ABSTRACT
Title of dissertation: TEMPORAL AND SPATIAL ALIGNMENT OF
MULTIMEDIA SIGNALS
Hui Su, Doctor of Philosophy, 2014
Dissertation directed by: Professor Min Wu
Department of Electrical and Computer Engineering
With the increasing availability of cameras and other mobile devices, digi-
tal images and videos are becoming ubiquitous. Research efforts have been made
to develop technologies that utilize multiple pieces of multimedia information si-
multaneously. This dissertation focuses on the temporal and spatial alignment of
multimedia signals, which is a fundamental problem that needs to be solved to
enable such applications dealing with multiple pieces of multimedia data.
The first part of the dissertation addresses the synchronization of multime-
dia signals. We propose a new modality for audio and video synchronization based
on the electric network frequency (ENF) signal naturally embedded in multimedia
recordings. Synchronization of audio and video is achieved by aligning the ENF
signals. The proposed method offers a significant departure to tackling the au-
dio/video synchronization problem from existing work, and a strong potential to
address previously untractable scenarios.
Estimation of the ENF signal from video is a challenging task. In order to
address the problem of insufficient sampling rate of video, we propose to exploit the
rolling shutter mechanism commonly adopted in CMOS camera sensors. Several
techniques are designed to alleviate the distortions of motions and brightness changes
in videos for ENF estimation.
We also address several challenges that are unique to the synchronization of
digitized analog audio recordings. Speed offset often occurs in digitized analog audio
recordings due to the inconsistency in the tape’s rolling speed. We show that the
ENF signal captured by the original analog audio recording can be retained in the
digitized version. The ENF signal is considered approximately as a single-tone signal
and used as a reference to detect and correct speed offsets automatically.
A complete multimedia application system often needs to jointly consider both
temporal synchronization and spatial alignment. The last part of the dissertation
examines the quality assessment of local image features for efficient and robust
spatial alignment. We propose a scheme to evaluate the quality of SIFT features
in terms of their robustness and discriminability. A quality score is assigned to
every SIFT feature based on its contrast value, scale and descriptor, using a quality
metric kernel that is obtained in a one-time training phase. Feature selection is
performed by retaining features with high quality scores. The proposed approach
is also applicable to other local image features, such as the Speeded Up Robust
Features (SURF).
TEMPORAL AND SPATIAL ALIGNMENT OF MULTIMEDIA
SIGNALS
by
Hui Su
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2014
Advisory Committee:
Professor Min Wu, Chair/Advisor
Professor Carol Espy-Wilson
Professor Kari Kraus
Professor K. J. Ray Liu
Professor Gang Qu
c© Copyright by
Hui Su
2014
To my family
ii
Acknowledgments
First and foremost, I would like to express my sincere and greatest gratitude to
my advisor, Prof. Min Wu, for her continuous support and guidance over the past
five years of my graduate study. With her vision, creativeness, and enthusiasm,
she has taught me how to identify important research issues, and how to solve
problems with critical thinking and unremitting efforts. Her encouragement and
constructive suggestions have led me through difficult times. Her endless pursuit of
excellence inspired me to strive for achieving my potential in my study, research,
and professional careers. The valuable lessons I learned from her not only fueled
my graduate study and research, but will also have a lasting influence in my future
professional and personal development.
I would like to thank Prof. K.J. Ray Liu for his enlightening courses and his
constructive suggestions on my research. I am grateful to Prof. Kari Kraus for her
expertise and insightful comments in our collaborative work. I want to thank Prof.
Gang Qu and Prof. Carol Espy-Wilson for serving on my dissertation committee
and for their valuable comments. I also want to thank Prof. Douglas Oard for the
enlightening discussions that inspired our collaborative work.
It has been my pleasure and privilege to work with my wonderful colleagues
in Maryland. I thank Dr. Avinash Varna, Dr. Wenjun Lu, Dr. Wei-Hong Chuang,
and Dr. Ravi Garg for providing help and guidance on my research. I thank my
officemates Chau-Wai Wong, Adi Hajj-Ahmad, and Abbas Kazemipour with whom
I enjoyed our discussions on various research and life topics.
iii
Last but not least, I thank my parents who always give me unconditional love
and support. My gratitude and love to them are beyond words. I dedicate this
dissertation to them.
iv
Table of Contents
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Temporal and Spatial Alignment of Multimedia data . . . . . . . . . 2
1.2 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 ENF as a New Modality for Synchronization . . . . . . . . . . 5
1.2.2 Extraction of ENF from Videos Created with Rolling Shutter 6
1.2.3 Synchronization and Restoration of Old Audio Recordings . . 7
1.2.4 Quality Assessment of Image Features For Efficient and Ro-
bust Visual Matching . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . 9
2 ENF Signal as a New Modality for Video Synchronization 12
2.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Synchronization Challenges and Prior Art . . . . . . . . . . . 14
2.1.2 New Synchronization Modality via ENF . . . . . . . . . . . . 15
2.2 Multimedia Synchronization using ENF from Audio . . . . . . . . . . 17
2.2.1 Estimation of ENF from Audio Signal . . . . . . . . . . . . . 17
2.2.2 Video Synchronization by Matching ENF Signals . . . . . . . 20
2.2.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 ENF Extraction from Visual Recording 33
3.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Extracting ENF from Visual Recording . . . . . . . . . . . . . . . . . 35
3.2.1 Exploiting the Rolling Shutter Mechanism . . . . . . . . . . . 35
3.2.2 ENF Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 41
v
3.2.3 Videos with Motions . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.4 Brightness Change Compensation . . . . . . . . . . . . . . . . 50
3.2.5 Camera Motion Compensation . . . . . . . . . . . . . . . . . . 53
3.3 Video Synchronization by Matching ENF from Image Sequence . . . 56
3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4 ENF Analysis on Historical Audio Recordings 62
4.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Distortions of ENF Signal in Historical Audio Recordings . . . . . . . 64
4.2.1 Multiple ENF Traces in Recaptured Audio Recordings . . . . 64
4.2.2 The Drifting Effect . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Audio Speed Restoration . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 ENF as a Guidance for Speed Correction . . . . . . . . . . . . 78
4.3.2 Experiments and Examples . . . . . . . . . . . . . . . . . . . 86
4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5 Quality Evaluation of Image Features for Efficient and Robust Spatial Align-
ment 96
5.1 Chapter Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Quality Evaluation of SIFT Features . . . . . . . . . . . . . . . . . . 102
5.2.1 Definition of Quality Metric . . . . . . . . . . . . . . . . . . . 102
5.2.2 Soft Quantization in Feature Space . . . . . . . . . . . . . . . 103
5.2.3 Kernel Training and Quality Assessment . . . . . . . . . . . . 107
5.3 Experiment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3.1 Implementation and Experiment Setup . . . . . . . . . . . . . 108
5.3.2 Feature Selection Performance . . . . . . . . . . . . . . . . . . 111
5.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.1 Examining the Quality Metric Kernel . . . . . . . . . . . . . . 122
5.4.2 Generalization To Other Local Image Features . . . . . . . . . 124
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6 Conclusions and Future Perspectives 128
Bibliography 134
vi
List of Tables
4.1 Comparison of the speed correction error using different rate conver-
sion methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Comparison of the speed correction error using different sizes of ENF
frame (Le) and correction frame (Lc). . . . . . . . . . . . . . . . . . . 90
5.1 List of transformations used in quality metric kernel Training . . . . . 110
5.2 Number of features needed to achieve pre-defined Average Top on the
UKB dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3 Average number of hits and retrieval time . . . . . . . . . . . . . . . 116
5.4 Number of Features Needed to achieve Pre-defined mAP on the Ox-
ford Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 Number of Features Needed to achieve Pre-defined mAP on the IN-
RIA Holiday Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
vii
List of Figures
2.1 Spectrograms and ENF estimates from test audio and power signals
recorded at the same time. . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Synchronization of the racquetball videos by aligning the ENF signals
from soundtracks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Synchronization of the parking lot videos by aligning the ENF signals
from soundtracks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Evaluation of synchronization accuracy. . . . . . . . . . . . . . . . . . 30
2.5 Spectrogram strips around the ENF harmonics for the Apollo 13
recordings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Synchronize the Apollo 13 mission recordings with the ENF signals. . 32
3.1 The spectrogram of a video recording shooting a white wall under
under fluorescent lightings. . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Timing of rolling shutter sampling. . . . . . . . . . . . . . . . . . . . 36
3.3 Time domain illustration of the rolling shutter sample acquisition. . . 38
3.4 A filter bank model of rolling shutter sample acquisition. . . . . . . . 39
3.5 A test video of white wall. . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 The spectrogram of the row signal from a white wall video recording. 45
3.7 The ENF signal stimulated from a white wall video. . . . . . . . . . . 46
3.8 The ENF signal estimated from a survalliance video. . . . . . . . . . 47
3.9 The ENF signal stimulated from a video of a parking lot during night. 48
3.10 An example of finding mutual motion-free regions. . . . . . . . . . . . 49
3.11 ENF estimation using the static regions in the hallway test video. . . 50
3.12 Sample frames of the hexbug test video. . . . . . . . . . . . . . . . . 51
3.13 ENF estimation using the static regions in the Hexbug test video. . . 52
3.14 Demonstration of brightness change in video. . . . . . . . . . . . . . . 53
3.15 The relationship between the pixel values in static regions of two
images subject to brightness change. . . . . . . . . . . . . . . . . . . 54
3.16 The effectiveness of the brightness change compensation technique. . 54
viii
3.17 Sample frames from a test video with camera motion. . . . . . . . . . 56
3.18 Experiment of a video recording with camera motion. . . . . . . . . . 60
3.19 Example of video synchronization by aligning the ENF signals. . . . . 61
4.1 Spectrograms of the original and recaptured audio recordings. . . . . 67
4.2 Spectrograms of the original and digitized Kennedy Speech recording. 68
4.3 The mean and the Variance of the NCC values. . . . . . . . . . . . . 69
4.4 ENF fluctuations as a function of time. . . . . . . . . . . . . . . . . . 72
4.5 ROC for audio recapture detection with different clip length. . . . . . 74
4.6 Demonstration of the drifting effect with a synthetic tone signal. . . . 75
4.7 Compensating the drifting effect. . . . . . . . . . . . . . . . . . . . . 76
4.8 The spectrogram of an Apollo 11 Mission recording. . . . . . . . . . . 77
4.9 Using linear fitting to compensate for the drifting effect. . . . . . . . 79
4.10 The correlation between the timing signatures of the two Apollo Mis-
sion 11 recordings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.11 The diagram of the ENF-based speed correction scheme. . . . . . . . 81
4.12 Speed correction based on up-sampling and down-sampling. . . . . . 83
4.13 Bilinear interpolation for tape speed adjustment. . . . . . . . . . . . . 85
4.14 The spectrogram of a test audio before correction. . . . . . . . . . . . 87
4.15 The spectrogram of a test audio after correction. . . . . . . . . . . . . 88
4.16 The spectrogram of an Apollo mission control recording before and
after speed correction. . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.17 The ENF signal extracted from the Apollo mission control recording
before and after speed correction. . . . . . . . . . . . . . . . . . . . . 92
4.18 The Quindar tones in an Apollo Mission recording before and after
restoration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1 Comparing SIFT feature matching with and without feature pruning. 104
5.2 Sample images in the training dataset. . . . . . . . . . . . . . . . . . 111
5.3 Comparing different feature selection schemes. . . . . . . . . . . . . . 112
5.4 Average Top Scores on the UKB dataset using different feature selec-
tion schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.5 Effect of α on the Average Top Score. . . . . . . . . . . . . . . . . . . 115
5.6 The mAP scores on the Oxford Buildings dataset using different fea-
ture selection schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.7 The mAP scores on the INRIA Holiday dataset using different feature
selection schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.8 The mAP scores on the INRIA Holiday dataset as a function of fea-
ture size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.9 Analyzing the quality score as a function of scale and contrast value,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.10 Visual words with low quality scores (top) and high quality scores
(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.11 The mAP scores on the INRIA Holiday dataset using different SURF
feature selection schemes. . . . . . . . . . . . . . . . . . . . . . . . . 127
ix
Chapter 1
Introduction
Recent years have witnessed rapid growth in the amount of digital multimedia
data. With the advancement in device design and manufacturing technologies, mo-
bile phones and portable cameras with improved imaging quality and large memory
storage capacity are becoming ubiquitous. These developments encourage people to
take photos and audio/video recordings to capture interesting moments and impor-
tant events. Photos created by different people can often include the same objects
such as landmarks. Similarly, events like concerts or speeches may be recorded in
multiple audio and video clips. Numerous applications arise for which the common
content in images and audio/video recordings is exploited.
When an event is recorded simultaneously by multiple independent cameras,
and possibly from a variety of angles, combining the information in these videos may
provide a better presentation and novel experience of the event than any recording
alone. For example, using 3D reconstruction techniques, a dynamic scene may be
1
reconstructed from multiple video streams that allows people to choose from different
viewing angles of a scene. Several videos of various perspective can be “stitched”
together to achieve wider field of view via video panorama [6]. A video sequence of
high space-time resolution can be obtained by combining information from multiple
low-resolution video sequences of the same dynamic scene [53].
Similarly, multiple photos containing the same object or scene can be exploited
to create visually appealing images with panoramic views and super-fine resolutions.
The existence of common objects in multiple images have also motivated research in
content-based image retrieval (CBIR). Given a query image, the goal is to retrieve
from a large database the images containing the same object or scene as in the query
image. CBIR can be useful in scenarios including landmark recognition, direction
aid for tourists, and CD/book automatic annotation.
1.1 Temporal and Spatial Alignment of Multimedia data
Temporal and spatial alignment are two essential problems to solve for many
applications involving multiple pieces of audio-visual data. Temporal alignment, or
synchronization, is the process of finding time correspondence between multiple au-
dio/video streams. Most existing approaches to multimedia signal synchronization
extract and match audio/visual features extracted from the audio/video recordings.
These methods may not perform well in certain situations. For example, it is dif-
ficult to synchronize video sequences using visual features when the videos do not
share sufficient common areas of the scene or when the viewing angles differ sig-
2
nificantly; similar limitations apply to alignment of audio recordings that have few
common acoustic or speech components.
In addition to synchronization, a complete multimedia application system must
also address spatial alignment, which is the process of transforming different images
of the same scene or object into one coordinate system. Local image feature is one
of the most successful solution for image alignment. In the local feature framework,
interest points are first selected as distinctive and robust points in the image by a
key point detector. Next, a robust feature descriptor is generated using the informa-
tion within the neighborhood of the interest point. The interest point and feature
descriptor are designed in a way so that the same object can produce interest points
at the same image structure with similar descriptors regardless of the changes of
viewing angles and lighting conditions. The point correspondence between images
of the same scene can be obtained by matching the feature descriptors.
Most of the solutions for CBIR are also built on matching local image features.
To cope with large-scale databases, a visual dictionary-based Bog-of-Words (BoW)
approach has been proposed by Sivic and Zisserman [55]. They propose to quantize
image features into a set of visual words as a codebook by using k-means clustering
on many training features. A given feature can be mapped to its nearest visual word
among the codebook. The images are then represented by the frequency vector of
the visual word occurrences. The similarity between two images is usually measured
using the L1 or L2 distance between their visual word frequency vectors. During
a query, the similarity score can be computed efficiently by using an inverted file
system associated to the database.
3
1.2 Main Contributions
We explore two main research problems related to the temporal and spatial
alignment of multimedia data in this dissertation. The first problem asks how to
synchronize audio and video recordings in adverse conditions for conventional meth-
ods. We propose a new synchronization modality that exploits the electric network
frequency (ENF) signal naturally embedded in multimedia recordings. In contrast
to conventional methods, our proposed approach does not rely on the perceptual
audio and visual information of the recordings. It offers a strong potential to address
difficult scenarios that are otherwise intractable. For video synchronization, it im-
poses no major constraints on the viewing angles and camera calibrations, as other
methods do. The second problem we explored is how to improve the robustness and
efficiency of feature-based image matching. Image feature matching remains critical
for image alignment and retrieval. The improvement of image resolution and the
growth of image database in scale may lead to the explosion of the number of local
image features. To expedite feature matching without greatly affecting accuracy,
it is desirable to select a subset of the most reliable and informative features to
represent each image. We propose a method to evaluate feature quality in terms of
robustness and discriminability. Every feature is assigned a quality score used for
feature selection. The topics explored in this dissertation are listed below.
4
1.2.1 ENF as a New Modality for Synchronization
ENF is the supply frequency of the alternating current in a power grid. The
nominal value of the ENF is 60Hz in the Americas, Taiwan, Saudi Arabia, and
Philippines, and is 50Hz in other regions except Japan, which adopts both fre-
quencies. The instantaneous value of the ENF constantly deviates slightly and
randomly from its nominal value, as a result of the dynamic interaction between
power generation and consumption and the control mechanism of the power grid.
The instantaneous values of the ENF over time is referred to as the ENF signal.
The variation patterns in the ENF signal are shown to be consistent within the
same power grid, even for distant locations. Multimedia recordings created using
electric devices plugged into the power mains or located near electric activities often
pick up ENF signals in audio due to electromagnetic interference and/or acoustic
vibrations [27]; and in video due to imperceptible flickering in indoor lighting [23].
Viewed as a continuous random process over time, the ENF signal embedded
in audio and video signals can be used as a timing fingerprint that is unique at
any specific time instance. In this dissertation, we propose to match the ENF
signals extracted from audio/video recordings to achieve temporal alignment. After
extracting ENF signals from the recordings to be synchronized, the normalized cross
correlation coefficients are calculated with different lags between the ENF signals,
and the lag corresponding to the maximum correlation coefficient is identified as
the temporal shift between the recordings. The experimental results show that the
synchronization error of the ENF-based method can be under 0.1 second.
5
1.2.2 Extraction of ENF from Videos Created with Rolling Shutter
ENF signals may be extracted from the soundtracks of the video recordings,
as well as the image sequences if the video captures the subtle flickering of lights.
Extracting the weak ENF signal from image sequences poses a challenging task
which has not been well addressed. The temporal sampling rate of visual recordings
is generally too low for estimating the ENF signal that may appear at harmonics
of 50 or 60 Hz. The ENF traces in video signals are relatively weak, and they may
easily be distorted by object and camera motions.
We exploit the rolling shutter that is commonly adopted for CMOS sensors
to facilitate ENF estimation from video. Unlike global shutters often employed in
charge-coupled device (CCD) sensors that record the entire frame from a snapshot of
a single point in time, a camera with a rolling shutter scans the vertical or horizontal
lines of each frame in a sequential manner. As a result, different lines in the same
frame are exposed at slightly different times. By treating each line of the frame as
a sample point, the sampling rate can be much higher than the frame rate, which
helps solve the aliasing problem.
Existing work on ENF estimation from image sequence of video is limited to
videos of static scenes. It is more challenging to extract ENF signals from video
recordings with object motions, brightness changes, and camera motions. To address
object motions, if the scene in the video contains a static background, we can use
these static regions to estimate the ENF signal. Many cameras are equipped with
a brightness control mechanism that adjusts the camera sensor’s aperture and/or
6
sensitivity to light in response to the illumination conditions, so the overall bright-
ness of the acquired image remains visually pleasing. Such brightness change affects
the estimation of ENF signals. From our study, we find that the brightness changes
may be well modeled by a linear transform. A linear transform is estimated to
compensate for the brightness changes effectively. For videos with camera motions,
optical flow-based method is used to estimate and compensate for the pixel shift
among video frames.
1.2.3 Synchronization and Restoration of Old Audio Recordings
As ENF signal is embedded into multimedia recordings at the time of record-
ing, several interesting challenges arise for ENF analysis on recaptured audio record-
ings. ENF signals in recaptured audio recordings may contain two components: one
is inherited from the original recording, referred to as the original ENF signal; and
the other becomes embedded during recapturing process, referred to as the recaptur-
ing ENF signal. When the original ENF and recapturing ENF overlap, conventional
ENF estimation methods may fail. To solve this problem, a decorrelation based al-
gorithm is designed to estimate effectively both ENF signals from recaptured audio
signals in a sequential order.
In addition to the possible superposition of multiple ENF traces, the “drifting
effect” present another distortion of the ENF signals unique to digitized analog
tape recordings. Due to mechanical imperfection of analog recorders and tapes, the
rolling speed of these tapes often varies over time during recording and replay. The
7
inconsistency between the rolling speeds of original recording and playback during
digitization induce speed errors in the digitized audio. Techniques are designed to
compensate for such drifting effect for accurate ENF analysis.
We also show that ENF signals suffering drifting effect can be exploited to
correct tape speed errors. Using the ENF signal as a reference, the speed of au-
dio can be restored by temporally stretching or compressing the audio signal via
rate conversion. We have considered two schemes of rate conversion based on re-
sampling and interpolation, respectively, and their performances are experimentally
compared. The effectiveness of the proposed scheme is shown by a demonstration
of speed restoration with recordings of the NASA Apollo 11 Mission.
1.2.4 Quality Assessment of Image Features For Efficient and Robust
Visual Matching
Feature selection is required to handle image matching efficiently over large-
scale data sets. We propose a quality metric to evaluate the usefulness of individual
image features. Features with low quality scores may be discarded to reduce the
number of features and to improve matching efficiency. We demonstrate the pro-
posed approach with the SIFT feature, which is widely used in computer vision
and image processing applications. The methodology may be generalized to other
features as well.
Our feature quality metric uses a learning-based method. A feature space is es-
tablished that consists of the scale, contract value, and descriptors of SIFT features.
8
The whole feature space is quantized into a certain number of bins. To improve per-
formance, we adopt a soft quantization scheme. We collect a set of representative
sample images covering various scene types and categories as training images, and
extract SIFT features from these training images. The SIFT features are mapped
to their nearest bins in the feature space. We conduct synthetic transformations
of interest to the training images, match SIFT features between every original and
transformed image pair, and record the matching results for the features in each
feature space bin. The ratio of the number of correct matches over the total number
of features in each bin is calculated and used as that bin’s quality score. A quality
metric kernel is thus obtained. Given a SIFT feature, we can compute its quality
score using the quality metric kernel. A few nearest feature space bins are first
identified, then the quality score of the feature is computed as the weighted average
of the quality scores of its nearest bins.
The proposed approach is tested on three benchmark datasets for large scale
content-based image retrieval. Feature selection according to the proposed quality
score is shown to perform better than empirical methods, such as selecting features
with largest scales and highest contrast values.
1.3 Dissertation Organization
The rest of the dissertation is organized as follows. In Chapter 2, we discuss
how the ENF signal can be estimated from audio signals and used for synchroniza-
tion of audio and video clips. Experiments on various audio and video recordings
9
are conducted to demonstrate the effectiveness of the proposed approach. The syn-
chronization accuracy is also tested and analyzed.
In Chapter 3, we study the problem of ENF extraction from visual recordings,
which is more challenging than audio recordings. The rolling shutter mechanism
is first modeled and analyzed using multi-rate signal processing theory. To move
beyond static videos, we propose several techniques to address the object motions,
camera motions and brightness changes in video. We then conduct experiments to
demonstrate that matching the ENF signals estimated from the image sequences of
videos can effectively synchronize multiple video streams.
In Chapter 4, we investigate the ENF analysis on recaptured audio recordings.
A decorrelation based method is introduced to estimate multiple ENF signals that
may superimpose in a recaptured audio recording. The drifting effect of the ENF
signal in digitized audio recording with speed errors is examined. We show that
the ENF signal may serve as a reference signal to correct speed errors. Several
historical recordings of the NASA Apollo 11 Mission are successfully restored with
this method.
In Chapter 5, we investigate the SIFT feature selection for efficient and robust
visual matching. A quality metric considering the scale, contrast value, and the
descriptor of SIFT features is obtained through a learning based method. Feature
selection using the proposed quality metric is shown to be effective on several bench-
mark data sets for large scale image retrieval. We also show that the methodology
applies to other image local features such as SURF.
Finally, in Chapter 6, we conclude this dissertation and outline research issues
10
that can be explored in the future.
11
Chapter 2
ENF Signal as a New Modality for Video
Synchronization
2.1 Chapter Introduction
The analysis of electric network frequency (ENF) signals has emerged in recent
years as an important technique for digital multimedia forensics [22, 25, 27, 30–32,
50, 51, 58]. ENF is the supply frequency of the alternating current in a power grid.
The nominal value of the ENF is 60Hz in the Americas, Taiwan, Saudi Arabia,
and Philippines, and is 50Hz in other regions except Japan, which adopts both
frequencies. The ENF has three interesting properties. First, the instantaneous
value of the ENF constantly deviates slightly and randomly from its nominal value,
as a result of the dynamic interaction between power generation and consumption
and the control mechanism of the power grid. The instantaneous values of the
ENF over time is referred to as the ENF signal. In second property of ENF, the
12
variation patterns in its instantaneous values are consistent within the same power
grid, even for distant locations. For example, there are only three major power
grids in the United States, the Eastern Grid, the Western Grid, and the Texas
or ERCOT (Electric Reliability Council of Texas) Grid. So, there are mainly three
different instantaneous variation patterns of ENF in the United States. Last but not
least, multimedia recordings created using electric devices plugged into the power
mains or located near electric activities often contain ENF signals in audio due
to electromagnetic interference and/or acoustic vibrations [27]; and in video due
to imperceptible flickering in indoor lighting [23]. The ENF signal extracted from
audio or video recordings has been shown to exhibit a high correlation with the ENF
extracted from the power mains measurements at the corresponding time.
Several forensic applications have been proposed based on the analysis of the
ENF signal. For example, ENF signals have been successfully used as a natural time
stamp to authenticate audio recordings [27, 32, 51]. By examining the phase conti-
nuity of the ENF signal, one can detect the region of tampering [50]. Some recent
work shows that the ENF signal can also reveal information about the locations and
regions in which certain recordings are made [22,30,31].
In this work, we explore the potential of the ENF signal from a new perspective
and use it for synchronization of multimedia signals, i.e. the temporally alignment
of audio and video recordings. Synchronization poses a fundamental problem for ap-
plications dealing with multiple pieces of multimedia signals, such as view synthesis
and A/V experience reconstruction.
13
2.1.1 Synchronization Challenges and Prior Art
In professional video productions such as sports TV broadcasting, the record-
ing cameras may be synchronized based on coordinated hardware and communica-
tion protocols to provide synchronized timestamps and ensure accurate temporal
alignment. For distributed and ad-hoc settings involving consumer-level devices,
different cameras’ clocks are not easily synchronized to the frame level. Although
an increasing number of devices are now equipped with GPS capabilities that may
supply a universal time stamp, many low-cost and low-power devices are not. One
line of prior art developed mobile applications and protocols to provide synchronized
time tags on video frames [20]. As it requires that users of mobile devices must first
install an application and/or follow a prescribed protocol before their videos can
be aligned, this approach cannot deal with unconstrained streams for general and
less intrusive settings in today’s social media and emerging big data paradigms. In
absence of proactive synchronization mechanisms, the current solution must rely
primarily on visual content and/or sound content, which provides limited alignment
resolution on media streams at a rather high computational cost. For these reasons,
we have seen limited availability of efficient and effective alignment technologies on
audio and video streams, which comes in a sharp contrast with the broad availability
and adoption of “stitching” tools for still images.
Efforts have been made to address the video synchronization problem us-
ing signal processing and computer vision techniques, as shown in the literatures
[13,14,46,54,56,63]. In [56], a three-step approach using a set of corresponding fea-
14
ture points is proposed. The method in [13] simultaneously estimates both spatial
alignment and temporal synchronization between two video sequences using spatio-
temporal information. In [14], alignment is achieved by matching trajectories of
moving objects. The authors in [54] present a synchronization method based on
detecting camera flashes present in the video content. Most of the existing work
exploits certain features from the video signals that can be detected in one video
sequence and related to corresponding features in other sequences. These feature-
based methods rely on the visual content of the videos and may not always perform
well. For example, it is difficult to synchronize video sequences using visual features
when they do not share sufficient common contents of the scene, or the viewing an-
gles differ significantly. Some methods impose constraints on the video recordings
to be synchronized, such as calibrated or static cameras and homography between
image sequences. These constraints may not always be satisfied in practice.
2.1.2 New Synchronization Modality via ENF
We propose a new modality for video synchronization by exploiting the ENF
signal naturally embedded in video recordings [60,61]. In viewing the ENF signal as
a continuous-time random process, its realization in each recording may serve as a
timing fingerprint. Synchronization of audio and video recordings therefore can be
performed by matching and aligning their embedded ENF signals. This approach
differs greatly from how the existing work tackles the audio/video synchronization
problem, and it has several advantages over conventional methods. The ENF based
15
method does not rely on the analysis of audio and visual content in the recordings to
be synchronized. This property allows it a strong potential to address such difficult
scenarios that remain intractable with existing methods. In video synchronization,
for example, the conventional approaches based on visual cues do not perform well in
situations where arbitrary camera motion occurs or the view overlap is insufficient.
The ENF based method is not affected by these adverse conditions. Additionally,
extracting and aligning ENF signals may be more effective computationally than
the approaches that rely on computer vision and/or extensive learning, so more (or
longer) recordings could be efficiently processed. It can also be easily generalized to
synchronize multiple pieces of recordings.
ENF signals may be extracted from the soundtracks of video recordings, as
well as the image sequences if the video captures the subtle flickering of lightings.
Extracting the weak ENF signal from image sequences presents a challenging task.
The temporal sampling rate of visual recordings is generally too low for estimating
the ENF signal that may appear at harmonics of 50 or 60 Hz. The ENF traces in
video signals are relatively weak and may easily be distorted by object and camera
motions. In this chapter, we focus on exploiting ENF signals from video sound
tracks for synchronization. The extraction of ENF from visual recordings will be
addressed in the next chapter.
16
2.2 Multimedia Synchronization using ENF from Audio
2.2.1 Estimation of ENF from Audio Signal
A general and easily implementable approach to estimating the ENF signal
from a source signal, such as audio, is the short-time Fourier transform (STFT),
which is a popular non-parametric tool for frequency analysis of time-varying signals.
It divides a signal into possibly overlapping frames of small durations. Within every
frame, the signal can be regarded as wide-sense stationary, and each of the frames
undergoes Fourier analysis respectively. For ENF estimation, we apply STFT to
a source signal that contains ENF traces, and find the peak frequency within a
certain range near the nominal value or the harmonics in each frame. The values
of the peak frequency from all the frames are concatenated to form the estimated
ENF signal. Plotting the squared magnitude of an STFT for each time frame, a
spectrogram is used to visualize the spectrum of the source signal. The spectrogram
is usually displayed as a two-dimensional intensity plot with the two axes being time
and frequency, respectively.
Comparisons of various frequency estimation approaches for ENF are carried
out in [30, 44]. In [37], the authors evaluate the use of quadratic interpolation over
DFT samples as a means to improve the accuracy of ENF estimation. A multi-tone
harmonic model is used to build a Maximum-Likelihood estimator for ENF in [12].
Similarly, the use of multiple harmonics is also investigated in [29]. The weighted
energy method as used in [23] is adopted here for its robustness and low complexity.
17
The audio signal is divided into frames of certain length (e.g 8 seconds), and fast
Fourier transform (FFT) is calculated for every frame. The ENF signal is then
estimated as the dominant frequency within a small vicinity of the nominal value of
ENF:
F (n) =
∑L2
l=L1
f(n, l)|s(n, l)|
∑L2
l=L1
|s(n, l)|
, (2.1)
where fs and NFFT are the sampling frequency of the signal and the number of FFT
points, respectively; L1 =
(fENF−∆f)NFFT
fs
and L2 =
(fENF+∆f)NFFT
fs
, where fENF is
the nominal value of ENF and ∆f defines the range of frequencies in consideration;
f(n, l) and s(n, l) are the frequency and energy in the lth frequency bin of the nth
time frame, respectively.
According to the sampling theory, if a signal contains no components of fre-
quency higher than F Hz, it can be completely determined by its sampled version
when the sampling rate exceeds 2F Hz. The ENF signal embedded in multimedia
recordings is usually present close to its nominal value (50 or 60 Hz) and/or several
higher order harmonics. Digital audio recordings are usually created with a sampling
rate that is much higher than the value of ENF. Therefore, we can down-sample an
audio signal before extracting its ENF signal to reduce computational cost.
The ground-truth ENF signal can be obtained from power outlet measurements
using a step-down transformer and a voltage divider circuit. Fig. 2.1 shows an
example of ENF extraction from audio signal. In this example, an audio recording
and a power measurement recording were made simultaneously in the US where the
nominal value of ENF is 60 Hz. The ENF signal can be extracted from around
18
any harmonics of the nominal value of ENF, as long as the ENF traces are strong
enough. Here, we examine the second harmonic for the audio recording and the
base frequency for the power recording. As can be seen from Fig. 2.1, the ENF
signals estimated from the audio recording exhibit variation trends similar to the
ground-truth ENF signal from the power outlet measurements.
(a) (b)
0 200 400 600119.99
120
120.01
120.02
120.03
120.04
120.05
120.06
120.07
Time (seconds)
Frequ
ency 
(Hz)
(c)
0 200 400 60060.01
60.015
60.02
60.025
60.03
60.035
60.04
60.045
60.05
Time (seconds)
Frequ
ency 
(Hz)
(d)
Figure 2.1: Spectrograms and ENF estimates from audio and power signals recorded at
the same time. (a) Spectrogram of the test audio signal around the 2nd harmonic (120
Hz); (b) Spectrogram of the power signal around the base frequency (60 Hz); (c) ENF
signal estimated from the audio signal; (d) ENF signal estimated from the power signal.
19
2.2.2 Video Synchronization by Matching ENF Signals
In this section, we discuss in detail how the ENF traces embedded in video
soundtracks can be used for video synchronization. After taking the soundtracks
from two video recordings to be synchronized, we first divide each soundtrack
into overlapping frames of length Lframe seconds. The overlap between adjacent
frames is denoted as Loverlap in seconds. So the shift from one frame to the next is
Lshift = Lframe−Loverlap. For every frame, we estimate the dominant frequency near
the nominal value of the ENF. The values of the estimated frequency are concate-
nated to form the ENF signal of each soundtrack. The normalized cross correlation
coefficients are calculated with different lags between the ENF signals. The lag
corresponding to the maximum correlation coefficients is identified as the temporal
shift between the two videos. In the following, we show the experimental results of
the proposed approach using battery-powered consumer-level cameras.
In the first experiment, we made two video recordings of people playing rac-
quetball in a gym with a Canon PowerShot SX230 HS camera and a Canon Power-
Shot A2300 camera, respectively. During the recording, the cameras were fixed on
tripods shooting the racquetball court from different viewing angles. Both record-
ings are approximately 10 minutes long, and one begins approximately 20 seconds
earlier than the other.
The ENF signals are estimated from the soundtracks of the video clips, and
their NCC is calculated with different values of lags between them. In Fig. 2.2 (a), we
plot the normalized cross correlation (NCC) as a function of the lag, and observe a
20
clear peak at 20.52 seconds. We then align the video clips by shifting them relatively
by 20.52 seconds. The ENF signals after alignment, along with the reference ENF
measured from the power outlet, are shown in Fig. 2.2 (b). Both the ENF signals
extracted from the videos exhibit variation trends that are consistent with those of
the reference ENF signal. Fig. 2.2 (c) shows a few sample pairs of images from the
video sequences after alignment. The images in the same row are from the same
video stream, while the images in the same column correspond to the same time
instance. By examining the girl’s movement in the images, we can see that the two
video sequences are well synchronized after using the proposed approach.
For the second experiment, we use the same cameras to make two video record-
ings in an office building. The cameras were placed in different rooms facing a park-
ing lot through glass windows. We estimate the ENF signals from the soundtracks
of the two video recordings and calculate their correlation coefficient as a function
of the lags between them, which is plotted in Fig. 2.3 (a). A peak value of 0.95 is
found at the lag of 92.16 seconds. The estimated ENF signals after alignment and
their reference ENF signal are shown in Fig. 2.3 (b). Although some segments of
the ENF signal estimated from one of the video soundtracks (video 1) suffer from
some short-term distortions around the 110th second, the video clips are still well
synchronized, as shown by the sample image frame pairs in Fig. 2.3 (c).
In the examples shown in this section, the cameras are fixed during recording,
and some view overlap occurs between the videos to be synchronized. These settings
are chosen so that we can appreciate the effectiveness of the proposed approach by
visually examining the image frames. It should be noted that, as an advantage over
21
many existing methods, no constraints exist on the camera motion or calibration.
Also, no requirement is necessary for any view overlap when using the ENF signals
extracted from the soundtracks for synchronization of video recordings. Readers
may note the substantial differences of view points and depth of the two cameras in
both examples. For such situations, even when some view overlap occurs, it can be
challenging to handle with traditional computer vision based approaches.
The accuracy of synchronization is important for many applications involving
multiple videos. Experiments are conducted to evaluate the synchronization accu-
racy of the proposed method. We use two audio recorders to make a total of around
17 hours of audio recordings in an office room. A beep sounds at the beginning of the
recording so the ground truth of the lag between the recordings can be obtained by
comparing the waveforms of the beep using audio editing software. The recordings
are divided into clips of Lc seconds, and each clip is treated as an individual test
sample. We apply the proposed method to synchronize the test clips and examine
the synchronization accuracy under two variables: the length of the test clips Lc,
and the strength of the ENF signal in the source signal measured by the signal to
noise ratio (SNR). Assume that the ENF signal is estimated near f0 Hz, which may
be the nominal value of ENF or one of its harmonics. The SNR is estimated as the
ratio of the energy density in the narrow frequency band (f0 − fδ1, f0 + fδ1) to that
in the band (f0 − fδ2, f0 − fδ1) and (f0 + fδ1, f0 + fδ2), where fδ2 > fδ1. To test the
more challenging scenarios with low SNR levels, white Gaussian noise of controlled
strength is added to the original source signals.
Fig. 2.4 shows the synchronization error of each test clip pair and the estimated
22
SNR of the ENF signal. As expected, the synchronization accuracy improves with
higher SNR and longer duration of clip. The average absolute synchronization errors
for the test clips without added noise are 0.48, 0.40, and 0.17 seconds for clip sizes
of 5, 10, and 15 minutes, respectively.
Although most demonstrations of ENF being picked up by digital audio and
video recordings in areas of electrical activities have been reported in the recent
decade, the presence of ENF can be found in analog recordings made throughout
the second half of the 20th century. In our recent work, we demonstrated that ENF
traces can be found in digitized versions of 1960s phone conversation recordings
of President Kennedy in the White House [58]. Using ENF to analyze historical
recordings could offer many useful applications for forensics and archivists. For
instance, many 20th century recordings are important cultural heritage records, but
lack necessary metadata, such as the date and time of recording. Also, the need may
arise to timestamp old recordings for investigative purposes, and ENF may provide
a solution.
As an example of exploring historical recording synchronization using the ENF
signals, we analyze two recordings from the 1970 NASA Apollo 13 mission [1] that we
know were recorded at approximately the same time. The first recording comes from
the PAO (Public Affairs Officer) loop, which is the space-to-ground communications
that was broadcast to the media. The second recording is of GOSS Net 1 (Ground
Operational Support System), which is the recording of the space-to-ground audio
as the people in mission control heard it. Both recordings are approximately four
hours long. Figure 2.5 shows spectrogram strips for both recordings about the
23
ENF harmonics. We observe that for the first recording, the ENF clearly appears
around all the harmonics, and it is especially strong around 360Hz. For the second
recording, the ENF is noisier, and it appears best around 120Hz and 360Hz.
We extract the ENF of the first recording from around 360Hz. For the second
recording, we use the spectrum combining technique for ENF estimation [29], where
we combine the ENF traces from around 120Hz and 360Hz to arrive at a more reliable
ENF estimate. The resulting ENF signal remains rather noisy; we clean the signal
by locating outliers and replacing them using linear interpolation from surrounding
ENF values. Figure 2.6 (a) shows 20-minute simultaneous ENF segments from both
recordings, with the second ENF signal displaced by 0.05Hz to distinguish them and
see them separately. Visually, the two signals appear similar.
In a synchronization scenario, we would need to match ENF segments from two
or more signals with potentially different lags, then decide on the correct lag based
on the similarity of the segments, using the correlation coefficient as a metric. As a
proof-of-concept for the Apollo data described above, we divide the first Apollo ENF
signal into overlapping 10-min ENF segments, and for each segment, we correlate
it with equally-sized segments from the second Apollo ENF with varying lags. The
two signals were recorded at the same time, so the ground truth suggests that the
highest correlation should be at zero lag. Figure 2.6 (b) shows the mean values of
the correlations achieved for different lags, and we can clearly see that the highest
correlation is achieved for zero lag, which matches the ground truth.
We observe that the techniques discussed earlier for audio and video alignment
can be extended to aligning two historical recordings of interest. This can potentially
24
help timestamp old recordings of unknown date of capturing. With old recordings,
we may not always have access to reference power ENF, as in the case considered
here, yet we have the potential to utilize historical recordings of known date and
time to create an ENF database to which we can compare recordings of interest
with uncertain information about capturing time.
2.2.3 Discussions
The prerequisite for the proposed approach is that ENF signals can be ex-
tracted well from the audio recordings to be synchronized. This prerequisite may
not always be satisfied in practice. For example, audio recordings created by battery-
powered recorders at locations far from electrical equipments are likely not to capture
ENF traces. We have conducted experiments to study the existence of ENF signals.
We use battery-powered audio recorders to make recordings at multiple locations
inside a university building. These locations include hallways, classrooms, offices,
computer labs, and study lounges. The reference power recordings are recorded as
well. With a total of 100 test audio recordings, we are able to extract ENF signals
that match with the power reference from 81 of them. It proves that most of audio
recordings in the building can capture ENF signals.
Statistical modeling and analysis of ENF signals can be performed to study
ENF signal’s capability of synchronization theoretically. It is shown in [25] that
the ENF signal can be modeled as an autoregressive random process. Based on
this model, a decorrelation-based approach is proposed to improve the matching
25
accuracy. Such study can help understand the effect of ENF signal’s duration and
signal-to-noise ratio to the synchronization accuracy.
2.3 Chapter Summary
In this chapter, we propose to synchronize audio and video recordings by
extracting and matching the ENF signal that is naturally embedded in the audio
clip during its creation. The value of the ENF signal fluctuates around its nominal
value randomly over time. The variation patterns of the ENF are very similar within
the same power grid, even at distant locations. Audio recordings can often capture
the ENF signal due to electromagnetic interference to the recording sensor or the
electric humming. By correlating the ENF signals extracted from audio recordings
or the soundtracks of videos, multiple audio and video clips can be synchronized.
We have conducted experiments to test the synchronization accuracy of the
proposed approach. With a fair signal-to-noise ratio for the ENF signal embedded
in the audio signal, the average synchronization error can be as low as about 0.12
second, which is equivalent of 3-4 frames for most video clips created by consumer
cameras with a frame rate at 30 fps.
Our study also shows that the ENF signal may be present in historical record-
ings as well. In particular, we are able to extract ENF signal from some digitized
analog recordings of the NASA Apollo 13 Mission. As an example, we have success-
fully synchronized two audio clips that were recorded at roughly the same time but
at different loops. Such analysis of old recordings may be of great value for revealing
26
unknown historical facts and archiving historical events.
The traces of ENF may also be found in visual recordings, such as the image
sequence of video clips. Extracting the weak ENF signal from image sequences
presents a challenging task. The temporal sampling rate of visual recordings is
generally too low for estimating the ENF signal that may appear at harmonics of
50 or 60 Hz. The ENF traces in video signals are relatively weak, and may be
easily distorted by object and camera motions. In the next chapter, we focus on the
extraction of ENF from visual recordings.
27
−300 −200 −100 0 100 200 300−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Lag (seconds)
Corre
lation
(a)
0 100 200 300 400 500 600119.98
120
120.02
120.04
120.06
120.08
120.1
120.12
120.14
Time (seconds)
Frequ
ency 
(Hz)
 
 Grondtruth ENF measured from power mainsENF signal estimated form video 1ENF signal estimated from video 2
(b)
(c)
Figure 2.2: Example-1 of video synchronization by aligning the ENF signals from sound-
tracks. (a) The correlation coefficients of the ENF signals as a function of the relative
lag between them. (b) The ENF signals from the two video recordings after alignment
and the ENF measured from the power mains at the same time. (c) Several sample frame
pairs after alignment. Rows correspond to video sequences, and columns correspond to
time instances.
28
−100 −50 0 50 100 150 200 250
−0.5
0
0.5
1
Lag (seconds)
Corre
lation
 Coef
ficien
t
(a)
0 100 200 300 400 500 600−0.2
−0.1
0
0.1
0.2
0.3
Time (seconds)
ENF 
Varia
tion (H
z)
 
 ENF signal estimated from video 1ENF signal estimated from video 2Reference ENF signal
(b)
(c)
Figure 2.3: Example-2 of video synchronization by aligning the ENF signals from sound-
tracks. (a) The correlation coefficients of the ENF signals as a function of the relative lag
between them. (b) The ENF signals from the two video recordings from different viewing
angles after alignment and the ENF measured from the power mains at the same time.
The signals are properly shifted to facilitate comparison. (c) Several sample frame pairs
after alignment. Rows correspond to video sequences, and columns correspond to time
instances.
29
0 10 20 30 40 50−5
−4
−3
−2
−1
0
1
2
3
4
5
Estimated SNR (dB)
Sync
hroni
zation
 error
 (seco
nd)
(a)
0 10 20 30 40 50−5
−4
−3
−2
−1
0
1
2
3
4
Estimated SNR (dB)
Sync
hroni
zation
 error
 (seco
nd)
(b)
0 10 20 30 40 50−5
−4
−3
−2
−1
0
1
2
3
4
5
Estimated SNR (dB)
Sync
hroni
zation
 error
 (seco
nd)
(c)
Figure 2.4: The synchronization error with clip size being (a) 300 seconds; (b) 600 seconds;
(c) 900 seconds.
30
(a)
(b)
Figure 2.5: Spectrogram strips around the ENF harmonics for the Apollo 13 recordings.
(a): PAO recording; (b): GOSS recording.
31
0 200 400 600 800 1000 120059.96
59.98
60
60.02
60.04
60.06
60.08
Time (seconds)
Freq
uenc
y (Hz)
ENF1
ENF2 + 0.05Hz
(a)
−300 −200 −100 0 100 200 300
−0.2
0
0.2
0.4
0.6
0.8
1
Lag (seconds)
Aver
age 
Corr
elatio
n Co
effici
ent
(b)
Figure 2.6: Synchronize the Apollo 13 mission recordings with the ENF signals.
32
Chapter 3
ENF Extraction from Visual Recording
3.1 Chapter Introduction
Most previous work related to the analysis of ENF signals is built on extracting
ENF traces from audio recordings [27, 29, 50, 51]. Recently, it has been found that
indoor lightings such as the fluorescent lights and incandescent bulbs vary their
light intensity in accordance with the AC supply voltage, which varies according to
the AC supply frequency [23]. As a result, cameras can capture the light intensity
variation that can be used to extract the ENF signal. In [23], the authors took
the mean of the pixel values in every frame of video recordings that capture indoor
lightings, then used spectrogram analysis to estimate the embedded ENF signal. The
aliasing effect presents a major challenge of this scheme. Most of the consumer-level
digital cameras adopt a frame rate of around 30 fps, while the ENF signal appears
at the harmonics of 50 or 60 Hz. Therefore, the ENF signal suffers from a severe
33
Figure 3.1: The spectrogram of the mean values of the frames from a test video recording
shooting a white wall under under fluorescent lighting. The ENF signal overlaps with the
DC components and is difficult to extract. Figure is best viewed in color.
aliasing effect induced by insufficient sampling speed. In the US, the nominal value
of the ENF is 60 Hz. If the frame rate is exactly 30 Hz, the ENF signal will be
shifted to 0 Hz, i.e., the DC frequency. As a result, it is highly difficult to estimate
the ENF signal due to low signal-to-noise ratio. Figure 3.1 shows the spectrogram
calculated from the mean values of the frames of a video shooting a white wall under
fluorescent lighting. We observe the ENF signal overlaps with the DC components
and is difficult to extract.
34
It has been proposed to take advantage of the rolling shutter to address the
problem of insufficient sampling rate [24]. Rolling shutters are commonly adopted for
the complementary metal-oxide semiconductor (CMOS) camera sensors. Cameras
with global shutters in charge-coupled device (CCD) sensors record an entire frame
from a snapshot of a single point in time. However, a camera with a rolling shutter
scans the vertical or horizontal lines of each frame sequentially, so that different
lines in the same frame are exposed at different times. If we treat each line of the
frame as a sample, the temporal sampling rate can be much higher than the frame
rate, which would facilitate the estimation of the ENF signal.
In this work, we conduct a further study on the exploitation of the rolling
shutter for extracting ENF traces from video recordings. We model and analyze the
rolling shutter mechanism with a filter bank, then perform analysis using multirate
signal processing theory. We extend the scope of ENF extraction from videos of still
scenes to those containing motions, which presents a more challenging problem and
has never been formally attempted. Several methods are developed, and promising
results are observed.
3.2 Extracting ENF from Visual Recording
3.2.1 Exploiting the Rolling Shutter Mechanism
With a rolling shutter, each frame is recorded by scanning across the frame
either vertically or horizontally line by line, instead of capturing the whole frame at
a single point in time as in the case of a global shutter. Figure 3.2 illustrates the
35
1 
Frame 1 
2 3 … idle 
Row 
1 2 3 … idle 
Frame 2 
… 
Figure 3.2: Timing of rolling shutter sampling: the rows of a frame are sequentially
exposed, followed by an idle period before proceeding to the next frame.
timing for the image acquisition of rolling shutters, assuming the frame scanning is
done row-by-row. Each row of the frame is exposed sequentially to light, followed
by a possible idle period before proceeding to the next frame. Since pixels in dif-
ferent rows are exposed at different times, but are displayed simultaneously during
playback, the rolling shutter may cause such distortions as skew, smear, and other
image artifacts, especially with fast-moving objects and rapid flashes of light [35].
The sequential read-out mechanism of a rolling shutter has been traditionally
considered detrimental to image/video quality due to its accompanying artifacts.
However, recent work has shown that the rolling shutter can be exploited with
computer vision and computational photography techniques [7,19,28]. Our previous
work [24] exploited the rolling shutter to extract ENF traces from videos of static
scenes. In this work, we investigate the more challenging cases of videos containing
motions [59].
36
A Filter Bank Model
For an image captured by a rolling shutter, we can treat the spatial mean of
every row as a temporal sample since all the pixels in a single row are exposed at
the same time. As a between-frame idle period occurs, in terms of capturing the
ENF signal over time, we are equivalently abandoning some samples that would
have been generated in the idle period. The time domain illustration of this model
is shown in Figure 3.3. Here, we assume that the shutter is able to produce M
samples at its full capacity, and only L samples among them are retained while the
rest are discarded, where L ≤ M . We denote the input and output signal as x(n)
and y(n), respectively.
To facilitate frequency domain analysis, we use a L-branch filter bank to model
the relationship between the input signal x(n) and the output signal y(n), as shown
in Figure 3.4. In each branch of the filter bank, the input goes through an M-fold
down-sampler followed by an L-fold up-sampler, with appropriate delays at both
the beginning and the end of the branch.
The DTFT of the signal coming from the lth branch can be analyzed according
to multi-rate signal processing theory [64]:
Yl(ω) =
1
M
(
M−1∑
m=0
X(
ωL+ 2pim
M
)ej
ωL+2pim
M l
)
e−jωl. (3.1)
37
… … … … … 
… … … 
L 
M 
L 
M 
X(n) 
y(n) 
Figure 3.3: A time domain illustration of the rolling shutter sample acquisition.
Every L out of M samples are retained, and the other samples that correspond to
the idle periods are discareded.
38
   M    L  
   M    L  
𝑍−1 𝑍 
. 
. 
. 
   M    L  
𝑍−1 𝑍 
   M    L  
𝑍−1 𝑍 
. 
. 
. 
. 
. 
. 
x (n)  y(n)  
L branches  
Figure 3.4: A filter bank model of rolling shutter sample acquisition. In each branch
of the filter bank, the input goes through an M-fold down-sampler followed by an
L-fold up-sampler, with appropriate delays at both the beginning and the end of
the branch.
39
So the DTFT of the combined final output Y (ω) is given by:
Y (ω) =
L−1∑
l=0
Yl(ω)
=
L−1∑
l=0
1
M
(
M−1∑
m=0
X(
ωL+ 2pim
M
)ej
ωL+2pim
M l
)
e−jωl
=
M−1∑
m=0
X(
ωL+ 2pim
M
)Fm(ω), (3.2)
where
Fm(ω) =
1
M
L−1∑
l=0
e−j
ω(M−L)−2pim
M l. (3.3)
Equivalently, with ω = 2pif , we have
Y (f) =
M−1∑
m=0
X(
ωL+ 2pim
M
)Fm(f). (3.4)
T denotes the frame duration of the camera. According to the notations in the
previous section, the sampling rate of the shutter is fs = M/T , and the perceptual
sampling rate of the row signal is L/T . Here, L is the number of rows per frame,
and the exact value of M depends on the CMOS manufacture’s design and is usually
unavailable to the public. The spectrogram of the row signal is computed using the
perceptual sampling rate L/T , instead of the actual sampling rate M/T . Deriving
from Equation 3.4, we can show that the Fourier representation of the row signal is
Y (f) =
M−1∑
m=0
X(
2pi
fs
(f +
m
T
))Fm(f). (3.5)
This suggests that the row signal is the weighted summation of a series of trans-
formed versions of x(n) that are shifted by multiple of 1T in the frequency domain.
40
3.2.2 ENF Estimation
In this section, we describe how to extract ENF traces from videos captured by
rolling shutters. Without loss of generality, we assume the rolling shutter scans the
frame row-by-row. Consider a video signal s(r, c, n), where 1 ≤ r ≤ R, 1 ≤ c ≤ C
and 1 ≤ n ≤ N denote the row index, column index and frame index, respectively.
The video signal contains mainly two components: one is the visual component v
corresponding to the visual scene; and the other is the ENF component e:
s(r, c, n) = v(r, c, n) + e(r, c, n). (3.6)
The authors in [24] use the spatial average of each row in the video as the
source signal to estimate the ENF signal. From Eq. (3.6), we see that the signal-
to-noise-ratio (SNR) of e in s may be low in the presence of the visual component
v. For fixed spatial indices r and c, the visual component v(r, c, n) as a function of
n is in general a low-pass signal. In order to suppress the effect of v and extract the
ENF component e, we propose to apply high-pass filtering to the video signal s. In
the next sections, we describe the high-pass filtering techniques for two cases: for
videos of static scenes and for videos with motions.
We first consider a video recording with a static scene, so the visual signals
of every video frame are identical, i.e., v(r, c, n) = v(r, c). Under this assumption,
Eq. (3.6) is reduced to
s(r, c, n) = v(r, c) + e(r, c, n). (3.7)
We can apply a high-pass filter to s by subtracting from it its mean value across all
41
frames:
sˆ(r, c, n) = s(r, c, n)− s¯n(r, c)
= s(r, c, n)−
1
N
N∑
m=1
s(r, c,m)
= e(r, c, n)−
1
N
N∑
m=1
e(r, c,m). (3.8)
Here e(r, c, n) is the sinusoidal ENF signal sampled at the rth row and cth column
in the nth frame. For any given r and c, e(r, c, n) as a function of n = 1, 2, ..., N is
essentially a sinusoid sampled at the frame rate of the video recording. Since the
frequency of the ENF signal is changing over time, e(r, c, n) for n = 1, 2, ..., N tends
to have different phases and cancel out. So, for a sufficiently large N , the average
of these samples is close to 0, i.e.,
e¯n(r, c) =
1
N
N∑
m=1
e(r, c,m) ' 0. (3.9)
This leads to
sˆ(r, c, n) ' e(r, c, n). (3.10)
After the high-pass filtering, the SNR of the ENF signal in sˆ is much higher
than that in the original video signal s. We then use the spatial average of each row
in sˆ(r, c, n) as the source signal to estimate the ENF signal:
R(r, n) =
1
C
C∑
c=1
sˆ(r, c, n). (3.11)
R(r, n) is referred to as the row signal hereafter. We can use the frequency estimation
techniques discussed earlier to estimate the ENF signal from the row signal.
42
We have conducted experiments using a Canon PowerShot SX230 HS camera
that is battery-powered and equipped with a rolling shutter. Fig. 3.5 shows an
example of ENF estimation from a static video recording. The test video recorded
a white wall under fluorescent lighting, with a camera mounted on a tripod. The
spatial resolution of each video frame is 480 × 320, and the frame rate is 29.97
fps. Fig. 3.5 (a) shows a snapshot of the test video. We calculated the row signal
according to Eq. (3.11), and then vectorized it by concatenating its entries frame
after frame to form the source signal for ENF estimation. Fig. 3.5 (b) shows a
segment of the source signal. We observe that the source signal exhibits sinusoidal
waveforms, except for some periodic phase shifts. These phase shifts exist because
of the idle period of the rolling shutter between exposing the last row of one frame
and starting the first row of the next frame. No recording is conducted during the
idle period, and a phase jump of the source signal may occur on every frame border
(every 240 samples in this experiment).
In this example, the nominal value of ENF is 60 Hz, and the frame rate of the
camera is 1T = 29.97 fps. The intensity variations of the fluorescent lightings should
follow the instantaneous energy of the AC power supply, exhibiting a oscillation of
around 120 Hz. By Eq. (3.5), the ENF traces embedded by the row signal from the
video recording should appear at around 120 +m× 29.97 Hz, where m = 1, 2, 3, ....
This holds consistent with what we observe from the spectrogram of the row signal
in Figure 3.6.
The ENF traces can be extracted from the spectrogram of the row signals
around 30 Hz, 60 Hz, 90 Hz... We estimate the ENF signal from close to 60 Hz as
43
(a)
0 200 400 600 800 1000−4
−3−2
−1
01
2
34
Sample Index
(b)
Figure 3.5: (a) A snapshot of a test video of a static scene of a white wall. (b) The source
signal for ENF estimation.
we see from the spectrogram that the SNR of the ENF signal appears the highest in
this frequency range. The estimated ENF signal from this recording along with the
reference ENF signal simultaneously measured from the power mains are plotted
in Figure 3.7 (a). They are appropriately shifted to lie within the same dynamic
range, as only the variation trends are of interest. The ENF signals from the video
recording and the power measurement exhibit similar variations. The correlation
coefficient between them as a function of the relative time lag is plotted in Figure
3.7 (b), and a clear peak is observed at the ground truth lag of 0 second.
In the second example, the camera was placed in a room illuminated by flu-
orescent lights, which is representative of real-life surveillance scenarios where few
events are expected to occur. Measurements from the power mains at the time of
the video recording were also conducted to provide the reference ENF signal. In
Figure 3.8 (a), we plot the estimated ENF signals from the video and the power
44
Figure 3.6: The spectrogram of the row signal from a white wall video recording. Figure
is best viewed in color.
45
0 50 100 150 200 250 300 350−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
Time (seconds)
Frequ
ency 
(Hz)
 
 ENF estimated from video signalENF estimated from power signal
(a)
−80 −60 −40 −20 0 20 40 60 80
0.4
0.5
0.6
0.7
0.8
0.9
1
Lag (seconds)
Corre
lation
 Coef
ficien
t
(b)
Figure 3.7: (a) The ENF signals (appropriately shifted) extracted from a test video of
white wall and the power measurement. (b) Correlation coefficient between the video and
power ENF signal as a function of relative time lag.
measurement, and the correlation coefficient between them as a function of lag is
plotted in Figure 3.8 (b).
In the third example, we show that videos of outdoor scenes can also capture
ENF traces. A test video is captured in a dark parking lot illuminated by outdoor
lighting. The camera is fixed, and no object motion occurs in the recording. Fig. 3.9
(a) shows a snapshot of the test video. The ENF signal estimated from the test video
exhibits similar variation patterns to the one estimated from the simultaneous power
measurements, as can be observed in Fig. 3.9 (b). It proves that the outdoor lighting
are connected to the mains, and the ENF signal is successfully extracted from the
video that captures the subtle flickering in the light.
46
0 50 100 150 200 250 300−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
Time (seconds)
Frequ
ency 
(Hz)
 
 ENF estimated from video signalENF estimated from power signal
(a)
−60 −40 −20 0 20 40 60
0.7
0.75
0.8
0.85
0.9
0.95
Lag (seconds)
Corre
lation
 Coef
ficien
t
(b)
Figure 3.8: (a) The ENF signals (appropriately shifted) extracted from a surveillance
video and the power measurement. (b) Correlation coefficient as a function of relative
time lag.
3.2.3 Videos with Motions
Extracting ENF signals from video recordings of scenes with moving objects
proves to be more challenging. In such scenario, Eq. (3.7) does not hold anymore,
and the method for high-pass filtering in the previous subsection would no longer
work.
If the scene in the video contains a static background, we can use the static
regions to estimate the ENF signal. Following the notations of last subsection, given
two image frames s(r, c, n) and s(r, c,m), we are interested in finding the regions not
affected by object motion in either frame. The mutual motion-free regions between
47
(a)
50 100 150 200
−0.04
−0.02
0
0.02
0.04
0.06
Time (seconds)
Freq
uenc
y (Hz)
 
 
ENF from VideoENF from Power
(b)
Figure 3.9: (a) Screen shot of the test video of a parking lot. (b) The ENF signals
(appropriately shifted) extracted from the test video and the power measurement.
s(r, c, n) and s(r, c,m) are represented by a binary matrix Mn,m(r, c), defined as
Mn,m(r, c) =



1 if frame n and frame m are both static at pixel (r,c)
0 otherwise
As an example, Fig. 3.10 (a) and (b) show two sample images from a video se-
quence recorded inside an office building. The motion-free regions shown in Fig. 3.10
(c) are found by thresholding the pixel-wise differences of the pixel intensity between
the two images.
Using a strategy similar to that in Sec. 3.2.2, we apply a high-pass filter to
the video signal by subtracting from it a smoothened version of the original signal.
For an image frame s from the video sequence, we search for its mutual motion-
free regions against all the other frames. The pixel values of the frames in their
respective motion-free regions can be averaged to form a smooth version of s, which
48
(a) (b) (c)
Figure 3.10: (a) Frame 1. (b) Frame 2. (c) Mutual motion-free regions are highlighted.
is then subtracted from s:
sˆ(r, c, n) =s(r, c, n)−
1
∑
m6=nM
n,m(r, c)
∑
m 6=n
s(r, c,m) ·Mn,m(r, c) (3.12)
The row signal is obtained by taking the row average of sˆ, from which the
ENF signal can be estimated. We have conducted an experiment with a video
that records people walking in an office building’s hallway. The video used similar
settings as the experiments in Sec. 3.2.2. We use the proposed scheme to extract
the ENF signal from this test video. The reference ENF signal is also estimated
from a simultaneously recorded power signal. We observe from Fig. 3.11 that the
variation trends of the ENF signal estimated from the test video remain consistent
with those of the reference ENF signal.
In a second example, we made a video recording of a moving Hexbug toy in
a room with indoor lighting. The Hexbug is a robotic toy with fast movement,
powered by the vibrations of a built-in battery motor. Figure 3.12 shows several
49
0 200 400 600 800−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
Time (seconds)
EN
F V
aria
tion
 (Hz
)
 
 
ENF signal estimated from video
Reference ENF signal from power measurements
Figure 3.11: The ENF signal estimated from the test video matches well with the
reference ENF signal. The signals are properly shifted to facilitate comparison.
video frames. The ENF signals extracted from the Hexbug video and its reference
ENF signal from power mains are plotted in Figure 3.13 (a). They exhibit similar
variations, and a correlation peak is observed when they align, as seen from Figure
3.13 (b).
3.2.4 Brightness Change Compensation
Many cameras have a brightness control mechanism that adjusts the camera’s
sensitivity to light in response to the illumination conditions, so the overall bright-
ness of an acquired image remains visually pleasing. As an example, two images
from a video sequence are shown in Fig. 3.14. As the person in the second image is
closer to the camera, the background wall appears brighter than in the first image.
50
Figure 3.12: Sample frames of the hexbug test video.
Such brightness changes introduce challenges to the estimation of the ENF signal
when using the techniques described in previous subsections.
To investigate how to mitigate he negative effect of brightness changes, we
have created the following recording: during the first four minutes, a person walked
around in a hallway relatively distant from the camera so the camera’s automatic
brightness adjustment was not triggered; after four minutes, the person moved closer
to the camera and brightness changes occurred, as shown in Fig. 3.14. The ENF
signal is extracted from this test video using the techniques discussed in previous
subsections without addressing the brightness changes. In Fig. 3.16, we observe that
the estimated ENF signal from the test video becomes distorted after four minutes
into the recording as a result of brightness changes.
We have examined the relationship of the pixel values in different images of
the same scene. For two images, we find the regions in which both images remain
static. Each pixel in the static regions is represented as a dot in Fig. 3.15, whose X
coordinate is the pixel value in image 1, and the Y coordinate is the pixel value in
51
0 100 200 300 400−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
Time (seconds)
Frequ
ency 
(Hz)
 
 
ENF estimated from video signalENF estimated from power signal
(a)
−100 0 100 200 300−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
Lag (seconds)
Corre
lation
 Coef
ficien
t
(b)
Figure 3.13: (a) The ENF signals extracted from the test video of the Hexbug and the
power measurement (appropriately shifted). (b) Correlation coefficient as a function of
relative time lag.
image 2. From the figure we observe that the brightness change can be well modeled
by a linear transform. Given two frames s(r, c, n) and s(r, c,m), we have
s(r, c, n) = an,m · s(r, c,m) + bn,m. (3.13)
For a frame s(r, c, n), the pixel values in the static background regions are used
to estimate the parameters an,m and bn,m. For brightness change compensation, we
apply Eq. (3.13) to each frame s(r, c,m). Eq. (3.8) then becomes
sˆ(r, c, n) =s(r, c, n)−
1
∑
m 6=nM
n,m(r, c)
·
∑
m 6=n
(
an,m · s (r, c,m) + bn,m
)
·Mn,m(r, c) (3.14)
The described scheme was applied to the test video, and the result of ENF es-
timation is shown in Fig. 3.16. With our proposed brightness change compensation,
the ENF signal estimated from the test video now exhibits consistent variations with
52
Figure 3.14: Two image frames from a test video recording demonstrating camera’s
automatic brightness control mechanism.
the reference ENF signal.
3.2.5 Camera Motion Compensation
In previous discussions, we have assumed that the camera is fixed during
recording so that the pixels in different image frames align spatially. In practice,
people may hold the camera by hand to make a video recording, so the camera
would undergo certain movement and the previously described methods may not
apply.
In order to address the situations with camera motions, we first consider
videos of static scenes. For two image frames s(r, c, n) and s(r, c,m), we denote
by (δn,mr , δ
n,m
c ) the pixel-wise shift between them due to the camera motion:
s(r, c, n) = s(r + δn,mr , c+ δ
n,m
c ,m). (3.15)
To compensate for the camera motion, we need to shift the pixels in two frames
relatively by (δn,mr , δ
n,m
c ) so that they align spatially. The registered frames can be
53
0 50 100 150 200 2500
50
100
150
200
250
Pixel Value in Frame 1
Pix
el V
alu
e in
 Fra
me
 2
Figure 3.15: The relationship between the pixel values in static regions of two images
subject to brightness change.
0 100 200 300 400 500 600
−0.1
−0.05
0
0.05
0.1
Time (seconds)
EN
F V
aria
tion
 (Hz
)
 
 ENF signal estimated from video with brightness compensation
Referecne ENF signal from power measurements
ENF signal estimated from video without brightness compensation
Figure 3.16: The effectiveness of the brightness change compensation technique. The
signals are properly shifted to facilitate comparison.
54
processed as described in the previous subsections. Considering the camera motion
compensation, Eq. (3.8) becomes
sˆ(r, c, n) = s(r, c, n)−
1
N
N∑
m=1
s(r + δn,mr , c+ δ
n,m
c ,m), (3.16)
and the ENF signal can be estimated from sˆ(r, c, n).
Optical flow methods can be used to estimate the pixel-wise displacement
between image frames. These methods calculate the motion field (Vr, Vc) between
two frames s(r, c, n) and s(r, c, n+δn) based on the optical flow equation ∂s∂rVr+
∂s
∂cVc+
∂s
∂n = 0, and certain additional conditions and constraints for regularization. In this
work, we have used the implementation of the optical flow estimation developed
by [36].
An experiment was conducted to verify the proposed camera motion compen-
sation scheme. We used the Canon PowerShot SX230 HS camera to record a video
of a hallway. The camera was hand-held during the recording, and we deliberately
manipulated the camera to create noticeable motion, as shown in the sample frames
in Fig. 3.17. If we ignore the camera motion, the spectrogram of the source signal
extracted from the test video exhibits blurry spectral energy distributions, as shown
in Fig. 3.18 (a). The ENF signal estimated from the test video without camera
motion compensation is shown in Fig. 3.18 (b), and it deviates from the reference
ENF signal. We then apply the proposed camera motion compensation to the test
video. The ENF signal estimated from the video signal after compensation matches
well with the reference ENF signal as shown in Fig. 3.18 (b).
If object motion occurs in the scene in addition to camera motion, we can
55
Figure 3.17: Sample frames from a test video with camera motion.
apply camera motion compensation first to the image frames, then use the strategy
in Sec. 3.2.3 to locate and utilize the static backgrounds for extraction of ENF
signals.
3.3 Video Synchronization by Matching ENF from Image Sequence
In Chapter 2, we demonstrated video synchronization by aligning the ENF
signals extracted from the soundtracks of video clips. In certain scenarios, such as
some surveillance recordings, video recordings may have been muted or soundtracks
may have been edited, and thus have no reliable audio available. As an alternative,
56
we may extract the ENF signal from the image sequence of the visual track using the
techniques described in Sec. 3.2. In this section, we present experimental results
of this approach.
We used two Canon PowerShot SX230 HS cameras equipped with CMOS
sensor and rolling shutter to video an office building’s hallway illuminated by an
indoor light. The cameras were placed to capture the hallway from different view
angles, and each recording is about eight minutes long. A person walked through
the hallway back and forth, and his movements were captured by both cameras.
We apply the methods discussed in Sec. 3.2 to estimate the ENF signals from
the image sequences of both video recordings. The NCC of the estimated ENF
signals as a function of the lags between them is plotted in Fig. 3.19 (a), from which
we find a peak NCC value of 0.96 at 60.72 seconds. The ENF signals after alignment
are shown in Fig. 3.19 (b), and we observe the variation patterns of the ENF signals
match well. In Fig. 3.19 (c), we show several image frames from the synchronized
video recordings. For comparison, we manually align the two videos by comparing
the image frames and the soundtracks in both video clips, and found the lag to be
60.80 seconds, which is close to the value obtained by the proposed approach.
3.4 Chapter Summary
In this chapter, we have studied the problem of extracting ENF traces from
videos. We have exploited the rolling shutter of CMOS imaging sensors and treated
each line as an ENF impacted signal sample in order to compensate for the low
57
frame rate of video recordings. The rolling shutter mechanism was analyzed using
a filter bank model and multirate signal processing theory. Extraction of the ENF
signal from an image sequence present many challenges, and, to the best of our
knowledge, few research attempts have been made to adequately address it. We
have proposed several techniques to overcome the difficulties of extracting the ENF
signal from image sequences, such as the low sampling rate, object motions in the
scene, camera motions, and brightness change. Through our experiments, we have
demonstrated that video recordings can be accurately synchronized by aligning the
inherently embedded ENF signals.
In this chapter together with Chapter 2, We have demonstrated promising
results of video synchronization using the naturally embedded ENF signals in the
soundtracks and image sequences of video clips. As a prerequisite for this method,
the ENF traces available in the audio/video recordings must be strong enough for
reliable estimation. Through our current study and experiments, we find that this
prerequisite may not always be satisfied. The recordings may have been created
in an environment where the ENF traces are relatively weak. For example, audio
recordings created by recorders powered by batteries at locations far from electrical
equipment, and videos made in areas without any electric lighting would probably
not capture any ENF traces. Also, the embedded ENF signals may suffer distor-
tions. Most audio and video recorders apply compression to the output recordings,
and some strong compression artifacts may adversely affect the estimation of ENF
signals. For audio recordings made in loud environments, the strong foreground
sounds or voices may also bring distortions to the ENF signals.
58
Most conventional methods of audio/video synchronization extract and match
certain audio and visual features from the contents of recordings. The approach pro-
posed in the paper does not rely on the audio or visual contents of the multimedia
signals, and therefore is fundamentally different from and complementary to exist-
ing work. The ENF-based approach and the audio/visual-cue-based approach may
complement each other and may be combined to solve the problem of multimedia
synchronization more effectively.
59
(a)
0 50 100 150 200 250 300 350−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Time (seconds) 
 ENF estimated from video with camera motion compensation
Reference ENF estimated from power measurements
ENF estimated from video without camera motion compensation
(b)
Figure 3.18: Experiment of a video recording with camera motion. (a) The spectrogram
of the source signal without camera motion compensation. (b) The ENF signals estimated
from the test video. The signals are properly shifted to facilitate comparison.
60
−200 −100 0 100 200 300−1
−0.5
0
0.5
1
Lag (seconds)
Corre
lation
 Coe
fficie
nt
(60.72, 0.96)
(a)
0 100 200 300 400 500−0.04
−0.02
0
0.02
0.04
0.06
0.08
Time (seconds)
ENF
 Vari
ation
 (Hz)
 
 
ENF signal estimated from video 1ENF signal estimated from video 2
(b)
(c)
Figure 3.19: Example of video synchronization by aligning the ENF signals. (a) The
correlation coefficient of the ENF signals as a function of the relative lag between them. (b)
The ENF signals estimated from the two video recordings after alignment. The signals are
properly shifted to facilitate comparison. (c) Several sample frame pairs after alignment.
Rows correspond to video sequences, and columns correspond to time instances.
61
Chapter 4
ENF Analysis on Historical Audio
Recordings
4.1 Chapter Introduction
ENF signal is embedded into multimedia recordings at the time of recording,
which leads to several interesting questions about the ENF traces in recaptured
audio recordings. If recapturing of recording happens in the region of the same
nominal ENF as the original recording, the ENF traces, due to the two recording
processes, may overlap with each other. How will such overlap affect the quality
of the ENF signal extraction? ENF signals in recaptured audio recordings may
contain two components: one is inherited from the original recording, referred to as
the content ENF signal; the other is embedded during recapturing process, referred
to as the original ENF signal. The original and recapturing ENF signals may have
different energies; the signal with a higher energy is referred to as the dominant ENF
62
and the one with a lower energy as the latent ENF. The ENF signals in a recaptured
audio recording can provide useful information about the audio file in question, such
as when the original recording was created and when it was recaptured.
The question of ENF extraction in recaptured audio is relevant to analyzing
recordings of historical importance. For example, such historical recordings as NASA
Apollo lunar mission audio recordings [1, 2] and President Kennedy’s White House
conversations [9] were conducted in the analog era of 1960’s. These recordings
were recently digitized and made available online. Several interesting tasks can be
accomplished using such recordings. For example, multiple channels of NASA Apollo
mission recordings can be used to create a time synchronized exhibit of the mission.
As an ENF signal is time-varying, it can potentially be used to align multiple audio
recordings archived from such historical events. However, due to the digitization
process, the recordings available online may also have been affected by the ENF
signals corresponding to the time of digitization. To the best of our knowledge,
no prior work has addressed the effect of recapturing of audio recordings on ENF
signals.
As will be shown in this chapter, conventional ENF estimation techniques
can only extract the dominant ENF signal. This observation motivates us to design
algorithms to extract both the dominant and the latent ENF signals from recaptured
recordings. Audio recapturing can also be used as an “anti-forensic” strategy by
an adversary to alter ENF traces to mislead a forensic examiner, so developing
techniques to extract multiple overlapping ENF signals may also complement the
existing techniques to counter such anti-forensic operations [17]. In this work, we
63
propose a decorrelation based algorithm to estimate both the dominant and the
latent ENF from a recaptured audio. After estimating the dominant ENF using
conventional ENF signal estimation techniques, a residual signal is computed by
subtracting the estimated dominant ENF signal from the original signal. The latent
ENF is then estimated from the residual signal.
Besides the possible superposition of multiple ENF traces, another distortion
of the ENF signals that is unique to digitized analog tape recordings is the “drifting
effect”. Due to mechanical imperfection of the analog recorders and tapes, the rolling
speed of these tapes often varies during recording and replay. The inconsistency
between the rolling speeds of original recording and playback during digitization
induces speed errors in the digitized audio. For example, if the tape rolls faster
during digitization than during the original recording, the digitized version of the
recording will play faster than normal. As a result, the ENF signal in a digitized
analog audio recording may deviate from its original value. In this work, techniques
have been proposed to compensate for such drifting effect to achieve accurate ENF
analysis. We also demonstrate that ENF signals suffering such drifting effect can be
exploited to detect and correct tape speed errors.
4.2 Distortions of ENF Signal in Historical Audio Recordings
4.2.1 Multiple ENF Traces in Recaptured Audio Recordings
As this work aims to analyze the ENF signals present in a recaptured au-
dio, we conduct the following experiment to test the robustness of ENF signals to
64
recapturing. An audio is recorded in an office using a digital recorder. To simu-
late the conditions of recapturing, we play this recording on a stand-alone speaker
in an acoustic anechoic chamber and record the playback using a digital recorder.
Fig. 4.1(a) and (b) show the spectrograms of the original recording and the recap-
tured recording, respectively. From these figures, we observe that the ENF signal
is present at the harmonic frequency of 240 Hz in the original recording and the re-
captured recording. High correlation is observed between the ENF signals extracted
from the original and recaptured recordings around 240 Hz. When we switch-off the
replayed audio, the energy peak present at 240 Hz in the spectrogram of the recap-
tured audio recording disappears. This happens because no interference is present
from power lines at 240 Hz in the acoustic chamber. In this example, the content
ENF signals and the recapturing ENF signals do not interfere with each other.
In another example, Fig 4.2 (a) shows the spectrogram of a historical recording
from President Kennedy’s White House conversations, which are available at [9].
This recording occurred in 1962 on analog tape and was digitized later. From this
figure, we observe that two different ENF signals are present near 240 Hz, and one
of them (present around 239 Hz) disappears well before the end of audio. After
listening to the audio, we note that the original recording is turned off at this time.
We conjecture that the 239 Hz signal is the original ENF signal and the 240 Hz
signal is the recapturing ENF signal.
The two examples have demonstrated the case when the original ENF signal
and the recapturing ENF signal are non-overlapping. From such recordings, both
the ENF signal can be extracted easily by using suitable bandpass filters followed
65
by conventional ENF estimation techniques around the frequency of interest. In less
favorable cases, however, the original ENF signal and the recapturing ENF signal
may overlap and interfere with each other. To illustrate this scenario, we conduct
a recording in the acoustic chamber and recapture it in the same place. As the
ENF signal in the same room is embedded from the electromagnetic influences of
the same power sources, the original ENF and the recapturing ENF are overlapping
at a frequency of 120 Hz. From the spectrogram of the recaptured audio shown in
Fig. 4.2 (b), we observe that for the duration of the playback of the original audio
on the speaker, the original ENF and the recapturing ENF overlap with each other
and the energy distribution of the spectrogram appears noisy. After the original
audio is switched-off, the ENF signal becomes cleaner as only the recapturing ENF
is captured.
Conventional ENF signal estimation methods extract dominant frequencies
in a narrowband around the frequency of interest (nominal ENF or its harmonics)
of a given signal. As the original ENF signal and the recapturing ENF signal in
recaptured recordings may overlap with each other at the same frequency range, the
conventional methods fail to extract both the ENF signals. To demonstrate this, we
generate two frequency sequences, Ed(t) and El(t), as following:
Ed(t) = 60 +Nd(t)
El(t) = 60 +Nl(t),
where Nd(t) and Nl(t) are drawn from i.i.d. Gaussian random processes of zero
mean and variance 0.1. Using these two signals, we generate a time domain signal
66
(a)
Speaker Turned Off  
(b)
Figure 4.1: Spectrograms of the original and recaptured recordings. (a) original audio;
(b) recaptured audio
that varies according to the frequencies Ed(t) and El(t) as follows:
s(t) = cos(2pi
∫ t
0
Ed(τ)dτ) +
√
α cos(2pi
∫ t
0
El(τ)dτ) +N(t), (4.1)
where N(t) is a Gaussian random process of zero mean and unity variance, and α
is a constant with 0 ≤ α ≤ 1. We see from Eq. 4.1 that signal s(t) consists of
two sinusoids of different amplitudes with Ed(t) and El(t) being their instantaneous
frequencies at time t. Based on this model of s(t), Ed(t) is the dominant ENF signal
and El(t) is the latent ENF signal, as the energy of the sinusoid corresponding to
Ed(t) is greater than El(t). This model of signal s(t) is similar to when the original
ENF signal and the recapturing ENF signal overlap.
We use a weighted energy frequency estimation method to extract the ENF
67
Original Recording  Turned off  
(a)
Original Recording  Turned off  
(b)
Figure 4.2: Audio recording spectrograms. (a) digitized Kennedy conversation recording;
(b) recaptured audio signal with overlapping ENF signals.
signal from s(t). The spectrogram is computed for s(t), and the ENF for each time
bin is estimated by weighing frequency bins around the nominal ENF value (60Hz)
with the energy present in the corresponding frequency bins. We compute the nor-
malized cross-correlation (NCC) between the estimated ENF signal and the ground
truth frequency sequences Ed(t) and El(t), respectively. The experiment is repeated
multiple times with different realizations of Nd(t), Nl(t), and N(t). The mean and
the variance of the NCC values obtained for different values of α is shown in Fig. 4.3.
From this figure, we observe that when there is a significant difference between the
energy of the dominant ENF and the latent ENF, the correlation between the ex-
tracted ENF and the dominant ENF is high ( 0.6-0.7 range). However, as the energy
of the latent ENF signal increases, this correlation value decreases and becomes low
(< 0.3 for α close to 1). Similar results were obtained for other frequency estimations
methods, such as the subspace based MUltiple SIgnal Classification (MUSIC) and
Estimation of Signal Parameters via Rotational Invariance Techniques (ESPRIT).
68
Our preliminary results also show that these subspace-based approaches can only
obtain reliable estimates when a sufficient margin between Ed(t) and El(t) exists.
This experiment on synthetic data verifies that the conventional ENF estimation
techniques fail to extract the overlapped ENF signals, which is usually the case
with recaptured audio recordings. In the following subsection, we describe a new
algorithm to extract both the dominant and the latent ENF from recaptured audio
recordings.
0.6 0.7 0.8 0.9 1−0.2
0
0.2
0.4
0.6
0.8
α
NC
C
 
 ENF Estimation vs. Ed
ENF Estimation vs. El
Ed vs. El
Figure 4.3: The mean and the variance of the NCC values.
A Decorrelation-Based Solution
Our proposed algorithm to extract both the dominant and the latent ENF
signals in a recaptured audio works in two stages: first the dominant ENF is esti-
mated, followed by the latent one. As discussed earlier, the dominant ENF, denoted
69
by Ed(t), in an audio s(t) can be estimated using conventional estimation techniques
such as the weighted energy method. After estimating Ed(t), we match it with the
reference ENF database from the power grid to estimate the time the dominant ENF
signal embedded in the recording. We then subtract the power signal corresponding
to the time of recording of the dominant ENF signal from the audio recording. As
the magnitude of power measurements and the actual embedding in the audio may
differ, the subtraction is performed by estimating the appropriate scaling factor of
the magnitude that makes the ENF signal of resulting audio signal sˆ(t) maximally
decorrelated with the ENF signal of the power recording corresponding to the time
of dominant ENF embedding estimated previously. More specifically, we have:
sˆ(t) =s(t)− aˆ · P (t), with
aˆ=argmin
a
{corr(ENF (s(t)− aP (t)), ENF (P (t)))} ,
(4.2)
where P (t) is the power measurement signal at time t, and aˆ is the estimated mag-
nitude of the power. ENF (·) denotes the weighted frequency estimation function.
As can be understood from the equation, the selection of aˆ is to search for the rel-
ative amplitude of the dominant ENF signal in the audio signal, with respect to
the power signal. Ideally, after the decorrelation process, the resulting signal sˆ(t)
is contains no traces of Ed(t). The ENF signal that remains in sˆ(t) would come
from the latent ENF signal, El(t). We estimate the latent ENF using the weighted
frequency estimation approach from sˆ(t).
To show the effectiveness of the proposed algorithm in extracting the dominant
70
and the latent ENFs, we conduct experiments on audio data. An audio recording
was made in an acoustic anechoic chamber and recaptured later in the same place.
The power measurements were also recorded during the original recording and the
recapturing process. The original ENF and the recapturing ENF signals are present
around 120 Hz in this case. The ENF extracted directly from the recaptured audio
signal shows similar fluctuations with the ENF signal estimated from the power
signal at the time of the original recording (NCC 0.62), as can be seen from Fig.
4.4. The content ENF signal is, therefore, the dominant signal in this case. We
then decorrelate the recaptured audio by subtracting the estimated dominant ENF
signal, as discussed previously. The ENF signal in the power measurement record-
ing is centered around 60 Hz, so we transfer it to 120 Hz by squaring the power
signal and feeding it into a bandpass filter with a narrow passband around 120 Hz.
The processed power signal is used for decorrelation as in (3). The ENF signal
estimated from the decorrelated audio signal shows high correlation with the ENF
signal extracted from the power measurements at the time of recapturing (NCC
0.68). Both the original ENF and recapturing ENF are now successfully extracted
from the recaptured audio recording using the proposed decorrelation method. The
time of the original recording and recapturing can then be determined by comparing
the original ENF and recapturing ENF with the reference database obtained from
the power mains.
71
0 500 1000 1500 2000
59.9860
60.02
Time (in seconds)(a)
Fre
qu
en
cy
0 500 1000 1500 200059.96
59.9860
60.02
Time (in seconds)(b)
Fre
qu
en
cy
0 500 1000 1500 2000119.95
120
120.05
Time (in seconds)(c)
Fre
qu
en
cy
0 500 1000 1500 2000119.95
120
120.05
Time (in seconds)(d)
Fre
qu
en
cy
Figure 4.4: ENF fluctuations as a function of time. (a) the ENF from the power measure-
ment signal at the time of the original recording; (b) the ENF from the power measurement
signal at the time of recapturing; (c) the dominant ENF estimated from the recaptured
audio recording; (d) the ENF estimated from the decorrelated audio signal.
Applications for Recapture Detection
Assuming that the reference measurements from the power grids are available,
the proposed decorrelation algorithm can be used for audio recapture detection, i.e.,
identify whether the given audio recording is original or a recaptured version. As
discussed earlier, two ENF signals are embedded in a recaptured audio recording.
The ENF signal estimated directly from a recaptured audio recording is the domi-
nant ENF signal Ed. After decorrelation, the latent ENF signal El can be extracted
72
from the residue audio signal. When compared with the reference ENF database
measured from power mains, these two ENF signals should match different seg-
ments of the reference database, the time indices of which are denoted as Td and Tl,
respectively.
If the recording is original, Td and Tl are likely to be of similar values. In cases
where Td and Tl differ greatly, the peak correlation C, that is calculated between
the ENF signal estimation from the decorrelated audio signal and the reference
database, should be low since it is a false match. Under a hypothesis framework,
the H1 and H0 cases and the decision rule can be formulated as follows:



H1 : Test audio is recaptured.
H0 : Test audio is original.
1(|Td − Tl| > δ)× C
H1
R
H0
τ
Here 1(·) is an indicator function, and τ is a decision threshold.
We conduct the following experiments to evaluate the proposed audio recap-
ture detection scheme. Audio recordings were made in the acoustic chamber and a
conference room. Some recordings were then recaptured in the acoustic chamber by
playing on a speaker with variant volume. The total test dataset includes 8.5 hours
of original recordings and 16 hours of recaptured ones. The recordings are divided
into short clips of 10, 20, and 30 minutes duration, and each clip is considered a test
sample. We evaluate the false alarm rate and detection rate with different values of
τ to obtain the ROC curves, as shown in Fig 4.5. The detection accuracy increases
with longer clips. Specifically, when considering audio clips of 30 minutes, 95% of
73
the recaptured clips are correctly identified without any false alarms.
0 0.2 0.4 0.6 0.80
0.2
0.4
0.6
0.8
1
False Alarm Rate
De
tec
tio
n P
rob
ab
ility
 
 
10 minutes
20 minutes
30 minutes
Figure 4.5: ROC for audio recapture detection with different clip lengths.
4.2.2 The Drifting Effect
Another challenge of dealing with digitized tape recordings is the “drifting”
effect. This effect refers to the phenomenon that the signal frequency in a digitized
analog audio recording may deviate from its original value. To demonstrate this,
we conducted the following experiment with a cassette tape recorder. We generated
a synthetic single tone acoustic signal of 400 Hz. We then made recordings of it
with the tape recorder and a digital recorder, respectively. The tape recording
was digitized by playing on a speaker and recording with a digital recorder in a
acoustic chamber. The spectrograms of the recordings are shown in Fig. 4.6. As we
74
observe from the spectrograms, the digital recording captures the signal at almost
exactly 400 Hz, while the signal frequency of the digitized tape recording presents
random deviations from its original value. This effect occurs probably because that
the mechanical rolling speed of the tape recorder is not identical during recording
and playback. The signal presents at a higher frequency if the rolling speed during
playback is faster than that during recording, and vice versa.
(a) (b)
Figure 4.6: Demonstration of the drifting effect with a synthetic 400 Hz tone signal.
(a) Spectrogram of the digital recording; (b) Spectrogram of the digitized cassette tape
recording.
As the ENF signal has a relatively small dynamic range (less than 0.1 Hz
around 60 Hz), deviations caused by the drifting effect may have a large impact on
the estimation of ENF signals. In Fig. 4.7, the blue curve shows the ENF signals
estimated directly from the digitized tape recordings in our experiment. The ENF
75
signals estimated from the simultaneous power measurements (shifted to the same
scale) is shown as the red curve in the figure. Due to the drifting effect, the audio
ENF signals cannot match well with the groundtruth power ENF signals.
50 100 150 200 250 300 350 400 450359.2
359.4
359.6
359.8
360
360.2
360.4
360.6
360.8
Time (seconds)
Fre
que
ncy
 
 
ENF from digitized audio recording
ENF from power measurements (shifted)
ENF from digitized audio recording after compensaiton
Figure 4.7: Compensating for the drifting effect.
We can compensate for the drifting effect if a reference tone signal of frequency
f t is available, as in our experiment (400 Hz). ENFd denotes the ENF signal
estimated directly from the digitized audio, and f td is the frequency of the tone
signal present in the recording. The adjusted ENF signal can be expressed as
ENFc =
ENFd ∗ f t
f td
(4.3)
In Fig. 4.7, the pink curve is the ENF signal after compensation, and it shows
similar trends as the groundtruth power ENF signal.
76
The drifting effect is commonly seen among digitized historical recordings,
such as the tapes of the NASA Apollo mission. Fig. 4.8 shows the spectrogram
of a digitized Apollo Mission 11 audio recording. In the spectrogram, we can see
three strips, two of which exhibit the same variations. The other strip resides at
approximately 120 Hz. Based on previous subsection, we infer that the former
are the original ENF signals and the latter is the recapturing ENF signals from
the digitization process. The original ENF signals that originate from the analog
tape recording deviate from the nominal value of 120 Hz because of the drifting
effect. The discontinuity in the original ENF signals is caused by pauses during
tape recording or deletion of certain content during digitization.
Figure 4.8: The spectrogram of an Apollo Mission 11 recording.
Different from the previously discussed scenario, the reference tone signal is
77
not available in those historical recordings. In some situations, the drifting effect
can be approximated well with a linear or piece-wise linear model. We provide an
example in Fig. 10. The ENF signals are extracted from two tapes in the Apollo 11
audio dataset [2] that were recorded at the same time. One of the recordings was
made in the Flight Directors (FD) loop, and the other one was made in the Public
Affair Officers (PAO) loop. These two recordings have been manually synchronized
based on conversational content present in both recordings. Given the drifting
effect, directly correlating the ENF signals from the two recordings fails to find a
true match, which should be at the lag of 0 seconds. We fit the ENF signals with a
piece-wise linear model, as illustrated in Fig. 4.9 (a). Fig. 4.9 (b) shows the residues
between the original ENF signals and their linear approximations as new timing
features, which exhibit consistent variation patterns over time. The correlation
coefficients with different lags are plotted in Fig. 4.10. We can see that before
the linear fitting and residue extraction, no correlation peak can be observed; and
the correlation between the residue signals successfully finds the proper temporal
alignment.
4.3 Audio Speed Restoration
4.3.1 ENF as a Guidance for Speed Correction
The mechanical rolling speed of an analog tape recorder and player is often
not well stabilized. As discussed in Sec. 4.2.2, when an analog tape recording is
digitized, the rolling speed of the tape playback likely to differ from that of the
78
0 1000 2000 3000 4000119.3
119.4
119.5
119.6
119.7
119.8
119.9
120
120.1
Time (seconds)
Freq
uenc
y (Hz)
 
 
ENF from FD loop recordingENF from PAO recording
(a)
500 1000 1500 2000−0.06
−0.04
−0.02
0
0.02
Time (seconds) 
 
residue signal of the FD loop recordingresidue signal of the PAO loop recording
(b)
Figure 4.9: Using linear fitting to compensate the drifting effect. (a) The ENF signals
from the two recordings and their liner approximation; (b) The residue signals.
creation of the tape. As a result, a speed offset is induced in the digitized version
of the audio. Significant speed errors can sometimes affect the perceptual quality
of the audio recording. Even with minor speed error that may not be audible to
human ears, it is often desirable to perform speed correction. For example, many
old recordings have great historical and archival value, so the digitized version of
the recording should preserve the original audio as faithfully as possible. Certain
historical events were covered by multiple audio recordings [1, 2], and it would be
interesting to synchronize the recordings and play them simultaneously in multiple
channels. Speed errors may hinder such an application, as it is difficult to align
recordings that have unknown and different speed offsets.
As manual detection and correction of speed error in audio may be too costly
and inefficient in practice, an automatic solution is preferred. We have designed a
79
0 500 1000 1500 2000 2500 3000−0.5
0
0.5
1
Lag (seconds)
Co
rre
lati
on
 Co
eff
icie
nt
0 500 1000 1500 2000 2500 3000−1
−0.5
0
0.5
1
Lag (seconds)
Co
rre
lati
on
 Co
eff
icie
nt
Figure 4.10: Correlation coefficients between the ENF signals (upper) and their residue
signals (lower).
tape speed correction scheme exploiting ENF signals embedded in audio recordings.
As discussed in previous sections, the nominal value of the ENF is known
(60Hz in North America, 50Hz in most other parts of the world). The instantaneous
value of the ENF typically fluctuates around its nominal value as a result of the
interaction between power load and generation, and the deviation of the ENF from
its nominal value is usually minuscule given the control mechanism of the power
grid. For example, in the US the ENF deviation is typically less than 0.05Hz from
the nominal value of 60Hz, which is equivalent to about 0.08%. The ENF signal can
80
be approximately considered as a single tone signal. If an original analog recording
has captured ENF traces during its creation, and such traces could be retained in
the digitized version of the recording, then we can exploit the ENF signal to detect
speed offsets. The ENF signal appearing at a frequency higher than its nominal
value indicates the audio speed is faster then normal, and vice versa. Further, the
ENF signal may serve as a reference signal to perform speed error correction.
A general diagram of the proposed scheme is plotted in Fig. 4.11. The given
audio signal is first divided into frames of unit length, and the ENF signal f is
extracted for every frame. We refer to each frame as an ENF frame hereafter. A
speed correction ratio is obtained as
r = f/f0, (4.4)
where f0 is the nominal value of ENF. Speed correction is then performed by tem-
porally stretching or compressing of the audio signal.
ENF Estimation 
Input Audio  
Signal  ݂
Correction ratio: ݎൌ݂ ଴݂Τ  
Corrected Audio  
Signal Stretching/compression 
Figure 4.11: The diagram of the ENF-based speed correction scheme.
The values of ENF signal in adjacent ENF frames are likely to be different
from each other. This would cause a sudden speed change in the restored audio at
81
the frame boundaries. To alleviate such artifact, each ENF frame is further divided
into several smaller chunks, referred to as correction frames. The ENF signal for
the correction frames are obtained via interpolation among the ENF signals of the
ENF frames. Therefore, the values of the ENF signal transit smoothly from one
ENF frame to the next, and each correction frame has its unique correction ratio
that differs slightly from its neighbors.
When choosing ENF frame size, a trade-off exists between the frequency esti-
mation accuracy and temporal resolution. The optimal amount of speed adjustment
of a recording may change quickly over time. Ideally the ENF frame should be as
small as possible to achieve satisfactory temporal resolution. However, accurate
frequency estimation requires that the ENF frame be of a certain length. Dividing
an ENF frame into several correction frames may provide an improved trade-off
between the frequency estimation accuracy and temporal resolution.
Within each correction frame, sample rate conversion is performed on the audio
signal so it becomes temporally stretched or compressed according to the correction
ratio r. Two methods of sample rate conversion are considered in this work, and
their performances will be analyzed and experimentally compared.
One of the most common strategy of rate conversion is up-sampling followed
by down-sampling [45]. The diagram of this scheme is shown in Fig. 4.12. The
correction ratio is converted to a rational number r = UD , with U and D being
integers. The audio signal first goes through an U -fold up-sampler which inserts
U − 1 zeros between every adjacent samples, followed by a low-pass filter. Then
a D-fold down-sampler retains 1 out of every D samples, and the output is the
82
speed corrected audio signal. The low-pass filter between the up-sampler and down-
sampler replaces zeros in the up-sampled signal in interpolated values, and avoids
aliasing in the following decimation.
  U    D  
Up - Sampler Low - Pass Filter Down - Sampler 
Input Audio  
Signal 
Corrected Audio  
Signal 
Figure 4.12: Speed correction based on up-sampling and down-sampling.
In practice, the correction ratio may not be a rational number. In other cases,
U and D may be extremely large numbers resulting in extraordinary computational
and storage complexity. Therefore, approximations are applied and the values of U
and D are chosen as the smallest integers satisfying the following rule:
|
U
D
− r| < δ · r, (4.5)
where δ is a pre-set parameter to adjust the trade-off between efficiency and preci-
sion.
Let f0, ft, f , fr denote the nominal value of ENF, the true value of ENF, the
observed value of ENF in the audio before correction, and the observed value of
ENF in the restored audio, respectively. From the discussion above, we have
r =
f
f0
;
rˆ = r + ∆r,
83
fr =
f
rq
,
where ∆r is the rational approximation error introduced from Eq. 4.5. Assume that
the true value of ENF deviates from the nominal value by ∆f :
ft = f0 + ∆f.
The accuracy of the speed correction can be measured by
E = 1−
fr
ft
= 1−
f
rˆ · ft
= 1− α.
(4.6)
Here,
α =
f
rˆ · ft
=
f0 · r
(r + ∆r)(f0 + ∆f)
=
1
(1 + ∆rr )(1 +
∆f
f0
)
(4.7)
From Eq. 4.7, we can see that the accuracy of the speed correction is affected
by two factors: the deviation of the ENF signal from its nominal value (∆f) and
the approximation error in the correction ratio (∆r). The analysis above is based
on the assumption of perfect ENF estimation. If this assumption does not hold,
the speed adjustment is subject to additional distortion due to frequency estimation
errors.
The other sample rate conversion method is based on sample interpolation,
as has been used in [47]. The original audio signal is denoted by s[n], n = 1, 2, ....
84
With a given correction ratio r, the ideal corrected audio signal is sopt = s[r · n].
Here, r may be an arbitrary positive real number, so the product of r and n is likely
not an integer. For a non-integer x, the value of s(x) can be obtained with nearby
samples at integer indices using interpolation. In this work, we demonstrate this
idea with a simple linear interpolation method. Assume the largest integer that is
less or equal to x is bxc = k. With a linear interpolation as illustrated in Fig. 4.13,
s(x) is determined by
s(x) = w1 · s(k) + w2 · s(k + 1),
where w1 = k + 1− x, and w2 = x− k.
k  k+1  x  
s(k) 
s(k+1) 
s(x) 
Figure 4.13: Linear interpolation for tape speed adjustment.
85
4.3.2 Experiments and Examples
Experiments have been conducted in order to evaluate the proposed audio
speed correction approach and compare the two sample rate conversion methods.
Audio recordings are created using a digital recorder in an office, and a synthetic tone
signal of 400 Hz is generated during the recording. The speed of the recordings are
then intentionally altered following random patterns via stretching and compressing
of the audio signals. We extract ENF signals from the altered audio recordings and
apply the proposed speed correction approach. Frequency analysis is performed on
the audio signals before and after correction. The synthetic signal of 400 Hz provides
a way of measuring the speed correction accuracy. If the audio speed is perfectly
restored, the tone signal should appear at exactly 400 Hz.
A total of 100 tape recordings are used in our experiment, each being about
five minutes long. The proposed speed correction approach is applied to all the
audio signals using both rate conversion methods. The size of the ENF frame and
correction frame are set as four seconds and one second, respectively. The ENF
signals for the correction frames are obtained by linear interpolation of the ENF
estimates of the ENF frame.
Fig.4.14 shows the spectrogram of an audio signal before speed restoration.
Both the tone signal and the ENF signal have been distorted by the tintometric
tampering with the audio speed. Fig.4.15 shows the spectrogram of the audio signal
after speed restoration. Our approach is effective, and the tone signal is resorted
close to its ground truth value.
86
(a) (b)
Figure 4.14: The spectrogram of a test audio before correction. (a) The 400 Hz tone
signal. (b) The ENF signal.
The effectiveness of the speed correction is measured by the closeness of the
observed tone signal to its groundtruth value of 400 Hz. The restored audio signal
is divided into frames of four seconds, and the dominant frequency around 400 Hz
in the ith frame is estimated as tone(i), i = 1, 2, ..., N . The overall speed correction
error is defined as root-mean-square-error (RMSE) of tone(·):
e =
1
N
N∑
i=1
√
(tone(i)− 400)2. (4.8)
Two rate conversion methods are tested in the experiment: re-sampling with
up-sampler and down-sampler, and interpolation. For the re-sampling method, the
error correction ratio is converted to a rational number, according to Eq. 4.7, in order
to avoid excessive computational and memory cost. The average values of RMSE of
the tone signal over all test audio clips are calculated and listed in Table 4.1, along
with the average time needed for restoring a test audio clip with our implementation.
87
(a) (b)
Figure 4.15: The spectrogram of a test audio after correction. (a) the 400 Hz tone signal.
(b) the ENF signal.
The experimental results demonstrate that the accuracy of speed restoration with
the re-sampling method is improved using smaller approximation threshold δ, at the
price of more the computational cost. The interpolation method for rate conversion
generally outperforms the re-sampling method in both accuracy and efficiency, and
therefore is chosen as the default method in the experiments in this chapter.
We have also conducted experiments to test the speed correction accuracy
under different settings of values of ENF frame size Le and correction frame size
Lc. The results are listed in Table 4.2, from which several observations can be
made. The speed offset of the audio may change quickly over time. Within each
correction frame, a constant correction ratio is used to stretch or compress the audio
signal. It is therefore desirable to adopt small correction frames to improve the speed
correction accuracy. As for the size of ENF frame, it should not be either too large
88
Table 4.1: Comparison of the speed correction error using different rate conversion
methods.
Rate conversion method average RMSE average time (sec.)
re-sampling with δ = 0.01 1.81 1.04
re-sampling with δ = 0.005 0.83 1.74
re-sampling with δ = 0.001 0.15 1039
interpolation 0.12 0.103
or too small. A single dominant frequency value is estimated for each ENF frame,
so the ENF frames should be reasonably short to adapt to the evolvement of the
ENF signal over time. however, the number of samples in each ENF frame needs to
be sufficient for reliable frequency estimation.
At the end of this section, we demonstrate the application of the proposed
speed correction approach for restoration of certain digitized Apollo 11 recordings.
The Apollo 11 mission, operated by NASA, had the objective to perform a crewed
lunar landing and return to Earth. The mission launched on July 16, 1969, and
finished on July 24, 1969. On July 20, the mission completed the first-ever human
lunar landing. During the mission, a large amount of audio recordings were recorded
of the communications among the crew of the spacecraft and the mission control
staff. These recordings, made as historical documentation during the mission, were
recorded on analog equipment that is now obsolete. When digitizing the recordings,
89
Table 4.2: Comparison of the speed correction error using different sizes of ENF
frame (Le) and correction frame (Lc).
Le \Lc (sec.) 1 2 4 6 8
1 0.66
2 0.17 0.19
4 0.12 0.14 0.21
6 0.16 0.18 0.23 0.30
8 0.22 0.23 0.27 0.33 0.40
these machines no longer operate at the correct speed even under optimum circum-
stances. In some extreme cases, the tape is so greatly off speed that the recorded
speech becomes unintelligible. Speed correction is therefore needed to preserve these
tapes of tremendous archival value.
The spectrogram of a sample recording with serious speed offset is shown in
Fig. 4.16 (a). From the figure, we observe the capture of ENF traces around 60
Hz. The value of the ENF signal decreases gradually to as low as 45-50 Hz in the
earlier part of the recording, and then returns to close to 60 Hz through to the end.
The proposed speed restoration scheme is applied on this recording using the ENF
signals estimated around 60 Hz that is shown in Fig. 4.17 (a). The spectrogram and
the estimated ENF signal from the audio after restoration is shown in Fig. 4.16 (b)
and Fig. 4.17 (b), respectively.
90
(a) (b)
Figure 4.16: The spectrogram of an Apollo mission control recording before and after
speed correction. (a) Before speed correction; (b) After speed correction.
One way of measuring the speed correction accuracy with the Apollo recordings
is to examine the Quindar tones in the audio. The Quindar tones have been used
in Apollo Missions for Mission Control to simulate the action of the push-to-talk
(PTT) and release-to-listen button. Two tones exist as pure sine waves that were
250 ms long but at slightly different frequencies around 2500 Hz. The “intro tone”
is generated at 2525 Hz and signals the keypress of the PTT button that un-mutes
the audio. The “outro tone” is generated at 2475 Hz and signals the release of the
PTT button that mutes the audio.
Most Quindar tones sound loud and clear in the recording, resulting in a high
signal to noise ratio. Their frequency therefore can be accurately estimated. The
91
0 200 400 600 80050
52
54
56
58
60
Time (second)
Freq
uenc
y (Hz
)
(a)
0 200 400 60059
59.5
60
60.5
61
Time (second)
Freq
uenc
y (Hz
)
(b)
Figure 4.17: The ENF signal extracted from a Apollo mission control recording, (a)
before and (b) after speed correction.
amount of deviation of the Quindar tones from their nominal values indicates the
level of speed offset in the audio. For the audio clip whose spectrogram is shown in
Fig 4.16 (a), Quindar tones’ frequencies before and after speed correction is shown
in Fig. 4.18. The frequencies of the Quindar tones in the original audio during the
first 400 seconds are generally much lower than the nominal values of 2525 and 2475
Hz. The worst one is at about 2050 Hz, which is more than 17% lower than the
nominal value. A measurement of the Quindar tone deviation can be defined in the
following form:
dQ =
√
√
√
√ 1
n
n∑
i=1
(Qi − 2500)2,
where Qi is the frequency of the ith of a total of n Quindar tones. The Quindar
tone deviation in the original audio is 176. After restoration, the offset of the
92
Quindar tone frequencies is reduced, as can be seen from Fig. 4.18, and the Quindar
tone deviation decreases to 60.8. This demonstrates that our proposed approach is
effective in reducing the speed offset.
0 100 200 300 400 500 6002100
2200
2300
2400
2500
2600
2700
2800
Time (second)
Fre
que
ncy
 (Hz
)
 
 Quindar tones before restoration
Quindar tones after restoration
Figure 4.18: The Quindar tones in an Apollo Mission recording before and after restora-
tion.
As indicated by the Quindar tone deviation, although the speed offset is re-
duced, the audio speed is not perfectly recovered. This is because that the ENF
signal cannot be well estimated. Two factors contribute to this issue. First, from
the spectrogram we observe that the ENF traces in the recording spread to a wide
frequency strip, hindering accurate estimation of the dominant frequency. Second,
it can be inferred from the Quindar tones in Fig. 4.18 that the speed offset of this
recording changes quickly with time: some of the Quindar tones that occur close in
93
time have very different frequencies. As a result, the value of the ENF signal in the
recording also changes quickly. As the frequency estimation must be performed on
an ENF frame of sufficient length, the temporal resolution of the ENF estimation
is not sufficient to cope with such fast changes.
4.4 Chapter Summary
In this chapter, we studied the ENF analysis on recaptured audio recordings.
As ENF signal is embedded into multimedia recordings at the time of recording,
multiple ENF traces may exist in a recording that has been recaptured. If the
recapturing of the recording is performed in the region of the same nominal ENF as
the original recording, the ENF traces of the two recording processes may overlap.
Conventional frequency estimation methods may fail in such situations. We propose
a decorrelation based algorithm to estimate multiple ENF signals from a recaptured
audio in a sequential order, assuming that the power measurements are available as
reference.
ENF extraction from recaptured recordings are relevant to analyzing digitized
old recordings of historical importance. Many audio recordings that record historical
events and conversations were created using analog recorders. These recordings are
often digitized to facilitate preservation and transmission of the audio. During the
digitization, a new ENF signal may be captured. A particular distortion for the ENF
signal in digitized audio recordings is the drifting effect. It refers to the phenomenon
that the value of the ENF in a digitized analog audio recording may deviate from
94
its original value. The mechanical rolling speed of analog tape recorder and player
is often not well stabilized. When an analog tape recording is digitized, the rolling
speed of the tape is likely to differ from that of the tape’s creation. As a result,
a speed offset is induced in the digitized version of the audio, causing the drifting
effect.
We propose to correct speed error of digitized audio recording using ENF as
a reference signal. The ENF signal fluctuates minimally around a known nominal
value. It therefore can be considered as a single tone signal. The ENF signal
appearing at a frequency higher than its nominal value indicates the audio speed
is faster then normal, and vice versa. Speed error correction can be performed by
stretching or compressing the audio signal according to the amount of deviation of
the ENF signal. The proposed method has been applied to audio recordings from
the NASA Apollo 11 Mission, and it has been shown to be effective.
Acknowledgement
I would like to thank Prof. Douglas W. Oard of the University of Maryland
for enlightening discussions that inspired the work in this chapter.
95
Chapter 5
Quality Evaluation of Image Features for
Efficient and Robust Spatial Alignment
In the previous chapters, we have discussed audio/video synchronization that
presents a fundamental task for applications involving multiple pieces of audio-
visual data, such as video panorama, 3D reconstruction, and video superresolution.
Spatial alignment of images and video frames, which is often performed together
with or following synchronization, is another essential alignment problem for these
applications. Spatial alignment is also useful for identifying common objects in
images, facilitating such applications as image retrieval. Local image feature offers
one of the most successful and popular methods for visual matching of images. In
this chapter, we investigate feature quality evaluation for efficient and robust spatial
alignment of images/video frames.
96
5.1 Chapter Introduction
5.1.1 Motivation
With the rapid growth of storage capacity and the development of manufac-
turing technologies, digital cameras are becoming ubiquitous, and the quantity of
digital images is increasing drastically. The existence of common objects in these
images have motivated research in content-based image retrieval (CBIR). Given a
query image, the goal is to retrieve from a large database the images containing
the same object or scene as in the query image. CBIR can be useful in scenarios
including landmark recognition, direction aid for tourists, and CD/book automatic
annotation. A special application of CBIR is mobile visual search [26], in which
users can use their phone cameras to take a picture of an object, such as landmarks,
and send the picture or the extracted image features via broadband data connec-
tions to a server. The server then conducts image retrieval and object recognition
to identify the object in the picture and send desired information about the ob-
ject, such as annotations, back to the user. Examples of commercial products in
this category include Google Goggles [3], Nokia Point and Find [4], and Amazon
SnapTell [5]. Mobile visual search poses unique challenges. The retrieval process
must be conducted within stringent memory, computation, power, and bandwidth
constraints. The communication and processing latency should be minimized to
ensure satisfactory user experience. This means that the size of data transmitted
between the mobile client and the server should be as small as possible, and the
97
image retrieval should be as computationally efficient as possible.
Most of state-of-the-art solutions for CBIR are built on the usage of local
image features. In the local feature framework, interest points are first selected
as distinctive and robust points in the image by a key point detector. Next, a
robust feature descriptor is generated using the information within the neighborhood
of the interest point. The images may have gone through different lighting and
viewing conditions, so desirable features should be robust to various distortions.
Among all the proposed local image features [40, 52], the Scale-Invariant Feature
Transform (SIFT) [38] is regarded as a seminal work given its robustness. The
interest points used in SIFT are extrema in the Difference-of-Gaussian scale space,
and the descriptors are vectors consisting of the histograms of the weighted gradient
orientations in a local patch around every interest point. The histogram is calculated
on a 4-by-4 block for 16 such blocks in the neighborhood of the interest point.
Each histogram has 8 bins covering 360 degrees, giving a SIFT descriptor of 128-
dimension. The patches are orientation and scale normalized, so SIFT is resilient
to rotation and scaling variations.
A query image can be matched against an image database by computing the
pair-wise distances between the SIFT descriptors. However, this scheme is computa-
tionally expensive for large corpora and does not scale well. To cope with large-scale
databases, a visual dictionary-based Bog-of-Words (BoW) approach has been pro-
posed by Sivic and Zisserman [55]. The idea is to quantize image features into a
set of visual words as a codebook by using k-means clustering on many training
features. A given feature can be mapped to its nearest visual word among the code-
98
book. The images are then represented by the frequency vector of the visual word
occurrences. The similarity between two images is usually measured using the L1 or
L2 distance between their visual word frequency vectors. During a query, the simi-
larity score can be computed efficiently by using an inverted file system associated
with the database. To improve the retrieval accuracy, Nister and Stewenius use the
hierarchical k-means clustering to generate a vocabulary tree for a much larger BoW
codebook [43]. Alternatively, Philbin, et al., propose approximate k-means utilizing
randomized k-d trees to achieve a better partition of the descriptor space [48]. Other
extensions of this approach include query expansion [18], hamming embedding [34],
and soft quantization [49].
The visual dictionary-based approach has proven to be significantly more effi-
cient than contentional methods. Much of the computational gain comes from the
inverted file system that decreases the fraction of images in the database that need
to be considered for computing the similarity scores. In order to expedite the re-
trieval process, the visual world frequency vector of each image should be as sparse
as possible so that the number of image hits is minimized. To achieve efficient and
accurate retrieval, it is desirable to select from among all available features a sub-
set of the most reliable and informative features to represent each image. Feature
selection is especially beneficial to mobile visual search for which communication
and processing latency should be minimized to ensure satisfactory user experience.
As bandwidth is a limited resource in wireless communications, smaller number of
features help reduce the data transmission load and communication latency between
the mobile client and the server. The computational complexity of image retrieval
99
is significantly reduced with fewer, yet more resilient, features. Not only does this
reduce the server side processing latency, it also potentially enhances the search
performance. Another possible benefit of feature selection is smaller feature storage
cost which keeps the size of the feature database feasible for large-scale applications.
In this work, we focus on designing a scheme to evaluate the quality of SIFT
features in terms of their robustness and discriminability. A quality score is assigned
to every SIFT feature based on its contrast value, scale, and descriptor, using a
quality metric kernel that is obtained in a one-time training phase. The quality
score predicts how the usefulness of a SIFT feature is in describing certain object or
scene. Feature selection can be performed by retaining features with high quality
scores. We also show the proposed methodology can be generalized and applied to
other local image features, such as SURF.
5.1.2 Related Work
Several techniques proposed in recent years have the potential to improve the
efficiency and accuracy of feature matching for image alignment and retrieval. Tur-
cot and Lowe [62] considered multi-view scenarios and proposed to select only a
small subset of useful image features to reduce the number of visual descriptors and
also improve the recognition accuracy. Their method is based on an unsupervised
pre-processing step to retain descriptors that are geometrically consistent across
multiple views. The descriptors are ranked according to the time of appearances in
the training images, and the top ones are retained. This method was improved in
100
both speed and accuracy by Naikal, et al., [42] by using sparse Principal Component
Analysis (PCA). Fritz, et al., [21] present an information theoretic feature selection
framework by employing the selection criteria of H(O|f) < λ. Here λ is a threshold,
and H(O|f) is the estimated conditional entropy of the object O given the feature
f , defined as H(O|f) = −
∑
k P (Ok|f)log(P (Ok|f)); P (O|f) is the conditional dis-
tribution of the object class variable O given that feature f comes from O, and
this distribution can be estimated from the feature matching results of the training
images. Further, Baheti, et al., [10] associate each descriptor to a weight factor that
represents the relative importance of the feature in the database. They devise a
three step approach of feature pruning, including intra-object pruning to remove re-
dundancies between the views of the same object, inter-object pruning to retain the
most informative features across different objects, and keypoint clustering to fur-
ther reduce the number of features. Feature descriptor compression [15, 16, 26] has
also been proposed for compact storage and reduced network transmission. SIFT
descriptor can be compressed to around 2 bits per dimension with almost no per-
formance degradation.
The feature selection method presented in this paper works by assigning a
quality score to every individual image feature and retaining the features with the
highest scores. This scheme is not as database-specific as some of the existing
methods. It can also be potentially combined with state-of-the-art techniques.
101
5.2 Quality Evaluation of SIFT Features
5.2.1 Definition of Quality Metric
Useful features should provide true matches more frequently than generate
false matches. A natural way of measuring the quality of features is to analyze their
behaviors statistically. In [57], we estimate the probability of true match and false
match of SIFT features and use them as quality metric. We apply vector quantiza-
tion to the 128-dimension SIFT descriptor space to generate a set of visual words.
SIFT features are extracted from many training images and assigned to their nearest
visual words. The training images then undergo certain synthetic transformations,
such as rotation and blurring. Feature matching is conducted between every pair of
original and transformed images. Since the transformation is synthetic, the ground
truth of point correspondence is known. If n features exist in a certain visual word
bin, and nt of them are correctly matched while nf of them are falsely matched,
then the probabilities of true match and false match are estimated as ntn and
nf
n .
The quality score of the features f belonging to this visual word bin can be defined
in the form:
Q(f) = Rtrue(f)− αRfalse(f), (5.1)
where Rtrue(f) and Rfalse(f) are the true match rate and false match rate respec-
tively, and α is a weighting parameter. Note the sum of Rtrue(f) and Rfalse(f) is
not necessarily 1 because some features may not be matched to any other features
at all. A simplified assumption for this method is that features with descriptors
102
quantized into the same visual word are of the same quality.
Fig. 5.1 shows an example of using the approach in [57] to prune SIFT features.
The left figure shows the direct SIFT feature matching between the template image
and an image containing the same object in the red circle, where many of the matches
are incorrect. The right figure shows the matching result when using only those with
good quality scores. Almost all false matches are eliminated while retaining all true
matches.
Several heuristics to identify desirable SIFT features exist, such as selecting
features with highest contrast value or largest scale [39,62]. The contrast value refers
to the keypoint’s response to the Difference-of-Gaussian filter at its characteristic
scale. These empirical rules may also provide useful information for evaluating
feature quality. In the following sections, we extend the work in [57] by incorporating
contrast and scale information, and apply learning techniques to form a feature
quality metric kernel. The whole contrast-scale-descriptor space is quantized into a
set of bins, and each bin is characterized by its contrast quantization index, scale
quantization index, and visual word index. We then analyze the statistical behavior
of the features in each bin. We refer to the scale-contrast-descriptor space as the
feature space.
5.2.2 Soft Quantization in Feature Space
The feature space quantization is necessary to make the feature quality evalu-
ation feasible. Instead of assigning a quality score to every possible feature, we only
103
(a)
(b)
Figure 5.1: Comparing SIFT feature matching with and without feature pruning. The
query object is shown on the left and the image containing the query object in the red circle
is shown on the right. Figure (a) demonstrates matching result using all the extracted
SIFT features; (b) demonstrates matching result using only selected features. The figures
are best viewed in color.
need to deal with a limited number of feature space bins. However, the quantization
provides only a relatively coarse approximation to the actual points in the feature
space. The quality evaluation accuracy may be affected by quantization, especially
for the features that fall near quantization boundaries. To alleviate the negative
impact of quantization, we employ the soft quantization strategy, as in [49]. Instead
of a hard quantization where a feature is assigned to a single bin in the feature
104
space, the soft quantization mechanism maps each feature to a set of weighted bins.
The weights assigned to the bins can be determined by the distances between the
feature point and the bin centroids. An exponential weighting is adopted in this
paper.
Consider a SIFT feature f with scale Sf , contrast value Cf , and descriptor
vector Df . Suppose that (ScennS , C
cen
nC , D
cen
nD ) is the feature space bin centroid cor-
responding to the nS-th scale level, the nC-th contrast level and the nD-th visual
word, where nS ∈ {1, 2, ..., NS}, nC ∈ {1, 2, ..., NC}, and nD ∈ {1, 2, ..., ND}, for
a total of NS, NC and ND quantization levels for scale, contrast, and descriptor,
respectively. We denote (ScennS , C
cen
nC , D
cen
nD ) by (nS, nC , nD) in the following. The
indices of the KS nearest neighbors of Sf among {ScennS |nS ∈ {1, 2, ..., NS}} are de-
noted as {ISt |t = 1, 2, ..., KS}, i.e., the t-th nearest neighbor of Sf is S
cen
ISt
. Similar
notations are defined for the nearest neighbors of Sf and Df (in L-2 distances).
{ICt |t = 1, 2, ..., KC} and {I
D
t |t = 1, 2, ..., KD} are the nearest neighbor indices of
Sf and Df , respectively.
For hard quantization, f is assigned to a single feature space bin (IS1 , I
C
1 , I
D
1 ).
With the soft quantization strategy, f is mapped to a set of KS ·KC ·KD nearby
bins that are closest to (Sf , Cf , Df ), denoted by {ISt1 , I
C
t2 , I
D
t3 |t1 = 1, ..., KS, t2 =
1, ..., KC , t3 = 1, ..., KD}. We assign different weights to each bin according to the
distance between the feature point and the bin centroids. Denoting the L2 distance
105
as d(·, ·), the weight assigned to bin {ISt1 , I
C
t2 , I
D
t3 } is
w(ISt1 , I
C
t2 , I
D
t3 ) = αS · exp(−
(d(Sf , ScenISt1
))2
σ2S
)
×αC · exp(−
(d(Cf , CcenICt2
))2
σ2C
)
×αD · exp(−
(d(Df , DcenIDt3
))2
σ2D
)
= wS(I
S
t1) · wC(I
C
t2) · wD(I
D
t3 ), (5.2)
where σS, σC and σD are tunable parameters. αS, αC and αD are normalization
coefficients defined as
αS =
[
∑
t=1,2,...,KS
exp(−
(d(Sf , ScenISt
))2
σ2S
)
]−1
,
αC =
[
∑
t=1,2,...,KC
exp(−
(d(Cf , CcenICt
))2
σ2C
)
]−1
,
αD =
[
∑
t=1,2,...,KD
exp(−
(d(Df , DcenIDt
))2
σ2D
)
]−1
. (5.3)
The weights assigned to the bins that are not among the feature point’s nearest bins
can be thought of as 0. It can be shown that the sum of the weights across all the
bins is unity:
NS∑
t1=1
NC∑
t2=1
ND∑
t3=1
w(t1, t2, t3)
=
NS∑
t1=1
NC∑
t2=1
ND∑
t3=1
wS(t1)wC(t2)wD(t3)
=
(
KS∑
t1=1
wS(I
S
t1)
)(
KC∑
t2=1
wC(I
C
t2)
)(
KD∑
t3=1
wD(I
D
t3 )
)
= 1. (5.4)
106
5.2.3 Kernel Training and Quality Assessment
In this subsection, we explain how we integrate the techniques discussed above
to form and utilize the SIFT feature quality metric kernel.
Quality Metric Kernel Training
To begin, we collect a set of representative sample images covering various
scene types and categories as training images, and extract SIFT features from these
training images. We then randomly select a subset of these features to estimate
the quantization centroids for scale, contrast and descriptors, using K-means and
Lloyd’s algorithm. The numbers of quantization levels for each dimension, i.e., NS,
NC and ND, are determined empirically. The feature space is thus divided into bins
that are indexed by their scale, contrast, and visual word indices. Then, we apply
soft quantization to the features from the training images according to equation (2).
The quantization weights are aggregated into a 3-dimension table Ntotal:
Ntotal(t1, t2, t3) =
∑
all features
w(t1, t2, t3), (5.5)
where t1 ∈ {1, 2, ..., NS}, t2 ∈ {1, 2, ..., NC} and t3 ∈ {1, 2, ..., ND}. We then con-
duct synthetic transformations of interest to the training images, and match SIFT
features between every original and transformed image pair. The matching results
for each feature from the original images are recorded. Ntrue and Nfalse are used to
store the weight aggregates of each feature space bin corresponding to correctly and
107
falsely matched features, respectively:
Ntrue(t1, t2, t3) =
∑
correctly matched features
w(t1, t2, t3),
Nfalse(t1, t2, t3) =
∑
falsely matched features
w(t1, t2, t3). (5.6)
The quality metric kernel can be established using the three tables above:
Qker(t1, t2, t3) =
Ntrue(t1, t2, t3)− α ·Nfalse(t1, t2, t3)
Ntotal(t1, t2, t3)
, (5.7)
where α is a parameter adjusting the weights between the true match rate and false
match rate.
Quality Assessment
Given a SIFT feature f , we compute its quality score using the quality metric
kernel. We first soft-quantize f to obtain the weights for each bin in the feature
space according to equation (2), denoted by wf (t1, t2, t3). Then its score is calculated
as the dot product of wf (t1, t2, t3) and the quality metric kernel Qker:
Q(f) =
NS∑
t1=1
NC∑
t2=1
ND∑
t3=1
wf (t1, t2, t3) ·Qker(t1, t2, t3). (5.8)
Features with higher quality scores are more desirable.
5.3 Experiment Results
5.3.1 Implementation and Experiment Setup
We collect about 40000 images from Flickr, covering various scene types and
categories, as the training dataset. The resolution of these images is generally around
108
500-by-300. Some sample images are shown in Fig. 5.2. This dataset provides about
20 million SIFT features using the VLFeat SIFT implementation [65]. The number
of quantization levels are chosen as Ns = 10, Nc = 10, Nd = 4096. 8000 images are
randomly selected as the training images for estimating the quantization centroids
and boundaries. We employ efficient hierarchical K-means to generate the visual
word codebook. The transformations used for the quality metric kernel training and
the corresponding parameters are listed in Table 5.1. The matching results between
the original images and their transformed counterparts are used to learn the feature
quality metric kernel as previously discussed. The parameters in Eq. 5.2 are set as
σ2S = 2, σ
2
C = 2, σ
2
D = 10000. The number of nearest neighbors for scale, contrast,
and visual word are chosen as KS = 2, KC = 2, KD = 10, respectively.
Figure 5.3 provides an example of feature selection on an image from the
University of Kentucky Benchmark dataset [43]. The location of the features are
indicated by the center of the red circles, and the radius of the circles are proportional
to the feature scale. The first figure shows all the SIFT features extracted from
the image. The other three figures demonstrate the features selected with highest
quality score, largest scale, and highest contrast value, respectively. The example
shows that the proposed quality score captures more informative and representative
features than the other empirical schemes.
109
Table 5.1: List of transformations used in quality metric kernel Training
Transformation Parameter Value
Rotation Rotation Angle 15, 30, 45, ..., 165
JPEG Quality factor 80, 70, ..., 10
AWGN Noise variance 0.01, 0.02, 0.03, 0.04
Average Filter Filter size 3, 5, 7, 9
Gaussian Filter Filter variance 0.5 1.0 , ... 3.0
Median filter Filter size 3, 5, 7, 9
Histogram Output level 256, 128, 64, 32
equalization numbers
110
Figure 5.2: Sample images in the training dataset. The training dataset consists of
around 40,000 images that are crawled from Flickr, including categories such as aircraft,
cars, rocks, trees, etc.
5.3.2 Feature Selection Performance
To demonstrate the effectiveness of the proposed SIFT feature quality metric
as a tool for selecting useful features, we conduct experiments on three datasets: the
University of Kentucky Bechmark, the Oxford Buildings Dataset, and the INRIA
Holidays Dataset. These are popular datasets in the literature and are often used as
benchmarks for image retrieval. The three datasets are of different scales (10K, 5K,
and 1.5K, respectively), and also provide a wide variety in terms of image content.
111
(a) (b)
(c) (d)
Figure 5.3: Comparing different feature selection schemes. (a) shows all the SIFT features
extracted from the sample image; (b) shows the top 20 features with the highest quality
scores; (c) shows the top 20 features of the largest scales; (d) shows the top 20 features
with the highest contrast values. The features shown in red circles are best viewed in
color.
The University of Kentucky Benchmark
The University of Kentucky Benchmark (UKB) [43] dataset consists of 10200
images of 2550 objects such as shoes, bottles, and CD covers. Each object is rep-
112
resented by 4 640×480 images from different viewpoints. The most popular per-
formance metric for this database is the average top. Each image is employed as
a query, and the perfect result provides all 4 images of the query object (including
the query image itself) as top returned images. The average top is defined as the
average number of correctly matched images that are in the top 4 returned images,
taken over all possible query images.
We use the BoW approach in the experiments, and the vocabulary is obtained
using the Hierarchical K-Means (HIK) method [43] with 1 million cluster centers. N
SIFT features are selected to represent each image, and these features (10200×N)
are assigned to their corresponding visual words. The similarity between images
is measured with the tf-idf weighted scalar product between the frequency vectors
of the visual word occurrences. An invert file system is built to facilitate efficient
calculation of the similarity scores.
It is desirable to achieve a satisfactory average top score with the number of
features N being as small as possible for efficient retrieval. Determining the best
method of choosing the representative features for every image presents the key
problem. With the proposed quality metric, we can calculate the quality scores of
the SIFT features and choose the features with the highest scores. To show the
effectiveness of our approach, we compare it with several other empirical feature
selection schemes such as picking features with the largest scales or the highest
contrast values. The average top scores under different settings of N are plotted in
Figure 5.4.
In Fig. 5.4, the magenta curve with triangle symbols corresponds to the method
113
0 50 100 150 200 2501
1.5
2
2.5
3
Nomber of Features per Image
Av
era
ge 
To
p S
cor
e
 
 
Proposed Quality Metric (α=0)
High Contrast Value
Large Scale
Descirptor
Figure 5.4: Average Top Scores on the UKB dataset using different feature selection
schemes.
proposed in [57], where the quality metric is established considering only the de-
scriptor information using hard quantization with 4096 visual words. The blue curve
with cross symbols and the green curve with circle symbols show the performance
of selecting by large scale and selecting by high contrast value, respectively. The
red curve shows the performance using the proposed feature selection method (with
α = 0). The figure shows that the feature selection scheme, based on our proposed
feature quality score, achieves superior performance when compared to other em-
pirical methods, especially with smaller values of N , when the selection of useful
feature is more crucial. Typically, a 20% saving of feature number can be obtained
to achieve the same average top score. Detailed results are listed in Table 5.2.
114
We have also tested the effect of α in equation (7), as illustrated in Figure 5.5.
The matching accuracy is not sensitive to the choice of α. This observation remains
consistent over all three datasets we are testing on. In the following experiments,
we consider the case when α is 0.
0 50 100 150 200 2501.5
2
2.5
3
Number of Features per Image
Av
era
ge 
To
p
 
 
α=0
α=1.0
α=2.0
Figure 5.5: Effect of α on the Average Top Score.
In Table 5.3 we list the average number of hits and retrieval time per image
using different number of features to represent each image. For a certain image, a
hit happens when one of its visual words is found in another image. The number of
hits directly affects the retrieval efficiency of the invert file indexing system. From
the table we observe reduced number of features benefits the retrieval efficiency
substantially. It is also interesting to note that at certain point where the number
115
Table 5.2: Number of features needed to achieve pre-defined Average Top on the
UKB dataset
Target Average Top 1.8 2.2 2.4 2.6 2.9
Selection by Proposed Quality Metric 16.9 37.7 57.5 92.7 219.3
Selection by High Contrast 26.8 51.2 72.6 105.9 236.1
Selection by Large Scale 33.2 47.9 70.6 156.2 NA
Selection by the Quality Metric in [57] 29.2 74.6 118 186.6 NA
NA: Not Achievable in the experiments
The target Average Top is listed in the first row. The numbers of features needed to
achieve the target average top are listed in the following rows.
of features is large enough, using more features actually degrades the matching
accuracy.
It should be noted that the average top score we have obtained here is slightly
worse than the numbers reported in other literatures. One possible reason is that we
employ SIFT feature points (DoG point detector) that are only robust to scaling and
Table 5.3: Average number of hits and retrieval time
Number of features per image 50 100 200 300 500 700 1000
Number of hits per image 157.6 439.0 1265.4 2382.5 5425.0 9299.2 15159.4
Retrieval time per image(ms) 3.7 9.3 26.3 49.9 119.6 220.8 378.4
Average top score 2.33 2.63 2.87 2.96 3.02 3.01 2.96
116
rotation, whereas more sophisticated affine invariant interest point detectors are used
in other literatures. Techniques that can improve the retrieval performance, such as
Approximate K-means [48], soft visual word assignment [49] and query expansion
[18], are not adopted in our current implementation. However, our experiments
suffice to show the effectiveness of the proposed feature quality metric as a way of
identifying and selecting useful features.
The Oxford Buildings Dataset
The Oxford Buildings Dataset [48] consists of 5062 images collected from Flickr
by searching for Oxford landmarks. The dataset has been manually annotated to
generate a comprehensive ground truth for 11 different landmarks, each represented
by 5 queries. To evaluate retrieval performance, we use the Average Precision (AP),
defined as the area under the precision-recall curve. Precision is the ratio of the
number of retrieved true positive images to the total number of retrieved images.
Recall is defined as the ratio of the number of retrieved true positive images to the
the total number of true positive images in the gallery. The perfect precision-recall
curve has precision 1 over all recall values, which corresponds to AP of 1. The AP
score is computed for each of the 55 query images in the dataset. The average of
these AP scores, termed as Mean Average Precision (mAP), is used to evaluate the
overall performance on this dataset.
We use similar settings to the UKB dataset. A visual word vocabulary is
obtained using the HKM method with 1 million cluster centers. Each query and
117
training image is represented by N selected SIFT features, and these features are
mapped to visual words according to the vocabulary. In Figure 5.6, we plot the mAP
scores for different values of N . Feature selection by the proposed quality metric
again performs the best among all the methods that we have tested. To achieve
the same mAP score, fewer features are needed by the proposed scheme than the
selection by large scale or high contrast method, as can be seen from Table 5.4.
0 200 400 600 8000.1
0.15
0.2
0.25
0.3
0.35
0.4
Number of Features per Image
m
AP
 
 
Proposed Quality Metric
High Contrast Value
Large Scale
Descirptor
Figure 5.6: The mAP scores on the Oxford Buildings dataset using different feature
selection schemes.
118
Table 5.4: Number of Features Needed to achieve Pre-defined mAP on the Oxford
Dataset
Target mAP 0.15 0.2 0.25 0.3 0.35
Selection by Proposed Quality Metric 39.9 117.1 222.9 433.2 674.4
Selection by High Contrast 60 200.7 332.2 487.8 760.6
Selection by Large Scale 76 166.6 372.3 616.3 NA
Selection by the Quality Metric in [57] 94.8 86.4 320 612.5 NA
NA: Not Achievable in the Experiments
The INRIA Holidays Dataset
The INRIA Holidays dataset contains a set of personal holiday photos of
mainly outdoor scenes captured by the authors of [34]. The dataset contains 1,491
images divided into 500 groups, each of which represents a distinct scene or object.
In the experiment, we use the first image of every image group as query images
and evaluate the mean average precision as defined in Sec. 5.3.2. As this dataset is
relatively small, the retrieval is performed with nearest neighbor match instead of
using the BoW approach. N SIFT features are selected from every image to form a
feature gallery. When an image is used as a query, we take each of its N features and
find the 3 nearest neighbors among the feature gallery. The images that contain at
least one of the 3 nearest features receive one vote. After considering all the query
features, the images with the most votes are regarded as top matches. Approximate
Nearest Neighbor (ANN) techniques [8,33,41] are adopted to accelerate the nearest
119
neighbor search. Using the implementation provided by [41], we can speed the near-
est neighbor search by as many as 50 times while guaranteeing that the searching
result is correct with a probability above 95%. As illustrated in Figure 5.7 and
Table 5.5, the proposed feature selection scheme continue to outperform the other
methods.
0 20 40 60 80 100 1200.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of Features per Image
m
AP
 
 
Proposed Quality Metric
Descriptor
Large Scale
High Contrast Value
Figure 5.7: The mAP scores on the INRIA Holiday dataset using different feature selec-
tion schemes.
For mobile visual search applications, the size of the feature data transmission
over the network between client and server can be further reduced by applying
feature compression techniques [15,16,26] to the selected features. Here, we conduct
experiments to reduce the feature dimensionality via Principle Component Analysis
120
Table 5.5: Number of Features Needed to achieve Pre-defined mAP on the INRIA
Holiday Dataset
Target mAP 0.55 0.60 0.65 0.70 0.75
Selection by proposed quality metric 7.2 11.2 19.4 36.8 69.7
Selection by high contrast 18.8 28.7 40.3 63.7 103.7
Selection by large scale 7.2 11.8 22.8 40.6 75
Selection by the quality metric in [57] 11.7 18.9 33.4 59 108.2
(PCA). The eigenvectors are learned from all features in the dataset, and the features
selected by the proposed quality metric is projected to M eigenvectors with the
largest eigenvalues, resulting in M -dimension PCA features. Each dimension of
the features is then quantized to 8 bits. These PCA features are used for nearest
neighbor match. Figure 5.8 shows the mAP scores achieved with various feature
size, from which we observe the efficiency is improved with PCA on the original
SIFT features.
The proposed feature selection scheme using the quality metric performs con-
sistently better than other empirical schemes on all three benchmark datasets that
we have tested on. Each of these three datasets covers images of distinct and com-
plementary characteristics. For example, the UKB contains images of mainly small
objects, texture contents, and CD covers, and we note that features with high con-
trast are more informative than those of large scale. For the INRIA Holiday dataset
that contains images of mostly outdoor scenes, features of large scale are more use-
121
2.5 3 3.5 4 4.50.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
Feature Size per Image (Bytes in Log Scale)
m
AP
 
 
Original SIFT Features
PCA Features, M=32
PCA Features, M=16
Figure 5.8: The mAP scores on the INRIA Holiday dataset as a function of feature size.
ful than high-contrast features. Our feature quality metric considers all the scale,
contrast, and descriptor information, so it offers a more stable and effective method
of selecting useful features for a broad variety of image contents.
5.4 Discussions
5.4.1 Examining the Quality Metric Kernel
To gain a better understanding of the characteristics of SIFT features, we
look closely at the quality metric kernel. The kernel is essentially a 3-dimensional
matrix, with each entry representing the robustness and discriminability of the point
122
in the feature space indexed by its scale, contrast, and descriptor. It is enlightening
to examine the marginal quality metric, i.e., the quality score as a function of only
one of the feature properties (scale, contrast, or descriptor). With the 3-dimensional
kernel as a joint quality metric, we can compute the marginal of one feature property
by averaging across the other two. In Figure 5.9 we demonstrate the quality score
as functions of scale and contrast, respectively. The quality score reaches its peak at
the scale near 10-20. Then, it starts to decrease significantly as the scale increases.
However, features of extremely large scale (larger than 20) are rare (as can be seen
from the CDF curve). So, although selecting features of the largest scale is not
optimal, it still works much better than random selection. Similar rules apply for
the key point contrast. As the contrast increases, the quality score first improves to
its peak at around 20-25, then decreases.
Figure 5.10 provides some examples of visual words of high quality score (top
row) and low quality score (bottom row). We observe that the descriptors with
high scores generally have multiple significant gradient orientations, while those
with low scores have a single outstanding peak along the dominant direction. This
is correspond with the intuition that the former corresponds to image patches at
complex structures with more discrimination power, and the latter corresponds to
image patches around simple line structures, which are not as useful to distinguish
different objects.
123
5.4.2 Generalization To Other Local Image Features
Although we demonstrated the feature quality evaluation scheme with SIFT
features, we expect the methodology to apply to other image local features, such as
Speeded Up Robust Features (SURF) [11] and Gradient Location and Orientation
Histogram (GLOH) [40], as well. We demonstrate briefly the results we obtained
with SURF. The key point detector in the SURF algorithm is an integer approxima-
tion to the determinant of Hessian blob detector. The descriptor is a 64 dimension
vector comprising the statistics calculated from the wavelet responses of the 4 4-by-4
regions near the key point.
We repeat the feature quality kernel training process as we employed for the
SIFT features, and conduct experiments on the INRIA Holiday dataset. We repre-
sent each image in the dataset with N SURF features selected with different strate-
gies. The retrieval scores (mAP) as a function of N are illustrated in Fig. 5.11.
The selection scheme based on the proposed feature quality metric outperforms the
other empirical schemes, similar to SIFT features.
5.5 Chapter Summary
In this chapter, we have focused on local image feature selection for efficient
visual matching. Local image feature presents a powerful tool to find point cor-
respondences between multiple images of the same object or scene. Many of the
most successful solutions for image registration and content-based image retrieval
are built on the use of local image features. The increase of the image resolution
124
and the growth of the scale of image database may lead to the explosion of the
number of image features. Feature selection becomes important in order to improve
the feature matching efficiency.
To solve the problem of feature selection, we have presented a quality evalua-
tion method for SIFT features. Our approach is built upon a quality metric kernel,
which is essentially a 3-dimensional matrix indexed by SIFT feature’s quantized
scale, contrast and descriptor. In the training phase, a set of training images cov-
ering various scene types are collected. We conduct feature matching between the
training images and their synthetically transformed versions. The matching results
are aggregated to the metric kernel, so every entry of the kernel reflects how the
features belonging to the corresponding scale-contrast-descriptor feature space bin
behave statistically. The ratio of the number of correct matches over the total num-
ber of features in each bin is calculated as the quality score for the corresponding
bin. In the application phase, a given feature’s quality score is computed based
on the its affinity to the feature space bins in the kernel and the quality scores of
the bins. The proposed approach is tested on 3 benchmark datasets for large scale
content-based image retrieval. Feature selection according to the proposed qual-
ity score is shown to perform better than the empirical methods, such as selecting
features with largest scales and highest contrast values.
125
0 20 40 600.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
Scale
Ave
rage
 Qu
ality
 Sco
re
0 20 40 60
0.4
0.5
0.6
0.7
0.8
0.9
1
Scale
Cum
ulat
ive 
Den
sity 
Fun
ctio
n (C
DF)
(a) (b)
0 10 20 30
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Contrast
Ave
rage
 Qu
ality
 Sco
re
0 10 20 300.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Contrast
CDF
(c) (d)
Figure 5.9: Analyzing the quality score as a function of scale and contrast value, respec-
tively. (a) shows the quality score of features at different scales; (b) shows the cumulative
distribution function (CDF) of feature scale; (c) shows the quality score of features with
different contrast values; (d) shows the CDF of feature contrast value
126
Figure 5.10: Visual words with low quality scores (top) and high quality scores (bottom).
0 50 100 150 200 2500.55
0.6
0.65
0.7
0.75
0.8
Number of Features per Image
m
AP
 
 
Proposed Quality Metric (α=0)
Large Scale
High Contrast Value
Figure 5.11: The mAP scores on the INRIA Holiday dataset using different SURF feature
selection schemes.
127
Chapter 6
Conclusions and Future Perspectives
In this dissertation, we have considered the temporal and spatial alignment of
multimedia signals. For temporal alignment, we propose a new modality for audio
and video synchronization by exploiting the electric network frequency (ENF) signal
naturally embedded in multimedia recordings. The value of the ENF signal fluctu-
ates randomly around its nominal value over time, and the variation patterns remain
consistent within the same power grid, even at distant locations. Synchronization of
audio and video recordings can therefore be achieved by matching and aligning their
embedded ENF signals. The ENF based method does not rely on the audio and vi-
sual contents in the multiple recordings to be synchronized. This property provides
a strong potential to address such difficult scenarios that are intractable by existing
methods. Taking video synchronization for example, the conventional approaches
based on visual cues do not work well in situations with arbitrary camera motion or
with insufficient view overlap, while the ENF based method is not affected by these
128
adverse conditions. Additionally, extracting and aligning ENF signals may be more
effective computationally than the approaches that rely on computer vision and/or
extensive learning, so more (or longer) recordings could be efficiently processed. It
can also be easily generalized to synchronize multiple pieces of recordings. To the
best of our knowledge, our work is the first in the endeavor to exploit the ENF for
multimedia signal synchronization.
The extraction of ENF signals from multimedia recordings is extensively stud-
ied in the this dissertation, especially for visual recordings. ENF signals may be
extracted from the soundtracks of the video recordings, as well as the image se-
quences if the video captures the subtle flickering of lightings. Extracting the weak
ENF signal from image sequences presents a challenging task. The temporal sam-
pling rate of visual recordings is generally too low for estimating the ENF signal
that may appear at harmonics of 50 or 60 Hz. The ENF traces in video signals
are relatively weak, and may be easily distorted by object and camera motions.
Along with the direction of previous work, we have performed a further study ex-
ploiting the rolling shutter to extract ENF traces from video recordings. We model
and analyze the rolling shutter mechanism with a filter bank using multirate signal
processing theory. We then extend the scope of extracting ENF traces from videos
of still scenes to those containing motions, which is a challenging problem and has
never been formally attempted. We have proposed several techniques to overcome
the difficulties of extracting the ENF signal from image sequences, such as the low
sampling rate, object motions in the scene, camera motions, and brightness change.
We also address several challenges unique to the ENF analysis of recaptured
129
audio recordings, such as digitized historical recordings. Multiple ENF traces may
exist in recaptured audio recordings. Traditional estimation methods may fail when
the ENF signals corresponding to different time instances interfere with each other.
A decorrelation based method is proposed to extract multiple ENF signals from
recaptured audio recordings in a sequential manner. The ENF signal in a digitized
analog recording may also suffer from the drifting effect, which refers to the phe-
nomenon that the value of the ENF signal deviates severely from its original value.
The mechanical rolling speed of analog tape recorder and player is usually not well
stabilized. When an analog tape recording is digitized, the rolling speed of the tape
is likely to differ from that of the creation of the tape. As a result, a speed offset
is induced in the digitized version of the audio, causing the drifting effect. We have
proposed to detect and correct such speed errors using the ENF signal as a reference
and utilizing digital signal processing techniques. The proposed method is used to
restore several digitized historical audio recordings of the NASA Apollo 11 Mission.
Local image features are widely used to detect common objects among multiple
images and to obtain the spatial alignment of the images. In the local feature
framework, interest points are first selected as distinctive and robust points in the
image by a key point detector. Next, a robust feature descriptor is generated using
the information within the neighborhood of the interest point. Since the images
may have gone through different lighting and viewing conditions, desirable features
should be robust to various distortions.
Many of the most successful solutions for image registration and content-based
image retrieval are built on the use of local image features. The increase of the image
130
resolution and the growth of the scale of image database may lead to the explosion
of the number of image features. Feature selection becomes important in order to
improve the feature matching efficiency. To solve the problem of feature selection, we
have presented a quality evaluation method for SIFT features. Our approach is built
upon a quality metric kernel, which is essentially a 3-dimensional matrix indexed
by SIFT feature’s quantized scale, contrast, and descriptor. In the training phase,
a set of training images covering various scene types are collected. We perform
feature matching between the training images and their synthetically transformed
versions. The matching results are aggregated to the metric kernel, so every entry
of the kernel reflects how the features belonging to the corresponding scale-contrast-
descriptor feature space bin behave statistically. The ratio of the number of correct
matches over the total number of features in each bin is calculated as the quality
score for the corresponding bin. In the application phase, a given feature’s quality
score is computed based on the its affinity to the feature space bins in the kernel
and the quality scores of the bins. The proposed approach is tested on 3 benchmark
datasets for large scale content-based image retrieval, and is shown to outperform
the empirical methods.
Several interesting research topics along the directions of the work in this
dissertation can be further explored. The ENF-based approach emerges as a new
modality for audio and video synchronization. Unlike most conventional methods
taht extract and match certain audio and visual features from the contents of the
recordings, it utilizes the embedded ENF traces that are usually considered as noise.
The ENF-based approach and the audio/visual cue-based approach may therefore
131
complement each other and may be combined to solve the problem of multimedia
synchronization more effectively. For example, the ENF-based method can first be
applied to obtain synchronization of two video clips efficiently with frame-level ac-
curacy. The synchronization then can be refined by certain visual cue based-method
to a sub-frame accuracy with more computation. Some video clips may consist of
segments that are in different conditions for various synchronization methods. For
a video clip, we may choose the best synchronization method for each segment,
and combine these results of the segments to reach the final lag estimation for the
whole clip. Video synchronization systems that exploit multiple modalities can be
explored in the future.
The estimation of the ENF signal is crucial for the proposed multimedia syn-
chronization method that relies on matching the embedded ENF signals. In the dis-
sertation, we use the short time Fourier transform (STFT) based-method to extract
ENF signals from audio and visual recordings due to its efficiency and simplicity.
This method divides a source signal into possibly overlapping frames of small du-
rations. Within every frame, the signal can be regarded as wide-sense stationary,
and each of the frames undergoes Fourier analysis respectively. For ENF estima-
tion, we apply STFT to a source signal that contains ENF traces, and find the peak
frequency within a certain range near the nominal value or the harmonics in each
frame. The values of the peak frequency from all the frames are concatenated to
form the estimated ENF signal.
Several challenges need to be addressed in order to improve the synchronization
accuracy. The STFT based-method works well when the signal to noise ratio (SNR)
132
of the ENF signal is fairly high. However, when the noise distortion is strong,
the estimation performance degrades severely. Another problem is the temporal
resolution of the frequency estimation. The length of the frame needs to be large
enough to ensure reliable estimation, which limits the resolution of the estimate. The
resolution of frequency estimation is critical to the multimedia signal synchronization
accuracy. How to improve the frequency estimation accuracy in the case of low SNR
and to increase the resolution is therefore an important research topic.
With the image feature selection scheme proposed in this dissertation, we can
identify image features that are statistically more likely to be useful for describing
visual structures. By retaining only these features, the transmission load and the
storage cost for image features can be lowered. The matching accuracy can be
potentially improved as well. However, the computation for generating the features
is not reduced in any way, because the feature selection is performed after all feature
candidates have been generated. An interesting direction to explore is to study
whether feature selection can be carried out earlier and incorporated into the feature
generation process. Much of the time for feature generation is spent on calculating
the feature descriptor vector. Similar to the approach purposed in this dissertation,
for a interest point, we can try to predict its robustness and discriminability using
such information as its scale, contrast value, and some other features abstracted
from its neighbor pixels. Then the feature descriptors are calculated only for those
interest points that obtain a high quality score. The saving in descriptor calculation
may improve feature generation efficiency substantially.
133
Bibliography
[1] Apollo 13 audio recordings, https://archive.org/details/Apollo13Audio.
[2] The first men on the moon: The apollo 11 lunar landing,
http://www.firstmenonthemoon.com/.
[3] Google Goggles, http://www.google.com/mobile/goggles.
[4] Nokia Point and Find, http://www.pointandfind.nokia.com.
[5] SnapTell, http://www.snaptell.com.
[6] A. Agarwala, K. C. Zheng, C. Pal, M. Agrawala, M. Cohen, B. Curless,
D. Salesin, and R. Szeliski. Panoramic video textures. In SIGGRAPH, 2005.
[7] O. Ait-Aider, A. Bartoli, and N. Andreff. Kinematics from lines in a single
rolling shutter image. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2007.
[8] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu. An optimal
algorithm for approximate nearest neighbor searching. Journal of ACM, 45:891–
923, 1998.
[9] Miller Center at Univ. of Virginia. Presidential recordings,
http://millercenter.org/scripps/archive/presidentialrecordings/kennedy.
[10] P. K. Baheti, A. Swaminathan, M. Chari, S. Diaz, and S. Grzechnik.
Information-theoretic database building and querying for mobile augmented re-
ality applications. In IEEE International Symposium on Mixed and Augmented
Reality, 2011.
134
[11] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded up robust features.
In ECCV, 2006.
[12] D. Bykhovsky and A. Cohen. Electrical network frequency (ENF) maximum-
liklihood estimation via a multitone harmonic model. IEEE Transactions on
Information Forensics and Security, 8(5), May 2013.
[13] Y. Caspi and M. Irani. A step towards sequence -to-sequence alignment. In
Int’l Conf. Comput. Vision & Pattern Recognition, 2000.
[14] Y. Caspi, D. Simakov, and M. Irani. Feature-based sequence-to-sequence
matching. Int’l J. Comput. Vision, 68(1), Jun. 2006.
[15] V. Chandrasekhar, M. Makar, G. Takacs, D. Chen, S. Tsai, N. Cheung,
R. Grzeszczuk, Y. Reznik, and B. Girod. Survey of SIFT compression schemes.
In Proc. Visual Communication and Image Processing, 2009.
[16] V. Chandrasekhar, G. Takacs, D. M. Chen, S. S. Tsai, J. P. Singh, and B. Girod.
Transform coding of image feature descriptors. In Proc. Visual Communication
and Image Processing, 2009.
[17] W. H. Chuang, R. Garg, and M. Wu. How secure are power network signature
based time stamps? In ACM Conf. on Computer and Communication Security,
Oct. 2012.
[18] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman. Total recall: Auto-
matic query expansion with a generative feature model for object retrieval. In
CVPR, 2007.
[19] Abe Davis, Michael Rubinstein, Neal Wadhwa, Gautham Mysore, Fredo Du-
rand, and William T. Freeman. The visual microphone: Passive recovery
of sound from video. ACM Transactions on Graphics (Proc. SIGGRAPH),
33(4):79:1–79:10, 2014.
[20] M. El-Saban, A. Kaheel, and M. Refaat. Stitching videos streamed by mobile
phones in real-time. In ACM Multimedia, Oct. 2009.
[21] G. Fritz, C. Seifert, and L. Paletta. A mobile vision system for urban detection
with informative local descriptors. In Proc. of the Fourth IEEE Int. Conf. on
Computer Vision Systems, Jan. 2006.
[22] R. Garg, A. Hajj-Ahmad, and M. Wu. Geo-location estimation from electrical
network frequency signals. In IEEE Int’l Conf. Acoustics, Speech, and Signal
Processing (ICASSP), May 2013.
[23] R. Garg, A. Varna, and M. Wu. ‘Seeing’ ENF: natural time stamp for digital
video via optical sensing and signal processing. In 19th ACM International
Conference on Multimedia, Nov. 2011.
135
[24] R. Garg, A. L. Varna, A. Hajj-Ahmad, and M. Wu. ‘Seeing’ ENF: Power
signature based timestamp for digital multimedia via optical sensing and signal
processing. IEEE Trans. Info. Forensics Security, 8(9), 2013.
[25] R. Garg, A. L. Varna, and M. Wu. Modeling and analysis of electric network
frequency signal for timestamp verification. In IEEE International Workshop
on Information Forensics and Security, Dec. 2012.
[26] B. Girod, V. Chandrasekhar, D. Chen, N.-M. Cheung, R. Grzeszczuk,
Y. Reznik, G. Takacs, S. Tsai, and R. Vedantham. Mobiel visual search. IEEE
Signal Processing Magazine, July 2011.
[27] C. Grigoras. Applications of ENF criterion in forensics: Audio, video, computer
and telecommunication analysis. Foresnsic Science International, 167(2-3):136–
145, April 2007.
[28] J. Gu, Y. Hitomi, T. Mitsunaga, and S. Nayar. Coded rolling shutter pho-
tography: Flexible space-time sampling. In IEEE International Conference on
Computational Photography (ICCP), 2010.
[29] A. Hajj-Ahmad, R. Garg, and M. Wu. Spectrum combining for ENF signal
estimation. IEEE Signal Processing Letters, 20(9), 2013.
[30] A. Hajj-Ahmad, R. Garg, and M. Wu. Instantaneous frequency estimation and
localization for ENF signals. In APSIPA Annual Summit and Conference, Dec.
2012.
[31] A. Hajj-Ahmad, R. Garg, and M. Wu. ENF based location classification of
sensor recordings. In IEEE Int. Workshop on Info. Forensics and Security
(WIFS), Nov. 2013.
[32] M. Huijbregtse and Z. Geradts. Using the ENF criterion for determining the
time of recording of short digital audio recordings. In International Workshop
on Computational Forensics (IWCF), Aug. 2009.
[33] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing
the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium
on Theory of Computing, 1998.
[34] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric
consistency for large scale image search. In ECCV, 2008.
[35] C.-K. Liang, L.-W. Chang, and H. H. Chen. Analysis and compensation of
rolling shutter effect. IEEE Transactions on Image Processing, 17(8):1323–
1330, 2008.
[36] C. Liu. Beyond pixels: Exploring new representations and applications for
motion analysis. Doctoral Thesis, Massachusetts Institute of Technology, May
2009.
136
[37] Y. Liu, J. Chai, B. Greene, R. Conners, and Y. Liu. A study of the accu-
racy and precision of quadratic frequency interpolation for ENF estimation. In
Proceedings of the AES 46th international conference, June 2012.
[38] D. Lowe. Distinctive image features from scale invariant keypoints. Interna-
tional Journal of Computer Vision, 60:91–110, 2004.
[39] Wenjun Lu, Avinash L. Varna, and Min Wu. Forensic hash for multimedia
information. In Proc. of SPIE Media Forensics and Security, 2010.
[40] K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors.
IEEE Trans. on Pattern Analysis and Machine Intelligence, 27:1615–1630, Oc-
tober 2005.
[41] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic
algorithm configuration. In International conference on computer vision and
applications, 2009.
[42] N. Naikal, A. Y. Yang, and S. S. Sastry. Informative feature selection for object
recognition via sparse PCA. In ICCV, Nov. 2011.
[43] D. Niste´r and H. Stewe´nius. Scalable recognition with a vocabulary tree.
In Proc. of IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2006.
[44] O. Ojowu, J. Karlsson, J. Li, and Y. Liu. ENF extraction from digital record-
ings using adaptive techniques and frequency tracking. IEEE Transactions on
Information Forensics and Security, 7(4):1330–1338, August 2012.
[45] A. V. Oppenheim and R. W. Schafer. Discrete-Time Signal Processing. Prentice
Hall, 2009.
[46] F. Padua, R. Carceroni, G. Santos, and K. Kutulakos. Linear sequence-to-
sequence alignment. IEEE Trans. Pattern Anal. Mach. Intell., 32(2), Feb.
2010.
[47] M. Pawig, G. Enzner, and P. Vary. Adaptive sampling rate correction for
acoustic echo control in Voice-Over-IP. IEEE Transactions on Signal Process-
ing, 58(1), May 2013.
[48] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval
with large vocabularies and fast spatial matching. In Proc. of IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2007.
[49] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization:
Improving particular object retrieval in large scale image databases. In CVPR,
2008.
137
[50] D. Rodriguez, J. Apolinario, and L. Biscainho. Audio authenticity: Detecting
ENF discontinuity with high precision phase analysis. IEEE Transactions on
Information Forensics and Security, 5(3):534–543, Septemper 2010.
[51] R. W. Sanders. Digital authenticity using the electric network frequency. In
33rd AES International Conference on Audio Forensics, Theory and Practice,
June 2008.
[52] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors.
IJCV, 37(2):151 – 172, 2000.
[53] E. Shechtman, Y. Caspi, and M. Irani. Space-time super-resolution. IEEE
Trans. Pattern Anal. Mach. Intell., Apr. 2005.
[54] P. Shrestha, H. Weda, M. Barbieri, and D. Sekulovski. Synchronization of
multiple video recordings based on still camera flashes. In ACM Multimedia,
2006.
[55] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object
matching in videos. International Conference on Computer Vision, 2003.
[56] G. P. Stein. Tracking from multiple view points: self-calibration of space and
time. In Int’l Conf. Comput. Vision & Pattern Recognition, 1999.
[57] H. Su, W-H. Chuang, W. Lu, and M. Wu. Evaluating the quality of individual
SIFT features. In ICIP, 2012.
[58] H. Su, R. Garg, A. Hajj-Ahmad, and M. Wu. ENF analysis on recaptured audio
recordings. In IEEE Int. Conf. Acoustics, Speech, & Signal Process. (ICASSP),
May 2013.
[59] H. Su, A. Hajj-Ahmad, R. Garg, and M. Wu. Exploring rolling shutter for ENF
extraction from video. In ICIP, 2014.
[60] H. Su, A. Hajj-Ahmad, M. Wu, and D. Oard. Exploring the use of ENF for
multimedia synchronization. In IEEE Int’l Conf. Acoustics, Speech, and Signal
Processing (ICASSP), 2014.
[61] H. Su, C.W. Wong, A. Hajj-Ahmad, R. Garg, and M. Wu. ENF signal in-
duced by power grid: a new modality for video synchronization. In ACM MM
Workshop on Immersive Media Experiences, 2014.
[62] P. Turcot and D. Lowe. Better matching with fewer features: The selection of
useful features in large database recognition problems. In ICCV Workshop on
Energent Issues in Large Amounts of Visual Data, 2009.
[63] T. Tuytelaars and L. Van Gool. Synchronizing video sequences. In Int’l Conf.
Comput. Vision & Pattern Recognition, 2004.
138
[64] P. P. Vaidyanathan. Multirate Systems And Filter Banks. Prentice Hall, 1992.
[65] A. Vedaldi and B. Fulkerson. VLFeat - an open and portable library of computer
vision algorithms. ACM Multimedia, 2010.
139