ABSTRACT
Title of dissertation: CONTENT RECOGNITION AND CONTEXT MODELING
FOR DOCUMENT ANALYSIS AND RETRIEVAL
Guangyu Zhu, Doctor of Philosophy, 2009
Dissertation directed by: Professor Rama Chellappa
Department of Electrical and Computer Engineering
The nature and scope of available documents are changing significantly in
many areas of document analysis and retrieval as complex, heterogeneous collec-
tions become accessible to virtually everyone via the web. The increasing level of
diversity presents a great challenge for document image content categorization, in-
dexing, and retrieval. Meanwhile, the processing of documents with unconstrained
layouts and complex formatting often requires effective leveraging of broad contex-
tual knowledge.
In this dissertation, we first present a novel approach for document image
content categorization, using a lexicon of shape features. Each lexical word corre-
sponds to a scale and rotation invariant local shape feature that is generic enough
to be detected repeatably and is segmentation free. A concise, structurally indexed
shape lexicon is learned by clustering and partitioning feature types through graph
cuts. Our idea finds successful application in several challenging tasks, including
content recognition of diverse web images and language identification on documents
composed of mixed machine printed text and handwriting.
Second, we address two fundamental problems in signature-based document
image retrieval. Facing continually increasing volumes of documents, detecting and
recognizing unique, evidentiary visual entities (e.g., signatures and logos) provides
a practical and reliable supplement to the OCR recognition of printed text. We
propose a novel multi-scale framework to detect and segment signatures jointly from
document images, based on the structural saliency under a signature production
model. We formulate the problem of signature retrieval in the unconstrained setting
of geometry-invariant deformable shape matching and demonstrate state-of-the-art
performance in signature matching and verification.
Third, we present a model-based approach for extracting relevant named enti-
ties from unstructured documents. In a wide range of applications that require struc-
tured information from diverse, unstructured document images, processing OCR
text does not give satisfactory results due to the absence of linguistic context. Our
approach enables learning of inference rules collectively based on contextual infor-
mation from both page layout and text features.
Finally, we demonstrate the importance of mining general web user behavior
data for improving document ranking and other web search experience. The context
of web user activities reveals their preferences and intents, and we emphasize the
analysis of individual user sessions for creating aggregate models. We introduce a
novel algorithm for estimating web page and web site importance, and discuss its
theoretical foundation based on an intentional surfer model. We demonstrate that
our approach significantly improves large-scale document retrieval performance.
CONTENT RECOGNITION AND CONTEXT MODELING
FOR DOCUMENT ANALYSIS AND RETRIEVAL
by
Guangyu Zhu
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2009
Advisory Committee:
Professor Rama Chellappa, Chair/Advisor
Dr. David S. Doermann, Co-Advisor
Professor Carol Y. Espy-Wilson
Professor Larry S. Davis
Professor David W. Jacobs
c? Copyright by
Guangyu Zhu
2009
Dedication
I dedicate this dissertation to my parents and my wife.
ii
Acknowledgments
I owe my gratitude to all the people who have made this thesis possible and
because of whom my graduate experience has been one that I will cherish forever.
First and foremost, I would like to thank my advisors, Dr. David Doermann
and Professor Rama Chellappa, for teaching me so much and pushing me to ex-
cellence in academic and professional development. It has been a pleasure to work
with and learn from such extraordinary researchers.
I would also like to thank my committee members, Professors Carol Y. Espy-
Wilson, Larry Davis, and David Jacobs for serving on the committee and reviewing
the dissertation. I was extremely lucky to have such a superstar committee and I
will never forget the experience. I am especially grateful to Professor David Jacobs,
who gave me valuable advice on my research and education.
My LAMP colleagues have enriched my graduate life in many ways and I
received helpful feedback from many of them, including Stefan Jaeger, Yefeng Zheng,
Yi Li, Xiaodong Yu, Daniel DeMenthon, Zhe Lin, Daniel Ramsbrock, and Mudit
Agrawal. I really enjoyed my collaborations and discussion with these brilliant guys.
I want to give thanks to my co-workers from Yahoo! Labs and IBM Almaden
Research Center, who introduced me to challenging research problems and gave me
many opportunities early in my career.
I owe my deepest thanks to my parents who have always supported and mo-
tivated me. Finally, I thank my wife Yuan Yuan, who gives meaning to everything
I do.
iii
Table of Contents
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Document Content Category Recognition . . . . . . . . . . . . . . . . 2
1.2 Content-based Document Image Analysis and Retrieval . . . . . . . . 4
1.3 Processing Unstructured Document Images . . . . . . . . . . . . . . . 5
1.4 Improving Document Search Ranking . . . . . . . . . . . . . . . . . . 6
1.5 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Image Content Category Recognition 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Image Content Categorization . . . . . . . . . . . . . . . . . . 12
2.2.2 Language Identification . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Contour-based Learning . . . . . . . . . . . . . . . . . . . . . 15
2.3 Recognizing Image Content Using Shape Lexicon . . . . . . . . . . . 16
2.3.1 Extraction of Contour Feature . . . . . . . . . . . . . . . . . . 16
2.3.2 Measure of Dissimilarity . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Learning the Shape Lexicon . . . . . . . . . . . . . . . . . . . 19
2.3.3.1 Clustering Lexical Words . . . . . . . . . . . . . . . 19
2.3.3.2 Organizing Features in the Lexicon . . . . . . . . . . 20
2.3.4 Constructing the Image Descriptor . . . . . . . . . . . . . . . 22
2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Image Content Category Recognition . . . . . . . . . . . . . . 22
2.4.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1.2 Overview of the Experiments . . . . . . . . . . . . . 23
2.4.1.3 Results and Discussions . . . . . . . . . . . . . . . . 23
2.4.2 Language Identification . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.2.2 Overview of the Experiments . . . . . . . . . . . . . 24
2.4.2.3 Results and Discussions . . . . . . . . . . . . . . . . 25
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Signature Detection and Matching 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.1 Signature Detection and Segmentation . . . . . . . . . . . . . 31
3.1.2 Signature Matching for Document Image Retrieval . . . . . . 33
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Structural Saliency and Contour Grouping . . . . . . . . . . . 37
3.2.2 Shape Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Document Image Retrieval . . . . . . . . . . . . . . . . . . . . 40
iv
3.3 Multi-scale structural saliency . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Theoretical framework . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Signature detection and segmentation . . . . . . . . . . . . . . 41
3.4 Measure of saliency for signatures . . . . . . . . . . . . . . . . . . . . 43
3.5 Matching and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.1 Measures of Shape Dissimilarity . . . . . . . . . . . . . . . . . 48
3.5.2 Shape Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.3 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 52
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.2 Multi-scale Signature Detection . . . . . . . . . . . . . . . . . 55
3.6.3 Discussion on Signature Detection . . . . . . . . . . . . . . . . 59
3.6.4 Signature Matching for Document Image Retrieval . . . . . . 60
3.6.4.1 Shape Representation . . . . . . . . . . . . . . . . . 60
3.6.4.2 Shape Matching Algorithms . . . . . . . . . . . . . . 62
3.6.4.3 Measures of Shape Dissimilarity . . . . . . . . . . . . 64
3.6.4.4 Multiple Instances as Query . . . . . . . . . . . . . . 65
3.6.5 Off-line Signature Verification . . . . . . . . . . . . . . . . . . 65
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4 Learning Document Context 69
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Motivations and Background . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Electronic Submission . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Named Entity Extraction . . . . . . . . . . . . . . . . . . . . 75
4.3.3 Automated Auditing . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.4 Document Archival . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4 Extraction of Relevant Named Entities . . . . . . . . . . . . . . . . . 77
4.4.1 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4.2.1 Page Segmentation . . . . . . . . . . . . . . . . . . . 78
4.4.2.2 Learning From Document Context . . . . . . . . . . 81
4.4.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.3.1 Practical Constrains . . . . . . . . . . . . . . . . . . 82
4.4.3.2 Feature Set . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.3.3 Region Labels . . . . . . . . . . . . . . . . . . . . . . 85
4.4.4 Relevant NE Extraction with CRFs . . . . . . . . . . . . . . . 85
4.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.2 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . 88
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
v
5 Mining Session Context 92
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Mining Web Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.1 Session Identification . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.2 Session Characteristics . . . . . . . . . . . . . . . . . . . . . . 99
5.3.3 Session Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3.4 Session Interpretation . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Using Browsing Information for Web Search . . . . . . . . . . . . . . 104
5.4.1 ClickRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.2 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . 106
5.4.3 Application to Web Search Ranking . . . . . . . . . . . . . . . 109
5.4.4 Relation to Graph-based Models . . . . . . . . . . . . . . . . . 111
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5.1 Site Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5.2 Page Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5.2.1 Evaluation Methodology . . . . . . . . . . . . . . . . 115
5.5.2.2 Data Preparation . . . . . . . . . . . . . . . . . . . . 117
5.5.2.3 Results and Discussion . . . . . . . . . . . . . . . . . 117
5.5.3 Mining Dynamic Quicklinks . . . . . . . . . . . . . . . . . . . 119
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Summary of Contributions 122
6.1 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A Proof of Theorems 126
Bibliography 129
vi
List of Tables
2.1 Confusion table of our approach for language identification . . . . . . 26
3.1 Summary of the English and Arabic evaluation datasets. . . . . . . . 55
3.2 Retrieval performance using different measure of shape dissimilarity . 64
3.3 Retrieval performance using multiple signature instances . . . . . . . 65
3.4 Summary of signature verification results on the Sabourin database . 67
4.1 Summary of the two real-world receipt image collections. . . . . . . . 87
4.2 Named entity extraction performances on the two databases. . . . . . 89
5.1 Characteristics of general web sessions . . . . . . . . . . . . . . . . . 100
5.2 Unsupervised clustering of session histograms . . . . . . . . . . . . . 103
5.3 Top-ranked sites computed by different algorithms . . . . . . . . . . . 113
5.4 Page ranking summary . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.5 Statistically significant improvements from ClickRank over a state-
of-the-art baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
vii
List of Figures
1.1 Key challenges facing document analysis and retrieval systems . . . . 2
2.1 Diverse images returned by Google Image Search . . . . . . . . . . . 9
2.2 Examples from the Maryland multilingual database and the IAM
handwriting DB3.0 database . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Shape differences are captured locally by neighboring contour features 18
2.4 The 25 most frequent lexical words learned for different languages . . 21
2.5 Confusion tables for image content category recognition . . . . . . . . 23
2.6 Confusion tables for language identification . . . . . . . . . . . . . . . 25
2.7 Comparison of language identification performances . . . . . . . . . . 26
2.8 Examples of error cases in language identification . . . . . . . . . . . 27
2.9 Recognition rates across languages as the size of training data varies . 28
3.1 The prominence of a signature is perceived across scales . . . . . . . . 31
3.2 Shape contexts and local neighborhood graphs constructed from de-
tected and segmented signatures . . . . . . . . . . . . . . . . . . . . . 34
3.3 Geometric constraint posed by two points on a signature . . . . . . . 43
3.4 Different cycloids are generated when the speeds of writing along two
perpendicular directions vary sinusoidally . . . . . . . . . . . . . . . . 44
3.5 Conic sections inferred by a pair of edge points . . . . . . . . . . . . . 46
3.6 Anisotropic scaling and registration quality effectively capture shape
differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.7 A signature query example . . . . . . . . . . . . . . . . . . . . . . . . 53
3.8 Examples from the Tobacco-800 and the Maryland Arabic databases . 54
3.9 Signature detection ROC curves for the Tobacco-800 database and
the Maryland Arabic database . . . . . . . . . . . . . . . . . . . . . . 56
3.10 Examples of detected signatures from the Tobacco-800 database . . . 57
3.11 Examples of detected signatures from the Maryland Arabic database 58
3.12 Examples of false alarms and missed signatures . . . . . . . . . . . . 59
3.13 Skeleton and contour representations computed from signatures . . . 61
3.14 Document image retrieval performances using different shape repre-
sentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.15 Document image retrieval using single signature instance as query . . 63
3.16 SignatureverificationresultsacrossallsignatureclassesontheSabourin
database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 System architecture of our automated expense reimbursement system 71
4.2 Architecture of EntityFinder . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Page segmentation and word segmentation results of a receipt image . 79
4.4 Page segmentation results by different algorithms . . . . . . . . . . . 80
5.1 Distribution of session lengths and session durations . . . . . . . . . . 99
5.2 Visualization of session histograms in 3D . . . . . . . . . . . . . . . . 102
viii
5.3 Distribution of discretized ClickRank scores . . . . . . . . . . . . . . 115
5.4 Dynamic quicklinks discovered using ClickRank . . . . . . . . . . . . 119
ix
Chapter 1
Introduction
The extent of document analysis has broadened significantly over recent years.
The explosive growth of the web has enabled access to much larger and more di-
verse collections of documents. While the primary purpose of a document remains
unaltered ? to facilitate the transfer of information [1] ? it can be presented in a
much larger variety of contexts and formats to readers. The increasing level of di-
versity and complexity presents numerous challenges for the analysis and retrieval
of documents, both in hard copy and electronic form.
In many areas of document image analysis, assumptions based on structured,
narrow-domain documents are no longer valid in the presence of complex, heteroge-
neous content. Interpretations of the content type, language, and genre associated
with a document image become essential tasks. A document image processing work
flow, as shown in Figure 1.1, would require image content category recognition, in-
dexing, and search capabilities that can be generalized to semi- or unstructured,
broad-domain image collections. The goal is no longer limited to the conversion
of text to electronic form, but can now extend semantic analysis toward broader
content.
For the retrieval of electronic documents, ranking continues to be of utmost
importance. The quality of documents presented to web users has significant impact
1
Figure 1.1: Key challenges facing document image analysis and retrieval systems
that process diverse document collections.
on their daily lives, yet people have extremely limited spans of attention and have
difficulty dealing with the massive quantity of pages on the web. Experiments on
users? interaction with the list of ranked results returned by web search engines have
shown that a user seldom looks for documents beyond the first page of results [2].
This dissertation presents work in four main areas related to the analysis and
retrieval of document images and semi-structured electronic documents. In the
following sub-sections, we motivate these problems and present an overview of our
approaches.
1.1 Document Content Category Recognition
Many computer vision and image analysis problems begin by obtaining high-
level content type interpretation for an image. As shown in Figure 1.1, information
2
about the content category largely determines how we process the image in ensuing
tasks. Having recognized text content and the language, a document analysis system
can process and index images containing predominantly machine printed text.
We focus on two image content category recognition problems central to het-
erogeneous document image collections. The first problem lies in the categorization
of a general image into different content classes. The second problem arises from
language identification for documents that are composed of mixed machine printed
text and handwriting, which has been an open research problem. For systems that
process diverse multilingual document images, the performance of language identi-
fication is crucial for a broad range of tasks ? from determining the correct optical
character recognition (OCR) engine for text extraction to document indexing, trans-
lation, and search. In character recognition, for instance, almost all existing work
requires knowledge of the script and/or language of the processed document [3].
In this dissertation, we propose a novel approach for document image content
categorization using image descriptors constructed from a lexicon of shape features.
We encode local text structures using scale and rotation invariant lexical words, each
representing a segmentation-free shape feature that is generic enough to be detected
repeatably. We learn a concise, structurally indexed shape lexicon by clustering and
partitioning similar feature types through graph cuts. Our approach is extensible
and does not require skew correction, scale normalization, or segmentation. We
demonstrate our approach on two challenging document image content recognition
problems: 1) The classification of 4,500 web images from Google Image Search
into three content categories ? pure image, image with text, and document image,
3
and 2) Language identification of eight languages (Arabic, Chinese, English, Hindi,
Japanese, Korean, Russian, and Thai) on a 1,512 complex document image database
composed of mixed machine printed text and handwriting.
1.2 Content-based Document Image Analysis and Retrieval
Visual content, such as signatures [4], logos [5], and stamps [6], present con-
vincing evidence of document source and provide an important form of indexing
for document image processing and retrieval. Detecting, segmenting, and matching
these free-form objects from clustered background pose unique challenges. Signature
detection, for example, is an open document analysis problem [7].
In this dissertation, we study two fundamental problems in signature-based
document image retrieval. First, we propose a novel multi-scale approach to jointly
detecting and segmenting signatures from document images. Rather than focusing
on local features that typically have large variations, our approach captures the
structural saliency using a signature production model and computes the dynamic
curvature of 2-D contour fragments over multiple scales. This detection framework
is general and computationally tractable. Second, we treat the problem of signature
retrieval in the unconstrained setting of translation, scale, and rotation invariant
deformable shape matching. We propose two novel measures of shape dissimilarity
based on anisotropic scaling and registration residual error, and we present a su-
pervised learning framework for combining complementary shape information from
different dissimilarity metrics using linear discriminant analysis (LDA). We quan-
4
titatively study state-of-the-art shape representations, shape matching algorithms,
measures of dissimilarity, and the use of multiple instances as query in document
image retrieval. We further compare our matching approach to the state of the art
in the task of off-line signature verification.
1.3 Processing Unstructured Document Images
Hardcopydocumentshavebeenthemostcommonformofinformation-conveying
vehicleovercenturies, andpeoplehavedevelopedanexceptionalcapabilitytoextract
information from diverse paper sources. Once hard copy documents are captured
electronically as scanned images, however, the automated processing of documents
with unconstrained layouts and diverse formatting becomes more difficult. In many
applications that require such capability, applying traditional language modeling
techniques to the stream of OCR text does not produce satisfactory results due
to the absence of linguistic context. Extracting structured information, such as
named entities, from unstructured document images remains an unsolved research
challenge.
In this dissertation, we present a model-based approach for extracting relevant
named entities from unstructured document images by combining rich page layout
features in the image space with OCR text. We demonstrate our named entity ex-
traction approach in an expense reimbursement system and evaluate its performance
on large collections of degraded, real-world receipt images.
5
1.4 Improving Document Search Ranking
User browsing information, particularly their non-search related activity, re-
veals important contextual information on the preferences and the intents of web
users. A document retrieval system may improve both search ranking performance
and user experience by effective leveraging of contextual knowledge gained from
general web user behavior data.
In this dissertation, we demonstrate the importance of mining general web
user behavior data for improving document ranking and other web search experi-
ence, with an emphasis on analyzing individual user sessions to create aggregate
models. In this context, we introduce ClickRank, an efficient, scalable algorithm for
estimating web page and web site importance from general web user behavior data.
We lay out the theoretical foundation of ClickRank based on an intentional surfer
model and discuss its properties. We quantitatively evaluate its effectiveness for
the problem of web search ranking, showing how it contributes significantly to re-
trieval performance as a novel web search feature. We demonstrate that the results
produced by ClickRank for web search ranking are highly competitive with those
produced by other approaches, yet achieved with better scalability and substantially
lower computational costs. Finally, we discuss novel applications of ClickRank in
providing enriched user web search experience, highlighting the usefulness of our
approach for non-ranking tasks.
6
1.5 Outline of the Dissertation
The remainder of this dissertation is organized as follows. Chapter 2 describes
our approach to document image content category recognition using a lexicon that
is composed of a wide variety of local shape features. We demonstrate its appli-
cation in two challenging tasks ? the categorization of diverse web images and the
language identification of document images involving a blend of machine printed
text and handwriting. Chapter 3 presents our approaches to signature detection,
segmentation, and matching for document image retrieval. Chapter 4 describes our
approach to extracting relevant named entities from unstructured document im-
ages. We present its successful application in an automated expense reimbursement
system. Chapter 5 studies the problem of using contextual information from gen-
eral user behavior data online to improve web search ranking. The main ideas and
contributions of the dissertation are summarized in Chapter 6.
7
Chapter 2
Image Content Category Recognition
2.1 Introduction
Image content categorization has become a pressing problem in computer vi-
sion as we face a phenomenal increase in the diversity of visual content. Content
category recognition aims to reduce the semantic gap for ensuing tasks by providing
usage-oriented content description that can be utilized in individual applications.
For vision systems involving high-volume, complex, and heterogeneous image data,
effective high-level content interpretation is essential prior to object detection or
object category recognition at a finer level.
In this chapter, we focus on the most pervasive content within documents ?
text. Once text content and the language are recognized, images containing text
can be processed by an optical character recognition (OCR) system and indexed.
Toward this end, however, many unsolved challenges to image content category
recognition still exist.
First, we consider the recognition of an image?s primary content as one of three
content categories ? pure image (e.g., natural image and human photos), image
with text (see examples in Figure 2.1), or document image. This sort of automated
content categorization has broad impact in image search, and content-based image
indexing and retrieval.
8
Figure 2.1: Examples of images returned by Google Image Search using the keyword
?CD cover?.
Second, we study the problem of recognizing the primary language of a doc-
ument image in an unconstrained setting. This presents a fundamental research
challenge for those systems that need to process diverse multilingual document im-
ages automatically, such as Google Book Search [8] or an automated global expense
reimbursement application [9]. Almost all work to date on OCR requires that the
script and/or language of the processed document be known [3]. The performance
of language identification is crucial for the success of a wide range of tasks ? from
determining the correct OCR engine for text extraction to document indexing, trans-
lation, and search [10].
Progress in the field of language identification has focused almost exclusively
on machine printed text. Document collections, as shown in Figure 2.2, often contain
a diverse and complex mixture of machine printed and unconstrained handwritten
content, and vary tremendously in font and style. Language identification on docu-
ment images involving diverse content types, including unconstrained handwriting,
is still an open research area [7] and, to our best knowledge, no reasonable solutions
have been presented in the literature.
9
Figure 2.2: Examples from the Maryland multilingual database [11] and the IAM
handwriting DB3.0 database [12]. Languages in the top row are Arabic, Chinese,
English, and Hindi. In the lower row are Japanese, Korean, Russian, and Thai.
The problem of language identification for handwriting also highlights several
common challenges facing category recognition of diverse content. First, handwrit-
ing exhibits much larger variability compared to machine printed text. Handwriting
variations due to style, cultural, and personalized differences are typical [13], which
significantly increase the diversity of shapes found. Text lines in handwriting are
curvilinear and the gaps between neighboring words and lines are far from uni-
form. No well-defined baselines exist for handwritten text lines, even by linear or
piecewise-linear approximation [14]. Second, automatic processing of off-line doc-
ument image content should be robust in the presence of unconstrained document
layouts, formatting, and image degradations.
10
Our approach takes the view that the intricate differences between text can be
effectively captured using segmentation-free shape features and structurally indexed
shape descriptors. Low-level local shape features serve well for this purpose because
they can be detected robustly in practice, without detection or segmentation of
high-level entities, such as text lines or words. As shown in Figure 2.3, visual differ-
ences between handwriting samples across languages are captured well by different
configurations of neighboring line segments, which provide rich description of local
text structures.
We propose a novel approach for document image content recognition, using
image descriptors built from a lexicon of generic low-level shape features that are
translation, scale, and rotation invariant. To construct a structural index among
large number of diverse features, we dynamically partition the space of shape prim-
itives by clustering similar feature types. We formulate feature partitioning as a
graph cuts problem with the objective of obtaining a concise and globally balanced
lexicon index by sampling the training data. Each cluster in the lexicon is repre-
sented by an exemplary lexical word, making association of feature type efficient.
We obtain competitive document image content categorization performance using a
multi-class SVM classifier.
This chapter is structured as follows. Section 2.2 reviews related work. In
Section 2.3, we describe our algorithm for learning the shape lexicon and present a
document image content recognition approach using the shape lexicon. We discuss
experimental results in Section 2.4, and conclude in Section 2.5.
11
2.2 Related Work
We first present a comprehensive overview of existing approaches on whole-
image categorization and script/language identification, respectively. We then high-
light work related to our approach on contour-based learning.
2.2.1 Image Content Categorization
Little literature has directly addressed the problem of content recognition for
heterogeneous image repositories. However, several approaches based on different
motivations have demonstrated good performance in tasks that involve diverse ob-
jects. Oliva and Torralba [15] developed a holistic image representation for scene
recognition called spatial envelope, which characterizes the dominant spatial struc-
ture of a scene using a set of discriminative energy spectrum templates. Another
fairly intuitive approach treats blocks of text as texture [16]. One widely used rota-
tioninvariantfeaturefortextureanalysisisthelocalbinarypatterns(LBP)proposed
by Ojala et al. [17]. LBP captures spatial structure of local image texture in circular
neighborhoods across angular space and resolution, and has demonstrated state-of-
the-art results in a wide range of whole-image categorization problems involving
diverse data [18, 19].
2.2.2 Language Identification
Prior literature on script and language identification has largely focused on
the domain of machine printed documents. These works can be broadly classified
12
into three categories ? statistical analysis of text lines [20, 21, 22, 23, 24], texture
analysis [16, 25], and template matching [26].
Statistical analysis using discriminating features extracted from text lines, in-
cluding distribution of concavities [20, 22], horizontal projection profile [21, 23], and
vertical cuts of connected components [24], has proven effective on homogeneous
collection of machine printed documents. These approaches, however, do have ma-
jor limitations for handwriting. First, they assume uniformity among printed text,
and require precise baseline alignment and word segmentation. Freestyle handwrit-
ten text lines are curvilinear, and, in general, have no well-defined baselines, even
by linear or piecewise-linear approximation [14]. Second, it is difficult to extend
these methods to a new language, because they employ a combination of hand-
picked and trainable features and a variety of decision rules. In fact, most of these
approaches require effective script identification to discriminate between selected
subset of languages, and use different feature sets for script and language identifi-
cation, respectively.
Script identification using rotation invariant texture features, including multi-
channel Gabor filters [16] and wavelet log co-occurrence features [25], were demon-
strated on small blocks of printed text with similar characteristics. However, no
results were reported on full-page documents that involve variations in layouts and
fonts. Script identification for printed words was explored in [27, 28] using texture
features between a small number of scripts.
Template matching approach computes the most likely script by probabilis-
tic voting on matched templates, where each template pattern is of fixed size and
13
rescaled from a cluster of connected components. The script and language identifi-
cation system developed by Hochberg et al. [26] at Los Alamos National Laboratory
based on template matching can process 13 machine printed scripts without explicit
assumptions of distinguishing characteristics for a selected subset of languages. Tem-
plate matching is intuitive and delivers state-of-the-art performance when the con-
tent is constrained (i.e., printed text in similar fonts). However, templates are not
sufficiently flexible to generalize across large variations in fonts or handwriting styles
typical to diverse datasets [26]. From a practical view, the system needs to learn
the discriminability of each template through labor-intensive training, the extent to
which requires tremendous amounts of supervision and further limits applicability.
There exists minimal literature on language identification for handwriting. To
the best of our knowledge, the experiments of Hochberg et al. [13] on script and
language identification of handwritten document images offers the only reference on
this topic in the literature. They used linear discriminant analysis based on five
simple features of a connected component, including relative centroid location and
aspect ratio. The approach is demonstrated to be sensitive to large variations across
writers and diverse document content [13]. In their experiments, irregularities in the
document, including machine printed text, illustrations, markings, and handwriting
in different orientation from main body, were removed manually by image editing
from their evaluation dataset.
14
2.2.3 Contour-based Learning
Learning contour features is an important aspect in many computer vision
problems. Ferrari et al. [29] proposed scale-invariant adjacent segments (kAS) fea-
tures extracted from the contour segment network of image tiles, and used them in a
sliding window scheme for object detection. By explicitly encoding both geometric
and spatial arrangement among the segments, kAS descriptor demonstrates state-
of-the-art performance in shape-based object detection, and outperforms descriptors
based on interest points and histograms of gradient orientations [30]. However, kAS
descriptor is not rotation invariant, because segments are rigidly ordered from left
to right. This limits the repeatability of high-order kAS, and the best performance
in [29] is reported when using 2AS.
In handwriting recognition literature, one interesting work, also motivated by
the idea of learning a feature codebook, is the study on writer identification by
Schomaker et al. [31]. Writer identification assumes that the language is known
beforehand and aims to distinguish between writers based on specific characteristics
of handwriting. Schomaker et al. used closed-contour features of ink blobs directly,
without any shape descriptor. The difference between contour features is computed
using Hamming distance. The low-dimensional codebook representation presented
in [31] is based on Kohonen self-organizing map (SOM) [32]. Their approach de-
mands good segmentation and is not scale or rotation invariant. To account for
size variations, for instance, SOMs need to be computed at multiple scales, which
requires large training data and is computationally expensive.
15
2.3 Recognizing Image Content Using Shape Lexicon
Recognition of diverse visual content needs to account for large variations,
because content appears in many forms and contexts. The scope of the problem
is intuitively in favor of low-level shape primitives that can be detected repeatably.
Rather than focusing on selection of class-specific features, our approach aims to
distinguish intricate differences between content types collectively using the statis-
tics of a large variety of generic, geometrically invariant feature types (lexical words)
that are structurally indexed. Our emphasis on the generic nature of lexical words
provides a different perspective to recognition, which has traditionally focused on
finding sophisticated features or visual selection models. This may limit generaliza-
tion performance.
We explore the kAS contour feature recently introduced by Ferrari et al. [29],
which consists of a chain of k roughly straight, connected contour segments. Specifi-
cally, we focus on the case of triple contour segments, which strike a balance between
lower-order contour features that are not properly discriminative and higher-order
ones that are less likely to be detected robustly.
2.3.1 Extraction of Contour Feature
We perform computation locally, which means our approach detects features
in a highly efficient manner. First, we compute edges using the Canny edge de-
tector [33], which consistently demonstrates good performance on text content and
gives precise localization and unique response. Second, we group contour segments
16
by connected components and fit them locally into line segments. Then, within
each connected component, we extract every triplet of connected line segments that
starts from the current segment. Figure 2.3 provides visualization of the quality of
detected contour features by our approach using random colors.
Our feature detection scheme requires only linear time and space in the number
of contour fragments n, and is highly parallelizable. It proves much more efficient
and stable than [29], which requires construction of contour segment network and
depth first search from each segment, leading to O(nlog(n)) time on average and
O(n2) in the worst case.
We encode object contours in a translation, scale, and rotation invariant fash-
ion by computing orientations and lengths with reference to the first detected line
segment. A contour feature C can be compactly represented by an ordered set of
lengths and orientations of ci for i ? {1,2,3}, where ci denotes line segment i in C.
This is distinct from the motivation of kAS descriptor that attempts to enumerate
spatial arrangements of contours within local regions. Furthermore, kAS descriptor
does not take rotation invariance into account.
2.3.2 Measure of Dissimilarity
The overall dissimilarity between two contour features can be quantified by
the weighted sum of the distances in lengths and orientations. We use the following
generalized measure of dissimilarity between two contour features Ca and Cb
d(Ca,Cb,?) = ?Tlengthdlength +?Torientdorient, (2.1)
17
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 2.3: Shape differences are captured locally by a large variety of neighboring
contour features. (a)-(d) Examples of handwriting from four different languages.
(e)-(h) Detected contour features by our approach, each shown in a random color.
where vectors dlength and dorient are composed of the distances between contour
lengths and orientations, respectively. ?length and ?orient are their corresponding
weight vectors, providing sensitivity control over the tolerance of line fitting. One
natural measure of dissimilarity in lengths between two contour segments is their
log ratio. We compute orientation difference between two segments by normalizing
their absolute value of angle difference to pi. In our experiments, we use a larger
weighting factor for orientation to de-emphasize the difference in the lengths because
they may be less accurate due to line fitting.
18
2.3.3 Learning the Shape Lexicon
We extract a large number of lexical words by sampling from training images,
and construct an indexed shape lexicon by clustering and partitioning the lexical
words. A lexicon provides a concise structural organization for associating large
varieties of low-level features, and is efficient because it enables comparison to far
fewer feature types.
2.3.3.1 Clustering Lexical Words
Prior to clustering, we compute the distance between each pair of lexical words
and construct a weighted undirected graph G = (V,E), in which each node on the
graph represents a word. The weight on an edge connecting two nodes Ca and Cb
is defined as a function of their distance
w(Ca,Cb) = exp
parenleftbigg
?d(Ca,Cb)
2
?2d
parenrightbigg
, (2.2)
where we set parameter ?d to 20 percent of the maximum distance among all pairs
of nodes.
We pose feature clustering as a spectral graph partitioning problem, for which
we seek to group the set of vertices V into disjoint sets {V1,V2,...,VK}, such that
by the measure defined in (2.1) the dissimilarity among the vertices in a set is low,
and that between different sets is high.
More concretely, let the N ?N symmetric weight matrix for all the vertices
be W, where N = |V|. We define the degree matrix D as an N ?N diagonal
matrix, and i-th element d(i) along the diagonal satisfies d(i) =summationtextj w(i,j). We
19
use an N ?K matrix X to represent a graph partition, i.e., X = [X1,X2,...,XK],
where each element of matrix X is either 0 or 1. We can show that the feature
clustering formulation that seeks globally balanced graph partitions is equivalent to
the normalized cuts criterion [34], and can be written as
maximize epsilon1(X) = 1K
Ksummationdisplay
l=1
XTl WXl
XTl DXl , (2.3)
subject to X ? {0,1}N?K, and
summationdisplay
j
X(i,j) = 1. (2.4)
Minimizing normalized cuts exactly is NP-complete. We use a fast algorithm
[35] for finding its discrete near-global optimum, which is robust to random initial-
ization and converges faster than other clustering methods.
2.3.3.2 Organizing Features in the Lexicon
For each cluster, we select the feature instance closest to the cluster?s center as
the exemplary lexical word. This ensures that an exemplary word has the smallest
sum of squared distance to the other features within the cluster. In addition, each
exemplary word is associated with a cluster radius, which is defined as the maximum
distance from the cluster center to all the other features within the cluster. The
constructed shape lexicon L is composed of all exemplary lexical words.
Figure 2.4 shows the 25 most frequent exemplary lexical words for Arabic,
Chinese, English, and Hindi, learned from 10 documents of each language. Dis-
tinguishing features between languages, including cursive style in Arabic, 45 and
90-degree transitions in Chinese, and various configurations due to long horizontal
lines in Hindi, are learned automatically. Each row in Figure 2.4(e)-(h) lists exam-
20
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 2.4: The 25 most frequent exemplary lexical words in (a) Arabic, (b) Chinese,
(c) English, and (d) Hindi document images capture the distinct features of different
languages. (e)-(h) show lexical words in the same cluster as the top 5 exemplary
lexical words for each language, ordered by ascending distances to the center of their
clusters. Scaled and rotated versions of feature types are clustered together.
ples of lexical words in the same cluster, ordered by ascending distances to the center
of their associated clusters. By clustering, translated, scaled and rotated versions
of feature types are grouped.
Since each lexical word represents a generic local shape feature, a majority
of lexical words should intuitively appear in images across content categories, even
though their frequencies of occurrence deviate significantly. In our experiments, we
find that 95.1% and 92.6% of lexical words in natural images also appear in images
with text and document images, respectively. In addition, 86.3% of lexical word
instances appear in document images across all eight languages.
21
2.3.4 Constructing the Image Descriptor
We construct a shape descriptor for each image, which provides statistics of
the frequency at which each feature type occurs. For each detected lexical word
W from the test image, we compute the nearest exemplary word Ck in the shape
lexicon. We increment the descriptor entry corresponding to Ck only if
d(W,Ck) < rk, (2.5)
where rk is the cluster radius associated with the exemplary lexicon word Ck. This
quantization step ensures that unseen features that deviate considerably from train-
ing features are not used for image description. In our experiments, we found that
only less than 2% of the contour features cannot be found in the shape lexicon
learned from the training data.
2.4 Experimental Results
2.4.1 Image Content Category Recognition
2.4.1.1 Dataset
To evaluate our approach for image content category recognition, we construct
a 4,500-image dataset by crawling web images from the Google Image search engine
using a wide variety of keywords. Figure 2.1 shows some examples of images with
text returned by using the text keyword ?CD cover?. All the images are automat-
ically downloaded by a script. Duplicate and junk images are manually inspected
and removed to reduce the proportion of unrelated images.
22
(a) Mean diagonal = 78.9% (b) Mean diagonal = 77.6% (c) Mean diagonal = 83.3% (d) Mean diagonal = 89.6%
Figure 2.5: Confusion tables for image content category recognition using (a) Spatial
envelope [15], (b) LBP [17], (c) kAS [29], (d) Our approach. (P: Pure image, T:
Image with text, D: Document image)
2.4.1.2 Overview of the Experiments
We compare our approach with spatial envelope [15], local binary patterns
(LBP) [17], and the state-of-the-art kAS descriptor [29], which are well-known ap-
proaches based on different views of whole-image characterization. Spatial envelope
uses a holistic image representation without attempting to exploit localized infor-
mation such as shape, whereas LBP is based on rotation-invariant texture analysis.
Since 2AS gives the best performance among different kAS [29], we use it as the
benchmark for kAS.
We train a multi-class SVM classifier [36] on only 100 randomly selected images
from each category and use it to test the rest of the images in the collection. For
easy comparison, we set the dimensions of the image descriptor to 90 for both kAS
and our approach in the following experiments.
2.4.1.3 Results and Discussions
The confusion tables for spatial envelope, LBP, kAS, and our approach ap-
pear in Figure 2.5. Spatial envelope demonstrates performs well in recognizing pure
23
images, but is not very effective for text content. Texture-based LBP produces
balanced results for all three content types. Our approach obtains the best per-
formances for recognizing each content class, with a respectable mean diagonal of
89.6%. The only notable confusion occurs when distinguishing between image with
text and pure image.
2.4.2 Language Identification
2.4.2.1 Dataset
We use 1,512 document images of eight languages (Arabic, Chinese, English,
Hindi, Japanese, Korean, Russian, and Thai) from the University of Maryland multi-
lingual database [11] and the IAM handwriting database DB3.0 [12] (see Figure 2.2)
for evaluation on language identification. Both databases are large public real-world
collections, containing the source identity of each image in the ground truth. This
enables us to construct a diverse dataset that closely mirrors the true complexities
of heterogeneous document image repositories in practice.
2.4.2.2 Overview of the Experiments
We compare our approach with the state-of-the-art language identification sys-
tem [26], which is based on template matching. We also include LBP and kAS in
this experiment since they have demonstrated reasonable performance on diverse
text contents. For effective comparison, we used multi-class SVM classifiers trained
on the same pool of randomly selected handwritten document images from each
24
(a) Mean diagonal = 55.1% (b) Mean diagonal = 68.1% (c) Mean diagonal = 88.2% (d) Mean diagonal = 95.6%
Figure 2.6: Confusion tables for language identification using (a) LBP [17], (b)
Template matching [26], (c) kAS [29], (d) Our approach. (A: Arabic, C: Chinese,
E: English, H: Hindi, J: Japanese, K: Korean, R: Russian, T: Thai, U: Unknown)
language class in the following experiments, for LBP, kAS, and our approach, re-
spectively. We further evaluated the generalization performance of our approach as
the size of training data varied.
2.4.2.3 Results and Discussions
The confusion tables for LBP, template matching, kAS, and our approach ap-
pear in Figure 2.6. Our approach demonstrates excellent results on all the eight
languages, with a mean diagonal of 95.6% and a standard deviation of 4.5%. Ta-
ble 2.1 lists all entries in the confusion table of our approach for the eight languages.
kAS, with a mean diagonal of 88.2%, is also effective. Neither kAS nor our approach
has difficulty generalizing across large variations, such as font types or handwrit-
ing styles, as evident from their relatively small standard deviations along diagonal
entries in the confusion tables shown in Figure 2.7.
The performance of template matching varies significantly across languages,
with 68.1% mean diagonal and 20.5% standard deviation along diagonal. One big
25
Table 2.1: Confusion table of our approach for the 8 languages.
A C E H J K R T
A 99.7 0.3 0 0 0 0 0 0
C 1.4 85.0 4.0 1.0 6.7 1.0 0.7 0.2
E 1.6 0 95.9 0.2 0 1.1 0.6 0.6
H 0.2 0.2 0 98.8 0.8 0 0 0
J 0 1.3 1.0 0.2 96.2 1.3 0 0
K 0 0.8 0.1 1.9 0.5 96.0 0.5 0.1
R 0.5 0 2.0 0 0 0 97.1 0.4
T 0 0.3 1.6 0.9 0.6 0.3 0 96.3
Figure 2.7: Comparison of language identification performances.
confusion of template matching occurs between Japanese and Chinese, given a docu-
ment in Japanese may contain varying number of Kanji (Chinese characters). Tem-
plates are not flexible for identifying discriminative partial features, and the bias
in voting decision toward the dominant candidate causes less frequently matched
templates to be ignored. Another performance lowering source of error comes from
undetermined cases (see the unknown column in Figure 2.6(b)), where probabilistic
voting cannot decide between languages with roughly equal votes.
Texture-based LBP could not effectively recognize differences between lan-
guages on a diverse dataset because distinctive layouts and unconstrained hand-
26
Figure 2.8: Examples of error cases in Chinese handwritten documents.
writing exhibit irregularities that are difficult to capture using texture, and its mean
diagonal is only 55.1%.
Figure 2.7 quantitatively compares the overall language identification perfor-
mances of different approaches by the means and standard derivations of diagonal
entries in their confusion tables. Shape-based approaches, including our approach
and kAS, show higher recognition rates and smaller performance derivations, com-
pared to those based on models of textures and templates.
Analysis of the language identification errors made by our approach provides
further insights. Among all eight languages, recognition of Chinese handwriting
proved the most challenging task. As shown in Figure 2.7, the Chinese language
identification performances for all four approaches were significantly lower compared
to the other languages. We show several typical error cases in Figure 2.8, where Chi-
nese handwritten documents are recognized incorrectly. These error cases include
document images with severe degradations, second-generation documents captured
from low-resolution source copies (e.g., photocopy of a fax document), and docu-
ments containing a significant amount of other content.
27
Figure 2.9: Recognition rates of our approach for different languages as the size
of training data varies. Our approach achieves excellent performance even using a
small number of document images per language for training.
Good generalization helps determine the success of document analysis systems
that need to process diverse, unconstrained data. Figure 2.9 shows the recognition
rates of our approach as the size of training set varies. We observe highly competi-
tive language identification performance on this challenging dataset even when using
a small amount of training data per language class. This demonstrates the effective-
ness of generic low-level shape features when mid or high-level vision representations
may not be generalized or flexible enough for the task.
Our results on language identification are encouraging, as the training requires
considerably less supervision. Our approach needs only the class label of each train-
ing image, and does not require prior skew correction, scale normalization, or seg-
mentation.
28
2.5 Summary
In this chapter, we proposed a novel approach for document image content
categorization using a lexicon composed of a wide variety of local shape features.
Each lexical word represents a characteristic structure generic enough to be detected
repeatably and segmentation free. The lexicon provides a principled approach to
structurally indexing and associating a vast number of feature types, and is learned
from training data with little supervision. Our approach is extensible and does
not require constructing explicit content models. In two challenging real world
document image content recognition problems involving large-scale, highly variable
image collections, our approach demonstrated excellent results and outperformed
other state-of-the-art techniques. Our future work will be directed toward refin-
ing and evaluating the approach by further incorporating spatial co-occurrences of
lexical words using a secondary lexicon.
29
Chapter 3
Signature Detection and Matching
3.1 Introduction
Searching for relevant documents from large complex document image repos-
itories presents a central problem in document image analysis and retrieval. One
approach recognizes text in the image using an OCR system, and applies text in-
dexing and query. This solution is primarily restricted to machine printed text
content because state-of-the-art handwriting recognition is error prone and limited
to applications with a small vocabulary, such as postal address recognition and
bank check reading [7]. In broader, unconstrained domains, including searching
historic manuscripts [37] and processing languages where character recognition is
difficult [38], image retrieval has demonstrated better results.
As unique and evidentiary entities in a wide range of business and forensic ap-
plications, signatures provide an important form of indexing that enables effective
exploration of large heterogeneous document image collections. Given an abun-
dance of documents, searching for a specific signature is a highly effective way of
retrieving documents authorized or authored by an individual [39]. In this context,
handwriting recognition is suboptimal, because of its prohibitively low recognition
rates and the fact that the character sequence of a signature is often unrelated to the
personal identity it represents. More importantly, as a number of studies [40, 41]
30
(a) (b) (c)
(d) (e) (f)
Figure 3.1: The prominence of a signature is perceived across scales whereas back-
ground text merges into blobs at coarser scales. (a) A document image example. (b)
and (c) are edges computed at the scale of 1.0 and 2.0, respectively. (d) Detected
signature region without segmentation. (e) and (f) are detected and segmented
signatures by our approach at the two scales.
demonstrated, signatures are highly stylistic in nature and are best described by
their graphic style.
Detecting, segmenting, and matching deformable objects such as signatures
are important and challenging problems in computer vision. In the following sub-
sections, weaddressdetection, segmentation, andmatchinginthecontextofsignature-
based document image retrieval and present an overview of our approach.
3.1.1 Signature Detection and Segmentation
Detecting free-form objects is challenging in a number of aspects. First, detec-
tion needs to be robust in the presence of large intra-class variations and cluttered
backgrounds. Second, the contours of these complex objects are fragmented 2-D
31
signals in real off-line images, for which reliably recovering the order of points in
handwriting is generally difficult [42]. Moreover, recognition and retrieval techniques
require well segmented objects to minimize the effects of outliers during matching.
Detecting signatures from documents offers one such example that further involves
diverse layout structures and complex mixture of machine printed text, handwrit-
ing, diagrams, and other elements. Signature detection and segmentation remains
an open research problem [43].
Prior research on off-line signatures has focused almost exclusively on signature
verification and identification to perform authentication [44, 45, 40, 41, 46, 47, 48]
in the context of biometrics. For signature verification, the problem lies in deciding
whether a sample signature is genuine by comparing it with stored reference signa-
tures. Signature identification is essentially a writer identification problem, which
has an objective to find the signer of a sample given a database of signature exem-
plars from different signers. Most studies published to date assume the availability
of good detection and segmentation [7].
Theproblemofsignaturedetectionandsegmentationpivotsonsolvingsignature-
based document indexing and retrieval. Equally important, a solution will benefit
off-line signature verification and identification in a range of domains. In addition,
the ability to detect signatures robustly and extract them intact from volumes of
documents is highly desirable for many business and government applications.
We propose a new multi-scale approach to detect and extract signatures jointly
from document images. Rather than focusing on local features, we treat signatures
as symbols that exhibit characteristic structural saliency in our multi-scale detection
32
framework. We employ a novel saliency measure based on a signature production
model, which captures the dynamic curvature in a signature without recovering its
temporal information. The signature detection approach can also be applied to
on-line handwritten notes, where the trajectories of the pen are readily available.
Our detection framework is general and has the advantage of not embeding
explicit assumptions on local features of signatures, such as the granulometric size
distributions [44] or stroke-level features [40]. Therefore, it performs robustly against
many forms of variations in shape-based object detection problems, and is generally
applicable despite language differences.
3.1.2 Signature Matching for Document Image Retrieval
Detection and segmentation produce a set of 2-D contour fragments for each
detected signature. Given a few available query signature instances and a large
database of detected signatures, signature matching needs to find the most similar
signature samples from the database. By constructing the list of best matching
signatures, we effectively retrieve the set of documents authorized or authored by
the same person.
We treat a signature as a non-rigid shape, and represent it by a discrete set
of 2-D points sampled from the internal or external contours on the object. A 2-
D point feature offers several competitive advantages compared to other compact
geometrical entities used in shape representation. It relaxes the strong assumption
that the topology and the temporal order need to be preserved under structural
33
Figure 3.2: Shape contexts [55] and local neighborhood graphs [56] constructed from
detected and segmented signatures. First column: Examples of signatures. Second
column: Shape contexts descriptors constructed at a point, which provides a large-
scale shape description. Third column: Local neighborhood graphs capture local
structures for non-rigid shape matching.
variations or clustered background. For instance, two strokes in one signature sam-
ple may touch each other, but remain well separated in another. These structural
changes, as well as outliers and noise, are generally challenging for shock-graph based
approaches [49, 50], which explicitly use the connection between points. In earlier
studies [51, 52, 53, 54], a shape is represented as an ordered sequence of points.
This 1-D representation is well-suited for signatures collected online using a PDA or
Tablet PC. For unconstrained off-line handwriting, however, it is generally difficult
to recover their temporal information from real images due to large structural vari-
ations [42]. Represented by a 2-D point distribution, a shape is more robust under
structural variations while carrying general shape information. As shown in Fig-
ure 3.2, the signature?s shape is captured by a finite set P = {P1,...,Pn},Pi ? R2,
of n points, which are sampled from edge pixels computed by an edge detector.
We use two state-of-the-art non-rigid shape matching algorithms for signature
34
matching. The first method is based on the representation of shape contexts, in-
troduced by Belongie et al. [55]. In this approach, a spatial histogram defined as
shape context is computed for each point, which describes the distribution of the
relative positions of all remaining points. Prior to matching, the correspondences
between points are first solved through weighted bipartite graph matching. Our sec-
ond method uses the non-rigid shape matching algorithm proposed by Zheng and
Doermann [56], which formulates shape matching as an optimization problem that
preserves local neighborhood structure. This approach contains an intuitive graph
matching interpretation, where each point represents a vertex. Two vertices are con-
sidered connected in the graph if they are neighbors. So, finding the optimal match
between shapes is equivalent to maximizing the number of matched edges between
their corresponding graphs under a one-to-one matching constraint. Computation-
ally, [56] employs an iterative framework for estimating the correspondences and
the transformation. In each iteration, graph matching is initialized using the shape
context distance, and subsequently is updated through relaxation labeling [57] for
more globally consistent results.
Treating an input pattern as a generic 2-D point distribution broadens the
space of dissimilarity metrics and enables effective shape discrimination using the
correspondences and the underlying transformations [58]. We propose two novel
shape dissimilarity metrics that quantitatively measure anisotropic scaling and reg-
istration residual error, and present a supervised training framework for effectively
combining complementary shape information from different dissimilarity measures
by linear discriminant analysis (LDA).
35
The contributions of this dissertation are twofold:
1. We propose a novel multi-scale approach to detect and segment signatures
jointly from documents with diverse layouts and complex backgrounds. We
treat a signature generally as an unknown grouping of 2-D contour fragments,
and solve for the two unknowns ? identification of the most salient struc-
ture and its grouping, using a signature production model that captures the
dynamic curvature of 2-D contour fragments without recovering the temporal
information.
2. We treat signature recognition and retrieval in the unconstrained setting of
non-rigid shape matching, and propose two new measures of shape dissim-
ilarity that correspond to i) the amount of anisotropic scaling, and ii) the
registration residual error. We demonstrate robust signature-based document
image retrieval, and comprehensively evaluate different shape representations,
shape matching algorithms, measures of dissimilarity, and the use of multiple
signature instances in overall retrieval accuracy.
We structure this chapter as follows: The next section reviews related work.
In Section 3.3, we present our multi-scale structural saliency approach to signature
detection and segmentation in detail. Section 3.4 introduces a structural saliency
measure for capturing the dynamic curvature under a signature production model.
In Section 3.5, we describe our signature matching approach and present methods
to combine different measures of shape dissimilarity and multiple query instances
for effective retrieval with limited supervised training. We discuss experimental
36
results on real English and Arabic document datasets in Section 3.6, and conclude
in Section 3.7.
3.2 Related work
3.2.1 Structural Saliency and Contour Grouping
Detecting salient structures from images is an important task in many seg-
mentation and recognition problems. The saliency network proposed by Sha?ashua
and Ullman [59] offers a well-known approach to extracting salient curves by jointly
solving the two aspects of this problem iteratively, i.e., identifying salient struc-
tures and grouping contours. The saliency function defined in [59] monotonically
increases with the length and decreases with the total squared curvature of the eval-
uated curve. To reduce the exponential search space, the saliency network assumed
that an optimal solution has a recurrent structure, which they called extensibility, so
that searching by dynamic programming in the exponential space of possible curves
takes polynomial time. However, greedily reducing the solution space by such a re-
current formulation involves hard decisions at each step, and theoretically, a single
mistake can result in the convergence to a wrong solution.
Alter and Basri [60] presented a comprehensive analysis of the saliency network
and derived its O(k2N2) time complexity, where N is the total number of pixels and
k is the number of neighboring elements considered in forming a locally connected
network. They demonstrated that the salient network has a few serious problems
due to the extensibility assumption, and the convergence rates vary significantly de-
37
pending on the object structure. These limitations are difficult to overcome without
fundamentally changing the computation.
A body of literature exists on contour grouping and contour-based learning in
computer vision. Here we highlight the work more closely related to ours, which
includes Parent and Zucker?s [61] work using relaxation methods, and Guy and
Medioni?s [62] work using voting patterns. Elder and Zucker [63] developed a method
for finding closed contours using chains of tangent vectors. Williams and Jacobs [64]
and Williams and Thornber [65] discussed contour closure using stochastic comple-
tion fields. Shotton et al. [66] demonstrated a learning-based method for object
detection using local contour-based features extracted from a single image scale.
Such an approach, however, is inherently restricted to closed contour shapes, which
have an explicit ordering of points.
3.2.2 Shape Matching
Rigid shape matching has been approached in a number of ways with intent
to obtain a discriminative global shape description. Approaches using silhouette
features include Fourier descriptors [67, 68], geometric hashing [69], dynamic pro-
gramming [70, 52], and skeletons derived using Blum?s medial axis transform [71].
Although silhouettes are simple and efficient to compare, they have limits as shape
descriptors because they ignore internal contours and are difficult to extract from
real images [72]. Other approaches, such as chamfer matching [73] and the Haus-
dorff distance [74], treat the shape as a discrete set of points extracted using an
38
edge detector. Unlike approaches that compute correspondences, these methods do
not enforce pairing of points between the two sets. While they work well under se-
lected subset of rigid transformations, they cannot be extended to handle non-rigid
transformations. [75, 76] present a general survey on classic rigid shape matching
techniques.
Matching for non-rigid shapes needs to consider unknown transformations that
are both linear (e.g., translation, rotation, scaling, and shear) and non-linear. One
comprehensive framework for shape matching in this setting is to estimate iteratively
the correspondence and the transformation. The iterative closest point (ICP) algo-
rithm introduced by Besl and McKay [77] and its extensions [78, 79, 80] provide a
simple heuristic approach. Assuming two shapes align roughly, the nearest-neighbor
in the other shape is assigned as the estimated correspondence at each step. This
estimate of the correspondence is then used to refine the estimated affine or piece-
wise-affine mapping, and vice versa. While ICP is fast and guaranteed to converge to
a local minimum, its performance degenerates quickly in the presence of large non-
rigid deformation or a significant amount of outliers [81]. Chui and Rangarajan [82]
developed an iterative optimization algorithm to estimate the correspondences and
the transformation jointly, using thin plate splines as a generic parameterization of
a non-rigid transformation. Joint estimation of correspondences and transformation
leads to highly non-convex optimization, which is solved using the softassign and
deterministic annealing.
39
3.2.3 Document Image Retrieval
Rath et al. [83] demonstrated retrieval of handwritten historical manuscripts
by using images of handwritten words to query un-labeled document images. The
system compares word images based on Fourier descriptors computed from a collec-
tion of shape features, including the projection profile and the contours extracted
from the segmented word. Srihari et al. [84] developed a signature matching and
retrieval approach by computing correlation of gradient, structural, and concavity
features extracted from fixed-size image patches. It achieved 76.3% precision us-
ing a collection of 447 manually cropped signature images from the Tobacco-800
database [85, 86]. These approaches are not translation, scale or rotation invariant.
3.3 Multi-scale structural saliency
3.3.1 Theoretical framework
We consider the identification of salient structure and the grouping of its struc-
tural components separately. There are clear motivations for decoupling these two
unknowns, as opposed to solving them jointly. First, we have a broader set of
functions to use as measures of saliency. For object detection, saliency measures
that fit high-level knowledge of the object give more globally consistent results than
jointly optimizing a fixed set of low-level vision constraints. Having identified the
salient structures, the problem of grouping becomes simpler based on constraints
such as proximity and good continuation. Second, we can effectively formulate struc-
40
tural saliency across image scales, as opposed to single-scale approaches such as the
saliency network. Multi-scale detection is important for non-rigid objects like signa-
tures, whose contours can be severely broken due to poor ink condition and image
degradations. Last, multi-scale saliency computation generates detection hypotheses
at the natural scale where grouping among a set of connected components becomes
structurally obvious. These provide a unified framework for object detection and
segmentation that produces meaningful representation for object recognition and
retrieval.
From a computational point of view, computing saliency using connected com-
ponents makes the computation tractable and highly parallelizable. Our serial im-
plementation runs in O(N), where N is the total number of edge points. This is
significantly faster than the saliency network approach that has O(k2N2) time com-
plexity. We also explore document context to improve detection. The idea is to
estimate the length and inter-line spacing of text lines and use the information to
locate the bottom or end of a document, where signatures are more likely to appear.
In our evaluation, we show results of signature detection on whole documents, as
well as by exploration of document context.
3.3.2 Signature detection and segmentation
In this section, we describe a structural saliency approach to signature detec-
tion by searching a range of scales S = ?1, ?2, ???, ?n. We select the initial scale
?1 based on the resolution of the input image. We define the multi-scale structural
41
saliency for a curve ? as
?(?) = max
?i?S
f(??i(??i),?i), (3.1)
where f: R2 ? R is a function that normalizes the saliency over its scale, and ??i is
the corresponding curve computed at the scale ?i. Using multiple scales for detection
relaxes the requirement that the curve ? be well connected at a specific scale.
Detection at a particular scale ?i proceeds in three steps. First, we convolve
the image with a Gaussian kernel G?i, re-sample it using the Lanczos filter [87]
at the factor d?i, and compute its edges using the Canny edge detector [33]. This
effectively computes a coarse representation of the original image in which small
gaps in the curve are bridged by smoothing followed by re-sampling (see Figure 3.1
on page 31). Second, we form connected components on the edge image at the scale
?i, and compute the saliency of each component using the measure presented in
Section 3.4, which characterizes its dynamic curvature. We define the saliency of a
connected component ?k?i as the sum of saliency values computed from all its pairs of
edges. Third, we identify the most salient curves and use a grouping strategy based
on proximity and curvilinear constraints to obtain the rest of the signature parts
withintheirneighborhood. Ourjointdetectionandsegmentationapproachconsiders
identifying the most cursive structure and grouping it with neighboring elements in
two steps. By locating the most salient signature component, we effectively focus
our attention on its neighborhood. Subsequently, a complete signature is segmented
from background by grouping salient neighboring structures.
42
Figure 3.3: Among the large number of geometric curves passing the two random
end points E1 and E2 on a signature, few are realistic (solid curves). An example
of unrealistic curve is shown in dotted line.
3.4 Measure of saliency for signatures
In this section, we consider the problem of recognizing the structural saliency
of a 2-D off-line signature segment using a signature production model. As shown in
Figure 3.3, among the infinite number of geometric curves that pass two given end
points E1 and E2 on a signature, minimal are realistic. The author?s wrist is highly
constrained in the degrees of freedom while signing the document. Furthermore, a
signature segment rarely fits locally to a high-order polynomial, as shown by the
dotted curve in Figure 3.3.
We propose a signature production model that incorporates two degrees of
freedom in the Cartesian coordinates. We assume that the pen moves in a cycloidal
fashion with reference to a sequence of shifting baselines when signing. The lo-
cal baseline changes as the author?s wrist adjusts its position with respect to the
document. Within a short curve segment, we assume that the baseline remains un-
changed. In addition, the locus of the pen maintains a proportional distance from
the local center point (focus) to the local baseline (directrix), which imposes an
additional constraint that limits the group of second-order curves to ellipses. A sig-
nature fragment thus can be equivalently viewed as concatenations of small elliptic
43
(a) (b) (c)
Figure 3.4: (a) Curtate cycloid, (b) cycloid, and (c) prolate cycloid curves generated
when the speeds of writing in the horizontal baseline and the vertical direction both
vary sinusoidally.
segments. In a similar spirit, Saint-Marc et al. [88] have used piece-wise quadratic
B-splines to approximate complex-shaped contours for symmetry detection.
Our hypothesis on cycloidal writing is motivated by Hollerbach?s [89] oscil-
lation theory of handwriting, who discovered that embedded oscillations coupled
along the horizontal and vertical directions produce letter forms that closely resem-
ble handwriting samples. In fact, when the signature baseline aligns to the x axis,
velocities in the x and y directions are
vx = a sin(?xt+?x)+c
vy = b sin(?yt+?y)
(3.2)
where a and b are horizontal and vertical velocity amplitudes, ?x and ?y are the
horizontal and vertical frequencies, ?x and ?y are the horizontal and vertical phases,
t is the time variable, and constant c is the average speed of horizontal movement.
Without loss of generality, consider the case of a = b, ?x = ?y and ?x ??y = pi/2.
We can show that for different values of a and c, the resulting curves are curtate
cycloid, cycloid, and prolate cycloid, as shown in Figure 3.4, respectively.
We model piecewise segments of a signature by a family of second-order curves
that satisfy constraints imposed by the signature production. In addition, we use the
44
local gradient directions approximated at the two end points, which can be viewed
as soft constraints on the segment of the curve imposed by the global structure of
the signature instance. In Cartesian coordinates, the family of a quadratic equation
in two variables x and y is always a conic section. Figure 3.5 demonstrates that the
directionst of the gradients at the two edge points greatly limit the inference on local
curve segment to a family of conics, under the second-order signature production
model.
We can formalize this intuition geometrically. For a pair of edge points E1
at (x1,y1) and E2 at (x2,y2), we obtain estimates of their local gradients N1(p1,q1)
and N2(p2,q2) through edge detection. For definiteness, we suppose both E1 and E2
point into the angular section between the tangent lines, as shown in Figure 3.5.
p1(x2 ?x1)+q1(y2 ?y1) > 0 and (3.3a)
p2(x2 ?x1)+q2(y2 ?y1) < 0 (3.3b)
The two tangent lines at E1 and E2 are normal to their local gradients and are given
by
t1(x,y) ? p1(x?x1)+q1(y ?y1) = 0 (3.4)
and
t2(x,y) ? p2(x?x2)+q2(y ?y2) = 0. (3.5)
The straight line l(x,y) that passes through E1 and E2 can be written as
l(x,y) ?
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
x y 1
x1 y1 1
x2 y2 1
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
= 0. (3.6)
45
Figure 3.5: Conic sections inferred by a pair of edge points.
Note that t1(x,y), t2(x,y) and l(x,y) are all first-order linear functions in
x and y. The family of second-order curves bounded within the angular section
between t1(x,y) and t2(x,y) can be expressed as
C(x,y) ? l2(x,y)??t1(x,y)t2(x,y) = 0. (3.7)
In the canonical form, that is
ax2 +2hxy +by2 +2gx+2fy +c = 0, (3.8)
where parameters a, b, c, f, g, h are first-order linear functions in ?, and the
parameter set (x1, y1), (x2, y2), (p1, q1), (p2, q2).
Given a parameter set (x1, y1), (x2, y2), (p1, q1), (p2, q2), which is equiv-
alent to fixing the set of three straight lines t1(x,y), t2(x,y) and l(x,y), we can
theoretically analyze how parameter ? for ? ? [0,+?) affects the curvature of the
bounded quadratic curve defined by (3.7). When ? = 0, Equation (3.7) degener-
ates into the straight line l(x,y), and the total squared curvature of the bounded
segment is strictly zero. When ? monotonically increases from 0 within certain
range (0 < ? < ?0), the curve segment bounded by E1 and E2 exhibits increasingly
more curvature. This happens because the second-order curves governed by (3.7)
46
for ? ? [0,?0) are ellipses with monotonically increasing eccentricities. As ? ? ?0,
the center of the ellipse recedes to infinity, so that the ellipse tends to a parabola
at ? = ?0. When ? > ?0, the conic of (3.7) becomes a hyperbola. Eventually as
? ? +?, the hyperbola degenerates into the two intersected straight lines t1(x,y)
and t2(x,y). We prove in Appendix A that ?0 is given by
?0 = 4[p1(x2 ?x1)+q1(y2 ?y1)](p
1q2 ?p2q1)
? [p2(x1 ?x2)+q2(y1 ?y2)](p
1q2 ?p2q1)
. (3.9)
Equation (3.9) provides a theoretical second-order upper bound of the dynamic
curvature, given the parameter set (x1, y1), (x2, y2), (p1, q1), (p2, q2) that fits the
signature production model. We use ?0 as the saliency value ??i(Ei,Ej) for a pair of
points at scale ?i. When a = |OE1| = |OE2|, it is straightforward to show that the
right hand side of (3.9) is 4a2. This result allows us to normalize saliency over scale,
whereas the scale interpretation of most published saliency measures surveyed in
[65] are largely unclear. Obviously, our saliency measure is translation and rotation
invariant as it only uses local gradient directions.
The saliency of a curve ?k?i at scale ?i is defined as the sum of saliency values
computed from all pairs of points on it and is written as
??i(?k?i) =
summationdisplay
Ei,Ej??k?i
??i(Ei,Ej). (3.10)
It measures the likelihood of elliptic segment fitting given a set of 2-D points, and
the computation does not require the temporal order among points.
The analysis so far applies only to the continuous case. To account for the
discretization effect, we impose two additional conditions. First, the absolute values
of the two functions on the left hand side of Equation (3.3a-b) must be strictly larger
47
than epsilon1. Second, the denominator term in (3.9) must be strictly large than epsilon1. In our
experiments, we use epsilon1 = 0.1. For robustness, we weight the saliency contribution by
the gradient magnitude of the weaker edge.
Separating saliency detection from grouping significantly reduces the level of
complexity. Let the total number of edge points be N and the average length of
connected components be Lc. Saliency computation for each component in Equation
(3.10) requires O(L2c) time on average. Therefore, the overall computation is of
order (N/Lc)?L2c = NLc. Since Lc is effectively upper bounded by prior estimate
of signature dimensions and the range of searched scales n is limited, they can be
considered as constants. The complexity in saliency computation is linear in N.
Gaussian smoothing and connected component analysis both require O(N) time.
The total complexity in the signature detection algorithm is therefore O(N).
3.5 Matching and Retrieval
3.5.1 Measures of Shape Dissimilarity
Before we introduce two new measures of dissimilarity for general shape match-
ing and retrieval, we first discuss existing shape distance metrics. Each of these
dissimilarity measures captures certain shape information for effective discrimina-
tion. In the next sub-section, we describe how to combine these individual measures
effectively with limited supervised training, and present our evaluation framework.
Several measures of shape dissimilarity have demonstrated success in object
recognition and retrieval, including the thin-plate spline bending energy Dbe and
48
the shape context distance Dsc. As a conventional tool for interpolating coordinate
mappings from R2 to R2 based on point constraints, the thin-plate spline (TPS)
is commonly used as a representation of non-rigid transformation [90]. The TPS
interpolant f(x,y) minimizes the bending energy
integraldisplayintegraldisplay
R2
bracketleftbigg
(?
2f
?x2)
2 +2( ?
2f
?x?y)
2 +(?
2f
?y2)
2
bracketrightbigg
dxdy (3.11)
over the class of functions that satisfy the given point constraints. Equation (3.11)
imposes smoothness constraints to discourage non-rigidities that are too arbitrary.
The bending energy Dbe [82] measures the amount of non-linear deformation to
warp the shapes into alignment and provides physical interpretation. However, Dbe
measures only the deformation beyond an affine transformation, and its functional
in (3.11) is zero if the undergoing transformation is purely affine.
The shape context distance Dsc between a template shape T composed of m
points and a deformed shape D of n points is defined in [55] as
Dsc(T ,D) = 1m
summationdisplay
t?T
argmin
d?D
C(T(t),d)+ 1n
summationdisplay
d?D
argmin
t?T
C(T(t),d), (3.12)
where T(.) denotes the estimated TPS transformation and C(.,.) is the cost func-
tion for assigning correspondence between any two points. Given two points, t in
shape T and d in shape D, with associated shape contexts ht(k) and hd(k), for
k = 1,2,...,K, respectively, C(t,d) is defined by ?2 statistic as
C(t,d) ? 12
Ksummationdisplay
k=1
[ht(k)?hd(k)]2
ht(k)+hd(k) . (3.13)
We introduce a new measure of dissimilarity Das that characterizes the amount
of anisotropic scaling between shapes. Anisotropic scaling is a form of affine trans-
49
(a) (b)
(c) (d)
(e) (f)
Figure 3.6: Anisotropic scaling and registration quality effectively capture shape
differences. (a) Signature regions without segmentation. The first two signatures
are from the same person, whereas the third is from someone else. (b) Detected and
segmented signatures by our approach. Second row: matching results of first two
signatures using (c) shape contexts and (d) local neighborhood graph, respectively.
Last row: matching results of first and third signatures using (e) shape contexts
and (f) local neighborhood graph, respectively. Corresponding points identified by
shape matching are linked and unmatched points are shown in green. The computed
affine maps are shown in the figure legends.
formation that involves change to the relative directional scaling. As shown in
Figure 3.6, the stretching or squeezing of the scaling in the computed affine map
effectively captures global mismatch in shape dimensions among registered points,
even in the presence of large intra-class variation.
We compute the anisotropic scaling between two shapes by estimating the
ratio of the two scaling factors Sx and Sy in the x and y directions, respectively.
A TPS transformation can decompose into a linear part corresponding to a global
50
affine alignment, together with the superposition of independent, affine-free defor-
mations (or principal warps) of progressively smaller scales [90]. We ignore the
non-affine terms in the TPS interpolant when estimating Sx and Sy. The 2-D affine
transformation is represented as a 2?2 linear transformation matrix A and a 2?1
translation vector T ?
??
?
u
v
?
??
?= A
?
??
?
x
y
?
??
?+T, (3.14)
where we can compute Sx and Sy by singular value decomposition on matrix A.
We define Das as
Das = log max(Sx,Sy)min(S
x,Sy)
. (3.15)
Note that we have Das = 0 when merely isotropic scaling is involved (i.e., Sx = Sy).
We propose another distance measure Dre based on the registration residual
errors under the estimated non-rigid transformation. To minimize the effect of
outliers, we compute the registration residual error from the subset of points that
has been assigned correspondence during matching, and ignore points matched to
the dummy point nil. Let function M: Z+ ? Z+ define the matching between two
point sets of size n representing the template shape T and the deformed shape D.
Suppose ti and dM(i) for i = 1,2,...,n denote pairs of matched points in shape T
and shape D, respectively. We define Dre as
Dre =
summationtext
i:M(i)negationslash=nil ||T(ti)?dM(i)||summationtext
i:M(i)negationslash=nil 1
, (3.16)
where T(.) is the estimated TPS transformation and ||.|| is the Euclidean norm.
51
3.5.2 Shape Distance
After matching, we compute the overall shape distance for retrieval as the
weighted sum of individual distances given by all the measures: shape context dis-
tance, TPS bending energy, anisotropic scaling, and registration residual errors.
D = wscDsc +wbeDbe +wasDas +wreDre. (3.17)
The weights in (3.17) are optimized by LDA, using only a small amount of training
data.
The retrieval performance of a single query instance may depend largely on the
instance used for the query [91]. In practice, it is often possible to obtain multiple
signature samples from the same person. So, we can use them as an equivalence
class to achieve better retrieval performance. When multiple instances q1,q2,...,qk
from the same class Q are used as queries, we combine their individual distances
D1,D2,...,Dk into one shape distance as
D = min(D1,D2,...,Dk). (3.18)
3.5.3 Evaluation Methodology
We use two commonly cited measures, average precision and R-precision, to
evaluate the performance of each ranked retrieval. First, we make precise the intu-
itions of these evaluation metrics, which emphasize the retrieval ranking differently.
Given a ranked list of documents returned in response to a query, average precision
(AP) is defined as the average of the precisions at each relevant hit. It rewards re-
trieval systems that rank relevant documents higher and at the same time penalizes
52
Figure 3.7: A signature query example. Among the total of nine relevant signatures,
eight appear in the top nine of the returned ranked list, giving average precision
of 96.0%, and R-precision of 88.9%. The irrelevant signature is highlighted with
a bounding box. Left: signature regions in the document. Right: detected and
segmented signatures used in retrieval.
those that rank irrelevant ones higher. AP effectively combines the precision, recall,
and relevance ranking, and is considered a stable and discriminating measure of the
quality of retrieval engines [91]. R-precision (RP) for a query i is the precision at
the rank R(i), where R(i) is the number of documents relevant to query i. It de-
emphasizes the exact ranking among the retrieved relevant documents and is more
useful given a large number of relevant documents.
Figure 3.7 shows a query example, in which eight of the nine total relevant
signatures are among the top nine and one relevant signature is ranked 12. For this
query, AP = (1+1+1+1+1+1+1+8/9+9/12)/9 = 96.0%, and RP = 8/9 = 88.9%.
53
Figure 3.8: Examples from the Tobacco-800 database [85, 86] (left) and the Mary-
land Arabic database [11] (right).
3.6 Experiments
3.6.1 Datasets
To evaluate detection and matching in signature-based document image re-
trieval, we used two large collections of real world documents?the Tobacco-800
database [86] and the University of Maryland Arabic database [11]. Tobacco-800 is
a public subset of the complex document image processing (CDIP) collection [85]
and has been used in TREC 2006 and 2007 evaluations. It was constructed from
42 million pages of English documents (in 7 million multi-page TIFF images) re-
leased by tobacco companies, and was originally hosted at UCSF [92]. Tobacco-800
is a realistic dataset for document analysis and retrieval as these documents were
collected and scanned using a wide variety of equipments over time. The Mary-
land Arabic dataset consists of 166,071 Arabic handwritten business documents.
54
Table 3.1: Summary of the English and Arabic evaluation datasets.
Tobacco-800 Maryland Arabic
Document Types Printed/handwritten Mostly handwritten
Total Pages 1290 169
Resolution (in DPI) 150?300 200
Labeled Signatures 900 149
Number of Signers 66 21
Figure 3.8 shows some examples from the two datasets.
We tested our approach using all the 66 and 21 signature classes in Tobacco-
800 and Maryland Arabic datasets, respectively, among which the number of sig-
natures per person varies from 6 to 11. The overall performance across all queries
is computed quantitatively in mean average precision (MAP) and mean R-precision
(MRP), respectively.
3.6.2 Multi-scale Signature Detection
We organize our experiments on signature detection as follows. First, we
use the detection probability PD and the false-alarm probability PF as evaluation
metrics. PD and PF represent the two degrees of freedom in a binary hypothesis
test, and they do not involve a priori probabilities of the hypothesis. To factor in
the ?quality? of detection, we consider a signature correctly detected and complete
if the detected region overlaps with more than 75% of the labeled signature region.
We declare a false alarm if the detected region does not overlap with more than
25% of any labeled signature. Since the number of signatures varies significantly
55
Figure 3.9: Signature detection ROC curves for the Tobacco-800 database (left) and
the Maryland Arabic database (right).
across the documents in practice, we assume no prior knowledge on the distribution
of signatures per document.
Figure 3.9 shows the receiver operating characteristic (ROC) curves for the
Tobacco-800 and Maryland Arabic datasets. A Fisher classifier using size, aspect
ratio, and spatial density features serves as a baseline for comparison, with all other
procedures remaining the same in the experiment. We use two scale levels in multi-
scale detection experiments. Parameters involved in obtaining the ROC curves,
including the detection threshold in saliency and estimates of signature dimensions,
are tuned on 10 documents. We use the following approach to compute each op-
erating point on an ROC curve, and sort the list of detected candidates from the
entire test set by their saliencies. To plot a new point, we move down the ranked
list by one and look at the portion of the ranked list from the top to the current
rank, which is equivalent to lowering the decision threshold gradually. The entire
sets of ROC curves computed using this scheme, as shown in Figure 3.9, are densely
packed and include every operating point.
56
Figure 3.10: Examples of detected signatures from the Tobacco-800 database, to-
gether with their saliency maps. The top three most salient parts are shown in red,
green, and blue, respectively.
The multi-scale saliency approach obtains the best overall detection perfor-
mance on both English and Arabic datasets. Using document context, our multi-
scale signature detector achieves 92.8% and 86.6% detection rates for the Tobacco-
800 and Maryland Arabic datasets, at 0.3 false-positives per image (FPPI). Explor-
ing context is more effective on machine printed documents, given the uniformity
of geometric relationships among text lines. The operating point of 0.1 FPPI offers
an example. Context alone gives an average increase of 13.6% in detection accuracy
on Tobacco-800, compared to only 5.1% on the Maryland Arabic database. The
improvements by using multi-scale are 5.8% and 4.0% on the Tobacco-800 and the
Maryland Arabic datasets, respectively. This demonstrates the advantage of multi-
scale approach on datasets that capture more diversity, such as the Tobacco-800
database.
Second, we test our saliency measure?s ability to discriminate signatures from
other handwriting. The handwritten Maryland Arabic dataset serves better, be-
cause local features become clearly less discriminative as evident from the poor
performance of the Fisher classifier.
57
Figure 3.11: Examples of detected signatures from the Maryland Arabic database,
together with their saliency maps. The top three most salient parts are shown in
red, green, and blue, respectively.
Figures 3.10 and 3.11 show samples of detected signatures from Tobacco-800
and Maryland Arabic datasets, with their saliency maps. We delineate the top three
most salient parts in red, green, and blue, respectively. In our experiment, a cursive
structure is normally more than an order of magnitude more salient than printed text
of the same dimensions. However, we did find rare instances of printed text among
false positives as shown in Figure 3.12(a), with comparable saliencies to signatures
because of their highly cursive font styles. A limitation of our proposed method
derives from when the detected and segmented signature contain a few touching
printed characters.
For better interpretation of the overall detection performance, we summarize
key evaluation results. On Tobacco-800, 848 signatures of 900 labeled signatures
are detected by the multi-scale saliency approach using document context. Among
detected signatures, 83.3% are complete. The mean percentage area overlap with
the groundtruth is 86.8%, with a standard deviation of 11.5%. As shown in Fig-
ures 3.10 and 3.11, using connected components for detection gives structurally and
perceptually meaningful output quality. Furthermore, it does not necessarily limit
58
(a)
(b)
Figure 3.12: Examples of (a) false alarms and (b) missed signatures from the
Tobacco-800 database.
the detection probability when used in a multi-scale framework. In fact, these fig-
ures approximate the word segmentation performance for machine printed text of a
leading commercial OCR product [93] on Tobacco-800 documents. The results on
the Maryland Arabic dataset are also encouraging, as the collection consists mainly
of unconstrained handwriting in complex layouts and backgrounds. In addition, we
have conducted field tests of using an ARDA-sponsored dataset composed of 32,706
document pages in 9,630 multi-page images. Among the top 1,000 detections, 880
are real signatures.
3.6.3 Discussion on Signature Detection
On the saliency maps, an edge detector generates two parallel contour lines
from a stroke as both are local maxima in gradient magnitude. A ridge detector can
generate more compact segmentation output since it produces only one thin curve
in response. However, a ridge detector [94] performs significantly worse in signature
detection in our experiments. The Canny edge detector provides good localization
that guarantees accurate estimation of local gradient directions, and performs more
robustly under structural variations.
59
Examples of false positives from the Tobacco-800 database are shown in Fig-
ure 3.12(a), which include handwriting. The classification between signature and
handwriting is sometimes not well posed by studying shape alone. Highly cursive
handwriting may not have any obvious visual differences from signatures, as shown
in Figure 3.12(a). Using geometric information cannot effectively resolve such in-
tricacies in real documents because these handwritings are primarily annotations
created in an ad hoc manner. Semantics and content layer information is required
to solve the ambiguity in this case.
Figure 3.12(b) gives examples of false negatives. These missed signatures are
severely broken, and a step edge operator like Canny could not detect contours,
even on a coarse scale. As shown on most signatures, however, using multiple scales
for detection partially overcomes the limitations of a connected-components-based
approach by relaxing the requirement that the contour fragment be well connected
at a particular scale. This improvement is more evident on the Tobacco-800 dataset,
which contains a considerable number of highly degraded images at low resolutions.
3.6.4 Signature Matching for Document Image Retrieval
3.6.4.1 Shape Representation
We compare shape representations computed using different segmentation
strategies in the context of document image retrieval. In particular, we consider
skeleton and contour, which are widely used mid-level features in computer vision
and can be extracted with relative robustness.
60
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
Figure 3.13: Skeletons and contours computed from signatures. The first column
are labeled signature regions in the groundtruth. The second column are signature
layers extracted from labeled signature regions by the baseline approach [95]. The
third and fourth columns are skeletons computed by Dyer and Rosenfeld [96] and
Zhang and Suen [97], respectively. The last column are salient contours of actual
detected and segmented signatures from documents by our approach.
For comparison, we developed a baseline signature segmentation approach
by removing machine printed text and noise from labeled signature regions in the
groundtruth using a Fisher classifier [95]. To improve classification, the baseline
approach models the local contexts among printed text using Markov Random Field
(MRF). We implemented two classical algorithms?one by Dyer and Rosenfeld [96]
and the other by Zhang and Suen [97], which compute skeletons from the signature
layer extracted by the baseline approach. Figure 3.13 shows layer subtraction and
skeleton extraction results by the baseline, as compared to the salient contours of
detected and segmented signatures from documents.
In this experiment, we sample 200 points along the extracted skeleton and
salient contour representations of each signature. We use the faster shape context
matching algorithm to solve for correspondences between points on the two shapes
and compute all the four shape distances using Dsc, Dbe, Das, and Dre. To remove
any bias in all retrieval experiments, each query signature is removed from the test
set in that query.
61
Figure 3.14: Document image retrieval performances using different shape represen-
tations on the two datasets measured by mean average precision (MAP) and mean
R-precision (MRP).
Document image retrieval performance using different shape representations is
shown in Figure 3.14. Salient contours computed by our detection and segmentation
approach outperform the skeletons that are extracted directly from labeled signa-
ture regions on both datasets. As illustrated by the third and fourth columns in
Figure 3.13, thinning algorithms are sensitive to structural variations among neigh-
boring strokes and noise. In contrast, salient contours provide a globally consistent
representation by weighting the more structurally important shape features. This
advantage in retrieval performance shows clearly on the Maryland Arabic dataset,
in which signatures and background handwriting are closely spaced.
3.6.4.2 Shape Matching Algorithms
Wedevelopedsignaturematchingapproachesusingtwonon-rigidshapematch-
ing algorithms?shape contexts and local neighborhood graph, and evaluated their
retrieval performances on salient contours. We use all four measures of dissimilarity
62
Figure 3.15: Document image retrieval using single signature instance as query using
shape contexts [55] (left) and local neighborhood graph [56] (right). The weights
for different shape distances computed by the four measures of dissimilarity can be
optimized by LDA using a small amount of training data.
Dsc, Dbe, Das, and Dre in this experiment. The weights of different shape distances
are optimized by linear discriminant analysis, using randomly selected subset of sig-
nature samples as training data. Figure 3.15 shows retrieval performances measured
in MAP for both methods as the size of training set varies. A special case in Fig-
ure 3.15 occurs when no training data is used. In this case, we simply normalize
each shape distance by the standard deviation computed from all instances in that
query, effectively weighting every shape distance equally.
A significant increase in retrieval performance is observed with a fairly small
amount of training data. Both shape matching methods are effective with no sig-
nificant difference. In addition, the performances of both methods measured in
MAP deviates less than 2.55% and 1.83%, respectively, when different training sets
are randomly selected. These demonstrate the generalization performance of rep-
resenting signatures by non-rigid shapes and counteracting large variations among
unconstrained handwriting through geometrically invariant matching.
63
Table 3.2: Retrieval performance using different measure of shape dissimilarity on
the Tobacco-800 database.
Measure of Dissimilarity MAP MRP
Dsc 66.9% 62.8%
Das 61.3% 57.0%
Dbe 59.8% 55.6%
Dre 52.5% 48.3%
Dsc +Dbe 78.7% 74.3%
Dsc +Das +Dbe +Dre 90.5% 86.8%
3.6.4.3 Measures of Shape Dissimilarity
Table 3.2 summarizes the retrieval performance using different measures of
shape dissimilarity on the larger Tobacco-800 database. The results are based on the
shape context matching algorithm, as it demonstrates smaller performance deviation
in previous experiment. We randomly select 20% of signature instances as training
data and use the rest for test.
The most powerful single measure of dissimilarity for signature retrieval is the
shape context distance (Dsc), followed by the affine transformation based measure
(Das), the TPS bending energy (Dbe), and the registration residual error (Dre). By
incorporating rich global shape information, shape contexts can discriminate even
under large variations. Moreover, the experiment shows that measures based on
transformations (affine for linear and TPS for non-linear transformation) are also
effective. The two proposed measures of shape dissimilarity Dsc and Dbe improve
the retrieval performance considerably, increasing MAP from 78.7% to 90.5%. This
demonstrates that we can significantly improve the retrieval quality by combining
64
Table 3.3: Retrieval performance using multiple signature instances on the Tobacco-
800 database.
Measure of Dissimilarity MAP MRP
One 90.5% 86.8%
Two 92.6% 88.2%
Three 93.2% 89.5%
effective complementary measures of shape dissimilarity through limited supervised
training.
3.6.4.4 Multiple Instances as Query
Table 3.3 summarizes the retrieval performances using multiple signature in-
stances as an equivalent class in each query on the Tobacco-800 database. The
queries consist of all the combinations of multiple signature instances from the same
person, giving even larger query sets. In each query, we generate a single ranked list
of retrieved document images using the final shape distance between each equivalent
class of query signatures and each searched instance defined in Equation (3.17). As
shown in Table 3.3, using multiple instances steadily improves the performance in
terms of both MAP and MRP. The best results on Tobacco-800 is 93.2% MAP and
89.5% MRP, when three instances are used for each query.
3.6.5 Off-line Signature Verification
We quantitatively compare our signature matching approach with several
state-of?the-art off-line signature verification approaches [44, 40, 48] on the Sabourin
65
Figure 3.16: Our approach robustly discriminates genuine signatures (red cross)
from forgeries (blue dots) across all signature classes in the Sabourin signature
verification database [44].
off-line signaturedatabase[44], which has beenusedin[98,99,44, 40]. Thisdatabase
of 800 signature images contains genuine and random forged signatures of 20 classes,
with 40 signatures per class. Historically, results on the Sabourin signature database
are reported in terms of false rejection rate (FRR) of genuine signatures, and false
acceptance rate (FAR) of forged signatures, with parameters trained to minimize
total errors. We follow the same evaluation protocol as in [44, 40], where the first
20 signatures in each class are used for training and the remaining 20 signatures for
test.
Our approach exceeded the state-of-the-art performance on off-line signature
verification on the Sabourin database without any explicit assumption of local fea-
tures, such as granulometric size distributions in [44] and sets of stroke-level features
66
Table 3.4: Signature verification results on the Sabourin off-line signature
database [44].
Approach FRR FAR
Sabourin et al. [44] 1.64% 0.10%
Guo et al. [40] 1.66% 1.10%
Shanker and Rajagopalan [48] 8.75% 3.26%
Our approach 0.50% 0.21%
in [40]. The 0.50% FRR and 0.21% FAR is the best published off-line signature veri-
fication result on this database using the standard experimental protocol. As shown
in Figure 3.16, our approach effectively separates genuine signatures and forgeries
across all the 20 signature classes. The performance of [44] in Table 3.4 occurs
by combining four classifiers using multiple classification schemes. The best sin-
gle classifier reported in [44] gives 1.55% FRR and 0.15% FAR at 25 iterations.
This experiment also demonstrates the significant difference in performance be-
tween our approach and [48], which employs an improved dynamic time warping
algorithm on 1-D projection profile. Projection profile feature provides a limited
ability to discriminate shape and has demonstrated sensitivity to changes in the
baseline direction. Heuristic alignment, such as finding the orientation which gives
the minimum length horizontal projection [48], is not robust under large structural
variations in off-line signatures. Treating signatures in the unconstrained setting of
a 2-D deformable shape, our approach is more robust to large intra-class variations
and provides better generalization performance.
67
3.7 Summary
In this chapter, we proposed a novel signature detection and segmentation
approach based on the view that object detection can be a process that aims to
capture the characteristic structural saliency of the object across image scales. This
differs from the common object detection framework that focuses on sets of local
properties. The results on signature detection using multi-language document image
collections show that our approach is effective in the presence of large variations
and clustered backgrounds. In one advantage of using multi-scale saliency for joint
detection and segmentation, it provides a general framework for which detection
and segmentation degrades gracefully as the problem becomes more challenging. In
addition, detected and segmented outputs are both structurally and perceptually
meaningful for matching and recognition.
To handle large structural variations robustly, we treated the signature in the
general setting of a non-rigid shape and demonstrated document image retrieval
using state-of-the-art shape representations, measures of shape dissimilarity, shape
matching algorithms, and multiple instances as query. We quantitatively evaluated
these techniques in challenging retrieval tests, each composed of a large number
of classes but a relatively small numbers of signature instances per class. We also
demonstrated our matching techniques in off-line signature verification and achieved
competitive results.
68
Chapter 4
Learning Document Context
4.1 Introduction
Obtaining structured information from unstructured image sources presents a
grand document analysis challenge. In a wide range of applications requiring such
capability, traditional language modeling techniques does not produce satisfactory
results on the stream of OCR text, despite the character recognition performance
on quality machine printed documents has improved steadily [100, 3].
Processing of receipts offers a classic example, where the unstructured nature
of the document images present critical challenges in several aspects. First, text
on receipt documents consists predominantly of terse streams of nouns. The lack
of linguistic context, such as punctuation and language constructs, makes both
syntactic and semantic analysis diffcult. Second, the output text from OCR systems
does not contain useful page layout information, such as spatial block region and
font type, which is readily available in the source image. Third, the text often
contains high error rates (typically more than 6% at the character level) because
receipts are often generated by low-resolution printers (e.g. dot matrix printers) and
are kept in less than ideal physical conditions. These source image degradations are
challenging to recover and have significant impact on the performance of character
recognition and downstream tasks.
69
We present a model-based approach for extracting relevant named entities
from unstructured document images, which enables learning of inference rules col-
lectively based on contextual information from both page layout features in the im-
age space and OCR text. We demonstrate the named entity extraction approach in
an automated expense reimbursement system [9], which consists of (1) an electronic
submission infrastructure that provides multi-channel image capture, transport, and
storage of documents; (2) an document image analysis engine that extracts relevant
named entities from unstructured document images; and (3) automation of auditing
procedures.
Our contributions in this work include (1) a model-based approach to solving
the named entity extraction and question answering aspects of the problem jointly;
(2) a formal framework for efficient probabilistic inference by learning contextual
dependencies using both page layout and text features; and (3) a demonstration of
the effectiveness of integrating rich feature sets available in the image space, when
OCR text reveals limited structural information.
The remainder of this chapter is organized as follows: The next section pro-
vides background information on the emerging application of automated expense
reimbursement. Section 4.3 presents an overview of the system architecture and
components. Section 4.4 describes our approach to extracting relevant named enti-
ties from diverse, unstructured document images. Section 4.5 discusses experimental
results on collections of real-world receipts. We review related work in Section 4.6,
and conclude in Section 4.7.
70
Figure 4.1: System architecture of our automated expense reimbursement system.
4.2 Motivations and Background
Expense reimbursement is a time consuming and labor intensive process for
organizations of all sizes. Even though policies and regulations defining the process
vary across organizations and industries, corporate expense reimbursement faces
a set of common challenges. The solution requires technical innovations in the
following three key areas.
(1) A generalized paper-free framework for capturing, transporting, and storing
paper documents in digital image form
In spite of progress made with electronic tools, paper consumption in the office
is growing, and paper continues to inhibit business process innovation. Expense re-
porting provides an example where paper receipts continue to generate unnecessary
costs and delays even though the organization has access to web-based applications.
The problem arises from the requirements to hold receipts to prove the validity
of submitted reimbursement claims. Currently, without any pervasive mechanism
71
for electronic submission, paper receipts are mailed for centralized processing, with
printed cover sheets. To protect against the risk of loss in the mail, the package is
often copied, which still creates more paper. Even worse than handling the expand-
ing amount of paper, time to reimbursement remains at the speed of mailed package
and manual processing rather than at the speed of electronic transaction.
A fast technology shift in the printing industry from analog copiers to consol-
idated high-resolution digital multi-function devices (MFDs) enables us to close the
paper-digital gap by using these image capturing and transporting channels. In our
system, submission of paper paper receipts requires a few easy steps: walking to
the office MFD, authenticating yourself with your intranet password, selecting the
appropriate menu option on the touch screen, and hitting the ?big green button? to
scan, and submitting receipts singly.
(2) Extraction of relevant named entities from diverse receipt images with un-
constrained layouts and formatting
It is important to distinguish between the standard named entity recognition
(NER) problem [101, 102, 103] and the one present in this situation. In our task, the
query to the relevant named entity in each category equals a question. We look to
find the unique answer that best answers the question using the presented context.
For instance, given a receipt document, we can ask for the name of the merchant. If
more than one merchant entity exists, the system needs to resolve such ambiguity
and return most relevant answer using all available cues collectively.
Effective solution to both the entity extraction and question answering (QA)
requires integrating interdependent mixture of features from page layout and lan-
72
guage content. This is a research area has relatively little work in the literature.
In addition, we prefer a formal model-based approach that can be trained on new
data.
(3) Automation of auditing procedures that enables an organization to perform
expense validation with minimum human interaction
Many organizations have a limited ability to audit volumes of expense re-
ports, as it requires dedicated auditors to examine incoming receipts manually and
judge their accuracy from the associated report. This labor-intensive approach often
causes an organization to downgrade their internal requirements for the percentage
of submissions audited. For some organizations with have a high percentage of em-
ployees requiring travel, controlling costs in expense processing generally requires a
lower rate of oversight than would be desired.
4.3 System Overview
In this following sub-sections, we give an overview of the expense reimburse-
ment system and describe each component in its client-server architecture, as shown
in Figure 4.1. The system?s client may be any computer, multi-function device
(MFD), fax machine, or other electronic device, which has built-in capability to
transmit native image files. The application server running an IBM Intelligent Doc-
ument Gateway (IDG) server directly interacts with the centralized document image
repository, the named entity extraction engine (EntityFinder), the dynamic business
rule engine server, and back-end business processes.
73
4.3.1 Electronic Submission
Our system provides multi-channel image capture, transport, and storage of
paper documents. On the client side, users have several options to submit paper
receipts or scanned receipt images securely to the document server:
1. Multi-function devices (MFDs): The touch screen displays a customized user
interface for directions of how to scan and submit paper receipts, once the user
successfully authenticate by providing passwords.
2. Web-based client: The user uploads document image files directly from their
computer to the server through a light-weight client. This web-based appli-
cation allows the user to provide additional information associated with the
submitted receipt document, including its language set and personal remain-
der.
3. Desktop print-job: This option allows the user to submit receipt documents
in native image format through their computer?s printer queue.
The meta-data transmitted along with a submitted image file includes the type
of the document and an identifier that links the submitted receipt document to the
corresponding expense claim. The receipt image and its associated meta-data are
encrypted prior to transmission to the document gateway server via the corporate
or pubic networks.
74
Figure 4.2: Architecture of the document image analysis module ? EntityFinder.
4.3.2 Named Entity Extraction
The capability to extract relevant named entities from documents is integrated
into a document image analysis module called EntityFinder, as shown in Figure 4.2.
At the lower level, EntityFinder handles images at their native formats (e.g. multi-
page TIFF images) and provides support for higher-level functions, including docu-
ment layout analysis and feature extraction, through interfaces with the OCR engine
libraries. We present our approach to relevant named entity extraction in Section
4.4.
4.3.3 Automated Auditing
Extracting relevant named entities from unstructured document images opens
many possibilities for business process automation. Our system uses a business
rules engine to analyze the extracted data for relevancy within the context of au-
tomated expense auditing, and it activates actions based on the result of the rule
execution. The set of business rules are defined in XML and are dynamically con-
75
figurable in the live system. The auditing actions include verification of extracted
entities from receipts with reference to their corresponding entries in the expense
reimbursement claim, flagging instances of potential fraud, adherence to prescribed
organizational policies (e.g., meal limit). Automation of these routine procedures
enables a significantly higher rate of auditing and a much shorter processing time
between submission and compensation, bringing tangible productivity gain and cost
savings to the organization.
4.3.4 Document Archival
Once all business rules for automated auditing have been executed and the
resulting external business processes have been initiated, the set of extracted named
entities, along with the source document, are stored in centralized repositories. The
task aims to conform to business and legal requirements for document retention,
for future data analysis needs of the organization, and for auditing control purpose.
Document archival is governed by a configuration file associated with the document
process. This includes the type of repository adapter to use, a link to the server,
associated authentication data, and the descriptive information, including table and
column information for database access. The design allows multiple repositories as
required by a given process, and can be flexibly adapted to organization-specific
archival requirements.
76
4.4 Extraction of Relevant Named Entities
4.4.1 Task
We consider the task of extracting transaction-related named entities (NEs)
from receipt documents. Different levels of complexities are involved in extracting
NEs of diverse natures. The limited variation in NEs like transaction amount, date,
credit card number, and merchant phone number can be handled effectively using
regular expressions, in combination with rules. In this chapter, we focus on the
task of finding the set of NEs with arbitrarily large variation (e.g., the merchant)
and present an approach to extracting these challenging NEs by exploiting context
collectively from the page layout in the source image and its OCR text.
The application imposes three major requirements:
1. Handle document images effectively, given unconstrained layouts and format-
ting, since the system must be able to process all kinds of receipts.
2. Provide the most likely answer to each NE as inferred collectively from the
document?s context.
3. Should not rely on large external dictionaries, as it is not economical to create
and maintain such dictionaries. Furthermore, NEs on receipts commonly ap-
pear in various abbreviated forms that are difficult to enumerate. In fact, even
with this constraint lifted, the NE extraction task is not trivial, but presents
a different set of problems. The challenges involved in improving NER perfor-
mance using external dictionaries are discussed in [104].
77
4.4.2 Our Approach
The structural information derived collectively from page layout and language
features is important to the NER task on unstructured documents like receipts,
where OCR text stream may not be sufficient. Figure 4.3 shows a receipt from a
Union 76 gas station. The string ?76? itself is most likely to be a number when
it appears without context. However, people can determine that its reference to a
merchant by examining the document layout and linguistic elements collectively.
A receipt document with unconstrained page layout and formatting still con-
veys structural information in two aspects:
? Many semantically related entities are placed geometrically within spatial
proximities, even if their structure within the region does not seem obvious.
? The sequence of decomposed regions and the combination of layout and lin-
guistic features within the regions reveals important contextual information.
Our approach to extracting relevant NEs involves (1) decomposing the docu-
ment image spatially into regions by page segmentation; and (2) learning the infer-
ence rules collectively in a discriminative conditional framework using the contextual
information from page layout and text features.
4.4.2.1 Page Segmentation
Two page segmentation strategies can be employed to divide a general doc-
ument image into homogeneous regions. In one strategy, a page segmentation al-
gorithm is used. Representative page segmentation approaches from the document
78
Figure 4.3: Page and word segmentation results of a receipt image.
image analysis community include the Docstrum algorithm by O?Gorman [105] and
the Voronoi diagram-based algorithm by Kise et al. [106]. Another approach uses
the OCR engine through low-level function calls. Figure 4.4 shows the page seg-
mentation results by the Docstrum algorithm1 and the OCR engine, respectively.
Each segmented region is plotted using a red bounding box. For better visualization
of content within each segmented region, we show the word segmentation results in
the right sub-figure of Figure 4.4 using blue bounding boxes.
Using the OCR engine to segment a document page directly has a few practical
advantages, as opposed to a stand-alone algorithm. First, it makes region-level at-
tributes easily accessible. Most leading commercial OCR packages offer region-level
classification capabilities on machine-printed documents, which allow regions con-
1Docstrum is a bottom-up page segmentation algorithm that can work on document images
with non-Manhattan layout and arbitrary skew angles. It has limited capability to handle non-
text regions and text zones with irregular font sizes and spacing, and it tends to fragment them.
79
Figure 4.4: Page segmentation results by the Docstrum algorithm [105] (left) and
by the OCR engine [93] (right).
taining text, tables, and graphics to be identified and processed accordingly. Second,
it facilitates feature extraction from the segmented regions, including character-level
attributes such as the coordinates of character borders on the image grid, font in-
formation, and recognition confidence. Last, using OCR for page segmentation
removes the tedious step of training the free parameters involved in a stand-alone
algorithm. Packaged OCR products provide a convenient black-box solution, in
which the engine parameters have been tuned for performance over large collections
of documents. At the post-processing stage, the OCR engine can use preliminary
recognition results to improve page segmentation in an iterative fashion. Additional
information, including consistency in the font style and spatial alignment of seg-
mented regions, helps improve overall page segmentation performance, and tends
to produce more structurally meaningful results, even for a highly degraded input
80
image. Past empirical studies [107] also observe this, where representative page
segmentation algorithms are evaluated against built-in page segmentation functions
provided by several OCR products.
4.4.2.2 Learning From Document Context
We use conditional random fields (CRFs) as the framework for learning infer-
ence rules based on features at multiple levels of granularity and multiple modali-
ties. Clear motivations exist for using a discriminative model, such as CRFs. First,
CRFs relax the strict conditional independence assumptions of observations in gen-
erative models like hidden Markov models (HMMs) for ensuring tractable infer-
ence [108]. This allows CRFs more flexibility to integrate complex, overlapping
and non-independent feature sets that operate at multiple levels of granularity and
multiple modalities. Second, modeling of conditional probabilities devotes model re-
sources directly to the relevant task of label inference. It generally requires fewer la-
beled observation sequences, which leads to better generalization performance given
limited training data. In addition, CRFs avoid the label bias problem exhibited
by maximum entropy Markov models (MEMMs) and other discriminative Markov
models based on directed graphical models [109]. A few studies have shown that
CRFs outperform both MEMMs and HMMs on a number of real-world language
related sequence labeling tasks [109, 110, 111, 112].
81
4.4.3 Feature Selection
4.4.3.1 Practical Constrains
Optical character recognition on machine-printed characters has emerged as
an industrial-strength technology since its phenomenal advances from early 1990s.
However, OCR accuracy still lacks in comparison to a second-grade child [3]. It is
important to place the strength and weakness of OCR technology in perspective,
and understand the factors involved that significantly affect performance. These
insights provide useful guidelines for selecting feature sets that can be extracted
relatively reliably given the practical constraints imposed by a targeted application.
ThemostsuccessfulapplicationdomainofOCRtechnologyoccurswithmachine-
printed characters. During the last decade, the acceptance rates of form readers on
hand-printed digits and constrained alphanumeric fields have risen significantly. The
relatively low recognition errors in these constrained domains reflect the complexi-
ties involved in classifying a novel pattern under such a limited variation in the data
set [113]. In contrast, recognition of unconstrained off-line human handwriting and
multi-lingual recognition among a variety of scripts are more challenging problems,
and they remain active research frontiers.
The accumulated imaging degradations have a significant effect on OCR per-
formance. Typical imaging defects in the printing process include blotchy characters
caused by dot-matrix printer ribbons, and faint impressions resulting from worn
ribbons and printer cartridges. The scanning process also introduces various imper-
fections. Digital scanning involves sampling both horizontally and vertically on the
82
image grid. Desirable sampling rates by OCR are beyond 300 dots per inch (dpi).
Although commercial packages can work at as low as 150 dpi by interpolating a
low-resolution image to the preferred dpi, this generally leads to significant increase
in recognition errors. Significant image degradation also occurs when storing an
image in binary format by applying thresholds to separate foreground content. Us-
ing gray-scale and color scans of the image captures more detailed information for
pattern recognition and reduces the error rate. Most high-end MFDs provide these
functions.
4.4.3.2 Feature Set
We use a rich combination of page layout and text features for NE extraction.
The feature extraction process can be viewed as a set of binary-valued functions
defined on the appropriate feature space that output either 1 or 0 based on the
presence or absence of the corresponding feature. The conditional nature of CRFs
enables effective learning from these discrete-valued, interdependent features, which
may have extremely complex joint probabilities.
Page Layout Features As shown in Figure 4.4, noise speckles and graphic el-
ements in the document, including logos, lines, and region borders, may not be
reliably classified and segmented, and thus be given to the machine-printed text
recognizer in error. We simply discard those segmented text regions, where the
majority of text is un-recognizable or suspicious. We use the following collection of
page layout features extracted from each segmented region:
83
? Variation of font size and font face within the region
? Presence of the largest font on the entire page
? Presence of bold font face in the region
? Whether the text block is horizontally aligned to the center
? Whether the text block horizontally aligned to the left
Text Features Word tokens that are logically or semantically related to a NE are
useful and relatively robust features for extracting the NE. In fact, current OCR
systems commonly use the technique for character-level error correction that makes
explicit use of context at word level, by choosing a common letter n-gram over a rare
one [3]. This allows for improvement in recognition performance at word level, even
if the quality of the image remains poor. We organize word tokens into equivalent
groups. For example, ?Inc.? and ?Companies? are grouped together. The following
text features are also used:
? Capitalization of words
? Mixed cases
? Frequent appearance of digit characters (0-9)
? Presence of special characters (/, ?, #, -, *, $, ?)
? Presence of special patterns (?s)
84
Named Entity Features The orthogonality between NEs can be effective fea-
tures for probabilistic inference. A region of text containing a credit card number
is less likely to contain the name of the merchant. We use the set of orthogonally
related NEs as features. These include addresses, phone numbers, credit card num-
bers, dates, and monetary amounts.
4.4.3.3 Region Labels
We use a compact set of labels to categorize the ordered list of regions ob-
tained by page segmentation. This is based on our observation that context change
along the sequence of regions is frequent, making the inference of label more effec-
tive among neighboring regions. For extracting the merchant, we use three labels of
regions, NON DATA, MERCHANT DATA, and TRANS DATA. NON DATA rep-
resents a region that does not contain details of a transaction or any association with
a merchant. The MERCHANT DATA denotes a region that contains a merchant.
TRANS DATA region includes details of a transaction.
4.4.4 Relevant NE Extraction with CRFs
A conditional random field can be viewed as an undirected graphical model,
and be used to compute the conditional probability of labels on designated output
nodes Y, when globally conditioned on X, the random variable representing obser-
vation sequences. We construct a conditional model p(y|x) from paired observation
and labeled sequences, and do not explicitly model the marginal p(x).
85
We use CRFs with a linear chain structure. Given an instance of observed
input sequence x, the probability of a label sequence y is defined in [109] as
p(y|x) ? exp
parenleftBigg Tsummationdisplay
t=1
parenleftBiggsummationdisplay
k
?kfk(yt?1,yt,x,t)+
summationdisplay
k
?kgk(yt,x,t)
parenrightBiggparenrightBigg
, (4.1)
where fk(yt?1,yt,x,t) is a transition feature function of the entire observation se-
quence and labels at positions t and t?1; gk(yt,x,t) is a state feature function of the
label at position t and the observation sequence. More compactly, the probability
of a label sequence y given the observation sequence x is given by
p(y|x) = 1Z(x) exp
parenleftBigg Tsummationdisplay
t=1
summationdisplay
k
?kFk(yt?1,yt,x,t)
parenrightBigg
, (4.2)
where Z(x) is a normalization factor and
Z(x) =
summationdisplay
s?ST
exp
parenleftBigg Tsummationdisplay
t=1
summationdisplay
k
?kFk(yt?1,yt,x,t)
parenrightBigg
. (4.3)
Assuming the training data D = {xi,yi}Ni=1 are independent identically dis-
tributed, the product of Equation (4.2) over all training sequences as a function of
the parameters ? is the likelihood function. Maximum likelihood training chooses
parameters values so that the log-likelihood is maximized. For a CRF, the log-
likelihood is a concave function, guaranteeing convergence to the global maximum.
L =
Nsummationdisplay
i=1
logp(yi|xi). (4.4)
Likelihood maximization can be performed efficiently using a quasi-Newton
method, such as L-BFGS [114], which approximates the second derivative of the
likelihood by keeping a running, finite window of previous first-derivatives. L-BFGS
can be treated as a black-box optimization procedure, requiring only the first deriva-
tive of the function to be optimized.
86
Table 4.1: Summary of the two real-world receipt image collections.
Collection 1 Collection 2
Total images 145 283
Number of characters 71,316 522,320
Character error rate 6.03% 9.48%
Image resolution 200-300 dpi 150-200 dpi
Country of origin US UK
Let yi be the state path up to position T on instance i of the labeled training
sequence. The first-derivative of the log-likelihood function is given by
?L
??k =
Nsummationdisplay
i=1
Tsummationdisplay
t=1
Fk(yit,yit?1,xit)?
Nsummationdisplay
i=1
Tsummationdisplay
t=1
summationdisplay
y,yprime
Fk(y,yprime,xit)p(y,yprime|xi). (4.5)
Intuitively, when the state paths chosen by the CRF parameters match the
state pathsfromthelabeledsequence, thederivativegiveninEquation(4.5)becomes
zero.
4.5 Results and Discussion
4.5.1 Datasets
We used two large real-world receipt collections provided by IBM World Wide
Reimbursement Center for training and testing, which contain binary scanned re-
ceipt images from IBM internal business units and IBM Global Services customers.
These two collections provide realistic examples because the paper receipts were
gathered and scanned over time, using a variety of equipment. Characteristics of
the two datasets are summarized in Table 4.1.
87
We used the first two fifths from each dataset for training and the rest for
test. Groundtruth labels were created by first running a rule-based heuristic on
the sequence of segmented regions. Human judgment was then employed to correct
mistakes in the heuristic labeling.
4.5.2 Evaluation and Discussion
Weevaluateperformancewiththewidelyusedprecision-recallmetrics. Through-
out our evaluation, we define recall as the ratio of the number of NEs correctly ex-
tracted to the number of NEs that are physically present in the collection. Precision
is the ratio of the number of NEs correctly extracted divided by the total number
of NEs extracted in the category. The F-Measure (or F1 Measure) is computed by
(2 ? Precision ? Recall)/(Precision + Recall) [115].
Table 4.2 summarizes the overall NE extraction performances on the two col-
lections of receipt images, respectively. On both datasets, using CRF to extract
the merchant significantly outperformed rule-based heuristic approach. This is en-
couraging because improvements on a rule-based system require constant changes
to the code base, while improvements on the CRF system generally require only
defining new features and retraining the model. In fact, we observe improvement in
performance after modifying the word token sets to reflect locale difference of the
two databases.
The impact of recognition errors on rule-based NE extraction approaches is
evident. Almost all the errors made by the heuristic on simple NEs were caused
88
Table 4.2: Named entity extraction performances on the two databases.
Collection 1 Collection 2
Entity Precision Recall F-Measure Precision Recall F-Measure
Date 100.0 85.2 92.0 96.3 78.2 86.3
Credit card # 98.8 76.5 86.2 92.4 69.5 79.3
Expense total 100.0 97.7 98.8 89.1 91.7 90.4
Phone # 95.3 78.1 85.8 84.8 66.1 74.3
Address 95.4 82.7 88.6 82.3 58.6 68.5
Merchant (by heuristic) 63.2 56.8 59.8 58.5 49.7 53.7
Merchant (by CRF) 73.8 70.5 72.1 67.2 62.9 65.0
Merchant (improvement) 10.6 13.8 12.3 8.7 13.2 11.3
by text errors. Image scan on a higher-resolution device or a device that supports
gray-scale or color format can minimize this problem by effectively containing the
recognition errors. In addition, a paper document undergoes the image capturing
(scanning) process only once in our system, in contrast to some receipts in the
evaluation datasets that are second-generation copies (e.g., a scanned image of a
previously faxed document). The figures in Table 4.2 provide a realistic estimate of
the practical performance, given improvements in digital printing technologies.
Extracting relevant NEs using CRF proves more robust to character recog-
nition errors because it leverages collective information from presented document
context. For instance, the 9.48% character error rate on Collection 2 translates to a
39.2% word level error rate for a five-letter word. Nevertheless, we achieved 65.0%
F-score on merchants using CRF on the dataset.
89
4.6 Related Work
Named entity recognition (NER) is an essential task in deriving structured
information from un-structured sources. Historically, it has been defined on un-
structured text due to its roots in Message Understanding Conferences [101]. NER
capabilities have been demonstrated using un-structured text corpora from a wide
range of domains, including identifying personal names and company names in
newswire text [102], identifying titles and authors in on-line publications [103, 116],
and identifying gene and protein names in biomedical publications [117, 118]. More
recently, unsupervised NER results are reported on a massive corpus of domain-
independent text from the web [119].
The vast majority of NER systems employ rule-based or machine-learning-
based approaches. Examples of rule-based systems in the literature include [120,
121]. Machine-learning-based approaches can be further divided into two main cat-
egories: classifier-based and Markov-model-based. Common choices of classifiers
include decision trees, naive Bayes, and support vector machines (SVMs). In addi-
tion, studies in [122] and [123] have used classifier combination techniques in NER
tasks. Markov-based models, including hidden Markov models (HMMs) [124], max-
imum entropy Markov models (MEMMs) [125], and Conditional Random Fields
(CRFs) [109], are well suited to problems that involve sequential analysis. More re-
cently, CRFs have also been extended to computer vision problems, including region
classification [126] and human motion recognition [127].
90
4.7 Summary
In this chapter, we proposed a model-based approach for extracting relevant
namedentitiesfromunstructured documents, which enableslearningaboutinference
rules collectively based on contextual information from both page layout and text
features. We applied our approach to the processing of diverse real-world receipt
documents in an expense reimbursement system, and demonstrated that it brought
significant improvement to named entity extraction. Our approach delivers better
generalization performance and is shown to be more robust to recognition errors
than approaches that rely solely on linguistic features. These results demonstrate
the importance of jointly interpreting the context of the document when text itself
reveals limited structural information.
91
Chapter 5
Mining Session Context
5.1 Introduction
The understanding of users? behavior presents a major challenge in improving
the ranking performance and the usability of a document retrieval system. One
common solution used by search engines is to record and analyze user activities
on search result pages or within query sessions. These search-related logs provide
indication of relevance judgement and can be useful as training data to improve
numerous search capabilities [128], including query refinement and suggestion [129,
130, 131], learning ranking functions [132, 133, 134, 135], image search [136], and
targeted advertising [137].
Analysis of search-centric user behavior data sources has several inherent lim-
itations. First, search-oriented activity represents a minimal fraction (less than 5%)
of a user?s behavior online [138], even taking into account their post-query browsing
trails. This may significantly affect the document coverage. Second, typical search
engine logs contain user activities only within the scope of interactions with search
result pages [139], despite the fact that the use of contextual information can im-
prove the user?s search experience. Third, user activities on search result pages or
within search sessions introduce a strong bias toward dated documents with a high
ranking [140], since users seldom search beyond the first page [2]. Consequently, the
92
learned ranking will not converge to an optimal ranking because documents that
are fairly recent or initially ranked low will rarely be presented to users.
By effectively incorporating information on all web user activities, search en-
gines gain insights into user preferences and intents, and improve both retrieval
performance and user experience. First, analysis of all user actions provides a more
robust estimate of a user?s perceived importance associated with web pages and
sites [141]. Second, search engines face the challenge of prioritizing and adapting
their computing resources under practical constraints in crawling, indexing, and
query processing [142]. In this context, the relative attention a web page receives
from all users provides an intuitive and user-centric optimization criterion, and re-
sponds to evolving user behaviors. As a large amount of web content emerges and
refreshes within a shorter time interval than a typical crawling and indexing cycle of
a search engine [143], discovering popular content and adapting crawling schedules
based on the degree of usage may prove an effective and agile policy. Finally, an-
other challenging area for search engines is the access to the deep (or invisible) web
? the fraction of the web that is dynamically generated and not directly accessible
to automated crawlers [144]. Their coverage can substantially improve by leveraging
large-scale user browsing history, which collectively reveals hidden URLs, providing
gateways to their contents.
As a logical unit of user web experience, a web session contains rich information
on the user?s preferences and intents within an actionable time frame [138]. Web
session representation and interpretation is a particularly important subject for the
web community because it applies across all activities.
93
In this dissertation, we focus on using the large amount of knowledge gained
from computational analysis of general user browsing behavior by: (1) leveraging
information on all browsing actions; and (2) developing models that incorporate rich
contextual information within web sessions. Our main contribution is ClickRank, a
novel algorithm we propose for estimating web page and web site importance, which
is based on these two key notions. ClickRank first estimates a local importance value
for every page or site in each user browsing session, based on the implicit preference
judgments of the user in the session context. It then aggregates these local values
over all sessions of interest to construct a global ranking.
We evaluate ClickRank in three important areas of web search. Our first
experiment tests the traditional task of web site ranking, where we show that re-
sults from ClickRank are competitive against state-of-the-art approaches, including
PageRank [145] and the recently proposed BrowseRank [141], and are obtained at
significantly lower computational costs. In the second experiment, we demonstrate
the novelty and effectiveness of ClickRank in web page ranking with several hundred
state-of-the-art web search features, including those computed from page visit counts
and the link structure of the web graph. In this large-scale test, we formulate the
task of learning the optimal ranking model as an additive regression problem using
gradient-boosted decision trees, and quantify the variable importance of ClickRank
in direct comparison with other quality features. Finally, we demonstrate ClickRank
in a system that mines and presents recent, popular pages to web search users as
dynamic quicklinks in search result summaries.
94
We structure the remainder of this chapter as follows. Section 5.2 reviews
related work. In Section 5.3, we present important characteristics of general web
sessions, and describe in details our approach to session mining by incorporating
contextual information in session representation. In Section 5.4, we introduce the
ClickRank algorithm, and describe how we combine it with other useful web search
signals to improve web search by learning an optimal ranking model. We compre-
hensively evaluate ClickRank in three web search applications in Section 5.5, and
conclude in Section 5.6.
5.2 Related Work
PageRank [145], HITS [146], and TrustRank [147] are representative link anal-
ysis algorithms for computing authoritative sources, using the link structure of the
web graph, and have been widely used as measures of relative importance of web
pages. The well-known PageRank algorithm, for instance, considers a link from a
source page to another as an explicit endorsement of the destination page in per-
ceived page quality, and uses only the static link structure of the web as input.
Based on the assumption of a random surfer model and the first-order Markov pro-
cess, PageRank computes the stationary probability distribution for the web link
graph iteratively, resulting in the world?s largest matrix computation [148].
A number of problems are commonly associated with link analysis algorithms.
First, intent drives user browsing behaviors, which significantly deviate from the
random surfer model upon which PageRank is based. A recent study on real net-
95
work traffic [149] demonstrated that user visitation patterns differ considerably from
that approximated by the uniform surfing behavior model used in PageRank. Sec-
ond, static modeling of the link structure favors old pages, because a recent page
is less likely to be linked within a short period of time, even if the page is of good
quality. Third, link structures are prone to manipulation as adversarial links can
be generated to inflate ranking artificially compared to good links that typically
originate in manual editing. Last, as the web grows at an explosive speed1, com-
puting page importance at the web scale by link analysis becomes computationally
expensive [151], even through various optimization schemes [152, 153].
Minimal study has been accomplished on general non-search browsing data.
Prior literature related to sessions [154, 131, 155, 156, 157] focuses almost exclusively
on search trails within query sessions. However, as we will present in Section 5.3,
search-related activities account for less than 5% of overall user activity online.
Analysis of web sessions in a general setting broadens the user behavior models
with richer contextual information from the entire spectrum of actions, and the
analysis is key to new web search applications that aim to provide enriched user
search experience centered around users? interests.
A new page importance ranking algorithm called BrowseRank [141] has re-
cently been proposed, and it makes two significant contributions. First, it uses the
more reliable input of user behavior data, computing a user browsing graph, rather
than a web link graph. Second, BrowseRank models the random walk on the user
1While the first Google index in 1998 had 26 million pages, this number officially reached 1
trillion mark as of July 25, 2008 [150].
96
browsing graph by a continuous-time Markov process. BrowseRank has shown bet-
ter ranking performance compared with link analysis algorithms at the expense of
higher computational costs.
Study on search trails within query sessions is the subject of a few recent
works. Dramatic differences in user interaction behaviors with a search engine are
reported in [154]. The idea of using popular end points in search trails as query-
dependent feature is studied in [131] to improve web search interaction. A recent
study [155] shows improved retrieval quality using post-search browsing activities
over alternative data sources that contain only the end points of search trails or
clickthrough logs. Also, the study suggests that post-search browsing behavior logs
provide strong signal for inferring document relevance for future queries.
5.3 Mining Web Sessions
We define a web session as a logical unit of time-ordered user browsing activi-
ties, representing a single span of user interactions with a web browser. The concept
of session in our study is generalized to all categories of web activities, while studies
related to search log or search clickthrough data consider a session simply as a set
of search queries and largely ignore all other activity.
5.3.1 Session Identification
A user?s browsing history is commonly accessible from several sources, such
as the ISP or other gateway to the web [149] or clients installed on the user?s
97
environment [154]. In this study, we use information logged by the Yahoo! Toolbar,
a browser add-on that assists users with quick access to various web tasks. The
toolbar logs user activities for a subset of users who opted for this data collection
during installation.
Each log entry is a tuple of {cookie, timestamp, URL, referral URL, event
attribute list}. The cookie serves as a unique, anonymous client identifier that
expires and refreshes after a pre-defined time period. The URL identifies the page
being accessed, and the referral URL is the URL from which the user access the
current URL. The event attribute list comprises various metadata associated with
the activity. For the experiments in this dissertation, the browsing data consists
of more than 30 billion anonymous events, across millions of unique Yahoo! users,
collected over six months in 2008.
To segment web activities into sessions, we first use the referral URL?current
URL structure to reconstruct the entire chain of browsing activities per user. This
scheme ensures that, for users who are multitasking (e.g., those having multiple
browser windows or tabs open), we group activities associated with different tasks
into separate sessions rather than interleaving them together. Next, we partition
the time-ordered user events using two boundary conditions. First, we start a new
session from the current event if more than 30 minutes of inactivity occured between
the current event and its immediately preceding event. Second, a new session starts
if the current event entry does not have a referral URL. This typically happens when
the user launches a new web browser, or clicks on a link in a non-browser source
(e.g., in a text file).
98
Figure 5.1: Distribution of session lengths (left) and session durations (right) in
general web user behavior logs.
Our session segmentation approach requires only one-pass scanning over the
data. This may seem a simple mechanism. However, a recent study on finding
logical sessions from query logs [156] has shown that in the vast majority (92%)
of cases, a session segmentation method based on timeout threshold gives identical
scores to an advanced and computationally expensive algorithm [156], when both
are compared with human judged sessions using the objective Rand index [158]. For
the small fraction of remaining sessions that are difficult for the advanced algorithm,
the timeout-based method performs at a merely marginal degradation of 1.4%.
5.3.2 Session Characteristics
Table 5.1 summarizes the key characteristics of general web sessions. Fig-
ure 5.1 shows the probability distributions of the number of events in a session and
session duration, respectively. The number of events in a web session approximately
follows a power law distribution. Its mean and standard derivation are 9.1 and 24.5,
99
Table 5.1: Key characteristics of general web sessions.
Average events per session 9.1
Standard deviation of events per session 24.5
Average session duration (seconds) 420.3
Standard deviation of session duration (seconds) 1068.0
Sessions per user per day 15.5
Percentage of search sessions 4.85%
respectively, demonstrating that a general web session contains significantly richer
activity context and diversity than a search session, which reportedly consists of
an average of five events [155]. In addition, search sessions (those containing at
least one query sent to one of the major web search engines) constituted 4.85% of
overall sessions, signaling that focusing on them may lead to a biased view in down-
stream analysis [159]. The session duration graph in Figure 5.1 shows two different
power law behaviors across the timeout threshold of 1,800 seconds. On average,
a web session lasts 420.3 seconds, with the standard deviation of 1,068 seconds,
demonstrating its short-to-medium time range coverage of user activities.
5.3.3 Session Clustering
Mining user sessions at the web scale is particularly important for learning and
recognizing user behavior patterns associated with structured intents. We employ
several clustering approaches to discover web sessions driven by different intents
and to learn their statistical characteristics. Due to space constraint, we focus our
discussion on one representative clustering effort.
100
In this experiment, we mapped each interpretable URL to one of five intents
categories?search, email, information/reference, rich content (e.g., social network-
ing and multimedia), and shopping. We computed the histogram representation of
a session by the distribution of number of events over these intent categories. While
certain temporal information is clearly discarded, we will see in the next section
that this histogram representation preserves adequate discriminating power for the
clustering purpose and remains compact for the large amount of data.
To associate a visit reliably to each URL with an intent type, we used human
categorizations of the top 1,200 most popular web sites. While the coverage achieved
this way was reasonable at 41% for all events, we augmented these categorizations
using heuristics that map from URLs to likely intents. For example, URLs of the
format shopping.*.com/* were mapped to shopping intent, and so on.
Within each session, a browsing event was labeled either as unknown, or as-
signed to one of these five intent categories described above. We then computed the
distribution of events over the six intent labels (i.e., including the unknown class),
and discarded those sessions that contained more than 80% of unknown events, as
they could not be reliably clustered. Finally, we smoothed each normalized intent
histogram by evenly distributing the weight associated with the unknown class to
the other five histogram bins.
The final session histogram is a seven dimensional feature vector. The first five
dimensions correspond to the normalized intent histogram, with their sum equal to
100. The last two dimensions correspond to the number of events in the session and
the session duration in seconds, respectively.
101
Figure 5.2: Visualization of session histograms in 3D by dimensionality reduction
using principle component analysis.
To gain further insights on the spread in session histograms, we used principle
component analysis (PCA) to reduce the dimensionality. PCA seeks projections
onto a low-dimensional linear subspace that best preserves the data scatter in a
least-squares sense [113]. The 3D view of session histogram shown in Figure 5.2
demonstrates the heterogeneity as the histogram data covers a broad continuum of
activity space. Among the first six significant eigenvalues, the first eigenvalue is
dominant.
5.3.4 Session Interpretation
A meaningful interpretation to sessions offers the key to understanding the
context of activities on general, unconstrained user behavior data. Table 5.2 summa-
rizes the unsupervised session histogram clustering results, using k-means algorithm
with k = 10. These clusters are ordered according to the cluster size. Significant
features that clearly indicate cluster attributes in Table 5.2 are highlighted.
102
Table 5.2: Unsupervised clustering of session histograms reveals web user browsing
patterns. Significant features associated with each cluster are highlighted in bold.
Feature Entire Data 1 2 3 4 5 6 7 8 9 10
Dimension Mean / Std. Dev. 29.8% 16.6% 14.3% 11.9% 11.0% 4.7% 4.6% 3.5% 2.1% 1.5%
Search 23.63 / 37.71 0.35 98.43 1.19 2.35 2.33 56.18 41.52 52.23 6.46 0.09
Mail 16.81 / 34.98 0.07 0.66 97.25 0.39 0.42 1.29 51.79 0.70 9.79 0.07
Information 12.26 / 30.85 0.04 0.27 0.39 1.03 96.50 24.58 2.65 0.50 5.97 0.02
Content 34.31 / 45.69 99.42 0.37 0.64 0.45 0.36 0.64 0.95 45.25 60.51 99.54
Shopping 12.85 / 31.60 0.08 0.24 0.41 95.67 0.29 16.92 2.60 0.86 16.84 0.06
Events 9.06 / 24.53 11.14 2.89 5.66 6.25 5.33 4.24 5.38 4.26 7.84 151.68
Duration 420.30 / 1067.99 532.49 261.4 303.85 235.78 298.91 228.40 455.58 218.01 439.78 4237.65
Various intent-driven web browsing patterns emerge from clusters? statistical
properties. The top five clusters correspond to coherent sessions of rich content
browsing, search, email, shopping, and information, respectively. For instance, the
center of cluster 1, with 29.8% of entire data, contains 99.42% rich content browsing.
Typically, these are user interactions with social networking sites such as Facebook
and MySpace. Its cluster-wise standard deviation of 2.82% along this feature di-
mension is significantly smaller than the standard deviation of 45.69% for the entire
data.
Clusters revealing more sophisticated user behaviors are also evident in Ta-
ble 5.2. These interesting patterns include browsing web search results without a
click (cluster 2), collecting information during shopping (cluster 6), visits to rich
content web site through navigational queries (cluster 8), and prolonged activities
in social networking sites (cluster 10, note the average session duration).
These observations demonstrate that even a simple approach to session rep-
resentation ? as distributions over high-level event categories ? can provide the
103
search engine with valuable information, such as the distribution over the types of
content that users are likely to access (useful for crawling scheduling as well as for
ranking purposes). If we apply a filter to the entire set of sessions and preserve
only those containing search queries, we can further observe which queries lead to a
particular session type (e.g., a shopping session) and optimize the user experience
accordingly.
5.4 Using Browsing Information for Web Search
In this section we present a novel web search ranking algorithm, ClickRank,
that combines different notions of user preferences mined from browsing sessions.
The ClickRank algorithm provides a robust estimate of the importance of web pages
and web sites without explicitly constructing a web graph. Its relatively low com-
putational cost makes it particularly useful for web search ranking purposes. We
also describe how ClickRank can be incorporated with a large set of other ranking
features for learning a ranking model.
5.4.1 ClickRank
A web session contains several contextual indicators of user preferences among
the visited web pages. Intuitively, users tend to browse content that they perceive
as important in the context of their informational need. This makes the dwell time
on a web page an important endorsement of the user?s interest level. The click order
within a general trail of user activities is also important: accessing one web page
104
before another in the session may be interpreted as the user?s preference signal.
ClickRank aims to combine these signals to determine a local importance value
for each page within a session, and then aggregate the importance values over all
sessions of interest.
We start by computing local importance values within each session. The Click-
Rank of a web page pi in a given web session sj is a function of several indicators
within the session context?the dwell time on the page, the page load time, the rank
of pi in the ordered set of all visited URLs, and the frequency of occurrence in the
session. We define the local ClickRank function as
ClickRank(pi,sj) =
summationdisplay
pi?sj
wr(i,sj)wt(p,sj)I(p = pi), (5.1)
where wr(i,sj) is a weight function induced on the rank of the event i in session
sj, and wt(p,sj) is a weight function computed from the set of temporal attributes
associated with the browsing of page pi. I() denotes the indicator function.
We define the weight function wr() for an event i in rank r(i) of a session sj
with a total of nj events as
wr(i,sj) = 2(nj +1?r(i))n
j(nj +1)
, (5.2)
where r(i) ? {1,...,nj} and wr(i,sj) is a monotonically decreasing function w.r.t.
the rank of the event within a session i. The function choice for wr() is motivated
by measurements of implicit user preference judgements through eye tracking ex-
periments [2], which show decreasing relative attention devoted to ordered clicks in
navigational and informational tasks.
105
Lemma 5.1. The weight function wr() has the following properties:
(a) wr(i,s) is a monotonically decreasing function w.r.t. the rank r(i) (1 ? r(i) ? n)
of the event in session s, and wr(i?1,s)?wr(i,s) = wr(i,s)?wr(i+1,s),
(b) summationtextni=1 wr(i,s) = 1, where n is the number of events in session s.
For a set of web sessions S = (s1,...,sk) across users and over a period of
time, we aggregate the ClickRank values as
ClickRank(p,S) = AGGRs?S [ClickRank(p,s)], (5.3)
where ClickRank(p,s) is the local ClickRank function defined in (5.1) given an
instance of observed sessions, and AGGR denotes an aggregation function, such as
summation or averaging, over all sessions of interest. In the following experiments,
we use summation as the aggregation function.
Finally, theClickRankofawebsitew forasetofsessionsS issimplythesumof
the ClickRank values of all pages in S that are part of the site: ClickRank(w,S) =
summationtext
p?w ClickRank(p,S). Note that using a sum implicitly models both the impor-
tance (as evidenced by ClickRank values of individual pages) and the size of the
web site ? the amount of pages it comprises.
5.4.2 Theoretical Analysis
The formulation of ClickRank has a theoretical interpretation based on an
intentional surfer model. A web session can be viewed as a logical sequence of hops
through the hyperlink structure of the web. At each step, a user selects what she
106
judges as most relevant as the next click, based on a variety of features such as the
attractiveness of content in the context of the user?s activity and her prior knowl-
edge. The user further indicates her interest through various temporal attributes,
such as the time devoted to the page or whether it was visited multiple times.
This process continues throughout the duration of the session, until the user starts
another journey on the web.
Moreconcretely, thelocalClickRankfunctiondefinesarandomvariableXj : ? ? R+0
associated with the web page pj in the event of a logical sequence of web clicks. The
mean and variance of the random variable Xj are non-negative and finite.
Lemma 5.2. For any bounded, non-negative weight function wt(), E(Xj) < ? and
var(Xj) < ?.
Proof. Let wt() be bounded by a non-negative constant A. Then,
E(Xj) =
summationdisplay
pj?s
wr(i,s)wt(p,s)P(p = pj,s)
=
summationdisplay
pj?s
wr(i,sj)wt(p,s)P(s)P(p = pj|s)
=
summationdisplay
pj?s
wr(i,sj)wt(p,s)P(s)
?
summationdisplay
pj?s
wr(i,sj)A
? A
summationdisplay
pj?s
wr(i,sj)
? A
by part (b) of Lemma 5.1. Similarly, we can show that var(Xj) ? A2.
The following convergence property of ClickRank defines its asymptotic behav-
ior over increasing volume of empirical data. Furthermore, we can establish bounds
107
on a ClickRank-induced function in a probabilistic setting by Markov?s inequality
and its corollaries [160].
Theorem 5.1 (Convergence property). Let {X1j , X2j , ..., Xkj } be the sequence
of random variables associated with the web page pj over observed sessions S =
(s1,...,sk), and assume they are independent and identically distributed. Then
1
k
ksummationdisplay
i=1
Xij ? E(Xj) almost surely, as k ? ?.
Proof. Xij are non-negative, independent identically distributed random variables
with finite means. The result follows directly from the strong law of large num-
bers [161].
Theorem 5.2 (Markov?s inequality). Let f : R ? [0,+?) be a non-negative func-
tion, then
P[f(X) ? a] ? E(f(X))a for all a > 0.
Corollary 5.3. If f : R ? [0,+?) is a non-negative function taking values bounded
by some number M, then
P[f(X) ? a] ? E(f(X))?aM ?a whenever 0 ? a < M.
Corollary 5.4 (Chebyshev?s inequality). Let X be a random variable with expected
value ? and finite variance ?2, then
P[|X ??| ? a] ? ?
2
a2 if a > 0.
Simply put, as the volume of the web browsing sessions analyzed by ClickRank
reaches a sizable sample of the entire web traffic, the rank computed by ClickRank
for each page converges to its true rank according to a usage criterion.
108
5.4.3 Application to Web Search Ranking
As a query-independent feature, ClickRank can be incorporated into a doc-
ument ranking process in several ways [162]. One particular framework that has
recently become prominent is the learning to rank approach to information retrieval,
which aims to apply machine-learning algorithms to derive a ranking function from
data. In a machine-learned ranking framework, a large variety of features are used
to model a query and a document. Query features can be its length or frequency
in a search log, and document features can be term statistics or, in the case of web
documents, the number of incoming HTML links. Machine-learned ranking provides
a convenient approach for quantitatively evaluating the effectiveness of ClickRank
as a novel feature in addition to a large collection of existing ranking features.
We formulate the task of learning the ranking model for web search as an
optimization problem. Our goal is to find a ranking function f?(x) that maps a set
of input random variables corresponding to features x = {x1,...,xn} to an output
random variable y representing the relevance score, such that the expected value of
the loss function ?(y,f(x)) is minimized over the joint distribution of (y,x)
f?(x) ? argmin
f(x)
Ey,x?(y,f(x)) (5.4)
= argmin
f(x)
Ex[Ey(?(y,f(x)))|x]. (5.5)
We compute the optimal ranking model using the numerical optimization
framework of gradient boosting [163], since analytical solution cannot be derived
generally for f(x) and ? in Equation 5.4. Gradient boosting employs functional
regression that expresses the solution to the ranking function as additive expansion
109
of M parameterized functions
f?(x) =
Msummationdisplay
i=0
fm(x) ?
Msummationdisplay
i=0
?mh(x;am), (5.6)
where f0(x) is an initial guess, and [fm(x)]M1 are incremental functions (or ?boosts?).
In Equation(5.6), each incremental function fm(x) can be further factored as the
product of a base learner h(x;am) and corresponding coefficient ?m.
In parameter estimation, gradient boosting sequentially fits a parameterized
function to current residuals by least-squares criterion at each iteration
yim = ?
bracketleftbigg??(y
i,f(xi))
??(f(xi))
bracketrightbigg
f(x)=fm?1(x)
(5.7)
and
am = argmin
a,?
Nsummationdisplay
i=1
[yim ??h(xi;a)2], (5.8)
where N is the number of training samples. The optimal coefficient ?m is computed
by line search
?m = argmin
?
Nsummationdisplay
i=1
?(yi,fm?1(xi)+?h(xi;am)). (5.9)
We use a decision trees as the base learner h(x;am) in (5.6), where it is param-
eterized by the splitting variables and corresponding split points. At each iteration
m, a decision tree partitions the entire feature space into disjoint regions [Rlm]Ll=1
and predicts according to the region that contains the observed feature vector x as
h(x;[Rlm]L1) =
Lsummationdisplay
i=1
ylmI(x ? Rlm). (5.10)
Since a decision tree predicts a constant value ylm within each region Rlm, we
110
Algorithm 5.1 Gradient Tree Boosting
1: f0(x) = argmin?summationtextMi=1 ?(yi,?)
2: for m = 1 to M do
3: yim = ?
bracketleftBig
??(yi,f(xi))
??(f(xi))
bracketrightBig
f(x)=fm?1(x)
,i = 1,N
4: {Rlm}L1 = L-terminal node treeparenleftbig{yim,xi}N1 parenrightbig
5: ?lm = argmin?summationtextxi?Rlm ?(yi,fm?1(xi)+?)
6: fm(x) = fm?1(x)+? ?summationtextLl=1 ?lmI(x ? Rlm)
7: end for
can rewrite (5.9) as
?lm = argmin?
summationdisplay
xi?Rlm
?(yi,fm?1(xi)+?). (5.11)
We add a regularization term, the shrinkage parameter 0 < ? < 1 to control the
learning rate of the update procedure
fm(x) = fm?1(x)+? ??lmI(x ? Rlm). (5.12)
We use Algorithm 5.1 to estimate ranking model parameters for decision trees
with L-terminal nodes. Gradient boosted decision trees (GDBT) produce compet-
itive, highly robust, interpretable procedures in regression and classification [163],
and are particularly useful for settings with large amounts of data and a dense
feature space.
5.4.4 Relation to Graph-based Models
ClickRank has a number of advantages compared to approaches that estimate
the web page authority from explicit graph formulations, such as PageRank and
BrowseRank. First, ClickRank is data driven and does not embed assumptions on
111
the traversing scheme over the web. Second, it is significantly more computationally
efficient: local ClickRank values are inexpensive to calculate and can be derived
independently for each session. This makes ClickRank well-suited to distributed
computing (e.g., the MapReduce framework [164] that we used to implement the
experiments in this dissertation). It is also memory friendly. Furthermore, addition
of new data requires only incremental computation of local ClickRank values on the
newly logged web sessions and combination with those from existing sessions, rather
than re-computation of the entire model (such as would be needed by PageRank and
BrowseRank). This is particularly important for the processing of web-scale user
browsing information, which changes constantly.
5.5 Experiments
We demonstrate the effectiveness of ClickRank algorithm in three core aspects
of web search?site ranking, page ranking, and mining new, popular web pages. In
our experiments, we assume that the dwell time on a page and the page load time
are two independent random processes and define the temporal weight function in
(5.1) as
wt(p,s) = (1?e??1td)e??2tlI (t(p) ? T ), (5.13)
where td and tl are the normalized dwell time on the page, and page load time w.r.t.
the entire session. t(p) is the timestamp of the event, and T denotes the time range.
In the following experiments, we used the same six months of aggregate user
browsing logs collected from the Yahoo! toolbar. In total, the data comprises of
112
Table 5.3: Top-ranked sites computed by different algorithms.
Rank PageRank BrowseRank ClickRank
1 adobe.com myspace.com yahoo.com
2 wordpress.com msn.com google.com
3 w3.org yahoo.com myspace.com
4 miibeian.gov.cn youtube.com live.com
5 statcounter.com live.com youtube.com
6 phpbb.com facebook.com facebook.com
7 baidu.com google.com msn.com
8 php.net ebay.com friendster.com
9 microsoft.com hi5.com pogo.com
10 mysql.com bebo.com aol.com
11 mapquest.com orkut.com microsoft.com
12 cnn.com aol.com wikipedia.org
13 google.com friendster.com ebay.com
14 blogger.com craigslist.org craigslist.org
15 paypal.com google.co.th hi5.com
16 macromedia.com microsoft.com go.com
17 jalbum.net comcast.net ask.com
18 nytimes.com wikipedia.org google.co.th
19 simplemachines.org pogo.com comcast.net
20 yahoo.com photobucket.com orkut.com
more than 3.3 billion web sessions. These sessions contain 16.3 million unique web
sites, and 3.1 billion unique web pages.
5.5.1 Site Ranking
We computed the ClickRank for each web site and ordered them by this value.
We list the top-ranking 20 sites computed with ClickRank and compare them to
those computed by PageRank2 and BrowseRank3. Results are listed in Table 5.3,
following the same convention used in [141].
On the task of site ranking, our results confirm the same finding reported in
[141], which states that link analysis algorithms like PageRank have a strong bias
2Using the web link graph as constructed by the Yahoo! crawler.
3This list is included from the reported list in [141] on a total of 5.6 million web sites.
113
toward sites with higher degree of inlinks and do not necessarily reflect the degree
of actual usage. This is a fundamental limitation of the web link graph, from which
PageRank and other link-based authority estimation algorithms are derived.
The computed site ranked lists by both ClickRank and BrowseRank algorithms
are surprisingly similar, with a total of 18 overlapping entries among the top 20 sites.
Both ranked lists correlate better with web users? informational need compared to
PageRank, as they are both computed with user behavior data. Some ranking
differences between BrowseRank and ClickRank in this table can be attributed to
their data source. BrowseRank is computed with a set of users who installed the
Live toolbar, and are presumably users of live.com and msn.com services; similarly,
ClickRank is computed with a set of Yahoo! users.
One key difference between the results produced by ClickRank and Browse-
Rank is that ClickRank consistently ranks the starting point of user?s web experience
higher. One of the major search engines, ask.com, does not even appear among top
20 sites produced by BrowseRank.
ClickRank has a significantly lower computational cost than PageRank or
BrowseRank. ClickRank requires only one pass through the data and does not re-
quire building intermediate graphs and solving stationary probability distributions.
This allows for rapid adaptation of ClickRank values to new content: as noted ear-
lier, new browsing information that is collected does not require recomputation over
the entire data. The overall running time of our implementation of ClickRank al-
gorithm in ranking of the 16.3 million web sites in this section and 3.1 billion web
pages for the page ranking test in the next section are 56 minutes and 1 hour 32
114
Figure 5.3: Distribution of discretized ClickRank scores over a large collection of
judged documents.
minutes, respectively, using the map-reduce framework on 300 Hadoop [165] nodes.
To our best knowledge, these are the best published run times for page importance
ranking on a web scale.
In a realistic, production-grade search engine environment, it is important to
minimize the footprint of every relevance feature used by the ranking model so la-
tency and memory requirements are met. Often, float numeric values are compressed
or discretized into a small dynamic range that can be represented with as few bits as
possible. To this end, and to evaluate the ranking performance of ClickRank as de-
ployed in a production system, we quantize the computed ClickRank score for each
web page into an unsigned byte within the range of [0,...,255]. The distribution of
these values are shown in Figure 5.3.
5.5.2 Page Ranking
5.5.2.1 Evaluation Methodology
We comprehensively evaluated the performance of ClickRank in conjunction
with several hundred features used in commercial search engines. To gain further
115
insights, we quantified the search improvement from ClickRank with a state-of-the-
art baseline system, and measured its relative variable significance against this large
pool of ranking features. This evaluation scheme gives more realistic, quantitative
results, in contrast to common published evaluations, using limited feature set as
baseline. For instance, [141] employs the single feature of BM25 [166] as the rele-
vance baseline in their comparison.
We used discounted cumulative gain (DCG) and normalized discounted cu-
mulative gain (NDCG), two widely used search engine relevance measures [167], to
quantitatively evaluate ranking performances. Given a query and the ranked list of
returned documents, the DCG(K) score for the query is computed as
DCG(K) =
Ksummationdisplay
k=1
gk
log2(1+k), (5.14)
where gk is the weight for the document at rank k. A five-grade score is assigned to
each document based on its degree of relevance.
We trained ranking models using gradient boosted decision trees on the base-
line system with all existing features, and on the alternative system that includes
one additional ClickRank feature, respectively. Training and test data is partitioned
through cross-validation. We used identical parameter settings in all the following
comparison experiments. Table 5.4 provides a high-level summary of this experi-
ment.
To quantify the relative importance Si of each ranking feature xi, we used the
following measure of variable importance for decision trees [168]
S2i = 1M
Msummationdisplay
m=1
L?1summationdisplay
n=1
wlwr
wl +wr(yl ?yr)
2I(vt = i), (5.15)
116
Table 5.4: Summary of the page ranking experiment.
Total number of queries 9.041
Number of affected queries 7,341
Total number of judged documents 429,985
Number of documents with ClickRank scores 268,929
Cross-validation split ratio 3
where vt is the splitting variable at the non-terminal node n, yl, yr are the means
of the regression responses in the left and right subtrees respectively, and wl, wr are
the corresponding sums of the weights.
5.5.2.2 Data Preparation
We used a set of 9,041 randomly sampled queries from a search log. For each
query, 5?20 web pages have been independently judged by a panel of editors and
assigned with one of the five relevance scores.
5.5.2.3 Results and Discussion
The usage of ClickRank as an additional relevance feature brings 1.02%, 0.97%,
1.11%, and 1.331% web search improvements in DCG(1), DCG(5), DCG(10), and
NDCG, respectively, on top of a state-of-the-art ranking model?a model already
incorporating hundreds of features derived from content (e.g., anchor, title, body,
and section), from the link structure of the web, from search engine query logs, and
from other sources. The reported gains are statistically significant.
The gains in retrieval performance from ClickRank on top of a competitive
117
Table 5.5: ClickRank delivers statistically significant improvements over a state-of-
the-art baseline.
Query Number of Affected Improvements in Significance Test
Length Queries Queries DCG(1) DCG(5) DCG(10) NDCG p-value
1 1484 1232 0.448% 0.713% 1.002% 0.378% 5.33?10?2
2 2992 2450 0.560% 0.993% 1.121% 1.071% 4.65?10?4
3 2153 1722 1.618% 1.076% 1.406% 2.177% 1.10?10?4
4+ 2412 1937 0.918% 0.861% 0.784% 1.433% 1.61?10?5
All 9041 7341 1.020% 0.966% 1.105% 1.331% 9.98?10?5
search engine are summarized in Table 5.5. These are substantial improvements in
the context of commercial web search: our strong baseline incorporates a feature set
of several hundred signals tuned over a long period of time. In addition, 81.2% of
more than 9,000 queries are affected in the alternative experiment, demonstrating
the generality of ClickRank. Furthermore, we observe higher improvements on long
queries in Table 5.5, which are typically much more challenging for search engines.
We show search improvements across different query lengths in Table 5.5.
We experimented with several variants of ClickRank and observed that it
consistently ranks among the top features in variable significance as calculated by
Equation (5.15). For example, in the experiment shown in Table 5.5, the ClickRank
feature is ranked 25th in variable importance among several hundred other features,
significantly higher than the highest-ranking feature derived from page visit counts
(ranked 56th) and a feature based on a propagation of authority through the link
graph (ranked 108th). These results demonstrate the significance of session-based
web importance estimation and show that ClickRank captures novel user preference
knowledge not identified through other modeling techniques.
118
(a) Quicklinks for August 10, 2008. (b) Quicklinks for August 16, 2008.
Figure 5.4: Dynamic quicklinks discovered using ClickRank by recency ranking.
5.5.3 Mining Dynamic Quicklinks
Many commercial web search engines supply a set of ?quicklinks? ? direct
access links to certain pages within the site, in addition to the search result itself.
Typically, these quicklinks are pointers to frequently visited destinations within the
host, mined from query or clickthrough logs. This method, however, has two major
limitations. First, query logs do not contain user activities beyond the scope of
interactions with search engines, which account for the vast majority (more than
95%, as shown earlier) of real web traffic. Second, results computed from query logs
have a strong bias toward old navigational links within the site since they receive
more clicks within the visibility range of search engines.
119
We demonstrate a novel application of ClickRank for discovering and display-
ing dynamic quicklinks in web search results through recency ranking. The idea
is to adapt the time range for the indictor function in Equation (5.13) w.r.t. the
content refresh rate found by web crawlers. In addition to normal search results,
the system displays highly ranked web pages computed by ClickRank as quicklinks
to the user.
Figure 5.4 shows search results with discovered quicklinks by the system in
response to the query of ?beijing olympic 2008? on two days during the summer
Olympic Games in 2008, using the time range of 24 hours. Quicklinks mined by
ClickRank are displayed alongside of the most frequently clicked navigational links.
The quicklinks effectively capture the event highlights, while the most frequently
clicked navigational links remain unchanged. The quicklink results by ClickRank
are more meaningful in suggesting content that are of potential interest to web users
than those that reflect the structural property of the web site.
5.6 Summary
In this chapter, we expanded the use of general web user browsing information
for discovering session models driven by structured user intents, and proposed user
preference models that incorporate rich session context for web search ranking. We
presented characteristics of general web sessions and revealed interesting user be-
havior patterns mined from sessions. We introduced ClickRank, an efficient, scalable
algorithm for estimating web page and web site importance based on user preference
120
judgments mined from session context. ClickRank is based on a data-driven inten-
tional surfer model, and is empirically shown to be an effective and novel ranking
feature even on top of a highly competitive baseline system employing hundreds
of ranking features. We also discussed the advantages of ClickRank compared to
existing importance ranking approaches. ClickRank is efficient to compute, deliv-
ering highly competitive ranking results compared to the state-of-the-art models
based on web graphs. We also demonstrated a promising application that mines
dynamic quicklinks for enhancing web user experience. These promising results on
data-driven user behavior modeling highlight the prominent role of mining web user
behavior in the understanding of user intents and in next-generation web search.
121
Chapter 6
Summary of Contributions
This dissertation has addressed two critical challenges in analysis and retrieval
of unstructured, broad-domain document collections?recognition of diverse content
and the use of broad contextual knowledge. Our key contributions include:
1. Obtaining high-level content interpretation for an image offers an important
starting point in many computer vision and image analysis problems. We
proposed a novel approach for document image content categorization, using
image descriptors constructed from a lexicon of shape features. We encoded lo-
cal text structures using scale and rotation invariant lexical words and learned
a concise, structurally indexed shape lexicon. Our approach is extensible and
does not require skew correction, scale normalization, or segmentation. In
two challenging problems?content category recognition of diverse web im-
ages and language identification of documents with mixed machine printed
text and handwriting?our approach?s performance exceeded the state of the
art.
2. We explored a new direction of detecting and matching evidentiary visual
content for document indexing and retrieval. Our study on signature-based
document image retrieval addressed two important problems. First, we pro-
posed a novel multi-scale approach to detect and segment signatures jointly
122
from documents. Our approach captures the structural saliency using a signa-
ture production model and is computationally tractable. Second, we treated
the problem of signature retrieval in the unconstrained setting of translation,
scale, and rotation invariant deformable shape matching. We proposed two
novel measures of shape dissimilarity based on anisotropic scaling and reg-
istration residual error, and presented a supervised learning framework for
combining complementary shape information from different dissimilarity met-
rics. Our approach demonstrated state-of-the-art performance in the tasks of
signature matching and signature verification.
3. We presented a model-based approach for recognizing relevant named entities
from unstructured document images by combining rich page layout features in
theimagespacewithOCRtext. Wedemonstratedournamedentityextraction
approach in an automated expense reimbursement system and evaluated its
performance on large collections of real receipt images.
4. We proposed a computational framework for incorporating contextual knowl-
edge gained from general web user behavior data for improving ranking and
other web search experience, with the objective of constructing aggregate mod-
els by analyzing individual user sessions. We introduced ClickRank, an effi-
cient, scalable algorithm for estimating web page and web site importance and
laid out its theoretical foundation. ClickRank is shown to contribute signifi-
cantly as a novel web search feature.
123
6.1 Possible Extensions
Some suggested additional work has already been discussed at the end of
relevant chapters as extensions that follow naturally from the technical content. In
this section, we consider topics somewhat further afield from the specific topics in
this dissertation but are a continuation of the underlying ideas.
Content Categorization of Printed Text and Handwriting The approach
presented in Chapter 2 worked well in identifying the primary script or language of
a document in the presence of mixed printed text and handwriting. The abstraction
of shape by line segment features and the representation of geometrically invariant
descriptors were shown to be effective on diverse text content. In principle, our
shape lexicon approach can be naturally extended to the problem of categorizing
text and handwriting, given a region of text. We are collaborating with another
Ph.D. student on this topic and the preliminary results are promising. We have
found that the problem of identifying machine printed text and handwriting becomes
extremely challenging when the region contains minimal amount of text. The limited
features observed might be too sparse for reliable classification. It is useful to include
additional features, such as those derived from the uniformity among text. These
cases require further investigation.
Indexing Scheme for Visual Content A versatile indexing scheme for visual
content has been a fundamental challenge in content-based image retrieval. Because
image data is complex and high dimensional, obtaining finer semantic interpretation
124
for compact indexing often presents a difficult challenge. In Chapter 3, we used a
set of feature points extracted from signatures for matching and retrieval, which
is a simple form of indexing. From the retrieval point of view, it is helpful to
include a blend of other useful information in the index, including the text within
the context and image features (e.g., dimensions, location, and color information).
A flexible engineering design for visual content indexing facilitates query processing
and contributes to the success of content-based image retrieval in practice.
CustomizedWebSearchUsingContext TheexperimentsinChapter5demon-
strated two important results. First, a high percentage of sessions are associated
with structured actions (e.g., checking emails, looking for information, and shop-
ping). Second, using models learned from aggregate user behavior data significantly
improves search quality for a large number of queries. The natural extensions would
be 1) how to predict a structured intent in the current query, and 2) how to de-
velop models from user behavior data to improve customized search results for the
structured intent. The answers to both questions involve learning of models from
aggregate user behaviors (for coverage and freshness) and individual behavior his-
tory.
125
Appendix A
Proof of Theorems
Theorem 3.1 Given two intersecting lines t1(x,y) and t2(x,y) in the forms of
p1(x?x1)+q1(y ?y1) = 0andp2(x?x2)+q2(y ?y2) = 0, respectively. LetL(x,y)
be the straight line that passes through points (x1,y1) and (x2,y2). Then, the
quadratic equation C(x,y) ? l2(x,y)??t1(x,y)t2(x,y) = 0 represents a conic sec-
tion and the range of ? for which C(x,y) is an ellipse is 0 < ? < ?0, where
?0 = 4[p1(x2 ?x1)+q1(y2 ?y1)](p
1q2 ?p2q1)
? [p2(x1 ?x2)+q2(y1 ?y2)](p
1q2 ?p2q1)
.
Proof. A quadratic equation C(x,y) in two variables x and y can be written as
C(x,y) = ax2 +2hxy +by2 +2gx+2fy +c = 0, (A.1)
where a,b,c,f,g,h are linear functions in ? and the parameter set (x1,y1), (x2,y2),
(p1,q1), (p2,q2). We can rewrite equation (A.1) as
xTAx+2fTx+c = 0, (A.2)
where x = (x,y)T, f = (g,f)T and the 2?2 matrix A is
A =
?
??
?
a h
h b
?
??
?. (A.3)
When C(x,y) is an ellipse with center x0 at (x0,y0)T, (A.2) can be written as
(x?x0)T A(x?x0) = d, (A.4)
126
i.e.,
xTAx?2xT0Ax+xT0Ax0 ?d = 0. (A.5)
Comparing (A.2) with (A.8), we have that
Ax0 = ?f,
giving
x0 = ?A?1f, (A.6)
and
d = fTA?1f ?c. (A.7)
Equation (A.5)represents a parabola ifAis singular, sincethe centerx0 is at infinity.
ItrepresentsanellipseifAispositivedefinite. When? = 0, thematrixAissingular.
As ? increases, A becomes positive definite first, and then becomes singular again at
certain positive value ?0. For ? larger than ?0, matrix A is indefinite, and equation
(A.5) becomes a hyperbola. Thus, the range of ?0 for which (A.5) is an ellipse is
the interval 0 < ? < ?0. We can find ?0 by solving
det(A) =
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
a h
h b
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
vextendsinglevextendsingle
= 0, (A.8)
where
a(?) = (y1 ?y2)2 ??p1p2, (A.9)
b(?) = (x2 ?x1)2 ??q1q2, (A.10)
h(?) = (y1 ?y2)(x2 ?x1)? 12?(p1q2 +p2q1). (A.11)
127
One root of (A.8) is zero, and the other root is strictly positive. The positive root
is the required ?0 and is given by
?0 = 4[p1(x2 ?x1)+q1(y2 ?y1)](p
1q2 ?p2q1)
? [p2(x1 ?x2)+q2(y1 ?y2)](p
1q2 ?p2q1)
.
Lemma 5.1 The weight function wr() has the following properties:
(a) wr(i,s) is a monotonically decreasing function w.r.t. the rank r(i) (1 ? r(i) ? n)
of the event in session s, and wr(i?1,s)?wr(i,s) = wr(i,s)?wr(i+1,s),
(b) summationtextni=1 wr(i,s) = 1, where n is the number of events in session s.
Proof. (a) Clearly wr(i,s) is a monotonically decreasing function because r(i) is
strictly increasing, and for 2 ? i ? (n?1)
wr(i?1,s)?wr(i,s) = wr(i,s)?wr(i+1,s) = 2n(n+1). (A.12)
(b)
nsummationdisplay
i=1
wr(i,s) =
nsummationdisplay
i=1
2(n+1?r(i))
n(n+1)
= n? 2(n+1)n(n+1) ? 2n(n+1)
nsummationdisplay
i=1
r(i)
= 2? 2n(n+1)n(n+1)2
= 1.
128
Bibliography
[1] D. Doermann, E. Rivlin, and A. Rosenfeld, ?The function of documents,? Int. J.
Computer Vision, vol. 16, pp. 799?814, 1998.
[2] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, ?Accurately inter-
preting clickthrough data as implicit feedback,? in Proceedings of the ACM SIGIR
Conference on Research and Development in Information Retrieval, 2005, pp. 154?
161.
[3] S. V. Rice, G. Nagy, and T. A. Nartker, Optical Character Recognition: An Illus-
trated Guide to the Frontier. Norwell, Massachusetts: Kluwer Academic Publishers,
1999.
[4] G. Zhu, Y. Zheng, D. Doermann, and S. Jaeger, ?Multi-scale structural saliency for
signature detection,? in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2007, pp. 1?8.
[5] G. Zhu and D. Doermann, ?Automatic document logo detection,? in Proceedings
of the International Conference on Document Analysis and Recognition, 2007, pp.
864?868.
[6] G. Zhu, S. Jaeger, and D. Doermann, ?A robust stamp detection framework on
degraded documents,? in Proceedings of the Document Recognition and Retrieval
Conference, vol. 6067, 2006, pp. 1?9.
[7] R. Plamondon and S. N. Srihari, ?On-line and off-line handwriting recognition: A
comprehensive survey,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp.
63?84, 2000.
[8] L. Vincent, ?Google Book Search: Document understanding on a massive scale,? in
Proceedings of the International Conference on Document Analysis and Recognition,
2007, pp. 819?823.
[9] G. Zhu, T. J. Bethea, and V. Krishna, ?Extracting relevant named entities for au-
tomated expense reimbursement,? in Proceedings of the ACM SIGKDD Conference
on Knowledge Discovery and Data Mining, 2007, pp. 1004?1012.
[10] G. Zhu, X. Yu, Y. Li, and D. Doermann, ?Language identification for handwritten
document images using a shape codebook,? Pattern Recognition, vol. 42, no. 12, pp.
3184?3191, 2009.
[11] Y. Li, Y. Zheng, D. Doermann, and S. Jaeger, ?Script-independent text line seg-
mentation in freestyle handwritten documents,? IEEE Trans. Pattern Anal. Mach.
Intell., vol. 30, no. 8, pp. 1313?1329, 2008.
[12] U. Marti and H. Bunke, ?The IAM-database: An English sentence database for off-
line handwriting recognition,? Int. J. Document Analysis and Recognition, vol. 5,
pp. 39?46, 2006, Available at http://www.iam.unibe.ch/?fki/iamDB/.
129
[13] J. Hochberg, K. Bowers, M. Cannon, and P. Kelly, ?Script and language identifica-
tion for handwritten document images,? Int. J. Document Analysis and Recognition,
vol. 2, no. 2-3, pp. 45?52, 1999.
[14] X. Li, Y.-Y. Wang, and A. Acero, ?Learning query intent from regularized click
graphs,? in Proceedings of the ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval, 2008, pp. 339?346.
[15] A.OlivaandA.Torralba, ?Modelingtheshapeofthescene: Aholisticrepresentation
of the spatial envelope,? Int. J. Computer Vision, vol. 42, no. 3, pp. 145?175, 2001.
[16] T. Tan, ?Rotation invariant texture features and their use in automatic script iden-
tification,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 7, pp. 751?756,
1998.
[17] T. Ojala, M. Pietikainen, and T. Maenpaa, ?Multiresolution gray-scale and rotation
invariant texture classification with local binary patterns,? IEEE Trans. Pattern
Anal. Mach. Intell., vol. 24, no. 7, pp. 971?987, 2002.
[18] M. Heikkila and M. Pietikainen, ?A texture-based method for modeling the back-
ground and detecting moving objects,? IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 28, no. 4, pp. 657?662, 2006.
[19] T. Ahonen, A. Hadid, and M. Pietikainen, ?Face description with local binary pat-
terns: Application to face recognition,? IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 28, no. 12, pp. 2037?2041, 2006.
[20] A. Spitz, ?Determination of script and language content of document images,? IEEE
Trans. Pattern Anal. Mach. Intell., vol. 19, no. 3, pp. 235?245, 1997.
[21] J. Ding, L. Lam, and C. Suen, ?Classification of oriental and European scripts
by using characteristic features,? in Proceedings of the International Conference on
Document Analysis and Recognition, 1997, pp. 1023?1027.
[22] D.-S. Lee, C. R. Nohl, and H. S. Baird, Language Identification in Complex, Un-
oriented, and Degraded Document Images, ser. Document Analysis Systems II, J. J.
Hull and S. L. Taylor, Eds. River Edge, New Jersey: World Scientific Publishing
Co. Pte. Ltd, 1998.
[23] C. Suen, S. Bergler, N. Nobile, B. Waked, C. Nadal, and A. Bloch, ?Categorizing
document images into script and language classes,? in Proceedings of the Interna-
tional Conference on Document Analysis and Recognition, 1998, pp. 297?306.
[24] S. Lu and C. Tan, ?Script and language identification in noisy and degraded docu-
ment images,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 1, pp. 14?24,
2008.
[25] A. Busch, W. Boles, and S. Sridharan, ?Texture for script identification,? IEEE
Trans. Pattern Anal. Mach. Intell., vol. 27, no. 11, pp. 1720?1732, 2005.
[26] J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, ?Automatic script identification
from document images using cluster-based templates,? IEEE Trans. Pattern Anal.
Mach. Intell., vol. 19, no. 2, pp. 176?181, 1997.
130
[27] H. Ma and D. Doermann, ?Word level script identification on scanned document
images,? in Proc. Document Recognition and Retrieval (SPIE), 2004, pp. 124?135.
[28] S. Jaeger, H. Ma, and D. Doermann, ?Identifying script on word-level with infor-
mational confidence,? in Proceedings of the International Conference on Document
Analysis and Recognition, 2005, pp. 416?420.
[29] V.Ferrari, L.Fevrier, F.Jurie, andC.Schmid, ?Groupsofadjacentcontoursegments
for object detection,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 1, pp.
36?51, 2008.
[30] N. Dalal and B. Triggs, ?Histograms of oriented gradients for human detection,? in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2005, pp. 886?893.
[31] L. Schomaker, M. Bulacu, and K. Franke, ?Automatic writer identification us-
ing fragmented connected-component contours,? in Proceedings of the International
Workshop on Frontiers in Handwriting Recognition, 2004, pp. 185?190.
[32] T. Kohonen, Self-Organization and Associative Memory, 3rd ed. New York, New
York: Springer-Verlag New York, Inc., 1989.
[33] J. Canny, ?A computational approach to edge detection,? IEEE Trans. Pattern
Anal. Mach. Intell., vol. 8, no. 6, pp. 679?697, 1986.
[34] J. Shi and J. Malik, ?Normalized cuts and image segmentation,? IEEE Trans. Pat-
tern Anal. Mach. Intell., vol. 22, no. 8, pp. 888?905, 2000.
[35] S. Yu and J. Shi, ?Multiclass spectral clustering,? in Proceedings of the IEEE Inter-
national Conference on Computer Vision, 2003, pp. 11?17.
[36] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, Online,
Available at http://www.csie.ntu.edu.tw/?cjlin/libsvm, 2001.
[37] T. Rath and R. Manmatha, ?Word image matching using dynamic time warping,? in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2003, pp. 521?527.
[38] J. Chan, C. Ziftci, and D. Forsyth, ?Searching off-line Arabic documents,? in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2006,
pp. 1455?1462.
[39] G. Zhu, Y. Zheng, and D. Doermann, ?Signature-based document image retrieval,?
in Proceedings of the European Conference on Computer Vision, vol. 3, 2008, pp.
752?765.
[40] K. Guo, D. Doermann, and A. Rosenfeld, ?Forgery detection by local correspon-
dence,? Int. J. Pattern Recognition and Artificial Intelligence, vol. 15, no. 4, pp.
579?641, 2001.
[41] B. Fang, C. Leung, Y. Tang, K. Tse, P. Kwokd, and Y. Wonge, ?Off-line signature
verification by the tracking of feature and stroke positions,? Pattern Recognition,
vol. 36, no. 1, pp. 91?101, 2003.
131
[42] D. Doermann and A. Rosenfeld, ?Recovery of temporal information from static
images of handwriting,? Int. J. Computer Vision, vol. 15, no. 1-2, pp. 143?164,
1995.
[43] G. Nagy, ?Twenty years of document image analysis in PAMI,? IEEE Trans. Pattern
Anal. Mach. Intell., vol. 22, no. 1, pp. 38?62, 2000.
[44] R. Sabourin, G. Genest, and F. Preteux, ?Off-line signature verification by local
granulometric size distributions,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 19,
no. 9, pp. 976?988, 1997.
[45] M. Munich and P. Perona, ?Continuous dynamic time warping for translation-
invariant curve alignment with applications to signature verification,? in Proceedings
of the IEEE International Conference on Computer Vision, 1999, pp. 108?115.
[46] M. Kalera, S. N. Srihari, and A. Xu, ?Off-line signature verification and identification
using distance statistics,? Int. J. Pattern Recognition and Artificial Intelligence,
vol. 18, no. 7, pp. 1339?1360, 2004.
[47] S. Chen and S. N. Srihari, ?Use of exterior contours and word shape in off-line
signature verification,? in Proceedings of the International Conference on Document
Analysis and Recognition, 2005, pp. 1280?1284.
[48] A. Shanker and A. Rajagopalan, ?Off-line signature verification using DTW,? Pat-
tern Recognition Letters, vol. 28, no. 12, pp. 1407?1414, 2007.
[49] K. Siddiqi, A. Shokoufandeh, S. Dickinson, and S. Zucker, ?Shock graphs and shape
matching,? Int. J. Computer Vision, vol. 35, no. 1, pp. 13?32, 1999.
[50] T. Sebastian, P. Klein, and B. Kimia, ?Recognition of shapes by editing their shock
graphs,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 5, pp. 550?571, 2004.
[51] L. Latecki, R. Lakamper, and U. Eckhardt, ?Shape descriptors for non-rigid shapes
with a single closed contour,? in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2000, pp. 424?429.
[52] E. Petrakis, A. Diplaros, and E. Milios, ?Matching and retrieval of distorted and
occluded shapes using dynamic programming,? IEEE Trans. Pattern Anal. Mach.
Intell., vol. 24, no. 11, pp. 1501?1516, 2002.
[53] T. Sebastian, P. Klein, and B. Kimia, ?On aligning curves,? IEEE Trans. Pattern
Anal. Mach. Intell., vol. 25, no. 1, pp. 116?124, 2003.
[54] H. Ling and D. Jacobs, ?Shape classification using the inner-distance,? IEEE Trans.
Pattern Anal. Mach. Intell., vol. 29, no. 2, pp. 286?299, 2007.
[55] S. Belongie, J. Malik, and J. Puzicha, ?Shape matching and object recognition using
shape contexts,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 4, pp. 509?
522, 2002.
[56] Y. Zheng and D. Doermann, ?Robust point matching for non-rigid shapes by pre-
serving local neighborhood structures,? IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 28, no. 4, pp. 643?649, 2006.
132
[57] A. Rosenfeld, R. Hummel, and S. Zucker, ?Scene labeling by relaxation operations,?
IEEE Trans. System, Man and Cybernetics, vol. 6, no. 6, pp. 420?433, 1976.
[58] G. Zhu, Y. Zheng, D. Doermann, and S. Jaeger, ?Signature detection and matching
for document image retrieval,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,
no. 11, pp. 2015?2031, 2009.
[59] A. Sha?ashua and S. Ullman, ?Structural saliency: The detection of globally salient
structures using a locally connected network,? in Proceedings of the IEEE Interna-
tional Conference on Computer Vision, 1988, pp. 321?327.
[60] T. Alter and R. Basri, ?Extracting salient curves from images: An analysis of the
saliency network,? Int. J. Computer Vision, vol. 27, no. 1, pp. 51?69, 1998.
[61] P. Parent and S. Zucker, ?Trace inference, curvature consistency, and curve detec-
tion,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 8, pp. 823?839, 1989.
[62] G. Guy and G. Medioni, ?Inferring global perceptual contours from local features,?
Int. J. Computer Vision, vol. 20, no. 1-2, pp. 113?133, 1996.
[63] J. Elder and S. Zucker, ?The effect of contour closure on the rapid discrimination
of two-dimensional shapes,? Vision Res., vol. 33, no. 7, pp. 981?991, 1993.
[64] L. Williams and D. Jacobs, ?Stochastic completion fields: A neural model of illusory
contour shape and salience,? Neural Computation, vol. 9, pp. 849?870, 1997.
[65] L. Williams and K. Thornber, ?A comparison of measures for detecting natural
shapes in cluttered backgrounds,? Int. J. Computer Vision, vol. 33, no. 2-3, pp.
81?96, 2000.
[66] J. Shotton, A. Blake, and R. Cipolla, ?Contour-based learning for object detection,?
in Proceedings of the IEEE International Conference on Computer Vision, 2005, pp.
503?510.
[67] C. Zahn and R. Roskies, ?Fourier descriptors for plane closed curves,? IEEE Trans.
Computing, vol. 21, no. 3, pp. 269?281, 1972.
[68] C. Lin and R. Chellappa, ?Classification of partial 2-D shapes using Fourier de-
scriptors,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 9, no. 5, pp. 686?690,
1987.
[69] Y. Lamdan, J. Schwartz, and H. Wolfson, ?Object recognition by affine invariant
matching,? in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 1988, pp. 335?344.
[70] J. Gorman, R. Mitchell, and F. Kuhl, ?Partial shape recognition using dynamic
programming,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 10, no. 2, pp. 257?
266, 1988.
[71] D. Sharvit, J. Chan, H. Tek, and B. Kimia, ?Symmetry-based indexing of image
database,? J. Visual Commun. and Image Representation, vol. 9, no. 4, pp. 366?
380, 1998.
133
[72] G. Mori, S. Belongie, and J. Malik, ?Efficient shape matching using shape contexts,?
IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 11, pp. 1832?1837, 2005.
[73] G. Borgefors, ?Hierarchical chamfer matching: A parametric edge matching algo-
rithm,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 10, no. 6, pp. 849?865, 1988.
[74] D. Huttenlocher, R. Lilien, and C. Olson, ?Comparing images using the Hausdorff
distance,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 9, pp. 850?863,
1993.
[75] S. Loncaric, ?A survey of shape analysis techniques,? Pattern Recognition, vol. 31,
no. 8, pp. 983?1001, 1998.
[76] R. Velkamp and M. Hagedoorn, State of the art in shape matching. Technical
Report UU-CS-1999-27, Utrecht University, Netherlands, 1999.
[77] P. Besl and H. McKay, ?A method for registration of 3-D shapes,? IEEE Trans.
Pattern Anal. Mach. Intell., vol. 14, no. 2, pp. 239?256, 1992.
[78] Z. Zhang, ?Iterative point matching for registration of free-form curves and sur-
faces,? Int. J. Computer Vision, vol. 13, no. 2, pp. 119?152, 1994.
[79] J. Feldmar and N. Anyche, ?Rigid, affine and locally affine registration of free-form
surfaces,? Int. J. Computer Vision, vol. 18, no. 2, pp. 99?119, 1996.
[80] T. Wakahara and K. Odaka, ?Adaptive normalization of handwritten characters us-
ing global/local affine transform,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 20,
no. 12, pp. 1332?1341, 1998.
[81] S. Gold, A. Rangarajan, C. Lu, S. Pappu, and E. Mjolsness, ?New algorithms for 2-D
and 3-D point matching: Pose estimation and correspondence,? Pattern Recognition,
vol. 31, no. 8, pp. 1019?1031, 1998.
[82] H. Chui and A. Rangarajan, ?A new point matching algorithm for non-rigid regis-
tration,? Computer Vision and Image Understanding, vol. 89, no. 2-3, pp. 114?141,
2003.
[83] T. Rath, R. Manmatha, and V. Lavrenko, ?A search engine for historical manuscript
images,? in Proceedings of the ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval, 2004, pp. 369?376.
[84] S. N. Srihari, S. Shetty, S. Chen, H. Srinivasan, C. Huang, G. Agam, and O. Frieder,
?Document image retrieval using signatures as queries,? in Proceedings of the Inter-
national Conference on Document Image Analysis for Libraries, 2006, pp. 198?203.
[85] G. Agam, S. Argamon, O. Frieder, D. Grossman, and D. Lewis, The Complex Docu-
ment Image Processing (CDIP) test collection, Online, Illinois Institute of Technol-
ogy, 2006, http://ir.iit.edu/projects/CDIP.html.
[86] D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, and J. Heard, ?Building
a test collection for complex document information processing,? in Proceedings of
the ACM SIGIR Conference on Research and Development in Information Retrieval,
2006, pp. 665?666.
134
[87] K. Turkowski, Filters for common resampling tasks, A. S. Glassner, Ed. San Diego,
California: Academic Press Professional, Inc., 1990.
[88] P. Saint-Marc, H. Rom, and G. Medioni, ?B-spline contour representation and sym-
metry detection,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 11, pp.
1191?1197, 1993.
[89] J. M. Hollerbach, ?An oscillation theory of handwriting,? Ph.D. dissertation, Mas-
sachusetts Institute of Technology, 1978.
[90] F. Bookstein, ?Principle warps: Thin-plate splines and the decomposition of de-
formations,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 11, no. 6, pp. 567?585,
1989.
[91] C. Buckley and E. Voorhees, ?Evaluating evaluation measure stability,? in Proceed-
ings of the ACM SIGIR Conference on Research and Development in Information
Retrieval, 2000, pp. 33?40.
[92] The Legacy Tobacco Document Library (LTDL), Online, University of California,
San Francisco, 2007, http://legacy.library.ucsf.edu/.
[93] ABBYY FineReader SDK 8.0, ABBYY Corp, 2006, http://www.abbyy.com/.
[94] T. Lindeberg, ?Edge detection and ridge detection with automatic scale selection,?
Int. J. Computer Vision, vol. 30, no. 2, pp. 77?116, 1996.
[95] Y. Zheng, H. Li, and D. Doermann, ?Machine printed text and handwriting identifi-
cation in noisy document images,? IEEE Trans. Pattern Anal. Mach. Intell., vol. 26,
no. 3, pp. 337?353, 2004.
[96] C. Dyer and A. Rosenfeld, ?Thinning algorithms for gray-scale pictures,? IEEE
Trans. Pattern Anal. Mach. Intell., vol. 1, no. 1, pp. 88?89, 1979.
[97] T. Zhang and C. Suen, ?A fast parallel algorithm for thinning digital patterns,?
Comm. ACM, vol. 27, no. 3, pp. 236?239, 1984.
[98] R. Sabourin, R. Plamondon, and L. Beaumier, ?Structural interpretation of hand-
written signature images,? Int. J. Pattern Recognition and Artificial Intelligence,
vol. 8, no. 3, pp. 709?748, 1994.
[99] J. Drouhard, R. Sabourin, and M. Godbout, ?A neural approach to off-line signature
verification using directional PDF,? Pattern Recognition, vol. 29, no. 3, pp. 415?424,
1996.
[100] S. Mori, C. Y. Suen, and K. Yamamoto, ?Historical review of OCR research and
development,? vol. 80, no. 7, pp. 1029?1058, 1992.
[101] N. Chinchor, ?MUC-7 named entity task definition,? in Proceedings of the Message
Understanding Conference, 1997, pp. 1?21, Available at http://www.itl.nist.gov/
iad/894.02/related projects/muc/proceedings/ne task.html.
[102] A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman, ?Exploiting diverse knowl-
edge sources via maximum entropy in named entity recognition,? in Proceedings of
the Sixth Workshop on Very Large Corpora, 1998, pp. 152?160.
135
[103] S. Lawrence, C. L. Giles, and K. Bollacker, ?Digital libraries and autonomous cita-
tion indexing,? IEEE Computer, vol. 32, no. 6, pp. 67?71, 1999.
[104] W. Cohen and S. Sarawagi, ?Exploiting dictionaries in named entity extraction:
Combining semi-Markov extraction processes and data integration methods,? in
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data
Mining, 2004, pp. 89?98.
[105] L. O?Gorman, ?The document spectrum for page layout analysis,? IEEE Trans.
Pattern Anal. Machine Intell., vol. 15, no. 11, pp. 1162?1173, 1993.
[106] K.Kise, A.Sato, andM.Iwata, ?SegmentationofpageimagesusingtheareaVoronoi
diagram,? Computer Vision and Image Understanding, vol. 70, no. 3, pp. 370?382,
1998.
[107] S. Mao and T. Kanungo, ?Empirical performance evaluation methodology and its
application to page segmentation algorithms,? IEEE Trans. Pattern Anal. Machine
Intell., vol. 23, no. 3, pp. 242?256, 2001.
[108] H. M. Wallach, Conditional Random Fields: An Introduction, Technical Report MS-
CIS-04-21, University of Pennsylvania, Available at http://repository.upenn.edu/
cis reports/22/, 2004.
[109] J. Lafferty, A. McCallum, and F. Pereira, ?Conditional random fields: Probabilistic
models for segmenting and labeling sequence data,? in Proceedings of the Interna-
tional Conference on Machine Learning, 2001, pp. 282?289.
[110] D. Pinto, A. McCallum, X. Wei, and W. Bruce, ?Table extraction using conditional
random fields,? in Proceedings of the ACM SIGIR Conference on Research and
Development in Information Retrieval, 2003, pp. 235?242.
[111] F. Sha and F. Pereira, ?Shallow parsing with conditional random fields,? in Pro-
ceedings of the Conference of the North American Chapter of the ACL on Human
Language Technology, 2003, pp. 134?141.
[112] S. Sarawagi and W. W. Cohen, ?Semi-markov conditional random fields for infor-
mation extraction,? in Proceedings of the Annual Conference on Neural Information
Processing Systems, 2004, pp. 1185?1192.
[113] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New
York, New York: John Wiley and Sons, Inc., 2000.
[114] R. H. Byrd, J. Nocedal, and R. B. Schnabel, ?Representations of quasi-Newton
matrices and their use in limited memory methods,? Mathematical Programming,
vol. 63, pp. 129?156, 1994.
[115] H. T. Ng, C. Y. Lim, and J. L. T. Koo, ?Learning to recognize tables in free text,?
in Proceedings of the Annual Meeting of the ACL, 1999, pp. 443?450.
[116] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, ?Automating the construc-
tion of internet portals with machine learning,? Information Retrieval, vol. 3, no. 2,
pp. 127?163, 2000.
136
[117] R. Bunescu, R. Ge, R. J. Mooney, E. Marcotte, and A. K. Ramani, Extracting Gene
and Protein Names from Biomedical Abstracts, Unpublished Technical Note, Univer-
sity of Texas, Austin, Available at http://www.cs.utexas.edu/users/ml/publication/
ie.html, 2002.
[118] K. Humphreys, G. Demetriou, and R. Gaizauskas, ?Two applications of information
extraction to biological science journal articles: Enzyme interactions and protein
structures,? in Proceedings of the Pacific Symposium on Biocomputing, 2000, pp.
502?513.
[119] O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland,
D. S. Weld, and A. Yates, ?Unsupervised named-entity extraction from the web:
An experimental study,? Art. Intell., vol. 165, no. 1, pp. 91?134, 2005.
[120] K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi, ?Toward information extraction:
Identifying protein names from biological papers,? in Proceedings of the Pacific
Symposium on Biocomputing, 1998, pp. 707?718.
[121] M. Narayanaswamy, K. E. Ravikumar, and K. Vijay-Shanker, ?A biological named
entity recognizer,? in Proceedings of the Pacific Symposium on Biocomputing, 2003,
pp. 427?438.
[122] X. Carreras, L. Marquez, and L. Padro, ?Named entity extraction using AdaBoost,?
in Proceedings of the Conference on Computational Natural Language Learning,
2002, pp. 167?170.
[123] R. Florian, A. Ittycheriah, H. Jing, and T. Zhang, ?Named entity recognition
through classifier combination,? in Proceedings of the Conference on Computational
Natural Language Learning, 2003, pp. 168?171.
[124] M. Collins, ?Discriminative training methods for hidden Markov models: Theory
and experiments with perceptron algorithms,? in Proceedings of the Conference on
Empirical Methods in Natural Language Processing, 2002, pp. 1?8.
[125] A. McCallum, D. Freitag, and F. Pereira, ?Maximum entropy Markov models for
information extraction and segmentation,? in Proceedings of the International Con-
ference on Machine Learning, 2000, pp. 591?598.
[126] X. He and M. C.-P. R. Zemel, ?Multiscale conditional random fields for image la-
beling,? in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, vol. 2, 2004, pp. 695?702.
[127] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas, ?Conditional models for
contextual human motion recognition,? in Proceedings of the IEEE International
Conference on Computer Vision, vol. 2, 2005, pp. 1808?1815.
[128] E. Amitay and A. Broder, ?Introduction to special issue on query log analysis:
Technology and ethics,? ACM Trans. Web, vol. 2, no. 4, Article 18, pp. 1?2, 2008.
[129] D. Beeferman and A. Berger, ?Agglomerative clustering of a search engine query
log,? in Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and
Data Mining, 2000, pp. 407?416.
137
[130] R. Jones, B. Rey, O. Madani, and W. Greiner, ?Generating query substitutions,? in
Proceedings of the World Wide Web Conference, 2006, pp. 387?396.
[131] R. W. White, M. Bilenko, and S. Cucerzan, ?Leveraging popular destinations to
enhance web search interaction,? ACM Trans. Web, vol. 2, no. 3, Article 16, pp.
1?30, 2008.
[132] W. Cohen, R. Shapire, and Y. Singer, ?Learning to order things,? J. Art. Intell.
Res., vol. 10, pp. 243?270, 1999.
[133] T. Joachims, ?Optimizing search engines using clickthrough data,? in Proceedings
of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2002,
pp. 133?142.
[134] F. Radlinski and T. Joachims, ?Query chains: Learning to rank from implicit feed-
back,? in Proceedings of the ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, 2005, pp. 239?248.
[135] E. Agichtein, E. Brill, and S. Dumais, ?Improving web search ranking by incorpo-
rating user behavior information,? in Proceedings of the ACM SIGIR Conference on
Research and Development in Information Retrieval, 2006, pp. 19?26.
[136] N. Craswell and M. Szummer, ?Random walks on the click graph,? in Proceedings of
the ACM SIGIR Conference on Research and Development in Information Retrieval,
2007, pp. 239?246.
[137] F. Radlinski, A. Broder, P. Ciccolo, E. Gabrilovich, V. Josifovski, and L. Riedel,
?Optimizing relevance and revenue in ad search: A query substitution approach,?
in Proceedings of the ACM SIGIR Conference on Research and Development in
Information Retrieval, 2008, pp. 403?410.
[138] G. Zhu and G. Mishne, ?Mining rich session context to improve web search,? in
Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data
Mining, 2009, pp. 1037?1046.
[139] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz, ?Analysis of a very large
web search engine query log,? SIGIR Forum, vol. 33, no. 1, pp. 6?12, 1999.
[140] F. Radlinski and T. Joachims, ?Active exploration for learning rankings from click-
through data,? in Proceedings of the ACM SIGKDD Conference on Knowledge Dis-
covery and Data Mining, 2007, pp. 570?579.
[141] Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li, ?BrowseRank: Letting
web users vote for page importance,? in Proceedings of the ACM SIGIR Conference
on Research and Development in Information Retrieval, 2008, pp. 451?458.
[142] R. Baeza-Yates, C. Castillo, F. Junqueira, V. Plachouras, and F. Silvestri, ?Chal-
lenges in distributed information retrieval,? in Proceedings of the IEEE Conference
on Data Engineering, 2007, pp. 6?20.
[143] C. Olston and S. Pandey, ?Recrawl scheduling based on information longevity,? in
Proceedings of the World Wide Web Conference, 2008, pp. 437?446.
138
[144] B. He, M. Patel, Z. Zhang, and K. C.-C. Chang, ?Accessing the deep Web,? Com-
mun. ACM, vol. 50, no. 5, pp. 94?101, 2007.
[145] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking:
Bringing Order to The web, Technical Report, Stanford InfoLab, Available at http:
//ilpubs.stanford.edu:8090/422/, 1999.
[146] J. Kleinberg, ?Authoritative sources in a hyperlinked environment,? J. ACM, vol. 46,
no. 5, pp. 604?632, 1999.
[147] Z. Gy?ongyi, H. Garcia-Molina, and J. Pedersen, ?Combating web spam with
TrustRank,? in Proceedings of the International Conference on Very Large Data
Bases, 2004, pp. 576?587.
[148] C. Moler, The World?s Largest Matrix Computation, Online, 2002, http://www.
mathworks.com/company/newsletters/news notes/clevescorner/oct02 cleve.html.
[149] M. R. Meiss, F. Menczer, S. Fortunato, A. Flammini, and A. Vespignani, ?Ranking
web sites with real user traffic,? in Proceedings of the ACM Conference on Web
Search and Data Mining, 2008, pp. 65?76.
[150] Google, We Knew The Web Was Big, Online, 2008, http://googleblog.blogspot.
com/2008/07/we-knew-web-was-big.html.
[151] A. N. Langville and C. D. Meyer, ?Deeper inside PageRank,? Internet Math., vol. 1,
no. 3, pp. 335?400, 2005.
[152] A. Broder, R. Lempel, F. Maghoul, and J. Pedersen, ?Efficient PageRank approxi-
mation via graph aggregation,? in Proceedings of the World Wide Web Conference,
2004, pp. 484?485.
[153] F. McSherry, ?A uniform approach to accelerated PageRank computation,? in Pro-
ceedings of the World Wide Web Conference, 2005, pp. 575?582.
[154] R. W. White and S. M. Drucker, ?Investigating behaviorial variability in web
search,? in Proceedings of the World Wide Web Conference, 2007, pp. 21?30.
[155] M. Bilenko and R. W. White, ?Mining the search trails of surfing crowds: Identi-
fying relevant websites from user activity,? in Proceedings of the World Wide Web
Conference, 2008, pp. 51?60.
[156] P. Boldi, F. Bonchi, C. Castillo, D. Donato, A. Gionis, and S. Vigna, ?The query-
flow graph: Model and applications,? in Proceedings of the ACM Conference on
Information and Knowledge Management, 2008, pp. 609?618.
[157] D. Downey, D. Liebling, and S. Dumais, ?Understanding the relationship between
searchers? queries and information goals,? in Proceedings of the ACM Conference on
Information and Knowledge Management, 2008, pp. 449?458.
[158] W. M. Rand, ?Objective criteria for the evaluation of clustering methods,? J. Am.
Statist. Assoc., vol. 66, no. 336, pp. 846?850, 1971.
139
[159] A. Mowshowitz and A. Kawaguchi, ?Bias on the Web,? Commun. ACM, vol. 45,
no. 9, pp. 56?60, 2002.
[160] R. D. Yates and D. J. Goodman, Probability and Stochastic Processes, 2nd ed.
Hoboken, New Jersey: John Wiley and Sons, Inc., 2005.
[161] G. Grimmett and D. Stirzaker, Probability and Random Processes, 3rd ed. Oxford:
Oxford University Press, 2001.
[162] N. Craswell, S. Robertson, H. Zaragoza, and M. Taylor, ?Relevance weighting for
query independent evidence,? in Proceedings of the ACM SIGIR Conference on
Research and Development in Information Retrieval, 2005, pp. 416?423.
[163] J. H. Friedman, ?Greedy function approximation: A gradient boosting machine,?
Ann. Statist., vol. 29, no. 5, pp. 1189?1232, 2001.
[164] J. Dean and S. Ghemawat, ?MapReduce: Simplified data processing on large clus-
ters,? in Proceedings of the USENIX Symposium on Operating Systems Design and
Implementation, 2004, pp. 137?150.
[165] Hadoop, Open Source Implementation of MapReduce, Online, 2007, http://lucene.
apache.org/hadoop/.
[166] K. S. Jones, S. Walker, and S. E. Robertson, ?A probabilistic model of informa-
tion retrieval: Development and comparative experiments (parts 1 and 2),? Inform.
Proces. Manage., vol. 36, no. 6, pp. 779?840, 2000.
[167] K. J?arvelin and J. Kek?al?ainen, ?Cumulated gain-based evaluation of IR techniques,?
ACM Trans. Inf. Syst., vol. 20, no. 4, pp. 422?446, 2002.
[168] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and
Regression Trees. Boca Raton, Florida: CRC Press, 1984.
140