ABSTRACT Title of dissertation: USING ONTOLOGIES TO IMPROVE ANSWER QUALITY IN DATABASES Yu Deng, Doctor of Philosophy, 2006 Dissertation directed by: Professor V.S. Subrahmanian Department of Computer Science One of the known shortcomings of relational and XML databases is that they overlook the semantics of terms when answering queries. Ontologies constitute a useful tool to convey the semantics of terms in databases. However, the problem of e?ectively using semantic information from ontologies is challenging. We first address this problem for relational databases by the notion of an ontology extended relation (OER). An OER contains an ordinary relation as well as an associated ontology that conveys semantic meaning about the terms being used. We then extend the relational algebra to query OERs. We build a prototype for the OER model and show that the system scales to handle large datasets. We then propose the concept of a similarity enhanced ontology (SEO), which brings a notion of similarity to a graph ontology. We extend TAX, one of the best known algebras for XML databases, with SEOs. The result is our TOSS system that provides a much higher answer quality than TAX does alone. We experimentally evaluate the TOSS system on the DBLP and SIGMOD bibliographic databases and show that TOSS has acceptable performance. These two projects have involved ontology integration for supporting semantic queries across heterogeneous databases. We show how to e?ciently compute the canonical witness to the integrability of graph ontologies given a set of interoperation constraints. We have also developed a polynomial algorithm to compute a minimal witness to the integrability of RDF ontologies under a set of Horn clauses and negative constraints, and experimentally show that our algorithm works very well on real-life ontologies and scales to massive ontologies. We finally present our work on ontology-based similarity measures for find- ing relationships between ontologies and searching similar objects. These measures are applicable to practical classification systems, where ontologies can be DAG- structured, objects can be labeled with multiple terms, and ambiguity can be in- troduced by an evolving ontology or classifiers with imperfect knowledge. The ex- periments on a bioinformatics application show that our measures outperformed previous approaches. USING ONTOLOGIES TO IMPROVE ANSWER QUALITY IN DATABASES by Yu Deng Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2006 Advisory Committee: Professor V.S. Subrahmanian, Chair/Advisor Assistant Professor Lise Getoor Professor James A. Hendler Professor Dana S. Nau Associate Professor Donald Yeung c? Copyright by Yu Deng 2006 DEDICATION This dissertation is dedicated to my husband and family. ii ACKNOWLEDGMENTS I want to thank my advisor, Prof. V.S. Subrahmanian, who has given me the best guidance and support throughout my PhD study. I also want to thank Professors Lise Getoor, James A. Hendler, Dana S. Nau and Donald Yeung for being on my dissertation committee, and for their insightful comments and questions. I would like to thank Dr. Harumi Kuno and Dr. Peter Schwarz for being my mentor during my internship at HP Labs and IBM Almaden respectively, where I advanced my research in schema mapping and semantic similarity search. My study at Maryland has been a great experience and I would like to thank all of my colleagues. I especially thank Dr. Edward Hung, who often had stimulating ideas. Edward has given a lot of help to me and contributed to the TOSS project. I also want to give my sincere thanks to Octavian Udrea for the project of RDF integration and for having a great time to work with him. Additionally I want to thank T.J. Rogers for his technical support. Finally, I must thank my family, who always believe in me and encourage me to pursue my academic goals. Most importantly, I thank my husband for being always with me to get through the di?culties in my study. Without the love and support from my family, it is impossible for me to finish my dissertation and obtain the degree of Doctor of Philosophy. iii TABLE OF CONTENTS List of Figures vi 1 Introduction 1 1.1 TheRecallProblem............................ 1 1.2 OntologyIntegration ........................... 4 1.3 Ontology-basedSemanticSimilarityMeasures ............. 5 2 Review of Ontologies 9 2.1 Graph-basedOntologies ......................... 10 2.2 RDF.................................... 12 2.2.1 SyntaxofRDF .......................... 12 2.2.2 SemanticsofRDF ........................ 15 2.2.3 RDFSchema ........................... 16 2.3 SHOE ................................... 18 2.3.1 SemanticsofSHOE........................ 20 2.4 DAML+OIL................................ 20 2.5 OWL.................................... 21 2.6 DescriptionLogics ............................ 23 3 Review of the State of the Art 26 3.1 UsingIRTechniquestoImproveAnswerQuality............ 26 3.1.1 IRandRDBMS.......................... 27 3.1.2 IRandXMLDB ......................... 28 3.2 MediatorSystems............................. 31 3.3 OntologyMergingandAlignment.................... 32 3.4 ObjectMatching ............................. 35 3.4.1 EntityResolution......................... 35 3.4.2 SimilarityMeasuresonOntologies................ 38 4 Ontology-extended Relational Algebra 42 4.1 MotivatingExample ........................... 43 4.1.1 Simple Industrial Parts Example ................ 43 4.1.2 InsuranceExample ........................ 45 4.2 OntologiesandOntologyIntegration .................. 47 4.2.1 Ontologies............................. 47 4.2.2 OntologyIntegration....................... 49 4.3 OntologyExtendedRelationalAlgebra ................. 55 4.3.1 SelectionConditions ....................... 59 4.3.2 OntologicalRelationalAlgebra ................. 61 4.4 ImplementationandExperiments .................... 63 4.4.1 HOMESystem .......................... 63 4.4.2 Experiments............................ 64 4.5 Summary ................................. 67 iv 5 TOSS: An Extension of TAX with Ontologies and Similarity Queries 69 5.1 SemistructuredInstanceandTAX.................... 71 5.1.1 EmbeddingsandWitnessTrees ................. 72 5.1.2 TAX................................ 74 5.1.3 ProblemswithTAX ....................... 75 5.2 SimilarityEnhancedOntologies ..................... 77 5.2.1 Ontologies............................. 77 5.2.2 IntegratingOntologies ...................... 78 5.2.3 SimilarityEnhancedOntologies ................. 80 5.3 OntologyExtendedSemistructuredDataModel ............ 88 5.3.1 TOSSAlgebra........................... 89 5.4 ImplementationandExperiments .................... 95 5.5 Summary .................................101 6 A Theoretical Foundation for Integrating RDF Ontologies 103 6.1 Preliminaries ...............................104 6.2 Hornconstraints .............................106 6.3 TheRDFOntologyIntegrationProblem ................108 6.4 TheCROWAlgorithmtointegrateOntologies.............112 6.5 ImplementationandExperiments ....................119 6.6 Summary .................................122 7 Finding Similar Objects Using a Taxonomy 123 7.1 Information-TheoreticSimilarity.....................125 7.1.1 HolisticSimilarity.........................127 7.1.2 GenericSimilarity.........................134 7.2 RelatedWork...............................141 7.3 Implementation..............................144 7.3.1 User-DefinedTypes........................145 7.3.2 Precomputation..........................147 7.4 ExperimentalEvaluation.........................149 7.4.1 Results...............................152 7.4.2 Discussion.............................153 7.4.3 CandidateSelection........................155 7.5 Summary .................................158 8 Future Work 160 8.1 ImprovingPerformanceofOntology-ExtendedSystems ........160 8.2 LearningOntologiesandInteroperationConstraints..........161 8.3 SemanticSimilarityJoin.........................162 8.4 SemanticWebDatabases.........................163 8.5 OWLIntegration .............................165 Bibliography 166 v LIST OF FIGURES 2.1 An RDF example describing artifact auctions in Christie?s auction house.................................... 14 2.2 AfragmentofaweatherontologyinDAML+OIL........... 22 2.3 AfragmentofOWLontology ...................... 24 4.1 (a) Ontology associated with Claims1 relation, (b) Ontology associ- ated with Claims2 relation........................ 49 4.2 (a)Integrated Ontology integrating the Claims1, Claims2 relations, (b)Hierarchy Graph associated with the Claims1, Claims2 relations, (c)Canonical Hierarchy associated with the Claims1, Claims2 relations 51 4.3 (a)Performance of HOME for conjunctive selection queries, (b)Join queries with varying join selectivity and varying number of tuples per relation .................................. 65 4.4 Performance of HOME for join queries as size of relations being joinedisvaried .............................. 67 4.5 Performanceofontologyintegrationalgorithms ............ 68 5.1 (a)ADBLPexample,(b)ASIGMODexample ............. 72 5.2 (a)A pattern tree P1, (b)A selection result, (c)A projection result . . 73 5.3 (a)ApatterntreeP2,(b)Ajoinresult.................. 75 5.4 (a)An example of ontology associated with SIGMOD, (b) An example ofontologyassociatedwithDBLP ................... 78 5.5 HierarchyGraphassociatedwithSIGMODandDBLP ........ 79 5.6 CanonicalfusionoftheontologiesofSIGMODandDBLP ...... 80 5.7 (a)Anexampleontologyand(b)itssimilarityenhancement ..... 87 5.8 ApatterntreeP3............................. 94 5.9 ArchitectureofTOSSsystem ...................... 95 vi 5.10 (a) Recall against precision for TOSS and TAX results, (b) Square root of product of recall and precision of TOSS and TAX results against the square root of corresponding TAX recall for each query. . 97 5.11 (a) Improvement factor of TOSS recall compared with TAX recall normalized by precision, (b) Performance of TOSS and TAX for se- lectionqueries. .............................. 98 5.12 (a) Performance of TOSS and TAX for join queries, (b) TOSS com- putation time of selection and join against epsilon1. .............100 6.1 RDF(respectivelyOWL)forthetwoexampleontologies .......105 6.2 Twosimpleontologies ..........................106 6.3 HOGexample(partial)..........................110 6.4 Minimal integrability witness ......................111 6.5 Intermediate result CROW phase(1)..................114 6.6 CROW running time ...........................120 7.1 SimilarityinContext...........................132 7.2 UseofAnonymousTerms ........................135 7.3 GenericSimilarityinContext ......................137 7.4 LabelingWithGenericTerms ......................139 vii Chapter 1 Introduction 1.1 The Recall Problem In recent years, people have developed various theories and techniques to han- dle queries for relational and XML data. For example, relational algebra [1] is one of the most important formalism of the relational data model. For XML, TAX [56] is one of the best known algebras. One problem with TAX and many existing rela- tional DBMSs is that the semantics of terms in databases is not taken into account when answering queries. Thus, even though TAX and the relational DBMSs answer queries with 100% precision, the recall is relatively low. For example, a TAX query that wishes to find all papers in the DBLP database written by ?J. Ullman? will not find bibliographic references to papers by ?J.D. Ull- man? or by ?Je?rey Ullman.? This is because TAX does not use any notion of similarity between search terms to answer queries. Likewise, TAX cannot answer queries of the form ?Find all papers having at least one author from the US gov- ernment.? Such a query may be useful in order to identify papers that potentially have no copyright restrictions. However, few authors if any will list their a?liations as ?US Government.? They are more likely to list their a?liations as ?US Census Bureau? or ?US Army? and so on. As a consequence, TAX will miss these answers. Problems such as these are not really an artifact of TAX?s design - rather, they are 1 caused by a general lack of lexical semantics in answering queries. Most commercial relational DBMSs would not perform such reasoning either when answering queries. The net e?ect of this shortcoming of TAX (as well as most commercial DBMSs) is that such systems have high precision (all the answers to a query are correct) but poor recall (not all the answers that should be returned are in fact returned). As commonly adopted by the IR and DB communities, the precision of an answer is the number of correct results returned divided by the number of results returned and the recall of an answer is the number of correct results returned divided by the total number of correct results that should have been returned. In Chapter 3, we introduce recent works on combining IR techniques and relational/XML database management systems for supporting keyword search, proximity search, relevance ranking, word stemming and thesauri. The recall problem exists not only in single DBMSs, but also when integrating multiple DBMSs. For relational databases, though the problem of integrating these diverse databases has been studied extensively [111] and many impressive mediator systems have been built, it remains a fact that integration of data at the semantic level remains an open problem. Consider a simple example consisting of industrial parts data. There are numerous companies in the US and elsewhere that maintain stocks of various industrial parts. Two di?erent companies of this kind may maintain part information in varying representations. One company may use the column Cost to represent the price of the items and Shipping to represent the shipping cost while the other one may use the columns Price and ShipCost. By looking at the fields, we are not sure whether Cost and Price refer to the same things or the columns Shipping 2 and ShipCost refer to the same concept. Note that the semantic ambiguities exist not only in schema, but also in instance data. In Chapter 3, we give a review on several well-known mediator systems. We address the recall problem by introducing the concept of an ontology to capture inter-term lexical relationships. In Chapter 2, we give a review on ontologies and ontology languages. In Chapter 4, we define a graph-based ontology and propose the notion of an ontology extended relation (OER). An OER contains an ordinary relation as well as an associated ontology that conveys semantic meaning about the terms being used. We extend the relational algebra to query OERs. Furthermore, we show how multiple ontologies may be merged, which is imperative when executing queries over multiple ontology extended relations. We describe an implementation of the OER model and show (via experiments) that the system scales to handle large data sets. We have made a similar e?ort on XML database systems. But di?erently, we investigate how to use both similarity metrics and ontologies to answer queries correctly. In Chapter 5, we introduce the TOSS system, which extends and enhances the semantics of TAX (a tree algebra for XML databases) so that the resulting system returns high quality answers. We show how to generate a similarity enhanced ontology, a concept that brings a notion of similarity into a integrated ontology. We describe a prototype implementation of TOSS on top of the Apache Xindice XML database system [115]. Our experiments conducted on the SIGMOD XML proceedings data set[96] and the DBLP data set[31] show that the TOSS system consistently outperformed ordinary XML query engines in terms of answer quality. 3 1.2 Ontology Integration It is a common case that distributed data sources may be associated with di?erent ontologies, but on similar topics. These associated ontologies provide se- mantical information about the data. When semantically integrating such data and performing queries over multiple ontology extended data sources, we may need to merge multiple ontologies. On the other hand, since the adoption of ?Resource De- scription Framework? (RDF) as a web recommendation by the World Wide Web Consortium, there has been growing interest in using RDF for expressing ontologies about a diverse variety of topics. As more and more ontologies emerge about the same topics, there is also a growing need to integrate these ontologies. There are some initial approaches to merging ontologies in the literature. The initial pioneering work of [80] showed that ontology merging is an important prob- lem. In another important paper, [24] develops a model theoretic basis for merging ontologies assuming they are in description logic. [98] uses natural language and formal concept analysis methods to merge ontologies using a concept lattice which is explored and transformed by user interactions. Meanwhile, there are many works on finding relationships between ontologies, e.g. [21, 75]. We will give details on the related work in Chapter 3. Meanwhile, extensive research has been performed on integrating logical the- ories from multiple sources. In Chapter 3, we introduce these research e?orts moti- vated by problems arising in areas such as cooperative information systems, multi- databases, multi-agent systems and distributed expert systems. These works may 4 be applied to ontology integration. However, due to the special properties of some ontologies, especially the underlying graph models, it is often possible to find a more e?cient method to integrate them, as we show in Chapter 4 and Chapter 6. In Chapter 4, we provide a definition of a graph-based ontology and study the problem of integrating ontologies under a given set of interoperation constraints. We formally define the notion of canonical witness to the integrability of a set of ontologies under such constraints. We have established a theory that a set S of ontologies is integrable if and only if the canonical witness is a witness to the integrability of the ontologies in S. We further provide an e?cient algorithm to compute the canonical witness given a set of ontologies and their interoperation constraints. We present our approach on integrating RDF ontologies in Chapter 6. Our framework allows relationships to be not only between terms, but also allows com- plex Horn Constraints, the Horn clauses that specify semantic relationships among terms in ontologies. Note that our approach is rooted in the novel concept of an integration witness and in graph theoretic methods. We also show the correctness and complexity proofs of our approach as well as the experimental results on both synthetic and real ontologies. 1.3 Ontology-based Semantic Similarity Measures Finding interoperation constraints between ontologies is an important step for integrating them. Since most existing approaches focus on the structures of ontolo- 5 gies, how to find such constraints at the semantic level is a challenging problem. In Chapter 7, we introduce our work on ontology-based semantic similarity measures which are applicable for identifying similarity relationships between two terms, two sets of terms and two individuals. Our measures are also useful for finding similar objects given a description of a target in a domain. For example, one may wish to find patents similar to a given patent, or subjects similar to a hypothetical ?ideal? subject for a clinical trial, or gene products (e.g. proteins) similar to a given gene product. When merging data from multiple sources, it is often important to e?ciently identify duplicate objects. There have been some approaches on string matching, object matching and entity resolution in the literature. For example, Cohen et al. [28] conduct extensive experiments to compare a set of string metrics for matching names and records, Zhou et al. [116] present a framework for supporting object matching in integrating heterogeneous data, Hernandez and Stolfo [49] study the problem of merging data and identifying distinct entities in large databases, Bhattacharya and Getoor [18] show an iterative deduplication algorithm to identify similar entities, and Dong et al. [35] propose a couple of algorithms for supporting similarity search for web ser- vices. However, none of the above research groups has considered to use taxonomies (ontologies) to find semantically similar objects, e.g., those objects that are highly related, but not matched by traditional IR metrics. Taxonomies have been a useful tool for classification. They also provide a means of determining how similar one such individual is to another. In its sim- plest form, a taxonomy defines a hierarchical grouping of individuals into ever more 6 specific classes. Two individuals share the properties of the most specific grouping that includes both of them, and the degree to which the two individuals are similar depends on the location of this class in the hierarchy. The lower in the hierarchy, the more similar the individuals are. For example, two apples of the same species are more similar than apples of di?erent species, and an apple is more similar to another apple of any species than it is to an orange. Based on this intuitive idea of similarity, various authors, e.g. [94, 114, 66], have defined di?erent ways to compute a numeric value for measuring the similarity of objects. We will give details about these works in Chapter 3. However, previous ontology-based similarity measures may not be applicable for complex cases as follows. Firstly, it is frequently the case that a class of individ- uals may specialize the properties of more than one parent class, i.e., taxonomies are often DAG-structured. Furthermore, taxonomies often evolve, as new special- ized groupings are formed and older ones are reorganized. Even with an unchanging taxonomy, the classification of a particular object may evolve as more is learned about it, or users of the taxonomy may disagree as to how it should be classified. Thirdly, real taxonomies tend to be quite large, and the sets of objects they are used to classify are often very large. Lastly, in practice, an object is often associated with a set of terms (instead of one term) in a taxonomy. In Chapter 7, we present a pragmatic approach to the use of taxonomies to identify similarity relationships between terms and objects. We have defined two taxonomy-based similarity measures to take the above cases into account. The first, holistic similarity, is a new information-theoretic similarity measure that is more 7 general and well-founded than prior art. The second, generic holistic similarity, adapts our holistic similarity measure to cope with ambiguity, as introduced by an evolving taxonomy or classifiers with imperfect knowledge. To the best of our knowledge, this is the first attempt to consider this latter problem. We also describe a scalable implementation of our measures that is tightly integrated with an object- relational database, and we evaluate our approach by applying it to an object- matching problem from bioinformatics for which the correct answers are known a priori. The results show that our new measures are more successful than those previously reported, and that our implementation scales well for large taxonomies and data sets. 8 Chapter 2 Review of Ontologies According to Thomas Gruber [41], ?an ontology is an explicit specification of a conceptualization?. Here, a conceptualization describes the classification of entities in the universe and how they are related. An ontology defines a set of representational terms to name the entities and associates text and axioms with the terms to constrain their interpretation. In a more practical point of view, ontologies provide a shared and common understanding of a domain that can be communicated between people and hetero- geneous and distributed application systems [61]. They are (meta)data schemas, providing a controlled vocabulary of concepts, each with an explicitly defined and machine processable semantics [72]. Ontologies play increasingly important roles in the areas of Data Integration, Information Retrieval and Knowledge Sharing and Reuse, etc. In 2001, Berners-Lee et al. [15] proposed the concept of Semantic Web and described a promising picture for this future generation of World Wide Web. Ontologies are used in this picture to resolve semantic ambiguities between di?er- ent databases. Since then, this picture has triggered a strong interest on ontology research in the Semantic Web community. In this chapter, we give a brief introduction to several well-known or widely- used ontologies/ontology languages, which include graph-based ontologies, RDF, 9 SHOE, OWL, DAML+OIL and Description Logics. 2.1 Graph-based Ontologies Graph-based ontologies cover various kinds, ranging from simple catalogs, IS- A hierarchies, to complex graph systems, such as WordNet. These ontologies are closely related to two early knowledge representation systems, semantic networks and frame systems. In 1960?s, Quillian proposed semantic networks [90], where each concept is represented by a node (called type node) in a graph. The meaning of each type node is implied by an associated structure, which consists of ?token nodes? interconnected by di?erent kinds of links. These links are further labeled with semantic relations, such as conjunctive and disjunctive relations. In addition, it is the token nodes that connect a type node to other type nodes, because each token node is linked to a type node out of the structure that the token node belongs to. Later, Minsky introduced frame systems [79]. A frame is a named data object with a set of slots that represent properties of the object. Each slot can specify the conditions its assignments must meet. A assignment of a slot can be an object, or a pointer to another frame. Meanwhile di?erent frames can share slots in order to rep- resent di?erent ways of using the same information. Although frame systems seem to have complex structures, it has been shown that those systems are isomorphic to semantic networks. Recent years, many graph-based ontologies have been built for specific appli- 10 cation domains. They more or less inherit some features from semantic networks and frame systems. WordNet, as an example, is one of them. WordNet is a lexical reference system which organizes lexical information by se- mantic relations [78]. These relations include synonymy, antonymy, hyponymy, and meronymy, etc. In WordNet, English nouns, verbs and adjectives are grouped into synonym sets, each of which represents one lexical concept. For example, {board, plank} and {board, committee} designate two di?erent meanings of board. Seman- tic relations, like pointers, link the synonym sets. It is important to note that the three categories, nouns, verbs and adjectives, have di?erent semantic organizations in WordNet. For instance, nouns are organized into hierarchical trees while the organization of adjectives is more or less a n-dimensional hyperspace. WordNet has been widely used in information retrieval systems. A couple of semantic similarity measures have been defined with regard to such a large lexical database. We will come back to this in Chapter 3. We also propose a simple notion of a graph-based ontology [19] which is a partial mapping from a set of relations to a set of hierarchies. We formally define the concept of ?canonical witness?, based on which we provide an e?cient algorithm to merge such ontologies under a set of interoperation constraints. The details are given in Chapter 4. Note that our notion of ontology generalizes the ontologies in RDF, DAML+OIL and OWL, of which the underlying models are graphs. 11 2.2 RDF Resource Description Framework (RDF) is a W3C recommendation endorsed by approximately 400 companies. It is a foundation for processing metadata and it provides interoperability between applications that exchange machine-understandable information on the Web [64]. Lassila et al. [64] introduce a model for representing RDF metadata as well as a syntax for encoding and transporting the metadata. Their work is clarified and updated by W3C RDF Primer [74]. Brickley et al. [22] introduce RDF?s vocabulary description language, RDF Schema (RDFS). Hayes [46] specifies a precise semantics and corresponding complete systems of inference rules for RDF and RDFS. The basic elements of RDF are triples with the form (subject, predicate, ob- ject). RDF is property-centric. RDF Schema defines vocabulary to describe classes, properties and other resources. 2.2.1 Syntax of RDF RDF defines a model for describing relationships between resources in terms of uniquely identified properties (attributes) and values. RDF has become the standard for the description and exchange of metadata on the Web. The underlying model for RDF is a labeled directed graph where nodes are either resources or literals. The graph is defined by a set of triples, statements of the form (subject, predicate, object), where subject is a resource, predicate is the edge label and object is either a resource or a literal. The basic elements in RDF data 12 model are as follows: ? Resource: A resource is anything identified by a URI reference (URI plus optional anchor ids) and described in terms of simple properties and property values. ? Literal: A literal may be plain or typed. A plain literal is a self-denoting string combined with an optional language tag. A typed literal is a string combined with a datatype URI reference. It denotes the member of the iden- tified datatype?s value space obtained by applying the lexical-to-value mapping to the literal string. ? Property: A property is a resource that represents a specific aspect, character- istic, attribute, or relation used to describe resources. The set of properties is a subset of the set of resources. ? Statement: A statement is composed of a subject, a predicate and an object in an ordered manner. It is also called a triple. With a statement, a specific resource (the subject) is associated with a property (the predicate) and a value of that property (the object). Specially, a statement can be transformed into a resource with a URI reference using reification, a mechanism in RDF for describing statements. Example 1 The bottom half of Figure 2.1 shows an RDF example for artifact auc- tions. The top half shows the corresponding RDF schema. As illustrated in the fig- ure, the resource r1, with URI http://www.artist.netsharpguyrose, is associated with sev- 13 Schema Instance technique technique &r3 "Guy" "Rose" title "The Model" title material String estimated Price "Christoffel" presented 2004?04?28 estimated &r5 "oil on canvas" &r4 "oil on canvas" 2004?04?28 presented estimated ns1:http://www.auctionschema.com/schema1# Literal Literal technique String String fname Artist Sculptor Painter &r2 paints paints &r1 lname fname lname creates String sculpts paints Sculpture Painting Artifact fname "10000"^^ns1:USD 2004?04?23 presented "Tuffolina" title String Date presented USD GBP xsd:decimal high low high r6 : http://www.artist.net#christberghe r3 : http://www.christies.com/beverly/rockies.jpg r2 : http://www.christies.com/beverly/model.jpg r1 : http://www.artist.net#guyrose r10 : http://www.christies.com/ny/tuffolina.jpg r9 : http://www.artist.net#odotabacchi r7 : http://www.christies.com/kingst/river.jpg low high low high low high low "15000"^^ns1:USD "Van den Berghe" &r6 lname &r7 paints "oil on copper" technique estimated &r8 2004?04?21 presented &r9 "Odoardo" "Tabacchi" estimated "40000"^^ns1:GBP "60000"^^ns1:GBP "300000"^^ns1:USD "500000"^^ns1:USD "600000"^^ns1:USD "800000"^^ns1:USD &r11 material "bronze"sculpts &r10 lname fname rdf:type rdfs:subClassOf rdfs:subPropertyOf Figure 2.1: An RDF example describing artifact auctions in Christie?s auction house eral properties, e.g. with URI http://www.auctionschema.com/schema1sharpfname (ns1 is the namespace http://www.auctionschema.com/schema1sharp). The value of for the resource r1 is the plain literal ?Guy?. Furthermore, the state- ment (, , ?guy?) associates the re- source r1 with the property and the value ?Guy?. In contrast to an object-oriented system, RDF is property-centric instead of object- centric, although people might think RDF resources can be analogous to objects and RDF properties are similar to attributes. RDF Properties are themselves resources, and are independent from classes. The relationships between RDF resources and 14 properties are not tightly restricted by schemas. 2.2.2 Semantics of RDF RDF is an assertional logic, in which each triple expresses a simple proposition. An RDF graph is a set of RDF triples and a ground RDF graph is one with no blank nodes (anonymous resources without URI references). Hayes [46] gives very detailed discussion about RDF semantics. For reader?s convenience, we present the important concepts of Graph Instance, Merge of RDF Graphs and Interpretation here. Interested readers may refer to [46]. Definition 1 (Graph Instance [46]) Suppose M is a mapping from a set of blank nodes to some set of literals, blank nodes and URI references. Then any graph obtained from a graph G by replacing some or all of the blank nodes N in G by M(N) is an instance of G. A proper instance of a graph is an instance in which a blank node has been replaced by a name, or two blank nodes in the graph have been mapped into the same node in the instance. An RDF graph is lean if it has no instance which is a proper subgraph of the graph. Definition 2 (Merge of RDF Graphs [46]) If the graphs in the set have no blank nodes in common, then the union of the graphs is a merge; if they do share blank nodes, then it is the union of a set of graphs that is obtained by replacing the graphs in the set by equivalent graphs that share no blank nodes. 15 Definition 3 (Interpretation [46]) A simple interpretation I of an RDF vocab- ulary V is defined by: 1. A non-empty set IR of resources, called the domain or universe of I. 2. A set IP, called the set of properties of I. 3. A mapping IEXT from IP into the powerset of IR x IR i.e. the set of sets of pairs with x and y in IR . 4. A mapping IS from URI references in V into (IR union IP). 5. A mapping IL from typed literals in V into IR. 6. A distinguished subset LV of IR, called the set of literal values, which contains all the plain literals in V. 2.2.3 RDF Schema RDF-Schema (RDFS) extends RDF with special class and property constructs that may be used to describe classes, properties and other resources. The important constructs in RDFS are as follows: ? rdfs:Resource. All RDF resources are instances of rdfs:Resource. ? rdfs:Class, the class of resources that are RDF classes. ? rdfs:Literal, the class of literal values. rdfs:Literal is an instance of rdfs:Class and a subclass of rdfs:Resource. 16 ? rdfs:subClassOf, an instance of rdf:Property that is used to state that all the instances of one class are instances of another. It is a transitive relationship w.r.t. class hierarchy. ? rdfs:subPropertyOf, an instance of rdf:Property that is used to state that all resources related by one property are also related by another. It is a transitive relationship w.r.t. property hierarchy. ? rdf:type, an instance of rdf:Property that is used to state that a resource is an instance of a class. ? rdfs:domain, an instance of rdf:Property that is used to state that any resource that has a given property is an instance of one or more classes. For example, a triple of the form (P rdfs:domain C) states that the resources denoted by the subjects of triples whose predicate is P are instances of the class C. ? rdfs:range, an instance of rdf:Property that is used to state that the values of a property are instances of one or more classes. A triple of the form (P rdfs:range C) states that the resources denoted by the objects of triples whose predicate is P are instances of the class C. Besides the above constructs, RDFS also defines a vocabulary for RDF con- tainer classes and properties, RDF collections and reification, etc. Please refer to [22] for details. As pointed out by Staab et al. [97], while support for modeling of ontological concepts and relations has been extensively provided in RDF(S), the same cannot 17 be said about the modeling of ontological axioms since RDF(S) has relatively weak expressiveness. They propose an approach to model axioms in RDF(S) while the core semantics of RDF(S) is re-used and the semantics is preserved between di?erent inferencing tools. 2.3 SHOE In 1995, Hendler?s group proposed the SHOE language at the University of Maryland, and later they refined the language and experimented its use [71, 48, 47]. SHOE, which stands for Simple HTML Ontology Extensions, supports knowledge acquisition by augmenting the Web with tags that provide semantic meaning. To eliminate the possibility of contradictions between agent assertions, the designers of SHOE carefully chose some features for the language. For example, SHOE does not permit logical negation, nor the specification of disjoint classes. Syntactically, elements of SHOE can be described in HTML extended with additional semantic tags, or in XML. A SHOE ontology has both an identifier and a version number, where all ontologies with the same identifier are di?erent versions of the same ontology. For example 1 , declares an ontology called ?university-ont? whose version is 1.0. In addition, SHOE allows for ontology inclusion, which helps to reuse ontologies. A SHOE ontology may contain a number of elements, in which categories, 1 All the examples of the SHOE language are from [47]. 18 relations, and inference rules are important components. A category (or class) is a set of objects that share some common properties. The is?a relation is commonly used to group categories together. For example, defines a category called ?Faculty?, which is a subcategory of ?Person?. In SHOE, a relation is equivalent to a n-ary predicate with zero or more arguments, which are typed and explicitly ordered. For instance, a relation of ?advises? can be defined as follows: where ?Faculty? and ?Student? are its two arguments. SHOE also supports infer- ence rules, each of which consists of a set of antecedents and a set of consequents. Notice that the axioms in SHOE are restricted to Horn clauses since SHOE?s se- mantics is based on datalog. The following example defines a rule that the head of a Department is a Chair: 19 The subclauses enclosed by and tags are antecedents and the ones enclosed by and are consequents. 2.3.1 Semantics of SHOE A set of functions has been defined to translate di?erent constructs of the SHOE syntax to concepts in the logical model. For example, res : Url ? R maps a uniform resource locator from the set Url to a specific resource in R.Viathese functions, each tag in the SHOE language can be mapped into a set of assertions. For detailed description, please see [47]. 2.4 DAML+OIL DAML+OIL [29], a semantic markup language for Web resources, provides modeling primitives similar to frame-based languages. It was built from the original DAML ontology language as well as the language components of OIL, the Ontology 20 Interface Layer defined by Horrocks et al. [50]. Figure 2.2 shows a fragment of a weather ontology in DAML+OIL. A DAML+OIL ontology consists of zero or more headers, followed by zero or more class elements, property elements, and instances. The headers contain version information and imports elements. As shown in Figure 2.2, the element contains the version information. A class element, ,con- tains the definition of an object class. It refers to a class name (a URI), such as ?UnitAbbreviation? in the example ontology, and may contain other elements, like . A daml:Property element refers to a property name (a URI), e.g., ?hasUnitAbbreviation?, and may contain other elements, like and . There are two kinds of instances: class instances and property instances. For example, in Figure 2.2, celsiusMap is an instance of class UnitAb- breviation. [29] provides details about syntax of DAML+OIL and [106] gives the model-theoretic semantics for DAML+OIL. 2.5 OWL OWL [13, 88], the Web Ontology Language, is a semantic markup language for publishing and sharing ontology on the World Wide Web. It is a W3C recom- mendation and provides three subsets of language constructs with di?erent features: OWL Full, OWL DL and OWL Lite. OWL Full allows free mixing of OWL with RDF Schema, but it is not decidable although it gains the highest expressivemess in the three sublanguages of OWL. OWL 21 $Id: metar1.1.daml,v 1.12 2002/01/23 16:11:17 ggrimnes Exp $ Unit Abbreviation Abbreviation C Figure 2.2: A fragment of a weather ontology in DAML+OIL 22 DL is a subset of OWL Full and was designed to support the existing Description Logics. Note that OWL DL maintains decidability while still has high expressiveness. OWL Lite was targeted for easy implementation and providing users a functional subset of constructs. Besides abiding by all the restrictions of OWL DL, it has further requirements for some constructs. Figure 2.3 shows a fragment of an example OWL DL ontology [13]. It is important to note that the meaning of an OWL ontology is solely deter- mined by the underlying RDF graph. Patel-Schneider et al. [88] provide a thorough analysis in the model-theoretic semantics for OWL. 2.6 Description Logics As introduced in [8, 25], Description Logics are a family of knowledge repre- sentation languages that can be used to represent the knowledge of an application domain in a structured and formally well-understood way. They allow for modeling an application domain in terms of objects, classes and relationships between classes, and for reasoning about them. The important notions of a given domain are de- scribed by concept descriptions, which are expressions built from atomic concepts and roles. A knowledge base in DLs is formed by two components, called TBox and ABox. A TBox expresses intensional knowledge about classes and relations and an ABox expresses extensional knowledge about individual objects. Furthermore, DLs are equipped with a formal and logic-based semantics. An important problem for a DL system is the trade-o? between the expressivity 23 1 Figure 2.3: A fragment of OWL ontology 24 of DL and the complexity of its inference. Current DL systems can employ very expressive DLs while still providing powerful inference tools. These strengths have made DLs a popular choice for representing ontologies. Note that our notion of graph-based ontology cannot generalize the ontologies in Description Logics. In addition, since OWL DL is based on Description Logics, it cannot generalize OWL DL either. 25 Chapter 3 Review of the State of the Art 3.1 Using IR Techniques to Improve Answer Quality In recent years, it has become more and more common to have unstructured text embedded in or connected to structured data in database systems. For example, a database about books may have unstructured text about readers? comments and a database about patients may be associated for each patient record with detailed symptoms stored as text. This has motivated the research on supporting keyword search in databases. On the other hand, since they may not know the detailed content or structure of a database, users tend to be interested in (and often satisfied with) the results that do not exactly match the query, but are highly relevant. Furthermore, they often hope to have results ranked according to their relevance to the query. As these changes and query patterns have attracted more and more attention from both academia and industry, many research e?orts have emerged to combine IR techniques and relational/XML database management systems to query unstruc- tured text embedded in (or connected to) structured data and to support keyword search, proximity search, relevance ranking, word stemming and thesauri in database systems. In this section, we introduce some of the works in keyword search and rele- vance ranking for relational and XML database systems respectively. Note that some 26 of these systems have incorporated ontological information into query answering. 3.1.1 IR and RDBMS DBXplorer [2] and DISCOVER [52] are two early systems to support keyword search in relational database systems. But they do not support IR style ranking. Commercial relational database management systems have provided keyword search capabilities over single columns [30, 85, 76]. They use standard IR techniques to rank query results. However, several research groups have found that using standard IR ranking functions has a few shortcomings. For example, traditional IR functions miss records that do not contain the keywords, but are closely related to them. Furthermore, the traditional IR semantics are unable to meaningfully sort the resulting objects according to their relevance to the keywords [9]. Instead, they [9, 53] have adapted PageRank [23] technology for ranking query results in databases. Balmin et al. [9] develop the ObjectRank system which models databases as labeled graphs and applies authority-based ranking to keyword search. The idea is to first identify a base set where each object contains the keywords. Then author- ity originates at the objects in the base set and flows to objects according to their semantic connections. ObjectRank makes use of schema information when com- puting the scores. In addition, it considers domain-specific link semantics and can rank results with regard to specific keywords. However, it needs to materialize the database graph which poses large space overhead. Although it incorporates certain 27 semantics, the system cannot support arbitrary semantic relationships such as is-a and part-of. Furthermore, it relies on the base set which still requires exact match. Note that the above research e?orts do not consider answering a keyword query by joining tuples from multiple relations (which is called free-form keyword search). Hristidis et al. [51] adapt IR-style document-relevance ranking strategies to the problem of precessing free-form keyword queries over RDBMSs. Di?erent from previous research e?orts, their solution for ranking query answers is based on existing search capabilities in RDBMSs as well as IR ranking techniques. In addition, they can handle queries with not only AND, but also OR semantics. They have also introduced several top-k query-processing algorithms and compared the performance of the algorithms with regard to di?erent parameters. Di?erent from the above work, we do not consider keyword search or relevance ranking in our HOME system. Instead, our goal is to extend the relational algebra with ontological information such that it is feasible to integrate heterogeneous rela- tional databases at the semantic level. We provide details about the HOME system in Chapter 4. 3.1.2 IR and XML DB To the best of our knowledge, Florescu et al. [36] were the first to combine XML query processing and keyword search capabilities. They extend the XML-QL query language by adding a predicate called ?contains?, which has four arguments: an XML element variable, a word, an integer expression and a boolean expression. 28 The integer expression limits the depth at which the word is found within the ele- ment. The boolean expression over the set of constants {tag name, attribute name, content, attribute value} imposes a constraint on the location of the word. For ex- ample, contains($A, ?Database?, 3, any) is used to search for the elements containing the string ?Database? with a search depth no more than 3. The authors also dis- cussed how to process such keyword queries using a relational database management system. However, their work is limited to single keyword search. They do not con- sider other cases, such as multiple keywords with AND/OR semantics. Amer-Yahia et al. [4] propose TeXQuery, a full-text search extension to XQuery. Instead of exploring only a few primitives at a time to support full-text search, they provide a rich set of fully composable primitives including Boolean connectives, phrase matching, proximity distance, word stemming and thesauri. They introduce an underlying data model called FullMatch to guarantee that full-text search prim- itives can be closed under the model. The FullMatch model also helps to integrate the primitives seamlessly into the XQuery language. An interesting feature of TeX- Query is that it provides an interface to incorporate terms from thesauri, such as synonyms. However, the system does not support other kinds of queries (such as join) with regard to thesauri. Note that the above systems do not consider relevance ranking for query results. We introduce a couple of research e?orts to support ranking in XML databases. The TAX algebra by Jagadish et al. [56] is a formal theoretical basis for semistructured data. It proposes the concept of a pattern tree to query semistruc- 29 tured data sources. However, ontologies are not used in TAX and TAX does not support similarity search. Based on TAX, Al-Khalifa et al. introduced a bulk algebra TIX [3] to integrate information-retrieval style query processing into a traditional pipelined query evaluation engine for an XML database. TIX uses a scored pattern tree instead of a pattern tree as in TAX. User-defined score functions are given to some nodes in a scored pattern tree to specify how to compute scores of nodes using IR-style conditions. By extending existing operators in TAX and introducing new operators, an IR-style query can be expressed in TIX to find relevant results, with weighting and ranking support. However, they only support traditional IR meth- ods (i.e., exact keyword match) for scoring such as term frequencies. Hence, they cannot incorporate lexical relationships into query answering. As a result, TIX may miss those instances that are highly related to a query but do not contain those keywords in the query. Compared to TIX, our work on TOSS uses similarity en- hanced ontologies to support both lexical and similarity relationships between terms in databases. In an ontology, users can define arbitrary relationships that they are interested in. In addition, TOSS provides mechanisms to resolve semantic ambigu- ities when answering queries across heterogeneous databases which TIX does not consider. Guo et al. [42] present the XRANK system that supports ranking results for keyword search queries over XML documents. The ranking is at the granularity of an XML element and the algorithm for computing the ranking is based on the PageR- ank [23] technology. Di?erent from previous work, they support two dimensional proximity search by considering both the keyword distance and ancestor distance. 30 They also propose new inverted list index structures to e?ciently evaluate XML keyword search queries. Theobald and Weikum [100] present an interesting work on relevance ranking in XML databases. They consider a similarity operator ? which takes ontological information into account. They also propose similarity metrics (with regard to on- tologies) and special index structures to evaluate XML queries with similarity search conditions. Di?erent from our work, they only consider similarity relationships (in- cluding synonyms, hypernyms and hyponyms), but do not support other interterm lexical relationships such as part-of. Our TOSS system supports both interterm lexical relationships and similarity metrics defined in the IR literature. Another di?erence is that their work is based on an XML query language while our work is based on an XML algebra. Furthermore, we support semantic integration of XML databases but they focus on relevance ranking. 3.2 Mediator Systems Wiederhold [110, 111] proposed the important concept of a mediator - a pro- gram that would integrate data from multiple heterogeneous information sources. The first e?orts of mediator systems were SIMS [6], HERMES [70] and TSIMMIS [37]. Subsequent e?orts included the DISCO system [101]. SIMS [6] is targeted at e?ciently mapping a query at the domain level into a set of subqueries to individual information sources. A planner is used to select the information sources to be used in answering a query and also to order the 31 subqueries in a more e?cient way. Their domain model is a simple form of ontology which is used to reformulate an original domain-model query into a query in terms of database concepts. This step of reformulation is important to find out the related information sources. However, di?erent from our work, their focus is not on defining an ontology-extended algebra. They define a set of operators, some of which are similar to our operators in selection conditions. They use constraints to specialize concepts while we use ontology itself. Similar to SIMS, most of these e?orts focused on developing query languages, optimization algorithms or caching strategies with regard to the system performance. However, the above e?orts do not take into account lexical semantical relationships between terms occurring in databases such as those shown in previous examples. They do not have the ability to automatically reason that ?US Census Bureau? is part-of ?US Government? and ?Maryland? is part-of the ?USA?, and so on. Other such relationships between terms include isa (e.g. duck isa bird) or a?ects (e.g. rain a?ects visibility). 3.3 Ontology Merging and Alignment Wiederhold et al [112, 73, 81] at Stanford University have proposed an on- tology algebra for manipulating ontologies from di?erent domains. Their algebras consisted of logical statements [73] using a LISP style syntax. In their later work (combined in their ONION system), they represent ontologies with a graph-based model and provide a set of operators to articulate multiple ontologies into a unified 32 ontology using both logical rules and functional rules. However, they do not consider interoperation constraints in [112, 73], which we think are important when merging ontologies. Although a similar concept was considered in [81], their integrated on- tologies were not concise. Furthermore, [112, 73, 81] only merges together ontologies without directly addressing the impact of how the query algebra is a?ected. And, no concept of similarity search is involved in these works. [21] extends the syntax of OWL by adding constructs to express relationships between multiple OWL ontologies, but don?t actually tell us how to merge ontologies together using these relationships. [75] describes a tool called Chimaera that finds taxonomic ?areas? for merging as well as a list of similar terms. Noy and Musen [84] propose the PROMPT algorithm to semi-automatically merge and align ontologies. Based on an initial matched terms, the PROMPT algorithm iteratively executes user?s selected operation, checks for any conflicts and makes a list of suggestions to user. Di?erent from previous tools, the PROMPT approach provides good guidance to the user in the process of ontology alignment. Notice that the output of the ontology alignment tools can be used as interoperation constraints for our ontology integration algorithm in Chapter 4. Dou et al. [33] proposed an approach (based on ontology merging and auto- mated reasoning) for ontology translation, which is useful for datasets translation, ontology extension generation and querying through di?erent ontologies. They built an inference engine to process assertions and queries in semantic translation. Di?er- ent from ontology alignment, ontology translation requires human experts to specify semantic di?erences between ontologies. 33 On the other hand, extensive research has been performed on integrating log- ical theories from multiple sources (e.g., [65, 26, 40, 14, 62, 63, 67, 99, 12, 11, 10].) These studies have been motivated by problems arising in areas such as cooperative information systems, multi-databases, multi-agent systems and distributed expert systems. Baral et al. [12, 11, 10] have studied the combination of multiple knowledge bases under a set of integrity constraints when each of the knowledge bases is a nor- mal Horn logic program, a first order theory and a default logic theory respectively. Their idea is to obtain a maximally combined knowledge base with respect to the union of knowledge bases that is consistent with the integrity constraints. Subrahmanian [99] presents a uniform theoretical framework, based on an- notated logics, for amalgamating multiple knowledge bases when these knowledge bases (possibly) contain inconsistencies. In the framework, individual knowledge bases are transformed into new annotated logic programs and amalgamated under a new axiom scheme. Based on epistemic logic, Liau [65] provides a modal logic framework for rea- soning about multi-agent belief and its fusion. Two strategies are considered for the cautious merging of beliefs. In addition, the author proposes two logics for these two strategies respectively. An important concept of this framework is the di?erent degrees of reliability for the agents. Di?erent from the above e?orts, we study the problems of integrating graph ontologies and RDF ontologies in Chapter 4 and Chapter 6 respectively. We have provided e?cient algorithms to integrate the ontologies based on their special prop- 34 erties. 3.4 Object Matching The problem of object matching is important in data integration. It helps to determine whether object representations in di?erent databases refer to the same entity and helps to remove duplicates when combining information from di?erent sources. There has been a large body of work in this aspect. Object matching may be referred to as entity resolution, object identification, and record linkage, etc. We review some of them in Section 3.4.1. 3.4.1 Entity Resolution Entity resolution has received significant attention from both industry and academia, due to its wide use in many problems, such as deduplication in data integration and reference reconciliation in Personal Information Management. The approaches in the literature can be divided into two kinds: single class resolution and multiple class resolution. Earlier work in the literature explored various methods to resolve entities in a single class that has a number of attributes. Many similarity metrics have been pro- posed to match strings or textual values, for example, Levenstein distance, Monge- Elkan distance[82], Jaro metric[57], and token-based distance like Jaccard similarity and cosine similarity, etc[28]. [28] provides an experimental comparison for these string metrics. Meanwhile, a lot of work has been focused on matching records where 35 each record has a set of attributes and each attribute can be matched by those string metrics or some domain-specific similarity measures. For example, Hernandez and Stolfo [49] proposed a e?cient merge/purge process to compute the transitive clo- sure over multiple independent passes, each of which uses a di?erent key to sort the data such that similar records are in a small neighborhood. Their system also includes a declarative rule language to define equivalence of two values. Recent research e?orts have made intensive use of dependency relationships between di?erent class resolution decisions. Pasula et al. [87] have defined a genera- tive model in the context of citation matching. Their model is based on a relational probability model and adopts an approximation method based on Markov chain Monte Carlo. In [86], Parag and Domingos proposed a multi-relational approach to allow information to propagate from one candidate match to another via their common attributes. Their systems makes a collective decision for all the candidate pairs instead of making a decision for each pair seperately. Bhattacharya and Getoor [18, 16, 17] have proposed various approaches for entity resolution. In [18], an interative algorithm for deduplication was proposed to identify authors. Their main idea is to make use of coauthor relationships and iteratively identify additional potential co-references as more evidence is collected. In [16], they regarded entity resolution as a clustering problem and defined similarity measures to group the references. In [17], they proposed a generative probabilistic model to make collective decisions for resolving entities. In contrast to previous approaches, they introduced two hidden variables for each reference instead of one decision variable for each duplicate pair. In addition, they presented an unsupervised 36 method for determing the number of entities. Dong et al. [34] describe an algorithm to reconcile references (entities) which belong to multiple related classes and each reference may have very few attribute values. The major application of their algorithm is in Personal Information Manage- ment. In their algorithm, they build dependency graph to represent the dependency between a pair of similarities, which may be the similarity of a pair of references or the similarity of a pair of attribute values. By enriching the references while propagating similarity decisions from node to node in the dependency graph, their algorithm makes a continuous loop between computing similarities and matching decisions until a fixed point is reached. None of the above e?orts has considered to use ontologies for matching objects. Ontologies constitute a useful tool to convey the semantics of terms in databases. For example, a ?blue? shirt is considered to be similar to a ?navy? shirt, and a gene product is close to another gene product with similar cellular locations, molecular functions and biological processes. However, the colors and the properties of gene products are ontological terms which are hard to match using string metrics or the above approaches. We present our work on ontology-based object matching in Chapter 7. It is important to note that our ontology-based similarity measures can be used as domain-specific metrics in the above approaches, such as the framework in [34]. 37 3.4.2 Similarity Measures on Ontologies Recently, people have proposed a range of ontology-related approaches to com- pute the semantic similarity scores. The literature describes three distinct groups of similarity measures that can be applied to taxonomies (ontologies). The first group of measures, which we will refer to as term-similarity measures, can be used to com- pute the similarity of two individual terms. The other two groups of measures can be used when an object is labeled with multiple terms. Wu and Palmer[114] defined a term-similarity measure based on the depth (in the taxonomy) of the least common ancestor of two terms relative to the depths of the terms individually. The ?closer? the common ancestor is to the terms themselves, the greater the similarity. Wu and Palmer did not describe how to use their approach when the taxonomy is a DAG. Another problem with this approach is that some portions of the taxonomy may have been extensively developed and contain many terms, whereas other areas are sparse. Such variations in the ?density? of terms make this and other measures (such as [91, 92]) that rely on edge counts a poor estimate of similarity. To the best of our knowledge, the idea of using information content to mea- sure similarity is due to Resnik[94]. Using the WordNet taxonomy and frequency estimates derived from a large body of English text, Resnik calculated the seman- tic similarity of word pairs by selecting the common ancestor with the greatest information content. For words with multiple senses, Resnik used the sense that produced the maximum similarity. Using judgments made by human subjects as 38 the standard, Resnik found that his measure worked better than earlier ones based on edge-counting. Although Resnik?s measure can be used when the taxonomy is structured as a DAG, it cannot be used directly when objects are labeled with mul- tiple terms, and it has a number of other disadvantages. Its range is not normalized to [0, 1], but more importantly, by selecting the ancestor with the greatest informa- tion content it understates the similarity of objects by focusing on the single most significant aspect of their commonality, at the expense of all others. Lin[66] provided an axiomatic definition of similarity, and showed how Resnik?s approach could be adapted to fit this framework. Whereas Resnik?s measure was based solely on the commonality between word meanings, Lin?s also takes into ac- count the di?erences in meaning to determine a normalized similarity score. Lin compared his measure to Resnik?s and to Wu and Palmer?s, and found that it pro- duced scores that were better correlated with human judgments than those produced by the other two measures. However, Lin does not describe how to use his mea- sure when the taxonomy is a DAG, or when multiple terms are used to describe an object. The similarity measures in the second group described in the literature measure the similarity of objects based on the number or frequency of terms that are common to the descriptions of both objects. Measures in this group include Jaccard, Dice and Set Cosine, which are used frequently in information retrieval systems and di?er in how the count of common terms is normalized, as well as the FMS measure of Keller at al[60]. These measures do not take the structure of the taxonomy into account. Any candidate object that does not share terms with the target will receive a score 39 of zero, even though it may be quite similar. The third set of proposed similarity measures relies on an underlying term- similarity measure to determine the similarity between individual pairs of terms, and then combines these to yield an overall similarity score. Halkidi et al.[45] defined a similarity measure of this type for use in clustering web documents. Using the Wu and Palmer term similarity measure, they consider each term in the target and candidate sets individually, and find the most similar term from the other set. Then, over each term set (the target set and the candidate set), they average the similarity of these best matches. Finally, they combine the average similarity from the two sets with equal weight. Since they use Wu and Palmer as the underlying term similarity measure, it is not clear how to apply this measure when the taxonomy is a DAG. Wang et al.[109] developed a similarity measure that avoids this problem by using a generalized form of Lin?s measure to determine the similarity of each term- pair. They generalize Lin?s formula for use in a DAG taxonomy by selecting the least common ancestor with the maximum information content. Note that they use a di?erent function than Halkidi et al. for combining term-pair scores. Instead of averaging the scores of the best match from the other set for each term, they average the term similarity scores across all term pairs. Keller et al.[60] present several ways of quantifying similarity using fuzzy mea- sures based on the depth or information content of terms. However, their measures either require subjectively-specified weights, or the solution of a high-order polyno- mial equation for each pair of target and candidate objects. We believe this to be prohibitively expensive for large problems. 40 In Chapter 7, we present two taxonomy-based similarity measures which are applicable to practical classification systems, where taxonomies (ontologies) can be DAG-structured, objects can be labeled with multiple terms, and ambiguity can be introduced by an evolving ontology or classifiers with imperfect knowledge. 41 Chapter 4 Ontology-extended Relational Algebra Heterogeneous relational databases being integrated vary widely in how the same data is represented. The discrepancy exists not only in the syntax, such as data structures and storage systems, but also in the semantics, such as the di?erent expressions referring to the same concept. It is widely acknowledged that integra- tion of data at the semantic level remains an open problem. As we mentioned in Chapter 1, semantic discrepancies degrade recall and a?ect the quality of query answers. In this chapter, we present our approach [19] on semantically integrat- ing heterogeneous relational databases. An important notion in our approach is ontology extended relation. Ontologies provide semantic information about some body of data (e.g. num- bers, strings, etc.). An ontology extended relation is a relation that has an associated ontology. Intuitively, the ontology associated with a relation provides some seman- tical information about the relation and about the syntactic objects (e.g. attribute names, attribute values) associated with the relation. For example, given a relation about stock information, an ontology associated with this relation may contain an explanation of the terms in the relation and/or an explanation of some of the rela- tionships between such terms. For example, it may tell us that database companies are software companies, and that a carburettor is part of a car and that a car is 42 Name Cost Shipping Tire 54.19 20.05 Gasket 3.05 1.55 Valve 3.35 1.55 Brake pads 78.50 8.50 Evaporator 305.00 11.50 ??? ??? ??? Table 4.1: Parts1 relation equivalent to an automobile. We use a running industrial parts example and an insurance example to illustrate the basic concepts of our approach. In addition, when performing queries over multiple ontology extended relations, we may need to merge multiple ontologies. We show how multiple ontologies may be merged. 4.1 Motivating Example In this section, we present a simple parts database example and an insurance claims database example. Both these examples are toy examples used primarily to illustrate the basic definitions and results of this paper. 4.1.1 Simple Industrial Parts Example Consider a very simple example consisting of industrial parts data. There are numerous companies in the US and elsewhere that maintain stocks of various 43 Item Price ShipCost Wheel 50.05 18.00 Air Gasket 3.00 1.70 Valve 3.35 1.55 Hubcap 11.50 6.00 Spark Plug 20.00 8.50 ??? ??? ??? Table 4.2: Parts2 relation industrial parts (e.g. Grainger in the USA). Two di?erent companies of this kind may maintain part information such as those shown in the Parts1 (Table 4.1) and Parts2 (Table 4.2) relations below. When we examine these two databases, we notice several things. First, the fields Cost and Price probably refer to the same things. Second, the fields Shipping and ShipCost probably refer to the same concept. However, by looking at the fields, we really are at a loss to determine whether these fields use the same units or not. For example, Cost in the Parts1 relation may be in US dollars, while Prices in the Parts2 table may be in Euros. In addition, wheel may be a part of tire, and likewise, a hubcap is also a part of a tire. Evaporator and AC Evaporator probably mean the same thing. And an air gasket probably is a gasket. A user asking queries spanning these two databases would probably like an- swers that resolve the above types of semantic ambiguities when answering his queries. For example: 44 ? If a user wants to find all tuples dealing with Gaskets for either the parts1 or parts2 database, then the system should be smart enough to know that the attribute names Name and Item are the same, and that an ?Air Gasket? is a kind of gasket. ? Likewise, if the user wants to find prices of all tire parts, the system should be smart enough to know that a wheel is part of a tire and hence constitutes a tire part. Similarly for hubcaps. It should therefore return the tuples associated with wheels, hubcaps and tires. ? A more mundane retrieval may arise because the user may want to find valves that cost less than $3.50. In this case, if the parts1 relation uses Euro to express units, then the system should automatically convert the value in Euros to USD. In this paper, we develop a model of ontologies and a concept of ontology extended relations that address these issues. 4.1.2 Insurance Example Consider the case of an insurance company that maintains records about in- surance claims. Such a company may keep track of who made a claim, what the claim was about, etc. This information may be entered by diverse claim adjusters who work for the company and may have been entered in di?erent ways over the years. When the insurance company buys another company, they may need to exe- cute queries spanning both databases. We show below two such relations - Claims1 45 ClaimId Type Cost 1 burglary 2000 2 theft 150 3 mugging 860 4 arson 1800 Table 4.3: Claims1 relation ClaimNumber Type Value 1 robbery 400 2 fire 550 3 auto accident 500 4 burglary 250 Table 4.4: Claims2 relation (Table 4.3) and Claims2 (Table 4.4) - that may contain the data from two such databases. Again, in the above example, we see that burglaries and muggings are types of theft. Likewise, arson is a kind of fire. Burglaries are a kind of robbery (which is synonymous with theft). As a consequence, queries about thefts should include all tuples involving such terms. For example, consider the two situations listed below. ? If a user wanted to find all thefts that involved a cost of over $ 1000 dollars, the system should automatically recognize that burglaries, muggings and robberies count as thefts. If costs are represented in di?erent currencies, conversions 46 should be automatically performed. ? Likewise, if a user wanted to automatically find all fires, it should recognize that arsons are fires and hence should return the appropriate tuples. 4.2 Ontologies and Ontology Integration In this section, we define ontologies, and describe how to integrate a set of on- tologies. We only consider ontologies involving simple concepts - ontologies involving boolean concepts of ontologies are not considered. 4.2.1 Ontologies In order to define an ontology, we first need the notion of a hierarchy. A couple of definitions are needed in order to get a formal definition of a hierarchy. If S is a nonempty set and ?? S?S,then(S,?)isanordering.Asusual,(S,?)isapartial ordering if ? is a reflexive, transitive, and anti-symmetric binary relation on S. Definition 4 (better than) Suppose (S,precedesequal 1 ), (S,precedesequal 2 ) are two orderings. We say (S,precedesequal 1 ) is better than (S,precedesequal 2 ) i? (?x, y ? S)x precedesequal 1 y ? x precedesequal 2 y. We will say that (S,precedesequal 1 ) is strictly better than (S,precedesequal 2 )i?(S,precedesequal 1 ) is better than (S,precedesequal 2 )and(S,precedesequal 2 ) is not better than (S,precedesequal 1 ). Definition 5 (hierarchy) Suppose S is a partially ordered set under ordering ?. A hierarchy for S is an ordering (S,precedesequal) such that 1. (S,precedesequal) is better than (S,?) and 47 2. (S,?) is the reflexive, transitive closure of (S,precedesequal). 3. there is no other ordering (S,subsetsqequal) satisfying the preceding two conditions such that (S,subsetsqequal) is strictly better than (S,precedesequal). A simple example is shown below. Example 2 Consider the set S={wheel, car, hubcap}. A partial ordering reflecting the ?part of? relation may say that a wheel is part of a car, a hubcap is part of a car, and a hubcap is part of a wheel 1 . In addition, everything is a part of itself. In this case, the natural definition of the part of relation, denoted ? is given by the set {(wheel, wheel), (car, car), (hubcap, hubcap), (wheel, car), (hubcap, car), (hubcap, wheel)}. There is only one hierarchy associated with this partial ordering, viz. the set { (wheel, car), ((hubcap, wheel)}. The astute reader will note that a hierarchy as defined above is nothing but the Hasse diagram associated with a partial ordering. Taking advantage of this fact, we will often talk about hierarchies as though they are graphs (hence, we will talk about paths in hierarchies, etc.). Definition 6 Suppose ? is some finite set of strings and S is some set. An ontology w.r.t. ? is a partial mapping ? from ? to hierarchies for S. Throughout the rest of this chapter, we will assume that ? and S are arbitrarily fixed. We will further assume that ? contains a distinguished string called isa (representing the usual isa relationship) and that ?(isa) is always defined. 1 The ?part of? relation in this example has nothing to do with philosophical argument. 48 Example 3 Suppose ?={isa}. Figures 4.1(a) and 4.1(b) show ontologies associ- ated with the relations Claims1 and Claims2 respectively. theft burglary mugging arson (a) robbery burglary auto-accident fire (b) Figure 4.1: (a) Ontology associated with Claims1 relation, (b) Ontology associated with Claims2 relation 4.2.2 Ontology Integration Recall that we intend to associate ontologies with relations (though we have not done so yet) and then to extend the relational algebra to handle such ontology- extended relations. When performing binary operations between two such ontology- extended relations, we need to know the relationship between the terms in each ontology. We start by describing the relationship between terms in two hierarchies. Definition 7 (interoperation constraints) Suppose (H i ,? i ), 1 ? i ? n are n di?erent hierarchies and suppose i negationslash= j.Then(x : i = y : j), (x : i ? y : j), (x : i negationslash? y : j), (x : i negationslash= y : j) are called interoperation constraints. NOTE: As the constraint x : i = y : j can be written as two constraints x : i ? y : j and y : j ? x : i, we will henceforth assume (without loss of generality) that equality constraints are not present. 49 For example, an interoperation constraint involving an English and a French veterinary practice may say that dog :1=chien : 2 explaining that Dog (in English) in the first hierarchy and Chien (in French) in the second hierarchy mean the same thing. In the case of the two claims relations for the insurance application, our interoperation constraints may say that: theft :1=robbery :2andthat arson :1isa fire :2. Definition 8 (integration) Suppose (H i ,? i ), 1 ? i ? n are n di?erent hierarchies and suppose IC is a finite set of interoperation constraints. A hierarchy (H , ?) is said to be an integration of (H i ,? i ), 1 ? i ? n i? there are n injective mappings ? 1 ,...,? n from H 1 ,...,H n respectively to H such that: 1. (?i ?{1,...,n})x precedesequal i y ? ? i (x) ? ? i (y). 2. (?x ? H i )(?y ? H j )(x : iopy : j) ? IC ? ? i (x) op ? j (y). In this case, (H,?),? 1 ,? 2 ,...,? n is a witness to the integrability of (H i ,? i ), 1 ? i ? n in the presence of interoperation constraints IC. Intuitively, an integration of two hierarchies is a new hierarchy. Each member of the hierarchies being merged must be associated with a member of the integrated hierarchy. Axiom (1) above says that the integrated hierarchy must preserve the ordering associated with each of the input hierarchies. Axiom (2) above says that they must preserve the interoperation constraints. It is important to note that there could be many di?erent ways of integrating hierarchies. Some will be better than others. 50 Example 4 Consider integrating the two hierarchies of Figures 4.1(a) and 4.1(b) w.r.t. the interoperation constraints theft :1=robbery :2and that arson : 1 isa fire :2. Figure 4.2(a) shows the integrated hierarchy. The mappings that witness this integration are: ? 1 ? 2 burglary mapsto? burglary robbery mapsto? theft theft mapsto? theft fire mapsto? fire mugging mapsto? mugging autoaccident mapsto? autoaccident arson mapsto? arson burglary mapsto? burglary theft burglary mugging fire arson auto-accident (a) theft:1 burglary:1 robbery:2burglary:2 mugging:1 arson:1 fire:2 auto-accident:2 (b) theft burglary mugging fire arson auto-accident (c) Figure 4.2: (a)Integrated Ontology integrating the Claims1, Claims2 relations, (b)Hierarchy Graph associated with the Claims1, Claims2 relations, (c)Canonical Hierarchy associated with the Claims1, Claims2 relations Definition 9 (graph associated with a set of hierarchies) Suppose (H i ,? i ), 1 ? i ? n are n di?erent hierarchies and suppose IC is a finite set of interoperation con- straints. The hierarchy graph G associated with (H i ,? i ), 1 ? i ? n is defined as follows: ? The set of vertices is the set {x : i |x ? H i }; 51 ? The set of edges is the union of: ? {(x : i, y : i) | x, y ? H i ? x ? i y} ? {(x : i, y : j) | x : i ? y : j ? IC}. Example 5 Returning to our running claims example, Figure 4.2(b) shows the hi- erarchy graph associated with the two claims ontologies. Definition 10 (canonical hierarchy) Suppose (H i ,? i ), 1 ? i ? n are n di?er- ent hierarchies and suppose IC is a finite set of interoperation constraints. The canonical hierarchy (H star ,? star ) of (H i ,? i ), 1 ? i ? n is defined as follows. ? H star is the set of all strongly connected components 2 of the graph associated with (H i ,? i ), 1 ? i ? n. ? If x star ,y star ? H star ,thenx star ? star y star i? either x star = y star or there exists a directed path from x : i to y : j (for some x : i ? x star and y : j ? y star ) in the hierarchy graph associated with (H i ,? i ), 1 ? i ? n. The following example illustrates the concept of a canonical hierarchy. Example 6 Figure 4.2(c) shows the canonical hierarchy associated with our run- ning claims ontologies. Definition 11 (canonical witness) Suppose (H i ,? i ), 1 ? i ? n are n di?erent hierarchies and suppose IC is a finite set of interoperation constraints and (H star ,? star ) 2 Recall that the strongly connected components of a graph are the maximal sets of vertices C such that there is a directed path between any two vertices of C. 52 is the canonical hierarchy. The canonical witness for (H i ,? i ), 1 ? i ? n is (H star ,? star ),? star 1 ,...,? star n where ? star i (x)=[x : i], i.e. ? star i maps x into the strongly connected component of the graph containing x. The canonical witness associated with our example is the obvious one - each element in either of the two hierarchies is mapped onto the node in the canonical hierarchy that contains that element. Suppose (H i ,? i ), 1 ? i ? n are n di?erent hierarchies and suppose IC is a finite set of interoperation constraints. The following lemma is an important result stating that the dependencies that hold in the canonical hierarchy (H star ,? star )of(H i ,? i ), 1 ? i ? n must also hold in any arbitrary witness to the integration of the hierarchies (H i ,? i ), 1 ? i ? n . Lemma 1 Let (H i ,? i ), 1 ? i ? n be a family of hierarchies and suppose (H star ,? star ) is its canonical hierarchy. Suppose (H,?),? 1 ,...,? n is any arbitrary witness to the integration of (H i ,? i ), 1 ? i ? n.Then: [x : i] ? star [y : j] ? ? i (x) ? ? j (y). In the above, [x : i] (resp. [y : j]) denotes the strongly connected component con- taining x : i (resp. y : j). The reader should note that the above theorem does not say that a witness to the integration of the hierarchies (H i ,? i ), 1 ? i ? n exists ! It merely says that if at least one such witness exists, then that witness must include all the dependencies in the canonical hierarchy. The following theorem now says that a set of hierarchies can 53 be integrated if and only if the canonical hierarchy and the canonical translation constitutes a witness to the integrability of (H i ,? i ), 1 ? i ? n. Theorem 1 Aset(H i ,? i ), 1 ? i ? n of hierarchies is integrable if and only if the canonical witness of (H i ,? i ), 1 ? i ? n is a witness to the integrability of (H i ,? i ), 1 ? i ? n. The above theorem is very important. It says that to check whether a set of hierarchies is integrable, all we need to do is to compute the canonical hierarchy associated with the hierarchies involved, and then explicitly check if the canonical hierarchy is a witness to their integrability. This can be easily done in polynomial time. We are finally ready to define what it means to integrate two di?erent ontolo- gies. Definition 12 (integrability of a set of ontologies) Suppose ? is some finite set of strings, S is some set, and ? 1 ,...,? n are ontologies w.r.t. ?,S.SupposeIC is a finite set of interoperation constraints. The ontologies ? 1 ,...,? n are integrable i? for every x ? ?, the hierarchies ? 1 (x),...,? n (x) are integrable. The preceding results suggest a simple algorithm to integrate a collection ? 1 ,...,? n of ontologies w.r.t. ?,S.First,foreachx ? ?, compute the canonical hierarchy H x of ? 1 (x),...,? n (x). Check if this canonical hierarchy satisfies the interoperation constraints. If this check evaluates to true for all x ? ?, then we know that the ontologies ? 1 ,...,? n are integrable and in fact, the integration of these ontologies 54 is the new integrated ontology ? which maps x to H x .Wecal?thecanonical fusion (or just fusion) of the ontologies ? 1 ,...,? n . Once again, we emphasize that in this paper, we are only considering simple ontologies composed of atomic concepts. Some authors [24] have proposed that expressive logics much richer than those above be used for expressing ontologies, but have not provided extensions of the relational algebra incorporating such ontologies. Extension to richer ontology structures is left for future work. 4.3 Ontology Extended Relational Algebra In this section, we extend the relational model of data and the relational algebra to utilize the information contained in ontologies associated with relations. Without this capability (as in the standard relational algebra) the answer returned by queries may be semantically incorrect and/or incomplete. Due to space restrictions, in the rest of this paper, we will assume that the set ? equals {isa}, i.e. only one hierarchy is associated with each ontology. The results of the paper can be easily extended to incorporate other kinds of hierarchies such as part of. Types, Domain Values, and Hierarchies We assume there exists a set T of strings called types.Eachtype? ?T has an associated set dom(?) called its domain.Themembersofdom(?) are called values of type ?.WhenT ?T, we will often use dom(T)todenote uniontext ??T dom(?). 55 For example, we may have T = {int,string}.Wemayhavedom(int)bethe set of all integers and dom(string) be the set of all strings over {a,...,z}. However we may also have in T a type like mm whose domain is the set of all non-negative integers (representing millimeters). A hierarchy of the form H =(T H ,? H )whereT H is a set of types is called a type hierarchy. Suppose ? ?T H is a type and a ? dom(?) is an arbitrary value of type ?. Clearly, we could also have a type called a with dom(a)={a}.Thus,eachvalueof a type may also be viewed as a type. Given any type hierarchy H =(T H ,? H )and a ? ?T H ,let below H (?):={? prime | ? prime ? ?}?dom(?) . Intuitively, below H extends the partial ordering of the hierarchy H to include the values of the types. Extended Schemas and Relations In this section, we show how to extend schemas and relations to include type information. Let A be a set of attribute names.IfA 1 ,...,A n ?A,and? 1 ,...,? n ? T ,then(A 1 : ? 1 ,...,A n : ? n ) is called a schema. Intuitively, an extended schema associates types with each attribute name. This is important because a type could be something like USD representing US Dollars or cm representing (intuitively) centimeters. An extended relation is a triple (R,S,H isa ), where S is a schema (A 1 :? 1 ,..., 56 A n :? n ), H isa is an isa hierarchy 3 and the following constraints are satisfied: ? 1 ,...,? n ?T isa (4.1) R ? below H isa (? 1 ) ?????below H isa (? n ) (4.2) where H isa is the hierarchy associated with isa.Ift ? R and t = ?x 1 ,...,x n ?,then we use t(A i )todenotex i . Conversion Functions As the reader has already seen, as di?erent units might be used in relations, there is often a need to convert from one unit to another. In this section, we introduce the concept of a conversion function. For each pair of types ? i and ? j , we assume there exists at most one conversion function ? i 2? j : dom(? i ) ? dom(? j ). We assume that conversion functions are total. Moreover, we assume that the set of conversion functions enjoys the following closure conditions and constraints: ? for each ? ?T, ?2? exists and equals the identity function, ? if both ? 1 2? 2 and ? 2 2? 3 exist, then ? 1 2? 3 exists and equals the composition ? 2 2? 3 ? ? 1 2? 2 . (Note that as ? 1 2? 3 is assumed to be unique, all the possible conversion function compositions that convert ? 1 ?s values into ? 3 ?s values must 3 When we allow ? to contain not just isa but other strings as well, the H isa term needs to be replaced with ? with obvious corresponding changes to the rest of the paper. For the sake of simplicity and due to space restrictions, in this section, we only consider the case where one hierarchy is used. 57 be equivalent.) Moreover, it is not necessary to specify ? 1 2? 3 explicitly. The system may automatically compose it from the other functions. ? for all hierarchies H,if? 1 ? H ? 2 then there exists a conversion function ? 1 2? 2 . The third requirement may cause a significant burden during system specifi- cation and implementation. To reduce this burden we adopt the following intuitive strategy. Units of measure always require an explicit conversion function, because they always adopt di?erent scales. Therefore, given a type ?Currency? with sub- types ?Euro? and ?Pound?, we need a unified representation for Currency together with conversion functions Euro2Currency and Pound2Currency. On the contrary, when types represent classes of objects (as in the parts and insurance examples), it seems reasonable to assume by default that the conversion functions from subtypes to supertypes are the identity function. These two cases (types as units and types as classes) are explicitly distinguished through suitable declarations of the form: unit ? 1 ,...,? n and class ? 1 ,...,? n . There are cases in which types represent classes, but heterogeneous notation has been adopted in the subclasses. As a result, some objects may have multiple names or identifiers. In these cases, a conversion function is needed to produce a non-ambiguous naming, otherwise equalities and other comparisons involving these objects are potentially erroneous. Such types should not be declared to be units or classes, as that would force the system to check that all the necessary conversions have been specified. 58 4.3.1 Selection Conditions Selection conditions are used both in defining the selection operation as well as the join operation. In this section, we define selection conditions. Simple (or atomic) conditions have the form XopY,whereop ?{=, negationslash=, <, ?, >, ?, instance of, isa, subtype of, above, below },andX,Y are terms,that is, attributes, types, or typed values v:?,withv ? dom(?). Types will sometimes be omitted in typed values when they can be unambiguously derived from the context. Selection conditions can now be defined as follows: ? all simple conditions are selection conditions, ? if c 1 ,c 2 are selection conditions then c 1 ?c 2 , c 1 ?c 2 , ?c 1 are selection conditions, ? nothing else is a selection condition. Example 7 For instance, in our claims example, Typeisa theft and Typeisa theft ? Cost > 140 : pounds are selection conditions. Note that if we apply the second selection condition to the claims1 relation and we assume that the type of Cost is USD (US dollars), we should only get the first tuple as the answer (the second tuple would have been in the answer if the selection condition had been Typeisa theft ? Cost > 140 : usd. In the rest of this section, let R =(R,(A 1 :? 1 ,...,A n :? n ),H isa ) be an arbitrary but fixed extended relation. In the context of R,thetypeofatermX is defined as follows: 59 type(X)= ? ? ? ? ? ? ? ? ? ? i if X = A i ? if X = ? or X = v:?. When type(X),type(Y ) are both in a type hierarchy T and the least upper bound of type(X)andtype(Y )inT exists, then it is called the least common supertype of X and Y w.r.t. a. We say that a simple condition XopY with op ?{=,negationslash=,<,?,>,?} is well- typed if X and Y have a least common supertype ? and the conversion functions type(X)2? and type(Y )2? exist. Simple conditions with other operators are always well-typed. A selection condition is well-typed if all its simple subconditions are. The value of a term X w.r.t. a tuple t ? R is X t = ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? t.A i if X = A i ? if X = ? v if X = v:?. A tuple t ? R satisfies a well-typed condition c in the context of R (denoted R,t|= c) in the following cases: ? c = XopY, op ?{=,negationslash=,<,?,>,?}, ? is the least common supertype of X and Y ,and(type(X)2?)(X t ) op (type(Y )2?)(Y t ) is true. ? c = X instance ofY , Y t ?T, type(X) ? H Y t ,andX t ? dom(Y t ). ? c = X subtype ofY , X t ?T, Y t ?T,X t ? H Y t . ? c = c 1 ?c 2 , R,t |= c 1 and R,t |= c 2 . 60 ? c = c 1 ?c 2 , and either R,t |= c 1 or R,t |= c 2 . ? c = ?c 1 ,andR,t negationslash|= c 1 . ? c = X belowY ,andR,t |= X instance ofY ? X subtype ofY . ? c = X aboveY ,andR,t |= Y belowX. 4.3.2 Ontological Relational Algebra Suppose (R 1 ,S 1 ,H 1 ),...,(R z ,S z ,H z ) are ontology extended relations , and suppose F is a fusion of H 1 ,...,H z via witness tr F . For all algebraic expressions E,[E] F is inductively defined as follows. Let op range over the operators of the relational algebra. ? If E is a relation R i with schema S i =(A 1 : ? 1 ,...,A n : ? n ), then [E] F = (tr F (R),S,? F ), where S =(A 1 : tr F (? 1 ),...,A n : tr F (? n )). In this case, E is always well-typed. ? (Projection) If E is ? A i 1 ,...,A i k (E prime )(1? i j ? n,1? j ? k)andif[E prime ] F = (R prime ,(A 1 :? 1 ,...,A n :? n ), ? F ), then [E] F =(R,S,? F ), where R is the standard projection of R prime onto A i 1 ,...,A i k and S =(A i 1 :? i 1 ,..., A i k :? i k ). If E prime is well- typed then E is. ? (Cross product) If E is E 1 ? E 2 and [E i ] F =(R i ,S i , ? F ), (i =1, 2), then [E] F =(R,S,? F ), where R is the standard cross product of R 1 and R 2 and S is the concatenation of S 1 and S 2 .IfE 1 ,E 2 are well-typed and S 1 ,S 2 have no common attribute, then E is well-typed. 61 ? (Selection) If E is ? c (E prime ), [E prime ] F =(R prime ,S,? F ), and c is a well-typed selection condition in the context of (R prime ,S,? F ), then [E] F =(R,S,? F ), where R = {t ? R prime | (R prime ,S,? F ),t|= c}. E is well-typed if E prime and c are. The standard relational condition join operator can be expressed as usual as a composition of ? c and ?. For the remaining set theoretic algebraic operators we need a notion of schema compatibility. Two schemas S prime =(A 1 :? prime 1 ,...,A n :? prime n )andS prime =(A 1 :? primeprime 1 ,...,A n :? primeprime n )arecom- patible if for all i =1,...,n, ? prime i and ? primeprime i have a least common supertype ? i and the conversion functions ? prime i 2? and ? primeprime i 2? exist. When this is the case, (A 1 :? 1 ,...,A n :? n )is called the least common superschema of S prime and S primeprime . Moreover, the conversion func- tion S prime 2S is defined by S prime 2S(R)={?? prime 1 2? 1 (x 1 ),...,? prime n 2? n (x n )?|?x 1 ,...,x n ??R}. ? (Union, Intersection and Di?erence) If E = E 1 op E 2 where op ?{?,?,?}, and [E i ] F =(R i ,S i , ? F ), (i=1,2), and S 1 ,S 2 have a least common superschema S,then[E] F =(R,S,? F ), where R is the standard result of S 1 2S(R 1 ) op S 2 2S(R 2 ). E is well-typed if E 1 ,E 2 are, and S 1 ,S 2 have a least common superschema. ? (Cast) If E =(S)E prime where S is a schema, and [E prime ] F =(R,S prime , ? F ), then [E] F =(S prime 2S(R),S,? F ). If S and S prime have the same number of attributes and the conversion functions needed for S prime 2S exist, and E prime is well-typed, then E is well-typed. Proposition 1 For all relational expressions E over (R 1 ,S 1 ,H 1 ),...,(R z ,S z ,H z ) and all fusions F of H 1 ,...,H z , [E] F is an extended relation (i.e., it satisfies con- ditions (4.1) and (4.2)). 62 4.4 Implementation and Experiments In this section, we briefly describe our HOME (Heterogeneous Ontology Man- agement Engine) system. 4.4.1 HOME System The HOME system is implemented in Java and currently consists of 23,562 lines of code. HOME accesses an Oracle server across the network. HOME consists of the following components: 1. HOME?s graphical user interface allows users to express queries as well as view answers. In addition, users can create ontologies and edit them. 2. HOME has an Ontology Maker that automatically takes a relational database as input and uses WordNet [77] and a commercial thesaurus (ThesDB) to automatically generate ontologies. In addition, the Ontology Maker contains a rule maker that allows a user to specify rules that can be used to further derive isa and part of relations over and above those derived automatically. In addition, the Ontology Maker has a zooming graphical user interface that allows users to zoom in on part of the ontology and either browse it or modify it. 3. HOME?s Query Executor parses queries expressed via the HOME GUI. The Query Executor contains code that transforms a user query into a query that takes ontological information into account. It implements the ontology ex- tended algebra described in this paper. The query executor also contains code 63 to access any JDBC source (the current backend used is an Oracle server). The query executor contains the necessary socket code needed to access re- mote servers. In addition to the above components, we have also developed code to query XML and MPEG-7 data (on top of the XINDICE database system) using ontologies as well as to infer ontologies from XML and MPEG-7 sources. 4.4.2 Experiments In all our scalability experiments, we used the GNIS data sets from the US Geological Survey which contains data about social and cultural locations in the US on a state by state basis. We populated Oracle relations with this data on a state by state basis. The data was stored in a Windows NT server with 300 MHz Pentium II processor and 64 MB memory. In addition, we used the Ontology Maker to generate ontologies based on the GNIS data sets. The experiments were run on a workstation with 2GHz Pentium 4 processor, 1 GB memory and RedHat Linux enterprise platform. Notice that the workstation was not stand-alone during the experiments. The times taken in all experiments include the following: (i) time to ship the query to a networked Oracle server running multiple concurrent workloads (not just our experiments), (ii) the time to execute the query using JDBC, (iii) the time to ship the answer back to the client from the remote networked Oracle server, and (iv) the time required to convert the JDBC format to the the HOME GUI?s presentation 64 format. Scalability of selection Figure 4.3(a) shows the results of evaluating conjunctive selection queries where each atomic selection condition in the conjunct has the form Asubtype ofv or A = v. We vary selectivity (ratio of the number of tuples selected and the number of tuples in the relation) for data sets of successively larger sizes. 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0.06 30 40 50 60 70 80 90 100 110 120 Selectivity Excution Time (seconds) Experiment Result 528397 total records 819030 total records 1109637 total records 1381134 total records 1614292 total records 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 0.055 0 50 100 150 200 250 300 Selectivity Excution Time (seconds) Experiment Result 166321 Average Records 252268 Average Records 350681 Average Records 473305 Average Records 625353 Average Records (a) (b) Figure 4.3: (a)Performance of HOME for conjunctive selection queries, (b)Join queries with varying join selectivity and varying number of tuples per relation The reader can easily see that our algorithms handle approximately 1,600,000 tuples in 88-104 seconds (the variation depends on selectivity). The slope of the graph is more or less linear - approximately 16,000-20,000 records can be processed in one second. 65 Scalability of join We also ran scalability experiments on join. Figure 4.3(b) shows how perfor- mance of join changes when we vary the join selectivity (ratio of number of tuples in the answer and the product of the number of tuples in the relations being joined). We plotted di?erent charts based on the number of tuples (from the GNIS data sets) that were considered. The reader can see that on the average, selectivity did not make a big di?erence but the number of tuples in the relations being queried did. As the average size of the relations being joined increased, we had a linear increase in the time taken to do the joins. Unlike relational databases, varying selectivity did not make a big di?erence. Part of the reason for this is that the computation time was dominated by network costs and by ontology management costs. Figure 4.4 shows the scalability of join as the size of the two ontology extended relations varies up to a little over 1.6 million tuples. Scalability of ontology integration algorithm We tested the scalability of our ontology integration algorithm by varying the size of the ontologies being integrated as well as the number of interoperation constraints. Figure 4.5 shows that as the average size of the two ontologies being merged increased from 5000 to 32,000, the time taken to merge that ontologies was under 2 seconds, exhibiting only a linear increase. However, when the average size of the ontologies went over 32,000, there was a sharp increase in the time taken. We believe that was because we almost ran out of memory. 66 0 2 4 6 8 10 12 14 16 18 x 10 5 0 50 100 150 200 250 300 Total Number of Records Execution Time (seconds) Experiment Result Figure 4.4: Performance of HOME for join queries as size of relations being joined is varied 4.5 Summary In this chapter, we provide a graph-based simple definition of an ontology and propose the notion of an ontology extended relation (OER) to semantically integrate heterogeneous relational databases. For the problem of ontology integration, we for- mally define the notion of canonical witness to the integrability of a set of ontologies under a set of interoperation constraints. We have established a theory that a set S of ontologies is integrable if and only if the canonical witness is a witness to the integrability of the ontologies in S. Furthermore, we provide an e?cient algorithm to compute the canonical witness. We then extend the relational algebra to query OERs. Finally we describe an implementation of the OER model and show (via 67 0.5 1 1.5 2 2.5 3 3.5 4 x 10 4 0 2000 4000 6000 8000 10000 12000 Average Graph Size Time (MilliSeconds) Experiments for Ontology Integration 10 constraints 20 constraints 30 constraints 40 constraints 50 constraints 60 constraints 70 constraints 80 constraints 90 constraints 100 constraints Figure 4.5: Performance of ontology integration algorithms experiments) that the system scales well. 68 Chapter 5 TOSS: An Extension of TAX with Ontologies and Similarity Queries TAX is perhaps the best known extension of the relational algebra to handle queries to XML databases. One problem with TAX (as with many existing relational DBMSs) is that the semantics of terms in a TAX DB are not taken into account when answering queries. In this chapter, we introduce our TOSS algebra [55], which extends the TAX algebra and improves the recall of TAX via the concept of a simi- larity enhanced ontology (SEO). We introduce the following mechanisms to alleviate the recall problem: 1. We define an ontology extended semistructured instance (OES instance), which consists of a semistructured instance (e.g. an XML instance) and an associated ontology. Intuitively, an ontology is a set of graphs describing relationships (such as isa, partof, etc.) between terms in a DB (see the definition in Chap- ter 4). With an ontology, we can express that ?US Census Bureau? is part of ?US government? or ?Google? isa ?web search company? isa ?Computer com- pany? isa Company (in case people want to ask the DBLP database for a list of all authors of papers written by someone in a web search company). 2. We consider the delicate issue of how to take similarities into account when merging ontologies together. We show that once ontologies are merged to- gether without taking similarity concepts into account, we can apply a simi- 69 larity enhancement operation to the merged ontology to capture such notions of similarity. The resulting similarity enhanced ontology (SEO for short) can be generated using any notion whatsoever of similarity between strings with a specified threshold. We do not propose to reinvent notions of string/lexical similarity in this paper. Rather, we can adapt many notions of similarity in the text processing literature, such as Levenstein distance, Monge-Elkan distance [82], Jaro metric[57], Jaccard similarity[28] and rule-based similarity where a set of domain-specific rules are used to define what is similar to what. Thus, we can identify similarities between terms (even if the term are not necessarily semantically identical). Note that not all of the similarities can be captured by a lexical semantical system such as WordNet [77]. For example, the strings ?J. Ullman?, ?J.D. Ullman? and ?Je?rey D. Ullman? may all refer to the same DB researcher. 3. We show how to extend the TAX algebra to answer queries w.r.t. such a similarity enhanced ontology. Our algebra is called TOSS. We develop a prototype implementation of TOSS on top of the Apache Xindice XML database system[115]. We ran experiments to measure the following proper- ties: (i) how does the quality 1 of answers returned by TOSS (with di?erent similarity threshold) compare against the quality of answers returned by TAX? (ii) how is the performance of TOSS queries compared with that of TAX when we vary the size of the semistructured instance being queried and/or the size of the ontology used 1 The quality of an answer is the square root of the product of the precision and recall of the answer[107]. 70 in executing the query? (iii) how is the performance of TOSS queries a?ected by the similarity threshold? Our experiments were run on the entire SIGMOD XML proceedings data set[96] and on a part of the DBLP data set[31]. In this chapter, we first provide a quick overview of TAX. We start with the notion of a semistructured instance, followed by the concept of a pattern tree used to express queries in TAX. We will use a small subset of DBLP[31] and SIGMOD[96] bibliographies (in XML format) to illustrate why we may get (intuitively) inadequate answers from TAX and how the ontology and similarity mechanisms we propose can overcome these problems. 5.1 Semistructured Instance and TAX TAX assumes XML documents to be in ordered tree structures. Figure 5.1 (a) and (b) show the sample DBLP and SIGMOD bibliographic data respectively. Definition 13 A semistructured instance I over a set O of objects, a set L of strings called labels,asetT of types, and a domain dom(?) for each type ? ?T, is a triple I =(V,E,t) where: 1. G =(V,E) is a set of rooted, directed trees where V ?Oand E ? V ? V . 2. t is a mapping such that for each object o ? V , t(o,string) assigns a type in T with: ? the tag of o, i.e. o.tag,ifstring = tag, ? the content of o, i.e. o.content,ifstring = content 71 pages[402?426] year[1999] booktitle[Advances in Distributed Systems] author[Sanjay Agrawal] author[Surajit Chaudhuri] author[Vivek R. Narasayya] title[Materialized View and Index Selection year[2001] booktitle[SIGMOD Conference] author[Elisa Bertino] author[Silvana Castano] author[Elena Ferrari] title[Securing XML Documents: the year[2001] booktitle[SIGMOD Conference] inproceedings inproceedings dblp inproceedings author[Paolo Ciancarini] Distributed Java Applications] Tool for Microsoft SQL Server 2000] Author?X Project Demonstration] title[Mobility and Coordination for author[Andrea Giovannini] author[Davide Rossi] month[June] year[2001] confYear[2001] conference[ACM SIGMOD International date[May 21?24, 2001] location[Santa Barbara, California, USA] ProceedingsPage articles title[Materialized View and Index Selection authors author[Sanjay Agrawal] author[Surajit Chaudhuri] author[Vivek Narasayya] title[Securing XML Document: the authors author[E. Bertino] Engine for Dynamic Web Pages] Author?X Project Demonstration] Tool for Microsoft SQL Server 2000] Conference on Management of Data] article article article author[Henrique Pagues] author[Calton Pu] author[Ling Liu] authors author[David Vuttler] title[OminiSearch: an Internet Search author[E. Ferrari] author[S. Castano] number[2] volume[30] (a) (b) Figure 5.1: (a)A DBLP example, (b)A SIGMOD example Intuitively, o.tag is the label of the edge between o and its parent. We call an object?s tag and content its attributes.Asemistructured database (SDB) is a finite set of semistructured instances. Example 8 Consider the DBLP example in Figure 5.1(a). V consists of all nodes in the tree and E consists of all edges. Let us use o to denote the first node ?author? at the top of the figure. Then, o.tag = author, t(o, tag)=string, o.content = Paolo Ciancarini and t(o, content)=string. 5.1.1 Embeddings and Witness Trees TAX uses the concept of a pattern tree to express queries. We recapitulate the concepts of a pattern tree, an embedding and a witness tree from [56]. 72 Definition 14 A pattern tree is a pair P =(T,F),whereT =(V,E) is an object- labeled and edge-labeled tree such that: ? each object in V has a distinct integer as its label; ? each edge is either labeled pc (for parent-child) or ad (for ancestor-descendant); ? F is a selection condition applicable to objects. Example 9 Figure 5.2(a) shows an example pattern tree, where T isthetreeon the left and F is the condition on the right. #1.tag = inproceedings & #2.tag = author & #3.tag = year & #3.content = 1999 #3 #2 #1 pc pc author[Paolo Ciancarini] inproceedings inproceedings inproceedings year[1999] author[Davide Rossi] year[1999] author[Andrea Giovannini] year[1999] author[Andrea Giovannini] author[Davide Rossi] year[1999] inproceedings author[Paolo Ciancarini] (a) (b) (c) Figure 5.2: (a)A pattern tree P1, (b)A selection result, (c)A projection result Suppose SDB is a semistructured database and P =(T,F) a pattern tree. An embedding of a pattern tree P into SDB is a total mapping h : P ? uniontext (V,E)?SDB V from the nodes of T to those in SDB such that: ? h preserves the structure of T, i.e., whenever (u,v)isapc (resp., ad)edgein T, h(v) is a child (resp., descendant) of h(u)inSDB. ? The image under the mapping h satisfies the selection condition F.(The reader may consult [56] for a formal definition of satisfaction). Each embedding h of a pattern tree P into SDB induces a witness tree to the embedding, denoted h SDB (P), which is defined as follows: 73 ? anoden of SDB is in the witness tree if n = h(u) for some node u in the pattern tree P. ? for any pair of nodes n, m in the witness tree, whenever m is the closest ancestor (of the nodes in the witness tree) of n in SDB, the witness tree contains the edge (m, n). ? the witness tree preserves order between nodes in SDB, i.e., for any two nodes in h SDB (P), whenever m precedes n in the preorder enumeration of SDB, m precedes n in the preorder node enumeration of h SDB (P) as well. Example 10 Figure 5.2(b) shows all the possible witness trees from the pattern tree P1 in Figure 5.2(a) to the DBLP example in Figure 5.1(a). 5.1.2 TAX Selection in TAX. Consider the pattern tree P1 in Figure 5.2(a) and suppose SL is any set of nodes. In TAX, the selection query ? P,SL (SDB) will return all witness trees w.r.t. pattern query P and SDB. In addition, if a node n in SL appears in a witness tree above, then all descendants of n will also be added to the witness tree. Example 11 Figure 5.2(b) shows the result of ? P1,{} (dblp). Projection in TAX. Projection takes a semistructured DB SDB, a pattern tree P and a projection list PL (a list of node labels appearing in P) as inputs. The projection operation ? P,PL (SDB) returns tree(s) consisting of all nodes n selected from SDB such that for every node n in the result, there exists some witness tree h SDB (P)andn prime ? PL where h SDB (n prime )=n. 74 Example 12 To find the authors of papers published in 1999, we can use the pattern tree P1 shown in Figure 5.2(a) and apply projection w.r.t. the semistructured DB shown in Figure 5.1(a). This query, ? P1,{$1,$2,$3} (dblp), returns the collection of trees in Figure 5.2(c). Product in TAX. The product SDB 1 ?SDB 2 of two semistructured DBs SDB 1 ,SDB 2 contains for each pair of trees T 1 ? SDB 1 ,T 2 ? SDB 2 , a tree, whose root is a new node (called tax prod root), left child is the root of T 1 and right child is the root of T 2 . Condition join consists of a product followed by selection. Example 13 The join of the two XML instances in Figure 5.1(a) and Figure 5.1(b), by taking the product and applying a selection with the pattern tree shown in Fig- ure 5.3(a), yields the result shown in Figure 5.3(b). pc #6 #4.content = #6.content #5.content = "SIGMOD Conference" & #6.tag = title & #5.tag = booktitle & #4.tag = title & #3.tag = article & #2.tag = inproceedings & #1.tag = tax_prod_root & pc pc ad ad #5 #4 #3 #2 #1 article Author?X Project Demonstration] title[Securing XML Documents: the booktitle[SIGMOD Conference] Author?X Project Demonstration] title[Securing XML Documents: the inproceedings tax_prod_root tax_prod_root article inproceedings Tool for Microsoft SQL Server 2000] booktitle[SIGMOD Conference] title[Materialized View and Index Selection Tool for Microsoft SQL Server 2000] title[Materialized View and Index Selection (a) (b) Figure 5.3: (a)A pattern tree P2, (b)A join result 5.1.3 Problems with TAX When we examine the two databases in Figure 5.1 (a) and (b), we notice several things. First the object tags booktitle and conference refer to the same thing. Also, the name of the SIGMOD conference is stored di?erently in the DBLP and SIGMOD 75 databases (?SIGMOD Conference? in DBLP DB but the full name in SIGMOD DB), but they both refer to the same thing. Likewise, names of the three authors of the paper titled ?Securing XML Documents...? are stored di?erently: their first names are stored in full in DBLP but only initials are stored in SIGMOD. In general, there may be many people with the same last name and first initial of the first name, e.g., Marco Ferrari and Mauro Ferrari. Furthermore, there may be errors in data collection and input, e.g., Gian Luigi Ferrari and GianLuigi Ferrari are probably the same person. 2 Thus, it is reasonable to use some similarity measure d s to indicate the degree of similarity between two strings. For example, d s (Gian Luigi Ferrari, GianLuigi Ferrari) = 0.1 (i.e., very similar), d s (Marco Ferrari, Mauro Ferrari) = 2.2 (quite similar) and d s (Marco Ferrari, GianLuigi Ferrari) = 6.5 (much less similar). Second, cost and book-price probably refer to the same concept. However, by looking at the objects, we really are at a loss to determine whether these fields use the same units or not. For example, cost may be in US dollars, while book-price may be in Euros. Our TOSS system can handle this as well via a clean extension of the TAX algebra. A user asking queries spanning these two databases would probably like these semantic ambiguities resolved. For example if we wish to find all conference papers published by Mauro Ferrari by looking at both DBLP and SIGMOD DBs, the system must know that: (1) the tag names booktitle and conference are identical, (2) the short names and full names of SIGMOD conference refer to the same thing. (3) It should be also smart enough to consider su?ciently similar names to be the same. 2 All the four names above can be found in DBLP. 76 5.2 Similarity Enhanced Ontologies In this section, we define ontologies and then recapitulate how [19] integrates ontologies based on graph merging algorithms in [20]. We then show how to enhance an ontology with similarity measures. 5.2.1 Ontologies Suppose (S,?) is a partially ordered set. A hierarchy for (S,?) is the Hasse diagram for (S,?). Note that the Hasse diagram for a partial order (S,?)isa directed acyclic graph whose set of nodes is S. The Hasse diagram of (S,?)hasa minimal set of edges such that there is a path from u to v in the Hasse diagram i? u ? v. Example 14 Consider the set S = {article, author, title}. A partial ordering re- flecting the part of relation may say that author is part of article and title is part of article. In addition, everything is part of itself. In this case, the natural definition of the part of relation, denoted ? is given by the set { (author, article), (title, article), (article, article), (author, author), (title, title)}. There is only one hierarchy as- sociated with this partial ordering, viz. the set {(author, article), (title, article)}. Definition 15 Suppose ? is some finite set of strings and S is some set. An on- tology w.r.t. ? is a partial mapping ? from ? to hierarchies for S. Throughout the rest of this paper, we will assume without loss of generality that ? and S are arbitrary but fixed. We will further assume that ? contains distinguished 77 strings called isa and part of (representing the usual isa and part of relationships) and that ?(isa) and ?(part of) are always defined. confYear number volume location date month year conference title articles article authors author ProceedingsPage author title booktitle year pages inproceedings (a) (b) Figure 5.4: (a)An example of ontology associated with SIGMOD, (b) An example of ontology associated with DBLP Example 15 Suppose ?={part of}. Figures 5.4 (a) and (b) show the simplified examples of ontologies associated with SIGMOD and DBLP respectively. 5.2.2 Integrating Ontologies Suppose our SDB = {I 1 ,...,I n }. Suppose, for the sake of simplicity, that ?={part of}. Suppose (H j ,? j )isthepart of hierarchy associated with I j .To answer queries spanning these di?erent semistructured instances, we need to take into account the relationships between terms in these hierarchies. These are captured by interoperation constraints (specified by the database administrators) defined in Chapter 4. The formal definitions and algorithm for integrating ontologies are given in Chapter 4. In this section, we present examples of interoperation constraints, hierarchy graph and canonical fusion of ontologies to illustrate the concepts. 78 Example 16 We may write an interoperation constraint booktitle:1 = conference:2 to indicate that booktitle (inDBLP)isthesameasconference in the SIGMOD collection. title:1 conference:1 year:1 month:1 date:1 location:1 volume:1 number:1 confYear:1 ProceedingsPage:1 author:2 title:2 booktitle:2 year:2 inproceedings:2 pages:2 articles:1 article:1 authors:1 author:1 Figure 5.5: Hierarchy Graph associated with SIGMOD and DBLP Example 17 Consider integrating the two hierarchies of Figures 5.4 (a) and (b) w.r.t. the following five interoperation constraints conference:1 = booktitle:2, title:1 = title:2, author:1 = author:2, year:1 = year:2, confYear:1 = year:2, which means that conference in the first hierarchy (SIGMOD) is equal to booktitle in the second hierarchy (DBLP), etc. Figure 5.5 shows the hierarchy graph associated with the two ontologies. Note that there may be many di?erent ways of integrating hierar- chies. [20, 19] describe how to find an integration, which is called the canonical hierarchy. We call this integration the canonical fusion (or just fusion) ? of ontolo- gies ? 1 ,...,? n via a witness (which maps a node of a hierarchy onto a node in the canonical hierarchy). Figure 5.6 shows the fused ontology. We would like to mention that there are various works in the literature on integrating ontologies that have a very rich structure (e.g. definitions where an 79 ProceedingsPage articles article authors author title conference year month date location volume number inproceedings pages Figure 5.6: Canonical fusion of the ontologies of SIGMOD and DBLP ontology is any arbitrary first order logic theory, and so on) [24]. However, to date, we have seen no extension that uses these very rich ontology definitions to answer queries over even relational sources. In this paper, we have only considered simple ontologies expressible as graphs. 5.2.3 Similarity Enhanced Ontologies In the rest of this section, we assume that given a semistructured database SDB, an ontology associated with each I ? SDB and a set of interoperation con- straints IC, we have computed the fusion of all the ontologies w.r.t. IC. We will now show how to enhance such a fused ontology using similarity notions. As is com- mon in many aspects of information retrieval, similarity between terms is measured using a distance function[28]. Definition 16 A string similarity measure d s is any function which takes two strings X,Y and returns a non-negative real number such that: ??X, d s (X,X)=0. ??X,Y , d s (X,Y )=d s (Y,X). 80 A string similarity measure d s is strong i? for all strings X,Y,Z, d s (X,Y )+ d s (Y,Z) ? d s (X,Z). A similarity measure d is any function which takes nodes A, B as input and returns a non-negative real number such that d(A, B)=min X?S A ,Y ?S B d s (X,Y ), where d s is a string similarity measure, S A ,S B are the sets of strings contained in nodes A, B respectively, and for any node A,ifX,Y ? S A ,thend s (X,Y )=0. d is strong i? d s is strong and satisfies the previous condition. Intuitively, two nodes get more and more similar as the distance between them gets smaller. The following result says that for strong similarity measures, we can avoid trying all combinations of strings between two nodes. Lemma 2 Suppose S A ,S B are the sets of strings contained in nodes A, B respec- tively. If a similarity measure d is strong, then ?X,P ? S A ,Y,Q? S B , d s (X,Y )= d s (P,Q). Therefore, ?X ? S A ,Y ? S B , d(A, B) ? d s (X,Y ). Proof sketch: ?Y,Q ? S B , d s (Y,Q)=d s (Q, Y ) = 0. Because d is strong, so d s is also strong. Therefore, d s (X,Y )+d s (Y,Q) ? d s (X,Q), i.e., d s (X,Y ) ? d s (X,Q). Similarly, d s (X,Q)+d s (Q, Y ) ? d s (X,Y ), i.e., d s (X,Q) ? d s (X,Y ). Thus, d s (X,Q)=d s (X,Y ). Similarly, we can prove ?X,P ? S A ,d s (X,Q)= d s (P,Q). Thus, ?X,P ? S A ,Y,Q? S B , d s (X,Y )=d s (P,Q). a50 Many di?erent similarity measures can be used. The information retrieval com- munity has proposed various string distances such as Levenstein distance 3 , Monge- 3 The Levenstein distance is strong. It assigns a unit cost to every edit operation. 81 Elkan distance[82], Jaro metric[57], and token-based distance like Jaccard similarity 4 and cosine similarity, etc[28]. In addition, products such as WordNet[77] define sim- ilarity measures between terms in English and various foreign languages (but not between proper nouns). In certain domains, rule based methods can also be used to specify similarity between proper nouns (in our SIGMOD/DBLP application for example, we could write a set of rules describing when two names are considered similar). There are also statistical approaches to checking similarity between strings used in methods such as latent semantic indexing. Our goal in this paper is not to invent a new notion of similarity between terms - this has already been well done by the IR community. Instead, the TOSS framework can plug in any such similarity implementation and use it to answer queries in a manner that increases the quality of an answer. Definition 17 (similarity enhancement) Suppose H is an integrated hierarchy, d is a similarity measure and epsilon1 ? 0.Thepair(H prime ,?) is called a similarity enhance- ment of H w.r.t. d, epsilon1 i? H prime is a hierarchy and ? is a function from H to 2 H prime (2 H prime is the power set of H prime ) such that 1. ?A, B ? H,A ? ? ?(A),B ? ? ?(B), if there is a path from A to B in H,then there is a path of length zero or more from A ? to B ? in H prime . Let us define ? ?1 : H prime ? 2 H such that ? ?1 (A prime )={A | A ? H ? A prime ? ?(A)}. Similarly, ?A prime ,B prime ? H prime , if there is a path for A prime to B prime in H prime ,then?A ? ? ?1 (A prime ),B? ? ?1 (B), there is a path from A to B in H. 4 Jaccard similarity between two word sets S, T is | S?T S?T |. 82 2. ?A, B ? H,A ? ? ?(A),B ? ? ?(B),ifA ? = B ? ,thend(A, B) ? epsilon1. 3. ?A, B ? H,ifd(A, B) ? epsilon1,then?A ? ? ?(A),B ? ? ?(B) such that A ? = B ? . 4. ?A prime ,B prime ? H prime , ? ?1 (A prime ) ?? ?1 (B prime ) negationslash= ? ?1 (A prime ). The first condition ensures that the similarity enhanced hierarchy preserves the original partial orderings in H and does not include unwarranted orderings in the result. The second condition ensures that all nodes mapped into the same node are su?ciently similar to each other - the threshold epsilon1, specified by the database administrator, says two terms are su?ciently similar if the distance between them is epsilon1 or less. The fourth condition removes redundant nodes which are a subset of some other node in the similarity enhancement. Furthermore, our similarity enhancement says that two strings are similar if and only if the two strings are jointly present in some node in the similarity enhancement. That is why we have the third condition. For example, let us consider a hierarchy H consisting of nodes A, B, C.If d(A, B) ? epsilon1, d(A, C) ? epsilon1, but d(B,C) >epsilon1, then we must have one and only one enhancement: we should have a node containing {A, B} AND a node containing {A, C}. We define the similarity enhancement in this way such that if a query is on a string in A, then we can find the result of all similar strings in the nodes containing {A, B} and {A, C} easily. If we just have two nodes containing {A, B} and {C} then we do not know that strings in C are similar to those in A from the resulting ontology. In this case, we will need to either (i) compare all nodes with A to find those similar; or (ii) use an index to find those nodes similar to A quickly. But then, we can just use (i) or (ii) directly and do not need to use the similarity 83 enhancement. By definition, an ontology is not allowed to have cycles. This may mean, on occasion, that no similarity enhancement exists for some similarity measures and some thresholds epsilon1. Definition 18 (similarity inconsistency) Suppose H is a hierarchy, d is a sim- ilarity measure and epsilon1 ? 0.Thetriple(H,d,epsilon1) is similarity consistent (resp. simi- larity inconsistent) i? there exists (resp. does not exist) a similarity enhancement of H w.r.t. d, epsilon1. The following theorem states that all similarity enhancements of a particular hierarchy (w.r.t a similarity measure d and a threshold epsilon1)areequivalent. Theorem 2 (Equivalence of Similarity Enhancements) Suppose H =(S,precedesequal) is a hierarchy, d is a similarity measure and epsilon1 ? 0.If(H prime 1 ,? 1 ) and (H prime 2 ,? 2 ) are similarity enhancements of H, then there exists an one-to-one mapping M : S prime 1 ? S prime 2 such that M ? ? 1 = ? 2 and ?A, B ? S prime 1 , (A, B) ?precedesequal 1 i? (M(A),M(B)) ?precedesequal 2 ,where H prime 1 =(S prime 1 ,precedesequal 1 ) and H prime 2 =(S prime 2 ,precedesequal 2 ). Proof sketch: Conditions (2) ? (4) of Definition 17 equivalently define the set of nodes and the mapping in the similarity enhancement as follows. First, we define S primeprime = {V |?A, B ? V,A ? S ? B ? S ? d(A, B) ? epsilon1}. Then, define S primeprime ? S primeprime ?{V |?V prime ? S primeprime ?V ? V prime }.Thus,S primeprime is unique. Next, we can define a mapping M 1 : S ? 2 S primeprime s.t. M 1 (A)={V | A ? V}. For any similarity enhancement, say, (H prime 1 ,? 1 ), we can define a one-to-one map- ping M 2 : S primeprime ? S prime 1 s.t. M 2 (B)=C i? ?A ? B, C ? ? 1 (A). 84 Similarly, for (H prime 2 ,? 2 ), we can define a one-to-one mapping M prime 2 : S primeprime ? S prime 2 s.t. M prime 2 (B)=C i? ?A ? B, C ? ? 2 (A). Now we can define M : S prime 1 ? S prime 2 s.t. M(C 1 )=C 2 i? ?A ?{A | C 1 ? ? 1 (A)}, C 2 ? ? 2 (A). Thus, M ? ? 1 = ? 2 . Because M ? ? 1 = ? 2 , ? ?1 1 ? M ?1 = ? ?1 2 . From condition (1) in Definition 17, ?A, B ? S prime 1 , (A, B) ?precedesequal 1 i? ?a ? ? ?1 1 (A),b? ? ?1 1 (B), (a, b) ?precedesequal. Similarly, ?A, B ? S prime 2 , (A, B) ?precedesequal 2 i? ?a ? ? ?1 2 (A),b? ? ?1 2 (B), (a, b) ?precedesequal.Thus,?A, B ? S prime 1 , (M(A),M(B)) ?precedesequal 2 i? ?a ? ? ?1 2 (M(A)),b? ? ?1 2 (M(B)), (a, b) ?precedesequal.But?X ? S prime 1 , ? ?1 2 (M(X)) = ? ?1 1 ?M ?1 (M(X)) = ? ?1 1 (X). Thus, ?a ? ? ?1 2 (M(A)),b? ? ?1 2 (M(B)), (a, b) ?precedesequal is equivalent to ?a ? ? ?1 1 (A),b? ? ?1 1 (B), (a, b) ?precedesequal. Therefore, ?A, B ? S prime 1 ,(M(A), M(B)) ?precedesequal 2 i? (A, B) ?precedesequal 1 . a50 Table 5.1 presents the SEA algorithm (short for Similarity Enhancement Algorithm) to find a similarity enhancement of a hierarchy w.r.t. d, epsilon1. In Line 14, the function check-acyclic(H prime ) checks whether H prime is acyclic or not. We can modify a depth-first search to detect cycles in O(|precedesequal prime |)timewhere|precedesequal prime | indicates the number of edges in the hierarchy of the similarity enhancement. The time complexity of the SEA algorithm is O(|S|?|S prime |(|S|+ |S prime |)+|precedesequal||S prime | 2 ). Theorem 3 (Correctness of Algorithm) Given a hierarchy H, a similarity measure d and a real number epsilon1 ? 0, the SEA algorithm in Table 5.1 returns a similarity enhancement (H prime ,?) of H if one exists. 85 algorithm SEA(H,d,epsilon1) Input: hierarchy H =(S,precedesequal), similarity measure d,realnumberepsilon1 ? 0 Output: H prime =(S prime ,precedesequal prime ),? 1. S prime ?? 2. for each s ? S, 3. S prime ? S prime ?{s} 4. for each s prime ? S prime , 5. if ?a ? s prime , d(s,a) ? epsilon1, then s prime ? s prime ?{s} 6. else if ?a ? s prime , d(s,a) ? epsilon1, then 7. S prime ? S prime ?{{s}?{a | a ? s prime ? d(s,a) ? epsilon1}} 8. S prime ? S prime ?{s prime |?s primeprime ? S prime ? s prime ? s primeprime } 9. for each s ? S, 10. define ?(s)={s prime |s prime ? S prime ? s ? s prime } 11. precedesequal prime ?? 12. for each (a,b) ?precedesequal, 13. precedesequal prime ?precedesequal prime ?{(c,d) | c ? ?(a)? d ? ?(b)? c negationslash= d} 14. if check-acyclic(H prime )==TRUE,then 15. return H prime =(S prime ,precedesequal prime ),? 16. else return ?Similarity inconsistent? Table 5.1: Similarity Enhancement Algorithm SEA 86 Proof sketch: Lines 5 ? 7 produce nodes such that the second condition of a similarity enhanced hierarchy is satisfied. Lines 3, 5 ? 7 ensure that the third condition is satisfied because line 3 ensures that, for every s ? S, there exists some node in S prime that contain s, and lines 5 ? 7 ensures that ?s 1 ,s 2 ? S,ifd(s 1 ,s 2 ) ? epsilon1, then there must exist some node in S prime to contain both of them. Line 8 removes redundant nodes to ensure that the fourth condition is satisfied. Lines 11 ? 13 create precedesequal to satisfy the first condition. Lines 9 ? 10 define ? according to the definition. a50 (a) (b) title data models data relation relational model data models relation relational, model, title relational relation data model models data relation relational model data models Figure 5.7: (a) An example ontology and (b) its similarity enhancement Example 18 Figure 5.7(a) shows a toy ontology consisting of a isa hierarchy. Sup- pose we use Levenstein distance as the similarity measure d and a threshold epsilon1 =2. Because d(relation,relational)=2, d(model,models)=1, the algorithm SEA will form two new nodes in the similarity enhancement, containing {relation,relational} and {model, models} respectively. SEA will also remove the four redundant nodes relation,relational, model and models. Then it defines ?, which maps each node in the original hierarchy to itself in the result, except for the four nodes removed, which are mapped to the two new nodes respectively. precedesequal prime is defined as shown in Figure 5.7(b). 87 5.3 Ontology Extended Semistructured Data Model In this section, we extend the semistructured data model and the TAX algebra to handle ontology and similarity concepts. For the sake of simplicity, in the rest of this section, we will assume that the set ? equals {isa}, i.e. only one hierarchy is associated with each ontology. The results of our work can be easily extended to incorporate arbitrary hierarchies such as part of. SEO Semistructured Instance. In this section, we show how to extend a semistructured instance to include type 5 , ontology, and similarity information. Re- call that a semistructured instance is defined as I =(V,E,t)wheret associates a type in T with each attribute (tag and/or content) of each object o ? V . Intuitively, an extended instance associates types with each object attribute. This is important because a type could be something like USD representing US Dollars or cm representing (intuitively) centimetres. It is also associated with an ontology (which for now is just a hierarchy H isa , a distance function d and a real number epsilon1>0). An ontology extended semistructured instance is a quadruple (V,E, t, H isa ), where H isa is an isa hierarchy 6 and the following constraints are satisfied: 5 The formal definitions of types, domain values and hierarchies (as well as conversion functions) are given in Chapter 4. 6 When we allow ? to contain not just isa but other strings as well, the H isa term needs to be replaced with ? with obvious corresponding changes to the rest of the paper. For the sake of simplicity and due to space restrictions, in this section, we only consider the case where one hierarchy is used. 88 ??object o, t(o, tag) ?T isa and o.tag ? below H isa (t(o, tag)) ??leaf object o, t(o, content) ?T isa and o.content ? below H isa (t(o, content)) An SEO(similarity enhanced ontology extended) semistructured instance is a quadruple (V,E, t, H isa prime )where(H isa prime ,?) is the similarity enhancement of H isa for some ? w.r.t. similarity measure d and threshold epsilon1. 5.3.1 TOSS Algebra Throughout this section, we assume without loss of generality that we are querying some semistructured database SDB and that the ontologies of semistruc- tured instances in SDB have been fused together as described in Section 5.2.2 and that the fused ontology was then enhanced with similarity using the SEA algorithm shown in Table 5.1. All algebraic operators therefore assume the existence of this similarity enhanced (fused) ontology. SEO Embeddings and Witness Trees Selection conditions are used in pattern trees which are used in both selection and projection. In this section, we define selection conditions and show how to extend the definitions of embedding and witness trees to support ontologies and similarity. Simple (or atomic) conditions have the form XopY,whereop ?{=, negationslash=, <, ?, >, ?, ?, instance of, isa, subtype of, above, below },andX,Y are terms,that is, attributes, types, or typed values v:?,withv ? dom(?). Types will sometimes be 89 omitted in typed values when they can be unambiguously derived from the context. Here ? represents the similarTo operation which will return true if both operands are su?ciently similar (i.e., ? epsilon1 for a given similarity measure d). Selection conditions can now be defined as follows: ? all simple conditions are selection conditions, ? if c 1 ,c 2 are selection conditions then c 1 ?c 2 , c 1 ?c 2 , ?c 1 are selection conditions, ? nothing else is a selection condition. Example 19 Suppose we want to find the titles of all papers in DBLP related to Mi- crosoft (independently of the field in which Microsoft appears). We may change the selection condition in the pattern tree in Figure 5.2 (a) to the following and apply it to dblp: #1.tag = inproceedings ? #2.tag = title ? #3.tag part of inproceedings ? #3.content =?starMicrosoftstar primeprime (where star is a wild card). In order to ensure an embedding to be correct w.r.t an SDB with an associ- ated similarity enhanced ontology, we need to redefine the satisfaction of selection conditions. In the rest of this section, let EI =(V,E,t,H isa ) be an arbitrary but fixed ontology extended or SEO instance. 7 In the context of EI,thetypeofatermX 7 The semantics of both ontology extended and SEO instances are similar except that an ontol- ogy extended instance uses an ontology but a SEO instance uses a SEO and it allows similarity operations. 90 w.r.t. a mapping h 8 from a pattern to a tree collection is defined as follows: type(X) h = ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? t(h(X),tag)ifX is the tag of an object t(h(X), content)ifX is the content of an object ? if X = ? or X = v:?. When type(X),type(Y ) are both in a type hierarchy T and the least upper bound of type(X)andtype(Y )inT exists, then it is called the least common supertype of X and Y . We say that a simple condition XopY with op ?{=,negationslash=,<,?,>,?} is well- typed if X and Y have a least common supertype ? and the conversion functions type(X) h 2? and type(Y ) h 2? exist. Simple conditions with other operators are always well-typed. A selection condition is well-typed if all its simple subconditions are. The value of a term X w.r.t. a mapping h from a pattern to a tree collection is X h = ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? h(X).tag if X is the tag of an object h(X).content if X is the content of an object ? if X = ? v if X = v:?. The image (witness tree WI) under the mapping h (from a pattern P to a collection C), i.e., h C (P), satisfies a well-typed condition c in the context of the ontology extended instance EI (denoted EI,WI |= c) in the following cases: 8 Strictly speaking, h maps a node in the pattern tree to a node in a tree collection. However, X in h(X) is not a node but a node?s attribute. For convenience, here we abuse the notation such that, suppose X = y.b where y is a node and b is an attribute, we use h(X)todenoteh(y). 91 ? c = XopY, op ?{=,negationslash=,<,?,>,?,?}, ? is the least common supertype of X and Y ,and(type(X) h 2?)(X h ) op (type(Y ) h 2?)(Y h ) is true. For A ? B, the condition is true i? ? a node containing both of them in the similarity enhancement. ? c = X instance ofY , Y h ?T, type(X) h ? H Y h ,andX h ? dom(Y h ). ? c = X subtype ofY , X h ?T, Y h ?T,X h ? H Y h . ? c = c 1 ?c 2 , EI,WI |= c 1 and EI,WI |= c 2 . ? c = c 1 ?c 2 ,eitherEI,WI |= c 1 or EI,WI |= c 2 . ? c = ?c 1 , EI,WI negationslash|= c 1 . ? c = X belowY , EI,WI |= X instance ofY ? X subtype ofY . ? c = X aboveY , EI,WI |= Y belowX. TOSS Algebra In this section, we are able to give a succinct definition of the TOSS algebra. Note that this algebra is easy to implement on top of any semistructured DBMS system. Suppose EI 1 =(V 1 ,E 1 ,t 1 ,H 1 ),..., EI z =(V z , E z , t z , H z ) are ontology ex- tended instances. Suppose F prime is a fusion of H 1 ,...,H z via witness tr F prime,(F,?)is the similarity enhancement of F prime , tr F = ??tr F prime.? F is the ontology that associates the hierarchy F with isa. For all algebraic expressions Exp,[Exp] F is inductively defined as follows. ? If Exp is an instance EI i ,then[Exp] F =(tr F (V i ),E i , tr F (t i ), ? F ), which is 92 an SEO instance. Here tr F (V i ) (resp. tr F (t i )) maps the tags and contents of objects o ? V i (resp. their types) to terms in ? F .Inthiscase,Exp is always well-typed. ? (Selection) If Exp is ? P,SL (Exp prime ) and the selection condition F in P is well- typedinthecontextof[Exp prime ] F ,then[Exp] F returns the set of witness trees WI such that [Exp prime ] F ,WI |= F. SL specifies the nodes whose descendants should be included in the final output. Exp is well-typed if Exp prime and F are. ? (Projection) If Exp is ? P,PL (Exp prime ) and the condition F in P is well-typed in the context of [Exp prime ] F ,then[Exp] F keeps the nodes in [Exp prime ] F which match some pattern node in P (and also in PL) for some embedding h : P ? C (such that the witness tree under h satisfies F in the context of [Exp prime ] F )andtheir hierarchical relationships but eliminates all other nodes. Exp is well-typed if Exp prime and F are. ? (Cross product) If Exp is Exp 1 ?Exp 2 then [Exp] F contains for each pair of trees T i ? [Exp 1 ] F ,T j ? [Exp 2 ] F , a tree, whose root is a new node, left child is the root of T i and right child is the root of T j .IfExp 1 ,Exp 2 are well-typed, then Exp is. The join operator can be expressed in the standard way as a composition of ? P,SL and ?. For the remaining set theoretic algebraic operators we need to specify when two data trees should be considered identical as described in [56]: two data trees T 1 ,T 2 are equal i? there exists an isomorphism ? : T 1 ? T 2 between the two sets of nodes that preserves edges and order, and furthermore, for every value-based 93 atom of the form ?attribute ? value? (? is one of =,negationslash=,>, etc), the atom is true at node u in T 1 i? it is true at node ?(u)inT 2 . ? (Union, Intersection and Di?erence) If Exp = Exp 1 op Exp 2 where op ? {?,?,?},then[Exp] F is the standard result of [Exp 1 ] F op [Exp 2 ] F . Exp is well-typed if Exp 1 ,Exp 2 are. Proposition 2 For all algebraic expressions Exp over EI 1 =(V 1 ,E 1 ,t 1 ,H 1 ),...,EI z = (V z ,E z ,t z ,H z ) and the similarity enhanced fusion F of H 1 ,...,H z , [Exp] F is an SEO instance. pc ad ad ad #1.tag = tax_product_root & #2.tag = inproceedings & #3.tag = ProceedingsPage & #4.tag = title & #5.tag = booktitle & #6.tag = article & #5.content ~ #7.content #4.content ~ #8.content & #8.tag = title & #7.tag ~ #5.tag & pc #4 ad ad #1 #2 #3 #8 #7 #6 #5 Figure 5.8: A pattern tree P3 Example 20 Suppose we want to find the papers in SIGMOD DB such that the title of that paper is similar to the title of some SIGMOD conference paper recorded in DBLP. A join of both databases by taking product of them and a selection with the pattern tree shown in Figure 5.8 can solve this task, i.e., ? P3,{} (dblp?ProceedingsPage). The result will contain two trees corresponding to the papers titled ?Materialized View ...? and ?Securing XML ...? respectively. 94 5.4 Implementation and Experiments Figure 5.9 shows the architecture of our TOSS system, which is implemented in Java and currently consists of 24402 lines of code. WordNet User?specified rules Ontology Maker XML files Similarity similarity measure Fusion of Ontologies Enhanced Ontology Query Executor results user queries Similarity Enhancer Xindice system ?threshold Figure 5.9: Architecture of TOSS system TOSS is built on top of the Xindice[115] database system and consists of the following components: (1) the Ontology Maker automatically takes XML files as input and uses WordNet[77] and user-specified rules to automatically generate ontologies, and then integrate them to obtain their fusion; (2) the Similarity En- hancer automatically finds the similarity enhancement of the ontology according to the threshold epsilon1 and the similarity measure specified by the database administrators among a variety of possible choices (supported by a JAVA toolkit available in [28]); (3) the Query Executor implements the ontology extended algebra, transforms a user query into a query that takes ontological information into account, and accesses the remote or local Xindice system. Recall and precision. We obtained the recall and precision of results returned by TOSS and TAX by evaluating 12 selection queries on 3 data sets (each containing 100 random papers from DBLP). Each query contains 1 isa,1similarToand3tag matching conditions. For isa and similarTo conditions, ?contains? and exact match 95 are used for TAX respectively. The precision and recall of TOSS and TAX results for each query are calculated by checking against semantically correct results generated manually. A query result contains 1 to 38 papers. Figure 5.10 (a) shows that TAX always get 100% precision but low recall (lower than 0.5 for 75% of queries) because the exact match in TAX guarantees that the results it returns are correct but it misses most of the correct results. Its recall is 1 for 3 queries whose semantically correct results contain 3 or fewer papers. In contrast, the average precision and recall of TOSS (with epsilon1 = 3) results are 0.942 and 0.843. For most of queries, it maintains its precision close to 1 with much higher recall. For the query with lowest TOSS precision, by comparing with TAX recall, it shows that it takes a tradeo? of 1/3 of precision degradation for 3 times of recall increase. For TOSS (with epsilon1 = 2), the average precision and recall are 0.987 and 0.596. The higher precision and lower recall compared with the one of epsilon1 =3are expected because the SEO obtained using a lower epsilon1 can only merge very similar terms together, missing some correct answers. Figure 5.10(b) shows the square root of product of recall and precision as a quality measure[107] of TOSS and TAX results against the square root of TAX recall for the corresponding query. It clearly shows that TOSS (epsilon1=3) outperforms TAX for all queries (except the 3 queries mentioned above). Figure 5.11(a) shows how much the recall is improved by TOSS compared with TAX normalized by the precision.InTOSS(epsilon1=3), most of the queries get their normalized recall more than doubled. Again, the performance of TOSS (epsilon1=2) lies between TOSS (epsilon1=3) and the 96 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Precision TOSS (threshold = 3) TOSS (threshold = 2) TAX 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 SQRT(Recall(X) * Precision(X)) SQRT(Recall(TAX)) X = TOSS (threshold = 3) X = TOSS (threshold = 2) X = TAX (a) (b) Figure 5.10: (a) Recall against precision for TOSS and TAX results, (b) Square root of product of recall and precision of TOSS and TAX results against the square root of corresponding TAX recall for each query. original TAX with the similar reason. We also compared TAX and TOSS with F-measure, defined as F ? measure = 2 ? precision ? recall precision + recall Table 5.2 shows the performance of TAX and TOSS with regard to F-measure. Similar to the result in Figure 5.10 (b), TOSS (epsilon1=3) still outperforms TAX for all queries (except the 3 queries mentioned above). Except query 9, TOSS (epsilon1=2) has equal or better F-measure quality than TAX. It is even more than three times better for query 2 and 6. Scalability experiments. In all our scalability experiments, we used the DBLP and SIGMOD proceeding data. The total DBLP file size is 188,566,002 bytes. We 97 1 2 3 4 5 6 7 8 9 0 0.2 0.4 0.6 0.8 1 Recall(TOSS) * Precision(TOSS) / Recall(TAX) Recall(TAX) Improvement (threshold = 3) Improvement (threshold = 2) 0 1 2 3 4 5 0 2 4 6 8 10 12 14 16 XML File Size (MB) Time (Seconds) Experiments for Conjunctive Selection Queries TOSS (ontology size: 1378 Terms) TOSS (ontology size: 982 Terms) TOSS (ontology size: 588 Terms) TAX (a) (b) Figure 5.11: (a) Improvement factor of TOSS recall compared with TAX recall normalized by precision, (b) Performance of TOSS and TAX for selection queries. selected all proceeding records and removed unnecessary spaces. Then we truncated the file such that it has 4,753,774 bytes containing 3712 papers due to the 5MB max- imum data size limitation of Xindice. The SIGMOD data contains 16 proceedings pages having a total size of 712KB. The experiments were run on a PC with 1.4GHz CPU, 524MB memory and Windows 2000 Professional platform. The times taken in all experiments include the following: (i) time to parse a pattern tree and rewrite the pattern tree into XPath queries, (ii) time to execute the XPath queries in the Xindice system, (iii) time to parse the result returned from the Xindice system and convert the result to the form defined by TAX. Scalability of selection. Figure 5.11 (b) shows the results of evaluating TOSS and TAX conjunctive selection queries, each of which contains 2 isa and 4 tag matching conditions, on the DBLP data. For isa conditions, exact match is used for TAX. We 98 Query TAX TOSS (threshold=2) TOSS (threshold=3) 1 0.62 0.77 0.84 2 0.18 0.57 0.53 3 0.52 0.64 0.97 4 0.47 0.47 0.96 5 0.43 0.71 0.76 6 0.17 0.67 0.86 7 0.55 0.55 1 8 1 1 1 9 0.68 0.66 0.75 10 1 1 1 11 0.18 0.54 0.93 12 1 1 1 Table 5.2: F-measure for TOSS and TAX results tested the scalability of selection queries by varying the size of ontologies (only for TOSS) as well as the size of XML files. The reader can easily see that our algorithms handle data size of approximately 4.75 MB in 12.5-14 seconds, almost independent of the ontology size. The slope of the curves are more or less linear - approximately 500 KB can be processed in one second. As the last XML data file has 4,753,774 bytes, very close to the 5MB maximum data size of Xindice, there was a sudden slope up at the last point. The di?erence between the lowest TOSS curve in the figure and TAX is in 0.4064-3.1544 seconds. The di?erence between the highest TOSS curve and TAX is in 0.6228-4.1402 seconds. When data size grows, the di?erence increases because 99 0.5 1 1.5 2 2.5 3 3.5 4 2 4 6 8 10 12 14 16 18 XML File Size (MB) Time (Seconds) Experiments for Join Queries TOSS (ontology size: 1769 Terms) TOSS (ontology size: 1369 Terms) TOSS (ontology size: 969 Terms) TAX 0 1 2 3 4 5 9 10 11 12 13 14 15 16 17 18 Epsilon Time (Seconds) Experiments for Computation Time Versus Epsilon selection join (a) (b) Figure 5.12: (a) Performance of TOSS and TAX for join queries, (b) TOSS compu- tation time of selection and join against epsilon1. there are more accesses to the ontology. The bigger di?erence is also due to the empty result returned by TAX. Scalability of join. We also ran scalability experiments on join. Each query contains 5 tag matching and 1 similarTo conditions. For similarTo conditions, exact match is used for TAX. Figure 5.12 (a) shows the scalability of join of the DBLP and SIGMOD data as the total size of the two XML files varies up to a little over 3.7 MB. As the total size of the data being joined increased, we had a linear increase in the time taken to do the joins except the two last data points. These two data points show a faster increase because Xindice system returns intermediate results in a super-linear time and dominates the total execution time when the size of intermediate result is large. For example, in the experiment of the last data point, 100 the total execution time was 17 seconds, where 12.1 seconds is from Xindice and 4.9 seconds from our code. The di?erence between the lowest TOSS curve in the figure and TAX is in 0.3078-1.3 seconds. The di?erence between the highest TOSS curve and TAX is in 0.532-2.7238 seconds. The di?erence increases with the data file size because the number of accesses to the ontology increases. TOSS computation time vs epsilon1. Finally, we conducted experiments to show how the evaluation time of conjunctive selection queries and join queries are a?ected by the di?erent values of epsilon1 (in generating SEO). The selection experiment was on a dblp file with 4,753,774 bytes with ontology size of 1003 terms. The join experiment was on a dblp file of 3,071,430 bytes and the sigmod data of 692,188 bytes with ontology size of 1769 terms. Figure 5.12 (b) shows that both execution time increase approximately linearly to the increase of epsilon1 because when epsilon1 increases, each node will contain more similar terms in average and thus more time is needed to output a larger result. 5.5 Summary Our work builds on two important recent contributions in databases, viz. the TAX algebra for semistructured databases by Jagadish et al.[56] and the need to use ontologies for query processing by Wiederhold?s group[110, 111]. As mentioned before, TAX is the starting point for our work. The landmark TAX paper proposes a formal theoretical basis for semistructured data, together 101 with an algebra to query semistructured data sources using the concept of a pattern tree. Wiederhold?s group was the first to notice that ontologies can be used to improve the quality of answers to queries. They proposed an ontology algebra [112, 73, 81]. TOSS is fundamentally di?erent from both these works and other related works in many ways. First, in TOSS, ontologies are merged under some interoperation constraints ? no such constraints were considered in [112, 73]. However, this is important because terms in one ontology may have a relationship with terms in another. Second, TOSS does not just merge ontologies together - it extends the semistructured DB model and algebra so that semistructured instances with associ- ated ontologies can be queried. Third, our framework introduces for the first time, the concept of similarity search in semistructured databases - something that nei- ther the TAX work nor Wiederhold?s prior work did. Fourth, we have developed experimental results showing that TOSS greatly improves the quality of answers compared to TAX. 102 Chapter 6 A Theoretical Foundation for Integrating RDF Ontologies Resource Description Framework (RDF)isaW3Crecommendationforpro- cessing metadata and exchanging machine-understandable information on the Web. It defines a model for describing relationships between resources. Triples with the form (subject, predicate, object) are the basic elements of RDF. RDF has become the de facto standard for the description and exchange of metadata on the Web. Many RDF ontologies have been constructed in do- mains such as food, geography and transportation. For example, both SchemaWeb (www.schemaweb.info)andDAMLwebsite(www.daml.org) have hosted a number of RDF ontologies. Since many of these ontologies are about the same topics, there is a growing need to integrate them in order to facilitate reasoning and query answering across distributed data sources. In this chapter, we present our work on integrating RDF ontologies [103] which includes a theoretical framework and a set of experiments on real and synthetic ontologies. We focus on the integration of RDF ontologies at the instance level 1 , each of which is a finite set of triples. 1 In some literature, such an RDF ontology at the instance level may be called an RDF graph. 103 6.1 Preliminaries In this section, we provide a brief overview of the most important constructs in RDF and show how RDF documents may be viewed as graphs. An RDF-ontology is a finite set of triples (r, p, v)wherer is a resource name, p is a property name, and v is a value (which could also be a resource name). RDF- ontologies assume the existence of some set R of resource names, some set P of property names, and a set dom(p) of values associated with any property name p. We do not address reification and containers in RDF due to space constraints. Throughout the rest of this chapter, we will assume that R,P,domare all arbitrary, but fixed. Definition 19 (RDF Ontology graph). Suppose O is an RDF-ontology. An RDF ontology graph for O is a labeled graph (V,E,?) where (1) V = R? uniontext p?P dom(p) is the set of nodes. (2) E = {(r, r prime ) | there exists a property p such that (r, p, r prime ) ? O} is the set of edges. (3) ?(r, r prime )={p | (r, p, r prime ) ? O} is the edge labeling function. It is easy to see that there is a one-one correspondence between RDF-ontologies and RDF-ontology graphs. Given one of them, we can uniquely determine the other. As a consequence, we will often abuse notation and interchangeably talk about both RDF-ontologies and RDF-ontology graphs. Figure 6.1 shows parts of RDF ontologies 64 and 322 from the DAML web site (www.daml.org) ? their graphs are 104 Figure 6.1: RDF (respectively OWL) for the two example ontologies shown in Figure 6.2. Note that both ontologies have had terms renamed (through the attachment of strings ?:1? and ?:2? respectively). Thus, student:2 refers to the student concept in ontology 2. Given two nodes r, r prime in an RDF-ontology graph, and a property name p,we say that there exists a p-path from r to r prime if there is a path from node r to node r? such that every edge along the path contains p in its label. For example, in the second graph of figure 6.2, there is an S-path from Researcher-in-Academia to Employee.HereS stands for ?subClassOf?. However, there is no A-path from Student:2 to Organization-Unit:2 where A stands for ?A?liate-Of?. Our techniques di?erentiate between transitive and non-transitive properties. For instance, subClassOf is a transitive relationship, while affiliateOf as de- picted in Figure 6.2 is not. A cycle involving a transitive relationship could indicate a semantic problem (e.g. a BossOf b, b BossOf c, c BossOf a seems to indicate a problem). However, a cycle involving a non-transitive properties may not be a problem (e.g. a FriendOf b, b FriendOf c, c FriendOf a is not unusual). 105 Researcher- In- Academia:2 Student:2 Graduate Student:1 PhD- Student:2 Person:1 Organiza- tion- Unit:2 Organiza- tion:1 Organiza- tion:2 Student:1 Person:2 Resear- cher:2 Working- Person:2 Academic :2 Worker:1 Faculty:1 S S S S A Affiliated Person:2 S PS A Employee :2 S Educational Employee:2 S S S S S S S S=subClassOf ; A=affiliateOf; P=hasSubUnit S Figure 6.2: Two simple ontologies Definition 20 (Graph embedding) Let G 1 =(V 1 ,E 1 ,? 1 ) and G 2 =(V 2 ,E 2 ,? 2 ) be two RDF-ontology graphs. G 1 can be embedded into G 2 (denoted by G 1 subsetsqequal G 2 ) i? there exists a mapping ? : V 1 ? V 2 such that: (1) For p transitive, if there is a p-path from r to r prime in G 1 , then there is a p-path from ?(r) to ?(r prime ) in G 2 . (2) For q non-transitive, if there is an edge labeled with q from r to r prime in G 1 ,then there is an edge labeled with q from ?(r) to ?(r prime ) in G 2 . Graphs G 1 and G 2 are equivalent i? G 1 subsetsqequal G 2 and G 2 subsetsqequal G 1 . Note that for any RDF ontology O, we can uniquely determine its RDF- ontology graph ? however, there may be many other graphs that are equivalent to it. 6.2 Horn constraints When integrating two RDF ontologies, there may be various kinds of con- straints linking the terms in the two ontologies. For instance, we may say that the 106 terms Person:1 and Person:2 are equivalent. Likewise, we may say that if x is a Researcher-in-Academia:2 and also a Student:2,thenx is also a Graduate- Student:1. In general, for any property p, we may have a Horn clause saying that as far as property p is concerned, if an individual is in all classes in a given set, then he is also in another given class. Definition 21 (Horn constraint) If r 1 ,...,r n ,r are all resource names, then r 1 ?????r n ? r is a Horn 2 constraint. Note that Horn constraints only specify class subclass relationships, and nothing else. Definition 22 (Negative constraints) Suppose a, b are nodes and q is a prop- erty. Then: (1) a negationslash= b is a negative constraint. (2) a negationslash? q b is a negative constraint (which states that there is no q-path from a to b in the graph in question). Two examples of negative constraints associated with Figure 6.2 are: ? Faculty:1negationslash? S Student:2 which states that Faculty is not a subclass of Student. 2 Strictly speaking, we should say ?definite clause?[68] constraint here rather than Horn con- straint, but we will abuse notation and call these Horn clauses. 107 ? Student:2negationslash? S Faculty:1 which states that Student is not a subclass of Faculty. Satisfaction of a Horn clause or a negative constraint by an ontology is straightfor- ward to define. 6.3 The RDF Ontology Integration Problem In this section, we will declaratively define the problem of integration of two RDF ontologies under a given set of Horn and negative constraints. To do this, we first define the notion of a ?Horn Ontology Graph? (HOG for short) which merges Horn constraints with an ontology. Definition 23 (Horn Ontology Graph (HOG)) Suppose O is an ontology and H is a finite set of Horn constraints over R,P,dom.TheHorn Ontology Graph, HOG(O,H) is defined as the labeled graph G =(V,E,?) where: (1) If a is either a resource name or a domain value in O,then{a} is a node in V . (2) If r 1 ?...? r n ? r is in H, then there is a node in V called ? {r 1 ,...,r n }?? sometimes, we may abuse notation and write this node?s label as ?r 1 ?...? r n ? (3) Whenever there are two nodes A, B in V such that A ? B, then there is an edge labeled S from B to A. (4) If r 1 ? ...? r n ? r is in H, then there is an edge labeled S in E from ? {r 1 ,...,r n }?to{r}. 108 (5) If there is an edge in O prime s graph from a to b labeled p, then there is an edge in E from {a} to {b} labeled p. Intuitively, the nodes in the Horn ontology graph are obtained by taking nodes in O?s graph and making singleton sets out of them. In addition, we take the body of each Horn constraint and make it a node labeled by the entire body of the Horn constraint. As far as edges are concerned, condition (5) says that all edges in O are present in the HOG (the only di?erence is that an edge in the original RDF graph from a to b ends up being an edge from {a} to {b}). In addition, for every Horn clause in H, we add an edge from the set associated with its body to the singleton set associated with its head. Strictly speaking, we may be able to eliminate some edges from HOG(O,H). For instance, if we have one Horn clause with the body {a, b, c}, another with the body {a, b} and yet another with the body {a}, then condition (3) above states that there should be an edge from both {a, b} and {a, b, c} to {a} as well as an edge from {a, b, c} to {a, b}. We henceforth assume all such redundant edges are eliminated. It is easy to see that the Horn Ontology Graph thus defined is unique and can be constructed from H and O in time O(|H|?|R? uniontext p?P dom(p)|). Figure 6.3 shows the HOG associated with the union of the two RDF graphs shown in Figure 6.2 under a given set of Horn constraints. Definition 24 (Integrability witness) Suppose O 1 ,O 2 are two ontologies, H is a finite set of Horn clauses and NC is a finite set of negative clauses. An ontology 109 Person:1 ? Person:2 Organization:1 ? Organization:2 Worker:1 ? WorkingPerson:2 Student:1 ? Student:2 Faculty:1 ? Academic:2 Researcher-In-Academia:2 ? Student::2 ? GraduateStudent:1 PhD-Student:2 ? GraduateStudent :1 Person:1 Person:2S Organiza- tion:1 Organiza- tion:2 Student:1 Student:2 Worker:1 Working- Person:2 Faculty:1 Academic:2 Researcher- In- Academia:2 Researcher-In- Academia:2 ? Student:2 Graduate- Student:1 S S S S S S S PhD- Student:2 S SS S S S S S S Figure 6.3: HOG example (partial) graph G =(V,E) is said to be a witness to the integrability of O 1 ,O 2 w.r.t. H and NC i?: (1) G contains no p-cycles 3 for any transitive property p, (2) HOG(O 1 ? O 2 ,H) subsetsqequal G. We define the distance between a witness G to the integrability of ontologies O 1 ,O 2 subject to Horn constraints H and negative constraints NC as: d(G,?O 1 ,O 2 ,H?)= |G 1 | + |G 2 | + |H|?|G| where G i is the graph associated with ontology O i . Definition 25 (Minimal integrability witness) Suppose O 1 ,O 2 are two ontolo- gies with associated graphs G 1 ,G 2 , H is a finite set of Horn constraints and NC is a finite set of negative constraints. A witness G to the integrability of G 1 ,G 2 w.r.t. H and NC is minimal if and only if there is no strict subgraph 4 G prime of 3 A p-cycle is a p-path of length 1 or more from a node to itself. 4 We use the standard definition of subgraph. 110 Faculty:1 Graduate Student:1 Person {1,2} Organiza- Tion {1,2} Working- Person:2, Worker:1 Resear- cher:2 Researcher- In- Academia:2 Academic:2 Educational Employee:2 Affiliated Person:2 Organization- Unit:2 Student {1,2} Researcher-In- Academia:2 ? Student:2 Employee:2 PhD- Student:2 A S S P A S S S S S S S S S S S SS S=subClassOf; A=affiliateOf; P=hasSubUnit S S S S S S S S S S S S S S Figure 6.4: Minimal integrability witness G which is a witness to the integrability of O 1 ,O 2 w.r.t. H and NC such that 0 ? d(G prime ,?O 1 ,O 2 ,H?) ? d(G,?O 1 ,O 2 ,H?). Note that the minimal integrability witness is minimal in two respects: first, its graph structure is minimal (i.e. it does not contain unnecessary nodes and edges) and in addition, the distance according to the d-metric is minimized. Figure 6.4 shows a minimal witness to the integrability of the ontologies in Figure 6.2 using the Horn Constraints shown in Figure 6.3 and the negative constraints Faculty:1negationslash? S Student:2 and Student:2negationslash? S Faculty:1 discussed earlier. RDF Ontology Integration Problem. The RDF ontology integration problem is specified as follows: Input: RDF ontology graphs O 1 ,O 2 , a finite set H or Horn constraints, and a finite set NC of negative constraints. Output: Return a minimal witness to the integrability of O 1 ,O 2 w.r.t. H and NC if a witness exists ? otherwise return NULL. 111 6.4 The CROW Algorithm to integrate Ontologies In this section, we present the CROW (Computing RDF Ontology Witness) algorithm to solve the RDF ontology integration problem. The CROW algorithm has four distinct phases: (1) In the pre-processing phase, we rename the ontologies being merged so that terms in di?erent ontologies are di?erent (this is done by merely appending the ontology id to the end of each term). (2) In the graph construction phase, CROW constructs HOG(O,H). (3) In the graph transformation phase, the above graph is simplified using vari- ous kinds of strongly connected computations which we will describe shortly. Redundant edges are also eliminated during this phase. (4) In the negative constraint check phase, we check whether the graph produced above satisfies the negative constraints. If not, we return ?NULL? otherwise we return the graph produced in (3) above. The heart of the algorithm is in step (3) above ? we have already explained steps (1) and (2) earlier on in the paper. We apply two graph transformations. Definition 26 Suppose G is an RDF ontology graph, and p is any relation in P that is known to be transitive. ? A p-strongly connected component (or p-SCC for short) is any set S of nodes in G such that for all a, b ? S,thereisap-path from a to b. 112 ? The operator ? p (G) returns that graph G prime =(V prime ,E prime ) where: ? V prime is the set of all p-SCCs in G and ? E prime contains an edge from an SCC S 1 to an SCC S 2 labeled p i? there is an edge in G from some node in S 1 to some node from S 2 labeled p. ? p (G) reduces all p-SCCs in G to a single node and then draws edges between two such reduced nodes if there was some node in one of the SCCs that was connected in the original graph G to some node in the other SCC. This is an intuitive method for reducing cycles for transitive relationships (e.g.: if Worker:1 subClassOf WorkingPerson:2 and WorkingPerson:2 subClassOf Worker:1,itwould be safe to conclude that Worker:1=WorkingPerson:2). Note that ? p (G)isa lossy transformation as edges not labeled with p between nodes in the p-SCCs are lost. The second transformation - ? p (O) - is used to eliminate redundant edges and thus obtain a minimal integration witness. Definition 27 An edge from a to b labeled p is said to be redundant w.r.t. graph G i? there is a p-path from a to b in the graph obtained by deleting this edge from G. The operator ? p (G) returns a graph G prime by eliminating as many redundant edges on transitive properties from G as possible. There may be many ways in which redundant edges are removed from a graph G ? all we require here is that ? p (G) return any subgraph of G with no redundant 113 Person:1 Organiza- tion:1 Worker:1 Student:1 Faculty:1 Graduate- Student:1 Person:2 Organiza- tion:2 Working- Person:2 Resear- cher:2 Researcher- In- Academia:2 Academic:2 Employee:2 Affiliated Person:2 Organization- Unit:2 Student:2 A S S S S S S P A S Educational Employee:2 S S S S S S PhD- Student:2 S S S S S S Researcher-In- Academia:2 ? Student:2 S S SS S S=subClassOf; A=affiliateOf; P=hasSubUnit S S SS SS S S S S S S Figure 6.5: Intermediate result CROW phase (1) edges. 5 Theorem 4 For any graph G: (1) G subsetsqequal ? p (G) (2) G subsetsqequal ? p (G). The proof is based on the analysis of a single p-SCC or redundant edge elimi- nation operation; each such operation preserves the subsetsqequal relation. Proof sketch: (1) We represent ? p (G) as a sequence of operations?? 1 p (G),...,? i p (G),...?, where each ? i p (G) collapses exactly one p-SCC into one node. Let us denote by G i the graph resulting from the application of the first i transformations and let V i 5 As such there can be many di?erent implementations of ? p (G). One such algorithm works as follows. First, it computes an adjacency matrix A for G. It then computes a path matrix B such that B[i,j] = 1 i? there is a path of length two or more from node i to j. This can be computed by a straightforward adaptation of Dijkstra?s algorithm. Now remove all edges (i,j)forwhich A[i,j]=1andB[i,j]=1. 114 be its set of nodes. Let S i be the set of nodes that form a p-SCC and should be collapsed at step i +1. Letn S i+1 be the node resulting by collapsing S i and let m i be an arbitrary node in S i . Then we define a mapping ? : V i ? V i+1 such that ? x ? S i ,?(x)=n S i+1 and ? x ? V i ? S i ,?(x)=x. To prove that ? p is order preserving, we need to show that ? satisfies the conditions in Definition 20. Let x, y ? V i , x negationslash= y be such that there exists a q-path between x and y represented by ?n 1 ,...,n k ?. Let us consider the sequence ??(n 1 ),...,?(n k prime) ?, where we have omitted consecutive duplicate nodes. As ? p only removes edges not labeled with p in S i (collapsed into one node in G i+1 ), this is a q-path in G i+1 (maybe containing fewer nodes than the original). Hence, every step in ? p is order preserving. So by induction we conclude that ? p is order preserving. (2) Let path q (x, y) be true i? there is a q-path between x and y. By definition, ? p (G) does not modify the value of this predicate for any property q. Hence, the conditions in Definition 20 are preserved. a50 We are now ready to present the CROW algorithm (after the pre-processing step) in Table 6.1. Although the CROW algorithm is designed for integrating two ontologies, a simple extension can be used to integrate sets of ontologies 6 . In lines 1-9, CROW first constructs the Horn Ontology Graph. The result of this first phase is depicted in Figure 6.5. The reader may notice that this graph contains both strongly connected components for the subClassOf hierar- chy (Student:1 ? S Student:2) as well as redundant edges (PhD-Student:2 6 For instance, by using CROW repeatedly or by merging all the ontology graphs at the same time. 115 Algorithm: CROW(O 1 ,O 2 ,H,NC) Inputs: Ontologies O 1 ,O 2 , set of Horn constraints H and set of negative constraints NC. Let G 1 =(V 1 ,E 1 ),G 2 =(V 2 ,E 2 ) be the associated graphs. Output: G =(V,E,?), minimal witness to the integrability of O 1 ,O 2 w.r.t. H and NC or NULL if no minimal witness exists. 1. V ? V 1 ? V 2 ; 2. E ? E 1 ? E 2 ; 3. for r 1 ? ...? r k ? r ? H in ascending order of k do 4. V ? V ?{r 1 ,...,r k }; 5. E ? E ?{({r 1 ,...,r k },r)}; 6. for all n ? V , n ?{r 1 ,...,r k } do 7. E ? E ?{({r 1 ,...,r k },n)}; 8. end 9. end 10. for all properties p 11. if p transitive then (* the transitivity of p is specified a priori *) 12. while ? S, p-SCC in G do 13. for aarrowrighttophalfb? NC do (* by arrowrighttophalf we denote any type of negative constraint *) 14. if a ? S and b ? S return NULL; 15. end 16. G = ? S p (G); (* it collapses only S into one node *) 17. end 18. G = ? p (G); 19. end 20. for cons ? NC labeled with p do 21. if G does not satisfy cons then return NULL; 22. end 23. end 24. return O; Table 6.1: Computing RDF Ontology Witness Algorithm CROW 116 subClassOf Student:2). The reduction of strongly connected components and redundant edges is suitable only for transitive properties, as discussed in the Pre- liminaries Section. After these transformations (lines 16-18) the algorithm verifies the satisfiability of the negative constraints against the integrated ontology (lines 20-22). Note that some negative constraints can be checked directly while detect- ing SCC, which optimizes the response time for cases when there is no minimal integrability witness. Figure 6.4 shows the result of integrating the ontologies in Figure 6.2 using the Horn constraints in Figure 6.3 and using the negative con- straints: NC = {Faculty:1negationslash? S Student:2,Student:2negationslash? S Faculty:1}. The following result shows that CROW is correct. Theorem 5 (Correctness of CROW) CROW is correct, i.e. (i) If CROW(O 1 ,O 2 ,H,NC) does not return NULL, then it returns a minimal wit- ness to the integrability of G 1 ,G 2 w.r.t. H and (ii) If CROW(G 1 ,G 2 ,H,NC) returns NULL, then there is no witness to the inte- grability of G 1 ,G 2 w.r.t. H. Proof sketch. (i) Since all sets are finite and operations on lines 10-19 reduce the number of edges and/or nodes in the ontology, the algorithm will terminate. By the construction of the HOG in lines 1-9 we can easily see that G 1 subsetsqequal HOG(G 1 ?G 2 ,H) and G 2 subsetsqequal HOG(G 1 ?G 2 ,H). Furthermore, by Theorem 4 the graph transformations applied to G = HOG(G 1 ?G 2 ,H) preserve order. Also, G will satisfy NC according to lines 14, 20-22. Hence, the returned ontology G is an integration witness for G 1 ,G 2 w.r.t H and NC. Let us assume that G is not a minimal integrability witness 117 and ? G prime an integrability witness which is a strict subgraph of G.IfG prime has fewer nodes than G, it would violate the distance condition in Definition 25. Otherwise, if G prime has fewer edges than G, since all redundant edges for transitive properties have been eliminated in G, that would mean G prime violates condition (2) of Definition 24. Our hypothesis was false, hence G is a minimal integrability witness. (ii) Let us assume that there is a minimal integrability witness G that the CROW algorithm does not find. This means that CROW has returned NULL on either line 14 or 21. However, these directly correspond to checks against the negative constraints in NC. Therefore, G does not satisfy NC and cannot be a minimal integrability witness. a50 The following theorem states that CROW runs quite fast. Theorem 6 (CROW complexity) Let n i , e i be the number of nodes and respec- tively edges in G i , i ?{1, 2}.Letn h be the number of constraints in H and t the number of transitive properties. We denote by n, e the number of nodes and respectively edges in G. Then according to the CROW algorithm: (1) n ? n 1 + n 2 + n h and e ? e 1 + e 2 +2? n h . (2) CROW is O(t ?n ? e ? (n + e)). The proof follows directly from the analysis of the CROW algorithm. Proof: Proof sketch. (i) The CROW algorithm adds all nodes from the source ontologies and at most one node for each Horn constraint to G. Also, aside from the edges in the source ontologies, each Horn constraint can add at most two edges. One is the 118 edge specified by the Horn constraint itself. The other may be a subset-induced edge according to condition (3) in Definition 23. There is at most one such edge in G since S is a transitive relationship and redundant edges for transitive relationships are eliminated. (ii) The initial phase of the algorithm (lines 1-9) is performed in O(n h ? n). SCC detection on the constructed graph can be performed in O(n + e), while ? S p elimination for a p-SCC S can be performed in O(e ? (n + e)) (due to the edge rebuilding phase in Definition 26). Therefore, ? p for all p-SCCs can be performed in O(n ? e ? (n + e)). Meanwhile, by adapting Dijkstra?s algorithm, in O(n 2 )we can detect whether there is a p-path of length more than 1 from a fixed source to a target. Therefore, ? p (G) can be performed in O(e?n 2 ) to eliminate all the redundant edges. Since these two transformations are performed for all transitive relationships, the G ontology is obtained in O(t?n?e?(n+e)). Negative constraint satisfiability is linear in the number of constraints, which cannot be greater than n 2 (which would mean there are negative constraints for every pair of nodes in the graph). Therefore, the complexity of the CROW algorithm is bounded by O(t? n ? e ? (n + e)). a50 6.5 Implementation and Experiments Our prototype is implemented in Java and currently consists of 3041 lines of code. It consists of the following components: 1. The RDF Graph Generator. We use Jena 2.2, a Semantic Web toolkit from Hewlett-Packard Labs, to digest an ontology data file. Then the RDF Graph 119 CROW Running Time 0 50 100 150 200 250 300 350 0 5000 10000 15000 20000 25000 30000 35000 40000 Source ontology size Ti m e ( s ) 10 constraints 20 constraints 30 constraints 40 constraints 50 constraints 60 constraints 70 constraints 80 constraints 90 constraints 100 constraints Figure 6.6: CROW running time Generator constructs an ontology graph by reading a set of RDF triples stored in Jena. 2. The HOG Generator generates a HOG for a set of Horn clauses. 3. The Synthetic Data Generator randomly generates ontology graphs, Horn con- straints and negative constraints. 4. The Ontology Integrator integrates ontologies by taking a set of Horn con- straints and a set of negative constraints into account. Scalability experiments. We report here the performance of CROW algorithm on both real-life and synthetically generated ontologies. The experiments were run on a PC with 1.4GHz CPU, 524MB memory and Windows 2000 Professional platform. The times taken in the experiments include thetimetoconstructanHOGand the time to integrate ontologies. 120 Ont 1 Ont 2 #conts size 1 size 2 size 3 size 4 (KB) #trans Time (ms) D-64 D - 189 17 485 820 826 17.944 3 40.2 D-4 D-62 25 1171 2142 2147 52.564 1 32 D-62 ka2 20 1019 1864 1868 51.972 1 22.2 S - 236 S - 197 8 959 1829 1841 51.152 1 24 D - 276 S - 162 32 1706 2938 2946 81.667 4 58.4 Table 6.2: Experimental results Scalability on real-life ontologies. We have applied our CROW algorithm to five pairs of similar ontologies 7 selected from DAML, SchemaWeb and OntoBroker ontology libraries. We list the detailed experimental results in Table 6.2. In the table, due to space constraints, we use D to stand for DAML, S for SchemaWeb and ka2 for the ontology ka2.daml in OntoBroker. In addition, ?# conts? represents the number of constraints and ?# trans? is the number of transitive properties. We have used various metrics for the size of ontologies (using the notations from Theorem 6): (i) size 1 = n 1 + n 2 + n h ,(ii) size 2 = n 1 + n 2 + e 1 + e 2 + n h , (iii) size 3 is the number of nodes and edges in the HOG and (iv) size 4 is the total file size of the source ontologies. The running time of CROW algorithm is related to not only the number of terms and edges, but also to the number of transitive properties. We can see the e?ect of this factor in Table 6.2 by comparing the first and third pair of ontologies. Scalability on synthetic data. We randomly generated ten pairs of ontology graphs (each with only one transitive property). The total number of terms in each 7 Two on academic departments, one each on publications, travel and music. 121 pair of ontologies varies from 8000 to 35000 with a step of 3000. For each pair, we generate ten sets of constraints (including Horn and negative constraints) whose sizes vary from 10 to 100 with a step of 10. Figure 6.6 shows the scalability of CROW algorithm on the ten pairs of on- tologies. As the total number of terms increased, we had a slightly faster than linear increase in the time taken to do the integration. The increase in running time is mainly due to the increase in number of terms (and the size of the HOG) more than to the increase in number of constraints, as shown in Figure 6.6. 6.6 Summary In this chapter, we have developed a formal model to integrate RDF ontolo- gies in the presence of both Horn constraints and negative constraints. We have explained the concept of a witness to the integrability of two ontologies and we have developed the CROW algorithm to integrate two ontologies. We have tested CROW and found that it is very fast, both on ontologies at the DAML, SchemaWeb and OntoBroker sites, and on synthetically generated ontologies. 122 Chapter 7 Finding Similar Objects Using a Taxonomy Finding interoperation constraints between ontologies is fundamentally impor- tant for integrating them. In this chapter, we present our work on ontology-based similarity measures [95] applicable for identifying interoperation constraints and searching similar objects. Taxonomies provide a precise way to name classes of individuals that share certain properties or behavior. They also provide a means of determining how similar one such individual is to another. Two individuals share the properties of the most specific grouping that includes both of them, and the degree to which the two individuals are similar depends on the location of this class in the hierarchy. As we have introduced in Chapter 3, several authors have defined ways of turning this intuitive idea of similarity into a numeric value that can be used to rank the similarity of objects. The ability to find similar objects given a description of a target is useful in many domains. For example, one may wish to find patents similar to a given patent, or gene products (e.g. proteins) similar to a given gene product. In these domains, and many others, comprehensive taxonomies have been defined and used by vari- ous organizations to classify sets of objects. Examples include the Gene Ontology (GO)[7], Medical Subject Headings (MESH)[83] and the patent classification tax- 123 onomies of the United States Patent and Trademark O?ce (USPTO)[105] and the World Intellectual Property Organization (WIPO)[113]. In this chapter, we address several issues that arise when putting the idea of searching similar objects with taxonomies into practice. In particular, we provide a definition of information-theoretic similarity for taxonomies that are structured as directed acyclic graphs from which multiple terms may be used to describe an object. We discuss how our definition should be adapted in the presence of ambiguity, as introduced by an evolving taxonomy or classifiers with imperfect knowledge. We also present a pragmatic implementation of our similarity measures, and a variety of others from the literature, that is tightly integrated with an object-relational database and scales to large taxonomies and large datasets. Our experiments ? on a bioinformatics application that matches the annotations for gene products in the Gene Ontology supplied by Proteome Inc. and UniProt ? show that our two measures outperformed previous approaches. The remainder of this chapter is organized as follows. At the beginning of Section 7.1, we review the information-theoretic definition of similarity. Next, in Section 7.1.1, we define our holistic similarity measure. In Section 7.1.2, we con- sider how classifiers that have incomplete knowledge of the objects being classified introduce ambiguity, and define the generic holistic similarity measure for such sit- uations. Section 7.2 briefly summarizes related work. In Section 7.3 we describe our implementation, and Section 7.4 compares the e?cacy of our new similarity measures to a variety of others. Section 7.5 summarizes our contributions. 124 7.1 Information-Theoretic Similarity The idea of using an information-theoretic definition of similarity to compare objects labeled using a taxonomy was introduced by Resnik[94]. Given a taxonomy defined over a set of terms T, Resnik defined the similarity of two terms t 1 ? T and t 2 ? T with respect to a corpus of objects O as sim Resnik =max ? t?S(t 1 ,t 2 ) [?log p( ? t)] where S(t 1 ,t 2 )isthesetoftermsinT that subsume 1 t 1 and t 2 and p(t) is the prob- ability that an object randomly chosen from O represents an occurrence (instance) of term t [94]. An object represents an occurrence of a term t if it is labeled with t or a descendant of t. The quantity I(t)=?log p(t) is known as the information content of term t. Resnik?s definition of similarity captures the idea that the similarity between two objects depends not only on what they have in common, but also on the context in which they are compared. It has many intuitively appealing properties. If the taxonomy is a hierarchy (i.e a tree), the term ? t satisfying Resnik?s definition will be the least common ancestor of t 1 and t 2 . As the commonality of the terms decreases, the position of ? t moves higher in the hierarchy, becoming the root r if the terms have nothing in common. Since every object in O represents an occurrence of the root, p(r)=1andI(r)=0.Conversely, if t 1 = t 2 , I( ? t)=I(t 1 )=I(t 2 ). 1 Concept C 1 subsumes C 2 if C 1 is a superclass of C 2 in the taxonomy. 125 Lin[66] derives an axiomatic definition for similarity based on a limited set of assumptions that can be summarized as follows: 1. The maximum similarity between two (identical) objects is 1 and the minimum similarity is 0. 2. The similarity between two objects is a function of their commonality and their di?erences. 3. If each object can be described from several perspectives, the overall similarity of two objects is a weighted average of their similarities as seen from the individual perspectives. Under these assumptions, and additionally assuming that the target and can- didate objects are selected independently, the similarity between a target object T and a candidate object C is[66]: sim(T,C)= I(common(T,C)) I(descr(T)) + I(descr(C)) (7.1) where common(T,C) represents a description of what the two objects have in com- mon, and descr(T)anddescr(C) represent descriptions of the two objects individ- ually. If T and C are each labeled by a single term from a hierarchical taxonomy, their commonality is represented by the term that is the least common ancestor a of the terms t and c that were used to describe T and C, respectively. With respect to a particular corpus of objects O, the similarity becomes[66]: sim Lin (T,C)= 2I(a) I(t)+I(c) = ?2logp(a) ?log p(t) ? log p(c) (7.2) 126 Suppose, for example, that O is a set of edible things and that t,thelabelofT,is ?apple? and c,thelabelofC, is ?pear?. If a,theleastcommonancestoroft and c, is the term ?fruit?, then the numerator of the similarity measure represents the probability that two objects randomly chosen from O are both labeled with terms that denote kinds of fruit, and the denominator represents the probability that one is labeled an apple and one is labeled a pear. 7.1.1 Holistic Similarity Equation 7.2 intuitively captures the idea of similarity in the most straightfor- ward case, in which the taxonomy is structured as a tree and each object is labeled by a single term. However, in practice many taxonomies are not trees, but allow a new term to be derived from multiple parents. Furthermore, classification systems often allow an object to be labeled using multiple terms. In this section, we examine how to define similarity under these more general conditions. We begin by defining a taxonomy as a directed acyclic graph. Definition 1 (Taxonomy) A taxonomy T is a directed acyclic graph (N,E,r), where N is a set of nodes, E ?{N ? N} is a set of directed edges, r ? N is a unique root node of T , and for every other node n ? N, there is at least one path from n to the root r. Each node n ? N represents a term of T , and we will use the words ?node? and ?term? interchangeably. Each directed edge e ? E connects a more specific child term to a more general parent term. Typically, the relationship between parent and 127 child terms can be described as an is-a or kind-of relationship, but our measures can also be used with other transitive relationships, such as part-of, works-for, belongs-to, etc., as long as one wishes to treat objects closely related in these ways as similar. For convenience, we will refer to nodes in the out-neighbor region of a given node (i.e all nodes on the paths from the given node to the root) as ?ancestors? of the given node, and similarly use the term ?descendants? to refer to nodes in the given node?s in-neighbor region. We will also find it useful to describe portions of the taxonomy as subgraphs of T . We will use the notation Terms(g) to refer to the set of nodes in a subgraph g. A label is a subset of the terms in the taxonomy. A label can be used to represent the classification of a specific object, but it can also be used to represent more general concepts, like what two objects have in common. Definition 2 (Label) Given a taxonomy T =(N,E,r), a label is a nonempty set of terms L ? N. A cheeseburger labeled with terms from a food taxonomy, for example, might have the label {Beef, Cheese}. If a term in a taxonomy applies to an object, so do the terms that correspond to each of its ancestors. Therefore, some of the terms in a label may be redundant. Although, as will be seen, such redundant terms do not a?ect similarity as calculated by our measures, their presence complicates the descriptions of objects, and obscures the equivalence of labels. We therefore define the concept of a minimal label to eliminate such terms. 128 Definition 3 (Minimal Label) AlabelL is a minimal label if for every term l ? L,noancestorofl is also in L. Given a label L, one can derive a unique minimal label L prime by removing from L every term that is an ancestor of another term in L.LetLmin(L)denotethe minimal label derived from L. A labeling assigns a label to each object in some corpus. Definition 4 (Labeling) Given a taxonomy T and a corpus O, a labeling is a total function L : O mapsto? 2 N .Ifo is an object in O,letL(o) denote the label for object o. A label can also be associated with any subgraph of the taxonomy. The label for a subgraph g is the set of nodes contained in the subgraph, i.e Terms(g). For convenience, however, we will generally refer to labels with the notation L X ,where X may be either an object or a graph. A labeling is a proper labeling if for all o ?O, L(o) negationslash= ?,andaminimal labeling if for all o ?O, L(o) is a minimal label. For the remainder of this chapter we will assume that all labelings are minimal proper labelings unless otherwise noted. We interpret the labeling of an object with one or more terms to imply only that at least those terms (and their ancestors) apply to the object. This open-world interpretation of labels is an important assumption of our work, and contrasts with the closed-world model of traditional databases, under which all terms not included in a label do not apply to the labeled object. We adopt the open-world model in this paper for a number of reasons. In the first place, we believe it more accurately 129 reflects how taxonomies are used in many domains. Frequently, both the label of an object and the structure of the taxonomy itself change as knowledge accumulates concerning the domain and the objects of interest. At any point in time, the label of an object only reflects what has been discovered about it so far. Secondly, when the taxonomy is structured as a DAG, the closed-world assumption is incompatible with the use of interior terms in labels. We defer a discussion of this latter point to Section 7.1.2. It is also useful to be able to enumerate all the terms associated with a label, either directly or indirectly. We therefore formalize the concept of an ancestor graph. Definition 5 (Ancestor Graph) Let L be a label. For each term l ? L,letout(l) be the out-neighbor region of l in T . The ancestor graph of L is the union of the out-neighbor regions of the terms contained in L. That is: Anc(L)=? l?L out(l) Given an object o with label L o , we will use the notation Anc(o) to refer to the ancestor graph of L o .ThesetofnodesTerms(Anc(o)) represent an exhaustive list of the terms associated with o. We now associate a probability with an arbitrary label. Its value is the prob- ability of finding an object to which at least those terms in the label L apply. We refertothisastheinclusion probability, p i (L). Definition 6 (Inclusion Probability) Let L be a label. Then p i (L) is the prob- ability that the ancestor graph of the label of an object chosen at random from O 130 contains L. That is, given a randomly chosen object o: p i (L)=p(L ? Terms(Anc(o))) If the ancestor graph of an object?s label contains L, it also contains Lmin(L) and vice versa. Hence, if L is not minimal, the extra terms do not a?ect its inclusion probability. That is, p i (L)=p i (Lmin(L)). Inclusion probability gives us the tool we need to apply Lin?s general definition of similarity from Equation 7.1 to a taxonomy. To quantify the individual informa- tion content of the objects being compared (the denominator of Equation 7.1), we use the inclusion probability of their labels. To quantify the information content of the commonality between the two objects (the numerator of Equation 7.1), we find a label L A to represent that commonality, and use the corresponding inclusion probability. The label L A is constructed by intersecting the ancestor graphs of the labels of the objects being compared. We refer to the resulting measure as holistic similarity because it treats all the terms in a label as a group. As we will see in Sec- tion 7.2, other measures that have been suggested for use when objects are labeled with multiple terms treat each term individually. Definition 7 (Holistic Similarity) Let L T and L C be the labels of objects T and C, respectively, and let L A = Lmin( Terms(Anc(L T ) ? Anc(L C ))).Then: sim H (T,C)= ?2logp i (L A ) ?log p i (L T ) ?log p i (L C ) Alternatively, expressed in terms of the information content corresponding to each 131 (a) (b) L A L C C L T T O L A L T T L C C O Figure 7.1: Similarity in Context probability: sim H (T,C)= 2I(L A ) I(L T )+I(L C ) Figure 7.1 illustrates the idea behind our measure graphically, by depicting two objects being compared, T and C, as members of populations to which their particular label applies. All such objects are also members of the larger set of objects for which label L A applies, and are of course also members of O, the corpus. The sizes of these sets relative to each other and the size of the corpus determine the similarity of the objects being compared. The larger the region associated with L A relative to the size of O, the less significant it is that the objects being compared are included in this set, because the same also applies to many other objects in O. Similarly, the smaller the size of regions L T and L C relative to L A , the less significant it is that both objects are in L A , because many objects share this trait. Thus, as illustrated in Figure 7.1(a), maximum similarity (short of identity) occurs when L A is a small fraction of O and only objects labeled with L T or L C are contained within it. Conversely, similarity is at a minimum when L A contains most of O, including many objects other than those labeled with L T or L C , as in Figure 7.1(b). 132 When the taxonomy is a tree and each object is labeled with a single term, Definition 7 reduces to Equation 7.2. Consider objects T and C, labeled with terms t and c, respectively. When T is a tree, L A , the minimal label for the intersection of the ancestor graphs of t and c,isa,theleastcommonancestoroft and c in T . As defined by Equation 7.2, Lin?s formulation cannot be used when the tax- onomy is a DAG, because the terms describing the objects do not necessarily have a unique least common ancestor. Nor is it applicable when objects are labeled with multiple terms. Resnik?s similarity measure, which is defined in terms of the ances- tor with the maximal information content, could be used with a DAG taxonomy, but not when multiple terms are used as labels. However, because it is specified over labels, not terms, holistic similarity is well-defined in both of these cases. Definition 7 can also be viewed in terms of conditional probability. Suppose one has randomly picked two objects, o 1 and o 2 from O.Letp(X) represent the probability that label L 1 applies to one of the objects and label L 2 applies to the other, given that L A applies to both of them. Then: p(X)=p i (L 1 |L A )p i (L 2 |L A ) = p i (L 1 ?L A )p i (L 2 ?L A ) p i (L A ) 2 Since L A applies to any object to which L 1 applies, p(L 1 ?L A )=p(L 1 ), and likewise p(L 2 ?L A )=p(L 2 ). So: p(X)= p i (L 1 )p i (L 2 ) p i (L A ) 2 =log ?1 [log p i (L 1 )+logp i (L 2 ) ? 2logp i (L A )] I(X)=?log p(X) 133 = I(L 1 )+I(L 2 ) ? 2I(L A ) When I(X) is large, the situation is analogous to that of Figure 7.1(b): know- ing that L A applies to both objects tells one relatively little about whether the objects come from the populations labeled L 1 and L 2 .AsI(X) approaches its mini- mum value, the situation becomes analogous to that of Figure 7.1(a): knowing that L A applies to both objects is tantamount to knowing that they are drawn from the desired populations. I(X) ranges between 0 and I(L 1 )+I(L 2 ). If we normalize I(X) to the range [0..1], we find that: I(X) I(L 1 )+I(L 2 ) = I(L 1 )+I(L 2 ) ? 2I(L A ) I(L 1 )+I(L 2 ) =1? 2I(L A ) I(L 1 )+I(L 2 ) =1?sim H (o 1 ,o 2 ) sim H (o 1 ,o 2 )=1? I(X) I(L 1 )+I(L 2 ) In other words, I(X) represents the un-normalized distance between the labels of objects o 1 and o 2 ,andsim H (o 1 ,o 2 ) normalizes this value and converts it from a distance to a similarity. 7.1.2 Generic Similarity In this section, we consider more carefully the use of the interior terms of a taxonomy in labels. We begin by examining the meaning of using such a term in a label. 134 Fruit F Apple A Cherry C (Other Fruit) F* Orange O Pear P Fruit F Apple A Cherry C Orange O Pear P (a) (b) Figure 7.2: Use of Anonymous Terms When an interior term is used in a label, there are two possible interpretations. The term may have been selected because no more specific term in the taxonomy ap- plies to the object in question. For example, consider the taxonomy of Figure 7.2(a). If the object to be labeled is a mango, the interior term ?fruit? is the most appro- priate label, because none of the more-specific terms apply. On the other hand, the individual doing the labeling may choose an interior term because he or she does not know which (if any) more specific term applies. We refer to the first type of labeling as careful and to the second as generic. Both often occur in practice, in particular as a taxonomy evolves. Initially, a single term may be applied to what later turns out to be a whole group of distinct subclasses of objects. Over time, as these subclasses are recognized, new more-specific terms are created. However, objects classified under one version of the taxonomy are not necessarily reclassified whenever the taxonomy evolves. To accommodate the careful use of interior terms as labels, we augment the taxonomy by adding a new descendant term, X ? , for every interior term X used as a label. We refer to X ? as an anonymous term, because it describes an unnamed subset of the objects to which the term X applies. In the taxonomy of Figure 7.2(b), an anonymous term F ? has been added to the taxonomy of Figure 7.2(a). An object 135 carefully labeled ?fruit? will be considered an instance of F ? and all of its ancestors (including F). If an object can be labeled with multiple terms, it may also be necessary to introduce an anonymous term for combinations of terms that are used carefully. As far as our definition of similarity is concerned, anonymous terms behave exactly like real terms in T and we will not consider them further in this chapter. Unlike careful labeling with interior terms, generic labeling forces us to rethink our basic understanding of similarity by introducing uncertainty into the labeling of objects. If a target object is labeled ?fruit?, and we are uncertain as to which specific kind of fruit it is, candidate objects labeled ?apple?, ?pear?, or ?fruit? all fulfill the only specific requirement posed by the labeling of the target object, that of being a fruit. However, if we apply Definition 7, only the candidate labeled ?fruit? will receive a similarity score of 1 with respect to the target. In e?ect, sim H penalizes objects labeled ?apple? or ?pear? for being ?too specific? when the target object is generic. Note that the situation changes when the roles of target and candidate are reversed. If the target object is labeled ?apple?, a candidate object labeled ?apple? is a better match than one labeled ?pear? or ?fruit?. To reflect the asymmetry introduced by generic labeling, we define a revised similarity measure, sim G , such that sim G (T,C) = 1 if and only if object C is substitutable for object T. This idea is familiar from the use of type hierarchies in programming languages, where an instance of a subtype is substitutable for an instance of a supertype, but not the other way around. Definition 8 (Generic Holistic Similarity) Let L T and L C be the labels of ob- 136 L A T C L A C L T T (a) (b) L T OO Figure 7.3: Generic Similarity in Context jects T and C, respectively, and let L A = Lmin(Terms(Anc(L T ) ? Anc(L C ))). Then: sim G (T,C)= ?2logp i (L A ) ?log p i (L T ) ?log p i (L A ) Alternatively, expressed in terms of the information content corresponding to each probability: sim G (T,C)= 2I(L A ) I(L T )+I(L A ) The generic similarity measure views the candidate object, C, as an instance of the most specific class of objects that includes both the target and the candidate. Figure 7.3 depicts a target object T compared to a candidate object C using sim G . As in the case of sim H , the larger the region associated with L A relative to the size of O, the less significant it is that the objects being compared are included in this set. Similarly, the smaller the size of region L T relative to region L A ,the less significant it is that both objects are in L A . As illustrated in Figure 7.3(a), maximum similarity (short of identity) occurs when L A is a small fraction of O and most of the objects within it are labeled with L T . Conversely, similarity is at a minimum when L A contains most of O, including many objects other than those 137 labeled with L T , as in Figure 7.3(b). Like sim H , sim G can also be viewed in terms of conditional probability. Sup- pose one has randomly picked two objects, o 1 and o 2 from O.Letp(X)represent the probability that label L 1 applies to o 1 ,giventhatL A applies to both objects. Then: p(X)=p i (L 1 |L A ) = p i (L 1 ? L A ) p i (L A ) = p i (L 1 ) p i (L A ) =log ?1 [log p i (L 1 ) ? log p i (L A )] =log ?1 [log p i (L 1 )+logp i (L A ) ? 2logp i (L A )] I(X)=?log p(X) = I(L 1 )+I(L A ) ? 2I(L A ) Normalizing I(X) to the range [0..1], gives: I(X) I(L 1 )+I(L A ) = I(L 1 )+I(L A ) ?2I(L A ) I(L 1 )+I(L A ) =1? 2I(L A ) I(L 1 )+I(L A ) =1?sim G (o 1 ,o 2 ) sim G (o 1 ,o 2 )=1? I(X) I(L 1 )+I(L A ) So, as in the case of sim H , sim G represents a normalized distance viewed as a similarity. In this case, the distance is between L 1 ,thelabelofo 1 ,andL A ,thelabel representing what o 1 and o 2 have in common. 138 Meat VegetableDairy Cheese Milk TomatoChicken Beef Food Steak Cheeseburger Lasagna Casserole Broccoli Figure 7.4: Labeling With Generic Terms The use of generic labeling also reinforces our choice of an open-world model, because when the taxonomy is structured as a DAG, the closed-world model is incompatible with the use of generic terms as labels. To see why, consider the taxonomy of Figure 7.4. If an object is labeled with the generic term {Beef}, under generic labeling we interpret that to mean the labeler is uncertain as to which of the terms describing a more specific type of beef applies. The object could be a steak, which is ?just? beef, or it could be a cheeseburger. Because a cheeseburger contains additional ingredients (i.e. cheese), the ancestor graph of the label {Cheeseburger} contains terms like {Cheese} that are neither descendants nor ancestors of the term {Beef}, and these terms may also apply to the object in question. Thus, labeling a cheeseburger {Beef} would violate the closed-world assumption, which states that terms not in the ancestor graph of {Beef} do not apply to the object. We conclude this section with a set of examples that demonstrates how sim G orders candidate objects with respect to a fixed target. The examples are based on the taxonomy of Figure 7.4, and are intended to be illustrative rather than to provide an exhaustive case analysis. 139 C L C L A sim G (T,C) c 1 {Cheeseburger} {Beef, Cheese} 2I(Beef,Cheese) I(Beef,Cheese)+I(Beef,Cheese) =1 c 2 {Beef, Cheese, Tomato} {Beef, Cheese} 2I(Beef,Cheese) I(Beef,Cheese)+I(Beef,Cheese) =1 c 3 {Steak} {Beef} 2I(Beef) I(Beef,Cheese)+I(Beef) c 4 {Steak, Milk} {Beef, Dairy} 2I(Beef,Dairy) I(Beef,Cheese)+I(Beef,Dairy) c 5 {Casserole} {Meat, Dairy} 2I(Meat,Dairy) I(Beef,Cheese)+I(Meat,Dairy) Table 7.1: Similarity to Target Labeled {Beef, Cheese} Table 7.1 shows the similarity of five candidate objects to a common target object T labeled {Beef, Cheese}. The first candidate, labeled {Cheeseburger},is a specialization of the target object and therefore substitutable for it. It receives a similarity score of 1 because its ancestor graph includes the complete ancestor graph of the target. The same is true of c 2 , the candidate labeled {Beef, Cheese, Tomato}. The ancestor graph of c 3 , labeled {Steak}, includes the term ?Beef? and all its ancestors, but only intersects with the ancestors of ?Cheese? at the root of the taxonomy (?Food?). L A is therefore {Beef}, the minimal label of the intersection of the target and candidate ancestor graphs. Since I(Beef)iscalculatedusing inclusion probability, it reflects not only the number of objects labeled {Beef}, but also the populations of objects whose labels are descendants of {Beef} (e.g. {Steak}, {Lasagna}) and those that include the term ?Beef? among others in their label (e.g. {Beef, Broccoli}). c 4 , the candidate labeled {Steak, Milk}, is similar to c 3 , but in this case L A = {Beef, Dairy}. The size of the population associated with 140 this label is smaller than the one associated with {Beef}, since it includes only those objects whose label includes the term ?Dairy? (or one of its descendants) as well as the term ?Beef? (or one of its descendants). Hence the information content of this label is higher, and c 4 will receive a higher similarity score than c 3 .Forc 5 , labeled {Casserole}, L A = {Meat, Dairy}. Because inclusion probability associates more objects with this label than with the label {Beef, Dairy}, c 5 will receive a lower similarity score than c 4 . 7.2 Related Work We have introduced in Chapter 3 three groups of similarity measures which can be applied to taxonomies: the term-similarity measures, the common-term mea- sures, the measures that combine the similarity scores between individual pairs of terms. In this section, we briefly review some of them, which are evaluated and compared with our measures in the experimental section. The simplest measure we studied was Common Term Count, which ranks candidates based solely on the number of terms they have in common with the target. We also evaluated the Jaccard measure, a normalized form of Common Term Count that is commonly used to measure similarity in IR tasks. If L T and L C represent the labels of the target and candidate objects respectively: sim Jaccard (T,C)= |L T ? L C | |L T ? L C | Another two analogous measures that are also based on common terms are Common Term IC and its normalized form. For these two measures, we weight 141 each common term by its information content. For the measure of Common Term IC, we sum the information content of the common terms to determine similarity: sim CTIC (T,C)= summationdisplay a?L T ?L C I(a) We then normalize this measure by comparing the common information content to the total information content of all the terms in the two labels: sim Norm?CTIC (T,C)= 2 ? summationtext a?L T ?L C I(a) summationtext t?L T I(t)+ summationtext c?L C I(c) We also considered the measure proposed by Wang et al.[109] and the one proposed by Halkidi et al.[45]. These two measures are not based solely on common terms, but also count as similar pairs of terms that are close in the taxonomy. They calculate the pairwise similarity of individual terms from the target and candidate labels, and combine these to produce an overall score. As mentioned in Chapter 3, Wang et al.?s measure uses a generalized form of Lin?s similarity measure to determine the similarity of each term-pair. If t and c are the target and candidate terms: sim Lin? (t, c)= 2 ? max t a ?Terms(Anc(c)?Anc(t)) (I(t a )) I(t)+I(c) Note that this is not the same as applying our formulation in Definition 7 to a pair of single terms, and does not follow directly from Lin?s axiomatic definition of similarity in Equation 7.1 because it considers only a portion of the commonality between the terms (selecting the least common ancestor with the maximum information content). Given this definition of term similarity, Wang et al.?s measure combines the pairwise 142 similarities by averaging them across all pairs of terms in the two labels: sim Wang (T,C)= 1 |L T ||L C | summationdisplay t?L T ,c?L C sim Lin? (t, c) Halkidi et al. used the similarity measure defined by Wu and Palmer[114] to compute the similarity of term pairs, but it is unclear how to generalize this to a taxonomy structured as a DAG. Hence we instead used the same generalized form of Lin?s measure as Wang et al. in order to evaluate Halkidi et al.?s measure: sim Halkidi?IC (T,C)= 1 2 ? ? 1 |L T | summationdisplay t?L T max c?L C (sim Lin? (t, c))+ 1 |L C | summationdisplay c?L C max t?L T (sim Lin? (c, t)) ? ? Note that Halkidi et al.?s measure considers each term individually, and finds the most similar term from the other set. Then, the similarity of these best matches is averaged over the terms in each set. Finally, the average similarity from the two sets is combined with equal weight. It is important to note that our holistic measures sim H and sim G do not belong to any of these groups. Unlike the term similarity measures, they can be used when labels contain multiple terms. Unlike the common term measures, they take distinct but similar terms into account. Unlike the pairwise measures, they do not consider individual terms, but rather take all the terms in each label into account simultaneously. In Section 7.4, we describe how we evaluated our similarity measures using the taxonomy and associations defined by the Gene Ontology[7]. Other authors 143 have also used taxonomic similarity measures over the Gene Ontology to address the problem of finding gene products similar to a specified target. For example, Lord et al.[69] evaluate the Resnik and Lin measures, and a distance measure due to Jiang and Conrath[58], by determining how well their similarity scores correlate with similarities discovered by comparing the genetic sequences of the target and candidate objects. Similarly, Wang et al.[109] evaluate their measure by determining how well it correlates with functional properties derived from gene expression data. 7.3 Implementation To accommodate both large taxonomies and large corpora of objects, we built our implementation in the context of an object-relational database management system, specifically IBM DB2 Universal Database V8.2. 2 We believe this approach o?ers a number of advantages. In the first place, storing the corpus in a database allows the full power of SQL to be used to select those objects of interest in a particular situation. For example, a user searching for similar objects in a large database of gene products may wish to restrict the search to human gene products. Secondly, the ability to extend the database management system with user-defined types and functions allowed us to implement certain critical operations very e?- ciently within the database, without requiring large amounts of data to be retrieved for manipulation by an application. In our implementation, each similarity measure is implemented as an SQL 2 http://www.ibm.com/software/data/db2/udb/ 144 query against a set of relations with a fixed schema. The primary purpose of these relations is to present, in a standard format, the taxonomy itself and the associations between terms from the taxonomy and objects in the corpus. In our database, the taxonomy is represented as a table of (parent term id, child term id) pairs; the associations are represented as (term id, object id) pairs. Since the associations for the target and candidate objects may in general come from di?erent sources, we use separate relations to represent them. These relations may be base tables, but are more likely views created over the native representation of the associations. Once the taxonomy and the associations have been specified, additional tables are defined to store information that can be precomputed once and used repeat- edly in subsequent evaluations of the similarity measure. We will provide more information on this auxiliary data in Section 7.3.2. 7.3.1 User-Defined Types Within each query, certain critical operations are implemented as User Defined Functions (UDF?s) that operate on User Defined Types (UDTs) represented in the database as Binary Large Objects (BLOBs). Two types of operations warranted such special treatment. Firstly, the taxonomy is naturally represented as a directed graph, and a key step in the evaluation of sim H , sim G , and various other measures involves the deter- mination of common ancestors between the labels of target and candidate objects. Therefore, we have made extensive use of a general-purpose graph library for DB2. 145 The library allows graphs to be constructed e?ciently from database data using a user-defined aggregate function. Once constructed, they can be stored in the database as BLOBs and manipulated by UDFs that implement a wide range of graph-theoretic operations, e.g. finding the in-neighbor region of a node, intersect- ing graphs, etc. Table functions are provided that return the edges or nodes in a graph, along with payload values and various properties, e.g. the incoming edge count for a node. A full discussion of the graph library, which scales to very large graphs, is beyond the scope of this chapter. A second critical step in the evaluation of sim H and sim G is the determina- tion of the inclusion probabilities of particular labels. To determine the inclusion probability of a label L, one must know its frequency, that is, the number of objects in the corpus to which all of the terms in L apply. Recall that a term t applies to an object o if and only if the ancestor graph of the object?s label includes the term in question, i.e. if t ? Terms(Anc(o)). In principle, one could precompute frequencies for each of the 2 |N| combinations of terms that can be used as a label. For taxonomies of realistic size, however, this approach is impractical. Instead, we build an inverted list for each term, identifying the objects to which the term applies, i.e. those objects whose labels contain the term or any of its descendants. Let O(t) denote the list of objects for term t.The frequency of a label can then be determined by finding the size of the intersection of the inverted lists of its individual terms. The inclusion probability is therefore: p i (L)= |? t?L O(t)| |O| 146 Like the taxonomy graph, the inverted lists are implemented as a User Defined Type optimized to support the operations needed to compute label frequency: intersection and length. The inverted list UDT stores a list of object identifiers as a simple vector. Identifiers can be inserted in any order as the list is built (using a user- defined aggregate function), and the list is sorted once when insertion is complete. The intersection of two lists can be computed with a single pass through both lists, and a user-defined aggregate function is provided to find the intersection of a set of lists. 7.3.2 Precomputation In principle, all the information needed to compute sim H or sim G can be derived dynamically from the relations that define the taxonomy and the association corpus. However, certain information used to find and rank candidate objects can be used repeatedly for di?erent target objects, as long as neither the taxonomy nor the corpus changes. In this section we consider several situations in which we chose to precompute such values. Whenever a target and candidate object are compared, we need to find the intersection of their respective ancestor graphs. Furthermore, at least in our exper- iments, the same candidates are evaluated for many di?erent targets. We therefore precompute the ancestor graph for the label of each object in the corpus of can- didate objects. The graph library can generate ancestor graphs quite quickly, so precomputation is practical even for corpora of large size. The space requirement is 147 modest, because the ancestor graph for an object is typically a small fraction of the entire taxonomy. Details for our experimental scenario can be found in Section 7.4. Updates to the candidate corpus can be handled incrementally, but updates to the taxonomy may require a complete recomputation of these graphs. Note that we only precompute the ancestor graphs for candidate object labels, not target object labels, because each target object is generally only referenced once. However, we also precompute the ancestor graph for each individual term in the taxonomy. These graphs make the dynamic computation of ancestor graphs for target object labels more e?cient, and are reused many times since many targets refer to the same terms. They consume much less space than the candidate label ancestor graphs, because there are far fewer terms than labels, and each graph is smaller. The term ancestor graphs are not a?ected by updates to the corpus, but may need recomputation when the taxonomy changes. We also precompute the inverted lists. For each term in the taxonomy, we build a list containing the identifiers of all objects in the corpus that contain the term, or a descendant of the term, in their label. This is for the purpose of finding the inclusion probability of an arbitrary label by intersecting the lists corresponding to its terms. Note that the size of these lists is proportional to the size of the corpus. The lists can be updated incrementally as objects are added to the corpus, but may need to be recomputed if objects are deleted or the taxonomy changes. 148 7.4 Experimental Evaluation Similarity measures are di?cult to evaluate, since similarity of objects is nor- mally subjective in nature and individuals may di?er in how they rank candidate objects with respect to a particular target. Other authors have tried to work around this problem by comparing their rankings to those of a panel of human subjects (e.g., [94, 66]) or to an independent measure of similarity appropriate to their domain (e.g., [69, 109, 60]). We took a di?erent approach. In search of a result that could be more objec- tively quantified, we applied our similarity measures to the more di?cult problem of matching. Suppose that two sources of associations are using the same taxonomy to assign labels to objects. Given the label of a target object as supplied by one source, the task is to find the same object in a corpus of labels supplied by another source. In our experiments, the two sources use a common identifier for objects, so it is easy to verify that a match has been found. Specifically, we use the Gene Ontology (GO)[7] as the taxonomy in our ex- periments. This taxonomy contains about 17000 terms, and can be represented by a graph with as many nodes and about 22000 edges. The terms are divided into three independent facets, representing a gene product?s cellular location(s), molecu- lar function(s) and the biological process(es) in which it participates. Each facet is a directed acyclic graph, and in our experiments, we ignored the facets and consid- ered all terms as a single label. The relationships between terms are characterized as either is-a or part-of. We also ignored this distinction, since both relationships 149 possess the transitive behavior that underlies our understanding of classification. Finally, a small number of terms in the taxonomy are considered obsolete, and have been gathered together as children of a special node. We removed these nodes from the taxonomy, since their location in the taxonomy does not reflect their semantics. Various organizations have used the GO taxonomy to annotate a large number of gene products, and the taxonomy and annotations can be downloaded from the GO Consortium. 3 Over one million gene products have been annotated with over 4 million term associations, but we restrict our attention to human gene products registered in the SwissProt databank, which number about 28000. Almost all of these gene products have annotations supplied by the UniProt Knowledgebase[5], and we use these gene products as our candidate objects. There are approximately 110000 annotations for these objects, and these annotations constitute our corpus of term associations. We removed only those annotations that associate a gene product with an obsolete term. Of the 28000 gene products, 5789 also have annotations supplied by Proteome, Inc. 4 These gene products constitute our target objects, and the corresponding set of about 20000 annotations becomes our target associations. Therefore our matching task is: given the annotations for a gene product supplied by Proteome Inc., find the same gene product using the annotations supplied by UniProt. We note that this is inherently a very di?cult problem. The two organizations that generated the annotations we use for matching are completely independent. 3 http://www.geneontology.org 4 http://www.proteome.com 150 They did not necessarily have the same goals or use the same guidelines in assign- ing terms to gene products, nor did they necessarily work from the same source information. Furthermore, the space to search is large (28000 objects), and often contains many objects similar to the desired target. To apply a similarity measure to this problem, we rank the candidate objects using the similarity measure, and then determine whether the matching object is found in the top K objects as ranked by the measure. We also keep track of the depth in the candidate list at which the matching object was found. In practice, it is too expensive to compute the similarity score of a target with respect to all 28000 objects in the corpus. Hence we applied the following heuristic to limit the candidate set: we considered as candidates only those objects whose labels have at least one term in common with the label of the target object. We chose this heuristic because it could be applied to all the measures we wished to evaluate, some of which are based entirely on common terms and their properties. However, it should be noted that for many target objects, the labels supplied by the two organizations have no terms in common. Thus, regardless of the measure used, this heuristic limits the number of targets that can be successfully matched to a maximum of 3660 of 5789, or 63.2%. We considered other heuristics for determining the candidate set; see Section 7.4.3 for discussion. 151 7.4.1 Results We tested sim H , sim G and the six measures described in Section 7.2 on targets from the set of 5789 gene products annotated by both UniProt and Proteome, Inc. Using the candidate-selection heuristic described above, we calculated the similarity score for each candidate with respect to the target. If the measure ranked the object corresponding to the target among the top 100 candidates, we considered it to have found a match. While this may seem like a generous definition of success, we note again the inherent di?culty of the matching problem. We calculated the success rate for each measure, along with the depth at which the matching object was found, averaged over the successful matches. The results arepresentedinTable7.2. 5 Recall that the maximum possible success rate using the common-term heuris- tic for candidate selection is 63.2%. We do not compare the measures on the basis of speed, since we made a concerted e?ort to reduce costs only for sim H and sim G . However, we note that our strategy of precomputation enabled us to test a single candidate in about 4.4ms using sim G and about 6.4ms using sim H . These values are for a 2.4GHz Pentium 4 with 1Gb of memory, running Windows XP and DB2 UDB v.8.2. The space requirements for the precomputed object ancestor graphs, term ancestor graphs and term inverted lists were modest: 122Mb, 40Mb and .5Mb, respectively. 5 For one measure where we did not test the entire set of targets, a confidence interval is provided for the success rate. 152 Method Success Rate (%) Average Depth sim H 39.3 23.9 Common Term IC 39.1 22.2 sim G 37.5 25.2 Normalized CT IC 36.9 25.4 Halkidi IC 36.1 26.2 Common Term Cnt. 35.2 28.4 Jaccard 30.9 28.7 Wang 24.5 (? 2.42@95%) 35.9 Table 7.2: Comparison of Similarity Measures 7.4.2 Discussion Although the di?erences in the success rates are not striking, some observations can be made. Firstly, the measures based solely on common terms did not adequately take into account pairs of terms that were closely related, but not identical. Because such pairs do not contribute at all to the similarity score, common-term measures that penalize candidates for terms that are not shared with the target understate similarity to a greater extent than those which do not. Thus, both Common Term Count and Common Term IC, which ignore unmatched terms, outperformed their normalized counterparts, Jaccard and Normalized Common Term IC, which incur a penalty for unmatched terms. For the holistic measures, all terms contribute to the similarity score, and indeed sim H outperforms the other measures. However, our candidate-selection heuristic requires each candidate to have at least one term in common with the target. While this heuristic provides a level playing 153 field for comparing the various measures, it limits the beneficial e?ect of taking related (but not common) terms into account. We explore the e?ects of relaxing this restriction in Section 7.4.3. We also observe that measures that weight the importance of pairs of com- mon terms based on their information content performed better than the analogous measures based on counting common terms. In particular, Common Term IC outperformed Common Term Count,andNormalized Common Term IC outperformed Jaccard. The di?erences among the measures based on information content stem from di?erences in how similarity is derived from information content in each case. Both Wang and Halkidi IC normalize the common information content of each term- pair independently, and then combine these normalized values to reach an overall similarity value. As a result, term-pairs with relatively low common information content are given the same weight as pairs with much greater common information content. This is particularly so in the case of Wang, which pairs each term with every other, as opposed to Halkidi IC, which just pairs each term with its best match. However both measures su?er in comparison to those based on common term information content, which weight each common term in accordance with its information content. Another source of error in these measures is correlation. If a label contains two terms whose occurrence is correlated, these measures overestimate their information content when they occur together. Theholisticmeasuressim H and sim G avoid both these problems by calculating the combined information content of all the terms in each relevant set. Although we 154 believe that generic labeling with interior terms occurs pervasively in our corpus, we note that sim H outperformed sim G in our experiments. We believe this reflects our choice of a matching problem rather than one based strictly on similarity. The di?erence between the measures is that sim G does not penalize a candidate for being labeled with terms that are more specific than those used to label the target. This may elevate the score of the matching candidate enough to make it competitive with others labeled with more general terms. However, it also elevates the scores of candidates that use more specific terms than the matching candidate and its competitors, which has the opposite e?ect. If one views the terms associated with the target as requirements, these additional candidates satisfy the requirements as well as the matching candidate does, and therefore they are equivalently similar. But in a matching problem, we are looking for a particular object, and the presence of the others makes finding it more di?cult. 7.4.3 Candidate Selection As we mentioned previously, all the measures we evaluated were limited by our candidate-selection heuristic. For the measures entirely based on common terms, this limitation does not hurt, since objects that have no terms in common with the target have a similarity score of zero. However, the other measures we evaluated have the potential to find additional matches if we consider additional candidates. We tested this hypothesis with an alternative candidate-selection heuristic. For each association between a target object and a term, we augmented the set of 155 KNN Patent ID Score Title 0 14019 1.0 ?Apparatus and method for collecting flue gas particulate with high permeability filter bags? 0 47255 0.9673537 ?Advanced hybrid particulate collector and method of operation? 0 195208 0.9673537 ?Volatile materials treatment system? 0 265087 0.9673537 ?Char for contaminant removal in resource recovery unit? 2 304641 0.9673537 ?System and method for removing gas from a stream of a mixture of gas and particulate solids? 0 344644 0.9673537 ?Electric dust collector and incinerator? 1 21467 0.9441908 ?Thief process for the removal of mercury from flue gas? 0 25179 0.9441908 ?Multi-stage particulate matter collector? 1 473644 0.9441908 ?Method of regulating the flue gas temperature and voltage supply in an electrostatic precipitator...? 0 92698 0.92622423 ?Electrically enhanced electrostatic precipitator with grounded stainless steel collector electrode...? 0 354751 0.92622423 ?Device and method for filtering internal combustion engine exhaust gases and vehicle equipped?...? Table 7.3: Patent Similarity Search terms associated with the target by adding associations between the target object and the original term?s immediate neighbors. We then used as candidates all objects that have at least one term in common with this expanded set. We refer to this heuristic as KNN-1, because the candidate set comprises those objects that are described by at least one of the original terms or a neighboring term one hop away. The original common term heuristic could be described as KNN-0; KNN-2 and other values could be considered as well. When we tested sim H with KNN-1 on the matching problem, we found that it raised the success rate from 39.3% to 40.5%. Although not a dramatic increase, this suggests that our measures can find similar objects even when they have no terms in common with the target. However, the time required to match a target is proportional to the number of candidates considered, and increasing the value 156 of N increased the number of candidates to test significantly. For KNN-0 on our dataset, the average number of candidates per target was 1429. For KNN-1 this rose to 3705, causing a greater than twofold increase in the average time required to match a target. For KNN-2, the average number of candidates rose to 6824. We also tested the KNN candidate-selection heuristic on a second dataset. The United States Patent and Trademark O?ce (USPTO) maintains a classifica- tion scheme for patents based on a tree-structured taxonomy of about 160000 terms, referred to as classes and subclasses. 6 Individual patents are labeled with one or more terms from the taxonomy. We tested sim H and sim G on a corpus of approxi- mately 500000 patents filed between 2001 and 2003, as classified by about 1.9 million associations between patents and terms. While we could not perform an objective matching experiment with this dataset, it nevertheless provided a number of interesting opportunities for evaluating our measures and their implementation. Firstly, we were able to verify that our imple- mentation could scale to a much larger corpus and a much larger taxonomy with satisfactory performance. The USPTO taxonomy also has a di?erent structure than the GO taxonomy; it is broader and flatter, and does not have facets or multiple inheritance. Finally, our candidate-selection heuristics generated fewer candidates per target than in the matching experiment. We were thus able to more easily ob- serve the value of the KNN heuristic, coupled with our measures? ability to detect similarity without the presence of common terms between the candidate and target objects. 6 http://www.uspto.gov/web/patents/classification/ 157 Table 7.3 shows the combined result of three similarity searches for a typical target, with KNN values of 0, 1 and 2. The first patent listed, ID 14019, is the target, and thus had a perfect similarity score of 1.0. Below, similar patents are listed with their similarity scores, as well as the KNN value of the search in which they were initially found. For example, patent 304641 was found by the search that used KNN-2 to select candidates, and scored as highly as any of the patents found with smaller KNN values. Similarly, two other highly-scored patents were found with KNN-1 that would not have been found with KNN-0, nor by any of the measures that rely solely on common terms. 7.5 Summary Similarity ranking of objects labeled using a taxonomy is an interesting prob- lem with a variety of useful applications. Our work has made several contributions to the state of the art. Starting from Lin?s axiomatic definition of similarity, we developed new similarity measures that are applicable to real classification systems, in which the taxonomy can be structured as a DAG, objects can be labeled with multiple terms, internal terms can be used in labels, and di?erent users may label thesameobjectindi?erentways. We implemented our measures using SQL and a pair of libraries for specialized data structures, realized as user-defined types. The result is a flexible, scalable implementation that is tightly integrated with a database management system and achieves good performance through strategic precomputation of key data structures. 158 We evaluated our measures on an object-matching task using the Gene Ontol- ogy, a taxonomy with all the properties noted above. Our experiments show that Our measures were more successful at matching objects than those reported in the literature. We also tested our measures on a search task, using the patent classi- fication taxonomy of the USPTO to find patents similar to a specified target. We evaluated two heuristics for candidate selection, a critical issue for large data sets. 159 Chapter 8 Future Work In this chapter, we present future directions for improving or extending our work on ontologies and ontology-based query answering. 8.1 Improving Performance of Ontology-Extended Systems It is true that on average, existing ontologies in various domains have a mod- erate size (usually of the order of hundreds of terms) [102]. For example, most ontologies at www.daml.org have less than 500 terms. However, recently, in some application domains, we often encounter ontologies of huge size. For example, the Gene Ontology has 17000 nodes and the taxonomy of the United States Patent and Trademark O?ce has 160000 nodes. On the other hand, in previous chapters, we have presented experimental re- sults to demonstrate that our algorithms can scale well to handle certain amount of data. However, most of the operations are still memory-based, e.g., query pro- cessing and ontology integration. It is not e?cient to frequently run graph-based operations on a large ontology in memory. Hence, it would be natural to have some index structures to take those operations on large ontologies into account. This ontology-based indexing mechanism will help our system respond to user?s queries in much shorter time. Furthermore, with such index structures at hand, it would be 160 feasible to design algorithms in support of disk-based integration for large ontologies. 8.2 Learning Ontologies and Interoperation Constraints In Chapter 4 and Chapter 6, we show how to integrate graph-based and RDF ontologies respectively. These e?orts also present a challenge in finding interopera- tion constraints. In our HOME (Chapter 4) and TOSS (Chapter 5) systems, there is a component called Ontology Maker. It automatically takes relational or XML data as input and uses WordNet [77] and user-specified rules to generate ontolo- gies. Inside Ontology Maker, there is a subcomponent called Rule Maker,which helps users to specify the interoperation constraints between ontologies. However, it is not reasonable to assume that end-users can always provide interoperation con- straints. In our probabilistic ontologies project [102], we have proposed an algorithm to infer simple equality constraints. This part of our work is yet far from completion. Learning ontologies and interoperation constraints are still open problems [72, 33]. Notice that there has been a strong interest in recent years to learn ontologies from multimedia data, such as audio and video data. Such ontologies can be used to convey semantics in audio and video and to answer semantic queries over the multimedia data. On the other hand, we often need to incrementally maintain the interoperation constraints when changes are made to the ontologies. In [32], we present method- ologies for maintaining mappings (constraints) between evolving schemas by using the rich constructs in OWL. This area is still relatively new, however, and more 161 e?orts are needed as well as collaboration with other areas (e.g., machine learning and natural language processing). 8.3 Semantic Similarity Join As we mentioned in Chapter 3, it is often the case that the same entity may have di?erent representations (identifications) in heterogeneous data sources. In previous approaches for integrating such data sources, people usually assume that objects from di?erent data sources can be mapped into a globe domain such that matching those objects is easy. However, in practice, such a common domain is not easy to construct due to the subtle di?erence on the semantics of the objects as well as various understandings to the semantics of even a single object. Hence, several research groups have investigated various methods of approximate joins between string-valued or text-valued columns in relational databases. Cohen [27] proposes a system called WHIRL to reason about the similarity of textual values in databases. The similarity is measured using the vector-space model based on the intuition that two documents are more similar if they share more pairs of terms with high weights. The system relies on an A* search to return the highest-scoring answers e?ciently. Gravano et al. [39] present a filtering technique for computing approximate string joins based on the notion of edit distance. Their core idea is to match the q-grams (substrings of length q) of the two strings to be compared. They have identified three important q-gram properties to filter the candidate set of strings for join. Later, Gravano et al. [38] propose a sampling-based 162 technique for approximate text join by adopting the cosine similarity metric over the vector-space model. To join relation R 1 and R 2 , they extract a sample set of tuples from relation R 2 for each tuple t in R 1 . This sample set contains tuples highly similar to t. For each tuple t prime in R 2 ,whethert prime should be put into the sample set for tuple t in R 1 depends on a probability function approximating the similarity between t and t prime . However, none of these decent methods considers the semantic similarity func- tions based on taxonomies (ontologies). On the other hand, these approaches are not applicable for joining two sets of ontological terms since they are based on edit distance or the cosine similarity metric. Hence, it would be interesting to investigate new join methods that take ontology-based similarity functions into account. 8.4 Semantic Web Databases Recently, there has been a large amount of web data and services available in RDF. Several research teams have been working on the formal aspects of the RDF data model, the RDF model?s extension to represent additional information (e.g., temporal and context-dependent data), and query processing on RDF data. Many query languages for RDF data have been proposed, e.g., RQL [59], RDQL [93], and SPARQL [89]. Considering that RDF data on the web is sub- ject to frequent change, several research groups have been working on RDF views and the maintenance. For example, Voltz et al [108] introduced an RDF view mech- anism which requires that (i) the results contain class instances (i.e., a subject or 163 object variable), or (ii) the result itself has the pattern of RDF statement (i.e., a triple containing subject, predicate and object). Later, Hung et al [54] developed sophisticated view maintenance algorithms that take the graph structure of RDF data into account in order to answer queries very e?ciently. They also proposed an e?cient algorithm to answer aggregate queries over RDF databases and a set of algorithms to maintain views involving those aggregates. The research e?orts on RDF databases have also motivated the study on formal aspects of RDF data model and query processing. Guti?errez et al[44] have studied the features of RDF data model (such as reification, typing and inheritance) as well as the foundational aspects of RDF query languages (such as complexity of query processing and query containment). They defined a formal query language for RDF and investigated the theoretical aspects with regard to this language. More recently, there has been a surge of interest on extending RDF data model to represent temporal and context-dependent information. For example, Gutierrez et al [43] present a framework to incorporate temporal reasoning into RDF.They present both a syntax and a semantics for temporal RDF graphs. They also show the complexity analysis for the entailment in temporal RDF graphs. Udrea et al [104] presented a unified framework, annotated RDF, to support all forms of reasoning in RDF, such as temporal reasoning and uncertainty reasoning. In this framework, RDF triples are annotated by members of a partially ordered set. They also presented a formal semantics for annotated RDF and proposed algorithms for consistency checking and query answering in this extend RDF data model. Nevertheless, there is still a large set of open issues in this area to explore, for 164 instance, how to represent temporal, spatial and uncertainty information in OWL data model; how to query on these extended OWL data models; how to learn such models from data etc. OWL has more complicated features and constructs than RDF which has made the extension of OWL data model very interesting and challenging. 8.5 OWL Integration In Chapter 4 and Chapter 6, we show how to integrate graph ontologies and RDF ontologies respectively. It would be interesting to continue this journey on OWL ontologies as well. Yet, integrating OWL ontologies is a more challenging task due to the more complex features and constructs in the model. There are a range of problems to be considered in this task, e.g., consistency checking, how to eliminate inconsistency from an integrated ontology, how to incorporate open-world assumption in the integration, etc. 165 BIBLIOGRAPHY [1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison- Wesley, 1995. [2] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword- based search over relational databases. In ICDE ?02: Proceedings of the 18th International Conference on Data Engineering (ICDE?02), Washington, DC, USA, 2002. [3] S. Al-Khalifa, C. Yu, and H. V. Jagadish. Querying structured text in an XML database. In Proc. ACM SIGMOD Conf. on Management of Data,San Diego, CA, 2003. [4] S. Amer-Yahia, C. Botev, and J. Shanmugasundaram. TeXQuery: A full-text search extension to XQuery. In WWW2004, pages 583?594, New York, USA, 2004. [5] R. Apweiler, A. Bairoch, C. Wu, W. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, and M. Magrane. UniProt: the univer- sal protein knowledgebase. Nucleic Acids Res., 32(1):D115?119, Jan. 2004. [6] Y. Arens, C. Y. Chee, C. N. Hsu, and C. A. Knoblock. Retrieving and in- tegrating data from multiple information sources. International Journal of Cooperative Information Systems, 2(2):127?158, 1993. [7] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. 166 Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene Ontology: Tool for the unification of biology. Nature Genetics, 25(1):25?29, May 2000. [8] F. Baader, I. Horrocks, and U. Sattler. Description logics as ontology lan- guages for the Semantic Web. Lecture Notes in Artificial Intelligence, 2003. [9] A. Balmin, V. Hristidis, and Y. Papakonstantinou. ObjectRank: Authority- based keyword search in databases. In Proceedings of the 30th VLDB Confer- ence, Toronto, Canada, 2004. [10] C. Baral, S. Kraus, and J. Minker. Combining multiple knowledge bases. IEEE Transaction on Knowledge and Data Engineering, 3(2):208?220, 1991. [11] C. Baral, S. Kraus, J. Minker, and V. S. Subrahmanian. Combining knowledge bases consisting of first order theories. Computational Intelligence, 8(1):45?71, 1992. [12] C. Baral, S. Kraus, J. Minker, and V. S. Subrahmanian. Combining default logic databases. International Journal on Intelligent Cooperative Information Systems, 3(3):319?348, 1994. [13] S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, and L. A. Stein. OWL Web ontology language reference. W3C recommendation. http://www.w3.org/TR/owl-ref/, 10 February 2004. [14] S. Benferhat and S. Kaci. Logical representation and fusion of prioritized information based on guaranteed possibility measures: application to the 167 distance-based merging of classical bases. Artificial Intelligence, 148(1-2):291? 333, 2003. [15] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American, May 2001. [16] I. Bhattacharya and L. Getoor. Relational clustering for multi-type entity resolution. In The 11th ACM SIGKDD Workshop on Multi Relational Data Mining (MRDM-05), 2005. [17] I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. In The 6th SIAM Conference on Data Mining (SIAM SDM- 06), 2006. [18] I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and in- tegration. In The 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Paris, France, June 2004. [19] P. Bonatti, Y. Deng, and V. S. Subrahmanian. An ontology-extended re- lational algebra. In Proceedings of the IEEE International Conference on Information Reuse and Integration (IEEE IRI 2003), 2003. [20] P. Bonatti, M. L. Sapino, and V. S. Subrahmanian. Merging heterogeneous security orderings. Journal of Computer Security, 5(1):3?29, 1997. [21] P. Bouquet, F. Giunchiglia, F. van Harmelen, L. Serafini, and H. Stucken- schmidt. C-OWL: Contextualizing ontologies. In International Semantic Web Conference, pages 164?179, 2003. 168 [22] D. Brickley and R. Guha. RDF vocabulary description language 1.0: RDF Schema. W3C recommendation. http://www.w3.org/TR/rdf-schema/,10 February 2004. [23] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th WWW Conference, Brisbane, Australia, 1998. [24] D. Calvanese, G. D. Giacomo, and M. Lenzerini. A framework for ontology integration. In Proc. of the First Semantic Web Working Symposium, pages 303?316, 2001. [25] D. Calvanese, G. D. Giacomo, and M. Lenzerini. Description logics: Foun- dations for class-based knowledge representation. In Proc. of the 17th IEEE Sym. on Logic in Computer Science (LICS 2002), pages 359?370, 2002. [26] L. Cholvy and C. Garion. Answering queries addressed to several databases according to a majority merging approach. J.Intell.Inf.Syst., 22(2):175?201, 2004. [27] W. W. Cohen. Integration of heterogeneous databases without common do- mains using queries based on textual similarity. In Proc. ACM SIGMOD, pages 201?212, Seattle, WA, 1998. [28] W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string metrics for matching names and records. In Proc. of the First Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003. 169 [29] D. Connolly, F. van Harmelen, I. Horrocks, D. L. McGuinness, P. F. Patel- Schneider, and L. A. Stein. DAML+OIL (March 2001) reference descrip- tion. W3C note. http://www.w3.org/TR/daml+oil-reference, 18 December 2001. [30] http://www.ibm.com/software/data/db2/extenders/textinformation/index.html, 2001. [31] DBLP XML records. Available at http://dblp.uni-trier.de/xml/,Nov 2003. [32] Y. Deng, H. Kuno, and K. Smathers. Managing the evolution of simple and complex mappings between loosely-coupled systems. In Proceedings of the Second Workshop on Semantics in Peer-to-Peer and Grid Computing at the Thirteenth International World Wide Web Conference, New York, USA, 2004. [33] A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy. Learning to match ontologies on the Semantic Web. VLDB Journal, Special Issue on the Semantic Web, 12(4):303?319, 2003. [34] X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In SIGMOD Conference, Baltimore, USA, 2005. [35] X. Dong, A. Y. Halevy, J. Madhavan, E. Nemes, and J. Zhang. Simlarity search for Web services. In VLDB, pages 372?383, Toronto, Canada, 2004. 170 [36] D. Florescu, D. Kossmann, and I. Manolescu. Integrating keyword search into XML query processing. In Proceedings of the 9th International World Wide Web Conference, Amsterdam, The Netherlands, 2000. [37] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, V. Vassalos, and J. Widom. The TSIMMIS approach to media- tion: Data models and languages. Journal of Intelligent Information Systems, 8(2):117?132, 1997. [38] L. Gravano, P. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an RDBMS for Web data integration. In WWW 2003, pages 90?101, Budapest, Hungary, 2003. [39] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB 2001, pages 491?500, 2001. [40] G. Greco, S. Greco, and E. Zumpano. A logical framework for querying and repairing inconsistent databases. IEEE Transactions on Knowledge and Data Engineering, 15(6):1389?1408, 2003. [41] T. R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199?220, 1993. [42] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram. XRANK: Ranked keyword search over XML documents. In SIGMOD, San Diego, USA, 2003. 171 [43] C. Guti?errez, C. Hurtado, and A. Vaisman. Temporal RDF. In European Semantic Web Conference (ESWC2005), pages 93?107, Heraklion, Greece, 2005. [44] C. Guti?errez, C. A. Hurtado, and A. O. Mendelzon. Foundations of Semantic Web databases. In PODS, pages 95?106, 2004. [45] M. Halkidi, B. Nguyen, I. Varlamis, and M. Vazirgiannis. THESUS: Organiz- ing Web document collections based on semantics and clustering. Technical Report 230, INRIA Project Gemo, 2003. [46] P. Hayes. RDF semantics. http://www.w3.org/TR/rdf-mt/,10February 2004. [47] J. Heflin. Towards the Semantic Web: Knowledge representation in a dynamic, distributed environment. Ph.D. Dissertation, 2001. [48] J. Heflin and J. Hendler. Semantic interoperability on the Web. In Proceedings of Extreme Markup Languages 2000, pages 111?120, Alexandria, VA, 2000. Graphic Communications Association. [49] M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD Conference, pages 127?138, 1995. [50] I. Horrocks, D. Fensel, J. Broekstra, S. Decker, M. Erdmann, C. Goble, F. van Harmelen, M. Klein, S. Staab, R. Studer, and E. Motta. OIL: The ontology in- ference layer. Technical Report IR-479. Vrije Universiteit Amsterdam, Faculty of Sciences. http://www.ontoknowledge.org/oil/, September 2000. 172 [51] V. Hristidis, L. Gravano, and Y. Papakonstantinou. E?cient IR-style keyword search over relational databases. In Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003. [52] V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in rela- tional databases. In Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002. [53] A. Huang, Q. Xue, and J. Yang. TupleRank and implicit relationship discovery in relational databases. In WAIM, Chengdu, China, 2003. [54] E. Hung, Y. Deng, and V. S. Subrahmanian. RDF aggregate queries and views. In ICDE, Tokyo, Japan, 2005. [55] E. Hung, Y. Deng, and V. S. Subrahmanian. TOSS: An extension of TAX with ontologies and similarity queries. In SIGMOD, Paris, France, June, 2004. [56] H. V. Jagadish, L. V. S. Lakshmanan, D. Srivastava, and K. Thompson. TAX: A tree algebra for XML. In Proceedings of DBPL?01, Rome, Italy, 2001. [57] M. A. Jaro. Probabilistic linkage of large public health data files. Statistics in Medicine, 14:491?498, 1995. [58] J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computationa Linguistics, Taiwan, 1998. 173 [59] G. Karvounarakis, A. Magkanaraki, S. Alexaki, V. Christophides, D. Plex- ousakis, M. Scholl, and K. Tolle. Querying the Semantic Web with RQL. Computer Networks and ISDN Systems Journal, 42(5):617?640, 2003. [60] J. Keller, M. Popescu, and J. Mitchell. Taxonomy-based soft similarity mea- sures in bioinformatics. In Proceedings of the 2004 IEEE International Con- ference on Fuzzy Systems, 2004. [61] M. Klein, D. Fensel, F. van Harmelen, and I. Horrocks. The relation between ontologies and schema-languages: translating OIL-specifications in XML- schema. In Proceedings of the ECAI?00 workshop on applications of ontologies and problem-solving methods, Berlin, German, Aug 2000. [62] S. Konieczny. On the di?erence between merging knowledge bases and com- bining them. In Seventh International Conference on Principles of Knowledge Representation and Reasoning (KR?00), pages 135?144, 2000. [63] S. Konieczny and R. P. Prez. Merging with integrity constraints. In Fifth European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU?99), pages 233?244, 1999. [64] O. Lassila and R. R. Swick. Resource description framework (RDF) model and syntax specification. W3C recommendation. http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/,22February 1999. 174 [65] C.-J. Liau. A modal logic framework for multi-agent belief fusion. ACM Trans. Comput. Logic, 6(1):124?174, 2005. [66] D. Lin. An information-theoretic definition of similarity. In Proc. 15th Inter- national Conf. on Machine Learning, pages 296?304. Morgan Kaufmann, San Francisco, CA, 1998. [67] J. Lin. Integration of weighted knowledge bases. Artif. Intell., 83(2):363?378, 1996. [68] J. Lloyd. Foundations of Logic Programming. Springer Verlag, 1987. [69] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble. Investigating seman- tic similarity measures across the Gene Ontology: The relationship between sequence and annotation. Bioinformatics, 19(10):1275?1283, 2003. [70] J. Lu, G. Moerkotte, J. Schue, and V. S. Subrahmanian. E?cient mainte- nance of materialized mediated views. In Proc. 1995 ACM SIGMOD Conf. on Management of Data, San Jose, CA, 1995. [71] S. Luke, L. Spector, D. Rager, and J. Hendler. Ontology-based Web agents. In Proceedings of the First International Conference on Autonomous Agents, pages 59?66, New York, NY, 1997. Association of Computing Machinery. [72] A. Maedche and S. Staab. Learning ontologies for the Semantic Web. In SemWeb, 2001. 175 [73] D. Maluf and G. Wiederhold. Abstraction of representation for interoperation. Lecture Notes in AI, 1315, 1997. [74] F. Manola and E. Miller. RDF primer. W3C recommendation. http://www.w3.org/TR/rdf-primer/, 10 February 2004. [75] D. L. McGuinness, R. Fikes, J. Rice, and S. Wilder. An environment for merging and testing large ontologies. In Proc. 7th Int. Conf. on Principles of Knowledge Representation and Reasoning, 2000. [76] http://msdn.microsoft.com/library, 2001. [77] G. A. Miller. WordNet: A lexical database for English. Communications of the ACM, 38(11):39?41, November 1995. [78] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Intro- duction to WordNet: an on-line lexical database. International Journal of Lexicography, 3(4):235?244, 1990. [79] M. Minsky. A framework for representing knowledge. The Psychology of computer Vision, 1975. [80] P. Mitra, G. Wiederhold, and J. Jannink. Semi-automatic integration of knowledge sources. In Proc. of the 2nd Int. Conf. On Information FUSION?99, 1999. 176 [81] P. Mitra, G. Wiederhold, and M. Kersten. A graph-oriented model for articu- lation of ontology interdependencies. In Proceedings Conference on Extending Database Technology 2000 (EDBT?2000), Konstanz, Germany, 2000. [82] A. Monge and C. Elkan. The field-matching problem: algorithm and applica- tions. In Proc. of the Second International Conference on Knowledge Discovery and Data Mining, 1996. [83] National Library of Medicine. Medical subject headings fact sheet. http://www.nlm.nih.gov/pubs/factsheets/mesh.html. [84] N. F. Noy and M. A. Musen. PROMPT: Algorithm and tool for automated ontology merging and alignment. In Proceedings of the Seventeenth National Conference on Artificial Intelligence (AAAI-2000), 2000. [85] http://technet.oracle.com/products/text/content.html, 2001. [86] Parag and P. Domingos. Multi-relational record linkage. In KDD-2004 Work- shop on Multi-Relational Data Mining, 2004. [87] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing (NIPS), 2002. [88] P. F. Patel-Schneider, P. Hayes, and I. Horrocks. OWL Web on- tology language semantics and abstract syntax. W3C recommendation. http://www.w3.org/TR/owl-semantics/, 10 February 2004. 177 [89] E. Prud?hommeaux and A. Seaborne. SPARQL query language for RDF. W3C candidate recommendation. http://www.w3.org/TR/rdf-sparql-query/,6 April 2006. [90] M. R. Quillian. Word concepts: A theory and simulation of some basic se- mantic capabilities. Behavioral Science, 12:410?430, 1967. [91] R. Rada and E. Bicknell. Ranking documents with a thesaurus. JASIS, 40(5):304?310, 1989. [92] R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1):17?30, 1989. [93] RDQL - a query language for RDF. W3C member submission. http://www.w3.org/Submission/RDQL/, 9 Jan 2004. [94] P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In IJCAI, pages 448?453, 1995. [95] P. Schwarz, Y. Deng, and J. E. Rice. Finding similar objects using a taxonomy: A pragmatic approach. Draft, 2006. [96] SIGMOD Record in XML. Available at http://www.acm.org/sigmod/record/xml/, Nov 2002. 178 [97] S. Staab, M. Erdmann, A. Maedche, and S. Decker. An extensible approach for modeling ontologies in RDF(S). In Proceedings of the ECDL 2000 Workshop on the Semantic Web, 2000. [98] G. Stumme and A. Maedche. FCA-MERGE: Bottom-up merging of ontologies. In IJCAI, pages 225?234, 2001. [99] V. S. Subrahmanian. Amalgamating knowledge bases. ACM Trans. Database Syst., 19(2):291?331, 1994. [100] A. Theobald and G. Weikum. The index-based XXL search engine for querying XML data with relevance ranking. In EDBT, pages 477?495, Prague, Czech Republic, 2002. [101] A. Tomasic, L. Raschid, and P. Valduriez. Scaling access to heterogeneous data sources with DISCO. IEEE TKDE, 1998. [102] O. Udrea, Y. Deng, E. Hung, and V. S. Subrahmanian. Probabilistic ontolo- gies and relational databases. In Proceedings of the Fourth International Con- ference on Ontologies, Databases and Applications of Semantics (ODBASE 2005), Agia Napa, Cyprus, 2005. [103] O. Udrea, Y. Deng, E. Ruckhaus, and V. S. Subrahmanian. A graph theoret- ical foundation for integrating RDF ontologies. In AAAI, Pittsburgh, USA, July, 2005. [104] O. Udrea, D. R. Recupero, and V. S. Subrahmanian. Annotated RDF. In European Semantic Web Conference (ESWC2006), to appear, 2006. 179 [105] United States Patent and Trademark O?ce. Patent classification home page. http://www.uspto.gov/web/patents/classification/. [106] F. van Harmelen, I. Horrocks, and P. F. Patel-Schneider. A model-theoretic semantics for DAML+OIL (March 2001). W3C note. http://www.w3.org/TR/daml+oil-model, 18 December 2001. [107] V. G. Voiskunskii. Evaluation of search results: A new approach. Journal of the American Society for Information Science, 48(2), Feb 1997. [108] R. Volz, D. Oberle, and R. Studer. Towards views in the Semantic Web. In The 2nd Int?l Workshop on Databases, Documents and Information Fusion (DBFUSION02), Karlsruhe, Germany, 2002. [109] H. Wang, F. Azuaje, O. Bodenreider, and J. Dopazo. Gene expression cor- relation and Gene Ontology-based similarity: An assessment of quantitative relationships. In The 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and computational Biology (CIBCB-2004), 2004. [110] G. Wiederhold. Mediators in the architecture of future information systems. IEEE Computer, pages 38?49, Mar 1992. [111] G. Wiederhold. Intelligent integration of information. In Proc. 1993 ACM SIGMOD Conf. on Management of Data, pages 434?437, 1993. [112] G. Wiederhold. Interoperation, mediation and ontologies. In International Symp. on Fifth Generation Computer Systems, Workshop on Heterogeneous Cooperative Knowledge Bases, ICOT, pages 33?48, 1994. 180 [113] World Intellectual Property Organization. International patent classification (sixth edition). http://www.wipo.int/classifications. [114] Z. Wu and M. Palmer. Verb semantics and lexical selection. In 32nd. Annual Meeting of the Association for Computational Linguistics, pages 133 ?138, New Mexico State University, Las Cruces, New Mexico, 1994. [115] Apache Xindice XML database. Available at http://xml.apache.org/xindice/. [116] G. Zhou, R. Hull, R. King, and J. Franchitti. Using object matching and materialization to integrate heterogeneous databases. In Proc.of3rdIntl. Conf. on Cooperative Information Systems (CoopIS95), Vienna, Austria, 1995. 181