This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
. The seven constructs comprise the minimum components required for any SP2P system and it can be used for assessing SP2P system conformance. In the following we will describe each of these model constructs. In Table 2, a brief definitions of some key model abstractions are provided for quick referencing. The model constructs are represented in UML classes and diagrams. This will enable transparent translation of model constructs into code using high level programming languages such as Java or C++. In [35] we translated the UML class diagrams into Java classes and packages and various SP2P systems have simulated. 5.1
Peers
A Peer P=
50
A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro Table 2. Definitions of some of the SP2P model concepts
Model concept A peer Autonomous joining
Short definition of the model concept Represents an active object or an entity in the network A process in which a peer autonomously select which other peers it is going to connect with Mapping The semantic relationship between concepts from independent information sources (ontologies) Network degree The constrain on the number of relations a peer could make Peer discovery Process or protocol in which peers discover their acquaintance(peers with similar profile) Peer neighbors The references a peer possesses to other peers in the network Peer profile The description of peer’s domain knowledge, the information content offered by a peer, description of peer’s schema, or expertise and services a peer provide comprise peer’s profile Semantic neighborhood Connected peers with compatible information resources comprise semantic neighborhood Similarity function A function to measure the strength of the relation, the semantic affinity, between any two profiles
profile o and a set of neighbors n, i.e., references to other peers in the network. Example of the resources that a peer could manage include sets of documents files, bibliography files or learning objects ( video, audio, image, etc) that peers willing to exchange. The profile is the description of peer’s domain knowledge, expertise or services and is used in building semantic neighborhood. For example, a subset of data model’s key concepts could comprise peer’s profile. Figure 5 is a class view of a peer construct for the proposed SP2P reference architecture. 5.2
Resources
The Resources r=
Fig. 5. Peer construct
A Reference Model for Semantic Peer-to-Peer Networks
51
sets of documents files or media objects that peers need to exchange. Peers could have their data represented in different data model. Examples of data model include Relational table, XML Schema, RDF data model, and OWL file. Meta-data such as an ontology name space or peers knowledge about other peers’ resources is a reference to other external resources available on the network (see [8] for more information on reference to external resources). In contrast to conventional P2P networks, resources in SP2P are neither replicated nor assigned to other peers in the network in order to be used by network peers for processing queries. In cases where SP2P is used for file sharing, this characteristic might not hold. The choice of data model is important, and SP2P systems could be differentiate from each other based on the choice of the data model. This is due to the following two features of the highly structured data: 1) Support for Semantics The choice of data model determines data semantic transparency. Semantic transparency in turn enables automatic machine processing of data as well as improving query result precision. 2) Support for Inferences The choice of data model determines the extent of the system’s ability to answer queries. For example, data models such as RDF and OWL support knowledge inferences. Systems with these types of data model are able to answer queries where information is not explicitly stored in their repository. This might be difficult for other systems with different data models to do so. Figure 6 is a class view of the Resource construct for the proposed SP2P reference architecture.
Fig. 6. Resource construct
5.3
Query Formulator
Query Formulator qf =<sc, cq, pq, q, l>, often a graphical user interface component, is a separate component on top of the resource layer for constructing queries. Peers use their own Query formulator to select concepts, sc, from the local resource repositories, compose queries content, cq, and place queries, pq, on the neighboring peers n.
52
A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro
Fig. 7. Query formulator construct
Query objects, q, are diverse based on the system’s endorsement for the query’s explicit semantics, i.e. peer’s data model (see subsection 5.2). For example, query content could incorporate references to local or global ontologies for supporting query concept meanings, or when a tree-like data representation is used as a resource, e.g. XML format, a query concept could be substituted by a tree path. A tree path refers to the concept, its ancestors, and descendant concepts. Another important aspect relevant to the query formulation module is the query language, l. The choice of the query language restricts the explicit semantic of query content. Figure 7 represents a class view of the Query Formulator construct for the proposed SP2P reference architecture. 5.4
Semantic Neighborhood
Discovering and grouping together peers with compatible semantic information, i.e. forming semantic neighborhood sn =
A Reference Model for Semantic Peer-to-Peer Networks
53
Fig. 8. Semantic neighborhood construct
5.5
Routing
Routing t=
54
A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro
answers, processing repeated queries increases the number of query messages a system would exchange. Query termination policy (tt): When query forwarding is going to stop is another important matter of routing queries in SP2P systems. Current common techniques for stopping query forwarding depend on either counting the number of hops or setting the query time-to-live (TTL). Using the hop-counting approach, a system administrator sets the length of a network path that a query message could traverse before terminating. On the other hand, the TTL approach is time based: a query message could traverse the network for the period of time that is specified in the query. As a query message traverses the network, its TTL value decreases. When the TTL value becomes zero, message forwarding stops. Note that, these techniques have an impact on the query results that could be obtained. For instance, peers will continue to forward queries to their related neighbors even when they already have answers to the query as long as the specified constraints permit. As a result the number of query results will be affected. Yet, another query forwarding termination policy is to use query content to decide whether the Query should be forwarded or not. Using such a policy Queries will not be forwarded when their content becomes empty. Query contents become empty as result of concepts dropping during translation process. Based on the dropping/not dropping uncomprehended Query concepts during Query translation, we divided SP2P systems into two groups: Irreducible SP2P system (IRSP2P) and Reducible SP2P system (RSP2P) and they are defined as follow: Definition 1. IRSP2P system is an SP2P system with the property that it does not discard uncomprehended query concepts during query translation and forwarding among multiple peers. Definition 2. RSP2P system is an SP2P system with the property that it does discard uncomprehended query concepts during query translation and forwarding among multiple peers. Piazza system, for example, belongs to IRSP2P group and Chatty Web is an instance of RSP2P systems. Figure 9 represents a Router Class and its associated forwarding policy. 5.6
Query Answerer
Query answerer (Qa)=
A Reference Model for Semantic Peer-to-Peer Networks
55
Fig. 9. Routing construct
The way answer evaluations are determined, qd, could be automatic or manual. In manual query answer determination, system users decide on the correctness (incorrectness) of query answer. Automatic query answer determination is about the system peer’s ability to conclude the query answer’s correctness (incorrectness). In the latter case, the system designer needs to design a set of criteria to empower SP2P systems with the ability to decide on the correctness (incorrectness) of query answers. An example of such measurement includes calculating the semantic relation between query answer concepts and query concepts. The SP2P system’s answer correctness is evaluated using common precision and/or recall metrics. Answer selection (AS):
Mappings
The semantic mapping m=<me, mi, mc, mw, mm> refers to semantic relationship between concepts from independent information sources (ontologies). It is a fundamental design building block for any SP2P system, and a topic undergoing
56
A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro
Fig. 10. Query answerer construct
in-depth research. Using semantic mapping with SP2P systems involves decision making on various issues including mapping expressiveness, me, mapping implementation, mi, mapping correctness, mc, mapping ownership, mw, and mapping maintenance, mm. Below, a short description of each of these mapping constructs is highlighted. At the beginning of mapping procedure, a check is required to determine whether the query should be just passed or needs to be translated into neighboring ontology concepts. Further, SP2P systems may support query reformulation, i.e. splitting the query, or completely reordering the query in a totally different but equivalent query. In such cases the mapping component needs an additional query evaluator procedure to support query evaluation and reformulation. Mapping expressiveness(me): Semantic mapping in its simplest form could be just a matter of finding query concept synonyms among different ontologies. In more expressive mappings, logical relations are used for finding relationships among concepts, concept properties and attributes. The set of logical relations commonly used to define relationships among the peers’ ontology concepts are {≡, , , ∗, ⊥}. In this case, c1 ≡ c2 means that the two concepts are synonyms. In other words, c1 and c2 are different concepts with similar or identical meanings and are interchangeable. For example, notebook and laptop are synonyms concepts. The relation c1 c2 means c1 is hypernym of c2 . That is, c1 is more generic or broad than c2 . For example, the system software concept is more generic or broad than the operating system concept. The relation c1 c2 , means that the c1 have a hyponymyous relationship with c2 , i.e. c2 is more generic or broad than c1 . For example, the book concept is less generic than the publication concept. The relation ⊥ means that two concepts have no semantic relation with each other. For example, bank as financial institution and bank as river-bank have no semantic relation. Any other relations between concepts other than those described above can be captured by the ∗ relation.
A Reference Model for Semantic Peer-to-Peer Networks
57
Mapping expressions have an effect on the extent of query results. They could increase or decrease the extent of query results based on the permissible logical expressions of the mappings. Systems demanding exact mappings could relax some of their constraints (i.e. allow for less restricted mapping logics to take place) to increase query recall. For example, let us assume that the University concept from one ontology and an Educational Institute concept from a second ontology are synonymous, i.e. University ≡ Educational Institute and mapping operation returns a value equal to 1, map(University, Educational Institute)=1.0. This assumption is valid since both concepts can be mapped to a common concept, Institute. Now let us consider the following query been posed on either ontologies. Query:
list the name of all Research Institutes in the area.
The restricted query result will be null, since no semantic relationship between Research Institute and University or Educational Institute can be asserted. However, if we relax the synonymous relationship between University and Educational Institute to a related, i.e. the relationship between University and Educational Institute have been defined and the operation map(University, Educational Institute)=0.25, the result of the previous query will be a set of University names which they might carry out some research. This is because the relationship between Research Institute and Educational Institute will be asserted, both Research Institute and Educational Institute are Institute. Information on the mapping operation’s numerical result can be found in [37]. Mapping implementation(mi): How mapping is carried out is an important design issue. Peers could use a local copy of thesauruses such as WordNet, build their own dictionaries, construct mapping tables or when feasible, exchange ontologies (schemas) to translate concepts between ontologies. The choice of the approach to carry out mapping is affected by the scope of the application. For small and domain specific applications, peers could exchange local ontologies or build their own local dictionaries for translation. Larger and semantically acute applications on the other hand, may require local thesauruses which are capable of performing some inference rather than just simple concept-to-concept mappings associated with local dictionaries and tables. Mappings could be carried out automatically, semi-automatically or manually. Mapping correctness measurement(mc): Correct semantic mapping is fundamental to SP2P systems. Various research efforts have been devoted to the classification of possible types of faults, measuring the quality of the mapping and estimating of information loss during query translation. The correctness of mapping is measured in two different ways: numerical and logical measurement. Numerical measurement pertains to the numerical values returned from the mapping tool. For example, a mapping operation could conclude that the semantic relationship between a Laptop concept and a Notebook concept is equal to 1.0: map(c1,c2)=1.0, and the semantic relationship between an Operating system concept and a Software concept is equal to 0.5: map(c3,c4)=0.5, or some other
58
A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro
values. A detailed example related to the numerical values use in the mapping operation can be found in [37]. If the numerical value returned from a mapping operation is ≥ δ (threshold), mapping is considered to be correct. The numerical values associated with semantic relationships between ontology concepts are system designer decision. For example when a concordance table is used by an SP2P system for mapping, the values assigned to the relationships between any two concepts in the table will be declared and latter used in the mapping process. The logical measurement, on the other hand, is the logical relationships that have been concluded during the mapping operation, that is, whether or not the relationship between two concepts satisfies one of these logical operations {≡, , , ∗, ⊥}. For example, the logical relationship between publication and book is . The two methods could be modified such that the logical relation could return numerical values and vice versa. Mapping ownership (mw): An important decision that SP2P system designers have to make is who (i.e., sender or receiver peer) is going to carry out the mapping. That is, whether query translation takes place before sending the query or after receiving the query. This is important because it will have an effect on query routing, to the extent that the querying peer will first perform mapping and then submit to only semantically related peers (i.e. if the outcome of mapping is above a certain threshold). This constraint can be used as a strategy for terminating query forwarding. Since the receiving peer performs mappings after receiving a query, this means that any query could be posed to any peer (i.e., there is no restriction on query forwarding). Query receiving peers either answer queries (i.e., if they could translate them to their local representation), or forward them to some other peers. Mapping maintenance (mm): Recently, several studies have focused on the mapping maintenance issue and its effects on the SP2P systems reliability [4,11,37,38]. These studies have concluded that mapping between different ontologies (schemas) need maintenance. This is because mapping could get outdated as a result of ontology changes. Outdated mapping puts the entire system at risk of failure. Hence, there is a need for 1. semantic mapping maintenance, 2. mapping corruption detection, and 3. mapping corruption tolerance: Mapping maintenance is needed to prevent it from corruption, corruption detection is required so it can be fixed, and lastly, mapping corruption tolerance is necessary in order to limit the level of the damage that mapping corruption may have done to the system. Figure 11 is a class view of the Mapping construct for the proposed SP2P reference architecture. In Figure 12 all model constructs are put together and the reference model class diagram has been created. The class diagram shows the dependency between model components. The model diagram encompasses, among others, four enumeration classes. The SP2P implementer may choose one or more of the enumerated options and ignore others. For example, an SP2P implementer may decide to only implement ”hopCounting” and not ”TTL” or ”destinationArrival” to stop query forwarding. The sequence of interaction between model components during query processing is represented in Figure 13.
A Reference Model for Semantic Peer-to-Peer Networks
Fig. 11. Mapping construct
Fig. 12. SP2P class model
59
60
A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro
Fig. 13. The sequence of interaction between model components during query processing
6
Model Applicability and Validation
In order to show the model applicability, the system architectures described in the paper (i.e., KEx, P2PSLN, Piazza, and Chatty web system architectures) are mapped onto the reference model. Individual systems are checked on whether they comply with the generic model, and how they implement the generic features. Table 3 illustrates that the described state-of-the-art systems posses the model’s component, however, they are different on the implementation of the component properties and component relations. The Table also manifests the comparative advantages (disadvantages) that the systems have over one another in relation to the model. Furthermore, Table 3 shows that the identified components, features and properties of the reference model exists in the Edutella system as well. Edutella system was not used for deriving the reference model components, yet it has the components and functionality need in SP2P systems. In other words, Edutella illustrates that the described reference model being valid and rigorous.
autonomous joining X peer discovery network degree X
autonomous joining peer discovery X network degree (implicit) similarity function X
autonomous joining X peer discovery network degree X
similarity function
expressiveness (low) implementation (reuse existing mapping) correctness ownership (sender) maintenance X query forwarding cycle handling
autonomous joining X peer discovery (implicit) network degree X
similarity function X
expressiveness implementation correctness X ownership (sender) maintenance X query forwarding cycle handling (implicit) query termination (implicit) evaluation selection X determination (manual) precision recall integration arrival (indirect) Piazza system
evaluation selection determination X precision X recall X integration arrival (indirect) Edutella System
evaluation selection X determination (manual) precision recall X integration X arrival (indirect) P2PSLN system
query termination (implicit)
expressiveness implementation (global dictionary) correctness ownership (sender) maintenance query forwarding cycle handling
evaluation selection X determination (manual) precision X recall integration X arrival (indirect) KEX system
query termination
expressiveness (high) implementation (WordNet) correctness X ownership (receiver) maintenance X query forwarding cycle handling
Query Semantic Mapping Routing formulator neighborhood
evaluation selection X determination (automatic) precision X recall X integration X arrival (direct) Chatty Web system
query termination (implicit) query termination
expressiveness (high) implementation (concordance table) correctness ownership (sender/receiver) maintenance X query forwarding cycle handling
similarity function
autonomous joining X peer discovery network degree X
query language X
Resources
similarity function
query language
query language
query language
query language
select concepts compose queries place queries
select concepts compose queries place queries
Peer
select concepts X compose queries place queries
select concepts X compose queries place queries
select concepts X compose queries place queries
data instance
data instance X
data instance
data instance X
data instance X
id neighbors profile (implicit) resource data model meta-data
id neighbors profile resource data model meta-data
id neighbors (implicit) profile resource data model meta-data
id neighbors profile resource data model meta-data
id neighbors (implicit) profile resource data model meta-data
Table 3. Concepts and relationships manifested in five SP2P systems
A Reference Model for Semantic Peer-to-Peer Networks 61
Query answerer
62
7
A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro
Conclusion
In this research work, we have identified that current SP2P research efforts have generated many and diverse realizations and architectures. This diversity in implementation and architectures has led to an ambiguity and incompatibility in defining SP2P abstracts and concepts. We have described a reference model for SP2P systems in an effort to model the emerging decentralized computing paradigm in a generic and high level abstraction. The model contributes to the advancement of the current SP2P networks in different ways. The minimum necessary constructs to build the SP2P networks and their related design issues have been identified and defined. This can enable researchers and network architects to focus on core components of SP2P systems and their related design issues. The reference model also can reduce the conceptual ambiguity of semantics and meanings of SP2P systems constructs. Furthermore, the model helps building new systems and simulations seamlessly; we were able to transform the model diagrams into the implementation and simulate different SP2P systems (Chatty web, Piazza, P2PSLN). The simulations show that SP2P systems built based on the reference model would be easy to change and modify. The simulation and results are presented in [35]. The described reference model is an essential step toward building a comprehensive API for SP2P networks. We consider building such an API to be an important future work. Combining prominent features from different SP2P systems to come up with new systems is yet another promising future work. lastly, a comprehensive model validation is a viable future work. Currently, the model validation limited to one system, Edutella system. The model validation can be checked further by showing how the various concepts and relationships manifested in other systems can be mapped to the model features, properties and associations.
References 1. Aberer, K., Alima, L.O., Ghodsi, A., Girdzijauskas, S., Haridi, S., Hauswirth, M.: The essence of P2P: a reference architecture for overlay networks. In: Proc. Fifth IEEE International Conference on P2P Computing, pp. 11–20 (2005) 2. Aberer, K., Cudr´e-Mauroux, P., Hauswirth, M., Van Pelt, T.: GridVine: Building Internet-Scale Semantic Overlay Networks. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 107–121. Springer, Heidelberg (2004) 3. Aberer, K., Cudr´e-Mauroux, P., Hauswirth, M.: Start making sense: The Chatty Web approach for global semantic agreements. Journal of Web Semantics 1(1), 89–114 (2003) 4. An, Y., Borgida, A., Mylopulos, J.: Discovery and maintaining Semantic Mappings between XML Schemas and Ontologies. Journal of Computing Science and Engineering 2(1), 44–73 (2008) 5. Bianchini, D., De Antonellis, V., Melchiori, M., Salvi, D., Bianchini, D.: Peerto-peer semantic-based web service discovery: state of the art, Technical Report, Dipartimento di Elettronica per l’Automazione Universit` a di Brescia (2006)
A Reference Model for Semantic Peer-to-Peer Networks
63
6. Bonifacio, M., Bouquet, P., Mameli, G., Nori, M.: Peer-mediated distributed knowledge management. In: van Elst, L., Dignum, V., Abecker, A. (eds.) AMKM 2003. LNCS (LNAI), vol. 2926, pp. 31–47. Springer, Heidelberg (2004) 7. Bouquet, P., Don` a, A., Scrafini, L., Zanobini, S.: ConTcXtualizcd Local Ontology Specification via CTXML. In: AAAI Workshop on Meaning Negotiation, pp. 64–72 (2002) 8. Castano, S., Ferrara, A., Montanelli, S.: H-Match: an Algorithm for Dynamically Matching Ontologies in Peer-based Systems. In: The 1st VLDB Int. Workshop on Semantic Web and Databases (SWDB), pp. 231–250 (2003) 9. Castano, S., Montanelli, S.: Enforcing a Semantic Routing Mechanism based on Peer Context Matching. In: Proc. of the 2nd Int. ECAI Workshop on Contexts and Ontologies: Theory, Practice and Applications (2006) 10. Choi, N., Song, I., Han, H.: A survey on ontology mapping. SIGMOD Rec. 35(3), 34–41 (2006) 11. Colazzo, D., Sartiani, C.: Mapping Maintenance in XML P2P Databases. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 74–89. Springer, Heidelberg (2005) 12. Dabek, F., Brunskill, E., Kaashoek, M.F., Karger, D.: Building peer-to-peer systems with Chord, a distributed lookup service. In: Proc. 8th Wshop. Hot Topics in Operating Syst. (HOTOS-VIII) (May 2001) 13. Dabek, F., Zhao, B., Druschel, P., Kubiatowicz, J., Stoica, I.: Towards a Common API for Structured Peer-to-Peer Overlays. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735. Springer, Heidelberg (2003) 14. Doulkeridis, C., Vlachou, A., Nørv˚ ag, K., Kotidis, Y., Vazirgiannis, M.: Efficient search based on content similarity over self-organizing P2P networks. Peer-to-Peer Networking and Applications Journal (2010), doi:10.1007/s12083-009-0058-2 15. Ehrig, M.: Ontology alignment: bridging the semantic gap. Springer publishing, Heidelberg (2007) 16. Euzenat, J., Shvaiko, P.: Ontology matching. Springer publishing, Heidelberg (2007) 17. Fergus, P., Mingkhwan, A., Merabti, M., Hanneghan, M.: Distributed emergent semantics in P2P networks. In: Proc. of the Second IASTED International Conference on Information and Knowledge Sharing, pp. 75-82 (2003) 18. Franconi, E., Kuper, G., Lopatenko, A., Zaihrayeu, I.: Queries and updates in the coDB peer to peer database system. In: Proc. of 30th International Conference on Very Large Databases VLDB 2004, pp. 1277–1280 (2004) 19. Freenet, http://www.freenetproject.org 20. Guarino, N.: Formal ontology and information systems. In: Proc. of Formal Ontology in Information Systems, pp. 3–15 (1998) 21. Gruber, T.R.: The Role of Common Ontology in Achieving Sharable, Reusable Knowledge Bases. In: Proc. of the 2nd International Conference on Principles of Knowledge Representation and Reasoning, pp. 601–602 (1991) 22. Gnutella, http://www.gnutella.wego.com 23. Haase, P., Broekstra, J., Ehrig, M., Menken, M.R., Mika, P., Olko, M., Plechawski, M., Pyszlak, P., Schnizler, B., Siebes, R., Staab, S., Tempich, C.: Bibster – A Semantics-based Bibliographic Peer-to-Peer System. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 122–136. Springer, Heidelberg (2004)
64
A.-R. Mawlood-Yunis, M. Weiss, and N. Santoro
24. Haase, P., Siebes, R., van Harmelen, F.: Peer Selection in Peer-to-Peer Networks with Semantic Topologies. In: Bouzeghoub, M., Goble, C.A., Kashyap, V., Spaccapietra, S. (eds.) ICSNW 2004. LNCS, vol. 3226, pp. 108–125. Springer, Heidelberg (2004) 25. Hai, Z., Liu, J., Feng, L., Sun, X., He, C.: Query Routing in a Peer-to-Peer Semantic Link Network. Computational Intelligence 21(2), 197–216 (2005) 26. Halevy, A., Ives, Z., Mork, P., Tatarinov, I.: Piazza: Mediation and integration infrastructure for semantic web data. Web Semantics: Science, Services and Agents on the World Wide Web 2(1), 155–175 (2004) 27. http://www.cnscenter.future.co.kr/hot-topic/p2p.html 28. https://www.jxta.dev.java.net/ 29. http://www.w3.org/RDF/ 30. Joseph, S.: Neurogrid: Semantically Routing Queries in Peer-to-Peer Networks. In: Proc. Intl. Workshop on Peer-to-Peer Computing (2002) 31. Kang, K.: Feature-Oriented Domain Analysis. Technical Report No. CMU/SEI-90TR-21, Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA (1990) 32. Kementsietsidis, A., Arenas, M., Miller, R.: Managing Data Mappings in the Hyperion Project. In: Proc.of the 19th Intl. Conf. on Data Engineering, pp. 732–734 (2003) 33. Liu, L., Xu, J., Russell, D., Antonopoulos, N.: Self-Organization of Autonomous Peers with Human Strategies. In: Proc. of ICIW 2008, pp. 348–357 (2008) 34. L¨ oser, A., Staab, S., Tempich, C.: Semantic social overlay networks. IEEE Journal on Selected Areas in Communications 25(1), 5–14 (2007) 35. Mawlood-Yunis, A.-R., Weiss, M., Santoro, N.: From P2P to Reliable Semantic P2P Systems. Peer-to-Peer Networking and Applications Journal, special issue, 1–19 (2010), doi:10.1007/s12083-009-0066-2 36. Mawlood-Yunis, A.-R., Weiss, M., Santoro, N.: Reference Model for Semantic Peerto-Peer Networks. In: Babin, G., Kropf, P., Weiss, M. (eds.) E-Technologies: Innovation in an Open World. Lecture Notes in Business Information Processing, vol. 26, pp. 319–334. Springer, Heidelberg (2009) 37. Mawlood-Yunis, A.-R., Weiss, M., Santoro, N.: Fault-Tolerant Emergent Semantics in P2P Networks. In: Cardoso, J., Lytras, M. (eds.) Semantic Web Engineering in the Knowledge Society, pp. 161–187. IGI Global (2008) 38. Mccann, R., Alshebli, B., Le, Q., Nguyen, H., Vu, L., Doan, A.: Mapping maintenance for data integration systems. In: Proc. of the 31st International Conference on VLDB, pp. 1018–1029 (2005) 39. McDougall, P.: The power of peer-to-peer. Information week (August 28, 2000), http://www.informationWeek.com 40. Mena, E., Illarramendi, A., Kashyap, V., Sheth, A.P.: OBSERVER: an approach for query processing in global information systems based on interpretation across pre-existing ontologies. Distributed and Parallel Databases 8(2), 223–271 (2000) 41. Napster, http://www.napster.com 42. Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M., Naeve, A., Nilsson, M., Palm´er, M., Risch, T.: EDUTELLA: a P2P networking infrastructure based on RDF. In: Proc. of the 11th International Conference on World Wide Web, pp. 604–615 (2002) 43. Ng, W.S., Ooi, B.C., Tan, K.-L., Zhou, A.: PeerDB: a P2P-based system for distributed data sharing. In: Proc. of 19th International Conference on Data Engineering, pp. 633–644 (2003) 44. Oram, A. (ed.): Peer-to-Peer: harnessing the power of Disruptive Technologies. O’Reilly and Associates, Inc. publishing, Sebastopol (2001)
A Reference Model for Semantic Peer-to-Peer Networks
65
45. Parashar, M., Member, S., Browns, J.C.: Conceptual and Implementation Models for the Grid. Proc. of IEEE Journal 93(3), 653–668 (2005) 46. Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Liu, H. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001) 47. Rousset, M., Chatalic, P., Adjiman, P., Chatalic, P., Goasdoue, F., Simon, L.: Somewhere in the Semantic Web. In: Intl. Workshop on Principles and Practice of Semantic Web Reasoning, pp. 84–99 (2006) 48. Schmitz, C., L¨ oser, A.: How to model Semantic Peer-to-Peer Overlays? In: Proc. P2PIR Workshop, Informatik, vol. (1), pp. 12–19 (2006) 49. Staab, S., Stuckenschmidt, S.: Semantic Web and Peer-to-Peer. Springer Publishing, Heidelberg (2006) 50. Tempich, C., Staab, S., Wranik, A.: Distributed semantic query: Remindin’: semantic query routing in peer-to-peer networks based on social metaphors. In: Proc. of the 13th International Conference on World Wide Web, pp. 640–649 (2004) 51. Voulgaris, S., Kermarrec, A., Massouli´e, L., van Steen, M.: Exploring Semantic Proximity in Peer-to-Peer Content search. In: 10th International Workshop on Futeru Trends in Distriubed Computing Systems (2004) 52. Zaihrayeu, I.: Towards Peer-to-Peer Information Management Systems. PhD Dissertation, International Doctorate School in Information and Communication Technologies, DIT - University of Trento (2006)
Discovery of Probabilistic Mappings between Taxonomies: Principles and Experiments R´emi Tournaire1 , Jean-Marc Petit2 , Marie-Christine Rousset1 , and Alexandre Termier1 1
Universit´e de Grenoble, UJF/ Grenoble INP / UPMF / CNRS, LIG UMR 5217 681, rue de la Passerelle, 38402 St-Martin d’H`eres Cedex, France [email protected] 2 Universit´e de Lyon, CNRS INSA-Lyon, LIRIS UMR 5205 69621 Villeurbanne Cedex, France
Abstract. In this paper, we investigate a principled approach for defining and discovering probabilistic mappings between two taxonomies. First, we compare two ways of modeling probabilistic mappings which are compatible with the logical constraints declared in each taxonomy. Then we describe a generate and test algorithm which minimizes the number of calls to the probability estimator for determining those mappings whose probability exceeds a certain threshold. Finally, we provide an experimental analysis of this approach.
1
Introduction
The decentralized nature of the development of Web data management systems makes inevitable the independent construction of a large amount of personalized taxonomies used for annotating data and resources at the Web scale. Taxonomies are hierarchical structures appropriate to data categorization and semantic annotation of resources. They play a prominent role in the Semantic Web since they are central components of OWL [13] or RDF(S) [33] ontologies. A taxonomy constrains the vocabulary used to express metadata or semantic annotations to be classes that are related by structural relationships. Taxonomies are easy to create and understand by humans while being machine interpretable and processable thanks to a formal logical semantics supporting reasoning capabilities. In this setting, establishing semantic mappings between taxonomies is the key to enable collaborative exchange of semantic data. In this work we focus on discovering mappings between populated taxonomies like Web Directories or folksonomies used for example for organizing musical songs. Web Directories like the Yahoo! Directory1 and the Google directory2 (see Figure 2) can be considered as taxonomies because they are hierarchical 1 2
dir.yahoo.com dir.google.com
S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 66–101, 2011. c Springer-Verlag Berlin Heidelberg 2011
Discovery of Probabilistic Mappings between Taxonomies
67
(a) Taxonomy T1
(b) Taxonomy T2 Fig. 1. 2 Taxonomies and associated instances
structures categories organizing web pages. Folksonomies like those represented in Figure 1 are small taxonomies created by final users and populated with musical files, for example. Providing mappings between two taxonomies makes possible an exchange of documents (instances) by querying one taxonomy with the vocabulary of the other one. For instance, mappings between Yahoo! and Google directories allow to query both of them for pages about “Autos” while covering both the pages indexed by Yahoo! and those indexed by Google. With folksonomies of Figure 1, a local query of songs about “XXth Vocal” may be completed by songs populating the distant class XXth Opera thanks to a provided mapping representing the fact that XXth Opera is included in XXth Vocal. We claim that such an application of joint querying requires a grounded and semantic way for defining the mappings and measuring their degree of confidence. Manually finding such mappings is clearly not possible at the Web scale. Therefore, the automatic discovery of semantic mappings is the bottleneck for
68
R. Tournaire et al.
(a) Sample directory from Yahoo!
(b) Sample directory from Google
Fig. 2. Top levels of Yahoo! and Google directories
scalability purposes. Many techniques and prototypes have been developed to suggest candidate mappings between several knowledge representations including taxonomies, ontologies or schemas (see [48,53] for surveys). Particular Focus and Applications of the Proposed Approach Most of the proposed approaches rely on evaluating the degree of similarity between the elements (e.g., classes, properties, instances) of one ontology and the elements of another ontology. Many different similarity measures are proposed and often combined. Most of them are based on several syntactic, linguistic or structural criteria to measure the proximity of the terms used to denote the classes and/or their properties within the ontology. Some of them exploit characteristics of the data declared as instances of the classes (e.g. [17,12,34,16]). Almost all the existing matching systems return for every candidate pair of elements a coefficient in the range [0,1] which denotes the strength of the semantic correspondence between those two elements [24,41,8]. Those coefficients are the basis for yearly international comparative evaluation campaigns [22,20]. Those approaches usually consider each candidate mapping in isolation. In particular, they do not take into account possible logical implications between mappings, which can be inferred from the logical inclusion axioms declared between classes within each ontology. This raises a crucial issue: the similarity coefficients returned by the existing ontology or schema matching systems cannot be interpreted as probabilities of the associated mappings. On the other hand, some approaches for detecting semantic mappings by logical reasoning have been proposed (e.g., [52]). By construction, such logical methods are limited to discover mappings that are certain.
Discovery of Probabilistic Mappings between Taxonomies
69
We claim that uncertainty is intrinsic to mapping discovery. It is first due to the methods employed for detecting them. Another important reason is that the mappings are usually interpreted as simple semantic relations such as subsumption or equivalence relations between classes, which is often an oversimplification of the complex overlapping relation existing in the reality between two classes of different and independenty developed ontologies. In this article, we focus on inclusion mappings. Inclusion mappings are of primary importance for ontology alignment applications, but they have received only little attention until now. In addition, inclusion mappings with a probabilistic semantics are useful and easily interpretable by human or automatic systems, because they combine several theoretical and practical advantages: 1. Inclusion mappings are more likely to happen than equivalences (theoretically at least twice), and so our method discovers more information than methods based on equivalences. 2. Inclusion mappings are logically weaker than equivalences but have a clear logical semantic that is directly useful for integration and query rewriting settings. For example, the inclusion mapping Google/Software/Games Yahoo/Entertainment
3.
4.
5.
6.
allows to make a query about Yahoo/Entertainment (or any of its superclasses) and leads to answer this query by considering that all pages in the category Google/Software/Games should be in the answer. In this way, an alignment between Yahoo! and Google directories constituted by inclusion mappings may enable an access to all referenced pages of both Yahoo! and Google with a single query. A large amount of relevant mappings (like the above mapping) could not have been discovered by methods that return only equivalences. The probabilities associated to mappings can be used by others systems that require probabilities, for example in order to know what are the most reliable knowledge. Properly handling uncertainty is a key point for the Semantic Web due to the heterogeneous and unreliable nature of available data. In particular, probabilistic inclusion mappings directly fit the framework of probabilistic databases [18] and probabilistic reasoning [1]. These works contribute to probabilistic query answering which constitutes a major way to exchange data between heterogeneous schemas and ontologies. In a probabilistic query answering setting, users can obtain probabilistic answers to their queries, thanks to a probabilistic reasoning using probabilistic mappings. From an user point-of-view, a probability has a well-understood meaning when standing for confidence: a probability of 0.5 means there is an even chance the mapping would be correct. Probability theory is quite simple, general, popular and grounded for dealing with uncertainty, and the generality and reusability is a key point in computer science. Finally, from a theoretical point of view in the ontology matching domain, there is a need to better understand the fundations of the uncertainty in data and schema integration [54].
70
R. Tournaire et al.
Main Points and Organization of this Article In this paper, we propose an approach to discover automatically probabilistic inclusion mappings between classes of taxonomies, in order to query multiple folksonomies or Web Directories in a way that is grounded, robust and easily interpretable. First, we investigate and compare two instance-based ways of modeling probabilistic mappings which are compatible with the logical constraints declared in each taxonomy. In those two probabilistic models, the probability of a mapping relies on the joint probability distribution of the involved classes. They differ on the property of monotony of the corresponding probability function with respect to the logical implication. For estimating the mappings probabilities, we follow a Bayesian approach to statistics by exploiting the description of the instances categorized in each taxonomy as observations for the involved classes. The point is that to estimate the joint probability distribution of two classes C1 and C2 of different taxonomies, we have to determine among the instances that are declared in C1 the ones that can be classified in C2 (based on their description), and similarly for the classification in C2 of instances that are declared in C1 . Different classifiers can be used for that purpose. Based on the above probabilistic setting, we have designed, implemented and experimented a generate and test algorithm for discovering the mappings whose probability is greater than a given threshold. In this algorithm, the monotony of the probability function is exploited for avoiding the probability estimation of as many mappings as possible. We have performed thorough experiments on controlled synthetic data to measure the performances of such a systematic approach in fonction of the number of possible mappings and the number of instances per classes. We have also performed qualitative experiments to measure the impact of the classifiers used to estimate the probabilities on the precision and recall of the mappings returned by our algorithm. The paper is organized as follows. Section 2 presents the formal background and states the problem considered in this paper. Section 3 is dedicated to the definition and computation of mapping probabilities. In Section 4, we present the algorithm that we propose for discovering mappings with high probabilities (i.e., greater than a threshold). Section 5 surveys the quantitative and qualitative experiments that we have done on synthetic controlled data. Finally, in Section 7, we compare our approach to existing works and we conclude.
2
Formal Background
We first define taxonomies as a graphical notation and its interpretation in the standard first-order logical semantics, on which the inheritance of instances is grounded. Then, we define mappings between taxonomies as inclusion statements between classes of two different taxonomies. Finally, we set the problem statement of matching taxonomies that we consider in this paper.
Discovery of Probabilistic Mappings between Taxonomies
2.1
71
Taxonomies: Classes and Instances
Given a vocabulary V denoting a set of classes, a taxonomy TV is a Directed Acyclic Graph (DAG) where each node is labelled with a distinct class name of V, and each arc between a node labelled with C and a node labelled by D represents a specialization relation between the classes C and D. Each class in a taxonomy can be associated with a set of instances which have an identifier and a content description modeled with an attribute-value language. By a slight abuse of notation, we will speak of the instance i to refer to the instance identified by i. Figure 1 shows two samples of taxonomies related to the Music domain. Bold arrows are used for representing specialization relations between classes, and dashed arrows for membership relation between instances and classes. In both taxonomies, some instances, with attribute-value description denoted between brackets, are associated to classes. For example, #102 is an instance identifier and [Wagner, Tristan und Isold, ...] its associated description. The instances that are in the scope of our data model can be web pages (whose content description is a set of words) identified by their URL, RDF resources (whose content description is a set of RDF triples) identified by a URI, or audio or video files identified by a signature and whose content description may be attribute-value metadata that can be extracted from those files. We consider only boolean attribute-value description. Such a description could be obtained by discretization of attribute-value pairs given in a more complex language, like in Figure 1 where textual values are used. We consider that, possibly after a preprocessing which is out of the scope of this paper, the instances are described in function of a fixed set of boolean attributes {At1 , . . . , Atm }. Then, for an instance i, its description, denoted descr(i), is a vector [a1 , . . . , am ] of size m such that for every j ∈ [1..m], aj = 1 if the attribute Atj belongs to the content description of i, and aj = 0 otherwise. Taxonomies have a logical semantics which provides the basis to define formally the extension of a class as the set of instances that are declared or can be inferred for that class. 2.2
Logical Semantics
There are several graphical or textual notations for expressing the specialization relation between a class C and a class D in a taxonomy. For example, in RDF(S) [33] which is the first standard of the W3C concerning the Semantic Web, it is denoted by (C rdf s:subclassOf D). It corresponds to the inclusion statement C D in the description logics notation. Similarly, a membership statement denoted by an isa arc from an instance i to a class C corresponds in the RDF(S) notation to (i rdf :type C), and to C(i) in the usual notation of description logics. All those notations have a standard model-theoretic logical semantics based on interpreting classes as sets: an interpretation I consists of a non empty domain of interpretation ΔI and a function .I that interprets each class as a non empty subset of ΔI , and each instance identifier as an element of ΔI . The classes declared
72
R. Tournaire et al.
in a taxonomy are interpreted as non empty subsets because they are object containers. According to the unique name assumption, two distinct identifiers a and b have a distinct interpretation (aI = bI ) in any interpretation I. I is a model of a taxonomy T if: – for every inclusion statement E F of T : E I ⊆ F I , – for every membership statement C(a) of T : aI ∈ C I . An inclusion G H is inferred by a taxonomy T (denoted by T |= G H) iff in every model I of T , GI ⊆ H I . A membership C(e) is inferred by T (denoted by T |= C(e)) iff in every model I of T , eI ∈ C I . Let D be the set of the instances associated to a taxonomy T . The extension of a class C in T , denoted by Ext(C, T ), is the set of instances for which it can be inferred from the membership and inclusion statements declared in the taxonomy that they are instances of C: Ext(C, T ) = {d ∈ D/ T |= C(d)} 2.3
Mappings
The mappings that we consider are inclusion statements involving classes of two different taxonomies T1 and T2 . To avoid ambiguity and without loss of generality, we consider that each taxonomy has its own vocabulary: by convention we index the names of the classes by the index of the taxonomy to which they belong. For instance, when involved in a mapping, the class XXth Opera of the taxonomy T2 of Figure 1 will be denoted by XXth Opera2 while the class XXth V ocal of the taxonomy T1 will be denoted by XXth V ocal1 . Mappings between T1 and T2 are of the form A1 B2 or A2 B1 where A1 and B1 denote classes of T1 and A2 and B2 denote classes of T2 . For a mapping m of the form Ai Bj , its left-hand side Ai will be denoted lhs(m) and its right-hand side will be denoted rhs(m). A mapping Ai Bj has the same meaning as a specialization relation between the classes Ai and Bj , and thus is interpreted in logic in the same way, as a set inclusion. The logical entailment between classes extends to logical entailment between mappings as follows. Definition 1 (Entailment between mappings). Let Ti and Tj be two taxonomies. Let m and m be two mappings between Ti and Tj : m entails m (denoted m m ) iff every model of Ti , Tj and m is also a model of m . It is straightforward to show that is a (partial) order relation on the set M(Ti , Tj ) of mappings between the two taxonomies Ti and Tj . If m m , we will say that m is more specific than m (also that m entails m ) and that m is more general than m (also that m is an implicate of m). The following proposition characterizes the logical entailment between mappings in function of the logical entailment between the classes of their left hand sides and right hand sides.
Discovery of Probabilistic Mappings between Taxonomies
73
Proposition 1. Let m and m be two mappings between two taxonomies. Let Ti be the taxonomy of lhs(m), and Tj the taxonomy of rhs(m). m m iff – lhs(m) and lhs(m ) are classes of the same taxonomy Ti , and – Ti |= lhs(m ) lhs(m) , and – Tj |= rhs(m) rhs(m ) For example, two mappings between taxonomies T1 and T2 of Figure 1 are illustrated in Figure 3: – the mapping XXth Opera2 XXth V ocal1 is more specific than the mapping XXth Opera2 XXth Century1 , – and the mapping RecentClassical2 XXth Instrumental1 is more specific than the mapping Ravel2 Classical M usic1.
Fig. 3. 2 mappings between T1 and T2
2.4
Problem Statement
Among all possible mappings between two taxonomies, we want to determine those that are the most probable given the descriptions of the instances associated to each class of the taxonomies. More precisely, the main problem addressed in this paper is the design of an efficient generate and test algorithm which minimizes the number of calls to the probability estimator for determining those mappings whose probability exceeds a certain threshold. The mappings returned by this algorithm will be said probabilistically valid (valid for short). Two subproblems are emphasized. The first subproblem to handle is the choice of an appropriate probabilistic model for defining the probability of a mapping. As mentioned in the introduction, a probabilistic semantics of mappings cannot be independent of the logical semantics. In particular, it is expected that a mapping logically entailed by a mapping with a high probability (i.e., whose probability exceed a threshold) will also get a high probability. The second subproblem is then to find a good probability estimator to compute mapping probabilities, given two taxonomies and the description of their instances.
74
3 3.1
R. Tournaire et al.
Mapping Probabilities: Models and Estimation Modeling Probabilistic Mappings
We have considered two relevant probabilistic models for modeling uncertain mappings. They are both based on the discrete probability measure defined on subsets of the sample set representing the set of all possible instances of the two taxonomies. From now on, we will denote P r(E) the probability for an instance to be an element of the subset E. The first model defines the probability of a mapping Ai Bj as the conditional probability for an instance to be an instance of Bj knowing that it is an instance of Ai . It is the natural way to extend the logical semantics of entailment to probabilities. The second model comes directly from viewing classes as subsets of the sample space: the probability of Ai Bj is the probability for an element to belong to the set Ai ∪ Bj , where Ai denotes the complement set of Ai in the sample set. These two models are described in the following definition. Definition 2 (Two probabilities for a mapping). Let m be a mapping of the form Ai Bj . – Its conditional probability, denoted Pc (m), is defined as: Pc (m)=P r(Bj |Ai ). – Its implication probability Pi (m), is defined as: Pi (m) = P r(Ai ∪ Bj ). The following proposition states the main (comparative) properties of these two probabilistic models. In particular, they both meet the logical semantics for mappings that are certain, and they can both be equivalently expressed using joint probabilities. Proposition 2. Let m be a mapping between two taxonomies Ti and Tj . The following properties hold: 1. Pi (m) ≥ Pc (m). 2. If m is a certain mapping (i.e., Ti Tj |= m): Pc (m) = Pi (m) = 1. 3. Pi (m) = 1 + P r(lhs(m) ∩ rhs(m)) − P r(lhs(m)) 4. Pc (m) =
P r(lhs(m)∩rhs(m)) P r(lhs(m))
They differ on the monotony property w.r.t the (partial) order corresponding to logical implication (cf. Definition 1): Pi verifies a property of monotony whereas Pc verifies a property of weak monotony as stated in the following theorem. Theorem 1 (Property of monotony). Let m and m two mappings. 1. If m m then Pi (m) ≤ Pi (m ) 2. If m m and lhs(m) = lhs(m ) then Pc (m) ≤ Pc (m )
Discovery of Probabilistic Mappings between Taxonomies
75
The proof [57] results from Proposition 1 and Proposition 2 which relate mappings with the classes of their left hand sides and right hand sides for logical entailment and probabilities respectively, and from considering (declared or inherited) class inclusions within each taxonomy as statements whose probability is equal to 1. 3.2
Bayesian Estimation of Mappings Probabilities
As shown in Proposition 2, the computation of Pi (m) and Pc (m) relies on computing the set probability P r(lhs(m)) and the joint set probability P r(lhs(m) ∩ rhs(m)). These values are unknown and must be estimated. They are the (unknown) parameters of the underlying Bernoulli distributions modeling the membership function to a set as a random variable taking only two possible values 0 or 1. Following the Bayesian approach to statistics [14], we model these (unknown) parameters as continuous random variables, and we use observations to infer their posterior distribution from their prior distribution. In the absence of any particular knowledge, the prior distribution is usually set to the uniform distribution. In probability theory, a natural way of estimating the value of a parameter modeled by a random variable is to take its expected value. All this is summarized in the following definition. Definition 3 (Bayesian estimator of P r(E)). Let E be a subset of the sample set Ω. Let O be a sample of observed elements for which it is known whether they r(E), is the belong or not to E. The Bayesian estimator of P r(E), denoted P expected value of the posterior distribution of P r(E) knowing the observations on the membership to E of each element in O, and setting the prior probability of a random set to 12 , and of the intersection of two random sets to 14 . Setting the prior probabilities to 12 and 14 depending on whether E is a class or a conjunction of classes corresponds to the uniform distribution of instances among the classes. Let Ext(E, O) be the set of observed instances of O that are recognized to be instances of E. According to a basic theorem in probability theory (Theorem 1, page 160, [14]), if the prior distribution of the random variable modeling P r(E) is a Beta distribution of parameters α and β, then its posterior distribution is also a Beta distribution the parameters of which are: α+|Ext(E, O)| and β +|O|. The Beta distribution is a family of continuous probability distributions parameterized by two parameters α and β which play an important role in Bayesian statistics. If its two parameters are equal to 1, it corresponds to the uniform disα tribution for the associated random variable. Its expected value is: α+β . In our setting, the set O is the union of the two (possibly disjoint) sets Oi and Oj of instances observed in two distinct taxonomies Ti and Tj . This raises the issue of computing the set Ext(E, Oi ∪ Oj ), specially when E is the conjonction of a class Ci of the taxonomy Ti and a class Dj of the other taxonomy Tj . In this case: i ∩ Dj , Oi ∪ Oj ) = Ext(C i , Oi ∪ Oj ) ∩ Ext(D j , Oi ∪ Oj ) Ext(C
76
R. Tournaire et al.
Since the two taxomomies have been created and populated independently by different users, the only information that can be extracted from those two taxonomies are the extensions of each class within each taxonomy: Ext(Ci , Ti ) and Ext(Dj , Tj ). By construction, it is likely that their intersection contains very few instances or even no instance at all. Therefore, we use automatic classifiers to compute Ext(E, O). The machine learning community has produced several types of classifiers that have been extensively (theoretically and experimentally) studied (see [44] for a survey) and have been made available through open source platforms like Weka [61]. They are based on different approaches (e.g., Naive Bayes learning, decision trees, SVM) but they all need a training phase on two sets of positive and negative examples. Let C be a classifier. Let E be a class of one of the two taxonomies that we denote by Ti , the other one being denoted Tj . For computing Ext(E, O) we follow the same approach as [17]: – C is trained on the descriptions of the elements of the two sets Ext(E, Ti ) and Oi \Ext(E, Ti ) taken as the sets of positive and negative examples respectively, – C is then applied to each instance of Oj to recognize whether it belongs to E or not. As a result, the following theorem provides a simple way to compute the Bayesian c (m) of the two probabilities Pi (m) and Pc (m) defined estimations Pi (m) and P in Definition 2. Theorem 2 (Estimation of mapping probabilities). Let m : Ci Dj be a mapping between two taxonomies Ti and Tj . Let O be the union of instances i , O)|, Nj = |Ext(D j , O)| and observed in Ti and Tj . Let N = |O|, Ni = |Ext(C i ∩ Dj , O)|. Nij = |Ext(C 1+Nij i − 1+N 4+N 2+N 1+Nij 2+N × 1+N 4+N i
– Pi (m)
=1+
c (m) – P
=
1+N
ij i It is worth comparing the (Bayesian) ratios 1+N 2+N and 4+N appearing in the c (m) in Theorem 2 with the corresponding formulas for computing Pi (m) and P Nij Ni ratios N and N that would have been obtained by following the standard (frequency-based) approach of statistics (as it is the case for instance in [17]). The corresponding ratios converge to the same expected value when there are many instances, but the Bayesian ratios are more robust to a small number of instances. In contrast with the frequency-based approach, they are defined even in the case where no instance is observed: their respective values (i.e., 12 and 1 ) in this particular case correspond to the probability of random sets and the 4 joint probability of of two random sets respectively for a uniform distribution of instances in the sample set.
Discovery of Probabilistic Mappings between Taxonomies
4
77
Discovery Process of Probabilistic Mappings
Given two taxonomies Ti and Tj (and their associated instances), let M(Ti , Tj ) be the set of all mappings from Ti to Tj (i.e., of the form Ci Dj ). The ProbaMap algorithm determines all mappings m of M(Ti , Tj ) that verify two probabilisticbased criterion of Pu (m) ≥ Su and Pc (m) ≥ Sc , where Su and Sc are two thresholds in [0, 1]. Candidate mapping generation - The principle of ProbaMap algorithm is to generate mappings from the two sets of classes in the two taxonomies ordered according to a topological sort [10]. Namely, (see the nested loops (Line 4) in Algorithm 1) it generates all the mappings Ci Dj by enumerating the classes Ci of Ti following a reverse topological order and the classes Dj of Tj following a direct topological order. The following proposition is a corollary of Proposition 1. Proposition 3. Let Ti and Tj two taxonomies. Let ReverseT opo(Ti ) be the sequence of classes of Ti resulting from a reverse topological sort of Ti . Let T opo(Tj ) be the sequence of classes of Tj resulting from a topological sort of Ti . Let m : Ci Dj and m : Ci Dj two mappings from Ti to Tj . If m is an implicant of m (i.e., m m), then Ci is before Ci in ReverseT opo(Ti ) or Ci = Ci and Dj is before Dj in T opo(Tj ). This proposition ensures that the mappings are generated in an order that respect the logical implication, according to the knowledge of the taxonomies. Pruning the candidate mappings to test - Based on the monotony property of the probability function Pi (Theorem 1), every mapping m implicant of a mapping m such that Pi (m) < Su verifies Pi (m ) < Su . Therefore, in ProbaMap, we prune the probability estimation of all the implicants of every m such that Pi (m) < Su . We shall use the notation Implicants(m) to denote the set of all mappings that are implicants of m. Similarly, based on the property of weak monotony of the probability function Pc (Theorem 1), when a tested candidate c (m) < Sc we prune the probability estimation of all mapping m is such that P the implicants of m having the same left-hand side as m. We shall denote this set: Implicantsc(m). Based on Proposition 1, Implicants(m) and Implicantsc(m) can be generated from Ti and Tj . The ProbaMap Algorithm - The resulting ProbaMap algorithm is described in Algorithm 1, in which – primitive functions like Implicants and Implicants c returns the implicants of the mapping in argument and the implicants having the same class at the left-hand side. – Topo andReverseTopo return the sequences of classes of a taxonomy that respect the logical implication order (respectively in the direct and reverse directions). – exti and extj represent the two maps that store the extension of each class of respectively Ti and Tj , or the extensions obtained by classification when it is enabled.
78
R. Tournaire et al.
Algorithm 1 returns mappings directed from Ti to Tj . In order to obtain all valid mappings, it must be applied again by swapping its inputs Ti and Tj . Algorithm 1. ProbaMap Require: Taxonomies (DAG) Ti , Tj , thresholds Sc , Si , Instances maps exti , extj c (m) ≥ Sc } Ensure: return {m ∈ M(Ti , Tj ) such that Pi (m) ≥ Si and P 1: MV al ← ∅ 2: MNV al ← ∅ 3: for all Ci ∈ ReverseTopo(Ti ) do 4: for all Dj ∈ Topo(Tj ) do 5: let m = Ci Dj 6: if m ∈ MNV al then 7: if Pi (m, exti , extj , Ti , Tj ) ≥ Si then c (m, exti , extj , Ti , Tj ) ≥ Sc then 8: if P 9: MV al ← MV al ∪ {m} 10: else 11: MNV al ← MNV al ∪Implicants c(m, Tj ) {Pruning using the weak monotony} 12: else 13: MNV al ← MNV al ∪Implicants(m, Ti , Tj ) {Pruning using the strong monotony} 14: return MV al
The discovered valid mappings are stored in the set MN V al . Mappings that are pruned are stored in the set MN V al . The nested loops in Line 4 in Algorithm 1 generate all the mappings Ci Dj from the most general to the most specific mapping. Based on this, Proposition 3 shows that this algorithm maximizes the number of pruning. When a mapping is generated and if it is not already pruned, it is firstly tested with regard to Pi (m) ≥ Si in Line 7, then if it is positive, it is tested with Pc (m) > Sc in Line 8. In the case Pi (m) < Si , all the implicants of m are marked as pruned (Line 13), thanks to the strong monotony property of Pi . If Pi (m) ≥ Si but Pc (m) < Sc , then the property of weak monotony conducts to prune all the implicants of m with the same left-hand side, in Line 11 and they are added to MN V al . Hence, MN V al set is kept up to date by containing all the pruning mappings from the beginning. The general structure for the implementation of ProbaMap is pictured in Figure 4. 1. Computation of the transitive logical closure of the two taxonomies and their instances given as input. (in order to obtain quickly the class extensions and to optimize the pruning process) 2. Automatic classification step in order to merge the instances of the two taxonomies. This step has been detailed in section 3.4. It can be disabled by the user. 3. Core of ProbaMap: generation-and-test of candidate mapping, according to the two thresholds as input parameters.
Discovery of Probabilistic Mappings between Taxonomies
79
Fig. 4. Structure of ProbaMap
The output is directly the set of mappings from the first taxonomy to the second one for which Pi and Pc both exceed their respective thresholds Su and Sc . We insist on the fact that, once the parameters are set, the discovery process is fully automatic. As many matching methods provide equivalences as their results, we can add a postprocessing for ProbaMap in order to obtain equivalences logically implied by the inclusion mappings discovered. In this case, the thresholds Su and Sc should be lowered with respect to the discovery of pure implications, because equivalences are stronger relations than implicants, and then less likely to happen.
5
Experiments
We have performed three series of experiments : 1. on controlled synthetic data 2. on OAEI directory benchmark 3. on Web Directories, with a comparison with the method SBI [34] This section is structured according to these three series. 5.1
Experiments on Synthetic Data
For the purpose of systematic testing of our approach in various conditions, we have evaluated Algorithm 1 on synthetic data on which we can control important parameters and guarantee structural or distributional properties. Different analysis have been conducted. We have measured the impact of the probabilistic models and of the thresholds involved in the validity criteria on
80
R. Tournaire et al.
the precision of the results and on the pruning ratio. The pruning ratio is the ratio of mappings that are pruned by adding in lines 13 and 11 Implicant(m) (or Implicantc(m)) to the set MN V al of unvalid mappings without estimating their probabilities. We have also measured the qualitative and quantitative impact of the choice of a classifier. Automatic classification is at the core of the estimation of Pi and c with the computation of Ext(C i ∩ Dj , O) (see Theorem 2). For evaluating P the quality of our results, we use the standard criteria of precision and recall [58]. Recall is the ratio of returned results that are expected w.r.t. all expected results. Precision is the ratio of returned results that are expected w.r.t. all returned results. In order to measure the robustness we have tested the robustness of Algorithm 1 to noisy data. Finally, we have conducted experiments on real-world data to check the scalability of the Algorithm 1 on large taxonomies and to make qualitative measurements. We first describe the principles and the process of the data generator on which we have conducted the different experiments. Then we describe the experimental protocol that we have followed. Finally, we summarize the main experimental results that we have obtained on synthetic and real-world data. Synthetic data generation is divided into three steps : generation of taxonomies with fixed sizes, generation of the expected mappings to discover, and population of each class by generating a fixed number of instances and associated description. Generation of taxonomies - Given constraints on number of classes n1 and n2 , we generate the structure of the two respective taxonomies T1 and T2 as a forest of general trees (unconstrained in-degrees) by using a Boltzmann sampler for unlabelled trees described in [19]. We use a reject method to get random forests with n1 and n2 nodes. This method is simple and efficient while guaranteeing an uniform distribution among the trees with the same number of nodes. Then, we label each node by a distinct class name. In our experiments, we set n1 = n2 so the two taxonomies have the same size, which is the unique parameter of the taxonomies generation. Mappings generation - We initialize the generation of mappings to be discovered MG with a set MS of seed mappings, whose size is either fixed or randomly chosen between 1 and the common size of the taxonomies. Each mapping m ∈ MS is generated by a random choice for the two classes lhs(m) and rhs(m) in T1 and T2 , or in T2 and T1 , depending on the mapping direction which is randomly chosen too. We reject mappings which logically entail class inclusions that are not entailed within each taxonomy (i.e., we forbid generated mappings to modify the knowledge of each taxonomy). The set MG of all mappings to discover will then be the set of mappings that can be logically entailed by MS and the two taxonomies.
Discovery of Probabilistic Mappings between Taxonomies
81
Following [21], the computation of precision and recall will be based on MG . Let R be the result of Algorithm 1. The recall is the proportion of mappings of MG actually returned by Algorithm 1: Recall =
MG ∩ R MG
The precision is the proportion of returned mappings that are actually in MG : P recision =
MG ∩ R R
Instances and description generation - For this step, we consider the two taxonomies and the mappings between them as a whole DAG of classes. The acyclicity of that graph is guaranteed by the constraints imposed in the production of the set MG of generated mappings described above. We first generate a set of boolean attributes sufficient to associate a minimal intentional description of each class respecting the semantic partial order conveyed by the above DAG structure. Then, we use this intentional knowledge to generate accordingly the description of the instances with which we populate each class in the taxonomies. Generation of the intentional description of classes: We traverse the DAG of classes according to a reverse topological order [10] starting from the most general classes that constitute the level 0, and we iterate the following process for generating the intention of classes as sets of attributes: - For each class Ci0 of level 0, we generate a disjoint set of distinct attributes At0i and we set the intention of Ci0 , denoted Int(Ci0 ), to be At0i . - For each class Cij of level j (according to the reverse topogical order), we generate a set Atji of novel attributes (disjoint from the set of existing attributes) with a size fixed to the out degree of Cij in the DAG of classes, and we set Int(Cij ) to be Atji ∪ Int(Cij−1 ), where the Cij−1 are the successors of Cij in k k the DAG. Population of classes: Let {At1 , . . . , Atm } be the set of attributes generated at the previous step. We populate each class with nP instances, and we associate to them descriptions that respect the corresponding intentional description, as follows: For each class C, each of its instances is described by a boolean vector [a1 , . . . , am ] obtained by: - setting to 1 each ai such that the corresponding attribute Ati is in the intention of the class C, - randomly setting the other values aj to 0 or 1. This way, by construction, all the instances in the extension of a class have in common that all the attributes in Int(C) are present in their description. In Section 5.1 we will use an oracle classifier which classifies an instance i in the class C iff all the attributes of the intention Int(C) of the class C are present in the description of i.
82
R. Tournaire et al.
The results of data generation can be summarized into a table Tdata with m + nC columns where m is the number of generated attributes and nC is the number of generated classes, and each tuple [a1 , . . . am , c1 , . . . , cnc ] concatenates the description [a1 , . . . am ] of an instance in terms of attributes, and its categorization [c1 , . . . , cnc ] with respect to the classes: for each i ∈ [1..nC ] ci is equal to 1 if i ∈ Ext(C) and to 0 if it is not the case. Connection with Armstrong relations - Data generation in our context represents quite challenging issue when compared to other synthetic data generation, such as functional Armstrong relation for functional dependencies3 [4], or transactional databases for frequent itemsets [49]. Roughly speaking, we borrow the same principles than those developed for Armstrong relations [25]. Each generated mapping should be satisfied by the generated data, and each mapping that is not generated should be contradicted by the generated data. With regard to the Armstrong relations, our generation tackles additional issues. Firstly, the structure may be generated whereas Armstrong relations suppose the structure (schemas) given. Secondly, for relational databases, it is enough to generates two tuples that contradict one dependency for ensuring that the dependency is not satisfied. In our case where mapping probabilities are estimated from statistics on class extensions, the amount of instances that contradict a mapping has a strong impact on its validity, then we can not only generate one instance for each mapping to be contradicted. Experimental protocol - We first explain the goals of the successive experiments that we have performed. For an overview of the complete setting, the connection between ProbaMap and the generator is pictured in Figure 5. Distinguishing between the parameters of ProbaMap (two thresholds for Pc and Pi and the kind of classifier) and thoses of the generator is a key point. The first goal is to analyze the impact on the precision of the thresholds Sc , Su , with the purpose to fix appropriate thresholds for the next experiments. c and Pi on The second goal is to analyze the impact of the probabilities P the pruning ratio of the algorithm. The purpose is to determine among three validity criteria the one offering the best performances : using Pi alone, using Pc alone, or using both of them. The third goal is to analyse and compare the impact both on precision/recall and on total running time of three real classifiers (Naive Bayes, C4.5 and SVM) for estimating the probabilities. The purpose is to determine the classifier offering the best tradeoff between quality of results and running time. Note that we do not take the learning time of classifiers into account because we consider that this task can be precomputed for each taxonomy. 3
An Armstrong relation for a set of functional dependencies is a relation that satisfies each dependency implied by the set and does not satisfy any dependency that is not implied by it.
Discovery of Probabilistic Mappings between Taxonomies
83
Fig. 5. Structure of ProbaMap
The last goal with synthetic data is to analyse the robustness of the approach to noisy data. For all the experiments on synthetic data presented in this section, each point is obtained by averaging the results of 100 runs. For each run, a new synthetic dataset is generated with the appropriate parameters. Note that in our experiments we generate taxonomies with few dozens of classes. The number of random taxonomies of such sizes can be counted in billions. Thus, averaging over 100 runs for a point does not prevent from local variations, leading to curves that are not smooth. Our algorithm is written in Java and compiled using Sun Java version 1.6. We run all the tests on a quad-core Intel Q6700 Xeon at 2.66 GHz with 4 GB of memory. The OS is Ubuntu Linux 8.10. For all the experiments measuring run times, only one instance of our program and the OS are running on the machine, to avoid memory contention effects with other programs that would affect the results. Experimental results 1 - Impact of thresholds on precision We compare the influence of the thresholds Sc and Su associated to probabilities c and Pi on the quality of the results returned by Algorithm 1. For doing so, P c ≥ Sc and Pi ≥ Su . we run the algorithm with the validity criterion: P In this experiment, the computation of probabilities is performed using the oracle classifier. The parameters in the synthetic generator are defined such that |M(T1 , T2 )| = 320. We set the number of seed mappings |MS | = 4. Note that by logical entailment the total number |MG | of mappings to be discover may be much greater. For each couple of threshold (Sc , Su ) ∈ [0.78 : 0.995]2, we compute
84
R. Tournaire et al.
1
Precision contours Oracle classifier
0.98
0.99 0.98 0.97 0.96 0.95 0.94 0.93
0.96 0.94
Su
0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 Sc
1
Fig. 6. Precision w.r.t. Sc and Su thresholds
the precision and the recall of the results of Algorithm 1. We observed that the recall remains constant at 1.0 independently of values of Sc and Su . This is because thanks to the oracle classifier, estimated probabilities for the mappings of MG are very close to 1, and superior to all experimented thresholds, leading to a perfect recall. Thus, we only show the results for precision in Figure 6. The figure shows the contours of the different precision levels, from a precision of 0.93 to a precision of 0.99. From the shape of these contours, it is clear that c and Pi have an influence on precision. As the relation Pi ≥ P c holds both P (Proposition 2), under the diagonal Pi has no influence on precision. c c is more discriminant than Pi . The figure shows that P The probability P influences the precision for a large range of values of the threshold Sc , while Pi only has an influence for very high values of Su . We have observed the estimated probabilities for different mappings, and found that there is an important gap in c between valid and invalid mappings. This gap is much smaller the values of P for Pi . Pi giving higher probability values to invalid mappings, this explains why it can only have an influence on precision at very high Su values. Based on the curve of Figure 3, we fix the thresholds at (Sc = 0.83, Su = 0.96) for experiments where the classifier used to estimate the probabilities is the oracle. This gives a good precision of 0.95, and maps to a region where Pi has an influence on the quality of results. For the experiments in which a real classifier is used to estimate the probabilities, we fix the thresholds at (Sc = 0.85, Su = 0.90) to be tolerant to classification errors. 2 - Impact of probabilities on pruning ratio We now study the impact of three probabilistic criteria for testing the validity of a mapping on the pruning ratio performed by Algorithm 1 : 1. using only Pi by setting Sc = 0 (bypass) 2. using only Pc by setting Su = 0 (bypass) 3. using both Pi and Pc Figure 7 shows the ratio of pruning made by the algorithm using the different validity criteria, w.r.t. a naive approach not doing any pruning.
Discovery of Probabilistic Mappings between Taxonomies
85
Pruning factor w.r.t. number of candidate mappings Oracle classifier 0.5
Pu>Su Pc>Sc Pu>Su and Pc>Sc
0.45
Ratio
0.4
0.35
0.3
0.25
0.2 0
500
1000
1500 2000 2500 number of mappings
3000
3500
4000
c , Pi and P c and Pi Fig. 7. Pruning factor for validity computation with P
The validity computation using only Pi is the one that prunes the least mappings, computing probabilities for about 40% of all mappings of M(Ti , Tj ). Both c and P c and Pi does more prunings and obtain a significant reduction of the P c obtains slightly better results than using search space. Combining Pi and P c ≥ Sc and Pc alone, so for the remainder of this experiments section, we use P Pi ≥ Su as validity criterion. It allows to compute the probabilities for only 20% of the mappings of M(Ti , Tj ), when the number of candidate mappings is high. For example, a search space of 3500 mappings is pruned up to 22% by combining Pc and Pi , that implies that there are only 770 examinated mappings for which Pi and Pc are estimated. 3 - Impact of the classifiers In this subsection, we replace the oracle classifier with a real classifier. We compare the results given by three well-known classifiers: Naive Bayes [44], C4.5 [47] Computation time (s) w.r.t. candidate mappings number 200 ins/class - 100 runs/pt 140
Naive Bayes C4.5 SVM
Computation time (s)
120 100 80 60 40 20 0 0
500
1000 1500 2000 Candidate mappings number
2500
3000
Fig. 8. Computation time (s) for different classifiers
86
R. Tournaire et al.
and SVM [27]. We use the Weka implementation of these classifiers and have interfaced it with our code. The comparisons of running times are shown in Figure 8 and in log scale in Figure 9. A first conclusion is that the running times are polynomial in the number of mappings, and are very similar, with Naive Bayes being slightly slower than C4.5 and SVM. Comparisons for precision and recall are shown in respectively Figure 10 and Figure 11. Naive Bayes has both the worst recall and the worst precision, the choice is thus between C4.5 and SVM. They seem to have similar results. However, the learning time of SVM (not shown here) is much longer than the learning time of C4.5. We thus choose C4.5 for further experiments, and analyse the impact of the number of instances per class on the classification performance of Algorithm 1 with C4.5. Computation time (s) w.r.t. candidate mappings number - Logarithmic scale 200 inst/class - 100 runs/pt 5
Naive Bayes C4.5 SVM
log(computation time)
4
3
2
1
0
-1 5.5
6
6.5 7 log(candidate mappings number)
7.5
8
Fig. 9. Computation time in log scale for different classifiers
Precision w.r.t. candidate mappings number 200 inst/class - 100 runs/pt 1
NB C4.5 SVM
0.95 0.9
Precision
0.85 0.8 0.75 0.7 0.65 0.6 0
500
1000 1500 2000 Candidate mappings number
2500
Fig. 10. Precision for different classifiers
3000
Discovery of Probabilistic Mappings between Taxonomies
87
Recall w.r.t. candidate mappings number 200 inst/class - 100 runs/pt 1 0.99 0.98 0.97 Recall
0.96 0.95 0.94 0.93 0.92 NB C4.5 SVM
0.91 0.9 0
500
1000
1500
2000
2500
3000
Candidate mappings number
Fig. 11. Recall for different classifiers
Computation time (s) w.r.t. nb inst/class Classifier:C4.5 - 100 runs/pt 18 16
Computation time (s)
14 12 10 8 6 4 2 0 0
50
100
150 200 250 300 number of instances/class
350
400
450
Fig. 12. C4.5 - Computation time (s) - impact of number of inst/class
We vary the number of instances per class nP between 10 and 450. The results for computation time, precision and recall are shown in Figures 12, 13, and 14. In this experiment, the number of classes and of mapping is constant, hence the number of classifications to perform is linear in the number of instances. The C4.5 algorithm takes linear time in the number of instances. In fact, all instances are classified into each class at the beginning of ProbaMap, this is also perfectly coherent. This linearity is also the case for Algorithm 1, as shown by Figure 9. Increasing the number of instances per class only increases slightly precision, whereas it strongly improves recall. The most important point to note is that excellent values of precision and recall are obtained with as few as 50 instances per class, as expected, with a use of Bayesian approach of statistics.
88
R. Tournaire et al.
Precision w.r.t. nb inst/class Classifier: C4.5 - 100 runs/pt 1 0.95 0.9
Precision
0.85 0.8 0.75 0.7 0.65 0.6 0
50
100
150 200 250 300 number of instances/class
350
400
450
Fig. 13. C4.5 - Precision - impact of number of inst/class Recall w.r.t. nb inst/class Classifier: C4.5 - 100 runs/pt 1
0.95
Recall
0.9
0.85
0.8
0.75 0
50
100
150 200 250 300 number of instances/class
350
400
450
Fig. 14. C4.5 - Recall - impact of number of inst/class
4 - Robustness to noisy data In order to test the robustness to noise of our algorithm, we define a new parameter θ corresponding to the quantity of noise to inject in the synthetic data. Each dataset produced by the synthetic data generator goes through a step of noise application, where each boolean corresponding to the value of an attribute for an instance can be reversed with a probability θ. The new dataset is then processed as usual by Algorithm 1. The variations of precision and recall for values of θ ∈ [0; 0.3] are show in Figure 15. The figure shows that recall gracefully degrades when noise increases. At 10% noise, the recall is nearly unaffected, at a value of 0.95. Values of noise superior to 15% have a more significant impact and lead to poor recall. Precision, however, exhibits a different behavior. It first increases with noise, before abruptly decreasing for more than 24% of noise.
Discovery of Probabilistic Mappings between Taxonomies
89
Recall and precision w.r.t. noise level Classifier: C4.5 - 100 runs/pt 1 0.9 0.8 0.7 0.6 0.5 0.4 Recall Precision 0.3 0
0.05
0.1
0.15 Noise level
0.2
0.25
0.3
Fig. 15. C4.5 - Precision and recall - impact of noise level
In order to understand this phenomenon, we have investigated in details the classifier results and the values of probabilities given to mappings. We found that for 0% noise, there are invalid mappings that are incorrectly given too high probabilities, and that appear as valid. This explains the non-perfect 0.88 precision value. The probability values for these mappings are close to the threshold. Increasing noise makes the classifiers more selective, and tends to decrease the values of all probabilities. So the probabilities of these invalid mappings go below the threshold for a moderate amount of noise, whereas the probabilities of valid mappings remain above the threshold. Thus the precision increases. 5.2
Real-World OAEI Data
We have made experiments on the directory set of OAEI [22,20] contest. This set is constituted by two large taxonomies of respectively 2857 and 6628 classes. For the contest, due to scalability issues, the taxonomies are split into the set of their branches, a small subset of which is given to the competitors for mapping alignment. In contrast, our algorithm is able to handle the two whole taxonomies, thus taking advantage of the complete structure of the taxonomies. It is important to note that without pruning, this would lead to a search space of 30 million mappings. For compensating the absence of available instances for these taxonomies, we use a method inspired by [52] to automatically populate the classes with synsets4 . We follow a similar method to Ctx-Match [52]. Intuitively, the principle of this population is based on associating each class C with a set of WordNet synsets (atomistic semantic unit in this thesaurus) that reflect the right meaning of the label of C in the context where it appears, i.e. the labels of the ancestor classes of C in its taxonomy. This help to disambiguate the meaning of a word: for instance, the label “Arizona” can correspond to a state of the U.S.A. or to a 4
In WordNet, a synset is a semantic unit, 2 synonyms correspond to the same synset.
90
R. Tournaire et al.
snake. If the Arizona class is a child class of “Animals”, the label “Animals” can be used to guess that “Arizona” means the snake species. On the two whole taxonomies, the population phase produces about 30000 instances and takes 5 hours while the mapping discovery algorithm itself only takes 11 minutes. These are very reasonable computational times for handling 30 million possible mappings. For evaluating the precision of the set of mappings discovered by our algorithm, we could only compute a lower bound based on the partial reference provided by OAEI. The results are promising, as for the thresholds Su and Sc respectively set to 0.9 and 0.8 we obtained a lower bound of precision of 67%5 . 5.3
Comparative Analysis on Collected Web Directories
In this section, we test ProbaMap on part of Internet directories from Yahoo! and Google (actually based on Dmoz) that are rooted with similar or very close label. These sub-directories are considered as taxonomies, and URLs referenced inside each class of the taxonomy as instances. The main difference with the sequence of experiments in the previous section is that the dataset contains original instances that are collected with their taxonomies, avoiding to process an artificial population. We compare our approach to the SBI algorithm of Ichise et al. [34,35], which is dedicated to the discovery of mappings between Internet directories, and the integration of such directories. Internet directories are trees of categories, which can be seen as taxonomies, categories being the classes. Each category contains a set of links (i.e. URLs to web sites), which can be seen as the instances of the class. Each link comes with a small text summary, whose words can be seen as instance attributes for classification. Our datasets are corresponding locations in the Yahoo! and Google directories, that have also been used in the experiments of [34,35]: – Yahoo! : Recreation / Automotive & Google : Recreation / Autos – Yahoo! : Recreation / Outdoors & Google : Recreation / Outdoors – Yahoo! : Computers and Internet/Software & Google : Computers/ Software – Yahoo! :Arts / Visual Arts / Photography & Google : Arts /Photography The data from the directories was collected in June 2010, so is different from the data of [35] and [34] which was collected in Fall 2001. Table 1 shows for each dataset the number of classes and instances in each class, and the number of instances shared between the Yahoo! and the Google directories. Two instances are considered shared if they correspond to the same URL in both directories. For a fair comparison, we have implemented both ProbaMap and the SBI algorithm in Java. 5
Fore more details, see http://disi.unitn.it/~pane/OAEI/2009/directory/result/
Discovery of Probabilistic Mappings between Taxonomies
91
Table 1. Statistics on data collected from subdirectories on Yahoo! and Google Yahoo! classes instances Autos Outdoors Software Photography
947 2428 323 168
4406 5511 2390 1851
Google classes instances 967 1441 2395 321
6425 13863 30140 3852
shared instances 837 623 572 286
Experimental protocol - An overview of the setting used for the comparative experiment is pictured in Figure 16. SBI takes as input one source taxonomy, one target taxonomy, and the instances declared for each class of both taxonomies. For each class Cs of the source taxonomy, SBI returns a rule Cs → Ctpredicted associating Cs to a target class Ctpredicted in the target taxonomy. In order to fit the evaluation framework of SBI, we added a postprocessing to ProbaMap to obtain a similar form of results, i.e. a set of unique rules for each class of the source taxonomy. The complete process is the following: 1. Application of ProbaMap on T1 and T2 2. For each class C1 of T1 , among all C2 for which the two mappings C1 C2 and C2 C1 have been c (C1 C2 ), P c (C2 C1 )) discovered, select the class C2 for which min(P has the highest value. 3. For each class C1 of T1 , if there is no rule for C1 , associate to C1 the rule of its closest ancestor in T1
Fig. 16. Setting used for the comparative experiment between SBI and ProbaMap
92
R. Tournaire et al.
In this way we obtain an unique rule C1 → C2 for each class of T1 , like the SBI system. As there are a good proportion of shared instances for each directory that makes the classification step not mandatory, we can and we will compare SBI against both ProbaMap versions with and without the classification step. For the version without classification, preliminary tests lead to set Sc = 0.6 and Su = 0.9. SVM is set to be the default classifier by using the SMO implementation in weka. When using classification, because of the small values of the joint probabilities of classes, we set Su = Sc = 0. In this case, the postprocessing step is modified to take into account the probabilities Pc and Pi of the reverse mappings. Comparative qualitative evaluation - The goal of our experiments is to compare the quality of Internet directories alignment for ProbaMap and SBI. For the discovery of mappings, ProbaMap and SBI receive a “training” set of instances which is a subset of the shared and annotated instances. The test set used for evaluation of the discovered mappings is constituted by the remaining instances among the shared ones. In the case where ProbaMap is set to use classification, the training set is extended with all the non shared instances. The ratio of the number of shared instances used for training among all shared instances is a controlled variable on the experiments we have conducted. The classification is performed using the SVM implementation SMO [27] in Weka [61], where the classification attributes for an instance are the words of its summary. The experiments on controlled data have shown that SVM and C4.5 are comparable in quality but that C4.5 is faster than SVM. Nevertheless, we have conducted this experiment with SVM which is expected to perform better on sparse data. The evaluation is done by using the test set of instances. Each instance of this set belongs to a class Cs of the source taxonomy. Hence, we can compare: – the class Ctpredicted of the instance, predicted by the output rule Cs → Ctpredicted – the class Ct in which the instance is declared in the target taxonomy (each instance of the test set is common to the two taxonomies) The accuracy measures the ratio of the instances in the test set for which Ctpredicted = Ct . Accuracy is a standard micro-averaged evaluation measure which is based on instances, whereas recision and recall are macro-averaged measures based on mappings themselves. As there is no reference of mappings provided for the considered Yahoo! and Google subdirectories, but a sufficient proportion of instances are shared by them, we use the accuracy criterion to evaluate ProbaMap in this experiment. This enables also to fit the evaluation framework of SBI (described in [35]). Results - The averaged results in Table 2 show that ProbaMap outperforms SBI in average, and that ProbaMap with classification outperforms ProbaMap without classification. The average is computed on the four directories of the Table 1 and for each direction of alignment.
Discovery of Probabilistic Mappings between Taxonomies Table 2. Averaged accuracy for SBI and ProbaMap Ratio of shared instances provided SBI ProbaMap ProbaMap + classif for training 0.5 0.9
0.23 0.29
0.28 0.33
0.33 0.38
Accuracy for Autos: Yahoo! into Google 1 ProbaMap ProbaMap + classif SBI
Accuracy
0.8
0.6
0.4
0.2
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.8
0.9
Training set size (ratio w.r.t. common instances)
(a) Autos: Yahoo! into Google Accuracy for Photography: Yahoo! into Google 1
ProbaMap ProbaMap + classif SBI
Accuracy
0.8
0.6
0.4
0.2
0 0.1
0.2
0.3 0.4 0.5 0.6 0.7 Training set size (ratio w.r.t. common instances)
(b) Photography: Google into Yahoo! Fig. 17. Comparative accuracy results (1)
93
94
R. Tournaire et al.
Accuracy for Outdoors: Google into Yahoo! 1
ProbaMap ProbaMap + classif SBI
Accuracy
0.8
0.6
0.4
0.2
0 0.1
0.2
0.3 0.4 0.5 0.6 0.7 Training set size (ratio w.r.t. common instances)
0.8
0.9
0.8
0.9
(a) Outdoors: Google into Yahoo! Accuracy for Software: Yahoo! into Google 1
ProbaMap ProbaMap + classif SBI
Accuracy
0.8
0.6
0.4
0.2
0 0.1
0.2
0.3 0.4 0.5 0.6 0.7 Training set size (ratio w.r.t. common instances)
(b) Software: Google into Yahoo! Fig. 18. Comparative accuracy results (2)
In particular, ProbaMap with classification significantly outperforms ProbaMap without Classification (about 10% better) and SBI (about 20% better) for the datasets Autos and Photography, whatever the size of the training set. For the directories Software and Outdoors, ProbaMap without classification and SBI both provide lower results. SBI performs a little better (no more than 5%). In these two cases, there are initially few instances by class, and then the classification step allows to improve the results : ProbaMap with classification outperforms SBI on the Software directory up to 60% of instances for the training set, whereas ProbaMap without classification does not. For the Outdoors results pictured in Figure 18(a), ProbaMap with classification is better for a small
Discovery of Probabilistic Mappings between Taxonomies
95
training set (≤ 20% of the shared instances). The reason is that, for this particular dataset, a very small ratio of instances is classified and then the extensions of classes do not change much after the classification step. We have also checked that all these results both for ProbaMap and ProbaMap with classififcation are rougly better in average than the results of a modified version of SBI using Naive Bayes classification so-called SBI-NB [35]. More precisely, both versions of Probamap perform better for 6 of the 8 tested directories. Note however that we are provided results for SBI-NB with an old dataset collected in 2004, so the comparison is not totally fair. Finally, SBI and ProbaMap (without classification) both take a few seconds on each of the directories of Figure 1). The version of ProbaMap with classification requires additionnal time (several minutes) for classifying each instance into each class, but this significantly improves the quality of the results. These experiments show that the mapping discovery method that we propose gives good results on real-world datasets, and can take advantage of classification techniques to compensate small training set sizes. This is an important quality for real world taxonomies built by different people, that are unlikely to have many instances in common.
6
Related Work
As outlined in the introduction, semantic mappings are the glue for data integration systems. A wide range of methods of schema/ontology matching have been developed both in the database and the semantic web communities [23]. One of the principles widely exploited is terminological comparison of the labels of classes with string-based similarities or lexicon-based similarities (like WordNet) (e.g., TaxoMap [32], H-MATCH [8]) . Another widely used principle is structure comparison between labeled graphs representing ontologies (e.g., OLA [24]), based on similarity flooding [43]. The underlying idea is that two elements of two distinct ontologies are similar when their adjacent elements are similar. This is applied by spreading similarities in the spirit of how IP packets flood the network in broadcast communication. A category of methods (e.g., FCA-merge [55], [12]) are called “extensional” because they are instance-based: they rely on instances of classes to compute matches between them. It is shown in [36] that the reliability of instance-based methods depends on the applications and on the kind of correspondences that are considered. The instance-based method SBI [34] with which we have compared our method is based on the computation of the Fleiss Kappa coefficients that are symmetric and then they cannot be interpreted as probabilities on inclusion mappings. Most of the instance-based methods make use of classifiers using a corpus of labelled instances (e.g., LSD [16], SemInt [38], GLUE [17],[45]). It should be pointed out that GLUE discovers symmetric correspondences that have no formal semantics. The associated confidence value for each discovered correspondence are not real probabilities as they are based on similarities like the Jaccard
96
R. Tournaire et al.
similarity. In contrast, we consider mappings that denote inclusions between classes of different taxonomies, and we define a formal probabilistic semantics for these inclusion mappings. sPLmap [45] is an approach for schema matching, and oPLmap [46] is a variant for ontology matching, that rely on classifiers and deal with probabilities. sPLMap finds the best set of mappings M between two schemas S, T that maximizes the probability that the tuples in the rewritten schema S using M (denoted SM ) are plausible in T and vice versa. At the end, each returned mapping is associated with a probability based on the conditional probability formula. In contrast with our work, sPLmap computes probabilities by combining scores provided by several weighted classifiers (working on text or data values of tuples). Other approaches have been investigated with machine learning techniques using a corpus of schema matches (e.g., [40],[60]). The work introduced in [60] uses classifiers directly on correspondences, by representing each correspondence in a vector space constructed from instances features. A training set of true correspondences should be provided. Then, for a tested correspondence between two classes A and B, the similarities between (i) the instances of A, (ii) the instances of B, and (iii) the instances of all examples correspondences allow to give to the tested correspondence a position in the correspondence space. Hence the classifier can be used to determine if the tested correspondence is relevant or not, according to its position in the correspondence space and the learned examples. In fact, most of the existing matchers combine these elementary approaches in different ways (e.g., COMA++ [3] and COMA [15], Cupid [41], H-MATCH [8], Lily [59], S-Match [30], Clio [9]). In [28], an extension of existing matching methods is proposed by considering the k ranked best alignments. The top-k alignment ranking is combined with the schema to match in order to generate the final alignment that globally maximizes its score.
7
Conclusion and Future Work
It is standard practice for ontology and schema matchers to associate numbers with the candidate mappings they propose. The uncertainty is intrinsic to correspondences discovery because two classes or properties (for example) independently created are unlikely to exactly match. As stated in [54], there is still a need to better understand the foundations of modeling uncertainty that is primary important for improving the detection of correspondences causing inconsistencies, e.g., via probabilistic reasoning, or to identify where the user feedback is maximally useful, and for improving the quality of the interpretation of correspondences. However, those uncertainty coefficients do not have a probabilistic meaning and are just used for ranking. In contrast, our approach promotes a probabilistic semantics for mappings and provides a method to compute mapping probabilities based on the descriptions of instances categorized in each ontology. It is
Discovery of Probabilistic Mappings between Taxonomies
97
important to note that even if we use similar classification techniques as [17], we use them for computing true probabilities and not similarity coefficients. The most distinguishing feature of our approach is that it bridges the gap between logic and probabilities by providing probabilistic models that are consistent with the logical semantics underlying ontology languages. Therefore, our approach generalizes existing works based on algebraic or logical representation of mappings as a basis for reasoning (e.g., Ctx-Match [52], Clio [9]). The work in [29] introduces a framework for modeling and evaluating automatic semantic reconciliation. It provides a formal model for semantic reconciliation and theoretically analyses the factors that impact the effectiveness of matching algorithms. The proposed formal model borrows from fuzzy set theory for handling uncertainty by combinations of similarity measures and good properties. This work is not connected probability but is complementary to the approach in this thesis. The mappings that are returned by our algorithm can be exploited for mapping validation by probabilistic reasoning in the line of what is proposed in [6]. More generally, our approach is complementary of the recent work that has been flourishing on probabilistic databases [5,11]. In particular, it fits into the general framework set in [18] for handling uncertainty in data integration, for which it provides an effective way for computing mapping probabilities. ProbaMap can be generalized to perform without instances at all, by using techniques that lead to correctly estimate P (A1 ) and P (A1 ∩ B2 ) (for a mapping A1 B2 ). Such a generalization should take care that the distributions of A and of A ∩ B are considered and estimated in a probabilistic framework. For example, a linguistic resource can be used for estimating thoses probabilities by taking the labels of A and B and returning probabilities for A and A ∩ B, not only coefficients. This requires that the techniques to obtain such probabilities from the linguistic resource should respect some properties that fit with the probability theory. The experiments that we have conducted on both real-world and controlled synthetic data have shown the feasibility and the scalability of our approach. In contrast with our approach that prunes the search space by an appropriate ordering for generating and testing the mappings, a lot of existing approaches compute a similarity matrix for all pairs of classes. The ordering used in SBI is not logically consistent. Several complex matching system use the logical consistency as a filter or a way to check their results, but not for pruning the search space. Up to now, the OAEI context does not focus on scalability, except for isolate cases like the anatomy 2006 6 dataset which contains large ontologies in the anatomy domain. Methods like PRIOR [42] and H-MATCH [7] has shown good performances on this dataset. In particular, the PRIOR system makes use of information retrieval techniques, by indexing a profile for each class (or any ontology entity), and by using a query answering algorithm to rank and find the best matches for a particular class by making a query about its own profile. Note that the anatomy dataset does not contain any instance. 6
http://oaei.ontologymatching.org/2006/anatomy/
98
R. Tournaire et al.
Another possible approach investigated in [31] for handling large taxonomies is based on partitioning. In contrast, our approach is scalable while keeping the whole logical structure, which can be potentially lost by partitioning taxonomies. Perspectives We envision two main perspectives. First, we will study a setting for probabilistic query answering, in the spirit of probabilistic databases [18]. In such a setting, probabilities of mappings will be used by the query answering algorithm to give some probability values for each answer. In particular, it should be interesting to focus on reasoning-based query algorithm like in the Somewhere [2] setting. There are existing works on introducing probabilities in logic and inference process. Probability Logic [1] fits our proposed model in the way that the classical implication and the conditional are both present in the language. Probabilistic description logics [37,6] are also based on conditional formulas too. Based on inference rules and their properties about probabilities, such probabilistic and logical framework can be extended to do probabilistic reasoning involving probabilistic mappings. Our second pespective tackles the central issue in ontology matching (see [54]), that is to provide a formal semantics to coefficients returned as output of existing alignment methods. The idea is to design and implement a post-processing step in order to transform the returned coefficients into coefficients that can be interpreted as probabilities, i.e. that respect the property of monotony (Theorem 1). For this purpose, we plan to use the similarity flooding principle [43], in the spirit of N2R [51] and OLA [24]. Coefficients for mappings are initialized by those returned by the existing method to postprocess. The properties of monotony are then translated into strong influences between coefficients of mappings connected by an entailment relation. These influences between mappings coefficients are finally modeled by (non-linear) equations involving the maximum function. Like in N2R, the main issue would be to find an iterative algorithm ensured to converge towards a fixpoint that is the solution of the equation system.
References 1. Adams, E.: A Primer of Probability logic, CSLI. Stanford University, Stanford (1998) 2. Adjiman, P., Chatalic, P., Goasdou´e, F., Rousset, M.C., Simon, L.: Distributed reasoning in a peer-to-peer setting: Application to the semantic web. Journal of Artificial Intelligence Research (JAIR) 25, 269–314 (2006) 3. Aumueller, D., Do, H.H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: SIGMOD 2005: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 906–908. ACM, New York (2005) 4. Beeri, C., Dowd, M., Fagin, R., Statman, R.: On the structure of Armstrong relations for functional dependencies. Journal of the ACM (JACM) 31(1), 30–46 (1984)
Discovery of Probabilistic Mappings between Taxonomies
99
5. Benjelloun, O., Sarma, A.D., Halevy, A.Y., Widom, J.: ULDBs: Databases with uncertainty and lineage. In: VLDB, pp. 953–964 (2006) 6. Castano, S., Ferrara, A., Lorusso, D., N¨ ath, T.H., M¨ oller, R.: Mapping validation by probabilistic reasoning. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 170–184. Springer, Heidelberg (2008) 7. Castano, S., Ferrara, A., Messa, G.: Results of the H-MATCH ontology matchmaker in OAEI 2006. In: Proceedings of the ISWC 2006 Workshop on Ontology Matching, Athens, GA, USA (2006) 8. Castano, S., Ferrara, A., Montanelli, S.: H-MATCH: an algorithm for dynamically matching ontologies in peer-based systems. In: SWDB, pp. 231–250 (2003) 9. Chiticariu, L., Hern´ andez, M.A., Kolaitis, P.G., Popa, L.: Semi-automatic schema integration in clio. In: VLDB, pp. 1326–1329 (2007) 10. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press, Cambridge (September 2001) 11. Dalvi, N.N., Suciu, D.: Answering queries from statistics and probabilistic views. In: VLDB, pp. 805–816 (2005) 12. David, J., Guillet, F., Gras, R., Briand, H.: An interactive, asymmetric and extensional method for matching conceptual hierarchies. In: EMOI-INTEROP Workshop, Luxembourg (2006) 13. Dean, M., Schreiber, G.: OWL web ontology language reference. W3C recommendation, W3C (February 2004) 14. Degroot, M.H.: Optimal Statistical Decisions (Wiley Classics Library). WileyInterscience, Hoboken (April 2004) 15. Do, H.H., Rahm, E.: COMA - a system for flexible combination of schema matching approaches. In: VLDB (2002) 16. Doan, A., Domingos, P., Levy, A.Y.: Learning mappings between data schemas. In: Proceedings of the AAAI 2000 Workshop on Learning Statistical Models from Relational DatA (2000) 17. Doan, A., Madhavan, J., Domingos, P., Halevy, A.Y.: Learning to map between ontologies on the Semantic Web. In: WWW, pp. 662–673 (2002) 18. Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: VLDB, pp. 687–698 (2007) 19. Duchon, P., Flajolet, P., Louchard, G., Schaeffer, G.: Boltzmann samplers for the random generation of combinatorial structures. Comb. Probab. Comput. 13(4-5), 577–625 (2004) 20. Euzenat, J., Ferrara, A., Hollink, L., Isaac, A., Joslyn, C., Malais, V., Meilicke, C., Nikolov, A., Pane, J., Sabou, M., et al.: Results of the ontology alignment evaluation initiative 2009. In: Fourth International Workshop on Ontology Matching, Washington, DC (2009) 21. Euzenat, J.: Semantic Precision and Recall for Ontology Alignment Evaluation. In: IJCAI, pp. 348–353 (2007) 22. Euzenat, J.: Ontology alignment evaluation initiative (July 2008), http://www.oaei.ontologymatching.org/ 23. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007) 24. Euzenat, J., Valtchev, P.: Similarity-based ontology alignment in OWL-Lite. In: ECAI, pp. 333–337 (2004) 25. Fagin, R.: Horn clauses and database dependencies. J. ACM 29(4), 952–985 (1982) 26. Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press, Cambridge (May 1998)
100
R. Tournaire et al.
27. Flake, G.W., Lawrence, S.: Efficient SVM regression training with SMO. Mach. Learn. 46(1-3), 271–290 (2002) 28. Gal, A.: Managing uncertainty in schema matching with top-k schema mappings. Journal on Data Semantics 6 (2006) 29. Gal, A., Anaby-Tavor, A., Trombetta, A., Montesi, D.: A framework for modeling and evaluating automatic semantic reconciliation. The VLDB Journal 14(1), 50–67 (2005), http://www.portal.acm.org.gate6.inist.fr/ citation.cfm?id=1053477 30. Giunchiglia, F., Shvaiko, P., Yatskevich, M.: S-match: an algorithm and an implementation of semantic matching. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 61–75. Springer, Heidelberg (2004) 31. Hamdi, F., Safar, B., Reynaud, C., Zargayouna, H.: Alignment-based Partitioning of Large-scale Ontologies. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds.) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol. 292, pp. 251–269. Springer, Heidelberg (2010), http://www.hal.inria.fr/inria-00432606/en/ 32. Hamdi, F., Zargayouna, H., Safar, B., Reynaud, C.: TaxoMap in the OAEI 2008 alignment contest. In: Ontology Alignment Evaluation Initiative (OAEI) 2008, Campaign - Int. Workshop on Ontology Matching (2008) 33. Hayes, P. (ed.) RDF Semantics. W3C Recommendation, World Wide Web Consortium (February 2004), http://www.w3.org/TR/rdf-mt/ 34. Ichise, R., Takeda, H., Honiden, S.: Integrating multiple internet directories by instance-based learning. In: International Joint Conference on Artificial Intelligence (IJCAI), vol. 18, pp. 22–30 (2003) 35. Ichise, R., Hamasaki, M., Takeda, H.: Discovering relationships among catalogs. In: Suzuki, E., Arikawa, S. (eds.) DS 2004. LNCS (LNAI), vol. 3245, pp. 371– 379. Springer, Heidelberg (2004), http://www.citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.121.5336 36. Isaac, A., van der Meij, L., Schlobach, S., Wang, S.: An empirical study of instancebased ontology matching. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudr´e-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 253–266. Springer, Heidelberg (2007) 37. Koller, D., Levy, A., Pfeffer, A.: P-CLASSIC: a tractable probablistic description logic. In: Proceedings of the National Conference on Artificial Intelligence, pp. 390–397 (1997) 38. Li, W.S., Clifton, C.: SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000) 39. Lin, F., Sandkuhl, K.: A survey of exploiting wordnet in ontology matching. In: Artificial Intelligence in Theory and Practice II, pp. 341–350 (2008) 40. Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.: Corpus-based schema matching. In: International Conference on Data Engineering, pp. 57–68 (2005) 41. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. The VLDB Journal, 49–58 (2001), http://www.citeseer.ist.psu.edu/ madhavan01generic.html 42. Mao, M., Peng, Y.: PRIOR system: Results for OAEI 2006. In: Proceedings of the Ontology Alignment Evaluation Initiative, pp. 165–172 (2006) 43. Melnik, S., Garcia-Molina, H., Rahm, E., et al.: Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In: Proceedings of the International Conference on Data Engineering, pp. 117–128 (2002)
Discovery of Probabilistic Mappings between Taxonomies
101
44. Mitchell, T.: Machine Learning. McGraw-Hill Education (ISE Editions) (1997), http://www.amazon.ca/exec/obidos/ redirect?tag=citeulike09-20&path=ASIN/0071154671 45. Nottelmann, H., Straccia, U.: Information retrieval and machine learning for probabilistic schema matching. Information Processing and Management 43(3), 552–576 (2007) 46. Nottelmann, H., Straccia, U.: A probabilistic, logic-based framework for automated web director alignment. In: Ma, Z. (ed.) Soft Computing in Ontologies and the Semantic Web. Studies in Fuzziness and Soft Computing, vol. 204, pp. 47–77. Springer, Heidelberg (2006) 47. Quinlan, R.J.: C4.5: Programs for Machine Learning. Morgan Kaufmann Series in Machine Learning. Morgan Kaufmann, San Francisco (January 1993), http:// www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20\&path=ASIN/ 1558602380 48. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB Journal 10(4), 334–350 (2001) 49. Ramesh, G., Maniatty, W., Zaki, M.J.: Feasible itemset distributions in data mining: theory and application. In: PODS, pp. 284–295 (2003) 50. Resnik, P.: Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11(95), 130 (1999) 51. Sa¨ıs, F., Pernelle, N., Rousset, M.C.: Combining a logical and a numerical method for data reconciliation. In: Spaccapietra, S. (ed.) Journal on Data Semantics XII. LNCS, vol. 5480, pp. 66–94. Springer, Heidelberg (2009) 52. Serafini, L., Bouquet, P., Magnini, B., Zanobini, S.: An algorithm for matching contextualized schemas via SAT. In: Proceedings of CONTEXT 2003 (2003) 53. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. In: Spaccapietra, S. (ed.) Journal on Data Semantics IV. LNCS, vol. 3730, pp. 146–171. Springer, Heidelberg (2005) 54. Shvaiko, P., Euzenat, J.: Ten challenges for ontology matching. In: Meersman, R., Tari, Z. (eds.) OTM 2008, Part II, pp. 1164–1182. Springer, Heidelberg (2008) 55. Stumme, G., Maedche, A.: FCA-MERGE: Bottom-Up Merging of Ontologies. In: Proc. of the 17th International Joint Conference on Artificial Intelligence, pp. 225–234 (2001) 56. Tournaire, R., Petit, J.M., Rousset, M.C., Termier, A.: Discovery of Probabilistic Mappings between Taxonomies: Principles and Experiments (technical report) (2009), http://www.membres-liglab.imag.fr/tournaire/longpaper.pdf 57. Tournaire, R., Rousset, M.C.: D´ecouverte automatique de correspondances entre taxonomies - internal report (in french) (2008), http://www.membres-liglab.imag.fr/tournaire/irap08.pdf 58. Van Rijsbergen, C.J.: Information retrieval. Butterworths, London (1975) 59. Wang, P., Xu, B.: Lily: Ontology alignment results for OAEI 2009. Shvaiko, et al [SEG+ 09] (2009) 60. Wang, S., Englebienne, G., Schlobach, S.: Learning concept mappings from instance similarity. In: Sheth, A.P., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T., Thirunarayan, K. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 339–355. Springer, Heidelberg (2008), http://www.portal.acm.org.gate6.inist.fr/citation.cfm?id=1483184 61. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
TwigTable: Using Semantics in XML Twig Pattern Query Processing Huayu Wu, Tok Wang Ling, Bo Chen, and Liang Xu School of Computing, National University of Singapore {wuhuayu,lingtw,chenbo,xuliang}@comp.nus.edu.sg
Abstract. In this paper, we demonstrate how the semantic information, such as value, property, object class and relationship between object classes in XML data impacts XML query processing. We show that the lack of using semantics causes different problems in value management and content search in existing approaches. Motivated on solving these problems, we propose a semantic approach for XML twig pattern query processing. In particular, we design TwigTable algorithm to incorporate property and value information into query processing. This information can be correctly discovered in any XML data. In addition, we propose three object-based optimization techniques to TwigTable. If more semantics of object classes are known in an XML document, we can process queries more efficiently with these semantic optimizations. Last, we show the benefits of our approach by a comprehensive experimental study1.
1
Introduction
XML query processing has been studied for over a decade. In most XML query languages, e.g., XPath [4] and XQuery [5], queries are expressed as twig patterns. Finding all occurrences of a twig pattern query in an XML document is considered the core operation for XML query processing. This process is also referred as twig pattern query processing or twig pattern matching. More background on XML queries and twig pattern matching is discussed in Section 2.1. Normally an XML query is composed of structural search and content search. Consider an XPath query Q1 that finds the subject name of the book with the title of “Network”, issued to the XML data in Fig. 1: Q1: //subject[//book/title=“Network”]/name In this query, //subject[//book/title]/name is a structural search, aiming to find all matches in the document that satisfy this structural constraint; while the predicate title=“Network” is a content search, which filters the structural search 1
This is an extended and updated version of the previously published paper [30]. The major extension includes a discussion on semantics in XML, a technique to process queries across multiple twig patterns, discussions on potential problems of optimized tables, a new optimization scheme with semantics of relationship, and a comparison with the schema-aware relational approach in experiments.
S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 102–129, 2011. c Springer-Verlag Berlin Heidelberg 2011
Using Semantics in XML Twig Pattern Query Processing
103
result based on the specified value comparison. Most state-of-the-art XML query processing approaches only focus on employing effective index, e.g., inverted lists (details shown in Section 2.2), to improve the efficiency of performing joins between nodes in more complex twig pattern queries, without distinguishing between structural search and content search. This attempt is proven efficient for structural search without considering values. However, due to the different characteristics between leaf value nodes and internal non-value nodes in XML data, using inverted lists to manage values and to process content search in the same way as structural search will cause problems in managing tremendous number of inverted lists and performing costly structural join for content search. Besides the inefficiency in content search, which is caused by ignoring the semantic information of value and non-value nodes, existing approaches may also suffer from efficiency problems in structural search. To look deeper into the semantics of non-value document nodes, we can find that a non-value document node may further correspond to an object or a property. Most real life queries aim to find desired objects based on the value predicates on their properties. However, none of existing approaches takes semantics of object and property into account when they manage inverted list index and process queries. The ignorance of such semantics would result in scanning many useless labels in inverted lists during structural search. The details are discussed in Section 3.2 and Section 5.1. In this paper, we propose a semantic approach for twig pattern query processing. Motivated by solving the problems caused by a lack of using semantics on object, property and value in existing approaches, we propose semantics-based relational tables incorporated with inverted lists of tags to aid twig pattern query processing. In particular, relational tables are used to store values, while inverted lists are used to index internal document nodes, including property nodes and object nodes, but not values nodes. We design TwigTable algorithm to perform content search and structural search separately with the two kinds of indexes in twig pattern matching. Content search is performed by table selection before structural search. Because content search is always a predicate between a property and a value, after performing content search the size of the inverted list of the relevant property node is reduced due to the selectivity of the predicate, and the twig pattern query can be simplified by removing value comparisons. Matching a simplified twig pattern with reduced inverted lists for several query nodes will reduce the complexity of structural search, and thus improve the twig pattern matching performance. Finally, the semantic tables can help to extract actual values to answer the queries that ask for property values or object details, which is not efficient to achieve in other structural join based algorithms. We also need to highlight that the relational tables are constructed based on semantic information such as the relationship among object, property and value. The semantics of property is common for any XML document, i.e., the parent node of each value must be the property of that value. Based on this default semantic information, we initially store each value with its property in the corresponding property table. When more of an object’s semantics is known,
104
H. Wu et al. bookstore (1:5000,1) subject (2:307,2) name (3:6,3)
“computer” (4:5,4)
……
books (7:306,3) book (8:29,4)
book (30:55,4)
……
publisher title author price quantity publisher title author author price quantity (9:12,5) (13:16,5) (17:20,5)(21:24,5)(25:28,5) (31:34,5) (35:38,5) (39:42,5) (43:46,5) (47:50,5) (51:54,5) “Hillman” “Network” “Green” 45 30 “Elco” “Database” “White” “Brown” 35 15 (10:11,6) (14:15,6) (18:19,6)(22:23,6)(26:27,6) (32:33,6) (36:37,6) (40:41,6) (44:45,6) (48:49,6) (52:53,6)
Fig. 1. The bookstore document with all nodes labeled by containment scheme
we propose three optimization techniques to change the tables to be object based. We will show that using object-based tables, a query can be processed even more efficiently. In a word, with more semantic information known, our approach is more efficient to process twig pattern queries. The rest of the paper is organized as follows. We first describe some background information in Section 2. After that we revisit related work and discuss the motivation of our research in Section 3. The TwigTable algorithm with three semantic optimizations is presented in Section 4 and Section 5. We present the experimental results in Section 6 and conclude our work in Section 7.
2 2.1
Background Data Model and Twig Pattern Query
Normally an XML document is modeled as an ordered tree, without considering ID references. Fig. 1 (ignoring node labels) shows the tree structure of a bookstore document. In an XML tree, the internal nodes represent the elements and attributes in the document, and the leaf nodes represent the data values which are either a text in an element or an attribute value. Thus a node name is a tag label, an attribute name or a value. Edges in an XML tree reflect element-subelement, element-attribute, element-value, and attribute-value pairs. Two nodes connected by a tree edge are in parent-child (PC) relationship, and the two nodes on the same path are in ancestor-descendant (AD) relationship. The core query pattern in most standard XML query languages (e.g., XPath and XQuery) is also in a tree-like structure, which is often referred as a twig pattern. In particular, an XPath query is normally modeled as a twig pattern query, and an XQuery query is normally modeled as several twig patterns
Using Semantics in XML Twig Pattern Query Processing
105
subject book
name
book
book author
title
author
title
title “Network”
“Network”
(a) Twig pattern for XPath query Q1
t1
t2
(b) XQuery query Q2 can be expressed by joining the two twig patterns t1 and t2 based on author value
Fig. 2. Example twig pattern expressions
linked by joins. For example, the XPath query Q1 in Section 1, i.e., //subject[//book/title=“Network”]/name, can be represented as a twig pattern query in Fig. 2(a). Example 1 shows the twig patterns for an XQuery expression. Example 1. A query to find the titles of all books written by the author of the book “Network” can be expressed by XQuery: Q2: For $a in distinct-values(doc(“bookstore.xml”) //book[title=“Network”]/author) For $b in doc(“bookstore.xml”)//book Where $b/author=$a Return
106
H. Wu et al.
by the relationship between the corresponding document nodes. Matching Q to T returns a list of n-ary tuples to answer Q, where n is the number of nodes in Q and each tuple (a1 , a2 ,..., an ,) consists of the document nodes a1 , a2 ,..., an , which identify a distinct match of Q in T. 2.2
Document Labeling and Inverted List
Checking whether two document nodes satisfy the PC or AD relationship specified in a twig pattern pattern query is essential to twig pattern query processing. To facilitate the PC and AD relationship checking, we normally assign a positional label (label for short, if no confusion arises) to each document node. There are multiple labeling schemes proposed for XML data, among which the containment scheme [37] and the prefix scheme [26] are most popular. There are also several dynamic encoding schemes [16] [33] to avoid re-labeling for dynamic documents. In this paper, we use the containment labeling scheme for illustration, as shown in Fig. 1. It can be replaced by other schemes as long as they are sufficient for PC and AD relationship determination. Labels are usually organized by inverted lists. Normally, for each type of document node, there is a corresponding inverted list to store the labels of all nodes of this type in document order. To process a query, only relevant inverted lists are scanned to perform structural join based on the query constraints. Because in most algorithms, an inverted list is scanned in a streaming fashion, it is also referred as a label stream, or simply a stream in some work. 2.3
Semantics of Object, Property and Value in XML Data
Object (or entity) is an important information unit in data management, e.g., the ER approach to design a relational database. In data-centric XML databases, object also plays an important role, as most XML queries ask about information of objects. An object normally has several properties to describe it from different aspects. Property is also referred as attribute. To differentiate it from attribute type in document schema (e.g., DTD), we use the term property in this paper. In our work, we use semantics of value, property, object and relationship among objects to improve the performance of XML twig pattern query processing. Usually, such semantic information can be provided by the XML database designer. Next we briefly discuss how to discover the semantic information, in case it is not available from database designers. Identifying values is trivial. Each non-tag text in an XML document is a value, and it must appear as a leaf node in the corresponding document tree. Furthermore, we can also infer that the parent node of each value node in an XML tree corresponds to the property of that value. For example, in Fig. 1 the “Network” is a value and its parent node title is the property of this value. This inference of value and property always holds for any document, regardless of whether more semantics is provided or not. Thus our basic algorithm TwigTable constructs relational tables based on property and value. Semantics on object contributes to our optimizations to further improve query processing performance, as presented in Section 5. However, to discover semantics
Using Semantics in XML Twig Pattern Query Processing
107
of object and relationship between objects is not so trivial. Intuitively, we may consider the parent node of each property as the corresponding object. For example, in the document in Fig. 1, the subject nodes and the book nodes are all objects, because they are parent nodes of properties such as name, publisher, title, etc. However, in some documents, properties do not directly connect to their objects. For example, if the document further groups the properties of a book as shown in the two examples in Fig. 3, the parent node of a property may not be the associated object. book
publisher
title
authors
“Elco” “Database” author
book
basicInfo
price quantity
author 35
“White” “Brown”
15
publisher
title
author
saleInfo
author price quantity
“Elco” “Database” “White” “Brown” 35
15
Fig. 3. Two alternative design of book in the bookstore document
Two or more objects that are related to each other are normally reside along the same path in an XML tree. However, the properties of a relationship are not easy to identify. Because of the hierarchical structure of XML data, a property of the relationship between two objects is normally stored under the deeper object of the relationship. This brings difficulty to distinguish it from the properties of that deeper object. Later in Fig. 11 we show an example document in which quantity is a property of the relationship between branch and book, but this property appears as a child node of the deeper object book, together with other book properties. The situation may be even more complex when dealing with ternary or n-ary relationships. There are many attempts to discover the semantics such as object and relationship between objects in an XML document. Generally, there are categorized into three classes. 1. Using available tools and information. There are semantic rich models that work as a schema for XML documents. For example, the ORA-SS [17] model can distinguish between properties, objects and relationships, as well as specify the degree of n-ary relationships and indicate if a property belongs to an object or a relationship. If such model is available alongside an XML document, we can easily discover useful semantic information. Also as semantic web is rapidly developed, there are many ontologies available for different domains. By exploring the ontology of relevant domains, we can have desired semantic information of an XML document. 2. Mining schema or document. Liu et al. [18] infer objects by analyzing DTD. [9] and [35] propose algorithms to discover the semantics such as keys and
108
H. Wu et al.
functional dependencies in XML documents, which can help to identify objects and relationships. There are also many data mining techniques, such as the decision tree, that can be adopted to infer semantic information. 3. User interaction. In many web-based information management systems [10] [25], they use mass collaboration to seek feedback from users to improve semantics identification. In this paper, we do not offer any new technique to discover objects and relationships in an XML document. All the existing semantics discovery techniques can be used in our work for semantic optimizations. The guideline is that our algorithm is initially built on the semantics of property and value. This information can be correctly inferred from the document structure. When more semantics on object is known, we can incorporate such information into relational table construction, to further improve the performance of query processing.
3 3.1
Related Work and Motivation Related Work
In the early stage, there are many research efforts on storing and querying XML data using RDBMS. In those relational approaches, they shred XML data into relational tables, and convert XML queries into SQL statements to query the database. The node-based approach [37] [13], the edge-based approach [11] and the path-based approach [34] [22] shred XML documents based on different kinds of tree components. However, they all suffer from efficiency problems when dealing with structural search. The node-based approach and the edge-based approach need too many costly table joins to process a twig pattern query. The path-based approach is not efficient to handle “//”-axis. The schema-aware decomposition methods [24] [6] are proven more efficient than schemaless methods [27], but they are still not efficient for structural search when the document structure is complex. For example, consider a fragment
Using Semantics in XML Twig Pattern Query Processing
109
based on the containment labeling of an XML document, and showed the superiority over relational approaches. Later an improved stack-based structural join algorithm is proposed by Al-Khalifa et al. [3]. These two algorithms, as well as most of prior works decomposed a twig pattern into a set of binary relationships, i.e., parent-child and ancestor-descendant relationships. Twig pattern matching are done by matching every binary relationship and combining these basic binary matches. The main problem of such approaches is that the intermediate result size may be very large, even when the input and final result sizes are more manageable. To overcome this limitation, Bruno et al. [7] proposed a holistic twig join algorithm, TwigStack, which could avoid producing an unnecessarily large intermediate result. However, this algorithm is only optimal for twig pattern queries with only ancestor-descendent relationships. There are many subsequent works [19] [15] [8] [20] [14] [36] to optimize TwigStack, or extend TwigStack to solve different kinds of problems. In particular, Lu et al. [19] introduced a list structure to make it optimal for queries containing parent-child relationships under non-branching nodes. TSGeneric [15] improved the query performance by indexing each inverted list and skipping labels within one list. Chen et al. [8] divided an inverted list into several sub-lists associated to each prefix path or each (tag, level) pair and pruned some sub-lists before evaluating the twig pattern. Lu et al. [20] used Extended Dewey labeling scheme and scanned only the labels of leaf nodes in a twig query. [14] and [36] extended twig pattern query to support OR-predicate and NOT-predicate separately. However, all these structural join based work only focus on structural search. For the value node in each query predicate, they normally treat it the same as element node and perform structural joins for the whole query structure. As a result, they suffer from several problems as mentioned in the next section. 3.2
Motivation
The structural join based approaches are proven efficient in structural search. However, because they do not consider the semantics of value and other types of document nodes, they suffer from several problems during query processing. 1. Inverted list management. In most structural join based approaches, all nodes including elements, attributes and values in an XML tree are labeled and the labels of each type of nodes are organized in an inverted lists. When we build inverted lists for values, the number of different values causes a problem of managing a tremendous number of inverted lists. Based on our investigation, a 100MB XML document contains over 300 thousand different values, which correspond to 300 thousand inverted lists. This number will linearly increase according to the document size increase. 2. Advanced content search.Since twig pattern query normally models XPath expression, the advanced content search, such as numeric range search, containment search or even conjunction/disjunction of several value comparisons, which often appear in XPath query predicates may also appear as a leaf node in a twig pattern query. Without handling values specially, existing approaches have difficulty in supporting these advanced content search.
110
H. Wu et al.
For example, to process a query to find the books with the price greater than 15, it is time consuming to get all the inverted lists with the numeric names greater than 15, and combine labels in them by document order, to perform this range search. Also structural join with inverted lists can hardly support containment search, such as //book[contains(@title, ‘XML query’)]/price. 3. Redundant search in inverted lists. Inverted lists for values do not have semantic meanings. This may cause redundant search during inverted list scanning. For example, when a query is interested in books with the price of 35 in the bookstore document, structural search scans the inverted list for the value node ‘35’ (denoted by T35 ). Since in T35 we do not differentiate whether a label corresponds to price or quantity, we need to check all labels in this inverted list though many of them stand for quantity of 35, and definitely do not contribute to the query result. 4. Actual value extraction. To answer a query, what we need is not twig pattern occurrences represented as tuples of labels, but value results. For example, after finding a number of occurrences of the twig pattern query in Fig. 2(a), we need to know the value under each name node. One major advantage of the structural join based approaches is that they only need to load relevant inverted lists to process a query, instead of scanning the whole document with high I/O cost. However, when a query asks for values of a certain property, after getting a set of resulting labels of that property from pattern matching, they cannot find the child value under each label using inverted lists. To extract actual values, they have to read the document again, which violate the initial objective in I/O saving. Motivated on solving all these problems, we propose a semantic approach that uses both inverted lists and relational tables to perform twig pattern matching.
4
TwigTable Algorithm
4.1
An Overview of TwigTable
In TwigTable, we pay attention to the semantics of value during index construction. We maintain inverted list index only for structural nodes (internal nodes). For each value node, we store it into a relational table index2 with the label of its property node, instead of labeling it and putting its label into an inverted list as other approaches do. Then the number of inverted lists is limited to the number of different element/attribute types in the document. Query processing in TwigTable includes three steps. In the first step, we perform content search for the value comparisons in query predicates, using relational tables. After that, the query is rewritten by removing all value comparisons. In the second step, we perform structural search for the simplified query pattern, using 2
To avoid the overhead on maintaining relational tables, the relational table index can also be replaced by other types of index to bidirectional map values and their properties. However, the trade-off is the inconvenience to support advanced content search and to meet the requirements in our object-related optimizations.
Using Semantics in XML Twig Pattern Query Processing
111
any structural join algorithms, e.g., TwigStack. The last step is to return result to users. If the output node is a property type, the value result can be extracted from tables, instead of accessing the original document. 4.2
Document Parsing in TwigTable
When we parse an XML document, we only label elements and attributes, and put the labels into corresponding inverted lists. Values in the document are not labeled, instead we put them into relational tables together with the labels of their parent property nodes. Normally this parsing step is only executed once for an XML document, and after all relevant indexes are properly built during the parsing step, the system is ready to process twig pattern queries over the given document. The detailed algorithm Parser is presented in Algorithm 1. Algorithm 1. Parser Input: A SAX stream of the given XML document Output: A set of inverted lists and a set of relational tables 1: initialize Stack S 2: while there are more events in SAX stream do 3: let e = next event 4: if e is a start tag then 5: //step 1: label elements 6: create object o for e 7: assign label to o 8: push o onto S 9: for all attributes attr of e do 10: //attributes are parsed in the same way as elements. 11: assign label to attr 12: put label of attr into the inverted list Tattr 13: insert the label of attr and the value of attr into the table Rattr 14: end for 15: //step 2: put labels of elements into inverted lists 16: put label of o into the inverted list Te 17: else if e is a value then 18: set e to be the child value of the top object in S 19: else if e is an end tag then 20: // step 3: Insert values with their parent element into tables 21: pop o from S 22: if o contains a child value then 23: insert label of o together with its child value into table Re 24: end if 25: end if 26: end while
We use the SAX to read the input document and transform each tag and value into events. Line 3 captures the next event if there are more events in the SAX stream. Based on different types of events, different operations are performed accordingly. Line 4-16 are executed if the event e is a start tag. In this case, the first two steps are triggered. The system first constructs an object for this element and assigns a label to it. It then puts the label into the inverted list for that tag. A stack S is used to temporarily store the object so that when an end tag is reached, the system can easily tell on which object the operation will be executed. At line 9-14, the system analyzes the attributes for an element if any. Based on the same operating steps, it labels the attributes and puts labels into inverted lists. The attribute values are treated in the same way as element
112
H. Wu et al.
values. Line 17-18 is the case that the event is a value type. Then the value is simply bound to the top object in S for further table insertion. When the event is an end tag in line 19-25, the last step is performed, which is popping the top object o from S and inserting the label of o together with its value into the relational table for o, if it has a value. Example 2. After parsing the bookstore document, the new labeled document tree is shown in Fig. 4. Comparing to the document tree in Fig. 1, we can see that TwigTable does not label value nodes. The advantages of this attempt are reported in Section 6. bookstore (1:2600,1) subject (2:167,2)
……
books (5:166,3)
name (3:4,3) “computer”
publisher (7:8,5)
book (6:17,4)
book (18:31,4)
……
title author price quantity publisher title author author price quantity (9:10,5) (11:12,5)(13:14,5)(15:16,5) (19:20,5) (21:22,5) (23:24,5) (25:26,5) (27:28,5) (29:30,5)
“Hillman” “Network” “Green”
45
30
“Elco” “Database” “White” “Brown”
35
15
Fig. 4. The bookstore document with only internal nodes labeled, during TwigTable parsing
Some example relational tables which store data values are shown in Fig. 5. The name of each table is a property name, and each table contains two fields, the label of the property node and the corresponding child value. We call these tables property tables. Rpublisher
Rtitle
Rauthor
label
value
label
value
label
value
(7:8,5)
Hillman
(9:10,5)
Network
(11:12,5)
Green
(19:20,5)
Elco
(21:22,5)
Database
(23:24,5)
White
…
…
…
…
(25:26,5)
Brown
...
...
Fig. 5. Example property tables
Similar to inverted list, relational table also plays an index role to aid twig pattern query processing. As a result, how to cope with document updates is an important issue for both inverted lists and relational tables. Inverted lists
Using Semantics in XML Twig Pattern Query Processing
113
have been widely used and well studied for years. There are different ways to maintain an inverted list for updates, e.g., employing a B+ tree on each list. Here we discuss the maintenance of relational tables when the XML document is updated. Actually the relational table is easy to maintain. If the document is dynamic with frequent updates, we can adopt a dynamic labeling scheme to label the document so that the update will not cause re-labeling of remaining document nodes. Thus, any deletion or insertion only brings the update of the relevant tuple, without affecting other tuples. 4.3
Query Processing in TwigTable
As mentioned in Section 4.1, query processing in TwigTable contains three steps: content search and query rewriting, structural search, and value extraction. Theoretically, the first two steps, i.e., content search and structural search, can be reordered. The reason why we perform content search before structural search is that content search normally results in high selectivity. By performing content search first, we can reduce the complexity of structural joins. This is similar to selection push-ahead in relational query optimizers. The pseudo-code of TwigTable query processing is presented in Algorithm 2. Algorithm 2. TwigTable query processing Input: A query Q and necessary inverted lists and relational tables Output: A set of value results answering Q 1: //step 1: perform content search, construct new inverted lists and rewrite the query 2: while there are more value comparisons in predicates of Q do 3: let c be the next value comparison, and p be its property (parent element or attribute) 4: create a new inverted list Tp for p 5: select the labels based on c from the table Rp , and sort the resulting labels by document order 6: put the selected labels into Tp 7: rewrite the query to replace the sub-structure p/c by p 8: end while 9: //step 2: perform structural search on the rewritten query with new inverted lists 10: process the rewritten pattern of Q using any existing efficient structural join algorithm like TwigStack, to get labels for output nodes 11: remove newly created inverted lists 12: //step 3: return query answer 13: extract actual values with labels from corresponding tables, if the output node is a property node; otherwise access the document to return subtrees.
We first perform content search in Line 2-8. The algorithm recursively handles all value comparisons in two phases: creating new inverted lists based on the predicates and rewriting the query to remove the processed value comparisons. In more details, Line 3-6 execute SQL selection in the corresponding property tables based on each value comparison, and then put all the selected labels, which satisfy the value comparison, into the new inverted list for the corresponding property node. Line 7 rewrites the query in such a way that every value comparison and its parent property are replaced by a new query node which has an identical name as the corresponding new inverted list. The second step is using TwigStack or other efficient structural join algorithms to process the simplified query with new inverted lists in Line 10-11. Last in line 13, we return result
114
H. Wu et al.
based on labels of output nodes. In particular, we can extract actual values from the corresponding table, if the output node is a property node. Example 3. We use the twig pattern query in Fig. 2(a) to illustrate how TwigTable works. In the first step, TwigTable identifies the only predicate with value comparison is title=“Network”. During content search, TwigTable executes an SQL selection in the table Rtitle to get all the labels of the element title which have a value of “Network”. Then we put the selected labels into a new inverted list for title, Ttitle , and rewrite the twig pattern query to replace the sub-structure of the node title and its child node “Network” by title’, as shown in Fig. 6(a). The new query node title’ corresponds to the newly created inverted list Ttitle , in which all labels satisfy the constraint title=“Network”. To clearly explain title’ in the rewritten query, we use titleN etwork in Fig. 6(a). Finally we use a twig pattern matching algorithm, e.g., TwigStack to process the rewritten query in Fig. 6(a), with the inverted list Ttitle for the node titleN etwork . subject
name
subject
book
titleNetwork
name
book
>20
(a) Rewritten query example (b) Invalid query example Fig. 6. A rewritten query example and an invalid twig pattern query example
As described in Section 2.1, twig pattern query is an intermediate query representation of formal XML query languages, e.g., XPath and XQuery. Since in the predicate of an XPath or XQuery query, a value must link to a property through an operator, e.g., price>20, the value comparison in a twig pattern query must be a child (‘/’), instead of a descendant (“//”) of an internal query node. A twig pattern expression shown in Fig. 6(b) is invalid, as the value comparison follows a “//” edge. Semantically, this query cannot be well interpreted; and practically, this query will never appear in XPath or XQuery expressions. We do not consider such invalid twig patterns, thus our algorithm can perform any content search using property tables. Note that TwigTable saves I/O cost in value extraction when the output node is a property node, because we do not need to visit the original document, but only access relevant relational tables to find values. However, if the output node matches some internal nodes with subelements in the document, the result should be the whole subtree rooted at each matched node, instead of a single value. In this case, TwigTable has no advantage in result return over other approaches. 4.4
Analysis of TwigTable
Label and inverted list management. TwigTable combines values to their parent elements, and avoids labeling value nodes separately. Then the
Using Semantics in XML Twig Pattern Query Processing
115
number of labeled nodes in memory will be greatly reduced. Moreover, TwigTable puts values into relational tables, instead of maintaining separate inverted lists for them. Thus the problem of managing a tremendous number of inverted lists in previous work can be solved. Content search. TwigTable organizes values based on their property semantics in tables. When the value in a query predicate has different semantic meaning, i.e., corresponds to different properties, TwigTable only accesses the correct property table to search the value. In contrast, other approaches have to scan all such values to perform structural join, though many of them correspond to other properties and definitely do not contribute to the result. Inverted list searching reduction. Performing content search before structural search in TwigTable can significantly reduce the size of relevant inverted lists. Consider the query in Fig. 2(a). Assume there is only one book called “Network”. If the number of different books is b, the size of the inverted list for the property title is also b in previous approaches. We need O(b) to scan all the labels in the inverted list for title. TwigTable processes selection in advance, so that the new inverted list for title is created based on the value “Network”. In this case the new inverted list has only one label inside based on our assumption. Normally, when the selectivity of a property is high, like in this example, TwigTable can significantly improve the efficiency of structural search by greatly reducing the inverted list size for this property. Advanced search support. Since TwigTable can use any existing RDBMS to manage property tables, all the advanced searches which are supported by the relational system are also supported in TwigTable. We can observe that sequential scans and structural joins for labels of both property node and value node in previous work are replaced by selections in semantic tables in TwigTable. Actually in any relational database system, such table selection can be done very efficiently. It is not surprising that replacing structural join by selection for content search will improve the overall performance. Generally, TwigTable gains benefit from performing content search ahead of structural search, and then reduce the complexity of structural search. Thus most advantages discussed in this section hold only for queries with value predicates, which are commonly seen in real life. When a query does not have value comparison as predicate, we just follow any existing structural join algorithm to perform structural search directly. 4.5
Queries across Multiple Twig Patterns
A twig pattern can be used to model a simple query, e.g., a query that can be represented by XPath. When a query is more complex, we need to model it with multiple twig patterns and value-based joins are used to connect these twig patterns. One example is shown in Fig. 2(b). As pointed out by [7], structural join based algorithms can only efficiently process single-patterned queries. When a complex query involves several twig patterns, either from the same document or across different documents, structural join based algorithms will fail to work.
116
H. Wu et al.
The reason why structural join based twig pattern matching algorithms cannot process queries involving several twig patterns is that those algorithms cannot perform value-based join between twig patterns using their inverted list indexes. One naive approach is to match different twig patterns in such a query separately. By considering each query node that are involved in value-based join as an output node, they can then access the original document to retrieve the child values for these query nodes. Last, they join the matching results from different twig patterns based on the retrieved values. Obviously this attempt is I/O costly, and also may produce a large size of intermediate result. In TwigTable, we introduce relational tables to store values. This structure offers an opportunity to process queries across multiple twig patterns. We observe that a join operation between two twig patterns is based on a value comparison between two properties in the two twigs. Using property tables, we can easily perform the value based join. We use an example to illustrate how TwigTable is extended to process such queries. Example 4. Consider the query in Fig. 2(b). There are two twig patterns t1 and t2 are involved in this query. First, TwigTable estimates the selectivity of each twig pattern. In this case, obviously t1 has a higher selectivity. Then TwigTable matches t1 to the document, to get the value result of query node author. Due to the high selectivity on title=“Network” w.r.t. the data in Fig. 4, the matching result only returns one author name, which is “Green”. In the next step, TwigTable joins the value result from the first twig pattern to the property table which corresponds to the joining node in the second twig pattern. In this example, we join the only value “Green” to Rauthor , to get a list of labels such that all of them correspond to the value “Green”. Finally, we form a new inverted list with the selected labels for the author node, and match t2 to the document. Discussion: TwigTable uses both inverted lists and tables for twig pattern matching, which offers a good opportunity to process queries across pieces of a document or across different documents. However, to process such a query, an optimizer is necessary to decide which twig pattern should be matched first, to reduce the searching space for other twig patterns. Such an optimizer is quite similar to a relational optimizer, and needs to estimate the cost to matching each twig pattern, the selectivity of each twig pattern, and also the cost and the selectivity of each value-based join between twig patterns. In this paper, we show the capability of TwigTable to process queries across multiple twig patterns. How to generate an optimal query plan to evaluate such queries will be further investigated.
5
Semantic Optimizations
Tables in TwigTable are built based on the semantic relationship between property and value. That is why we call them property tables. Using property tables to perform content search may still not be efficient enough in some cases. Since object is an important information unit for most queries, we can optimize the property tables to be object based, to further improve the performance.
Using Semantics in XML Twig Pattern Query Processing
5.1
117
Optimization 1: Object/Property Table
Motivation: Using property tables may still suffer from redundant search in relevant inverted lists. Consider the query in Fig. 2(a). Supposing there are b books in the bookstore and only one of them is called “Network”. After TwigTable rewrites the query in Fig. 6, the size of the inverted list for title is reduced to 1. However the size of the inverted list for book is still b, though we know one label in it matches the label in the title inverted list. To solve this efficiency problem, we propose an optimization scheme based on the object semantics. Optimization: Instead of storing each value with the label of its associated property node, we can put the property value and the label of the corresponding object node into relational tables. For example, in the bookstore document we put values of publisher, title and so forth with labels of the corresponding object book into object/property tables as shown in Fig. 7(a). The ‘label’ field of each table stores the label of the object and the following ‘value’ corresponds the value of different properties in different tables. When we perform a content search, we can directly select the object labels in the corresponding object/property tables and construct a new inverted list for the object. To process the query in Fig. 2(a), we perform the content search using Rbook/title to restrict the book labels based on the condition on title value. After that the query can be further rewritten accordingly, as shown in Fig. 7(b), where Tbook is the new inverted list for the element book and we use booktitle=“N etwork” to explicitly explain book’. Here we not only reduce the size of Tbook , but also further reduce the number of structural joins and the number of query nodes by one. Then we can get better performance when we execute the simplified query. Rbook/publisher
Rbook/title
Rbook/author
label
value
label
value
label
value
(6:17,4)
Hillman
(6:17,4)
Network
(6:17,4)
Green
(18:31,4)
Elco
(18:31,4)
Database
(18:31,4)
White
…
…
…
…
(18:31,4)
Brown
…
…
(a) Tables in TwigTable Optimization 1
subject
name
booktitle=“Network”
(b) Rewritten for Fig 2(a)
query
Fig. 7. Tables and rewritten query under TwigTable Optimization 1
This object-level optimization is general to all queries with a predicate on a property to constraint an object. For the case that the same property type belongs to multiple object classes, this optimization gains even more benefits, as it avoids accessing those property nodes that do not belong to the required object class by distinguishing their object semantics. Discussion: [Ordinal Column.] This optimization may lose order information for multi-valued properties. Such information may be important in some cases.
118
H. Wu et al.
For example, the order of authors is important, but from the book/author table we cannot tell which author comes first for a certain book. To solve this limitation, we can simply add an additional column in the corresponding object/property table for such a multi-valued property, to indicate the ordinal information. 5.2
Optimization 2: Object Table
Motivation: It is quite normal that some queries contain multiple predicates on the same object. Consider the query shown in Fig. 8(a), which aims to find the subject of the book with the title of “Network” and the price less than 40. To answer this query, Optimization 1 needs to find the labels of the books whose title is “Network” and the labels of the books whose price is less than 40 separately using the object/property tables, and intersect them. With more semantic information, we know that title and price are both properties of the object book. If we have one table for this object that contains the both properties, books satisfying these two constraints can be found directly with one SQL selection. subject
name
book subject
title
price name
“Network”
booktitle=“Network”&price<40
<40
(a) Query with multiple value predicates under the same object
(b) Rewritten query for Fig 8(a) under Optimization 2
Fig. 8. Example query with multiple value predicates under the same object and its rewritten query in Optimization 2
Optimization: A simple idea is to merge the object/property tables in Optimization 1 for each object class. For multi-valued properties, such as author in our example, it is not practical to merge it with other properties. In this case, we can merge all the single-valued properties of an object into one object table and keep the object/property tables for multi-valued properties. The resulting tables for the object class book in the bookstore document under this optimization are shown in Fig. 9. In Rbook , each label of book is stored with all the single-valued property values of that book. When we process queries with multiple predicates on single-valued properties of an object, we can do a selection in that object table based on multiple constraints in one time. For example, to process the query in Fig. 8(a), we can select book labels based on the two predicates in Rbook with one SQL selection. Then the original query can be rewritten as shown in Fig. 8(b). Comparing with the Optimization 1 approach, we further simplify the query and prune intermediate results for the two predicates.
Using Semantics in XML Twig Pattern Query Processing Rbook
119
Rbook/author
label
publisher
title
price
quantity
label
value
(6:17,4)
Hillman
Network
45
30
(6:17,4)
Green
(18:31,4)
Elco
Database
35
15
(18:31,4)
White
…
…
…
…
…
(18:31,4)
Brown
…
…
Fig. 9. Tables for book in the bookstore document under TwigTable Optimization 2
Discussion: [Mixed table selection.] If the multiple predicates involve both single-valued properties and multi-valued properties, we can intersect the selection result from the object table for the single-valued properties, with the selection result from the object/property tables for the multi-valued properties. [Rare property.] Properties may optionally appear under the associated objects. In some cases, the occurrence of certain properties may be rare. We call such properties rare properties. Suppose in the bookstore document, only a few books have a second title, then second title is a rare property. If we put this rare property as a column in the book table, there will be too many NULL entries. Some relational database systems can deal with sparse attributes in the physical storage. In case some other systems do not have this function, we can maintain a separate table specially for all rare properties. The rare property table contains: the object name, the rare property name, the object label and the property value. Suppose in the bookstore document, second title is a rare property of book and sale region is a rare property of magazine, the example rare property table is shown in Fig. 10. Queries involving rare properties are processed by accessing the rare property table with the object name and the property name. Rrare_property object
property
label
value
book
second_title
(76:89,4)
An introduction to data mining
magazine
sale_region
(128:143,4)
Singapore
book
second_title
(282:299,4)
A first course
…
…
…
…
Fig. 10. Table for rare properties
[Vertical partitioning.] An object table is obtained by merging the object/ property tables for all the single-valued property types under the same object class. When there are many single-valued property types under a certain object class, the tuple size of the corresponding object table will be very large, which results in a high I/O cost in selection. A common way in RDBMS design to reduce such I/O cost is the vertical partitioning of a table ( [21]). We can refer to the query history to see which properties often appear together in the same
120
H. Wu et al.
query, and then split the original object table into several partitions according to such information. Since vertical partitioning is not a new technique, we do not discuss it any more. 5.3
Optimization 3: Relationship Table
Motivation: The hierarchical structure of an XML document cannot reflect the relationships between objects explicitly, however, such relationships do exist usually. Consider another design of the bookstore document as shown in Fig. 11, in which the books are also grouped by different branches. In Fig. 11, the quantity is not a property of book, but a property of the relationship between branch and book. Putting a relationship property into the object table of the property’s nearest object does not affect the accuracy of query processing. Consider a query to find the code of the branch that has some computer book with a low quantity, i.e., less than 20, expressed in Fig. 12(a). If we have no idea on the relationship between branch and book, but store the property quantity in the object table for book, Optimization 2 will rewrite the query as in Fig. 12(b) for further matching. However, since we aim to find qualified branches, matching the book node is redundant. If we know the predicate is on the relationship between branch and book, we may ignore book during pattern matching to improve efficiency. bookstore (1:2800,1) subject (2:197,2)
……
name (3:4,3)
branches (5:196,3)
“computer”
code (7:8,5)
branch (6:107,4) book (9:20,5)
……
book (21:34,5)
……
title author price quantity publisher title “001” publisher author author price quantity (10:11,6) (12:13,6) (14:15,6)(16:17,6)(18:19,6) (22:23,6) (24:25,6) (26:27,6) (28:29,6) (30:31,6) (32:33,6) “Hillman” “Network” “Green”
45
30
“Elco” “Database” “White” “Brown”
35
15
Fig. 11. Another design of the bookstore document
Optimization: If we have the semantic information about the property of relationship, we can introduce relationship tables. A relationship table store the property value and the label of the participating objects of each relationship instance. The example relationship table for the document in Fig. 11 is shown in Fig. 13(a). When a relationship involves more than two objects, the corresponding relationship table will include the labels of all the objects. Using the relationship table, the query in Fig. 12(a) can be rewritten as in Fig. 13(b).
Using Semantics in XML Twig Pattern Query Processing
subject
name
subjectname=“computer”
branch
“ computer”
121
code
branch
book
code
bookquantity<20
quantity
<20
(a) Example twig pattern query
(b) Rewritten query for Fig 12(a) under Optimization 2
Fig. 12. Example query with predicate on relationship property and its rewritten query in Optimization 2 subjectname=“computer”
Rbranch-book labelbranch
labelbook
quantity
(6:107,4)
(9:20,5)
30
(6:107,4)
(21:34,5)
15
…
…
…
(a) Relationship table
branchbranch-book.quantity<20
code
(b) Rewritten query for Fig 12(a) under Optimization 3
Fig. 13. Example relationship table and rewritten query in TwigTable Optimization 3
Compared to Optimization 2, the query is further simplified with the semantics of relationship, and the size of the inverted list of the branch node is reduced. Then the query processing performance will be further improved. Discussion: [Overlapping predicates.] A query node can be involved in both an object predicate and a relationship predicate. For example, if the query in Fig. 12(a) has an additional predicate on book price, the query node book will be involved in two predicates (i.e., the relationship predicate and the object predicate) overlapping on it. To handle overlapping predicates, we can perform the content search based on different predicates separately, and then intersect the label results to construct the temporary inverted list for the involved object. [Merging object table and relationship table.] If the semantics of participation constraint between two object classes is known, we can merge the object table(s) and the relationship table when the constraint is many-to-one or one-to-one. This is similar to the translation from ER diagram to tables with the consideration of
122
H. Wu et al.
participation constraints in relational database design. However, similar to the vertical partitioning, how to physically maintain relational tables is generally bound to the performance analysis for practical queries. 5.4
A Summary
The optimization techniques are proposed around the semantics of object. This object-level attempt is motivated by the fact that most queries ask about the information of objects. Generally, as more semantic information is known, we can optimize TwigTable to different levels, to get better performance. When the object information is not available but we still need it to manage data, we can roughly treat the parent node of each property as an object. As indicated in Section 2.3, this inference may not be correct in some cases, but it does not affect the correctness of twig pattern query processing. However, we construct table index based on the semantics of object instead of the simple structural information is because that object is the information unit in real life queries. Optimizations in object level can minimize the number of structural joins without affecting the processing of general queries. For example, if a person has a composite property name with firstName and lastName. Because most queries are issued to the object person, instead of to the property name only, if we construct table based on name, with the property of firstName and lastName, after the content search, we still have to join name with person. However, if we build the table based on the real object person, i.e., incorporating name/firstName and name/lastName into person table, any predicate on firstName or lastName will result in a label filtering for person directly, to save a structural join.
6
Experiments
In this section, we conduct experiments to show the advantage of TwigTable. We first compare our approach to a schema-aware relational approach [24], which is considered more efficient than other relational approaches. Then we compare TwigTable and its two optimizations to TwigStack, a typical structural join based twig pattern matching algorithm. Note that in this experiment, our algorithms take TwigStack to perform structural search, thus we compare to TwigStack to show the benefit gained. We can also take any other structural join algorithm to perform structural search. We do not compare with them because the comparison with TwigStack is sufficient to show the advantage of our approach. 6.1
Settings
We implemented all algorithms in Java. The experiments were performed on a dual-core 2.33GHz CPU and a 4GB RAM under Windows XP. We used three types of real-world and synthetic data sets to compare the performance of TwigStack and our approaches: NASA [1], DBLP and XMark [32]. NASA is a 25MB document with deep and complex schema. DBLP data set is a 127MB fragment of DBLP database. It is rather regular with a simple
Using Semantics in XML Twig Pattern Query Processing
Data Set
NASA
Query
Path Expression
NQ1
//dataset//source//other[date/year>1919 and year<2000]/ author/lastName
NQ2
//dataset/tableHead[//field/name=‘rah’]//tableLinks //title
NQ3
//dataset//history//ingest[date[year>1949 and year<2000] [month=‘Nov’][day>14 and day<21]]//creator/lastName
DQ1
/dblp/article[/author=‘Jim Gray’]/title
DQ2
/dblp/proceedings[year>1979]/isbn
DQ3
/dblp/inproceedings[title=‘A Flexible Modeling Approach for Software Reliability Growth’][year=‘1987’][author=‘Sergio Bittanti’]/booktitle
XQ1
//regions/africa/item[//mailbox//mail/from=‘Libero Rive’]// keyword
XQ2
//person[//profile/age>20]/name
XQ3
//open_auction[//bidder[time>18:00:00]/increase>5]/quantity
DBLP
XMark
123
Fig. 14. Experimental queries
DTD schema but a large amount of data values. We also used 10 sets of XMark benchmark data with sizes from 11MB to 110MB for our experiments. We selected three meaningful queries for each data set. All the queries contain predicates with value comparisons, as value predicates appear in most practical queries. Generally, there are three types of query predicates: predicates of equality comparison, predicates of range comparison and multiple predicates of different comparisons under one object. The queries are shown in Fig. 14. In TwigTable, we use the Sybase SQL Anywhere [2] to manage relational tables, and inherit the default database parameters. 6.2
Comparison with Schema-Aware Relational Approach
In this section, we compare TwigTable with a Schema-aware Relational Approach proposed in [24]. We name it SRA for short. We also use the Sybase SQL Anywhere for SRA. Since this approach is weak in dealing with “//”-axis, we adopt the proposal in [12] to augment it. SRA is proven more efficient than other schemaless relational approach to process XML queries. The execution time for both SRA and TwigTable to process the queries in NASA, DBLP and a 110MB XMark data is shown in Fig. 15. Note that the Y-axis is in logarithmic scale. From Fig. 15 we can see that for NASA and XMark data (NQ1-3, XQ1-3) TwigTable is more efficient, but for DBLP data (DQ1-3) SRA is more efficient. DBLP data is rather regular and flat, thus the values in DBLP data can be perfectly shredded into relational tables. The maximum height of DBLP is 4, which means using SRA shredding method, there is at most one table join required for
124
H. Wu et al.
Executiontime(ms) tiontime(ms) (ms)
100000 10000 1000 100 10 1 NQ1
NQ2
NQ3
DQ1
DQ2
DQ3
XQ1
XQ2
XQ3
Queries SRA
TwigTable
Fig. 15. Comparison result between SRA and TwigTable
all queries and this join is between the table for root (containing only one tuple) and another table. For such a relational-like document, the relational approach is much more efficient, because table selection dominates the overall performance and this operation can be performed very efficiently in all relational databases. However, most real life XML data is not as regular as DBLP, otherwise, it violates the advantage of the semi-structured format. As we see for NQ1-3 and XQ1-3, when the document is deeper and more complex, i.e., requires more table joins for SRA, the performance of SRA is badly affected. 6.3
Comparison with TwigStack
Space management: We test the space issues in document parsing, including the number of labeled nodes in memory, and the number of inverted lists and the number of tables maintained on disk. We parse the NASA, the DBLP, and a 110MB XMark data using the two approaches. The result is shown in Fig. 16. Data
Number of Labeled Nodes
Number of Inverted Lists
Number of Tables
TwigStack
TwigTable
Saving
TwigStack
TwigTable
TwigStack
TwigTable
NASA
997,987
532,963
46.6%
121,833
68
N.A.
39
DBLP
6,771,148
3,736,406
44.8%
388,630
37
N.A.
26
XMark
3,221,925
2,048,193
36.4%
353,476
79
N.A.
43
Fig. 16. Number of labeled nodes, inverted lists and tables in TwigStack and TwigTable
This result validates our analysis in Section 4.4 about the reduction of labeled nodes in memory and the reduction of inverted lists. In TwigTable, the relational tables are built based on different types of properties, so the number of tables is limited to the number of different property types. We also use 10 sets of XMark data, whose sizes vary between 11MB and 110MB, to further demonstrate the superiority of TwigTable in space management. The experimental result is shown in Fig. 17. We can see that the number of labeled nodes is scaled to the document size for both approaches, and TwigTable always manages less labeled nodes. The
3.5 1 2 33 4 5 2.5 6 7 82 9 10 1.5 1
wigStack ERT 0.5
0.324273 0.650335 0.969617 1.305245 1.628549 1 1.9634 9634 2.294327 2.61591 2.943522 3.221925
0.072371 0.124502 0.164237 0.208709 0.239532 0 0.267772 267772 0.291727 0.311866 0.332839 0.353476
0.20613 0.413111 0.616229 0.820438 1.024073 1.233723 1 233723 1.440674 1.643495 1.849449 2.048193
7.9E-05 7.9E-05 7.9E-05 7.9E-05 7.9E-05 7.9E-05 7 9E 05 7.9E-05 7.9E-05 7.9E-05 7.9E-05
Q1 year Q2 name Q3 year Q3 month Q3 day Q4 71.58 874.656 71.58 29.436 29.436 85 60.876 17.22 60.732 2.148 14.196
No.ofinvertedlists vertedlists (Million) illion)
No. of labeled ed nodes (Million) llion)
Using Semantics in XML Twig Pattern Query Processing 0.4 1 2 0.35 3 4 5 0.3 6 7 0.25 8 9 0.2 10 0.15
0.324273 0.650335 0.969617 1.305245 1.628549 1 1.9634 9634 2.294327 2.61591 2.943522 3.221925
0.072371 0.124502 0.164237 0.208709 0.239532 0 0.267772 267772 0.291727 0.311866 0.332839 0.353476
0.20613 0.413111 0.616229 0.820438 1.024073 1 1.233723 233723 1.440674 1.643495 1.849449 2.048193
125
7.9E-05 7.9E-05 7.9E-05 7.9E-05 7.9E-05 7 7.9E-05 9E 05 7.9E-05 7.9E-05 7.9E-05 7.9E-05
Q3 year Q3 month Q3 day Q4 0.1 Q1 year Q2 name wigStack 71.58 874.656 71.58 29.436 29.436 85 ERT 0.05 60.876 17.22 60.732 2.148 14.196
0
0 11 3 22.8 22 8 11.3
34
45 3 56.2 56 2 68.2 68 2 79 7 90 7 102 45.3 79.7 90.7
111
11 3 22.8 22 8 11.3
34
File size (MB) TwigStack
45 3 56 2 68.2 68 2 79 7 90 7 102 45.3 56.2 79.7 90.7
111
File size (MB)
TwigTable
TwigStack
(a) Number of labeled nodes
TwigTable
(b) Number of inverted lists
Fig. 17. Space management comparisons Total size for indexes (MB) Data
TwigStack
TwigTable
Inverted Lists
Tables
Total
Inverted Lists
Tables
Total
NASA
462
0
462
9
31
40
DBLP
1290
0
1290
30
119
149
XMark
1040
0
1040
17
47
64
Fig. 18. Index sizes in TwigStack and TwigTable
number of inverted lists is scaled to the size of document in TwigStack, whereas this number is a constant in TwigTable. For a large data set it is not practical to handle the tremendous number of inverted lists using TwigStack. We further analyze the total space usage to store indexes of the two approaches, as shown in Fig. 18. We simply use a separate file to represent each inverted list on disk, for both approaches. The inverted lists are further indexed by a B+ tree automatically by the file system, so that during query processing the relevant inverted lists can be quickly addressed. From the result, we can see that TwigStack needs a large size of space to manage inverted lists. We further investigate it and find that actually the total size of inverted lists is not very large, which is similar to the total size used by TwigTable. However, because of the tremendous number of inverted lists, the extra structures, e.g., the B+ tree, built by the file system to index these inverted lists are quite large in size. In contrast, although TwigTable maintains table indexes in addition to inverted lists, the total size is still much smaller. Query performance: We used the NASA, the DBLP and a 110MB XMark data set for query performance comparison. We compare TwigStack with our original TwigTable algorithm, as well as Optimization 1 and Optimization 2. As mentioned earlier, we can infer the object information in an XML document.
126
H. Wu et al.
Although the inference may not be semantically correct, it will not affect the correctness of the result. In the two optimizations, we use such inference to build object/property tables and object tables. We do not test Optimization 3. The inverted lists and relational tables are stored on disk and loaded into memory as needed during query processing. As mentioned above, inverted lists are stored as separate files and indexed by a B+ tree to ensure the high performance in inverted list access. TwigTable actually adopts TwigStack (with the same implementation) to perform structural search, after content search with table selection. The execution time of TwigTable and its optimizations include the I/O and CPU costs to access relational tables to perform content search and the cost on structural search. The comparison result is shown in Fig. 19.
Executiontime(ms)
NQ1 NQ2 NQ3 DQ1 DQ2 DQ3 XQ1 XQ2 XQ3
NASA 18000
relational twigstack vert opt1 35495 640 485 4 //dataset//source/ 800 3813 812 641 3 //dataset/tableHea 2578 609 422 3 //dataset//history/ 600 /dblp/article[/auth 62 2922 2453 14 /dblp/proceedings 46 15547 3454 26 400 78 4132 3546 16 /dblp/inproceeding //regions/africa/ite 7421 789 713 64 //person[//profile/ 11730 1000 869 8 200 //open_auction[//b 8500 1716 1293 6
relational twigstack vert opt1 35495 640 485 4 NQ1 15000 //dataset//source/ 3813 812 641 3 NQ2 //dataset/tableHea 2578 609 422 3 NQ3 12000 //dataset//history/ DQ1 /dblp/article[/auth 62 2922 2453 14 DQ2 9000 /dblp/proceedings 46 15547 3454 26 78 4132 3546 16 DQ3 /dblp/inproceeding 6000 XQ1 //regions/africa/ite 7421 789 713 64 //person[//profile/ 11730 1000 869 8 XQ2 3000 XQ3 //open_auction[//b 8500 1716 1293 6
Executiontime(ms)
NASA 1000
0
0 NQ1
NQ2
NQ3
DQ1
Query TwigTable
TwigStack TwigTableOptimization1
DQ2
TwigTableOptimization2
TwigTableOptimization1
(a) NASA
DQ3
Query TwigTable
TwigStack
TwigTableOptimization2
(b) DBLP
NASA 2000
Executiontime(ms)
NQ1 NQ2 NQ3 DQ1 DQ2 DQ3 XQ1 XQ2 XQ3
relational twigstack vert opt1 35495 640 485 4 //dataset//source/ 3813 812 641 3 //dataset/tableHea 2578 609 422 3 //dataset//history/ 1200 /dblp/article[/auth 62 2922 2453 14 /dblp/proceedings 46 15547 3454 26 800 /dblp/inproceeding 78 4132 3546 16 //regions/africa/ite 7421 789 713 64 //person[//profile/ 11730 1000 869 8 400 //open_auction[//b 8500 1716 1293 6 1600
0 XQ1
XQ2
XQ3
Query TwigTable
TwigStack TwigTableOptimization1
TwigTableOptimization2
(c) XMark Fig. 19. Execution time by TwigStack and TwigTable without optimizations, with Optimization 1 and with Optimization 2 in the three XML documents
From the result we can see that for all queries, TwigTable outperforms TwigStack. The reason is that TwigTable performs content search first to simplify the query pattern before structural joins, thus gaining better overall performance. The result also proves that the overhead on table selection will not affect the benefit gained from twig pattern simplification. TwigStack performs very badly on DQ2. In the DBLP data, there are a lot of numeric values. To process DQ2, TwigStack has to combine the labels in all the inverted lists with a number name greater than 1979, based on document order. To load and merge these inverted lists is costly. However, in TwigTable this step is replaced by table
Using Semantics in XML Twig Pattern Query Processing
127
selection, thus it is more efficient. We can see, though for small document (e.g., NASA), TwigStack is not very slow to perform range search, when the amount of labels in numeric inverted lists is large, TwigStack will be inefficient to load and merge the labels. For all the queries, Optimization 1 works better than TwigTable. The reason is that Optimization 1 reduces one more level up to the original twig pattern query. Similarly, the query processing performance is further improved. Comparing Optimization 1 with Optimization 2, we can see that for singlepredicated queries there is no obvious difference. For some queries, Optimization 2 is even slightly worse than Optimization 1. The reason is that the combined object table is larger than object/property table, and then there may be more I/Os to load tuples for object table. However, for multi-predicated queries, e.g., the queries Q3, Q6 and Q9, Optimization 2 has better performance, because Optimization 2 performs content search for all the value comparisons on the same object at the same time. This again proves our analysis in Section 4.4.
7
Conclusion
In this paper, we propose a semantic approach TwigTable to solve different kinds of content problems raised in existing approaches for twig pattern query processing. Unlike TwigStack and its subsequent algorithms, our approach uses semantic tables to store values in XML document and avoids the management of a tremendous number of inverted lists for different values. During query processing, we perform content search first to reduce the size of relevant inverted lists, and rewrite the query to reduce the number of structural nodes and the number of structural joins in it. Then, we match the simplified pattern with size reduced inverted lists to the document. We also show that the attempt of using hybrid indexes (inverted list and relational table) can easily and efficiently process queries across multiple twig patterns. Our approach is a semantic approach because the relational tables are initially built based on the semantics of property. With more semantics on objects and relationships, we propose three optimization techniques to further improve the tables and enhance efficiency of query processing. In particular, (1) if each property’s associated object is known, we can change property table to object/property table in Optimization 1; (2) if we know certain properties belong to the same object, we can combine the object/property tables to be object table in Optimization 2; (3) if the relationship between objects is known, we can introduce relationship tables to precisely store the property values of relationships in Optimization 3. To summarize, as more semantic information is known, we can further optimize tables and get better performance. Furthermore, when an output node of a query corresponds to a property type, after finding an occurrence of a twig pattern in the document, our algorithm can easily extract actual value to answer the query from relational tables, whereas existing approaches need more work to convert labels into values by accessing documents again. Our work brings a new perspective to incorporate relational approach into native approach to manage and query XML data. Processing queries involving
128
H. Wu et al.
ID references [31] or queries across multiple documents becomes possible in our approach, as our approach takes value-based join to link different twig patterns involved in a complex query. We will further investigate how to generate optimal query plans for queries involving both structural joins and table joins.
References 1. http://www.cs.washington.edu/research/xmldatasets/data/nasa/nasa.xml 2. http://www.sybase.com/products/databasemanagement/sqlanywhere 3. Al-Khalifa, S., Jagadish, H.V., Patel, J.M., Wu, Y., Koudas, N., Srivastava, D.: Structural joins: A primitive for efficient XML query pattern matching. In: Proc. of ICDE, pp. 141–154 (2002) 4. Berglund, A., Chamberlin, D., Fernandez, M.F., Kay, M., Robie, J., Simeon, J.: XML Path Language (XPath) 2.0. W3C Working Draft (2003) 5. Boag, S., Chamberlin, D., Fernandez, M.F., Florescu, D., Robie, J., Simeon, J.: XQuery 1.0: An XML Query. W3C Working Draft (2003) 6. Bohannon, P., Freire, J., Roy, P., Simeon, J.: From XML schema to relations: a cost-based approach to XML storage. In: Proc. of ICDE, pp. 64–75 (2002) 7. Bruno, N., Koudas, N., Srivastava, D.: Holistic twig joins: Optimal XML pattern matching. In: Proc. of SIGMOD, pp. 310–321 (2002) 8. Chen, T., Lu, J., Ling, T.W.: On boosting holism in XML twig pattern matching using structural indexing techniques. In: Proc. of SIGMOD, pp. 455–466 (2005) 9. Chen, Y., Davidson, S.B., Hara, C.S., Zheng, Y.: RRXS: redundancy reducing XML storage in relations. In: Proc. of VLDB, pp. 189–200 (2003) 10. Doan, A., Ramakrishnan, R., Chen, F., DeRose, P., Lee, Y., McCann, R., Sayyadian, M., Shen, W.: Community information management. IEEE Data Eng. Bull. 29(1), 64–72 (2006) 11. Florescu, D., Kossmann, D.: Storing and querying XML data using an RDMBS. IEEE Data Eng. Bull. 22(3), 27–34 (1999) 12. Gou, G., Chirkova, R.: Efficienty querying large XML data repositories: a survey. IEEE Transactions on Knowledge and Data Engineering 19(10), 1381–1403 (2007) 13. Grust, T.: Accelerating XPath location steps. In: Proc. of SIGMOD, pp. 109–120 (2002) 14. Jiang, H., Lu, H., Wang, W.: Efficient processing of XML twig queries with OR-predicates. In: Proc. of SIGMOD, pp. 59–70 (2004) 15. Jiang, H., Wang, W., Lu, H., Yu, J.: Holistic twig joins on indexed XML documents. In: Proc. of VLDB, pp. 273–284 (2003) 16. Li, C., Ling, T.W.: QED: a novel quaternary encoding to completely avoid relabeling in XML updates. In: Proc. of CIKM, pp. 501–508 (2005) 17. Ling, T.W., Lee, M.L., Dobbie, G.: Semistructured database design (web information systems engineering and Internet technologies series). Springer, Heidelberg (2004) 18. Liu, Z., Chen, Y.: Identifying meaningful return information for XML keyword search. In: Proc. of SIGMOD, pp. 329–340 (2007) 19. Lu, J., Chen, T., Ling, T.W.: Efficient processing of XML twig patterns with parent child edges: a look-ahead approach. In: Proc. of CIKM, pp. 533–542 (2004) 20. Lu, J., Ling, T.W., Chan, C., Chen, T.: From region encoding to extended dewey: On efficient processing of XML twig pattern matching. In: Proc. of VLDB, pp. 193–204 (2005)
Using Semantics in XML Twig Pattern Query Processing
129
21. Navathe, S., Ceri, S., Wiederhold, G., Dou, J.: Vertical partitioning algorithms for database design. ACM Transactions on Database Systems 9(4), 680–710 (1984) 22. Pal, S., Cseri, I., Seeliger, O., Schaller, G., Giakoumakis, L., Zolotov, V.: Indexing XML data stored in a relational database. In: Proc. of VLDB, pp. 1146–1157 (2004) 23. Rao, P.R., Moon, B.: PRIX: Indexing and Querying XML Using Prufer Sequences. In: Proc. of ICDE, p. 288 (2004) 24. Shanmugasundaram, J., Tufte, K., Zhang, C., He, G., DeWitt, D.J., Naughton, J.F.: Relational databases for querying XML documents: limitations and opportunities. In: Proc. of VLDB, pp. 302–314 (1999) 25. Spink, A.: A user-centered approach to evaluating human interaction with web search engines: an exploratory study. Information Processing & Management 38(3), 401–426 (2002) 26. Tatarinov, I., Viglas, S., Beyer, K.S., Shanmugasundaram, J., Shekita, E.J., Zhang, C.: Storing and Querying Ordered XML Using a Relational Database System. In: Proc. of SIGMOD, pp. 204–215 (2002) 27. Tian, F., DeWitt, D.J., Chen, J., Zhang, C.: The design and performance evaluation of alternative XML storage strategies. SIGMOD Record 31(1), 5–10 (2002) 28. TreeBank. Retrieved from University of Washington Database Group (2002) 29. Wang, H., Park, S., Fan, W., Yu, P.S.: ViST: A Dynamic index method for querying XML data by tree structures. In: Proc. of SIGMOD, pp. 110–121 (2003) 30. Wu, H., Ling, T.W., Chen, B.: VERT: A semantic approach for content search and content extraction in XML query processing. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.) ER 2007. LNCS, vol. 4801, pp. 534–549. Springer, Heidelberg (2007) 31. Wu, H., Ling, T.W., Dobbie, G., Bao, Z., Xu, L.: Reducing graph matching to tree matching for XML queries with ID references. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010. LNCS, vol. 6262, pp. 391–406. Springer, Heidelberg (2010) 32. XMark. An xml benchmark project, http://www.xml-benchmark.org 33. Xu, L., Ling, T.W., Wu, H., Bao, Z.: DDE: From Dewey to a Fully Dynamic XML Labeling Scheme. In: Proc. of SIGMOD, pp. 719–730 (2009) 34. Yoshikawa, M., Amagasa, T., Shimura, T., Uemura, S.: XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Techn. 1(1), 110–141 (2001) 35. Yu, C., Jagadish, H.V.: Efficient discovery of XML data redundancies. In: Proc. of VLDB, pp. 103–114 (2006) 36. Yu, T., Ling, T.W., Lu, J.: Twigstacklistnot: A holistic twig join algorithm for twig query with NOT-predicates on XML data. In: Li Lee, M., Tan, K.-L., Wuwongse, V. (eds.) DASFAA 2006. LNCS, vol. 3882, pp. 249–263. Springer, Heidelberg (2006) 37. Zhang, C., Naughton, J., Dewitt, D., Luo, Q., Lohman, G.: On supporting containment queries in relational database management systems. In: Proc. of SIGMOD, pp. 425–436 (2001)
Database Semantics Recovery through Analysis of Dynamic SQL Statements Anthony Cleve1,3 , Jean-Roch Meurisse2 , and Jean-Luc Hainaut1 1
3
Faculty of Computer Science, PReCISE Research Center University of Namur, Belgium {acl,jlh}@info.fundp.ac.be 2 Department of Education and Technology University of Namur, Belgium [email protected] Department of Information and Communication Sciences Universit´e Libre de Bruxelles, Belgium
Abstract. The documentation of a database includes its conceptual schema, that formalizes the semantics of the data, and its logical schema that translates the former according to an operational database model. Important engineering processes such as database and program evolution rely on a complete and accurate database documentation. In many cases, however, these schemas are missing, or, at best, incomplete and outdated. Their reconstruction, a process called database reverse engineering, requires DDL code analysis but, more important, the elicitation of implicit constructs (data structures and constraints), that is, constructs that have been incompletely translated into the operational database schema. The most powerful discovery technique of these implicit constructs is the static analysis of application program source code, and, in particular of embedded SQL statements. Unfortunately, the increasing use of dynamic SQL in business applications development makes such techniques helpless, so that modern web software systems can no longer be correctly redocumented through current reverse engineering techniques. This paper introduces and evaluates dynamic SQL capturing and analysis techniques and shows that they can replace and even improve upon pure static analysis. A real world case study is presented and discussed.
1 Introduction According to standard database design methodologies, any decently developed database should be accompanied by a complete, acurate and up to date documentation, comprising, among others, its conceptual and logical schemas, and, more specially for the database administrator, its physical schema and the translation of the latter in the DDL of the DBMS. 1.1 Database Schemas The conceptual schema describes the semantics of the data in terms of the subject application domain while the logical schema is its lossless transcription in the data structures S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 130–157, 2011. c Springer-Verlag Berlin Heidelberg 2011
Database Semantics Recovery through Analysis of Dynamic SQL Statements
131
of the DBMS model. Both express the same semantics but in different languages, the former providing a high level of abstraction and semantic expressivity while the latter focussing on programmability and technology efficiency. These schemas are not mere byproducts of the design process but they also are indispensable for the exploitation of the database, in particular for application program development, and for maintenance and evolution of the database. These complex processes cannot be carried out efficiently and in a reliable way without the precise knowledge of the meaning and of the properties of the data. In many situations, however, particularly for ageing databases, these documents are obsolete, partial, inacurate or simply lost, if there ever was any. 1.2 Database Reverse Enginering The need for precise methods to reconstruct the documentation of a database has been widely recognized in the eighties1 under the name database reverse engineering. The first approaches were based on simple rules, that work nicely with databases designed in a systematic way [1,2,3,4]. A second generation of methodologies coped with physical schemas resulting from empirical design in which practitioners tend to apply non standard and undisciplined techniques. More complex design rules were identified and interpreted [5], and, based on them, structured and comprehensive approaches were developed [6] while the first industrial tools appeared (e.g., Bachman’s Reengineering Tool). Many contributions were published in the nineties, addressing practically all the legacy technologies and exploiting such sources of information as application source code and database contents. Among synthesis publications, we mention [7], the first tentative history of this discipline. These second generation approaches were faced with two kinds of problems induced by empirical design. The first problem is the recovery of implicit constructs, that is, structures and constraints that have not been explicitly declared in the DDL code. Their elicitation requires the analysis of such complex information sources as the source code of the application programs (and particularly the DML statements), the contents of the database and the user interface. The second problem is that of the semantic interpretation of logical schemas that may include non standard data structures. 1.3 New Challenges The process of legacy database reverse engineering has been studied for several decades and solid methodologies and tools are now available. One of the major supporting techniques that have been designed and implemented is program code exploration, through which essential implicit constructs of the database schemas can be discovered through understanding how programs access and process the data. In particular, DML statements (in most case SQL statements) are a particularly rich information source on the existence and the properties of implicit constructs. We observe however that new forms of information systems are emerging, that pose new problems that cannot be 1
In fact, the term reverse engineering was already coined in the documents of the ISDOS system, in the seventies. See http://www.pslpsa.com/HistoryTop.htm for example.
132
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
adequately solved by these resources [8]. The proliferation of web-based applications is one of them. Often developed in an undisciplined way through rapid development tools, these applications generate an increasing number of poorly designed and undocumented databases, making the reverse engineering process even more critical. Another challenge we currently face is the increasing use of dynamic languages, and in particular dynamic SQL, notably through ODBC and JDBC interfaces. Since actual statements only exist at runtime, standard program analysis techniques generally fail, and new, more complex, dynamic analysis techniques must be designed. 1.4 About This Paper The objective of this paper is to study the characteristics of dynamic SQL in dataintensive application programs and to propose techniques to extract data semantics from these statements. The paper is structured as follows. Section 2 provides a brief introduction to database reverse engineering (DBRE). Section 3 illustrates the difference between static and dynamic SQL. In Section 4, we elaborate on the analysis of (dynamic) SQL statements in support to DBRE. Section 5 identifies and compares several techniques for capturing the trace of SQL statements that are executed by an application program at runtime. In Section 6 we show how such an SQL execution trace can be analyzed in order to identify implicit schema constructs, in particular undeclared foreign keys. Section 7 reports on a case study evaluating the proposed approach on a real web application. Related work is discussed in Section 8 and concluding remarks are given in Section 9.
2 Background: Database Reverse Engineering Basically, database reverse engineering comprises three processes, namely physical schema extraction, logical schema reconstruction and schema conceptualization. Each of them has distinct objectives and relies on specific techniques, in such a way that they can, to a large extent, be studied independently2. 2.1 Physical Schema Extraction This process recovers the physical schema of a database by parsing its DDL code or, equivalently, by analyzing the contents of its active data dictionary, such as the system tables in most relational systems. This extraction makes visible the explicit constructs of the schema, that is, the data structures and constraints that have been explicitly declared through DDL statements. Such is the case for primary keys, unique constraints, foreign keys and mandatory fields. Generally, this process is fairly straightforward. However, the analysis of sub-schemas (e.g., relational views or IMS PCBs) can be more intricate since each of them brings a partial, and often refined view of the global schema. Since the handling of the database by the DBMS relies on the physical schema, the latter can be considered exact. 2
Some of the material of this section is drawn from [9].
Database Semantics Recovery through Analysis of Dynamic SQL Statements
133
2.2 Logical Schema Reconstruction This major process addresses the discovery of the implicit constructs of the schema, that is, those that have not been declared by explicit DDL statements and clauses. In some favorable situations, they have been translated into programmed database components such as SQL checks, triggers and stored procedures. However, most of them have been coped with into application program fragments (nearly) duplicated and scattered throughout millions of lines of code. For instance, a popular way to check a referential constraint consists in accessing the target record before storing the source record in its file. Recovering these implicit constructs, in constrast with explicit constructs, requires a precise analysis of various pieces of procedural code, and particularly DML statements. Though program source code is the richest information source, the database contents (the data), screen layout, report structure, program execution, users interview and, of course, (possibly obsolete) documentation will be analyzed as well. Contrary to the physical schema, the logical schema produced by this process can only be qualified reasonably exact inasmuch as some implicit constructs may have not been discovered (silence) while some constructs introduced in the schema simply do not exist (noise). There is no formal way to prove that all the implicit constructs, and only them, have been discovered. 2.3 Schema Conceptualization The goal of this process is to interpret the logical schema semantically by extracting a conceptual schema that represents its intended meaning. It mainly relies on transformational techniques that undo the effect of the logical design process [10]. This complex process is decomposed in three subprocesses, namely untranslation, de-optimization and conceptual normalization. The untranslation process consists in reversing the transformations that have been used to draw the logical schema from the conceptual schema. For instance, each foreign key is interpreted as the implementation of a many-to-one relationship type. This process relies on a solid knowledge of the rules and heuristics that have been used to design the database. Those rules can be standard, which makes the process fairly straightforward, but they can also be specific to the company or even to the developer in charge of the database (who may have left the company), in which case reverse engineering can be quite tricky. The main constructs that have to be recovered from their logical implementation are relationship types, super-type/subtype hierarchies, multivalued attributes, compound attributes and optional attributes. The de-optimization process removes the trace of the optimization techniques that have been used to improve the performance of the database. Redundancies must be identified and discarded, unnormalized data structures must be decomposed and horizontal and vertical partitioning must be identified and undone. Finally, conceptual normalization improves the expressiveness, the simplicity, the readability and the extendability of the conceptual schema. Though the experience and skill of the (reverse) engineer are essential in schema conceptualization, the quality of the conceptual schema mainly depends on the completeness and soundness of the logical schema.
134
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
2.4 Implicit Constructs All the structures and constraints of a logical schema that cannot be expressed in the DDL are implicit by nature. However, many databases include implicit constructs that could have been declared at design time but that were not, for such reasons as convenience, standardization, inheritance from older technology or simply by ignorance or bad design. For example, in many legacy relational databases, foreign keys are not declared through foreign key clauses, but are, at best, managed by an appropriate set of triggers. We briefly describe some of the most important implicit constructs that have been observed in relational databases [11]. – Exact column structure. Compound and multivalued columns (that would have been translated into row and array complex types in SQL3) are often anonymously represented by the concatenation of their component values. – Unique keys of tables and multivalued fields. This property is particularly important in implicitly strongly structured columns. – Foreign keys. Each value of a column, or of a set of columns, is processed as a reference to a record in another table. – Functional dependencies. The values of a column can depend on the values of other columns that have not been declared or elicited as a candidate key. This pattern has often been used in older databases for performance reasons. – Value domains. A more precise definition of the domain of a field can be discovered by data and program analysis. Identifying enumerated domains is particularly important. 2.5 Information Sources and Discovery Techniques Discovering implicit constructs is based on ad hoc techniques depending on the nature and reliability of the information sources. None of them can guarantee in an absolute way the presence or the absence of such constructs but they can all contribute to a better knowledge of the hidden components and properties of a database schema. We describe four of them. – Schema analysis [3,12,13]. Spotting similarities in names, value domains and representative patterns may help identify hidden constructs such as foreign keys. – Data analysis [14,15,16,17]. Mining the database contents can be used in two ways. Firstly, to discover implicit properties, such as functional dependencies and foreign keys. Secondly, to check hypothetic constructs that have been suggested by the other techniques. Considering the combinatorial explosion that threaten the first approach, data analysis is most often applied to check the existence of formely identified patterns. – Program analysis [18,19,20,21]. Understanding how programs use the data provides crucial information on properties of these data. Even simple analysis, such as dataflow graph exploration, can bring valuable information on field structure and meaningful names. More sophisticated techniques such as dependency analysis and program slicing can be used to identify complex constraint checking or
Database Semantics Recovery through Analysis of Dynamic SQL Statements
135
foreign keys. SQL statements examination, that will be detailed in this paper, is one of the most powerful variants of program analysis. – Screen/report layout analysis [22,23,24]. Forms, reports and dialog boxes are useroriented views on the database. They exhibit spatial structures (e.g., data agregates), meaningful names, explicit usage guidelines and, at runtime, data population and error messages that, combined with dataflow analysis, provide much information on hidden data structures and properties. 2.6 SQL Statement Analysis When, as has been common for more than two decades, data are managed by relational DBMSs, the database/program interactions are performed through the SQL language and protocols. Based on the relational algebra and the relational calculus, SQL is a high-level language that allows programmers to describe in a declarative way the properties of the data they instruct the DBMS to provide them with. By contrast, navigational DMLs (also called one-record-at-a-time DML) access the data through procedural code that specifies the successive operations necessary to get these data. Therefore, a single SQL statement can be the declarative equivalent of a procedural section of several hundreds of lines of code. Understanding the semantics of an SQL statement is often much easier than that of this procedural fragment. The analysis of SQL statements in application programs has long been used as a major program understanding technique in database reverse engineering [18,25,26,27,21]. We illustrate its importance on a small but representative example based on the schema of Figure 1 made up of two tables describing customers and orders. This physical schema graphically translates the constructs declared in the DDL code.
CUSTOMER CustNum: num (8) CustName: varchar (30) CustAddress: char (120) id: CustNum
ORDERS OrdNum: num (10) OrdDate: date (1) Reference: char (12) Sender: num (8) id: OrdNum
Fig. 1. A two-table schema including implicit constructs
A careful examination of Query 1 shows that it provides the customer city and the ordered product for a definite order. It asks the DBMS to extract these data from a row built by joining tables CUSTOMER and ORDERS for that order. It exhibits the main features of the program/query interface: the program transmits an input value through host variable ORDID and the query returns result values in host variables CITY and PRODUCT. The analysis of this query brings to light some important hidden informations, two of which are essential for the understanding of the database schema. 1. The join is performed on columns CustNum and Sender. The former is the primary key of table CUSTOMER while the second one plays no role so far. Knowing that most joins are based on the matching of a foreign key and a primary
136
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
select substring(CustAddress from 61 for 30), Reference into :CITY, :PRODUCT from CUSTOMER C, ORDERS O where C.CustNum = O.Sender and OrdNum = :ORDID
Query 1. Extracting hidden City and Product information (predicative join) key, Query 1 strongly suggests that column Sender is an implicit foreign key to CUSTOMER. Further analysis will confirm or reject this hypothesis. 2. Atomic column CustAddress appears to actually be a compound field, since its substring at positions 61 to 90 is extracted and stored into variable CITY. Translating this new knowledge in the original schema leads to the more precise schema of Figure 2. CUSTOMER CustNum: num (8) CustName: varchar (30) CustAddress: compound (120) Data1: char (60) City: char (30) Data2: char (30) id: CustNum
ORDERS OrdNum: num (10) OrdDate: date (1) Reference: char (12) Sender: num (8) id: OrdNum ref: Sender
Fig. 2. Two implicit constructs revealed by the analysis of Query 1
Analyzing independent statements may prove insufficient in that it provides incomplete information for some frequent programming patterns. An SQL statement can be decomposed into several fragments in such a way that its actual global structure is no longer explicit. The code of Query 2 is an example of a procedural join, functionally similar to that of Query 1. One can reconstruct the intended join by examining the usage of shared variables that define inter-query dependencies.
select Sender into :CUST from ORDERS where OrdNum = :ORDID
-- Q21
-- Q22
Query 2. Extracting hidden City and Product information (procedural join)
3 Static vs Dynamic SQL The introductory examples of Section 2.6 are expressed in static SQL, a variant of the language in which the SQL statements are hard-coded in the source program. Static SQL, in which Query 1 and Query 2 are written explicitly shows the architecture of the
Database Semantics Recovery through Analysis of Dynamic SQL Statements
137
query according to the SQL native syntax. It mentions the schema constructs concerned and identifies the input and output variables through which the host statements interact with the query. There is another family of SQL interfaces, called dynamic SQL, with which the SQL statements are built at runtime and sent to the database server through a specific API. Typically, the programs build each query as a character string, then ask the DBMS to prepare the query (i.e., to compile it) and finally execute it. The only point in the program at which the SQL query actually exists is at runtime, when, or just before, the query string is sent to the DBMS for compilation and/or execution. Query 3 is a simple example of the use of dynamic SQL to extract the name of customer ’C400’. Dynamic SQL is now widely used in data-intensive application programs, particularly in web business systems. Its analysis is therefore particularly important for the reverse engineering of poorly documented databases. Dynamic SQL provides a high
CNUM = "C400"; QUERY = "select CustName from CUSTOMER where CustNum = ’" + CNUM + "’"; exec SQL prepare Q from :QUERY; exec SQL execute Q into :NAME;
Query 3. First example of dynamic SQL level of flexibility but the application programs that use it generally are difficult to analyse and to understand. Indeed, dynamically generated queries must be reconstructed before they can be properly analyzed. This reconstruction process is not always possible through static program analysis [28]. For instance, such techniques as dataflow analysis or program slicing fall short in this context: the high-level of dynamicity causes a lot of noise in the data dependency graph [21], with a negative impact on the precision of the results. Most major DBMSs, such as Oracle and DB2, include interfaces for both static and dynamic SQL. Though dynamic SQL has been standardized in the eighties and implemented by most relational DBMS vendors, as illustrated in Queries 3 and 4, all of them provide drivers for the popular DBMS-independent APIs ODBC from Microsoft and JDBC from SUN. QUERY = "select CustName from CUSTOMER where CustNum = :v1"; exec SQL prepare Q from :QUERY; exec SQL execute Q into :NAME using :CNUM;
Query 4. Another common usage pattern of dynamic SQL The examples in Queries 3 and 4 are written in dynamic SQL for C/C++. The first one shows the build time injection of constants from variable CNUM, through mere string concatenation, while Query 4 illustrates the binding of formal variable v1 with the actual host variable CNUM through clauses into and from of the prepare and execute statements.
138
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
The ODBC and JDBC interfaces provide several query processing protocols, differing notably on the binding technique, that we refer to in this study. The most general form is illustrated in Query 5, where the iteration structure for obtaining successive results has been ignored for simplicity. Line 1 creates database connection con. Line 2 builds the SQL query in host variable QUERY. This statement includes input placeholder ? which will be bound to an actual value before execution. Line 3 creates and prepares (i.e., compiles) statement SQLstmt from string QUERY. This statement in completed in Line 4, in which the first (and unique) placeholder is replaced with value C400. The statement can then be executed (Line 5), hence creating the resulting set of rows rset. Method next of rset positions its cursor on the first row of this result (Line 6) while Line 7 extracts the first (and unique) output value specified in the query select list and stores it in host variable Num. Line 8 closes the result set.
1 2 3 4 5 6 7 8
CNum = "C400"; Connection con = DriverManager.getConnection(DBurl, Login, Password); String QUERY = "select OrdNum from ORDERS where Sender = ?"; PreparedStatement SQLstmt = con.prepareStatement(QUERY); SQLstmt.setString(1, CNum); ResultSet rset = SQLstmt.executeQuery(); rset.next(); Num = rset.getInt(1); rset.close();
Query 5. Standard JDBC database interaction
4 SQL Statement Analysis We can distinguish two main approaches to SQL statement analysis, namely static analysis and dynamic analysis. Static analysis consists in analyzing the SQL statements occuring in the source code of the application programs. In contrast, dynamic analysis aims at capturing and analyzing the SQL statements that are executed by a running program. It first captures an SQL execution trace, that (at least) contains all the SQL queries sent by the application program to the DBMS. Then it analyzes this trace in order to extract useful information from it. Making the choice between static and dynamic analysis techniques in a given context depends on such criteria as (1) the availability of the source code, (2) the nature of the SQL statements to analyze (static or dynamic SQL) and (3) the final objectives of the analysis. 4.1 Static Analysis of Static SQL Statements The analysis of static SQL statements has been introduced in Section 2.6. Though this paper is mainly devoted to dynamic techniques, it is worth developing further the way sequences of statements can be analyzed. The analysis of SQL statements in a program can address each statement independently (Query 1) or can extend the understanding to chains of dependent statements (Query 2).
Database Semantics Recovery through Analysis of Dynamic SQL Statements
139
The SQL code of Query 1 explicitly exhibits the input and output variables through which the host statements interact with the query. These variables are important since, on the one hand, input variables define the variable parts of the query and, on the other hand, input and output variables potentially weave dependency links with other SQL queries. The code fragment Query 2 (procedural join) illustrates such an inter-query dependency by developing the join into two SQL queries. Naming Q21 and Q22 the two SQL queries, we observe that the host variable CUST is the output variable of query Q21 and the input variable of the query Q22. If the intermediate host statements do not change the value of CUST (this invariance can be checked through dependency analysis), then these queries can be considered as fragments of a distributed global query. Inter-query analysis has been suggested by Petit et al. [18] for example. Remark. Basically, there is no need to apply dynamic techniques to static SQL, but in two objectives, namely capturing result sets of rows and identifying SQL statements execution sequences. In addition, the capturing techniques are limited to program instrumentation and DBMS logging (see Section 5). 4.2 Static Analysis of Dynamic SQL Statements In some disciplined programming styles, the query building step is written as a neat sequence of explicit substring concatenations just before the prepare/execute section, so that significant fragments of the SQL query, if not the whole query itself, can be recovered through careful static analysis [29] or symbolic execution [30]. However, some fragments of the query may be initialized long before, and far away from the execution point, or computed, or extracted from external sources in such a way that discovering the intended query by code examination alone may prove impossible. 4.3 Dynamic Analysis of Dynamic SQL Statements While static analysis of SQL statements is mainly based on the dependencies that exist between host variables, dynamic analysis reasons about column values. When static analysis fails, dynamic analysis can detect constant equality in the traces from different program points. If the first constant belongs to the output of query Q1, if the second constant is an input value of query Q2 and if both constants are equal throughout the traces, then the probability that both constants come from the same variable is high. Studying input and output constants may also reveal another kind of inter-query dependency, according to which the result of a query Q1 influences the fact that another query Q2 is executed or not. This is typically the case when a program first verifies an implicit integrity constraint (as a foreign key) before inserting or updating an SQL row. Similarly, before inserting a new row in a table, the program may check that the provided identifier value is unique. In both cases, there will be equality between an input value of a validation query Q1 and an input value of insertion query Q2. 4.4 Capturing Results of SQL Statement Execution Dynamic analysis is a privileged means to capture actual SQL queries at execution time in order to examine them, but this technique can also be used to examine the results
140
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
of SQL query executions. The format of the output data is defined by the form of the SQL query and the information of the logical schema. As just explained, result analysis often is the only technique to trace the link between two SQL queries through shared host variables when the intermediate host statements cannot be analysed statically, as in Query 2 for example. In addition, query results also provide interesting information such as the number of rows that a query usually returns. For instance, an SQL query that always returns a single row probably includes a condition on a column that may be an implicit candidate key.
5 SQL Statement Capturing Techniques In this section, we identify and describe six different techniques to capture the SQL statements executed by an application program. These techniques, summarized in Figure 3 are intended to capture the behaviour of the client/server system at runtime. More detail about them can be found in [31].
CLIENT side
Programs
API
SERVER side
DBMS
Database
API overloading API substitution Program instrumentation Aspect-based tracing
DBMS logs Tracing stored procedures
Fig. 3. Six capturing techniques for SQL statement executions
Program instrumentation. The capture of a dynamic SQL statement is performed by a dedicated code section inserted before the program point of this statement. Similarly, the result of an SQL statement will be captured by a code section inserted after it. This technique requires code analysis to identify and decode database API statements and entails source code modification and recompilation. It provides a temporal list of statement instances. In the example of Figure 4, the tracing statement writes in the log file the program point id (132), the event type (SQLexec), the statement object id (hashCode()) followed by the SQL statement (build time constant injection is assumed) or the output variable contents. According to the information that needs to be extracted from the trace, program id, process id and/or timestamp can be output as well.
Database Semantics Recovery through Analysis of Dynamic SQL Statements
141
Statement stmt = connection.createStatement(); ResultSet rs = stmt.executeQuery(SQLstmt); SQLlog.write(”132;SQLexec;”+stmt.hashCode()+”;”+SQLstmt); rs.next(); vName = rs.getString(1); SQLlog.write(”133;SQLgetS1;”+rs.getStatement().hashCode()+”;”+vName); vSalary = rs.getInt(2); SQLlog.write(”134;SQLgetI2;”+rs.getStatement().hashCode()+”;”+vSalary);
Fig. 4. Logging SQL operations by program instrumentation
Aspect-based tracing. Aspect technology [32], if available for the host programming language, allows triggers to be added to an existing program without source code modification. The introduction of tracing aspects simply requires the programs to be recompiled. In the case of Java/JDBC applications, AspectJ [33] pointcuts and advices can be defined in order to capture the successive events involved in dynamic database manipulation like query preparation, input value definition, query execution and result extraction [34]. Figure 5 shows a simple tracing aspect that captures and records the execution of an SQL query (in absence of query preparation). The declared pointcut (lines 5-6) refers to the invocation of method executeQuery of class Statement. Before each occurence of query execution, the advice (lines 8-13) writes a log entry indicating (1) the class name, (2) the line of code, (3) the object id of the statement and (4) the query string itself. 1 public aspect SQLTracing { 2 3 private MySQLLog log = new MySQLLog(); 4 5 pointcut queryExecution(String query): 6 call(ResultSet Statement.executeQuery(String)) && args(query); 7 8 before(String query): queryExecution(query){ 9 String file = thisJoinPoint.getSourceLocation().getFileName(); 10 int LoC = thisJoinPoint.getSourceLocation().getLine(); 11 Statement statement = (Statement) thisJoinPoint.getTarget(); 12 log.traceQuery(file, LoC, statement.hashCode(), query); 13 } 14 }
Fig. 5. Aspect-based tracing of SQL query executions
API overloading. The API overloading technique consists in encapsulating (part of) the client side API within dedicated classes which provide similar public methods but produce, in addition, the required SQL execution trace. For instance, we could write our own Statement, PreparedStatement and ResultSet classes which, in turn, make use of the JDBC corresponding classes, as shown in Figure 6. This technique requires minimal program adaptation (illustrated in Figure 7) and recompilation.
142
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
package myAPI; ... public class Statement{ java.sql.Statement stat; public Statement(java.sql.Connection con){ stat = con.createStatement(); } public ResultSet executeQuery(String sql){ log.traceQuery(sql); return new ResultSet(stat.executeQuery(sql)); } ... }
Fig. 6. Illustration of API overloading
import java.sql.Statement; import myAPI.Statement; import java.sql.ResultSet; import myAPI.ResultSet; ... ==> ... Statement sta = con.createStatement(); Statement sta = new Statement(con); ResultSet rsl = sta.executeQuery(q); ResultSet rsl = sta.executeQuery(q);
Fig. 7. Program adaptation for API overloading
API substitution. If the source code of the client side API is available, which is the case for ODBC and JDBC drivers of Open Source RDBMSs, tracing statements can be inserted directly in this API. The latter is then recompiled and bound to the client applications. The client program need not be modified. This technique records the statement instances but ignores the program points. DBMS logs Most database servers store the requests received from the client application programs in a specific file or table. For example, MySQL writes, in the order it received them, all the client queries in its general query log. Each record comprises the client process id, the timestamp of query reception and the text of the query as it was executed, the input variables being replaced with their instant values. As compared to the program instrumentation technique, DBMS log does not provide program points and can be processed off-line only. This technique does not require source code modification. Tracing stored procedures. SQL procedures are centrally stored in the database and can be invoked by any client program. By building SQL procedures equivalent to the most used SQL queries, and by replacing some or all SQL statements in programs by the invocation of these SQL procedures, we provide an ad hoc API that can be augmented with tracing instructions that log SQL statement executions. This technique can be considered in architectures that already rely on SQL procedures. When client programs include explicit SQL statements, it entails in-depth and complex code modification. However, since it replaces complex input and output variable binding with mere procedure arguments, this reengineering can lead to a better code that will be easier to maintain and evolve.
Database Semantics Recovery through Analysis of Dynamic SQL Statements
143
Techniques comparison. Table 1 provides a comparison of the SQL trace capturing techniques described above. Two main dimensions are considered: – the richness of the traces that can be obtained: is the technique able to capture such information as the executed queries, the query results and the program points that correspond to the query execution steps? – the impact on the application programs: does the technique necessitate program modification and/or recompilation? The choice of a particular tracing technique obviously depends on such criteria as the objective of the dynamic analysis (which influences the information to capture) and the nature of the system of interest (programming language, database platform, architecture, availability of the source code). Table 1. Comparison of SQL trace capturing techniques Techniques Instrumentation Aspects API overloading API substitution DBMS logs Stored procedures
Richness of the SQL trace Impact on programs Executed queries Query results Program points Modification Recompilation X X X X X X
X X X X X
X X
X X
(X)
X X X X X
6 SQL Trace Analysis for Database Reverse Engineering Examination of static SQL queries is one of the most powerful techniques to elicit implicit database schema constructs and constraints among which undeclared foreign keys, identifiers and functional dependencies [35,36,37,38]. In this section, we discuss the use of dynamic analysis of dynamic SQL queries for such database reverse engineering task. In particular, we show how SQL execution traces can serve as a basis to formulate hypotheses on the existence of undeclared foreign keys3 . These hypotheses will still need to be validated afterwards (e.g., via data analysis or user/programmer interviews). We distinguish two different approaches to detect implicit referential constraints between columns of distinct or identical tables. One can observe either the way such referential constraints are used or the way they are managed. – Referential constraint usage consists in exploiting the referential constraint. For instance, within the same execution, an output value o1 of an SQL statement s1 querying table T1 is used as an input value of another SQL statement s2 accessing another table T2 (see Query 2). A more direct usage of a foreign key consists of a join of T1 and T2 within a single query (see Query 1). In both cases, this suggests the existence of an implicit foreign key between tables T1 and T2 . 3
However, a similar discussion can be developed for the other implicit constructs.
144
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
– Referential constraint management aims at verifying that the referential constraint keeps being respected when updating the database. For instance, before modifying the contents of a table T2 (by an insert or update statement s2 ), the program executes a verification query q1 on table T1 . According to the result of q1 , s2 is executed or not. When both q1 and s2 are executed, they include at least one common input value. Similarly, when deleting a row of a table T1 using a delete statement d2 , one observes that the program also deletes a possibly empty set of rows of another table T2 via a another delete statement d1 (procedural delete cascade). We have identified three main heuristics for implicit foreign key constraint detection from SQL execution traces, namely joins, output-input dependency and input-input dependency. Below, we further define and illustrate those three heuristics. Notations. Let q be an SQL query occuring in an execution trace. Let t be a table of a relational database. – q.match4 denotes the set of couples of columns (c1 , c2 ) whose values are matched in an equality condition of q; – q.in denotes the set of input values of q; – q.out denotes the set of output values of q; – q.seq denotes the sequence number of q in its trace; – t.cols denotes the set of columns of t. 6.1 Joins As discussed above, most SQL joins rely on the matching of a foreign key with its target primary key. Query 1 corresponds to a standard join, where several table names appear in the from clause of the query. It combines the rows of those tables, typically based on a join condition which expresses the equality of the primary and foreign key values. When primary keys have been recovered, either from the physical schema or as implicit constructs, then the join conditions provide strong evidence for implicit foreign keys. Definition 1. An SQL query q contains a join of two tables t1 and t2 iff ∃(c1 , c2 ) ∈ q.match such that c1 ∈ t1 .cols ∧ c2 ∈ t2 .cols. It is important to mention that all SQL joins do not necessarily correspond to the matching of a foreign key and a primary key value. Several counter-examples can be observed. A typical case consists in joining two tables on their foreign keys (none of them being a candidate key) to a third table, a dubious pattern known as connection trap [39]. 6.2 Output-Input Dependency An output-input dependency occurs when an SQL query uses as input some of the results of a previous SQL query of the same program execution. Definition 2. An SQL query q2 is output-input dependent on another SQL query q1 iff q2 .in ∩ q1 .out = ∅ ∧ q2 .seq > q1 .seq. 4
The q.match relationship is symmetric.
Database Semantics Recovery through Analysis of Dynamic SQL Statements
145
... select Sender from ORDERS where OrdNum = 5789 getString(1) = C400 select Name, Address from CUSTOMERS where Num = ’C400’ ...
Fig. 8. An SQL trace fragment with an output-input dependency that may reveal the existence of the implicit foreign key of Figure 1
In the presence of foreign keys, be they implicit or not, output-input dependencies can be observed in several navigational programming patterns. For instance, in a procedural join between the source and target tables of a foreign key (see Query 2), the value of the foreign key column(s) is used to retrieve the target row using a subsequent query. Conversely, the value of the identifier of a given target row can be used to extract all the rows referencing it. For instance, the program retrieves a customer, before searching for all her recent orders. Figure 8 shows an example of input-output dependency between two succesive SQL queries. In this example, the program retrieves the name and address of the customer who placed a particular order it has just retrieved. We see that the input value of column ORDERS.Sender of the first query is the same as the input value of column CUSTOMER.Num in the second query. 6.3 Input-Input Dependency In an SQL execution trace, an input-input dependency holds between two successive SQL queries that share common input values. Definition 3. An SQL query q1 is input-input dependent on another SQL query q2 iff q1 .in ∩ q2 .in = ∅. The presence of input-input dependencies in SQL execution traces constitutes another strong indicator of the presence of foreign keys. Several data manipulation patterns for referential constraint management make an intensive use of input-input dependent queries. Among the most popular examples, the delete cascade mecanism, that consists in deleting all referencing rows before deleting a target row, makes use of delete queries that share a common input value: the primary/foreign key value of the target rows to be deleted. A second example is the check-before-insert pattern, that aims at preserving a referential integrity constraint when inserting rows in the database. When inserting a row in a referencing table, the program first checks that the provided foreign key value is valid, i.e., it corresponds to the primary key value of an existing row in the target table. Similar patterns can be observed in delete and update procedures. As an illustration, we consider the execution trace given in Figure 9. This trace strongly suggests the existence of an implicit foreign key between column Sender of table ORDERS and column Num of table CUSTOMER. Indeed, a row insertion in table ORDERS is preceded by the execution of a validation query that (1) counts the number
146
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
... select count(*) from CUSTOMER where Num = ’C400’ getInt(1) = 1 insert into ORDERS(_,_,_,Sender) values (_,_,_,’C400’) ...
Fig. 9. An SQL trace fragment with an input-input dependency that may reveal the existence of the implicit foreign of Figure 1
of rows of table CUSTOMER having c as value of column Num – where c corresponds to the value of column Sender of the inserted row of ORDERS – and (2) returns 1 as a result. In other words, the program checks that the provided value of column Sender does correspond to the primary key (Num) value of an existing customer. This SQL trace fragment is actually an instance of the check-before-insert pattern described above.
7 Case Study In order to validate the approach we propose in this paper, we conducted an experiment based on an e-learning application, called WebCampus5 , that is used at the University of Namur. WebCampus is an instantiation of Claroline6 , an open-source Learning Management System (LMS) allowing teachers to offer online courses and to manage learning and collaborative activities on the web. The platform is written in PHP and uses a MySQL database. WebCampus consists of more than a thousand source code files, amounting to 460 000 lines of code. The main database manipulated by WebCampus is made up of 33 tables and 198 columns, representing the data on available online courses, course users, university departments, etc. In addition, the system makes use of one separate database per course, each comprising about 50 tables. The MySQL DDL code of the database does not explicitly declare any foreign key. Indeed, the database makes use of the MyISAM storage engine, which does not support foreign key management. However, the Claroline developers community7 is aware of all the implicit constructs. We particularly focused on the main database of WebCampus, that actually hosts 35 implicit foreign keys (according to the WebCampus developpers). This case study is outstanding since the target of the reverse enginering process is (alledgely8) known, in such a way that we can rigorously evaluate the effectiveness of a given reverse engineering technique as a basis for the detection of implicit foreign keys. WebCampus can be considered as a representative application of the class of applications we target in this paper. First, both the size of the source code and the size of the database schema fully justify the use of automated foreign key detection techniques. Second, we observe a high level of dynamicity in the construction of the database 5 6 7 8
see http://webcampus.fundp.ac.be see http://www.claroline.net The second author is co-developper of WebCampus and a member of the Claroline community. As we will see below...
Database Semantics Recovery through Analysis of Dynamic SQL Statements
147
queries (by string concatenation), combined with a high concentration of database query execution commands in only a few source code files (data access modules). This makes such static analysis as dataflow analysis, pattern mining and program slicing inapplicable. Three other techniques can be considered, namely data analysis, schema analysis and dynamic analysis of SQL statements. Data analysis is generally used as a confirmation technique for other detection techniques like schema or program analysis. Indeed, when a foreign key is implicit, by definition, it is not declared in the DDL code of the DBMS. Consequently, there is no guarantee that the underlying referential constraint is respected by all the rows of the referencing table. We learned from our experience with large databases that a significant proportion of the data actually violate implicit integrity constraints like undeclared foreign keys [40]. Schema analysis constitutes, a priori, a promising technique for implicit foreign detection. It seems indeed reasonnable to assume that an implicit foreign key colum is likely to have a similar name to the primary key column it references. We can also assume that there is a high probability that both the source and target columns are of compatible types. Thus, column name and type similarity appear as adequate heuristics for schema-based implicit foreign key detection. Below, we will verify this by analyzing the WebCampus database schema. Dynamic analysis of SQL statements is presented above as a promising alternative technique to overcome the limitations of static program analysis in the presence of dynamic database query construction, as in such programming environments as PhP/MySQL or Java/JDBC. This case study is intended to support this claim. Concretely, the case study aims to show that one can detect the implicit foreign keys of the WebCampus database using dynamic SQL analysis. Furthermore, we will compare the results obtained using dynamic SQL analysis with respect to the results obtained by means of schema analysis. The experiment involved the following two steps: (1) collecting SQL execution traces corresponding to typical interaction scenarios within WebCampus; and (2) analyzing these traces in order to detect implicit foreign keys. 7.1 Trace Collection The SQL traces collected reports on 14 execution scenarios, which translate the most typical operations carried out by WebCampus users on a regular basis. Trace collection was carried out through tool-supported source code instrumentation by the second author. This proved straightforward since only a few source code modules are in charge of querying the WebCampus database (but those data access modules are called from almost all modules of the application). The output of the tracing process was stored in a MySQL database composed of two tables: (1) sql_trace, each row of which describes an executed SQL query; and (2) sql_trace_results, that contains information on the results of these queries. Table 2 provides size metrics about the trace obtained by indicating, for each execution scenario, the number and the nature of the corresponding queries and query results.
148
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
# of results from main DB
total # of results
# of insert on main DB # of delete on main DB # of update on main DB
# of select on main DB
# of queries on main DB
Execution scenario
total # of queries
Table 2. Some metrics about the SQL trace obtained, classified by execution scenario
register_user 27 27 24 3 0 0 163 163 add_course_manager 364 194 190 4 0 0 2 643 2 391 add_course_user 289 155 151 4 0 0 2 112 1 908 create_course 70 29 20 9 0 0 319 299 delete_course 329 132 123 1 7 0 1 865 1 700 delete_course_user 159 84 83 0 1 0 1 110 996 delete_dpt_attempt 37 37 37 0 0 0 423 419 install_applet 92 88 82 4 0 2 729 721 install_tool 4 894 2 169 2 039 126 4 0 26 002 24 180 uninstall_applet 82 78 68 0 9 1 581 573 uninstall_tool 3 713 1 896 1 888 0 8 0 23 333 22 419 user_register_to_course 64 64 63 1 0 0 721 708 user_register_to_webcampus 35 32 30 2 0 0 188 184 user_unregister_from_course 24 19 17 1 1 0 169 155 Total
10 179 5 004 4 815 155 30 3 60 358 56 816
7.2 Trace Analysis The goal of the trace analysis process was to find indications of undeclared foreign keys in the SQL execution traces, with a focus on joins and output-input dependencies. This process took the physical schema of the main WebCampus database (without any foreign key) as a basis to systematically analyze the contents of the tracing tables. This analysis process was supported by a dedicated trace analyzer, implemented as a Java plugin of DB-MAIN CASE environment [41]. This plugin takes as input (1) a relational database schema and (2) a set of SQL traces stored in a relational database in the format described above. The analyzer returns a set of potential implicit foreign keys together with the number of corresponding joins and output-input dependencies occuring in the input SQL traces. The analyzer behaves as follows. For each couple of non necessarily distinct tables (t1 ,t2 ) of the input schema, it generates and executes a set of SQL queries that analyze the SQL execution traces. The input schema is used to determine the primary key (or unique column) pk of table t2 , while any column of table t1 is considered by the analyzer as a potential foreign key f k. 7.3 Results In order to evaluate the recall of our dynamic analysis technique, we compared the results of the trace analysis process with the set of known implicit foreign keys of the main Webcampus schema. In order to better interpret this recall value, we first needed to evaluate the richness of the SQL traces we analyzed. The left part of Table 3 indicates, for each implicit foreign key f k from table t1 to table t2 , (1) the number of
Database Semantics Recovery through Analysis of Dynamic SQL Statements
149
queries referencing t1 , (2) the number of queries referencing t2 , and (3) the number of distinct scenarios where both t1 and t2 are accessed. From (3), we can derive that only 27 implicit foreign keys of the schema were potentially detectable in the SQL trace we obtained. Indeed, the minimal requirement for detecting an undeclared foreign key t1 → t2 in an SQL trace is that both t1 and t2 must be involved in at least one execution scenario considered. If this is the case, then the SQL trace obtained can contain indications of the foreign key. The right part of Table 3 summarizes the indications of implicit foreign key that have been found in the SQL trace. For each undeclared foreign key (t1 → t2 ), we provide (1) the number of SQL joins between t1 and t2 , and (2) the number of output-input dependencies between a query q1 accessing t2 and a subsequent query q2 accessing t1 , further subdivided according to the nature of q2 . From these statistics, we observe that we found evidence for 23 implicit foreign keys (those in light grey), which represents a recall of: – about 65% of the total number of implicit foreign keys in the database; – about 85% of the foreign keys identified as potentially detectable in the trace we obtained. Let us now discuss the precision that we reached in this experiment. We observed three unexpected foreign keys hypotheses, summarized in Table 4, that are all based on the presence of SQL joins. It turned out that two of them correspond to actual implicit foreign keys that were not part of the list initially provided by the WebCampus developpers. The third hypothesis is erroneous and therefore constitutes a noise in our experiment. Several joins are made between tables notify and course user based on their respective column user id. Both notify.user id and course.user id reference a third table user, as detected by our analyzer (see Table 3). This case actually corresponds to an instance of the connection trap pattern described above. In summary, our experiment allowed us to correctly detect 25 implicit foreign keys, which corresponds to a recall of 71% (25/37). Considering that only 29 implicit foreign keys were potentially detectable (covered by our execution scenarios), we would reach a recall of 86% (25/29). In terms of precision, only one hypothesis revealed to be erroneous (false-positive), which results in a precision of 96% (25/26). Such false foreign keys do not pose a critical problem in practice since they would most probably be invalidated by other techniques, such as schema and data analysis. 7.4 Comparison with Schema-Based Foreign Key Detection As a further evaluation, we have conducted a second experiment where we used schema analysis techniques to detect the 37 undeclared foreign keys of the WebCampus database. The goal of this second experiment is to compare, on a common case, the precision and recall that can be reached using schema analysis vs dynamic SQL analysis. The experiment was supported by a schema analysis plugin of DB-MAIN, that takes as input a relational database schema (with no foreign key) and returns as output a set of implicit foreign key suggestions. A foreign key from column a to column b is suggested if the following conditions hold:
150
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
Conventions : query(t) = query on table t; select(t) = select query on table t; insert(t) = insert query on table t; delete(t) = delete query on table t.
0 1927 1927 55 55 55 0 244 1840 1838 0 11 11 3 4 4 4 4 3 38 17 9 9 9 0 0 0 1 1 161 161 161 0 3 1
0 1840 51 1927 51 41 41 496 1840 1927 1927 41 5 1927 41 3 41 3 41 496 496 1927 41 452 1 41 0 0 1927 1927 51 178 41 41 41
# of output-input dependencies select(t2 ) → insert(t1 ) # of output-input dependencies select(t2 ) → delete(t1 )
# of output-input dependencies select(t2 ) → select(t1 )
# of output-input dependencies query(t2 ) → query(t1 )
# of joins between t1 and t2
# of queries accessing t2 # of execution scenarios involving both t1 and t2
Implicit foreign key (t1 → t2 ) class → class course → faculty course → right_profile course_user → course course_user → right_profile course_user → user desktop_portlet_data → user dock → module faculty → faculty course_addons → course course_program → course user_addons → user user_addons → program im_message → course im_message_status → user im_message_status → im_message im_recipient → user im_recipient → im_message log → user module_contexts → module module_info → module notify → course notify → user notify → course_tool property_definition → user_property rel_class_user → user rel_class_user → class rel_course_class → class rel_course_class → course right_rel_profile_action → course right_rel_profile_action → right_profile right_rel_profile_action → right_action sso → user tracking_event → user user_property → user
# of queries accessing t1 (q(t1 ))
Table 3. SQL trace analysis results (recall)
0 0 0 0 0 0 12 1 832 87 86 1 0 12 0 1 0 1 0 9 9 47 37 7 3 9 0 9 0 9 0 9 7 50 44 4 2 0 0 0 0 0 0 14 58 297 291 3 3 12 3 0 0 0 0 9 0 3 842 3 839 0 3 0 0 0 0 0 0 3 0 15 15 0 0 3 0 0 0 0 0 3 0 5 0 5 0 3 0 2 0 2 0 3 0 0 0 0 0 3 0 2 0 2 0 3 0 0 0 0 0 3 0 1 0 1 0 14 34 11 0 6 5 4 13 11 0 6 5 6 0 3 0 0 3 6 0 11 11 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 3 6 0 22 19 0 3 7 0 711 91 600 20 7 32 810 0 810 0 0 0 0 0 0 0 2 0 2 0 2 0 1 1 0 0 0 0
Database Semantics Recovery through Analysis of Dynamic SQL Statements
151
Table 4. Unexpected foreign key hypotheses, due to SQL joins Unexpected foreign key (t1 → t2 )
# of joins Explanation
course_tool → module right_action → course_tool notify → course_user
1. 2. 3. 4.
372 actual implicit foreign key 18 actual implicit foreign key 6 t1 and t2 reference the same table t3
a and b are distinct; b is declared as primary key of its table; a and b are of the same type; a and b have similar names.
As far as name similarity is concerned, the tool relies on the Jaro-Winkler string proximity metrics [42]. If the proximity between two column names c1 , c2 is not less than a given treshold t, then c1 and c2 are considered similar. We run our schema analysis tool for 11 different treshold values between 0 and 1. For each threshold value, we measured: – – – – – –
D: the total number of detected foreign keys; T P : the number of true positives, i.e., correctly detected foreign keys ; F P : the number of false positives, i.e., wrongly detected foreign keys; F N : the number of false negatives, i.e., undetected foreign keys; P = T P/(T P + F P ): the precision of the schema-based detection technique; R = T P/(T P + F N ): the recall of the schema-based detection technique.
Table 5 summarizes the results obtained when using schema analysis as implicit foreign key detection technique for the WebCampus database. Figure 10 graphically shows the level of precision and recall reached for each name similarity treshold value considered. We observe that a high level of recall can be reached when the name similarity treshold is set to a low value. This is not surprising, since in this case almost all couple of column names are considered similar, thus almost all columns having the same type as a primary key column is considered as a foreign key. We can also observe that for 2 implicit foreign keys, the type of the foreign key column differs from the type of the target primary key column. The most important observation concerns the high proportion of false positives, whatever the treshold chosen. This makes the precision Table 5. Results of implicit foreign key detection using schema analysis Treshold # FK detected # True positives # False positives # False Negatives Recall (%) Precision (%) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1329 1026 1026 1025 968 655 296 150 105 75 73
35 30 30 30 30 29 24 20 17 14 14
1294 996 996 995 938 626 272 130 88 61 59
2 7 7 7 7 8 13 17 20 23 23
94.6 81.1 81.1 81.1 81.1 78.4 64.9 54.1 45.9 37.8 37.8
2.6 2.9 2.9 2.9 3.1 4.4 8.1 13.3 16.2 18.7 19.2
152
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
Fig. 10. Precision and recall values of the schema analysis technique, depending on the name similarity treshold used
of the schema-based detection technique too low to be satisfying in this experiment, especially when compared to the precision of the dynamic analysis technique used in the above experiment. 7.5 Discussion The experiments presented above clearly validate our approach by confirming that (1) SQL execution traces contain essential information about implicit schema constraints, in particular implicit foreign keys; (2) dynamic analysis of SQL statements allows the detection of implicit foreign keys with a satisfying level of recall and precision and (3) dynamic analysis of SQL statements can lead to significantly more accurate results than other data reverse engineering techniques like schema analysis. More precisely, our results allow us to say that if there is an implicit foreign key, then SQL execution traces most probably contain indications of this foreign key if both involved tables are used by the execution scenario considered. As usual in dynamic program analysis, the main difficulty is to choose relevant execution scenarios based on the knowledge of the analyzed system. In our case, we considered a large set of scenarios in order to reach a good coverage of the database schema. However, the set of selected execution scenarios was not sufficient to reach a detection recall of 100%. This can be explained by several non-exclusive reasons. First, nothing guarantees that the WebCampus application actually exploits all the structures of the database. Second, we did not consider all possible execution scenarios of WebCampus. Third, it is likely that we did not cover all possible interesting execution paths of each execution scenario considered. In the context of this experiment, an execution path is considered interesting if it involves the execution of successive inter-dependent database queries accessing different tables. Precisely evaluating the coverage of execution scenarios and input data is a non-trivial problem that has been extensively studied [43]. Several mature techniques
Database Semantics Recovery through Analysis of Dynamic SQL Statements
153
have been proposed to support this evaluation in the particular case of data-intensive applications [44]. Last, even for each execution path we followed, it is obvious that we did not consider all possible combination of input data and database state. The techniques and tools that we used in our experiment can actually been extended and reused in other application contexts. For instance, while the initial goal of the SQL trace analysis tool we developed for the WebCampus case study is to detect indications of implicit foreign keys in SQL execution traces, a similar tool may also be used for confirming candidate implicit foreign keys elicited using other reverse engineering techniques including schema or user interface analysis. Another interesting application of SQL trace analysis concerns the identification of unsafe data access paths [40], i.e., program fragments where an implicit constraint is not correcly managed. For instance, based on the list of implicit foreign keys, one could detect in the SQL trace that an insert or an update statement is performed without prior verification of the referential constraints. In this case, the analysis would be based on the absence of output-input and input-input dependency under particular scenarios.
8 Related Work Most of the previous approaches on SQL statement analysis [18,27,21,29] rely on static program analysis techniques. Petit et al. [18] present a technique for extracting an Entity-Relationship schema from an operational relational database. The enrichment of the raw schema takes benefit from the analysis of the SQL queries available in the application programs. In particular, joins are seen as heuristics for the detection of implicit dependencies between the columns of distinct tables. Willmor et al. [27] propose an approach to program slicing in the presence of database states. In particular, they introduce two forms of data dependencies related to database queries. The first category, called program-database dependencies, accounts for interactions between program statements and database statements. The databasedatabase dependencies capture the situation in which the execution of a database statement affects the behaviour of another database statement. We also consider the latter kind of dependencies, but they are extracted from SQL execution traces rather than from the source code. van den Brink et al. [29] present a tool-supported method to quality assessment for SQL statements. The initial phase of the method consists in extracting the SQL statements from the source code using control and dataflow analysis techniques. Similarly, Ngo and Tan [30] make use of symbolic execution to extract database interaction points from web applications. Based on a case study, they showed that their method is able to extract about 80% of such interactions. Dynamic analysis of SQL statements has already been used by other authors, but for different purposes than database reverse engineering. Debusmann and Geihs [45] present an aspect-based method for the intrumentation of application components. This method is used in the context of runtime system monitoring. They measure, in particular, the response time of database queries. Yang et al. [46] make use of the aspect-based tracing method that we introduced in [34] for feature model recovery. Their experiments revealed that static analysis techniques would have been inapplicable in this context.
154
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
The WAFA approach [47], by Alalfi et al., is dedicated to program comprehension. It combines static and dynamic program analysis techniques for achieving a fine-grained analysis of database interactions in web applications. The key idea is to automatically recover the link between SQL execution instances and the original statement source. A similar method was proposed in previous work [34] in the case of Java systems.
9 Conclusions and Perspectives SQL statements appear to be a particularly rich source of information in program and data structure understanding and therefore in legacy database semantics recovery, a strict prerequisite to such essential processes as program and database maintenance, evolution and reengineering. Although various static analysis approaches have been proposed in the last two decades, the increasing use of dynamic SQL in business applications makes such techniques inapplicable, so that web software systems can no longer be correctly redocumented and understood. Dynamic analysis of SQL queries constitutes a promising alternative, although it is still a largely unexplored research and technical domain. The goal of this paper was to mark this engineering domain out by identifying and discussing its basic concepts and specific techniques. We first identified, illustrated and compared a set of techniques for capturing the SQL queries executed by a system at runtime. Then, we elaborated on the analysis of SQL traces in the context of database reverse engineering. We identified possible heuristics for the detection of implicit foreign key from SQL execution traces. Those heuristics combine both intra-query dependencies (SQL joins) and inter-query dependencies (input-input and output-input dependencies). A comprehensive experiment, based on a real-life web application, allowed us to establish the analysis of dynamic SQL execution traces as a very effective technique for relational database reverse engineering. It proved convincing enough to let ReveR (a university spin-off devoted to information system reverse engineering) include our techniques and tools to its database reverse engineeering tool set. In the near future, we plan to investigate the benefits of hybrid approaches to database reverse engineering, combining schema analysis, data analysis, static and dynamic program analysis techniques. We also aim at exploring the use of dynamic analysis of SQL statements in other application contexts than database reverse engineering, including quality assessment, program comprehension, intrusion detection and consistency management. Acknowledgments. This research was carried out during the tenure of an ERCIM “Alain Bensoussan” Fellowship by the first author. Partial support was also received from the IAP Programme of the Belgian State, Belgian Science Policy (MoVES project).
References 1. Casanova, M.A., De Sa, J.E.A.: Mapping uninterpreted schemes into entity-relationship diagrams: two applications to conceptual schema design. IBM J. Res. Dev. 28(1), 82–94 (1984) 2. Davis, K.H., Arora, A.K.: A methodology for translating a conventional file system into an entity-relationship model. In: Proceedings of the Fourth International Conference on EntityRelationship Approach, pp. 148–159. IEEE Computer Society, Washington, DC, USA (1985)
Database Semantics Recovery through Analysis of Dynamic SQL Statements
155
3. Navathe, S.B., Awong, A.M.: Abstracting relational and hierarchical data with a semantic data model. In: Proceedings of the Sixth International Conference on Entity-Relationship Approach (ER 1987), pp. 305–333. North-Holland Publishing Co., Amsterdam (1988) 4. Johannesson, P.: A method for transforming relational schemas into conceptual schemas. In: Proceedings of the Tenth International Conference on Data Engineering (ICDE 2004), pp. 190–201. IEEE Computer Society, Washington, DC, USA (1994) 5. Blaha, M.R., Premerlani, W.J.: Observed idiosyncracies of relational database designs. In: Proceedings of the Second Working Conference on Reverse Engineering (WCRE 1995), p. 116. IEEE Computer Society, Washington, DC, USA (1995) 6. Hainaut, J.L., Englebert, V., Henrard, J., Hick, J.M., Roland, D.: Database reverse engineering: From requirements to care tools. Automated Software Engineering 3, 9–45 (1996) 7. Davis, K.H., Aiken, P.H.: Data reverse engineering: A historical survey. In: Proceedings of the Seventh Working Conference on Reverse Engineering (WCRE 2000), p. 70. IEEE Computer Society, Washington, DC, USA (2000) 8. Hainaut, J.L.: Legacy and future of data reverse engineering. In: Proceedings of the 16th Working Conference on Reverse Engineering (WCRE 2009), p. 4. IEEE Computer Society, Los Alamitos (2009) 9. Hainaut, J.L., Henrard, J., Englebert, V., Roland, D., Hick, J.M.: Database reverse engineer¨ ing. In: Liu, L., Ozsu, M.T. (eds.) Encyclopedia of Database Systems, pp. 723–728. Springer, US (2009) 10. Hainaut, J.-L.: The transformational approach to database engineering. In: L¨ammel, R., Saraiva, J., Visser, J. (eds.) GTTSE 2005. LNCS, vol. 4143, pp. 95–143. Springer, Heidelberg (2006) 11. Hainaut, J.L.: Introduction to database reverse engineering. LIBD Publish (2002), http://www.info.fundp.ac.be/ dbm/publication/2002/ DBRE-2002.pdf 12. Markowitz, V.M., Makowsky, J.A.: Identifying extended entity-relationship object structures in relational schemas. IEEE Trans. Softw. Eng. 16, 777–790 (1990) 13. Premerlani, W.J., Blaha, M.R.: An approach for reverse engineering of relational databases. Commun. ACM 37(5), 42 (1994) 14. Chiang, R.H.L., Barron, T.M., Storey, V.C.: Reverse engineering of relational databases: extraction of an eer model from a relational database. Data Knowl. Eng. 12(2), 107–142 (1994) 15. Lopes, S., Petit, J.M., Toumani, F.: Discovering interesting inclusion dependencies: application to logical database tuning. Inf. Syst. 27, 1–19 (2002) 16. Yao, H., Hamilton, H.J.: Mining functional dependencies from data. Data Min. Knowl. Discov. 16(2), 197–219 (2008) 17. Pannurat, N., Kerdprasop, N., Kerdprasop, K.: Database reverse engineering based on association rule mining. CoRR abs/1004.3272 (2010) 18. Petit, J.M., Kouloumdjian, J., Boulicaut, J.F., Toumani, F.: Using queries to improve database reverse engineering. In: Loucopoulos, P. (ed.) ER 1994. LNCS, vol. 881, pp. 369–386. Springer, Heidelberg (1994) 19. Di Lucca, G.A., Fasolino, A.R., de Carlini, U.: Recovering class diagrams from dataintensive legacy systems. In: Proceedings of the 16th IEEE International Conference on Software Maintenance (ICSM 2000), p. 52. IEEE Computer Society, Los Alamitos (2000) 20. Henrard, J.: Program Understanding in Database Reverse Engineering. PhD thesis, University of Namur (2003) 21. Cleve, A., Henrard, J., Hainaut, J.L.: Data reverse engineering using system dependency graphs. In: Proceedings of the 13th Working Conference on Reverse Engineering (WCRE 2006), pp. 157–166. IEEE Computer Society, Washington, DC, USA (2006) 22. Choobineh, J., Mannino, M.V., Tseng, V.P.: A form-based approach for database analysis and design. Communications of the ACM 35(2), 108–120 (1992)
156
A. Cleve, J.-R. Meurisse, and J.-L. Hainaut
23. Terwilliger, J.F., Delcambre, L.M.L., Logan, J.: The user interface is the conceptual model. In: Embley, D.W., Oliv´e, A., Ram, S. (eds.) ER 2006. LNCS, vol. 4215, pp. 424–436. Springer, Heidelberg (2006) 24. Ramdoyal, R., Cleve, A., Hainaut, J.-L.: Reverse engineering user interfaces for interactive database conceptual analysis. In: Pernici, B. (ed.) CAiSE 2010. LNCS, vol. 6051, pp. 332–347. Springer, Heidelberg (2010) 25. Andersson, M.: Searching for semantics in cobol legacy applications. In: Data Mining and Reverse Engineering: Searching for Semantics, IFIP TC2/WG2.6 Seventh Conference on Database Semantics (DS-7). IFIP Conference Proceedings, vol. 124, pp. 162–183. Chapman & Hall, Boca Raton (1998) 26. Embury, S.M., Shao, J.: Assisting the comprehension of legacy transactions. In: Proceedings of the 8th Working Conference on Reverse Engineering (WCRE 2001), p. 345. IEEE Computer Society, Washington, DC, USA (2001) 27. Willmor, D., Embury, S.M., Shao, J.: Program slicing in the presence of a database state. In: ICSM 2004: Proceedings of the 20th IEEE International Conference on Software Maintenance, pp. 448–452. IEEE Computer Society, Washington, DC, USA (2004) 28. Maule, A., Emmerich, W., Rosenblum, D.S.: Impact analysis of database schema changes. In: Proceedings of the 30th international conference on Software engineering (ICSE 2008), pp. 451–460. ACM Press, New York (2008) 29. van den Brink, H., van der Leek, R., Visser, J.: Quality assessment for embedded sql. In: Proceedings of the 7th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2007), pp. 163–170. IEEE Computer Society, Los Alamitos (2007) 30. Ngo, M.N., Tan, H.B.K.: Applying static analysis for automated extraction of database interactions in web applications. Inf. Softw. Technol. 50(3), 160–175 (2008) 31. Cleve, A.: Program Analysis and Transformation for Data-Intensive System Evolution. PhD thesis, University of Namur (October 2009) 32. Kiczales, G., Hilsdale, E.: Aspect-oriented programming. In: ESEC/FSE-9: Proceedings of the 8th European software engineering conference held jointly with 9th ACM SIGSOFT international symposium on Foundations of software engineering, p. 313. ACM, New York (2001) 33. Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., Griswold, W.G.: An overview of aspectj. In: Lee, S.H. (ed.) ECOOP 2001. LNCS, vol. 2072, pp. 327–353. Springer, Heidelberg (2001) 34. Cleve, A., Hainaut, J.L.: Dynamic analysis of sql statements for data-intensive applications reverse engineering. In: Proceedings of the 15th Working Conference on Reverse Engineering, pp. 192–196. IEEE Computer Society, Los Alamitos (2008) 35. Petit, J.M., Toumani, F., Kouloumdjian, J.: Relational database reverse engineering: A method based on query analysis. Int. J. Cooperative Inf. Syst. 4, 287–316 (1995) 36. Lopes, S., Petit, J.M., Toumani, F.: Discovery of “Interesting” data dependencies from a ˙ workload of SQL statements. In: Zytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 430–435. Springer, Heidelberg (1999) 37. Tan, H.B.K., Ling, T.W., Goh, C.H.: Exploring into programs for the recovery of data dependencies designed. IEEE Trans. Knowl. Data Eng. 14(4), 825–835 (2002) 38. Tan, H.B.K., Zhao, Y.: Automated elicitation of inclusion dependencies from the source code for database transactions. Journal of Software Maintenance 15(6), 379–392 (2003) 39. Codd, E.F.: A relational model of data for large shared data banks. Commun. ACM 13(6), 377–387 (1970) 40. Cleve, A., Lemaitre, J., Hainaut, J.L., Mouchet, C., Henrard, J.: The role of implicit schema constructs in data quality. In: Proceedings of the 6th International Workshop on Quality in Databases (QDB 2008), pp. 33–40 (2008)
Database Semantics Recovery through Analysis of Dynamic SQL Statements
157
41. DB-MAIN: The DB-MAIN official website (2011), http://www.db-main.be 42. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-2003), pp. 73–78 (2003) 43. Zhu, H., Hall, P.A.V., May, J.H.R.: Software unit test coverage and adequacy. ACM Comput. Surv. 29, 366–427 (1997) 44. Kapfhammer, G.M., Soffa, M.L.: A family of test adequacy criteria for database-driven applications. In: Proceedings of the 9th European software engineering conference held jointly with 11th ACM SIGSOFT international symposium on Foundations of software engineering. ESEC/FSE-11, pp. 98–107. ACM, New York (2003) 45. Debusmann, M., Geihs, K.: Efficient and transparent instrumentation of application components using an aspect-oriented approach. In: Brunner, M., Keller, A. (eds.) DSOM 2003. LNCS, vol. 2867, pp. 209–220. Springer, Heidelberg (2003) 46. Yang, Y., Peng, X., Zhao, W.: Domain feature model recovery from multiple applications using data access semantics and formal concept analysis. In: Proceedings of the 16th International Working Conference on Reverse Engineering (WCRE 2009), pp. 215–224. IEEE Computer Society, Los Alamitos (2009) 47. Alalfi, M., Cordy, J., Dean, T.: WAFA: Fine-grained dynamic analysis of web applications. In: Proceedings of the 11th International Symposium on Web Systems Evolution (WSE 2009), pp. 41–50. IEEE Computer Society, Los Alamitos (2009)
Ontology Alignment Evaluation Initiative: Six Years of Experience Jérôme Euzenat1, Christian Meilicke2 , Heiner Stuckenschmidt2, Pavel Shvaiko3, and Cássia Trojahn1 1
INRIA & LIG, Grenoble, France {jerome.euzenat,cassia.trojahn}@inria.fr 2 University of Mannheim, Germany {christian,heiner}@informatik.uni-mannheim.de 3 Informatica Trentina S.p.A., Trento, Italy [email protected] Abstract. In the area of semantic technologies, benchmarking and systematic evaluation is not yet as established as in other areas of computer science, e.g., information retrieval. In spite of successful attempts, more effort and experience are required in order to achieve such a level of maturity. In this paper, we report results and lessons learned from the Ontology Alignment Evaluation Initiative (OAEI), a benchmarking initiative for ontology matching. The goal of this work is twofold: on the one hand, we document the state of the art in evaluating ontology matching methods and provide potential participants of the initiative with a better understanding of the design and the underlying principles of the OAEI campaigns. On the other hand, we report experiences gained in this particular area of semantic technologies to potential developers of benchmarking for other kinds of systems. For this purpose, we describe the evaluation design used in the OAEI campaigns in terms of datasets, evaluation criteria and workflows, provide a global view on the results of the campaigns carried out from 2005 to 2010 and discuss upcoming trends, both specific to ontology matching and generally relevant for the evaluation of semantic technologies. Finally, we argue that there is a need for a further automation of benchmarking to shorten the feedback cycle for tool developers. Keywords: Evaluation, experimentation, benchmarking, ontology matching, ontology alignment, schema matching, semantic technologies.
1 Introduction The past ten years have witnessed impressive development in the area of semantic technologies, mostly driven by the idea of creating a semantic web [4] as a source of information that is accessible by machines. This development has been enabled by the standardization of representation languages for knowledge on the web, in particular RDF and OWL. Based on these languages, many tools have been developed to perform various tasks on the semantic web, such as searching, querying, integrating and reasoning about semi-structured information. Standards were an important factor for the development of software tools supporting semantic web applications. However, a crucial step in their large scale adoption in real world applications will be the ability to S. Spaccapietra (Ed.): Journal on Data Semantics XV, LNCS 6720, pp. 158–192, 2011. c Springer-Verlag Berlin Heidelberg 2011
Ontology Alignment Evaluation Initiative: Six Years of Experience
159
determine the quality of a system in terms of its expected performance on realistic data. This means that systematic evaluation of semantic technologies is an important topic. A major and long term goal of evaluation is to help developers of such systems to improve them and to help users evaluating the suitability of the proposed systems to their needs. The evaluation should thus be run over several years in order to allow for adequate measurement of the evolution of the field. Evaluation should also help assessing absolute results, i.e., what are the properties achieved by a system, and relative results, i.e., how these results compare to the results of other systems. One particular kind of evaluation is benchmarking. A benchmark is a well-defined set of tests on which the results of a system or a subsystem can be measured [9]. It should enable to measure the degree of achievement of proposed tasks on a well-defined scale (that can be achieved or not). It should be reproducible and stable, so that it can be used repeatedly for: (i) testing the improvement or degradation of a system with certainty and (ii) situating a system among others. A medium term goal for evaluation efforts is to set up a collection of reference sets of tests, or benchmark suites for assessing the strengths and weaknesses of the available tools and to compare their evolution with regard to these references. Building benchmark suites is valuable not just for groups of people who participate in planned evaluations but for all the community, since system designers can make use of them at any time and compare their results with those of the other systems. In this paper, we focus on the Ontology Alignment Evaluation Initiative (OAEI)1 which carries out annual campaigns for the evaluation of ontology matching tools. Ontology matching is an important functionality in many applications as it is the basis for linking information, e.g., from heterogeneous sources into a common model that can be queried and reasoned upon. Initially, the focus of OAEI was on the task of matching different ontologies rather than on the data itself. More recently, however, the focus is being extended to include data matching algorithms as well. The main goal of OAEI is to compare systems and algorithms on the same basis and to allow anyone for drawing conclusions about the best matching strategies. The OAEI ambition is that from such evaluations, tool developers can learn and improve their systems, thus extending the state of the art in ontology matching. The goal of this paper is to present the state of the art in evaluating ontology matching. For this purpose, we draw lessons from the six first years of carrying out OAEI focusing on trends we have observed and implications for the further improvement of the OAEI campaigns and the evaluation of semantic technologies in general. Annual OAEI reports [28; 26; 25; 8; 23; 24] present the individual datasets and the results of the different campaigns in detail. In this paper, we take a global view on outcomes of the evaluation campaigns over the years and identify interesting developments, fundamental decisions as well as solved and open problems. Thus, the contributions of the paper are: – A comprehensive overview of the six years of ontology matching benchmarking in the context of the OAEI initiative accompanied with a rationale for the choice of the datasets used; 1
http://oaei.ontologymatching.org/
160
J. Euzenat et al.
– The identification and discussion of problems in designing experiments for evaluating matching technologies; – An analysis of the development of the field of ontology matching on the basis of the results obtained in the different evaluation campaigns; – Current trends and future challenges of ontology matching evaluation based on our observations and experiences from the OAEI campaigns. In a nutshell, the lessons learned from the evaluation campaigns can be summarized as follows: – Systematic ontology matching evaluation indeed allows for measuring the progress of the field in terms of participation to the evaluation campaigns, quality of the matching results and runtime performance; – It is necessary to be reactive to propose improvements in data sets and evaluation modalities in order to keep or increase the interest in the field; – Automation is prone to improve the situation on many fronts of ontology matching evaluation, including scalability, variability, and hardness of tests. The remainder of the paper is structured as follows. In Section 2, we provide an overview of the related work. In Section 3, we introduce the ontology matching problem. Section 4 addresses the problem of designing evaluations for the ontology matching problem and provides some guidelines for the design of future evaluations. Results of the different evaluation campaigns are discussed in Section 5. We first provide background on OAEI, its organization and its development over the years. Then we focus on the progress that has been achieved and how it was measured. In Sections 6 and 7, we summarize our experiences and discuss implications for future evaluation campaigns.
2 Related Work on Evaluations Currently, the systematic evaluation of semantic technologies in general still falls behind other fields, such as theorem proving and information retrieval, where benchmarking against standardized datasets is a common practice. Standardized evaluations also provide the basis for a fair comparison of systems according to scientific standards and make it harder to tune results in favor of one or another system. Evaluation initiatives like TPTP (Thousand Problems in Theorem Proving) or TREC (Text Retrieval Conference) that have been carried out on a regular basis for many years have shown that besides the practical benefits of supporting the uptake of technology, systematic and continuous evaluations also lead to a continuous improvement of the field because fundamental problems are better understood and can be addressed more efficiently due to the direct feedback from the frequent evaluation campaigns. OAEI, presented in this paper, took inspiration from TREC. Indeed, ontology matching is closer to information retrieval than to theorem proving or standard conformance, since there are, in general, no algorithms for providing the solution to the problem to be solved. Thus, establishing an evaluation in such a setting is less direct. For what concerns ontology matching evalutation most of the available works conveged towards contributing to the OAEI campaigns. Thus, below, we discuss the related
Ontology Alignment Evaluation Initiative: Six Years of Experience
161
work on evaluation only in two relevant areas, namely semantic technologies in general and specifically database schema matching. Evaluation of semantic technologies. While systematic evaluation of semantic technologies is not yet as established as in related areas, such as databases or information retrieval, several initiatives started to investigate this problem by focussing on different types of methods and tools. For example, early efforts have considered the evaluation of semantic web systems with respect to their ability of exchanging semantic data without loss of information [63]. Although in theory, interoperability should be granted by the use of standardized languages, such as RDF and OWL, evaluations have shown that this is not always the case. As a response to this problem, interoperability benchmarks for semantic web tools were defined and implemented for testing existing implementations [29]. So far, interoperability has mostly been tested for ontology development tools. More recent efforts also included the evaluation of APIs for ontology management and API-based interfaces [43]. The efficiency of accessing semantic data is another subject of existing evaluation efforts that stands in the tradition of database systems benchmarking, where the main focus has always been on efficiency. To this end, a number of benchmark datasets for evaluating the performance of RDF databases was defined in terms of generators that can be used to generate arbitrarily large RDF datasets based on a predefined schema [33; 6; 55]. The corresponding experiments typically focus on upload and query execution times. Compared to the existing benchmarking activities in the database area, a special characteristic of semantic data access is the need to perform logical reasoning for answering queries. This means that besides the efficiency, completeness and correctness of the underlying reasoning procedures are of a major importance and were also considered in the respective benchmarks, see e.g., [33; 44]. More recently, algorithms for generating test data that allows for measuring completeness of a reasoning system independent of a certain schema were investigated as well [61]. Another aspect of semantic technologies that was the subject of evaluation activities is the ability to find and combine relevant information in a useful way. Here, the main criterion is the quality of the resulting information. This task comes in different forms, depending on the kind of information that is concerned. While the use of semantic technologies for enhancing classical information retrieval tasks has not been the subject of systematic evaluation, there is some work from the area of web service discovery and composition, see, e.g., [66]. In particular, the task of selecting appropriate web services based on a user request and semantic annotations was investigated in detail and a comprehensive benchmarking suite is available [41]. Other benchmarking activities are concerned with the integration of different web services into a coherent workflow, although based on a qualitative evaluation rather than concrete quality measures [51]. Different communities have recognized the benefits of providing an automatic evaluation framework where system developers can test their tools against a predefined set of benchmark datasets and receive an evaluation result online. Examples are the SMT-Exec initiative2 for satistisfiability testing and the S3 contest for web service matching3. The Ontology Alignment Evaluation Initiative described in this paper is a related activity 2 3
http://www.smtexec.org http://www-ags.dfki.uni-sb.de/~klusch/s3/index.html
162
J. Euzenat et al.
in the context of evaluating semantic technologies for finding and combining relevant information that focusses on the task of matching between knowledge models. It thus supplements, or has inspired, the activities mentioned above by focussing on a different technology. Evaluation of schema matching. Untill recently there were no comparative evaluations and it was quite difficult to find two database schema matching systems evaluated on the same dataset. For example, an early evaluation effort of [16] focused mostly on comparison criteria from four areas, such as input (test cases), output (match results), quality measures (precision, recall, f-measure, overall) and savings of manual efforts (pre-match, post-match). It also provided a summary on several matching tools using those criteria. However, even at present in the database community there are no well-established benchmarks for comparing schema matching tools. Instead, the activities were somewhat fragmented, such as those of Cupid [45] and iMAP [15]. Several later works built up on the past results in terms of using the same datasets and quality measures for evaluations, such as COMA++ [3], S-Match [31], SMB [47] and YAM [19] to name a few. In turn, the work on STBenchmark [2; 1] focused on evaluation of mappings, namely on the transformation from source instances into target instances, what finds its parallels with the instance matching track of OAEI. The closest to OAEI works on benchmarking of database schema matching systems are those of [16] and more recently of XBenchMatch [18; 17]; though these initiatives have not led to well-established recurrent evaluation campaigns.
3 Ontology Matching Designing and running evaluation campaigns for a certain kind of tools require a solid understanding of the problem the respective tools try to solve. There have been different formalizations of the matching process and the results generated by this process [5; 42; 38; 59; 70]. We follow the framework presented in [27]. In order to illustrate the matching problem, let us consider two simple ontologies depicted in Figure 3. These ontologies contain subsumption statements, property specifications and instance descriptions. On an abstract level, ontology matching is the task of finding correspondences between ontologies. Correspondences express relationships supposed to hold between entities in ontologies, for instance, that a SubjectArea in one ontology is the same as a Topic in another one or that Regular author in an ontology is a subclass of Author in another one. In the example above, one of the correspondences expresses an equivalence, while the other one is a subsumption correspondence. In a further step, one may generate query expressions that automatically translate instances of these ontologies under an integrated ontology. Matching is the process that determines an alignment A for a pair of ontologies o and o . There are some other parameters that can extend the definition of the matching process, namely: (i) the use of an input alignment A, which is to be completed by the process; (ii) the matching parameters, for instance, weights and thresholds; and (iii) external resources used by the matching process, for instance, common knowledge and domain specific thesauri.
Ontology Alignment Evaluation Initiative: Six Years of Experience
163
o
o ≡
Person
Human
string
≡
email
{male,female}
Chairman
≡
has email has gender
Committee member
ConferenceMember Chair Author Reviewer
Conference contributor
Active participant Paper
assignedTo Ontology matching
SubjectArea
Regular author
≡ Topic
Fig. 1. Two simple ontologies. Classes are shown in rectangles with rounded corners, e.g., in o, Chairman being a specialization (subclass) of Person, while relations are shown without the latter, such as email being an attribute (defined on a domain string) and assignTo being a property. Ontology matching is a shared instance. Correspondences are shown as arrows that connect an entity from o with an entity from o . They are annotated with the relation that is expressed by the correspondence.
Each of the elements featured in this definition can have specific characteristics which influence the difficulty of the matching task. It is thus necessary to know and control these characteristics (called dimensions because they define a space of possible tests). The purpose of the dimensions is the definition of the parameters and characteristics of the expected behavior in a benchmark experiment. As depicted in Figure 2, the matching process receives as input three main parameters: the two ontologies to be matched (o and o ) and, eventually, an input alignment (A). The input ontologies can be characterized by the input languages they are described (e.g., OWL-Lite, OWL-DL, OWL-Full), their size (number of concepts, properties and instances) and complexity, which indicates how deep is the hierarchy structured and how dense is the interconnection between the ontological entities. Other properties, such as consistency, correctness and completeness are also used for characterizing the input ontologies. The input alignment (A) is mainly characterized by its multiplicity (or cardinality, e.g., how many entities of one ontology can correspond to one entity of another one) and coverage in relation to the ontologies to be matched. In a simple scenario, which is the case for most of the OAEI test cases, the input alignment is empty. Regarding the parameters, some systems take advantage of external resources, such as WordNet, sets of morphological rules or previous alignments among general purpose resources, e.g., Yahoo and Google directories.
164
J. Euzenat et al.
o
parameters
A
matching
o
resources
A
Fig. 2. The ontology matching process (from [27])
The output alignment A is a set of correspondences between o and o : Definition 1 (Correspondence). Given two ontologies, o and o , a correspondence is a quintuple: id, e, e , r, n, such that: – id is an identifier of the given correspondence; – e and e are entities, e.g., classes and properties of the first and the second ontology, respectively; – r is a relation, e.g., equivalence (≡), more general (), disjointness (⊥), holding between e and e ; – n is a confidence measure (typically in the [0, 1] range) holding for the correspondence between e and e . Alignments are sets of correspondences between entities belonging to the matched ontologies. The correspondence id, e, e , r, n asserts that the relation r holds between the ontology entities e and e with confidence n. The higher the confidence, the higher the likelihood that the relation holds. For example, an alignment A, which contains only equivalence correspondences, is a 1:1 alignment, if for all id1 , e1 , e1 , r1 , n1 ∈ A there exists no id2 , e2 , e2 , r2 , n2 ∈ A with (e1 = e2 ∧ e1 = e2 ) ∨ (e1 = e2 ∧ e1 = e2 ). For example, in Figure 3 according to some matching algorithm based on linguistic and structure analysis, the confidence measure between entities with labels Chairman in o and Chair in o is 0.75. Suppose that this matching algorithm uses a threshold of 0.55 for determining the resulting alignment, i.e., the algorithm considers all pairs of entities with a confidence measure higher than 0.55 as correct correspondences. Thus, our hypothetical matching algorithm should return to the user the following correspondence id2,4 , Chairman, Chair, , 0.75. Different approaches to the problem of ontology matching have emerged from the literature [27]. The main distinction among them is due to the type of knowledge encoded within each ontology, and the way it is utilized when identifying correspondences between features or structures within the ontologies. Terminological methods lexically compare strings (tokens or n-grams) used in naming entities (or in the labels and comments concerning entities), whereas semantic methods utilize model-theoretic semantics to determine whether or not a correspondence exists between two entities. Some approaches may consider the internal ontological structure, such as the range of the
Ontology Alignment Evaluation Initiative: Six Years of Experience
165
properties (attributes and relations), cardinality, and the transitivity and/or symmetry of the properties, or alternatively the external ontological structure, such as the position of the two entities within the ontological hierarchy. The instances (or extensions) of classes could also be compared using extension-based approaches (e.g., based on frequency distributions). In addition, many ontology matching systems rely not on a single matching method (matcher), but combine several matchers.
4 Evaluation Design The design of the evaluations is at the heart of an evaluation campaign, and the design of a good evaluation is a task that should not be underestimated. Setting new challenges for participants in terms of well-designed tests requires a good understanding of the problem domain, in our case ontology matching. In fact the evaluation initiative only really took off after a theoretical framework for ontology alignment was developed within the KnowledgeWeb network of excellence [7]. Over the years, the theoretical understanding of the problem has been further improved and led to the development of further datasets. Designing an evaluation is difficult, because it has to balance several partially conflicting desiderata: D1: The evaluation criteria and tests should cover all relevant aspects of the problem and the results of an evaluation should provide a good estimation of the expected performance of the tested system in a real application. D2: The evaluation has to be fair in the sense that it does not favor a certain approach or systems that make a certain assumption on the nature of the data or the result. D3: The results have to be informative in the sense that they allow the developers of the tested system as well as potential users to learn about the strengths and the weaknesses of a tool and also to decide which tool shows a better performance. D4: The evaluation should allow for quick feedback cycles to foster advances of the state of the art. This requires that the effort of conducting the campaign should not be too high neither for the participants nor for the organizers. In the development of the Ontology Alignment Evaluation Initiative we have worked with these desiderata and came up with different methods for improving the evaluations to better meet them. These and further necessary developments are discussed in this section. We start with a basic evaluation design and then discuss its variations. Figure 3 shows a basic evaluation process for ontology matching tools. The main component in this process is the matching component, which represents the system to be evaluated. The system takes two ontologies as input and generates an alignment. The second component is an evaluation script (evaluator) that takes the produced alignment and compares it with a reference alignment representing the expected outcome of the matching process. The evaluator compares the two alignments and computes a measure of the quality of the alignment produced by the matching component. This basic process is simplistic and has to be concretized in many respects. First of all, the input data in terms of the ontologies to be matched has to be defined. No single pair of ontologies can test all aspects of ontology matching. We also experienced
166
J. Euzenat et al.
R o
parameters matching
o
evaluator
m
A
resources
Fig. 3. Basic evaluation design: a matcher receives two ontologies o and o as input and generates an alignment A using a certain set of resources and parameters. An evaluation component receives this alignment and computes a (set of) quality measure(s) m – typically precision and recall – by comparing it to the reference alignment R.
that there is a need for different types of datasets: for systematic evaluations and for competitive evaluations. Another insight gained was that standard quality measures, in particular precision and recall, are not always suited for the purpose of ontology matching as they fail to completely capture the semantics of ontology alignments and different measures are needed for evaluating different aspects. Finally, we found out that more complex approaches are sometimes needed in certain situations, for instance, if a partial alignment exists or if no reference alignment is available. It is possible to use external resources as long as they have not been tuned to the current evaluation experiment (for instance, using a sub-lexicon, which is dedicated to the domain considered by the tests). It is acceptable that the algorithm prunes or adapts these resources to the actual ontologies as long as this is in the normal process of the algorithm. Moreover, some parameters can be provided to the methods participating in an evaluation. However, these parameters must be the same for all the tests. It can be the case that some methods are able to tune their parameters depending on the presented ontologies. In such a case, the tuning process is considered to be part of the method. In the following, we elaborate these insights with respect to datasets, quality measures and evaluation processes used in the context of OAEI. Specifically, in §4.1, we discuss properties of ontologies and alignments that determine the hardness of a test. The datasets used in the OAEI initiative are presented in §4.2. In turn, §4.3 discusses evaluation measures and processes that were developed and used in OAEI. Finally, typical evaluation processes are discussed in §4.4. 4.1 Dataset Characteristics Good datasets are a prerequisite for a good evaluation. The nature of the datasets determines how far the evaluation design meets our first two desiderata: the coverage of relevant aspects and the fairness of the evaluation. In the case of ontology matching, a dataset typically consists of at least two ontologies and a reference alignment between these ontologies. In the following, we call the combination of exactly two ontologies and, if present, a reference alignment between these ontologies a test. A dataset consists
Ontology Alignment Evaluation Initiative: Six Years of Experience
167
of several tests. If not defined otherwise, we assume that each combination of ontologies plus the respective reference alignment is a test in the dataset. The work in [30] proposed the following criteria for designing or selecting datasets for ontology matching evaluation: – Complexity, i.e., that the dataset is hard for state of the art matching systems. – Discrimination ability, i.e., that the dataset can discriminate sufficiently among various matching approaches. – Incrementality, i.e., that the dataset allows for incrementally discovering the weaknesses of the tested systems. – Monotonicity, i.e., that the matching quality measures calculated on subsets of gradually increasing size converge to the values obtained on the whole dataset. – Correctness, i.e., that a reference alignment is available for the dataset, which allows to divide generated correspondences into correct and incorrect ones. There are two basic properties that determine the nature of a dataset, and thus, how well it meets the quality criteria mentioned above: the properties of the ontologies to be matched and the properties of the reference alignment, that are expected to be reproduced by the matching systems. Ontologies. There are two major aspects of an ontology that have an influence on the matching process: the complexity of labels used to describe classes, relations and instances in the ontology, that has an influence on the initial determination of candidate correspondences, and the complexity of the structures used to define these elements that is often used to improve and validate the initial hypotheses. Complexity of labels. Many matching systems use a combination of heuristics for comparing the labels of entities in ontologies in order to compute correspondences between these entities. Hence, the kind of labels found in an ontology influences heavily the performance of a particular matching system. Specifically, we distinguish between simple labels vs. sentence-like labels, monolingual vs. multilingual labels. It also often makes a large difference whether labels used in an ontology can be anchored to common background knowledge sources, such as WordNet, that helps interpreting those labels. Further complexity is added if the ontologies to be matched use specific vocabularies, e.g., from the biomedical or geo-spatial applications, that are outside common language. Complexity of structures. Almost all matching systems use the structure of definitions in the ontologies to be matched in the later stages of the matching process to propagate similarity estimations and to validate hypotheses on correct correspondences. Therefore, structures found in ontologies are also an important issue in the design of benchmark datasets. Fortunately, the standardization of the semantic web languages RDF and OWL provide a common syntax for comparing ontologies, but still the way and intensity this common syntax is used varies a lot. Directories and thesauri only use the hierarchical structure given by subsumption, while more expressive ontologies use relations between classes that may be constrained by various kinds of axioms. This additional knowledge can be used by matchers for matching as well as for checking the coherence of their alignments [48].
168
J. Euzenat et al.
On the level of instances, we can also have different levels of complexity. In particular, instances can either be described in detail using attributes and relations to other instances or can be atomic entities with no further explicit definitions or property specifications. Often instances represent links to external sources, e.g., web pages or images, that can be used as a basis for matching. In this case, the nature of the external resource can also make a significant difference. For example, web pages often provide a good basis for extracting additional information about the described object that makes matching easier, an image is harder to interpret and to compare with other resources. Reference alignments. A reference alignment is another important aspect to consider: characteristics, such as the types of semantic relations used in the alignment or the coverage of the alignment, have a significant impact not only on the hardness of the task but also puts restrictions on evaluation measures that are discussed later. Types of semantic relations. As mentioned in §3, an alignment consists of a set of correspondences defined by elements from the two ontologies and a semantic relation between them. The kind of semantic relations found in the reference alignment also determine what kind of relations the matching systems should be able to produce. The most commonly used relation is equivalence of elements (in most cases classes and relations). The majority of available matching systems are designed to generate equivalence statements. There are exceptions to this rule, however, that should be taken into account. Other kinds of relations that were investigated are subclass [67; 32] and disjointness relations [54; 32]. Formal properties of the alignment. Besides the type of a relation, its semantics is another relevant aspect. In particular, we have to distinguish between more and less rigorous interpretations of relations. The equivalence relation, for example, can be interpreted as logical equivalence or more informally as a high level of similarity or exchangeability. Using a rigorous formal interpretation of the semantic relations has the advantage that we can enforce formal properties on the reference alignment. For example, we can claim that the merged model consisting of the two ontologies and the alignment should be coherent, i.e., it should not contain unsatisfiable classes. Enforcing such consistency conditions is not possible for less formal interpretations. Cardinality and coverage. A less obvious property with a significant influence on the evaluation results is the cardinality of the reference alignment. In principle, there is no restriction on the alignment, so the relation between elements from the different ontologies can be an n-to-m relation. In practice, however, it turns out that the alignment relation is one-to-one in most cases. Therefore, matching systems often generate one-to-one alignments. Along the same lines, the degree of overlap between the ontologies to be matched is not restricted and a dataset could consist of two ontologies with little or no overlap. Typically, however, it is assumed that the two ontologies to be matched describe the same domain. As a consequence, matching systems normally try to find a correspondence for every element in the two ontologies rather than ignoring elements.
Ontology Alignment Evaluation Initiative: Six Years of Experience
169
4.2 OAEI Datasets From 2005 on, different datasets have been used in the OAEI evaluation campaigns. The aim of using these different sets is to cover as much as possible the relevant aspects of the matching problem, i.e., the desideratum D1 discussed above. Initially, the goal of the initiative was to achieve this coverage within a single dataset, the benchmark dataset. The benchmark dataset deals with the topic of scientific publications. It consists of a large set of artificial tests. These tests alter an initial ontology and the task is to match it to the modified ontology. Modifications concern both the element labels, e.g., replacing them by random labels, and the structure, e.g., deleting or inserting classes in the hierarchy. In addition, the dataset comprises four other real ontologies that have to be matched to the reference ontology. Details about the different tests can be found on the OAEI website4 . The declared goal of the benchmark dataset is the analysis of matching systems to identify their strengths and weaknesses with respect to the absence or the presence of certain structures in the ontologies to be matched. While the benchmark dataset serves this purpose quite well, it turned out to be less useful for other purposes. In particular, the benchmark dataset is not really suited for comparing the overall performance of systems. Obviously, comparing the performance of systems on the artificial tests is not useful for assessing system behavior in reality as each of the tests focuses on a specific situation that is not likely to occur in practice and the tests did not reflect any realistic situation. In consequence, we recognized that we needed other, more realistic tests to actually compare the performance of matching systems in realistic situations and that the benchmark dataset is not a suitable means for assessing matcher behavior on real tasks. However, it can be still used as an immediate first check-up of the newly proposed system in terms of its weaknesses, strengths and its presumable position with respect to the state of the art. Based on these experiences, the benchmark dataset was complemented by a number of other datasets that try to cover those aspects not addressed by the benchmark dataset. These datasets fall in different categories; see Table 1 for an overview of the datasets that are currently used in OAEI. Expressive ontologies. For addressing the issues of realism and difficulty identified on the benchmark dataset, we have introduced two datasets that are more challenging in the sense that they are much larger, more heterogeneous and feature more complex definitions of classes that have to be taken into account during matching. The datasets in this category are the OntoFarm5 dataset [69] also referred to as the conference dataset in the context of the OAEI campaigns and the anatomy dataset. The conference dataset consists of a set of fifteen OWL ontologies describing scientific conferences using complex definitions. The anatomy dataset consists of two ontologies describing the human and the mouse anatomy that are actually used in the medical community and have been manually matched by medical experts. For both datasets, reference alignments exist, but we have decided not to publish these reference alignments completely to avoid the effect we have observed for the benchmark dataset. Thus, it is possible to conduct a blind 4 5
http://oaei.ontologymatching.org/2009/benchmarks/ http://nb.vse.cz/~svatek/ontofarm.html
170
J. Euzenat et al.
Table 1. Characteristics of test cases (‘open’ evaluation is made with already published reference alignments, ‘blind’ evaluation is made by organizers from reference alignments unknown to the participants and ‘expert’ evaluation involves manual analysis of results, by an expert user) Dataset
Formalism
Relations
Confidence
Modalities
Language
benchmarks anatomy conference directory library
OWL OWL OWL-DL OWL SKOS +OWL OWL RDF RDF RDF SKOS +OWL
= = =, <= =, <, >, ⊥ exact-,narrow-, broadMatch =,<,> = = = exact-, closeMatch
[0 1] [0 1] [0 1] 1 1
open blind blind+open blind+open blind
EN EN EN EN EN+NL+FR
[0 1] [0 1] [0 1] [0 1] [0 1]
open open open open blind expert
EN EN EN EN NL+EN
benchmarksubs ars tap iimb vlcr
evaluation, where the correct answers are not given to the participants. Both datasets have become an integral part of the OAEI campaigns. Directories and thesauri. These datasets consist of large weakly structured ontologies, as they are already in use on the web and in digital libraries. The lack of a sophisticated structure puts the element labels in a much more prominent position. Besides the analysis of labels, the size of the datasets in this category is a major challenge for many matching systems as the structures to be matched contain up to hundreds of thousands of classes. A problem connected to these more realistic datasets, e.g., library in Table 1, is lack of complete reference alignments. Due to the size of the models creating such an alignment manually is not an option, therefore other means of evaluation had to be found [36; 30]. Instance matching. With the increasing interest in linked open data, it turns out that typical matching problems on the web consist of finding instances representing the same individual rather than finding equivalent classes in different ontologies. While instance matching is covered by the theory of ontology matching outlined in §3, it has not been represented in the OAEI campaigns until recently. Since 2009 a number of instance matching datasets have been included in the campaigns. These datasets are the iimb, the ars, and tap. These datasets comprise automatically generated benchmarks, in which one dataset is modified according to various criteria, as well as real data from the domain of scientific publications. Beyond equivalence. Finally, there are first attempts to move ahead from equivalence as the only semantic relation considered in the OAEI tests. There are tests now that ask for close matches as well using the relations ‘exactMatch’ and ‘closeMatch’.
Ontology Alignment Evaluation Initiative: Six Years of Experience
171
4.3 Evaluation Measures The diverse nature of OAEI datasets, mainly in terms of the complexity of test cases and presence/absence of (complete) reference alignments, has required to use different evaluation measures. Furthermore, evaluating a matching systems from different perspectives allows for avoiding to favor a certain approach or system, when evaluation is made under a same dataset. This is one of the criterion to meet the desideratum D2 presented above. Organizers have as well the important role of conducting a fair evaluation. Table 2 provides an overview of the evaluation criteria used in the OAEI evaluations. Table 2. OAEI evaluation criteria (compliance is usually measured with variations of precision and recall against available references) Type Compliance Other Measure / Dataset Manual Partial Complete Efficiency Data Logical Application labelling reference reference mining Reasoning oriented √ benchmarks √ √ anatomy √ √ √ √ conference √ directory √ √ library √ benchmarksubs √ √ ars √ √ tap √ √ iimb √ vlcr
The most commonly used and understood criterion for evaluation of ontology alignments is the compliance of matcher alignments with respect to the reference alignments. Measures, such as precision (true positive/retrieved), recall (true positive/expected) and f–measure (aggregation of precision and recall) have been used as basis in the OAEI campaigns for measuring compliance. For a subset of datasets, namely conference, library and vlcr, the complete reference alignment is not available and then compliance is measured on a partial reference alignment. Although precision and recall are standard measures for evaluating compliance of alignments, alternative measures addressing some limitations of these measures have been used. For example, it may happen that an alignment is very close to the expected result (reference alignment) and another one is quite remote from it, although both share the same precision and recall. The reason for this is that standard metrics only compare two sets of correspondences without considering if these are close or remote to each other. In order to better discriminate such systems a relaxed precision and recall measures were defined which replace the set intersection by a distance [20]. To solve another problem, that two alignments may score differently while being semantically equivalent, semantic precision and recall were defined based on entailment instead of inclusion [22].
172
J. Euzenat et al.
Specially in the cases where only partial reference is available, alternative evaluation approaches have been applied. For instance, in the conference track, manual labeling, data mining and logical reasoning techniques were considered: – For manual labeling, for each matcher the most highly rated correspondences were considered as population. n correspondences per matcher were randomly sampled from the population. These correspondences were then evaluated as correct or incorrect. As a result, a score for precision was estimated. – For supporting the discovery of non-trivial findings about matchers, data mining techniques and correspondence patterns were exploited as well. The aim is to find explanations on the so-called analytic questions, such as: (i) which systems give higher/lower validity than others to the correspondences that are deemed ‘in/correct’?; (ii) which systems produce certain matching patterns/correspondence patterns more often than others?; and (iii) which systems are more successful on certain types of ontologies? – Logical reasoning was used to measure the degree of incoherence that is caused by an alignment. The underlying idea is that a correct alignment should not result in unsatisfiable classes. Measuring the degree of (in)coherence of an alignment was first proposed in [48]. The approach adopted by the library track organizers, for compensating the lack of complete reference alignments, was based on application relevance. They considered the provided alignment in the context of an annotation translation process supporting the re-indexing of books indexed with one vocabulary A, using concepts from the aligned vocabulary B [36]. For each pair of vocabularies A − B, this scenario interprets the correspondences as rules to translate existing book annotations with A into equivalent annotations with B. Based on the quality of the results for those books for which the correct annotations are known, the quality of the initial correspondences can be assessed. The criteria above are about alignment quality. However, another useful comparison between systems refers to their efficiency. The best way to measure efficiency is running all the systems under the same controlled evaluation environment. However, in the previous OAEI campaigns, participants were asked to run their systems on their own machines and to send the resulting alignments to be evaluated. So, the information about the time each system takes to execute the matching was gathered directly from the participants and could not be directly compared. 4.4 Evaluation Processes An evaluation process represents the interaction between several components in an evaluation experiment (matchers, test providers, evaluators, etc.). A simple process restricts the experiment to the evaluation of one matcher using a set of test cases. Usually, several matchers are evaluated in one evaluation experiment. Figure 4 illustrates the evaluation process that extends the process presented at the beginning of this section (Figure 3). The first step is to retrieve, from a database of tests containing the ontologies to be matched and the corresponding reference alignments, the tests to be
Ontology Alignment Evaluation Initiative: Six Years of Experience
173
considered in such an evaluation. Next, each available matcher performs the matching task, taking as input parameters the two ontologies. Then, the resulting alignment is evaluated against the reference alignment, by an evaluator. Finally, each result interpretation is stored into the result database (for instance, precision and recall).
R o
evaluator matching
test
m
result
A
o
matcher
Fig. 4. Basic evaluation process
Due to the variability of the alignment evaluation, different scenarios can be specified, by adding new components to the process presented in Figure 4: Test generator. Test cases can be generated from a description of the kind of evaluation to be executed (for example, removing n% of the properties of the ontologies). A description of the desired test case must be provided and the output of the test generator service is then used as input to the matching process. Lack of reference alignment. It is not the case that all test cases have a complete reference alignment. Thus, alternative evaluation metrics must be provided, such as measuring the consensus between several matchers, intersection or union of the results, etc. User in the loop. Sometimes, matching systems are considered as semi-automatic and the user has control over the matching process. On the other hand, manual labeling can be required in the cases where the reference alignments are not available. Task-specific evaluation. It can be useful to set up experiments which do not stop at the delivery of alignments, but carry on with the particular task. This is especially true when there is a clear measure of the success of the overall task; see §4.3. The components described above can be combined together in different ways. Figure 5 illustrates a more elaborated process where tests are generated by a test generator, according to the description provided by the user. This generation process may create a set of alternative ontologies, from a reference ontology, by removing its properties or
174
J. Euzenat et al.
user
test description
matcher test generator o matching
test
A
result
o
Fig. 5. Advanced evaluation process
individuals. Moreover, one can imagine that no reference alignments are provided by the test generator. In such a scenario, the user has the role of an evaluator. For each generated test, the available matchers are executed and their resulting alignments are stored into a database, whose content will be used later for user evaluation.
5 Analysis of the Past OAEIs The evaluation design presented in the previous section was chosen to provide tool developers and potential users with feedback on the state of the art in ontology matching and to foster developments in the field, meeting the two last desiderata presented in §4. Therefore, a crucial question that needs to be answered is whether the initiative indeed supported an improvement of the field. In the following, we try to answer this question by providing an abstract view on the results of the evaluation campaigns. This overview shows that OAEI was a success in many respects. First of all, a large and vivid community was established around the OAEI campaigns, which is shown by an increasing number of participants and test cases provided by the community. Further, we will show that there actually has been an improvement of matching systems that frequently participated in the benchmarking campaigns both in terms of runtime and quality of matching results. Finally, and probably most importantly, we have gained insights in the nature of the ontology matching tasks and the functioning of matching systems. We first provide an overview of the evaluation campaigns that have been carried out from 2004 to 2010 (§5.1). We then summarize our observations with respect to the evolution of result quality (§5.2), the efficiency of matching systems (§5.3) and with respect to the impact of matching system configurations on the results (§5.4).
Ontology Alignment Evaluation Initiative: Six Years of Experience
175
5.1 Campaigns and Participation The first ontology matching evaluations were carried out in 2004 as part of the Information Interpretation and Integration Conference (I3CON)6 held at the NIST Performance Metrics for Intelligent Systems (PerMIS) workshop and the Evaluation of Ontologybased Tools (EON) workshop of the annual International Semantic Web Conference (ISWC) [62]. The workshops were organized in a joint but complementary way by different organizers. This parallel development emphasized the importance of the topic and indicated that a joint initiative would be of advantage. From 2005 on, joint activities are carried out under the heading of the Ontology Alignment Evaluation Initiative. The first official evaluation campaign of the initiative was carried out at the Workshop on Integrating Ontologies at the 3rd International Conference on Knowledge Capture (K-CAP 2005) in Banff, Canada. Since 2006, the annual evaluation campaigns are carried out at the International Semantic Web Conference in the context of the Ontology Matching workshop. Since the early beginning the workshop has a constant attendance of more than 50 participants working on the topic. Over the years, the number of systems participating has increased from 4 systems in 2004 to 15 in 2010. A detailed look shows that there was a significant increase from 4 in 2004 up to 17 in 2007, while from 2007 to 2010 the participation rate is with ≈ 15 participants relatively stable and fluctuates around this point. In future it is required to extend OAEI with new datasets and evaluation modalities according to the trends in the field (see §6) in order to maintain or increase the participation rate. Table 3 provides an overview of the campaigns carried out so far. More information on the individual campaigns can be found on the OAEI web site4 . Table 3. Overview of the evaluation campaigns year I3CON OAC OAEI OAEI OAEI OAEI OAEI OAEI
2004 2004 2005 2006 2007 2008 2009 2010
location
#tests #participants reference Gaithersburg, US 10 5 -6 [62] Hiroshima, JP 1 4 [28] Banff, CA 3 7 [26] Athens, US 6 10 [25] Busan, KR 7 17 [8] Karlsruhe, DE 8 13 [23] Chantilly, US 9 16 [24] Shanghai, CN 6 15
5.2 Quality Improvement The main goal of OAEI is to support the enhancement of the ontology matching field. In the following, we report on results that show in how far this goal has been reached. First, we present summative results. In particular, we show how the average f-measure developed from 2005 to 2010 analyzing those datasets which have been run several 6
http://www.atl.external.lmco.com/projects/ontology/i3con.html
176
J. Euzenat et al.
years in succession. Then we analyze those systems that have been participating continuously from 2007 to 2010 in detail. The presented results allow to discuss the effects of a continuous participation. Summative results. Our analysis required to recompute some of the values presented as results of the annual campaigns. In particular, the benchmark data was rendered more difficult in 2008. Since these changes affected both ontologies and resulting reference alignments, we did not recompute these values. The reference alignments of the conference track have been extended year by year. We recomputed the average f-measure based on the current, most comprehensive corpus of reference alignments. This has to be taken into account when analyzing the results. We have compared the average fmeasure in terms of the arithmetic mean over all participants per year and track. This gives a rough representation on the main tendency and allows for abstracting from interdependencies between precision and recall. 1. benchmark
f measure
anatomy directory conference
0.
2005
2006
2007
2008
2009
2010
year
Fig. 6. Evolution of the average f-measure in different datasets
The results of Figure 6 are heterogeneous for the different datasets. The results for the conference and directory dataset range from an f-measure of 0.3 to 0.5. Both datasets leave room for further improvements. A detailed look reveals that there is a high variance between participants. The top performers of the last years reached an f-measure in the range of 0.5 to 0.6 in the conference track and an f-measure of ≈ 0.6 in the directory track. The average f-measure for the benchmark and anatomy datasets ranges from 0.6 to 0.8, even though both datasets describe different domains and vary in size and expressivity. In both cases good results in f-measure are based on a high precision. The challenge with regard to these datasets is to increase recall of the results without decreasing the high precision scores. We observe a moderate increase in benchmark and conference with some exceptions. Results for anatomy and benchmark will be analyzed in more detail later on. The quality of the alignments submitted to the conference track increases with each year, with the exception of 2008. However, Figure 6 does not show that the average f-measure is
Ontology Alignment Evaluation Initiative: Six Years of Experience
177
based on very different results generated by each of the participating matching systems. It seems to be hard to find an appropriate configuration for matching ontologies of the conference dataset that is also appropriate for the other datasets. We discuss this issue in detail in §5.4. The improvements of the anatomy results are the most significant. In particular, we measured for each year a continous improvement. Remember that the reference alignment of the anatomy track was not available for participants (blind modality) until 2010 and it is hardly reconstructable without biomedical expertise. For what concerns the directory track (where the reference alignments were partially available), the overall trend from 2006 to 2010 is positive, though with a substantial drop in 2008. There are several explanations for this: (i) OLA2 and Prior+ never participated again after 2007 and those were the two systems showed top results, (ii) the set of participating systems in 2008 was almost completely different compared to 2007; it performed worse than the set of participating systems in 2007, but better than those participating in 2006. Overall we conclude that the field as a whole improved over the years. We have also claimed that systems entering the campaign for several times tend to improve over years. By providing matching system developers with measurable feedback on their developments, it seems reasonable to think that they will be able to analyze the reasons for these results in order to improve their systems. We consider this claim in the following. Very few of these systems have participated in most of the tests and also only a few systems have participated more than three years in a row, thus allowing a judgement of their individual improvement over time. We therefore have to base our discussion on quality improvement on a limited set of systems and datasets. From among the datasets that were systematically evaluated against a reference alignment from 2007 to 2010 we have chosen the benchmark and the anatomy datasets. We have selected these tracks because several systems participated at least three of four times in these tracks from 2007 to 2010. For the other tracks, we found a lower number of (more or less) continuous participation.
1.
asmov aroma
f measure
dssim lily mappso
0.
rimon taxomap
2007
2008
2009
2010
year
Fig. 7. Evolution of results on the benchmark dataset
178
J. Euzenat et al.
Results on the benchmark dataset. Figure 7 shows f-measure of the systems under consideration on the benchmark dataset in 2007 through 2010. These systems achieve a similar level of precision, between 80% and 95%, which is quite a high value for a matching task. Only recall differs and impacts f-measure. However, for each system there is little variation, and not necessary towards an increase. This is confirmed by the results of ASMOV and RiMOM which have participated for four years in a row, respectively .92 and .9 in f-measure. Figure 8 shows that the best systems are overall very safe because their precision is very close to 100% until 50% of recall and it is still over 95% until 90% of recall. They are also able to stop providing correspondences when their quality goes down (compared to the edna baseline). Figure 8 also shows the yearly progress of these systems by preserving better precision when looking for more recall. The reasons for this behavior is that benchmark is made of a myriad of tasks, some of which are very difficult, but most of which are relatively easy. In consequence, the overall results are higher than for other tasks which were designed to be realistic or hard. This means that the results (precision, recall, f-measure) cannot be transposed from benchmarks to other situations. This also explains why gaining the last points of precision and recall is difficult for each system individually. Due to this feature, the benchmark dataset has lost its discrimination power over the years: a large difference between systems on the benchmarks still reflects a difference in the versatility of systems in practice, but small differences are not relevant. In addition, benchmarks are often used by system developers to tune their systems both because they cover a variety of situations and because reference alignments are available. Hence, even systems which participate for the first time achieve relatively high performances. This may lead systems to be overfitted to the benchmark dataset, i.e., tuned to solve this particular kind 1.
edna 2010 ASMOV 2009 Lily
precision
2008 Lily 2007 ASMOV 2006 RiMOM 2005 Falcon
0. 0.
recall
1.
Fig. 8. Precision/recall curves for the yearly best results on the benchmark dataset (edna is a simple matcher based on edit distances on names of entities)
Ontology Alignment Evaluation Initiative: Six Years of Experience
179
of tasks. Only systems which are especially designed for another purpose and whose designers do not want to twist to address benchmarks achieve low performances. In summary, although artificially designed, benchmarks can be used as a starting point for developers to test which kinds of ontology features their systems handle better. This feedback can be then exploited for further improvements in their implementations. However, it is not relevant for predicting the behavior of a system in a particular situation. Results on the anatomy dataset. If our explanation of the results on the benchmark dataset is correct, and there is still an improvement of the overall quality of individual matchers, this improvement will have to be visible in the results on the blind datasets. We chose the anatomy dataset as a basis for checking this as it has been evaluated against a complete reference alignment from 2007 on. Further, we present the results of those systems that particpated at least three times in these four years.
1.
agrmaker
f measure
dssim rimon asmov lily taxomap 0.
2007
2008
2009
2010
year
Fig. 9. Evolution of results on the anatomy dataset
Figure 9 shows the development of f-measure of the systems from 2007 to 2010. This time, we can clearly see an upward trend in f-measure, which reflects both a significant increase in precision and a moderate increase in recall. This trend is more significant for the second time a system participates, reinforcing the analysis above that once participants know the results of the evaluation they can better improve it, but the next time the increase will be smaller. This pleads for more tests and more reference alignments given to participants because this can be a sign of over fitting to the OAEI datasets in general. Hence, we conclude that there has been a significant increase in the quality at least of those matching systems on real world datasets that participated in the evaluation on a regular basis. This supports our claim that OAEI observes a measurable quality improvement in the ontology matching field.
180
J. Euzenat et al.
5.3 Runtime
Median
Average
X-SOM
TaxoMap
SOBOM
SAMBO
RiMOM
Prior+
Lily
kosimap
Falcon-AO
DSSim
ASMOV
Aroma
AgreementMaker
Anchorflood
System / Year
Besides the quality of generated alignments, other criteria are important for practical applications. With the increase of the number and the size of existing ontologies, the runtime of systems becomes also crucial. We therefore made first attempts of measuring the runtime of matching systems as well. This was done in a systematic way for the first time in 2007 evaluation of the anatomy track. Runtime was not a topic of investigations for the other OAEI tracks, thus we have to focus on the results for the anatomy track. Due to the setting of previous OAEI campaigns, where system developers run their matchers on the test sets locally and send the results for inspection, it was not possible to measure comparable runtimes on a fair ground. For that purpose, it would have been necessary to execute all systems on the same machine ensuring a fair comparison. As an alternative, we collected statements about runtimes that have been measured by the participants themselves. This information had to be delivered together with a description of CPU and memory capabilities of the computer on which the matching process was executed. According to these descriptions, in 2009 most of the systems were run on a CPU with two cores in the range from 2.0 to 3.1 GHz, using 2 or 4 GB RAM memory. In 2007, this survey was conducted by OAEI organizers mainly for interest. However, the huge variability in the reported runtimes together with the fact that all systems were executed on machines of similar strength, encouraged us to publish runtimes as part of the results in 2007, following the same strategy in 2008 and 2009. In 2010 it was originally planned to conduct the track in a completely automized evaluation setting. Finally, the evaluation was conducted in semi-automized way. Due to this, runtime measurements are unfortunately missing for 2010. Table 4 gives an overview on the runtimes that were reported by the matching tool developers. These results show that the community has made clear improvements regarding runtime issues. In 2007, one of the systems required four days for matching the ontologies, while the slowest system in 2009 finished the matching process in under two hours. This trend is also reflected by the average and median values that significantly decreased from 2007 to 2009. It is also interesting to see that the median and average seem to converge in 2009 because there is no longer negative outliers that require an enormous amount of time. Note also that the decreased runtime is in most cases not related to a decreased quality of the generated alignment.
2007 - 30 - 900 75 12 - 5760 23 240 360 - 300 600 830 270 2008 1 - 4 230 17 - - 200 - 24 720 - 25 - 152.6 24.5 2009 0.25 23 1 5 12 - 5 99 - 10 - 19 12 18.6 11 Table 4. Runtimes in minutes reported by the participants
Ontology Alignment Evaluation Initiative: Six Years of Experience
181
More important than the trend line of faster matching systems, is the positive acceptance of presenting and discussing runtimes as part of the evaluation. Although we are aware that this way of gathering reported runtimes is a subject to criticism, the approach has nevertheless pointed to an important aspect in evaluating matching systems that would have been neglected otherwise. In §6.3, we describe an infrastructure that will finally allow to measure runtimes by the use of an evaluation runtime environment. 5.4 System Configuration and Optimisation We study here the configuration of participating systems and the possible influence of datasets and evaluation settings on the performance of systems. For that purpose, we examine the blind results obtained on the conference dataset with the results obtained by the same systems in the benchmark dataset. Our analysis is based on a discussion of the results from the OAEI 2009 conference track. All of the participants of the conference track have also participated in the benchmark track, most of them with good results. These systems are AFlood [56], AgreementMaker [11] as well as an extension of the system, AROMA [13], Asmov [37], Kosimap [52] and DSSim [50]. Two systems, namely DSSim and AFlood, did not annotate correspondences with confidence values. Since our approach requires confidence values, we omitted these systems. The conference submissions of 2009 are well suited for our analysis, because in 2009 the evaluation of the submitted results was based for the first time on the use of a substantial set of reference alignments. Only a small subset of these alignments has been available prior to the evaluation. Similar to the ontologies of the benchmark track, the ontologies of the conference track are of moderate size (between 32 and 140 classes). They cover the domain of conference organization. This domain is partially overlapping with the domain of bibliography. In addition, in both cases, ontologies are labeled in natural language terms (as opposed to the ontologies of the anatomy track, for instance). Thus, we would expect that a system that obtains good results for the benchmark track, obtains similar results for the conference track. In particular, we would expect that the configuration of such a system is also well suited for the conference track. However, the results do not fit with this hypothesis. In Figure 10, the dependency between f-measure and threshold is shown for each system that participated in this track. Figure 10 is generated, for each submitted alignment featuring confidences different than 1, by applying a posteriori a threshold that is increased step by step. For each setting the resulting f-measure is measured and its average for all test cases is depicted in Figure 10. For each of the other systems, we can distinguish between two interesting points. The first one is the threshold t where the f-measure increases for the first time (e.g., for kosimap t = 0.1, ASMOV did not use a threshold). This point refers to the threshold that was chosen by the tool developer when running his matching tool to generate the alignments. Since none of the correspondences has a confidence value below this threshold, we observe a horizontal line for, e.g., t < 0.1 with regard to kosimap. The second interesting point is the threshold t where the system reaches its maximum fmeasure (e.g., for kosimap t = 0.52).
182
J. Euzenat et al.
f measure
.6
0. threshold
0.
aroma 0.40
amaker 0.55 asmov 0.54
1. amext 0.49 kosimap 0.45
Fig. 10. F-measures of matching systems for different thresholds (the values under the legends are the absolute optimal f-measure for each system)
These curves are all nearly monotonically increasing until the optimum and then monotonically decreasing. This could be the sign of a robust way from all systems to rank correspondences (the increasing phase corresponds to less than optimally ranked false positives and the decreasing phase, more than optimally ranked true positives). On the other side, if these systems are good at ranking correspondences, they are not very good at finding the optimal threshold. Moreover, they are all lower than optimal. However, the f-measure of these systems is far lower than the one they obtain in the benchmark track. How can this be explained? We cannot exclude that systems are overfitting on benchmarks, i.e., they are optimized for performing well at benchmark. Alternatively, the benchmark dataset has a particular feature that can favor those systems which try to maximize pairing of entities, e.g., by considering a similarity between two entities as a gain and to maximize the gain. Given one of the benchmark test cases, let os be the smaller ontology and let ol be the larger ontologies. Then each matchable entity in os has a counterpart in ol . Thus maximizing pairing is a good strategy. We call this over-matching. In fact, overfitting or over-matching would have the same results. This characteristic occurs in most of the benchmark tests and in none of the conference tests. Moreover, it is only likely to occur in specific scenarios such as version matching. So, it introduces a bias in evaluation results towards a particular strategy. The lessons of this analysis from the standpoint of evaluation are three fold: – The benchmark test case should be modified so that this bias is suppressed;
Ontology Alignment Evaluation Initiative: Six Years of Experience
183
– Delivering more datasets with reference alignments would help developers avoiding overfitting; – Multiplying datasets and studying divergence is a good direction because it allows to test matchers in different situations and cross-compare the results from multiple datasets. Further analysis that goes beyond the scope of a single track is required to understand the effects and appropriateness of specific system configurations.
6 Trends and challenges From the last years of OAEI (2005-2010), we can identify medium term trends and challenges for ontology matching evaluation. We distinguish between two types of trends: some trends are relevant for the evaluation of systems in general (§6.1) while others are more specific to the matching problem (§6.2). The first ones are independent of the concrete problem that is solved by a system to be evaluated. Typical examples are evaluation metrics related to the effectiveness of user interaction or issues related to the hardness of a test. In addition to these two groups of trends, we finally conclude in §6.3 with a challenge that is tangent to many of the issues. We discuss the automation of ontology evaluation and the infrastructure required for that purpose. In particular, we present the infrastructure developed in the SEALS project7 (Semantic Evaluation at Large Scale), which represents a partial continuation of OAEIs, as key to solve many open issues. The work in [60] described ten challenges for ontology matching. Amongst others, these include large-scale evaluation, performance of ontology matching techniques and reasoning with alignments, which are directly related to trends identified here. 6.1 General Issues User interaction. Automatic ontology matching can only be a first step in generating a final alignment. Therefore, systems that (i) automatically generate an initial set of matching hypothesis, (ii) support the user in the refinement of the generated alignment, (iii) propagate the user input to semi-automatic filter and/or extend the alignment are in advance and will finally be used to solve concrete matching problems. An example for such a system is AgreementMaker [10]. However, current evaluation techniques do not take into account the quality and effectiveness of user interventions. Currently, only a subtask of the anatomy track in OAEI deals with the third aspect marginally, while the second point is not at all considered. This is one of the most important drawbacks of current evaluation practices that has to be tackled in future. In situ evaluation. The overall approach underlying most OAEI evaluations is based on the implicit assumption that there exists a unique reference alignment that correctly describes how ontologies have to be matched. Although this reference alignment is not always available, correspondences can in principle be divided in correct and incorrect ones. However, the relative quality or usefulness of a generated alignment also depends 7
http://about.seals-project.eu/
184
J. Euzenat et al.
on its intended use. The difference between these approaches was emphasized in [68] by a comparison of relevance and correctness. In [34], an evaluation method is described that takes into account some characteristics of a usage scenario and reports the respective evaluation results. Large scale analysis. OAEI campaigns gave only some preliminary evidence of the scalability characteristics of the ontology matching technology. We reported about these attempts in §5. Therefore, larger tests involving 10.000, 100.000 and 1.000.000 entities per ontology (e.g., UMLS has about 200.000 entities) are to be designed and conducted. In turn, this raises the issues of a wider automation for acquisition of reference alignments, e.g., by minimizing the human effort while increasing an evaluation dataset size [60; 46]. Notice also that scalability involves not only the consideration of runtime, but has to focus also on aspects as memory consumption and required disk capacity. Defining and measuring test hardness. There is a need for evaluation methods grounded on a deep analysis of the matching problem space. Semi-automatic test generation methods require such an analysis as basis. These methods will allow for the construction of tests of desired hardness by addressing a particular point in the matching problem space. We have already argued that additional tests are required. Initial steps towards this line were already discussed in [30]. 6.2 Specific Issues In the following, we present several specific issues that we believe will become more important to OAEI: complex matching, instance matching and database schema matching. Complex matching refers to a matching process in which correspondences are not restricted to link named entities, but can also link complex descriptions. Instance matching is not concerned with matching terminological entities but focuses on matching individuals. Finally, schema matching has received decades of attention in the database community. Database schemas are different from ontologies, e.g., by not providing explicit semantics for their data. However, these are also similar in the sense that both schemas and ontologies provide a vocabulary of terms and constrain the meaning of terms used in the vocabulary. Moreover, in real life situations schemas and ontologies have both well defined and obscure labels and structures, thus, these often share similar solutions, which need to be evaluated. Complex matching. State of the art ontology matching techniques are often limited to detect correspondences between atomic concepts and properties. Nevertheless, for many concepts and properties atomic counterparts will not exist, while it is possible to construct equivalent complex concept and property descriptions [58]. A typical example, presented in [53], is the correspondence Researcher ≡ Person ∃researchedBy−1 .. The expressivity supported by the available Alignment API [21] implementation was in the past restricted to non-complex correspondences and has recently been extended to a more expressive language referred to as EDOAL (Expressive and Declarative Ontology Alignment Language) [14]. Even though the infrastructure for expressing complex correspondences is now available and several approaches for complex matching techniques have been proposed (see, for example, [15; 53]).
Ontology Alignment Evaluation Initiative: Six Years of Experience
185
Instance matching and linked data. While rich ontologies were promoted as an integral part of every semantic web application [35], it is increasingly argued that the real value of the semantic web is based on its ability to create and maintain linked open data which provides effective access to semantically enhanced information on the web [65]. In 2009, OAEI comprised for the first time a track explicitly concerned with instance matching. In 2009 six matching systems participated, in 2010 five systems participated. It can be expected that this track will be an important component of the OAEI in the following years with an increasing number of participants. Database schema matching. As was mentioned in §2, at present in the database community there are no well-established benchmarks for comparing schema matching tools. However, there are many recent schema matching tools and more generally model management infrastructures, e.g., COMA++ [3], AgreementMaker [12], GeRoMe [40; 39], Harmony [49; 57], that are able also to process ontologies, and hence, might be interested to test them within OAEI, as actually already happens, though modestly. On the other hand, OAEI has to consider including explicit schema matching tasks involving XML and relational schemas in order boost the cross-fertilization between these communities. 6.3 Automation Although OAEI campaigns have created a basis for evaluation that did not exist before, the progress in leveraging increased evaluation efforts has to be made in order to continue the growth of ontology matching technology. Further progress is highly dependent on the automation of many parts of the evaluation process. This would reduce the effort necessary for carrying out evaluation, but above all, this would allow to handle more complex evaluation processes as well as measurements of runtime and memory consumption. Reducing the evaluation effort will allow for better meeting the fourth desideratum discussed in §4. The SEALS project aims at establishing systematic evaluation methods for semantic technologies, including ontology matching, by providing standardized datasets, evaluation campaigns for typical semantic web tools and, in particular, a software infrastructure – the SEALS platform – for automatically executing evaluations. This platform will allow matcher developers to run their tools on the execution environment of the platform in both the context of an evaluation campaign and on their own for a formative evaluation of the current version of the tool. The results can be published both in the context of the evaluation campaign or in the context of evaluating a tool on its own. In both cases, results are reproducible, since the matching system, the test dataset and the results themselves are stored and archived in the repositories of the SEALS platform. This approach differs from the approach conducted in the OAEI campaigns where participants send their results (and their systems) to the OAEI organizers in the Alignment API format [21]. These submissions are accepted by the organizers as official results of the matching system. After a phase of validating, e.g., the format of the submissions, evaluation experiments are conducted by the organizers and the results are
186
J. Euzenat et al.
prepared and finally presented on a webpage8 and in the annual result reports [28; 26; 25; 8; 23; 24]. This process requires several weeks before first results are published. The SEALS platform aims at automating most of the evaluation process. This allows tool developer to receive a direct feedback. OAEI will in particular benefit from both the reduced amount of effort required by the organizers and from the controlled execution environment. This environment ensures that the matching systems generate the alignments with a fixed setting for each track and test case. In particular, it allows to execute all evaluated matching systems in the same controllable context. Thus, it is possible to conduct precise runtime measurements that will replace the report-based approach used from 2007 to 2009. OAEI and SEALS are closely coordinated and the SEALS platform will be progressively integrated within the OAEI evaluations. In a first phase, the participants of three OAEI 2010 tracks (benchmarks, anatomy, conference) were asked to make their tools available as web services. Implementing the required interface allowed participants in 2010 to debug their system from their own site. This approach substitutes the phase of preliminary testing as well as the submission of the final results. The alignments are generated on the machine of the tool developer and sent to the SEALS platform in the context of an evaluation process. On the one hand, evaluation results are immediately available in this setting. On the other hand, runtime and memory consumption cannot be correctly measured due the fact that the controlled execution environment is missing. Details on this approach can be found in [64]. In the second phase, which is planned already for OAEI 2011, the tools will be deployed in the SEALS platform. This allows organisers to compare systems on the same basis, in particular in terms of runtime. This is also a test of the deployability of tools. The successful deployment relies on the Alignment API and requires additional information about how the tool can be executed in the platform and its dependencies in terms of resources (e.g., installed databases or resources like WordNet). For that reason, the challenging goal of the SEALS project can only be reached with the support of the matching community and it highly depends on the acceptance by the tool developers.
7 Conclusions The OAEI campaigns of the last years have provided extensive experience in ontology matching and evaluation of semantic technologies in general. This experience was reported in this paper. We summarize lessons learned that are worth emphasizing because they are relevant not only to OAEI, but also to the evaluation activities in other areas of semantic technologies and beyond. As the reported experience indicates, foremost, there is a real need for systematic evaluation. Researchers and practitioners of the ontology matching tools have eagerly taken up the challenges offered by OAEI and actively participated from the beginning on. In general, systems have improved their performances over the campaigns for most of the tracks. This is specially corroborated by the results for the anatomy track, but this is a general trend. 8
See, for example, http://oaei.ontologymatching.org/2010/results/
Ontology Alignment Evaluation Initiative: Six Years of Experience
187
We observed that it was necessary to evolve with the field, involving our understanding of it and the reaction of developers to the proposed datasets. For example, most of the participants focused on the benchmark dataset, followed by anatomy and conference. There are only few systems that did not submit their results for benchmark. It can be due to the fact that benchmark offers relatively easy tasks and full reference alignments. Developers naturally use this available information (evaluation results) for improving their results. However, this overfitting has a potential influence on the performance of the systems. In turn, this requires to be reactive in proposing new datasets, new measures and new evaluation settings. We have pointed out areas in which improvements are necessary: more varied benchmarks (from various vertical domains as well as transversal ones), instance-based and user-assisted matching to name a few. Also we made the case for automation and reported about first steps made in that direction. Increased automation does not only mean less work for evaluation organizers and better reproducibility. It offers the opportunity to generate datasets and thus to test scalability, variability and to understand test hardness. This allows for performing runtime, space and deployability measurements. Ultimately, it turned out that a rather minimal common infrastructure was sufficient to start the initiative. Finally, setting up such an evaluation is a great chance, and a great responsibility: it has an influence not only on the improvement of systems but also on research directions being followed. This chance, however, comes at a price, since the successful evaluation initiative requires a deep understanding of the problem domain and substantial resources to be dedicated to creating datasets, designing protocols and processing evaluations.
Acknowledgements Jérôme Euzenat, Christian Meilicke, Heiner Stuckenschmidt and Cássia Trojahn dos Santos are partially supported by the European project SEALS (IST-2009-238975). Pavel Shvaiko was supported by the Trentino as a Lab initiative of the European Network of the Living Labs at Informatica Trentina. We are grateful to all our colleagues who contributed to the OAEI campaigns: Caterina Caraciolo, Alfio Ferrara, Ryutaro Ichise, Antoine Isaac, Fausto Giunchiglia, Willem Robert van Hage, Laura Hollink, Cliff Joslyn, Véronique Malaisé, Malgorzata Mochol, Andriy Nikolov, Natasha Noy, Juan Pane, Marta Sabou, François Scharffe, Vassilis Spiliopoulos, Ondˇrej Šváb-Zamazal, Vojtech Svatek, George Vouros, Shenghui Wang and Mikalai Yatskevich. Finally, we also thank all the OAEI participants who made these evaluations worthwhile: Masaki Aono, John Beecher-Deighan, Boutheina Ben Yaghlane, Sadok Ben Yahia, Wayne Bethea, Jürgen Bock, Olivier Bodenreider, Gosse Bouma, Silvana Castano, Gong Cheng, Isabel F. Cruz, Carlo Curino, Jérôme David, Jean-François DjoufakKengue, Marc Ehrig, Daniel Engmann, Alfio Ferrara, Clayton Fink, Sandra Geisler, Jorge Gracia, Philippe Guégan, Fayçal Hamdi, Jan Hettenhausen, Bo Hu, Wei Hu, Yves R. Jean-Mary, Ningsheng Jian, Mansur R. Kabuka, Yannis Kalfoglou, Vangelis Karkaletsis, Ulas C. Keles, David Kensche, Ching-Chieh Kiu, Konstantinos Kotis, Najoua Laamari, Patrick Lambrix, Chien-Sing Lee, Dan Li, Juanzi Li, Xiang Li, Yi Li, Peng Liu, Qiang Liu, Angela Maduko, Ming Mao, Sabine Massmann, Eduardo Mena,
188
J. Euzenat et al.
Engelbert Mephu-Nguifo, Gianpaolo Messa, Enrico Motta, Miklos Nagy, Slawomir Niedbala, Nobal Niraula, Giorgio Orsi, Roelant Ossewaarde, Flavio Palandri-Antonelli, Jeff Z. Pan, Yefei Peng, Yuzhong Qu, Christoph Quix, Erhard Rahm, Nataliya Rassadko, Quentin Reul, Chantal Reynaud, Marta Sabou, Brigitte Safar, Hanif Seddiqui, Feng Shi, E. Patrick Shironoshita, Yahya Slimani, Vassilis Spiliopoulos, Piotr Stolarski, Umberto Straccia, Cosmin Stroe, Heiko Stoermer, William Sunna, York Sure, He Tan, Letizia Tanca, Jie Tang, Haijun Tao, Raphaël Troncy, Alexandros G. Valarakos, Petko Valtchev, Maria Vargas-Vera, George A. Vouros, Peng Wang, Yadong Wang, Honghan Wu, Baowen Xu, Peigang Xu, Tianyi Zang, Haifa Zargayouna, Sami Zghal, Yuanyuan Zhao, Duo Zhang, Songmao Zhang, Xiao Zhang, Dongdong Zheng, Qian Zhong and Xinyu Zhong.
References 1. Alexe, B., Tan, W.C., Velegrakis, Y.: Comparing and evaluating mapping systems with STBenchmark. VLDB Endowment (PVLDB) 1(2), 1468–1471 (2008) 2. Alexe, B., Tan, W.C., Velegrakis, Y.: Stbenchmark: towards a benchmark for mapping systems. VLDB Endowment (PVLDB) 1(1), 230–244 (2008) 3. Aumueller, D., Do, H.-H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In: Proceedings of the 24th International Conference on Management of Data (SIGMOD), Software Demonstration, Baltimore, MD US, pp. 906–908 (June 2005) 4. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284(5), 34–43 (2001) 5. Bernstein, P., Halevy, A., Pottinger, R.: A vision of management of complex models. ACM SIGMOD Record 29(4), 55–63 (2000) 6. Bizer, C., Schultz, A.: The berlin SPARQL benchmark. International Journal of Semantic Web and Information Systems 5(2), 1–24 (2009) 7. Bouquet, P., Ehrig, M., Euzenat, J., Franconi, E., Hitzler, P., Krotzsch, M., Serafini, L., Stamou, G., Sure, Y., Tessaris, S.: Specification of a common framework for characterizing alignment. Deliverable D2.2.1v2, Knowledge web NoE (December 2004) 8. Caracciolo, C., Euzenat, J., Hollink, L., Ichise, R., Isaac, A., Malaisé, V., Meilicke, C., Pane, J., Shvaiko, P., Stuckenschmidt, H., Zamazal, O.Š., Svatek, V.: Results of the ontology alignment evaluation initiative 2008. In: ISWC 2008, Karlsruhe, DE, pp. 73–119 (October 2007) 9. Castro, R.G., Maynard, D., Foxvog, D., Wache, H., González-Cabero, R.: Specification of a methodology, general criteria, and benchmark suites for benchmarking ontology tools. Deliverable D2.1.4, Knowledge web NoE (February 2004) 10. Cruz, I., Antonelli, F.P., Stroe, C.: Agreementmaker: efficient matching for large real-world schemas and ontologies. VLDB Endowment 2(2), 1586–1589 (2009) 11. Cruz, I., Antonelli, F.P., Stroe, C., Keles, U.C., Maduko, A.: Using AgreementMaker to align ontologies for OAEI 2009: Overview, results, and outlook. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, pp. 135–146 (October 2009) 12. Cruz, I.F., Antonelli, F.P., Stroe, C.: Agreementmaker: Efficient matching for large real-world schemas and ontologies. VLDB Endowment (PVLDB) 2(2), 1586–1589 (2009) 13. David, J.: AROMA results for OAEI 2009. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, pp. 147–151 (October 2009) 14. David, J., Euzenat, J., Scharffe, F., Trojahn, C.: The Alignment API 4.0. Semantic web journal 2(1) (2011)
Ontology Alignment Evaluation Initiative: Six Years of Experience
189
15. Dhamankar, R., Lee, Y., Doan, A.-H., Halevy, A., Domingos, P.: iMAP: Discovering complex semantic matches between database schemas. In: Proceedings of the 23rd International Conference on Management of Data (SIGMOD), Paris, FR, pp. 383–394 (June 2004) 16. Do, H.-H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Chaudhri, A.B., Jeckle, M., Rahm, E., Unland, R. (eds.) NODe-WS 2002. LNCS, vol. 2593, pp. 221–237. Springer, Heidelberg (2003) 17. Duchateau, F., Bellahsene, Z.: Measuring the quality of an integrated schema. In: Parsons, J., Saeki, M., Shoval, P., Woo, C., Wand, Y. (eds.) ER 2010. LNCS, vol. 6412, pp. 261–273. Springer, Heidelberg (2010) 18. Duchateau, F., Bellahsene, Z., Hunt, E.: XBenchMatch: a benchmark for xml schema matching tools. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), Vienna, AT, pp. 1318–1321 (September 2007) 19. Duchateau, F., Coletta, R., Bellahsene, Z., Miller, R.J.: (not) yet another matcher. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management CIKM, Hong Kong, CN, pp. 1537–1540 (November 2009) 20. Ehrig, M., Euzenat, J.: Relaxed precision and recall for ontology matching. In: Ashpole, B., Ehrig, M., Euzenat, J., Stuckenschmidt, H. (eds.) Proceedings of the Workshop on Integrating Ontologies, vol. 156, p. 8 (August 2005) 21. Euzenat, J.: An API for ontology alignment. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004. LNCS, vol. 3298, pp. 698–712. Springer, Heidelberg (2004) 22. Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, IN, pp. 348–353 (January 2007) 23. Euzenat, J., Ferrara, A., Hollink, L., Isaac, A., Joslyn, C., Malaisé, V., Meilicke, C., Nikolov, A., Pane, J., Sabou, M., Scharffe, F., Shvaiko, P., Spiliopoulos, V., Stuckenschmidt, H., Zamazal, O.Š., Svatek, V., Trojahn, C., Vouros, G., Wang, S.: Results of the ontology alignment evaluation initiative 2009. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, Washington (DC US), pp. 73–126 (October 2009) 24. Euzenat, J., Ferrara, A., Meilicke, C., Pane, J., Scharffe, F., Shvaiko, P., Stuckenschmidt, H., Zamazal, O.Š., Svatek, V., Trojahn, C.: Results of the ontology alignment evaluation initiative 2010. In: Proceedings of the ISWC 2010 Workshop on Ontology Matching, Shanghai, China, pp. 85–125 (2010) 25. Euzenat, J., Isaac, A., Meilicke, C., Shvaiko, P., Stuckenschmidt, H., Šváb, O., Svatek, V., Hage, W.R., Yatskevich, M.: Results of the ontology alignment evaluation initiative 2007. In: Proceedings of the ISWC 2007 Workshop on Ontology Matching, Busan (KR), pp. 96–132 (November 2007) 26. Euzenat, J., Mochol, M., Shvaiko, P., Stuckenschmidt, H., Šváb, O., Svatek, V., van Hage Robert, W., Yatskevich, M.: Results of the ontology alignment evaluation initiative 2006. In: Proceedings of the ISWC 2006 Workshop on Ontology Matching, Athens (GA US), pp. 73–95 (November 2006) 27. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007) 28. Euzenat, J., Stuckenschmidt, H., Yatskevich, M.: Introduction to the ontology alignment evaluation 2005. In: Proceedings of the K-CAP 2005 Workshop on Integrating Ontologies, Banff, CA (October 2005) 29. Garcia-Castro, R., Gómez-Pérez, A., Prieto-Gonzalez, J.: IBSE: An OWL interoperability evaluation infrastructure. In: Golbreich, C., Kalyanpur, A., Parsia, B. (eds.) Proceedings of the Workshop on OWL: Experiences and Directions (OWLED), Innsbruck, Austria. CEUR Workshop Proceedings, vol. 258, pp. 1–10 (June 2007) 30. Giunchiglia, F., Paolo, M.Y., Avesani, Shvaiko, P.: A large scale dataset for the evaluation of ontology matching systems. The Knowledge Engineering Review Journal (KER) 24(2), 137–157 (2009)
190
J. Euzenat et al.
31. Giunchiglia, F., Shvaiko, P., Yatskevich, M.: Semantic schema matching. In: Chung, S. (ed.) OTM 2005. LNCS, vol. 3760, pp. 347–365. Springer, Heidelberg (2005) 32. Giunchiglia, F., Yatskevich, M., Shvaiko, P.: Semantic matching: Algorithms and implementation. In: Spaccapietra, S., Atzeni, P., Fages, F., Hacid, M.-S., Kifer, M., Mylopoulos, J., Pernici, B., Shvaiko, P., Trujillo, J., Zaihrayeu, I. (eds.) Journal on Data Semantics IX. LNCS, vol. 4601, pp. 1–38. Springer, Heidelberg (2007) 33. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics 3(2), 158–182 (2005) 34. Hollink, L., van Assem, M., Wang, S., Isaac, A., Schreiber, G.: Two variations on ontology alignment evaluation: Methodological issues. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 388–401. Springer, Heidelberg (2008) 35. Horrocks, I.: Ontologies and the semantic web. Communications of the ACM 51(11), 58–67 (2008) 36. Isaac, A., Matthezing, H., van der Meij, L., Schlobach, S., Wang, S., Zinn, C.: Putting ontology alignment in context: Usage scenarios, deployment and evaluation in a library case. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 402–417. Springer, Heidelberg (2008) 37. Jean-Mary, Y.R., Shironoshita, E.P., Kabuka, M.R.: Ontology matching with semantic verification. Journal of Web Semantics 7(3), 235–251 (2009) 38. Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. The Knowledge Engineering Review 18(1), 1–31 (2003) 39. Kensche, D., Quix, C., Li, X., Li, Y., Jarke, M.: Generic schema mappings for composition and query answering. Data and Knowledge Engineering 68(7), 599–621 (2009) 40. Kensche, D., Quix, C., Chatti, M.A., Jarke, M.: Gerome: A generic role based metamodel for model management. In: Spaccapietra, S., Atzeni, P., Fages, F., Hacid, M.-S., Kifer, M., Mylopoulos, J., Pernici, B., Shvaiko, P., Trujillo, J., Zaihrayeu, I. (eds.) Journal on Data Semantics VIII. LNCS, vol. 4380, pp. 82–117. Springer, Heidelberg (2007) 41. Kuster, U., Konig-Ries, B.: Towards standard test collections for the empirical evaluation of semantic web service approaches. International Journal of Semantic Computing 2(3), 381–402 (2008) 42. Lenzerini, M.: Data integration: A theoretical perspective. In: Proceedings of the 21st Symposium on Principles of Database Systems (PODS), Madison, WI, US, pp. 233–246 (June 2002) 43. Luther, M., Liebig, T., Böhm, S., Noppens, O.: Who the heck is the father of bob? In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 66–80. Springer, Heidelberg (2009) 44. Ma, L., Yang, Y., Qiu, Z., Xie, G., Pan, Y., Liu, S.: Towards a complete OWL ontology benchmark. In: Sure, Y., Domingue, J. (eds.) ESWC 2006. LNCS, vol. 4011, pp. 125–139. Springer, Heidelberg (2006) 45. Madhavan, J., Bernstein, P., Rahm, E.: Generic schema matching with Cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB), Roma, IT, pp. 48–58 (September 2001) 46. Maltese, V., Giunchiglia, F., Autayeu, A.: Save up to 99% of your time in mapping validation. In: Meersman, R., Dillon, T., Herrero, P. (eds.) OTM 2010. LNCS, vol. 6427, pp. 1044–1060. Springer, Heidelberg (2010) 47. Marie, A., Gal, A.: Boosting schema matchers. In: Chung, S. (ed.) OTM 2008, Part I. LNCS, vol. 5331, pp. 283–300. Springer, Heidelberg (2008)
Ontology Alignment Evaluation Initiative: Six Years of Experience
191
48. Meilicke, C., Stuckenschmidt, H.: Incoherence as a basis for measuring the quality of ontology mappings. In: Proceedings of the ISWC 2008 Workshop on Ontology Matching, Karlsruhe, DE, pp. 1–12 (October 2008) 49. Mork, P., Seligman, L., Rosenthal, A., Korb, J., Wolf, C.: The Harmony Integration Workbench. In: Spaccapietra, S., Pan, J.Z., Thiran, P., Halpin, T., Staab, S., Svatek, V., Shvaiko, P., Roddick, J. (eds.) Journal on Data Semantics XI. LNCS, vol. 5383, pp. 65–93. Springer, Heidelberg (2008) 50. Nagy, M., Vargas-Vera, M., Motta, E.: DSSim-ontology mapping with uncertainty. In: Proceedings of the ISWC 2006 Workshop on Ontology Matching, pp. 115–123 (November 2006) 51. Petrie, C., Margaria, T., Lausen, H., Zaremba, M.: Semantic Web Services Challenge - Results from the First Year. Semantic Web and Beyond, vol. 8. Springer, Heidelberg (2009) 52. Reul, Q., Pan, J.Z.: KOSIMap: ontology alignments results for OAEI 2009. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, pp. 177–185 (October 2009) 53. Ritze, D., Meilicke, C., Zamazal, O.Š., Stuckenschmidt, H.: A pattern-based ontology matching approach for detecting complex correspondences. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, Washington, DC, USA (October 2009) 54. Sabou, M., Gracia, J.: Spider: bringing non-equivalence mappings to OAEI. In: Proceedings of the ISWC 2008 Workshop on Ontology Matching, Karlsruhe, (DE), pp. 199–205 (October 2008) 55. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: Sp2 bench: A SPARQL performance benchmark. In: Proceedings of the 25th International Conference on Data Engineering ICDE, Shanghai, China, pp. 222–233. IEEE, Los Alamitos (2009) 56. Seddiqui, M.H., Aono, M.: Anchor-Flood: results for OAEI 2009. In: Proceedings of the ISWC 2009 Workshop on Ontology Matching, pp. 127–134 (October 2009) 57. Seligman, L., Mork, P., Halevy, A.Y., Smith, K., Carey, M.J., Chen, K., Wolf, C., Madhavan, J., Kannan, A., Burdick, D.: Openii: an open source information integration toolkit. In: Proceedings of the 29th International Conference on Management of Data (SIGMOD), Indianapolis, IN, US, pp. 1057–1060 (2010) 58. Seylan, I., Franconi, E., de Bruijn, J.: Effective query rewriting with ontologies over dboxes. In: Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), Pasadena (USA), pp. 923–925 (July 2009) 59. Shvaiko, P., Euzenat, J.: A survey of schema-based matching approaches. Journal on Data Semantics IV, 146–171 (2005) 60. Shvaiko, P., Euzenat, J.: Ten challenges for ontology matching. In: Proceedings of the 7th International Conference on Ontologies, Databases, and Applications of Semantics, Monterrey, MX, pp. 1163–1181 (November 2008) 61. Stoilos, G., Grau, B.C., Horrocks, I.: How incomplete is your semantic web reasoner? In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2010), Atlanta, USA, pp. 11–15. AAAI Press, Menlo Park (2010) 62. Sure, Y., Corcho, O., Euzenat, J., Hughes, T. (eds.): Proceedings of the 3rd ISWC Workshop on Evaluation of Ontology-based tools EON, Hiroshima, JP (November 2004) 63. Sure, Y., Gómez-Pérez, A., Daelemans, W., Reinberger, M.-L., Guarino, N., Noy, N.: Why evaluate ontology technologies? because it works! IEEE Intelligent Systems 19(4), 74–81 (2004) 64. Trojahn, C., Meilicke, C., Euzenat, J., Stuckenschmidt, H.: Automating oaei campaigns (first report). In: Proceedings of the International Workshop on Evaluation of Semantic Technologies, IWEST (2010)
192
J. Euzenat et al.
65. Tummarello, G., Delbru, R., Oren, E.: Sindice.com: Weaving the open linked data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 552–565. Springer, Heidelberg (2007) 66. Vaccari, L., Shvaiko, P., Pane, J., Besana, P., Marchese, M.: An evaluation of ontology matching in geo-service applications. GeoInformatica (2011) (in press) 67. van Hage, W.R., Katrenko, S., Schreiber, G.: A method to combine linguistic ontologymapping techniques. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 732–744. Springer, Heidelberg (2005) 68. van Hage Robert, W., Kolb, H., Schreiber, G.: Relevance-based Evaluation of Alignment Approaches: The OAEI 2007 Food Task Revisited. In: Proceedings of the ISWC 2008 Workshop on Ontology Matching, pp. 234–238 (October 2008) 69. Šváb, O., Svatek, V., Berka, P., Rak, D., Tomášek, P.: Ontofarm: Towards an experimental collection of parallel ontologies. In: Proceedings of the 4th International Semantic Web Conference (ISWC) – Poster Track, Galway, IE, pp. 1–3 (November 2005) 70. Zimmermann, A., Euzenat, J.: Three semantics for distributed systems and their relations with alignment composition. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 16–29. Springer, Heidelberg (2006)
Author Index
Chen, Bo 102 Cleve, Anthony
Petit, Jean-Marc
66
130 Rousset, Marie-Christine
Decker, Stefan Euzenat, J´erˆ ome Groza, Tudor
Santoro, Nicola 37 Shvaiko, Pavel 158 Stuckenschmidt, Heiner
158
1
Termier, Alexandre 66 Tournaire, R´emi 66 Trojahn, C´ assia 158
Hainaut, Jean-Luc 130 Handschuh, Siegfried 1 Ling, Tok Wang
66
1
102
Mawlood-Yunis, Abdul-Rahman Meilicke, Christian 158 Meurisse, Jean-Roch 130
37
Weiss, Michael 37 Wu, Huayu 102 Xu, Liang
102
158