This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen
2113
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Heinrich C. Mayr Jiri Lazansky Gerald Quirchmayr Pavel Vogel (Eds.)
Database and Expert Systems Applications 12th International Conference, DEXA 2001 Munich, Germany, September 3-5, 2001 Proceedings
13
Series Editors Gerhard Goos, Karlsruhe University, Germany Juris Hartmanis, Cornell University, NY, USA Jan van Leeuwen, Utrecht University, The Netherlands Volume Editors Heinrich C. Mayr University of Klagenfurt, IFI -IWAS Universitaetsstr. 65, 9020 Klagenfurt, Austria E-mail: [email protected] Jiri Lazansky Czech Technical University, Faculty of Electrical Engineering Technicka 2, 166 27 Prague 6, Czech Republic E-mail: [email protected] Gerald Quirchmayr University of South Australia, School of Computer and Information Science Mawson Lakes Campus, Mawson Lakes, SA 5095 E-mail: [email protected] Pavel Vogel Technical University of Munich, Department of Information Systems Orleanstr. 34, 81667 Munich, Germany E-mail: [email protected] Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Database and expert systems applications : 12th international conference ; proceedings / DEXA 2001, Munich, Germany, September 3 - 5, 2001. Heinrich C. Mayr ... (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Tokyo : Springer, 2001 (Lecture notes in computer science ; Vol. 2113) ISBN 3-540-42527-6
Preface DEXA 2001, the 12th International Conference on Database and Expert Systems Applications was held on September 3–5, 2001, at the Technical University of Munich, Germany. The rapidly growing spectrum of database applications has led to the establishment of more specialized discussion platforms (DaWaK conference, ECWeb conference, and DEXA workshop), which were all held in parallel with the DEXA conference in Munich. In your hands are the results of much effort, beginning with the preparation of the submitted papers. The papers then passed through the reviewing process, and the accepted papers were revised to final versions by their authors and arranged with the conference program. All this culminated in the conference itself. A total of 175 papers were submitted to this conference, and I would like to thank all the authors. They are the real base of the conference. The program committee and the supporting reviewers produced altogether 497 referee reports, on average of 2.84 reports per paper, and selected 93 papers for presentation. Comparing the weight or more precisely the number of papers devoted to particular topics at several recent DEXA conferences, an increase can be recognized in the areas of XMS databases, active databases, and multi- and hypermedia efforts. The space devoted to the more classical topics such as information retrieval, distribution and Web aspects, and transaction, indexing and query aspects has remained more or less unchanged. Some decrease is visible for object orientation. At this point we would like to say many thanks to all the institutions which actively supported this conference and made it possible. These are: - The Technical University of Munich - FAW - DEXA Association - Austrian Computer Society A conference like DEXA would not be possible without the enthusiastic efforts of several people in the background. First we would like to thank the whole program committee for the thorough referee process. Many thanks also to Maria Schweikert (Technical University of Vienna) and Monika Neubauer and Gabriela Wagner (FAW, University of Linz). July 2001
Jiri Lanzanski Heinrich C. Mayr Gerald Quirchmayr Pavel Vogel
Program Committee General Chairperson: Heinrich C. Mayr, University of Klagenfurt, Austria Conference Program Chairpersons: Jiri Lazansky, Czech Technical University, Czech Republic Gerald Quirchmayr, University of Vienna, Austria Pavel Vogel, Technical University of Munich, Germany Workshop Chairpersons: A Min Tjoa, Technical University of Vienna, Austria Roland R. Wagner, FAW, University of Linz, Austria Publication Chairperson: Vladimir Marik, Czech Technical University, Czech Republic Program Committee Members: Michel Adiba, IMAG - Laboratoire LSR, France Hamideh Afsarmanesh, University of Amsterdam, The Netherlands Jens Albrecht, Oracle GmbH, Germany Ala Al-Zobaidie, University of Greenwich, UK Bernd Amann, CNAM, France Frederic Andres, NACSIS, Japan Kurt Bauknecht, University of Zurich, Switzerland Trevor Bench-Capon, University of Liverpool, United Kingdom Alfs Berztiss, University of Pittsburgh, USA Jon Bing, University of Oslo, Norway Omran Bukhres, Purdue University, USA Luis Camarinah-Matos, New University of Lisbon, Portugal Antonio Cammelli, IDG-CNR, Italy Wojciech Cellary, University of Economics at Poznan, Poland Stavros Christodoulakis, Technical University of Crete, Greece Panos Chrysanthis, Univ. of Pittsburgh & Carnegie Mellon Univ., USA Paolo Ciaccia, University of Bologna, Italy Christine Collet, LSR-IMAG, France Carlo Combi, University of Udine, Italy William Bruce Croft, University of Massachusetts, USA John Debenham, University of Technology, Sydney, Australia Misbah Deen, University of Keele, United Kingdom Nina Edelweiss, University of Rio Grande do Sul, Brazil Johann Eder, University of Klagenfurt, Austria Thomas, Eiter , Technical University of Vienna, Austria Gregor Engels, University of Paderborn, Germany Peter Fankhauser, GMD-IPSI, Germany Eduardo Fernandez, Florida Atlantic University, USA
Organization
Simon Field, IBM Research Division, Switzerland Peter Funk, Mälardalen University, Sweden Antonio L. Furtado, University of Rio de Janeiro, Brazil Georges Gardarin, University of Versailles, France Parke Godfrey, York University, Canada Georg Gottlob, Technical University of Vienna, Austria Paul Grefen, University of Twente, The Netherlands Abdelkader Hameurlain, Université Paul Sabatier, France Igor T. Hawryszkiewycz, University of Technology, Sydney, Australia Mohamed Ibrahim, University of Greenwich, UK Yahiko Kambayashi, Kyoto University, Japan Magdi N. Kamel, Naval Postgraduate School, USA Nabil Kamel, American University in Cairo, Egypt Gerti Kappel, University of Linz, Austria Kamalar Karlapalem, University of Science and Technology, Hong Kong Randi Karlsen, University of Tromsö, Norway Rudolf Keller, University of Montreal, Canada Myoung Ho Kim, KAIST, Korea Masaru Kitsuregawa, Tokyo University, Japan Gary J. Koehler, University of Florida, USA Donald Kossmann, Technical University of Munich, Germany Jaques Kouloumdjian, INSA, France Petr Kroha, Technical University Chemnitz-Zwickau, Germany Josef Küng, University of Linz, Austria Michel Leonard, University of Geneve, Switzerland Tok Wang Ling, National University of Singapore, Singapore Mengchi Liu, University of Regina, Canada Fred Lochovsky, Hong Kong Univ. of Science and Technology, Hong Kong Peri Loucopoulos, UMIST, United Kingdom Sanjai Kumar Madria, University of Missouri-Rolla, USA Akifumi Makinouchi, Kyushu University, Japan Vladimir Marik, Czech Technical University, Czech Republic Simone Marinai, University of Florence, Italy Subhasish Mazumdar, New Mexico Tech, USA Robert Meersman, Free University Brussels, Belgium Elisabeth Metais, University of Versailles, France Mukesh Mohania, Western Michigan University, USA Sophie Monties, EPFL, Switzerland Tadeusz Morzy, Poznan University of Technology, Poland Günter Müller, University of Freiburg, Germany Erich J. Neuhold, GMD-IPSI, Germany Gultekin Ozsoyoglu, University Case Western Research, USA Georgios Pangalos, University of Thessaloniki, Greece Stott Parker, University of Los Angeles (UCLA), USA Oscar Pastor, Technical University of Valencia, Spain Marco Patella, University of Bologna, Italy Barbara Pernici, Politecnico di Milano, Italy
VII
VIII
Organization
Günter Pernul, University of Essen, Germany Fausto Rabitti, CNUCE-CNR, Italy Isidro Ramos, Technical University of Valencia, Spain Haralad Reiterer, University of Konstanz, Germany Norman Revell, Middlesex University, UK Sally Rice, Unversity of South Australia, Australia John Roddick, Flinders University of South Australia, Australia Colette Rolland, University Paris I, Sorbonne, France Elke Rundensteiner, Worcester Polytechnic Institute, USA Domenico Sacca, University of Calabria, Italy Marinette Savonnet, University of Bourgogne, France Erich Schweighofer, University of Vienna, Austria Timos Sellis, National Technical University of Athens, Greece Michael H. Smith, University of California, USA Giovanni Soda, University of Florence, Italy Harald Sonnberger, EUROSTAT, Luxembourg Günther Specht, Technical University of Ilmenau, Germany Uma Srinivasan, CIRO, Australia Bala Srinivasan, Monash University, Australia Olga Stepankova, Czech Technical University, Czech Republic Zbigniew Struzik, CWI, Amsterdam, The Netherlands Makoto Takizawa, Tokyo Denki University, Japan Katsumi Tanaka, Kobe University, Japan Zahir Tari, University of Melbourne, Australia Stephanie Teufel, University of Fribourg, Germany Jukka Teuhola, University of Turku, Finland Bernd Thalheim, Technical University of Cottbus, Germany Jean Marc Thevenin, University of Toulouse, France Helmut Thoma, IBM Global Services Basel, Switzerland A Min Tjoa, Technical University of Vienna, Austria Roland Traunmüller, University of Linz, Austria Aphrodite Tsalgatidou, University of Athens, Greece Susan Urban, Arizona State University, USA Krishnamurthy Vidyasankar, Memorial University of Newfoundland, Canada Roland R. Wagner, University of Linz, Austria Michael Wing, Middlesex University, UK Werner Winiwarter, Software Competence Center Hagenberg, Austria Gian Piero Zarri, CNRS, France Arkady Zaslavsky, Monash University, Australia
Table of Contents Invited Talk XML Databases: Modeling and Multidimensional Indexing ..................................... 1 R. Bayer; Germany
Advanced Databases I Updatability in Federated Database Systems ............................................................. 2 M.L. Lee, S.Y. Lee, T.W. Ling; Singapore Designing Semistructured Databases: A Conceptual Approach............................... 12 M.L. Lee, S.Y. Lee, T.W. Ling, G. Dobbie, L.A. Kalinichenko; Singapore, New Zealand, Russia Meaningful Change Detection on the Web .............................................................. 22 S. Flesca, F. Furfaro, E. Masciari; Italy Definition and Application of Metaclasses............................................................... 32 M. Dahchour; Belgium
Information Retrieval Aspects I XSearch: A Neural Network Based Tool for Components Search in a Distributed Object Environment....................................................................... 42 A. Haendchen Filho, H.A. do Prado, P.M. Engel, A. von Staa; Brazil Information Retrieval by Possibilistic Reasoning .................................................... 52 C.-J. Liau, Y.Y. Yao; Taiwan, Canada Extracting Temporal References to Assign Document Event-Time Periods............ 62 D. Llidó, R. Berlanga, M.J. Aramburu; Spain Techniques and Tools for the Temporal Analysis of Retrieved Information ........... 72 R. Berlanga, J. Pérez, M.J. Aramburu, D. Llidó; Spain
Digital Libraries Page Classification for Meta-data Extraction from Digital Collections ................... 82 F. Cesarini, M. Lastri, S. Marinai, G. Soda; Italy A New Conceptual Graph Formalism Adapted for Multilingual Information Retrieval Purposes ........................................................... 92 C. Roussey, S. Calabretto, J.-M. Pinon; France
XII
Table of Contents
Flexible Comparison of Conceptual Graphs........................................................... 102 M. Montes-y-Gómez, A. Gelbukh, A. López-López, R. Baeza-Yates; Mexico, Chile Personalizing Digital Libraries for Learners .......................................................... 112 S.-S. Chen, O. Rodriguez, C.-Y. Choo, Y. Shang, H. Shi; USA
User Interfaces Interface for WordNet Enrichment with Classification Systems............................ 122 A. Montoyo, M. Palomar, G. Rigau; Spain An Architecture for Database Marketing Systems ................................................. 131 S.W.M. Siqueira, D.S. Silva, E.M.A. Uchôa, H.L.B. Braz, R.N. Melo; Brazil, Portugal NChiql: The Chinese Natural Language Interface to Databases ............................ 145 X. Meng, S. Wang; China
Advanced Databases II Pattern-Based Guidelines for Coordination Engineering ....................................... 155 P. Etcheverry, P. Lopistéguy, P. Dagorret; France Information Management for Material Science Applications in a Virtual Laboratory ........................................................................................... 165 A. Frenkel, H. Afsarmanesh, G. Eijkel, L.O. Hertzberger; The Netherlands TREAT: A Reverse Engineering Method and Tool for Environmental Databases ................................................................................. 175 M. Ibrahim, A.M. Fedorec, K. Rennolls; United Kingdom
Information Retrieval Aspects II A Very Efficient Order Preserving Scalable Distributed Data Structure................ 186 A. Di Pasquale, E. Nardelli; Italy Business, Culture, Politics, and Sports – How to Find Your Way through a Bulk of News? (On Content-Based Hierarchical Structuring and Organization of Large Document Archives).................................................... 200 M. Dittenbach, A. Rauber, D. Merkl; Austria Feature Selection Using Association Word Mining for Classification................... 211 S.-J. Ko, J.-H. Lee; Korea
Table of Contents
XIII
Multimedia Databases Efficient Feature Mining in Music Objects ............................................................ 221 J.-L. Koh, W.D.C. Yu; Taiwan An Information-Driven Framework for Image Mining .......................................... 232 J. Zhang, W. Hsu, M.L. Lee; Singapore A Rule-Based Scheme to Make Personal Digests from Video Program Meta Data ............................................................................. 243 T. Hashimoto, Y. Shirota, A. Iizawa, H. Kitagawa; Japan
Workflow Aspects Casting Mobile Agents to Workflow Systems: On Performance and Scalability Issues ............................................................................................. 254 J.-J. Yoo, Y.-H. Suh, D.-I. Lee, S.-W. Jung, C.-S. Jang, J.-B. Kim; Korea Anticipation to Enhance Flexibility of Workflow Execution ................................. 264 D. Grigori, F. Charoy, C. Godart; France Coordinating Interorganizational Workflows Based on Process-Views................. 274 M. Shen, D.-R. Liu; Taiwan
Advanced Databases III Strategies for Semantic Caching............................................................................. 284 L. Li, B. König-Ries, N. Pissinou, K. Makki; USA Information Flow Control among Objects in Role-Based Access Control Model ............................................................................................ 299 K. Izaki, K. Tanaka, M. Takizawa; Japan Object Space Partitioning in a DL-Like Database and Knowledge Base Management System .................................................................. 309 M. Roger, A. Simonet, M. Simonet; France A Genome Databases Framework .......................................................................... 319 L.F.B. Seibel, S. Lifschitz; Brazil Lock Downgrading: An Approach to Increase Inter-transaction Parallelism in Advanced Database Applications .................................................... 330 A. Brayner; Brazil
XIV
Table of Contents
Information Retrieval Aspects III The SH-tree: A Super Hybrid Index Structure for Multidimensional Data ............ 340 T.K. Dang, J. Küng, R. Wagner; Austria Concept-Based Visual Information Management with Large Lexical Corpus....... 350 Y. Park, P. Kim, F. Golshani, S. Panchanathan; Korea, USA Pyramidal Digest: An Efficient Model for Abstracting Text Databases ................ 360 W.T. Chuang, D.S. Parker; USA A Novel Full-Text Indexing Model for Chinese Text Retrieval............................. 370 S. Zhou, Y. Hu, J. Hu; China
Active Databases Page Access Sequencing in Join Processing with Limited Buffer Space ............... 380 C. Qun, A. Lim, O.W. Chong; Singapore Dynamic Constraints Derivation and Maintenance in theTeradata RDBMS ......... 390 A. Ghazal, R. Bhashyam; USA Improving Termination Analysis of Active Rules with Composite Events............ 400 A. Couchot; France TriGS Debugger – A Tool for Debugging Active Database Behavior ................... 410 G. Kappel, G. Kramler, W. Retschitzegger; Austria Tab-Trees: A CASE Tool for the Design of Extended Tabular Systems ............... 422 A. Lig za, I. Wojnicki, G.J. Nalepa; Poland
Spatial Databases A Framework for Databasing 3D Synthetic Environment Data ............................. 432 R. Ladner, M. Abdelguerfi, R. Wilson, J. Breckenridge, F. McCreedy, K.B. Shaw; USA GOLAP – Geographical Online Analytical Processing.......................................... 442 P. Mikšovský, Z. Kouba; Czech Republic Declustering Spatial Objects by Clustering for Parallel Disks ............................... 450 H.-C. Kim, K.-J. Li; Korea A Retrieval Method for Real-Time Spatial Data Browsing.................................... 460 Y. Shiraishi, Y. Anzai; Japan
Table of Contents
XV
Designing a Compression Engine for Multidimensional Raster Data .................... 470 A. Dehmel; Germany
Advanced Databases IV DrawCAD: Using Deductive Object-Relational Databases in CAD ...................... 481 M. Liu, S. Katragadda; Canada A Statistical Approach to the Discovery of Ephemeral Associations among News Topics ............................................................................................... 491 M. Montes-y-Gómez, A. Gelbukh, A. López-López; Mexico Improving Integrity Constraint Enforcement by Extended Rules and Dependency Graphs......................................................................................... 501 S. Jurk, M. Balaban; Germany, Israel Statistical and Feature-Based Methods for Mobile Robot Position Localization.............................................................................................. 517 R. Mázl, M. Kulich, L. 3 HXþLOCzech Republic
Distributed Databases Efficient View Maintenance Using Version Numbers ........................................... 527 E.K. Sze, T.W. Ling; Singapore α-Partitioning Algorithm: Vertical Partitioning Based on the Fuzzy Graph...................................................................................... 537 J.H. Son, M.H. Kim; Korea Using Market Mechanisms to Control Agent Allocation in Global Information Systems............................................................................... 547 I.N. Wang, N.J. Fiddian, W.A. Gray; United Kingdom
Web Aspects I Query Integration for Refreshing Web Views........................................................ 557 H. Liu, W.K. Ng, E.-P. Lim; Singapore Sized-Adjusted Sliding Window LFU – A New Web Caching Scheme ................ 567 W.-C. Hou, S. Wang; USA ViDE: A Visual Data Extraction Environment for the Web .................................. 577 Y. Li, W.K. Ng, E.-P. Lim; Singapore WebSCAN: Discovering and Notifying Important Changes of Web Sites ............ 587 M. Qiang, S. Miyazaki, K. Tanaka; Japan
XVI
Table of Contents
Knowledge Aspects I Knowledge Base Maintenance through Knowledge Representation...................... 599 J. Debenham; Australia ANKON: A Multi-agent System for Information Gathering.................................. 609 C. Diamantini, M. Panti; Italy Mining Astronomical Data ..................................................................................... 621 B. Voisin; France
XML Integration of WWW Applications Based on Extensible XML Query and Processing Languages...................................................................................... 632 N. Shinagawa, K. Kuragaki, H. Kitagawa; Japan Incorporating Dimensions in XML and DTD ........................................................ 646 M. Gergatsoulis, Y. Stavrakas, D. Karteris; Greece Keys with Upward Wildcards for XML ................................................................. 657 W. Fan, P. Schwenzer, K. Wu; USA
Datawarehouses A Framework for the Classification and Description of Multidimensional Data Models.......................................................................... 668 A. Abelló, J. Samos, F. Saltor; Spain Range Top/Bottom k Queries in OLAP Sparse Data Cubes................................... 678 Z.W. Luo, T.W. Ling, C.H. Ang, S.Y. Lee, B. Cui; Singapore On Formulation of Disjunctive Coupling Queries in WHOWEDA........................... 688 S.S. Bhowmick, W.K. Ng, S. Madria; USA
Web Aspects II Topic-Centric Querying of Web Information Resources ....................................... 699 ø6$OWÕQJ|YGH6$Özel, Ö. Ulusoy, G. g]VR\R÷OX=0g]VR\R÷OX Turkey, USA WebCarousel: Automatic Presentation and Semantic Restructuring of Web Search Result for Mobile Environments.................................................... 712 A. Nadamoto, H. Kondo, K. Tanaka; Japan
Table of Contents
XVII
Imposing Disjunctive Constraints on Inter-document Structure ............................ 723 S.S. Bhowmick, W.K. Ng, S. Madria; USA
Knowledge Aspects II A Semi-automatic Technique for Constructing a Global Representation of Information Sources Having Different Formats and Structure .......................... 734 D. Rosaci, G. Terracina, D. Ursino; Italia Integration of Topic Maps and Databases: Towards Efficient Knowledge Representation and Directory Services ............................................... 744 T. Luckeneder, K. Steiner, W. Wöß; Austria
Hypermedia A Spatial Hypermedia Framework for Position-Aware Information Delivery Systems................................................................................ 754 H. Hiramatsu, K. Sumiya, K. Uehara; Japan Ariadne, a Development Method for Hypermedia ................................................. 764 P. Díaz, I. Aedo, S. Montero; Spain
Index Aspects A General Approach to Compression of Hierarchical Indexes .............................. 775 J. Teuhola; Finland Index and Data Allocation in Mobile Broadcast .................................................... 785 C. Qun, A. Lim, Z. Yi; Singapore
Object-Oriented Databases I Modeling and Transformation of Object-Oriented Conceptual Models into XML Schema .................................................................................................. 795 R. Xiao, T.S. Dillon, E. Chang, L. Feng; China, Australia, The Netherlands Adding Time to an Object-Oriented Versions Model ............................................ 805 M.M. Moro, S.M. Saggiorato, N. Edelweiss, C.S. dos Santos; Brazil Cache Conscious Clustering C3 ............................................................................. 815 Z. He, A. Marquez; Australia Closed External Schemas in Object-Oriented Databases ....................................... 826 M. Torres, J. Samos; Spain
XVIII
Table of Contents
Transaction Aspects I Supporting Cooperative Inter-organizational Business Transactions..................... 836 J. Puustjärvi, H. Laine; Finland O2PC-MT: A Novel Optimistic Two-Phase Commit Protocol for Mobile Transactions ......................................................................................... 846 Z. Ding, X. Meng, S. Wang; China Quorum-Based Locking Protocol in Nested Invocations of Methods .................... 857 K. Tanaka, M. Takizawa; Japan
Query Aspects I Applying Low-Level Query Optimization Techniques by Rewriting .................... 867 -3áRG]LH .6XELHWD3RODQG A Popularity-Driven Caching Scheme for Meta-search Engines: An Empirical Study ................................................................................................ 877 S.H. Lee, J.S. Hong, L. Kerschberg; Korea, USA Towards the Development of Heuristics for Automatic Query Expansion ............ 887 J. Vilares, M. Vilares, M.A. Alonso; Spain Utilising Multiple Computers in Database Query Processing and Descriptor Rule Management .......................................................................... 897 J. Robinson, B.G.T. Lowden, M. Al Haddad; United Kingdom
Object-Oriented Databases II Estimating Object-Relational Database Understandability Using Structural Metrics......................................................................................... 909 C. Calero, H.A. Sahraoui, M. Piattini, H. Lounis; Spain, Canada CDM – Collaborative Data Model for Databases Supporting Groupware Applications ...................................................................... 923 W. Wieczerzycki; Poland
Transaction Aspects II An Efficient Distributed Concurrency Control Algorithm Using Two Phase Priority....................................................................................... 933 J.S. Lee, J.R. Shin, J.S. Yoo; Korea A New Look at Timestamp Ordering Concurrency Control .................................. 943 R. Srinivasa, C. Williams, P.F. Reynolds; USA
Table of Contents
XIX
Query Aspects II On the Evaluation of Path-Oriented Queries in Document Databases ................... 953 Y. Chen, G. Huck; Canada, Germany Reasoning with Disjunctive Constrained Tuple-Generating Dependencies ........... 963 J. Wang, R. Topor, M. Maher; Australia, USA A Method for Processing Boolean Queries Using a Result Cache ......................... 974 J.-H. Cheong, S.-G. Lee, J. Chun; Korea
DEXA Position Paper Trends in Database Research.................................................................................. 984 M. Mohania, Y. Kambayashi, A.M. Tjoa, R. Wagner, L. Bellatreche; USA, Japan, Austria, France Author Index......................................................................................................... 989
XML Databases: Modeling and Multidimensional Indexing Rudolf Bayer Institut für Informatik TU-München
Abstract. The talk will discuss several relational models for XMLDatabases and methods, how to map XML-Data to relational data. For each relational model there is a standard technique, how to rewrite XML-Queries in order to transform them into relational SQL-Queries. The relational models and the queries will be considered in combination with classical and multidimensional (UB-tree) indexes and qualitative performance analyses will be presented depending on the relational models and the indexes used. Based on these analyses a recommendation will be given for mapping XML to the relational world.
Updatability in Federated Database Systems Mong Li Lee, Sin Yeung Lee, and Tok Wang Ling School of Computing, National University of Singapore email{leeml, jlee, lingtw}@comp.nus.edu.sg Abstract. It is important to support updates in federated database systems. However, not all updates on the federated schema are possible because some may violate certain constraints in the local databases which are involved in the federation. In this paper, we give a formal framework which characterizes the conditions under which a federated schema object type is updatable. We study the steps involved to systematically map an update request on an external view of a federated schema into the equivalent update(s) on the local databases. We also consider the situation where semantically equivalent object types may not model exactly the same set of objects in the real world. We ensure that the set constraints (EQUAL, SUBSET, DISJOINT) between the instance sets of equivalent object types are not violated after an update.
1
Introduction
A federated database system (FDBS) is a collection of cooperating but autonomous component database systems. [16] proposed a five-level schema architecture (Figure 1) which includes the local schema (conceptual schema of a local database system), the component schema (equivalent local schema modeled in a canonical or common data model), the export schema (subset of a conceptual schema), the federated schema (integration of multiple export schemas), and the external schema (a view of the federated schema). There are 3 important issues in a FDBS: 1. Translation from the local schema to component schema with semantic enrichment [3,11,13]. 2. Integration of multiple export schemas into a federated schema [1,15,18,7]. 3. Translation of retrieval and update requests specified in terms of external schema to requests on underlying database systems. While the first two issues have been extensively studied by many researchers, the third issue, however, has not been deeply studied. Updates in FDBS are quite different from view updates in centralized databases. The former involves a federated schema which is an integration of multiple schemas, while the latter involves views of an underlying conceptual schema. To support updates in an FDBS, it is necessary to determine how to which local databases to propagate the update to. This entails us to consider how the various local schemas have been integrated into the federated schema. Consider two SUPPLIER-PART databases with base tables SUPPLIER, PART and SUPPLY and export schemas ES1 and ES2 respectively. Suppose ES1 H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 2–11, 2001. c Springer-Verlag Berlin Heidelberg 2001
Updatability in Federated Database Systems External Schema
External Schema
Federated Schema
Export Schema
Export Schema
Component Schema
Local Schema
Local Database
...
External Schema
...
Federated Schema
...
Export Schema
...
Component Schema
... ...
3
Local Schema
Local Database
Fig. 1. The five level architecture
consists of entity types SUPPLIER, PART and relationship set SUPPLY, while ES2 has only entity type SUPPLIER with a multivalued attribute PARTNO. An integration of these two export schemas will give us a federated schema that is the same as ES1. Assuming that we have an external schema that is the same as the federated schema. Then a request to delete a relationship supply(s1, p1, 10) from the external schema is translated to deleting the same relationship supply(s1, p1, 10) from the federated schema. This federated schema deletion will be propagated to the export schemas in various ways. In ES1, it is translated to a deletion of supply(s1, p1, 10) from relationship set SUPPLY. In ES2, the deletion will be translated to an update to the multivalued attribute PNO of SUPPLIER. The latter situation would not have occured in centralized databases. Furthermore, the set of objects in a local databases may need to obey certain set constraints. For example, the set of students in in the science department database should be a subset of the set of students in the university database. If an update is propagated to the science department database, then it must also be propagated to the university database. The rest of the paper is organized as follows. Section 2 reviews the EntityRelationship (ER) model and explains the update issue in a FDBS. Section 3 gives a framework for the integration of export schemas. Section 4.1 examines when and how an update to the federated schema can be propagated to the set of export schemas. Section 4.2 discusses how to maintain the consistency of the local databases after an update. We conclude the paper in Section 5.
4
2
M.L. Lee, S.Y. Lee, and T.W. Ling
Preliminaries
In this section, we will discuss the advantages of adopting the ER model as the CDM and examine what is involved in supporting updates in a FDBS. 2.1
The Entity-Relationship Approach
Many existing FDBS prototypes use the relational model as the CDM. However, the relational model does not possess the necessary semantics for defining all the integration mappings that might be desired. Updating views in relational model is inherently ambiguous [6]. The trend is to use semantic data models such as object-oriented (OO) data model [2,4] and the ER data model [18]. Unfortunately, the OO models suffer from several inadequacies such as the lack of a formal foundation, lack of a declarative query language, a navigational interface, conflicts in class hierarchy etc. The structural properties of the OO model can be derived from the ER approach [19]. Hence, although ER model does not have an equivalent concept of object methods yet, we will adopt the more established and well-defined ER model as the CDM in our FDBS. In this paper, we assume that the ER data model supports single-valued, multivalued and composite attributes, weak entity types (both EX-weak and ID-weak), recursive relationship sets, and special relationship sets such as ISA, UNION, INTERSECT etc [8]. 2.2
The Update Issue
In the traditional problem of updating databases through views, a user specifies an an update U against a database view V [DB]. The view update mapper M map the update U on V [DB] to another database update UM on DB, which results in a new database UM (DB). The new view of the database is therefore V [UM (DB)]. The mapping is correct if V [UM (DB)] = U (V [DB]), that is, the view changes precisely in accordance with the user’s request. On the other hand, an update on the external schema of a FDBS has to be translated to the equivalent updates on the local schemas. We identify five mappings that are required. 1. Map update request on external schema modeled using data models such as relational, network, or hierarchical models to corresponding update on external schema modeled using ER model. 2. Map update request on ER external schema to corresponding updates on ER federated schema. 3. Map updates on federated schema into updates on the ER export schemas which have been integrated to form the federated schema. 4. Map updates on export schemas into updates on their respective ER component schemas. 5. Map updates on the component schemas into updates on their corresponding local schema which may not be modeled using the ER model.
Updatability in Federated Database Systems
5
Algorithms for mappings (1) and (5) can be found in [9] and [14,5,13] respectively. Mappings (2) and (4) are similar to the translation of ER view updates. [10,12] give the theory and algorithms to handle these mappings. Mapping (3), which has not been investigated in the literature, involves the propagation of an update on the federated schema to updates on the various export schemas. The integration feature, which is non-existent in traditional view updates, makes this mapping unique and is the focus of the paper.
3
Integration of Schemas
It is important to know the original modeling constructs of a federated object type (entity type, relationship set, attribute), in the export schemas to determine the updatability of the federated object type. Structural conflicts occur when related real world concepts are modeled using different modeling constructs in the different schemas. [7] enumerates the possible original modeling structures of a federated schema object type: 1. A federated schema entity type is an entity type, or as an attribute of an entity type or a relationship set in the export schemas. 2. A federated schema relationship set is a relationship set, or a relationship between an attribute and its owner entity type in the export schemas. 3. A federated schema attribute is an attribute in the export schemas.
Name
Title
Address
Publisher
1
m
Book
Publish
Name
1
Contain
m
Topics
Fig. 2. Schema S1
Title
Code
Publication
Title
Publisher
m
m List
Keywords
Fig. 3. Schema S2
Example 1. Consider schemas S1 and S2 in Figure 2 and 3. We have 1. S1.Topics = S2.Keywords 2. S1.Topics.Name = S2.Keywords.Title 3. S1.Publisher = S2.Publication.Publisher 4. S1.Book ISA S2.Publication The structural conflict between S1.Publisher and S2.Publication.Publisher can be resolved by transforming the attribute S2.Publication.Publisher into an entity
6
M.L. Lee, S.Y. Lee, and T.W. Ling
type Publisher (Figure 4). A new relationship set Print is created. We merge S1 and S2’. Since we have S1.Book ISA S2.Publication, an ISA relationship is created between Book and Publication. Any redundant relationship sets and entity types are removed. Figure 5 shows the integrated schema. Title
Name
Publisher
Print
Code
Publication
Title
List
Keywords
Fig. 4. Schema S2’ - Attribute Publisher in S2 is transformed into an entity type
Name
Title
Address
Publisher
Publish
Code
Publication
Title
List
Keywords
ISA
Book
Fig. 5. Final schema S3 after integrating S1 and S2
We will now present the notations used to indicate how an integrated entity type/relationship set/attribute is obtained from the export schemas. Definition 1. Let EF be an entity type in a federated schema. If EF is obtained by integrating a set of objects Ik in export schemas Sk respectively, then we denote EF = Integrate < I1 , · · · , In >. Object Ik is of the form: 1. Sk .Ej if Ik is an entity type Ej in export schema Sk , 2. Sk .Ej .A if Ik is an attribute A of an entity type Ej in schema Sk , 3. Sk .Rj .A if Ik is an attribute A of relationship set Rj in schema Sk . Definition 2. An integrated schema relationship set RF is denoted as Integrate < I1 , · · · , In > if RF is obtained by integrating a set of objects Ik where Ik is of the form: 1. Sk .Rj if Ik is a relationship set Rj in export schema Sk , or 2. Sk .Ej %A if Ik is the implicit relationship between an attribute A and its . owner entity type Ej in export schema Sk Definition 3. An integrated schema attribute AF which is obtained by integrating a set of objects Ik , is denoted by AF = Integrate < I1 , · · · , In > where all the Ik are either of the form Sk .Ej .A or of the form Sk .Rj .A. That is, the Ik are equivalent entity types (or relationship sets). .
Updatability in Federated Database Systems
7
For example, the integrated entity type Publisher can be represented as Integrate < S1.Publisher, S2.Publication.Publisher>. This indicates that Publisher is obtained from an entity type in the export schema S1 and an attribute Publisher in S2. The integrated relationship set Publish = Integrate <S1.Publish, S2.Publication%Publisher>. S2.Publication%Publisher denotes the implicit relationship between entity type Publication and its attribute Publisher. The integrated attribute Title = Integrate <S1.Topics.Name, S2.Keywords.Title>. Note that the entity types Topics and Keywords are equivalent. [11] shows that that an attribute of an entity type in one schema cannot be equivalent to an attribute of a relationship set in another schema because their semantics are inherently different.
4
Updatability of Federated Schemas
In this section, we will investigate the conditions under which an update on the federated schema is acceptable. There are two steps involved: 1. Decide the set of objects in the export schemas which a federated schema update can propagate to. 2. Decide if update propagation is consistent with the set constraints that exist among the objects in the local databases. 4.1
Propagating Updates
Definition 4. A federated schema entity type (or a relationship set) SF is insertable wrt an export schema SE if we can insert some corresponding entities (or relationships) in SE and the translation of the update is correct. The deletion of a federated schema entity type (or a relationship set) can be similarly defined. A federated schema attribute is modifiable wrt an export schema SE if we can modify some corresponding attribute in SE and the translation of the modification is correct . Theorem 1 (Update propagation for federated schema entity type). Let EF = Integrate < I1 , · · · , In > be a federated schema entity type. If EF is insertable and deletable, then an insertion or deletion on EF will be propagated to some Ik where Ik is an entity type. Proof. Assume that any identifier or key conflicts have been resolved. If Ik is an entity type, then a request to insert an entity into EF will be translated into requests to insert corresponding entities with key value and other attribute values of EF into Ik . Similar argument can be applied to deletion. Hence, EF is insertable and deletable wrt Ik . If Ik is an attribute, then any insertion or deletion on EF will not be propagated to Ik since an attribute needs to be associated with an owner entity/relationship. Similarly, When we delete an entity from EF , we cannot delete the corresponding key value from some attribute Ik because this would mean deleting some implicit relationship between the attribute and its owner entity type/relationship set .
8
M.L. Lee, S.Y. Lee, and T.W. Ling
Example 2. Let Publisher=Integrate <S1.Publisher,S2.Publication.Publisher> where the concept publisher is modeled as an entity type Publisher in S1 and as an attribute Publisher of Publication in S2. The integrated entity type Publisher is both insertable and deletable as follows. An insertion of a new publisher into Publisher at the federated schema is propagated to S1 by inserting the new publisher into S1.Publisher. Note that the insertion will not be propagated to S2 because publisher is modeled as an attribute in S2 and needs to be associated with a publication. The next theorem determines how the insertion or deletion of a federated relationship set is propagated to the export schemas. The modification of a relationship set can refer to the modification of attributes of the relationship set, or the modification of an association among the participating entities of the relationship set. The former is handled in Theorem 3 while the latter is equivalent to a deletion followed by an insertion. Theorem 2 (Update propagation for federated schema relationship set). Let RF be a federated schema relationship set. RF = Integrate < I1 , · · · , In > where Ik corresponds to a relationship set or an implicit relationship E%A between an entity type E and an attribute A in some export schema. Then RF is always both insertable and deletable wrt to any Ik . Proof. If Ik is a relationship set, then insertion is similar to the case of entity type. If Ik is an implicit relationship E%A, then Ik is a binary relationship set, and RF must be a binary relationship set involving two entities E1 and E2 . Without loss of generality, let E correspond to E1 and attribute A correspond to the identifier of E2 . If A is a single-valued attribute, then we can insert into E%A by first retrieving the corresponding entity in E using the key value of E1. If the value of A is NULL, that is, this relationship instance does not yet exist in E1 and E2 , then set the value of A to the identifier value of E2 . If A is a multivalued attribute, then update A by inserting the new value into A. Similarly, to delete E%A, set the value of A to NULL if A is a single-valued attribute, or remove the deleted value from A if A is a multivalued attribute . Example 3. Consider the integrated relationship set Publish with participating entity types Publisher and Publication. If Publish is integrated from S1.Publish and S2.Publication%Publisher, then the integrated relationship set Publish is both insertable and deletable as follows. The insertion of a new relationship (IEEE, Computer) into Publish is propagated to S1 by inserting (IEEE, Computer) into S1.Publish. The insertion is also propagated into S2 by retrieving publication “Computer” from S2.Publication and inserting the value “IEEE” to multivalued attribute Publisher. Theorem 3 (Modification of a federated schema attribute). Let AF be an attribute in the federated schema. AF = Integrate < I1 , · · · , In > where each Ik is an attribute. The owners of the attributes I1 , · · · , Ik are all equivalent entity types E1 , · · · , En (or relationship sets R1 , · · · , Rn ). Then AF is always modifiable wrt any Ik .
Updatability in Federated Database Systems
9
Proof. If a federated schema entity type EF (or relationship set RF ) is the owner of attribute AF whose value is to be modified, then we will use the key value of EF (or RF ) to retrieve the corresponding entity in Ek (or relationship in Rk ). If Ak is a single-valued attribute of Ek , then modify the value of Ak to new value. If Ak is a multivalued attribute of Ek , then remove the old value from the set attribute Ak and insert the new value into Ak . Hence, AF is modifiable wrt any Ik . 4.2
Maintaining Set Constraints
We have shown when an update from the federated schema can be propagated to the export schemas. However, an update can be propagated to an export schema does not imply that the update WILL be propagated. One of the further considerations is to maintain set constraints among various local databases. An update will be propagated only if no set constraint is violated. In this section, we consider the following three set constraints. Note that these constraints are not necessarily mutually exclusive. For instance, EQUAL is a special case of SUBSET. 1. EQUAL - Two entity types (or relationship sets) are EQUAL, if they model exactly the same set of objects (or relationships) in the real world. 2. SUBSET - An entity type E1 (or a relationship set R1 ) is a SUBSET of another entity type E2 (or a relationship set R2 respectively) if for any database instance, the set of objects in the real world modeled by E1 (or R1 respectively) is a subset of the set of objects in the real world modeled by E2 (or R2 respectively). 3. DISJOINT - Two entity types (or two relationship sets) obeys the DISJOINT constraint if for any database instance, the sets of objects in the real world modeled by both entity types (or both relationship sets) are disjoint. Lemma 1 (Enforce set constraints for attribute modification). Let A = Integrate < A1 , · · · , An >, where A is a federated schema attribute of an object type (entity type or relationship set) and Ak is an attribute of an object type Ok in the export schemas. The execution of a modification request on the value of a federated schema attribute cannot violate the set relations between any two object . types Oi and Oj On the other hand, an insertion and deletion of entity type and relationship set can violate some set constraints. The following theorem describes the condition such that an insertion or a deletion will not violate any of the three given set constraints. Theorem 4 (Enforce set constraints for entity type and relationship set insertion and deletion). Let O = Integrate < O1 , · · · , On >, where O is a federated schema object type, which can be either an entity type or a relationship set, and Ok are equivalent object types in the export schemas. Suppose the insertion of an object x into federated schema object O is propagated to some export
10
M.L. Lee, S.Y. Lee, and T.W. Ling
schema O ∈ {O1 , · · · , On }. This insertion does not violate any set constraint if and only if all the following two conditions hold 1. x is also inserted into each of the Ok ∈ {O1 , · · · , On } if either O EQUAL Ok or O SUBSET Ok holds. 2. x is not inserted into any of the Ok ∈ {O1 , · · · , On } if O DISJOINT Ok holds. Similarly, Suppose the deletion of an object x from federated schema object O is propagated to some export schema O ∈ {O1 , · · · , On }. The deletion of an object x from O does not violate any set constraint if and only if x is also deleted from all of the Ok ∈ {O1 , · · · , On } if either O EQUAL Ok or Ok SUBSET O holds . The above theorem only decides if a set of update propagations are correct. In general, there can be many different update propagations that does not violate any set constraint. The decision of which propagation to be finally chosen will be determined by the user or application. Example 4. Suppose S1.Publisher is modeled as a weak entity type. In this case, we may assume that S1.Publisher EQUAL S2.Publication.Publisher. According to Theorem 1, an insertion to the integrated entity type Publisher can be propagated to S1 but not S2. According to Theorem 4, an insertion of S1 will violate some set constraint if S2 is not inserted at the same time. In this case, the insertion into the integrated schema entity type Publisher is not valid. However, when S1.Publisher is modeled as a normal entity type, we can only assume that S2.Publication.Publisher SUBSET S1.Publisher. An insertion into S1 without an insertion into S2 will not violate any set constraints. In this case, an insertion into the integrated schema entity type Publisher will be valid. Intuitively, this conclusion is sound as the publishers modeled by S2 are “published publishers”. A new publisher without any publication can be inserted into S1 but not S2.
5
Conclusions
In this paper, we have examined the issue of supporting updates in an ER based FDBS. We discussed how an update against the external schema of a FDBS needs to be translated into equivalent updates on the local databases via the federated schema, export schemas, component schemas and local schemas. The crucial step involves mapping an update on the federated schema to equivalent export schemas updates. This step requires the determination of the updatability (insertable or deletable) of a federated schema entity type or relationship set. We examined when and how a federated schema update can be propagated to updates on various export schemas, and ensure that such an update does not violate any of set constraints.
References 1. Batini C. and Lenzerini M. A methodology for data schema integration in the ER model. In IEEE Trans. on Software Engineering, Vol 10, 1984.
Updatability in Federated Database Systems
11
2. Bertino E. Integration of heterogeneous data repositories by using object-oriented views. In First Int. Workshop on Interoperability in Multidatabase Systems, 1991. 3. Castellanous M. and Saltor F. Semantic enrichment of database schema: An objectoriented approach. First Int. Workshop on Interoperability in Multidatabase, 1991. 4. Drosten K., Kaul M. and Neuhold E. Viewsystem: Integrating heterogeneous information bases by object-oriented views. In IEEE 6th Int. Conf. On Data Engineering, 1990. 5. Johannesson P. and Kalman K. A method for translating relational schemas into conceptual schemas. In Proc. of 8th Int. Conf. on ER Approach, 1989. 6. Keller A.M. Algorithms for translating view updates to database updates for views involving selections, projections and joins. In Proc. of ACM SIGACT-SIGMOD Symposium, 1985. 7. Lee M.L. and Ling T.W. Resolving structural conflict in the integration of ER schemas. In Proc. 14th Int. Conf. on ER Approach, 1995. Australia. 8. Ling T.W. A normal form for ER diagrams. In Proc. 4th Int. Conf. on ER Approach, 1985. 9. Ling T.W. External schemas of ER based database management systems. In Proc. 7th Int. Conf. on ER Approach, 1988. 10. Ling T.W. and Lee M.L. A theory for ER view updates. In Proc. 11th Int. Conf. on ER Approach, 1992. 11. Ling T.W. and Lee M.L. Relational to ER schema translation using semantic and inclusion dependencies. In Journal of Integrated Computer Aided Engineering, 2(2), 1995. 12. Ling T.W. and Lee M.L. View update in ER approach. In Data and Knowledge Engineering, Vol 19, 1996. 13. Markowitz M. and Markowsky J. Identifying extended ER object structures in relational schemas. In IEEE Trans. on Software Engineering, 16(8), 1990. 14. Navathe S.B. and Awong A.M. Abstracting relational and hierarchic data with a semantic data model. In Proc of 7th Int. Conf. on ER Approach, 1987. 15. Navathe S., Larson J. and Elmasri R. A theory of attribute equivalence in schema integration. In IEEE Trans. on Software Engineering, Vol 15, 1989. 16. Sheth A.P. and Gala S.K. Federated database systems for managing distributed, heterogenous, and autonomous databases. In ACM Computing Surveys, 22(3), 1990. 17. Spaccapietra S., Parent C., and Dupont Y. Model independent assertions for integration of heterogenous schemas. In VLDB Journal, (1), 1992. 18. Spaccapietra S. and Parent C. View integration: A step forward in solving structural conflicts. In IEEE Trans. on Knowledge and Data Engineering, 6(2), 1994. 19. Teo P.K. Ling T.W. and Yan L.L. Generating object-oriented views from an ER based conceptual schema. In 3rd Int. Symposium on Database Systems for Advanced Applications, 1993.
Designing Semistructured Databases: A Conceptual Approach Mong Li Lee1 , Sin Yeung Lee1 , Tok Wang Ling1 , Gillian Dobbie2 , Leonid A. Kalinichenko3 1
3
School of Computing, National University of Singapore, Singapore {leeml, jlee, lingtw}@comp.nus.edu.sg 2 Dept of Computer Science, University of Auckland, New Zealand [email protected] Institute for Problems of Informatics, Russian Academy of Sciences, Russia [email protected] Abstract. Semistructured data has become prevalent with the growth of the Internet. The data is usually stored in a database system or in a specialized repository. Many information providers have presented their databases on the web as semistructured data, while others are developing repositories for new applications. Designing a “good” semistructured database is important to prevent data redundancy and updating anomalies. In this paper, we propose a conceptual approach to design semistructured databases. A conceptual layer based on the Entity-Relationship model is used to remove redundancies at the semantic level. An algorithm to map an ER diagram involving composite attributes weak entity types, recursive, n-ary and ISA relationship sets, and aggregations to a semistructured schema graph (S3-Graph) is also given.
1
Introduction
It is increasing important to design good semistructured databases. While many information providers have presented their databases on the web as semistructured data, others are developing repositories for new applications. Consider the building of an e-commerce application. This involves designing and maintaining repositories of data including product catalogs, customer and vendor information and business-to-business transactions. Currently application builders must define and create objects and relationships in the underlying databases, as the details of the schema are not expressed in the data. As with traditional databases, poorly designed databases contain redundant data, leading to undesirable anomalies. A design methodology that seamlessly maps from the objects and relations in the database, to the hierarchical elements and attributes in semistructured data is also required. In this paper, we describe the research that forms the basis of a methodology to build applications with semistructured data. This comprises three steps: 1. model the underlying database using ER diagrams [1], 2. normalize the ER diagrams [4], 3. model the views on the data using S3-graphs [3]. H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 12–21, 2001. c Springer-Verlag Berlin Heidelberg 2001
Designing Semistructured Databases: A Conceptual Approach
13
Introducing a normalized layer has the following advantages. First, anomalies and redundancies can be removed at the semantic level. Second, customized XML views can be generated from the normalized model. Third, data can be stored in relational databases with controlled redundancy. The contribution of this paper is an algorithm that maps ER diagrams to semistructured schema graphs (S3Graphs), forming a seamless mapping between how the stored data is modeled and the semistructured views of the data. Our study also reveals similarites between the S3-Graph and the hierarchical model and nested relations in that all have limitations in modeling situations with nonhierarchical relationships given their tree-like structures. The rest of the paper is organized as follows. Section 2 reviews background concepts such as the S3-graph and the ER approach. Section 3 shows how the ER approach can be used to design a semistructured database and gives a comprehensive algorithm to translate an ER diagram into a normal form semistructured schema graph. Finally, we conclude in Section 4.
2
Preliminaries
Data modeling people have acknowledged the fact that if the database attributes are fixed in structure, the modeling power of the data model is greatly reduced. In this section, we review some of the concepts of a semistructured data graph and a semistructured schema graph (S3-Graph) defined in [3]. We also review the ER approach and discuss how it can be used to model semistructured data. 2.1
SemiStructured Graph
With the introduction of XML, semistructured data becomes widespread. XML has characteristics which are very similar to semistructured data: self-describing, deeply nested or even cyclic, and irregular. Graph-based complex object models such as Object Exchange Model (OEM) [7], Document Object Model (DOM) [2], Araneus Data Model (ADM) [6] and semistructured graph [3] provide a natural framework for describing semistructured XML repositories and their DTDs. Figure 1 shows a semistructured data graph. The data graph is basically a labeled graph in which vertices correspond to objects and edges represent the object-subobject relationship. Each edge has a label describing the precise nature of the relationship. Each object has an object identifier (oid) and a value. The value is either atomic or complex, that is, a set of object references denoted as a set of (label, oid) pairs. Figure 2 shows the schema of the semistructured data graph in Figure 1. A SemiStructured Schema Graph (S3-Graph) is a directed graph where each node is either a root node or a component node. We differentiate nodes in a semistructured data graph from the S3-graph by using the & symbol instead of #. Node #1 is a root node representing the entity Student. Node #2 is both a component node as well as a leaf node. Node #3’ is a component node which references the root node Course (Node #3). Each directed edge in the
14
M.L. Lee et al. Student &1
Name
Course
Course &2 "John"
&3’
Grade
&4’ Tutor
Tutor
&8
Grade
&9
&10
89 Feedback TName
Office
&15
&16
TName &17 8
"Tan" "S16 05-13"
&18
&20
&19
9
"S15 04-26"
"Lim"
&14 Feedback
TName
Office
&22
&23
&24 10
"Tan" "S16 05-13"
Room Course
Course &25
&26
"S17"
"03-18"
Course ID &5 "C1"
"A"
Feedback Office Office &21
Building
Tutor
&13
&3
Code
Title
&6
&7
"CS101" "Java"
Course &4
&10 "C2"
Code
Title
&11 "IT321"
&12 "Database"
Fig. 1. A semistructured data graph.
S3-graph is associated with a tag indicating the relationship between the source and destination nodes. The tag may be suffixed with a “*” indicating multiple occurences. A node V1 is connected to node V2 by a component edge (a solid arrow line) with a tag T if V2 is a component of V1 . If T is suffixed with a “*”, the relationship is interpreted as “Entity type represented by V1 has many T ”. Otherwise, it is interpreted as “Entity type represented by V1 has at most one T ”. A node V1 is connected to node V2 via a referencing edge (a dashed arrow line) if V1 references the entity represented by node V2 . A node V1 is pointed by a root edge (a solid arrow line with no source node) tagged T if the entity type represented by V1 is owned by the database. The design of our example database is not a good one due to data redundancy. The two instances “ &9 ” and “ &14 ” in Figure 1 actually represent the same tutor “Tan” and his office “S16 05-13”. If a tutor changes his office, then all occurences of this information must be changed to maintain consistency. [3] employs a decomposition method to restructure a S3-graph to remove anomalies. As in relational databases, the decomposition approach does not ensure a good solution. 2.2
The Entity-Relationship Approach
The ER model [1] incorporates the concepts of entity types, relationship sets and attributes which correspond to structures naturally occuring in information systems and database applications. An entity type is a collection of similar objects which have the same set of predefined common properties or attributes.
Designing Semistructured Databases: A Conceptual Approach Student
Course
#1
#3 Code
Title
#4
#5
string
string
15
Name
Course*
#2 string
#3’ Tutor* Grade
Course
#7
#6 string/integer
Feedback Office*
TName #8 string
#10
#9
integer Room
Building #11 string
#12 string
Fig. 2. Schema of the semistructured graph in Figure 1.
Attributes can be single-valued, multivalued or composite. A minimal set of attributes whose values uniquely identify an entity in an entity type is called a key. A relationship is an association among two or more entities. A relationship set is a collection of similar relationships with a set of predefined common attributes. If the existence of an entity in one entity type depends on the existence of a specific entity in another entity type, such a relationship set and entity type are called existence dependent relationship set and weak entity type. A relationship set which involves weak entity types is a weak relationship set. An entity type which is not weak is a regular entity type. If an entity in one entity type E1 is also in another entity type E2, we say that E1 ISA E2. The ER model also supports recursive relationships involving entities in the same entity type. Relationship sets can be viewed as a high level entity type known as aggregation.
Grade
Student
Enrol
Course
Name
Code
Title
SCT Feedback
Tutor Office Name
Building
Room
Fig. 3. Entity-Relationship diagram for Student-Course-Tutor example.
16
M.L. Lee et al.
Figure 3 shows the ER diagram for the Student-Course-Tutor database. The relationship set Enrol captures the association that a student is enrolled in a course and has a single-valued attribute Grade. Since a student taking a course is taught by some tutors, we need to associate the relationship set Enrol with entity type Tutor. This association is captured in SCT where Enrol is viewed as an entity type participating in other relationship sets. The ER diagram is also in normal form.
3
ER to S3-Graph Translation
The task of designing a “good” semistructured database can be made easier if we have more semantics. For example, [4] proposes a normal form for the ER model and [5]uses the normal form ER to design normal form nested relations. This top-down approach has two advantages. First, normalizing an ER diagram removes ambiguities, anomalies and redundancies at the semantic level. Second, converting a normalized ER diagram into a set of nested relations results in a database schema with clean semantics, in a good normal form. An S3-graph is similar to the nested relations and hierarchical model in that they have a treelike structure and allow repeating groups or multiple occurences of objects. All these models represent hierarchical organizations in a direct and natural way, but they are problematic when representing nonhierarchical relationships. Duplication of data is necessary when representing many-to-many relationships or relationships that involve more than two participating entity types. The problem of handling symmetric queries with data duplication also exists in semistructured data. Keys and foreign keys are used in nested relations while virtual pairing by logical parent pointers to represent many-to-many relationships are used in the hierarchical model to allow symmetric queries. In S3-graphs, we will show how referencing edges can be used to remove redundancies. We will now describe the translation of an ER diagram to an S3-Graph. If the ER diagram is not in normal form, then the S3-Graph obtained may contain redundancy. However, if the ER diagram is in normal form, then the S3Graph obtained will be in normal form (S3-NF). Anomalies in the semistructured database are removed and any redundancy due to many-to-many relationships and n-ary relationships are controlled. Algorithm: Translation of an ER diagram to a S3-Graph. Input: an ER diagram; Output: the equivalent S3-Graph Step 1. Transform the ER diagram to a normal form ER diagram. Step 2. Map regular entity types. Step 3. Map weak entity types. Step 4. Map relationship sets with no aggregations1 . Step 5. Map relationship sets with aggregations. Step 6. Map ISA relationship sets. 1
The participating entity types of a relationship set R may be aggregations or simply entity types.
Designing Semistructured Databases: A Conceptual Approach
17
Step 1. Transform the ER diagram to a normal form ER diagram. Detailed steps with examples are described in [4]. Step 2. Map regular entity types. Each regular entity type E becomes a root node N of an S3-Graph. (a) Each single-valued and multivalued attribute A of E is mapped to a component node NA connected to E by a component edge tagged A. If A is multivalued, then the tag is suffixed by an “*”. NA is also a leaf node labeled with the data type of A. (b) Each composite attribute A of E is mapped to a component node NA connected to E by a component edge tagged A. Each component attribute C of A is mapped to a component node NC connected to NA by a component edge with tag C. Step 3. Map weak entity types. Each weak entity type W becomes a component node NW connected to the node N corresponding to its owner entity type E by a component edge tagged W suffixed by an “*”. Attributes of W are mapped in the same way as the attributes of a regular entity type. Step 4. Map regular relationship sets with no aggregations. Case (1) R is a binary relationship set. Let R be a binary relationship set with participating entity types EA and EB . EA and EB are mapped to root nodes NA and NB respectively. Depending on the application, there are several ways to map R: (a) R is mapped to a component node NR . NA is connected to NR by a component edge tagged RA and NR connected to NB by a referencing edge tagged RB . (b) R is mapped to a component node NR . NB is connected to NR by a component edge tagged RB and NR connected to NA by a referencing edge tagged RA . (c) R is mapped to component nodes NR and NR . NA is connected to NR by a component edge tagged RA and NR connected to NB by a referencing edge tagged RB , while NB is con nected to NR by a component edge tagged RB and NR connected . to NA by a referencing edge tagged RA (d) R is mapped to a root node NR . NR is connected to NA and NB by referencing edges tagged RA and RB respectively. If R is a one-to-one relationship set, then all the tags of the component and referencing edges are not suffixed by an “*”. If R is a many-to-many relationship set, then the tags are suffixed by an “*”. If R is a one-tomany relationship set, then without loss of generality, let the cardinalities of EA and EB in R be 1 and m respectively. If we have NA connected to NR by a component edge, then the tag of the component edge is suffixed by an “*”. Otherwise, if we have NB connected to NR by a component edge, then the tag is suffixed by an “*”. Tags of component edges for NR are similarly suffixed.
18
M.L. Lee et al.
Attributes of R are mapped to component nodes in the same way as attributes of regular entity types. Figure 4 summarizes the mapping of binary relationship sets. Note that mappings (a) and (b) do not allow symmetric queries to be answered efficiently, while mapping (c) has controlled data redundancy. The relationship set R is “flattened” in mapping (d) to allow symmetric queries without duplication. Note that if R is a recursive relationship set, then only one entity type, say EA is involved in R with roles r1 and r2. The mapping for a recursive relationship set is similar to binary relationship sets except that the edges are tagged with the rolenames of EA . A
m
R
n
B
C
(a)
A
(b)
B NA
A
NB
NA
R A* NR c
A
(c)
RB
c
NR c
(d)
NB
A
NC
B NB
NA
R
RA
NR
R’B*
R’A RB
NC
R B*
RA
B
NR
NB
NC
NA R A*
B
N R’
RB
c
c NC
NC
Fig. 4. Different ways to map a binary relationship set in an ER diagram to S3-graph
Case (2) R is a n-ary relationship set where n > 2. Let the participating entity types of R be E1 , E2 , ..., Em , where m > 2. E1 , E2 , ..., Em are mapped to root nodes N1 , N2 , ..., Nm respectively. There are several ways to map R as shown in Figure 5. (a) Map R to a component node NR . Without loss of generality, connect N1 to NR by a component edge tagged RE1 . Then NR has a referencing edge tagged RE1 , RE2 , ..., REm to each of the root nodes N2 , N3 , ..., Nm respectively. (b) First choose a path to link the participating entity types of R. Let ≺ V1 , V2 , V3 , · · · , Vk be the path, vertex V1 corresponds to some
Designing Semistructured Databases: A Conceptual Approach
19
participating entity type of R which is associated with some root node N1 , and vertex Vi , 2 ≤ i ≤ k, corresponds to either a participating entity type of R or a combination of participating entity types of R. Next, create component nodes NR2 , NR3 , ..., NRk that is associated with V2 , V3 , ..., Vk respectively. Root node N1 has a component edge tagged R2E1 to node NR2 , while each node NRi , where 2 ≤ i ≤ k − 1, has a component edge tagged Ri+1E1 to NRi+1 , and a referencing edge(s) tagged REi to the root node(s) that is associated with the participating entity type(s) of R corresponding to Vi . (c) “Flatten” the relationship set by mapping R to a root node NR . NR is connected to N1 , N2 , ..., Nm by referencing edges tagged RE1 , RE2 , ..., REm respectively. E2 n E1
1
R
n
Em
c
(a)
E1
Em
E2 N2
N1 R*E1
R E2
Nm
(b)
E1
R Em
N1
NR
R*2
N2
N R2
R Em
Nc
E1 N1
N2
R E1 R
Em
E2
R E2
NR
Nm
R*k
Nm
R E2
E1
c
(c)
Em
E2
E1
N Rk c
R Em
Nc
c Nc
Fig. 5. Different ways to map a n-ary relationship set in an ER diagram to S3-graph
Step 5. Map regular relationship sets with aggregations. Aggregations are a means of enforcing inclusion dependencies in a database. Let R be a regular relationship set and E1 , E2 , ..., Em and EA1 , EA2 , ..., EAn be the participating entity types of R. Entity types E1 , E2 , ..., Em have been mapped to root nodes N1 , N2 , ..., Nm respectively. Relationships in the aggregations EA1 , EA2 , ..., EAn have been mapped to nodes NA1 , NA2 , ...,
20
M.L. Lee et al.
NAn respectively. Depending on the application, use the various alternatives in Step 4 to map R to a node NR and link NR to the nodes Ni and NAj , where 1 ≤ i ≤ m, 1 ≤ j ≤ n. Step 6. Map special relationship set ISA. Given A ISA B, map A and B to nodes NA and NB respectively and the ISA relationship set to a referencing edge tagged ISA connecting NA to NB . Example. The ER diagram in Figure 3 can be translated to the semi-structured schema graph in Figure 6 as follows. The entity types Student, Course and Tutor become entity nodes #1, #3, #7 respectively. The attributes also become nodes and are connected to their owner entity type by component edges. We need to process the relationship Enrol before SCT because Enrol is involved in an aggregation. Enrol is mapped to an entity node #13 with component edges to entity nodes Course and Student. The attribute Grade is a component of the entity node #13. Next, we map the relationship set SCT to an entity node #15 which has component edges to entity nodes Enrol and Tutor. The attribute Feedback is a component of node #15. The S3-Graph obtained does not contain data redundancy. Note that the relationship sets Enrol and SCT have been flattened in the S3-Graph in Figure 6 because we want to answer symmetric queries with no redundant data. However, if an application only process queries which retrieve the courses taken by a student, and do not need to find students who take a given course, then Figure 7 shows an alternative way to map the ER diagram. Note that this schema cannot answer symmetric queries effectively.
#3 Code
Title
#4
#5
string
Tutor
Student
Course
#7
#1
Name
Name #2 string
string
#8 string
Office* #9
Building
R Student
#11 string
R Course Enrol
SCT
#13
#15
#6
#12 string
R Tutor
Feedback
Grade
string/integer
Room
R Enrol
#10 integer
Fig. 6. An S3-Graph for the ER diagram in Figure 3
Designing Semistructured Databases: A Conceptual Approach
Title
#4
#5
string
string
#7
#1
#3 Code
Tutor
Student
Course
Name #2 string R Course
Course* R Tutor
#3’
#6
Name #8 string
Tutor* Grade
string/integer
21
Office* #9
Building #7’
#11 string
Room #12 string
Feedback #10 integer
Fig. 7. An alternative S3-Graph for the ER diagram in Figure 3
4
Conclusion
To the best of our knowledge, this is the first paper that presents a conceptual approach for designing semistructured databases which can be associated with some schema. We envisage the growing importance of well designed semistructured databases with the development of new e-commerce applications that require the efficient design and maintenance of large amounts of data. The introduction of an ER-based conceptual layer allows us to remove anomalies and data redundancies at the semantic level. We have developed an algorithm to map an ER diagram involving weak entity types, recursive, n-ary and ISA relationship sets, and aggregations to a normal form S3-Graph. Using the mappings proposed, XML DTDs and customised XML views can be generated from the normal form ER diagrams. Relational tables can also be created from the normalized ER diagram to store the XML data with controlled or no redundancy.
References 1. P.P. Chen. The ER model: Toward a unified view of data. ACM Transactions on Database Systems, Vol 1, No 1, 1976. 2. Document Object Model(DOM). http://www.w3.org/TR/REC-DOM-Level-1. 3. S.Y. Lee, M.L. Lee, T.W. Ling, and L. Kalinichenko. Designing good semistructured databases. In Proc. of 18th Int. Conference on ER Approach, 1999. 4. T.W. Ling. A normal form for entity-relationship diagrams. In Proc. of 4th Int. Conference on Entity-Relationship Approach, pages 24–35, 1985. 5. T.W. Ling. A normal form for sets of not-necessarily normalized relations. In Proc. of 22nd Hawaii Int. Conference on Systems Science, pages 578–586, 1989. 6. G. Mecca, P. Merialdo and P. Atzeni. Araneus in the Era of XML. IEEE Bulletin on Data Engineering, 1999. 7. Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In IEEE Int. Conference on Data Engineering, 1995.
Meaningful Change Detection on the Web S. Flesca2 , F. Furfaro2 , and E. Masciari1,2 1
Abstract. In this paper we present a new technique for detecting changes on the Web. We propose a new method to measure the similarity of two documents, that can be efficiently used to discover changes in selected portions of the original document. The proposed technique has been implemented in the CDWeb system providing a change monitoring service on the Web. CDWeb differs from other previously proposed systems since it allows the detection of changes on portions of documents and specific changes expressed by means of complex conditions, i.e. users might want to know if the value of a given stock has increased by more than 10%. Several tests on stock exchange and auction web pages proved the effectiveness of the proposed approach.
1
Introduction
Due to increasing number of people that use the Web for shopping or on-line trading, services for searching information and identifying changes on the Web have received renewed attention from both industry and research community. Indeed, users of e-commerce or on-line trading sites frequently need to keep track of page changes, since they want to access pages only when their information has been updated. Several systems providing change monitoring services have been developed in the last few years[9,10,13,8]. Generally, these systems periodically check the status of the selected web pages, trying to identify how the page of interest has been changed. The lack of a fixed data structure makes the problem of detecting, efficiently and effectively, meaningful changes on the web a difficult and interesting problem. Most of the systems developed so far are not completely satisfactory since they are only able to check if a page has been modified. For instance, the system Netmind can only detect changes on a selected text region, a registered link or image, a keyword, or the timestamp of the page [9]. Consider, for instance, an auction on-line web page (e.g. eBay, Auckland, etc.), an user wants to be alerted only if a change occurs in one of the items he wants to buy, i.e. if the quotation of an article has been changed or if new items of the desired kind are available on the site. Change detection systems should provide the possibility of specifying the changes the user is interested in: to select the region of the document of interest,
Work partially supported by the Murst projects Data-X and D2I
H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 22–31, 2001. c Springer-Verlag Berlin Heidelberg 2001
Meaningful Change Detection on the Web
23
the items inside the region whose changes have to be monitored, and conditions on the type of changes which must be detected. Systems detecting changes on HTML pages with fixed structure are not able to satisfy these kind of user needs since the page regions considered depend on the user’s request. Current techniques for detecting document differences are computationally expensive and unable to focus on the portion of the page that is considered of interest to the user [3,2,15]. A technique able to detect changes with a reasonable degree of efficiency and accuracy is necessary. The general problem of finding a minimum cost edit script that transforms a document into its modified version is computationally expensive (NP-hard) [3,2]. However, in many application contexts, like the one considered in the above example, users are only interested in the changes made and not in the sequence of updates which produce the new document. For the on-line trading example, users are interested in the change on the quotation of a stock or in the insertion of a new stock, regardless of the changes in the whole structure or the intermediate changes. In this paper we present a different approach that, instead of looking for the exact sequence of changes that permits the new document to be produced from the old version, pays attention to how much the changes have modified the document under observation. Our technique represents the document as a tree and permits the user to focus on specific portions of it, e.g. sub-trees. The paper also describes the architecture of a system, called CDWeb system, which allows users to monitor web pages, specifying the information and the type of changes they consider relevant. The main contributions of this paper are the definition of a new efficient technique that allows users to measure Web document differences in a quantitative way, the definition of a language to specify web update triggers and the implementation of a system for change detection on the web.
2
Web Changes Monitoring
In this section we define an efficient technique for the detection of meaningful changes in web documents. We are mainly interested in changes that add, delete or update information contained in specific portions of a Web page. To specify the information that has to be monitored, the user selects the region of the document of interest (a sub-tree of the document tree), the items inside the region (sub-trees of the previously selected sub-tree) whose changes have to be monitored, and conditions on the type of changes which must be detected. The system has to identify first the region of interest (i.e. the portion of the document that is most similar to the region selected in the old version) and to verify, for each item the associated conditions. To retrieve the sub-tree of interest in the updated document it is necessary to define a similarity measure between document sub-trees and use it to compare all the possible sub-trees with the old one. The similarity measure of two trees is defined by considering the similarities of the sub-trees. It is worth noting that the use of the minimum cost edit script, transforming a given document into the new one [2], to detect changes is not feasible for this kind of application since
24
S. Flesca, F. Furfaro, and E. Masciari
it is computationally expensive. Our technique can be seen as the computation of an edit script characterized by a null cost for the “move” operation (and no glue and copy operations are considered). The null cost assumption is not a real limitation, since the type of applications considered are only interested in semantic changes (the position of the stock quote is not of interest). The similarity measure is defined by considering the complete weighted bipartite graph ((N1 , N2 ), E) where N1 and N2 are, respectively, the nodes of the two sub-trees; the weight of each edge (x, y) ∈ E is the similarity of the two nodes. The similarity of two trees is defined by considering the associations of nodes (edges of the bipartite graph) which give the maximum degree of similarity. The association constructed is then used to obtain quantitative information about changes. Document Model. Several different data models have been proposed to represent Web documents. For instance, the WWW Consortium (W3C) has defined a kind of “generic” model, named Document Object Model (DOM), which defines a set of basic structures that enable applications to manipulate HTML and XML documents. In this work we represent structured documents as unordered labeled trees, i.e. we do not consider the order of document elements but only the hierarchical information about them. Generally each node of the tree corresponds to a structuring HTML tag in the document. The document model is defined in a formal way as follows. We assume the presence of an alphabet Σ of content strings, of a set of element types τ , that contains the possible structuring markup, and a set of attribute names A. Definition 1. (Document Tree) A document tree is a tuple T = N, p, r, l, t, a, where N is the set of nodes of the tree, p is the parent function associating each node (except the root r) of the tree with its parent, r is the distinguished root of T , l is a labeling function from leaf (T ) to Σ + , t is a typing function from N to τ and a is an attribute function from N to A × Σ ∗ . Essentially a document tree is an unordered tree whose nodes (named also elements) are characterized by their markup type and the associated set of attribute-value pairs. Leaf nodes have associated the actual textual content of the document. Given a document tree T , whose root is r, and a node en of T , we denote with T (en ) the sub-tree of T rooted at en . Furthermore we define two new functions characterizing an element w.r.t. the whole document tree, type(en ) and w(en ). If r, e2 , · · · en is the path from the root r to the element en , type(en ) = t(r)t(e2 ) · · · t(en ), whereas w(en ) = {s|s is a word1 contained in l(e) ∧ e ∈ leaf (T (en ))}. We also define a(en ) as the set of attributes associated to en . Essentially w(en ) is a set of words contained in the various text strings associated to the leaves of the subtree rooted at en , and type(n) is the concatenation of type label in the path starting from the root of the tree and ending in en , i.e. the complete type of the element. 1
A word is a substring separated by blank to the other substring
Meaningful Change Detection on the Web
25
Example 1. Consider the portion of an HTML document shown in the right side of Fig. 1.
Fig. 1. A document tree
It corresponds to the HTML document tree shown in the left side of Fig. 1, where for each node are reported the corresponding HTML tag, and attributes (text is not shown for non leaf elements). The root element r of this sub-tree is characterized by w(r)= { This, is, an, example }, type(r)={table} and a(r)={ ∅ }, whereas for the node p relative to the first paragraph we have w(p)= { This, is }, type(p)={table.tr.td.p} and a(p)={ A }. A tree similarity measure. To detect changes in the selected portion of a web page, we first have to retrieve this portion of the document in the new document version. Since in the new version of the web page text can be added or removed before and after the portion of the document we are interested in, we cannot rely on its old position in the document tree to perform this task, and consequently, we have to find the portion of the new document that is the most similar to the old one. One possibility to perform this task is to follow one of the approaches that compute minimum edit script between tree structures[3, 2]. However the use of these techniques is not suitable for our problem, since, in general, the problem of finding a minimum cost edit script is computationally expensive, and we cannot use heuristics to compute the similarity degree. We define a simple similarity measure between document trees. In the definition of this measure there are two main issues to be achieved: it should be possible to compute it efficiently, and it must be normalized, allowing the comparison of different pairs of trees and the selection of the most similar one. To define the similarity between documents we first associate each element of the selected document to its current version in the new document, and then consider the similarity degree of the two documents w.r.t. this association. So, we first have to define a measure of similarity between single elements and then use it to define a similarity measure between whole trees. Given a document tree T = N, p, r, l, t, a and an element r of N , the characteristic of r (ψ(r )) is a triple < type(r ), a(r ), w(r ) >. The similarity measure of two elements is defined on the basis of the similarity between each component of the characteristics of the elements being considered. We define the following functions measuring similarity between the different
26
S. Flesca, F. Furfaro, and E. Masciari
components of element characteristics. Given two trees T1 and T2 , and two nodes r1 and r2 we define: intersect(w(r1 ), w(r2 ))
=
attdist(a(r1 ), a(r2 ))
=
|w(r1 )∩w(r2 )| |w(r1 )∪w(r2 )|
typedist(type(r1 ), type(r2 )) =
ai ∈{a(r1 )∩a(r2 )} ai ∈{a(r1 )∪a(r2 )}
W eight(ai ) W eight(ai )
suf max−i (2 ) i=0 max i i=0 (2 )
The function intersect(w(r1 ), w(r2 )) returns the percentage of words that appear in both w(r1 ) and w(r2 ). The function attdist(a(r1 ), a(r2 )) is a measure of the relative weight of the attributes that have the same value in r1 and r2 w.r.t. all the attributes in r1 and r2 . The attributes are weighted differently because some attributes are generally considered less relevant than other, for instance the attribute “href” is considered more relevant than formatting attributes, like “font”. The definition of the function typedist(type(r1 ), type(r2 )) take care of the difference between the complete types of element, suf represents the length of the common suffix between type(r1 ) and type(r2 ) and max denotes the maximum cardinality between type(r1 ) and type(r2 ). We can now to define similarity between two document tree elements. Definition 2. (Element Similarity) Given two document trees T1 and T2 and two elements r1 and r2 of characteristics type(r1 ), a(r1 ), w(r1 ) and type(r2 ), a(r2 ), w(r2 ), the similarity of r1 and r2 (CS(r1 , r2 )) is defined as: CS(T1 , T2 ) = −1 + 2 × (α ∗ typedist(type(r1 ), type(r2 )) + β ∗ attdist(a(r1 ), a(r2 )) +γ ∗ intersect(w(r1 ), w(r2 ))) where α + β + γ = 1. The value of α,β,γ are given by the user on the basis of the type of changes you want to detect (see section 5.1). Clearly the similarity coefficient takes values from the interval [ -1,1 ], where -1 corresponds to the maximum difference and 1 to the maximum similarity. A element that is deleted (resp. inserted) has assumed to have similarity 0 with elements of the new (resp. old) document. Detecting document changes. Once we have defined element similarity, we can complete the definition of our technique. To compare two document subtrees, we consider the complete weighted bipartite graph ((N1 , N2 ), E) where N1 and N2 are, respectively, the nodes of the two sub-trees; the weight of each edge (x, y) ∈ E is CS(x, y). We use this weighted graph to establish association between elements belonging to the old and new version of the document. Obviously not all the possible associations can be considered valid since node association must correspond to an effective document transformation. Also we do not want to consider all the possible transformations, since it is not probable that complex transformations correspond to rewriting of some information already present in the document, at least for the type of applications we are considering.
Meaningful Change Detection on the Web
27
Document mappings. All the possible changes that can occur in a document must correspond to a change in the association between the nodes in the document tree of the original pages and the nodes in the document tree of the newest pages. As stated above, not all the associations can be considered valid. For example if we do not want to deal with paragraph splitting or joining then only one to one associations are valid. Also we do not consider glue or copy operations [3,2] since they seems to be not relevant in this contest. In general we are interested in associations that correspond to some type of editing of the document that add, change or delete some meaningful information in the document, for instance the text of a paragraph or the destination of a hypertext link. Before defining valid edit mapping we introduce some notation. Given two document trees T = N, p, o, r, l, t and T = N , p , o , r , l , t . A Tree Mapping from T to T is a relation M ⊆ N × N , such that r, r ∈ M . Given two document trees T and T , a tree mapping M from T to T and a node x in N , we denote with Mx,. the set of nodes of N associated with x in M ; analogously, given a node y in N we denote with M.,y the set of nodes of N associated with y in M . Definition 3. (Edit Mapping) Given two document trees T = N, p, o, r, l, t and T = N , p , o , r , l , t . An edit mapping M from T to T is a tree mapping such that ∀x ∈ N if |Mx,. | > 1 then |M.,y | = 1 for each y in Mx,. . Intuitively if |Mx,. | > 1 the original node has been split while if |M.,y | > 1 many nodes in the original tree have been merged. A mapping between two trees T and T is said to be Simple if it associates each node in T with at most one node in N and each node in T with at most one node in N . Given two document trees T and T and a tree mapping M we denote with ext(M ) the set of mappings M from T to T such that M ⊆ M . The number of valid edit mapping may be very large due to the completeness of the graph, but we can strongly reduce the number of edges to be considered for the mapping. This can be done by considering the edges that have a weight greater than a predefined threshold. A cost model for mapping. Once we have defined the valid association between document trees we have to define the cost of these associations, that is: if we consider the nodes in the new subtree as the new version of the associated nodes in the original sub-tree, how similar can we consider the new document sub-tree to the old one? To define document similarity we need to define node similarity. Definition 4. Given two document trees T1 , T2 and two sub-trees T1 , T2 and an edit mapping M from T1 to T2 , the similarity of x ∈ N1 w.r.t. M is defined as: avg<x,y>∈M CS(x, y) if |{< x, y >∈ M }| > 0. SimM (x) = 0 otherwise. Thus, given a bipartite graph (N1 , N2 ), E, Definition 4 computes the similarity of a node x in N1 by considering the average of the similarities of the
28
S. Flesca, F. Furfaro, and E. Masciari
pairs of elements x, y for each y related to x by the edit mapping. Using the previous definition we can now define the concept of similarity among document sub-trees. Definition 5. Given two document trees T1 , T2 , two subtrees T1 of T1 ,T2 of T2 and an edit mapping M from T1 to T2 , the similarity of T1 , T2 w.r.t. M is defined as follows: SimM (x) 2 . SimM (T1 , T2 ) = x∈N1 ∪N |N1 | + |N2 | Finally, we define document sub-tree similarity considering the similarity obtained by the edit mapping that maximizes the similarity between two subtrees. Definition 6. (Tree Similarity) Given two document trees T1 , T2 , two subtrees T1 of T1 ,T2 of T2 , and letting M be the set of possible mappings from T1 to T2 . The similarity coefficient of T1 , T2 (Sim(T1 , T2 )) is defined as: Sim(T1 , T2 ) = max SimM (T1 , T2 ) M ∈M
Searching for the most similar subtree. Once we have defined similarity between document sub-trees we can approach the problem of detecting document changes. Here we consider only simple changes, i.e. changes that are detectable using a simple mapping; these changes are insertion, deletion, or textual modification. Note that some types of move operations are also detectable; in particular changes that move an element e from the sub-tree rooted in the parent of e to another sub-tree that is not contained in T (e) and does not contain T (e). Before presenting the technique used to detect changes we introduce the concept of similarity graph, that will be used in the algorithm searching for the most similar sub-trees. Given two document trees T1 = N1 , p1 , r1 , l1 , t1 , a1 and T2 = N2 , p2 , r2 , l2 , t2 , a2 , the similarity graph associated to T1 , T2 (denoted W G(T1 , T2 )) is a weighted bipartite graph (N1 , N2 ), E where E is the set of weighted edges defined as follows: E = {x, y, CS(x, y) | ∀x ∈ N1 , y ∈ N2 } Furthermore given a similarity graph W G(T1 , T2 ) = N, E, we define the projection of a similarity graph on a set of nodes N ⊆ N as πN W G(T1 , T2 ) = N , { x, y, c|x, y, c ∈ E ∧ (x ∈ N ∨ y ∈ N )}, i.e. the sub graph representing the piece of the document to be monitored. To define the algorithm that for a given sub-tree finds the most similar sub-tree in the new document, we refer to the Maximum Matching problem. Indeed, given two document sub-trees T1 = N1 , p1 , r1 , l1 , t1 , a1 and T2 = N2 , p2 , r2 , l2 , t2 , a2 , the following lemma assures that we can use the Hungarian algorithm to compute a simple edit mapping between two sub-trees. . Proposition 1. Given two document trees T1 , T2 and two subtrees T1 , T2 , and a simple edit mapping M between T1 and T2 , then SimM (T1 , T2 ) = Sim(T1 , T2 ) if M is a maximum weight matching on WG(T1 , T2 ).
Meaningful Change Detection on the Web
3
29
Web Update Queries
To better exploit the change detection technique defined in the previous section, we need to provide the possibility of specifying general conditions on data being observed. A trigger will be executed only if the associated condition is verified. In this section we introduce a language to specify this type of triggers, named Web update queries (web trigger). A web update query allows the user to select some specific portions of the document, that will be monitored (we refer to these portions as target-zones). These are the portions of the document where the information which is considered relevant is contained. Inside this zone you can specify a set of sub-zones, named targets. When specifying the trigger condition, the user can ask for verification if the information in a target has been modified. Usually, each target is a leaf of an HTML Tree that is considered relevant by the user. Web update queries are expressed using the syntax sketched below: < W ebT rigger >
::= CREATE Webtrigger < name > ON < zone-list > CHECK < target-list > NOTIFY BY < notif y > WHEN < target-condition > [ BETWEEN < date > AND < date > ] [ EVERY < polling-interval > ]
< target-name > OF TYPE < typename > < polling-interval > ::= < number-of -minutes >
where < HT M L SubT ree > represents a subtree of the document representation discussed in the above section. Note that web update queries should be specified using a visual interface, since it is the best way to specify the < HT M L SubT ree > involved in the trigger definition. The non terminal symbol < target-condition > in the trigger syntax table represents simple boolean conditions that can be used in the when clause. In particular when specifying conditions in the when clause it is possible to access both the old and new target values using the N EW and OLD properties of target items, as shown in the example below. Furthermore you can cast target item type from the predefined string type to number and date type. Using the CREATE WebTrigger command a user can create a web trigger on the CDWeb personal server, that handles change detection on his behalf. The server maintains a local copy of the target zones to be monitored and a list of the target predicates that can fire user notification. Example 2. Consider the web page shown in Fig. 2 that contain information about stock prices on NASDAQ and suppose that a user would like to be notified
30
S. Flesca, F. Furfaro, and E. Masciari
if the quotation for “Cisco System” stock has a percentage variation of 5%. The user can run CDWeb and once the item relative to “Cisco System” has been selected (you can do this by simply selecting the table entry indexed by “CSCO”) a Web Trigger can be set as shown in Fig. 2:
Fig. 2. The Nasdaq example.
where “Cisco System” and “price” are respectively an HTML subtree and leaf element of it that the user can choose using the CDWeb browser by simply double click on the table row for Cisco system quotation and then click on the price column. If the condition specified in the WHEN clause is verified, the user is notified by an alert.
4
System Architecture
In this section we describe the evaluation process of web update queries in the CDW EB system that allows users to specify and execute web update queries using a visual interface. Change detection results are shown when triggers are raised. The system is implemented in java, and HTML documents are manipu-
Fig. 3. System Architecture
lated by means of performed using the swing document libraries, that are based
Meaningful Change Detection on the Web
31
on a document model very similar to the model here used. The architecture of the system is reported in Fig. 3 and consists of five main modules: are the change monitoring service, the query engine, the change detection module, the query builder and the change presentation module. The system is composed of two main applications, a visual query editor, that handles query specification, and an active query engine, that evaluates web update queries. The system maintains an object store where the objects describing the currently active web update queries are serialized. Each query object maintains information about the list of target zones(document sub-trees) referred to in the query, and for each target zone the list of targets contained inside that zone.
References 1. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom Change detection in hierarchically structured information. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 493-504, Montreal, Quebec, June 1996. 2. S. Chawathe, H. Garcia-Molina Meaningful change detection in structured data. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 26-37, Tuscon, Arizona, May 1997. 3. S. Chawathe, S. Abiteboul, J. Widom Representing and querying changes in semistructured data. In Proc. of the Int. Conf. on Data Engeneering, pages 4-13, Orlando, Florida, February 1998 4. F. Douglis, T. Ball, Y. Chen, E. Koutsofios WebGuide: Querying and Navigating Changes in Web Repositories. In WWW5 / Computer Networks, 28(7-11), pages 1335-1344, 1996. 5. Fred Douglis, Thomas Ball: Tracking and Viewing Changes on the Web. In Proc. of USENIX Annual Technical Conference, pages 165-176, 1996. 6. F. Douglis, T. Ball, Y. Chen, and E. Koutsofios. The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web. In World Wide Web, 1(1), pages 27-44, Baltzer Science Publishers, 1998. 7. L. Liu, C. Pu, W. Tang, J. Biggs, D. Buttler, W. Han, P. Benninghoff, and Fenghua. CQ: A personalized update monitoring toolkit. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, 1998 8. L. Liu, C. Pu, W. Tang WebCQ - Detecting and delivering information changes on the web. In Proc. of CIKM’00, Washington, DC USA, 2000. 9. NetMind. http://www.netmind.com 10. TracerLock. http://www.peacefire.org/tracerlock 11. Wuu Yang. Identifying Syntactic differences Between Two Programs. In Software - Practice and Experience (SPE), 21(7), pp. 739-755, 1991. 12. J. T. Wang, K. Zhang and G. Chirn. Algorithms for Approximate Graph Matching. In Information Sciences 82(1-2), pp. 45-74, 1995. 13. Webwhacker. http://www.webwhacker.com 14. J. Widom and J. Ullman. C 3 : Changes, consistency, and configurations in heterogeneous distributed information systems. Unpublished, available at http://wwwdb.stanford.edu/c3/synopsis.html,1995 15. K. Zhang, J. T. Wang and D. Shasha. On the Editing Distance between Undirected Acyclic Graphs and Related Problems. In Proc. of Combinatorial Pattern Matching, pp. 395-407, 1995.
Definition and Application of Metaclasses Mohamed Dahchour University of Louvain, IAG School of Management, 1 Place des Doyens, 1348 Louvain-la-Neuve, Belgium, [email protected] Abstract. Metaclasses are classes whose instances are themselves classes. Metaclasses are generally used to define and query information relevant to the class level. The paper first analyzes the more general term meta and gives some examples of its use in various application domains. Then, it focuses on the description of metaclasses. To help better understand metaclasses, the paper suggests a set of criteria accounting for the variety of metaclass definitions existing in the literature. The paper finally presents the usage of metaclasses and discusses some questions raised about them.
1
Introduction
Common object models (and languages and database systems based on them) model real-world applications as a collection of objects and classes. Objects model real-world entities while classes represent sets of similar objects. A class describes structural (attributes) and behavioral (methods) properties of their instances. The attribute values represent the object’s status. This status is accessed or modified by sending messages to the objects to invoke the corresponding methods. In such models, there are only two abstraction levels: class level composed of classes that may be organized into hierarchies along inheritance (i.e., isA) mechanism, and instance level composed of individual objects that are instances of the classes in the class level. However, beyond the need for manipulating individual objects, there is also the need to deal with classes themselves regardless of their instances. For example, it should be possible to query a class about its name, list of its attributes and methods, list of its ancestors and descendents, etc. To be able to do this, some object models (e.g., Smalltalk [11], ConceptBase [16], CLOS [18]) allow to treat classes themselves as objects that are instances of the so-called metaclasses. With metaclasses, the user is able to express the structure and behavior of classes, in such a way that messages can be sent to classes in the same way that messages are sent to individual objects in usual object models. Systems supporting metaclasses allow to organize data into an architecture of several abstraction levels. Each level describes and controls the lower one. Existing work (e.g., [11,16,18,19,26,10,21]) only deal with particular definitions of metaclasses related to specific systems. This work deals with metaclasses in general. More precisely, the objectives of the paper are: H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 32–41, 2001. c Springer-Verlag Berlin Heidelberg 2001
Definition and Application of Metaclasses
33
clarify the concept of metaclasses often confused with ordinary classes; define a set of criteria characterizing a large variety of metaclass definitions; present some uses of metaclasses; discuss some problems about metaclasses raised in the literature.
The rest of the paper is organized as follows. Section 2 analyzes the more general term meta and gives some examples of its use beyond the object orientation. Section 3 defines the concept of metaclasses. Section 4 presents a set of criteria accounting for the variety of metaclass definitions found in the literature. Section 5 describes the mechanism of method invocation related to metaclasses. Section 6 presents the usage of metaclasses and Section 7 analyzes some of their drawbacks. Section 8 summarizes and concludes the paper.
2
Meta Concepts
The word meta comes from Greek. According to [29], meta means “occurring later than or in succession to; situated behind or beyond; more highly organized; change and transformation; more comprehensive”. Meta is usually used as a prefix of another word. In the scientific vocabulary, meta expresses the idea of change (e.g., metamorphosis, metabolism) while in the philosophical vocabulary, meta expresses an idea of a higher level of generality and abstractness (e.g., metaphysics, metalanguage). In the computing field meta has the latter sense and it is explicitly defined as being a “prefix meaning one level of description higher. If X is some concept then meta-X is data about, or processes operating on, X” [15]. Here are some examples of use of meta in computing:
Metaheuristic. It is an heuristic about heuristics. In game theory and expert systems, metaheuristics are used to give advice about when, how, and why to combine or favor one heuristic over another. Metarule. It is a rule that describes how ordinary rules should be used or modified. More generally, it is a rule about rules. The following is an example of metarule: “If the rule base contains two rules R1 and R2 such that: R1 ≡ A ∧ B ⇒ C R2 ≡ A ∧ not B ⇒ C then the expression B is not necessary in the two rules; we can replace the two rules by R3 such that R3 ≡ A ⇒ C” Metarules can be used during problem solving to select an appropriate rule when conflicts occur within a set of applicable rules. Meta-heuristics and meta-rules are known in knowledge-based systems under a more generic term, metaknowledge. Metaknowledge. It is the knowledge that a system has about how it reasons, operates, or uses domain knowledge. An example of metaknowledge is shown below. “If more than one rule applies to the situation at hand, then use rules supplied by experts before rules supplied by novices”
34
M. Dahchour
Metalanguage. It is a language which describes syntax and semantics of a given language. For instance, in a metalanguage for C++, the (meta)instruction hvariablei“=”hexpressioni“;” describes assignment statement in C++, of which “x=3;” is an instance. Metadata. In databases, metadata means data about data and refer to things such as a data dictionary, a repository, or other descriptions of the contents and structure of a data source [22]. Metamodel. It is a model representing a model. Metamodels aim at clarifying the semantics of the modeling constructs used in a modeling language. For instance a metamodel for OML relationships is proposed in [14]. Figure 1 shows a metamodel for the well-known ER model.
Entity Type (1,n) has
plays (1,n)
(1,1)
(1,n) has
(1,1) Identifier (1,n)
(1,1)
Role
(1,1)
has (2,n)
Relationship
(1,1) has
(1,n)
Cardinality
Attribute (0,1)
composed-of
Fig. 1. Metamodel of the ER model.
The basic concepts of the ER model are the following: entity types, relationships associating entity types, roles played by participating entity types, attributes characterizing entity types or relationships themselves, identification structures identifying in a unique manner the entity types, and cardinalities related to the roles. Each of these concept appears as a metatype in the metamodel shown in Figure 1.
3
The Metaclass Concept
In a system with metaclasses, a class can also be seen as an object. Two-faceted constructs make that double role explicit. Each two-faceted construct is a composite structure comprising an object, called the object facet, and an associated class, called the class facet. To underline their double role, we draw a two-faceted construct as an object box adjacent to a class box. Like classes, class facets are drawn as rectangular boxes while objects (and object facets) appear as rectangular boxes with rounded corners as in Figure 2. MC is a metaclass with attribute A and method M1(..). Object I MC is an instance of MC, with a0 as value for attribute A. I MC is the object facet of a two-faceted construct with C as class facet. A is an instance attribute of MC
Definition and Application of Metaclasses Metaclass level
MC Attr A Meth M1(..)
...
M1(..)
: is-of : method invocation
I_MC A = a0
M2(..)
35
C A = a0 Attr B Meth M2(..)
a two-faceted construct
I1_C
I2_C
B = b0
B = b1
Class level
M2(..) Instance level
Fig. 2. Class/metaclass correspondence.
(i.e., it receives a value for each instance of MC) and a class attribute of C (i.e., its value is the same for all instances of C). For instances I1 C and I2 C of C, attribute A is either inapplicable (e.g., an aggregate value on all instances) or constant, i.e., an instance attribute with the same value for all instances. In addition to the class attribute A, C defines attribute B and method M2(..). The figure shows that methods like M1(..) can be invoked on instances of MC (e.g., I MC), while methods like M2(..) can be invoked on instances of C (e.g., I1 C and I2 C). Note that the two-faceted construct above is useful only to illustrate the double facet of a class that is also an object of a metaclass. Otherwise, in practice, both the object facet I MC and its associated class facet C (see Figure 2) are the same thing, say, I MC C defined as shown in Figure 3.
Metaclass MC Attributes A:AType Methods M1(..) End
Class I MC C instanceOf MC Values A=a0 Attributes B:BType Methods M2(..) End
Fig. 3. Definition of class I MC C as an instance of metaclassMC.
Systems with metaclasses comprise at least three levels: token (uninstantiable object), class, and metaclass, as shown in Figure 4. Additional levels, like Metaclass in Figure 4, can be provided as root for the common structure and behavior of all metaclasses. The number of levels of such hierarchies varies from one system to another.
36
M. Dahchour Metaclass Entity
Meta2class level Metaclass level
Person
Class level
John
Token level
: is-of link
Fig. 4. Levels of systems with a metaclass concept.
4
Various Metaclass Definitions
Substantial differences appear in the literature about the concept of metaclass. We suggest the following criteria to account for the variety of definitions.
Explicitness: the ability for programmers to explicitly declare a metaclass like they do for ordinary classes. Explicit metaclasses are supported by several semantic models (e.g., TAXIS [23], SHM [2]), object models and systems (e.g., Vodak [19], ADAM [26], OSCAR [10], ConceptBase [16]), knowledge representation languages (e.g., LOOPS [1], KEE [9], PROTEUS [28], SHOOD [25], Telos [24]), and programming languages (e.g., CLASSTALK [21], CLOS [18]). On the contrary, Smalltalk [11] and Gemstone [3], for example, only support implicit system-managed metaclasses. Of course, explicit metaclasses are more flexible [21]. They can, for example, be specialized into other metaclasses in the same way that ordinary classes can. Uniformity: the ability to treat an instance of a metaclass like an instance of an application class. More generally, for a system supporting instantiation trees of arbitrary depth, uniformity means that an object at level i (i≥2), instance of a (meta)class at level i+1, can be viewed and treated like an object at level i-1, instance of a (meta)class at level i. Thus, for example, in Figure 4, to create the Entity metaclass, message new is sent to Metaclass; to create the Person class, the message new is sent to the Entity metaclass; and, again, to create the terminal object John, message new is sent to the Person class. While most metaclass systems support uniformity, Smalltalk-80 and Loops, for example, do not. Depth of instantiation: the number of levels for the hierarchy of classes and metaclasses. While, for example, Smalltalk has a limited depth in its hierarchy of metaclasses, Vodak and CLOS allow for an arbitrary depth. Circularity: the ability to use metaclasses in a system for a uniform description of the system itself. To ensure finiteness of the depth of instantiation tree, some metaclass concepts have to be instances of themselves. CLOS and ConceptBase, for example, offer that ability. Smalltalk does not. Shareability: the ability for more than one class to share the same userdefined metaclass. Most systems supporting explicit metaclasses provide shareability.
Definition and Application of Metaclasses
37
Applicability: whether metaclasses can describe classes only (the general case) or other concepts also. For example, TAXIS extends the use of metaclasses to procedures and exceptions, while ConceptBase uses attribute metaclasses to represent the common properties of a collection of attributes. Expressiveness: the expressive power made available by metaclasses. In most systems, metaclasses represent the structure and behavior of their instances only as shown in Figure 2. In some systems like Vodak [19], metaclasses are able to describe both their direct instances (that are classes) and instances of those classes. The metaformulas of Telos and ConceptBase can also specify the behavior of the instances of a metaclass and of the instances of its instances. Multiple classification: the ability for an object (resp., class) to be an instance of several classes (resp., metaclasses) not related, directly or indirectly, by the generalization link. At our knowledge, only Telos and ConceptBase support this facility.
Note that this list of characteristics has been identified by carefully analyzing a large set of systems supporting metaclasses. We cannot, however, claim their exhaustiveness. The list remains open to other characteristics that could be identified by exploring other systems. Note also that these criteria are very useful in that they much help designers to select the more suitable system (with metaclasses) to define their specific needs.
5
Method Invocation
In systems with metaclasses, messages can be sent to classes in the same way that messages are sent to individual objects in usual object models. To avoid ambiguity, we show below how messages are invoked at each level of abstraction and how objects are created. Henceforth, the term object will denote tokens, classes, or metaclasses. Two rules specify the method-invocation mechanism1 . Rule 1. When message Msg is sent to object o, method Meth which responds to Msg must be available (directly or indirectly by inheritance) in the class of o. Rule 2. An object o is created by sending a message, say new(), to the class of o. Consequently, according to Rule 1, new() must be available in the class of o’s class. The following messages illustrate the two rules above. They manipulate objects of Figure 4.
1
John→increaseSalary($1000). In this message, increaseSalary is sent to object John to increase the value of salary by $1000. Method increaseSalary is assumed to be available in the class of John, i.e., Person.
These rules assume that the target object system represents object behavior with methods. Systems like ConceptBase that represent object behavior using constraints and deductive rules are not concerned with message-passing rules.
38
M. Dahchour
John := Person→new(). In this message, new is sent to object Person in order to create object John as an instance of Person. According to Rule 1, method new must be available in the class of Person, i.e., Entity.
Most object systems provide for built-in primitives and appropriate syntax to define classes (e.g., Person), their attributes (e.g., salary), and methods (e.g., increaseSalary). However, to illustrate how metaclasses affect classes, just as classes affect tokens, we show in the following how messages can be sent to the Entity metaclass to build classes and their features.
6
Person := Entity→new(). In this message, Person is created as an instance of Entity. Once again, this assumes that method new is available in Entity’s class, i.e., Metaclass. Entity→addAttributes(Person, { [attrName:name, attrDomain: String]; [attrName:salary, attrDomain: Real]}). This message adds attributes name and salary to the newly created object Person. Similarly, a message can be sent to object Entity to add a new method to object Person.
Usage of Metaclasses
Various reasons warrant a metaclass mechanism in a model or a system. Typically, metaclasses extend the system kernel, blurring the boundary between users and implementors. Explicit metaclasses can specify knowledge to:
2
Represent group information, that concerns a set of objects as a whole. For example, the average age of employees is naturally attached to an EmployeeClass metalevel. Represent class properties unrelated to the semantics of instances, like the fact that a class is concrete or abstract2 , has a single or multiple instances, has a single superclass or multiple superclasses. Customize the creation and the initialization of new instances of a class. The message new which is sent to a class to create new instances can incorporate additional arguments to initialize the instance variables of the newly created instance. Furthermore, each class can have its own overloaded new method for creating and initializing instances. Enhance the extensibility and the flexibility of models, and thus allow easy customization. For example, the semantics of generic relationships can be defined once and for all in a structure of metaclasses that provides for defining and querying the relationships at the class level, creating and deleting instances of participating classes, and so on (see e.g., [13,19,5,7,20,6]). Extend the basic object model to support new categories of objects (e.g., remote objects or persistent objects) and new needs such as the authorization mechanism. This kind of extension requires the ability to modify some basic
Here, an abstract class, in the usual sense of object models, is an incompletely defined class without direct instances, whose complete definition is deferred to subclasses.
Definition and Application of Metaclasses
7
39
behavioral aspects of the system (object creation, message passing), and has often been faced by allowing these aspects to be manipulated in a metaclass level. Define an existing formalism or a development method within a system supporting metaclasses. This definition roughly consists in representing the modeling constructs involved in that formalism or method (i.e., its ontology) with a set of metaclasses of the target system. For example, Fusion [4], an object development method, was partially integrated in ConceptBase [12] using metaclasses. Integrate heterogeneous modeling languages within the same sound formalism. For example, a framework combining several formalisms for the requirement engineering of discrete manufacturing systems was defined along the lines of ConceptBase in [27]. The combined formalisms are: CIMOSA (for the purpose of eliciting requirements), i∗ (for the purpose of enterprise modeling), and the Albert II language (for the purpose of modeling system requirements).
Problems with Metaclasses
Some authors (e.g., [17]) have pointed out some problems with metaclasses. These problems have been analyzed in part in [8]. We summarize the main issues.
Metaclasses make the system more difficult to understand. We agree with [8] that, once programmers are familiar with metaclasses, having a single mechanism for both data and metadata helps them progress from object design to object system design. By themselves, metaclasses do not provide mechanisms to handle all the run-time consequences of extending the data model. This is true for most systems. However, some systems like ADAM [8] and ConceptBase introduce the notion of active rules to enforce some constraints in order to keep the database in a consistent state. Metaclasses do not facilitate low-level extensions. For most systems this is true since metaclasses describe the model or class level, above the structures that specify storage management, concurrency, and access control. Thus, in such systems, metaclasses do not let applications define policies at all levels. However, this is not a general rule. In fact, systems such as ConceptBase and VODAK provide for a mechanism of metaclass that allows to describe both the class and instance level in a coordinated manner. With metaclasses, programmers must cope with three levels of objects: instances, classes, and metaclasses. We agree that it can be difficult at the beginning to play with the three levels.
After presenting these problems, the authors conclude that the metaclass approach is not satisfactory. We agree with [8] that this conclusion may be be valid when talking about programming languages, but we believe that explicit
40
M. Dahchour
metaclasses are a powerful mechanism for enhancing database extensibility, uniformity, and accessibility by addressing these issues at the class level (see e.g., [6]).
8
Conclusion
Metaclasses define the structure and behavior of class objects, just as classes define the structure and behavior of instance objects. In systems with metaclasses, a class can also be seen as an object. We used the two-faceted constructs to make that double role explicit. Substantial differences appear in the literature about the concept of metaclass. We suggested a set of criteria to account for the variety of definitions, namely, uniformity, depth of instantiation, circularity, shareability, applicability, and expressiveness. We then presented the method-invocation mechanism between objects at various levels of abstraction. We also presented some uses of metaclasses and analyzed some of their drawbacks pointed out in the literature.
References 1. D.G. Bobrow and M.J. Stefik. The LOOPS Manual. Xerox Corp., 1983. 2. M.L. Brodie and D. Ridjanovic. On the design and specification of database transactions. In M. L. Brodie, J. Mylopoulos, and J. W. Schmidt, editors, On Conceptual Modelling. Springer-Verlag, 1984. 3. P. Butterworth, A. Ottis, and J. Stein. The Gemstone Database Management System. Communications of the ACM, 34(10):64–77, 1991. 4. D. Coleman, P. Arnold, S. Bodoff, C. Dollin, H. Gilchrist, F. Hayes, and P. Jeremaes. Object-Oriented Development: The Fusion Method. Prentice Hall, 1994. 5. M. Dahchour. Formalizing materialization using a metaclass approach. In B. Pernici and C. Thanos, editors, Proc. of the 10th Int. Conf. on Advanced Information Systems Engineering, CAiSE’98, LNCS 1413, pages 401–421, Pisa, Italy, June 1998. Springer-Verlag. 6. M. Dahchour. Integrating Generic Relationships into Object Models Using Metaclasses. PhD thesis, D´epartement d’ing´enierie informatique, Universit´e catholique de Louvain, Belgium, March 2001. 7. M. Dahchour, A. Pirotte, and E. Zim´ anyi. Materialization and its metaclass implementation. To be published in IEEE Transactions on Knowledge and Data Engineering. 8. O. D´ıaz and N.W. Paton. Extending ODBMSs using metaclasses. IEEE Software, pages 40–47, May 1994. 9. R. Fikes and J. Kehler. The role of frame-based representation in reasoning. Communications of the ACM, 28(9), September 1985. 10. J. G¨ oers and A. Heuer. Definition and application of metaclasses in an objectoriented database model. In Proc. of the 9th Int. Conf. on Data Engineering, ICDE’93, pages 373–380, Vienna, Austria, 1993. IEEE Computer Society. 11. A. Goldberg and D. Robson. Smalltalk-80: The Language and its Implementation. Addison-Wesley, 1983.
Definition and Application of Metaclasses
41
12. E.V. Hahn. Metamodeling in ConceptBase - demonstrated on FUSION. Master’s thesis, Faculty of CS, Section IV, Technical University of M¨ unchen, Germany, October 1996. 13. M. Halper, J. Geller, and Y. Perl. An OODB part-whole model: Semantics, notation, and implementation. Data & Knowledge Engineering, 27(1):59–95, May 1998. 14. B. Henderson-Sellers, D.G. Firesmith, and I.M. Graham. OML metamodel: Relationships and state modeling. Journal of Object-Oriented Programming, 10(1):47– 51, March 1997. 15. D. Howe. The Free On-line Dictionary of Computing. 1999. 16. M. Jarke, R. Gallersd¨ orfer, M.A. Jeusfeld, and M. Staudt. ConceptBase : A deductive object base for meta data management. Journal of Intelligent Information Systems, 4(2):167–192, 1995. 17. S.N. Khoshafian and R. Abnous, editors. Object Orientation: Concepts, Languages, Databases, User Interfaces. John Wiley & Sons, New York, 1990. 18. G. Kiczales, J. des Rivi`eres, and D. Bobrow. The Art of the Metaobject Protocol. MIT Press, 1991. 19. W. Klas and M. Schrefl. Metaclasses and their application. LNCS 943. SpringerVerlag, 1995. 20. M. Kolp. A Metaobject Protocol for Integrating Full-Fledged Relationships into Reflective Systems. PhD thesis, INFODOC, Universit´e Libre de Bruxelles, Belgium, October 1999. 21. T. Ledoux and P. Cointe. Explicit metaclasses as a tool for improving the design of class libraries. In Proc. of the Int. Symp. on Object Technologies for Advanced Software, ISOTAS’96, LNCS 1049, pages 38–55, Kanazawa, Japan, 1996. SpringerVerlag. 22. L. Mark and N. Roussopoulos. Metadata management. IEEE Computer, 19(12):26–36, December 1986. 23. J. Mylopoulos, P. Bernstein, and H. Wong. A language facility for designing interactive, database-intensive applications. ACM Trans. on Database Systems, 5(2), 1980. 24. J. Mylopoulos, A. Borgida, M. Jarke, and M. Koubarakis. Telos: Representing knowledge about informations systems. ACM Trans. on Office Information Systems, 8(4):325–362, 1990. 25. G.T. Nguyen and D. Rieu. SHOOD: A desing object model. In Proc. of the 2nd Int. Conf. on Artificial Intelligence in Design, Pittsburgh, USA, 1992. 26. N. Paton and O. Diaz. Metaclasses in object oriented databases. In R.A. Meersman, W. Kent, and S. Khosla, editors, Proc. of the 4th IFIP Conf. on Object-Oriented Databases: Analysis, design and construction, DS-4, pages 331–347, Windermere, UK, 1991. North-Holland. 27. M. Petit and E. Dubois. Defining an ontology for the formal requirements engineering of manufacturing systems. In K. Kosanke and J.G. Nell, editors, Proc. of the Int. Conf. on Enterprise Integration an Modeling Technology, ICEIMT’97, Torino, Italy, 1997. Springer-Verlag. 28. D.M. Russinof. Proteus: A frame-based nonmonotonic inference system. In W. Kim and F.H. Lochovsky, editors, Object-Oriented Concepts, Databases and Applications, pages 127–150. ACM Press, 1989. 29. M. Webster. The WWWebster Dictionary. 2000.
XSearch: A Neural Network Based Tool for Components Search in a Distributed Object Environment Aluízio Haendchen Filho1, Hércules A. do Prado2, Paulo Martins Engel2, and 1 Arndt von Staa 1
PUC – Pontifícia Universidade Católica do Rio de Janeiro, Departamento de Informática, Rua Marquês de São Vicente 225, CEP 22453-900, Rio de Janeiro, RJ, Brasil {aluizio, arndt}@inf.puc-rio.br 2 Universidade Federal do Rio Grande do Sul, Instituto de Informática, Av. Bento Gonçalves, 9500, CEP 91501-970, Porto Alegre, RS, Brasil {prado, engel}@inf.ufrgs.br
Abstract. The large-scale adoption of three-tier partitioned architectures and the support provided by the distributed object technology has brought a great flexibility to the information systems development process. In addition, the development of applications based on these alternatives has led to an increasing amount of components. This boom in the components amount was remarkably influenced by the Internet arising, that incorporated a number of new components, as HTML pages, java scripts, servlets, applets, and others. In this context, to recover the most suitable component to accomplish the requirements of a particular application is crucial to an effective reuse and the consequent reduction in time, effort and cost. We describe, in this article, a neural network based solution to implement a components intelligent recovering mechanism. By applying this process, a developer will be able to stress the reuse, while avoiding the morbid proliferation of nearly similar components.
1 Introduction The large-scale adoption of architectures partitioned in interface, logical, and data tiers, levered up by the Internet, has brought an unprecedented flexibility to the development of information systems. However, the development of applications based on these alternatives has led to a considerable increment of the number of components. One important challenge of this scenery is posed by the question: in a repository with an enormous amount of alternatives, how to recover the most suitable component to accomplish the requirements of a particular application? In this article, we describe a solution based on an artificial neural network, the associative Hopfield Model (HM) [2] [5], to locate and recover components for business applications. The HM is particularly interesting to record sparse signals, as is the case of software component descriptors. Moreover, in the application phase, the model allows to recover similes based in the description of the desired component requirements. A tool, called XSearch was implemented that validates this approach. We also present a small example that illustrates the applicability of the tool.
XSearch: A Neural Network Based Tool for Components Search
43
After discussing the context of the work in the next chapter, we give the details of our approach in Chapter 3. Chapter 4 describes the tool functions, presenting different kinds of resources that are used to build the neural network. Chapter 5 illustrates how the tool works by means of an example. Some techniques related to component recovery issues are described in Chapter 6.
2 Context This paper uses results from Software Engineering (SE) and Artificial Intelligence (AI), aiming to support the information systems development process. To clarify the proposed approach context, we describe the applied distributed object architecture (DOA), the multi-tier model, and the technology applied to create the DOA. The knowledge of these characteristics simplify to understanding the components that compose a typical distributed object environment and the deployment descriptors that are mapped to the neural network. To avoid confusion when referring to parts of the architecture and the topology of the HM, we reserved the word „tier“ to be used when describing the software engineering context and „layer“ to be used in the HM. In this paper we use the platform J2EE, from Sun Microsystems, particularly, the component model EJB (Enterprise Java Beans). EJB was designed to cope with the issues related to the management of business distributed objects in a three-tier architecture [7]. The J2EE Platform provides an application model distributed in tiers; it means that many parts of an application can run in different devices. On the other hand, it also enables different client types, to access transparently information from an object server in any platform. Figure 1 shows the components and services involved in a typical J2EE multi-tier environment. Client Tier Layer Middle Client
Middle Tier EJB Container
Client
entity beans
Web Client
Web Container
Web Client
EIS Tier
EIS
(RDBMS, ERP, Applications)
(Servlets, JSP Pages, XML, HTML)
Fig. 1. The multi-tier model applied in a J2EE environment
The client tier supports a variety of client types, inside or outside the corporation firewall. The middle tier supports client services through web containers and EJB components that provide the business logic functions. Web containers provide support
44
A. Haendchen Filho et al.
to client requisitions processing, performing time processing answers, such as invoking JSP methods, or servelets, and returning the results to the client [7]. EIS (Enterprise Information Systems) tier includes the RDBMS applied to the data persistency. Behind the central concept of a component based development model we find the notion of containers. Containers are standard processing environments that provide specific services to components [7]. A server-side component model defines an architecture to develop distributed objects. These models are used in the middle tier and manage the processing, assuring the availability of information to local or remote clients. The object server comprises the business object set, which is in this tier. Server-side component models are based in interface specifications. Since a component adheres to the specifications, it can be used by the CTM (Component Transaction Monitor). The relationship between a server-side component and the CTM is like a CD-ROM and a CD player: the component (CD-ROM) must be adequate to the player specification [7].
3 Proposed Approach Considering the context previously described, the approach consists of the application of a HM to retain information about components and recover the most similar one with respect to a particular set of requirements. Figure 2 represents an overview of the whole process whose steps are described next. Input vector
HM
❷
App Server
❶ 0 1 0 1 0 1 1 1 0
ejb web
Components ❸ ❹
Fig. 2. Representation process overview.
(1) The tool scans the application server and creates a HM, representing all components in the environment; (2) The developer presents the specifications of the desired component to the interface. These specifications are mapped to the input layer of the HM; (3) The tool recovers the most similar component; (4) New components are incrementally incorporated to the HM.
XSearch: A Neural Network Based Tool for Components Search
45
In this chapter, we describe: (a) the component descriptors, provided by J2EE platform, used to build the HM; (b) the HM topology; and (c) how the descriptors are codified in the HM. 3.1 The Deployment Descriptors The deployment descriptors works very similarly to a property file, in which attributes, functions, and behavior of a bean are described in a standard fashion. In our approach, the component descriptors are used to relate requirements of a desired component to the components already existing in an environment. Starting at these descriptors, the tool locates the component or framework most suitable to be reused or customized. Figure 3 shows an example of directory structures in that appear many components that belong to a specific package (Product) of an application (Application). A p p lic a t io n
P ro d uc t
M E T A - IN F
Fig. 3. The Product beans files.
When a bean class and its interfaces are defined, a deployment descriptor is created and populated with data about the bean. Usually, IDEs (Integrated Development Environments) are provided by the tools that work with EJB, through property sheets similar to those presented by Visual Basic, Delphi and others. After the description of these properties, by the developer, the component descriptor can be packaged in a JAR (Java Archive) type file. A JAR file contains one or more enterprise beans, including a class bean, remote interfaces, home interfaces, and primary keys (only for the EntityBean types), for each bean [8]. The component descriptor of a bean must be saved as a ejb.jar.xml file and must be located in the same directory where are the other components (interfaces and primary key) of the bean. Normally, in applications, we create directory structures that match with the structures from the application packages. Notice that the components belonging to the package Product form a set of files that includes the classes Product.class, ProductHome.class, ProductBean.class and
46
A. Haendchen Filho et al.
ProductPK.class, beyond the files .java (Product.java, ProductHome.java, Product PK.java). When a JAR file containing a JavaBean (or a set of JavaBeans) is loaded in an IDE, the IDE examine the file in order to determine which classes represent beans. Every development environments know how to find the JAR file in the META-INF directory. 3.2 The Hopfield Model The adoption of a discrete HM is justified by three main arguments: (1) we are assuming a stable set of components that are going to be stored in the model; (2) the components descriptors are typically represented by, or can be converted to, a binary form; and (3) the descriptors vector is quite sparse, since different components share very few descriptors. Our HM (see Figure 4) has two layers: the input one, where the binary descriptors are mapped, and the representation layer, where the traces of the vectors are represented.
representation layer x’1
x1
x’2
x’3
x2
x3
iii
iii
x’n
xn
input layer Fig. 4. The auto-associative architecture of HM.
The HM can be seen as a non-linear auto-associative memory that always converges to one of the stored patterns, as a response to a presentation of an incomplete or noisy version of that pattern. The stable points in the network phase space are the fundamental memories or prototype states of the model. A partial pattern presented to the network can be represented as an initial point in the phase space. Since this point is near the stable point, representing the item to be recovered, the system must evolve in time to converge to this memorized state. The discrete version of HM uses the formal McCulloch-Pitts neuron that can take one from 2 states (+1 or –1). The network works in two phases: storage and recovering. Let us suppose that we want to store a set of p N-dimensional binary vectors, denoted by:
{ξ
µ
| µ = 1,2,..., p
}
XSearch: A Neural Network Based Tool for Components Search
47
These are the p vectors corresponding to the fundamental memories. ξµ,i represents o i-th element from the fundamental memory ξµ. By the external product-storing rule, that is a generalization of the Hebb rule, the synaptic weight of neuron i to neuron j is defined by: 1 p w ji = ∑ ξ µ , j ξ µ ,i with wii = 0 N µ =1 Defining w the N by N matrix of synaptic weights, in which wji is its ji-th element, we can write: 1 p p T w= ∑ ξ µξ µ − I N µ =1 N
ξµξµT represents the external product of the vector ξµ with itself, and I denote the identity matrix. During the recovering phase, a N-dimensional binary vector of proof (±)x is imposed to the network. A proof vector is typically a noisy or incomplete version of a fundamental memory. By this way a prototype is recovered that represent the most probable reuse candidate component. 3.3 Coding Components in the Network Taking into account the context of distributed objects and the adopted platform, we are going to consider initially three kinds of basic components: (1) components of the bean entity type, (2) components of the bean section type and (3) components of the web type. For each one of the basic components it is built a different input vector and, as a consequence, a different HM, making the search and recovery process faster and more efficient. By this way, different kinds of networks can be generated, being each one more adequate for a particular objective. Filtering processes, running when interacting with the user, allow one to establish networks for different component classes. The input vector stores the description of the searched component and is defined by the developer. Figure 5 shows a particular example of an input vector layout for an entity bean component type. Interface
Fig. 5. Input vector data groups to the GenericNetwork.
Three different data groups compose the vector of one GenericNetwork: Interface, State, and Behavior. Each group is briefly described next. The fields in this vector are Boolean, receiving value 1 when the property holds, and -1 otherwise. These data groups are defined when configuring the HM. Each cell is coded as the contents of the JAR files are analyzed and according to the following guidelines: C1, C2, ... represents the different characteristics of a client that interacts with the entity bean, like: the existence or not of the local and remote interfaces, if the client is from a EJB type, and others;
48
A. Haendchen Filho et al.
D1, D2, ... holds the DBMS names in the environment, obtained during the interaction with the user when generating the network; I2, I2, ... represents, each one, a quantity of „int“ type attributes occurring in the component. For example, if a component has 2 attributes of „int“ type, the cell I2 receives 1 and the remaining I1, I3, ... receive –1. The same rule applies to the other data type (char, Str, Ima, and so on). CTM can have at maximum 7 cells, as described further in the Behavior topic.
4 The XSearch Functions XSearch is totally configurable, assuring a high flexibility to simulate alternative HMs, varying the descriptors. By simulating different HMs, the developer can look for a better tradeoff between search performance and precision. Among the advantages of using a neural network, a Hopfield Model in this case, the most important are the simplification of the process and the reduction of processing time. Comparing our approach to an exhaustive search in a relational database, we can see that the operations in the latter alternative overcome those in the neural network. Following we list the key operations for each alternative: Tasks in a relational model (a) Traverse all the component base to each element in the input vector: n = table size; m = input vector size; Cost = n x m (b) Compare for each interaction; (c) Sort the selected components by similarity. (d) Get the address and recover the component. Task in our approach (a) Make the product of the input vector by the neural network. After recovering a vector describing a component candidate for reuse, it is necessary to locate this component in its specific repository. Indexing the component in a binary tree with each descriptor as a node and having its address as leaf solves this problem. The example presented in Section 3.3 has shown a composition of a generic network GenericNetwork), more adequate to recover components that possess a great amount of methods combined with many attributes, interfaces, and other characteristics. Almost all the time, we need to recover a component from a less general set of specifications. For example, recover a session bean that possesses only two or three methods. In this case, a specific network to locate components can be faster and more efficient than a GenericNetwork. To cope with this question, considering a component of EJB type, the following QuickNets can be generated: (a) StateNet: deals with only the State group of the component; (b) BehaviorNet: considers only the Behavior group; (c) ClientNet: help in finding the components located outside the application server; and (d) PKNets: network that allows recover components that access databases.
XSearch: A Neural Network Based Tool for Components Search
49
To avoid the pattern mixing, due to successive extensions, the HM comprises only the original components (from which extensions can be generated). From this original component, a list of its extensions is created. When a component is recovered, a sequential search on its extension list is performed to try an extension that is more similar than the component. To find the most similar extension, the Hamming distance [2] is applied. Dictionaries play important roles in the system. When generating the network, after configuring the environment with the wizard, a process scans the application server to identify and classify components, attributes, and methods. Moreover, the dictionaries simplify the search, when recovering components by name or all components that apply a specific method or contain a specific attribute.
5 Example In this chapter we present a small example illustrating how a component is recovered according to a list of requirements stated by the developer: (a) component type: entity bean; (b) client type: ejb1.1; (c) DBMS vendor: ORACLE; (d) attributes: two integer type fields, two String fields, one Date type, and two doub1e type fields and (e) Methods: the component must include a set of methods like ejbCreate, ejbStore, ejbRemove.
Fig. 6. Specifying the component type.
Fig. 7. Requirements for the state part.
One start window, not shown, enables the HM configuration, and includes the specification of limits for the three data groups (interface, state, and behavior). One limit, for example, is the number of fields in the state data group. Another configuration item is the folder to be scanned when building the HM. Figure 6 shows the window that enables the user to specify the component type. The other operation recovers a component of the type selected in this window. It must be clear that, when choosing the component type (entity, session, or web), automatically the specific HM that holds the component characteristics is also chosen.
50
A. Haendchen Filho et al.
The window in Figure 7 allows the user to specify the component requirements. An example of state requirement specification is shown. Suppose we have a component base and a required component as described in Table 1. In this case, the HM will recover the component C2.. Note that, for the sake of simplicity, we adopted a general specification of behavior as transactional or non-transactional. The level of specification depends on the user preferences when configuring the tool. For space limitation it was not included in Table 1 examples of methods. Table 1. Components base and components state Descripti on
6 Related Works Recovering components for reuse has been approached in several recent publications. Some of them cope with this question by creating standard libraries for reuse. Michail [1] shows how to discover standard libraries in existing applications using Data Mining techniques. He applies „generalized association rules“ based in inheritance hierarchies to discover potential reusable components. By browsing generalized association rules, a developer can discover patterns in library usage in which take into account inheritance relationships. Küng [10], for example, applies the associative memory model Neunet in data mining, where the basic idea is to develop simple neural units and connections between the nodes. This is made by applying one binary representation to hold a connection between two units or not. The network shows a behavior similar to our approach. Another version – Fuzzy Neunet [11] - processes the signals which are normally between –1 and +1 [10]. Cohen [3] considers the recovery problem as an instance of a learning approach, focussing the behavior. Recently, library reengineering has been assessed, analyzing their use in many existing applications [6]. Constructing lattices do this provides insights into the usage of the class hierarchy in a specific context. Such a lattice can be used to reengineer the library class hierarchy to better reflect standard usage [1] [6].
XSearch: A Neural Network Based Tool for Components Search
51
7 Conclusions An important advantage of our proposal is that, by adopting the HM, it is possible to simplify and to locate and recover faster components to be reuse. Also, tool supports the network maintenance, since the model allows one to perform an online update on the network as new components are inserted in to the repository. Moreover, the tool can be reconfigured to include new descriptors in the HM, requiring the HM to be rebuilt. To generate and maintain a neural network in a highly dynamic environment requires many tasks, like monitoring the environment; keeping the neural network updated when including, modifying, or excluding components; and presenting the search results. To perform these laborious tasks a multi-agent system has been implemented. The increasing complexity in the modern computational environment has required more refined tools and resources that simplify and increase the efficiency of the development process. This is true even when considering the traditional CASE tools [9]. The application of Artificial Intelligence techniques can contribute significantly to provide many of these resources, as we have shown in this paper.
References 1.
Michail, A.: Data Mining Library Reuse Patterns using Generalized Association Rules. In nd Proceedings of 22 International Conference On Software Engineering, Limerick, Ireland, 2000. IEEE Computer Society Press. 2. Freeman, J. A.: Neural Networks - Algorithms, Applications, and Programming Techniques, Addison-Wesley Publishing, Menlo Park CA, 1992. 3. Cohen, W. W. et al.: Inductive specification recovery: Understanding. Software by learning from example behaviors. Automated Software Engineering, 2(2): 107-129, 1995. 4. Fayad, M. E. et al.: Application Frameworks: Object-Oriented Foundations of Frameworks Design. New York: John Wiley & Sons, 1999. 5. Haykin, S.: Neural networks: a comprehensive foundation, Prentice Hall, Inc., Englewood Cliffs, New Jersey, 1999. 6. Snelting, G. et al.: Reengineering class hierarchies using concept analysis. In Proceedings th of 6 IEEE International Conference On Automated Software Engineering, pages, 1998. 7. Kassen, N.: Designing Enterprise Applications with the Java2 Platform, Enterprise Edition, Addison-Wesley, Boston, 2000. 8. Monson-Haefel, R.: Enterprise Java Beans, O’Reilly & Associates, Inc., California, 1999. 9. Wang, Y. et al.: A Worldwide Survey of Base Process Activities Towards Software th Engineering Process Excellence. In Proceedings of 20 International Conference On Software Engineering, Kyoto, Japan, 1998. IEEE Computer Society Press. 10. Küng, J.: Knowledge Discovery with the Associative Memory Modell Neunet. In th Proceedings of 10 International Conference DEXA’99, Florence, Italy, 1999. SpringerVerlag, Berlin. 11. Andlinger, P. Fuzzy Neunet. Dissertation, Universität Linz, 1992.
Information Retrieval by Possibilistic Reasoning Churn-Jung Liau1 and Y.Y. Yao2 1
Institute of Information Science Academia Sinica, Taipei, Taiwan [email protected] 2 Department of Computer Science University of Regina Regina, Saskatchewan, Canada S4S 0A2 [email protected]
Abstract. In this paper, we apply possibilistic reasoning to information retrieval for documents endowed with similarity relations. On the one hand, it is used together with Boolean models for accommodating possibilistic uncertainty. The logical uncertainty principle is then interpreted in the possibilistic framework. On the other hand, possibilistic reasoning is integrated into description logic and applied to some information retrieval problems, such as query relaxation, query restriction, and exemplar-based retrieval. Keywords: Possibilistic logic, Boolean models, Description logic, Similarity-based reasoning.
1
Introduction
In the last two decades, we have witnessed the significant progress in the information retrieval(IR) research. To meet the challenge of information explosion, many novel models and methods have been proposed. Among them, the logical approach is aimed at laying down a rigorous formal foundation for the IR methods and leads to a deeper understanding of the nature of the IR process. Since the pioneering work of Van Rijsbergen[20], several logical approaches to IR have been proposed. These approaches usually rely on some philosophical logics or knowledge representation formalisms, such as modal logic[14], relevance logic[13], many-valued logic[16], description logic[12,13], and default logic[5]. This list is by no means exhaustive and further references and surveys can be found in [8, 9]. In the logical approaches, it is common to give the documents and queries some logical representation and the retrieval work is reduced to establishing some implication between documents and queries. However, it is also well-known that classical logical implication is not adequate for the purpose after Van Rijsbergen introduced the logical uncertainty principle (LUP). To cope with the problem, many logical models for IR are thus extended with some uncertainty management formalisms, such as probability[19], fuzzy logic[15], or Dempster-Shafer H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 52–61, 2001. c Springer-Verlag Berlin Heidelberg 2001
Information Retrieval by Possibilistic Reasoning
53
theory[6,7]. Though these extensions cover almost all the mainstream theories of uncertainty reasoning, the management of possibilistic uncertainty has received less attention. Possibilistic uncertainty is due to the fuzziness of information. In particular, in [17], it is shown that possibilistic uncertainty arises naturally from the degrees of similarity. Since the matching between documents and queries has been recognized as a kind of similarity in traditional models of IR (such as the vector models), the logical models should also have the capability of dealing with possibilistic uncertainty. Possibility theory[21] is the main theory for the management of possibilistic uncertainty. Some logical systems based on possibility theory have been developed and extensively studied in artificial intelligence literature[3,10]. In these logics, two measures are attached to the logical formulas for denoting their possibility and necessity. These measures are shown to be closely related to modal logic operators, so their evaluations rely on a set of possible worlds and a similarity relation between them. In IR terms, this means that the uncertainty of a Boolean query matching with a document will depend on the similarity between documents. In fact, the inferential IR approach based on fuzzy modal logic in [15] can be seen as an application of the possibility measures. However, the full utilization of possibilistic reasoning power remains to be explored. Also, though it is well-known that possibility theory can be seen as a special case of Dempster-Shafer theory, the former can provide some simplicity over the latter in the representation of similarity relation. In this paper, it would be shown that possibilistic reasoning can enhance the uncertainty management capability of similarity-based IR models. On the one hand, possibility theory will be used in combination with Boolean models to accommodate possibilistic uncertainty. Then it is shown that LUP can be interpreted in the possibilistic framework. On the other hand, due to the modal flavor of possibility and necessity measures, it is easy to integrate possibilistic reasoning into description logic, so we will propose a possibilistic description logic model for IR. This logic will be a possibilistic extension of ALC[18]. In the rest of the paper, we will first review some notions of possibility theory and description logic. Then we present the possibilistic extensions of Boolean and description logic IR models in two respective sections. Finally, we conclude the paper with some remarks.
2 2.1
Preliminaries Possibility Theory and Possibilistic Logic
Possibility theory is developed by Zadeh from fuzzy set theory[21]. Given a universe U , a possibility distribution on U is a function π : U → [0, 1]. In general, the normalized condition is required, i.e., supu∈U π(u) = 1 must hold. Thus, π is a characteristic function of a fuzzy subset of U . Two measures on U can be derived from π. They are called possibility and necessity measures and
54
C.J. Liau and Y.Y. Yao
denoted by Π and N respectively. Formally, Π, N : 2U → [0, 1] are defined as Π(X) = sup π(u), u∈X
N (X) = 1 − Π(X), where X is the complement of X with respect to U . In the IR application, the possibility distributions are in general induced from a similarity relation. Given a universe U , a similarity relation R : U × U → [0, 1] is a fuzzy relation on U satisfying that for all u, v ∈ U , (i) reflexivity(also called separation in [4]): R(u, v) = 1 iff u = v, and (ii) symmetry: R(u, v) = R(v, u). A binary operation ⊗ : [0, 1]2 → [0, 1] is a t-norm if it is associative, commutative, and increasing in both places, and satisfying 1 ⊗ a = a and 0 ⊗ a = 0 for all a ∈ [0, 1]. Some well-known t-norms include G¨ odel t-norm a ⊗ b = min(a, b), product t-norm a ⊗ b = a · b, and L E ukasiewicz t-norm a ⊗ b = max(0, a + b − 1). A similarity relation is called a ⊗-similarity if it in addition satisfies the ⊗transitivity: R(u, v) ⊗ R(v, w) ≤ R(u, w) for all u, v, w ∈ U . For each u ∈ U , the fuzzy relation R can induce a possibility distribution πu such that πu (v) = R(u, v) for all v ∈ U . The necessity and possibility measures corresponding to πu are denoted by Nu and Πu respectively. 2.2
Description Logics
In this subsection, we introduce a description logic, called ALC[18]. The alphabets of ALC consists of three disjoint sets, the elements of which are called concept names, role names, and individual names respectively. The roles terms of ALC are just role names and denoted by R (sometimes with subscripts) and the concept terms are formed according to the following rules. C ::= A | | ⊥ | C D | C D | ¬C | ∀R : C | ∃R : C where A is metavariable for concept names, R for role terms and C and D for concept terms. The wffs of ALC consists of terminological and assertional formulas. Their formation rules are as follows. 1. If C and D are concept terms, then C = D is a terminological formula. 2. If C is a concept term, R is a role term, and a, b are individual names, then R(a, b) and C(a) are assertional formulas. The terminological formula C ¬D = ⊥ is abbreviated as C D. The Tarskian semantics for ALC are given by assigning sets to concept names and binary relations to roles names. Formally, an interpretation for ALC is a pair I = (U, [| · |]), where U is a set of universe and [| · |] is an interpretation function which assigns each concept name a subset of U , each role name a subset of U ×U , and each individual name an element of U . The domain of [| · |] can be extended to all concept terms by induction
Information Retrieval by Possibilistic Reasoning
1. 2. 3. 4.
55
[| |] = U and [|⊥|] = ∅. [|¬C|] = U \[|C|], [|C D|] = [|C|] ∩ [|D|], and [|C D|] = [|C|] ∪ [|D|]. [|∀R : C|] = {x | ∀y((x, y) ∈ [|R|] ⇒ y ∈ [|C|])} [|∃R : C|] = {x | ∃y((x, y) ∈ [|R|] ∧ y ∈ [|C|])} An interpretation I = U, [| · |] satisfies a wff C = D ⇔ [|C|] = [|D|], R(a, b) ⇔ ([|a|], [|b|]) ∈ [|R|], C(a) ⇔ [|a|] ∈ [|C|].
If I satisfies a wff ϕ, it will be written as I |= ϕ. A set of wffs Σ is said to be satisfied by I, written as I |= Σ, if I satisfies each wff of Σ and Σ is satisfiable if it is satisfied by some I. A wff ϕ is an ALC-consequence of Σ, denoted by Σ |=ALC ϕ or simply Σ |= ϕ, iff for all interpretations I, I |= Σ implies I |= ϕ, and ϕ is ALC-valid if it is the ALC-consequence of ∅.
3
Possibilistic Reasoning in Boolean Models
An IR model in general consists of three components (D, Q, F ), where D is a collection of documents, Q is the query language(i.e. the set of possible queries), and F : D × Q → O is a retrieval ranking function with values in a total ordered set O. What makes differences between the models is the representation of documents and queries and the definition of retrieval ranking function. The models considered in this section will have a logical representation for the documents and queries and the retrieval ranking function will be determined by the possibilistic reasoning mechanism. 3.1
Boolean Models with Complete Information
In Boolean models, we have a propositional query language. The set of index terms A is taken as the set of propositional symbols and the wffs of the query language Q are formed from the index terms by Boolean connectives ¬, ∧, and ∨. An interpretation is just a two-valued truth assignment d : A → {0, 1} and the assignment can be extended to the whole set Q as usual. Let Ω be the set of all interpretations. In the Boolean models with complete information, a document is just an interpretation, so D is a subset of Ω. In this model, a document d is matched with a query ϕ if d(ϕ) = 1, so the retrieval ranking function is completely determined by the satisfaction relation between interpretations and wffs. This retrieval ranking function is two-valued, so it in fact returns an yes/no answer instead of a ranked list. What possibilistic logic can help is to improve the ranking capability of the above retrieval ranking function. To use possibilistic reasoning, we assume that there exists a similarity relation on the set D. The similarity relation can be imposed extraneously or generated automatically. One approach to the automatic
56
C.J. Liau and Y.Y. Yao
generation of similarity relation is by using the Dalal’s distance[2,11]. Let A be a finite set and d1 and d2 be two documents, then the Dalal’s distance between d1 and d2 is the proportion of A in which d1 and d2 do not agree, i.e. δ(d1 , d2 ) =
|{p ∈ A : d1 (p) =d2 (p)}| |A|
Thus, a similarity relation R on D can be defined by R(d1 , d2 ) = 1 − δ(d1 , d2 ) =
|{p ∈ A : d1 (p) = d2 (p)}| |A|
Note that the similarity so defined is a L E ukasiewicz t-norm similarity. As mentioned above, given a similarity relation R on D, we can induce a possibility distribution πd for each d ∈ D. The possibility distribution induces necessity and possibility measures on the set of interpretations. Since each query can be identified with its corresponding models, the necessity and possibility measures can be naturally extended to the set of queries. Thus we can further define a ordering !ϕ between documents according to the query ϕ: d1 !ϕ d2 ⇔ Πd1 (ϕ) > Πd2 (ϕ) or Πd1 (ϕ) = Πd2 (ϕ) and Nd1 (ϕ) > Nd2 (ϕ) In other words, the ranking function is defined as F : D × Q → [0, 1]2 , where [0, 1]2 is ordered by the lexicographical ordering >lex and for d ∈ D and ϕ ∈ Q, F (d, ϕ) = (Πd (ϕ), Nd (ϕ)). Then d1 !ϕ d2 iff F (d1 , ϕ) >lex F (d2 , ϕ). Because of the reflexivity of similarity relation, each possibility distribution πd is normalized, so according to possibilistic logic, we have Nd (ϕ) > 0 ⇒ Πd (ϕ) = 1. Thus the ordering d1 !ϕ d2 can be divided into two cases: 1. 1 > Πd1 (ϕ) > Πd2 (ϕ): thus Nd1 (ϕ) = Nd2 (ϕ) = 0, this means that neither d1 nor d2 satisfies ϕ, then they are ordered according to their nearness to ϕ since Πd (ϕ) corresponds to the minimal distance (or maximal similarity) from d to the documents satisfying ϕ. This can be seen as an interpretation of LUP in the possibilistic framework and in fact the same principle has been used in [11] in the case of Dalal’s distance. However, we can further distinguish the documents satisfying ϕ by their distances to ¬ϕ, that is 2. Nd1 (ϕ) > Nd2 (ϕ) > 0: thus Πd1 (ϕ) = Πd2 (ϕ) = 1, this means that both d1 and d2 have zero distance to ϕ since they satisfy ϕ by the reflexivity of similarity relation. However, Nd (ϕ) = 1 − Πd (¬ϕ) measures their distance to ¬ϕ. The larger the Nd (ϕ), the further d is from ¬ϕ. From the viewpoint of information need, this means that the documents close to both ϕ and ¬ϕ may be ambiguous and should be considered less matching the need. The use of necessity measure will improve the precision but reduce the recall, so it is particularly useful in meeting the challenge of information explosion.
Information Retrieval by Possibilistic Reasoning
57
In summary, a Boolean model with complete information is a tuple (Abc , Qbc , Ωbc , Dbc , Rbc , Fbc ) where Abc is the set of index terms, Qbc the propositional language formed from Abc , Ωbc = 2Abc the set of interpretations for Qbc , Dbc ⊆ Ωbc a set of documents, Rbc a similarity relation on Ωbc , and Fbc the ranking function defined above. Note that the domain of Rbc is extended to the whole Ωbc for handling queries not satisfied by any documents. For a query ϕ satisfiable in classical logic, if there are not any documents meeting its requirement, then Πd (ϕ) = Nd (ϕ) = 0 when Rbc is a similarity relation on Dbc . However, by extending the domain of Rbc , we can order the documents in Dbc according to their distances to the interpretations satisfying ϕ but not in Dbc . Since Qbc and Ωbc are completely determined by Abc , the model can sometimes be abbreviated as (Abc , Dbc , Rbc , Fbc ). 3.2
Boolean Models with Incomplete Information
In Boolean model with incomplete information, only partial description instead of complete information is given for each document, so the model is a tuple (Abi , Qbi , Ωbi , Dbi , Rbi , Fbi ), where Abi , Qbi , Ωbi , and Rbi are as above, however Dbi is now a subset of Qbi since each document is described by a sentence in the propositional language. As for the ranking function Fbi , we have several choices at our disposal. 1. Consider each possible interpretation ω of the document description. Let ψd ∈ Qbi be a description for document d and ϕ a query, then Fbi : Dbi × Qbi → [0, 1]2 can be defined in two ways. a) Optimistic way: Fbi∃ (d, ϕ) = max{Fbc (ω, ϕ) : ω(ψd ) = 1} >lex
2. According to the LUP, “a measure of the uncertainty of ψd → ϕ relative to a data set is determined by the minimal extent to which we have to add information to the data set, to establish the truth of ψd → ϕ”, however, what remain unspecified in the principle are the information measure and the implication →. In the possibilistic framework, the information measure is given by the pair of possibility and necessity measures induced from the similarity relation. Let us first consider the material implication ψd ⊃ ϕ =def ¬ψd ∨ ϕ. Let $ denote ∀ or ∃, then we have also two definitions of ranking function based on optimistic or pessimistic way of looking at the document. Fbi⊃ (d, ϕ) = Fbi (d, ψd ⊃ ϕ)
58
C.J. Liau and Y.Y. Yao
3. In [10], it is shown that the possibility theory can provide a natural semantics for conditional implication based on the Ramsey test. Essentially, given an interpretation ω, we can define an ordering >ω on the set Ωbi in the way that u >ω v iff πω (u) > πω (v). An ω-maximal model of a wff ϕ is an interpretation u satisfying ϕ and for all v satisfying ϕ, v >ω u. Thus we can define ω(ψd → ϕ) = 1 iff all ω-maximal models of ψd satisfy ϕ. In this way, each interpretation in Ωbi can also assign truth values to the wffs of conditional logic, so the possibility and necessity measures can also be extended to the conditional wffs. This results in our definition of a new form of ranking function: Fbi→ (d, ϕ) = Fbi (d, ψd → ϕ).
4
Possibilistic Description Logic
In the IR applications of DL’s, it is shown that the instance checking problem is especially relevant[12,13,19]. In those applications, a set of DL wffs is called a document base and the IR problem is to determine whether an individual i is an instance of a concept term C. Here a document base contains all descriptions of documents and thesaurus knowledge and an individual represents a document, whereas a concept term is just a query, so the problem just amounts to checking whether a document meets the information need expressed by the query. What makes DL-based approach advantageous is its capability to represent background knowledge (in particular, thesaurus knowledge) in the document base. However, classical DL’s also lack the necessary uncertainty management mechanisms, so a probabilistic extension of DL’s has been provided in [19]. Though probabilistic DL is definitely a must in dealing with the uncertainty problem of DL-based IR, it does not utilize the similarity between individuals. Obviously, the uncertainty due to randomness and that due to fuzziness are two orthogonal forms of properties and need separate formalisms for handling them. In the last section, we have seen that possibilistic reasoning is an appropriate tool for handling similarity-based reasoning in classical IR. In this section, we will try to propose a possibilistic extension of ALC and show that it is appropriate for DL-based IR. The logic is called PALC. To represent the individuals and concepts uniformly, we will use the basic hybrid language proposed in [1]. Let A, i, R be metavariables respectively for concept names, individual names, and role names and C and D be for concept terms, then the formation rules of concept terms are as follows: C ::= | ⊥ | A | i | ¬C | C D | C D | ∀R : C | ∃R : C | [α]C | [α]+ C | αC | α+ C where α ∈ [0, 1]. Note that an individual name is also a concept term. The intended meaning is to treat it as a singleton set. Thus we will not need assertional formulas any more. The wffs of PALC are just terminological ones of the form C = D for concept terms C, D. The definition of C D is as in ALC. For
Information Retrieval by Possibilistic Reasoning
59
convenience, we will write i : C or C(i) for i C and (i, j) : R or R(i, j) for i ∃R : j. The new modalities [α], [α]+ , α, and α+ are for quantifying the necessity and possibility measures induced from a similarity relation. For example, an individual is in αC iff it is similar to some element of C at least to the degree α. These modalities are also called numeral modalities. For the formal semantics, a PALL interpretation is a triple I = (U, [| · |], E), where E is a similarity relation on U and (U, [| · |]) is an ALC interpretation except [| · |] now assigns to each individual name a singleton subset instead of an element of U . Let Eα = {(u, v) : E(u, v) ≥ α} and Eα+ = {(u, v) : E(u, v) > α} denote the α-cut and strict α-cut of E respectively, then the following rules are added to the interpretation of concept terms: 5. 6. 7. 8.
The definitions of satisfaction, validity, etc. are all the same as those for ALC, so the IR problem under the PALC framework is still the instance checking problem. For the application of PALC to IR problems. Let us consider some examples. Example 1 (Query relaxation) Let Σ be a document base in PALC and C be a concept term in which the numeral modalities do not occur, then for the query C, our problem is to find document i such that Σ |= C(i). However, sometimes, if C is too restrictive, then it may not provide enough recall to meet the user’s need. In this case, we may try to relax the query by using the concept term αC for some α < 1. For example, in using an on-line hotel reservation system, the user may input a query C as follows: near−train−station ¬expensive pet− allowed ∃has−room−type.(single non− smoking) and the system consequently find a hotel satisfying the requirement. However, unfortunately, upon checking the availability during the specified period the user wants, no rooms are available. In this case, the user may relax the query to 0.8C for finding a hotel nearly satisfying his requirement. Example 2 (Query restriction) On the other hand, sometimes the query term is too loose so that there are too much recall. In this case, we may further require that [α]C must be satisfied for some α > 0. Note that [α]C requires the documents must not have similarity to elements in ¬C with degree exceeding α. Thus, the desired documents must be not only in C but also far enough from ¬C.
60
C.J. Liau and Y.Y. Yao
Example 3 (Exemplar-based retrieval) In some cases, in particular, for the retrieval of multimedia information, we may be given an exemplar or standard document and try to find documents very similar to the exemplar but satisfying some additional properties. In this case, we can write the query term as αi C, where i is the name for the exemplar and C denotes the additional properties. According to the semantics, j : αi will be satisfied by an interpretation I = (U, [| · |], E) iff E(aj , ai ) ≥ α where ai and aj be the elements of [|j|] and [|i|] respectively. Thus, a document j will meet the query if it can satisfy the properties denoted by C and is similar to the exemplar to some degree α. The last example also suggests that we may have to specify the aspect on which the similarity is derived. For example, we may require the documents which are similar to the exemplar on the style or on the color. To model the situation, we should have more than one similarity relations and corresponding numeral modalities. However, this can be achieved by a straightforward generalization of PALL. For example, let T denote a set of aspects, we can add to our language different modalities [α]t , etc. for all t ∈ T .
5
Concluding Remarks
We have presented some applications of possibilistic reasoning to IR problems. On the one hand, it can be used in combination with Boolean IR models to improve the precision and provide a finer ranking of the retrieval results. On the other hand, it can be easily integrated into the DL-based approach to help some IR tasks, such as query relaxation, query restriction and exemplar-based retrieval, etc. The scope of the applications is that the document collection must be endowed with some similarity relations. However, in most cases, the similarity relation can be automatically generated from the document representation, though it can also be given extraneously by some experts. The automatic generation of similarity relation may be time-consuming in the large collection of documents. It need O(n2 ) time if the computation of similarity degree between any two documents needs constant time. Fortunately, the generation process can be executed in advance when the collection is constructed, so it can be completed in a preprocessing phase.
References 1. P. Blackburn. “Representation, reasoning, and relational structures: a hybrid logic manifesto”. Logic Journal of IGPL, 8(3):339–365, 2000. 2. M. Dalal. “Investigations into a theory of knowledge base revision: Preliminary report”. In Proceedings of the 7th National Conference on Artificial Intelligence, pages 475–479. AAAI Press, 1988. 3. D. Dubois, J. Lang, and H. Prade. “Possibilistic logic”. In D.M. Gabbay, C.J. Hogger, and J.A. Robinson, editors, Handbook of Logic in Artificial Intelligence and Logic Programming, Vol 3 : Nonmonotonic Reasoning and Uncertain Reasoning, pages 439–513. Clarendon Press - Oxford, 1994.
Information Retrieval by Possibilistic Reasoning
61
4. F. Esteva, P. Garcia, L. Godo, and R. Rodriguez. “A modal account of similaritybased reasoning”. International Journal of Approximate Reasoning, pages 235–260, 1997. 5. A. Hunter. “Using default logic in information retrieval”. In C. Froidevaux and J. Kohlas, editors, Symbolic and Quantitative Approaches to Reasoning and Uncertainty : European Conference ECSQARU’95, LNAI 946, pages 235–242. SpringerVerlag, 1995. 6. M. Lalmas. “Dempster-Shafer’s theory of evidence applied to structured documents: modelling uncertainty”. In Proceedings of the 20th Annual International ACM SIGIR Conference in Research and Development of Information Retrieval, pages 110–118. ACM Press, 1997. 7. M. Lalmas. “Information retrieval and Dempster-Shafer’s theory of evidence”. In A. Hunter and S. Parsons, editors, Applications of Uncertainty Formalisms, LNAI 1455, pages 157–176. Springer-Verlag, 1998. 8. M. Lalmas. “Logical models in information retrieval: introduction and overview”. Information Processing and Management, 34(1):19–33, 1998. 9. M. Lalmas and P. Bruza. “The use of logic in information retrieval modeling”. Knowledge Engineering Review, 13(3):263–295, 1998. 10. C.J. Liau and I.P. Lin. “Possibilistic Reasoning—A Mini-survey and Uniform Semantics”. Artificial Intelligence, 88:163–193, 1996. 11. D.E. Losada and A. Barreiro. “Using a belief revision operator for document ranking in extended boolean models”. In Proceedings of the 22nd Annual International ACM SIGIR Conference in Research and Development of Information Retrieval, pages 66–73. ACM Press, 1999. 12. C. Meghini, F. Sebastiani, U. Straccia, and C. Thanos. “A model of information retrieval based on a terminological logic”. In Proceedings of the 16th Annual International ACM SIGIR Conference in Research and Development of Information Retrieval, pages 298–307. ACM Press, 1993. 13. C. Meghini and U. Straccia. “A relevance terminological logic for information retrieval”. In Proceedings of the 19th Annual International ACM SIGIR Conference in Research and Development of Information Retrieval, pages 197–205. ACM Press, 1996. 14. J.Y. Nie. “An information retrieval based on modal logic”. Information Processing and Management, 25(5):477–491, 1989. 15. J.Y. Nie. “Using fuzzy modal logic for inferential information retrieval”. Informatica, 20:299–318, 1996. 16. T. R¨ olleke and N. Fuhr. “Retrieval of complex objects using a four-valued logic”. In Proceedings of the 19th Annual International ACM SIGIR Conference in Research and Development of Information Retrieval, pages 206–214. ACM Press, 1996. 17. E. Ruspini. “On the semantics of fuzzy logic”. Int. J. of Approximate Reasoning, 5:45–88, 1991. 18. M. Schmidt-Schauß and G. Smolka. “Attributive concept descriptions with complements”. Artificial Intelligence, 48(1):1–26, 1991. 19. F. Sebastiani. “A probabilistic terminological logic for modelling of information retrieval”. In W.B. Croft and C.J. van Rijsbergen, editors, Proceedings of the 17th Annual International ACM SIGIR Conference in Research and Development of Information Retrieval, pages 122–130. ACM Press, 1994. 20. C.J. van Rijsbergen. “A non-classical logic for information retrieval”. The Computer Journal, 29:481–485, 1986. 21. L.A. Zadeh. “Fuzzy sets as a basis for a theory of possibility”. Fuzzy Sets and Systems, 1(1):3–28, 1978.
Extracting Temporal References to Assign Document Event-Time Periods* 1
1
2
D. Llidó , R. Berlanga , and M.J. Aramburu 1
2
Departament of Languages and Computer Systems Departament of Engineering and Science of Computers Universitat Jaume I, E-12071, Castellón (Spain) {dllido, berlanga, aramburu}@uji.es
Abstract. This paper presents a new approach for the automatic assignment of document event-time periods. This approach consists of extracting temporal information from document texts, and translating it into temporal expressions of a formal time model. From these expressions, we are able to approximately calculate the event-time periods of documents. The obtained event-time periods can be useful for both retrieving documents and finding relationships between them, and their inclusion in Information Retrieval Systems can produce significant improvements in their retrieval effectiveness.
1 Introduction Many documents tell us about events and topics that are associated to well-known time periods. For example, newspaper articles, medical reports and legal texts, are documents that contain many temporal references for both placing the occurrences and relating them with other events. Clearly, using this temporal information can be helpful in retrieving documents as well as in discovering new relationships between document contents (e.g. [1] [2] and [3]). Current Information Retrieval Systems can only deal with the publication date of documents, which can be used in queries as a further search field. As an alternative approach, in a new object-oriented document model, named TOODOR [4], is presented. In this model two time dimensions are considered: the publication date, and the eventtime period of documents. Furthermore, by means of its query language, called TDRL [5], it is possible to retrieve documents by specifying conditions on their contents, structure and time attributes. However, TOODOR assumes that the event-time period of a document is manually assigned by specialists, which is an important limitation. By one hand, this task is subjective as it depends on the reader’s particular interpretation of the document texts. On the other hand, in applications where the flow of documents is too high, the manual assignment of event-time periods is impracticable. Consequently, it is *
This work has been funded by the Bancaixa project with contract number PI.1B2000-14 and the CICYT project with contract number TIC2000-1568-C03-02.
Extracting Temporal References to Assign Document Event-Time Periods
63
necessary to define an automatic method for extracting event-time periods from document contents. In this paper we present an approach to extracting temporal information from document contents, and its application to automatically assigning event-time periods to documents. Moreover, with this work we demonstrate the importance of these attributes in the retrieval of documents. The paper is organized as follows. Section 2 describes the semantic models on which the extraction system relies. Section 3 presents our approach to extracting temporal references from texts. Section 4 describes how event-time periods can be calculated with the extracted dates. Finally, Section 5 presents some conclusions.
2 Semantic Models This section describes the semantic models on which the proposed information extraction method relies, these are: a representation model for documents, and a time model for representing the temporal information extracted from texts. 2.1 Documents and Their Time Dimensions This work adopts the document model of TOODOR [4]. Under this model, complex documents are represented by means of object aggregation hierarchies. The main novelty of this model is that document objects have associated two time attributes, namely: the publication time and the event time. The former indicates when the document has been published, whereas the latter expresses the temporal coverage of the topics of the document. The publication time plays an important role in the extraction of temporal expressions, because several temporal sentences, such as "today" and "tomorrow", take it as point of reference. The event-time period of a document must express the temporal coverage of the relevant events and topics reported by its contents. Since the relevance of a topic depends on the interpretation of the document contents, event-time periods are inherently indeterminate. As a general rule, we assume that the location of these periods will coincide approximately with the temporal references appearing in the document, where a temporal reference is either a date or a period mentioned in the document texts. In this way, event-time periods could be either extracted automatically from the texts, or manually assigned by users. 2.2 Time Model Temporal sentences in natural language usually involves the use of the calendar granularities. In concrete, we can express time instants, intervals, and spans at several granularity levels. In this section we provide a time model that takes into consideration the time entities appearing in temporal sentences.
64
D. Llidó, R. Berlanga, and M.J. Aramburu
2.2.1 Granularities The proposed time model relies on the granularity system of Figure 1. From now on, we will denote each granularity of this system by a letter: day (d), week (w), month (m), quarter (q), semester (s), year (y), decade (x) and century (c). As shown in Figure 1, these granularities can be arranged according to the finer-than relationship, which is denoted with p [6]. Note that unlike other time models of the literature, in written text it is usual to relate granularities that do not satisfy this relationship (e.g. "the first week of the year"). In Figure 1 they are represented with dashed lines. century
decade year quarter
sem ester m onth w eek
day
p
Fig. 1. Granularity System
In our model, two types of granularity domains are distinguished, namely: relative and absolute domains. A relative domain for a granularity g is defined in terms of another coarser granularity g’ (g p g’), which is denoted with dom(g, g’). For instance, the domain of days relative to weeks is defined as dom(d, w)={1,…,7}. Relative domains are always represented as finite subsets of the natural numbers. Thus, we will denote with first(g, g’) and last(g, g’), the first and last elements of the domain dom(g, g’) respectively. An absolute domain for a granularity g, denoted with dom(g), is always mapped onto integer numbers (e.g. centuries and years). Time models from the literature associate absolute domains to granularities (called ticks), defining over them the necessary mapping functions to express the finer-than relationship [6]. 2.2.2 Time Entities In this section, we define the time entities of our time model in terms of the granularity system described above. A time point is expressed as the following alternate sequence of granularities and natural numbers T = g1 n1 g2 n2 ... gk nk. In this expression, if gi is a relative granularity then ni must belong to the domain dom(gi, gi−1) with 1 < i ≤ k, otherwise ni∈dom(gi). Consequently, the sequence of granularities must be ordered by the finer-than relationship, i.e. gi+1 p gi with 1 ≤ i ≤ k. From now on, the finest granularity of a time point T is denoted with gran(T).
Extracting Temporal References to Assign Document Event-Time Periods
65
A time interval is an anchored span of time that can be expressed with two time points having the same sequence of granularities: I = [T1, T2], where T1 = g1 n1 ... gk nk , T2 = g1 n’1... gk n’k y ni ≤ n’i for all 1≤ i ≤ k We will use the functions start(I) and end(I) to denote the starting and end points of the interval I respectively. Besides, the finest granularity of I, denoted with gran(I), is defined as the finest granularity of its time points. Finally, a span of time is defined as an unanchored and directed interval of time. This is expressed as S = ± n1 g1... nk gk , where the sign (±) indicates the direction of the span (+ towards the future, - towards the past), ni (1 ≤ i ≤ k) are natural numbers, and the granularities gi with 1 ≤ i < k are ordered (i.e. gi+1 p gi). 2.2.3 Operators This section describes the main operators that are used during the resolution of temporal sentences from the text. Firstly, we define the refinement of the a point T = n1 g1... nk gk to a finer granularity g as follows: refine(T, g) = [T1, T2] where T1 = g1 n1... gk nk g first(g, gk), and T2 = g1 n1... gk nk g last(g, gk) Note that this operation can only be applied to granularities with relative domains. Similarly, we define the refinement of a time interval I to a finer granularity g (g p gran(I)) as follows: refine(I, g) = [start(refine(start(i), g)), end(refine(end(i), g))] The abstraction is the inverse operation to the refinement. Applying it, any time entity can be abstracted to a coarser granularity. We will denote this operation with the function abstract(T, g), where g is a granularity that must be contained in T. This operation is performed by truncating the sequence of granularities up to the granularity g. For example, abstract(y2000m3d1, y) = y2000. Finally, the shift of a time point T = g1 n1 ... gk nk by a time span S = n g is defined as follows: shift(T, S) = g1 n’1. … gi n’i. gi+1 ni+1 … gk nk where gi = g , and n’1… n’i are the new quantities associated to the granularities resulting from n + ni and propagating its overflow to the coarser granularities. These are some examples: shift(y1999m3, +10m) = y2000m1 shift(y2001, -2y) = y1999 shift(y1998m2w2, -3w) = y1998m1w4
66
D. Llidó, R. Berlanga, and M.J. Aramburu
3 Temporal Information Extraction To calculate the event-time period of a document we apply a sequence of two modules. The first module, named date extraction module, first searches for temporal expressions in the document text, then extracts dates, and finally inserts XML tags with the extracted dates. Figure 2 shows an example of a tagged document. In our approach we use the tag 7,0(; defined in [7], to which we have added the attribute 9$/8( to store the extracted dates. Figure 3 presents the different stages of the date extraction module. 1HZV! SXEOLFDWLRQWLPH!SXEOLFDWLRQWLPH! (O*RELHUQREULWiQLFRHVWiGHFLGLGRDLPSHGLUTXHODPDUFKDGHORVXQLRQLVWDVGHOD2UGHQGH 2UDQJHSUHYLVWDSDUDHO7,0(;7<3( '$7( 9$/8( ¶¶!SUy[LPRGRPLQJR7,0(;! 1HZV! Fig. 2. Example of XML tagged document.
Regarding to the second module, named event-time extraction module, it processes all the TIMEX tags of the document to approximately obtain its event-time period. This section is focused on describing how the first module works, whereas Section 4 describes the second module. 3DWWHUQVIRU H[WUDFWLQJ GDWHV
'RFXPHQW 6HJPHQWDWLRQ GRFXPHQWV
VHQWHQFHV
,GHQWLI\LQJ 6LPSOH7LPH (QWLWLHV
GDWHV
*URXSLQJ 7LPH(QWLWLHV
JUDQXODULWLHV
SRLQWVLQWHUYDOV DQGOLVWV
7HPSRUDO UHIHUHQFH UHVROXWLRQ
WDJJHG GRFXPHQWV
Fig. 3. Stages of the Date Extraction Process.
In the date extraction module, the main problem to solve is similar to that of any natural language processing system, that is the ambiguity. This appears in several contexts:
• Syntactic ambiguity. We need to know which words belong to the same temporal expression. After testing several syntactic analysers, we have concluded that they are not able to identify whole phrase like "In May of this year’’. • Word sense disambiguation. We need to fix indefinite phrases like "in the last years", vague adverbial words like "now", "recently", and references to events like "since the beginning of these negotiations". • Semantic ambiguity. We need to distinguish between temporal expressions that identify either spans, intervals or dates. The approach we propose in this work consists of applying a shallow semanticsyntactic parser to extract temporal information. Similarly to Information Extraction systems [8], we begin with a lexical analysis that looks-up words related to temporal
Extracting Temporal References to Assign Document Event-Time Periods
67
expressions into a dictionary (time granularities, day of weeks, months, holidays, etc.), and name recognition of standard date expressions. This is followed by a partial syntactic analysis of the sentences that contain these words, in order to search for more words that probably belong to the same temporal expression. Afterwards, the selected words are coded with their semantic meaning in terms of the formal time model. Finally, these codes are properly combined to obtain dates, intervals and time spans. Next section illustrates the grammatical elements necessary for all this process, and the following stages are described afterwards. 3.1 Grammatical Elements By analysing the range of temporal expressions in natural language, we have classified the words belonging to these expressions in three categories that give us the semantic information necessary to assign the corresponding date. These are:
• Granularities, which are words that identify calendar granularities (e.g. "day", "month", "years", "semester", etc.) • Time head nouns, which are words closely related with the calendar granularities. Specifically, these words represent the proper granularities and its synonyms (e.g. "journey"), the granularities values (e.g. "July", "Monday") as well as relevant dates and periods like "Hallowing night", "Christmas", "autumn", etc. • Quantifiers, which are the cardinal, ordinal and indefinite adjectives, as well as the roman numbers (e.g. "first", "second", "two", etc.) • Modifiers, which are words that grammatically can take part in a temporal expression. In this group we can find words for expressing intervals or periods like "during" and "between", words for indicating the temporal direction of spans like "past" and "next", and words for specifying a position within a time interval like "beginning" and "end". All these elements are always translated into codes representing their temporal meaning in the formal time model of Section 2.2. We use the notation e ⇒ c to denote the translation of a temporal expression e into its corresponding representation c in the formal model. This translation is performed as follows:
• Time head nouns are always encoded as time entities. For example, since Monday is the first day of the week, we encoded it as "Monday" ⇒ "ZG". Other head nouns can be encoded as time intervals, for instance "autumn" ⇒ ">PGPG@". • Quantifiers are all encoded as natural numbers. Additionally, ordinal and cardinal numbers must be distinguished in order to identify the time entity they are referring to. For instance, "first day" is encoded as "G" (time point), whereas "two days" is encoded as "G" (span). The order of a quantifier with respect to a granularity changes its meaning. For instance we must distinguish between "day two" ⇒ "G" (time point) and "two days" ⇒ "G" (span). • Modifiers are used to express the direction of time spans, namely: towards the past ’−’, towards the future ’+’, and at present time ’0’. For instance, "last Monday" is
68
D. Llidó, R. Berlanga, and M.J. Aramburu
encoded as "−ZG", and "next three days" as "+G". Besides, modifiers can also refer to both other time entities, denoted with the prefix U, and events, denoted with the prefix 5. For example, consider the following translations "that day" ⇒ "UG" and "two days before the agreement" ⇒ "5−G the agreement". 3.2 Date Extraction Module The basic structural unit in our document model is the paragraph. However in the extraction date module, as in most Information Extraction systems, it is necessary to split them into smaller units to extract complex temporal expressions. For this purpose, we make use of the usual separators of sentences (e.g. '!', '¡',. '?', '-', ':', etc.) Since some of these symbols are also used for other purposes such as numeric expressions, we need to define and apply a set of patterns to correctly split sentences. 3.2.1 Extraction of Dates During this stage, regular expressions are applied in order to extract basic temporal expressions for dates. These are common date formats (e.g. ?G^`??G^`??G^`) and relative temporal expressions referred to the publication date (e.g. "today", "this morning", "weekend", etc.). These regular expressions have been obtained by analysing the most frequent temporal sentences. 3.2.2 Identifying Simple Temporal Expressions In this stage all the sentences having temporal head nouns are analysed to extract simple time entities. Sometimes these head nouns appear in usual temporal expressions like "every Monday", "each weekend", "each morning", which do not denote any time entity of our model. To avoid misunderstandings on interpreting such expressions and improve the efficiency of the extraction process, a list of patterns for rejecting them has been defined. Once checked that a temporal expression does not match any of these patterns, the algorithm proceeds to search for modifiers and quantifiers in the head's adjacent words. As a result, the identified head and its modifiers/quantifiers are translated into a single time entity. 3.2.3 Grouping Simple Time Expressions Once the simple time entities from a sentence are extracted, we have to analyse them in order to detect if they are the components of a more complex time entity. Thus, this phase we must determine whether they constitute a single date (e.g. "May last year" ⇒ "\P−\"), a time interval (e.g. "from May to July" ⇒ "from P to P" ⇒ ">P P@"), a list of dates (e.g. "On Wednesday and Friday" ⇒ "on ZG and ZG" ⇒ "^ZG ZG`"), or two different expressions (e.g. "I won yesterday and you today"). Starting from a set of temporal expressions, we have defined a list of regular expressions for grouping simple time entities. For instance, the pattern 'IURP ?HQWLW\ WR ?HQWLW\' is used to identify a time interval. In this way, when a sentence contains several encoded time entities, the algorithm tries to apply these patterns to identify complex time entities.
Extracting Temporal References to Assign Document Event-Time Periods
69
3.2.4 Resolution of Temporal References Most of the identified time entities can be finally translated into concrete dates, which will be used by the event-time generator. More specifically, only those time entities that contain the granularities either of year or century are translated into dates. In this process we take into account the relationships and operations specified between time entities as well as the time references of the document. To perform these tasks, the system makes use of regular expressions as follows:
• If the sentence matches the pattern ’?JUDQXODULW\>@’, the date (or interval date) is extracted by applying the refine operation on the temporal expression. Example: "\" ⇒ UHILQH\G = >\PG\PG@ • If the sentence matches the pattern ’_ "?JUDQXODULW\?'’, the date is extracted by applying the denoted shift operation to the publication date. If the shift sign is omitted, the system tries to determine it by using the tense of the verb within the same sentence. Example: "The meeting will be on Monday" ⇒ "The meeting will be on ZG" • If the sentence matches the pattern ’U_ ?JUDQXODULW\?G’, the date is extracted by applying the denoted shift operation to the most recent cited date. • If the sentence matches the pattern ’U_ ?G?JUDQXODULW\’, we proceed as before. The rest of cases are not currently analysed to extract concrete dates. However, their study can be of interest in order to extract further knowledge about events and their relationships. For instance, temporal expressions containing references to events, for examples "5G the agreement", can be very useful to identify named events and their occurrences. However, this analysis will be carried out in future works.
4 Generating Event-Time Periods In this section we describe the module in charge of analysing the extracted dates of each document, and of constructing the event-time period that covers its relevant topics. As in Information Retrieval models, we assume that the relevance of each extracted date is given by its frequency of appearance in the document (i.e. the TF factor). Thus, the most relevant date is considered as the reference time point of the whole document. If all dates have a similar relevance, the publication date is taken as the reference point. This approach differs from others in the literature, where the publication date is always taken as the reference time point. The algorithm for constructing the event-time period of a document groups all the consecutive dates that are located around the reference time point, and whose relevance is greater than a given threshold. Currently, both the date extraction module and the event-time generator have been implemented in the Python language. To perform the dictionary look-ups when solving temporal references, the date extraction module uses the TACAT system [9], which is implemented in Perl.
70
D. Llidó, R. Berlanga, and M.J. Aramburu
4.1 Preliminary Results To evaluate the performance of the date extraction module we have analysed four newspapers containing 1,634 time expressions. The overall precision (valid extracted dates / total extracted dates) of the evaluated set was 96.2 percent, while the overall recall (valid extracted dates / valid dates in the set) was 95.2 percent. Regarding the execution times, each news is tagged in 0.1 seconds. These results, obtained on a dual Pentium III-600 MHz, are very satisfactory for our applications. To study the properties of the generated event-time periods, we have applied the extraction modules to 4,274 news. Then we have classified them into the following four classes: 1. Class A: news whose event-time periods contain the publication date and are smaller than three days. 2. Class B: news whose event-time periods do not contain the publication date and are smaller than three days. 3. Class C: news whose event-time periods are between four and fourteen days. 4. Class D: news whose event-time periods are greater than fourteen days. Table 1. Classification of documents according to their event-time period. Class A 21%
Class B 53%
Class C 9%
Class D 11%
The obtained results are given in Table 1. It is worth pointing out that near 6% of the articles have no event-time assigned. These cases are due to the lack of dates in the document contents. Moreover, around 42% of the articles contain dates located at least 14 days before or after the publication date. These dates are references to other past or future events, probably described in other newspaper articles. The extraction of these dates can be very useful to automatically link documents through their time references.
5
Related Work
The extraction of temporal information from texts is a recent research field within the Information Retrieval area. In [7] it has been shown that near 25% of the tagged tokens in documents are time entities, whereas near 31% of the tags corresponds to person names. The relevance of temporal information is also demonstrated in [2], where the impact of time attributes on Information Retrieval systems is analyzed. Extracting temporal information is also important in the topic detection and tracking tasks. However, the proposed methods in the literature (e.g. [1]) use the publication date as the event time. The work presented in [2] tries to calculate event-time periods by grouping similar news located in consecutive publication dates. This approach can produce errors because an event is published one or more days after its occurrence. There are other works in the literature dedicated to automatically extract dates from dialogues [11] and news [12]. The main limitation of these approaches is that only
Extracting Temporal References to Assign Document Event-Time Periods
71
absolute temporal expressions [7] are analyzed to extract dates. In [12], some simple relative expressions can also be analyzed by applying the tense of verbs to disambiguate them.
6 Conclusions In this paper a new method for extracting temporal references from texts has been presented. With this method event-time periods can be calculated for documents, which can be used in turn for retrieving documents and discovering temporal relationships. The proposed method is based on the shallow parsing of natural language sentences containing time entities. These are translated into a formal time model where calculations can be performed to obtain concrete dates. Future work is focused on the automatic recognition of events by using the extracted dates and the chunks of texts where they appear. Another interesting task consists of solving the temporal expressions that refers to other events.
References 1.
st
J. Allan, R. Papka and V. Lavrenko. "On-Line New Event Detection and Tracking". 21 ACM SIGIR Conference, pp. 37-45, 1998. 2. R. Swan and J. Allan. "Extracting Significant Time Varying Features from Text," CIKM Conference, pp. 38-45, 1999. 3. R. Berlanga, M. J. Aramburu and F. Barber. "Discovering Temporal Relationships in Database of Newspapers". In Tasks and Methods in Applied Artificial Intelligence, LNAI 1416, Springer Verlag, 1998. 4. M. J.Aramburu and R. Berlanga. "Retrieval of Information from Temporal Document Databases". ECOOP Workshop on Object-Oriented Databases, Lisboa, 1999. 5. M. J. Aramburu and R. Berlanga. "A Retrieval Language for Historical Documents". 9th DEXA Conference, LNCS 1460, pp. 216-225, Springer Verlag, 1998. 6. C. Bettini et al. "A glossary of time granularity concepts". In Temporal Databases: Research and Practice, LNCS 1399, Springer-Verlag, 1998. 7. "The task definitions, Named Entity Recognition Task Definition" Version 1.4, http://www.itl.nist.gov/iad/894.01/tests/ie-er/er_99/doc/ne99_taskdef_v1_4.ps 8. R. Grishman. "Information Extraction: Tehniques and Challenges. International Summer School” SCIE-97. Edited by Maria Teresa Pazienza, Springer-Verlag, pp 10-27, 1997. 9. Castellón, M. Civit and J. Atserias. "Syntactic Parsing of Unrestricted Spanish Text". International Conference on Language Resources and Evaluation, Granada (Spain), 1998. 10. J. Wiebe et al. "An empirical approach to temporal reference resolution.". Second Conference On Empirical Methods in Natural Language Processing, Providence, 1997. 11. M. Stede, S. Haas, U. Küssner. "Understanding and tracking temporal descriptions in dialogue". 4th Conference on Natural Language Processing, Frankfurt, 1998. 12. D.B. Koen and W. Bender, "Time frames: Temporal augmentation of the news," IBM Systems Journal Vol. 39 (3/4), pp. 597-616, 2000.
Techniques and Tools for the Temporal Analysis of Retrieved Information 1
2
1
Rafael Berlanga , Juan Pérez, María José Aramburu , and Dolores Llidó 1
2
Department of Languages and Computer Systems Department of Engineering and Science of Computers Universitat Jaume I, Castellón, Spain {berlanga,aramburu, dllido}@nuvol.uji.es
Abstract. In this paper we present a set of visual interfaces to query newspapers databases with conditions on their contents, structure and temporal properties. Query results are presented in various interfaces designed to facilitate the reformulation of query conditions and the analysis of the temporal distribution of news. The group of techniques and tools here described has shown useful for the temporal analysis of information from documents in a way that current systems do not support.
Techniques and Tools for the Temporal Analysis of Retrieved Information
73
The rest of the paper is organised as follows. Firstly, the underlying documents database and visual interfaces are briefly described. Sections 4 and 5 explain the techniques applied to analyse the evolution of topics and to evaluate temporal patterns. Section 6 explains how to calculate the relevance of retrieved documents, and section 7 how to implement all these techniques. Conclusions are in section 8.
2
Storage and Retrieval System
The techniques and tools presented in this paper have been developed over a document storage and retrieval system that contains a large amount of digital newspapers. This repository has been implemented by means of the Oracle database management system, with the Context tool for text management, and by following the approach presented in [5]. The data and query models adopted for this system were also presented in previous papers, being denoted TOODOR (Temporal ObjectOriented Document Organisation and Retrieval) [2] and TDRL (Temporal Document Retrieval Language) [3], respectively. In the TOODOR data model, each document has assigned a time period denoted event time, which expresses the temporal coverage of the relevant events and topics reported by its contents. This temporal attribute is very useful when retrieving information from the documents stored in the repository. Firstly, it allows retrieving with better precision the documents relevant to user queries. Secondly, it can be applied to the evaluation of temporal relationships between document contents, as for example, cause-effect relationships and topic co-occurrence. Finally, it serves to analyse the evolution of the topics described in the documents of the repository. To perform these operations, TDRL provides a complete set of temporal predicates and operators, and a syntax based on OQL. Logically, to be executed by Oracle, TDRL sentences must be previously translated into equivalent SQL queries.
3
Interfaces for Information Retrieval
For users not trained to handle database query languages, the specification of sentences in TDRL can be very difficult, especially when applying its predicates to define temporal relationships between documents. Something similar happens when analysing query answers, because for extracting conclusions from query results, presenting them as tabular raw data combining attributes, text and temporal information is of little help. Instead of this, it is preferable to specify queries in some graphical and intuitive interface, easy to use for non-specialised users, and that can be adapted to many different kinds of query conditions, that is, without loosing the expressiveness of TDRL. Similarly, the graphical presentation of processed query results would improve the capacity of analysis of the users, at the same time that would facilitate the reformulation of queries until satisfactory results were obtained. Thus, the main objective of this work is to provide a set of interactive user interfaces to analyse the temporal evolution of the contents of a documents repository in an intuitive and useful way.
74
R. Berlanga et al.
3.1 Description of the Interfaces The interface for the definition of query variables is the initial one and it is presented in Figure 1. It allows for the specification of conditions on the structure and contents of the documents in the repository, representing each variable a set of documents that satisfy a group of conditions. In this component, it is also possible to define a temporal window for each variable, so that the documents are restricted to those published during those dates. &RQGLWLRQVRQFRQWHQWV
&RQGLWLRQVRQVWUXFWXUH
7HPSRUDOZLQGRZ
Fig. 1. Interface for the specification of initial query conditions
After specifying the variables of the query, the interface of Figure 2 can be used for defining the temporal relationships that will constitute the temporal pattern to analyse. The set of temporal relationships available is equivalent to the set of temporal predicates of TDRL, being also possible to specify the temporal granularity at which they should be evaluated. In this interface, each query variable is represented by an icon that has associated the number of documents satisfying the corresponding initial conditions. From the variables of the query, the user will be able to choose one of them as the objective of the query, that is, the set of documents to retrieve. This and the rest of parameters of the query can be modified at any moment, so that it is possible to adjust it depending on the intermediate results, or to analyse these results from different perspectives. Each time that these parameters are redefined, the number of documents associated to each variable varies dynamically. 2EMHWLYHYDULDEOH
7HPSRUDOUHODWLRQVKLSV
7HPSRUDOJUDQXODULW\RIUHODWLRQVKLSV
7HPSRUDOSDWWHUQDQGTXHU\YDULDEOHV
Fig. 2. Interface for the specification of temporal patterns
Finally, in the interface of Figure 3, users can see the list of documents that instantiate the objective variable ordered by relevance. At selecting one of them, its text is
Techniques and Tools for the Temporal Analysis of Retrieved Information
75
visualised together with a histogram of the words that occur more frequently in it. In this interface, users can find out and select some topics to feed back the initial query, and refine the results. 5HOHYDQFHRIGRFXPHQWV
0LQLPXPQXPEHURIRFFXUUHQFHV
1XPEHURIRFFXUUHQFHV
Fig. 3. Visualisation of the texts and histogram of keywords
To visualise the query results in a format that facilitates the temporal analysis of information, users can apply the components in Figure 4, denoted respectively temporal histogram and chronicles. The first component shows a bar chart expressing the relevance of the required information in each span of time of the query temporal window. This relevance is calculated as a combination of the frequency and relevance of the documents found in that period. The second component represents the periods of time during which different occurrences of the event described in the query are happening. Each one of these time periods is calculated by grouping the consecutive event times of the documents in the answer, and therefore, this interface expresses the temporal distribution and frequency of events. Like before, the user can adjust the parameters of these interfaces as needed, and visualise the contents of the documents associated to a chart bar by clicking on it.
3DUDPHWHUVRIYLVXDOLVDWLRQ 5HOHYDQFH
3DUDPHWHUVRIHYDOXDWLRQ
&KURQLFOHV
7LPHSDUWLWLRQV
Fig. 4. Query results presented as temporal histograms and chronicles
76
R. Berlanga et al.
3.2 Implementation Requirements The implementation of the interfaces previously described presents several requirements with respect to the processing and refinement of query results. They can be enumerated as follows: 1. The evaluation of the initial query conditions and the temporal pattern is a costly task that needs some preprocessing in order to optimise its execution. By this reason, it is necessary to design some algorithms of optimisation to decide the order of execution of the query conditions. 2. To generate the charts for the temporal histogram and the chronicles, it is required to design algorithms to group query results based on documents event-time periods and query granularity. Furthermore, these algorithms should calculate the relevance of each group, considering in each case several parameters as topic relevance or structural properties. 3. Users can perform the redefinition of query parameters at any time, including changes in the granularity of queries or in their retrieval conditions. Therefore, it is necessary to design some scheme of execution to refine query answers in some efficient way that is, without re-evaluating the query completely. In the following sections the solutions that we have developed to satisfy these requirements are explained.
4
Analysis of Topic Evolution
In this section, the processing of query results for visualising the temporal evolution of document topics is described. This process applies temporal aggregation functions to group documents by their event times, and in this way, to draw the two presentations of Figure 4. The time model of TDRL [3] provides us with two mechanisms for performing these grouping operations: 1. Regular time partitions based on several time granularities (i.e.: weeks, months, etc.). These are similar to those defined by the Group-By clause of TSQL2 [7]. 2. Irregular time partitions defined by documents with intersecting event times. These were denoted Chronicles in [3] and can be applied to analyse the periods of occurrence of the topics described by documents. 4.1 Elaboration of Histograms After evaluating a query, each regular time partition defined by the chosen granularity level has associated a possibly empty set of documents whose event-time periods fall into the partition. The degrees of relevance of these documents can be combined to calculate the relevance of the whole partition. More specifically, for a retrieval condition IRE, each retrieved document d has associated an index of relevance denoted rel(d, IRE). In our model, this index is evaluated by the normal TF-IDF factor [4], and the degree of structural relevance defined in Section 6. Given a finite and regular time line partition {Pi}i=1...k , each time interval Pi will have associated the following functions: docs(Pi, IRE) = {d | event-time(d) ¬ Pi « ¾ rel(d, IRE)>0}
Techniques and Tools for the Temporal Analysis of Retrieved Information
77
sum(Pi, IRE) = Ê"d³docs(Pi, IRE) rel(d, IRE) avg(Pi, IRE) = sum(Pi, IRE) / |docs(Pi)| From these functions, we define the relevance of each partition as: rel(Pi, IRE) = a·sum(Pi, IRE) + (1-a)·(sum(Pi,IRE)·avg(Pi,IRE) / |docs(Pi, IRE)|) In other words, the relevance of each time partition is defined as the pondered sum of the relevance of the documents in the partition, and a factor that considers the ratio between the sum of relevances, their average, and the number of documents in the partition. The a constant is experimentally calculated, in our experiments a value of 0.3 produces good results. The presentation of the function rel(Pi, IRE) (left hand of Figure 4) as a histogram shows the relevance of each time partition with respect to the IRE retrieval condition. This is useful to analyse the distribution of the relevant documents along time. 4.2 Elaboration of Chronicles For the construction of chronicles, we start from a regular time line partition defined by choosing a granularity level. These partitions are applied to group the documents whose event-time periods intersect, and that have a minimum level of relevance with respect to the query. Each of these groups of documents corresponds to a chronicle and the algorithm that calculates them is presented in Figure 5. The algorithm takes two input parameters: the maximum number of empty time partitions that will be allowed between the documents of a chronicle (separation), and the minimum index of relevance (limit) that must have a partition to be considered as non-empty. The purpose of the limit parameter is to remove the documents that are not relevant for the query, whereas the separation parameter indicates the degree of tolerance to apply when building the chronicles. In this way, when the limit is increased, the information is more filtered, and when the separation is increased, the information is less fragmented. The right chart of Figure 4 shows the chronicles evaluated from the document distribution represented at the left. Def chronicle_generator(partitions {Pi}i=1..k, limit, separation): is_chronicle=false; gaps=0 for each i (1 i k): if rel(Pi, IRE)separation: is_chronicle=false create chronicle with the interval from ini to end else: gaps=gaps+1 elif rel(Pi,IRE)limit: if not is_chronicle: gaps=0; ini=i; end=i elif docs(Pi,IRE) ¬ docs(Pi+1,IRE) «: end=i else: gaps=gaps+1 return the generated chronicles end chronicle_generator
Fig. 5. Algorithm for the elaboration of chronicles
78
5
R. Berlanga et al.
Evaluation of Temporal Patterns
In TDRL, a temporal pattern consists of a set of temporal relationships between the event-time periods of the documents in the query (see Figure 2). Here it is an example of TDRL sentence with a temporal pattern defined: select a from #.Finantial.#.Article as a, Column as b, Article as c where contains(a, ’agricultural subsidies’, 0.7) and contains(b, ’EEC meeting’, 0.8) and contains(c, ’agricultural agreement’, 0.8) and after-within(10 day, a.et, b.et) and intersects-within(3 day, b.et, c.et)
In a TDRL query, there are a set of query variables v1,...,vn, with some unary conditions over them c1,...,cn, and a possibly empty set of temporal relationships between them, Rij with 1 i n, 1 j n and ij. One of these variables is the objective of the query and denotes the set of documents to retrieve. Initially, a possible strategy of query evaluation could apply join operations over the query variables in the following way: (SELc1(v1) R1,2SELc2(v2) R2,3 SELc3(v3) R3,4 SELc4(v4) ...)
However, given that temporal relationships are not very restrictive, the number of tuples that results from the join operations is too large for an efficient execution. By this reason, a new scheme of optimisation must be designed to execute queries. After considering several alternatives, the best results were obtained with the application of semi-join operators by executing a chain of nested EXISTS as shown in Figure 6. As the order of nesting of the semi-joins modifies the total time of execution, it is important to elaborate a good strategy. In a query, each variable has associated a different group of unary conditions that produces a domain for the variable with a given cardinality. Our strategy consists of nesting more deeply those variables with a larger domain, evaluating the variables with smaller cardinalities in last term. In this way, the total time of execution is reduced. SELECT v1 FROM repository v1 WHERE c1 AND EXISTS ( SELECT v2 FROM repository v2 WHERE c2 AND r1,2 AND EXISTS (SELECT v4 FROM repository v4 WHERE c4 AND EXISTS ( SELECT v3 FROM repository v3 WHERE c3 AND r3,4 AND r2,3 ) ))
Fig. 6. Example of nesting of EXISTS clauses
Figure 7 summarises the proposed algorithm of optimisation for queries with a temporal pattern. In it, the objective variable is denoted by vobj, and the cardinality of the domain of a variable vi by ni. The Order(vi) operator returns the order of nesting assigned to the variable vi. Logically, the position of the objective variable is always the first. The Conjunction operator returns the and of the query conditions. Finally, by means of Cond[o], all the conditions over the variable with order o are represented. For variables with similar domains, the behavior of this algorithm may be unsatisfactory, given that the only criteria considered by Step 2 is the cardinality of the variables domains. The final algorithm introduces an additional parameter to take into account the temporal relationships defined between variables.
Techniques and Tools for the Temporal Analysis of Retrieved Information
79
Step 1: Calculate the number of elements of the domain of each non-objective variable: " vi ¢ vi vobj calcule ni in paralell Step 2: Sort the variables and construct the operator Order: Order(vobj) 0; Order(vi vobj) increasing order of the variable vi in terms of its ni. Step 3: Construct the sets of conditions corresponding to each nesting level: Step 3.1: Inicialise the sets of retrieval conditions associated to each variable: " vi, Cond[o] {ci}; with o = Order(vi) Step 3.2: Iterative construction of the sets of conditions: For each variable vi so that vi vobj, and with o = Order(vi), taken in reverse order of Step 2: Step 3.2.1: " vj ¢ Order(vj) < o ¾ $ ri,j, Cond[o] Cond[o] {ri,j}; Step 3.2.2: Cond[o-1] Cond[o-1] {EXISTS SELECT vi FROM repository vi WHERE Conjunction(Cond[o])}; Step 4: Construct the final SQL sentence: SQL-sentence SELECT vobj FROM repository vobj WHERE Conjunction(Cond[0]);
Fig. 7. Algorithm of optimisation of queries with temporal patterns
6
Relevance of Retrieved Documents
In many applications with documents, the logical position of the retrieved elements must be considered to calculate their relevance for the query. For example, with newspapers, the topics that appear in the title, or in the first paragraphs, are more relevant than those that appear in any other part of the news. This section describes how the structural relevance of documents has been included in our query model, and how it can be combined with the relevance based in the frequency of terms (TF-IDF) to calculate a final relevance for each document. In our implementation of TDRL, each document element has associated a code, denoted Scode [5], that indicates its location in the database logical schema. More specifically, this code is a sequence of codified pairs (elem, order), that describes the schema path followed to insert the element. As it was explained in [5], it is possible to define a function of relevance for these codes as follows:
relevance_struc: {Scode} Å [0, 100] In our current approach, this function has been defined in this way:
relevance_struc(Scode) = S (elem, order) ³ Scode weight(elem)/(order + 1) The function weight returns a degree of relevance for each element of the database logical schema. Those elements considered more relevant in the context of an application (titles, keywords, etc.) will have assigned a higher degree of relevance. Each pair (elem, order) of a Scode will have a degree of relevance that depends on the type of the elem component. In the case of multi-valued elements, this relevance degree is modified depending on the order of the element. In this way, it is possible to assign a higher importance to the first paragraphs of the document, as required by many applications. Finally, to combine the structural relevance with the TF-IDF factor, the next formula is applied:
This formula has been obtained by experimentation in the newspapers field, and as it can be seen, the structural relevance has higher importance than the frequencies based factor. In other application areas, this ratio may vary.
7
Implementation
To implement the application presented in this paper, we have designed a three-tier architecture with the different components organised as Figure 8 shows. The upper layer contains the components that visualise the interfaces described in Section 3. The components that translate queries into SQL sentences are in the intermediate layer, together with the components to process query answers and generate the results that are visualised in the upper layer. This layer is also in charge of storing intermediate query results, and in this way to accelerate the re-execution of queries when their parameters are modified. Finally, in the lower layer is the Oracle data base server. The application interfaces used to connect the three layers are also represented in this figure. SearchCondition
TemporalPattern
Histogram
Chronicle
TextVisual
RMI DBSchema
Temporal
Serie
TextRetrieval
JDBC Oracle Server
Fig. 8. Proposed multi-tier architecture
The main property of this architecture is that each of the three layers can be developed over independent and heterogeneous platforms. In this way, the database server and client processes can be executed apart from those of the intermediate layer, which are much more costly in time and space. Other interesting property of this architecture is that it offers independence with respect to the location and evolution of the database server. From the point of view of the client components, any changes in the server will be transparent and properly managed by the intermediate layer.
8 Conclusions In this paper we have presented a set of techniques and tools developed to help in the analysis of the temporal distribution of the happenings described by documents. In our solution we assume that each document has assigned an event-time attribute indicating the time of occurrence of the described happenings. At the moment we are also working on the development of techniques for extracting these attributes from documents texts automatically [6].
Techniques and Tools for the Temporal Analysis of Retrieved Information
81
As a related work, in [8] the TimeMines system is presented, which starting from a repository of time tagged news, generates timelines indicating the most important topics, how much coverage they receive, and their time spans. The purpose of timelines is similar to our chronicle generators, but they apply statistical methods and text mining techniques to extract knowledge about time-dependent stories. Our results are being applied to large repositories of newspapers, where users want to discover the relationships between the occurrences of pre-established happenings, or to build the sequence of occurrences that has led to a given event, or simply to write the story of some topic. The interfaces presented have been implemented in Java and can be executed from a web navigator over our newspapers database. A demo is available at http://www3.uji.es/~berlanga/Demos/demo.zip.
Acknowledgments. This work has been funded by the Bancaixa project with contract number PI.1B2000-14, and the CYCIT project with contract number TIC2000-1568C03-02.
References 1. Aramburu, M. and Berlanga, R.: An Approach to a Digital Library of Newspapers: Information Processing & Management, Pergamon Press, Vol. 33(5), pp. 645-661, 1997. 2. Aramburu, M. and Berlanga, R.: Metadata for a Digital Library of Historical Documents: Proceedings of the 8th International Conference on Database and Expert Systems Applications. Springer Verlag, LNCS 1308, pp 409-418, 1997. 3. Aramburu, M. and Berlanga, R.: A Retrieval Language for Historical Documents: Proceedings of the 9th International Conference on Database and Expert Systems Applications. Springer Verlag, LNCS 1460, pp. 216-225, 1998. 4. Baeza-Yates, R.: Modern information retrieval: Addison-Wesley Longman, 1999. 5. Berlanga, R., Aramburu, M. and Garcia, S.: Efficient Retrieval of Structured Documents from Object-Relational Databases: Proceedings of the 10th International Conference on Database and Expert Systems Applications, Springer Verlag, LNCS 1677, pp. 426-435, 1999. 6. Llidó, D., Berlanga, R. and Aramburu, M.: Extracting Temporal References to Assign Document Event-Time Periods: Proceedings of the 12th International Conference on Database and Expert Systems Applications, Springer Verlag, 2001. 7. Snodgrass, R.: The TSQL2 Temporal Query Language: Kluwer Academic Press, 1995. 8. Swan, R. and Jensen, D. : Automatic generation of overview timelines: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, pp. 49-56, 2000.
Page Classification for Meta-data Extraction from Digital Collections Francesca Cesarini, Marco Lastri, Simone Marinai, and Giovanni Soda Dipartimento di Sistemi e Informatica - Universit` a di Firenze Via S.Marta, 3 - 50139 Firenze - Italy Tel: +39 055 4796361. {cesarini, lastri, simone, giovanni}@mcculloch.ing.unifi.it http://mcculloch.ing.unifi.it/˜docproc
Abstract. Automatic extraction of meta-data from collections of scanned documents (books and journals) is a useful task in order to increase the accessibility of these digital collections. In order to improve the extraction of meta-data, the classification of the page layout into a set of pre-defined classes can be helpful. In this paper we describe a method for classifying document images on the basis of their physical layout, that is described by means of a hierarchical representation: the Modified X-Y tree. The Modified X-Y tree describes a document by means of a recursive segmentation by alternating horizontal and vertical cuts along either spaces or lines. Each internal node of the tree represents a separator (a space or a line), whereas leaves represent regions in the page or separating lines. The Modified X-Y tree is built starting from a symbolic description of the document, instead of dealing directly with the image. The tree is afterwards encoded into a fixed-size representation that takes into account occurrences of tree-patterns in the tree representing the page. Lastly, this feature vector is fed to an artificial neural network that is trained to classify document images. The system is applied to the classification of documents belonging to Digital Libraries, examples of classes taken into account for a journal are “title page”, “index”, “regular page”. Some tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century.
1
Introduction
Meta-data are “data about data” and generally provide high level information about a set of data. In the field of Digital Libraries, appropriate meta-data allow users to effectively access digital material. When dealing with scanned books and journals three main categories of meta-data can be taken into account: administrative (e.g. the ISBN code of a publication), descriptive (e.g. the number of pages of a book), and structural (e.g. the title of a chapter). Whereas administrative and descriptive meta-data are frequently already available in electronic standard formats, or can be easily extracted from library cards, structural metadata can be computed from a digital book only after an accurate analysis of the content of the book. In order to automatically extract structural meta-data from a scanned book, document image analysis techniques can be taken into H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 82–91, 2001. c Springer-Verlag Berlin Heidelberg 2001
Page Classification for Meta-data Extraction from Digital Collections
83
account. An useful task for the automatic extraction of structural meta-data is page classification, that is appropriate for both extracting page-level meta-data and narrowing the set of pages where to look for some meta-data. Page-level meta-data have a one-to-one correspondence of the meta-data with a physical page. Significant examples are the table of contents page, and pages containing pictures. Page classification can be helpful also for locating meta-data which appear only in some pages, for instance identifying the title page can help to retrieve the title of a book. Page classification has been addressed with different objectives and methods. Most work concerned form classification methods that are aimed at selecting an appropriate reading method for each form to be processed [1,2]. Other approaches address the problem of grouping together similar documents in business environments, for instance separating business letters from technical papers [3]. In the last few years the classification of pages in journals and books received more attention [4,5]. An important aspect of page classification are the features that are extracted from the page and used as input to the classifier. Sub-symbolic features, like the density of black pixels in a region, are computed directly from the image. Symbolic features, for instance the number of horizontal lines, are extracted from a segmentation of the image. Structural features (e.g. relationships between objects in the page) can be computed from a hierarchical description of the document. Textual features, for instance presence of some keywords, are obtained from the text in the image recognized by an OCR (Optical Character Recognition) program. In this paper we describe a page classification system aimed at splitting pages (belonging to journals or monographs in Digital Libraries) on the basis of the type of page; the input is a structural representation of the page layout. Examples of classes taken into account are advertisement, first page, and index. The structural representation is based on the Modified X-Y tree, a hierarchical description of page layout. The page is classified by using artificial neural networks (multilayer perceptron trained with Back-propagation) working on an appropriate encoding of the Modified X-Y tree corresponding to the page. This page classifier is under development in the domain of the METAe European project1 . METAe is focused on the semi-automatic extraction of structural meta-data from scanned documents of historical books and journals, in order to make the digital conversion of printed material more reliable in terms of digital preservation. Key components of the project are layout analysis, page classification and specialized OCR for automatic meta-data extraction. The paper is organized as follows, in Section 2 we describe the structural representation of documents, in Section 3 we analyze the proposed classification method. Experimental results are reported in Section 4, while conclusions are drawn in Section 5.
1
METAe: the Metadata engine. http://meta-e.uibk.ac.at
Fig. 1. Example of X-Y tree decomposition. In the upper-left part of the image we show the original page. The three images in the lower part describe the position of cuts at different levels of segmentation.
2
Document Layout Representation
The structure of the page is represented with a hierarchical representation (the Modified X-Y tree, MXY tree in the following) that is an extension of the classical X-Y tree representation. In this section, we first review the X-Y tree decomposition algorithm, and afterwards describe the MXY tree extension that is designed in order to deal with documents containing lines. Finally the building of the MXY tree starting from a symbolic description of the page is analyzed. 2.1
The Modified X-Y Tree
The Modified X-Y tree [6] is an extension of the X-Y tree designed in order to deal with documents containing lines in their layout. The X-Y tree [7] is a topdown data-driven method for page layout analysis. The basic assumption behind the X-Y tree segmentation is the property that elements of the page (columns, paragraphs, figures) are generally laid out in rectangular blocks. Furthermore, the blocks can usually be grouped in such a way that blocks that are adjacent to one another within a group have one dimension in common. The method consists in using thresholded projection profiles in order to split the document into successively smaller rectangular blocks [8]. A projection profile is the histogram of the number of black pixels along parallel lines through the document (see Figure 3 for an example). Depending on the direction of parallel lines the profile can be horizontal or vertical. To reduce the effects of noise, frequently a thresholded projection profile is considered. The blocks are split by alternately making horizontal and vertical “cuts” along white spaces which are found by using the thresholded projection profile. The splitting process is stopped when
Page Classification for Meta-data Extraction from Digital Collections
85
HL
VS
HL
Fig. 2. The MXY tree of a page. Dotted lines point out to images of regions described in the corresponding nodes. VL (HL) denote Vertical (Horizontal) cutting Line; VS (HS) denote Vertical (Horizontal) cutting Space. Nodes with a line indicate leaves corresponding to line separators.
a cutting space (either horizontal or vertical) cannot be found or when the area of the current region is smaller than a pre-defined threshold. The result of such segmentation can be represented in a X-Y tree, where the root is for the whole page, the leaves are for blocks of the page, whereas each level alternately represents the results of horizontal (x cut) or vertical (y cut) segmentation. Figure 1 contains an example of a page segmented into blocks and the corresponding X-Y tree representation. Two improvements to this approach have been proposed in literature. The lossless optimization proposed in [9] is based on the consideration that it is sufficient to perform the projections only up to the threshold Tp . In [10], projections profiles are obtained by using bounding boxes of connected components instead of single pixels in order to reduce the computational cost for calculating the projection profile. This method is tightly related to the symbolic extraction of MXY tree that we propose in Section 2.2. When dealing with documents containing lines, the X-Y tree algorithm can give rise to uneven segmentations for the presence of regions delimited by lines. The MXY tree extends the basic X-Y tree approach by taking into account splitting of regions into sub-parts by means of cuts along horizontal and vertical lines, in addition to the classical cuts along white spaces. Each node of an MXY tree is associated either to a region of the page or to a horizontal or vertical line. In particular, internal nodes can have four labels (corresponding to two cutting directions, and two cutting ways), and leaves can have 4 labels. Figure 2 shows an example of a page with the corresponding MXY-tree. 2.2
Symbolic Building of Modified X-Y Tree
When using the X-Y tree (and also the MXY tree) for document segmentation, the purpose is to extract the blocks composing the page, and the algorithm is applied directly to the document image. To this purpose, appropriate algorithms
86
F. Cesarini et al. shdkh dkdk djdj djd kjdk sjdd lsjdk dkdkdkdkdkdk dkdk
Fig. 3. Two approaches for computing the projection profile of textual regions. Left: the classic method which computes the profile directly from the image. Right: the profile is computed taking into account an uniform contribution for each block.
must be considered for the extraction and analysis of the projection profile, and for the location of separating lines (Section 2.1). However, the MXY tree data structure can be taken into account also for hierarchically representing the layout of the page, and this representation is helpful for understanding the meaning of items in the page, and also for page classification. In order to build an MXY representation of a document already split into its constituents blocks (e.g. provided by a commercial OCR), we developed an algorithm for the symbolic extraction of the MXY tree of a segmented document. Another advantage of the use of this algorithm is the possibility of integrating the algorithm with other approaches (e.g. bottom-up methods) that are less sensitive to the skew of the page, but which provide less structured representations of the page. The input to the algorithm is a list of rectangular regions (corresponding to the objects in the page), and the list of horizontal and vertical lines. Since the input format is quite simple, various segmentation algorithms can be easily adapted in order to deal with this algorithm. The page classifier that we describe in this paper (Section 3), was integrated with a commercial OCR that is able to locate regions corresponding to text, and regions corresponding to images. Since horizontal and vertical lines are not provided by the OCR package, we look for them in zones of the image not covered by regions found by the OCR. Moreover, in order to locate segmentation points corresponding to horizontal and vertical white spaces, we compute an approximate projection profile (Figure 3). This profile is computed by considering an uniform contribution from each region extracted by the OCR both in the horizontal and in vertical direction. The amount of contribution to the profile depends on the average number of black pixels in each region, and this value can be either computed directly from the image or estimated on the basis of the number of characters in the region. A side effect of this approach is that noise in the image (not included in segmented regions) does not affect the MXY tree building. This approach is similar to the use of connected components for computing profiles [10] described in Section 2.1. The main difference is that in our approach we use whole regions instead of connected components, and the contribution to the projection profile is related to the density of the region.
Page Classification for Meta-data Extraction from Digital Collections HL
VS
87
HL
HL
VS
Fig. 4. A common subtree between the MXY trees of two pages of the same class.
3
Page Classification
Page classification is performed with a sequence of operations that is aimed at encoding the hierarchical structure of the page into a fixed-size feature vector. MXY trees are coded into a fixed-size representation that takes into account the occurrences of some specific tree-patterns in the tree corresponding to each document image. Lastly, this feature vector is fed to an MLP that is trained to classify document images according to the labels assigned to training data. Most classifiers (e.g. decision trees and neural networks) require a fixed-size feature vector as input. Some approaches have been considered for the mapping of a graph-based representation into a fixed-size vector. One approach (e.g. [11]) is based on the assignment of some pre-defined slots of the vector to each node and edge of the graph. This approach is appropriate when the maximum size of the graph is bounded, and when a robust ordering algorithm of nodes and edges is available. Another method is based on generalized N-grams [12], and the tree structure of logical documents is represented by probabilities of local tree node patterns similar to mono-dimensional N-grams, which are generalized in order to deal with trees. The generalization is obtained by considering “vertical” N-grams (describing ancestor and child relations) in addition to the more usual “horizontal” N-grams (corresponding to sibling relations). In this paper, we use an encoding method that is used for the classification of trees describing the page layout. The basic idea that is behind this coding is the observation that similar layout structures often have similar sub-trees in the corresponding MXY representation (Figure 4). In real cases, because of noise and content variability, we cannot expect to find exactly the same sub-tree in all the trees of a given class. For instance, a block of text can be sometimes split into two or more sub-parts for other documents. Due to this size variability of the common sub-trees, we describe each tree by counting the occurrences of some tree-patterns composed by three nodes. This approach is somehow similar to generalized N-grams [12]. The main difference with respect to generalized N-grams is that the tree-patterns considered are composed by three nodes connected one to the other by a path in the tree. On the contrary, generalized N-grams include also patterns made by three siblings without taking into account their parent. Trees composed by three nodes can have two basic structures: one composed by a root and two children (referred to as balanced tree-pattern), and one composed by a root, a child, and a child of the second node. Four labels can be assigned to each internal node: HS, VS (for cuts along spaces), HL, VL (for cuts along lines). Each leaf can have four labels: hl (Horizontal line), vl (vertical line), T
88
F. Cesarini et al. HS
HS
HS
VS
HS
HS
VS
VS
HS
T
T
T
I
T
T
VS
VS
VS
=1 =1
=2
T
T
HS
VS
=1
HS T
I
=1
Fig. 5. A simple MXY tree; in the right part of the figure we show the balanced tree patterns in the tree, with the corresponding occurrences. Non adjacent nodes are considered in the pattern having HS as root and VS as leaves.
(text region), and I (image). Leaves of tree-patterns can correspond either to a leaf of the MXY tree or to an internal node, consequently internal nodes of a tree pattern can have four values, whereas leaves can have eight values. Taking into account all the combinations of labels, 512 possible tree patterns can be defined. A special care is required by the balanced tree-patterns. Since siblings in the MXY tree can be ordered according to their horizontal or vertical position (depending on the cutting direction described in their parent), the relative position between contiguous blocks is preserved in this description. However, due to noise (or simply variable layouts of the documents), one sub-tree can differ from the reference one only for a node that is inserted between two representative siblings. In order to overcome this problem, when computing the tree-patterns appearing in a MXY tree, we look also for non-adjacent children (Figure 5), and this is another difference with respect to generalized N-grams. The encoding just described takes into account only discrete attributes in the nodes of the tree. To consider also some information about the size of regions considered in the tree nodes, we added four values in the feature vector, that take into account the size of textual blocks belonging to the same tree-pattern. Textual blocks are labeled as “small” or “big” depending on the ratio of their area with respect to the area of the page. Blocks with area lower than a fixed threshold are labeled as “small”, whereas larger blocks are labeled as “big”. Therefore each tree-pattern containing textual leaves can belong to one of four classes according to the possible combinations of size labels of the leaves. The four features bringing size information are obtained by computing the relative distribution of each of the four combinations in the MXY tree. The addition of these features provides increased classification performance as discussed in the following section. After extracting the vectorial representation of the MXY tree corresponding to a document, various algorithms can be taken into account for the actual classification. In this paper we addressed the problem with a classical MLPbased classifier (trained with the Backpropagation algorithm), which takes the normalized feature vector as input, whereas outputs describe, with an one-hot coding, the membership of each pattern. One problem with such an approach (that is common to other classification methods) is the large size of the feature vector, since many combinations of node labels can be considered. From a prac-
Page Classification for Meta-data Extraction from Digital Collections
89
Fig. 6. Examples of the classes considered in our experiments. From left to right: advertisement, first page, index, receipts, regular.
tical point of view, we can easily find out that few tree-patterns can be found in actual documents of a given data set, as we will analyze in the next section.
4
Experimental Results
We made a set of experiments in order to evaluate the improvements in classification that can be achieved by considering the information about the relative size of textual blocks, and by using non-adjacent leaves when computing occurrences of balanced tree-patterns. Moreover, we analyzed the results that can be achieved using few patterns in the training set. The experiments are made with a data-set of pages belonging to a historical journal: the American Missionary, that is available in the on-line Digital Library Making of America2 . We considered five classes having different layout, and appearing in each issue of the journal. Samples of the 5 classes are shown in Figure 6. Some classes have a very stable layout (e.g. the first page and the index), whereas other classes have a more variable layout (e.g. the advertisement class) and give rise to most errors. For each experiment the documents are split into two classes: one is used for training, and the other is considered for testing purposes. The training set was furtherly divided into three sub-sets in order to perform a three-folder cross validation that allowed us to find the optimal number of epochs required for MLP training. A simple feature selection step was performed by removing from the feature vectors all the items that never appear in the training set. In this way we used a feature vector containing only 177 elements, instead of the 512 possible combinations of labels assigned to nodes. The classification results obtained with an MLP having 177 inputs, 20 hidden nodes, and 5 outputs are summarized in Table 1. A pattern is rejected when the difference between the highest MLP output and the next one is lower than 0.2 (the outputs are in the range [0,1]). As described in Section 3, in order to take into account the size of textual blocks, we added to the basic features four block size features. First, we selected the most appropriate threshold (that discriminates among “small” and “big” 2
Document images can be downloaded from the web site of the collection: http://cdl.library.cornell.edu/moa/.
90
F. Cesarini et al. Table 1. Confusion table of the test set, when using the basic features. Output class True class adv first page index receipts regular Reject adv 37 0 1 6 5 2 first page 0 56 0 0 0 3 index 0 0 53 0 0 0 receipts 0 0 0 57 1 1 regular 0 1 0 2 79 0
Table 2. Confusion table of the test set, when adding the textual block size features considering a threshold of 28 %. Output class True class adv first page index receipts regular Reject adv 36 0 0 6 4 5 first page 0 58 1 0 0 0 index 0 0 52 0 0 1 receipts 1 0 0 56 1 1 regular 0 0 0 1 79 3
blocks), by evaluating the performances with different values of this threshold. From this experiment we selected a threshold value of 28 % as an optimal one (the corresponding confusion table is reported in Table 2). Comparing Table 2 with Table 1 we can see that a lower error rate is achieved when introducing the information about the block area. Another experiment was performed in order to evaluate the gain that can be achieved when considering tree-patterns generated from non-adjacent siblings. In this experiment we generated feature vectors considering only adjacent siblings. Also in this case the threshold for size selection of blocks was 28 %, and we obtained an error rate of 6.8 % that is higher than the 4.7 % achieved when considering non-adjacent siblings. The last experiment concerns an empirical analysis of the requirements of the proposed method in terms of number of training samples (Table 3). From this experiment we can see that also with few training patterns, the performance are not excessively deteriorated.
5
Conclusions
We propose a method for the classification of document images belonging to Digital Libraries, that can be useful for the automatic extraction of structural meta-data. The method is based on a vectorial encoding of the MXY tree representing the document image. Each item in the feature vector describes the occurrences of some tree-patterns in the tree corresponding to the document. After an extensive test on a data-base of more than 600 pages we can conclude that an encoding taking into account non-contiguous siblings (and that uses information on the relative size of textual siblings) is appropriate; moreover with this approach we are able to obtain reasonable performances also when dealing
Page Classification for Meta-data Extraction from Digital Collections
91
Table 3. Classification error versus number of training samples. Each value corresponds to the average of 10 tests obtained by randomly selecting the corresponding number of training samples. The test set is fixed and is composed by 300 samples different from those taken into account for training. Error (%) 17.4 10.9 9.2 9.5 7.1 6.5 5.7 5.4 5.3 4.7 Number of training samples 30 60 90 120 150 180 210 240 270 300
with few training samples. Future work is related to the use of other feature selection approaches, and on tests on other kinds of documents. Moreover, other classifiers will be taken into account in place of the MLP-based classifier considered in this paper. We would like to thank Oya Y. Rieger from Cornell University for her help in collecting data taken into account for our experiments.
References 1. S. L. Taylor, R. Fritzson, and J. Pastor, “Extraction of data from preprinted forms,” Machine Vision and Applications, vol. 5, no. 5, pp. 211–222, 1992. 2. Y. Ishitani, “Flexible and robust model matching based on association graph for form image understanding,” Pattern Analysis and Applications, vol. 3, no. 2, pp. 104–119, 2000. 3. A. Dengel and F. Dubiel, “Clustering and classification of document strcture -a machine learning approach,” in Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 587–591, 1995. 4. J. Hu, R. Kashi, and G. Wilfong, “Document image layout comparison and classification,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition, pp. 285–288, 1999. 5. C. Shin and D. Doermann, “Classification of document page images based on visual similarity of layout structures,” in SPIE 2000, pp. 182–190, 2000. 6. F. Cesarini, M. Gori, S. Marinai, and G. Soda, “Structured document segmentation and representation by the modified X-Y tree,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition, pp. 563–566, 1999. 7. G. Nagy and S. Seth, “Hierarchical representation of optically scanned documents,” in Proceedings of the International Conference on Pattern Recognition, pp. 347– 349, 1984. 8. G. Nagy and M. Viswanathan, “Dual representation of segmented technical documents,” in Proceedings of the First International Conference on Document Analysis and Recognition, pp. 141–151, 1991. 9. T. M. Ha and H. Bunke, “Model-based analysis and understanding of check forms,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 8, no. 5, pp. 1053–1081, 1994. 10. J. Ha, R. Haralick, and I. Phillips, “Recursive X-Y cut using bounding boxes of connected components,” in Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 952–955, 1995. 11. A. Amin, H. Alsadoun, and S. Fischer, “Hand-printed arabic character recognition system using an artificial network,” Pattern Recognition, vol. 29, no. 4, pp. 663–675, 1996. 12. R. Brugger, A. Zramdini, and R. Ingold, “Modeling documents for structure recognition using generalized N-grams,” in Proceedings of the Fourth International Conference on Document Analysis and Recognition, pp. 56–60, 1997.
A New Conceptual Graph Formalism Adapted for Multilingual Information Retrieval Purposes Catherine Roussey, Sylvie Calabretto, and Jean-Marie Pinon LISI, INSA of Lyon, 20 Avenue A. Einstein 69621 VILLEURBANNE Cedex, FRANCE {croussey,cala,pinon}@lisi.insa-lyon.fr
Abstract. In this paper, a graph formalism is proposed to describe the semantics of document in a multilingual context. This formalism is an extension of the Sowa formalism of conceptual graphs [8] in which two new concepts are added: vocabulary and term. Based on recent works, we propose a new comparison operator between graphs taking the specific needs of information retrieval into account. This operator is the core of the comparison function used in our multilingual documentary system, called SyDoM. SyDoM manages XML documents for virtual libraries. An English collection of articles has been used to evaluate SyDoM. This first evaluation gives better results than traditional Boolean documentary system. Keywords. Digital libraries, information retrieval, knowledge engineering, information modeling, conceptual graph, multilingual information retrieval system
1 Introduction The emergence of web applications has deeply transformed the access to information. Particularly document exchanges between countries are facilitated. Consequently, document collections contain documents written in various languages. Thanks to this technical revolution, libraries became digital libraries able to manage multilingual collections of documents and Information Retrieval (IR) systems retrieve documents written in different languages. To take the multilingual aspect of such collections into account, it is necessary to improve the representation of documents. In multilingual context, terms are no more sufficient to express the document contents. It is thus necessary to work on elements more significant than terms, namely "concepts". Moreover, according to recent works [2], semantics of indices have to be enhanced. A solution is to transform the usual list of keywords into a more complex indexing structure, in which relations link concepts. That is the reason why, the Sowa formalism of Conceptual Graph (CG) [4] is chosen to express the document contents. Nevertheless, one of the drawbacks of IR system based on CG is the time consuming effort to carry on a retrieval process. Moreover, the CG matching function produce lot of
A New Conceptual Graph Formalism Adapted for Multilingual Information Retrieval
93
silence1 that decrease the recall rate. In this article, an adaptation of the CG formalism is proposed in order to improve the retrieval effectiveness of our system. First of all, the principles of the CG formalism, used in IR system, are presented. Afterwards, we propose the semantic graph formalism and its corresponding matching function. Finally, the validation of our proposition is presented.
2 Conceptual Graph Formalism A conceptual graph [4] is a graph composed of concept nodes, relation nodes and edges that link concept and relation nodes. A concept node is labeled by a type and possibly a marker. Type corresponds to a semantic class and marker is a particular instance of a semantic class. In the same way, a relation node is only labeled by a type. A specialization relation, noted , classifies concept types and relation types in a hierarchy. Specialization relations are useful to compare graphs by the Sowa projection operator. This operator defines a specialization relation between graphs. As shown in Figure 1, there is a projection of a graph H onto a graph G if there exists in G a "copy" of the graph H where all nodes are specialization of H nodes.
H G
1
Lubrication
Lubrication
1
2
Instrument
Instrument
2
Lubricant
Oil
2
specialization relation
Goal
1
Development
Fig. 1. A projection example.
the matching function of IR system is based one the projection operator. For example, the previous graph H represents a query and the graph G corresponds to the document index. If there is a projection of H onto G the document is considered relevant for the query.
3 State of the Art Several IR systems, based on CG formalism, have been developed: 1. Ounis and all [2] have developed the RELIEF system. Even if in graph theory a projection cannot be performed in polynomial time, one of the contributions of this work is to propose a fast matching function, based on inverted file and acceleration tables. 2. Genest [1] has noted that the projection operator is not adapted to IR purpose. First, matching functions based on projection give Boolean results. Secondly, document is not relevant for a query, if its index graph contains only one node, which is a 1
Relevant documents not retrieved (forgotten) by the IR system..
94
C. Roussey, S. Calabretto, and J.-M. Pinon
generalization of the query node or if graph structures are different. In order to take such problems into account, Genest defines some transformations on conceptual graph. Moreover a mechanism is proposed to order sequences of transformations. As a consequence, the matching function based on projection becomes a ranking function and orders relevant documents for a query. Our proposition considers improvements of the Ounis and Genest methods, by proposing a graph matching function optimized for the information retrieval needs. First a graph formalism is presented allowing the document description in a multilingual context.
4 Semantic Graph Model We have simplified the CG formalism to get it closer to a documentary language. So, concepts are limited to generic concepts because descriptors represent main notions and not individual object. Moreover, the comparison between graphs should not be based on graph structure. 4.1 Semantic Thesaurus We proposed an extension of the Sowa formalism in which two kinds of knowledge are identified in a semantic thesaurus: 1. Domain knowledge organizes domain entity in two hierarchies of types. Types defined a pivot language used to represent document and query graphs 2. Lexical knowledge associates terms, belonging to a vocabulary, to types. Terms are used to present semantic graphs in the user’s native language. 4.1.1 Domain Conceptualization or Support A support S is a 2-tuple S = (T C , T R ) such as: - TC is a set of concept types partially ordered by the specialization relation, noted , and it has a greatest element, noted T. - TR is a set of binary relation types2 partially ordered by and it has a greatest element, noted T2. 4.1.2 Semantic Thesaurus A semantic thesaurus, noted M, (composed of P languages) is a 3-tuple M = ( S, V, l) such as : - S is a support (cf. § 4.1.1). - V is a set of vocabularies, split into set of terms belonging to the same language (a vocabulary). V = V L1 V L2 … V L j … VLP such as VLj is a set of terms belonging to the language Lj. 2
In general, a type of relation can have any arity, but in this paper, relations are considered to be only binary relations like case relations or thematic roles associated with verbs [5].
A New Conceptual Graph Formalism Adapted for Multilingual Information Retrieval VL1
VLj
VLP
95
VLj
- l = { l ,…, l ,…, l }is a set of P mapping such as l : TC TR VL j is a mapping VLj l (t) which associates a term of the language Lj ³ VLj with a type t ³ TC TR,. V
Tc
T
λV eng(tc2.1) tc1
tc2
lubricant oil
tc1.1
tc2.1
λV eng(tc2.1.1) λV fr(tc2.1)
tc1.1.1
lubrifiant
tc2.1.1
λV fr(tc2.1.1)
huile
tc1.1.1.1
Fig. 2. An example of semantic thesaurus.
Figure 2 presents an example of mapping l, in which V is composed of two vocabularies: an English vocabulary, noted Veng , and a French vocabulary, noted Vfr. Each concept type is linked to a term of each vocabulary. For example, the concept type tc2.1 Veng is linked to the English term l (tc2.1)="lubricant" and it is also linked to a French Vfr term l (tc2.1)="lubrifiant". From this semantic thesaurus defining domain knowledge and lexical knowledge, our formalism, called semantic graph, is defined. A semantic graph is a set of concept nodes connected to each other by relations. Comparing to Conceptual Graph, the notion of arch is defined as a couple of concept nodes labeled by a relation type. 4.2 Semantic Graph A semantic graph is a 4-tuple Gs = (C, A, m, n) related to a semantic thesaurus M, such that : - C is a set of concept nodes3 contained in Gs. - A ± C C is a set of arches contained in Gs.
3
In this article, "concept node" and "concept" are equivalent expressions.
96
C. Roussey, S. Calabretto, and J.-M. Pinon
- m: C TC , A TR, m is a mapping, which associated for each concept node, c ³ C, a label m(c) ³ TC , m(c) is also called the type of c. m associated for each arch, a ³ A, a label m(a) ³ TR . m(a) is also called the type of a. VL1 VLj VLP VL j - n is a set of mapping n = {n , … n , … , n } such that the mapping n : C A VLj associates an arch, a ³ A or a concept node, c ³ C, with a term of the language VL j VLj Lj. n(a) ³ VLj is called the term of a for the language Lj. n(c) ³ VLj is called the term of c for the language Lj. Thanks to previous definitions, there exist different representations for the same semantic graph depending of the label used. 1. The first representation of semantic graph labels each graph component with its type. So for a concept node c, its label is m(c). 2. The second kind of semantic graph representation labels each graph component with a term chose in a vocabulary defined in the semantic thesaurus. So for a concept node c, its label is n (c)= l (m(c)). Indeed, there exist several representations of the same semantic graph depending of the chosen vocabulary. Now, the pseudo-projection operator comparing semantic graph is presented. 4.2.1 Pseudo-Projection Operator The pseudo projection operator is an extension of the projection operator of Sowa conceptual graph. A pseudo projection defines morphism between graphs with less constraints than the Sowa original operator does. The pseudo projection of a graph H in a graph G means that, H is "comparable" to G. The formalization of the pseudoprojection operator is as follows: Pseudo projection operator: A pseudo projection from a semantic graph H = (CH , AH , mH , nH) to a semantic graph G = (CG , AG , mG , nG ) is a mapping P: AH AG, CH CG which associates an arch of H with an arch of G and a concept node of H with a set of concept nodes of G. P has the following properties: 1. Arches are preserved but concept nodes cannot be preserved. 2. Types can be restricted or increased. Remark: A concept node can have several images by P. As a consequence, the pseudo-projection operator makes no differences between a graph containing several concept nodes typed by tc, for example, and another graph containing a unique concept node typed by tc. That is the reason why, a semantic graph is considered to contain a unique concept node by type. That is defined as the normal form of a graph. 4.3 Similarity Functions The result of the pseudo-projection operator between graphs is Boolean: pseudoprojection exists or does not exist. Often, a matching function of IR system orders the result documents. Thus, a similarity function between graphs is defined. To this end, various similarity functions are presented. Each similarity function returns a normal-
A New Conceptual Graph Formalism Adapted for Multilingual Information Retrieval
97
ized float value ranging between 0 and 1. First, thanks to the specialization relation, a similarity function between types will be defined. 4.3.1 Similarity Function between Types The similarity function, noted sim, between types is an asymmetrical function. sim is defined as follows: - If two types are not comparable then the similarity function returns 0. - If two types are identical then the similarity function returns 1. If a type t2.1 specializes another type t2 directly, i.e. there is not intermediate type between t2.1 and t2 in the type hierarchy, then the similarity function returns a constant value lower than 1. For example, sim(tC2.1 , tC2) = VG and sim(tC2 , tC2.1) = VS VS and VG are fixed arbitrary. - If a type t2.1.1 specializes another type t2 not directly, i.e. there is an intermediate type t2.1 between t2.1.1 and t2 in the type hierarchy then the similarity function between t2.1.1 and t2 is the product of the similarity functions between (t2.1.1, t2.1.) and (t2.1., t2). For example, sim(tC2.1.1 , tC2) = sim(tC2.1.1 , tC2.1) sim(tC2.1 , tC2) 4.3.2 Similarity Function between Arches The similarity function between two arches, noted SimA computes the average of the similarity type between each arch component. For example, aH is an arch such as aH = (cH , c’H) and m (aH ) = trH and aG is an arch such as aG = (cG , c’G) and m (aG ) = tr G. Sim A ( aH , aG ) =
4.3.3 Similarity Function between Graphs The similarity function, noted simG, between a graph H = (CH , AH , mH , nH ) and a graph G = (CG , AG , mG , nG ) is the average of the similarity function between each arch and concept node of H and their images in G by P. Because a concept node can have several images by P, we take the maximum of the similarity function between a concept and their images. simG (H , G ) =
∑
a ∈ AH
(
( ))
sim A a , Π a
+ CH
∑
Max(sim (µ H (c) , µG (Π (c )) ))
+
AH
c ∈ CH
(2)
5 Algorithms After introducing our semantic graph formalism, we shall concentrate on the implementation of the matching function between graphs. Search algorithms evaluate all the pseudo-projections from the query graph to the index graphs stored in the database.
98
C. Roussey, S. Calabretto, and J.-M. Pinon
During indexing, all the possible query subgraphs comparable with each index graph are memorized. Finding documents relevant for a query graph consists of identifying the subgraphs of the current query corresponding to possible query subgraphs, stored beforehand in the database. The semantic graphs are composed of arches and concept nodes. Thus the document content are represented by two different indices: a list of arches and a list of concepts, from which the normal form of semantic graph can be rebuilt. Following the works of Ounis [2], our algorithms are based on the association of inverted files and acceleration tables. The inverted file groups in the same entry all the documents indexed by an indexing entity. The acceleration tables store, for each indexing entity, the list of the comparable entities as well as the result of the similarity function between the comparable entity and the indexing entity. The acceleration tables pre-compute all possible generalizations or specializations of the indexing entities. The construction of the inverted file and the acceleration table is done off-line, as part of the indexing procedure. There is a search algorithm for each kind of indexing entities. Because these two algorithms are similar, only the search algorithm for arches is presented.
GraphReq is a query graph composed of nbArc arches, noted ArcReq, and of nbConcept concept nodes ListDocResult is a list of documents weighted by the value of the similarity function between the query graph GraphReq and the index graph of document. For each arch ArcReq of GraphReq do ListArcIndex ÄFindArcComparable(ArcReq) For each(ArcIndex, WeightArc) of ListArcIndex do ListDoc Ä FindListDoc(ArcIndex) For each Doc of ListDoc do If ListDocArc.Belong(Doc) Then Weight ÄListDocArc.FindWeight(Doc) NewWeight Ä max(Weight, WeightArc) ListDocArc.ReplaceWeight(Doc, NewWeight) Else ListDocArc.Add(Doc, WeightArc) Endif Endfor Endfor For each (Doc, WeightArc) of ListDocArc do If ListDocResult.Belong(Doc) Then Weight Ä ListDocResult.FindWeight(Doc)
A New Conceptual Graph Formalism Adapted for Multilingual Information Retrieval
99
NewWeight Ä Weight + (WeightArc / (nbArc + nbConcept)) ListDocResult.ReplaceWeight(Doc, NewWeight) Else ListDocResult.Add(Doc, WeightArc) Endfor Endfor FindArcComparable(ArcReq) returns a list of arches (ArcIndex) comparable to ArcReq, weighted by the value of the similarity function between ArcReq and ArcIndex, noted WeightArc. Usually, the cost of a projection operator between graphs is prohibitive, because in graph theory, it is equivalent to find a morphism between indefinite structures. To overcome this problem, the graph structure is limited, that is to say a semantic graph is supposed to contain a unique concept node .
6 Experiment An information retrieval module based on semantic graph has been developed. This module is a component of the documentary system called SyDoM (Multilingual Documentary System) [3]. The system is implemented in JAVA on top of a relational database system. SyDoM is composed of three modules: 1. The semantic thesaurus module manages the documentary language (addition of new vocabulary or new domain entity). 2. The indexing module indexes and annotates XML documents with semantic graph using a set of metadata associated to the semantic thesaurus. 3. The retrieval module performs multilingual retrieval. The users choose their query component in the semantic thesaurus presented in their native language. As example, Figure 3 presents a French query graph dealing with combustion model. The library Doc’INSA associated to the National Institute of Applied Science of Lyon gives us a test base of English articles. These articles deal with mechanics and they are called pre-print of the Society of Automotive Engineers (SAE). During manual indexing, only titles are taken in account. For our first experiments, approximately fifty articles were indexed manually and ten queries were performed. Our system was compared to the Boolean system used at Doc’INSA. Indices of Doc’INSA system were generated automatically from those of SyDoM, to avoid variability. Figure 4 presents this evaluation. The average precision were computed for ten recall intervals. We can notice that relation treatments and hierarchy inference improve significantly the quality of the answer even for manual indexing.
100
C. Roussey, S. Calabretto, and J.-M. Pinon
Fig. 3. SyDoM interface is composed of a graph editor and a browser of semantic thesaurus (concept and relation hierarchies). To build their queries, users select graph component in the hierarchies. The graph can be presented in different language by changing the language vocabulary thanks to the top left button.
7 Conclusion In this paper, a solution was proposed to the challenge of using complex knowledge representation formalism for information retrieval purpose. Moreover, a graph formalism is presented to describe the semantics of document contents in a multilingual
1,2
precision
1 0,8 Sy DoM
0,6
Doc’INSA
0,4 0,2 0 0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
re ca ll
Fig. 4. Evaluation of SyDoM (threshold = 0.6) and Doc’INSA system.
A New Conceptual Graph Formalism Adapted for Multilingual Information Retrieval
101
context. This formalism is an extension of the Sowa formalism of Conceptual Graphs. Starting from recent works, a new comparison operator between graphs is proposed not based on graph structure comparison. This choice enables us to decrease the complexity of the search algorithm. Our proposition is validated by the prototype SyDoM dedicated to digital libraries. SyDoM has been evaluated by querying in French, an English collection of articles. At this stage, SyDoM gives better results than traditional documentary system. The next step would be to compare our system with RELIEF [2] or Genest One [1]. Such experiments would test if our proposition -comparing to the extension of CG proposed by Genest- could have similar results with less computational time.
References 1. D. Genest. Extension du modèle des graphes conceptuels pour la recherche d'information. PhD Thesis, Montpellier University, Montpellier, France (2000). 2. I. Ounis, M. Pasça. RELIEF: Combining Expressiveness and Rapidity into a Single System. th Proceedings of 18 SIGIR Conference, Melbourne, Australia, (1998), 266-274. 3. C. Roussey, S. Calabretto, J. M. Pinon Un modèle d'indexation pour une collection multilinrd gue de documents. Proceedings of the 3 CIDE Conference, Lyon, France, (2000) 153-169. 4. J. Sowa. Conceptual Structures: Information Processing in Mind and Machine. The System Programming Series, Addison Wesley publishing Company, (1984). 5. J. Sowa. Knowledge Representation: Logical, Philosophical, and Computational Foundations. Brooks Cole Publishing Co., Pacific Grove, CA., (2000).
Flexible Comparison of Conceptual Graphs* 1
M. Montes-y-Gómez 1, A. Gelbukh , A. López-López 2, and R. Baeza-Yates 3 1
Center for Computing Research (CIC), National Polytechnic Institute (IPN), 07738, Mexico. [email protected], [email protected] 2 Instituto Nacional de Astrofísica, Optica y Electrónica (INAOE), Mexico. [email protected] 3 Departamento de Ciencias de la Computación, Universidad de Chile, Chile. [email protected] Abstract. Conceptual graphs allow for powerful and computationally affordable representation of the semantic contents of natural language texts. We propose a method of comparison (approximate matching) of conceptual graphs. The method takes into account synonymy and subtype/supertype relationships between the concepts and relations used in the conceptual graphs, thus allowing for greater flexibility of approximate matching. The method also allows the user to choose the desirable aspect of similarity in the cases when the two graphs can be generalized in different ways. The algorithm and examples of its application are presented. The results are potentially useful in a range of tasks requiring approximate semantic or another structural matching – among them, information retrieval and text mining.
1 Introduction In many application areas of text processing – e.g., in information retrieval and text mining – simple and shallow representations of the texts are commonly used. On one hand, such representations are easily extracted from the texts and easily analyzed, but on the other hand, they restrict the precision and the diversity of the results. Recently, in all text-oriented applications there is a tendency to use richer representations than just keywords, i.e., representations with more types of textual elements. Under this circumstance, it is necessary to have the appropriate methods for the comparison of two texts in any of these new representations. In this paper, we consider the representation of the texts by conceptual graphs [9,10] and focus on the design of a method for comparison of two conceptual graphs. This is a continuation of the research reported in [15]. Most methods for comparison of conceptual graphs come from information retrieval research. Some of them are restricted to the problem of determining if a graph, say, the query graph, is completely contained in the other one, say, the document graph [2,4]; in this case neither description nor measure of their similarity is obtained. Some other, more general methods, do measure the similarity between two conceptual graphs, but they typically describe this similarity as the set of all their common elements allowing duplicated information [3,6,7]. Yet other methods are focused on question answering [12]; these methods allow a flexible matching of the graphs, but they do not compute any similarity measure.
*
Work done under partial support of CONACyT, CGEPI-IPN, and SNI, Mexico.
The method we propose is general but flexible. First, it allows measuring the similarity between two conceptual graphs as well as constructing a precise description of this similarity. In other words, this method describes the similarity between two conceptual graphs both quantitatively and qualitatively. Second, it uses domain knowledge – a thesaurus and a set of is-a hierarchies – all along the comparison process, which allows considering non-exact similarities. Third, it allows visualizing the similarities between two conceptual graphs from different points of view and selecting the most interesting one according to the user’s interests. The paper is organized as follows. The main notions concerning conceptual graphs are introduced in section 2. Our method for comparison of two conceptual graphs is described in section 3, matching of conceptual graphs being discussed in subsection 3.1 and the similarity measure in subsection 3.2. An illustrative example is shown in section 4, and finally, some conclusions are discussed in the section 5.
2 Conceptual Graphs This section introduces well-known notions and facts about conceptual graphs. A conceptual graph is a finite oriented connected bipartite graph [9,10]. The two different kinds of nodes of this bipartite graph are concepts and relations. Concepts represent entities, actions, and attributes. Concept nodes have two attributes: type and referent. Type indicates the class of the element represented by the concept. Referent indicates the specific instance of the class referred to by the node. Referents may be generic or individual. Relations show the inter-relationships among the concept nodes. Relation nodes also have two attributes: valence and type. Valence indicates the number of the neighbor concepts of the relation, while the type expresses the semantic role of each one. Figure 1 shows a simple conceptual graph. This graph represents the phrase “Tom is chasing a brown mouse”. It has three concepts and three relations. The concept [cat: Tom] is an individual concept of the type cat (a specific cat Tom), while the concepts [chase] and [mouse] are generic concepts. All relations in this graph are binary. For instance, the relation (attr) for attribute indicates that the mouse has brown color. The other two relations stand for agent and patient of the action [chase]. Building and manipulating conceptual graphs is mainly based on six canonical rules [9]. Two of these rules are the generalization rules: unrestrict and detach. Unrestrict rule generalizes a conceptual graph by unrestricting one of it concepts either by type or referent. Unrestriction by type replaces the type label of the concept with some its supertype; unrestriction by referent substitutes individual referents by generic ones. Detach rule splits a concept node into two different nodes having the same attributes (type and referent) and distributes the relations of the original node between the two resulting nodes. Often this operation leads to separating the graph into two unconnected parts.
104
M. Montes-y-Gómez et al.
v:
u:
cat
cat: Jerry
Agnt
Agnt
chase
chase
Ptnt
Ptnt
animal
mouse
Attr
brown
πv Fig. 2. Projection mapping π: v → u (the highlighted area is the projection of v in u).
A conceptual graph v derivable from the graph u by applying a sequence of generalization rules is called a generalization of the graph u; this is denoted as u ≤ v. In this case there exists a mapping π: v → u with the following properties (πv is a subgraph 1 of u called a projection of v in u; see Figure 2):
• •
For each concept c in v, πc is a concept in πv such that type(πc) ≤ type(c). If c is an individual concept, then referent(πc) = referent(c). For each relation node r in v, πr is a relation node in πv such that type(πr) = type(r). If the i-th arc of r is linked to a concept c in v then the i-th arc of πr must be linked to πc in πv.
The mapping π is not necessarily one-to-one, i.e., two different concepts or relations can have the same projections (x1 ≠ x2 and πx1 = πx2, such situation results from application of detach rule). In addition, it is not necessarily unique, i.e., a conceptual graph v can have two different projections π and π´ in u, π´v ≠ πv. If u1, u2, and v are conceptual graphs such that u1 ≤ v and u2 ≤ v, then v is called a common generalization of u1 and u2. A conceptual graph v is called a maximal common generalization of u1 and u2 if and only if there is no other common generalization v´ of u1 and u2 (i.e., u1 ≤ v´ and u2 ≤ v´) such that v´ ≤ v.
3 Comparison of Conceptual Graphs The procedure we propose for the comparison of two conceptual graphs is summarized in Figure 3. It consists of two main stages. First, the two conceptual graphs are matched and their common elements are identified. Second, their similarity measure is computed as a relative size of their common elements. This measure is a value between 0 and 1, 0 indicating no similarity between the two graphs and 1 indicating that the two conceptual graphs are equal or semantically equivalent. The two stages use domain knowledge and consider the user interests. Basically, the domain knowledge is described as a thesaurus and as a set of user-oriented is-a hierarchies. The thesaurus allows considering the similarity between semantically related concepts, not necessarily equal, while the is-a hierarchies allow determining similarities at different levels of generalization. 1
Here, the functions type(c) and referent(c) return the type and referent of the concept c, respectively; the function type(r) returns the type of the relation r. By type(a) ≤ type(b) we denote the fact that type(b) is a supertype of type(a).
Flexible Comparison of Conceptual Graphs Similarity description
Matching conceptual graphs
105
Similarity measure
overlaps
Domain Knowledge
Similarity measure
User
Fig. 3. Comparison of conceptual graphs
3.1 Matching Conceptual Graphs Matching of two conceptual graphs allows finding all their common elements, i.e., all their common generalizations. Since the projection is not necessarily one-to-one and unique, some of these common generalizations may express redundant (duplicated) information. In order to construct a precise description of the similarity of the two conceptual graphs (e.g. G1 and G2), it is necessary to identify the sets of compatible common generalizations. We call such sets overlaps and define them as follows. Definition 1. A set of common generalizations O = {g1 , g 2 , K , g n } is called compatible if and only if there exist projection maps2 {π 1 , π 2 , K , π n } such that the corresponding projections in G1 and G2 do not intersect, i.e.: n
Iπ i =1
n
G1
g i = I π G2 g i = ∅ i =1
Definition 2. A set of common generalizations O = {g1 , g 2 , K , g n } is called maximal if and only if there does not exist any common generalization g of G1 and G2 such that either of the conditions holds: 1. O′ = {g1 , g 2 , K , g n , g } is compatible,
2. ∃i : g ≤ g i , g ≠ g i , and O ′ = {g1 ,K , g i −1 , g , g i +1 ,K , g n } is compatible. (i.e., O cannot be expanded and no element of O can be specialized while preserving the compatibility of O.)
Definition 3. A set O = {g 1 , g 2 , K , g n } of common generalizations of two conceptual graphs G1 and G2 is called overlap if and only if it is compatible and maximal. Obviously, each overlap expresses completely and precisely the similarity between two conceptual graphs. Therefore, the different overlaps may indicate different and independent ways of visualizing and interpreting their similarity. Let us consider the algorithm to find the overlaps. Given two conceptual graphs G1 and G2, the goal is to find all their overlaps. Our algorithm works in two stages. At the first stage, all similarities (correspondences) between the conceptual graphs are found, i.e., a kind of the product graph is constructed [6]. The product graph P expresses the Cartesian product of the nodes and relations of the conceptual graphs, 2
Recall that projection map and thus the projection for a given pair v, u is not unique.
106
M. Montes-y-Gómez et al.
but only considers those pairs with non-empty common generalizations. The algorithm is as follows: 1 2 3 4 5 6
For each concept ci of G1 For each concept cj of G2 P ← the common generalization of ci and cj. For each relation ri of G1 For each relation ri of G2 P ← the common generalization of ri and rj.
At the second stage, all maximal sets of compatible elements are detected, i.e., all overlaps are constructed. The algorithm we use in this stage is an adaptation of a wellknown algorithm for the detection of all frequent item sets in a large database [1]. Initially, we consider that each concept of the product graph is a possible overlap. At each subsequent step, we start with the overlaps found in the previous step. We use these overlaps as the seed set for generating new large overlaps. At the end of the step, the overlaps of the previous step that were used to construct the new overlaps are deleted because they are not maximal overlaps and the new overlaps are the seed for the next step. This process continues until no new large enough overlaps are found. Finally, the relations of the product graph are inserted into the corresponding overlaps. This algorithm is as follows: 1 Overlaps1 = {all the concepts of P} 2 For (k = 2; Overlapsk-1 = ∅; k++) 3 Overlapsk ← overlap_gen (Overlapsk-1) 4 Overlapsk-1 ← Overlapsk-1 – {elements covered by Overlapsk} 5 MaxOverlaps = Uk Overlaps k 6 For each relation r of P 7 For each overlap Oi of MaxOverlaps 8 If the neighbor concepts of r are in the overlap Oi 9 O←r The overlap_gen function takes as argument Overlapsk-1, the set of all large (k-1) overlaps and returns Overlapsk, the set of all large k-overlaps. Each k-overlap is constructed by joining two compatible (k-1) overlaps. This function is defined as follows:
Overlaps k′ = {X ∪ X ′ | X , X ′ ∈ Overlaps k −1 , X ∩ X ′ = k − 2}
Overlaps k = {X ∈ Overlaps ′k | X contains k members of Overlaps k −1 }
with the exception of the case k = 2 where
Overlaps 2 = {X ∪ X ′ | X , X ′ ∈ Overlaps1 , X and X ′ are compatibles concepts} . In the next section we give an illustration of matching of two simple conceptual graphs; see Figure 4. It is well-known [5,6] that matching conceptual graphs is an NP-complete problem. Thus our algorithm has exponential complexity by the number of common nodes of the two graphs. This does not imply, however, any serious limitations for its practical application for our purposes, since the graphs we compare represent the results of a shallow parsing of a single sentence and thus are commonly small and have few nodes in common. Since our algorithm is an adaptation of the algorithm called
Flexible Comparison of Conceptual Graphs
107
APRIORI [1] that was reported to be very fast, ours is also fast (which was confirmed in our experiments); in general, algorithms of exponential complexity are used quite frequently in data mining. For a discussion of why exponential complexity does not necessarily present any practical problems, see also [14]. 3.2 Similarity Measure Given two conceptual graphs G1 and G2 and one of their overlaps, O, we define their similarity s as a combination of two values: their conceptual similarity sc and their relational similarity sr. The conceptual similarity sc depends on the common concepts of G1 and G2. It indicates how similar the entities, actions, and attributes mentioned in both conceptual graphs are. We calculate it using an expression analogous to the well-known Dice coefficient [8]:3 ⎛ ⎜ sc = 2 ⎜ weight (c ) × β π G1 c, π G 2 c ⎜ c∈U O ⎝
U O is the union of all graphs in O, i.e., the set of all their nodes and arcs; the
function weight (c) gives the relative importance of the concept c, and the function β ( π G1 c , π G2 c ) expresses the level of generalization of the common concept c ∈ U O relative to the original concepts π G1 c and π G2 c . The function weight(c) is different
for nodes of different types; currently we simply distinguish entities, actions, and attributes:
⎧wE if c represents an entity ⎪ weight (c) = ⎨wV if c represents an action ⎪w if c represents an attribute ⎩ A where wE, wV, and wA are positive constants that express the relative importance of the entities, actions, and attributes respectively. Their values are user-specified. In the future, a less arbitrary mechanism for assigning weights can be developed. The function β ( π G1 c , π G2 c ) can be interpreted as a measure of the semantic similarity between the concepts π G1 c and π G2 c . Currently we calculate it as follows: ⎧1 ⎪⎪ β π G1 c, π G2 c = ⎨depth (depth + 1) ⎪2d d + dπ G c 2 ⎩⎪ c π G1 c
(
3
4
)
(
( ) ( ) ( ) ( ) ) if type(π G c) ≠ type(π G c)
( (
) )
( (
4
if type π G1 c = type π G2 c and referent π G1 c = referent π G2 c if type π G1 c = type π G2 c and referent π G1 c ≠ referent π G2 c 1
) )
2
Because of its simplicity and normalization properties, we take the Dice coefficient as the basis for the similarity measure we proposed. In this definition, the condition type( π G c ) = type( π G c ) is also satisfied when type( π G1 c ) 1
2
and type( π G c ) are synonyms, which is defined by the thesaurus. 2
108
M. Montes-y-Gómez et al.
In the first condition, the concepts π G1 c and π G2 c are the same and thus
β ( π G1 c , π G2 c ) = 1. In the second condition, the concepts π G1 c and π G2 c refer to different individuals of the same type, i.e., different instances of the same class. In this case, β π G1 c, π G2 c = depth (depth + 1) , where depth indicates the number of levels of the is-
(
)
a hierarchy. Using this value, the similarity between two concepts having the same type but different referents is always greater that the similarity between any two concepts with different types. In the third condition, the concepts π G1 c and π G2 c have different types, i.e., refer to elements of different classes. In this case, we define β ( π G1 c , π G2 c ) as the semantic similarity between type( π G1 c ) and type( π G2 c ) in the is-a hierarchy. We calculate it using a similar expression to one proposed in [11]. In this third option of our formula, di indicates the distance – number of nodes – from the type i to the root of the hierarchy. The relational similarity sr expresses how similar the relations among the common concepts in the conceptual graphs G1 and G2 are. In other words, the relational similarity indicates how similar the neighbors of the overlap in both original graphs are (see more details in [13]). We define the immediate neighbor of the overlap O in a conceptual graph Gi, NO(Gi), as the set of all the relations connected to the common concepts in the graph Gi:
(
)
N O (Gi ) = U N Gi π Gi c , where N G (c ) = {r | r is connected to c in G} . c∈O
With this, we calculate the relational similarity sr using the following expression – also analogous to the Dice coefficient:
Here weightG(r) indicates the relative importance of the conceptual relation r in the conceptual graph G.5 This value is calculated by the neighbor of the relation r. This kind of assignment guarantees the homogeneity between the concept and the relation weights. Hence, we compute weightG(r) as: weightG (r ) =
∑ weight(c)
c∈N G (r )
N G (r ) , where N G (r ) = {c | c is connected to r in G}.
Now that we have defined the two components of the similarity measure, sc and sr, we combine them into a cumulative measure s. First, the combination should be roughly multiplicative, for the cumulative measure to be proportional to each of the two components. This would give the formula s = sc × sr . However, we note that the relational similarity has a secondary importance, because its existence depends on the existence of some common concept nodes and because even if no common relations 5
This function also holds for overlaps because an overlap is also a set of conceptual graphs (see the definition 3.1).
Flexible Comparison of Conceptual Graphs
109
exist between the common concepts of the two graphs, there exists some level of similarity between them. Thus, while the cumulative similarity measure is proportional to sc, it still should not be zero when sr = 0. So we smooth the effect of sr using the expression:
s = sc × (a + b × sr ) With this definition, if no relational similarity exists between the two graphs (sr = 0) then the general similarity only depends on the value of the conceptual similarity. In this situation, the general similarity is a fraction of the conceptual similarity, where the coefficient a indicates the value of this fraction. The coefficients a and b reflect user-specified balance (0 < a, b < 1, a + b = 1). The coefficient a indicates the importance of the part of the similarity exclusively dependent on the common concepts and the coefficient b expresses the importance of the part of the similarity related with the connection of these common concepts. The user’s choice of a (and thus b) allows adjusting the similarity measure to the different applications and user interests. For instance, when a > b, the conceptual similarities are emphasized, while when b > a, stresses structural similarities.
4 An Illustrative Example Our method for comparison of two conceptual graphs is very flexible. On one hand, it describes qualitatively and quantitatively the similarity between the two graphs. On the other hand, it considers the user interests all along the comparison process. To illustrate this flexibility, we compare here two simple conceptual graphs. The first one represents the phrase “Gore criticizes Bush” and the second one the phrase “Bush criticizes Gore”.6 The figure 4 shows the matching of these two graphs. Notice that their similarity can be described in two different ways, i.e., by two different and independent overlaps. The overlap O1 indicates that in both graphs “a candidate criticizes another candidate”, while the overlap O2 indicates that both graphs talk about Bush, Gore, and an action of criticizing. The selection of the best overlap, i.e., the most appropriate description of the similarity, depends on the application and the user interests. These two parameters are modeled by the similarity measure. Table 1 shows the results for the comparison of these two conceptual graphs. Each result corresponds to a different way of evaluating and visualizing the similarity of these graphs. For instance, the first case emphasizes the structural similarity, the second one the conceptual similarity, and the third one focuses on the entities. In each case, the best overlap and the longer similarity measure are highlighted.
5 Conclusions In order to start using more complete representations of texts than just keywords in the various applications of text processing, one of the main prerequisites is to have an appropriate method for the comparison of such new representations.
6
Bush and Gore were candidates at U.S. president elections in 2001.
110
M. Montes-y-Gómez et al.
We considered representation of the criticize Ptnt candidate Bush Agnt candidate :Gore G 1: texts by conceptual graphs and proposed a method for comparison of any pair of Ptnt candidate criticize Agnt O 1: candidate conceptual graphs. This method works in two main stages: matching conceptual Ptnt candidate :Gore criticize Agnt candidate :Bush G 2: graphs and measuring their similarity. (a) Matching is mainly based on the generPtnt candidate :Bush criticize Agnt candidate :Gore G 1: alization rules of conceptual graph theory. Similarity measure is based on Candidate :Bush criticize O 2: Candidate :Gore the idea of the Dice coefficient but it also incorporates some new characteristics derived from the criticize Ptnt candidate :Gore candidate :Bush Agnt G 2: conceptual graph (b) structure, for instance, the combinaFig. 4. Flexible matching of two conceptual graphs tion of two complementary sources of similarity: conceptual and relational similarity. Our method has two interesting characteristics. First, it uses domain knowledge, and second, it allows a direct influence of the user. The domain knowledge is expressed in the form of a thesaurus and a set of small (shallow) is-a hierarchies, both customized by a specific user. The thesaurus allows considering the similarity between semantically related concepts, not necessarily equal, while the is-a hierarchies allow determining the similarities at different levels of generalization. The flexibility of the method comes from the user-defined parameters. These allow Table 1. The flexibility of the similarity measure Conditions
analyzing the similarity of the two conceptual graphs from different points of view and also selecting the best interpretation in accordance with the user interests. Because of this flexibility, our method can be used in different application areas of text processing, for instance, in information retrieval, textual case-based reasoning, and text mining. Currently, we are designing a method for the conceptual clustering of conceptual graphs based on these ideas and an information retrieval system where the non-topical information is represented by conceptual graphs.
References 1. Agrawal, Rakesh, and Ramakrishnan Srikant (1994), “Fast Algorithms for Mining Associath tion Rules”, Proc. 20 VLDB Conference, Santiago de Chile, 1994. 2. Ellis and Lehmann (1994), “Exploiting the Induced Order on Type-Labeled Graphs for fast Knowledge Retrieval”, Lecture Notes in Artificial Intelligence 835, Springer-Verlag 1994. 3. Genest D., and M. Chein (1997). “An Experiment in Document Retrieval Using Conceptual Graphs”. Conceptual structures: Fulfilling Peirce´s Dream. Lecture Notes in artificial Intelligence 1257, August 1997. 4. Huibers, Ounis and Chevallet (1996), “Conceptual Graph Aboutness”, Lecture Notes in Artificial Intelligence, Springer, 1996. 5. Marie, Marie (1995), “On generalization / specialization for conceptual graphs”, Journal of Experimental and Theoretical Artificial Intelligence, volume 7, pages 325-344, 1995. 6. Myaeng, Sung H., and Aurelio López-López (1992), “Conceptual Graph Matching: a Flexible Algorithm and Experiments”, Journal of Experimental and Theoretical Artificial Intelligence, Vol. 4, 1992. 7. Myaeng, Sung H. (1992). “Using Conceptual graphs for Information Retrieval: A Framework for Adequate Representation and Flexible Inferencing”, Proc. of Symposium on Document Analysis and Information Retrieval, Las Vegas, 1992. 8. Rasmussen, Edie (1992). “Clustering Algorithms”. Information Retrieval: Data Structures & Algorithms. William B. Frakes and Ricardo Baeza-Yates (Eds.), Prentice Hall, 1992. 9. Sowa, John F. (1984). “Conceptual Structures: Information Processing in Mind and Machine”. Ed. Addison-Wesley, 1984. 10. Sowa, John F. (1999). “Knowledge Representation: Logical, Philosophical and Computational Foundations”. 1st edition, Thomson Learning, 1999. 11. Wu and Palmer (1994), “Verb Semantics and Lexical Selection”, Proc. of the 32nd Annual Meeting of the Associations for Computational Linguistics, 1994. 12. Yang, Choi and Oh (1992), “CGMA: A Novel Conceptual Graph Matching Algorithm”, Proc. of the 7th Conceptual Graphs Workshop, Las Cruces, NM, 1992. 13. Manuel Montes-y-Gómez, Alexander Gelbukh, Aurelio López-López (2000). Comparison of Conceptual Graphs. O. Cairo, L.E. Sucar, F.J. Cantu (eds.) MICAI 2000: Advances in Artificial Intelligence. Lecture Notes in Artificial Intelligence N 1793, Springer-Verlag, pp. 548-556, 2000. 14. A. F. Gelbukh. “Review of R. Hausser’s ‘Foundations of Computational Linguistics: ManMachine Communication in Natural Language’.” Computational Linguistics, 26 (3), 2000. 15. Manuel Montes-y-Gómez, Aurelio López-López, and Alexander Gelbukh. Information Retrieval with Conceptual Graph Matching. Proc. DEXA-2000, 11th International Conference on Database and Expert Systems Applications, Greenwich, England, September 4-8, 2000. Lecture Notes in Computer Science N 1873, Springer-Verlag, pp. 312–321.
Personalizing Digital Libraries for Learners Su-Shing Chen, Othoniel Rodriguez, Chee-Yoong Choo, Yi Shang, and Hongchi Shi University of Missouri-Columbia Columbia, MO 65211 [email protected]
Abstract. User-centered digital libraries for education are developed. Instead of static contents on the web searched and retrieved in a traditional sense, we investigate personalized, dynamic information seeking in the learning environment of digital libraries. Learning objects and user profiles are important components of the digital library system. In their existing metadata standards, personalizing agents are designed and developed for realizing peerto-peer educational and learning technologies on the Internet.
and accomplishments of learners and learner-groups, (2) engaging a learner in a learning experience pedagogically, and (3) discovering learning opportunities for learners in the LOVE collection and the NBDL digital library. The goal of the NSF NSDL Program is to enhance education through cumulating modern technologies, but is not to replace completely teachers by web-based learning systems. Thus LOVE is an intelligent environment supporting peer-to-peer learning. It exploits technologies, such as multiagent systems (e.g., [12]), but also human involvement of teachers, (K12) parents, reviewers, editors, and students, all as users.
2 LOVE: Design Objectives LOVE is intended for the community of teachers, (K-12, university, and life-long) learners, authors, editors, and reviewers. It complements the primary collections of NBDL. Our main design objectives of LOVE as a digital library are adaptivity, interactivity and openness. Adaptivity is needed to select and customize the learning resources to the learners and to the context in which the learning is taking place. These two aspects exhibit a wide range of variability for digital libraries. Such systems can not make a priori assumptions about the characteristics of the learner, such as educational background, cognitive style, etc., nor about the context and purpose for the learning process. Instead it must be able to adapt dynamically based on explicit knowledge about these aspects that need to be maintained independently of the more generic learning content knowledge. In the following, we describe the overall NBDL architecture of which LOVE is a subsystem:
Name, Class & Object Libraries
User Interface & Search Engines
SMET Learning Object Virtual Exchange
Services
HTTP server Semantic Object Manager
LOVE
Emerge
PI
Persistent Collections & Resources
PI
PI
Persistent Collections & Resources
Persistent Collections & Resources
PI PI=Protocol Interface
Persistent Collections & Resources
Fig. 1. NBDL Architecture
In the LOVE collection, we will provide learning objects in standardized forms so that intelligent agents can index user profiles and learning objects and match them directly. In the persistent collections of legacy data (e.g., the Library of Congress and National Library of Medicine), we will not be able to directly match the two parties. Instead we would develop data mining techniques in the “Semantic Object Manager” and “Emerge” (search engine) for matching them. The LOVE architecture consists of
114
S.-S. Chen et al.
the collection managed by a community (e.g., a school district) and several networked services, which implement various educational technologies as multiagents described in later sections.
3 Learning Objects: Metadata Standards The IEEE Learning Technology Standards Committee (IEEE-LTSC P1484) has undertaken the initiative of drafting a set of standards among which they define a data model for Learning Object Metadata (LOM) [6], [9]. This standard has received the endorsement of other consortiums dealing with educational standards such as ARIADNE, IMS (Instructional Management Systems) Consortium, and SCORM (Shareable Courseware Object Reference Model) for the ADL-Net (Advanced Distributed Learning Network) within the DOD. Several of these standards are being endorsed by the IMS Consortium, who in addition is developing the Content Packaging Information Model, which describes a self-standing package of learning resources [7]. The IMS Content Packaging Information Model describes data structures that are used to provide interoperability of Internet-based content with content creation tools, learning management systems, and run time environments. The objective of the IMS Content Packaging Information Model is to define a standardized set of structures that can be used to exchange content. These structures provide the basis for standardized data bindings that allow software developers and implementers to create instructional materials that interoperate across authoring tools, learning management systems and run time environments that have been developed independently by various software developers. The IEEE Learning Technology Standards Committee (IEEE-LTSC P1484) has undertaken the initiative of drafting a set of standards among which they define a data model for Learning Object Metadata (LOM) [6], [9]. This standard has received the endorsement of other consortiums dealing with educational standards such as ARIADNE, IMS (Instructional Management Systems) Consortium, and SCORM (Shareable Courseware Object Reference Model) for the ADL-Net (Advanced Distributed Learning Network) within the DOD. Several of these standards are being endorsed by the IMS Consortium, who in addition is developing the Content Packaging Information Model, which describes a self-standing package of learning resources [7]. The IMS Content Packaging Information Model describes data structures that are used to provide interoperability of Internet-based content with content creation tools, learning management systems, and run time environments. The objective of the IMS Content Packaging Information Model is to define a standardized set of structures that can be used to exchange content. These structures provide the basis for standardized data bindings that allow software developers and implementers to create instructional materials that interoperate across authoring tools, learning management systems and run time environments that have been developed independently by various software developers. The IEEE-LTSC LOM model is an abstract model, however, the IMS Consortium has provided one possible binding specification using pure XML and XML-Schema standards. The XML Schema introduces an unambiguous specification of low-level
Fig. 2. IMS Content Packaging Conceptual Model [7]
and intermediate level data types and structures that assure a higher level of interoperability between XML documents. We will adopt this model to develop our LOVE collection. The current LOM model includes the following nine metadata elements, some of which can occur multiple times: 1. General: title, cat/entry, language, description, keyword, coverage, aggregation level. 2. Lifecycle: version, status, contributor. 3. MetaMetaData: identifier, catalog/entry, contributor, metadata scheme, language. 4. Technical: format, size (bytes), location, requirements (installation, platforms), duration. 5. Pedagogical: interactivty type, learning resource type, interactivity level, semantic density, intended end user role, learning context, age range, typical learning time, description on how to be used. 6. Rights: cost, copyright, description of condition of use. 7. Relation: kind, resource (target). 8. Annotation: person, date, description, comment. 9. Classification: taxon, taxon-path. From this set, the Relation and Classification elements and sub-elements are specifically relevant from the perspective of supporting LOVE design objectives: adaptivity, interactivity, and openness. The Relation element provides a best practice set of controlled vocabulary for kind-of-relations pairings with the target LO or resource with which the current LO holds this kind-of-relation. Their use for navigation between LO’s, in a kind of semantic networks, is very important. Note that these relations do not have to link to other LO’s necessarily, so relations with other resources that may provide associated active-content is a possibility here. The Classification element provides the principal mechanism for extending the LOM model by allowing it to reference a Taxonomy and describe associated taxon-path sub-elements corresponding to the LO. Thus Classification provides for multiple alternative descriptions of the LO within the context and meaning of several Taxonomies. In our LOVE collection, LO’s will represent small capsules of knowledge in a form suitable for didactic presentation and assimilation by learners. We believe that
116
S.-S. Chen et al.
the LO metadata standardization will introduce a large degree of interoperability and re-use, promoting the widespread investment in, and adoption of, educational technology. Each learning object by being highly atomic and complete in capturing a concept or “learning chunk” provides the opportunity for the configuration of a large number of course variations. The resulting fine-grained course customization is expected to lead to “just-in time”, “just-enough”, “just-for-you”, training and performance support courseware. This implies the traversal of a subject matter domain in a highly flexible way and learner-specific way. However this flexibility must comply with inter-LO dependencies and restrictions, which in turn will require new goal-driven more intelligent navigation facilities.
4 User Profiles: Learner Model Standards Tracking learners’ patterns and preferences for adapting the digital library system to their needs is perhaps the most important aspect of user-centered digital libraries. In addition to LOM, the IMS is also defining a standard model for learners called the IMS LIP (Learner Information Packaging) Model [8]. IMS LIP is based on a data model that describes those characteristics of a learner needed for the general purposes of: • Recording and managing learning-related history, goals, and accomplishments. • Engaging a learner in a learning experience. • Discovering learning opportunities for learners. Since some IMS LIP elements are narrowly specific to formal classroom learning context, and not relevant to digital libraries, we will not use them. In user profiling of our LOVE collection, we will tentatively use the following metadata, subject to NSDL community decisions: Identification Affiliation Privacy Security Relationship Accessibility (e.g., disability) Goal Competency Performance Portfolio Interest Preference Activity The LOVE community is composed of learners, but also creators, reviewers, catalogers, librarians, and editors. Its user profiles are manifold. For examples, 1. Learners: finding resources, managing resources once retrieved, sharing resources with others, talking to other community members; 2. Teachers: tutoring, advising, managing resources;
Personalizing Digital Libraries for Learners
117
3. Parents: advising, coordinating with other parents and teachers; 4. Creators: contributing resources; 5. Reviewers: gaining access for quality assurance and review, rating resources, finding rated resources; 6. Editors: adding resource descriptions; 7. Catalogers: ingesting content into digital libraries; and 8. Librarians: analyzing collection. The relationships between learners, reviewers, editors, parents, catalogers, and librarians of LOVE are hierarchical with different authorities. Learners are basic users without any authority, while reviewers and editors will have authority over content materials of LO’s. Catalogers and librarians will have the final authority to manage the LOVE collection. At present, we have not decided the role of parents, because it will be a matter of each school district. At least, parents will have a supporting role to enhance K-12 students. Either proactive or on-demand, personalization services are derived by matching metadata patterns of both learning objects and learner/user models. Learners would have at least the following needs: 1. accessibility (e.g., styles, disabilities, cognitive level, language), 2. personal profile on creators/vendors available to judge resource quality, 3. dynamic “portfolio” of their own ratings and assessment, 4. dynamic “portfolio” shared in a limited way with other LOVE members. Getting profile information for learners/users will be iterative over time. Generally we can expect users to provide profile information only in incremental times. Moreover profile information touches upon privacy issues. We are in the process of building user profiling for NBDL, which will be extended later to the whole NSDL Program.
5 Theories: Multiagent Systems Personalization of digital libraries for education rests its theories on multiagent systems (e.g., [11], [12]). These theories are general multiagent theories specialized to education. Under our LOVE design-objectives, personalization means system adaptivity, interactivity and openness. Multiagents are required to manipulate and represent many user profiles, learning objects, and learning experiences. Multiagents are intelligent software that incorporates many Artificial Intelligence (AI) and Distributed AI research results, but also exhibit emergent intelligent behavior resulting from the combined interaction among several agents and their shared environment. In our perspective, key properties of multiagents are autonomous, proactive, interactive, adaptive, scalable, and decentralized. These properties support personalization in several ways. Our contribution is to develop practical and scalable multiagents as operations on metadata describing user profiles, learning objects, and learning experiences. In the following, we define multiagents in the LOVE environment. Autonomous agents can incorporate a set of goals that steer their planning, behavior and reactions as they carry out their assigned tasks. Autonomous agents can continuously support the interaction with users. Proactive agents take the initiative in guiding users based on high level learning goals and low level tasks selections. Interactive agents possess complex behavior and can incorporate meaningful responses to users by closely tracking their inputs and reactions. Interactivity means the agents or significant portions of them execute on the local desktop thus avoiding network links latencies
118
S.-S. Chen et al.
and bandwidth limitations. Adaptive agents are capable of exhibiting a wide range of adaptivity, by properly modeling users and their contexts in user profiles, and understanding learning objects interdependencies. Scalable agents are able to tackle problems whose computational complexity is usually not scalable when using standard monolithic algorithm approaches. This is achieved through off-loading some of the potential agent-to-agent interaction complexity to interaction within a shared environment of standardized formats, styles, and protocols. Decentralized agents can interact directly with the user without tight coupling with central services, thus reducing the performance requirements on the network and improving the scalability. In the LOVE environment, agents are inherently modular services and able to communicate using a certain inter-agent language in XML asynchronously. Thus our multiagent architecture is highly flexible allowing the re-use and creative recombination of different agents and the independent development and deployment of new agents with improved functionality, performance and embodying new ideas on digital libraries. Multiagents permit experimentation and incremental improvement of the system. They make also possible inter-disciplinary work and cooperation among different NSDL teams, allowing highly focused development of specialized agents that can be tested and deployed to real-world learning environments without having to obsolete and start from scratch every time. The re-usability of learning objects coupled with the modularity of agents, provide for steady and continuous improvement both in the quality of content and the quality of delivery of digital libraries. Although we can conceive of several partitioning alternatives for a minimum set of functions required within the LOVE learning environment, an intuitively appealing break-up is one, which follows the traditional human roles in current educational settings. This role re-distribution among agents might not be necessarily the best one but has the advantage to help the architectural specification of the LOVE learning environment by exploiting existing metaphors that enhance comprehension of the NSDL architecture. Actual experience with the NSDL Program may suggest a more optimized approach to role partitioning and re-distribution. The following is a sample of the kinds of personalizing agents that we will develop. The learner-agent is a proxy for the learner on interactions with other parts of the system. Its responsibilities include learner enrollment into a particular course, presentation of pre-selected material to the learner, capturing learner responses, forwarding these to a teacher-agent, also forwarding learner input during learnerdriven navigation. The teacher-agent is responsible for performing the role of an intelligent tutor choosing optimal navigation of the course learning units and the interactive tasks required from the learner. The recommender-agent provides content adaptation and navigation support for learner-driven navigation. A major responsibility of this agent is updating the learner model, including keeping track of learner cumulative performance portfolio. The course-agent is customized based on the content of a course. It is responsible for the unwrapping of course content packages. The timely retrieval of all learning resources identified by a course manifest, including active content and any additional delivery mechanisms. The register-agent manages the learner enrollments into courses, and may enforce some desired curricular sequencing among courses, and other high level policies dealing with long-term learning and development goals. This agent is responsible for the security and privacy of learner public and private information, its storage and retrieval.
Personalizing Digital Libraries for Learners
119
6 Learning Technologies and Implementation Issues The two major predecessor learner technologies impacting education are intelligent tutoring systems (ITS) and adaptive hypermedia systems (AHS) (e.g., Brusilovsky [1]). The goal of ITS is the use of knowledge about the domain and learner profiles to support personalized learning. There are at least four core functions: curriculum sequencing, intelligent analysis of student’s solutions, interactive problem solving support, example-based problem solving support. We will adopt these ideas to the LOVE environment. Curriculum sequencing is to provide the learner with the most suitable individually planned sequence of knowledge units to learn and sequence of learning tasks (examples, questions, problems, etc.). Sequencing is further divided into active sequencing dealing with a learning goal and passive sequencing dealing with remedial knowledge. Active sequencing can involve fixed system-level selection of goals and adjustable learner selections of a subset of the goals. Most systems provide high level sequencing of knowledge in concepts, lessons, and topics, and low level sequencing of task in problems, examples, and examinations within a high level goal. Not all systems adopt intelligent sequencing at both the high and low level. The learner knowledge is used to drive active sequencing as a function of the “gap” between the goals and the knowledge. Sequencing can also be driven by learner preferences with regards to lesson media. The sequencing can be generated statically before the learner begins interacting with the system or dynamically while the learning process is taking place. Curriculum sequencing has become a favorite technology on web-based learning due to its relatively easy implementation. Historically most ITS had focused on problem solving support technologies with sequencing being left as the responsibility of a human tutor. Problem solving support technologies are intelligent analysis of learner’s solutions, interactive problem solving and example-based problem solving support. Intelligent analysis of learner’s solutions uses the learner final answers to perform knowledge diagnosis, provide error feedback and update learner model. Interactive problem solving support continually tracks learner problem solving process, identifies difficulties and can provide error indication for each individual step, hint at alternative solutions, or provide wizard-like step by step help. These interactive tutors can not only help the learner every step of the way but also update the learner model. Example-based problem solving support is shifted from identifying errors or step by step support, to suggesting previously solved examples that are relevant to the problem at hand. Adaptive hypermedia systems (AHS) will be another important feature of LOVE. Adaptive presentation is to adapt the content of a hypermedia page to the learner’s goal, knowledge and other information stored in the learner model. In this technology, pages are not static, but are adapted to learner goals, knowledge level, etc. Some systems perform low-level conditional text techniques, while others can generate adaptable summaries, or preface to pages. The latter can take the form of adaptively inserted warnings about learner readiness to learn a given page. Adaptive navigation support is to support the learner in hyperspace orientation and navigation by changing the appearance of visible links. The technique can be seen as a generalization of curriculum sequencing but within the hypermedia context, and offering more options for direct/indirect guidance. Direct guidance guides learners to next “best” link. Contrary to curriculum sequencing where pages are built on demand and only system
120
S.-S. Chen et al.
can guide learners to page, here the page must pre-exist. Direct guidance usually provides one-level sequencing versus two-level sequencing available in traditional curriculum sequencing. Adaptive link annotation modifies the link colors, associates icons with links, and provides other differentiation cues that help learner selection. Adaptive link hiding makes the links selectively invisible when the learner is not ready to learn that material. Adaptive link sorting sorts links in terms of the next best choice for navigation. Adaptive collaboration support forms different matching groups for collaboration, like identifying a collaboration group adequate to a learner characteristics or finding a qualified learner-tutor among the other learners. Finally, intelligent class monitoring looks for miss-matching user profiles or outlying learners. The theory of multiagents provides the foundation of learner-agent, teacher-agent, and course-agent, and learning technologies supply the didactic model of the LOVE environment. However practical implementation issues involve a framework of browsers, applets, servlets, and distributed services in the LOVE environment. In the browser, the active portion is at the client in the form of an applet. It takes advantage of the browser facilities to present material and interact with the learner. Although allowing the applet to be a relatively thin client, the reliance on the browser imposes and inherits all the browser limitations, constraining the ultimate flexibility. For example, a heavyweight browser must always remain in the background, and if the browser window is closed the applet is also terminated. In addition interaction between the applet and the network is severely constrained. The distributed service framework implements decentralized agents. A portion of the functionality is implemented as a service and other parts as a remote service-object that accesses the service. The client-side is the service-object that knows how to communicate with the parent service probably using a proprietary protocol. The client-side service-object is downloaded from a look-up service that makes publicly available the parent service through a discovery protocol. For example, this is the scheme implemented by JINI [4]. One of the advantages of this approach is the centralization of the learner model at the parent service site. A potential disadvantage is the centralized nature of the parent service.
Fig. 3. The LOVE Architecture
Personalizing Digital Libraries for Learners
121
References [1]
Brusilovsky, P., Adaptive and Intelligent Technologies for Web-based Education, In: C. Rollinger and C. Peylo, (eds.) Künstliche Intelligenz, Special Issue on Intelligent Systems and Teleteaching, 1999, 4, 19-25. [2] Chen, S. Digital Libraries: The Life Cycle of Information, Better Earth Publisher, 1998, http://www.amazon.com. [3] Deitel, H.M., Deitel, P.J., Nieto, T.R., Internet and World Wide Web: How to Program, Prentice-Hall. [4] Edwards, W. Core JINI, Sun Microsystems Press, 1999. [5] Futrelle, J., Chen, S., and Chang, K., NBDL: A CIS framework for NSDL, The First ACM-IEEE Joint Conference on Digital Libraries, Roanoke VA, June 24-28, 2001. [6] IMS Learning Resource Metadata Information Model, http://www.imsproject.org/metadata/. [7] IMS Content Packaging Information Model, http://www.imsproject.org/content/packaging/. [8] IMS Learner Information Packaging Model, http://www.imsproject.org/profiles/lipinfo01.html [9] LOM: Base Scheme - v3.5 (1999-07-15), http://ltsc.ieee.org/doc/wg12/scheme.html. [10] SMETE.ORG, http://www.smete.org/nsdl/. [11] Shang, Y. and Shi, H. IDEAL: An integrated distributed environment for asynchronous learning, Distributed Communities on the Web, LNCS, No. 1830, Kropf et al (ed.), pp. 182-191. [12] Weiss, G. (Editor), Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence, The MIT Press, 1999.
Interface for WordNet Enrichment with Classification Systems Andrés Montoyo1, Manuel Palomar1 and German Rigau2 1
Department of Software and Computing Systems, University of Alicante, Alicante, Spain {montoyo, mpalomar}@dlsi.ua.es 2 Departament de Llenguatges i Sistemes Informàtics, Universitat Politécnica de Catalunya 08028 Barcelona, Spain [email protected]
Abstract. This paper presents an interface that incorporate a method to enrich semantically WordNet 1.6. with categories or classes from other classification systems. In order to build the WordNet enriched it is necessary the creation of a interface to label WordNet with categories from different available classification systems. We describe features of the design and implementation of the interface to obtain extensions and enhancements on the WordNet lexical database, with the goal of providing the NLP community with additional knowledge. The experimental results, when the method is applied to IPTC Subject Reference System, show that this may be an accurate and effective method to enrich the WordNet taxonomy. The interface has been implemented using programming language C++ and providing a visual framework.
Interface for WordNet Enrichment with Classification Systems
123
for linking Spanish and French words from bilingual dictionaries to WordNet synsets are described in [18]. A mechanism for linking LDOCE and DGILE taxonomies using a Spanish/English bilingual dictionary and the notion of Conceptual Distance between concepts are described in [19]. The work reported in [4] used LDOCE and Roget’s Thesaurus to label LDOCE. A robust approach for linking already existing lexical/semantic hierarchies, in particular WordNet 1.5 onto WordNet 1.6, is described in [5]. This paper presents an interface that incorporate a method to enrich semantically WordNet 1.6. with categories or classes from other classification systems. In order to build the WordNet enriched it is necessary the creation of a interface to label WordNet with categories from different available classification systems. The organisation of this paper is as follows: After this introduction, in Section 2 we describe the technique used (Word Sense Disambiguation (WSD) using Specification Marks Method) and its application. In Section 3, we briefly describe the method for labelling the noun taxonomy of the WordNet. In section 4, we describe the user interface which allows the enrichment of WordNet. In Section 5, some experiments related to the proposal method are presented, and finally, conclusions and an outline of further lines of research are shown.
2
Specification Marks Method
WSD with Specification Marks is a method for the automatic resolution of lexical ambiguity of groups of words, whose different possible senses are related. The method requires the knowledge of how many of the words are grouped around a specification mark, which is similar to a semantic class in the WordNet taxonomy. The word-sense in the sub-hierarchy that contains the greatest number of words for the corresponding specification mark will be chosen for the sense-disambiguating of a noun in a given group of words. Detailed explanation of the method can be found in [12], while its application to NLP tasks are addressed in [15]. 2.1
Algorithm Description
The algorithm with Specification Marks consists basically of the automatic sense disambiguating of nouns that appear within the context of a sentence and whose different possible senses are related. Its context is the group of words that co-occur with it in the sentence and their relationship to the noun to be disambiguated. The disambiguation is resolved with the use of the WordNet lexical knowledge base (1.6). The input for the WSD algorithm will be the group of words w={w1, w2, ..., wn}. Each word wi is sought in WordNet, each one has an associated set si={si1, si2, ..., sin} of possible senses. Furthermore, each sense has a set of concepts in the IS-A taxonomy (hypernym/hyponym relations). First, the concept that is common to all the senses of all the words that form the context is sought. We call this concept the Initial Specification Mark (ISM), and if it does not immediately resolve the ambiguity of the word, we descend from one level to another through WordNet´s hierarchy, assigning
124
A. Montoyo, M. Palomar, and G. Rigau
new Specification Marks. The number of concepts that contain the subhierarchy will then be counted for each Specification Mark. The sense that corresponds to the Specification Mark with highest number of words will then be chosen as the sense disambiguation of the noun in question, within its given context. 2.2
Heuristics
At this point, we should like to point out that after having evaluated the method, we subsequently discovered that it could be improved, providing even better results in disambiguation. The results obtained in [13] demonstrate that when the method is applied with the heuristics, the percentages of correct resolutions increases. We therefore define the following heuristics: Heuristic of Hypernym: This heuristic solves the ambiguity of those words that are not directly related in WordNet (i.e. plant and leaf). But the word that is forming the context is in some composed synset of a hypernym relationship for some sense of the word to be disambiguated (i.e. leaf#1 Æ plant organ). Heuristic of Definition: With this heuristic, the word sense is obtained using the definition (gloss used in WordNet system) of the words to be disambiguated (i.e. sister, person, musician). Heuristic of Common Specification Mark: With this heuristic, the problem of fine-grainedness is resolved (i.e. year, month). To disambiguate the word, the first Specification Mark that is common to the resulting senses of the above heuristic is checked. As this is the most informative of the senses, it is chosen. By means of this heuristic it tries to resolve the problem of the fine grainedness of WordNet. Since in most of the cases, the senses of the words to be disambiguated differ very little in nuances, and as the context is a rather general one it is not possible to arrive at the most accurate sense. Heuristic of Gloss Hypernym: This heuristic resolves the ambiguity of those words that are neither directly related in WordNet nor are in some composed synset of a hypernym relationship for some senses of the word to be disambiguated. To solve this problem we use the gloss of each synset of a hypernym relationship. Heuristic of Hyponym: This heuristic resolves the ambiguity of those words that are not directly related in WordNet (i.e. sign and fire). But the word that is forming the context is in some composed synset of a hyponym relationship for some sense of the word to be disambiguated (i.e. sign#3Æ Visual signal Æ watch fire). Heuristic of Gloss Hyponym: This heuristic resolves the ambiguity of those words that are neither directly related in WordNet nor are in some composed synset of a hyponym relationship for some sense of the word to be disambiguated. To resolve this problem we use the gloss of each synset of a hyponym relationship.
3
WordNet Enrichment
The classification systems provide a means of arranging information so that it can be easily located within a library, World Wide Web, newspapers, etc. On the other
Interface for WordNet Enrichment with Classification Systems
125
hand, WordNet presents word senses that are too fine-grained for NLP tasks. We define a way to deal with this problem, describing an automatic method to enrich semantically WordNet 1.6. with categories or classes from the classification systems using the Specification Marks Method. Categories, such as Agriculture, Health, etc, provide a natural way to establish semantic relations among word senses. These groups of nouns are the input for the WSD module. This module will consult the WordNet knowledge base for all words that appear in the semantic category, returning all of their possible senses. The disambiguation algorithm will then be applied and a new file will be returned, in which the words have the correct sense as assigned by WordNet. After a new file has been obtained, it will be the input for the rules module. This module will apply a set of rules for finding out the super-concept in WordNet. This super-concept in WordNet is labelled with its corresponding category of the classification system. Detailed explanation of the method can be found in [11].
4
Interface
In order to build the enriched WordNet it is necessary the creation of a interface to label WordNet with categories from different available classification systems. This interface is made up of a set of computer programs that do all the work leading ultimately to a labelled lexical knowledge base of WordNet. This section describes features of the design and implementation of the interface to obtain extensions and enhancements on the WordNet lexical database, with the goal of providing the NLP community with additional knowledge. The design of the interface is composed of four processes: (i) selecting the classifycation systems and their categories, (ii) resolving the lexical ambiguity of each word, (iii) finding out the super-concept and (iiii) organization and format of the WordNet database. These processes are illustrate in the figure 1. In order to validate our study, we implemented the interface using programming language C++. It is shown in figure 2, with necessary given explanations below. And due to the physical distance between the different members of the group of investigation who use the interface, this has been developed to work through the local area network (LAN). The user interface offers the operations followed: Select the classification system. A classification systems selection window contains option buttons. The user clicks on the appropriate button to select the desired classification system. We have considered the classification systems such as IPTC, Dewey classification, Library of Congress Classification and Roget´s. Open category. The user clicks on this command button to select a category of the selected classification system in the previous step. The group of words that belong to the selected category appear in the left text window of the interface, named Input Category.
126
A. Montoyo, M. Palomar, and G. Rigau Interface Process
Select Classification System
WSD Algorithm and Steps
IPTC
Dewey
Roget
Resolving lexical resolution
L. Congress
WordNet
Finding out Super-Concept
WordNet Enriched
Rules
Organization and Format WordNet
Figure 1: Interface Process Run Interface. The processes, resolving the lexical ambiguity and finding out the super-concept, were implemented in a unique function. The command button Run Interface allows one to run this function, and the output information that belongs to the group of words of the selected category appear in the right text window of the interface, named Output Labelled Synsets. This output information is made up of WordNet Sense Word and Super-Concept obtained for each word belonging to the category. For example: WordNet Sense Word {10129713} disease#1
Super-Concept {10120678} ill Health
Save Category. If this command button is clicked, the information above is organizated, formatted and storaged in the WordNet lexical database for each superconcept, their full hyponyms and meronyms.
128
A. Montoyo, M. Palomar, and G. Rigau
In the second approach we tested the Specification Mark Method on word clusters related by categories over IPTC Subject Reference System. The percentage of correct resolution was 96.1%. This successful percentage was because the method uses the knowledge of how many of the words in the context are grouped around a semantic class in the WordNet taxonomy. Once it has been shown that the WSD Specification Marks Method works well with classification systems, we tested the method of combining the semantic categories of IPTC and WordNet. Each IPTC category was computed as the amount of synsets of WordNet correctly labelled, synsets incorrectly labelled and words unlabelled (synsets are not in WordNet). After, we evaluate the precision1, coverage and recall of the method obtaining 95,7%, 93,7% and 89,8%, respectively.
6
Conclusion and Further Work
This paper applies the WSD Specification Marks Method to assign a category of a classification system to a WordNet synset as to full hyponyms and meronyms. We enrich the WordNet taxonomy with categories of the classification system. The experimental results, when the method is applied to IPTC Subject Reference System, indicate that this may be an accurate and effective method to enrich the WordNet taxonomy. The WSD Specification Marks Method works successfully with classification systems, that is, categories subdivided into groups of words that are strongly related. Although, this method has been tested on IPTC Subject Reference Systems, but can also be applied to other systems that group words about a single category. These systems are Library of Congress Classification(LC), Roget’s Thesaurus or Dewey Decimal Classification(DDC). A relevant consequence of the application of the Method to enrich WordNet is the reduction of the word polysemy (i.e., the number of categories for a word is generally lower than the number of senses for the word). That is, category labels (i.e., Health, Sports, etc), provide a way to establish semantic relations among word senses, grouping then into clusters. Therefore, this Method intends to resolve the problem of the fined-grainedness of WordNet´s sense distinctions [6]. The researchers are therefore capable of constructing variants of WSD, because for each word in a text a category label has to be chosen instead of a sense label.
1
Precision is given by the ratio between correctly synsets labelled and total number of answered (correct and incorrect) synsets labelled. Coverage is given by the ratio between total number of answered synsets labelled and total number of words. Recall is given by the ratio between correctly labelled synsets and total number of words
Interface for WordNet Enrichment with Classification Systems
129
Acknowledgements This research has been partially funded by the UE Commission (NAMIC IST-199912302) and the Spanish Research Department (TIC2000-0335-C03-02 and TIC20000664-C02-02).
References 1. Ageno A., Castellón I., Ribas F., Rigau G., Rodríguez H., and Samiotou A. 1994. TGE: Tlink Generation Environment. In proceedings of the 15th International Conference On Computational Linguistic (COLING´94). Kyoto, (Japan). 2. Alvar M. 1987. Diccionario General Ilustrado de la Lengua Española VOX. Bibliograf S.A.. Barcelona, (Spain). 3. Byrd R. 1989. Discovering Relationship among Word Senses. In proceedings of the 5th Annual Conference of the UW Centre for the New OED, pages 67-79. Oxford, (England). 4. Chen J. and Chang J. 1998. Topical Clustering of MRD Senses Based on Information Retrieval Techniques. Computational Linguistic 24(1): 61-95. 5. Daudé J., Padró L. And Rigau G. 2000. Mapping WordNets Using Structural Information. In Proceedings 38th Annual Meeting of the Association for Computational Linguistics(ACL00). Hong Kong. (Japan). 6. Ide N. and Véronis J. 1998. Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art. Computational Linguistics 24 (1): 1-40. 7. Knight K. 1993. Building a Large Ontology for Machine Translation. In proceedings of the ARPA Workshop on Human Language Technology, pages 185-190. Princenton. 8. Knight K. and Luk S. 1994. Building a Large-Scale Knowledge Base for Machine Translation. In proceedings of the American Association for Artificial Inteligence. 9. Miller G. A., Beckwith R., Fellbaum C., Gross D., and Miller K. J. 1990. WordNet: An online lexical database. International Journal of Lexicography 3(4): 235-244. 10. Miller G., Leacock C., Randee T. and Bunker R. 1993. A Semantic Concordance. Proc. 3rd DARPA Workshop on Human Language Tecnology, pages 303-308, Plainsboro, (New Jersey). 11. Montoyo A., Palomar, M. and Rigau, G. (2001) WordNet Enrichment with Classification Systems. WordNet and Other Lexical Resources: Applications, Extensions and Customisations Workshop. (NAACL-01) The Second Meeting of the North American Chapter of the Association for Computational Linguistics. Carnegie Mellon University. Pittsburgh, PA, USA. 12. Montoyo, A. and Palomar M. 2000. Word Sense Disambiguation with Specification Marks in Unrestricted Texts. In Proceedings 11th International Workshop on Database and Expert Systems Applications (DEXA 2000), pages 103-108. Greenwich, (London). 13. Montoyo, A. and Palomar, M. 2001. Specification Marks for Word Sense Disambiguation: New Development. 2nd International conference on Intelligent Text Processing and Computational Linguistics (CICLing-2001). México D.F. (México). 14. Okumura A. and Hovy E. 1994. Building japanese-english dictionary based on ontology for machine translation. In proceedings of ARPA Workshop on Human Language Technology, pages 236-241. 15. Palomar M., Saiz-Noeda M., Muñoz, R., Suárez, A., Martínez-Barco, P., and Montoyo, A. 2000. PHORA: NLP System for Spanish. In Proceedings 2nd International conference on
130
A. Montoyo, M. Palomar, and G. Rigau
Intelligent Text Processing and Computational Linguistics (CICLing-2001). México D.F. (México). 16. Procter P. 1987. Longman Dictionary of common English. Longman Group. England. 17. Rigau G. 1994. An Experiment on Automatic Semantic Tagging of Dictionary Senses. In International Workshop the Future of the Dictionary. Grenoble, (France). 18. Rigau G. and Agirre E.1995. Disambiguating bilingual nominal entries against WordNet. Seventh European Summer School in Logic, Language and Information (ESSLLI´95). Barcelona, (Spain). 19. Rigau G., Rodriguez H., and Turmo J. 1995. Automatically extracting Translation Links using a wide coverage semantic taxonomy. In proceedings fifteenth International Conference AI´95, Language Engineering´95. Montpellier, (France). 20. Risk O. 1989. Sense Disambiguation of Word Translations in Bilingual Dictionaries: Trying to Solve The Mapping Problem Automatically. RC 14666, IBM T.J. Watson Research Center. Yorktown Heights, (United State of America).
An Architecture for Database Marketing Systems 1
1
a
1
Sean W.M. Siqueira , Diva de S. e Silva , Elvira M A. Uchôa , a 2 1 M Helena L.B. Braz , and Rubens N. Melo 1
PUC-Rio, Rua Marquês de São Vicente, 255, Gávea, 22453-900, Rio de Janeiro, Brazil, {sean, diva, elvira, rubens}@inf.puc-rio.br 2 DECivil/ICIST, Av. Rovisco Pais, Lisboa, Portugal [email protected]
Abstract. Database Marketing (DBM) refers to the use of database technology for supporting marketing activities. In this paper, an architecture for DBM systems is proposed. This architecture was implemented using HEROS - a Heterogeneous Database Management System as integration middleware. Also, a DBM metamodel is presented in order to improve the development of DBM systems. This metamodel arises from the main characteristics of marketing activities and basic concepts of Data Warehouse technology. A systematic method for using the proposed DBM architecture is presented through an example that shows the architecture's functionality.
HEROS HDBMS as integration middleware. In section 4, the conceived DBM metamodel is described. In section 5, a systematic method for using the architecture is detailed through an example that shows the architecture’s functionality. Finally, in Section 6, related works and some final remarks are presented.
2 Fundamentals In this paper, DBM denotes the use of database technology for supporting marketing activities while marketing database (MktDB) refers to the database system. There are many different concepts in the specialized literature for DBM. PricewaterHouse& Coopers [4] proposed three different levels of DBM in order to better organize these concepts: • Direct Marketing – Companies manage customer lists and conduct basic promotion performance analyses. • Customer Relationship Marketing – Companies apply a more sophisticated, tailored approach and technological tools to manage their relationship with customers. • Customer-centric Relationship Management – Customer information drives business decisions for the entire enterprise, thus allowing the retailer to directly dialogue with individual customers and ensure loyal relationships. Besides these levels, some usual functions/processes of marketing such as householding, prospecting, campaign planning/management, merchandise planning and cross selling are mentioned in ([4], [6]). These functions/processes should also be supported by DBM systems. An architecture for DBM systems should satisfy operational and analytical requirements. Usually, analytical systems need the integration of data from several sources, internal and/or external to the corporation, in a MktDB ([9], [15]). This MktDB is used by marketing tools/systems for data analysis and also for planning and execution of marketing strategies. Some tools like Xantel Connex ([33]), MarketForce ([7]) and The Archer Retail Database Marketing Software ([27]) consider only operational DBM requirements; thus they do not provide analytical aspects. Other tools, such as ProfitVision [16], Decisionhouse ([25]), SAP FOCUS ([13]), Pivotal Software ([24]) and ERM Central ([5]) consider only analytical DBM requirements. Finally, there are some tools, like IBM marketing and sales application ([23]), that consider both operational and analytical DBM requirements. They have a MktDB that considers the analytical aspect and they allow operational applications to use their own databases that are optimized for operational tasks. Analytical processing in support of management's decision has been researched in DW context. According to William H. Inmon [31], a DW is subject oriented, integrated, non-volatile collection of data that is time-variant and used in support of management’s decisions. Generally, a DW is modeled using the dimensional model [26] that is an intuitive technique to represent business models that allows high-performance access. Business models are domain dependent and refer to the area of the system. Usually, the dimensional model is implemented in a schema similar to a star. This resulting
An Architecture for Database Marketing Systems
133
schema is called star schema. In this schema, there is a central data structure called „fact structure“ that stores business measures (facts). Business measures are indicators of some action/activity of business.
3 Proposed Database Marketing Architecture Based on the characteristics of operational and analytical DBM systems, an architecture for DBM systems is proposed and the use of a HDBMS as integration midlleware is highlighted. This architecture is implemented using HEROS – a HDBMS. 3.1 Specification of the Proposed DBM Architecture The work described in this paper presents a four-layer architecture (Fig. 1) for DBM systems.
Fig. 1. Proposed DBM Architecture
In the proposed architecture, the Data Sources layer refers to data from production systems of the company and from external sources. External data sources refer to data outside the corporation, such as data obtained from market research or data from other corporations. They are important to complement the information of the company. Data sources can be centralized or distributed, homogeneous or heterogeneous, and comprise relevant data to marketing activities. The Integration layer is responsible for providing an integrated view from several component data sources, eliminating inconsistency and heterogeneity, besides consolidating and aggregating data whenever it is necessary. In this layer, all the processes for identifying duplications, standardizing names and data types, comparing data, removing strange and excessive data, identifying synonymous and homonymous, and treating any other kind of heterogeneity/inconsistency are executed. The Materialization layer refers to the materialization of integrated data in a new database. This layer is responsible for giving persistence to data resulting from integration process. It allows better performance in the execution of marketing queries, because a query submitted to a persistent MktDB (where integrated data were previously stored) executes faster than doing the whole integration processes „on the
134
S.W.M. Siqueira et al.
fly“. The execution of integration processes increases network traffic and processing time in local systems that must also execute local applications. The materialization layer is composed by a MktDB and an extractor: • MktDB corresponds to the database system responsible for storing the integrated data, guaranteeing their persistence and security, as well as allowing the DBM Application layer to access them. This MktDB stores current data (resulting from new data loads), historic data (resulting from preview loads that remain stored in the database) and a catalog for supporting the translation of the output of the integration middleware to the MktDB format. • The extractor is responsible for activating integration processes, translating the output of the integration middleware to the MktDB format and, finally, loading the MktDB. It generates a persistent view of the integrated data. In the proposed architecture, the MktDB is used to data analysis. This database system is based on the multidimensional modeling that presents better performance in query-only environments. Operational applications can extract data from this MktDB to their own database where data are stored in a model that is more adequate to operational aspects. The DBM Application layer embodies tools for marketing activities. These tools or systems access data from the MktDB and allow the execution of marketing functions/processes. Therefore, through this layer, it is possible to visualize, analyze and manipulate data from the MktDB. Generally, this layer is composed by OLAP tools, statistical and data mining tools and/or marketing-specific applications: • OLAP tools present a multidimensional view of data, allowing sophisticated analyses through easy navigation and visualization of a large volume of data ([1]). • Tools for statistical analysis and data mining1 are used for clients/products segmentation and valuation or for discovering patterns and information, allowing personalized services ([17], [18]). • Finally, marketing-specific applications like campaign management, merchandise planning, media selection and scheduling, retention analysis and inventory management can also be performed in this layer. 3.1.1 Integration Middleware Indifferently to which requirements are considered in DBM systems, it is necessary to have integrated access to data from several sources. There are many different ways to provide this kind of access. In database community, HDBMS are one of the solutions for integrating heterogeneous and distributed data. A HDBMS ([28], [3]) is a layer of software for controlling and coordinating heterogeneous, autonomous and pre-existing data sources, interconnected by communication networks. By heterogeneity, it is meant not only technological differences (hardware and software), but also differences in data models, database systems and semantics. The analysis about the level of integration that exists among the component systems allows heterogeneous database systems (HDBS) to be classified into tightly 1
Data mining refers to extraction of hidden information from large databases. Data mining tools predict trends and future behavior.
An Architecture for Database Marketing Systems
135
coupled or loosely coupled HDBS [3]. In loosely coupled HDBS, the end-user must know in which sources are located data that he/she wants to access and their paths. The HDBMS just supply mechanisms to facilitate this access. In tightly coupled HDBS, the end-user has an integrated and homogeneous view of data. It gives the illusion that there is only one system. The use of a tightly coupled HDBMS as integration middleware was considered appropriated to DBM systems according to the following reasons: Commercial products vs. HDBMS Some DBM tools consider an existing database or use only their own database as data source, e.g.: Xantel Connex and MarketForce. Other tools, such as ProfitVision and DIALOG++, presume the existence of a DW that would be responsible for data integration processes. Other tools use proprietary solutions exploring ODBC drivers to access/integrate data, e.g.: The Archer™ Retail Database Marketing Software, Decisionhouse, SAP FOCUS, Pivotal Software, ERM Central and IBM marketing and sales application. Commercial products generally behave as „black“ boxes. The integration procedures are hidden from the users that are responsible for the DBM definition. This fact obstructs the user’s perception about extraction and cleansing processes, creating the possibility of errors in the resulting data. Moreover, these commercial products don't consider the semantic heterogeneity. It must be treated through other programs/procedures, in some phase before the use of these products. The use of tightly coupled HDBMS to integrate heterogeneous data presupposes that the database administrator that is responsible for the DBM definition knows local data schemas and translation processes. Once this knowledge is recognized by the HDBMS, it will treat the heterogeneity and will carry out data integration processes. In order to include a new data source, the local user (DBA) needs only to define the characteristics of this source in the HDBMS. This approach contributes to a more organized and transparent process. Tightly coupled HDBMS support the development of a DBM without the need of external programs. Mediators and Wrappers vs. HDBMS Although it was found no tool or research work using mediators [14] and wrappers [20] in DBM systems, they are frequently used for data access/integration. Wrappers and mediators are software programs, which are developed to assist a specific class of problem. They usually work as „black“ boxes and do not allow accessing their logic. Therefore, they are less flexible than a HDBMS as an integration middleware that is responsible for data extraction, transformation and integration. The use of a HDBMS implies in representing data schemas and their mappings that facilitates data understanding and project changes. 3.2 Implementation of the DBM Architecture Once the use of tightly coupled HDBMS as integration middleware was considered appropriate, it was decided to use HEROS – HEteRogeneous Object System – in the proposed architecture. HEROS is a tightly coupled HDBMS at development in the Computer Science Department of PUC-Rio (Catholic University of Rio de Janeiro). It
136
S.W.M. Siqueira et al.
allows the integration of a set of HDBS in a federation. These HDBS are cooperative but autonomous, and queries and updates can be executed with transparency in relation to data location, access paths and any heterogeneity or redundancy [10]. Therefore HEROS HDBMS is responsible for all the processes for data integration and consolidation. The use of HEROS HDBMS allows any kind of system, even nonconventional ones, to be easily integrated. There is only the need for specializing some classes in HEROS’ data model [11]. The extractor, which is responsible for activating HEROS, triggering data integration, decoding HEROS’ output and loading the MktDB was implemented using C++ for its core and Visual Basic for the front end. The database management system (DBMS) used for the MktDB was Oracle 8.0.
4 Proposed Database Marketing Metamodel In the development of DBM systems, it is very important to understand marketing processes and activities. A DBM metamodel could incorporate marketing semantic in order to guide the project and therefore to provide higher quality to DBM systems. However, during this research, no DBM metamodel was found in the literature, so it was decided to propose a DBM metamodel based on fundamental concepts related to marketing activities. 4.1 Characteristics of Marketing Activities A marketing activity implies in answering four important questions [30]: • Who should I target? • What should I target them with? • When should I do it? • How should I bring the offer to market? Once marketing activities refer to exchange of products/services, it is possible to generalize the considerations above: • An exchange involves a deal with two (or more) partners. Then, „Who“ represents the corporation’s partner in a marketing activity. • In this exchange relationship, the corporation must offer some product or service. „What“ refers to the product or service that the corporation is offering. • A marketing activity occurs on a specific moment in time. Then, „When“ is a temporal aspect in the MktDB, and represents the moment of the exchange. • Finally, characteristics of the marketing activity are represented by „How“. However, it is possible to detail it in two aspects: • Which promotion channel should I use? – representing the communication channel used to present the promotion. • How promotion should be done? – representing the promotional strategy, discount policies, etc. Analyzing the three DBM levels (section 2) and functions/processes supported by DBM systems, two new important questions, not found in the literature, were introduced:
An Architecture for Database Marketing Systems
137
• Where should I offer it? – representing a spatial aspect in the MktDB. It is related to the physical and/or geographic space where the marketing activity occurs. • Why should I do it? – representing the purpose of the marketing activity. 4.2 Metamodel To express these fundamental questions in DBM systems, a metamodel was proposed. This DBM metamodel explores concepts and characteristics of marketing activities and uses some concepts of multidimensional modeling (facts and dimensions structures) that are used in DW area. It increases the semantics and brings the analytical perspective to the DBM systems.
Fig. 2. DBM Metamodel
Fig. 2 represents the DBM metamodel, using the UML notation [22]. As DBM refers to the use of database technology to support marketing activities, a DBM system may be represented as a set of marketing activities (Mkt_Activity) such as sales or promotions. If the company executes these activities, they are considered Real (representing internal data or complementary data if it comes from an external source). If these activities refer to benchmarking or monitoring of other company’s activities, Prospecting represents them. Each marketing activity has at least one „perspective of the data“ (Dimension). This perspective provides information for comparative analysis. Exploring the concepts and characteristics of marketing activities, the questions Who, What, When, Where, How, Why and Which are the different perspectives of data about marketing activities and represent the Dimensions. Finally, if there is more than one dimension, they are related to a specific subject that is responsible for combining them. This specific subject is represented by Fact. More detail can be found at [29]. If this metamodel is instantiated in a HDBMS, integration processes have more DBM semantic and become less liable to errors, reducing the possibility of failure of the DBM project.
138
S.W.M. Siqueira et al.
5 Systematic Method for Using the Proposed Architecture The development and use of a DBM system based on the proposed architecture should be guided by a systematic method that embraces three stages: the construction/development of the DBM system, the load of the resulting MktDB and the use of the MktDB by marketing applications. In order to evaluate the architecture it was considered an example, detailed in [29]. In this paper, a simplified version of this example is used to present the proposed systematic method. All aspects related to the example are printed in Italics. The example refers to a virtual bookstore – Book House – that wants to know its customers and their characteristics in detail. Through this knowledge, Book House intends to increase sales using a list of prospects (potential customers), which is obtained from a specialized company (Editora Abril, a Brazilian publishing house). It is considered a customer of „Book House“ everyone who has bought any product (book) from the company. The development of a DBM system – using the proposed architecture – follows seven steps [29]: 1. Identification of the DBM level: According to meetings with enterprise managers, level one of DBM - direct marketing - is the most suitable level to Book House. It is desired to generate a simple list of promotional prospectus to potential customers, starting data selection/segmentation from characteristics of the best current customers. 2. Identification of necessary marketing functions/processes: Among the marketing functions/processes that need to be supported by the Book House’s MktDB, prospecting and campaign management are the two most important. Prospecting refers to getting a list of people that have not yet bought at „Book House“. Campaign management is responsible for coordinating prospectus-mailing activities to prospects and verifying possible answers (purchases and contacts). 3. Business interviews and data gathering for identifying and understanding data sources: After analysis of several data systems of the enterprise and other data sources that are necessary to the desired marketing activities, two data sources were considered in the development of Book House’s MktDB. One data source refers to real data that comes from the enterprise and the other refers to prospecting data. The first data source is related to Book House’s sales system. This data source, which is identified in Book House’s DBM system as Sales Component, uses Oracle DBMS. The other data source, called Ext_Customer Component, corresponds to prospecting about potential customers bought from Editora Abril. This component uses Postgres DBMS. Fig 3 shows the component’s local schemas, with some simplifications. 4. Integrate data using the Integration Middleware: In HEROS HDBMS, data integration is executed through the definition of a set of schemas. The schema architecture used by HEROS HDBMS is shown in Fig 4. In this architecture, each data source has a local schema, represented in its own data model. This local schema is translated to HEROS’ object oriented data model, resulting in an export schema. All export schemas must be integrated, resulting in a global schema, with no heterogeneity. Finally, end user’s views can be created from this global
An Architecture for Database Marketing Systems
139
schema, generating external schemas. A HEROS’ federation consists of an integrated set of autonomous component systems.
Fig. 3. Local Schemas of (a) Sales System and (b) Ext_Customer
Fig. 4. Schema Architecture used by HEROS
As the implementation of the proposed architecture for DBM system involves the use of HEROS HDBMS as integration middleware, it is necessary to follow the steps for the creation of a HEROS’ federation [12]: a) To define a new HEROS’ federation and to explain data semantic of each local system into HEROS’ data dictionary: The BookHouse DBM federation was defined in HEROS and the data semantic of each local system was explained. b) To create an export schema for each local schema (at present, this is necessary because HEROS’ active rules mechanism has not yet been developed. This mechanism would create the export schemas automatically). For the creation of export schemas, it is necessary to represent the local schemas (Sales Systems and Ext_Customer) in HEROS’ data model. Another component system – HEROS – must be defined in the federation in order to be used as a working area in the integration process. Therefore, in addition to classes representing local schemas (Customer, Book, Time, Sale and E_Customer), another class – Work_Area – is defined. Fig 5 presents the classes of the export schemas concerning the example. Each method in the classes of the export schema must call local procedures that will be executed in the component systems. Such local procedures are mainly responsible for executing data extraction from the component system. The detailed procedures and execution paths for this example can be found at [29].
140
S.W.M. Siqueira et al.
Fig. 5. Export Schemas of the Component Systems
c) To create a global schema. For creating this global schema, it is necessary to specialize the metamodel classes, integrating export schemas and to define command trees in order to achieve data semantic integration and consolidation.
Fig. 6. Global Schema of „Book_House“ Federation
The DBM metamodel was instantiated into HEROS’ global schema and then specialized according to the semantic of the example. Fig 6 shows the global schema for BookHouse DBM, representing the metamodel through gray boxes and the specialization for the example through white boxes. One Real Marketing Activity – Mkt_Real – and one Prospect Marketing Activity – Ed_Abril_Prosp composes the BookHouse DBM. The Real Marketing Activity is composed by a Fact – Sales – and by three Marketing Dimensions: Book (the „what“ aspect), Time (the „when“ aspect) and Customer (the „who“ aspect). The Prospecting Marketing Activity is composed by only one Marketing Dimension: Customer_P that represents the „who“ aspect.
An Architecture for Database Marketing Systems
141
A query relative to a global class may be decomposed into sub-queries for other classes in the global schema and afterwards into sub-queries for classes in the export schemas through the execution trees related to the global procedures. The execution trees for the example are detailed in [29] and they are responsible for treating the semantic heterogeneity and data consolidation. 5. Creation of tables for storing integrated data in the MktDB: In the example, MktDB is a relational DBMS because almost all DBM tools access this type of DBMS. Tables Sale, Customer, Book, Time and Customer_Prosp were created in Oracle DBMS. 6. Creation of a mapping catalog between the output of the integration middleware and MktDB’s tables: This catalog corresponds to the mapping between HEROS’ output and attributes of the MktDB’s tables. Therefore, it enables the extractor to load the MktDB. Table 1 represents the catalog for the example. Table 1. Catalog of Mappings between HEROS’ output and MktDB persistent tables
7. Definition and configuration of DBM tools: For the example, some tools for campaign management such as Lodgistics (from DataLodgic), NCR Target Marketing & Campaign Management solution (from NCR) and Trail Blazer (from Aspen Software Corp.) were analyzed. However, it was decided to develop a specific application (BSA – BookHouse Sales Analysis) to access/analyze data from the resulting database in order to improve performance analysis and enable strategic actions. One of the screens of BSA is shown in Figure 7.
142
S.W.M. Siqueira et al.
Fig. 7. Print Screen of module Best Customers Analysis from BSA
After the development of the DBM system, the MktDB is loaded through the activation of the extractor by the DBA and it is used through marketing tools.
6 Conclusions This paper presented an architecture for DBM systems, focusing data integration processes in the creation of a MktDB and considering some concepts of DW technology. There was found no similar work in the literature. The survey conducted during the development of the work presented in this paper showed that there are some works on DBM systems, but they focus different characteristics. On the business field, the main works consider aspects related to customer loyalty and selection of customers ([32], [21]). On data mining area, some techniques have been proposed to allow knowledge discovery on customer data ([2]). Also, DW systems have been used to store data for data analysis in DBM systems. However, no work treating data integration for DBM systems, considering operational and analytical aspects was found. The proposed architecture is adequate to all levels of DBM (Direct Marketing, Customer Relationship Marketing and Customer-centric Relationship Management) because it considers analytical and operational aspects of marketing activities. It considers a MktDB that uses multidimensional modeling – facts and dimensions structures – and allows operational marketing applications to use the resulting MktDB as a data source. HEROS HDBMS was used as integration middleware, providing the necessary transparency about data models, localization and other details to end-users. The proposed DBM metamodel makes possible a business-oriented view because it was based on marketing concepts and characteristics. It guides the definition of the necessary data and therefore the MktDB becomes semantically richer and more reliable. The systematic mechanism for using the architecture conducts the development of DBM systems, allowing faster and more reliable DBM projects.
An Architecture for Database Marketing Systems
143
The main contributions of the work presented in this paper are: • the proposal of an architecture for DBM systems; • the definition of a DBM metamodel; • and the use of HEROS HDBMS as integration middleware in an implementation of the proposed architecture. As future work, it is suggested the development of a tool and a systematic mechanism for automatic refresh of the MktDB. A monitor responsible for detecting changes in data sources and, automatically, triggering a new load of the MktDB could perform this refresh. It should also be able to perform scheduled loads of the MktDB. Other interesting aspect is the research/development of tools in order to automate some marketing functions/processes using the semantic offered by the resulting MktDB. Then, it would be possible, for instance, after a load of the MktDB, to automatically trigger a new marketing campaign to prospects according to automatic analysis of the profile of the best customers. Therefore, some marketing activities could be performed automatically.
A. Berson & S. J. Smith: Data Warehousing, Data Mining & OLAP, McGraw-Hill Companies, Inc., 1997 A. Berson & S. Smith & K. Thearling: Building Data Mining Applications for CRM ; McGraw Hill, 2000. A. P. Sheth & J. A. Larson, „Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases“ in ACM Computing Surveys, Vol. 22, N. 3, September 1990 Coopers & Lybrand Consulting (CLC), „Database Marketing Standards for the Retail Industry“, Retail Target Marketing System Inc., 1996 Customer Analytics to Integrate MyEureka! Within its Enterprise Relationship Management Suite – http://www.informationadvantage.com/pr/ca.asp D. M. Raab, „Database Marketing“, DM Review, January 1998 D. M. Raab, „MarketFirst Software“, DM News, May 1998 – http://raabassociates.com/a805mark.htm D. Shepard, Database Marketing, Makron, 1993 D. Shepard, The New Direct Marketing: How to Implement a Profit-Driven Database rd Marketing Strategy, 3 edition, McGraw-Hill, 1998 E. M. A Uchôa & S. Lifschitz & R. N. Melo, „HEROS: A Heterogeneous Object-Oriented Database System“, DEXA Conference and Workshop Programme, Vienna, Austria, 1998 fw E. M. A. Uchôa & R. N. Melo, „HEROS : a Framework for Heterogeneous Database Systems Integration“, DEXA Conference and Workshop Programme, Florence, Italy, 1999 E. M. A. Uchôa: HEROS – A Heterogeneous Database System: Integrating Schemas. Computer Science Department – Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio). M.Sc. Thesis, 1994 (in Portuguese) Focus Group Approach – Facilitator Manual – http://p2001.health.org/VOL05/FOCUSGRP.SAP.HTM G. Wiederhold, "Mediators in the Architecture of Future Information Systems"; IEEE Computer, March 1992 J. F. Naughton, „Database Marketing Applications, Relational Databases and Data Warehousing“, http://www.rtms.com/papers/dbmarket.htm, Janeiro 1999
144
S.W.M. Siqueira et al.
[16] J. McMillan, „Hyperion and HNC Software Sign Reseller Agreement to Deliver Profitability Analysis Solutions to the Financial Industry“, http://psweb1.hyperion.com/hyweb/imrsnews.nsf/newsdate/87C87DF6955AE94385 2568C500547C24 [17] K. Thearling, „From Data Mining to Database Marketing“, http://www3.shore.net/~kht/text/wp9502/wp9502.htm [18] K. Thearling, „Understanding Data Mining: It’s All in the Interaction“, DS, December 1997 [20] M. T. Roth & P. Schwarz, „Don't Scrap it, Wrap it! A Wrapper Architecture for Legacy Data Sources“, Proceedings of the 23 VLDB Conference, Athens, Greece, 1997 [21] N. Narayandas, „Measuring and Managing the Consequences of Customer Loyalty: An Empirical Investigation“, http://www.hbs.edu/dor/abstracts/9798/98-003.html [22] OMG Unified Modeling Language Specification, version 1.3, June 1999 http://www.rational.com/media/uml/post.pdf [23] P. Gwynne, „Digging for Data“, http://www.research.ibm.com/resources/magazine/1996/issue_2/datamine296.html TM [24] Pivotal eRelationship – Our award-winning customer relationship management (CRM) solution enables universal collaboration“, http://www.pivotal.com/solutions/eRelationship.htm TM [25] Quadstone, „Decisionhouse “, http://www.quadstone.com/systems/decision/index.html [26] R. Kimball: The Data Warehouse Toolkit, John Wiley & Sons, Inc., 1996 [27] RTMS/Customer Insight“, http://www.rtms.com/papers/lybrand.html [28] S. Ram: Guest Editor’s Introduction: Heterogeneous Distributed Database Systems. In: IEEE Computer, Vol.24, N.12, December 1991. [29] S. W. M. Siqueira: An Architecture for Database Marketing Systems using HEROS – a HDBMS. Computer Science Department – Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio). M.Sc. Thesis, 1999 (in Portuguese) [30] T. Suther, „Customer Relationship Management: Why Data Warehouse Planners Should Care About Speed and Intelligence in Marketing“, DM Review, January 1999 [31] W. H. Inmon: Building the Data Warehouse, John Wiley & Sons, Inc., 1996 [32] W. Hoyer, „Quality, Satisfaction and Loyalty in Convenience Stores — Monitoring the Customer Relationship“, University of Texas at Austin - Center for Customer Insight, http://hoyer.crmproject.com/ [33] Xantel Connex version 2.4, http://www.ask-inet.com/html/news_relase3.htm
NChiql: The Chinese Natural Language Interface to Databases Xiaofeng Meng and Shan Wang Information School, Renmin University of China Beijing 100872, China [email protected], [email protected]
Abstract: Numerous natural language interface to databases (NLIDBs ) developed in the mid-eighties demonstrated impressive characteristic in certain application areas, but NLIDBs did not gained the expected rapid and wide commercial acceptance. We argue that there are two good reasons explain why: limited portability and poor usability. This paper describes the design and implementation of NChiql, a Chinese natural language interface to databases. In order to bright the essence of these problems, we provide an abstract model (AM) in NChiql. We try to give a solution for these problems based on the model in our system. In this paper, we depict a novel method based on database semantics (SCM) to handle the Chinese natural language query, which greatly promotes the system’s usability. The experiments show that NChiql has good usability and high correctness.
In order to bright the essence of these problems, we provide an abstract model (AM) in NChiql. We try to give a solution for these problems based on the model in our system. The remainder of this paper is organized as follows: In Section 2, an abstract model in NChiql is presented. Section 3 explain the natural language query processing in NChiql. The experiment results are provided in Section 4. Section 5 concludes the paper.
2 Abstract Model in Nchiql Generally, there are three level models involved in a NLIDB, which are user’s linguistic model, domain conceptual model and data model. The Figure 1 shows that the domain conceptual model is usually fixed for a specific domain, but the corresponding data models in computers may be various. Similarly, users can employ different linguistic models to express the same concepts. For example, "what are sale on the second floor" and "what we can buy on the second floor", or "what is his salary" and "how much does he earn every month". The task of NLIDBs is to map the user’s linguistic model to machine’s data model. However, the distance between them is very far. Data models do not contain any semantic information. On the other hand, linguistic models are flexible and varied. Most systems have to introduce an intermediate representation - conceptual models. Linguistic models are first mapped to unambiguous conceptual models and then to definite data models from the conceptual models. All of these are illustrated in Figure 1, which is called abstract model. Application Domain Linguistic Model
Data Model
Linguistic Model
Data Model
Linguistic Model
Conceptual Model
Data Model
Fig. 1. The Abstract Model in Nchiql
NChiql: The Chinese Natural Language Interface to Databases
147
The introduction of conceptual models can bridge the gap of linguistic models and data models to some extent. Most NLIs choose logical forms as the conceptual models. Although E-R models are also available to express conceptual model, they are developed for database design and cannot express complete semantics. They have to be extend to cover linguistic models in NLIs. Based on the idea, we provide a Semantic Conceptual Model (SCM) to serve as the conceptual models in NChiql,. SCM can describe not only the semantics of words, but also combination relationship among different words that involve in linguistic models. So it’s more powerful than the E-R model in term of semantics expression. Each application domain has different vocabulary and domain knowledge. So when a NLIDB is portable from one domain to another, the three models will be changed based on application domain. Linguistic model and data model can be obtained by domain experts and database designers respectively. The difficulty is how to generate conceptual model from domain, namely how to build the bridge. We think that’s the essence for the problems mentioned above. So in order to solve the problem of portability, it must enable the system to generate conceptual model automatically or semi-automatically. Based on the relationship among the three models, we give an extracting method in NChiql that can build the SCM automatically [8]. The performance is 15-20 minutes in an evaluation of transporting NChiql to a new domain. Besides the three models in AM, there are two mappings existing among them, that’s Linguistic model to conceptual model mapping, and Conceptual to Data mapping. Actually, the Processing of a natural language query in NLIDBs is the processing of the two mapping. The first mapping is the sentence analysis processing, and the second is the translation to database query.
3 Chinese Natural Language Query Processing Based on SCM in Nchiql The goal of language processing in NLIDB is to translate natural language to database query. So it’s not necessary to understand the deep structure of sentences, as long as we can reach the translation goal. On the other hand, different from common language, natural query language has its own features: First, it mainly includes three kinds of sentences: the imperative, the question and the elliptical in term of language structure. Second, the database concept model limits the semantic content of the natural query statements, that’s to say, the query statements only involve the words that have direct relationship with database model (E-R Model). Based on the consideration, we depict a novel language processing method based on the database semantic, namely SCM. It includes the following steps: 1. Based on the Chinese query language features, a word segmentation algorithm is designed to handle the delimiting problem; 2. Based on the semantic word segmentation, a revised dependency grammar is depicted to parse the language structures; 3. Based on the semantic dependency tree, the outcome of second step, we give a set of heuristic rules for translation to database query, e.g.SQL.
148
3.1
X. Meng and S. Wang
Word Segmentation Based on SCM in NChiql
The initial step of any language analysis task is to tokenize the input statement into separated words with certain meaning. For many writing systems, using white space as a delimiter for words yields reasonable results. However, for Chinese and other systems where white space is not used to delimit words, such trivial schemes will not work. Therefore, how to segment the words in a Chinese natural language becomes an important issue. Most of literatures [2,3,5] on Chinese segmentation are rooted in the natural language processing (NLP). However, NLIDB, as one of the typical application domain of NLP, has its own processing features. It is very possible to apply the achievement in NLP to NLIDB, but it may not be the best way. The goal of language processing in NLIDB is to translate natural language to database query. So it’s not necessary to understand the deep structure of sentences, as long as we can reach the translation goal. Generally, the conventional segmentation methods mark the word with Part of Speech (POS) such as noun, verb, adjunct, and pronoun etc. These methods can not reflect the database semantic associated with the words. However, in NLIDB we do not care which class a word belongs to, but care what semantic it represents in database. According to this design principle, we give a novel word segment method based on the database semantic. The advantage of the word-segmenter is simple and efficient. The performance was 99% above precision to our real test queries [9]. Definition 1. The database description of a given word in a specific application domain (DOM) is defined as D(ω:DOM)=(o,[t,c]), where ω represents the given word; o represents the corresponding database object (entity, relation, attribute) of ω; t represents the data type(data type, length, precision) of a database object; c represents the verb case of a database object. 3.2 Sentence Analysis in NChiql The next step of language analysis is sentence analysis. Many researches have shown[10] that Chinese languages are suitable to be represented by Dependency Grammar(DG). We argue that Chinese natural language queries are especially suitable to be parsed by DG. Under the scope of database, the dependency relationship is very simple and clear. In SCM, there are three kinds of the association among the database objects: 1) modifying /modified association; 2) relationship association; and 3) part-of association. In the query language, every words which have specific database semantic will associate with other words as the above relationship. The query goals and query conditions can be clustered based on the associations. Based on the idea, we put forward a dependency analysis method ground on database semantics. We know that DG parsing can be represented as dependency trees. Basically, the nodes in the dependency tree are three attributes tuple as below: <dependant-no>,, <dependant-relation> a dependency tree is the collection of the nodes.
NChiql: The Chinese Natural Language Interface to Databases
149
In NChiql, we extend the above tree. First the node is extended to four attributes tuple: <dependant-no>,,<dependant-relation> D where D is the database semantic description as defined in Definition 1. It will be useful to the query translation to SQL. We design special values for <dependantrelation> based on the database requirements. We have the following dependent relations - value based relation - relationship based relation - VP based relation - aggregation based relation - quantifier based relation - compare based relation - conjunction based relation The tree consisting of the above extended nodes is called the database semantic dependency tree, or for short, semantic dependency tree(SDT). 3.3 Translation to Database Query Based on Set-Block in NChiql
SDT is a hierarchical tree structure. As mentioned in the above section, STD has combined database semantics in sentence analysis. Therefore, the SDT has both the sentence structure and semantic information used in the translation to database query (i.e.SQL). The nodes in SDT can be classified to the following types: Attribute value node (AVD): it’s related to some like “Attribute = value” expression, i.e. a selection predicate. Entity node (ED): it’s related to a database object such as table name. Relationship node (RD): it’s related to a join condition. Operation node (OD): it’s related to some operators like AND/OR, GREAT/LESS, SOME/ALL, etc.. AVDs give the restrictive condition. But restrictive objects should be given by EDs. So a semantic block in database translation should consists of several nodes that have internal dependent relationship among each other. ED plus at least one AVD can be a semantic block. Also, a OD plus ED or semantic unit can serve as a semantic unit. Definition 2. A semantic block can be defined as a subtree rooted Rs in SDT: 1) Rs should be a ED; or 2) Rs should be a OD with database object sub-nodes; 3) sub node of Rs should be AVD, RD, or semantic block. It’s a recursive definition that reflects the nest feature in semantic blocks. Clearly, the semantic block is a recursive definition that reflects the nested block nature in database queries. Essentially, s semantic block is a set in database evaluation. So we can call it as set block too. In database scope, set block can be defined as below. Definition 3. Let SetBlock = Obj Cond T where:
Obj = {o | o relations;
R
o
U} U denoted all attributes R denoted all
150
X. Meng and S. Wang
Cond = cond1 cond2… condi condi+1… condn; Condi = Ai1 Op V Ai1 Op Ai2 or {SQL where conditions}, 1 i n Ai1,Ai2 U V denoted values Op { >,<, , ,=}; T= {t | t = the relation that Aij belongs to 1 i n 1 j 2} It can be found that semantic block can acts as the basic unit in the translation from a natural language query to a database query (SQL). Semantic blocks enable us to adopt a “divide-and-conquer” strategy for translation. This method consists of two steps. In SDT the first step, the SDT is transformed to a tree with nested blocks, where each block is a subtree Transforming to Block (based on Definition 2). In the second step, blocks that correspond to subqueries (based on Definition 3) are evaluated in an inside-out manner. When a Evaluating Blocks block is evaluated, the root of the corresponding subtree is reduced; that is, the block corresponding Integrating Subqueries to the root of subtree are returned as an intermediate result, a possible SQL subquery. After every blocks are evaluated, we can obtain a SQL Query series subqueries. So it’s an important stage to integrate the subqueries to a final SQL statement. Fig. 2. Translation process Flow Based on the above discussion, the translation process flow is shown in Figure 2. 3.4 Run-Time System Structure and Implementation Figure 3 illustrates the run-time system structure and functional components of Nchinql. The part included in the box is the language processing function, the outside is the user interface function. NChiql have good ability of leaning and guidance. We explain the processing flow in Figure 3 as below: Users face the Guidance function, which can help users to lean the system scope and typical query statements; (1) Users input their natural language query statement by voice or writing devices; (2) Natural language query is passed to Language Analyzer through Interface Agent; (3)(4)(5)(6) Language Analyzer read Semantic Dictionary to process word segmentation and sentence analysis. When it meets the words that can not be identified, Language Analyzer returns the words to Interface Agent which call the Learning module to handle the new words with the interaction of users. The result will be sent to Language Analyzer to process further; (7) The result of Language Analyzer is SDT, which can be used as middle language for translation and paraphrasing; (8)(9) To be confirmed the result of Language Analyzer by users, middle language is generated to natural language by paraphrasing module. If the analysis result does not meet the user’s requirement, system will stop the processing and return to user interface to revise the paraphrasing result;
NChiql: The Chinese Natural Language Interface to Databases
151
(10)(11) If the result is right, middle language is translated to SQL; (12)(13) The Executing module call DBMS to handle the SQL, and the query results are handover to Response Analysis to help user understanding the results.
Fig. 3. Run-time structure in NChiql
4 System Evaluation 4.1 Test Data Collection and Analysis To conduct the evaluation, an investigation has been designed for two purposes: (1) collecting the test data for the evaluation; (2) analyzing the data to find out the sentence distribution. There were 44 subjects under study in this experiment with very diverse backgrounds. In fact, the educational backgrounds of these subjects varied drastically including master program students, undergraduates as well as persons with only secondary school level education. Their ages varied between 14 and 23. Two methods are used for the investigation: – Passive investigation: Each subject should answer our questionnaire by natural language query. The questionnaire consists of 30 problems shown with pictures. Additional help were available, but did not inflect their language usage. Each question can be expressed with several different queries.
152
X. Meng and S. Wang
– Active investigation: Each subject give any questions under the specific domain. We collected 1126 query sentences through the passive investigation and 124 query sentences through active investigation. There are four kind of sentence involved in answers, including imperative query, question query, elliptical query, and multisentence query. Table 1 and 2 describe the percentage of each kind of query. Table 1. Natural language query percentage (passive investigation) imperative No. of sentences percentage %
elliptic al 35
multisentence 94
others
error
total
620
questio n 289
18
70
1126
55.1
25.7
3.1
8.3
1.6
6.2
100
Table 2. Natural Language query percentage (active investigation)
No. of sentences percentage %
imperative
question
multisentence 7
others
error
total
48
elliptica l 2
48
5
14
124
38.7
38.7
1.7
5.6
1.6
11.3
100
From the table 1 and table 2, we can see that imperative query and question query are the dominant usage in natural language query (with the percentage of 80.8 and 77.4% respectively) It should note here, all the subjects are not knowledgeable in database and SQL. So their query usage does not be inflected by database query (e.g. SQL). Based on the investigation, we select the imperative query and question query as the test objects(See Table 3). Table 3. Testing Queries imperative Number of test sentences
question
130
total
31
162
4.2 Usability Test In the language processing, system will interact with users for some ambiguity or unknown words in order to obtain the right answer. The number of interaction with users in NChiql is the main factor in term of usability. The test result is shown in Table 5. Table 4. Usability Test No of interaction
No of Testing queries
0
1
2
3
4
5
6
7
8
9
46
44
46
12
5
2
2
1
1
3
NChiql: The Chinese Natural Language Interface to Databases
153
From the Table 4, we can see that almost 148 sentences (of 162 sentences, 91.36%) need the interaction with users no more than three times, and only 8.64% sentences need over three times interaction. This result shows that Nchiql has the good usability. That proves that Nchiql has good ability of leaning and disambiguating. 4.3 Correctness Test For a query, There are two results: one is the answer required by users (called UR), the other is the system output (called SR). It’s a correct result when SR is equal to UR. Unfortunately, SR is not always equal to UR. Basically, there are following relationship between SR and UR:
• Equality(EQ): combining with the interaction, EQ can be evaluated as the following cases: - EQ(<3): System can output a correct result with no more than three interactions; - EQ(>3): System can output a correct result with more than three interactions. • Part Equality (PEQ) SR partly matches UR. There are two cases too: - PEQ1: System can output a SR, but its contents are less or more than UR. For example, users want to find student’s name and age, but the system outputs a small result( just including name or age attribute) or a big result ( including more attributes besides name and age). - PEQ2: System can output a SR, SR is related to US. For example, users want to find student’s name, but system gives the student’s address. • Not Equality(NEQ) System can output a SR, but it’s not related to UR semantically. Table 5 shows the correctness test result. Table 5. Correctness Test No of sentences percent%
EQ 139 85.8
1 PEQ 22 13.58
Table 5. Correctness Test NEQ 1 0.62
EQ<3 No of sentences 123 percent% 88.49
2
EQ>3 16 11.51
There are 85.8% sentences can be processed correctly. Among them, there are 88.49% sentences can be handled correctly with no more than three interactions. So we can conclude that NChiql has good usability and high correctness.
5 Conclusion Now there are two main problems hinder NLIDBs to gain the rapid and wide commercial acceptance: portability and usability. In order to bright the essence of these problems, we provide an abstract model (AM) in NChiql. In this paper we depict a novel language processing method based on the database semantic, namely
154
X. Meng and S. Wang
SCM. The experiment results show that NChiql has good usability and high correctness. In the future , we will explore how to utilize the techniques in NChiql to query Web by natural language.
Acknowledgements. This work is sponsored by the Natural Science Foundation of China (NSFC) under grant number 69633020. We would like to thank Shuang Liu and Mingzhe Gu of Renmin University of China for their great help and valuable advice in the evaluation and the detailed implementation of NChiql.
References [1] Hendrix G.G., Natural Language Interface, American Journal of Computational Linguistics, 1982, 8(2):56-61.
[3] Hendrix G. G. and Lewis W. H., Transportable Natural Language Interface to Database, American Journal of Computational Linguistic, Vol.7,1981.
[4] Grosz B.,et al., Team: An Experiment in the Design of Transportable Natural-Language Interfaces, Artificial Intelligence , 1987, 12: 173-243.
[5] Cha S K., et al., Kaleidoscope Data Model for An English-like Query Language, Proc. of the 17th International Conference on VLDB, September 3-6, 1991, Spain: 351-361.
[6] Androutsopoulos L., et al, Natural Language Interfaces to Database - An Introduction, URL: http://xxx.lanl.gov/abs/cmp-lg. Also in Journal of Natural Language Engineering, Combridge University press, 1995, 1(1): 29-81. [7] Epstein S S, Transportable Natural Language Processing Through Simplicity - the PRE System, ACM Transaction on Office Information Systems, 1985, 3(2): 107-120. [8] Meng X F, Zhou Yong, Wang Shan, Domain Knowledge Extracting in a Chinese Natural Language Interface to Database: NChiql, In Proc of PAKDD’99, Beijing: Spinger-Verlag , 1999: 179-183. [9] Meng X F, Liu S, Wang S, Word Segmentation based on Database Semantic in NChiql,, Journal of Computer Science and Technology, 1998, 5(4):329-344. [10] Zhang X X, et al., Encyclopedia of Computer Science and Technology, Tsinghua Press, 1999:1008-1011
Pattern-Based Guidelines for Coordination Engineering Patrick Etcheverry, Philippe Lopistéguy, and Pantxika Dagorret
Laboratoire d’Informatique U.P.P.A IUT de Bayonne – Pays Basque Château Neuf – 64100 Bayonne – France {Patrick.Etcheverry, Philippe.Lopisteguy, Pantxika.Dagorret}@iutbayonne.univ-pau.fr
Abstract. This paper focuses on coordination engineering. We state that coordination engineering can be approached through a double point of view. On the one hand, coordination problems are recurrent and on the other hand, tested forms of coordination exist. We define a typology of coordination problems that can be solved by the enforcement of well known coordination forms. We highlight a correlation between our approach and the context-problem-solution formulation of patterns. We present a catalogue of coordination patterns that makes an inventory of a set of coordination problems, and a set of solutions that describe how these problems can be solved. After describing an example of coordination pattern, we finally present guidelines that use the catalogue in a framework of process coordination engineering.
1
Two Key Components for a Coordination Problem Solving
considered as two structural axes for solutions modelling as well as they will give rise to methodological steps in the construction of solutions to coordination problems. Firstly, our contribution is presented as a catalogue of coordination patterns that makes an inventory of a set of situations, where coordination problems occur, and a set of solutions that describe how these problems can be solved. Secondly, we propose a four steps approach that helps designers to specify the coordination forms that have to be adopted by the activities of the modelled process. For each step, we list inputs, deliverables, the corresponding pattern clauses on which it relies and the models/languages needed to perform it. We also indicate adapted methodologies to carry out the step and precise existing computer-based tools that are able to support these methodologies.
2 2.1
Situations of Coordination Situation Vocabulary = Vocabulary of the Domain
We aim to elaborate guidelines that help to carry out coordination within human or software processes. Consequently, specification of situations and coordination forms has to be performed in terms of elements that belong to the considered processes, that means in terms of the considered domains. Despite the specificity of each process, it is possible to define a common vocabulary allowing a generic description of any process. We consider that a process is basically composed of a set of activities that combine and use resources. An activity corresponds to an action of the process. It can be elementary, or composed of other activities. A resource is an entity belonging to the activities environment and needed for activities progress. We define three types of resources: actors (human being or software, and more generally, any component of the organisation able to process an activity), devices and documents. Moreover, activities and resources have interactions, like resource utilisation (by an activity) and temporal constraints (between activities). This point has been developed in [7] and presented as a complete model of process. 2.2
Typology of Coordination Situations
Our typology of coordination situations relies on [13] which studies the recurrent and interdisciplinary characteristics of coordination problems. Coordination is defined as "the management of dependencies between activities", where the main types of dependencies are: production-expenditure of resources, resource sharing, simultaneity constraints and tasks - sub-tasks relationships. Subsequently, we define a coordination situation as a situation where one of these four dependencies can be identified between activities of a given process. For each coordination situation, we describe, on the one hand, the characteristics of this situation and on the other hand, the coordination problems related to it.
Pattern-Based Guidelines for Coordination Engineering
157
Production - Expenditure of Resources Characterisation: This situation arises when an activity produces a resource which is used by another activity. The production - expenditure relation is not limited to material flows, it also extends to informational flows. Example: In manufacturing processes, a typical example occurs when the result produced at one stage of the assembly line is used as input for the following stage. Associated coordination problems: We distinguish three families of problems related to the production - expenditure of resources: - Prerequisite problems: these problems happen when the following constraint cannot be satisfied: a P activity producer of an R resource must be finished before a C activity consumer of R begins. This rule implies two sub-constraints: P activity exists and P must produce R before R expenditure begins. - Transfer problems: these problems are related to resource "transportation", from production activity until expenditure activity (wrong data communication channel, wrong resource deposit place, etc). - Usability problems: these problems happen when the produced resource cannot be expended because of its format and/or its properties (access rights, etc). Sharing Resources Characterisation: This situation arises when activities must share a limited resource. It is necessary to have a resource allocator which manages the resource demands formulated by the activities. Resource allocation is probably one of the most largely studied coordination mechanisms: for example, economy, organisation theory and data processing are major domains interested by this issue. Examples: sharing a storage space, sharing a person’s working time, etc. Associated coordination problem: It is related to concomitant accesses to the shared resource. Simultaneity Constraints Characterisation: This situation arises when activities must satisfy one or more temporal constraints among the following ones [11]: "A activity before B activity", "A activity starts B activity", "A activity overlaps B activity", "A activity during B activity", etc. Example: Planning a rendezvous is a typical situation which supposes simultaneity constraints satisfaction ("A activity starts B activity", which means that activities A and B must start at the same time). Associated coordination problems: They deal with difficulties to respect the control of temporal constraints.
158
P. Etcheverry, P. Lopistéguy, and P. Dagorret
Tasks and Subtasks Relationships Characterisation: This situation happens when the goal to reach is divided into subgoals and the associated activities are distributed to several actors. Example: Retrieval information activity on Internet, where different search engines are in charge of exploring different sites. Associated coordination problems: Three kinds of coordination problems emerge. The first problem consists in determining the goal to reach. The second problem consists in splitting the task into sub-tasks. The last problem consists in allocating the sub-tasks to the actors.
3
Coordination Forms
A coordination form defines a mechanism that expresses coordination principles. The study of various works [15], [8], [3] and [5] points out that all detailed coordination forms are enclosed in coordination forms described in [15]. Thus, we rely upon this nomenclature for identifying coordination forms. Any coordination is expressed according to three basic mechanisms: mutual adjustment, supervision, and standardisation (which is declined in four particular under-forms). 3.1
Supervision
Characterisation: It is the coordination form where a supervisor gives instructions and controls the execution of a set of tasks. Example: In computer systems, the supervision mechanism is used in systems based on a master - slave architecture. In Management domain, supervision is based on hierarchical levels: leaders supervise managers which themselves supervise operators. 3.2
Standardisation
Characterisation: It is the coordination form where activities have to respect norms. These norms can focus on: behavior to carry out, results to reach, qualifications to have or standards to respect. Example: TCP/IP protocol specifies a behaviour that has to be adopted by communicating machines. 3.3
Mutual Adjustment
Characterisation: This mechanism carries out the activities coordination by informal communication. It is a particularly suitable mechanism for complex situations where numerous activities interact. Example: An organisation in charge of sending a man on the moon for the first time is compelled to use this coordination mechanism. This project requires a very elaborated labour division between thousands of specialists. The project success largely depends on the specialists capability to adjust one with each other [15].
Pattern-Based Guidelines for Coordination Engineering
4
159
Coordination Patterns Catalogue
Our analysis of current research works about coordination leads us to state that the presented coordination situations can be managed by the introduced coordination forms. 4.1
Patterns as a Combination of Situations and Coordination Forms
The catalogue of coordination patterns we propose relies on the former statement. Indeed, for [10], a pattern is "a solution to a problem in a given context". The context refers to all recurrent situations in which the pattern is applied. The problem expresses a set of forces (goals and constraints) which take place in the context. The solution refers to a model that can be applied to solve these forces. Thus, by analysing coordination issue according to the context-problem-solution point of view, we identify situations (contexts) where coordination forms must be employed to construct solutions (solutions) to coordination problems (problems). To take up [14] point of view, we consider the set of coordination patterns as a catalogue of solution schemas that can be applied in order to specify and solve coordination problems. Each pattern results from the connection of one coordination situation and one coordination form introduced in the former section. Then, the pattern catalogue is composed of the combination of coordination situations and coordination forms. It is presented in the bellowing table. For example, mapping a supervision mechanism with the task allocation problem leads to specify a pattern, marked (*) in the catalogue, which focuses on task allocation solutions thanks to supervision techniques.
We describe patterns according to a framework derived from those presented in [1] and [9]. The coordination pattern presented results from mapping the task allocation problem and the supervision mechanism. We present an informal description of each clause.
Name
Task allocation by supervision
Examples
- In MIMD multiprocessors architectures, process distribution to multiple processors deals with task allocation problems. When allocation is carried out in a centralised way, the "task allocation by supervision" pattern is applicable. - This pattern is applicable to the multi-agent planning mode called "centralised planning for multiple agents" [8].
Context
This pattern can be used each time the problem deals with distributing tasks between several actors (human, hardware or software). The context elements are a set of tasks and a set of actors.
The pattern deals with the problem of tasks allocation between Consider- several actors. The difficulty consists in determining who does what. ed The problem is how to establish links between tasks to achieve and Problem potential actors. To solve the problem, it is necessary to have a strategy based on the supervision mechanisms. The strategy has to establish links between tasks and actors. The strategy is distributed between the supervisor and the potential actors, and is controlled by the supervisor.
How to build the solution
The supervisor: - knows the tasks to be carried out - knows the potential actors able to perform tasks - has an algorithm of tasks allocation - is informed of the acceptance or not of the tasks assignments Each actor is provided with communication and behaviour capabilities: - reception of instructions - sending of notifications to received instructions - instruction interpreter and task performance capabilities
Solution
The supervisor decides the links to establish. It is informed by the actors of their tasks acceptance. Actors can be required to carry out tasks and notify the supervisor concerning their acceptance or refusal.
Pattern-Based Guidelines for Coordination Engineering
Strengths and compromises
Associated patterns
5
161
This solution with centralised structure facilitates control and allows dynamic adaptation of the supervisor’s strategy. This pattern is recommended in strongly dynamic environments where re-planning is an essential activity. The weakness of the pattern is the weakness of centralised systems. If the supervisor undergoes a fault, coordination will not be correctly ensured. In a costly communication environment, this pattern is not efficient because its success strongly depends on the exchanges between the supervisor and actors. Each pattern related to task allocation brings a different solution. Thus, in order to remedy defaults of this pattern (see Strengths and compromises), it is recommended to use the "task allocation by standardisation" pattern (processes standardisation, results standardisation) which needs few communication.
Using the Catalogue to Solve Coordination Problems
We present an approach that uses the catalogue in a process modelling context [12]. We propose a four steps approach that helps designers to specify the coordination forms that have to be adopted by the components of the modelled process. For each step, we list inputs, deliverables, the models/languages and the corresponding pattern clauses on which it relies. We also indicate adapted methodologies to carry out the step and precise existing computer-based tools that are able to support the step. 5.1
Describing Activities According to a Process Approach
Objectives: The aim of this step consists in describing the various activities that must be carried out in order to achieve the process goal. It does not focus on how activities are carried out but on how activities are connected to achieve the final goal. Input data: It is made up of informal knowledge about the process to be modelled. Knowledge is parcelled and held by different actors of the process. Deliverables: A schema of the process describing its components in terms of: actors, activities, resources, roles, and relations between them. Models, languages: Deliverables are described thanks to a process model defined in [2] with a semi-formal modelling language (Unified Modelling Language [17]). Methodology: Methods belonging to requirements engineering and knowledge acquisition domains are well adapted to carry out this step. Tools for Information Technology based support: A collective process editor has been developed in order to support the achievement of this step [6]. It allows the proper actors of the process to participate simultaneously at the process description.
162
5.2
P. Etcheverry, P. Lopistéguy, and P. Dagorret
Identifying Situations That Generate Coordination Problems
Objectives: The goal of this step consists in analysing the described process and identifying the situations in which problems need coordination solutions. Input data: The process schema produced by the previous step. Deliverables: A set of situations and associated problems extracted from the process schema. Models, languages: Deliverables of this step are situations described thanks to the process model used in the previous step. Methodology: Situations are identified by analysing the existing dependencies between components (activities, resources, …) and comparing them to the situations typology defined in the patterns catalogue. Tools for Information Technology based support: Pattern recognition systems are of interest in this step. They identify and suggest situations in the process schema that can match with situations of the catalogue. 5.3
Choosing a Coordination Form for Each Coordination Problem
Objectives: The aim of this step consists in associating a coordination form to each identified coordination situation. Input data: A set of situations and associated problems extracted from the process schema and identified in the previous step. Deliverables: A set of pairs: coordination situation extracted from the process schema ; associated coordination form, which define the coordination canvas of the process. There is one pair [coordination situation ; coordination form] for each extracted situation, and it defines the way the named situation will be managed. Models, languages: no specific language to describe a set of pairs. Methodology: The assignment of a catalogue's coordination form to a coordination situation is carried out thanks to the comparison between characteristics and constraints of the situation's environment (organisation: distributed / centralised / hierarchical, communication: quality / rapidity …) with advantages and drawbacks of the coordination forms (local / global knowledge and decision, communication usage …). These correlation aspects are treated in the Strengths and compromises clause and the Associated patterns clause of the pattern. The choice of forms can also be guided by the analysis of previously acquired experience. Tools for Information Technology based support: Case based reasoning systems are of interest in this step. 5.4
Carrying out Coordination
Objectives: The aim of this step consists in producing a solution for each extracted coordination situation by the enforcement of the corresponding coordination form. Inputs: A set of pairs [ coordination situation ; coordination form ] defined in the previous step.
Pattern-Based Guidelines for Coordination Engineering
163
Deliverables: A set of solutions derived from the pairs [ coordination situation ; coordination form ]. Each solution assigns procedures (directives, rules) to the situation components and brings, when necessary, new elements needed to implement the solution (queues, stacks …). Models, languages: The procedures assigned by the solution are specified in terms of mechanisms (directives, rules) belonging to the form. Methodology: How to build and Solution pattern clauses describe how to carry out this step. The objective consists in using the specific mechanisms of the form to express the procedures to be applied by the components. Adapted methodologies for carrying out this step concern engineering of procedures production and engineering of procedures implementation into situation components. For example, business process reengineering in management domain. Tools for Information Technology based support: The tools facilitate the implementation of the procedures within situation components. For instance, multiagent systems are suitable to implement procedures derived from negotiation forms.
6
Conclusion
The whole approach is based on the strong distinction between two axes: situations and forms of coordination. According to these two axes, coordination description is an original way to present coordination problems. Specification of solutions by the means of patterns constitutes a framework that allows answering to fundamental questions of [16] "… why, when, where and how is coordination carried up …". Indeed, each pattern describes a solution (how to) in order to solve a coordination problem (why) that arises in a situation (when / where) [9]. The proposal of a pattern catalogue structured according to these two main axes allows a large enumeration of solutions and any extension of each axes leads to the extension of the whole catalogue. Our presented approach also relies upon the two axes. Consequently, it suggests the designer a prior specification of problematic situations. The analysis of their environment facilitates then an adapted choice of the form to be adopted. Moreover, the organisation of both catalogue and approach according to the two axes ensures a usefulness help of the described patterns in the approach.
References 1. 2.
3.
C. Alexander, S. Ishikawa, M. Silverstein, M. Jacobson, I. Fiskdahl-King, S. Angel: A Pattern Language. Oxford University Press, New York, (1977) C. Bareigts, P. Etcheverry, P. Dagorret, P. Lopistéguy: Models of process specification for Organisational Learning. UK Conference on Communications and Knowledge Management, Swansea, Wales, UK, (2000) B. Chaib-draa, S. Lizotte: Coordination in unfamiliar situations. Third French-speaking days, IAD & SMA, Chambery-St Baldoph, (1995)
164 4.
5.
6. 7. 8. 9.
10. 11.
12. 13. 14. 15. 16.
17.
P. Etcheverry, P. Lopistéguy, and P. Dagorret K. Crowston, C.S. Osborn: A coordination theory approach to process description and redesign. Technical report number 204, Cambridge, MA, MIT, Centre for Coordination Science, (1998) K.S. Decker: Environment Centred Analysis and Design of Coordination Mechanisms. Department of Computer Science, University of Massachusetts, UMass CMPSCI Technical Report, (1995) P. Etcheverry, P. Dagorret, G. Bernadet, N. Salémi, A. Coste: A cooperative editor for process design. Bayonne, (1999) P. Etcheverry, P. Dagorret, P. Lopistéguy: Know-how capitalization, a process approach. Interdisciplinary Research Center, Bayonne, (1999) J. Ferber: Multi-Agent Systems - Towards a collective intelligence. InterEditions – ISBN: 2-7296-0665-3, (1997) E. Gamma, R. Helm, R. Johnson, J. Vlissides: Design Patterns - Elements of Reusable Object-Oriented Software. Addison-Wesley publishing company - ISBN: 0-201-63361-2, (1995) D. Lea: Patterns Discussion http://g.oswego.edu/dl/pd-FAQ/pd-FAQ.html, (1997) T.D.C. Little, A. Ghafoor: Interval-Based Conceptual Models for Time-Dependent Multimedia Data. In IEEE Trans. on Knowledge and Data Engineering (Special Issue: Multimedia Information Systems), Vol. 5, N°4, pp 551-563, (1993) P. Lorino: The value development by processes. in French Review of Management, (1995) T.W. Malone and K. Crowston: The Interdisciplinary Study of Coordination. ACM Computing Surveys, 26 (1), pp 87-119, (1993) M. Mattsson: Object-Oriented Frameworks. A survey of methodological issues. Licentiate Thesis, Lund University, Department of Computer Science, (1996) H. Mintzberg: Management. Travel Toward Organizations Centre. Organizations Editions, Paris, (1990) H. S. Nwana, L. C. Lee and N. R. Jennings: Coordination in Software Agent Systems. The British Telecom Technical Journal, 14 (4) 79-88, (1996) J. Rumbaugh, G. Booch, I. Jacobson: The Unified Modeling Language Reference Manual. Addison-Wesley, (1998)
Information Management for Material Science Applications in a Virtual Laboratory A. Frenkel1, H. Afsarmanesh1, G. Eijkel2, and L.O. Hertzberger1 1
University of Amsterdam, Computer Science Department Kruislaan 403, 1098 SJ, Amsterdam, The Netherlands {annef, hamideh, bob}@science.uva.nl 2 Institute for Atomic and Molecular Physics (AMOLF) Kruislaan 407, 1098 SJ, Amsterdam, The Netherlands [email protected]
Abstract. The goal of Virtual Laboratory project (VL), being developed at the University of Amsterdam, is to provide an open and flexible infrastructure to support scientists in their collaboration towards the achievement of a joint experiment. The advanced features of VL provide an ideal environment for experiment-based applications, such as the Material Analysis of Complex Surfaces (MACS) experiments, to benefit from different developed interfaces to the hardware and software required by the scientists. To properly support the information management in this collaborative environment, a set of innovative and specific mechanisms and functionalities for efficient storage, handling, integration, and retrieval of the MACS-related data, as well as data analysis tools on the experiment results, are being developed. This paper focuses on the information management in the MACS application case and describes its implementation using the Matisse ODBMS system.
1 Introduction The aim of the Virtual Laboratory (VL) project1 is to provide an open and flexible framework that support the collaboration between groups of scientists, engineers and scientific organizations that decide to share their knowledge, skills and resources (e.g. data, software, hardware, complex devices, etc.) towards the achievement of a joint experiment [1], [2], [11]. The advanced features of VL provide an ideal environment for experiment-based applications to benefit from different developed interfaces to the hardware and software required by the scientists. One of the experiment-based application cases proposed for the VL, is focused on the Material Analysis of Complex Surfaces (MACS) experiments. These experiments involve large and complex physics related devices, such as the Fourier Transformed Infra-Red imaging spectrometer (FTIR) and the nuclear microprobe (mBeam). This 1
This research is supported by the ICES/KIS organization.
application case benefits from VL since it is possible to operate these devices remotely in a multiple-user collaboration way and also from the possibility to combine results from different experiments, creating in this way new research opportunities. In order to support the information management involved in this collaborative environment, a set of innovative and specific mechanisms and functionalities for efficient storage, handling, integration, and retrieval of the MACS related data, through the VL, are being developed. These mechanisms and functionalities enable scientists to search through the large amount of stored data in order to identify patterns and similarities. Therefore, the database model is carefully designed to enable an efficient way to store and access the data produced in such scientific environments. The focus of this paper is on describing these information management mechanisms and functionalities specific for the MACS case that are being implemented using the Matisse ODBMS. This paper is organized as follows. Section 2 describes the Virtual Laboratory environment and its reference architecture. In Section 3, the specific domain, i.e. the MACS experiment case is covered. Section 4 presents the development approach and the functional details that support the information management system developed for the MACS application using the Matisse ODBMS system. Section 5 addresses the main conclusions of this paper and some of the future work that is planned in the context of this research project.
2 The Virtual Laboratory Environment The Virtual Laboratory environment provides a framework for groups of scientists, engineers and scientific organizations that interact and cooperate with each other towards the achievement of a common experiment. Such an experimental environment enables researchers, at different locations, to work in an interactive way, as in any laboratory, i.e. the scientists are able to create and conduct the experiments in the same natural and efficient way as if they were in their laboratory. One of the most important characteristics of the experimental domains is the manipulation of large data sets produced by the experiment devices, as described in [1]. To be able to handle the resulting experiment data sets, three main requirements are supported within the VL architecture: - Proper management of large data sets: i.e. storage, handling, integration, and retrieval of large data sets. For example, in such a scientific environment, the size of data sets can range from a few megabytes (e.g. DNA micro-array experiments data sets) to tens of gigabytes (e.g. FTIR imaging micro-spectrometer data sets). - Information sharing and exchange for collaboration activities: scientists are able to share both the devices used to perform the experiments and the data sets generated by those experiments. They must be able also to look at these data sets and compare them to the ones from previous experiments or other public databases, in order to find similarities and patterns. - Distributed resource management: must be properly considered in order to meet the high performance and massive computation and storage requirements.
Information Management for Material Science Applications in a Virtual Laboratory
167
The Virtual Laboratory architecture, shown in Fig. 1, has incorporated these and other functional requirements through the design of different system components. In particular, the VL architecture consists of three main architecture components: 1. The Application Environment contains the scientific application domains considered in the VL (e.g. MACS application case, DNA Micro-array application case, and others), including certain specific domain functionalities. 2. The VL Middleware enables the VL users to access low level distributed computing resources. The VL middleware provides: the VL user interface that enables the scientists to define and execute the experiments; the Abstract Machine (AM) that is the intermediate layer between the Grid infrastructure and the VL users, as described in [2]; and three main functional components: the VIMCO component provides the functionalities to store and retrieve both the large data sets and the data analysis results, the advance functionalities for intelligent information integration and the facilities for information sharing based in a federated approach [1]. The ComCol component provides the appropriate mechanisms for the data and process handling based on the Grid technology. The ViSE component offers a generic Virtual Simulation and Exploration environment where 3D visualization techniques are offered to analyze large data sets. The functionality provided by each one of these components is integrated through the VL integration architecture. 3. The Distributed Computing Environment provides the network platform that enables efficient usage of the computing and communication resources. At present, a Gigabit Ethernet connection is being used. In the near future, it will be extended to a Wide-Area environment using a GigaPort network based on the Surfnet5 backbone, which will result in a speed of 80 gigabits per second and a client connection capacity of 20 Gigabits per second [6]. The Grid infrastructure provides the platform to manage data, resources, and processes in distributed collaborative environments, such as the VL scientific applications. The Globus toolkit offers a set of tools to manage the resources in Data-Grid systems [5], [2], [7], [15]. The functionalities provided by the VIMCO layer and the specific domain tools developed in the VL Interface layer specifically for the Material Science applications are described in details in the following sections. Case 1 FTIR Scanner
Case 2 Microbeam
Case 3 DNA Array
Others
End-user Application Environment
...
...
...
VL User Interface Environment
ViSE
ComCol
VIMCO
VL Integration Architecture
VL Middleware
VL Abstract Machine
... Distributed Computing Environment
Fig. 1. Virtual Laboratory reference architecture
168
3
A. Frenkel et al.
Material Science Application in VL
The goal of the Material Science application is to the study materials and their properties and understanding what happens on surfaces when materials interact. In this section, the Material Analysis of Complex Surface experiment, a specific case of the Material Science application, is described. 3.1 Material Analysis of Complex Surface Experiment The Material Analysis of Complex Surface (MACS) experiments try to identify and determine the elements that compose complex surfaces, regardless of the nature of the sample. Some application areas that benefit from this kind of experiments (some of which are currently implemented or considered) includes: art conservation and restoration (e.g. analysis of binding media and organic pigments in old master paintings), bio-medical science (e.g. identification of arteriosclerotic deposits in mice), medical research (e.g. studies of trace elements in brain tissues), and others. The MACS experiment itself can be divided into three phases as shown in Fig. 2, as the preprocessing, the experimentation process, and the analysis of results. The preprocessing phase is where references to related research and images of the object are collected and analyzed. After this, the sample that will be used during the experiment process is extracted from the object. This process includes several extraction protocols and procedures to be followed. Then usually the sample needs to be treated, for example with reagents and solutions, in order to fulfill the requirements of the device used in the material analysis process for the experimentation phase. The material analysis process is performed with a set of specialized and complex Pre-processing Collect of Related Information
External Database
Object Extraction and Preparation of the Sample Master Paintings
Biological Tissues
Related Research
Polymer Laminates
Experimentation Process
Material Analysis Process
Data File Conversion and Certification
FTIR Device Data Cube
Analysis Process
Data Analysis and Knowlledge Extraction Interpretation Scientists
Analysis Tools
Results
Local Database
Fig. 2. Material Analysis of Complex Surfaces Experiment
Information Management for Material Science Applications in a Virtual Laboratory
169
hardware equipments. At present, the FTIR and the mBeam devices are available. The FTIR facility is a non-dispersive infrared imaging spectrometer coupled to an infrared microscope used to examine the infrared radiation absorbed by complex surfaces, as described in [8] and [4]. The mBeam device provides a highly focused beam of ions, with a spatial resolution in the sub-micrometer range, that can be used to identify trace -15 elements on a surface with a sensitivity of 10 grams as described also in [8]. After the full scan process finishes, the outcome of the experiment is a set of data files, containing the experiment results and the device parameters. This data set consists of a stack of images, known as hyper-spectral data cube. Afterwards, these data files are converted into a format that can be used in the analysis phase. Also a quality control process is carried out to certify that the generated data complies with some standards, otherwise the data is discarded and the material analysis process is redone. The large amount of data produced by these devices makes the analysis phase longer and more effort consuming than the experiment phase itself. For example, the size of one single data cube can range from 16 to 100 Mbytes and considering that every day up to 20 data cubes can be generated, it is understandable that individual scientists cannot do this analysis. Therefore, a set of analysis tools needs to be integrated into the application to facilitate the work of the scientists, e.g. correlation analysis, multivariate data analysis (PCA, pLS) and others.
4 The MACS Information Management System The main goal of this system is to design and develop an open and flexible environment to facilitate the experimentation process for physicists involved in MACS-related experiments. This application case is being developed at the CO-IM group [14] at the University of Amsterdam in collaboration with the physics institutes AMOLF and NIKHEF. The first phase to build the MACS system focuses on the specific mechanisms and functionalities that need to be developed for the information management of the data produced by the FTIR and the mBeam devices. Thus, first the identification of the information management requirements including the study of the structures of the input and output data and the study of the operations on the data of the application domain was done. The next step was the development of the MACS database that included: the design of the database, the development of database prototype, the design and development of tools to load the database, the population of the database with the FTIR and/or mBeam data, and the design and development of the user and query interfaces. The second phase will focus on the development of data analysis and knowledge extraction tools that will be used to process, analyze and present the results in such a way that valuable knowledge can be extracted from the large amount of data generated by these complex devices, i.e. information about experimental resources, experimental parameters and conditions, and raw or processed results.
170
A. Frenkel et al.
4.1 MACS Process-Data Model After studying and analyzing the way in which the MACS experiments are performed, (including data, objects, and processes), a process-data flow model was designed. For this design, the Virtual Laboratory Experiment Environment Data (VL-EED) model was used as a reference model [10]. The VL-EED model is a generic database model for experimentation environments. This model is the result of the careful study of several applications, within the context of the VL project. Therefore, it was possible to determine the generic characteristics of scientific experiments and design a generic schema to store experimental information. The VL-EED model is a template that facilitates the creation of new experiment-based schemas, preventing in this way the duplication of modeling effort, i.e. the database managers do not have to create a new “schema” for each new experimental application. It also enables a more efficient way to share and access the data from different experiment-based application tools, e.g. data analysis tools, browser and query tools. The VL-EED model (shown in Fig. 3) can be viewed as a hierarchy with the class Project as the root. Under each project a number of experiments can be performed. Each experiment consists of experiments elements that can be either processes or data elements. The experiments and the experiment elements can have comments. The processes are actions that can be described by protocols (i.e. standard procedures) and can have properties. The processes may be carried out with the use of hardware or software tools with their parameters and whose vendor is an organization. In addition, a person that belongs to an organization (both with an address) performs the experiments and processes. The relationships between experiments and experiment elements COMMENT date : DATE comment : STRING creator : PERSON has_prev_exp 0..1 0..* EXPERIMENT PROJECT name : STRING id : STRING description : STRING start_date : DATE end_date : DATE url : STRING
project_of
name : STRING id : STRING project_id : STRING type : STRING subject : STRING date : DATE description : STRING published_in : STRING literature : STRING url : STRING
has_next_exp 0..1 has_exp 0..*
experiment_in 1..*
has_comment 0..*
has_comment
has_sub_elm
0..* EXP_ELEMENT has_element
1..* element_of
0..*
has_next_elm
0..*
has_prev_elm
0..*
has_related_exp 0..*
1..* has_submitter
has_super_elm 0..1
name : STRING id : STRING exp_id : STRING description : STRING
has_contributor
1..1 has_property
0..*
0..* ADDRESS street : STRING postal_code : STRING city : STRING state : STRING country : STRING
0..* PERSON name : STRING id : STRING title : STRING phone : STRING fax : STRING email : STRING url : STRING
has_vendor
0..*
name : STRING id : STRING description : STRING
has_vendor
name : STRING id : STRING description : STRING
0..*
has_defined
0..1
PROTOCOL
has_performed
defined_by
SOFTWARE
DATA_ELEMENT
date : DATE performed_by 1..*
has_contributed
0..* TEMPLATE 1..1
PROPERTY name : STRING num_val : DOUBLE text_val : STRING unit : STRING
name : STRING id : STRING
0..*
HARDWARE name : STRING id : STRING description : STRING
has_hardware
HW_TOOL
has_parameter
0..*
has_parameter
0..*
1..1
PARAMETER SW_TOOL
1..1 has_software
Fig. 3. Virtual Laboratory Experiment Environment Data model
Information Management for Material Science Applications in a Virtual Laboratory
171
are represented by the recursive-relations has_prev_elm and has_next_elm. The goal of this representation is to enable a flexible and random process-data flow. The MACS process-data flow model (shown in Fig. 4) covers the information specific for the material science experiments. Due to the fact that the VL-EED model is flexible and extendible, it was easy to develop the domain specific data model on top of it. Following the VL-EED definition, the MACS experiments consist of experiment elements that can be extended to data elements and/or processes. The data elements can be subdivided into active elements and passive elements considering their participation during the different experimental phases. Thus, the passive elements are just used during the experiment process while the active elements are generated and/or modified by one or more experimental processes. In the figure, for instance, the set of Passive data elements is represented by gray rectangles (e.g. Object, Physics Devices, Analysis Tool, etc.). And the Active data elements, represented by lined rectangles (e.g. Sample, Data Cube, etc. The Processes elements are represented by ovals (e.g. Sample Extraction, Material Analysis, Data Cube Analysis, etc.). 4.2 MACS Information Management System Development The MACS database system was developed using the Matisse object-oriented database management system, which provides a set of database management tools for proper handling of large and complex data from database applications. Some of the advantages of considering Matisse ODBMS for this application include its flexible and dynamic data model, its support to manage many multimedia data types, and the high level of scalability and reliability that it provides, as mentioned in [12]. In order to create the description of the MACS schema in Matisse, the data definition language MATISSE ODL was used. The MACS ODL file provides the description of the persistent data for both of the VL-EED and the MACS schema as a set of object classes, including the attributes and relationships. Once, the MACS ODL file is ready, the next step is to interpret it using the MATISSE mt_odl utility, which creates
Fig. 4. MACS Process Data Flow
172
A. Frenkel et al.
the actual MACS schema in the database. Thus the MACS database schema is stored in the database and can be manipulated like the other objects through the use of APIs. Once the database is set up, the transference of the data from some existing external sources can be done with the loader tool specially developed for this purpose. Thus, the MACS Database Loader is responsible for providing the proper means for uploading data into the MACS database. Therefore, instead of creating one object at a time it is possible to load many objects at once. The format of the source data file of is based on the Object Interchange Format (OIF) file. This format is a specification language proposed in the ODMG standard to dump/load databases objects to/from files, as described in [13]. The MACS Database Loader (presented in Fig. 4) was implemented using Java, in order for the application to be portable between platforms, and also to offer the possibility of using the program as an applet, allowing it to also run remotely from a web browser. For the integration with the MACS database, the Matisse Java API was used [9]. The Matisse Java API, developed at the University of Amsterdam, is a set of library functions that provides a high-level and object-oriented Java access to the Matisse ODBMS. It provides a set of generic data management functions that encapsulate Matisse C API commands. In this way, the applications that are developed do not have to deal with Matisse specificities, and may just provide the necessary information through the access functions. These functions do not necessarily imply a one-to-one mapping in relation to Matisse commands; they can encapsulate a sequence of Matisse commands. The functions contained in this library include: the DB Access functions (e.g. to connect and perform the transactions on Matisse DB), the Data access functions (e.g. to select, update and delete the data in the Matisse DB), and the Meta-data Access functions (e.g. to perform operations on the database schema). 4.3 The MACS Information Management System in the Virtual Laboratory Considering as a scenario case, the experiment for the analysis of highly oxidised diterpenoid acids of Old Master paintings described in detail in [3], a typical experiment developed within the VL environment would consist of the following steps: 1. Through the VL user interface environment (of the VL middleware), the user logs in to the system, and through a VL web-based interface he/she is able to access the VL resources that include physical devices, software and data elements. 2. Using further features of the VL Abstract Machine, the experiment is defined by selecting a number of experiment elements, i.e. processes and data elements and connecting them in order to create a process-data flow. The definition of the experiments is performed using a drag-and-drop interface, which may also provide an intelligent assistant (i.e. VL-AM Assistant) to help the user, during the design of the VL experiment, as described in [2]. It is also possible to load a previous experiment, i.e. an experiment that was performed earlier, or even a pre-defined experiment (i.e. a experiment template). 3. Every application provides a set of user-friendly tools, either specific domain tools or generic tools, to look at the data sets stored in VIMCO. Through the MACS user
Information Management for Material Science Applications in a Virtual Laboratory
173
Fig. 4. MACS Database Loader user interface
interface facilities, the user can access, at any time, the data collected from the ex periments. In this case, the MACS interface allows the user to perform queries on the MACS database, to apply some analysis processes in order to extract valuable information and to provide the facilities to access visualization tools. 4. When the setup of the experiment is finished, the experiment is submitted to the system. At this moment, the VL Abstract Machine Run Time System (VL-AM RTS), uses the tools provided by the Globus toolkit for the Data-Grid management to send the different parts of the experiment, throughout the distributed environment (within the computational grid), according to the computational requirements and the availability of the resources needed. 5. During the execution of the experiment, through the VL user interface environment of the VL middleware, the user is able to supervise their experiments using monitoring tools. Also, it is possible for the user to change the experiment parameters at any moment, in order to adjust the experiment process.
5 Conclusions and Future Work In the VL environment, an important requirement is the appropriate management of the large amount of data produced by the large and complex devices used in the scientific experiments. The information management system developed for the Material Science application in the VL project and its implementation using Matisse ODBMS supports the efficient storage, handling, integration, and retrieval of such data sets. The MACS component, integrated in the VL environment, provides a comprehensive and friendly environment to scientists of the Material Science application. The user-friendly interfaces that allow the VL users to access the data stored in the Matisse database are now under development. Such query/search component will enable the VL user to search through the data and look for similarities or patterns. A query component that includes sophisticated search commands is being considered and will result in a more powerful tool. For instance, these query tools can be used to extract slices from the data cubes and together with specialized tools perform some calculations (e.g. chemometrics, correlation analysis methods) on these data slices. Additionally, some data mining and knowledge extraction technology should be offered to analyze the large data sets, to process either the raw data generated by different devices from different applications or the processed experiment-results. This tech-
174
A. Frenkel et al.
nology is presently being considered to process, analyze and present the results in such a way that some valuable knowledge can be extracted from the large amount of data. The stored data that will be used may include information about experimental resources, experimental parameters and conditions, and raw or processed results. The easy retrieval and manipulation of the large data sets together with sophisticated data analysis and knowledge extraction tools give the scientists new research possibilities.
References [1]
[2]
[3]
[4]
[5] [6] [7] [8] [9] [10] [11]
[12] [13] [14] [15]
Afsarmanesh, H., Benabdelkader, A., Kaletas, E.C., et al. Towards a Mulit-layer Architecture for Scientific Virtual Laboratories. In 8th International Conference on High Performance Computing and Networking - EuropeHPCN 2000. 2000. Amsterdam, The Netherlands: Springer. Belloum, A., Hendrikse, Z.W., Groep, D.L., et al. The VL Abstract Machine: a Data and Process Handling System on the Grid. In High Performance Computing and Networking Europe, HPCN 2001. 2001. Amsterdam, The Netherlands. Berg, K.J.v.d., Boon, J.J., Pastorova, I., et al., Mass spectrometric methodology for the analysis of highly oxidized diterpenoid acids in Old Master paintings. Journal of Mass Spectrometry, 2000. 35(4): p. 512-533. Eijkel, G.B., Afsarmanesh, H., Groep, D., et al. Mass Spectrometry in the Amsterdam Virtual Laboratory: development of a high-performance platform for meta-data analysis. In 13th Sanibel Conference on Mass Spectrometry: informatics and mass spectrometry. 2001. Sanibel Island, Florida, USA. Foster, I., Kesselman, C., and Tuecke, S., The Anatomy of the Grid: enabling scalable virtual organizations, www.globus.org/research/papers/anatomy.pdf. 2000. Gigaport, Gigaport Homepage (www.gigaport.nl). 2001. Global Grid Forum, http://www.gridforum.org/. 2001. Groep, D., Brand, J.v.d., Bulten, H.J., et al., Analysis of Complex Surfaces in the Virtual Laboratory. 2000, Amsterdam, The Netherlands. Kaletas, E.C., A Java Based Object-Oriented API for the Matisse OODBMS. 2001, University of Amsterdam: Amsterdam. Kaletas, E.C. and Afsarmanesh, H., Virtual Laboratory Experiment Environment Data model. 2001, University of Amsterdam: Amsterdam. Massey, K.D., Kerschberg, L., and Michaels, G. VANILLA: A Dynamic Data Schema for A Generic Scientific Database. In 9th International Conference on Scientific and Statistical Database Management (SSDBM ’97). 1997. Olympia, WA, USA: Institute of Electrical and Electronics Engineers (IEEE). Matisse, Matisse Tutorial. 1998. ODMG, The Object Data Standard: ODMG 3.0. Series in Data Management Systems, ed. Gray, J., et al. 2000: Morgan Kaufmann Publishers, Inc. The CO-IM Group, UvA, http://carol.wins.uva.nl/~netpeer/. The Globus Project, http://www.globus.org/. 2001.
TREAT: A Reverse Engineering Method and Tool for Environmental Databases Mohamed Ibrahim, Alexander M. Fedorec, and Keith Rennolls The University of Greenwich, London, UK {M.T.Ibrahim, A.M.Fedorec, K.Rennolls} @Greenwich.ac.uk
Abstract. This paper focuses on some issues relating to data modelling, quality and management in a specific domain: forests. Many forest domain specialists e.g., botanists, zoologists, economists and others collect vast volumes of data about the forest fauna and flora, climate, soil, etc. The favourite tools for managing this data are spreadsheets and/or using popular DBMS packages such as Access or FoxPro. The use of these tools introduces two major problems: loss of semantics and poor data structure. These problems and associated issues are examined in this the paper. To address these problems, we propose a method for database reverse engineering from spreadsheet tables to a conceptual model and suggest a design of a prototype tool (TREAT). We also explain the motivation for and the methodology and approach that we adopted. The interactive process used to identify the constituents of the spreadsheet tables and data semantics are explained. Semi-automated analysis of the associations between the data items in terms of the domain knowledge, constraints and functional dependencies between the data items are also described. The output from the tool may be selected as either an Entity-Relationship or Object or Object-Relational model. Keywords. Data management, reverse engineering, data modelling.
conservation of biodiversity, and also the management of medicinal and pharmaceutical resources in forests. In such areas, even the yardsticks of measurement are not well developed, since the inherent structure of tropical rain forests, and its relationship to biodiversity and medicinal plant communities are not well understood and are the subject of continuing ecological and environmental research. This paper is organized as follows. In section 2, we offer some observations based on practical field experiences of the authors. This is then followed by discussion of some issues of data modelling in section 3. In section 4, we discuss some practical experience in model extraction and share with the reader some of the problems we faced in this respect. Sections 5 and 6 deal with our proposed method for reverseengineering and a prototype tool which we dubbed our ’TREAT’. In section 7, we discuss our conclusions and suggest further work.
2. Observations on Data Management Practice Date describes a database as “nothing more than a computer-based record keeping system: that is a system whose overall purpose is to record and maintain information” [1]. In current database theory it is convenient when considering design and structure to assume that there is just one database containing the totality of all stored data in the system. It can be shown that subsequent physical partitioning and distribution of data for practical implementation and performance reasons does not invalidate this assumption [2]. Thus databases are considered ‘integrated’, that is, a unification of several otherwise distinct data files with any redundancy among those files partially or wholly eliminated. For example, a forestry database may contain both species records, giving name, genus, family, etc, and study plot records listing trees with their heights, diameters and so on. There is clearly no need to include the genus of each tree in the study plot records as this can always be discovered from the species records. As well as being an integrated repository for stored data, databases are also ‘shared’. Sharing is a consequence of integration and implies that different users may access individual data items for different purposes. Any given user will normally be concerned with only a subset of the total database and different user’s subsets may overlap in many different ways. Thus a given database will be perceived by different users in a variety of different ways and two users sharing the same subset of the database may have views of that subset which differ considerably at a detailed level. For example the same forest data set may be employed in resources assessment for production, dynamics monitoring for conservation or environmental impact assessment for protection. An integrated and shared repository of information implies a central responsibility for the data and database management. Although long accepted in areas of commerce and business where data is recognised as a valuable asset, this is in stark contrast to the situation that prevails in many areas of forestry research where each application has its own private files and local copies of data.
TREAT: A Reverse Engineering Method and Tool for Environmental Databases
177
Indeed studies by Boehm have shown that a requirement incorporated into an unstructured system by someone who was not the original author typically takes 40 times the development effort of incorporating the requirement when the system was initially implemented [5]. Data that had been collected and generated at enormous cost was therefore lost or beyond practical use to other research.
3. Issues in Data Modelling The work in this paper stems from practical recent experiences with the Indonesian Forest Sector. The overall aim of this European Commission funded project is to strengthen the Ministry of Forestry’s capacity for forest planning and management at a provincial level. Botanical, zoological and soil surveys have been conducted and this with socio-economic, climatic, geological, topographic and other data is being incorporated into a comprehensive and user-friendly Integrated Forest Resource Information System (IFRIS) to complement the National Forest Inventory [8, 16]. Our prime concern was to recover the data model of this wealth of data using an information systems perspective and hence forward engineer the required information systems based on sound foundations. Whilst there are many kinds of information system, our initial interest was restricted to two main systems: Operational Transaction Processing (TP) and Decision Support Systems (DSS) aspects of the IFRIS data. TP systems are concerned with standard day-to-day operations such as entering, modifying and reporting on ‘operational’ data and are characterized by the traditional ACID concepts of atomicity, consistency, isolation and durability [9]. The collation and management of field study data would typically be within the remit of TP. DSS, on the other hand, supports the strategic exploration of ‘informational’ data by a ‘knowledge worker’. This sort of information processing often takes the form of ‘What if?’ queries as exemplified by a researcher’s explorations into biodiversity metrics or those of a senior manager concerned with ecological or environmental impact analysis. Management Information Systems (MIS) would clearly be essential in the longer term to bodies responsible for on-going forest management or conservation or, for example, those seeking ISO14000 certification [10], and therefore an important aspect of IFRIS. With restricted time and limited access to potential end-users, little progress could be made in the required analysis. The standard approach to building TP, MIS and DSS information systems is to bridge the semantic gap from problem domain to the solution space via a set of models that may be transformed and refined at each step [11]. Whilst different authors use different terms, the standard models are conceptual, logical, and physical. Quality is assured by applying verification and validation to each refinement. The first of these steps maps semantics from the real-world problem to a conceptual model that embodies an abstract representation of the user mini real world. There are many design solutions to any system and the actual result will be dependent on the paradigm employed in the modeling and the target technology. Irrespective of the adopted paradigm, the models will normally coincide with the
178
M. Ibrahim, A.M. Fedorec, and K. Rennolls
deliverables of the standard analysis, design and implementation phases of the classical software engineering lifecycle [12]. Each of the information models may be viewed in terms of the data and processes on that information and consist of three main components: 1. Structure: including objects, properties and association between objects, 2. Operations to manipulate the structure, 3. Domain knowledge and Constraints to ensure validity of: - the (static) database states, and - the operations and transition between states. Although ER modelling has been successfully employed for the data modelling of pictorial or spatial and temporal entities [15], OO has clear advantages in the analysis and design of a system presentation layer that employs event driven interaction with multimedia objects such as a GIS based user interface. However, as our concern with Indonesian Forest sector was with data management and the storage layer of TP and DSS systems targeted at relational DBMS systems, it was felt that structured data analysis based on ER models was adequate given the limited local skills available.
4. Practical Model Extraction Experience An example of the data modeling, undertaken in Indonesia, is the analysis of botanical data held in spreadsheets [16]. Spreadsheets are popular and flexible data manipulation tools however they have no data dictionary or other explicit mechanism to act as a repository for meta-data and no input or state transition validation facilities. The implicit metadata is limited to data-type information of cell values and range relationships for embedded formulae provided by the audit facilities of the spreadsheet system. The botanical data consists of sets of Microsoft Excel workbooks, each of which represents a study area and contains two related worksheets of tree and plot data. An example of the plot data is presented figure 1. The second worksheet is the related plot data. The size of this data set depends on the number of transects in the study area and the number of plots within the transects. Typically there are a few tens of records. An example of tree data is given in figure 2. Again the sheet has been transposed and restricted to three records to fit the printed page. In this table it can be seen that, without domain specific knowledge of transects, plots and subplots or an understanding of the representation of sample data as arrays of percentages, interpretation may be extremely difficult. It can thus be seen that discovering what the data means is hard and time-consuming without ‘proper’ documentation. The actual process of producing a data model from the spreadsheet data required many iterations of file-gazing, reading available documents and reports and interaction with domain experts - for completeness an example data model manually produced from this botanical data is given in figure 3. It was therefore of interest to explore how much of this effort could be automated.
TREAT: A Reverse Engineering Method and Tool for Environmental Databases
Pengisi Tempat Tinggi Azimuth Habitat Tanggal T P Ukuran plot No. Pohon Miring Dominan Tajuk Utama Strata ke2 Strata ke3 pohon kecil Tanlunak palm pandan Pakis rotan_m rotan_r liana_L5 liana_K5 bambu Epifit Belukar tua Belukar tua
5. The Design of a Reverse-Engineering Tool Much work has been done within the software engineering community on automated ‘reverse-engineering’ of process models from code and re-engineering the systems [17]. Unfortunately reverse engineering of data models is not as well explored. Chikofsky and Cross define reverse engineering as the “Process of analysing a system to identify components and their inter-relationships in order to create representations in another form, usually at a higher level of abstraction” [18]. Re-Engineering is defined as the “process of re-implementing a design recovered by means of reverse engineering, possibly in a different environment”. The relationship between the concepts is shown in figure 4. The term ‘reverse engineering’ comes from the practice of analysing existing hardware (created by a competitor or even an enemy) to understand its design. As the goal is to create an alternative representation, usually at a higher level of abstraction, it is also known as ‘design recovery’. Re-engineering employs reverse engineering to recover a design followed by forward engineering to re-implement the system in a changed form. Re-engineering does not necessarily involve changing the system’s external appearance or functionality. One of the benefits of database reverse engineering identified by
180
M. Ibrahim, A.M. Fedorec, and K. Rennolls
Premerlani [19] is the identification of errors in original data design – he reports that 50% of the database systems he studied had major errors. Id n o T re e N o . x y d Ht Hb Ba Vol H t/D T P SubP Lf c o ll N _H erb Fam G enus S p e c ie s G en_Spe fre q . S p _ ra n k A u th o r Cnt of G en_Spe
11101 1 9 .9 0 .7 1 6 .2 9 6 0 .0 2 0 6 1 4 6 6 2 0 .0 6 1 8 4 3 9 8 6 5 5 .5 5 5 5 5 5 5 6 1 1 1 1 -. MORA A rto c a rp u s e la s tic u s A rt_ e la 2 24 G en_S pe T o ta l
11101 2 6 .2 2 .8 1 2 .6 6 4 0 .0 1 2 4 7 0 5 9 8 0 .0 2 4 9 4 1 1 9 6 4 7 .6 1 9 0 4 7 6 2 1 1 1 1 R IZ . 124 D IP T S h o rea a s s a m ic a Sho_ass 2 24 ***_ *** 328
11101 3 5 .4 4 .8 1 3 .7 10 8 0 .0 1 4 7 4 3 0 5 0 .0 4 9 1 4 3 4 9 8 7 2 .9 9 2 7 0 0 7 3 1 1 1 1 HAS 531 RUBI N e o n a u c le a c a ly c in a N eo_cal 2 24 A c t_ g la 1
11101 4 6 .1 7 .5 1 3 .9 9 5 .5 0 .0 1 5 1 7 6 6 4 6 0 .0 4 5 5 2 9 9 3 7 6 4 .7 4 8 2 0 1 4 4 1 1 1 1 R IZ . 125 D IP T S h o rea o v a lis Sho_ova 20 6 A c t _ g lo 2
G en_Spe T o ta l R ank
***_ *** 328 0
N _H erb c o ll ***
2 R IZ . ***
A c t_ g la 1 25 26 2 R IZ . ***
A c t_ g lo 2 24 26 2 R IZ . ***
A d e _ m ic 3 23 26 2 R IZ . ***
Fig. 2. Tree data - (sample only)
As incomplete information is the norm and undocumented metadata is held in the head of the author a fully automated tool for reverse engineering the forest botanical data is infeasible. The goals were therefore to: 1. Automate as much as is possible 2. Ensure quality by verification of transforms 3. Assist the analyst/domain expert where the process cannot be automated by: (i) pruning of the search space, and (ii) presentation of relevant information. The desired output would be a sound data model with an explicit representation of the entities, relationships and a stable structure that could support the diverse information needs of TP, MIS and DSS. The resultant tool is called TREAT – which, in the best software engineering tradition is an acronym for ‘Trial Reverse Engineering Automated Tool’. It was developed using Microsoft Visual Basic Version 5 for Excel 97 and presents a set of interactive steps which briefly are as follows.
TREAT: A Reverse Engineering Method and Tool for Environmental Databases
181
Fig. 3. ER Model - Forest Botanical data
The first step is to set up and initialize a symbol table of spreadsheet objects and data dictionary to be populated with the identified entities, relationships and attributes. The symbol table lists each workbook and the worksheets within the workbooks, the ranges, data types and names of the ranges of each dataset. This is achieved by iteratively prompting the user to select ranges of headings and of data. Forward engineering
Forward engineering
Reengineering
Analysis
Reverse engineering
Reengineering
Reverse engineering
Restructuring
renovation
Design Recovery
refactoring
Design
Implementation
Fig. 4. Concepts in Reengineering
The data dictionary is initialized with a dummy ‘super-entity’ from which further entities will hang. Drawing on the information in the symbol table each workbook is represented as a tentative entity as are each of the worksheets within the workbooks. As each new entity is formed it is allocated a temporary unique name of the form TEnnn, (the user is given the opportunity to change this to something more meaningful at a later stage).
182
M. Ibrahim, A.M. Fedorec, and K. Rennolls
Fig. 5. Example of TREAT – First Pass Amendment of Entity-Attribute List
The symbol table is then processed. Each column entry is made an attribute of the enclosing sheet entity and the column reference, name and data type is entered into the data dictionary. As each row is a record each cell is an instance value and could later be forward engineered into a new database. Having initialized the symbol table and data dictionary we perform an automatic attribute analysis. This consists of pairing and comparing the data values in each of the datasets allocated to the same entity and logging the nature of the relation between them. A set of predicates are used, largely derived from Semmen’s [23] and Fraser’s [24] respective formal specifications of ER models, to indicate which columns should
Fig. 6. TREAT Modification of Relations Fig. 6. TREAT Modification of Relations
TREAT: A Reverse Engineering Method and Tool for Environmental Databases
183
be factored out into new entities and what is the cardinality of the relation between the existing entity and the new one. The mapping relations of prime interest are the functions; in particular bijections, which indicate potential attributes of the same entity, and the forward and reverse injections that suggest attributes of entities with a one-to-many cardinality. As a sequence of injective functions is transitive these are also used to indicate indirect dependencies between entities. The relation log (which is actually part of the symbol table) is also used to provide proof obligations on further transforms, restrict and validate the manual changes to the model and provide verification constraints on the final data model and future database modifications. After the automatic attribute analysis has generated a new set of candidate entities and nominally assigned attributes and relations to them they must be reviewed manually. Following [25], examples of two of these dialogue boxes are given in figures 5 and 6. On completion of a step control progresses to the next step or iteration. In practice there is much iteration, backtracking and re-ordering of steps and so facilities are provided through pull-down menus and command buttons for the user to navigate to any of the key steps.
6. Notes on ’Our’ TREAT Numerous problems and exceptions were found in formulating our predicate set which could not be resolved from just the information content of the source data. Some of these problems were due to the syntactic style of the original data (for example the formulation of strata data as uniquely identified attributes rather than a repeating group of homogeneous attributes, and the use of repeating groups within the strata variables). It may be feasible (but probably not cost-effective) to include algorithms to search for and handle many of these problems. However the majority of real problems were due to loss of semantics. These problems required considerable domain knowledge to resolve and it is clear from this that the reverse engineering process could never be fully automated. The tool therefore ‘suggests’ data refinements and the process still requires substantial domain expertise to clarify the semantics of the entities, attributes and relations between them. Because the process is primarily an iterative factoring and decomposition of existing objects, the data model produced is fundamentally hierarchical. Whilst techniques have been suggested by Blaha and Premerlani for transforming hierarchical and network data structures to fully relational or objectrelational models [26] the process still demands considerable data modeling and systems analysis expertise and cannot be automated.
7. Conclusion & Further Work In this paper we have stressed the importance of data modeling and data management for forestry data sets and outlined some important information systems concepts for data analysis. We have presented examples of botanical data and discussed some of the problems associated with the data being held in spreadsheets or poorly structured personal databases. In particular we have noted the loss of semantic data, lack of
184
M. Ibrahim, A.M. Fedorec, and K. Rennolls
metadata and formal documentation and therefore the difficulty in integrating and sharing that data in operational and informational information systems. Finally, the design and implementation of a proof of principle prototype for a tool (TREAT) to facilitate the reverse engineering of data models from these data sets is described. Whilst limited in functionality, the tool has been found to help structure the systems analyst/domain expert knowledge elicitation process and could potentially reduce the substantial time and effort currently required for data analysis of existing forestry data sets. Work is planned for enhancing the functionality of TREAT further and examining the possibility of implementation a fully automated version.
References 1. 2. 3. 4. 5. 6. 7. 8. 9.
10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
Date, C.J., An Introduction to Database Systems, Addison-Wesley, 1995 Umar, A., Object-Oriented Client/Server Internet Environments, Prentice-Hall, 1997 Korth, H.F., Silberschatz, A., Database Research Faces the Information Explosion, Communications of the ACM, 40(2), Feb 1997, pp. 139-142 DeLisi,C., Computation and the Human Genome Project: An Historical Perspective, in G.I. Bell and T.G.Marr (eds.) Computers and DNA, Addison-Wesley, 1988, pp. 13-20 Boehm, B., Software Engineering Economics, Prentice-Hall, 1981 Fisher, G., Experimental Materials Databases, in M.J.Bishop (ed.), Guide to Human Genome Computing, Academic Press, 1994, pp. 39-58 Burks, C., The Flow of Nucleotide Sequence Data into Data Banks., in G.I. Bell and T.G.Marr (eds.) Computers and DNA, Addison-Wesley, 1988, pp. 35-46 Legg, C.A., Integrated Forest Resource Information System, Brochure prepared for FIMP, Jakarta, 1998 Hennessey, P., Ibrahim, M.T., Fedorec, A.M., Formal Specification, Object Oriented Design and Implementation of an Ephemeral Logger for Database Systems, in R.Wagner and H. Thoma (Eds.), Database and Expert Systems Applications (DEXA ’96), SpringerVerlag, 1996, pp.333-355 ISO14000, International Standard ISO 14000 – Introduction, http://www.quality.co.uk/iso14000.htm, 1998 Furtado,A.L., Neuhold, E.J., Formal Techniques for Database Design, Springer-Verlag, 1986 Dawson,C.W., Dawson,R.J., Towards more flexible management of software systems development using meta-models, Software Engineering Journal, May 1995, pp.79-88 Chen, P. The Entity-Relationship Model: Towards a Unified View of Data, ACM Trans. Database Systems, 1(1), 1976, pp. 9-36 Rumbaugh,J., Blaha, M., Premeralni, W., Eddy, F., Lorensen, W., Object-Oriented Modeling and Design, Prentice-Hall, 1991 Pizano, A., Klinger, A., Cardenas, A., Specification of Spatial Integrity Constraints in Pictorial Database, IEEE Computer, 22(12), Dec 1989, pp.59-71 Ibrahim, M.T., FIMP (Forest Inventory Management Project) Database Management Report, 1998, in preparation Sneed, H., Planning the Reengineering of Legacy Systems, IEEE Software Jan 1995 Chikofsky,E., Cross II,J., Reverse Engineering and Design Recovery: A Taxonomy, IEEE Software, 7(1), Jan 1990, pp.13-19 Premerlani, W., Blaha, M., An Approach for Reverse Engineering of Relational Databases, Communications of the ACM, 37(5), May 1994, pp.42-49 Holtzblatt, L.J., et.al, Design Recovery for Distributed Systems, IEEE Trans Software Engineering, 23(7), July 1997, pp.461-472
TREAT: A Reverse Engineering Method and Tool for Environmental Databases
185
21. Markosian, L., et al, Using an Enabling Technology to Reengineer Legacy Systems, Communications of the ACM, 37(5), May 1994, pp. 58-71 22. Aiken.P., Muntz, A., Richards, R., DoD Legacy Systems, Reverse Engineering Data Requirements, Communications of the ACM, 37(5), May 1994, pp. 26-41 23. Semmens, L., Allen, P., Using Yourdon and Z: an Approach to Formal Specification, J.E.Nicholls (Ed.) Proc Z User Workshop, Oxford 1990, Springer-Verlag, 1991, pp. 228253 24. Fraser, M.D., Informal and Formal Requirements Specification Languages: Bridging the Gap, IEEE Trans Software Engineering, 17(5), May 1991, pp. 454-466 25. Sockut, G.H., Malhotra, A Full-Screen Facility for Defining Relational and EntityRelationship Database Schemas, IEEE Software, 5(6), Nov 1988, pp.68-78 26. Blaha, M., Premerlani, W., Object Oriented Modeling and Design for Database Applications, Prentice-Hall, 1998 27. Laumonier, Y., B.King,C.Legg, K.Rennolls (eds.), Data Management and Modelling using Remote Sensing and GIS for Temporal Forest Land Inventory, Proceedings of An Interantional Conference on, EU, 1999. 28. EC/IUFRO, FIRS (Forest Information from Remote Sensing), Proceedings of Conference on Remote Sensing and Forest Monitoring, Rogow, Poland 1-3June 1999; EC 2000.
A Very Efficient Order Preserving Scalable Distributed Data Structure Adriano Di Pasquale1 and Enrico Nardelli1,2 1 2
Dipartimento di Matematica Pura ed Applicata, Univ. of L’Aquila, Via Vetoio, Coppito, I-67010 L’Aquila, Italia. {dipasqua,nardelli}@univaq.it Istituto di Analisi dei Sistemi ed Informatica, Consiglio Nazionale delle Ricerche, Viale Manzoni 30, I-00185 Roma, Italia.
Abstract. SDDSs (Scalable Distributed Data Structures) are access methods specifically designed to satisfy the high performance requirements of a distributed computing environment made up by a collection of computers connected through a high speed network. In this paper we present and discuss performances of ADST, a new order preserving SDDS with a worst-case constant cost for exact-search queries, a worst-case logarithmic cost for update queries, and an optimal worst-case cost for range search queries of O(k) messages, where k is the number of servers covering the query range. Moreover, our structure has an amortized almost constant cost for any single-key query. Finally, our scheme can be easily generalized to manage k-dimensional points, while maintaining the same costs of the 1-dimensional case. We report experimental comparisons between ADST and its direct competitors (i.e., LH*, DRT, and RP*) where it is shown that ADST behaves clearly better. Furthermore we show how our basic technique can be combined with recent proposals for ensuring high-availability to an SDDS. Therefore our solution is very attractive for network servers requiring both a fast response time and a high reliability. Keywords: Scalable distributed data structure, message passing environment, multi-dimensional search.
1
Introduction
The paradigm of SDDS (Scalable Distributed Data Structures) [9] is used to develop access methods in the technological framework known as network computing: a fast network interconnecting many powerful and low-priced workstations, creating a pool of perhaps terabytes of RAM and even more of disk space. The main goal of an access method based on the SDDS paradigm is the management of very large amount of data implementing efficiently standard operations (i.e. inserts, deletions, exact searches, range searches, etc.) and aiming at scalability, i.e. the capacity of the structure to keep the same level of performances while the number of managed objects changes. The main measure of performance for a given operation in the SDDS paradigm is the number of point-to-point messages exchanged by the sites of H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 186–199, 2001. c Springer-Verlag Berlin Heidelberg 2001
A Very Efficient Order Preserving Scalable Distributed Data Structure
187
the network to perform the operation. Neither the length of the path followed in the network by a message nor its size are relevant in the SDDS context. Note that, some variants of SDDS admit the use of multicast to perform range query. There are several SDDS proposals in the literature: defining structures based on hashing techniques [3,9,12,16,17], on order preserving techniques [1,2,4,7,8, 10], or for multi-dimensional data management techniques [11,14], and many others. LH* [9] is the first SDDS that achieves worst-case constant cost for exact searches and insertions, namely 4 messages. It is based on the popular linear hashing technique. However, like other hashing schemes, while it achieves good performance for single-key operations, range searches are not performed efficiently. The same is true for any operation executed by means of a scan involving all the servers in the network. On the contrary, order preserving structures (e.g., RP* [10] and DRT* [5]) achieve good performances for range searches and a reasonably low (i.e. logarithmic), but not constant, worst-case cost for single key operations. Here we present and discuss experimental results for ADST, the first order preserving SDDS proposal achieving single-key performances comparable with the LH*, while continuing to provide the good worst-case complexity for range searches typical of order preserving access methods (e.g., RP* and DRT*). For a more detailed presentation of the data structure see [6]. The technique used in our access method can be applied to the distributed k-d tree [14], an SDDS for managing k-dimensional data, with similar results.
2
Distributed Search Trees
In this section we review the main concepts relative to distributed search trees, in order to prepare the way for the presentation of our proposal and to allow its better comparison with previous solutions. Each server manages a unique bucket of keys. The bucket has a fixed capacity b. We define a server “to be in overflow” or “to go in overflow” when it manages b keys and one more key is assigned to it. When a server s goes in overflow it starts the split operation. This operation basically consists of transfer half of its keys to a new fresh server snew . Consequently the interval of keys I managed is partitioned in I1 and I2 . After the split, s reduces its interval I to I1 . When snew receives the keys, it initializes its interval to I2 . This is the first interval managed by snew and we refer to such an interval as the basic interval of a server. From a conceptual point of view, the splits of servers build up a virtual distributed tree, where each leaf is associated to a server, and a split creates a new leaf, associated to the new fresh server, and a new internal node. Please note that the lower end of the interval managed by a server never changes. A split operation is performed by locking the involved servers. Its cost is a constant number of messages, typically 4 messages. We recall that since in the SDDS paradigm the length of a message is not accounted in the complexity, then it is assumed that all keys are sent to the new server using one message.
188
A. Di Pasquale and E. Nardelli
After a split, s manages 2b keys and snew 2b + 1 keys. It is easy to prove that for a sequence of m intermixed insertions and exact searches we may have at splits, where A = 2b . most m A The splits of a server is a local operation. Clients and the other servers are not, in general, informed about the split. As a consequence, clients and servers can make an address error, that is they send the request to a wrong server. Therefore, clients and servers have a local indexing structure, called local tree. Whenever a client or a server performs a request and makes an address error, it receives information to correct its local tree. This prevents a client or a server to commit the same address error twice. From a logical point of view the local tree is an incomplete collection of associations server, interval of keys: for example, an association s, I(s) identifies a server s and the managed interval of keys I(s). A local tree can be seen as a tree describing the partition of the domain of keys produced by the splits of servers. A local tree can be wrong, in the sense that in the reality a server s is managing an interval smaller than what the client currently knows, due to a split performed by s and yet unknown to the client. Note that for each request of a key k received by a server s, k is within the basic interval I of s, that is the interval s managed before its first division. This is due to the fact that if a client has information on s, then certainly s manages an interval I ⊆ I, due to the way overflow is managed through splits. In our proposal, like other SDDS proposals, we do not consider deletions, hence intervals always shrinks. Therefore if s is chosen as the server to which to send the request of a key k, it means that k ∈ I ⇒ k ∈ I. Given the local tree lt(s) associated to server s, we denote as I(lt(s)) the interval of lt(s), defined as I(lt(s)) = [m, M ), where m is the minimum of lower ends of intervals in the associations stored in lt(s), and M is the maximum of upper ends of intervals in the associations stored in lt(s). From now on we define a server s pertinent for a key k if k ∈ I(s), and logically pertinent if k ∈ I(lt(s)).
3
ADST
We now introduce our proposal for a distributed search tree, that can be seen as a variant of the systematic correction technique presented in [2]. Let us consider a split of a server s with a new server s . Given the leaf f associated to s, a split conceptually creates a new leaf f and a new internal node v, father of the two leaves. This virtual node is associated to s or to s . Which one is chosen is not important: we assume to associate it always with the new server, in this case s . s stores s in the list l of servers associated to nodes in the path from the leaf associated to itself and the root. s initializes its corresponding list l with a copy of the s one (s included). Moreover if this was the first split of s, then s identifies s as its basic server and stores it in a specific field. Please note that the interval I(v) now corresponds to the basic interval of s.
A Very Efficient Order Preserving Scalable Distributed Data Structure
189
After the split s sends a correction message containing the information about the split to s and to the other servers in l. Each server receiving the message corrects its local tree. Each list l of a server s corresponds to the path from the leaf associated with s to the root. This technique ensures that a server sv associated to a node v knows the exact partition of the interval I(v) of v and the exact associations of elements of the partition and servers managing them. In other words the local tree of sv contains all the associations s , I(s ) identifying the partition of I(v). Please note that in this case I(v) corresponds to I(lt(sv )). This allows sv to forward a request for a key belonging to I(v) (i.e. a request for which sv is logically pertinent) directly to the right server, without following the tree structure. In this distributed tree, rotations are not applied, then the association between a server and its basic server never changes. Suppose a server s receives a requests for a key k. If it is pertinent for the requests (k ∈ I(s)) then it performs the request and answers to the client. Otherwise if it is logically pertinent for the requests (k ∈ I(lt(s))) then it finds in its local tree lt(s) the pertinent server and forwards it the requests. Otherwise it forwards the requests to its basic server s . We recall that I(lt(s )) corresponds to the basic interval of s, then, as stated before, if the request for k is arrived to s, k has to belong to this interval. Then s is certainly logically pertinent. Therefore a request can be managed with at most 2 address errors and 4 messages. The main idea of our proposal is to keep the path between any leaf and the root short, in order to reduce the cost of correction messages after a split. To obtain this we aggregate internal nodes of the distributed search tree obtained with the above described techniques in compound nodes, and apply the above technique to the tree made up by compound nodes. For this reason we call our structure ADST (Aggregation in Distributed Search Tree). Please note that the aggregation only happens at a logical level, in the sense that no additional structure has to be introduced. Each server s in ADST is conceptually associated to a leaf f . Then, as a leaf, s stores the list l of servers managing compound nodes in the path from f and the (compound) root of the ADST. If s has already split at least one time, then it stores also its basic server s . In this case s is a server that manages a compound node and such that I(lt(s )) contains the basic interval of s. Any server records in a field called adjacent the server managing the adjacent interval on its right. Moreover, if s manages also a compound node va(s), then it also maintains a local tree, in addition to the other information (see figure 1). The way to create compound nodes in the structure is called aggregation policy. We require that an aggregation policy creates compound nodes so that the height of the tree made up by the compound nodes is logarithmic in the number of servers of the ADST. In such a way the cost of correcting the local trees after a split is logarithmic as well. One can design several aggregation policies, satisfying the previous requirement. The one we use is the following.
190
A. Di Pasquale and E. Nardelli
(AP): To each compound node va a bound on the number of internal nodes l(va) is associated. The bound of the root compound node ra is l(ra) = 1. If the compound node va father of va has bound l(va ), then l(va) = 2l(va ) + 1. In figure 2 an example of ADST is presented.
adjacent pointer basic server pointer correction message
Fig. 1. Before (left) and after (right) the split of server s with snew as new server. Intervals are modified accordingly. Correction messages are sent to server managing compound nodes stored in the list s.l and adjacent pointers are modified. Since the aggregation policy decided to create a new compound node and snew has to manage it, then snew is added to the list s.l of servers between the leaf s and the compound root nodes, snew sets snew .l = s.l. If this is the first split of s, then s sets snew as its basic server.
We now show how a client c looks for a key in ADST: c looks for the pertinent server for k in its local tree, finds the server s, and sends it the request. If s is pertinent, it performs the request and sends the result to c. Suppose s is not pertinent. If s does not manage a compound node, then it forwards the request to its basic server s . We recall that I(lt(s )) includes the basic interval of s, then, as stated before, if the request for k is arrived to s, k has to belong to this interval. Therefore s is certainly logically pertinent: it looks for the pertinent server for k in its local tree and finds the server s . Then s forwards the request to s , which performs the request and answers to c. In this case c receives the local tree of s in the answer, so to update its local tree (see figure 3). Suppose now that s manages a compound node. The way in which compound nodes are created ensures that I(lt(s)) includes the basic interval of s itself. Then s has to be logically pertinent, hence it finds in lt(s) the pertinent server and sends it the request. In this case c receives the local tree of s in the answer. For an insertion, the protocol for exact search is performed in order to find the pertinent server s for k. Then s inserts k in its bucket. If this insertion causes
A Very Efficient Order Preserving Scalable Distributed Data Structure
191
va(a) A va(c) a
B
va(q) b
va(f)
va(r) C Q DF G H I
c q
d
f
g
h
i
m n
MN E R
e r L OP
va(o)
l
o
p
Fig. 2. An example of ADST with policy AP. Lower-case letters denote servers and associated leaves, upper-case letters denote intervals of data domain. The sequence of splits producing the structure is a → b → c → d → e, then d → f → g → h → i → l → m → n, then l → o → p, then c → q and finally e → r, meaning with x → y that the split of x creates the server y.
forward to basic server
va(s’)
forward to pertinent server + lt(s’)
request client c s’’
s
answer + lt(s’)
Fig. 3. Worst-case of the access protocol
s to go in overflow then a split is performed. After the split, correction messages are sent to the servers in the list l of s. Previous SDDSs, e.g LH*, RP*, DRT*, etc., do not explicitly consider deletions. Hence, in order to compare ADST and previous SDDSs performances, we shall not analyze behavior of ADST under deletions. To perform a range search the protocol for exact search is performed in order to find the server s pertinent for the leftmost value of the range. If the range is not completely covered by s, then s sends the request to server s stored in its field adjacent. s does the same. Following the adjacent pointers all the servers covering the range are reached and answer to the client. The operation stops
192
A. Di Pasquale and E. Nardelli
whenever the server pertinent for the rightmost value of the range is reached (see figure 4).
va(s’)
forward to basic server
request forward to pertinent server
client c
s
s*
k-1 messages following adjacent pointers
answers
Fig. 4. Worst-case of the range search. s is the server pertinent for the leftmost value of the range.
In the following we give the main results of ADST. Detailed descriptions can be found in the extended version of the paper [6]. An exact search and an insertion that does not cause a split have in an ADST a worst-case cost of 4 messages. A split and the following corrections of local trees have in an ADST a worst-case cost of log n + 5. A range search has in an ADST a worst-case cost of k + 1 messages, where k is the number of servers covering the range of query, without accounting for the single request message and the k response messages. Moreover, under realistic assumptions, a sequence of intermixed exact searches and insertions on ADST has an amortized cost of O(1) messages. The basic assumption is that logb n < 1. For real values of b, e.g. hundreds, thousands or more, the assumption is valid for SDDSs made by up to billions of servers. ADST has a good load factor, under any key distribution, like all the other order preserving structures, that is 0.5 in worst case and about 0.7 (= ln 2) as expected value. Another important performance parameter for an SDDS is the convergence of new client’s index. This is the number of requests that a new client starting with an initial index has to perform in order to have an index reflecting the exact structure of the distributed file: this means that the client does not make address errors until new splits occur in the structure. The faster is the convergence, the lower is the number of address errors made by clients. In ADST a new client is initialized with a local tree containing the server associated to the compound root node, or the unique server, in the case ADST is made up just by one server. Due to the correction technique used after a split, this server knows the exact partition of data domain among servers. Then it is
A Very Efficient Order Preserving Scalable Distributed Data Structure
193
easy to show that a new client obtains a completely up-to-date local tree after just one request. Also in this case ADST notably improves previous results. In particular we recall that, for an n-servers SDDS, the convergence of a new client’s index requires in the worst-case: – n messages in any structure of DRT family. n – O 0.7f messages in RP*s, where f is the fanout of servers in the kernel. – O(log n) messages in LH*.
4
Experimental Comparison
In this section we discuss results of experimental comparisons between ADST performances and RP*, DRT* and LH* ones with respect to sequences of intermixed exact-searches and insertions. The outcome is that ADST behaves clearly better than all its competitors. As discussed previously ADST presents worst-case constant costs for exact searches, worst-case logarithmic costs for insertions whenever a split occurs, and amortized constant costs for sequences of intermixed exact-searches and insertions. On the other hand, LH* is the best SDDS for single key requests, since it has worst-case constant costs for both exact searches and insertions, and constant costs in the amortized case as well. The objective of our experimental comparison is to show which is the difference between ADST and its direct competitors. We have not considered in our experimental comparison the case of deletions since this case is not explicitly analyzed in LH*, RP* and DRT*. In our experiments we perform a simulation of SDDS using the CSIM package [15], which is the standard approach in the SDDS literature for this kind of performance evaluation. We analyze structures with a capacity of buckets fixed at b = 100 records, that is a small, but reasonable, value for b. Later we describe the behavior of the structures with respect to different values of b. We consider two situations: a first one with 50 clients manipulating the structures and a second one with 500 clients. Finally, we consider three random sequences of intermixed insertions and exact searches: one with 25% of insertions, one with 50% of insertions and one with 75%. We have considered a more realistic situation of fast working clients, in the sense that it is possible that a new request arrives to a server before it has terminated with updating operations. This happens more frequently when considering 500 clients with respect to the case of 50 clients. The protocol of operations is the usual one. In case of exact searches an answer message arrives to the client which issued the request, with the information to correct the client index. In case of insertions, the answer message is not sent back. This motivates the slightly higher cost for the structures in case of low percentage of inserts, even if more exact searches means more possibility to correct an index of a client.
194
A. Di Pasquale and E. Nardelli
number of messages per request
Note that, although all costs in LH* are constant, while ADST has a logarithmic split cost, in practice ADST behaves clearly better with respect to LH* in this environment, as shown in figure 5, 6, 7, 8, 9 and 10. This is fundamentally motivated by the better capacity of ADST to update client indexes with respect to LH*, and then to allow clients to commit a lower number of address errors. This difference of capability is shown by the fact that while LH* slightly increases its access cost passing from 50 to 500 clients, for ADST the trend is opposite: with 500 clients access cost is slightly decreased with respect to 50. The logarithmic cost of splits for ADST become apparent for lower values of b, where the weight of term logb n increases its relative weight. However lower values for b, e.g. 10, are not realistic for an SDDS involving a large number of servers. On the contrary the situation shown in figures is even more favourable to ADST for larger values of b, like for example 1000 or more (in this case the term logb n decreases its relative weight). This also happens for larger number of clients querying the structure, due to the relevant role played by the correction of client indexes.
Fig. 5. Access cost for a bucket capacity b = 100 and for a number of clients c = 50 and a sequence of requests of intermixed exact searches and inserts, with 25% of inserts.
Other possible experiments could have involved the load factor of the structures, but there we fundamentally achieve the results of other order preserving SDDSs. For a series of experimental comparisons see [7]. For range searches, ADST is clearly better than LH*, where a range search can require to visit all the servers of the structure, even for small ranges. This is a direct consequence of the fact that ADST preserves the order of data. For other order preserving proposals (e.g. RP*s, BDST), the worst-case range search cost is O(k + log n) messages, when using exclusively the point-to-point protocol, without accounting for the request message and the k response mes-
number of messages per request
A Very Efficient Order Preserving Scalable Distributed Data Structure
Fig. 6. Access cost for a bucket capacity b = 100 and for a number of clients c = 500 and a sequence of requests of intermixed exact searches and inserts, with 25% of inserts.
Fig. 7. Access cost for a bucket capacity b = 100 and for a number of clients c = 50 and a sequence of requests of intermixed exact searches and inserts, with 50% of inserts.
sages. The logarithmic term is due to the possibility that the request arrives to a wrong server and then has to go up in the tree to find the server associated to the node covering the entire range of the query. The base of the logarithm is a fixed number (it is equal to 2 for BDST and to the fanout of servers in the kernel for RP*s), while n is assumed unbounded. Hence, in the case of use of point-to-point protocol, our algorithm clearly improves the cost for range search with respect to other order preserving proposals
Fig. 8. Access cost for a bucket capacity b = 100 and for a number of clients c = 500 and a sequence of requests of intermixed exact searches and inserts, with 50% of inserts.
Fig. 9. Access cost for a bucket capacity b = 100 and for a number of clients c = 50 and a sequence of requests of intermixed exact searches and inserts, with 75% of inserts.
and reaches the optimality. Whenever multicast is used, all proposals have the same cost since in this case the nature of access method does not affect the cost.
5
Extensions
The basic ADST technique can be extended to: – manage k-dimensional data. This is obtained considering the distributed k-d tree with index at client and server sites [14]; – manage deletions.
number of messages per request
A Very Efficient Order Preserving Scalable Distributed Data Structure
Fig. 10. Access cost for a bucket capacity b = 100 and for a number of clients c = 500 and a sequence of requests of intermixed exact searches and inserts, with 75% of inserts.
The detailed presentation of the extensions would exceed the limit of the paper and can be found in extended version of this paper [6]. In the following we just consider the fault tolerance extension of ADST. 5.1
High Availability
In this section we want to focus on the fact that our scheme is not restrictive with respect to techniques for fault tolerance in SDDSs, and we can consider ADST as an access method completely orthogonal to such techniques. In particular we focus on the techniques for high availability using Reed Solomon codes and in general based on record grouping [13] and on the very interesting scalable availability provided by this scheme. One of the important aspect of this work is that with full availability of buckets, the normal access method can be used, while recovery algorithms have to be applied whenever a client cannot access a record in normal mode. In such a case we say an operation enters in degraded mode. A more detailed description of record grouping and of techniques based on Reed Solomon codes would exceed the limits of this paper, hence we give only a brief sketch in the following. In [13] LH* is used as access method in normal mode. The operations in degraded mode are handled by the coordinator. From the managed state of file, it locates the bucket to recover, and then proceeds with the recovery using the parity bucket. But for the search of the bucket to recover, the access method does not influence the correctness of the recovery algorithm, based on parity buckets. Moreover the coordinator is, in some sense, considered always available.
198
A. Di Pasquale and E. Nardelli
As already stated in [13], the same technique may be applied to other SDDS than LH*. For example we consider ADST. Buckets can be grouped and the parity bucket can be added in the same way as in [13]. We just have to associate a rank to a record in a bucket. This can be the position of the record at the moment of insertions. ADST technique is used in normal mode. Whenever an operation enters in degraded mode, it is handled by the coordinator. We assume that the coordinator is the server s managing the compound root node (or a server with a copy of the local tree of s that behaves like the coordinator in LH*. This is not important here). From the request entered in degraded mode, the coordinator finds the server and then the bucket pertinent for the request. Then s can proceed with the recovery, following algorithms of [13]. The cost of operations in degraded mode are just increased with the cost of the recovery. In this case we want to emphasize that ADST, extended for achieving high availability, still keeps better worst-case and amortized case performances than the extensions of other proposals, e.g. RP*s or DRT, to a high availability schema.
6
Conclusions
We presented an evaluation of performances of ADST (Aggregation in Distributed Search Tree). This is the first order preserving SDDS, obtaining a constant single-key query cost, like LH*, and at the same time an optimal cost for range queries, like RP* and DRT*. More precisely our structure features: (i) a cost of 4 messages for exact-search queries in the worst-case, (ii) a logarithmic cost for insert queries producing a split in the worst-case, (iii) an optimal cost for range searches, that is a range search can be answered with O(k) messages, where k is the number of servers covering the query range, (iv) an amortized almost constant cost for any single-key query. The experimental analysis compares ADST and its direct competitors, namely LH*, RP* and DRT*. The outcome is that ADST has better performances than LH* (hence better than RP* and DRT*) in the average case for single-key requests. Moreover ADST is clearly better with respect to other order preserving SDDSs, like RP* and DRT*, for range searches (hence better than LH*). ADST can be easily extended to manage deletions and to manage kdimensional data. Moreover, we have shown that ADST is an orthogonal technique with respect to techniques used to guarantee fault tolerance, in particular to the one in [13], that provides a high availability SDDS. Hence our proposal is very attractive for distributed applications requiring high performances for single key and range queries, high availability and possibly the management of multi-dimensional data.
A Very Efficient Order Preserving Scalable Distributed Data Structure
199
References 1. P. Bozanis, Y. Manolopoulos: DSL: Accomodating Skip Lists in the SDDS Model, Workshop on Distributed Data and Structures (WDAS 2000), L’Aquila, June 2000. 2. Y. Breitbart, R. Vingralek: Addressing and Balancing Issues in Distributed B+ Trees, 1st Workshop on Distributed Data and Structures (WDAS’98), 1998. 3. R.Devine: Design and implementation of DDH: a distributed dynamic hashing algorithm, 4th Int. Conf. on Foundations of Data Organization and Algorithms (FODO), Chicago, 1993. 4. A.Di Pasquale, E. Nardelli: Fully Dynamic Balanced and Distributed Search Trees with Logarithmic Costs, Workshop on Distributed Data and Structures (WDAS’99), Princeton, NJ, Carleton Scientific, May 1999. 5. A.Di Pasquale, E. Nardelli: Distributed searching of k-dimensional data with almost constant costs, ADBIS 2000, Prague, Lecture Notes in Computer Science, Vol. 1884, pp. 239-250, Springer-Verlag, September 2000. 6. A.Di Pasquale, E. Nardelli: ADST: Aggregation in Distributed Search Trees, Technical Report 1/2001, University of L’Aquila, February 2001, submitted for publication. 7. B. Kr¨ oll, P. Widmayer: Distributing a search tree among a growing number of processor, in ACM SIGMOD Int. Conf. on Management of Data, pp 265-276 Minneapolis, MN, 1994. 8. B. Kr¨ oll, P. Widmayer. Balanced distributed search trees do not exists, in 4th Int. Workshop on Algorithms and Data Structures(WADS’95), Kingston, Canada, (S. Akl et al., Eds.), Lecture Notes in Computer Science, Vol. 955, pp. 50-61, SpringerVerlag, Berlin/New York, August 1995. 9. W. Litwin, M.A. Neimat, D.A. Schneider: LH* - Linear hashing for distributed files, ACM SIGMOD Int. Conf. on Management of Data, Washington, D. C., 1993. 10. W. Litwin, M.A. Neimat, D.A. Schneider: RP* - A family of order-preserving scalable distributed data structure, in 20th Conf. on Very Large Data Bases, Santiago, Chile, 1994. 11. W. Litwin, M.A. Neimat, D.A. Schneider: k -RP∗s - A High Performance MultiAttribute Scalable Distributed Data Structure, in 4th International Conference on Parallel and Distributed Information System, December 1996. 12. W. Litwin, M.A. Neimat, D.A. Schneider: LH* - A Scalable Distributed Data Structure, ACM Trans. on Database Systems, 21(4), 1996. 13. W. Litwin, T.J.E. Schwarz, S.J.: LH*RS : a High-availability Scalable Distributed Data Structure using Reed Solomon Codes, ACM SIGMOD Int. Conf. on Management of Data, 1999. 14. E. Nardelli, F.Barillari, M. Pepe: Distributed Searching of Multi-Dimensional Data: a Performance Evaluation Study, Journal of Parallel and Distributed Computation (JPDC), 49, 1998. 15. H. Schwetman: Csim reference manual. Tech. report ACT-ST-252-87, Rev. 14, MCC, March 1990 16. R.Vingralek, Y.Breitbart, G.Weikum: Distributed file organization with scalable cost/performance, ACM SIGMOD Int. Conf. on Management of Data, Minneapolis, MN, 1994. 17. R.Vingralek, Y.Breitbart, G.Weikum: SNOWBALL: Scalable Storage on Networks of Workstations with Balanced Load, Distr. and Par. Databases, 6, 2, 1998.
Business, Culture, Politics, and Sports – How to Find Your Way through a Bulk of News? On Content-Based Hierarchical Structuring and Organization of Large Document Archives Michael Dittenbach1 , Andreas Rauber2 , and Dieter Merkl2 1
2
E-Commerce Competence Center – EC3, Siebensterngasse 21/3, A–1070 Wien, Austria Institut f¨ ur Softwaretechnik, Technische Universit¨ at Wien, Favoritenstraße 9–11/188, A–1040 Wien, Austria www.ifs.tuwien.ac.at/{˜mbach, ˜andi, ˜dieter}
Abstract. With the increasing amount of information available in electronic document collections, methods for organizing these collections to allow topic-oriented browsing and orientation gain increasing importance. The SOMLib digital library system provides such an organization based on the Self-Organizing Map, a popular neural network model by producing a map of the document space. However, hierarchical relations between documents are hidden in the display. Moreover, with increasing size of document archives the required maps grow larger, thus leading to problems for the user in finding proper orientation within the map. In this case, a hierarchically structured representation of the document space would be highly preferable. In this paper, we present the Growing Hierarchical Self-Organizing Map, a dynamically growing neural network model, providing a content-based hierarchical decomposition and organization of document spaces. This architecture evolves into a hierarchical structure according to the requisites of the input data during an unsupervised training process. A recent enhancement of the training process further ensures proper orientation of the various topical partitions. This facilitates intuitive navigation between neighboring topical branches. The benefits of this approach are shown by organizing a real-world document collection according to semantic similarities.
1
Introduction
With the increasing amount of textual information stored in digital libraries, means to organize and structure this information have gained importance. Specifically an organization by content, allowing topic-oriented browsing of text collections, provides a highly intuitive approach to exploring document collections. As one of the most successfull methods applied in this field we find the SelfOrganizing Map (SOM ) [5], a popular unsupervised neural network model, which is frequently being used to provide a map-based representation of document H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 200–210, 2001. c Springer-Verlag Berlin Heidelberg 2001
Business, Culture, Politics, and Sports
201
archives [7,2,10,6]. In such a representation, documents on similar topics are located next to each other. The obvious benefit for the user is that navigation in the document archive is similar to the well-known task of navigating in a geographical map. With the SOMLib digital library [9] we developed a system using the SOM as its core module to provide content-based access to document archives. This allows the user to obtain an overview of the topics covered in a collection, and their importance with respect to the amount of information present in each topical section. While these characteristics made the SOM a prominent tool for organizing document collections, most of the research work aims at providing one single map representation for the complete document archive. As a consequence, hierarchical relations between documents are lost in the display. Moreover, it is only natural that with increasing size of the document archive the maps for representing the archive grow larger, thus leading to problems for the user in finding proper orientation within the map. We believe that the representation of hierarchical document relations is vital for the usefulness of map-based document archive visualization approaches. In this paper we argue in favor of establishing such a hierarchical organization of the document space based on a novel neural network architecture, the Growing Hierarchical Self-Organizing Map (GHSOM ) [3]. The distinctive feature of this model is its problem dependent architecture which develops during the unsupervised training process. Starting from a rather small high-level SOM , which provides a coarse overview of the various topics present in a document collection, subsequent layers are added where necessary to display a finer subdivision of topics. Each map in turn grows in size until it represents its topic to a sufficient degree of granularity. Since usually not all topics are present equally strong in a collection, this leads to an unbalanced hierarchy, assigning more “map-space” to topics that are more prominent in a given collection. This allows the user to approach and intuitively browse a document collection in a way similar to conventional libraries. The hierarchical structuring imposed on the data represents a rather strong separation of clusters mapped onto different branches. While this is a highly desireable characteristic helping in understanding the topical cluster structure in large data sets, it may lead to misinterpretations when long-streched clusters are mapped and expanded on two neighboring, yet different units of the SOM . This can be alleviated by ensuring proper orientation of the maps in the various branches of the hierarchy, allowing navigation between branches. We present the the benefits of such a hierarchical organization of digital libraries, as well as the stability of the process using a set of experiments based on a collection of newspaper articles from the daily Austrian newspaper Der Standard. Specifically, we compare two different representations of the topical hierarchy of this archive resulting from different parameter settings. The remainder of this paper is organized as follows. In Section 2 we provide a brief review of related architectures followed by a description of the principles of the SOM and GHSOM training in Section 3. Subsequently, we provide a detailed
202
M. Dittenbach, A. Rauber, and D. Merkl
discussion of our experimental results in Section 4 as well as some conclusions in Section 5.
2
Related Work
A number of extensions and modifications have been proposed over the years in order to enhance the applicability of SOMs to data mining, specifically inter- and intra-cluster similarity identification. The Hierarchical Feature Map(HFM ) [8] addresses the problem of hierarchical data representation by modifying the SOM architecture. Instead of training a flat SOM , a balanced hierarchical structure of SOMs is trained. Data mapped onto one single unit is represented at a further level of detail in the lower-level map assigned to this unit. However, this model merely represents the data in a hierarchical way, rather than really reflecting the hierarchical structure of the data. This is due to the fact that the architecture of the network has to be defined in advance, i.e. the number of layers and the size of the maps at each layer is fixed prior to network training. This leads to the definition of a balanced tree which is used to represent the data. What we want, however, is a network architecture definition based on the actual data presented to the network. The shortcoming of having to define the size of the SOM in advance has been addressed in several models, such as the Incremental Grid Growing (IGG) [1] or Growing Grid (GG) [4] models. The former allows the adding of new units at the boundary of the map, while connections within the map may be removed according to some threshold settings, possibly resulting in several separated, irregular map structures. The latter model, on the other hand, adds rows and columns of units during the training process, starting with an initial 2 × 2 SOM . This way the rectangular layout of the SOM grid is preserved.
3 3.1
Content-Based Organization of Text Archives Feature Extraction
In order to allow content-based classification of documents we need to obtain a representation of their content. One of the most common representations uses word frequency counts based on full text indexing. A list of all words present in a document collection is created to span the feature space within which the documents are represented. While hand-crafted stop word lists allow for specific exclusion of frequently used words, statistical measures may be used to serve the same purpose in a more automatic way. For our experiments we thus remove all words that appear either in too many documents within a collection (e.g. say in more than 50% of all documents) or in too few (say, less than 5 documents) as these words do not contribute to content representation. The words are further weighted according to the standard tf × idf , i.e. term frequency times inverse document frequency, weighting scheme [11]. This weighting scheme assigns high values to words that are considered important for content representation. The resulting feature vectors may further be used for SOM training.
Business, Culture, Politics, and Sports
3.2
203
Self-Organizing Map
The Self-Organizing Map is an unsupervised neural network providing a mapping from a high-dimensional input space to a usually two-dimensional output space while preserving topological relations as faithfully as possible. The SOM consists of a set of i units arranged in a two-dimensional grid, with a weight vector mi ∈ n attached to each unit. Elements from the high dimensional input space, referred to as input vectors x ∈ n , are presented to the SOM and the activation of each unit for the presented input vector is calculated using an activation function. Commonly, the Euclidean distance between the weight vector of the unit and the input vector serves as the activation function. In the next step the weight vector of the unit showing the highest activation (i.e. the smallest Euclidean distance) is selected as the ‘winner’ and is modified as to more closely resemble the presented input vector. Pragmatically speaking, the weight vector of the winner is moved towards the presented input signal by a certain fraction of the Euclidean distance as indicated by a time-decreasing learning rate α. Thus, this unit’s activation will be even higher the next time the same input signal is presented. Furthermore, the weight vectors of units in the neighborhood of the winner as described by a time-decreasing neighborhood function are modified accordingly, yet to a less strong amount as compared to the winner. This learning procedure finally leads to a topologically ordered mapping of the presented input signals. Similar input data is mapped onto neighboring regions on the map. 3.3
Growing Hierarchical Self-Organizing Map
The key idea of the GHSOM is to use a hierarchical structure of multiple layers where each layer consists of a number of independent SOMs. One SOM is used at the first layer of the hierarchy. For every unit in this map a SOM might be added to the next layer of the hierarchy. This principle is repeated with the third and any further layers of the GHSOM . Since one of the shortcomings of SOM usage is its fixed network architecture we rather use an incrementally growing version of the SOM . This relieves us from the burden of predefining the network’s size, which is rather determined during the unsupervised training process. We start with a layer 0, which consists of only one single unit. The weight vector of this unit is initialized as the average of all input data. The training process then basically starts with a small map of 2 × 2 units in layer 1, which is self-organized according to the standard SOM training algorithm. This training process is repeated for a fixed number λ of training iterations. Ever after λ training iterations the unit with the largest deviation between its weight vector and the input vectors represented by this very unit is selected as the error unit. In between the error unit and its most dissimilar neighbor in terms of the input space either a new row or a new column of units is inserted. The weight vectors of these new units are initialized as the average of their neighbors. An obvious criterion to guide the training process is the quantization error qi , calculated as the sum of the distances between the weight vector of a unit i
204
M. Dittenbach, A. Rauber, and D. Merkl
and the input vectors mapped onto this unit. It is used to evaluate the mapping quality of a SOM based on the mean quantization error (MQE ) of all units in the map. A map grows until its MQE is reduced to a certain fraction τ1 of the qi of the unit i in the preceding layer of the hierarchy. Thus, the map now represents the data mapped onto the higher layer unit i in more detail. As outlined above the initial architecture of the GHSOM consists of one SOM . This architecture is expanded by another layer in case of dissimilar input data being mapped on a particular unit. These units are identified by a rather high quantization error qi which is above a threshold τ2 . This threshold basically indicates the desired granularity level of data representation as a fraction of the initial quantization error at layer 0. In such a case, a new map will be added to the hierarchy and the input data mapped on the respective higher layer unit are self-organized in this new map, which again grows until its MQE is reduced to a fraction τ1 of the respective higher layer unit’s quantization error qi .
layer 0 layer 1
layer 2
layer 3
Fig. 1. GHSOM reflecting the hierarchical structure of the input data.
A graphical representation of a GHSOM is given in Figure 1. The map in layer 1 consists of 3 × 2 units and provides a rough organization of the main clusters in the input data. The six independent maps in the second layer offer a more detailed view on the data. Two units from one of the second layer maps have further been expanded into third-layer maps to provide sufficiently granular data representation. Depending on the desired fraction τ1 of MQE reduction we may end up with either a very deep hierarchy with small maps, a flat structure with large maps, or – in the most extreme case – only one large map, which is similar to the Growing Grid. The growth of the hierarchy is terminated when no further units are available for expansion. It should be noted that the training process does not necessarily lead to a balanced hierarchy in terms of all branches having the same depth. This is one of the main advantages of the GHSOM , because the structure of the hierarchy adapts itself according to the requirements of the input space.
Business, Culture, Politics, and Sports
205
Therefore, areas in the input space that require more units for appropriate data representation create deeper branches than others. The growth process of the GHSOM is mainly guided by the two parameters τ1 and τ2 , which merit further consideration. – τ2 : Parameter τ2 controls the minimum granularity of data representation, i.e. no unit may represent data at a coarser granularity. If the data mapped onto one single unit still has a larger variation a new map will be added originating from this unit, representing this unit’s data in more detail at a subsequent layer. This absolute granularity of data representation is specified as a fraction of the inherent dissimilarity of the data collection as such, which is expressed in the mean quantization error of the single unit in layer 0 representing all data points. If we decide after the termination of the training process, that a yet more detailed representation would be desirable, it is possible to resume the training process from the respective lower level maps, continuing to both grow them horizontally as well as to add new lower level maps until a stricter quality criterion is satisfied. This parameter thus represents a global termination and quality criterion for the GHSOM . – τ1 : This parameter controls the actual growth process of the GHSOM . Basically, hierarchical data can be represented in different ways, favoring either (a) lower hierarchies with rather detailed refinements presented at each subsequent layer, or (b) deeper hierarchies, which provide a stricter separation of the various sub-clusters by assigning separate maps. In the first case we will prefer larger maps in each layer, which explain larger portions of the data in their flat representation, allowing less hierarchical structuring. In the second case, however, we will prefer rather small maps, each of which describes only a small portion of the characteristics of the data, and rather emphasize the detection and representation of hierarchical structure. Thus, the smaller the parameter τ1 , the larger will be the degree to which the data has to be explained at one single map. This results in larger maps as the map’s mean quantization error (MQE ) will be lower the more units are available for representing the data. If τ1 is set to a rather high value, the MQE does not need to fall too far below the mqe of the upper layer’s unit it is based upon. Thus, a smaller map will satisfy the stopping criterion for the horizontal growth process, requiring the more detailed representation of the data to be performed in subsequent layers. In a nutshell we can say, that, the smaller the parameter value τ1 , the more shallow the hierarchy, and that, the lower the setting of parameter τ2 , the larger the number of units in the resulting GHSOM network will be. In order to provide a global orientation of the individual maps in the various layers of the hierarchy, their orientation must conform to the orientation of
206
M. Dittenbach, A. Rauber, and D. Merkl
the data distribution on their parents’ maps. This can be achieved by creating a coherent initialization of the units of a newly created map, i.e. by adding a fraction of the weight vectors in the neighborhood of the parent unit. This initial orientation of the map is preserved during the training process. By providing a global orientation of all maps in the hierarchy, potentially negative effects of splitting a large cluster into two neighboring branches can be alleviated, as it is possible to navigate across map boundaries to neighboring maps.
4
Two Hierarchies of Newspaper Articles
For the experiments presented hereafter we use a collection of 11,627 articles from the Austrian daily newspaper Der Standard covering the second quarter of 1999. To be used for map training, a vector-space representation of the single documents is created by full-text indexing. Instead of defining language or content specific stop word lists, we rather discard terms that appear in more than 813 (7%) or in less than 65 articles (0.56%). We end up with a vector dimensionality of 3,799 unique terms. The 11,627 articles thus are represented by automatically extracted 3,799-dimensional feature vectors of word histograms weighted by a tf × idf weighting scheme and normalized to unit length. 4.1
Deep Hierarchy
Training the GHSOM with parameters τ1 = 0.07 and τ2 = 0.0035 results in a rather deep hierarchical structure of up to 13 layers.1 The layer 1 map depicted in Figure 2(a) grows to a size of 4 × 4 units, all of which are expanded at subsequent layers. Among the well separated main topical branches we find Sports, Culture, Radio- and TV programs, the Political Situation on the Balkan, Internal Affairs, Business, or Weather Reports, to name but a few. These topics are clearly identifiable by the automatically extracted keywords using the LabelSOM technique [10], such as weather, sun, reach, degrees for the section on Weather Reports 2 . The branch of articles covering the political situation on the Balkan is located in the upper left corner of the top-layer map labeled with Balkan, Slobodan Milosevic, Serbs, Albanians, UNO, Refugees, and others. We find the branch on Internal Affairs in the lower right corner of this map listing the three largest political parties of Austria as well as two key politicians as labels. This unit has been expanded to form a 4×4 map in the second layer as shown in Figure 2(b). The upper left area of this map is dominated by articles related to the Freedom Party, whereas, for example, articles focusing on the Social Democrats are located in the lower left corner. Other dominant clusters on this map are Neutrality, or the elections to the European Parliament, with one unit carrying specifically the five political parties as well as the term Election as labels. Two units of this second layer map are further expanded in a third 1 2
The maps are available for interactive exploration at http://www.ifs.tuwien.ac.at/˜andi/somlib/experiments_standard We provide English translations for the original German labels.
Business, Culture, Politics, and Sports
207
layer, such as, for example, the unit in the lower right corner representing articles related to the coalition of the People’s Party and the Social Democrats. These articles are represented in more detail by a 3 × 4 map in the third layer.
Balkan
Radio/TV Freedom Party
Sports EU Elections
Weather
Culture
Neutrality
Business Social Democrats
Internal Affairs
(a) Top layer map: 4×4 units; Main topics
(b) Second layer map: 4×4 units; Internal Affairs
Fig. 2. Top and second level map.
4.2
Shallow Hierarchy
To show the effects of different parameter settings we trained a second GHSOM with τ1 set to half of the previous value (τ1 = 0.035), while τ2 , i.e. the absolute granularity of data representation, remained unchanged. This leads to a more shallow hierarchical structure of only up to 7 layers, with the layer 1 map growing to a size of 7 × 4 units. Again, we find the most dominant branches to be, for example, Sports, located in the upper right corner of the map, Internal Affairs in the lower right corner, Internet-related articles on the left hand side of the map, to name but a few. However, due to the large size of the resulting first layer map, a fine-grained representation of the data is already provided at this layer. This results in some larger clusters to be represented by two neighboring units already at the first layer, rather than being split up in a lower layer of the hierarchy. For example, we find the cluster on Internal Affairs to be represented by two
208
M. Dittenbach, A. Rauber, and D. Merkl
neighboring units. One of these, on position (6/4), covers solely articles related to the Freedom Party and its political leader J¨ org Haider, representing one of the most dominant political topics in Austria for some time now, resulting in an accordingly large number of news articles covering this topic. The neighboring unit to the right, i.e. located in the lower right corner on position (7/4), covers other Internal Affairs, with one of the main topics being the elections to the European Parliament. Figure 3 shows these two second-layer maps.
Fig. 3. Two neighboring second-layer maps on Internal Affairs
However, we also find, articles related to the Freedom Party on this second branch covering the more general Internal Affairs, reporting on their role and campaigns for the elections to the European Parliament. As might be expected these are closely related to the other articles on the Freedom Party, which are located in the neighboring branch to the left. Obviously, we would like them to be presented on the left hand side of this map, so as to allow the transition from one map to the next, with a continuous orientation of topics. Due to the initialization of the added maps during the training process, this continuous orientation is preserved, as can easily be seen from the automatically extracted labels provided in Figure 3. Continuing from the second layer map of unit (6/4) to the right we reach the according second layer map of unit (7/4), where we first find articles focusing on the Freedom Party, before moving on to the Social Democrats, the People’s Party, the Green Party and the Liberal Party. We thus find the global orientation to be well preserved in this map. Even though the cluster of Internal Affairs is split into two dominant sub-clusters in the more shallow map, the articles are organized correctly on the two separate
Business, Culture, Politics, and Sports
209
maps in the second layer of the map. This allows the user to continue his exploration across map boundaries. For this purpose, the labels of the upper layers neighboring unit may serve as a general guideline as to which topic is covered by the neighboring map. In the deeper hierarchy, these two sub-clusters are represented within one single branch in the second layer of the map, covering the upper and the lower area of the map, respectively.
5
Conclusions
Automatic topical organization is crucial for providing intuitive means of exploring unknown document collections. While the SOM has proven capable of handling the complexities of content-based document organization, its applicability is limited, firstly, by the size of the resulting map, as well as secondly, by the fact that hierarchical relations between documents are lost within the map display. In this paper we have argued in favour of a hierarchical representation of document archives. Such an organization provides a more intuitive means for exploring and understanding large information spaces. The Growing Hierarchical Self-Organizing Map (GHSOM ) has shown to provide this kind of representation by adapting both its hierarchical structure as well as the sizes of each individual map to represent data at desired levels of granularity. It fits its architecture according to the requirements of the input space, reliefing the user from having to define a static organization prior to the training process. Multiple experiments have shown both its capabilities of hierarchically orginzing document collection according to their topics, as well as the benefits of providing a better overview of, especially, larger collections, where single map-based representations tend to become unacceptably large. Furthermore, by preserving a global orientation of the individual maps, navigation between neighboring maps is facilitated. The presented model thus allows the user to intuitively explore an unknown document collection by browsing through the topical sections.
References 1. J. Blackmore and R. Miikkulainen. Incremental grid growing: Encoding highdimensional structure into a two-dimensional feature map. In Proceedings of the IEEE International Conference on Neural Networks (ICNN’93), volume 1, pages 450–455, San Francisco, CA, USA, 1993. http://ieeexplore.ieee.org/. 2. H. Chen, C. Schuffels, and R. Orwig. Internet categorization and search: A selforganizing approach. Journal of Visual Communication and Image Representation, 7(1):88–102, 1996. http://ai.BPA.arizona.edu/papers/. 3. M. Dittenbach, D. Merkl, and A. Rauber. The growing hierarchical self-organizing map. In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2000), volume VI, pages 15 – 19, Como, Italy, 2000. IEEE Computer Society. http://www.ifs.tuwien.ac.at/ifs/research/publications.html.
210
M. Dittenbach, A. Rauber, and D. Merkl
4. B. Fritzke. Growing Grid – A self-organizing network with constant neighborhood range and adaption strength. Neural Processing Letters, 2(5):1 – 5, 1995. http://pikas.inf.tu-dresden.de/˜fritzke. 5. T. Kohonen. Self-organizing maps. Springer-Verlag, Berlin, 1995. 6. T. Kohonen, S. Kaski, K. Lagus, J. Saloj¨ arvi, J. Honkela, V. Paatero, and A. Saarela. Self-organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3):574–585, May 2000. http://ieeexplore.ieee.org/. 7. X. Lin. A self-organizing semantic map for information retrieval. In Proceedings of the 14. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR91), pages 262–269, Chicago, IL, October 13 - 16 1991. ACM. http://www.acm.org/dl. 8. R. Miikkulainen. Script recognition with hierarchical feature maps. Connection Science, 2:83 – 101, 1990. 9. A. Rauber and D. Merkl. The SOMLib Digital Library System. In Proceedings of the 3. European Conference on Research and Advanced Technology for Digital Libraries (ECDL99), LNCS 1696, pages 323–342, Paris, France, 1999. Springer. http://www.ifs.tuwien.ac.at/ifs/research/publications.html. 10. A. Rauber and D. Merkl. Using self-organizing maps to organize document collections and to characterize subject matters: How to make a map tell the news of the world. In Proceedings of the 10. International Conference on Database and Expert Systems Applications (DEXA99), LNCS 1677, pages 302–311, Florence, Italy, 1999. Springer. http://www.ifs.tuwien.ac.at/ifs/research/publications.html. 11. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.
Feature Selection Using Association Word Mining for Classification Su-Jeong Ko and Jung-Hyun Lee Department of Computer Science & Engineering Inha University, Inchon, Korea {[email protected]},{[email protected]}
Abstract. In this paper, we propose effective feature selection method using association word mining. Documents are represented as association-wordvectors that include a few words instead of single words. The focus in this paper is the association rule in reduction of a high dimensional feature space. The accuracy and recall of document classification depend on the number of words for composing association words, confidence, and support at Apriori algorithm. We show how confidence, support, and the number of words for composing association words at Apriori algorithm are selected efficiently. We have used Naive Bayes classifier on text data using proposed feature-vector document representation. By experiment for categorizing documents, we have proved that feature selection method of association word mining is more efficient than information gain and document frequency.
of document classification depend on the number of words for composing association words, confidence, and support at Apriori algorithm. We show how confidence, support, and the number of words for composing association words at Apriori algorithm are selected efficiently. In order to evaluate the performance of feature selection using association word mining designed in this paper, we compare feature selection methods with information gain and document frequency. In this case, we use Naïve Bayes classifier on text data using proposed feature-vector document representation[7].
2 Feature Selection Methods Scoring of individual features can be performed by using some of the methods used in machine learning for feature selection during the learning process[18]. 2.1 Document Frequency(DF) Document frequency is the number of documents in which a term occurs. We computed the document frequency for each unique term in the training corpus and removed from the feature space those terms whose document frequency was less than some predetermined threshold. 2.2 Information Gain(IG) Information gain is frequently employed as a term goodness criterion in the field of machine learning. It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document. Let m {c i} denote the set of categories in the target space. The information gain of term t i =1
is defined to be:
2.3 Mutual Information(MI) If one considers the two-way contingency table of a term t and a category c, where A is the number of times t and c co-occur, B is the number of time the t occurs without c, C is number of times c occurs without t, and N is the total number of documents, then the mutual information criterion between t and c is defined to be:
Feature Selection Using Association Word Mining for Classification
213
2.4 Term Strength(TS) Term strength is originally proposed and evaluated for vocabulary reduction in text retrieval. Let x and y be a arbitrary pair of distinct but related documents, and t be a term, then the strength of the term is defined to be:
3 Feature Selection Using Association Word Mining 3.1 Feature Selection for Document Representation We adopt the commonly used ‘bag-of-words’[12] document representation scheme, in which we ignore the structure of a document and the order of words in the document[5]. ‘bag-of-words’ is composed of nouns pruned stop-list from results after morphological analysis[13,14]. In this paper, we represent ‘bag-of-words’ as ‘bag-ofassociation words’. The feature vectors represent the association words observed in the documents. The association word-list in the training set consists of all the distinct words that appear in the training samples after removing the stop-words. The AW (association word)_list is defined to be : AW={(w11&w12…&w1(r-1)=>w1r), (w21&w22…&w2(r-1)=>w2r),…,(wk1&wk2…&wk(r-1)=>wkr),...,(wp1&wp2…&wp(r-1)=>wpr)}. Here, each of { wk1,wk2,…,wk(r-1),wkr } in (wk1&wk2…&wk(r-1)=>wkr) represents a word for composing association word. “p” in AW represents the number of association words in a document. “r” in AW represents the number of words in an association word. “&” in pairs of words means that pairs of words have a high degree of semantic relatedness. “wk1&wk2…&wk(r-1)” is antecedent of association word (wk1&wk2…&wk(r-1)=>wkr) and “wkr” is consequent of association word (wk1&wk2…&wk(r-1)=>wkr). 3.2 Confidence and Support at Apriori Algorithm Apriori algorithm [1,2] extracts association rule between words through data mining [15]. Mining association rule between words consists of two stages. In the first stage, composition having transaction support in excess of min_support is found to constitute frequent word item. In the second stage, frequent word item is used to create association rule from database. As for all frequent word item(L), find subset instead of all empty set of frequent word item. As for each subset(A), if ratio of support(L) against support(A) is not less than min_confidence, rule of A=>(L-A) type is displayed. Support of this rule is support(L). In order to constitute association word, confidence and support should be decided. Equation (1) to decide confidence can be obtained as follows. Equation (1) is the result of dividing the number of transaction that includes all items of W1 and W2 with the number of transaction that includes item of W1. Confidence(W1->W2)=Pr(W2|W1)
(1)
214
S.-J. Ko and J.-H. Lee
Fig. 1 indicates accuracy and recall of extracted association word in times of diversifying confidence of one hundred web documents. One hundred web documents are those collected in game class from one of eight classes into which web document related to computer is classified for this experiment. Criteria of recall and accuracy of mining result has been evaluated through use of words thesaurus of WordNet[19]. For the evaluation, synonym, hyponyms and hypernyms of words related to game has been extracted from WordNet. And we extract to make 300 association words. If mining association words are not included in these 300 association words, it is regarded as error. Accuracy represents ratio of association word regarded as error against mining association word. Recall is ratio of inclusion of mining association word into association words made for evaluation.
recall & accuracy
120 100 80 60 40 20 0 0
10
20
30
40
50
60
70
80
90
100
confidence recall
accuracy
Fig. 1. The accuracy and recall of extracted association word in times of diversifying confidence of one hundred web documents
Fig. 1 shows that the bigger confidence the more accurate association words become but the lower recall becomes. However recall is almost consistent and accuracy recorded high at not less than 85 of confidence. Accordingly, in order to extract the most proper association word, confidence should be fixed at not less than 85. Equation (2) to decide support represents frequency of each association word among all word sets. Equation (2) is result of dividing the number of transaction that includes all items of W1 and W2 with the number of all transactions within database. Support(W1-> W2)=Pr(W1U W2)
(2)
If support is large, frequency can be low but important association word can be omitted, and less important association word with high frequency such as (basics & method & use & designation => execution) is extracted. Fig. 2 represents change in accuracy and recall according to change in support of one hundred web documents. Criteria of evaluating accuracy and recall are the same as confidence.
Feature Selection Using Association Word Mining for Classification
215
recall&accuracy
120 100 80 60 40 20 0 0
10
20
30
40
50
60
70
80
90
100
support
accuracy
recall
Fig. 2. The change in accuracy and recall according to change in support of one hundred web documents
Curve of accuracy and recall is identical at support of 22 and at this point, the most proper association word is extracted. However, if support is not less than 22, both accuracy and recall become lower. Accordingly, in order to extract the most trustworthy association word, support of not more than 22 should be designated. 3.3 Generating Feature Our document representation includes not only 2 association words but also up to 5 association words occurring in document. At confidence 90 and support 20 in Apriori algorithm, we can capture some characteristic word combinations, in which the number of words increases. The process of generating feature is performed in n database retrieval, where n-association words are generated in the last pass. For illustration we show in Fig. 3(a) the accumulated number of features during the process of generating feature on 1000 web documents gathered by HTTP down loader. Let AW denote association word in generating feature. In Fig. 3(a), we can see that the number of feature generated using 2-AW is larger than the others(149890 for 2-AW vs. 13932 for 3-AW vs. 3802 for 4-AW vs. 98 for 5-AW). Fig. 3(b) shows the result of classification using new features generated. In order to evaluate the performance of classification using each AW(2-AW, 3-AW, 4-AW, 5AW), we use Naïve Bayes classifier on 500 web documents. We have gathered 500 web documents in game class at yahoo retrieval engine by HTTP down loader. In case that Naïve Bayes classifier using AW classifies documents into the other classes except game class, it is incorrect classification. The accuracy of classification is rate of documents correctly classified for 500 documents. In Fig. 3(b), time(sec) is the response time for document classification. As the graph shows, 2-AW has a very bad speedup performance. On the other hand, the accuracy of classification using 2-AW is higher than using 4-AW but is lower than using 3-AW. The classification using 3-AW has much more accuracy than the others. In addition, 3-AW has a good speedup comparatively. 4-AW has a very good speedup performance. On the other hand, the accuracy of classification using 4-AW is much lower than the others. Therefore, it is relevant to use 3-association words format at feature selection for document classification.
216
S.-J. Ko and J.-H. Lee
2-AW 150000
4-AW
90 80
130000 120000 110000
70
100000 90000 Features
3-AW
100
140000
60
80000
50 40
70000 60000 50000
30
40000 30000
20 10
20000 10000 0 2AW
3-
4-
5-
AW
AW
AW
0 Time(sec)
Accuracy
Times used for generation of feature
(a) Generating feature
(b) Result of classification
Fig. 3. The accumulated number of features during the process of generating feature on 1000 web documents and the result of classification
4 Document Classification by Naïve Bayes This chapter illustrates how Naïve Bayes classifier using association word mining classify web documents. 4.1 Naïve Bayes Classifier In order to classify document, we use Naïve Bayes classifier[8]. Naïve Bayes classifier classifies document through learning stage and classifying stage. In order to learn, we must choose Equation (3) for estimating the probability of association words. In particular, we shall assume that the probability of encountering a specific association word AWk is independent of the specific association word position being considered. P ( AW
k
v j) =
nk + 1 n + AWKB
(3)
Here, n is the total number of association word positions in all training examples whose target value is vj. nk is the number of times association word AWk is found among n association word positions.ÙAWKBÙis the total number of distinct association words found within the training data. In second stage, new web document can be categorized by Equation (4). (4)
Feature Selection Using Association Word Mining for Classification
217
4.2 Feature Selection and Document Classification In order to classify document, we first represent document as 3-association words feature using Apriori algorithm. Apriori algorithm can mine association words at confidence 90 and support 20 and 3-association rule. In order to experiment, web documents on field of computer are classified into 8 classes. A basis of classification follows a statistics that the established information retrieval engines- yahoo, altavista and so on - classify words on field of computer. In Table 1, we show an example of 3association words using Apriori algorithm. Table 1. An example of 3-association words format for feature selection Class
Antecedent
game&composition sports&participation method¢er Graphic manufature&use news&offer News& media inforamtion&flash system&business Semiconductor activity&technique world&netizen Security person&maniplation content&site Internet management&shopping input&edit Publication output&color&kind board&printer Hardware slot&Pentium Game
Consequent choice play evaluation process guide radio computer system hacker communication web electronic publication print machine computer
Average confidence
Average support
91.30%
20.1039%
90.10%
21.4286%
99.9%
20.2838%
96.20%
20.3839%
96.30%
21.7583%
94.90%
19.3838%
95.30%
18.2129%
96.20%
21.2532%
Table 2 shows examples of how Naïve Bayes classifier classifies web document(D) using Equation (3) and Equation (4). Apriori algorithm extracts association words, which represent the web document(D). Association words that represent web document(D) are {game&participation=>event, domain&network=>host, laser&inkjet=>printer, game&technique =>development, composition&choice=> play}. In Table 2, Naïve Bayes classifier in Equation (4) assigns class1 to web document(D). Table 2. Documents classification by Naïve Bayes classifier Association word game&participation=>event
Class 1
Class 2
Class Class 3 4
Class 5
Class 6
7
Class 8
1(0.100386)
Domain&network=>host
1(0.00635)
laser&inkjet=>printer game&technique=>develop 1(0.100386) ment composition&choice=> play 1(0.086614) class
Class
0.1724
1(0.0321)
0.0013
0.00642
218
S.-J. Ko and J.-H. Lee
5 Evaluation In order to evaluate the performance of feature selection using association word mining(AW), we compare feature selection methods with IG and DF. We experiment on Naïve Bayes document classification with 1000 web documents gathered by HTTP down loader. It is important to evaluate accuracy and recall in conjunction. In order to quantify this with single measure, we use F-measure in Equation (5), which is a weighted combination of accuracy and recall[4]. . F_measure =
2
(
2
+ 1)PR P+R
P=
a 100% a +b
R=
a 100% a +c
(5)
P and R in Equation (5) represent the accuracy and recall. “a” is the number of documents, which appear in both classes. “b” is the number of documents, which appear in class categorized by first method but not in class categorized by second method. “c” is the number of documents, which appear in class categorized by second method but not in class categorized by first method. The larger F-measure is, the better performance of classification is. Here, b represents the relative weight of recall for accuracy. For b=1.0, the weight of accuracy and recall is same. The larger b is than 1.0, the larger relative weight of recall for accuracy is. In this experiment, we show the results of F-measure for b=1.0 and changing b from 0.5 to 1.4. AW
IG
DF
100% 90%
90% 70%
recall
accuracy
95% 80%
60%
85% 80%
50% 1
2
3
4
5
6
7
8
75%
1
class
2
3
4
5
6
7
8
class
(a) Accuracy of classification IG
AW
DF
100
Fmeasure(beta =1)
F-measure
AW
(b) Recall of classification
90 80 70 60 50 1
2
3
4
5
6
7
8
beta
(c) F-measure at varying b
9
10
IG
DF
94% 92% 90% 88% 86% 84% 82% 80% 78%
1
2
3
4
5 class
6
7
8
(d) F-measure of classification
Fig. 4. Performance of the AW method compared to IG method and DF method
Feature Selection Using Association Word Mining for Classification
219
Fig. 4 summarizes the performance of three methods. In Fig. 4(a), we can see that AW is much more accuracy than the other methods(average 90.17 for AW vs. 87.33 for IG vs. 85.72 for DF). In Fig. 4(b), we can see that both AW, as well as IG, have a significant advantage in recall(average 87.92 for AW vs. 87.47 for IG) but DF is low in recall(average 85.33 for DF). In addition, AW has an advantage than IG substantially. In Fig. 4(c), at varying with change for with b from 0.5 to 1.4, we can see that all methods have similar performance in accuracy and recall. In Fig. 4(d), we can see that AW has higher performance than the other methods(average 89.02 for AW vs. 87.39 for IG vs. 85.50 for DF). These results are encouraging and provide empirical evidence that the use of AW can lead to improved performance on document classification.
6 Conclusion In this paper, we have proposed feature-vector document representation that includes association words instead of just single words. We believe that the contributions of this paper are twofold. First, we have shown that feature of association rule is able to extract at confidence 90 and support 20 significantly. Second, we have shown that when association words are composed of 3-words, performance of document classification is most efficient. We have used Naïve Bayes classifier on text data using proposed feature-vector document representation. In order to evaluate the performance of feature selection using association word mining designed in this paper, we compared feature selection methods with information gain and document frequency. By experiment for classifying documents, we have proved that feature selection method of association word mining is much more efficient than information gain and document frequency. In the future, the availability of association rule for feature space reduction may significantly ease the application of more powerful and computationally intensive learning methods, such as neural networks, to very large text classification problems that are otherwise intractable.
References 1. 2. 3. 4. 5.
R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules," Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994. R. Agrawal and T. Imielinski and A. Swami, "Mining association rules between sets of items in large databases," In Proceedings of the 1993 ACM SIGMOD Conference, Washington DC, USA, 1993. W. W. Cohen and Y. Singer, “Context sensitive learning methods for text categorization,” Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307-315, 1996. V. Hatzivassiloglou and K. McKeown, "Towards the automatic identification of adjectival scales: Clustering adjectives according to meaning,” Proceedings of the 31st Annual Meeting of the ACL, pp. 172-182, 1993. D. D. Lewis, Representation and Learning in Information Retrieval, PhD thesis(Technical Report pp. 91-93, Computer Science Dept., Univ. of Massachussetts at Amherst, 1992.
S.-J. Ko and J.-H. Lee D. D. Lewis and M. Ringuette, “Comparison of two Learning algorithms for text categorization,” Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, 1994. Y. H. Li and A. K. Jain, “Classification of Text Documents,” Computer Journal, Vol. 41, No. 8, pp. 537-546, 1998. T. Michael, Maching Learning, McGraw-Hill, pp. 154-200, 1997. D. Mladenic, “Feature subset selection in text-learning,” Proceedings of the 10th European Conference on Machine Learning, pp. 95-100, 1998. D, Mladenic and M. Grobelnik, “Feature selection for classification based on text hierarchy,” Proceedings of the Workshop on Learning from Text and the Web, 1998. I. Moulinier and G. Raskinis and J. Ganascia, “Text categorization: a symbolic approach,” Proceedings of Fifth Annual Symposium on Document Analysis and Information Retrieval, 1996. M. Pazzani, D. Billsus, Learning and Revising User Profiles: The Identification of Interesting Web Sites, Machine Learning 27, Kluwer Academic Publishers, pp. 313-331, 1997. V. Rijsbergen and C. Joost, Information Retrieval, Butterworths, London-second edition, 1979. G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, 1983. E. Wiener and J. O. Pederson and A. S. Weigend, “A neural network approach to topic spotting,” Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995. P. C. Wong and P. Whitney and J. Thomas, "Visualizing Association Rules for Text Mining," Proceedings of the 1999 IEEE Symposium on Information Visualization, pp. 120-123, 1999. Y. Yang and C. G. Chute, “An example-based mapping method for text categorization and retrieval,” ACM Transaction on Information Systems, pp. 253-277, 1994. Y. Yang and J. O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412-420, 1997. Cognitive Science Laboratory, Princeton University, “WordNet - a Lexical Database for English,” http://www.cogsci.princeton.edu/~wn/.
Efficient Feature Mining in Music Objects Jia-Ling Koh and William D.C. Yu Department of Information and Computer Education National Taiwan Normal University Taipei, Taiwan 106, R.O.C. [email protected]
Abstract. This paper proposes novel strategies for efficiently extracting repeating patterns and frequent note sequences in music objects. Based on bit stream representation, the bit index sequences are designed for representing the whole note sequence of a music object with little space requirement. Besides, the proposed algorithm counts the repeating frequency of a pattern efficiently to rapidly extracting repeating patterns in a music object. Moreover, with the assist of appearing bit sequences, another algorithm is proposed for verifying the frequent note sequences in a set of music objects efficiently. Experimental results demonstrate that the performance of the proposed approach is more efficient than the related works.
1 Introduction Data mining has received increasing attention in the area of database, with the conventional alphanumeric data having been extensively studied [4]. However, the mining of multimedia data has received lesser attention. Some interesting patterns or rules in multimedia data can be mined to reveal the hidden useful information. In the melody of a music object, many sequences of notes, called repeating patterns, may appear more than once in the object. For example, “sol-sol-sol-mi” is a well-known melody that repeatedly appears in Beethoven's Fifth Symphony. The repeating pattern, an efficient representation for content-based music retrieval, can represent the important characteristics of a music object. Moreover, the sequence of notes that frequently appear in a set of music objects, an interesting pattern called a frequent note sequence, can be used for music data analysis and classification. Hsu [5] proposed an effective means of finding repeating patterns of music objects based on a data structure called correlative matrix. The correlative matrix was used to record the lengths and appearing positions of note patterns that are the intermediate results during the extracting process. However, as the lengths of music objects increase, the memory requirement increases rapidly. The authors of [5] proposed a new approach in 1999 for discovering repeating patterns of music objects [8]. In this approach, the longer repeating pattern was discovered by using string-join operations to repeatedly combine shorter repeating patterns. Therefore, the storage space and execution time is reduced. However, the approach is inefficient when there exist many
repeating patterns whose length are extremely close to the length of the longest repeating patterns. Mining frequent note sequences that differ from mining frequent item sets must consider the order of data items. Agrawal [2] extended the Apriori approach [1] to mine the sequential orders that frequently appear among the item sets in customer transactions. Bettini [3] adopted the finite automata approach to extract the frequent sequential patterns. However, the lengths and the ending of the repeating patterns and frequent note sequences in music objects can not be predicted, such that the terminating state of the finite automata can not be defined. Wang [9] designed suffix tree structure for representing the data set. Although capable of confirming the appearance of a certain data sequence efficiently by tracing the path of the suffix tree, that approach requires much storage space. In light of above developments, this paper presents an efficient approach for extracting all maximum repeating patterns in a music object. Based on bit stream representation, the bit index sequences are designed for representing the whole note sequence of a music object with little space requirement. In addition, the repeating frequency of a candidate pattern can be counted by performing bit operations on bit index sequences so that repeating patterns can be verified efficiently. Moreover, an extended approach for mining frequent note sequences in a set of music objects is provided. Also based on bit stream representation, the appearing bit sequences are designed for representing the music objects in which a note sequence appears. With the aid of appearing bit sequences and bit index sequences, the number of music objects in which a note sequence appears can be counted by performing bit operations. Therefore, the frequent note sequences can be extracted efficiently. The rest of this paper is organized as follows. Section 2 introduces the basic definitions of the problem domain. Section 3 describes the proposed data structure and algorithm for extracting repeating patterns. By extending the idea proposed in Section 3, Section 4 introduces the algorithm for extracting frequent note sequences in a set of music objects. Next, Section 5 summarizes the experimental results that demonstrate the efficiency of the proposed algorithm. Finally, we conclude with a summary and directions for future work in Section 6.
2 Repeating Patterns and Frequent Note Sequences A repeating pattern is a consecutive sequence of notes appearing more than once in a music object and a frequent note sequence is the one appearing among a set of music objects frequently. As important features in music objects, these two kinds of patterns can be used for content-based retrieval and analysis of music data. [Def. 2.1] Let X denote a note sequence consisting of P1P2...Pn, where n denotes a positive integer and each Pi (i=1, ..., n) denotes a note. The length of X is denoted by length(X), whose value is n. [Def. 2.2] Let X denote a note sequence consisting of P1P2...Pn and X’ denote another note sequence consisting of Q1Q2...Qm, where m and n are positive integers and
Efficient Feature Mining in Music Objects
223
each Pi (i=1, ..., n) and Qj (j=1, ..., m) denotes a note, respectively. X’ is called a sub-pattern of X if (1) m n, (2) and $ positive integer i, i n, such that PiPi+1...Pi+m-1 = Q1Q2...Qm. [Def. 2.3] Let X denote a note sequence consisting of P1P2...Pn, where n denotes a positive integer and each Pi (i=1, ..., n) denotes a note. The set containing all the sub-patterns of X is {X’| X’= Pi...Pj, where i, j ³ positive integers, i 1, j n, and i j }, which is denoted by SUBP(X). [Def. 2.4] A note sequence X is a repeating pattern in music object M, if X satisfies (1) X ³ SUBP(the entire note sequence of M), and (2) FreqM(X) min-freq. FreqM(X) denotes the repeating frequency of X in M. Besides, min-freq denotes a constant value, which can be specified by users. Hereinafter, min-freq is set to be 2. [Def. 2.5] The note sequence X is a maximum repeating pattern in M if X is a repeating pattern in M, and there does not exist another repeating pattern X’ in M such that X is a sub-pattern of X’ and FreqM(X) is equal to FreqM(X’). [Example 2.1] Table 1. Music example for example 2.1 Music ID M1 M2 M3 M4
Music Melody DoMiMiMiSoSo SoMiMiFaReRe SoSoSoMiFaFaFaRe SoDoDoReSoDoDo
Consider the music objects as shown in Table 1. In music object M1, FreqM1("Mi") = 3, FreqM1("So") = 2, and FreqM1("MiMi") = 2. Because the repeating frequencies of these three patterns are larger than or equal to min-freq, "So", "Mi", and "MiMi" are repeating patterns in M1. In addition, these three patterns are also maximum repeating patterns in M1. In music object M4, FreqM4 ("So") = 2, FreqM4 ("Do") = 4, FreqM4 ("SoDo") = 2, FreqM4 ("DoDo") = 2, and FreqM4 ("SoDoDo") = 2. These five patterns are repeating patterns in music object M4. However, only "Do" and "SoDoDo" are maximum repeating patterns. [Def. 2.6] Let MS denote a set of music objects. The note sequence Y is a frequent note sequence in MS, if SupMS(Y) min-sup, where SupMS(Y) = ÊM³MS ContainsM(Y), ContainsM(Y) = 1, if FreqM(Y) 1 ContainsM(Y) = 0, otherwise. In the definition, min-sup denotes a constant value, which is used to require that a frequent note sequence must appear in at least min-sup music objects in MS. Besides, the notation SupMS(Y) is named the support of Y in MS.
224
J.-L. Koh and W.D.C. Yu
[Def. 2.7] The note sequence Y is a maximum frequent note sequence in MS if Y is a frequent note sequence in MS, and there does not exist another frequent note sequence Y’ in MS such that Y is a sub-pattern of Y’. [Def. 2.8] Let P and Q denote a note in music object M, respectively. Q is named an adjacent note of P if note sequence PQ appears in M. We also say Q is adjacent to P. Moreover, PQ is an adjacent note pair in M.
3 Repeating Patterns Mining 3.1 Bit Index Table The bit index table is designed based on bit stream representation. A bit index sequence is constructed for each note in the note sequence of a music object. The length of the bit index sequence equals the length of the note sequence. Assume that the least significant bit is numbered as bit 1 and the numbering increases to the most significant bit. If the ith note in the note sequence is note N, bit i in the bit index sequence of N is set to be 1; otherwise, the bit is set to be 0. That is, the bits in the bit index sequence of note N represent the appearing locations of N in the music object. Consider the note sequence "DoDoDoMiMi" as an example. The bit index sequence of "Do", denoted by BISDo, is 00111. Then, the entire note sequence of a music object is represented by a bit index table that consists of bit index sequences for various notes in the music object. Consider the music object with note sequence "SoSoSoMFaFaFa Re". The length of the sequence is 8, which contains four various notes. Table 2 presents the corresponding bit index table. Table 2. Example of bit index table Note Re Mi Fa So
Bit index sequence 10000000 00001000 01110000 00000111
For each note P, the number of bits with value 1 in its bit index sequence implies the repeating frequency FreqM(P). Suppose a note sequence consists of two notes P and Q. The repeating frequency of the note sequence PQ also can be counted efficiently by performing bit operations on bit index sequences as described in the following. Initially, the bit index table provides the corresponding bit index sequences of note P and Q, BISP and BISQ. The left shift operation on BISP is performed. An and operation on the previous result and BISQ is then performed to get the bit index sequence of PQ. The bits with value 1 in the sequence correspond to the positions in music object M where PQ appears. In addition, the number of bits with value 1 represents FreqM(PQ). Similarly, this strategy can be applied to verify whether a note sequence ST, consisting of a note sequence S with length(S) 1 and a note T, is a repeating pattern. Hereinafter, Frequent_Count(M, P) represents the function for counting the repeating frequency of note sequence P in music object M.
Efficient Feature Mining in Music Objects
225
[Example 3.1] Given the bit index table of music object M as shown in Table 2, where the represented note sequence is "SoSoSoMiFaFaFaRe". The process for evaluating whether the note sequences "SoSo" and "SoSoMi" are repeating patterns is as follows. <1> Frequent_Count(M, SoSo) 1) Obtain the bit index sequence of the second note "So", BISSo = 00000111. 2) Perform the left shift operation on BISSo, which is 00000111. The resultant sequence is then assigned to temporal variable t, t =shl(BISSo, 1) =00001110. 3) Perform r = t ¾ BISSo, and the resultant bit sequence is 00000110. 4) Count the number of 1s in the bit sequence r and get FreqM(SoSo) = 2. This implies that "SoSo" appears in M two times and, thus, is a repeating pattern in M. <2> Frequent_Count(M, SoSoMi) 1) Obtain BISMi = 00001000. 2) BISSoSo , 00000110, is obtained from the previous step. Perform the left shift operation and assign it to variable t. t = 00001100. 3) Perform r = t ¾ BISMi, and, in doing so, bit sequence 00001000 is obtained. 4) The number of 1s in the bit sequence r is 1. This indicates that FreqM(SoSoMi) = 1 and, thus, "SoSoMi" is not a repeating pattern in M. 3.2 Algorithm for Mining Repeating Patterns This subsection presents an algorithm, named MRP algorithm, for mining repeating patterns by applying the bit index table. Redundancy may occur among the repeating patterns and, therefore, only the maximum repeating patterns are extracted. The algorithm consists of two phases: mining repeating patterns and extracting maximum repeating patterns. The mining phase applies the depth first search approach. First, the candidate pattern consists of a single note. If the pattern is a repeating pattern, the algorithm constructs a new candidate pattern by adding an adjacent note to the old one and verifies the repeating frequency. If the new candidate is a repeating pattern, the process will continue recursively. Otherwise, the process returns to the old candidate pattern before adding a note and attempts to add another note to the sequence. Consider the music object with note sequence "ABCABC". The candidate pattern with length 1, "A", is initially chosen and its repeating frequency is then counted. Since the repeating frequency of "A" equals 2, "A" is a repeating pattern. Next, an adjacent note of "A" is added, and the candidate pattern "AB" is constructed as well. After verifying that "AB" satisfies the definition of a repeating pattern, the candidate "ABC" is then constructed. Although "ABC" is verified to be a repeating pattern, a repeating pattern with length 4 can not be obtained by adding any note to "ABC". Thus, the recursive process is terminated and the status returns to the sequence "AB" and then back to the sequence "A". When the process returns the status back to the sequence "A", no other adjacent notes of "A" can be added. Next, another candidate pattern with length 1, "B", is chosen and the above process repeats. After that, the
226
J.-L. Koh and W.D.C. Yu
repeating patterns "B" and "BC" are found. In addition, the final candidate pattern with length 1, "C", is also a repeating pattern. Among the extracted repeating patterns, not all repeating patterns are maximum repeating patterns. Therefore, the second phase is required to investigate the subpattern relationship and repeating frequencies for every pair of repeating patterns to remove the non-maximum repeating patterns. In order to reduce processing cost of the second phase, the following property is applied in the mining phase for removing a part of non-maximum repeating patterns. If repeating pattern T is constructed from repeating pattern S by adding a note, S is a sub-pattern of T. Therefore, we can make sure that S is not a maximum repeating pattern if S and T have the same repeating frequency and S can be removed in the mining phase. In the example illustrated above, the repeating frequency of sequence "AB" is 2, and so is sequence "ABC". Therefore, "AB" is not a maximum repeating pattern. Similarly, "A" is not a maximum repeating pattern. Therefore, only sequences "ABC","BC", and "C" remain in the result because they are possibly maximum repeating pattern and require the processing of second phase. This strategy can filter out many non-maximum repeating patterns, such that it is more efficient when extracting the maximum repeating patterns during the second phase. Among the possible maximum repeating patterns remained in set RP, no two repeating patterns exist that have the same repeating frequency and one is the prefix of the other. These patterns are stored in a table as shown in Table 3. In addition to the note sequence of the repeating pattern, the other three columns are used to store the length, repeating frequency, and the final starting position where the pattern appears in the music object. Applying the table allows us to extract the maximum repeating patterns without performing string matching. The function MMRP() is used to extract the maximum repeating patterns. The pseudo codes for function MMRP() are shown below. The first two predicates in the "if clause" are used to verify whether if note sequence T is a sub-pattern of sequence S. If T is a sub-pattern of S and has the same repeating frequency with S, T is not a maximum repeating pattern and is removed from the results. Function MMRP() {for each note sequence S in RP for each note sequence T in RP and T S i := the final starting position where S appears j := the final starting position where T appears if((i < j)and(length(S)+ilength(T)+j)and(FreqM(S)=FreqM(T)) RP = RP - {T} end for } Table 3. Example of maximum repeating patterns Repeating pattern ABC BC C
Length 3 2 1
Repeating frequency 2 2 2
Final starting position 4 5 6
Efficient Feature Mining in Music Objects
227
Consider the note sequence "ABCABC". Table 3 displays the possible maximum repeating patterns extracted in the mining phase. Both the final starting positions of "BC" and "C" are larger than the one of "ABC". These patterns have the same result by adding the final starting positions and the lengths of the patterns. Moreover, their repeating frequencies are all the same. Therefore, only "ABC" is a maximum repeating pattern among these patterns.
4 Frequent Note Sequences Mining This section describes a novel algorithm for mining frequent note sequences in a set of music objects. The note combination structure is designed for storing the adjacent note pairs appearing in a set of music objects. Applying the structure allows us to produce the candidate frequent note sequences efficiently in the mining process. 4.1 Note Combination Structure The note combination structure is a two-dimensional array named NC table. Notably, the index values of the array correspond to the distinct notes. The data stored in array NC[M][N] is a bit stream representing the music objects in which the note sequence MN appears and is named the appearing bit sequence of MN. The length of the sequence equals the number of music objects. If MN appears in the ith music object, bit i in the appearing bit sequence is set to be 1; otherwise, the bit is set to be 0. In the following, AppearS denotes the appearing bit sequence of a note sequence S. Table 4. Example of NC table Do Re Mi Fa
Do 000 000 000 110
Re 011 000 000 000
Mi 100 011 000 000
Fa 000 000 111 000
[Example 4.1] Consider the note sequences of three music objects: "DoReMiFa", "ReMiFaDoRe", and "MiFaDoMi". All the adjacent note pairs appearing in the first music object are "DoRe", "ReMi", and "MiFa". Because "DoRe" appears in music object 1, the first bit of the appearing bit sequence in NC[Do][Re] is set to be 1. Table 4 presents the NC table for these three music objects. The bit sequence stored in NC[Do][Re], which is 011, represents that the note sequence "DoRe" appears in music object 1 and 2. Similarly, AppearMiFa is stored in NC[Mi][Fa]. The sequence "111" implies that "MiFa" appears in all the music objects. For a single note P, the appearing bit sequence of P, denoted as AppearP, represents the music objects in which note P appears. This sequence can be obtained via performing or operations on the sequences stored in the row and column indexed by P. Then the number of bits with value 1 in AppearP represents the support of P.
228
J.-L. Koh and W.D.C. Yu
For a note sequence consisting of two notes, P and Q, AppearPQ is obtained from the NC table directly. Therefore, the support of PQ also can be counted efficiently according to the number of 1s in AppearPQ. Suppose a note sequence S with length n (n2) is a frequent note sequence. And a new candidate note sequence is constructed by adding a note Q to S. In order to reduce the processing cost of frequent note sequence verification, information in the NC table is used to filter out the non-frequent note sequences as early as possible. Suppose the last note in S is P. The note sequence SQ remains as a candidate if both the following two requirements are satisfied. 1. The number of bits with value 1 in AppearPQ, stored in NC[P][Q], is larger than or equal to min_sup. 2. Perform AppearS ¾ AppearPQ. The number of 1s in the resultant sequence is larger than or equal to min_sup. The result of AppearS ¾ AppearPQ is an approximation of AppearSQ, which represents the music objects in which both S and RQ appear. If the number of bits with value 1 in this sequence is larger than or equal to min-sup, both S and RQ are frequent note sequences. However, further verification is required by invoking function Frequent_Count() to verify if SQ is actually a note sequence in the corresponding music objects specified by the bits with value 1. 4.2 Algorithm for Mining Frequent Note Sequences This subsection presents MFNS algorithm designed for mining frequent note sequences, which mainly consists of the following three steps. [Step 1] Bit index tables and NC table construction. The note sequences of the given music objects are scanned sequentially. The bit index table for each music object and the NC table for the given music objects are then constructed. [Step 2] Frequent note sequences extraction. Similar to the mining phase in MRP algorithm, this step also applies the depth first search approach to extract the frequent note sequences. Initially, the candidate note sequence consists of a single note. The information in the NC table and bit index tables is used to verify if a candidate is a frequent note sequence and construct a new candidate sequence recursively. [Step 3] Maximum frequent note sequences extraction. This step extracts maximum frequent note sequences. For any two frequent note sequences P and P’, if P is the sub-pattern of P’, P is not a maximum frequent note sequence and removed from the result. [Example 4.2] Table 4 displays the NC table of a set of music objects, and min-sup is set to be 2. Frequent note sequence mining according to step 2 of the algorithm is performed as follows. <1> Select note "Do" as the first note of candidate note sequences. After performing AppearDoRe ¿ AppearDoMi ¿ …¿ AppearFaDo, the outcome is "111". The sequence represents that "Do" appears in all the music objects and is a frequent note sequence. Then note "Re" is chosen to added adjacent to "Do".
Efficient Feature Mining in Music Objects
229
AppearDoRe="011" implies that "DoRe" appears in two music objects and is a frequent note sequence. Next, note "Mi" is chosen to construct candidate sequence "DoReMi". The outcome is "011" after performing AppearDoRe ("011") ¾ AppearReMi ("011"). This occurrence implies that "DoRe" and "ReMi" appear in the first and second music objects. It is necessary to verify whether if "DoReMi" exists in the first and second music objects. Closely examining the bit index sequences of "DoReMi" in the first and second music objects reveals that "DoReMi" only appears in the first music. Therefore, "DoReMi" is not a frequent note sequence. Next, no other notes can be added to the frequent note sequences "DoRe" and "Do", individually, to construct new candidates. Therefore, the mining process for the frequent note sequences begins with "Do" terminates. "Do" and "DoRe" are the extracted frequent note sequences. <2> Select note "Re" as the first note of candidate note sequences. Applying the same mining process allows us to extract the frequent note sequences "Re", "ReMi", and "ReMiFa" sequentially. Next, the note "Do" is chosen to construct a candidate pattern "ReMiFaDo". However, after performing the and operation on AppearReMiFa ("011") and AppearFaDo("110"), there is only one bit with value 1 in the resultant sequence, implying that only the second music object contains "ReMiFa" and "FaDo" at the same time. Therefore, "ReMiFaDo" is obviously not a frequent note sequence. The frequent note sequences beginning with "Re" are "Re", "ReMi", and "ReMiFa". Similarly, notes "Mi" and "Fa" are chosen as the first note of candidate note sequences individually, and the above process repeats. Finally, the extracted frequent note sequences beginning with "Mi" are "Mi", "MiFa", and "MiFaDo". The frequent note sequences beginning with "Fa" are "Fa" and "FaDo". For a frequent note sequence K, K is not a maximum frequent note sequence if K is a prefix of another frequent note sequence. Such sequences are not stored in MaxFNS when mining frequent note sequences. Therefore, step 3 can be performed more efficiently via only considering the frequent note sequences in MaxFNS. Among the extracted frequent note sequences, only "DoRe", "ReMiFa", "MiFaDo" and "FaDo" are stored in MaxFNS, which are possible maximum frequent note sequences. After step 3 is performed, only the maximum frequent note sequences "DoRe", "ReMiFa", and "MiFaDo" remain.
5 Performance Study The efficiency of the proposed MRP algorithm is evaluated by comparing it with two related approaches: the correlative matrix approach [5] and the string join method [8]. The algorithms are implemented and performed on a personal computer. The data sets used in the experiments include synthetic music objects and real music objects. Herein, the object size of a music object is defined as the length of the note sequence and the note count represents the number of various notes appearing in the music object. In addition, the total repeating frequency of repeating patterns in a music object is the RP repeating frequency and the number of maximum repeating patterns is
230
J.-L. Koh and W.D.C. Yu
named RP count. Due to page limit, the parameter setting of synthetic music objects for each experiment refers to [7]. According to the results of the experiments shown in Fig.1, the execution time of MRP algorithm is less than 0.2 seconds for the real music objects and less than 0.01 seconds for the synthetic music objects. The proposed algorithm is more efficient than the other two approaches. Moreover, the memory requirement of MRP algorithm is less than the other ones, thus confirming that the algorithm proposed herein is highly appropriate for mining maximum repeating patterns.
Fig. 1. (a) illustrates the execution time versus the object size of synthetic music objects. (b) shows the results of eight real music objects. (c) indicates that the execution time is inversely proportional to the note count. (d) reveals that the RP repeating frequency more significantly influence the execution time of correlative matrix approach than the MRP algorithm and the string join method do. (e) reveals that all the execution times of the three algorithms increase with an increasing length of the longest repeating pattern. (f) shows that the size of memory requirement for MRP algorithm is the least among the three algorithms.
Efficient Feature Mining in Music Objects
231
6 Conclusion This paper presents a novel means of mining the maximum repeating patterns in the melody of a music object. With the design of bit index sequence representation, the repeating frequency of a candidate pattern can be counted efficiently. In addition, the proposed approach is extended to extract frequent note sequences in a set of music objects. With the aid of appearing bit sequence representation and note combination structure, MFNS algorithm is proposed for generating candidate sequences and verifying the frequent note sequences efficiently. Experimental results indicate that both execution time and memory requirement of the proposed MRP algorithm are less than the other two related works for mining maximum repeating patterns. Moreover, our approach is extended to mine the frequent note sequences in a set of music objects, which has seldom been mentioned previously. From the extracted repeating patterns, a future work should address issues on music data clustering and classification. Furthermore, the music association rules should be analyzed for frequent note sequences to develop the intelligent agent so that similar music objects can be automatically retrieved. Finally, a future application will extend the proposed techniques to DNA and protein sequence mining.
References 1. 2. 3.
4. 5.
6. 7.
8.
9.
R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” in Proc. 20th International Conference on Very Large Data Bases, 1994. R. Agrawal and R. Srikant, “Mining Sequential Patterns,” in Proc. the IEEE International Conference on Data Engineering (ICDE), Taipei, Taiwan, 1995. C. Bettini, S. Wang, S. Jajodia, and J.-L. Lin, “Discovering Frequent Event Patterns with Multiple Granularities in Time Sequences,” IEEE Trans. on Knowledge and Data Eng., vol. 10, no. 2, 1998. M.S. Chen, J. Han and P.S. Yu, “Data Mining: an Overview from a Database Perspective,” IEEE Trans. Knowledge and Data Eng., Vol. 8, No. 6, Dec.1996. J.-L. Hsu, C.-C. Liu, and A.L.P Chen, “Efficient Repeating Pattern Finding in Music Databases,” in Proc. the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM’98), 1998. Roberto J. and Bayardo Jr., “Efficiently Mining Long Patterns from Databases,” in Proc. ACM SIGMOD International Conference on Management of Data, 1998. J.-L. Koh and W.D.C. Yu, “Efficient Repeating and Frequent Sequential Patterns Mining in Music Databases,” Technique report in Department of information and computer education, National Taiwan Normal University. C.-C. Liu, J.-L. Hsu and A.L.P. Chen, “Efficient Theme and Non-Trivial Repeating Pattern Discovering in Music Databases,” in Proc. IEEE International Conference on Data Engineering, 1999. K. Wang, “Discovering Patterns from Large and Dynamic Sequential Data,” Journal of Intelligent Information Systems (JIIS), Vol. 9, No. 1, 1997.
An Information-Driven Framework for Image Mining Ji Zhang, Wynne Hsu, and Mong Li Lee School of Computing, National University of Singapore {zhangji, whsu, leeml}@comp.nus.edu.sg Abstract. Image mining systems that can automatically extract semantically meaningful information (knowledge) from image data are increasingly in demand. The fundamental challenge in image mining is to determine how lowlevel, pixel representation contained in a raw image or image sequence can be processed to identify high-level spatial objects and relationships. To meet this challenge, we propose an efficient information-driven framework for image mining. We distinguish four levels of information: the Pixel Level, the Object Level, the Semantic Concept Level, and the Pattern and Knowledge Level. High-dimensional indexing schemes and retrieval techniques are also included in the framework to support the flow of information among the levels. We believe this framework represents the first step towards capturing the different levels of information present in image data and addressing the issues and challenges of discovering useful patterns/knowledge from each level.
1 Introduction An extremely large number of image data such as satellite images, medical images, and digital photographs are generated every day. These images, if analyzed, can reveal useful information to the human user. Unfortunately, there is a lack of effective tools for searching and finding useful patterns from these images. Image mining systems that can automatically extract semantically meaningful information (knowledge) from image data are increasingly in demand. Image mining deals with the extraction of implicit knowledge, image data relationship, or other patterns not explicitly stored in the images and between image and other alphanumeric data. It is more than just an extension of data mining to image domain. It is an interdisciplinary endeavor that draws upon expertise in computer vision, image processing, image retrieval, data mining, machine learning, database, and artificial intelligence [6]. Despite the development of many applications and algorithms in the individual research fields, research in image mining is still in its infancy. The fundamental challenge in image mining is to determine how low-level, pixel representation contained in a raw image or image sequence can be processed to identify high-level spatial objects and relationships.
domain related alphanumeric data and the semantic concepts obtained from the image data to discover underlying domain patterns and knowledge. High-dimensional indexing schemes and retrieval techniques are also included in the framework to support the flow of information among the levels. This framework represents the first step towards capturing the different levels of information present in image data and addressing the question of what are the issues and work that has been done in discovering useful patterns/knowledge from each level.
The rest of this paper is organized as follows: Section 2 presents an overview of the proposed information-driven image mining architecture. Section 3 describes each of the information level. Section 4 discusses how each of the information level can be organized and indexed. Section 5 gives the related work and we conclude in Section 6.
2
Information-Driven Image Mining Framework
The image database containing raw image data cannot be directly used for mining purposes. Raw image data need to be processed to generate the information that is usable for high-level mining modules. An image mining system is often complicated because it employs various approaches and techniques ranging from image retrieval and indexing schemes to data mining and pattern recognition. Such a system typically encompasses the following functions: image storage, image processing, feature extraction, image indexing and retrieval, patterns and knowledge discovery. A number of researchers have described their image mining framework from the functional perspective [6, 25, 37]. While such functional-based framework is easy to understand, it fails to emphasize the different levels of information representation necessary for image data before meaningful mining can take place. Figure 1 shows our proposed information-driven framework for image mining. There are four levels of information, starting from the lowest Pixel Level, the Object Level, the Semantic Concept Level, and finally to the highest Pattern and Knowledge Level. Inputs from domain scientists are needed to help identify domain specific objects and semantic concepts. At the Pixel Level, we are dealing with information relating to the primitive features such as color, texture, and shape. At the Object Level, simple clustering algorithms and domain experts help to segment the images into some meaningful regions/objects. At the Semantic Concept Lever, the objects/regions identified earlier are placed in the context of the scenes depicted. High-level reasoning and knowledge discovery techniques are used to discover interesting patterns. Finally, at the Pattern and Knowledge Level, the domain-specific alphanumeric data are integrated with the semantic relationships discovered from the images and further mining are performed to discovered useful correlations between the alphanumeric data and those found in the images. Such correlations discovered are particularly useful in the medical domain.
234
J. Zhang, W. Hsu, and M.L. Lee
Pattern and Knowledge Level
,QWHJUDWHG
,QIRUPDWLRQLQWHJUDWLRQ
UXOHV $OSKDQXPHULF 3DWWHUQV NQRZOGJH GDWDEDVH
3DWWHUQV .QRZOHGJH GDWDEDVH
6HPDQWLFFRQFHSW LPDJHUHWULHYDO
Semantic Concept Level
,PDJH.''PRGXOH 6HPDQWLFFRQFHSW
$OSKDQXPHULF.''
LPDJHLQGH[LQJ
PRGXOH
6HPDQWLFFRQFHSWV
Object Level
2EMHFWUHJLRQ
GDWDEDVH
LPDJHUHWULHYDO $OSKDQXPHULF
)HDWXUHH[WUDFWLRQ
GDWDEDVH
2EMHFWUHJLRQ LPDJHLQGH[LQJ
2EMHFWUHJLRQ GDWDEDVH
Pixel Level
'RPDLQ
3ULPLWLYHIHDWXUH
NRZQOHGJH
LPDJHUHWULHYDO
,PDJHSURFHVVLQJ ,PDJHVHJPHQWDWLRQREMHFW
)
UHFRJQLWLRQ 3ULPLWLYHIHDWXUH LPDJHLQGH[LQJ
,PDJHGDWDEDVH
8VHULQWHUIDFH 'DWDYLVXDOL]DWLRQ
Fig. 1
3
The Four Information Levels
In this section, we will describe the four information levels in our proposed framework. We will also discuss the issues and challenges faced in extracting the required image features and useful patterns and knowledge from each information level.
An Information-Driven Framework for Image Mining
3.1
235
Pixel Level
The Pixel Level is the lowest layer in an image mining system. It consists of raw image information such as image pixels and primitive image features such as color, texture, and edge information.
Color is the most widely used visual feature. Color is typically represented by its RGB values (three 0 to 255 numbers indicating red, green, and blue). The distribution of color is a global property that does not require knowledge of how an image is composed of component objects. Color histogram is a structure used to store the proportion of pixels of each color within an image. It is invariant to under translation and rotation about the view axis and change only slowly under change of view angle, change in scale, and occlusion [32]. Subsequent improvements include the use of cumulative color histogram [31], and spatial histogram intersection [30].
Texture is the visual pattern formed by a sizable layout of color or intensity homogeneity. It contains important information about the structural arrangement of surfaces and their relationship to the surrounding environment [27]. Common representations of texture information include: the co-occurrence matrix representation [12], the coarseness, contrast, directionality, regularity, roughness measures [33], the use of Gabor filter [22] and fractals [17]. [22] develop a texture thesaurus to automatically derive codewords that represent important classes of texture within the collection.
Edge information is an important visual cue to the detection and recognition of objects in an image. This information is obtained by looking for sharp contrasts in nearby pixels. Edges can be grouped to form regions. Content-based image retrieval focus on the information found at the Pixel Level. Researchers try to identify a small subset of primitive features that can uniquely distinguish images of one class from another class. These primitive image features have their limitation. In particular, they do not have the concept of objects/regions as perceived by a human user. This implies that the Pixel Level is unable to answer simple queries such as “retrieve the images with a girl and her dog” and “retrieve the images containing blue stars arranged in a ring”.
3.2 Object Level The focus of the Object level is to identify domain-specific features such as objects and homogeneous regions in the images. While a human being can perform object recognition effortlessly and instantaneously, it has proven to be very difficult to implement the same task on machine. The object recognition problem can be referred to
236
J. Zhang, W. Hsu, and M.L. Lee
as a supervised labeling problem based on models of known objects. Given a target image containing one or more interesting objects and a set of labels corresponding to a set of models known to the system, what object recognition does is to assign correct labels to regions, or a set of regions, in the image. Models of known objects are usually provided by human input a priori. An object recognition module consists of four components: model database, feature detector, hypothesizer and hypothesis verifier [15]. The model database contains all the models known to the system. The models contain important features that describe the objects. The detected image primitive features in the Pixel Level are used to help the hypothesizer to assign likelihood to the objects in the image. The verifier uses the models to verify the hypothesis and refine the object likelihood. The system finally selects the object with the highest likelihood as the correct object.
To improve the accuracy of object recognition, image segmentation is performed on partially recognized image objects rather than randomly segmenting the image. The techniques include: “characteristic maps” to locate a particular known object in images [16], machine learning techniques to generate recognizers automatically [6], and use a set of examples already labeled by the domain expert to find common objects in images [10]. Once the objects within an image can be accurately identified, the Object Level is able to deal with queries such as “Retrieve images of round table” and “Retrieve images of birds flying in the blue sky”. However, it is unable to answer queries such as “Retrieve all images concerning Graduation ceremony” or “Retrieve all images that depicts a sorrowful mood.” 3.3 Semantic Concept Level While objects are the fundamental building blocks in an image, there is “semantic gap between the Object level and Semantic Concept level. Abstract concepts such as happy, sad, and the scene information are not captured at the Object level. Such information requires domain knowledge as well as state-of-the-art pattern discovery techniques to uncover useful patterns that are able to describe the scenes or the abstract concepts. Common pattern discovery techniques include: image classification, image clustering, and association rule mining.
(a) Image Classification Image classification aims to find a description that best describe the images in one class and distinguish these images from all the other classes. It is a supervised technique where a set of labeled or pre-classified images is given and the problem is to label a new set of images. This is usually called the classifier. There are two types of classifiers, the parametric classifier and non-parametric classifier. [7] employ classifiers to label the pixels in a Landset multispectral scanner image. [37] develop a MM-Classifier to classify multimedia data based on given class labels. [36] proposed IBCOW (Image-based Classification of Objectionable
An Information-Driven Framework for Image Mining
237
Websites) to classify websites into objectionable and benign websites based on image content . (b) Image Clustering Image clustering groups a given set of unlabeled images into meaningful clusters according to the image content without a priori knowledge [14]. Typical clustering techniques include hierarchical clustering algorithms, partitioning algorithms, nearest neighbor clustering, and fuzzy clustering. Once the images have been clustered, a domain expert is needed to examine the images of each cluster to label the abstract concepts denoted by the cluster. (c) Association Rule Mining Association rule mining aims to find items/objects that occur together frequently. In the context of images, association rule mining is able to discover that when several specific objects occur together, there is a high likelihood of certain event/scene is being described in the images. An association rule mining algorithm works in two steps. The first step finds all large itemsets that meet the minimum support constraint. The second step generates rules from all the large itemsets that satisfy the minimum confidence constraint. [25] present an algorithm that uses association rule mining to discover meaningful correlations among the blobs/regions that exists in a set of images. [37] develop an MM-Associator that uses 3-dimensional visualization to explicitly display the associations in the Multimedia Miner prototype.
With the Semantic Concept Level, queries involving high-level reasoning about the meaning and purpose of the objects and scene depicted can be answered. Thus, we will able to answer queries such as: “Retrieve the images of a football match” and “Retrieve the images depicting happiness”. It would be tempting to stop at this level. However, careful analysis reveals that there is still one vital piece of missing information – that of the domain knowledge external to images. Queries like: “Retrieve all medical images with high chances of blindness within one month”, requires linking the medical images with the medical knowledge of chance of blindness within one month. Neither the Pixel level, the Object level, nor the Semantic Concept level is able to support such queries.
3.4 Pattern and Knowledge Level
To support all the information needs within the image mining framework, we need the fourth and final level: the Pattern and Knowledge Level. At this level, we are concerned with not just the information derivable from images, but also all the domainrelated alphanumeric data. The key issue here is the integration of knowledge discovered from the image databases and the alphanumeric databases. A comprehensive image mining system would not only mine useful patterns from large collections of images but also integrate the results with alphanumeric data to mine for further pat-
238
J. Zhang, W. Hsu, and M.L. Lee
terns. For example, it is useful to combine heart perfusion images and the associated clinical data to discover rules in high dimensional medical records that may suggest early diagnosis of heart disease. IRIS, an Integrated Retinal Information System, is designed to integrate both patient data and their corresponding retinal images to discover interesting patterns and trends on diabetic retinopathy [13]. BRAin-Image Database is another image mining system developed to discover associations between structures and functions of human brain [23]. The brain modalities were studied by the image mining process and the brain functions (deficits/disorders) are obtainable from the patients’ relational records. Two kinds of information are used together to perform the functional brain mapping. By ensuring a proper flow of information from low level pixel representations to high level semantic concepts representation, we can be assured that the information needed at the fourth level is derivable and that the integration of image data with alphanumeric data will be smooth. Our proposed image mining framework emphasizes the need to focus on the flow of information to ensure that all levels of information needs have been addressed and none is neglected.
4 Indexing of Image Information While focusing on the information needs at various levels, it is also important to provide support for the retrieval of image data with a fast and efficient indexing scheme. Indexing techniques used range from standard methods such as signature file access method and inverted file access method, to multi-dimensional methods such as K-D-B tree [26], R-tree [11], R*-tree [3] and R+-tree [29], to high-dimensional indexes such as SR-tree [18], TV-tree [20], X-tree [4] and iMinMax [24]. Searching the nearest neighbor is an important problem in high-dimensional indexing. Given a set of n points and a query point Q in a d-dimensional space, we need to find a point in the set such that its distance from Q is less than, or equal to, the distance of Q from any other points in the set [19]. Existing search algorithms can be divided into the following categories: exhaustive search, hashing and indexing, static space partitioning, dynamic space partitioning, and randomized algorithms. When the image database to be searched is large and the feature vectors of images are of high dimen2 sion (typically in the order of 10 ), search complexity is high. Reducing the dimensions may be necessary to prevent performance degradation. This can be accomplished using two well-known methods: the Singular Value Decomposition (SVD) update algorithm and clustering [28]. The latter realizes dimension reduction by grouping similar feature dimensions together. Current image systems retrieve images based on similarity. Euclidean measures may not effectively simulate human perception of a certain visual content. Other similarity measures such as histogram intersection, cosine, correlation, etc., need to be utilized.
An Information-Driven Framework for Image Mining
239
One promising approach is to first perform dimension reduction and then use appropriate multi-dimensional indexing techniques that support Non-Euclidean similarity measures [27]. [11] develop an image retrieval system on Oracle platform using multi-level filters indexing. The filters operate on an approximation of the highdimension data that represents the images, and reduces the search space so that the computationally expensive comparison is necessary for only a small subset of the data. [12] develop a new compressed image indexing technique by using compressed image features as multiple keys to retrieve images. Other proposed indexing schemes focus on specific image features. [21] present an efficient color indexing scheme for similarity-based retrieval which has a search time that increases logarithmically with the database size. [34] propose a multi-level R-tree index, called the nested R-trees for retrieving shapes efficiently and effectively. With the proliferation of image retrieval mechanisms, a performance evaluation of colorspatial retrieval techniques was given in [35] which serves as guidelines to select a suitable technique and design a new technique.
5
Related Work
Several image mining systems have been developed for different applications. The MultiMediaMiner [37] mines high-level multimedia information and knowledge from large multimedia database. [8] describes an intelligent satellite mining system that comprises of two modules: a data acquisition, preprocessing and archiving system which is responsible for the extraction of image information, storage of raw images, and retrieval of image, and an image mining system, which enables the users to explore image meaning and detect relevant events. The Diamond Eye [6] is an image mining system that enables scientists to locate and catalog objects of interest in large image collections. These systems incorporate novel image mining algorithms, as well as computational and database resources that allow users to browse, annotate, and search through images and analyze the resulting object catalogs. The architectures in these existing image mining systems are mainly based on module functionality. In contrast, we provide a different perspective to image mining with our four level information image mining framework. [6, 25] primarily concentrate on the Pixel and Object level while [37] focus on the Semantic Concepts level with some support from the Pixel and Object levels. It is clear that by proposing a framework based on the information flow, we are able to focus on the critical areas to ensure all the levels can work together seamlessly. In addition, with this framework, it highlights to us that we are still very far from being able to fully discover useful domain information from images. More research is needed at the Semantic Concept level and the Knowledge and Pattern level.
240
6
J. Zhang, W. Hsu, and M.L. Lee
Conclusion
The rapid growth of image data in a variety of medium has necessitated a way of making good use of the rich content in the images. Image mining is currently a bourgeoning yet active research focus in computer science. We have proposed a four-level information-driven framework for image mining systems. High-dimensional indexing schemes and retrieval techniques are also included in the framework to support the flow of information among the levels. We tested the applicability of our framework by applying it to some practical image mining applications. The proposal of this framework is our effort to provide developers and designer of image mining systems a standard framework for image mining with an explicit information hierarchy. We believe this framework represents the first step towards capturing the different levels of information present in image data and addressing the question of what are the issues and challenges of discovering useful patterns/knowledge from each level.
References 1. 2. 3. 4. 5. 6. 7.
8. 9. 10.
11. 12. 13.
Annamalai, M and Chopra, R.: Indexing images in Oracles8i. ACM SIGMOD, (2000) Babu, G P and Mehtre, B M.: Color indexing for efficient image retrieval. Multimedia Tools and applications, (1995) Beckmann, N, Kriegel, H P, Schneider, R and Malik, J.: The R*-tree: An efficient and robust access method for points and rectangles. ACM SIGMOD, (1990) Berchtold, S, Keim, D A and Kriegel, H P.: The X-tree: An index structure for high dimennd sional data. 22 Int. Conference on Very Large Databases, (1996) Bertino, E, Ooi, B C, Sacks-Davis, R, Tan, K L, Zobel, J, Shilovsky, B and Catania, B.: Indexing Techniques for Advanced Database Systems. Kluwer Academic Publisher (1997) Burl, M C et al.: Mining for image content. In Systems, Cybernetics, and Informatics / Information Systems: Analysis and Synthesis, (1999) Cromp, R F and Campbell, W J.: Data mining of multi-dimensional remotely sensed images. International Conference on Information and Knowledge Management (CIKM), (1993) Datcu, M and Seidel, K.: Image information mining: exploration of image content in large archives.IEEE Conference on$erospace, Vol.3 (2000) Eakins, J P and Graham, M E.: Content-based image retrieval: a report to the JISC technology applications program. (http://www.unn.ac.uk/iidr/research/cbir/report.html), (1999) Gibson, S et al.: Intelligent mining in image databases, with applications to satellite imaging and to web search, Data Mining and Computational Intelligence, Springer-Verlag, Berlin, (2001) Guttman, A.: R-trees: A dynamic index structure for spatial searching. ACM SIGMOD. (1984) Haralick, R M and Shanmugam, K.: Texture features for image classification. IEEE Transactions on Systems, Man, and Cybernetics, Vol 3 (6) (1973) Hsu, W, Lee, M L and Goh, K G.: Image Mining in IRIS: Integrated Retinal Information System, ACM SIGMOD. (2000)
An Information-Driven Framework for Image Mining
241
14. Jain, A K, Murty, M N and Flynn, P J.: Data clustering: a review. ACM computing survey, Vol.31, No.3. (1999) 15. Jain, R, Kasturi, R and Schunck, B G.: Machine Version. MIT Press. (1995) 16. Jeremy S. and Bonet, D.: Image preprocessing for rapid selection in “Pay attention mode”. MIT Press. (2000) 17. Kaplan, L M et al.: Fast texture database retrieval using extended fractal features. Proc SPIE in Storage and Retrieval for Image and Video Databases VI (Sethi, I K and Jain, R C, eds). (1998) 18. Katayama, N and Satoh, S.: The SR-tree: An index structure for high-dimensional nearest neighbour queries. ACM SIGMOD. (1997) 19. Knuth, D E.: Sorting and searching, the Art of Computer Programming, Vol.3. Reading, Mass. Addison-Wesley (1973) 20. Lin, K, Jagadish, H V and Faloutsos, C.: The TV-tree: An index structure for highdimensional data. The VLDB Journal, 3 (4). (1994) 21. Ma, W Y and Manjunath, B S.: A texture thesaurus for browsing large aerial photographs, Journal of the American Society for Information Science 49(7) (1998) 22. Manjunath, B S and Ma, W Y.: Texture features for browsing and retrieval of large image data, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, (1996) 23. Megalooikonomou, V, Davataikos, C and Herskovits, E H.: Mining lesion-deficit associations in a brain image database. ACM SIGKDD. (1999) 24. Ooi, B C, Tan, K L. Yu,S and Bressan. S.: Indexing the Edges - A Simple and Yet Efficient Approach to High-Dimensional Indexing, 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (2000). 25. Ordonez, C and Omiecinski, E.: Image mining: a new approach for data mining. IEEE. (1999) 26. Robinson, J T.: The K-D-B tree: A search structure for large multidimensional dynamic indexes. ACM SIGMOD. (1981) 27. Rui, Y, Huang, S T et al.: Image retrieval: Past, present and future. Int. Symposium on Multimedia Information Processing. (1997) 28. Salton, J and McGill, M J.: Introduction to Modern Information Retrieval. McGraw-Hill Book Company. (1983) + 29. Sellis, T, Roussopoulous, N and Faloutsos.: C. R -tree: A dynamic index for multith dimensional objects. 16 Int. Conference on Very Large Databases. (1987) 30. Stricker, M and Dimai, A.: Color indexing with weak spatial constraints. Proc SPIE in Storage and Retrieval for Image and Video Databases IV. (1996) 31. Stricker, M and Orengo, M.: Similarity of color images. Proc SPIE in Storage and Retrieval for Image and Video Databases III. (1995) 32. Swain, M J and Ballard, D H.: Color indexing. International Journal of Computer Vision 7(1). (1991) 33. Tamura, H et al.: Textural features corresponding to visual perception. IEEE Transactions on Systems, Man and Cybernetics 8(6). (1978) 34. Tan, K L, Ooi, B C and Thiang, L F.: Retrieving Similar Shapes Effectively and Efficiently. Multimedia Tools and Applications, Kluwer Academic Publishers, accepted for publication, 2001 35. Tan, K L, Ooi, B C and Yee, C Y.: An Evaluation of Color-Spatial Retrieval Techniques for Large Image Databases, Multimedia Tools and Applications, Vol. 14(1), Kluwer Academic Publishers. (2001)
242
J. Zhang, W. Hsu, and M.L. Lee
36. Wang, J Z, Li, J et al.: System for Classifying Objectionable Websites, Proceedings of the 5th International Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services (IDMS’98), Springer-Verlag LNCS 1483, (1998) 37. Zaiane, O R and Han, J W.: Mining MultiMedia Data. CASCON: the IBM Centre for Advanced Studies Conference (http://www.cas.ibm.ca/cascon/), (1998)
A Rule-Based Scheme to Make Personal Digests from Video Program Meta Data Takako Hashimoto1,2, Yukari Shirota3, Atsushi Iizawa1,2, and Hiroyuki Kitagawa4 1
Information Broadcasting Laboratories, Inc., Tokyo, Japan {takako, izw }@ibl.co.jp 2 Software Research Center, Ricoh Company Ltd., Tokyo, Japan {takako, izw }@src.ricoh.co.jp 3 Faculty of Economics, Gakushuin University, Tokyo, Japan [email protected] 4 Institute of Information Sciences and Electronics, University of Tsukuba, Japan [email protected] Abstract. Content providers have recently started adding a variety of meta data to various video programs; these data provide primitive descriptors of the video contents. Personal digest viewing that uses the meta data is a new application in the digital broadcasting era. To build personal digests, semantic program structures must be constructed and significant scenes must be identified. Digests are currently made manually at content provider sites. This is time-consuming and increases the cost. This paper proposes a way to solve these problems with a rule-based personal digest-making scheme (PDMS) that can automatically and dynamically make personal digests from the meta data. In PDMS, depending on properties of the video program contents and viewer preferences, high-level semantic program structures can be constructed from the added primitive meta data and significant scenes can be extracted. The paper illustrates a formal PDMS model. It also presents detailed evaluation results of PDMS using the contents of a professional baseball game TV program.
1 Introduction The digitization of the video contents has experienced rapid growth with the recent advance of digital media technology such as digital broadcasting and DVD (digital video discs). In digital media environments, various video meta data can be attached to the video program contents. Personal digest viewing is an application that uses such meta data. These meta data can be used to make personal digests automatically and dynamically, and to present them on such viewer terminals as TVs and personal computer monitors. To build personal digests of video programs, higher-level semantic program structures must be constructed from the primitive meta data, and significant scenes need to be extracted. Appropriate semantic structures and scene extraction strategies, however, depend heavily on program content properties. Beyond that, in making personal digests, viewer preferences should be reflected in the significant scene extraction. We therefore need a scheme that is adaptable to target program contents and viewer preferences. To achieve this scheme, this paper presents a personal digest-making scheme (called PDMS) based on rule descriptions using video program meta data. In PDMS, higher-level semantic scenes are extracted using rules designating occurrence patterns
of meta data primitives. Other rules are used to calculate scene significance. Scenes with high significance scores are selected to compose digests. Viewer preferences can be reflected in calculating significance. Processes involved in the digest making can be tuned flexibly depending on properties of the program contents and viewer preferences. Section 2 of this paper describes related work and our approach. Section 3 explains our personal digest-making scheme. Section 4 describes our prototype digest viewing system based on PDMS, which is applied to the TV program content of a professional baseball game. Section 5 summarizes important points and briefly describes future work.
2 Related Work and Our Approach This section describes work related to our personal digest-making scheme. Generally speaking, methods for making video digests are divided into two groups. The first is based on image and speech recognition technology, and the second is based on meta data, which are added by content providers. Various researchers have been looking into the first group [1, 2, 3, 4, 5, 6, 7]. An advantage of the first group is that the technology makes it possible to automatically construct index structures to build digests without a lot of manual labor. Technical reliability, however, is not high and the recognition process is expensive. The second group, on the other hand, features high reliability. Despite the cost of providing meta data, we consider this approach more practical for use in digital broadcasting services because of its reliability and efficiency. Moreover, many content providers are starting to supply primitive meta data in addition to video contents. For these reasons, our approach follows the second group. There are some researches based on meta data [8, 9, 10, 11, 12]. Zettsu et al., for example, proposed a method of finding scenes as logical units by combining image features and attached primitive meta data [10]. Ushiama and Watanabe proposed a scheme of extracting scenes using regular expressions on primitive meta data [11]. Their work concentrates on process to extract semantic structures such as scenes from primitive meta data. It does not present schemes for processes to evaluate scene significance and to select important scenes to compose digests. Kamahara et al. proposed a digest making method using scenario templates to express the synopsis of the digest video [5, 12]. This method is useful in building typical digests focused on the story of the video program. This method, however, has two problems: The first is that it is difficult to prepare all scenario templates that are assumed as digests. The second is that it is hard to build digests in the middle of a video program. In our scheme, scene significance is calculated more dynamically, and viewer preferences are also taken into account.
3 Personal Digest-Making Scheme 3.1 Inputs to PDMS There are three kinds of data given to PDMS: video data, primitive descriptors, and viewer preferences. The video data and primitive descriptors should be produced by content providers. At home, viewers input their preferences through user interfaces.
A Rule-Based Scheme to Make Personal Digests from Video Program Meta Data
245
Video Data: This data is a video stream, which is expressed as a sequence of frames f1...fn . Each frame has a frame identifier (fid) and time code to specify the starting point in the video program. A sub-sequence of frames in the following explanation is referred to as a frame sequence. Primitive Descriptor: Primitive descriptors are video program meta data. Each primitive descriptor is expressed as the following tuple: (pid, type, ffid, lfid, {attr1, …, attrn}). Pid is the primitive descriptor’s identifier. Type is a type of the description. For example, when the video content is a baseball game, it includes beginning_of_game, beginning_of_inning, hit, out, additional_run, and so on. Type values are decided and registered in advance at the content providers. Each primitive descriptor has the first and last frame identifiers: ffid and lfid. Items { attr1, …, attrn } are attributes of the descriptor. They are decided according to the type value. Fig. 1 shows an example of primitive descriptors. Each line corresponds to one primitive descriptor. 1, beginning_of_game, 100,,Giants, Carp, Tokyo Dome, Oct. .., 0-0, start 2, beginning_of_inning, 130,,1 ,Carp, 0-0,, top 3, at_bat, 2800,, Nomura, Kuwata,,,,0-0, , 4, pitch, 2950,,Nomura, Kuwata, straight, ,0-0-0,,,, 5, hit, 3130,, right,liner 6, at_bat,4250,,Tomashino, Kuwata, Nomura, ,,0-0,, 7, pitch,5135,,Tomashino, Kuwata, straight, ,0-0-0, Nomura,,, 8, hit,5340,,left, liner 9, additional_run, 6150,,1, Nomura, 1,, 1-0,, 10, beginning_of_inning, 8680,, Kanamoto,Kuwata,Nomura,,,2-0,, : 52 ,out, 23440,,strikeout, ,Ogata
Fig. 1. Primitive Descriptor Example
Viewer Preference: A viewer preference is expressed as follows: (uid, prefid, name, category, weight). Uid is the viewer’s identifier and prefid is the identifier of the viewer preference. Name is the preference item name and category designates the category of the preference item. Weight expresses the degree of the viewer’s preference (0<= weight <=1). Preference examples are indicated as follows: (1, 1, "Giants", "Team Name", 0.8) // the fan of team Giants (1, 2, "Brown", "Player Name", 1) // the fan of player Brown Here, viewer 1 registers two viewer preferences. 3.2 Data Generated in PDMS This section describes data generated in PDMS: scenes, annotations, status parameters and preference parameters. Scene: A scene is a frame sequence corresponding to a semantic unit. A scene is expressed as follows: (sid, type, ffid, lfid). Sid is the scene’s identifier. Type expresses the scene’s type. Ffid and lfid express the first and last frame identifiers. For baseball programs, an inning type scene is defined as a frame sequence starting from a primi-
246
T. Hashimoto et al.
tive descriptor beginning_of_inning and ending just before the next occurrence of beginning_of_inning. A batting type scene (the frame sequence from a primitive descriptor at_bat to just before the next at_bat) is also defined. In the batting type scene, when a primitive descriptor hit and a primitive descriptor additional_run appear in this order, the frame sequence from the hit to the additional_run becomes an RBI(run batted in)_hit type scene (see Fig. 2).
Annotation: An annotation is a text data item used to add semantic information to a primitive descriptor. For baseball programs, when an RBI_hit type scene is extracted, an annotation RBI is added to the primitive descriptor hit (Fig. 2). We can also think of such annotations as come_from_behind and the_first_run. Status Parameter: A status parameter expresses the significance of each frame of the video. For baseball programs, there is the Aggression Level status parameter. The value of Aggression Level increases when aggressive primitive descriptors such as hit and additional_run, or aggressive annotations such as RBI, appear (Fig. 2). Aggression Level expresses the aggressive importance. We can also use Pitching Level to express the pitcher’s condition, Excitement Level to express an exciting situation, and various others for baseball programs. Users can specify which parameter to use. Preference Parameter: A preference parameter expresses the degree of a viewer’s preference. Each viewer has one preference parameter. When a primitive descriptor related to the viewer’s preference appears, the value of the preference parameter increases. For example, suppose the viewer’s preference (1, 2, "Brown", "Player Name", 1) is given. If a primitive descriptor related to the player "Brown" occurs such as at_bat, pitch, and hit in which the batter’s name is "Brown," the value of his or her preference parameter increases (see Fig. 2).
A Rule-Based Scheme to Make Personal Digests from Video Program Meta Data
247
3.3 Rules for Making Digests This section explains the rules for generating scenes and annotations, and for changing values of status parameters and preference parameters. Scene Extraction Rule: Scenes and annotations are derived by scene extraction rules. Fig. 3 shows two examples of scene extraction rules. In the first rule, an inning type scene is extracted. In the second rule, an RBI_hit type scene is extracted, and an annotation RBI is also generated and added to the primitive descriptor hit. Status Parameter Calculation Rule: Fig.3 also shows a status parameter calculation rule example. An occurrence of the primitive descriptor hit and the generation of the annotation RBI invoke the execution of this rule. Then, the value of the status parameter is calculated. In Fig.3, the value of Aggression Level is increased by 2points. Preference Parameter Calculation Rule: A preference parameter calculation rule is invoked by an occurrence of a primitive descriptor related to the viewer’s preference. Fig. 3 shows an example of a preference calculation rule. The viewer’s identifier, preference item name, category name, and weight is referred to by $X, $Name, $Category, and $Weight. In this rule, when a primitive descriptor at_bat in which the batter’s name matches the viewer’s favorite player appears, the value of the preference parameter is increased by "(4* $Weight)." <scene extraction rule> // extract an "inning" type scene
3.4 Digest Making Process in PDMS In the following, we explain the digest making process in PDMS, which consists of two steps (See Fig. 4). Step 1: Scene/annotation extraction and parameter calculation First, the system analyzes the input meta data. Then, scenes are extracted and annotations are generated, both based on scene extraction rules. The status/preference parameters are also calculated based on the corresponding rules. Step 2: Selection of significant scenes To select significant scenes, user inputs three kinds of data: (1) Total Digest Time: the total time of the digest; (2) Scene Unit: the unit of each extracted scene or granularity; and (3) Parameters to select: a set of status and preference parameters selected by the viewer among the given parameters. Depending on these inputs, significant scenes that are judged change. As Fig. 4 shows, extracted scenes depend on the Selected Parameters where scenes Si and Sk are extracted if Aggression Level is selected; scenes Si and Sj are extracted if Aggression Level and Preference Level are selected.
4 Prototype System 4.1 System Architecture As Fig. 5 shows, the system consists of three modules and two kinds of databases: a meta data analysis module, a scene extraction module, a parameter calculation module, rule databases, and a semantic structure database.
A Rule-Based Scheme to Make Personal Digests from Video Program Meta Data
249
First, the meta data analysis module receives the broadcast data and continually monitors for an occurrence of a primitive descriptor (See (1) in Fig. 5). After parsing and checking data consistency, the module notifies the other modules of the primitive descriptor occurrence, issuing a primitive descriptor occurrence event.
PD M S
P r im itiv e D escrip to rs
V iew e r’s P refe ren ce s
(1 ) M eta D a ta A n aly sis R u le D B S ta tu s/ P refe ren ce P a ra m eter C alc ula tio n R u les P referen c e P a ra m e ters
P rim itive D escr iptor O ccu rr en ce
(3) P a ram eter C alcu la tion S ta tu s P a ram ete rs
P rim itive D escr iptor O ccu rr en ce
A n n otatio n G e n era tion S em an tic S tr u ctu re D B
R u le D B
(2 ) S cen e E x tra ction S cen es
S ce n e E xtrac tion R u le s
A nn o ta tio ns
Sc en es
A n n o tation s V alu es o f S ta tu s P a ra m eters V alu e s o f P r efer en ce P ar am eter s
D a ta
E ve n t Issu e
M o d u le
D ata S to ra ge
E v en t
D ata In p u t
Fig. 5. PDMS System Architecture
The scene extraction module tries to find a primitive descriptor occurrence pattern corresponding to a scene or an annotation from a sequence of incoming primitive descriptors (See (2) in Fig. 5). The extracted scenes and the generated annotations are stored in the semantic structure database. If an annotation is generated, the scenes extraction module notifies the parameter calculation module by issuing an annotation generation event. The parameter calculation module then starts to calculate the parameter values (See (3) in Fig. 5). Parameter values are also calculated from a sequence of incoming primitive descriptors.
C h a rt
O p t io n
Fig. 6. PDMS Screen Example
250
T. Hashimoto et al.
Fig. 6 shows values of status parameters internally calculated of which horizontal axis shows time (For a home user interface, the internal calculation results are not presented.). The extracted scene is highlighted as a bar. Clicking the bar, you can see the extracted scene on the screen, as shown in Fig. 6. 4.2 Evaluation of PDMS Effectiveness This section evaluates effectiveness of our PDMS using actual professional baseball content broadcast by the Nippon Television Network Corporation. This sample content is a game between the Giants and the Carp held in the Tokyo Dome. The scores are as follows:
Carp Giants
1 1 3
2 0 3
3 0 0
4 0 0
5 0 0
6 3 0
7 0 0
8 0 1
9 0 x
4 7
To evaluate effectiveness, we made two types of digests, depending on viewer preferences. The first type is called a generic digest that does not take into account of each viewer’s preferences; the digest is middle-of-the-road. The second type is called a preference digest, which is skewed by viewer’s preferences. To authenticate PDMS, we use a baseball game database provided by Nippon Television Network Corporation, the content provider [13]. Sports correspondents write the game digest articles presented on the database, ensuring high reliability. If a scene selected by PDMS is included in the database digest, we can say the scene is appropriate for a generic digest. Generic Digest Table 1 lists the selected scenes as a generic digest by our system under the following conditions: · Total Digest Time: 3min. · Scene Unit: throwing scene · Selected Parameters: {Aggression Level, Excitement Level, Pitching Level} Our system extracted each scene in maximum 30-second segments. As Table 1 shows, every scene extracted by our system was appropriate. The correctness is 100%. The DB article also extracted six scenes. Table 1. Extracted Scenes as for a Generic Digest Inning Top of 1st Bottom of 1st
Team Carp Giants
Bottom of 2nd Top of 6th Bottom of 8th
Giants Carp Giants
Selected Scene Batter Etoh got a sacrifice hit. Batter Takahashi hit an RBI. Batter Kawai hit an RBI. Batter Matsui hit a home run. Batter Etoh hit a homerun. Batter Motoki got a sacrifice hit.
Result OK OK OK OK OK OK
A Rule-Based Scheme to Make Personal Digests from Video Program Meta Data
251
Preference Digests To evaluate preference digests, we use preference data of two viewers: A and B. Viewer A preferences are (1, 1, "Nishi", "Player Name", 0.5) and Viewer B preferences are (2, 1, "Nishi", "Player Name", 1). The preference data show that Viewer A is an ordinary fan of batter Nishi, with a weight of 0.5. Viewer B is an enthusiastic fan of batter Nishi, with a weight of 1. The preference digests are made under the following conditions: · Total Digest Time: 3min · Scene Unit: throwing scene · Selected Paramteres: {Aggression Level, Excitement Level, Pitching Level} Table 2 lists the two viewers’ resultant data. Both digests are skewed by batter Nishi’s scenes. Batter Nishi’s scenes occupy the following: Viewer A: 2 (scenes) / 6 (scenes) = 33.3% when w=0.5 Viewer B: 4 (scenes) / 6 (scenes) = 66.6% when w=1.0. The number of preference scenes increases according to the preference weights. Evaluation results show that scenes extracted by our system are appropriate for a generic digest and that the number of preference scenes can be controlled by the preference weight. Table 2. Extracted Scenes for Preference Digests Viewer (A)
(B)
Inning Top of 1st Bottom of 1st
Team Carp Giants
Bottom of 2nd
Giants
Top of 6th Bottom of 1st
Carp Giants
Bottom of 2nd
Giants
Bottom of 4th Bottom of 7 th
Giants Giants
Selected Scene Batter Etoh got a sacrifice hit. Batter Nishi walked to 1st base. Batter Takahashi hit an RBI. Batter Nishi struck out. Batter Matsui hit a home run. Batter Etoh hit a home run. Batter Nishi walked to 1st base. Batter Takahashi hit an RBI. Batter Nishi struck out. Batter Matsui hit a home run. Batter Nishi grounded out to third. Batter Nishi fled out.
5 Conclusions and Future Work This paper described a personal digest-making scheme that can be adapted to various programs and viewer preferences. We proposed a rule-based scheme called PDMS to make personal digests from video program meta data. PDMS extracts higher-level semantic structures and calculates scene significance based on rules. Viewer preferences can be reflected in the significance calculation. Therefore, using PDMS, viewers can make personal digests flexibly and dynamically on their TVs or personal computers. We applied our prototype system based on PDMS to build digests of a TV program content for a professional baseball game and evaluated its effectiveness. We verified that our prototype could extract important scenes selected manually by the
252
T. Hashimoto et al.
content provider and flexibly make personal digests according to the weight of viewer preferences. Acknowledgements. The authors would like to thank Nippon Television Network Corporation for providing the baseball program contents. We are also indebted to Ms. Hiroko Mano at Ricoh Company Ltd., Takeshi Kimura at NHK (Japan Broadcasting Corporation) Science & Technical Research Laboratories and Mr. Hideo Noguchi and Mr. Kenjiro Kai at Information Broadcasting Laboratories, Inc. for their detailed comments on an earlier draft of this paper.
References 1.
Y. Nakamura and T. Kanade: Semantic Analysis for Video Contents Extraction - Spotting by Association in News Video, Proc. of ACM Multimedia, Nov. 1997, pp. 393-401. 2. M. A. Smith and T. Kanade: Video Skimming and Characterization through the Combination of Image and Language Understanding, Proc. of the 1998 Intl. Workshop on ContentBased Access of Image and Video Database (CAIVD ’98), IEEE Computer Society, 1998, pp. 61-70. 3. A. G. Hauptmann and D. Lee: Topic Labeling of Broadcast News Stories in the Informedia Digital Video Library, Proc. of the 3rd ACM International Conference on Digital Libraries, ACM Press, June 23-26, 1998, Pittsburgh, PA, USA, pp. 287-288. 4. A. G. Hauptmann and M. J. Witbrock: Story Segmentation and Detection of Commercials in Broadcast News Video, Proc. of the IEEE Forum on Research and Technology Advances in Digital Libraries, IEEE ADL ’98, IEEE Computer Society, April 22-24, 1998, Santa Barbara, California, USA, pp. 168-179. 5. J. Kamahara, T. Kaneda, M. Ikezawa, S. Shimojo, S. Nishio, and H. Miyahara: Scenario Language for automatic News Recomposition on The News-on Demand, Technical Report of IEICE DE95-50, Vol. 95, No. 287, pp.1-8, 1995 (in Japanese). 6. M. Nishida and Y. Ariki: Speaker Indexing for News Articles, Debates and Drama in Broadcasted TV Programs, Proc. of the IEEE International Conference on Multimedia Computing and Systems (ICMCS) Volume II, 1999, pp. 466-471. 7. Y. Ariki and K. Matsuura: Automatic Classification of TV News Articles Based on Telop Character Recognition, Proc. of the IEEE International Conference on Multimedia Computing and Systems (ICMCS) Volume II, 1999, pp. 148-152. 8. Y. Shirota, T. Hashimoto, A. Nadamoto, T. Hattori, A. Iizawa, K. Tanaka, and K. Sumiya: A TV Programming Generation System Using Digest Scenes and a Scripting Markup Language, Proc. of HICSS34 34th Hawaii International Conference on System Science and CD-ROM of full papers, Jan. 3-6, 2001, Hawaii, USA. 9. Takako Hashimoto, Yukari Shirota, Atsushi Iizawa, and Hideko S. Kunii: Personalized Digests of Sports Programs Using Intuitive Retrieval and Semantic Analysis, Alberto H. F. Laender, Stephen W. Liddle, and Veda C. Storey (Eds.): Conceptual Modeling - ER 2000, Proc. of 19th International Conference on Conceptual Modeling, Salt Lake City, Utah, USA, October 9-12, 2000, Lecture Notes in Computer Science, Vol. 1920, Springer, 2000, pp. 584-585. 10. K. Zettsu, K. Uehara, and K. Tanaka: Semantic Structures for Video Data Indexing, Shojiro Nishio, and Fumio Kishino (Eds.): Advanced Multimedia Content Processing, First International Conference, AMCP ’98, Osaka, Japan, November, 9-11, 1998, Lecture Notes in Computer Science, Vol. 1554, Springer, 1999, pp. 356-369.
A Rule-Based Scheme to Make Personal Digests from Video Program Meta Data
253
11. T. Ushiama and T. Watanabe: A Framework for Using Transitional Roles of Entities for Scene Retrievals Based on Event-Activity Model, Information Processing Society of Japan Transactions on Database, Vol. 40, No. SIG 3(TOD 1), Feb. 1999, pp. 114-123 (in Japanese). 12. J. Kamahara, Y. Nomura, K. Ueda, K. Kandori, S. Shimojo, and H. Miyahara: A TV News Recommendation System with Automatic Recomposition, Shojiro Nishio, and Fumio Kishino (Eds.): Advanced Multimedia Content Processing, Proc. of First International Conference, AMCP ’98, Osaka, Japan, November, 9-11, 1998, Lecture Notes in Computer Science, Vol. 1554, Springer, 1999, pp. 221-235. 13. Nippon Television Network Corporation. http://www.ntv.co.jp
Casting Mobile Agents to Workflow Systems: On Performance and Scalability Issues 1
2
1
3
3
Jeong-Joon Yoo , Young-Ho Suh , Dong-Ik Lee , Seung-Woog Jung , Choul-Soo Jang , and 3 Joong-Bae Kim 1
Department of Info. and Comm., Kwang-Ju Institute of Science and Technology 1 Oryong-Dong Buk-Gu Kwangju, Korea (Republic of) 2 Internet Service Department, Electronics and Telecommunications Research Institute 3 EC Department, Electronics and Telecommunications Research Institute 161 Kajong-Dong Yusong-Gu, Taejon, Korea (Republic of) {jjyoo, dilee}@kjist.ac.kr {yhsuh, swjung, jangcs, jjkim}@etri.re.kr Abstract. In this paper we describe two important design issues of mobile agents-based workflow systems; in architecture and workflow execution levels. Solutions for better performance and scalability of workflow systems are proposed. We suggest 3-layer architecture and agent delegation model in architecture and workflow execution levels of workflow systems respectively. Mobile agents effectively distribute workloads of a naming/location server and a workflow engine to others based on the proposed methods. In consequence, the performance and the scalability of workflow systems are improved. This effectiveness is shown through comparison with client server-based and another mobile agent-based workflow systems with stochastic Petri-nets simulation. Simulation results show that our approach not only outperforms others in massive workflow environment but also comes up with the scalability of previous mobile agent-based workflow systems.
Casting Mobile Agents to Workflow Systems: On Performance and Scalability
255
manage by itself. More specifically, if a set of tasks is assigned to an agent, the agent can perform all the tasks without any interaction with workflow engines. With the autonomy and the mobility features remote interactions can be reduced; as a result, performance and scalability is improved. DartFlow is the first WFMS based on mobile agents for highly flexible and scalable WFMSs [6]. Since the mobile agent carries the workflow definition by itself, it can decide the next tasks to perform without help of workflow engines. However, DartFlow also has problems in the scalability and performance as; (i) the existence of centralized location/naming server which may become bottle-neck. (ii) too big agent code size, that introduces much communication overhead (referred to as an agent migration overhead in this paper). In this paper, we tackle on the performance and the scalability of mobile agentbased WFMSs by considering the above (i) and (ii). We suggest ‘3-layer architecture’ and ‘agent delegation model’ as solutions for the issue (i) and (ii) respectively. We show the effectiveness of proposed solutions through stochastic Petri-nets and compare with previous WFMSs. Through this paper, throughput is considered as a measure of performance. Furthermore throughput = 1/response time, where response time is duration from creation of a workflow instance to the completion. Scalability is sensitivity to throughput with respect to the number of workflow instances. The rest of this paper is organized as follows; In Section 2, we describe design issues and approaches for better performance and scalability of WFMS based on mobile agents. In Section 3, we compare the performance and the scalability of agent delegation models through stochastic Petri-nets simulation. One of agent delegation models with 3-layer architecture is again compared with previous WFMSs. Finally, Section 4 is the conclusion with future work.
2 Design Issues and Approaches for Increasing Performance and Scalability In this section we explain design issues and suggest some approaches for better performance and scalability. We try to solve two problem of mobile agent-based workflow system described before as; (i) the existence of centralized location/naming server which may become a bottle-neck point. (ii) too big agent code size, that introduces much communication overhead (referred to as an agent migration overhead in this paper). We suggest three-layer architecture for (i) and agent delegation models for (ii). 2.1 System Architectural Level: 3-Layer Architecture To enjoy the advantages of mobile agents we must provide an efficient location management for mobile agents, so that agents can communicate with others for the purpose of dynamic reconfiguration etc. In order to provide location-independent name resolution scheme, location/naming server is required to map a symbolic name to the current location of the agent. However, a centralized location/naming server may be potential performance bottle-neck in mobile agent systems - this may be unacceptable in such a system that a huge number of agents are executed in parallel.
256
J.-J. Yoo et al.
Fig. 1. 3-layer architecture of a WFMS
Thus to exert potential advantages of the mobile agent paradigm, particularly in terms of performance and scalability, an efficient mechanism of location management for mobile agents must be provided in the mobile agent system level. Fig. 1 shows the 3-layer architecture of a WFMS which is suggested as a solution for the described problem. In this figure, instead that a mobile agent (referred to in the text as a proxy agent) migrates over task performers in layer 3 to execute the delegated workflow instance, the proxy agent creates several agents with help of a workflow engine in layer 2 and delegates a sub-workflow process to each sub-agent. We refer to a mobile agent being responsible for part of workflow process as a subagent. The role of sub-agents is defined and used later in agent delegation model. In the conventional mobile agent systems, the location of not only proxy agents but also sub-agents is managed by a centralized location/naming server. On the other hand, the location of sub-agents is managed by the corresponding proxy agent in our systems. In this 3-layer architecture the location management is gracefully distributed in a hierarchical way. Trivially, any centralized location/naming server that can be a bottle-neck is not required. 2.2 Agent Execution Level: Agent Delegation Model Agent migration overhead is a source of performance degradation. On the other hand, a process migration distributes workloads of a host to others. Considering them, a certain ‘workflow process decomposition policy’ must be provided to reduce agent migration overhead and to increase the load distribution. It is beyond scope of the paper to find the optimum solution for agent delegation model. The agent delegation model is a kind of workflow process decomposition policy. Here we propose some non-trivial division methods and claim that these strategies are necessary. Assuming that there are two trivial division methods as follows (a and b), we can define some non-trivial division methods. a. minimum model (referred to as min) Each task consisting a workflow process is assigned to a mobile agent as shown in Fig. 2(a). When a mobile agent completes its own task, it does not return back home but reports the results to the corresponding workflow engine. Then the
Casting Mobile Agents to Workflow Systems: On Performance and Scalability
257
Fig. 2. Trivial Agent Delegation Models: (a) min (b) max
workflow engine updates local database, executes scheduler to decide the next task and assign it to another mobile agent. If the number of workflow processes increases, the number of mobile agents generated is increased. Thus the workflow engine becomes a bottle-neck and hence the scalability is decreased. b. maximum model (referred to as max) All tasks consisting a workflow process are assigned to a mobile agent as shown Fig. 2(b). A mobile agent itself migrates to the task performers, performs tasks, and determines the next tasks. A mobile agent residing in task performers can take loads off workflow engines. The max is in favor of scalability, while the min is in favor of performance (It is shown later in Section 3). In this paper, we adopt these two trivial agent delegation models as references. We try to enhance performance by proposing a comparatively good delegation model that is a hybrid model of min and max as follows; c. Non-trivial agent delegation model: min-based max model Algorithm for min-based max model is shown in Fig. 3. ELEMENT is an object data type for tasks, joins, splits, or terminations. TASKS_QUEUE is a queue for tasks. As an example, consider a workflow process shown in Fig. 4(a), ignore dotted boxes in this phase.
Fig. 3. Algorithm for min-based max model
258
J.-J. Yoo et al.
Fig. 4. An example of the proposed delegation model: (a) initialization, (b) after a sub-agent completes one task, another two sub agents assigned two tasks for each parallel routing are generated, (c) after a sub-agent completes two tasks for sequential routing, another sub-agents assigned one task for each parallel routing is generated
In the initialization phase, the process is decomposed into a set of sub-processes enclosed by dotted boxes. Two sub-workflow processes (1) and (2) are further decomposed into solid boxes in Fig. 4(b) and (c) respectively by the Decomposition procedure in Fig. 3. This model is compared with other alternatives, such as min, max, and the maximumbased minimum model, which adopts the max as the basic model and partially includes the min in AND-splits of a workflow process (applying Decomposition procedure without Initialization in Fig. 3).
3 Experimental Comparisons In this section, we evaluate the performance and the scalability of our design strategy – the mobile agent-based workflow system adopting the 3-layer architecture and the delegation model. Our system is compared with the previous workflow systems such that client server-based and another mobile agent-based workflow systems. In our simulation, the performance corresponds to absolute elapsed time needed to complete a fixed number of workflow instances, while the scalability is considered as an oblique of the performance graphs. We use UltraSAN simulation tool [8] for stochastic Petri-nets simulation. 3.1 Simulation Models In this sub-section, we show stochastic Petri-nets model and its parameters of our simulation models. Agent delegation models Simulation models for the min and max are shown in Fig. 5(a) and (b). Simulation models for the min-based max and max-based min models are the combination of (a) and (b) as described before. As an example, in Fig. 4(c), there are one mobile agent having two tasks for sequential routing and two mobile agents having single task for each parallel routing. A Petri-net model of hybrid delegation model for this case is shown in Fig. 5(c). All of models in Fig. 5 consist of a ‘workflow engine’, ‘task performer’, and ‘channel’. Important simulation parameters for min, max, min-based
Casting Mobile Agents to Workflow Systems: On Performance and Scalability
259
Fig. 5. Petri net models for agent delegation model-based workflow system
max and max-based min agent delegation models are shown in Table 1. All transitions are changed by the load dependent rate function as shown in Table 2. As well, the agent transmission rates are also changed depending on the agent size as shown in Table 3. Agent migration rates are determined by the literature of [9] together with an
260
J.-J. Yoo et al.
assumption of ‘A transmission time of code and/or data is proportional to the size of them’. Scheduling rates are observed in Hanuri/TFlow [10].
Table 1. Parameters for agent delegation models Minimum model Type Semantics Exponential Load dependent Exponential Load dependent Exponential Load dependent Exponential Load dependent Exponential Load dependent Exponential Load dependent Maximum model Transition name Rate Type Semantics InitInstance 1.0 Exponential Load dependent UpdateLocation 7.69 Exponential Load dependent ReturnResult 15.4 Exponential Load dependent Dispatch 0.046 Exponential Load dependent Task 19.2 Exponential Load dependent Minimum-based maximum model, Maximum-based minimum model Transition name Rate Type Semantics InitInstance 1.0 Exponential Load dependent Scheduling 1.1 Exponential Load dependent UpdateLocation 7.69 Exponential Load dependent ReturnResult 15.4 Exponential Load dependent Dispatch Refer to Table 3 Exponential Load dependent Task 19.2 Exponential Load dependent Transition name CreateAgent InitInstance Scheduling ReturnResult Dispatch Task
Rate 19.2 1.0 1.1 15.4 0.769 19.2
Table 2. Parameters for load dependent rate m = l¼(1-x/(B+1))a
l
B x
a
Value Rate in Table 1 100 0.7
Meaning Constant rate for a single server semantic Buffer size Number of customers Controller for rate change
Table 3. ‘Dispatch’ rate for min-based max and max-based min The number of task assigned to sub-agents 1 2 3 4 5 6 32
Fig 6. Remote interactions for a task execution in client server-based WFMS Table 4. Parameters for the client server-based workflow system Transition SendInit SendContext ReturnResult Ack
Rate 154 1.54 15.4 154
Type Exponential Exponential Exponential Exponential
Casting Mobile Agents to Workflow Systems: On Performance and Scalability
261
A client server model To compare the performance and the scalability of our systems with client serverbased workflow system we define remote interactions between a client and a server as shown in Fig. 6. In the first step of Fig. 6, a workflow engine asks a task performer to create a context, and the task performer makes a space for a context and returns an acknowledgement message, about 100Bytes. After receiving the acknowledgement message, the workflow engine sends the context information of the control data and relevant data, about 10Kbytes, to the task performer in the second step. In the third step, the workflow engine asks the task performer to execute the task with the information of the context. Then the task performer can begin the task. After completing the task, the task performer returns the results, about 1Kbyte. The parameter values for a client server-based workflow system are summarized in Table 4. The simulation model is shown in Fig. 5(d). 3.2 Evaluations and Analysis From now on, we evaluate the performance and the scalability of agent delegation models with various parameters such as the number of branches of a workflow process, the number of tasks consisting the branch (referred to as task length), and the number of workflow processes to reflect diverse workflow system environments. One of agent delegation models is chosen to compare the performance and the scalability of client server-based workflow systems. Relation between workflow structure and simulation time Fig. 7 shows the relation between the number of branch in a workflow process and the simulation time. In this simulation we fix up the number of process as 100 and 1000. In case that the number of workflow process is 100 (referred to in the text as smallscale workflow system), min outperforms all other models. But in case that the number of workflow process is 1000 (referred to in the text as large-scale workflow systems), min-based max model outperforms all others. These results say that min is more sensitive to the number of workflow process than the other models. The bottleneck in a centralized workflow engine caused more sensitiveness of min to the number of workflow process. This result is true of the case that the task length is increased as shown in Fig. 8. Relation between the number of workflow process and simulation time In this simulation we evaluate the performance and the scalability of agent delegation model on the diverse number of workflow process. We adopt the workflow structure shown in [11] as a target. As shown in Fig. 9 min outperforms all other agent delegation models in case of small-scale workflow systems. But min-based max outperforms all others in case of large-scale workflow systems. It also comes up with the scalability of max. Comparison with other workflow systems In this simulation min-based max model with 3-layer architecture, client server-based workflow systems, and another mobile agent-based workflow systems are evaluated. As shown in Fig. 10, a workflow system applying 3-layer architecture and an agent delegation model not only outperforms client server-based and DartFlow workflow systems in case of massive workflow environments such as telecommunication and manufacturing enterprises but also preserves the scalability of DartFlow. Therefore,
262
J.-J. Yoo et al. large-scale(# of instance = 1000)
small-scale(# of instance=100)
18000
6000
4000
total simulation time(sec)
total simulation time(sec)
16000
min max(DartFlow) min-based max max-based min
5000
min max(DartFlow) min-based max max-based min
3000 2000 1000
14000 12000 10000 8000 6000 4000 2000
0
0 0
2
4
6
0
2
# of branch
4
6
# of branch
(a) the number of workflow process=100 (b) the number of workflow process=1000 Fig. 7. The relation between the number of branch and the simulation time small-scale(# of instance = 100) min max(DartFlow) min-based max max-based min
min max(DartFlow) min-based max max-based min
14000 total simulation time(sec)
total simulation time(sec)
5000
large-scale(# of instance = 1000)
16000
6000
4000 3000 2000 1000
12000 10000 8000 6000 4000 2000 0
0 0
2
4 length of branch
6
0
2
4
6
length of branch
(a) the number of workflow process=100 (b) the number of workflow process=1000 Fig. 8. The relation between the length of branch and the simulation time
Fig. 9. The relation between the number of process and the simulation time
we conclude that the 3-layer architecture and agent delegation models address a good solution for increasing the performance and the scalability of workflow systems.
4 Conclusions In this paper, we explained design issues of mobile agent-based workflow systems. In our proposed system, mobility of agents is mainly used to transport parts of a workflow implementation towards decentralized processing elements. Autonomy of agents is used to reduce the remote interactions between workflow engines and task performers. Because the workflow structure is frequently changed, uploading the parts of workflow implementation toward decentralized processing elements in advance is not an efficient method. By hierarchical distribution of control in architecture level as well as workflow execution level, potential advantages of a
Casting Mobile Agents to Workflow Systems: On Performance and Scalability
263
mobile agent in workflow systems are realized in terms of performance and scalability. We showed the effectiveness of proposed strategy with an UltraSAN simulation tool. As the stochastic Petri-nets simulation results has shown, the proposed model outperforms the client server-based and the max-based workflow systems in case of massive workflow environments as well as comes up with the scalability of max-based workflow systems. Although the code size for error handling is not considered, we will consider it in our future work.
Fig. 10. Comparisons of the performance and the scalability of workflow systems
Acknowledgments. This work was partially supported by Korea Science and Engineering Foundation (KOSEF) under contract 98-0102-11-01-3.
References 1. WfMC, “Workflow Management Coalition Terminology and Glossary: WfMC Specification,” 1999. 2. Frank Leymann, and Dieter Roller, “Business Process Management with FlowMark,” Spring Compcon, Digest of Papers, pp. 230-234, 1994. 3. Action Workflow website: “http://www.actiontech.com/” 4. FloWare website: “http://www.plx.com/” 5. G. Alonso, C. Mohan et al., “Exotica/FMQM: A Persistent Message-Based Architecture for Distributed Workflow Management,” In IFIP WG8.1 Working Conference on Information System Development for Decentralized Organizations, pp. 1-18, 1995. 6. Ting Cai, Peter A. Gloor, and Saurab Nog, “DartFlow: A Workflow Management System on the Web using Transportable Agents,” DartMouth College, Technical Report PCSTR96-283, 1996. 7. Colin G. Harrison, David M. Chess, and Aaron Kershenbaum, “Mobile Agents: Are they a good idea?,” Research Report, IBM Research Division, T.J.Watson Research Center, 1995. 8. D. D. Deavours, W. D. Obal II, M. A. Qureshi, W. H. Sanders, and A. P. A. van Moorsel., “UltraSAN Version 3 Overview,” In Proceedings of International Workshop on Petri Nets and Performance Models, 1995. 9. Manfred Dalmeijer, Eric Rietjens, Dieter Hammer, Ad Aerts, and Michiel Soede, “A Reliable Mobile Agents Architecture,” In Proceedings of the Int. Symposium on ObjectOriented Real-Time Distributed Computing, 1998. 10. Kwang-Hoon Kim, Su-Ki Paik, Dong-Su Han, Young-Chul Lew, and Moon-Ja Kim, “An Instance-Active Transactional Workflow Architecture for Hanuri/TFlow,” In proceedings of International Symposium on Database, Web and Cooperative Systems, 1999. 11. Qinzheng Kong, and Graham Chen, “Transactional Workflow for Telecommunication Service Management,” In Proceedings of International Symposium on Network Operations and Management, 1996.
Anticipation to Enhance Flexibility of Workflow Execution Daniela Grigori, François Charoy, and Claude Godart LORIA - INRIA Lorraine, Campus Scientifique BP 239, 54506 Vandoeuvre les Nancy – France {dgrigori, charoy, godart}@loria.fr
Abstract. This paper introduces an evolution to classical workflow that allows more flexible execution of processes while retaining its simplicity. On the one hand it allows to describe processes in the same way that they are in design and engineering manuals. On the other hand it allows to control these processes in a way that is close to the way they are actually enacted. This evolution is based on the concept of anticipation, i.e. the weakening of strict sequential execution of activity sequences in workflows by allowing intermediate results to be used as preliminary input into succeeding activities. The architecture and implementation of a workflow execution engine prototype allowing anticipation is described.
1 Introduction Current workflow models and systems are mainly concerned with the automation of administrative and production business processes. These processes coordinate welldefined activities that execute in isolation, i.e. synchronize only at their start and terminate states. If current workflow models and current workflow systems apply efficiently for this class of applications, they show their limits when one wants to model the subtlety of cooperative interactions as they occur in interactive or creative processes, typically co-design and co-engineering processes. Several research directions are investigated to provide environments that are more adaptable to user habits. These directions are described in section 2. They propose most of the time complex evolutions to the basic workflow model. Our approach consists in adding flexibility to workflow execution with minimal changes of the workflow model. We try to reach this goal by relaxing the way the model is interpreted; users can take some initiative regarding the way they start the assigned activities, leaving the burden of consistency management to the execution engine. In this paper we introduce the idea of anticipation as a way to support more flexible execution of workflows. The principle is to allow an activity to start its execution even if all “ideal” conditions for its execution are not fulfilled. Anticipation is very common in creative applications: reading a draft, starting to code without complete design, illustrate this idea. Anticipation allows to add flexibility to workflow execution in a way that can not be modeled in advance. It can also be used to accelerate process execution as it increases parallelism in activity execution.
Anticipation to Enhance Flexibility of Workflow Execution
265
The paper is organized as follows. In the next section, we motivate our approach. In section 3 we describe our view of anticipation in workflows and the constraints related to it in order to ensure consistent execution of a process. Section 4 is dedicated to the implementation of a workflow engine allowing anticipation. Finally, section 5 concludes.
2 Related Work In the literature we can find several works addressing the problem of workflow flexibility. The first approach considers the process as a resource for action [16]. Basically, it means that the process is a guide for users upon which they can build their own plan. It is not a definitive constraint that has to be enforced. Thus, users keep the initiative to execute their activities. They are not constrained by the predefined order of activities but are inspired by it and encouraged to follow it. Authors of [14] propose to enhance the workflow model with goal activities and regions in order to allow its use as a resource for action. A goal node represents a part of the procedure with an unstructured work specification; its description contains goals, intent or guidelines. Authors of [1] argues that a plan as a resource for action must support users awareness, helping them to situate in the context of the process, either to execute it or to escape to it in order to solve a breakdown. The second approach uses the process as a constraint for the flow of work, but it is admitted that it may change during its lifetime. The process can be dynamically adapted during its execution. ADEPTflex [15], Chautauqua [5], WASA [17] and WIDE [3] provide explicit primitives to dynamically change running workflow instances. These primitives allow to add/delete tasks and to change control and data flow within a running workflow instance. Constraints are imposed on the modifications in order to guarantee the syntactic correctness of the resulting process instance. The third approach consists in evolving the process model itself to allow for more flexible execution. In this case, flexibility has to be modelled and is anticipated during the process modelling step. This is one of the branches that are followed by the COO project [7] and by other similar work [6]. In Mobile [10], the authors define several perspectives (functional, behavioral, informational, organizational, operational) for a workflow model, the definitions of perspectives being independent of one another. Descriptive modeling is defined as the possibility to omit irrelevant aspects in the definition of a perspective. In [11], [7], [12] other examples of descriptive modeling are presented as techniques for compact modeling. The authors propose simple modeling constructs that better represent real and complex work patterns to be used, instead of a composition of elementary constructs. The first two approaches consider flexibility at the level of the process execution itself. In one case, the model is a guide to reach a goal; in the other case, the model is a path to reach a goal that may change during its course. In the third approach, it is the model that evolves to provide the requested flexibility. In this paper we consider a fourth way which is not based on the way the process model is used or instantiated, neither on the way it can be evolved or modelled, but which adds flexibility in the workflow management system execution engine itself. This has the advantage of retaining the simplicity of the classical model and may also
266
D. Grigori, F. Charoy, and C. Godart
be adapted to other approaches. It is a first step toward a simple model that could support a more flexible execution suited to engineering processes.
3 Flexibility of the Execution Model In this part we describe the evolution of the workflow execution engine to support flexible execution and data flow, and how we tackle the consistency problems that arise. The workflow model that we use is very simple. It provides the basics to support control and data flow modeling. We provide here a minimal description allowing the explanation of the evolution we propose on the workflow engine. Our workflow model is based on process graphs. A process is represented as a directed graph, whose nodes are activities. An activity having more than one incoming edge is a join activity; it has an associated join condition. For an or-join activity the associated join condition is the disjunction of conditions associated to incoming edges. For an andjoin activity the associated join condition is the conjunction of conditions associated to incoming edges. An activity having more than one outgoing edge is a fork activity. Activities have input data elements and produce output data elements. The circulation of data between activities is represented by edges between output elements of an activity and input data element of another activity. The consumer must be a direct or transitive successor of the producer activity. In summary, a process model is represented by a directed graph whose nodes are activities and whose edges represent control flow and data flow constraints. 3.1 Anticipation Traditional workflow management systems impose an end-start dependency between activities [4]. This means that an activity can be started only after the preceding ones have completed. However, in cooperative processes that do not use coordination support, activities overlap and start their work with intermediate1 results (in opposition to final results2) of preceding activities, even if all conditions for their execution are not completely fulfilled. Anticipation is the mean we propose to support this natural way to execute activities while retaining the advantage of explicit coordination. Anticipation allows an activity to start its execution earlier regarding the control flow defined in the process model. When preceding activities are completed and all activation conditions are met, an anticipating activity enters the normal executing state, i.e. it continues its execution as if it never anticipated. At this time, final values of its input parameters are available. Having already been started, the activity is able to finish its execution earlier. The anticipation allows a more flexible execution, preserving, at the same time, the termination order of activities.
1 2
Intermediate result: result produced by an activity during its execution, before its end. Final result: result produced by an activity when it completes.
Anticipation to Enhance Flexibility of Workflow Execution
267
Process Edit
Review
...
Modify
Execution (1)
Modify
Review
Edit
Edit
(2)
Review
Modify
Fig. 1. Execution without (1) and with (2) anticipation
Example of Fig. 1 is an execution with anticipation (2). The Edit activity provides an intermediate draft of the edited document to Review. In this case Review and Modify activities can be started earlier. The whole process can thus be terminated earlier. The possibility to anticipate requires some modifications of the workflow execution model. Our approach is to extend the traditional model (we start from the model defined in [13]) to take into account anticipation. Two new activity states are added: ready to anticipate and anticipating state. The ready to anticipate state indicates that the activity can start to anticipate. When an agent having the adequate role chooses it from its to do list, the activity enters the anticipating state. Fig. 2 depicts a state transition diagram including the new added states. Note that before to complete, even if an activity has started to anticipate, it must pass through executing state. executable
initial
executing
ready to anticipate
suspended
anticipating
completed
dead
Fig. 2. State transition diagram for activities
When a process is started, all activities are in initial state, except the start activities (activities with no incoming edges), that are in executable state. Transition from initial to ready to anticipate state. Concerning the moment when an activity in initial state can start to anticipate, several strategies can be considered: 1. Free anticipation – an activity in initial state may anticipate at any moment (ready to anticipate state merged with initial state). Control flow dependencies defined in the process model are interpreted at execution time as end-end dependencies; i.e. an activity can finish its execution only when the preceding one has. In our example, it would mean that Modify could start at any time. Free anticipation should be reserved to very special cases. 2. Control flow dictated anticipation – an activity may anticipate when its direct predecessor has started to work, i.e. is in anticipating or executing state. For an orjoin activity, at least one of its predecessors must be in the anticipating or executing state. For an and-join activity, all the preceding activities must have been
268
D. Grigori, F. Charoy, and C. Godart
started. In this case, the traditional start-end dependency between activities is relaxed, being replaced with a start-start dependency. In our example, that means that Modify can start as soon as Review has started. 3. Control flow and data flow dictated anticipation – an activity can anticipate when its predecessors are in anticipating, executing or completed state and for all its mandatory inputs, there are values available. In this case, the Modify activity could start only with a draft document and some early versions of the comments. These three strategies have an impact on the general flexibility of the process execution. While the first one is very open but can lead to a lot of inconsistencies, the last one is more rigid and remove most of the interest of anticipation. In the remainder of the paper, we will consider that the implemented policy is the second one. Transition from ready to anticipate to anticipating. As soon as an activity becomes ready to anticipate, it is scheduled, i.e. it is assigned to all agents who actually qualify under its associated staff query. It is important to note that users know the state of activity and can decide to anticipate. When one of these agents starts to anticipate, the activity passes in anticipating state and disappears from the to do list of the other agents. Transition from anticipating to executing. An activity in anticipating state passes to executing state when it is in a situation where it would be allowed to start its execution if it was not anticipating. Transition from ready to anticipate to executable. An activity in ready to anticipate state passes to executable state in the same conditions as traditionally an activity passes from initial to executable. Transition from anticipating to dead. An activity in anticipating state passes to dead state if it is situated in a path that is not followed in the current workflow instance. Such a transition has the same motivations as for a traditional workflow activity to go from the initial state to the dead state. An anticipating activity makes the hypothesis that it will be executed. This is not sure. However, we think that, due to the nature of the applications we consider, this situation will not occur frequently: the objective of anticipation is to make the right decision at the right time thanks to rapid feedback. We can see from this description that modifications of the workflow execution engine to provide anticipation are not very important. In order to gain all the benefits of anticipation of activity execution, it is necessary to allow also early circulation of data between activities, i.e. publication of early or intermediate results in the output container of executing activities. 3.2 Data Flow Supporting Intermediate Results As we consider mainly interactive activities, user can decide to provide output data before their end and possibly with several successive versions. New activity operations are introduced, Write and Read. These operations can be used by users (or even special tools) to manage publication of data during activity execution. Write operation updates an output element and makes it available to succeeding activities. Consider that activity A invokes the operation Write(aout1) to publish its output data aout1. Suppose that in data flow definition, two edges exist that have aout1 as origin: an edge (aout1, bin1) between A and B activity and another one, (aout1, cin1) between A and C activity. If B or C is in anticipating state, it must be notified about
Anticipation to Enhance Flexibility of Workflow Execution
269
the existence of new data (pull mode) or notified of arrival of data in their input container (push mode ). Read operation is used by anticipating activities in pull mode to update an input data with the new version published by the preceding activity. For unstructured data, a mechanism must be provided to synchronize with new versions (merge for instance). For text files, this is a common feature supported by version management systems. Activities are no more isolated in black box transactions. They can provide results that can be used by succeeding activities. If the succeeding activities are interactive activities, the users in charge can consider taking this new value into account. They may also choose to wait for a more stable value supposed to arrive at a later time. Breaking activity isolation is necessary in order to benefit from the ability to anticipate but it may also cause some problems of inconsistency. These problems and the way to address them will be described in the section 3.4. Besides supporting early start, anticipation can be used to provide rapid feedback to preceding activities. A communication channel can be created backward of the defined flow of activities. As anticipation allows successive activities to execute partly in parallel, it is natural to imagine that people may have some direct feedback (e.g. comments) to provide to user of preceding activities. In the example of Fig. 1, the Review activity can provide early comments while the Edit activity is still running. 3.3 Anticipation to Increase Parallelism between a Process and Its Subprocess In a traditional workflow, an activity is a black box that produces output data at the end of its execution. In our approach, an activity may fill an output parameter in its output container as soon as it is produced. These partial results become available to succeeding activities. They can enter anticipating state and initiate actions based on these results. For instance, let consider the activity of sending a letter to a customer; the letter can be prepared in anticipating state with available data and really sent out in executing state. In this way, the process execution is accelerated. Anticipation is especially useful when the activity is a sub-process. The output container of the activity is the output container of the process implementing it. Similar to an activity, a process can gradually fill data produced by its activities in its output container. These data become available for subsequent activities; otherwise they would have to wait the end of the sub-process. Anticipation allows increasing the parallelism between the main process and the sub-process implementing the activity. To illustrate this, we can use the example of [9] depicted in Fig. 3; it represents a typical process to handle the delivery of products by a retail company. The Get item activity takes care of the stock control; it can be implemented as a sub-process which verifies if the item is available and otherwise orders it; finally an invoice is produced. The output data of the sub-process are the warehouse where the product is available and the delivery date. As soon as the product is in stock or the date when it will be received from manufacturer is known, the acknowledgement to the customer can be prepared. Similarly, the Prepare Shipment is initiated as soon as it is known from which warehouse the product will be delivered and at what date it will be available.
270
D. Grigori, F. Charoy, and C. Godart
Place order
Get item
Ack
Prepare shipment
Shipping no t anticipable
Toplevel Process
Check availability
Order from warehouse
Order from manufacturer
Produce Invoice
Subprocess
control flow data flow
Fig. 3. Inter-process data communication
For this kind of process, control and data flow dictated anticipation can be applied efficiently. However, the Shipping activity can not be anticipated. This is a typical activity that can be executed only when all the preceding activities are terminated. 3.4 Synchronization and Recovery Publishing intermediate results is important to take advantage of anticipation. However, an activity that has published results during its execution may fail or die, after it has published a result. The problem is how to compensate the visibility of such results (it can be related to dirty read in traditional concurrency theory). In order to assure this, the following rules are applied: 1. An activity that has read an intermediate result must read the corresponding final result if necessary (it is different from the intermediate result). An activity can enter completed state only from executing state. When an anticipating activity enters executing state, previous activities have been completed and their final results available. 2. An activity cannot produce intermediate results if it is not currently “certain” that it will enter the normal executing state. Independently of the anticipation strategy, we adopt a conservative approach concerning the moment when an anticipating activity may publish data requiring the preceding activity to be in executing state. Of course, this is an optimistic approach: nothing ensures that the preceding activities will not be canceled. 3. In case where an executing activity is canceled, anticipating activities that published data may need to be compensated. As only direct successors could publish data, the influence of canceling an activity is limited. After the state change, the status of anticipating activities is recomputed. They may go to dead state; the work done in anticipating state is lost. Otherwise, they remain anticipating. New updated values will be provided by the execution of their preceding activities. They will have to resynchronize with the next valid input. As we can see, anticipation does not have an important impact on the general problem of workflow recovery, as long as activities do not have side effects. The
Anticipation to Enhance Flexibility of Workflow Execution
271
work done in anticipating state may be lost but this is the price of flexibility. In case of activities having side effects outside of the scope of the workflow system, we are still obliged to introduce a special case described in the next section. 3.5 Suitability and Applicability of Anticipation Anticipation is not suitable in any situation and not applicable for any activity type. For instance, it cannot be applied for an automatic activity having effects that cannot be compensated. However, it can be applied for automatic activities that are of retry type (flexible transactions). For example, an activity that searches available flight tickets in a database, can be automatically restarted at each modification of its inputs if the execution speed of the process is more important for the application than the overload created in the external database. A similar example is an activity that compiles a program at each modification of one of its modules.
During the definition phase of the process, it may not be possible to know in advance when anticipation will occur but it is possible to know which activities must not anticipate. These activities will be marked as such and will follow a more classical execution model.
In this part we have described how a simple modification of the classical model can enhance the flexibility of workflow execution in different ways. Now we are going to present how it is integrated in a larger framework to support cooperative workflow execution.
4 Implementation We are currently implementing a workflow execution engine allowing anticipation. It is part of the MOTU prototype whose goal is to provide a framework to support cooperative work of virtual teams. The prototype is written in Java3.
process graph
task lists
Workspace View
Fig. 4. Motu Client 3
motu.sourceforge.net
272
D. Grigori, F. Charoy, and C. Godart
The basis for the implementation is a non-linear version management server that provides the functionalities for public and private workspace management. This system has been extended with a basic workflow execution engine that provides dynamic process instantiation and implements anticipation. It also supports a cooperative transaction manager that allows exchanges of data between concurrently executing activities. Anticipation is implemented as part of the Workflow execution engine. Activities that are not anticipatable can be specified. To each activity is associated a Coo Transaction[8]. This Coo transaction provides support for optimistic concurrency control between activities. A Coo Transaction[2] has a private base (a workspace) that contains a copy of all the objects accessed (read and updated) by activities; it also serves as communication channel between successive activities.
5 Conclusion In this paper, we have presented a simple yet powerful way to modify workflow systems classical behavior to provide more flexible execution of processes. Compared to other approaches that try to provide flexibility in workflow models, the one we propose is simpler to understand for end users. This simplicity is essential in cooperative processes allowing graphical representation, helping their participants to situate in the context of the process and facilitating manual interventions for dynamical modifications. We showed that extending existing systems with anticipation is simple since it requires just updating the state transition management of the execution engine and some adaptation to the way data are transmitted between activities. Anticipation also provides substantial benefits regarding activities execution, especially for interactive activities: parallelism of execution between successive activities, possibility of early feedback between successive activities, potential acceleration of the overall execution. Of course, there is also the risk of doing some extra work but as we mainly target interactive activities, we believe that users are able to evaluate the opportunity to use anticipation and what they may gain from it. The next step of this work will be to consolidate the integration of the workflow model with cooperative transactions and to provide a set of operators to allow dynamic modification of the process during its execution. We believe that flexible execution, cooperative data exchange and dynamic process definition provide a cooperative environment allowing richer interactions between users and preserving coordination control on the process.
References 1. 2. 3.
Agostini, A. and G. De Michelis, Modeling the Document Flow within a Cooperative Process as a Resource for Action, . 1996, University of Milano. Canals, G., et al., COO Approach to Support Cooperation in Software Developments. IEE Proceedings Software Engineering, 1998. 145(2-3): p. 79-84. Casati, F., et al. Workflow Evolution. in 15th Int. Conf. On Conceptual Modeling (ER’96). 1996.
Anticipation to Enhance Flexibility of Workflow Execution 4. 5. 6. 7.
8.
9. 10. 11. 12.
13. 14. 15. 16. 17.
273
Workflow Management Coalition, The Workflow Reference Model, . 1995. Ellis, C. and C. Maltzahn. Chautaqua Workflow System. in 30th Hawaii Int Conf. On System Sciences, Information System Track,. 1997. Georgakopoulos, D. Collaboration Process Management for Advanced Applications. in International Process Technology Workshop. 1999. Godart, C., O. Perrin, and H. Skaf. coo: a Workflow Operator to Improve Cooperation Modeling in Virtual Processes. in 9th Int. Workshop on Research Issues in Data Engineering Information technology for Virtual Entreprises (RIDEVE’99). 1999. Grigori, D., F. Charoy, and C. Godart. Flexible Data Management and Execution to Support Cooperative Workflow: the COO approach. in The Third International Symposium on Cooperative Database Systems for Advanced Applications (CODAS’01). 2001. Beijing, China. Hagen, C. and G. Alonso. Beyond the Black Box: Event-based Inter-Process Communication in Process Support Systems. in 9th International Conference on Distributed Computing Systems (ICDCS 99). 1999. Austin, Texas, USA. Jablonski, S. Mobile: A Modular Workflow Model and Architecture. in 4th international Working Conference on Dynamic Modeling and Information Systems. 1994. Noordwijkerhout, NL. Jablonski, S. and C. Bussler, Workflow management - Modeling Concepts, Architecture and implementation. 1996: International Thomson Computer Press. Joeris, G. Defining Flexible Workflow Execution Behaviors. in Enterprise-wide and Cross-enterprise Workflow Management - Concepts, Systems, Applications’, GI Workshop Proceedings - Informatik’99, Ulmer Informatik Berichte Nr. 99-07, University of Ulm. 1999. Leymann, F. and D. Roller, Production Workflow. 1999: Prentice Hall. Nutt, G.J. The Evolution Toward Flexible Workflow Systems. in Distributed Systems Engineering. 1996. Reichert, M. and P. Dadam, ADEPTflex - Supporting dynamic Changes of Workflows Without Losing Control. Journal of Intelligent Information Systems, 1998. 10. Suchmann, L.A., Plans and Situated Action. The Problem of Human-Machine Communication, in Cambridge University Press. 1987. Weske, M. Flexible Modeling and Execution of Workflow Activities. in 31st Hawaii International Conference on System Sciences, Software Technology Track (Vol VII). 1996.
Coordinating Interorganizational Workflows Based on Process-Views Minxin Shen and Duen-Ren Liu Institute of Information Management, National Chiao Tung University, Taiwan {shen, dliu}@iim.nctu.edu.tw
Abstract. In multi-enterprise cooperation, an enterprise must monitor the progress of private processes as well as those of the partners to streamline interorganizational workflows. In this work, a process-view model, which extends beyond the conventional activity-based process model, is applied to design workflows across multiple enterprises. A process-view is an abstraction of an implemented process. An enterprise can design various process-views for different partners according to diverse commercial relationships, and establish an integrated process that is comprised of private processes as well as the process-views that these partners provide. Participatory enterprises can obtain appropriate progress information from their own integrated processes, allowing them to collaborate more effectively. Furthermore, interorganizational workflows are coordinated through virtual states of process-views. This work develops a regulated approach to map the states between private processes and process-views. The proposed approach enhances prevalent activity-based process models to be adapted in open and collaborative environments.
Coordinating Interorganizational Workflows Based on Process-Views
275
A process-view abstracts critical commercial secrets and is an external interface of an internal process. An enterprise can design process-views, which are unique to each partner. Process-views of participatory enterprises comprise a collaboration workflow. Furthermore, the virtual states of a process-view present progress status of an internal process. An enterprise can monitor and control the progress of partners through the virtual states of their process-views. The proposed approach provides a modeling tool to describe interorganizational workflows as well as an interoperation mechanism to coordinate autonomous, heterogeneous and distributed workflow management systems (WfMSs). The remainder of this paper is organized as follows. Section 2 presents the processview model and its applications within inter-enterprise cooperation. Section 3 summarizes the procedure of defining an ordering-preserved process-view presented in [9]. Next, Section 4 presents the coordination of interorganizational workflows through the virtual states of process-views and then Section 5 discusses some properties of process-view based approach and related work. Conclusions are finally made in Section 6.
2
Process-View Based Coordination Model
A process that may have multiple process-views is referred to herein as a base process. A process-view is an abstracted process derived from a base process to provide abstracted process information. Based on the process-view definition tool, a modeler can define various process-views to achieve different levels of information concealment. Definition 1 (Base process). A base process BP is a 2-tuple ÆBA, BDÖ, where 1. BD is a set of dependencies. A dependency dep(x, y, C) indicates that x is completed and C is true is one precondition of whether activity y can start. 2. BA is a set of activities. An activity is a 4-tuple ÆAID, SPLIT_ flag, JOIN_ flag, SCÖ, where (a) AID is a unique activity identifier within a process. (b) SPLIT_ flag/JOIN_ flag may be “NULL”, “AND”, or “XOR”. NULL indicates this activity has only one outgoing/incoming dependency (Sequence). AND/XOR indicates the AND/XOR JOIN/SPLIT ordering structures defined by WfMC [15]. (c) SC is the starting condition of this activity. If JOIN_ flag is NULL, SC equals the condition associated with its incoming dependency. If JOIN_ flag is AND/XOR, SC equals Boolean AND/XOR combination of all incoming dependencies’ conditions. 3. "x , y ³ BA, (a) if $ dep(x, y, C), then x and y are adjacent; (b) the path from x to y is denoted by x y . (c) x is said to have a higher order than y if $ x y, i.e., x proceeds before y, and their ordering relation is denoted by x > y or y < x. If x y and y x, i.e., x and y proceed independently, their ordering relation is denoted by x y. 2.1
Virtual Process: A Process-View
A process-view is generated from either base processes or other process-views and is considered a virtual process. A process-view is defined as follows:
276
M. Shen and D.-R. Liu
Definition 2 (Process-view). A process-view is a 2-tuple ÆVA, VDÖ, where (1) VA is a set of virtual activities. (2) VD is a set of virtual dependencies. (3) Analogous to base process, "vai ,vaj ³ VA, the path from vai to vaj is denoted by vai vaj ; the ordering relation between vai and vaj may be “>”, ”<”, or “”. A virtual activity is an abstraction of a set of base activities and corresponding base dependencies. A virtual dependency is used to connect two virtual activities in a process-view. Figure 1 illustrates how the components of our model are related. Section 3 demonstrates how to abstract virtual activities and dependencies from a base process. Notably, within an interorganizational environment, a participant’s role represents an external partner. participates in Base Process consists of
associates with
abstracts from
Base Activity
Process-view
to
from
Base Dependency
from
contains
consists of
Virtual Activity
contains
is produced by is consumed by Base Process Relevant Data
Role
to
is produced by is consumed by
Virtual Dependency
Process-view Relevant Data
refer to
Fig. 1. Process-view model
2.2
Process-View Based Coordination
Figure 2 illustrates the cooperation scenario and system components, in which three systems cooperate through process-views. To enhance the interoperability through open techniques, process-views’ interactions (solid bi-arrow lines) are implemented based on industrial standards, such as CORBA and XML. However, each enterprise determines its proprietary implementation of the communication autonomously (blank bi-arrow lines) among base processes, process-views and integrated processes. WfMS1
Integrated Process
Base Process
Base Process ProcessView
WfMS3 Integrated Process
Internet
ProcessView
WfMS2
Integrated Process
ProcessView Base Process
proprietary protocol standard protocol
Fig. 2. System architecture and interaction scenario
Coordinating Interorganizational Workflows Based on Process-Views
277
A process-view is an external view (or interface) of the private base process and is derived through the procedure described in Section 3. An integrated process is a specific view of the interorganizational workflow that is based on a participatory enterprise’s perspective, which consolidates private processes and partners’ processviews. Notably, the integrated process is also a virtual process. Each of its virtual activity/dependency is either a base one of a private base process or a virtual one of partners’ process-views. As a base activity/process, a virtual activity/process is associated with a set of states to present its run-time status. A virtual state is employed to abstract the execution states of base activities/processes contained by a virtual activity/process. To monitor and control the progress of a private process through the virtual states of its public process-view, two rules are proposed in Section 4 to map the states between a base and a virtual process. Therefore, an enterprise can coordinate with its partners through virtual states of process-views. 2.3
Three Phase Modeling
Collaboration modeling is a complex negotiation procedure. Process design is divided into three phases: base process phase, process-view phase and integration phase. Figure 3 illustrates the three phases for the cooperation between enterprise A and B. ö Base process phase is the traditional build phase. A process modeler specifies the activities and their orderings in a business process, which is based on a top-down decomposition procedure that many activity-based process models support. ö Next, designing a process-view is a bottom-up aggregation procedure. A process modeler can define various process-views for the partners according to diverse cooperation relationships. ö Finally, a process modeler forms an integrated process, or a personalized view of an interorganizational workflow through consolidating private base process and partners’ process-views. Base Process
A
aA1
B
aB1
aA2
aA4
aA3 aB2
aB3
Process-View
aA5
vaA1
vaB1
vaA2 vaAB
Integrated Process
vaA3
vaB2
aA1
vaA1
aA2 vaB1
aA4 vaB2
aA5
vaA2 aB1
aB2
aB3
vaA3
Fig. 3. Three phases of designing interorganizational workflows
3
Ordering-Preserved Process-View
Enterprises cooperate through process-views. According to the different properties of a base process, various approaches can be developed to derive a process-view. A novel ordering-preserved approach to derive a process-view from a base process has been presented in [9] and is summarized in this section. The ordering-preserved
278
M. Shen and D.-R. Liu
approach ensures that the original execution order in a base process is preserved. A legal virtual activity in an ordering-preserved process-view must follow three rules: Rule 1 (Membership). A virtual activity’s member may be a base activity or a previously defined virtual activity. Rule 2 (Atomicity). A virtual activity, an atomic unit of processing, is completed if and only if each activity contained by it either has been completed or is never executed. A virtual activity is started if and only if one activity contained by it is started. In addition, if an ordering relation, § (i.e., >, < or ), between two virtual activities is found in a process-view, then an implied ordering relation § exists between these virtual activities’ respective members. Rule 3 (Ordering preservation). The implied ordering relations between two virtual activities’ respective members must conform to the ordering relations in the base process. Based on the above rules, virtual activities and dependencies in an orderingpreserved process-view are formally defined as follows: Definition 3 (Virtual Activity). For a base process BP = ÆBA, BDÖ, a virtual activity va is a 6-tuple ÆVAID, A, D, SPLIT_ flag, JOIN_ flag, SCÖ, where 1. VAID is a unique virtual activity identifier within a process-view. 2. A is a nonempty set, and its members follow three rules: ì Its members may be base activities that are members of BA or other previously defined virtual activities that are derived from BP. ì The fact that va is completed implies that each member of A is either completed or never executed during run time; the fact that va is started implies that one member of A is started. ì "x ³ BA, x ´ A, the ordering relations between x and all members (base activities) of A are identical in BP, i.e., "y, z ³ BA, y, z ³ A, if x § y exists in BP, then x § z also exists in BP. 3. D is a nonempty set, and its members are dependencies whose succeeding activity and preceding activity are contained by A. 4. SPLIT_ flag/JOIN_ flag may be “NULL’’ or “MIX”. NULL suggests that va has only one outgoing/incoming virtual dependency (Sequence) while MIX indicates that va has more than one outgoing/incoming virtual dependency. 5. SC is the starting condition of va. The SPLIT_ flag and JOIN_ flag cannot simply be described as AND or XOR since va is an abstraction of a set of base activities that may associate with different ordering structures. Therefore, MIX is used to abstract the complicated ordering structures. A WfMS evaluates SC to determine whether va can be started. The abbreviated notation va = ÆA, DÖ is used for brevity. Definition 4 (Virtual Dependency). For two virtual activities vai = ÆAi , DiÖ and vaj = ÆAj , DjÖ that are derived from a base process BP = ÆBA, BDÖ, a virtual dependency from vai to vaj is vdep(vai , vaj , VCij) = { dep(ax , ay , Cxy) | dep(ax , ay , Cxy) ³ BD, ax ³ Ai , ay ³ Aj }, where the virtual condition VCij is a Boolean combination of Cxy . The procedure of defining an ordering-preserved process-view is summarized as follows: A process modeler must initially select essential activities. The process-view definition tool then generates a legal minimum virtual activity that encapsulates these essential activities automatically. The above two steps are repeated until the modeler
Coordinating Interorganizational Workflows Based on Process-Views
279
determines all required virtual activities. The definition tool then generates all virtual dependencies between these virtual activities as well as ordering fields (JOIN/SPLIT_ flag) and starting condition (SC) of each virtual activity automatically. [9] presents the algorithm that implements the process-view definition tool.
4
Coordinating Inter-enterprise Processes through Virtual States
In this section, the mechanism that coordinates inter-enterprise processes through activity/process states is described. During run time, cooperative partners monitor and control the progress of inter-enterprise processes through the execution states (virtual states) of virtual activities/processes. First, the states and operations of base activity/ process are described. Then, the state mapping rules to coordinate base processes, process-views and integrated processes during run-time are proposed. 4.1
Generic States and Operations
The state of a process or activity instance represents the execution status of the instance at a specific point. A state transition diagram depicts the possible run-time behavior of a process/activity instance. Currently, both WMF and Wf-XML support the generic states as shown in Figure 4, in which WfExecutionObject is a generation of a process or activity instance [11]. Furthermore, the hierarchical structure of states imposes superstate/substate relationships between them. open
closed
not_running not_started suspended
completed running
terminated aborted
Fig. 4. States of a WfExecutionObject [11]
After a WfExecutionObject is initiated, it is in open state, however, upon completion, it enters closed state. The open state has two substates: running indicates that the object is executing, and not_running suggests that the object is quiescent since it is either temporarily paused (in suspended state) or recently initialized and prepared to start (in not_started state). The state completed indicates that the object has been completed correctly. Otherwise, the object stops abnormally, i.e., in terminated or aborted state. The operations, e.g., suspend, terminate, and change_state, that are used to control a WfExecutionObject change the state of a WfExecutionObject as well as its associated WfExecutionObjects. The operation get_current_state, as defined in WMF, returns the current state of a WfExecutionObject instance. In the following section, state function fs is employed to substitute this operation for brevity.
280
4.2
M. Shen and D.-R. Liu
State Mapping
Both base and virtual activities/processes support the same set of the previously mentioned generic states and operations. In this section, consistent mapping of the execution states between virtual processes/activities and its member processes/ activities is discussed. Two cooperation scenarios can trigger state mapping. First, virtual activities/processes must respond to the state change that occurred in base activities/processes, i.e., the mapping occurs from base activities/processes to virtual activities/processes. Second, base activities/processes must react to the request to change the state as triggered by virtual activities/processes, i.e., the mapping occurs from virtual activities/processes to base activities/processes. State Mapping between a Base Process and a Process-View The virtual state of a process-view simply equals the state of its base process. For example, a process-view is in the suspended state if its base process is also in the same state. However, state mapping between virtual activities and its member activities must follow atomicity rule (Rule 2) as follows. Active state. Atomicity rule states that a virtual activity is active, i.e., in open state, if at least one member activity is active. Active degree (or grade) of active states is introduced to extend the atomicity rule for state mapping. Active states are ranked as follows: running > suspended > not_started. Since an activity has been executed for a while prior to suspension, but never runs before the not_started state, the suspended state is more active than the not_started state. The atomicity rule is extended as follows: if two or more member activities of a virtual activity va are active and states of member activities compose a state set Q, then the (virtual) state of va equals the most active state in Q. Inactive state. Atomicity rule also states that a virtual activity is inactive, i.e., in closed state, if all members are either inactive or never initialized. According to the definition of the closed state and its substates in [11, 14], an execution object, WfExecutionObject, is stopped in completed state if all execution objects contained within it are completed. Second, an execution object is stopped in terminated state if all execution objects contained within it are either completed or terminated, and at least one is terminated. Finally, an execution object is stopped in aborted state if at least one execution object contained within it is aborted. Therefore, based on these definitions, the state of an inactive virtual activity can be determined. In sum, a virtual activity responds to the state change of member activities according to the following rule. Notably, fs(a)/ fs(va) denotes the state/virtual state of a member activity a/ virtual activity va. Rule 4 (State Abstraction). Given a virtual activity va = ÆA , DÖ. If $ a ³ A , fs(a) = open or its substate, then let the state set Q = { fs(a), "a ³ A }, fs(va) equals the most active state in Q. If "a ³ A, fs(a) = closed or its substate, and $ a ³ A, fs(a) = aborted, then fs(va) = aborted. If "a ³ A, fs(a) = either terminated or completed, and $ a³ A, fs(a) = terminated, then fs(va) = terminated. If "a ³ A, fs(a) = completed, then fs(va) = completed. Figure 5 depicts the state transitions of a virtual activity that are triggered by the state transitions of its member activities.
Coordinating Interorganizational Workflows Based on Process-Views open initiate( a ), $a ˛ A
closed complete( a ), "a ˛ A
not_running
not_started
suspended
start ( a ), $ a ˛ A
running
suspend ( a ), $ a ˛ A, and f ( a¢ ) = not_running, s "a¢ ˛ A - {a}
resume ( a ), $ a ˛ A
281
terminate ( a ), $ a ˛ A, and f ( a¢ ) = terminated or completed, s "a¢ ˛ A - {a}
completed
terminated
abort ( a ), $ a ˛ A, and f ( a ¢) = closed, s
"a¢ ˛ A - {a }
aborted
Fig. 5. State transitions of a virtual activity va = ÆA , DÖ
When invoking an operation of a virtual process/activity object, the object propagates the operation to underlying base process/activity object. A process-view affects the entire base process, while a virtual activity only affects its member activities. For example, invoking create_ process operation on a process-view definition initiates a process-view and its corresponding base process. However, applying suspend operation on a virtual activity only suspends its member activity(s). If an external event or API call alters the state of a virtual activity/process, then the influence of state transition in base activities/processes depends on the following rule: Rule 5 (State Propagation). For a virtual activity va = ÆA , DÖ, when requesting va to be in state s, "a ³ A, if a transition from state fs(a) to state s is valid, then the state of a transfers from fs(a) to s; otherwise, fs(a) is not changed. Next, according to Rule 4, the state of va can be derived. If fs(va) s, then an InvalidState exception [11] throws and member activities rollback to their original states. If fs(va) = s, then the state transitions of member activities are committed. For a process-view PV, when requesting PV to be in state s, if its base process BP can transfer to s, then the state transitions of PV and BP are committed; otherwise, an InvalidState exception are returned to the request and PV and BP rollback to their original states. State Mapping between an Integrated Process and Its Underlying Processes Since each virtual activity in an integrated process only maps to an activity in the private process or a partner’s process-view, state mapping at the activity level is direct. If virtual activity a’s underlying activity can transfer to state s when requesting that a to be in state s, then the transition is committed; otherwise, the transition fails. Similarly, if the activities within an integrated process IP can transfer to state s when requesting IP to be in state s, then the request is committed; otherwise, the transition fails.
5
Discussion
Several investigations defined and implemented interaction points among cooperating enterprises. Casati and Discenza [2] introduced event nodes in a process to define the tasks that exchange information with partners. Lindert and Deiters [8] discussed the data that should be described in interaction points from various aspects. Van der Aalst [13] applied a message sequence chart to identify interactions among enterprises. In these approaches, the external interface is the set of interaction points. Above
282
M. Shen and D.-R. Liu
investigations ensure that only interaction points are public and the structure of internal processes remains private. A widely used modeling method uses an activity as an external interface to abstract a whole process enacted by another enterprise. The notion is based on service information hiding [1], i.e., a consumer does not need to know the internal process structure of a provider. Therefore, the activity can be viewed as a business service that an external partner enacts. This approach resembles the traditional nested sub-process pattern in which a child sub-process implements an activity within a parent process. Various investigations are based on the paradigm of service activity such as [5, 10, 12]. Via service activity states, a consumer can monitor service progress. Most approaches only support Workflow Management Coalition (WfMC) specified activity states [15] to comply with interoperation standards such as Wf-XML [14] and Workflow Management Facility (WMF) [11]. To reveal a more semantic status of the service provider’s process, CMI [5] enables modelers to define application-specific states that extend from standard activity ones. This work focuses mainly on supporting collaborative workflow modeling and interoperation. Conventional approaches are restricted by original granularity of process definitions that is not intended for outside partners. Therefore, determining which parts of private processes should be revealed to partners is extremely difficult. Process-view model enables a modeler to generate various levels of abstraction (granularity) of a private process flexibly and systematically. A process-view can be considered a compromised solution between privacy and publicity. Meanwhile, in parallel with the publication of our work, Chiu et al. proposed a workflow view that provides partial visibility of a process to support interorganizational workflow in an e-commerce environment [3]. A workflow view contains selected partial activities of a process. In contrast, a process-view is derived from bottom-up aggregation of activities to provide various levels of aggregated abstraction of a process. Interorganizational workflows are coordinated through virtual states of process-views. The generic states/operations that was defined in standards were adopted to use the existing standards as a backbone to integrate heterogeneous and distributed systems in multi-enterprise cooperation. This work has developed a regulated approach to manage the states mapping between base processes/activities and virtual processes/activities. Although only generic states and operations are discussed herein, the adopted hierarchical structure of states facilitates further extension regarding specific application domains, e.g., the CMI approach [5]. WISE [7] proposed a framework to compose a virtual business process through the process interfaces of several enterprises. In addition, CrossFlow [6] proposed a framework, which is based on service contract, to manage WfMS cooperation between service providers and consumers. These projects focus on providing broking architectures to exchange business processes as business services. Our contribution is a systematic approach from which external interfaces can be derived. The processview model can be extended to support the trading architectures that WISE and CrossFlow proposed.
6
Conclusion
A process-view model to conduct interorganizational workflow management was presented herein. Notably, a process-view is an abstracted process that can be viewed
Coordinating Interorganizational Workflows Based on Process-Views
283
as an external interface of a private process. The proposed approach not only preserves the privacy of an internal process structure, but also achieves progress monitoring and control. Moreover, enterprises interact through the virtual states of process-views that conform to interoperation standards. Therefore, distributed, heterogeneous and autonomous WfMSs can be integrated in an open environment. The proposed approach alleviates the shortcomings of inter-enterprise workflow collaboration. Acknowledgements. The work was supported in part by National Science Council of the Republic of China under the grant NSC 89-2416-H-009-041.
References 1. A. P. Barros and A. H. M. ter Hofstede, "Towards the Construction of Workflow - Suitable Conceptual Modelling Techniques", Information Systems Journal, 8(4), pp. 313-337, 1998. 2. F. Casati and A. Discenza, "Modeling and Managing Interactions among Business Processes", Journal of Systems Integration, 10(2), pp. 145-168, 2001. 3. D. K. W. Chiu, K. Karlapalem, and Q. Li, "Views for Inter-Organization Workflow in an ECommerce Environment", Proceedings of the 9th IFIP Working Conference on Database Semantics (DS-9), Hong Kong, China, April 24-28, 2001. 4. D. Georgakopoulos, M. Hornick, and A. Sheth, "An Overview of Workflow Management from Process Modeling to Workflow Automation Infrastructure", Distributed and Parallel Databases, 3(2), pp. 119-153, 1995. 5. D. Georgakopoulos, H. Schuster, A. Cichocki, and D. Baker, "Managing Process and Service Fusion in Virtual Enterprises", Information Systems, 24(6), pp. 429-456, 1999. 6. P. Grefen, K. Aberer, Y. Hoffner, and H. Ludwig, "CrossFlow: Cross-Organizational Workflow Management in Dynamic Virtual Enterprises", Computer Systems Science & Engineering, 15(5), pp. 277-290, 2000. 7. A. Lazcano, G. Alonso, H. Schuldt, and C. Schuler, "The WISE Approach to Electronic Commerce", Computer Systems Science & Engineering, 15(5), pp. 345-357, 2000. 8. F. Lindert and W. Deiters, "Modeling Inter-Organizational Processes with Process Model Fragments", Proceedings of GI workshop Informatik’99, Paderborn, Germany, Oct. 6, 1999. 9. D.-R. Liu and M. Shen, "Modeling Workflows with a Process-View Approach", Proceedings of the 7th International Conference on Database Systems for Advanced Applications (DASFAA’01), pp. 260-267, Hong Kong, China, April 18-22, 2001. 10. M. z. Muehlen and F. Klien, "AFRICA: Workflow Interoperability Based on XMLMessages", Proceedings of CAiSE’00 workshop on Infrastructures for Dynamic Businessto-Business Service Outsourcing (IDSO’00), Stockholm, Sweden, June5, 2000. 11. Object Management Group, "Workflow Management Facility", Document number formal/00-05-02, April 2000. 12. K. Schulz and Z. Milosevic, "Architecting Cross-Organizational B2B Interactions", Proceedings of the 4th International Enterprise Distributed Object Computing Conference (EDOC 2000), pp. 92-101, Los Alamitos, CA, USA, 2000. 13. W. M. P. van der Aalst, "Process-Oriented Architectures for Electronic Commerce and Interorganizational Workflow", Information Systems, 24(8), pp. 639-671, 1999. 14. Workflow Management Coalition, "Interoperability Wf-XML Binding", Technical report WfMC TC-1023, May 1, 2000. 15. Workflow Management Coalition, "The Workflow Reference Model", Technical report WfMC TC-1003, Jan. 19, 1995.
Strategies for Semantic Caching* Luo Li , Birgitta König-Ries , Niki Pissinou U, and Kia Makki 1
2
2
2
1
Center for Advanced Computer Studies, U. of Louisiana at Lafayette [email protected] 2 Telecommunications and Information Technology Institute, Florida International University niki|[email protected]
Abstract. One major problem with the use of mediator-based architectures is long query response times. An approach to shortening response times is to cache data at the mediator site. Recently, there has been growing interest in semantic query caching which may generally outperform the page and tuple caching approaches. In this paper, we present two semantic-region caching strategies with different storage granularities for mediators accessing relational databases. In contrast to most existing approaches, we do not only cache the projection result of queries but also the condition attributes, resulting in a higher cache hit rate. Additionally, we introduce the Profit-based Replacement algorithm with Aging Counter (PRAG), which incorporates the semantic notion of locality into the system.
1 Introduction Mediators are software components that homogenize and integrate information stemming from different data sources [Wied92]. Mediator-based architectures are used successfully in applications where transparent access to heterogeneous information sources is needed. Whenever a user query is sent to a mediator, this query is expanded into one to the underlying information sources and results are gathered from these sources. A major disadvantage stemming from this kind of “view-like behavior” is that query response times tend to be very long. For many applications this is not acceptable. For example, consider a mobile and wireless system that has the typical characteristics of frequent disconnections and expensive connections charged by connection time [PMK2000]. In such a system long response times can result in the user being disconnected by the time the result arrives and in high costs associated with querying. A first step in making mediator architectures more usable in application areas that require short response times is to make query results instantly available. Caching them in the mediator can do this. Then, when a query arrives, before contacting the underlying information source, the mediator checks if all or part of the required *
information is in its cache. This materialization has two key advantages: First, partial results are available even if a mediator becomes disconnected from its source. Second, the response times and thus costs and the likelihood of disconnections are reduced. While these advantages are of particular importance in mobile and wireless applications, more traditional systems can also profit from the above approach. Several caching strategies are available for mediator architectures. Besides traditional tuple and page caching, there has been growing interest in semantic query caching, where the cache is organized in semantic regions containing semantically related query results. Due to the semantic locality (i.e., subsequent queries often are related conceptually to previous queries), semantic caching generally outperforms the page and tuple caching approaches [CFZ94]. Different classes of mediators cater to different information needs. In this paper, we are looking at mediators answering queries on information sources with hot-spots, i.e., small areas of the database that are queried frequently. Semantic caching works particularly well for this class of mediators. Hot-spots are common in a wide variety of applications: A library’s database may get many queries on bestsellers; an online travel system may have lots of queries about certain places; a movie database may have mainly queries on movies with top ratings. Currently, we are restricting our investigation to non-join queries in read-only applications such as the ones mentioned above. Although this leaves out a number of interesting applications, we believe the remaining to be of considerable importance warranting sophisticated support. In order to achieve a high utilization of data, in our work we do not only cache the projection results of a query, but also the condition attributes. Additionally, we introduce a Profit-based Replacement algorithm with Aging Counter (PRAG). The remainder of this paper is organized as follows: Section 2 introduces the basic ideas of semantic-region caching and provides our approaches. Two strategies with different storage granularities are studied, and the profit-based replacement algorithm PRAG is introduced. Section 3 provides our experimental results; Section 4 gives an overview of related work. Section 5 summarizes the paper and presents an outlook on future work.
2 Materializing Hot-Spots Using Semantic Caching In this section, we show how semantic caching can be applied in order to reduce mediator response times for mediators accessing relational databases with a hot-spot pattern. We summarize the semantic caching method described in [DFJ+96], and describe in detail our solutions to the open questions posed in [DFJ+96]. 2.1 Preliminaries [DFJ+96] proposes a semantic model for client-side caching and replacement in a client-server database system. Semantic caching uses semantic descriptions to describe the cached data instead of a list of physical pages, which is used in page caching, or tuple identifiers, which are used in tuple caching. The cache is organized as a collection of semantic regions. Each region groups tuples sharing the same
286
L. Li et al.
semantic description, a query statement corresponding to the cached data. By using these descriptions, the client can determine what data is available in the cache. When a query is asked, it is split into a remainder query and a probe query. The probe query is used to retrieve data from the cache, while the remainder query is sent to the database to retrieve data that is not available in the cache. In order to keep the cache organization optimal, semantic regions will be split and merged dynamically based on the queries posed. Thus, semantic descriptions may change over time. The semantic region is also the unit of replacement in case the cache becomes full. Sophisticated value functions incorporating semantic notions of locality can be used for this task. Additionally, for this purpose, information about the reference history is maintained for each region. [DFJ+96] presents the basic ideas of semantic caching, but it does not provide any implementation details. In the following, we provide these implementation details. 2.2 Our Implementation In order to implement semantic caching, two problems need to be addressed: The first one is the development of a caching strategy, namely the decision of what data to store in the cache. The second one is the development of a replacement algorithm. This algorithm is used to decide which entries to replace in the cache if the cache does not have enough empty space to add a new region. In the following subsections, we look at both problems. 2.2.1 Caching Strategies
In this section, we describe two caching strategies for mediators that support queries to databases containing hot-spots. The existence of hot-spots implies that the same data items will be accessed frequently, therefore, it makes sense to store these data items in the mediator. We assume that this approach is used for queries that access a single table. If there is no specific declaration, qi(i=1,2,3…) is used to denote a general query: SELECT qi_project_attributes FROM R WHERE qi_condition(qi_condition_attributes) where qi is the query name; qi_project_attributes indicates qi’s selection list; qi_condition_attributes indicates the attributes used in qi’s WHERE clause; qi_condition(qi_condition_attributes) indicates the predicate of qi, i.e. its WHERE clause; R is the relation name. The general idea is to materialize and store region(s) whenever a query is posed. For each region, two data structures are maintained: one is the "real" data (i.e., the tuples retrieved from the database), the other is the meta-data entry for the region. This contains cache location, semantic description, size, and reference information. When a query is submitted to the mediator, it is first compared with the semantic description of regions that we have cached. Then, the query is split into a remainder query, which will be sent to the server, and a probe query, which will be executed in
Strategies for Semantic Caching
287
the local cache. Once query execution is completed, regions in the cachet are reorganized if necessary and the reference history is updated. We have developed two different strategies for building and storing regions. While the first one reduces storage usage, the second one aims at optimizing performance. Implementation Method 1: Optimizing Storage Usage Suppose there is no cached data in the mediator at the beginning and query qi on relation R arrives, what data should we retrieve from the database and materialize? In [DFJ+96], complete tuples of the underlying relation are stored. Most other approaches (e.g., [RD2000], [LC98], [GG99]) materialize only part of each tuple. The straightforward method is to materialize exactly the query result. In this case, the semantic description d(ri) of region ri is query qi. However, this is not satisfying, since we want to take advantage of the region(s) to answer future queries. If we create region ri this way, the region can be used only for a query identical to query qi. However, if the incoming query qj is not identical to qi, region ri can not be used for answering qj, even if the region contains the result or part of the result of qj. Consider as an example the following queries qi and qj on the relation Movie(title, year, length, type, studioName). qi has been materialized as region ri1.
qi: SELECT title, type FROM movie WHERE year > 1995 d(ri)::
qj: SELECT title, type FROM movie WHERE year > 1997 Obviously, the result of qj is contained in ri. However, the only columns that we stored for qi are its projection attributes, that is, TITLE and TYPE. We lose information about the condition attribute, the column YEAR. When query qj comes, we cannot use region ri to answer it. To solve this problem, when generating a region for a query, we do not only store the projection attributes, but also the condition attributes in the semantic region. For example, the semantic description of qi’s region is:
qi_project_attributes d(ri):SELECT qi_condition_attributes ∪ ROWID FROM R WHERE qi_condition(qi_condition_attributes)
∪
Note that we store an extra column, ROWID, in the region, which will be used for region splitting and merging. This is an artificial primary key for a tuple in the scope of the whole database. 1
We use as an abbreviation for query q.
288
L. Li et al.
For simplicity, in what follows, we use “qi_all_attributes” to denote “qi_project_attributes∪ qi_condition_attributes”, and “qi_condition()” to denote “qi_condition(qi_condition_attributes)” . region and query have intersection
region status after splitting and merging
region and query have intersection
region status after splitting and merging
1
1,5
1 1
2
2 3
3
4 2,3,4
2,3 5
situation 1
situation 5
1 2 4
2 3
2,3
situation 2
1 2
3
4
5
2,3,4
situation 6
1 3 4 5 situation 3
3
4
1
1 2
1,5
1
1
2,3,4
1,2
2 5
situation 7
1,5 4 2,3,4
5
situation 4
Fig. 1. Region Splitting and Merging
As described above, when a query is submitted to the mediator, we compare its semantic description with that of all the cached regions. Then we can decide how they intersect and generate the probe query and remainder query and retrieve the necessary data from the remote database. Finally, in order to maintain a well-organized cache, if the query results and an existing region overlap, the regions need to be split and merged. In Figure 1, we show all seven possible situations of overlap between a region and a new query2. For each situation in the figure, the left side indicates how the query intersects with a cached region; the right side indicates how the region is split and merged. The boxes in Figure 1 represent relations with rows (the horizontal) representing tuples and columns (the vertical) representing attributes: the ones with bold lines represent cached regions, the ones with thin lines represent query results. We also add some dashed lines to mark the fragments of a region. These fragments are numbered from 1 to 5. The shadowed boxes indicate the region whose reference information is updated after splitting and merging operations. The exact way for
2
It is also possible that a query overlaps with more than one region. Extending the approaches to deal with this case is straightforward.
Strategies for Semantic Caching
289
splitting and merging the regions, as well as how the query execution is performed, depends on the type of overlap. Consider as an example Situation 1. Suppose the semantic description for region ri is qi and the semantic description for the new query is qj. From the figure, we can see that qi and qj meet the following condition: (qi_all_attributes) ⊆ (qj_all_attributes) However, it is not necessary that the following condition be satisfied: qj_condition_attributes ⊆ (qi_all_attributes) If it is met, however, we can execute the probe query pqj and the remainder queries rqj1 and rqj2 shown in Figure 2 in parallel, which will shorten the response time. Probe query for qj: Pqj: SELECT * FROM ri WHERE qj_condition() Remainder queries for qj, (R is the corresponding table in the database) Rqj1: SELECT(qj_all_attributes-qi_all_attributes, ROWID FROM R WHERE qi_condition() ∩ qj_condition()
Rqj2: SELECT qj_all_attributes, ROWID FROM R WHERE ¬qi_condition() ∩ qj_condition() Fig. 2. Example Probe and Remainder Queries
After we get all the necessary data from the database, we split and merge the region as necessary using the relational operations INSERT and DELETE. We will also update the semantic description and reference history for the changed regions. The original semantic description of ri was:
SELECT qi_all_attributes, ROWID FROM R WHERE qi_condition() The updated semantic description of ri is:
SELECT qi_all_attributes, ROWID FROM R WHERE qi_condition()∩ ¬qj_condition() The semantic description of the newly created region is the query qj.
290
L. Li et al.
Let us now consider the second variant of Situation 1, in which the condition qj_condition_attributes ⊆ (qi_all _attributes) is not met. In this case, we cannot execute the probe query until we receive the result of the remainder query from the remote database. Once we do, we execute similar operations of splitting and merging. After performing the splitting and merging operations, the reference information of each new region should be updated. For the regions that contain (part of) the result of the new query, we update the reference information; these regions are shown in Figure 1 by shadowed boxes. For the regions that do not contain any result of the new query, we keep the reference information unchanged. The reference information is maintained for region replacement, which will be analyzed in detail in Section 2.2. Implementation Method 2: Optimizing Performance The method described above minimizes storage usage. The drawback of this approach is that for each relation, a number of differently structured regions may exist, which makes region splitting and merging complicated. We have thus developed a second strategy that achieves uniform regions by storing complete tuples in the semantic region. While this clearly increases storage usage, the advantage is an improved performance due to easier region splitting and merging. For example, suppose that the mediator receives a query qi on relation R. If this is the first time that the mediator receives a query on relation R, the mediator will initialize the region space for it. If, however, there has been such a table in the cache already, the mediator will add the tuples to it directly. After we add the tuples to the cache table, we will create/update the meta-data entry for this region. The region is only a conceptual unit in this strategy: all regions from the same relation are stored together in a single cache table. The only way to distinguish them is to reference their semantic descriptions. For this strategy, we only record the condition of a query as its semantic description. We do not have to record the projection information. When a new query is submitted to the mediator, it is divided into probe query(s) and remainder query(s) as with the first strategy. However, for this strategy (as opposed to the previous one) the splitting and merging operation is simple, as we maintain only one cache table for all the regions that come from the same relation. Suppose that a new query qj is asked. To generate the probe query and the remainder query, we will analyze the region whose semantic description is qi. The probe query (or its condition) should be: qi_condition() ∩ qj_condition(), or just qi_condition() which will be applied to region of qi. The remainder query should be: ¬qi_condition() ∩ qj_condition(). After we get the tuples from the database, we insert them into the corresponding cache table. We add a new meta-data entry for qj whose semantic description is the same as qj; and we update the semantic description for region qi to qi_condition() ∩ ¬qj_condition(). Note that if the condition qi_condition_attributes ⊆ qj_condition_attributes is met, we may possibly determine that the region can answer the query completely. In this case, we do not have to send a query to the database. We just have to update the semantic descriptions as explained above.
Strategies for Semantic Caching
291
2.3 Replacement Issues Up to now, we have dealt with adding entries to the cache. When the cache size is exceeded, a replacement policy needs to be used to determine which entries to remove. [DFJ+96] suggests that “a semantic description of cached data enables the use of sophisticated value functions that incorporate semantic notions of locality’’. However, [DFJ+96] does not give a detailed general solution to this issue3. [SSV96] provides a cache replacement algorithm called LNC-RA (Least Normalized Cost Replacement / Admission) based on a profit model. The algorithm uses the following statistics for each retrieved set RSi corresponding to a query qi: profit ( RSi ) =
where • • •
λ i ⋅ Ci Si
λi: average rate of reference to query qi. Si: size of the set retrieved by query qi. Ci: cost of execution of query qi.
If a new query (or data set) RSi is submitted the result size of which is bigger than the empty space available in the cache, LNC-RA will scan the cache to find all cached data sets whose profit is less than that of the current query. If the total size of all such data sets is greater than that of the new query, region replacement will happen. There are several problems with the LNC-RA algorithm. In [SSV96], λi is defined as: λi=K/(t-tK). K is the size of sliding window; t is current time and tK is the time of the K-th reference. However, for a newly retrieved region, we cannot calculate the value of λ, because the denominator “t - t1” is equal to zero. Another problem of LNC-RA, is that LNC-RA tends to evict newly retrieved data sets first. This is because it first considers all retrieved sets having just one reference in their profit order, then the ones with two references, etc. Since newly retrieved data sets always have fewer reference times (which means smaller value of λ and profit), this strategy will lead to “thresh” in cache: new data sets are admitted then evicted fast. In order to prevent this phenomenon, we should gather enough reference information for a new region before it is evicted. We have therefore developed a Profit-base Replacement algorithm with Aging Counter (PRAG). For each region or retrieved set RSi, the profit of it is still equal to (λi•Ci) /Si [SSV96]. However, we introduce an “aging counter” strategy to solve the problems mentioned above. For each region, we assign an “aging counter”. The principles of the aging counter are as follows: Each time a region is admitted, its aging counter is set to an initial value4. A region cannot be evicted unless its aging counter is equal to zero. When there is not enough space in the cache, we test each region in the cache and decrease its aging counter by one. When a region is referenced, its aging counter will be reset to the initial value. 3
4
However, [DFJ+96] talks about a Manhattan Distance function. This approach particularly suits mobile navigation applications where geographic proximity is a suitable means to express semantic closeness. See Section 3 for a discussion on how to determine this initial value.
292
L. Li et al.
CreateRegion(r) Calculate r.profit; r.agingCounter = PredefinedAgingTimes; UpdateRegion(r) Recalculate r.profit; r.agingCounter = PredefinedAgingTimes; //Reset ReplaceRegion(r) freeSpace=0; V=∅; for( each region ri in cache){ if( ri.agingCounter <= 0){ V= V ∪ {ri}; freeSpace=freeSpace + sizeof(ri); }else ri.agingCounter = ri.agingCounter – 1; } if ( freeSpace > = sizeof(r) ){ evict n regions in V with least profit such that n
∑ sizeof (r ) >= sizeof (r ) ; j
j =1
admit r in cache; } Fig. 3. PRAG algorithm
The aging counter has two functions. The first function is the “aging function.” If a region has not been referenced recently, its aging counter will decrease gradually. When it decreases to zero, the region is ready to be replaced. The second function is that the aging counter allows a new region to stay in the cache for a period of time during which the new region can get the necessary reference information. This overcomes the thresh problem. The “zero denominator” problem is solved also, since we do not use any value that could be zero as denominator. The algorithm is shown in Figure 3.
3 Simulation Results We have performed extensive experiments to compare the performance of the PRAG replacement algorithm with other commonly used algorithms. In particular, we examine the performance of three algorithms, i.e. LRU (Least Recently Used), MRU (Most Recently Used) and PRAG, on different types of queries. The performance is in inverse proportion to the response time. We use the performance of the LRU algorithm as the measurement unit, that is, we always consider its performance as 1. The measurement of MRU and PRAG is represented by the ratio of their performance to that of LRU. We use three types of query sets with different patterns:
Strategies for Semantic Caching
293
1) “very hot” query set in which 99% of queries are within 1% of all the tuples; 2) “hot” query set in which 90% of queries are within 10% of all the tuples; 3) random query set in which the queries are random. For each type of query set, we test a size of 1000, 2000, 3000, 4000, and 5000 queries respectively. We use a simulation environment to simulate the database, the mediator and the network. The parameters for the simulation environment are:
• • • • • • • • •
the size of the database is 2MB; the number of relations is 10; the total number of tuples is about 16000; for each attribute, the values are distributed evenly within its domain; the transmission speed of the network is 40KBPS the transmission delay of the network is 0.4s; the transmission speed of the disk is 40MBPS; the transmission delay of the disk is 1ms; the size of each cache block is 128 bytes.
Fig. 4. Performance comparison for different types of the query set
In Figure 4, the X-axis denotes the size of the query set and the Y-axis denotes the performance. We use a cache with a size of 192KB to perform this test. The LRU and MRU algorithm show almost the same performance for each type of the query sets; however, the PRAG algorithm shows a different behavior. In Figure 4a, the PRAG algorithm has a much better performance for the “very hot” query set than LRU and MRU. What is more, the larger the query set is, the better the performance of PRAG is compared with that of LRU and MRU. This is due to the property of the “very hot” query set, i.e., 99% of queries are within 1% of all tuples. After a certain amount of queries to warm up the cache, the PRAG algorithm is able to fill the cache with as many hot regions as possible. Non-hot regions, thus, are not likely to be admitted. So, the more referenced the cache is, the higher the benefit gained using PRAG. For the “hot” query set, which is shown in Figure 4b, we can find that the performance of PRAG is better than LRU and MRU while keeping a stable ratio. This is because there are about 10% random queries whose data scopes would be the whole information source and which cannot be effectively cached. However, because the PRAG algorithm has a better ability to recognize hot regions, its performance stays 25% higher than that of LRU and MRU.
294
L. Li et al.
The result in Figure 4c indicates that the performance of the PRAG algorithm will be a little worse than that of LRU and MRU for random query sets. This is because the PRAG algorithm always tries to keep the region having larger profit value in cache. However, for the random case, no query will be more likely to be referenced again soon than other queries. In that case, PRAG may try to keep a region with high profit value (which it recognized as hot region) for a longer time than necessary. LRU and MRU degrade to a FIFO algorithm, which seems to be the best solution for random query sets.
Fig. 5. Performance comparison for different cache sizes
Our next experiment examines the performance of the three algorithms on different cache sizes. In Figure 5, the X-axis denotes the size of the cache and the Y-axis denotes the performance. The size of query set is fixed at 3000 queries. Again, the LRU and MRU algorithm show almost the same performance for each type of the query sets. In Figure 5a, the PRAG algorithm has a much better performance for the “very hot” query set than LRU and MRU and the ratio are getting higher when the cache size increases. The advantage will be more remarkable if the cache size is near or greater than the total size of all hot regions. In that case, the PRAG algorithm will keep almost all hot regions in the cache and is able to answer almost all queries directly without contacting the information source any more. In fact, the performance will be improved for both LRU and MRU; however, the rate of improvement is not as large as for PRAG. For the “hot” query set (Figure 5b), the ratio of performance for PRAG algorithm increases with the cache size’s increasing. When the cache size is large enough, the ratio of performance for PRAG algorithm keeps stable later on. The explanations for Figure 5c are similar to that of Figure 4c. The ratio of performance for PRAG is insensitive to the cache size. Another interesting problem is how the initial value of the aging counter affects the performance of the PRAG algorithm. The experimental result is shown in Figure 6. First, we examine the curve corresponding to the “hot” query set. When the initial aging-times is equal to zero, the algorithm degrades to the pure profit algorithm without aging counter. In such a case, a hot region may be evicted at any time. Thus, there is no guarantee that a hot region will stay in the cache for a period of time to gather reference information. With the initial aging-times increasing, the PRAG algorithm is getting smarter and its performance is improved. This is due to the PRAG algorithm obtaining more reference information for regions by which it is able to distinguish hot regions from non-hot regions. However, the aging counter has both positive and negative effect for the PRAG algorithm. On the one hand, it allow hot regions to get enough reference information; on the other hand, the non-hot regions
Strategies for Semantic Caching
295
will keep staying in cache before their aging counters decrease to zero, which will affect the performance. If the aging counter is larger than necessary, the negative effect will counteract the positive effect. We can see that the performance curve for “hot” query decreases after the initial aging-times reaches a certain value. Then, we examine the curve corresponding to “very hot” query set in Figure 6. The performance of PRAG algorithm is improved when the initial aging-times increases from zero. The explanations are similar to the case above. However, the performance is not affected so much as the initial aging-times continues to increase. This is due to the property of the “very hot” query set in which 99% of queries are hot ones. Thus, most regions in the cache are hot regions. Even if the initial aging-times is very large, the few non-hot regions will not take much effect. We also find that the performance of PRAG is improved for random query set when the initial value of the aging counter increases. This can be explained as follows: if the initial value of the aging counter is zero, the replacement policy will be based on profit only. Regions with larger profit value are prone to stay in cache, getting more chance to be reference again, obtaining larger profit value and keeping staying in the cache. Thus, the “valid” cache size decreases and the utilization of cache space is low. When the initial value of the aging counter is too small, the situation is similar. With increasing initial value of the aging counter, regions with small profit value get a chance to stay in the cache and regions with large profit value may be replaced if their aging counters decrease to zero. The replacement policy approaches FIFO when the initial value of aging counter is large enough. Now, we analyze the problem how to find an appropriate initial aging-times for the PRAG algorithm. Our solution is based on the assumption that if a region is referenced more than twice during its aging time, it can be considered a hot region. In other words, the appropriate value for aging times should be the average number of queries between two subsequent references to a certain hot region. The value of the average number can be estimated as:
Our experimental results, depicted in Figure 6, support this formula. For a database size of 2 MB, a cache size of 192 KB and 3000 queries, as used in our experiments, the formula suggests an initial aging counter of 47.11 for the hot query set, the experiments show best performance at 45, for the very hot query set the numbers are 14.55 and 15, respectively.
296
L. Li et al.
Fig. 6. Performance of PRAG on different initial aging-times
4 Related Work There are several techniques that use semantic information about the query and cache to improve the efficiency of query evaluation or shorten the response time. [DFJ+96] proposes a semantic model for client-side caching and replacement in a client-server database system. As described in Section 2, it provides some key semantic caching solutions. [DFJ+96] however does not describe any implementation strategies or a general replacement algorithm. In view of this, this paper extends the semantic caching model provided by [DFJ+96] and proposes a PRAG algorithm for replacement issues. [GG97] and [GG99] extend the SQC (semantic query caching) paradigm by broadening the domain of applications of SQC into heterogeneous database environments and by presenting a logic framework for SQC. Within that framework they consider the various possibilities to answer a query. Possible (overlap) relations between cache and query are studied in [GG99]. However, in contrast to our caching strategy that caches both the projection attributes and condition attributes, its caching strategy only caches the projection attributes, which leads to the ability of utilizing the cache to answering queries being weak. [GG96] and [GGM96] address the issue of semantic query optimization for bottom-up query evaluation strategies. They focus on the optimization technique of join elimination, and propose a framework that allows for semantic optimization over queries that employ views. We do not consider query optimization in this paper; however, the technology can always be employed to archive efficient query evaluation, which would further decrease the query response time. [KB94] proposes a client side data-caching scheme for relational databases. It focuses on the issue of “cache currency” which deals with the effect of update at the central database on the
Strategies for Semantic Caching
297
multiple client caches. The result can also be applied to our semantic caching model when considered in a distributed environment. [RD2000] studies the issue of semantic caching in a background of mobile computing, it extends the existing research in three ways: formal definitions associated with semantic caching are presented, query processing strategies are investigated, and the performance of the semantic cache model is examined through a detailed simulation study, which shows its effectiveness in mobile computing. Similar to [GG99], its does not consider the strategy of caching both the projection and condition attributes either. [LC98] and [LC99] provide a semantic caching scheme suitable for web database environments with web sources that have very limited querying. Possible match types and detailed algorithms for comparing the input query with stored semantic views are studied in [LC98]; a seamlessly integrated query translation and capability mapping between the wrappers and web sources in semantic caching is described in [LC99]. It should be further studied whether our semantic caching scheme can be applied to web sources because web sources “have typically weaker querying capabilities than conventional database”[LC99]. However, the replacement algorithm PRAG is a general one and can be applied to the web data sources.
5 Summary and Future Work Materialization of mediator results is a promising first step in adapting mediator architectures for mobile and wireless applications. In this paper, we present materialization strategies for mediators used to query databases whose query profile shows some hot-spots. For such environments, [DFJ+96] proposes a semantic-region caching. We have extended this work. The main advantages of our approach are optimized methods to define semantic regions in a relational DBMS environment and a profit-based algorithm PRAG for region replacement. Our analysis shows that our caching strategies optimize storage space requirements and performance respectively. Experimental results show that the PRAG algorithm outperforms other popular replacement algorithms in environments with hot or very hot query sets. Currently, we are implementing both strategies to support our analytical conclusions on performance gains using extensive benchmark results. Also, we are working on strategies to support mediators with join-queries.
References [CFZ94] Michael J. Carey, Michael J. Franklin, Markos Zaharioudakis: Fine-grained sharing in page server database system. In Proc. of ACM-SIGMOD 1994 International Conference on Management of Data, Minneapolis, Minnesota, pages 359-370, May 1994. [DFJ+96] Shaul Dar, Michael J. Franklin, Björn Thór Jónsson, Divesh Srivastava, Michael Tan: Semantic Data Caching and Replacement. In Proc. of: Intl. Conf. on Very Large Databases (VLDB), Bombay, India, pages 330-341, September 1996. [GG96] Parke Godfrey, Jarek Gryz: A Framework for Intensional Query Optimization. In Proc of: Workshop on Deductive Database and Logic Programming, held in conjunction with the Joint International Conference and Symposium on Logic Programming, Bonn, Germany, pages 57-68, September 1996.
298
L. Li et al.
[GG97] Parke Godfrey, Jarek Gryz: Semantic Query Caching for Heterogeneous Databases. In th Proc. of : 9 Intl. Symposium on Methodologies for Intelligent Systems (ISMIS), Zakopane, Poland, June 1996. th [GG99] Parke Godfrey, Jarek Gryz: Answering Queries by Semantic Caches. In Proc. of: 10 Intl. Conf. on Database and Expert Systems Applications (DEXA), Florence, Italy, pages 485-498, 1999. [GGM96] Parke Godfrey, Jarek Gryz, Jack Minker: Semantic Query Optimization for Bottomth Up Evaluation. In Proc. of: 9 Intl. Symposium on Methodologies for Intelligent Systems (ISMIS), Zakopane, Poland, pages 561 –571, June 1996. [KB94] M. Keller, Julie Basu: A Predicate-based Caching Scheme for Client-Server Database Architectures. In Proc. of: IEEE Conf. on Parallel and Distributed Information Systems, Austin, Texas, pages 229-238, September 1994. [LC98] Dongwon Lee, Wesley W. Chu: Conjunctive Point Predicate-based Semantic Caching for Wrappers in Web Databases. In Proc of: ACM Intl. Workshop on Web Information and Data Management (WIDM'98), Washington DC, USA, November 1998. [LC99] Dongwon Lee, Wesley W. Chu: Semantic Caching via Query Matching for Web Sources. In Proc. of ACM Conf. on Information and Knowledge Management (CIKM) , Kansas City, MO, pages 77-85, 1999. [PMK2000] Niki Pissinou, Kia Makki and Birgitta König-Ries: A Middleware Based Architecture to Support Mobile Users in Heterogeneous Environments. In: Proc. of: Intl. Workshop on Research Issues in Data Engineering (RIDE), San Diego, CA, 2000. [RD2000] Qun Ren, Margaret H. Dunham: Semantic Caching in Mobile Computing. Preliminary version (submitted). Available at: http://www.seas.smu.edu/~mhd/pubs/00/tkde.ps [SSV96] Peter Scheuermann, Junho Shim, Radek Vingralek. WATCHMAN: A Data Warehouse Intelligent Cache Manager. In Proc. of: Intl. Conf. on Very Large Databases (VLDB), Bombay, India, September 1996. [Wied92] Gio Wiederhold. Mediators in the Architecture of Future Information Systems. IEEE Computer 25(3): 38-49 (1992)
Information Flow Control among Objects in Role-Based Access Control Model Keiji Izaki, Katsuya Tanaka, and Makoto Takizawa Dept. of Computers and Systems Engineering Tokyo Denki University Email {izaki, katsu, taki}@takilab.k.dendai.ac.jp
Abstract. Various kinds of applications have to be secure in an objectbased model. The secure system is required to not only protect objects from illegally manipulated but also prevent illegal information flow among objects. In this paper, we discuss how to resolve illegal information flow among objects in a role-based model. We define safe roles where no illegal information flow occurs. In addition, we discuss how to safely perform transactions with unsafe roles. We discuss an algorithm to check if illegal information flow occurs each time a method is performed.
1
Introduction
Various kinds of object-based systems like object-oriented database systems, JAVA [10] and CORBA [13] are widely used for applications. Object-based systems are composed of multiple objects cooperating to achieve some objectives by passing messages. An object is an encapsulation of data and methods for manipulating the data. Methods are invoked on objects in a nested manner. The object-based system are required to not only protect objects from illegally manipulated but also prevent illegal information flow among objects in the system. In the access control model [11], an access rule s, o, t means that a subject s is allowed to manipulate an object o in an access type t. Only access requests which satisfy the access rules are accepted to be performed. However, the confinement problem [12] is implied, i.e. illegal information flow occurs among subjects and objects. In the mandatory lattice-based model [1, 3, 16], objects and subjects are classified into security classes. Legal information flow is defined in terms of the can-flow relation [3] between classes. Access rules are specified so that only the legal information flow occurs. For example, if a subject s reads an object o, information in o flows to s. Hence, the subject s can read the object o only if a can-flow relation from o to s is specified. In the role-based model [6, 17, 19], a role is defined to be a collection of access rights, i.e. pairs of access types and objects, to denote a job function in the enterprise. Subjects are granted roles which show their jobs. In an object-based system, the methods are invoked on objects in a nested manner. The purpose-oriented model [18, 20] discusses which methods can invoke another method in the object-based system. In the paper [15], a message filter is used to block read and write requests if illegal information flow occurs. The authors [9] discuss what information flow to possibly H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 299–308, 2001. c Springer-Verlag Berlin Heidelberg 2001
300
K. Izaki, K. Tanaka, and M. Takizawa
occur among objects if subjects issue methods by the authority of the roles in case every method invocation is not nested. Since methods are invoked in the nested manner in the object-based systems, we have to discuss information flow to occur among objects. We define a saf e role where no illegal information flow occurs by performing any transaction with the role. In addition, we discuss an algorithm to check for each method issued by a transaction if illegal information flow occurs by performing the method. By using the algorithm, some methods issued by a transaction can be performed even if the transaction is in a session with an unsafe role. Data flowing from an object o1 to o2 can belong to o2 some time after the data flows. We discuss how to manage timed information flow. In section 2, we classify methods from information flow point of view. In section 3, we discuss information flow to occur in a nested invocation. In section 4, we discuss how to resolve illegal information flow.
2
Object-Based Systems
An object-based system is composed of objects which are encapsulations of data and methods. A transaction invokes a method by sending a request message to an object. The method is performed on the object and then the response is sent back to the transaction. During the computation of the method, other methods might be invoked. Thus, methods are invoked in a nested manner. Each subject plays a role in an organization. In the role-based model [6,17,19], a role is modeled to be a set of access rights. An access right o, t means that t can be performed on the object o. A subject s is granted a role which shows its job function in an enterprise. This means that the subject s can perform a method t on an object o if o, t ∈ r. If a subject s is in a session with r, s can issue methods in r. Each subject can be in a session with at most one role. Each method t on an object o is characterized by the following parameters: 1. 2. 3. 4.
Input type = I if the method t has input data in the parameter, else N . M anipulation type = M if the object o is changed by t, else N . Derivation type = D if data is derived from o by t, else N . Output type = O if data is returned to the invoker of t, else N .
Each method t of an object o is characterized by a method type mtype(t) = α1 α2 α3 α4 , where input α1 ∈{I, N }, manipulation α2 ∈{M , N }, derivation α3 ∈{D, N }, and output α4 ∈{O, N }. For example, a method class “IM N N ” shows a method which carries data in the parameters to an object and changes the state of the object. Here, N is omitted in the method type. For example, “IM ” shows IM N N . Especially, “N ” shows a type N N N N . Let M C be a set {IM DO, IDO, IM O, IO, IM D, ID, IM , I, M DO, DO, M O, O, M D, D, M , N } of sixteen possible method types. A counter object c supports methods display(dsp), increment(inc), and decrement(dec). mtype(dsp) = DO and mtype(inc) = mtype(dec) = IM D. A notation “β1 , ..., βk ∈ mtype(t)” (k ≤ 4) shows mtype(t) = α1 α2 α3 α4 and βi ∈ {α1 , α2 , α3 , α4 } (i ≤ k). For example, I ∈ mtype(inc) and ID ∈ mtype(dec). In the object-based systems, objects are
Information Flow Control among Objects
301
created and dropped. IM ∈ mtype(created) and N ∈ mtype(drop). The method type mtype(t) is specified for each method t by the owner of the object. We assume that each subject does not have any persistent storage. That is, the subject does not keep in record data obtained from objects. The subject issues one or more than one method to objects. A sequence of methods issued by the subject is referred to as a transaction, which is a unit of work. Each transaction T can be in a session with only one role r. A transaction has a temporary memory. Data which the transaction derives from objects may be stored in the temporary memory. On completion of the transaction, the memory is released. Any transaction does not share data with the other transactions. In this paper, objects show persistent objects. Suppose T with a role r invokes a method t1 on an object o1 since o1 , t1 ∈ r. Suppose t1 invokes another method t2 on an object o2 . Here, we assume o2 , t2 ∈ r. That is, o, t ∈ r for every method t invoked on an object o in T .
3 3.1
Nested Invocation Invocation Tree
Suppose a transaction T invokes a method t1 on an object o1 and a method t2 on an object o2 . Then, t1 invokes a method t3 on an object o3 . The invocations of methods are represented in a tree form named invocation tree as shown in Figure 1. Each node o, t shows a method t invoked on an object o in the transaction T . A dotted directed edge from a parent to a child shows that the parent invokes the child. A notation “o1 , t1 T o2 , t2 ” means that a method t1 on an object o1 invokes t2 on o2 in the transaction T . A node , T shows a root of invocation tree of T . Here, mtype(T ) is N according to the assumption. If a method serially invokes multiple methods, the left-to-right order of nodes shows an invocation sequence of methods, i.e. tree is ordered. Suppose o1 , t1 T o2 , t2 and o1 , t1 T o3 , t3 in an invocation tree of a transaction T . If t1 invokes t2 before t3 , o2 , t2 precedes o3 , t3 (o2 , t2 ≺T o3 , t3 ). In addition, o4 , t4 ≺T o3 , t3 if o2 , t2 T o4 , t4 . o2 , t2 ≺T o4 , t4 if o3 , t3 T o4 , t4 . The relation “≺T ” is transitive. T invokes t1 before t2 as shown in Figure 1. Here, o1 , t1 ≺T o2 , t2 and o3 , t3 ≺T o2 , t2 . T t1
o1
t3
o3
o2
method t1 : invocation t2 t3 : method : data t2
Fig. 1. Invocation tree.
mtype O IM DO
302
3.2
K. Izaki, K. Tanaka, and M. Takizawa
Information Flow
Suppose mtype(t3 ) = DO, mtype(t2 ) = IM , and mtype(t1 ) = O in Figure 1. In a transaction T , data is derived from an object o3 through the method t3 . The data is forwarded to t1 as the response of t3 . The data is brought to t2 as the input parameter. and is stored into o2 through t2 . Thus, the information in o3 is brought to o2 . A straight arc indicates the information flow in Figure 2. This example shows that information flow among objects may occur in a nested invocation. [Definition] Suppose a pair of methods t1 and t2 on objects o1 and o2 , respectively, are invoked in a transaction T . T 1. Information passes down from o1 , t1 to o2 , t2 in T (o1 , t1 o2 , t2 ) iff T T o3 , t3 t1 invokes t2 (o1 , t1 T o2 , t2 ) and I ∈ mtype(t2 ), or o1 , t1 o2 , t2 for some o3 , t3 in T . T o2 , t2 ) iff 2. Information passes up from o1 , t1 to o2 , t2 in T (o1 , t1 T T o2 , t2 T o1 , t1 and O ∈ mtype(t2 ), or o1 , t1 o3 , t3 o2 , t2 for some o3 , t3 in T . 2
[Definition] Information passes from o1 , t1 to o2 , t2 in an ordered transT T T T action T (o1 , t1 → o2 , t2 ) iff o1 , t1 o2 , t2 , o1 , t1 o2 , t2 , o1 , t1 O
T
T
T
O
O
o3 , t3 o2 , t2 and o1 , t1 ≺T o2 , t2 , or o1 , t1 → o3 , t3 → o2 , t2 for some o3 , t3 in T . 2 [Definition] Information passes from o1 , t1 to o2 , t2 in an unordered transT T T o2 , t2 , o1 , t1 o2 , t2 , or o1 , t1 action T (o1 , t1 → o2 , t2 ) iff o1 , t1 T
T
U
U
U
→ o3 , t3 → o2 , t2 for some o3 , t3 in T . 2 Suppose t1 is invoked before t2 , i.e. o1 , t1 ≺T o2 , t2 in Figure 2. o3 , t3 T T T T ,T o2 , t2 . o1 , t1 → o2 , t2 if o2 , t2 ≺T o1 , t1 . However, o1 , t1 O
T
T
T
T
O
U
T o1 , t1 → o2 , t2 . A relation “→” shows “→” or “→”. A notation “o1 → o2 ” shows U T
T T “o1 , t1 → o2 , t2 ” for some methods t1 and t2 . Here, T → o and o → T indicate T T T , T → o, t and o, t → , T , respectively. According to the definitions, o1 → U
T
o2 if o1 → o2 . O
T
[Definition] o1 , t1 f lows into o2 , t2 in a transaction T (o1 , t1 ⇒ o2 , t2 ) T o2 , t2 , D ∈ mtype(t1 ), and M ∈ mtype(t2 ). 2 iff o1 , t1 → T
In Figure 2, o3 , t3 ⇒ o2 , t2 where o3 , t3 is a source and o2 , t2 is a sink. T T Here, data in o3 flows into o2 . “o1 , t1 ⇒ o2 , t2 ” can be abbreviated as o1 ⇒ r T T T T o2 . T ⇒ o if T → o and o is a sink. o ⇒ T if o → T and o is a source. o1 ⇒ o2 T for a role r iff o1 ⇒ o2 for some transaction T with r. r [Definition] Information in oi f lows into oj (oi ⇒ oj ) iff oi ⇒ oj for some role r and oi ⇒ ok ⇒ oj for some object ok . 2
Information Flow Control among Objects
o1
t3
o3
r
r
r
Oi
T t1
r
o2
t2
Oj
Oj r’
r’
: information flow : invocation : method : data
Oi
r’
O
O
r’
r’ (1) Safe
Fig. 2. Information flow.
303
(2) Unsafe
Fig. 3. Safeness.
oi ⇒ oj is primitive for a role r if oi ⇒ oj . oi ⇒ oj is transitive for a role r r r oj is not primitive for r, i.e. oi ⇒ ok ⇒ oj but oi ⇒ oj for some ok . If r iff oi ⇒ oi ⇒ oj is transitive for r, a transaction T with r may get data in oi through oj even if T is not allowed to get data from oi . [Definition] “oi ⇒ oj ” is illegal iff oi ⇒ oj is transitive for some role r. 2 [Definition] A role r threatens another role r1 iff for some objects oi , oj , and r1 r oj ⇒ o and oi ⇒ o is transitive for r. 2 o, oi ⇒ r
1 oj ) by perSuppose information in oi might flow into an object oj (oi ⇒ forming a transaction T1 with a role r1 . Even if a transaction T2 is not granted a role to derive data from oi , T2 can get data in oi from oj if T2 is granted a role r to derive data from oj . Thus, if there is another role r threatening a role r1 , illegal information flow might occur if some transaction with r is performed. r [Definition] “oi ⇒ oj ” is saf e for a role r iff r is not threatened by any role. 2
r
Figure 3 shows a system including a pair of roles r and r where oi ⇒ oj . For r
r
another role r , oi ⇒ o and oj ⇒ o in Figure 3 (1). Since r does not threaten r, r
r
r
oi ⇒ oj is safe. In Figure 3 (2), oj ⇒ o but oi ⇒o. However, T is not allowed r to derive data from oi . Hence, r threatens r and oi ⇒ oj is not safe. oi ⇒ o is illegal. This is a conf inement problem on roles. It is noted that o may show a transaction. For example, the transaction T manipulates oj through a DO r
method t. Here, oi ⇒ T . [Definition] A role r is saf e iff r neither threatens any role nor is threatened by any role. 2 A transaction is saf e iff the transaction is in a session with a saf e role. An unsaf e transaction is in a session with an unsaf e role. [Theorem] If every transaction is safe, no illegal information flow occurs. 2 That is, no illegal information flow occurs if every role is safe. The paper [9] discusses an algorithm to check whether or not illegal information flow possibly occurs if the method is performed.
304
3.3
K. Izaki, K. Tanaka, and M. Takizawa
Invocation Models
Suppose a transaction T is in a session with a role r. It is not easy to make clear what transactions exist for each role and how each transaction invokes methods. Hence, we first discuss a basic (B) model where there is one transaction Tr which is in a session with a role r and invokes all the methods in r, i.e. , Tr r o, t for every o, t in the role r. An invocation tree of Tr is an unordered, two-level tree. r Here, , Tr → o, t if o, t ∈ r and I ∈ mtype(t) according to the definition of r r r →. o, t → , T if o, t ∈ r and o ∈ mtype(t). → is transitive. o, t ⇒ , T r r r iff o, t → , Tr and D ∈ mtype(t). , Tr ⇒ o, t iff , Tr → o, t and M r r r ∈ mtype(t). o1 , t1 ⇒ o2 , t2 iff o1 , t1 → , Tr and , Tr → o2 , t2 . Here, r r r r r ⇒ o and o ⇒ r show “ , Tr ⇒ o, t” and “o, t ⇒ , Tr ” for some method r r t, respectively. “⇒” shows “⇒” in the B model. B
Next, suppose a collection of transactions are a priori defined. T r(r) is a set of transactions which are in sessions with r. Let N (T ) be a set {o, t | t is invoked on o in a transaction T } and Al(r) be {o, t | o, t ∈ N (T ) for every transaction T in T r(r)} (⊆ r). Suppose two transactions T1 and T2 are in sessions with a role r. T1 invokes a method t1 on an object o1 . T2 invokes a method t2 on an object o2 and then t2 invokes a method t3 on an object o3 and t4 on o4 . Here, T r(r) = {T1 , T2 }. N (T1 ) = {o1 , t1 }, and N (T2 ) = {o2 , t2 , o3 , t3 , o4 , t4 }. Al(r) = N (T1 ) ∪ N (T2 ). There are two cases: invocation sequence of methods is a priori fixed or not, i.e. invocation tree of each transaction is ordered(O) or r r unordered(U ). In the basic (B) model, Tr invokes t1 and t2 . Since o1 ⇒ Tr ⇒ o2 r r ⇒ o3 , o1 ⇒ o3 , i.e. information in o1 possibly flows to o2 . In the unordered (U ) and ordered (O) models, there is no information flow between o1 and o3 , because o1 and o3 are manipulated by T1 and T2 , respectively. If the transactions are not r ordered, o4 ⇒ o3 as shown in Figure 4. On the other hand, if the transactions r
r
r
O
U
are ordered, o4 is manipulated before o3 . Hence, o4 ⇒o3 . oi ⇒ oj if oi ⇒ oj . oi r
r
U
B
⇒ oj if oi ⇒ oj .
r T1
T2
t1
t2
: source : sink
t3
: information flow t4
Fig. 4. Invocation trees.
Information Flow Control among Objects
4 4.1
305
Resolution of Illegal Information Flow Flow Graph
Every safe transaction is allowed to be performed because no illegal information r flow occurs. As discussed in Figure 4, o1 ⇒ o3 does not hold in the U and O r r r models even if o1 ⇒ o3 in the B model. o1 ⇒ o3 in the U model but o1 ⇒ o3 does not hold in the O model. This means it depends on an invocation sequence of methods whether or not illegal information flow occurs. The paper [9] discusses how to decide if a role is safe and an algorithm for each method issued by an unsafe transaction to check whether or not illegal information flow possibly occurs if the method is performed. However, it is not easy, possibly impossible to decide whether or not each role is safe if roles include large number of objects and roles are dynamically created and dropped. In this paper, we discuss an algorithm to check whether or not illegal information flow necessarily occurs if each method issued by every transaction is performed. A system maintains a following directed f low graph G. [Flow graph] 1. Each node in G shows an object in the system. Here, each transaction is also an object. If an object is created, a node for the object is added in G. Initially, G includes no edge. T
2. A directed edge o1 →τ o2 is created if oi ⇒ oj by performing a transaction T of a role r at time τ . If o1 →τ1 o2 already exists in G, o1 →τ1 o2 is changed to o1 →τ o2 if τ1 < τ . 3. For each object o3 such that o3 →τ1 o1 →τ o2 in G, 3.1 o3 →τ o2 no edge from o3 to o2 in G and τ2 < τ . go to Step 2. 3.2 o3 →τ2 o2 if o3 →τ3 o2 is already in G and τ2 > τ3 . 2 Figure 5 shows a flow graph G including four objects o1 , o2 , o3 , and o4 . First, r1 o3 occurs suppose o1 →4 o2 and o2 →3 o4 hold in G. Then, information flow o2 ⇒ by performing a transaction at time 6. Here, a directed edge o2 →6 o3 is created in G. Since o1 →4 o2 →6 o3 , information flowing to o2 from o1 at time 4 might flow to o3 by the transaction. Hence, o1 →6 o3 since 4 < 6 [Figure 5 (2)]. Then, r2 o4 at time 8. o3 →8 o4 . Since o1 →4 o2 →6 o3 →8 o4 , an edge o1 →8 o4 o3 ⇒ is also created and another edge o2 →8 o4 is tried to be created. However, “o2 →3 o4 ” in G. Since 3 < 8, the time 3 of the edge “o2 →3 o4 ” is replaced with 8 [Figure 5 (3)]. In Figure 5 (3), information in the objects o1 , o2 , and o3 flow into o4 . Let In(o) be a set {o1 | o1 →τ o in G} of objects whose information has flown into an object o. For example, In(o4 ) = {o1 , o2 , o3 } in Figure 5. Suppose a method t is issued to an object o in a transaction T with a role r. Methods invoked in T are logged in an ordered invocation tree form in a log T LT . From the invocation tree in LT , every information flow relation “oi ⇒ oj ” is obtained. If the following condition is satisfied, t can be invoked in o. [Condition for a method t] [Figure 6] DO ∈ mtype(t2 ) and o2 , t2 ∈ r, 1. for every “o1 →τ o” in LT if IM ∈ mtype(t), 2. for every “o2 →τ o” in G if DO ∈ mtype(t). 2
306
K. Izaki, K. Tanaka, and M. Takizawa 4
o1
(1)
o2
o3
o4
3 6 4
o1
(2)
6
o2
o3
o4
3 6 4
o1
(3)
8 6
o2
o3
8
8
Fig. 5. Flow graph G.
o4
o2
o1
T
o
o2
o
T
2
1
Fig. 6. Condition.
In the condition 1, data in some object o2 might have been brought into o1 T (o2 ⇒ o1 ) before the transaction T manipulates the object o. In the condition 2, T issues t to derived data from o. 4.2
Timed Information Flow
Suppose some data in an object oi illegally flows to another object oj by performing a transaction T with a role r at time τ (oi →τ oj in G). Security level of data is changing time by time. After it takes some time δ, the data brought from oi is considered to belong to oj . An edge “oi →τ oj ” is aged if τ + δ < σ where σ shows the current time. Every aged edge is removed from the graph G for σ. In Figure 5, suppose δ = 10. If σ gets 14, an edge timed 4 is aged now and removed. Figure 7 shows the flow graph G obtained here. Suppose some transaction T with a role r1 issues a request t3 on an object o3 which DO ∈ mtype(t3 ) in Figure 5(3) but data in o1 is not allowed to be derived. In Figure 5(3), T is rejected according to the conditions. However, the DO method t3 can be performed in Figure 7 because of no illegal information flow from o1 to T . 8
8
o1
o2
6
o3
8
o4
o1
8
Fig. 7. Flow graph.
o2
o3*
8
o4
8
Fig. 8. Flow graph.
Suppose an object o3 is dropped in a flow graph G of Figure 5(3). Since “o3 4 → o4 ” exists in G, some data in o3 might have been copied in o4 . Hence, only transaction which is granted to manipulate o3 is allowed to manipulate o4 even after o3 is dropped. [Drop of an object] An object o is dropped. 1. A node o is marked. 2. Every incoming edge in In(o) is removed from G. 3. Every outgoing edge in Out(o) is marked. 2
Information Flow Control among Objects
307
Figure 8 shows a flow graph G obtained by dropping the object o3 a through the algorithm from Figure 5(3). The node o3 is marked ∗. A dotted edge from o3 to o4 shows a marked edge. All incoming edges to o3 , i.e. “o1 →6 o3 ” and “o2 →6 o3 ” are removed from G. Here, suppose some transaction T issues a DO method t4 on o4 . t4 is rejected if T is not allowed to derived data from o3 even if o3 is dropped already. Because there is still data of o3 in o4 . Each marked edge is removed after it takes δ time units. If a marked node o does not have any outgoing edge, i.e. Out(o) = φ, o is removed from G. [Remove of aged edge] 1. For any edge “oi →τ oj ” in G, the edge is removed if τ + δ ≤ σ. 2. Every marked node oi is removed if Out(oi ) = φ. 2
5
Concluding Remarks
This paper discussed an access control model for the object-based system with role concepts. We discussed how to control information flow in a system where methods are invoked in a nested manner. We first defined a safe role where no illegal information flow possibly occurs in types of invocation models; basic (B), unordered (U ), and ordered (O) models. We presented the algorithm to check if each method could be performed, i.e. no illegal information flow occurs after the method is performed. By using the algorithm, some methods issued by an unsafe transaction can be performed depending on in what order a transaction performs the methods. We also discussed a case that security level is timevariant. Information flowing to another object can be considered to belong to the object after some time.
References 1. Bell, D. E. and LaPadula, L. J., “Secure Computer Systems: Mathematical Foundations and Model,” Mitre Corp. Report, No. M74–244, Bedford, Mass., 1975. 2. Castano, S., Fugini, M., Matella, G., and Samarati, P., “Database Security,” Addison-Wesley, 1995. 3. Denning, D. E., “A Lattice Model of Secure Information Flow,” Communications of the ACM, Vol. 19, No. 5, 1976, pp. 236–243. 4. Fausto, R., Elisa, B., Won, K., and Darrell, W., “A Model of Authorization for Next-Generation Database Systems,” ACM Trans on Database Systems, Vol. 16, No. 1, 1991, pp. 88–131. 5. Ferrai, E., Samarati, P., Bertino, E., and Jajodia, S., “Providing Flexibility in Information Flow Control for Object-Oriented Systems,” Proc. of 1997 IEEE Symp. on Security and Privacy, 1997, pp. 130–140. 6. Ferraiolo, D. and Kuhn, R., “Role-Based Access Controls,” Proc. of 15th NISTNCSC Nat’l Computer Security Conf., 1992, pp. 554–563. 7. Harrison, M. A., Ruzzo, W. L., and Ullman, J. D., “Protection in Operating Systems,” Comm. of the ACM, Vol. 19, No. 8, 1976, pp. 461–471. 8. Izaki, K., Tanaka, K., and Takizawa, M., “Authorization Model in Object-Oriented Systems,” Proc. of IFIP Database Security, 2000.
308
K. Izaki, K. Tanaka, and M. Takizawa
9. Izaki, K., Tanaka, K., and Takizawa, M., “Information Flow Control in Role-Based Model for Distributed Objects,” Proc. of IEEE Int’l Conf. on Parallel and Distributed Systems, 2001. 10. Gosling, J. and McGilton, H., “The Java Language Environment,” Sun Microsystems, Inc, 1996. 11. Lampson, B. W., “Protection,” Proc. of 5th Princeton Symp. on Information Sciences and Systems, 1971, pp. 437–443. (also in ACM Operating Systems Review, Vol. 8, No. 1, 1974, pp. 18–24.) 12. Lampson, B. W., “A Note on the Confinement Problem,” Comm. of the ACM, Vol. 16, No. 10, 1973, pp. 613–615. 13. Object Management Group Inc., “ The Common Object Request Broker : Architecture and Specification,” Rev. 2.1, 1997. 14. Oracle Corporation,“Oracle8i Concepts”, Vol. 1, Release 8.1.5, 1999. 15. Samarati, P., Bertino, E., Ciampichetti, A., and Jajodia, S., “Information Flow Control in Object-Oriented Systems,” IEEE Trans. on Knowledge and Data Engineering Vol. 9, No. 4, 1997, pp. 524–538. 16. Sandhu, R. S., “Lattice-Based Access Control Models,” IEEE Computer, Vol. 26, No. 11, 1993, pp. 9–19. 17. Sandhu, R. S., Coyne, E. J., Feinstein, H. L., and Youman, C. E., “Role-Based Access Control Models,” IEEE Computer, Vol. 29, No. 2, 1996, pp. 38–47. 18. Tachikawa, T., Yasuda, M., and Takizawa, M., “A Purpose-oriented Access Control Model in Object-based Systems,” Trans. of IPSJ, Vol. 38, No. 11, 1997, pp. 2362– 2369. 19. Tari, Z. and Chan, S. W., “A Role-Based Access Control for Intranet Security,” IEEE Internet Computing, Vol. 1, No. 5, 1997, pp. 24–34. 20. Yasuda, M., Higaki, H., and Takizawa, M., “A Purpose-Oriented Access Control Model for Information Flow Management,” Proc. of 14th IFIP Int’l Information Security Conf. (SEC’98), 1998, pp. 230–239.
Object Space Partitioning in a DL-Like Database and Knowledge Base Management System Mathieu Roger, Ana Simonet, and Michel Simonet TIMC-IMAG Faculté de Médecine de Grenoble 38706 La Tronche Cedex – France [email protected], {Ana,Michel}[email protected]
Abstract. The p-type data model was designed first to answer database needs. Some of its features were and still are quite unusual for a DBMS and, by some aspects, make it nearer Description Logic (DL) systems than classical DBMS. Views play a central role in the model. They are defined in a hierarchical manner, with constraints on role (attribute) types as in DLs, and instance classification (view recognition) is a basic mechanism in p-type implementation by the Osiris system. In this paper we recall the main characteristics of p-types and their semantics as a DL system. We insist on the modelling of unknown values, whose treatment leads to a three-value instance classification system. We develop database specific aspects and particularly the partitioning of the object space and its use for the management of data.
The similarity between views and concepts in Description Logics (DLs) has been pointed out early. Both “define” subsets of objects of a more general concept (a table is a relational representation of a concept). Work has been done on using views – considered as defined concepts1 – to optimise query evaluation [Beneventano et al., 93]. Through this kind of work, database research can benefit from the theoretical work that is being done in the field of DLs. Most of this work concerns the complexity of concept classification depending on the kind of properties the system allows to define the concepts. However, there are important differences between DL systems and DBMS. DLs deal with conceptual aspects, not the physical ones related to object management. DLs do not deal with the management of large quantities of objects, which is the purpose of databases. They do not even consider object evolution. They deal only with a given state of the world, concerning a small amount of static, non-evolving objects. On the other hand, databases do not consider – even today – instance classification, i.e., determination of the current views of an object. The Osiris system, which implements the p-type data model, can be considered at the crossing of both currents: databases and DLs. From databases, it retains the importance of sharing large amounts of data between different categories of users; hence the central place of views and their role that is not limited to that of “named queries”. With DLs it shares the importance given to concept management. The specific aspects of Osiris with respect to both paradigms concern mainly a category of constraints used to define the views, namely Domain Constraints, and particularly their use for object management and query optimisation. In this paper we make a presentation of the p-type data model in the DL style, which enables us to situate its language in the DL classification. This has given us confidence in the possibilities of implementing certain features in a feasible manner. The DL paradigm has also proved useful to express many specific aspects of the model, such as the partitioning of the object space, and even generalise it. Therefore this experience has been fruitful. In this introduction we have outlined some essential characteristics of the p-type data model. In the following, we present a DL model of p-types. We first recall the main results of a previous study [DEXA00]. We show how the explicit treatment of unknown values leads to a three-value instance classification model where views can be Valid, Invalid or Potential for a given object. We then present the principle of a partitioning of the object space that is based on constraints on predefined domains and extended to constraints on views. We show how this partitioning can be used for the management of data in a database perspective: persistent views and object indexing. We present the current solution in the Osiris system and the problems posed by the generalization of constraints from predefined to user-defined domains.
1
“Defined” concepts are defined by necessary and sufficient conditions. They are opposed to “primitive” concepts, which are defined only by necessary conditions. Therefore, it is mandatory to explicitly assign an object to a primitive concept and then let the system classify it into the defined concepts whose properties it satisfies. For example, an object p1 being assigned to the primitive concept PERSON will be classified into the defined concepts ADULT, MINOR, MALE, etc., according to its properties.
Object Space Partitioning in a DL-Like Database
311
2 Previous Results: p-types Syntax and Semantics In this section we recall the main results about the formal definition of the syntax and semantics of the concept language of p-types in a DL-like manner. This work was presented in [DEXA 00]. The modeling has been improved since, but its principle and its results remain mostly the same. We do not consider the external form of p-types, i.e., the syntax used in the Osiris system. We consider a concept language whose semantics is that of p-types, which presents some characteristics that make it a database-oriented model. Description Logics deal with concept definition and classification. In defining P-types a designer is not interested only in the ontological aspects of the system, i.e., the concepts that are taken into consideration, but in their use in a database context. This leads to a decision to transform some concepts into classes of objects and others into views. A class must be understood in the programming sense: an object belongs to a unique class and does not change its class during its lifetime. Views are subsets of objects of a given class. Contrary to the common database perspective, P-type views are not shortcuts for queries. They are defined by logical properties on the attributes of the class and therefore behave as defined concepts in a DL. Similarly, classes behave as primitive concepts. This means that an object must be explicitly assigned to a class and then is automatically classified into its views. This is what happens in Osiris. Any object is created in a given class and automatically classified into the views of this class. Table 1. Syntax and semantics for type concept language constructs, R ranges over roles, A over types and C,D over type concepts.
Construct Name Attribute Typing Mono valued Attribute Intersection
Syntax
Semantics
" R.A (= 1 R)
{ x ³ D | "y ³ D : (x,y) ³ Relation à y ³ I A} I I {x ³ D | #{y ³ D : (x,y) ³ Relation} = 1 }
C¬D
C ¬D
I
I
I
I
I
In this section we give a definition of the concept language syntax for OSIRIS. On an operational point of view, this language is used to specify databases schemes. Let us consider: a set of types T (ranged over by A,B,T) containing predefined types INT, REAL, CHAR, STRING, namely Predefined, a set of views V (ranged over by U,V) containing predefined views {]v1,v2[ , [v1,v2[ ,]v1,v2] , [v1,v2], {a1,…,an} | v1 and v2 are both elements of one of the types INT, CHAR or REAL and ai are all elements of one of the types INT,REAL, CHAR or STRING} namely PredefinedViews a set of roles R (ranged over by R).
312
M. Roger, A. Simonet, and M. Simonet
Table 2. Syntax and Semantics for types concept language constructs, R ranges over roles, U,V over view concepts.
Construct Name Intersection Negation
Elementary Constraints
Syntax
Semantics
U¬V ½U Undefined(R) " R.V " R. ½V $ R.V $ R. ½V
V ¬U I I {x ³ D | x ´ U } I {x ³ D | x ³ Undefined} I I I I {x ³ D | "y ³ D : (x,y) ³ Rel à y ³ V } I I I I {x ³ D | "y ³ D : (x,y) ³ Rel à y ´ V } I I I I {x ³ D | $y ³ D (x,y) ³ Rel and y ³ V } I I I I {x ³ D | $y ³ D (x,y) ³ Rel and y ´ V } I
I
We define two separate languages that are mostly sub languages of the ALC language family [Domini et al., 96]. One of these languages is called type concept language (namely TL), and is used to declare the types of the application considered as primitive concepts. The other one is called view concept language (namely VL), and is used to declare views (i.e., subsets of types) considered as defined concepts. This distinction is a central point of the Osiris system, and it has also been used in [Buchheit et al., 98]. The purpose of such a distinction is to emphasize the fact that types are inherently primitive concepts and views are defined concepts. A type scheme is a set S of axioms of the type: 1. 2. 3. 4.
A ² C, where A ³ T-Predefined and C ³ TL A ² ½B, where A ³ T and B³ T R ² AB, where R³ R, A ³ T-Predefined and B ³ T V = A ¬ U, where V ³ V, A ³ T and U ³ VL
Such that: Uniqueness of a type definition: for all A ³ T-Predefined, there is one axiom A ² C in S. Such types are called p-types. A role belongs to a single type and all roles are defined: for all R there is exactly one axiom R ² AB and A ² C such as C uses R. Types are disjoint: for all A,B in T, there is an axiom A ² ½B. Uniqueness of a view definition: for all V³ V there is exactly one axiom V = U. Views form a hierarchy: considering the binary relation of view inclusion "V uses U in its definition, V = U¬ …", the directed graph formed by views as node and this relation is acyclic and every connex compound has a unique root called minimal view. A view is a subset of a unique type: for each view V there is exactly one type A such as minimal_view(V) = A ¬ …. I An interpretation I = (DI,.I) of a given type scheme S is given by a set D and an .I interpretation function such that initial concepts and roles are interpreted in the following way:
Object Space Partitioning in a DL-Like Database
313
A is a subset of D for all A in T I I V is a subset of D for all V in V I I I R is a triplet with Relation ² D D , Undefined I I I I I ² D , Unknown ² D and {x ³ D | $ y ³ D and (x,y) ³ Relation} ² D (Undefined Unknown) Concept constructs are interpreted according to table 1 and 2. I
I
Example. We give an example of type scheme: - PERSON ² "partners. PERSON ¬ " age. INT ¬ (= 1 age) ¬ "follow. COURSE ¬ "teach. COURSE ¬ " namePERSON. STRING ¬ (= 1 namePERSON) - COURSE ² "teacher. PERSON ¬ (= 1 teacher) ¬ " nameCOURSE. STRING¬ (= 1 nameCOURSE) - VIEWPERSON = PERSON ¬ ½Undefined(age) ¬ ½Undefined(partner) ¬ ½Undefined(namePERSON) ¬ " age. [0, 100] - STUDENT = VIEWPERSON ¬ ½Undefined (follow) ¬ " age. [0, 50[ - TEACHER = VIEWPERSON ¬ ½Undefined (teach) ¬ " age. [18, 70[ - TEACHINGASSISTANT = STUDENT ¬ TEACHER - MAJOR = VIEWPERSON ¬ " age. [18, 100[ - VIEWCOURSE = COURSE¬½Undefined(teacher)¬½Undefined(nameCOURSE) Remark. The undefined set in a role interpretation represents the objects for which the role has no meaningful value. It is the case for the function 1/x: it is undefined for I x=0. As in Description Logics we assume that names, i.e., the elements from D , are unique (unique name assumption), in the context of our database system these names are oids. Definition. Given I and J two interpretations of a scheme S, we say that J is a more specific interpretation than I (noted as J ² I) iff:
D ²D I J For all A in T, A ² A . I J For all V in V, V ² V . I J For all R in R, R = , R = , Relation1 ² Relation2 , Undefined1 ² Undefined2 and Unknown2 ² Unknown1 . I
J
In other words, J a more specific interpretation than I contains less unknown information and possibly more objects than I. Definition. An interpretation I is said finite if D is finite. I
Definition. An interpretation I is said to be complete if for every R ³ R, such as R = , we have Unknown = «. I
Remark. Intuitively, a interpretation is complete if all the values for all the objects are known.
314
M. Roger, A. Simonet, and M. Simonet
Definition 8. Given a scheme S, we say that a finite interpretation I may satisfy an axiom. More precisely: 1. I satisfies A ² C iff A ² C I I 2. I satisfies A ² ½B iff A ¬ B = « I I I I 3. I satisfies R ² AB iff Relation ² A B , Undefined ² A and Unknown ² A , I where R = < Relation, Undefined, Unknown> I I I 4. I satisfies V = A ¬ U iff V = A ¬ U I
I
Definition. We say that a finite interpretation satisfies a type scheme S (or that I is a valid interpretation of S) iff: 1. I satisfies every axiom of S I I 2. For each p-type A, with minimal view V, we have A =V . 3. There exists a complete interpretation J such that J satisfies S and J ² I. Remark. This Definition needs a few explanations. First, we only consider finite interpretations because they best match our intuition: we do not believe that it is the purpose of computers to deal with infinity. Constraint 1 is classical and means that an interpretation for a concept follows its definition. Constraint 2 means that minimal views are primitive concepts. Constraint 3 deals with the unknown values. It means that in a valid interpretation, if a value is unknown, then it will be possible to put some actual value in the future. Definition. We say that a view V1 subsumes a view V2 , iff V2 ² V1 for all valid complete interpretation I. I
I
Remark. As in Description Logics, deduction does not deal with unknown values. The unknown values only appear when building the extension of a specific database, which is during the “using phase” of a database. Before inserting any object in the database, that is building a specific interpretation, the only thing one can do is to deduce facts that are true for every interpretation.
3 Application to Databases The goal of this section is to show how the object space of a p-type can be partitioned in order to store the objects of an interpretation, i.e., the actual set of objects of the database. This partition may also be used for computing the subsumption relationship, as in [Calvanese 96] where compound concepts form a semantic driven partitioning analogous to the equivalence classes introduced in this section. We give the intuition of the partitioning, starting with the partition of an attribute when it is defined over a predefined type (which is the case in the current implementation of the Osiris system). Then we extend this partitioning to the case where attribute types are views. Definition. Given a role R from a p-type A, let us call constraints(R) the set of all elementary constraints involving R; a stable sub-domain for R is either a subset of constraints(R) or the stable sub-domain sds-undefinedR. The interpretations of a stable I I I I sub-domains s are repectively s = A ¬ Ci - Cj , whereCi ³ s and Cj ´s; and sdsI I undefinedR = Undefined, where R = .
Object Space Partitioning in a DL-Like Database
315
Definition. A stable sub-domain s is said to be consistent iff for some valid I interpretation I we have s «. Example. Consistent stable sub-domains for the PERSON p-type and age role in example are: Sds11={"age.[0,50], "age.[0,70], "age.[0,100]} : those whose age is between 0 and 50 (included) Sds12={"age.[0,70], "age. [0,100]} : those whose age is between 50 (excluded) and 70 (included) Sds13={"age.[0,100]} : those whose age is between 70 (excluded) and 100 (included) Sds14={} : those whose age is not between 0 (included) and 100 (included) Property. Given an interpretation I and a role R, the R’s stable sub-domains’ I interpretations form a partition for A . Proof. Let o ³ A . For a constraint C from constraint(R), either o belongs to C , or o I belongs to ½C . As this is true for each constraint, we have the result.ä I
I
Definition. Given a p-type A whose roles are R1, …, Rn, an eq-class for A is (s1, …, I I sn) where each si is a stable sub-domain for Ri with interpretation (s1, …, sn) = ¬ si . Remark. One can see an eq-class as a hyper-square, as in figure 1.
Fig. 1. Example of graphical representation of eq-classes. Graphical representation for a virtual space of a p-type; only consistent stable sub-domains are represented. The circle symbolizes an object with values follow=Undefined and age=60.
Property. Given an interpretation I, the interpretations of Eq-classes are the equivalence classes for the relationship between objects: o1 o2 iff for each I I constraint C, either both o1 and o2 belong to C , or both o1 and o2 belong to ½C . The following properties and definitions deal with the storage of a specific interpretation. So we assume that we have a valid interpretation I and we concentrate on a single p-type A. Property. Given an object o from I such that, for each R ³ R, with R = , we have o ´ Unknown (i.e. o is completely known), then I there exists a unique eq-class eq, called EquivalentClass(o,I) such as o ³ eq . I
316
M. Roger, A. Simonet, and M. Simonet
Definition. Given a valid interpretation I, and an object o, we define Possible(o,I) as the set of eq-classes associated with o in every more specific interpretation of I. Property. If o is completely known, then Possible(o,I) = {EquivalentClass(o,I)}. The idea would be to index each object with the Possible(o,I) set. In fact, as we do not need so much information, we will use instead an approximation of Possible(o,I). Definition. Given a p-type A with role R, a general stable sub-domain is either a stable sub-domain or the general stable sub-domain sds-unknownR with interpretation I I sds-unknownR = A . Definition. Given a p-type A whose roles are R1, …, Rn, a general eq-class is (s1, …, sn) where each si is a general stable sub-domain for Ri with interpretation I I (s1, …, sn) = ¬ si . Definition. Given an object o, we define GeneralEquivalentClass(o,I) = (s1, …, sn) I where si is either the stable sub-domain such that o ³ si if o is not unknown for Ri, or sds-unknownR. Property. Given an object o and a J valid complete more specific interpretation than I, then we have: J J eq ³ Possible(o,I) eq ² GeneralEquivalentClass(o,I) . Proof. Suppose o is completely known, then Possible(o,I) = {EquivalentClass(o,I)} and GeneralEquivalentClass(o,I) = EquivalentClass(o,I). Now suppose than o is not completely known. Let us reorder the roles from A, such as R1, …, Rk are the roles for which o is known and Rk+1, …, Rn the roles for which o is unknown. Then let sds1, …, sdsk the stables sub-domains for o according to each Ri for I in 1..k. We have: eq in Possible(o,I) Ã eq = (sds1, …, sdsk, …). We also have GeneralEquivalentClass(o,I)=(sds1,…,sdsk,sds-UnknownRk+1,…,sds-UnknownRn). J J As for each Ri, I in k+1 .. n, we have sdsi ² sds-UnknownRi , we have : J J eq ² GeneralEquivalentClass(o,I) .ä Definition. Given a view V of p-type A and a general eq-class eq, we define Validity(eq,V), the validity of eq with regards to V, as : I I True if eq ² V for every valid interpretation I. I I False if eq ¬ V = « for every valid interpretation I. I I I I Possible if eq ¬ V « and eq ¬ ½V « for every valid interpretation I. Definition. Given a general eq-class eq of p-type A with views V1, …,Vn, we call ValidityVector(eq) the set of couples (Vi, Validity(eq,Vi)). Solution. We can associate with each general eq-class its validity vector and the set of objects that belong to the interpretation of the general eq-class. This is the way we implement object indexation as shown in figure 2.
Object Space Partitioning in a DL-Like Database
sds17
sds23
sds31
317
list of objects ValidityVector
sds17
sds23
sds-? list of objects ValidityVector
Fig. 2. Example of general eq-classes for object indexation. Two general eq-classes, with their associated object list and validity vector. The second eq-class has a general stable sub-domain unknown (quoted here as sds-?).
4 Conclusion and Future Work This paper continues the work initiated in [Roger et al. 00]. In particular we present a way of partitioning data according to the semantics. The notion of eq-classes provides semantic-driven partitioning and even if the number of eq-classes is exponential according to the size of the type scheme, only populated eq-classes need to be represented. Thus the number of eq-classes never exceeds the number of objects. This is the central idea of the p-type model as it has proven to be valuable for object indexation and logical deduction as stated in [Calvanese 96]. This convergence between databases and knowledge bases through semantic-driven partitioning allows us to consider the possibility of reusing previous results about concept classification. Future work will consist in extending the expressiveness of the language by reusing previous work on description logics, for example cardinality constraints [Calvanese 96]. We will also study the influence of concepts on a level “above” p-types, which reduces the gap between the p-type data model and that of description logics. An example of such a concept would be “old things” that would gather objects from different p-types according to the role age.
References [ANSI/X3]: ANSI/ X3/ SPARC Study group on database management systems. Interim report, ACM SiGMOD Bulletin 7, N2, 1975. [Bellatreche,00]: L. Bellatreche, Utilisation des Vues Matérialisées, des Index et de la Fragmentation dans la Conception d’un Entrepôt de Données, Thèse d’Université, Université Blaise Pascal, Clermont-Ferrand, France, Dec. 2000. [Beneventano et al., 93]: Beneventano, Bergamaschi, Lodi, Sartori, Using subsumption in semantic query optimisation, IJCAI Workshop on object based representation systems, Août 1993. [Bucheit et al., 98]: Buchheit, Domini, Nutt, Schaerf, A refined architecture for terminological systems: terminology = schema + views, Artificial Intelligence, Vol 99, 1998.
318
M. Roger, A. Simonet, and M. Simonet
[Calvanese 96]: Diego Calvanese, Finite Model Reasoning in Description Logics, in Proceedings of Knowledge Representation 1996. [Domini et al., 96]: Domini, Lenzerini, Nardi, Schaerf, Reasoning in description logics, Principles of Knowledge Representation, pp. 191-236, CSLI Publications, 1996. [Sales 84]: A. Sales, Types abstraits et bases de données, Thèse, Université scientifique et médicale de Grenoble,1984. [Simonet et al., 94]: A. Simonet, M. Simonet, Objects with Views and Constraints : from Databases to Knowledge Bases, Object-Oriented Information Systems OOIS'94 - London, Springer Verlag, pp 182-197, Dec. 1994. [Simonet, 88]: A. Simonet, Les P-TYPES: un modèle pour la définition de bases de connaissances centrées-objets cohérentes, R.R. 751-I laboratoire Artemis, Grenoble, Novembre 1988. [Simonet et al., 98]: A. Simonet, M. Simonet, C. G. Bassolet, X. Delannoy, R. Hamadi, Static Classification Schemes for an Object System, In: FLAIRS-98, 11th Int. FLorida Artificial Intelligence Research Society Conference, AAAI Press, pp 254-258, May 1998. [Roger, 99]: M. Roger, Requêtes dans un SGBD-BC de type objet avec vues, Rapport de Dea, UFR IMA, Grenoble, 1999. [Roger et al., 00]: M. Roger, A. Simonet and M Simonet, A Description Logics-like Model for a Knowledge and Data Management System, in DEXA 2000.
A Genome Databases Framework Luiz Fernando Bessa Seibel and S´ergio Lifschitz Departamento de Inform´ atica Pontificia Universidade Cat´ olica do Rio de Janeiro (PUC-Rio) Rio de Janeiro - Brasil {seibel,sergio}@inf.puc-rio.br
Abstract. There are many Molecular Biology Databases, also known as Genome Databases, and there is a need for integrating all this data sources and related applications. This work proposes the use of an object-oriented framework for genome data access and manipulations. The framework approach is an interesting solution due to the flexibility, reusability and extensibility requirements of this application domain. We give a formal definition of our Genome Databases Framework using UML class diagrams, that explore the structural part of the architecture. A brief discussion on the Framework functionalities is also presented.
1
Introduction
Many molecular biology projects are currently active [19]. In spite of all the benefits one may expect from them, it has become a challenging problem to deal with large volumes of DNA and protein sequences, besides other related data (such as annotations) (e.g., [2,9,23,10]). DNA and protein sequences are text strings and this is one of the reasons why molecular biologist started keeping them in text files. With new technologies available, the sequencing process and genetic code production has increased in such a way that the total volume of data became large enough, motivating the use of DBMSs. Database technology is already present in this research area but to a little extent, i.e., even if some projects include DBMS-like software to store all the data, most does not use DBMS’s functionalities [2,19]. Moreover, most users still work with flat text-based files downloaded from public repositories and data manipulation is done through programs like BLAST search (e.g., [5]). There exists many Molecular Biology Databases, also known as Genome Databases, such as the GenBank Sequence Database [13], the Annotated Protein Sequence Database (Swiss-Prot) [27] and A C. elegans Database [1]. It is important to note that many so-called databases are not always complete database systems but, rather, file systems with own storage, manipulation and access methods. We are interested here in a basic, though very important, problem that is related to the definition of a suitable structure for representing and integrating this kind of genome data. It is a well-known problem for genome and molecular biology data users that the information widely spread in different sites are not easy to deal with in a single and uniform way. Each research group that is H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 319–329, 2001. c Springer-Verlag Berlin Heidelberg 2001
320
L.F.B. Seibel and S. Lifschitz
currently generating or processing these data usually work in a independent manner, using different data models to represent and manipulate mostly the same information. There are object-oriented systems (e.g., AceDB [1]), relational (e.g., Swiss-Prot [27]) and semi-structured text-based files (e.g., GenBank [13]). The usual approach to handle this problem is to use a specific integrated model and system that should capture all the needed data for a particular application (e.g., [8]). Since it is a research area that is often changing, with new structural information being incorporated together with new application requirements, it is very difficult to decide upon which data model should be considered in this context. Thus, every existing approach based on a chosen model may not be well adapted to all users and application needs. In this work we propose the use of an object-oriented framework [11] approach to deal with these Genome data integration problem. A framework is an incomplete software system, which contains many basic pre-defined components (frozen spots) and others that must be instantiated (hot spots) for the implementation of the desired and particular functionality. There are multiple object-oriented framework classifications. Our framework belongs to a class called “specific to the application domain”. Indeed, molecular biology (genome) research area. We claim that, using a framework and the software systems instantiations it may generate, we have a better solution to most of the questions that arise in this domain. Our proposed framework, briefly introduced in [20], will be discussed and formalized here, with UML class diagrams that explore its structural part. Due to space limitations, we will only give an idea of the framework’s dynamic part. This is further explained in [26]. We first motivate our work in the next section, listing a sample of the existing genome data sources and tools, together with some of the most important approaches in the literature. Then, in Section 3, we give an overview of our framework, presenting its basic architecture, a discussion on its functionalities and an instantiation of the biological model. This is followed by the details of the framework modules, shown in Section 4. We conclude in Section 5 with contributions, future and ongoing work.
2
Motivation and Related Work
There are many interesting problems for the database research community, besides simply providing a way to store and give reliable and efficient access to large volumes of data. Among them, we can mention the works on appropriate user interfaces and the interaction between different data collections [4,28]. Some other issues involve the definition of an appropriate ontology [3,15], as well as buffer management [18] and indexes [14]. Molecular genome projects, through sequencing, have produced very large collections of DNA data of multiple organisms. An important problem in this research area is how to deal with gene and other genome sites in order to identify their functions. It is important to enable comparisons between different species’ genome data that may be similar and, probably, have the same function.
A Genome Databases Framework
321
Many research groups have developed tools to provide integration, structuring, comparison and presentation of genome sequences and related information. In [17,9,21] the authors identify the most important integration strategies: – (hyperlink navigation for joining information) The idea here is to allow users to jump through registers of different data sources, either through existing links among them (e.g., Entrez [19]) or navigation systems that create the links among different data sources (e.g., SRS [19]). Thus, in a first movement, the user accesses a data source register and, in what follows, the user asks for a link to another data source where the desired information is; or – (multidatabases) Another strategy includes those that use integration tools that implement queries to the different pre-existing data sources. These queries may be formulated through special languages (e.g., [7] and CPL/Kleisli [19]) that allow representing complex data types, with an access driver implemented for each data source to be accessed; or – (data warehouses) An alternative strategy consists of using a mediator that is responsible for determining the data sources that participate in the query, creating an access plan, translating concepts and syntax, assigning the queries to the distributed environment and integrating the results [17]. We deal here with implementation of a data instance that collects the biological information available in several sources. When genome and molecular biology information structuring are taken into account, other research groups propose and discuss data models that are suitable to represent them. We can cite the OPM semantic-based object-oriented model [8], the DDBJ DNA Database presented in [24] and, more recently, the data warehouse-like approach proposed in [25]. Usually genome projects develop own system interfaces that differ in the way they show their data and results from data manipulation. For example, AceDB [1] offer a powerful graphical interface that enables the user to visualize the chromosome map in details. We have chosen here an approach that is based on an object-oriented framework. The basic idea for choosing a framework is that we needed a tool for integrating genome information that is spread in a distributed environment (mostly available in the web). Furthermore, this information changes because new biological information and descriptions emerge often. All data is used by distinct and, at the same time, similar applications. Thus, properties like flexibility, extensibility and software reusability are required. The biology data model, once initially defined, need to incrementally aggregate new information from the different existing data sources. Through an objectoriented framework, it becomes possible to generate interfaces and database instances, executing the most important genome applications in a uniform and integrated way. Our approach integrates the information through the instantiation of particular scientific data warehouses, which respond to high performance requests of the related applications.
Schemas capture of existing and different data sources; Matching of the architecture objects to those objects in the captured schema; Capture of data belonging to the data sources; Definition of new ad-hoc schemas; Data generation in a format required by a molecular biology application; Execution of algorithms instantiated as methods of the biology classes.
The first one assumes that there exists a converter (wrapper) for the data sources being considered. When the second functionality is not directly obtained, a new biology object is created and associations must be created. This matching may establish relationships such as ”is synonym of”. The ability of capturing the data is the third functionality mentioned and is needed once the associated schema has been already captured. For this, there is another type of converter. Listed next, in fourth place, is a common demand for new specific applications schemas, together with a new associated data set. As there are multiple stored data formats, one may need to convert them to a particular format, mandatory for the execution of a given application (e.g., FASTA file format for BLAST programs). This is what the fifth functionality is related to and, finally, the framework must be able to execute all the involved methods. Besides the above mentioned functionalities, we have chosen XML Schema to define the data sources schemas and XML for storing data [29]. This is due to (i) the intrinsic characteristic of the information in the biology data sources; (ii) the known advantages of semi-structured data models in this context [16]; and (iii) to the eventual adoption of XML by many commercial CASE tools. Framework Architecture The framework being proposed is divided in four modules: Administrator, Captor, Driver and Converter. Their relationship and an overview of the framework architecture is depicted in Figure 1. The hot spots of our framework are the Biology Model and Algorithms, the Wrappers associated to biology data sources and the Application Drivers. When instantiated, they implement a particular functionality, defining an application over the molecular biology application domain. The Administrator module performs the interface with the users to provide management of the biological data model: schemas and/or data capturing requests or the execution of algorithms instantiated in the framework. Therefore, this module contains a biology class model that is committed with the existent data sources, as well as with the methods that are associated to these classes. The Captor module is responsible for the data and schemas repository. The Converter provides access to the biology data sources, translating schemas to XML Schema and data to XML. Finally, the Drivers module implements the interface generation between biology applications and the framework.
A Genome Databases Framework
Administrator Biology Model
323
Application Drivers
Biology Algorithms
Fasta txt
.Ace
Reg Swiss Prot
Captor Converter (Wrappers) ACEDB
GenBank
Swiss Prot
AceDB
GenBank
Swiss Prot
XML Data
XML Schema Schemas
Biology Data Sources Fig. 1. The Framework Architecture
Overview of Framework Dynamics When a user asks for a schema (or data) from a given data source to the Administrator, this module sends the request to the Captor, which in turn sends it to the corresponding biology data wrapper. These schemas/data capturing may only be done if the correspondent wrapper have been previously developed in the framework. The wrapper implements the mapping of the data sources schema to XML Schema and data to XML. The schemas and the data obtained are stored in their respective repositories. The user may also ask the Administrator module for generating a file for a given biology application. Much like as described before, such a request can only be done if the associated driver have been previously instantiated in the architecture. The Administrator module triggers the driver in order to execute its task. Then, the driver requests data to the Captor, which manipulates the data repository. There are multiple biology applications that ask for a given file name and its localization in order to proceed with the execution. The architecture also allows the execution of a biology algorithm instantiated in the architecture, which may work on the available data stored in the repository. The construction of interfaces between the framework and the existent biology applications can also be done through the Application Driver, i.e., data may be requested in specific formats for a class of applications. An Example of Biology Model We present here an instantiation of the Biology Model to give an idea of the objects involved in the biology application domain, as well as the algorithms associated to each one of these objects.
324
L.F.B. Seibel and S. Lifschitz Biolo gyMo de l (f rom Adm inis t rat or)
Bi ol og yM od el : S tri ng queryM odel () upda teM od el () upda teM od el Asso ciatio n() upda teS yno nym() execu teAl go rithm () l istAva i labl e Al gorithm()
1
1 ..*
a ssoci ate
1
G en om e 1
associ ate
con tai ns
1
1..*
1 ..*
C hrom os om e
associ ate
1 1..*
Pa ttern Stra te gy
0 ..*
contain s
0 ..*
g e neF ind er() b las t() fa s t()
Ge n eFin de r e xe cu te ()
contains 1 ..*
C h rom o som e Frag m en t
ge neF in de r
0..*
R eg io nNo nT ran scri be d
T ranscri ptRegi on
Ge n eFin d er1
...
Ge ne Fin derN
Ge n e
0..*
1 co nta ins 1 ..* Pri ma ryT ra nscri p t
R eg ula torySequ e nc e
P ro m ote r
Term ina to r
Chro m osom al El e m ent
En ha nc er
C en trom ere
Te lo m ore
OR I
Fig. 2. Example of Genome Class diagram
The model presents only a small part of the application domain objects, specifically the pieces of information related to the genome. Other facts refer, for example, to the proteome, transcriptome and metabolone. The biology model - a framework hot spot in our case - can be extended so that all information currently available at the biology data sources are considered. In the model depicted in Figure 2, one may observe that chromosomes form the genome and that each is considered a set of Chromosome Fragments, which consist of DNA sequences. A Chromosome Fragment can be either a Transcript Region or a Non-Transcript Region. The latter can be a Regulatory Sequence or a Chromosomal Element, and so forth. The algorithms that may be executed over the objects in the chromosome fragments class are, for example, GeneFindertype methods, like Blast and Fast [22]. These type of algorithms are those that run discovery (mining) processes of DNA regions in the fragments for which genes formation is possible.
4
Framework Modules
We present here the architecture of each module Each module has a class that represents its interface with the other modules. We use the Facade pattern [12] in their implementation. The framework is described and formally specified using the Unified Modeling Language (UML) [6]. Although a complete formal definition of our framework would contain both structural and functional specification
A Genome Databases Framework
325
(through class and sequence diagrams), as explained before only the static part will be presented. We will focus on module’s classes and related functionalities. The attributes and methods will not be detailed but are quite immediate. <> AdmFacade captureSchema() getAvailabl eConverters() s electConverter() deleteSchema() captureData() deleteSchemaData() getAvailabl eSche mas () s electSc hema() matchSchemaModel () querySchem a() queryModel() informMatch() createObj ectModel() deleteObj ectModel() createAs s ocia ti on() deleteAs s ocia ti on() createSynonym () deleteSynonym () inc ludeOwnSchema() s electObj ectModel() queryData() execu teAl gorithm() s electAl gorithm()
Fig. 3. Classes Diagram of the Administrator Module
Administrator Module The 3 classes in this module are: AdmFacade, BiologyModel and RepositoryModel. The class AdmFacade is the path to all framework functionalities. The users interact with this class in order to capture schemas or data from a given data source or to obtain the matching between a biology schema, defined in the architecture, and the schema that was captured from a data source. When there is no direct matching, new objects can be added to (or suppressed from) the model. Also, object associations can be done (or undone) and objects can be recognized as synonyms of other objects. An user can create a new schema and a data instantiation that are appropriate to a given application. One can, via the Administrator module, generate data for external applications, query schemas and data repositories or even run biology programs. The class BiologyModel manipulates the biology model, allowing its expansion. Classes that are part of the Administration module may be extended or modified by programmers. They are, indeed, hot spots of the proposed framework. The Strategy pattern [12] is used to permit the creation of algorithms
326
L.F.B. Seibel and S. Lifschitz
families associated to the biology model’s classes. This way the programmers can implement variations of the available algorithms. Finally, the class RepositoryModel provides object persistency as well as their retrieval from the Repository. Captor Module The class CaptorFacade provides (i) the capture and storage of biology data sources’ schemas/data; (ii) management of own specific schemas, defined from the objects of the biology model available in the framework; (iii) exclusion of own or captured schemas/data; and (iv) query execution over the repository class.
< < fac ade > > C apt orF aca de cap tureS ch em a () del ete Sc h e m a() cap tureD a ta () del ete Sc h e m aD a ta() inc lu d e O wn S che m a() getAvai la b leC onve rte rs () que ryS ch em a () getAvai la b leSch em a s () getAp p lica tio nDa ta () que ryD ata () info rm Ma tch ()
< < faca de> > W F B Fac ade (f rom Biology Source W rapper)
ca p tu re S che m a () ca p tu re D ata () ge tAva ila ble Sch e m a s ()
Sc hem aR ep osit ory
D at aR epo s ito ry
u pda te ()
R ep osit ory R ep o s itoryN am e o p en() c los e () in s ert() d e lete () q u ery()
Fig. 4. Classes Diagram of the Captor Module
The second class, called Repository, provides persistency of schemas and data, besides enabling data retrieval. It is worth to remember that the schemas are stored in XML Schema and all data in XML. So, there is a need for access and manipulation languages, such as XQuery [30], to deal with the Repository. WrapperBiologySources Module This module is composed by classes WFBFacade and DataSourceWrapper. The WFBFacade is an interface class between the framework modules and the biology data wrappers. There will be various converters in the architecture, one for each data source. Therefore, the relationship between the WFBFacade and the Wrappers is of the type one-to-many.
A Genome Databases Framework
327
<< facade>> W FBFac ade c apture Sch em a() c apture D ata() g etAvailab le Sche m as () +C onverterLis t
1
0..*
DataSourc eW rapper readData() readSchema() open() close()
GenBank
ACeDB
...
Swiss -Prot
Fig. 5. Classes Diagram of the WrapperBiologySources Module
The DataSourceWrapper represents the implementation of each wrapper. A wrapper will contain two distinct functionalities. On one hand, it has the ability to capture the biology data source schema and, on the other hand, the capture of the source’s data itself. The DataSourceWrapper class is a hot spot of the architecture.
< < fac ad e> > A pplicatio nDriverF aca de
<< fac ade>> C aptorF ac ad e
g e n e ra teAp p li cati on D a ta () li s tAvai l ab le D ri ve rs () s e le c tD river() +D rive rL is t
(from Ca pto r)
1
g etDa ta g en e ra te D a ta
1 ..*
Da taG enerat or
1 ..*
1
c a ptu re Sc he m a () d e le te Sch em a () c a ptu re D a ta () d e le te Sch em a D a ta() in clu de Ow nS ch e m a () g e tAvai lab le Co n ve rte rs () q u eryS che m a () g e tAvai lab le Sch e m a s () g e tApp li ca tio n Da ta () q u eryD ata () in form Matc h ()
g en e ra teAp p li ca tion Da ta ()
DotA ceG enerato r
FA STAG en erat or
.. .
Tx t O fDB Generator
Fig. 6. Class Diagram of the DriverBiologyApplication Module
328
L.F.B. Seibel and S. Lifschitz
DriverBiologyApplication Module The following classes compose the DriverBiologyApplication module: the ApplicationDriverFacade is an interface class between the modules of the framework and the drivers that generate data for the biology applications. There will exist multiple application drivers in the architecture, one for each application program to be used. Thus, the relationship between the ApplicationDriverFacade and the drivers is ”one-to-many” type. The DataGenerator represents the implementation of each driver. The driver is also a framework hot spot. For instance, a driver can generate data in a text format, according to the syntax used in GenBank or Swiss- Prot, or even in FASTA format to be used in the execution of algorithms that work on them. Moreover, a driver may be the implementation of an interface with a system available in the Web. It can send the available data in the framework repositories to a system that will execute and manipulate them. The driver can also be a data service, allowing an application to be connected to the framework, receiving the data stored there.
5
Final Comments
We proposed and detailed a genome database framework that integrates molecular biology data sources and allows the execution of programs and queries in a uniform way. This is quite different from previous approaches, that lie on particular data models and structures, only appropriate to specific contexts. The main contribution is based on the idea that our framework works much like data warehouses do but provide in addition flexibility and reusability. The users of such a tool may access a heterogeneous environment of information sources and can deal with schema evolution based on a meta-model, i.e., independent of each distinct data model used. New schemas can be built via framework instantiation, with the help of an ontology and a biology data model. We are currently working on the framework implementation and we hope to have a prototype available soon, with all functionalities, although still with restricted access to data sources. We are also interested in the definition and representation of a specific ontology for the molecular biology and genome area, which will be used in the existing data sources. Moreover, we plan to explore further the schema evolution characteristics of the framework.
References 1. AceDB: http://genome.cornell.edu/acedoc/index.html 2. M. Ashburner and N. Goodman, “Informatics, Genome and Genetics Databases”, Current Opinion in Genetics & Development 7, 1997, pp 750–756. 3. P. Baker, C.A. Goble, S. Bechhofer, N.W. Patton, R. Stevens and A. Brass, “An Ontology for Bioinformatics Applications”, Bioinformatics 15(6), 1999, pp 510–520. 4. M.I. Bellgard, H.L. Hiew, A. Hunter, M. Wiebrands, “ORBIT: an integrated environment for user-customized bioinformatics tools”, Bioinformatics 15(10), 1999, pp 847–851.
A Genome Databases Framework
329
5. Blast: http://ww.ncbi.nlm.nih.gov/BLAST/ 6. G. Booch, J. Rumbaugh and I. Jacobson, “The Unified Modeling Language User guide”, Addison-Wesley Longman, 1999. 7. P. Buneman, S.B. Davidson, K. Hart, G.C. Overton and L. Wong, “A Data Transformation System for Biological Data Sources”, VLDB Conference, 1995, pp 158– 169. 8. I.A. Chen and V.M. Markowitz, “An Overview of the Object Protocol Model and the OPM Data Management Tools”, Information Systems 20(5), 1995, pp 393–418. 9. S.B. Davidson, C. Overton and P. Buneman, “Challenges in Integrating Biological Data Sources”, Journal of Computational Biology 2(4), 1995, pp 557–572. 10. R.F. Doolittle (editor), “Methods in Enzymology”, Academic Press, 1990. 11. M.E. Fayad, D.C. Schmidt and R.E. Johnson, “Building Application Frameworks”, Addison-Wesley, 1999. 12. E. Gamma, R. Helm, R. Johnson and J. Vlissides, “Design Patterns: Elements of reusable object-oriented software”, Addison-Wesley Longman, 1995. 13. GenBank: http://www.ncbi.nlm.nih.gov/Genbank/index.html 14. E. Hunt, M.P. Atkinson and R.W. Irving, “A Database Index to Large Biological Sequences”, to appear in VLDB Conference, 2001. 15. Gene Ontology: http://www.geneontology.org/ 16. V. Guerrinia and D. Jackson, “Bioinformatics and XML”, On Line Journal of Bioinformatics, 1(1), 2000, pp 1–13. 17. P. Karp, “A Strategy for Database Interoperation”, Journal of Computational Biology 2(4), 1995, pp 573–586. 18. M. Lemos, “Memory Management for Sequence Comparison”, MSc Thesis (in Portuguese), Departamento de Inform´ atica, PUC-Rio, August 2000. 19. S. Letovsky (editor), “Bioinformatics: Databases and Systems”, Kluwer, 1999. 20. S. Lifschitz, L.F.B. Seibel and E.M.A. Uchˆ oa, “A Framework for Molecular Biology Data Integration”, Procs. Workshop on Information Integration on the Web (WIIW), 2001, pp 27-34. 21. V.M. Markowitz and O. Ritter, “Characterizing Heterogeneous Molecular Biology Database Systems”, Journal of Computational Biology, 2(4), 1995, pp 547–556. 22. J. Meidanis and J.C. Set´ ubal, “Introduction to Computational Molecular Biology”, PWS Publishing Company, 1997. 23. F. Moussouni, N.W. Paton, A. Hayes, S. Oliver, C.A. Goble and A. Brass, “Database Challenges for Genome Information in the Post Sequencing Phase”, Procs 10th Database and Expert Systems Applications (DEXA), 1999, pp 540– 549. 24. T. Okayama, T. Tamura, T. Gojobori, Y. Tateno, K. Ikeo, S. Miyasaki, K. FukamiKobayashi and H. Sugawara, “Formal Design and Implementation of an Improved DDBJ DNA Database with a New Schema and Object-oriented Library”, Bioinformatics 14, 1998, pp 472–478. 25. N.W. Patton, S.A. Khan, A. Hayes, F. Moussoni, A. Brass, K. Eilbeck, C.A. Goble, S.J. Hubbard, S. G. Oliver, “Conceptual modeling of genomic information”, Bioinformatics 16(6), 2000, pp 548– 557. 26. L.F.B. Seibel and S. Lifschitz, ”A Genome Databases Framework”, Technical Report (MCC) PUC-Rio, Departamento de Inform´ atica, 2001. 27. Swiss-Prot: http://www.ebi.ac.uk/swissprot 28. Tambis Project: http://img.cs.man.ac.uk/tambis/ 29. XML: http://www.w3.org/XML/ 30. Xquery: http://www.w3.org/TandS/QL/QL98/pp/xquery.html
Lock Downgrading: An Approach to Increase Inter-transaction Parallelism in Advanced Database Applications1
Angelo Brayner University of Fortaleza - UNIFOR, Dept. of Computer Science 60811-341 Fortaleza - Brazil [email protected]
Abstract. In this paper, we propose a concurrency control protocol, denoted
Cooperative Locking
, which extends the two-phase locking protocol by intro-
ducing the notion of
downgrading of locks
proposed in [9].
The basic idea of
the proposed protocol is to provide the following functionality: after using an object in a transaction, the user can downgrade a lock on an object to a less restrictive mode before the transaction ends its execution. The prime goal of our proposal is to provide a high degree of inter-transaction parallelism while ensuring serializability of schedules.
1
Introduction
serializability
The classical model for concurrency control in DBMSs adopts as the correctness criterion for the execution of concurrent transactions. In existing DBMS, serializability is ensured by the two-phase locking (2PL) protocol [7]. The 2PL protocol implements a locking mechanism which requires that a transaction obtains a lock on a database object before accessing it. A transaction may only obtain a lock if no other transaction holds an incompatible lock (e.g., a lock for a read operation is incompatible with a write lock) on the same object. When a transaction obtains a lock, it is retained until the transaction ends its execution (by a commit or an abort operation). However, in recent years, database concepts have been applied to areas such as computer-aided design and software engineering (CAD and CASE), geographic information systems (GIS) and work ow management systems (WFMS). These advanced database applications consist of long-living transactions and present a cooperative environment. Waiting for locks may cause unacceptable delays for concurrent transactions belonging to this class of database applications. Accordingly, the 2PL protocol is not adequate for controlling concurrency in such applications. In this work, we propose an extension to the 2PL protocol by introducing the notion of lock mode downgrading presented in [9] (this notion appears also in [10] as , but it is not used as a primitive in the locking mechanism). Harder and Rothermel propose that . This notion is used by the authors to implement controlled downward inheritance of locks to provide more intra-transaction parallelism in the processing of nested transactions.
lock release conversion a transaction holding a lock in mode M can downgrade it to a less restrictive mode 1 Research supported by the University of Fortaleza - UNIFOR.
H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 330−339, 2001. Springer-Verlag Berlin Heidelberg 2001
Lock Downgrading: An Approach to Increase Inter-transaction Parallelism
331
In our proposal, we use lock downgrading as a primitive in the locking protocol in order to obtain more inter-transaction parallelism. Such a feature can optimize the processing of long-living transactions. With the notion of lock downgrading, the user may relax the blocking property of locks on database objects. Lock downgrading ensures more cooperation among transactions accessing the same set of database objects. For that reason, we denote our proposal Cooperative Locking (CL, for short). This paper is organized as follows. In the next section, we brie y outline some concepts of the conventional transaction model which are used in this paper. Section 3 motivates and discusses the Cooperative Locking protocol. In Section 4, we compare the results of our proposal with other works. Section 5 concludes the paper.
2
The Model
A database is a collection of disjoint objects. The values of these objects may be read and modi ed by transactions. A transaction is modeled as a nite sequence of read and write operations on database objects, where ri (x) (wi (x)) represents a read (write) operation executed by a transaction Ti on object x. To each transaction, an identi er, denoted TRID, is associated which uniquely identi es it. A schedule models an interleaved execution of transactions. Two operations of dierent transactions con ict i they access the same database object and at least one of them is a write operation. With p <S q , we indicate that the operation p is executed in the schedule S before q . The Serialization Graph for a schedule S over T = fT1 ; T2 ; ; Tn g is a directed graph S G(S ) = (N; E ) where each node in N corresponds to a transaction in T , and E contains edges of the form Ti ! Tj if and only if Ti ; Tj 2 N and two operations p in Ti and q in Tj are in con ict, with p <S q . We say that a transaction Tj indirectly con icts with Ti , if there is an edge Ti ! Tj in E + , where S G+ (S ) = (N; E + ) is the transitive closure of the serialization graph SG of the schedule S , and Ti ; Tj 2 T . A schedules S is said to be con ict serializable, i.e. S 2 csr, i S G(S ) is acyclic [4]. A schedule S is correct if it is either serial or con ict serializable.
3 3.1
The CL Protocol The Protocol
Upgrading of locks is the only lock conversion supported by 2PL. Cooperative Locking extends 2PL by introducing a mechanism which enables two dierent types of lock conversion on a database object O: Upgrading of locks: A transaction holding a lock type L on O can upgrade it to a more restrictive type L', if no other transaction holds a con icting lock with L' on O; Downgrading of locks: A transaction holding a lock type L on O can downgrade it to a less restrictive type L'. Cooperative locking supports three lock types: Nil, read lock (rl) and write lock (wl). The lock type Nil denotes the absence of locks. The other lock types
332
A. Brayner
have been discussed exhaustively. Figure 1 shows the compatibility table among lock types. The columns represent lock types which a transaction Ti holds on a database object. The rows represent locks which are requested by another transaction Tj on the same object. An entry \+" in the table denotes that the lock types are compatible. On the other hand, an entry \ " denotes that the lock types are not compatible.
Nil rlj wlj
+ +
rli
+ -
wli
-
Figure 1: Compatibility table for locks in a CL protocol. A CL scheduler2 manages locks according to the following rules: (R1) A transaction may acquire a lock of type read on an object O, if no other transaction holds a write lock on O. (R2) A transaction may acquire a lock of type write on an object O or upgrade a lock it holds to write lock, if no other transaction holds a read or write lock on O. (R3) A transaction holding a write lock may downgrade it to read lock or Nil. (R4) A transaction holding a read lock may downgrade it to Nil. (R5) Once a transaction has downgraded a lock, it may not upgrade the lock. (R6) Once a transaction has released a lock, it may not acquire any new locks (2-phase rule). The process of lock downgrading can make an object used by a transaction T visible to other transactions. The object may become visible for read or for update operations. For instance, if a transaction T downgrades a write lock on an object O to a read lock, other transactions can read O but not update it. Notwithstanding, if T downgrades the write lock to Nil, other transactions can read and update O. The user should explicitly request a lock downgrading. To represent such a transaction request, we will use the notation dg (O; L), where O denotes the object whose lock should be downgraded to type L according to rules (R3) and (R4). It is easy to see that, dierently from lock upgrading, the action of downgrading a lock does not provoke deadlocks. On the other hand, lock downgrading may sometimes present undesirable side-eects. To illustrate such a side-eect, consider that a CL scheduler has already scheduled the following operations: S = w1 (x)dg1 (x; rl )r2 (x)w2 (y )c2 . Assume that the commit operation c2 has been successfully processed and, after that, the scheduler receives w1 (y ) and schedules it. A nonserializable schedule is produced (S G(S ) contains a cycle). In order to avoid such undesirable side-eects, the CL protocol must perform special control on transactions which have executed at least one lock downgrading. Before describing how this control is carried out, we have to de ne data structures which are needed for it. De nition 1. The CL protocol should maintain the following data structures: 1. rl set(O): This structure represents a set containing TRIDs of all transactions which are currently holding a read lock on the object O. 2A
scheduler which implements a Cooperative Locking protocol.
Lock Downgrading: An Approach to Increase Inter-transaction Parallelism
2. 3.
333
( ): Set of TRIDs from transactions currently holding a write lock
wl set O
on O.
dg set: Set whose elements are TRIDs of all active (not committed) transactions which have downgraded at least one lock. After a transaction has committed or aborted, it is removed from the set dg set; 4. T C G(T ): For each transaction T in dg set, a directed graph, called transaction con ict graph for T (T C G(T )) is constructed as follows: (a) The nodes represent transactions which directly or indirectly con ict with T, where the con icting operations occur after T has downgraded a lock; (b) The edges represent the con icts among the transactions represented in T C G(T ). Hence, the T C G(T ) contains edges of the form T ! T' if and only if an operation p in T con icts with an operation q belonging to T' and p < q ; (c) The T C G(T ) must be acyclic; (d) After T has ended its execution, the T C G(T ) can be deleted.
Example 1. Consider the schedule S over set T = fT1 ; T2 ; T3 ; T4 g of transactions, where: S = w1 (x)c1 r2 (x)w2 (y )dg2 (y; rl)r3 (y )w3 (z )c3 r4 (z )c4 w2 (v ) The following graph represents the transaction con ict graph for T2 : T C G(T2 ): T2 ! T3 ! T4 The con ict between T1 and T2 occurs before T2 has downgraded wl on y to rl . For that reason, T1 is not represented in T C G(T2 ). On the other hand, T3 con icts with T2 after the latter has downgraded a lock. By item 4.a of De nition 1, T3 should be represented in the graph. Transaction T4 represents a node in T C G(T2 ) because T4 indirectly con icts with T2 . Theorem 1. Let cl be the set of all schedules produced by a Cooperative Locking protocol. Then cl csr. Proof. Let S be a schedule over a set T of transactions produced by a cl protocol, that is, S 2cl. Case 1. No transaction has requested lock downgrading in S . Suppose, by way of contradiction, that S is not con ict serializable, that is S 2= csr. Hence, the serialization graph of S is cyclic. Without loss of generality, consider that the cycle in S G(S ) has the following form: Ti ! Tj ! ! Ti . It follows from this that, for some operations pi (x); qi (y ), the transaction Ti may obtain a lock on y after having released a lock on x, a contradiction to the 2-phase rule. Thus, cl csr. Consider the following schedule: S = r1 (x)w2 (x)w1 (z ). Clearly, S 2 csrncl, because, according to the CL protocol (rule R2), transaction T2 must wait for the end of T1 in order to execute the operation w2 (x). Thus, cl csr, as was to be proved. Case 2. At least one transaction in S downgrades a lock. By way of contradiction, suppose that S 2= csr. Hence, the serialization graph of S contains at least one cycle. Without loss of generality, consider that S G(S ) has the following cycle: Ti ! Tj ! ! Ti . The edge Ti ! Tj is the result of two con icting operations 0
0
334
A. Brayner
i (x) in Ti and qj (x) in Tj , where pi (x) <S qj (x). Now consider that the transaction Ti has downgraded a lock on object x on which the transaction Tj executes operation qj . By De nition 1, a transaction con ict graph for Ti (T C G(Ti )) should be constructed for which Ti ! Tj ! ! Ti is a subgraph, a contradiction, because, by item 4.c of De nition 1, T C G(Ti ) must be acyclic. Thus, cl csr. The schedule S 0 = w1 (z )r1 (x)dg1 (x; N ill)w2 (x)r2 (z )w1 (y ). Clearly, 0 S 2 csrncl, because T2 may not acquire a read lock on z while T1 holds a write lock on z , consequently, T2 may not execute r2 (z ) before T1 ends. Therefore, cl csr, as was to be proved. Theorem 1 shows that a scheduler implementing cooperative locking enforces serializability. It is important to note that in the absence of lock downgrading, a CL scheduler behaves like a 2PL scheduler. This is shown in case 1 of the proof for Theorem 1. This property represents an important result of our proposal, since it assures that a cooperative locking mechanism may be implemented on the top of any 2PL scheduler. p
3.2
Implementation Aspects
Using the structures described in De nition 1, a CL scheduler performs the operations shown in Figure 2. These two procedures can be summarized as follows. When a transaction Ti requires a lock L for a given object O, the scheduler veri es whether there is a con icting lock associated to O on behalf of another transaction Tj (i 6= j ). If any Tj holds a con icting lock on O, the scheduler should delay the processing of setting L until Tj releases its lock on O. In fact, the delay function (delay (rli (O)) or delay (wli (O))) blocks transaction Ti until Tj releases the lock. If there is no con icting lock, the scheduler must verify whether transaction Ti has downgraded any lock. Hence, if Ti is not an element of dg set, the lock can be set. Otherwise, it must be checked whether the execution of the corresponding operation to the required lock introduces a cycle in T C G(Ti ). This is performed by the function check (T C G(trid)). If no cycle is produced, the required lock may be granted. Otherwise, the required lock should be rejected and the transaction aborted. Read Lock(trid,O)
Write Lock(trid,O)
/* read locking protocol */
/* write locking protocol */
if wl set(O) =
if wl set(O)=
if
2
;
trid = dg set rl set(O)
rl set(O)
else
if
[ ftridg
;
2
;
wl set(O)
wl set(O)
else
check(T CG(trid))
if exists cycle
else
else
reject(rl(O )) rl set(O)
delay(rl(O));
rl set(O)
[ ftridg
;
check(T CG(trid))
if exists cycle
else
;
and rl set(O)=
trid = dg set
reject(wl(O ))
[ ftridg
;
wl set(O)
wl set(O)
else delay(wl(O));
Figure 2: Procedures to set read/write locks.
[ ftridg
;
Lock Downgrading: An Approach to Increase Inter-transaction Parallelism
335
Figure 3 shows the procedure executed by a CL scheduler when it receives a lock downgrading request. The parameter Lold denotes the lock on object O which is to be downgraded to Lnew . Only rl or Nil are valid values for Lnew . Here it is important to underline the dierence between downgrading a lock to Nil and releasing a lock. When a transaction T downgrades a lock to Nil, the scheduler should monitor the execution of T more closely in order to ensure that serializability will not be jeopardized. For that reason, T should be inserted in dg set and the graph T C G(T ) should be constructed and maintained by the scheduler. By releasing a lock, a transaction can not induce inconsistencies in the execution of concurrent executions (this is ensured by the two-phase rule). Thus, no additional control is necessary. Evidently, aborting transactions introduces several diÆculties for the processing of long-living transactions. However, we propose mechanisms which can minimize the drawbacks of transaction aborts. It is also important to note that other proposals extending 2PL, such as Altruistic Locking [2, 11] and Locks with Constrained Sharing [1], also suer from the problem of having to abort transactions.
3.3 Reducing the Negative Eects of Transaction Aborts
As mentioned before, a CL scheduler is sometimes forced to abort transactions. In this section, we present two strategies which can reduce the frequency of aborts. The basic function of a concurrency control mechanism is to synchronize con icting operations. There are two kinds of con icts: read-write (write-read) con icts and write-write con icts. Sometimes, it may be meaningful to decompose the synchronization realized by a concurrency control mechanism in two subfunctions: (i) synchronization of con icting read-write (write-read) operations, denoted rw-synchronization and (ii) synchronization of con icting write-write operations, denoted ww-synchronization. In order to illustrate this fact, consider that a CL scheduler has already scheduled the following operations: S = rl1 (x)r1 (x)dg1 (x; N il )wl2 (x)w2 (x)wl2 (z )w2 (z )c2 Now, suppose that, after c2 has already been scheduled and performed, the scheduler receives the operation w1 (z ). For that reason, a wl1 (z ) is requested. By the CL protocol, wl1 (z ) can not be granted, the operation w1 (z ) should be rejected, and the transaction T1 aborted. However, suppose that the scheduler has granted the write lock on O, but has not executed the operation w1 (z ) (i.e. the scheduler ignores w1 (z )). This yields the same value for the object z as executing the actions of rejecting w1 (z ) and aborting T1 . Therefore, from the ww-synchronization perspective, the actions of rejecting w1 (z ) and aborting T1 were unnecessary. We can summarize the observation described above as follows. Let Tj be a transaction which has executed a write operation on O before the CL scheduler receives wli (O) (write lock request of Ti on O). If wli (O) can be granted (because Tj has already committed or downgraded the write lock to Nil) and the operation wi (O ) introduces a cycle in T C G(Ti ), then the scheduler only has to grant the write lock and ignore the operation wi (O) without aborting Ti . This is the suÆcient condition to produce a result similar to the one produced by a correctly
336
A. Brayner
synchronized execution of the two con icting write operations. To illustrate this fact, consider the following schedule: S = rl1 (x)r1 (x)dg1 (x; N il )wl2 (x)w2 (x)wl1 (z )w1 (z )c1 wl2 (z )w2 (z )c2 . If a CL scheduler has already scheduled the operations of S and then receives wl1 (z ), T1 should be aborted, because a cycle would be produced in T C G(Ti ), if w1 (z ) were executed. However, the scheduler can \correct" S on-the- y without aborting T1 . It only needs to grant the write lock for T1 and ignore operation w1 (z ). If this rule is applied, the execution of S produces the same database state as does the execution of the following schedule S , which is a correct one: S = rl1 (x)r1 (x)dg1 (x; N il )wl2 (x)w2 (x)wl1 (z )w1 (z )c1 wl2 (z )w2 (z )c2 . 0
0
This ww-synchronization rule is called Thomas' Write Rule (TWR) [4] in the literature. To minimize the negative eects of aborting transactions, we introduce the TWR concept in the CL protocol. The key idea is to verify the applicability of the TWR whenever the scheduler receives a lock request for a write operation and this operation produces a cycle in the TCG of the transaction requiring the lock. For that reason, we have to extend the write locking protocol of Figure 2. The extended protocol is shown in Figure 4. Write Lock(trid,O) /* write locking protocol using TWR */
;
if wl set(O)= if dg Lock(trid,O,Lold ,Lnew ) dg set
dg set
if Lnew =rl wl set(O)
[ ftridg n ftridg
Read Lock(trid,O);
wl set(O)
rl set(O) else wl set(O)
rl set(O)
wl set(O)
;
if exists cycle
;
if apply twr wl set(O)
Figure 3. Downgrading of locks.
wl set(O)
[ ftridg
;
/* the lock is setted, but the write
n ftridg n ftridg
wl set(O)
[ ftridg
check(T CG(trid))
else if Lold =rl
;
and rl set(O)=
else
;
wl set(O)
2
trid = dg set
operation is not executed */
;
else
reject(wl(O ))
;
else wl set(O) else delay(wl(O));
wl set(O)
[ ftridg
;
Figure 4. Granting write locks applying TWR. The basic function of apply twr in the protocol of Figure 4 is to verify whether or not the TWR is applicable. This can be done by identifying the form of the cycle in the function check (T C G(trid)), where trid represents the transaction requesting the lock. Another approach to reduce the side-eects of aborting transactions is to avoid cascading aborts in the CL protocol by introducing the following constraint: A transaction may only downgrade read locks. This constraint guarantees that a write lock may not be downgraded to a read lock or Nil. Write locks may only be released at transaction commit. This solution is useful for applications whose (long-living) transactions basically consist of many read-only operations and few
Lock Downgrading: An Approach to Increase Inter-transaction Parallelism
337
write operations. However, it is restrictive for long-living transactions with many write operations. The approach described above solves the cascading abort problem. However, it introduces other problems and restrictions. In fact, avoiding cascading aborts for long-living transactions is too expensive and perhaps impracticable. For that reason, several proposals introduce the notion of compensating transactions, instead of avoiding cascading aborts, as for example in [5, 8].
4
Related Work
As already mentioned, Altruistic Locking [2, 11] and Locks with Constrained Sharing [1] are also proposals based on the release of locks before a transaction
ends. The Altruistic Locking (AL) protocol extends 2PL by introducing the donate operation. A donate operation encapsulates the information that an object will not be used by a transaction. Hence, a donate operation converts a write or read lock to Nil. However, it does not have the same semantic as releasing a lock. If a transaction donates an object, it still holds a lock on the object. Locks are released according to the two-phase rule of the 2PL protocol. In order to guarantee serializability, the following condition, denoted altruistic locking rule [11], must be ensured by the AL protocol: if a transaction Tj accesses an object donated by a transaction Ti , each operation opj (O) in Tj can only be executed if the object O has been donated by Ti or after the rst unlock operation in Ti . Observe that the transaction Tj may wait for the end of Ti to execute an operation opj (O), although there is no lock on the object O and transaction Ti will never execute any operation on O. Hence, the altruistic rule is too restrictive. The CL protocol relaxes the altruistic locking rule. For that reason, we can say that the CL protocol provides a higher degree of inter-transaction parallelism than AL. In [6] we show that alt cl. Although the AL protocol has been proposed to increase concurrency when long transactions are processed, it may cause that short transactions wait for the end of long transactions. To overcome such a restriction, some extensions are proposed in [11]. These extensions are based on the pre-declaration of the access set of transactions. However, because of the interactive nature of transactions in some advanced database applications (e.g., transactions in design activities), the access patterns of such transactions are not predictable. In the AL protocol, a donate operation downgrades read or write locks to Nil. On the other hand, the CL protocol provides a controlled downgrading of locks. The user can specify, for example, if a write lock should be downgraded to read lock or to Nil. Thus, the user can decide if an updated object can be seen by other transactions for read-only operation or for update. The Locks with Constrained Sharing (LCS) protocol introduces the notion of ordered shared locks. The basic idea here is that two transactions may hold con icting locks on the same object, if the following condition is ensured: the order in which the con icting locks are acquired must be the same to execute the corresponding operations. For example, if rl1 (x) < wl2 (x), then r1 (x) < w2 (x). The lock wl2 (x) is said to be on hold and it remains on hold until transaction T1
338
A. Brayner All Schedules CSR CL ALT LCS
2PL Serial
Figure 3: Relationship among dierent classes of schedules. executes an unlock operation on x. In addition to the 2PL rules , this protocol ensures serializability through the following constraint: a transaction may not release any of its lock, if it has locks on hold. This constraint, however, can be too restrictive. To illustrate this fact, consider that a LCS scheduler has already scheduled the following operations: S = wl1 (x)w1 (x)rl2 (x)r2 (x), where T1 = w1 (x)r1 (v ) : : : w1 (z )c1 ; T2 = r2 (x)c2 Consider that, after operation r2 (x) is processed, the scheduler receives the operation c2 . According to the LCS protocol, the execution of c2 must be delayed until T1 releases its locks on x. As in Altruistic Locking, we have here the situation in which short transactions may wait for the end of long transactions. In [6] we show that lcs cl. In Figure 3, we present a Venn diagram depicting the relationships among classes of schedules produced by the protocols discussed in this paper. One may argue that, in our proposal, the rate of aborts increases. However, this problem also exists in LCS and AL protocols. In a concurrency control mechanism using the LCS protocol, transactions should be aborted under the same conditions as in a CL protocol, more precisely, when non-serializable (incorrect) schedules are produced. For example, the following schedule may be produced by a LCS scheduler: S = w1 (x)r2 (x)w2 (y )r1 (y ). In [1], such phenomenon is called deadly-embrace situation and the authors propose that one of the involved transactions should be aborted. The AL protocol increases the rate of aborts, since the frequency of deadlocks is increased. To show this fact, consider the following transactions: T1 = w1 (x)donate 1 (x)w1 (z )c1 ; T2 = r2 (x)r2 (y )c2 ; T3 = r3 (z )w3 (x)c3 Now, suppose that an AL scheduler has already scheduled the following operations: S = r3 (z )w1 (x)donate1 (x)r2 (x). If, after scheduling r2 (x), the scheduler receives r2 (y ), this operation should be delayed until T1 ends (altruistic locking rule). Thus, transaction T2 should wait for the end of T1 . On the other hand, T1 can only end if T3 ends, since T3 holds a lock on z and T1 is waiting for a lock on z in order to execute w1 (z ). In turn, transaction T3 should wait for the end of T2 , because T2 holds a lock on x and T3 can not execute w3 (x). Therefore, transactions T1 , T2 and T3 are involved in a deadlock. One of them has to be
Lock Downgrading: An Approach to Increase Inter-transaction Parallelism
339
T1 is chosen to be should also be aborted. Note that a similar deadlock situation is
aborted, possibly causing cascaded aborts. For example, if aborted,
T2
also produced if the scheduler receives w1 (z ) or w3 (x) after scheduling r2 (x). If the CL protocol were used to synchronize the transactions in such a scenario, no deadlock situation had been induced.
5
Conclusions
In this paper, we have proposed an extension to the two-phase locking protocol which provides a higher degree of parallelism among transactions. We have shown that, although the Cooperative Locking protocol is more permissive than the 2PL, it ensures serializability of schedules. Moreover, we have also shown that Cooperative Locking provides more concurrency among transactions than other proposals extending the 2PL protocol, such as Altruistic Locking and Locks with Constrained Sharing. Of course, our proposal does not completely solve all problems existing in the processing long-living transactions. However, it can provide a higher degree of inter-transaction parallelism as compared to 2PL, Altruistic Locking and Locks
with Constrained Sharing.
References
[1] Agrawal, D. and Abbadi, A. E. Locks with Constrained Sharing. In
the 9th ACM Symposium on PODS,
Proceedings of
pages 85{93, New York, 1990.
[2] Alonso, R., Garcia-Molina, H., Salem, K.
Concurrency control and recovery for
global procedures in federated database systems.
A quartely bulletin of the Com-
puter Society of the IEEE technical comittee on Data Engineering,
10(3), 1987.
[3] Berenson, H., Bernstein, P., Gray, J., Melton, J., O'Neil, E. and O'Neil, P. Critique of ANSI SQL Isolation Levels.
Conference,
In
A
Proceedings of 1995 ACM SIGMOD
pages 1{10, June 1995.
Concurrency Control and Re-
[4] Bernstein, P. A., Hadzilacos, V. and Goodman, N.
covery in Database Systems.
Addison-Wesley, 1987.
[5] Biliris, A., Dar, S., Gehani, N., Jagadisch, H. V. and Ramamritham, K. ASSET: A System for Supporting Extended Transactions. In
SIGMOD Conference,
Proceedings of the 1994 ACM
pages 44{54, May 1994.
Transaction Management in Multidatabase Systems.
[6] Brayner, A.
Shaker-Verlag,
1999. [7] Eswaran, K.P., Gray, J.N., Lorie, R.A. and Traiger, I.L. tency and Predicate Locks in a Database System.
The Notions of Consis-
Communications of the ACM,
19(11):624{633, November 1976. [8] Garcia-Molina, H. and Salem, K. . SAGAS. In
Conference,
Proceedings of the ACM SIGMOD
pages 249{259, 1987.
[9] H arder, T. and Rothermel, K. Concurrency Control Issues in Nested Transactions.
VLDB Journal, [10]
Korth, H. F.
2(1):39{74, 1993.
Locking Primitives in a Database System.
Journal of the ACM,
30(1):55{79, 1983. [11]
Salem, K., Garcia-Molina, H. and Shands, J. Altruistic Locking.
on Database Systems,
19(1):117{165, March 1994.
ACM Transactions
The SH-tree: A Super Hybrid Index Structure for Multidimensional Data Tran Khanh Dang, Josef Küng, and Roland Wagner Institute for Applied Knowledge Processing (FAW) University of Linz, Austria {khanh, jkueng, rwagner}@faw.uni-linz.ac.at Abstract. Nowadays feature vector based similarity search is increasingly emerging in database systems. Consequently, many multidimensional data index techniques have been widely introduced to database researcher community. These index techniques are categorized into two main classes: SP (space partitioning)/KD-tree-based and DP (data partitioning)/R-tree-based. Recently, a hybrid index structure has been proposed. It combines both SP/KDtree-based and DP/R-tree-based techniques to form a new, more efficient index structure. However, weaknesses are still existing in techniques above. In this paper, we introduce a novel and flexible index structure for multidimensional data, the SH-tree (Super Hybrid tree). Theoretical analyses show that the SHtree is a good combination of both techniques with respect to both presentation and search algorithms. It overcomes the shortcomings and makes use of their positive aspects to facilitate efficient similarity searches. Keywords. Similarity search, multidimensional index, bounding sphere (BS), minimum bounding rectangle (MBR), super hybrid tree (SH-tree).
1 Introduction Feature based similarity search has a long development process which is still in progress now. Its application range includes multimedia databases [33], time-series databases [32], CAD/CAM systems [34], medical image databases [27], etc. In these large databases, feature spaces have been usually indexed using multidimensional data structures. Since Morton introduced the space-filling curves in 1966 up to now, many index structures have been developed. A survey schema that summarizes the history of multidimensional access methods from 1966 to 1996 has been presented in [1]. This summary and two recent publications [2, 19] show that multidimensional index techniques can be divided into two main classes: Index structures based on space partitioning (SP-based) or KD-tree-based such as kDB-tree [6], hB-tree [7], LSD-tree and LSDh-tree [8, 9], Grid File [10], BANG file [11], GNAT tree [29], mvp-tree [35], SKD-tree [28], etc. Index structures based on data partitioning (DP-based or R-treebased) consist of R-trees and its improved variants [12, 13, 14], X-tree [15], SS-tree [5], TV-tree [3], SR-tree [4], M-tree [20], etc. The remains, which can not be categorized into the above schema, are called dimensionality reduction index techniques [19] like Pyramid technique [16, 17], UB-tree [18], space-filling curves
The SH-tree: A Super Hybrid Index Structure for Multidimensional Data
341
(see [1] for a survey). Recently, the Hybrid tree1 [2, 19], a hybrid technique has been proposed. It is formed by combining both SP and DP based techniques. For detailed explanations of classification, see [1, 2, 19]. This paper is organized as follows: Section 2 discusses motivations, which lead us to introduce the SH-tree. Section 3 is devoted to discuss structure and advanced aspects of the SH-tree. Section 4 presents update operations, query algorithms with the SH-tree. Section 5 gives conclusions and future work.
2 Motivations The SR-tree [4] has shown superiorities over the R*-tree and the SS-tree by dividing feature space into both small volume regions (using bounding rectangles–BRs) and short diameter regions (using bounding spheres–BSs). Nevertheless, the SR-tree must incur the fan-out problem: only one third of the SS-tree and two third of the R*-tree [4]. The low fan-out causes the SR-tree based searches to read more nodes and to reduce the query performance. This problem does not occur in the KD-tree based index techniques: the fan-out is constant for arbitrary dimension numbers. Recently, the Hybrid tree [2, 19] has been introduced. It makes use of positive characteristics of both SP-based and DP-based index techniques. It depends on the KD-tree based layout for internal nodes and employs bounding regions (BRs) as hints to prune while traversing the tree. To overcome the access problem of unnecessary data pages, the Hybrid tree also applies a dead space eliminating technique by coding actual data regions (CADR) [9]. Although the CADR technique partly softens the unnecessary disk access problem, it is still not a high efficient solution to solve the entire problem. It strongly depends on the number of bits used to code the actual data region and, in some cases, this technique does not benefit regardless of how many bits are used to code space. Figure 1a and 1b show examples like that in 2-dimensional space. Here the whole region is coded irrespective of how many bits are used. Figure 1c shows an example where the benefit from coding the actual data region is not interesting, especially for range queries. This is due to the high remaining dead space ratio in the coded data region. Besides, when new objects locate outside the bounds of feature space already indexed by the Hybrid tree, the encoded live space (ELS) [19] must be recomputed from scratch. Furthermore, the SP/KD-tree based index techniques in common recursively partition space into two subspaces using a single dimension until the data object number in each subspace can be stored in a single data page as the Hybrid tree, the LSDh-tree, etc. This partitioning way leads cluster of data to be quickly destroyed because the objects stored in the same page are “far away” in the real space. This problem could significantly influence the search performance; increase the number of disk accesses per range query [1]. It is contrary to the DP/R-tree based index techniques as the SS-tree, the SR-tree, etc. They try to keep near objects in the feature space into each data page. To alleviate these problems and take inherent advantages of the SR-tree (the R-tree based techniques as a whole), together with introducing novel worth attentions we 1
Internal nodes presentation idea is similar to one introduced by Ooi et al in 1987 for the Spatial KD-tree [28].
342
T.K. Dang, J. Küng, and R. Wagner
will present the SH-tree in the successive section. In the SH-tree, the fan-out problem will be overcome by employing the KD-tree presentation for partitions of internal nodes. The data cluster problem as mentioned above, however, is softened by still keeping the SR-tree-like structure for presentation of balanced and leaf nodes of the SH-tree (c.f. section 3.1). Section 3 will detail these ideas.
Coded data space Dead space Coded dead space
a
b
c
Fig. 1. Some problems with coding actual data region
3 The SH-tree This section is dedicated to introduce the SH-tree. We are going to discuss how to split multidimensional space into subspaces and introduce a very special hybrid structure of the SH-tree. 3.1 Partitioning Multidimensional Space in the SH-tree Because the SH-tree is planned to apply not only for point data objects, but also for extended data objects we choose no overlap-free space partitioning. This approach easily controls objects that cross a selected split position and solve the storage utilization problem. The former had been described in the SKD-tree [28] and the latter has happened to the kDB-tree, which shows uninterestingly slow performance even in 4-dimensional feature vector spaces [21]. There are three node kinds in the SH-tree: Internal, balanced and leaf nodes. Each internal node i has structure , where d is split dimension, lo represents the left (lower) boundary of the right (higher) partition, up represents the right (higher) boundary of the left (lower) partition and other_info consists of additional information as the data object number of its left, right child. While up=lo means no overlap between partitions, up>lo indicates that partitions overlap. This structure is similar to ones introduced in the SKD-tree [28] and the Hybrid tree [2]. The supplemental information also gives hints to develop a cost model for the nearest neighbor search in high-dimensional spaces, query selectivity estimation, etc. Moreover, let BRi denote bounding rectangle of internal node i. The BR of its left child is defined as BRi¬(d £ up). Note that ¬ denotes geometric intersection. Similarity, the BR of its right child is defined as BRi¬(d ‡ lo). This allows us to apply algorithms used in the DP/R-tree based techniques to the SH-tree. Balanced nodes are just above leaf nodes and they are not hierarchical (figure 2). Each of them has a similar structure to that of an internal node of the SR-tree This is a specific characteristic of the SH-trees. It conserves the data cluster, in part, and makes the height of the SH-tree smaller as well as employing the SR-tree’s superior aspects. Moreover, it also shows that the SH-trees are not simple in binary shape as in the KDtree based techniques. They are also multi-way trees as R-tree based index techniques:
The SH-tree: A Super Hybrid Index Structure for Multidimensional Data
343
BN: (minBN_E £ n £ maxBN_E) Bi: A balanced node consists of entries B1, B2, … Bn, (minBN_E £ n £ maxBN_E) where minBN_E and maxBN_E are the minimum and maximum number of entries in the node. Each entry Bi keeps information of a leaf node including four components: a bounding sphere BS, a minimum bounding rectangle MBR, the object number of leaf node num and a pointer to it pointer. Furthermore, computing MBS (minimum BS) of a given objects set is not feasible in a high-dimensional space, since the time complexity is exponential in the dimension number [25]. Therefore, the SH-tree preliminarily uses MBRs and only BSs. See [4] for the calculation formula of BS. d=1 lo=6 up=6
10 1
3
2
12
13
8 4
7
d=2 lo=3 up=4
14
8
6
4
10
2
9
0
2
6
5
10
15
11
16
4
6
8
16
12
13
14
10
d=1 lo=3 up=3 1
MBR
7
9 8
0
3
d=2 lo=8 up=8
15
11
d=2 lo=5 up=6
2
4
5
6
1
10
2
3
Internal node
11 Balanced node
BS Overlapping space
Leaf node 4
5
6
7
8
9
Fig. 2. A possible partition of a data space and corresponding mapping to the SH-tree
Each leaf node of the SH-tree has the same structure as that of the SS-tree (because the SR-tree [4] is just designed for point objects but the SH-tree is also planned for both points and extended objects): (minO_E £ m £ maxO_E) LN: Li: A leaf node consists of entries L1, L2, … Ln, (minO_E £ n £ maxO_E) where minO_E and maxO_E are minimum and maximum number of entries in a leaf. Each entry Li consists of a data object obj and information in the structure info as a feature vector, the radius bounds the object’s extent in the feature space, object’s MBR, etc. If objects in database are complex, then obj is its identifier instead of a real data object. In addition, in case that the SH-tree is only applied for point data objects, each Li is similar to that of the SR-tree: Li: . In this case, the other information of the objects is no longer needed. For example, the parameter radius is always equal to zero and MBR is the point itself. Figure 2 shows a possible partition of a feature space and its corresponding mapping to the SH-tree. Assume we have a 2-dimensional feature space D with a size of (0,0,10,10). With (d, lo, up)=(1,6,6), the BRs of left and right children of the
344
T.K. Dang, J. Küng, and R. Wagner
internal node 1 are BR2=D¬(d £ 6)=(0,0,6,10) and BR3=D¬(d ‡ 6)=(6,0,10,10), individually. For the internal node 2, (d, lo, up)=(2,3,4), BR 4=BR2¬(d £ 4)=(0,0,6,4), BR5=BR2¬(d ‡ 3)=(0,3,6,10) and so on. The BRs information is not stored in the SHtree, but it is computed when necessary. Furthermore, the storage utilization of the SH-tree must ensure that each balanced node is filled with at least minBN_E entries and each data page contains at least minO_E objects. Therefore, each subspace according to a balanced node holds N data objects and N satisfies the following condition: minO_E x minBN_E £ N £ maxO_E x maxBN_E
(1)
3.2 The Extended Balanced SH-tree For almost index techniques based on the KD-tree, the tree structure is not balanced (e.g., the LSD/LSDh-tree, the SKD-tree). It means that there are leaf nodes that are farther away from the root than all others are. The experiments of [29] have shown that a good balance is not crucial for the performance of the index structure. In this section, we introduce a new conception for the balance problem in the SH-tree: extended balance. The motivation is to retain acceptable performance of the index structure and reduce maintenance cost for its exact balance. Suppose that p, b, b_min, b_max denotes leaf node number, balanced node number, minimum and maximum number of balanced nodes in the SH-tree, respectively. The following inequality holds:
ø Ø Ø ø p p b_min = Œ œ£ b £ Œ œ = b_max max BN _ E min BN _ E Œ œ œ Œ
(2)
The SH-tree’s height h satisfies the following inequality: 1 + Ølog2b_min ø £ h £ Ølog2b_max ø+ 1
(3)
Inequality (3) is used to evaluate whether the SH-tree is “balanced” or not. The meaning of balance here is loose. It does not mean that path length of every leaf node from the root is equal. We call this extended balance in the SH-tree. If the height hl of each leaf node in the SH-tree satisfies (3), i.e. 1+ Ølog2b_min ø£ hl £ Ølog2b_max ø+1, then the SH-tree is called an extended balanced tree (EBT), and otherwise it is not a balanced tree. The extended balance conception generalizes the conventional balance conception: if inequality (3) becomes 1+ Ølog2b_min ø=h= Ølog2b_max ø+1, then an EBT becomes a conventional balanced tree (CBT). If minBN_E=2 and maxBN_E=3, the SH-tree in figure 3 is not a CBT or an EBT; it is not a balanced tree. The inequality (3) can be also extended as follows: 1 + Ølog2b_min ø- x £ h £ Ølog2b_max ø+ 1 + x
(4)
or a more general form: 1 + Ølog2b_min ø- x £ h £ Ølog2b_max ø+ 1 + y
(5)
The SH-tree: A Super Hybrid Index Structure for Multidimensional Data
345
In (4), (5) x and y are acceptable “errors”. These parameters give more flexibility to the SH-tree but they must be carefully selected to prevent from creating a too much unbalanced tree. The SH-tree does not satisfy (3) but (4) or (5) will be called loosely extended balanced tree (LEBT). For example, concerning the SH-tree in figure 3 then (3) becomes 4 £ h £ 4 (here b_min= Ø16 / 3 ø=6 and b_max=8). If the SH-tree satisfies this condition, it really becomes a CBT (also EBT). We can readjust this condition with x=1 and get the new condition concerning (4): 3 £ h £ 5. With respect to the new condition, the above SH-tree can be considered a LEBT. The parameter x (and y) in (4) (and (5)) depends on many attributes, say p, minBN_E, maxBN_E and so on. If x is suitably chosen, the maintenance cost of the SH-tree is substantially decreased but does not affect the querying performance. In general, if the SH-tree fails to satisfy (4), it needs to be reformed. The reformation can entirely reorganize the SH-tree (also called dynamic bulk loading) or suitably change splitting algorithm. Henrich has presented a hybrid split strategy for KD-tree based access structures [22]. It depends on weighted average of the split positions calculated using two split strategies, data dependence and distribution dependence. Notice that the dynamic reformation operation usually incurs substantial costs including both I/O accesses and CPU time. An efficient algorithm for the SHtree reformation is still an open problem. 3.3 Splitting Nodes in the SH-tree In the context of dynamic databases, which means that the SH-tree is incrementally created and in that process, the data objects can be added or deleted, we present leaf nodes splitting and balanced nodes splitting in the SH-tree. Leaf nodes splitting. The boundary of a leaf node in the SH-tree is the geometric intersection between its MBR and BS, but BS is isotropic thus it is not suitable for choosing the split dimension. Therefore, the choice of the split dimension depends on its MBR. This problem is solved in the same way as that of the Hybrid tree including overlap free splitting. The selected split dimension must minimize the disk access number. Without loss of generality, assume that the space is d-dimensional and extent th of MBR along the i dimension is ei, i= [1,d]. Let range query Q be a bounding box with each dimension of length r. Prove as done in [4] to get result: the split dimension is k if
r ek + r
is the minimum. Therefore, split dimension k is chosen such that its
extent in MBR is the maximum, i.e. ek=max(ei), i=[1,d]. The next step is to select the split position. First, we check if it is possible to split in the middle without violating the utilization constraint. If it is impossible, we distribute data items equally into two nodes. This way also solves the special case as shown in the hB-tree [7]. Figure 3 shows this case as an example in two-dimensional space. Balanced nodes splitting. Because the balanced node has the similar structure to internal nodes of the SR-tree and the R*-tree, the internal nodes splitting algorithm of the R*-tree [24] can be applied to split overfull balanced nodes of the SH-tree. With the SH-tree, however, if the sibling of an overfull balanced node is also a balanced node and still not full, then an entry of the overfull balanced node can be shifted to the
346
T.K. Dang, J. Küng, and R. Wagner
y
x
Fig. 3. Assume the split dimension x is chosen and the minimum data object number of each partition is three. There is no suitable split position if we apply the way of the Hybrid tree as described in [2]. In this case and other similar cases, the SH-tree distributes data items equally into two nodes.
sibling to avoid a split. This method also increases the storage utilization [36, 9]. Thus, the modified splitting algorithm for the balanced nodes can be concisely described: First, try to avoid a node splitting as just discussed. If it fails, the split algorithm similar to that of the R*-tree is employed. Notice that, in the SH-tree, the balanced node split does not cause propagated splits upwards or downwards, which is called cascading splits [7] and happened to the kDB-tree [6].
4 The SH-tree Operations 4.1 Insertion Let NDO be a new data object to insert into the SH-tree. First, the SH-tree must be traversed from the root to locate leaf node w, which NDO will belong to. The best candidate is the node whose MBR is closest to NDO2. Ties are broken based on the nodes’ data object number. If there is an empty entry in this leaf, NDO is inserted. Conversely, the leaf is an overflow leaf node, then one object of this leaf can be redistributed to the sibling, which is still not full, to make space for NDO. This idea is the same as that of [36] but does not recursively go upward like that, the siblings here are locally located in the balanced node. In fact, the predefined constant l of the algorithm in [36] is similar to the current entry number (CEN) of the balanced node (minBN_E £ CEN £ maxBN_E). The parameter CEN for the SH-tree’s corresponding redistribution algorithm is different from each balanced node; this is a difference from the one presented in [36]. If a split is still compulsory, it can only propagate upwards at most one level. Figure 4 illustrates the split propagation in the SH-tree. In that, assume leaf node P1 is selected to insert a NDO and P1’s entry number is maxO_E already. Moreover, suppose that the redistribution is also failed. Consequently, P1 is split into P1’ and P1”. Nevertheless, because maxBN_E=2 (minBN_E=1) in this example, the balanced node B1 is later split into B1’ and B1”. At last, a new internal node N is created. The split process is stopped and has no more propagation to upper level (root node R in this example).
2
The distance metric used here is MINDIST, described in [26]
The SH-tree: A Super Hybrid Index Structure for Multidimensional Data
347
R
R
N
B1 B1’
B1”
P1
P1’
P1”
Fig. 4. Split in the SH-tree
4.2 Deletion After determining which leaf node contains the object and removing the object, the leaf may become under-full (it means that the object number kept in this leaf is less than minO_E). There are some solutions to solve this problem as discussed in [23]. An under-full leaf node can be merged with whichever sibling has least enlargement or its objects can be scattered among sibling nodes. Both of them can cause the node splitting, especially the latter can lead into a propagated splitting, say the balanced nodes splitting. The R-tree [23] employed re-insertion policy instead of two ones above. The SR-tree, the SS-tree, the R*-tree and the Hybrid-tree also employ this policy. We propose a new algorithm to solve the under-full leaf problem called eliminate-pull-reinsert. The algorithm is similar to eliminate-and-reinsert policy as well. However, because reinsertion can cause the splits of leaf and balanced nodes, thus after deleting the object, if the leaf node is under-full, we apply a “pull” strategy to get one object from the sibling so that this sibling still ensures utilization constraints. This also depends on the idea in section 4.1 but in a contrary direction. While the under-full leaf here “pulls” one object from the sibling, the overflow one, in section 4.1, “shifts” one object to the sibling. If the pull policy still does not solve the problem, the objects of the under-full leaf node are reinserted. Note that, the pull policy can also propagate to only the siblings located in the same balanced node. 4.3 Search The search operations of the SH-tree are similar to the SR-tree for the balanced nodes, leaf nodes and similar to the R-tree for the internal nodes. Because of the space limitation, we do not present them here. The detail discussion is referred to [31].
5 Conclusions and Future Work In this paper, we introduced the SH-tree for indexing multidimensional data. The SHtree is a flexible multidimensional index structure to support similarity searches in information systems. It is a well-combined structure of both the SR-tree and the KDtree based techniques. The SH-tree carries positive aspects of both the KD-tree and
348
T.K. Dang, J. Küng, and R. Wagner
the R-tree families. While the fan-out problem of the SR-tree is overcome by employing the KD-tree like representation for partitions of internal nodes, the SH-tree still take advantages of the SR-tree by using the balanced nodes, which are the same as internal nodes of the SR-tree. Moreover, the tree operations in the SH-tree are similar to the R-tree family but there are many modifications to adapt them to the new structure. We also introduced a new concept for the SH-tree, the extended balanced tree (EBT). It implies that the SH-trees are not necessary to be exactly balanced, but the querying performance is still not deteriorated and the maintenance cost for the tree balance is reduced. As a part of the future work, we intend to compare the SH-tree to the SR-tree, the LSDh-tree and some other prominent multidimensional index structures as X-tree, SS-tree, M-tree, etc. We also plan to deploy the SH-tree for indexing features in similarity search systems [30].
References 1. V. Gaede, O. Günther. Multidimensional Access Methods. ACM Computing Surveys, Vol. 30, No. 2, June 1998. 2. K. Chakrabarti, S. Mehrotra. The Hybrid Tree: An Index Structure for High Dimensional Feature Spaces. Proc. of 15th International Conference on Data Engineering 1999. IEEE Computer Society. 3. King-Ip Lin, H.V. Jagadish, C. Faloutsos. The TV-Tree: An Index Structure for HighDimensional Data. VLDB Journal, Vol. 3, No. 1, January 1994. 4. N. Katayama, S. Satoh. The SR-Tree: An Index Structure for High Dimensional Nearest Neighbor Queries. Proc. of the ACM SIGMOD International Conference on Management of Data, 1997. 5. D.A. White, R. Jain. Similarity Indexing with the SS-Tree. Proc. of the 20th International Conference on Data Engineering, 1996. IEEE Computer Society. 6. J.T. Robinson. The k-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes. Proc. of ACM SIGMOD International Conference on Management of Data, 1981. 7. D.B. Lomet, B. Salzberg. The hB-Tree: A Multiattribute Indexing Method with Good Guaranteed Performance. ACM Trans. on Database Systems, Vol. 15, No. 4, Dec. 1990. 8. A. Henrich, H.W. Six, P. Widmayer. The LSD Tree: Spatial Access to Multidimensional Point and Nonpoint Objects. Proc. of 15th VLDB, August 1989. 9. A. Henrich. The LSD/sup h/-tree: An Access Structure for Feature Vectors. Proc. of 14th International Conference on Data Engineering, 1998. IEEE Computer Society. 10. J. Nievergelt, H. Hinterberger, K.C. Sevcik. The Grid File: An Adaptable, Symmetric Multikey File Structure. ACM Trans. on Database Systems Vol. 9, No. 1, March 1984. 11. M. Freeston. The BANG file: A new kind of grid file. Proc. of the ACM SIGMOD Annual Conference on Management of Data, 1987. 12. A. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. Proc. of ACM SIGMOD Conference, 1984. 13. T.K. Sellis, N. Roussopoulos, C. Faloutsos. The R+-Tree: A Dynamic Index for MultiDimensional Objects. Proc. of 13th VLDB, September 1987. 14. N. Beckmann, H.P. Kriegel, R. Schneider, B. Seeger. The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990. 15. S. Berchtold, D.A. Keim, H.P. Kriegel. The X-tree: An Index Structure for HighDimensional Data. Proc. of 22nd VLDB, September 1996.
The SH-tree: A Super Hybrid Index Structure for Multidimensional Data
349
16. S. Berchtold, C. Böhm, H.P. Kriegel. The Pyramid Technique: Towards Breaking the Curse of Dimensionality. Proc. of ACM SIGMOD International Conference on Management of Data, June 1998. 17. J. Küng, J. Palkoska. An Incremental Hypercube Approach for Finding Best Matches for Vague Queries. Proc. of the 10th International Workshop on Database and Expert Systems Applications, DEXA 99. IEEE Computer Society. 18. R. Bayer. The Universal B-Tree for Multidimensional Indexing. Technical Report TUMI9637, November 1996. (http://mistral.informatik.tu-muenchen.de/results/publications/) 19. K. Chakrabarti, S. Mehrotra. High Dimensional Feature Indexing Using Hybrid Tree. Technical Report, Department of Computer Science, University of Illinois at Urbana Champaign. (http://www-db.ics.uci.edu/pages/publications/1998/TR-MARS-98-14.ps) 20. P. Ciaccia, M. Patella, P. Zezula. M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. Proc. of VLDB 1997. 21. D. Greene. An implementation and performance analysis of spatial data access methods. Proc. of 5th International Conference on Data Engineering 1989. IEEE Computer Society. 22. A. Henrich. A hybrid split strategy for k-d-tree based access structures. Proc. of the fourth ACM workshop on Advances on Advances in geographic information systems, 1997. 23. A. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD, Proc. of Annual Meeting, June 1984. 24. N. Beckmann, H.P. Kriegel, R. Schneider, B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. Proc. of ACM SIGMOD International Conference on Management of Data, 1990. 25. R. Kurniawati, J.S. Jin, J.A. Shepherd. The SS+ -tree: An Improved Index Structure for Similarity Searches in a High-Dimensional Feature Space. SPIE Storage and Retrieval for Image and Video Databases V, San Jose, CA, 1997. 26. N. Roussopoulos, S. Kelley, F. Vincent. Nearest neighbor queries. Proc. of ACM SIGMOD International Conference on Management of Data, 1995. 27. F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, Z. Protopapas. Fast Nearest Neighbor Search in Medical Image Databases. Proc. of VLDB 1996. 28. B.C. Ooi, K.J. McDonell, R. Sacks-Davis. Spatial kd-Tree: A Data Structure for Geographic Databases. Proc. of COMPSAC 87, Tokyo, Japan. 29. S. Brin. Near Neighbor Search in Large Metric Spaces. Proc. VLDB 1995. 30. FAW Institute, Johannes Kepler University Linz. VASIS – Vague Searches in Information Systems. (http://www.faw.at/cgi-pub/e_showprojekt.pl?projektnr=10) 31. D.T. Khanh, J. Küng, R. Wagner. The SH-tree: A Super Hybrid Index Structure for Multidimensional Data. Technical Report, VASIS Project. (http://www.faw.uni-linz.ac.at) 32. C. Faloutsos, M. Ranganathan, Y. Manolopoulos. Fast subsequence matching in time-series databases. ACM SIGMOD International Conference on Management of Data, 1994. 33. Thomas Seidl, Hans-Peter Kriegel: “Efficient User-Adaptable Similarity Search in Large Multimedia Databases”. VLDB 1997. 34. S. Berchtold, H.P. Kriegel. S3: Similarity Search in CAD Database Systems. Proc. of ACM SIGMOD International Conference on Management of Data, 1997. 35. T. Bozkaya, M. Ozsoyoglu. Indexing Large Metric Spaces for Similarity Search Queries. ACM Transactions on Database Systems. Vol. 24, No. 3, September 1999. 36. A. Henrich. Improving the performance of multi-dimensional access structures based on kd-trees. Proc. of the 12nd International Conference on Data Engineering, 1996.
Pyramidal Digest: An Ecient Model for Abstracting Text Databases Wesley T. Chuang and D. Stott Parker Computer Science Department, UCLA, Los Angeles, CA 90095, USA fyelsew, [email protected]
Abstract. We present a novel model of automated composite text di-
gest, the Pyramidal Digest. The model integrates traditional text summarization and text classi cation in that the digest not only serves as a \summary" but is also able to classify text segments of any given size, and answer queries relative to a context. \Pyramidal" refers to the fact that the digest is created in at least three dimensions: scope, granularity, and scale. The Pyramidal Digest is de ned recursively as a structure of extracted and abstracted features that are obtained gradually | from speci c to general, and from large to small text segment size | through a combination of shallow parsing and machine learning algorithms. There are three noticeable threads of learning taking place: learning of characteristic relations, rhetorical relations, and lexical relations. Our model provides a principle for eciently digesting large quantities of text: progressive learning can digest text by abstracting its signi cant features. This approach scales, with complexity bounded by O(n log n), where n is the size of the text. It oers a standard and systematic way of collecting as many semantic features as possible that are reachable by shallow parsing. It enables readers to query beyond keyword matches.
1
Introduction
When facing enormous volumes of text information, the fact that syntactic analysis does not scale has encouraged nding an alternative way for determining large-scale meaning without syntax: to consider shallow parsing and to gauge \understanding" on the basis of query relevance and learning accuracy. With this in mind, in this paper we shift the syntax-then-semantics mentality to a semantics-then-syntax one. Our bias is that \microscopic" syntactic categories should play only a minimal role in determining the meaning of lengthy documents, given that we wish to retrieve exact information as well as \understand" the text at any level. These goals together require the system to retrieve keywords as well as retrieve an abstract \context." Putting syntactic structure temporarily aside, however, does not immediately protect us from the complexity of semantics. Viewing text documents as long segments formed by consecutive words, we can assign meanings to any subsegment. A word has many possible meanings and it can relate to other words. A H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 360−369, 2001. Springer-Verlag Berlin Heidelberg 2001
Pyramidal Digest: An Efficient Model for Abstracting Text Databases
361
segment of words, too, can be abstracted with meanings and can be related to other segments. Based on the goals of retrieving keywords as well as context, we are in need of a model that digests the original text eciently and accurately by analyzing these semantic relations, and by extracting as many features as possible. Because the relations are too numerous, the best way to accomplish this is with a machine learning approach. The advantage of a learning approach is that we can make good tradeos between eciency and accuracy. Learning is possible even with minimal data, and results can be improved with experience. We regard whatever is learned as a digest or abstract. This introduces a new kind of summarization, both from the point of view of its function and of its representation. In the past, text summarization has focused on natural language generation, or on extracting sentences from the text. By contrast, our approach to text digesting permits summarization to be integrated with text classi cation, in which the goal is to determine the topic or category of a text document as well as meaning of a text segment of any size. Our notion of digest has several novelties. For one, our digest takes into account the visual eect of a structure. There is no doubt that pictures can reduce cognitive load, and so can structures. A second novelty is that our digest can take a new composite or vectorized form. Speci cally, we represent digests with a combination of text segments of different lengths. A third novelty is that we push summarization one step further, so that abstractions of the combined text Fig. 1. An example digest in XML. segments form the digest. - - - + <extractplus> + - - + <extractplus> + + <segments> + - - <extractplus> - pain chest headach hurt sever ... + <sentences> - protection body_part ache protection ... - + - - <extractplus> - chest pain correl diagnosi littl ... + <sentences> - body_part ache ... - <segments> - - + <satellite> + + <example> + + +
Example 1. In Figure 1, we illustrate one level (from a multilevel) of an ex-
ample digest in XML after processing a small portion of a book (see [9]). A digest has a nested format and contains various sizes of segments. It is a composite summary of the original text. In later sections, we will explain how this digest is obtained. We can see that the sample digest above has a hierarchical or
362
W.T. Chuang and D.S. Parker
nested nature. Every tag represents a \concept." The concept has both an <extractplus> and . <extractplus> contains extracted words to digest a scope. The reason for the \plus" is that this extract also includes additional synonyms; it is not extracted purely from the original text. , on the other hand, contains words that abstract the scope (portion of the book) covered by this digest. Under <segments>, it contains sentence segments that are considered to be important (for example, with rhetorical signi cance such as , <example>, etc.) from the original text. The novel way to extract and abstract text features across several dimensions (scale, granularity, and scope) makes the resulting digest not only become a composite summary, but also become some kind of a contextual \index," which is dierent from traditional inverted index, or B-tree index. This contextual index is very convenient for context retrieval.
2 Related Work A concept hierarchy or taxonomy is usually used in order to determine (or classify) the topic of documents, or of larger text granularities exceeding that of documents [1]. We show a way to extend the classi cation capability down to sentence segment and word granularity. At the sentence granularity, previous work mostly considered extracted sentences to be text summaries [11,5]. Structural aspect of the text was exploited in [3]. Other summarization systems have been concerned with generating a coherent summary [8,12]. [10] made use of machine learning algorithms to determine the sentences to be extracted. [3] combined the structural aspect into learning and extracted sentence segments as a summary. Our work though integrate the summary into so-called digest to serve as a \summary" as well as an \index." Research into lexical relations in WordNet [6] has drawn interests into word sense identi cation or word sense disambiguation [13]. Hirst et al. use lexical chains | chain of words | to represent context [7]. Their work dier from ours is that we only consider word senses for suciently representative words, which are merged into the digest. In addition, context is very important for producing a summary or retrieving information as shown in [14]. With our digest, context can be eciently captured and represented by segments of various sizes connected at some strength along several dimensions.
3 A Pyramidal Digesting Model 3.1 Relationships of Text at Dierent Granularities
Viewing text documents as long segments and their subsegments, we obtain a hierarchy of text segment sizes. To digest text from this hierarchy, we can then re ect the whole spectrum of segment sizes. Every segment of text is potentially the basis for part of the summary, and it may participate in many complicated
Pyramidal Digest: An Efficient Model for Abstracting Text Databases
363
semantic relations. Because of this, we have strived to employ methods that can manage large texts well | methods that can scale. Surface (\shallow") parsing thus becomes a reasonable choice. But we also need a method that can extract sucient semantics from the text \surface."
De nition 1. Consider a document C (where multiple documents can be lumped
into one vector) as a vector < w1, w2 , : : : wn >, where wi is either a word or a discourse marker (e.g. punctuation, or cue words such as \because", \but", .., etc). De nition 2. A text segment S is any sequence of consecutive words in C . That is, S = C[i:j], where ij and 0 i,j jC j. De nition 3. A semantic relation R is a binary relation on text segments, i.e., R(S i , S j ), where S i , S j are text segments.
The number of relations can grow exponentially. In the interest of simplicity, we present only a few relations. { Above paragraph granularity: We focus on text segments of a size comparable to descriptions of basic \concepts" in the hierarchy [2]. Intuitively, the meaning of any two chapters can be described by how they dier from each other with respect to certain relations. Such characteristic relations, if mapped into a concise form, not only indicate their characteristic dierences, but also serve as a summary of connections among text segments, which we call blocks, at this granularity. { At sentence-segment granularity: We concentrate on segments that are enclosed by discourse markers [3]. There are an enormous number of possible semantic relations. Fortunately, in practice it is sucient to analyze only those segments that are separated by discourse markers (e.g. punctuation, or cue words such as \because", \but", .., etc.). There are words in the sentence that signal so-called \rhetorical relations" between sentence segments; [3] gives a detailed account. { At word granularity: We work on the words' semantic relations for a small number of representative words. Whether in a given document or in a dictionary, there are many relations between words, such as synonymic and hypernymic relations [6]. Two synonyms in a summary can be grouped into one; several words that depict one idea can be \elevated" into an abstraction, perhaps represented by another word.
3.2 Pyramidal Digest De nition 4. A Digest i at level , consisting a collection of text P segments, is induced from the previous level i,1 of text segments. i = i,1 , jjD=1j S j P + jjD=1j ( (S j ) , S j ) , where : S ! S is an abstraction function. are machine learning functions with their de nitions to follow. , + are dierence and insertion operators, respectively and is multiplication. D
i
D
h
f; g
g
D
h
f
D
g
;
W.T. Chuang and D.S. Parker
From the de nition, a digest is made by taking out text segments from the original document. Two operators are involved in shaping the digest, as de ned in the following.
scale scale ("Percent of Original")
364
5%
scope
15% granularity
digest D at one level
30%
W
W
W W
W
W
De nition 5. Word occupation is a machine learning function f: S S1 S2 : : : Sn,1 R1 R2 : : : Rm ! f0, 1g that determines Fig. 2. A Pyramidal (multilevel) digest. 50%
W W W
W W
W
W
WW W
whether a segment S can occupy its original position or be eliminated in the next level of digest. Here n; m are the number of text segments, and number of semantic relations considered, respectively. De nition 6. Word replacement is a machine learning function g: S n Rm ! f0, 1g that determines whether a segment can be replaced with another segment containing abstract words. De nition 7. A Multilevel Digest is an instance of M=(D, f , g, ), which is a hierarchical structure, where is a partial order de ned over D.
Documents, before being digested, are represented as the digest at the base level. To get a digest at a level above, we permit a machine learning algorithm to learn which text segments should be taken out (by word occupation) and which segments should be replaced (with word replacement). Since the two operators only reduce the number of words, it can be proved that digests obey a partial order.
3.3 A Learning Model Unifying Classi cation and Summarization
One of the great principles of sumword1 word2 word3 ... wordk,1 wordk ... Concept marization, essenpain common symptom ... management trigger ... 1 tially Occam's ra- correlation severity chest ... pain cause ... 2 zor, is that the simpain somatic skin ... tissue neuropath ... 1 mucosal muscle ... in ammation hollow ... 3 plest description is abdominal substernal pressure meals ... emotion arousal ... 2 preferred. This is chronic pain distraught ... migraine headache ... 1 : : : ... : : ... : implicit in our machine learning ap- Table 1. Classify block-size segments into target concepts. proach, in which simpler and more abstract digests are produced level after level. At every granularity of summarization, it is classi cation that governs the learning process. At a high granularity, we classify big text segments, nding their characteristic dierences. At the granularity of sentence segments, we observe their rhetorical dierences, and classify them into essential and non-essential sentence segments.
Pyramidal Digest: An Efficient Model for Abstracting Text Databases
365
At the granular- Segment ID title words term freq ... antithesis cause ... Important 1 1 2.3 ... 1.5 0 ... Y ity of word segments, 2 0 0.5 ... 0 -3.5 ... N we look into their 3 2 5.7 ... 0 0.5 ... N 4 2 7.2 ... -3.5 0 ... Y sense dierences, and 5 0 1.1 ... 0 0.33 ... N classify them into 6 0 3.5 ... 2.5 0 ... N : : : ... : : ... : synonyms and hypernyms that are comTable 2. Classify sentence segments into targets. patible with the context. As shown in Tables 1- 3, targets may be binary- or multiple-valued depending on how many target classes there are. During learning, target values are known because they are labeled either implicitly or by a human. Then classi cation, and essentially classi cation, is used to predict what the target class is. Example 2. Speci cally,
word neighbor1 neighbor2 ... neighbork,1 neighbork ... Sense in Table 1 a block chest pulmonary embolism ... aortic dissection ... 1 migraine acute ... ergotamines aspirin ... 2 segment containing cer- headache pain sensory stimuli ... viscera nerve ... 1 tain words can be symptom pain common ... cause factor ... 1 tool medicine ... 1 classi ed into a tar- immune infectious disease ... onset bilateral ... tight band ... 3 get concept. In Ta- tension: : : ... : : ... : ble 2, rhetorical as well as other features Table 3. Classify words into target senses. of a sentence segment can be classi ed into either the important or unimportant class. In Table 3, judging from its neighboring words, a word's sense can be classi ed in a way that agrees with the context. These classi cation processes can be very ecient with complexities bounded by O(n log n) time, as explained in [2, 4].
4
Learning to Create a Digest
We now explain how learning takes place at various granularities of the text body, and how summaries created for each granularity are combined. Example 3. Examine the following text segments. We break it into three dierent granularities | document (paragraph here), sentencesegment, and word, as shown Fig. 3. Relations for machine learning to learn. in Figure 3. Then, the encoded relations are subjected to classi cation (or learning) as mentioned in Tables 1- 3. B. rhetorical relation
A. characteristic relation
Chapter 2 discusses chest pain
characteristicrelation
Chapter 1 introduces pain, the most common symptom of disease.
chapter
Chapter 2 discusses chest pain, for which there is little correlation ... severity and cause.
for which there is little correlation ... rhetorical relation
C. lexical relation cause chest
synonym
hypernym
reason origin
synonym
section
hypernym
thorax body_part
DICTIONARY
366
W.T. Chuang and D.S. Parker
``Chapter 1 introduces pain, the most common symptom of disease, whereas Chapter 2 discusses chest pain, for which there is little correlation between severity and cause.''
These three threads, which learn relations at dierent granularities, create a digest for one level in the \pyramid." There is another dimension which is seen as the vertical \scale" dimension in Figure 2. This can be described as admitting fewer and fewer concepts into the target summary, so that a digest will be formed gradually, from speci c to general.
4.1 Scale Digest in All Levels A popular Microsoft product, AutoSummarizer, provides the capability of producing a summary of length that is any given percentage of the original. If we stack these summaries, from higher percentage to lower, they, too, form a \pyramid" as shown in Figure 2. This pyramid, however, is dierent from ours. Microsoft summaries consist of sentences extracted from the original, while ours contain segments of dierent sizes, and are both extracted and abstracted | i.e. they include words not in the original text. Nevertheless, we can compare our digest to Microsoft summaries by viewing only our sentence segment summaries inside the digest. Sentence segments are clauses; they are about the same size as sentences. In [3], comparison has been made for one scale; we will use the same data as in [3], but make more comparisons in the next section for dierent scales.
4.2 Supervised versus Unsupervised Learning To create summaries of dierent scales for our pyramidal digest, we divide it into supervised and unsupervised methods. The pseudocode in Figure 4 illustrates how either supervised learning or unsupervised learning is applied to create the pyramidal digest that encompasses dierent granularities and dierent scales of summaries. Supervised learning requires humans to manually label data, and is considered by many people to be too expensive. But there are ways to improve it. For example, labeled data can be obtained by feeding initial data to some heuristic function or existing search engines.
4.3 A Comparison with MS Word AutoSummarizer In the following, we take only a portion (i.e. sentence segments) from several levels (i.e. scale at 50%, 30%, 15%, and 5%) of our pyramidal digest, both with supervised and unsupervised learning, and compare the results with summaries produced by the Microsoft AutoSummarizer.
Pyramidal Digest: An Efficient Model for Abstracting Text Databases
367
For every scale, Microsoft Word summary for (scale = 50% downto 5%) performs the worst in if (unsupervised ) every category | averConstruct naturally nested (concept) structure . age accuracy, precision, Remember text segments in each \concept". and recall. Note that Mielse crosoft summary is only Manually create concept hierarchy. compared to a portion Manually label text segments into each concept. of our pyramidal digest. end In Figure 5 (a), we avCompute TFIDF feature vectors for each concept. eraged the overall preciApply [2]'s accuracy feedback algorithm. sion and recall, and plotfor every concept ted it for methods in Obtain subset features by this scale. dierent scales. It can Find rhetorical relations of sentence segments. be seen that all methif (unsupervised ) ods drop as the scale Apply [3]'s heuristic function by this scale. approaches zero. There else was a unusual spike Manually label segments of this scale. for unsupervised learnApply C4.5 or Naive Bayesian or DistAl[3]. ing near the lower end end of the scale. This is beAdd segments to the digest. cause very few segments for top k = scale jTFIDF j rep. words. are in the denominator if (unsupervised ) of the precision and rePerform classi cation phase of Accuracy call equation; it often Feedback algorithm at word granularity. gets either 100% or 0%. else Figure 5 (b) shows Manually label word senses. the average test accuAccuracy feedback at word granularity. racy for dierent methend ods in dierent scales. Add the -nyms to the digest feature vector. Microsoft receives the end lowest. As scale apend proaches zero, somewhat end against our intuition, accuracy increases. This is because accuracy not Fig. 4. Learning to create the pyramidal digest. only counts those that are correctly retrieved but also those that are correctly not retrieved. 5
Contextual Retrieval
A bene t of this pyramidal digest is that it better re ects context. Digests along the \scope" dimension draw attention to segments from dierent positions in the sequence, and these dierent positions correspond to dierent contexts. Digests along the \granularity" dimension highlight connections between context and
368
W.T. Chuang and D.S. Parker
70
100
MS Word Unsupervised Supervised
90
50
85 Accuracy (%)
0.5*(Recall+Precision) (%)
MS Word Unsupervised Supervised
95
60
40
30
80
75 20
70
10
0
65
5
10
15
20
25 30 Scale (%)
35
40
45
50
60
5
10
15
20
25 30 Scale (%)
(a) Recall and Precision. (b) Accuracy. Fig. 5. Evaluation in dierent scales.
35
40
45
50
text segments of dierent size. Last, digests along the \scale" dimension capture dierences in user perceptions of context. This coincides with the assumption that dierent users want dierent sizes of summary. Our model produces a digest by focusing only on nearby context and never wastes time in considering farreaching relationships. After learning, the multilevel digest consists of text segments connected to others by some strength measure. This enables a given query string to be matched against the context, part of which is abstract information.
5.1 Abstract Information
Such abstract information is obtained in two ways. One abstract information is hypernym, which is inferred by following the WordNet's lexical hierarchy [6] after we learn the correct sense at the word granularity. The other abstract information is learned in supervised learning when we permit abstract terms in model summary. Contextual retrieval exploits this abstract information so that queries are carried out beyond keyword matches.
6 Summary and Discussion When evaluated according to the criteria in section 4.3, we can see that our digest produced strictly better results than the Microsoft summary. The sentence segments portion of the digest performs better than Microsoft's summary in the categories of recall, precision and learning accuracy. But sentence segments are not the only information in our digest. The digest also contains extracted, abstracted, synonymized etc. segments of various scope in the original. The experiments also indicate several tradeos. First, there is a tradeo between supervised and unsupervised learning. Depending on how much humans can be involved in the process, the digest can be improved. Even when humans are not present, learning can still proceed without sacri cing much in performance. The second tradeo we have learned from the experiment is between
Pyramidal Digest: An Efficient Model for Abstracting Text Databases
369
retrieval relevance and scale (from Figure 5). It may be that we prefer shorter and more abstract digests, but it becomes more and more dicult to obtain such a digest when the scale is reduced. Third, from an information retrieval point of view, a higher level digest will retrieve more abstract information, but at the same time, more irrelevant information. Nevertheless, the digest, or composite summaries, capture much information in a concise form. They basically index all \important" parts of the original text, and can be used as an index for contextual retrieval as described in section 5. In addition, the pyramidal digest has many interesting potential applications such as XML summarizers, search engines, and e-Book etc.
References 1. Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In Proceedings of the 23rd VLDB Conference, 1997. 2. Wesley Chuang, Asok Tiyyagura, Jihoon Yang, and Giovanni Giurida. A fast algorithm for hierarchical text classi cation. In Proceedings of the DaWak Conference, 2000. 3. Wesley Chuang and Jihoon Yang. Extracting sentence segments for text summarization: A machine learning approach. In Proceedings of the 23rd SIGIR Conference, 2000. 4. Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classi cation. John Wiley & Sons, 2000. 5. H.P. Edmundson. New methods in automatic extracting. Journal of the ACM, 16(2):264{285, 1969. 6. Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database. The MIT Press, 1998. 7. Graeme Hirst and David St-Onge. WordNet: An Electronic Lexical Database, chapter Lexical Chains as Representation of Context for the Detection and Correction Malapropisms, pages 305{332. The MIT Press, 1997. 8. Eduard Hovy and Chin-Yew Lin. Advances in Automatic Text Summarization, chapter Automated Text Summarization in SUMMARIST. MIT Press, 1999. 9. Kurt Isselbacher, Eugene Braunwald, Jean Wilson, Joseph Martin, Anthony Fauci, and Dennis Kasper, editors. Harrison's Principles of Internal Medicine. McGrawHill, 13rd edition, 1994. 10. Julian Kupiec, Jan O. Pedersen, and Francine Chen. Proceedings of the 18th acm sigir conference. In A Trainable Document Summarizer, pages 68{73, 1995. 11. H.P. Luhn. The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2):159{165, 1958. 12. Dragomir R. Radev and Kathleen McKeown. Generating natural language summaries from multiple on-line sources. Computational Linguistics, 24(3):469{500, 1998. 13. Mark Sanderson. Word sense disambiguation and information retrieval. In Proceedings of the SIGIR Conference, pages 142{151, 1994. 14. Ayse P. Saygin and Tuba Yavuz. Query processing in context-oriented retrieval of information. In Joint Conference on Intelligent Systems, 1998.
A Novel Full-Text Indexing Model for Chinese Text Retrieval1 Shuigeng Zhou, Yunfa Hu* and Jiangtao Hu* State Key Lab of Software Engineering Wuhan University, Wuhan, 430072, China
*Department of Computer Science Fudan University, Shanghai, 200433, China
Abstract. Text retrieval systems require an index to allow fast access to documents at the cost of some storage overhead. This paper proposes a novel full-text indexing model for Chinese text retrieval based on the concept of adjacency matrix of directed graph. Using this indexing model, retrieval systems need to keep only indexing data, rather than indexing data and original text data as the traditional retrieval systems do, thus system space cost as a whole can be reduced drastically while retrieval efficiency is maintained satisfactory. Experiments over five real-world Chinese text collections are carried out to demonstrate the effectiveness and efficiency of this model.
1 Introduction With the rapid growth of electronic Chinese documents published in Mainland China, Taiwan, Singapore, etc., there is an increasing need for Chinese text retrieval systems that support fast access to large amount of text documents. Full text retrieval systems are a popular way of providing support for on-line text access. From the end-user point of view, full text searching of on-line documents is appealing because a valid query is just any word or sentence of the document. Generally, full-text retrieval systems have an index to allow efficient retrieval of documents. Many text-indexing methods have been developed and used, such as inverted lists [1], signature files [2], PAT trees [3] and PAT arrays [4]. Although word-based indexing is widely used in English [1], it is not easily applied to Chinese. This is because written Chinese text has no delimiters to mark word boundaries. The first step toward word-based indexing of Chinese text is to break a sequence of characters into words, which is called word segmentation. Word segmentation is known to be a difficult task because accurate segmentation of written Chinese text may require deep analysis of the sentences [5]. On the other hand, character-based indexing methods don’t depend on word segmentation, thus is suitable for Chinese text. Recently, there is an increasing research on character-based indexing for text retrieval of Chinese and other Oriental languages [6-7]. However, character-based full-text indexing costs too much storage space because each 1
This work was supported by China Postdoctoral Science Foundation and the Natural Science Foundation of China (No. 60003016).
A Novel Full-Text Indexing Model for Chinese Text Retrieval
371
character in text database is indexed and its positional information is stored to support exact searching. So it is unfavorable for some application areas where storage resources are very limited, e.g., CD-based text retrieval systems. This paper presents a novel full-text indexing model for Chinese text retrieval. By treating a text database as a directed graph and extending the concept of adjacency matrix of directed graph, we propose the adjacency-matrix based full-text indexing model. By using this model, retrieval system’s space cost as a whole can be cut down drastically while retrieval efficiency is maintained satisfactory. The only precondition for the proposed model is that sufficient main memory is available to support an in-memory adjacency matrix. Given this precondition, the model described can support efficiently searching of string in large text databases. In Section 2, we describe the novel full-text indexing model. In Section 3 we introduce the implementation techniques. We present the experimental results in Section 4 and conclude the paper in Section 5.
2 The Model We begin with a Chinese character set Σ: a finite set of Chinese characters, letters, digits, punctuation marks and other symbols that may occur in Chinese text documents. A text string or simply string over Σ is a finite sequence of characters from Σ. The length of a string is its length as a sequence. We denote the length of a string w by |w|. Alternatively a string w can be considered as a function w:{1,…,|w|}→Σ; the value of w(j), where 1≤j≤|w|, is the character in the jth position of w. To distinguish identical characters at different positions in a string, we refer to them as different occurrences of the character. That is, the character l∈Σ occurs in the jth position of the string w if w(j)=l. Definition 1. Given a string w over Σ, let V⊆Σ be the set of unique characters in w, there exists a directed graph TDG= where Vg is a set of vertices and Vg =V, i.e., each character in V corresponds to a vertex in Vg; E g is a set of directed edges, each of which corresponds to a bigram appearing in w and its direction points from the first character to the second one. Because a character may occur at different positions in a string, the directed graph of a string is usually a directed cyclic graph.
”, its Example 1. Consider a Chinese text string w1: “ directed graph is illustrated in Fig. 1 where Vg = {“ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”}, Eg={“ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”, “ ”}.
Definition 2. A simple string, or simply s-string, is a string in which at most one character can occur twice. Equivalently, an s-string is a string whose directed graph contains at most one cycle. Lemma 1. All bigrams in an s-string are unique.
372
S. Zhou, Y. Hu, and J. Hu
Proof. By contradiction, if two or more similar bigrams exist in an s-string, then the s-string’s directed graph must contain at least two cycles, which indicates it cannot be an s-string.
Fig. 1 The directed graph of string w1
In what follows, we give an algorithm to segment an arbitrary string into a sequence of s-strings. Algorithm 1. Segment an arbitrary string w into a sequence of s-strings 1) Set k=1; 2) Scan string w from its first character w(1) to its last character w(|w|) 3) if there are two characters w(i) , w(j) such that w(j)= w(i) (1≤j
Lemma 2. If a string is segmented by algorithm 1, each bigram in the string is uniquely identified by a set of labels of the s-strings where the bigram occurs. Proof. Each s-string is uniquely identified by its label (Definition 3); each bigram in an s-string is unique (Lemma 1); considering a bigram may occur in multiple s-strings, thus each bigram in a string can be uniquely identified by a set of labels of the s-strings where the bigram occurs. Definition 4. The weighted directed graph of a string is established according to the following steps: 1) Segment the string using algorithm 1 and label the s-strings according to definition 3;
A Novel Full-Text Indexing Model for Chinese Text Retrieval
373
2) Construct the directed graph of the string according to definition 1. 3) Associate each edge in the directed graph with the label of the s-string where the edge’s corresponding bigram locates. 4) Compact directed edges sharing a similar bigram to one directed edge and unite their corresponding labels to a set of labels as the compacted edge’s weight. Formally, denote WDG= the weighted directed graph of string w, Vw is the set of vertices and Ew is the set of directed edges as defined in definition 1, Lw is the set of labels associated with directed edges in Ew. Let Lw(li, lj) be the set of labels associated with directed edge li lj, we have
Lw = {Lw (l i , l j ) | ∀l i l j : l il j ∈ Ew and l i ∈ Vw and l j ∈ Vw } . Definition 5. The adjacency matrix of a string is formally defined as follows:
A = [aij ], aij = Lw (l i , l j ). Example 3. Given a Chinese text string w : “ ”. Segment w into a sequence of s-strings and label them by the default fashion:s =“ ”, s =“ ”, s =“”, s =“ ”, s =“ ”, s =“ ”. Eleven unique characters in w constitute the vertices set V ={“ ”, “”, “ ”, “”, “”, “”, “”, “”, “”, “”, “ ”}. Fourteen unique bigrams constitute the directed edges set E ={“ ”, “”, “”, “”, “”, “ ”, “”, “ ”, “”, “”, “”, “ ”, “ ”, “ ”}. Fig.2 shows the weighted 2
2
1
3
2
4
5
6
2
w
w
directed graph of string w2.
Fig. 2 Weighted directed graph of string w2 Here, we use only 6 labels to represent 14 bigrams in w2. However, if the character-based inverted lists were used, we would have to use 36 positions to identify 11 characters in w2. We assume l1= “ ”, l2= “ ” and so on, the corresponding adjacency matrix of Fig. 2 is a 11×11 matrix as shown in Fig. 3, in which: a14= Lw (“ ”, “ ”)={1, 2}, d24= Lw (“ ”, “ ”)={2, 3}, a34= Lw (“ ”, “ ”)={4, 5}, a45= Lw ( “ ” , “ ”)={1, 2, 3, 4, 5, 6}, a56= Lw ( “ ” , “ ”)={2, 4,
Lemma 3. For a string w and its adjacency matrix A, 1) if bigram c1c2 occurs in w, then a(c1, c2)≠Φ; 2) if trigram c1c2c3 occurs in w, then (a(c1, c2)∩ a(c2, c3))∪( {a(c1, c2)+1}∩ a(c2, c3))≠Φ. Here, {a(c1, c2)+1} represents a new set formed by adding 1 to each element in a(c1, c2). Proof. 1) if bigram c1c2 occurs in w, it must locate in some s-string(s), thus a(c1, c2)≠Φ; 2) if trigram c1c2c3 occurs in w, then two cases exist: a) c1c2 and c2c3 locate in the same s-string, that is a(c1, c2)∩ a(c2, c3) ≠Φ; b) c1c2 and c2c3 locate in two adjacent s-strings respectively. Considering the labels of two adjacent s-strings differ from each other by 1(definition 3), so {a(c1, c2)+1}∩ a(c2, c3) ≠Φ. Combining these two cases, we have (a(c1, c2)∩ a(c2, c3))∪( {a(c1, c2)+1}∩ a(c2, c3))≠Φ. Definition 6. A text database is a collection of text documents, each of which is a string over Σ. Neglecting the boundary between any two adjacent documents, a text database can be seen as a long string, whose length is the sum of the lengths of all documents in the text database. While constructing adjacency matrix of a text database, we request all documents in the text database are segmented separately. However, all s-strings are labeled globally. That is: 1) no s-string spans two or more adjacent documents; 2) for any two adjacent documents di and di+1, di+1’s first s-string is labeled just following the label of di’s last s-string. Definition 7. The document index table (DIT) of a text database is organized as a triple:
DIT = {(l1, c1 , c2 )}. Each document in the text database has a record in DIT. A record has three fields: l1 is the label of the first s-string in a document; c1 and c2 are the first two characters of the
A Novel Full-Text Indexing Model for Chinese Text Retrieval
375
document. Given two adjacent documents di and di+1’s first label: li, li+1, [li, li+1-1] constitutes the label range of di’s s-strings. Definition 8. The adjacency matrix based full-text indexing model of a text database consists of two parts: the adjacency matrix and the document index table. Unlike current indexing techniques, the proposed indexing model has the following unique characteristics: 1) Using bigrams as indexed terms and organizing all bigrams into an adjacency matrix corresponding to the text database’s weighted directed graph. 2) After the indexing model is established, original text documents are not saved in the retrieval system, which results in a drastic reduction of system storage cost as a whole. While processing query, original text documents are reconstructed by using exclusively the indexing matrix and DIT. 3) Labels of s-strings rather than positions of indexed term are used for identifying different occurrences of indexed term, which provides an opportunity of cutting down indexing space overhead because the total number of s-strings in a text database is much smaller than the text database’s length. Using the proposed indexing model, an arbitrary string searching can be processes by applying lemma 3 iteratively. Generally, a query is processed with the following three steps: 1) Based on the adjacency matrix, retrieve the labels or label sequences of s-strings that the query string may locate in or span over. A query-processing algorithm is responsible for this sub-task. 2) According to DIT, find the desired document index records and corresponding label ranges. 3) Using the adjacency matrix and results from step 2), a text-reconstructing algorithm is used to reconstruct the text contents of all desired documents.
3 Implementation Techniques The indexing adjacency matrix is usually a large, sparse matrix. Statistics for five test collections show that more than 97% of matrix elements are empty. In the process of index building, the adjacency matrix is dynamically expanding. We adopt a three-level in-memory structure for the indexing adjacency matrix as illustrated in Fig. 4. Notice that each matrix element corresponds to a bigram. The first level is a hash indexed by the first character of a bigram. Its maximum size is the number of all unique characters appearing in text database. Each element of the first-level hash consists of three components: the indexed character (ch1), number of the indexed character’s successive characters (number_of_ch2) and a pointer to one instance of the second level. An instance of the second level is a hash indexed by the second character of the bigram, each of whose elements also includes three components: the indexed character (ch2), number of the labels (number_of_labels) associated with bigram (ch1, ch2) and a pointer to one instance of the third level. One instance of the third level is a hash indexed by the label of s-strings where a certain bigram locates. While processing queries or reconstructing text documents, the indexing matrix is stable. We adopt a static in-memory structure for the indexing matrix then, which is quite similar to that in Fig.4, except 1) size-fixed arrays are used to replace the
376
S. Zhou, Y. Hu, and J. Hu
first-level hash and the second-level hashes in Fig.4; and 2) the third-level structure in Fig.4 is not used due to memory limitation. Label data is loaded to memory from disk whenever requested.
[ C1 C1
C2
C3
Cj
Cn ]
FK
FK
D FKFK
a(c1, c 2) ch1-1
ch2-1
Label1
ch1-2
ch2-2
Label2
C2 C3 Ci
a(c i, cj)
ch1-3
Cn ch2
Adjacency MatrixA ch1
number_of_ch2
number_of_labels
pointer
pointer
Fig. 4 In-memory structure of adjacency matrix (Used for matrix building)
On disk, indexing data is stored separately in two files. One file is used to store the labels of s-strings, i.e., data in the third level hashes. We refer to these data as label data; another is used to save structure information of the adjacent matrix, which corresponds to the data stored in both the first level hash and the second level hashes. We call these two files label file and matrix file respectively. In the label file, label data associated with each bigram is stored sequentially and successively. The matrix file has a structure as in Fig. 5. Here, ch1 is the first character of a bigram, which specified the row where the bigram locates in the matrix; and ch2 is the second of character of the bigram that determines which column the bigram is placed in the matrix. ch2_number indicates the number of successive characters of ch1. label_number indicates the occurrences count of bigram (ch1, ch2) in text database, and fp_seek specifies the starting address where label data of bigram (ch1, ch2) is stored in the label file. label_number and fp_seek use user defined data types because the ranges of their values are related to text database size. Using user defined data types can save storage space. Struct { UNICHAR ch1; Usigned int ch2_number; Struct { UNCHAR ch2; LABEL_NUMBER_TYPE label_number; FILE_ADDRESS fp_seek; } Column [ch2_number]; } ROW[Size_Of_Character_Set].
Fig. 5 Structure of the matrix file
A Novel Full-Text Indexing Model for Chinese Text Retrieval
377
A third file, the document index file, is used to store records in the document index table. These three files mentioned above are created when the indexing matrix is established. While processing queries or reconstructing text documents, data of the matrix file and the document index file is loaded into memory for improving retrieval efficiency. For the text collection of 182.2Mb in Table 1, the memory requirement is about 15Mb, which is available for current PCs. The process of adjacency matrix building is simultaneously the process of text database building, which produces three resulting files: the matrix file, the label file and the document index file. As a new document arrives it is parsed and its bigrams are inserted into the in-memory adjacency matrix. At some point data in the adjacency matrix must be written to disk to release memory. We write only the label data, which is stored in the third-level hashes, to temporary files. Generally, a temporary file corresponds to a bigram. Notice the occurrences of different bigrams in text database are quite uneven. In the process of matrix building, some matrix elements (corresponding to frequently occurring bigrams) will expand rapidly with the arrival of new documents while others (corresponding to infrequently appearing bigrams) will expand slowly or not at all. In addition, new documents will contain previously unseen bigrams. To amortize disk writing cost, when data writing is requested, we move only the label data of bigrams that have accumulated more than Ls labels (Ls is a pre-specified threshold) since the last writing, so that label data of infrequent bigrams may be written only one time (the final time) during the process of matrix building. When all documents have been processed, label data in temporary files as well as these still in memory is merged into the final label file.
4 The Experiments Five Chinese text collections of different sizes are used for experiments, which are listed in Table 1. All experiments are carried out on a PC with 2 CPUs of PII350 and 512M RAM. Table 1 Test Collections
Test collection Size (Mb)
TC-1
TC-2
TC-3
TC-4
TC-5
14.9
39.0
97.6
182.2
500.4
We first test space cost. Expansion ration is used to measure space cost of different indexing methods. We define expansion ration as (stxt+sind)/stxt. Here, stxt and sind are the sizes of text data and indexing data respectively. Fig. 6 shows the test results of expansion ration for four different indexing methods over fiver text collections. The four indexing methods are PAT array, bigram-based inverted lists, character-based inverted lists and the proposed new indexing model. We can see that our indexing method has the lowest expansion ration. When indexing TC-5, our method consumes about 750Mb less storage space than other methods do. We then test matrix building efficiency. Fig. 7 illustrates the results of matrix building speed for five text collections. Obviously, as the size of text collection grows, the frequency of disk writing also increases, which leads to processing efficiency going down. Following
378
S. Zhou, Y. Hu, and J. Hu
that, we test the efficiency of text reconstructing. Test results are demonstrated in Fig. 8. Basically, larger text collection has lower reconstructing efficiency. However, even for the largest test collection TC-4, its reconstructing speed surpasses 12kb/s, i.e., about 6000 characters/s, which is absolutely fast enough to meet the requirements of text reading and browsing of most ordinary people. 3.5
3 2.5 2 1.5
Adjacency matrix Bigram-based inverted lists Character-based inverted lists PAT array
1 0
100
200
300
400
500
600
Fig. 6 Expansion ration comparisons among different indexing methods
60
Building Speed(kb/s)
50 40 30 20 10
0
TC-1
TC-2
TC-3
TC-4
TC-5
Fig.7 Matrix building speeds for text collections of different sizes
%
%
%
%
Fig.8 The impact of text collection size on text reconstructing speed
Finally, we examine the efficiency of query processing by using five kinds of query length: 2-character, 5-character, 10-character, 15-character and 20-character. To make experimental results more reasonable and reliable, a program is used to generate queries automatically and randomly. For each query length, 3000 queries generated randomly are processed to measure query efficiency. Results shown in Fig. 9 are average results over 3000 different queries of the same length. As the size of text collection becomes larger, the average amount of label data associated with each bigram increases. Consequently, more label data need to be read and processed while evaluating a query of a given length. Generally, query not longer than 10 characters can be processed within about 1 sec.
A Novel Full-Text Indexing Model for Chinese Text Retrieval
379
3
15M
2.5
40M 2
100M
1.5
200M
1
500M
0.5 0 2
5
10
15
20
Fig. 9 The impact of query length on time cost of query processing
5 Conclusions By treating text database as directed graph and extending the concept of adjacency matrix of directed graph, we propose the adjacency-matrix based full-text indexing model in this paper. The innovative ideas of this paper are 1) Organizing bigrams into an adjacency matrix, which makes it possible to reconstruct text contents by using exclusively indexing data; 2) using the labels of s-strings of text database to identify different occurrences of bigrams leads to further reduced indexing space overhead. Experimental results show that the proposed indexing model is effective and efficient. The new model can also be used for text retrieval of other Oriental languages such as Japanese and Korean.
References [1] [2]
[3] [4] [5] [6] [7]
R. Baesa-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, Reading, Mass., 1999. C. Faltousos and S. Christodoulakis. Signature files: an access method for documents and its analytical performance evaluation. ACM Trans. On Office Information Systems, 2(4): 267-88, 1984. D. R. Morrison. PATRICIA- practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM, 15(4): 514-534, 1968. G. Navarro. An optimal index for PAT arrays. In: Proceedings of the Third South American Workshop on String Processing, pp. 214-227, 1996. Z. Wu and G. Tseng. Chinese text segmentation for text retrieval: Achievements and Problems. Journal of the American Society for Information Science. 44:532-542, October 1993. Y. Ogawa & M. Iwasaki. A new character-based indexing method using frequency data for Japanese documents. In: Proc. 18th ACM SIGIR Conf., pp. 121-128, 1995. K. L. Kwol. Comparing representations in Chinese information retrieval. In: Proc. Of 20th ACM SIGIR Conf., pp.34-41, 1997.
Page Access Sequencing in Join Processing with Limited Buer Space Chen Qun, Andrew Lim and Oon Wee Chong Department of Computer Science National University of Singapore Lower Kent Ridge Road, Singapore 119260 Email: fchenqun,alim,[email protected]
When performing the join operation in relational databases, one problem involves nding the optimal page access sequence such that the number of page re-accesses is minimized, given a xed buer size. This paper presents a new heuristic for this problem (known as OPAS2) that generally outperforms existing heuristics. Abstract.
Keywords: Join Processing, Query Processing, Heuristic Design 1 Introduction The join operation is one of the most expensive and frequently executed operations in database systems. The main cost of join operations involves the fetching of data pages from secondary storage devices to the main memory buer. Several approaches to this problem have previously been tried [2, 4, 7]. One strategy to minimise memory usage is to rst scan the indices of the relevant relations to obtain a set of data page pairs (x, y ), where page x contains some tuple which is joined with some tuple in page y. This information can then be used to nd an eÆcient page access sequence to minimise the amount of memory required. This is an extension of the sort-merge join and the simple TID algorithm [1, 4]. There are two related problems when it comes to nding an optimal page access sequence for this strategy : 1. Given that there are no page reaccesses, what page access sequence will require the minimum number of buer pages? 2. Given a xed buer size, what page access sequence will require the minimum number of page reaccesses? The above problems are referred to as OPAS1 and OPAS2 respectively, for problems [6]. Both problems are believed to be NP-Complete. All previous works have concentrated on nding a good solution to OPAS1, and then adapting it to OPAS2 by including a page replacement optimal page access sequence
H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 380−389, 2001. Springer-Verlag Berlin Heidelberg 2001
Page Access Sequencing in Join Processing with Limited Buffer Space
381
strategy when the maximum buer size is reached. In this study, we present a new heuristic for OPAS2 that is not based on an OPAS1 strategy. In Section 2, we de ne the symbols and terminology used in this report, along with the graph models used. Section 3 gives a brief description of existing heuristics. The new heuristic is presented in Section 4, with the experimental results analyzed in Section 5. Finally, in Section 6, we conclude our ndings.
2 Terminology and Notation 2.1 Problem De nition We can represent the page-pair information of a join by an undirected join graph G = (V, E), where the set of vertices V represents the pages in the join, and the set of edges E V V represents the set of page-pairs which contains tuples to be joined with each other. As pages are fetched into the buer, the join graph is updated as follows : An edge (x, y ) is removed from the graph if the pages x and y have been fetched and joined. A vertex x is removed from the graph if the degree of x becomes zero. Such a page is said to be released. A page access sequence (PAS) speci es the order of fetching the pages of the join graph into the buer. For the OPAS2 problem, its de nition is as follows: De nition 1. Let G = (V; E ) be a join graph. A page access sequence S =< th page p1 ; p2 ; ; pjv j > is a sequence of pages from V where pi denotes the i fetched into the buer. This de nition diers from an OPAS1 page access sequence in that the sequence of pages need not be distinct. De nition 2. For any page p, its resident degree is the number of distinct pages in the buer that it is adjacent to. Its non-resident degree is the number of distinct pages not in the buer that it is adjacent to. All pages are adjacent to itself. De nition 3. Let S =< p1 ; p2 ; ; pjV j > be a page access sequence for a join graph G = (V; E ) in a system with buer size B. S is an optimal page access sequence i (jS j jV j) (jS 0 j jV j) for all page access sequences S 0 . For any PAS, the total number of page re-accesses is the dierence between the length of the PAS and the total number of pages to be read. An optimal page access sequence is thus one that minimizes this value. De nition 4. Let S =< p1 ; p2 ; ; pjvj > be a page access sequence. S 0 = < pi ; pi+1 ; ; pi+k >; 1 i (jV j k ) is a segment of page access sequence S i S 0 is a nonempty subsequence of S such that 1. No page is released by the entry of pj for i j < i + k, 2. One or more pages are released by the entry of pi+k , and 3. One or more pages are released by the entry of pi 1 if i > 1. Thus, each PAS can be uniquely expressed as a sequence of m segments. We call a segment of length N that releases K pages an N -Release-K segment.
382
C. Qun, A. Lim, and O.W. Chong
2.2
Types of Graphs
There are some types of graphs that are commonly used to model the conditions that arise in database systems. For our study, we make use of the following two types of graphs: If G is a bipartite graph such that V = V1 [V2 and V1 \V2 = ;, and degree n%, then Prob[(vi , vj ) 2 E] = n if v 2 V and v 2 V , 1. 100 i 1 j 2 2. 0 otherwise.
Bipartite Graph
A bipartite graph is partitioned into 2 sets of vertices, where the vertices within a set cannot be connected to each other. The bipartite graph models bi-relational joins, and is one of the most important join graphs in database systems. If G is a geometric graph with v vertices and the expected degree of each vertex E () = k, then G is generated as follows : k. 1. Compute d = v 2. Generate v points randomly in a unit square, i.e. assign a pair of coordinates (xk , yk ), xk , yk 2 [0, 1] to each vertex vk . 3. Add (vi , vj ) into E i the distance between vi and vj < d, i.e. if (xi xj )2 + (yi yj )2 < d.
Geometric Graph
q
p
Thus in a geometric graph, there is only an edge if 2 points are \close enough", and therefore the points tend to form clusters. It is our opinion that a geometric graph is a good approximation of multi-relation joins, where each cluster approximates a relation.
3 Existing Heuristics All existing heuristics for OPAS2 have so far been OPAS1 heuristics coupled with a page replacement strategy. In this section, we give a brief description of these heuristics. Omiecinski's Heuristic (OH) [5] nds at each step the smallest number of fetches which would remove one page from the memory. The victim page when the buer is full is the page with the smallest non-resident degree that is not adjacent to the page being brought into the buer. Chan and Ooi's Heuristic (COH) [6] does not restrict itself to removing pages from the buer. At each step, it looks for the smallest number of pages to be read in order to release any page, or the smallest minimal segment, and puts it into the buer in order of descending resident degree. When the buer is full, the page replaced is the one with the smallest non-resident degree. COH generally outperforms OH for the OPAS1 problem, but performs worse for the OPAS2 problem when the buer size is lower than a certain threshold.
Page Access Sequencing in Join Processing with Limited Buffer Space
383
In essense, the COH heuristic searches for the smallest N -Release-1 segments during each iteration. Lim, Kwan & Oon's Heuristic (LKOH) [3] extends COH by searching for N -Release-(K 1) page segments, such that N K is minimized while maximizing N . It has the added parameter L to limit the value of K in the search. LKOH outperforms COH signi cantly for geometric graphs, but the improvement is slight in the case of bipartite graphs. 4
The New Heuristic (CLOH)
In the OPAS2 problem, our performance metric is the number of page re-accesses. Therefore, it does not matter if there are few or no pages released early in the PAS, as long as the number of page re-accesses are ultimately minimized. A strategy that thus suggests itself is to create as many lightly-connected pages as possible within the memory buer early in the algorithm. This is the basis of the new heuristic (CLOH). CLOH brings the page with the highest resident degree into the buer in each iteration. Ties are broken by selecting the page with the lowest non-resident degree. We also de ne a release level L, such that if there exists a page whose non-resident degree is L or less, we will bring in the segment that releases that page. The optimal value of L is determined quantitatively. The page replacement strategy is slightly dierent depending on whether the threshold L has been reached. In the rst case, when the smallest non-resident degree of all resident pages is greater than L, the victim page is the page with the largest non-resident degree that is not connected to the page being brought in. In the second case, when we are bringing in the smallest segment, the victim page has the additional condition of not being the page that is to be released at the end of the segment.
OPAS2-CLOH(
)
G,L
1. Choose a page pi in the join graph G such that the degree of pi is minimal. Bring pi into the buer. 2. if (the smallest non-resident degree of pages in the buer is greater than L), then (a) Choose a page pj such that of all the non-resident pages with the largest resident degree, pj has the smallest non-resident degree. (b) if (the buer is full) then remove from the buer a page with the largest non-resident degree that is not connected to pj . (c) Bring pj into the buer. (d) Delete all edges (pi , pj ) from G where pi and pj are contained in the buer. If the degree of a page becomes zero, then remove the page from the buer (the page is released), and delete the vertex from G. 3. else (a) Choose a set of pages PAGES(G) to bring into buer using the following strategy:
384
C. Qun, A. Lim, and O.W. Chong
i. Find a page pj such that pj has the minimal non-resident degree. ii. Select all pages outside the buer which are connected with pj . These pages make up PAGES(G). (b) for (every page pk in PAGES(G)) i. if (the buer is full) then remove from the buer a page with the largest non-resident degree that is not connected to pk , and is not pj . ii. Bring pk into the buer. iii. Delete all edges (pi , pk ) from G where pi and pj are contained in the buer. If the degree of a page becomes zero, then remove the page from the buer (the page is released), and delete the vertex from G. 4. If G is empty, quit; else goto Step 1.
5 Testing and Evaluation 5.1
Bipartite Graph Results
For bipartite graphs, we randomly generated 20 instances with 500 vertices (250 vertices per partition) for each edge ratio of 5%, 10% and 15%. We then ran the OH, COH, LKOH and CLOH algorithms on these graphs, using a range of buer sizes. For the LKOH heuristic, we used the L-value of 3 which gave the best results in the original work [3]. In the case of the new heuristic, we tested the eects of setting the release level L to various values. Our results show that in general, the best results are achieved with L set to between 2 and 4. Table 1 gives the results for graphs with edge ratio 10%. The Improvement column gives the absolute dierence between the best result obtained by CLOH using L at 2, 3, 4 and 5, and the best result obtained by the OH, COH and LKOH(3). The results for graphs with edge ratio 5% and 15% are similar. The improvement patterns for all three cases (shown in gures 1, 2 and 3) are also similar. We note that CLOH outperforms all existing heuristics except for a narrow range of buer sizes. Furthermore, the absolute improvement increases markedly as edge ratio increases, since denser graphs will have more heavily-connected vertices. 5.2
Geometric Graph Results
We randomly generated 20 geometric graphs of 500 vertices each for expected degrees of 25, 50 and 75. We ran the COH, OH, LKOH and CLOH algorithms on these graphs for a range of buer sizes. Once again, the new heuristic outperforms all existing heuristics in general. Table 2 gives the results for the set of geometric graphs with an expected degree of 75. For LKOH, we rst ran the algorithm with the recommended Lvalue of 3. In the course of our experiments, we found that there were cases when an L-value of 5 gave a better result, and these gures are also included in the table. For CLOH, we ran the tests with release values of 2, 3 and 4,
Page Access Sequencing in Join Processing with Limited Buffer Space
Comparison results for bipartite graphs with 250+250 vertices and edge
Absolute Improvement
50 40 30 20 10 0 -10 -20 50
Buffer
Fig. 1.
tio=5%
100
150
200
250
Size
Improvement of CLOH for BipartiteGraphs with 500 vertices and edge ra-
386
C. Qun, A. Lim, and O.W. Chong
350
Absolute Improvement
300 250 200 150 100 50 0 -50 50
100
Buffer
150
200
250
Size
Improvement of CLOH for BipartiteGraphs with 500 vertices and edge ratio=10% Fig. 2.
700
Absolute Improvement
600 500 400 300 200 100 0 -100 50
Buffer
100
150
200
250
Size
Improvement of CLOH for BipartiteGraphs with 500 vertices and edge ratio=15% Fig. 3.
Page Access Sequencing in Join Processing with Limited Buffer Space
387
which were found to give the best results. The Improvement column gives the absolute dierence between the best values attained by COH, OH, LKOH(3) and LKOH(5), and that of CLOH for the three release values. Buer Size COH OH LKOH(3) LKOH(5) CLOH(2) CLOH(3) CLOH(4) Improvement 10 2395 2088 2482 2482 2035 2063 2035 53 15 1990 1454 1845 1851 1448 1410 1428 44 20 1534 1139 1494 1494 1139 1133 1074 65 25 1269 1113 1158 1158 949 984 901 212 30 1164 890 1054 1054 956 910 856 34 35 862 894 884 884 856 788 830 74 40 784 761 793 758 850 729 735 29 45 731 710 792 792 715 660 696 50 50 805 700 725 709 652 663 647 53 55 666 679 749 749 634 664 615 51 60 704 675 601 597 612 691 599 -2 65 631 607 598 585 607 609 604 -19 70 629 593 590 596 558 557 595 33 75 574 564 570 573 550 549 598 15 80 533 550 570 568 544 528 539 5 85 536 530 530 528 534 532 524 4 90 530 515 519 517 530 525 518 -3 Table 2.
degree
Comparison results for geometric graphs with 500 vertices and 75 expected
Experiments with graphs of expected degree 25 and 50 produced similar results. Figures 4, 5 and 6 give the improvement patterns for all three test sets. These gures show that CLOH gives better results over almost the entire range of buer sizes. 5.3
Evaluation of the New Heuristic
Our experiments show that the CLOH heuristic outperforms all existing OPAS2 heuristics for both the bipartite and geometric graph models. Through quantitative analysis, we have ascertained that the best release level L for the CLOH algorithm is between 2 and 4, irrespective of buer size and graph density. This bodes well for practical implementation of this heuristic.
6 Conclusion In this paper, we proposed a new OPAS2 heuristic CLOH that diers from all previous heuristics in that it is not derived from simply giving an OPAS1 heuristic a page replacement strategy. In contrast, CLOH takes advantage of
388
C. Qun, A. Lim, and O.W. Chong
Absolute Improvement
60 50 40 30 20 10 0 -10 15
30
Buffer
45
60
75
Size
Improvement of CLOH for GeometricGraphs with 500 vertices and Expected Degree=25
Fig. 4.
Absolute Improvement
60 50 40 30 20 10 0 -10 20
40
Buffer
60
80
100
Size
Improvement of CLOH for GeometricGraphs with 500 vertices and Expected Degree=50
Fig. 5.
Page Access Sequencing in Join Processing with Limited Buffer Space
389
Absolute Improvement
60 50 40 30 20 10 0 -10 20
40
Buffer
Fig. 6.
60
80
100
120
Size
OPAS2 Improvement of CLOH for Geometric Graphs with 500 vertices and
Expected Degree=75
the fact that the performance metric for OPAS2 is page re-accesses, and makes use of the buer space to create a situation of several lightly-connected pages in the memory. This is achieved by reading in the pages with the largest resident degree. Segments are released when their length is less than a release level L, which experiments have shown to be optimal at a value of 2 to 4.
References Storage and access in relational databases, , Performance of a composite attribute and join index, IEEE Trans. on Software Eng., 15 (1989), pp. 142{152. , Page access sequencing for join processing, in International Conference on Information and Knowledge Management, 1999, pp. 276{283. , Join processing in relational databases, ACM Computing Surveys, 24 (1992), pp. 64{113. , Heuristics for join processing using nonclustered indexes, IEEE Trans. Knowledge and Data Eng., 15 (1989), pp. 19{25. , EÆcient scheduling of page access in index-based join processing, IEEE Trans. Knowledge and Data Eng., 9 (1997), pp. 1005{1011. , Join indices, ACM Trans. on Database Systems, 12 (1987), pp. 218{
1. M. W. Blasgen and K. P. Eswaran,
IBM Systems Journal, 16 (1977), pp. 363{377.
2. B. C. Desai
3. A. Lim, J. Kwan, and W. C. Oon
4. P. Mishra and M. H. Eich 5. E. R. Omiecinski
6. B. C. Ooi and C. Y. Chan 7. P. Valduriez 246.
Dynamic Constraints Derivation and Maintenance in the Teradata RDBMS Ahmad Ghazal and Ramesh Bhashyam NCR Corporation, Teradata Division 100 N. Sepulveda Blvd. El Segundo, CA, 90245 {ahmad.ghazal,ramesh.bhashyam}@ncr.com
Abstract. We define a new algorithm that allows the Teradata query optimizer to automatically derive and maintain constraints between date type columns across tables that are related by referential integrity constraints. We provide a novel, quantitative measure of the usefulness of such rules. We show how the Teradata query optimizer utilizes these constraints in producing more optimal plans especially for databases that have tables that are value ordered by date. We also discuss our design for maintaining these constraints in the presence of inserts, deletes, and updates to the relevant tables. Finally, we give performance numbers for seven TPC-H queries from our prototype implementation based on the Teradata Database engine.
1
Introduction
Query optimization is important in relational systems that deal with complex queries on large volumes of data. Unlike previous generation navigational databases, a query on a relational database specifies what data is to be retrieved from the database but not how to retrieve it. Optimizing a relational query is not that important in transaction-oriented databases where only a few rows are accessed either because the query is well specified by virtue of the application or because it accesses the database using a highly selective index. In decision support and data mining applications, where the space of possible solutions is large and the penalty for selecting a bad query plan is high, optimizing a query to reduce overall resource utilization is important since it provides orders of magnitude overall performance improvement. There has been lot of work on query optimization in relational and deductive databases [1,2,9,15,16]. Chaudhuri, et al in [8] has a good overview of query optimization in relational databases. One important optimization technique is to rewrite the user-specified query to be more performant.1 The query is transformed into a logically equivalent query that performs better, i.e., costs less to execute [2,6]. There are basically two techniques for query transformation – Syntactic and
1
Physical techniques and algorithms to improve execution are both outside the scope of this paper.
Dynamic Constraints Derivation and Maintenance in the Teradata RDBMS
391
Semantic. Syntactic or algebraic transformations use the properties of the query operators and their mapping to rewrite the query. Some forms of magic set transformation [10,11], most forms of predicate push down, and transitive closure are techniques that fall under this category. Semantic query transformations use declarative structural constraints and semantics of application specific knowledge, declared as part of the database, to rewrite the query [1,2,4,18,20]. Semantic query transformation based rewrites are called Semantic Query Optimization or SQO [1]. The basic intent of a query rewrite is to reduce the number of rows processed. King in [2] mentions five transformations. We clarify these in the following and build upon them: 1. Predicate Introduction. A new predicate is inferred from domain knowledge and integrity constraints that are specified as part of the data model. The introduction of the predicate may reduce the number of rows that are read from the relation. For example, a range constraint may be introduced based on check constraint specified on a column. 2. Predicate MoveAround. Selection predicates may be pushed as far down in the execution tree as possible or moved sideways [11,17], or moved up when the predicates are expensive. [11] gives an example of a view that is restricted based on application specific information. 3. Operator MoveAround. A join operation may be commuted with an aggregation operation such that the aggregation may be pushed down across the join operation [12]. 4. Join Elimination. A join may be deemed unnecessary and hence eliminated [1,5,14]. For example some portion of joins between relations that have a Primary Key – Foreign Key relationship (referred to in this paper as PK-FK relationship) can be eliminated especially if no attributes are projected from that relation. There are also other forms of partial or full join elimination based on materialized views. 5. Join Introduction. If a join with another table will help to reduce the rows processed from the original table then an extra join with the large table may be justified [2]. 6. Others. The literature [2,7] also discusses other transformations such as index introduction and empty-set detection. Notice that most of what is mentioned in the literature and almost all of what little is commercially implemented relate to transformations based on structural constraints and domain knowledge. As a point of interest, note that although the process of transformation is separate from the process of selecting the optimal plan they can be combined, as in Teradata, based on the cost of each transformation. The researches in [2,7] are some of the earliest works in semantic query optimization. Structural constraint based semantic optimizations use functional dependencies, key dependencies, value constraints, and referential constraints that are defined on the relations [7,9]. Hammer and Zdonik [15] also discuss the use of application domain knowledge to perform query transformations. [13] shows semantic query transformations using dependency constraints, domain constraints, and constraints between two tables that have a join condition between them. [1,13] extends the classical notion of both integrity constraints and their application. [7] also
392
A. Ghazal and R. Bhashyam
discusses query optimization using two other types of constraints called implication integrity constraints and subset integrity constraints. These constraints are defined on chunks of data across relations. They define subset integrity constraints as a supersetsubset relationship between the domains of two different attributes of two different relations and an implication integrity constraint as valid ranges of values that an attribute can have when some other attributes are restricted in the same or different relation. There is not much work in the literature on dynamically derived constraints across multiple relations that are based on actual data stored in those relations. We explore this concept in this paper. Instead of using domain data [7, 13], we use actual data in relations and automatically derive and maintain constraints across two different relations. We then show how these can be used to optimize queries using the TPC-H query suite [3] as an example. SQO could be very costly and may even cost more than the query execution time [19]. Shejar, et al in [19] presented a model for the trade-off between the cost and benefit of applying SQO. The problem of rule derivation we are considering in this paper, a sub-problem of SQO, is also a difficult one. The main difficulties we see are: What kind of rules should the optimizer find? Should the derivation process be completely automated? How can the optimizer decide if a rule is useful or not? In this paper we show how we managed in the Teradata database engine to answer the above questions from our experience with real customer situations. We limit our analysis to the time dimension, specifically, we limit our analysis to: (1) relations that have a PK-FK structural relationship and (2) date attributes in those relations. Time is a key analytic dimension. Both customer relationship management (CRM) and strategic decision support applications limit their search space using time. They also attempt to co-relate transactions and behavior activities using time as a key dimension. For example, CRM queries are interested in understanding a customer’s propensity for a product in a specific time frame. Similarly, the TPC-H workload often specifies time in its queries [3]. Also, we intend to make the DBMS automatically find these rules and we formally define a metric of how useful an SQO rule is. The following example illustrates the type of SQO problems we solved. Example 1. In the TPC-H benchmark LINEITEM and ORDERS are two tables. The ORDERS table gives details about each order. The LINEITEM table gives information about each item in the order; an order may have up to 7 items. O_ORDERDATE is an attribute of the ORDERS table and it represents the date the order was made. L_SHIPDATE is an attribute of the LINEITEM table and it denotes the date that line item was shipped. The LINEITEM and ORDERS tables have a PK-FK referential integrity structural constraint based on O_ORDERKEY=L_ORDERKEY. O_ORDERKEY is the primary key of ORDERS and L_ORDERKEY is a foreign key
Dynamic Constraints Derivation and Maintenance in the Teradata RDBMS
393
for LINEITEM. Line items of an order are shipped within 122 days of the order date. This fact can be written using the following rule: (L_ORDERKEY=O_ORDERKEY)2 (O_ORDERDATE+1L_SHIPDATE) and (L_SHIPDATEO_ORDERDATE+122)
The following query example (Q3 in TPC-H) illustrates the usefulness of such semantic rules. SELECT L_ORDERKEY, SUM (L_EXTENDEDPRICE*(1-L_DISCOUNT) (NAMED REVENUE), O_ORDERDATE, O_SHIPPRIORITY FROM CUSTOMER, ORDERS, LINEITEM WHERE C_MKTSEGMENT = ’BUILDING’ AND C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE < ’1995-03-15’ AND L_SHIPDATE > ’1995-03-15’ GROUP BY L_ORDERKEY, O_ORDERDATE, O_SHIPPRIORITY ORDER BY REVENUE DESC, O_ORDERDATE; The query has the condition (L_ORDERKEY = O_ORDERKEY) and using the rule (L_ORDERKEY=O_ORDERKEY) (O_ORDERDATE +122 L_SHIPDATE and L_SHIPDATE O_ORDERDATE+1), the optimizer will add (ORDERDATE+122 L_SHIPDATE and L_SHIPDATE O_ORDERDATE+1) to the where clause of the query. In another phase of the optimizer, the transitive closure of the where-clause conditions are computed and the following range conditions on L_SHIDATE and O_ORDERDATE, that are specific to this query, are found: L_SHIPDATE < ’1995-07-15’ and O_ORDERDATE > ’1994-11-13’. Together with O_ORDERDATE<’1995-03-15’ AND L_SHIPDATE>’1995-03-15’, each of O_ORDERDATE and L_SHIPDATE has a range of approximately four months. The new date constraints above could be very useful in one or both of the following situations: They could provide a fast access path to the corresponding table, for example, if ORDERS or one its secondary indexes are value ordered by O_ORDERDATE. Then in Example 1, only 4 months of data need to be accessed for ORDERS. The new constraints could reduce the size of an intermediate result. Note that this is applicable even if the derived constraints do not provide an access path to the table. For example, assume that ORDERS and CUSTOMER tables are
2
means implies.
394
A. Ghazal and R. Bhashyam
hash distributed on O_ORDERKEY and C_CUSTKEY, respectively.3 Also, assume that in the final execution plan of the query in example 1, ORDERS is re-hashed (re-distributed in Teradata terminology) on O_CUSTKEY to join with CUSTOMER. In this case, the new constraint O_ORDERDATE>’199411-13’ could be applied prior to the re-hashing step which significantly reduces the amount of data that will re-hashed, sorted and stored on disk. In this paper we discuss the automatic derivation of these date constraint rules. We refer to these derived date constraints rules as DDCR and the right hand side of a DDCR as DDC for the rest of this paper. We also discuss the maintenance of a DDCR in the presence of updates to the base relations. Finally, we discuss the performance gains that accrue from applying these constraints on some TPC-H queries by using the Parallel Teradata execution engine.
2
Rule Derivations and Usage
In this section we describe how Teradata finds a DDCR between columns of PK-FK tables. As mentioned before, for performance and practical aspects, the optimizer tries to discover such relationships under specific scenarios. We studied some customer workloads and found that, frequently, date columns from PK-FK are semantically related by some range constraints. So our focus was to find such relationships between date columns from two tables that are PK-FK related. A DDCR can be typically represented by (PK=FK) -> (Date2 + C1 Date1 Date2+C2) where C1, C2 are constants and Date1, Date2 are date columns in the FK and PK tables, respectively. The optimizer can initiate the DDCR derivation process when the user issues a collect statistics statement on a date column, Date1, of a table that is either a PK or an FK table and the related table also has a date column Date2. The basic idea of the algorithm is to find the values of C1 and C2 above. The high level algorithm is given below. Procedure FindConstraint(T1.Date1,T2.Date2) {Assume that T1 and T2 are PK-FK tables where Date1 and Date2 are date columns in the FK and PK tables, respectively. We also assume, without loss of gen-erality, that both Date1 and Date2 are not nullable.} begin
3
Teradata is an MPP shared nothing architecture and tables are partitioned according to the hash value of a predefined primary index of that table.
Dynamic Constraints Derivation and Maintenance in the Teradata RDBMS
395
1. Perform an inner join between T1 and T2 using PK=FK as a join condition. The optimizer will choose the optimal join method, which is irrelevant to this algorithm. Note that we do not need to write the join results to spool since we will derive the relationship (constraint) between the two dates on the fly. 2. We create an initial constraint, which has an empty range. We call it a running constraint (RC). 3. For every row in the join result, assume that D1, D2 are the values of Date1 and Date2, respectively. From this join row we can deduce Date2 +(D1-D2) Date2 +(D1-D2). Merge RC with this new Date1 range. If RC is empty then set both C1 and C2 to D1-D2, otherwise set C1= minimum(C1,D1-D2) and C2=maximum(C2,D1-D2). 4. The Result is a DDCR using the last value of RC and and it is (PK=FK)(Date1Date2+C2 Date1Date2+C1) end.
The resulting DDCR of the previous algorithm will be stored as table level constraints for each table. The above algorithm always yields a relationship between Date1 and Date2. However, the relationship may or may not be useful. For example, (L_ORDERKEY = O_ORDERKEY) (L_SHIPDATE O_ORDERDATE+2557) is a useless rule for deriving a range constraint on either L_SHIPDATE or O_ORDERDATE for TPCH since both of L_SHIPDATE and O_ORDERDATE have the same range of values and both are within 7 years (2557 days). Such rules will not benefit query optimization and will be just overhead. We formally measured the “usefulness” of a DDCR using the following analysis. Assuming uniform distribution of Date1 and Date2, a DDCR is most useful when C2-C1 is minimized. Since both C1 and C2 are computed from D1-D2 in FindConstraint, the MIN MAX MAX MIN range of values for both is from (D1 - D2 ) to (D1 - D2 ) call them Low and High, respectively. The usefulness of a DDCR is measured as (C2-C1)/Size, where Size is the interval size for the values of C2-C1, which is equal to (High-Low+1). The value of the usefulness function is between 0 and 1 and smaller values means more useful.
396
A. Ghazal and R. Bhashyam
If we take the TPC-H workload as an example and the result of FindConstraint to be (L_ORDERKEY=O_ORDERKEY)(O_ORDERDATE+122L_SHIPDATE and L_SHIPDATEO_ORDERDATE+1) then C1=1, C2=122,Low=-2557, High=2557 and the usefulness of this rule is 0.024. As a heuristic, the optimizer saves and maintains a DDCR only if the usefulness value is less than or equal to 0.5. Note that the usefulness function can be extended for non-uniform distribution of one or both of the date columns using collected statistics on these columns. This subject is outside the scope of this paper and we assume that the usefulness function for uniform distribution is a good approximation for the non-uniform case. One problem with FindConstraint is that the DDC is a single interval, which may not be very useful in some cases. For example, if all line items in the TPC-H case were shipped within 122 days of the order date with the exception of one line item that was shipped after 500 days of its order. In this case, the final RC in the algorithm will be (O_ORDERDATE+500 L_SHIPDATE and L_SHIPDATE O_ORDERDATE+1). It is more useful if the algorithm finds a set of non-overlapping ranges like (O_ORDERDATE+122 L_SHIPDATE and L_SHIPDATE O_ORDERDATE+1) or (O_ORDERDATE+500 L_SHIPDATE and L_SHIPDATE O_ORDERDATE+500). If you apply this non-overlapping constraint to a query that has a range constraint on O_ORDERDATE, the optimizer can define the range of values for L_SHIPDATE as the union of two small non-overlapping ranges. In our prototype, FindConstraint was modified to handle a union of nonoverlapping ranges by maintaining a list of at most k running non-overlapping constraints,4 RC1, RC2, … RCk. If at any time there are k+1 non-overlapping RC’s then the algorithm will pick two of them and merge them. The choice of which two to merge is done based on the nearest pair of intervals. Also, the usefulness function is modified to handle a set of RC’s rather than just one. First, the usefulness function is applied to each interval using the logic as described before. Then the usefulnesse of all the intervals are summed up to give the overall usefulness. The optimizer uses a DDCR, like in Example 1, when the left-hand side of the rule exists in the query. To insure correctness, the optimizer adds the DDC to the largest conjunction that contains the left-hand side. The following example illustrates that. Example 2. Consider the DDCR: (L_ORDERKEY=O_ORDERKEY) (O_ORDERDATE+122 L_SHIPDATE and L_SHIPDATE O_ORDERDATE+1) and the query condition (L_ORDERKEY=O_ORDERKEY and L_SHIPDATE > ‘1999-0501’ and L_QTY > 100) OR (L_ORDERKEY <> O_ORDERKEY and L_QTY < 200). In this case the optimizer rewrites the query condition to be (L_OR4
The value of k depends on how the optimizer handles OR cases. It is set to 10 in our prototype.
Dynamic Constraints Derivation and Maintenance in the Teradata RDBMS
397
DERKEY=O_ORDERKEY and L_SHIPDATE > ‘1999-05-01’ and L_QTY > 100 and O_ORDERDATE+122 L_SHIPDATE and L_SHIPDATE O_ORDERDATE+1) OR (L_ORDERKEY <> O_ORDERKEY and L_QTY < 200). 2.1
Redundancy Detection and Removal
The DDC of a DDCR by itself does not provide any benefit to optimizing a query and therefore it is redundant. The reason is that the DDC, which is date range between two date columns, does not provide a fast access path to either relations and does not reduce the size of an intermediate result. It is useful for a query if it helps transitive closure to derive single column date ranges. Based on that, the optimizer uses the following rule to add the DDC of a DDCR: When the left-hand side of a DDCR is present in some conjunction in the query condition, we add the DDC only if at least one of the date columns is also referenced in that conjunction. The date column must be referenced in a condition of the form “Date op Constant”, where op ³ {<,=,>,,,}.5 If you apply this rule to Example 2, the addition of (O_ORDERDATE+122 L_SHIPDATE and L_SHIPDATE O_ORDERDATE+1) happens because L_SHIPDATE is referenced in another condition in the same conjunction of (L_ORDERKEY=O_ORDERKEY). After the query execution plan is found, the optimizer simply removes all DDC’s since they are not useful by themselves. 2.2 Rule Maintenance In this section, we show a high level approach to maintain DDCR in the presence of inserts, deletes and updates to either the PK or the FK table. Overall, this maintenance is performed within the PK-FK system enforcement. The general approach is the following: 1. If the operation is an insert to the PK table, do nothing. This is because new rows in the PK table do not have any matches in the FK table. 2. If the operation is an insert to the FK table, produce a join6 of the new rows with the PK table. Apply algorithm FindConstraint to the join result, merge with the existing DDCR and replace the existing DDCR with the new one. 3. If the operation is a delete to either of the tables, choose between taking-no-action and redoing DDCR after some number of deletes. There are multiple options depending on the specific workload. Taking no action would merely reduce the “usefulness” of the DDCR. If the deletes were not frequent, re-computing the
5 6
op is restricted to the same comparisons that our transitive closure implementation has. This join will be very fast, especially in the Teradata RDBMS, with the Row Hash Match Scan join algorithm.
398
A. Ghazal and R. Bhashyam
entire DDCR periodically may suffice. In DSS and CRM applications, deletes are not as frequent as inserts. 4. If the operation is an update to a column that is not a PK,7 FK, or the relevant date columns for the DDCR, in either of the tables, then there is no action to be taken. If it is otherwise, do the same approach as a delete followed by an insert.
3
Experimental Results
We have tested our prototype on the TPC-H benchmark with 10GB workload. LINEITEM is hash partitioned on L_ORDERDEY and value ordered by L_SHIPDATE and ORDERS is hash partitioned on O_ORDERKEY and value ordered by O_ORDERDATE.8 With the new rules, 7 queries out of the 22 queries ran faster and the other 15 queries were not affected. The table below shows the execution time before and after the semantic query optimization is applied. It also, displays the percentage of time reduction this optimization provided. TPC-H Query Q3 Q4 Q5 Q7 Q8 Q10 Q12
4
Query Time without the new rules 523 369 804 510 827 390 175
Query Time with the new rules 55 41 217 348 365 176 133
Savings 89% 89% 73% 32% 56% 55% 24%
Conclusions and Future Work
We introduced a new algorithm that automatically derives integrity constraint rules across two PK-FK relations. These rules are dynamic and dependent on the actual data stored in the database. We invented a new function that assesses the value of these new rules before incorporating them in the database. We show the applicability of these constraints in semantic query optimization using TPC-H as an example. We also give algorithms to automatically maintain these constraints in the presence of inserts, deletes, and updates to the tables used in the constraint. We show the performance results of our prototype with the Teradata RDBMS on TPC-H queries.
7 8
In the Teradata RDBMS, updates to PK columns are discouraged but not prohibited. Currently, the Teradata RDBMS does not support value ordering for base tables and value ordering for LINEITEM and ORDERS were simulated using single table join indexes.
Dynamic Constraints Derivation and Maintenance in the Teradata RDBMS
399
We expect to focus our future work on various structural and dynamic integrityconstraint based semantic query optimization. Specifically, we believe transitive closure and constraint move around can be specialized for time.
U. Chakravarty, J J.Grant and J Minker. Logic Based approach to semantic query optimization. ACM TODS, 15(2):162-207, June 1990. J. J. King. Quist: A system for semantic query optimization in relational databases. Proc. th 7 VLDB, pages 510-517, September 1981. Transaction Processing Performance Council, 777 No. First Street, Suite 600, San Jose, TM CA 95112-6311, TPC-H Benchmark , 1.2.1 H. Pirahesh, J. M. Hellerstein, and W. Hasan. Extensible/rule based query rewrite optimization in Starbust. In Proc. Sigmod, pages 39-48, 1992. Qi Cheng, Jack Gryz, Fred Koo, et al. Implementation of two semantic query optimization th techniques in DB2 Universal Database. Proc. 25 VLDB, pages , September 1999. M. Siegel, E. Scorie and S. Salveter. A method for automatic rule derivation to support semantic query optimization. ACM TODS, 17(4):563-600, December 1992. Sreekumar T Shenoy and Z Meral Ozsolyoglu. A System for Semantic Query Optimization. Proceedings of the ACM SIGMOD Annual Conference on Management of data 1987, 181-195. Surajit Chaudhuri, An overview of query optimization in relational systems. Proceedings th of the 17 ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, 1998, 34-43. M. Jarke, J. Koch, Query Optimization in Database Systems. ACM Computing Surveys 16 (1984), 111-152 I.S. Mumick, S.J. Finkestein, H. Pirahesh, R. Ramakrishnan. Magic Conditions. ACM Transactions Database Systems Mar 1996, 107-155 I.S. Mumick, S.J. Finkestein, H. Pirahesh, R. Ramakrishnan. Magic Is Relevant. ACM SIGMOD International Conference on Management of data May 23-26, 1990 247-258 Yan W. P., Larson P. A., “Performing Group-By before Join”, International Conference on Data Engineering, Feb. 1993, Houston. G.D. Xu. Search Control in Semantic Query Optimization. Tech. Rep, Dept of Computer Science, Univ. of Mass, Amherst, 1983 J. Gryz, L. Liu, and X. Qian. Semantic query optimization in DB2: Initial results. Technical Report CS-1999-01, Department of Computer Science, York University, Toronto, Canada, 1999. th M. T. Hammer and S. B. Zdonik. Knowledge based query processing. Proc. 6 VLDB, pages 137-147, October 1980. M. Jarke, J. Clifford, and Y. Vassiliou. An optimizing PROLOG front end to a relational query system. SIGMOD, pages 296-306, 1984. A. Y. Levy, I. Mumick, and Y. Sagiv. Query optimization by predicate move-around. In Proc. Of VLDB, pages 96-108, 1994. H. Pirahesh, T. Y. Leung, and W Hassan. A rule engine for query transformation in Starburst and IBM DB2 C/S DBMS. ICDE, pages: 391-400, 1997. S. Shekar, J. Srivistava and S. Dutta, A formal model of trade-off between optimization th and execution costs in semantic query optimization. Proc. 14 VLDB, pages : 457-467, Los Angeles, 1988. S. T. Shenoy and Z.M. Ozsoyoglu. Design and implementation of a semantic query optimizer. IEEE transactions on Knowledge and data Engineering, Sep 1989 1(3) 344-361
Improving Termination Analysis of Active Rules with Composite Events Alain Couchot Université Paris Val de Marne, Créteil, France [email protected] Abstract. This article presents an algorithm for static analysis of termination of active rules with composite events. We refine the concept of triggering graph, including in the graph not only rules, but also events (primitive events and composite events). Our termination algorithm improves the previous termination algorithms, thanks to the notions of composite path and maximal order M path preceding a rule, replacing the classical notion of cycle. Both composite events and overall conditions of rules paths can be taken into account for rules termination analysis. So, much more termination situations can be detected by our algorithm, especially when active rules defined with conjunction events or sequence events are used.
1 Introduction This paper deals with the termination problem of active rules. Active rules (or EventCondition-Action rules[7]) are intended to facilitate the work of design and programming databases. But writing a set of rules actually remains a tricky work, often devolved upon specialists. Indeed, a set of rules is not a structured entity : the global behavior of a set of rules can be hard to predict and to control [1]. In particular, research works have brought to the fore the termination problem of the rules (the execution of the rules can be infinite in some situations), or the confluence problem (the same rules do not necessarily give the same results, depending on the execution order of the rules). In section 2, we expose the related work ; in section 3, we present a motivating example ; in section 4, we introduce the events/rules graphs, the composite paths and the maximal order M path preceding a rule ; in section 5, we propose a function for the evaluation of the truth value of the condition of a rule (due to a simple path and due to the maximal order M path preceding the rule) ; in section 6, we expose our algorithm for termination analysis ; section 7 concludes.
Improving Termination Analysis of Active Rules with Composite Events
401
guarantee termination. The majority of works on the termination of the active rules exploit the concept of triggering graph. [5] introduced, for the first time, the notion of triggering graph. This notion is clarified by [1] : a such graph is built by means of a syntactic analysis of rules ; the nodes of the graph are rules. Two rules r1 and r2 are connected by a directed edge from r1 towards r2 if the action of r1 contains an triggering event of r2. The presence of cycles in a such graph means a risk of non-termination of the set of rules. The absence of cycles in the triggering graph guarantees the termination of the set of rules. However, the possible deactivation of the condition of the rule is not taken into account by this analysis. A finer analysis is led by [3, 4, 8, 9, 12], taking the possible deactivation of the condition of a rule into account. Generalized connection formulas are introduced by [8], in order to test if the overall condition of a path can be satisfied. [10] proposes a technique to remove a path instead of removing a node, [11] proposes a technique to unroll a cycle. [13] refines the triggering graphs, using partial and total edges, to take into account the influence of composite events. Our work is based on the following observation : the algorithm proposed by [13], which deals with conjunction events, takes into account neither the path removing technique [10], nor the cycle unrolling technique [11]. A new technique must be designed, in order to take into account for termination analysis both the composite events and the overall conditions of active rules paths. Thanks to this new technique, termination cases can be discovered, which would not be discovered by a simple combination of the algorithms [13] and [3, 8, 9, 10, 11, 12].
3 Motivating Example We present in this section an example, which motivates our proposition. We consider a banking application. The class Account has the following three methods : rate_increase (X) : this method increases the loan overdraft_increase (X) : this method increases the
rate of the account by X per cent. allowed bank overdraft of the account
by X. capacity_increase (X) :
this method increases the loan capacity of the account by X.
The context of the conjunction events is cumulative [6]. The coupling modes of the active rules are immediate. The active rules are instance oriented. The active rules are the following : Rule R1 : When the rate and the allowed overdraft of an account have been increased, increase the loan capacity of the account by 2000. Event : A1firate_increase(X1) AND A1fioverdraft_increase(Y1) Condition : Action :
-
A1ficapacity_increase(2000)
Rule R2 : When the rate or the loan capacity of an account have been increased, if the account is a stocks account, increase the allowed overdraft of the account by 200.
402
A. Couchot
Event : Condition : Action :
A2firate_increase(X2) OR A2ficapacity_increase(Y2) A2.type = stocks_account A2fioverdraft_increase(200)
Rule R3 : When the loan capacity of an account A has been increased, if the account A is a standard account, increase by 1.5 the rate of all the accounts of the same customer than the account A, which are stocks accounts. Event : A3ficapacity_increase(X3) Condition : (A3.type = standard_account) AND (B.customer = A3.customer) AND (B.type = stocks_account) Bfirate_increase(1.5) Action : Let us try to analyze termination of this rules set using the Refined Triggering Graph method [8, 12] improved with [13] ([13] deals with composite events). We build a modified triggering graph with partial edges and total edges (figure 1).
R1
R3
R2
Fig. 1. Modified Triggering Graph with Partial and Total Edges. (the edges in thin lines are partial edges, and the edges in thick lines are total edges).
The cycles of the modified triggering graph are : (R1, R3, R2, R1), (R1, R3, R1), (R1, R2, R1). We have to detect false cycles. The cycle (R1, R3, R2, R1) is a false cycle (the path ( R2, R1, R3 ) is not activable, since the generalized connection formula along this path : (A2.type = stocks_account) AND (A2 = A1) AND(A1 = A3) AND (A3.type = standard_account) can not be satisfied, and no rule action can modify an atom of this formula [8, 12]). The cycle (R1, R3, R1) can trigger itself [13], and the cycle (R1, R2, R1) can trigger itself [13]: these two cycles are not false cycles. Just one cycle can be detected as a false cycle. Therefore, the termination cannot be guaranteed by [8, 12], even improved by [13]. Let us analyze the behavior of this rules set more in detail. Let us consider a rules process P. Each occurrence of R1 during P requires an occurrence of R3 and an occurrence of R2 (due to the conjunction event of R1), thus requires an occurrence of the path ( R3 ; R1 ) and an occurrence of the path ( R2 ; R1 ). Each occurrence of R3 during P requires an occurrence of R1, thus, requires an occurrence of the path ( R1 ; R3 ). Thus, each occurrence of R3 requires an occurrence of the path ( R3 ; R1 ; R3 ) and an occurrence of the path ( R2 ; R1 ; R3 ). But the path ( R2 ; R1 ; R3 ) is not activable. Thus, R1 cannot appear an infinite number of times
Improving Termination Analysis of Active Rules with Composite Events
403
during a rules process. R1 can be removed from the triggering graph. So, this rules set always terminates. No previous algorithm and no combination of previous algorithms is able to detect termination of this rules set. This is due to the following fact : termination of this rules set can be guaranteed if we take into account at the same time the composite conjunction event of the rule R1 and the fact that the path ( R2 ; R1 ; R3 ) is not activable. No previous algorithm deals at the same time with the composite conjunction events and the deactivation of a path. [13], which deals with composite conjunction events, do not deal with deactivation of conditions, and [10, 11], which deals with the deactivation of an overall condition of a path, do not deal with composite conjunction events.
4 Events/Rules Graphs In this section, we propose a refinement of the triggering graphs proposed in the past : the triggering graphs we propose contain not only rules, but also events (primitive events and composite events). We develop then the notions of composite paths and maximal order M path preceding a rule. 4.1 Considered Active Rules The active rules that we consider in this article follow the paradigm Event-ConditionAction. The database model is a relational model or an object oriented model. We do not make any assumption about the execution model of the rules (execution in depth, in width, or other model). Each rule definition contains an unique event. A composite event is specified binding two events (primitive or composite) by means of one of the three binary operators : AND, OR, SEQ. AND is the conjunction operator, OR is the disjunction operator, and SEQ is the sequence operator. We assume that the semantics of the conjunction events and the sequence events is defined by one of the three semantics defined by Snoop [6] : recent context (each occurrence of the composite event is computed with the most recent occurrence of each component event), chronicle context (each occurrence of the composite event is computed with the oldest occurrence of each component event), cumulative context (each occurrence of the composite event is computed with all the occurrences of each component event). From the point of view of the termination analysis, it is important to note, for these three semantics, the following property : let us consider two (primitive or composite) events E1 and E2. If E1 or E2 can just occur a finite number of times during any rules process, the conjunction event (E1 AND E2) and the sequence event (E1 SEQ E2) can just occur a finite number of times during any rules process.
404
A. Couchot
4.2 Graphical Representation of the Composite Events We graphically represent a composite event by means of a tree. The leaves of the tree are the primitive events, which make up the composite event. The final node of the tree is the represented composite event. Each intermediate node is a composite event. A conjunction/disjunction/sequence event E has two incoming edges : the event E is the conjunction/disjunction/sequence of the origin nodes of the incoming edges. The tree is built by a syntactic analysis of the composite event. 4.3 Definition An events/rules graph is an oriented graph, where the nodes are events and rules. Composite events are depicted using the tree representation showed above. There is an edge from a composite event E towards a rule R iff E is the composite event defined as the triggering event of the rule R. There is an edge from the rule R towards the event E iff E is a primitive event and R can raise E. Example. For the active rules of the motivation example (section 3), the events/rules graph is shown figure 2. There are three primitive events : E1 is raised by a call of the method rate_increase, E2 is raised by a call of the method overdraft_increase, E3 is raised by a call of the method capacity_increase.
E1
E2 E1 AND E2
R3
E1 OR E3
R1
E3
R2
Fig. 2. Events/Rules Graph.
4.4 Composite Paths We propose here the notion of composite path. This notion will be used for the evaluation of the truth value of the condition of a rule.
Improving Termination Analysis of Active Rules with Composite Events
405
Simple Path. Let n be such that n > 1. Let N1, N2, ... Ni, ... Nn be n nodes (not necessarily all distinct) of an events/rules graph such that there is an edge from Ni+1 to Ni. The tuple (N1, N2, ... Nn) make up a simple path. We adopt the following notation : N1‹N2‹...‹Ni‹...‹Nn. (N1 is the last node of the path, Nn is the first node of the path). Composite Path. A simple path is a particular case of composite path. Let N1‹N2‹...‹Nn be a simple path (n > 1). Let Cj ( 1 £ j £ k) be k composite paths such that the last node of Cj is Nn and such that the last edge of Cj1 is different from the last edge of Cj2, (for 1 £ j1 £ k, 1 £ j2 £ k and j1 „ j2). Then, the tuple of paths ((N1‹N2‹...‹Nn), C1, ..., Ck) make up a composite path. If Nn is a conjunction event or a sequence event, we denote this composite path in the following way : ((N1‹N2‹...‹Nn)‹C1) AND ((N1‹N2‹...‹Nn)‹C2) AND ... ((N1‹N2‹...‹Nn )‹Ck). If Nn is neither a conjunction event nor a sequence event, we denote this composite path in the following way : ((N1‹N2‹...‹Nn)‹C1) OR ((N1‹N2‹...‹Nn)‹C2) OR ... ((N1‹N2‹...‹Nn)‹Ck). Example. See figure 2. (E3‹R1‹(E1 AND E2)‹E1‹R3) AND (E3‹R1‹(E1 AND E2)‹E2‹R2) is a composite path, whose last node is E3. 4.5 Maximal Order M Path Preceding a Rule Let G be an events/rules graph. We replace the classical notion of cycle (used in the previous termination algorithms for the analysis of the triggering graphs) by the notion of maximal order M path preceding a rule. This is the composite path which contains all the simple paths preceding a rule. The number M corresponds to a limit on the length of the considered simple paths, and is fixed by the designer. Let R0 be a rule. The maximal order M path preceding R0: Max_Path(R3; 1 ;G) is built performing a depth search in the opposite direction of the edges. The computation is the following : Path0 = ( R0 ) Max_Path(R ; M ; G) = Path_Building_Function ( Path0 ) Path_Building_Function (incoming variable : Pathin, outgoing variable : Pathout ) Let N be the first node of Pathin Let N1, N2, ... Np be the nodes of G such that there is an edge from Ni to N FOR each node Ni (1 £ i £ p) IF Ni appears less than M times in Pathin Pathi = Path_Building_Function (Pathin ‹ Ni ) ENDIF ENDFOR IF N is a conjunction event or a sequence event Pathout = ( Path1 AND Path2 ... AND Pathp ) ELSE Pathout = ( Path1 OR Path2 ... OR Pathp ) ENDIF
406
A. Couchot
Example. See figure 2. Max_Path(R3; 1 ;G) = (R3‹E3‹R1‹(E1 AND E2)‹E1) AND ((R3‹E3‹R1‹(E1 AND E2)‹E2‹R2‹(E1 OR E3)) OR (R3‹E3‹R1‹(E1 AND E2)‹E2‹R2‹(E1 OR E3)‹E1))
5 Evaluation of the Truth Value of the Condition of a Rule We first introduce the notion of stabilization field of a rule. The usefulness of the notion of stabilization field is to represent in a uniform way the possible causes of deactivation of a condition of a rule due to a simple path. The truth value of the condition of the rule R due to a simple path will be stored by means of a function TV. The various stored truth values of the condition of the rule R will then be manipulated, thanks to the properties of the function TV, in order to deduce the truth value of the condition of R due to the maximal order M path preceding R. 5.1 Stabilization Field of a Rule Definition. Let Path be a simple path of the events/rules graph such that the last node of Path is the rule Last_Rule. Let R1 , R2 , ..., Rn be n rules of the events/rules graph. We say that the pair (Path, {R1, R2, ... , Rn}) is a stabilization field of Last_Rule iff we have the following property : For each rules process P such that there is no occurrence of the rules R1, R2, ..., Rn during P, there is only a finite number of occurrences of Path during P. Path is a stabilizer of Last_Rule ; {R1, R2, ... Rn} is a destabilizing set associated to the stabilizer Path. Previous termination algorithms [3, 4, 8, 9, 12] can be used to determine stabilization fields of a rule. For example, a simple path is a stabilizer of the last rule of the path, if there is a generalized connection formula along this simple path which cannot be satisfied [8, 10, 12] ; an associated destabilizing set contains all the rules whose action can modify the attributes contained in the atoms of the connection formula. Thanks to this notion, it is possible to represent in a uniform way the possible cases of deactivation of a condition of a rule listed by the previous termination algorithms [3, 4, 8, 9, 12]. 5.2 Truth Value of the Condition of a Rule Due to a Simple Path Let Graph1 be the initial events/rules graph, or a subgraph of the initial graph ; let Path1 be a simple path of Graph1 ; let Rule1 be the last rule of Path1. We evaluate the truth value of the condition of the rule Rule1 due to the simple path Path1 for the graph Graph1 using a function TV, which associates a boolean value to the triple ( Rule1 ; Path1 ; Graph1 ). The meaning of the function TV is the following : If TV (Rule1 ; Path1 ; Graph1) is FALSE, this means that we are sure that, for each rules process P, composed of rules of Graph1, there is only a finite number of occurrences of Path1.
Improving Termination Analysis of Active Rules with Composite Events
407
If TV (Rule1 ; Path1 ; Graph1) is TRUE, this means that we are not sure of the truth value of the condition of Rule1 due to Path1 for Graph1. The function TV is determined using the two following formulas : An unknown value TV ( Rule1 ; Path1 ; Graph1 ) is supposed equal to TRUE.
(1)
We set down : TV( Rule1 ; Path1 ; Graph1 ) = FALSE (2) if we can determine a stabilization field of Rule1 (Simple_Path , {R1, R2, ...,Rn}) such that Simple_Path = Path1 ; and no rule Ri (1 £ i £ n) is in Graph1. 5.3 Truth Value of the Condition of a Rule Due to the Maximal Order M Path We extend the previous function TV, in order to evaluate the truth value of the condition of a rule Rule1 due to the maximal order M path preceding the rule Rule1. For this, we apply the following formulas to Max_Path(Rule1 ;M ;Graph1) : (Graph1 is the initial events/rules graph, or a subgraph of the initial events/rules graph ; Path1 and Path2 are composite paths of Graph1 such that Path1 and Path2 have the same last rule : Rule1 ). TV( Rule1; Path1 OR Path2; Graph1) = TV(Rule1; Path1; Graph1) OR TV(Rule1; Path2 ; Graph1 )
6 Termination Algorithm The termination algorithm is composed of two main parts. The first part removes rules which are just triggered a finite number of times during any rules process, and events which are just raised a finite number of times during any rules process. The second part removes rules R whose condition is deactivated by the maximal order M path preceding R. We sketch below our termination algorithm : G = Initial Events/Rules Graph WHILE nodes of G are removed WHILE nodes of G are removed Part One : Forward Deletion of Rules Remove from G the nodes without incoming edge Remove from G the conjunction events and the sequence events E such that an incoming edge of E has been removed ENDWHILE WHILE (all the rules of G have not been tested) AND (the condition of a rule has not been detected as deactivated) Part Two : Detection of the deactivation of the condition of a rule Choose a rule R Compute Max_Path(R ; M ; G)
408
A. Couchot
Evaluate TV (R ; Max_Path(R ; M ; G) ; G) IF TV (R ; Max_Path(R ; M ; G) ; G) = FALSE Remove R from G ENDIF ENDWHILE ENDWHILE
Termination of the Rules Set. If the final graph is empty, termination of the rules set is guaranteed. If, after application of the termination algorithm, there are still rules in the final graph, these rules risk being triggered infinitely. The designer has then to examine, and possibly, to modify these rules. Example. We analyze termination of the rules set of our motivation example (section 3). Let G be the initial events/rules graph of this rules set (figure 2). The Part One of the algorithm provides no result. Let us apply the Part Two of the algorithm. Let us then choose the rule R3. The maximal order 1 path preceding R3 is: Max_Path(R3 ; 1 ; G) = (R3‹E3‹R1‹(E1 AND E2)‹E1) AND ((R3‹E3‹R1‹(E1 AND E2)‹E2‹R2‹(E1 OR E3)) OR (R3‹E3‹R1‹(E1 AND E2)‹E2‹R2‹(E1 OR E3)‹E1)) The pair ((R3‹E3‹R1‹(E1 AND E2)‹E2‹R2), ˘ ) is a stabilization field of R3. We can establish this result using the Refined Triggering Graph method [8, 12]. (The generalized connection formula (A3.type = standard_account) AND (A3 = A1) AND (A1 = A2) AND (A2.type = stocks_account) can not be satisfied, and the attribute type can not be updated by any rule action.) We can deduce : TV(R3 ; Max_Path(R3 ; 1 ; G) ; G ) = TRUE AND (FALSE OR FALSE) = FALSE R3 can be removed. By forward deletion (Part One of the algorithm), the other nodes of the graph can be removed. Termination of this rules set is guaranteed by our algorithm. Note that no previous algorithm is able to detect the termination of this rules set.
7 Conclusion We have presented an improvement of termination analysis of active rules with composite events. Our termination algorithm detects all the termination cases detected by the previous algorithms [3, 8, 9, 10, 11, 12, 13]. Much more termination situations are detected by our algorithm, especially when active rules with composite events are defined. In the future, we plan to determine sufficient conditions to guarantee the termination of the union of several rules sets, designed by several designers, even if no designer knows all the rules : this can be useful for a modular design, when several active rules sets are designed by distinct designers.
Improving Termination Analysis of Active Rules with Composite Events
409
References 1. A. Aiken, J. Widom, J.M. Hellerstein. Behavior of Database Production Rules: Termination, Confluence and Observable Determinism. In Proc. Int’l Conf. on Management of Data (SIGMOD), San Diego, California, 1992. 2. J. Bailey, G. Dong, K. Ramamohanarao. Decidability and Undecidability Results for the Termination Problem of Active Database Rules. In Proc. ACM Symposium on Principles of Database Systems (PODS), Seattle, Washington, 1998. 3. E. Baralis, S. Ceri, S. Paraboschi. Improved Rule Analysis by Means of Triggering and Activation Graphs. In Proc. Int’l Workshop Rules in Database Systems (RIDS), Athens, Greece, 1995. 4. E. Baralis, J. Widom. An Algebraic Approach to Rule Analysis in Expert Database Systems. In Proc. Int’l Conf. on Very Large Data Bases (VLDB), Santiago, Chile, 1994. 5. S. Ceri, J. Widom. Deriving Production Rules for Constraint Maintenance. In Proc. Int’l Conf. on Very Large Data Bases (VLDB), Brisbane, Queensland, Australia, 1990. 6. S. Chakravarthy, D. Mishra. Snoop : An Expressive Event Specification Language for Active Databases. In Data and Knowledge Engineering, 14 , 1994. 7. U. Dayal, A.P. Buchmann, D.R. Mc Carthy. Rules are Objects too : a Knowledge Model for an Active Object Oriented Database System. In Proc. Int’l Workshop on ObjectOriented Database Systems, Bad Münster am Stein-Ebernburg, FRG, 1988. 8. A.P. Karadimce, S.D. Urban. Refined Triggering Graphs : a Logic-Based Approach to Termination Analysis in an Active Object-Oriented Database. In Proc. Int’l Conf. on Data Engineering (ICDE), New-Orleans, Louisiana, 1996. 9. S.Y. Lee, T.W. Ling. Refined Termination Decision in Active Databases. In Proc. Int’l Conf. on Database and Expert Systems Applications (DEXA), Toulouse, France, 1997 10. S.Y. Lee, T.W. Ling. A Path Removing Technique for Detecting Trigger Termination. In Proc. Int’l Conf. on Extending Database Technology (EDBT), Valencia, Spain, 1998. 11. S.Y. Lee, T.W. Ling. Unrolling Cycle to Decide Trigger Termination. In Proc. Int’l Conf. on Very Large Data Bases (VLDB), Edinburgh, Scotland, 1999. 12. M.K. Tschudi, S.D. Urban, S.W. Dietrich, A.P. Karadimce. An Implementation and Evaluation of the Refined Triggering Graph Method for Active Rule Termination Analysis. In Proc. Int’l Workshop on Rules in Database Systems, Skoevde, Sweden, 1997. 13. A. Vaduva, S. Gatziu, K.R. Dittrich. Investigating Termination in Active Database Systems with Expressive Rule Languages. In Proc. Int’l Workshop on Rules in Database Systems, Skoevde, Sweden, 1997.
TriGS Debugger – A Tool for Debugging Active Database Behavior1 G. Kappel, G. Kramler, W. Retschitzegger Institute of Applied Computer Science, Department of Information Systems (IFS) University of Linz, A-4040 Linz, AUSTRIA email: {gerti, gerhard, werner}@ifs.uni-linz.ac.at
Abstract. Active database systems represent a powerful means to respond automatically to events that are taking place inside or outside the database. However, one of the main stumbling blocks for their widespread use is the lack of proper tools for the verification of active database behavior. This paper copes with this need by presenting TriGS Debugger, a tool which supports mechanisms for predicting, understanding and manipulating active database behavior. First, TriGS Debugger provides an integrated view of both active and passive behavior by visualizing their interdependencies, thus facilitating preexecution analysis. Second, post-execution analysis is supported by tracing and graphically representing active behavior including composite events and rules which are executed in parallel. Third, TriGS Debugger allows to interactively examine and manipulate the active behavior at run-time.
1
Introduction
Active database systems have been developed since several years. Basic active facilities in terms of Event/Condition/Action rules (ECA rules) have already found their way into commercial database systems [1]. Although active facilities are suitable for a wide range of different tasks, they are not straightforward to use when developing active database applications [21]. The main reasons are as follows. First, the very special nature of active behavior, which is controlled dynamically by events rather than statically by a flow of control, the latter being the case for traditional applications based on passive database behavior. Second, while each single rule is easy to understand, complexity arises from the interdependencies among rules and between active and passive behavior. Finally, the inherent complexity of active behavior is increased by concepts such as composite events, cascaded rule execution, and parallel rule execution. The actual behavior of a set of rules responsible for a certain active behavior is very hard to understand without proper tool support. The special characteristics of active behavior, however, prevent the straightforward employment of traditional debuggers realized for application development based on
1
The financial support by SIEMENS PSE Austria under grant No. 038CE-G-Z360-158680 is gratefully acknowledged.
TriGS Debugger - A Tool for Debugging Active Database Behavior
411
passive database behavior. Therefore, specific approaches for the verification of active behavior have been investigated. First of all, there are approaches supporting static rule analysis to determine certain qualities of a set of rules, like termination, confluence, and observable determinism [2], [3], [4], [18], [19], [24]. A major drawback of these approaches is that expressive rule languages which are not formally defined are hard to analyze, leading to imprecise results. Furthermore, on one hand it is not always obvious what action should be taken when a potential source of nontermination or nonconfluence is detected and on the other hand, the fact that a set of rules exhibits terminating and confluent behavior does not necessarily imply that it is correct. Due to these drawbacks, static rule analysis has no major influence on the development of active applications [21]. Most existing systems take a complementary approach in that they record the active behavior at run-time, and visualize rule behavior afterwards [5], [6], [7], [8], [9], [10], [23]. Besides mere recording and viewing, some systems let the rule developer control the active behavior by means of breakpoints and step-by-step execution, enabling the inspection of database states at any time during rule execution. However, existing systems often do not cope with the interdependencies between passive behavior and active behavior. They still lack proper debugging support for important aspects of rule behavior, like the composite event detection process [6], and parallel executed rules, which are not considered at all. Finally, the information overload induced by complex trace data is often not handled properly. TriGS Debugger copes with these drawbacks and the special nature of active database behavior in three different ways. First, pre-execution analysis is allowed on the basis of an integrated view of active behavior and passive behavior. Second, postexecution analysis is supported by a graphical representation of active behavior which includes the detection of composite events and rules which are executed in parallel. Special emphasize is drawn on the complexity of the resulting trace data by allowing for filtering and pattern mining. Third, TriGS Debugger allows to interactively examine and manipulate the active behavior at run-time. In particular, mechanisms are provided to set breakpoints, to replay single events or event sequences, to (de)activate selected events and rules, and to modify rule properties and the rule code itself. The remainder of this paper is organized as follows. The next section provides a concise overview of the active object-oriented database system TriGS as a prerequisite for understanding the work on the TriGS Debugger. In Section 3, the functionality of TriGS Debugger is presented from a rule developer’s point of view. The paper concludes with some lessons learned from user experiences and points to future research.
2
Overview of TriGS
TriGS Debugger is realized on top of TriGS (Triggersystem for GemStone) [15] representing an active extension of the object-oriented database system GemStone [11]. The two main components of TriGS are TriGS Engine comprising the basic concepts employed in TriGS for specifying and executing active behavior and TriGS Developer, an environment supporting the graphical development of active database
412
G. Kappel, G. Kramler, and W. Retschitzegger
applications [22]. In the following, TriGS Engine is described as far as necessary for understanding the forthcoming sections. Like most active systems, TriGS is designed according to the ECA paradigm. Rules and their components are implemented as first-class objects allowing both the definition and modification of rules during run-time. The structure of ECA rules in TriGS is defined by the following template in Backus-Naur Form: ::= DEFINE RULE AS ON <EselC> // Condition event selector IF THEN // Condition part [[WAIT UNTIL] ON <EselA>] // Action event selector EXECUTE [INSTEAD] // Action part [WITH PRIORITY ] [TRANSACTION MODES (C:{serial|parallel}, A:{serial|parallel})] END RULE . The event part of a rule is represented by a condition event selector (EselC) and an optional action event selector (EselA) determining the events (e.g., a machine breakdown) which are able to trigger the rule's condition and action, respectively. Triggering a rule's condition (i.e., an event corresponding to the EselC is signaled) implies that the condition has to be evaluated. If the condition evaluates to true, and an event corresponding to the EselA is also signaled, the rule's action is executed. If the EselA is not specified, the action is executed immediately after the condition has been evaluated to true. By default, the transaction signaling the condition triggering event is not blocked while the triggered rule is waiting for the action triggering event to occur. Blocking can be specified by the keyword WAIT UNTIL. In TriGS, any message sent to an object may signal a pre- and/or post-message event. In addition, TriGS supports time events, explicit events, and composite events. Composite events consist of component events which may be primitive or composite and which are combined by different event operators such as conjunction, sequence and disjunction. The event starting the detection of a composite event is called initiating event, the event terminating the detection is called terminating event. TriGS allows to detect composite events whose component events are either signaled within a single transaction or within different transactions. It is even possible that component events span different database sessions each comprising one or more transactions [13]. For each event, a guard, i.e., a predicate over the event's parameters, may be specified, which further restricts the events able to trigger a condition or action, respectively. The condition part of a rule is specified by a Boolean expression, possibly based on the result of a database query (e.g., are there some scheduled jobs on the damaged machine?). The action part is specified again in terms of messages (e.g., display all jobs scheduled on the damaged machine and reschedule them on another machine). Considering rules incorporating message events, the keyword INSTEAD allows to specify that the action should be executed instead of the method corresponding to the message triggering condition evaluation. The execution order of multiple triggered rules, i.e., conditions and actions of different rules which are triggered at the same time, is controlled by means of priorities.
TriGS Debugger - A Tool for Debugging Active Database Behavior
413
The transaction mode is specified separately for conditions and actions respectively. It defines, in which transaction rule processing comprising rule scheduling and rule execution takes place. Rule scheduling includes the detection of composite events and the determination of triggered conditions and actions. Rule execution refers to condition evaluation and action execution. In case of a serial mode, rule processing is done as part of the database session’s transaction which has signaled the triggering event. In case of a parallel mode rule processing is done within transactions of separate database sessions. Rules incorporating composite events whose component events are signaled by different database sessions can have a parallel transaction mode only. This is also the case for rules which are triggered by events being signaled outside of a database session, e.g., time events. Rule scheduling is done within a dedicated database session running in a separate thread of control called Parallel Rule Scheduler. Rule execution is made efficient by means of several Parallel Rule Processors each running within a separate thread [14]. The number of these threads is dynamically controlled and depends on the utilization of the corresponding rule processors.
3
Functionality of TriGS Debugger
This section is dedicated to an in-depth description of the functionality of TriGS Debugger. TriGS Debugger as part of the TriGS Developer is operational, a more detailed description can be found in [16]. 3.1
Providing an Integrated View on Active and Passive Schema
During pre-execution analysis, i.e., schema analysis, of an active system like TriGS it is not sufficient to exclusively focus on the active schema, because of the interdependencies between the active schema and the passive schema. For instance, rules may be triggered by method calls in case of message events. However, since these schemas are developed using different tools, namely GemStone (GS) Class Browser and TriGS Designer, respectively, it is difficult to keep track of the interdependencies between passive schema and active schema. In order to provide an integrated view on the database schema TriGS Debugger supports a Structure Browser (cf. Fig. 1). The Structure Browser shows the passive object-oriented schema on the left side of the window by means of a class hierarchy tree comprising classes, methods, and class inheritance (cf. - in Fig. 1). This is complemented by a visualization of the active schema on the right side of the window, which comprises rules and their event selectors, no matter being primitive or composite (cf. -). The interdependencies between passive schema and active schema are depicted by edges between methods and corresponding message event selectors. In order to reduce information overload, guards, conditions, and actions of a rule are not explicitly shown within the Structure Browser. Rather, their existence is indicated by means of small black boxes only (cf. and ). Finally, the Structure Browser provides a filtering mechanism to cope with the fact that, when many classes
414
G. Kappel, G. Kramler, and W. Retschitzegger
and rules are visualized at once the display becomes cluttered. For example, it is possible to filter the methods and classes related to a particular rule and its event selectors. The filter result may be visualized in a context preserving way by means of highlighting using different colors (cf. ). Class
Method
Primitive event selector
Composite event operator Rule
Active Schema
Passive Schema
Class Inheritance
Event guard indicator
Filter
Condition and action indicators
Fig. 1. Structure Browser
3.2 Visualizing Trace Data Post-execution analysis, i.e., verification of the actual active behavior, requires tracing of events and rule executions, and visualization of traced data. For this, TriGS Debugger provides a History Browser (cf. Fig. 2), which may be opened at any time during or after the run of an active application to get a graphical snapshot of its behavior until that point in time. The visualization has been designed to highlight the relationships among events, like parallelism, event composition, or causal dependencies. In particular, the History Browser contains the following information: • Temporal Context. Since active behavior occurs over time, a timeline (cf. in Fig. 2) organizes trace information in a temporal context, from top to bottom. • Event Detection. Event detection of both primitive and composite events is visualized as event graph. This event graph is different from the event selector tree as depicted by the Structure Browser in that it represents events not at type level but rather at instance level. Since each event may be component of multiple composite events, a directed acyclic graph is formed. • Rule Execution. The visualization of rule execution (cf. , ) is decomposed into condition evaluation and action execution, because these two phases of rule execution may be triggered at different points in time by different events. Visualization of a condition evaluation comprises the begin of evaluation (cf. evaluating in Fig. 2) together with the name of the rule, possibly cascaded rule executions, and the end of evaluation indicated by the result of evaluation, i.e., true, false, an object, or nil. Action execution is visualized accordingly. The symbol “||” preceeding the begin of evaluation/execution denotes that the condition/action was performed by a Parallel Rule Processor outside the session which signaled the triggering event.
TriGS Debugger - A Tool for Debugging Active Database Behavior
415
Composite event detection
Timeline
Database sessions
Rule execution
Cascaded rule execution
Triggering indicator
Fig. 2. History Browser
• Database Sessions. Database sessions during which events have been signaled or rules have been processed appear in the History Browser as parallel vertical tracks (cf. ). According to the separation of serial rule processing and parallel rule processing in TriGS as discussed in Section 2, all trace data concerning a serial rule is shown within the track of the database session where the triggering event was signaled. This can be either an Application Session or a Parallel Rule Processor Session. Since events triggering parallel rules are processed within a dedicated database session, there is a special track labeled Parallel Rule Scheduler displayed right to the Application Session where all trace data of event detection for parallel rules is shown. As discussed in Section 2, parallel rules are executed by parallel rule processors in separate database sessions. The tracks of Parallel Rule Processor sessions are displayed right to the Parallel Rule Scheduler track. • Transactions. Transaction boundaries within database sessions are represented by primitive events corresponding to commit/abort operations. Since committing a transaction may fail due to serialization conflicts, success or failure of a commit operation is indicated as well (cf. transaction commit = true in Fig. 2). Further details are available by means of the integrated GS Inspector allowing the inspection and modification of trace data and related database objects, like condition results and action results as well as event parameters. However, one has to be aware that the Inspector shows the current state of an object, which may have changed since the time when, e.g., the object was used to evaluate the condition of a rule. 3.3 Customizing and Analyzing Trace Data In case of a huge amount of events and rule executions a rule developer in looking for whether and why something has or has not happened will be overwhelmed by the sheer amount of information preserved in the trace data. Especially composite events, which may have to be traced over a long period of time and originating from various
416
G. Kappel, G. Kramler, and W. Retschitzegger
different sessions contribute to the complexity of visualization. To reduce the information overload, TriGS Debugger provides two complementary mechanisms allowing to customize the visualization of trace data, namely a filtering mechanism and a pattern analyzer. 3.3.1 Filtering Mechanism The filtering mechanism of TriGS Debugger allows to focus the History Browser’s view on particularly interesting details of the trace data, while preserving their order of occurrence by providing several context independent and context dependent filters. Whereas context dependent filters allow to navigate on causes and effects of selected parts of trace data, context independent filters allow to focus on the active behavior without selecting any particular trace data. Analogous to the Structure Browser, the filter results may be visualized either exclusively in a new window, or by simply highlighting them. Context independent filters supported by TriGS Debugger are: • Session. With the session filter, it is possible to show/hide every track of the History Browser. • Time. The time filter allows to focus on particular time intervals, specified either by begin and end time, or relative to the trace end, e.g. “the last 10 minutes”. • Names of Event Selectors and Rules. The view on a certain trace can be restricted to events conforming to certain event selectors, and/or having triggered evaluations and executions of certain rules. The initial focus set by context independent filters can be modified step-by-step in an interactive manner by means of the following context dependent filters: • Cause. The cause filter applied to a condition evaluation or an action execution shows the triggering event. Applying the cause filter to a primitive event results in either the rule execution during which it has been signaled, or the Application Session which signaled the event. • Direct/Indirect Effects. The effect filter allows to focus either on direct effects only or on both direct and indirect effects of events and rule executions. Concerning a primitive event, the direct effects are the composite events terminated by that event, and the conditions/actions triggered, whereas the indirect effects are in addition all initiated composite events and all conditions/actions triggered by that initiated composite events. Concerning a rule execution, the direct effects are the primitive and composite events detected during execution as well as the first level of cascaded rule executions. The indirect effects of a rule execution include all levels of cascaded rule executions, as well as the events detected during these cascaded executions. • Time. In order to focus on the temporal context of an event or of a rule execution, the time filter may be applied resulting in all events and rule executions within some time interval before and after the selected event/rule execution. Fig. 3 is based on the same trace data as shown in Fig. 2, with filters being applied, thus making the trace data more readable. In particular, a context independent filter has been applied showing only the events signaled from Application Session 3 (cf. ), and, based on the result of the first filter, a context dependent filter has been used to highlight the direct and indirect effects of the selected event (cf. ).
TriGS Debugger - A Tool for Debugging Active Database Behavior Result of context independent filter
417
Result of context dependent filter
Fig. 3. Filtering Trace Data Within the History Browser
3.3.2 Pattern Analyzer Besides filtering mechanisms, TriGS Debugger provides a Pattern Analyzer. In contrast to the History Browser, which shows each detected event and each executed rule in order of their occurrence, the Pattern Analyzer mines patterns of recurring active behavior and shows these patterns once, in an aggregated fashion. With this, a compact view on trace data is provided. Causal dependencies between rules and their components are visualized at one place, without having to browse through the whole history of trace data. In order to mine patterns, the trace is analyzed step by step and similar trace data is aggregated into equivalence classes, whereby not only the equivalence of events and rule executions themselves is considered, but also the equivalence of relationships among them in terms of causal dependencies. Therefore, the equivalence relation is defined recursively, as follows: • Two primitive events are considered equivalent if (1) they conform to the same primitive event selector, and (2) their causes, i.e., either the rule executions or the applications having signaled them, are in the same equivalence class. • Equivalence of (partially detected) composite events is defined by the equivalence of their composition trees consisting of primitive component events (leafs) and event operators (nodes). Two such trees are equivalent if the root nodes are equivalent. Two nodes are in turn equivalent if (1) the event operator is the same, and (2) the child nodes/leafs are in the same equivalence class. • Two condition evaluations are equivalent if (1) they belong to the same rule, and (2) the triggering events are in the same equivalence class. • Two action executions are equivalent if (1) they belong to the same rule, and (2) the corresponding condition evaluations are in the same equivalence class, and (3) the action triggering events – if defined – are in the same equivalence class. Patterns of behavior are visualized as a directed acyclic graph, with root nodes on the left representing equivalence classes of active behavior with outside cause, e.g., equivalence classes of primitive events raised by an application, and child nodes denoting equivalence classes of resulting active behavior. Note that this abstract view on rule behavior is similar to so-called triggering graphs as used in static rule analysis [2]. Unlike static rule analysis, which can be done before run-time, trace data patterns
418
G. Kappel, G. Kramler, and W. Retschitzegger
are not available until the application has been started. However, since trace data patterns represent the real active behavior of an application they have the advantage that quantitative information can be provided for each element of a pattern in terms of number of occurrences and execution time, comprising minimum, maximum and average values. Qualitative information in terms of, e.g., rule cycles is not automatically provided yet but rather subject to future work (cf. Section 4).
Fig. 4. Pattern Analyzer
Fig. 4 shows the Pattern Analyzer’s view of the trace data as shown in Fig. 2. One can see at one place, e.g., all the direct and indirect consequences of the event post(MillingMachine,changeState). Another insight is that MillingRule_1 has been processed completely within the execution of ScheduleRule_2.
3.4 Interactive Rule Debugging Besides the visualization and analyzing capabilities, TriGS Debugger allows to interactively control, examine and manipulate the active behavior by means of breakpoints, a replay and simulation mechanism and again by taking advantage of the other components of TriGS Developer and the standard GS tools. First of all, TriGS Debugger allows to set breakpoints from within the History Browser and the Structure Browser. Breakpoints may be defined on all rule components namely events, guards, conditions, and actions. Whenever during rule processing TriGS Engine encounters a breakpoint, processing is stopped and a GemStone signal is thrown. On the front end, the GS Debugger has been extended to catch this signal and open a special breakpoint window. Depending on the rule component the breakpoint has been set on, different functionality is offered. In case the breakpoint has been set on a condition, one may choose to (1) proceed with condition evaluation, (2) continue under control of the GS Debugger in order to step through the condition’s code, (3) skip the evaluation of the condition by setting the condition to either true or false, or (4) terminate the evaluation. Concerning breakpoints on events, one can in addition ignore the raised event, or continue until the next event is signaled. It has to be emphasized that since the standard GS Debugger can be incorporated from within the breakpoint window, one can control and examine the passive behavior of the application and its active behavior simultaneously. TriGS Debugger offers similar interactive features for modifying the active behavior at run-time as the GS Debugger. Exploring the interpretative nature of GS, it
TriGS Debugger - A Tool for Debugging Active Database Behavior
419
is possible to modify rule code at run-time, even during a breakpoint halt. It is further possible to temporarily enable or disable certain events, guards, or rules and to modify rule priority settings. The History Browser also supports a simple kind of event simulation [5], by allowing to replay events meaning that they are raised again. Either a single event or an arbitrary sequence of events can be selected from the event history in order to be replayed. By means of the GS Inspector, it is possible to modify event parameters before replaying, thus enabling the testing of guards and conditions. This way, an application scenario can be simulated, allowing to test a rule set without the need to run the application.
4
Lessons Learned and Future Work
TriGS Debugger has been already employed in a project aiming at the development of a schema generator that translates a conceptual schema modeled in terms of AOBD (Active Object/Behavior Diagrams), [17] into an executable TriGS schema [20]. The active behavior generated thereby is highly complex and therefore difficult to understand, since it makes extensive use of both composite events and parallel rules. This section discusses the lessons learned in the course of this project and points to future work. Conflicts between parallel rules are hard to locate. The complexity of active behavior is not founded in single rules but in the interaction among rules [21]. With parallel executing rules, this interaction is multiplied by the possibility of concurrent access to database resources, which in turn may lead to serialization conflicts and abort of the corresponding transactions. It has been shown that parallel rules were involved in the most non-trivial bugs which had to be found and removed during development of the schema generator. Therefore we believe that debugging tools for parallel executing rules are even more important than for serial ones. Although TriGS Debugger facilitates debugging of parallel rules in that parallel database sessions, transactions, rules executed within a transaction, and success or failure of transactions are visualized, the reason of a conflict is not explained by the current prototype. In this sense it would be desireable to show the conflicting transactions/rules and the objects involved by a debugger. Furthermore, techniques for debugging parallel programs should be adopted. For instance, the trace driven event replay technique as already supported by TriGS Debugger could be used for semi-automatic detection of race conditions [12]. A history of database states would ease debugging. Trace data gathered by TriGS Debugger represents the history of event detections and rule executions, but it does not include the history of related database objects. This leads to the problem that one can inspect the current state of an object only. However the object may have changed since the time when it has been used, e.g., to evaluate the condition of a rule. Breakpoints are a cumbersome means to solve this problem, since they force a rule developer to interactively handle each break during an application test. Instead, it would be beneficial if a rule debugger could access a snapshot of the database state as it was at the time when a guard or condition was evaluated or an action was executed.
420
G. Kappel, G. Kramler, and W. Retschitzegger
Active behavior requires active visualization. In order to support a rule developer in observing active behavior, a debugging tool should notify the developer automatically of any new rules which have been triggered or events which have been signaled. This could be achieved by updating the view(s) on trace data any time when considerable trace data has been added, or within certain time intervals. In the current prototype, the views of History Browser and History Analyzer have to be updated explicitly in order to check for new events/rule executions. Trace analysis is a promising research topic. We consider analysis of trace data a promising alternative to static rule analysis. Especially in an environment like TriGS, which is based on the untyped language GS Smalltalk, static analysis is very restricted. Instead, we will focus on deriving qualitative information on active behavior from trace data. For instance, we plan to enhance the Pattern Analyzer to provide information on rule execution cycles. In general, TriGS Debugger should detect (patterns of) active behavior which is violating certain properties like termination or determinism, and this should be highlighted/visualized in both the History Browser’s and Pattern Analyzer’s view. Another possible application of trace data analysis would be to save trace data of different application tests (of different application versions) and compare them afterwards. Comparison might be in terms of execution order or encountered behavior patterns. TriGS Debugger is not only specific to TriGS. TriGS Debugger has been designed specifically for TriGS, without having its application to other active database systems in mind. As such, the implementation of features like interactive rule debugging relies heavily on specific properties of the underlying TriGS/GS system. Nevertheless, the basic ideas of TriGS Debugger, like providing an integrated view on active and passive behavior, the visualization and filtering of trace data, and pattern analysis of trace data, can be applied to debuggers for active (object-oriented) database systems different from TriGS as well.
References 1. ACT-NET Consortium: The Active Database Management System Manifesto: A Rulebase of ADBMS Features. SIGMOD Record Vol. 25, 1996, pp. 40-49 2. Aiken, A., Widom, J., Hellerstein, J.: Behavior of database production rules: Termination, confluence, and observable determinism. SIGMOD Record, Vol. 21, 1992, pp. 59-68 3. Bailey, J., Dong, G., Ramamohanarao, K.: Decidability and Undecidability Results for the termination problem of active database rules. Proc. of the 17th ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, Seattle, 1998, pp. 264-273 4. Baralis, E.: Rule Analysis. Norman W. Paton (Ed.): Active Rules in Database Systems. Springer, New York, 1999, pp. 51-67 5. Behrends, H.: Simulation-based Debugging of Active Databases. In Proceedings of the 4th International Workshop on Research Issues in Data Engineering (RIDE) - Active Database Systems, Houston, Texas, IEEE Computer Society Press, 1994, pp. 172-180 6. Berndtsson, M., Mellin, J., Högberg, U.: Visualization of the Composite Event Detection Process. In Paton, N. W. and Griffiths, T. (eds.): International Workshop on User Interfaces to Data Intensive Systems (UIDIS'99). IEEE Computer Society, 1999, pp. 118-127
TriGS Debugger - A Tool for Debugging Active Database Behavior
421
7. Chakravarthy, S., Tamizuddin, Z., Zhou, J.: A Visualization and Explanation Tool for Debugging ECA Rules in Active Databases. In Sellis, T. (ed.): Proc. of the 2nd Int. Workshop on Rules in Database Systems. Springer LNCS Vol. 985, 1995, pp. 197-212 8. Coupaye, T., Roncancio, C.L., Bruley, C., Larramona, J.: 3D Visualization Of Rule Processing In Active Databases. Proc. of the workshop on New paradigms in information visualization and manipulation, 1998, pp. 39-42 9. Diaz, O., Jaime, A., Paton, N.: DEAR: a DEbugger for Active Rules in an object-oriented context. In Proceedings of the 1st International Workshop on Rules in Database Systems, Workshops in Computing, Springer, 1993, pp. 180-193 10. Fors, T.: Visualization of Rule Behavior in Active Databases. In Proceedings of the IFIP 2.6 3rd Working Conference on Visual Database Systems (VDB-3), 1995, pp. 215-231 11. GemStone Systems Inc.: http://www.gemstone.com/products/s/, 2001 12. Grabner, S., Kranzlmüller, D., Volkert, J.: Debugging parallel programs using ATEMPT. Proceedings of HPCN Europe 95 Conference, Milano, Italy, May, 1995 13. Kappel, G., Rausch-Schott, S., Retschitzegger, W., Sakkinen, M.: A Transaction Model For Handling Composite Events. Proc. of the Int. Workshop of the Moscow ACM SIGMOD Chapter on Advances in Databases and Information Systems (ADBIS), MePhI, Moscow, September, 1996, pp. 116-125 14. Kappel, G., Rausch-Schott, S., Retschitzegger, W.: A Tour on the TriGS Active Database System - Architecture and Implementation. J. Carroll et al (eds.): Proc. of the 1998 ACM Symposium on Applied Computing (SAC). Atlanta, USA, March, 1998, pp. 211-219 15. Kappel, G., Retschitzegger, W.: The TriGS Active Object-Oriented Database System - An Overview. ACM SIGMOD Record, Vol. 27, No. 3, September, 1998, pp. 36-41 16. Kappel, G., Kramler, G., Retschitzegger, W.: TriGS Debugger – A Tool for Debugging Active Database Behavior. Technical Report 09/00, Dept. of Information Systems, University of Linz, Austria, March, 2000 17. Lang, P., Obermair, W., Kraus, W., Thalhammer, T.: A Graphical Editor for the Conceptual Design of Business Rules. Proc. of the 14th Int. Conference on Data Engineering (ICDE), Orlando, Florida, IEEE Computer Society Press, February, 1998, pp. 599-609 18. Lee, S.Y., Ling, T.W.: Unrolling cycle to decide trigger termination. In Proceedings of the 25th International Conference on Very Large Data Bases, 1999, pp. 483-493 19. Montesi, D., Bagnato, M., Dallera, C.: Termination Analysis in Active Databases. Proceedings of the 1999 International Database Engineering and Applications Symposium (IDEAS’99), Montreal, Canada, August, 1999 20. Obermair, W., Retschitzegger, W., Hirnschall, A., Kramler, G., Mosnik, G.: The AOODB Workbench: An Environment for the Design of Active Object-Oriented Databases. Software Demonstration at the Int. Conference on Extending Database Technology (EDBT 2000), Konstanz, Germany, March, 2000 21. Paton, N.W., Diaz, O.: Active Database Systems. ACM Computing Surveys Vol. 31, 1999, pp. 63-103 22. W. Retschitzegger: TriGS Developer - A Development Environment for Active ObjectOriented Databases. Proceedings of the 4th World Multiconference on Systemics, Cybernetics and Informatics (SCI'2000) and the 6th International Conference on Information Systems Analysis and Synthesis (ISAS'2000), Orlando, USA, July 23-26, 2000 23. Thomas, I.S. and Jones, A.C.: The GOAD Active Database Event/Rule Tracer. Proc. of the 7th Int. Conference on Database and Expert Systems Applications, Springer LNCS Vol. 1134, 1996, pp. 436-445 24. Vaduva, A., Gatziu, S., Dittrich, K.R.: Investigating Termination in Active Database Systems with Expressive Rule Languages. In Proceedings of the 3rd International Workshop on Rules In Database Systems, Skovde, Sweden, 1997, pp. 149-164
Tab-Trees: A CASE Tool for the Design of Extended Tabular Systems? Igor Wojnicki, and Grzegorz J. Nalepa Antoni Ligeza, , Institute of Automatics, University of Technology AGH al. Mickiewicza 30, 30-059 Krak´ ow, Poland [email protected], {wojnicki,gjn}@agh.edu.pl
Abstract. Tabular Systems constitute a particular form of rule-based systems. They follow the pattern of Relational Databases and Attributive Decision Tables. In Extended Tabular Systems non-atomic values of attributes are allowed. In order to assure safe, reliable and efficient performance of such systems, analysis and verification of selected qualitative properties such as completeness, consistency and determinism should be carried out. However, verification of them after the design of a system is both costly and late. In this paper another solution is proposed. A graphical CASE-like tool supporting the design of Extended Tabular Systems and providing verification possibilities is presented. The tool uses a new rule specification paradigm: the so-called tab-trees, covering the advantages of Attributive Decision Tables and Decision Trees. In the system the verification stage is moved into the design process. The background theory and the idea of tab-trees are outlined and presentation of a practical tool, the Osiris System, is carried out.
1
Introduction
Rule based systems constitute a powerful programming paradigm. They can be used in Active Databases, Deductive Databases, Expert Systems, Decision Support Systems and intelligent information processing. A particular form of such systems are Tabular Systems which try to stay close to Relational Databases (RDB) scheme and preserve their advantages. A tabular system can specify facts or rules; in the latter case some of the attributes refer to preconditions, while the rest refer to hypothesis or decision. Such systems can be embedded within the RDB structure, but if so all the values of attributes must be atomic ones. In an Extended Tabular System non-atomic values are allowed; they include subsets of the domain of an attribute, intervals (for ordered domains), and – in general – values of some lattice structure. A tabular system can consist of one or more tables, organised in a form of modular system [11]. In order to assure safe, reliable and efficient performance, analysis and verification of selected qualitative properties should be carried out [1,2,3,14,15,16]. In this paper we follow the line concerning control and decision support systems [5, ?
Research supported from a KBN Project No.: 8 T11C 019 17.
H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 422–431, 2001. c Springer-Verlag Berlin Heidelberg 2001
Tab-Trees: A CASE Tool for the Design of Extended Tabular Systems
423
12], where logical details were developed in [8,9,11]. Properties of interest include features such as, completeness, consistency and determinism. However, verification of them after the design of a rule-based system is both costly and late. In this paper another solution is proposed. A graphical CASE-like tool supporting the design of rule-based systems and providing verification possibilities is presented. The tool uses a new rule specification paradigm: the so-called tabtrees (also tabular-trees or tree-tables), covering the advantages of attributive decision tables and decision trees. The tool is aimed at synthesis of a single level, forward chaining rule-based systems in tabular form; the verification stage is moved into the design process. The background theory and the idea of the tab-trees are discussed and presentation of a practical tool, the Osiris System, is carried out. Osiris was implemented as a graphical CASE tool for design of Kheops-like rule-based systems under Unix environment. The basic idea of the paper is to include the verification stage into the design process, as well as to support the design with flexible graphical environment of the CAD/CASE type. We follow the line proposed first in [7,8], where the socalled ψ-trees were proposed as a tool for support of logical design of complete and deterministic, non-redundant rule-based systems. We also use the ideas of tabular systems [9,11], which provide an approach following the attributive decision tables; the advantage of them is the easily readable, relational database-like format, which can be used both for data and knowledge specification. As a result, a fully graphical environment supporting the design and including partially verification has been developed. A prototype system named Osiris cooperating with graphical user interface gKheops for the Kheops systems is presented. The tool is a generic one; the format of the rules to be generated can be adjusted for other systems as well.
2
Kheops
Kheops [5] is an advanced rule-based real time system. Its working idea is quite straightforward: it constitutes a reactive, forward interpreter. However, it is relatively fast (response time can be below 15 milliseconds) and oriented toward time-critical, on-line applications. Its distinctive features include compilation of the rule-base to the form of a specific decision tree which allows for checking some formal properties (e.g. completeness) and allows for evaluation of response time, dealing with time representation and temporal inference, and incorporation of specialized forms of rules, including universal quantification and C-expression. A more detailed description of Kheops can be found in [5]. The Kheops system was applied as one of the principal components in the TIGER project [12]. This was a large, real-domain application in knowledgebased monitoring, supervision, and diagnosis. The system operates on-line 24 hours a day and is applied for continuous monitoring, situation assessment and diagnosis of gas turbines. Its distinctive features include application of heterogenous tools, i.e. Kheops, IxTeT, and CA-EN, systems, i.e. it is a multi-strategy,
424
A. Ligeza, I. Wojnicki, and G.J. Nalepa ,
multi-component system. Details about the TIGER system can be found in the literature quoted in [12].
3 3.1
Qualitative Properties Verification Subsumption of Rules
Consider the most general case of subsumption; some particular definitions are considered in [1,9,11]. A rule subsumes another rule if the following conditions hold: – the precondition part of the first rule is weaker (more general) than the precondition of the subsumed rule, – the conclusion part of the first rule is stronger (more specific) than the conclusion of the subsumed rule. Let the rules r and r0 , satisfy the following assumption: φ0 |= φ and h |= h0 . The subsumed rule r0 can be eliminated according to the following scheme: r : φ −→ h r0 : φ0 −→ h0 r : φ −→ h For intuition, a subsumed rule can be eliminated because it produces weaker results and requires stronger conditions to be satisfied; thus any of such results can be produced with the subsuming rule. Using tabular notation we have: rule A1 A2 . . . Aj . . . An H r t1 t2 . . . tj . . . tn h r0 t01 t02 . . . t0j . . . t0n h0 rule A1 A2 . . . Aj . . . An H r t1 t2 . . . tj . . . tn h The condition for subsumption in case of the above tabular format takes the form t0j ⊆ tj , for j = 1, 2, . . . , n and h0 ⊆ h (most often, in practice h0 = h are atomic values). In the current version of Osiris subsumption is automatically eliminated due to the tabular partitioning of the attribute values to nonoverlapping subdomains. 3.2
Determinism
A set of rules is deterministic iff no two different rules can succeed for the same state. A set of rules which is not deterministic is also referred to as ambivalent. The idea of having a deterministic system consists in a priori elimination of “overlapping” rules, i.e. ones which operate on a common situation. From purely logical point of view the system is deterministic iff the conjunction of the precondition formulae φ ∧ φ0 is unsatisfiable. Calculation of φ ∧ φ0
Tab-Trees: A CASE Tool for the Design of Extended Tabular Systems
425
is straightforward: for any attribute Aj there is an atom of the form Aj = tj in φ and Aj = t0j in φ0 , i = 1, 2, . . . , n. Now, one has to find the intersection of t and t0 – if at least one of them is empty (e.g. two different values; more generally tj ∩ t0j = ∅), then the rules are disjoint. In the current version of Osiris determinism can be assured (if desired) thanks to non-overlapping partitioning of attribute domains. Normally, the check for determinism is to be performed for any pair of rules. 3.3
Completeness of Rule-Based Systems
For intuition, a RBS is considered to be complete if there exists at least one rule succeeding for any possible input situation [1,9]. In the following subsection logical (total) completeness is considered for a set of rules as below: r1 : φ1 r2 : φ2 .. .. . . r m : φm
−→ h1 −→ h2 .. .. . .
−→ hm
The approach proposed here comes from purely logical analysis based on the dual resolution method [6]; its algebraic forms are discussed in [9,11]. Consider the joint disjunctive formula of rule precondition of the form Φ = φ1 ∨ φ2 ∨ . . . φm . The condition of logical completeness for the above system is |= Φ, which simply means that Φ is a tautology. In the proposed approach, instead of logical proof [6], an algebraic method based on partitions of the domain attributes is used [9,11]; in effect, no exhaustive enumeration of detailed input states is necessary. In the current version of Osiris the check for completeness is performed automatically, when desired, during the design stage.
4
Graphical Design Concept
There is no general rule for knowledge representation and extraction. There are several approaches which have advantages and disadvantages depending on the purpose they are created for. That means there is a need for a brand new approach which would be able to give clear and efficient way to represent logical structures. The approach proposed below uses the so-called tab-trees or treetables for knowledge representation and it is analysed below along with some other representation methods. 4.1
Production Rules and Logic
This method relies on the if-then-else construct. It is well known form procedural programming languages. It could be described as: IF condition T HEN action1 ELSE action2 It reads: if condition is true then perform action1 else perform action2. Often the ELSE part is absent or it is global, for all rules.
426
4.2
A. Ligeza, I. Wojnicki, and G.J. Nalepa ,
Decision Tables
Decision tables are an engineering way of representing production rules. Conditions are formed into a table which also holds appropriate actions. Classical decision tables use binary logic extended with “not important” mark to express states of conditions and actions to be performed. The main advantage of decision tables is their simple, intuitive interpretation. One of the main disadvantages is that classical tables are limited to binary logic. In some cases the use of values of attributes is more convenient. A slightly extended table are OAV tables (OAT). OAV stands for Object-Attribute-Value (OAT – Object-Attribute-value-Table) and such a table is presented below. Table 1. Object-Attribute-Value table. attrib 1 attrib 2 . . . action 1 action 2 . . . v 11 v 12 . . . w 11 w 12 ... v 21 v 22 . . . w 21 w 22 ... .. .. .. . . .. .. . . . . . .
The rows specify under what attribute values certain actions must be executed. Both v ij and w kl may take different values, not only true, false and not important. 4.3
Decision Trees
Tree-like representations are readable, easy to use and understand. The root of the tree is an entry node, under any node there are some branching links. The selection of a link is carried out with respect to a conditional statement assigned to the node. Evaluation of this condition determines the selection of the link. The tree is traversed top-down, and at the leaves final decisions are defined. An example of a decision tree is given in Fig 1. Circles represent actions, rectangles hold attributes and parallelograms express relations and values. Decision trees could be more sophisticated. The presented decision tree (Fig 1) is a binary tree; every node has only two links which express two different values of a certain attribute. There are also decision trees called ψ-trees [7,8]; in such trees a single node may have more then two links which makes the decision process more realistic, and allows to compare attributes with many different values in a single node. The structure of such trees is modular and hierarchical. 4.4
A New Approach: The Tab-Trees
The main idea is to build a hierarchy of OATs [7,8]. This hierarchy is based on the ψ-tree structure. Each row of a OAV table is right connected to the other OAV table. Such a connection implies logical AND relation in between.
Tab-Trees: A CASE Tool for the Design of Extended Tabular Systems
427
A off stop on B >=4 C <0 D other stop 8
>0 <4 off
C
off
>0 off <0 on
Fig. 1. An example of a decision tree.
OAV tables used in tree-table representation are divided into two kinds: attribute tables and action tables. Attribute tables are the attribute part of a classical OAT, Action tables are the action part. There is one logical limitation. While attribute tables may have as many rows as needed (a number of columns depends on the number of attributes), action tables may have only one row, it means that the specified action, or set of actions if there is more then one column, may have only one value set, which preserves the consistency.
A
B
C
Act
on
<4
<0
on
off
<4
>0
>=4
<0
>=4
>0
Act
D 8 other
Act stop
off Act off
Act
Act
stop
stop
Fig. 2. An example of a tab-tree knowledge representation.
428
A. Ligeza, I. Wojnicki, and G.J. Nalepa ,
An example of a tab-tree representation is given in Fig 2. Note, that a tabtree representation is similar to Relational Database (RDB) data representation scheme. The main features of the tab-tree knowledge representation are: – – – – – –
5
simplicity and transparency; intuitive way of representation, hierarchical, tree-like knowledge representation, highly efficient way of visualisation with high data density, power of the decision table representation, analogies to the RDB data representation scheme, flexibility with respect to knowledge manipulation.
The Osiris System
The Kheops [5] system has text oriented user interface. It is suitable for advanced users but not for beginners. The gKheops system solves problems with navigation through Kheops modes and allows to integrate editor with Kheops run-time environment. However, the user has still to know the sophisticated Kheops syntax. The main goal was to create such an environment that allows rapid development of Kheops rules using a graphical user interface and cooperates with gKheops. The name had been chosen to be Osiris 1 . Osiris should be designed to be as universal as possible to meet requirements not only for Kheops, but for almost all rule based expert systems without any major modifications. The main features of the system include: – graphical rule editor based on logic; it allows creating, modifying and storing rule structures, – the editor uses a new tab-tree representation as a visualisation and development method of logical structures, – rule-checking subsystem, completeness checking, which provides verification possibility in the development stage, – mouse driven graphical user interface, easy to understand and use, following the RDB paradigm, – automatic code generation for the Kheops system, – integration with the gKheops to run the developed code, – ability to expand or modify the system easily (e.g. to add code generation for other expert systems), – high flexibility; the tabular components can be split and joint vertically if necessary. Osiris is a multi-module system designed for UNIX environments (tested under Debian GNU/Linux, Sun Solaris 2.5). It consists of: a graphical environment for computer aided development of rules, a code generator for Kheops system, a validator, which provides on-line completeness checking and a run-time environment for created rules. The architecture of the Osiris is shown in Fig. 3 1
It corresponds to an ancient Egyptian God whom Cheops (Kheops) could have owned his strength. The Osiris system was developed as a M.Sc. Thesis at Institute of Automatics AGH in Cracow, Poland.
Tab-Trees: A CASE Tool for the Design of Extended Tabular Systems
429
Osiris system
data file
Kheops generator
validator
editor
controller
Kheops source
gKheops system
USER
Fig. 3. The architecture of the Osiris
For visualisation and development the tree-table representation is chosen (see Section 4.4). Note that modules for code generation, verification, and a run-time environment (Generator, Validator, gKheops respectively) are separate applications. The Generator and the Validator use the compiler technology to process an Osiris datafile which describes the logical structures being developed. They produce a source code for Kheops and provide incompleteness checking, respectively. All modules are controlled by a single graphical user interface. Osiris uses its own datafile for storing the logical structures. As a storing method an XML document is chosen. The datafile has well-defined grammar which allows to create a generator for almost any expert system and implement even sophisticated inconsistency checking algorithms. What is also most important, Osiris provides a way to check the logical structures being developed against a possible incompleteness (the Validator) which moves the verification process to the development stage. As a run-time environment a graphical user interface for Kheops, which is called gKheops is chosen. A sample session with the Osiris is shown in Fig 4.
6
The gKheops System
The gKheops system is a graphical user interface to Kheops expert system. Its main functions are launching and controlling Kheops, and aiding at creating and testing an expert system in Kheops environment. gKheops has a text editor for Kheops files and a syntax checker module. The interface is easy to use, intuitive and self documenting. It runs a process independent from Kheops itself. It uses Unix concurrent processes and inter process communication mechanisms that allow real-time communication with Kheops.
430
A. Ligeza, I. Wojnicki, and G.J. Nalepa ,
Fig. 4. A sample screen from a session with the Osiris system.
The gKheops system has been implemented in ANSI C and runs in the GNU/Linux and Unix environments. The interface is based on an advanced graphical toolkit called Gtk+. The Gtk+ library is available on any Unix-like platform in the X Window environment. The design of gKheops interface was made with Glade, which is a GUI builder for the Gtk+. The whole gKheops project was created entirely using free software. The syntax checker module consists of a Kheops language parser and scanner. The module was implemented using GNU Bison, a LALR(1) context-free parser generator, and Flex, which is a scanner generator. These tools are compatible with popular Yacc and Lex tools. The gKheops system simplifies the development of expert systems in the Kheops environment by providing a coherent graphical interface to Kheops itself. It speeds up the process of launching and controlling Kheops, and is very useful in the process of debugging and testing an expert system.
7
Concluding Remarks
In this paper the first ideas for graphical, interactive, CASE-like environment for design of rule-based systems are proposed. The main aim of the system is to provide an intuitive, user-friendly environment, which covers both design and partial verification of qualitative properties. At the current, prototype implementation, the system provides the possibility of completeness checking, while features such as consistency, determinism and subsumption are achieved thanks to the structural specification of the design. The system includes a new, the so-called tab-tree knowledge representation. It seems to be convenient, intuitive and readable, especially for people familiar with relational databases and decision trees. It should be mentioned that, only few tools support the design of the knowledge base in a similar way, e.g. or [18]
Tab-Trees: A CASE Tool for the Design of Extended Tabular Systems
431
(which uses mostly specific decision trees) or [19] (but mostly through knowledgemanagement tool).
References 1. Andert, E. P.: Integrated knowledge-based system design and validation for solving problems in uncertain environments. Int. J. of Man-Machine Studies, 36, 1992, 357–373. 2. Coenen, F.: Verification and validation in expert and database systems: The expert systems perspective. A Keynote presentation in [17], 1998, 16–21. 3. Coenen, F.: Rulebase checking using a spatial representation. In [13], 1998, 166– 175. 4. Coenen, F., B. Eaglestone and M. Ridley: Validation, verification and integrity in knowledge and data base systems: future directions. In [15], 1999, 297–311. 5. Gouyon, Jean-Paul: Kheops Users’s Guide, Report of Laboratoire d’Automatique et d’Analyse des Systemes, No.: 92503, 1994, Toulouse. 6. Ligeza, A.: A note on backward dual resolution and its application to proving , completeness of rule-based systems. Proceedings of the 13th Int. Joint Conference on Artificial Intelligence (IJCAI). Chambery, France 1, 1993, 132–137. 7. Ligeza, A.: Towards design of complete rule-based control systems, IFAC/IMACS , International Workshop on Artificial Intelligence in Real-Time Control, IFAC, Bled, Slovenia, 1995, 189-194. A.: Logical support for design of rule-based systems. Reliability and 8. Ligeza, , quality issues, ECAI-96 Workshop on Validation, Verification and Refinement of Knowledge-based Systems, ECAI’96, Budapest, 1996, 28–34. 9. Ligeza, A.: Towards logical analysis of tabular rule-based systems. In [17], 1998, , 30–35. 10. Ligeza, A.: Intelligent data and knowledge analysis and verification; towards a , taxonomy of specific problems. In [15], 1999, 313–325. A.: Towards logical analysis of tabular rule-based systems. International 11. Ligeza, , Journal of Intelligent Systems, 16, 2001, 333–360. 12. Milne, R., C. Nicol, L. Trav´e-Massuy`ez and J. Quevedo, TIGER: Knowledge based gas turbine condition monitoring, Applications and Innovations in Expert Systems III, SGES Publications, 1995, III, Cambridge, Oxford, 23–43. 13. Quirchmayr, G., Schweighofer, E. and T.J.M. Bench-Capon (Eds.): Database and Expert Systems Applications. Proceedings of the 9th Int. Conf., DEXA’98, Vienna, Springer-Verlag Lecture Notes in Computer Sciences. Berlin, 1460, 1998. 14. Preece, A. D.: A new approach to detecting missing knowledge in expert system rule bases. Int. J. Man-Machine Studies. 38, 1993, 661–668. 15. Vermesan, A. and F. Coenen (Eds.): Validation and Verification of Knowledge Based Systems – Theory, Tools and Practice. Kluwer Academic Publishers, Boston, 1999. 16. Vermesan, A. et al.: Verification and validation in support for software certification methods. In [15], 1999, 277–295. 17. Wagner, R. R. (ed.): Database and Expert Systems Applications. Proceedings of the Ninth International Workshop, Vienna; IEEE Computer Society, Los Alamitos, CA., 1998. 18. Attar Software, XpertRule 3.0 http://www.attar.com/pages/info xr.htm. 19. AITECH Katowice, Sphinx 2.3, http://www.aitech.gliwice.pl/.
A Framework for Databasing 3D Synthetic Environment Data 1
2
1
1
Roy Ladner , Mahdi Abdelguerfi , Ruth Wilson , John Breckenridge , 1 1 Frank McCreedy , and Kevin B. Shaw 1
Naval Research Laboratory, Stennis Space Center, MS {rladner, ruth.wilson, jbreck, mccreedy, shaw} @ nrlssc.navy.mil 2
University of New Orleans, Computer Science Department, New Orleans, LA mahdi @ cs.uno.edu
Abstract. Since 1994 the Digital Mapping, Charting and Geodesy Analysis Program at the Naval Research Laboratory has been developing an objectoriented spatial and temporal database, the Geographic Information Database (GIDB ). Recently, we have expanded our research in the spatial database area to include three-dimensional synthetic environment (3D SE) data. This work has focused on investigating an extension to the National Imagery and Mapping Agency's (NIMA’s) current Vector Product Format (VPF) known as VPF+. This paper overviews the GIDB and describes the data structures of VPF+ and a prototyped 3D synthetic environment using VPF+. The latter was designed as a 3D Geographic Information System (3D-GIS) that would assist the U.S. Marine Corps with mission preparation and also provide onsite awareness in urban areas.
A Framework for Databasing 3D Synthetic Environment Data
433
tended to facilitate the use of VPF in the 3D SE generation process by supporting a wide range of three-dimensional features expected to be encountered in a threedimensional synthetic environment. We have prototyped VPF+ in a 3D Geographic Information System (3D-GIS) that would assist the U.S. Marine Corps with mission preparation and also provide onsite awareness in urban areas. These operations require practice in physically entering and searching both entire towns and individual buildings. Our prototype, therefore, supplements the more traditional 2D digital-mapping output with a 3D interactive synthetic environment in which users may walk or fly across terrain, practice entry of buildings through doors and windows, and gain experience navigating the interiors of buildings.
2 DMAP’s Spatial Database Experience DMAP began investigating spatial database issues in 1994 with the development of the GIDB. The GIDB includes an object-oriented model, an object-oriented database management system (OODBMS) and various Spatial Analysis Tools. While the model provides the design of classes and hierarchies, the OODBMS provides an effective means of control and management of objects on disk such as locking, transaction control, etc. Spatial analysis tools include spatial query interaction, multimedia support and map symbology support. Users can query the database by area-of-interest, time-of-interest, distance and attribute. Interfaces are implemented to afford compatibility Arc/Info and Oracle 8i, among others. Not only has the object-oriented approach been beneficial in dealing with complex spatial data, but it has also allowed us to easily integrate a variety of raster and vector data. Some of the raster data includes Compressed ARC Digitized Raster Graphics (CADRG), Controlled Image Base (CIB), jpeg and video. Vector data includes VPF, Shape, sensor data and Digital Terrain Elevation Data (DTED). The VPF data includes such NIMA products as Digital Nautical Chart (DNC), Vector Map (VMAP), Urban Vector Map (UVMAP), Digital Topographic Data Mission Specific Data Sets (DTOP MSDS), and Tactical Oceanographic Data (TOD). Figure 1 gives an example of how the user may use this data over the web through the applet. The area-of-interest shown in the figure is for a portion of the U.S. Marine Corps Millennium Dragon Exercise that took place in September 2000 in the Gulf of Mexico. Using the applet interface to the GIDB the user was able to access the area of interest, bring in CIB imagery and overlay it with various vector data from DNC, MSDS and survey data from the Naval Oceanographic Office. The user was then able to zoom in and replace the CIB with CADRG imagery, and then zoom in further to see more of the detail of the MSDS data around the harbor in Gulfport, Mississippi. In addition to spatial query features, the GIDB is capable of temporal query, such as wave height over a given time span for spatial objects (for instance, an ocean sensor), to provide statistics (min, max, mean, standard deviation) of this data and to provide data plots.
434
R. Ladner et al.
Fig. 1. Screen shots from Millennium Dragon exercise area. From left to right: (1a) CIB background with VMAP, DTOP MSDS, sensor data and DNC added. (1b) CADRG Data added, area-of-interest zoomed in. (1c) Additional zoom with CADRG and sensor data.
3 Motivations for the Current Work NIMA is the primary provider of synthetic environment data to the Department of Defense and to the private sector. VPF and DTED are the formats used by NIMA to disseminate a significant amount of that data. Despite the widespread use of VPF, its shortcomings have been documented in the synthetic environment database generation systems used by a variety of government and private groups, involving different proprietary end-product formats, and across varying user needs [Trott 96]. These shortcomings involve VPF’s arrangement of features into disjointed, thematic layers, its lack of attribute and geometric data appropriate to the reconstruction many 3D objects and its often lack of correlation with DTED data. Disjointed thematic feature data, in particular, requires considerable preprocessing and data integration to be usable in a 3D synthetic environment. These shortcomings add much time and money to the process of constructing synthetic environments from VPF data. Non-manifold objects are those in which one of the following conditions exist: (1) exactly two faces are not incident at a single edge, (2) objects or faces are incident only through sharing a single vertex, or (3) a dangling edge exists [Gursoz 90, Lienhardt 91, and O’Rourke 95]. A dangling edge is one in which the edge is not adjacent to any face. Examples are given in Figure 2 where non-manifold conditions are noted by the bold edges. Non-manifold objects are commonplace in the real world, and they should be found in a synthetic environment (SE). VPF uses winged-edge topology to provide line network, face topology and seamless coverages across tile boundaries [VPF 96]. However, it lacks the constructs necessary to maintain the adjacency relations in non-manifold objects. While an edge is adjacent to exactly two faces in VPF Level 3 topology, an edge may be adjacent to 0, 1, 2, 3 or more faces in a SE when a non-manifold condition is present. Though the
A Framework for Databasing 3D Synthetic Environment Data
435
winged-edge topology used by VPF relates each edge to exactly two faces (left and right, corresponding to two adjacent faces), the concept of a "left" and "right" face may be lacking in 3D non-manifold objects where multiple faces may be incident to the edge. VPF also relates each connected node to exactly one of the edges to which each such node is connected. This allows for retrieval of all edges and faces to which the node is connected using the winged-edge algorithm. However, in a SE a connected node may connect two different 3D objects, two different faces or a dangling edge. Relating the connected node to only one edge in these circumstances will not be adequate for retrieval of all spatial primitives.
Fig. 2. Clockwise from top left: (2a) Multiple Faces Incident at a Single Edge. (2b) Dangling Edge and Building Joined Only By a Common Point. (2c) Two Buildings Sharing a Face With a Non-Manifold Condition at the bold Edge. (2d) Partial Building Creates Non-Manifold Condition.
There are a number of data structures that are capable of maintaining the adjacency relationships found in manifold and non-manifold objects. Notable are the Radial Edge [Weiler 86], the Tri-Cyclic Cusp [Gursoz 90] and the ACIS Geometric Modeler [Spatial 96]. The Radial Edge Structure is an edge based data structure that addresses topological ambiguities found with two non-manifold situations - the non-manifold edge and the non-manifold vertex. The Tri-Cyclic Cusp Structure is a vertex based data structure. This data structure addresses the topological relationships that the Radial Edge Structure addresses, and, in addition, is specifically intended to resolve ambiguities inherent in certain non-manifold representations that may not be easily eliminated by the Radial Edge structure as when two objects are joined only at a common point. The ACIS Geometric Modeler is a component-based package consisting of a kernel and various application based software components.
436
R. Ladner et al.
While these may provide a theoretical basis for a logical extension of the VPF standard, they could not be directly implemented. Our primary area of concern is modeling synthetic environments. These other data structures address a different application area, solid modeling, making them often inconsistent with Winged-Edge topology concepts found in the VPF standard. There are also a number of major developers of synthetic environment database systems such as Loral Advanced Distributed Simulation, Inc., Lockheed Martin Information Systems (LMIS), Multigen, Inc., Evans & Sutherland (E&S) and Lockheed Martin Tactical Defense Systems (LMTDS). Their products include database formats such as the S1000, OpenFlight, TARGET, and specific image generator formats. Their emphasis, however, is on visual representation, not three-dimensional topological relationships.
4 The VPF+ Data Structure Since VPF has widespread use and there are numerous VPF databases, the data structures described in this section are defined as a superset of VPF, known as VPF+, in order to facilitate the use of VPF in the 3D SE generation process. This superset introduces a new level of topology called Level 4 Full 3D Topology (Level 4) to accomplish 3D modeling. A boundary representation (B-rep) method is employed. B-rep models 3D objects by describing them in terms of their bounding entities and by topologically orienting them in a manner that enables the distinction between the object’s interior and exterior. The topologic adjacencies of three-dimensional manifold and non-manifold objects in the SE are described using a new, extended Winged-Edge data structure, referred to as “Non-Manifold 3D Winged-Edge Topology”. Geometric information includes both three-dimensional coordinates and Face and Edge orientation. Although this discussion is restricted to planar geometry, curved surfaces can also be modeled through the inclusion of parametric equations for Faces and Edges as associated attribute information. Level 4 is a full 3D topology that is capable of representing comprehensive, integrated 3D synthetic environments. Such an environment can include the terrain surface, objects generally associated with the terrain surface such as buildings and roads, and it can include objects that are not attached to the terrain but are rather suspended above the terrain surface or below a water body’s surface. There are five main VPF+ primitives found in Level 4 topology: (1) Entity node – used to represent isolated features; (2) Connected node – used as endpoints to define edges; (3) Edge – an arc used to represent linear features or borders of faces; (4) Face – a two-dimensional primitive used to represent a facet of a three-dimensional object such as the wall of a building; and (5) Eface – a primitive that describes a use of a face by an edge. Unlike the topology of traditional VPF, the Level 4 topology of VPF+ does not require a fixed number of faces to be incident to an edge. The Eface is a new primitive that is introduced to resolve some of the ensuing ambiguities. Efaces de-
A Framework for Databasing 3D Synthetic Environment Data
437
scribe a use of a Face by an Edge and allow maintenance of the adjacency relationships between an Edge and zero, one, two or more Faces incident to an Edge. This is achieved in VPF+ by linking each edge to all faces connected along the edge through a circular linked list of efaces. Each eface identifies the face it is associated with, the next eface in the list and the “next” edge about the face in relation to the edge common to the three faces. Efaces are also radially ordered in the linked list in a clockwise direction about the edge in order to make traversal from one face to the radially closest adjacent face a simple list operation. In addition to the eface structure, VPF+ introduces several extensions to VPF consistent with non-manifold topology and 3D modeling. One extension is the NodeEdge relationship. While VPF relates each Connected Node to exactly one Edge, VPF+ allows for non-manifold Nodes. This requires that a Node point to one Edge in each object connected solely through the Node and to each dangling Edge (an edge that is adjacent to no face). This relationship allows for the retrieval of all Edges and all Faces in each object and the retrieval of all dangling Edges connected to the Node. Significant to 3D modeling, VPF+ defines Two-Sided Faces. Faces are defined in VPF as purely planar regions. In VPF+ Faces may be one sided or two sided. A two sided Face, for example, might be used to represent the wall of a building with one side used for the outside of the building and the other side for the inside of the building. Feature attribute information would be used to render the two different surface textures and color. A one sided Face might then be used to represent the terrain surface. Additionally, orientation of the interior and exterior of 3D objects is organized in relation to the normal vector of Faces forming the surface boundary of closed objects. This allows for easy distinction between an object's interior and exterior. For more detail on VPF+ topologic structures the interested reader is referred to [Abdelguerfi 98]. Traditional VPF defines five categories of cartographic features: Point, Line, Area, Complex and Text. Point, Line and Area features are classified as Simple Features, composed of only one type of primitive. Each Simple Feature is of differing dimensionality: zero, one and two for Point, Line and Area Features respectively. Unlike Simple Features, Complex Features can be of mixed dimensionality, and are obtained by combining Features of similar or differing dimension. For Level 4 topology, VPF+ adds a new Simple Feature class of dimension three. The newly introduced feature, referred to as 3D Object Feature, is composed solely of Face primitives. This new feature class is aimed at capturing a wide range of 3D objects. Although 3D Objects are restricted to primitives of one dimension, 3D Objects of mixed dimensionality can be modeled through Complex Features using Simple Features of similar or mixed dimensionality as building blocks. Software performance can be improved by identifying characteristics of real 3D objects that will allow storage of optional, unambiguous topological information that may otherwise require considerable processing time to derive. Clearly, portions of numerous 3D objects form closed volumes that divide 3D space into interior, exterior and surface regions. Optional topological information in these cases includes the
438
R. Ladner et al.
classification of Faces as either inside of, outside of or part of the boundary of the 3D Object and the orientation of the interior and exterior of the object. Though Area Features may geometrically exist in 3D space, they are topologically two dimensional, and are intended to model surface area. As with 3D Object Features, Area Features are Simple Features, and objects being modeled at this level are restricted to be composed only of Faces connected along incident Edges or at nonmanifold Connected Nodes. Each face may be single sided or double sided, but an Area Feature will generally make use of only a single side of a double sided Face. Tiling is the method used in VPF to break up large geographic data into spatial units small enough to fit the limitations of a particular hardware platform and media. Primitives that cross tile boundaries are split in VPF, and topology is maintained through cross-tile topology. The cross-tile constructs of VPF are extended in Level 4 in accordance with the organizational scheme of Non-Manifold 3D Winged-Edge topology. Tile boundaries in VPF+, however, consist of planar divisions.
5 The Prototyped Synthetic Environment The synthetic environment prototype consists of the Military Operations in Urban Terrain site at Camp LeJeune, North Carolina. The MOUT site is a small city built by the Marine Corps for urban combat training. The MOUT site consists of approximately 30 buildings constructed in a variety of shapes and sizes to resemble what might be expected in an actual urban area. Since the area is supposed to resemble a combat environment, some are constructed to exhibit various degrees of damage. There is also a transportation network and the usual urban features associated with this type of setting such as trees, park benches, planters, flag poles, etc. Data for the site is readily available, which allowed for construction of a detailed 3D SE that closely matched its real world counterpart. MOUT buildings that exhibit damage, for example, are reproduced in the prototyped SE to show the same elements of damage. The prototype provides a 3D synthetic environment alongside a more traditional 2D digital map. The map view offers general orientation and feature identification, while the 3D SE complements this with an immersive experience of the three-dimensional environment [Ladner 2000]. The combination should prove beneficial to a variety of uses. A commercial-off-the-shelf OODBMS was used for the prototype database. Java2 and the Java3D API were used for interface into the database. The Java3D API provided reasonable performance for 3D interaction and easy implementation. 5.1 The User Interface The user interface for the prototype consists of windows for displaying 2D digital maps and 3D synthetic environments. Each window is placed in a separate frame to allow for independent re-sizing according to the user’s needs. On start-up, the user is
A Framework for Databasing 3D Synthetic Environment Data
439
given a map of the world, which allows selection of a user-defined area of interest (AOI) by dragging the mouse across and down the map. Selection of an AOI causes the database to be queried. A digital map is drawn with all database features (Figure 3(a)). The user then has several options including zoom, pan, render features in 3D and identify objects. Rendering objects in 3D was left to a user decision rather than a default occurrence for performance reasons. Although the user can render all features in 3D, the user is given the choice of zooming and panning the map as a means of selecting features for 3D display. Only those features within the AOI are rendered in 3D, avoiding the unnecessary use of resources to extract unwanted feature geometry from the database. Figures 3(b) - 5 show the 3D SE of the MOUT facility from various positions. The interface allows the user to move through the SE and into and out of buildings by use of the arrow keys. Movement can be by walking on or flying above the terrain. Dropdown menus allow the user to change speed, background texture and lighting conditions. Altering lighting conditions allows the viewer, for example, to obtain both day and night views of the 3D SE. A feature is also provided that allows the user to track his position in the 3D SE. When activated, this feature places an icon on the map corresponding to the user’s position in the 3D SE. As the user moves through the SE, the position of the icon is updated. The icon is oriented to correspond to the user’s orientation in the SE.
Fig. 3. From Left: (3a) 2D Map of the MOUT Facility. (3b) 3D View of the MOUT Facility from the Southwest.
6 Observations This paper has described VPF+, a VPF-consistent data structure capable of supporting topologically consistent three-dimensional feature coverage. VPF Winged-Edge topology is insufficient to support the many topological adjacency relationships found in the 3D SE. The Non-Manifold 3D Winged-Edge Topology will support these rela-
440
R. Ladner et al.
tionships in a wide range of objects likely to be modeled in a 3D SE and provides a framework for the 3D synthetic environment generation process. VPF+ should be useful for commercial as well as the more traditional modeling and simulation applications, especially for developers who want to extend their geographic information system capability to add 3D topology. Detailed data in the form of highly accurate representations of the interior and exterior of buildings was used in the prototype. Some SE applications do not require building interiors. For these, building exteriors with accompanying topology suffices. VPF+ topology will support these implementations as well. Continuing research into improved methods of automating the extraction of detailed 3D object geometry from satellite, aerial and panoramic imagery should be beneficial for providing detailed data over large areas, at least for building exteriors. On smaller areas, digitizers can be used to re-construct building interiors from building plans or CAD data can be imported. Where building plans are not available or where more rapid development is required, further work can concentrate on developing tools to project the interior layout of buildings based on photo-imagery, the building use and a basic material composition description, i.e. steel, brick, frame. This type of description should be easily obtainable.
Acknowledgments. The National Imagery and Mapping Agency and the U.S. Marine Corps Warfighting Lab sponsored this work.
Fig. 4. Street View of the MOUT Facility Looking North.
A Framework for Databasing 3D Synthetic Environment Data
441
Fig. 5. View of the M OUT Facility from the West.
References [Abdelguerfi 98] Mahdi Abdelguerfi, Roy Ladner, Kevin B. Shaw, Miyi Chung, Ruth Wilson, VPF+: A Vector Product Format Extension Suitable for Three-Dimensional Modeling and Simulation, tech. report NRL/FR/7441-98-9683, Naval Research Laboratory, Stennis Space Center, Miss., 1998. [GIDB] Digital Mapping, Charting, and Geodesy Analysis Program Web site provides additional information about the Naval Research Laboratory’s Geographic Information Database, http://dmap.nrlssc.navy.mil/dmap (current 16 May 2001). [Gursoz 90] E. Levent Gursoz, Y. Choi, F. B. Prinz, "Vertex-based Representation of NonManifold Boundaries" in Geometric Modeling for Product Engineering, M.J. Wozny, J.U. Turner and K. Preiss (Editors), Elsevier Science Publishers, 1990, pp. 107-131. [Ladner 2000] Roy Ladner, Mahdi Abdelguerfi, Kevin Shaw, 3D Mapping of an Interactive Synthetic Environment, Computer, Vol. 33, No. 3, March 2000, pp. 35-39. [Lienhardt 91] Pascal Lienhardt, Topological models for boundary representation: a comparison with n-dimensional generalized maps, Computer Aided Design, Vol. 23, No. 1, January 1991, pp. 59-82. [O'Rourke 95] Joseph O'Rourke, Computational Geometry in C, Cambridge University Press, 1995, pp. 114-115. [Spatial 96] Spatial Technology, Inc., Format Manual, 2425 55th Street, Building A, Boulder, CO 80301, 1996. [Trott 96] Kevin Trott, Analysis of Digital Topographic Data Issues in Support of Synthetic Environment Terrain Data Base Generation, TEC-0091, U.S. Army Corps of Engineers, Topographic Engineering Center, November 1996. [VPF 96] Department of Defense, Interface Standard for Vector Product Forma, MIL-STD 2407, 28 June 1996. [Weiler 86] K.J. Weiler, Topological Structures for Geometric Modeling, PhD thesis, Rensselaer Polytechnic Institute, Troy, NY 1986.
GOLAP – Geographical Online Analytical Processing Petr Mikšovský and Zdenßek Kouba The Gerstner Laboratory for Intelligent Decision Making and Control Faculty of Electrical Engineering Czech Technical University Technická 2, 166 27 Prague 6, Czech Republic Phone: +420-2-24357666, Fax: +420-2-24357224, {miksovsp, kouba}@labe.felk.cvut.cz
Abstract. Current geographical information systems (GIS) handle large amounts of geographical data stored usually in relational databases. Database vendors developed special database plug-ins in order to make retrieval of geographical data more efficient. Basically, they implement spatial indexing techniques aimed at speeding-up spatial query processing. This approach is suitable for those spatial queries, which select objects in certain user-defined area. Similarly as on-line transaction processing (OLTP) systems evolved into on-line analytical processing (OLAP) systems for supporting more complicated analytical tasks, similar evolution can be expected in the context of geographical information analytical processing. This paper describes the GOLAP system consisting of a commercial OLAP system enriched with a spatial index. Experiments comparing efficiency of original OLAP and the extended one are presented.
query. However, let us imagine that we need to determine the average income of people living, for example, in the area with a diameter of 5 km around a city. It is an expensive task not only because of the huge number of arithmetic calculations, but also because of the complicated selection of appropriate objects (houses), which need to be taken into account. In the case of similar ad-hoc spatial queries such an approach does not help that much. This paper describes a new approach, which is based on the idea of building-up the hierarchical structure of the geographical dimension of a data warehouse according to a spatial index having been constructed for a given population of GIS objects in a map. A prototype of a GOLAP (Geographical On-Line Analytical Processing) system implementing this idea has been developed. The efficiency of such a solution has been explored in a series of experiments.
2.
GOLAP Components
The GOLAP system represents natural embedding of a spatial index into a commercial OLAP system. The prototype implementation makes use of the Microsoft SQL Server’s Analytical Services. However, it is not restricted on this single OLAP system. Both the idea and the developed software creating spatial indices can be easily adapted to any other data warehouse platform. The following paragraphs describe briefly both the OLAP and the spatial index components, which are exploited by the system. 2.1 OLAP - Online Analytical Processing Data warehouses are aimed at providing very fast responses to user queries. Usually, the OLAP layer of the data warehouse architecture [4] generates these queries. The data model of a data warehouse is designed to support fast evaluation of very complicated multi-dimensional queries. Considering define the following toy example for explaining the basic concepts of data warehouse modelling. Let us consider an imaginary grocery chain consisting of nine supermarkets located in three districts in Bohemia. The districts are Eastern Bohemia, Central Bohemia, and Western Bohemia. Let the data warehouse store the data on turnover of particular supermarkets. The data warehouse structure is represented by a cube, dimensions (axes) of which are time, assortment, and location. Every elementary cell of the cube contains a real number identifying the turnover achieved by the corresponding supermarket (position along the location axis) in the respective time slot (position along the time axis), and particular assortment item (position along the assortment axis). This number is called fact, measure, or metric. Let us choose the term fact to avoid ambiguity. Data in an OLAP system are usually read-only for all users in order to minimise the overhead of a multiple access control. The content is updated in batches in pre-
444
P. Mikšovský and Z. Kouba
defined time periods by an ETL (Extraction, Transformation and Load) process. This is the only way how to modify the data stored in a data warehouse. The ETL process extracts data from data sources, transforms it to a multidimensional structure and preferably creates suitable pre-calculated aggregations.
2.2
Spatial Indexing
Generally, spatial data queries are very complex and their evaluation is time consuming. There are some supporting mechanisms, typically based on indexing techniques, which can speed-up spatial query evaluation. There are at least two spatial indexing techniques used frequently in commercial systems: quad-trees (e.g. Oracle8 Spatial Cartridge) and R-trees (e.g. Informix Spatial DataBlade). The GOLAP system currently makes use of R-trees, therefore a brief description of this technique follows. R-tree [1] and its derivatives are spatial data structures devoted to indexing of more general objects than single points. In principle, an R-tree is a simple modification of a B-tree, where the leaf records contain pointers to data objects representing spatial objects. The important feature is that an R-tree node is implemented as a page in an external memory. The indexing of spatial objects itself is given by a pair < I, Id > (so called index records), where I is the minimal bounding hyper-cube (MBH) of the particular spatial object. Every MBH has basically the form of a tuple (I0, I1, …, In), where Ii is an interval [ai, bi] describing lower and upper bounds along the dimension i. The non-leaf nodes contain records (I, pointer). The pointer refers to such a subtree, which contains the nodes corresponding to all MBHs covered by the MBH I. The R-tree is given an order (m ,m1 ), where m is the minimal number of edges leaving a node, whereas m1 is the maximal one. It means that MBHs are constructed in such a way, that the number of edges leading from the corresponding node is at least m1 but m as maximum. The MBHs may be overlapping. Figure 1 shows construction of an R-tree of the order (2,3). If the number of spatial objects is E, the depth of an m-ary R-tree is logmE -1 in the worst case. The R-tree is a dynamic data structure based on page splitting (in case of INSERT) and/or on page merging (in case of DELETE). In contrast to B-trees the search in an R-tree does not follow a single path, as the MBHs may be overlapping. It means that there may be several possibilities how to go on from a given node when searching an object. There are many heuristics used for R-tree optimising. Basically, there is a tendency to separate MBHs as much as possible. In principal the INSERT operation is very important for maintaining the R-tree to sustain its efficiency. The basic strategy of the algorithm implementing the INSERT operation is to find such a leaf in the tree, for which inserting an index record will invoke a minimum of nodes on the path from the leaf to the root requiring an update. The R-tree efficiency is very sensitive on the node splitting method.
GOLAP – Geographical Online Analytical Processing
445
The splitting procedure solves the problem how to partition an unordered set of index records. The idea is to minimise the probability, that both nodes will need to be investigated during a search. Therefore the volume of each MBH corresponding to the two nodes after splitting should be minimal. The complexity of the algorithm providing us with globally optimal solution is exponential. In practice, sub-optimal algorithms with linear or quadratic complexity [2] are used.
Fig. 1. Construction of an R-tree
3.
GOLAP Architecture
The GOLAP system in fact consists of two parts: a spatial index builder and a query evaluator. The first of them (the spatial index builder) is responsible for building an R-tree for the given set of geographical objects. The index is a tree. Its leafs correspond to individual geographical objects, whereas the non-leaf nodes correspond to the respective MBHs. Only the ETL process uses the spatial index builder during the data warehouse creation/update. It creates an R-tree, which is then used as a template for definition of the geographical dimension of the data warehouse. The second part (the query evaluator) is the user interface enabling the user to select an ad-hoc area to be analysed. The query evaluator selects a set of MBRs and single objects covering the explored area. Then it assembles the corresponding OLAP query and sends it to the OLAP engine.
446
4.
P. Mikšovský and Z. Kouba
Experimental Results
Currently, first series of experiments have been carried out. Basically, the results of these experiments confirmed our expectations concerning the comparison of a conventional OLAP and the GOLAP performance. The experiments were carried out on 2xPIII 866MHz, 256MB RAM workstation running MS Windows 2000 and MS SQL Server 2000 with Analytical Services. Two random geographical data sets were generated. Both of them consisted of 100000 GIS objects distributed in a grid 10000x10000 points. One of them was based on a uniform geographical distribution, the other on a 2-dimensional normal (Gaussian) one. It simulates a more realistic distribution of objects in a geographical information system. The procedure generating the second data set started by a random selection of 100 centres ("cities"). Then 1000 GIS objects with a 2-dimensional normal distribution were generated around each of them.
Fig. 2. Data set No. 1 - uniform distribution of GIS objects The multidimensional schema used for experiments was simple. It included only one fact table and the geographical dimension. The fact table contained a single fact expressing the count of GIS objects in the respective area. The geographical dimension was constructed using the spatial index. An aggregation level is introduced to the geographical dimension for every level of nodes in the R-tree. Thus, a precalculated data aggregate can be stored in the data warehouse for each node (i.e. MBH) of the R-tree. The spatial analysis capabilities of a conventional OLAP system are restricted to a simple filtering expressed in terms of SQL-based multidimensional queries (i.e.
GOLAP – Geographical Online Analytical Processing
447
Microsoft MDX in our case). On the other hand, the GOLAP is capable to analyse areas having a shape of a generic polygon. As the authors wanted to prove that the GOLAP is really useful, they choose a pessimistic strategy. It means that the experimental analysis was carried out on rectangular areas rather than on generic polygons. The reason is not the effort to reduce the expensive evaluation of polygons when using the spatial index, but defining fair conditions for comparing the response times of both systems. Even if this strategy degraded the capabilities of the GOLAP system and increased the “chance” of the conventional OLAP, the results are encouraging.
Fig. 3. Data set No. 2 - Gaussian distribution of GIS objects For each data set areas of two sizes were analysed. The smaller one was a square of 1000 x 1000 points representing 1% of the whole grid. The bigger one was a square of 5000 x 5000 points representing 25% of the whole grid. For every data set and every size of the explored area 20 runs were carried out. The explored area was located in various positions (around the centres of clusters in case of the Gaussian distribution) in particular runs. The mean values over all 20 runs of both the response time and the number of GIS objects in the explored area are introduced in the Table 1. Figure 2 and Figure 3 show results provided by the query evaluator module. Black rectangles represent those MBHs of the spatial index, which are fully contained in the explored area. The black points represent those GIS objects, which are contained in the explored area, but the MBHs covering them are not fully contained in it. These black rectangles and points correspond to data aggregates evaluated by the OLAP component of the GOLAP system. In the case of direct evaluation by the conventional OLAP, all GIS objects (i.e. all the black and grey points in the figures) have to be evaluated. The ratio of the number of black rectangles and points to the number of all points determines the efficiency of the GOLAP.
448
P. Mikšovský and Z. Kouba Table 1. Experimental results
OLAP Avg. response time [s] 13.78 13.11 14.23 15.43
Accelerat ion [%] 393.7 2570.6 42.3 181.7
Conclusion
The paper describes a geographical extension of a conventional OLAP system capable to evaluate facts for ad-hoc defined areas. Without such an extension is the conventional OLAP applicable only for on-line analysis of pre-defined areas. The solution is based on embedding a spatial index into multi-dimensional structure and exploiting such an index for evaluating complex spatial analytical queries. The prototype implementation called GOLAP has been developed. First series of experiments demonstrated that the proposed solution is useful. According to their results the GOLAP is 2 to 25 times faster than a conventional OLAP. The GOLAP efficiency depends on the number of objects in the explored area, whereas the response time of the conventional OLAP system is roughly independent on it. The explanation is straightforward. The conventional OLAP system determines the target data aggregate corresponding to the explored area by accessing every single GIS object and testing if it belongs to the area. On the other hand the GOLAP calculates the result using the set of necessary data aggregates identified by the spatial index. Then it accesses the data warehouse only to retrieve those data aggregates without any filtering overheads. It is obvious from the Table 1 that there exist situations when the conventional OLAP is faster than GOLAP (see 42% acceleration for the explored area of 25% of the whole map). The experiments support a conjecture that GOLAP’s efficiency corresponds to the ratio S/T, where S is the number of GIS object in the explored area and T is the total number of them. Next research will be focused on more detailed analysis of this dependence. It will be aimed at finding the critical value of that ratio, when the conventional OLAP behaves better than GOLAP. We assume that the size of the explored area does not exceed 10% of the whole map in practical geographical analysis. As the critical value of the above-mentioned ratio seems to be far above the values used in practical analytical queries, we believe that GOLAP will be very efficient in praxis. Another topic of next research is further optimising the OLAP queries generated by the spatial index. This optimisation will help to move the critical value even higher.
GOLAP – Geographical Online Analytical Processing
449
Acknowledgement. The work related to this paper has been carried out with support of the INCO-COPERNICUS No. 977091 research project GOAL – Geographical Information On-line Analysis. The authors want to express thanks to their colleagues from the Gerstner Laboratory for Intelligent Decision Making and Control for creating a friendly environment and to Jaroslav Pokorný for his excellent tutorial on spatial indexing techniques [5].
References [1]
Guttman, A.: R-trees: a dynamic index structure for spatial indexing, Proc. of SIGMOD Int. Conf. on Management of Data, 1984, pp. 47-54
[2]
Gavrila, D.M.: R-tree Index Optimization, CAR-TR-718, Comp. Vision Laboratory Center for Automation Research, University of Maryland, 1994
[3]
Kouba Z., Matoušek K., Mikšovský P.: On Data Warehouse and GIS Integration, In: Proceedings of the 11. International Conference on Database and Expert Systems Applications (DEXA 2000), Ibrahim, M. and Küng, J. and Revell, N. (Eds.), Lecture Notes in Computer Science (LNCS 1873), Springer, Germany
[4]
Kurz A.: Data Warehousing – Enabling Technology, (in German), MITP-Verlag GmBH, Bonn, 1999, ISBN 3-8266-4045-4
[5]
Pokorný J.:Prostorové datové struktury a jejich SRXåLWtNindexaci prostorových REMHNW (Spatial Data Structures and their Application for Spatial Object Indexing), in Czech, GIS Ostrava 2000, Technical University Ostrava, Ostrava, Czech Republic, 2000, pp.146 –160
Declustering Spatial Objects by Clustering for Parallel Disks Hak-Cheol Kim and Ki-Joune Li Dept of Computer Science, Pusan National University, Korea [email protected], [email protected]
Abstract. In this paper, we propose an efficient declustering algorithm which is adaptable in different data distribution. Previous declustering algorithms have a potential drawback by assuming data distribution is uniform. However, our method shows a good declustering performance for spatial data regardless of data distribution by taking it into consideration. First, we apply a spatial clustering algorithm to find the distribution in the underlying data and then allocate a disk page to each unit of cluster. Second, we analyize the effect of outliers on the performance of declustering algorithm and propose to handle them separately. Experimental results show that these approaches outperform traditional declustering algorithms based on tiling and mapping function such as DM, FX, HCAM and Golden Ratio Sequence.
1
Introduction
Spatial database systems, such as geographic information systems, CAD systems, etc., store and handle a massive amount of data. As a result, frequent disk accesses become a bottleneck of the overall performance of system. We can solve this problem by partitioning data onto multiple parallel disks and accessing them in parallel. This problem is referred to as declustering. For parallel disk accesses, we divide the entire data set into groups by the unit of disk page and assign a disk number so that partitioned groups should be accessed at the same time by a query. Up to now, several declustering methods have been proposed. Most of them partition a data space into several disjoint tiles and match them with a disk number using mapping function[1,2,4,14,16]. They focussed only on an efficient mapping function from a partitioned tile to a disk number on the assumption that data is uniformly distributed. Therefore their methods, though give a good performance for uniform data, show a drop in efficiency for skewed data. To be effective for a skewed data, a declustering algorithm must be reflected of the distribution in the underlying data. We can apply two approaches, parametric and nonparametric methods, to discover distribution. By the parametric method, we assume a parametric model such as normal distribution, gamma distribution, etc. However, we cannot adopt this approach because most of the real data distributions do not agree with these distribution models. Another approach, non-parametric methods, makes no assumption about distribution. H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 450–459, 2001. c Springer-Verlag Berlin Heidelberg 2001
Declustering Spatial Objects by Clustering for Parallel Disks
451
Several methods such as kernal estimation, histogram, wavelet and clustering methods belong to this category. Among these methods, we apply a spatial clustering method called SMTin[6] to detect data distribution. By applying a spatial clustering method, our method is more flexible with the distribution than tiled partitioning methods and results in a high storage utilization and a low disk access rate. In addition to this contribution, we analyze the effect of outliers on the performance of declustering algorithm and propose a simple and efficient method to control them. This paper is organized as follows. In section 2, we present related works and their problems and propose a new declustering algorithm in section 3. In section 4, we show the effects of outliers on the performance of declusteirng algorithm and our solutions. We show experimental results in section 5 and conclude this paper in section 6.
2
Background and Motivation
Since the performance of a spatial database system is more affected by the I/O cost than CPU cost, a great deal of efforts have been done to partition data onto multiple parallel disks and access those objects qualifying query condition simultaneously. Traditionaly this problem is called as declustering, which consists of the following two steps. – step 1: grouping data set by the unit of disk page size – step 2: distributing each group of data onto multiple disks When we distribute objects onto multiple parallel disks, the response time of query q is determined as follows. Definition 1. Let M be the number of disks, then the number of disk accesses to process query q is DAq = maxM i=1 (DAq (i)), where DAq (i) is the number of i-th disk accesses Based on definition 1, Moon and Saltz defined the condition of the strict optimality of declustering algorithm[13]. Definition 2. Strictly Optimal Declustering Let M be the number of disks, DAq (i) is the number of i-th disk accesses for a query q, then declustering method is strictly optimal if M N ∀q, DAq = Mq , where Nq = i=1 DAq (i) Previously proposed declustering methods tile a data space and assign disk number to each tile based on a certain mapping function. Disk Modulo(DM)[1], Field-wise Xor(FX)[2] and Hilbert Curve Allocation Method(HCAM)[4] use this scheme. Among these methods, Hilbert Curve Allocation Method proposed by C. Faloutsos and P. Bhagwat has been known to outperforms others[4]. Moon and Saltz proved that the scalability of DM and FX is limited to some degree. They proved that the scalability of DM is bounded by the query side and the
452
H.-C. Kim and K.-J. Li
scalability of FX is at best 25 percents by doubling the number of disks. For more details, see [13]. Recently, Bhatia and Sinha proposed a new declustering algorithm based on Golden Ratio Sequences[16]. Their analytical model and experimental results show that GRS outperforms not only traditional tile-based declustering methods such as DM, FX, HCAM, but also cyclic allocation method[14], which is a generalization of DM. Most of these declustering methods are known to be not strictly optimal without some assumptions[1,2,4,8,14,16] and try only to find an optimal way in allocating each tile onto parallel disks on the assumption that data is uniformly distributed. However, they ignore the effects of a good grouping method on increasing the storage utilization for step 1. It is obvious that the maximum number of page accesses per disk grows as the total number of disk page occupied by objects increases. As a result, we might access more disk pages without a good grouping method. In addition to this drawback, they did not consider outliers having an effect on the performance of declustering algorithm. We will explain its effects more closely in section 4.
3
Declustering Skewed Dataset
In section 2, we showed that previously proposed declustering algorithms have potential weakness by assuming uniform data distribution. we will show it in detail and propose a new declustering algorithm in this section. 3.1
Skewed Data and Declustering
First, we will describe how the performance of a tiling algorithm can be affected by data distribution. When the number of objects in a certain densed tile exceeds the disk page capacity, all the tiles must be split though they contain small number of objects that can fit into one disk page. This results in a low storage utilization and a poor performance in comparison with uniform data even if the same mapping function is used. We investigate this problem in the rest of this subsection in detail. Table 1 shows notations and their meanings to be used from now on. Lemma 1. When we apply the same tiling scheme, the number of total disk pages occupied by skewed data is more than that of uniform data. Proof. Let ni be the number of objects in the i-th tile occupied by skewed data. Ts Ts Then i=1 ni = N . Since the distribution of data is skewed, mini=1 (ni ) < s s maxTi=1 (ni ) = Bfmax and avgTi=1 (ni ) = N/Ts < Bfmax . For uniform data, the number of objects in a tile is N/Tu and if the storage utilization is maximum, the number of objects in a tile is Bfmax . It means that N/Tu = Bfmax . Therefore N/Ts < N/Tu that is Ts > Tu .
Declustering Spatial Objects by Clustering for Parallel Disks
453
Table 1. Notations and their meanings Notation Meaning N number of spatial objects in 2-D space Tu , Ts total number of tiles(disk pages) occupied by uniform data and skew data respectively Bfmax maximum disk blocking factor du , ds density of uniform data and skewed data respectively au , a s area of one tile for uniform data and skewed data respectively
Lemma 2. For a given query q, Let Tu (q) and Ts (q) be the number of disk page accesses to process query q for uniform data and skewed data respectively. Then, Tu (q) < Ts (q) Proof. We normalize the data space as [0, 1]2 . For completely uniform data, the number of disk pages Tu and the area of a tile are given as follows. Tu =
N Bfmax
(1)
1 Bfmax = Tu N objects in a tile, we obtain the following equation au =
Since there are Bfmax
du · au = Bfmax ,
au =
Bfmax du
(2)
(3)
For skewed data, the number of objects contained in a tile is variable. We get the following equation for skewed data. max(ds ) · as = Bfmax ,
as =
Bfmax max(ds )
(4)
It is evident that du < max(ds ), since max(ds ) is the maximum density of skewed data. Therefore we derive the following inequality from equation 3, 4. as < au
(5)
For a query q whose area is A(q), the number of tiles contained by q for the uniform data, Tu (q) and skewed data Ts (q) are given as Tu (q) =
A(q) , au
Ts (q) =
A(q) as
(6)
From equation 5 and 6, we know that the number of tiles for the skewed data qualifying the same query condition is larger than that of uniform data, since as < au .
454
H.-C. Kim and K.-J. Li
From lemma 1 and 2, we come to a conclusion that tiling methods for skewed data cannot satisfy the strict optimality condition of defintion 2. Theorem 1. Suppose that DAu (q) and DAs (q) are the maximum number of page accesses per disk for uniform and skewed data to process query q respectively. Then, DAu (q) < DAs (q) Proof. Let M be the number of disks and assume an optimal declustering algorithm. Then, the number of page accesses per disk for uniform data and skewed data to process query q is given as follows. DAu (q) = TuM(q) ,
DAs (q) = TsM(q)
Since Tu (q) < Ts (q), DAu (q) < DAs (q).
This means that the number of tiles(disk pages), in other words storage utilization, is an important factor in declustering method in case of skewed data. In fact, we found a significant difference between Ts and Tu depending on data and the degree of skewedness from experiments. These observations lead to a conclusion that it is very important to partition spatial objects in a way that reduces the number of tiles, in addition to find a good mapping function for allocating each tile to a disk number. We will focus on methods to improve the storage utilization in the next subsection. 3.2
An Efficient Declustering Method for Skewed Data
In this subsection, we propose a new declustering algorithm which is flexible to the distribution of data and results in a small number of pages in comparision with precedent declustering algorithms. The proposed algorithm is composed of the following three steps. Step 1. Find Data Distribution In this paper, we apply a spatial clustering algorithm to find the distribution of data. Up to now, several spatial clustering methods have been introduced[5,6, 9,10]. Among them, we apply SMTin as a spatial clustering algorithm. SMTin initially constructs delaunay triangulations for point set and extracts clusters from triangles whose distance is within a predefined threshold value. We can find the distribution in the underlying N objects by the time complexity of O(N logN ). For more details, see [6]. Step 2. Split Overflow Clusters After clustering step, there may be clusters whose number of elements exceeds disk page capacity. These clusters must be partitioned into several sub-clusters so that each of which can fit into one disk page. we applied an efficient partitioning method called STR proposed by Leutenegger, et al.[7] to split these clusters.
Declustering Spatial Objects by Clustering for Parallel Disks
455
Step 3. Distribute Clusters onto Disk Page After adjusting all of clusters so as to fit into one disk page, we calculate the center of minimum bounding rectangle enclosing cluster, sort them by the hilbert value of their center and then assign a disk number in a round robin fashion.
4
Outliers and Declustering
In data analysis, data with large dissimilarity with others may deteriorate result. Many researches have been done about these data called outlier in data mining area and various definitions of outlier have been made in the literature[11,12, 15]. But we give a differenct definition of outlier, which is based on SMTin as follows. Definition 3. Outliers with Cluster Construction Threshold Value Let CCTcs be a cluster construction threshold value and C1 , C2 , . . . , Cn be clusters by the CCTcs , then any object O satisfying the following condition is outlier: ∀O ∈ Ci , 1 ≤ i ≤ n, dist(O, O ) > CCTcs In fact, we can control the number of outliers by means of CCTcs . We will explain the effects of outliers and our solutions in the following subsections. 4.1
The Effect of Outliers on the Declustering
First, we illustrate the problems of outliers. After finding an initial cluster set, there can be some objects not included in any clusters depending on an initial cluster construction threshold value. We have to increase cluster construction threshold value to include them and the shape of cluster may be degenerated and minumum bounding rectangle enclosing cluster may be extended unnecessarily. Another problem is that these clusters whose size is extremely smaller than maximum page size result in a low storage utilization. Although they may be regarded as a cluster, it is desirable to treat them as outliers. 4.2
An Efficient Method for Skewed Data with Outliers
In previous subsection, we showed that outliers may degenerate the performance and must be carefully treated. We may apply two approaches to handle them. The simplest approach is including them in the nearest cluster by force. Another solution is to regard outlier as a cluster whose size is extermely small and assign one disk page for it. However, these solutions result in a low storage utilization and a high disk access rate. We propose to keep the outliers on an extra main-memory buffer. As outlier buffer size has a limitation, we keep a part of outliers in buffer and include the remained outliers in the nearest cluster by force. Figure 1 shows our proposed declustering algorithm.
456
H.-C. Kim and K.-J. Li
Algorithm DC (Declustering Clusters to parallel disks) Input P: set of points {p1 , p2 , . . . , pn }, M: number of disks available Bfout : capacity of outlier buffer, Bf: disk blocking factor CCTcs : cluster construction threshold value Output {Ci , di } //Ci : i-th cluster, di : disk number assigned to Ci Begin Algorithm C ← {}; Construct initial clusters by CCTcs ; While (Card(outlier) > Bfout ) adjust CCTcs ; reconstruct clusters; End while For each cluster Si If Card(Si ) > Bf {Si1 , Si2 , . . . , Sik } ← SplitCluster(Si , Bf); C ← C ∪ {Si1 , Si2 , . . . , Sik }; End If Else C ← C ∪ {Si }; End Else End For For each element Ci in C (xi , yi ) ← center of cluster Ci ; di ← H(xi , yi ) mod M; // H(x, y) is Hilbert value of (x, y) End For End Algorithm
Fig. 1. Description of Declustering by Clustering(DC) algorithm
5
Performance Evaluation
We performed several experiments to compare our method with previously proposed declustering algorithms. To do this, we generated synthetic skewed data to analyize the effect of data distribution on the performance of declustering. We also prepared two real data set extracted from the maps of Long Beach County and Seoul city. Figure 2 shows their distribution. For query set we generated two types of queries, which is uniformly distributed in data space and concentrated on a data area, and whose size is 0.1×0.1(small query) and 0.5×0.5(large query) of data space whose size is [0, 1]2 . Storage Utilization We explained that the number of total disk pages has an effect on the performance. We found that tiling method occupy more disk pages than our proposed method. In detail, it occupy 1.6∼3.7 times more disk pages than declustering
Declustering Spatial Objects by Clustering for Parallel Disks
(a) skew synthetic data
(b) LBCounty
457
(c) Seoul
80 60 40 20 0
0
1
2 x
disk page size: 2 KByte (a) skew
3
tiling clustering
100
storage utilization(%)
tiling clustering
100
storage utilization(%)
storage utilization(%)
Fig. 2. Test data distribution
80 60 40 20 0
0
1
2 x
disk page size: 2 KByte (b) LBCounty
3
tiling clustering
100 80 60 40 20 0
0
1
2
3
disk page size: 2x KByte (c) Seoul
Fig. 3. Storage utilization of tiling and clustered partition(outlier buffer is 1 KB)
by clustering algorithm. Figure 3 shows the storage utilization of tiling method and DC. It shows that tiling method is far from being optimal allocation but our method is nearly optimal as far as a storage utilization is concerned. Scalability We compared our algorithm only with Golden Ratio Sequence(GRS)[16] since it is known to be the best among tiling methods. Figure 4 shows the result of experiments. We see that our method gives a good performance regardless of data distribution and query size. Especially, when the number of disks is small and query size is large, enhancement of our method over GRS is significant. When query size is 25% of data space and the number of disks is 8, the declustering method by Golden Ratio Sequence accesses about 8 times more disk pages than DC for Seoul data. However, the performance improvement ratio over GRS become lower as the number of available disks increases. We carried out a similar experiment with a set of queries concentrated on a downtown area rather than uniformly distributed. It is more realistic environment of experiment since queries tend to be located on specific regions. The experiment however shows very similar results with that of uniformly distributed queries. The Effect of Outliers We carried out an experiment to reveal the effects of outliers on the performance of declustering algorithm. If the size of outlier buffer is small, a cluster
small query large query
8
16
24 32 40 48 number of disks
56
64
8 7 6 5 4 3 2 1 0
small query large query
8
(a) skew
16
24 32 40 48 number of disks
ratio of improvement
8 7 6 5 4 3 2 1 0
H.-C. Kim and K.-J. Li
ratio of improvement
ratio of improvement
458
56
64
(b) LBCounty
8 7 6 5 4 3 2 1 0
small query large query
8
16
24 32 40 48 number of disks
56
64
(c) Seoul
12
Bfout=4KB Bfout=8KB
10
average response time
average response time
Fig. 4. Performance enhancement ratio of DC over GRS for uniformly distributed queries, where disk page size is 4 KByte and outlier buffer is 1KByte
8 6 4 2 0
8
16
24 32 40 48 number of disks
(a) small query
56
64
12
Bfout=4KB Bfout=8KB
10 8 6 4 2 0
8
16
24 32 40 48 number of disks
56
64
(b) large query
Fig. 5. Average response time by an outlier buffer size: disk page size is 1KByte
construction threshold value becomes large and the minimum bounding rectangle enclosing a cluster must be extended to include outliers. We found that the effect of outlier buffer is affected by a data distribution. It is of no use to increase outlier buffer exceeding a certain value which is related to the data distribution. Figure 5 shows the effect of a outlier buffer on the performance of the proposed declustering algorithm for Seoul data. We gain more performance enhancement, though its effectiveness is trivial, by increasing the size of outlier buffer.
6
Conclusion
In this paper, we proposed a new declustering method for spatial objects. We investigated the effect of data distribution on the performance of declustering algorithm and showed that previously presented algorithms do not give a good performance for skewed and real data. We reviewed the definition of strict optimality proposed by Moon and Saltz and showed that tiling algorithms cannot be strictly optimal for skewed data. Before declustering spatial objects onto multiple disks, we apply a spatial clustering algorithm to discover the data distribution. By doing so, our method is more flexible to the data distribution than tiling methods. In addition to this contribution, we showed the effects of outlier on the performance of declustering algorithm and proposed to store them separately. Experimental results show that our method gives a high storage utilization and low disk access rate regardless of data distribution and query distribution.
Declustering Spatial Objects by Clustering for Parallel Disks
459
Currently our works are limited to a static data set. In our future works, we will study on the dynamic clustering algorithm and show the effect of outliers more closely. Acknowledgements. The author(s) wish(es) to acknowledge the financial support of the Korea Research Foundation made in the Program Year 1997.
References 1. Du, H.C., Sobolewski, J.S.: Disk Allocation for Cartisian Files on Multiple-Disk Systems. Int. J. ACM TODS, Vol. 7, No. 1, (1982) 82-102 2. Fang, M.T., Lee, R.C.T., Chang, C.C.: The Idea of De-Clustering and Its Applications. VLDB (1986) 181-188 3. Faloutsos, C., Metaxas, D.: Disk Allocation methods using error correcting codes. Int. J. IEEE Trans on Computers, Vol. 40, No. 8, (1991) 907-914 4. Faloutsos, C., Bhagwat, P.: Declustering using fractals. Parallel and Distributed Information Systems Conf (1993) 18-25 5. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering methods for very large databases. SIGMOD (1996) 103-114 6. Kang, I.S., Kim, T.W., , Li, K.J.: A spatial data mining method by delaunay triangulation. Proc. ACM-GIS. (1997) 35-39 7. Leutenegger, S. T., Lopez, M.A., Edgington, J.M.: STR: A simple and efficient algorithm for r-tree packing. ICDE. (1997) 497-506 8. Abdel-Ghaffar, K., Abbadi, A. E.: Optimal allocation of two-dimensional data. ICDT (1997) 409-418 9. Sheikhleslami, G., Chatterjee, S., Zhang, A.:Wavecluster: A multi-resolusion clustering approach for very large spatial databases. VLDB (1998) 428-439 10. Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithms for large databases. SIGMOD (1998) 73-84 11. Knorr, E., Ng, R.: Algorithms for Mining Distance-Based Outliers in Large Datasets. VLDB (1998) 392-403 12. Barnett, V., Lewis, T.:Outliers in Statistical Data. Third Edition, John Wiley & Sons Ltd. (1998) 13. Moon, B. K., Saltz, J. H.: Scalability Analysis of Declustering Methods for Multidimensional Range Queries. Int. J. IEEE TKDE, Vol. 10, No. 2, (1998) 310-327 14. Prabhakar, S., Abdel-Ghaffar, K., El Abbadi, A.: Cyclic allocation of twodimensional data. ICDE (1998) (94-101) 15. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD (2000) (427-438) 16. Bhatia, R., Sinha, R.K., Chen, C.-M.:Declustering using golden ratio sequences. ICDE (2000) 271-280
A Retrieval Method for Real-Time Spatial Data Browsing Yoh Shiraishi and Yuichiro Anzai Department of Computer Science, Keio University, 223-8522, 3-14-1, Hiyoshi, Yokohama, Japan, {siraisi, anzai}@ayu.ics.keio.ac.jp
Abstract. This paper presents a retrieval method for real-time spatial data browsing through a computer network. This method is designed based on ”anytime algorithm” that improves a quality of the processing results over time. The retrieval method decomposes a region specified in a query into sub-regions and searches at each decomposed region using a spatial index. The searched data are kept as the intermediate results for the query and these results are incrementally sent to a client. Consequently, the client can incrementally browse spatial data for the query. We implemented this retrieval system using Java language and constructed a map browsing system as a sample application. Through some experiments, we discussed the effectiveness of our retrieval system.
1
Introduction
Providing spatial data such as map data and GIS data through a computer network is effective for location-oriented applications on the Internet. Especially, spatial data expressed by text data such as XML are more important in the point of view of standardization, interoperability and information integration [1]. However, it will take long response time in order to retrieve spatial data from remote databases and obtain a large quantity of spatial data via a network. Therefore, we propose a retrieval method for real-time spatial data browsing through a computer network. Our method is inspired by “anytime algorithm” [2,3] that improves a quality of the processing results over time. It can provide spatial data incrementally to a client by controlling search and data transmission. This retrieval method executes a range query for a spatial database with a tree index such as R-tree and R*-tree, in order to collect spatial data within a region specified by a user. This method decomposes a region of the range query into sub-regions, and searches an index tree to find spatial data in each decomposed region. These searched data are kept as the intermediate results during the searching. These results are decomposed into packets and each packet is transmitted incrementally to the client. The remainder of this paper is organized as follows. In Section 2, we mention the requirements of spatial data retrieval through a network. In Section 3, we present a retrieval method for real-time spatial data browsing. The implementation of this retrieval system is described in Section 4. In Section 5, we show H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 460–469, 2001. c Springer-Verlag Berlin Heidelberg 2001
A Retrieval Method for Real-Time Spatial Data Browsing
461
the experimental results and discuss the effectiveness of our retrieval method. Finally, we give conclusion in Section 6.
2
Real-Time Spatial Data Browsing through Internet
Our research aims to construct a system for browsing spatial data through a computer network. Spatial data are a large quantity of data that include various kinds of information such as map data and location attributes. However, when a user receives spatial data with large size for a query, the response time will be long. As the size of collected data is larger, data transmission time is longer. As a remote database has a larger data table, the data access cost will increase. Accordingly, spatial data browsing with real-time response is required. In the field of artificial intelligence, ”anytime algorithm” was proposed as an algorithm for problem solving under time constrained [2,3]. The algorithm improves the quality of the problem solving over time and can return some solutions whenever it is interrupted. On the other hand, ”imprecise computation” was advocated in the field of real-time systems[4]. This concept is very similar to anytime algorithm. Both computational models have a monotonic improvement of the quality of the processing result, by keeping the intermediate results of the processing. There are some applications of imprecise computation[5,6]. An imprecise algorithm for image data transmission was proposed [5]. However, it cannot be applicable to text data transmission and spatial data retrieval. Also, ”approximate query processing” is a query method for a real-time database based on the imprecise computation model [6]. This method gives the incremental output of the retrieval to a relational database, but applying into data retrieval via a network or spatial database systems was not discussed. Therefore, we apply “anytime algorithm” into spatial data retrieval.
3
A Retrieval Method for Spatial Data Browsing
In this paper, we regard a processing quality as the rate of collected data for spatial data retrieval and design the retrieval method that can improve the quality. The retrieval quality at time t is defined as follows: q(t) =
the number of spatial data objects collected up to t the number of total data objects for the query
(1)
In order to improve this quality, our retrieval method keeps the intermediate results for the query and transmits these results incrementally. Our method consists of two processes: a process for search and a process for results transmission. The search process decomposes a region of a range query into sub-regions and searches a spatial index to find spatial data in each decomposed region. The transmission process decomposes the retrieval results into packets and supplies each packet to the client incrementally.
462
3.1
Y. Shiraishi and Y. Anzai
Spatial Index
We adopt R-tree[7] as a spatial index for data management. R-tree is the most popular spatial index based on the minimum bounding rectangle (MBR). The index structure is a balanced tree. Each node of the tree has some entries. An entry in a leaf node points to a data record that keeps a spatial data object and the MBR of the entry is defined as a region that covers the object. An entry in a non-leaf node points to a child of the node. The MBR of a node is calculated as a region that covers all entries of the node. 3.2
Query Expression
Finding spatial data in a region specified by a user is demanded frequently in spatial database systems and geographical information systems. Such a query is called a range query that is often used for spatial database systems. In this paper, we treat two kinds of range queries for retrieving two dimensional spatial data: a rectangle query and a circular query. A rectangle query has two ranges along each coordinate axis and has a rectangle region Rrect =< xmin , ymin , xmax , ymax >. On the other hand, a circular query is a query based on the distance between a query point and spatial data objects. The query region is represented as Rcircle =< xcenter , ycenter , radius >. Our retrieval method collects all spatial data that overlap with a query region of these range queries. 3.3
Query Region Decomposition
Our method decomposes a region of a query into sub-regions that do not overlap one another. Namely, a query region R is expressed as a region that is covered by all disjointed regions.
R = r1 ∪ r2 ∪ ... ∪ rk =
k
ri
(2)
∀i, j ri ∩ rj = φ (i = j)
(3)
i=1
k is the number of decomposed regions. Each decomposed region ri is defined by using two bounded regions: lower bounded region rilower and upper bounded region riupper (Fig.1). ri =
riupper − rilower for i > 1 for i = 1 riupper
upper rilower = ri−1
(4) (5)
These bounded regions must satisfy a condition rilower ⊂ riupper . Decomposed regions are ordered by the formula (5) and are kept as the ordered list region list.
A Retrieval Method for Real-Time Spatial Data Browsing
463
rupper i ri rlower i Fig. 1. Decomposing a region of a query
Since decomposed regions are searched in order of the list and the searched area is decided by riupper , the area spreads with time. Decomposing a query region must satisfy these conditions, but there are various kinds of ways to decompose a query region. In this paper, we use some parameters that decide how to extend the searched area. These parameters consist of the width that expresses the degree of the area enlargement and the base point of all decomposed regions (Fig.2).
y
y
dyplus
( x-base, y-base ) dxplus
dradius
dxminus
x (a) circular query
dyminus
x
(b) rectangle query
Fig. 2. The implementation of query region decomposition
A query region Rcircle of a circular query is decomposed into sub-regions as shown in Fig.2 (a). dradius decides the rate to extend a radius of decomposed regions. In this case, the base point is the center of the circle (xcenter , ycenter ). The upper bounded region riupper is calculated as follows: riupper
=
< xcenter , ycenter , dradius × i > for 1 ≤ i < k for i = k < xcenter , ycenter , radius >
(6)
On the other hand, a query region Rrect of a rectangle query is decomposed as shown in Fig.2 (b). The coordinate (xbase , ybase ) expresses the base point of all decomposed regions. dxplus , dxminus , dyplus and dyminus decide the rate to extend regions along each coordinate axis. riupper is calculated as follows:
464
Y. Shiraishi and Y. Anzai
riupper
< xbase + dxminus × i, ybase + dyminus × i, xbase + dxplus × i, ybase + dyplus × i > for 1 ≤ i < k = < xmin , ymin , xmax , ymax > for i = k dxplus ≥ 0,
3.4
dyplus ≥ 0,
dxminus ≤ 0,
dyminus ≤ 0
(7) (8)
Search Algorithm
Our search algorithm behaves in the top down manner and the description is shown in Fig.3. L1: L2: L3: L4: L5: L6: L7: L8: L9: L10: L11: L12: L13: L14: L15: L16: L17: L18: L19: L20: L21: L22:
create region list = r1 , ..., rk by decomposing a query region R create OP EN LISTi at each sub-region (ri ) create RESU LT BU F F ERi for ri insert a root node into OP EN LIST1 for check each decomposed region (ri ) in order of region list do while OP EN LISTi is not empty do pick up an element e from OP EN LISTi if e is an entry in a leaf node then if overlap(ri , e.rect) ∧ overlap(ri , o.geom) then insert e into RESU LT BU F F ERi else insert e into OP EN LISTi+1 endif else if overlap(ri , e.rect) then insert all children of e into OP EN LISTi else insert e into OP EN LISTi+1 endif endif end end Fig. 3. A search algorithm based on region decomposition
Our method scans each decomposed region using a index tree in order of region list. OP EN LIST is a queue for searching and RESU LT BU F F ER is a buffer to keep the searched result for each decomposed region. While an entry picked out from OP EN LISTi overlaps with the sub-region ri , the entry is expanded using the OP EN LISTi . Otherwise, the entry is inserted into OP EN LISTi+1 for next decomposed region ri+1 as a candidate for the searching. Using entries in the OP EN LISTi+1 , the search for the next sub-region is executed. Our method does not start the search for each sub-region from scratch. Since an entry (e) of a leaf node points to a data object (o), this method evaluates whether the bounding box (e.rect) of the entry overlaps with ri as well as
A Retrieval Method for Real-Time Spatial Data Browsing
465
whether the geometric attribute (o.geom) of the object overlaps with the region. If these conditions are satisfied, the entry is inserted into RESU LT BU F F ERi . In this implementation, we manage entries in OP EN LIST in the depth-first policy. 3.5
Transmitting Retrieval Results
The data transmission process marks up the retrieval results by XML and sends incrementally each result as a packet to a client. Since the transmission process keeps the results at each decomposed region using multiple buffers, it can supply the intermediate result to a client before the search finishes. Spatial data objects for each decomposed region are sent as packets. The process picks out these data objects from RESU LT BU F F ER based on the maximum number (max item) of data objects that a packet can include.
4
Implementation
We implemented a spatial data server based on our retrieval method using Java language (jdk1.2). This server executes a query requested by a client and supplies the retrieval results to the client. This retrieval system behaves as a TCP server. A search process and a transmission process are implemented as a thread using java.lang.Thread class. The search thread and the transmission thread behave concurrently and share the intermediate results through multiple buffers that the search thread keeps. Accordingly, the transmission thread can acquire the retrieval results from these buffers during the searching. We used the digital map data published by the Geographical Survey Institute [8] as sample data for our experiments and we composes a spatial database that includes line data and polygon data. As a spatial index for this database, we adopted R-tree with a quadratic splitting algorithm [7]. We use java.io.RandomAccessFile class in order to access data in a file. A DTD for XML expression of the retrieval result is shown in Fig.4.
spatial-data (line|polygon)*> line (id?,((x,y),(x,y)))> polygon (id?,((x,y),(x,y),(x,y)+))> x (#PCDATA)> y (#PCDATA)> Fig. 4. A DTD for description of spatial data
Also, we implemented a map browsing system as a TCP client of this retrieval system. Through the TCP connection, the client requests a query to the server and receives the retrieval results from the server. This client parses spatial data from the results expressed by XML and draws a map based on the parsed data. We used SAX (Simple API for XML) as the XML parser.
466
5
Y. Shiraishi and Y. Anzai
Evaluation and Discussion
We have conducted some experiments to evaluate our method. A spatial data server and a map browsing client work on the different host (Sun Ultra-10 Workstation). The server host has 440 MHz CPU and 512 MB memory. The client host has 333 MHz CPU and 128 MB memory. These hosts are connected through 100 Mbps Ethernet. 5.1
The Improvement of the Retrieval Quality
First, we evaluated the basic performance of the implemented retrieval system. The response time (twait ) of the retrieval system includes the search time (tsearch ), the XML tagging time (ttag ), the XML parsing time (tparse ), the drawing time (tdraw ) and the rest (tetc ). tetc includes the data transmission time. We measured these time using java.lang.System.currentTimeMillis() method. In this experiment, the data server manages 26745 line data objects that contain 53490 points. We measured the response time when the client finishes the drawing for each received packet, and recorded the retrieval quality calculated by an expression (1). Fig.5 and Fig.6 show the performance for a circular query with radius = 1500. The total number of collected line objects for the query is 6859.
1 not decompose dr=100 dr=200 0.8 dr=500 quality
quality
1 not decompose max= 500 max=2000 0.8 max=5000 0.6
0.6
0.4
0.4
0.2
0.2
0
0
5000 10000 15000 20000 25000 time [ms]
Fig. 5. The improvement of the retrieval quality for packet decomposition
0
0
5000 10000 15000 20000 25000 time [ms]
Fig. 6. The improvement of the retrieval quality by query region decomposition
Fig.5 shows the effect of packet decomposition based on the maximum number of data objects (max item) when the server searches without region decomposition. This result shows the monotonic improvement of the retrieval quality. Also, the first response time becomes short by packet decomposition. The server can provide some fragments of the retrieval results before the search finishes completely.
A Retrieval Method for Real-Time Spatial Data Browsing
467
Fig.6 shows the effect of region decomposition using dradius . This result suggests that the retrieval quality improves monotonically, similar to the case using packet decomposition (Fig.5). As dradius is smaller (namely, the number of decomposed regions is larger), the first response time is shorter and the quality improves smoothly. 5.2
Processing Cost
Next, we measured some processing costs for a circular query to collect line data object. The result is shown in Fig.7.
Fig. 7. Processing time
Fig. 8. Search cost
This result suggests that the search time makes up large percentage of the response time. Also, as a query range (radius) is longer (namely, the number of the collected data objects is larger), XML processing costs (ttag and tparse ) are larger. These costs depend on the size of the transmitted data. Also, Fig.8 shows the total search time when a client requests a circular query with radius = {1000, 1500, 2000} and the server decomposes the query region by using dradius = {100, 200, 500}. This result suggests that the total search time depends on the number of collected objects for the query and region decomposition has less overhead for the total search time. In addition to this, as the size of the last transmitted result is smaller by region decomposition and the XML processing time is shorter, the total response time is shorter. The tendency can be observed in Fig.6. 5.3
Map Browsing
We examined the drawing image when a client requests a rectangle query with Rrect =< −52500, −18000, −49500, −14000 > and the server decomposes the query region using {xbase = −51000, ybase = −16000, dxplus = dyplus = 200, dxminus = dyminus = −200}. The server manages 1678 polygon objects that
468
Y. Shiraishi and Y. Anzai
Fig. 9. The image (t = 778msec)
Fig. 11. The last drawing image (t = 14754msec)
Fig. 10. The image (t = 4398msec)
Fig. 12. The drawing packet decomposition
image
using
contain 28659 points. The case using region decomposition has a different drawing manner from the case using packet decomposition. The drawing images are shown in Fig.9, Fig.10, Fig.11 and Fig.12. In both cases, the client can browse some spatial data before receiving all results from the server because the client draws incrementally the map data every time it receives packets. Fig.9 and Fig.10 are the drawing image based the intermediate results when the server searches using region decomposition. Fig.11 is the image based on all data objects collected for the query. These figures show that the region containing the collected objects spreads over time from the base point specified by the user. On the contrary, when the server transmits the retrieval results using packet decomposition, the client draws the map data sparsely (Fig.12). 5.4
Related Work
Incremental nearest neighbor algorithm for ranking spatial objects was proposed [9]. Using a priority queue, this algorithm ranks data objects while searching. It is more effective than a method using nearest neighbor query because it does not have to start the search from scratch. Generally, a search algorithm uses a queue, called an open list, and keeps candidate nodes based on a given policy. Our search method is depth-first approach, but the incremental algorithm is best-first
A Retrieval Method for Real-Time Spatial Data Browsing
469
approach based on the distance between spatial objects. Also, the algorithm can output the ranking results incrementally. In this point, this is very similar to our method. However, our goal is to browse spatial objects with real-time response, not to rank spatial objects. The incremental nearest neighbor algorithm exactly sorts using a single queue, but our method maintains multiple queues to control the search for each decomposed region. Our approach uses multiple queues for ordered sub-regions and roughly sorts spatial objects enough to realize real-time spatial data browsing.
6
Conclusion
We designed a retrieval method for real-time spatial data browsing through a computer network and implemented the retrieval system using Java language. Our retrieval method can supply incrementally the intermediate results for a query. These results are ordered by searching based on region decomposition and are decomposed into packets based on the number of data objects. Consequently, the client can incrementally browse spatial data for the query. Through some experiments, we showed that our method can monotonically improves the retrieval quality. Also, the experimental results suggested that our method can bring the first response with a short time and take the total response time with less overhead. Further consideration will include the refinement of the search algorithm and the construction of a framework for information integration based on spatial data infrastructure.
References 1. G-XML. http://gisclh01.dpc.or.jp/gxml/contents-e. 2. M. Boddy and T. Dean. Solving Time-Dependent Planning Problems. In Proccedings of IJCAI 89, pages 979–984, 1989. 3. S. Zilberstein and S. Russel. Optimal composition of real-time systems. Artificial Intelligence, 82(1-2):181–213, 1996. 4. K.-J. Lin, S. Natarajan, J. W.-S. Liu, and T. Krauskopf. Concord: A System of Imprecise Computation. In Proccedings of COMPSAC 87, pages 75–81. IEEE, 1987. 5. X. Huang and A. M. K. Cheng. Applying Imprecise Algorithms to Real-Time Image and Video Transmission. In Proceedings of Real-Time Technology and Applications Symposium, pages 96–101, 1995. 6. S. V. Vrbsky. A data model for approximate query processing of real-time databases. Data & Knowledge Engineering, 21:79–102, 1997. 7. A. Guttman. R-Trees: A Dynamic Index Structure for Spatial Searching. In Proccedings of ACM SIGMOD Conference, pages 47–57, 1984. 8. Geographical Survey Institute. http://www.gsi.go.jp/ENGLISH. 9. G. R. Hjaltason and H. Samet. Ranking in Spatial Databases. In Advances in Spatial Databases :4th International Symposium, SSD ’95 (LNCS-951), pages 83–95, 1995.
Designing a Compression Engine for Multidimensional Raster Data Andreas Dehmel FORWISS Orleansstraße 34 D-81667 Munich, Germany [email protected]
Abstract. Multidimensional raster data appears in many application areas, typically in the form of sampled spatial or spatio-temporal analogue data. Due to the data volume and correlations between neighbouring samples usually encountered in this kind of data it has high potential for efficient compression, as can be seen by the wealth of specialized compression techniques developed for 2D raster images; other examples would be 1D time series, 3D volumetric or spatiotemporal data, 4D spatio-temporal data or 5+D data typically found in OLAP data cubes. Efficiently handling this kind of data often requires compression, be it to reduce storage space requirements or transfer times over low-bandwidth media. In this paper we present the design of the generic, tile-based compression engine developed for this purpose and implemented in the multidimensional array DBMS RasDaMan. Keywords: Spatial and temporal databases, object-oriented databases, data compression
1
Introduction
Raster data of arbitrary dimensionality and base type, which we call Multidimensional Discrete Data or MDD for short, is an everyday phenomenon in digital data management. The most popular special cases of MDD are raster images, but in recent years digital audio and video data have also gained a substantial amount of exposure. The popularity and usability of this kind of data has been largely influenced by the development of dedicated compression methods, for instance JPEG, PNG, or the various MPEG standards, to name but a few, as only compression allows reducing the data volume enough to handle this data efficiently in everyday use. Research done so far on compression in databases has mostly focussed on relational data and the impact of compression on overall performance as in [8]. The situation is very different for MDD, however, since MDD are both considerably
Research funded by the European Commission under grant no. QLG3-CT-199900677 NeuroGenerator, see www.neurogenerator.org
H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 470–480, 2001. c Springer-Verlag Berlin Heidelberg 2001
Designing a Compression Engine for Multidimensional Raster Data
471
bigger than relational data and have different properties. Compression in relational databases usually means a simple text compression method, whereas more advanced techniques developed for image- and video compression are a much more natural template for the development of MDD compression algorithms because both of these data types are just special cases of MDD themselves and exploit properties like spatial correlations which can be found in many types of MDD as well. The problems in efficiently compressing MDD derive from their generic nature, i.e. neither the number of dimensions nor the structure of the base type may be restricted in any way. There is a huge amount of work done on compression for specific types of MDD like images, video and audio data, but to the best of our knowledge no work on generic MDD compression nor its integration into a DBMS kernel. That means that existing compression techniques may well serve as a design template but are much too restrictive ”out of the box” to be of any use in the MDD context. We will list the requirements in more detail in Sec. 1.3 after a few introductory words on terminology and RasDaMan. 1.1
Terminology
Before taking a closer look at the requirements of MDD compression we will first establish the (RasDaMan) terminology used in the remainder of this paper. An MDD is a template for multidimensional raster data; the two template parameters are information about the geometry, encoded as a spatial domain, and about the base type. This leads to the following definitions: Spatial Domain: a multidimensional interval spanned by two vectors l, h. We use the syntax [l1 : h1 , . . . , ld : hd ], li , hi ∈ ZZ, li ≤ hi to represent a spatial domain in d dimensions. A point x lies within a spatial domain sdom if ∀i : li ≤ xi ≤ hi (x ∈ sdom). Cell: a cell is located at each position x ∈ sdom (dense data model). Its structure is determined by the base type of the MDD they belong to. Base Type: describes the structure of a cell. There are atomic types as used in high level languages like C (e.g. int, short, . . .) and arbitrary structured types which may consist of any combination of atomic types and other structured types. 1.2
RasDaMan
RasDaMan [2] is an array DBMS for MDD developed at FORWISS. In order to achieve finer access granularity, MDD are subdivided into sets of non-overlapping tiles, each of which is stored separately for efficient random access [1,3]. There is no restriction on base types except that all cells within an MDD must have constant size (which rules out e.g. variable-length strings); this allows fast calculation of cell addresses. All data within a cell is stored consecutively in memory which means that the values for the atomic types within a structured type are interleaved in order to achieve maximum speed when most type members are accessed together, which is typical for many queries.
472
1.3
A. Dehmel
Requirements for MDD Compression
There is a wealth of compression techniques for special data like images, audio and video data, which exploit intrinsic properties of these data types, a nice overview of which can be found in [7]. Images for instance are 2D MDD over a small number of possible base types, by far the most popular of which are 8bit (greyscale) and 24bit (true colour). Standard compression algorithms for generic MDD are lacking, however. There are a number of compression methods in common use today like the LZ series developed by Lempel and Ziv [9,10] which are generic in the sense that they interpret all data as an unstructured bytestream, but these naturally fail to exploit some of the structure inherent in MDD objects, such as – correlations between neighbouring cells which will be lost at least in part if the multidimensional data is simply linearized into a bytestream as required for those compression methods. These correlations will not necessarily exist for all types of data, but many kinds of MDD like tomograms or the results of numerical simulations have a spatial and/or temporal interpretation which implies the presence of such correlations; – the semantics of data belonging to structured base types. A satellite image with 9 spectral channels will often compress better if the values of each channel are compressed separately rather than all of them interleaved. This is even more true when the structured type consists of different subtypes, e.g. a mixture of char, short and int. Please note that the performance of a compression method always depends heavily on the kind of data it is applied to; for instance if the values of a structured base type are the same within each cell, compressing the values for each atomic type separately will often compress worse than the interleaved approach. The standard approach is to separate the compression engine into a model layer at the top which transforms the data depending on its type into a format which allows more efficient compression, and a compression layer at the bottom which consists of traditional symbol-stream oriented compression techniques. The model layer for an image and an OLAP data cube will be completely different, but the compression layer is identical. There is also the important distinction to make between lossless and lossy compression modes, which will be resolved in the model layer as well. Although lossless mode is desirable especially in a DBMS, the entropy sets a hard limit on the maximum achievable rate; on the other hand in some situations the loss of some accuracy may well be worth the additional storage savings achieved by lossy compression. Another aspect of the desired compression engine is that it should be usable for transfer compression as well, which means that it must be available on both the database server and the client and therefore machine-independent. Transfer compression can greatly reduce transfer times; when taking the compression / decompression overhead into account this is weakened somewhat in that only fast (and typically simple) compression techniques will perform well for transfer
Designing a Compression Engine for Multidimensional Raster Data
473
compression unless the bandwidth between client and server is very low. In contrast, when storing data the reduction in size is usually more important than the computational overhead involved. For the compression engine this means that it must support a variety of compression methods with different time / compression rate properties. All of the above indicates that a compression engine for multidimensional raster data should not be limited to one technique but rather be a modular collection of different approaches with individual strengths and weaknesses, where specific ones can be chosen according to the user’s requirements.
2
Compression Engine Architecture
All compression functionality was isolated in a new module which is present on both client and server for maximum flexibility and efficient transfer compression. Modularity was a major factor when designing the compression engine, because there is no universal compression technique and therefore several methods have to be provided through a common interface. Because of RasDaMan’s tile-based architecture it is natural to make compression tile-based as well. Each tile is compressed separately and the compression methods can differ between tiles, i.e. a compression method is a property of a tile, not of an MDD (see Fig. 2). The engine consists of two fundamental layers: – the bottom layer corresponds to traditional stream-oriented compression techniques and is not aware of any MDD semantics but merely operates on a linear stream of symbols; therefore this layer can also be used outside of the tile context. An interface to this layer is provided by the class lincodecstream; it is described in more detail in Sec. 2.1; – the top (=model) layer is MDD-aware and can therefore exploit characteristics like nD smoothness or base type structure. The interface for this layer is provided by the abstract class tilecompression which is at the root of an extensive class hierarchy. This layer can only operate on tiles and does not perform any compression itself, but merely transforms the data according to a data model before passing it on to an object of the bottom layer for actual compression. See Sec. 2.2 for a detailed overview of this layer. 2.1
The Lincodecstream Hierarchy
Figure 1 shows a skeleton of the bottom layer’s class hierarchy. All classes derive from a common ancestor class linstream but are then split into separate class hierarchies for compression and decompression because these operations usually work very differently internally, e.g. in a compression operation the final size of the compressed data is unknown, whereas during decompression the size of both compressed and uncompressed data is usually known. When easy access to matching compression and decompression objects is required this can be achieved via the lincodecstream class that merges both branches. There are currently 3
474
A. Dehmel lincodecstream +get_comp_stream() +get_decomp_stream() decompress
compress
linstream #params: parseparams
lincompstream +begin() +put() +end() +create()
nocompstream
rlecompstream zlibcompstream
lindecompstream +begin() +get() +end() +create()
nodecompstream rledecompstream
zlibdecompstream
Fig. 1. Outline of the lincodecstream hierarchy in UML notation
compression streams available, which have very different weights on complexityand compression factors: None: no compression, the input stream is just copied. RLE: an RLE algorithm based on the PackBits approach described in [13]. RLE algorithms decompose a data stream into tuples (value, count) and thus provide fast and simple compression of sequences of identical symbols, which makes them attractive for both sparse data and transfer compression (the latter due to low complexity). PackBits was chosen because of its smart 1 th (in encoding of these tuples with a worst case data expansion by 128 contrast to a primitive encoding which would double the size). The algorithm was further extended to operate on base types of size 1, 2, 4 and 8 to allow it to compress all atomic types as efficiently as possible1 . ZLib: the well known, free standard compression library ZLib [12] was chosen because of its excellent compression properties on arbitrary binary data. ZLib compression is based on the LZ77 dictionary compression technique [9] which is far superior to RLE in terms of achievable compression, but also of considerably higher complexity.
Furthermore, due to their streaming properties lincodecstreams can be nested to any depth as a filter sequence of arbitrary length, they could even be composed dynamically on-the-fly. 2.2
The Tilecompression Hierarchy
The tilecompression layer is MDD-aware, i.e. it knows base type and spatial domain of the data and can use these to achieve better compression. The tile1
If the algorithm worked only on bytes it would perform very badly when compressing data from larger base types: the sequence <1,2,2,2,2> over the type short corresponds to the byte sequence <0,1,0,2,0,2,0,2,0,2> on a little endian machine, which obviously wouldn’t compress in RLE.
Designing a Compression Engine for Multidimensional Raster Data
475
compression layer itself only performs transformations of the data and is coupled with classes of the lincodecstream layer for actual compression. The tilecompression class hierarchy consists of the following major branches (see Fig. 2): tilecompression #params: parseparams +compress() +decompress() +create()
Fig. 2. Outline of the tilecompression hierarchy in UML notation
tilecompnone: no compression. This class is optimized for fast handling of uncompressed data and avoids copying operations that would have been necessary if it had been merged with another class of this hierarchy. tilecompstream: the simplest variant of tile compression which does not use any MDD properties but merely reads/writes the linearized tile data from/to a lindecompstream/lincompstream. The (de)compression stream objects are provided by derived classes via get compressor() / get decompressor() methods, everything else is done in tilecompstream scope; tilesepstream: a more sophisticated compression variant which uses the semantics of structured base types and compresses the values belonging to each atomic type separately. Thus an RGB image – which is a 2D MDD over the structured base type struct {char red, char green, char blue} – would be compressed by processing the red, green and blue values separately and concatenating the output. Like tilecompstream, the compression is based on lincodecstreams and uses the same interface to retrieve (de)compression objects from its derived classes. When applied to a tile over an atomic base type, tilecompstream and tilesepstream are equivalent; waveletcompression: this is the most advanced compression algorithm currently supported by RasDaMan, which uses both base type semantics and
476
A. Dehmel
spatial correlations. Section 2.3 deals with the architecture of this subclass in more detail. Tile compression is integrated in RasDaMan’s two communication layers, the one to the base DBMS for storage and retrieval of compressed tiles and the one to client applications for transfer compression. One issue that hasn’t been addressed yet is the specification of parameters for the various kinds of compression, but due to space restrictions this can’t be covered in detail in this paper. A parameter string of the form [key=value[,key=value]*] is used for this purpose, which allows integer, floating point and string parameters. We provided the possibility to use parameters for decompression as well, but the policy is that the compression algorithm must store all parameters necessary for decoding the tile along with the data2 so the decompressor can always return the correct result; thus decompression parameters are intended for optimization only. 2.3
Wavelet Compression
For the last couple of years, wavelets have received a huge amount of attention especially in the areas of data compression [19,20,18], including the upcoming JPEG2000 standard [15], where they’re used to transform a digital input signal into a representation better suited for compression; for example a sine wave has a rather complex spatial representation but a very simple one in the frequency domain. Wavelets are base functions with special properties such as (ideally) compact support; scaled and translated versions of the so-called mother wavelet are used to approximate functions at different resolution levels. What makes wavelets particularily attractive for lossy compression is that the transformed data can often be made very sparse and thus highly compressable without a noticeable degradation in the quality of the reconstructed signal [7]. In the discrete case, a 1D wavelet transform is performed by folding the input signal with matching (wavelet-specific) low-pass and high-pass filters to obtain average and detail coefficients, which can be used to reconstruct the original data during synthesis. Wavelet theory is a highly complex mathematical field far beyond the scope of this paper; the interested reader is referred to [16,17] for the profound theoretical background. Due to space restrictions we’re unable to touch the subject here. Wavelets in RasDaMan: Using wavelets for MDD compression requires the extension of the wavelet concept (typically given for the 1D case) to an arbitrary number of dimensions. The usual approach chosen in wavelet compression of images [18,19] is to perform 1D wavelet transformations first on rows, storing the average coefficients in the first and the detail coefficients in the second half of the rows, then doing 1D wavelet transformations on the columns of this transformed data, storing average and detail coefficients likewise. This partitions the original image into 4 areas, one where both transformations stored their averages (cc), 2
e.g. subband levels and quantization steps when encoding wavelet coefficients etc.
Designing a Compression Engine for Multidimensional Raster Data
477
one for the two possible mixtures of averages and details (cd, dc) and one for pure detail information (dd). This technique is then recursively applied to the cc region, see Fig. 3; the inverse procedure simply decodes columns, then rows, using 1D wavelet synthesis, starting at the coarsest resolution and doing the inverse recursion to finer resolutions.
cc2 cd2 dc2 dd2
cd1
dc1
dd1
dc0
cd0
dd0
Fig. 3. Multiresolution wavelet decomposition in image compression
It is natural to extend this system for an arbitrary number of dimensions D, performing 1D decompositions of the MDD along dimensions 1, . . . , D when encoding and 1D synthesis along dimensions D, . . . , 1 when decoding. On each scale, this leads to 2D regions ri = {c, d}D where we recursively apply the same transformation to region {c}D . Values belonging to an atomic type within a structured type are transformed separately. When the values of an atomic type have been wavelet-coded, all of the above mentioned regions at all hierarchic levels are iterated over, quantized, and encoded with a linear stream. The iteration is done by a class banditerator which assigns the regions to bands and steps over the data band-wise. Currently the following band iterators are available: isolevel: Each hierarchical level forms a band; isodetail: All regions (in all hierarchical levels) with the same number of d in their identifiers form a band (e.g. cd* and dc* in Fig. 3); leveldetail: All regions within a hierarchical level and with the same number of d in their identifiers form a band (e.g. cd0 and dc0 in Fig. 3). The wavelet class itself is modular and consists of an abtract base class performing all the wrapper code like iterating over all atomic types in a structured base type. Its two abstract child classes twaveletcomp and qwaveletcomp implement lossless (transform only) and lossy (quantizing) wavelet compression. The only thing derived classes have to do is the actual wavelet analysis and synthesis. Due to space restrictions we’re unable to present the architecture of the wavelet engine or the quantization in more depth here but give only a short overview over the following wavelet types:
478
A. Dehmel
Haar Wavelets: These are the oldest and most widely used wavelets, which is mostly due to their simplicity; they are also used (implicitly) in the S-transform [18,20]. Like all wavelet filter coefficients, the ones of the Haar wavelet are irrational numbers, but an equivalent transformation can be done in pure integer arithmetic where a pair (x2i , x2i+1 ) of input values is transformed into their 2i+1 2i+1 and the difference di = x2i −x during analarithmetic average ai = x2i +x 2 2 ysis and reconstructed using x2i = ai + di and x2i+1 = ai − di during synthesis. Using some integer optimizations, we can make sure that the exact encoded data has the same type and number of bits as the original data which is not normally possible when using wavelets and makes Haar wavelets an attractive transformation technique for lossless compression of integer types. For floating point types, loss in the general area of the machine precision can’t be avoided [19]. Generic Wavelet Filters: In the general case, a wavelet filter consists of real numbers which can only be approximated by floating point numbers on a computer system; folding a signal with such a filter naturally results in floating point coefficients too, in contrast to the Haar wavelets in Sec. 2.3. Uniting both concepts under one wavelet architecture was an important extension of the system. All wavelet filters currently supported are orthonormal with compact support (see Fig. 2). Currently there exists a separate class for Daubechies 4 wavelets [16] and an abstract class orthowavelet which can operate with arbitrary orthonormal wavelet filters with even length. The actual filter is initialized in the child class orthofactory which currently supports 20 filter types with lengths between 6 and 30 coefficients. The main reason for creating a special class for Daubechies 4 was that the generic approach is less efficient the shorter the filter is. Biorthogonal wavelets as used in the upcoming JPEG2000 standard [15] could simply be added as a brother class to orthowavelet.
3
Results
Although the engine isn’t finished yet, it can already be used for data compression in RasDaMan. In this section we will present some preliminary results for several kinds of MDD typically stored in RasDaMan. The results are preliminary in that additional work is required for predictors and more efficient wavelet quantization, which is part of the future work. At the moment, wavelet coefficients are quantized with a user-defined number of bits per band, which unfortunately doesn’t allow very high compression rates with acceptable distortion; still, careful bit assignment can already yield good results in many cases. In the following examples, RLE and ZLib are lossless compression, whereas Daub10 is lossy compression using the Daubechies 10-tap wavelet filter which can be found in e.g. [7] and a zlib stream for compressing the quantized coefficients. Compression factors denote the size of the compressed data relative to the uncompressed data size; for Daub10 the best that could be found using the current quantization approach without noticeable degradation in quality are
Designing a Compression Engine for Multidimensional Raster Data
479
given. These should improve substantially once zerotree quantization is available; detailed rate-distortion analysis will follow then. Here are the results for some 2D and 3D MDD over integer and floating point base types: sdom type size[kB] RLE ZLib Daub10 tomo 256 × 256 × 154 char 9856 28.3589% 21.9568% 14.7853% temp 15 × 32 × 64 float 120 100% 73.0924% 12.8507% painting 727 × 494 char 351 98.2427% 75.7984% 24.4352% tempsec 256 × 512 float 512 98.7696% 88.5839% 11.4721% Even with the current suboptimal quantization, the wavelet engine can often compress data considerably better than the lossless approaches while keeping distortion low. Most noticeably, Daub10 performs particularily well on floating point data where lossless compression achieves little to no gain. This is an important result for scientific applications which usually operate on floating point fields.
4
Conclusions and Future Work
We presented the design of a compression engine for generic MDD which has been implemented in the RasDaMan DBMS. This kind of data differs completely from traditional table data typically stored in relational DBMSs where mostly light-weight compression seems feasible due to small access granularity [8]. In contrast, MDD share many of the properties typically found in digital images, can become very large and have considerably coarser access granularity than table data, therefore the use of sophisticated compression techniques based mostly on work done in image compression is the most promising approach. The design of the compression engine allows using many different, extensible techniques side by side, so the best method can be chosen depending on the kind of MDD used and to cater for different future requirements. There are two major issues left in the compression engine, one of which is more efficient wavelet quantization using a generalized zerotree [21] and the other are predictors. Predictors approximate cell values from the values of neighbouring cells and store the difference between the actual and the predicted value. This typically results in a reduction of the value range which in turn improves the compression performance. On one hand, there are inter-channel predictors where the values of a channel at a given position x are approximated by the values of the other channels at the same position. This is essentially what the RGB → YUV transformation does for colour images. Then there are intra-channel predictors which use values at neighbouring positions within the same channel to predict the value at position x. There may even be some potential of uniting both predictor types into one. After these extensions, another important workpackage will be performing exhaustive tests on the effect of the available compression techniques on various types of MDD, be it in compression rates achieved or the impact on total execution times. In that context, the addition of a cost model for transfer compression to the query optimizer comes to mind as well.
480
A. Dehmel
References 1. P. Furtado, P. Baumann: Storage of Multidimensional Arrays Based on Arbitrary Tiling. In Proc. of the International Conference on Data Engineering (ICDE), Sydney, Australia, 1999. 2. P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, N. Widmann: Spatio-Temporal Retrieval with RasDaMan (system demonstration) . Proc. Very Large Data Bases (VLDB), Edinburgh, 1999, pp. 746-749. 3. P. Furtado: Storage Management of Multidimensional Arrays in Database Management Systems. PhD thesis, Technical University of Munich, 1999. 4. R. Ritsch: Optimization and Evaluation of Array Queries in Database Management Systems. PhD thesis, Technical University of Munich, 1999. 5. P. Baumann: A Database Array Algebra for Spatio-Temporal Data and Beyond. In Proc. Fourth International Workshop on Next Generation Information Technologies and Systems (NGITS ’99), Zikhron Yaakov, Israel, July 5-7 1999, LNCS 1649, Springer. 6. The International Organisation for Standardization (ISO): Database Language SQL. ISO 9075, 1992(E). 7. K. Sayood: Introduction to Data Compression. Morgan Kaufmann Publishers, Inc., San Francisco, CA, 1996. 8. T. Westmann, D. Kossmann, S. Helmer, G. Moerkotte: The Implementation and Performance of Compressed Databases. Reihe Informatik 3/1998. 9. J. Ziv, A. Lempel: A Universal Algorithm for Data Compression. IEEE Transactions on Information Theory, IT-23(3), 1977 10. J. Ziv, A. Lempel: Compression of individual Sequences via Variable-Rate Coding. IEEE Transactions on Information Theory, IT-24(5), 1978 11. R. Cattell: The Object Database Standard: ODMG 2.0. Morgan Kaufmann Publishers, San Mateo, California, USA, 1997. 12. ZLib homepage: http://www.info-zip.org /pub/infozip/zlib/ 13. TIFF Revision 6.0 Specification, p42; Aldus Corporation, Seattle, 1992. 14. G.K. Wallace: The JPEG Still Picture Compression Standard. Communications of the ACM No.4, Vol 34, Apr 1991. 15. A.N. Skodras, C.A. Christopoulos, T. Ebrahimi: JPEG2000: The Upcoming Still Image Compression Standard. Proceedings of the 11th Portuguese Conference on Pattern Recognition (RECPA), 2000 16. I. Daubechies: Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics, CBMS 61, 1992. 17. C.K. Chui: An Introduction to Wavelets. Academic Press Inc., 1992. 18. N. Strobel, S.K. Mitra, B.S. Manjunath: Progressive-Resolution Transmission and Lossless Compression of Color Images for Digital Image Libraries. Image Processing Laboratory University of California, Santa Barbara 93106. 19. A. Trott, R. Moorhead, J. McGinley: Wavelets Applied to Lossless Compression and Progressive Transmission of Floating Point Data in 3-D Curvilinear Grids. IEEE Visualization, 1996. 20. A. Said, W.A. Pearlman: An Image Multiresolution Representation for Lossless and Lossy Compression. SPIE Symposium on Visual Communications and Image Processing, Cambridge, MA, 1993. 21. J. Shapiro: Embedded Image Coding Using Zerotrees of Wavelet Coefficients. IEEE Transactions on Signal Processing, Vol 41, 1993.
9
X
Y
[
\
\
f
h
Y
¦
f
\
m
Y
k
£
s
f
^
|
e
p
¸
u
¹
u
m
c
º
d
p
^
»
k
m
^
Y
¼
½
¸
\
¾
f
_
c
º
k
e
h
p
h
x
[
Y
\
F
C
P
C
T
Ê
=
Ç
?
:
U
H
<
À
Â
T
O
Ã
C
J
:
J
Ä
:
P
C
=
<
À P
<
È
À
:
H
<
H
P
<
N
C
T
P
P
Ë
P
C
Ã
Ä
À
P
<
A
T
T
T
:
À <
:
<
T
A
O
U
U
<
H
H
<
<
N
T
Æ
C
T
C
H
=
:
O
:
U
C
?
À
?
:
T
C
A
<
H
T
Ä
Õ
H
T
C
J
A
H
:
<
P
T
J
À
T
<
A
H
:
:
A
P
C
:
?
:
H
=
N
P
C
Á
Ö
N
Ë
H
P
:
J
C
Ö
:
À
N
:
:
?
T
A
P
A
T
T
Á
H
<
J
:
F
C
J
C
u
A
e
P
C
P
C
:
<
C
:
<
T
T
A
Ô
<
Ã
N
H
N
<
T
C
A
?
T
P
:
=
H
T
T
A
:
?
H
T
H
H
T
N
:
Ô
<
H
p
u
k
m
C
Þ
N
F
:
Ø
F
h
f
f
u
u
m
e
\
/
e
b
f
e
p
^
^
Y
f
p
x
c
f
h
_
\
p
e
|
f
p
f
e
f
|
m
b
^
e
f
p
h
f
k
£
b
p
[
d
f
m
|
f
e
b
h
u
f
s
e
f
Y
p
m
k
p
|
e
f
u
e
£
u
x
h
|
m
f
Y
b
x
u
u
h
^
d
k
f
u
d
f
_
¤
[
d
e
[
e
e
e
|
f
f
u
f
b
\
f
^
£
\
Y
\
f
_
e
u
|
d
\
d
f
f
h
x
u
k
e
c
p
p
c
b
p
u
s
u
u
u
b
f
£
f
x
\
h
p
h
Á
<
C
H
|
f
£
x
À
=
<
C
F
<
T
:
:
:
U
U
C
P
È
<
À
T
=
H
J
:
P
P
C
H
:
æ
È
C
:
H
T
H
P
T
=
U
F
<
Ë
<
J
T
Ç
?
A
H
O
A
P
A
C
:
T
À
T
:
?
:
J
O
F
U
:
P
T
O
P
:
H
A
Á
U
P
Ç
H
U
A
H
F
:
P
Ë
Ä
?
P
Ö
H
U
?
H
T
<
:
:
P
J
T
T
Ï
T
F
<
:
:
H
N
U
N
C
P
H
T
T
Ë
H
C
A
Ã
H
<
Á
C
<
T
Ö
H
C
T
Ä
N
N
<
Û
H
=
Ä
O
A
U
?
F
T
F
<
J
O
:
?
T
:
Ø
T
A
H
H
<
<
J
N
P
Ö
C
T
H
C
:
H
T
<
<
=
È
C
?
C
<
T
:
J
C
T
N
:
H
H
T
Û
È
A
<
:
Ô
Õ
Ø
F
A
:
:
Ö
H
U
U
C
P
F
H
N
C
Ì
H
T
<
Â
È
Á
Ç
:
H
C
:
T
U
C
?
C
H
À
<
U
Â
À
N
F
<
?
T
C
<
P
È
H
<
<
H
=
:
U
T
A
H
T
P
T
U
:
P
:
P
C
C
?
T
P
È
Á
:
á
P
T
Á
P
U
H
F
?
T
<
F
H
U
A
T
H
<
C
<
:
T
:
A
H
:
T
<
<
<
O
P
J
È
H
T
T
A
:
:
J
J
N
:
H
T
:
<
?
H
:
H
<
T
C
P
C
:
=
U
<
Ô
:
â
J
Ø
<
C
Ö
ß
N
Â
H
Ë
C
<
T
F
U
<
:
:
?
T
A
Ë
H
À <
P
C
P
P
H
À H
?
Ö
à
J
T
À
J
:
<
T
C
P
?
A
H
<
=
:
À
U
H
<
T
À
C
F
H
O
A
T
=
=
:
N
P
Ø
P
:
Ç
À
T
ã
N
:
U
À
?
H
T
À N
?
A
H
<
:
J
=
:
<
J
H
Ø
F
J
È
T
A
:
?
:
?
:
P
J
H
?
?
:
P
P
P
A
:
P
C
U
:
A
:
À H
J
T
C
?
H
N
N
U
N
:
Ø
Ë
Ö
À
U
:
H
<
J
J
:
H
N
:
N
Á
?
:
À
T
C
T
A
O
U
O
Â
¿ Ë
=
=
ã
T
J
<
C
Á
T
À
T
H
H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 481−490, 2001.
c Springer-Verlag Berlin Heidelberg 2001
<
Õ
Á :
À
T
C
C
Ô
Á O
F
Ë
Ç
T
H
T
Ó
À H
J
C
æ
:
À H
=
U
H
Á C
H
P
Ç
F
Á :
Ö
J
<
P
P
Á
T
?
à
á
:
P
:
H
Ò
À U
Ø
<
H
:
À
?
P
H
Á
U
:
Î
À
:
Ø
A
U
Ñ
Á Ã
=
Ø
À
P
P
P
P
H
À O
U
?
Ð
Ô
?
H
J
ã :
:
Î
À
Ë
J
?
A
È
Á
:
H
Á
N
J
<
:
U
<
È
Á
C
¿ H
U
À :
H
À P
C
Î
¿
T
T
À
<
C
U
T
Í
Ô
C
À
U
À ?
<
P
U
J
P
À T
U
?
Ç O
À :
:
H
:
:
C
<
Ç
P
à
Ø
Á
A
T
P
=
¿
=
=
À
C
T
2
X
b
d
p
e
~
\
^
m
m
\
|
Y
\
p
k
Á Ö
T
J
Á
P
T
Ç
<
H
}
_
¡
e
c
u
k
p
p
Y
f
|
f
|
e
`
f
^
f
[
\
p
f
m
[
p
s
c
[
\
e
u
u
p
O
:
N
P
Ã
:
Ã
H
ä
J
U
T
d
£
e
À
<
Ö
T
T
Ä
H
Ô
?
À P
k
`
p
?
P
H
Ë
À
À
<
P
T
:
<
e
f
{
À
J
U
À
=
p
s
Á F
Ã
À
T
H
Y
h
e
p
Ç
:
<
F
<
=
H
U
<
À <
Ç
À
F
Ç
P
:
d
\
p
[
/
|
f
b
b
e
p
f
|
¨
Y
u
h
m
\
f
k
k
Á :
J
À T
m
[
|
u
d
p
m
Ç
À
¿
J
:
<
À P
\
u
^
k
x
À
U
À <
N
Ë
f
p
p
f
x
e
`
^
Y
u
f
d
[
p
[
e
n
u
|
e
[
[
e
p
f
Y
c
|
£
e
¨
f
u
,
H
p
u
h
f
J
{
[
k
¦
J
p
f
{
u
`
s
\
b
|
e
e
m
H
C
C
C
À N
F
P
U
:
C
N
H
Ç
P
À
Ö
U
k
À
P
H
À <
:
p
\
<
:
Ö
T
H
À O
O
C
e
:
C
?
:
P
P
A
:
P
Ç
P
T
:
Á
?
?
Ë
<
J
T
=
H
À
F
O
<
À :
T
H
Ø
A
P
U
À P
N
p
k
Y
+
À H
T
T
H
<
À
F
h
e
^
Á
?
<
Á <
P
Ì
U
A
À :
À
=
O
T
Ä
U
N
À Ô
H
:
<
Á C
Ã
T
U
e
f
u
{
À H
H
N
H
e
¦
e
p
p
e
\
e
e
e
£
H
Á
U
H
¿
N
u
m
Y
Y
Y
Á Ã
Á
F
Ç U
J
F
Ö
C
Á
?
£
p
À
C
¿ H
Ç
Ë
È
:
Ç
P
Á
C
:
<
d
Á
U
C
H
Á :
T
T
H
T
À O
À
<
d
^
\
f
^
u
\
\
h
\
u
e
=
e
f
À
Á
H
P
:
Á
:
H
\
À ?
<
?
Á
C
c
À O
À
=
^
f
k
f
À
Å
À
C
U
Á
?
F
b
^
h
¦
s
Y
H
z
m
|
¦
À :
e
£
U
n
m
f
f
e
£
p
^
\
|
\
¦
e
T
x
£
£
p
h
f
b
^
p
k
e
f
u
h
f
f
Y
b
Y
f
e
h
k
[
d
[
u
e
k
\
\
f
e
k
u
e
^
h
k
f
(
·
À J
J
e
[
f
\
e
p
H
u
u
Y
f
x
Y
x
h
|
|
&
S
f
¿ O
À
À
Y
f
¦
f
b
\
c
e
£
f
u
f
^
$
A
s
f
k
u
£
f
p
u
\
[
¦
[
d
m
u
£
f
b
Á
Á
s
d
\
_
e
p
u
k
\
c
\
P
k
m
f
:
`
k
m
p
k
h
e
h
s
Y
f
e
h
\
h
\
e
u
`
\
m
Y
\
h
e
f
O
f
!
m
m
f
p
N
r
m
[
n
h
u
f
{
^
\
d
h
m
p
p
f
u
e
C
m
k
k
A
\
u
u
c
Y
p
h
`
|
m
m
f
m
f
e
L
f
f
h
^
J
h
|
_
d
m
£
[
\
e
{
f
c
k
p
p
|
\
e
k
e
<
p
f
`
k
f
e
|
H
`
\
m
p
e
k
F
n
u
Y
u
h
e
u
p
f
h
u
p
h
k
^
Y
f
u
|
c
f
f
f
d
C
f
f
d
_
e
[
f
p
Y
e
h
f
f
|
f
®
[
[
n
¦
_
f
e
¨
|
p
p
f
[
x
Y
e
h
^
£
\
^
h
p
\
e
h
h
&
D
m
p
C
f
f
e
u
x
f
m
^
e
[
\
m
_
e
f
f
·
p
c
e
[
e
Y
\
p
e
k
p
f
\
k
f
A
k
e
u
¡
£
?
Y
d
f
c
e
e
Y
b
c
^
¦
f
k
c
f
p
X
u
^
|
=
h
d
\
f
<
c
\
c
h
e
s
^
[
\
\
p
f
c
^
e
p
\
p
f
h
k
b
c
p
f
b
h
\
|
f
[
e
\
p
Y
m
d
`
s
h
u
e
c
[
Y
e
h
s
p
f
Y
b
f
e
x
Y
f
_
[
k
^
f
¦
|
À
e
\
e
£
£
\
¿
m
£
\
{
m
\
`
m
p
f
p
_
k
e
k
Y
¶
p
_
\
u
|
µ
^
:
T
A
:
P
:
Ø
U
:
<
T
Ã
à
Ä
F
:
Á
T
?
T
P
C
P
C
N
C
C
Á
T
<
C
C
:
P
Á F
Ô
482
M. Liu and S. Katragadda Ø
Ä
H
T
H
P
:
À P
:
O
P
Ë
=
U
U
:
T
P
T
À
J
C
T
A
H
:
:
H
<
T
<
T
:
À
À
P
O
N
À
Ø
?
:
J
:
=
á
U
P
T
U
T
J
T
H
H
:
T
T
Â
U
:
N
H
T
À J
F
?
T
C
Ö
H
P
:
C
Ø
<
:
?
Ë
P
Ø
:
T
H
P
N
=
Ë
J
?
U
C
P
H
C
P
À
:
T
Ë
O
Ö
C
C
À H
<
J
:
T
:
J
T
T
C
J
:
P
J
C
H
P
:
T
J
A
H
N
J
:
?
T
Ø
Ë
P
C
C
P
F
C
:
T
P
H
T
Ä
J
H
T
T
C
<
P
U
P
A
:
Ë
P
C
T
T
J
:
A
T
?
:
H
Ö
C
H
P
J
:
F
F
P
J
N
J
H
N
H
Ë
?
:
Ö
O
:
Â
U
:
Ë
:
P
H
:
N
O
O
A
T
T
O
C
N
C
:
C
:
?
P
U
T
U
:
P
:
T
C
Î
Î
?
?
T
F
:
U
C
À
ï
C
T
:
Ö
:
Ô
C
<
P
A
T
:
Ö
C
:
F
J
U
U
T
N
P
:
Ç
:
Ô
?
C
Õ
J
N
A
Ö
?
H
<
?
H
:
C
Ì
P
C
Ô
T
H
N
H
:
â
P
J
T
T
H
Ë
?
C
N
N
F
P
<
P
Ë
A
¾
Ë
P
T
þ
P
T
P
A
:
C
=
T
:
?
T
A
C
C
N
Ä
N
C
U
H
J
:
:
=
:
<
:
U
H
À
N
À F
<
O
F
U
Â
Ç
T
J
C
â
J
:
T
H
C
<
P
T
C
<
T
A
A
<
:
J
U
:
O
:
<
J
À
H
T
:
H
U
:
?
Ö
O
À
U
T
P
U
È
T
A
H
T
Â
Á
:
:
H
:
À
C
T
:
ß
?
C
:
<
T
H
Á :
J
J
<
Á
:
ã :
U
:
H
U
:
H
<
<
Â
Ø
F
:
U
Á
H
H
P
:
P
Ë
P
T
:
P
Í
ê
È
Ò
È
Î
ë
Ó
È
Á
H
H
P
:
P
Ë
P
T
:
P
Í
ì
Ó
È
H
<
J
J
:
Â
Á P
:
P
Ë
P
C
P
T
:
P
T
:
<
T
J
:
J
À
Í
Ð
È
î
F
?
T
C
Ö
È
ï
È
ð
P
F
O
Ø
Ó
Ô
Õ
A
:
Þ
F
à
:
P
Â
À
:
:
À U
P
?
T
Â
U
:
N
H
T
C
<
H
¿
O
U
T
C
<
N
À
=
Ã
Ä
H
P
T
A
:
Ë
<
T
À
T
U
H
=
:
H
<
J
:
Ø C
P
Û
ß
?
C
:
À
È
F
T
H
<
N
C
T
H
<
Ç
P
O
ã T
A
:
Ä
U
:
T
À
H
?
?
:
P
P
T
Á
U
N
H
U
=
:
Ç
U
T
¿
A
:
C
<
:
U
Á
H
Ã
Ä
À È
H
:
?
U
:
O
H
:
U
P
C
T
P
T
À
C
:
<
Ø
P
Ë
P
:
<
?
:
P
?
T
Â
U
À =
U
H
O
Ä
A
C
?
H
P
Ë
P
T
P
:
T
Á
:
?
C
A
<
C
Þ
N
N
F
F
P
:
T
P
U
H
T
T
H
T
Ø
<
P
T
A
H
T
H
:
P
U
H
T
:
Ä
U
<
¿
H
Ã
?
ã
T
P
N
J
:
Ä
J
F
?
Ë
U
:
F
P
C
Â
<
¿
=
Á
A
Ã
Ä
P
Ë
P
T
:
P
Á
T
U
:
H
<
J
H
<
Á
<
H
Ø :
P
C
à
N
À
C
N
À
T
À F
Ô
:
À
:
À H
ã
:
À :
Ç
<
T
à
T
Á Ã
H
À
ã
Ç
H
U
H
N
N
Ë
O
:
=
À
:
J
H
T
Á
U
H
Ø
U
:
J
Ë
À
N
P
Ô
L
:
?
T
Á
C
<
Ð
J
C
P
?
F
P
U
P
:
P
T
A
ó
C
<
ê
C
N
N
F
P
T
U
:
:
<
á
H
T
:
P
T
A
O
N
:
P
C
À
:
<
Ä
U
Û
<
:
H
<
H
J
T
J
:
P
C
=
<
H
À
C
<
<
J
=
J
Ä
Ô
L
:
?
T
H
T
<
H
ï
O
À :
P
F
=
=
:
Â
Ç
H
P
P
T
À
:
U
À
C
Á
P
C
Ø
N
À Ã
À
J
N
¿
H
À U
:
À
:
ã
H
T
·
»
¶
À F
C
N
ü
<
H
H
P
:
T
H
T
:
T
P
T
J
À
T
<
À
T
?
P
?
J
T
C
N
C
T
H
T
U
:
U
U
U
:
:
Ä
:
C
:
<
=
T
J
C
A
:
<
=
T
ã
:
P
â
T
H
T
H
T
C
Ë
È
?
F
H
Ç
P
P
Ø
:
U
A
Ç
C
<
U
Ö
À
C
J
:
P
Ç
P
U
F
T
F
U
:
U
P
T
A
:
C
J
:
:
P
C
=
<
H
H
:
C
U
H
O
H
<
J
J
:
Ö
:
À
U
A
:
Û
Ø
:
<
N
H
T
H
J
N
H
T
À
C
<
A
:
P
?
<
J
A
Ë
N
:
È
J
:
F
O
P
C
P
:
=
:
:
U
<
N
U
J
P
C
:
C
<
<
:
H
C
<
T
:
U
<
Ë
C
P
T
?
<
U
?
:
=
T
<
O
H
T
J
:
H
T
N
H
H
T
H
J
P
=
<
T
C
<
U
:
H
:
<
J
U
:
H
T
C
È
<
T
P
<
U
T
T
H
A
U
:
H
C
<
C
<
:
C
U
T
C
=
J
T
A
:
T
O
H
T
F
P
C
À
C
Ç
A
H
U
=
Ë
È
ã
P
<
Þ
C
Ö
=
P
T
:
P
O
F
:
U
H
?
T
U
H
O
A
C
?
H
N
Û
P
Ë
C
:
P
Ö
P
?
<
O
:
Û
H
H
P
Á :
:
<
<
H
H
C
T
P
H
P
H
O
N
:
H
N
H
J
P
:
Ë
<
Ô
T
:
U
H
?
T
C
<
F
J
P
N
?
Û
:
J
C
Ô
â
T
:
<
H
<
J
U
:
=
Ë
<
H
<
C
J
T
T
A
P
A
:
F
U
<
T
:
A
O
C
<
=
T
A
H
T
:
P
N
?
N
H
T
Ô
À
C
<
<
T
C
<
À
:
P
N
=
À
?
Ø
Û
À :
H
Ç
Ô
À
J
:
:
U
:
À N
C
Á
U
?
H
F
Ô
ã
C
P
ó U
T
À
Ø
C
T
P
=
?
C
Ç Ä
<
T
:
H
N
ã
P
C
J
J
Ø
N
U
C
F
F
P
ã
N
?
Û
P
È
A
C
?
A
À U
:
Ø A
=
J
N
:
À
Ë
H
Á P
T
:
À
C
:
<
Ø
<
À
U
T
È
À
C
Ø
=
H
H
Ø
N
H
À
T
N
<
À
?
T
:
:
À
F
À
O
U
:
À
?
Ç ?
<
Â
U
À
C
:
Á C
T
?
Ø =
U
H
H
T
C
<
?
:
À Ã
Ä
È
:
J
F
T
À
<
À
T
<
N
À
È
À Ê
Ô
¿
À C
Ì
Ø
U
J
C
H
Ç T
P
À
H
À N
H
Ç
T
À =
H
J
H
À
:
À
N
T
<
À ?
Ø F
À
<
Á O
Á
<
H
C
à :
O
À
Û
Ø :
<
J
T
Ë
C
Ø ?
Ç
:
?
H
À U
Ø
=
Á
?
À
C
C
C
N
ã
P
N
H
À
N
À T
Ø Û
:
T
O
C
:
<
À T
À
H
?
<
À :
À
:
=
À :
C
<
?
Ç
<
J
N
A
<
A
:
Ø
N
À
H
<
·
À
Â
J
À
<
C
:
À ?
º
À
H
À
P
À
C
N
Ç
P
¾
Ç
Â
H
È
Ë
¸
ó T
J
÷
Ç
O
P
ó U
Ø
Ø
A
¸
ó
T
T
À
Ô
¿ Ã
·
ã H
P
H
T
:
?
P
ü
Ç
H
J
P
Ë
T
H
<
?
J
Õ
Ç
T
=
Á
U
T
Á P
À
:
:
Ô
C
J
:
¿
T
Ø
â
À
:
T
<
P
Ö
P
J
À
Á
F
:
ã C
=
ó O
Ô
À
T
H
À
T
÷
Ô
È
:
À P
Ö
:
À F
Ö
N
?
?
U
H
A
J
H
Ç
H
:
N
:
N
H
:
?
·
:
Á
:
Ö
:
O
P
:
À
=
O
U
U
P
é
T
H
H
<
<
P
J
N
À
T
À
H
P
?
Õ
J
U
?
À P
ý
P
:
À C
C
P
C
?
:
ã
T
ü
Ä
P
J
U
<
T
H
N
C
:
Ô
U
:
Ø
H
C
H
:
J
Á
T
:
J
P
L
Á
H
À ?
T
:
ã
Ø H
Ô
<
H
T
O
C
O
H
T
J
Á Ã
ã
J
N
:
T
T
H
T
?
U
N
N
H
:
<
Ô
A
<
Ç
C
H
C
?
H
À
C
H
P
O
H
À
J
ö
Á
O
T
P
F
H
:
:
F
À
H
:
À L
=
Ø
F
J
A
æ
T
:
Ä
?
H
Ä
À Ò
P
H
U
À
:
Ö
U
J
À
:
<
P
:
Ã
<
Á
Ä
N
À â
Ö
¿
À Ã
:
F
Ö
U
Ô
Ç :
Ë
H
T
À H
P
<
H
ö
H
¿
:
:
J
N
U
ó
Ç
C
H
T
P
ó
U
T
:
F
:
T
T
H
=
U
H
U
H
H
A
U
H
Á U
U
ú
H
À
T
<
ã Ä
T
C
Â
:
H
H
:
U
J
À
À H
N
À T
á
A
¿
O
?
H
Ç
À :
=
P
P
ù
Ç <
T
:
T
Ø
P
O
H
ã
?
<
Ä
C
À
â
U
T
Ä
ã ?
Ø T
A
H
J
À
:
T
â
=
Ø
C
ø
Ã
?
T
?
J
<
:
:
U
N
T
Ó
<
Ç
¿
F
:
È
C
à
C
<
Ä
L
P
÷
H
J
O
:
Ô
¹
ã U
:
T
À
Ã
Ç C
U
<
Ô
N
A
ö
J
Õ
J
Ä
à
Ó
=
N
P
T
J
À
U
Ô
À
<
:
O
À
H
P
H
À
Ë
À
À <
O
C
Ä
<
õ
Ä
U
Ø
:
H
<
:
P
Ç U
<
Ø
P
Á
C
H
H
Ã
?
:
Ç
F
U
T
À :
À
:
N
Á
Á
T
Í
È
¿
H
A
T
F
Ø
:
O
J
J
À
U
H
O
<
ã Ä
T
P
Á
:
Á
O
A
N
:
:
:
À
:
Ë
O
¿ C
À Õ
N
P
C
Ø
N
Á O
:
<
Ø
U
U
:
P
À F
P
Ç
P
O
P
A
Ã
Ë
À
H
À :
H
T
:
À :
<
A
P
H
¿
P
Í
T
È
H
?
Ø
:
:
À H
T
H
À
H
J
P
À
P
H
A
À
C
<
T
<
U
H
H
:
T
P
Á
H
Ø ?
Á
H
J
Ô
À O
T
<
À :
À
Á Ã
Ö
H
À
à
U
P
T
P
C
Ø T
H
H
T
Ø A
H
P
À O
H
J
P
J
H
P
<
?
À
:
Ë
H
T
H
¿ P
J
:
:
H
T
F
P
Ç N
<
J
U
T
Á
?
Õ
C
:
Á
<
=
H
È
í
H
ã J
?
?
Á H
T
T
H
:
Ç
P
<
?
À P
H
<
F
U
<
:
À T
C
<
ã F
U
H
À
À
U
<
Ö
H
Ø
Â
À F
C
P
<
:
À N
Á
:
P
:
À <
T
J
T
À
:
P
À :
<
U
P
T
À
Â
A
J
à
U
H
F
:
P
Ø ?
À H
À
T
?
À
C
H
J
Ø H
à
A
Ë
Á P
<
:
P
ñ
À
?
?
P
ã <
N
:
Á H
H
C
:
À C
N
P
Á
H
U
H
T
H
À
F
H
À ?
C
F
Ø
H
Á :
A
à :
?
:
Ø F
:
Á
U
Ø U
P
O
<
Ø O
:
Ç
T
P
H
Á
?
A
F
U
:
?
Ø
T
Ç <
È
T
À
T
À T
P
A
ã :
Ø
:
Ø
H
ã
T
Ø
P
Á
U
À P
Á
H
À O
T
H
N
Ë
C
<
Ö
N
ó
P
T
A
Ö
:
À
:
:
N
H
T
<
Ç
N
À H
?
C
N
C
T
H
T
:
C
<
T
:
U
H
?
T
P
T
A
:
À
C
C
=
H
P
Â
ã <
C
T
A
DrawCAD: Using Deductive Object-Relational Databases in CAD ã Ä
U
H
¿
À
H
<
Ã
J
J
:
Ä
J
Ô
F
?
Ã
N
:
J
ó
P
T
H
T
H
A
À
:
:
N
H
T
À
C
<
Ø
N
=
J
H
T
H
"
Ä
U
À D
H
<
=
F
H
=
:
Æ
È
F
P
C
À
<
=
H
=
U
Ø U
P
?
U
C
O
T
C
P
H
À
H
P
:
C
P
J
:
P
C
=
<
:
J
C
<
H
P
O
:
?
C
F
<
?
O
A
C
Ç
À
A
?
H
N
Á
¿
À
H
Ã
Ä
C
P
J
:
Ö
:
<
T
:
U
O
U
:
T
:
J
N
H
<
O
À N
Û
C
T
È
Õ
S
È
Á
T
A
H
T
:
T
H
<
J
H
U
J
C
<
T
:
=
U
F
O
H
=
:
H
H
<
U
:
T
:
U
F
J
P
:
¿ T
A
H
T
C
<
T
:
U
O
U
:
T
P
<
:
J
À
Õ
D
C
:
Á
P
<
J
?
J
P
P
S
C
P
H
<
H
J
J
N
H
C
<
Û
C
:
<
À
Á
:
H
T
C
<
=
=
Á
U
<
O
A
:
C
?
á
H
T
:
N
<
C
P
<
J
P
N
:
J
#
%
'
:
H
U
)
<
J
P
C
<
T
:
ã
H
À
<
C
?
<
?
:
U
P
ã
O
U
Ô
Õ
T
:
:
C
J
=
P
:
T
F
H
?
T
:
A
H
H
<
À
C
:
<
?
T
A
:
Ê
U
N
C
<
À
Õ
D
Å
P
A
:
C
À
C
T
A
C
T
Ô
<
A
Ö
:
P
D
Ô
Õ
H
A
<
?
A
H
U
:
T
C
<
C
O
F
N
H
T
A
:
C
4
6
8
O
J
=
J
:
=
J
T
:
P
T
T
P
9
6
C
æ
:
A
U
<
:
:
P
P
<
A
Ô
H
Õ
T
A
H
:
U
:
È
?
T
:
U
T
:
C
J
=
:
T
H
N
<
H
:
?
P
F
T
A
:
:
N
T
N
A
P
?
:
U
C
C
<
:
Á
P
È
:
N
N
C
O
P
:
À
P
H
<
À F
P
:
H
?
T
<
P
À D
H
U
:
T
P
U
:
:
N
á
N
<
C
T
:
J
T
J
A
:
C
<
C
N
C
T
H
T
:
ã
J
=
:
T
:
Ö
<
T
È
P
?
<
P
?
H
N
Ç
J
=
N
H
C
<
?
:
P
?
U
C
O
T
D
P
:
J
C
D
?
H
U
U
2
O
<
R
=
6
/
U
N
>
1
C
N
Ä
T
T
U
;
#
%
:
C
P
:
<
T
:
U
Á
1
'
%
/
8
#
%
:
<
C
T
O
<
?
U
:
H
Û
:
P
O
N
H
.
<
U
C
P
á
T
U
H
À
A
:
'
2
O
R
"
#
O
:
6
D
Ç
Ô
#
Ô
/
Õ
Ô
â
È
À
Ä
D
È
T
<
?
:
Ä
H
U
H
N
U
:
H
T
:
F
:
A
C
H
#
N
?
H
Á
.
/
T
A
:
Á
F
O
?
Û
:
P
?
U
À
È
T
H
Ö
U
A
1
T
:
:
:
:
J
U
N
2
Ä
U
U
T
Ø T
A
P
P
U
N
P
T
:
<
H
H
T
A
Æ
:
Ë
:
?
=
H
2
A
O
R
6
"
Ç
/
À
1
N
N
U
?
:
A
F
"
'
Û
?
U
2
<
O
R
H
.
P
U
:
6
T
T
U
O
P
?
:
<
T
Á :
2
R
%
4
/
:
P
T
N
C
2
<
J
?
P
<
A
T
:
C
<
P
U
:
N
H
T
:
:
J
¿ Ö
C
J
Á
:
Õ
D
À
?
J
=
À
O
:
T
P
<
H
U
:
:
<
O
?
U
C
O
T
C
P
:
á
Ç
T
N
:
À T
O
?
Ø
T
F
<
T
T
È
:
:
á
<
T
T
C
U
H
?
F
:
J
T
:
J
Ô
:
?
á
H
<
Á
J
À
U
H
T
C
<
Ô
Á <
H
À
Ë
À
<
À <
<
C
Ö
H
T
Ô
:
U
C
C
T
C
Ö
:
P
N
C
Û
:
À
P
H
Ø
P
:
J
¿ :
J
<
?
:
U
T
H
C
<
U
:
J
H
<
Ô
Á
Õ
A
¿
C
C
P
J
:
H
H
<
Ç
Ö
U
H
P
O
À
=
U
ã
P
:
N
:
H
Ô
Õ
C
<
T
A
T
:
:
.
T
J
U
<
:
?
H
<
Ö
H
1
#
1
%
'
%
8
.
/
O
#
:
8
Ô
#
%
Ä
8
'
<
H
<
J
=
:
P
H
N
T
C
A
C
A
T
A
H
C
T
?
A
C
O
P
:
À
F
P
.
?
?
1
Û
A
C
'
P
T
À
P
%
U
:
F
P
U
]
J
=
:
J
?
?
"
T
P
<
H
P
H
U
:
:
%
T
F
8
Ô
Ä
P
.
A
:
;
P
<
C
H
Y
A
Ä
È
H
<
J
H
<
J
k
l
O
P
4
n
<
:
U
#
J
U
:
:
N
<
?
:
T
A
N
H
T
:
U
H
Ö
U
:
=
J
C
:
:
U
:
H
.
P
:
N
O
A
6
2
6
Ô
Õ
A
:
=
H
Ô
Õ
N
C
A
<
C
T
:
P
U
Â
À
:
:
T
A
?
<
<
?
U
:
Ö
T
T
J
c
:
N
U
T
J
:
<
:
æ
:
P
U
Û
C
=
<
À
P
:
H
T
U
C
?
e
N
"
T
C
T
?
U
J
:
A
H
e
H
T
H
C
T
P
T
#
H
J
C
<
<
J
T
A
=
H
?
T
C
Ö
:
C
T
Ë
À
T
P
F
O
N
<
H
?
Ë
C
T
C
<
<
=
P
T
Ô
ã T
A
C
A
À
P
C
<
:
ã
J
Ô
ã U
C
=
C
<
È
A
C
?
A
C
?
U
:
H
T
:
J
T
P
A
C
A
:
Ë
H
P
U
:
À
P
H
Ë
P
T
A
H
T
T
A
:
Ø
À
T
:
?
U
:
H
T
:
J
Ô
Ã
N
P
È
Á
Ë
=
H
<
Á
:
:
'
2
O
T
:
N
Õ
H
P
A
:
:
:
È
/
?
1
Û
J
#
C
P
H
À
H
A
6
%
8
.
O
#
T
H
T
Ç
ã
P
A
:
A
H
T
C
P
J
C
<
H
T
C
<
O
F
T
À
T
Û
C
<
J
P
À U
Õ
Ô
À
?
Ô
À
J
R
Ø
N
J
<
C
:
U
J
Û
U
P
O
:
À A
À ?
À :
:
U
T
À
Á
:
C
J
À
È
T
?
N
À <
.
T
À
Ø
J
F
C
8
H
H
U
Ø
C
d
H
:
À
Ë
Ç
C
Ö
N
:
:
À
:
%
U
Ç C
Ç N
À U
;
T
:
À
Ë
>
<
?
A
ã
P
:
ó
T
C
:
Ç
<
Ç
T
T
#
:
J
J
<
À
:
O
ã :
À
O
P
8
U
Ø Ã
A
À
6
#
¿
#
:
C
/
H
U
J
À
8
Ø :
P
C
Á
U
:
Ø C
Ø
T
H
A
ã
<
Ç
P
T
U
/
H
O
U
O
À
J
ã <
:
È
%
P
U
.
H
ã
J
Ø
H
ã A
=
U
ã C
T
=
J
1
Ç
Á
U
=
A
À C
?
Ë
Á
H
À C
J
À
P
Ç :
O
H
Á
U
T
:
P
Ø F
H
À
<
P
À
H
ã
:
U
Ö
C
ã
H
Ô
Ö
Ö
?
À :
J
Á =
U
=
U
C
H
Ø
?
J
T
.
U
H
Ã
6
<
U
Ç
ã C
H
F
<
T
¿
H
.
H
Õ
À O
=
P
F
<
H
A
:
U
À
J
N
O
C
À 1
<
D
C
U
ã
H
H
H
U
Ë
H
#
ã Ä
P
Ô
À F
Ø
ã O
A
8
Á
U
P
Ç
H
ä
:
T
J
<
À
ã
<
H
À
?
:
T
%
C
Ô
ã
Ø
T
O
:
Ø
N
H
:
P
#
C
T
À
C
Ä
U
À '
O
Á
T
À
À
O
A
H
A
À
<
Û
J
H
Ô
T
U
:
<
¿ H
¿ Ã
<
J
P
#
À
N
Á
P
P
?
Õ
O
Ø
T
H
Ç
N
H
C
À C
À
:
H
:
A
Ë
<
F
ã
P
À
Á
O
ñ
P
T
A
H
C
D
?
C
Ø C
:
Á
H
P
A
.
Ô
Û
:
T
H
À D
S
C
P
T
U
H
A
P
À C
ã
T
A
D
U
¿ Õ
À
<
<
C
:
T
:
J
Á
?
Á P
Õ
?
À
C
:
À
:
À Ö
F
Ô
P
Ç
T
À
N
O
U
¿ A
T
P
C
U
<
:
J
P
8
Ç T
:
Õ
À
:
<
Á =
À
À
C
Å
P
ó
À
N
Ø A
%
À
J
Õ
P
Õ
ã
N
<
U
J
U
A
Õ
T
1
Ø ?
T
D
Ç
:
ã Õ
H
ã
<
Ø
Õ
Ã
O
.
?
:
Á
T
O
Ç
À T
:
%
O
=
Ã
=
=
Ô
:
H
¿
H
A
8
8
T
P
F
¿
H
J
Á
?
ã .
P
+
Á
H
ã '
:
O
Û
ã
<
À C
ã ;
U
:
U
?
F
C
ã
¿
È
ã
T
N
À
Õ
À F
O
Ã
:
ã
Õ
=
ã H
J
F
Ô
A
A
J
ã
H
C
O
:
P
P
<
¿
N
P
U
Ø
H
O
Ä
À á
P
¿ ?
H
U
Ô
Ç H
Á
À
À C
á
P
P
ã
T
:
À
T
<
:
Ë
N
?
T
Á
O
À
C
¿ Õ
J
À
D
O
Õ
¿
U
C
Ø N
¿
Õ
À
U
F
<
À
'
:
¿ á
À H
:
Ø
T
Ø
P
C
À
U
ã Õ
J
:
ã
:
H
U
N
Ã
ã H
T
ã H
T
¿ U
:
À
Õ
?
À F
Á
U
P
¿
T
H
ã Õ
?
C
À
È
O
À
Õ
¿
P
:
T
Á
U
J
P
A
ã
J
Ç ä
T
Ø
U
P
Ô
Á
T
Ç :
<
À H
A
C
N
À
Â
H
À
T
À
H
T
Ç U
?
T
H
H
À Õ
?
Ë
Á
?
À T
H
ã :
Ø
?
J
Á
?
À P
À
N
¿
N
À
T
À C
H
Ô
ã
ã
H
483
:
H
<
J
H
/
1
O
h
À
?
N
U
J
H
T
H
i
À
Ç
ã T
A
:
J
U
Á
H
C
<
ã J
U
C
<
=
O
U
A
:
O
U
Ø
C
C
C
U
C
T
C
Ö
:
P
T
<
C
Þ
F
?
:
H
C
T
J
À
T
A
T
:
Ö
:
A
'
:
P
2
O
H
P
Ô
<
<
P
C
Ø
F
R
T
C
/
1
:
<
U
T
A
A
À
:
Ç
N
?
Û
J
Á C
T
È
H
:
A
:
H
N
C
H
<
?
P
T
N
F
A
?
J
U
J
:
:
Û
Ô
Õ
A
:
U
:
U
P
T
:
C
<
=
O
U
U
:
<
?
:
P
:
:
U
<
À
:
?
:
:
U
P
:
C
T
C
Ö
:
P
È
T
<
?
:
P
T
T
A
H
%
?
8
Û
.
P
O
H
#
U
Ô
:
T
A
:
F
<
C
Þ
F
:
J
U
H
T
A
:
J
À
T
T
U
A
:
P
C
<
=
O
Ø T
:
<
=
T
H
=
P
T
A
H
T
H
U
:
A
H
C
æ
:
U
:
<
T
À
Â
N
?
U
C
C
T
C
Ö
:
P
Û
P
H
<
J
C
H
=
?
<
U
:
T
A
:
Ç
N
P
P
À
:
Û
Ô
Õ
A
:
U
À C
J
Ø
F
Á
H
Á U
A
Ø
T
ã
T
Ç
C
U
À :
Á
H
A
À
:
ã
N
#
N
Ç
N
ã T
T
Ç
P
À
:
Ç
P
H
ã
U
À
Â
6
Ê
Õ
À
T
Ø
C
:
Ø
T
C
A
À
N
À F
T
Ç
C
À T
=
Á
H
:
J
T
:
:
Ø T
A
:
U
:
<
?
:
P
À N
?
Û
P
C
<
484
M. Liu and S. Katragadda À
Õ
A
:
C
<
P
:
U
T
H
T
<
H
P
:
?
À :
H
<
Ø
H
Ø :
À
C
Ø J
<
H
Á
J
J
:
N
:
T
À
<
:
<
:
T
A
P
C
P
U
J
T
:
o
H
J
T
p
C
Ô
H
Ê
Õ
T
ü
:
A
A
÷
A
H
¸
ó Õ
J
T
:
J
T
¾
C
º
A
H
:
:
N
H
T
P
T
J
C
F
H
·
À
:
A
Ä
T
J
:
J
U
?
º
:
T
P
þ
À
C
<
N
T
C
P
J
:
P
C
=
<
:
=
J
J
C
Ø T
A
:
J
H
T
Ø :
T
ö
H
T
<
¸
H
P
T
A
?
Ä
H
:
Ô
H
H
A
T
:
r
H
C
Þ
÷
ý
F
H
P
:
T
A
Û
J
N
:
T
:
ü
U
<
t
Ä
U
C
º
ã U
Ë
<
:
H
=
¹
C
P
J
H
H
U
H
¹
A
H
÷
T
:
Ç
N
H
ú
:
Ä
P
P
U
:
U
P
:
P
T
Ç
:
P
<
T
H
C
<
P
T
A
U
:
:
O
H
U
T
P
H
T
Ë
O
:
J
A
Ê
<
C
T
Û
T
À
C
<
N
A
:
=
Á U
H
A
Ø
È
:
U
Ë
C
<
=
T
<
A
À N
:
T
:
A
H
P
À
:
J
ã
T
Ä
U
N
?
Û
H
T
H
H
P
:
Ô
¿
H
Ã
:
J
H
T
H
C
<
O
F
À ?
Û
P
Ä
Ô
À
T
Á
ã
U
Ä
<
P
È
U
:
N
P
H
U
¿
H
Ã
Ø
O
P
T
Ä
Ô
Á
C
N
:
Ô
Õ
A
:
À
C
H
=
ã ?
F
C
À N
:
N
U
Ø
P
À
?
:
À O
ö
Ø
:
À
:
=
Ç
T
F
A
N
Ç Þ
?
T
À <
N
T
?
<
C
À
H
F
:
ù
À Û
C
T
Ø
H
P
H
À Ã
H
J
ø
¿
H
T
H
À N
Ø
P
ö
T
:
À
J
H
:
À O
À
H
?
ó È
Á
?
P
:
À
:
T
ó
N
P
Á
÷
Ë
À
J
N
Á
H
<
Ã
À J
Ç
H
P
C
À Ã
Ç
T
÷
:
U
¿
H
Ç
T
Ø
N
?
<
F
À
C
F
ã â
Á
:
U
À J
:
:
=
À
P
<
A
À =
ã F
À Õ
Ç
<
U
À C
À
C
À
J
P
?
A
À
:
Ç
H
À
C
<
J
:
Ê
<
C
T
C
<
P
H
<
J
U
F
N
:
À J
:
Ê
<
C
T
C
<
P
Ô
ó u
v
w
1
.
1
x
8
%
2
%
O
8
À
'
:
N
H
T
À
C
<
À
N
=
P
F
O
Á
O
U
À |
~
z
{
È
H
<
J
|
È
:
T
?
Ô
â
T
H
N
T
P
H
<
Ø
À
F
:
Ç
Á
U
O
U
C
À
P
P
F
O
C
À
O
U
T
P
F
P
:
U
J
:
Ê
<
:
J
T
C
Ö
:
T
Ë
O
:
P
P
F
?
A
H
P
z
{
|
È
Á
?
O
N
:
á
T
Ë
O
:
P
F
P
C
<
=
2
w
l
1
}
À '
1
2
l
%
'
2
ã
?
<
P
T
U
F
?
Í
T
P
Ô
H
C
A
:
Ê
U
P
T
T
Ë
O
:
C
U
<
T
¿
Á
H
È
Ã
H
C
Ç
Ä
<
T
H
C
<
N
Ë
F
Í
L
T
H
U
T
H
C
<
T
À U
H
:
P
T
A
À
:
À
P
P
<
T
C
T
Ë
T
Ë
O
:
T
A
H
T
A
H
P
È
<
J
H
¿
:
C
<
T
À
C
P
T
H
<
N
U
À
T
U
C
T
1
l
6
2
%
O
8
.
1
x
8
À
%
2
%
O
<
H
N
U
:
N
H
T
T
C
A
<
T
H
H
<
<
J
J
H
<
<
P
T
'
:
N
H
T
A
<
P
T
A
H
T
P
<
:
J
N
Ë
P
U
:
T
U
:
A
:
=
T
A
:
H
U
H
T
H
O
P
U
:
N
T
<
H
U
C
J
?
<
J
P
Ø
P
H
N
T
:
C
T
C
<
P
J
H
C
:
P
J
T
:
U
P
P
H
Û
J
U
U
:
:
N
N
:
U
N
T
N
:
T
â
O
N
T
A
:
:
C
<
T
<
Ô
O
F
F
P
Ä
T
C
Ç
<
T
U
A
=
Ø T
J
H
?
<
Õ
A
U
<
<
H
T
T
:
H
P
<
Ô
Õ
J
A
H
:
A
:
U
N
?
C
<
=
O
U
P
Ä
H
U
<
A
P
T
A
:
P
Ô
Ä
Õ
U
A
Û
P
T
P
C
P
:
C
P
A
H
?
:
O
A
T
:
:
Ö
U
C
J
H
Ô
Õ
U
<
=
<
T
A
T
J
C
C
Ç
ã
U
T
C
C
Û
T
A
C
Ö
T
C
<
:
P
Ö
T
:
P
A
Ô
H
:
<
À
È
P
H
<
C
<
J
J
C
P
<
=
H
U
J
O
U
Û
C
<
<
H
N
U
Ø
U
?
A
:
N
U
:
:
T
á
<
T
T
Ë
:
C
<
Â
<
H
À
C
<
:
C
P
À
H
C
N
U
J
H
T
N
Á
?
Â
À
N
H
?
C
T
C
Ö
:
P
Ä
U
J
H
T
?
H
H
H
Û
<
C
C
P
P
P
T
<
=
O
U
<
=
O
T
:
<
P
C
C
:
C
<
P
Ä
J
:
O
:
<
J
:
<
:
T
T
C
Ö
:
:
<
T
C
T
C
:
P
T
C
Ö
:
J
H
T
H
U
:
U
U
<
:
T
<
:
Ç
:
Ø
H
:
H
?
U
J
:
T
:
J
A
:
J
C
æ
:
U
:
<
T
J
U
<
A
:
?
:
J
À
A
N
?
Û
ã T
C
U
À Ã
À
C
T
J
T
Ø
T
À
C
:
À
T
Ç
C
U
U
Ø
<
¿
H
Á
H
:
C
Á
H
U
Û
H
À
C
=
À
N
H
T
À
C
Á
T
Õ
ã
J
J
:
<
À Ä
Á
H
C
:
H
À
T
Ã
<
À
P
À
C
À
:
:
Á
Ö
A
Ø C
Ç <
¿
H
Ç C
U
À
T
À
T
H
Ô
Á T
H
F
U
P
Ô
ã
:
H
N
ã
Ô
:
?
U
:
ã =
:
N
À Ã
F
F
C
Ç
ó
C
:
C
O
O
À
:
Á
?
Ë
À
?
¿
H
U
Á
H
A
È
A
T
Ó
À C
ã
C
:
À J
O
ã U
<
J
H
À
H
À
H
F
H
U
J
À
J
C
À
H
T
U
O
:
<
ã T
à
C
P
H
U
Ç
U
Ç
Ë
Û
F
À
<
À
:
À ?
F
H
À ?
À N
:
T
À
À U
:
Ö
T
H
<
Á N
H
Ç
È
N
À
:
À Ö
Ç
C
:
À O
À
C
Ç
Á
<
A
T
A
ã <
T
T
H
Ø C
H
T
Á C
H
T
Á
À
P
Á H
N
:
À :
U
H
À H
:
T
À
Á
:
:
À P
U
U
:
?
À O
H
:
T
A
À
T
À Õ
:
P
=
Ç
T
À
P
A
T
À
C
Ø C
Õ
H
<
À
?
:
À
C
À U
8
À
C
C
À
O
À
P
ã
N
À
È
ã
O
H
À N
À :
À
:
Ó
À
Ç Õ
Ä
È
C
<
Â
Á
H
C
<
=
O
U
C
C
T
C
Ö
:
À U
:
N
H
T
C
<
P
Ô
Ç Õ
A
À
À
:
N
ã
À
N
C
<
=
H
U
:
T
A
:
:
á
T
:
<
P
À
C
<
H
N
U
:
N
H
T
C
<
À Å
Î
Æ
H
L
À
T
U
:
P
T
A
À H
<
J
:
<
J
:
À
O
C
<
N
C
<
:
â
Ä
H
<
J
H
<
J
T
Ç
A
:
À
T
T
A
:
N
C
<
:
H
<
J
C
T
P
Ð
Æ
e
e
H
L
U
À U
<
:
U
H
À C
T
P
<
J
:
P
T
A
:
:
N
N
C
ã
O
P
:
â
Ä
È
À
N
:
U
U
C
=
A
T
À
?
U
<
:
T
ê
Æ
N
À
F
U
"
J
C
<
À T
A
e
:
H
L
?
<
H
T
:
À
N
?
H
T
À
C
<
Ç
À
P
T
A
:
À
À
?
U
J
C
<
H
T
:
À
N
?
H
T
À
C
<
A
:
P
T
H
U
T
O
C
<
Ç
T
Ç
P
Ç
T
A
:
F
O
O
:
U
N
:
T
Ø
U
T
A
:
U
:
?
T
H
<
=
N
:
T
A
H
T
J
:
P
?
U
C
:
P
H
T
N
À
:
U
Ô
N
ã
T
U
À U
C
T
A
:
:
N
N
C
O
P
:
H
<
J
Ô
À
?
J
Ô
À
?
d
À U
U
À Å
À
?
N
À
T
À ?
H
À
?
À Å
P
:
P
À ?
H
T
C
T
À <
P
A
:
U
:
?
T
H
<
=
N
:
â
Ç
Ä
A
:
F
O
O
:
U
N
:
À
È
Ç T
A
:
T
A
:
U
C
T
C
P
Ê
À T
?
N
À U
<
:
U
H
<
J
T
A
:
N
N
:
À
J
U
<
T
ã
È
H
<
J
T
A
:
À :
U
U
C
=
A
T
?
U
<
:
U
H
<
J
DrawCAD: Using Deductive Object-Relational Databases in CAD À Å
Instituto Nacional de Astrofísica, Optica y Electrónica (INAOE), Puebla, Mexico. [email protected] Abstract. News reports are an important source of information about society. Their analysis allows understanding its current interests and measuring the social importance and influence of different events. In this paper, we use the analysis of news as a means to explore the society interests. We focus on the study of a very common phenomenon of news: the influence of the peak news topics on other current news topics. We propose a simple, statistical text mining method to analyze such influences. We differentiate between the observable associations—those discovered from the newspapers—and the real-world associations, and propose a technique in which the real ones can be inferred from the observable ones. We illustrate the method with some results obtained from preliminary experiments and argue that the discovery of the ephemeral associations can be translated into knowledge about interests of society and social behavior.
1
Introduction
The problem of analysis of large amounts of information has been solved to a good degree for the case of information that has fixed structure, such as databases with fields having no complex structure of their own. The methods for the analysis of large databases and the discovery of new knowledge from them are called data mining (Fayyad et al., 1996, Han and Kamber, 2001). However, this problem remains unsolved for non-structured information such as unrestricted natural language texts. Text mining has emerged as a new area of text processing that attempts to fill this gap (Feldman, 1999; Mladeniü 2000). It can be defined as data mining applied to textual data, i.e., as the discovery of new facts and world knowledge from large collections of texts that—unlike those considered in the problem of natural language understanding—do not explicitly contain the knowledge to be discovered (Hearst, 1999). Naturally, the goals of text mining are similar to those of data mining: for instance, it also attempts to uncover trends, discover associations, and detect deviations in a large collection of texts. In this paper, we focus on the analysis of a collection of news reports appearing in newspapers, newswires, or other mass media. The analysis of news collections is an interesting challenge since news reports have many characteristics different from the texts in other domains. For instance, the news topics have a high correlation with *
Work done under partial support of CONACyT, CGEPI-IPN, and SNI, Mexico.
society interests and behavior, they are very diverse and constantly changing, and also they interact with, and influence, each other. Some previous methods consider: the trend analysis of news (Montes-y-Gómez et al., 1999), the detection of new events on a news stream (Allan et al., 1998), and the classification of bad and good news (García-Menier, 1998). Here, we focus on the analysis of a very common phenomenon of news: the influence of the peak news topics over other current news topics. We define a peak news topic as a topic with one-time short-term peak of frequency of occurrence, i.e., such that its importance sharply rises within a short period and very soon disappears. For instance, the visit of Pope John Paul II to Mexico City became a frequent topic in Mexican newspapers when the Pope arrived to Mexico and disappeared from the newspapers in a few days, as soon as he left the country; thus this is a peak topic. Usually, these topics influence over the other news topics in two main ways: a news topic induces other topics to emerge or become important along with it, or it causes momentary oblivion of other topics. The method we proposed analyzes the news over a fixed time span and discovers just this kind of influences, which we call ephemeral associations. Basically, this method uses simple statistical representations for the news reports (frequencies and probability distributions) and simple statistical measures (the correlation coefficient) for the analysis and discovery of the ephemeral associations between news topics (Glymour et al., 1997). Additionally, we differentiate between the observable ephemeral associations, those immediately measured by the analysis of the newspapers, and the real-world associations. In our model, the real-world associations in some cases can be inferred from the observable ones, i.e., for some observable associations its possibility to be a real-world one is estimated. The rest of the paper is organized as follows. Section 2 defines ephemeral associations and describes the method for their detection. Section 3 introduces the distinction between the observable and the real-world associations and describes the general algorithm for the discovery of the real-world associations. Section 5 presents some experimental results. Finally, section 6 discusses some conclusions.
2 Discovery of Ephemeral Associations A common phenomenon in news is the influence of a peak topic, i.e., a topic with one-time short-term peak of frequency, over the other news topics. This influence shows itself in two different forms: the peak topic induces some topics to emerge or become important along with it, and the others to be momentarily forgotten. This kind of influences (time relations) is what we call ephemeral associations.1 An ephemeral association can be viewed as a direct or inverse relation between the probability distributions of the given topics over a fixed time span. Figure 1 illustrates 1
This kind of associations is different from the associations of the form X Ã Y, because they not only indicate the co-existence or concurrence of two topics or a set of topics (AhonenMyka, 1999; Rajman & Besançon, 1998; Feldman & Hirsh, 1996), but mainly indicate how these news topics are related over a fixed time span.
A Statistical Approach to the Discovery of Ephemeral Associations
p
p
Peak topic
Peak topic
t p
493
Another topic
t p Another topic
t Inverse Association
t Direct Association
Fig. 1. Ephemeral associations between news topics
these ideas and shows an inverse and a direct ephemeral association occurring between two news topics. A direct ephemeral association indicates that the peak topic probably caused the momentary arising of the other topic, while an inverse ephemeral association suggests that the peak topic probably produces the momentary oblivion of the other news topic. Thus, given a peak topic and the surrounding data, we detect the ephemeral associations in two steps: 1. Construction of the probability distribution for each news topic over the time span around the peak topic. 2. Detection of the ephemeral associations in the observed data set (if any). These steps are described in the next two subsections. 2.1
Construction of the Probability Distributions
Given a collection of news reports corresponding to the time span of interest, i.e., the period around the existence of the peak topic, we construct a structured representation of each news report, which in our case is a list of keywords or topics. In our experiments we used a method similar to one proposed by Gay and Croft (1990), where the topics are related to noun strings. We apply a set of heuristic rules proper to Spanish and based on proximity of words that allow identifying and extracting phrases. These rules are guided by the occurrence of articles and some times by the occurrence of the prepositions de or del (of in English) along with nouns or proper nouns. For instance, given the following paragraph, the highlighted words are selected as keywords. “La demanda de acción de inconstitucionalidad tiene como argumentos una serie de violaciones que el Congreso de Yucatán incurrió porque, de acuerdo con el PRD, hizo modificaciones a la ley electoral 90 días antes de que se lleven a cabo los comicios en ese Estado de la República”. Once this procedure is done, a frequency f ki can be assigned to each news topic. The frequency f ki is calculated as the number of news reports for the day i that mention the topic k. It is more convenient, however, to describe each news topic k by a
494
M. Montes-y-Gómez, A. Gelbukh, and A. López-López
{ }
probability distribution Dk = p ki by the days i, where for a given day i, pki expresses the probability for a news topic randomly chosen from the reports of that day to be the topic k:2
∑
p ki = f ki
f ji
(1)
j∈Topics
We will call the values pki relative probabilities. A probability p ki = 0 indicates that the topic k was not mentioned on the day i. The relative probabilities pki are advantageous for the discovery of the ephemeral associations mainly because they maintain a normalization effect over the news topics: for any day i,
∑
pki = 1
k∈Topics
This condition holds for the whole period of interest and means that the increase of the relative probability of one news topic always is compensated by the decrease of probability of some other topics, and vice versa. 2.2
Detection of Ephemeral Associations
The ephemeral associations express inverse or direct relations between the peak topic and some other current news topics (see the figure 1). Let the peak topic be, say, the topic k = 0 and the other one we are interested in be, say, the topic k = 1. The statistical method we use to detect the observable associations is based on the correlation measure r between the topics k = 0 and k = 1(Freund and Walpole, 1990) defined as: r=
S01 , S00 S11
1 ∑ (p p )− m ∑ m
where
S kl =
i =1
i k
i l
m
i =1
m i pki pl , i =1
∑
k , l = 0, 1.
(2)
Here, pki are defined in the previous section and m is the number of days of the period of interest. The correlation coefficient r measures how well the two news topics are related to each other.3 Its values are between –1 and 1, where –1 indicates that there exists an exact inverse relation between the two news topics; 1 indicates the existence of an exact direct relation between the news topics, and 0 the absence of any relation at all. Therefore, if the correlation coefficient between the peak topic and some other news topic is greater than a user-specified threshold u (i.e., r < − u ) then there exists a direct ephemeral association between them. On the other hand, if the correlation
2 3
This roughly corresponds to the percentage of the space the newspapers devoted to the topic k on the day i. 2 The usual interpretation of the correlation coefficient is the following: 100 r is the percentage of the variation in the values of one of the variables that can be explained by the relation with the other variable.
A Statistical Approach to the Discovery of Ephemeral Associations
495
coefficient is less than the threshold – u (i.e., r < − u ) then there exists an inverse ephemeral association between the two topics. There are two reasons for introducing the user-specified threshold u. First, it softens the criterion so that we can approximate the way a human visually detects the association. Second, to cope with the relatively small data sets in our application: since few data are available (a peak topic persists over few days), random variations of topic frequencies unrelated to the effect in question can greatly affect the value of the correlation coefficient. A typical value recommended for u is around 0.5.
3 Discovery of Real World Associations Newspapers usually have a fixed size, and the editor has to decide what news to include in the day’s number and what not to include. Thus, the frequencies of the news mentioned in a newspaper do not directly correspond to the number of events that happen in the real world on the same day. In the next subsection, we explain this important difference between the real world news frequencies and the ones normalized by the fixed size of the newspaper. Then, we show how to estimate whether the observable associations are mainly due to the normalization effect or there is a possible real-world association component. 3.1
The Notion of Real World Associations
Since our ultimate goal is to discover the associations that hold in the real world, it is important to distinguish between two different statistical characteristics of the topics appearing in the newspapers. One characteristic is the real-world frequency: the frequency with which the corresponding news comes from the information agencies, for instance. Another characteristic is the observable frequency, expressed as the pieces of news actually appearing in the newspapers. To illustrate this difference, let us consider two sources of information: say, a journalist working in Colombia and another one working in Salvador. Let the first one sends 30 messages each week and the second one sent 30 messages in the first week and 70 messages in the second week. These are the real-world frequencies: 30 and 30 in the first week, and 30 and 70 in the second one (i.e., there was something interesting happening in Salvador in the second week). However, the newspaper has a fixed size and can only publish, say, 10 messages per week. Then it will publish 5 and 5 messages from these correspondents in the first week, but 3 and 7 in the second week. These are the observable frequencies, since this is the only information we have from the newspaper texts. Our further considerations are based on the following two assumptions. Assumption 1: The newspapers tend to have a constant “size.”4 Thus, the observable frequencies can be considered normalized, i.e., their sum is a constant, while the real world ones are not normalized. We assume that these two kinds of frequencies are proportional, being the proportion coefficient the normaliza4
The “size” of a newspaper not only depends on its physical size (for instance, the number of pages) but also on the number of the journalists, the time required for editing, printing, etc.
496
M. Montes-y-Gómez, A. Gelbukh, and A. López-López
tion constant. Thus, we define a real-world ephemeral association as an association that holds between the topics in the real world and not only in the observable (normalized) data, and we consider that an observable ephemeral association is a combination of two sources: a (possible) real-world ephemeral association and the normalization. The normalization effect is always an inverse correlation effect. This means that the increase of probability of the peak topic is always compensated by the decrease of probability of some other topics, and vice versa. Thus, we can conclude that any direct observable ephemeral association is, very probably, a real-world association. Assumption 2: The peak topic proportionally takes away some space from each current news topic. First, this assumption implies that the relative proportions among the rest of the news topics do not change if we take away the peak topic and its related topics. Second, no topic completely disappears only as a consequence of the normalization effect.5 3.2
Detection of Real World Associations
As we have noted, all direct associations should be considered real-world ones, so we only need to analyze the inverse ones. The idea is to restore the original distribution of the topics by eliminating the normalization effect, and check if this distribution still correlates with that of the peak topic. The Assumption 2 allows us to estimate the probability distribution Dk′ = p′ ik of
{ }
the topic k as it would be without the normalization effect, where the probability p′ ik expresses the relative probability of occurrence of the topic k on the day i after we take away the peak topic and it related topics. This probability is calculated as follows:
p ′ ik = f ki
∑
f ji
(3)
j∉Peak
Here the set Peak consists of the peak topic and its related topics (those with a direct association), while the frequency f ki indicates the number of news reports in the day i that mention the topic k. Therefore, an inverse observable association between the peak topic and the news topic k is likely a real-world association if it remains after the normalization effect is eliminated from the topic k. In other words, if the correlation coefficient between the peak distribution and the corrected distribution Dk′ is less than the user-specified threshold –u (i.e., r < − u ) then the inverse observable ephemeral association is likely a real-world one.
5
Usually, newspaper editors design the newspaper format and contents in such a way that they expose all news of the day, even if briefly.
A Statistical Approach to the Discovery of Ephemeral Associations
497
35
Probability
30 25 20 15 10 5 0
Virgin of Guadalupe
Raúl Salinas
20 5.53
Visit of Pope
21
22
23
24
25
26
27
17.39 14.28 26.08 30.43 21.42 18.18 23.07
0
0
5.55
4.34
0
28
29
7.14
0
8.69
13.04
7.14
0
0
0
0
21.42 13.04
4.34
7.14
9.09
0
0
20
Days
Fig. 2. Analysis of the peak topic “Visit of Pope”
Concluding, the our basic algorithm for the discovery of the real-world ephemeral associations among the news topics of a given period consists of the following steps: 1. 2. 3. 4.
Calculate the (observable) probabilities by the formula (1); Calculate the (observable) correlations between the peak topic and the other ones by the formula (2); Select the topics that strongly correlate with the peak one, using a threshold u 0.5; Determine which associations are real-world ones: a. All direct associations are real-world ones; b. For the inverse associations, i. Build the corrected distributions by the formula (3), using the knowledge obtained at the step 3; ii. Calculate the (real-world) correlations between the peak topic and the other ones using the formula (2) and the corrected distributions; iii. The topics for which this correlation is strong represent the real-world associations.
4 Experimental Results 6
To test these ideas, we used the Mexican newspaper El Universal. We collected the national news for the ten days surrounding the visit of Pope John Paul II to Mexico City, i.e., from January 20 to 29, 1999, and looked for some ephemeral associations between this peak topic and the other topics. One of the associations detected with our method (using the threshold u = 0.6 ) was a direct ephemeral association between the peak topic and the topic Virgin of 6
http://www.el-universal.com.mx
498
M. Montes-y-Gómez, A. Gelbukh, and A. López-López
Guadalupe. 7 The figure 2 illustrates this association. The correlation coefficient was r = 0.959 for the period between the 23 and 25 of January (stay of the Pope in Mexico), and r = 0.719 for the surrounding period between the 20 and 29 of January. Since this association was a direct one, it had a high possibility for being a realworld one. This means that the topic Virgin of Guadalupe probably emerged because of the influence of the peak topic. Moreover, since this topic was the only one that had a direct association with the peak topic, we deduced that the visit of the Pope was strongly related with Virgin of Guadalupe (in fact, he has focused his discourse on this important Mexican saint). Another interesting discovery was the inverse association between the peak topic and the topic Raúl Salinas (brother of the Mexican ex-president, Carlos Salinas de Gortari, sentenced in the 22 of January). The figure 2 also shows this association. The correlation coefficient r = – 0.703 between the 22 and 26 of January (period covering the Visit of Pope and the sentencing of Raúl Salinas) indicates the existence of an inverse observable ephemeral association. In order to determine the possibility of this association for being a real-world one, we analyzed the normalization effect. First, we built the probability distribution of the topic Raúl Salinas without considering the peak topic and its related topics (the topic Virgin of Guadalupe in this case). The new probability distribution was:
′ Salinas = {5.88, 5.26, 25, 17.64, 6.25, 9.09, 11.11, 0, 0, 20} D Raul Second, we recomputed the correlation coefficient between the peak topic and the topic Raúl Salinas. The new correlation coefficient r = – 0.633 (between the 22 and 26 of January) indicated that it was very possible for this association to be real-world one. If this was true, then the topic Raúl Salinas went out of the attention because of the influence of the visit of the Pope to Mexico City. As another example, we examined the peak topic Death of Kennedy Jr. This topic took place between the 18 and 24 of July of 1999. For the analysis of this peak topic, we used the news appearing in the national desk section of the newspaper The New York Times.8 Among our discoveries, there were two inverse ephemeral associations. One of them between the peak topic and the topic Election 2000, with r = −0.68 , and the other one between the peak topic and the topic Democrats, with r = −0.83 . The figure 3 shows these associations. Since these associations were both inverse ones, we analyzed their normalization effect. First, we built their probability distributions without considering the peak topic:
′ D Election 2000 = {0, 9.52, 0, 5.26, 0, 0, 0, 2.94, 11.11} ′ D Democrats
= {11.53, 4.76, 0, 0, 0, 0, 0, 2.94, 7.4}
Then, we recomputed their correlation coefficients. The probability distribution of the topic Democrats did not change (because of the zero probabilities of the topic 7 8
A Mexican saint whose temple the Pope visited. The topics were extracted manually as opposed to the Spanish examples that were analyzed automatically.
A Statistical Approach to the Discovery of Ephemeral Associations
499
Probability
30 20 10 0 Democrats
17
18
11.538 4.7619
Election 2000
0
9.5238
Kennedy Jr.
0
0
19
20
21
22
23
0
0
0
0
0
2.9412 7.4074
0
4.1667
0
0
0
2.9412
11.111
12.5
14.815
0
0
28.571 20.833 15.385
24
25
Days
Fig. 3. Analysis of the peak topic “Death of Kennedy Jr.”
Democrats during the peak existence). Thus, the correlation coefficient was again r = – 0.83 and we concluded that this association had a high possibility for being a real-world one. On the other hand, the new correlation coefficient between the topic Elections 2000 and the peak topic, r = – 0.534, was not less than the threshold −u (we used u = 0.6), therefore, there was not enough evidence for this association to be a real-world one.
5
Conclusions
We have analyzed a very frequent phenomenon in real life situations—the influence of a peak news topic on the other news topics. We have described a method for the discovery of this type of influences, which we explain as a kind of association between the two news topics and call ephemeral associations. The ephemeral associations extend the concept of typical associations because they not only reveal coexistence relations between the topics but also their temporal relations. We distinguish between two types of ephemeral associations: the observable ephemeral associations, those discovered directly from the newspapers, and the realworld associations. We have proposed a technique with which the observable associations are detected by simple statistical methods (such as the correlation coefficient) and the real-world associations are heuristically estimated from the observable ones. For the sources that do not have any fixed size, such as newswires, the observed frequencies of the news reports correspond to the real world ones. For such sources, the method discussed in this paper do not make sense. An easier way to discover the same associations in this case is not to normalize the frequencies in the formula (1), using p ki = f ki instead and then applying the formula (2). However, if it is not clear or not known whether the news source presents the normalization problem, then the method presented here can be applied indiscriminately. This is because in the absence of normalization effect, our method will give equally correct results, though with more calculations.
500
M. Montes-y-Gómez, A. Gelbukh, and A. López-López
As future work, we plan to test these ideas and criteria under different situations and to use them to detect special circumstances (favorable scenarios and difficult conditions) that make the discovering process more robust and precise. Basically, we plan to experiment with multiple sources, and to analyze the way their information can be combined in order to increase the precision of the results. Finally, it is important to point out that the discovery of this kind of associations, the ephemeral associations among news topics, helps to interpret the relationships between society interests and discover hidden information about the relationships between the events in social life.
References 1. Ahonen-Myka, Heinonen, Klemettinen, and Verkamo (1999), Finding Co-occurring Text Phrases by Combining Sequence and Frequent Set Discovery, Proc. of the Workshop on Text Mining: Foundations, Techniques and Applications, IJCAI-99, Stockholm, 1999. 2. Allan, Papka and Lavrenko (1998), On-line new Event Detection and Tracking, Proc. of the 21st ACM-SIGIR International Conference on Research and Development in Information Retrieval, August 1998 3. Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy (1996), Advances in Knowledge Discovery and Data Mining, Cambridge, MA: MIT Press, 1996. 4. Feldman, editor (1999), Proc. of The 16th International Joint Conference on Artificial Intelligence, Workshop on Text Mining: Foundations, Techniques and Applications, Stockholm, Sweden, 1999. 5. Feldman and Hirsh (1996), Mining Associations in Text in the Presence of Background Knowledge, Proc. of the 2nd International Conference on Knowledge Discovery (KDD96), Portland, 1996. 6. Freund and Walpole (1990), Estadística Matemática con Aplicaciones, Cuarta Edición, Prentice Hall, 1990. (In Spanish) 7. García-Menier (1998), Un sistema para la clasificación de notas periodísticas, Simposium Internacional de Computación CIC-98, México, D.F., 1998. 8. Gay and Croft (1990), Interpreting Nominal Compounds for Information Retrieval, Information Processing and Management 26(1): 21-38, 1990. 9. Glymour, Madigan, Pregibon, and Smyth (1997), Statistical Themes and Lessons for Data Mining. Data Mining and Knowledge Discovery 1, 11-28, 1997. 10. Han and Kamber (2000), Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2001. 11. Hearst (1999), Untangling Text Data Mining, Proc. of ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Marylnd, 1999. 12. Mladeniü (2000), Proc. of the Sixth International Conference on Knowledge Discovery and Data Mining, Workshop on Text Mining, Boston, MA, 2000. 13. Montes-y-Gómez, López-López and Gelbukh (1999), Text Mining as a Social Thermometer, Proc. of the Workshop on Text Mining: Foundations, Techniques and Applications, IJCAI-99, Stockholm, 1999. 14. Rajman and Besançon (1998), Text Mining - Knowledge Extraction from Unstructured Textual Data, 6th Conference of International Federation of Classification Societies (IFCS98), Rome, 1998.
Improving Integrity Constraint Enforcement by Extended Rules and Dependency Graphs Steffen Jurk1 and Mira Balaban2 1
Cottbus Technical University of Brandenburg, Dept. of Databases and Information Systems, P.O.B. 101344, 03044 Cottbus, Germany [email protected] 2 Ben-Gurion University, Dept. of Information Systems Engineering, P.O.B. 653, Beer-Sheva 84105, Israel [email protected]
Abstract. Integrity enforcement (IE) is important in all areas of information processing – DBs, web based systems, e-commerce. Beside checking and enforcing consistency for given data modifications approaches for IE have to cope with termination control, repair mechanisms, effect preservation and efficiency. However, existing approaches handle those problems in many different ways. Often the generation of repairs is too complex, termination of repairs is specified imprecise and effect preservation is insufficient. In this work we propose to extend integrity constraints by termination bounds and to represent the enforcement task by dependency graphs (DG) which allow efficient pre-processing without costly run-time evaluation of constraints. Further, we present an optimization technique by serializing DGs and a history approach for effect preservation. Our main contribution is an uniform framework that considers all relevant criteria for integrity enforcement and shows how termination control, effect preservation and efficiency can be designed to be used within modern database management systems.
1
Introduction
Integrity enforcement is important in all areas of information processing - DBs, web based systems, e-commerce, etc. Management of large data-intensive systems requires automatic preservation of semantical correctness. Semantical properties are specified by integrity constraints which can be verified by querying the information base. In dynamic situations, however, where operations can violate necessary properties, the role of integrity enforcement is to guarantee a consistent information base by applying additional repairing actions. Integrity Enforcement (IE ) is responsible for selection and combination of repairing actions, so that consistency for all constraints is achieved. Selection of a repairing action for a constraint depends on a repairing policy. The combination of repairing actions presents the three main problems in integrity enforcement: Termination, effect preservation and efficiency. The termination problem H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 501–516, 2001. c Springer-Verlag Berlin Heidelberg 2001
502
S. Jurk and M. Balaban
is caused by repairing actions whose applications violate already enforced constraints, thereby leading to non-termination of the overall integrity enforcement process. The effect preservation problem arises when a repairing action achieves consistency by simply undoing the action being repaired. The efficiency problem deals with order optimization of the enforcement of the individual constraints. The optimization is possible since usually, the constraints are presented independently of each other, and an optimized ordering their enforcement might reduce repeated enforcements, or might lead to early detection of essential nontermination. Termination control has been studied by many authors. The most common method is to describe constraints and their repairs as graphs, and detect cycles that might indicate potential non-termination. Yet, existing methods do not distinguish between desirable cyclic activations of repairing actions (finite cyclic activations) to non-terminating ones. Effect preservation is the main motivation for the GCS work of Schewe and Thalheim [16]. The underlying idea is that an update is a change intended by its initiator. For example, the update insert(a, e) intends that in the database state to which the update evaluates, the entity e is an element of the set a. In the context of Logic Programming the effects of updates have been studied by [9]. There, the effect of Prolog predicates depends on their positions in bodies of clauses and on the ordering of clauses in the program. Hence, changes resulting from updating primitives depends on the processing order which might cause different overall effects of an update. In this paper we suggest an general approach for advanced database management, that combines ideas from Rule Triggering Systems (RTS s) [18] with compile-time oriented approaches, such as [16]. We suggest a uniform framework that considers the three relevant criteria of termination, effect preservation and optimization. Rather then providing a new approach, we understand our work as a uniforming approach were existing methods and ideas can be used and combined. The fundamental means of our work are dependency graphs which allow efficient pre-processing of database transactions, termination control and optimization. With respect to selection of repairing actions, we claim that for each constraint, the desirable repairing policy (or policies) should be specified by the database developers. We underline the claim by showing the existence of multiple policies. With respect to termination, we suggest using termination bounds, that are provided by the database developers, as a powerful means for termination control. Our understanding of effect preservation is that every applied repair should not undo the initiated update. Finally, we show that the order is an important parameter for efficiency of enforcement, and present a method for optimizing the order towards avoiding unnecessary computational overhead caused by rollback operations. The paper is organized as follows: Section 2 shortly describes related work and motivates our ideas on policies, termination bounds and optimizing the order of enforcement. As a result, we introduce our framework for integrity enforcement in section 3. Section 4 introduces dependency graphs as an essential means of the
Improving Integrity Constraint Enforcement by Extended Rules
503
framework. Optimization of the order is discussed in section 5 and the history approach for effect preservation is presented in section 6. Section 7 concludes the paper.
2
The Right Task at the Right Time
The task of integrity enforcement attracted much work in the area of database management, particularly in the deductive database community. A classification of common approaches is given in [6]. In the GCS approach of [16], integrity with effect preservation is enforced by (static) update rewrite (compile-time integrity enforcement). Most approaches handle integrity enforcement at run-time, i.e., following an update application, relying on design-time analysis that is performed by the database developer. In general, hard problems like cycle detection for termination control and the finding of repairing actions, are done at designtime ([3]). Nevertheless, there is no uniform overall framework for controlling the relevant problems in integrity enforcement. In this section we study the correlation in different approaches, between the relevant problems of IE (termination, consistency, effect preservation, optimization) to the level of time (design-, compile-, pre-processing- and run-time). Design-time (DT) covers activities of developers that specify the behavior and properties of a system. Compile-time (CT) activities cover methods that assist developers in verifying the design, e.g., cycle detection within active rules. In pre-processing-time (PT) an update is pre-processed into a new update that guarantees integrity. In a sense, we consider PT as pre-processing without evaluation of constraints and without execution of repairing actions. The run-time (RT) level is reached if a constraint is evaluated or a repair executed and only an expensive rollback operation can undo already performed changes. Table 1 shows that some related tasks, such as termination control, selection of repairing actions, or optimization, cannot be assigned to a unique level of integrity enforcement. In the next section we try to provide an answer to the question: “At which level the relevant tasks of IE should be handled?” 2.1
Repairing Mechanisms
As table 1 shows, there is no uniform treatment for repair mechanisms. Mayol [7,8] and MEER [1] compute repairs for each constraint at PT and RT. This is possible due to restrictions on the constraint language. MEER, for example, enforces only cardinality constraints of the ER model. The repairing actions are planned at PT, and the actual policy for selection among various alternatives is done at RT, by interaction with the user. The following example emphasizes the problem of alternative repairs: Example 1. Consider the integrity constraint a ⊆ b and an insert into the set a which might violate the constraint in case the new value is not a member of the set b. There exists three possible repairs, if the insert does not yield a consistent state: (1) rollback the whole transaction, (2) insert the new value into b, (3) only undo the insert. The latter repair only partially undoes the transaction, while the first one does rollback the whole transaction.
504
S. Jurk and M. Balaban Table 1. Classification of a small range of existing approaches.
In sum, generating repairs at PT or RT requires the consideration of different possible policies, and relies on interaction with the user for making an actual selection. The problem with that approach is that typical users cannot answer questions on repair policies. Therefore, the selection of appropriate repairing actions and its policies must be part of DT activities of application developers and can not be handled separately at PT or RT. 2.2
Termination Control
Termination control is a crucial requirement for integrity enforcement. The different methods can be categorized as follows: (1) Bounds. Some DBMSs provide time-out mechanisms that reject a transaction when a time bound set on a transaction is exceeded. Another approach is to restrict the maximal number of nested trigger activations. For example, Sybase imposes to cascade a trigger to a maximum of 16 levels. (2) Restricted Constraints. In database theory it is known that certain combinations of classes of constraints, e.g. functional and unary inclusion dependencies, guarantee termination. In MEER [1], termination proof for the suggested repairing updates relies on known results for testing if a database schema is strongly satisfiable [5]. (3) Cycle Analysis. The most common method is the detection of potential cycles within database schemata or sets of active rules. The work presented in this paper tries to combine the first and the third approaches. An ultimate static method detects exactly all non-terminating applications: (1) It identifies all such cases (completeness). (2) It does not identify a terminating rule activation as non-terminating (soundness). However, in practice,
Improving Integrity Constraint Enforcement by Extended Rules
505
most methods prefer completeness. That is, they identify all potentially nonterminating rule activations. The strength of a complete method stands in reverse correlation to the number of potential non-terminating activations that it detects. In the field of Rule Triggering Systemss [19], much effort have been devoted to termination control1 . In these systems, static methods for termination analysis are used at CT. Consider, for example, two methods introduced in [3], where analysis is performed on activation or triggering graphs. Cycles in the graphs indicate potential non-termination. Figure 1 demonstrates two
r1
r1
r2
r3
r3 r6 r4
r6 r5
(a)
r2
r4
r5 (b)
Fig. 1. (a) Activation graph and (b) Triggering graph
such graphs that result from the application of these methods to a set of rules r1 , · · · , r6 (consult [3] for details). Clearly, method (b) detects a high number of potential cycles, e.g., r1 , r6 , r2 , r1 , · · ·, while method (a) detects no cycle at all. Provided that both methods are complete, method (a) is stronger (it is also sound). That is, for complete methods, stronger methods detect less cycles2 . The problem is that syntactic analysis is not sufficiently strong for distinguishing non-terminating rule sequences from the rest. Therefore, developers usually either avoid all detected cycles (thereby loosing the “good” ones that do not cause non-termination), or they leave cycle handling to the DBMS. Figure 2 shows desirable cycles (that the designer is aware of), that result from a recursively modeled hierarchy. A rule enforcing the salary-constraint would definitely be executed repeatedly along the hierarchy. The problem is how to distinguish terminating cycles from non-terminating ones. In order to avoid strict cycle rejection we introduce, for every rule or constraint, a bound that restricts the number of executions within a cycle. For example, a rule that enforces the salary-constraint is bounded by the number of levels within the hierarchy. Assuming that the bound is known to the devel1 2
Limitations of RTSs are studied in [15]. This problem is closely related to the problem of Integrity Constraint Checking where syntactic analysis is used to identify updates that might cause violation of a given set of constraints.
AFTER UPDATE salary ON employee WHEN NOT C BEGIN FOR ALL subordinates(employee) s DO IF salary(s) < 13 salary(employee) THEN salary(s) = 13 salary(employee) IF salary(s) > 23 salary(employee) THEN salary(s) = 23 salary(employee) DONE IF salary(employee) < 13 salary(boss(employee)) OR salary(employee) > 23 salary(boss(employee)) THEN ROLLBACK TRANSACTION END Fig. 2. A recursively defined employee-boss hierarchy including a salary-restricting integrity constraint. The constraint is enforced by the given rule, that adapts the salary from subordinates but not from bosses.
oper means that the bound association belongs to the category of DT activities. Therefore, we propose the following rule extension: Example 2. Extended rule syntax for using known boundaries. CREATE TRIGGER { BEFORE | AFTER } on
[ REFERENCING ] [ FOR EACH { ROW | STATEMENT } ] [ ] BOUNDED BY <maximum of executions within a cycle> The default bound, if no bound is given, is 1, which means to rollback a transaction if a cycle contains the rule more than once. The benefit is a meaningful restriction that is based on natural bounds of potential cycles, that helps developers to cope with termination problems. Termination can be enforced by pre-computing activation or triggering graphs and unfolding them according to the specified bounds. This is an important aspect of our approach. 2.3
Quality of Enforcement
The main task of IE is to enforce a set of integrity constraints C1 , · · · , Cn . The order of enforcement is approach dependent, and provides a certain degree of
Improving Integrity Constraint Enforcement by Extended Rules
507
freedom within the process of enforcement. Hence, the quality of enforcement can be understood as finding a “good” ordering. Example 3. Let a, b, c be sets of integers with constraints C1 ≡ ∀t.(t ∈ a ⇒ t ∈ b) and C2 ≡ ∀t.(t ∈ a ⇒ t ∈c). C1 is repaired by inserting missing elements into b or by propagating deletions in b to a. A rollback is performed if C2 is violated. For an insert into a there are two possible enforcement orderings: Either S12 ≡ inserta (t); if t ∈b then insertb (t); if t ∈ c then rollback or S21 ≡ inserta (t); if t ∈ c then rollback; if t ∈b then insertb (t). Obviously, S21 is preferable, since in case of a rollback only inserta (t) has to be undone, whereas S12 requires to undo inserta (t) and insertb (t). The example shows that rollback-optimized orders can improve the overall performance by avoiding undo operations. This is particularly important for large sets of rules and assertions. Note that both assertions and rules are part of current SQL:1999 standard, and are relevant for commercial database vendors. Going back to the approaches introduced by table 1, optimization is usually not discussed in the literature. RTSs try to optimize rule sequences by applying priorities to rules (triggers) at DT. However, since a rollback operation can be artificially introduced (e.g. detection of a cycle, bound, violation of effects, etc.), static priorities are not sufficient. Furthermore, information about the amount of data and query costs can not be taken into account. Below, we propose optimization as part of PT or RT activities.
3
A DT-CT-PT-RT Framework for Integrity Enforcement
Following the observations presented above, we developed a framework for integrity enforcement that is based on extended rules and performs additional preparatory computations. We propose to split the relevant tasks as depicted in table 2. In this section we introduce extended rules, compares them to RTSs, and summarize the PT and RT activities of the framework. Table 2. Suggested processing of the relevant tasks within our framework. Task
Level
termination bounds (extended rules) repairing actions (extended rules) termination (cut cycles according to bounds) optimization (find a “good” order) consistency (combine repairs) effect preservation (history)
DT DT/CT PT PT PT RT
508
3.1
S. Jurk and M. Balaban
Extended Rules
Example 2 already introduced an extended version of the classical ECA rules. In this work we further extend the rule notion towards a constraint-centered rule that groups together several ECA-like repairing rules and a bound. The group includes rules that are relevant for the enforcement of a constraint. Example 4. Extended rule as integrity constraints including repairing rules and a bound. CREATE CONSTRAINT a-subset-b AS NOT EXISTS ( SELECT t.* FROM a WHERE t.* IS NOT IN b ) AFTER INSERT ON a WHEN new NOT IN b THEN INSERT new INTO b AFTER DELETE ON b WHEN old IN a THEN DELETE old FROM a BOUNDED BY 1 Example 4 presents an inclusion dependency on two relations a and b. Inconsistencies are repaired by propagating insertions and deletions to a and b respectively. A bound is associated with each constraint in order to enforce termination. The repairing rules correspond to trigger definition of the SQL:1999 standard. Each rule is a semantic repair for the specified integrity constraint. The designer has three responsibilities: (1) Ensure that each repairing rule is a semantic repair for the constraint. Formal definitions of repairs can be found in [4] and [2]. Approaches for deriving repairs with respect to given constraints can be found in [18,9]. (2) Derive bounds for constraint application, by analyzing the application domain. We hypothesize that for most practical applications, bound specification is feasible. (3) Show confluence of rules, e.g. [17,3]. 3.2
Validating Integrity Enforcement
The constraint of example 4 includes redundancies, since the condition parts of the repairing rules can be derived from the constraint specification ( C ≡ a ⊆ b and inserta (t) imply that C is violated if t ∈b). However, for arbitrarily complex integrity constraints it is hard to derive such simplified and update-adapted conditions (an initial treatment is introduced in [18]). Further, the semantic verification of database behavior is limited to known classes of constraints. Therefore, we propose an alternative method for validating rules. The framework can be run in a test mode, where the database is initially in a consistent state and after each rule execution the whole constraint (AS clause), is tested. In case the constraint does not hold the rule must be incorrect. Therefore, it is under the responsibility of the developer to assign test scenarios for each rule. We think that running a database in test mode is a pragmatic alternative for validating database applications, since existing methods of testing software can be used as well. Since testing is beyond the scope of this paper we omit any further details.
Improving Integrity Constraint Enforcement by Extended Rules
3.3
509
Compare to Rule Triggering Systems
In contrast to RTSs, termination can be controlled and enforced by using meaningful bounds. Note that in RTSs, even if cycles are detected, termination remains unclear. No optimization such as employing rule priorities at DT is necessary, and effect preservation is left for the DBMS responsibility. The following example demonstrates an update for which RTSs do not preserve the effects of the initial update. Example 5. Consider a set x of integers and a set y of tuples (s, t) of integers. Let I1 ≡ ∀z.(z ∈ y ⇒ z.t ∈ x) be an inclusion constraint and I2 ≡ ∀z.(z ∈ y ⇒ z.s ∈ / x) be an exclusion constraint. The following ECA rules are obtained: R1 R2 R3 R4
: AFTER : AFTER : AFTER : AFTER
inserty ((s, t)) WHEN ¬I1 DO insertx (t) inserty ((s, t)) WHEN ¬I2 DO deletex (s) deletex (t) WHEN ¬I1 DO deletey ((∗, t)) insertx (s) WHEN ¬I2 DO deletey ((s, ∗))
The wildcard ∗ denotes all values. The operation T ≡ inserty (i, i) on the empty database, might trigger either R1 ; R4 ; R2 or R2 ; R1 ; R4 . Both executions lead to insertx (i) on the empty database which definitely does not preserve the effects of T . As long to tuple (i, i) is involved the effects are preserved, but in the general case (e.g. at CT) we have to assume that the set of rules does not preserve the effects. In section 6 we show how effects are preserved by our framework. 3.4
The Process of Integrity Enforcement
Based on a set of extended rules our framework is designed to handle all relevant problems of integrity enforcement, e.g. termination, effect preservation and optimization. All the constraint specifications are stored as meta data within a data repository of a DBMS. At PT a given update is prepared for its execution. In the first stage, a repair plan is built and termination is enforced by detecting and cutting potential cycles. Here the specified bounds help to enforce termination. Repair plans are based on dependency graphs which are introduced in section 4. In the second stage an order for constraint enforcement is fixed by applying certain optimization strategies as explained in section 5. In this work we discuss optimization on rollback operations only. At RT, the framework executes an optimized repair plan and preserves the effects by using the history of all previous database modifications. Details can be found in section 6.
4
Enforcement Plans and Dependency Graphs
This section introduces repair plans represented by dependency graphs (DGs). This work extends the DGs of Mayol [7]. We discuss the construction of dependency graphs, and how to enforce termination using DGs.
510
4.1
S. Jurk and M. Balaban
Static Integrity Constraint Checking
The computation of DGs at PT depends on an interference relationship between updates and integrity constraints. An update interferes with a constraint if its application might violate the constraint. Computing the interference relationship at PT requires detection of potential violations, without performing costly evaluations of the constraints. In general, a PT computed interference relationship cannot guarantee RT violation, but it reduces the search space of an integrity enforcement procedure. Much work has been devoted to this problem in the field of integrity constraint checking [10,12,13,11]. Some methods perform simplifications that return a simplified condition that is left to test at RT [14]. In this work, we assume the existence of a predicate interf ere(S, C) that returns f alse if S cannot violate C in any database state and true otherwise. ˆ be a set of rules and S an update. Definition 1. Let R ˆ = {r | r ∈ R ˆ ∧ interf ere(S, cond(r))} violate(S, R) Clearly, a strong interf ere implies a small set of potential violations; the weakest ˆ as the full set of violations. Henceforth, interf ere is always true, implying R event(r), cond(r), action(r) denote the event, condition and action of a rule r, respectively; con(r) denotes the constraint of the extended rule that includes r, and replace(r, newAction) denotes a replacement of the action of r. 4.2
Definition of Dependency Graphs
A dependency graph is defined by a tree structure. For a node n, n.children denotes the set of child nodes of n. An update (sometimes called also operation) S is considered as a rule with event(S) = cond(S) = null. ˆ be a set of rules and S an Definition 2 (Dependency Graph - DG). Let R update. A dependency graph t for S is a tree whose nodes are labeled by rules, and satisfy the following: 1. t = S (meaning the root node of t is labeled by S), and ˆ 2. for all nodes n of the tree t : n.children = violate(action(n), R). In a sense, a DG reflects all possible constraint violations - the “enforcement space” or “worst case” - of the update. Note that a DG does not specify a concrete ordering for repairing. In order to derive a concrete transaction the paths of the DG must be serialized (sequenced). We elaborate on that point in section 5. Example 6. Consider the database schema, constraints and rules in figure 3. A bank governs all customer data in a trust database. For each customer, a trust value is computed by a function ftrust that is stored in table trust. Further, customers need to be older than 20, earn more than $5.000 per month, and their trust value must be at least 100. We omit bounds and provide only rules for insert updates. The given DG shows the worst case scenario for inserting a new person.
Improving Integrity Constraint Enforcement by Extended Rules
511
person name profession
R1 insert person
trust name
age salary
value
R2
R3 R4
CREATE CONSTRAINT C1 AS ∀n, p, a, s. (n, p, a, s) ∈ person ⇔ (n, ftrust (p, a, s)) ∈ trust AFTER insertperson (n, p, a, s) DO inserttrust (n, ftrust (p, a, s)) CREATE CONSTRAINT C2 AS ∀n, v. (n, v) ∈ trust ⇒ v > 100 AFTER inserttrust (n, v) WHEN v ≤ 100 DO rollback
(R1 )
(R2 )
CREATE CONSTRAINT C3 AS ∀n, p, a, s. (n, p, a, s) ∈ person ⇒ a > 20 AFTER insertperson (n, p, a, s) WHEN a ≤ 20 DO rollback
(R3 )
CREATE CONSTRAINT C4 AS ∀n, p, a, s. (n, p, a, s) ∈ person ⇒ s > $5.000 AFTER insertperson (n, p, a, s) WHEN s ≤ $5.000 DO rollback
(R4 )
Fig. 3. Left: the trust database; Right: dependency graph for an insert on person; Bottom: definition of constraints and rules
4.3
Termination
A DG might be infinite. A path in the graph represents a sequence of rules, such that the action of a rule in the sequence potentially violates the condition of the next rule. Non-termination is characterized by the existence of an infinite sequence (path) in the tree (analogous to the notion of cycles in an activation or a triggering graph). An infinite path must contain a rule that repeats infinitely often, since the total number of rules is finite. Termination can be enforced by restricting the number of enforcements for each rule on the path according to the bound that is specified for its constraint. If the bound is exceeded the path is cut and a rollback operation is used instead of the rule action. Consequently, a transaction based on a “truncated” DG always terminates. However, static bounds do not exists, if the number of performed cycles depends on the update itself. For example, consider two sets and two constraints with √ cyclic dependencies, as in C1 ≡ ∀t.(t ∈ x ⇒ t ∈ y) and C2 ≡ ∀t.(t ∈ y ⇒ t ∈ x), where x and y are sets of integers. Clearly, for inserting a value t into x the number of repeated enforcements for C1 and C2 depends on t. Therefore, the bound method is only useful, if appropriate bounds exist and can be specified by application developers.
512
4.4
S. Jurk and M. Balaban
Construction of Dependency Graphs
The construction of a dependency graph emerges naturally from Definition 2, and ˆ denote the set of rules, and the array count[Ci ] uses the bounds method. Let R ˆ with Ci ∈ C contain the bounds according to each constraint specification. The construction starts by applying createDG(S, count) where the update S is taken as a rule to be repaired. (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
createDG(S, count) tree = S ˆ do f orall r ∈ violate(action(S), R) if count[con(Ci )] = 0 then subT ree = replace(r, rollback) tree = addChild(tree, subtree) else count[con(Ci )] = cou[const(Ci )] − 1 subT ree = createDG(r, count) tree = addChild(tree, subT ree) return tree
The complexity of the algorithm depends on that of violate. In the worst case, ˆ whether r potentially violates S the function violate tests for each rule r ∈ R (interf ere(S, cond(r)) = true). Under the assumption that rules do not change frequently, violate can be implemented efficiently, using indexing and caching techniques that avoid repeated computations. That is, rules sharing the same tables can be pre-computed, and already computed interf ere values can be memorized. Consequently, for each database schema the appropriate dependency graph is computed only once, and is revised only when the specification of the constraints has been changed.
5
Serialization and Optimization
Dependency graphs (DGs) represent the general enforcement task. In order to derive an executable transaction a DG needs to be serialized. In this section we introduce a serialization procedure, and discuss its optimization. Definition 3 (Serialization). Let t be a DG. A serialization of t, denoted ser(t), is a total ordering of the set of nodes (rules) of t. A path serialization of a dependency graph is a serialization that preserves the path ordering in the graph: Definition 4 (Path Serialization). Let t be a DG and s = ser(t). The order s is a path serialization of t if for every two nodes t1 and t2 in s where t1 < t2 , if t1 and t2 occur, both, in a single path in t, then t1 occurs before t2 in the path. Consider the DG in Figure 3. Some possible path serializations are: insertperson , R1 , R2 , R3 , R4 insertperson , R3 , R1 , R4 , R2 insertperson , R1 , R2 , R4 , R3 insertperson , R4 , R1 , R3 , R2 insertperson , R3 , R4 , R1 , R2 ···
Improving Integrity Constraint Enforcement by Extended Rules
513
Definition 5 (Sequential Execution). A finite serialization s ≡ r1 , . . . , rn can be sequentially executed by the program Ps : for i = 1 to n do if cond(ri ) then action(ri ). The program Ps is called a sequential execution of s The following consistency theorem relies on the definition of path serializations (Definition 4). The proof appears in the full paper. Theorem 1 (Consistency). Let t be a finite DG (already truncated) with resˆ Let s be a path serialization of t. Then pect to an update S and a set of rules R. the sequential execution of s yields a consistent database state. Since path serialization is not unique, there is room for optimization. As mentioned before, transaction processing within integrity enforcement can be improved, if rollback operations are performed as early as possible. Therfore, one idea for optimization is to identify possible rollback update in action parts of rules in the DG, and push this nodes to the beginning of the serialized DG. Definition 6 (rollback optimization). Let s be a path serialization of a DG g with n nodes, and fs (i) a weight function with 1 ≤ i ≤ n and i i-th node of s contains no rollback operation fs (i) = 0 otherwise Define weight(s) =
n
fs (i)
i=0
The sequence s is rollback-optimized if weight(s) = max(weight(s )), taken over all path serializations s of g. We demonstrate optimization for the DG at figure 3 by the following path serializations and the result of its weight functions: s1 = insertperson , R1 , R2 , R3 , R4 fs1 = 1 + 2 + 0 + 0 + 0 = 3 s2 = insertperson , R3 , R1 , R4 , R2 fs2 = 1 + 0 + 3 + 0 + 0 = 4 s3 = insertperson , R3 , R4 , R1 , R2 fs3 = 1 + 0 + 0 + 4 + 0 = 5 For the rules given in Figure 3 the serialization s3 is a rollback-optimized one. Clearly, more sophisticated optimization techniques are possible. For example, if BEFORE rules and a modified notion of dependency graphs are used, the problem can be further optimized as shown by the following serialization (details are omitted ): s4 = R3 , R4 , insertperson , R1 , R2 fs4 = 0 + 0 + 3 + 0 + 5 = 8 which, indeed, avoids a rollback of insertperson . The optimization method above assumes that all tests of conditions and executions of actions of rules have the same cost. Our framework also takes into account the average costs of testing conditions, and execution and rollbacks of actions. Costs can be collected by the running system, and reflect the specific behavior of the application. Since for a sequence of user transactions the order of rule enforcements is usually stable, we also employ temporal caching techniques to avoid computational overhead caused by optimization algorithms.
514
6
S. Jurk and M. Balaban
Run-Time and Effect Preservation
Theorem 1 guarantees that the sequential execution of a serialization yields a consistent database. In this section we consider the issue of effect preservation. We present a run-time history for achieving effect preservation (EP ) under sequential execution. Effect preservation is not often discussed in the literature. Rule triggering systems [18] use the notion of net effect of rules as a guard against, for example, deletions of inserted tuples in a single transaction. The notion of effect preservation generalizes the notion of net effect, since it amounts to disallowing undo operations. In particular, it implies that an inserted tuple can not be deleted, a deleted tuple can not be inserted again, and the update of a tuple can not be reversed. As shown by example 5, EP at CT is ambiguous and should therefore be handled at RT. In order to support effect preservation, we consider an execution of a serialization s ≡ r1 , . . . , rn as a single transaction that is associated with a history hs of database modifications. Every new modification is checked against the history in order to find possible interferences of effects. The history considers insert, delete and update operations, where the latter one is represented as “delete existing value” and “insert a new value”. Further, we distinguish between “real” inserts and deletes to “update caused” inserts and deletes. Definition 7 (History). The history h is a chronological collection of all database modifications within a transaction. Modifications are mapped to tuples of the history as follows: inserttable (value) →(table, insert, value) deletetable (value) →(table, delete, value) updatetable (vold , vnew ) →t1 = (table, udelete, vold ) and t2 = (table, uinsert, vnew ) with t1 < t2 Since h can be totally ordered, t1 < t2 means that t1 is inserted to the history before t2 (t1 occurs before t2 ). Effect preservation is reduced to the problem of specifying modifications that are allowed according to previously executed modifications (history). Our understanding of EP is characterized by the following rules: Definition 8 (Effect Preservation). A modification insert, delete or update is accepted if its rule holds. h Ruleinsert ≡ (table, delete, value) ∈ Ruledelete ≡ (table, insert, value) ∈ h h ∧ (table, delete, vnew ) ∈ h∧ Ruleupdate ≡ (table, insert, vold ) ∈ ¬∃t1 , t2 . t1 , t2 ∈ h ∧ t1 < t2 ∧ t1 = (table, udelete, vnew ) ∧ t2 = (table, uinsert, vold )
Improving Integrity Constraint Enforcement by Extended Rules
515
Consequently, Ruleupdate avoids updates of newly inserted or newly deleted objects, and sanctions reversed updates, e.g., x = x + 1; x = x − 1. Example 7. Recall example 5 and a modification inserty (i, i) on the empty database which yields an execution of R2 , R1 , R4 and the following history: R2 → h = (y, insert, (i, i)) R1 → h = (y, insert, (i, i)), (x, insert, (i)) R4 → h = (y, insert, (i, i)), (x, insert, (i)), (y, delete, (i, i)) Clearly, the last entry of h does not preserve the effects of the initial modification inserty (i, i). Therefore, the transaction is rejected. Since database modifications usually affect only a few objects, the caused computational overhead is relatively low for each transaction. Yet, effect preservation at RT might cause rollback operations that have not been considered by the rollback optimization technique at PT. The design of PT effect preservation techniques is part of our planned future work.
7
Conclusion and Future Directions
In this paper we introduce a framework for integrity enforcement that improves existing rule triggering systems by extended rules. By using termination bounds we have shown that certain classes of termination problems can be fully shifted to design-time, provided that appropriate bounds can be specified by application developers. Furthermore, we have shown that optimization of the order of enforcement can be efficiently done by performing some preparatory work of the database system, so that assigning priorities at design-time can be avoided. The framework is based on dependency graphs which are a significant means for preprocessing database modifications. By assigning indexing and caching techniques additional computational overhead caused by processing dependency graphs can be avoided. In the future, we plan to extend our work in the following directions: (1) Dependency graphs can be used for query result caching. That is, if within a rule sequence r1 , r2 the conditions parts (SQL queries) of r1 and r2 share parts of their query execution plans, the latter rule can participate from the query evaluation of the previous rule. (2) Dependency graphs will serve for identifying parallel executable actions. (3) We would like to close the gap in our rollback optimization techniques. We plan to further develop the handling of effect preservation in the pre-processing stage of database modifications. (4) We plan to compare the efficiency of rule triggering systems and our framework, based on real world examples. Acknowledgment. This research was supported by the DFG (Deutsche Forschungsgemeinschaft), Berlin-Brandenburg Graduate School in Distributed Information Systems (DFG grant no. GRK 316).
516
S. Jurk and M. Balaban
References 1. M. Balaban and P. Shoval. EER as an active conceptual schema on top of a database schema – object-oriented as an example. Technical report, Information Systems Program, Department of industrial Engineering and Management, BenGurion University of the Negev, ISRAEL, 1999. 2. Mira Balaban and Steffen Jurk. The ACS Approach for Integrity Enforcement. Technical report, Ben Gurion University of the Negev, April 2000. 3. E. Baralis and J. Widom. An algebraic approach to static analysis of active database rules. In ACM Transactions on Database Systems, volume 25(3), pages 269–332, September 2000. 4. Steffen Jurk. The active consistent specializations approach for consistency enforcement. Master’s thesis, Dept. of Computer Science, Cottbus Technical University, Germany, 2000. 5. M. Lenzerini and P. Nobili. On the satisfiability of dependency constraints in entity-relationship schemata. Information Systems, 15(4):453–461, 1990. 6. E. Mayol and Ernest Teniete. A survey of current methods for integrity constraint maintenance and view updating. In Chen, Embley, Kouloumdjian, Liddle, Roddick, editor, Intl. Conf. on Entity-Relationship Approach, volume 1727 of Lecture Notes in Computer Science, pages 62–73, 1999. 7. E. Mayol,E. Teniente. Structuring the process of integrity maintenance. In Proc. 8th Conf. on Database and Expert Systems Applications, pages 262–275, 1997. 8. E. Mayol,E. Teniente. Addressing efficiency issues during the process of integrity maintenance. In Proc. 10th Conf. on Database and Expert Systems Applications, pages 270–281, 1999. 9. F. Bry. Intensional updates: Abduction via deduction. In Proc. 7th Conf. on Logi Programming, 1990. 10. F. Bry,H. Decker,R. Manthey. A uniform approach to constraint satisfaction and constraint satisfiability in deductive databases. Proceedings of Extending Database Technology, pages 488–505, 1988. 11. H. Decker. Integrity enforcements on deductive databases. Proceedings of the 1st International Conference on Expert Database Systems, pages 271–285, 1986. 12. M. Celma,H. Decker. Integrity checking in deductive databases. the ultimate method? Proceedings of 5th Australiasian Database Conference, pages 136–146, 1995. 13. S.K. Das, M.H. Wiliams. A path finding method for constraint checking in deductive databases. Data and Knowledge Engineering 3, pages 223–244, 1989. 14. S.Y. Lee, T.W. Ling. Further improvement on integrity constraint checking for stratisfiable deductive databases. In Proc. 22th Conf. on VLDB, pages 495–505, 1996. 15. K.D. Schewe and B. Thalheim. Limitations of rule triggering systems for integrity maintenance in the context of transition specifications. Acta Cybernetica, 13:277– 304, 1998. 16. K.D. Schewe and B. Thalheim. Towards a theory of consistency enforcement. Acta Informatics, 36:97–141, 1999. 17. van der Voort and A. Siebes. Termination and confluence of rule execution. In In Proceedings of the Second International Conference on Information and Knowledge Management, November 1993. 18. J. Widom and S. Ceri. Deriving production rules for constraint maintenance. In Proc. 16th Conf. on VLDB, pages 566–577, 1990. 19. J. Widom and S. Ceri. Active Database Systems. Morgan Kaufmann, 1996.
Statistical and Feature-Based Methods for Mobile Robot Position Localization Roman Mázl1, Miroslav Kulich2, and Libor PrßHXþil1 1
The Gerstner Laboratory for Intelligent Decision Making and Control Faculty of Electrical Engineering, Czech Technical University 166 27 Prague 6, Technická 2, Czech Republic {mazl, preucil}@labe.felk.cvut.cz 2
Center for Applied Cybernetics Faculty of Electrical Engineering, Czech Technical University 166 27 Prague 6, Technická 2, Czech Republic [email protected]
Abstract. The contribution introduces design and comparison of different-brand methods for position localization of indoor mobile robots. The both methods derive the robot relative position from structure of the working environment based on range measurements gathered by a LIDAR system. As one of the methods uses statistical description of the scene the other relies on a feature-based matching approach. The suggested localization methods have been experimentally verified and the achieved results are presented and discussed.
1 Introduction In order to effectively explore a working environment and to autonomously build a map of the environment, a robot has to localize its position. Mobile robots are usually equipped with odometer or other dead-reckoning methods, which are capable to provide sufficient accuracy only in a short-term. Unfortunately, the position and orientation errors increase during time due to wheel slippage, finite encoder resolution and sampling rate, unequal wheel diameters, and other non-uniform influences, so that localization techniques have to be applied. Localization methods can be divided into three basic categories:
• • •
Landmark based methods Markov localization Scan matching approaches
The central idea of landmark localization is to detect and match characteristic features in the environment (landmarks) from sensory inputs. Landmarks can be artificial (for example Z-shaped, line navigation, GPS) or natural which are learned from a map of the environment [1], [2]. The Markov localization evaluates a probability distribution over the space of pos-
sible robot states (positions). Whenever the robot moves or senses the environmental entity the probability distribution is updated [3], [4]. The sensor matching techniques compare raw or pre-processed data obtained from sensors with an existing environment map or with previously obtained reference sensor data. The comparison in this sense denotes a correlation process of transitions and a rotation of the actual range scan with the reference range scans. Sensor matching methods differ in a representation of the actual and reference scans, so the major types can be recognized like approaches matching particular entities as: • • •
Point-to-point Point-to-line Line-to-line
For the point-to-point method, both the reference and the actual scans are supposed to be raw sensor measurements (vectors of points) and the matching process consists of two steps: pairs of corresponding points are searched for in range scans followed by transitions and rotations to fit with a reference scan The procedure is guided by minimization of a distance function by the least-squares method [5]. A crucial part of this approach is retrieval of the corresponding pairs of points. This works properly only if the difference of the actual scan from the reference scan is sufficiently small. Furthermore, processing of a huge amount of points is timeconsuming as the complexity is O(n 2 ) , where n is a number of measurement points. These disadvantages lead to usage of the point-to-line algorithms. The main idea is to approximate the reference scan by a list of lines or to match the actual scan with visible lines of the map [6]. Similarly, two lists of lines representing the scans are matched in the line-to-line approach [7], [8]. Besides the previous matching techniques working directly with points (or pointbased features as corners, line segments, etc.) there are also methods using statistical representations of the existing features. A fundamental work has been introduced by [9] and applies directional densities (histograms) to find correspondences in sensor data determining relative motions. Moreover, the final performance and robustness of the most matching techniques can be additionally improved by using a Kalman filter [10], [11]. Therefore, the chapter 2 hereunder has been dedicated to a common algorithm for line segmentation from range data. Chapters 3 and 4 describe in particular the line-toline (feature-based) and histogram–based (statistical) localization methods. Practical performance and brief discussion of the achieved results is given in chapters 5 and 6.
2 Range Measurement Segmentation As the obtained range measurements are in the form of separate points (distances to rigid obstacles in the selected discrete direction) the processing requires input data segmentation into point sets belonging to particular boundary.
Statistical and Feature-Based Methods for Mobile Robot Position Localization
519
The used range-measurement segmentation applies two steps. For this purpose there has previously been developed a recursive interval-splitting approach [8] selecting the candidate points for a segment. The algorithm (step 1) takes the advantage of naturally angle-ordered points obtained from a laser range finder. Firstly, all the points are represented by a single segment, which is defined by boundary points from laserscan (the first and the last point of the scan). Search for the most distant point from a segment is invoked afterwards. This point cracks the segment in question into two new segments. The whole process is applied recursively to both the new segments. Evaluation of the maximum curvature criterion followed by application of a LSQ (step 2) for optimal approximation points of the segment serves for the algorithm termination. The less important and possibly incorrectly segmented elements are scratched in a post-processing filtration using heuristic rules like:
•
Low significance. The segment is not build of less than certain number of points.
•
Minimum angular length. The segment endpoints are not observed within a minimum angle interval. This filters adversely oriented and partially occluded segments.
•
Limited gap. Two neighbouring points are within a chosen distance threshold. The rule partially handles multiple reflections, laser beam fading and improper segments originating from occlusions.
•
Gravity centre. If gravity centres of a segment and of the original point set creating the segment in question are not matching, the found segment is omitted.
3 Line-to-Line Method 3.1 Corresponding Line Finding
The key part of the method is to find for each line from the actual scan a corresponding line from the reference scan (if such line exists). We say, that two lines correspond if and only if differences of their directions and positions are sufficiently small. To specify these conditions more precisely, lets denote x and y lengths of the lines, a, b, c, d distances between their vertices, and φ the angle between the lines. b x
y
d
a c Fig. 1. Distance definition for evaluation of line correspondence.
The lines in question also have to satisfy the following two constrains given by expressions:
520
R. Mázl, M. Kulich, and L. PrßHXþil
φ < Φ
(1)
MAX
min(a, b) + min(c, d ) < K, x+ y
(2)
where Φ MAX and K are thresholds saying how "close" the both lines have to be. For example, whenever two lines of a length 1 overlap, the common part is at least of a length equal to 1-K. 3.2 Orientation Correction
The angle difference α between the scans is expected to be determined in this step. The leading idea behind is to evaluate the weighted sum of differences of particular corresponding lines:
∑ φw α= ∑ w n i =1 i n i =1
i
(3)
i
where n is a number of pairs of the corresponding lines, φi is an angle between lines of the i-th pair, and wi is the weight of the i-th pair. The weight is defined as a product of the lines’ lengths, which prefers pairs containing long lines to pairs of short lines. 3.3 Position Correction In order to correct the position of the actual scan, we express each line of the reference scan in a standard form: (4) ai x + bi y + ci = 0 n
2
min ∑∑ (ai ( xij + p x ) + bi ( yij + p y ) + ci ) 2 . px , p y
(5)
i =1 j =1
Simultaneously, each line of the actual scan is represented by its outer points ( [ xi1 , y i1 ], [ xi2 , y i2 ] ) so that the shift [ p x , p y ] of the actual scan can be determined by minimizing the penalty function for px and py following the expression (5). 3.4 Final Optimization The localization problem can also be expressed as minimization of a penalty function respecting the position and orientation. As a suitable analytic solution does not exists suitable numerical solution has to be applied. The given problem uses the NelderMead type simplex search method. The desired initial orientation and position can be derived and evaluated from separate optimization for rotation and shift obtained in the preceding steps.
Statistical and Feature-Based Methods for Mobile Robot Position Localization
521
4 Histogram-Based Correction Method 4.1 The Sensor Heading Correction The other presented approach for correction of heading and position relies on similar method as used for the initial segmentation. The leading idea for angular refinement has been derived from [9] and applied to line segments instead of standalone point measurements. While the tangent line at each point location serves for construction of angle-histogram [9] in our approach we apply using direction of line segments. The first step builds up a directional histogram comprising distribution of dominant directions in particular map image (obtained from a current range scan). Comparison of the current and the reference histograms is performed by a cross-correlation function: n
c ( j ) = ∑ h1 (i) * h2 (i + j ) í =1
j ∈ (− n 2 ; n 2 )
(6)
The angle s that is used to compensate the scene rotation is obtained for maximum of the c(j). Usage of low number and sparse segments leads typically to directional histogram that does not have sufficiently smooth shape. This problem can be treated by the following procedure: (a) A Gaussian core is assigned to each line segment (the set of its creating points): f ( x) =
⎡ (x − µ)2 ⎤ exp ⎢− 2σ 2 ⎥⎦ 2π σ ⎣ 1
(7)
Where µ stands for direction of the segment and σ denotes inverse number of points relevant to the segment. (b) All the Gaussian cores are superimposed in order to obtain a smooth directional histogram. 4.2 Sensor Shift Correction In principle, the x, y scan-to-scan translation of the sensor can be handled in a similar way as for the rotation case. The major modification is in definition of the histograms along two roughly orthogonal axes. Each axis is determined by direction of the most significant segments from reference range scan. Definition interval is required for (-∞, +∞), but can be restricted to finite case and simply omitting less important (= more distant) segments. Particular Gaussian cores creating the histogram are defined by a mean value µ denoting projected position of the segment midpoint onto the used coordinate axis. The σ depends on inverse number of points and mutual orientation of the segment with respect to the given axes. Segments parallel to each axis do not describe possible shifts along the axis, while their proper length is unknown. From this point of view is value of the σ increased for particular Gaussian cores that corresponds to segments parallel with coordinate axes.
522
R. Mázl, M. Kulich, and L. PrßHXþil curren t data
reference d ata
d x y h1
h2
G au ssian co res
Fig. 2. Histogram generation for calculation of x, y shift values
Correlating the histograms h1 and h2 of the current and reference map images in both directions provides the desired translation j for maximised c(j): n
j ∈ (− n 2 ; n 2)
c ( j ) = ∑ h1 (i ) * h2 (i + j ) í =1
(8)
x x … correction value
d
0
d
Fig. 3. Example of the resulting 1-D cross-correlation function along a single axis.
Two evaluated correction values (a and b) are composed into a single translation vector T. The following figure illustrates the situation with given dominant directions and calculation of the final direction for the sensor shift correction. The largest segment defines the most significant direction; the other significant axis direction corresponds to the i-th segment, which maximizes the following formula:
⎛ length (si ) direction( si ) − direction(sl ) * max ⎜⎜ i π 2 ⎝ length (sl )
⎞ ⎟⎟ ⎠
where si is the candidate segment and sl stands for the largest segment.
(9)
Statistical and Feature-Based Methods for Mobile Robot Position Localization
reference data
current data b
523
T a a
second significant direction
T
b
the most (first) significant direction
Fig. 4. Dominant directions a, b and derivation of the final translation T for correction.
5 Experiments Both the described approaches were experimentally verified and tested with real data in indoor environments. The tests were setup and conducted with carefully chosen obstacles configuration and environment shape in order to point out their weaknesses in performance. The experimental evaluation has been designed mainly with respect to: • Testing sensitivity to presence of dominant directions • Evaluation of capability of the matching procedure to determine correct linepairs • Recognition and evaluation of the influence of the scene sampling density (with respect to maximal inter-scan changes) From this point of view there have been built up scenes with significant dominant directions (see Fig. 6) on one hand beside scenes containing also boundary directions spread uniformly (e.g. referred often as random scenes, see Fig. 5) on the other. Another aspect being taken into consideration was the capability of the verified method to cope with dynamically changing scenes. This means that not all the obstacles determined in a single particular scene scan could be determined in the following frames. This means that the case tests robustness of the matching part of the algorithm itself. For the experiments, persons walking through a static environment have achieved this type of a scene setup.
6 Conclusion The presented methods were mainly designed for self-localization of autonomous mobile robots in indoor environments. The introduced experiments were targeted on localization of relative motions only. This has been done via scan-to-scan comparison determining relative motions of the robot. Absolute positioning could be achieved by fitting the scene scans to a global map of the environment, what is straightforward.
524
R. Mázl, M. Kulich, and L. PrßHXþil
Fig. 5. Original scene with noisy data induced by walking persons.
Fig. 6. Result of a position correction using the histogram-based method.
Fig. 7. Position correction using the line-to-line algorithm.
The experiments have brought a valuable recognition about the approach performance. The both of the methods were found to be quite comparable and robust in stan-
Statistical and Feature-Based Methods for Mobile Robot Position Localization
525
dard conditions. Although it could be expected that the histogram-based approach is likely to be more reliable in noisy scenes (because of integrating the sensor data into histograms) this is not very significant. Another weak point of the method stands in computational costs, which are about 3 times higher than for the other approach. On the other hand the line-to-line matching approach might be suspicious for failures in noisy scenes. Not only pure lines are correlated and matched here but also complete rectangular structures (doubles of close-to-rectangular line segments) are used to create pairs. Using these “higher level features” for generating line pairs makes the method surprisingly robust to mismatches even for scenes with lots of randomly spread segments. The maximal motion between two scans is the only remaining limitation determining the maximal distance for correspondence search. If not satisfied, the methods fail to create proper pairs and might crash from principle (see Fig. 7) by mismatching the line pairs in certain situations. The presented comparison of the two methods introduces a new aspect for using low-level self-localization methods with a laser range measurements.
Acknowledgements. This research has been carried out being funded by the following grants: FRVŠ No. 1900/2001, MŠMT 212300013, and CTU internal grant No. 300108413. The work of Miroslav Kulich has kindly been supported by the MŠMT 340/35-20401 grant.
References 1.
2. 3. 4. 5.
6.
7.
8.
Leonard, J.J., Cox, I. J., and Hugh F. Durrant-Whyte: Dynamic map building for an autonomous mobile robot. In Proc. IEEE Int. Workshop on Intelligent Robots and Systems, pages 89-96, July 1990. Borenstein J., Koren, Y.: Histogramic In-motion Mapping for Mobile Robot Obstacle Avoidance, IEEE Journal of Robotics and Automation, 7(4), pages 535-539, 1991. Nourbakhsh, I., Powers, R., Dervish, S. B.: An Office-Navigating Robot, AI Magazine, 16(2), pages 53-60, 1995. Fox, D., Burgard, W., Thrun, S.: Markov localization for mobile robots in dynamic environments. Journal of Artificial Intelligence, 11, pages 391-427, 1999. Lu, F., Milios, E.,: Robot Pose Estimation in Unknown Environments by Matching 2D Range Scans, IEEE Computer Vision and Pattern Recognition Conference (CVPR), pages 935-938, 1994. Gutmann, J. S., Burgard W., Fox, D., Konolige, K.:Experimental comparison of localization methods. International Conference on Intelligent Robots and Systems, Victoria, B.C. , 1998. Gutmann, J., S., Weigel, T., Nebel B.: Fast, Accurate, and Robust Self-Localization in Polygonal Environments, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 1999. Chmelarß B., PrßHXþil, L., Štépán, P.: Range-data based position localization and refinement for a mobile robot. Proceedings of the 6th IFAC Symposium on Robot Control, Austria, Vienna, pp. 364-379, 2000.
526 9.
R. Mázl, M. Kulich, and L. PrßHXþil
Weiss, G., Puttkamer, E.: A map based on laserscans without geometric interpretation, In: U. Rembold et al. (Eds.): Intelligent Autonomous systems, IOS Press, pages 403-407, 1995. 10. Arras, K., O., Tomatis, N.: Improving Robustness and Precision in Mobile Robot Localization by Using Laser Range Finding and Monocular Vision, Proceedings of the 3rd European Workshop on Advanced Mobile Robots (Eurobot'99), Zurich, Switzerland, 1999. 11. Guivant, J., Nebot, E., Baiker, S.: Localization and map building using laser range sensors in outdoor applications, Journal of Robotic Systems, Volume 17, Issue 10, pages: 565-583, 2000.
Efficient View Maintenance Using Version Numbers Eng Koon Sze and Tok Wang Ling National University of Singapore 3 Science Drive 2, Singapore 117543 {szeek, lingtw}@comp.nus.edu.sg
Abstract. Maintaining a materialized view in an environment of multiple, distributed, autonomous data sources is a challenging issue. The results of incremental computation are effected by interfering updates and compensation is required. In this paper, we improve the incremental computation proposed in our previous work by making it more efficient through the use of data source and refreshed version numbers. This is achieved by cutting down unnecessary maintenance queries and thus their corresponding query results. The number of times of sending subqueries to a data source with multiple base relations are also reduced, as well as avoiding the execution of cartesian products. Updates that will not affect the view are detected and incremental computation is not applied on them. We also provide a compensation algorithm that resolve the anomalies caused by using the view in the incremental computation.
1
Introduction
Providing integrated access to information from different data sources has received recent interest from both the industries and research communities. Two methods to this data integration are the on-demand and the in-advance approaches. In the on-demand approach, information is gathered and integrated from the various data sources only when requested by users. To provide fast access to the integrated information, the in-advance approach is preferred instead. Information is extracted and integrated from the data sources, and then stored in a central site as a materialized view. Users access this materialized view directly, and thus queries are answered immediately. Noting that information at the data sources do get updated as time progresses, this materialized view will have to be refreshed accordingly to be consistent with the data sources. This refreshing of the materialized view due to changes at the data sources is called materialized view maintenance. The view can be refreshed either by recomputing the integrated information from scratch, or through incrementally changing only the affected portion of the view. It is inefficient to recompute the view from scratch. The deriving of the relevant portion of the view and then incrementally changing it is a preferred approach as a smaller set of data is involved. H.C. Mayr et al. (Eds.): DEXA 2001, LNCS 2113, pp. 527–536, 2001. c Springer-Verlag Berlin Heidelberg 2001
528
E.K. Sze and T.W. Ling
In Section 2, we explain the working of materialized view maintenance, and its associated problems. In Section 3, we briefly discuss the maintenance algorithm proposed in [7]. The improvements to this algorithm is given in Section 4, and the changes to the compensation algorithm of [7] are given in Section 5. We compare related works in Section 6, and conclude our discussion in Section 7.
2
Background
In this section, we explain the view maintenance algorithm and its problems. 2.1
Incremental Computation
We consider the scenario of select-project-join view V , with n numbers of base relations {Ri }1≤i≤n . Each base relation is housed in one of the data sources, and there is a separate site for the view. The view definition is given as V = proj attr σsel cond R1 ... Rn . A count attribute is appended to the view relation if it does not contain the key of the join relations R1 ... Rn to indicate the number of ways the same view tuple could be derived from the base relations. There are multiple, distributed, autonomous data sources, each with one or more base relations. There is communication between the view site and the data sources, but not between individual data sources. Thus a transaction can involve one or more base relations of the same data source, but not between different data sources. The view site does not control the transactions at the data sources. No assumption is made regarding the reliability of the network, i.e., messages sent could be lost or could arrive at the destination in a different order from what was originally sent out. The data sources send notifications to the view site on the updates that have occurred. To incrementally refresh the view with respect to an update Ri of base relation Ri , R1 ... Ri ... Rn is to be computed. The result is then applied to the view relation. 2.2
View Maintenance Anomalies and Compensation
There are three problems in using the incremental approach to maintain the view. Interfering Updates. If updates are separated from one another by a sufficient large amount of time, incremental computation will not be affected by interfering updates. However, this is often not the case. Example 1. Consider R1 (A, B) and R2 (B, C), and view V = C R1 R2 , with a count attribute added for the proper working of the incremental computation. R1 contains the single tuple (a1, b1) and R2 is empty. Hence the view is also empty. Insert R1 (a2, b1) occurs. The view receives this notification and the query R1 (a2, b1) R2 is sent to R2 . Let insert R2 (b1, c1) occurs just before the view maintanance query for insert R1 (a2, b1) reaches R2 . Thus, the tuple
Efficient View Maintenance Using Version Numbers
529
(b1, c1) would be returned. The overall result (without projection) is the tuple (a2, b1, c1). The single projected tuple (c1) is added to the view to give (c1, 1). When the view receives the notification of insert R2 (b1, c1), it formulates another query R1 R2 (b1, c1) and sends it to R1 . The result {(a1, b1), (a2, b1)} is returned to the view site. The overall result is {(a1, b1, c1), (a2, b1, c1)} and this adds 2 more tuples of (c1) to the view. The presence of interfering update (insert R2 (b1, c1)) in the incremental computation of insert R1 (a2, b1) adds an extra tuple of (c1) to the view relation, giving it a count value of 3 instead of 2. Misordering of Messages. Compensation is the removal of the effect of interfering updates from the query results of incremental computation. Most of the existing compensation algorithms that remove the effect of interfering updates and thus achieve complete consistency [11] are based on the first-sentfirst-received delivery assumption of the messages over the network, and thus will not work correctly when messages are misordered. A study carried out by [3] has shown that one percent of the messages delivered over the network are misordered. Loss of Messages. The third problem is the loss of messages. Although the loss of network packets can be detected and resolved at the network layer, the loss of messages due to the disconnection of the network link (machine reboot, network failure, etc.) has to be resolved by the application itself after re-establishing the link. As the incremental computation and compensation method of the maintenance algorithm is driven by these messages, their lost would cause the view to be refreshed incorrectly.
3
Version Numbers and Compensation for Interfering Updates
The following types of version numbers are proposed in [7]. Base relation version number of a base relation. The base relation version number identifies the state of a base relation. It is incremented by one when there is an update transaction on this base relation. Highest processing version number of a base relation, and this number is stored at the view site. The highest processing version number is used to provide information to the maintenance algorithm on the updates of a base relation which have been processed for incremental computation. It indicates the last update transaction of a base relation which has been processed for incremental computation. Initial version numbers of an update. The initial version numbers of an update identify the states of the base relations where the result of the incremental computation should be based on. Whenever we pick a update transaction of a base relation for processing, the current highest processing version numbers of all base relations become the initial version numbers of this update. At the same time, the highest processing version number of the base relation of this update is incremented by one, which is also the same as the base relation version number of the update. Queried version number of a tuple of the result from a base relation. The queried
530
E.K. Sze and T.W. Ling
version numbers of the result of incremental computation of an update indicate the states of the base relations where this result is actually generated. Base relation and highest processing version numbers are associated with the data source and view site respectively, while initial and queried version numbers are used only for the purpose of incremental computation and need not be stored permanently. The different types of version numbers allow the view site to identify the interfering updates independent of the order of arrival of messages at the view site. The numbers in between the initial and queried version numbers of the same base relation are the version numbers of the interfering updates. This result is stated in Lemma 1, which was given in [7]. Once the interfering updates are identified, compensation can be carried out to resolve the anomalies. Compensation undoes the effect on the result of incremental computation caused by interfering updates. The formal definiton for compensating the interfering updates can be found in [7]. Lemma 1. Given that a tuple of the query result from Rj has queried version number βj , and the initial version number of Rj for the incremental computation of Ri is αj . If βj > αj , then this tuple requires the compensation with updates from Rj of base relation version numbers βj down to αj + 1. These are the interfering updates on the tuple. Otherwise if βj = αj , then compensation on the query result from Rj is not required.
4
Improved Incremental Computation
The view maintenance approach in [7] overcomes the problems caused by the misordering and loss of messages during transmission. However, efficiency in the incremental computation is not taken into consideration. Each sub-query only accesses one base relation, but generally a data source has multiple base relations. [7] uses the same strategy to incrementally compute the change for all updates. Since the view contains partial information of the base relations, we propose in this paper to involve the view in the incremental computation. 4.1
Querying Multiple Base Relations Together
Since a data source usually has more than one base relation, the sub-queries sent to it should access multiple base relations. This cuts down the total network traffic. It also reduces the time required for the incremental computation of an update, and results in smaller number of interfering updates. Using the join graph to determine the access path of querying the base relations, instead of doing a left and a right scans of the relations based on their arrangement in the relation algebra of the view definition, cuts down the size of the query results sent through the network by avoiding cartesian products. Briefly, the incremental computation is handled as follows. Consider the join graph of the base relations of a view. The view maintenance query starts with the base relation Ri , 1 ≤ i ≤ n, where the update has occurred. A sub-query is
Efficient View Maintenance Using Version Numbers
531
sent to a set of relations S, where S ⊂ {Rj }1≤j≤n , the relations in S comes from the same data source, Ri and the relations in S form a connected sub-graph with Ri as the root of this sub-graph. If there are more than one such set of base relations S, multiple sub-queries can be sent out in parallel. For each sub-query sent, we marked the relations in S as “queried”. Ri is also marked as “queried”. Whenever a result is returned from a data source, another sub-query is generated using the similar approach. Let Rk , 1 ≤ k ≤ n, be one of the relations that have been queried. A sub-query is sent to a set of relations S, where S ⊂ {Rj }1≤j≤n , the relations in S comes from the same data source, none of the relations in S are marked “queried”, and Rk and the relations in S form a connected sub-graph with Rk as the root of this sub-graph. Again, if there are more than one such set of base relations S, multiple sub-queries can be sent in parallel. The incremantal computation for this update is completed when all the base relations are marked “queried”. 4.2
Identifying Irrelevant Updates
If a data source enforces the referential integrity constraint that each tuple in Rj must refer to a tuple in Ri , then we know that an insertion update on Ri will not affect the view and thus can be ignored by our view maintenance process. Similarly, deletion update on Ri will not affect the view if the data source further enforces that no tuple in Ri can be dropped when there are still tuples in Rj referencing it. 4.3
Partial Self-Maintenance Using Functional Dependencies
We proposed the involvement of the view relation to improve the efficiency of the maintenance algorithm by cutting down the need to access the base relations. Additional Version Numbers. We propose the following additions to the concept of version numbers given in [7]. This enhancement would allow a data source to have more than one base relation, i.e., update transaction of a data source can involve any number of the base relations within it, and enable the maintenance algorithm to utilize the view in its incremental computation. The two new types of version numbers are as follow. Data source version number of a data source. Since a data source usually has more than one base relation, it is not sufficient to determine the exact sequence of two update transactions involving non-overlapping base relations using the base relation version number alone. The data source version number is used to identify the state of a data source. It is incremented by one when there is an update transaction on the data source. Refreshed version number of a base relation, and this number is stored at the view site. If we want to use the view relation for incremental computation, we provide the refreshed version numbers to identify the states. The refreshed version number indicates the state of a base relation the view relation is currently showing. The data source version number is used to order the update transactions from the same data source for incremental computation, and subsequent refreshing
532
E.K. Sze and T.W. Ling
of the view. The base relation version number continues to be used for the identification of interfering updates. The refreshed version number is assigned to the queried version number of the tuples of the result when the view is used for incremental computation. Accessing View Data for Incremental Computation. Lemma 2 uses functional dependencies to involve the view in the incremental computation. In the case where the view is not used, all tuples in µRi (the query result of the incremental computation from Ri ) need to be used to query the next base relation Rj . When conditions (1) and (2) of Lemma 2 are satisfied, only those tuples in µRi that cannot be matched with any of the tuples in V [Rj , S ] need to be used in accessing the base relations. Lemma 2. When the incremental computation of an update (insertion, deletion, and modification) needs to query base relation Rj , (1) if the key of Rj is found in the view, and (2) if Rj functionally determines the rest of the relations S that are to be queried based on the query result of Rj (denoted as µRj ), then the view can be accessed for this incremental computation and the refreshed version numbers are taken as the queried version numbers for the result. The view is first used for the incremental computation using µRj , S = µRi V [Rj , S ], where Rj is the set of attributes of Rj in V , S is the set of attributes of relations S in V , and µRj , S is the query result for Rj and the set of relations S. For the remaining required tuples are not found in the view, the base relations are next accessed. Example 2. Consider R1 (A, B, C), R2 (C, D, E) and R3 (E, F, G), with the view defined as V = B,C,F R1 R2 R3 . The following shows the initial states of the base and view relations. R1 (A, B, C) R2 (C, D, E) R3 (E, F, G) V (B, C, F, count) a1,b1,c1 c1,d1,e1 e1,f1,g1 b1,c1,f1,1 c2,d2,e2 e2,f2,g2 Let R1 reside in data source 1, and R2 and R3 reside in data source 2. The base relation version numbers of R1 , R2 and R3 are each 1, and the data source version numbers of data sources 1 and 2 are also 1. The refreshed version numbers at the view site are 1, 1, 1 for R1 , R2 and R3 respectively. Update transaction with data source version number 2 occurs at data source 1, and the updates involved are insert R1 (a2, b2, c2) and insert R1 (a3, b3, c1), which now has its base relation version number changed to 2. The view site receives this notification and proceed to handle the incremental computation. The highest processing version number of R1 is updated to 2. Thus, the initial version numbers for the incremental computation of this update are 2, 1, 1. R2 and R3 are to be queried. Since the key of R2 , which is C, is in the view, and R2 functionally determines the other base relations to be queried (only R3 in this case), the view is first accessed for this incremental computation using the query {(a2, b2, c2), (a3, b3, c1)} V [R2 , R3 ] (R2 contains attribute C, and R3 contains attribute F ). It is found that the tuple (a3,b3,c1) can join with the
Efficient View Maintenance Using Version Numbers
533
tuple (c1,-,-) from R2 and (-,f1,-) from R3 (C → F ). Note the use of “-” for the unknown attributes values. The queried version numbers for both tuples are 1, taken from the refreshed version numbers for R2 and R3 . Projecting the overall result over the view attributes adds one tuple of (b3,c1,f1) to the view. Thus the tuple (b3,c1,f1,1) is inserted into the view relation. The tuple (a2,b2,c2) that cannot retrieve any results from the view relation will have to do so by sending the view maintenance query [ C {(a2, b2, c2)}] (R2 R3 ) to data source 2. The result returned consists of the tuple (c2,d2,e2) from R2 and (e2,f2,g2) from R3 , each with queried version number 1. Since the queried version numbers here correspond with the initial version numbers of both R2 and R3 , there is no interfering update and compensation is not required. The tuple (b2,c2,f2,1) is inserted into the view. Maintaining the View Without Querying All Base Relations. It is not necessary to know the complete view tuples before the view can be refreshed in some cases of modification or deletion updates. In this paper, modification update that involves the change of any of the view’s join attributes will be handled as a deletion and an insertion updates because these update will join with different tuples of the other relations after the modification. Otherwise, the modification update will be handled as one type of update by our view maintenance algorithm. Applying Lemma 3 would also serve to reduce the overall size of queries and results transmitted. Lemma 3. For a modification or deletion update Ri , if the key of Ri is in the view, then maintenance can be carried out by modifying or deleting the corresponding tuples of Ri in the view through using the key value of Ri , without the need to compute the complete view tuple.
5
Compensation for Missing Updates
Lemma 1, which was proposed in [7], is used to identify interfering updates in the case where the base relations is queried for incremental computation. Using the view relation for incremental computation also creates the similar kind of problem, in that the view relation might not be refreshed to the required state when it is accessed. We called them missing updates to differentiate from the interfering updates. Lemma 4. Extending Lemma 1, the following is added. If βj < αj , then this tuple (taken from the view relation) requires the compensation with updates from Rj of base relation version numbers βj +1 to αj . These are the missing updates on the tuple. 5.1
Resolving Missing Updates
The compensation of a missing insertion update is to add the tuples of this insertion into the query result. The compensation with a missing modification
534
E.K. Sze and T.W. Ling
update is simply to update the tuples from the unmodified state to the modified state. These are given in Lemmas 5, 6 and 7 respectively. Lemma 5. Let µRj be the query result from Rj (retrieved from the view) for the incremental computation of Ri . To compensate the effect of missing deletion update Rj for the result of incremental computation of Ri , all tuples of Rj that are found in µRj are dropped, together with those tuples from the other base relations that are retrieved due to the original presence of Rj in µRj . Lemma 6. Let µRj be the query result from Rj (retrieved from the view) for the incremental computation of Ri , and µRk be the query result from Rk . To compensate the effect of missing insertion update Rj on the result of incremental computation of Ri , and assuming that µRj is queried using the result from µRk , all tuples of Rj that can join with µRk are added to µRj , together with those tuples from the other base relations that should be retrieved due to the inclusion of Rj in µRj . Lemma 7. Let µRj be the query result from Rj (retrieved from the view) for the incremental computation of Ri . To compensate the effect of missing modification update Rj (which does not involve any change to the view’s join attributes), for the result of incremental computation of Ri , each old tuple (before the modification) of Rj that occurs in µRj has its values changed to the corresponding new tuple (after the modification). Note that for both missing deletion or modification update Rj , if the key of Rj is not in the view, then the relation Rk with its key that functionally determines the attributes of Rj will be used in applying Lemmas 5 and 7. Theorem 1 gives the overall compensation process that is applied to resolve the maintenance anomalies. Theorem 1. Given α1 ,...,αn as the initial version numbers of incremental computation of update Ri , the compensation starts with those relations that are linked to Ri in the join graph, and proceed recursively to the rest of the relations in the same sequence as they are been queried. The compensation on the query result from Rj proceeds by first compensating with the missing updates of base relation version number βjmin + 1 to αj , where βjmin is the minimum queried version number of the tuples in the result from Rj , using Lemmas 5, 6 and 7. This is followed by the compensation with the interfering updates of base relation version number βjmax down to αj + 1, where βjmax is the maximum queried version number of the tuples in the query result from Rj , using the method discussed in [7]. Theorem 2. To achieve complete consistency, the view will be refreshed with the results of incremental computation in the same order as they have been queried.
Efficient View Maintenance Using Version Numbers
6
535
Comparison
Related works in this area are the Eager Compensation Algorithm (ECA and ECAK ) [10], the Strobe and C-Strobe Algorithms [11], the work of [2], the SWEEP and Nested SWEEP Algorithms [1], the work of [3], and the work of [7]. We compare these using a set of criteria, which are grouped into four categories. The first criterion under the environment category is the number of data sources. All the approaches, except ECA, ECAK and [2], cater for multiple data sources. The second criterion is the handling of compensation. ECA and CStrobe send compensating queries to the data sources, while the other algorithms handle compensation locally at the view site. The latter method is preferred as compensating queries add to the overall traffic. The first criterion under the correctness category is the correct detection of interfering updates. The compensation methods of ECA, ECAK and Strobe are not through the detection of interfering updates, and hence they only achieve strong consistency [11]. C-Strobe does detect some interfering deletion updates which turn out to be non-interfering. The rest of the algorithms work by correctly detecting for the presence of interfering updates when messages are not misordered or lost. The next criterion is the network communication assumption. All the approaches, except [7] and the work of this paper, assume that messages are never lost and misordered. [3] also does not assume that messages are never misordered. There are five criteria under the efficiency category. The first criterion is the number of base relations accessed per sub-query. Most of the approaches can only work by querying one base relation at a time. ECA and ECAK can query all base relations of their single data source together. The method proposed in this paper is able to access multiple base relations within the same data source via the same query. The second criterion is the parallelism in the incremental computation of an update. Existing methods base their view maintenance querying on a left and right scan approach, and thus limit their parallelism to the two scans. In this paper, we use the join graph to guide the accessing of the base relations, and thus provide more parallelism. The third criterion is the parallelism in the incremental computation between different updates. ECA, ECAK , Strobe and Nested SWEEP are able to process the incremental computation of different updates concurrently, but achieving only strong consistency. The rest of the methods have to process the incremental computation of different updates sequentially. [7] and the method in this paper can process the incremental computation concurrently and also achieve complete consistency. The fourth criterion is the use partial self-maintenance. ECAK , Strobe and C-Strobe have a limited form of partial self-maintenance in that deletion update need not be processed for incremental computation. [2] can detect updates that will not affect the view. In this paper, we provide for more opportunity of partial self-maintenance. The fifth criterion is the handling of modification as one type of update. Only [7] and the method in this paper consider modification as one type of update. The criteria under the application requirements category are the flexibility of the view definition, quiescence requirement and level of consistency achieved.
536
E.K. Sze and T.W. Ling
ECAK , Strobe and C-Strobe require that the key of each base relation be retained in the view. The number of base relations in [2] is limited to two. The others have no such requirement. C-Strobe, [2], SWEEP, [3], [7] and our approach achieve complete consistency, and also do not require a quiescent state before the view can be refreshed. This does not apply to ECA, ECAK , Strobe and the Nested SWEEP Algorithm.
7
Conclusion
The use of data source and refreshed version numbers in the maintenance algorithm allow for partial self-maintenance, as well as the accessing of multiple base relations residing at the same data source within a single query. Also, the accessing of the base relations for incremental computation are based on the join graph to avoid cartesian products. Using the join graph to determine the query path also results in more parallelism. Knowledge of referential integrity constraint imposed by the data source is used to eliminate irrelevant updates. Overall performance of the maintenance algorithm is improved by reducing the amount and size of messages sent over the network.
References 1. Agrawal, D., El Abbadi, A., Singh, A., Yurek, T.: Efficient View Maintenance at Data Warehouses. International Conference on Management of Data (1997) 417– 427 2. Chen, R., Meng, W.: Efficient View Maintenance in a Multidatabase Environment. Database Systems for Advanced Applications (1997) 391–400 3. Chen, R., Meng, W.: Precise Detection and Proper Handling of View Maintenance Anomalies in a Multidatabase Environment. Conference on Cooperative Information Systems (1997) 4. Colby, L.S., Griffin, T., Libkin, L., Mumick, I.S., Trickey, H.: Algorithms for Deferred View Maintenance. International Conference on Management of Data (1996) 469–480 5. Griffin, T., Libkin, L.: Incremental Maintenance of Views with Duplicates. International Conference on Management of Data (1995) 328–339 6. Griffin, T., Libkin, L., Trickey, H.: An Improved Algorithm for the Incremental Recomputation of Active Relational Expressions. Knowledge and Data Engineering, Vol. 9 No. 3 (1997) 508–511 7. Ling, T.W., Sze, E.K.: Materialized View Maintenance Using Version Numbers. Database Systems for Advanced Applications (1999) 263–270 8. Qian, X., Wiederhold, G.: Incremental Recomputation of Active Relational Expressions. Knowledge and Data Engineering, Vol. 3 No. 3 (1991) 337–341 9. Quass, D.: Maintenance Expressions for Views with Aggregation. Workshop on Materialized Views: Techniques and Applications (1996) 10. Zhuge, Y., Garcia-Molina, H., Hammer, J., Widom, J.: View Maintenance in a Warehousing Environment. International Conference on Management of Data (1995) 316–327 11. Zhuge, Y., Garcia-Molina, H., Wiener, J.L.: The Strobe Algorithms for MultiSource Warehouse Consistency. Conference on Parallel and Distributed Information Systems (1996)