This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board: Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Oscar Nierstrasz University of Berne, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen Dortmund University, Germany Demetri Terzopoulos New York University, NY, USA Doug Tygar University of California at Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA
3007
3
Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo
Jeffrey Xu Yu Xuemin Lin Hongjun Lu Yanchun Zhang (Eds.)
Advanced Web Technologies and Applications 6th Asia-Pacific Web Conference, APWeb 2004 Hangzhou, China, April 14-17, 2004 Proceedings
13
Volume Editors Jeffrey Xu Yu The Chinese University of Hong Kong Dept. of Systems Engineering & Engineering Management Shatin, New Territories, Hong Kong E-mail: [email protected] Xuemin Lin University of New South Wales, School of Computer Science & Engineering Sydney, NSW 2052, Australia E-mail: [email protected] Hongjun Lu Hong Kong University of Science and Technology, Dept. of Computer Science Clear Water Bay, Kowloon, Hong Kong E-mail: [email protected] Yanchun Zhang Victoria University, School of Computer Science and Mathematics Internet Technologies and Applications Research Lab (ITArl) Ballarat Road, Footscray, P.O. Box 14428, Melbourne City MC, VIC 8001, Australia E-mail: [email protected]
Library of Congress Control Number: 2004102546 CR Subject Classification (1998): H.3, H.4, H.5, C.2, K.4 ISSN 0302-9743 ISBN 3-540-21371-6 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag is a part of Springer Science+Business Media springeronline.com c Springer-Verlag Berlin Heidelberg 2004 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Protago-TeX-Production GmbH Printed on acid-free paper SPIN: 10996393 06/3142 543210
In memory of Yahiko Kambayashi (1943–2004) It is with great sorrow that we received the sad news of the sudden and untimely passing of Prof. Yahiko Kambayashi, just before sending this proceedings out for printing. Prof. Kambayashi was an IEEE fellow, a trustee of the VLDB Endowment (1991–1996), and a winner of the 1995 ACM/SIGMOD contribution award. All of us know how much he meant to the database community, especially to the database researchers in the Asia-Pacific region. His sudden leave is not only a tragic loss to his family but also a great loss to the whole international database community. Despite his busy schedule, Prof. Kambayashi served as a program committee member for this conference. Together with all the members of the organization and program committees, we would like to use this place to express our heartfelt condolences and our deepest sympathy to his family. February 6, 2004
Hongjun Lu, Yanchun Zhang, Jeffrey X. Yu, Xuemin Lin
VI
VII
Preface
The Asia-Pacific region has emerged in recent years as one of the fastest growing regions in the world in the use of Web technologies as well as in making significant contributions to WWW research and development. Since the first Asia-Pacific Web conference in 1998, APWeb has continued to provide a forum for researchers, professionals, and industrial practitioners from around the world to share their rapidly evolving knowledge and to report new advances in WWW technologies and applications. APWeb 2004 received an overwhelming 386 full-paper submissions, including 375 research papers and 11 industrial papers from 20 countries and regions: Australia, Canada, China, France, Germany, Greece, Hong Kong, India, Iran, Japan, Korea, Norway, Singapore, Spain, Switzerland, Taiwan, Turkey, UK, USA, and Vietnam. Each submission was carefully reviewed by three members of the program committee. Among the 386 submitted papers, 60 regular papers, 24 short papers, 15 poster papers, and 3 industrial papers were selected to be included in the proceedings. The selected papers cover a wide range of topics including Web services, Web intelligence, Web personalization, Web query processing, Web caching, Web mining, text mining, data mining and knowledge discovery, XML database and query processing, workflow management, E-commerce, data warehousing, P2P systems and applications, Grid computing, and networking. The paper entitled “Towards Adaptive Probabilistic Search in Unstructured P2P Systems”, co-authored by Linhao Xu, Chenyun Dai, Wenyuan Cai, Shuigeng Zhou, and Aoying Zhou, was awarded the best APWeb 2004 student paper. The proceedings also includes 3 keynote papers by three high-profile researchers, Philip S. Yu (IBM Thomas J. Watson Research Center, USA), Maria E. Orlowska (University of Queensland, Australia), and Arun Sharma (National ICT Australia), and three invited talks by Hui-I Hsiao (IBM Almaden Research Center, USA), Xiaofang Zhou (University of Queensland, Australia), and Wei-Ying Ma (Microsoft Research Asia). The conference received financial support from the National Natural Science Foundation of China, Microsoft, the Chinese University of Hong Kong, the University of New South Wales, Australia, and the National ICT Australia. We are grateful to the program committee members who helped tremendously in reviewing a large number of submissions in a short period of time, in particular during the Christmas season. We appreciate the work by the members of the APWeb 2004 organizing committees, and by Zheng Liu in setting up and managing the Web site and paper submission/review system. We would like to thank Fudan University and Zhejiang University for organizing the conference. February 2004
Jeffrey Xu Yu Xuemin Lin Hongjun Lu Yanchun Zhang
VIII
Organization Committee General Chair Hongjun Lu
Hong Kong University of Science and Technology, China
Program Committee Co-chairs Xuemin Lin University of New South Wales, Australia Jeffrey Xu Yu Chinese University of Hong Kong, China Panel Chair Sang Kyun Cha
Seoul National University, Korea
Tutorial Chair Xiaofang Zhou
University of Queensland, Australia
Publication Chair Yanchun Zhang Victoria University of Technology, Australia Publicity Chair Yoshiharu Ishikawa
University of Tsukuba, Japan
Industrial Chair Mukesh Mohania
IBM India Research Lab, India
Exhibition/Demo Chair Jiajin Le Donghua University, China Jian Chen Netec Corp., China Organization Co-chairs Aoying Zhou Fudan University, China Zhaohui Wu Zhejiang University, China
Organization
IX
Program Committee Toshiyuki Amagasa Boualem Benatallah Sourav S. Bhowmick Ulrik Brandes Stephane Bressan Jinli Cao Wojciech Cellary Ji-Woong Chang Akmal Chaudhri Arbee L.P. Chen Hanxiong Chen Jian Chen Ming-Syan Chen Yi-Ping Phoebe Chen Kai Cheng David Cheung Guozhu Dong Xiaoyong Du Wenfei Fan Marie-Christine Fauvet Ling Feng David Hawking Jingyu Hou Hui Hsiao Joshua Huang Yahiko Kambayashi Kamal Karlapalem Markus Kirchberg Hiroyuki Kitagawa Masaru Kitsuregawa Yasushi Kiyoki Zoe Lacroix Chiang Lee Dik Lee Hong Thomas Lee Chen Li Jianzhong Li Qing Li Xue Li Weifa Liang
Nara Institute of Science and Technology, Japan University of New South Wales, Australia Nanyang Technological University, Singapore University of Konstanz, Germany National University of Singapore, Singapore La Trobe University, Australia Poznan U. of Economics, Poland Korea Advanced Institute of Science and Technology, Korea IBM DeveloperWorks, USA National Tsing Hua University, Taiwan University of Tsukuba, Japan Netec Network Technology, China National Taiwan University, Taiwan Deakin University, Australia Kyushu Sangyo University, Japan Hong Kong University, China Wright State University, USA Renmin University of China, China Bell Laboratories, Lucent Technologies, USA Universit´e Joseph Fourier, France University of Twente, Netherlands CSIRO, Australia Deakin University, Australia IBM Almaden Research Center, USA E-Business Technology Institute, Hong Kong University, China Kyoto University, Japan IIIT, Hyderabad, India Massey University, New Zealand University of Tsukuba, Japan University of Tokyo, Japan Keio University, Japan Arizona State University, USA National Cheng Kung University, Taiwan Hong Kong University of Science and Technology, China University of Pennsylvania, USA University of California (Irvine), USA Harbin University of Technology, China City University of Hong Kong, China University of Queensland, Australia Australian National University, Australia
X
Organization
Bing Liu Chengfei Liu Huan Liu Jiming Liu Ee Peng Lim Qiong Luo Wei-Ying Ma Sanjay Madria Weiyi Meng Xiaofeng Meng Wee Keong Ng Beng Chin Ooi Maria Orlowska Jian Pei Pearl Pu Depei Qian Gitesh Raikundalia Keun Ho Ryu Shazia Sadiq John Shepherd Timothy K. Shih Myra Spiliopoulou Jianwen Su Kian Lee Tan Changjie Tang Alexei Tretiakov Guoren Wang Ke Wang Min Wang Wei Wang X. Sean Wang Raymond Wong Wanlei Zhou Jian Yang Yun Yang Ge Yu Shuigeng Zhou Qiang Zhu
University of Illinois at Chicago, USA South Australia University, Australia Arizona State University, USA Baptist University of Hong Kong, China Nanyang Technological University, Singapore Hong Kong University of Science and Technology, China Microsoft Research Asia, China University of Missouri-Rolla, USA Binghamton University, USA Renmin University of China, China Nanyang Technological University, Singapore National University of Singapore, Singapore University of Queensland, Australia University at Buffalo, State University of New York, USA EPFL, Switzerland Xi’an Jiaotong University, China Victoria University, Australia Chungbuk National University, Korea University of Queensland, Australia University of New South Wales, Australia Tamkang University, Taiwan University of Magdeburg, Germany University of California, Santa Barbara, USA National University of Singapore, Singapore Sichuan University, China Massey University, New Zealand Northeastern University, China Simon Fraser University, Canada IBM T.J. Watson Research Center, USA University of North Carolina, USA University of Vermont, USA University of New South Wales, Australia Deakin University, Australia Tilburg University, Netherlands Swinburne University of Technology, Australia Northeastern University, China Fudan University, China University of Michigan, USA
Organization
XI
Additional Reviewers Yu Bei Cui Bin Christophe Brouard Humberto Cervantes Chee-Yong Chan Linus Chang Chao-chun Chen Chung-Shien Chen Ing-Hsien Chen Tson-Shien Chen Jiefeng Cheng Yu-Chi Chung Lin Dan Lin Deng Nathalie Denos Ming Dong Takeharu Eda Markus Eiglsperger Wei Fan Luodao Feng Daniel Fleischer Xiang Fu Like Gao Cagdas Evren Gerede Torsten Grust Lifang Gu Yuelong Gu Jinhua Guo Wook-Shin Han Carmem S. Hara Sven Hartmann Kenji Hatano Hai He Shen Hengtao John Horwood Dongdong Hu Haibo Hu Xiaodi Huang Joshua W. Hui Kuo-Tin Hwang Yuan-Ke Hwang Tok Wee Hyong Yoshiharu Ishikawa Ravindranath Jampani
Huidong Jin Liang Jin Roland Kaschek Dao Dinh Kha Won-Young Kim Kinshuk Huaizhong Kou Christian A. Lang Sang-Ho Lee Young-Koo Lee Erwin Leonardi J¨ urgen Lerner Hongchen Li Jia Li Lee Mong Li Ning Li Wenyuan Li Yingjiu Li Zhao Li Daniel Lieuwen Bin Lin Chen Ling Sebastian Link Huan Liu Pei-Lin Liu Gan Long Guojun Lu Hongen Lu Jianguo Lu Shiyong Lu Jianhua Lv Mehregan Mahdavi Brett Matson George Mihaila Jun Miyazaki Hoda Mokhtar Atsuyuki Morishima Akiyo Nadamoto Miyuki Nakano Jianpo Ou P. Sreelakshmi Chaoyi Pang Chong-Mok Park Young-Ho Park
Chang-shing Perng Willy Picard Sandeep Prakash Iko Pramudiono Supawan Prompramote Gang Qian Weining Qian Georges Quenot Faizal Riaz-ud-din Chun Ruan Zhang Rui Jarogniew Rykowski Atul Sajjanhar Thomas Schank Klaus-Dieter Schewe Nalin Sharda Jialie Shen Hao Shi Wei Shi Wei-Feng Shih Khalil Shihab Norihide Shinagawa Michal Shmueli-Scheuer Yanfeng Shu Kirack Sohn Gao Song Yuqing Song I-Fan Su Aixin Sun Bing Sun Yu Suzuki Roger Tagg Katsumi Takahashi Qingzhao Tan Xuan Tian Parag Tijare Trystan Upstill Jhansi Rani V Soujanya Vadapalli Satya Valluri Rishi Vardhan German Vegas Jorge Villalobos Millist Vincent
XII
Organization
Yingyu Wan Haixun Wang Wenqiang Wang Yitong Wang Yu Wang Daling Wang Chiemi Watanabe Khin Myo Win Wojciech Wiza Wing Sing (Dicky) Wong
Ming Xiong Linhao Xu Jiong Yang Weiwen Yang Xiaochun Yang Chih-Kang Yeh Kuo-Her Yen Lei Yi Shui Yu Xingbo Yu
Kun Yue Jing Zhao Keping Zhao Qiankun Zhao Qi Zhong Xiangmin Zhou Xuan Zhou Manli Zhu Xingquan Zhu
Monitoring Continual Range Queries Philip S. Yu, Kun-Lung Wu, and Shyh-Kwei Chen IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne, NY 10532 {psyu,klwu,skchen}@us.ibm.com
Abstract. This paper provides an overview of a query indexing method, called VCR indexing, for monitoring continual range queries. A VCRbased query index enables fast matching of events against a large number of range predicates. We first describe VCR indexing for general event matching against a set of 2D range predicates. We then show how VCR indexing can be used for efficient processing of continual range queries over moving objects. VCR stands for virtual construct rectangle. A set of VCRs are predefined, each with a unique ID. Each region defined by the range predicate is decomposed into or covered by one or more activated VCRs. The predicate ID is then stored in the ID lists associated with these activated VCRs. The use of VCRs provides an indirect and costeffective way of pre-computing the search result for any given event or object position. Event matching becomes very efficient.
1
Introduction
Without the constraints of geographic distance, the World Wide Web has become a natural platform for managing many human activities. The Web has been widely used to access and deliver information and services, conduct business transactions and form special user communities, such as message boards and chat rooms. With a universal Web browser, anyone from anywhere can access the Web independently at any time. While the flexibility and ease of information production and sharing is good, it is equally difficult or daunting to navigate, collect and track relevant data in this dynamic and open space. This problem is further aggravated as most of the data of interest change unpredictably. To keep abreast of what is happening, there is a strong need for a system to automatically monitor events on behalf of the users. For example, a stock investor might want to be alerted when a certain stock, such as IBM, reaches a price range. A national security analyst might want to be alerted when a set of keywords are frequently used on a message board. Another example is the news alert service. When a news article containing certain keywords is published, alerts can be sent to those who are interested in it. The Web is naturally suited to deploy such a monitoring system. Users from anywhere can simply specify their desires via the Web. Once specified, user interests become long-running queries, which must be monitored continually. We call them continual queries. These continual queries J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 1–12, 2004. c Springer-Verlag Berlin Heidelberg 2004
2
P.S. Yu, K.-L. Wu, and S.-K. Chen
are usually expressed as predicates on a set of attributes. A service provider monitors these predicates against a stream of events, such as the current stock prices, the newest sensor readings, the current interest rates, and the most recent positions of a set of moving objects. Once an event is matched with a subset of the predicates, proper actions can be taken. Alerts can be sent to the users via e-mails or cell-phones. Specific actions, such as buying or selling stocks, can also be triggered. One of the most critical components of monitoring continual queries is the fast matching of events against the predicates. Generally, an efficient predicate index is needed. Note that, in many places of this paper, we use “predicate” and “query” interchangeably because continual queries are defined and monitored based on their predicates. Prior work for fast event matching has mostly focused on building predicate indexes with equality-only clauses [1,2,3]. However, many predicates contain non-equality range clauses, such as intervals or regions. For example, continual queries are usually expressed as rectangles or circles in a 2D space for an application to monitor which moving objects are positioned within a region [5]. In general, continual queries concerning non-integer attributes, such as stock prices, interest rates and keyword frequencies, are likely to be expressed as range predicates. It is difficult to construct an effective index for multidimensional range predicates. It is even more challenging if these predicates are highly overlapping, as they usually are because people often share similar interests. For example, people tend to be interested in the current price ranges of individual stocks. As a result, the range predicates of their interests are likely to be highly overlapping. Although multidimensional range predicates can be treated as spatial objects, a typical spatial index, such as an R-tree or any of its variants, is generally not effective for fast event matching [6]. As has been pointed out in [6] and [7], this is because an R-tree quickly degenerates if spatial objects overlap [7,8]. In this paper, we provide an overview of an efficient query indexing method, called VCR indexing, for monitoring continual range queries. VCR stands for virtual construct rectangles. A VCR is a shingle-like or tile-like object that is conventionally laid to cover an area, such as the roof-top of a building or the floor of a room. A set of VCRs is predefined, each with a unique ID. Each region formed by a 2D range predicate is decomposed into one or more activated VCRs. In other words, one or more activated VCRs are used to completely cover the region. The predicate ID is then inserted into the ID lists associated with those covering VCRs. The use of VCRs provides an indirect and cost-effective way of pre-computing the search result for any given event. Because of pre-computation, the search operation becomes a simple and systematic enumeration of all the VCRs that contain an event point. As a result, event matching or searching is very efficient. We first provide an overview of VCR indexing for general event matching against 2D range predicates. Then we show how VCR indexing can be used to develop an incremental algorithm for processing a set of continual range queries over moving objects. The query results can be used to provide location-aware
Monitoring Continual Range Queries
3
services in many mobile applications. For example, a continual range query can be defined as follows: “Find all the taxi cabs currently positioned within 2-mile radius from the train station at White Plains, NY.” If this query is evaluated continually, then its result can be used by a taxi cab company to quickly dispatch a nearby taxi to a customer at the train station. The paper is organized as follows. Section 2 describes VCR indexing for 2D range predicates. Section 3 shows how to use VCR indexing for efficient processing of continual range queries over moving objects. Section 4 discusses related work. Section 5 summarizes the paper.
2
VCR Index for 2D Range Predicates
In this section, we provide an overview of a VCR indexing method for fast event matching against a set of 2D range predicates. We describe the definition of VCR, the predicate insertion and deletion algorithms and how to perform fast event matching using a VCR index. For further details, including the performance studies of VCR, readers are referred to [9]. 2.1
Preliminaries
We assume that predicates are conjunctions of two intervals involving attributes X and Y . For simplicity, assume that the attribute ranges are 0 ≤ X < Rx and 0 ≤ Y < Ry , respectively. We assume the XY attribute region is partitioned into Rx Ry virtual grids. These integer grid points are used to define VCRs and to specify the regions defined by the 2D range predicates. Event points can be anywhere in the region, including non-integers. For example, in a 2D space, the spatial region for specifying a user interest is defined with integer grids, but object positions can be anywhere. For applications where the regions defined by the predicate must be specified with non-integers, we can first expand the region to the nearest integers and then perform a check with the search result obtained from VCR indexing. 2.2
Virtual Construct Rectangles
For each integer grid point (a, b), where 0 ≤ a < Rx and 0 ≤ b < Ry , we define a set of B virtual construct rectangles, or VCRs. These B VCRs share the common bottom-left corner at (a, b) but have different shapes and sizes. We assume the shapes of VCRs are rectangular or square-only. We assume the maximum side lengths of a VCR are Lx = 2k and Ly = 2k , where k is an integer. If lx and ly are the side lengths of a VCR, then lx = 2i and ly = 2j , where 0 ≤ i ≤ k, 0 ≤ j ≤ k, and i and j are integers. These B VCRs are defined such that the region defined by a range query can be strictly covered by one or more VCRs. Fig. 1 shows examples of (a) rectangular VCRs and (b) square-only VCRs when Lx = Ly = 2. For rectangular VCRs, there are 4 different sizes (i.e., B = Br = 4): 1 × 1, 2 × 1,
4
P.S. Yu, K.-L. Wu, and S.-K. Chen rectangular VCRs : Lx = 2; Ly = 2 Y
Y
Y VCR: 1
VCR: 0 (0, 1) (0, 0) (1, 0)
(0, 0)
X
(0, 0) (1, 0)
(2, 0)
X
Y
(0, 0)
X
(0, 0) (1, 0)
X
Y VCR: 3
(0, 2)
VCR: 2
VCR: 0 (0, 1)
(0, 1)
Y
(0, 2)
square - only VCRs
(2, 0)
X
(0, 2)
(0, 0)
(a)
VCR: 1
(2, 0)
X
(b)
Fig. 1. Assigning IDs to VCRs sharing the same bottom-left corner for (a) rectangular VCRs and (b) square-only VCRs. 1×2 and 2×2. For square-only VCRs, there are only two sizes (i.e., B = Bs = 2): 1 × 1 and 2 × 2. Note that rectangular VCRs also contain squares. Each VCR has a unique ID. For rectangular VCRs, it can be computed as follows: (1) Vr (a, b, 2i , 2j ) = Br (a + bRx ) + j(k + 1) + i, where (a, b) is the bottom-left corner of the VCR whose side lengths are 2i and 2j . The first term is derived by horizontally scanning the integer grid points from (0, 0) to (Rx − 1, 0), then from (0, 1) to (Rx − 1, 1), · · ·, until (a − 1, b). There are (a + bRx ) such grid points. For each grid point, there are B = Br VCRs defined. The second term is derived by the ID assignment shown in Fig. 1(a). Similarly, for square-only VCRs, the ID of a VCR (a, b, 2i , 2i ) is calculated as follows: Vs (a, b, 2i , 2i ) = Bs (a + bRx ) + i.
(2)
Note that these VCRs are virtual. A virtual VCR becomes activated when it is used to cover the region of a continual range query. Even though there are BRx Ry virtual VCRs, there are far fewer activated ones. The total number of activated VCRs depends on the distribution of the query regions and the strategy used to cover these regions, which we will describe next. 2.3
Predicate Insertion and Deletion
Let (a, b, w, h) represent a rectangle whose bottom-left corner sits at (a, b), width is w and height is h. To insert a range query q, defined as (a, b, w, h), we first decompose it into one or more activated VCRs. In order words, one or more VCRs are activated to cover the region formed by (a, b, w, h). Then the query ID q is inserted into each of the ID lists associated with the covering VCRs. Hence, a VCR index maintains a direct mapping between a VCR and all the queries that contain the VCR. There are many decomposition methods. Here, we describe two of them: simple decomposition (SD) and overlapped decomposition (OD).
Fig. 2. Pseudo code for a simple decomposition of a predicate rectangle (a, b, w, h). Simple decomposition. In simple decomposition, we cover a 2D query region with VCRs following a strategy similar to one that we use to cover a floor with tiles. First, we use the largest VCR that can fit into the region. Then we use smaller VCRs to decompose the rest. Adjacent VCRs touch each other only on the edges. Fig. 2 shows the pseudo code for simple decomposition (SD). SD first creates a working rectangle, which is initialized to be (a, b, w, h). Then, it cuts a strip rectangle from the bottom of the working rectangle and moves upwards until the working rectangle is completely removed. The width of a strip rectangle is w, the same as that of the input predicate, and the height is maxVCRh(Hw ), the maximal VCR height that is less than or equal to Hw (the height of the working rectangle). As an example, if Hw = 5 and Ly = 8, then the height of a strip rectangle should be 4 because 4 is the maximal VCR height that is less than 5. On the other hand, if Hw = 10 and Ly = 8, then the height of a strip rectangle should be 8. For each strip rectangle, we find the largest VCR that has the same height as the strip rectangle and use that VCR to cut the strip rectangle. This process is repeated until the strip rectangle is completely removed. At the end, the set of decomposed VCRs is returned. As an example, Fig. 3(a) shows the result of using SD to decompose a predicate rectangle (3, 3, 11, 6). Assume that Lx = 4 and Ly = 4. First, it is partitioned into 2 strip rectangles: (3, 3, 11, 4) and (3, 7, 11, 2). Then each strip rectangle is decomposed into 4 VCRs each. Hence, (3, 3, 11, 6) is decomposed into a total of 8 VCRs. These 8 VCRs have different sizes. The overlapping among them is minimal. It occurs only on the boundary lines.
6
P.S. Yu, K.-L. Wu, and S.-K. Chen Y
Y
(a) SD(3, 3, 11, 6) Æ 8 VCRs having different sizes
8
8
4
4
0 4
8
12
16
X
(b) OD(3, 3, 11, 6) Æ 6 VCRs having the same size
0 4
8
12
16
X
Fig. 3. Examples of (a) simple decomposition and (b) overlapped decomposition.
Overlapped decomposition. In contrast to SD, the overlapped decomposition (OD) uses the same-sized VCR to decompose a given predicate region. OD is very similar to covering a rooftop with same-sized shingles, where overlapping shingles are allowed. OD is very similar to SD in creating strip rectangles and using the largest VCR to decompose each strip rectangle. The difference between OD and SD is in how they handle the remnants of a strip rectangle and a working rectangle. Overlapping VCRs is allowed in OD. To achieve this, the left boundary of the last VCR is allowed to shift backward so that the same sized VCR is used in the decomposition. Similarly, the bottom of the last strip rectangle is allowed to shift downward so that the last strip rectangle has the same height as those of the other strip rectangles. As an example, Figs. 3(b) shows the result of using OD to decompose the same predicate rectangle (3, 3, 11, 6). In contrast with Fig. 3(a), we only use a 4×4 VCR for the decomposition and there are only 6 decomposed VCRs, instead of 8 as in SD. Compared with SD, OD better facilitates the sharing and reuse of decomposed VCRs among predicates. It reduces the number of activated VCRs. Less activated VCRs make it more effective to reduce the storage requirement via VCR hashing, which will be described next. VCR hashing. For each activated VCR, we maintain a predicate ID list. This list keeps all the IDs of queries that use the VCR in their decompositions. In order to efficiently access this predicate ID list for a given VCR, a straightforward approach is to maintain an array of pointers, with the array index corresponding to the VCR ID. Because the total number of VCRs is BRx Ry , the storage requirement for the array of pointers can be large. Note that VCRs are virtual. We only need to maintain those activated ID lists. In practice, the total number of activated VCRs is likely to be moderate. This is particularly true if predicates are highly overlapping. In order to reduce storage requirement, hashing can be used to maintain the activated predicate ID lists. In order to ensure even distribution of hash values, the VCR ID is first randomized and then divided by the hash size, a prime number, to obtain the hash location. The randomization step in the hash
Monitoring Continual Range Queries
7
computation is important because activated VCRs tend to be clustered due to the fact that predicates are highly overlapping. Once VCR hashing is employed, the search time inevitably slows down a bit because of the hash computation. But, it is a trade-off in order to limit the storage requirement. Moreover, efficient search time can be maintained by limiting hash collision to a minimal. Predicate deletion. To delete a predicate from the system, it is decomposed into one or more VCRs, similar to predicate insertion. Then, the predicate ID is removed from each of the ID lists associated with the decomposed VCRs. 2.4
Matching an Event against Predicates
An event is equivalent to a data point. To find all the predicates matching an event is to find all the predicates whose IDs are stored in the ID lists associated with all the activated VCRs that contain that data point. Because of the way VCRs are defined, it is rather simple to enumerate such covering VCRs using the concept of covering VCR set. Covering VCR set. Let CV (x, y) denote the covering VCR set of a point (x, y), where a < x < a + 1 and b < y < b + 1, and a and b are integer grid points. CV (x, y) contains all the VCRs that can possibly cover point (x, y). Graphically, CV (x, y) contains the set of VCRs that have bottom-left corners in the shaded region southwest of (a, b), and upper-right corners in the shaded region northeast of (a + 1, b + 1) as shown in Fig. 4. The covering VCR sets share two important properties: constant size and identical gap pattern.1 First, the number of VCRs in a covering VCR set is the same for all event = x2 and y1 = y2 . Second, points. Namely, |CV (x1 , y1 )| = |CV (x2 , y2 )| even if x1 the ID gap between any two VCRs of matching positions is identical for any two covering VCR sets if the VCRs are sorted. Let Vi,(x,y) denote the ID of a VCR covering (x, y) and Vi+1,(x,y) > Vi,(x,y) for 1 ≤ i < |CV (x, y)|. Namely, the VCR IDs are in a sorted order. The identical gap pattern property can be described as follows: Vi+1,(x1 ,y1 ) − Vi,(x1 ,y1 ) = Vi+1,(x2 ,y2 ) − Vi,(x2 ,y2 ) , for 1 ≤ i < |CV (x1 , y1 )| and any two points (x1 , y1 ) and (x2 , y2 ). These two properties can be verified by first grouping all the drawings in Fig 4 together as a unit and then moving it around. When the center is moved from (x1 , y1 ) to another point (x2 , y2 ), the relative positions of all the covering VCRs stay the same. Note that the gaps between different pairs of adjacent VCRs within a covering VCR set may not be the same. However, the gap pattern is identical across all covering VCR sets. 1
For ease of discussion, we focus a search point (x, y) not under the boundary conditions, i.e., Lx ≤ x ≤ Rx − Lx − 1 and Ly ≤ y ≤ Ry − Ly − 1. The points under the boundary conditions, i.e., 0 ≤ x < Lx or Rx − Lx − 1 < x < Rx or 0 ≤ y < Ly or Ry − Ly − 1 < y < Ry , can be similarly handled.
8
P.S. Yu, K.-L. Wu, and S.-K. Chen Y
( x , y ) where a < x < a + 1and b < y < b + 1
(a+L+1, b+L+1)
(a, b)
pivot point (a+1-L, b+1-L) (a+1, b+1)
(a-L, b-L)
(0, 0)
X
Fig. 4. An example of the covering VCR set for data point (x, y). Search algorithm. With these two properties, we can design an efficient algorithm for enumerating the covering VCR set for event point (x, y). We first define a pivot point as P whose location is (x + 1 − L, y + 1 − L) and a pivot VCR as Vp which is defined as (x + 1 − L, y + 1 − L, 20 , 20 ). Namely, the bottom-left corner of Vp is at the pivot point P and Vp is a unit square. Then we pre-compute difference array D, which stores the differences of the ID’s between two neighboring VCRs in a sorted covering VCR set. With D and the pivot VCR Vp , we can enumerate all the covering VCRs by simple additions of the pivot VCR ID to each element stored in D.
3
Processing Continual Range Queries over Moving Objects
The VCR indexing method described in Section 2 can also be used for efficient processing of continual range queries over moving objects. These range queries basically track which moving objects are currently located inside their respective regions. Up-to-date results from these continual queries can be used to provide many location-aware services. For example, a retail store in a shopping mall can distribute timely promotional e-coupons to a customer’s cell-phone or PDA using the result from a continual range query like: “Find all the cell-phones or PDAs located within 1/2 mile radius from my store.” With a query index, such as the VCR index, processing of continual range queries proceeds as follows. Periodically, each object position is first used to search the query index to find all the range queries that contain the object. The object ID is then inserted into the results associated with the matched queries. After every object position is processed, we have the most up-to-date results for all the range queries. In order not to have stale results, it is important that the time it takes to perform the periodic query reevaluation be brief. This can be achieved as follows. First, each search on the query index must be efficient. Second, the query index must also allow the query reevaluation to take advantage of incremental changes
Monitoring Continual Range Queries
9
in object positions. Namely, certain object positions need not be processed. The VCR index has both properties. Object positions that have not moved outside a VCR do not need to be processed. In this section, we show how to use the VCR index to perform incremental query reevaluation. Further details, including performance studies, can be found in [5]. Note that besides the x- and y-coordinates, different attributes can be used to define the positions of moving objects. For example, the price difference and the size of a trade can be used to define the moving position of a stock. The frequencies of a set of selected keywords in the message contents published by a person can also be used to define his/her moving position. Various continual range queries can be defined to monitor and track these moving objects. 3.1
Incremental Query Reevaluation
Here, we assume a VCR index is maintained for the query regions using the algorithms described in Section 2. Query results are maintained in an array of object lists, one for each query. Assume that OL(q) denotes the object list for q. OL(q) contains the IDs of all objects that are inside the boundaries of q. Periodically, we recompute all OL(·)’s, considering the changes in object locations since the last reevaluation. If the period between two consecutive reevaluations is short, many objects may not have moved outside the VCR boundaries. As a result, many of the search computations can be saved. The use of covering VCRs provides a convenient way to capitalize on the incremental changes in object movements. Algorithm SQI IR described in Fig. 5 shows how to use a VCR index for periodic query reevaluation. IR stands for Incremental Reevaluation. We assume that the object locations used in the last reevaluation are available. These locations are referred to as the old locations in contrast to the new locations for the current reevaluation. For each oi ∈ O, where O is the set of moving objects, if the location of oi , denoted as L(oi ), has not been updated since the last reevaluation, nothing needs to be done for this object. For an object whose location has been updated, we compute two covering-VCR sets: CVnew (oi ) with the new location data and CVold (oi ) with the old location data. When an object has moved, we need to consider three cases: (1) It has moved into a new VCR; (2) It has moved out of an old VCR; (3) It has remained inside the same old VCR. With both CVnew (oi ) and CVold (oi ), we can easily identify the VCRs under each case. For any VCR vk that is in the new covering VCR set but not the old, i.e., vk ∈ CVnew (oi ) − CVold (oi ), we insert an instance of oi to the OL(q) list, ∀q ∈ QL(vk ). Here, QL(vk ) is the query list associated with VCR vk . This accounts for the case that oi has moved into these VCRs. On the other hand, for a VCR vj that is in the old covering VCR set but not the new, i.e., vj ∈ CVold (oi ) − CVnew (oi ), we delete an instance of oi from OL(q) list, ∀q ∈ QL(vj ). This accounts for the case that oi has moved out of these VCRs. For any VCR that is in both covering VCR sets, nothing needs to be done. It accounts for the case that oi has remained inside the boundaries of these VCRs.
10
P.S. Yu, K.-L. Wu, and S.-K. Chen
Algorithm SQI IR for (i = 0; oi ∈ O; i + +) { if (L(oi ) has not been updated) { continue; } compute CVnew (oi ); compute CVold (oi ); for (k = 0; vk ∈ CVnew (oi ) − CVold (oi ); k + +) { q = QL(vk ); while (q = NULL) { insert(oi , OL(q)); q = q → next; } } for (k = 0; vk ∈ CVold (oi ) − CVnew (oi ); k + +) { q = QL(vk ); while (q = NULL) { delete(oi , OL(q)); q = q → next; } } }
Fig. 5. Pseudo code for Algorithm SQI IR.
4
Related Work
There are strong interests in event matching schemes for content-based pub/sub systems [1,3] and triggers [2]. Most of these focused on equality-only predicates. In [10], an adaptive multiple key hashing approach was developed to consider range predicates. However, it considered mainly single dimensional range predicates and treated two range predicates, each from a different attribute, independently. In contrast, we consider 2D range predicates in this paper. Continual queries [11,12,2] have been developed to permit users to be notified about changes that occur in the database. They evaluate conditions that combine incoming event with predicates on a current database state. This makes it difficult for these systems to scale. There are various spatial data structures that can handle non-equality predicates, such as R-trees and its variants [7,8,13]. As pointed out in [6,7], R-trees are generally not suitable for event matching because R-trees quickly degenerate if the objects are highly overlapping. Many data structures have been designed to index a list of predicate intervals defined on the same attribute, such as segment trees [13], interval binary search trees [14], and interval skip lists [6]. But, these are mainly for the case of single dimension. The VCRs defined in this paper are different from the space-filling curves, such as the Hilbert curve and the Z-ordering [7], that are used to store multidimensional point data. The objective is to preserve spatial proximity in the original point data. In contrast, a set of VCRs is defined for a point. These VCRs are used to decompose predicates, which are spatial objects. Query indexing was not used in the moving object environment until recently [15,4]. In [4], an R-tree-based query indexing method was first proposed for continual range queries over moving objects. In [15], a cell-based query indexing scheme was proposed. It was shown to perform better than an R-tree-based
Monitoring Continual Range Queries
11
query index [15]. Basically, the monitoring area is partitioned into cells. Each cell maintains two query lists: full and partial. However, using partial lists has a drawback. The object locations must be compared with the range query boundaries in order to identify those queries that truly contain an object. Because of that, it cannot allow query reevaluation to take advantage of the incremental changes in object locations. There are research papers focusing on other issues of moving object databases. For example, various indexing techniques on moving objects have been proposed [16,17,18,19]. The trajectories, the past, current, and the anticipated future positions of the moving objects have all been explored for indexing. The data modeling issues of representing and querying moving objects were discussed in [20,21,22,23]. Uncertainty in the positions of the moving objects was dealt with by controlling the location update frequency [24,23], where objects report their positions when they have deviated from the last reported positions by a threshold. Partitioning the monitoring area into domains (cells) and making each moving object aware of the query boundaries inside its domain was proposed in [25] for adaptive query processing.
5
Summary
In this paper, we have presented an overview of a VCR indexing method for (a) general fast event matching against a set of 2D range predicates, and (b) efficient processing of continual range queries over moving objects. A VCR is a shinglelike or tile-like object that is conventionally laid to cover an area, such as the roof-top of a building or the floor of a room. A set of VCRs is predefined, each with a unique ID. Each region formed by a 2D range predicate is decomposed into or covered by one or more activated VCRs. The query ID is inserted into the ID lists associated with the activated VCRs. Hence, the use of VCRs provides an indirect and cost-effective way of pre-computing the search result for any given event. As a result, event matching or searching is very efficient.
References 1. Aguilera, M.K., Strom, R.E., Sturman, D.C., Astley, M., Chandra, T.D.: Matching events in a content-based subscription system. In: Proc. of Symp. on Principles of Distributed Computing. (1999) 2. Hanson, E., Carnes, C., Huang, L., Konyala, M., Noronha, L., Parthasarathy, S., Park, J.B., Vernon, A.: Scalable trigger processing. In: Proc. of ICDE. (1999) 266–275 3. Fabret, F., Jacobsen, H.A., Llirbat, F., Pereira, J., Ross, K.A., Shasha, D.: Filtering algorithms and implementation for very fast publish/subscribe systems. In: Proc. of the ACM SIGMOD. (2001) 4. Prabhakar, S., Xia, Y., Kalashnikov, D.V., Aref, W.G., Hambrusch, S.E.: Query indexing and velocity constrainted indexing: Scalable techniques for continuous queries on moving objects. IEEE Trans. on Computers 51 (Oct. 2002) 1124–1140
12
P.S. Yu, K.-L. Wu, and S.-K. Chen
5. Wu, K.L., Chen, S.K., Yu, P.S.: Indexing continual range queries with covering tiles for fast locating of moving objects. In: Proc. of 2004 ICDCS Workshop on Distributed Mobile Computing. (2004) 6. Hanson, E., Johnson, T.: Selection predicate indexing for active databases using interval skip lists. Information Systems 21 (1996) 269–298 7. Gaede, V., G¨ unther, O.: Multidimensional access methods. ACM Computing Surveys 30 (June 1998) 170–231 8. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: Proc. of ACM SIGMOD. (1984) 9. Wu, K.L., Chen, S.K., Yu, P.S.: VCR indexing for fast event matching for highlyoverlapping range predicates. In: Proc. of 2004 ACM Symp. on Applied Computing. (2004) 10. Wu, K.L., Yu, P.S.: Efficient query monitoring using adaptive multiple key hashing. In: Proc. of ACM CIKM. (2002) 477–484 11. Chen, J., DeWitt, D., Tian, F., Wang, Y.: NiagaraCQ: A scalable continuous query system for Internet databases. In: Proc. of ACM SIGMOD. (2000) 379–390 12. Liu, L., Pu, C., Tang, W.: Continual queries for Internet scale event-driven information delivery. IEEE TKDE 11 (July/Aug. 1999) 610–628 13. Samet, H.: Design and Analysis of Spatial Data Structures. Addison-Wesley (1990) 14. Hanson, E., Chaaboun, M., Kim, C.H., Wang, Y.W.: A predicate matching algorithm for database rule systems. In: Proc. of ACM SIGMOD. (1990) 271–280 15. Kalashnikov, D.V., Prabhakar, S., Aref, W.G., Hambrusch, S.E.: Efficient evaluation of continuous range queries on moving objects. In: Proc. of 13th Int. Conf. on Database and Expert Systems Applications. (2002) 16. Agarwal, P.K., Arge, L., Erickson, J.: Indexing moving objects. In: Proc. of ACM PODS. (2000) 17. Kollios, G., Gunopulos, D., Tsotras, V.J.: On indexing mobile objects. In: Proc. of ACM PODS. (1999) ˇ 18. Saltenis, S., Jensen, C.S., Leutenegger, S.T., Lopez, M.A.: Indexing the positions of continuously moving objects. In: Proc. of ACM SIGMOD. (2000) 19. Pfoser, D., Jensen, C.S., Theodoridis, Y.: Novel approaches to the indexing of moving object trajectories. In: Proc. of VLDB. (2000) 20. Forlizzi, L., Guting, R.H., Nardelli, E., Scheider, M.: A data model and data structures for moving objects. In: Proc. of ACM SIGMOD. (2000) 21. Guting, R.H., Bohlen, M.H., Erwig, M., Jensen, C.S., Lorentzos, N.A., Schneider, M., Vazirgiannis, M.: A foundation for representing and querying moving objects. ACM TODS 25 (Mar. 2000) 1–42 22. Sistla, A.P., Wolfson, O., Chamberlain, S., Dao, S.: Modeling and querying moving objects. In: Proc. of ICDE. (1997) 23. Wolfson, O., Sistla, A.P., Chamberlain, S., Yesha, Y.: Updating and querying databases that track mobile units. Distributed and Parallel Databases 7 (1999) 257–387 24. Wolfson, O., Chamberlain, S., Dao, S., Jiang, L., Mendez, G.: Cost and imprecision in modeling the position of moving objects. In: Proc. of ICDE. (1998) 25. Cai, Y., Hua, K.A.: An adaptive query management technique for real-time monitoring of spatial regions in mobile database systems. In: Proc. of Int. Performance, Computing, and Communication Conference. (2002)
The Next Generation Messaging Technology – Makes Web Services Effective Maria E. Orlowska The University of Queensland School of ITEE St Lucia, Australia [email protected]
Abstract. The vision presented in this paper and its technical content are a result of close collaboration between several researchers from the University of Queensland, Australia and the SAP Corporate Research Center, Brisbane, Australia. In particular; Dr Wasim Sadiq (SAP), Dr Shazia Sadiq (UQ), and Dr Karsten Schultz (SAP) are the prime contributors to the ideas presented. Also, PhD students Mr Dat Ma Cao and Ms Belinda Carter are involved in the research program. Additionally, the Australian Research Council Discovery Project Scheme and Australian Research Council Linkage Project Scheme support some aspects of research work towards the HMT solution.
1 Introduction Tremendous developments in data storing, processing and communication over the last two decades have made an unprecedented impact on how most companies operate, develop future business strategies and deal with day to day operations. The Data Management Systems for structured and unstructured data provide efficient and secure environments for persistent data archives and manipulations. Commonly available computer networking and expansion of access to the Internet have changed the way we reason about system architectures, with integration becoming an obvious and preferred option. The research efforts and development paths pursued by many academic groups and system vendors, targeting heterogenous system integration, have not been easy and have not always delivered effective and practical results which could make a real impact on how the future solutions are to be constructed. We have learnt many lessons from these research outcomes. They outline the clear boundaries of feasibility when dealing with building new applications out of existing and useful/deployable components [24]. These conclusions are not only related to the technological aspects of integrated structures, such as the middleware, but also to semantic issues of terms used across multiple systems. In particular, the need for a complete and extensible ontology that expresses the basic concepts that are common across a variety of
domains, became apparent, forming a new research direction over the last few years [25]. Workflows Management Systems delivered effectively in the area of process enforcement, offering a clear separation of business process logic from component applications involved in process execution, thereby responding to the well-established need for application integration. Workflows complementary role in supporting advanced database functions, offer for the first time a complete and powerful environment for information systems architects. It is an observed phenomenon that a new IT solution often triggers additional, and even more advanced user requirements, which probably would not be discovered if the current systems functionality was not so widely available. This pattern can be clearly observed in the context of workflows technology evolution and it is the main motivation for this research program. Primarily, workflow technology has delivered a great deal of productivity improvements, mainly for repetitive business processes with substantial human involvement, providing effective scheduling, monitoring and resource planning services. As in the past with database systems, extensively documented successes of workflows deployment in many application domains, call now for an even more advanced and rich level of integration facilities. To be useful and commonly accepted, such a technology must provide a great deal of flexibility and autonomy preservation for all local systems forming parts of the integrated infrastructure [21], [23]. It should be possible to form new business processes out of pre-existing fully operational workflows at different sites for various business partners with process owner controlled levels of data disclosure, and most importantly, practically no impact on the component applications. Only an integration technology that offers rapid and easy integration procedures, requiring only minimal IT expert intervention, can be successful at multiple and diverse, geographically spread e-business environments. The great challenge for IT specialists now is to find a functionally rich and technically feasible balanced solution for this overall complex problem of integration taking into account technological and ontological limitations. There is currently a great drive towards advancement of the technologies surrounding the e-business domain [22], [23]. Businesses are increasingly moving towards extensive automation of their private and public processes. This automation takes the form of complex interactions between heterogeneous and autonomous systems within the enterprise and often cross multiple organizations. Controlling these complex interactions in order to effectively manage collaborative business processes is known to be a critical yet difficult problem using current technology solutions. Consequently, the areas of consideration are multi-faceted ranging from security, reliability and transactionability, quality of service guarantees, process validation, and optimisation to the semantic integrity of terminology used. Integration technologies such as brokers, application adapters, portals and messaging are fundamental elements of a collaborative business process environment. For this widespread enterprise application integration and/or business to business (B2B)
The Next Generation Messaging Technology – Makes Web Services Effective
15
integration to become a reality, we need common architectures and open standards to support it. The industry is currently flooded with initiatives and proposals towards ebusiness standards. These standards encompass trading partner agreements, business process specification, application integration, and network protocols. Service Oriented Computing (SOC) recently became a new computing paradigm that may shape the utilization of the Web in the near future [26], [27], [28]. Services can be viewed as semi-autonomous applications executable at any time and be supported by platform-independent solutions. They can be published, discovered and composed to form larger unities that in return act as large distributed interoperable applications. The full potential of global Internet access has not yet been explored, especially to provide a core infrastructure for collaborative business processes. There are numerous reports indicating unprecedented growth in the number of business and scientific users cross the whole globe. Although most of the individual users are currently satisfied with availability of access to the wide range of information sources and email services, another class of users – business organizations, still have to go further. These should/will perceive the Web as a backbone supporting a variety of service offerings, either as consumers or service providers. Naturally, as with any other traditional (non web based) services, there is no justification in investing in such capacities as long as there is no clear vision as to their visibility/discovery mechanisms to the potential clients. There are already substantial developments in these directions [1], [2], [3]. Service Oriented Architectures are central to these trends in distributed computing [26]. Web services should be available in a wide range of devices from handheld computers equipped with GPS facilities, mobile phones and wireless PDAs to the traditional desk tops and stationary computers. It is an exciting new trend, providing a solid structural foundation for the advancement of flexible, dynamic distributed computing solutions. Existing distributed computing platforms that came before the service-oriented paradigm, tended to be too monolithic, inflexible, and complex. Developing applications on these systems required expertise in complex programming models, and these systems did not lend themselves to interoperability with other platforms. Often these platforms restricted the distribution paradigms that could be supported (e.g. lack of peer-to-peer support). If we think of Web Services as loosely coupled pieces of applications that have their own role within owner organization but can be used also outside of the organization boundaries serving a well defined purpose either in collaboration with other services or as stand alone useful service providers. For the cooperative mode of operation, they need to be published, identified, deployed, and integrated with other applications over a network. There is an obvious need for the next generation of messaging technology to provide new ways to connect discovered applications/services either operated automatically or with some user’s intervention. This is in contrast to existing, tightly coupled distributed computing platforms (such as CORBA or DCOM). They offer
16
M.E. Orlowska
dynamic discovery and binding of application functionality through service descriptions that provide the details on what a given application does, and how its services can be accessed. In practice, however, there is only limited evidence of success of such dynamic binding for a wider class of applications. One problem remains which is still far from an adequate solution – the common semantics for all parties involved in such integrated structures and generic mechanisms to compose several services into one meaningful structure/business process. The concept of web services can be found at the core of many of new approaches to distributed computing [6], [7], [8]. Web services are seen as a means of integrating applications, promoting interoperability and facilitating process management over decentralized environments such as the Internet. The loose coupling and dynamic binding characteristics of web services are the main justifications towards achieving the above. An essential component of the next generation of distributed architectures (mostly based on web services, and often referred to as the web services ‘stack’ [8]) is the message-oriented middleware (MOM). MOM provides the basic means for target applications to communicate in a distributed environment. Messaging middleware however is not a new technology. Historically, in the 1980s, the move from highly centralized applications on mainframes to decentralized client/server applications introduced the need for messaging middleware to exchange data between systems on dedicated network connections. Legacy LAN-based messaging products (such as IBM MQSeries and TIBCO Rendezvous) were designed to support the move from centralized mainframe apps to decentralized client/server apps. Working over an internal, often dedicated network, across internal applications and with internal users, these products were well suited to the task of exchanging data in real-time between multiple internal systems. In its new role, MOM has gained increasing deployment and has already delivered great benefits for communication between disparate systems, and as a grass roots component of the web services stack. In spite of the move from propriety networks to open standards, the fundamental functionality of MOM has not changed substantially. Looking at currently available solutions, we see that the focus of MOM has been primarily to deliver Security (authorization, digital signatures, non-repudiation); Reliability and Serializability (guaranteed delivery in the proper order); and Scalability (high volume and speed). The technology is driven by simple dispatch models, such as point to point (message exchange between a sender and one recipient, often based on queuing methods, and used mostly in secure financial transactions); and publish-subscribe (used for content dissemination to multiple recipients or subscribers). Some essential enhancements to basic messaging technology have also been proposed, for example content-based routing and event notification. We see a much greater potential in changing the overall messaging systems infrastructure rather than in the enhancement of current MOM functionality, especially in its new role in facilitating complex and long running interactions for collaborative business processes executing in a decentralized environment.
The Next Generation Messaging Technology – Makes Web Services Effective
17
2 Harmonised Message Technology The envisaged Harmonised Message Technology (HMT) environment [29] naturally subsumes all core functions of already established solutions but mostly provide new and advanced features. The design of message flow to different system communicating through HMT will be independent from the message content (unless used specifically for content related navigation) and from any other message attributes. The main driving idea of the HMT vision is a clear separation of message dispatch functions from objects such as users and messages itself. There are several aspects of messaging that impact on, and define the scope for message harmonization. We identify below several classes of harmonized messaging, as a list of our minimum requirements for functionality of HMT Message coordination - Messages often represent a step in a business transaction or process. Coordinating the flow of messages can take the form of most, if not all, activity coordination structures in workflow/process management. HMT can facilitate coordination through multi-step complex routing specifications. Time driven messaging – Temporal constraints represent a critical aspect of business events. Time driven messages may depend on absolute time e.g. 2.00 PM on Friday, as well as relative time e.g. every 4 hours. Correlation of messages - Messages from a single (or even multiple) senders may be linked in terms of the content they carry. Correlation can include associating or relating a new message with a previously received message, for example multiple items of a single purchase order; invalidating a previously received message, for example, increasing the items ordered, etc. Batch oriented messaging - The need for batching messages is clear from the above. Batching or grouping may be required due to message coordination, correlation or time dependencies. The definition of the batch may thus encompass many properties, for example deliver all messages on a given topic from a given sender at a given time in a given order, rather than one a time as they arrive to the message server. Message filtering - This is essentially sending messages to interested parties based on message contents (content based routing). However, advanced filtering may be required, which takes into consideration a combination of conditions such as content, time, sender attributes and others. We conclude with a list of the main research challenges that need to be addressed before the HMT becomes available for empirical tests;
18
M.E. Orlowska
• Functional specification of HMT needs to be developed. • Harmonisation specification language, preferably user friendly - masking formal specification of complex logical expressions must be formally established. • Message store – a core data component of the HMT architecture requires careful design to support all aspects of HMT operations. • Harmonization engine - a core driving force for the system supporting essential functionality features of the harmonization: • Equipped with a power to interpret and execute the conditions that govern the harmonisation requirements, • A persistent storage facility for monitoring and logging the operational data, • Management of concurrent users building messages streams, • Exception handler dealing with unexpected behaviours, • Transactionability in order to offer guarantee of completeness of execution,
References 1.
2.
3. 4. 5. 6. 7. 8. 9.
10.
11. 12. 13.
M. Orlowska, V. D’Andrea, M. Papazoglou, P. Traverso, C. Bussler, B. Pernici, J. Yang (Editors) Workshop on Service Oriented Computing : Models, Architectures and Applications. The 2003 International Symposium on Applications and the Internet, January 27 - 31, 2003. Orlando, Florida, USA M. Orlowska, C. Bussler, J. Yang, B. Pernici, S. McIlraith, R. Hull (Editors) Workshop on Web Services, E-business and Semantic Web. The Fourteenth International Conference on Advanced Information Systems Engineering (CaiSE 02) May 27 - 31, 2002, Toronto, Ontario, Canada. 3rd VLDB Workshop on Technologies for E-Services TES’02 http://gkpc14.rbg.informatik.tu-darmstadt.de/tes02/ R. M. Colomb & M. Orlowska. Interoperability in Information Systems. Information Systems, 5(1), pp.37-50, 1994. International Conference on Ontologies, Databases and Applications on Semantics 2002. www.cs.rmit.edu.au/fedconf/odbase/2002/ Web Services Description Language WSDL 1.1 May 2002. www.w3.org/TR/wsdl The Universal Description, Discovery and Integration UDDI. www.uddi.org Simple Object Access Protocol 1.1 May 2000. www.w3.org/TR/SOAP/ Heather Kreger (2001) Web Services Conceptual Architecture (WSCA) 1.0. IBM Software Group May 2001. www-3.ibm.com/software/solutions/webservices/pdf/WSCA.pdf K. Schulz, M. Orlowska. Architectural Issues for Cross-Organisational B2B Interactions, International Workshop on Distributed Dynamic Multiservice Architectures (DDMA) in conjunction with the 21st International Conference on Distributed Computing Systems (ICDCS-21), IEEE Computer Society Press, Phoenix, USA, April 2001 Christopher Bussler (1999) Enterprise-wide workflow management. IEEE Concurrency Journal, July 1999. Mike Papazoglou (2002) The world of e-business: web-services, workflows, and business transactions. Georgakopoulos et al. Modelling and composing service-based and reference processbased multi-enterprise processes. In proceedings of the 12th International Conference, CAiSE2000, Springer-Verlag, 2000.
The Next Generation Messaging Technology – Makes Web Services Effective 14. 15. 16. 17. 18. 19. 20.
21.
22.
23.
24. 25. 26. 27. 28. 29.
19
G.Riempp Wide Area Workflow Management. Springer-Verlag, London, 1998. SAP, Sap business maps. Technical report, SAP AG, 2001. ebXML Requirements specification version 1.0 Technical report, ebXML.org.2000. Web Service Choreography Interface (WSCI) 1.0 Specification wwws.sun.com/software/ xml/developers/wsci Business Process Execution Language for Web Services, Version 1.0 http://www-106.ibm.com/developerworks/webservices/library/ws-bpel/ FlowMake – A Process Modelling and Verification Tool www.dstc.edu.au/praxis/ flowmake/ Wasim Sadiq and Maria Orlowska. Analysing Process Models using Graph Reduction Techniques. Information Systems, Vol. 25, No. 2, pp. 117-134, 2000. Elsevier Science. June 2000. Shazia Sadiq, Olivera Marjanovic, Maria Orlowska. Managing Change and Time in Dynamic Workflow Processes. The International Journal of Cooperative Information Systems. Vol. 9, Nos. 1 & 2. March 2000. K. Schulz, M. Orlowska. Towards a Cross-Organisational Workflow Model, 3rd IFIP Working Conference on Infrastructures for Virtual Enterprises, Lisbon, Portugal, 1-3 May 2002. Edited by Luis M. Camarinha-Matos “Collaborative Business Ecosystems and Virtual Enterprises”. Kluwer Academic Publishers. Shazia Sadiq. Handling Dynamic Schema changes in Workflow Processes. Proceedings of the 11th Australian Database Conference ADC2000, Canberra, Australia. Jan 30 - Feb 3 2000. Colomb, R.M & Orlowska, M.E. Interoperability in Information Systems. Information Systems, 5(1), pp.37-50, 1994. International Conference on Ontologies, Databases and Applications on Semantics 2002. www.cs.rmit.edu.au/fedconf/odbase/2002/ M. Orlowska, S Weerawarana, M Papazoglou, Service-Oriented Computing – 2003, LNCS 2910, Springer, 2003. T Catarci, J Mylopoulos, M Orlowska, Wen Information Systems Engineering, IEEE, 2003. X Zhou, Y Zhang, M Orlowska, Web Technologies and Applications, LNCS2642, Springer, 2003. S Sadiq, M Orlowska, W Sadiq, K Schultz, Facilitating Business Process Management with Harmonized Messaging, Accepted for ICEIS’04, April 2004, Porto/Portugal.
Web Searching and Σ2 Queries (Extended Abstract) Arun Sharma National ICT Australia , Sydney Research Laboratory at Kensington, UNSW Sydney NSW 2052, Australia [email protected]
In the terminology of Logic programming, current search engines answer Σ1 queries (formulas of the form ∃¯ xϕ(¯ x) where ϕ(¯ x) is a boolean combination of attributes). Such a query is determined by a particular sequence of keywords input by a user. In order to give more control to users, search engines will have to tackle more expressive queries, namely, Σ2 queries (formulas of the form ∃¯ x∀¯ y ϕ(¯ x, y¯)). The purpose of the talk is to examine which directions could be explored in order to move towards more expressive languages, more powerful search engines, and the benefits that users should expect. Search engines allow users to input a boolean combination of keywords, like: C = keyword 1 ∧ (¬keyword 2 ∨ keyword 3 ). In the terminology of Logic programming, a user who inputs C implicitly asks the following Σ1 query Q: Does there exist a web page X such that X contains occurrences of keyword 1 , and either no occurrence of keyword 2 or some occurrences of keyword 3 . A solution to query Q is a (link to) a web page that indeed contains occurrences of keyword 1 , and either no occurrence of keyword 2 or some occurrences of keyword 3 . Usually, the number of distinct solutions to a query like Q is huge. Users actually have in mind much more restrictive queries. Rather than any solution to Q, they expect a solution to the following Σ2 query Q: Does there exist a web page X such that: – X contains occurrences of keyword 1 , and either no occurrence of keyword 2 or some occurrences of keyword 3 , and – for all web pages Y distinct from X, if Y contains occurrences of keyword 1 , and either no occurrence of keyword 2 or some occurrences of keyword 3 , then X is ‘better’ than Y .
National ICT Australia is funded by the Australian Government’s Department of Communications, Information Technology and the Arts and the Australian Research Council through Backing Australia’s Ability and the ICT Centre of Excellence Program. This talk is based on joint work with Eric Martin of the School of Computer Science and Engineering at the University of New South Wales.
J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 20–23, 2004. c Springer-Verlag Berlin Heidelberg 2004
Web Searching and Σ2 Queries
21
A search engine follows a medium path: – It has a definition of ‘better,’ based on numbers of hints, numbers of links, number of references, etc. – It computes all solutions to Q but ranks them according to the ordering underlying the definition of ‘better,’ and the first solution output (the first result of the search) can be seen as the solution to Q. Search engines like Google do a pretty good job, since even when thousands or tens of thousands of links to web pages are output as solutions to queries like Q, users often find the information they seek in one of the web pages listed first. This means that Google’s definition of ‘better’ is remarkably good—better than most of its competitor’s definitions. . . —, and lots of ingenuity and sophisticated algorithms indeed underly this definition. Still even the best search engines sometimes generate frustration, and a mechanism that would be both more flexible and more powerful is desirable. – The definition of ‘better’ should not be independent of the particular keywords input by a particular user. Even if users do not want to or cannot formalize ‘better,’ they could or might want to guide its formalization and constraint its definition. For instance, a ‘better’ web page might be a more recent one, or a shorter one, or one that contains less jargon, or one that is well illustrated, etc. – A total ordering on the set of solutions to queries like Q is not always appropriate. The relationship between solutions to Q could be a partial ordering would be all solutions to Q that are maximal in R. R, and the solution to Q – Universal quantifiers could be used not only to retain just a few of the solutions to Q, but also to increase the expressive power available to users to retrieve some specific piece of information. For instance, a user might look for a book on a particular topic such that all reviews on this book are positive. This means that the relevant query is Σ2 , but variables range over books and reviews rather than over web pages: does there exists a book X such that X is a book on spiders and for all reviews Y , if Y is a review on X then Y is positive. A major challenge that the areas of information retrieval and information synthesis will have to face is to unharness the power of Σ2 queries. Whereas Σ1 queries are natural formalizations of problems of a deductive nature, Σ2 queries are natural formalization of problems having an inductive component, involving learning and nonmonotonic reasoning. Web searching is such a problem. We have observed that users implicitly ask Σ2 queries of a particular kind above), though they just input a boolean combination of keywords (like (like Q C above)—this combination being most of the time just a conjunction of atoms. Search engine designers rightly do not expect users to be experts in logic and other technical fields, and let the system write Q from C. Though Σ2 queries will become more flexible and users will play a role in their definition, it is still essential to find mechanisms that ensure that search engines are immediately usable.
22
A. Sharma
The way users interact with them should remain natural, and not require any technical skill. To achieve this, an interface could contain two windows, one window W1 for the purely existential part of the query, and another window W2 for the part that involves universal quantifiers. Consider the simpler case of queries where all variables range over web pages. Then the possible inputs to W2 could be a boolean combination of some members of a menu of predefined keywords. For instance, the menu could contain keywords like recent, nontechnical, short, etc. If a user inputs in W1 a boolean combination of atoms like C above, and in W2 just the keyword recent, then the system would construct the query: Does there exist a web page X such that: – X contains occurrences of keyword 1 , and either no occurrence of keyword 2 or some occurrences of keyword 3 , and – for all web pages Y distinct from X, if Y contains occurrences of keyword 1 , and either no occurrence of keyword 2 or some occurrences of keyword 3 , then Y is not more recent than X and X is ‘better’ than Y . We expect that in most cases, the information input by the user in W2 would not be restrictive enough, so that is why the previous query contains ‘and X is ‘better’ than Y ’ where ‘better is defined by the system. Clearly, ‘more recent’ would be implemented in a reasonable way and not interpreted in the stricter sense. A well selected collection of keywords for W2 would result in a set of options to build the universal part of the query, and have many practical uses. Of course, what we have in mind is a much more expressive class of Σ2 queries, where the universal part is not built from predefined keywords: users should be allowed to input their own keywords in both W1 and W2 . The main difficulty here is that keywords for W2 should be relations, i.e., predicates of arity 2 or more, whereas for W1 , properties, i.e., unary predicates, are still good enough. If a user wants to find a book on spiders that has only positive reviews, then he needs to ask the query: Does there exist a web page X and a book on spiders B such that: – B has occurrences in X, and – for all reviews Y , if Y is a review on X then Y is positive. In order to ask this query, the user could enter book and spiders in W1 , and he would have to express in window 2 that the book he is looking for has nothing but positive reviews. To express that the review is positive, a menu of predefined relations, like k needs to be retrieved. In such a way, the efficiency problem caused by using the substring and aggregate functions (which cannot make use of database indexes efficiently) and possible errors caused by the same polygon enters the same data pixel multiple times are solved. In other words, the perfect generalization of polygonal data (and the Li-Openshaw algorithm) can be implemented using a single SQL query with only one extra condition added which is δ > k. Map generalization is one of the most important functions to support Web-based spatial applications. The performance of using the revised scaleless data structure and a single SQL query for both data retrieval and simplification is shown to be significantly better than the traditional retrieval-then-simplify approach [11].
Multiresolution Spatial Databases
4.2
45
Progressive Vector Data Transfer
Several most popular rater image formats, such as GIF and JPEG, have a property called progressive encoding, that allows creating a rough image preview while the image is downloaded, and the clarity improves gradually when more data is downloaded. This property is very important for Web applications, as it holds the users’ attention while a large image is being downloaded, and allows the user to click to other pages if they have obtained sufficient information before the download is complete. This behavior is, however, very difficult to support for vector spatial data, as adding more points can change geometric and topological relationships. A rough image rendered using incomplete vector data therefore can be misleading. Progressive vector data transfer is a highly desirable property for spatial data, but there is no satisfactory solution to this problem. A basic common assumption for progressive vector data transfer is that all the points received earlier must be a subset of the points of the final object, so the newly added points can be used together with, not overwrite, those already received. It is easy to see that all points must carry information about their index in the final object [1]. Now we give a simple and efficient solution to this problem using the idea of the scaleless data structure, same to what used in map generalization. Points of a polygon are sent to the client according to their δ in descending order, and the image on the client side is rendered in a ‘blocky’ way according to the current δ being transferred, and the granularity of ‘blocks’ are reduced to improve image quality on the client side when more data is received. This approach ensures that the rough image do not give the user misleading information and at the same time, allows transferring of vector data in a progressive way. As images of very large ‘blocks’ are not useful and can be annoying to the user, the initial δ should not start from 1, but from a more reasonable number that considers both the client-side block size and the amount of data initially transferred. We are currently evaluating different vector data compression methods such that the data with a large δ value can be sent with low precision too. Other important issues that need to be considered to make progressive vector data transfer really useful include client and proxy side vector data caching and integration with W3C’s standard for vector data exchange (such as SVG). 4.3
Approximate Spatial Query Processing
Spatial join operations combine two or several spatial datasets using spatial relationships such as overlap, adjacent, encloses or within distance [4]. They are among the most time-consuming yet most important operations in a spatial DBMS. Traditionally the filter-and-refine approach is used to minimize spatial join processing cost. It has two steps: a filter step which applies some simple operations on approximations of spatial objects, followed by a refinement step where a full test using complete geometry is applied to those objects survived the filter condition. For example, the minimum bounding box intersection is often used
46
X. Zhou et al.
as the filter condition for polygon intersection. Typically, object approximations are used together with a spatial index. Multi-resolution database makes it possible to extend the principle of the filter-and-refine approach further. Firstly, the full geometry used in the refinement stage needs not always to be the highest resolution data in the database; a suitable level of resolution, depending on the accuracy models used (either the aggregation model or the probability model), will be determined for a given application and the data at that level can be used for refinement. This will reduce the cost of refinement for the applications that do not require the highest level of resolution. Secondly, the data at a lower resolution level can be used as approximations for its counter parts at a higher resolution level. Thus, multiple filters are available in a multi-resolution database, and can be used to minimize the need to use more complex data. Using the scaleless data structure introduced before, a spatial object can be approximated differently when δ changes. The traditional filter-and-refine strategy becomes a simple-box-or-full-geometry special case. Scaleless data structures can produce similar but more flexible approximations, with virtually no extra storage or maintenance overhead. It should be pointed out that scaleless data based approximation is neither fully conservative nor fully progressive, thus may not be suitable for applications that require precise results based on the highest resolution data. We are working on defining relevant metadata to be collected to find methods to determine the lowest acceptable level of resolution for a query based on the aggregation model and the probability model. Using the aggregation model (which is simpler but less useful than the probability model), we have observed up to 75% reduction of disk access costs while the accuracy is still within 80%. A more comprehensive empirical evaluation is underway now, and research to develop a probability model has just been started. 4.4
Other Applications
The scaleless data structure has also been applied to many other applications in our lab, including view-dependant multiresolution terrain visualization using the visual quality model [14], 3D spatial analysis (in particular, surface k nearest neighbor queries) using the aggregation model, spatial data amalgamation for spatial data mining and warehousing applications [10], and applications for environmental analysis and location-based services (mainly spatial generalization and tokenization). The underlying data structures for these projects, while all similar and all based on the scaleless data structure we introduced in this paper, are customized to specific application problems. It is necessary to consider in the near future how to unify these optimized structures for different applications in a general-purpose spatial DBMS.
5
Conclusions
Spatial data has the multiresolution nature. This important property has not been adequately considered in the past to improve the performance of spatial
Multiresolution Spatial Databases
47
data management and query processing, which now becomes a primary barrier to more efficient and sophisticated spatial application in the Web environment where bandwidth can be limited and applications are much more diversified. In this paper, we address this key technical issue, proposing a novel solution to improve the performance of spatial applications by exploring the multiresolution nature of spatial data. Spatial objects are stored using scaleless data structures. We have shown that the overhead of assembling spatial objects can be compensated by a reduced amount of data to be retrieved from the database. The benefits of in-database simplification of spatial objects flow on to database retrievals, spatial operation processing, post-query processing and server-to-client data transfer. It also makes it possible, for the first time, to support real progressive spatial data transfer and semantic spatial caching. We have discussed a number of important spatial applications to demonstrate that multiresolution spatial databases are ideal for Web-based spatial applications. Acknowledgment. The work reported in this paper has been partially supported by grant DP0345710 from the Australian Research Council.
References 1. M. Bertolotto and M. Egenhofer. Progressive vector transmission. In ACM GIS, pages 152–157, 1999. 2. G Dutton. Digital map generalization using a hierarchical coordinate system. In Auto Carto, 1997. 3. A. U. Frank and S. Timpf. Multiple representations for cartographical objects in a multi-scale tree - an intelligent graphical zoom. Computers and Graphics, 18(6):823–829, 1994. 4. R. H. G¨ uting. An introduction to spatial database systems. VLDB Journal, 3(4):357–399, 1994. 5. C. B. Jones and D. B. Kinder. Database design for a multi-scale spatial information system. J. GIS, 10(8):901–920, 1996. 6. Z. Li and S. Openshaw. Algorithms for automated line generalization based on a natural principle of objective generalization. J. GIS, 6(5):373–389, 1992. 7. X. Lin, Q. Liu, Y. Yuan, and X. Zhou. Multiscale histograms: Summarizing topological relations in large spatial datasets. In VLDB, pages 814–825, 2003. 8. R. B. McMaster and K. S. Shea. Generalization in Cartography. Association of American Geographers, Washington, D.C., 1992. 9. P Prasher. Perfect cartographic generalisation and visualisation. In VDB, 2002. 10. S. Prasher and X. Zhou. Multiresolution amalgamation: Dynamic spatial data cube generation. In ADC, pages 103–111, 2004. 11. S. Prasher, X. Zhou, and M. Kitsuregawa. Dynamic multi-resolution spatial object derivation for mobile and WWW applications. J. WWW, 6(3):305–325, 2003. 12. H. Samet. Applications of Spatial Data Structures. Addison-Wesley, 1990. 13. J. Sharma. Oracle Spatial: an Oracle technical white paper. Oracle Technical Network, 2002. 14. K. Xu, X. Zhou, and X. Lin. Direct mesh: an multiresolution approach to terrian visualization. In ICDE, page to appear, 2004. 15. J. Zhou, X., Yates, and G. Chen. Using visual spatial search interface for WWW applications. Info. Sys., 6(2):61–74, 2001.
A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines Shaozhi Ye∗, Ruihua Song, Ji-Rong Wen, and Wei-Ying Ma Microsoft Research Asia 5F, Sigma Center, No 49 Zhichun Rd Beijing, China 100080
Abstract. Duplication of Web pages greatly hurts the perceived relevance of a search engine. Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. The offline methods target to detect all duplicates in a large set of Web pages, but none of the reported methods is capable of processing more than 30 million Web pages, which is about 1% of the pages indexed by today’s commercial search engines. On the contrary, the online methods focus on removing duplicated pages in the search results at run time. Although the number of pages to be processed is smaller, these methods could heavily increase the response time of search engines. Our experiments on real query logs show that there is a significant difference between popular and unpopular queries in terms of query number and duplicate distributions. Then, we propose a hybrid query-dependent duplicate detection method which combines both advantage of offline and online methods. This hybrid method provides not only an effective but also scalable solution for duplicate detection.
1 Introduction The World Wide Web (WWW) has been growing rapidly in the past decades. More and more information is becoming available electronically on the Web. The tremendous volume of web documents poses challenges to the performance and scalability of web search engines. Duplicate is an inherent problem that search engines have to deal with. It has been reported that about 10% hosts are mirrored to various extents in a study including 238,000 hosts [8]. Consequently, many identical or near-identical results would appear in the search results if search engines do not solve this problem effectively. Such duplicates will significantly decrease the perceived relevance of search engines. Therefore, automatic duplicate detection is a crucial technique for search engines.
∗
The author is also with the Department of Electronic Engineering, Tsinghua University. This work was conducted and completed when he was a visiting student at Microsoft Research Asia.
A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines
49
“Duplicate documents” refer to not only completely identical documents but also those nearly identical documents. The typical method of duplicate detection uses certain similarity measures, such as syntactic similarity [3, 4, 5] or semantic similarity [11], to calculate the duplicate degree of two documents. Documents with duplicate degree higher than a predefined threshold are considered duplicate documents. In [4], the concept of resemblance is defined to capture the informal notion of “roughly the same”. The resemblance r(A,B) of two documents A and B is defined as follows. First each document is transformed into a set of k-grams (or shingles) denoted as S(.). Then resemblance is computed by: r ( A, B ) =
| S ( A) ∩ S ( B ) | | S ( A) ∪ S ( B ) |
(1)
where |S| is the size of the set S. In [11], documents are presented by term vector and cosine measure is used to calculate the semantic similarity between two documents. In this paper, we use syntactic similarity to detect duplicate documents. The existing duplicate detection methods can be classified into two categories, namely the offline method and the online method. The offline method calculates document similarities among a large of Web pages and detects all duplicates at the pre-processing stage. On the contrary, the online method detects duplicates in the search result at run time. The offline method seems to be more appealing since duplicate detection is done at the data preparation phase and the response time and throughput of search engines will not be affected. However, the huge scale of the Web page collection makes it nearly infeasible to detect all duplicates in practice. As of today, the offline method has reported to be capable of dealing with 30 million web pages in 10 days [4]. Considering 3 billion web pages that are currently searchable via commercial search engines, the offline methods cannot meet the performance and scalability requirements in such scenarios. The online methods can be viewed as local methods since they detect duplicate documents in the scope of the search result of each query, while the offline methods are taken as global methods since they detect duplicates in the whole collection. For the online methods, since the number of documents is small, the duplicate detection process could be made fast enough to add only a relatively small overhead to the response time. In addition, since few users check more than the first 3 result pages (about 30 web pages) returned by search engines [10], it is usually unnecessary to detect duplicates that are out of the top n documents in the result list and the duplicate detection process could be further speeded up. However, as duplicate detection needs to be performed for each query, the accumulated overheads may become a significant factor to slow down the response time and decrease the throughput of a search engine. In this paper, we propose a hybrid method for duplicate detection which takes advantages of both offline and online methods while avoiding their shortcomings. The basic idea is to divide user queries into popular and unpopular queries by mining query logs. For a popular query, we detect duplicates in its corresponding inverted list offline. For a unpopular query, duplication detection is conducted at run time. Our experiments on real query logs show that there is a significant difference between popular and unpopular queries in terms of query number and duplication distribution.
50
S. Ye et al.
And our hybrid method could achieve good performance and scalability on duplicate detection in large scale search engines. The rest of the paper is organized as follows. In Section 2 we review the previous work on duplicate detection. In Section 3 we report several important observations through mining query logs, such as the frequency distribution of queries, the difference of duplicate degree between popular and unpopular queries, etc. Based on these observations, a query-dependent duplicate detection approach is proposed in Section 4. Finally we conclude the paper and discuss future works in Section 5.
2 Prior Work The prior work of duplicate detection can be partitioned into three categories based on the ways to calculate document similarity – shingle based, term based, and image based algorithms. We review these algorithms respectively in this section. 2.1 Shingle Based Algorithms The algorithms, such as [1] [3] [4] [5], are based on the concept of shingle. A shingle is a set of contiguous terms in a document. Each document is divided into multiple shingles and one hash value is assigned to each shingle. By sorting these hash values, shingles with same hash values are grouped together. Then the resemblance of two documents can be calculated based on the number of matching shingles. Several optimization techniques have been proposed to reduce the number of comparisons made. [3] selects shingles with the lowest N hash values and removes shingles with high frequencies. In this way, [4] processes 30M web pages in 10 days. Another more efficient alternative is also discussed in [4], which combines several shingles into one super shingle and computes hash values of the super shingles. The super shingle algorithm does not count all overlaps and thus is much faster. However, the author noted that it does not work well for short documents and no detailed results are reported. In [5], exact copies are removed in advance and then made each line a shingle. With the help of the hash strategy, the lower bound of computation complexity of these shingle based algorithms is O(N*logN). However, when N is very large and the Web page collection can not be processed by a single computer, a distribution algorithm is needed and thus the computation complexity will be close to O(N2). As the size of document set increases, more computation time and storage space will be needed, making these algorithms only feasible for a relatively small number of the Web pages. 2.2 Term Based Algorithms Term based algorithms [11] [12] use individual terms as the basic unit, instead of using continuous k-gram shingles. They focus on semantic similarity rather than syn-
A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines
51
tactic similarity by discarding the structure information of documents, such as the edit distance of terms, paragraph and sentence structures. Cosine similarity between document vectors is usually used to calculate similarity between documents. Different from the shingle based algorithms, each document in the set has to be compared with all the others, so its computation complexity is O (N2). The largest set processed by term based algorithms contains only about 500K web pages [11]. [12] describes an online algorithm for rapid determining similarity among the document set returned by an information retrieval system. It uses a phrase recognizer to obtain the most important terms in a document and computes the similarity between documents based on these terms. It works for a small IR system. But for popular search engines which need to answer over 100M queries everyday, this method is not suitable because of it is expensive to compute. 2.3 Image Based Algorithms Image based algorithms [7, 9] target to deal with documents stored as images and their main issues are those of image processing, rather than plain text document processing. These algorithms deal with scenarios that are less relevant to our problem here, so we refer readers to [7, 9] for detail.
3 Observations of Queries and Duplicates We investigate a log file provided by MSN1, which contains 32,183,256 queries submitted to MSN in one day. Totally 11,609,842 unique queries are extracted from the log. Statistical analysis is conducted to get insights of these queries and duplicates in their corresponding search results. Below we report three important observations from our analysis that lead to the design of our duplicate detection algorithm. 3.1 Distribution of Query Frequencies It is well known that the occurrence number of Web queries follows an 80-20 rule, which means that the 20% most frequent query terms occupy 80% of the number of total query occurrences [10]. Some works have shown that the double log plot of rankfrequency distribution of queries approximately follows a Zipf distribution [10, 13], which means the occurrences of the popular queries take up a major part in the whole query set. For example, on analysis of AltaVista2's query log, [6] reports only 13.6% queries occur more than 3 times and 25 most common queries form 1.5% of the total number of queries, despite being only 0.00000016% of the unique 154 million queries. In [10], it was found that the top 75 terms in frequency represent only 0.05% of all unique terms, yet they account for 9% of all 1,277,763 search terms in all unique 1 2
http://search.msn.com http://www.altavista.com
52
S. Ye et al.
queries. In [13], 2.56% and 5.40% queries in the two log data sets occur more than 10 times.
Fig. 1. Distribution of query frequency
Fig. 2. Proportion of queries, ranked by query frequency
Here we revisit this phenomenon by analyzing the MSN log. Figure 1 is the distribution of query frequency in the log. X axis is the proportion of queries, which are ranked by their frequencies. Y axis is the number of occurrences of the queries (number is in log scale). It is shown that a small portion of queries is searched many times and its frequency decreases very quickly. Figure 2 is a clearer illustration of the proportions of frequent queries. The Y axis is the accumulated proportion of the top X most frequent queries. It shows that, in the MSN log, nearly 60% query occurrences are made up of 10% most frequent queries, and 70% query occurrences are made up of 20% most frequent queries.
A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines
53
The significance of the skewed query frequency distribution is that we can provide duplicate detection capability to most queries even if only a small portion of frequent queries are processed offline. For example, if the search results of the 10% most frequent queries are preprocessed to remove duplicates, we could directly return duplicate-free results to 60% queries submitted. 3.2 Duplicate Degrees for Popular and Unpopular Queries The second problem we explored is to analyze if there is any difference of duplication degrees in search results between popular and unpopular queries. From the log, we randomly select 50 queries which are submitted more than 2,000 times as popular queries and 50 queries which are submitted exactly 10 times as unpopular queries. Google3 supports the function of disabling the duplicate filter. If the option "filter=0" is appended to the search request URL, duplicate pages in the search result will not be filtered. Thus we use Google as our test bed by submitting these 100 queries to it with the duplicate filter option disabled. There are 10 web pages in every search result page returned by Google. We fetch the cached results in the first 10 result pages and get 100 results for each query. Then we use shingle based algorithm in [4] to detect the duplicate documents. For each pair of detected duplicate documents, the one with lower rank is taken as duplicate and the other with higher rank as the source of duplicate (here we mean that rank 1 is higher than rank 2, rank 2 is higher than rank 3, and so on). We use a high threshold for similarity measure, that is, unless the resemblance is higher than 0.95, the two documents will not be judged as duplicates. Since 0.95 is a rather high (1.0 stands for exact match), resemblance here is considered transitive. So in the duplicate detection operation, we merge the duplicate list using the following rule: if document A is duplicate of document B and document C is duplicate of document A, then we treat document C as duplicate of A too. We leave out the one with highest rank in a duplicate set and treat others as duplicate documents. The results of analysis on duplicate degrees of popular and unpopular queries are shown in Figure 3. The average duplicate degree in the search results of popular queries is about 5.5%, while that of unpopular ones is about 2.6%. It means that there are more duplicate documents in the search results of popular queries. This observation coincides with our intuition because popular queries usually are related to popular web pages and popular web pages tend to have more duplicates on the Web. This observation indicates that users can benefit more from duplicate removal for popular queries since there are more duplicates in their search results. 3.3 Duplicate Distribution in Search Results The third analysis we conducted is to investigate the duplicate distributions in the search results of popular and unpopular queries. If most of the duplicates have low ranks, they would not appear in the first several result pages. Thus users may not care 3
http://www.google.com
54
S. Ye et al.
too much about them and detecting duplicates in search results may be less needed since most users check no more than 3 search result pages [10]. As shown in Figure 4, the duplicate distribution of either popular queries or unpopular queries is nearly random. In other word, duplicates could appear in anywhere of search results. This observation confirms the need and importance of detecting and removing duplicates in search results.
Fig. 3. Duplicate degree in retrieval result
Fig. 4. Duplicate distributions in search results of popular and unpopular queries
4 Query-Dependent Duplicate Detection Algorithm Most of the prior works use a query-independent strategy to detect duplicates in a collection of web pages. In this paper, we propose a query-dependent method for duplicate detection. Based on the three important observations in Section 3, we conclude that popular queries that occupy a major portion of the whole search requests have more duplicates in search results than unpopular queries. Also, duplicates could appear anywhere in search results. Therefore, we propose a hybrid method that intelli-
A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines
55
gently takes advantage of query properties. For popular queries, duplicates are detected and removed using an offline method in the preprocess phase; for unpopular queries, we execute an online method to detect and remove duplicates at the run time. 4.1 Duplicate Detection for Popular Queries Popular queries can be obtained from query logs through statistical analysis, as shown in Section 3.1. Most search engines use inverted file to index Web pages. An inverted index is made up of multiple inverted lists. An inverted list contains a term and the document IDs in which the term appears. For efficiency and easy implementation, we take advantage of the inverted index to conduct duplicate detection. However, standard inverted index only index separate terms and a query usually contains multiple terms. So we extend the inverted index by treating popular queries as an index unit (like “phrase”) and build inverted lists for these queries. Duplicate detection is executed for each inverted list of popular queries. For each Web page, we only compare the shingles containing the queries to reduce the number of comparisons. We argue that this method has little impact on accuracy as in this case the goal is to detect duplicate “fragments” correlated to the query. 4.2 Duplicate Detection for Unpopular Queries According to the analysis in Section 3.1, unpopular queries occur much less frequently than popular ones and the number of distinct unpopular queries is large. So, we could only deal with them at the run time. Otherwise, we will suffer the same scalability problem in traditional methods. Since the total occurrence number of unpopular queries is small, the impact of such an online method on the search performance can be managed. In our implementation, only a few top-ranked result pages (e.g. including 1000 web pages) need to be processed because most users check no more than the first 3 search result pages. Also, only shingles containing the query are used for comparison. With these strategies, the online processing overhead is greatly reduced. 4.3 Performance Improvement To verify the feasibility and performance of our algorithm, we design the following simulation experiment to show the performance improvement. The data we used in the experiment is the query log we described in Section 3. We suppose that when duplicate detection is done online, the cost for each query is 1. If the search results of a query have been processed offline, there is no online computation cost (or very little in comparison with online processing cost). Then we increase the proportion of offline processing queries, and calculate the total online processing time. Figure 5 shows the decrease of processing time (Y-axis) for online duplicate detection in proportion to the increase of amount of offline work (X-axis). The online proc-
56
S. Ye et al.
essing time decreases quickly when X is small. On the other hand, more online processing time is needed when the number of offline processed queries increases. Obviously we have to find a best trade-off between offline and online processes for a better performance. This could be decided by the distribution of queries and other operational conditions such as intervals of index updating and the amount of user requests.
Fig. 5. Simulation experiments of performance improvement
Here we provide another analysis. The computation complexity of our proposed method is O(N*M*logM), where N stands for the number of queries and M is the number of returned documents relevant to a query. According to Search Engine Watch4, the busiest search engine serves 250M queries per day in Feb 2003. Based on the results in [6] and [10], we estimate that there are about 25% unique queries, which is 62.5M, and less than 1% queries occurring more than 100 times5. Assuming we process the top 10% queries and use the first 1,000 web pages returned for every query, the computation complexity of our proposed method will be 6.25*1010. Considering 3 billion web pages that are currently searchable on the Web, the computation complexity of traditional shingle based algorithms will be close to 9*1018! As can be seen, our proposed query-dependent algorithm is linear to the number of queries, and thus it is much more scalable than shingle based approaches.
4 5
http://www.searchenginewatch.com Actually, according to our statistics, there are much less than 1% queries which occurs more than 100 times
A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines
57
5 Conclusion and Future Work Three important observations on the properties of queries and duplicates were reported in this paper. First, based on MSN query logs, we found that popular queries consist of a major portion of the whole search requests. Thus duplicate detection can be omitted if a small portion of frequent queries are processed offline. Second, we found that popular queries often lead to more duplicates in the search results, so the benefit of duplicate removal for popular queries is more significant. Third, duplicates are found to distribute randomly in search results. Based on these observations, we proposed a query-dependent duplicate detection scheme that combines the advantages of both online and offline methods. That is, it first conducts offline processing for popular queries and then does additional work at run time to further improve the performance for unpopular queries. Such a strategy could effectively deal with the scalability problem of traditional offline methods while avoiding the performance problem of traditional online methods. Although syntactic duplicates could be detected in our methods, in our experimental results there are still many pages having almost identical contents but different formats, e.g., two same pages with different site templates. For these pages, we can not simply use a fixed threshold to determine if they are duplicates. We have to compare both content and template. To deal with this kind of duplicates, one possible solution is to detect the website’s template [14], partition pages into blocks [15][16], discard the template blocks, and then compute the similarity of two pages based on their content blocks. We plan to explore this direction in our future work. We also started to explore duplicate detection in newsgroup and news search on the Web. We found that there are much more duplicates in these data than general Web pages. We think that duplicate detection will also greatly improve the performance of retrieval results in these two types of web search.
References 1.
2. 3. 4.
5. 6.
Sergey Brin, James Davis, and Hector Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In Proceeding of the Special Interest Group on Management of Data (SIGMOD'95), pp.298-409, 1995 Peter J. Denning, Plagiarism in the Web, In Communications of the ACM, Vol.38, December 1995 Nevin Heintze, Scalable Document Fingerprinting, In Proceedings of the Second USENIX Electronic Commerce Worksop, pp.191-200, November 1996 Andrei Z. Broder, Steven C. Glassman and Mark S. Manasse, Syntactic Clastering of the Web. In Proceedings of the Sixth International World Wide Web Conference(WWW6), 1997 Narayanan Shivakumar and Hector Garica-Molina, Finding Near-Replicas of Documents on the Web, In International Workshop on the Web and Databases (WebDB98), 1998 Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moricz, Analysis of a Very Large AltaVista Query Log. Technical Report 1998-014, Digital System Research Center, October 1998
58 7.
8.
9.
10.
11.
12.
13. 14.
15.
16.
S. Ye et al. Daniel P. Lopresti, Models and Algorithms for Duplicate Document Detection, In Proceedings of the 5th International Conference on Document Analysis and Recognition, September 1999 Krishna Bharat and Andrei Broder, Mirror on the Web: A Study of HostPairs with Replicated Content, In Proceedings of 8th International World Wide Web Conference (WWW8'99), pp.501-512, 1999 Mark Turner, Yuliya Katsnelson and Jim Smith, Large-Scale Duplicate Document Detection in Operation, In Proceedings of the 2001 Symposium on Document Image Understanding Technology, 2001 Amanda Spink, DeitmarWolfram, Bernard Jansen and Tefko Saracevic, Searching The Web: The Public and Their Queries, In Journal of the American Society for Information Science, Vol.53, No.2, pp.226-234, 2001 Abdur Chowdhury, Ophir Frieder, David Grossman and Mary Catherine McCabe, Collection Statistics for Fast Duplicate Document Detection, In ACM Transactions on Information Systems, Vol.20, No.2, pp.171-191, April 2002 James W. Cooper, Anni R. Coden and Eric W. Brown, Detecting Similar Documents using Salient Terms, In the 11th International Conference on Information and Knowledge Management (CIKM'02), November 2002 Yinglian Xie and David O'Hallaron, Locality in Search Engine Queries and its Implications for Caching, In Proceedings of IEEE Infocom'2002, June 2002 Ziv Bar-Yossef and Sridhar Rajagopalan, Temlate Detection via Data Mining and its Applications. In Proceedings of the 11th International World Wide Web Conference (WWW'2002), 2002 Shipeng Yu, Deng Cai, Ji-Rong Wen and Wei-Ying Ma, Improving Pseudo-Relevance Feedback in Web Infromation Retrieval Using Web Page Segmentation, In Proceedings of the 12th International World Wide Web Conference(WWW2003), pp.11-18, May 2003 Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, Extracting Content Structure for Web Pages Based on Visual Representation, In Proceedings of the 5th Asia Pacific Web Conference(APWeb'03), pp.406-417, 2003
A Semantic Approach for Web Indexing B. Wang and G.R. Brookes Computer Science Department, University of Hull, Hull, HU6 7RX, United Kingdom
Abstract. Classical term weighting approaches normally use only term frequency to determine the importance of documents. In this paper, a new semantic approach is introduced to weight index terms of Web documents. The approach measures the importance of index terms on the basis of their semantic role in the document. The model and their semantic measurement of how to index terms are explained and some experimental results will be used to explain the significance of the new approach.
1
Introduction
In a traditional Web search approach, a search engine normally displays the search results in the order of higher term frequency. In fact, most relevant results are not those containing high terms frequency. Thus, by using only term frequency information is insufficient to rank the relevancy results. Most search engines combine classical term weighting schemes with some off-content factors as the supplement. These factors are user popularity[9], PageRank[14] and authority analysis[6]. User popularity methods collect the visiting information in background when users get results from queries. It uses the count of clicks in ranking pages. PageRank method uses the link structure of the Web to find a global importance ranking of every page. Therefore, pages are ranked by counting citations. Authority analysis methods use the relationship between hub pages and authoritative page to return the best authorities. By using the sub-graph of link structure about a topic, hubs and authorities can be found. An iterative algorithm is used to reinforce the relationship between them. The pages get more authority weight if they are pointed from pages with high hub weight. At the same time, the pages pointing to high authority weight pages get more hub weight as well. The classical term weighting schemes are still the dominant, if not the only, tools to represent the content of web pages. In this paper, we are discussing an approach to improve the search results by measuring semantic significance of index terms. We firstly review related researches which is described in section 2. Section 3 presents the proposed approach. The experiments and testing are described in section 4. Finally, section 5 concludes the research and outlines the major issues in future development.
2
Background Review
There are two major research trends: statistical and intelligent indexing. Statistical indexing uses statistical information such as words’ frequencies in document J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 59–68, 2004. c Springer-Verlag Berlin Heidelberg 2004
60
B. Wang and G.R. Brookes
and their distributions in the collection to calculate the weight of index terms. Luhn[11] first proposed the idea that frequency data can be used to measure the word significance in representing an information item. Hence, term frequency, inverse document frequency, signal ratio, and discrimination value[10] methods are used in the traditional indexing approach. Term frequency weighting assumes that all words in the collection have the equal quality. However, in reality, some terms provide less information for distinguishing information items. In order to improve the indexing quality, inverse document frequency, signal ratio and discrimination value are introduced. Other classifications for term weighting such as term and collection frequency[15] and the Local and Global weight[5] use the same traditional indexing mechanisms. Statistical index weighting schemes have such an assumption that index terms are independent. In most situations, index terms are generally neither independent, nor monolithic. A concept can often be represented by multiple terms individually (synonyms). A term can have multiple meanings (senses). For this reason, intelligent indexing methods are introduced. There are two types of intelligent indexing. They are conceptual and semantic indexing. Conceptual indexing[1] is a methodology that keeps conceptual structure of phrases in index. With the taxonomy knowledge base from Sun Microsystems Laboratories and the popular lexical database WordNet[17], the structure of conceptual descriptions was created. For instance, a car is an automobile, and washing is a type of cleaning. It is further inferred that car washing is a type of automobile cleaning. This structural relationship is used to organise a collection of phrases into a conceptual taxonomy. When retrieving a document, all the items indexed by sub-ordinate concepts should be be return. as well as the query one. Semantic indexing prevents mismatch by using the word senses or synonym sets in indexing instead of the word forms. When a term is indexed, it will be labelled with certain meaning of that term. Thus, if the computer device mouse is required, Mickey Mouse will not be returned. In Gonzalo’s research[8], it shows that the retrieval result was improved about 29% by using semantic indexing techniques. The conceptual and semantic indexing models concern about the accuracy of index terms at the semantic level. They do not provide the measurement of the importance of indexing terms. The new approach described in this paper proposes a new semantic weighting method. This method uses semantic relations of index terms with the term co-occurrence data when weighting indexing terms. Thus, the proposed approach can reveal the potential importance of terms when they are used to represent documents.
3 3.1
A Semantic Approach Introduction
The model is developed on such assumption that a given term is semantically related to some terms in the same information field. This term carries some
A Semantic Approach for Web Indexing
61
potential semantic information about its content. Thus, it is will be more important when indexing. The proposed approach is to put an extra weight to terms that have more semantically associated terms in the same information field. For example, a PLANT is known to have body parts like ROOT, STEM, LEAF, FRUIT, and so forth. In a document, the appearance of word PLANT together with ROOT, LEAF and FRUIT may confirm that PLANT is part of content of the document. Although the term frequency of plant in this document may be lower. It may gain more weight because of the presence of term ROOT, LEAF and FRUIT. On the basis of this assumption, we have the following procedures to decide a weight of a term.
3.2
Assigning Frequency Weight
In order to get the initial weight, we use traditional approach to assign weight for each term. Firstly, a stop-list is used to remove all stop words such as the, a, and stemming. Then, the document collection is transformed to a term-document matrix. In this matrix, each row represents if a given term used as an index term in a document. Each column indicates if a document has a particular index term. When all the documents are indexed, the tf-idf weight is assigned to each index term for every document. In the traditional tf-idf weight method, two concepts are used to represent the importance of indexing terms. The term frequency (tf) in the given document and the inverse document frequency (idf) of the term in the whole database. The term frequency in the given document indicates how important the term is in this document. Whereas, the document frequency of the term, i.e. the percentage of the documents which contain this term, shows how generally important the term is. Thus, a high weight in a tf-idf ranking scheme is therefore reached by a high term frequency in the given document and a low document frequency of the term in the whole database. We use Wij to represent a tf-idf weight of ith term in jth document. Consequently, the first matrix frequency weight matrix W is generated.
3.3
Assigning Term Relationship Values
In order to find out the relationships among indexing terms, the second matrix is generated. The matrix decides the association relationship between terms. The principle is very simple. That is, if a term has a relationship with other terms the value of their corresponding positions in the term association matrix will be assigned 1. Thus, the second term association relationship matrix (T) is generated. Tij represents that the association relationship value between the ith term and the jth term. To further simplify the situation, we use binary values 0 and 1 to indicate such relationships between terms. If the Tij value is 1, it means the ith term has a association relationship with the jth term. 0 represents no such relationship between terms are existed.
62
3.4
B. Wang and G.R. Brookes
Combining W and T Matrices
The above two matrices individually measure the importance of indexing terms and their association relationships. In order to further reveal the relationship between terms and documents. We generated the third matrix which is a product Z of T and W (Z = T × W). Consequently, such product represents the importance of a term which has a association with other terms in a given document. The value Zij indicates the importance of a term in a document. In order to select those most importance terms, a threshold λ is used. This is an artificial value. This value is decided on the basis of the above product. In general, we take the mean of total term weight values. Thus, using this threshold, we can further decide the final term-document relationship matrix. That is, Af inal matrix = λ×Z. Aij = 1 iff Aij > λ. 3.5
Parameters in Semantics Discovery
There are various relations among words, but not all of them will benefit the indexing procedure. One of the major tasks in semantics weighting is to find the useful semantic relations to build up the term-term association matrix for indexing procedure. Thesauri contain rich information about semantic relations among words. They are helpful tools for mining term relationships for information retrieval. At the present time, WordNet lexical database[17] is the reference source to discover the term semantic relations for indexing in the proposed model. In WordNet, terms are grouped as Synset, which means synonym sets. Different relations are assigned among Synsets, for instance, Hypernym (is-a), Hyponym (instance-of), Meronym (has-part/has-member), Holonym (member-of/part-of relation) relations, and so on. WordNet provides about 19 direct relations, and more indirect relations can be deduced from these direct relations. One of the problems is to find the appropriate relations for indexing. The second problem is to decide semantic spreading depth for given semantic relations. Spreading activation[2] originally come from the cognitive science to describe the process of activating related concept for retrieval information in human’s memory. Semantic spreading makes further related words with a given semantic relation involve in the term-term association matrix. An unconstrained semantic spreading may activate terms in an over-wide range, what make no sense to help indexing. The spreading depth controls in what degree two words are considered as related or not. The third problem is that once the relations and spreading depth are decided, spreading path should follow all the selected relations or limited in previous relation type. Spreading following all the possible relations is called blind spreading, whereas, only following one kind of relation is called non-blind spreading. The final problem is that there are four part of speech categories in WordNet (nouns, verbs, adjectives, and adverbs), whether all the categories should be kept in semantics weighting, or only the nouns. As pointed out by some researchers[3], usually, index terms are nouns or noun groups. For the weighting procedure, fewer categories mean the system will work more efficiently, but the question is if it may damage the effectiveness of retrieval at the same time.
A Semantic Approach for Web Indexing
63
To find out the appropriate relations, spreading depth, path and range for indexing, a set of tests were carried out. In these tests, documents in test collection were retrieved according to the semantics weight only, so that the impact of semantics weighting can be observed. Retrieval effectiveness measurements, Recall and Precision[16], were used to evaluate the parameters. Recall is the ratio of the number of document retrieved and relevant to the total number of relevant documents in the collection. Precision is the ratio of the number of document retrieved and relevant to the total number of documents retrieved from the collection. An interpolated Precision at 10 fix Recall intervals (0.1, 0.2, ..., 1) was computed. The interpolated Precision is the highest Precision at a Recall scale. The detailed illustration of the interpolation can be found in Salton’s book[7]. The interpolation process makes it possible to estimate the average Precision of all the queries in a collection. On the basis of the 10-interval-Precision, the average Precision was computer as the pointer of effectiveness.
4
Testing and Evalution
Three documents sets are used to test the model. They are ADI, MEDLINE and TIME. The results from different collections are shown in the following tables. The results are inconsistent among these collections. The best result comes from collection ADI. The Semantics Weighting model improves the Precision at every Recall interval. In MEDLINE, Semantics Weighting worsens the Precision almost at almost all the Recall intervals. In TIME, Semantics weighting model slightly improves the Precision at all the Recall intervals. The following table shows the results of 35 queries in collection ADI. Recall tf-idf weight SW model Improvement 0.1 0.6537 0.6872 5.12% 0.2 0.6199 0.6565 5.90% 0.3 0.5228 0.5553 6.22% 0.4 0.4706 0.5074 7.82% 0.5 0.4567 0.4964 8.69% 0.6 0.3476 0.3842 10.53% 0.7 0.2429 0.2733 12.52% 0.8 0.2124 0.2459 15.77% 0.9 0.1697 0.2021 19.09% 1.0 0.1697 0.2021 19.09% In ADI, Semantics Weighting model increases about 0.03 on the Precision at every Recall interval. The proportion becomes bigger at higher Recall intervals due to the lower baselines. The average improvement of Semantics Weighting model at all the Recall intervals is 11.1%. This result shows that semantics weighting model can bring the relevant documents to better rankings. The following results represents the interpolated precision for 30 queries in collection MEDLINE.
64
B. Wang and G.R. Brookes
Recall tf-idf weight SW model Improvement 0.1 0.8338 0.8308 -0.36% 0.2 0.7382 0.7370 -0.16% 0.3 0.6983 0.6961 -0.32% 0.4 0.6242 0.6298 0.90% 0.5 0.5220 0.5175 -0.86% 0.6 0.4488 0.4419 -1.54% 0.7 0.3898 0.3802 -2.46% 0.8 0.3202 0.3058 -4.50% 0.9 0.1992 0.1932 -3.01% 1.0 0.0632 0.0608 -3.80% In MEDLINE, tf-idf weight works well. The Precision at lowest Recall level is higher than 0.8. At 9 of 10 Recall intervals, semantics weight decreases the Precision. MEDLINE is collection about medicine. There are full of proper terms in the documents and queries. The semantics analysis by using a general-purpose lexical database trend to emphasise the relations among non-proper terms. For example, the 20th query in the collection is ”somatotropin as it effects bone, bone development, regeneration, resorption, bone cells, osteogenesis, physiologic calcification or ossification, cartilage and bone diseases in general. somatotropin as it relates to hypophysectomy, pituitary function, diseases, dwarfism, neoplasms, hypopituitarism and hyperpituitarism, and growth in general.” Removed the stop-words, the rest of terms are listed in the following table. The numbers in the table are the times that the terms get semantics weight in the collection. Terms Times Terms Times Terms Times somatotropin 2 physiologic 1 pituitary 13 effect 203 calcification 2 function 56 bone 53 ossification ** dwarfism 1 development 86 cartilage 7 neoplasm 8 regeneration 6 disease 94 hypopituitarism * resorption * general 26 hyperpituitarism * cell 199 relate 63 growth 98 osteogenesis * hypophysectomy * * means the term did not be found in the lexical database. ** means the term did not appear in the documents of this collection. As shown in the above table, there are several proper terms that can not be found in the lexical database. Other proper terms only get semantic weight for a few times in the whole collection that has 1033 documents. Meanwhile, the general terms get heavier weighting. The proper terms have relatively smaller weight in similarity calculation. Meanwhile, in this collection, the proper terms are more important to represent the content. In this case, a medical dictionary may help more to discover the semantics. The final test is on the TIME document set. The result of interpolated precision for 83 queries is shown in the follwing table.
In TIME, the Precision in tf-idf weight is very good. Even at the highest Recall, Precision is higher than 0.4. This indicates that most relevant documents get good ranking according to frequency weight. Semantics Weighting model only can offer slightly improvement. TIME collection is world news articles from Time magazine. It was expected to get better results because there were not so many terms in specific area as MEDLINE. However, when looking into this collection, it was found that there were full of proper names, place and person’s name. WordNet covers some of them, but there are still quite a few which have been left out. Thus, just like the situation in MEDLINE, tf-idf weight works well, but semantics weight can not help much. As shown above, although WordNet is a popular lexical database in information retrieval researches, it has some limitations, such as lack of proper name terms and domain-specific relationships among words. These limitations are also found in other research tasks, for instance, query expansion[12], and the researches suggested that complementing WordNet with other thesauri. The experimental comparison shows that Semantics Weighting model can improve the retrieval effectiveness, but its performance may be affected by the fitness of thesauri. The advantage of the Semantics Weighting model can be illustrated by further analyses of a Web search using a leading Web search engine Google. Information about tree was wanted, then a query tree was sent to Google. 20,800,000 WebPages were returned. All these pages contained the word or backlink[13] with tree. In the top 20 returned pages from Google, there are four of them related to tree which are ranked at 4, 5 6 and 11. According to tf-idf weight, relevant pages are ranked at 3,4,13,and 18. According to semantics weighting model, a page will have better rank if it has not only high term frequency of tree, but also terms semantically related to tree. According to this priciple, Semantics Weighting model brings most of the relevant pages to tops, and the top 3 of the result are all the relevant ones. A comparison of the ranking in Google, tf-idf weight and Semantics Weighting model is shown in the following table.
66
B. Wang and G.R. Brookes
Page ID Google Tf-idf SW model Relevance 1 1 1 4 2 2 12 6 3 3 2 5 4 4 4 2 + 5 5 18 18 + 6 6 3 3 + 7 7 20 20 8 8 17 17 9 9 16 16 10 10 14 14 11 11 13 1 + 12 12 9 11 13 13 15 15 14 14 7 9 15 15 19 19 16 16 8 10 17 17 5 7 18 18 10 12 19 19 11 13 20 20 6 8 In the above table, + shows that the page is relevant and - indicates that the page is irrelevant. The position of page 11 is changed dramatically. Page 11 is a relevant page. However, according to Google and tf-idf rule, it is only ranked at number 11 and 13, respectively. In fact, there are term forest and palm in the page. Forest is the Holonyms word of tree, which means tree is a member of forest. Palm is the Hyponyms word of tree, which means palm tree is a kind of tree. All these semantically related terms provide the clues that tree is an important word in this page, so that it is ranked at number 1 in Semantics Weighting model. Although, the result is improved by Semantics Weighting model, there is still an obvious shortage. Page 5, who contains term tree, leaf, and fruit, is clear to a reader that the page is related to the query. Judged by WordNet, there is no direct relation between fruit or leaf and tree. WordNet is a general-purpose lexical database, and it keeps complex terms structure. Sometimes the semantic distance showed by the spreading depth in WordNet may not match the semantic distance in common sense. Other methods were tested by some researchers to measure the semantic distance in WordNet[4]. However, there is no final conclusion on the best way to show the semantic distance. The consistency between semantic distance measurement in WordNet and human knowledge is one of the important aspects to improve the performance of Semantics Weighting Model.
A Semantic Approach for Web Indexing
5
67
Conclusion
In this paper, a new way of indexing terms used in web servers is proposed. The approach exploits the semantics underlying the co-occurrence terms for term weighting scheme. It can be used as an efficient method to index massive web documents. This approach offers an improvement over the classical tf-idf weighting in general topic collections. From the indexing point of view, hierarchical relations among terms are more useful for finding the important index terms. Besides, the two-level blind spreading in all term categories is a better option for building up term-term associations from WordNet. However, there are still unsolved problems in this approach. For instance, the rank merging. In other pilot tests for this study, it was found that the improvement of semantics similarity did not always bring improvement in the final ranking result. The rank merging algorithm is where the complementarity of semantics weight and frequency weight emerges. On the other hand, the limitation of WordNet restrains the performance of semantics weighting. WordNet is a general-purpose lexical database. It lacks proper terms and relations in specific domains. The complementing of other source is quite necessary. Finally, the semantic distance among words is measured by the spreading depth in WordNet. This is not sufficient. The result may not match the semantic distance in common sense. All these issues will be addressed in the future research and a new version of the model will be published shortly. Special thanks to my PhD students, especiallly Yao for her constructive ideas and all the tests and exercises she have done for testing the model.
References 1. Woods. W. A. Conceptual indexing: a better way to organize knowledge. Technical report tr-97-61, Sun Microsystems Laboratories, 1997. 2. A. D Baddeley. Memory: theory and practice, chapter Knowledge. East Sussex: Psychology Press, 1997. 3. R. Baeza-Yates and B Ribeiro-Neto. Modern information retrieval. ACM Press, 1999. 4. A. Budanitsky and G. Hirst. Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In Workshop on WordNet and other lexical resources, in the North American Chapter of the Association for Computational Linguistics (NAACL-2000), Pittsburgh, PA, USA, 2001. 5. E. Chisholm and T G Kolda. New term weighting formulas for the vector space method in information retrieval. Technical memorandum ornl-13756, Oak Ridge National Laboratory, 1999. 6. Clever. Clever searching. http://www.almaden.ibm.com/cs/k53/clever.html, 2002. 7. Salton G. and M.J McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. 8. J. et al Gonzalo. Indexing with wordnet synsets can improve text retrieval. In Proceedings of the COLING/ACL ’98 Workshop on Usage of WordNet for NLP, Montreal, 1998. 9. Direct Hit. Direct hit system. http://www.directhit.com/help/score.html, 2002.
68
B. Wang and G.R. Brookes
10. G Kowalski. Information retrieval systems theory and implementation., chapter Automatic indexing. 1997. 11. H.P Luhn. A statistical approach to mechanised encoding and searching of library information. IBM Journal of Research and Development, pages 309–317, 1957. 12. R. Mandala et al. Complementing wordnet with roget’s and corpus-based thesauri for information retrieval. In Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, pages 94–101, 1999. 13. L Page et al. The pagerank citation ranking: bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998. 14. PageRank. Our search. http://www.google.com/technology/, 2002. 15. S. E. Robertson and K Sparck-Jones. Simple proven approaches to text retrieval. Technical report tr356, Cambridge University Computer Laboratory, 1997. 16. C. J Van Rijsbergen. Information Retrieval. Butterworths, 1979. 17. WordNet. http://www.almaden.ibm.com/cs/k53/clever.html, 2002.
Semantic, Hierarchical, Online Clustering of Web Search Results Dell Zhang1,2 and Yisheng Dong3 1
Department of Computer Science, School of Computing, National University of Singapore S15-05-24, 3 Science Drive 2, Singapore 117543 2 Computer Science Programme, Singapore-MIT Alliance E4-04-10, 4 Engineering Drive 3, Singapore 117576 [email protected] 3 Department of Computer Science & Engineering, Southeast University Nanjing, 210096, P. R. China [email protected]
Abstract. We propose a Semantic, Hierarchical, Online Clustering (SHOC) approach to automatically organizing Web search results into groups. SHOC combines the power of two novel techniques, key phrase discovery and orthogonal clustering, to generate clusters which are both reasonable and readable. Moreover, SHOC can work for multiple languages: not only English but also oriental languages like Chinese. The main contribution of this paper includes the following. (1) The benefits of using key phrases as Web document features are discussed. A key phrase discovery algorithm based on suffix array is presented. This algorithm is highly effective and efficient no matter how large the language’s alphabet is. (2) The concept of orthogonal clustering is proposed for general clustering problems. The reason why matrix Singular Value Decomposition (SVD) can provide solution to orthogonal clustering is strictly proved. The orthogonal clustering has a solid mathematics foundation and many advantages over traditional heuristic clustering algorithms.
1 Introduction Today, search engine is the most commonly used tool for Web information retrieval, however, its current status is still far from satisfaction. We think clustering of Web search results could help users find relevant Web information more easily and quickly: if Web search results are presented in groups, users are able to have an overview of the whole topic and just select interested groups to browse. In this paper, we propose a Semantic, Hierarchical, Online Clustering (SHOC) approach to automatically organizing Web search results into groups. SHOC combines the power of two novel techniques, key phrase discovery and orthogonal clustering, to generate clusters which are both reasonable and readable. Moreover, SHOC can work for multiple languages: not only English but also oriental languages like Chinese. We have created a prototype system, Web Information Clustering Engine (WICE), to demonstrate the effectiveness of SHOC.
2 Related Work D. Cutting et al. have created the Scatter/Gather system to cluster Web search results [3]. However, their system has some limitations due to the shortcomings of the traditional heuristic clustering algorithms (e.g. k-means) they used. Y. Wang et al. have proposed an interesting method to cluster Web search results based on hyperlinks [10]. But their method needs to download and parse the original Web pages, so it is not able to provide clustering results quickly. Our work is an extension of O. Zamir and O. Etzioni’s Suffix Tree Clustering (STC) approach to clustering Web search results [12]. We attempt to overcome the following shortcomings of STC: its key phrase discovery algorithm based on suffix tree is not suitable for oriental languages like Chinese; documents containing no key phrase become inaccessible since they are not included in any cluster; sometimes taking the hierarchy of the constructed suffix tree as the hierarchy of the generated clusters is unreasonable; the phenomena of synonymy and polysemy are neglected, and so on.
3 The SHOC Approach A practical approach to clustering of Web search results should meet the following requirements. • Semantic. The clustering algorithm should group search results based on their semantic topic. Since a search result may have multiple topics, it is instructive not to confine one search result in only one cluster. The clustering algorithm should also provide each cluster a label that describes the cluster’s topic, so that users can determine at a glance whether a cluster is of his/her interest. • Hierarchical. The clustering algorithm should automatically organize the generated clusters into a tree structure to facilitate user browsing. • Online. The clustering algorithm should be able to provide fresh clustering results “just-in-time. These requirements will be emphasized throughout this paper. Our proposed SHOC approach is composed of three major steps: (1) data collection and cleaning; (2) feature extraction; (3) identifying and organizing clusters. 3.1 Data Collection and Cleaning The data collection task here is actually meta-search. Given a query, we just forward it to several search engines and then collect their search results (lists of pointers to Web pages). Usually the search results returned by a search engine are partitioned into several result-pages to facilitate user browsing. To get high efficiency for metasearch, we use a two-level parallelization mechanism: (1) search engines are called in parallel via multi-threads; (2) a search engine’s all result-pages are fetched in parallel via multi-threads. Then we merge all the lists of search results from different search engines into a uniform ranked list. Duplicate search results are discarded.
Semantic, Hierarchical, Online Clustering of Web Search Results
71
A search result usually includes the URL, title and snippet of its corresponding Web page. Note the “online” requirement implies that we do not have time to download the original complete Web pages that the search results point to. In reality, most users are unwilling to wait for the clustering system to download the original Web pages. Therefore we take a Web page’s title and snippet in the search results as a good summary of its content, and use it as a “document” to be fed to the clustering algorithm. Each document is parsed and split into sentences according to punctuations (period, comma, semicolon, question mark etc.) and HTML tags (
, ,
,
etc.). The non-word tokens are stripped, the redundant spaces are compressed, and the English words are stemmed. 3.2 Feature Extraction Most clustering algorithms treat a document as a “bag-of-words”, totally ignoring word order and proximity which may provide important information for clustering. In contrast, we decide to use key phrases extracted from the document collection as document features. The benefit is two-fold: (1) it can improve the quality of the clusters through leveraging more information present in the documents; (2) it is helpful to construct concise and accurate labels for the generated clusters [12]. The key phrase discovery algorithm for the document collection is the same as that for single document, because a document collection can be treated as a pseudo-document. In our scenario, a document is essentially a string of characters, and a key phrase is defined as a meaningful substring within a sentence which is both specific and significant. Every phrase is restricted to be within a sentence because sentence boundary usually indicates topical shift. This also reduces the cost of our key phrase discovery algorithm. Given a document T of length N , we check if a substring S of T can be a key phrase through the three measures: completeness, stability, and significance. DEFINITION Suppose S occurs in k distinct positions p1 , p2 ,... pk in T , S is “complete” if and only if the ( pi − 1) th token in T is different with the ( p j − 1) th token for at least one (i, j ) pair, 1 ≤ i < j ≤ k (called “left-complete”), and the ( pi + S ) th token is different with the ( p j + S ) th token for at least one (i, j ) pair, 1 ≤ i < j ≤ k (called “right-complete”). [14]
DEFINITION Suppose S = c c 1
MI ( S ) = f ( S )
( f ( SL ) +
2
f (SR ) − f ( S ) ) ,
c P , the stability (mutual information) of S is S L = c1
where
c P −1 ,
S R = c2
cP ,
and
f ( S ), f ( S L ), f ( S R ) are frequencies of S , S L , S R .
DEFINITION The significance of S can be estimated as se ( S ) = f ( S ) × g ( S ) , where f ( S ) is the frequency of S , S is the length of S , g ( x ) is a heuristic utility function the string length, g (1) = 0 , g ( x ) = log x when 2 ≤ x ≤ 8 and g ( x ) = 3 when 2
x > 8.
The task of key phrase discovery in English could be accomplished efficiently using the suffix tree data structure [6], as described in [12]. However, the
72
D. Zhang and Y. Dong
performance (time and space complexity) of suffix tree is related to the alphabet size of language [9]. As we all know, oriental languages have much larger alphabets than English, e.g., Chinese has more than 6,000 characters. Hence using suffix tree is not efficient for key phrase discovery in oriental language text. Besides, oriental languages like Chinese do not have explicit word separators (such as blanks in English) [2]. Therefore straightforwardly applying English key phrase discovery algorithms to multi-lingual Web pages may extract many meaningless partial phrases. Here we present a novel key phrase discovery algorithm based on suffix array, which is both scalable over alphabet size and able to avoid extracting meaningless partial phrases. The suffix array data structure was introduced as a text indexing technique [9]. Using a suffix array, online string searches of the type, “Is S a substring of T ?” can be answered in O( P + log N ) time, where P is the length of S and N is the length of T . Such efficiency is competitive with (and in some cases slightly better than) that of using a suffix tree. A suffix array can be constructed with O( N ) expected time complexity, regardless of the alphabet size. The major advantage of using suffix array over using suffix tree is in space. A suffix tree needs O( N Σ ) space that grows with alphabet size Σ . Manber and Myers [9] reported that suffix arrays are an order of magnitude more efficient in space than suffix trees even in the case of relatively small alphabet size ( Σ = 96 ). The advantage of using suffix array over using suffix trees is significant for large alphabet languages like Chinese. The suffix array s of a document T , is an array of all N suffixes of T , sorted alphabetically. A suffix (also known as semi-infinite string) s[i ] , is a string that starts at position i in the text and continues to the end of the text. In practice, a suffix s[i ] is typically denoted by a 4-byte integer, i , which one might have thought would require O( N ) space. Manber and Myers’ algorithm [9] uses an LCP array to accelerate searching operation. The LCP array lcp corresponding to s stores N + 1 integer elements, where lcp[i ] (1 ≤ i ≤ N − 1) indicates the length of the longest common prefix between s[i − 1] and s[i ] , lcp[0] = lcp[ N ] = 0 . Given the suffix array s , the LCP array lcp can also be constructed with O( N ) expected time complexity. Given a document T of length N , a set of key phrases can be efficiently extracted using its suffix array s and its LCP array lcp . THEOREM A substring S of T is right-complete if and only if there is a w (1 ≤ w < N ) and S is the LCP of s[ w − 1] and s[ w] . PROOF Omitted due to space limit. It turns out that every right-complete substring (including complete substring) of T can be identified by the position of a pair of adjacent suffixes in the suffix array. DEFINITION A right-complete substring of T , RCS , can be uniquely identified by ID( RCS ) = min { w
1 ≤ w < N , the LCP of s[ w − 1] and s[ w] is RCS
}
There are at most N − 1 right-complete substrings, even though there are N ( N + 1) 2 substrings of T . Based on the above theorem, we propose a linear time complexity algorithm discover_rcs, to extract all right-complete substrings of T and meanwhile count their
Semantic, Hierarchical, Online Clustering of Web Search Results
73
frequencies. The discover_rcs algorithm leverages a stack to store the right-complete substrings under counting. The time complexity of the discover_rcs algorithm is O( N ) . void discover_rcs ( ) { typedef structure { int ID; int frequency; } RCSTYPE; RCSTYPE rcs_stack[N]; // N is the document’s length Initialize rcs_stack; int sp = -1; // the stack pointer int i = 1; while ( i < N+1 ) do { if ( sp < 0 ) { // the stack is empty if ( lcp[i] > 0 ) { sp ++; rcs_stack[sp].ID = i; rcs_stack[sp].frequency = 2; } i ++; } else { int r = rcs_stack[sp].ID; if ( lcp[r] < lcp[i]) { // case (a) sp ++ ; rcs_stack[sp]. ID = i ; rcs_stack[sp]. frequency = 2 ; i ++ ; } else if ( lcp[r] == lcp[i] ) { // case (b) rcs_stack[sp].frequency ++ ; i ++; } else { // case (c) Output rcs_stack[sp]; // ID & frequency int f = rcs_stack[sp].frequency; sp - -; if ( sp >= 0 ) { rcs_stack[sp].frequency = rcs_stack[sp].frequency + f - 1; } } // end of case (c) } // end of if (sp < 0) } // end of while }
A complete substring should be both right-complete and left-complete. To discover all the left-complete substrings, we just apply the discover_rcs algorithm to ~ T , the inverse document T . If S is a right-complete substring of ~ T , then ~ S must be a left-complete substring of T . Suppose rcs is the array of right-complete substrings, and lcs is the array of leftcomplete substrings. The array rcs is already alphabetically ordered. We also sort the array lcs to make it alphabetically ordered. Then we find the intersection of lcs and rcs using the intersect_lcs_rcs algorithm. The output of this algorithm is the alphabetically ordered array of the complete substrings of T with their frequencies. Suppose the size of lcs and rcs are L and R respectively, the time complexity of the intersect_lcs_rcs algorithm is O( L + R) . After obtaining the alphabetically ordered array of the complete substrings of T with their frequencies, we can get the frequency of any complete substring using the binary search algorithm. Thereafter, the stability (mutual information) and
74
D. Zhang and Y. Dong
significance estimation of each complete string could be computed easily. To discover key phrases from the set of phrases (complete substrings), we only need to examine every complete substring to see whether or not it is stable and significant. The complete, stable, significant substrings are just the key phrases we need. void intersect_lcs_rcs ( sorted lcs array, sorted rcs array ) { int i = 0; int j = 0; while ( ( i < L ) && ( j < R ) ) { string str_l = lcs[i].ID denoted left-complete substring; string str_r = rcs[j].ID denoted right-complete substring; if ( str_l == str_r ) { Output lcs[i]; i ++; j ++; } if ( str_l < str_r ) { i ++; } if ( str_l > str_r ) { j ++; } } }
M. Yamamoto and K. W. Church have developed an algorithm using suffix array to compute term frequency and document frequency for all substrings in a corpus of size N , in O( N log N ) time [11]. In contrast, we only consider “complete” substrings. L. F. Chien has proposed more strict conditions of string “completeness” which require more complex algorithms [13]. 3.3 Identifying and Organizing Clusters The STC algorithm simply groups documents sharing a common phrase into one cluster [12]. This kind of methods are often inaccurate and incomplete, because of the phenomena of synonymy and polysemy that exist in natural languages. A better clustering approach should run on semantic level, i.e., group documents sharing a common conceptual topic together. It is natural to assume that a document belongs to a cluster in some degree. So we adopt a continuous cluster definition here. DEFINITION A cluster of m objects t , t , , t , C , can be identified by a m 1
2
m
g
dimensional vector x , x = 1 and x (i ) represents the degree in which t belongs g
g
g
i
to C . x is denoted as the cluster vector of C , and it can be used interchangeably g
g
g
with C . g
After the previous steps, we can get m key phrases of n documents. Taking key phrases as terms, the search results can be described as a m × n term-document matrix A , whose row vectors represent the terms and column vectors represent the documents. The element A (i , j ) = 1 if the i -th term Ti occurs in the j -th document D j , or A (i , j ) = 0 .
Semantic, Hierarchical, Online Clustering of Web Search Results
75
Following the idea of idea of Latent Semantic Indexing (LSI) [1,4], we attempt to discover the latent semantic of documents through analyzing the associations between terms and documents. The term-document matrix A could be visualized as a bipartite graph. The bipartite graph reveals the dual reinforcing relationship between terms and documents, i.e., the terms linked with the same document should be close in semantic space, and the documents linked with the same terms should be close in semantic space. That is to say, densely-linked terms or documents are close to each other in semantic space, so they should be grouped together to form a cluster. The degree of associations among objects in a cluster can be measured by the following notation. DEFINITION Suppose x ( y ) is a cluster of the row (column) vectors of A , then g
g
T
the cluster density of x ( y ) is x A ( Ay ) . [15] g
g
g
g
We want to find the clusters with high densities since they capture main topics of the documents. Suppose x is the cluster with maximum density, and x is another 1
2
cluster. It is known from basic linear algebra that x
2
can be written as
x 2 = η x 1 + ( 1 − η ) z , where η is a constant scalar ( 0 ≤ η ≤ 1 ), z ⊥ x1 and z = 1 . 2
Then the cluster density of x is x A = η x A + (1 − η 2
T
2
2
T
2
2
1
)zA T
2
. The larger the
value of η , the higher the cluster density of x . If there is no constraint on x , it will 2
2
be arbitrary close to x . Therefore to get a new meaningful cluster x , we have to 1
g
restrict x to be orthogonal to the already discovered cluster vectors [15]. DEFINITION The orthogonal clustering of row (column) vectors of A is discovering a set of cluster vectors x , x , , x , x (1 ≤ g ≤ k ) is the cluster with g
1
2
k
g
maximum density subject to being orthogonal to x , 1
, x g −1 .
To find out the solution for the orthogonal clustering problem, we introduce the following definitions and theorems. DEFINITION Suppose M is a real a m × m symmetrical matrix, the Rayleigh Quotient of M w.r.t. x ∈ R is R(x) = ( xT Mx ) ( xT x ) . m
THEOREM Suppose M is a real a m × m symmetrical matrix, its eigenvalues are λ1 ≥ λ2 ≥
then
≥ λm corresponding to orthonormal eigenvectors p1 , p 2 ,
max R ( x ) = R (p 1 ) = λ1 , x≠0
min R ( x ) = R (p m ) = λm ; x≠0
if
, p m , if x ∈ R , m
x ∈ L (p g , p g +1 ,
, ph ) ,
1 ≤ g ≤ h ≤ m , then max R ( x ) = R (p g ) = λg , min R ( x ) = R (p h ) = λh . x≠0
x≠0
DEFINITION Suppose a m × n matrix A with rank ( A ) = r , λ ≥ λ ≥ 1
are r non-zero eigenvalues of AA
T
T
( A A ),
x1 , x 2 ,
, x m ( y1 , y 2 ,
2
≥ λr > 0
, y n ) are the
corresponding orthonormal eigenvectors, then the Singular Value Decomposition
76
D. Zhang and Y. Dong
(SVD) of
A is defined as
σg =
( g = 1, 2,
λg
U = [ x1 , x 2 ,
are
, k)
∑ O
O
T
called
the
singular
A = U
, xm ] , V = [ y1 , y 2 ,
∑ = diag (σ 1 , …… , σ r ) ,
V , where O
, y n ] , x1 , x 2 ,
values
, x m ( y1 , y 2 ,
of
A,
and
, y n ) are called the left
(right) singular vectors of A . [5] It turns out that the SVD of the matrix A can provide solution to the orthogonal clustering of the row or column vectors of A . THEOREM The left (right) singular vectors of A are the cluster vectors discovered through orthogonal clustering of row (column) vectors of A . PROOF Since AA is a m × m symmetrical matrix, we can compute the Rayleigh T
Quotient of AA w.r.t x ∈ R : R ( x ) = T
m
x
T
( AA ) x T
T
( x A )( A x ) = T
T
x x
( x A, x A ) = T
T
x
T
T
T
T
x x
x x
. If c represent a cluster of row vectors of A , then c = 1 , g
2
g
2
T
R (c g ) =
x A
=
( x A )( x A ) =
2
T
( x, x )
T
cg A cg
2
= c g A . So c g A = T
2
T
R (c g ) , i.e., the cluster density of c g is actually T
the square root of the Rayleigh Quotient of of AA w.r.t. c . According to the g
definition of orthogonal clustering, c should has maximum density subject to being g
orthogonal to c ,
, c g −1 . From the above theorem about Rayleigh Quotient, and
1
⊥
, p g ) = L (p g +1 ,
noting L (p , 1
, p m ) , it is clear that c g must be the g -th eigenvector
T
p g of AA , or the g -th left singular vector x g of A . The proof for the clustering of A ’s column vectors is similar. Since there may be some negative elements in the m
cluster vectors, we add a constraint for each cluster x that g
∑x
g
(i ) ≥ 0 , or we use
i =1
−x g instead.
Then we address the problem of how to determine the appropriate cluster numbers. DEFINITION In the previous problem setting, the cluster matrix of x is g
(
)
X g = x g x g A , similarly the cluster matrix of y g is Yg = ( Ay g ) y g . The cluster T
T
matrix actually represents its corresponding part in the original data matrix A .
THEOREM A = ∑ ( σ x y r
k
k
T k
).
k =1
THEOREM X = Y = C = σ x y . T
g
g
g
g
g
g
Semantic, Hierarchical, Online Clustering of Web Search Results k
Let
Ak =
∑ g =1
∑ (σ x y ) ,
77
k
Cg =
T
g
g
the
g
quality
of
orthogonal
clustering
g =1
x1 , ..., x k ( y 1 , ..., y k ) can be reflected by the ratio of A k over A .
DEFINITION In the previous problem setting, the k-step orthogonal clustering quality of A is q ( A , k ) =
Ak A
∑ (σ ) k
F
=
2
g
g =1
F
∑ (σ ) , (1 ≤ k ≤ r ) . r
2
g
g =1
Given a cluster-quality threshold q * (e.g. 80%), the ideal cluster number k * is the minimum number k satisfying q ( A , k ) ≥ q * . SHOC applies orthogonal clustering to the term-document matrix of Web search results. Given a cluster-strength threshold t , the g -th document cluster V is g
composed of the documents whose value in vector y is greater than t . The term g
(key phrase) with largest value in x can be taken as the label of V . g
g
There are some efficient algorithms which can do SVD for large sparse matrix very quickly [15]. To save time further, we can run SVD on the top- n items in the search results returned by search engines, then “fold-in” the rest documents incrementally [1]. Because search engines usually place high quality documents at the top of the result list, this approximation would not seriously hurt the clustering quality. P. Drineas et al. introduced the initial “orthogonal clustering” concept [15]. This paper gives strict mathematical proof on why SVD provides solution to orthogonal clustering. J. Kleinberg pointed out that SVD on the hyperlink structure matrix can be used for ranking and clustering Web pages [16]. Furthermore, SHOC organizes the clusters of Web search results into a tree structure to facilitate browsing. This is done by checking each pair of clusters, X and Y, to see if they can be merged into one cluster or be treated as a parent-child relationship. When two base clusters, X and Y are merged into one cluster, their phrase labels label_x and label_y should also be merged into one phrase label_xy. We iteratively check every pair of base clusters and organize them. A hierarchy of cluster appears at last. Users are able to adjust the thresholds t1 and t2 to fulfill their own needs. void combine_clusters ( two clusters X and Y ) { if ( |X∩Y| / |X Y| > t1 ) { X and Y are merged into one cluster; } else { if ( |X| > |Y| ) { if ( |X∩Y| / |Y| > t2 ) { let Y become X’s child; } } else { if ( |X∩Y| / |X| > t2 ) { let X become Y’s child; } } } }
78
D. Zhang and Y. Dong
References 1. 2.
3.
4.
5. 6. 7.
8. 9.
10.
11.
12.
13. 14.
15. 16.
M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37 (4), 573-595, 1995. L. F. Chien, PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval, In Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1997. D. Cutting, D. Karger, J. Pedersen, J. W. Tukey. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Copenhagen, 1992. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman, Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 41, pp. 391407, 1990. G. Golub and C. V. Loan, Matrix Computations (2nd ed.), Johns-Hopkins, Baltimore, 1989. D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997. C. Kwok, O. Etzioni, and D. S. Weld. Scaling Question Answering to the Web. In Proceedings of the 10th International World Wide Web Conference (WWW), Hong Kong , May 2001. R. Kannan, S. Vempala and A. Vetta. On Clusterings: Good, Bad and Spectral. In Proceedinsg of the 41st Symposium on the Foundations of Computer Science, 2000. U. Manber and E. Myers. Suffix arrays: a New Method for On-line String Searches. In Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms, pages 319327, 1990. Y. Wang, M. Kitsuregawa. Link-based Clustering of Web Search Results. In Proceedings of the 2nd International Conference on Web Age Information Management (WAIM), Xi’An, China, 2001. M. Yamamoto and K. W. Church. Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus. Computational Linguistics, 27(1), pp.1-30, MIT Press, 2001. O. Zamir and O. Etzioni. Web Document Clustering: a Feasibility Demonstration. In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Melbourne, Australia, 1998. L. F. Chien. PAT-tree-based Adaptive Keyphrase Extraction for Intelligent Chinese Information Retrieval. Information Processing and Management, 35(4), pp.501-521, 1999. C. H. Chang. and S. C. Lui. IEPAD: Information Extraction based on Pattern Discovery. In Proceedings of the 10th International Conference on World Wide Web (WWW), Hong Kong, May 2-6, 2001. P. Drineas, A. Frieze, R. Kannan, S. Vempala and V. Vinay. Clustering in Large Graphs and Matrices. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1999. J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, (46), 1999.
SIMON: A Multi-strategy Classification Approach Resolving Ontology Heterogeneity on the Semantic Web1 Leyun Pan, Shui Yu, and Fanyuan Ma Department of Computer Science and Engineering Shanghai Jiao Tong University, 200030 Shanghai, China {pan-ly,yushui}@cs.sjtu.edu.cn, [email protected] Abstract. One key idea of semantic web is that the content of the web is usable to machines (i.e. software agents). On the semantic web, data interoperability and ontology heterogeneity between agents are becoming ever more important issues. This paper presents a multi-strategy learning approach to resolve these problems. In this paper we describe the SIMON (Semantic Interoperation by Matching between ONtologies) system, which applies multiple classification methods to learn the matching between ontologies. We use the general statistic classification method to discover category features in data instances and use the first-order learning algorithm FOIL to exploit the semantic relations among data instances. On the prediction results of individual methods, the system combines their outcomes using our matching committee rule called the Best Outstanding Champion. The experiments show that SIMON system achieves high accuracy on real-world domain.
1 Introduction Now the web reaches a great scale. The majority of the web pages are in human readable format only, so machines cannot understand and process this information, and much of the potential of the web has so far remained untapped. The traditional web has reached its crucial point. Some researchers and web developers have proposed that we can augment the web with languages that make the meaning of web pages explicit. Tim Berners-Lee, inventor of the web, has coined the term Semantic Web to describe this approach [1]. The vision in the semantic web can be regarded as data interoperation between soft agents. In order to carry out their required tasks, agents must communicate and understand meaning. When agents communicate with each other, they need some ways to ensure that the meaning of what one agent ‘says’ is accurately conveyed to the other agent [2]. In reality, agents need to interact in a much wider world, where it cannot be assumed that other agents will use the same terms. If they do use the same terms, it cannot be assumed that they mean the same thing. 1
Research described in this paper is supported by Major International Cooperation Program of NSFC Grant 60221120145 and by Science & Technology Committee of Shanghai Municipality Key Project Grant 02DJ14045.
An approach for enabling interoperability is the semantic method where ontology is a key factor. The basic idea for how to proceed is to encode the domain concepts and their semantic relations in ontologies with an explicit description. Before the data exchange, there will be a semantic exchange process. So if agent A sends a message to agent B, then along with this message is an indicator of, or a pointer to what ontology term agent A is using for this message. Agent B can look in agent A’s ontology to see what the terms mean. The message is successfully communicated, and then they can change for data smoothly. Bringing the meaning to the web data, in theory, ontology is a good solution for data interoperation. However, semantic heterogeneity or ontology heterogeneity is still a problem in real web environment. Because of the semantic web's distributed nature, ontologies from different communities will inevitably be different. So the problem of improving system interoperability will rely on the reconciliation of different ontologies. We can consider the process of addressing the semantic heterogeneity as the process of ontology matching (ontology mapping) [3]. Mapping processes typically involve analyzing the ontologies and comparing them to determine the correspondence among concepts. Given two ontologies in the same domain, we can find the most similar concept node in one ontology for each concept node in another ontology. Manually finding such mappings is necessary in some situations, i.e. for agreement of basic concept [2]. However, at the web scale, manual matching is tedious, error-prone, and clearly not possible. It cannot satisfy the need of online exchange of ontology to two agents not in agreement. Hence, we must find some approaches to assist in the ontology (semi-) automatically matching process. In the paper, we will discuss the use of data instances associated with the ontology for addressing semantic heterogeneity. These data instances contain much information for identifying similarities and differences between ontologies. We can utilize this kind of information in our matching system. We propose the SIMON (Semantic Interoperation by Matching between ONtologies) system, which applies multiple classification methods to learn the matching between the pair of ontologies that are homogenous and their elements have significant overlap. Given the source ontology B and the target ontology A, for each concept node in target ontology A, we can find the most similar concept node from source ontology B. SIMON considers the ontology A and its data instances as the learning resource. All concept nodes in ontology A are the classification categories and relevant data instances of each concept are labeled learning samples in a classification process. The data instances of concept nodes in ontology B are unseen samples. SIMON classifies instances of each node in ontology B into the categories of ontology A according the classifiers for A. SIMON uses multiple learning strategies, namely multiple classifiers. Each of classifier exploits different type of information either in data instances or in the semantic relations among these data instances. Using appropriate matching committee method, we can get better result than simple classifier. This paper is organized as follows. In the next section, we introduce the overview of the ontology matching system. In section 3, we will discuss the multi-strategy classification for ontology matching. Section 4 presents the experiment results with our SIMON system. Section 5 reviews related works. We give the conclusions in section 6.
SIMON: A Multi-strategy Classification Approach Resolving Ontology Heterogeneity
81
2 Overview of the Ontology Matching System The ontology matching system is trained to compare two ontologies and to find the correspondence among concept nodes. An example of such task is illustrated in Figure 1 and Figure 2. There are two ontologies of movie database websites. The data on website is organized into a hierarchical structure that includes movie, person, company, awards and so on. Movie has attributes such as title, language, cast&crew, production company, genre and so on. IMDB homepage:
Movie
Awards and Nominations result: category: awardName: awardsMovie: awardsWon:
MainMovieInfo title: Language: Plot: cast&crew: production company: genre:
Company name: address: createdYear:
MoviePerson name: biography:
Music title: musicMood:
GenreInstance genreType: genreKeywords:
countryOfBirth:
composer:
Recommends:
belongsTo: filmography:
Actor rolePlayed:
Director independent:
awards:
awards:
Fig. 1. Ontology of movie database IMDB AllMovie homepage:
Movie title: Language: cast&crew: production: genre:
However, because each of websites may use different ontology, it is difficult to completely integrate all data for an agent that only master one ontology. For example, one agent may consider that Movie in Allmovie is equivalent to Movie in IMDB. However, in fact Movie in IMDB is just an empty ontology node and MainMovieInfo in IMDB is the most similar to Movie in Allmovie. The mismatch also may happen between MoviePerson and Person, GenreInstance and Genre, Awards and Nominations and Awards. In this situation, the agent must find the matching between ontologies in movie domain.
82
L. Pan, S. Yu, and F. Ma
When training ontology matching system, SIMON takes two inputs for each matcher (ontology be matched). One of inputs is the ontology that defines the domain knowledge from the matcher’s own perspective. According defining of ontology, it includes the terms describing classes, properties and relations. The other is the set of each matcher’s data instances that describe classes and relations. Each data instance of the class is the combination of the values of properties. Otherwise, if a class owns a property describing relation with another class, this class’s data instances will link to the other’s data instances. SIMON uses multi-strategy learning methods including both statistical and firstorder learning techniques. Each base learner exploits well a certain type of information from the training instances to build matching hypotheses. We use a statistical bag-of-words approach to classifying the pure text instances. Furthermore, the relations among concepts can help to learn the classifier. An appealing aspect of our approach is that the first-order rules can describe categories using a rich description of the local graph structure around the categories. On the prediction results of individual methods, system combines their outcomes using our matching committee rule called the Best Outstanding Champion that is a weighted voting committee. This way, we can achieve higher matching accuracy than with any single base classifier alone.
3 Multi-strategy Learning for Ontology Matching 3.1 Statistical Text Classification One of methods that we use for text classification is naive Bayes, which is a kind of probabilistic models that ignore the words sequence and naively assumes that the presence of each word in a document is conditionally independent of all other words in the document. Naive Bayes for text classification can be formulated as follows. Given a set of classes C = {c1,..., cn} and a document consisting of k words, {w1,..., wk} , we classify the document as a member of the class, c * , that is most probable, given the words in the document:
c* = arg max c Pr(c | w1,..., wk )
(1)
Pr(c | w1,..., wk ) can be transformed into a computable expression by applying Bayes Rule (Eq. 2); rewriting the expression using the product rule and dropping the denominator, since this term is a constant across all classes, (Eq. 3); and assuming that words are independent of each other (Eq. 4).
Pr(c | w1,..., wk ) =
Pr(c) Pr( w1,..., wk | c) Pr( w1,..., wk )
(2)
SIMON: A Multi-strategy Classification Approach Resolving Ontology Heterogeneity
83
k
∝ Pr(c)∏ Pr( wi | c, w1,...wi − 1)
(3)
i =1
k
= Pr(c)∏ Pr( wi | c)
(4)
i =1
Pr(c) is estimated as the portion of training instances that belong to c . So a key step in implementing naive Bayes is estimating the word probabilities, Pr( wi | c ) . We use Witten-Bell smoothing [4], which depends on the relationship between the number of unique words and the total number of word occurrences in the training data for the class: if most of the word occurrences are unique words, the prior is stronger; if words are often repeated, the prior is weaker. 3.2 First-Order Text Classification As mentioned above, data instances under ontology are richly structured datasets, which can be best described by a graph where the nodes in the graph are objects and the edges in the graph are links or relations between objects. The methods for classifying data instances that we discussed in the previous section consider the words in a single node of the graph. However, the method can’t learn models that take into account such features as the pattern of connectivity around a given instance, or the words occurring in instance of neighboring nodes. For example, we can learn a rule such as “An data instance belongs to movie if it contains the words minute and release and is linked to an instance that contains the word birth." This kind of rules can be concisely represented using a first-order representation. We can learn to classify text instance using a learner that is able to induce first-order rules. The learning algorithm that we use in our system is Quinlan's Foil algorithm [5]. Foil is a greedy covering algorithm for learning function-free Horn clauses definitions of a relation in terms of itself and other relations. Foil induces each Horn clause by beginning with an empty tail and using a hill-climbing search to add literals to the tail until the clause covers only positive instances. When Foil algorithm is used as a classification method, the input file for learning a category consists of the following relations: 1. category(instance): This is the target relation that will be learned from other background relations. Each learned target relation represents a classification rule for a category. 2. has_word(instance): This set of relations indicates which words occur in which instances. The sample belonging a specific has_word relation consists a set of instances in which the word word occurs. 3. linkto(instance, instance): This relation represents that the semantic relations between two data instances. We apply Foil to learn a separate set of clauses for every concept node in the ontology. When classifying ontology’s data instances, if an instance can’t match any clause of any category, we treat it as an instance of other category.
84
L. Pan, S. Yu, and F. Ma
3.3 Evaluation of Classifiers for Matching and Matching Committees Method of Committees (a.k.a. ensembles) is based on the idea that, given a task that requires expert knowledge to perform, k experts may be better than one if their individual judgments are appropriately combined [8]. For obtaining matching result, there are two different matching committee methods according to whether utilizing classifier committee: —microcommittees: System firstly utilizes classifier committee. Classifier committee will negotiate for the category of each unseen data instance. Then System will make matching decision on the base of single classification result. —macrocommittees: System doesn’t utilize classifier committee. Each classifier individually decides the category of each unseen data instance. Then System will negotiate for matching on the base of multiple classification results. To optimize the result of combination, generally, we wish we could give each member of committees a weight reflecting the expected relative effectiveness of member. There are some differences between evaluations of text classification and ontology matching. In text classification, the initial corpus can be easily split into two sets: a training(and-validation) set and test set. However, the boundary among training set, test set and unseen data instance set in ontology matching process is not obvious. Firstly, test set is absent in ontology matching process in which the instances of target ontology are regarded as training set and the instances of source ontology are regarded as unseen samples. Secondly, unseen data instances are not completely ‘unseen’, because instances of source ontology all have labels and we just don’t know what each label means. Because of the absence of test set, it is difficult to evaluate the classifier in microcommittees. Microcommittees can only believe the prior experience and manually evaluate the classifier weights, as did in [3]. We adopt macrocommittees in our ontology matching system. Notes that the instances of source ontology have the relative “unseen” feature. When these instances are classified, the unit is not a single but a category. So we can observe the distribution of a category of instances. Each classifier will find a champion that gains the maximal similarity degree in categories of target ontology. In these champions, some may have obvious predominance and the others may keep ahead other nodes just a little. Generally, the more outstanding one champions is, the more we believe it. Thus we can adopt the degree of outstandingness of candidate as the evaluation of effectiveness of each classifier. The degree of outstandingness can be observe from classification results and needn’t be adjusted and optimized on a validation set. We propose a matching committee rule called the Best Outstanding Champion, which means that system chooses a final champion with maximal accumulated degree of outstandingness among champion-candidates. The method can be regarded as a weighted voting committee. Each classifier votes a ticket for the most similar node according to its judgment. However, each vote has different weight that can be measured by degree of champion’s outstandingness. We define the degree of outstandingness as the ratio of champion to the secondary node. So given a source node and columns of classifiers’ results under this source node, we can find a final champion for target nodes as follows: for each target node c and k classifiers, the
SIMON: A Multi-strategy Classification Approach Resolving Ontology Heterogeneity
85
k
pooled votes can be calculated as
Votesc = ∑ wivi . Where Votesc are combined i =1
votes of node c . The component wi is the ratio of champion to the second in the classifier i . The component vi equals 1 if node c is the champion in classifier i , otherwise 0.
4 Experiments We take movie as our experiment domain. We choose the first three movie websites as our experimental objects which rank ahead in google directory Arts > Movies > Databases: IMDB, AllMovie and Rotten Tomatoes. We manually match three ontologies to each other to measure the matching accuracy that can be defined as the percentage of the manual mappings that machine predicted correctly. We found about 150 movies in each web site. Then we exchange the keywords and found 300 movies again. So each ontology holds about 400 movies data instances except repetition. For example, we found data instances about movie ‘The Godfather’ in IMDB and found data instances about movie ‘The Matrix’ in AllMovie. After keyword-exchange searching (KES), IMDB holds some instances about ‘The Matrix’ and AllMovie holds some instances about ‘The Godfather’. So each ontology holds some similar knowledge about same movies set. We use a three-fold cross-matching methodology to evaluate our algorithms. We conduct three runs in which we performed two experiments that map ontologies to each other. In each experiment, we train classifiers using data instances of target ontology and classify data instances of source ontology to find the matching pairs from source ontology to target ontology. Before discussing the results of our statistical and first-order text classification, we firstly illustrate some learned rules from first-order text classification. Figure 3 shows three of the rules learned by Foil in IMDB-Allmovie experiment. movie(A) :- has_minute (A), has_ release (A), linkto(A,B), has_birth (B). player (A) :- has_play(A); linkto(A,B), has_best(B), has_actor(B). director (A) :- has_ executive (A); linkto(A,B), has_best(B), has_ director (B). Fig. 3. A few of the rules learned by Foil for classifying data instances of movie domain
The learned rules for player and director show the power of a first-order representation. Because player and director inherit a lot of properties and have few own properties, the data instances of these two nodes have little discrimination power. It is difficult for standard classifier to classify properly the instances into the two nodes. But if players or directors have awards or nominations records, there would be some linked instances of awards and nominations. The rule about linked instance will do some help for classification because the actor often win the best actor awards and vice versa. Notes that actor in ‘has_actor’ is different from the category name ‘actor’.
86
L. Pan, S. Yu, and F. Ma
The former is a word often occurs in instances of category awards and nominations. It can help classify some instances whose category name maybe actor or player. Table 1. Results matrixs of statistic classifier and the First-Order classifier IMDB MainMovieInfo Company MoviePerson Actor Director A&N … Other Total
Movie 241 (112) 1(0) 12(97)
Company 2(1)
Person 68(323)
AllMovie Player 22(12)
Director 7(2)
Awards 13(12)
… …
25(18) 1(0)
11(5) 715(418)
34(13) 54(11)
12(4) 11(1)
1(1) 12(9)
… …
7(5) 8(4) 10(7) … 0(52) 385
0(1) 0(1) 2(2) … 0(4) 37
11(5) 8(4) 23(11) … 0(169) 1213
213(456) 305(20) 11(14) … 0(182) 833
40(13) 49(104) 5(4) … 0(19) 186
7(1) 5(2) 103(90) … 0(21) 198
… … … …
Table 1 shows the classification result matrixes of partial categories in AllmovieIMDB experiment, respectively for the statistic classifier and the First-Order classifier (The numbers in the parentheses are the results of First-Order classifier). Each column of the matrix represents one category of source ontology Allmovie and shows how the instances of this category are classified to categories of target ontology IMDB. Boldface indicates the leading candidate on each column. These matrixes illustrate several interesting results. First, note that for most classes, the coverage of champion is high enough for matching judgment. For example, 63% of the Movie column in statistic classifier and 56% of the Player column in FirstOrder classifier are correctly classified. And second, there are notable exceptions to this trend: the Player and Director in statistic classifier; the Movie and the Person in First-Order classifier. There will be a wrong matching decision according to results of Player column in statistic classifier, where Player in AllMovie is not matched to Actor but Director in IMDB. In other columns, the first and the second are so close that we can’t absolutely believe the matching results according to these classification results. The low level of classification coverage of champion for the Player and Director is explained by the characteristic of categories: two categories lack of feature properties. For this reason, many of the instances of two categories are classified to many other categories. However, our First-Order classifier can repair the shortcoming. By mining the information of neighboring instances-awards and nominations, we can learn the rules for two categories and classify most instances to the proper categories. Because the Player often wins the best actor awards and vice versa. The neighboring instances don’t always provide correct evidence for classification. The Movie column and the Person column in First-Order classifier belong to this situation. Because many data instances between these two categories link to each other, the effectiveness of the learned rules descends. Fortunately, in statistic classifier, the classification results of two categories are ideal. By using our matching committee rule, we can easily integrate the preferable classification results of both classifiers. After calculating and comparing the degree of outstandingness, we more trust the matching results for Movie and Person in statistic classifier and for Player and Director in First-Order classifier.
SIMON: A Multi-strategy Classification Approach Resolving Ontology Heterogeneity
87
Figure 4 shows three runs and six groups of experimental results. We match two ontologies to each other in each run, where there is a little difference between two experimental results. The three bars in each experimental represent the matching accuracy produced by: (1) the statistic learner alone, (2) the First-Order learner alone, and (3) the matching committee using the previous two learners. 100
statistic learner
First-Order Learner
Matching committee
90 80 70 60 50 40 30 20 10 0 AllMovie to IMDB
IMDB to AllMovie
RT to IMDB
IMDB to RT
RT to AllMovie
AllMovie to RT
Fig. 4. Ontology matching accuracy
According to experimental results, we can draw some conclusions as follows. 1. By using the first-order learning algorithm, we can exploit the semantic relations among data instances. The first-order learner can find some matching node pairs that can’t be caught by the general classification method. 2. The higher accuracy of matching committee shows that our committee rule called the Best Outstanding Champion can well accumulate the correct matching of alone classifier. 3. Generally, effectiveness of many classification methods is good enough for ontology matching and is not the key problem of ontology matching. If each concept node of ontology has ‘good’ feature property and sufficient training data, classification methods are competent for matching task and the matching results are ideal. However, there are still some limitations from obtaining higher accuracy, such as ambiguous concept nodes, insufficient training data etc. The solutions to these problems will be our future tasks.
5 Related Works From perspective of ontology matching using data instance, some works are related to our system. In [3] some strategies classify the data instances and another strategy Relaxation Labeler searches for the mapping configuration that best satisfies the given domain constraints and heuristic knowledge. However, automated text classification is the core of our system. We focus on the full mining of data instances for automated classification and ontology matching. By exploiting the classification features in or
88
L. Pan, S. Yu, and F. Ma
among data instances, we can furthest utilize the text classification methods. Furthermore, as regards the combination of multiple learning strategies, [3] uses microcommittees and manually evaluate the classifier weights. But in our system, we adopt the degree of outstandingness as the weights of classifiers that can be computed from classification result. Not using any domain and heuristic knowledge, our system can automatically achieve the similar matching accuracy as in [3]. [6] also compare ontologies using similarity measures, whereas they compute the similarity between lexical entries. [7] describes the use of FOIL algorithm in classification and extraction for constructing knowledge bases from the web.
6 Conclusions Current web is evolving to semantic web. It will become the reality that intelligent soft agent can automatically access the resource on the web. When agents exchange information, there will be heterogeneity problem consequentially. The data heterogeneity is transferring to the semantic heterogeneity between ontologies. In the paper, we introduce the SIMON system and describe the key techniques. We take movie as our experiment domain and extract the ontologies and the data instances from three different movie database websites. We use the general statistic classification method to discover category features in data instances and use the firstorder learning algorithm FOIL to exploit the semantic relations among data instances. The system combines their outcomes using our matching committee rule called the Best Outstanding Champion. A series of experiment results show that our approach can achieves higher accuracy on a real-world domain.
References 1. T. B. Lee, J. Hendler, and O. Lasilla. The Semantic Web. Scientific American, May 2001. 2. M. Uschold. Where is the semantics in the Semantic Web? An invited talk at the Ontologies in Agent Systems workshop held at the Autonomous Agents Conference, June 2001. 3. A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to Map between Ontologies on the Semantic Web. In Proceedings of the World Wide Web Conference (WWW-2002). 4. I. H. Witten, T. C. Bell. The zero-frequency problem: Estimating the probabilities of novel events in text compression. IEEE Transactions on Information Theory, 37(4), July 1991. 5. J. R. Quinlan, R. M. Cameron-Jones. FOIL: A midterm report. In Proceedings of the European Conference on Machine Learning, pages 3-20, Vienna, Austria, 1993. 6. A. Maedche, S. Staab. Comparing Ontologies- Similarity Measures and a Comparison Study. Internal Report No. 408, Institute AIFB, University of Karlsruhe, March 2001. 7. M.Craven, D. DiPasquo, D. Freitag, A. McCalluma, T. Mitchell. Learning to Construct Knowledge Bases from the World Wide Web. Artificial Intelligence, Elsevier, 1999. 8. F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Support Vector Machines Learning for Web-Based Adaptive and Active Information Retrieval Zhaofeng Ma and Boqin Feng Department of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China [email protected]
Abstract. An Adaptive and Active Computing Paradigm (AACP) for long-term users to get personalized information services in heterogeneous environment is proposed to provide user-centered, push-based high quality information service timely in a proper way, the motivation of which is generalized as R4 Service: the Right information serves the Right person at the Right time in the Right way. Formalized algorithms of adaptive user profile management, active monitoring and delivery mechanism, and adaptive retrieval algorithm are discussed in details, in which Support Vector Machines is adopted for collaborate retrieval and content-based adaptation, which overcomes the demerits of using collaborative or content-based algorithm independently, and improves the precision and recall in a large degree. Performance evaluations showed the proposed paradigm in this paper was effective, stable and feasible for large-scale users to gain fresh information instead of polling from kinds of information sources.
1 Introduction During the past decades pull-based information service such as search engine and traditional full-text retrieval[1,3] were studied much more, and many applications have been put to real use. However with the explosive growth of the Internet and World Wide Web, locating relevant information is time consuming and expensive, push technology[4-9] promises a proper way to relieve users from the drudgery of information searching. Some current commerce software or prototype systems such as PointCast Network, CNN Newswatch, SmartPush and ConCall serve users in a personalized way[10-12], while recommendation system] such as GroupLens, MovieLens, Alexa, Amazon.com, CDNow.com and Levis.com are used in many internet commerce fields. Although kinds of personalized recommendation systems were developed, still many things left unresolved, these problems result in deficiency and low quality of information service as the systems declared. One of most important reasons of which is that single recommendation mechanism such as content-based or collaborative recommendation is difficult to serve kinds of users for their various information needs.
In this paper, an Adaptive and Active Computing Paradigm (AACP) for personalized information service in wide-area distributed heterogeneous environment is proposed to provide user-centered, push-based high quality information service timely in a proper way. Unlike some static and plat user profile-based system we build up the system on a fully user-oriented paradigm with adaptive user profile management, incremental information retrieval and active push-delivery mechanism, moreover we integrate collaborative and content-based retrieval technologies[2] together for high quality recommendation of personalized information, in which Support Vector Machines[10] is adopted for efficient mining of latent knowledge of groups users. As part work of our research, we have implemented the Adaptive and Active Information Retrieval System (AIRS) for scientific research use.
2 The Adaptive and Active Computing Paradigm Usually traditional information retrieval[1,3] system is composed of RetrievalEnabled Information Source (IS), Indexing Engine (XE), Information Model (IM), Retrieval Engine (RE) and Graphic User Interface(GUI) as entrance for users to retrieve. Definition 1: The traditional Information Retrieval System can be defined as 5-tuple system according to its components in a system’s view as following: IRSSysView: ={ IS, XE, IM, RE, GUI } Especially, in an open source information system such as search engine, the indirect IS is the WWW, while the direct IS is the abstract image of the indirect information source. In non-open information sources such as full-text retrieval system the IS is the original metadata suitable for retrieval. Definition 2: In a user’s view, the information retrieval system is a 3-tuple framework composed of virtual or real information source (IS), user’s Retrieval Input (RI) and Retrieval Output (RO), that is: IRSUserView: ={ IS, RI, IO } Traditional information system is essentially information-centered, pull-based computing paradigm which serves customers in a passive mode, and can no longer meet the demand of long-term users getting fresh information efficiently and timely. In this paper, an Adaptive and Active Computing Paradigm (AACP) for personalized information service in heterogeneous environment is user-centered, push-based high quality information service in a proper way, the motivation of which is generalized as R4 Service: the Right information serves the Right person at the Right time in the Right way, that is: R4: =R × R × R × R 4 The R Service serves users in an automatically retrieval and push-delivery mode timely and correctly, and adapt to users preference dynamically and automatically.
Support Vector Machines
91
2.1 The Abstract Architecture of AACP Upon the above description of traditional information system, the adaptive and active information service for specific user in a personalized way is an user-correlative, context-sensitive push-based, event-driven information system, which is in fact built on traditional information framework, thus the paradigm can be formally defined as: Definition 3:The Adaptive and Active Information System (AAIS) for personalized information service is encapsulated on the traditional information system with adaptation and activity, and can be viewed in an abstract architecture: AACP: =IRSSysView ∪ AAIS Where AAIS: =(VUser, VIS, C, T, P) Whose semantic of each element in AAIS is described as table 1: Table 1. Semantic of AAIS Symbol VUSER VIS C T P
Definition Vector of User’s interests or preferences Vector of Information Source Condition that VIS matches VUSER Trigger for the system starts to work for U Period that the system serves for U
For a long-term user in information retrieval system, what one really needs in a relative stable period is just a little part in the whole global information domain, the relationships that can be precisely described as definition 4. Definition 4: A Personalized Information Set (PIS) is a subset of Domain Information Set (DIS), while DIS is a subset of Global Information Set (GIS), so that: PIS
GIS
⊆ DIS ⊂ GIS DIS j
PIS n
PISi PIS j DISi Fig. 1. Simple Model of Information Scope
Considering the high-dimension of information system, usually the domains overlap each other especially in multi-subject scope, there the information scope model is in fact n-dimension space, Fig.1 gives a simple and flat model of the above relationship. Definition 5: Personalized Information Service over heterogeneous information source is a matching map R (where R stand for Retrieval) from VIS to VUser with the Condition C and trigger T during the period P, that is: Q=R (VUser ,VIS , C, T, P; MInfo)
92
Z. Ma and B. Feng
where Minfo is the information Model for the Retrieval Map, which may be Boolean, Probabilistic, or VSM model. To serve users adaptively and actively, the system must know users’ information need, usually user profile is the basic infrastructure, which decides the quality of adaptive and active service. Furthermore, other important infrastructures include information indexing, information retrieval/filtering and automatically monitoring for retrieval and delivery.
3
Infrastructure of AACP
In this section we will discuss the infrastructure in details. 3.1 Abstract User Profile To serve users according to their preference and interests, user profile is needed as a image of what users need, by which the system can decide who to serve, when to serve, what to serve and how to serve. Then by definition 4, a user profile can be defined as: AUP: = {UID, PIV, Period, Freq} Where UID: =InterUID||ExterUID; PIV: =GIS || DIS || PIS; DIS: =DIS1|| DIS2 || …DISm; PIS: =PIS1|| PIS2|| …PISn; PISi: =; Period: =StartTime || EndTime; Freq: =Hour||Day|| Week||Month In fact PIS may be defined as Definite and Candidate weighted key words. PISD: = where wi ≥ wj,(1 ≤ i, j ≤ n, i0 is the constant which decides l
the degree of misclassification, and
∑ξi
is the up-bound of all misclassification
i =1
l
samples. In fact,
=
l
∑ξi can be formulated as a generic expression: Fσ (ξ ) = ∑ξ iσ .The i =1
=
i =1
case σ 1 relates linear cost function, and case σ 2 is corresponding to square cost SVM. Solution to SVM results in the following QP (Quadratic Programming) problem: l 1 l Min W (α ) = −∑ α i + ∑ α iα j y i y j K ( xi ⋅ x j ) 2 i, j =1 i =1 s.t:
0 ≤ α i ≤ C , i = 1,..., l ; l
∑α i yi = 0; i =1
Thus the generic decision hyperplane can be formulated as: f ( x) = sgn(
∑ yiα i K ( xi ⋅ x) + b) S .V .
Support Vector Machines
95
The alternative choice of Kernel function except for the mostly used ones such as Polynomial, Gaussian RBF and Ridge Kernel include: || x − y || 2 n ln( x − y ||) (used in Splines) and
σ σ sin c[ ( x − y )] (used in Band-limited Paley Wiener space Thin plate). π π
In the following section, we will apply SVMs as the integration of adaptive retrieval for efficient and intelligent information service. (3) SVM-VSM Adaptive Retrieval Algorithm For better retrieval performance of information precision/recall in AACP, we integrate these two methods together to achieve much higher precision and recall. The adaptation is divided into two stages: at the first stage we apply SVMs to classify group interests according to the given user’s UID[i0].DIS, and retrieve from the information source(s) in a classification mode, and at the second stage, based on the returned retrieved results, then retrieve according to the PIS vector itself for content-based adaptation. The following is the description of the adaptive retrieval algorithm. Algorithm: SVM-VSM-Retrieval Input: UID[i0] Output: R+, R-(R+ , R- are definite and candidate retrieval result set) Stage I: Collaborative Retrieval Step1: Locate to UID[i0] and extract its corresponding DIS[i].DIS[j](1 0, where AHNb = HNb + degrade × HNa . Here AHNb is the aggregate hit-number of B including the hit-number of A. In turn, C, D and E create their own RGs for Tq based on ANHb in the same manner, and send their respective AHNs to their neighbors until Q is terminated. Generally, AHN obeys the following formula: AHN = HNlocal + degrade × AHNi (1) where the coefficient degrade( < alloc Course = "c1" Tutorial = "t1"> .; (3) Filtering Engine for XML stream using the MTrie index and Predicate table, which will be discussed in section 2.2. The running scenario of
248
W. Rao et al.
our MTrie EAI prototype is as follow. DTD files are pre-registered by application integration developer to validate the incoming XML message. To each DTD, we build a related DTD graph. Subscriber applications subscribe the XML message by XPath queries. Based on each DTD graph, all XPath queries from subscribers are merged into an MTrie index. Publisher applications publish the XML data message which is validated by the pre-registered DTD file. If the XML data satisfies a subscriber’s XPath query, MTire engine sends the XML data to the subscriber applications. X ML Input Stream From Publishers
X ML SA X Parser
M Trie Index
DTD Parser
Predicate Table
X Path Parser
MTrie Filtering Engine
MTrie Index Building
X Path Queries
XML Output Data To Subscribers
Fig. 1. Architecture of MTrie
2.1 MTrie Indexing Building The basic idea of building MTrie index is to combine all XPath queries into a single tree-like index. To building MTrie index, there are two steps: conversion and combination. Conversion means to every XPath query into the DPaths in DTree; and combination to combine the DPaths into the already-built MTrie index into new MTrie in place of the old already-built MTrie index. The conversion is actually to convert the symbol “/”, “*” or “//” in XPath into the symbol ∩ in DTD Tree. For “e1/e2”, the conversion is simple and just replace / with ∩ . For “e1/*/e2”, the element M who is child of e1 and parent of e2 is used to replace “*” with the result of “e1/M/e2”. For “e1//e2”, the conversion is the most complex. The subpath SubPath := M 1 ∩ ...M n in DTD tree must be found to replace “//” with the result of “ e1 ∩ M 1 ∩ ... ∩ M n / e2 ”, where M 1 is the child of e1 and M n is the parent of e2. What really makes the conversion of “//” complex is that there may exist recursive elements. There are three cases of replacement: • If M 1 ∩ ... ∩ M n in e1 ∩ M 1 ∩ ... ∩ M n ∩ e2 are recursive elements, we replace M 1 ∩... ∩ M n with its virtual element V and the result is e1 ∩ V ∩ e2 where V = ( M 1 ∩ ... ∩ M n ) * .
• If M 1 ∩ ...e 2 ... ∩ M n in e1 ∩ M 1 ∩ ... ∩ e 2 ∩ ... ∩ M n are recursive elements, i.e. e 2 is among of recursive elements, the result will result in the replacement : e1 ∩ V ∩ M 1 ∩ ... ∩ e 2 where V = ( M 1 ∩ ...e 2 ... ∩ M n ) * .
MTrie: A Scalable Filtering Engine of Well-Structured XML Message Stream
249
• If M 1 ∩ ...e1 ... ∩ M n in M 1 ∩ ... ∩ e1 ∩ ... ∩ M n ∩ e 2 are recursive elements, i.e. e1 is among of recursive elements, the result will result in the replacement: e1 ∩ ... ∩ M n ∩ V ∩ e 2 where V = ( M 1 ∩ ...e1 ... ∩ M n ) * . After the conversion is done, the new XPath follows the already-built MTrie index from the root element. If meeting a common element, the element is combined into the common element; otherwise the different element and all left elements becomes the child of last common element. When the new XPath is combined into the old already-built MTrie index, the new MTrie will be replaced when the next XPath is combined. To evaluate the content of XPath queries, a predicate table can be built. The MTrie index’s path node has a pointer that points to the element’s content filtering which can be the element text or the attribute value, the expression including both element text and attribute values. The element text and attribute value filtering can be handled easily.in the SAX’s startElement() and characters() events. The complex content filtering expression has to be parsed to multiple element text or attribute value filtering which may decease the scalability of MTrie engine. In this paper, we focus on scalability of the structure filtering of our MTrie engine. 2.2 Matching Algorithm The filtering algorithm is the key part of our MTrie filtering engine. Compared with earlier XML matching systems [2][3][4], MTrie matching algorithm is rather simple and efficient. It accepts 2 inputs: the SAX parsed XML data event stream and MTrie index, and returns the XML result set who satisfies XPath queries. The basic idea of MTrie filtering algorithm is as follow. When SAX parses XML stream, the XML data is traversed in pre-order style. During the traversing, MTrie index is checked to find whether the XML data is satisfied with XPath queries or not. For every incoming XML stream, the SAX startDocument() event is firstly triggered and MTrie index begins the matching process from the root node of MTrie index. For each startElement() event, the XML stream’s level grow by 1 and MTrie filtering algorithm go down in the MTrie index and searches the child nodes of current node in MTrie index to find whether child node’s element name matches the element name or not. When MTrie index’s current node has arrived at the leaf node while the XML data stream has not reached the leaf element, MTrie index stays here until XML data stream returns the current node in the event of endElement(), and continues next step’s matching procedure. When SAX parser finishes traversing the whole XML data stream with the endDocument() event, MTrie matching procedure ends up with the returned satisfied XPath queries and sending the XML stream to those subscribers whose XPath queries is among the returned XPath queries. If there exists a virtual node, the hierarchy of sub-elements of the virtual node is checked first, directly go down to the root element of the hierarchy without the event of startElement(), and then continue next matching process as usual. When the leaf node of the sub-element hierarchy arrived, the next node is the root element of the hierarchy or the child element of the whole virtual element, which depends on the element value of startElement() event. If there exists such a child node whose the XPath path node level equals to –1, the XPath’s Path filtering is met. When the SAX
250
W. Rao et al.
endElement () event is triggered, the XML stream’s level decreases by 1 and the current node in MTrie index goes up and returned to the parent node of the current node. Because the MTrie index is a sub-graph of DTD graph, the nodes in MTrie index can be traversed in SAX events of XML message stream.
3 Performance Evaluation We implemented an EAI prototype platform based MTrie engine in JAVA1.4. To express MTrie’s performance, we use filter time as our two performance metrics. The filter time is defined as the time between XML message stream’s startDocument() event and endDocument() event. All Experiments were conducted on a 1.5GHz Intel Pentium 4 machine with 2048MB of main memory running Windows 2000 Server platform. We run our experiments on two groups of data set, one of which is the real data set from NASA (http://xml.gsfc.nasa.gov) which has the recursive elements, another is the generated data from nine DTD files from NIAGRA experiment data page(http://www.cs.wisc.edu/niagara/data.html) by using IBM’s XML generator tool. The ADC’s dataset.DTD in NASA contains 140 elements and 8 levels and we use the real XML files in the http://xml.gsfc.nasa.gov/pub/adc/xml_archives/. To generate the XML data of nine DTD files, IBM’s XML generator creates random XML instance based on our provided constraints. For every DTD files, we use the method in [3] to generate 250 XML documents with different sizes: small, medium, and large with an average of 20, 100 and 1000 pairs of tags, respectively. We generate the synthetic XPath queries using a similar version of XPath generator like [4]. The modified generator can generate XPath queries based on our input parameters including number of queries, maximum depth of a query, wildcard operator ‘*’ probability for each location step, "//" probability for each location step, number of value-based predicates in a query where the value is chosen from our pre-defined data. To simply our MTrie implementation, all of the generated XPath queries are linear queries. To compare our MTrie’s performance, we also implement YFilter and XPush using JAVA1.4.
Fig. 2. 50,000 NASA XPath queries and 1M NASA XML message
MTrie: A Scalable Filtering Engine of Well-Structured XML Message Stream
251
To address the question that how XML message data size and XPath queries number influence the performance of MTrie engine, we design the experiment to test the filtering time of three filter algorithms including MTrie, YFilter and XPush against NASA XML message size under different message size and XPath queries number. In figure 2(a) MTrie engine performs with least filter time among three filtering algorithms because MTrie need not probe the DFA in XPush or NFA in YFilter and directly match MTrie index. In Figure 2(b) we find that MTrie performs with a stable filter time of 1800 ms after a linear increase at 300,000 queries becausethe merged MTrie index of 300,000 XPath queries has already been the whole NASA DTD graph and the MTrie index of less 300,000 XPath queries is just a subgraph of DTD graph.
4 Conclusion In this paper, we have proposed a novel XML filtering system, termed MTrie. MTrie supports an effective and scalable filtering of XML message based on XPath expressions. MTrie can support a large number of XPath queries by merging these queries to a single trie-like data structure. For recursive elements, we combine the recursive elements into a virtual element then deal with it just like a normal element. Based on DTD, MTrie convert the merged XPath queries into the MTrie index which makes XML filtering more effective and faster. MTrie can support XML message filtering of heterogeneous DTD files by make an MTrie index for every DTD file. Our experiments result shows that MTrie outperforms than earlier work and show scalability for both message size and XPath queries.
References [1] A. Carzaniga, D. R. Rosenblum, and A. L. Wolf. Challenges for distributed event services: Scalability vs. expressiveness. In Engineering Distributed Objects '99, May 1999. [2] ALTINEL, M., AND FRANKLIN, M. J. Efficient filtering of XML documents for selective dissemination of information. In Proceedings of VLDB Conference (2000). [3] Y. Diao, P. Fischer, M. Franklin, and R. To. Yfilter: Efficient and scalable filtering of XML documents. In Proceedings of ICDE, 2002. [4] Ashish Gupta, Dan Suciu, Stream Processing of XPath Queries with Predicates, In Proceeding of ACM SIGMOD Conference on Management of Data, 2003 [5] J. Bates, J. Bacon, K. Moody, and M. Spiteri. Using events for the scalable federation of heterogeneous components. In P. Guedes and J. Bacon, editors, Proceedings of the 8th ACM SIGOPS European Workshop: Support for Composing Distributed Applications, Sintra, Portugal, Sept. 1998.
Issues in Cache-Answerability for XML Queries on the Web∗ Jung Kee Park and Hyunchul Kang School of Computer Science and Engineering, Chung-Ang University Seoul, 156-756, Korea [email protected], [email protected]
Abstract. Recently, semantic caching for the database-backed Web applications has received much attention. Since emergence of XML as a standard for data exchange on the Web, today’s Web applications are to retrieve information from the remote XML source across the network. Cache-answerability for XML queries is thus crucial for efficient support of XML database-backed Web applications. In this paper, we address two issues involved in realizing such functionality: incremental refresh of the XML cache against the updates at XML source, and integration of the partial query results obtained from the cache and from the source..
1 Introduction Since emergence of XML as a standard for data exchange on the Web, today’s Web applications are to retrieve information from the remote XML source across the network. Cache-answerability for XML queries is thus crucial for efficient support of XML database-backed Web applications [1][3][4][5]. There are three major issues that need to be dealt with in realizing such functionality: 1. 2. 3.
Rewriting an XML query with the relevant XML cache. Propagating the update done to the XML source to the cache. Integrating the partial query results obtained from the cache and from the source c lie n t to produce the final result. W eb Se rve r
The first issue was thoroughly investigated in XM L A pp lication c ac he our earlier work [5]. In this paper, we address Se rve r the remaining two issues to come up with a fullXM L D ata S o u rce Se rve r fledged solution to cache-answerability for XML queries on the Web. We consider the Fig. 1. Multi-tier Architecture for multi-tier architecture for XML database-backed XML DB-backed Web Applications ∗
This work was done as a part of Information & Communication Fundamental Technology Research Program supported by the Ministry of Information & Communication in the Republic of Korea.
Issues in Cache-Answerability for XML Queries on the Web
253
applications on the Web where middle-tier XML caching at the application server is provided (Fig. 1). As for XML caching, we consider semantic caching which is viewbased query caching whereby the result of an XML query against the source XML documents is cached and maintained as a materialized view. We employ relational DBMSs as XML store for both the source and the cache.
2 Update Propagation to XML Cache In this section, we investigate the issue of propagating the updates done to the source XML documents at the data server to the application server to incrementally refresh the XML materialized views there. 2.1 XML Numbering Scheme Supporting XML Updates Consider the XML document in Fig. 2(a). Its tree representation excluding the text is in Fig. 2(b) where the integers in parentheses beside each node denote the element identifier (Eid), which is assigned in a monotonically increasing way from the root element to its subelements in the DFS (Depth First Search) order. Eid is not just the unique identifier of the element but carries information on the DFS order of the elements in a document. Without loss of generality, in Fig. 2(b), we assume that Eid’s are monotonically increased by 1 in the DFS order starting at 1 from the root. To come up with an XML numbering scheme that supports XML updates of any type such that the Eid’s of all the elements in the document are still maintained in the DFS order after updates without reassigning the Eid’s at all, we introduce RmdEid and NextChildEid. RmdEid stores the identifier of the current element’s rightmost descendant element. RmdEid stores a form of ancestor-descendant relationship among the elements. Fig. 3(a) gives the RmdEid and NextChildEid values as well where the 3tuple (i,j,k) beside each node denotes (Eid, RmdEid, NextChildEid). For example, RmdEid of node c is 5 because c’s rightmost descendant is e whose Eid is 5. RmdEid of e is 5 which is equal to its Eid. The rightmost descendant of every leaf node is itself, and thus, for a leaf, RmdEid equals to Eid. As for Nexa tChildEid, it stores the “1.0”?> means that u and v are the neighbors each other (u = v). In a structured P2P data system, each cell saves its information, and the identifiers and scopes of its neighbor cells in a peer that is called a head of the cell. Routing among cells is implemented through heads of cells to improve routing speed. In the implementation of structured P2P data systems, several heads can be made in a cell to prevent from the temporary fault of one head. The cell abstract index distinguishes itself by the following advantages. It needs small storage space, supports the set information publishing method, and comes up with very frequent updates to the shared documents because it is constructed on abstract information of peers instead of information of documents. Especially it supports high autonomy of peers and improves the safety of the system since no global hash table fraction or similar global information is needed to be stored in each peer. In addition, the routing strategy under the unit of cell accelerates the query processing.
272
3.2
C. Wang, J. Li, and S. Shi
The Construction of Cell Abstract Indices
The construction of a cell abstract index in a P2P data system is the process of forming all peers of the system and the abstract space of the system into an undirected graph (V, E). When a peer, say peer A, joins a system, it firstly connects with the Name Server of the system to get its identifier and the location information of an online peer, say peer B. Secondly peer A computes its abstract, and routes it from peer B to its end cell, i.e. the cell whose scope includes the abstract of A, by Procedure Get-End-Cell in Sect. 3.3. Let MPPC (maximum peers per cell) be the maximum of peers within a cell. If the number of peers in the end cell is less than MPPC, peer A will be inserted into the cell. Otherwise the scope of the end cell will be averagely partitioned on the component of abstract with the biggest interval. Accordingly the end cell is split into two new cells that are neighbors each other. Each new cell has a half of the original scope. The peers and neighbor cells of the original cell will be placed into the proper new cells. Also, the neighbor cells of the original end cell should modify theirs neighbor information. The head of the original cell is still the head of one new cell, and a peer in the other new cell is selected randomly as the head of the cell. Finally the new end cell of peer A is selected, and A will be inserted into it. When a peer leaves a P2P system, it informs the head of the cell on condition that it is not the head. Otherwise, it has to find a peer within the cell to take over the information of the cell. If the leaving peer is the last one of the cell, the scope of the cell should be mandated to one of neighbors of the cell. In detail, the first neighbor cell replying to the mandatory request can be selected as the mandatory cell. If a peer whose abstract is in the scope joins the system, this mandated cell will be reappeared. 3.3
The Application of Cell Abstract Indices
When a query is submitted by a user, the abstract of the query is extracted and routed to its end cell through the cell abstract index. The end cell of a query is the cell whose scope includes the abstract of the query. Peers in the end cell are considered as the candidate peers. The feature of the query is sent to these peers. After receiving the feature of the query, each candidate peer compares the feature with the features of its shared documents and sends its local result to the querying peer. Based on the value of LEVEL, the layer factor defined by the user or the system, candidate peers may be extended to different areas. When LEVEL=1, only the peers in the end cell are considered. When LEVEL=2, peers in the neighbor cells of the end cell are also considered as the candidate peers. When LEVEL=3, peers in the neighbor cells of the preceding cells are also considered as the candidate peers. When LEVEL is other integers, the process continues. In order to obtain better candidate peers, some restrictions should be added. For example, peers in the neighbor cells of a cell are not considered as candidate peers until the number of documents similar to the user’s query returned from peers of the cell is more than a given valve. Let Rt be the number of documents satisfying a user’s query in the end cell of the query. Let γ be the decision factor defined by users or the system. Then the given value may be γ × Rt .
Cell Abstract Indices for Content-Based Approximate Query Processing
273
The querying peer receives, merges and sorts all results returned from candidate peers. After the user selects his favorites from the final result, the querying peer connects to the peers that have the selected documents. If the connections between the querying peer and the appropriate peers are established, the querying peer downloads the documents. Procedure Get-End-Cell is used to route the abstract of a query to the end cell of the query. One of its implementations is the greedy forwarding. When abstracts of peers of a system are evenly distributed, the complexity of the logical routing of Get-End-Cell is O(n · c1/n ), where n is the number of dimensions of the abstract space, c is the number of cells. 3.4
An Example
Suppose the abstract space S = [0, 1]2 . Several cells in a P2P data system are illustrated in Fig. 1(a), where cell (1) with the scope (0.25, 0.5, 0.75, 1); (2) (0.5, 0.75, 0.75, 1); (3) (0.25, 0.375, 0.5, 0.75); (4) (0.375, 0.5, 0.5, 0.75); (5) (0.5, 0.625, 0.625, 0.75); (6)(0.5, 0.625, 0.5, 0.625); (7)(0.625, 0.75, 0.5, 0.75); (8) (0.75, 1, 0.5, 0.75); (9) (0.25, 0.5, 0.25, 0.5); (10) (0.5, 0.625, 0.25, 0.5); (11) (0.625, 0.75, 0.25, 0.5); (12) (0.75, 1, 0.25, 0.5); (13) (0.5, 0.75, 0, 0.25); (14) (0.75, 1, 0, 0.25).
Fig. 1. A cell abstract index in a structured P2P data system
Shown as in Fig. 1(b), a user submits a query q from the peer A, where the abstract of q, yq , is (0.54, 0.53), and the abstract of peer A is (0.9, 0.1). Peer A sends yq to the head of the cell in which peer A is includes, i.e. peer B. B judges that it is not the end cell of q, and then routes yq to the end cell of q, cell (6), through Procedure Get-End-Cell. One sample route is A → B → C → D → E → F, where B, C, D, E and F are respectively heads of cells (14), (12), (11), (10) and (6). When LEVEL=1, peers in the end cell of q are all candidate peers, i.e. peers in the cell marked as “L=1” in Fig. 1(c). When LEVEL=2, peers in cells marked as
274
C. Wang, J. Li, and S. Shi
“L=2” are also candidate peers. When LEVEL=3, supposing that the numbers of returns in cell (4) and (5) are more than (γ × Rt ), peers in neighbor cells of (4) and (5) are also candidate peers, i.e. peers in cells marked as “L=3”. Each candidate peer compares the feature of the query and features of local shared documents, and sends local result to the querying peer. The querying peer processes the results returned from all candidate peers. 3.5
Enhancements
Several abstracts can be computed in a single peer. They can be gotten by classification, clustering, or other algorithms, such as k-means. Then the accuracy of abstract indices will be better. System update in P2P environments consists of document update and peer update. Document update is the update introduced by a user’s addition into and deletion from the local shared document set. When a user add or delete a shared document, the abstract of local peer should be updated. In detail, when the change of the documents alters the abstract enough, for example the difference between the new abstract and the old one is more than 5%, or the number of new abstracts and that of the old is different, the peer should send its new abstracts to the head of the cell in which it is included in a structured P2P system. In order to prevent from the error introduced by the accumulation of many little differences, the up-to-date abstracts of peers should be sent to heads of cells periodically, such as every other hour. Peer update is the update introduced by peers’ joining or leaving the system. In a structured P2P system based on abstract indices, a peer sends its abstract instead of the information of all its shared documents to the head of its end cell when it joins the system. Also, its abstract information instead of the information of shared documents is deleted from the head of its end cell when it leaves the system. Obviously the data used in system update based on abstract is far less than that used in system update based on documents. Several cell abstract indices can be established in a structured P2P data system. They can be established corresponding to different kinds of media, or different formats of a kind of media, even one format of a kind of media. Cell abstract indices can be combined with other indices or meta-data of document.
4
Experiments
In this section, experiments are made to verify the effectiveness of cell abstract indices in structured P2P data systems. 4.1
Experimental Setup
The simulated P2P data system consists of 10,000 peers. The number of documents in the system is 3,854,400. Each document is represented as a 3-dimension vector (x1 , x2 , x3 ), where xi (i = 1, 2, 3) may be a digital feature extracted from the document. In the simulator, the range of xi is [0,1]. Documents in each peer
Cell Abstract Indices for Content-Based Approximate Query Processing 0.6 the number of peers
0.5
1200
Document recall
The number of peers in each bin
1600 1400
275
1000 800 600
0.4 LEVEL=1 LEVEL=2 LEVEL=3
0.3 0.2
400 0.1
200 0 0
200 400 600 800 The number of documents in a peer
1000
0
0.05
0.1
0.15 0.2 Similarity offset
0.25
0.3
Fig. 2. The histogram of peers in the sim- Fig. 3. Recalls of satisfied documents in ulated P2P data system according to the the structured P2P data system number of documents in each peer
are generated randomly. The minimum of documents in a peer is 70, and the maximum is 1,000. The histogram of peers in the system is drawn in Fig. 2. In the following experiments, documents in the system are all shared. The abstract of a document set is defined as (y 1 , y 2 , y 3 ), where y i is the mathematical expectation of Xi (i = 1, 2, 3), and Xi is a random variable used to represent xi . Two documents (x1 , x2 , x3 ) and (x1 , x2 , x3 ) are called similar if the Euclidean distance between them is not more than a given constant offset. The values of offset used in the experiments are 0.05, 0.1, 0.15, 0.2, 0.25 and 0.3. In order to test the effectiveness of cell abstract indices, 512 different queries are generated under the uniform distribution. Each query is a triple. 4.2
Experimental Results
Cell abstract indices are constructed in the simulator. In the simulated structured P2P data system, there is only one head in each cell. The head of each cell needs to remember the identifier and the scope of the cell, the peers within the cell, and the scopes and locations of its neighbor cells. Each peer that is not a head only needs to remember the head of its end cell. MPPC=50, γ=0.9. Returns. In the simulated system, the numbers of returns, i.e. the numbers of satisfied documents in all candidate peers, under cell AbIx, are listed in Table 1. Numbers in Table 1 are all decimal fractions because they are all averages. Each number is the mean of 512 results corresponding to the 512 different queries. Moreover the querying peer is randomly selected in each test. At a given layer factor LEVEL, the returns increase with the increase of the offset because the similarity demands are loosed. At a given offset, the returns decrease with the decease of LEVEL because the candidate peers become less. It can be found from Table 1 that there are many satisfied documents returned at certain layer factor and offset (usually more than 1,000). Therefore the effects of cell abstract indices are very well in structured P2P data systems.
276
C. Wang, J. Li, and S. Shi Table 1. Returns in a structured P2P data system offset LEVEL=1 LEVEL=2 LEVEL=3 0.05 266.65 1,138.51 1,237.36 0.1 1,850.51 8,804.42 9,795.55 0.15 4,880.48 27,392.26 31,863.85 0.2 8,629.34 58,294.23 72,004.53 0.25 12,032.41 99,036.37 131,603.60 0.3 14,602.54 145,404.61 210,343.06
Recalls. In a structured P2P data system, the definition of the metric recall is defined as return , (1) recall = s-all where return is the number of returned satisfied documents; s-all is the number of all satisfied documents in the system. Recalls at different offsets and layer factors are illustrated in Fig. 3. As shown in Fig. 3, the recalls decrease with the increase of offsets at a given layer factor. It is because the loosened condition makes a lot of documents be considered as the similar documents to the query. The inflated similar document set reduces recalls. At a given offset, the recalls increase with the increase of the layer factor because both candidate peers and the satisfied documents in these peers become more. When LEVEL=3, recalls are around 50% and stable. Illustrated in Fig. 4, at LEVEL=2, offset=0.05, when 7.03% peers are searched, 9.30% documents are computed, and 47.36% similar documents can be retrieved. It means that many satisfied documents will be found while just a few peers are searched. The same effects remain in the other layer factors and offsets. It shows the effectiveness of cell abstract indices in structured P2P data systems. 35
recall peer−hit doc−hit
Percentage
0.4
0.3
0.2
0.1
0
LEVEL=1 LEVEL=2 LEVEL=3
30 The first accelerating ratio
0.5
25 20 15 10 5
1
2 LEVEL
3
0
0.05
0.1
0.15 0.2 0.25 Similarity offset
0.3
Fig. 4. Effects in the structured P2P data Fig. 5. The first accelerating factors of cell system when offset=0.05 and LEVEL=2 AbIx in the structured P2P data system
Cell Abstract Indices for Content-Based Approximate Query Processing
277
AR1 . The first accelerating ratio in a structured P2P data system is defined as AR1 =
recall , peer-hit
(2)
where peer-hit is the ratio of the number of candidate peers to the number of all peers in the system. It has important function to describe the efficiency of structured abstract indices. The first accelerating ratio in the structured P2P data system is drawn in Fig. 5. Usually AR1 is more than 5. In the experiments, AR1 decreases with the increase of offset, and it increases with the decrease of layer factor. Given a layer factor, the candidate peers don’t change with the variety of offset, then peer-hit doesn’t change. Therefore in this situation, AR1 varies in the direction of the change of recall. Then AR1 also decreases with the increase of offset. In a structured P2P data system, given a value of offset, recall and peer-hit all decrease with the decrease of layer factor. When the layer factor decreases, AR1 increases because the change rate of recall is more than that of peer-hit. As shown in Fig. 5, the change of AR1 also shows the rightness of cell abstract indices. Formula (2) can be rewritten as AR1 =
recall = peer-hit =
return s-all |candidate peers| |peers in the simulator| return |candidate peers| , s-all |peers in the simulator|
(3)
where || represents the number of . Because s-all and the number of peers in the simulator are invariable when the offset is given, AR1 reflects the average number of satisfied documents on each candidate peer in structured data systems. Figure 5 shows that the peers on which more documents similar to the query can be selected with the decrease of layer factor. AR1 can be used to characterize the performance of cell abstract indices. AR1 = c means that c% returns can be gotten while only 1% peers are searched. As in Fig. 5, AR1 = 10.6 when offset = 0.3 and LEVEL=1. It means that in this situation 10.6% returns can be gotten while only 1% peers are computed. Please note that it does not mean that all satisfied documents can be retrieved with searching only 10% peers.
5
Conclusions and Future Work
This paper presents cell abstract indices to process content-based approximate queries in structured peer-to-peer data systems. Experimental results show the effectiveness of the proposed indices. A P2P music file sharing prototype system based on cell abstract indices will be implemented in the future.
278
C. Wang, J. Li, and S. Shi
References 1. Wang, C., Li, J., Shi, S.: An Approach to Content-Based Approximate Query Processing in Peer-to-Peer Data Systems. In: Proceedings of the Second International Workshop on Grid and Cooperative Computing, Shanghai, China (2003) 2. Wang, C., Li, J., Shi, S.: AbIx: An Approach to Content-Based Approximate Queries in Peer-to-Peer Data Systems. Tech. Report HIT/CSED-DB-03-0710, Database Lab. of Harbin Institute of Technology, Harbin, China (2003) 3. Yang, B., Garcia-Molina, H.: Efficient Search in Peer-to-Peer Networks. In: Proceedings of the 22nd International Conference on Distributed Computing Systems. (2002) 5–14 4. Crespo, A., Garcia-Molina, H.: Routing Indices for Peer-to-Peer Systems. In: Proceedings of the 22nd International Conference on Distributed Computing Systems. (2002) 23–34 5. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Schenker, S.: A Scalable Content-Addressable Network. In: Proceedings of ACM SIGCOMM. (2001) 161– 172 6. Rowstron, A., Druschel, P.: Pastry: Scalable, Distributed Object Location and Routing for Large-Scale Peer-to-Peer Systems. In: IFIP/ACM International Conference on Distributed Systems Platforms (Middleware). (2001) 329–350 7. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D.R., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications. IEEE/ACM Transactions on Networking 11 (2003) 17–32 8. Zhao, B.Y., Kubiatowicz, J., Joseph, A.D.: Tapestry: An Infrastructure for Faulttolerant Wide-area Location and Routing. Tech. Report UCB/CSD-01-1141, University of California, Berkeley, California 94720 (2001) 9. Cuenca-Acuna, F.M., Nguyen, T.D.: Text-Based Content Search and Retrieval in ad hoc P2P Communities. In: Proceedings of the International Workshop on Peer-to-Peer Computing. (2002) 10. Tang, C., Xu, Z., Mahalingam, M.: pSearch: Information Retrieval in Structured Overlays. In: Proceedings of the 1st HotNets-I, Princeton, New Jersey, USA, ACM Press (2002) 11. Wang, C., Li, J., Shi, S.: A Kind of Content-Based Music Information Retrieval Method in a Peer-to-Peer Environment. In: Proceedings of the 3rd International Symposium on Music Information Retrieval, Paris, France (2002) 178–186 12. Gao, J., Tzanetakis, G., Steenkiste, P.: Content-Based Retrieval of Music in Scalable Peer-to-Peer Networks. In: The 2003 IEEE International Conference on Multimedia & Expo(ICME’03), Baltimore, MD, USA, IEEE CS Press (2003) 13. Gribble, S., Halevy, A., Ives, Z., Rodrig, M., Suciu, D.: What Can Database Do for Peer-to-Peer ? In: Proceedings of the 4th International Workshop on the Web and Databases. (2001) 31–36 14. Halevy, A.Y., Ives, Z.G., Suciu, D., Tatarinov, I.: Schema Mediation in Peer Data Management Systems. In: Proceedings of the 19th International Conference on Data Engineering. (2003) 15. Rowstron, A., Druschel, P.: Storage Management and Caching in PAST, a LargeScale, Persistent Peer-to-Peer Storage Utility. In: Proceedings of ACM Symposium on Operating System Principles (SOSP’01), Banff, Canada (2001)
A Distributed Ranking Strategy in Peer-to-Peer Based Information Retrieval Systems Zhiguo Lu1 , Bo Ling1 , Weining Qian1 , Wee Siong Ng2 , and Aoying Zhou1 1
2
Department of Computer Science and Engineering, Fudan University, China {luzhiguo,lingbo,wnqian,ayzhou}@fudan.edu.cn Singapore-MIT Alliance, 4 Engineering Drive 3, National University of Singapore, Singapore-117576 [email protected]
Abstract. The state of the art of P2P-based information retrieval is still at its infant stage and confronted with many challenges. One of the most urgent problems is how to rank the results retrieved from different peers. In this paper, we propose a fully distributed strategy to address this challenge. First, we identify the underlying diathesis resulting in the problem. Then we propose a fully distributed strategy to rank and merge results retrieved from different sites, including fully distributed statistics management and distributed ranking processing. Finally, we conduct an extensive experimental study to evaluate the strategy and the results verify the effectiveness of our solution.
1
Introduction
While most of P2P based applications are established to share data, the current mechanisms are largely restricted to semantics-free, file level sharing by requesting file identifiers. The inability to share data based on their semantics is not able to satisfy users’ information demands efficiently and results in information overload (as many irrelevant answers may be retrieved) as well. On the other hand, information retrieval (IR) has made breakthrough and been employed to semantically retrieve information of several formats (such as text and image) efficiently. However, the traditional IR systems inherently suffer from many limitations, e.g. lack of scalability and limited power (including computing and storing capability), especially in the current situation of exponential information exploding. To address the problems of both technologies described above, the most natural way is to integrate IR with P2P computing so that they can compensate each other. Such a system can be defined as a P2P-based information retrieval system (PIRS ). Recently, several P2P-based information retrieval systems have been developed and used to semantically retrieve information, such as, PlanetP
Supported by High Education Doctorial Subject Research Program of MoE in China under grant No.20030246023, and by Science and Technology Commission of Shanghai Municipal Government under grant No.03DZ15028.
J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 279–284, 2004. c Springer-Verlag Berlin Heidelberg 2004
280
Z. Lu et al.
[2], PeerIS [3]. However, since P2P-based information retrieval is still at its infant stage, it is confronted with many challenges inevitably. And how to rank and merge the results retrieved from different peers is one of the most urgent problems. In this paper, we have proposed a fully distributed strategy to address the challenge. About the related work, in PlanetP[2], it first ranks peers according to the relevance of each peer’s documents to the query and then decides on the number of peers to contact and rank the documents returned by these peers. So its scheme is heuristic but not deterministic(ours is deterministic). In [1], Tang et. al. have proposed a deterministic strategy. But it’s just suitable for structured P2P systems, while ours is system structure-independent. Next we will identify the ranking problem and proposes a fully distributed ranking strategy to address the challenges.
2 2.1
A Distributed Ranking Strategy in Peer-to-Peer Underlying Diathesis of Ranking Challenge
Without losing generality, we can assume that the data in peer are represented with vector space model [5,6], since it is the most popular strategy employed at present. And also we can further assume each peer adopts a vector space model subsequently. In such a context, each (sharing) document and query is represented as a vector, where each dimension is associated with a distinct feature term (or index word). The value of each component of the vector represents the importance of that index term (typically denoted as the weight) to that document or query. Specifically, the vector of a document and an initiated query can be defined: dj = (w1,j , w2,j · · · , wt,j ), q = (w1,q , w2,q · · · , wt,q ) (1) Then, given a query, we rank the relevance of documents to that query by measuring the similarity between the query’s vector and each of the candidate document’s vectors, which can be measured as the cosine of the angle between them: t wi,j × wi,q dj • q = i=1 SR(dj , q) = . (2) t t |dj | × |q| 2 2 w w × i=1 i,j i=1 i,q Generally, the weight of index terms of both documents and query is decided by TFxIDF rule, whose rationale is that by using some combination of term frequency (TF ) in a document with the inverse of how often that term shows up in documents in the collection (IDF ). In the traditional information retrieval system, there is just a (logical) node. Let N be the total number of files in the node and ni be the number of files where the index term ki appears. Then the weight is decided by wi,j = fi,j × idfi =
f reqi,j N × log maxl f reql,j ni
(3)
A Distributed Ranking Strategy in Peer-to-Peer
281
Above, freqi,j is the raw frequency of term ki in the file dj and maxl f reql,j stands for maximal value of the raw frequency of all index terms. Further, let idfi be the inverse file frequency of ki . In a P2P-based information retrieval, there are a large number of peers and each of them is an autonomous information retrieval system. So for the idfi the situation is different from traditional IR, since for any given two peers x and y, they may maintain different volume of files and their local files contain a given index term ki are also different. Therefore, their idfi s are different, i.e., idfix = log
Nx Ny = idfiy = log y x ni ni
(4)
Similarly, the weight of an index term in a given query q is confronted with the same problem. For an index term ki in the query q, its weight may be unequal due to different local statistics. Take the same example again, that the query is routed to the two peer x and y, its weights may be different even to retrieval the y x = wi,q . same file: wi,q Based on above formulations, we can conclude that the traditional vector model ranking strategy, i.e., formulation (2), cannot be migrated directly into P2P-based information retrieval. In the next section, we will propose a novel strategy to address the problem. 2.2
Distributed Statistics Management Strategy
As discussed above, the challenge associated with vector model ranking strategy results from the traditional statistics management strategy. To address the problem, there are two alternative strategies. The more intuitive one is to have one or several servers to process all the files in the system and manage all related statistics. When processing a query, all queried peers first retrieve related meta data from the servers. This strategy is obviously impracticable, since it inherits the limitations of centralized architecture model, and the problems are more serious in the dynamic and ad-hoc P2P systems of large scales. The other one is to devise a fully distributed strategy to let all peers obtain approximate global statistics of related to their local files. Guided by the idea, we propose a fully distributed strategy to manage the statistics, which is detailed in the following. Assume a random given peer p in the P2P-based information retrieval system has a unique identifier, denoted by PID, ki is an index term of its local (sharing) files and npi is the number of file contains the index term ki , while N p is the total number of its local (sharing) files. Furthermore, we define a mapping function (e.g., a distributed hash function) that can map the index term ki to a target peer (can be named as target) in the system, then the peer p “upload” (name the peer as uplaoder ) a tuple < P ID(p), npi , N p , T imeStamp > to the target. Since all peers in the system employ the same mapping function , the target can accordingly collect all statistics related to index term ki from different peers in the system. Then the target periodically summarizes the statistics and returns the aggregated statistics to those uploader s. Indeed, the aggregated statistics is
282
Z. Lu et al.
the approximate global statistics. Armed with the information, all queried peers can correctly and independently compute the SRs and rank their local sharing files on the fly. 2.3
The Determination of the Meta Data N
In the algorithm, the ni of the ki in the PIRS can be correctly determined, while there is still an issue associated with computing N , the volume of system corpus. Sine our algorithm is index term-oriented and each peer may have a large number of index term (equaling to the dimensions of its local file space), a peer may receive several N x s returned from different target peers (related to its distinct index terms), which possibly differ from one other. There are many solutions to solve the problem. First, we can take the maximum of returned N x s as the volume of the system corpus. To make the N be more consistent, each peer can gossip its N = max(Ntx ) to others by employing gossip algorithm proposed in [2], where t is a timestamp; Further each N = max(Ntx ) is identified by its PID and the timestamp. With the comparison communication, peers can obtain the approximately consistent global value of N . However, this solution results in the wastage of bandwidth and computing resources. Indeed, the ranking processing just needs appropriate statistics, Therefore, each peer can just take its own max(Ntx ) as N . 2.4
The Dynamism of Peers
One of the predominant features of P2P system is the dynamism of peers, i.e., peers may join and leave the P2P network at any time, which also brings forth the challenge to maintain the consistency of meta data and affects all peers to rank retrieved possible results locally. We provide two methods to meet the challenges. First, we predefine a appropriate update frequency, which need not be high, according to the research results of [7]. Second, each peer (say, p) can invite one or several peers to be its proxies during its offline period and under the constraint proxy online ≥ of f linep and proxy online onlinep > 0. That is to say, the sum online period of proxies should cover the offline duration of p and there should be some online overlap among all of the related peers[4].
3 3.1
Evaluation Experiment Setup
The experiment environment is made up of 64 “peers” in the whole to form a virtual P2P-based infirmation retrieval system. And we implemented a Hashbased mapping function to realize the distributed statistics management strategy. We generate a collection of files whose sizes range from 10KB to 1000KB, each of which has 100 dimensions, i.e., indexed with 100 terms. Then we assign 1000 ∼ 2000 files to each peer. In addition, we generate a query consists of 10 terms to evaluate the scenarios defined in the next sections.
A Distributed Ranking Strategy in Peer-to-Peer
283
We define two metrics to evaluate our solution, i.e., Recall, and Precision, whose respective connotation is detailed in the following subsections. To verify the effectiveness of our solutions, we propose three basic scenarios: (i) GRank, all files and their statistics are maintained by a single node so that it has the global statistics of the system and its Recall, Precision and Correctness are assumed to be of 100%. This scenario is employed as the baseline. (ii) DRank, each peer in the P2P-based information retrieval system adopts the distributed statistics management strategy proposed in the paper, which is implemented in the PeerIS. (iii) TRank, each peer just use the statistics of its local files to rank the query results, which is underlying diathesis of ranking challenge of PIRSs. 3.2
Recall and Precision
Recall and Precision are two classical metrics to evaluate the information retrieval technology. However, their connotation in the context of P2P differs from that of traditional information retrieval. In the PIRSs, it is more reasonable to compute them by the following formula: Recall =
Answer , Available Answer
Retrieved
System
qualif ied P recision = retrieved
Answer Answer
(5)
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Recall
Recall
In the experiment, we don’t take the processing period into account, since the network transferring delay in a LAN is neglectable. We initiate the dedicated query 5 times and the average results are plotted in the figure 1 and 2.
0.5 0.4 0.3
GRank DRank TRank
0.4 0.3
0.2
GRank DRank TRank
0.2
0.1 0 0.68
0.5
0.1 0.69
0.7 0.71 0.72 0.73 0.74 Uniform Data Distribution Threshold
0.75
0.76
0 0.68
0.69
0.7 0.71 0.72 0.73 0.74 80/20 Data Distribution Threshold
0.75
0.76
Fig. 1. Recall
Firstly, we have a look of Recall. The left sub figure displays the results when the data is uniformly distributed, while the right sub figure presents the results when the data is distributed in the manner of “80/20”, which models the actual situation in the reality. From both sub figures,we can see the DRank betters TRank in both data distributions. Further, when the data is uniformly distributed, both DRank and TRank are similar to the GRank. It is because in np this case, the local Nip of a peer is similar to the nNi of the system (TRank ); From the right sub figure we can observe that DRank greatly outperforms TRank in the practical environment. Furthermore, even when the “threshold” is bigger, the Recall of DRank almost keeps same and approaches to that of the GRank
Z. Lu et al.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
Precision
Precision
284
0.5 0.4 0.3
GRank DRank TRank
0.5 0.4 0.3
0.2
GRank DRank TRank
0.2
0.1 0 0.68
0.6
0.1 0.69
0.7 0.71 0.72 0.73 0.74 Uniform Data Distribution Threshold
0.75
0.76
0 0.68
0.69
0.7 0.71 0.72 0.73 0.74 80/20 Data Distribution Threshold
0.75
0.76
Fig. 2. Precision
(and is much higher than that of TRank ). For Precision, the results are similar to Recall. Based on the above evaluation, we can conclude that our ranking strategy can satisfy users’ information demands more efficiently than the current solutions.
4
Conclusion
In this paper,we have conducted an extensive analysis on the most popular vector space model ranking strategy and identified the underlying diathesis. Furthermore, a fully distributed ranking and merging strategy has been proposed. And finally, an extensive experimental study has been carried out and the results are promising. Since P2P-based information retrieval is still at its infant stage, it is confronted with many challenges, such as source selection and query routing, source representation and description, and so on. Some of which are of our current research topics while some others are of our future direction.
References 1. M. M. Chunqiang Tang, Zhichen Xu. Peersearch: Efficient information retrieval in structured overlays. In Proceedings of HotNets-1, ACM SIGCOMM, 2002. 2. F. M. Cuenca-Acuna, C. Peery, and R. P. M. T. D. Nguyen. Plantet: Infrastructure support for p2p information sharing. In Technical Report DCS-TR-465, Department of Computer Science, Rutgers University, Nov., 2001. 3. B. Ling, Z. Lu, W. S. Ng, B. C. Ooi, K.-L. Tan, and A. Zhou. A content-based resource location in peeris. In Proceedings of 3th Web Information System Engineering, Singapore, Dec. 2002. 4. B. Ling, W. S. Ng, Y. Shu, and A. Zhou. Fuzzy cost modeling for peer-to-peer systems. In International Workshoo on Agent and Peer-to-Peer Computing, Australia, July. 2003. 5. G. Salton. The Smart Retrieval System-Experiments in Automatic Document Processing. Prentice Hall Inc., 1971. 6. G. Salton and M.E.Lesk. Computer evaluation of indexing and text processing. In Journal of the ACM., Jan. 1968. 7. S.Waterhouse. Jxta search: Distributed search for distributed networks. In http://search.jxta.org/JXTAsearch.pdf.
PeerSDI: A Peer-to-Peer Information Dissemination System Keping Zhao, Shuigeng Zhou, Linhao Xu, Wenyuan Cai, and Aoying Zhou Department of Computer Science and Engineering, Fudan University Shanghai 200433, P.R.China {kpzhao,sgzhou,xulh,wycai,ayzhou}@fudan.edu.cn
Abstract. Selective dissemination of information (SDI) is a powerful mechanism to retrieve information in wide-area networks. With selective information dissemination, users submit continuous queries to the server, which collects new documents from information sources, matches the documents against the queries, and notifies the users with their desired information. On the other hand, Peer-to-Peer (P2P) computing is gaining more and more attention from both academia and industrial community for its potential to reconstruct current distributed applications on the Internet with inherent advantages, such as scalability, cost sharing and autonomy. This paper presents the PeerSDI information dissemination system, which combines the effective mechanism of SDI with Peer-to-Peer computing paradigm. PeerSDI supports distributed and scalable content-based information filtering and dissemination and makes SDI mechanism available on existing structured P2P networks. Our discussion focuses on textual documents dissemination, however the solution is also suitable for dissemination of multimedia resources. Keywords: Peer-to-Peer, DHT, Pulish/subscribe, Selective Information Dissemination
1
Introduction
With the explosively rapid development in recent years, Internet has become the most important information source that holds much more contents than any sources ever emerged before. It is a great advantage that the users can gain easy access to such abundant information, while how to retrieve the interesting information from the giant source becomes a challenge to the users. Information Retrieval community has proposed several effective solutions to meet the challenge. One of the solutions is selective information dissemination (SDI)[7]. In an SDI system, a user posts a continuous query, called user profile, to the system, which collects new documents from information sources, matches the documents against the profile, and notify the user with his interesting information. SDI
This work was supported by National Natural Science Foundation of China under grant No.60373019, and by Science and Technology Commission of Shanghai Municipal Government under grant No.03DZ15028.
J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 285–290, 2004. c Springer-Verlag Berlin Heidelberg 2004
286
K. Zhao et al.
systems provide an opportunity for people to access the information in a timely way. However, the traditional SDI systems, like SIFT[7], work in a centralized way, which makes them face some severe challenges, including flash crowd loads, denial of service attacks, server failures, network outrages, and etc. On the other hand, Peer-to-Peer(P2P) computing[3] are gaining more and more attention from both academia and industrial community for its potential to reconstruct current distributed applications on the Internet. In P2P systems, peers use dedicated naming space and act as both service providers and service consumers, while keeping themselves fully autonomous. Since most peers in P2P networks are equal, traffic is often evenly distributed and congestion can be reduced. And most P2P networks are designed under the assumption of frequently node failure, thus such networks are always able to maintain their performance. In this paper, we introduce the PeerSDI information dissemination system, which combines the mechanism of SDI and Peer-to-Peer computing paradigm to address the limitations mentioned above. PeerSDI supports distributed and scalable content-based, rather than simple group-based, information filtering and dissemination, and makes SDI mechanism available on existing structured P2P systems. Furthermore, our system is resilient to faults, which assures the users always keep notified by the latest information. The remaining of the paper is organized as follows. Section 2 gives an overview of the design of PeerSDI system. Section 3 introduces our current implementation of the PeerSDI prototype. Section 4 reviews the related work and finally Section 5 concludes this paper.
2
The PeerSDI System Overview
In this section, we present an overview of the PeerSDI system. PeesSDI is constructed on a simple distributed hash table (DHT) substrate. However, PeerSDI can run on any existing structured P2P networks supporting DHT functionality. With DHT functionality, each peer in the system is identified by a unique PeerID in the DHT Space. Virtually, PeerSDI system requires two APIs of Map(key) and MapK(key, k), which are abstract of the services provided by DHT functionality: – Map(key) returns the active peer with PeerID numerically closest to the key in the P2P networks. Map is an elementary service provided by DHT functionality. – MapK(key, k), an extension to Map, returns the k alive peers numerically closest the key. Some of the existing P2P systems support MapK service directly, including Pastry and Tapstry. However, some other systems need to be extended to provide such service. In this section, because of the limited space we only briefly present the design of PeerSDI. More details can be found in our technical report [8]. 2.1
System Preliminaries
The PeerSDI system by far supports Vector Space Model (VSM) to support content-based query. In this paper, we focus on the discussion of textual documents dissemination. However, VSM is actually employed by nearly all current
PeerSDI: A Peer-to-Peer Information Dissemination System
287
multimedia retrieval systems. In PeerSDI, the users are notified of the documents which are relevant to the continuous query, with respect to the similarity between the document vector and the query vector.
Publish Message
Fig. 1. The Architecture of PeerSDI
Fig. 1 illustrates the application scenario of PeerSDI. Each peer in the system, say Peer A, registers one or more profiles, one for each topic of interest. The user profile is delivered to an active (online) peer by DHT mechanism, say Peer F ). Then Peer F adds the profile into its profile database. And in order to improve the resilience of the system, the profiles are replicated to some peers with PeerID close to the hashed key of the profile, say Peer G and Peer H. Now, suppose Peer B publishes a document to PeerSDI, and the publish message is routed to Peer F. Peer F matches the document against the profile database. If the document is relevant to the profile of Peer A, a notification will be created, and delivered directly to Peer A. In case that Peer A goes down, the notification will first delivered to the peer which is closest to the hash key of Peer A’s PeerID, say Peer C. And Peer C acts as a notification inbox for Peer A. Similarly, one peer can have several notification inboxes. Peer D and E in Fig. 1 are Peer A’s other two notification inboxes. When Peer A goes up later, it will try to retrieve the notifications from its notification inboxes. The detailed description of the proposed selective dissemination mechanism, replication scheme and method to deal with failure will be presented in subsequent sections.
288
2.2
K. Zhao et al.
User Interfaces
Our PeerSDI system provide four main interfaces to users, which are Subscribe, UnSubscribe, Publish, and Notification interfaces. With the Subscribe interface, users are able to register and modify user profiles to the system, having themselves notified by latest desired information. And UnSubscribe interface enable users to unsubscribe the profiles registered before. On the other hand, users can also make their own resources available to others interested in a timely fashion via the Publish interface. The Notification interface enables users to access the notifications. Interested readers are referred to [8] for more details of these interfaces. 2.3
Selective Information Dissemination Mechanics
When a publish message is received, the peer, such as Peer F in Fig. 1, extract the topic and the document vector from the message. Then the peer matches the new incoming document against the collection of user profiles. With vector space model we adopt in PeerSDI, the task is to calculate the similarity between the incoming document vector and the profile vectors stored, and find out the relevant profiles. Because the vectors may be sparse, some profiles need not be examined. In order to reduce the number of profiles that should be examined, PeerSDI makes use of index structure for user profiles. More information will be found in [8]. 2.4
Fault Tolerance
In PeerSDI, there are two main challenges to the resilience of the system. One of the challenges is to maintain the integrity of the profile databases on the peers that process document matching and dissemination. The other one is to assure that users will get their notifications. Our scheme to attack the two challenges is to make several replications of the profiles or messages. If the peer can get the profile or messages desired from one of the replications, the system then works well. The basic idea to maintain integrity of profile database is as follows: collect profiles and unsubscribe message from its neighbors after some time cursor, and calculate the superset of the collections. And notification messages are also replicated on several peers, so the owner of the message can retrieve it with high probability. Interested readers are referred to [8] for the details of our schemes mentioned above to improve fault tolerance of the system and the analysis of them.
3
Implementation
In this section, we present our implementation of the prototype of PeerSDI. The PeerSDI prototype is implemented in JAVA language, and all the functionality described above is included in the current implementation. We run PeerSDI in the local area network of our lab. Each of the node runs a instance of PeerSDI. We
PeerSDI: A Peer-to-Peer Information Dissemination System cea rfe tn I eb ric sb u S
cea rfe tn I eb ric sb u nS U
cea rfe tn I hs il bu P
Profile Index
Topic x
Topic y
...
Document Matching Engine
Profile Database
Message Pool
ec far et nI no it ac ifit o N
289
Message Manager DHT Functionality Service Interface Fig. 2. The Architecture of the Peer in PeerSDI System
hope to run PeerSDI in a larger scale, and make it available for free downloading in the future. Fig. 2 present the layered architecture of a single peer in the prototype of PeerSDI. The lowest layer is the DHT functionality service interface. The layer directly lying on the DHT service layer is the message manager, which manages all the incoming and outcoming messages and provides the interface of sending massage for upper layer. The user interface module includes the four interfaces presented in Section 2 for user interaction with the system. The information dissemination module includes document matching engine, profile database and profile index. The document matching engine matches the incoming publish messages using the profile index, and send out the notifications via the message manager. Message pool is used to replicate the unsubscribe and notification messages to improve the resilience of PeerSDI.
4
Related Work
The idea of information dissemination has been around for a long time in the information retrieval community[5]. SIFT[7] is a centralized information dissemination system developed by Stanford University, which pays attention to efficiency rather than only focuses on precision as the previous systems do. So far, the few dissemination systems based on P2P networks are simple group-based publish/subscribe systems. SCRIBE[4] is a representative of such systems. SCRIBE is implemented on top of Pastry. For each group, a multicast tree is created based on the rout of the message routing of Pastry, which introduces single points of failure and bottlenecks. The recent project of pFilter[6] though supports content-based document filtering, it lies on particular P2P infrastructure and also depends on the mechanics of multicast tree. And [2] presents a implementation of SDI solution based on super-peer P2P networks proposed in [1].
290
5
K. Zhao et al.
Conclusion and Future Work
In this paper, we present the PeerSDI information dissemination system, which makes the SDI functionality available on the current structured P2P networks, and attacks the challenges encountered by traditional centralized SDI systems. We discuss in details the architecture, mechanics of information dissemination and fault tolerance of the system. We also present the implementation of the prototype of PeerSDI system. In the future, we will run the PeerSDI system in larger scale to test the current setting and schemes we adopt. We will also explore the scenario of SDI in unstructured P2P networks.
References 1. A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Design and evaluation of a wide-area event notification service. ACM Transactions on Computer Systems, 19(3):332–383, 2001. 2. M. Koubarakis, C. Tryfonopoulos, S. Idreos, and Y. Drougas. Selective information dissemination in p2p networks: Problems and solutions. SIGMOD Record, 32(3):Special Section on Peer to Peer Data Management, September 2003. 3. D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Prunyne, B. Richard, S. Rollins, and Z. Xu. Peer-to-peer computing. Technical Report HPL-2002-57, HP Lab, 2002. 4. A. Rowstron, A. M. Kermarrec, M. Castro, and P. Druschel. SCRIBE: The design of a large-scale event notification infrastructure. In Third International Conference on Networked Group Communication (NGC 2001), volume 2233 of LNCS, pages 30–43. Springer-Verlag, 2001. 5. G. Salton. Automatic Information Organization and Retrieval. McGraw-Hill, New York, 1968. 6. C. Tang and Z. Xu. pFilter: Global information filtering and dissemination using structured overlay networks. In The 9th Information Filtering and Dissemination Using Structured Overlay, 2003. 7. T. W. Yan and H. Garcia-Molina. The SIFT information dissemination system. ACM Transactions on Database Systems, 24(4):529–565, 1999. 8. K. Zhao and S. Zhou. The PeerSDI information dissemination system. Technical report, Fudan University, Department of Computer Science and Engineering, 2003.
PWSD: A Scalable Web Service Discovery Architecture Based on Peer-to-Peer Overlay Network1 Yin Li , Futai Zou, Zengde Wu, and Fanyuan Ma The Department of Computer Science and Engineering, Shanghai Jiaotong University, Shanghai, China, 200030 {Liyin, Zoufutai, Wu-zd, Ma-fy}@cs.sjtu.edu.cn
Abstract. The Web services are distributed across the Internet, but the existing Web service discovery is processed in a centralized approach such as UDDI, which has the limitations of single point failure and performance bottleneck. We propose PWSD, a scalable Web service discovery architecture based on peer-to-peer overlay network to overcome these limitations. In PWSD, the service descriptions are managed in a completely decentralized way. Moreover, since the basic peer-to-peer routing algorithm cannot be applied directly in the service discovery process, we extend the basic peer-to-peer routing algorithm with XML, which enables PWSD to support XML-based complicate query. Experimental results show that PWSD has good robustness and scalability.
1 Introduction The web service computing is driving distributed computing towards a model of service based interactions, where corporate enterprises use web services to interact with each other dynamically. Web services are much more loosely coupled than traditional distributed applications. In order for the two services to interact with each other dynamically, they must be able to do at least two fundamental things. First, a service must be able to describe its abstract interfaces and protocol bindings so that clients can figure out how to invoke it. Second, clients must be able to discover services. The first thing can be finely solved by WSDL[1] and SOAP[2]. However, current web service discovery employs centralized registries such as UDDI[3], which leads to a single point failure and performance bottleneck. These registries are critical to the ultimate utility of the Web Services and must support scalable, flexible and robust discovery mechanisms. Since Web services are widely deployed on a huge amount of machines across the Internet, it is highly demanded to manage these web services in a decentralized way. Peer-to-peer (P2P)[4-6], as a complete distributed computing model, could supply a good scheme for the Web service discovery. Existing Peer-to-Peer overlay network provides an infrastructure for routing and data location in a decentralized, self1
Supported by The Science & Technology Committee of Shanghai Municipality Key Project Grant 03dz15027 and 03dz15028.
organized environment in which each peer acts not only as a node providing routing and data location service, but as a server providing service access. All the hosts that provide Web services can be organized together as a Web service Peer-to-Peer overlay where each host acts as a peer in this overlay. The hosts in the Peer-to-Peer network can publish their service descriptions to the overlay, and the users could access the up-to-date Web services. We designed a distributed Web service discovery architecture based on structured Peer-to-Peer overlay network named PWSD (Peer-to-Peer based Web service discovery). We have also extended the Peer-to-Peer routing algorithm based on Chord[4] for supporting XML based service lookup. Experimental results have showed that PWSD has good scalability and robustness. The rest of the paper is organized as follows. Section 2 presents the architecture of PWSD. Section 3 presents the Peer-to-Peer based Web service discovery algorithm. Section 4 presents the experimental results. Conclusions are given in the section 5.
2 Web Service Discovery in PWSD The key to Web service discovery lies in service publishing and service lookup. Figure 1 shows the service publishing and locating processes in PWSD. To publish the service, the service description metadata is generated first. Then a set of service keys will be obtained and passed to the hash function such as MD5 to generate a set of Hash IDs(HID for simplicity). These HIDs will be used to locate appropriate peers by means of Peer-to-Peer routing algorithms. At last, the HIDs and the service description metadata will be published to the target peers, which will save the service description metadata in the router repositories and complete the publishing process. The process of service locating is roughly the same as that of service publishing. The key step in service publishing and locating process is to lookup a peer node according to HID, which was achieved by the extended routing algorithm. We will discuss the service discovery algorithm in detail in section 3. Parse Service Description
Parse Service Request
Hash Service Key
Hash Request Key
Locate Target Peer
Locate Service Peer
Register Service Description
Get Service Description
(a) Service Publishing
(b) Service Locating
Fig. 1. Service publishing and locating process
Each peer in PWSD acts as a service peer (SP for simplicity), which not only provides Web service access, but acts as a peer in the Peer-to-Peer overlay network. The architecture of the service peer in PWSD is shown in figure 2. A SP is a
PWSD: A Scalable Web Service Discovery Architecture
293
federation of logical machines. Each logical machine consists of three active components called the Web Service Discovery Interface, the core component and the router, and a passive component called the local repository. Several logical machines can share one piece of hardware. Each can have a dedicated machine or a single logical machine, which consists of several physical machines. Web Service Discovery Interface provides access interface to publish or locate Web services and also exposes itself as a Web service. Core component consists of the XML parser, the Service Description Resolver, a Service Key Mapper and a Service Mapper. The service description resolver is a key-splitting algorithm that extracts key nodes from a description. The goal of the algorithm is to break descriptions into shorter pieces so that resolvers can specialized around subsets of the descriptions. Each key node extracted from the description is independently passed to the service key mapper component, together with service description or query. The service key mapper is responsible for associating HID with each key node. It does this by concatenating the node name and value of the key node into a single string, and computing a HID for this string. More details are given in section 3. XML parser parses the XML document and is used by other components. Service mapper is responsible for mapping HIDs to service descriptions and will return the results to the application services through the Web service discovery interface. Local repository keeps the Web service interface, service descriptions and HIDs that SP is responsible for. The router routes query requests and return routing results. Application Service Web Service Discovery Interface Core Component
XML Parser
Service Description resolver
Service Mapper
Local Repository
Service Key Mapper
Incomming Message
Outcomming Message Router
Fig. 2. The architecture of the service peer
In PWSD, we organize every service peer in a structured Peer-to-Peer overlay network. Because Chord has the features of simplicity, provable correctness, and provable performance compared with other lookup protocols, we use Chord protocol to organize the SP’s routing table. Thus the topology of the overlay network is a circle, which is maintained by Chord algorithm.
3 Service Publishing and Locating Algorithm in PWSD Service locating algorithm specifies how to route the requests to the service peers who satisfies the service requests. In PWSD, the service request is expressed in XML,
294
Y. Li et al.
which is in consistent with that of the service descriptions stored in the destination service peers. However, the routing algorithm, Chord, in underlying Peer-to-Peer overlay network only supports exact-match. We have to extend the Chord algorithm to support XML based conditional match. The extended Chord algorithm is called XChord. In PWSD, WSDL is used to describe the Web service interface, and the service description metadata is generated based on the content of WSDL document and the description that the user inputs before publishing. An example of Web service description metadata is shown in figure 3. In figure 3, the services section is automatically generated from the WSDL document and the description section is produced based on the user input.
ListPriceService List the product price http://services.companya.com/product/ListPro ductService.wsdl OrderService Make an order to the product http://services.companya.com/product/OrderSe rvice.wsdl CompanyA Manufactory China Automobile Price Order …… Fig. 3. An example of Web service description metadata in PWSD
Before publishing the service descriptions, an XML based tree node extraction approach is used in PWSD. For instance, the example service description shown in figure 3 can be converted to a node-value tree as shown in figure 4. We have to notice that only the important elements in the service description will be extracted and inserted into the NVTree, and only the meaningful nodes in the NVTree will be used to generate a hash value, which in turn will be used as a hash key of the service description and will be inserted into the Peer-to-Peer overlay network.
PWSD: A Scalable Web Service Discovery Architecture
295
Root services
description company keyword region
service
service name documentation
namedocumentation CompanyA China
ListPriceService
Automobile Price Order
List the OrderService Make an order to the product product price
Fig. 4. NVTree converted from service description shown in figure 3.
service
service
service
service
description
name documentation name documentation company
Make an ListPriceService List the OrderService order to the CompanyA product product price
description
description
region
keyword
China
Automobile Price Order
Fig. 5. Splitting a NVTree into service description nodes
In PWSD, we use a simple node-splitting method to extract each node-value pair form NVTrees, and independently map it onto a key. However, only the leaf nodes in NVTrees have a pair of node and value. Furthermore, in order to preserve the hierarchical relationship, the parent node of the leaf node is also extracted. The splitting description nodes from figure 4 are demonstrated in figure 5. After splitting the NVTree into separated simple description nodes, we use a simple concatenating method to convert them into strings, and those nodes whose values consist of several words, are further divided into single word value based nodes. These strings are passed to the hash function to produce hash IDs, which will be used as keys to insert into the underlying Peer-to-Peer overlay by XChord algorithm. Before presenting PWSD service publishing and locating algorithm which is named as XChord, we first introduce some definitions: Definition 1. Let И be identifier set, and Ω be the SP set. Then the definition of node and node-1 are given as follows: node: ИÆ Ω ,Иis HID set, and Ω is the SP set. node maps HID to corresponding node. This mapping is achieved by Chord algorithm. node-1: Ω ÆИ, node-1 maps node to the HID of the node. This mapping is achieved by consistent hashing. Definition 2. Let SD stands for a service description document, then Γ(SD) represents the NVTree of SD, andΝ(SD) stands for the set of service description nodes, whereΝ(SD)={N1, N2 ,…,Nm }. Definition 3. Let N stands for a service description node, then К(N) represents the set of concatenating strings, whereК(N)={S1, S2 ,…,Sk }.
296
Y. Li et al.
The pseudocode of XChord service description publishing algorithm is given in figure 6. Function Publish is run on node n, take a service description(SD) as input and publishes the SD into the Peer-to-Peer overlay network. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
n.Publish(SD){ Compute Ν(SD)={N1, N2 ,…,Nm }; for each Ni in Ν(SD){ Compute К(Ni)={S1, S2 ,…,Sk }; for each Sj in К(Ni) n.Distribute(Sj, SD); } } n.Distribute(S, SD){ id = hash(S); n`=n.Route(id); n`.Insert(id, S, SD); } n.Route(id){ n lookups id` which is closest to id from local routing table; if id,≥, '2003− 7− 21'∧ has −topic ='Grid ' ([ Publication]) Based on the projection and selection, we can specify more complex query. Example 3.10. Ask for author, publish house and publish date of all publications whose every author is Phd student.
π authored −by , published −by , published − date (σ ∀( authored −by =[ Phd − student ]) ([ Publication])) Obviously, it is a composition of two algebraic expressions. 3.4 Join The Join operator returns a collection including the resource elements in c1 and c2 whose property values matching the specified p. Definition Join join predicate.
(c1) p (c 2) = {( x, y )| p ( x, y )=True, x∈C I (c1)∧ y∈C I (c 2)} ,p is a
496
Q.-J. Sheng and Z.-Z. Shi
Example 3.11. To select Phd students and Master students who have common supervisor.
3.6 Intersection Definition c1∩c 2={x|x∈C I (c1)∧ x∈C I (c 2)} Example 3.13. To get Phd student who has published conference paper and journal paper on Grid.
3.7 Minus Definition c1−c 2={x|x∈C I (c1∧¬c 2)} Example 3.14. To get all publications except for Paper. [Publication]-[Paper]
4 Knowledge-Based Resource Query Language 4.1 KRQL In this subsection, we develop a high-level declarative query language KRQL based on the algebra defined in section 3. The basic syntax of KRQL has form as follows: SLEECT [p1[,p2,…,pn]] [FROM C1[,C2[,…[,Cn]]]] [WHERE F’] [BEGIN WITH C3 THROUGH transitive-p [UNTILL max-iteration]] [UNION] [MINUS] [INTERSECTION] KRQL is designed according to the defined algebra for the purpose of easy usage by end user or agent. Thus the semantics of KRQL can be clearly understood when we translate the components of KRQL back to the algebraic expression. The semantics of first three clauses can be given by algebraic expression, π p1, p 2,.., pn (σ F (c)) , where F is a condition clause that can be obtained from F’ by replacing keyword AND,OR, FORALL, EXIST with symbol ∧,∨ ,∀,∃ respectively. If there are two more
A Knowledge-Based Data Model and Query Algebra for the Next-Generation Web
497
collections in the FROM clause, then a join predicate should be specified in the WHERE clause. In turn, the semantics can be described with Join algebraic operator. The BEGIN WITH clause is essentially a syntax translation of algebraic closure operator. Finally, the semantics of last three clauses of KRQL is based on operators Union, Minus and Intersection respectively. Next, we shall present some queries in KRQL, instead of formal algebra, to simply illustrate its usage. we use [ ] to denote a collection of resources and {} a specific resource instance. 4.2 Examples 1). Query author, publish house and publish date of all publications. SELECT authored-by, published-by, publish-date FROM [publication] 2). Query publications whose every author is a PhD student. SELECT FROM [publication] WHERE FORALL authored-by=[PhD-student] 3). To query all prerequisite course for Grid research. SELECT BEGIN WITH {Grid} THROUGH based-on 4). To query authors and publish date of publication whose at least one of authors is Phd student and whose topic is Grid. SELECT authored-by, publish-date FROM [publication] WHERE EXIST authored-by=[Phd-student] AND topic=’Grid’ 5).To select all the publications whose every author is work in Grid group SELECT FROM [publication] WHERE FORALL authored-by.work-in= ’Grid-group’ 6).To get Phd student who has published conference paper and journal paper on Grid. SELECT FROM [Phd-student] WHERE author-of.has-topic=’Grid’ AND author-of.published-at=[workshop] INTERSECT SELECT FROM [Phd-student] WHERE author-of.topic=’Grid’ AND author-of.published-at =[journal] 7).To select Phd students and Master students who have common supervisor SELECT FROM [p Phd-student],[m Master-student] WHERE p.supervised-by= m.supervised-by
5 Related Works RAL is an algebra for querying RDF [8]. RAL models RDF as a finite set of triples composed of resources associated through properties, which form a directed labeled graph. The model integrates RDF and RDFS definitions. Compare to the RAL model, our model considers integrity constrain which makes it a more semantic model. Moreover, the path expression and closure operation in our algebra are absent in
498
Q.-J. Sheng and Z.-Z. Shi
RAL. Finally, we present a clear model-theory semantics of our model which is a fundamental basis to define various algebra operators clearly. Another works related to ours is from Semantic Web/Grid domain. RQL[7] is a recent semantic web query language. Compared with it, our work has three distinctive features. Firstly, our model considers background knowledge modeling with integrity constrain rule which is useful to enrich the resource modeling and to ensure that the domain resources are modeled meaningful and effectively. In addition, it can be employed in doing semantic query optimization. Secondly, we defined an algebra to better support definition of high level query language, KRQL. With the increase of resources, query optimization will become an unavoidable aspect. Due to the lack of query algebra, RQL do not have a solid foundation to do query optimization. And its semantics can only be described using APIs. Thirdly, we extended RQL with joins, graph navigation and recursion.
6 Conclusions and Future Works Research on resource management of web is still in its infancy. In this paper, we presented a formal data model and algebra to advance the state of the art of resource management in the web environment.. A declarative KRQL query has been specified based on the algebra and illustrated with some examples. Besides, we also made some comparisons with the related works of resource matchmaking in the web and work from Semantic Web. We are implementing the proposed model and query language with Semantic Web techniques. In the mean time, KRQL is also going to be improved. Moreover, we plan to extend the algebra to support inference and analysis on the schema and instance, based on Description Logic[9].At last, algebraic equivalence laws and heuristics would be studied to do query optimization.
Acknowledgements. The authors thank the anonymous referrers for their helpful comments on the earlier version of the paper.
References 1. 2. 3. 4. 5.
Codd, E.F., A relational model of data for large shared data banks, Comm. Of the ACM,13(6), June 1970, pp. 377-387 Semantic Grid, http://www.semanticgrid.org/ Wu zhaoHui et al, Knowledge Base Grid: A Generic Grid Architecture for Semantic Web, J. Computer Sci.&Technol. (JCST), Vol.18, No.4, July 2003,pp.462-473 Hai Zhuge, A Knowledge grid model and platform for global knowledge sharing, Expert Systems with Applications 22(2002) pp.313-320. Zhongzhi Shi, Mingkai Dong, Haijun Zhang, Qiujian Sheng. Agent-based Grid Computing. Keynote Speech, International Symposium on Distributed Computing and Applications to Business, Engineering and Science, Wuxi, Dec. 16-20, 2002
A Knowledge-Based Data Model and Query Algebra for the Next-Generation Web 6. 7.
8.
9. 10.
11. 12.
13.
499
XQuery 1.0: An XML Query Language,http://www.w3.org/TR/xquery/ G. Karvounarakis, A. Magkanaraki et al, Querying the Semantic Web with RQL, Computer Networks and ISDN Systems Journal, Vol. 42(5), August 2003, pp. 617-640. Elsevier Science Flavius Frasincar et al , RAL~ An Algebra for Querying RDF, The 3rd International Conference on Web Information Systems Engineering (WISE), pages 173-181, IEEE Computer Society, 2002 Franz Baader et al, The Description Logic Handbook – Theory, Implementation and Applications, ISBN: 0521781760, Cambridge University Press, January 2003 DOMENICO BENEVENTANO and SONIA BERGAMASCHI , CLAUDIO SARTORI, Description logics for semantic query optimization in object-oriented database systems, ACM Transactions on Database Systems, Vol. 28, No. 1, March 2003, Pages 1–50. Richard Hull, Michael Benedikt, Vassilis Christophides,Jianwen Su, EServices:A Look Behind the Curtain, PODS 2003, June 912, 2003, San Diego, CA. Nadine Cullot, Christine Parent, Stefano Spaccapietra, and Christelle Vangenot, Ontologies : A contribution to the DL/DB debate. The first International Workshop on Semantic Web and Databases, Berlin, Germany, September 7-8, 2003 Michael Halper, Li-min Liu, James Geller and Yehoshua Perl, Frameworks for incorporating semantic relationships into object-oriented database systems, Concurrency Computat.: Pract. Exper. 2003, to appear.
Conditional Evidence Theory and Its Application in Knowledge Discovery Yongchuan Tang1 , Shouqian Sun1,2 , and Yangguang Liu1 1
College of Computer Science, Zhejiang University, Hangzhou, Zhejiang Province, 310027, P. R. China [email protected], [email protected] 2 State Key Lab of CAD & CG, Zhejiang University, Hangzhou, Zhejiang Province, 310027, P. R. China [email protected]
Abstract. In this paper, we develop the conditional evidence theory and apply it to knowledge discovery in database. In this theory, we assume that a priori knowledge about generic situation and evidence about situation at hand can be modelled by two independent random sets. Dempster’s rule of combination is a popular method used in evidence theory, we think that this rule can be applied to knowledge revision, but isn’t appropriate for knowledge updating. Based on random set theory, we develop a new bayesian updating rule in evidence theory. More importantly, we show that bayesian updating rule can be performed incrementally by using M¨ obius transforms.
1
Introduction
Let K denote the generic knowledge and E the particular “evidential” information on some cases at hand. K is often represented as a rule-base, or as a probability distribution in expert systems [14]. In this paper the generic knowledge and evidential information will be represented by two independent random sets Γ , Σ on the finite universe U . The problem addressed here is what to do in order to account for the new information so as to produce new conclusions. When all the information is generic, the well-known Dempster’s rule of combination can be used as the rule for knowledge revision [2][11]. When the evidence is represented by a proposition or an event, an operation focusing can be used to answer questions of interest [4,5,6]. As to the general situation, where the evidence is represented by a random set, some work have been done to resolve this inferring question. R. Mahler presented a theory called conditional DempsterShafer theory (CDS) involving a priori knowledge [9,10]. Y. Tang and S. Sun considered more general CDS where a priori knowledge is modelled by a fuzzy measure [13]. But one deficiency of these work is that two conjugate updating rules can be defined, since two conjugate fuzzy measures can be induced from a priori knowledge. In this paper a novel inferring rule is presented, this rule is called as bayesian updating rule in evidence theory. More importantly, the J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 500–505, 2004. c Springer-Verlag Berlin Heidelberg 2004
Conditional Evidence Theory and Its Application in Knowledge Discovery
501
bayesian updating rule can be performed incrementally without a loss of information. Section 2 reviews random set. Section 3 presents two possible conditional belief measures. In Section 4 the bayesian updating rule is proposed to model the inferring mechanism in evidence theory. Section 5 illustrates an example. The last section is our conclusions.
2
Random Set
For a formal definition of a random set, let (Ω, σΩ , P rΩ ) be a probability space and let (U, σU ) be a measurable space where U = P(U ), P(U ) is a powerset of U and σU is a σ−algebra defined on U. Then a random set of U is a random variable Σ : Ω → U. If Σ and Γ are two random sets of U then random set intersection, union and complement are defined by (Σ ∩ Λ)(ω) Σ(ω) ∩ Λ(ω), (Σ ∪ Λ)(ω) Σ(ω) ∪ Λ(ω), (Σ c )(ω) Σ(ω)c for all ω ∈ Ω. The mass assignment of Σ is defined by mΣ (S) = P rΩ {ω : Σ(ω) = S} for all S ∈ P(U ). The belief and commonality measures associated with Σ are mΣ (E), δΣ (S) P rΩ (Σ ⊇ S) = mΣ (E). βΣ (S) P rΩ (Σ ⊆ S) = E⊆S
E⊇S
The plausibility measure P lΣ (S) = P rΩ (Σ ∩ S = ∅) = 1 − βΣ (S c ). Using M¨ obius transforms, The mass assignment can be achieved: (−1)#(S−T ) βΣ (T ), mΣ (S) = (−1)#(T −S) δΣ (T ), mΣ (S) = T ⊆S
T ⊇S
for all S ⊆ U , where, for all X ⊆ U , #X denotes the number of elements in X. Hence the mass assignment is also called as M¨obius assignment. Two random sets Σ, Λ are statistical independent if, for any S, T ⊆ U , mΣ,Λ (S, T ) P rΩ (Σ = S, Λ = T ) = mΣ (S)mΛ (T ).
3
Geometry Conditional Belief Measure and Dempster Conditional Plausibility Measure
We first extend belief measure βΓ (S) and plausibility measure P lΓ (S) for constant subset S of U to the random set Σ. Definition 1 Let Γ and Σ be two independent random sets on U . Then βΓ (Σ) P rΩ (Γ ⊆ Σ) = βΓ (S)mΣ (S) = δΣ (S)mΓ (S), S⊆U
P lΓ (Σ) P rΩ (Γ ∩ Σ = ∅) =
S⊆U
(1)
S⊆U
P lΓ (S)mΣ (S) =
S⊆U
P lΣ (S)mΓ (S).
(2)
502
Y. Tang, S. Sun, and Y. Liu
When the random set Γ on U is degenerated to a random variable r, we have βr (Σ) = P lr (Σ) = qr (Σ), where qr (Σ) = mΣ (S)qr (S) = qr ({u})δΣ ({u}). S⊆U
u∈U
Based on the extended definition of belief and plausibility measure, two possible inferring rules in evidence theory are defined as follows. Definition 2 Let Γ and Σ be two independent random sets on U . Then βΓ (S ∗ Σ)
βΓ (S ∩ Σ) , P lΓ (S ∗ Σ) 1 − βΓ (S ∗ Σ), βΓ (Σ)
(3)
whenever βΓ (Σ) = 0. βΓ (· ∗ Σ) (P lΓ (· ∗ Σ)) is called as geometry conditional belief (plausibility) measure conditioned on random set Σ. Definition 3 Let Γ and Σ be two independent random sets on U . Then P lΓ (S ∗ Σ)
P lΓ (S ∩ Σ) , βΓ (S ∗ Σ) 1 − P lΓ (S ∗ Σ), P lΓ (Σ)
(4)
whenever P lΓ (Σ) = 0. βΓ (· ∗ Σ) (P lΓ (· ∗ Σ)) is called as Dempster conditional belief (plausibility) measure conditioned on random set Σ. Theorem 4 Let Γ and Σ be two independent random sets on U . Let βΓ (S ∗ Σ) be defined by (3). Then the M¨ obius transform of βΓ (· ∗ Σ) is given by mΓ (S ∗ Σ) =
mΓ (S)δΣ (S) , βΓ (Σ)
(5)
for all S ⊆ U . When random set Σ is a constant set, then, by formula (5), we get mΓ (· ∗ E) which is called as geometry conditioning in some references [1,7,8]. Theorem 5 Let Γ and Σ be two independent random sets on U . Let βΓ (S ∗ Σ) be defined by (4). Then the M¨ obius transform of βΓ (· ∗ Σ) is given by −1 P lΓ (Σ) ∗ A∩B=S mΓ (A)mΣ (B) if S = ∅, (6) mΓ (S Σ) = 0 otherwise. for all S ⊆ U . This theorem shows that the result fused by Dempster’s rule of combination is representable by Dempster conditional belief measure. Furthermore, when the random set Σ is a constant set E, then we get so-called Dempster rule of conditioning mΓ (· ∗ E) [1,7]. When the random set Γ on U is degenerated to a random variable r, we have qr (S | Σ) =
qr (S ∩ Σ) qr (S ∩ Σ) . = qr (Σ) qr (S ∩ Σ) + qr (S ∩ Σ)
(7)
Conditional Evidence Theory and Its Application in Knowledge Discovery
4
503
Bayesian Updating Rule in Evidence Theory
In order to define an appropriate inferring rule in evidence theory, we should further extend the previous conditional belief measures.
Definition 6 Let Γ , Σ and Σ be three independent random sets on U . Then
Definition 7 Let Γ , Σ and Σ be three independent random sets on U . Then
P lΓ (Σ ∗ Σ)
P lΓ (Σ ∩ Σ) , βΓ (Σ ∗ Σ) 1 − P lΓ ((Σ )c ∗ Σ), P lΓ (Σ)
(9)
whenever P lΓ (Σ) = 0. Some deep observations reveal that the above inferring rules can be performed incrementally. A first observation to make is that both βΓ (· ∗ Σ) and P lΓ (· ∗ Σ) can be computed incrementally, as the following theorem demonstrates.
Theorem 8 Let Γ , Σ and Σ be three independent random sets on U . Then
βΓ (S ∗ Σ ∩ Σ ) =
βΓ (S ∩ Σ ∗ Σ) P lΓ (S ∩ Σ ∗ Σ) , P lΓ (S ∗ Σ ∩ Σ ) = , (10) βΓ (Σ ∗ Σ) P lΓ (Σ ∗ Σ)
whenever βΓ (Σ ∩ Σ ) = 0, P lΓ (Σ ∩ Σ ) = 0. A second observation to make is that mΓ (· ∗ Σ) and mΓ (· ∗ Σ) can be computed incrementally, as the following theorems demonstrate.
Theorem 9 Let Γ , Σ and Σ be three independent random sets on U . Let mΓ (·||∗ Σ) be defined by (5). Then
mΓ (S ∗ Σ ∩ Σ ) =
mΓ (S ∗ Σ)δΣ (S) . βΓ (Σ ∗ Σ)
(11)
Theorem 10 Let Γ , Σ and Σ be three independent random sets on U . Let mΓ (·||∗ Σ) be defined by (6). Then mΓ (A ∗ Σ)mΣ (B) . (12) mΓ (S ∗ Σ ∩ Σ ) = A∩B=S P lΓ (Σ ∗ Σ) A third observation to make is that both βΓ (Σ) and P lΓ (Σ) can be computed incrementally. That is,
βΓ (Σ ∩ Σ ) = βΓ (Σ ∗ Σ)βΓ (Σ), P lΓ (Σ ∩ Σ ) = P lΓ (Σ ∗ Σ)P lΓ (Σ). The bayesian updating rule in evidence theory is defined as follows.
(13)
504
Y. Tang, S. Sun, and Y. Liu
Definition 11 For all S ⊆ U , bayesian conditional belief measure is defined as, βΓ (S ∩ Σ) . (14) βΓ (S ∩ Σ) + P lΓ (S ∩ Σ) Based on above observations, we can conclude that bayesian updating rule can be performed incrementally. For any S ⊆ U , βΓ (S Σ ∩ Σ ) equals to βΓ (S Σ)
βΓ (S ∗ Σ ∩ Σ )βΓ (Σ ∗ Σ)βΓ (Σ) βΓ (S ∗ Σ ∩ Σ )βΓ (Σ ∗ Σ)βΓ (Σ) + P lΓ (S ∗ Σ ∩ Σ )P lΓ (Σ ∗ Σ)P lΓ (Σ) T ⊆S m2∗ (T )p2∗ p1∗ (15) = 2∗ 2∗ 1∗ T ⊆S m2∗ (T )p2∗ p1∗ + T ∩S=∅ m (T )p p
5
Knowledge Discovery in Medical Diagnosis
Assume a group of medicine experts are beginning to explore some diseases in a fictitious region. The only diseases they expect to explore are on the universe U = {kotosis, phlegaria, pinpox}, all diseases are abbreviated as a, b, c, respectively. And the only general information is a priori masses mΓ : mΓ ({a}) = 0.2, mΓ ({b}) = 0.2, mΓ ({c}) = 0.2, mΓ ({b, c}) = 0.2, mΓ ({a, b}) = 0.1, mΓ ({a, b, c}) = 0.1. The knowledge about the disease at this moment is expressed by four items
m0 , m0 , p0∗ , p0∗ , where m0 = mΓ , p0∗ = p0∗ = 1. Now assume that expert B examines the patient and deliver diagnose. Expert B’s report is expressed by a random set Σ. Assume Σ : 0.85{b, c} + 0.15U . So the knowledge about the disease can be updated as follows, p1∗ = 0.66, p1∗ = 0.83, m1∗ ({a}) = 0.045, m1∗ ({b}) = 0.303, m1∗ ({c}) = 0.303, m1∗ ({b, c}) = 0.303, m1∗ ({a, b}) = 0.023, m1∗ ({a, b, c}) = 0.023, m ({a}) = 0.036, m1∗ ({b}) = 0.343, m1∗ ({c}) = 0.242, m1∗ ({b, c}) = 0.343, 1∗
m1∗ ({a, b}) = 0.018, m1∗ ({a, b, c}) = 0.018. If another expert C examines the patient and delivers a report modelled by a random set Σ , that is, Σ : 0.6{b} + 0.4U . Hence the knowledge about the disease is updated as follows, p2∗ = 0.582, p2∗ = 0.833, m2∗ ({a}) = 0.03, m2∗ ({b}) = 0.52, m2∗ ({c}) = 0.209, m2∗ ({b, c}) = 0.209, m2∗ ({a, b}) = 0.016, m2∗ ({a, b, c}) = 0.016, m2∗ ({a}) = 0.017, m2∗ ({b}) = 0.685, m2∗ ({c}) = 0.116, m2∗ ({b, c}) = 0.165, m2∗ ({a, b}) = 0.009, m2∗ ({a, b, c}) = 0.009.
Based on these information, we now can compute βΓ (· Σ ∩ Σ ). So we have
βΓ ({a} Σ ∩ Σ ) = 0.263, βΓ ({b} Σ ∩ Σ ) = 0.478, βΓ ({c} Σ ∩ Σ ) = 0.116. On the basis of Evidences provided by experts B and C we are able to conclude that the patient has phlegaria.
Conditional Evidence Theory and Its Application in Knowledge Discovery
6
505
Conclusions
The issue addressed in this paper is what to do in order to account for the new information about case at hand so as to produce new plausible conclusions. The usual method in evidence theory is the well-known Dempster’s rule of combination. In this paper we present a new appropriate inferring rule called as bayesian updating rule. Based on M¨ obius transforms of geometry conditional belief measure and Dempster conditional plausibility measure, we show that bayesian updating rule can be performed incrementally. Acknowledgements. This work has been supported by Hubei province young elitist project (Grant No. 2002AC001) and Chinese 973 project (Grant No. 2002CB312106).
References 1. L. Chrisman, “Incremental Conditioning of Lower and Upper Probabilities”, International Journal of Approximate Reasoning, Vol. 13, pp. 1–25, 1995. 2. A. P. Dempster, “Upper and lower probabilities induced by a multi-valued mapping,” Ann. Mathematical Statistics, Vol. 38, pp. 325–339, 1967. 3. T. Denoeux, “A k-nearest neighbour classification rule based on Dempster-Shafer theory,” IEEE transactions on systems, man and cybernetics, Vol. 25, No. 5, pp. 804–813, 1995. 4. D. Dubios and H. Prade, “Bayesian conditioning in possibility theory”, Fuzzy Sets and Systems, Vol. 92, pp. 223–240, 1997. 5. D. Dubios and H. Prade, “Focusing vs. revision in possibility theory”, Proceedings of the 4th IEEE International Conference on Fuzzy Systems(FUZZ-IEEE’96), pp. 1700–1705, New Orleans, LA, 9-12 September, 1996. 6. D. Dubios, H. Prade and P. Smets, “Representing partial ignorance” IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, Vol. 26, No. 3, pp. 361–377, 1996. 7. J. Y. Jaffray, “Bayesian updating and belief functions,” IEEE Transactions on Systems, Man, and Cybernetics, Vol. 22, No. 5, pp. 1144–1152, 1992. 8. R. K¨ orner and W. N¨ ather, “On the specificity of evidences” Fuzzy Sets and Systems, Vol. 71, pp. 183–196, 1995. 9. R. Mahler, “Combining ambiguous evidence with respect to ambiguous a priori knowledge, I: Boolean logic,” IEEE transactions on systems, man and cybernetics– part A: systems and humans, Vol. 26, No. 1, pp. 27–41, 1996. 10. R. Mahler, “Combining ambiguous evidence with respect to ambiguous a priori knowledge,II: Fuzzy logic,” Fuzzy Sets and Systems, Vol. 75, pp. 319–354, 1995. 11. G. Shafer, A Mathematical Theory of Evidences, Princeton University Press, 1976. 12. S. McClean, B. Scotney and M. Shapcott, Using background knowledge in the aggregation of imprecise, Data and Knowledge Engineering Journal, Vol. 32, pp. 131–143, 2000. 13. Y. C. Tang and S. Q. Sun, “The Conditional Dempster-Shafer Theory: a New Frame for Knowledge Updating” (to be submitted). 14. P. Walley, “Measures of uncertainty in expert systems” Artificial Intelligence, Vol. 83, pp. 1–58, 1996.
An Efficient Method for Quantitative Association Rules to Raise Reliance of Data* Hye-Jung Lee1, Won-Hwan Park , and Doo-Soon Park1 2
1
Division. of Computer Science and Computer Engineering, SoonChunHyang University, Sinchang-Myun, Asan-Si, Choongchungnam-Do, South Korea, {mean1218, parkds}@sch.ac.kr 2 Korea National Statistical Office, Building 3 Government Complex, Seogoo 920, Teajeon-Si, South Korea [email protected]
Abstract. This paper suggests an efficient method to raise reliance of Large Interval Itemsets to convert quantitative item into binary item. The suggested method does not leave behind meaningful items. And can create more quantity of minute Large Interval Itemsets and can minimize the loss of attribution of original data because it generate merged interval which is close to the figure of Minimum Support appointed by the user and generate Large Interval Itemsets under the consideration of characteristic of data-occurrence Therefore, it raises reliance of data and those data will be useful when we create association rules later.
An Efficient Method for Quantitative Association Rules to Raise Reliance of Data
507
This paper suggests a method to generate a Large-Interval Itemset under the consideration of characteristic of data-occurrence in order to apply the item of quantitative data such as a census datum to the Association Rule. This method can minimize loss of characteristics of original data and enhance the confidence of data, and never lose meaningful items than existent method since it generates Large-Interval Itemset focusing on most frequently happening area.
2 Discovery of Association Rule Most of algorithms for discovering association rule such as Apriori[3], AprioriTID[3,10], AprioriHybrid[3], DHP[4], Partition[5], DIC[10], Direct Sampling[5], Sampling Approach[6] place an importance on the solution of these problems. We may discover association rule with the current discovering algorithm by converting quantitative items into binary items using a method of discovering association rule for a quantitative data, and there were studies on this methodology[8],[9]. Srikant[8] partitions defined area of quantitative items-domain into small blocks of a certain range and generates large interval itemsets that satisfies the Minimum Support by merging the adjacent small blocks. In this case, it is effective when data is distributed evenly within the defined area of a quantitative item, but it is not effective if data is converged into a certain area. So, as a solution of this problem, dynamic partition method[9] which partitions according to the distribution map was introduced. For partitioning and merging, however Dynamic Partition Method uses Minimum Support and Minimum Partition Support both.
3 Method for Generation of Large Interval Itemsets to Increase Reliability The suggest method in this paper is the one to generate a Large Interval Itemsets by converting the defined area of quantitative items, based on the unit interval of mode, into binary items. The unit interval of mode is the area where the density of data occurrence is high. The unit interval of mode is normally used to get the outstanding results. Databases(D), quantitative items(Lq), frequency(f(Lq)) which are required to generate Large Itemsets are defined as follows: − D : a set of transactions that includes the quantitative items. It also includes Lq and f(Lq) - Lq : { lq1, lq2, ... , lqn-1, lqn } − lqi (1 ≤ⅰ≤ n) is discrete as an item of a unit interval − f(Lq) : {f(lq1), f(lq2), ... , f(lqn-1), f(lqn)} - f(lqi) (1 ≤ⅰ≤ n) is the frequency within a unit interval - Max_lq : Unit interval of mode - FL : {fl1, fl2, … ,flm } − lq_ti(1 ≤ⅰ≤ n) : it represents the possibility of use of lqi.
508
H.-J. Lee, W.-H. Park, and D.-S. Park
To generate a Large Interval Itemset, we, first, must select the unit interval of mode(lqi) and then we merge adjacent(left~right) unit intervals(lqi-1, lqi+1) from the selected lqi until it satisfies the Minimum Support. At this time, we merge the adjacent intervals which is the nearest and which is equal or higher than the Minimum Support among the values of item (frequency or Support) until the Minimum Support is satisfied. If we cannot proceed further merging because of the upper bound2 or lower bound, we take only one side value and continue to merge. At last, if we can not make further procedure anymore because of both upper and lower bound, we calculate the number of merged areas and Support, and we judge if merged area is meaningful or not. Then, if meaningful, we divide the merged area into 2 areas based on the Support and include them in the generated Large Interval Itemsets which are located in both sides. If meaningless, we set the merged area as non-frequency area and continue the same process after selecting the next mode (area). We stop merging during the merging of the adjacent unit intervals if it satisfies the Minimum Support. Then, we include it in the Large Interval Itemset, and, at the same time, define it as a domain of Large Interval Item. We, continuously, select the unit interval of mode from the rest unit intervals and make same process to generate Large Interval Items. And we stop the process when there is no more unit interval to be selected. (a) and (b) of Figure 1 shows the whole process expressed by the code. (a) of figure 1 selects the unit interval of mode and delivers its value to Gen-FL function. (b) shows performing procedure that merges adjacent unit intervals from the transferred unit interval to generate Large Interval Itemset until it satisfies the Minimum Support(Smin), and we may consider 4 cases according to the adjacency to the upper and 1 lower bound. 3.1 Characteristic of Suggested Method Using Mode The example of process generating Large Interval Itemset with real census data is shown as the (a), (b), (c) of figure 2 and figure 3. Applied data in figure 2, 3 is population distribution by age and household distribution by a floor space based on the survey of basic statistics of enterprises in South Korea. For this example, the Minimum Support of 10% is applied. (a) of Figure 2 and figure 3 shows process that Large Interval Itemsets is created by suggested method and (b), (c) shows result about existing method – Fixed-Size Interval Method(b) and Dynamic Partition Method(c). Because suggested method creates Large Interval Item sequentially via unit interval with mode which is item have the highest frequency, as shown (a) of figure 2 and figure 3, there is a tendency that the range of Large Interval Items gets wider as the process is being made from the 1st to the kth. And, as it merges minimum numbers of Intervals which satisfy the assigned Minimum Support, this method, compared with other methods, can generate more Large Interval Itemsets based on the same
1
The upper and lower bound is described as bound of (lqi) and (lqn). It is also the bound of unit interval of domain already used.
An Efficient Method for Quantitative Association Rules to Raise Reliance of Data // User specifies Minimum Support (Smin) // Create f(lq) through search DB FL = Ψ for (k=1 ; Lq≠Ψ; k++) do begin Max_lq = MAX(f(lqi)), (not tagged, 1≤i≤n) flk merge lqi ; CALL Gen_FL FL = ∪ flk// Answer Max_lq = 0 if flk < Smin // flk is not large then Calculate whether flk is meaningful if flk is meaningful then divide intervals to two section FL = ∪ FLp, FLq (1≤ p,q ≤ k-1) // which are Large Interval Itemsets located in both sides else lqi = tagged (1≤i≤n) // not large Interval endif endif end
(a)
509
Function Gen_FL for (j=1; Max_lq Smin, j++) do begin case 1 : lq_ti-j, and lq_ti+j are not tagged if (f(lqi-j)+f(lqi+j)) (Smin-Max_lq) then Max_lq = Max_lq + f(lqi-j) +f(lqi+j); flk merge lqi-j, lqi+j ; lq_ti-j, lq_ti+j = tag else if (f(lqi-j) f(lqi+j) and (Smin- Max_lq) f(lqi-j)) then Max_lq = Max_lq + f(lqi-j); flk merge lqi-j; lq_ti-j = tag else Max_lq = Max_lq + f(lqi+j); flk merge lqi+j; lq_ti+j = tag endif endif case 2 : lq_ti-j is not tagged,lq_ti+j tagged Max_lq = Max_lq + f(lqi-j) ; flk merge lqi-j;lq_ti-j = tag case 3 : lq_ti-j is tagged,lq_ti+j not tagged Max_lq = Max_lq + f(lqi+j) ; flk merge lqi+j; lq_ti+j = tag case 4 : lq_ti-j andlq_ti+jis tagged return // flk is not large end Return
(b)
Fig. 1. (a) is the suggested algorithm for Large Item creation, (b) is the mergence function to merge unit intervals.
Fig. 2. The process generating Large Interval Items with population distribution by age.
Fig. 3. The process generating Large Interval Items with household distribution by a floor space.
section. Besides, it shows different length of interval according to the importance of data itself, and does not lose meaningful intervals. But, the (b), Fixed-Size Interval Method, misses intervals and irregularly generates the Large Interval Itemsets which has much bigger Support because it merges intervals without considering the characteristics of the data. The (c), Dynamic Partition Method which considers the distribu-
510
H.-J. Lee, W.-H. Park, and D.-S. Park
tion of data more or less, may even miss meaningful intervals, and the quantity of missed intervals is subject to the Minimum Partition Support. Therefore, suggested method can minimize the loss of characteristic of original data itself as well as can create more minute Large Interval Itemsets in section where items of high Support are concentrated on. But, the existing methods merge intervals of regular number, distance of merged interval is wide as well as the Support of merged interval is much higher than the assigned one
4 Performance Test For the performance test of the method suggested in this paper, we will compare the number of Large Interval Items and the average interval of intervals to be generated Large Interval Itemsets by the three methods: Fixed-Size Partition Method(M1), Dynamic Partition Method(M2), Suggested Method(M3). And we used the following 3 data for the performance test. 1. Population data classified by ages from survey of population and housing census of Tae-Jeon, South Korea : 1,214,327 records 2. Enterprise data in Tae-Jeon classified by the number of workers from survey of enterprise census: 88,869 records 3. Random data without locality : 37,000 records
Fig. 4. The number of created Large Interval Items
Fig. 5. Average distance of merged interval
Also, 9 Minimum Supports (40%, 35%, 30%, 25%, 20%, 15%, 10%, 5%, 3%) are used, and constant partition interval used at M1 is “2” and the Minimum Partition Support at M2 is “1/2” of the Minimum Support. Figure 4 shows the result of performance test of each method. M1, M2, M3 has generated the same number of large
An Efficient Method for Quantitative Association Rules to Raise Reliance of Data
511
interval items with above 20% of Minimum Support, but under less than 20% of Minimum Support, M3 has produced more number of Large Interval Items than others. The average distance of merged intervals in Large Interval Items made by each method is shown at figure 5. Each graphs shows that there is a big difference of interval between M1, M2 and M3 depending on the characteristics of distribution of each data. M3, all in all, generates large interval itemsets with narrower average distance of interval, while M2 and M3 which do consider the characteristics of data distribution make different results depending on the Minimum Support.
5 Conclusion This paper introduced more effective method that converts the defined domain of quantitative items into large interval items as a form of binary items in order to discover the association rules from the large databases including quantitative items. And, the suggested method (M3) is proved excellent by the performance test with real data such as a survey of population and housing census, because it generates more number of minute large interval itemsets than any other method can do. The suggested method, as a method to consider characteristics of subject data to be discovered, that is, locality of data occurrence, used “mode” to generate large interval items. The results obtained by using mode are not only to generate more number of minute large interval items but also to decrease the minuteness of large interval items as the sequence of generation passes, as shown at figure 2 and 3. This means that the degree of loss of the characteristics that the real data have varies according to the sequence of generation. That is, the large interval items generated in the beginning stage have less loss of the characteristics than the large interval items generated later have. These natures mean that the user can make use of the generation sequence of large interval items according to the quality of rules to be required when discovering association rules later. Besides, by generating Large Interval Itemsets which satisfy the given Minimum Support, it can raise the reliance (confidence) of data and discover better association rules when generating association rules later.
References 1.
2. 3.
R. Agrawal, T. Imielinski, and A. Swami.: Mining association rules between sets of items in large databases, In Proc. of the ACM SIGMOD Conference on Management Data.(1993) 207-216. R. Agrawal and R. Srikant.: Fast Algorithms for mining association rules, In Proceedings of the 20th VLDB Conference. Santiago, Chile. Sept(1994). J.S, park. M.S, Chen. And P.S, Yu.: An Effective hase-based algorithm for mining association rules, In Proceedings of ACM SIGMOD Con-ference on Management of Data. May(1995) 175-186.
512 4. 5. 6. 7. 8.
9.
H.-J. Lee, W.-H. Park, and D.-S. Park A. Savasere, E. Omiencinsky and S. Navathe, “An efficient algorithm for mining rules in large databases”, In proceedings of the 21st VLDB Conference (1995). 432-444. J.S, Park. P.S, Yu and M. S, Chen.: Mining Association Rules with Adjustable Accuracy, In Proceedings of ACM CIKM 97, November (1997) 151-160. R, Srikant and R, Agrawal.: Mining Quan-titative Association Rules in Large Relational Tables, Proceedings of the ACM SIGMOD Conference on Management of Data. (1996) Young-Hee, Choi. Su-Min, Jang. Jae-Chul, OH.: Generationg Large Items Efficiently For Mining Quantitative Association Rules, Vol.6. KIPS (1999) 2597-2607. Rajeev Rastogi and Kyuseok Shim.: Mining Optimized Association Rules with Categorical and Numeric Attributes, IEEE Transactions on Knowledge and Data Engineering, vol. 14, No.1, January/February (2002) 29-50. S, Brin. R, Motwani. J.D, Ullman. and S, Tsur.: Dynamic Itemset Counting and ImplicationRules for Market Basket Data, In Proceedings of ACM SIGMOD Conference on Management of Data(1997) 255-264
Web Usage Mining: Sequential Pattern Extraction with a Very Low Support F. Masseglia, D. Tanasa, and B. Trousse INRIA Sophia Antipolis 2004 route des lucioles - BP 93 06902 Sophia Antipolis, France {Florent.Masseglia, Doru.Tanasa, Brigitte.Trousse}@sophia.inria.fr
Abstract. The goal of this work is to increase the relevance and the interestingness of patterns discovered by a Web Usage Mining process. Indeed, the sequential patterns extracted on web log files, unless they are found under constraints, often lack interest because of their obvious content. Our goal is to discover minority users’ behaviors having a coherence which we want to be aware of (like hacking activities on the Web site or a users’ activity limited to a specific part of the Web site). By means of a clustering method on the extracted sequential patterns, we propose a recursive division of the problem. The developed clustering method is based on patterns summaries and neural networks. Our experiments show that we obtain the targeted patterns whereas their extraction by means of a classical process is impossible because of a very weak support (down to 0.006%). The diversity of users’ behaviors is so large that the minority ones are both numerous and difficult to locate. Keywords: Web usage mining, sequential patterns, clustering, patterns summary, neural networks.
1
Introduction
Analyzing the behavior of a Web site’s users, also known as Web Usage Mining, is a research field which consists in adapting the data mining methods to the records of access log files. These files collect data such as the IP address of the connected host, the requested URL, the date and other information regarding the navigation of the user. Web Usage Mining techniques provide knowledge about the behavior of the users in order to extract relationships in the recorded data. Among available techniques, the sequential patterns are particularly well adapted to the log study. Extracting sequential patterns on a log file, is supposed to provide this kind of relationship:“On the Inria’s Web Site, 10% of users visited consecutively the homepage, the available positions page, the ET1 offers, the ET missions and finally the past ET competitive selection”. This kind of behavior is just a supposition, because extracting sequential patterns on a log file also implies managing several problems: 1
ET: Engineers, Technicians
J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 513–522, 2004. c Springer-Verlag Berlin Heidelberg 2004
514
F. Masseglia, D. Tanasa, and B. Trousse
– the cache (on the user’s computer) and the proxies (which can be cache servers) can lower the number of records in the access log. – the great diversity of pages on the site. – the research engines, which allow the user to directly access a specific part of the Web site (thus reducing the number of entries in the log file, and the number of navigations shared by the users). – the representativeness of the visited part compared to the entire site (a research team can represent less than 0.7% of the site). – the representativeness of the users who navigate through that part of the site, compared to the users of the whole site. If caching problems can be solved [3], the representativeness requires a strong study. In order to illustrate our goal, let us consider the sequential patterns we are supposed to obtain. Due to the small size of the “job offer” part of the site, users requesting a page on that part represent only 0.5% of users on the entire site. In the same way, users navigating on the “teaching” part of the Lambda research project represent only 0.01% of all the users. Therefore, a Web Usage Mining study on such a Web site, has to manage this particular representativeness in order to provide satisfying results. Our goal is to show that a classical2 sequential pattern mining process is not able to provide behaviors with such a weak support. Furthermore we present a method for discovering the behavior of all the users of a Web site, including the minority behaviors. We describe our experiments and then conclude the paper.
2
Definitions
In this section we define the sequential pattern mining problem in large databases and give an illustration. Then we explain the goals and techniques of Web Usage Mining with sequential patterns. 2.1
Sequential Pattern Mining
In [1], the association rules mining problem is defined as follows: Definition 1. Let I = {i1 , i2 , ..., im }, be a set of m literals (items). Let D = {t1 , t2 , ...tn }, be a set of n transactions ; Associated with each transaction is a unique identifier called its T ID and an itemset I. I is a k-itemset where k is the number of items in I. We say that a transaction T contains X, a set of some items in I, if X ⊆ T . The support of an itemset I is the fraction of transactions in D containing I: supp(I) = {t ∈ D | I ⊆ t}/{t ∈ D}. An association rule is an implication of the form I1 ⇒ I2 , where I1 , I2 ⊂ I and I1 ∩ I2 = ∅. The rule I1 ⇒ I2 holds in the transaction set D with confidence c if c% of transactions in D that contain I1 also contain I2 . The rule r : I1 ⇒ I2 has support s in the transaction set D if s% of transactions in D contain I1 ∪ I2 2
Our research field excludes methods with constraints and sampling methods for reasons that will be given in this paper
Web Usage Mining: Sequential Pattern Extraction with a Very Low Support
515
(i.e. supp(r) = supp(I1 ∪ I2 )). Given two parameters specified by the user, minsupp and minconfidence, the problem of association rule mining in a database D aims at providing the set of frequent itemsets in D, i.e. all the itemsets having support greater or equal to minsupp. Association rules with confidence greater than minconfidence are thus generated. As this definition does not take time into consideration, the sequential patterns are defined in [10]: Definition 2. A sequence is an ordered list of itemsets denoted by < s1 s2 . . . sn > where sj is an itemset. The data-sequence of a customer c is the sequence in D corresponding to customer c. A sequence < a1 a2 . . . an > is a subsequence of another sequence < b1 b2 . . . bm > if there exist integers i1 < i2 < . . . < in such that a1 ⊆ bi1 , a2 ⊆ bi2 , . . . , an ⊆ bin . Example 1. Let C be a client and S=< (3) (4 5) (8) >, be that client’s purchases. S means that “C bought item 3, then he or she bought 4 and 5 at the same moment (i.e. in the same transaction) and finally bought item 8”. Definition 3. The support for a sequence s, also called supp(s), is defined as the fraction of total data-sequences that contain s. If supp(s) ≥ minsupp, with a minimum support value minsupp given by the user, s is considered as a frequent sequential pattern. 2.2
Access Log Files Analysis with Sequential Patterns
The general idea is similar to the principle proposed in [4,9,11]. It relies on three main steps. First of all, starting from a rough data file, a pre-processing step is necessary to clean ”useless” information. The second step starts from this preprocessed data and applies data mining algorithms to find frequent itemsets or frequent sequential patterns. Finally, the third step aims at helping the user to analyze the results by providing a visualization and request tool. Raw data is collected in access log files by Web servers. Each input in the log file illustrates a request from a client machine to the server (http daemon). Access log files format can differ, depending on the system hosting the Web site. For the rest of this presentation we will focus on three fields: client address, the URL requested by the user and the time and date for that request. We illustrate these concepts with the access log file format given by the CERN and the NCSA [12], where a log input contains records made of 7 fields, separated by spaces: host user authuser [date:time] “request” status bytes The access log file is then processed in two steps. First of all, the access log file is sorted by address and by transaction. Afterwards each ”uninteresting” data is pruned out from the file. During the sorting process, in order to allow the knowledge discovery process to be more efficient, URLs and clients are mapped into integers. Each time and date is also translated into relative time, compared to the earliest time in the log file.
516
F. Masseglia, D. Tanasa, and B. Trousse
Definition 4. Let Log be a set of server access log entries. An entry g, g ∈ Log, g g .U RL, lm .time]) > such that for is a tuple g =< ipg , ([l1g .U RL, l1g .time] ... [lm g 1 ≤ k ≤ m, lk .U RL is the item asked for by the user g at time lkg .time and for all 1 ≤ j < k, lkg .time > ljg .time. The structure of a log file, as described in definition 4, is close to the “Client-Time-Item” structure used by sequential pattern algorithms. In order to extract frequent behaviors from a log file, for each g in the log file, we first have to transform ipg into a client number and for each record k in g, lkg .time is transformed into a time number and lkg .U RL is transformed into an item number. Table 1 gives a file example obtained after that pre-processing. To each client corresponds a series of times and the URL requested by the client at each time. For instance, the client 2 requested the URL “60” at time d4. The goal is thus, according to definition 3 and by means of a data mining step, to find the sequential patterns in the file that can be considered as frequent. The result may, for instance, be < ( 10 ) ( 30 ) ( 20 ) ( 30 ) > (with the file illustrated in table 1 and a minimum support given by the user: 100%). Such a result, once mapped back into URLs, strengthens the discovery of a frequent behavior, common to n users (with n the threshold given for the data mining process) and also gives the sequence of events composing that behavior. Table 1. File obtained after a pre-processing step
Client 1 2 3
3 3.1
d1 10 10 10
d2 30 30 70
d3 40 20 30
d4 20 60 20
d5 30 30 30
Divide and Discover: Motivations and Principle Motivations
Let us consider Inria’s web sites. The main site is www.inria.fr, the site of Sophia Antipolis is www-sop.inria.fr, and so on. These sites can be represented as shown in Figure 1 and the pages contents can be about jobs, research, teaching... From such a log analysis, we can provide some lessons: – Usually, the sequential patterns coming from such a log can be disappointing. In fact their relevance is weak and they can be obvious and not so useful (e.g. “0.1% of users arrive at the homepage and then go to the contents page”). – The interesting behaviors are contained in a specific part of the log. For example, in Figure 1, the part corresponding to the teaching activities of J. Smith (STID) will be requested by 0.01% of users recorded in the log. The users interested in job opportunities will represent 0.5% of all requests on the site.
Web Usage Mining: Sequential Pattern Extraction with a Very Low Support
517
Fig. 1. Parts of the Inria’s Web sites
– In order to get interesting patterns on that log, we thus have to specify a really low support. Let us study the question of a low support. In our research framework, we exclude methods with constraints and sampling methods. Without denying the efficiency of such methods, we argue that techniques based on constraints do not allow to find all the patterns (which are still to discover, thus unknown from the user and his constraints). The second technique is sampling. We consider that the representativeness we are working with is so weak that the size of the sample will be almost the same as that of the log. Let us now imagine that we specify a very low support. Two problems will then appear: – The response time will be too long (in most cases, the result won’t even be obtained due to the complexity of the process). – The amount of frequent patterns generated by this process (in the case the process ends) would be very large. Nevertheless, the behaviors we want to discover have a really low support. These behaviors correspond to minorities, but we aim at discovering this kind of patterns since we consider that they are highly relevant. For instance, among these behaviors, we can notice the hacking activities or the navigation (from students) on some teaching pages. Our goal is thus to provide patterns, revealing behaviors such as: – 0.08% of users have a navigation similar to hacking activities. Among them 90% respected a typical hacking navigation. – 0.007% of users have a navigation related to the teaching pages of J. Smith. Among them, 15% requested consecutively the 6 pages of his course on data mining. The very weak support of these patterns is mainly due to the great diversity of the behaviors on the analyzed logs and to the large number of URLs contained in that site. In order to solve the problems described above, we developed the “Divide and Discover” method.
518
3.2
F. Masseglia, D. Tanasa, and B. Trousse
Principle
The outline of our method is the following: discovering clusters of users (grouped by behavior) and then analyzing their navigations by means of a sequential pattern mining process. Our method thus relies on two steps. The first step aims at dividing the log into sub-logs, supposed to represent separated activities. The second step aims at analyzing the behavior of users recorded in each sub-log. The principle of our method is thus the following: 1. Extracting sequential patterns on the original log. 2. Clustering these sequential patterns. 3. Dividing the log according to the clusters obtained above. Each sub-log contains sessions from the original log, corresponding to at least one behavior of the cluster which enabled to create this sub-log. A special sub-log is then created to collect the sessions from the original sub-log which do not correspond to a cluster from the previous step. 4. For each sub-log, apply this whole process (recursively).
Fig. 2. Divide & Discover principle
Figure 2 illustrates this method. First sequential patterns are obtained and clustered (C1 to Cn ) . Then the log is divided (SL1 to SLn ) upon these clusters. Finally a special sub-log (SLn+1 ) is created for the sessions which can not be matched with a behavior from the original log. The quality of the results produced by our approach will rely on this sub-log. In fact, the first sub-logs contain the most represented categories of users. They are thus interesting, but the most interesting patterns will come from the study of the unclustered sessions of the sub-log SLn+1 . Considering this sub-log as a new original log, and repeating the process (as described in Figure 2) will allow us to discover behavior with a low representativeness. In order to provide reliable results, our method depends on a specific factor: the quality of the division proposed for a log. This division
Web Usage Mining: Sequential Pattern Extraction with a Very Low Support
519
relies on the clustering performed on the discovered sequential patterns in the original log. We describe in the next section the method we employed to cluster the sequential patterns.
4
Clustering Based on Pattern Generalisation
We studied several clustering methods for sequential patterns. We describe here the most efficient method that we used for sequential pattern clustering. The clustering method used in this research is based on a method developed in 2000 by [2] for indexing web sequences in the context of Web-based recommender systems. The efficiency of such a method is based on the neural approach and its effectiveness relies on the use of summarized descriptions for sequential patterns: such descriptions are based on a generalization of Web access sequences (cf. section 4.2). 4.1
Neural Method
We propose a neural clustering method based on [2] (integrated in the objectoriented framework called CBR*Tools3 [6] for supporting the reuse of past experiences). It was successfully applied in the context of browsing advisors in a Web thematic repertory, for enterprises such as France Telecom. This method relies on a hybrid model of connectionist memory inspired from [7] and composed from a connexionist part [5] and a flat memory compound of patterns’ groups. A threshold si is associated to each prototype, which will be modified during the learning step. Such a threshold determines an influence region in the input space. If a pattern introduced in the network falls in the influence region of a prototype, then this prototype will be activated. Such a region is determined by the set of input vectors satisfying a distance measure lower than the threshold. If there is no activated prototype, a new one is created. So the structure of a prototype-based network such as ARN2 is evolutionary in the sense that the number of prototypes at the hidden level is not a priori fixed and might be increased during the learning step. A prototype is characterized by its reference vector, an influence region and a set of representing patterns. 4.2
Summarizing Sequential Patterns
To characterize a sequential pattern, we use four attributes based on a generalization of web pages belonging to it: we use 1) the multi-site aspect and 2) the first-level category aspect (for all sites). Currently such a category identification is done at the syntactic level on URLs. For example, page (a) belongs to www-sop.inria.fr and has teaching as a second-level category value. So page (a) will be considered as representing a document and a second-level category both for www-sop.inria.fr and for the Lambda first-level category. The 3
four attributes calculated from the Web pages of the sequential pattern are: 1) the number of second-level categories for the pattern per site, 2) the number of documents for the pattern per site, 3) the number of second-level categories for the pattern per first-level category, for all sites and finally 4) the number of documents per first-level category (for all sites). The dimension of the description vector is equal to 2×(number of considered sites + number of first-level categories from Web pages of patterns i.e. the union of first-level categories independently of site). Each attribute is normalized between 0..1 and has an importance weight assigned related to the chosen context. Example 2. Let us consider a group of students requesting the pages of a course about data mining: lambda/teaching/STID/. Several pages can thus be requested on that part of the site: annee02-03.html, TD1.html, TD2.html, TD3.html, accesslog.html and errorlog.html. Each of these pages will be named a, b, c, d, e and f (i.e. a=lambda/teaching/STID/annee02-03.html, b=lambda/teaching/STID/TD1.html, ...). The goal is of course to detect that these six pages will be grouped in the same cluster. In fact, once this cluster is detected, a sequential pattern mining process on the sub-log file corresponding to that cluster will allow to find patterns with high support on this sub-log and very low representativeness on the entire log. Let us suppose that 1) we join logs from two sites www.inria.fr and wwwsop.inria.fr, 2) we structure such logs into sessions and finally 3) the extracted patterns use Web pages from only six first-level categories (such as ”Lambda”). Let us consider that we extracted four sequential patterns concening these pages: , , and . The Web pages ”a...f” belonging to these patterns belong to the same site ”www-sop.inria.fr”, with “Lambda” as the first-level category and “teaching” as the second-level category. So they have the same summary on the four following attributes. oCategory2PerSite oDocsPerSite
0, 2 0, 2
oCategory2PerCategory1 oDocsPerCategory1
0,0,0,0,2,0 0,0,0,0,2,0
The prototype network built on such an example has an input space dimension equal to 16 (related to the description of a pattern). Since the four patterns have an equal description, we obtain only one class: C = {< (a)(b) >, < (b)(c) > , < (d)(e) >, < (d)(f ) >}. Hence, our objective of grouping together sessions containing Web pages that are related, is now fulfilled.
5
Experiments
The extraction methods are written in C++ on a Pentium (2.1 Ghz) PC running a Red-Hat system. We used the PSP algorithm [8] for sequential pattern extraction. The neural method and the GUI are realized in Java. For the Inria’s main site, the data was collected over a period of one month, while for the Inria Sophia Antipolis site, over a period of two months. Their sizes are 2.1 Go and respectively 3 Go. The description of the characteristics (cf. Figure 3) is the following: N stands for the number of lines in the log, S for the number of sessions,
Web Usage Mining: Sequential Pattern Extraction with a Very Low Support
521
U for the number of filtered URLs, AL for the average length of the sessions, AU for the average number of URLs in the sessions. During our experiments, we could bring into relief frequent behaviors, with a relative representativeness getting weaker and weaker, depending on the depth of the sub-log.
to the career opportunities offered by the Inria. The users read the job opportunities page, then the page describing the competitive selection and finally the pages describing the research teams. C2: (support: 0.04%). This behavior is typical of a search
for a security hole in the system. Usually, these attacks are programmed once and then shared and used by different individuals. C3: with the prefix epidaure/Demonstrations/foie3d/: (support: 0.01%)
The behaviors discovered with our method cover more than 50 surfing goals on Inria’s main site and more that 100 goals on the site of Inria Sophia Antipolis. We reported here three goals, from job opportunities requests to hacking activities. Thus, these behaviors illustrate the success of our method in discovering minority behaviors, which we couldn’t have discovered given their very weak representativeness.
6
Conclusion
In this paper, we proposed a method for extracting of the behavior of all the users of a Web site. Our approach has the characteristic to recursively divide the log file in order to locate behaviors and to represent them as clusters. For this purpose, we had to provide a specific clustering method, dedicated to sequential patterns. The main advantage of our method is to consider the Web Usage
522
F. Masseglia, D. Tanasa, and B. Trousse
Mining with very low support as a problem that can be solved by successive divisions. The problem thus shifts from one open problem to n problems we can solve and one problem that will have to be divided recursively. By proceeding this way, we could establish that the frontier between the quantity of data and the quality of the results can sometimes be pushed back by extracting behaviors with a very low representativeness.
References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proceedings of the 1993 ACM SIGMOD Conference, pages 207–216, Washington DC, USA, May 1993. 2. A. Benedek and B. Trousse. Adaptation of Self-Organizing Maps for CBR case indexing. In 27th Annual Conference of the Gesellschaft fur Klassifikation, Cottbus, Germany, March 2003. 3. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1):5–32, 1999. 4. U.M. Fayad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA, 1996. 5. A. Giacometti. Mod`eles hybrides de l’expertise, novembre 1992. PhD Thesis (in french), ENST Paris. 6. M. Jaczynski. Mod`ele et plate-forme ` a objets pour l’indexation des cas par situation comportementales: application ` a l’assistance ` a la navigation sur le web, d´ecembre 1998. PhD thesis (in french), Universit´e de Nice Sophia-Antipolis. 7. M. Malek. Un mod`ele hybride de m´emoire pour le raisonnement ` a partir de cas, octobre 1996. PhD thesis (in french), Universit´e Joseph Fourrier. 8. F. Masseglia, F. Cathala, and P. Poncelet. The PSP Approach for Mining Sequential Patterns. In Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD’98), LNAI, Vol. 1510, pages 176–184, Nantes, France, September 1998. 9. F. Masseglia, P. Poncelet, and R. Cicchetti. An efficient algorithm for web usage mining. Networking and Information Systems Journal (NIS), April 2000. 10. R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology (EDBT’96), pages 3–17, Avignon, France, September 1996. 11. Doru Tanasa and Brigitte Trousse. Web access pattern discovery and analysis based on page classification and on indexing sessions with a generalised suffix tree. In Proceedings of the 3rd International Workshop on Symbolic and Numeric Algorithms for Scientific Computing, pages 62–72, Timisoara, Romania, October 2001. 12. W3C. httpd-log files. In http://www.w3.org/Daemon/User/Config/Logging.html, 1995.
CS-Mine: An Efficient WAP-Tree Mining for Web Access Patterns Baoyao Zhou, Siu Cheung Hui, and ACM Fong School of Computer Engineering, Nanyang Technological University, Singapore [email protected], {asschui, ascmfong}@ntu.edu.sg
Abstract. Much research has been done on discovering interesting and frequent user access patterns from web logs. Recently, a novel data structure, known as Web Access Pattern Tree (or WAP-tree), was developed. The associated WAPmine algorithm is obviously faster than traditional sequential pattern mining techniques. However, WAP-mine requires re-constructing large numbers of intermediate conditional WAP-trees during mining, which is also very costly. In this paper, we propose an efficient WAP-tree mining algorithm, known as CSmine (Conditional Sequence mining algorithm), which is based directly on the initial conditional sequence base of each frequent event and eliminates the need for re-constructing intermediate conditional WAP-trees. This can improve significantly on efficiency comparing with WAP-mine, especially when the support threshold becomes smaller and the size of database gets larger.
1
Introduction
Web usage mining [1] discovers interesting and frequent user access patterns from web usage data that can be stored in web server logs, proxy logs or browser logs. Essentially, a web access pattern [2] is a sequential pattern in a large set of pieces of web logs, which is pursued frequently by users. Most of the previous studies for web access patterns mining have adopted sequential pattern mining techniques [3], such as AprioriAll [3] and GSP [5]. However, these Apriori-based algorithms encounter the same problem that requires expensive multiple scans of database in order to determine which candidates are actually frequent. Recently, Pei et al. [2] proposed a compressed data structure known as Web Access Pattern Tree (or WAP-tree), which facilitates the development of algorithms for mining web access patterns efficiently from web logs. The associated WAP-mine algorithm [2] avoids the problem of generating explosive numbers of candidates. Experimental results have shown that WAP-mine is obviously faster than traditional sequential pattern mining techniques. However, the conditional search strategy in WAP-mine requires re-constructing large numbers of intermediate conditional WAP-trees during mining, which is also very costly. In this paper, we propose a new web access pattern mining algorithm based on WAP-tree structure, known as CS-mine (Conditional Sequence mining algorithm). To improve efficiency, CS-mine eliminates the need for re-constructing intermediate conditional WAP-trees. The rest of this paper is organized as follows. In Section 2, we introduce the related work on the WAP-tree structure and WAP-tree based mining
algorithms. The proposed CS-mine algorithm is presented in Section 3. Section 4 shows the experimental results. Finally, the conclusion is given in Section 5.
2
WAP-Tree and Mining Web Access Patterns
Generally, web logs can be regarded as a collection of sequences of access events from one user or session in timestamp ascending order. Preprocessing tasks [4] can be applied to the original log files to obtain web access sequences after data cleaning, user identification, session identification, etc., for mining purposes. In this section, we review the WAP-tree structure and the related WAP-tree based mining algorithms. 2.1
WAP-Tree Structure and Construction
Let E be a set of access events, which represents web resources accessed by users, i.e. web pages, URLs. A web access sequence S = e1e2…en (ei∈E for 1 i n) is a sequence of access events, and |S| = n is called the length of S. Note that it is not necessary that ei ej for i j in S, that is repeat of items is allowed. For example, suppose we have a set of web access sequences with the set of access events E = {a, b, c, d, e, f}. A simple web access sequence database is shown in Table 1. Table 1. A database of web access sequences
User ID 100 200 300 400
Web Access Sequence cadba cacbeae ceafbab cfcabfa
Frequent Sub-sequence caba cacba cabab ccaba
A web access sequence S’ = e1’e2’…em’ is called a sub-sequence of S = e1e2…en, n, such that ej’ = ei for denoted as S’ ⊆ S, if there exists some i, 1 i1 < i2 L and T.count >= α 4) output FPs embedded in T based on the conditinal pattern P ; 5) else 6) for each sub-tree Ts of T do 7) if Ts .count >= α 8) P ← T.item ∪ P ; 9) PushRight(Ts ’s leftmost child, Ts ’s sibling); 10) Traverse(T’s leftmost child, P, L, α); 11) delete T; Fig. 7. The CFPM Algorithm
represents a pattern the-path-to-e with a global support of t. From this observation, we can list all frequent patterns from the leftmost branch of an IFP-tree. Considering the IFP-tree in Figure 5, we can immediately enumerate all the FPs (with their support) on the leftmost path together with their support, which are {s:2, sb:2}. Note however, the non-rooted pattern b does not have the global support.
Fig. 8. Performance of FP-growth, Traverse (L=1) and CFPM (L = 3)
4.2
The CFPM Algorithm
Given an IFP-tree T and its leftmost subtree Ti1 , by pushing right Ti1 to its siblings, we will have the new leftmost subtree Ti2 containing all information on FPs containing i2 and items following i2 in I. Due to space restriction, we use an example to explain the procedure. Figure 6 shows the IFP-tree as the result of pushing right the s-subtree of the IFP-tree in Figure 5. We can see now the new leftmost b-subtree contains all information needed for mining FPs containing items after s in the item list, which are b, d, f , m, a and e. Given an IFP-tree T , with a top-down traversal strategy, by recursively enumerating all FPs on the leftmost branch and pushing it right, we can mine all FPs in T . The PushRight procedure incurs the major overhead. The following optimization aims to reduce such overheads. A single-path IFP-tree is an IFPtree in which each node only has single child. If T is a single-path tree, without recursively pushing right, we can enumerate all frequent patterns conditioned on P combining the nodes with count greater than the threshold. In traversing IFP-trees for FPs, the cost of pushing and checking single-paths is non-trivial, it is more pronounced when the underlying patterns are short, addressing this cost of mining short patterns, only FPs with length greater a
Efficient Frequent Pattern Mining on Web Logs
541
parameter L are mined from the IFP-tree. As will be discussed later, short FPs are mined with the less costly candidate-generation-and-test approach. In CFPM, we use the following strategies to save costs in mining FPs from IFPtrees: (1) If the count of a node equals to the global support of the item in it, then this subtree can be pruned. (2) If the length of a single path tree is ≤ L, this path can be pruned. The IFP traversal procedure is summarized in Figure 7. Comparing with FPgrowth, (1) We do not need to construct conditional databases. We can mine patterns directly on the original tree along with some merging operations, which only involve some pointer re-organization and counter increment. (2) We remove the branch after it has been processed, which includes outputting patterns embedded in it and pushing information in it to right branches. As a result the total number of nodes does not increase too much and the total running time is much less than FP-growth. In addition, this approach can dramatically reduces memory usage. The CFPM algorithm is as shown in Figure 7. Only two scans of the database are needed. After the first scan on the database, all the supports of all items are accumulated. Based on the minimum support threshold, all the 1-item FPs are found. In the second scan over the database, based on the 1-item FPs, the candidate-generation-and-test approach is used to find FPs of length ≤ L. The IFP-tree is built for transactions containing > L items. Longer FPs with length > L are mined by traversing the IFP-tree. Experiments show that CFPM (L = 3, where candidate-generation-and-test is used for mining FPs of length ≤ 3, and pattern-growth is used for mining FPs of lenght > 3, achieves significant performance gain over FP-growth on mining Web logs.
5
Experiments
CFPM was implemented in C++ and experiemts were conducted in the environment as discribed in Section 3. Figure 8 shows the time performance of FP-growth, Traverse (L=1), which is IFP-tree Mining, and CFPM (L = 3) on the BMS-WebView datasets. In Figure 8(a) and Figure 8(b), Traverse (L=1) outperforms FP-growth by 6 times when minimum support threshold is relatively high (≥ 2.5% on BMS-WebView-1, and ≥ 3.2% on BMS-WebView-2). However, with the dropping of minimum support threshold, the performance of Traverse (L=1) drops as well. The reason is the length of FPs increase with a sharp trend of inclining and so much more merging operations are involved in Traverse, which is a non-trival overhead. As expected, the problem is fixed by CFPM. From Figure 8(c) and Figure 8(d) we can see that CFPM significantly improves the performance of Traverse (L=1) at low support thresholds, by 3.1–4.9 times on BMS-WebView-1 and by 1.2–7.8 times on BMS-WebView-2.
542
6
L. Sun and X. Zhang
Conclusions
We have studied the problem of mining frequent patterns (FPs) from Web logs. We have done extensive experiments to examine the features of Web logs and the characteristics of FPs presented in such data. We also have conducted experiments to compare the performance of representative FP minining techniques on mining Web logs and have proposed improvements. Experiments show that the proposed algorithm CFPM significantly improve the performance of the patterngrowth approach by 1.2–7.8 times on mining frequent patterns in Web logs.
References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB’94, 1994. 2. R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering, pages 3–14, Taipei, Taiwan, March 1995. 3. R. Cooley, B. Mobasher and J. Srivastava. Grouping Web page references into transactions for mining world wide Web browsing patterns. In Technical Report TR 97-021, University of Minnesota, Dept. of CS, 1997. 4. J. Han, J. Pei and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD 2000), 2000. 5. R. Kosala and H.Blockeel. Web mining research: a survey. In ACM SIGKDD Explorations, 2000. 6. G. Liu, H. Lu, Y. Xu and J. X. Yu. Ascending frequency order prefix-tree: efficient mining of frequent patterns. In DASFAA 2003, 2003. 7. J. Liu, Y. Pan, K. Wang and J. Han. Mining frequent item sets by opportunistic projection. In SIGKDD’02, July 2002. 8. J. Pei et al. H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. In ICDM 2001, 2001. 9. J. Pei, J. Han, B. Mortazavi-asl and H. Zhu. Mining access patterns efficiently from Web logs. In Proc. of PAKDD-2000, 2000. 10. J. Srivastava, R. Cooley, P. Deshpande, and P. Tan. Web usage mining: discovery and application of usage patterns from Web data. In SIGKDD Explorations, 2000. 11. Y. Xiao and M. H. Dunham. Efficient mining of traversal patterns. In Data and Knowledge Engineering, 39(2), pages 191–214, 2000.
Mining the Predisposing Factor and Co-incident Factor among Numerical Dynamic Attributes in Time Series Data Set Suwimon Kooptiwoot and M. Abdus Salam Language and Knowledge Management Laboratory, School of Information Technologies The University of Sydney, Sydney, Australia 2006 {suwimon,msalam}@it.usyd.edu.au
Abstract. In this work we propose new algorithms which are the combination of many existing techniques and the idea seen in the chemical reaction to mine the predisposing factor and co-incident factor of the reference event of interest. We apply our algorithms with the Open Source Software data collected from SourceForge website and show the results.
1
Introduction
Temporal mining is a data mining that includes time attribute in consideration. Time series data is the data set which includes time attribute in the data. There are so many works, many methods and algorithms done in temporal mining. All are useful for mining the knowledge from time series data. We want to use the temporal mining techniques to mine the predisposing factor and the co-incident factor of the rate of the number of Download attribute change significantly change in OSS data set.
2
Problem
We get OSS data set from http://sourceforge.net which is the worlds largest Open Source software development website. There are 1,097,341 records, 41,540 projects in this data set. This data set consists of seventeen attributes including time attribute. The time attribute of each record in this data set is monthly. Each project in this data set is software. There are so many activities there. We are interested in thirteen attributes which indicate the number of the activities in this data set. The data of these thirteen attributes are all numeric. The value of the Download attribute is the number of the Download attribute. So the Download attribute is the indicator showing how popular the software is and show the successful of the development of the software. We are interested in the significant change rate of the number of the Download attribute. Then we employ the idea of the event detection technique proposed by [5] to detect the event of the Download attribute. The event of our interest is the significant rate J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 543–552, 2004. c Springer-Verlag Berlin Heidelberg 2004
544
S. Kooptiwoot and M.A. Salam
of the data change which can be acceleration or deceleration. We want to find the predisposing factor and the co-incident factor of the Download events. We employ the same idea about the reference event as proposed in [1] which is the fixed event of interest and want to find the other events related to the reference event. So we call the Download attribute as the reference attribute and call the event of the Download attribute as the reference event. The predisposing factor of the reference event can possibly be the cause of the reference event or the cause of the other event which is the cause of the reference event. And the coincident factor of the reference event can possibly be the effect of the reference event or the effect of the other event which is the effect of the reference event can be the result from the same cause of the reference event or just be the result from the other event which happens at the same time as the reference event. To make this concept clear, see the example as follows:
H
I
A
J
B
D
C
E
F
K
G
L time Fig. 1. The relationships among the events over time
If we have the event A, B, C, D, E, F, G, H, I, J, K, L and the relationships among them as shown in the Fig.1, that is, H and I give B; A and B give C; D and E give F; C and F give G; J and C give K; K and G give L. But our data set consists of only A, C, G, H, L and the reference event is C. We can see that H and A happen before C, we may say that A is the cause of C and/or H is the cause of C. But in the real relationship as shown above, we know that H is not the cause of C directly or it is not because both A and H give C. So we call A and H are the predisposing factors of C. And G and L happen after C. We call G and L as the co-incident factors of C. We can see from the relationship that G is the result from C and F. L is the result from G which is the result from C. Only G is the result from C directly. L is the result from G which is the result from C.
Mining the Predisposing Factor and Co-incident Factor
545
The main idea in this part is the predisposing factor can possible be the cause of the reference event and the co-incident factor can possible be the effect of the reference event. So we employ the same idea as proposed in [2,3] that is the cause happens before the effect. The effect happens after the cause. We call the time point when the reference event happens as the current time point. We call the time point before the current time point as the previous time point. And we call the time point after the current time point as the post time point. Then we define the predisposing factor of the reference event as the event which happens at the previous time point. And we define the co-incident factor of the reference event as the event happens at the the post time point.
3
Basic Definitions and Framework
We use the analogy of the chemical reaction to interpret the predisposing and coincident factors of the reference event. The point is the amount of the reactants, catalyst increase significantly before the reaction and then decrease significantly at the reaction process time. And the amount of the products increase significantly at the post time point compared with at the reaction process time. We detect two previous adjacent time points and two post adjacent time points in order to make sure that we cover all of the reactants and/or the catalysts and the products. We then judge if the number of significant changes at either of the previous time points, then we call it the predisposing factor. If it happens at either of the post time points, we call it the co-incident factor. Definition 1. A time series data set is a set of records r such that each record contains a set of attributes and a time attribute. The value of time attribute is the point of time on time scale such as month, year. rj = {a1 , a2 , a3 , · · · , am , tj } where rj is the j th record in data set Definition 2. There are two types of the attribute in time series data set. Attribute that depends on time is dynamic attribute (Ω) , other wise, it is static attribute (S). Definition 3. Time point (ti ) is the time point on time scale. Definition 4. Time interval is the range of time between two time points [t1 , t2 ]. We may refer to the end time point of interval (t2 ). Definition 5. An attribute function is a function of time whose elements are extracted from the value of attribute i in the records, and is denoted as a function in time, ai (tx ) ai (tx ) = ai ∈ rj where ai is attribute i; tx is time stamp associated with this record.
546
S. Kooptiwoot and M.A. Salam
Definition 6. A feature is defined on a time interval [t1 , t2 ], if some attribute function ai (t) can be approximated to another function φ(t) in time , for example, ai (t) ≈ φ(t), ∀t ∈ [t1 , t2 ] We say that and its parameters are features of ai (t) in that interval [t1 , t2 ]. If φ(t) = αi t + βi in some intervals, we can say that in the interval, the function ai (t) has a slope of αi where slope is a feature extracted from ai (t) in that interval Definition 7. Slope (αi ) is the change of value of a dynamic attribute (ai ) between two adjacent time points. αi =
ai tx − ai tx−1 tx − tx−1
where ai tx is the value of ai at the time point tx and ai tx−1 is the value of ai at the time point tx−1 Definition 8. Slope direction d(αi ) is the direction of slope. If αi > 0, we say dα = 1 If αi < 0, we say dα = −1 If αi ≈ 0, we say dα = 0. Definition 9. Reference attribute (at ) is the attribute of interest. We want to find the relationship between the reference attribute and the other dynamic attributes in the data set. Definition 10. Current time point (tc ) is the time point at which reference variables event is detected. Definition 11. Previous time point (tc−1 ) is the previous adjacent time point of tc Definition 12. Second previous time point (tc−2 ) is the previous adjacent time point of tc−1 Definition 13. Post time point (tc+1 ) is the post adjacent time point of tc Definition 14. Second post time point (tc+2 ) is the post adjacent time point of tc+1 Definition 15. Slope rate (θ) is the relative slope between two adjacent time intervals θ = (αi+1 − αi )/αi where αi is the slope value at time interval [ti−1 , ti ] and αi+1 is the slope value at time interval [ti , ti+1 ]. Definition 16. Slope rate direction (dθ ) is the direction of θ If θ > 0 , we say dθ = 1 or accelerating. If θ < 0 , we say dθ = −1 or decelerating. If θ ≈ 0 , we say dθ = 0 or steady. Definition 17. A significant slope rate threshold (δII) is the significant slope rate level specified by user.
Mining the Predisposing Factor and Co-incident Factor
547
Definition 18. An event (E2 ) is detected if θ ≥ δII Proposition 1. The predisposing factor of at denoted as P E2 at without considering dθ is ai if ((ani tc−1 ≥ ani tc ) ∨ (ani tc−2 ≥ ani tc )) where ani tc is the number of E2 of ai at tc ani tc−1 is the number of E2 of ai at tc−1 ani tc−2 is the number of E2 of ai at tc−2 Proposition 2. The co-incident factor of at denoted as CE2 at without considering dθ is ai if ((ani tc+1 ≥ ani tc ) ∨ (ani tc+2 ≥ ani tc )) where ani tc is the number of E2 of ai at tc ani tc+1 is the number of E2 of ai at tc+1 ani tc+2 is the number of E2 of ai at tc+2 Proposition 3. The predisposing factor of at with considering dθ of references event denoted as P E2 at dθ at is an ordered pair (ai , dθ at ) when ai ∈ Ω where dθ at is slope rate direction of at .
ntp ≥ antp ≥ antp Proposition 4. If ((antp i tc−1 i tc ) ∨ (ai tc−2 i tc )) , then P E2 at dθ at ≈ (ai , 1) where antp i tc is the number of E2 of ai at tc for which dθ at is accelerating antp i tc−1 is the number of E2 of ai at tc−1 for which dθ at is accelerating antp i tc−2 is the number of E2 of ai at tc−2 for which dθ at is accelerating.
Proposition 5. If ((antn tc−1 ≥ antn tc ) ∨ (antn tc−2 ≥ antn tc )) , then i i i i P E2 at dθ at ≈ (ai , −1) where antn tc is the number of E2 of ai at tc for which dθ at is decelerating i tc−1 is the number of E2 of ai at tc−1 for which dθ at is decelerating antn i antn tc−2 is the number of E2 of ai at tc−2 for which dθ at is decelerating. i Proposition 6. Co-incident factor of at with considering dθ at denoted as CE2 at dθ at is an ordered pair (ai , dθ at ) when ai ∈ Ω.
ntp ≥ antp ≥ antp Proposition 7. If ((antp i tc+1 i tc ) ∨ (ai tc+2 i tc )) , then CE2 at dθ at ≈ (ai , 1) where antp i tc is the number of E2 of ai at tc for which dθ at is accelerating antp i tc+1 is the number of E2 of ai at tc+1 for which dθ at is accelerating antp i tc+2 is the number of E2 of ai at tc+2 for which dθ at is accelerating.
Proposition 8. If ((antn tc+1 ≥ antn tc ) ∨ (antn tc−2 ≥ antn tc )) , then i i i i CE2 at dθ at ≈ (ai , −1) where antn tc is the number of E2 of ai at tc for which dθ at is decelerating i tc+1 is the number of E2 of ai at tc+1 for which dθ at is decelerating antn i antn tc+2 is the number of E2 of ai at tc+2 for which dθ at is decelerating. i
548
4
S. Kooptiwoot and M.A. Salam
Algorithms
Now we present two algorithms. Each algorithm is tailored for the type of patterns we wish to explore. 4.1
Without Considering dθ
Input: The data set which consists of numerical dynamic attributes. Sort this data set to ascending order by time, at ,δII of ai . Output: ani tc−2 , ani tc−1 , ani tc , ani tc+1 , ani tc+2 , P E2 at , CE2 at Method: /* Basic part For all ai For all time interval [tx , tx+1 ] Calculate αi For all two adjacent time intervals Calculate θ For at If αt ≥ δII Set that time point as tc Group record of tc−2 , tc−1 , tc , tc+1 , tc+2 . */ End of Basic part Count ani tc−2 , an itc−1 , ani tc , ani tc , ani tc+1 , ani tc+2 // interpret the result If ((ani tc−1 ≥ ani tc ) ∨ (ani tc−2 ≥ ani tc )) , then ai is P E2 at . If ((ani tc+1 ≥ ani tc ) ∨ (ani tc+2 ≥ ani tc )), then ai is CE2 at 4.2
With Considering dθ at
Input: The data set which consists of numerical dynamic attributes. Sort this data set to ascending order by time, at , δII of ai . ntp ntp ntp ntp ntn tc−2 , Output: antp i tc−2 , ai tc−1 , ai tc , ai tc+1 , ai tc+2 , P E2 at dθ at , ai ntn ntn ntn ntn ai tc−1 , ai tc , ai tc+1 , ai tc+2 , CE2 at dθ at Method: /* Basic part */ ntp ntp ntp ntp ntn Count antp tc−2 , i tc−2 , ai tc−1 , ai tc , ai tc+1 , ai tc+2 , ai ntn ntn ntn ntn ai tc−1 , ai tc , ai tc+1 , ai tc+2 // interpret the result ntp ntp ntp If ((antp i tc−1 ≥ ai tc ) ∨ (ai tc−2 ≥ ai tc )), then ai is P E2 at dθ at in acceleration. tc−1 ≥ antn tc ) ∨ (antn tc−2 ≥ antn tc )), then ai is P E2 at dθ at in deceleraIf ((antn i i i i tion. ntp ntp ntp If ((antp i tc+1 ≥ ai tc ) ∨ (ai tc+2 ≥ ai tc )), then ai is CE2 at dθ at in acceleration. tc+1 ≥ antn tc ) ∨ (antn tc+2 ≥ antn tc )), then ai is CE2 at dθ at in deceleraIf ((antn i i i i tion.
Mining the Predisposing Factor and Co-incident Factor
549
We deal with the rate of the data change, and we see the fact about the catalyst in the chemical reaction, that is, the catalyst can activate the rate of the chemical reaction to make it happen faster. So we look at the character of the catalyst in the chemical reaction in [7,6,4,8,10]. It is not necessary to have the catalyst in the chemical reaction. Not all of the chemical reaction has the catalyst. But once the catalyst is in the chemical reaction, it can activate the rate of the chemical reaction to make it happen faster. We believe that some events act as the catalyst. If we use the confidence threshold and/or minimum support threshold to select the predisposing factor or the co-incident factor, the events which act as catalyst and are not necessarily part of the reaction and will have low confidence value and thus discarded.
catalyst
Reactants
Products
Reaction time Fig. 2. The chemical reaction include the catalyst
We look at the time that the reaction happens is the same as the reference event happens. We see that the amount of the reactants at the previous time point is higher than the amount of the reactants at the current time point. And also the amount of the catalyst at the previous time point is higher than the amount of the catalyst at the current time point. The amount of the products at the post time point is higher than the amount of the products at the current time point. We look at the reactant and the catalyst at the previous time point as the predisposing factor and look at the product as the co-incident factor. The fact about the catalyst is it will not be transformed to the product, so after the reaction is finished, we will get the catalyst back. We will see the amount of the catalyst at the post time point is higher than the amount of the catalyst at the current time point. So we look at the catalyst at the post time point as the co-incident factor as well. The main idea in this part is not all of the chemical reactions have the catalyst and at the amount of the catalyst and the amount of the reactants at the previous time point is higher than their own amount at the current time point. And the amount of the products and the catalyst at the post time point is higher than their own amount at the current time point. So we use this characteristic to select the predisposing factor and the co-incident factor of the reference event of the event type II. The combination of many ideas mentioned above include the new idea taken from the fact seen in the chemical reaction as explained can be used to find the predisposing factor and the co-
550
S. Kooptiwoot and M.A. Salam
incident factor of the reference event. The steps to find the predisposing factor and the co-incident factor of the reference event are 1. Set the rate of the data change threshold of the reference attribute 2. Use the rate of the data change threshold to find the reference event. If the rate of the data change of the reference attribute is equal to or higher than the threshold, we say that the change of the data at that time point is the reference event. Then we call that time point as the current time point 3. Look at two previous adjacent time points to find the predisposing factor of the reference event and two post adjacent time points of the current time point to find the co-incident factor of the reference event 4. Set the rate of the data change threshold of the rest of other attributes of consideration 5. Use the rate of the data change threshold of the rest of other attributes to find the event of all of these attributes at two previous time points, the current time point and two post time points. 6. At each time point of two previous time points, the current time point and two post time points, count the number of the events of each attribute of all of other attributes 7. For each of all of the other attributes a) If the number of its event at either one or two previous time points is higher than the number of its event at the current time point, we say that it is the predisposing factor of the reference event. b) If the number of its event at either one or two post time points is higher than the number of its event at the current time point, we say that it is the co-incident factor of the reference event. We set the time point that we found the reference event as the current time point. Then we look at the two adjacent previous time points and the two post adjacent time points of each current time point. We set the rate of the data change threshold of a2 and a3 . Then we use this threshold to find the event of a2 and a3 at of two adjacent previous time points, the current time point and two post adjacent time points. We get the number of the event of a2 and a3 at each time point as shown in the Table.1. Table 1. The number of each event at each time point
a2 a3
Sec-previous Previous Current Post Sec-post 0 1 0 1 1 0 0 0 2 0
Compare the number of the events of a2 and a3 at various time points tabulated in Table.1.We determine the predisposing factor and the co-incident factors of the reference event: Predisposing Factor(s): a2 Co-incident Factor(s): a2 , a3
Mining the Predisposing Factor and Co-incident Factor
551
We can also mine in details more by separating the rate of the data change to be in the acceleration form and deceleration form and mine the predisposing factor and the co-incident factor of the reference event in each form.
5
Experiments
time(sec)
We apply our method with one OSS data set which consists of 17 attributes (Project name, Month-Year, Rank0, Rank1, Page-views, Download, Bugs0, Bugs1, Support0, Support1, Patches0, Patches1, Tracker0, Tracker1, Tasks0, Tasks1, CVS. This data set consists of 41,540 projects, 1,097,341 records
50 45 40 35 30 25 20 15 10 5 0
.
. . 6000
8000
10000
. 12000
.
14000
.
16000
number of records
Fig. 3. Running time (in seconds) and the number of records to be run at a time
5.1
Results
We set the rate of the data change threshold of the Download attribute and the rest of all of the other attributes as 1.5. In case without considering the slope rate direction of the Download attribute Predisposing Factor(s): Tasks0, Tasks1, CVS Co-incident Factor(s): Support0, Support1, Patches0, Patches1 In case considering the slope rate direction of the Download attribute The acceleration of the Download attribute Predisposing Factor(s): none Co-incident Factor(s): Bugs0, Bugs1, Support0, Support1, Patches0, Patches1, Tracker0, Tracker1 The deceleration of the Download attribute Predisposing Factor(s): Bugs0, Bugs1, Support0, Support1, Patches0, Tracker0, Tasks0, Tasks1, CVS Co-incident Factor(s): Support1
552
6
S. Kooptiwoot and M.A. Salam
Performance
Our methods consume time to find the predisposing factor and the co-incident factor of the reference event just in O(n) where n is the number of the total records. This fact is illustrated in Fig.3.
7
Conclusion and Discussion
The combination of the existing methods and the new idea from the fact seen in the chemical reaction to be our new algorithms can be used to mine the predisposing factor and co-incident factor of the reference event of interest very well. As seen in our experiments, our proposed algorithms can be applied with both the synthetic data set and the real life data set. The performance of our algorithms is also good. They consume execution time just in linear time scale. The threshold values as defined by the users play a significant role. They can be justified by looking at the data and observing the characteristic of the attributes of interest. The users have to realize that the results they get can be different depending on their threshold setting. If the degree of importance of an attribute is very high, just little change of the data of that attribute can make the data of the reference attribute change very much [9]. So the setting the rate of the data change threshold is very sensitive to the accuracy of the result.
References 1. Bettini, C., Wang S., et al.: Discovering Frequent Event Patterns with Multiple Granularities in Time Sequences. IEEE Transactions on Knowledge and Data Engineering 10(2) (1998). 2. Blum, L.: Discovery and Representation of Causal Relationships from a Large Time-Oriented Clinical Databases: The Rx Project. Lecture Notes in Medical Informatics 19 (1982). 3. Blum, L.: Discovery, Confirmation and Interpretation of Causal Relationships from a Large Time-Oriented Clinical Databases: The Rx Project. Computers and Biomedical Research 15 (2) (1982): 164-187. 4. Freemantle, M.: Chemistry in Action. Great Britain, MACMILLAN PRESS (1995). 5. Guralnik, V. and Srivastava, J.: Event Detection from Time Series Data. KDD-99, San Diego, CA USA (1999). 6. Harrison, R. M., Mora, S., et al.: Introductory chemistry for the environmental sciences. Cambridge, Cambridge University Press (1991). 7. Liska, K. and Pryde, L.: Introductory Chemistry for Health Professionals. USA, Macmillan Publishing Company (1984). 8. Robinson, R., Odom, J., et al.: Essentials of General Chemistry. USA, Houghton Mifflin Company (1997). 9. Salam, M.: Quasi Fuzzy Paths in Semantic Networks. Proceedings 10th IEEE International Conference on Fuzzy Systems, Melbourne, Australia (2001). 10. Snyder, H.: The Extraordinary Chemistry of Ordinary Things. USA, John Wiley & Sons, Inc (1998).
Query Modification by Discovering Topics from Web Page Structures Satoshi Oyama and Katsumi Tanaka Department of Social Informatics, Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501, Japan {oyama, ktanaka}@i.kyoto-u.ac.jp http://www.dl.kuis.kyoto-u.ac.jp
Abstract. We propose a method that identifies from Web pages pairs of keywords in which one word describes the other and uses these relations to modify the query. It takes into account the positions of the words in the page structures when counting their occurrences and applies statistical tests to examine the differences between word co-occurrence rates. It finds related keywords more robustly regardless of the word type than the conventional methods, which do not consider page structures. It can also identify subject and description keywords in the user’s input and find additional keywords for detailing the query. By considering the document structures, our method can construct queries that are more focused on the user’s topic of interest.
1
Introduction
The most commonly used method for finding information on the Web is using a search engine. However, the results of search engines usually contain many pages not of interest to the user, making it difficult for many users to formulate queries appropriate for finding information of interest [1]. Various techniques for helping users formulate queries have been proposed. For example, query expansion automatically or interactively reformulates queries by using synonymous or related terms [2]. Several commercial Web search engines now have a function for recommending keywords to users. Current approaches emphasize formulating a query as an appropriate set of keywords by adding keywords. However, when a user selects several keywords for a query, he/she may assume that these keywords play different roles in describing the topic of interest. For example, the user may think of one keyword as the subject and the other keywords as describing various aspects of the subject. A simple set of words approach thus may not well represent the potential different ways to describe the topic. In this paper, we describe a method that uses Web page structures. Keywords on different parts of a Web page generally play different roles in describing the subject. Keywords in the title likely represent the overall subject, while those in the body likely describe topics related to the subject. Web pages that have J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 553–564, 2004. c Springer-Verlag Berlin Heidelberg 2004
554
S. Oyama and K. Tanaka html document title
A
body
B B
A
A
B A
B
Fig. 1. Web pages with same words but in different positions
the same keywords, A and B, but in different positions, as shown in Figure 1, should be treated differently based on this distinction. Several search engines like Google1 provide functions for using page structures in a Web search. Although representing queries by specifying the positions of keywords on the page seems promising, it has not been used much so far. One of the problems is that it is difficult for a user to formulate a query using these structures. Thus, we propose a method for identifying keywords for the subject and keywords for the description in the user’s input and for formulating a query incorporating document structures. For this purpose, we first extract pairs of a subject and a description from a large set of Web pages by counting the co-occurrences of words in different positions in the documents. Then we identify the subject keywords in the user’s query and find other keywords describing the subject by referring to the subject-description relations. Based on the user’s feedback, a new keyword is added to the query to specify the topic in more detail. In the next section, we define some notations. Section 3 describes the method for identifying subject-description relations from Web page structures. Section 4 describes the method for using these relations to modify the query. We show experimental results in Section 5 and discuss related work in Section 6. We conclude in Section 7.
2
Definitions
We define the notations used in this paper. A, B, . . . denote keywords, and intitle(A) denotes the condition that keyword A should appear in the title portion of page. The title portion is text surrounded by and html tags. DF (X) means the number of pages that match query X. For example, DF (intitle(A)) is the number of Web pages containing keyword A in the 1
http://www.google.com
Query Modification by Discovering Topics from Web Page Structures
555
title. P (B|A) denotes the conditional probability that a page contains keyword B given that the page contains keyword A. We can empirically calculate this probability as follows: DF (A ∧ B) . (1) p(B|A) = DF (A) Similarly, P (B|intitle(A)) denotes the conditional probability that a page contains keyword B given that the page contains keyword A in the title: p(B|intitle(A)) =
DF (intitle(A) ∧ B) . DF (intitle(A))
(2)
We call keywords that represent the overall subject subject keywords and keywords that describe a related topic in detail description keywords.
3 3.1
Identifying Topics from Web Page Structures Word Co-occurrence Statistics Taking into Account Positions of Words on Pages
Generally, we assume that words in the title portion of a document describe the overall subject while words in the body describe topics related to the subject. This means we can identify topical relations between words using document structures. Given word A, if we find another word B that frequently appears in documents that have word A in their title, we can expect with high confidence that B is used to describe a topic of A. Of course, the above assumption does not hold for all pages. That is, all words in the title do not necessarily describe the overall subject, and all words in the body do not necessarily describe the topic. Therefore, to eliminate the noise from the many exceptions, we rely on the index of the search engine and apply statistical techniques to the data. If P (B|intitle(A)) is high, then when word A appears in the title, there is a high probability that word B appears on the page. We can possibly identify keywords that detail the topic of keyword A by looking for a keyword B that yields a high value of P (B|intitle(A)). However, if B is a common word that frequently appears on any page regardless of the presence of word A, a high value does not necessarily mean that keyword B details the topic of word A. For example, words like “information” and “page” occur on many Web pages regardless of the topic, so they have a high value of P (B|intitle(A)) for any word A. Presenting these words would not be helpful to the user. To avoid this problem, we also look at the simple co-occurrence probability that does not take the positions of keywords into account. Keyword B with a high value of P (B|A) is a word that frequently co-occurs with A regardless of its position in the document. We can identify keywords that detail the topic of keyword A more precisely by comparing the values of P (B|A) and P (B|intitle(A)) and selecting only keywords that have a higher P (B|intitle(A)) than P (B|A): DF (intitle(A) ∧ B) DF (A ∧ B) . > DF (intitle(A)) DF (A)
(3)
556
S. Oyama and K. Tanaka Table 1. 2x2 contingency table
Class C1 Class C2 Total
3.2
Class D1 Class D2 Total x11 x12 c1 x21 x22 c2 d1 d2 N
Examining the Statistical Significance
There is a possibility that coincidences make P (B|intitle(A)) higher than P (B|A). To estimate the statistical significance of the difference between the two rates, we use the χ2 test, which is also used in pruning decision trees [3] and refining association rules [4]. Let us consider the case in which the population is categorized as C1 , C2 and D1 , D2 based on attributes C and D, respectively. As described using a 2x2 contingency table (Table 1), we denote the observed frequency of each category in observed frequency a sample of size N as c1 , c2 and d1 , d2 , and denote the 2 of the 2 class of “Ci and Dj ” as xij (i = 1, 2; j = 1, 2), where j=1 xij = ci , i=1 xij = dj . In this case, χ20 =
2 2 (xij − ci dj /N )2 ci dj /N i=1 j=1
(4)
obeys a χ2 distribution with degree of freedom 1. If we chose a significance level of α, and the value of χ20 is larger than the upper α point of the distribution, we can conclude that the two attributes are statistically dependent. This technique can be used to determine the statistical significance between the two co-occurrence rates, P (B|A) and P (B|intitle(A)). For example, when the subject keyword is “Hungary” and a candidate description keyword is “Budapest,” class C1 is the set of pages containing “Hungary” in the title, and class C2 is the set of pages containing “Hungary” on other parts of the page. Class D1 is the set of pages containing “Budapest,” and class D2 is the set of pages that do not. (Please note that in this case all pages contain keyword “Hungary.”) Table 2 is the contingency table in this case. We can calculate the value of each cell in Table 2 by entering queries into a search engine like Google. According to a table of χ2 distributions with one degree of freedom [5], the upper 5th percentile of the distribution is 3.84. The value for χ20 (752.86) is greater than 3.84. Therefore, with a significance level of 5%, we can say that “Budapest” appears more frequently on pages with “Hungary” in the title than on pages with “Hungary” elsewhere. Please note that the value of χ20 becomes large not only when the occurrences of the words are positively correlated, but also when they are negatively correlated. This corresponds to the case in which P (B|intitle(A)) is significantly lower than P (B|A). To eliminate these cases, we also need to consider condition (3).
Query Modification by Discovering Topics from Web Page Structures
557
Table 2. Contingency table for subject keyword “Hungary” and candidate detailing keyword “Budapest”
“Hungary” in title “Hungary” elsewhere Total
4
“Budapest” No “Budapest” Total 979 2,741 3,720 11,121 86,159 97,280 12,100 88,900 101,000
Incorporating Web Page Structures into Query Modification
In this section, we describe a method for incorporating Web page structures into query modification. Presenting additional keywords derived from the subjectdescription relations rather than merely presenting related keywords focuses the modified query more on the user’s topic of interest. To ensure quick response to the user, we determine subject-description relations between keywords on the Web beforehand by using the method presented in the previous section. We assume a user inputs a set of keywords A1 , A2 , . . . as a query without specifying their positions on pages. Our method first identifies subject keywords among the user’s input keywords. To illustrate this process, we use directed graph G composed of keywords as nodes and of edges between keywords. We define Ai → Aj as an edge with Aj as a description keyword of Ai . We want to find the “highest” nodes that have no antecedent in the graph. However, the graph may have circles, and in such cases we cannot define highest nodes. Thus, we convert the graph into a DAG (directed acyclic graph), which has no cycles, by using strongly connected component decomposition [6]. An SCC (strongly connected component) is a maximal subgraph in which each node is reachable from every other node, and any node is reachable from itself. We decompose the directed graph into a set of SCCs, G1 , G2 , . . .. Then we make a new graph in which each SCC in the original graph is a node and there is an edge between nodes if there are edges between the corresponding SCCs. The resulting graph has no cycles, so we can find highest nodes (SCCs in the original graph) G1 , G2 , . . . in it. We regard the set of keywords in the highest nodes, set S = {Ai |Ai ∈ G1 ∪ G2 ∪ · · ·}, as the set of subject keywords. The set of other keywords in the user’s query, set D = {Ai |Ai ∈ G1 ∪ G2 ∪ · · ·}, is regarded as the set of description keywords. For each subject keyword, we find description keywords that do not appear in the user’s query from the set of predetermined subject-description relations. We select a predefined number, M , of these description keywords based on the significance values given by (4) and present them to the user. We add keyword B selected by the user to the set of description keywords and formulate a new structured query: subject keywords S plus original description keywords D plus new description keyword B . Our algorithm for query modification is summarized in Fig. 2.
558
S. Oyama and K. Tanaka
// A1 , A2 , . . . : Keywords input by user // M : Number of keywords presented to user 1. Construct directed graph G where A1 , A2 , . . . are nodes and there exists edge Ai → Aj when keyword Aj describes Ai . 2. Decompose G into a set of SCCs, G1 , G2 , . . .. Construct DAG in which each SCC is a node and there is an edge between nodes if there are edges between SCCs. 3. Find the “highest” nodes (SCCs in the original graph) G1 , G2 , . . . that do not have antecedents in the DAG; determine the set of subject keywords S = {Ai |Ai ∈ G1 ∪ G2 ∪ · · ·} and the set of description keywords D = {Ai |Ai ∈ G1 ∪ G2 ∪ · · ·}. 4. For each Ai ∈ S, find additional description keywords Bk that do not appear in the query such that Ai → Bk . 5. Select M keywords in {Bk } based on the values of (4). 6. Let B be the keyword selected by the user. Form a new query that specifies S as subject keywords and D ∪ {B } as description keywords. Fig. 2. Algorithm for query modification
5
Experiments
We tested our method for finding new description keywords for several examples of subject keywords. 5.1
Data Set
To ensure we used a statistically significant number of Web pages, we used Google Web APIs,2 a Web service for searching Google’s index. Google limits the number of queries to 1000 per day, and calculating the values of P (B|A) and P (B|intitle(A)) for every keyword B in the vocabulary given subject keyword A is a time-consuming task, so we approximated P (B|intitle(A)): p(B|intitle(A))
DFsample (intitle(A) ∧ B) , K
(5)
where DFsample (intitle(A) ∧ B) is the number of Web pages containing word B among the sample pages. We collected K sample pages for each subject keyword A by submitting query intitle(A) to the search engine. To reduce the bias caused by the ranking algorithm of the search engine, we first obtained as many URLs as possible (Google allows viewing of up to 1000 results) and then randomly selected K URLs and downloaded them. Then we submitted query A ∧ B and obtained P (B|A) for only the words B that had a value of (5) larger than predefined minimum value p. This reduced the number of queries submitted to the search engine. The reason we did not use sampling to obtain P (B|A) was 2
http://www.google.com/apis/
Query Modification by Discovering Topics from Web Page Structures
559
that a search engine usually ranks pages with keywords in the title even if they are not specified in the query, and this distorts the distribution of samples. 5.2
Examples of Identified Keywords
We set the number of sample pages, K, to 100 and downloaded from the Web as samples 100 of the pages that matched query intitle(A). We used a morphological parser for the Japanese language, ChaSen,3 to extract the nouns from these Web pages. We also extracted the compound nouns, which are composed of two or more words, by turning on an option of ChaSen. We set threshold p to 0.05 and calculated χ20 for words that occurred more frequently than this threshold and that also satisfied condition (3). We performed statistical tests with a significance level of 0.05 and extracted only significant words as description keywords. Table 3 lists up to ten description keywords for each subject keyword. For example, when we set subject word A to “Hungary,” our method identified various words for describing Hungary, such as names of places (“Budapest” and “Danube”), a sight (“palace”), the currency (“forint”), and the ethnicity (“Magyar”). An unsuccessful example is the subject keyword “Kyoto.” Only a few keywords were found, apparently because the top pages of universities and companies in Kyoto accounted for a large portion of the samples, and many of these pages were “splash” pages, i.e., pages with little textual content describing the topic. We also selected up to ten other keywords found in the same experiment and ranked them in decreasing order of conditional probability, P (B|intitle(A)). These words, which are not shown due to space limitations, include many general words irrelevant to the subject (such as “home page”) and more general words for the subject (“nation” and “Europe” for “Hungary”). This indicates that the effect of the difference between P (B|intitle(A)) and P (B|A) must be considered. 5.3
Comparison with Other Methods
Sanderson and Croft [7] proposed extracting a concept hierarchy based on subset relations between sets of documents. They define word A as a parent of word B when p(A|B) = 1, p(B|A) < 1 . In this case, the set of documents with word B is a subset of the set of documents with word A (Fig. 3). To avoid the effect of noise, they relax the condition: p(A|B) ≥ 0.8, p(B|A) < 1 . Table 4 lists keywords extracted using this criterion from the same set of sample pages as used for the experiment whose results are shown in Table 3. We omitted from the table subject keywords for which we could find no description 3
http://chasen.aist-nara.ac.jp
560
S. Oyama and K. Tanaka
Table 3. Description keywords identified for subject keywords. (The original keywords were Japanese.) Type of noun Place name
Subject keyword “Hungary”
Description keyword “Buda,” “Matyas,” “Budapest,” “Danube,” “Pest,” “Magyar,” “palace,” “forint,” “Hungarian,” “Hungarian people,” “Budapest” “fisherman,” “parliament,” “palace,” “bastion,” “goulash,” “paprika,” “Danube,” “hero’s square,” “Matyas,” “light up” “Iraq” “sanction,” “Hussein,” “Baghdad,” “US-British army,” “inspection,” “gulf,” “resolution,” “principle,” “political power,” “destruction” “Kyoto” “guide,” “free” Persons’ name, “Koichi Tanaka” “desorption/ionization,” “very,” Group name “biological macromolecule,” “Mr. Tanaka,” “engineer,” “eccentric,” “Nobel prize in chemistry,” “Mr. Koichi Tanaka,” “protein” “Einstein” “special relativity,” “general relativity,” “theory of relativity,” “speed of light” “Graduate School “disaster management information systems,” of Informatics, “social information network,” Kyoto University” “symposium on social informatics,” “social informatics model,” “related page link,” “Prof. Kariya,” “fluency,” “emergency management systems,” “global information network,” “distributed information systems” “Hanshin Tigers” “Irabu,” “Igawa,” “advance,” “Trakey,” “manager Hoshino,” “Hanshin fan,” “Imaoka,” “agreement,” “tiger,” “Rokkou” Common noun “SARS” “cough,” “remedy,” “washing hands,” “fever,” “difficulty in breathing,” “severe acute respiratory syndrome,” “ministry of health, labor and welfare,” “seeing a doctor,” “health center,” “suspicious case” “quantum computer” “very,” “teleportation,” “quantum mechanism,” “qubit,” “factorization,” “decipherment,” “realization,” “all over the world,” “the present time” “environment” “hormone-disrupting chemicals,” “environmental conservation,” “waste,” “recycle” “global warming” “greenhouse gas,” “methane,” “abnormal weather,” “carbon dioxide,” “drastic,” “average temperature,” “temperature,” “Kyoto protocol,” “change,” “climate change”
keywords. This method seemed to work well enough for common nouns like “environment,” but it failed to find as many keywords for proper nouns or newly coined words. This could be because proper nouns, such as people names and place names, and newly coined words are described not only by other proper nouns or new
Query Modification by Discovering Topics from Web Page Structures
561
Web Pages with keyword A
Pages with keyword B
Fig. 3. Example of a subset relation
words, but also by common nouns. In many cases, a common noun also describes other subjects, and the set of pages with this common noun cannot be a subset of the set of pages with a subject noun. For instance, the common noun “palace” describes the proper noun “Hungary.” The king of Hungary’s palace is a popular destination for tourists in Hungary, and many Web pages on Hungary mention it. However, many palaces exist in many countries, and the word “palace” appears in Web pages on other countries like “Czech.” Therefore, as shown in Fig. 4, the subset relation method cannot find the word “palace.” Given a proper noun or a new word as a subject, subset relations fail to find keywords describing common nouns. In contrast, our method can identify such keywords because it uses information on positions of keywords in documents. Words in a description of a vocabulary entry in a dictionary are typical words that detail the headword. We checked whether keywords found by our method also appear in descriptions of the subject keyword in Japanese dictionaries.4 Eight of the twelve subject keywords shown in Table 3 appear in such vocabulary entries. Only 20% of the description keywords for these eight keywords are also used in the dictionary descriptions. This might be because the Web and dictionaries describe the subject from different viewpoints. For example, most dictionaries describe names of countries in terms of their geography or history, while Web pages focus more on sightseeing or current affairs, reflecting users’ interests. With our method, we can also find keywords for subjects not listed in dictionaries, and this is another advantage of using the Web as a source of information.
6
Related Work
As mentioned in Section 5.3, Sanderson and Croft [7] proposed extracting a concept hierarchy based on subset relations between sets of documents. One of 4
http://dictionary.goo.ne.jp
562
S. Oyama and K. Tanaka
Table 4. Results based on subset relations. (The original keywords were Japanese.) Type of noun Place name
Subject keyword “Hungary”
Description keyword “Hungarian,” “Hungarian people,” “forint,” “Matyas” “Iraq” “Iraq issue,” “Hussein administration,” “Iraq war,” “Iraq attack,” “US-British army,” “Hussein” Persons’ name, “Koichi Tanaka” “Mr. Koichi Tanaka,” Group name “desorption/ionization” “Graduate School of Informatics,” “symposium on social informatics,” “Kyoto University” “social informatics model,” “Prof. Kariya,” “social information network” Common noun “environment” “environmental conservation,” “environmental report,” “waste,” “hormone-disrupting chemicals,” “global environment,” “environmental activity,” “environmental education,” “environmental problems,” “conservation,” “recycle” “global warming” “battle against global warming,” “combating global warming,” “global warming issue,” “greenhouse effect”
the problems in using subset relations for finding concept hierarchies is that exact subset relations rarely occur between a subject keyword of a proper noun and a description keyword of a common noun. On the other hand, if we make the threshold of p(A|B) too low, it becomes far from a subset relation, so the problem of extracting many irrelevant words arises. Glover et al. [8] proposed a method for determining parent, self, and child keywords for a set of Web pages, where self words describe the cluster and parent and child words describe more general and more specific concepts, respectively. Their method is based on the relative frequency of keywords in and out the cluster. They also use the textual contexts of links to the pages, called extended anchor text, to discover term relations. They do not distinguish keywords in different positions when counting occurrences. Our approach differs in that it explicitly uses information about where keywords appear in pages to determine relations between keywords. Related work on finding hyponyms from text has been done in natural language processing [9]. Lexico-syntactic patterns like “A, such as B” are used to find hypernym-hyponym pairs from a large collection of text documents. However, hyponyms found using these patterns are not always useful in Web searches
Query Modification by Discovering Topics from Web Page Structures
563
Web
Hungary
Czech
palace
palace
Fig. 4. Example where subset relation does not hold
because they are not based on document frequencies, which means that searching with these words sometimes results in too few documents being found. Liu et al. [10] proposed using the Web for in-depth learning about a particular topic, like reading a book or surveying papers. Their method identifies subtopics and salient concepts of the topic. Then it finds and organizes informative pages containing descriptions of the topic and subtopics. We believe our technique can be used to support this type of application as well as to modify queries. Our method is a kind of query structuring. While previous research on query structuring has dealt with the problem of searching Web databases [11], ours is aimed at supporting general Web searches. We also developed a system that automatically formulates multiple queries by incorporating different topic structures into the user’s input keywords, finds queries that yield the most different search results, and displays the results to the user for comparison [12]. This helps users find queries with structures appropriate for their information needs.
7
Conclusion
We have proposed a method that identifies pairs of a subject word and a description word from Web pages and uses these relations to modify queries. We take into account the positions of words when counting their occurrences and apply statistical tests to examine the difference between co-occurrence rates. This enables us to find related words more robustly than existing methods that use simple subset relations. Our method can also identify subject keywords and description keywords in the user’s input and propose additional description keywords to be added to the query. Results of experiments demonstrated that it can find appropriate description keywords from a set of Web pages. Our future work includes evaluating the identified relations in user testing. Acknowledgements. This research was partially supported by Grants-in-Aid for Scientific Research (14019048 and 15017249) from the Ministry of Education,
564
S. Oyama and K. Tanaka
Culture, Sports, Science and Technology of Japan, and by the 21st Century COE Program at the Graduate School of Informatics, Kyoto University.
References 1. Declan Butler. Souped-up search engines. Nature, 405:112–115, 2000. 2. Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. 3. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986. 4. Satoshi Oyama and Toru Ishida. Applying assocation rules to information navigation. Systems and Computers in Japan, 34(4):12–20, 2003. 5. George Waddel Snedecor and William Gemmell Cochran. Statistical Methods. Iowa State University Press, 1989. 6. Frank Harary, Robert Z. Norman, and Dorwin Cartwright. Structural Models: An Introduction to the Theory of Directed Graphs. John Wiley & Sons, 1965. 7. Mark Sanderson and Bruce Croft. Deriving concept hierarchies from text. In Proceedings of the 22nd ACM SIGIR Conference (SIGIR’99), pages 206–213, 1999. 8. Eric Glover, David M. Pennock, Steve Lawrence, and Robert Krovetz. Inferring hierarchical descriptions. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM’02), pages 507–514, 2002. 9. Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING’92), pages 539–545, 1992. 10. Bing Liu, Chee Wee Chin, and Hwee Tou Ng. Mining topic-specific concepts and definitions on the web. In Proceedings of the 12th international conference on World Wide Web (WWW2003), pages 251 – 260, 2003. 11. P´ avel Calado, Altigran Soares da Silva, Rodrigo C. Vieira, Alberto H. F. Laender, and Berthier A. Ribeiro-Neto. Searching web databases by structuring keywordbased queries. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM’02), pages 26–33, 2002. 12. Satoshi Oyama and Katsumi Tanaka. Exploiting document structures for comparing and exploring topics on the web. In The 12th International World Wide Web Conference (WWW2003), Poster Session, 2003.
Similarity Pattern Discovery Using Calendar Concept Hierarchy in Time Series Data 1
1
2
Sungbo Seo , Long Jin , Jun Wook Lee , and Keun Ho Ryu
1
1
Database Laboratory, Chungbuk National University, Cheongju , Korea {sbseo,kimlyong,khryu}@dblab.chungbuk.ac.kr http://dblab.chungbuk.ac.kr/index.html 2 Telematics Research Division, ETRI, Daejeon, Korea [email protected] Abstract. Most existing approaches for similarity search did not consider applying calendar concept hierarchy to search for similar patterns from time series data. In this paper, we present two techniques that capture scale oriented features of time series and provide an analyzing method for the multi-resolution view along the time dimension. Especially, we propose that a similarity search which makes the most of calendar concept hierarchy involves three stages which consist of data cube count based on time concept hierarchy, sequence division by time level and feature vector extraction. Because these feature vectors are inserted into multi-dimensional index, pre-processing step executes only one time at the beginning of the search process without adding considerable computing cost. Finally, we show that the proposed techniques find useful knowledge with low computational complexity and discovered rules can be applied to industrial fields.
view along the time dimension. First, we propose a similarity search using calendar concept hierarchy. Pre-processing of similarity search using calendar concept hierarchy involves three stages which consist of data cube count based on calendar concept hierarchy, sequence division by calendric granularities and feature vector extraction. These feature vectors are inserted into a multi-dimensional index tree and are used in similarity search. This pre-processing step is executed only one time at the beginning of the search process without additional computing cost. Second, we propose a similarity pattern discovery of time series data which involves a rule discovery algorithm and a query process based on the rule bases. Finally, through our experiments, we show that the proposed techniques find more useful knowledge with low computational complexity, and it can be applied to various applications including the weather forecast, the rate of exchange and so on.
2 Similarity Search Based on Time Concept Hierarchy To consider a calendar concept hierarchy for similarity search, we define a time series based on a calendar concept hierarchy. We start with a definition of a calendar[1] in our model, followed by details on granularities[1]. Table 1. Summaries of symbols and definitions
Symbols G
Definitions
S gt
calendric granularity a set of data sequences ( = {S 1g , S g2 ,..., S gm }, g t ∈ G ) t t t
S gi t
n length of data sequence, ( = {s gi [1],..., s gi [2],..., s gi [n]}, 1 ≤ i ≤ m, g t ∈ G ) t t t
Len( S gi t ) S gij t
S gi t [k ]
Sgi t [ j : k ] fC
S gi t ⇒ S gi k
Length of sequence S gi
t
the j-th disjoint window of sequence S gi (= Sgi [( j −1) ∗ω +1: j ∗ω], j ≥ 1 ) t t the k-th entry of sequence S gi (1 k Len ( S gi ) )) t t a subsequence of S gi , including entries from the j-th one to the k-th t (if j>k, then it means a null sequence of length 0) Conversion function (fc), gt < g k , gt , g k ∈ G
The main symbols are used in the paper and their definitions are summarized in Table 1. Also, when there are two data sequences, we use the similar model of Euclidean distance for measuring similarity between two data sequences[4]. For analysis various aspects of time series data, we can define a conversion function. Where, f C consists of aggregate functions such as SUM(), AVERAGE(), MIN(), MAX(), COUNT() and so on. By using f C , we can compute data sequences of different calendric granularities. Many multi-dimensional indexing methods such as the Rtree, R+-tree and R*-tree scale exponentially for high dimensionalities, eventually
Similarity Pattern Discovery Using Calendar Concept Hierarchy
567
reduce the performance to that of sequential scanning or worse. Therefore, transformation is applied necessarily to map the data sequences to a new feature space of a lower dimensionality. For example, the functions of feature extraction are DFT, FFT, DWT, and so on.
3 System Overview Using above definitions, a solution of similarity search based on calendar concept hierarchy shows the overall system framework, which is illustrated in Fig. 1. There are two frameworks: similarity search framework and rule discovery framework. In the similarity search frameworks, the pre-processing step, similarity search step and post-processing step based on calendar concept hierarchy are included. We give a detail description of the rule discovery framework in Sect. 3.2.
3.1 Subsequence Matching Based on Calendar Concept Hierarchy The procedure of multi-dimensional index creation is as algorithm 1. Where, we construct a record of using f-point, Sindex of time series, w-offset of starting window and Gt of corresponding calendric granularity.
Given
a
subsequence
S − index = i, w − offset = j, G t = g t .
S gijt , f-point is f feature vectors,
568
S. Seo et al. Algorithm 1. Multi-dimensional index creation Input : TDB : Time Series Database , ω : the size of window Output : Index : a multi − dimensiona l Index Algorithm : 1 . Initialize the Index . 2 . Read data sequence from TDB . 3 . for each data sequence in TDB a ) Compute data sequence of each calendric granularit y G i based on calendar concept hierarchy . b ) for data sequences of each calendric granularit y G i i ) Divide it into windows ( sliding / disjoint window ) of the size ω . c ) for each divided window of each calendric granularit y G i i ) Transform it to an f − dimensiona l point f − point . ii ) Construct a record < f − point , S − index , w − offset , G t > . iii ) Insert the record , whose key is f − point , into the index . 4 . Return Index ;
In similarity search step, when a user inputs query sequence Q, tolerance ε and granularity level Gt of user requirement, query process has following steps. First, preprocessing of query sequence Q is like as algorithm3.1. Second, we do range query using f-point and tolerance ε in the search index. Next, we discard them if Gt of returned result is not consistent with calendric granularity Gt given by the user. Then we construct a candidate set using the remained result. Finally, we discard them through post-processing about each entry S of the candidate set. In post-processing, we read sequence S from time series databases TDB firstly. Next, we extract {S-index, woffset, Gt}. Using this information, we read time series data of corresponding sequence S from time series databases TDB. And then, we compute Euclidian distance between sequence S and subsequence Q to check these similarities[4]. 3.2 Rule Discovery Based on Similarity Search This process consists of the pre-processing part and the rule discovery part. The preprocessing step constructs two processes such as the process of extracting primary shape and the process of generating rule data sets. In the process of extracting primary shape, we cluster results of similarity search according to query valid time. For these clusters, we sort them according to the number of subsequences belonging to each cluster (Cluster : C) and construct C={C1, …, Ck}. From C, we extract clusters whose number is larger than Minimum Shape Count and construct PS={a0, …, am} (0 m k). Then we represent alphabet symbols(am) for each of them. Last, we delete ID of time series that does not belong to PS and construct a LTID of frequency time series set using other one. In the process of generating rule data sets, to discover the rules based on calendar from results of similarity search, we use a calendar schema[13] that expresses various time granularities. Algorithm 2 show the procedure of rule discovery from results of similarity search on time series data.
Similarity Pattern Discovery Using Calendar Concept Hierarchy
569
Algorithm 2. Rule discovery based on similarity search
for (each Ti ∈ Γ ) do for (each ( a j , e0 ) ∈ Ti , a j ∈ PS ) do ps[ j ] = ps[ j ] ∪ e0 ; for (each star pattern e covers ps[ j ].e0 ) do count the number of ps[ j ] m that is covered e in ps[ j ]; if(m ≥ min_supp) L1 (e) = L1 (e) ∪ Ti ; end for ( k = 2; ∃a star calendar pattern e such that Lk −1 (e) ≠ φ ; k + + ) do for (each star pattern e that is included Lk −1 (e)) do generates candidates C k (e); for (c ∈ C k (e)) do count c.count in Lk −1 (e); if(c.count ≥ min_supp) Lk (e) = Lk (e) ∪ c; end
Where, the first part is the discovery process of the L1 rules. To discover large 1item rules, we count the occurrence of the each time series in one star pattern interval and regard them as rules only if its frequency is larger than minimum support(min_supp). The second part is the process of discovering rule Lk (k > 2). This process includes the step of generating candidate set Ck and the step of discovering rule Lk. In the part of discovering k-item rules, we can count how many k similar time series occur in one star pattern interval and regard them as rules only if its frequency is larger than minimum support. Incremental rule process consists of both the insertion rule process and the auto rule update rule process. Two processes store the discovered rules into time series knowledge base. But the second one occasionally checks query state, and then automatically discovers the rules of other time. User query process is as follows. When users query, we firstly check the query state of knowledge base. If query state already contains the corresponding rules, we get it and return it to users. Otherwise, user query process does the similarity search, and then discovers the rules from the results. We update rules in knowledge base and return them to users.
4 Experiment and Evaluation We conduct all the experiments on a Windows 2000 Server desktop with Pentinum PC 500MHz and 128 Mbytes of main memory. We selected a stock application data in experiment. Moreover, we have performed experiments using stock data of Nasdaq. In the implementation of similarity search based on calendar concept hierarchy, we used DualMatch of sequence division and Euclidean distance of distance function between two time series, DWT of feature vector extraction and R-tree of multi-dimensional index. In our experiments, we measured the performance according to the average length of data sequence and tolerance. Fig. 2 shows the query-processing time increases linearly according to the increasing the average size of data sequence. Fig. 3 shows the query-processing time increases linearly according to the increasing the tolerance.
S. Seo et al. 16 14 12 10 8 6 4 2 0 500
1000
1500
2000
2500
3000
Query processing time(sec)
Query processing time(sec)
570
120 100 80 60 40 20 0 1
The average size of query sequence
5
10
15
20
25
30
Tolerance
Fig. 2. Query processing time by the average size of data sequence
Search count
250
Using Single Level Using Time Concept Hierarchy
R ule count
300
Fig. 3. Results count of similarity Search
200 150 100 50
160 140 120 100 80 60 40 20 0
length 500 length 2000
day
week
length 1000 length 2500
month
quarter
length 1500 length 3000
year
0 200
400
600
800
1000
Multiple time level
The average size of query sequence
Fig. 4. Query processing time by tolerance
Fig. 5. Number of Generating Rules by Calendric Granularities
Fig. 4 shows similarity search based on concept hierarchy searched more similar sequences than similarity search based on single hierarchy. Thus, we can analyze the various sides of time series data using calendar concept hierarchy and provide users with more useful knowledge. Fig. 5 shows the rule count of month time level is larger than week time level. Because the rules of month granularity level are more general and useful than week granularity level at the real world.
5 Conclusion and Future Work In this paper, we firstly proposed a similarity search using calendar concept hierarchy. We considered the calendar concept hierarchy and were able to analyze various aspects of time series data. Thus, we can provide users with more useful knowledge. Second, we proposed a similarity pattern discovery of time series data. When the user queries, similarity pattern discovery searches knowledge from knowledge base instead of searching from time series database and provides users with more useful knowledge quickly. In our future work, we are going to apply the proposed algorithm to the real application systems and then evaluate the usefulness of various similarity pattern rules.
Similarity Pattern Discovery Using Calendar Concept Hierarchy
571
Acknowledgment. This work was supported by the RRC of MOST and KOSEF as well as ETRI in Korea.
References 1. Iqbal A. Goralwalla, Yuri Leontiev, M. Tamer Ozsu, Duane Szafron, “Temporal Granularity for Unanchored Temporal Data,” In Proc. of the 7th International Conference on Information and Knowledge Management, pp.414-423, 1989. 2. R. Snodgrass, I. Ahn, “Temporal Databases,” IEEE Computer, Vol.19, No.9, pp.35-42, September 1986. 3. Gautam Das, King-Ip Lin, Heikki Mannila, Gopal Renganathan, Padhraic Smyth, “Rule Discovery from Time Series,” In Proc. of the 4th International Conference on Knowledge Discovery and Data Mining, 1998. 4. Christos Faloutsos, M. Ranganathan, Yannis Manolopoulos, “Fast Subsequence Matching in Time-Series Databases,” In Proc. of the 1994 ACM SIGMOD International Conference on Management of Data, 1994. 5. Jun Wook Lee, Yong Joon Lee, Hey Kyu Kim, Bu Hun Hwang, Keun Ho Ryu, “Discovering Temporal Relation Rules Mining from Interval Data,” EurAsia-ICT, 2002.
Loss Minimization Based Keyword Distillation* Jie Tang, Juan-Zi Li, Ke-Hong Wang, and Yue-Ru Cai Knowledge Engineering Group, Department of Computer, Tsinghua University, P.R.China, 100084 [email protected], [email protected]
Abstract. Keywords provide rich semantic information for documents. It benefits many applications such as topic retrieval, document clustering, etc. However, there still exist a large amount documents without keywords. Manually assigning keywords to existing documents is very laborious. Therefore it is highly desirable to automate the process. Traditional methods are mainly based on a predefined controlled-vocabulary, which is limited by unknown words. This paper presents a new approach based on Bayesian decision theory. The approach casts keyword distillation to a problem of loss minimization. To determine which word can be assigned as keywords becomes a problem to estimate the loss. Feature selection is one of the most important issues in machine learning. Several plausible attributes are always be assigned as the learning features, but they are all based on the assumption of words’ independence. Machine learning based on them dose not produce satisfactory results. In this paper, taking the words’ context and linkages between words into account, we extend the work of feature selection. Experiments show that our approach significantly improves the quality of extracted keywords.
1 Introduction In recent years, the high volume of information makes effectively and efficiently managing the online information a significant challenge, also called information overload. Keywords distillation is a very useful technique to address this problem. Keywords give a high-level description of a document’s contents, which facilitates the prospective readers to decide whether or not it is relevant. Keywords summarize documents very concisely. They can be used as an alternative measure to document’s content for document clustering with lower cost. Keywords also provide an opportunity to improve the performance of information retrieval. Unfortunately, only a minor part of the documents are assigned with keywords over the whole web. Manually assigning keywords to the existing documents is tedious. Thus, automatic processes are of interest. Traditional method is based on a predefined controlled vocabulary, i.e. the categories [1]. It is limited by the predefined controlled vocabulary, which can’t process the unknown phrases. The second approach, keyword extraction [2, 3, 4], does not restrict possible keywords to a selected vocabulary. Using machine leaning, keywords are distinguished from words *
Supported by the National Natural Science Foundation of China under Grant No. 60443002
that are not. However, these approaches are all based on the assumption of phrases independent, therefore, does not produce satisfactory performance. This paper presents a new approach LMKD (Loss Minimization based Keyword Distillation) to distill keywords from documents based on Bayesian decision theory. Content of this paper is structured as following. Section 2 gives the survey of the related work. Section 3 describes our method LMKD. The evaluation and experiment are given in Section 4. Finally, we conclude the paper with a discussion.
2 Related Works Turney describes a system for keyword extraction, GenEx, based on a set of parametrized heuristic rules that are fine-tuned using a genetic algorithm [2]. The genetic algorithm optimizes the number of correctly identified keyword in the training documents by adjusting the rules’ parameters. Turney compared GenEx to the straightforward application of a standard machine learning technique i.e. bagged decision trees and concluded that it gives superior performance. He also shows an important feature that GenEx can successfully extracts keywords from web pages on a different topic when trained on a collection of journal articles. Eibe Frank et al. propose a keyphrase extraction algorithm, KEA, based on naïve Bayes machine learning technique [3, 5]. Compared to GenEx, its performance can be boosted significantly if training documents testing documents are from the same domain. Unlike the original lengthy genetic algorithm, KEA lowers the cost of training and extraction. In KEA, two features, i.e. TF*IDF and distance of phrases, are calculated as the features of word. These two features both ignore the possible dependency between words. However, words’ senses are always expressed by their contexts and linkages with other words. For example, analysis reveals that keyword more likely exists in the context with higher “contribution” (e.g. TF*IDF) to the document. Some other researches focus on integration of existing methods to improve the keyword extraction performance [6].
3 LMKD—Loss Minimization Based Keyword Distillation We propose a loss minimization model to distill keywords from documents based on Bayesian decision theory. Considering the case to pickup a word set as keywords for a document, each choice is associated a loss function. Then the problem is cast as to select a subset of words with minimal loss. Machine learning provides off-the-shelf tools for this kind of situation [7]. In our scene, the learning problem is to find a mapping from the examples to the two classes “keyword” and “not-keyword”. Therefore the distillation is divided into two stages: training and distillation. In the first stage the learning component generates the model from the given training data. In the second stage it can be applied to unlabeled data, i.e. to distill keywords from new documents. In machine learning, features selection is one of the most important steps. Traditional approaches are all based on two assumptions. First, the probability of one word being assigned as keyword is assumed to be independent of others. Second, the keyword-probability of a word is assumed to mean the semantic level with respect to
574
J. Tang et al.
the document. However, in real world, they are not always right. In our method, we make full use of the dependency between words, i.e. words’ context and linkage. 3.1 Distillation as a Decision Problem Bayesian decision theory provides a solid theoretical foundation for thinking about problems of action and inference under uncertainty [8]. In our experiment, the task is to select an optimal subset of words/phrases as keywords. To determine whether a word/phrase is a keyword depends on four features: mutual information mi , word’s context wc , words’ linkage wl , first position in the document
fp . Therefore, the expected loss of action a i is given by
R(W | DC , D) = ∫ L(W ,θ , D, DC ) p (θ | DC , D)dθ Θ
Where DC is the document collection; D is the one document. θ ≡ ({θ i }i =1 ) , is the model that word/phrase is a keyword, the posterior distribution is given by N
θi
N
p(θ | DC , D) ∝ ∏ p(θ i | cw, mi, fp, mr ) i =1
In this case, the loss minimization framework leads to the following form:
W * = arg W min R(W | DC , D j ) That is, to select W as keywords. We select the loss function as:
L(W ,θ , D, DC ) =
∑ − δ ( w, D)
w∈W
Where δ ( w, D) = 1 if the word w is a keyword; otherwise δ ( w, D ) = −1 . According to the naïve Bayes theory, the probability of a word being keyword is based on its four features. Formula description is:
p ( y | mi, wc, fp, wl ) =
p( y ) =
p (mi | y ) p ( wc | y ) p( fp | y ) p ( wl a | y ) p( wl h | y ) p( y ) p ( y ) + p ( n)
Y is the probability of the positive instances in the training corpora, Y+N
Y and N is the number of positive and negative instances respectively. In this way, candidates are ranked by their loss. Then several prune steps are carried on them [9]. 3.2 Feature Definitions Four features are selected to represent the word including: mutual information, first position in the document, word context weight, word’s linkage. Follow gives their definitions. 1. Mutual Information Mutual Information is a standard metric in information processing. It is designed to measure how much information provided by a word to a given document, vice versa. It is defined as:
Loss Minimization Based Keyword Distillation
wij =
∑ tf i
Where:
f ij / NN
ij
/ NN ∗ ∑ tf ij / NN
575
× factor
j
tf ij is the count word i occurs in the document j. NN = ∑∑ tf ij . factor j
i
is used to balance the bias toward infrequent candidates. 2. First Position The first occurrence is calculated as the number of words that precede the candidate’s first appearance, dividing the total words. The value is a real in (0, 1). 3. Word Linkage Word Linkage is also an important feature. We give two linkages definitions: Linkage Authority and Linkage Hub. Linkage Authority denotes how many words modify the word. Linkage Hub denotes how many words are modified by the word. Therefore, we define the Linkage as:
freq ( wi , ∀) df ( wi , ∀) , × − log count (∀, ∀) N freq (∀, wi ) df (∀, wi ) wl a = × − log count (∀, ∀) N Where: wl h and wl a represent modifying relationship and modified relationship wl h =
freq( wi , ∀) is the number of words modified by wi ; df ( wi , ∀) is the number of documents containing the modifier relation freq ( wi , ∀) ; count (∀, ∀) is the total modifier relationships; N is the size of the corpora.
respectively;
4. Word Context Weight Words’ context can defined as words window before/after the word. In our experiments, we define the context as twenty words, i.e. ten words before and ten words after. Then word context is defined as average TF*IDF of words in its context. 3.3 Algorithm Keyword distillation algorithm has two stages, i.e. training and distillation. 1. Preprocessing This process includes three steps. Firstly, input documents are filtered to regularize the text and determine the candidate words/phrases. This process includes word segmentation, stoplist, and stemmer. Frequency of each candidate will be counted, which is necessary for the next step. As for word segmentation, we adopt tri-gram strategy. Then, calculate scores of four features for each candidate. Finally, all features need to be discretized before machine learning processing. 2. Training It is necessary to compute the prior distribution of each attribute for training bayes model. That is to compute p ( mi | y ) , p ( wc | y ) , p ( fp | y ) , p ( wl a | y ) ,
p( wl h | y ) and p ( y ) , p (n) . In this way training stage has been accomplished.
576
J. Tang et al.
3. Distillation In the distillation stage, the test collection is also needed to feed to the processing components firstly, and then for each candidate, compute its post distribution. Then based on Bayesian decision theory, subset with minimal loss is selected as keywords.
4 Experiments and Evaluation This section we will give a description for the evaluation and experiments. Notice that in a document some words/phrases might be assigned either keywords or not. To balance the bias, we expand the manual assigned keywords set for each document, but only distill five keywords by LMKD for evaluation. Table 1. Experiments Set
Table 2. Experiments results and evaluation (5 distilled)
Docs Journal articles CRANFIELD Reuters-21578 Average
100
Keywords/doc 8.5(expanded) 6 4.5 6.2
matched 4.5 3.5 3 3.6
83.5 61.8
60
Precision(%) 95 70 60 72
95
94
80 Precision
Recall(%) 51.8 58.3 66.7 58.1
77.2
Expanded No-expand
56.9
40
38.5 24.3
20 0 40
50
60
70
Recall
Fig. 1. Precision-Recall Evaluation
Experiments are based on three different document collections. These three document collections are 200 journal articles arbitrarily downloaded from CiteSeer, CRANFIELD documents from aeronautical system papers and Distribution 1.0 of the
Loss Minimization Based Keyword Distillation
577
Reuters-21578 text categorization test collection. As for the journal articles, keywords are assigned by the authors, we simply expand the keyword set with other possible words/phrases. As for the other two document collection, we select 180 documents and 240 documents from them respectively and assigned keywords to each document manually. Table 1 lists the experiments set. Table 2 shows the experiments results and the evaluation. The other experiment is based on no-expand journal articles. Figure 1 gives the comparison of expanded and without expanded. We believe that expanded experiment shows more reasonable results, in that it is difficult, even for the owner, to determine whether the keywords are assigned correctly and completely.
5 Conclusions The main contribution of this paper includes two aspects: proposing a new approach based on loss minimization for keyword distillation and mining new features viz. word context and linkages between words to express words, which, in previous research, are always ignored. Experiments show significant improvement in precision. Based on this paper, we will improve on these aspects: (1) Refinement to the result. There are also some noise words distilled as keywords, therefore it is necessary to provide the ability to refine them. (2) More practicable evaluation measurement.
References 1. S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management, 1998. 2. P.D. Turney. Learning to extract keyphrases from text. Technical Report ERB-1057, National Research Council, Institute for Information Technology, 1999. 3. Eibe Frank and Gordon W. Paynter and Ian H. Witten. Domain-Specific Keyphrase Extraction. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden, Morgan Kaufmann. 1999: 668-673 4. Daille, B., Gaussier, E. and Lang'e, J-M. Towards Automatic Extraction of Monolingual and Bilingual Terminology. Proceedings of COLING94. 1994, 9:515-521. 5. Ian H. Witten, Gordon W. Paynter, et al. KEA: Practical Automatic Keyphrase Extraction. Fourth ACM Conference on Digital Libraries Proceedings, Berkeley, CA, 1999:254-255 6. Jorge Vivaldi et Horacio Rodríguez, Improving term extraction by combining different techniques, Terminology 7:1 (2001) 7. Quinlan, J.R. C4.5: Programs for machine learning. California: Morgan Kaufmann. 1993. 8. Berger, J. Statistical decision theory and Bayesian analysis. Springer-Verlag. 1985 9. Tang Jie. Loss Minimization based Keyword Distillation. Technique Report. ftp://keg.cs.tsinghua.edu.cn/publications. 2003
Using Maximum Entropy Model for Chinese Text Categorization Ronglu Li, Xiaopeng Tao, Lei Tang, and Yunfa Hu Computer Technology and Information Department, Fudan University, 200433 Shanghai, China [email protected], {xptao,yfhu}@fudan.edu.cn, [email protected]
Abstract. Maximum Entropy Model is a probability estimation technique widely used for a variety of natural language tasks. It offers a clean and accommodable frame to combine diverse pieces of contextual information to estimate the probability of a certain linguistics phenomena. This approach for many tasks of NLP perform near state-of-the-art level, or outperform other competing probability methods when trained and tested under similar conditions. In this paper, we use maximum entropy model for text categorization. We compare and analyze its categorization performance using different approaches for text feature generation, different number of features and smoothing technique. Moreover, in experiments we compare it to Bayes, KNN and SVM, and show that its performance is higher than Bayes and comparable with KNN and SVM. We think it is a promising technique for text categorization.
1
Introduction
With the rapid growth of World Wild Web and the steady accumulation of on line document information, text categorization has become one of the key techniques for handling and organizing text data. A comparative evaluation of a wide-range of text categorization methods was conducted by Yang[1] on standard English corpus with the conclusion that KNN and SVM have better performance than other classifiers in both accuracy and stability. In this paper, we propose the use of maximum entropy techniques for text classification. Since 1990s, people began to use maximum entropy in processing large-scale documents, more and more researchers have become interested in this technique for its flexibility. It can be used to construct probability models for a wide range of natural language problems by synthesizing all the relevant and irrelevant facts. Maximum entropy has been shown to be a viable and competitive algorithm in these domains. Maximum entropy has also been widely used in a variety of natural language tasks: including word segmentation, part-of-speech tagging, word sense disambiguation, machine translation and so on. However, relatively little work has been done on applying maximum entropy model to text categorization problems. In addition, to best of our knowledge, no previous work has focused on using maximum entropy model in classifying Chinese documents. Adwait Ratnaparkhi[2] first introduced maximum entropy to text J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 578–587, 2004. c Springer-Verlag Berlin Heidelberg 2004
Using Maximum Entropy Model for Chinese Text Categorization
579
classification and draw comparison to decision trees using both ME DEFAULT and ME IFS methods. Unfortunately, the feature used in his experiment is binary features, which maybe sufficient in sentence-level applications. But when it comes to document-level application like text classification, binary feature is not enough to capture sufficient semantic information. Since contribution of a certain word to the whole document cannot be measured by its existence in the text, but by the corresponding word frequency. Kamal Nigam[3] used word frequency to compute value of feature function, investigated maximum entropy model for text classification and compared it to the Bayes model. However, there are the following deficiencies in research done by Adwait Ratnaparkhi and Kamal Nigam. – They didn’t use smoothing technology. – Comparison maximum entropy model with KNN and SVM has never been done. – The affection of the approach of text feature generation should be taken into account. In this paper, n-gram and word segmentation are respectively used to generate text feature. We compare maximum entropy model for text categorization with other three approaches, Bayes, KNN and SVM. A detailed analyze is also made. In addition, we use Absolute-Discounting to smooth feature. We found that, the precision of the classification improved to some extent after smoothing technique is used.
2
Maximum Entropy Model
Maximum Entropy Model is used for probability estimation. Assuming that a is an event, b is the context where a happens, we want to know the joint probability of a and b, denoted by p (a, b). Provided that A is the set of all the events and B is the set of all the contexts, and the question becomes what will p (a, b)be for any a ∈ A, b ∈ B? Let’s confine the question to the field of text categorization, and then an event is that a document is assigned to certain class and the words in the document can be considered as the context for the event. We want to know the probability that the document including word b belongs to certain class a. It is very natural that we can calculate this through training corpus. Given a training set, A = {a1 , a2 , · · · , am } is the set of classes, B = {b1 , b2 , · · · , bn }is the set of feature words of a document. If a word a appear in a document b, donate it as a tuple (a, b). Label the appearance frequency of tuple (ai , bj ) as num (ai , bj ), then we can estimate the probability as follows. num (ai , bj ) p˜ (ai , bj ) = n m num (ai , bj ) i=1 j=1
(1)
580
R. Li et al.
But there is a problem that is sparse evidence problem. Even given massive training documents, it may be true that plenty of tuple (ai , bj ) do not exist. Obviously, it is not proper to suppose their probabilities to be 0. Maximum entropy, making the unknown events distributed uniformly in order to achieve the maximum entropy, is the very key to this problem. Take a triple classifier concerning military affairs, politics and science & technology as an example, we have known that 80% of the documents containing word airplane is labeled as class military affairs, but the probability distribution of this word on the remaining two classes is unknown. According to the rule of maximum entropy, we believe that when a document contains the word airplane, with a probability of 0.8 it will belong to class military affairs, and with probabilities of 0.1,0.1 respectively it belongs to the other two classes; when no word airplane appears in this document, the document will belong to any class with the same probability of 1/3. In other words, the distribution of the unknown events should be as uniform as possible. According to the definition of Shannon, entropy can be calculated as follows. H (p) = − p˜ (b) p (a|b) logp (a|b) (2) a,b
Then, the probability distribution that satisfy maximum entropy is p∗ = argmax H (p) p∈P
(3)
If there is no prior knowledge, according to the property of maximum entropy, Eq.(3) is satisfied when 1 (4) p (a|b) = |A| because of p (a|b) = 1. a∈A
Though the training corpus cannot give the probabilities of all tuples (ai , bj ), it does provide probabilities of some tuples or the condition that certain probabilities must satisfy. Then the problem becomes finding maximum entropy under incomplete information or optimization under some constrains. How to represent these incomplete information? The feature function, sometimes called feature for short, is introduced. Feature function is usually a binary function: f (a, b) → {0, 1}. For instance, for the problem of text categorization above, we can define feature function as follows. 1 (a = militaryaf f airs) ∧ (b = airplane) f (a, b) = 0 otherwise For feature function fi , its expectation under empirical probability distribution p˜ (a, b) is Ep˜fi = p˜ (a, b) fi (a, b) (5) a,b
Using Maximum Entropy Model for Chinese Text Categorization
For model p (a|b), the expectation will be p˜ (b) p (a|b) fi (a, b) Ep fi =
581
(6)
a,b
As for the training set, the two expectations should be equal. Ep fi = Ep˜fi
p˜ (a, b) fi (a, b) =
a,b
p˜ (b) p (a|b) fi (a, b)
(7)
a,b
We call Eq.(7) the constraint. Obviously, we can define many other similar feature functions. They can be independent, even the aspect to describe the problem can be totally different. The granularity can also arrange from small to huge. In a word, feature function can successfully combine lots of discrete or fragmentary knowledge together to fulfill one task. Given k feature functions f1 , f2 , · · · , fk , we can get k constraints. Ep fi = Ep˜fi , i = 1, 2, · · · , k
(8)
Then, our problem becomes an optimization problem under a group of constraints. P = {p|Ep fi = Ep˜fi , i = 1, 2, · · · k} p∗ = argmax H (p)
(9)
p∈P
The typical approach is to use Lagrange multipliers algorithm. Here, we just give the solution. The p∗ is as follows. k 1 ∗ p (a|b) = exp λi fi (a, b) (10) π (b) i=1 Here, π (b) is a normalize factor. π (b) =
a
exp
k
λi fi (a, b)
(11)
i=1
λi is a parameter representing the weight of feature functions. After training, we get the value of λi as well as the probability distribution, and finish the construction of maximum entropy model. Given that |A| the number of elements in the event set A and k is the number of different feature functions, according to Eq.(10), we can see that the time complexity of maximum entropy is O (k |A|). In order to construct maximum entropy model, we must calculate the parameters λi . We use the algorithm of GIS[4] here. Assuming that N is the number of samples in the training set, |A| is the number of events, if the algorithm will converge after P iterations, the time complexity of the algorithm is O (N P |A|).
582
3
R. Li et al.
Text Categorization Based on Maximum Entropy Model
Label event set A as the set of classes, context set B as the set of documents, we can calculate the probability p (aj |bi )of any document bi ∈ B is of any class aj ∈ A by Eq.(10). If one document can only belong to one class, argmax p (aj |bi ) j
is the pattern of document bi .Otherwise, introducing a threshold ε, document bi belongs to all classes that satisfy p (aj |bi ) > ε. Then, for text categorization, the most important is to select feature function. Generally, feature function is binary, which is limited while applied to documentlevel applications. So we select the pair word-class as a feature, whose value is the word frequency. For word w and class a , their feature function is as follows. num (b, w) a = a (12) fw,a (a, b) = otherwise 0 Here, num (b, w) means the frequency of the word w appearing in document b. We have known from Eq.(10) that the time complexity of maximum entropy model is O (k |A|). For text categorization, the features of every document are very sparse, usually less than 1/10 of the total number of features k. Therefore, the speed of maximum entropy model would be very fast. Because the features of one document are usually very sparse, the value of num (b, w)in most of feature functions is 0. Concerning such case, we can use smoothing technique to tackle it. So far there is no smoothing technique specialized for maximum entropy model. Hence we adopt the technique of AbsoluteDiscounting[5]. Absolute-Discounting is a smoothing technique which discount the probability of the event seen in the model and assign discount to the probability of the unseen events uniformly. Since the value of the feature function is word frequency, we need not constraint of making the sum of probability equate 1 when we discount the frequency, we add a constant d to word frequency num (b, w)of all the feature functions whose num (b, w) is equal to 0. After Absolute-Discounting, feature function (12) will change to num (b, w) =0 num (b, w) fw,a (a, b) = (13) num = (b, w) = 0 d Here, we use 0.1 for d .
4
The Feature Generation Method for Chinese Text
The main difference between Chinese and Western language text categorization lies in the model of document feature generation. For western language like English, if word is used as document feature, then no word segmentation is required since each word is separated by white space. But stemming is usually required.
Using Maximum Entropy Model for Chinese Text Categorization
583
If N-Gram term is used as document feature, we can use a character or a word as a N-Gram term. When it comes to Chinese text categorization, using word or N-Gram term as document feature is also applicable. However, the granularity of N-Gram term in Chinese may be different from English. Meanwhile, word segmentation in Chinese is still on the way. In order to evaluate the performance of maximum entropy model for text categorization, we take word segmentation and N-Gram to generate document feature respectively. We adopt Association-Retrospect Algorithm for word segmentation. The method for N-Gram term generation is illustrated in the following paragraphs. Suppose training corpus D has N D documents, every document contains N s sentences, and the average of one sentence is Ls , then the training corpus contain as many as N D N s Ls (Ls +1)/2 N-Gram terms. As we can see, the number of N-Gram in documents is very huge. It tells us that we have to make a selection from all the N-Gram terms while using N-Gram feature for text categorization. On the other hand, text categorization is semantic-oriented, requiring that document features should represent the semantics of the document as possible as it can. Of course, not all the N-Gram terms in one document are useful for categorization. The utility, or rather discernment of one N-Gram for categorization can be measured by the following three terms: frequency, spread ratio and convergence ratio. Their definitions are stated respectively as follow: Definition 1. In document d, frequency tf of N-Gram term t is the number that it appears in d. Definition 2. In document class c, spread ratio of N-Gram term t can be measured by the number of documents which belongs to class c and contains N-Gram term t, denoted by df. The larger df is, the more dispersed t is in c. Definition 3. In document set D, the convergence ratio of N-Gram term t can be measured by the number of classes whose documents contain N-Gram term t, denoted by cf. The smaller cf is, the more concentrated t is in D. From definitions above, we can know, the more frequent in a document, dispersed in classes, and concentrated in training documents N-Gram term t is, the more useful or discernable N-Gram term t is. Yet, there isn’t an excellent method to synthesize the three factors to obtain the optimal classification result. In order to avoid generating unnecessary N-Gram terms, N-Gram terms selected must satisfy the following constraints. Constraint 1. Given minimum frequency min-tf, a N-Gram term t in document d is not selected unless tf ≥min-tf. Constraint 2. Given minimum spread ratio min-df, a N-Gram term t in class c is not selected unless df ≥min-tf. In our experiment, both min-tf and min-df are set to 2. One direct method is to scan documents just once, and select all the N-Gram term one-off that satisfies the two constraints above. Because it needs to scan all the documents just once, this approach is efficient for small training corpus.
584
R. Li et al.
But for large corpus, large capacity of memory is required. Otherwise, a lot of data transaction between hard disk and memory will occur. Here, we adopt a step-by-step method. At first, we select eligible 1-Gram terms. Then construct 2-Gram terms based on the 1-Gram terms selected, and delete ineligible terms getting required 2-Gram terms. Similarly, we can select other N-Gram terms. In order to illustrate the algorithm for generating N-Gram feature, we first give one definition and one lemma of sub-term. Definition 4 (sub-term). Given i -Gram and j -Gram (i ≥ j), if j -Gram is contained in i -Gram, j -Gram is called as sub-term of i -Gram, denoted by j Gram⊆i -Gram. Lemma 1. if i -Gram satisfies constraint 1 and constraint 2, all the sub-terms of i -Gram satisfy both the two constraints. Algorithm 1. N-Gram term generation algorithm Input: Document set D, min-tf, min-df and constant N. Output: Set S of all N-Gram terms that satisfy constraint 1 and constraint 2. 1. Calculate the set of 1-Gram feature S 1 : scan the documents in the set D, select all the 1-Gram terms, delete the ones that do not satisfy constraint 1 or 2, and get the set S 1 of 1-Gram feature. 2. Calculate the set of 2-Gram feature S 2 : combine two 1-Gram term in S 1 , get 2-Gram candidate set C 2 , delete the 2-Gram terms that do not satisfy constraint 1 or 2, and get the set S 2 of 2-Gram feature. 3. Calculate the set of i-Gram feature S i (i = 3, . . . , N ). a) Calculate i-Gram candidate set C i . b) Given any two terms t m and t n in S i−1 , if tm (k + 1) = tn (k), then Ci = Ci ∪ tm tn (i − 1). Here, t m (k) and t n (k) (k = 1, . . . , i − 1) means the kth character in t m and t n respectively. c) Delete all the ineligible terms, which don’t satisfy constraint 1 and 2, and obtain the i -Gram set S i. Using N-Gram for text categorization, the primary requirement is that the N-Gram features selected must cover most of the words in documents. Hence, it is not true that the more N-Gram terms, the better. We must decide how to choose parameter N. According to the statistics on the number of characters to make up a word in Chinese text, it is found that the majority of words are 1-Gram, 2-Gram, 3-Gram and 4-Gram. These words can almost convey all the semantics of document, which means that we can just take 1-Gram, 2-Gram, 3-Gram and 4-Gram into account, that is the maximum of N is 4, while using N-Gram for classification.
5
Experiments and Results Analysis
For English text categorization, there are relative standard training and testing corpus abroad, thus making it possible to compare different approaches and systems based on the same corpus. In regard to Chinese text categorization, no
Using Maximum Entropy Model for Chinese Text Categorization
585
standard open categorization corpus can be found. Therefore, we collected more than 20000 news web pages from www.sina.com, and manually classify these web pages into ten classes: computer, traffic, environment, medicine, military affairs, politics, sports, arts and education. After deleting those web pages that cannot be classified into these ten classes, we got 16085 documents. We divided the 16085 documents into two sets, one of which was training set containing 10723 documents for classifier learning, the other was testing set containing 5362 document, which was used to evaluate the performance of classifier. We use micro-average-precision of classifier at precision/recall breakeven point to evaluate the performance of maximum entropy for text categorization. Evaluations are made in the following way. – Performance when using word segmentation or N-Gram for feature generation. – Compare the performance when using different number of features. – Affection of smoothing technique on classifier performance. – Affection of feature function on classifying performance. – Compare the performance of maximum entropy with Bayes, KNN and SVM. During the training process, we used both word segmentation and N-Gram to generate the features of document and then applyχ2 method to select the optimal feature subset. When training parameters of maximum entropy modal, we adopted GIS algorithm and iterated 100 times. Table 1. Performance comparisons among different feature generation methods Number 300 500 1000 1500 2000 2500
In order to compare the performance while using different feature generation method and different numbers of features, we use both word segmentation and N-Gram to generate the features of documents and test the model with feature number arranging from 300 to 2500. The micro-average-precision is demonstrated in Table 1. In Table 1, the first column is the number of features. Word represents using words as features of document. 1-Gram means only selecting 1Gram features, and 2-Gram is similar. 1/2-Gram means selecting both 1-Gram features and 2-Gram features. 1/2/3-Gram and 1/2/3/4-Gram are similar. we can conclude from Table 1 that – It is better to use word segmentation to generate text feature than N-Gram. We can see that no matter what value N is assigned, how N-Gram features
586
R. Li et al.
are combined, how many feature number is, the precision of N-Gram is below word segmentation. – As the number of feature increases, the precision also increase; To certain extent, precision does not increase but decreases. However, the decrease is rather trivial. Table 2. Affection of smoothing technique while using different feature function
In order to valuate the affection of smoothing technique and different feature functions on the model, under the arrangement of 300 to 2500 for the number of features, we respectively took binary value feature function and frequencybased feature function( Eq.(12) & Eq.(13) ), testing and comparing the performance of classifier. The test result is shown in Table 2. In Table 2, FBFF represents Frequency-Based Feature Function, and BVFF means Binary Value Feature Function. AD represent Absolute-Discounting technique, NS means that no smoothing technique was used. We can conclude from Table 2 that – Frequency-based feature function is better than binary value feature function. – Generally, smoothing technique is helpful for improving the classification precision. But, sometimes, it will have a negative affection on precision. For instance, in Table-2, when the number of features is 2000, and using frequencybased feature function and word segmentation for text feature generation, the precision decreases. Besides, if we use binary-value feature function, and take word segmentation for text feature generation, the smoothing technique also leads to the decrease of precision. In order to compare maximum model for text categorization with other classifiers, we chose three popular methods, Bayes, KNN and SVM. Here, Bayes adopts Multinomial Model, KNN assigns K to 50, and SVM chooses multinomial kernel function and One-vs-Rest method for multiclass classification. We can conclude from Table 3 that – Maximum entropy model for text categorization performs better than Bayes method. This matches well with the result of [3]. – Maximum entropy model for text categorization is comparable to KNN and SVM. Though in Table-3, Maximum entropy method performs better than
Using Maximum Entropy Model for Chinese Text Categorization
587
Table 3. Performance comparisons among different classification approaches Number 300 500 1000 1500 2000 2500
Maximum Entropy Word N-Gram 91.98 88.34 92.30 89.41 92.73 89.52 92.73 89.20 93.26 89.41 93.16 89.73
KNN, the scale of our testing corpus is not huge enough to ensure this advantage in other conditions. After all, these three methods are comparable with each other. – SVM perform better than KNN and maximum entropy method. However, the gap among three methods is rather trivial, especially when using word segmentation to generate document features.
6
Conclusion
In this paper, we adopt maximum entropy model for text categorization. In addition, we compare and analyze its classification performance using different classifier, different number of features and smoothing technique. The experiment results show that maximum entropy model is a promising classifier. However, we also found in the experiment that maximum entropy is not as stable as KNN. When different training documents are used, the testing result also changes sharply. These problems need our in-depth research in the future. Moreover, future work needs to do on larger scale experiment.
References 1. Yang Y. An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1999, 1(1): 76-88. 2. Adwait R. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania, 1998. 3. Kamal Nigam, John Lafferty, and Andrew McCallum. Using Maximum Entropy for Text Classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999. 4. Adwait R. A Simple Introduction to Maximum Entropy Models for Natural Language Processing. Technical Report 97-08, Institute for Research in Cognitive Science, University of Pennsylvania, 1997. 5. Sven Martin, Hermann Ney and Jrg Zaplo. Smoothing Methods in Maximum Entropy Language Modeling. In IEEE International Conference on Acoustics, Speech and Signal Processing, Phoenix, AR, 1999, 545-548.
A Comparative Study on Feature Weight in Text Categorization Zhi-Hong Deng, Shi-Wei Tang, Dong-Qing Yang, Ming Zhang, Li-Yu Li, and Kun-Qing Xie School of Electronics Engineering and Computer Science, Peking University, Beijing 100871 [email protected]
Abstract. Text Categorization is the process of automatically assigning predefined categories to free text documents. Feature weight, which calculates feature (term) values in documents, is one of important preprocessing techniques in text categorization. This paper is a comparative study of feature weight methods in statistical learning of text categorization. Four methods were evaluated, including tf*idf, tf*CRF, tf*OddsRatio, and tf*CHI. We have evaluated these methods on benchmark collection Reuters-21578 with Support Vector Machines (SVMs) classifiers. We found that tf*CHI is most effective in our experiments. Using tf*CHI with a SVMs classifier yielded a very high classification accuracy (87.5% for micro-average F1 and 87.8% for micro-average break-even point). tf*idf, which is widely used in text categorization, compares favorably with tf *CRF but is not as effective as tf*CHI and tf*OddsRatio.
A Comparative Study on Feature Weight in Text Categorization
589
major drawback of idf is that it ignores the fact that features play different roles in different document categories. To overcome this problem, Deng [9] used CRF instead of idf for feature weight and yielded fair results. Because methods of feature selection evaluate feature by scores, we can also adopt these methods for feature weight. In this paper, we study several excellent methods of feature selection for feature weight and compare performance of these methods with tf*idf and tf*CRF on corpus Reuters-21578. Experimental results show that OddsRatio and CHI with tf outperform tf*idf and tf*CRF. Section 2 describes the feature weight methods. Section 3 describes the classifier for empirical validation. Section 4 describes the document corpus and presents the experiments and results. The major findings of our experiments are also discussed in section 4. Section 5 discusses the related work. Section 6 summarizes the conclusions.
2 Feature Weight Methods Four methods are included in this study, each of which uses tf as feature’s capacity of describing the document contents. The differences of these methods are that they measure feature’s capacity of discriminating similar documents via various statistical functions. According to functions used, the four feature weight methods are: tf*idf, tf*CRF, tf*OddsRatio, and tf*CHI. OddsRatio and CHI are very excellent methods for feature selection [10] [11]. 2.1 tf Before describing feature weight methods, we give the definition of tf. Let freqij be the number of times feature fi is mentioned in the text of document dj. Then, the tf of feature fi in document dj is given by tf ( f i , d j ) =
freq max
k
(1)
ij
freq
kj
The maximum is computed over all features that are mentioned in the text of the document dj. For the sake of sententiousness, tf (fi, dj) is also written as tfij. 2.2 tf*idf tf*idf originated from information retrieval is the best known feature weight scheme in text categorization. This method uses idf to measure feature’s ability of discriminating similar documents. The motivation for idf is that features, which appear in many documents, are not very useful for distinguishing a relevant document from a non-relevant one. Let N be the total number of documents and ni be the number of documents in which feature fi appears. Then, the idfi, inverse document frequency for fi, is given by
590
Z.-H. Deng
idf
i
(2)
N ni
= log
According to tf*idf schemes, value of feature fi in the vector of document dj is given by v ij = tf ij × idf i =
freq max
k
× log
ij
freq
kj
N ni
(3)
tf*idf is the simplest technique for feature weight. It easily scales to very large corpora, with a computational complexity approximately linear in the number of features and training documents. However, idf is global measure and ignore the fact that features may have different discriminating powers for different document topics. For example, “football” is most valuable term in sport news while it has little value for implying financial news. According to idf, weather “football” in sport news or not, its values of idf is the same. The following sections will discuss some methods that calculate feature’s ability of discriminating similar documents in terms of document categories. 2.3 tf*CRF CRF (Category Relevance Factor) stands for the discriminating power of features to categories (such as document topics). Let C = {c1, …, cm} be the set of predefined categories and F = {f1, …, fn} be feature set. Let DOC = ∪Di be the set of documents where Di is the set of documents belonging to category ci. The category relevance factor CRF of fi and cj is given by CRF ( f i , c j ) = log
X /Y U /V
(4)
X is the number of documents that contain feature fi and belong to category cj, Y is the number of documents that belongs to category cj, U is the number of documents that contain feature fi and don’t belong to category cj, V is the number of documents that don’t belongs to category cj. For document d in Dj, let the feature vector V of d is be (v1, v2, …, vn) where vi is the value of feature fi. Then, in terms of tf*CRF scheme vi is given by: v i = tf ( f i , d ) × CRF ( f i , c j )
(5)
2.4 tf*OddsRatio OddsRatio is commonly used in information retrieval where the problem is to rank out documents according to their relevance for the positive class with using occurrence of different words as features. It was first used as feature selection methods by Mladenic[10]. Mladenic have compare six feature scoring measures with each other on real Web documents. He found that OddsRation showed the best performance. This shows that OddsRatio is best for feature scoring and may be very suitable for
A Comparative Study on Feature Weight in Text Categorization
591
feature weight. If one considers the two-way contingency table of a feature fi and cj, A is the number of times fi and cj co-occur, B is the number of times cj occurs, C is the number of times fi occurs without cj; D is the number of times cj is not occur, then the OddsRatio between fi and cj is defined to be OddsRatio ( f i , c j ) = log
P ( f i | c j )( 1 − P ( f i | ¬ c j ))
(6)
(1 − P ( f i | c j ) P ( f i | ¬ c j )
and is estimated using A C A C ) / (1 − ) OddsRatio ( f i , c j ) ≈ log (1 − D B D B
(7)
For the feature vector V=(v1, v2, …, vn) of d, the value of vi is define to be vi = tf ( f i , d ) × OddsRatio( f i , c j )
(8)
2.5 tf*CHI The CHI measures the lack of independence between a feature and a category and can 2 be compared to the χ distribution with one degree of freedom to judge extremeness. Given a feature fi and a category cj, The CHI of fi and cj is given by CHI ( f i , c j ) =
N × ( AD − CB ) 2 ( A + C ) × ( B + D ) × ( A + B ) × (C + D )
(9)
N is the total number of documents; A is the number of times fi and cj co-occur, B is the number of times fi occurs without cj; C is the number of times cj occurs without fi; D is the number of times neigher fi nor cj occurs. The CHIij has a value of zero if fi and cj are independent. On the other hand, the CHIij has the maximal value of N if fi and cj either co-occur or co-absent. The more fi and cj are correlative the more the CHIij is high and vice versa. Yang [11] reported that CHI is one of the most effective feature selection methods. Thus, calculating feature weight by CHI is natural and rational. For the feature vector V=(v1, v2, …, vn) of d, the value of vi is define to be v i = tf ( f i , d ) × CHI ( f i , c j )
(10)
3 SVM Classifiers To assess the effectiveness of feature weight methods we constructed text classifiers based on Support Vector Machines (SVM), which is a relatively new inductive learning approach that has recently proved to be successful along various application domains including text categorization.
592
Z.-H. Deng
3.1 SVM SVM was introduced by Vapnik in 1995 for solving two-class pattern recognition problems using the Structural Risk Minimization principle [12]. Given a training set containing two kinds of data (one for positive examples, the other for negative examples), which is linearly separable in vector space, this method finds the decision hyper-plane that best separated positive and negative data points in the training set. The problem searching the best decision hyper-plane can be solved using quadratic programming techniques [13]. SVM can also extend its applicability to linearly nonseparable data sets by either adopting soft margin hyper-planes, or by mapping the original data vectors into a higher dimensional space in which the data points are linearly separable [12, 13, 14]. Joachims [6] first applied SVM to text categorization, and compared its performance with other classification methods using the Reuters-21578 corpus. His results show that SVM outperformed all the other methods tested in his experiments. Subsequently, Dumais [15], Yang [16], Cooley [17], and Bekkerman [18] also explored how to solve text categorization with SVM respectively. Although based on different document collections, their experiments confirmed Joachim’s conclusion that SVM is the best method for classifying text documents. SVM is suitable not only for binary (two-class) classification but also for multiclass and multi-labeled categorization. A straightforward approach to dealing with multi-class and multi-labeled categorization with m classes is to decompose the problem into m binary problems. For m-class classification, the way finds m hyper-planes by Applying SVM m times. Each of these hyper-planes is a decision rule for one category. Given C = {c1, …, cm}, CS = {cs1, …, csm} where csi stands for the set of + training documents belonging to ci. For ci ∈ C, we set positive set ci = csi and nega+ – tive set ci = ∪ csj (j ≠ i). Using ci and ci as input, we can generate a decision rule Ri for ci by SVM. For all categories, we obtain m rules R1, R2, …, Rm. 3.2 Constructing SVM Classifiers with Various Feature Weight Methods Constructing SVM classifiers with various feature weight methods includes two components: one for learning classifiers and the other for classifying new documents. For the sake of description, we label the former Training_Phase and the latter Classifying_Phase. Training_Phase: Input: training documents set D = ∪ Di, 1≤ i ≤ m, Di = {document d | d belongs to category ci}, feature set F = {f1, f2, …, fn}, SVM_learning() (function for learning SVM’s classifying rules). Output: CD = {CD1, CD2, …, CDm}, SVMR = {SVMR1, SVMR2, …, SVMRm}. CDi = (cd1i, cd2i, …, cdni) represents class discrimination vector where cdki stands for the value of feature fk to ci in terms of one kind of feature weight methods. SVMRi stands for the inducting rule of ci learned by SVM. Step1. Set CD = ∅, SVMR = ∅. Step2. For i = 1 to m, do:
A Comparative Study on Feature Weight in Text Categorization
593
1.
For each fj, compute cdji according to one function ( such as formula (2), (4), (7), or (9)) described in Session 2. Whereafter, generate CDi = (cd1i, cd2i, …, cdni) and set CD = CD ∪ {CDi}. + ¯ 2. Set VDi = ∅, VDi = ∅. 3. For each d ∈ D, compute its feature vector Vd = (v1, v2, …, vn) according to corresponding feature weight methods(such as formula (3), (5), (8), or + + ¯ ¯ (10)). If d ∈ Di, then VDi = VDi ∪ {Vd}; else VDi = VDi ∪ {Vd}. + ¯ 4. SVMRi = SVM_Training(VDi , VDi ), and Set SVMR = SVMR ∪ { SVMRi }. Step3. Output CD and SVMR. Classifying_Phase: Input: CD = {CD1, CD2, …, CDm}, SVMR = {SVMR1 , SVMR2, …, SVMRm}, feature set F = {f1, f2, …, fn} and unlabelled document dnew. Output: Cnew= {l1, l2, …, lm} is the set of boolean labels, where li ∈{0, 1} (1 means that dnew belongs to ci and 0 means that dnew does not). Step1. Cnew =∅ Step2. For i = 1 to m, do: 1. According to feature weight method used in Training_Phase, computing feature vector Vnew of dnew. 2. Run SVMRi on Vnew to obtain li. 3. Cnew = Cnew ∪ { li }. Step3. output Cnew. There exist some free software packages providing functions for learning SVM’s light [19] as classifying rules. We choose the function provided by SVM SVM_learning(). As concerning text categorization domain, several authors advocated the choice of linear SVM (as opposed to kernel-based SVM) due to their speed in both training and classification time and generalization abilities. In all our experiments we used a linear SVM.
4 Experimental Evaluation We apply feature weight to yielding document vectors in the construction of SVM classifier. The effectiveness of a feature weight method is evaluated using the performance of SVM with such method. 4.1 Data Collection To make our evaluation results to most of the published results in text categorization, we chose the Reuters-21578 collection1, which is commonly used corpus and has become a new benchmark in text categorization research. For this paper we use the ModApte version of Reuters-21578, which was obtained by eliminating unlabelled documents and selecting the categories which have at least one document in the training set and the test set. ModApte version of Reuters-21578 has 90 categories and consists of a training set containing 7769 articles and a test set containing 3019 arti1
cles. In both the training and test sets we preprocessed each article so that any additional information except for the title and the body was removed. After conversion to lower case, stop word removal, word stemming, and rare terms removal2, we obtained a vocabulary 7014 unique words (features). 4.2 Performance Measures When measuring the performance of a multi-class multi-labeled categorization it is meaningless to use the standard accuracy measure. For effectively evaluating the effectiveness of category assignments by classifiers to documents, we use the microaveraging break-even point (BEP) and micro-averaging F1-measure, which have been used widely. Given a category ci and a test set, the standard definition of microaveraging precision pma and micro-averaging recall rma are defined to be: m
p ma =
∑α i =1
m
∑α i =1
i
+
m
∑
i m
∑
i =1
βi
r ma =
α
i =1
m
∑
i =1
αi +
i m
∑
i =1
γ
(11) i
αi is the number of documents of the test set that were classified into ci whose true label sets include ci; βi is the number of documents of the test set that were classified into ci whose true label sets do not include ci; γi is the number of documents of the test set that was not classified into ci whose true label sets include ci. The micro-averaging BEP and micro-averaging F1 are given by: BEP
=
p ma + rma 2
F1 =
2 × p ma × rma p ma + rma
(12)
4.3 Primary Results Table 1 summarizes the categorization results for SVM on Reuters-21578 after feature weight using tf*idf, tf*CRF, tf*OddsRatio, and tf*CHI, respectively. Two observations merge from the categorization results of SVM on Reuters. First, tf*CHI is more effective than tf*idf, tf*CRF and tf*OddsRatio. The micro-averaging BEP and the micro-averaging F1 of SVM + tf*CHI are 0.878 and 0.875 respectively. Both of them are the biggest values according to corresponding measures. Second, tf*idf and tf*CRF have similar effects on the performance of the classifiers while tf*OddsRatio performs better than them. The micro-level analysis suggests that tf*CHI > tf*OddsRatio >{ tf*idf, tf*CRF}.
2
We process Reuter 21578 corpus according to the order of conversion to lower case, stop word removal, word stemming, and rare terms removal. In addition, rare terms refer to terms that the number of documents that the terms occur in is less than three.
A Comparative Study on Feature Weight in Text Categorization
595
Table 1. Performances summary of feature weight methods in SVM over Reuters-21578
4.4 Discussion An interesting phenomenon is that the study of Yang in [11] revealed CHI was the most effective feature selection methods on Reuters-21578. It seems that CHI is suitable for depicting the inner feature-class model of Reuters. Whether CHI still retains highest performance on other document collections is worth studying. For theoretical interest, it is worth analyzing the relationship of CRF and OddsRatio. OddsRatio can be proven equivalent to: C A (1 − ) B D = log OddsRatio ( f i , c j ) ≈ log A C (1 − ) B D
A B C D
C C (1 − ) (1 − ) D D + log = CRF ( f i , c j ) + log A A ( 1 ) ( 1 ) − − B B
(13)
These formulas show that OddsRatio is the sum of CRF and a factor. It is easy to prove that if A/B is equal to C/D, then OddsRatio is also equal to CRF; if A/B is bigger than C/D, then OddsRatio is also bigger than CRF; if A/B is smaller than C/D, then OddsRatio is also smaller than CRF. This means that, compared with CRF, OddsRatio exacerbates discriminating power of features to classes. It shows that this exacerbation is favorable to text categorization from our experiments. CRF do not show its advantage over idf as experiment results from [9]. We hypothesize that the reason results from two aspects. Above all, CRF and idf were not compared with same algorithm in [9], where CRF is used with a Rocchio-like linear algorithm while idf is used with k nearest neighbor and SVM. The linear classifier with CRF outperforms k nearest neighbor classifier with idf but just is competitive with SVM classifier with idf. Secondly, [9] used three different document collections instead of Reuters-21578. Category distribution in these collections is more even than in Reuters-21578. It would be that CRF does not suitable for feature weight of uneven collections.
5 Related Work Several authors have reported experimental results of SVM on Reuters-21578. Table 2 displays the summary of these data. “-” means that authors did not report the values of corresponding measures.
596
Z.-H. Deng Table 2. Summary of other performances report of SVM over Reuters-21578
Dumais [15] Joachims [6] Yang [16]
micro-averaging BEP 0.87 0.865 -
micro-averaging F1 0.86
These authors used various feature weight methods. Although Joachims and Yang used same scheme like tf*idf defined in the paper, there is a difference between them. Joachims used the within-document frequency of terms (rtf, which is the number of times features occur in documents) directly, while Yang used log(rtf) instead. Our normal tf and different preprocessing techniques may result in that our results better than theirs under the same tf*idf scheme. However, the difference is not significant. Dumais adopted binary feature values, but she did many works on feature space reduction by feature selection. According to data from table 1 and table 2, it is obvious that SVM + tf*CHI yielded the best result on Reuter-21578 corpus.
6 Conclusion This is an evaluation of feature weight methods for text categorization. We found tf*CHI most effective in our experiments with SVM. tf*CRF is found comparable to tf*idf, while tf*idf is more simple. We also discovered that the tf*OddsRatio and tf*CRF scores of a feature are strongly correlated, revealing a fact that exacerbating the values of discriminating power is favorable. Our experimental results also reveal that feature weight methods react on the effectiveness of text categorization. In future work we will investigate combination of feature weight methods. The intuition is that different feature weight methods score discriminating power in qualitatively different ways. This suggests that different methods potentially offer complementary information about discriminating power and the proper combination of feature weight methods would be more effective than each one. Plentiful Results from combination of classifier would provide valuable information.
Acknowledgement. This research is support by the National Key Basic Research Program (NKBRP 973) of China under grant No. G1999032705. Any opinions, findings, and conclusions or recommendations expressed in this paper are the authors’ and do not necessarily reflect those of the sponsor. We are also grateful to two anonymous reviewers for their comments.
References 1.
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), pages 13-22, 1994.
A Comparative Study on Feature Weight in Text Categorization 2. 3. 4. 5. 6. 7. 8. 9.
10. 11. 12. 13. 14. 15. 16. 17. 18. 19.
597
A. McCallum and K. Nigam. A comparison of event models for naïve bayes text classification. In AAA-98 Workshop on Learning for Text Categorization, 1998. C. Apte, F. Damerau, and S. Weiss. Text mining with decision rules and decision trees. In proceedings of Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web, 1998. H.T. Ng, W.B. Goh, and K.L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In 20th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’97), pages 67-73, 1997. R. E. Schapire, Y. Singer. BoosTexter: A Boosting-based System for Text Categorization. Machine Learning, 2000, 39(2/3): 135 -168. T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the 1998 European of conference on Machine Learning (ECML), pages: 137-142, 1998. G. Salton and M. E. Lesk. Computer evaluation of indexing and text processing. Journal of the ACM, 1968, 15(1): 8-36. G. Salton, C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988, 24(5): 513-523. Z. H. Deng, S. W. Tang, D. Q. Yang, M. Zhang, X. B. Wu, M. Yang. A Linear Text Classification Algorithm Based on Category Relevance Factors. In Proceedings of 5th International Conference on Asian Digital Library (ICADL2002), Lecture Note Series in Computer Science (LNCS 2555), Springer-Verlag, pages: 88 – 98, 2002. D. Mladenic, M. Grobelnik. Feature Selection for Classification Based on Text Hierarchy. In Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD’98), 1998. Y. Yang, J.P. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of 14th International Conference on Machine Learning, 412-420, 1997. V. Vapnic. The Nature of Statistical Learning Theory. Springer, New York, 1995. C. Cortes and V. Vapnik. Support Vector networks. Machine Learning, 20: 273-297, 1995. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. In A.I. Memo. MIT A.I. Lab, 1996. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, pages 148-155, 1998. Y. Yang, X. Liu. A re-examination of text categorization methods. In 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), pages 42-49, 1999. R. Cooley. Classification of News Stories Using Support Vector Machines. In Proceedings of the 16th International Joint Conference on Artificial Intelligence Text Mining Workshop, 1999. R. Bekkerman, E. Y. Ran, N. Tishby, and Y. Winter. On feature distributional clustering for text categorization. In Proceedings of the 24th ACM SIGIR International Conference on Research and Development in Information Retrieval, pages: 146-153, 2001. T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods Support Vector Learning, MIT-Press, 1999.
User-Focus Based Personalization Recommendation for Text Information Retrieval Zhenya Zhang, Enhong Chen, Jin Wang, and Xufa Wang Computer Department, University of Sci. and Tech. of China, Hefei, 230027,P.R.China {zzychm,chamos}@mail.ustc.edu.cn, {cheneh,xfwang}@ustc.edu.cn
Abstract. Personalization recommendation (PR) is an important feature of future search engine. It is a valid method for lightening the user’s burden on information retrieval. This paper presents a new approach for PR based on userfocus. To construct user-focus for a user, a new algorithm WeightedFP for mining weighted frequent item set is given. The transactions for WeightedFP to be dealt with are the entire query requests of a user at a period of time and items in a transaction are non-noise words in the query request corresponding with the transaction. Each word as an item in itemset has a weight expressing the importance description of the word to the user. Experimental result shows that the implementation of PR based on user-focus can lighten the user’s burden caused by the work of filtering valid information from vast information in some extent while time requirement of TR is satisfied well.
consider the importance of each item. Some algorithms, such as MINWAL(O)[5] which do consider the importance of each item, however, have to construct and validate the candidate itemset step by step. For these algorithms are inefficient, they could not be used in some application requiring instant responding time, such as personalization recommendation of TR. This paper presents an implementation of personalization recommendation based on user-focus for TR.To construct user-focus for a user, an algorithm named as WeightedFP for mining weighted frequent itemset is presented. This paper is organized as follows. Part II introduces WeightedFP. Part III presents the definition of the userfocus and describes algorithms for the implementation of personalization recommendation with user-focus. Part IV presents and analyses experimental results. Finally, Part V draws a conclusion and highlights further work.
2 Weighted Frequent Itemset Mining Let I be a set of all items, I={i1,i2,…,im}. ij is named as an item which is an element of I and j=1,2…m. Let the data set D is a set of transactions in database and each transaction T is a set of items, i.e. T ⊆ I. Support (X), the support of itemset X, is defined as the ratio of the number of transaction containing X and the number of all transactions. Let X, Y be itemset and X∩Y=Φ. Confidence(X,Y), the confidence of association rule is defined as Support(X ∪ Y)/Support(X). Definition 1. Let I be a set of all items, I={i1,i2…im}. Let W= {w1,w2,…, wm},wj ∈ (0,1),j=1,2,…,m. W is defined as the weight set for itemset I and the element wj of W is called as the weight of element ij in I, j=1,2,…,m. The weight of item ij is the measurement value for the importance of item ij. Definition 2. Let I be the set of all items and W is the weight set for I. Let X be an itemset, X ⊆ I. The non-normalized weight support is defined as ( wj )support(X),
∑
i j ∈X
wj is the weight of item ij and be denoted as wsupport(X). Definition 3. Let I be the set of all items and W is the weight set for I. Let X be an itemset, X ⊆ I. X is weighted frequent itemset iff wsupport(X) wminsupp.Here wsupport is threshold named as minimal weight support Definition 4. Let X, Y be itemset and X∩Y=Φ. XÆY is a weighted interesting association rule iff confidence(X,Y) minconf, wsupport(X ∪ Y) wminsupp. Lemma 1. Let D be the transaction set and the number of transactions in D is N. Let I be the set of all items, X ⊆ I.wminsupp is the minimal weight support. If X is a weighted frequent itemset and the number of transactions containing X is denoted as wminsupp × N SC(X), then SC(X) . ∑wj i j ∈X
600
Z. Zhang et al.
Proof: Because SC(X) is the number of transactions containing X in D and N is the number of transactions in D, SC(X)= N×support(X). According to definition 2 and SC(X) wminsupp × N definition3, ∑ w j × wminsupp. I.e. SC(X) # N ∑wj i j ∈X i j ∈X
Definition 5. Let I be the set of all items and W is the weight set for I.Y is a qitemset, i.e. Y is an itemset with q items in it, q, =, , where: i) N = { n1 ,n2 , ,nt } is a union of a set of activity nodes AN = { a1,a2 , ,an } and a set of control nodes CN = { cn1 ,cn2 , ,cnm } . Each element in CN has one of the above four types, i.e., as, aj, os, or oj. ii) F ⊂ N × N is a set of flows between these nodes. iii) ℜ : AN → R is a resource set accessed by an activity, where R = { R1 , R2 , , Rn } is a superset. Ri ,i = 1, , n , is a resource set accessed by ai . iv) ws has a unique start activity (denoted as a s ) and at least one end activity (denoted as ae ). In order to model time information of activities in a workflow, we need to augment each of them with two time values, namely, the minimum and maximum durations. So, at build-time, we define d(a) and D(a) as the minimum and maximum durations of activity a for its execution respectively (d(a) < D(a)) [12]. Time is expressed in some basic time units, such as minutes, hours, or days. The granularity is selected according to specific workflow applications.
806
H. Li and Y. Yang
A workflow is an execution case of a workflow specification, beginning at the start activity and ending at the end activity. Each workflow is assumed to have an identifier, which distinguishes it from others. Note that a workflow contains a subset of activities explicitly specified in the associated workflow specification. During a workflow execution, activities are scheduled with respect to the flows between them, which prescribe the precedence relationship between those activities. This is to say, given a workflow specification ws with < ni , n j >∈ F , if both ni and n j are scheduled in a workflow, n j must start to execute only after the completion of ni , denoted as ni ≺ n j . More formally: Definition 2.2 (Workflow): A workflow, w , is a 4-tuple < id , ws , A, ≺ > , reflecting the execution of a workflow specification, where: i) id is an identifier assigned to the workflow. ii) ws is the associated workflow specification. iii) A ⊆ N is a set of activities (include control nodes), which contains a subset of activities in ws . iv) The execution order ≺⊆ ( A × A ) is the partial order such that if ai , a j ∈ A and < ai , a j >∈ F in ws , then ai ≺ a j . At run-time, a completed activity a has a start time (denoted as S ( a ) ) and an end time (denoted as E( a ) ). It is active during the period from its start time to end time. This period is called the active interval of activity a , denoted as [ S ( a ), E( a )] .
DR ( a ) = E( a ) − S ( a ) is defined as its run-time duration. Accordingly, DR ( ai , a j ) denotes the run-time duration from the start time of ai to the end time of a j , where
ai ≺ a j . Under the normal condition, we have: d ( a ) ≤ DR ( a ) ≤ D( a ) ; (2). E( ai ) ≤ S ( a j ) , if ∃f =< ai ,a j >∈ F .
2.2 Temporal Constraint and Its Representation In order to simplify the verification work, we differentiate temporal constraints as two classes, namely, absolute temporal constraints and relative temporal constraints. Temporal constraints of a workflow specification are consistent if and only if they could be satisfied based on the syntax of the workflow specification and expected minimum and maximum durations of activities [8]. Here, we use E( ai ) ≤ t ti to represent that ai should end its execution on or before absolute time ti , and S ( ai ) ≤t ti to represent that ai should start to execute on or before absolute time ti . For the relative temporal constraints, we use DR ( ai ,a j ) ≤ p to denote that a j should end its execution no more than p time units after ai starts. Here, ai is called a reference point. In fact, absolute temporal constraints and relative temporal constraints can be transformed to each other. Therefore, in this paper, we only demonstrate the dynamic
Verification of Temporal Constraints for Concurrent Workflows
807
checking of relative temporal constraints, and take DR ( a k , ai ) ≤ p ( ak is a reference point) as the canonical representation for temporal constraints.
3 Resource Constraints in WfMSs As stated early, activities in a workflow need to access resources during their executions. In WfMSs, a resource is defined to be any entity required by an activity for its execution, such as a document, a database table, an appliance (e.g., printer), an application, or even an actor1. According to the access property of resources in a WfMS, they are classified as two types, namely, shared resources and private resources. Shared resources can be accessed by different activities within a workflow or from different workflows, while private resources cannot be shared by activities and are only accessed by an activity. So, it is unnecessary to involve the private resources in our verification because they do not cause resource constraints between activities. In WfMSs, a resource can be denoted as r, with a unique identifier. All resources accessed by an activity ai consist of a set Ri = { r1 , , rm } . The mapping function
ℜ( ai ) (see Definition 2.1) returns all resources accessed by ai , that is, ℜ( ai ) = Ri . Definition 3.1 (Resource dependency): Given two activities ai , a j ,( i ≠ j ) within a workflow or from two workflows, we say that ai and a j have a resource dependency if ℜ( ai ) ∩ ℜ( a j ) ≠ φ . If ai and a j have a resource dependency, ai cannot execute simultaneously with a j . Otherwise, a conflict may arise from the competition for the same resources. We call this kind of conflict as resource conflict. More formally: Definition 3.2 (Resource conflict): ai has a resource conflict with a j if they have a resource dependency and ([ S ( ai ), E( ai )] ∩ [ S ( a j ), E( a j )]) ≠ φ . Given a set of concurrent workflows W = { w1 , w2 , , wn } , correlations may exist between them due to the resource dependency between activities. More formally: Definition 3.3 (Correlation): Two workflows wi and w j ( i ≠ j ) are correlated, iff there are two activities aik ∈ Ai and a jl ∈ A j that they have a resource dependency.
4 Concurrent Workflow Execution In general, there are multiple workflows executing concurrently in a WfMS, each of which is controlled and coordinated by a workflow engine2 independently. For each workflow, we have the following assumptions. 1 2
An actor can be either a person or a software agent, who acts a role to perform an activity or be responsible for its execution [5]. A workflow engine is a software service that provides the run-time execution environment for a workflow [10].
808
H. Li and Y. Yang
Assumption 1 (Consistent Specification): A workflow specification is correctly defined, without errors or inconsistencies. Assumption 2 (Correct Execution): Each activity in a workflow executes under the normal case. For the sake of brevity, here we introduce a terminology, workflow schedule, to denote a set of concurrent workflows. Definition 4.1 (Workflow Schedule): A workflow schedule, S, is a 3-tuple < W , AS , ≺ S > , reflecting the concurrent execution of workflows, where: i) W = { w1 , w2 , , wn } is a set of workflows, where wi =< id i , wsi , Ai , ≺ i > . ii) AS = { aij ( aij ∈ Ai ) ∧ ( wi ∈ W )} is a set of all activities of all workflows in W . iii) ≺ S is a partial order between activities of AS with ≺ S ⊆ ( AS × AS ) and ≺ S = ∪ wi∈W ≺ i . The definition of a workflow schedule reflects the concurrent execution of workflows at the activity level. Activities within a workflow are scheduled by a workflow engine according to flows defined in the associated workflow specification. However, activities from different workflows have no such limitation, i.e., given two activities aik ∈ Ai , a jl ∈ A j ,( i ≠ j ) , either S ( aik ) < t S ( a jl ) or S ( a jl ) < t S ( aik ) . This results in a correct workflow schedule. A workflow schedule is correct if it has no resource conflicts. More formally: Definition 4.2 (Correct Workflow Schedule): A workflow schedule S is correct if for any two activities aik , a jl ∈ AS with a resource dependency, we have
([ S ( aik ), E( aik )] ∩ [ S ( a jl ), E( a jl )]) = φ . Therefore, when an activity to be scheduled is identified having a resource conflict with another one, it will be postponed for execution or allowed to pre-empt the execution of the other one with a lower priority. Here, we assume that the First-ComeFirst-Served (FCFS) policy is adopted to allocate a resource to an activity, i.e., an activity with resource conflicts is postponed for execution until all resources required are available.
5 Dynamic Verification of Temporal Constraints In this section, we address the checking method for temporal constraints on workflows. The problem to be tackled is formally stated as: Given a workflow schedule S= < W , AS ,≺ S > and a temporal constraint DR ( aik ,ail ) ≤ p , we check if the temporal constraint can be satisfied at reference point aik in the workflow schedule. To achieve the dynamic verification of a temporal constraint DR ( aik ,ail ) ≤ p , where aik ≺ ail , for example, in an environment with concurrent workflows, we should calculate the value of DR ( aik , ail ) , and then compare it with p . If the value of DR ( aik , ail ) is less than or equal to p , we say the temporal constraint is satisfied. Otherwise, the result is unsatisfied. However, the value of DR ( aik , ail ) can be deter-
Verification of Temporal Constraints for Concurrent Workflows
809
mined only if aik and ail have completed. At that moment, if the temporal constraint is violated, then it is too late for any preventive or corrective action, and the checking makes no sense for workflow management. Therefore, it is necessary to dynamically check the temporal constraint before ail starts to execute. To this end, we should estimate some activities’ active intervals relative to the reference point. Given a set of workflow specifications WS = { ws1 ,ws2 , , wsm } associated with workflow schedule S= < W , AS , ≺ S > , each of which is defined as Definition 2.1 with time information, we have: Definition 5.1 (Reachability): Node nil is reachable from node nik if there is a path in wsi consisted of a series of flows from nik to nil . Let Re achable( nik , nil ) be a Boolean function to denote the reachability from
True , if ( ∃path =< nik , ,nil > ) node nik to nil such that Re achable( nik , nil ) = . False , otherwise Note that all nodes reachable from nik form a reachable set of nik , denoted as
Re( nik ) , which is a subset of N i and can be calculated as follows: Re( nik ) = { n n ∈ N i ∧ (( < nik , n >∈ Fi ) ∨ ( ∃nil ∈ N i , nil ∈ Re( nik ) ∧ n ∈ Re( nil ))) } . Here, the relation of reachability is assumed to be reflexive, i.e., nik ∈ Re( nik ) . Let aik be a reference point, for activity ail ∈ Re( aik ) , we have: Definition 5.2 (Earliest Start Time): The earliest start time of ail , EST ( ail ) , is its start time relative to aik under the condition where ( ∀a ∈ ANj , j = 1, ,m ) ∧( aik ≺ a ) such that DR ( a ) = d ( a ) . Definition 5.3 (Earliest End Time): The earliest end time of ail , EET ( ail ) , is its end time relative to aik under the condition where ( ∀a ∈ ANj , j = 1, ,m ) ∧( aik ≺ a ) such that DR ( a ) = d ( a ) . Definition 5.4 (Latest End Time): The latest end time of ail , LET ( ail ) , is its end time relative to aik under the condition where ( ∀a ∈ ANj , j = 1, ,m) ∧( aik ≺ a ) such that DR ( a ) = D( a ) . Definition 5.5 (Estimated Active Interval): The estimated active interval of ail is defined as the period between EST( ail ) and LET( ail ) , where EST( ail ) < LET( ail ) , denoted as [ EST( ail ),LET( ail )] . The latest start time of an activity is not defined due to not being used in the paper. From the above definitions, we conclude that: Proposition 5.1: EET ( ail ) = EST ( ail ) + d ( ail ) for any activity ail ∈ Re( aik ) . Proof: The proof follows directly from Definitions 5.2, 5.3, and 5.4. Proposition 5.2: For any activity ail ∈ Re( aik ) , the active interval [ S ( ail ), E( ail )] is within the estimated active interval [ EST ( ail ), LET ( ail )] at reference point aik in workflow schedule S, that is, [ S ( ail ), E( ail )] ⊆ [ EST ( ail ) + S ( aik ), LET ( ail ) + S ( aik )] . Proof: Omitted. Please refer to [7].
810
H. Li and Y. Yang
Theorem 1: Given a correct workflow schedule S= < W , AS ,≺ S > , a reference point aik and a temporal constraint DR ( aik ,ail ) ≤ p on wi , where ail ∈ Re( aik ) , we assert: i) DR ( aik ,ail ) ≤ p is not satisfied at aik if p < EET ( ail ) . ii) DR ( aik ,ail ) ≤ p is satisfied at aik if LET ( ail ) ≤ p . iii) DR ( aik ,ail ) ≤ p cannot be decided at aik if EET ( ail ) ≤ p < LET ( ail ) . Proof: Omitted. Please refer to [7]. Given a workflow specification wsi and a reference point aik , according to the definitions, we have EST ( aik ) = 0 and LET ( aik ) = D( aik ) . For each activity
ail ∈ Re( aik ) , we can calculate its estimated active interval within this workflow specification. In this paper, we do not address this question in detail due to the space limit. Readers can be referred to [6,7] for detailed discussion. The calculation of estimated active intervals within a single workflow does not consider resource dependencies between activities in a workflow schedule. Hence, the values calculated are inconsistent with the case in an environment with concurrent workflows, and need to be adjusted in terms of resource dependencies. Given two concurrent workflows wi and w j , aik and a jm are selected as two reference points. Now, we can calculate [ EST( aip ), LET ( aip )] for each activity
aip ∈ Re(aik ) within wi and [ EST ( a js ), LET ( a js )] for each a js ∈ Re(a jm ) within
w j respectively. However, suppose there are two activities aip ∈ Re(aik ) and a js ∈Re(a jm ) having a resource dependency, the possible relations between their estimated active intervals can be described distinctly by Allen’s interval logic [4]. If [ EST( aip ),LET( aip )] ∩ [ EST( a js ),LET( a js )] ≠ φ , their estimated active intervals need to be adjusted in order to avoid resource conflict in the concurrent execution. Without the loss of generality, here we suppose EST ( aip ) ≤ EST ( a js ) . Then a js ’s estimated active interval should be adjusted as follows:
EST ( a js ) = MAX { EST ( a js ), EST ( aip ) + d ( aip )} ; LET ( a js ) = MAX { LET ( a js ), LET ( aip ) + D( a js )} . After the adjustment for a js ’s estimated active interval above, given an active interval of aip such that [ S ( aip ), E( aip )] ⊆ [ EST ( aip ) + S ( aik ), LET ( aip ) + S ( aik )] in workflow schedule S, [ S ( a js ), E( a js )] ⊆ [ EST ( a js ) +
we can get an active interval S ( a jm ), LET ( a js ) + S ( a jm )] such that
[ S ( aip ), E( aip )] ∩ [ S ( a js ), E( a js )] = φ in the workflow schedule. In fact, the adjustment for the estimated active intervals of all activities with resource dependencies is an iterative process. Given a set of workflow specifications WS associated with workflow schedule S= < W , AS ,≺ S > and reference point aik , we present here the steps for calculating the estimated active intervals of some activities in WS as follows:
Verification of Temporal Constraints for Concurrent Workflows
811
Step 1: Select a reference point for each workflow in W . Assume S ( aik ) = t i , at that moment, activity a jm is selected as the reference point of w j if it is in the running state. Step 2: Calculate the estimated active intervals of activities within a workflow specification respectively. For each workflow w j ∈ W and the selected reference point
a jm , we calculate [ EST ( a js ), LET ( a js )] for each activity a js ∈ Re( a jm ) within workflow w j . Step 3: For each workflow w j ∈ W , we need to adjust the estimated active intervals of activities a js ∈ Re( a jm ) because a jm may have started for a while. Suppose
S ( a jm ) = t j ,( t j < t i ) , let δ = t i − t j . For each activity a js ∈ Re( a jm ) , the estimated active interval is adjusted as [ EST ( a js ) − δ , LET ( a js ) − δ ] . Step 4: All those activities having the estimated active intervals in WS compose a sequence, Q, according to their EST values with an ascending order. Now, we process the sequence in the following steps until it becomes empty. Step 4.1: Remove an activity from the head of Q, and assume it as aip . Step 4.2: Let Dep( aip ) be the set of those activities in Q which have resource dependencies with aip . Similarly, elements in Dep( aip ) are sorted by their EST values with an ascending order. Step 4.3: If Dep( aip ) = φ , then go to Step 4.1, otherwise remove an activity with the smallest EST value from Dep( aip ) , and assume it to be a js . Step 4.4: If [ EST ( aip ), LET ( aip )] ∩ [ EST ( a js ), LET ( a js )] = φ , then go to Step 4.3, otherwise to Step 4.5. Step 4.5: EST ( a js ) and LET ( a js ) are adjusted as follows:
EST ( a js ) = MAX { EST ( a js ), EST ( aip ) + d ( aip )} ; LET ( a js ) = MAX { LET ( a js ), LET ( aip ) + D( a js )} . For each activity a jt ∈ Re( a js ) (not including a js ), its estimated active interval is re-calculated within the workflow specification ws j accordingly. After that, go to Step 4.3. In fact, at last, we get a complete workflow schedule Sc of S. More formally: Definition 5.6 (Complete Workflow Schedule): Let S= < W , AS ,≺ S > be a workflow schedule. The complete workflow schedule of S at reference point aik , Sc, is a 3-tuple < WSc , ASc ,≺ Sc > , where i) W = WSc .
ii) ASc is a set of activities derived from AS in the following ways: (a) For each
workflow wi ∈W , if aip ∈ Ai , then aip ∈ ASc . (b) For each workflow wi ∈ W ,
suppose aik is a reference point, if aip ∈ Re( aik ) , then aip ∈ ASc .
812
H. Li and Y. Yang
iii) The partial order, ≺ Sc , is determined as follows: (a) For any two activities
aip , a js ∈ AS , if aip ≺ S a js in S, then aip ≺ Sc a js in Sc. (b) For two activities aip , ais ∈ ( ASc − AS ) 3, if Re achable( aip , ais ) = True , then aip ≺ Sc a js in Sc. (c) For two activities aip ∈ AS and ais ∈ ( ASc − AS ) , we have aip ≺ Sc a js in Sc.
iv) For every workflow wi ∈ W , each activity aip ∈ Ai having completed in S has an active interval [ S ( aip ), E( aip )] , and each activity ais ∈ Re( aik ) has an estimated active interval [ EST ( ais ), LET ( ais )] relative to the reference point aik . Theorem 2: If workflow schedule S= < W , AS ,≺ S > is correct, then the complete workflow schedule of S, Sc= < WSc , ASc ,≺ Sc > , at reference point aik is also correct. Proof: Omitted. Please refer to [7]. Based on the above discussion, the checking process is summarised as follows: Firstly, we construct the complete workflow schedule of S at the reference point. Then according to Theorems 1 and 2, the satisfaction of the temporal constraint can be decided by the following three cases: Case 1) p < EET ( ail ) : The temporal constraint cannot be satisfied at aik . Case 2) EET( ail ) ≤ p < LET( ail ) : The temporal constraint cannot be decided at aik . Case 3) LET ( ail ) ≤ p : The temporal constraint can be satisfied at aik . We can dynamically check if a temporal constraint can be satisfied at a reference point through the above process. The result of satisfaction means that the associated workflow is executing under the normal situation. Otherwise, if the result is uncertain or unsatisfied, some measures should be taken to tackle these abnormal cases, for example, triggering exception handling, assigning a higher priority to the workflow, or adjusting the temporal constraint. This issue is beyond the scope of this paper and will be addressed in detail elsewhere.
6 Conclusions The ability to dynamic checking of temporal constraints is crucial to workflow management. The past work on this issue is limited to a single workflow. In fact, there are multiple workflows executing concurrently in a workflow management system. So, we believe that the existing approaches are too simple and weak to verify temporal constraints in an environment with concurrent workflows. This paper proposes an approach to dealing with this problem. Firstly, we unify the representation of absolute temporal constraints and relative temporal constraints, and provide a canonical representation for them. Then, we investigate resource constraints in workflow management systems and concurrent workflow executions. Based on these discussions, we unfold the verification of temporal constraints on concurrent workflows. Compared with the existing approaches, the method presented in this paper takes into account the relationship between concurrent workflows. In addition, it is dynamic and can be used by workflow management systems at run-time. 3
ASc − AS is a set consisted of elements a ∈ ASc , but a ∉ AS .
Verification of Temporal Constraints for Concurrent Workflows
813
Our ongoing work focuses on the workitem prioritisation and resource allocation policies. For a workflow schedule, some policies should be adopted to allocate resources to activities in priority for the purpose of the satisfaction of temporal constraints.
Acknowledgement. The work reported in this paper is supported in part by Swinburne VC Strategic Research Initiative Grant (2002-2004), as well as the National Natural Science Foundation of China under grant No.60273026 and No.60273043.
References 1.
Adam, N., Atluri, V., Huang, W.: Modeling and Analysis of Workflows Using Petri Nets. Journal of Intelligent Information Systems, Special Issue on Workflow and Process Management, 10(2), (1998) 131-158 2. Bettini, C., Wang, X., Jajodia, S.: Temporal Reasoning in Workflow Systems. Distributed and Parallel Databases, 11(3), (2002) 269-306 3. Casati, F., Grefen, P., Pernici, B. et al: WIDE Workflow Model and Architecture. Public Documents, http://dis.sema.es/projects/WIDE/Documents/ase30_4.ps.gz, April, (1996) 4. Chinn, S., Madey, G.: Temporal Representation and Reasoning for Workflow in Engineering Design Change Review. IEEE Transactions on Engineering Management, 47(4), (2000) 485-492 5. Fakas, G., Karakostas, B.: A Workflow Management System Based on Intelligent Collaborative Objects. Information and Software Technology, 41, (1999) 907-915 6. Li, H., Yang, Y., Chen, T.Y.: Resource Constraints Analysis of Workflow Specifications. Accepted for publication in the Journal of Systems and Software, Elsevier 7. Li, H., Yang, Y.: Dynamic Checking of Temporal Constraints for Concurrent Workflows. Internal Report, School of Information Technology, Swinburne University of Technology, (2003) 8. Marjanovic, O.: Dynamic Verification of Temporal Constraints in Production Workflows. In Proc. of the Australian Database Conference, Canberra, Australia, (2000) 74-81 9. Reichert, M., Bauer, T., Dadam, P.: Enterprise-wide and Cross-Enterprise Workflow Management: Challenges and Research Issues for Adaptive Workflows. Enterprise-wide and Cross-Enterprise Workflow Management: Concepts, Systems, Application, Germany, (1999) 10. Workflow Management Coalition: The Workflow Reference Model. TC00-1003, (1995) 11. Zhao, J., Stohr, E.: Temporal Workflow Management in a Claim Handling System. In Proc. of Work Activities Coordination and Collaboration (WACC’99), San Francisco, CA, USA, (1999) 187-195 12. Zhuge, H., Cheung, T., Pung, H.: A timed workflow process model. The Journal of Systems and Software, 55(3), (2001) 231-243
Research on Partner-Choosing and Web Services Composition for B2B E-commerce in Virtual Enterprises Chao Young, Shoujian Yu, and Jiajin Le Department of Computer, Donghua University, Yan’an west Road 1882, 200051 Shanghai, China {Davidyoung, Jackyyu, Lejiajin}@mail.dhu.edu.cn
Abstract. Based on Web Services architectures (WSA) the paper presents some ideas for B2B e-commerce in virtual enterprise (VE) including partnerchoosing according to business via multi-layer UDDI and efficient Web Services composition by standard common data schema and services pattern.
Research on Partner-Choosing and Web Services Composition
815
sion, commercial business can be carried out together by several partners that are distributed geographically. Commercial services can be published by enterprise for its commerce aim as Web Services. Enterprise describes its Web Services by WSDL (Web Services Description Language), and registered it to registering center. Customers find out Web Services in registering center by UDDI (Universal Description, Discovery and Integration) and call Web Services according the Description of WSDL. Client program and Web Services exchange their data by SOAP (Simple Object Access protocol) to complete their business. Web Services is based on opening criterion or protocol and constructed on Internet. Web Services provides a flexible and loose-coupling distributed computing environment for VE. At the same time, Web Services masks application platform. Data is exchanged in form of XML document that is standalone from programming language. So at present, Web Services is the ideal candidate for integrating enterprise application and setting up opening and loose-coupling information platform for VE. The remainder of this paper is organized as follows: In the second section we put forward and analyze two issues that we concern about Web Services composition and partner-choosing. Then in the third section we give a solution to Web Services rapid composition based on standard common data schema and services pattern. In the fourth section we develop traditional UDDI registry center and propose multi-layer model to define, register and discover services and enterprises information. In the last section we conclude our thought and forecast the future study.
2 Issues Overview Web Services provide a reasonably lightweight and open standards-based mechanism for computer-to-computer communications. Web Services consists of a set of key technologies and standards that are suitable for B2B e-commerce. It should be pointed out that there are still several issues that need to be addressed before the full potential platform of VE can be realized over Web Services architectures. In this section we will concentrate on the following two aspects. 2.1 Inefficient and Costly Web Services Composition Web Services are interesting and differ from other distributed computing technologies because they are based on SOAP messages, which are encoded in the form of XML and transported over HTTP (among other common protocols). SOAP is independent of application such as program logic and syntax. In another word, SOAP defines only architecture of information structure without including the information content. The data type and structure is described in WSDL document. Firstly, if all enterprises develop and publish their Web Services only according to their own business logics, consequently, enterprises will own programming interfaces different from that of other enterprises even if enterprises conduct the same business (play the same role in supply chain). Thus after finding appropriate service (signed as
816
C. Young, S. Yu, and J. Le
SA) from UDDI registry (either a public registry or a private one hosted by a trusted third part), customer of Web Services must study and analyze the WSDL description which is issued by publisher of this Web Services, understand the program interface and data structure, then program appropriate module to invoke this service and treat the response. When choosing another Web Services (signed as SB) from other partner programmers have to develop new module. Because the two Web Services (SA and SB) are different in data structure and interface pattern even though they realize the same business logic. All of this hampers the adoption of Web Services and flexibility of VE application. Though some new tools have come into being which can output invocation code from WSDL document. The ability of these tools is still limited because they can not understand the interface and parameter exactly without human interaction. Secondly, there are a lot of data to be exchanged between partners of VE. These data such as price, order, specification, draft and so on are complex with special data structures that are different among enterprises. Therefore partners have to provide many Web Services to exchange these data in XML format via SOAP message. But as long as every enterprise uses different data structure to describe its data, that is, adopt different XSD (XML Scheme Definition), for example, for their order, customers have to coding special module according to special XSD to treat business data exactly. So these different data structures are also bottleneck for the composition of Web Services. If the composition of Web Services only includes the applications within an enterprise or only covers several stable partners, problems mentioned above look like maybe not so crucial. But to dynastic supply chain of VE, a loose couple is necessary because this kind of supply chain discomposes or merges frequently. If every change of supply chain will give rise to the modification or update of code, the cost of the change is too expensive. Therefore the information platform of VE is inefficient. To compose Web Services quickly and cost-effectively for effective B2B collaboration a solution based on standard common data schema and services pattern is proposed in section 3. 2.2 Limited Partner-Choosing Support Electronic commerce is moving towards a vision of Web Services based interactions, where cooperant enterprises use Web Services to interact with each other dynamically. UDDI specifications define a way to publish and discover information about Web Services. This approach to publish and discover Web Services relies upon a distributed registry of businesses and their service descriptions implemented in a common XML format. UDDI registry is a publicly accessible web site implementing the specification that allows enterprises to register information about the Web Services they offer so that other enterprises can find them. Firstly, let’s analyze the content of registration information. Conceptually, the information provided in a UDDI registration consists of three components: “white pages” including address, contact, and known identifiers; “yellow pages” including
Research on Partner-Choosing and Web Services Composition
817
industrial categorizations based on standard taxonomies; and “green pages”, the technical information about services that are exposed by the business. Green pages include references to specifications for Web Services, as well as support for pointers to various file and URL based discovery mechanisms if required. This point is very important so that we will mention it again in the remainder of this paper. The information hierarchy and the key XML element names that are used to describe and discover information about Web Services are shown in figure 1.
Fig. 1. Registration information hierarchy
In addition to locating information about how potential partner has exposed a Web services to learn the technical details to interact with that service. Furthermore, UDDI Registry can be used at a business level to find an appropriate partner in a given industry or region. In another word, to choose business partners according UDDI registration data. As the figure 1 show, it is important to note that UDDI does not form a full-featured discovery service. UDDI registry is targeted at enabling technical discovery of services. But concrete business level searches are not covered in UDDI specifications. Look at the following scenes: Locate a partner that can provide a special product or service at a given price. Locate a partner that must have an honorable credit standing. Locate a partner whose production and management level has met certain international standard. These kinds of advanced business-related discovery features are not and should not be provided by UDDI registry because UDDI specifications have to keep its light-weight flexibility. But these advanced business-related discovery features are important and routine necessity in business processes. How to solve this problem to complement UDDI search will be discussed in section 4.
818
C. Young, S. Yu, and J. Le
3 Web Services Composition Web Services composition is gaining a considerable momentum as a paradigm for effective B2B collaboration. The main goal of our work is enhance study of how to facilitate large-scale integration of Web Services that realize the business process. 3.1 Standardizing Common Data Schema and Services Pattern by tModel within Industry Domain As we have mentioned at section 2.1, heterogeneous business data scheme and Web Services interfaces will make the composition Web Services between partners become time-consuming and costly, consequently B2B collaboration turn inflexible and inappropriate for the dynamic environments of VE. To compose Web Services quickly and cost-effectively for effective B2B collaboration, we find, it is important that to set up a set of common standards or criterions about interfaces of Web Services and business data structure. Only if common interfaces are adopted widely in the supply chain of VE, can Web Services be invoked conveniently and efficiently. What’s more, only if data structure is standardized, can business data be exchanged over enterprises seamless. It is impossible to get common interfaces and data structure over all industry domains. So standards should be built within certain industry domain. All partner of a VE, especially neighboring partners in supply chain usually belong to a big or small industry domain. Therefore to unify Web Services interfaces and data structure is feasible and is a fundamental job for the VE B2B e- commerce platform of certain industry domain. So to unify Web Services interfaces and data structure within industry domain is an important idea we provide in our work. According to UDDI specifications, the entity tModels are references that are actually the URL that can be used to access information about a specification. Also called “fingerprints”, this information is metadata about a specification, including its name, publishing organization, and URL pointers to the actual specifications themselves. So we can standardize Web Services interfaces and data structure and save them at a public web site (for example the e-market we will define it in next section), than use tModels to reference the public technical standards. All these enterprises that develop their Web Services according them will be compatible in technology. Because compatible enterprises adopt common Web Services interfaces and business data structure, their applications can be integrated easily, dynamically and seamless. When change happens in supply chain enterprise don’t have to update or change its program for integration with new trading partners. Thus the cost of integration is not as expensive as before and efficiency of data exchange is improved.
Research on Partner-Choosing and Web Services Composition
819
3.2 The Design of Common Data Schema and Services Pattern Web Services and related functional module can be classed and designed into three kinds of components. Every kind of components consists of several small components of different granularity that ultimately form XML documents referenced by standard tModels. These tModels actually become standard technology specifications. 1) Web Services interface specifications (WSIS). These specifications are designed in accordance with industry characters. Unified interface specifications compose of business processes and logic of VE partners. Functions of every participant of VE supply chain are embodied in sub-module. For instance, the supply chain of textile VE maybe include textile, dying and garment enterprises. Thereby specifications package of textile domain may be comprised by textile specifications, dying specifications and garment specifications. All these specifications connect with each other to set up VE supply chain. Enterprise implements only some specifications according to its business that will be the Web Services interface in the near future. 2) Web Services calling specifications (WSCS). These specifications are designed in accordance with Web Services interface specifications and trading regulations as well. These specifications are also decomposed into sub-modules for every enterprises of supply chain. Via them enterprise can call the Web Services of his partners. Web Services calling specifications and Web Services interface specifications consist of all business process of certain industry domain. Industry enterprises can, in fact, unify their Web Services interfaces through these two component specifications. 3) Business data structure specifications (BDSS). These specifications comprise a lot of data structure definition and data composing-discomposing program. Data structure definition is XML scheme documents. Data composing-discomposing program provide some common interfaces to connect with back-end data source.
Fig. 2. Inner-enterprise application architecture
The following is the procedure that an enterprise must take to develop and publish its Web Services. The first step, to implement specifications related with its own business. For example, textile enterprise should implement Web Services interface components in the light of WSIS to establish Web Services, and implement dying
820
C. Young, S. Yu, and J. Le
Fig. 3. Inter-enterprise application architecture
calling modules and garment calling modules in the light of WSCS which will be used for invocation to the Web Services of dying or garment enterprises, and implement business data composing-discomposing components in the light of BDSS for data exchange between enterprises. The second step is to integrate Web Services to old applications and data sources via standard interfaces of the components. The third step is to publish enterprise base information, products or services information in registry center. The inner-enterprise application architecture is shown in figure 2. The whole inter-enterprise application architecture is shown in figure 3.
4 Partner-Choosing According Business As we have mentioned in section 2.2, business-related advanced registration and discovery are not involved in UDDI specifications. But that are important and routine necessity in B2B e-commerce. Through UDDI registry you only can estimate enterprises according their technical information of Web Services. A lot of important commercial information is not covered in UDDI registry. So enterprises can’t make an all-around judgment and wise partner-choosing only according UDDI registry. 4.1 Multi-layer UDDI Architecture Now that commercial information is necessary for partner-choosing, But these data is not covered by UDDI registry. We propose a multi-layer UDDI architecture to support partner-choosing according to both Web Services technical information and enterprise commercial data. Multi-layer UDDI architecture is shown in figure 4.
Research on Partner-Choosing and Web Services Composition
821
Fig. 4. Multi-layer UDDI architecture
We insert an additional layer between UDDI client (client program or programmer) and UDDI registry. We can see the additional layer play a role like market. So we name it e-market. In an industry domain an e-market can be set up. Enterprises belonging to certain industry register themselves and look for partners through their industry e-market. For example, a textile enterprise can register its commercial data and Web Services information through textile e-market. And through textile e-market to look for the most appropriate dying enterprise as partner. 4.2 Design of E-market Based on Web Services As middle layer of application and UDDI registry, firstly, e-market must provide all functions of UDDI registry because the client register and discover fully through emarket without direct connect with UDDI registry. Considering compatibility and consistency in technology, e-market should provide API in the format of Web Services. So if WSIUDDI stands for the UDDI Programmer’s API and WSIE stand for the Programmer’s API of e-market, then the relation: WSIUDDI ⊂ WSIE exists. Furthermore WSIE must provide business-related registration and query functions, For instance, looking for an enterprise that can provide a special product or service at a given price. We can call this part of functions WSIB. Then, WSIE = WSIUDDI + WSIB. As the core information model used by the UDDI registries is defined in an XML schema, the information model used by the e-market is also defined in an XML schema. XML was chosen because it offers a platform-neutral view of data and allows hierarchical relationships to be described in a natural way. The emerging XML schema standard was chosen because of its support for rich data types as well as its ability to easily describe and validate information based on information models represented in schemas. Content of e-market information consist of enterprise commercial information such as price, product specification, enterprise reputation and so on. Additionally, for a more quickly query speed some UDDI data can be cached here. It must be pointed that the GUID (Globally Unique Identifier) of business entity in e-
822
C. Young, S. Yu, and J. Le
market is the same as that in UDDI registry. Thus the data of business entities keep consistent between e-market and UDDI registry. The complete registration process of a new business entity is shown as figure 5.
new client : AP
e-market : EM
uddi registry : UDDI
submit application
transfer application return GUID return GUID
register data
register technology data register business data
return OK
Fig. 5. The complete registration process of a new business entity
In multi-layer UDDI architecture e-market provides business-related registration and query that is necessary for partner-choosing in VE. At the same time, through caching UDDI data, e-market can share the load of UDDI registry. It should be emphasized that reputation of enterprise is very important and useful content in e-market. The production of trust is an important requirement for forming and growing online trading communities. The lack of a common history with potential trading partners, as well as the relative ease with which buyers and sellers can change partners from one transaction to the next, gives incentives to both parties to provide inferior service quality or to hold back on their side of the exchange. Reputation reporting systems have emerged as an important risk management mechanism in such online communities. The goal of reputation systems is to encourage trustworthiness and avoid risk in transactions by using past behavior as a publicly available predictor of likely future behavior. Because, sometimes, commercial information such as price, product specification, as confidential information, has been concealed deliberately. Thus, reputation will become the only and key warrant for enterprise to choose trading partners. We have developed a new online reputation reporting system to management VE reputation. One new mechanism is used to cope with noisy reputation rating. These thought will be represented in future papers.
Research on Partner-Choosing and Web Services Composition
5
823
Conclusion and Future Work
In this paper, we analyze how to realize B2B e-commerce in VE based on Web Services technology. We focus our work mainly on two “how”: how to realize Web Serv ices composition and how to support partner-choosing. We propose to unify Web Services interfaces and data structure within industry domain is an important idea by which Web Services become compatible. We propose a multi-layer UDDI architecture to support partner-choosing according to both Web Services technical information and enterprise commercial data. In the future we will take step toward the following aspects. How to design business data structure? How to realize business process by Web Services? How to compose Web Services more effectively? How to design e-market in detail?
References 1.
2.
3. 4. 5. 6. 7. 8.
Benatallah B, Dumas, M, Fauvet, M.-C, and Rabhi F. 2002. Towards patterns of Web Services composition. In Patterns and Skeletons for Parallel and Distributed Programming, F. Rabhi and S. Gorlatch, Eds. Springer Verlag, London, UK. B enatallah, B, Dumas, M., Sheng, Q, and Ngu, A. 2002. Declarative Composition and Peer-to-Peer Provisioning of Dynamic Web Services. In Proc. of the International IEEE Conference on Data Engineering. San Jose CA, USA. Shmueli O. Architectures for Internal Web Services Deployment. VLDB 2001. Tsur S, Abiteboul S, et al, are Web Services the Next Revolution in E-Commerce? VLDB 2001. Sheng Q Z, Bennatallah B, et al, SERF-SERV: A platform for Rapid Composition of Web Services in a peer-to-peer Environment. VLDB 2002. Bennatallah B, Dumax M, et al, Declarative Composition and Peer-to-Peer Provisioning of Dynamic Web Services. ICDE 2002. UDDI Technical White Paper. Sep.6 2001. Chrysanthos Dellarocas. Building Trust On-line: The Design of Reliable Reputation Reporting Mechanisms for Online Trading Communites. http://ebusiness.mit.edu. 2001.
A Verification Model for Electronic Transaction Protocols Qingfeng Chen, Chengqi Zhang, and Shichao Zhang Faculty of Information Technology University of Technology, Sydney P.O. Box 123, Broadway, NSW 2007, Australia {qchen, chengqi, zhangsc}@it.uts.edu.au
Abstract. Electronic transaction protocols have been found with subtle flaws. Recently, model checking has been used to verify electronic transaction protocols for the limitations of low efficiency and error prone in the traditional approaches. This paper proposes an extendable verification model to especially validate electronic transaction protocols. In particular, the verification model is able to deal with the inconsistency in transmitted messages. Thus, we can measure the incoherence in secure messages coming from different sources and at different moments and ensure the validity of verification result. We analyze two instances by using this model. The analyses uncover some subtle flaws in the protocols.
1
Introduction
With the growth of e-commerce, security has become a key issue. For example, Internet shopping malls and other sites that act as brokers between individual users and/or commercial entities make it difficult to establish the credibility of counterparty. Therefore, it is of great importance to have protocols that can authenticate principals and messages under these circumstances. However, time has proven security protocols including electronic transaction protocols to be extremely error prone. To prevent the use of error prone protocols, there have been two kinds of methods, theorem proving and model checking, used to verify the security protocols. Theorem proving has been widely applied to analyze the security properties of key exchange and authentication protocols, such as Needham-Schroeder, Kerberos, SSL, and others [1]. It is usually low efficiency and error prone. Heintze [2] firstly used model checking to analyze the implementation of electronic commerce protocol. However, less attention has been paid to the model checking of electronic transaction protocols. In contrast to the security protocols that involve secrecy and authentication, the correctness conditions for electronic transaction protocols contain more components, such as the period of validity of transmitted message. These conditions present interesting challenges for a formal analysis tool. On the other hand, the main problem in verifying security protocols is that they are intended to run on J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 824–833, 2004. c Springer-Verlag Berlin Heidelberg 2004
A Verification Model for Electronic Transaction Protocols
825
networks with an unbounded number of principals, and that the principals operate on data such as encryption keys, certificates, and messages with value that are hard to bind. To address these problems, a verification model is established especially for the verification of electronic transaction protocols. Moreover, it is feasible to validate other security protools for the fundamental security mechanisms of security protocols keep the same as before even a protocol is altered. Usually, the transmitted messages in security protocols are assumed to be consistent. However, the inconsistency in messages has become one of the key issues in e-commerce. The verification model proposed by us is able to evaluate the inconsistency in transmitted secure messages coming from different sources and at different moments and ensure the validity of verification result. Two instances of SET [6] and Needham and Schroeder’s protocols [7] are verified by using this model. A number of abstractions have been employed, such as the low-level details of the underlying cryptographic mechanisms are disregarded, and just treated them as a blackbox (this is a standard ”perfect encryption” assumption). Thus, we could develop a simple model and turn our sight on the verification of security properties we expected to hold. From the observation, the verification model in contrast with the theorem proving is able to partically decrease the risk of introducing flaws and increase the efficiency. Section 2 gives a brief overview to model checking. In Section 3, it describes the basic concepts and components of the verification model and presents the verification of two instances. Section 4 concludes this paper.
2
Model Checking
Model checking is a technique that formally verifies finite-state concurrent systems. Specifications about the system are expressed as logic formulas, and efficient symbolic algorithms are used to traverse the model defined by the system and check if the specification holds or not. Model checking has been successfully used in hardware design. For example, SMV and SPIN have been used to detect the flaw in the published circuit design and the computer protocols for multiprocessors. Also, it has recently attracted the attention of the software designer. In security domain, a few works have been done. Lowe [3] used FDR to debug and validate the correctness of Needham-Schroeder protocol. Heintze [2] firstly verified the electronic commerce protocols, including NetBill and a simplified digital cash protocol. Protocol verification aims at proving that protocols meet their specifications, i.e., that the actual protocol behavior coincides with the desired one. On the other hand, if the logic formulas cannot be proved within the finite state, some feasible suggestion will be generated and sent to the user. However, as to our knowledge, not much works regarding model checking have been done in validating the complicated electronic transaction protocols.
826
3 3.1
Q. Chen, C. Zhang, and S. Zhang
Verification Model Components
The verification model is separated into four basic modules in terms of their different functions: 1) Inference engine; 2) Knowledge base; 3) User interface; and 4) Facts database. The knowledge base comprises the knowledge that is specific to the domain of application, including such things as facts about this domain, rules that describe the relations or phenomena in the domain. The inference rules inside the knowledge base imply the fundamental security mechanisms of protocols. Example 1. Examples of the known facts and rules include Alice knows Tom’s public signature key, know (Alice, Spb(Tom)); and if Alice knows communication key k and e(m, k), she should know the plain text of m . The inference engine is the core of the whole inference framework. It knows how to actively use the knowledge in the knowledge base. This verification model uses the embedded inference engine of Prolog. For simplicity, the discussion of inference engine is removed here. A user interface is used to provide communication between user and system. It is convenient to view the inference engine and interface as one module, usually called a shell. In addition to the known facts, some real-time facts have to be collected by using the user interface, and then stored into facts database. Example 2. Examples of the real-time facts include Alice knows the book order sent from Bob, know (Alice, book order (Bob)). From the above description, each of them actually performs as one of independent modules of the verification model. The aforementioned scheme separates knowledge from algorithms that use the knowledge. It enables a rational way of developing a verification model for several applications by developing a shell that can be used universally, and then adding some new knowledge for each application. Nevertheless, even if modifications of the knowledge base from one protocol to another are necessary, at least the basic security mechanisms should keep the same as before. Thereby, it is convenient to apply this model to validate other protocols. 3.2
Designing the Model
As mentioned above, the facts usually include the known facts about encryption keys and messages and the real-time facts. In particular, the known facts are stored in the knowledge base along with the inference rules. However, the realtime facts have to be derived from the interaction between user and verification system via the user interface. Usually, the language of if-then rules is adopted for representing knowledge. Each rule actually consists of two parts, condition and conclusion. Sometimes the condition of a rule can be the conclusion of other rules.
A Verification Model for Electronic Transaction Protocols
827
Example 3. rule 1, if Alice knows communication key k and encrypted message e(m, k) then she should know the plain text of m; rule 2, if Alice knows message m, she should know the signed message S (m, Spv (Alice)). The conclusion of rule 1 is actually the condition in rule 2. Thus, the knowledge base can be shown in Fig 1. as an inference network. Nodes A, B, C, D, E, F, G, and H in the network correspond to propositions and links correspond to rules in the knowledge base. E and G are the conclusion of rule 2 and rule 3 respectively. Moreover, they are acted as the conditions in rule 4. Thus, the inference system will firstly search the conclusion from the facts database. If it finds the matching facts in the database the verification process will skip this rule and turn to check the next inference rule for satisfying the other conditions; if not it has to match the conditions of this rule one by one. Arcs that connect the links indicate the conjunctive connection between the corresponding propositions. A C rule 1
B E
D
rule 2
G F
H rule 4
rule 3
Fig. 1. Mode of inference network
Once knowledge is represented in some form, we need a reasoning procedure to draw conclusions from the knowledge base. For if-then rules, there are two basic ways of reasoning: 1) backward chaining; and 2) forward chaining. Backwards chaining starts with a hypothesis and works backwards, according to the rules in the knowledge base, toward easily confirmed findings. However, the forward chaining reasons in the opposite direction. In our verification model, the back chaining is chosen as the reasoning way, which searches from goal we want to verify to data. The user thus needs to submit an authentication command as goal, and then the verification system tries to prove the goal by finding the related data from the facts database and knowledge base. If the verification process finally achieves the goal, the authentication succeeds; if not, based on existing knowledge, it is natural for us to make a decision that the authentication fails due to the properties of fail-negate in verification [4]. This verification model provides four different ways to handle the information:
828
• • • •
Q. Chen, C. Zhang, and S. Zhang
Adopt the external file as the storage of the knowledge base; Collect the facts and knowledge by the interaction with user; Access the knowledge base; Output the results.
The verification procedure is depicted in Fig 2. When the user detects some suspectable problem during the transaction, he submits an authentication request to the authentication server. As stated above, some known facts should be collected beforehand. Also, he needs to input the real-time facts and store them into the facts database. In answer to the authentication request, the authentication server will search the knowledge base and try to find the matching rules in terms of the inference network in Fig 1. Suppose the set R to denote the knowledge base with n inference rules: R = {rule1 , rule2 , . . . ,rulen } where the ith rule in knowledge base are of the form: rulei = {(N, [Conditionij ], Conclusioni )| 1≤i ≤n, 1≤j } where Conditionij is a set of assertions connected by the logic operator and and or, Conclusioni is a simple assertion that must not contain logic operator, and N is the rule name. The assertions in rules can be terms that contain variables. If the whole rules of R have been searched, the authentication will be halted promptly. If a rule rulei is found, the system will go ahead to check whether the Conclusioni of this rule has been stored in the facts database. If the Conclusioni is found, the verification system will skip it and go to next rule; if not, we have to match the Conditionij of rulei one by one. If and only if all the conditions are satisfied, a true value will be returned; otherwise we have to check next inference rule until all the rules are used out. The Conditionij can be derived from the facts database or deduced from other inference rules. Finally, a response message is generated and transmitted to the requester, which is used to decide whether the validated message is secure or not. To access the knowledge base and facts database, it is necessary to establish an interface between user and inference model. The interface should qualify for the processing of facts collection and query. The details will be discussed below.
3.3
Handling the Knowledge and Facts
Two ways can be used to handle the knowledge base of the verification model. One is to input new knowledge and the other is to read information from the existing knowledge base. They can be denoted by ‘a’ and ‘b’ respectively. process (‘a’) if acquisition. process(‘b’) if write (‘Name of the knowledge base’), readln (Name), consult(Name).
A Verification Model for Electronic Transaction Protocols
829
START
Check next rule NO Other rule?
STOP Knowledge
YES
YES Conclusion in
Base
Database? User NO Interface NO
Satisfy all NO conditions? YES Explanation
Database
Store the result
Fig. 2. The algorithm flow of inference engine
As described above, some known facts are public knowledge. They should be true in any security protocols. Except for the known facts, there are also some real-time facts derived by the interaction with user. Thereby, a predefined functor facts reading needs to be used to collect the them, which will be triggered promptly once the verification server receives the authentication request from the client. Every time, when the user inputs facts, the system will ask user whether it is a terminal symbol “yes”. If so, the process of facts collection is halted, otherwise the user will keep on inputting other facts. Facts reading is in fact a human-interaction facility, in which the user can collect the facts through answering ‘who’ and ’what’ questions. For example, ”Alice knows Bob’s public signature key” is a collected fact, in which Alice corresponds to the ’who’ question and Bob’s public signature key Spb(Bob) will answer to ’what’ question. For freshness, some facts are temporarily stored into the buffers of system and will be void automatically once the reasoning ends. In addition, it is very often that the verification system may ask the same question, which was inquired in the last rule. To avoid the repeated question, the enquired question should be saved in the working storage of system. An internal functor assertz is used to achieve this goal. The facts database is created to save the input facts. Furthermore, each fact is separated into several fields in terms of the concrete instances and then stored into the database. Facts in the database are of the format: Facts = {(Namei , Messagei ) | 0≤i } Thus, the verification system can execute the INSERT, UPDATE, and DELETE operations on facts by executing SQL query. This actually provides a flexible way for the communication among the inference engine, knowledge base, and facts database. If the user want to update the knowledge base, he can use another built-in predicate asserta() to store the new rules into the knowledge base. However, this operation should be accessed by the authorized user only. Thus, the user must achieve the authorization before update.
830
Q. Chen, C. Zhang, and S. Zhang
3.4
Measuring Inconsistency of Transmitted Information
Before starting the validation of protocol, the verification system usually has to collect the transmitted messages. They however may be inconsistent for malicious attacks, network block or interrupt. We use the QC model in [5] to measure the inconsistency in secure messages. The main definitions are as follows. Definition 1. Let A be a set of atoms. Let O be the set of objects defined as follows, in which +β is a positive object, and −β is a negative object. O = {+β | β ∈ A} ∪ {−β | β ∈ A} α.
We call X ∈℘(O) a model. So X can contain both +α and −α for some atom
Definition 2. Let |= be a satisfactory relation called satisfaction. For a model X, |= can be defined as follows, in which α, β are atoms. X |= α iff there is a reason for the belief α in X. Naturally, the definiton 2 can be extended as follows. X X X X X
|= |= |= |= |=
α ∧ β iff X |= α and X |= β. α ∨ β iff X |= α or X |= β. ¬¬ α iff X |= α. ¬ (α ∧ β) iff X |= ¬ α ∨ ¬ β ¬ (α ∨ β) iff X |= ¬ α ∧ ¬ β
Definition 3. Let X ∈℘(O). Conflictbase(X) = {α | +α ∈ X and −α ∈ X} Opinionbase(X) = {α | +α ∈ X or −α ∈ X} If Opinionbase(X ) = ∅, then X has no opinions. If Opinionbase(X ) = A, then X is totally self-opinioned. If Conflictbase(X ) = ∅, then X is free of conflict. Definition 4. The coherence function from ℘(O) into [0, 1 ], is defined below. Coherence(X) = 1 -
|Conf lictbase(X)| |Opinionbase(X)|
If Coherence(X ) = 1 then X is totally coherent, and if Coherence(∅) = 0 then X is totally incoherent, otherwise, X is partially coherent. Coherence(∅) = 1. Example 4. Let X = {know(x, m1 ), ¬know(x, m1 ), send(x, y, k)} and Y = {send(x, y, m2 ), ¬send(x, y, m2 ), know(y, k1 ), know(z, k2 )}. So Coherence(X ) = 1/2 and Coherence(Y ) = 2/3. Example 5. Let X ∈ MQC ({know(x, k1 ), ¬know(x, k1 ) ∨ ¬know(y, m), know(y, m), know(y, k2 )}), Y ∈ MQC ({¬know(x, m1 ), know(y, m2 ), ¬send(x, y, k1 ) ∧ send(x, y, k1 )}). So Coherence(X ) = 1/3, and Coherence(Y ) = 2/3 in terms of the definition of minimal QC model in [5].
A Verification Model for Electronic Transaction Protocols
831
The messages can be derived from the log file in Server and Client sides. In the actual circumstance, if the coherence of a set of data is over the predefined threshold it should be insecure. However, how to deal with the data conflict will be described in another paper. 3.5
Recognition
This section presents the validation of two instances. For simplicity, the description will focus on the rules handling and facts reading. Example 6: Merchant Certificate Request in SET. Auth(CA, M, CertReq) is the goal we want to verify. Know(CA, ), Know(M, Spb(M)), and Know(CA, ) are some known facts. The real-time facts, such as Know(CA, Spb(M)), will be input by the facts reading functor until the system reads the terminal symbol. In addition, those messages, E(, k1 ) and S(, Kpb(CA)) etc. sent from M to CA are created by asking user. The knowledge is transferred into rules, in which each rule is assigned with a serial number (rule name) and stored into the knowledge base. rule(1, [“Know(CA, S(, Kpb(CA))”, “Know(CA, Kpv(CA))”), “Know(CA, k1 )” ], ”Know(CA, AcctData(M))” ) rule(2, [“Know(CA, E(, k1 ))”, “Know(CA, k1 )” ], “Know(CA, )”) rule(3, [“Know(CA, )” ], ) rule(4, [“Know(CA, CertReq)”, “Know(CA, S(, Spv(M)))”, “Know(CA, |Clock -T| < t1 + t2 )”], “Auth(CA, M, CertReq)”) where Clock is the local time, t1 is an interval representing the normal discrepancy between the server’s clock and the local clock, and t2 is an interval representing the expected network delay time [8]. Each rule is separated into conditions and conclusion by the square bracket. The rules presented above are assigned with constants, which are in fact substituted with variables as stored into the knowledge base. Subsequently, the user inputs the real-time facts. Table 1. describes them with the Name and Message fields. After establishing the knowledge base and completing the collection of realtime facts, the user submits an authentication request: ?- Auth(CA, M, CertReq). The verification system then tries to find the matching rule from the knowledge base. Finally, the system cannot verify Auth(CA, M, CertReq) since the merchant did not add the timestamp and identifier in the message. Finally, the verification result is replied to the user.
832
Q. Chen, C. Zhang, and S. Zhang Table 1. Merchant certificate request Name Message M M M CA
k1 Kpb(CA) CertReq Spb(M)
Example 7: Needham and Schroeder’s protocol. We use this example to present the verification model is also able to valiate other security protocols. Next, this verification model will be applied to verify the distribution of communication keys in Needham and Schroeder’s protocol. Here the authentication server (AS ) is responsible for storing and distributing users’ public keys. Auth(A, AS, Spb(B)) is the hypothesis that we want to verify. Know(A, Spb(AS)) is a known fact and has been stored in the knowledge base. Message sent from A to AS and message S(, Spv(AS)) sent from AS to A are collected via the interaction with user. The inference rules are as follows: rule (1, [“Know(X, Spb(CA))”, “Know(X, S(Message, Spv(CA)))”], “Know(X, Message)”) rule (2, [“Know(X, Spv(CA))”, “Know(X, Message)”], “Know(X, S(Message, Spv(CA)))”) rule (3, [“Know(X, Spb(Y))”, “Know(X, Spb(CA))”, “Know(X, S(, Spv(Y)))”, “|Clock-T| < t1 +t2 ”], “Auth(X, Y, Spb(Y))”) The first two rules have existed in the knowledge base for they were generated in last example. It actually indicates the fundamental security mechanisms keep steady. Rule 3 is a newly generated rule by the authorized user, which proves that this model is extensible to verify other protocols. Table 2. presents the real-time facts collected via the interaction with user. Table 2. Needham and Schroeder’s protocol Name Message A AS AS AS AS
B A B Spb(A) Spb(B)
A Verification Model for Electronic Transaction Protocols
833
Now the user can submit an authentication command to the authentication server: ?- Auth(A, AS, Spb(B)). where the system will try to find the matching rules from the knowledge base. Finally, the authentication fails due to the shortage of timestamp. Although these instances described above involves just a small number of rules it is not difficult to verify a more complex instance by extending the original knowledge base and facts.
4
Conclusions
This paper proposes a verification model to especially validate the electronic transaction protocols. In particular, the model checking in contrast to theorem proving is usually fast and more appropriately to verify the complicate secure transaction protocols used for e-commerce mainly for two reasons: 1) the correctness conditions for secure transaction protocol usually contain more components than the normal security protocols; and 2) moreover the number of message and participants of protocols are hard to bind. It is extensible to verify other protocols for the fundamental security mechanisms keep steady. In particular, this model can measure the inconsistency in messages before verification and ensure the validity of verification result. Two instances of security protocols were validated by using this verification model. Some potential flaws are detected during the validation. It proves our verification model is promising.
References 1. Burrows M., Abadi M., Needham R., A logic for Authentication, ACM Transactions on Computer Systems, 8(1), pages 18-36, February 1990. 2. Heintze N., Tygar J., Wing J., and Wong H., Model Checking Electronic Commerce Protocols, Proceeding of the 2nd USENIX Workshop on Electronic Commerce, November, 1996. 3. Lowe G., Breaking and fixing the Needham-Schroeder public-key protocol using FDR, In Tools and Algorithms for the Construction and Analysis of Systems: Second International Workshop, TACAs 96, page 147-166, March 1996. 4. Qingfeng Chen, Chengqi Zhang and Shichao Zhang, ENDL: A Logical Framework for Verifying Secure Transaction Protocols, Knowledge and Information Systems, an International Journal by Springer, accepted. 5. Hunter A., Measuring inconsistency in knowledge via quasi-classical models, Proceedings of the 18th National Conference on Artificial Intelligence (AAAI’2002), pages 68-73, MIT Press. 6. SET Secure Electronic Transaction Specification, Book 1: Business Description, Version 1.0, May 31, 1997. 7. Needham R. and Schroeder M., Using Encryption for Authentication in Large Networks of Computers. Comm. of the ACM, 21(12), pages 993-999, Dec 1978. 8. Denning D., Sacco G., Timestamp in Key Distribution Protocols, Communications of ACM 24(8):533-536, August 1981.
An Ontology-Oriented Approach for Virtual Enterprises Li Li, Baolin Wu, and Yun Yang CICEC - Centre for Internet Computing and E-Commerce School of Information Technology Swinburne University of Technology PO Box 218, Hawthorn, Melbourne, Australia 3122 {lli,bwu,yyang}@it.swin.edu.au
Abstract. An underlying formation of a Virtual Enterprise (VE) is efficient partner selection mechanisms that enable all the partners in the VE being assembled for a short period. In this paper, we present an ontology-oriented approach that use description logics formally to represent concepts and roles (relations) of partner’s view of domain knowledge. We further address the use of such a shared knowledge in a VE under the knowledge model defined in this paper to help partner selection by proposing assessment criteria of potential partners. We suggest that value ranges as thresholds to filter unable biddings before evaluation should be a better way. Filtering process works well for its fitting with the dynamic criteria of the principal agent during the process of selection partners in a VE. We illustrate the approach through an e-shop case study.
1
Introduction
The proliferation of Internet technology and globalisation of business environments gives rise to the advent of dynamic virtual alliances among complementary organisations. A Virtual enterprise (VE) has emerged as an aggregation of autonomous and independent enterprises through a public network like WWW and assembled to deal with new type of alliance. In [5], a VE is defined as a temporary network of independent companies to share skills, cost and access to each other’s market. This network looks like a unified ecosystem which evolves over time. According to the definition of life cycle of the virtual organisation in [18], one of the most vital stages is selection of partners in a VE. It is also addressed in [11][10]. However, both of them deal with the highest values by only considering certain circumstances, rather than considering users’ expectations thus unable to provide potential choices by their bidding evaluation criteria for dynamic refinement later. We consider flexibility is also a very important issue in a VE formation. In order for partners (called agents in this paper) to achieve their maximum gains, it is quite important for them to understand each other by communication while consulting a VE ontology within a virtual environment. We argue that ontology-oriented approach is about concept shifting from static terms in J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 834–843, 2004. c Springer-Verlag Berlin Heidelberg 2004
An Ontology-Oriented Approach for Virtual Enterprises
835
business to runtime resource allocation based on agents’ capability and responsibility/commitment [13][14][15] along with roles to fill in business. Ontologies [7][3][9] can be used as a guideline to describe requirements and business concepts. The adoption of a shared ontology allows different agents to simultaneously interoperate without misunderstanding and retain a high degree of autonomy, flexibility and agility. Agent technology is suitable with respect to the operation of a VE. [6] provides a description of how agents fit well into the design and the operation of a VE in the AVE (agents in Virtual Enterprises) project. Applying agent technology in partner selection is attracting more attentions because of its importance in a VE formation. As for partner selection in a VE, the main steps can be summarised as follows: – – – –
A principal agent announces its requirements. Interested agents (partners) provide their bids. The principal agent identifies potential bids and also potential partners. Keep on bidding until the best partners have been selected.
This paper focuses on the third step above by defining criteria to select potential bids and partners while at the same time watching closely at the fourth step above which is mostly involved in negotiation between the principal agent and potential partners of a VE. We are concerned mainly with conceptual issues of a VE along with the development of a semantic representation by using description logics (DLs) [2][8]. Our goal is to provide the ontological dimension in a VE in order to provide a flexible partner selection mechanisms which might achieve as much common profit as possible. To this end, we extend basic DL known as ALC (DL ALC is the basis of DLs mainly characterised by a set of constructors that allow to build complex concepts and roles from atomic ones, in which concepts correspond to classes, interpreted as sets of objects, and roles interpreted as binary relations on objects) to include some performatives of Speech Act [1]. With DL’s well defined semantics and provision of inference services, we might define two terms in a VE environment - roles and concepts to simplify the creation and maintenance of ontologies. This expression also plays an important role when partners communicate with each other. This paper is organised as follows. The next section introduces a knowledge model of virtual enterprise formation, showing the ontological dimension of business processes. Section 3 presents a communication model based on the knowledge model proposed. Section 4 shows a case study and Section 5 discusses the related work. Finally, Section 6 concludes our work and identifies potential future research.
2
Knowledge Model of Virtual Enterprise Formation
According to the literature of both knowledge engineering and agent-oriented systems, it is believed that granularity of knowledge, at the meta-level and conceptual level, that schema definitions intuitively link to ontologies. Besides its
836
L. Li, B. Wu, and Y. Yang
traditional hierarchical structure, we will consider ontology dimensions depicted in Figure 1. We identify a set of relations which represent agents possessing some properties for further inferencing. We do not distinguish role from service for the reason that every role’s function is to provide service(s). definition 1 : capability (i, Σi ) specifying an agent whose name is i which owns capability Σi to transit from one state to another with some kinds of resource and ultimately provide some kinds of service, namely being active on its role. It is notated by the x axis. definition 2 : resource (i, Ψi ) denoting an agent whose name is i which needs to acquire some kinds of resource Ψi to undertake its role under the current circumstance. It is notated by the y axis. definition 3 : role (i, Ωi ) indicating an agent whose name is i which can play the role of Ωi if it can obtain the resource with its current capability. It is notated by the z axis. Therefore, we may describe the geometric object within a certain range to simplify the relations of these three dimensions which stand for an agent’s capability, resource expectation and role to fill. Here x, y, z are normalised to interval [0, 1]. The endpoint is the extreme case which seldom happens. Only extreme case < 1, 1, 1 > is reasonalbe here. Consequently, an agent, say i is depicted by 3-tuple < Σi , Ψi , Ωi >. We define Γ as the universe of discourse of an agent, Γ = < Σ, Ψ, Ω >. Agents consult each other using performatives of Speech Act such as assertives (informing), commisives (promising), directives (requesting or querying) and declaratives to query the meaning of the concept if it is a part of a shared ontology. Primitives are as follows: intent = {term|term ∈ {perf ormatives∗ , assertives∗ , commisives∗ , directives∗ , }}
In terms of TBox (for terminological knowledge of DL) and ABox (for assertional knowledge of DL) of DLs, we have : intent(i, j, C1 C2 ) = {< intent(i, j, C1I , C2I > |i, j ∈ Γ and C1I ⊆ C2I } intent(i, j, α : C) = {< intent(i, j, αI , C I > |i, j ∈ Γ and αI ∈ C I } where α is an individual name; C1 , C2 denote any complex concepts. I denotes an interpretation which is a pair of a domain ∆I and an interpretation function ·I , namely I = (∆I , ·I ). The semantics of above formulas can be explained as concept C1 is more specific than C2 for the first formula, and the second one is about an instance α of concept C [2][8]. Any query will start with primitives such as ASK, TELL, REQUEST. We do not detail attributes at every step (goal estimation, resource availability, etc.), instead we define a method to make a decision in terms of three dimensions of the knowledge model. In this paper, we consider not only attributes represented in quantitative forms possibly belong to interval [0,1], but also qualitative ones which may convert to numeric values between 0 and 1 [19]. Figure 2 is a diagram for agents to query about selection of potential partners. It shows a basic process
An Ontology-Oriented Approach for Virtual Enterprises
837
estimation mutual goal knowledge model
interested partners evaluation
resource
y
qualified partners
selection model
bidding potential partners
capability
negotiation
x
negotiation mechanism
z
best partners
le ro
Fig. 1. Ontology Dimensions
Legend: oval- I/O; rectangle- process arrowed line- information flow
Fig. 2. Partner Selection Under the Knowledge Model
for selecting partnership. We are going to detail the ‘evaluation’ phase of Figure 2 in the following subsections. In this paper, we only consider the cases given ai and bi (i ∈ {1, 2, 3}) for simplicity. A pair of ai , bi denotes the value range of axis x, axis y and axis z respectively while i ∈ {1, 2, 3}. 2.1
Capability, Resource, and Role Axes
The difference of b1 and a1 depicts the range of the capability dimension which fits into the a VE principal agent’s criterion for potential partner(s). They are defined by human users/agents at the present stage. Therefore, some partners will be excluded for their low credits to be able to fulfil the requirements. The structure of the capability dimension is as follows. It is a k-tuple schema. AgID (IndusCode, GDes, att1 ,att2 ,. . . ,attk−2 ), (k ≥ 2) where AgID: Agent ID; IndusCode1 : Industry it belongs to. We assume the codes match exactly (assigning 1 for the similarity between them) for two or more organisations to alliance in the VE. GDes: goal/subgoal description. att1 ,att2 ,. . . ,attk−2 , (k ≥ 2): list of its capabilities, k ∈ N , N denotes natural number. As for the goal/subgoal description, namely GDes, We define similarity measurement fx GDes sim ∈ [0, 1], which is calculated as a normalised difference between the expected values and the current ones. exact matching 1 no match fx GDes sim = 0 (0, 1) otherwise n We define the overall similarity assessment fx CP sim = 1/n· i=1 wi ·atti (x corresponds to capability dimension referring to Figure 1, where wi is the weight 1
IndusCode comes from The North American Industry Classification System (NAICS). It includes two main parts namely 2002 NAICS Code and 2002 NAICS Title on the website (http://www.census.gov/epcd/naics02/naicod02.htm)
838
L. Li, B. Wu, and Y. Yang
assigned by users according to their experiences, wi ∈ [0, 1], with i ∈ [1, n], atti is the attribute that relates to the capability axis normalised to interval [0,1] based on the ideal value and users’ expectations. Therefore fx CP sim ∈ [0, 1]. Both similarity assessment and similarity measurement definitions for the other two, namely resource and role, are similar. We omit them in this paper to avoid redundancy. 2.2
Evaluation
n Based on the calculation of 1/n · i=1 wi · atti , we assume that all the interested potential partners will be ranked by their numeric values, which are between 0 and 1 by ascending order. By doing so, every interested partner denoted by AgID (agent ID) has a numeric number related to every dimension, namely capability dimension SCP , resource dimension SRS and role dimension SRL respectively, where SCP , SRS , SRL are sets including a group of AgIDs. Qualified partners ϕ within the range of values are derived from the intersection of the three sets: ϕ = {SCP ∩ SRS ∩ SRL } The following is a brief description of the process for evaluation (see Figure 3 about its state transition): (1) derive new ψ from the waiting state every time when it starts a new bidding announcement (2) if ψ = Φ then continue bidding, bidding terminates according to principal agent’s criteria and followed by negotiation which will lead to a success or waiting state (3) else (if ψ = Φ) relax constraints, i. e., to let more candidates enter, go to (1) (4) if still unable to match (such as running out of time limit, no agreement reached so far), just waiting for new entrants, go to (1)
announcement evaluation
ask/request
bids tell agent
wait evaluate
negotiate secceed bid
Fig. 3. State Transition of Potential Partners Selection
3
agent
agent
negotiation
agents
AgID IndusCode capability: rsource: role: ...
Fig. 4. A Communication Model
Communication Model of Virtual Enterprise Formation
Our communication model highlights two points, namely, internal communication and external communication. The internal one means basic communication
An Ontology-Oriented Approach for Virtual Enterprises
839
involving common background, common vocabulary, etc. to reach mutual understanding. The external one means some inference involved in communication. For example, in this paper, concepts of role and class and constructors of description logics (DL) are included in communication to allow better understanding. Selecting partnership means matching well among agents’ capability, resource allocation and role to fill to achieve the goal of a VE. Agents may know their own very well, but might have little knowledge about others. They communicate with each other with powerful semantic presentation language DL which is particularly fitting in the definition of ontologies to facilitate communication. Figure 4 is a communication model of a VE under the knowledge model described in section 2. As illustrated on the left of Figure 4, agents get a global view of other agents’ capability, resource, and role by communication (“intent” here). On the right, announcement is goal-oriented. The principal agent looks like an auctioneer who puts an offer to all interested organisations in the electronic market. It should include information such as IndustryCode, subgoal, due date, cost, risk, etc. Interested agents are filtered by evaluation criteria [ai , bi ] (0 ≤ ai ≤ bi ≤ 1), i ∈ {1, 2, 3}, as defined in section 2. The winners are qualified partners who then enter the bidding phase based on their capabilities, resource expectation and roles needed to fill. [ai , bi ] (0 ≤ ai ≤ bi ≤ 1), i ∈ {1, 2, 3}, are adjustable to restrict or relax the range of three dimensions to exclude/include qualified partner(s). We provide a range rather than an evaluation value for the winning bid because it is difficult to derive any accurate value in advance in business, instead, qualitative evaluation is applicable. The next bidding phase leads to potential partners whom the principal agent may negotiate with in order to decide which one is better. Communication plays an important role in the following scenario for organisations/units/partners: (1) to understand what they are talking about (ontology and its refinement); (2) to solve conflicts between them (including some terms which are vague or not available currently); and (3) to preserve the update consistency which includes inserting new concepts and their relations among them.
4
Case Study
In this section, we study an example to illustrate how agents understand each other in a VE environment and get involved in the activities by presenting a VE at the end. For more detail, please visit website http://www.it.swin.edu.au/centres. Let us take an e-shop, an online book sale, as the example. The precondition of our discussion is that all the agents have the same IndusCode, namely we omit the similarity measurement of the goal/subgoal description by assuming that they are looking for each other in that case (under same business pattern). The partnerships for the case is illustrated in Figure 5. E-shop is a good example to study what are involved in a VE. Suppose it includes several agents such as CUSTOMER Agents, SUPPLIER Agents, ESHOP MANAGER Agent, and BANK Agents. The SUPPLIER (α, where α
840
L. Li, B. Wu, and Y. Yang customer
deliver
1
deliver
supplier
order
order
e-shop manager
0.6
clearance bank
27
credit card
Fig. 5. Online Bookshop Partnership
Fig. 6. An Expectation Function
is an individual name) has a feature C which is a complex concept, denoted by (α :C). C = AgID capability role resource ∃commit.(AgIDi AgIDj ) ∃involve.AgID commit(i, j , provision ∗ ) = {< i, j , provision > |i, j ∈ Γ , provision ∈ commitment ∗ } commiment∗ [15] may repeat many times involve(i, action ∗ ) = {< i, action > |i ∈ Γ , action ∈ activity ∗ }
activity ∗ (http://process.mit.edu) may repeat many times Some performatives for agents to communicate with each other are shown in the following. Request (i,j,Deliver.(Best-Seller.(Book.“Commerce”))), meaning to request delivering a book object of eCommerce which is also a best-seller one. Ask(i,j, Computer Science.Grid Computing ∃Best-Seller.Book); asking if there exists a concept Grid Computing which belongs to the Computer Science that is implied by an existing book object which is a best-seller one. After some rounds of communications, then start to assessment. Suppose there are three SUPPLIER Agents and all of them are within the assessment range of the E-SHOP MANAGER (suppose the E-SHOP MANAGER is the host who has rights to determine its partner(s) in this process). SUPPLIER Agents compete against each other according to its own properties. In Table 1, these three SUPPLER agents denoted as S1 , S2 , S3 with some properties n such as price, number in stock, etc., denoted as a1 , a2 , a3 , a4 , their 1/n · i=1 wi · atti are calculated respectively. The numbers within the third column are average expectations which are predefined (like the membership functions in fuzzy logic) regarding the attributes, while the values within the fourth column are weights defined by users. For the above example, the CUSTOMER may put an order for a specified book with a price between $25 and $26.5 and other free services such as reward and readers’ community. If one of SUPPLIERs can provide exactly the same book but with the price between $25.9 and $27, then the average expectation is equal to 0.8 not 1 according to the users’ expectation function as illustrated in Figure 6. Estimating functions for other attributes are similar. From the above analysis, the CUSTOMER agent will select S2 in this marketplace. Three agents’ weighted averages are 1.91, 2.266, and 1.302 respectively (the right most column) under assumed conditions. In summary, expectation functions are flexible and more suitable for further amendment when conditions/constraints are changing.
An Ontology-Oriented Approach for Virtual Enterprises
841
Table 1. The attributes, expectations and weights agent attribute expectation weight weighted average price($25.9-$27) 0.8 0.8 number in stock 0.6 0.8 S1 delivery time 0.6 0.9 awards 0.5 0.5 1.190 price($25.1-$26.8) 0.92 0.8 number in stock 0.7 0.8 S2 delivery time 0.8 0.9 awards 0.5 0.5 2.266 price($26.1-$26.9) 0.94 0.8 number in stock 0.4 0.8 S3 delivery time 0.2 0.9 awards 0.1 0.5 1.302
5
Related Work
In terms of knowledge models, one of the important contributions was made in [12] where a formal description of the knowledge models was given. There the process of knowledge-model construction was decomposed into three stages of knowledge identification, specification, and refinement. Also in [4] a three level organisation of the knowledge is presented which includes the meta-level, intensional level and estensional level from the knowledge modeling perspective. Our knowledge model is inspired by the work not only from the above work, but also by the work from agent systems where agents usually own a certain degree of knowledge for making a decision. As for communication of agents, KQML (http://www.cs.umbc.edu/kqml), which is both a message format and a message-handling protocol to support run-time knowledge sharing among agents, is a basic ground. We define the contents of KQML by applying a description logic concept specification language that is a syntactic extension of the first order logic. In addition, the agent interaction protocol is addressed but we do not provide a formal specification for it. There are some existing ontologies [16] [17]which capture certain characteristics of the world by defining meta-concepts and meta-relations and filling each catalogue with terms and relations. Two ontologies, Edinburgh Enterprise Ontology (http://www.aiai.edu.ac.uk/project/enterprise) and Toronto Virtual Enterprise (http://www.eil.utorinti.ca/enterprise-modelling/tove), define a set of enterprise domain concepts to describe enterprise in general. MIT Process Handbook provides a repository of about 5000 activities/relations involved in business processes as a guide book. Also [4] addresses core enterprise ontology to bridge the gap between different emphases in constructing ontologies. The ontology in this paper is different from the above proposals in that we address issues not only from the knowledge management perspective but also from agents’ point of
842
L. Li, B. Wu, and Y. Yang
view. We construct an ontology with four parts which are Entity, Flow, Activity and Model, where in the Entity category we focus on three dimensions of the knowledge model in section 2 with an additional identifier of the agent (Actor).
6
Conclusions and Future Work
In this paper, our general focus is on developing a knowledge model to facilitate communication while emphasises interoperation and interactivity among them. In our opinion, recognising concepts and relations among them should consider its further applications. Unlike general knowledge models, our three dimensional knowledge model takes into account the behavior of agents in virtual enterprises (VEs) formation especially in selecting partners in a VE. In addition, this model also provides a background for VE partners to understand each other with the presence of the communication among agents. We do this by facilitating description logics for its well defined semantics which ensures the correctness of reasoning tasks. Further, we have developed an ontology from agent’s perspective to help selection of partners based on the knowledge model proposed. This paper offers a good potential of agent-based technology for finding and selecting the partners to fill the roles such as producing goods, providing services, delivering stuff, etc. Due to uncertainty and heterogeneity in VEs, we have analysed that a range of values within the three dimensional knowledge model can be useful to provide a set of potential partners for further negotiation. We have also shown that the range of values and expectation functions can fit well with the dynamic criteria of the principal agent during the process of selecting partners in a VE. There are many issues remaining for future work. Business process ontology is one of the most issues to be addressed. Another aspect is how to detail VE formation with Web services to provide services eventually. Finally, even though we referred to the issue of resource allocation briefly in this paper, it is obviously an open issue in process management. Acknowledgements. Work reported in this paper is partly supported by Swinburne Vice Chancellor’s Strategic Research Initiative Grant 2002-2004 for project “Internet-based e-business ventures”. It is also partly supported by the National Natural Science Foundation of China under grant No. 60273026 and grant No. 60273043.
References 1. Austin, L. J. How to Do Things with Words, Clarendon Press. Oxford, UK, 1962. 2. Baader, F., Horrocks, I. and Sattler, U. Description Logics as Ontology Languages for the Semantic Web. Dieter Hutter and Werner Stephan (Eds.), Festschrift in honor of J¨ org Siekmann, Lecture Notes in Artificial Intelligence. Springer-Verlag, 2003. 3. Benjamins, R., Fensel, D. and Perez, G. A. Knowledge Management Through Ontologies. In Proc. of the 2nd International Conference on Practical Aspect of Knowledge management (PAKM’98 ), Basel, Schweiz, 1998.
An Ontology-Oriented Approach for Virtual Enterprises
843
4. Bertolazzi, P., Krusich. C., Missikoff, M. An Approach to the Definition of a Core Enterprise Ontology: CEO, OES-SEO 2001, International Workshop on Open Enterprise Solutions: Systems, Experiences, and Organisations, Rome, Sept. 14-15, 2001. 5. Byrne, J. A., Brandt R. And Bort, O. (1993), The Virtual Corporation. Business Week, Feb., 8:36-40, 1993. 6. Fischer, K. M¨ uller, P. J., Heimig, I. and Scheer, W. A. Intelligent Agents in Virtual Enterprises. In Proc. of the 1st International Conference on Practical Applications of Intelligent Agents and Multi-Agent Technology (PAAM’96), London, 1996. 7. Gruber, T. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In Formal Ontology in Conceptual Analysis and Knowledge Representation, Nicola Gruarino and Roberto Poli (Eds.), Kluwer Academic Publishers, 1993. 8. Horrocks, I., Sattler, U. A Description Logic with Transitive and Inverse Roles and Role Hierarchies. Journal of Logic and Computation, 9(3): 385-410, 1999. 9. Huhns, M. N. and Singh, P. M. Ontologies for Agents, IEEE Internet Computing, Nov/Dec., 1(3):81-83, 1997. 10. Oliveira, E. and Rocha, A. P. Agents’ Advanced Features for Negotiation in Electronic Commerce and Virtual Organisation Formation Process. European Perspectives on Agent Mediated Electronic Commerce, Springer Verlag, 2000. 11. Petersen, S. A., Divitini, M. Using Agents to Support the Selection of Virtual enterprise Teams. In Proc. of the 4th International Bi-Conference Workshop on Agent Oriented Information Systems (AOIS2002), Bologna, Italy, July, 2002. 12. Schreiber, A. T., Akkermans, J. M., Anjewieden, A.A., Dehoog, R., Van De Velde, W., and Wielinga, B.J. Engineering of Knowledge: The CommonKADS Methodology. University of Amsterdam, 1998. 13. Singh, M. P. On the Commitments and Precommitments of Limited Agents. IJCAI Workshop on Theoretical and Practical Design of Rational Agents, Sydney, Australia, 1991. 14. Singh, M. P. Multiagent Systems as Spheres of Commitment. International Conference on Multiagent Systems (ICMAS) Workshop on Norms, Obligations, and Conventions, Kyoto, Japan, 1996. 15. Singh, M. P. An Ontology for Commitments in Multiagents Systems: Towards a Unification of Normative Concepts. Artificial Intelligence and Law, 7: 97-113., 1999. 16. Uschold. M., King. M, Moralee. S. and Zorgios. Y. The Enterprise Ontology. The Knowledge Engineering Review, Special Issue on Putting Ontlogies to Use (Uschold. M. and Tate. A. Eds.), 13(1):31-89, 1998. 17. Uschold. M. Knowledge Level Modelling: Concepts and Terminology. The knowledge Engineering Review, 13(1):5-29, 1998. 18. Van Wijk, ir Jaconliene, Daisy Geurts, Rene Bultje. Seven Steps to Virtuality: Understanding the Virtual Organisation Process Before Designing ICT-Support. An Interdisciplinary Workshop on Objects, Components and the Virtual Enterprise at OOPSLA98, Vancouver, CA, 1998. 19. Zadeh L.A. Fuzzy Sets. Inf. Control, 8:338-353, 1965.
Customer Behavior Pattern Discovering with Web Mining Xiaolong Zhang1 , Wenjuan Gong2 , and Yoshihiro Kawamura3 1
School of Computer Science and Technology Wuhan University of Science and Technology 2 Dept. of Educational Administration Wuhan University of Science and Technology {xiaolong.zhang,wenjuan.gong}@mail.wust.edu.cn 3 Business Intelligence Solutions, IBM Japan [email protected]
Abstract. This paper describes a real application proven web mining approach. The approach performs with integrated data comprised of web logs and customer information involved in e-commerce web sites. The objective is to acquire behavior patterns of visitors on web sites. The mining tasks include the customer clustering, association rules among the web pages of visitor traffic, buying patterns of customers, and predict model generation for the potential customers. As web log data is very extraneous, low granularity and voluminous, a semantic taxonomy method is used to group web pages, helping address the discovered patterns. This web mining work is useful for an enterprise to have a multi-level customer view, which prompts decision-making process of the enterprise. Keywords: Data mining, web log analysis, e-commerce, CRM (customer relationship management), business intelligence
1
Introduction
In eBusiness, a company’s competitive advantage is achieved by knowing customers and has a good customer relationship management strategy. Web mining is an important step to reach this goal. The company begins to develop and implement strategies for collecting, organizing, managing, and using customerrelated information. This paper introduces how to pursue data mining processes with customer centric information data warehouse, whose data come from multichannel in the company. The channels can be telephone, fax, email, Internet, and letters. Some typical issues with eBusiness in the Internet world are some like how to navigate the web visitors in a complex web sites, and how to understand you customers to do campaign for cross- and up-selling within e-commerce web sites. Web mining is a crucial way to understand e-commerce web users and transform traffic information into competitive advantage. It enables you to make data-driven decisions, develop and enhance your decision strategies. It helps you J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 844–853, 2004. c Springer-Verlag Berlin Heidelberg 2004
Customer Behavior Pattern Discovering with Web Mining
845
to acquire new customers, retain existing customers, and improve customer satisfaction. We present a web mining template that integrates both web logs and corporate data to make data driven decisions that positively affect e-commerce business. Analysis of web log data is to identify web access patterns. The web usage analysis provides statistical reports (e.g., [1] and [2]), such as the number of hits for a web page, the number of visitors, the average time visitors stay in a web page. In addition, association rule and sequential pattern analyses (i.e. [3]) discovers association patterns among the visited web pages. However, from these results, one cannot know, for example, who are your profitable buyers. As an integrated analytical solution, we integrate key corporate data, such as sales data, customer profile information data as well as web log data. Mining with the integrated database can discover meaningful business trends and patterns. In order to acquire accurate rules and patterns, a semantic taxonomy is applied in web page grouping. The web log data is from the history of traffics of web visitors. The corporate data refers to sales data, content data and customer profiles and customer life time. This web mining template offers the ability to acquire association rules and sequential patterns among visited web sites, cluster web customers, classify buyers, and predict the sale revenue of products. The remainder of this paper is about a web mining template that describes a case study for an e-commerce web site, which is an effect-proven web mining solution, mining results that help to explain customer behavior on web, and related work and the conclusion of this paper.
2
Application Background: An Overview
Web businesses usually follow a typical evolutionary pattern. In their early stages of development, they just begin to create a web site before adequately determining whether it is targeted correctly, accomplishes specific business objectives and produces a reasonable return on investment. As the web site becomes an integral, strategic part of the business channel, strategy-makers now try to use concrete information to help justify their decisions. Visitor traffic, behavior and navigation patterns can be measured and analyzed objectively. Having this type of data in your databases is critical whatever the web site is primarily designed for. In addition to web transaction analysis (association rules and sequential patterns), we are most interested in analyzing web visitor activity that links purchasing visitors to the e-commerce site. The data analysis needs customer profile information, web access history, and product data. The data marts for web mining require integration of these data. An integrated mining tool, including statistical data analysis tools, a database, and unified architecture and graphical interface is selected to do variety of mining operations (the mining operations are in terms of ibm db2 intelligent miner). The mining tool has the ability to do clustering, classification, and prediction mining processes.
846
2.1
X. Zhang, W. Gong, and Y. Kawamura
Association Rule and Sequential Pattern
A most direct way to analysis web transactions is association and sequential pattern analyses. Association analysis allows association rules among visited pages to be found. This information is useful to web designers and content managers. Sequential pattern analysis provides most frequent access paths of web visitors, which reflects the frequent paths visitors accessed. Association rule. Given a set of transactions, where each transaction is a set of items, an association rule X → Y (both X and Y are items) means that the transactions containing the items in X tend to contain those items in Y . An association rule is usually measured by support and confidence. The support(X → Y ) is the percentage of both X and Y contained in all the transactions. The conf idence(X → Y ) is calculated by dividing the number of transactions supporting the rule by the number of transactions supporting the rule body. The data set used for association rule mining is a transaction data set selected from web log data. The transaction set contains the fields like transaction ID, www page, day, hour, minute, second, as well as some taxonomy items. The taxonomy is created in terms of what contents a web page contains. We assure in the e-commence web site, the cookie ID is used as the transaction ID, the www page as the corresponding item. The Apriori algorithm [4] is used for finding frequent item sets and their association rules among these item sets. Sequential pattern. The task of sequential pattern discovery is, given a database of customer transactions, to find the maximal sequential patterns among all the sequences that have a user-specified minimum support (see [5]). When users visit web pages, the web server records the transactions of the users. The frequent paths or browsed patterns are discovered with sequential pattern mining. The sequential patterns are useful to web designers (e.g., where is a banner advertisement to be inserted?). Sequential pattern mining is performed with the same transaction data set as that used in the association rule. We use the cookie ID as the transaction ID, the www page as the corresponding item, and the field minute (there are hour, minute and second used to describe time in the web logs) as transaction group ID. The fields hour and second are not used as the corresponding item, since the field hour is too general and the second too detailed for this pattern discovery. 2.2
Clustering Model
Clustering is used to segment a database into subsets, with the members of each subset sharing a number of interesting properties. That is, the users in the same cluster have similar behavior. Clusters can be generated either using neural or symbolic (or demographic) learning methods. Kohonen feature maps [6] is applied as a neural clustering method. It consists of two layers of units: an one dimensional input layer and a two dimensional competitive layer. Each
Customer Behavior Pattern Discovering with Web Mining
847
unit in the competitive layer has a weight vector. The neural learning algorithm consists of two steps, clustering the input data, and spatial ordering of the map so that similar input patterns tend to produce a response in units that are close to each other in the grid. The symbolic clustering algorithm (e.g., AQ11 [7]) operates primarily on instances with nominal values. It considers all the attributes that characterize each instance and use space search techniques to establish the subset of these attributes that describe each created cluster. Both neural and symbolic algorithms have been implemented in many useful mining tools. The clustering data set used in this template includes a set of fields or variables (see Table 1). Each record of this table is the summary of the transactions of a customer in the specified period (one months, or one week). The fields shopping, communication, information, and fun are 4 taxonomy items (a taxonomy method will be described later), each of them including some of web sites which are semantically belong to their upper taxonomy item. Furthermore, the table also contains some data from the corporate data (e.g., the field revenue originally comes from a corporate database). Table 1. The variables used for customer clustering Variable AgeGroup Gender Region Ariticles brought TheNumberofHits Revenue Shopping Communication Information Fun
2.3
interpretation discretization of users’age users’ gender users’ living area number of items brought by a user total hits of a user in a session money cost by a buyer visited pages in Shopping taxonomy visited pages in Communication taxonomy visited pages in Information taxonomy visited pages in Fun taxonomy
Classification Model
Decision tree method is fairly known as a classification method. The essence of decision tree is learning from supervised data in the attribution logic. The objective of decision tree is to split given examples into classes by repeating selection of the attributes. The principle of induction of decision tree can be found in [8,9]. The split points in the tree generation are decided by an information function. In id3 [8], the gain function is used. In ibm db2 intelligent miner, the gini function [9] is employed. The data set used in classification consists of the customer profile data and the aggregated web log data. A variable used as buyer-flag contains either a 0 (one does not buy) or a 1 (one buys). The other variables used in this classification are those from the second level taxonomy (fun, information, communication, and shopping) and customer profile data. The buyer-flag is selected as the class label in a decision tree. This class label is the output of a classification model, classifying a record data as a buyer or a non-buyer when the model used as
848
X. Zhang, W. Gong, and Y. Kawamura
scoring a new record data. The classification model can be used to find what customers are buyers in the web world. 2.4
Predictive Model
Prediction is also useful to e-commerce sites. Prediction for revenue, sales, and hits are necessary within e-commerce businesses. Considerable approaches are available in the current data mining tools. In our case study, we employ radial basis function (RBF) as a predictive method since this method deals with both numerical and categorical variables (note that statistical predictive methods use only numerical data) and has better performance compared to neural predictive methods. The RBF is a feed-forward neural network with only one hidden layer. It is different from the multi-layer perceptron in two ways. (a) The outputs from the hidden layer are not simply the product of the input data and a weighting; All the input data to each neuron are treated as a measure of distance which can be viewed as how far the data are from a center; The center is the position of the neuron in a spatial system. (b) The transfer functions of the nodes are used to measure the influence that neurons have at the center; These transfer functions are usually radial spline, Guassian or power functions. The radial basis function can easily be represented by a three layer feed-forward neural network which allows its model to be generated in an efficient way The more about the transfer function can be found in [10]. This case study generates a predictive model for revenue. With classification and predication, profitable customers can be discovered, where buying behavior patterns are also acquired.
3
Data Processing and Transformation
The processing of web log data includes filtering, selecting, arranging of web logs, and often creating variables with the web logs. In addition, semantic taxonomy of web pages is necessary. The semantic taxonomy for web sites is, given a set of web pages, to semantically divide the given web pages to a set of categories. Web log is of information about each request to a web site’s server. There are different log formats including common log format, user agent logs, referrer logs, combined log format, and extended log format. There are a number of software tools being able to deal with web logs and format layout for web log analysis. For example, by means of such a tool, the log entries whose filename suffixes such as gif, jpeg, jpg, and map can be removed. In addition, those entries created by ”Spider” and ”Robot” can also be filtered. The ”ibm websphere site analyzer” is applied to deal with the web log. This web log analysis software filters the web log data according to user defined categories, domains, browsers, etc. The data marts used for mining are built with both web log and corporate databases. Combining the access data with customer profile as well as customer lifetime records stored in the corporate databases can more accurately predict web user behaviors. Currently, mining data marts include the customer profile
Customer Behavior Pattern Discovering with Web Mining
849
data, customer lifetime data, and transaction (web click extracted from web logs) data. In general, the filtered web logs are still difficult to be directly used in mining processes. It needs a further data transformation, which is the semantic taxonomy of web pages. Table 2. Categories in the first level taxonomy Category Chat.html Competition.html Email.html Food.html Game.html Gourmet.html Help.html Info.html Kids.html Living.html Mfashion.html Music.html Newspaper.html Postcards.html ShopSpecial.html Sports.html Travel.html Travel2.html Womens-fashion.html
3.1
Interpretation chat with others in the chat room a premium user game send an email out (premium user only) food web page web page for playing a virtual game fine cooking web page help page: e.g. how to navigate information special page for the kid articles for bath, kitchen fashion (for men only) music: CDs, instruments, etc. newspaper, books, etc. send virtual postcards out specials for the youngsters sport articles book your travel as premium user book your travel as normal user fashion (for women only)
Semantic Taxonomy of Web Site
This section is concerned with the taxonomy of e-commerce web pages. The taxonomy helps building conceptual web categories which enables the results to be more clearly interpreted. Web page is an important step in data mart creation. The taxonomy generates a hierarchical configuration given web pages. In our case study, the web page taxonomy consists of two hierarchy, namely, the first level taxonomy and the second level taxonomy. The first level taxonomy is that all web pages are generalized to its root. For instance, a web page named ”Fun.html?siid=1?aid=3” is translated into ”Fun.html”. The second level taxonomy is, given the categories from the first level taxonomy, to create semantic concepts that refer to the generalized web pages. For example, the web logs used in our experiments are from a retail company, whose web page taxonomy is generated as the following way. First, some of the web pages are generalized by the first level taxonomy (displayed as Table 2). Then, with the categories from the first level taxonomy, more abstract categories are generated in the second level. Table 3 displays 4 categories (communication, fun, information, shopping), each of them including some categories in the first level taxonomy. 3.2
Building Data Marts
Data marts used for click analysis, clustering, classification, and prediction are built. Click stream analysis data mart is built according to the following way.
850
X. Zhang, W. Gong, and Y. Kawamura Table 3. Categories in the second level taxonomy Category Sub-category 1. Communication a. Email.html b. Chat.html 2. Fun a. Competition.html b. Postcards.html c. Game.htm 3. Information a. Kids.html b. Help.html c. Specials.html 4. Shopping a. Food.html b. Gourmet.html c. Living.html d. Mfashion.html e. Music.html f. Newspaper.html g. ShopSpecial.html h. Sports.html i. Travel.html j. Travel2.html k. Women-fashion.html
The original web log data contains a lot of information. However only some of them are of interest for the link analysis. The following 4 variables are prepared for click stream and link analysis: ID (either anonymous ID or premium user ID), Date (day), Time (hour, minute, second), www page. The data marts for clustering, classification have been described before. In order to predict the revenue of customers, a new field ”revenue” is added to the data marts. The field revenue is from customer lifetime databases in the corporate. In prediction, the field revenue is used as a dependent variable and some other fields are used as independent variables.
4
Application Results
This section represents the results of our case study. The web log data is extracted from click streams within 2 days. 4.1
Association Rule and Sequential Pattern
Association and sequential pattern analyses have been performed. Some of the discovered association rules are given in Table 4. For example, the association between ”email.html” and ”music.html” is strong. The more detail of this rule states that if some users go to the ”email” pages to send e-mails 84.6% of them also visit the ”music” pages, such rules are of 6.2% among all the web traffic transactions. Another rule addresses that 66.7% of users visit the ”food” pages before they go to ”chat” sites, such rules are of 11.2% among all the web traffic transactions. Some selected sequential patterns are reported in Table 5. The first sequential pattern interpreted as 91.4% of the users visit ”women-fashion.html” in two consecutive sessions. Another pattern is interpreted as 43% of users first go to ”music.html” and then go to ”chat.html” as communication medium, finally go
Customer Behavior Pattern Discovering with Web Mining
to ”women-fashion.html”. From these sequential patterns, one knows what are the main visiting tracks of web visitors, which helps someone to do banners and advertisements in the web pages. Table 5. Sequential pattern among the visited web pages Support (%) Sequential pattern 91.4 [women’s fashion.html] [women’s fashion.html] 72.4 [women’s fashion.html] [chat.html] 70.7 [women’s fashion.html] [sport.html] 43.0 [music.html] [chat.html] [women’s fashion.html]
4.2
Clustering Model
By clustering, web users are segmented in several clusters, where the users having similar behavior are in the same cluster. In clustering analysis, we get two interest clusters characterized as a ”high revenue cluster” and a ”high fun cluster”, respectively. We should know the fact that if the semantic taxonomy is not applied in the data transformation, the clusters we get could not distinctly interpreted. The ”high revenue cluster” is of 12% of the population. 90% of customers in this cluster are among 40-50 or more than 50 (Agegroup). All the customers in this cluster are male (Gender). Moreover, their ”hit number”, ”communication”, and ”fun” are relatively low, but their ”shopping” and ”revenue” are relatively high. This means that the high revenue customers are usually interested in visiting the shopping pages to look for new and suitable items. There are not so many times for them to play game, communicate and chat with other people. In fact, the customers in ”high revenue cluster” the profitable customers of the retail company. On the other hand, the features of the ”high fun cluster” are rather different. In the fun cluster, there are more younger people. Half of this group are less than 40 (Agegroup). 60% most of them are female (Gender). In addition, they have high ”hit number”, ”communication”, and ”fun”, but low ”shopping” and low ”revenue”. That is, the visitors in the fun cluster are usually interested in playing game and chatting each other. They are not the main buyers. If there are suitable game and fun items on web sites, it is a possible way to increase the buyers from these fun visitors.
852
4.3
X. Zhang, W. Gong, and Y. Kawamura
Classification and Prediction
Classification and prediction are performed with the built data marts. The result of classification is a decision tree where the class label in the decision tree is the buyer-flag. A understandable rule set is acquired from the decision tree. For example, one rule is interpreted as that if the interest in ”information” is very low (less than 0.5) and in ”communication” high (with at least an access rate of 4.5) then users will probably (95%) not buy items. Another rule confirms that if the interest in ”information” is not less than 0.5, in ”shopping” not less than 8.5, in ”communication” not greater than 9.5, and in ”fun” less than 4.5, then the user will probably buy (99%). On the other hand, a predictive model is built for predicting customers’ revenue. This predictive model is generated in terms of radial basis function (RBF) where both categorical and numerical variables are used. The generated model consists of several sub-models. The first sub-model describes high revenue customers, indicating the customers who have high shopping, low communication and fun. The last sub-model indicates low revenue customers, indicating the customers who are low shopping, high fun and communication. Almost the same customer behavioral patterns have already been identified in the clustering mining. Of course, RBF method can also be used to predict sales (articles or items) and click number, etc. Understanding the customer behaviors in both clustering and prediction mining is considered as a hybrid mining approach. The mining results can be compared with each other. Some of the mining results may be strongly confirmed when the results from both mining methods are of the same trend. On the other hand, some results should be further analyzed if contradictions are found among the results. In clustering and prediction, both explanations for the profitable customers are those who are high shopping, low communication and fun.
5
Related Work
As we mentioned above, the work in [1] and [2] is web usage analysis, where the number of hits for a web page, the number of visitors can be reported, where customer behavior patterns cannot be given. The work in [3] is about association rule and sequential pattern analyses, which is not related to customer behavioral patterns. With data analysis for customer behavioral patterns, similar work can be found in [11] and [12]. The work of [11] was to cluster customers with a Bayesian customer model, where the customer data, customer behavioral data were also used. However, it did not give you how to find who are the profitable customers and what are the key factors in the buying process. The work of [12] tried to cluster web visitors, where a cluster algorithm based on similarity measure of interests between web visitors, where there was no idea for how to create an efficient and integrated data mining mart, therefore there was no report about how to understand customer behaviors in web world. Both [11] and [12] are not considering about how to group web information that did not allow the customer behaviors to be completely discovered.
Customer Behavior Pattern Discovering with Web Mining
6
853
Conclusion
This paper addresses an application proven analytical solution for e-commerce businesses, where the profile and behavior of customers are clearly described with related variables. The semantic taxonomy of web pages contributes to the efficient mining process in extracting rules and patterns. The discovered patterns help analyzers to effectively profile web customers and understand buying activities. As web sites become more and more sophisticated, one of the future work is to automatically feed acquired models and rules to a personalization or recommendation engine to online navigate web visitors.
References 1. K. L. Wu, P. S. Yu, and A. Ballman. Speedtrace: A web usage mining and analysis tool. IBM System Journal, 37:89–104, 1998. 2. R. Coolry, B. Mobasher, and J. Srivastava. Data preparation for mining World Wide Web browing patterns. Knowledge and Information Systems, 1:5–32, 1999. 3. J. Pei, J. Han, B. Mortazavi-asl, and H. Zhu. Mining access patterns efficiently from web logs. In Knowledge Discovery and Data Mining (PAKDD 2000), Kyoto, Japan, 2000. Springer. 4. R. Agrawal, I. Imielinski, and A. Swami. Mining associations between sets of items in massive databases. In Proc. of the ACM-SIGMOD 1993 Int’l Conference on Management of Data. Washington D.C, 1993. 5. R. Agrawal and J. Srikant. Mining squential patterns: Generalizations and performance improvements. In Proc. of the Fifth Int’l Conference on Extending Database Technology, Avigno, France, 1996. 6. T. Kohonen. The self-organization map. Proceedings of the IEEE, 78:1464–1480, 1990. 7. R.S. Michalski and R. Stepp. Learning from observation: Conceptual clustering, Machine Learning: An Artificial Intelligence Approach,1983, Morgan Kaufmann. 8. J.R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986. 9. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int’l Conference on Extending Database Technology, Avignon, France, 1996. Morgan Kaufmann. 10. A. Zell, G. Mamier, and et al. SNNS: Stuttgart neural network simulator. Version 4.1, 1995, also ftp.informatik.uni-stuttgart.de(129.69.211.2). 11. J. Ji, L. Zhang, and C. Liu. The intelligent electronic shopping system based on bayesian clustering modeling. In Web Intelligence: Research and Develpoment (WI 2001). LNAI 2198, Springer, 2001. 12. V. Estivill-Castro and J. Yang. Categorizing vistors dynamically by fast and robust clustering of access logs. In Web Intelligence: Research and Develpoment (WI 2001 Proc.). Springer, 2001.
DM-XIDS – An Application-Layer Router of Incoming XML Streams Hao Gui, YuCai Feng, and Huayang Li Huazhong University of Science & Technology 430074 Wuhan, P.R.China [email protected]
Abstract. With the explosion of the information on the Internet and the widely use of the XML as a data exchange media, more and more information application can communicate with each other and deliver data of large volume in a continuous streaming. This trend has led to the emergence of novel concepts in data acquisition, integration, exchange, management and access. In this paper, we propose middleware architecture on XML streams information dissemination and design a prototype DM-XIDS as an applicable extension to our traditional database management system (named DM). As a middleware of our database system, DM-XIDS presents a novel concept of an applicationlayer information router with additional administrative functions, which builds bridges between the XML stream source and the underlying data storage conforming to the pre-customized strategy.
1 Introduction The World-Wide-Web has become a universal database for data storage and dissemination enabling people to access an enormous amount of information. It is unbelievable that a successful business transaction has nothing to do with external statistics. This is a world that we just can’t live without “information”, which is not merely a buzzword of Information Technology but the quintessential realistic. On the one hand, a large number of web sites like Yahoo provide us diverse collections of web pages containing the information covering almost every aspect of our daily life. Countless data supplying nodes were interwoven with others through physical and logical links, and the Internet turns out to be a huge cobweb. On the other hand, finding relevant information sources and query them manually is time consuming. We always find ourselves helpless and perplexed in the immense cyberspace. Internet community is considerate enough, and in order to help browsers locate the issues they’re interested in, famous web sites collect and classify pages manually under the help of professional librarians, meanwhile they provide users search facilities based on key words. Unfortunately, they emphasize simplicity rather than complex customization. With the emergence of XML as the standard for data representation and exchange on the Web, it sparks the interest in services aimed specifically at the collection of XML documents or the information that can be XMLized on the Web[1].
DM-XIDS – An Application-Layer Router of Incoming XML Streams
855
DM-XIDS (XML Information Dissemination System) designed at HUST is developing a general-purpose middleware to the DM database management system for processing continuous XMLized incoming stream over the multiple pre-registered queries of the users. In this paper, we will describe the architecture and features of this middleware as a software application-layer router and also outline our future work. We begin in Section 2 with the infrastructure of DM-XIDS, and then turn in Section 3 to the introduction of its routing functions. Section 4 and section 5 illustrate the mechanism for pre-processing of the queries and details about filtering engine. The paper concludes in Section 6 with the direction of the future work.
2 Overview of DM-XIDS and Its Architecture In this section we introduce the conceptual architecture of DM-XIDS that has been designed to efficiently filter and route the incoming XML streams[7]. Fig 1 depicts the overall DM-XIDS conceptual architecture.
Fig. 1. Conceptual architecture of DM-XIDS
Before getting down to the details, let’s characterize the function of the whole system briefly. XML data stream or some information that has been XMLized and streamlined comes from the outside world, perhaps from some kind of external storage devices or the web pages provided by a web site. According to the preregistered profiles of the end users that may be distributed in a certain area but connected to the DM-XIDS server, the server filters the information and guides the direction of the outgoing XML streams. It works just like an air-to-air transformer because of the time-effectiveness of the data or the inability to store the entire input information.
856
H. Gui, Y. Feng, and H. Li
2.1 Components The whole architecture of DM-XIDS is constituted with several components, which may implement some specific required functionality for the process of the dissemination of the data stream. Now we are going to take them apart and demonstrate them one by one from the functional and structural points of view. Web info Extractor and XML documents. Raw html data is the main source of the information that we can get from the Internet easily, though more and more business applications begin to exchange or publish their information in XML. So we have to resort to another newly developed tools of DM4.0 family named DM-WME(Web information Miner and Extractor). It works as an html web page gateway via which the input data in html files can be recoded into a format sufficing to the XML2.0 Specification. And it can also mine out the approximate schema of the result XML data. Introduction of the implementation of DM-WME is beyond the scope of this paper. Processing engine. This is the key component in DM-XIDS. It is the routing and access control center that executes the selection of the routes, the function and the implementation of which will be discussed in Section 3. It is also the kernel to filter and match the user-customized queries, and the mechanism will be discussed in Section 4. Physically distributed database site. In the experimental environments that we developed DM-XIDS, physically distributed database sites are linked with the router according to Fig1 to simulate the physical networks context. The information that finally reaches these sites are forwarded by DM-XIDS and stored in local database, which play roles as the data-consuming clients. We also use the RDBMS DM that supported by National 863 Project of China and developed by our lab to examine the seamless conjunction of this middleware gadget and the fundamental system. Physically distributed database is created on DM4.0 and the data is stored as raw XML information without any modification. 2.2 User Interface and Interactivity in DM-XIDS between Components The friendly interactive graphical user interface as seen in Fig 2 provides a uniform means for both users and administrator to pose their individual subscriptions and supervisal rules. With this visual tool, instead of having to write XPath queries by hand and analyze the regular expressions in mind, users can tailor the profiles by clicking on the items that they want. Now we use a simple example to show the whole process of interaction when customizing a query. Suppose that a user named Bob wants to get daily news reports about NBA without opening the IExplore and clicking here and there day to day. Bob can work out his profile on his computer in this interface once for ado. First, he must specific a certain information source. That is to say, he should decide the information he wants coming from which site. Maybe “www.yahoo.com” is not too bad, after the submission of this
DM-XIDS – An Application-Layer Router of Incoming XML Streams
857
request, DM-XIDS server will tell the Web Page Extractor to mine out the approximate schema of that site and return the result to the server. Then this information is delivered to the Bob’s computer, and he will see a structural tree in the left panel in the interface as shown in the above fig. After his selecting and clicking on the tree nodes, a query tree will emerge in the right panel associated with Bob’s preference, and an output XPath expression has also been created below the query tree. Until now, Bob has finished the customizing process for his favorite news.
Fig. 2. The graphical user interface
Web Page Extractor will query the target web site and transform the result into continuous XML streams to DM-XIDS server. Then processing engine will start its work to filter the data according to the profiles of all users and route the result to its dest. All the final information is stored locally in the eyes of an individual user, and he can read the news at his ease through the browser that supporting XML Specification of his own. Other issues about DM-XIDS will be discussed in following sections.
3 Routing Functions There are thousands of routers and gateway computers spread out all over the world. Every time when we are browsing a web page, routers transfer the data, router-torouter-to-router, from the computer of the web site that stores it to your computer, of which the prime function is to direct the data packets. Although for the moment in DM-XIDS there not exist complicated and nested routing mechanism, we endow it with some characteristics and features of a real router. The data is forwarded by our software router is XMLized streams rather than IP data packets.
858
H. Gui, Y. Feng, and H. Li
3.1 Routing Table Routing Table is an applicable mechanism, and each record in the table maps a destination address to a next-hop address. Routers perform a lookup in the table for the appropriate forwarding direction. In DM-XIDS, the pre-defined customized XPath queries are just like the records in the routing table. The different is that simple lookup cannot work when dealing with the incoming XMLized streams. The software router has to use matching technique that will be demonstrated in Section 5 to dynamically choose the destination computer. In a Cisco router, you can designate a static routing record by using ip route command: # ip route source-address source-address-subnetmask next-hop-address For an example: # ip route 192.1.0.128 255.255.255.192 192.1.0.65 In DM-XIDS system, users can direct some kind of information to his computer by customizing a query of his own. If you browse the routing rules from the configuration interface on the DM-XIDS server, you will see some records something like: # $emp-XPath-expression site-n Although the implementations are different, the goals to find next stop and to correctly direct the data are the same. We believe that the routing mechanism like or beyond this one will be bi-directional and indispensable for information exchange based on XML streams in the near future. 3.2 Holding List In the Cisco IOS series, there is a powerful network control tool called Access List. These lists add the flexibility to filter the packet flow in or out router interfaces. Such control can help limit network traffic and restrict network use by certain users or devices. Access lists differentiate packet traffic into categories that permit or deny other features. The design of the holding list in DM-XIDS simulates the network control concept and extends it into the software-gateway application environment. It plays the role as part of a stream filter in addition to the query matching mechanism. Now, let’s see the comparison and similarity between the access list in Cisco IOS and holding list in DM-XIDS. The phrase below is the standard access control list configuration command in Cisco IOS: # permit | deny source-address source-address-mask dest-address dest-addressmask [log] For an example: # access-list 1 deny 10.1.1.0 0.0.0.255 10.1.2.0 0.0.0.255 This access-list configuration command holds back all the packets that coming from the hosts specified by ip address 10.1.1.x and going to any hosts specified by ip address 10.1.2.x.
DM-XIDS – An Application-Layer Router of Incoming XML Streams
859
In DM-XIDS, we also work out some descriptive notations and terms to depict the access control strategy in our stream filtering environment, which is also describe in some form that based on XPath expression as an extended administrative filter in order to be processed in the same way with other customized queries. It looks like the expression as following: # permit | deny source-xml-stream-criteria dest-host [log] Suppose that you are a administrator that maintains the software router DM-XIDS, and you want to prevent some XML information from reaching a certain host, you can input the following command or achieve the same goal by clicking of the mouse in the configuration interface. # deny /www.yahoo.com/sports/nba site-1 [Log] option is also provided in DM-XIDS, which will enable the router to log the event of the holding back action on the XML stream. Something should be explained further, if the users or applications on site-1 preregister the query “/www.yahoo.com/sports”, the host of site-1 will continue to receive the information of other issues about sport except the reports on NBA action. That is to say, in addition to the effect of restrict the network traffic, holding list is served as an auxiliary filter to accomplish specific administrative function. And it seems effective. 3.3 Multicast Group In the experimental environment of DM-XIDS, the same category of information may be subscribed by a group of users. If DM-XIDS forwards all these information to each user’s sites respectively after filtering and matching, precious bandwidth of the network will be waste and it will also increase the burden of our DM-XIDS server. Traditionally, IP packets on Internet are transmitted in two ways - unicast or broadcast. Multicast is a third way to deliver IP packets to a group of hosts. Host groups are identified by class D IP addresses, i.e., those with "1110" as their higherorder bits. In dotted decimal notation, host group addresses range from 224.0.0.0 to 239.255.255.255. As an application-layer router of incoming XML streams, DM-XIDS can dynamically create and allocate multicast IP addresses for the identical user’s profiles after the preprocess (described in Section 4) of all the queries that have been submitted. Then DM-XIDS server will report associated multicast addresses to the hosts that have been arranged in some multicast group. And the hosts will be enabled to receive multicast information. All the managements of multicast group including construct and destruct are done in server, and this centralized management will provide more flexibility, security and efficiency when being compared with link-layer igmp+cgmp protocols in a specific application environment. All in all, application-layer routing functions proposed in this paper may be primitive and incomplete, but first step towards comprehensive and mature business applications built upon the information exchange context based on XMLized streams has been done with this experimental project.
860
H. Gui, Y. Feng, and H. Li
4 Pre-processing of Queries Based on Path Constraints We take advantage of path constraints existing in semi-structured database which is a fundamental part of the semantics of the data to pre-process the queries in DM-XIDS. As shown in Fig 1, Web info Extractor is a crucial component of DM-XIDS providing easy access to the countless web sites on the Internet. It can also mine out the structural information about the pages offered by a certain web site by making use of href tag or other kind of linkage. See Fig 3. The optimization of navigational queries on semi-structured data can be done by using implication of path constraints. A self-explanatory example in Fig 3: Main Page
Business
News
Sport
Science
NBA
Fig. 3. Structural information of a certain web site
/MainPage/News/…/NBA=/MainPage/Sports/NBA A path constraint is expressed in two forms: Q1 ⊆ Q2 or Q1 = Q2, where Q1 and Q2 are regular expressions. These information of inclusion (or equivalence) can be used to optimize and avoid redundant user’s queries. We draw the following conclusions after substantive analysis on the path constraints existing in some leading web sites. (1). Regular paths ending with a certain sequence of nodes return to the source page. That is to say after traveling through a series of pages, we will get to the original one again. Let P* be the sequence of nodes, query Q=/(Qs)/P*/Qe is equivalent to /(Qs)/ Qe (2). Path constraint Qm ⊆ Qn is satisfied in a specific site. So the query Qm can be omitted in the appearance of Qn. It is already proven that the general implication problem for regular path constrains is decidable in EXPSPACE with respect to the length of the constraints. DM-XIDS equipped with the strategy of using path constrains semantic in pre-processing of the queries can cope with large scale of user profiles more efficiently.
5 Mechanism of Stream Filtering Engine XML is a self-descriptive language by separating presentation from content and by revealing its semantic structure when compared with HTML which is the predominant
DM-XIDS – An Application-Layer Router of Incoming XML Streams
861
format in today’s Internet, and we can use this structural information[5] to implement filtering and matching in DM-XIDS. ../a/*/b
../a//b
any
any
Fig. 4. NFA fragments
In DM-XIDS, filtering engine is the core component to implement the filtering and matching operations. Because of the regular format of XPath expression, NFA is used to resolve the queries posed by users. When developing our prototype, we use SAX (Simple API for XML) as the parser for incoming XML streams because of the streaming characteristic and the inability to store the entire input to construct a whole document tree in memory[4]. Each node in the XPath expressions of user’s profile is mapped to a machine state of the corresponding NFA. After the construction of NFAs, DM-XIDS is ready for incoming XML streams. When the stream parser meets the starting of next node, an event is raised and the active node will be added to the tail of the current path to present the navigational step. We arrange and index queries according to their source site, and XML streams coming from an appointed web site will activate all the Automatons associated with the queries pointing at this site. A query is said to match an XML stream if during parsing, an accepting state for that query is reached. When a query has been matched, the information will be directed to the target according to the relevant record in the routing table mentioned in section 3. The simplest expression of XPath is a single path query[3], which may be either an absolute path from the root or a relative path from a certain context location[2]. For example: Q1=/Yahoo!/News & Media/Sports/Basketball/NBA Q2=/Yahoo!/News & Media/Sports/*/ NBA Q3=/Yahoo!/News & Media//NBA A location step is denoted by a node name of the navigational query. Each location step may also contain one or more condition to be evaluated to further identifying the selected node: Q4=/Yahoo!/News & Media/Sports[Date/Value=2003-8-15]/Basketball/NBA The examples above almost represent various user queries in our system. NFA has rather powerful capability to express natural language and it is competent to portray and filter the queries. A single FA with fixed state can resolve Q1, while Q4 can be
862
H. Gui, Y. Feng, and H. Li
treated as a simple combination of two single queries. Q2 and Q3 is more complex and the NFAs associated with “../a/*/b”,” ../a//b” are shown as following: XTrie, YFilter developed by AT&T Labs which are devised to present a generic model for the filtering of all kinds XML queries. Stream filtering engine designed in DM-XIDS is relatively naive when compared with them because of the relatively straightforward experimental environment shown in Fig 1 that focus on dealing with the user’s profile of the information subscription from a designated web site.
6 Conclusion and Future Work In this paper we have presented the architecture and routing functions of DM-XIDS, an experimental system being developed at HUST, the main objective of which is to meet the requirements of data exchange based on XML streams, by providing an integrated environment for managing and disseminating XML streaming data in a applicable way. A routing engine is devised upon 2-tier architecture to process the incoming data on the fly. The graphical user interface is also designed to cope with the characteristic of XML format, so that it could be easily operated to tailor a query plan. DM-XIDS is an ongoing project, much work remains to be done in the context of the dissemination of XMLized stream information. In particular, we are interested in the application-layer router architecture of the future XML stream data exchange environment because the routing mechanisms implemented in application layer have several unique strongpoint, though some of them may be not found in DM-XIDS for the moment. The following extended routing functions are the next step for DMXIDS in a more sophisticated WAN environment[6]. • Almost unlimited scalability • Efficient access to globally distributed services • Replication on demand • Self-tuning architecture • Application-level QoS and dynamic routing In addition, we hope to pay more attention on the XPath queries filtering and preprocessing in the core engine and plan to improve the algorithm we have adopted by using more advanced indexing technique. And we also want to enable our system to deal with specific ad hoc queries to enhance the adaptability to real business application. All these issues are worth further investigation.
References 1. 2. 3. 4.
Lukasz Golab , M. Tamer Özsu. Issues in Data Stream Management. SIGMOD Record, June 2003, Vol.32 No 2:5–14 Yanlei Diao, Michael J. Franklin. High-Performance XML Filtering:An Overview of YFilter, Data Engineering of IEEE, March 2003, Vol.26 No 1:41–48 Peter Buneman, Susan Davidson , Mary Fernandez and Dan Suciu. Adding Structure to Unstructured Data. In Proceedings of ICDT, January 1997 Serge Abiteboul. Querying semi-structured data. In Proceedings of ICDT, January 1997
DM-XIDS – An Application-Layer Router of Incoming XML Streams 5. 6. 7.
863
Monika R.Henzinger, Thomas A.Henzinger and Peter W.Kopke. Computing Simulations on Finite and Infinite Graphs. In Proceedings of IEEE FOCS, October 1995:453–462 RFC 1812 Vol7. APPLICATION LAYER - ROUTING PROTOCOLS . Cisco Systems Stratis D. Viglas, Jeffrey F. Naughton. Rate-Based Query Optimization for Streaming Information Sources. In SIGMOD 2002
Developing a Main Memory Moving Objects DBMS for High-Performance Location-Based Services Kwang Woo Nam, Jai Ho Lee, Seong Ho Lee, Jun Wook Lee, and Jong Hyun Park LBS Research Team, Telematics Research Division Electronics and Telecommunications Research Institute, Daejeon, 305-350, Korea {kwnam, snoopy, sholee, ljw63822, jhp}@etri.re.kr
Abstract. This paper proposes a moving objects data model, query language, main memory database system for location-based service. Location-based services require GIS databases system to represent moving objects and to support querying on the motion properties of objects. For example, fleet management applications may require storage of information about moving vehicles. Also, advanced CRM applications may require to store and query the trajectories of mobile phone users. But, modeling consistent information about the location of continuously moving objects and processing motion-specific queries is challenging problem. We formally define a data model and data types for moving objects that includes complex evolving spatial structure and indeterminacy, and then describe the system implementation.
since they don’t support moving objects data model and query language, application developers should implement all of them. In this paper, we describe the system architecture, moving objects data model, and query language of a main memory moving objects database system for overcoming the above two problems. This system uses the main memory data storage, memoryoptimized moving objects indexes for accelerating location update operations. Also, we propose a moving objects data model and moving objects SQL with time-series and continuous moving object operators. This is distinguished from other moving objects researches by Guting[1,2], Wolfson[8, 9, 10], Jensen[4], and others[5, 6, 7] in the point of time-series operators. They neglect the time-series characteristics of moving objects and have focused on continuous moving objects characteristics in spite that time-series characteristic is important and useful in location-based applications. The rest of this paper is organized as follows. The related works and problems on moving objects databases are discussed in section 2, and the overview of system architecture is introduced in section 3. The design of moving objects components and query language is presented in details in section 4 and section 5. Also, we show the application for moving objects visualization in section 6. Finally, the summary and future work is given in section 7.
2 Problem Statements Suppose that a taxi or a transportation company want to develop fleet management applications. They should collect and maintain large quantities of data associated with geographic locations for tracking vehicles and optimizing transport routes. This means that inserting new location data is very frequently occurred and the volume of data will be increased over time massively. But frequency of querying operation may be fewer than frequency of inserting new data. So how to efficiently manage moving object data in database is very closely associated with how to fast process not only inserting new data but also querying operation. We break through this problem by choosing main memory database technology. We think that our approach is one of the best ways to overcome the performance limit of disk-based system. x
x
x = f (vti )
x =?
x = f (now )
5 4
5 4 3
3
2
2
1
1
t1 t2
t3
t4
t5 t6
(a)
t
t1 t2
t3
t4
t5 t6 now
vt
(b)
Fig. 1. Representation of Moving Objects : (a) Time-series Representations in Database Storage, (b) Continuous Representations in Abstraction Layer
In real world, moving objects occupy a part of Euclidian space over varying time. But, for system efficiency, the trajectory information is gathered in every sampling
866
K.W. Nam et al.
time as frequent as capable, and stored in database storage. Suppose that a phone user is walking through a road. This information will be stored into database as a sequence of positions with time tag as like Figure 1(a). This is similar with the characteristic of time-series data such as sensor data. In spite that it is very useful to evaluate variation of moving objects and manage the data sequentially, previous researches for moving objects did not consider this. A problem of time-series representation is that we can get the position in t[x,x+1], a time point between sampling time tx and tx+1. In order to query continuous moving objects, it should be estimated by a mapping function having time parameter in abstract representation as shown Figure 1(b). However, we think that moving objects databases should support time-series view functions as well as continuous functions.
3 An Overview of the System Architecture Electronics and Telecommunication Research Institutes have developed core technology for location-based services on CDMA wireless network. The project includes the development of a LBS service platform, geographic contents and mobility server, and LBS solutions such as emergency 119, location-based games, and personal navigation agent. The main memory moving objects database is a core part of ETRI locationbased service technology project. This system manages the current and past moving object locations as well as geographic information. Figure 2 shows the overview architecture of the main memory moving objects database system and applications. Moving objects databases consists of moving objects components, moving objects SQL processor, and a main memory storage including moving objects indexes. Wireless Network Location Gateway SKT
Moving Objects SQL Parser and Analyzer Query Optimizer/ Action Generator
Catalog Manager
executor
Main Memory Storage APIs Transaction manager
Lock Manager
Recovery Manager
Memory Manager
Deadlock detector
Index Manager T-Tree, R*Tree, TB*-Tree, 3DR-Tree
Fig. 2. Architecture of Moving Object Database System
Moving objects classes component is a set of temporal, geometry, and moving objects classes that implement moving objects data model and operations. Moving ob-
Developing a Main Memory Moving Objects DBMS
867
jects SQL processor would processes and executes various and powerful moving objects queries. The extended query language is compatible with SQL3, and supports to make application system like moving objects data mining and customer relationship management through application programming standard such as JDBC, ODBC, and OLEDB. Main memory storage component consists of transaction manger, lock manager, recovery manager, deadlock detector, and index manager as like other conventional DBMS. But, we support current memory location indexes and past moving objects index such as R*-tree, TB*-tree, and 3DR-tree. This will increases the performance of moving object access efficiently.
4 Modeling Moving Objects In this section, we describe class model for moving objects components using UML. This supports time-series operations as well as continuous moving objects operations. 4.1 Moving Objects Class Hierarchy and Interfaces To design moving objects components that can be used in location-based database applications, we need more complex data structures and operations than those of general relational database applications. Since moving objects change the positions and shapes in temporal dimension, we should design temporal and spatial components first of all, and then moving objects. Also, we need to define new operations for manipulating and retrieving the moving objects in databases. Class package for moving objects components consists of four major components, which are ETRITime, ETRIGeometry, OGISGeometry, and ETRIMGeometry. ETRITime component is a set of classes supporting temporal classes and their operations, and ETRIGeometry component is for spatial classes. Similarly, ETRIMGeometry component is for moving objects.
Temporal
Instant
Period
1...*
Interval
(a)
TemporalCollecton
Instant
ITemporal ITemporalRelation ITemporalOperator
Period
ITemporal ITemporalRelation ITemporalOperator
TemporalCollection
ITemporal ITemporalRe lation ITemporalOperator
Interval
ITemporal
(b)
Fig. 3. Temporal Classes : (a) Temporal Class Hierarchy, (b) Interface Diagram of Temporal Component
Figure 3 shows a diagram of temporal classes. Temporal classes consist of Period, Interval, Instant, and TemporalCollection classes as shown in Figure 3(a). These
868
K.W. Nam et al.
classes have ITemporal, ITemporalRelation, and ITmporalOperator Interfaces as shown in Figure 3(b). Geometry classes consist of Point, LineString, Polygon, GeometryCollection, Surface classes, and et al. UML modeling for geometry classes is borrowed from its of International Standard of Open GIS Consortium[3]. So, we abbreviate the detail description for geometry classes. Figure 4 shows moving objects classes. Classes for moving objects consist of TObject, MObject, MBase, MGeometry classes, and et al. MObject is a super class of all moving object classes, and TObject is a unit element class for organizing moving objects such as a tuple of as shown in Figure 4(a) and Figure 4(b).
Fig. 4. Moving Objects Classes: (a) Moving Objects Class Hierarchy (b) Interface Diagram of Moving Objects Component
MBase is a super class for non-spatial moving object classes such as moving float, moving integer, moving string, and moving boolean. For example, moving float class can express a sequence of varying distances between two moving cellular phone users, such that [, , ….,]. MGeometry is a super class for spatial moving object classes such as moving point, moving linestring, moving polygon classes, and et al. For example, moving point class can be used to manage moving positions of a cellular phone user, such that [, , …, ]. 4.2 Operations for Moving Objects Time-Series View. As described in the former sections, moving objects have characteristics of time-series data. We can define an element of moving objects as follows. e =< t || o >
An element is a tuple with time and value object. For example, first value of Figure 1(a) is . A moving object is a sequence of such elements. s = [e1 , e2 ,..., enow ]
Developing a Main Memory Moving Objects DBMS
869
We propose time-series operators for moving objects as shown in Table 1. Validtime operator returns a time t of the element, and value operator return a value object o of the element. First operator return first element e1 of the moving object, and last operator return last element enow. Table 1. Operations for Time-Series View.
Classification
Time-Series Operations validtime (ei ) = ti
index(ei ) = i
GetFirstLast
first ( s, k ) = ek
last ( s, k ) = ek
GetElement
next(ei ) = ei +1 after ( s, ti ) = e j
previous(ei ) = ei −1
Slice
slice _ squence( s, k from , kto ) = s
GetTime
before ( s, ti ) = e j '
slice _ sequence( s, t from , tto ) = s '
after operator returns the nearest element after the time t in parameter, and before operator returns the nearest element before time t. slice_sequence operators return the part of sequence restricted by times or indexes in parameters. Continuous Moving Objects View. For continuous moving objects view, we propose three kinds of operators. They are snapshot, slice, and project operations as shown in Table 2. Table 2. Operations for Continuous Moving Objects View Classification Snapshot Slice Project
Continuous Moving Objects Operations snapshot _ validtime(m, ti ) = vi snapshot _ value(m, ci , x) ⊆ domain(Time) slice _ validtime ( m, t from , t to ) = m '
slice _ value(m, ci , x) = m' project _ validtime ( m ) ⊆ domain (Time )
project _ value(m, c i ) ⊆ domain(Ti )
Snapshot_ prefix operators return a snapshot by the specific dimension described in parameter. These consist of snapshot_ validtime and snapshot_value operator. Snapshot_ validtime operator returns snapshot value object at specific time point. Oppositely, snapshot_value returns snapshot times by value dimension. Slice operators return sliced moving objects by the specific dimension described in parameters. Slice_validtime operator returns a moving object between specific time periods. Oppositely, slice_value operator returns moving objects sliced value objects. Project operators returns value objects projected by another dimension, time or value. Slice operators are distinguished from the snapshot operators in the aspect of the returned values. The former returns moving objects, but the latter returns snapshot objects in value dimension.
870
K.W. Nam et al.
5 Moving Objects SQL In this section, we extend SQL to support moving objects databases. Our extended SQL uses time-series and moving objects operators. 5.1 Creation of a Moving Objects Table Moving objects are managed as data types, such as mpoint, mlinestring, and mpolygon. For example, when a user wants to create a table for managing positions of moving cellular phones in location-based service application, he may create a table with attribute of moving point data type as shown example 1. Example 1. Create a gmlc_user table with a moving point attribute. CREATE TABLE gmlc_user ( cellphone_number char(20), cellphone_owner char(15), position mpoint option (Data Split Mode is time ‘every 1 day’ index is 3DRTree) ); Data split mode means how to store moving points. Data split mode consists of three elements, that is time, spatial, and storage. In above example, it would be clustered every 1 day. Index mode means the index type for moving point. In this example, position attribute will be managed by 3DRTree index. 5.2 Time-Series Queries Time-series operators described in section 4.1 can be used in SQL for querying moving point, such as getting the latest input position, partial sequence, and aggregating variation among variation of movements. Example 1 is to show the latest location information of a cellular user. This query extracts a location that was lastly inserted data in the table. Example 2. Show the latest user’s position and the count of information, whose cellular phone number is 016-402-8040. SELECT last(position, 0), count(position) FROM gmlc_user WHERE cellphone_number = ‘016-402-8040’; 5.3 Continuous Moving Objects Queries Continuous moving objects queries consist of snapshot, slice, project, and join queries. As described in section 4.2, snapshot operator can be used for getting a snapshot by a specific time. We show a use case of snapshot operator in example 4. Example 3. Show the current positions of gmlc_users. SELECT snapshot_validtime(position, now) FROM gmlc_user
Developing a Main Memory Moving Objects DBMS
871
In above examples, we used a time variable ‘now’ which means the latest value of cellular phone user since the database can take the exact position in current time. If the user wants, explicit time value can be used instead of ‘now’ variable for getting a snapshot in the past. Slice queries are used to get the history information of moving objects. The followings are examples for slice queries. Example 4. Show trajectories of gmlc users during specific period. SELECT slice_validtime( position, period(‘2002/05/01-2002/05/31’)) FROM gmlc_user Example 4 shows that slice_validtime operator is used for select phase. This restricts the trajectories within user specific range. Example 5. Show gmlc users who passed the specific area since 2002/05/03. SELECT cellphone_owner FROM gmlc_user WHERE overlaps(position, motpolygon( period(‘2002/05/03-now’), polygon(‘500 500 1000 500 1000 1000 500 1000 500 500 ’))) Functions for moving objects such as distance can be used in SQL. Distance operator is very useful to extract the spatial relationship between two moving objects. Query as like example 6 will be required in location-based service application for moving objects databases. Example 6. Show gmlc users who less than 100m distance from another user. SELECT A.cellphone_owner, B.cellphone_owner FROM gmlc_user A, gmlc_user B WHERE distance(A.position, B.position) < 100
6 Moving Objects Indexing The TB*-tree is fundamentally similar to TB-tree in some aspects that line segments are treated as parts of trajectories and trajectories are strictly preserved. A leaf node only contains segments belong to the same trajectory. As mentioned in [8], this approach makes the overlap increase and causes to deteriorate performance of the space discrimination. However, trajectory preservation makes the index efficiently answer for pure spatiotemporal queries. To improve the performance of insertion, our approach uses the AuxCache (auxiliary cache) when inserts new data. In the previous related works, when new data is inserted, search a leaf node for storing new data. This processing causes the deterioration of inserting operation. Moreover, the cost of insertion operation in TB-tree is very expensive because TB-tree uses overlap query to find the segment connected to the new entry. Therefore, as the size of index increases, the cost of insertion increases dramatically. To solve the problem, we use the auxiliary structure of AuxCache. AuxCache contains the information of trajectory identifier and the latest node containing specified trajectory. When new data is inserted, index first finds a leaf node in AuxCache. If
872
K.W. Nam et al.
there exists the previous trajectory information, index will directly access to the found leaf node. Otherwise creates a new leaf node for new segment, and store new trajectory identifier and leaf node into AuxCache. Figure 5 shows the process of new data insertion.
Fig. 5. New Data Insertion in TB*-tree
Of course, the AuxCache needs more system resources. Suppose that there are 10,000 objects, and each variable for trajectory identifier, pageID is 4byte. Then the total size of AuxCache is only 80,000byte ≈ 79K. Moreover, AuxCache has another merit for some trajectory query. Trajectory queries such as “find the latest several trajectory segments with specified trajectory identifier” or “find the route of bus which has car ID ‘155’ during latest one hour” can be answered very fast.
(a)
(b)
Fig. 6. Moving Objects Visualizer : (a) Monitoring Positions of a Cellular Phone Mounted Car, (b) 3D Visualization of Moving Points.
7 Conclusion and Discussion In location-based services, the past location states as well as the current location of the objects are frequently of interest[2, 4, 10]. However, memory-resident management
Developing a Main Memory Moving Objects DBMS
873
and time-series view of moving objects have not been tried in any researches and developments. In this paper, we proposed a moving objects model and operators and applied this component into SQL processing system. Proposed model and query language support time-series view as well as continuous moving objects view. Also, we implemented the main memory storage manager for storing moving objects, and then proposed new index structure for accelerating update operation using AuxCache. Figure 6 shows the visualization program for monitoring moving objects using our database system. Proposed main memory moving objects database system can be applied to various LBS applications. For example, vehicle routing and monitoring software designed to help fleet managers in any industry reduce cost and improve customer service.
References 1.
Erwig, M., Guiting, R. H., Schneider, M., and Vazirgiannis, M., “Spatio-Temporal Data Types : An Approach to Modeling and Querying Moving Obejct in Databases,” GeoInfomatica, Vol.3, No.3, pp.269-296, 1999. 2. Forlizzi, L., Guting, R. H., Nardelli, E., and Schneider, M., “A Data Model and Data Structures for Moving Objects Databases,” ACM SIGMOD Conference, pp.319-330, 2000. 3. Open GIS Consortium, Inc., OpenGIS Simple Features Specification For OLE/COM Revision 1.1, 1999. 4. Pfoser, D., Jensen, C. S. and Theodoridis, Y., “Novel Approaches to the Indexing of Moving Object Trajectories,” Proc. of the 26th Conference on VLDB, Cairo, Egypt, 2000. 5. Schneider, L. M and Vazirgiannis, M., “A Foundation for Representing and Querying Moving Objects,” ACM Transactions on Database Systems, Vol. 25, pp.1-42, 2000. 6. Seydim, A. Y., Dunham M. H., and Kumar, V., "Location Dependent Query Processing," Second ACM international workshop on Data engineering for wireless and mobile access, pp.47 - 53, May 20, 2001, Santa Barbara, CA USA 7. Sh, S. L. and Wa, D., "Handling Disaggregate Spatiotemporal Travel Data in GIS," GeoInformatica Vol.4, No.2, pp.161-178, 2000. 8. Sistla, A. P., Wolfson, O., Chamberlain, S., and Dao, S., “Modeling and Querying Moving Obejcts,” ICDE, pp.422-432, 1997. 9. Wolfson, O., Chamberlain, B. X. S., and Jiang, L., "Moving Objects Databases: Issues and Solutions", 10th International Conference on Scientific and Statistical Database Management (SSDBM98), Capri, Italy, July 1-3, 1998, pp. 111-122. 10. Wolfson, O., Chamberlain, B. X. S., Sistla, P., Xu, B., and Zhou, X., “DOMINO: Databases fOr MovINg Objects tracking,” ACM International Conference on SIGMOD , pp. 547-549, 1999
A Detection Model of Direction Relations for Spatial and Spatiotemporal Objects Bao-Lin Yi and Yu-Cai Feng
Abstract. Direction is frequently used as a selection condition in spatial and spatiotemporal queries.As a result, it is important for spatial and spatiotemporal databases to provide a mechanism for modeling and processing direction queries and reasoning. Depending on the direction relation matrix, an inverted direction relation matrix and the concept of direction predominance are proposed to improve the detection of direction relation between objects. Direction predicates such as “surround” and “surroundedby” are also extended. These techniques can improve the veracity of direction queries and reasoning. Experiments show excellent efficiency in view of direction queries.
A Detection Model of Direction Relations for Spatial and Spatiotemporal Objects
875
M. Egenhofer proposed a model using direction-relation matrix to handle direction relations. An improved deep direction relation matrix model also researched [5]. All these models could not handle precision direction relations and paid little attentions on dynamic direction relations in spatiotemporal systems. The paper is organized as follows. In section 2 we define several direction relations between objects. Section 3 discusses the predicate extensions of direction relations in spatial and spatiotemporal systems. Section 4 gives some experimental results.
2 Direction Relation Model 2.1 Direction Relation Matrix The direction relation matrix of two polygons A and B, which are called reference object and target object respectively [5], is a 3x3 matrix as formula (1). NWA ∩ B N A ∩ B NE A ∩ B dir( A, B) = WA ∩ B O A ∩ B E A ∩ B SW ∩ B S ∩ B SE ∩ B A A A
(1)
Basic directions {NW, N, NE, E, SE, S, SW, W, O} are nine plane areas partitioned by MBR of reference object, elements of the matrix are intersection between target object and partitioned area ( 0 to denote empty intersection, 1 otherwise).When there is only one element in the matrix is non-empty, it is called single-item direction relations, otherwise, it is called multi-items direction relations. The direction relation matrix can describe eight conventional directions in single-item direction relations. However, it cannot give detailed direction between objects for same direction and multi-items direction relations. For example,the direction relation matrix between mainland of P.R. China and Taiwan area is (0,0,0,0,1,0,0,0,0), which shows Taiwan has the same direction to the mainland of China; this is not accord with the states that Taiwan is in the southeast of China. 2.2 Inverted Direction Relation Matrix Direction relation matrix only concentrates on the MBR of reference object. It does not consider semantic symmetry between objects. That’s to say, the shape of target object should be considered to get precision direction. This need inverted direction relation matrix, which has the same formulation as that of direction relation matrix; the only difference is the exchanged role of reference object and target object. The inverted direction relation matrix is a complementarity to direction relation matrix; it can provide more precision directional information. However, both of them could not determine unique direction between two objects. For example, the inverted direction matrix between mainland of China and Taiwan area is (1,1,1,1,0,0, 1,0,0),which means that Taiwan does not have the same direction to the mainland of
876
B.-L. Yi and Y.-C. Feng
China. As a result, direction predominance, which is based on metric direction relation matrix (formula (2)), is used. ρ ( NWA ∩ B) / ρ ( B) ρ ( N A ∩ B) / ρ( B) ρ ( NEA ∩ B) / ρ (B) Dir( A, B) = ρ (WA ∩ B) / ρ( B) ρ(OA ∩ B) / ρ( B) ρ( EA ∩ B) / ρ( B) ρ (SWA ∩ B) / ρ (B) ρ (S A ∩ B) / ρ ( B) ρ (SEA ∩ B) / ρ( B)
(2)
In metric direction relation matrix, each element is the percentage of intersection between reference object and target object under some metric functions. It is area function for polygons, and length function for line object. In metric relation direction matrix, if the value of one element is greater than other elements, it is called that target object has predominance in that direction. Direction dominance may determine direction relations. In practice, given some critical value, if the maximum value of metric relation direction matrix is greater than the critical value, then it determines the precision direction relation. Usually, the critical value depends on applications.
3 Predicate Extensions of Direction Relations 3.1 Directional Relation Predicates In direction relation matrix, if there is at most three zero elements in the same row and column except the second row and second column, other elements are non-zero; the relation is called surround relation. 13 kinds of surround relation are shown in Fig.1., where symbol “#” indicates undetermined value. 1 1 1 1 # 1 0 0 0
If surround relation is determined, then same relation can be determined, and then other relations. Especially, the cardinal directions have the same semantic, so do the diagonal directions (table 1.). In table 1., symbol “1*” indicates direction predominance, and “not surround” shows matrix is not equal one of surround relation matrix. 3.2 Spatiotemporal Extension In spatiotemporal systems, objects are changing with time, so does the direction. Spatiotemporal direction extension is based on ∃ and ∀ quantifier[6].
A Detection Model of Direction Relations for Spatial and Spatiotemporal Objects
877
Table 1. Direction Predicates D irectio n re latio n m atrix In ve rted directio n m atrix dir= (0,0 ,0 ,0,0 ,1,0 ,0 ,0)
D ir= (0 ,0,0 ,1,0 ,0,0,0 ,0)
dir= (#,# ,#,# ,#,1* ,#,# ,# )
D ir= (# ,#,# ,1* ,#,# ,# ,#,# )
dir= (0,0 ,0 ,0,1 ,0,0 ,0 ,0)
D ir= (0 ,0,0 ,0,1 ,0,0,0 ,0)
d ir= (# ,#,#,# ,1* ,#,# ,#,#)
D ir= (#,# ,# ,#,1 *,# ,#,# ,# )
E a st(A ,B )
S am e (A ,B )
dir= (0,0 ,1,0 ,0,0 ,0 ,0,0 )
no tes
e xam ples A
N ot surro un d
B
A
B AB
Not su rro u nd
AB B
D ir= (0,0 ,0,0 ,0,0 ,1 ,0,0 ) A
N orthE ast (A ,B )
A
S am e
S urrou n ded by (A ,B )
S urrou nd (A ,B )
B
dir= (#,# ,1 *,# ,#,# ,#,# ,# )
S am e
d ir= on e of 13 su rro un d m atrix es
D ir= (#,# ,#,# ,#,#,1 *,# ,#)
D ir= o ne of 1 3 surro un d m atrix es
N ot surrou nd
B A A B B A
If p is a spatial direction quantifier, O1,O2 are two objects and T is temporal object, then spatiotemporal direction predicates are defined as p(O1,O2,T):= ∃ t:p(O1(t),O2(t)) |∀ t:p(O1(t),O2(t)), which means the spatiotemporal direction predicates can be seemed as temporal restriction of spatial ones. According to this semantic extension, spatiotemporal direction predicates can be derived from spatial ones. For example, the query “to find all fire hydrants north to building A during 1990”, the SQL may just like: “Select b.name, f.name From building b, firehydrant f Where b.name=’A’ And North(b.Geometry, f.Geometry, DATE “1990”) = TRUE;”.
4 Experiments In this section we present several experimental results that justify our argument. The main intention of experimental is to compare the conventional direction query and predominance direction query. The experimental data is the province districts of China. The direction query is “find all provinces and cities that east to Hubei province and distance no more than 500km during 2003 ”. Only using direction relation matrix, five areas are given to the query: {Jiangsu, Anhui, Shanghai, Zhejiang and Jiangxi }.however, it is not accord with conventional intuition that Jiangxi province is east to Hubei province. Using direction predominance strategy, Jiangxi and Jiangsu province are not included in the results, which are
878
B.-L. Yi and Y.-C. Feng
reasonable for conventional intuition. In fact, the predominance matrix are {(#,#,#,#,#,0.4698,#,#,#),(#,#,#,#,#,0.6766,#,#,#),(#,#,#,#,#,1,#,#,#),(#,#,#,#,#, 0.6811, #,#,#), (#,#,#,#,#,0.1153,#,#,#)}, the inverted predominance direction matrix of Hubei and Jiangxi is just (#,#,#,0.1559,#,#,#,#,#), which also verifies the previous results. Here the critical value is 0.5. From the results, it is evident that direction predominance and inverted direction relation matrix can give more reasonable results.
References 1. Cao H.,Chen J. and D.S. Du. Qualitative Extension Description for Cardinal Directions of Spatial Objects. Acta Geodaetica et Cartographica Sinica(in Chinese), 2001,30(2):162–167. 2. A. Frank. Qualitative Spatial Reasoning: Cardinal Directions as an Example. International journal of geographical information systems. 1996. 10(3):169–190. 3. D.J. Peuquet, Z. Ci-Xiang.An Algorithm to Determine the Directional Relationship between Arbitrarily-Shaped Polygons in the Plane. Pattern Recognition,1987, 20 (1):65–74. 4. D. Papadias, Y. Theodoridis, T. Sellis. The Retrieval of Direction Relations using R-trees. Proc. 5th Int. Conf. Database and Expert Systems Applications (DEXA’94). Zürich, Switzerland. LNCS 856, Springer-Verlag, Berlin, Germany. 1994. 173–182. 5. D. Papadias, M. J. Egenhofer, J. Sharma. Hierarchical Reasoning about Direction Relations. ACMGIS'96. Rockvillle MD USA.1996. 6. B.L. Yi, Y.C.Feng,Y.Y. Wu. An Integrated Spatiotemporal Data Model and Semantic Exten- sion. Journal of Huazhong Univ. of Sci. and Tech.(in Chinese). to appear in 2004.
A Proactive Web Agent for Information Browsing and Extracting Hongen Lu Department of Computer Science and Computer Engineering La Trobe University Bundoora, Melbourne VIC 3086, AUSTRALIA [email protected] Abstract. In this paper, I present a proactive agent for semi-structured information extraction from the World Wide Web. A Hidden Markov Model (HMM) is given to recognize users’ intention of browsing actions. A novel approach based on pattern comparison is described for online information extraction.
1
Introduction
Information agents are ubiquitous on the Internet, from Letizia [2], WebMatcher [1], to SportsFinder [3]. However, most of current information agents are passive, that is they only work when their users input a command or keywords. This will enormously interrupt users browsing activities and daily routine jobs. Information agents are supposed to work autonomously on behalf of users rather than depending their human user’s instructions. To be proactive has been an goal for information agents for a long time. In this paper, I present a proactive agent for semi-structured information extraction from the World Wide Web. A Hidden Markov Model (HMM) is given to recognize users’ intention of browsing actions. A novel approach based on pattern comparison is described for online information extraction.
2
Browsing Activity Modeling
Definition 1. Browsing Activity: A browsing activity is a tuple < N , D, PN >, where N is the name of an activity, D is the duration a user carries the activity, and PN is the parameter of activity. For example, when a user goes to Yahoo! website, and then follows the link of Yahoo!News, then to the Business Front page, and browse on that page for 2 minutes with the mouse pointing to the Market Snapshot section, it is not difficult for us to guess that this user is looking for the DOW and NASDAQ stock market indexes. The above behaviours can be modeled as: < StockIndex, 2min, PStockIndex >, where PStockIndex =< homepage = Y ahoo!Business, M ouseP osition = M arketSnapshot >. J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 879–882, 2004. c Springer-Verlag Berlin Heidelberg 2004
880
3
H. Lu
Hidden Markov Model for User Intention Recognition
A hidden Markov model is a quantization of a system’s configuration space into a small number of discrete states. A generic HMM can be represented as λ = {Ξ, A, B, π}, where Ξ = {q1 , q2 , . . . , qn } denotes the n possible states, A = {aij } denotes the transition probabilities between the hidden states, B = {bj (.)} denotes the observation symbol probability corresponding to the state j, and π denotes the initial state distribution. This representation succeeds to the degree that the system fits the Markov condition: any information about the history of the process needed for future inferences must be reflected in the current state. Given a browsing activity sequence {O1 , O2 , . . . , OT }, where T is the length of the sequence, we then want to find one model from a given intention dictionary {λ1 , λ2 , . . . , λE } which maximizes the likelihood P (O|λ). Let 1 L 1 L T Z = (p11 , p21 . . . pL 1 , p2 . . . p2 , . . . , pd . . . pd )
Then, given Z from an observation class Ω, from [4] the conditional probability P (Z|Ω) is calculated as: 1 exp(− (Z − m)T Q−1 (Z − m)) 2 P (Z|Ω) = 1/2 (2π)N |Q|
(1)
where N = d × L, m is the mean of Z. The intention of the activity λi∗ is the maximum from likelihood estimation calculated by applying Equation 1: i∗ = arg max P (O|λi ) 1≤i≤E
(2)
From preliminary experiments, the above model can recognize a user’s intention from a series of his activities despite that the information he/she is looking for is only a small piece of information among many web pages browsed, such as the above example. This model is more powerful than content based user intention recognition, because users’ behaviours of browsing activity are taken into account, which in most cases have more meaningful information about a user’s intention especially when a user is intend to extract a piece of semi-structured information.
4
Semi-structured Information Extraction
After recognizing a user’s browsing intention, an information agent can act on behalf of the user to extract the information from web pages that is being browsed. In this project, I concentrate on semi-structured sports results. But the approach is applicable for many domains. Instead of a fully natural language understanding method, I use the express patterns to recognize and extract the sports scores. It is just like semi understanding of the text.
A Proactive Web Agent for Information Browsing and Extracting
(a)
881
(b)
Fig. 1. (a) A Golf Tournament Result Page. (b) The Comparison Results of (a).
4.1
Fuzzy Pattern Comparison
This algorithm calculates the similarity of two patterns. The similarity measurement used here allows for arbitrary length deletions and insertions, that is to say the algorithm measures the biggest possible similarity of two patterns under certain allowed mismatches and internal deletions. ˜ = a1 a2 · · · an and B ˜ = b 1 b 2 · · · bm . A Let the two express patterns be A similarity s(a, b) is given between pattern elements a and b. Deletions of length k are given weight Wk . To find the high degrees of similarity, a matrix H is set up. Hij is the maximum similarity of two segments ending in ai and bj , respectively. First set Hk0 = H0l = 0, then for 0 ≤ k ≤ n and 0 ≤ l ≤ m: Hi−1,j−1 + s(ai , bj ) maxk≥1 {Hi−k,j − Wk } Hij = max maxl≥1 {Hi,j−l − Wl } 0 and for 1 ≤ i ≤ n and l ≤ j ≤ m, the value of Hij is calculated as fellows: 1. 2. 3. 4.
If ai and bj are associated, the similarity is Hi−1,j−1 + s(ai , bj ). If ai is at the end of a deletion of length k, the similarity is Hi−k,j − Wk . If bj is at the end of a deletion of length l, the similarity is Hi,j−l − Wl . Finally, a zero is included to prevent calculated negative similarity, indicating no similarity up to ai and bj .
˜ B) ˜ → [0, 1] is the function to Definition 2. Possible Similarity: P osSim(A, ˜ and B. ˜ measure the biggest possible similarity of express patterns A max
˜ B) ˜ = P osSim(A,
{Hij }
0≤i≤n,0≤j≤m m n
(3)
s(ai , bj ) − W|m−n|
i=0 j=0
˜ B) ˜ is greater than a threshold, then A ˜ and B ˜ are considered If P osSim(A, as the same express pattern.
882
4.2
H. Lu
Experimental Result
Following is one case. From the records of one user’s browsing activities, who visits a golf web page frequently and each time highlights the result line of a particular player, Chris Smith. The proposed HMM model successfully recognize his intention as extracting the golf result of that player. The homepage of the results is presented in Figure 1 (a). Based on this recognition, the agent will act proactively on behalf of this user to extract the position of Chris Smith in the ladder. In this experiment, Wk = 1/3 ∗ k. Figure 1 (b) shows that the above algorithm performs well on this golf Web page. It recognizes most of the patterns ˜ B), ˜ are 1; while others as the same, with their possible similarities, P osSim(A, are within the range of difference with the values higher than 0.7, which are pretty accurate despite of the high penalty factor.
5
Conclusion
Information agents are supposed to work autonomously on behalf of human users rather than depending their instructions. To be proactive has been an goal for information agents for a long time. In this paper, I present a proactive agent for semi-structured information extraction from the World Wide Web. A Hidden Markov Model (HMM) is given to recognize users’ intention of browsing actions. This model is more powerful than content based user intention recognition, because users’ behaviours of browsing activity are taken into account, which in most cases have more meaningful information about a user’s intention especially when a user is intend to extract a piece of semi-structured information. Based on the recognition of browsing intention, the agent is able to act on behalf users to extract the required information proacitvely. A novel approach based based on fuzzy pattern comparison is described for online semi-structured information extraction. This method can measure the biggest possible similarity of patterns under certain allowed mismatches and internal deletions.
References 1. Thorsten Joachims, Dayne Freitag, and Tom Mitchell. Webwatcher: A tour guide for the World Wide Web. In Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI97), Nagoya, Japan, 1997. Morgan Kaufman Publishers. 2. Henry Lieberman. Letizia: An agent that assists web browsing. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-95), pages 924–929, Montreal, Quebec Canada, August 1995. 3. Hongen Lu, Leon Sterling, and Alex Wyatt. SportsFinder: An information agent to extract sports results from the World Wide Web. In Proceedings of the Fourth International Conference on the Practical Application of Intelligent Agents and MultiAgent Technology (PAAM99), pages 255–265, London, UK, April 1999. 4. Xinding Sun, Ching-Wei Chen, and B. S. Manjunath. Probabilistic motion parameter models for human activity recognition. Proceedings of the 16th International Conference on Pattern Recognition (ICPR 2002), 2002.
Effective Indexing of Web Images with Keyword Positioning Lakshman Jayaratne, Athula Ginige, and Zhuhan Jiang School of Computing and IT, University of Western Sydney, Sydney, Australia {k.jayaratne, a.ginige, z.jiang}@wus.edu.au
Abstract. We propose a new similarity measure based on keyword positioning to search images collected from the Web, for the improvement of precision and recall performance. Our image representation model fully exploits the image semantics from the text associated with the images in the Web pages. A prototype search engine Image Search is implemented with these techniques.
our proposed method in section 2 and conduct the experimental evaluation in section 3. A concluding summary is then made in the final section.
2 Proposed Method The design of our text engine will employ a few direct techniques to associate keywords with the images and to retrieve images through the use of the keywords. For this purpose, we shall parse the Web documents and collect the keywords for the corresponding images. More precisely, the text will pass through different processing stages. In the first stage it automatically captures the image’s essential semantic content by page title, meta data, image title, ALT string and image caption. Then we remove all the stop words that are so common in the language that they no longer provide any particular information about the content of the images. The stemming thus extracts the root of the keywords from the candidate keywords. This model can be seen as a representation of the semantic content of an image because it does keep only a few keywords, which are highly related to the images semantically. Since the weight associated to each keyword represents the degree of relevance in which this keyword describes the image’s semantic content, we have implemented a method of voting scheme that will determine the weights of the keywords in the page title, meta tags, image title, ALT string and image caption. Heuristically, we assign weights to these keywords according to their relevance. By allocating higher weights for the keywords, especially those in the caption, title, and ALT string of the image, we can discriminate the images from one another, in particular for the images from a single Web document. For example, since caption, title, and ALT string of the image are the most important, we fix their weights to 1.0. The page title and meta tags are less important, and cannot be more than the other’s weight. We hence assign 0.6 and 0.4 to them respectively. 2.1 Image Representation Model Semantic representation of an image in a Web page typically consists of lower-level and high-level features, semantics from text associated with the hyperlinks, with the image, as well as with the Web page rather than the image. Extensions to the HTML specification may also allow the use of context summary of events inside a Web page. An image inside a Web page is typically surrounded by semantic objects such as image title, image alternate text, image caption, page title and meta data. During the relevance ranking of the Web pages, some search engines may favor pages that have keywords specified by meta tags matching the search terms, compared with those that have matches only with the terms in the document body. Thus, we extract the keywords from the meta tags that appear frequently for the image semantics. We exclude the body text (main text) that contains too much unrelated information not semantically related to the images in the HTML document, because such information adds only noises into the semantics of the images. Therefore, we use only the five items mentioned at the beginning of this paragraph to represent image semantic content.
Effective Indexing of Web Images with Keyword Positioning
885
2.2 Semantic Similarity Measure In our implementation, we store terms extracted from page title, meta tags, image title, ALT string and image caption that represent the semantics for each image. To compute the semantic similarity between a query and an image, we propose the following formula
similarity image ,query =
t .weight × ∑ t .weight t
1
i, j
i ', j
i'
i, j
. position
0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1
Precision
where tij represents the ith term in the word list of the jth image that has been matched with a term in the query. We also used tij.position to denote the position associated with the matched term tij. We note that the term discrimination value, 1/ ti,j.position, in the definition of the above formula measures the degree to which the use of a term position will help to distinguish some images from those to which the term is assigned from the remainder of the collection. For each potential index term as a content identifier to a collection of images, a discrimination value can then be computed as a function of the position ranking. The greater the difference in term positions the more the images will become dissimilar, and therefore the better the particular term will function as a discriminator. This coefficient has an impact in terms of image ranking during the presentation of the returned images. As the position of the keyword in the list decreases, the value returned from this coefficient diminishes. Images with higher similarity measures will be returned to users ahead of images with lower similarity values. Therefore we can get more relevant images 1 being displayed earlier, i.e., 0.98 0.96 ranked higher. This also results in 0.94 0.92 a much higher proportion of all 0.9 possible relevant images being 0.88 0.86 recalled. We also note that in our 0.84 proposed system a significant portion of the total weight of all Recall the keywords associated with an image is contributed by a small number of keywords that are Fig.1. Precision versus Recall highly relevant to the semantic content of the image.
3 Experimental Evaluation Our proposed approach supports the hypothesis that it exploits the power of image semantics with higher-level semantics and facilitates the keyword-based search to improve precision and recall performance on image collection over the Web. To study the effectiveness of the proposed approach, we have implemented a prototype
886
L. Jayaratne, A. Ginige, and Z. Jiang
search engine Image Search for the proposed model as an image retrieval system, and a set of experimental evaluations were performed where our search engine were compared in terms of precision and recall. For this purpose we currently limit the image collection to BBC Web site (instead of extracting images from various Web sites via Web search engines) and we collected over 2200 images from over 1000 different URLs. A set of experimental retrievals were performed where Image Search were tested for 10 queries to validate the performance of our system in terms of precision and recall. These 10 queries Q1-Q10 are given respectively by Q1:Thanks, Q2:Jerusalem, Q3:Tai Mahal, Q4:David Beckham, Q5:World Trade Center, Q6: The Great Wall of China, Q7:Middle East Peace Process, Q8:UN Secretary General Kofi Annan, Q9:Former South African Leader Nelson Mandela, Q10:Manchester United Football Manager Sir Alex Ferguson. Fig. 1 shows the average precision at the 11 standard recall values for 10 queries. We can clearly see as the number of images returned by the system increases, more and more relevant images are returned by the system. Unlike some other systems where the increase of the number of images may even lead to very low precisions, our method proves to be more stable.
4 Conclusion In this paper, we have presented a new model with a good semantic representation that exploits the vast power of image semantics from the text associated with the image in a Web page. The other novel feature that distinguished the proposed semantic representation of the new model from the existing ones is the semantic similarity metric. It introduces a new ‘dimension’ to measure the semantic similarity for the Web images with keyword ‘positioning’ for the model we presented here.
References 1. Rui Y., Huang T., Mehrota S., Ortega M.: A Relevance Feedback Architecture in ContentBased Multimedia Information Retrieval Systems. Proceedings of IEEE Workshop on Content-Based Access of Image and Video Libraries, 1997 2. Lu Y., Hu C., Zhu X., Zhang H., Yang Q.: A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems. ACM Multimedia (2000) 31-38 3. Oren N.: Reexamining tf.idf based Information Retrieval with Genetic Programming. Proceedings of SAICSIT, (2002)224-234 4. Taher H., Aristides G., Dan K., Piotr I.: Evaluating Strategies for Similarity Search on the Web. International WWW2002 Conference, (2002)432-442 5. Alp A.Y., Yu Clement T.: Evaluating Strategies and Systems for Content-Based Indexing of Person Images on the Web. ACM Multimedia (2000)313-321 6. Lu Y., Hu C., Zhu X., Zhang H., Yang Q.: A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems. ACM Multimedia (2000)31-38 7. Yang J., Zhuang Y., Li Q.: Search for Multi-Modality Data in Digital Libraries. IEEE Conference on Multimedia, (2001)482-489
Design and Simulation of Multicast Routing Protocol for Mobile Internet Yuan Zhou1, Guangsheng Li 2, Qirong Mao 2, Yongzhao Zhan 2, Yibin Hou 1 1
School of Electron and Information Engineering, Xi’an Jiaotong University, China 2 School of Computer Science and Communication Engineering, Jiangsu University, China
Abstract. In the mobile Internet, characteristics, such as dynamic topology, power constrained, bandwidth-constrained links, make multicasting extremely challenging. This paper proposes a novel multicast routing mechanism in the mobile Internet environment, which creates and maintains route by means of connected broadcast ring graph. Both theoretical analysis and simulation results show that the mechanism can satisfy the requirements of the Internet for multicast applications with higher data packet delivery ratio and less control overhead.
1 Introduction Recently, the Internet technologies have been widely applied with the increasing requirements for information, in which multicasting plays an important role to use and save Internet resources. Generally, the traditional Internet is a network with fixed infrastructure. However, it is necessary for applications, such as disaster recovery, temporary meeting and automated battlefields, to provide the communication in a multihop wireless network. At present, some multicast routing protocols have been proposed, which can be broadly classified into two types, one is tree-based, such as AMRoute [1], AMRIS [2], which only provide one route between the source and the receiver. The other is mesh-based, such as ODMRP [3], CAMP [4], which provide multiple routes between the source and the receiver in data transmission. Lee indicates that the mesh-based multicast routing protocols have better performance than the tree-based ones [5]. Among these multicast routing protocols, ODMRP has better compositive performance. This paper proposes a novel multicast routing protocol – DBRMP (Dynamic Broadcast Ring Based Multicast Routing Protocol), which creates and maintains routes through the broadcast ring connected graph, simulation results show it is both effective and has less control overhead.
2 Mechanism of DBRMP When a node broadcasts packets with a limited TTL value in the IP header, the packets will be transmitted in a ring whose radius is initial TTL value. The mobile node in the ring can create a route to ring-center node with the assumption of bi-directional links. If the hop distance between these two ring-center nodes, for example, node N1 and N2, is less than or equal to the sum of the ring radii, two rings overlap. In this case, there exists an intermediate node in the overlapping zone, which has routes to both N1 and N2. Fig. 1 illustrates the route creation when two rings overlap.
Fig. 1. Route creation between two nodes
Fig. 2. DBRMP protocol framework
In DBRMP, each group member node has a dynamic broadcast ring and these rings form a ring graph G=(V, G), where V is the set of rings and G is the set of pairs of overlapping rings. If G is connected, the routes between any two member nodes can be created. In this case, G is called connected ring graph (CRG). DBRMP sets a core node to maintain the connectivity of the CRG, which is the source node that joins in a group earliest and whose ring is called core ring. The subgraph with core ring is called Core-CRG. In addition, DBRMP assigns each ring a layer number, which is determined by the shortest ring distance to the core ring in the CRG. In Core-CRG, there are only three kinds of relationship about layer number between the given ring and its neighbor rings, which are same layer, upper layer and under layer relationship.
3 DBRMP Protocol Framework The framework of DBRMP is shown in Fig. 2, which is composed of following main modules. Data packet handling module is to send, forward or discard the received data packets. When a packet delivered by upper layer applications is received, if there exist the route, the packet is sent to the wireless network interface. Otherwise, it drives the route management module to create a route. When the received packet is from wireless network interface, the member table and forwarding table are searched to check
Design and Simulation of Multicast Routing Protocol for Mobile Internet
889
whether this node has joined the multicast group or is a member of the forwarding mesh. If not, the packet will be discarded. Otherwise, it will be sent to the wireless network interface or upper layer application program. Route management module is to create and maintain multicast routes, including following sub-modules: Forwarding Mesh Creation Module works as follows: the member nodes broadcast MEMBER_QUERY messages periodically in their own broadcast rings. If a node receives a MEMBER_QUERY message, it is located in the broadcast ring of the member node sending this message. If a node receives two or more MEMBER_QUERY messages, it becomes a bridge node. The bridge node sends MEMBER_REPLY messages to these neighbor members to create routes between them. In DBRMP, UPDATE_MSG message is used to update the CRG. The process of updating the CRG is given as following: (1) core node builds UPDATE_MSG packet periodically, in which the local layer number is set to zero and sequence number increases monotonously, then sends the packet to its neighbor members; (2) When the member node receives the UPDATE_MSG packet, it handles the received packet and update the sequence number, then the node creates the UPDATE_MSG packet and sends it to its neighbor members If the member node does not receive the UPDATE_MSG packet in a certain period of time, it is considered that the ring of the member has departed from the Core-CRG. In this case, the ring rejoins in the CoreCRG through the method of expanding rings searching (ERS). The purpose of the CRG optimization module is to reduce the control overhead, DBRMP decreases the radii of the rings as much as possible, whose prior condition is keeping the connectivity of the CRG, the method of decreasing ring is given as following: (1) In an even update period of the Core-CRG, we decrease all odd layer rings and meanwhile fix all even layer ones, and in an odd update period, decrease all even layer rings and meanwhile fix all odd layer ones, where the parity of the update is determined by the sequence number in the UPDATE_MSG packet; (2) connecting with all under layer rings and at least one upper layer ring should be met for one ring after decreasing.
4 Protocol Simulation and Performance Analysis To evaluate the protocol performance, we compare the performances of the DBRMP and the ODMRP with NS2. The following metrics were used: (1) Data Packet Delivery Ratio; (2) Number of Control Packets Transmitted per Data Packets Delivered. In the experiments, we set receivers as five and maximum speed as 10 m/sec respectively. Fig. 3 shows that DBRMP is as effective as ODMRP in data delivery ratio, and the difference is less than 3%. However, as shown in Fig. 4, when the number of sources increases, the control overhead increases remarkably in ODMRP, which degrades the scalability of the protocol. On the contrary, the control overhead in DBRMP decreases with the number of the sources. This is because the sources don’t
Y. Zhou et al.
1.1 ODMRP DBRMP
Packet Delivery Ratio
1
0.9
0.8
0.7
0.6
0.5
1
2
3 Num of Sources
4
5
Fig. 3. Comparison of data delivery ratio
Num of Control Pkts. Transmitted/Num of Data Pkts. Delivered
890
4.5 ODMRP DBRMP 4
3.5
3
2.5
2
1.5
1 1
2
3 Num of Sources
4
5
Fig. 4. Comparison of control overhead
need to broadcast the control message throughout the network. In DBRMP, the sources, like the receivers, only need to send the message in their dynamic broadcast ring and maintain the connectivity with the Core-CRG. Therefore, the number of control packets sent is dependent of the number the group member (receiver and source), but not the number of sources. Thus DBRMP can improve the scalability with increasing number of sources.
5 Conclusion In this paper, a new multicast routing protocol is proposed, which creates and maintains routes through the connected ring graph. If the ring of a group member separates from the Core-CRG, it will rejoin in the Core-CRG quickly by means of ERS. In addition, DBRMP can decrease the size of the ring to reduce the control overhead provided that the ring graph is still connected after rings were decreased. Thus DBRMP can both ensure the effectiveness and efficiency.
References 1. J.-J. Xie, R.-R. Talpade, A. Mcauley, and M. Liu. AMRoute: ad hoc multicast routing protocol. Mobile Networks and Applications, Kluwer. Vol.7 No. 6 (2002) 429–439 2. C.-W. Wu and Y.-C. Tay. AMRIS: A Multicast Protocol for Ad hoc Wireless Networks. IEEE MILCOM 1999 (1999) 25–29 3. S.-J. Lee, M. Gerla, and C.-C. Chiang. On-Demand Multicast Routing Protocol in Multihop Wireless Mobile Networks. Mobile Networks and Applications, Kluwer.Vol.7 No. 6 (2002) 441–453 4. J.-J. Garcia, L. Aceves, and E.-L. Madruga. The Core-Assisted Mesh Protocol. IEEE Journal on Selected Areas in Communications, Vol. 17 No. 8 (1999) 1380–1394 5. S. Lee, W. Su, J. Hsu, M. Gerla, and R. Bagrodia. A Performance Comparison Study of Ad hoc Wireless Multicast Protocols. IEEE INFOCOM 2000 (2000) 565–574
Java-Based Single Sign-On Library Supporting SAML (Security Assertion Markup Language) for Distributed Web Services Jongil Jeong , Dongkyoo Shin ∗, Dongil Shin , and Kiyoung Moon 1
1
1
2
1
Department of Computer Science and Engineering, Sejong University 98 Kunja-Dong, Kwangjin-Ku, Seoul 143-747, Korea {jijeong, shindk, dshin}@gce.sejong.ac.kr 2 Electronics and Telecommunications Research Institute 161 Kajong-Dong, Yusong-Gu, Taejon 305-350, Korea [email protected]
Abstract. SAML is an XML based single sign-on (SSO) standard which enables the exchange of authentication, authorization, and profile information between different entities. This provides interoperability between different security services in the distributed environments. In this paper, we designed and implemented a Java-Based SSO library made up of SAML APIs, and then constructed a prototype of distributed Web applications to validate these SAML APIs.
1 Introduction Single sign-on (SSO) is a security feature, which allows a user to log into many different Web services offered by the distributed systems while the user only needs to authenticate once, or at least always in the same way [1]. Various SSO solutions have been proposed that depend on public key infrastructure (PKI), Kerberos, or passwordstores, which require an additional infrastructure on the client’s side and new administrative steps [2]. Recently a new standard for exchange of security-related information in XML called Security Assertions Markup Language (SAML) is recommended by the Organization for Advancement of Structured Information Standards (OASIS). SAML enables the exchange of authentication, authorization, and profile information between different entities to provide interoperability between different security services in distribution environments such as Web Services as proposed by W3C. In this paper, we designed and implemented a Java-Based SSO library made up of SAML Application Programming Interfaces (APIs), and then constructed a prototype of distributed applications to validate these SAML APIs.
2 Background The basic idea of single sign-on (SSO) is to shift the complexity of the security architecture to the SSO service and release other parts of the system from certain security obligations. For SSO implementation, token-based protocols such as cookies or SAML are used [3]. An authentication token is sent over an independently established secure channel. The main advantage of token-based protocols is that a majority of service providers already have SSL [4] server certificates and a suitable cryptographic implementation is available on all client machines via the browsers. Artifacts, which have a role of tokens, are created within a security domain and sent to other security domains for user authentication. Artifacts sent to the other domains are returned to the original security domain and removed after user authentication. Artifact identification is attached to the Uniform Resource Locator (URL) and redirects the message sent to the destination [5]. 2.1 SAML (Security Assertion Markup Language) Recently, OASIS has completed SAML, a standard for exchanging authentication and authorization information between domains. SAML is designed to offer single signon for both automatic and manual interactions between systems. SAML enables the exchange of authentication and authorization information about users, devices or any identifiable entity called subjects. Using a subset of XML, SAML defines the request-response protocol by which systems accept or reject subjects based on assertions [5]. An assertion is a declaration of a certain fact about a subject. SAML defines three types of assertions: y Authentication: indicating that a subject was authenticated previously by some means (such as a password, hardware token or X.509 public key). y Authorization: indicating that a subject should be granted or denied resource access. y Attribution: indicating that the subject is associated with attributes. SAML does not specify how much confidence should be placed in an assertion. Local systems decide if security levels and policies of a given application are sufficient to protect an organization if damage results from an authorization decision based on an inaccurate assertion. This characteristic of SAML is likely to spur trust relationships and operational agreements among Web-based businesses in which each agrees to adhere to a baseline level of verification before accepting an assertion. SAML can be bound with multiple communication and transport protocols. It can be linked with Simple Object Access Protocol (SOAP) over HTTP [5].
3 Design and Implementation of Java-Based SAML APIs We designed three basic packages named assertion, protocol and messaging packages. To support the messaging function, we also designed generator, uitilities and
Java-Based Single Sign-On Library Supporting SAML
893
security packages. The implemented SAML APIs are grouped into Java packages, as shown in Figure 1. The classification of packages is based on the specification “Assertions and Protocol for the OASIS Security Assertion Markup Language (SAML)” [6].
Fig. 1. Java Packages of SAML APIs
Fig. 2. SAML Request Message signatured in Enveloped Form
Fig. 3. Generation of SAML Response Message
The function of each package is as follows. y Assertion package: dealing with authentication, authorization and attribution information. y Protocol package: dealing with SAML request/response message pairs to process assertions. y Messaging package: including messaging frameworks which transmit assertions. y Security package: applying digital signature and encryption on the assertions y Utilities package: generating UUID, UTC Data format and artifacts, and so on. y Generator package: generating SAML request/response messages. We verified the developed SAML APIs by constructing demonstrations. And we verified the message according to the SAML specifications.
894
J. Jeong et al.
When we generated SAML request messages as shown in Figure 2, we used RequestGenerator class in generator package and Signature class in security.sign package. The signature process of signature class follows XML-signature standards in the enveloped form. Figure 3 shows the generation of SAML response messages, in which ResponseGenerator class in generator package is used. This SAML response message is also signatured using signature class in security.sign package.
4 Conclusion We designed and implemented an SSO library supporting the SAML standard. The implemented SAML APIs have following features. y Since SAML messages are transmitted through SOAP, XML based message structures are fully preserved. This enables valid bindings. y Integrity and non-repudiation are guaranteed by using signatures on transmitted messages. y Confidentiality is guaranteed by encryption of transmitted messages. Since XML encryption is applied, each element can be efficiently encrypted. Even though digital signatures on a SAML message using RSA is default and using XML signature is optional, we fully implemented both APIs in security package. Specific encryption methods for SAML messaging are not mentioned in the SAML specification. We implemented APIs for XML encryption, since XML encryption is a suitable candidate for encryption of SAML message.
References 1. Parker, T.A: Single sign-on systems-the technologies and the products. European Convention on Security and Detection, 16-18 May (1995) 151-155 2. Pfitzmann, B.: Privacy in Enterprise Identity Federation - Policies for Liberty Single Signon. 3rd Workshop on Privacy Enhancing Technologies (PET 2003), Dresden, March (2003) 3. Pfitzmann, B., Waidner, B.: Token-based web Single Signon with Enabled Clients. IBM Research Report RZ 3458 (#93844), November (2002) 4. Frier A., Karlton P., and Kocher P.: The SSL 3.0Protocol. Net scape Communications Corporation, Nov 18, (1996) 5. Bindings and Profiles for the OASIS Security Assertion Markup Language(SAML) V1.1: http://www.oasis-open.org/committees/security/ 6. Assertions and Protocol for the OASIS Security Assertion Markup Language(SAML) V1.0: http://www.oasis-open.org/committees/security/
Novel SVM Performance Estimators for Information Retrieval Systems Shui Yu, Hui Song, and FanYuan Ma Department of Computer Science and Technology, Shanghai Jiaotong University, Shanghai, P. R. China, 200030 {merlin, songhui_17, fyma}sjtu.edu.cn
Abstract. SVMs (support vector machines) have met with a significant success in information retrieval field, especially handling text classification tasks. Although various performance estimators for SVMs have been proposed, they only focus on the accuracy based on the LOO (leave-one-out) cross validation procedure. The information-retrieval-related performance measures are always neglected in kernel learning methodology. In this paper, we have proposed a set of information-retrieval-oriented performance estimators for SVMs, which are based on the span bound of the LOO procedure. Experiments have proved that our proposed estimators are both effective and stable. …
1 Introduction KMs (kernel-based learning methods) are a state-of-the-art class of learning algorithm, whose best-known example is SVMs. SVMs method has been introduced in ATC (automated text categorization) by Joachims[1][2] and subsequently extensively used by many other researchers in information retrieval community. An interesting topic of SVMs is the efficient performance estimation. Recently, several performance estimators for SVMs have been proposed, including Xi-Alpha estimator[3], GACV estimator[4], Span estimator[5], and Radius-Margin estimator [6]. However, all these estimators are accuracy-oriented, that is, the estimating process focuses on the accuracy of the classifier. But for IR (information retrieval) tasks, accuracy is not the only performance measure. People always care more about other measures while constructing IR systems. So far, there are few attempts to establish IR performance measures.
PRBEP needs iterative computing based on multiple confidence threshold tuning, and practically to get a single performance measure, the harmonic mean of precision and recall is commonly used. It is called the Fβ-measure and the most commonly setting is giving equal weight to precision and recall, which is widely used in information retrieval known as F1 measure. While most of the research of SVMs in information retrieval focus on error rate or accuracy, little attention has been paid to IR-related measures. In this paper, we will devise some novel performance estimators for SVMs, which are based on precision, recall and Fβ.
3 The LOO Procedure and the Span Bound Lunts and Brailovskiy[7] has proved the LOO estimator is almost unbiased. The LOO procedure is usually used to estimate the probability of test error of a learning algorithm. Recently researchers have proposed various theoretical error bound of SVMs based on the LOO procedure over the training set[3,4,5,6]. However, performance estimators for IR such as precision, recall, and Fβ measure can also be derived from the LOO procedure. These performance measures are seldom exploited in machine learning community, but they definitely play a very important role in information retrieval. Span bound is a relative new bound on the generalization ability of SVMs and has been proved tighter than some other bounds[5]. The span concept is elegant for the performance estimating, and not like the regular LOO procedure, for SVMs one needs to conduct the LOO procedure only for support vectors: non support vectors will be recognized correctly since removing a point which is not support vector does not change the decision function. Thus, the span bound can be used to construct special performance estimators for information retrieval systems.
4 Accurate IR Performance Estimators for SVMs Based on V. Vapnik and O. Chapelle’s original work on span bound[5], we may develop accurate performance estimators for information retrieval systems, such as precision, recall and Fβ measure. Denote:
G G d −+ = Card { p : ( y p = 1) ∧ (α p0 S p0 ≥ f 0 ( X p )) ∧ ( X p ∈ SVs )}
(1)
G G d + − = Card { p : ( y p = −1) ∧ (α p0 S p0 ≥ − f 0 ( X p )) ∧ ( X p ∈ SVs )}
(2)
l+ = Card { p : y p = 1}
(3)
And one can easily derive the following span estimators for the recall, the precision and the F1 measure:
Novel SVM Performance Estimators for Information Retrieval Systems
l ( h) = 1 − Recallspan
l (h) = Precisionspan
F1lspan ( h) =
d −+
897
(4)
l+
l+ − d −+
(5)
l+ − d −+ + d +−
2l+ − 2d −+
(6)
2l+ − d −+ + d +−
5 Experimental Results
Macro-average F1 Value
We have implemented SVMs in Java on the Windows 2000 Pro platform as well as the proposed performance estimators. And the empirical evaluation is done on two test collections. The first one is the Reuters-21578 dataset. Of the 135 potential topic categories only the most frequent 10 are used. The second dataset is the famous WebKB collection of HTML pages. Following the setup set by Nigam[8], only the classes course, faculty, project, and student are used with no preprocessing steps. To explore the estimators, we investigate the macro-average F1 value with different SVM parameters’ settings. 90
90
80
80
70
70
60
60
50
50
span estimator test
40 30 0.01
0.1
1
10
100
1000
Parameter C of SVMs (Reuters)
span estimator test
40 30 0.01
0.1
1
10
100
1000
Parameter C of SVMs (WebKB)
Fig. 1. Experiments of span estimator (left is the Reuters dataset, right is the WebKB collection)
Apparently our proposed performance estimators are not sensitive to the hyperparameters of the support vector machines. The simple result shows the estimator fits the test result curve well in various SVM parameter settings.
898
S. Yu, H. Song, and F. Ma
6 Conclusion We have proposed a series of accurate performance estimators for information retrieval. Based on the original span LOO bound of SVMs, we extended the performance estimation to individual recall, precision, and the F1 measure, which are meaningful to information retrieval, especially to text classification tasks. Experimental results of two well-established datasets (Reuters dataset and WebKB collection) have proved the effectiveness of our proposed performance estimators.
Acknowledgements. This work was supported by the Science & Technology Committee of Shanghai Municipality Key Project Grant 02DJ14045.
References 1. T. Jochims, Text categorization with support vector machines: learning with many relevant features: Proceedings of ECML-98, 10th European Conference on Machine Learning, Berlin, 1998, 137-142. 2. T. Jochims, Transductive inference for text classification using support vector machines: Proceedings of ICML-99, 16th International Conference on Machine Learning, Bled, Slovenia, 1999, 200-209. 3. T. Joachims, Estimating the generalization performance of a SVM efficiently: Proceedings of the International Conference on Machine Learning, San Mateo, CA, 2000, 431-438. 4. G. Wahba, Y. Lin, & H. Zhang, GACV for support vector machines, in A. Smola, P. Bartlett, B. Schölkopf and D. Schuurmans (Eds.), Advances in Large Margin Classifiers, (Cambridge, MA: MIT Press, 1999) 297-311 5. V. Vapnik and O. Chapelle, Bounds on error expectation for support vector machine, in A. Smola, P. Bartlett, B. Schölkopf and D. Schuurmans (Eds.), Advances in Large Margin Classifiers, (Cambridge, MA: MIT Press, 1999) 261-280 6. O. Chapelle, V. Vapnik, O. Bousquet, & S. Mukherjee, Choosing multiple parameters for support vector machines, Machine Learning, 46, 2002, 131-159. 7. A. Lunts, and V. Brailovskiy, Evaluation of attributes obtained in statistical decision rules. Engineering cybernetics, 3, 1967, 98-109 8. K. Nigam, A. McCallum, S. Thrun & T. Mitchell, Learning to Classify Text from Labeled and Unlabeled Documents: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), Madison, Wisconsin, 1998, 792-799.
Capturing Semantic Hierarchies to Perform Meaningful Integration in HTML Tables Shijun Li1 , Mengchi Liu2 , Guoren Wang3 , and Zhiyong Peng4 1
School of Computer, Wuhan University, Wuhan, China 430072 [email protected] 2 School of Computer Science, Carleton University, 1125 Colonel By Drive, Ottawa, ON, Canada K1S 5B6, [email protected] 3 Department of Computer Science and Engineering, Northeastern University, Shengyang, China 110004 [email protected] 4 State Key Lab of Software Engineering, Wuhan University, Wuhan, China 430072 [email protected]
Abstract. We present a new approach that automatically captures the semantic hierarchies in HTML tables, and semi-automatically integrates HTML tables belonging to a domain. It first automatically captures the attribute-value pairs in HTML tables by normalization and recognizing their headings. After generating global schema manually, it learns the lexical semantic sets and contexts, by which it then eliminates the conflicts and solves the nondeterministic problems in mapping each source schema to the global schema to integrate the data in HTML tables.
1
Introduction
As a large amount of information available on the Web is in the form of HTML tables, how to capture information in HTML tables semantically and integrate relevant information is a challenge. Since the cells in HTML tables may span multiple rows and columns, HTML tables may have nested heading. It is not trivial to automatically capture the semantic hierarchies in HTML tables. Since the rowspan and colspan attributes specify the number of rows or columns spanned by a cell, we can normalize an HTML table by inserting redundant cells into the table to make each row of the table has equal cells aligned and capture the attribute-value pairs according to the headings and their corresponding data cells. For HTML tables without marked headings via th elements, we introduce the notion of eigenvalue in formatting information to recognize their headings. To integrate HTML tables is difficult because there may be structural, semantic and name conflicts. By defining what data to integrate, we can produce a list of global concepts, and their hierarchies, which form the global view or global schema. By the attribute name set of each source table that correspond to the same global concept, we can eliminate most of the conflicts. However, J.X. Yu, X. Lin, H. Lu, and Y. Zhang (Eds.): APWeb 2004, LNCS 3007, pp. 899–902, 2004. c Springer-Verlag Berlin Heidelberg 2004
900
S. Li et al.
if an attributes appears in multiple lexical semantics sets, the nondeterministic problems arises. We solve them by comparing the context of the attribute in the source with the context of each global concept involved in the conflict to determine which global concept it corresponds to. The rest of the paper is organized as follows. Section 2 introduces the automatic approach that captures semantic hierarchies in HTML tables. Section 3 presents a semi-automatic approach that integrates HTML tables. Section 4 compares our work with the other related work. We conclude in Section 5.
2
Capturing Semantic Hierarchy in HTML Tables
HTML tables marked as table element consist of a number of rows, while a row consists of a number of cells. Table cells generally contain heading information via the th element and data via the td element. If a th or td element contains colspan = n or rowspan = n, it means the particular cell of the th or td is to be expanded to n−1 more columns starting from the current cell in the current row or the next n − 1 rows in the current column, respectively. By the rowspan and colspan attributes, we can normalize HTML tables in the form of Table 1. Based on the normalized table, we introduce a mapping rule to map the attribute-value pairs to the corresponding XML documents, and a merging rule to merge the content of the same XML element. Table 1. The normalized HTML table
Rule 1 Let the normalized table of an HTML table T be Table 1, the content of its caption element be c, and the rows that contain data be rk+1 , rk+2 , . . . , rm , where 0 ≤ k ≤ m. Then ψ(T ) =< c > ψ(rk+1 ) ⊕ ψ(rk+2 ) ⊕ . . . ⊕ ψ(rm ) < /c >, where ⊕ is the operation concatenating two character strings, ψ(T ) is a function converting the HTML table document to its corresponding XML one, and ... < vi,p > ψ(ri ) =< vi,1 > < h1,p+1 > . . . < hk,p+1 > ψ(di,p+1 ) < /hk,p+1 > . . . < /h1,p+1 > .. . < h1,n > . . . < hk,n > ψ(di,n ) < /hk,n > . . . < /h1,n > < /vi,p > . . . < /vi,1 >, where i = k + 1, k + 2, ..., m. The merging rule is as follows: < t1 >< t2 > s2 < /t2 >< /t1 >< t1 >< t3 > s3 < /t3 >< /t1 > =< t1 >< t2 > s2 < /t2 >< t3 > s3 < /t3 >< /t1 >
Capturing Semantic Hierarchies to Perform Meaningful Integration
901
Rule 1 covers not only two dimensional tables but also one dimensional tables. For HTML tables without marked headings, the authors generally use formatting information to mark headings for visual easy recognition by uses. This suggests us to introduce the notion of eigenvalue in formatting information to recognize the headings of HTML tables. Definition 1 In an HTML table, let all the eigenvalue of each cell be 0 and the font size of the first cell be k. We define each cell’s eigenvalue λ as follows: if the cell’s font size is n, then λ = λ + (n − k) ∗ 10; if the cell has bold font, then λ = λ + 10; if the face of the cell is italic, then λ = λ + 1. The eigenvalue of a row or column is the average eigenvalue of all the cells in the row or column. Based on the fact that the headings part generally have bigger eigenvalue than the data part in an HTML table. We can recognize headings of HTML tables without marked headings by computing the difference of two adjacent rows or columns in HTML tables.
3
Integrating HTML Tables Belonging to a Domain
To integrate HTML tables belonging to a domain of interest is difficult because we must handle Naming, structural and Semantic conflicts. The first step common to any schema integration methodology is defining what data sources and what data are integrated in the global schema. This step produces a list of global concepts and the hierarchies of these global concept, which form global schema or global view. This step is performed at a high of abstraction manually to decide what data is stored in the global schema. A global schema can be expressed in a tree. We use XML DTD to express the global schema. To eliminate conflicts, we introduce the notion of LSS (lexical semantic set). Definition 2 Let the set of the source tables to be integrated be Ω, the LSS of a global concept G, denoted by SG , is defined to be SG = {T (G) | T ∈ Ω}, where T (G) denotes the attribute in T that corresponds to G. Since it is impossible or difficult to access all the source tables to be integrated, we use example tables to learn the LSS. Using LSS, we can eliminate most the conflicts. However, if a attribute of a source table belongs to more than one LSS, then we cannot determine which global concept it corresponds to. We call this nondeterministic problems. To solve them, we introduce two notions: the conflict set and the context. Definition 3 Let SG be the LSS of a global concept G. For a source attribute A, If there exist at least two global concepts G1 and G2 so that A ∈ SG1 and A ∈ SG2 hold, then we call attribute A has a conflict set, denoted by LA , which is defined to be LA = {G | G is a global concept and A ∈ SG }. We can use the LSS learned by example tables to determine if a source attribute has a conflict set and compute it if the source attribute has one. To solve nondeterministic problems, we introduce the notion of the context as follows:
902
S. Li et al.
Definition 4 Let G be a global concept, its parent element be F , and in the global schema XML DTD, F be defined as: $1,000.
(3) Map Measures to Concrete Roles In DLs, concrete roles are roles pointing to concrete datatypes rather than other concepts. Provided that cells of a cube are mapped to set concepts, measures of a cube can be mapped to concrete roles that belong to the set concepts. Both base and computed measures in a cube can be mapped to unary concrete roles in DLs, denoted by T1,…,Tn. Constraints on sets of measures can be mapped to n-ary datatype predicates and qualified number restrictions on sets of unary concrete roles, denoted by [ | | t n | d n ]T1,…,Tn.Pn, where Pn is a predicate name. Here, we only give the semantics of T1,…,Tn.Pn, others can be given in a similar way. .I .I .I ( T1,…,Tn.Pn) = { (CUBES) | cs = cs1չ…չ csn, csi (Csi) , .I Csi Cset, Vk ( 'D )k, v1,…,vn Vk, T1 … Tn
.I
< v1,…,vn > PnD }
Example 2: Retrieve a sub-cube of SALES, whose sales_in_dollar is greater than $1,000 and Units is smaller than 300. This task can be expressed in this way: SALES Ո sales_in_dollar.>$1,000 Ո Units.$1,000’ and ‘100 Here, ‘O: X’ means that X is a reference to object O. (4) Map Cube to Abstract Cube Concept We first present the relationships among cube, record and set. The map g : Cells o V indicates that a cube is a set FACTS={ | c Cells , v V}. Each fact FACTS is a record that has two members: cell and assignment. Based on this observation, a cube can be looked as a set of records. To represent this inner structure of a cube, we need to introduce some additional notions in TBox. This can be accomplished by introducing in DLs a concept AbstractCube to represent the cubes, and two concepts RecType and SetType to represent the internal structures of cube model 1 . The TBox also contains two additional predefined roles: value and member. The concepts of set, record and cube can be presented in DLs in the following way: Set (Set-of T ) is presented as SetType Ո member.T; A1:T1,…,Ak: Tk End) is presented as Record (Record 1
Calvanese has proposed the concept of RecType and SetType in [5].
906
Z. Dong, X.-f. Xu, and D.-c. Zhan 1
1
RecType Ո ( A1.T1 Ո A1 )Ո…Ո ( Ak.Tk Ո Ak) CubeName Set-of T ) is presented Cube (Cube
as
1
CubeName { AbstractCube Ո value.(SetType Ո member.RecType) With the additional concepts of AbstractCube, RecType, and SetType, the complex structure of cube can be presented properly in DLs.
3 Further Discussions It is possible to view the web as a cube by establishing a high level universal logical layer on top of current web. The SHOQ(DnC), a powerful dialect of description logics, acts as the backbone of the prospective logical layer. The whole web, if encoded by SHOQ(DnC), will change into an enormous unified knowledge base. It provides an infinite world for intelligent agents to run. We argue that the research in this paper is estimable not only because that it established a logical foundation for the vision of W-VCube, but also because that this research will raise the application of Semantic Web onto a higher level, and will inject fresh power for the evolution of current web and information systems. First, the reasoning algorithm on SHOQ(DnC), should be studied in detail. The Tabulea algorithm of SHOQ(Dn) has already been studied [4]. So only the complexity of new features, such as set concepts should be studied. Second, the operational model of cube should be mapped into the operations on DLs. In this paper, we focus on the structural mapping from cube to DLs, without discussing the operational mappings. With cube operational model mapped to DLs, operations on cube could be translated into operations on DLs. The research on DL querying is relatively weak, and it’s our main job to develop practical methods for querying on SHOQ(DnC).
References 1ˊ Sheila A. Mcllraith, Tran Cao Son. Semantic Web Services [J]. IEEE Intelligent Systems, Vol.16, No.2, 2001, 46-53. 2ˊ DAML. DAML Web Site. http://www.daml.org/services, 2002. 3ˊ S. Mcllraith, T.C. Son. Adapting Golog for Programming the Semantic Web [C]. Proc. 5th Symposium On Logical Formalizations of Commonsense Reasoning, 2001, 195-2002. 4ˊ Jeff Z. Pan and Ian Horrocks. Reasoning in the SHOQ(Dn) Description Logic. In Proc. of the 2002 Int. Workshop on Description Logics (DL-2002), Apr, 2002. 5ˊ D. Calvanese, G. De Giacomo, and M. Lenzerini. Representing and reasoning on XML documents: A description logic approach. J. Logic and Computation, 9(3):295-318, 1999.
Fuzzy K-means Clustering on a High Dimensional Semantic Space Guihong Cao 1, Dawei Song2, and Peter Bruza
2
1
Department of Computer Science and Engineering Tianjin University, Tianjin 300072, China [email protected] 2 CRC for Enterprise Distributed Systems Technology (DSTC) Level 7, G.P. South, The University of Queensland, QLD 4072 Australia {dsong, bruza}@dstc.edu.au
Abstract. One way of representing semantics is via a high dimensional conceptual space constructed from lexical co-occurrence. Concepts (words) are represented as a vector whereby the dimensions are other words. As the words are represented as dimensional objects, clustering techniques can be applied to compute word clusters. Conventional clustering algorithms, e.g., the K-means method, however, normally produce crisp clusters, i.e., an object is assigned to only one cluster. This is sometimes not desirable. Therefore, a fuzzy membership function can be applied to the K-Means clustering, which models the degree of an object belonging to certain cluster. This paper introduces a fuzzy kmeans clustering algorithm and how it is used to word clustering on the high dimensional semantic space constructed by a cognitively motivated semantic space model, namely Hyperspace Analogue to Language. A case study demonstrates the method is promising.
subjects [6]. Another advantage of the HAL approach is that it is automatic and computationally tractable. The HAL model co-relates to Gärdenfors’ conceptual space theory [2], where information is represented geometrically in terms of a dimensional space. A domain is defined as a set of integral dimensions in the sense that a value in one dimension(s) determines or affects the value in another dimension(s). Gärdenfors’ and Williams (2001) states “the ability to bundle up integral dimensions as a domain is an important part of the conceptual spaces framework” [3]. The thrust of Gärdenfors’ proposal is that concepts are represented geometrically as points in a space of dimensions and domains of integral dimensions are categorized as regions in the space. The region categorization in a semantic space could be considered as clustering concepts (words) on that space. Clustering techniques divide a collection of data into groups based on similarity of objects. One well known clustering algorithm is the Kmeans method, which takes a desirable number of clusters, K, as input parameter, and outputs a partitioning of K clusters on a set of dimensional objects. Conventional clustering algorithms, however, normally produce crisp clusters, i.e., one object can only be assigned to one cluster. This is not always desirable. A fuzzy membership function can be applied to the K-Means clustering, which models the degree of an object belonging to a cluster. The goal of this paper is to introduce a fuzzy K-means clustering algorithm for concept categorization on a high-dimensional semantic space derived via HAL. The effectiveness is illustrated by a case study.
2 Fuzzy K-means Clustering The fuzzy K-means algorithm [1] has successfully been applied to a wide variety of clustering problems [4]. Let X = {x1 , x 2 ,...x N } be a set of N objects. If represented by S-dimensional weighted vectors, the objects can be viewed as a set of points in a high dimension space. Let d ( x j , xi ) be the distance or dissimilarity between objects
xi and x j . Let V = {v1 , v2 ,..., v K } , each vc be the mean of the c-th cluster. Let d (vc , xi ) be the distance or dissimilarity between the object xi and the mean of the cluster that it belongs to. The fuzzy clustering partitions these objects into K overlapped clusters based on a computed minimizer of the fuzzy within-group least squares functional: K
J m (U ,V ) = ∑ c =1
N
∑U i =1
m
(1)
(vc , xi )d (vc , xi )
Minimization is performed over all vc ∈ V . U (vc , xi ) is the membership function for object
xi belonging to cluster vc . To optimize (1), we alternates between optimi_
_
zation of J m ( U | V * ) over U with V * fixed and J m ( V | U * ) over V with U * st fixed, producing a sequence {U ( p ) ,V ( p ) } . Specifically, the p+1 value of V = {v1 , v2 ,..., v K } is computed using the p-th value of U in the right-hand side of:
Fuzzy K-means Clustering on a High Dimensional Semantic Space
909
N
v
( p +1) c
=
∑
x i * [U
i =1
( v c( p ) , x i )] m
(2)
N
∑ [U
( p)
i =1
U
( p)
( v c( p ) , x i )] m
The updated p+1st value of V is then used to calculate the p+1st value of U via: d ( x , v ( p + 1 ) ) − 1 /( m −1 ) m ∈ (1,+∞) is the so-called “fuzzifier”. (3)
( p +1)
( v k( p + 1 ) , x i ) =
i
k
K
∑ d (x c =1
i
, v c( p +1 ) ) −1 /( m −1 )
Two often used dissimilarity measures between two objects between an object and the mean
( x j , xi ) , as well as
(vc , xi ) are the LP norm distance [4]: 1/ p
N where p ∈ [1,+∞ ) ( p = 2 and p = 1 lead d ( v c , x i ) = ∑ | x i , j − v c , j | p j 1 = to the Euclidean and Manhattan distances respectively); and cosine-based dissimilarity:
d (vc , xi ) = e − Sim ( v c , xi ) where
Sim ( v c , x i ) is defined as:
S
Sim ( v c , x i ) =
∑
j =1
x i, j * v c, j
S
∑
j =1
x i, j
2
S
∑
j =1
vc, j
2
The “K-center” method is used for initialization. The first candidate is picked up as the mean over all the items in X, and each successive one is selected in such a way that each one is most dissimilar (remote) to all the means that have already been picked.
3 Word Clustering on a High Dimensional HAL Space – A Case Study This case study aims to demonstrate the effectiveness of fuzzy K-means approach in clustering words represented as HAL vectors. We applied HAL method to the Reuters-21578 collection, which comprises news feeds from the mid-late eighties. The vocabulary is constructed by removing a list of stop words and also dropping some infrequent words which appears less than 5 times in the collection. Window size is set to be 8, following previous studies in HAL [6, 7]. HAL vectors are normalized to unit length. Twenty words (represented by HAL vectors) were then selected from the Reuters collection. These words are: airbus, Boeing, plane, Chernobyl, nuclear, disaster, computer, NEC, Japan, IBM, Contra, industry, Iran, Iraq, scandal, war, president, Reagan, white, house. These words involve approximately the following contexts in the Reuters collection: 1) airplane manufacturers; 2) The Chernobyl disaster; 3) computer companies; 4) the Reagan administration in the middle 1980s (dealing with Iran-Iraq war and trade war against Japan); 5) the Iran-contra scandal (President Reagan was involved in the illegal arms sales to the Contra rebels). Note that these
910
G. Cao, D. Song, and P. Bruza
words have been chosen so there would be some overlap between clusters, for example, cluster 4 shares “industry” with clusters 1 and 3; and shares “Reagan” and “Iran” with cluster 5, etc. In order to find the best performing parameter settings for the fuzzy K-means clustering, we have developed a test bed on which a series of prior studies have been conducted. The Cosine function combined with fuzzifier 2.0 performed best in a series of experimental trials. When the membership value of a word belonging to a cluster is greater than the prior probability (0.2 for this experiment, since there are 5 clusters), it is output as a member in the cluster. The following table lists the result of fuzzy K-means clustering (the number following each word is the membership value of the word belonging to the corresponding cluster). It can be shown that the fuzzy clustering results basically reflect the overlapping contexts described above, particularly the overlap between Reagan government, Iran-Iraq war and Iran-Contra scandal. It demonstrates the Fuzzy K-means word clustering in the high dimensional HAL space is promising. Cluster 1 Airbus: 0.91 Boeing: 0.85 Plane: 0.85
4 Conclusion In this paper, we have introduced a fuzzy K-means clustering algorithm applied it to word clustering on the high dimensional semantic space constructed by a cognitively motivated semantic space model, namely Hyperspace Analogue to Language. A case study shows that the algorithm is promising. Since the web is a vast heterogeneous information source, many words have a distinct meaning in different contexts. Fuzzy K-means could be potentially useful to uncover these various context-sensitive semantics. In the future, we plan to conduct formal evaluation on larger scale collections.
Acknowledgement. The work reported in this paper has been funded by the Cooperative Research Centre for Enterprise Distributed Systems Technology (DSTC) through the Australian Federal Government's CRC Programme (Department of Education, Science and Training). The authors would like to thank Zi Huang from the
Fuzzy K-means Clustering on a High Dimensional Semantic Space
911
University of Queensland for her kind help in conducting experiments. We would also thank Dr. Hang Li in Microsoft Research Asia for his suggestions in refining the clustering algorithm.
References 1. 2. 3.
4. 5.
6.
7.
Bezdek, J.C. (1981) Pattern Recognition with Fussy Objective Function Algorithms. New Yok: Plenum, 1981 Gärdenfors, P. (2000) Conceptual Spaces: The Geometry of Thought. MIT Press. Gärdenfors, P. and Williams, M. (2001) Reasoning about Categories in Conceptual th Spaces. In Proceedings of 14 International Joint Conference of Artificial intelligence (IJCAI’2001), 385-392. Hathaway, R.J., Bezdek, J.C., and Hu, Y. (2000) Generalized fuzzy c-means clustering strategies using Lp norm distances. IEEE Trans. on Fuzzy Systems, Vol. 8, pp. 576–582. Landauer, T., and Dumais, S. (1997). A Solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211-240. Lund, K. and Burgess C. (1996) Producing High-dimensional Semantic Spaces from Lexical Co-occurrence. Behavior research Methods, Instruments, & Computers, 28(2), 203-208. Song, D. and Bruza, P.D. (2003). Towards Context Sensitive Informational Inference. Journal of the American Society for Information Science and Technology (JASIST), 52(4), pp. 321-334.
N-SHOQ(D): A Nonmonotonic Extension of Description Logic SHOQ(D) Songxin Wang, Shuigeng Zhou, Fei Wang, and Aoying Zhou (Department of Computer Science and Engineering, Fudan University, Shanghai 200433) {sxwang,fwang,sgzhou,ayzhou}@fudan.edu.cn
Abstract. Ontologies play a central role in the development of the semantic web. One important web ontology language is DAML+OIL, it has a formal semantics and a reasoning support through a mapping to the expressive description logic SHOQ(D) with the addition of inverse roles. In this paper we present a nonmonotonic extension of the SHOQ(D), called N-SHOQ(D), to allow for dealing with incomplete knowledge in the semantic web.
1 Introduction The development of the semantic web aims at making web resources more easily accessible to automated processing by annotating web pages with machine-readable information on their content. One important web ontology language is DAML+OIL[1]. A formal semantics is provided to DAML+OIL through a mapping to the expressive description logic SHOQ(D)[2] with the addition of inverse roles. Dealing with incomplete knowledge has stared to play an important role in database systems and knowledge representation and reasoning formalisms. We expect expressing and handling incomplete knowledge to also play an important role in web ontology language. In this paper we present a nonmonotonic extension of the SHOQ(D), called N-SHOQ(D), to allow for dealing with default ontologies in the semantic web. N-SHOQ(D) form a formal underpinning for the nonmonotonic extension of the DAML+OIL language.
N-SHOQ(D): A Nonmonotonic Extension of Description Logic SHOQ(D)
913
b) allows to represent both incomplete knowledge about concepts and roles, and also assert ional incomplete knowledge about concept and role instance. c) is based on description logic SHOQ(D), which is more expressive than the basic description logic ALC . Giugno et al.[6] propose an probability extension of SHOQ(D), called probability description logic P-SHOQ(D). While P-SHOQ(D) deals with the quantitative uncertainty of domain knowledge, N-SHOQ(D) deals with the qualitative uncertainty of domain knowledge.
3 Nonmonotonic Description Logic N-SHOQ(D) We assume that readers are familiar with description logic SHOQ(D). We firstly define the syntax of N-SHOQ(D). Definition 1. A strict concept inclusion axiom、a strict role inclusion axiom and a strict transitive axiom in N-SHOQ(D) are equivalent in syntax to a concept inclusion axiom、a role inclusion axiom and a transitive axiom in SHOQ(D) respectively. A strict terminology axiom in N-SHOQ(D) is either a strict concept inclusion axiom、a strict role inclusion axiom or a strict transitive axiom. Definition 2. A strict knowledge base KB is a set of strict terminology axioms. Definition 3. Let C, D be concepts of SHOQ(D), then expression C⊆*D is an incomplete concept inclusion axiom. A nonmonotonic terminology axiom is either a strict concept inclusion axiom or an incomplete concept inclusion axiom, denoted NF. The intuitive reading of an incomplete concept inclusion axiom C⊆*D is that concept C is typically included in concept D. Definition 4. An incomplete knowledge base I-KB is a set of incomplete concept inclusion axioms. Definition 5. A nonmonotonic knowledge base N-KB=(KB,I-KB)consists of a strict knowledge base KB and an incomplete knowledge base I–KB. We now define the nonmonotonic semantics of N-SHOQ(D). Definition 6. Let L=(∆,I)be a SHOQ(D) interpretation, a κ-distribution over L is a mapping from ∆ to N∪{∞}, satisfying κ(∆)=0, κ(∅)=∞, and, if A is a subset of ∆, then κ(A)=min{κ(ω)ω∈A}. Definition 7. A nonmonotonic interpretation N-L=(L, κ)with respect to the set of concrete datatypes D consists of a SHOQ(D) interpretation L and a κ-distribution κ. Definition 8. Let N-L=(L, κ)be a nonmonotonic interpretation, N-F is a nonmonotonic terminology axiom, then N-L=N-F is defined as follows: 1. N-L=C⊆D iff L=C⊆D, 2. N-L=C⊆*D iff either κ(C)= ∞, or κ(I(C∩D))