This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4256
Ling Feng Guoren Wang Cheng Zeng Ruhua Huang (Eds.)
Web Information Systems – WISE 2006 Workshops WISE 2006 International Workshops Wuhan, China, October 23-26, 2006 Proceedings
13
Volume Editors Ling Feng University of Twente Department of Computer Science P.O. Box 217, 7500 AE Enschede, Netherlands E-mail: [email protected] Guoren Wang Northeastern University College of Information Science and Engineering Shenyang 110004, China E-mail: [email protected] Cheng Zeng Wuhan University State Key Lab of Software Engineering Wuhan 430072, China E-mail: [email protected] Ruhua Huang Wuhan University School of Information Management Wuhan 430072, China E-mail: [email protected] Library of Congress Control Number: 2006934581 CR Subject Classification (1998): H.4, H.3, H.2, C.2.4, I.2, H.5.1 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-540-47663-6 Springer Berlin Heidelberg New York 978-3-540-47663-4 Springer Berlin Heidelberg New York
This book constitutes the joint refereed proceedings of the three workshops held in conjunction with the 7th International Conference on Web Information Systems Engineering, WISE 2006, in Wuhan, China, in October 2006. Workshop 1: Web Information Access and Digital Library Web information access systems such as search engines provide the necessary means to access the information on the Web. However, search engines are not very effective in searching scholarly information on the Web. The digital library is a key component of national and global information infrastructure, which aims at improving and facilitating Web information access. The workshop examined various existing approaches to Web information access. Of the 41 submitted manuscripts, 14 papers were selected for publication. Workshop 2: Web-Based Massive Data Processing With the fast development of the Internet and communication technology, massive data have been accumulated in many Web-based applications such as deep Web applications and Web search engines. How to collect, extract, store, index, query and analyze such massive data effectively and efficiently poses grand challenges and opportunities for ongoing research. This workshop brought together researchers and industry pioneers to discuss this important frontier. Of the 39 submitted manuscripts, 13 papers were selected for publication. Workshop 3: Advances in Web-Based Learning Today, the Web overcomes time and space limitations in traditional school education. Teachers and students are now using the Web to access a vast amount of information and resources in cyberspace. The Web also allows educators to implement a range of new teaching and learning practices, which redefine classroom-learning experiences. This workshop invited researchers from various fields to present and discuss their ideas on Web-based learning. Of the ten submitted manuscripts, four papers were selected for publication. We are grateful to all the many people directly or indirectly involved in the organization of the WISE 2006 workshop series. We especially thank the three workshop organizers and the reviewers for their great efforts, and all of the authors who submitted papers to the workshops. On behalf of the workshop Organizing Committee, we do hope that the results of this workshop series contribute to your work.
October 2006
Ling Feng, University of Twente, Netherlands Guoren Wang, Northeastern University, China
Web-Based Massive Data Processing Workshop Chairs’ Message
With the fast development of the Internet and communication technology, massive data have been accumulated in many Web-based applications such as deep Web applications and Web search engines. How to collect, extract, store, index, query and analyze such massive data effectively and efficiently poses grand challenges and opportunities for ongoing research. The WISE Workshop on WebBased Massive Data Processing (WMDP 2006) brought together researchers and industry pioneers to discuss this important frontier. The present volume contains the papers that were independently peer-reviewed and selected by the WMDP Program Committees. Thirty-three percent of the paper submissions were accepted. The workshop included a mini-panel on the research directions, and paper presentations and discussion.
Yan Jia, National University of Defense Technology, China Jian Pei, Simon Fraser University, Canada
Web Information Access and Digital Library Workshop Chairs’ Message
Information resources and technology are profoundly influencing everyone in society. The Web provides vast information in almost all subjects of science, culture and economy. The digital library is a component of information infrastructure, which aims at facilitating Web information access. How to use information technologies in order to improve access to information and information management and How to make information access effective in the right time on demand are topics in both computer science and library and information sciences. Information systems and services must meet human needs. The workshop focused on new challenges and initiatives in Web information access and digital library technology. WIADL was designed to bring together researchers and practitioners from interdisciplinary areas of information storage, retrieval and dissemination, Web information organization, information system designing and data mining, digital right management, metadata and library cataloging, and to act as a platform for discussions on experiences and visions. The present volume contains the papers that were independently peer-reviewed and selected by the WIADL Program Committees. Thirty-four percent of the paper submissions were accepted.
Feicheng Ma, Wuhan University, Changping Hu, Wuhan University, Chuanfu Chen, Wuhan University, Zhiyong Peng, State Key Lab of Software Engineering,
China China China China
Advances in Web-Based Learning Workshop Chairs’ Message
Among various Internet technologies, the Web has become the most important tool for educational fields. This is because the Web provides comprehensive services. The Web can provide many benefits to education. First of all, with Web-based education, students can study anytime anywhere. Also, students can get tremendous information in cyberspace. With technical advances in Web technologies, many research efforts are focused on Web-based instruction. The aim of this workshop was to invite researchers from various fields to present and discuss their ideas on Web-based learning. Areas of interest included various aspects of Web-based learning such as learning management systems, user interface design, educational standards, quality management in Web-based learning, infrastructure of Web-based learning environments, curriculum design in Web-based learning, assessment strategy in Web-based learning, instructional design methods for Web-based learning, collaborative Web-based learning, and virtual university, etc. A total of ten research papers were submitted and were reviewed by international Program Committees. Each paper was reviewed by two internationally renowned Program Committees. Papers were rigorously examined and selected based on their significance, originality, technical quality, relevance, and clarity of presentation. Finally, four papers were selected to be presented at the workshop. I would like to take this opportunity to thank all the authors who submitted papers to the workshop. I also thank the Program Committee members. Thanks also go to the conference organizers for their support.
Woochun Jun, Seoul National University of Education, Korea
Organization
Organization Committee Ling Feng, University of Twente, Netherlands Guoren Wang, Northeastern University, China Yan Jia, National University of Defense Technology, China Jian Pei, Simon Fraser University, Canada Feicheng Ma, Wuhan University, China Changping Hu, Wuhan University, China Chuanfu Chen, Wuhan University, China Zhiyong Peng, State Key Lab of Software Engineering, China Woochun Jun, Seoul National University of Education, Korea
Program Committee Workshop on Web-Based Massive Data Processing Lei Chen, Hong Kong University of Science and Technology, China Zhiyuan Chen, University of Maryland, Baltimore County, USA Jianlin Feng, UIUC Ying Feng, Microsoft Hong Gao, Harbin Institute of Technology, China Li Guo, Institute of Computer Technology, Academy of China, China Weihong Han, National University of Defense Technology, China Yan Huang, University of North Texas, USA Daxin Jiang, Nanyang Technology University, Singapore Ning Jing, National University of Defense Technology, China Jiejun Kong, University of California, Los Angles, USA Quanzhong Li, IBM Research Chun Tang, Yale University, USA Guoren Wang, Northeastern University, China Haiyang Wang, Shandong University, China Haixun Wang, IBM Research Lidong Wang, National Information Security Center of China, China Tengjiao Wang, Peking University, China Wei Wang, Fudan University, China Yuqing Wu, Indiana University, USA Shuqiang Yang, National University of Defense Technology, China Ge Yu, Northeastern University, China Aoying Zhou, Fudan University, China
X
Organization
Workshop on Web Information Access and Digital Library Hsinchun Chen, University of Arizona, USA Jin Zhang,University of Wisconsin-Madison, USA Marcia Lei Zeng, Kent State University, USA Li Chen, National Library of China, China Longji Dai, Management Center of Chinese Academic Library and Information System, China Jiangping Chen, University of North Texas, USA Zongying Yang , Shanghai Jiao Tong University Library, China Wei Liu, Shanghai Library, China Gang Li, Wuhan University, China Hui Dong, Wuhan University, China Ning Zhou, Wuhan University, China Xiangxing Shen, Wuhan University, China Ruhua Huang, Wuhan University, China Cheng Zeng, Wuhan University, China Workshop on Advances in Web-Based Learning Ilkyeun Ra, University of Colorado at Denver, USA Yan Liu, The Hong Kong Polytechnic University, Hong Kong Young-Cheol Bang, Korea Polytechnic University, Korea Philip Hingston, Edith Cowan University, Australia Leandro Balladares, National Polytechnic Institute, Mexico Danli Wang, The Chinese Academy of Science, China
Table of Contents
Workshop on Web-Based Massive Data Processing Session 1: Streaming Data Efficient Filtering Query Indexing in Data Stream Ying Wang, Shuo Bai, Jianlong Tan, Li Guo . . . . . . . . . . . . . . . . . . . . . .
1
Modelling and Guaranteeing Quality of Service over Data Streams Shanshan Wu, Yu Gu, Yanfei Lv, Ge Yu . . . . . . . . . . . . . . . . . . . . . . . . .
13
Research on Prediction Models over Distributed Data Streams Li Tian, AiPing Li, Peng Zou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mining Maximum Frequent Access Patterns in Web Logs Based on Unique Labeled Tree Ling Zhang, Jian-ping Yin, Yu-bin Zhan . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Short Documents Clustering in Very Large Text Databases Yongheng Wang, Yan Jia, ShuQiang Yang . . . . . . . . . . . . . . . . . . . . . . . .
83
A Simple Yet Efficient Approach for Maximal Frequent Subtrees Extraction from a Collection of XML Documents Juryon Paik, Ung Mo Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
94
XII
Table of Contents
Session 3: Massive Data Systems A Web-Based Transformation System for Massive Scientific Data Shi Feng, Jie Song, Xuhui Bai, Daling Wang, Ge Yu . . . . . . . . . . . . . . . 104 Supporting Complex Query with Structured Overlays in Schema-Based P2P System Min Yu, Zhanhuai Li, Longbo Zhang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 DRO+: A Systemic and Economical Approach to Improve Availability of Massive Database Systems Kai Du, Huaimin Wang, Bo Deng, Yongheng Wang, ShuQiang Yang . 122 Research and Implement of Real-Time Data Loading System IMIL WeiHong Han, Yan Jia, ShuQiang Yang . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Workshop on Web Information Access and Digital Library Session 1: Information Organization and Retrieval Blogs in American Academic Libraries: An Overview of Their Present Status and Possible Future Use Zhuo Fu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Rebuilding the Library OPAC Zhigeng Wang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Web Content Mining for Market Intelligence Acquiring from B2C Websites Danxiang Ai, Yufeng Zhang, Hui Zuo, Quan Wang . . . . . . . . . . . . . . . . . 159 Design of Chinese Word Segmentation System Based on Improved Chinese Converse Dictionary and Reverse Maximum Matching Algorithm Liyi Zhang, Yazi Li, Jian Meng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Cross-Media Database Retrieval System Based on TOTEM Cheng Zeng, Haiyang Zhou, Bing Yan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Session 2: Information System Semantic Query Planning Mechanism on XML Based Web Information Systems Jinguang Gu, Bo Hu, Yi Zhou . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Table of Contents
XIII
Analysis on the Construction of Management Modes of E-Government Information Exchange in China Cui Huang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Aggregated Framework of Enterprise Information System Based on Synergic Theory Gang Li, Quan Wei . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 Web Personalization Based on Artificial Psychology Quan Lu, Jing Chen, Bo Meng . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Grid Resource Management Based on Mobile Agent Xiaobo Tang, Huang Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Session 3: Digital Library The Use of Intelligent Information Access Technologies in Digital Libraries Jiangping Chen, Yuhua Li, Gang Li . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Encryption and Program Realization of Information Resources in DRM Chuanfu Chen, Zhiqiang Wu, Laichu Tang, Congjing Ran . . . . . . . . . . . 251 Knowledge Organization Systems and Interfaces of Digital Libraries in China Li Si, Lixiao Xu, Hongyan Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Study on Distribution Model of Internet Information Resources in China Hui Liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Advances in Web-Based Learning Collaborative User Tracking for Community Organization on Blogosphere: A Case Study of eLearning@BlogGrid Jason J. Jung, Inay Ha, Supratip Ghose, Geun-Sik Jo . . . . . . . . . . . . . . 276 Adaptive UI Storytelling System Using MOL Sunghan Bae, Rohae Myung . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Construction of a Distributed Learning Resource Management System Based on RSS Technology Chengling Zhao, Liyong Wan, Ying Yu, Qi Luo . . . . . . . . . . . . . . . . . . . . 298
XIV
Table of Contents
A Semantic Web Application to Automate the Construction of Didactic Material for Web-Based Education System Rub´en Peredo, Leandro Balladares, Iv´ an Peredo . . . . . . . . . . . . . . . . . . . . 306 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Efficient Filtering Query Indexing in Data Stream Ying Wang1,2, Shuo Bai1, Jianlong Tan1, and Li Guo1 1
Software Laboratory, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China 2 Graduate School, Chinese Academy of Sciences, Beijing 100039, China {wangying, bai, tan, guoli}@software.ict.ac.cn
Abstract. Filtering queries are widely used in data stream applications. As more and more filtering queries are registered in high-speed data stream management system, the processing efficiency becomes crucial. This paper presents an efficient query index structure based on decision tree. The index structure makes full use of predicate indices on single attributes, as well as the conjunction relationship between predicates in a single query. It is easy to integrate various predicate indices into this structure. How to select dividing attributes during construction is crucial to the performance of the index tree. Two dividing attribute selection algorithms are described. One is based on information gain (IG) and the other is based on estimated time cost (ETC). The latter takes some sample tuples as a training data set and is able to build more efficient trees. Our experiments demonstrate that. Keywords: filtering query, query indexing, predicate index, data stream.
P(ai) can be described equivalently by a subdomain ω of ai’s domain Ωi, as ai∈ω then P(ai)=true, otherwise P(ai)=false. So P(ai) can divide Ωi into 2 disjoint subsets (ω and Ωi-ω). If a filtering query does not care about attribute ai (ai doesn’t appear in the query for convenience), we regard P(ai) as always true. There are usually many filtering queries registered in a DSMS. When a data tuple arrives, the DSMS should compute which queries are satisfied. A straightforward way to handle multiple queries is to evaluate query one by one for each arrived tuple. In a typical data stream application, there may be thousands of filtering queries and data arrive at a high speed of tens of thousands of tuples per second. The one-by-one method does not work in such high-speed applications. Another approach to evaluate multiple filtering queries more quickly is to use query indexing. Registered queries are preprocessed to create indices before any stream data arrives. Each data tuple in the incoming stream is used to search the query index to find out quickly which queries are satisfied. There has been much work on query indexing, mainly focusing on indexing queries of a single attribute with numeric data type. Query index on a single attribute is also called predicate index. A predicate on numeric attribute can always be represented by one or more intervals. Classic interval index structures include interval binary search trees (IBS-trees) [2], interval skip lists (IS-lists) [3, 4], containmentencoded intervals (CEIs) [5, 6, 7], etc. Psoup can handle queries on multiple numeric attributes [8]. Psoup creates a red-black tree for predicates on each numeric attribute and intersects the results of all attributes when a data tuple arrives. This method does not take full advantage of conjunction relationship between predicates in a single query. There are some efficient filtering algorithms in publish/subscribe systems [9, 10, 11], whose main idea is to find common sub-queries among multiple queries and the shared sub-queries would be evaluated only once for each incoming data tuple. These pub/sub filtering algorithms do not make much use of predicate indices on single attributes and it is not an easy task to adapt them for more complex predicates. We present a more efficient query index structure (index tree) in this paper. The index structure is based on decision tree. The index tree makes full use of not only the conjunction relationship between predicates in a single query, but also predicate indices on single attributes. In theory, all kinds of predicate indices on single attributes can be easily integrated into the index tree, no matter how complex the predicates are. More important, in the worst case, matching time for each incoming tuple is bounded by O(Mf(N)), where M is the number of attributes, N is the number of registered queries, and f(N) is the cost to match a tuple in predicate indices on single attributes (in general, f(N)=O(logN)). During the construction of an index tree, how to select dividing attributes is crucial to the average performance of the built tree. We present two dividing attribute selection algorithms. One is based on information gain (IG) and the other is based on estimated time cost (ETC). The latter one uses some sample data as a training set and is able to build a more efficient index tree. Our experiments show that in a simulated IDS environment, on average, only few nodes need to be visited in the index tree built by ETC for each arrived data tuple. We wish the ETC algorithm would lay the foundation for fully adaptive processing of multiple filtering queries in data stream.
Efficient Filtering Query Indexing in Data Stream
3
Section 2 introduces our query index structure and its update algorithms. Section 3 presents two dividing attribute selection algorithms. Section 4 describes some implementation issues in our experimental system. Section 5 gives some experimental results. Section 6 concludes.
2 Query Index Tree Our query index structure is essentially a decision tree. A decision tree is a classifier, which takes an object (usually represented by a vector) as input and labels the object with a class tag. Decision tree is widely used in the field of machine learning. The engine handling multiple filtering queries in a DSMS can be considered as a classifier. Suppose there are N registered filtering queries, we can create a classifying system with N+1 classes for all possible data tuples. If a data tuple satisfies query i, this tuple belongs to the i-th class (note that a tuple may belong to more than one class). If a data tuple satisfies no query, it belongs to the N+1-th class. The processing of a tuple with multiple queries can also be considered as a procedure of classification. A tree node in our query index has the form of {dividing-attribute, matcher, satisfied-query-set, child-pointer-array}: dividing-attribute: when a tuple visits the current node, we test its value of the dividing-attribute. Leaf nodes have no dividing-attribute. matcher: a predicate index on dividing-attribute. We use the matcher to find out which child to follow for a traversing tuple. Leaf nodes have no matcher. satisfied-query-set: a set of queries which can be decided as satisfied when a tuple arrives at the current node. child-pointer-array: an array of pointers to the children of the node. 2.1 Query Index Tree Construction The index tree is built in a top-down manner. Each node is associated with a query set and an attribute set during the construction of the tree. We start with the root associated with a query set including all queries and an attribute set including all attributes. A queue is used to contain tree nodes (along with their query sets and attribute sets) that will be extended. The queue is initialized with the root. Each time we pop out a node from the head of the queue, extend it and push the newly extended child nodes into the tail, until the queue is empty. Now we consider extending a node with a query set Qset{Q1,Q2,…,Qn} and an attribute set Aset{a1, a2, …, am}. If an attribute ai does not appear in any query of the Qset (i.e., none of Qset cares about ai), we exclude ai from Aset. If a query Qi has no predicates on any attribute of Aset, we insert Qi into the node’s satisfied-query-set and exclude Qi from Qset. If Qset or Aset is empty, there will be no child derived from the current node (i.e., we get to a leaf node). At each non-leaf node, we first select a dividing attribute from Aset using some algorithm. The dividing attribute selection algorithms will be detailed in section 3. Suppose that we have chosen attribute aj as the dividing attribute and query Qi’s predicate on attribute aj is P[i, j]. P[i, j] corresponds to a subdomain ωi of attribute aj (i.e., P[i, j](aj) is true iff aj∈ωi). We use ω1,…, ωn to divide aj’s domain Ω into s noninterleaved subdomains σ1,…, σs, which satisfy:
4
Y. Wang et al.
(σ1 ∪ σ2 ∪…∪ σs = Ω) AND (σk ≠ Φ), ∀ 1≤k≤s
(I)
σh ∩ σk = Φ, ∀ h≠k, 1≤h,k≤s
(II)
(σk ∩ ωi= σk) OR (σk ∩ ωi = Φ), ∀ 1≤k≤s, 1≤i≤n
(III)
Condition I and II guarantee that any given value of aj will fall into one and only one subdomain σk(1≤k≤s). Condition III guarantees that any two distinct values x and y in σk, P[i, j](x) equals P[i, j](y), with any i, j, k. In addition, we should make s as small as possible. Then we create s children for the current node with each child corresponding to a subdomain respectively. Each child node is associated with an attribute set of Aset{aj}, where aj is the dividing attribute of the current node. Each child is initially associated with an empty query set. Then for each query Qi in Qset and each subdomain σk, if ωi covers σk we add Qi to child k’s query set (denoted as Qset'[k] later). Note that a query may be added to more than one child’s query set. Then all the children along with their query sets and attribute sets are pushed into the building queue. A child node’s attribute set size is at least 1 less than the parent’s. So the depth of the index tree is at most M, where M is the number of attributes. Finally, we build the matcher for the node. The matcher is an index of Qset’s predicates on the dividing attribute. With the matcher, we can figure out quickly which subdomain a given value falls into. All kinds of index algorithms of predicates on a single attribute are applicable here. We give a simple example. Suppose a node with a query set of: Q1: (50
a1 [0,50]
ah
[51,79]
{Q2}
…
[120,65535]
[80,119]
ai {Q1,Q2}
aj {Q1,Q3}
ak
…
…
…
{Q3}
Fig. 1. An example of node extending
2.2 Matching Algorithm The matching algorithm for an incoming tuple is straightforward. The matched query set is initially set to be empty. The tuple traverses the index tree from the root downward to a leaf node. At each node: Step 1. Add the node’s satisfied-query-set to the matched query set Step 2. If the node is a leaf node then return; else go to step 3
Efficient Filtering Query Indexing in Data Stream
5
Step 3. Use the tuple’s value of dividing-attribute to search the node’s matcher. Suppose that the value falls into the k-th subdomain, set the k-th child to be current node and go to step 1. Since in step 3 all other subtrees are ignored except the k-th subtree, only the query subset associated with the k-th child will be concerned in the subsequent steps. So the conjunction relationship between predicates in a single query is fully used here. At most M nodes will be tested in the matching algorithm, where M is the maximal depth of the index tree, i.e., the number of attributes. If the matcher’s searching time in Step 3 is bounded by O(logN), then the matching algorithm has a time cost of O(MlogN), where N is the number of queries. In general, the searching time of predicate index (matcher of a node) on a single attribute can meet the bound of O(logN). 2.3 Updating Index Tree 1. Insertion. Query insertion is also done in a top-down manner from the root. At each node: Step 1. If all predicates of the query have been checked, add the query to the satisfied-query-set of the node and return. Step 2. Suppose the query’s predicate on dividing-attribute is equivalent to a subdomain ω, then for each child of the node, suppose the child corresponds to a subdomain σ: (1) If there is no overlap between ω and σ, ignore the child. (2) If ω totally covers σ, insert the query into the subtree recursively. (3) If ω partially covers σ, divide σ into two subdomains σ1 and σ2, with σ1 totally covered by ω and σ2 without overlap with ω. Accordingly two duplicates substitute the child node along with its subtree. Furthermore, the matcher of the node must be rebuilt. Then we recursively insert the query into the subtree corresponding to σ1. 2. Removal. Removal of a query from the index tree is similar to insertion but simpler. At each node: Step 1. If the query is a member of the satisfied-query-set, remove it from the satisfied-query-set and return. Step 2. Suppose the query’s predicate on dividing-attribute is equivalent to a subdomain ω, then for each child of the node, suppose the child corresponds to a subdomain σ: (1) If ω covers σ, remove the query from the subtree recursively. (2) Else ignore the child. In the worst case, almost all the tree nodes would be traversed in the process of insertion or removal. Note that there is no merging of subtrees in our removal algorithm because of its complexity. On the whole, tree update is time costly. Fortunately query update is usually relatively infrequent in most data stream applications.
6
Y. Wang et al.
3 Dividing Attribute Selection During the construction of the tree, how to select dividing attributes is crucial to the tree shape and the tree’s matching performance. Matching performance is our main concern. It’s difficult to measure matching performance accurately since it is related to not only the tree structure but also the performance of the matcher of each node. In addition, it is also related to the distribution of incoming data. We measure the matching performance by counting the average number of tree nodes that are visited during the match for each incoming tuple. We will present two dividing attribute selection algorithms, one based on information gain and the other based on estimated time cost. Thereinafter we discuss how to select a dividing attribute for a node associated with a query set Qset{Q1,Q2,…,Qn} and an attribute set Aset{a1, a2, …, am} during the tree construction. 3.1 Information Gain Algorithm (IG) Kruegel et al. uses decision trees to improve the performance of intrusion detection [12]. Their engine structure is similar to our query index tree. They adapt the classical decision tree construction algorithm, ID3 to build the engine and get good performance. The information gain based method is also applicable in our dividing attribute selection. Suppose each query appears in Qset with the same probability, the entropy of Qset is calculated as [15]: Entropy (Qset ) =
|Qset|
1
∑ − | Qset | log i =1
2
1 1 = − log 2 = log 2 | Qset | | Qset | | Qset |
(1)
where |Qset| denotes the size of Qset. Suppose we have chosen ai as the dividing attribute. According to section 2.1, we then divide its domain into s subdomains and create s corresponding query subsets Qset'[1..s], one for each child. Then the information gain of the split of the query set with dividing attribute ai is defined as: s
We will choose the attribute with the highest information gain as the dividing attribute. 3.2 Estimated Time Cost Algorithm (ETC) In general, the IG algorithm will not build an optimal tree under our measurement, for it does not take the distribution of incoming data into account. We will give a data sensitive algorithm in this section. Some sample tuples are used as a training data set, which characterizes the distribution of incoming data. Now, each node is associated with a tuple set during the tree construction, besides a query set and an attribute set. We represent the tuple set as Dset{D1, D2, …, Dd}. After the division of the dividing attribute’s domain and the creation of s query subsets Qset'[1..s], we create s empty
Efficient Filtering Query Indexing in Data Stream
7
tuple sets Dset'[1..s], one for each child. Then for each tuple Di in Dset, if Di’s value of dividing attribute falls into subdomain σk, we insert Di into Dset’[k]. Now we consider how to build an optimal index tree under our measurement. Consider the subtree rooted by the current node. Suppose attribute A is chosen as the dividing attribute, then the average number of nodes visited by each tuple in Dset can be defined as a function AvgVisit(Qset, Aset, Dset) A , which can be recursively computed as: s
AvgVisit (Qset , Aset , Dset ) A = 1 + ∑ p i AvgVisit (Qset ' [i ], Aset − { A}, Dset ' [i ]) i =1
(3)
where p i =| Dset'[i] | / | Dset | , i.e. the ratio of tuples falling into the i-th child. It is easy to prove that if the tree has the minimal cost, each of its subtrees must also have the minimal cost. Therefore the minimal cost of the tree can be recursively computed as: MinAvgVisi t (Qset , Aset , Dset ) = s
1 + min (∑ p i MinAvgVisi t (Qset ' [i ], Aset − { A}, Dset ' [i ])) A∈ Aset
(4)
i =1
If we use directly the above formula to build the optimal tree, the construction time cost is extremely high. Actually we can estimate the cost of each subtree in an approximate way. In general, the larger the query set is, the higher time cost the index tree will have. So we use the size of Qset'[i] to estimate the i-th subtree’s time cost. Then the cost of the subtree rooted by the current node can be computed as: s
EstTime ( Qset , Aset , Dset ) A =| Qset | + ∑ p i | Qset ' [ i ] |
(5)
i =1
The first term |Qset| on the right side is the same for all attributes in Aset, so actually it can be ignored. We choose the attribute with the minimal EstTime as the dividing attribute for the current node. We call this algorithm Estimated Time Cost (ETC). ETC is a kind of greedy algorithm, so is IG. In general, ETC can’t build an optimal tree, but it is a good approximation. Finally we consider a special case. Suppose that with a given dividing attribute A, the node’s k-th child has an empty query set (i.e. |Qset'[k]|=0). That is, any tuple that falls into Dset'[k] won’t match any query. Apparently the more tuples in Dset fall into Dset'[k], the less cost the subtree will have. We usually call p k =| Dset'[k] | / | Dset | the selectivity of attribute A. In general, the ETC algorithm can always choose the most selective attribute as the dividing attribute.
4 Implementation Issues 4.1 Predicates and Matchers Our experimental system supports 3 classes of predicates: Class 1: on a numeric (integer or float) attribute ai, to decide if the attribute value falls into any of a given group of intervals.
8
Y. Wang et al.
Class 2: on a bit-field attribute ai, given a group of (mask, target) pairs, to decide if there exists any pair that satisfies (mask&ai)=target, where & is the bitwise-AND operator. This class of predicates is usually used to test if an IP address belongs to some subnet. Class 3: on a string attribute ai, to decide if the attribute value contains all of a given group of keywords. An example query includes all three classes of predicates: (80
5 Experiments We did our experiments in an assumed IDS application environment. The well-known Snort’s rule set was used as our query set [14]. Each Snort rule was translated into one
Efficient Filtering Query Indexing in Data Stream
9
query, which had an SQL-like form in our experimental system. The translation was not fully equivalent, for some Snort rules have complex predicates that our system doesn’t support and those predicates were simplified during the translation. Some Snort rules need application layer protocols (such as HTTP) analysis and those rules were discarded. Therefore, we got a query set with the size of a little more than 2,000, and each query had 3 to 8 predicates. As for the data, each tuple represents a network packet, which consists of 16 integer attributes, 5 bit-field attributes and 1 string attribute. We captured about 1 million packets from our sub-network gateway and stored them in tcpdump format. For the ETC algorithm, we uniformly sampled 10,000 packets as the training data set. To reduce the influence of reading from disk, all data were loaded into main memory before any performance test.
LQGH[WUHHVL]H
QXPEHURIQRGHV
,*
(7&
number of queries (×100)
Fig. 2. Index tree size for increasing size of query set
DYHUDJHQXPEHURI YLVLWHGQRGHVQRGHV
DYHUDJHQXPEHURIYLVLWHGQRGHV
,*
(7&
number of queries (×100)
Fig. 3. Average number of visited nodes by each incoming tuple
10
Y. Wang et al.
VSHHGWXSOHVVHF
SURFHVVLQJVSHHG
,* (7&
2
4
6
8
10
12
14
16
18
20
number of queries (×100)
Fig. 4. Processing speed of the index trees
We compared the performances of index trees built by IG and ETC, as well as the tree size. All index trees were built and stored in disk in advance. The IG build algorithm was always several times faster than ETC. For example, to build an index tree of 2,000 queries with the ETC algorithm took about one and a half minutes while it took only 13 seconds with IG.. The build time was not our main concern. To test the matching performance of index trees, we loaded the index tree and data into main memory, and then processed the tuples one by one as fast as possible. When a tuple matched some query, the only action was to print the query ID on screen. We increased the size of the testing query set from 200 to 2,000, and measured the tree size (number of tree nodes), the average number of nodes visited by each tuple and the processing speed (tuples per second) of the trees. We did the speed test 10 times and averaged the results for each query set. The experiment results are given in Fig. 2, Fig. 3 and Fig. 4 respectively. Fig. 2 shows that the tree built by ETC is much larger than the tree built by IG with the same query set. This is because ETC always chooses the most selective attribute as dividing-attribute at each node, and predicates on the most selective attribute tend to divide the domain into more subdomains. Therefore, the ETC trees are extended much wider. In contrast, the IG algorithm prefers smaller trees [15]. Even so, the tree of 2,000 queries built by ETC has a moderately large size of about half a million nodes. Fig. 3 shows that in the trees built by ETC, only very few nodes are visited by each incoming tuple on average. In contrast, in the trees built by IG the average number of visited nodes is several times larger. The reason is, with the most selective attributes (i.e., attributes with the most strict filtering conditions) selected as dividing attributes, the top levels of trees built by ETC can stop a tuple from visiting the lower levels with higher probability than those of trees built by IG can do. Note that the average number of visited nodes in trees built by IG is even larger than the number of attributes, for more than one child of a node may be visited because of the space reduction in section 4. Fig. 4 shows that the performance of a tree built by ETC is about 50%-90% better than the one of the same query set built by IG. It also shows that the performances of
Efficient Filtering Query Indexing in Data Stream
11
both kinds of trees degrade approximately in a logarithm way as the number of queries increases.
6 Conclusion Filtering queries are widely used in data stream applications. As more and more filtering queries are registered in a DSMS and the speed of data stream becomes higher and higher, the performance of the query processor is increasingly crucial. This paper presents a query index structure, which is essentially a decision tree. The index structure makes full use of predicate indices on single attributes, as well as the conjunction relationship between predicates in a single query. Various predicate indices on single attributes are easily integrated into this structure. How to select dividing attributes during tree construction is crucial to the performance of the index tree. We give two dividing attribute selection algorithms: one based on information gain (IG) and the other based on estimated time cost (ETC). The latter takes some sample tuples as a training data set. Our experiments in simulated IDS application demonstrate that, the trees built by ETC are more efficient than the trees built by IG. The ETC algorithm takes the distribution of incoming data into account and lays the foundation for fully adaptive processing of multiple filtering queries in data stream. Our future work will focus on how to improve the index tree dynamically as the distribution of incoming data changes.
References 1. B. Babcock, S. Babu, M. Datar, R. Motwani, J. Widom. Models and Issues in Data Stream. In Proc. ACM Symp. on Principles of Database Systems (2002) 1-16. 2. E. Hanson, M. Chaaboun, C.-H. Kim, and Y.-W. Wang. A predicate matching algorithm for database rule systems. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, (1990) 271-280. 3. Eric N. Hanson, Thedore Johnson. The Interval Skip List: A data structure for finding all intervals that overlap a point, 2nd WADS, LNCS Vol. 519 (1991) 153-164. 4. E. Hanson and T. Johnson. Selection predicate indexing for active database using interval skip lists. Information Systems, 21(3):269-298, (1996). 5. K.-L. Wu, S.-K. Chen, and P. S. Yu. Interval Query Indexing for Efficient Stream Processing. In Proc. of ACM CIKM (2004). 6. K.-L. Wu, S.-K. Chen, P. S. Yu. Query indexing with containment-encoded intervals for efficient stream processing. Knowl. Inf. Syst. 9(1): 62-90 (2006) 7. K.-L. Wu, S.-K. Chen, and P. S. Yu. On-Demand Index for Efficient Structural Joins. IBM Research Report (2006). 8. Sirish Chandrasekaran, Michael J. Franklin. Streaming Queries over Streaming Data. Proceedings of the 28th VLDB Conference, Hong Kong, China (2002). 9. Aguilera MK, Strom RE, Sturman DC, Astley M, Chandra TD. Matching events in a content-based subscription system. In: Proc. of the 18th ACM Symp. on Principles of Distributed Computing. Atlanta (1999) 53-61.
12
Y. Wang et al.
10. Campailla A, Chaki S, Clarke E, Jha S, Veith H. Efficient filtering in publish-subscribe systems using binary decision diagrams. In: Proc. of the ICSE 2001. Toronto: IEEE Computer Society (2001) 443-452. 11. F. Fabret, H. A. Jacobsen, F. Llirbat, J. Pereira, K. A. Ross, and D. Shasha. Filtering algorithms and implementation for very fast publish/subscribe systems. In Proc. of ACM SIGMOD Int. Conf. on Management of Data (2001). 12. Christopher Kruegel and Thomas Toth, Using decision trees to improve signature-based intrusion detection, RAID. (2003). 13. Gonzalo Navarro and Mathieu Raffinot, Flexibale Pattern Matching in Strings, Cambridge University Press (2002) 49-54. 14. Snort. Open-source Network Intrusion Detection System. http://www.snort.org. 15. Tom M. Mitchell. Machine Learning. McGraw-Hill (1997) 63-66.
Modelling and Guaranteeing Quality of Service over Data Streams Shanshan Wu, Yu Gu, Yanfei Lv, and Ge Yu School of Information Science & Engineering, Northeastern University, Shenyang 110004, P.R. China [email protected]
Abstract. In a growing number of web-based applications, massive data takes the form of continuous data streams. Most of these applications have deterministic and sophisticated performance requirements that need to be met under unbounded, high-volume and time-varying data streams. To be different from besteffort service providing, deterministic quality of service is generalized and modelled based on the theory of network calculus in this paper. A simple and efficient admission control mechanism is provided to determine the schedulability of several continuous queries with their QoS requirements satisfied simultaneously. Once the queries are admitted into the DSMS, our proposed QoS-Guaranteed scheduling algorithm assigns priorities to ready tuples that ensure all the QoS requirements. Furthermore, a QoS-Guaranteed batch-processing scheme is extended to reduce the query-scheduling overhead. Experiments are presented to illustrate and validate our strategies.
1 Introduction With the fast development of the Internet and communication technology, massive data has been generated in the form of streaming data in many web-based applications, such as performance measurements in network monitoring and traffic management, ATM operations in banks, log records generated by web servers, and a wide variety of other online monitoring of streaming data from trend-related analysis, web-click streams, intrusion detection, etc. How to manage and process such massive streaming data efficiently and effectively poses grand challenges and opportunities for data management research community. Stream processing applications are expected to support a wide variety of services. These services will have input characteristics and quality of service requirements that may be dramatically different for one another. For example, the control of a nuclear plant or telemedicine application requires strict real-time response constraint, and audio/video playback or teleconferencing requires that the delay jitter be bounded by the network multimedia application, etc. These applications present a formidable challenge to DSMS (Data Stream Processing System) that has to efficiently allocate limited computation resources to many continuous queries by promoting sharing while also providing
Supported by the National Natural Science Foundation of China under Grant No.60473073, 60503036, and Fok Ying Tung Education Foundation under No. 104027.
L. Feng et al. (Eds.): WISE 2006 Workshops, LNCS 4256, pp. 13–24, 2006. c Springer-Verlag Berlin Heidelberg 2006
14
S. Wu et al.
quality of service for each query. At present, most existing DSMS provide best-effort services to data stream processing [1][2][3], and they can only provide statistic quality of service, such as average response delay, overall throughput, etc. The issue of providing guaranteed service for data stream processing has drawn little consideration in data stream research field. Therefore, we dedicate to the guaranteeing of quality of service over data stream processing. In this paper, we propose a novel approach for modelling and guaranteeing QoS in data stream processing. Our approach is based on service curve, which is initially used over integrated service in network. Cruz [4] introduced the notion of service curves and proposed their use as a general framework called Network Calculus. A key feature of characterizing services with service curves is that the QoS guarantees can be expressed as simple functions of service curves and the input constraints. These functions are independent of the service curves and the input constraints of all other queries. Another key feature of the service curve specification is that it gives greater flexibility to a serve provider in allocating its resources to guarantee diverse QoS requirements. By specifying an appropriate service curve to a query, the query engine is dynamically informed and efficiently allocates resources to the query in order to meet its QoS specification. So if reasonably adapted to data stream processing, service curve based network calculus could consequentially help to provide a novel scheme for service specification and to improve resource (service) management of DSMS, which may lead to a brilliant performance guarantee for DSMS. In summary, we make the following contributions in this paper: – Based on the characteristics of data stream processing, deterministic quality of service is generalized and modelled based on the theory of network calculus. – An admission control mechanism is provided to determine the schedulability of several continuous queries with their QoS requirements satisfied simultaneously. Once admitted, queries are scheduled in accordance with our proposed QoS-Guaranteed scheduling algorithm. – QoS-guaranteed batch-processing scheme is extended to reduce the query-scheduling overhead. The rest of this paper is organized as follows. We establish service curve based QoS model of data stream processing in Section 2. Section 3 presents our QoS-Guaranteed mechanism, including QoS-Guaranteed scheduling algorithm and admission control. We extend our work to QoS-Guaranteed batch processing in Section 4. Experimental results are presented in Section 5. Section 6 discusses the related work. Finally, we end the paper with conclusions and future work in Section 7.
2 Service Curve Based QoS Model In this section, we model DSMS as a service system. With the knowledge of network calculus, quality of service for the continuous queries over data streams, such as response delay bound, memory requirement, throughput, etc. can be easily expressed as simple functions of the service curves and the characteristics of input at the entrance of the processing system.
Modelling and Guaranteeing Quality of Service over Data Streams
15
2.1 Preliminaries of Network Calculus Here we recall a few fundamental definitions and results proposed in network calculus as preliminary. Throughout this paper, we assume that time is divided into slots numbered 0,1,2...
Fig. 1. Properties of Service System
For a lossless service system S illustrated in Figure 1(a), which is viewed as a blackbox, S receives input data and delivers the data after a variable delay. Let cumulative input function R(t) denote the number of data seen on the input flow during the time interval [0, t]. Similarly, cumulative output function R∗ (t) is defined as the number of data that have received service from the service node during the interval [0, t]. By convention, we assume that there is no data stored in the service node at the end of slot zero, i.e. R(0) = 0. Therefore, the number of data that are stored in the service node at the end of slot t, called the backlog for the system at the end of slot t, is given by Equation 1: B(t) = R(t) − R∗ (t) (1) The virtual delay relative to t suffered by the flow, denoted by d(t), is defined in Equation 2: d(t) = inf{τ ≥ 0 : R(t) ≤ R∗ (t + τ )} (2) Both backlog and virtual delay are illustrated in Figure 1(b). Consider wide-sense increasing functions of time, with nonnegative, possibly infinite values. We introduce an operation called ”min-plus convolution”, which is the basis of min-plus algebra. Definition 1 (Convolution). Given wide-sense increasing functions f and g, the minplus convolution of f and g is defined as the function f ⊗ g : R → R+ ∪ {+∞}, as given in Equation 3: (f ⊗ g)(t) =
inf {f (t − s) + g(s)}
s:0≤s≤t
(3)
Definition 2 (Service Curve). Consider a system S and a flow through S with cumulative input function and cumulative output function R and R∗ , respectively. We say that S offers to the flow a service curve β if and only if β is wide sense increasing, β(0) = 0 and R∗ ≥ R ⊗ β.
16
S. Wu et al.
Service curve is defined as a measure for partially characterizing the service provided by a server. Roughly speaking, a service curve is a function that lower bounds the amount of traffic departing from the server within some specific interval. And an arrival curve defines the constraint on burstiness as follows: Definition 3 (Arrival Curve). Given a wide-sense increasing function α, we say that the flow is constrained by α if and only if R(t)−R(s) ≤ α(t−s) for all s ≤ t. Function α is called an arrival curve for the flow, and the flow is also called α-smooth. Given a service curve and arrival curve, Theorem1 provide the upper bounds on delay and backlog experienced by a flow as the maximum vertical and horizontal distance, which is illustrated by Figure 1(b) and (c). Theorem 1 (Backlog Bound and Virtual Delay Bound). Assume a flow constrained by arrival curve α traverses a service system that offers a service curve β. The backlog B(t) for all t is bounded, as shown in Equation 4: R(t) − R∗ (t) ≤ sup{α(t) − β(t)}
(4)
t≥0
The virtual delay d(t) is bounded by d(t) ≤ h(α, β), where h(α, β) is the horizontal deviation between the two functions, given by Equation 5: h(α, β) = sup{inf{τ : τ ≥ 0 and α(t) ≤ β(t + τ )}}
(5)
t≥0
2.2 Modelling QoS In the circumstance of data stream processing, each continuous query can be modelled as a service system as shown in Figure 2. Each query processes input stream as a service provider, and the input stream of the query can be regarded as service consumer. However, the output tuples of a continuous query are the results that satisfy certain query semantics. Taking a filter query as example, the output tuples of the filter are those that satisfy the filter predicate. In other words, the tuples in the output stream could not correspond to the tuples in the input stream one-to-one because of selectivity. As a result, when modelling a continuous query as a service system, the output should take the measure of the number of tuples that have been processed by the query rather than the number of results output by the query. Terminologically, R(t) denotes the number of tuples that have arrived in the input stream ready to be processed during the interval [0, t], and R∗ (t) denotes the number of tuples that have been served (processed) by the query. As for a deterministic service guarantee, a meaningful definition of the quality of service must include a description of the characteristics of the input to be served and an unambiguous definition of the performance requirement. When being provided with this set of information, the provider is capable of deriving the respective resource allocation in order to fulfill the QoS demand. DSMS being modelled as a service system, the characteristics of input stream and the service required can be flexibly described under the theory of network calculus as follows.
Modelling and Guaranteeing Quality of Service over Data Streams
17
Fig. 2. Modelling a DSMS
For continuous query Qi in a DSMS, both the arrival curve of input stream of Qi , αi , and the service curve of Qi , βi , determine the quality of service of Qi , such as response delay and buffer requirement. Therefore, for each continuous query Qi , a QoS model is defined as QoSi =< αi , βi >. The smoothness of input stream may be naturally induced at the source or can be imposed by the user. The service curve specification gives greater flexibility to a server in allocating its resources to meet diverse delay and throughput requirements. As an example, suppose a query guarantees the service curve β = δd , where 0, if 0 ≤ t ≤ d δd (t) = (6) ∞, if t > d In this case, it follows that for all t we have R∗ (t) ≥ (R ⊗ δd )(t) = R(t − d), and hence the response delay of tuples through this query is bounded by d.
3 QoS-Guaranteeing Mechanism Given QoS specification of continuous queries, measures should be taken to guarantee the admitted queries. Fig.3 presents the architecture of our QoS-Guaranteed DSMS. A user who wants to register a new continuous query to the DSMS has to explicitly specify the performance requirements that the system has to provide for the query in the form of QoS specification. Then admission controller determines whether the query can be supported without disrupting the service for existing queries in the system. Once the query is accepted, the system will dynamically allocate resources to various queries with a QoS-Guaranteed scheduling algorithm. QoS Specification
QoS M anager
Sc heduler
Data
Queries
Adm ission Controller
Query Engine
Results
QoS-Guaranteed DSM S
Fig. 3. Architecture of QoS-Guaranteed DSMS
18
S. Wu et al.
3.1 QoS-Guaranteed Scheduling Algorithm Without loss of generality, we assume that each stream is related to a query and each query has exactly one related stream, so that the processing of queries in a DSMS is independent of each other. Consider that there are N continuous queries in a DSMS sharing computation resources, and suppose query Qi requires that the service curve βi be guaranteed, where i = 1, ..., N . We design an SCED based scheduling algorithm that can guarantee the service curve for each query. SCED [5] is a non-preemptive task scheduling strategy in the theory of network calculus. It defines an output deadline allocation algorithm that guarantees, under some conditions, that flow i does have βi as a service curve. Suppose the ni -th input tuple of query Qi , arrives in the front of query Qi in slot u. As shown in Equation 7 and Equation 8, SCED algorithm tells the tuple must depart the system before the time slot di (n). Zi (t; u − 1) =
min
{Ri (s) + βi (t − s)}
τ (u−1)≤s≤u−1
di (n) = min{t : t ≥ u and Zi (t; u − 1) ≥ ni }
(7) (8)
where τ (t) is the index of the latest slot, no larger than t, at the end of which the query engine is idle, as shown in Equation 9: τ (t) = max{s : s ≤ t and Bi (s) = 0 f or i = 1, ..., N }
(9)
We call di (n) the departure deadline of the ni -th input tuple of query Qi . Since the processing cost of each tuple differs from one another, we should take processing delay into consideration when determine the scheduling priority. Suppose that per-tuple processing delay for query Qi is bounded by Δi (which will be referred in the next section), to ensure the satisfaction of departure deadline, tuples should be scheduled Δi time slots ahead of its departure deadline. Therefore, the scheduling deadline is defined in Equation 10: SDi (n) = min{t : t ≥ u and Zi (t − Δi ; u − 1) ≥ ni }
(10)
Based on the EDF strategy, the highest priority equals to the smallest scheduling deadline. If the tuple can be processed before its scheduling deadline, the output deadline can be satisfied, therefore the query could meet its service curve which corresponding to its QoS. The scheduling algorithm is given in Table 1. 3.2 Access Control for QoS Guaranteeing Since the capability of a certain query engine is limited, when a set of queries run in a single DSMS, they compete for computation resources with one another. So how to find the condition under which the shared query engine can simultaneously serve all the queries with QoS guarantees is a necessary problem to discuss.
Modelling and Guaranteeing Quality of Service over Data Streams
19
Table 1. QoS-Guaranteed Scheduling Algorithm Scheduling Algorithm 1: while (queues not empty) { 2: for i=1 to N { 3: if(SDi (n) of ready tuple not calcuated) 4: calculate SDi (n); 5: assign priorities based on SDi (n); 6: } //endfor 7: schedule tuple with the highest priority to execute; 8: } //endwhile
Because of different operation functions, processing costs of different queries differ from one another, so the service received by each query should be measured in a uniform comparable measurement rather than the number of tuples that have been served by each query. Therefore, the computation resource is standardized as follows when allocated to different queries. Assuming the query engine has the ability to perform C basic operations per time slot. And it takes Ci basic operations for Query Qi to serve a tuple. Therefore, for a tuple input to query Qi , we have Δi = CCi . We assume the statistic cost of queries and the capability of query engine are known in advance. This can be achieved by permanently monitoring the data stream processing during the warm-up stage. Theorem 3 (Feasible Allocation). Consider a query engine with capacity of C basic operations each time slot that serves N continuous queries. If the input stream to query Qi is αi -smooth, then the scheduling algorithm guarantees service curve βi to query Qi for i = 1, ..., N , if the following Equation 11 is satisfied. N
((αi ⊗ βi )(t) × Ci ) ≤ C × t ∀ t ≥ 0
(11)
i=1
If without assumption on the characteristics of input stream, a simpler sufficient condition is as Equation 12: N
(βi (t) × Ci ) ≤ C × t ∀ t ≥ 0
(12)
i=1
The proof of Theorem3 can refer to the proof of Theorem 2.3.1 in [6]. The above feasible allocation theorem gives us principles to introduce admission control policy when users register continuous queries. For generality, we could extend to a system with a variable capacity, meaning that the maximum number of basic operations that the system could perform during a slot N varies with time. If i=1 Bi (τ ) > 0 for all τ satisfying t1 ≤ τ ≤ t2 , the system is called continuouslybacklogged in the interval [t1 , t2 ]. Here we give the definition of service capacity as follows. Definition 4 (Capacity Curve). We say that a DSMS has a capacity curve C(t) if within any interval [t1 , t2 ] in which the system is continuously backlogged, the amount of basic operation that the system performs is at least C × (t2 − t1 ).
20
S. Wu et al.
With the definition of capacity curve, the right side of Equation 11 and Equation 12 in Feasible Allocation condition could be generalized to C × (t2 − t1 ). And the above fixed rate server, i.e., a system with a fixed capacity to perform C (a constant value) basic operations in each slot, is just a special case of a system with a capacity curve C(t) = C × t.
4 QoS-Guaranteed Batch Processing According to the scheduling algorithm proposed in Section 3, query engine performs scheduling in a-tuple-a-time (shorted as AT AT ) fashion, which has relatively low efficiency. This is because the high volume input tuples over data streams would lead to heavy scheduling overheads. Batch scheduling strategy, by grouping several tuples of a query as a scheduling unit, is the most efficient approach for real-time processing since it can lower down scheduling overheads effectively, compared with AT AT strategy. Many batch scheduling strategy have been proposed for data stream processing. However, under the circumstance of guaranteed service providing, batch processing should be reviewed carefully so that the required QoS is satisfied. Assuming that the batch size is b, we allocate each batch a priority equal to the priority of the last tuple in the batch, then the scheduler serves batches present in the system in the order of their priorities. Then the departure time of any tuple in the system is bounded by its departure time in the AT AT system plus b×CCmax where Cmax is the maximum per-tuple processing cost across all the queries and C is the system capability. In order to guarantee the pre-specified QoS requirements, each tuple in a batch should have its scheduling deadline intentionally b×CCmax ahead. In other words, because of the effect of batching processing, the service curve of each query should be modified as: βi (t + b×CCmax ) f or i = 1, ..., N . Considering the constraint of feasible allocation theorem, the size of batch, b, could be determined with the Equation 13 or Equation 14: N
((αi (t) ⊗ βi (t +
i=1 N
(βi (t +
i=1
b × Cmax )) × Ci ) ≤ C × t ∀ t ≥ 0 C
b × Cmax ) × Ci ) ≤ C × t ∀ t ≥ 0 C
(13)
(14)
5 Experiment Analysis In our experiment, we simulate three continuous queries with different QoS requirements. For simplification, the input rates of all the queries are constant. The input rates of Q2 and Q3 are both 200 tuples per second, and we vary the input rate of Q1 from 100 tuples per second to 700 tuples per second. And the QoS requirements of each query are listed as follows: – Q1 requires that the output rate be at least 300 tuples per second. – Q2 requires that the delay for any tuple be bounded by 4 ms.
Modelling and Guaranteeing Quality of Service over Data Streams
21
– Q3 asks for long-term output rate as 200 tuples per second, however, during the beginning of backlogged period tuples should has a response delay no more than 2.5ms.
(a) Servic e Curve of Q 1
(b) Servic e Curve of Q 2
(c ) Servic e Curve of Q 3
Fig. 4. Service Curves
(1) Modelling QoS The queries with QoS requirements can be modelled as in Table 2, and the service curves are shown in Figure 4. Table 2. QoS Specification of Queries
Arrival Curve Service Curve
Q1 α1 = r1 t (r1 = 0.1 ∼ 0.7) β1 = 0.3t
Q2
Q3
α2 = 0.2t β2 = δ4
β3 =
α3 = 0.2t 0.4t 0≤t≤5 0.2t + 1 t>5
Noted that millisecond (ms) is used as time unit. (2) Admission Control Check It is known that the computation cost per tuple of all the queries is 1tuple/ms with response delay jitter 0.2 ms, so we know that Δi = CCi = 1 for i = 1, 2, 3. ri t r1 ≤ 0.3 α1 (t) ⊗ β1 (t) = 0.3t r1 > 0.3 α2 (t) ⊗ β2 (t) = 0.2(t − 4) α3 (t) ⊗ β3 (t) = 0.2t We have: 3 (0.4 + ri )t − 0.8 r1 ≤ 0.3 (α (t) ⊗ β (t)) = i i i=1 0.7t − 0.8 r1 > 0.3 Even though the variation of input rate Q1 , it holds that: 3 C i=1 (αi (t) ⊗ βi (t)) < Ci × t ∀ t > 0 Therefore we conclude that all the three queries can be admitted in the system with their QoS requirements satisfied simultaneously, according to the feasible allocation condition.
22
S. Wu et al.
(3) QoS-Guaranteed Algorithm Analysis To evaluate the ability of our QoS-Guaranteed scheduling algorithm, we implement FIFO algorithm as well.
0.04
1
QoS-G FIFO QoS-Miss Ratio
QoS-Miss Ratio
0.05
0.03 0.02 0.01 0 100
0.8
QoS-G FIFO
0.6 0.4 0.2 0 100
300 500 Input rate of Q1 (tuple/s)
Fig. 5. QoS-Miss Ratio of Q1
300 500 Input rate of Q1 (tuple/s)
Fig. 6. QoS-Miss Ratio of Q2
QoS-Miss Ratio
0.5 The experimental results illustrated QoS-G FIFO by Figure 5-7 show that QoS require0.4 ments of all the admitted queries are 0.3 guaranteed under our QoS-Guaranteed Scheduling Algorithm. While for FIFO 0.2 scheduling, when the input rate of Q1 0.1 increases, more tuples appear on the input of Q1 with relative earlier arrival 0 100 300 500 time. Therefore, the query engine is Input rate of Q1 (tuple/s) over occupied with the processing of the tuples of Q1 . As a result, the input Fig. 7. QoS-Miss Ratio of Q3 tuples of Q2 and Q3 are less scheduled, which leads to the increase of QoSMiss ratio. The experiment above gives us a motivation that the service curve based approach is efficient for modelling and guaranteeing QoS requirements.
6 Related Work Recently, there has been considerable research activity pertaining to stream systems and data stream algorithms. Some systems that incorporate stream processing include Aurora [7][8], NiagaraCQ [9], STREAM [10], TelegraphCQ [11], and many others. QoS focuses on user-perceived qualities, and QoS-aware data management has been a hot topic in data management community [12][13]. In data stream research community, QoS is first proposed in [8]. The QoS requirements in Aurora are two-dimensional QoS graphs specified by application administrator telling the utility of the output in terms of several performance-related attributes. However, Aurora attempts to maximize the perceived QoS for the output it produces, in other words, this is a kind of posteriori QoS, and it can not provide deterministic performance guarantee. Moreover Aurora’s
Modelling and Guaranteeing Quality of Service over Data Streams
23
notion of QoS is general and is not restricted to the types of graphs. Qstream [14] suggests the idea of priori QoS for data stream processing, however the performance requirements are ensured by resource reservation, which incurs too much resource idleness, therefore less flexible. In addition, network calculus is a set of rules and results that apply min-plus algebra to packet networks. Fundamental work has been pioneered by Cruz [6][15][16], and it has been extended and simplified independently and simultaneously under equivalent forms by Sariowan [5] who gives a formal and general treatment of the concepts of arrival curves and service curves, two fundamental network calculus tools. The mathematics involved in network calculus uses min-plus algebra, as described in [16].
7 Conclusions and Future Work In this paper, we focus on modeling and guaranteeing quality of service for data stream processing based on the theory of network calculus. A QoS-Guaranteed DSMS architecture is proposed, which is supported by QoS modeling and QoS guarantee mechanisms. And a QoS-Guaranteed batch processing is mentioned here. The current work is based on query-level service providing, and we will try to push it into operator-level that might improve the system scheduling efficiently. Moreover, for some applications, it is desirable that a query be able to renegotiate its QoS requirements on the fly. We are going to give QoS guarantee when renegotiate arrival and service curves for a query on the fly. Moreover, since the input streams tend to be bursty and high-volume, we plan to perform arrival curve based load shedding which discards tuples from input streams to make it conform to the required burstiness constraints without breaking the negotiated QoS requirements.
References 1. D. Carney, U. Cetintemel, et al. Operator Scheduling in a Data Stream Systems. In Proc. of the 29th VLDB Conf., Berlin, Germany, Sep.2003, 838-849. 2. R. Avnur and J. M. Hellerstein. Eddies: Continuously Adaptive Query Processing. In Proc. of the ACM SIGMOD, Dallas, TX, May 2000, 261-272. 3. B. Babcock, S. Babu, M. Datar, and R. Motwani. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. In Proc. of the ACM SIGMOD Int.Conf. On Management of Data, San Diego, CA, 2003, 253-264. 4. R. L. Cruz. Quality of service guarantees in virtual circuit switched networks. IEEE J. Select. Aareas Commun., 1995(13): 1048-1056. 5. H. Sariowan. A service curve approach to performance guarantees in integrated service networks. Ph.D. dissertation, Univ. Calif. San Diego, 1996. 6. L. Boudec, J. Yves, P.Thiran. Network calculus. Springer Lecture Notes in Computer Science, Vol.2050, 2001. 7. D. J. Abadi, D. Carney, U. Cetintemel, et al. Aurora: A New Model and Architecture for Data Stream Management. VLDB Journal, 12(2), 2003. 8. D. Carney, U. Cetintemel, M. Cherniack, et al. Monitoring streamsCa new class of data management applications. In Proceeding of VLDB Conference, Hong Kong, China, August 2002, 215-226.
24
S. Wu et al.
9. J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for Internet databases. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, May 2000, 379-390. 10. R. Motwani, J. Widom, A. Arasu, et al. Query Processing, Resource Management, and Approximation in a Data Stream Management System. In CIDR Conference, Asilomar, CA, January 2003, 245-256. 11. S. Chandrasekaran, O. Cooper, A. Deshpande, et al. TelegraphCQ: Continuous dataflow processing for an uncertain world. TelegraphCQ: Continuous Dataflow Processing for an Uncertain World. In CIDR Conference, Asilomar, CA, January 2003, 269-280. 12. K. Kang, S. H. Son, J. A. Stankovic, et al. A QoS-Sensitive Approach for Timeliness and Freshness Guarantees in Real-Time Databases. In Proc. of the 14th Euromicro Conference on RealTime Systems, 2002, 203-212. 13. Y. Liu, X. Liu, L. Xiao, et al. Location-Aware Topology Matching in P2P Systems. In Proc. of IEEE INFOCOM 2004, Hong Kong, Mar.2004, http://www.cse.msu. edu/ liuyunha/INFOCOM04-yunhao.pdf. 14. S. Schmidt, H. Berthold, and W. Lehner. Qstream: Deterministic querying of data streams (demo). In Proc. of the 30th VLDB, Toronto, Canada, August 30-September 3 2004, 1365-1368. 15. R. L. Cruz. A calculus for network delay, part I: Network elements in isolation. IEEE Trans. Inform. Theory, vol. 37, Jan.1991, 114-131. 16. R. L. Cruz. A calculus for network delay, part II: Network analysis. IEEE Trans. Inform. Theory, vol. 37, Jan. 1991, 132-141.
Research on Prediction Models over Distributed Data Streams Li Tian, AiPing Li, and Peng Zou National Laboratory for Parallel and Distributed Processing Changsha, Hunan, China 410073 [email protected]
Abstract. A framework is presented to provide a mechanism to maintain adaptive prediction models established both on the coordinator and remote nodes in distributed data stream processing for reducing communication consumption. The coordinator employs these models to answer registered queries, while the remote nodes check whether the prediction value is close to the actual value or not. Update messages are needed only when there’s a large deviation between prediction value and actual value. Three particular prediction models are given and compared with existent ones. Analytical and experimental evidence show that the proposed approach performs better both on overall communication cost reduction and prediction query processing. Keywords: Prediction model, Data stream, Network consumption.
transmit to the coordinator, saving a great amount of communication while still guaranteeing sufficient precision of query results. Moreover, prediction query processing over future data steams is needed in some situations[7]. However, recent works focused mainly on approximate queries over newly arriving data [1,2], and in this paper, prediction models are proposed to process prediction queries. The rest of this paper is as follows: Related works are concluded in Section 2. In section 3, prediction models are invited to distributed data stream processing architecture and a common framework is raised. There are three adaptive prediction models raised in section 4, comparative analyses among them are performed theoretically. In section 5, experimental evidence proves that our approach significantly reduces overall communication cost and performs well on prediction queries. And lastly in section 6, conclusions and future work are described.
2 Related Works Here we will just describe some works which, alike ours, proposing to consider the tradeoff between accuracy and communication for monitoring a limited class of continuous queries (at a coordinator site) over distributed streams (at remote sites). Other than prediction method, there are several approaches by which approximate answers of sufficient precision are computed from a small fraction of the input streams. Das et al. [8] discuss monitoring of approximate set-expression cardinalities over physically-distributed element streams. An adaptive filter adjustment algorithm is developed in [4], which adapt to changing conditions to minimize stream rates while guaranteeing that all continuous queries still receive the updates necessary to provide answers of adequate precision at all times. A similar idea is proposed by Ankur Jain, et al.[5], where they specifically choose a Kalman filter among the linear estimation methods. The most relevant works [6,9,11] propose to use predictors in data stream processing. Cormode et al [6] present algorithmic solutions for tracking a broad class of complex data-analysis queries over distributed streams setting to specified accuracy, which rely on a combination of general-purpose randomized sketches and concise sketchprediction models. Several adaptive prediction techniques are researched in [9], and all of them are based on linear functions. An approach with prediction functions is presented in [11] to process monitoring queries in a network of sensors. The authors consider sensors that communicate their values according to a threshold policy and prediction functions to compare tuples efficiently and generate answers even in the absence of new incoming tuples. Our work differs from all above mainly in proposing a general architecture, which provides a mechanism to maintain adaptive prediction models significantly reduce communication cost over the distributed environment while still guaranteeing sufficient precision of query results. Prediction models proposed in reference [6] is analyzed and ameliorated in this paper as per shortcomings; efficiency and veracity of different models are compared respectively both on theory and experiment.
Research on Prediction Models over Distributed Data Streams
27
3 Data Stream Processing with Prediction Models 3.1 Problem Formulation As illustrated in Fig. 1, a distributed data stream processing architecture comprises a collection of k remote sites node(j)( j {1,..,k}) and a designated coordinator site c. Generally, direct communication between node(j) are not allowed, instead, node(j) exchanges messages only with c. Object values f1,..,fs from corresponding domains Ui(i {1,..,s}) are monitored by the system. Assume that there is a global time in the system, fi,j(t) is used to denote the observed value of fi in node(j).
∈
∈
Fig. 1. Illustration of distributed data stream processing
Approach based on prediction models is offered in the above distributed environment to reduce bandwidth consumption in this paper, and the related definitions are as follows. *
Definition 1 Prediction Model. A prediction model PMi is a function from Ui to Ui, * where Ui is the domain of fi , and Ui is the set contains all the subsets of Ui. Definition 2 Prediction Value. Using Fi(t)={fi(x)|x
28
L. Tian, A. Li, and P. Zou Table 1. Model and algorithm symbols
Symbol node(j) Oi Ui fi,j(t) fi,jp(t) ti,jprev PMi .tbegin vi,j ai,j θi
Meaning Remote sites (j ∈ {1,..,k}) Data object((i ∈ {1,..,s}) Domain of fi Observed value of fi in node(j) at time t Prediction value of fi in node(j) at time t Time the last communication occurred Time from which PMi become effective Velocity parameter of PMi Acceleration parameter of PMi The upper bound on the deviation of observed value and prediction value
Symbol c fi t PMi Δdi,j Δt PMi .texp tp Q δQ
Meaning Coordinator site Concerned character of Oi The current system time Prediction model of fi Δdi,j=fi,j(t)-fi,jp(t) Δt=t-ti,jpre Expiry time of PMi Update tuples transmitted A continuous query Precision constraint of Q
3.2 Architecture In this section, we explain the architecture that we propose as shown in Fig. 2. Client Module Snapshots
Sampling
Stream Corrdinator Buffers of tuples
Prediction Model 1
----
fi
Remote nodes
p
Warnings Events
Prediction Manager
Trigger Processing
Alarms
. . .
Prediction Model Buffers
-------
Precision Manager
Approximate Answer
Register queries
Register queries (query + precision constraint)
Continuous Queries Evaluator
No Update the coordinator with new value
Is prediction value close (with some precision) to the actual observed value?
Precision control messages
Data sources f1 Updates
. . Yes Drop the data tuple . . . . PM s-1 Remote nodes
Is prediction value close to the fs-1 actual observed value? Updates Yes Drop the data tuple Significative update value PM s Is prediction value close to the actual observed value?
fs Updates
Yes Drop the data tuple
Fig. 2. System architecture
Remote Nodes. Remote nodes, on the right, each master values for one or more objects, and they generate streams of significant updates for those values according to the corresponding prediction model and local data stream from data sources. Since local data stream may change as time pass by, a prediction model representing the stream exactly at a given time may be invalidated after a period of time. PM.tbegin and PM.texp are used to indicate the beginning and expiry time of prediction model. We propose that remote nodes transmit update tuples to the stream coordinator whenever: 1) the difference between prediction value and observed value exceeds a certain precision threshold (correction update), or 2) the time elapsed since the last update has exceeded the effective period of PM, that is t>PM.texp (valid update). T=PM.texp-PM.tbegin indicates the amount of time during which a prediction model can be applied: beyond that period, the prediction model is not reliable and it can be assumed that the remote node is unable to communicate new updates (e.g., it is
Research on Prediction Models over Distributed Data Streams
29
not alive). A valid update indicates that the corresponding prediction model describes the stream character well and therefore the valid period of the model can be extended, while a correction update will adjust parameters of corresponding prediction model and shrink its valid period. Update date tuples transmitted from remote nodes to coordinator are logically composed of update-tuples: tp=<node(j), Oi, type, fi,j(t)>, where node(j) and Oi is the identifier of a remote node and the concerned object, respectively, type equals either to CORR or to VALI, which indicates tp is a correction update or a valid one, and finally, fi,j(t) means the observed value of Oi in node(j) at current time instant t. Stream Coordinator. It receives update-tuples, manages prediction model copies, controls precision issues, and provides answers to client modules. Prediction Manager. The main function of prediction manager is as follows: 1) Receiving update-tuples from remote nodes. 2) Maintaining the copies of prediction models in coordinator to be valid, adaptive and consistent with corresponding remote one. 3) Deciding which input tuples should be stored (i.e., they may be needed to answer queries with window-join constraints [11]). Precision Manager. A precision manager is invited to manage precision issues. We assume any number of arbitrary CQs with arbitrary individual precision constraints δQ, respectively. The challenge is to ensure that all times the approximate answer to every Q is of adequate precision, i.e., can be guaranteed within δQ, while drop local tuples as much as possible to minimize total communication cost. In this paper, we do not delve into the details of how precision constraints are specified, neither do we make any assumption about the way precision allocations are performed [4], which are remained as open problems and will be studied in future. Buffer. There are two kinds of buffer in the stream coordinator: buffers of tuples and prediction model buffers. The former store received update-tuples which will be needed in future (i.e., to rebuild the prediction models or to answer window-join constraints queries [11]), and the latter store copies of prediction models used by the Continuous Query Evaluator to answer user queries. Continuous Query Evaluator. A Continuous Query Evaluator in the stream coordinator receives users’ register requirements of CQs and provides continuous approximate answers to them according to the concerned prediction models. An example of continuous query specification is as shown below: SELECT AVG(D_s.c) FROM Data_Stream D_s WHERE (D_s.a < v1 and D_s.b>v2) BEGIN begin_time END end_time Current researches focus mainly on queries over current and historical data and pay few attentions to future data (where the begine_time is the current system time and the end_time is a future time instant), which we named as prediction queries.
30
L. Tian, A. Li, and P. Zou
The CQ Evaluator generates an output data stream of tuples that satisfy the required constraints (or predicted to satisfy in the future, depending on whether the query registered is a prediction query or not). For normal queries that end_time. However, for prediction queries, the format of an output tuple is < fip(t), δQ, VI >, where VI=MIN{PMi.texp| Oi is involve in the query’s WHERE clause} is the validity mark of the out predicted tuple, indicating the time interval during which the prediction value is available. The stream coordinator can process prediction queries in this way. Client Module. A module client will receive the approximate answer released by the CQ Evaluator for further processing as needed. Our approach fits naturally with different types of client modules, such as: (1) a sampling module that periodically transforms the data stream into relations that are shown to a user as snapshots at different time instants of the values that satisfy the constraints; (2) a trigger processing module that requires a certain action when an event is detected. (3) other users or applications that register queries and obtain continuous approximate answers for certain precision guaranteeing.
4 Prediction Models In this section, we propose three particular adaptive prediction models: the static model, the linear model and the acceleration model. Similar with [6], all these three models consider how the local distribution fi,j changes (as a function of time) between the time of the last communication to the coordinator ti,jprev and the current t; but models raised in this paper are improved version comparing with what in [6]. 4.1 Static Model Static model essentially assuming that fi,j remains static (fi,jp(t)=fi,j(ti,jprev)) over time. The following analysis indicates that the static model is useful in some situation. Theorem 1. Assume N is a random variable which denotes the number of update th tuples. Let the tuples are observed for r times, then the probability that the r observed value is larger/smaller than the first r-1 observed values is 1/r. Theorem 2. Assume N is a random variable which denotes the number of update tuples. Let the tuples are observed for r times, then r
∑N
lim x:=1 r →∞ r
x
= AVG( N )
where Nx means the xth observed value of variable N, and AVG(N) means the exact average value of N. Proof of these theorems is omitted due to space limitation [7]. Theorem 1 indicates that, the probability of the newly arrived tuple becomes the max/min value decreasing with the increasing of r. In other words, if we assume that max{N}=max{N1, N2, … ,Nr },
Research on Prediction Models over Distributed Data Streams
31
the assumption will be imprecise with probability of 1/r. When r is large, the assumption will almost be accurate. Similar analysis can be performed on AVG aggregation operation according to Theorem 2. 4.2 Linear Model The linear-growth model (LGM) proposed in [6] predicts the future behavior of the stream as a linear scaling of the entire history of the distribution. It imposes a linear, uniform rate of change over the entire frequency distribution vector, and thus, cannot capture or adapt to shifts and differing rates in the distribution of updates over the vector. However, in many real-life scenarios, only the recent history of the stream may be relevant for such predictions. Considering that fi,j(ti,jprev) is the exact value of fi,j when t=ti,jprev, we regard the deviation between observed value and prediction value (Δdi,j=fi,j(t)-fi,jp(t) ) as the cumulation from the last communication to NOW (Δt=t-ti,jpre), which is caused by the inaccurate change rate (vi,j). Thus we propose the linear model (LM) as fi,jp(t)= fi,j(ti,jprev)+vi,jΔti,j, and the corresponding update algorithm is shown in algorithm 1. Algorithm 1. UpdateLinearModel (tp) Input: an update tuple tp Output: the adjusted prediction model, and prediction tuples for each query, if needed PM:=PredictionManager.GetPM(tp.node(j),tp.Oi); If tp.type = CORR then // correction update Δdi,j:=tp.fi,j(t)-PM. fi,j(ti,jprev); Δti,j:=t-PM. ti,jprev; Δd PM. vi,j := PM.vi,j + i,j Δti,j ;
4.3 Acceleration Model A richer prediction model named velocity/acceleration model (VAM) is raised in [6], which uses more parameters to better fit changing data distributions and can be described as fi,jp(t)= fi,j(ti,jprev)+vi,jΔti,j+(Δti,j)2ai,j. The author propose to compute the velocity component by maintaining a window of the W most recent updates to fi,j in remote node(j), scaling the difference between the newest and oldest updates stored in the window by the time difference, and finally transmitting the velocity parameter to coordinator [6]. This schema suffers from at least two significant shortcomings. Firstly, additional storage and CPU costs are needed for storing the W most recent updates and computing the velocity parameter, which aggravates the load of remote nodes. Secondly, the velocity parameter should be transmitted to coordinator c, which increases the communication cost.
32
L. Tian, A. Li, and P. Zou
Similar to the linear model, we regard the deviation Δdi,j as the cumulation result of the inaccurate of model parameters (vi,j and ai,j ) during the period of Δti,j,. The problem is how to describe this deviation by model parameters. Assume that the prediction function is fi,jp(t)= fi,j(ti,jprev)+vi,jΔti,j+(Δti,j)2ai,j, and the exact distribution function can be described by fi,j(t)= fi,j(ti,jprev)+vi,j*Δti,j+(Δti,j)2ai,j*. If vi,j=vi,j*, indicating that the deviation is caused only by the imprecision of acceleration parameter. Let Δai,j= ai,j*- ai,j, we have
Similarly, If ai,j=ai,j*, indicating that the deviation is caused only by the imprecision of velocity parameter. we have
a*i,j = ai,j , v*i, j = vi,j +2ai,j × Δti,j +
Δdi,j Δti,j
(3)
In general, both the inaccurate of vi,j and ai,j have contribution to the deviation Δdi,j. We make an approximate assumption that the deviation caused by the above two parameters is equal to Δdi,j/2, respectively, then we get the update algorithm of our acceleration model (AM) according to formula (1)~(3), shown in algorithm 2. Algorithm 2. UpdateAccelerationModel (tp) Input: an update tuple tp Output: the adjusted prediction model, and prediction tuples for each query, if needed
4.4 Compare and Analysis The following Table 2 summarizes the key points of prediction models mentioned above.
Research on Prediction Models over Distributed Data Streams
33
Table 2. Compare of prediction models
fi , j (ti,jprev )
fi , j (ti,jprev )
Communication Cost None
t fi , j (ti,jprev ) tiprev ,j
fi , j (ti,jprev ) , ti,jprev
None
Name
Prediction Value
Static Model Linear-growth model(LGM)
Storage Cost
Velocity/Accelerati on Model(VAM)
fi , j (ti,jprev ) + Δti , j vi , j + (Δti , j )2 ai , j
Linear Model(LM) Acceleration Model(AM)
fi , j (ti,jprev ) + Δti , j vi , j prev i,j
fi , j (t
fi , j (ti,jprev ) , ti,jprev , vi,j , ai,j
W f i,j in node(j) fi , j (ti,jprev ) , ti,jprev , vi,j
) + Δti , j vi , j + (Δti , j ) ai , j 2
prev i,j
fi , j (t
prev i,j
) ,t
, vi,j , ai,j
vi,j
None None
Comparing with LGM [6], the LM proposed in this paper adds the storage cost of parameter vi,j to capture shifts and differing rates in the distribution of updates. Time complexity and space complexity of LM is O(s×k), where s and k is the number of objects concerned and remote nodes, respectively. Comparing with VAM [6], the AM proposed in this paper avoids the storage of W most recent fi,j in node(j) and changes the space complexity from O(W×s×k) to O(s×k). Model’s parameters can be computed from the deviation Δdi,j and no additional communication cost is needed, reducing O(s×k) network transmission. However, the assumption that the inaccurate of vi,j and ai,j has the same contribution to whole deviation will influence prediction veracity in some way.
5 Experiments A simulation system is implemented in Delphi 7 to test the prediction veracities and the effects on reducing communication cost of all our proposed models. Experiments were performed on an Intel Pentium 4, CPU 1.7GHZ, and RAM 512 Mb, with Windows XP. Two kinds of data sets were used during our experiments, just as shown in Table 3. Data stream 1 is synthetic where parameters 0 ≤ pi < 10 (i=1..5) are random generated every 100 seconds to simulate streams with different distributions. Data stream 2 is ocean air temperature data from 1996 to 2005 measured by TAO (tropical atmosphere ocean) [10]. Threshold policy is adopted in precision controlling, and update communication is needed when
f i , j (t)-f i,jp( t ) fi , j (t)
≥θ .
Table 3. Streams used during the experiments Streams 1 2
Description f(t)=p1 Δt 3 + p2 Δt 2 + p3 Δt + p4 + p5 Δt 1996~2005 ocean air temperature data measured by TAO [10]
34
L. Tian, A. Li, and P. Zou
5.1 Prediction Veracity Fig. 3 shows the effect of varying the parameter θ for prediction models on the prediction veracity. It can be seen that when θ<1%, the veracity increase rapidly with parameter’s growth, while when θ>1%, the increase trend slows down. Therefore, the following experiments are performed under situation of θ=1%.
Predicate veracity(%)
100
80
60
VAM AM LM LGM
40
20
0 0
1
2
θ (%)
3
4
5
Fig. 3. Relationship between θ and prediction veracity on data stream 1
We investigated the prediction veracity of different models on different data streams. 100
A
VAM
AM
LM
LGM
80
Predicate Veracity(%)
Predicate Veracity(%)
100
60 40 20 0
B
90
80 Static VAM LM LGM AM
70
60
50
0.0
3
5.0x10
1.0x10
4
t
1.5x10
4
2.0x10
4
2
7.0x10
3
1.4x10
3
2.1x10
3
2.8x10
3
3.5x10
3
4.2x10
t
Fig. 4. Prediction veracity of models (A) on data stream 1 (B) on data stream 2
As shown in Fig. 4 (A), due to the nonlinear character of data stream 1, the prediction veracities of acceleration models are higher than those of linear ones. Fig. 4 (B) indicates that linear models perform better than acceleration ones on data stream 2 because of its stable random distribution. The two illustrations indicate that comparing with LGM [6], the LM proposed in this paper adapts well to shifts and differing rates in the distribution of data stream, and therefore performs obviously well on prediction veracity. However, the AM performs a little worse on prediction veracity than VAM does because of the approximate processing mentioned above.
Research on Prediction Models over Distributed Data Streams
35
5.2 Communication Cost The following two factors play important roles in reducing communication cost: the prediction veracity and the update policy when prediction is failure. Symbol p is used to denote the veracity of prediction model PM, proportion between redundant and necessary communication during an update policy is denoted by m:n, then we can compute the total communication cost by the following formula: C_CostPM ( p,m,n ) =
( 1 − p )×( m + n ) × 100% n
(4)
54
A 100
80
Communication cost(%)
Communication cost(%)
Object values during the experiment streams are all float values. Recall that the update-tuple is tp=<node(j), Oi, type, fi,j(t)>, during the experiments, node(j) and Oi is stored in 16 bits, respectively; parameter type, which is the only additional communication cost introduced, is denoted by 1 bit, and fi,j(t) , the observed float value, is presented by 32 bits. For Static model, LGM, LM and AM, m:n=1:64, while for VAM, the value of m:n is equal to 17:64 because an extra parameter vi,j should be transmitted to coordinator when a correction update occur, which is stored in 16 bits. We investigated the communication cost of different models on different data streams according to the above analysis. Both of the two illustrations in Fig. 5 indicate that benefiting from the higher prediction veracity, LM performs quite better than LGM in reducing communication cost. Although AM is a little worse than VAM in veracity, it omits the additional communication of parameter vi,j, and therefore has similar (on data stream 1) and sometimes even better(on data stream 2) performance on reducing communication cost.
LGM LM AM VAM
60
40
LGM VAM AM LM STATIC
36
27
18
9
20 0.0
B
45
3
5.0x10
4
1.0x10
4
1.5x10
4
2.0x10
t
6.0x10
2
3
1.2x10
1.8x10
3
2.4x10
3
3.0x10
3
3
3.6x10
4.2x10
3
t
Fig. 5. Effect of models on reducing the communication cost (A) on data steam 1 (B) on data stream 2
6 Conclusions and Future Work In this paper we have described a framework for data stream processing with prediction models in typical distributed environment, which provides a mechanism to maintain adaptive prediction models that significantly reduce update stream rates and process prediction queries over future data streams. Three improved adaptive prediction models are proposed; theoretical analysis and experimental evidence
36
L. Tian, A. Li, and P. Zou
indicate that our approach significantly reduces overall communication cost and performs well over prediction queries. In future steps, we plan to conduct a more extensive experimentation and analyze the precision control and allocation issues. Research on effective prediction models for streams with different distribution character is an open problem. Acknowledgements. This work is supported by the National High-Tech Research and Development Plan of China ("863" plan) under Grant No.2004AA112020 and No. 2005AA112030, and the National Weapon Equipment Pre-research Fund under Grant No. 51415010304KG0175.
References 1. Jin, C.Q., Qian W.N., Zhou A.Y.: Analysis and Management of Streaming Data: A Survey. Journal of Software. 15 (2004) 1172-1181 2. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. ACM SIGMOD. Madison, WI, USA (2002) 3. Cranor, C., Johnson, T., Spatscheck, O., Gigascop, V.S.: A Stream Database for Network Applications. ACM SIGMOD. San Diego. California (2003) 4. Olston, C., Jiang, J., Widom, J.: Adaptive Filters for Continuous Queries over Distributed Data Streams. ACM SIGMOD. San Diego, California (2003) 5. Jain, A., Chang, E.Y., Wang, Y.F.: Adaptive Stream Resource Management Using Kalman Filters. ACM SIGMOD. Canada (2004) 6. Cormode, G., Garofalakis, M.: Sketching Streams Through the Net: Distributed Approximate Query Tracking. VLDB. Trondheim, Norway (2005) 7. Li J.Z., Guo L.J., Zhang D.D., Wang W.P.: Processing Algorithms for Predictive Aggregate Queries over Data Streams. Journal of Software. 16 (2005) 1252-1261 8. Das. A., Ganguly, S., Garofalakis, M., Rastogi, R.: Distributed Set-Expression Cardinality Estimation. VLDB. Toronto, Canada (2004) 9. Kumar, V., Cooper, B.F., Navathe, S.B.: Predictive filtering: A learning-based approach to data stream filtering. International Workshop on Data Management for Sensor Networks, (2004) 10. Pacific Marine Environmental Laboratory. Tropical atmosphere ocean project. http://www.pmel.noaa.gov/tao (2005) 11. Sergio, I., Wolfson, O., Mena, E.:Processing of data streams with prediction functions. Proceedings of the 39th Hawaii International Conference on System Sciences (2006)
A Grid-Based Subspace Clustering Algorithm for High-Dimensional Data Streams Yufen Sun and Yansheng Lu College of Computer Science & Technology, Huazhong University of Science & Technology, Wuhan, 430074, China [email protected]
Abstract. Many applications require the clustering of high-dimensional data streams. We propose a subspace clustering algorithm that can find clusters in different subspaces through one pass over a data stream. The algorithm combines the bottom-up grid-based method and top-down grid-based method. A uniformly partitioned grid data structure is used to summarize the data stream online. The top-down grid partition method is used o find the subspaces in which clusters locate. The errors made by the top-down partition procedure are eliminated by a mergence step in our algorithm. Our performance study with real datasets and synthetic dataset demonstrates the efficiency and effectiveness of our proposed algorithm.
dimensionality also exists for grid based clustering algorithms, especially for those that use connected dense grid cells in original data space to represent clusters. Subspace algorithms have been proposed to deal with this problem [8, 9, 11]. In this paper, we propose an algorithm GSCDS (Grid-based Subspace Clustering algorithm for high-dimensional Data Streams) that can find clusters in different subspaces. The algorithm combines the advantages of the top-down grid-based method and bottom-up grid-based method. After summarizing the stream data into a uniformly partitioned grid data structure, the algorithm tries to find subspaces that contain clusters. For each dimension, if a cutting point on this dimension can best separate the projections of the data that belong to different clusters, this dimension is selected to span a subspace. The connected dense cells in these subspaces are identified as clusters. At last, these clusters are checked to see whether they can be merged. Thus the algorithm can find subspace efficiently and acquire accurate clusters. To deal with evolving data streams, the algorithm uses a fading parameter to eliminate the influence of history data. Our performance study with real datasets and synthetic dataset demonstrates the efficiency and effectiveness of our proposed algorithm. In the remainder of this paper, we first introduce related work in section 2. Then we discuss two kinds of grid-based methods in section 3. The method for subspaces identification and the algorithm GSCDS are described in section 4. Section 5 reports our performance study and last, section 6 gives our conclusion.
2 Related Work In a high-dimensional space, all pairs of points tend to be almost equidistant from one another [5]. This brings difficulties to clustering problems that define the similarity between data points by the distance between them. To solve this problem, subspace clustering algorithms try to find clusters in subspaces of the original data space [8, 9, 11]. CLIQUE is the first grid-based subspace clustering algorithm that identifies clusters in different subspaces [8]. The algorithm first finds dense units in all subspaces, and then it identifies connected dense units in these subspaces as clusters. These clusters may overlap, which makes the algorithm unfit for some applications. Moreover, the running time of CLIQUE is exponential in the highest dimensionality of any dense unit. MAFIA improves CLIQUE by adaptively partitioning the grid based on the distribution of data [9]. But its running time also grows exponentially with the number of dimensions in clusters. SURFING is a subspace clustering algorithm that tries to find interesting subspaces by checking whether the projections of data points in each subspace distribute uniformly [11]. If the distribution is not uniform, the algorithm tries to find clusters in the corresponding subspace. SURFING needs to compute k-nearest neighbors of every data point in every subspace, which is not a trivial task. OptiGrid is a grid-based clustering algorithm that separates clusters in highdimensional dataset based on the distribution of the projections of data points on each dimension [10]. But the accuracy of the clusters obtained by it is not very high. We will discuss it in detail in next section. Recently, several clustering algorithms for data streams have been proposed [2, 47]. The algorithm STREAM partitions the data stream into chunks that can be clustered in main memory [2]. The cluster centers of these chunks form higher level
A Grid-Based Subspace Clustering Algorithm
39
chunks and are clustered again. This algorithm satisfies the basic space and time restriction of a stream algorithm. The algorithm CluStream maintains online microclusters that are defined as a temporal extension of the clustering feature (CF) vectors [4]. The micro-clusters are stored at particular moments to support offline clustering over specific time horizons. HPStream is an algorithm that is proposed to perform incremental subspace clustering on high-dimensional data streams [5]. For each cluster, HPStream selects the dimensions over which the radii of this cluster are small. CluStream and HPStream have the problem of favoring clusters with spherical shape and they are sensitive to noise and the order of input. Park et al. propose a statistical grid-based clustering algorithm for data streams [6]. Based on an assumption of the distribution of data points in a dense cell, the algorithm divides the cell until it becomes a unit cell. A cluster is a group of adjacent dense unit cells. This algorithm does not consider the problem of high dimensionality and it needs to assume the distribution of data. GCHDS is a grid-based clustering algorithm for high-dimensional data streams [7]. It analyzes the distribution of data points on each dimension to select the dimensions that are useful for clustering. All clusters it finds lie in the same subspace. But for a real data set, clusters may lie in different subspaces. The algorithm GSCDS proposed in this paper is an on-line algorithm that can find clusters with arbitrary shapes in different subspaces. GSCDS first incrementally summarizes the data in a data stream into a uniformly partitioned grid in memory. Then a top-down grid-based method is used to find the subspaces that contain clusters. Finally the algorithm uses a bottom-up grid-based method to find clusters in each subspace.
3 The Grid-Based Clustering Algorithms Our procedure for subspaces identification is enlightened by OptiGrid, a grid-based clustering algorithm that aims at separating the clusters in a static data set. Before we introduce our algorithm, we first discuss the grid-based clustering algorithms. The grid based clustering algorithms can be categorized into two classes: the bottom-up method and the top-down method. The bottom-up method first uniformly partitions the data space into mutually exclusive equal-size cells, then tries to find connected dense cells as clusters. All data points in a cell are supposed to belong to the same cluster. Only the statistical information of these data points is stored with the cell. This method needs only one scan over the dataset and it can find clusters of arbitrary shapes. The accuracy of clusters is determined by the granularity of cells, which is determined by parameters that are given by users. A fine partitioned grid with small cells will produce more accurate clustering results, but at the same time cause higher computational complexity. The bottom-up method is unfit for high dimensional data sets because the sparsely distributed data in the high dimensional space will form too many unconnected dense cells. WaveCluster [12] and CLIQUE [8] are two representations of the bottom-up grid-based clustering algorithms. WaveCluster is designed for low-dimensional spatial data. It outperforms BIRCH, CLARANS, and DBSCAN in terms of both efficiency and clustering quality [3]. CLIQUE considers the problem of high dimensionality, but the time it needs to find the subspaces in which clusters locate is exponential in the highest dimensionality of the subspaces. MAFIA
40
Y. Sun and Y. Lu
adaptively partitions the grid by merging the adjacent intervals in a uniformly partitioned grid, and it proceeds much like CLIQUE [9]. So it is also considered as a bottom-up grid-based clustering algorithm. The top-down method does not partition the data space uniformly according to input parameters, but partitions the data space recursively according to the data distribution. This method adopts the idea of divide-and-conquer. The data space is first partitioned into several regions by cutting planes that can separate different clusters. For each region, the partition procedure is repeated until each region contains only one cluster. At last, the top-down method obtains an uneven partitioned grid in which each cell (region) contains at most one cluster. The highly populated cells in this grid are considered as clusters. By using the data distribution information on each dimension to separate the clusters, this method is insensitive to noise because noise is normally considered uniformly distributed.
(a) Clusters overlap on every dimension
(b) Clusters can be separated by axes aaaaaaaaaaaaaaaaaaaaaaaaaparallel cutting lines
Fig. 1. Errors caused by top-down partition procedure
OptiGrid is a typical top-down grid-based clustering algorithm that is designed specially to process high-dimensional data sets with noise. It partitions the data space recursively by cutting planes that can best separate the clusters. For each dimension, it chooses the points with local density minima of projections to determine the cutting planes. The results it produces are regions that are bounded by axes parallel planes. Each region contains at most one cluster. OptiGrid is easy to realize. But it has several drawbacks. First, it needs multiple scans over the database to perform the partition. Second, the algorithm only separates different clusters. It does not obtain an accurate description of the shapes of clusters. All the clusters it identified are represented by hyper-rectangles. Last, OptiGrid makes some errors during the partition procedure. Some clusters may be partitioned into two or more sub-clusters, and one region may contain several sub-clusters that belong to different clusters. As indicated in [10], this happens when two clusters overlap on every dimension. Fig. 1(a) shows a 2dimensional example. In this situation, the error cannot be avoided when cutting planes are axes parallel. But we find out that even when every pair of clusters can be separated by at least one axes parallel cutting plane (or cutting line in 2-dimensional space), a cluster is still possible to be partitioned into two regions under the effect of other clusters. Fig. 1(b) illustrates this situation. The bold line in the figure represents
A Grid-Based Subspace Clustering Algorithm
41
the cutting line that OptiGrid may choose. This kind of error is not considered by OptiGrid and it may decrease the accuracy of clustering results largely. When there are many clusters in the data space, this problem may be very serious. As has been mentioned above, the bottom-up grid-based method has a good clustering quality when the partition parameters are appropriately set, but it cannot deal with high-dimensional data. The top-down grid-based method can find clusters in a high-dimensional space efficiently, but the accuracy of the clustering result is not satisfiable. If we combine the advantages of these two methods, we will acquire an algorithm that can deal with high-dimensional data sets effectively and efficiently. This is the motivation of our algorithm. We use the top-down method to find subspaces that contain clusters. Then the connected dense cells in these subspaces are identified as clusters. Some of these clusters may be merged to improve the accuracy of the clustering result.
4 The Grid-Based Subspace Clustering Algorithm for HighDimensional Data Streams To formalize our problem, we expand the definition in CLIQUE algorithm [8] to data stream domain. Let A = A1 , A2 , " , Ad be a set of bounded, totally ordered domains and S = A1 × A2 × " × Ad be a d -dimensional data space. We will refer to A1 , A2 , " , Ad as the dimensions of S . The input stream data are d -dimensional points X = X 1 , X 2 , X 3 , " , where X i = ( x i1 , x i 2 , " , x id ) . The j -th attribute of X i is drawn from domain A j . The range of x i , j is [min j , max j ) . We partition the data space S into non-overlapping equal-size cells to get a grid data structure GS . The cells are obtained by partitioning every dimension into k intervals of equal length, which is an input parameter. Each cell C i is the intersection of one interval from each dimension. It has the form (c i1 , c i 2 , " , c id ) , where c ij = [ l mj , h mj ),1 ≤ m ≤ k is a right-open interval in the partitioning of A j . The number of data points contained in a cell is recorded by count of the cell, and we call it the density of the cell. There are k d cells in the grid. When d is large, this number will be very huge. To ensure that the whole grid can be kept in memory, we only store the cells (c i1 , c i 2 , " , c id , count i ) with counti > 0 in memory. 4.1 The Method for Subspaces Identification
The top-down grid-based method needs multiple scans over the data. That cannot be satisfied by an algorithm that processes data streams. To execute the top-down partition procedure without scanning the data multiple times, we first summarize the stream data into a uniformly partitioned grid GS that is built as described above. Then the top-down partition procedure is performed on this grid. The grid is partitioned into regions such that each region contains at most one cluster. The hyperplanes that form a region separate the cluster in this region from other clusters. Each result region Ri has the form ( ri1 , ri 2 ," , rid ) , where rij ⊆ [min j , max j ) is a right-open
42
Y. Sun and Y. Lu
interval. If rij equals the range of the grid on dimension j , dimension j of the region is not cut. This means that the cluster in the region can be separated from other clusters without using the values of data points on dimension j . For each region that contains a cluster, we can find the dimensions whose range is smaller than the range of the grid. These dimensions provide enough information needed to identify the cluster in the region. They span a subspace in which the cluster lies. Thus we can use top-down method to find the subspaces that contain clusters naturally through the grid partition procedure. In our algorithm, we partition the regions recursively until the partition procedure has been performed n times or none of the regions can be partitioned anymore, where n is an input parameter. Parameter nl represents the noise level in data. Algorithm Subspaces Identification Input: d -dimensional grid GS , parameter n , nl Output: the result region set RR , a set of dimension subset DI s 1 Let CR = {GS } represent the candidate region set. Let iteration _ num = 0 . 2 For each region Rl in CR , do: 2.1 For each dimension j , 1 ≤ j ≤ d , do: 2.1.1 Project the cells in Rl to dimension j . For an interval i on dimension j that satisfies [lij , hij ) ⊆ rlj , its density is:
∑ count
Dij = {
p
| C p ∈ Rl , c pj = [lij , hij )} .
p
2.1.2
Suppose
Di , j
is
a
local
max( D m j , j , D m j +1, j , " Di −1, j ) > N Rl /(nl × k j )
density and
minimum,
if
max( Di +1, j , Di + 2, j ,",
Dm j + k j −1, j ) > N Rl /(nl × k j ) , where N Rl represents the number of data
points in Rl , k j is the number of intervals of Rl on dimension j , interval m j and m j + k j − 1 are the leftmost and rightmost intervals of Rl on dimension j , then insert hij into the candidate cutting point i −1
set CCP . Dij and
∑D
p =m j
2.2
pj
−
m j + k j −1
∑D
pj
are also kept.
p =i +1
If CCP is nonempty, we choose the candidate cutting point in it with the smallest Dij . If several candidate cutting points have the same smallest i −1
density, we choose the one with the smallest value of
∑
p =m j
2.3
D pj −
m j + k j −1
∑D
pj
to
p =i +1
break the tie. If a cutting point hij has been selected from CCP , region Rl is partitioned into two regions by vertically cutting dimension j through the point hij . This partition produces two regions Rl′ and Rl ″ .
A Grid-Based Subspace Clustering Algorithm
43
″ ′ CR = CR + Rl + Rl − Rl . If CCP is empty, move Rl from CR into result
region set RR . 3 iteration _ num = iteration _ num + 1 . If iteration _ num < n and CR is not empty, go to step 2. 4 If CR is not empty, move the regions in CR to RR . 5 For each region Rl in RR , 5.1 For each dimension j , 1 ≤ j ≤ d , if rlj = [l , h) satisfying l > min j , or h < max j , insert j into dimension subset DI l that is corresponding
to Rl . In above algorithm, the dimensions in DI i construct a subspace in which the cluster contained in region Ri lies. One problem should be pointed out that if all data points are projected to a subspace corresponding to DI i , the cluster contained in Rl may not be identifiable. This is because the cutting hyperplanes may cut clusters. For example, the DI corresponding to the rightmost cluster in Fig. 1(b) only contains x axes, but the cluster cannot be identified if all data points are projected onto x axes. So the range of the region should be used together with the dimension subset. In next subsection, we will show how to merge the sub-clusters of a cluster that has been cut by cutting hyperplanes. In our experiments, the default value for parameter nl is 10 . That means that the noise level of the data does not exceed 10% of the data. That is reasonable for most data. If a data set contains more noise than 10% of the data, we can use a larger number for nl . The requirement for the density maxima on both sides of a candidate cutting point tries to ensure there really exist clusters on both sides of the data point. In step 2.2, if several candidate cutting points have the same smallest density, we choose the cutting point that can separate the data points most evenly to partition the region. That is because for a divide-and-conquer process, dividing the problem evenly is the most efficient way. 4.2 The Algorithm GSCDS
We maintain a uniformly partitioned grid data structure incrementally on a data stream to summarize the stream data in memory. When there is a request for clustering, the clustering procedure is performed on the grid to find clusters in the grid. As any other bottom-up grid-based clustering algorithms, the effectiveness of our algorithm is affected by the granularity of cells. In next section, we will discuss how to choose an appropriate grid partition parameter for a given dataset (data stream). Because a data stream is evolving, the clusters existing in history data may be different from the clusters in current data. For an application that only concerns the current data, the history data should be discarded. Thus we multiply the count of each cell in the grid by a fading parameter ε < 1 every time a new data item flows in. And only the cell with count > ξ will be stored, where ξ < 1 is a parameter used to control the number of cells in memory.
44
Y. Sun and Y. Lu
Algorithm GSCDS Input: the data stream DS , parameters k , ε , ξ Output: grid GS with cluster labels 1 Partition each dimension of the data space into k intervals to construct the grid GS . 2 For a new arrived data point X i = ( xi1 , xi 2 ," , xid ) of DS , find the cell C m that satisfies xij ∈ cmj for 1 ≤ j ≤ d , assign count m + 1 to count m .
3 For every cell stored, update its count , count = count × ε . Delete the cells with count < ξ . 4 If there is a request for clustering, go to step 5; else, go to step 2. 5 Run the algorithm for subspaces identification to obtain the region set RR and a set of dimension subset DI s. 6 For each region Rl in RR , 6.1 The dimensions in DI l construct a subspace SS l . Project every cell Ci ∈ Rl into SS l . 6.2 In SS l , label the connected cells as clusters. 6.3 The cells in the original data space get the cluster labels of their corresponding projections in SS l . 6.4 For each Clu i in Rl 6.4.1 For each dimension j in DI l 6.4.1.1 Suppose the range of Rl on dimension j is [l mj , hm′j ) , 1 ≤ m, m′ ≤ k . If l mj > min j , compute CLijm = {C p | C p ∈ Clui , c pj = [l mj , hmj )} ; if hm′j < max j , compute CRijm′ = {C p | C p ∈ Clu i , c pj = [lm′j , hm′j )}
6.4.1.2
If CLijm is not empty, find the span of CLijm on each dimension except j ; If CRijm′ is not empty, find the span of CRijm′ on each di-
mension except j . 7 For each pair of clusters Clu i and Clu l , if there exist nonempty set CLijm and CRljm′ satisfying m = m′ + 1 , and the spans of the sets on each dimension except j overlap, the clusters are merged. 8 Go to step 2.
In subspaces, the data points are much denser and we can identify the connected cells as clusters. Because the regions don’t overlap, a cell in the original data space will get at most one cluster label. As has been pointed out in section 3, the top-down grid partition procedure may cut one cluster into sub-clusters that lie in different regions. To check whether two result clusters belong to the same cluster, we put the clusters in the original data space. If a cluster has been cut by a hyperplane, step 6.4 will find out the cells in the cluster that are adjacent to the hyperplane. If two clusters meet the conditions stated in step 7, they are considered to belong to the same cluster.
A Grid-Based Subspace Clustering Algorithm
45
5 Experiment Results In this section, we empirically evaluate our GSCDS algorithm using two real data sets that have been used in [5] and one synthetic data set. We compare GSCDS with GCHDS to assess the accuracy and efficiency of GSCDS. GCHDS is our earlier algorithm that also utilizes the distribution information of data points on each dimension to select useful dimensions for clustering high-dimensional data streams [7]. All clusters the algorithm obtained lie in the same subspace. For the data set used in [7], GCHDS outperforms HPStream [5] both in effectiveness and efficiency. More details can be found in [7]. In this section, we also analyze the effect of the input parameters on GSCDS. All the experiments were performed on a PC with AMD 2500+ processor and 512 MB memory, which runs Windows XP operating system. We implemented our algorithm in Microsoft Visual C++. We use two real datasets with very different characters to evaluate our algorithm. The first real dataset we use is the KDD-CUP’99 Network Intrusion Detection stream dataset. There are 494020 records in this dataset, and each record has 34 continuous attributes. The second real dataset is the Forest CoverType dataset that is downloaded from the UCI machine learning repository website (http://www.ics.uci.edu/~mlearn). This dataset contains 581012 records. For each record, the 10 quantitative attributes are used to perform the clustering. In these two datasets, each record has a class label. We also generate a 2-dimensional synthetic dataset to illustrate the validity of our method for merging the sub-clusters of a cluster that has been cut by the partition procedure. We use the cluster purity that is defined as the average percentage of the dominant class label in each identified cluster to evaluate the clustering accuracy of GSCDS. Only the data points that arrive within a predefined window of time are used to calculate the cluster purity. The cluster purity has been used by GCHDS and HPStream to assess the clustering accuracy. As in [7], the stream speed is set at 100 points per time unit and one time window contains 1000 points. To avoid decaying the data too quickly, we multiply the count of every cell by ε every time 1000 new data points have arrived. In our experiments, the default value for ε is 0.9 , ξ is 0.8 . For the subspaces identification algorithm, the parameter n is set to 20, and nl is set to 10. The algorithm GCHDS chooses 20 dimensions for the Network Intrusion dataset and 8 dimensions for the Forest CoverType dataset. We compare the cluster purity of our results with GCHDS in Fig. 2 and Fig. 3. Fig. 2 shows the comparison for the Network Intrusion Detection dataset, and Fig. 3 shows the comparison for the Forest CoverType dataset. k is set to 20 for both datasets. We can see that the cluster purity of GSCDS is always higher than that of GCHDS. As illustrated in Fig. 3, GCHDS is unable to process dataset with complex data distribution because the dimension selection method it uses is too simple. Fig. 4 shows the clustering time of GSCDS and GCHDS. In GSCDS, the time needed for updating the grid is the same as that of GCHDS. So only the time needed for performing clustering on the grid is counted. GSCDS is slower than GCHDS because the procedure that it uses to select dimensions is more complicated. HTU
Fig. 4. Stream clustering time (Network Intrusion dataset)
In our algorithm, the parameter n , nl , ε and ξ don't affect the quality of the algorithm much. But the effectiveness of the algorithm will decrease sharply much when the value for k is inappropriate. Fig. 5 shows the cluster purity of GSCDS for the Forest CoverType dataset with different settings of k . It is obvious that the
A Grid-Based Subspace Clustering Algorithm
47
&OXVWHUSXULW\
*6&'6
N
Fig. 5. Cluster purity with different k (Forest CoverType dataset)
Fig. 6. Validation for mergence
accuracy of the algorithm is low when k is too small. From our experiments on several datasets, we find that the accuracy of our algorithm is high when k is no less than ten times of the number of clusters in the dataset. But this is not a general instruction for how to choose the number of intervals on each dimension for a bottomup grid-based clustering algorithm. We generate a simple 2-dimensional dataset to illustrate that the cluster cut by the top-down partition procedure can be identified by our algorithm. Fig. 6 shows the dataset with region borders and cluster labels. The cluster 3 has been cut into two regions, but out algorithm successfully identifies it.
6 Conclusion In this paper, we present a grid-based subspace clustering algorithm, GSCDS, for online clustering of high-dimensional data streams. We combine the top-down gridbased method and bottom-up grid-based method to design an algorithm that can find accurate clusters in different subspaces efficiently. The experiment results show that the clustering quality of the algorithm is better than GCHDS, a similar algorithm that find all clusters in the same subspace.
48
Y. Sun and Y. Lu
References 1. M. R. Henzinger, et al., Computing on data streams. SRC Technical Note 1998-011, Digital systems research center, Palo Alto, California, 1998 2. L. O'Callaghan, et al., Streaming-Data Algorithms for High-Quality Clustering. Proc. of the 18th International Conference on Data Engineering, 2002, 685-694 3. J. Han, M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001 4. C. C. Aggarwal, et al., A Framework for Clustering Evolving Data Streams. Proc. of the 29th VLDB Conference, 2003, 81-92 5. C. C. Aggarwal, et al., A Framework for Projected Clustering of High Dimensional Data Streams. Proc. of the 30th VLDB Conference, 2004, 852-863 6. N. H. Park, W.S. Lee, Statistical Grid-Based Clustering over Data Streams. ACM SIGMOD Record, 2004, 33 (1): 32-37 7. Y. Lu, et al., A Grid-Based Clustering Algorithm for High-Dimensional Data Streams. Proc. of the 1st International Conference on Advanced Data Mining and Applications, 2005, 824-831 8. R. Agrawal, et al., Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. Proc. ACM SIGMOD Int. Conf. On Management of Data (SIGMOD’98), 1998, 94-105 9. S. Goil, et al., MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets. Technical Report, No. CPDC-TR-9906-010, Center for Parallel and Distributed Computing, Department of Electrical & Computer Engineering, Northwestern University, 1999 10. A. Hinneburg, D. A. Keim, Optimal Grid-Clustring: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. Proc. of the 25th VLDB Conference, 1999, 506-517 11. C. Baumgartner, et al., Subspace Selection for Clustering High-Dimensional Data. Proc. 4th IEEE Int. Conf. On Data Mining (ICDM’04), 2004, 11-18 12. G. Sheikholeslami, S. Chatterjee, Aidong Zhang, WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. Proc. of the 24th VLDB Conference, 1998, 428-439 P
P
P
P
P
P
P
P
P
P
P
P
P
P
Mining h-Dimensional Enhanced Semantic Association Rule Based on Immune-Based Gene Expression Programming* Tao Zeng1, Changjie Tang1, Yintian Liu1, Jiangtao Qiu1, Mingfang Zhu1,2 Shucheng Dai1, and Yong Xiang1,3 1.
School of Computer, Sichuan Univ., Chengdu, 610065, China {zengtao, tangchangjie}@cs.scu.edu.cn 2 Dept. of Computer Sci. & Tech., Shaanxi Univ. of Tech., Hanzhong, 723003 China 3 Chengdu Electromechanical college, Chengdu, 610031, China
Abstract. Rule mining is very important for data mining. However, traditional association rule is relatively weak in semantic representation. To address it, the main contributions of this paper included: (1) proposing formal concepts on hDimensional Enhanced Semantic Association Rule (h-DESAR) with selfcontained logic operator; (2) proposing the h-DESAR mining method based on Immune-based Gene Expression Programming (ERIG); (3) presenting some novel key techniques in ERIG. Experimental results showed that ERIG is feasible, effective and stable.
1 Introduction Rule mining is an important task of data mining because it is easy to understand rules better than other data mining model. Association rule (AR) mining has been a hot research theme in data mining due to its broad applications at mining association, correlation, causality, and many other important data mining tasks [1-4].Fruitful research results for AR mining can be found in [1-4]. However, complex data mining application requires refined and rich-semantic knowledge representation. Traditional association rule is relatively weak in semantic representation. Example 1 and 2 show that it is difficult for traditional concepts and methods to describe and discover rich-semantic rule. Example 1. Customers probably purchase “laptop” if age is “30-40”, title is “prof.”, and address is not at “campus”. To describe this fact, we need new association rule in the form of age ("30 − 40" ) ∧ title (" prof ." ) ∧ (¬ address (" campus " )) → purchase("laptop" )
T. Zeng et al. age ("30 − 40" ) ∧ (title (" ass ." ) ∨ address (" campus " )) → purchase("PC " )
(2)
On issues like Example 1 and 2, we can retrieve little related work except [5]. In 2002, Zuo proposed an effective approach based on Gene Expression Programming (GEP) to mine Predicate Association Rule (PAR), named PAGEP [5]. However, PAGEP’s main objective to mine is single-dimensional. And PAGEP can not always success in discovering strong PARs stably. To address it, focusing on multi-dimensional problem, we proposed algorithms based on Immune-based GEP to mine h-Dimensional Enhanced Semantic Association Rule (h-DESAR). These are distinguished from [5] and other related works. The main contributions of this work included that formal concepts and properties of h-DESAR were proposed, and the h-DESAR mining algorithm based on Immunebased GEP (ERIG) was proposed, implemented and tested. Main novel techniques in our ERIG include: − The distinctive structures of immune cell and antibody, which can carry 8 pieces of h-DESARs to decrease computing complexity 8 times; − The Dynamic Self-Tolerance Strategy where self set can change dynamically and both invalid and redundant immune cell can be eliminated. − The heuristic h-DESARs Reduction Criterion (EPC), that is, a strong rule is fine if and only if the contra-positive of it is strong too. The remaining of the paper is organized as follows. Section 2 describes the background of problem and our motivation. Section 3 formally introduces the problem. Section 4 proposes the ERIG algorithm, presents some distinctive methods or strategies and discusses the time complexity. Section 5 gives experimental results. Finally, Section 6 is conclusion and future work.
2 Background and Motivation 2.1 Background Gene Expression Programming (GEP) [5,6,7] is of genetic computing introduced by Candida in 2001 [6]. The philosophical strategy hidden in GEP is to solve complex problem with simple code. GEP is somewhat similar to, but not the same as Genetic
Expression Tree (ET) Gene
aabcbcce
Decoding
Expression
(aa)(bc)
Equivalence
a
a
b
c Fig. 1. Decoding process in GEP
Mining h-Dimensional Enhanced Semantic Association Rule
51
Algorithms (GA) or Genetic Programming (GP). The chromosome of GP is treeformed structure directly, while that of GEP is linear string. So GP’s genetic operations are designed to manipulate the tree forms of chromosomes. However, GEP’s genetic operations are similar to but simpler than those in GA. Compared with its ancestors, GEP innovated in structure and method. It uses a very smart method to decode gene to a formula [5,6,7]. Fig. 1 demonstrates the decoding process in GEP. As an example, if let “a”, “b” and “c” represent atomic predicates “age(x)”, “title(x)” and “address(x)” respectively, then the expression in Fig. 1 can express the logic formula “(age(x) ∨ age(x)) ∧ (tile(x) ∨ ¬address(x))”. Like the example above, this paper will utilize GEP to express and discover the predicate formulas that can be used to construct enhanced semantic meta-rule. Please refer to [5, 6, 7] for the other detailed description on GEP due to the limited space. Artificial Immune System (AIS) [9-12] is a rapidly growing field of information processing based on immune inspired paradigms of nonlinear dynamics. It is expected that AIS, based on immunological principles, be good at modularity, autonomy, redundancy, adaptability, distribution, diversity and so on. As a member of natureinspired computing, AIS imitates biology immune system, aiming not only at a better understanding of the system, but also at solving engineering problems. There are various models or techniques for AIS based on different algorithms or representations. According to [10], the main representations used include binary strings, real-valued vectors, strings from a finite alphabet, java objects and so on. 2.2 Motivation of Proposing Immune-Based GEP GEP is strong in representing and discovering knowledge with simply linear strings. AIS has many advantages in evolution control. It is natural to assume that embedding GEP in AIS will inherit and enhance advantages of AIS and GEP.
3 Formal Statements for h-DESAR This section introduces some notations, presents the formal statement of problems, and discusses their properties. Basic relational algebra notations come from [8]. Let Sm denote a m-dimensional relation instance, Attr(Sm) denote attribute symbol set of Sm, Dom(Ai) denote the domain of attribute Ai, Sˆ m = ( A1 , A2 ,..., Am ) denote relation schema, and t = (V A ,VA ,...,V A ) denote a tuple of Sm, where Ai ⊆ Attr ( S m ) , 1 2 m VAi ∈ Dom( Ai ) for i=1,..,m.
Let H = {y | y is a well-formed predicate formula}, Ary(h) be arity of h, and PreSymbol(h) = {z | z is the symbol of an atomic formula in h, h∈H }. For instance, if q is A(x)∨A(x)∧(B(x)∨¬C(x)), then Ary(q) = 3 and PreSymbol(q) = {A, B, C}. Let W be a set, |W| denote the size of W that is the number of elements in W, and #(Sm) denote record number of Sm.
52
T. Zeng et al.
3.1 Enhanced Semantic Meta-rule The following formal statements on enhanced semantic meta-rule are different from those proposed by Fu in [4]. Definition 1. An enhaced semantic meta-rule ℜ on Sm can be described as a logic formula in the form of P→Q, where − Let X, Y ⊂ Atrr(Sm), X ≠ φ, Y ≠ φ, X ∩ Y = φ, and ΩM = {ψ | ψ is an atomic firstorder predicate whose symbol is in M, and ψ(x) means the value of attribute ψ is x} and F = {¬, ∧, ∨}. − P is a well-formed first-order logic formula composed of the atomic formulas in ΩX and logic operators in F. − Q is a well-formed first-order logic formula composed of the atomic formulas in ΩY and logic operators in F. Additionally, we call P antecedent, Q consequent and {P, Q} foundation set of it. □ T
T
Definition 2. We call an enhanced semantic meta-rule h-dimensional enhanced semantic meta-rule ℜh if and only if − 2 ≤ h ≤ m, and ary(P) + ary(Q) = h. − P and Q have been simplified.1 − The atomic predicates in P and Q can occur only once in P and Q respectively. Additionally, let {ℜh} denote the set of all h-dimensional enhanced semantic metarules on Sm. □ T
Remark 1. The logic operators we used include “AND”, “OR”, “NOT” which is selfcontained. 3.2 h-Dimensional Enhanced Semantic Association Rule Let M be an attribute set, M⊆Attr(Sm) and the sub-tuple GetFTuple (M, t) = M t where Ai ∈ M for i=1,…,|M|. ∏ A1M , A2M ,..., A|M M|
Definition 3. Given a tuple t = (VA ,VA ,...,VA ) ∈ Sm and ℜh, a h-Dimensional 1 2 n Enhanced Semantic Association Rule ℜh,ts can be described a logic formula in the form of Ps→Qs, where − Let P be antecedent and Q be consequent of ℜh. − Ps is the substitution formula of P, in which all variables were replaced by the corresponding value in GetFTuple (P, t) according to the meanings of atomic predicate in P. − Qs is the substitution formula of Q, in which all variables were replaced by the corresponding value in GetFTuple (Q, t) according to the meanings of atomic predicate in Q. Additionally, we call the tuple t feature tuple. Ps and Qs is antecedent and consequent of it respectively. { Ps, Qs } is foundation set of it. □ 1
Here simplified means that expression string is parsed to create an expression tree and both redundant brackets and “¬” are eliminated from expression string. For example, “(((a)∧b))” can be simplified to “a∧b”, “¬¬¬a” to “¬a”, and “¬¬¬¬a” to “a”.
Mining h-Dimensional Enhanced Semantic Association Rule
53
It is obvious that both Example 1 and Example 2 are sound 4-dimensional enhanced semantic association rules. Given two m-dimensional tuples t1=( Vc , Vc ,...,Vc ) and t2=( Vc ' ,Vc ' ,...,Vc ' ), let 1
2
m
1
2
m
t =( f c , f c ,..., f c ) denote match tuple between t1 and t2 where 1
2
fC j =
m
true
if VC = VC '
false
if VC ≠ VC '
j
j
(3)
j
j
j=1,…,m
Definition 4. Let U be one of Ps, Qs and Ps∧Qs of ℜh,ts, and t be feature tuple of ℜh,ts. For ∀t’∈Sm, we say that t’ support U if and only if − Let t be the match tuple between t and t’, and t P =GetFTuple(Attr(PSx), t ). − US is the boolean formula substituted for U, in which all atomic predicates were replaced by the corresponding boolean value in t P according to the mapping relationship between attributes in t P and atomic predicates in U. − Evaluate US and the result is true. − Otherwise, t’ does not support U. □ Let u∈{Ps, Qs, Ps∧Qs}, and σ(u | Sm) denote the number of records that support u. The support degree and confidence degree can be described as follows. − Support degree:
S S m sup(ℜh,ts, Sm)= σ ( P ∧ Q | S ) m # (S )
(4)
− Confidence degree:
S S m conf(ℜh,ts, Sm)= σ ( P ∧ Q | S ) S m σ (P | S )
(5)
Let min_conf, min_sup∈[0, 1]. ℜh,ts is strong if and only if sup(ℜh,ts, Sm)≥ min_sup and conf(ℜh,ts, Sm)≥ min_conf like [1,3]. 3.3 Example Example 3. 1) Let FA = age(x) ∧ (title(x) ∨ ¬address(x)), FA’ = (age(x) ∨age(x)) ∧ (title(x)∨¬address(x)) and FB = purchase(x) where “age, title, address, purchase”∈Attr(Sm). Then both FA→FB and FA’→FB are well-formed enhanced semantic meta-rule, but only FA→FB comply with h-dimensional enhanced semantic meta-rule according to Definition 2 where h = 4. 2) Given a tuple r(“30”, “male”, “campus”, “prof..”, “laptop”)∈Sm and Sˆ m = (age, gender, address, title, purchase), FAS → FBS is a ℜ4,rs where FAS = age(“30”) ∧ (title(“prof.”) ∨ ¬address(“campus”)) and FBS = purchase(“laptop”). 3) Suppose that there is another tuple r’(“30”, “male”, “not in campus”, “ass..”, “laptop”) ∈Sm, then match tuple r between r and r’ is (true, true, false, false, true). Because uSA = true∧(false∨ ¬ false) = true, uSB = true and uSA∧ uSB = true, r’ support FAS, FBS and FAS∧FBS. □ In this paper, we focus on mining h-DESAR, in which the atomic predicates in it occur only once, because it is more extractive and heuristic.
54
T. Zeng et al.
3.4 Some Properties of h-DESAR Lemma 1. If FS is a foundation set, then FS can be used to construct 8 pieces of hDESARs. They can be grouped into 4 pairs. Two h-DESARs in each pair are equivalent in logic each other. Proof. Suppose that there is a foundation set FS = {A, B}, and we can construct the following 8 h-DESARs: 1) A → B, 2) ¬B→¬A, 3) B → A, 4) ¬A→¬B, 5) ¬A→B, 6) ¬B→A, 7) A→¬B, and 8) B→¬A. In them, 1) and 2), 3) and 4), 5) and 6), 7) and 8) are the contra-positive each other respectively. Since the contra-positive is equivalent to the original statement, two statements in pair are equivalent each other. □ Theorem 1. Let FS={A, B} be a foundation set and Sm be a relation instance. If σ(A|Sm), σ(B|Sm), σ(A∧B|Sm) and #(Sm) were given, then all of support degree and confidence degree for 8 pieces of h-DESARs constructed by FS can be evaluated. Proof. Because in system, arbitrary tuple can either support h-DESAR or not, we can compute the following value: 1) σ(¬A|Sm) = #(Sm) - σ(A|Sm), 2) σ(¬B|Sm) = #(Sm) σ(B|Sm), 3) σ(A∧(¬B)|Sm)=σ(A|Sm)-σ(A∧B|Sm), 4) σ(¬A∧B|Sm) = σ(B|Sm) - σ(A∧B|Sm), 5) σ(¬A∧¬B|Sm)=#(Sm)-σ(A|Sm)-σ(B|Sm)+σ(A∧B|Sm). we can use these values to evaluate all support degree and confidence degree for these h-DESARs according to Equation (4) and (5). □ Lemma 2. Given a relation instance Sm and an enhanced semantic meta-rule ℜh, let EARSet be the set of enhanced semantic association rule complied with ℜh on Sm, then |EARSet| ≤ #(Sm). Proof. According to definition 3, let W = PreSymbol(PS) ∪ PreSymbol(QS), a subtuple GetFTuple (W, t) can be corresponding to a h-DESAR. Two cases arise: (a) If each of such sub-tuple in Sm is unique, then |EARSet| = #(Sm). (b) If there exist any □ duplicate sub-tuples in Sm, then |EARSet| < #(Sm). So |EARSet| ≤ #(Sm).
4 The ERIG Algorithm 4.1 Framework We call our algorithm ERIG (the h-DESAR mining based on Immune-based Gene Expression Programming). The AIS in ERIG is somewhat similar to the hybrid of the clonal selection principle [9-10] and the negative selection algorithm [11]. However, different from other models, the representation in our AIS is gene of GEP and mutation operators come from GEP. In addition, many new techniques were proposed in ERIG. The algorithm framework is as follows. Algorithm 1. (ERIG) The h-Dimensional Enhanced Semantic Association Rule mining based on Immune-based Gene Expression Programming. Input: A m-dimensional relation instance Sm, a minimum support, minsup, and a minimum confidence, minconf. Output: The set of strong h-Dimensional Enhanced Semantic Association Rules.
Mining h-Dimensional Enhanced Semantic Association Rule
55
BEGIN
1
Initialize and set control parameters; // The cellnum is the number of cells every generation. A outer loop is a generation. // The hfmt is the high frequent mutation threshold 2 WHILE stop condition is not satisfied BEGIN 3 BCSet := NULL; // BCSet is immune cells set count := 0; 4 WHILE BCSet.size < cellnum AND count < hfmt BEGIN 5 6 BCSet := GenBCSet(cellnum, Sˆ m , F, control parameters); //Call GenBCSet to generate BCSet via GEP BCSet := SelfTolerance(BCSet); // Self tolerance 7 count ++ ; 8 END WHILE 9 m 10 ABSet := MaturateCells(BCSet, S ); // Produce antibody set ABSet 11 Maturateaffinity(ABSet, BCSet, minsup, minconf); // Evaluate and eliminate those cells and antibodies which can not meet requirement. // Add cells in BCSet to elite gene pool for GEP 12 MemorizeCells(BCSet); // Output solution for problem; 13 Output(ABSet); 14 CloneMutation(BCSet); 15 END WHILE END.
□
The code is self-explanatory. But it is impossible to list all detail of ERIG. We will select some distinctive methods or strategies to show as follows. 4.2 Some Key Techniques in ERIG 4.2.1 Structures of Immune Cell and Antibody Immune cell and antibody are very important for AIS. In general, antigen is corresponding to the problem to be solved and antibody to the solution for it. For hDESAR problem, the record in relation instance can be antigen and h-DESAR can be antibody. Through comprehensive analysis on each aspect, we designed our antibody and immune cell (B cell). The formal definition is as follows. Definition 5. An immune cell, BCell, is a 3-tuple (G, E, δ) where − G = (gA, gB), called chromosome, is a 2-tuple, where both gA and gB are genes of GEP. − E = (eA, eB), called dual-expression, is a 2-tuple, which were decoded from genes in G according to GEP. − δ∈{-1, 0, 1, 2} is the state value of BCell, where –1, 0, 1 and 2 indicate cell is □ dead, immature, mature and memorized respectively. Definition 6. An antibody is a 3-tuple, (E, L, V), where − E comes from the immune cell that produces this antibody. − L = (lA, lB) is a 2-tuple, where lA and lB are the substitution formulas for those in E respectively by attribute values of record in relation instance. − The 4-tuple V = (pA, pB, pAB, ptotal) stores information about affinity where pA, pB, pAB and ptotal are the support number of lA, lB and lA ∧ lB and the total number □ of records who were tested respectively.
56
T. Zeng et al.
In Table 1 and Table 2, examples for BCell and Antibody were given respectively. Table 1. An example for BCell No. 1 2 3
Theorem 2. An antibody can represent and evaluate 8 pieces of h-DESARs. Proof. Let Ab denote an antibody and by Lemma 1, use {Ab.L.lA, Ab.L.lB} to construct 8 pieces of h-DESARs. Then, after affinity maturation, there are σ(Ab.L.lA|Sm) = Ab.V.pA, σ(Ab.L.lB|Sm) = Ab.V.pB, σ(Ab.L.lA ∧ Ab.L.lB|Sm) = Ab.V.pAB, and #(Sm) = Ab.V.ptotal. Thus we can evaluate these 8 h-DESARs by Theorem 1. □ It shows our antibody is good at representation and discovery of h-DESARs. 4.2.2 Dynamic Self-tolerance Strategy The part of self-tolerance in ERIG develops from negative select algorithm [11] and looks like that in [12]. However there are many differences among them. Our selftolerance strategy is problem-oriented. Main strategy is as follows. − Treat those immune cells that have been generated or used as self dynamically. − Let Bc be an immune cell, and SS be self-set. For ∀Bc∈SS where Bc.E=(eA,eB ), those cells are self too, if their dual-expression is one of (eB, eA), (¬eA, eB ), (eB, ¬eA), (eA, ¬eB), (¬eB, eA), (¬eA, ¬eB ) and (¬eB, ¬eA). − Inject vaccine if it is needed. And treat those cells with certain pattern as self. The function of our self-tolerance strategy is as follows. − Avoid generating redundant cells that are equivalent to represent h-DESARs. − Avoid generating any fault cells that cannot represent valid h-DESARs. − Be able to inject vaccine too. 4.2.3 Affinity Computing In course of affinity maturation, for each antibody, its affinity information for all records (antigens) will be computed. After affinity maturation, there are σ(Ab.L.lA|Sm) = Ab.V.pA, σ(Ab.L.lB|Sm) = Ab.V.pB, σ(Ab.L.lA ∧ Ab.L.lB|Sm) = Ab.V.pAB, and #(Sm) = Ab.V.ptotal. According to Theorem 1 and Theorem 2, we can scan database once but evaluate 8 times more h-DESARs than antibodies.
Mining h-Dimensional Enhanced Semantic Association Rule
57
Additionally, because the statement and contra-positive is logically equivalent, we proposed a heuristic h-DESARs Reduction Criterion (EPC) to reduce result set, that is, a strong rule is fine if and only if the contra-positive of that is strong. 4.3 Algorithms Analysis In this section, we discuss the time complexity of ERIG. Theorem 3. Let h be a constant and #(Sm) = n, then the time complexity of each generation in ERIG depends on the number of antibodies, and it is lower than O(n2). Proof. Since operation on database is time-consuming, #(Sm) is the variable that has great impact on the time complexity. The number of cells, the high frequent mutation threshold, the size of pool and other control parameter all are limited constants. Hence, the time complexity from row 2 to 9 in ERIG is bounded above. It is O(C1). Similarly, MemorizeCells and CloneMutation are O(C2) and O(C3). Supposing the maximum number of cells of every generation is c, then the time complexity of MaturateCells is lower than O(c*n2). It is because, under worst-case condition, c cells can produce c*n antibodies with scanning database once and each of these c*n antibodies will match with n tuples to compute affinity. Finally, for Output, there are c*n antibodies to process at most. So it is O(c*n). To sum up, the total maximum time complexity is O(C1)+O(C2)+O(C3)+O(c*n2)+O(c*n) ≈ O(n2). □
5 Experiments To verify our algorithm, various cases were designed. The test platform is as follows: CPU: Intel C3 1.0GHz, memory: 320MB, hard disk: 80GB, OS: MS Windows XP Pro. SP2, compiler: JDK1.5.03. The data set we used in our experiments is cmc, with 10 dimensions and 1473 rows. It comes from UCI Machine Learning Repository2. Table 3 gives us symbol definitions for this section. 5.1 Case Test Because there has been little research on h-DESAR, the case 1 was designed to compare ERIG with traditional AR mining. Case 1. Let F = {∧}, minsup=0.5%, minconf=95%, cellnum = 20, and hfmt = 200. We run ERIG and Apriori algorithm to mine traditional multi-dimensional AR on data set “cmc” respectively to verify ERIG. Remark 2. If F = {∧} and the order of predicates not be considered, h-DESAR is equivalent to traditional multi-dimensional AR. In this case, in order to utilize Apriori algorithm to mine multi-dimensional AR, we preprocess cmc in the following way. For each value of attribute in cmc, we add a string of its attribute in front of it to construct a new value, whose type become string, then store it into a new data set cmc’. After preprocessing, in cmc’, original equal values of different dimensions in cmc became unequal. It will eliminate possible value-collision between dimensions when Apriori runs on cmc’. 2
http://www.ics.uci.edu/~mlearn/MLRepository.html
58
T. Zeng et al.
Some details of the result for case 1 were gives in Table 4. It showed the number and content of ARs mined by ERIG on cmc are the same as those by Apriori on cmc’. Table 3. Symbol Definitions for Section 5 Symbol TC TSAR EGN ACGE
ECN AREC CEE AEE
Definition Total number of independent cells Total number of strong h-DESARs The generation number when program ends Percentage of cell generation efficiency where ACGE = TC/(EGN*cellnum) * 100 % Number of cells eliminated by self tolerance Ratio of ECN and TC where AREC = ECN / TC (times) Number of cells generation by elite pool Percentage of usage of elite pool where AEE = CEE / TC * 100%
Table 4. Results for Case 1 minsup=0.5%, minconf=95% No. h EGN TC TSAR ACGE AREC AEE 1 2 3 45 35 75.0% 2289 8.9% 2 3 19 360 744 94.7% 291 30.6% 3 4 75 1470 3418 98.0% 34 25.6% 4 5 197 3780 6324 95.9% 23 21.0% 5 6 336 6510 5650 96.9% 18 13.7% 6 7 391 7560 2470 96.7% 25 5.1% 7 8 295 5715 478 96.9% 30 0.7% 8 9 132 2550 22 96.6% 52 0.0% 9 10 28 511 0 91.3% 221 0.0% Note: Sum of TSAR is 19141. The number and content are the same as those via Apriori on cmc’. Table 5. Results for Case 2 minsup = 5% minconf = 98.5% No. h EPC Convergence Vaccine EGN TC TSAR ACGE AREC AEE 1 3 no yes no 292 5760 316292 98.6% 19.26 50.0% 2 3 yes yes no 297 5760 980 97.0% 24.03 20.9% 3 4 no no no 500 10000 1334128 100.0% 0.00 49.8% 4 4 yes no no 500 10000 6796 100.0% 0.01 30.6% 5 5 yes no no 500 10000 18552 100.0% 0.00 30.6% 6 7 yes no no 500 10000 36984 100.0% 0.00 16.6% 7 2 to 6 yes no yes 500 10000 6431 100.0% 0.09 14.1% 8 7 yes no yes 500 10000 5434 100.0% 0.01 1.7% Notes In No. 7, the dimensions were restricted to 2nd, 3rd, 4th, 6th, 7th and 8th. In No. 8, the dual-expression template is (“#”, “(##) (##)(##)”).
Case 2. Let F = {¬, ∧, ∨}, cellnum = 20, hfmt = 200, and the order of predicates be considered. On data set “cmc”, mine general h-DESARs, restricted dimensional
Mining h-Dimensional Enhanced Semantic Association Rule
59
h-DESARs and the special h-DESARs generated by fixed dual-expression template, which test the function of vaccine respectively. The results for Case 2 are in Table 5. Extensional tests show that 1) our algorithm is stable, 2) the efficiency of EPC is notable by comparison, 3) the capability of generating new immune cells is strong, and 4) the function of vaccine is sound and effective. As an example, a 5-DESAR from results of No.7 in Tab 5 is as follows. D8(1)→D7(3) ∨ ¬ (D3(4) ∧D4(1) ∧D2(4))
Since the 5-DESAR (6) and (7) are equivalent each other, they can be reduce to a 5-DESAR, where Di denote ith dimension.
∧} ¬,∧,∨} 8
9
10
h
Fig. 2. Relationship between TC and h in Case 1 and Case 2
6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1
Scalability Study on cmc incremental baseline
295
590 885 1180 Number of records
1473
Fig. 3. Relationship between average running time per generation and record number of data set in Case 3
5.2 Scalability Study In this section, we study the impact of relation instance scale on the performance of our algorithm. Case 3. Let F = {¬, ∧, ∨}, cellnum = 20, hfmt = 200, h = 4 and do the following. − Take a copy of records from row 1 to 295 in cmc as a new data set cmc1, similarly, records from row 1 to 590 as cmc2, records from row 1 to 885 as cmc3, and records from row 1 to 1180 as cmc4; − Merge 2 copies of cmc1 into a new data set cmc2’, similarly, 3 copies of cmc1 into cmc3’, 4 copies of cmc1 into cmc4’, and 5 copies of cmc1 into cmc5’; − Mine h-DESARs respectively on cmc1, cmc2, cmc3, cmc4, and cmc up to 100 generations several times. − Similarly, do it respectively on cmc1, cmc2’, cmc3’, cmc4’, and cmc5’ as baseline. Fig. 3 described the result. The number of distinct tuples from cmc2’, cmc3’, cmc4’ to cmc5’are the same as those in cmc1, besides different in the number of records, so that the number of antibodies does not change and average running time per generation increases very slowly in baseline. However, when we do step 3) in Case 3, the number of unique tuples from cmc1, cmc2, cmc3, cmc4 to cmc increases gradually
60
T. Zeng et al.
with the rise of rows. Thus for every generation, more antibodies are generated and the running time ascends. But it is not so steep. It testifies to Theorem 3. In Table 6, we bring a comparison between ERIG, PAGEP, and Apriori on available objective to mine. Table 6. Comparison between ERIG, PAGEP, and Apriori
Available objective to mine Traditional association rule Rule with connectives beyond “∧” Rule with constrained pattern Rule with constrained attributes
ERIG 9 9 9 9
PAGEP 9 9 8 8
Apriori 9 8 8 8
6 Conclusion and Future Work We have discussed h-DESAR problem, proposed ERIG algorithm, presented some key techniques in ERIG. Experimental results testified to our expectations and showed that the ERIG is feasible, effective and stable. Our future work includes: study on problem space, improvement of performance, discovery of h-DESAR on data streams, and application of web mining or firewall log mining.
References [1] Agrawal R, Imiclinski T, Swami A. Database mining: A performance perspective [J]. IEEE Trans Knowledge and Data Enginnering, 1993,5: 914-925. [2] Agrawal R, Srikant R. Fast algorithm for mining association rules [A]. Proceeding 1994 International conference Very Large Data Bases (VLDB’94). [3] Jiawei Han, Micheline Kambr. Data Mining-Concepts and Techniques [M]. Beijing: Higher Education Press, 2001 [4] Y Fu and J Han. Meta-rule-guided mining of association rules in relational databases[C]. KDOOD’95, 39-46, Singapore, Dec 1995 [5] Jie Zuo, Changjie Tang, Zhang Tianqing. Mining Predicate Association Rule by Gene Expression Programming[C]. WAIM, 2002 [6] C. Ferreira. Gene Expression Programming: A New Adaptive Algorithm for Solving Problems[J]. Complex Systems, 2001, 13(2): 87~129 [7] Jie Zuo. Research on the Key Techniques of Gene Expression Programming: [Ph. D. dissertation]. Sichuan: Sichuan University, 2004 [8] Silberschatz, Korth. Databse System Concepts, Fourth Edition, McGraw-Hill Computer Science Series, 2001 [9] DE CASTRO L N, VON ZUBEN F J .Artificial Immune Systems: Part I-Basic Theory and Applications[ J].Technical Report, TR- DCA Ol/99, 1999, 12. [10] Dasgupta D., Ji, Z., Gonzalez, F.. Artificial immune system (AIS) research in the last five years [J]. Evolutionary Computation, 2003. CEC '03. [11] S. Forrest, A. S. Perelson. et al. Self-Nonself Discrimination in a Computer. In Proceedings of IEEE Svmposiimi on Research in Secwitv and Privacy, 1994. [12] Tao Li, Xiaojie Liu, and Hongbin Li. A New Model for Dynamic Intrusion Detection [C]. CANS 2005, LNCS 3810, pp. 72–84, 2005.
An Efficient Algorithm for Distributed Incremental Updating of Frequent Item-Sets on Massive Database Jiangtao Qiu, Changjie Tang, Lei Duan, Chuan Li, Shaojie Qiao, Peng Chen, and Qihong Liu School of Computer Science, Sichuan University, Chengdu, China {qiujiangtao, tangchangjie}@cs.scu.edu.cn
Abstract. Incremental updating of frequent item-sets on a database includes three problems. In this paper, these problems are explored when database stores massive data. The main contributions include: (a) introduces the concept of Interesting Support Threshold; (b) proposes Frequent Item-sets Tree (FITr) with compact structure; (c) proposes and implements algorithm FIIU for frequent item-sets incremental updating; (d) in order to further improve performance, proposes the algorithm DFIIU for distributed incremental updating of frequent Item-sets on massive database; (e) gives extensive experiments to show that FIIU and DFIIU algorithms have better performance than traditional algorithm on massive database when the number of items is less.
In order to solve the three problems, our study makes following contributions based on idea of not scanning original database. 1) 2) 3) 4)
Introduces the concept of Interesting Support Threshold Supmin. Proposes a prefix tree with compact structure, called Frequent Item-sets Tree. Proposes an efficient algorithm FIIU (Frequent Item-sets Incremental Updating). In order to further improve performance, the algorithm DFIIU (Distributed Frequent Item-sets Incremental Updating) is proposed for incremental mining frequent Item-sets on massive databases with multiple computers.
The remaining of this paper is organized as follows. Section 2 gives a briefly introduction to related works. Section 3 revisits description of the problem. Section 4 proposes algorithm FIIU. Section 5 proposes algorithm DFIIU. Section 6 gives a thorough performance study in comparison with FP-Growth algorithm. Section 7 summarizes our study.
2 Related Works Some efficient algorithms have been proposed for finding frequent item-sets. Apriori[2] employs iterative approach known as level-wise search, where k-item-sets are used to explore k+1-item-sets. Apriori based DHP[3] uses hash table to improve performance on mining frequent item-sets. Partition[4] employs partition strategy to find frequent item-sets with only twice database scans. Han’s FP-Growth[5] may avoid candidates generation and utilizes compact data structure, called frequent pattern tree, to generate frequent item-sets with divide-and-conquer strategy. To address problem of incremental updating of frequent item-sets, Cheung and Li first propose FUP[4] and FUP2[5]. However, these algorithms only focus on second and third problems of incremental updating of frequent item-sets, and need to scan the entire database several times. In [6], a CAST tree is proposed. CAST may be employed to find frequent item-sets when support threshold changes. FIUA2[7] and IM[8] are FP-tree based algorithm. They incrementally mine frequent item-sets with strategy that use new data to extend old FP-tree. However, the two algorithms do not analysis performance on massive data. Main approaches of mining frequent item-sets on massive data include parallel-based approach and the sampling-based approach. CD[9] is an Apriori-like parallel algorithm. DD[9] and IDD[10] divide candidate set into several sub-sets, and then send each sub-set to one processor. Quick-Mining[11] is a sampling-based algorithm.
3 Problem Descriptions Let Ik be an item, and I={I1,I2,…, Im} be a complete collection of items. A transaction database DB={T1, T2,,…, Tn} is a collection of transactions, where Ti (i=1,…,n) is a transaction which contains items in I. Sup denotes support threshold. A item-set is called frequent item-set if its support count is not less than Sup × |DB| where |DB| is the number of transactions in database DB.
An Efficient Algorithm for Distributed Incremental Updating of Frequent Item-Sets
63
Given a transaction database DB and a threshold Sup, the problem of frequent item-set mining is to mine complete collection of frequent item-sets on DB with support threshold Sup. Let db be the collection of new transactions, and |db| be the numbers of new transactions, the updated database U=DB ∪ db. The problems of incremental updating of frequent item-sets include: (1) Find out frequent item-sets ISsup’ (DB) if support threshold Sup is changed to Sup’, but database is not updated, U=DB. (2) Find out frequent item-sets ISsup(U) if database is updated, U= DB ∪ db, but support threshold is not changed. (3) Find out frequent item-sets ISsup’ (U) when database was updated, U= DB ∪ db, and support threshold is changed to Sup’. The problem of Distributed Incremental Updating of Frequent Item-sets is to find out all frequent item-sets by utilizing distributed system when support threshold is changed and database is updated.
4 Incremental Updating of Frequent Item-Sets In order to not scan original database DB when incrementally mining frequent item-sets, in our study, all item-sets in DB will be stored. However the number of item-sets in DB may be large, especially when encountering long transactions in very large DB. Therefore, firstly, we introduce a concept, called Interesting Support Threshold Supmin in section 4.1. For each item-set A in DB, if A.count is not less than Supmin × |U|, A is a frequent item-set. Furthermore, all item-sets in DB may be divided into two categories: frequent item-sets and non-frequent item-sets. All frequent item-sets are stored in a prefix tree with compact structure, called FITr, which is defined in section 4.2. Then all reduced non-frequent item-sets are stored in two-level hash-indexed linked-lists, called THL, which is defined in section 4.3. Finally, FITr and THL are materialized on disk. For first problem of Incremental Updating of Frequent Item-sets, FITr is firstly read from disk when support threshold is changed to Sup(Sup>Supmin ). Then FITr is traveled to find out each item-set whose support count is not less than Sup × |U|. For the second and the third problems, firstly, all item-sets in db will be found out. Then store them into a data structure, called WITs, which is defined in section 4.3. Finally, utilizes Algorithm FIIU, described in 4.4, to realize incremental updating of frequent item-sets. 4.1 Interesting Support Threshold It is not necessary to find out item-sets under all support thresholds when apply mining association rules to industry. The smaller support threshold is, the larger the number of frequent item-sets is, and then the more difficult generating strong association rules is. Therefore, it can be concluded that generating significant strong association rules is impossible when support threshold is too small. Many criteria [12] are proposed to mine significant association rules. In this study, we defined Interesting Support Threshold according to these criteria. Definition 1(Interesting Support Threshold Supmin). Let DB be database and R be a set of criteria for mining association rules. Supmin is called interesting support threshold
64
J. Qiu et al.
if Supmin is minimal support threshold that satisfied R and may generate significant strong association rules on DB. 4.2 Frequent Item-Sets Tree To design a compact data structure to store frequent item-sets, let’s examine an example firstly. bN H
bN
I
I
I
I
I
cN I
I cN I
Fig. 1. Example of frequent item-set tree
Example 1. Let {I1,I2},{I1,I3,I4},{I2,I3},{I2,I4},{I3} be frequent item-sets. They are stored in frequent item-sets tree (or FITr in short). FITr is illustrated in Fig.1. Node Ij indicates that item Ij is stored in the node. A frequent item-sets tree may be designed as follows based on example 1. Definition 2(Frequent Item-sets Tree FITr). A Frequent Item-sets Tree (or FITr in short) is a tree structure defined below. 1) FITr consists of one root labeled as head. Each node in the FITr consists of four fields: item, count, cNode, bNode, where item registers item the node saved, count registers support count of an item-set, cNode is a pointer which point to first node in child level, bNode is also a pointer which point to brother node. 2) Let node1 and node2 be nodes in FITr. We call that node1 and node2 are in same level if node1.bNode=node2 or node1.bNode.….bNode=node2 . 3) Let node1 be a child node of root, we call that node1 is a node in first level in FITr. Starts off from node1, then goes to one of its children node, and go on by this way if noden may be reached at last, we call that there exists a path between node1 and noden. The set of items of all nodes in the path represents a frequent item-set, and count field of noden represents support count of the frequent item-set. Example 2. In our running example shown in Fig.1, I1 and its brother node I2, I3 are in same level. Path of node I1 to node I4 is I1 → I3 → I4. All nodes in the path represent an item-set A= {I1, I3, I4}. Value of count field of node I4 denotes support count of A. In our study, each item is given a unique integer number id. Property 1. Let id1 and id2 be integer number of node.item and node.bnode.item respectively. Then id1 is less than id2. Property 2. Let id1 and id3 be integer number of node.item and node.cnode.item respectively. Then id1 is less than id3. Property 1 ensures that nodes in same level will be sorted in ascending order by their integer number. Furthermore, cost of inserting operation and searching operation in
An Efficient Algorithm for Distributed Incremental Updating of Frequent Item-Sets
65
FITr will be reduced efficiently. Property 2 ensures that those frequent item-sets, which have same prefix, share same prefix sub-path in FITr. Furthermore, FITr may store frequent item-sets compactly. Algorithm 1 InsertFIT: inserting frequent item-sets to FITr Input: item-set A, FITr Symbols: A.count denotes support count of the item-set; item(i) denotes No.i item in item-set A; t denotes the number of items in item-set A. Method: (1) Sort(A); (2) for(int i=1, node=head; i<=t; i++){ (3) wNode=FindNode (node,item(i)); (4) if (wNode=null) wNode=Insert(item(i)); (5) node=wNode;} (6) node.count=node.count+A.count; For all item-sets in FITr, their items should be ranked in ascending order by id as shown in Property 2. Therefore, in step 1, items of the item-set are ranked before the item-set is inserted into FITr. The function FindNode(node,item(i)) search the node whose item field is same with item(i) in node’s child level. If the node is found, return the node, otherwise return null. If null is returned, in step 4, a new node, whose item field is set to item(1) and count is set to zero, will be generated, then inserted into node’s child level by Property 1. After the last node of item-set being inserted into FITr, in step 6, count of the node will be modify because the count of the last node should be support count of A. According to algorithm 1, support count of item-set A will be updated correctly when insert A to FITr and there exists A in FITr. Theorem 1. (Compactness of FITr) Let I be collection of items, I={I1,I2,…,Im}, ID(x) be unique integer number given ∈ to x(x ∈ I), A1 and A2 be item-set, and Path(A) be path of item-set A in FITr. We call that Path(A1) ⊂ Path(A2) if A1 ⊂ A2 and (∀x)(∀y )( x ∈ A2 ∧ x ∉ A1 ∧ y ∈ A1 ∧ ID( x) > ID( y )) . Proof. We assume that A1 has been inserted into FITr and the path of A1 is from node1 located in first level to node2. Let A2=A1 ∪ {Im} and ID(Im )>ID(x) (x ∈ A1). According to Property 2, the order of all items of A2 being inserted into FITr is items in A1 ∩ A2 at first, then item Im. According to step 3 and step 5 in Algorithm 1, the path of A1 will be found after all items in A1 ∩ A2 are inserted into FITr. A new node node3 (node3.item=Im) will be inserted at child level of node2. Name the path of node3 to node1 as the path of A2, denoted as Path(A2). It can be concluded that Path (A1) ⊂ Path (A2). 4.3 Storing of Non-frequent Item-Sets and Updating of FITr In our study, a two-level hash indexed linked-lists structure, called THL, is adopted to store non-frequent item-sets. Each linked-list is given a unique name, called class-name. Each node in linked-list consists of three fields: count, next, itNum, where
66
J. Qiu et al.
count registers support count of a item-set stored in linked-list, next points to next node, and itNum store the integer value gotten by reducing the item-set. Process of Item-set reduction includes follow steps. (1) Derive each item of an item-set and its integer number id. (2) Let an integer, called num, be zero, then set No. id bit of num to 1. By the reduction, different item-set may be represented by different integers. For example, let {I1, I4, I6} be an item-set, 1, 4, 6 be integer number of I1, I4 and I6 respectively. As a result, itNum=41. By summing up id of all items in item-set, we can get an integer, denoted as a. Then, denote a modulo 10 as key1. Let b be the base e logarithm of a. Denote value of second place after decimal point of b, as key2. Structure of THL is shown in Fig.2. T1 and T2 are class-name of linked-list. First level of two-level hash index uses key1 as hash key; another uses key2 as hash key. When an item-set is given, the class-name of linked-list the item-set belongs to can be gotten by two-level hash index. Second level First level
T1
…
…
…
6
12354
2
…
…
name head T2
… 7 …
…
…
8
488
…
…
Fig. 2. Two-level hash indexed linked-lists
…
T2
T2
…
… …
654
T100
…
Head list
T100 … 248 …
Fig. 3. Data structure of WITs
To insert the item-set to a linked-list, firstly, reduce item-set to an integer. Then generate a node of linked-list. Finally, insert the node to linked-list by itNum ascending order. The count of the node would be added if there exists node that has same itNum with the new node. According to the above data structure, all non-frequent item-sets may be saved to one hundred of linked-lists. Each linked-list will be materialized in disk by its class-name. In our study, a data structure, called WITs, is adopted to store temporary data during incremental updating of frequent item-sets. WITs includes a head list and linked-lists, whose structure is same with linked-lists in THL. The node in head list consists of two fields: name and head, where name registers class-name of a linked-list, head is pointer that points to a linked-list. Fig.3 shows structure of WITs. Before inserting a item-set to WITs, get class-name of the linked-list that the item-set belongs to by using the two-level hash index, then gets the linked-list by searching head list of WITs, and inserts item-set to the linked-list. Algorithm 2 GetFITr: Updating of FITr Input: FITr, WITs, THL, Supmin Output: Incremental updated FITr
An Efficient Algorithm for Distributed Incremental Updating of Frequent Item-Sets
67
Method: (1) While(traveling FITr){ (2) get item-set A from FITr; (3) InsertSI(A, WITs);} (4) FITr= ∅ ; (5) while(there exist unprocessed linked-lists in WITs and THL){ (6) wl=GetWL(WITs); (7) nl=GetNL(THL, wl); (8) nl=Merge(wl, nl, FITr); (9) WriteObject(nl);} (10) return FITr; From step 1 to 3, algorithm GetFITr travels FITr. Then inserts all item-sets stored in FITr to WITs. Function InsertSI(A, WITs) inserts item-set A to a linked-list in WITs. Function GetWL (WITs) gets an unprocessed linked-list in WITs, denoted as wl. Function GetNL (THL, wl) gets a linked-list of THL stored in disk, denoted as nl, which has same class-name with wl. Function Merge(wl, nl, FITr) merges wl to nl. Nodes having same itNum in both nl and wl sum up count of nodes. Then those nodes, whose counts are not less than Supmin, are inserted into a new FITr and deleted from nl. In step 9, stores nl to disk and free nl and wl from main memory. Lemma 1. Assume that a new FITr and a new THL can be derived after running algorithm GetFITr. There exists no same item-set in both the FITr and the THL, and the count of item-set is correct. Proof. i: The two-level hash index ensures that an item-set will be given a unique class-name. Therefore, it is impossible that an item-set may emerge on two linked-lists with different class-name in either THL or WITs. ii: Operation of inserting the item-set to linked-list ensures count of node in linked-list to be added when inserts same item-set to a linked-list. iii: According i and ii, the count of item-set may be added correctly when the item-set emerge in both THL and WITs because all same item-sets will be inserted into same linked-list in merging operation of Algorithm 2. iv: There will not be item-sets emerging in both FITr and THL after running algorithm 2 because an item-set will be inserted into FITr or THL by being or not being frequent. Lemma 1 means that algorithm GetFITr can create a new FITr that is incrementally updated correctly. 4.4 Algorithm of Frequent Item-Set Incremental Updating To find frequent item-sets under new support threshold Sup (Sup>Supmin), first step is to read FITr from disk. Then travels FITr to find item-sets whose support count is not less than Sup × |U|. Algorithm 3 is used to find frequent item-sets when database was updated and support threshold was changed. Algorithm 3 FIIU: Incremental updating of Frequent Item-sets Input: db Output: FITr, THL
68
J. Qiu et al.
Method: (1) WITs=General_FI(db); (2) FITr=ReadFIT(); // FITr is gotten from disk. (3) FITr=GetFITr(FITr, WITs, THL); In first step of algorithm 3, all item-sets of db are found, and then are inserted to WITs. General_FI is a general algorithm that mines frequent item-sets in dataset, such as Apriori[2] and FP-growth[5]. Step 3 uses algorithm 2 to generate the new FITr, which store all frequent item-sets of updated database. Lemma 2. Given Interesting Support threshold Supmin , database DB and new dataset db. A complete collection of frequent item-sets may be derived by algorithm FIIU. Proof. i: Complete collection of item-sets of db can be derived by using general frequent item-sets mining algorithm. ii: Because of THL and FITr storing all item-sets of DB, the new FITr and THL derived after running FIIU will include all frequent item-sets, according to i and Lemma 1.
5 FP-Tree Based Distributed Incremental Updating of Frequent Item-Sets Experiment in section 6 will show that time cost of FIIU increases dramatically when the number of items in dataset increases. As all known, building distributed systems is a good method to improve performance. Because of FP-tree and algorithm 2 being suitable for distributed architecture, we propose an algorithm DFIIU, based on FIIU and FP-tree, to incrementally mine frequent item-sets on multiple computers at the same time. FP-growth algorithm mines frequent item-sets on FP-tree as follows. First, build a head item table. Then, construct the condition FP-tree of each item in head item table and perform mining recursively on such a tree. Motivated by FP-growth, it should be feasible to hand out works of constructing condition FP-tree of each item and mining frequent item-sets on such a tree to different computers in a distributed system. Let NCi be No.i computer in the distributed system. Assume that there are k NC in the distributed system. The distributed incremental mining of frequent item-set proceeds as follows. First, one NC, called server, builds FP-tree of new dataset db and divides items in the head item table of the FP-tree to k parts. Then the server sends the FP-tree and the divide to other NC. NCi build conditional pattern trees of No.i part of items and mining frequent item-sets on such these trees. Item-sets derived from NCi are stored in WITsi . NCi divides all linked-lists in WITs to k parts, WITs(1),…., WITs(k). After merging No.i part of linked-lists in other computers (WITs1 (i), …, WITsk (i)), NCi reads FITr stored on server’s disk, and then searches frequent item-sets belonged to No.i part of linked-lists from FITr and inserts them to WITs(i). NCi reads No.i part of linked-lists in THL, THL(i), which is stored in server’s disk. After running algorithm GetFITr, The FITri may be built. Finally, server merges frequent item tree form FITr1 to FITrk, and a new FITr is derived.
An Efficient Algorithm for Distributed Incremental Updating of Frequent Item-Sets
69
Algorithm DFIIU includes client algorithm DFIIU-C and server algorithm DFIIU-S. Algorithm 4 DFIIT-C Input: fp-tree, item, class Symbol: Let itemi be an array save No.i part of items, class be an array record which part of WITs each linked-list belongs to; there are k computers in distributed system. Output: FITr Method: (1) IS=FP-Growth(itemi, fp-tree); (2) for each Item-set A in IS, InsertSI(A, WITsi); (3) Divide(class, WITsi); // divide WITsi to k parts according class (4) for(int j=0; j
6 Experiment and Performance Analysis 6.1 Test Environment and Test Datasets In this section, we evaluate FIIU and DFIIU in comparison with FP-growth. The three algorithms were implemented in java. Experiments were performed on a Intel C3 1.0G
70
J. Qiu et al.
PC with 512M main memory, running Windows 2000 sever and SQL Server 2000. There are two computers in the distributed system. Let total be the number of transactions, d be the number of new transactions. In the experiment, run time of FIIU and DFIIU were compared with run time of FP-growth when support threshold is Supmin and the number of transactions is total+d. In the experiment, Supmin is 0.1%. We use synthetic dataset in the experiment because real datasets are not enough large for the experiment. In addition, we will illustrate that experimental conclusion has nothing to do with datasets in section 6.2.The synthetic datasets were generated from algorithm 6. Algorithm 6 generating test datasets Input: Num is the number of items in dataset, Total is the number of transactions; Len is length of longest transaction in dataset. Output: dataset D Method: (1)I=CreateItems(Num); //build collection of items (2)For(int i=0;i
An Efficient Algorithm for Distributed Incremental Updating of Frequent Item-Sets
71
DFIIU may further reduces the time cost because it hands out parts of mining work to others computer in distributed system although I/O cost counteract some of effects, which is proved in the test 1, 2, 3. 6.3 Experiment Results In the test, we use N, L, T and d denotes Num, Len, Total and |db| respectively. In test 1, as shown in Fig.4, N=15, L=10, d=5k. We report experimental results on 8 different sizes of dataset. It can be observed that performance of DFIIU and FIIU is better than FP-growth When T>200K and T>350K respectively, and average time of DFIIU is less than FIIU. In test 2, as shown in Fig.5, N=17, L=17, d=10k. Experiments were performed on 5 different sizes of datasets. Performance of DFIIU and FIIU is better than FP-growth When T>700K and T>1200K respectively, In test 3, as shown in Fig.6, N=19, L=19, d=10k, experiments were performed on 6 different size of datasets. Performance of DFIIU and FIIU is better than FP-growth When T>800K and T>1.3M respectively.
5XQ7LPH6
5XQWLPH6
),,8 '),,8
)3*URZWK
),,8
'),,8
)3*URZWK
'DWD6HW.
Fig. 5. N=17 L=17 d=10k
),,8'
)3*URZWK'
),,8'
)3*URZWK'
),,8
'),,8
)3*URZWK
5XQWLPH6
5XQ7LPH6
Fig. 4. N=15 L=10 d=5k
'DWD6HW.
'DWD6HW.
Fig. 6. N=19 L=19 d=10k
6XSSRUW
Fig.7. N=17 L=17
Fig.7 shows experiment result about first problem of incremental updating on two datasets. In first one, denotes as D1, N=17, L=17, T=100K. In D2, N=17, L=17,
72
J. Qiu et al.
T=200K. Time cost of FIIU was compared with FP-Growth on five different support thresholds. We can observe from Fig.7 that time cost of FIIU keep unchanged and that of FP-Growth increase when size of dataset increase. For each dataset, average time cost of FP-Growth is larger than FIIU on five thresholds. FIIU have obvious advantages on solving first problem of incremental updating than FP-Growth.
7 Conclusions In this paper, we have proposed efficient method for three problems of incremental updating of frequent item-sets. Experiments also have proved that FIIU and DFIIU have better performance than FP-Growth (or other size sensitive incremental updating algorithms) when size of datasets is large. Especially, FIIU has a great advantage on first problem of incremental updating. However, we must indicate that finding all item-sets, which FIIU and DFIIU based on, is a NP hard problem. It may be an impossible task to mine all item-sets when the number of items on dataset is great. Therefore, for second and third problem of incremental problems, FIIU and DFIIU can show better performance when the number of items on dataset is less.
References 1. R.Agrawal,T.Imielinski,and A.Swami. Mining assciation rules between sets of items in large database[A]. the ACM SIGMOD. Washington, 1993. 2. R.Agrawal and R.Srikant.Fast algorithms for mining association rules[A].In:Proc. the 20 th International Conference on VLDB Santago,1994.487-499. 3. J.S.Park, et al. An efficient hash-based algorithm for mining association rules[A]. In Proc.1995 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’95), San Jose, CA. 4. A. Savasere, et al. An efficient algorithm for mining association rules in large databases [A]. In Proc. 1995 Int. Conf. Very Large Data Bases (VLDB’95), Zurich, Switzerland. 5. Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation [A]. In Proc. 2000 ACMSIGMOD Int. Conf. Management of Data (SIGMOD’00). 6. W. Cheung and O. R. Zaiane. Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint[A]. In Proc. IDEAS 2003, Hong Kong. 7. Zhu Y.,Sun Z.,Ji X.. Incremental Updating Algorithm Based on Frequent Pattern Tree fof Mining Association Rules[J].Chinease Journal of Computers,2003,26(1):91-96. 8. Xiu-Li Ma,Yun-Hai Tong.Efficient Incremental Maintence of Frequent Patterns with FP-Tree[J].J.computer Science and Technology,2004,19(6),876-884. 9. R.Agrawal, and J.C.Shafer[A]. Parallel mining of association rules.IEEE Transaction on Knowledge and Data Engineering,1996,8(6),962-969. 10. E.H.Han, G.Karypis, and V.Kumar.Scalable parallel data minig for association rules[A].In Proc. ACM SIGMOD International Conference on Management of Data(SIGMOD’97). 11. Zhang Z.G. Study of Data Mining Algorithms on Massive Data[D].Harbin:Harbin Institute of Technology,2003. 12. S. Brin, R. Motwani, and C. Silverstein. Generalizing association rules to correlations[A]. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (ACM SIGMOD '97).
Mining Maximum Frequent Access Patterns in Web Logs Based on Unique Labeled Tree Ling Zhang, Jian-ping Yin, and Yu-bin Zhan School of Computer, National University of Defense Technology, Changsha, Hunan, 410073, China [email protected]
Abstract. Discovering user’s Frequent Access Patterns is one of research hotspots in mining web logs. A novel apriori-based algorithm named s-Tree is proposed for mining maximum Frequent Access Patterns. The main contributions of s-Tree algorithm are the following. Firstly, a unique labeled tree is used to represent user session, which enables us to mine the maximum forward reference transaction and the users’ preferred access path. Secondly, an improved method of calculating support based on impact factor of content pages first, which helps us to discover some more important and interesting patterns than normal methods. Thirdly, two special strategies are adopted to reduce overheads of joining frequent patterns. Finally, experiments show that s-Tree algorithm is scalable, and is more efficient than previous graph-based structure pattern mining algorithms such as AGM and FSG.
account the website topology structure. And multi-window browsers enable users to view many web pages simultaneously, whose visited order is unimportant and uninteresting, and so it is meaningless to sequence these visited pages. Therefore we propose a new representation of user session-unique labeled tree to solve the problems of the above two kinds of representations of user session. Normal algorithms for mining web logs adopt definition of support in definition 1 in Section 3[3,8,9]. This definition considers equally important every visited page. But actually functions and users’ interest of content page, navigation page and search page are different. Therefore some important and interested patterns cannot be discovered. In order to conquer the limitation of this definition, this paper defines the impact factor of content pages on support of pattern and proposes a new method to calculate support of pattern. Based on the unique labeled tree representation of user session, a novel algorithm named s-Tree is proposed to mine the maximum frequent patterns. It adopts a new method of calculating support of pattern. S-Tree algorithm improves two aspects of apriori algorithm: generating candidates by joining frequent patterns and frequent edges and reducing overhead by pre-pruning. Our experiments show that s-Tree algorithm is scalable, and is more efficient than previous graph-based structure pattern mining algorithm such as AGM[6],FSG[7],AGMG, and FSG[11]. The paper is organized as follows: in section 2, we propose the unique labeled tree representation of user sessions; then we analyze characteristics of interesting access patterns, define impact factor of content pages on support of pattern and propose a new method of calculating support of pattern in section 3; in section 4 we introduce sTree algorithm; in section 5 we conduct experiments and analyze the result of experiments; finally, we have a conclusion and point out our future research direction.
2 Representing User Session by Unique Labeled Tree Large amount of web logs data generated daily in Web servers are important data source for us to study the users’ visiting behaviors and interest. The data should be preprocessed because that they cannot be applicable directly to web mining. The procedure of normal preprocess can be divided into four steps [3,12]: data cleaning; user identification; session identification; path completion. After preprocess user sessions may be represented by sets or visited page sequence. However, the above two representations are unable to depict some user’s visiting behaviors. Users log in a website by clicking the hyperlink or typing the URL, then regard the page as a root and view other pages simultaneously by hyperlinks in the current page. Those pages’ visited order is unimportant and uninteresting. Users again can click the hyperlinks in those pages to visit other pages. Thus the visit procedure forms a tree, and it is actually one "width-first" traversal tree. Meanwhile we don’t care only the visited order of pages but also the hyperlinks by which the pages are visited, i.e., the users’ traversal path. So a labeled tree[13] is introduced to represent user session, which can not only depict the users’ visiting behavior but also describe the visit sequence of the pages and the users’ traversal path. However, in a user session users may visit one page several times and so there are a few nodes having the same label because of several times visit to a same page. This will make it hard to
Mining Maximum Frequent Access Patterns in Web Logs
75
distinguish isomorphism of sub-tree. We suppose that only the first visit to that page is normal and meaning. So we only consider the first visit to web pages, and then there are no nodes of a tree having the same label. We call it a unique labeled tree. In a session we neglect the second and deeper visit to the same page and take the successor of them as successor of the first visit instead. Take Fig.1 as an example where page C is visited twice, and H-I is the successor of the second visit. Thus we get Fig.2 by taking H-I as the successor of the first visit of page C. In this section, we use a unique labeled tree to represent user session. In Fig.2, the user session represented by the unique labeled tree is just the same as the one in the Fig. 1.
Fig. 1. The labeled tree representation of user session
Fig. 2. The unique labeled tree representation of user session
Definition 1 (unique labeled tree). Given a tree T , the node set N of the tree, a label set L, T is a unique labeled tree iff there exists a injection f : N → L , that v ∈ N , f ( v ) = l ∈ L . And the f ( v ) is called the node v ’s label. A unique labeled tree of user session is called session tree. The label of nodes represents web pages and edges represent hyperlinks. For example in Fig.1, path A-B-G, A-C-H-I, and A-D are three visited page sequences. On each path a child page node is visited using hyperlink on its parent page node. The visited order of some pages such as B, C, and D are uninterested, and yet the entry hyperlinks linking to them, which form the user’s traversal path, are important. Since the user session is represented by the unique labeled tree, the Frequent Access Patterns can be defined as following. Definition 2(Frequent Access Pattern). Let T = {T1 ,T2 ,",Tm } be database of session trees. S which is unique labeled tree is called a pattern, the support of S is defined by the formula sup(S)=
|{Ti |S ⊂Ti }| . Given the minimum support threshold min-sup, if |T |
sup(S)≥min-sup, then S is a frequent access pattern. S is called a Maximum Frequent Access Pattern, if there is no frequent access pattern S* satisfied that S⊂S*. Given the database of session trees and a user specified minimum support threshold, the problem we want to solve is to find all Frequent Access Patterns. According to Def. 2, the Frequent Access Patterns in s-Tree algorithm is represented by a unique labeled tree. It has two advantages as follows: (1)it is easy to identify a Maximum Forward Referenced transaction, which is simply a path from the root to a leaf node of a session tree; (2)it is easy to discover users’ preferred traversal paths. In every
76
L. Zhang, J.-p. Yin, and Y.-b. Zhan
pattern discovered, every path from the root to a leaf node is users’ preferred traversal path, which cannot be discovered in mining algorithms based on the set representation of user session. And algorithms based on sequence of visited pages can discover the frequent access sequence, but can not discover the traversal path based on the website topological structure.
3 An Improved Method of Calculating Support Based on Impact Factor of Content Pages There are lots of web pages, which contain many different topics information on a website. In order to facilitate users of different goals and interests to browse over the website these web pages are organized hierarchically according to subjects. Based on a hierarchy and functions of them web pages can be broadly divided into four classes: 1. Homepage - a page whose purpose is to be the first page that users visit. 2. Content Page - a page that contains a portion of the information content that the Web site is providing. 3. Navigation Page - a page whose purpose is to provide links to guide users on to content pages. 4. Look-up Page - a page used to provide a definition or acronym expansion. Homepage and navigation pages are visited most frequently , however, some content pages which user are really interested in are in the deep level of the website and not visited so frequently. So we concentrate on user’s visiting content pages and discovering association rules of content pages. These association rules become the basis of how to recommend interesting web pages to users. In order to mine the patterns which contain more content pages we should reduce their minimum support threshold or increase their support weight according to the number of content pages. This paper uses the latter method and qualifies content pages’ impaction on the support of patterns with impaction factor. Definition 3 (impact factor). Given a pattern T, let n be the total number of pages in T, and m be the number of content pages, then content pages’ impact factor δ to the
support of the pattern is 1−e− m n . So in our s-Tree algorithm we consider the content pages’ impact factor, and given a pattern T and the content page’s impact factor δ , then the pattern T’s support is sup (T )= supinit (T )⋅(1+δ ) , in this formula supinit (T ) is calculated as the way defined in definition 1.
4 s-Trees Algorithm s-Tree algorithm based on Apriori employs an iterative approach known as a levelwise search, where frequent k-trees(tree with k edges) and frequent edges are used to explore frequent (k+1)-patterns. There are two input parameters in s-Tree algorithm: the database D of session trees from the web logs and the minimum support threshold min-sup.
Mining Maximum Frequent Access Patterns in Web Logs
77
The pseudo-code for the s-Tree algorithm is as follows: Algorithm 1 s-Tree algorithm (D,min-sup) Input: the database D of session trees; the minimum support threshold min-sup. Output: the set of maximum frequent access patterns S. 1. Scan the database D, find set of the frequent edges Edge. 2. Scan the database D again, and prepruning operation is performed on every session tree in the database D, then get the new database D′ 3. i←1;F1← Edge;S←∅. 4. While (Fi ≠ ∅)
5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
Do{Ci+1= Generation(Fi) // join and generate candidate frequent tree. If Ci+1= ∅, return S. For each candidate ti+1∈Ci+1
{ ti+1.Count←0 {for each T∈D′ if (sub-tree(ti+1,T)) //judge whether the ti+1 is a sub-tree of T. ti+1.Count ++} Fi+1←{ti+1∈ Ci+1|sup(ti+1) ≥ min- sup}} delete the patterns in S that are contained in a pattern that in Fi+1 S←S ∪ Fi+1 i←i+1} Return S
We don’t adopt the strategy of joining two frequent patterns in join step because judging whether two frequent patterns are joinable is a time-consuming operation. We generate candidates by adding a frequent edge to frequent pattern tree instead. At first the algorithm scans the database D and finds all frequent edges. Then it scans database D again, and prepruning is performed on the session trees. For each session tree in D, we remove the non-frequent edge and the child node from the tree if it has nonfrequent edges. And if the children node has children, we will take them as new children of the non-frequent edge’s father node, and no other change need to be done. Let’s take the session tree in the Fig.2 as an example, if edge (C,H) is non-frequent, then the algorithm gets rid of the edge (C,H) and the child node H, and the child of H is taken as the new child of node C, i.e., we adds an edge (C,I) to the tree. After pruning, the session is converted into session tree in Fig.3 and meanwhile the size of database is reduced significantly and this will not change the mining result. Finally, Edge is regarded as the set of frequent 1-patterns, denoted as F1, and it joined with itself is used to generate the candidate set C2, which after the pruning and deleting nonfrequent patterns is the F2, which is joined with Edge to find the C3, which after the pruning and deleting non-frequent patterns is F3 and so on, until no more frequent patterns with k edges can be found. In the join step, we generate candidate set Ci+1 by adding corresponding frequent edges to the frequent patterns in Fi. Generation of candidates is almost all overheads
78
L. Zhang, J.-p. Yin, and Y.-b. Zhan
of the algorithms based on apriori. In s-Tree algorithm, we adopt two strategies to reduce overheads of generation of candidate set. Firstly, the algorithm chooses an edge from Edge adding to the tree only at the nodes at level m-1 or m if the level of the pattern tree is m. For example in the Fig. 3, the algorithm adds an edge only at the nodes B,C, and D with the level number 2 and the nodes G,H with the level number 3. Let tn be a frequent n-tree, its level is m(m≥2) if tn+1 is a frequent n+1-tree generated by adding a frequent edge to it in the level neither m-1 nor m. So the tn+1’s level is also m. According to apriori property, it still be a frequent tree noted as tn* after tn+1 is gotten rid of an edge which links the nodes with the level m and m-1. So the level of tn* is m or m-1, and then the pattern tree tn+1 can be generated by adding the edge A to frequent pattern tn*. Therefore the algorithm doesn’t lose any frequent patterns. Thus the number of candidates and overheads of algorithm are reduced significantly by adding a frequent edge which chooses from the frequent edge set to the frequent patterns. Secondly, children of nodes in session tree are sorted in alphabetical order, so the algorithm adds frequent edges only at the nodes in the same level with and behind the father node of the right-bottom most edge or the nodes at the same level with the child node of that edge. For example in the Fig.3, arc(C,I) is the right-bottom most edge, so algorithm adds edges only at the node B, D, G and I which are labeled by a arrowhead. This doesn’t lose any frequent patterns, but can reduce the number of the candidate trees and the overhead of algorithm significantly.
Fig. 3. Illustrations of the incremental edges position
After generation of candidate set, s-Tree algorithm then deletes candidate patterns from candidate set that can not be frequent. In the prune step, the Apriori property is used as follows. For each candidate tree, the candidate is not frequent if any sub-tree of it is not frequent and so can be removed from the candidate set. Suppose that ci+1 is generated by adding a frequent edge A to the frequent tree tn. The algorithm verifies only the sub-trees which contain the edge A of ci+1 because all sub-tree of tn is frequent. ci+1 will be removed from the candidate if any one of them is not. Each candidate in Ci is frequent only if its support satisfies the user-specified minimum support threshold. In the procedure of calculating support, we first find out all the session trees with more than i edges. All candidates in Ci are not frequent if the number of all the sessions is less than min-sup·|D| and now the algorithm exits. Otherwise, the algorithm should judge a given candidate in Ci whether it is a sub-tree of a given session tree. After calculating the support of each candidate, we can find all
Mining Maximum Frequent Access Patterns in Web Logs
79
the frequent patterns in Fi+1.In algorithm 1 we use Boolean function sub-tree(ti+1,T) to describe this procedure. The pseudo-code for the function sub-tree(t,T) is as follows: Algorithm 2 sub-tree(t,T) function: it will output TRUE if tree t is a sub-tree of T, else it will output FALSE. parameter: the input argument t and T are trees, also they are the root nodes of the tree respectively. The symbol children(T) denote the set of T’s children nodes and the f(v) is the node v ’s label. { if f(t)=f(T) return Yessub-tree(t,T); else return Nosub-tree(t,T).} Nosub-tree(t,T) Yessub-tree(t,T) {if (∃M∈children(T), {if children(t)⊄ children(T) such that f(M)= f(t) ) return FALSE; return Yessub-tree(t,M); else {if children(t)=∅ return True; else if(children(T)= ∅) else for each m∈children(t) return FALSE; {find M∈ children(T) else {for each M∈children(T) such that f(m)=f(M); if (Nosub-tree(t,M)) if(!Yessub-tree(m,M)) return Nosub-tree(t,M); return FALSE} return FALSE; }} return TRUE;} } Distinguishing isomorphism of sub-tree is much time-consuming. But in algorithm 2, the time cost is o( n + N ) in the worst case ( n and N is the number of nodes of tree t and T respectively). Thus using the unique label tree to represent the user session improves significantly the efficiency of s-Tree algorithm and then makes it more efficient than AGM and FSG algorithm. This is the reason why we use a unique labeled tree but not labeled tree or label unordered tree[14] to represent the user session.
5 Experiment Results In this section we show the performance of s-Tree algorithm with experiments. The algorithm is implemented in c++ standard library and conducted using a machine with Intel Celeron 2.4GHz CPU, 256MB RAM, and XP OS. Test data set BMSWebView-1 and BMS-WebView-2 were downloaded from the homepage of KDDCUP 2000 (www.ecn.purdue.edu\KDDCUP). They contain several-months of click stream E-commerce data from several months. User sessions must be represented by unique labeled trees before experiments. Table 1 is a comparison between two test data sets using size, number of transactions, total occurrences of items, number of distinct items, maximal transaction size, and average transaction size. In experiments as follows the run time includes reading, computation and outputting the results to the disk.
80
L. Zhang, J.-p. Yin, and Y.-b. Zhan
5.1 Scalable Experiments We do scalable experiments on several session sets selected from above two datasets (the minimum support threshold is 0.3% on BMS-WebView-1 and 0.33% on BMSWebView-2). Fig.4 shows the relation between Run time and No. of sessions. The experiments show that s-Tree algorithm is scalable and more efficient than previous graph-based structure pattern mining algorithms such as AGM and FSG. Table 1. Characteristics of datasets Dataset
size
No. of Sess
BMS-WebView-1
2.07M
59602
497
267
2.51
BMS-WebView-2
5.01M
77512
3308
161
4.62
30 25 )20 s ( e15 m i t10 n5 u R 0
Total No. of page
Max No. of Sess.
Avg. Sess.size
BMS-WebView-1 BMS-WebView-2
1
1.5 2.5 3.5 4.5 5.5 No of sessions(×10000)
6
7.5
Fig. 4. Run time vs No. of sessions
5.2 Analysis of Run Time We conduct experiments on above two datasets respectively with different thresholds. For BMS-WebView-1, the run time decreases most sharply when the threshold is near to 0.3%, As Fig.5 show the run time is 113.1 seconds when the threshold is 0.24% and decreases sharply to 2.12 seconds when the threshold is 0.35%. The reason is that the number of frequent patterns decreases sharply at this threshold, and for BMSWebView-2, the run time decreases most sharply when the threshold is 0.35% )300 s250 ( e m200 i t150 n u100 R 50 0
BMS-WebView-1 BMS-WebView-2
0.2 0.24 0.29 0.35 0.4 Threshold(%) a.Run time vs Threshold
BMS-WebView-1 BMS-WebView-2
3.5 ) 3 s (2.5 e m 2 i t1.5 n 1 u R0.5 0
0.7 0.9 3 8 10 15 Threshold(%) b.Run time vs Threshold
Fig. 5. The relation between Run time and Threshold
Mining Maximum Frequent Access Patterns in Web Logs
81
5.3 Comparison of Mining Results S-Tree can mine some more frequent patterns ignored by normal algorithms because we increase the weight of important and interesting patterns containing more content pages. Table 2 compares the mining results of s-Tree algorithm with normal algorithms when the threshold is 0.3% on BMS-WebView-1. From the comparison, we conclude that our algorithm can discover more frequent patterns than normal algorithms. Table 2. Comparison of result of s-Tree algorithm and normal algorithm No of pattern
6
7
8
(8,10]
(10,20]
(20, ∞ )
Normal algrithms
8
5
5
4
0
0
s-Tree algorithm
13
7
9
7
2
0
6 Conclusion This paper proposes a unique labeled tree to represent user session. It conquers the limitation of the exiting representations and makes the frequent patterns to contain more useful information such as traversal path. A new improved method of calculating support is proposed in terms of four types of web pages. It can discover some more important and interesting patterns ignored by normal algorithms. S-Tree algorithm generates candidates by joining the frequent edges with the frequent patterns and reduces the overhead. However it is some rough to calculate the support and the interest of the patterns is hard to depict in s-Tree algorithm. In the future we will concentrate on how to quantify the impact of content page on the support of pattern more rationally and effectively.
References 1. Agrawal, Srikant.Fast algorithms for mining association rules in large database[C].VLDB1994,Santiago, Chile,1994.pp.487-499. 2. Pei, J., Han, J., et al.. Mining access patterns efficiently from Web logs[C]. PAKDD’00 ,Kyoto, Japan2000. 3. Srivastava, J., Cooley, R., et al. Web usage mining: Discovery and applications of usage patterns from Web data[J]. SIGKDD Explorations, 2000. Vol. 1. 4. R.Kosala,H.Blockeel. Web mining research: a survey[J]. In ACM SIGKDD Explorations, 2000. 5. J.Han,J.Pei,Y.Yin. Mining frequent patterns without candidate generation[C]. In SIGMOD 2000, 2000. 6. Inokuchi, T. Washio, H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data[C].PKDD2000, Lyon, France, 2000. 7. M. Kuramochi and G. Karypis. Frequent subgraph discovery[C]. ICDM2001,San Jose,USA,2001.
82
L. Zhang, J.-p. Yin, and Y.-b. Zhan
8. Liping Sun ,Xiuzhen Zhang. Efficient Frequent Pattern Mining on Web Logs[C]. APWeb 2004. pp. 533-542. 9. Ezeife,Yi Lu. Mining Web Log Sequential Patterns with Position Coded Pre-Order Linked WAP-Tree[J]. Data Mining and Knowledge Discovery. 2005, 10, 5-38. 10. Han Jiawei,Meng Xiaofeng,Wang Jing, Li Shengen. Research on Web Ming: a Survey[J]. Journal of computer research & development. 2001,38(4):405-413. 11. Wang Wei, Zhou Haofeng,Yuan Qingqing,et al.. Mining Frequent Pattern Based Graph Theory[J]. Journal of computer research & develop-ment. 2005,38(4):230-235. 12. Robert Cooley. Bamshad Mobasher, et al. Data Preparation for Mining World Wide Web Browsing Patterns[J]. Knowledge and Information System.Vol.NO,1.1999 page 5-23. 13. Chen Wang, Mingsheng Hong, Wei Wang, et al.. Chopper: Efficient Algorithm for Tree Mining. Journal of computer science and technology[J]. 2004, Vol 19(4): 309-319. 14. Yun Chi, Yi Xia, Yirong Yang,and Richard R. Muntz. Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees. IEEE Transactions on Knowledge and Data Engineering[J] 2005 27(2):190-202.
Short Documents Clustering in Very Large Text Databases* Yongheng Wang1, Yan Jia2, and Shuqiang Yang3 Computer School, National University of Defense Technology, Changsha, China 1 [email protected] 2 [email protected] 3 [email protected]
Abstract. With the rapid development of the internet and communication technology, huge data is accumulated. Short text such as paper abstract and email is common in such data. It is useful to cluster such short documents to get the data structure or to help build other data mining applications. But almost all the current clustering algorithms become very inefficient or even unusable when handle very large (hundreds of GB) and high-dimensional text data. It is also difficult to get acceptable clustering accuracy since key words appear only few times in short documents. In this paper, we propose a frequent term based parallel clustering algorithm which can be used to cluster short documents in very large text database. A novel semantic classification method is also used to improve the accuracy of clustering. Our experimental study shows that our algorithm is more accurate and efficient than other clustering algorithms when clustering large scale short documents. Furthermore, our algorithm has good scalability and can be used to process even huge data.
1 Introduction With the rapid development of the internet and communication technology, huge data is accumulated in many companies or organizations, most of which is text data. A vast amount of important scientific data and business data appears as documents. Many of such documents are short documents having no more than 50-100 words. For example, abstract of papers, most of the e-mails, conversations in chat room, etc. It is useful to cluster such short documents to get the data structure or to help building other data mining applications. In this paper, we provide the following task: clustering short documents in very large text databases efficiently and accurately. Document clustering has been studied intensively because of its wide applicability in areas such as web mining, search engines, information retrieval, and topological analysis. There are a large number of clustering algorithms that can be used in document clustering. J. Han and M. Kamber [1] categorized the major clustering methods into hierarchical methods, partition methods, density based methods, grid based methods and model based methods. Some clustering algorithms integrate the *
This project is sponsored by national 863 high technology development foundation (No.2004AA112020, No.2003AA115210 and No.2003AA111020).
ideas of several clustering methods. A recent study shows that a variant of k-means called bisecting k-means algorithm outperforms other algorithms in text clustering. When being used to cluster short documents in very large text databases, traditional clustering methods mainly encounter the following new challenges: (1) Text data is usually high dimensional and sparse. (2) Text databases can be very large. Some of the text databases contain more than 100 TB of documents. Most efficient clustering algorithms become inefficient or even unusable when the dataset is very large. (3) Getting understandable description of the clusters is difficult using common clustering methods. (4) It is difficult to get acceptable precision when clustering short documents with common methods based on vector space model since the key words appear several times or just one time in short documents. Frequent term-based text clustering [2,3] can be used to solve the problem 1 and 3. Problem 2 can be solved by using parallel clustering algorithms [4] or disk-based algorithms instead of memory-based algorithms. In order to improve the precision of document clustering, other methods such as semantic information based text clustering [5, 6] are developed. Based on the observation above, in this paper we propose a new algorithm named SDC (Short Documents Clustering) to cluster short documents in very large text databases. We make the following contributions in this paper. z
z
z
We use a novel semantic classification based method to eliminate clusters overlapping which is accurate for short document. The method is also efficient than other common semantic classification method for short documents. We take advantage of the function of text database to improve the performance. Logical data partition and text indexes are used in our algorithm which is efficient to process very large text data set. We develop a parallel clustering algorithm that has good scalability for very large text databases.
The remainder of this paper is organized as follows. In section 2, we give a general review of current related works, including frequent-term based clustering and semantic-based clustering. In section 3, we introduce the SDC algorithm. In section 4, we introduce our experiments on large and very large data sets. Then a discussion and conclusion is presented in section 5.
2 Related Work Frequent term based text clustering is a recently introduced text clustering method using frequent term sets, which are sets of terms co-occurring in more than a threshold percentage of all documents of a database [2] and can be efficiently discovered by algorithms of association rule mining, such as Apriori. The frequent term-based text clustering is based on the following ideas: (1) Frequent terms carry more information about the “cluster” they might belong to; (2) Highly co-related frequent terms probably belong to the same cluster. The FTC algorithm is a frequent term based text clustering algorithm developed by F. Beil [2]. FTC greedily picks up the next frequent term set (representing the next cluster) to minimize the overlapping of the documents that contain both the term set
Short Documents Clustering in Very Large Text Databases
85
and some remaining term sets. The clustering result depends on the order of picking up term sets, which in turn depends on the greedy heuristic used. FIHC is another frequent term-based text clustering approach developed by Benjamin [3]. There are two steps in FIHC. In the first step, for each frequent term set, an initial cluster is constructed to contain all the documents that contain this term set. Initial clusters are overlapped because one document may contain several frequent term sets. In the second step, overlapping between clusters is eliminated. A score function is defined to measure the goodness of a cluster for a document. The quality of the clusters detected by FIHC greatly depends on the score function which is not easy to define for short documents. And the score function is not easy to be calculated efficiently in very large text databases. The work of this paper is based on [3] but we provide parallel algorithm and use a semantic classification method to improve the accuracy. Traditional clustering methods treat document as a bag of words and don’t consider the semantic relationship between the words which reduce the quality of clustering. This is more important for short document since key words appear less times in short documents. Recently, some clustering approaches based on semantic information are developed to improve the clustering quality. Hotho [5] introduced an ontology based text clustering method named COSA. COSA uses a simple, core ontology for restricting the set of relevant document features and for automatically proposing good aggregations. B. Choudhary [6] used a new method for the creation of document vectors in his text clustering method. This approach uses the Universal Networking Language (UNL) representation of a document. The UNL (explained in detail in [Uchida, Zhu and Della 1995]) represents the document in the form of a semantic graph with universal words as nodes and the semantic relation between them as links. These methods get better clustering quality in some specific area but they depend on the domain ontologies.
3 Method Developments and SDC Algorithm In this section, we first introduce the basic ideas of the SDC algorithm and then provide the algorithm description. 3.1 Basic Clustering Method and Cluster Representation SDC is a frequent term based clustering method. We have developed an efficient algorithm for mining frequent term sets in very large text databases [9]. SDC detects clusters without overlapping. The basic clustering method contains two steps. In the first step, for each frequent term set, an initial cluster is constructed to contain all the documents that contain this term set. In the second step, a novel method based on semantic classification is used to eliminate overlapping of clusters. We take advantage of the function of text database to improve the performance in SDC. Most of the commercial database systems provide text index. For example, Oracle database 10g includes a text database named Oracle Text [8]. Oracle Text indexes documents by converting all words into tokens. The general structure of an Oracle Text index is an inverted index where each token contains the list of documents (rows) that contain that token. In Oracle text, it is efficient to get
86
Y. Wang, Y. Jia, and S. Yang
documents that contain specified words in very large database using SQL statement like the following: SELECT document FROM table WHERE CONTAINS (document, ‘Dog and Cat’) > 0; In a very large text database, a cluster may contain millions of documents. It is not feasible to provide all the documents to user as the result of the clustering. In this paper, we use the frequent term set as the label of the cluster. With the help of text index, we can easily get the documents that contain the frequent term set. But different clusters may be overlapped since one document may belong to more than one cluster. In order to represent the resulting clusters, we defined the document table structure like table 1. Table 1. Document table structure
Document ID Content … …
… …
Cluster ID Cluster Number .. …
If the document belongs to one cluster in the initial clusters, the cluster ID is not needed to be set. If the document belongs to more than one cluster in the initial step, the final cluster ID is filled into the cluster ID column of the document table after the overlapping is eliminated. Then for each cluster Ci labeled with frequent term set Fi, we can get all documents that contain Fi with a “SELECT” statement and then remove the documents that the cluster ID is not equal to Ci. 3.2 Parallel Method for SDC SDC is a parallel frequent term based short document clustering algorithm for very large text databases. It has one controlling node and n mining nodes. We can’t use traditional data partitioning method to support our parallel algorithm because one cluster may span many partitions. We use a logical data partitioning method in which there is no data really moved which is shown in Fig.1. We suppose the controlling node and the mining nodes are all in a cluster (cluster of computers) and the database is available for all nodes. Each mining server processes a part of the data but need not move the data to local machine. When eliminating cluster overlapping, documents in each logical partition can be processed parallel by taking advantage of the function of text database.
databases
logical partition
mining
Fig. 1. Logical partition for parallel processing
Short Documents Clustering in Very Large Text Databases
87
Mainly there are three steps in SDC. In the first step, the controlling node partitions the data logically and sends the frequent term sets to all mining nodes. In the second step, each mining node process all overlapping documents and eliminates the overlapping. In last step, the controlling node gathers local result from all mining node and merge them into global result. In step 2, each mining node first gets the documents in its logical partition that contain more than one frequent term set and then using the method described in 3.3 to decide which cluster it should be in. 3.3 Overlapping Elimination The cluster overlapping elimination method in this paper is based on semantic classification. As described in 3.1, after the first step of the clustering method, we get n clusters but they are overlapped. We consider the overlapping elimination problem as a classification problem: to classify the overlapped short documents into the n categories. We provide a novel kNN like classification method based on semantic text features graph. The semantic text feature graph is based on information inference [7,10]. An information inference mechanism has been proposed which automatically computes information flow through a high dimensional conceptual space. Each concept is represented as a vector of other concepts in the conceptual space. The information flow is a reflection of how strongly Y is informationally contained within X, which discovers the implicit associations between concepts.. Definition 1 (Concept representation in HAL). A concept c is a vector: c = <Wcp1,Wcp2,…,Wcpn> where p1,p2,…pn are called dimensions of c, n is the dimensionality of the HAL space, and Wcpi denotes the weight of pi in the vector representation of c. A dimension is termed a property if its weight is greater than zero. A property pi of a concept c is termed a quality property iff wcpi>∂, where ∂ is a nonzero threshold value. Let QP∂ (c) denote the set of quality properties of concept c. QPμ(c) will be used to denote the set of quality properties above mean value, and QP(c) is short for QP0(c). Definition 2 (HAL-based information flow). HAL-based information flow is defined as: i1,i2,…,ik |- j iff degree( ⊕ ci cj ) > λ
(1)
Where ci denotes the conceptual representation of token i, and λ is a threshold value. ⊕ci refers to the combination of the HAL vectors c1,…,cK into a single vector representation representing the combined concept.
Wc p ∑ c c
i l
degree( ci cj ) =
pl∈( QPμ (
i ) ∧ QP (
j ))
∑Wc p
(2)
i k
Pk ∈QPμ ( ci )
The degree of inclusion is computed in terms of the ratio of intersecting quality properties of ci and cj to the number of quality properties in the source ci as shown in equation 2.
88
Y. Wang, Y. Jia, and S. Yang
Based on the definition of information flow, we provide a new structure named semantic text features graph. There has been work where people tried to learn or create kernels (similarity measures) which would exploit semantic similarity between the terms. Attempts to incorporate some notion of term similarity also include the latent semantic indexing [12], semantic networks [13] and probabilistic methods [14]. In [15], a set of methods are presented for organizing a collection of documents into a graph and text feature similarity is defined based on the graph. In this paper, we proposed a semantic text features graph based classification method and used this method to eliminate the clusters overlapping. Definition 3 (Semantic text features graph). Given a set of documents D, semantic text features graph is a weighted graph G = where V is the set of vertexes and E is the set of edges. Each vertex denotes a document in D and also the set of words that the document contains. There is an edge iff v1 extends the vocabulary of v2. Here "extends" means that v1 (document 1) uses more specific terminology (on the same topic) than v2 (document 2). The weight of < v1, v2> is score(v1, v2) where score function is defined in definition 4. A special vertex that contains stop words acts as the root of the graph. For example, suppose we have a vertex with general “computer science” terms, which has among the others also the children “computer architecture” and “machine learning”. Each of the children extends the vocabulary of “computer science” vertex in their own way. Definition 4 (Score function for new vertex). Given a document Di which we are about to insert into the graph, the score function for is:
score(Vi,Vj ) =
| Join(Vi, (∪ p ∈ P ( j )Vp )) | | Union(Vi , (∪ p ∈ P ( j )Vp )) |
(3)
Where ∪ p ∈ P ( j )Vp denotes a set of words from vertexes on a path from Vj to the root. The Join and Union functions are defined as: Join(D1,D2) = {t| t ∈ D1, t ∈ D2} ∪ {t1, t2| t1 ∈ D1, t2 ∈ D2, degree(t1, t2) ≥ }
(4)
Union(D1, D2) = D1 + D2 – Join(D1, D2)
(5)
where is a threshold value. The semantic text features graph depends on the ordering of the documents. In order to create an optimized graph independent with the ordering of the documents, we define the total score function. Definition 5 (Total Score). The total score function for document di is:
totalscore(di ) =
∑ score(V ,V ) i
j
(6)
Vj∈V '
Where V’ means the set of vertex Vj that satisfies score(Vi, Vj) > ( is a threshold value). At each time the best document (the document which maximizes the totalscore given by equation 6) is selected to insert into the semantic text features graph. The basic idea for our classification method is that documents belonging to the same topic (category) are closed in the semantic text features graph. This assumption
Short Documents Clustering in Very Large Text Databases
89
Fig. 2. Semantic text features graph with labeled nodes
has been validated in our experiments. We use a kNN like method to build our short document classifier. Our method needs a set of labeled short documents as training samples. We can select the documents without overlapping in each cluster as training samples. The training samples are inserted into the semantic text features graph first. During the classifying process, each short document dc is inserted into the graph as vertex Vc and we find k labeled document vertexes that have the nearest path to Vc. An example of semantic text features graph with labeled vertexes and new vertexes to be classified is shown in fig. 2. We calculate the weight that document dc belonging to category Ci using the following equation:
weight (Ci ) =
∑ length(V ,V ) j
c
(7)
dj∈Ci , dj∈Dk
Where Dk means the set of k labeled documents and the length function calculates the weighted length between two vertexes in the semantic text features graph. Then document dc belongs to category Ci that maximizes weight(Ci). The key problem in our method is how to find the k nearest labeled neighbors for vertex Vc efficiently. This is a problem of fining k vertexes with shortest paths from a given source s to a set of vertexes in a weighted graph. The traditional Dijkstra 2 algorithm [16] can solve this problem with time O(n ) in a graph with n vertexes and m edges. To get satisfied accuracy, the number of unlabeled samples must be large enough 2 in our method. So it is very inefficient to find the k labeled neighbors with time O(n ) each time a new document is processed. Many other algorithms are developed to improve the performance of the shortest path problem. Since our semantic text features graph is static, we can improve the performance based on the following remark: Remark 1 (Weighted distance between vertexes in semantic text features graph). Let Vs and Vd be the source and destination vertexes in the semantic text features graph and Vs is linked to m vertexes V1,V2,…Vm. Suppose the weighted distance from Vi (i = 1,2,…m ) to Vd is D(Vi, Vd), then the distance from Vs to Vd is the maximum value of (weight(Vs, Vi) + D(Vi, Vd)).
■
The semantic text features graph is static when all labeled vertexes are inserted. When a new document to be classified is processed, we remove it from the graph to make the graph static. So the distance between vertexes is static and we calculate the distance from each vertex to all labeled vertexes once and store the distance value in a distance table. When a new vertex is added into the graph, we can easily get it distance to all labeled vertexes since D(Vi, Vd) is stored in the table.
90
Y. Wang, Y. Jia, and S. Yang
3.4 The SDC Algorithm Now we summarize the entire mining process and present the SDC algorithm. The frequent term set mining process and HAL semantic space building process is not included in this algorithm.
1 2 3
1 2 3
4
5
Algorithm 1. Short documents clustering in a very large text database Algorithm for controlling node: Input: (1) A very large text database D, (2) the frequent term set F. Output: Clusters with label, document count and documents (optional). Method: Partition the text database into n logical partitions named P1, P2, … Pn; Notify each mining node to run with input (Pi , F); Wait for the result of all mining server. When all result is returned, merge the result and return the resulting clusters. Algorithm for mining node Input: (1) A logical data partition Pi, (2) the frequent term set F. Output: The document count and label for all clusters. Method: Create an empty document table as shown in fig.1 and a cluster count table; Create the semantic text features graph; For each frequent term set Fi, get all documents in Pi that contain the terms in Fi and then: a) Use the documents as the initial cluster of Fi. Write the document number into the cluster count table; b) Increase the “cluster number” column of the document table as described in 3.2; Get the documents that are belong to more than one clusters by selecting the records with “cluster number” greater than 1. For each of such documents: a) Use the semantic classifier in 3.3 to classify the document into one of the clusters; b) Assign the document to the cluster by setting the “cluster ID” column in the document table; c) For all clusters that contain the document but the document does not belong to, decrease the cluster count in cluster count table; When the whole data partition is processed, send the cluster count table to the controlling node.
As we can see, SDC is communicating efficient since the mining nodes need not communicate with each other when clustering. With the help of text index, no database scan is needed. The semantic text features graph creating algorithm is not efficient but it runs only once and then the semantic text features graph is stored.
4 Experimental Evaluations A commonly used external measurement, the F-measure [11], is employed in this paper to evaluate the accuracy of the clustering algorithm. In order to evaluate the
Short Documents Clustering in Very Large Text Databases
91
semantic classification algorithm individually, we use interpolated 11-point average precision, initial precision and R-precision [17]. We use two types of data sets: (1) Reuters corpus which is commonly used for text clustering (Only the articles that are uniquely assigned to exactly one topic are used). (2) Short documents in our application. They are text data sets that are collected from public forums in the internet by robots. We use a small set of the short document (but the HAL semantic space is created from a large dataset) for accuracy experiment since it needs to cluster the documents by hand. To compare the performance of these algorithms, we used short documents that are about 400 MB in size. To evaluate the high scalability of SDC, we used short documents with 40 GB in size. Table 2. k=10, m=100,000 n
Average precision 50 0.575 100 0.728 200 0.753 300 0.782
We use 4 HP units connected with fast Ethernet. Each unit has 4 Itanium II 1.3G processors and 48 GB memory. The operating system is Red hat Advanced Linux 3.0. The text data is stored in an oracle 10g database (4 node cluster). The semantic classification method is important for the accuracy of the clustering algorithm. We first evaluate the parameters of the semantic classification method. Let m be the number of unlabeled documents used to create the semantic text features graph and n be the number of labeled documents. The result of our experiments is shown in table 2, 3 and 4. From the result we can find the increase of m, n and k will improve the precision of the classifier but the improvement becomes not obvious when a certain value is reached. The reason is that when the new vertex has enough neighbors to decide its category, the precision can’t be improved further. In the following experiments, we set n = 300, m = 500,000 and k = 10. We compare the accuracy of SDC with FTC and FIHC on the Reuters dataset. The result is shown in table 5. As we can see from the result, the accuracy of SDC is a little better than FTC and is comparable with FIHC. The reason is that SDC is designed for
92
Y. Wang, Y. Jia, and S. Yang
160 140 120 100 80 60
run time (minute)
run time (minute)
short documents and it has no superiority when processing normal long documents. The accuracy experiment result on short documents is shown in table 6. The number of clusters is controlled by setting different support value for frequent term set. SDC outperforms other methods for our real data which are short documents. As we described before, the reason is that key words appears less times in short document which affects the accuracy of FTC and FIHC but SDC uses the semantic information based classification method to improve the accuracy. In the next experiment, we compare the performance of the three algorithms on large data set as shown in fig. 3 (the HAL semantic space building time and semantic text features graph building time is not included). As we can see, the performance of SDC exceeds other algorithms especially when data set becomes even larger. The reason is that SDC is a parallel algorithm and it takes advantage of the function of text database. When data set is not very large, FTC and FIHC have better performance than SDC because they are simpler than SDC. The result of the scalability experiment in very large data set is shown in Fig. 4. We can see SDC has good scalability for very large data set while other algorithms can’t handle such data set.
FT C FIHC SDC
40 20 0 50
100
200
300
400
documents (thousand)
Fig. 3. Performance on large data set
450 400 350 300 250 200 150 100 50 0 0.5
1
2
3
4
documents (million)
Fig. 4. Scalability on very large data set
From these experiments we can conclude that SDC is more efficient and accurate than other algorithms when clustering short documents in very large text databases. Unlike FTC and HIFC whose performance deteriorates seriously as documents number becomes large, SDC's running time is almost linear. In addition, SDC has better accuracy for short document since it uses semantic information to eliminate the cluster overlapping.
5 Discussions and Conclusion In this paper, we have proposed a simple but efficient novel parallel algorithm SDC which is used to cluster short documents in very large text databases. As shown in our experimental results, SDC has better accuracy and good scalability than other algorithms when clustering short documents in very large text databases. SDC is designed for very large text databases. It can be easily scaled-up to even huge databases by applying more mining nodes since it is a parallel and distributed algorithm. Using semantic information is a reasonable way to improve the clustering accuracy for short documents. Parallel and distributed mining is a reasonable way to cluster
Short Documents Clustering in Very Large Text Databases
93
documents in very large databases. More detailed study along this direction is needed, including further improvement of the semantic space model to improve the accuracy of clustering and modifying this algorithm to build hierarchal clusters.
References 1. Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001. 2. Beil F., Ester M., Xu X. Frequent Term-Based Text Clustering, Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining (KDD ‘2002), Edmonton, Alberta, Canada, 2002. 3. Benjamin C. M. Fung, Ke Wang, Martin Ester: Hierarchical Document Clustering using Frequent Itemsets. SDM 2003. 4. E. Jensen, S. Beitzel, A. Pilotto, N. Goharian, O. Frieder “Parallelizing the Buckshot Algorithm for Efficient Document Clustering”, ACM 11th Conference on Information and Knowledge Management (CIKM), November 2002. 5. A. Hotho, A. Mädche, A., S. Staab. Ontology-based Text Clustering. Workshop Text Learning: Beyond Supervision, 2001. 6. Bhoopesh Choudhary and P. Bhattacharyya. Text Clustering Using Semantics. World Wide Web Conference (WWW2002), Hawai, USA, May 2002. 7. Song, D., Bruza P.D. (2001) Discovering Information Flow Using a High Dimensional Conceptual Space. In Proceedings of ACM SIGIR 2001, pp. 327-333. 8. Oracle Text 10g Technical Overview. http://www.oracle.com/technology/products/text/x/ 10g_tech_overview.html. 9. Wang Yongheng, Jia Yan and Yang Shuqiang. Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database. WAIM 2005. 10. Lund, K. and Burgess C. (1996) Producing High-dimensional Semantic Spaces from Lexical Co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203-208. 11. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. KDD Workshop on Text Mining’00, 2000. 12. T. A. Letsche and M.W. Berry. Large-scale information retrieval with latent semantic indexing. Information Sciences 100, 1997. 13. G. Siolas and F. d'Alch e-Buc. Support vector machines based on a semantic kernel for text categorization. In IEEE-IJCNN, 2000. 14. T. Hofmann. Probabilistic latent semantic indexing. In Research and Development in Information Retrieval, 1999. 15. Jure Leskovec, John Shawe-Taylor. Semantic Text Features from Small World Graphs. In Subspace, Latent Structure and Feature Selection techniques: Statistical and Optimization perspectives Workshop, Bohinj, Slovenia, 2005. 16. E. Dijkstra. "Two Problems in Connexion with Graphs," Numerische Mathematik, 1(1959), 269-271. 17. Song, D. and Bruza, P.D., Huang, Z., and Lau. Classifying Document Titles Based on Information Inference. Foundations of Intelligent Systems (Lecture Notes in Artificial Intelligence 2871) -Proceedings of ISMIS'03 conference. pp. 297-306.
A Simple Yet Efficient Approach for Maximal Frequent Subtrees Extraction from a Collection of XML Documents Juryon Paik and Ung Mo Kim Department of Computer Engineering, Sungkyunkwan University, 300 Chunchun-dong, Jangan-gu, Suwon, Gyeonggi-do 440-746, Republic of Korea [email protected], [email protected]
Abstract. Recently, XML is penetrating virtually all areas of computer science and information technology, and is bringing about an unprecedented level of data exchange among heterogeneous data storage systems. With the continuous growth of online information stored, presented and exchanged using XML, the discovery of useful information from a collection of XML documents is currently one of the main research areas occupying the data mining community. The mostly used approach to this task is to extract frequently occurring subtree patterns in trees. However, the number of frequent subtrees usually grows exponentially with the size of trees, and therefore, mining all frequent subtrees becomes infeasible for a large tree size. A more practical and scalable approach is to use maximal frequent subtrees, the number of which is much smaller than that of frequent subtrees. Handling the maximal frequent subtrees is an interesting challenge, and represents the core of this paper. We present a novel, conceptually simple, yet effective approach that discovers maximal frequent subtrees without generation of candidate subtrees from a database of XML trees. The beneficial effect of our approach is that it not only reduces significantly the number of rounds for infrequent tree pruning, but also eliminates totally each round for candidate generation by avoiding time consuming tree join operations or tree enumerations.
1
Introduction
In recent years, XML has become a popular way of storing many datasets because the semi-structured nature of XML allows the modeling of a wide variety of databases as XML documents. With the ever-increasing amount of available XML data, the ability to extract valuable knowledge from them becomes increasingly important and desirable. The data mining community has been challenged
This work was supported in part by the Ubiquitous Autonomic Computing and Network Project, 21st Century Frontier R&D Program and by the university IT Research Center project (ITRC), funded by the Korean Ministry of Information and Communication.
L. Feng et al. (Eds.): WISE 2006 Workshops, LNCS 4256, pp. 94–103, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Simple Yet Efficient Approach for Maximal Frequent Subtrees Extraction
95
to come up with an efficient and scalable method for uncovering useful information from large collections of XML data. The first step toward mining information from XML documents is to find subtrees frequently occurring in a large collection of XML trees. Frequent subtrees in a database of trees provide useful knowledge in many cases. However, the discovery of frequent subtrees appearing in a large-scaled tree-structured dataset is not an easy task to do. Due to combinatorial explosion, the number of frequent subtrees usually grows exponentially with the size of a tree. Therefore mining all frequent subtrees becomes infeasible for a large number of trees. A more practical and scalable approach is to use maximal frequent subtrees. Handling the maximal frequent subtrees is an interesting challenge, and represents the core of this paper. The rest of this paper is organized as follows. We begin by reviewing some related works in Section 2. We continue in Section 3 with a description of some terminologies used throughout the paper, and problem definitions considered in this paper. Then, the details of the proposed algorithm are given in Section 4. Finally, in Section 5 we sum up the main contributions made in this paper and discuss some of our future works.
2
Related Works
The various works for mining frequent subtrees are described in [3,14,15,17]. Wang and Liu [15] considered mining of paths in ordered trees by using Apriori [2] technique. They propose the mining of wider class of substructures which are subtrees called schemas. Asai et al. [3] proposed FREQT for mining labeled ordered trees. FREQT uses rightmost expansion notion to generate candidate trees by attaching new nodes to the rightmost edge of a tree. Zaki [17] proposes two algorithms, TreeMiner and PatternMatcher, for mining embedded subtrees from ordered labeled trees. PatternMatcher is a level-wise algorithm similar to Apriori for mining association rules. TreeMiner performs a depth-first search for frequent subtrees and uses the scope list for fast support counting. Termier et al. [14] developed TreeFinder which uses a combination of relational descriptions for labeled trees and θ-subsumption notion to extract frequent subtrees. Other recent works describe how to mine frequent graph patterns [9,10]. Such graph mining algorithms are likely to be too general for tree mining as pointed out in [17]. Frequent subtrees in a database of trees provide useful knowledge in many cases, such as gaining general information of data sources, mining of association rules, classification as well as clustering, and helping standard database indexing. However, as observed in Chi et al’s papers [7,8], due to combinatorial explosion, the number of frequent subtrees usually grows exponentially with the size (number of nodes) of a tree. This causes severe problems with the completion time of mining algorithm and the huge amount of potentially uninteresting patterns. Therefore mining all frequent subtrees becomes infeasible for a large number of trees. A more practical and scalable approach is to use maximal frequent
96
J. Paik and U.M. Kim
subtrees. Informally, a maximal frequent subtree is a frequent subtree such that none of its proper supertrees is frequently occurred in a database of XML trees, and the number of which is much smaller than that of frequent subtrees. The algorithms presented by Xiao et al. [16] and Chi et al. [6] attempt to alleviate the huge amount of frequent subtrees by finding and presenting to end-users only the maximal frequent subtrees. The PathJoin algorithm proposed by Xiao et al. uses a new compact data structure, FST-Forest, to store compressed trees, representing the trees in the database. The structure itself obtains maximal frequent subtrees by pruning the frequent subtrees which are not maximal. Note that all frequent subtrees are found by joining the maximal frequent paths contained in the FST-Forest. Nevertheless, the algorithm uses post-processing techniques that prune away non-maximal frequent subtrees after discovering all the frequent subtrees. Therefore, the problem of the exponential number of frequent subtrees still remains. The CMTreeMiner presented in [6] instead looks for all frequent subtrees in a database of trees, directly aims at closed and maximal frequent subtrees only. The algorithm uses several pruning and heuristic techniques to reduce the search space that does not correspond to closed and maximal frequent subtrees and to improve the computational efficiency on the generation of closed and maximal frequent subtrees. However, CMTreeMiner algorithm bases on the enumeration trees. In this approach, a candidate subtree is generated by extending its unique parent, which is a frequent subtree with smaller size in the enumeration tree. In this paper, we present a novel approach for efficiently finding maximal frequent subtrees from a set of rooted and ordered trees. The major idea of the proposed technique is to construct special data structures that preserve all necessary information for establishing our goal, and to extract maximal frequent subtrees without either joining two frequent subtrees or extending the unique parent frequent subtree, for candidate subtrees generation. The rest of this paper is organized as follows. In Section 3, we begin by describing some terminologies used throughout the paper, and problem definitions considered in this paper. Then, the details of the proposed approach are given in Section 4. Finally, we conclude in Section 5 with discussion of our future works and goals.
3 3.1
Preliminaries XML Data Model
With no rigid structure, XML documents are often called semi-structured data, traditionally modeled as an Object Exchange Model (OEM) graph [1]. The OEM offers a highly flexible data structure that may be used to capture most kinds of data, and provides a substrate in which almost any other data structure may be represented [4]. However, graph structures in general have undesirable properties with regard to algorithmic complexity; no efficient algorithm is known to perform systematic enumeration of the subgraphs of a given graph, which is a common facet of a semi-structured data mining algorithm. But fortunately, it suffices to
A Simple Yet Efficient Approach for Maximal Frequent Subtrees Extraction
97
use trees, a special class of graphs, to represent XML data; there exists no cycles in tree structures [11,5,17]. XML represents data as trees, and makes no requirement that the trees be balanced. Indeed, XML is remarkably free-form, with the only requirements being that (i) the root is the unique node denoting a whole document, (ii) the other internal nodes are labeled by tags, and (iii) the leaves are labeled by contents or attributes of tags. A rooted tree is a directed acyclic graph satisfying there is a special node called the root that has no entering edges, every other node has exactly one entering edge, and there is a unique path from the root to each node. Thus, any XML tree is a rooted tree. Another characteristics of an XML tree is the labels. The labels in a tree could be unique, or duplicate labels are allowed for different nodes. In addition, because edge labels can be subsumed without loss of generality by the labels of corresponding nodes, we ignore all edge labels in this paper. Let T = (N, E) be a tree, where N is a set of nodes and E is a set of edges. We say that the tree T is a labeled tree if there exists a labeling function that assigns a label to each node in N . Therefore, the tree of XML document is often called labeled rooted tree. A path in a tree is a sequence of edges, i.e, p = (v1 , v2 ), (v2 , v3 ), . . . , (vn−1 , vn ), where vi ∈ N (1 ≤ i ≤ n), and n is the number of nodes on the path; p is called a path from v1 to vn and its length is n − 1. For short, the path is usually represented just by the nodes on the path, i.e, p = v1 , v2 , v3 , . . . , vn−1 , vn . In a tree there is a unique path from the root to each node. The length of the path from the root r to a node v is the depth of v in T . Also, a height of a tree T , denoted by height(T), is the largest depth any node in the T . The XML model we are describing here is based on ordered collections. An ordered tree is one in which the children of each node are ordered. That is, if a node has k children, then there is a first child, a second child,. . . , and a kth child. An unordered tree is one in which parent-child relationships are important, but there is no sibling order. The emphasis on order in XML stems from its origin as a document markup language. In documents, order is crucial; while in data applications, unordered data is widespread. For brevity, in the remaining of this paper, unless otherwise specified, we call a labeled rooted ordered tree as simply a tree. 3.2
Tree Mining Problem
The primary goal of mining some set of data is to provide the information often occurred in the dataset. Roughly speaking, often occurred information means some data patterns frequently used by various users or applications. However, it is not straightforward in the case for trees unlike the case for traditional recordoriented data. In this subsection we briefly describe the basics of the occurrence of a pattern in a collection of trees before addressing the issue of the tree mining algorithm. Definition 1 (Subtree). Let T = (N, E) be a labeled tree where N is a set of labeled nodes and E is a set of edges. We say that a tree S = (NS , ES ) is a subtree of T , denoted as S T , iff NS ⊆ N and for all edges (u, v) ∈ ES , u is an ancestor of v in T .
98
J. Paik and U.M. Kim
Intuitively, as a subtree defined in this paper, a subtree S must not break the ancestor-descendant relationship among the nodes in a tree T . Let D = {T1 , T2 , . . . , Ti } be a set of trees and |D| be the number of trees in D, where 0 < i ≤ |D|. Definition 2 (Support). Given a set of trees D and a tree S, the frequency of S with respect to D, f reqD (S), is defined as ΣTi ∈D f reqTi (S), where f reqTi (S) is 1 if S is a subtree of Ti and 0 otherwise. The support of S w.r.t D, supD (S), is the D (S) fraction of the trees in D that have S as a subtree. That is, supD (S) = f req|D| . A subtree is called frequent if its support is greater than or equal to a minimum value of support specified by a user. This user specified minimum value of support is usually called the minimum support (minsup), denoted σ. A subtree S is σfrequent if its support is greater than or equal to σ. As mentioned in earlier sections, obtaining frequent subtrees causes severe problems with the completion time of mining algorithm and the huge amount of potentially uninteresting patterns. Therefore mining all frequent subtrees becomes infeasible for a large number of trees. In addition, even though frequent subtrees are mined through an algorithm, users have to put a lot of effort into finding interesting results from huge number of potentially useless results. Definition 3 (Maximal Frequent Subtree). Given some minimum support σ, a subtree S is called maximal frequent w.r.t D iff: i) the support of S is not less than σ, i.e., supD (S) ≥ σ. ii) there exists no any other σ-frequent subtree S w.r.t. D such that S is a subtree of S . Usually, the number of maximal frequent subtrees is much smaller than the number of frequent subtrees, and we can obtain all frequent subtrees from the set of maximal frequent subtrees. Example 1. An example of a set of XML trees D with various tags is shown in Fig. 1(a). At a glance, contents of three documents are different from each other and it seems that there is no similarity among them. However, when a minimum H
A C
B
D …
a
…
…
E
F
G
b
d
e
N
K
J L
b
…
M
… … … D
D
…
I
…
V = 2/3,
O
G
b
e
D b
D G
G
D
e
G
D b
G
…
e
G
T2
(a) Input: A set of XML trees D
T3
G e
Maximal Frequent Subtree
Frequent Subtrees
(b) Output: Maximal
b
e
e
T1
D
2 -frequent 3
Fig. 1. Maximal frequent subtrees of XML dataset
subtree
A Simple Yet Efficient Approach for Maximal Frequent Subtrees Extraction
99
support value is given as 23 , the interesting hidden information is discovered, as illustrated in Fig. 1(b). With a sufficient reliability more than 60%, we can get to know the commonly-occurring information. Also with the same reliability, we find the implicit relations between tags; tag D is obtained always together with tag G.
4
Overview of the Proposed Approach
In this section, we describe key features of our method to extract maximal frequent subtrees without any candidate subtree generation. The proposed method is inspired by the mining algorithm EXiT-B presented in the recent works of Paik et al. [12,13]. We modify their data structure PairSets because it can fail to store all frequent nodes. A large part of this problem is caused by the structure of PairSets. It keeps only current node(actually an n-bit binary code)’s information and a list of tree indexes. It is possible to establish a relationship with a wrong parent or ancestor since the current node has no information for either of them. In the PairSets, the relationship is only decided by the minimum support. Thus, we devise a new compact structure, LL-Tree (Linked List-Tree) to make up for the weak point of the PairSets. 4.1
Outline
Fig. 2 shows a brief outline of the proposed approach. As stated in the figure, the algorithm has three main functions: construcutLLL, buildLLTree, and findMaximals. The constructLLL function constructs a set of linked lists of tree indexes for node labels, named LLL (Linked List for Label). Each element, except Head, consisting of a single LLL holds the following information; locations of parent nodes, tree index placing a current node within itself, and a pointer for a next element. The function takes as inputs a database of trees, and returns a collection of LLLs, denoted as LLLD , for every unique node label in a database D. Afterwards, the set LLLD is divided into two groups F and N F , which are frequent LLLs and non-frequent LLLs in LLLD , respectively, by using a threshold value. Usually, this threshold value is minimum support σ × |D|. The frequent group F contains all LLLs in which the number of element of a single LLL is greater or equal to the threshold. Otherwise, into the non-frequent group N F . The second function buildLLTrees configures relations of every LLL in the set F . To this end, it builds a compressed tree, which is named LL-Tree. The LLLs in N F are referred because of ancestor-descendant relationships between nodes. We look a little more deeply the compressed tree in the later subsection. The last function findMaximal extracts maximal frequent subtrees from the LL-Tree produced by the function buildLLTree. Note that we do not perform any candidate subtree generation, which has been usually required to derive (maximal) frequent subtrees in previous researches. The LLLs and a LL-Tree make us avoid that complicate, time-consuming candidate generation phase.
100
J. Paik and U.M. Kim Algorithm Input: D: database of trees σ: minimum support Output: MFT: all maximal frequent subtrees Method: // scan the database to construct LLLD (1) LLLD = constructLLL(D) // scan LLLD to divide into two groups (2) threshold = |D| × σ (3) F = ∅, N F = ∅ (4) for each Linked List for Label LLL ∈ LLLD do begin (5) LLL.size≥ threshold then (6) F = F LLL (7) else (8) N F = N F LLL // build a LL-Tree (9) LLT = buildLLTree(F, N F, threshold) // extract maximal frequent subtrees (10) MFT = findMaximals(LLT ) (11) return MFT
Fig. 2. Outline of the proposed algorithm
4.2
LL-Tree Construction
The compressed structure LL-Tree is initially built with the LLLs in F . The work for constructing LL-Tree is actually two step processes; first step is to identify each parent node of elements of every LLL in F , and second step is to replace the parent node in N F with an ancestor node in F or null(/). Prior to explaining the LL-Tree building process, first we define the data structures. Definition 4 (Head). Given a database D, a Head is defined as a unique node label which must have a single Linked List for Label for itself. Definition 5 (LLL). A single LLL (Linked List for Label) is a linked list structure for a head. It usually has several elements. Each element holds the (p, tid , ne ) information, where p indicates locations of parent nodes of a current head, tid is a tree index placing a current head within itself, and Ne is a pointer for a next element. A set of all LLLs for a database D is denoted by LLLD . According to a specified threshold calculated with the minimum support and the size of D, the LLLD is divided into two groups. Definition 6 (F and N F ). Given the threshold, a LLL ∈ LLLD is called frequent and classified as a member of a group F if LLL.size 1 ≥ threshold value. Otherwise, called non-frequent and as a member of a group N F . 1
Here LLL.size means a number of elements consisting of the single LLL.
A Simple Yet Efficient Approach for Maximal Frequent Subtrees Extraction
101
Price Compare Computer Shop
Desktop Desktop
Laptop
Desktop
Information
Company Brand
Model
Price
Model
OS
Brand Name Model
Price Brand
T1
Price
Model
T2
T3
(a) Original database D /LQNHG/LVWRI/DEHO/// ORFDWLRQ
'HVNWRS
%UDQG
0RGHO
3ULFH
Head
DQHOHPHQW
&RPSDQ\
&RPSXWHU6KRS
1DPH
,QIRUPDWLRQ
/DSWRS
26
3ULFHFRPSDUH
)
(b) F and N F derived from LLLD by σ =
1) 2 3
Fig. 3. Database of trees and classified LLLD
Fig. 3 shows a database of three XML trees and its LLLD already classified into two groups, F and N F , by minimum support σ = 23 . We build a LL-Tree over the data depicted in the Fig. 3(b). Definition 7 (LL-Tree). Given the F and N F , a compressed data structure, LL-Tree (Linked List Tree), is constructed to reflect the original database compactly. To build an LL-Tree, indicators for parent nodes in each element and threshold are required. The LL-Tree is built by applying the following rules, illustrated in Fig. 3(b) and Fig. 4; first, identify whether each parent node (this is also Head of other LLL) in elements is included in F or N F by tracing the location indicators ps of a current Head. If the parent is not in F (see the second element of Brand Head, the 9 indicates the LLL of Information Head ), 1) trace every location indicator (actually the current Head’s grandparent) of elements for the LLL of parent node (in the figure, the Information Head has only one element. Therefore, trace the indicator 2), 2) check if the node indicated by the location indicator is a member of F or null (the LLL of Desktop Head is a member of F ), 3) if not, continue the step 1) and 2), 4) if so, the location indicator of the original current is replaced by the indicator for the frequent ancestor or null (the indicator 9 is replaced by the indicator 2). Second, consider the multi-location indicators in an element. To resolve this problem, the threshold value is applied. The indicator for a parent which is not a member of F is removed.
102
J. Paik and U.M. Kim
#2
Desktop
/
1
/
2
/
3
/
#4
Brand
#2
1
#2
2
#2
3
/
#5
Model
#2
1
#2
2
#4
3
/
#6
Price
#2
1
#2
2
Desktop
Brand
Model
Price
/
LL-Tree
Maximal frequent subtree
Fig. 4. LL-Tree and its maximal frequent subtree
After completion of a LL-Tree, maximal frequent subtrees are derived topdown construction of each LLL in the LL-Tree.
5
Conclusion and Future Work
We presented a new, simple yet effective approach for extracting maximal frequent subtrees from a database of trees. Unlike the traditional approaches, the proposed method does not perform any candidate subtree generation. To this end, we modified the recently suggested algorithm and its data structure by Paik et al. because there is a weak point in their proposal such that it can fail to store all frequent nodes. Thus, we complemented the problem by devising LLL structure and LL-Tree. We scan the original database only once to build LLLs for each unique node label. Afterwards, just LLLs are used to construct a LL-Tree which has all information to extract maximal frequent subtrees. The beneficial effect of our methodology is that it not only reduces significantly the number of rounds for non-frequent tree pruning, but also eliminates totally each round for candidate subtrees generation by avoiding time consuming tree join operations or tree enumerations. Currently, we are working on the analysis of the proposed algorithm as well as an evaluation of its performance compared to EXiT-B, especially focusing on how much the accuracy of maximal frequent subtree is increased.
References 1. Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML, 1st edition, Morgan Kaufmann (1999) 2. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. Proceedings of the 12th International Conference on Very Large Databases (1994) 487–499 3. Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. Proceedings of the 2nd SIAM International Conference on Data Mining (2002) 158–174 4. Buneman, P.: Semistructured data. Proceedings of the 16th ACM SIGACTSIGMOD-SIGART symposium on Principles of databases systems (1997) 117–121 5. Chi, Y., Nijssen, S., Muntz, R. R., Kok, J. N.: Frequent subtree mining — an overview. Fundamenta Informaticae 66(1–2) (2005) 161–198
A Simple Yet Efficient Approach for Maximal Frequent Subtrees Extraction
103
6. Chi, Y., Xia, Y., Yang, Y., Muntz, R.R.: Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Trans. Knowledge and Data Engineering 17(3) (2005) 190–202 7. Chi, Y., Yang, Y., Muntz, R. R.: HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. The 16th International Conference on Scientific and Statistical Database Management (2004) 11–20 8. Chi, Y., Yang, Y., Muntz, R. R.: Canonical forms for labelled trees and their applications in frequent subtree mining. Knowledge and Information Systems 8(2) (2005) 203–234 9. Inokuchi, A., Washio, T., Motoda, H.: An Apriori-based algorithm for mining frequent substructures from graph data. Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (2000) 13–23 10. Kuramochi, M., Karypis, G.: Frequent subgraph discovery. Proceedings of IEEE International Conference on Data Mining (2001) 313–320 11. Kilpe¨ ainen, P.: Tree matching problems with applications to structured text databases. PhD thesis in University of Helsinki (1992) 12. Paik, J., Shin, D. R., Kim, U. M.: EFoX: a Scalable Method for Extracting Frequent Subtrees. Proceedings of the 5th International Conference on Computational Science. Lecture Notes in Computer Science, Vol. 3516. Springer-Verlag, Berlin Heidelberg New York (2005) 813–817 13. Paik, J., Won, D., Fotouhi, F., Kim, U. M.: EXiT-B: A New Approch for Extracting Maximal Frequent Subtrees from XML Data. Proceedings of the 6th International Conference on Intelligent Data Engineering and Automated Learning. Lecture Notes in Computer Science, Vol. 3578. Springer-Verlag, Berlin Heidelberg New York (2005) 1–8 14. Termier, A., Rousset, M-C., Sebag, M.: TreeFinder: a First step towards XML data mining. Proceedings of IEEE International Conference on Data Mining (2002) 450– 457 15. Wang, K., Liu, H.: Schema discovery for semistructured data. Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (1997) 271–274 16. Xiao, Y., Yao, J-F., Li, Z., Dunham, M. H.: Efficient data mining for maximal frequent subtrees. Proceedings of IEEE Internation Conference on Data Mining (2003) 379–386 17. Zaki, M. J.: Efficiently mining frequent trees in a forest. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (2002) 71–80
A Web-Based Transformation System for Massive Scientific Data∗ Shi Feng, Jie Song, Xuhui Bai, Daling Wang, and Ge Yu College of Information Science and Engineering, Northeastern University, Shenyang 110004, P.R. China [email protected], [email protected], [email protected], {dlwang, yuge}@mail.neu.edu.cn
Abstract. In the domain of science research, a mass of data obtained and generated by instruments are in the form of text. How to make the best use of these data has become one of the issues for both nature science researchers and computer professions. Many of these data contain their logic structure inside, but they are different from the self-describing semi-structured data, for these data are separate from the schema. Because of the great increase of the data amount, the traditional way of studying on these data can not meet the needs of high performance and flexible access. Relational DBMS is a good technique for organizing and managing data. In this paper, a mapping model—STRIPE— between scientific text and relational database is proposed. Using STRIPE, we design and implement a Web-based massive scientific data transformation system, which gives a good solution to the problem of the massive scientific data management, query and exchange. The evaluation to the system shows that it can greatly improve the efficiency of scientific data transformation, and offer scientists a novel platform for studying the data.
1 Introduction Scientific instruments and computer simulations are obtaining and generating vast data stores in the domains of astronomy, oceanography, geognosy, meteorology and so on. Furthermore, the data volumes are approximately doubling each year [7]. These data require new scientific methods to analyze and organize. As the foundation of scientific research, many of these data are in form of ASCII text. However, there are some drawbacks in the text data as follows. • Low readability: without special explanation files, the text data is ambiguous or meaningless. • Hard to integrate: many data are stored in different locations or even in different countries. It’s difficult to integrate all the data into one place. • High query cost: it’s difficult to find certain contents in massive datasets, for the contents may distribute in thousands of files. With the development of information technology, relational DBMS provides scientists an efficient way to manage scientific data. Using DBMS, scientists can get ∗
This work is supported by National Natural Science Foundation of China (No. 60573090).
A Web-Based Transformation System for Massive Scientific Data
105
benefit from the non-procedural query analysis, automatic parallelism, and sophisticated tools for associative, temporal, and spatial search. However, it is still necessary to store scientific data in text form for the reasons as follows. Firstly, ASCII text is the only form supported by the scientific instruments and devices. Secondly, text data are platform independent, so they can be analyzed in different operation systems. Thirdly, ASCII text is the international standard for data exchange. Thus, in the area of scientific research, text data and databases will coexist for quite a long time. Recently, how to provide an efficient way to implement massive data transformation between texts and databases has become a common concern problem for both computer science researchers and natural science researchers. In this paper, we analyze the scientific text data structure, and model the data using a mathematics method. The main contribution of our research is that we present a mapping model for scientific data transformation (STRIPE-Scientific Text RelatIonal maPping modEl). Using this model, we design and implement a system based on Web for transformation between massive scientific text data and relational databases and also provide a platform for advanced applications. The evaluation to the system shows that our system provides a good solution to the problem of massive scientific data management in a distributed environment. The rest of this paper is organized as follows. Section 2 gives some basic concepts about scientific data, and then introduces the related work regarding the scientific data management and several data mapping models proposed. Section 3 analyzes the nonself-describing scientific text data structure, and presents a relational mapping model for the data, then describes an implementation for this model under the oceanography background. Section 4 proposes the Web-based framework for scientific data transformation and its applications. Section 5 gives the performance and evaluation of the system. Section 6 finally gives the conclusion and the future work of this research.
2 Theoretical Backgrounds 2.1 Non-Self-Describing Semi-structured Scientific Text Data Most scientific text data do not have the property of self-describing, for the describing tags waste plenty of space and may make the data file too big to be analyzed. In this section, we will describe the non-self-describing semi-structured scientific text data and how it is used in science research. 2.1.1 Semi-structured Data In semi-structured data, the information that is normally associated with a schema is contained within the data, which is called “self-describing” [3]. The meaning and logic structure of semi-structured data can be expressed and identified by semantic tags. For example, XML is a standardized extended markup semi-structured data. 2.1.2 Non-Self-Describing Semi-structured Scientific Text Data Non-self-describing semi-structured scientific text data (scientific text for short below) is a kind of semi-structured data produced by scientific devices or recorded by researchers. Every year, huge amount of Scientific Text are collected, exchanged and
106
S. Feng et al.
analyzed. Different from normal self-describing semi-structured data such as XML [6] and BibTex [15], scientific texts have not semantic tags to constrain inner structure and express meaning. The meaning of scientific texts usually hides in the hiberarchy structure and position relationship between characters. There are many international organizations to specify the norms for the scientific texts. 2.2 Related Work How to give an efficient way of managing scientific data has become a hot topic for both nature science researchers and computer professions. Jim Gray discusses several problems to be solved [7], and Peter Buneman et al design a biological data transformation system [2] and propose a way to archive scientific data in relation database [4]. There is a rich literature addressing the issue of managing semi-structured data in relational backends, and several mapping strategies have been proposed. Lorel [14] and Tsimmis [13] define the self-describing semi-structure data using graph theory. Semistructure data are represented as graph-like or tree-like forms, and data schema is defined by labeling the edges of the tree-like graph. Using this model, Jason McHugh et al design a query and management system Lore for the semi-structure data [9]. Lore provides a method to manage semi-structure data in DBMS. A self-describing semi-structured data relational mapping strategy STORED is proposed in [5]. When a semi-structured data instance is given, a STORED mapping can be generated automatically using data mining techniques, and then data can be transformed between relational database [1] and XML documents. Prior work on semi-structured data mostly focuses on self-describing documents. However, scientific devices and instruments are obtaining and generating massive non-self-describing scientific data every year and moreover, these devices and instruments may be placed in different location. Richard et al provide several frameworks for Web development [16] but none of those is suitable for scientific data. So the purpose of this research is to develop a Web-based system for massive scientific data collection, transformation and management.
3 Modeling Scientific Text Data 3.1 Scientific Text Modeling Scientific text data have hiberarchy structure. Sometimes there are object identifiers, which can identify different kind of data. Character is the minimum physical unit, and minimum logical unit is the string that contained underlying meanings. The meanings of scientific text data are defined in international scientific data format standards. For example, an integer in the text may represent the longitude of a certain position. Therefore, scientific text is a kind of semi-structure data with separate schema. Actually, these text data contain many kinds of data types, such as integer, decimal, timestamp, date, string and so on. In scientific text, objects are distinguished by object types, which are identified by object identifiers or inner text position structures. Having the same type means that these data have the same object identifier or they have close internal logic. For
A Web-Based Transformation System for Massive Scientific Data
107
example, if the data of longitude and the data of latitude together describe a certain position, we consider the two data have the same type. As we discussed above, in scientific text, there are four kinds of relation between minimum logic units: (1) Join with the same type inner row. (2) Join with the different type inner row. (3) Join with the same type between rows. (4) Join with the different type between rows. The four kinds of relation are reflexive, symmetric and transitive, so they are equivalence relations. Assuming that a scientific text data file is set D, R represents logic units relation (1) and (3) (the same type join inner and between rows). For any minimum logic unit d∈D, according to the set theory, the set [d]R={x|x∈D, xRd} is the equivalence class for the relation R on set D. By the same token, for equivalence class set Q on set D, Q is the quotient set of R on D. Therefore, there may be many equivalence classes consisting of the same object identifier in quotient set Q, and the relations between the equivalence classes in Q meet the definition of relational database theory. 3.2 STRIPE STRIPE (Scientific Text RelatIonal maPping modEl) is a kind of mapping model between scientific texts and relational databases, including text model, relation model and the mapping strategy between them. The mapping between text and relational database has two directions: the mapping from text to database (T/R mapping) and the mapping from database to text (R/T mapping). As a kind of forward engineering, in the T/R mapping procedure, the database structure is generated by the mapping model automatically and the text is parsed, relationalized, and at last saved into the database. R/T mapping is a kind of reversed mapping, in which scientific text data are generated from the relational database automatically by the mapping model. According to quotient set and relational database theory, there is relational schema lying in the equivalence classes of quotient set Q. In STRIPE, each equivalence class is mapped to relation entity, and the meaning of the minimum logic unit is mapped to the attribute of the entity. Because of the scientific text’s hiberarchy structure and the relationship between the rows, there are one-to-one or one-to-many relations between the entities mapped. In each equivalence class, the data that is divided by rows correspond to the tuples in the relational database. Since elements in the equivalence class have no order, the primary key of the tuples can be generated automatically or specify manually. 3.3 Building STRIPE with Oceanographic Data In this section, we will take oceanographic data as an example for explaining how to model the data and build up the STRIPE. Oceanography is really important branch of scientific research. As the scientists pay more attention to Green House and El Nino phenomenon and the observation technology develops fast, massive marine information is recorded in text form every year all around the world [12]. At last, these data will be stored in relational database for higher efficiency of query and
108
S. Feng et al.
management. At the meantime, text files are also the standards for international marine data exchange. STRIPE provides an efficient method to transform massive data between oceanographic texts and relational databases. Oceanographic raw text data take the international ASCII character coding scheme. This kind of text uses “row identifier” form, which means that the first character of each row identify this row’s object type. So the first character of each row is an object identifier, which divides the data into three types: Header, Data and Description. The data of the three types are in international marine gridded data format, which are constituted by a rectangular matrix of rows and columns. One example of marine Header data is showed in Table 1. For each oceanographic data file, there may be hundreds of columns in it, and a data file that an instrument creates can accumulate to several megabytes one day. There are dozens of this kind instruments in one data collect station, therefore, massive marine data are created per day. Table 1. Oceanographic Station Header Grid Data Format (excerpt) Parameter name This row identifier Next row identifier Country code Latitude Degree Latitude Minute Longitude Degree Longitude Minute
Start column 1 2 3 5 7 9 10
Length 1 1 2 2 3 2 3
Description Always 1 Always 2 or 3 According to IOC Latitude(degree) Latitude(minute) Longitude(degree) Latitude(minute)
From what we have discussed above, we can give the following definitions of oceanographic data. Definition 1. Given the marine data file OceanData, the three types of oceanographic data are represented by Header, DataSection and Description, respectively. DataSection i stands for one kind of DataSection. The operator ∨ denotes the relationship of join with different type between rows. m• denotes m lines of the data with the same identifier type, which means that the data are the relationship of join with the same type. The symbol n represents the max number of each component. So the record unit can be defined as follows: n DataBlock = Header (1... n ) ∨ ∪ m • DataSectio n i ∨ DataBlock' (0... n ) ∨ Descriptio n (0...1) i =1
(1)
The subscripts of each part represent the minimum and maximum occurrence of this part in the same DataBlock. By the same token, a marine data file OceanData is defined as follows: OceanData = m•DataBlock
(2)
In the DataSection, because of different ways of data collecting, each row can be divided into two parts: firstly, data appears in the fixed position; secondly,
A Web-Based Transformation System for Massive Scientific Data
109
transactional data with fixed length but unlimited numbers, which is attached to the first kind of data, as the depicting extra information under the first kind of data. So we have the definition as follows: Definition 2. CertainSection denotes the fixed data in DataSection; UncertainSection denotes the transactional data in DataSection. The operator + means the relationship of join with different type inner the row; p∗ means several transactional data with the same type in one row. According to definition 2, we can get: (3)
DataSection = CertainSection + p∗UncertainSection
By the definition 1 and 2, we can give the framework of STRIPE with oceanographic data in Fig.1. As showed in Fig.1, the left part shows a section of marine data which can be depicted as follows, where m, n, p represents the number of each component respectively: DataBlock = Header ∨ m•(CertainSection1 ∨ n•(DataBlock’))
(4)
DataBlock’ = CertainSection2 + p∗UncertainSection
(5)
According to relational database theory, there is relational schema in the oceanographic data. Different type of the oceanographic data can be mapped to the different relational entity in a database. For example, Header is mapped to entity Header in the relation model of Fig.1. Pay attention to the arrows in the Fig.1, we can find that they are bidirectional, which means our STRIPE is bidirectional. The data can be transformed not only from oceanographic files into databases, but also from database to oceanographic files. Scientific Text Model
Relational Model <<Entity>> Header
Header CertainSection1 CertainSection2
UncertainSection
CertainSection2
UncertainSection
1 *
<<Entity>> CertainSection1
CertainSection1 CertainSection2 CertainSection2
UncertainSection UncertainSection
1 *
<<Entity>> CertainSection2
1
Fig. 1. Oceanographic data STRIPE framework
*
<<Entity>> UncertainSection
110
S. Feng et al.
4 Design and Implementation of Web-Based Massive Scientific Data Transformation System In this section, we will discuss how to design and implement a Web-based massive scientific data transformation system. This system provides an efficient solution to the problem of massive scientific data management, query and exchange. 4.1 The Implementation of STRIPE XML [6] (eXtensible Markup Language) is a kind of markup language, which allows users to define their own tags. In our system, we give an implementation of STRIPE based on XML shown in Fig.2.
Fig. 2. STRIPE XML schema (excerpt)
Fig.2 is an excerpt from the XML STRIPE schema file, which describes the mapping model between scientific text data and relational databases. The STRIPE XML configuration file (short for STRIPE below) has three parts: scientific text model, relational database model and mapping strategy. The “Table” part of STRIPE reflects the schema of tables in the relational database, whose nodes include the attribute of table name tableName, and child elements primary key Pk, foreign key Fk. The “DataBlock” part of STRIPE depicts the physical structure of scientific text data. DataBlock may have four child elements: Header, DataSection, recursive DataBlock and Description. In each element, there are attributes describing the mapping strategy between scientific text data and relation database, such as the element and its database counterpart table name tableName, the start position startPosition and end position endPosition of data files and their mapped column name columnName in the database.
A Web-Based Transformation System for Massive Scientific Data
111
Building the scientific data under the STRIPE can reduce the coupling degree between data files and databases, and STRIPE provides a standard method of explaining massive scientific text data, therefore, it raises the readability of the scientific data. STRIPE offers computer scientists a standard of manipulating scientific data using programming language, and helps science researchers to understand the relational database more easily. Because of XML’s powerful ability of Web development, it is easy to take STRIPE into Web information system. In next section, we will design a Web-based massive scientific data transformation system, using STRIPE as transforming criterion. 4.2 Designing Web-Based Massive Scientific Data Transformation System Web-based massive scientific data transformation system is a kind of Web information system for managing, querying and exchanging scientific data. The architecture of it is shown in Fig.3. Every kind of scientific text data has its corresponding STRIPE configuration XML file. The system architecture includes the components of STRIPE Config Files, Import Engine, Export Engine, Relational Database and User Applications, and the STRIPE config files play an important role in the architecture. We will discuss each component of the architecture as follows. • Import Engine Through paring the STRIPE config XML files, Import Engine gets the physical structure of scientific data and the schema of the relational database. Then the SQL Generator creates the insert SQL script with regard to the database. After that, the Data Loader subcomponent gets the data from file system by the physical structure, and imports them into the database using the SQL script created before. • Relational Database Relational Database provides a high performance and high security solution to the management and query of massive scientific data. In our architecture, because of the platform independence of SQL, the Relational Database can be any commercial database or open source database. • Web Server Web Server provides a high efficiency container for the distributed scientific data management environment. The scientific researchers and data collector are usually not in the same location. So it is very useful for scientists to deliver and transform the data through Web. • User Applications Scientists used to search and analyze the scientific data through files. However, it’s really a tough work because there may be several gigabytes of data in one application. Relational database and Web server provide a much more efficient way to query and analyze the data. Furthermore, most application can be easily migrated to Web-based systems. As we can see from Fig.3, data visualizing, data mining, data statistics and data query can be easily deployed on our Web-based system. • Export Engine In reverse of the Import Engine, the purpose of Export Engine is to export scientific text data from databases. Firstly, Export Engine gets the query conditions from user’s
112
S. Feng et al.
requests, and then the SQL Generator creates the query SQL script from the relational schema and conditions. At last, File Writer executes the query script, and writes the data into files using the result set.
Fig. 3. Web-based massive scientific data transformation system architecture
5 The System Performance and Evaluation 5.1 System Performance The java technology provides a mature framework and toolkit for Web development. The whole system is developed under the struts of framework which is based on MVC model [10]. We use SAX for XML parsing which has the advantage of low memory cost and high parsing speed. JDBC is our choice for database accesses. As all of the technology we used is under standard API, our system is platform independence and can be deployed on any commercial or open source databases. For massive data is processed through our system, we design some accelerating algorithms to accelerate the speed of import and export. As insert and query are the most costly action in our system, the main idea of the two algorithms is reducing the
A Web-Based Transformation System for Massive Scientific Data
113
number of I/O of the database. In the import procedure, every data file is divided into several files according to object types inside. So the same type data can be inserted into the database at one time, which can greatly increase import efficiency. Using our import strategy, the data have the order that when the foreign key increases, the primary key will certainly increase. In the export procedure, we use this property to lower the number of database query. In the best condition, there will be only two times of query for the relational schema with two tables. 5.2 System Evaluation A Web-based scientific text transformation and application system has been successful deployed for the China Nation Marine Data and Information Service (NMDIS) [11]. The traditional way of store oceanographic data in NMDIS is text files, whose disadvantages have been discussed in Section 1. There are many data collecting station around the China's coastal areas, which collect huge amount of data every year. Using our system, it becomes really easy for oceanographer to integrate the data into data center in NMDIS, and exchange the appropriate data with foreign research institutes, as shown is Fig.4. Many these data centers and collectors all over the world may constitute a kind of network for scientific data integrating and exchanging. For reducing unnecessary traffic, this network may have some P2P attributes as described in [8]. Location A Data Collector 1
Location D Internet
Internet Data Collector 2
Location B
Data Center 1
Data Exchange Internet
Data Center 2
Internet
Data Collector 4
Internet
Internet
Data Collector 3
Data Collector 5
Data Collector 6
Location C
Location E
Location F
Internet
Fig. 4. The data collect and exchange in distribute environment
As massive data is transformed every day, our accelerating algorithms demonstrate good performance. After statistics, the speed of import can be about ten times faster than the normal import program, with export about thirty times faster. Our system provides flexible and fast accesses to the scientific databases. The applications based on Web offer good visualization and plotting tools, standard data mining API and mature data analyze algorithms. After a period time of trial running, our system runs stably and turns to play a great role in China’s oceanographic research.
114
S. Feng et al.
6 Conclusions and Future Work In this paper, a transformation model STRIPE that describing scientific text data and its mapping strategy to relational databases has been proposed. Using STRIPE, a Web-based massive scientific data transformation system has been designed and implemented, which offers a solution to the problem of the management, query and exchange of massive scientific data. Several approaches for further improving and extending the work are required. Firstly, more kinds of applications should be developed on our system. Secondly, the accelerating algorithms should be improved to meet the needs of more and more data. Thirdly, the text is only one kind of many forms of scientific data in the research, so more attention will be paid to find an efficient way to manage image or binary scientific data based on Web information system.
References 1. P. Atzeni, S. Ceri, S. Paraboschi, R. Torlone: Database System Concents,Languages and Architecture. McGraw-Hill (1999) 2. P. Buneman, S.B. Davidson, K. Hart, G.C. Overton, L. Wong: A Data Transformation System for Biological Data Sources. VLDB (1995) 158-169 3. P. Buneman, S.B. Davidson, M.F. Fernandez, D. Suciu: Adding Structure to Unstructured Data. ICDT (1997) 336-350 4. P. Buneman, S. Khanna, K. Tajima, W.C. Tan: Archiving scientific data. ACM Trans. Database Syst. 29 (2004) 2-42 5. A. Deutsch, M.F. Fernandez, D. Suciu: Storing Semistructured Data with STORED. SIGMOD (1999) 431-442 6. Extensible Markup Language: http://www.w3.org/XML/ 7. J. Gray, D.T. Liu, M.A. Nieto-Santisteban, A. Szalay, D.J. DeWitt, G. Heber: Scientific data management in the coming decade. SIGMOD Record 34(4) (2005) 34-41 8. Y. Liu, X. Liu, L. Xiao, L.M. Ni, X. Zhang: Location-Aware Topology Matching in P2P Systems. Proc. of the IEEE INFOCOM (2004) 9. J. McHugh, S. Abiteboul, R. Goldman, D. Quass, J. Widom: Lore: A Database Management System for Semistructured Data. SIGMOD Record 26(3) (1997) 54-66 10. Model-View-Controller. http://java.sun.com/blueprints/patterns/MVC.html 11. National Marine Data Information and Service. http://www.nmdis.gov.cn/ 12. National Oceanographic Data Center. http://www.nodc.noaa.gov/ 13. Y. Papakonstantinou, H. Garcia-Molina, J. Widom: Object exchange across heterogeneous information sources. ICDE (1995). 251-260 14. D. Quass, A. Rajaraman, J.D. Ullman, J. Widom, Y. Sagiv: Querying Semistructured Heterogeneous Information. Journal of Systems Integration 7(3/4) (1997) 381-407 15. The BibTeX Format. http://www.ecst.csuchico.edu/~jacobsd/bib/formats/bibtex.html 16. R.T. Vidgen: Constructing a web information system development methodology. Inf. Syst. J. 12(3) (2002) 247-261
Supporting Complex Query with Structured Overlays in Schema-Based P2P System* Min Yu, Zhanhuai Li, and Longbo Zhang School of Computer, Northwestern Polytechnical University, 710072 Xi’an, P.R. China {yum, zhanglb}@mail.nwpu.edu.cn, [email protected]
Abstract. Despite of their advantages in scalability and routing efficiency, structured peer-to-peer(P2P) overlay networks fail to support complex queries in a network of peers with heterogeneous schemas, which limits their use in schema-based P2P systems. By using relation keywords as index key for schema info and partitioning tuples vertically, a method of indexing both schema and data with structured overlay is designed. And an algorithm based on these two levels of indices to support complex queries on multiple attributes is proposed. Qualitative analysis and comparison show that this work is closer to the goal of P2P data management than other projects. Keywords: Peer-to-Peer, Schema-Based, Complex Query.
1 Introduction Enabling query processing is a natural extension of key and keyword based search in existing P2P systems. It is quite challenging because P2P query processing must be effective and efficient for handling peers with heterogeneous schema in a large scale dynamic and distributed network. We are interested in the problem of relational query processing in P2P systems. This is an important research area with applications to e-learning, P2P databases, monitoring and stream processing. Structured overlay is a good foundation for the creation of indices because it allows very efficient access for known keys. However it can not support complex queries such as join and aggregation on relational data directly. PIER[1] has tried to tackle the problem by partitioning the relations horizontally and indexing the tuples on certain attribute. But it assumes that every node knows and adopts the same schema which is impossible in large scale P2P systems and it can not support multiattributes queries effectively. Therefore, we try to support complex query over relational data with heterogeneous schemas based on structured overlays. The rest of the paper is organized as follows: Section 2 introduces our indexing scheme, based on which an algorithm of complex query is proposed and analyzed in section 3. In section 4, related work is discussed and compared with ours. Section 5 is devoted for concluding remarks. *
Supported by the National Natural Science Foundation of China (No.60573096).
2 Schema and Instance Indices In this paper, each peer shares its data with its own schema. It is required that the creator of each relation appending some keywords to the relation and its attributes as thesaurus to describe their meaning[2]. Each peer consists of local database (LDB), vertically partition component (VP), index partitioning buffer(IPB), metadata extractor(ME), local schema buffer(LSB), and indexing schema buffer(ISB).Firstly the ME extracts schema from LDB and puts it with the thesaurus into LSB, which is the schema for the peer to share data with others. Meanwhile, ME indexes the schema in the schema index. ISB is used to store the schema information assigned to this peer by the schema index. Secondly, VP extracts the data for sharing from LDB according to the shared schema and partitions them vertically into triples in form of (tid, relation name: attribute name, attribute value)[3] to be published into the structured instance index. IPB is used to store the triples assigned to this peer to maintain by the instance index. We borrow the concept of soft-state from PIER[1] to ensure that the data and schema published by the failed and leaving data source can be discarded finally. 2.1 Schema Index The schema index is mainly used to find schemas similar to that used by the original query. As schema index only requires efficient exact query on strings, any structured overlay can be used as schema index. Each keyword of a relation is used as the key to publish the corresponding schema info to the structured overlay in the form of (relation keyword, relation name, attribute1, attribute2,…, attribute n) and attaches metadata file with it:(1) so that each attribute is attached its thesaurus, (2) and the information about the relation which has foreign key reference with the relation is attached. The schema index only includes the list of schemas in the network without the information about the data source. So the data sources publish their own schema info to the schema index periodically. 2.2 Instance Indices Each tuple of relation is partitioned into triples of form (tid, relation name: attribute name, attribute value), in which tid(consisting of data source name, table name, primary key etc.) is used to identify a tuple uniquely. For a relation R with schema (A1,A2, …,An), each tuple (tid, v1, v2,…,vn) is partitioned vertically into n triples, i.e. (tid,R:A1,v1),(tid, R:A2, v2),…,(tid, R:An,vn)[3].Each triple is published to BATON[4] three times using tid, R:Ai:vi and vi as index key respectively. Using R:Ai:vi as index key is to support selection on attribute Ai. Using tid as index key is to get the complete tuple from the vertical partitions. Using vi as index key is for the join operator. BATON[4] is a P2P structured overlay which is the extension of B-tree in database to the field of P2P. For a network of n peers, it guarantees answering exact match query and range query in O(log N) hops, which is efficient. So it is an ideal choice for instance indices, which can support selection algorithm Selection(R,θ,p) naturally:
Supporting Complex Query with Structured Overlays in Schema-Based P2P System
117
(1)If the selection condition θ is in the form of Ci1 ≤ Ai ≤ Ci 2 , p searches BATON with range [R:Ai:Ci1, R:Ai:Ci2], which gets to the first overlay range in O(logN) hops. Searching the left range only needs constant hops before all the answers are found. (2)If the selection condition θ is in the form of Ai = Ci , p searches BATON with R:Ai:Ci as the search key which will be answered in O(logN) hops. After arriving at the peer satisfying the selection condition, the triples that satisfy the selection condition on (tid, R:Ai,vi) are found. To get the complete tuple of the triple, it needs to search BATON with tid as search key, which finds the match peer in O(log N) hops. As tid uses the data source name and relation name as the prefix, and BATON does not destroy the sequence of data, the tuples of the same relation often reside on the nearby nodes in the index. Thus, finding all the complete tuples satisfying the selection condition will not cost too much more hops than finding one of the complete tuples.
3 Answering Complex Query Based on the Indices Queries are routed by messages in the network. Each query message consists of the query itself, the identifier of peer issuing the query, the number of attributes in the query, a bitmap which indicates whether each attribute is translated. We only focus on selection and join operator in this paper, and defer aggregation operators to our future work. According to the experience of database theory, it’s better to put selection operator down before join operator. Furthermore, the original relation is usually quite large, and only parts of them are needed by the user, so queries often have both selection and join. After user issues the query according to the schemas in LSB and ISB of peer p, p receives the query and does the two tasks at the same time: (1) using the original query to search instance indices directly. (2) translating the query: First, it searches schema index with the keywords of R. Each match is located in O(log N) hops if it does exist. Then the query is sent to that peer, and it is translated as far as possible on that peer. It’s possible that there are many matched relations on that peer and thus there may be many rewritten queries and join operator may be introduced to the rewritten queries. Whenever an attribute is translated, the corresponding bit in the query message is set to ready. If the original query refers to more than one relation, the next relation in the query is translated using the schema index similar to the above steps. If there are attributes which can not be translated when all the relations are dealt, the unfinished rewritten query is discarded. When all the attributes are translated successfully, the rewritten query is sent back to the peer who issues the original one and is processed in the same manner as the original one. Algorithm 1 shows the steps to answer a complex query consists of selection and join, in which t.JR means the value of tuple t on attribute JR. As join is exchangeable operator, without loss of generality we assume R is more selective than S, which means R will filter out more tuples. Our partition and indexing strategy make our join algorithm able to implement the Fetch Matches join with semi-join optimization[1] easily.
118
M. Yu, Z. Li, and L. Zhang Table 1. Pseudo-code for the Algorithm1
Algorithm 1. ComplexQuery(R,S,θR,θS, JR,JS,p) Input: R, S: two relations to be joined; θR, θS: selection on R and S respectively; JR, JS: the join attribute of R and S respectively; p: the peer issuing the query. Output: Results of {t | t σ θ R(R) R.JR = S.JSσ θ S(S)}. Steps: 1. O={}; 2. if θR not null, then ResultR= Selection(R,θR,p);//section 2.2 3. else ResultR= R; 4. for t Result R do 5. { p searches BATON with t.JR to find node q; 6. Resultsemi= GetSemiJoin(t.JR,JS,S,q,p); } 7. for o Result semi do 8. { p search BATON with o.tid to get complete tuple of S; 9. get the complete joined tuple tjoin; 10. O= {t join } O ;} 11.Return (O);
∈
∀∈
∀∈
∪
Table 2. Pseudo-code for the procedure GetSemiJoin Procedure GetSemiJoin(vR,attS,S,θS,p,poriginal) Input: vR:the value on join attribute of R; attS: the join attribute of S; S: a join relation; θS: selection on S; p: the peer performing the query; poriginal: the peer receiving the outputs. Output: Returns the set of triples whose values on S.attrS equal to vR. Steps: 1.Resultjoin= {o | o = (tid, S : attrs, v) v = vR o Datap} ; 2.if θS applys to attrS,then 3. Resultsel= {o | o satisfies θ S o Resultjoin} ; 4.else 5. {ResultS=Selection(S,θS,p); //section 2.2 6. Resultsel = {o | ( t O' (o.tid = t.tid)) ( t' Results(o.tid = t'.tid))} ;} 7.Return Resultsel to poriginal;
∧ ∧∈ ∧∈
∃∈
∧∃∈
As query with multiple operators is handled in pipeline manner with the results of one operator flowing into the next, we only consider the delay of generating the first result tuple for each operator when calculating the delay of the query. First, ComplexQuery takes O(logN) hops to find the tuples of R which satisfy θR no matter it is
Supporting Complex Query with Structured Overlays in Schema-Based P2P System
119
exact match condition or range condition. For each tuple of R, it takes O(logN) to find the matching triple of S on join attribute. Filtering it with θS can be finished directly in the best case and needs O(log N) steps in the worst case. GetSemiJoin can send the results back to the issuing peer of the query using IP address directly without routing through the network. The issuing peer pays O(log N) hops to get the complete tuple of the result. In summary, the complex query can be answered in O(logN) hops. The logarithmic relationship between the hops and the number of peers shows the good scalability of our strategy and guarantee of efficiency.
4 Related Work and Comparison Like us, many projects focus on relational data management in P2P systems. Hyperion[5] focuses on the semantic coordination between peers instead of query processing. Our work is similar to PeerDB[2] in the method of attaching keywords to relation and attributes. PeerDB is based on unstructured overlay network and uses flooding query strategy limited by TTL. Different from it, we adopt structured overlay to index both schema and data, which provides theoretical guarantee on efficiency of query translation and query evaluation avoiding the shortcomings of unstructured overlays. PIER[1] supports range selection by applying PHT(prefix hash tree) on distributed Hash table(i.e. DHT) which can only get approximate results and is inefficient. PIER proposed Fetch-Matches join and symmetric hash join algorithms and provided two optimization strategies, Bloom join and semi-join. Superior to it, our work can get exact result of range selection and need not construct temporary hash index as PIER when join attribute is not the primary key attribute while implementing semi-join naturally. [3] also partitions relations vertically but it publishes each triple using qgrams of attribute name and attribute value as keys to support similar queries on attribute name and attribute value. It assumes similar name means the same thing so it does not specify relation name in query. But storing huge number of q-grams is costly and it can not express the situation that different attribute names mean similar things. Let’s use PeerDB’s example, Protein is a similar or related attribute to Kinases but they are not in common in q-gram. Our method can solve this problem and avoid the complex decomposition of string to q-grams to save attribute matching time. The similar join in [3] is just the approximate of natural join while we can also support equi-join with exact answers. CON-QuerP[6], based on unstructured overlay network, uses structured coordination overlay network (CON) to index views and data sources. Different from ours, CON is not used for schema translation and does not use keywords of relation as index key. Piazza[7] focuses on XML data and the transformation between RDF and XML data. Its index is centralized and poor in scalability and robustness. Edutella[8], GridVine[9], et.al focus on queries on RDF data which are much simpler than relational data in processing due to the fixed dimension. Now, we show the advantages of our work by qualitative analysis: First, in dealing with heterogeneity, we use keywords as thesaurus which is simpler than creating schema mapping for the user who is not expert in computer science. Second, processing of original query is in parallel with query translation which both ensures the initial response time and improves the recall by retrieving more results. Third, keywords of
120
M. Yu, Z. Li, and L. Zhang
relation name instead of the relation name itself are used to publish the schema information to schema index. Thus the relations with the same keywords which are considered similar in semantics will be indexed to the same or nearby peer. Moreover, when a single relation can not provide all the attributes in the query but along with those having foreign-key reference with it can provide all, the query can still be rewritten correctly. Finally, adopting of soft-state in creating schema index simplifies the maintenance, where the failed or departed peer need not inform the index and the data source need not be checked for connection. Table 3 shows our work is closer to the goals of P2P data management than others. Table 3. Comparison of related work Projects
Semantic coordination Hyperion excellent PeerDB good PIER no [3] good CON-QuerP no Piazza excellent Edutella good GridVine good Our work good Ideal state excellent
Expressiveness of query medium strong strong medium strong strong medium medium strong strong
Scalability medium medium strong medium medium poor medium strong strong strong
Need it reconstruct index for multi-attribute query? no no yes no no no no no no no
5 Conclusion By trading off between the semantic coordination and query processing in P2P data management systems, we attach keywords to schema elements and partition tuples vertically to make it possible to index both schema and data with structured overlay. Based on these indices, an algorithm of supporting complex query on multiple attributes is proposed. Qualitative analysis and comparison show that our approach is closer to the goals of P2P data management than other work. Quantitative analysis of it is our ongoing work and supporting aggregation operators on these indices is our future work.
References 1. R.Heubsch, B.Chun, J.Hellerstein, B.T.Loo, P.Maniatis, T.Roscoe, S. Shenker, I.Stoica, A.R.Yumerefendi.: The Architecture of PIER: an Internet-Scale Query Processor. In: Proc. of the 2005 Conference on Innovative Data Systems Research(2005) 28–43. 2. W.S.Ng, B.C.Ooi, K-L. Tan, AY Zhou: PeerDB:A P2P-Based System for Distributed Data Sharing. In: Proc. of the 19th International Conference on Data Engineering(2003) 633–644. 3. M.Karnstedt, K-U.Sattler, M.Hauswirth, R.Schemidt: Similarity Queries on Structured Data in Structured Overlays. In: Proc. of NetDB06 (2006) 32.
Supporting Complex Query with Structured Overlays in Schema-Based P2P System
121
4. H.V.Jagadish, B.C.Ooi, Q.H.Vu: BATON: A Balanced Tree Structure for Peer-to-Peer Networks. In: Proc. of the 31st VLDB Conference(2005) 661-672. 5. A..Kementsietsidis, M.Arenas: Data Sharing through Query Translation in Autonomous Sources. In: Proc. of the 30th VLDB Conference(2004) 468-479. 6. WN. Qian: Data Management in Peer-to-Peer Systems. PhD thesis, Fudan University(2003). 7. I.Tatarinov, A.Halevy: Efficient Query Reformulation in Peer Data Management Systems. In: Proc. of the ACM SIGMOD International Conference on the Management of Data(2004) 539-550. 8. W.Nejdl, W.Siberski, M.Sintek: Desigh Issues and Challenges for RDF- and Schema-Based Peer-to-Peer Systems. ACM SIGMOD Record Vol32(3). (2003) 41-46. 9. K.Aberer, P.Cudré-Mauroux, M.Hauswirth, T.V.Pelt: GridVine: Building Internet-Scale Semantic Overlay Networks. In: Proc. of the third International Semantic Web Conference(2004) 107-121.
Abstract. Because of the huge scale and numerous components, a massive database system’s availability has become a serious challenge. Many database replication technologies are used to increase the MTTF, but few are provided to decrease MTTR in massive database systems where the traditional backup methods are not feasible for expensive human cost. Based on analyzing the characteristics of the data in massive databases, we propose a novel approach called Detaching Read-Only (DRO) mechanism and its variation DRO+. It decreases MTTR through reducing the size of physically changing data in every database by detaching data on node granularity. The analysis and experiment results show that our approach can not only reduce MTTR by an order of magnitude, but also reduce the expensive human cost without extra hardware cost.
1 Introduction The requirements to store and query massive data in scientific and commercial applications have appeared. Alexander Szalay and Jim Gray address that the amount of scientific data is doubling every year and scientific methods are evolving from paper notebooks to huge online databases [1]. Until 2000, disk capacity has improved 1,000 fold in the last 15 years, consistent with Moore’s law, so storage volume is not the primary challenge as before [2]. System maintenance, automation and availability have become the new great challenges [3]. It is a good idea to build a massive database system with federated databases [4]. Because of the complexity of management and maintenance of a single PB(PetaByte)scale massive database system, partitioning it into many small federated databases is a feasible way. However maintaining such a huge system is expensive due to its low availability caused by its large scale. There are at least two challenges in gaining the high availability in massive database systems: the short MTTF (Mean Time To Failure) and long MTTR (Mean Time To Recovery) [5] for storage failures. The former is caused by the number of *
Supported by the National High Technology Development 863 Program of China under Grant No.2004AA112020, and the National Grand Fundamental Research 973 Program of China under Grant No.2005CB321804.
hardware components. For example, if the availability of a database of 1TB is 99.99%, the availability of a database system of 100TB constituted of 100 such databases will only be 99%. The latter is caused by single database’s size. There are two reasons: 1) Recovering the data of 1TB needs a long time even if with fine backup solutions like archived and timely backup [6]. 2) Finely backuping 100 databases is a huge DBAs’ cost. The efficient way to increase the overall MTTF is to increase the MTTF of single database in a massive federated system since the number of databases can’t be decreased. How to increase MTTF has been researched by those who mainly focus on how to provide efficient synchronization mechanism between the replicas of a cluster [8] and care little about the recovery time. Decreasing the MTTR for storage failure is a great challenge for the database’s size. Little attention has been paid to this problem in previous research because it is a new problem which only comes forth in a massive system. The idea of reducing MTTR rather than increasing MTTF has been proposed in the ROC [9] project. How to decrease the MTTR and human cost will be researched in this paper. From the analysis above, we will achieve two objectives in a massive system: z
To improve the massive database system’s availability by reducing the MTTR. z To decrease the MTTR without extra expensive human cost. Currently the massive storage system is usually filled by the high-rate streaming log data, such as science experiment data [10,11], call detail records, RFID-based free way tracking [12], network packet traces. All these data basically have the same features: z
The data are generated high-rate and continuously (otherwise the information will not be cumulated into a massive system of TB or PB-scale). z The data are just appended into the system and the old data will not be updated. We call these features as “insert-once-no-update”. Leveraging these features, we design a novel mechanism DRO and its variation DRO+ to separate the insert-onceno-update “read-only” data from the online loading data. We compare the novel mechanism to the double replication used in an existed massive system in MTTR, TCO (Total Cost of Ownership) and performance. The results show that DRO+ excels others in most cases. The organization of this paper is as follows: Section 2 describes a massive database system CDRMDB. Section 3 describes our novel replication mechanisms. Section 4 and Section 5 analyze the MTTR and cost of the novel mechanism; Section 6 is the experiment; Section 7 is the related work and Section 8 is the conclusion and future work.
2 A Massive Database System: CDRMDB In this section, we will illustrate a massive database system CDRMDB (Call Detail Records Massive DataBase) which stores high-rate and massive call detail records. It is built up with 86 database nodes and every node’s volume is 500GB. CDRMDB has
124
K. Du et al.
the following features which are classic to many massive streaming data storage systems: 1) Store high-rate and massive streaming data. 2) Provide query access interface to the massive data. 3) The scalability and availability are two key system features. In CDRMDB, the simple Primary Backup mechanism [13] is adopted in terms of efficient loading performance. One primary and backup replica constitute one cluster. So it is constituted by 43 database clusters. The cause of too long MTTR is database size and recovery mechanism. In CDRMDB, one database’ size is 500GB. During the recovery period, all the 500GB data must be loaded into the recovered database from the correct database through logical export and import mechanism. The size and logical recovery mechanism lead to the long MTTR. In addition, double replication may cause the failure cluster to be the bottleneck of the query process when the recovery is being done because the exporting data operation will dramatically decrease the system’s performance. In order to eliminate the query bottleneck in Double Replication, Treble Replication is a better choice with more half cost of the former. When a cluster has three replicas and one of them fails, one normal database is used to recover the data and the other can process the query as before. Another benefit of the treble replication is that the cluster’s availability is higher because the probability of three databases failing is lower by an order of magnitude than two.
3 Novel Replication Mechanisms: DRO and DRO+ In this section, firstly we uncover the inherent reasons of too long recovery time in massive database systems in Section 3.1. Then we propose a novel replication mechanism DRO and its variation DRO+ in Section 3.2 and 3.3. 3.1 Why so Long Recovery Time In Section 2 we show the conflict of decreasing the MTTR and decreasing the maintenance cost. The essential reason of the conflict is that the system’s scale is so large that the traditional backup can’t be done because the DBAs’ cost is high. The main idea of shortening the recovery time stems from the feature of insertonce-no-update which is described in section 1. We can explain this idea clearly in Fig.1. In Fig.1 (a) and (a1), it shows the current state of CDRMDB --all data is in an online changing database. The databases’ data is always changing and all loading and query requests are issued to the total database. If a database fails for storage failures, it needs to recover all data whose size can reach 500GB in CDRMDB. It will take several days. However this can be avoided by leveraging the insert-once-no-update feature. We can periodically detach the history data which will not be updated. This idea is shown in Fig.1 (b) and (b1). Fig.1 (b) means to logically detach the history data from a database. Fig.1 (b1) means to divide the system into two parts from the system’s view. One part processes loading and query requests as an OLTP database. The other stores history data and it is consolidated as an OLAP database. In addition, in order to eliminate the IO contention between detaching and loading, it is sound to detach data based on node.
DRO+: A Systemic and Economical Approach
125
y Query Loading (a) All data on changing state
OLTP
(a1) All data in OLTP Query
Loading (b) Detaching history data to consolidated state
OLTP
OLAP
(b1) Detaching history data dynamically to OLAP
Fig. 1. Online detaching history data
3.2 Detaching Read-Only Data Replication To detach history data based on node granularity can be implemented by dynamically deploy nodes to different usages. We call this as “Detaching Read-Only (DRO) Replication”. The idea stems from three basic facts: 1) a read-only database can have an extremely short recovery time comparing to a changing database. 2) The smaller the scale of a write-read database, the shorter the recovery time.3) some research about read-only compressed database [14] has proved that compressed databases will likely do better. 3.2.1 Tasks in DRO Now let’s illustrate the DRO mechanism in Fig. 2. In Fig.2, the system’s running time is divided by cycles. In every cycle, two works will be done parallelly: loading data into the write-read database clusters and compressing read-only database clusters. In cycle 1, three database clusters which are called as Loading Database Cluster (LDC) (marked as “Loading”) provide query and data loading functions. In cycle 2, the three database clusters become read-only, and at the same time another three clusters are added into the system as LDCs. The three read-only database clusters are compressed and all the data is collected into one database cluster which is called Query Database Cluster (QDC) marked as “Compressed & Query” with black color. The two free database clusters which are released after data compressing (boxes with dashed frame in Fig.2) will be added into the third cycle for next cycle’s data loading task. In cycle 3, one fresh database cluster which is newly added into the system (boxes with solid frame in Fig.2) and two free database clusters from cycle 2 make up the new data LDCs. Just like in cycle 2, two free database clusters come out after compressed and are added into cycle 4. In cycle 4, the procedure is continued and the rest may be deduced by analogy.
126
K. Du et al. Task N1
Loading
N2
N3
N4
N5
N6
N2
N3
N7
N5
N6
N8
N1
N2
N3
N4
N5
N6
N2
N3
N7
n
Compress & Query
Cycle 1
Cycle 2
fresh database cluster
Time
Cycle 4
Cycle 3
compressed & read-only database cluster
free database cluster
Fig. 2. Detaching Read-Only Data Replication Mechanism
3.2.2 Two Types of Database Clusters In a LDC, all the database nodes save the same data but the data are not stored in the absolutely same physical files. For example, when a tuple R is loaded into the three databases d1, d2, d3 in a cluster, it may be saved in file1 in d1, file2 in d2, file3 in d3 while it is saved logically in the completely same way. So when one database node breaks down for a media failure, the database needs to be built from the blank and imports the data which are exported from another normal node. However, in a QDC, all database nodes save absolutely the same physical data. When a compressed database node has been created, it can be copied into another node through a disk copy: copying all the files with the same directories and files. This mechanism profits from the data’s no-update-after-insert property. Its advantage is that when a database node breaks down for media failures, only the fault files need to be copied from the normal node. It shortens the recovery time and tinily affects the normal node. 3.3 DRO+: Improved DRO Task N1
Loading
N2
N4
N3
N5
n
N6
N1
N2
N3
N4
N5
N6
n
Compress & Query
Cycle 1 fresh database cluster
R1
R2
R3
Cycle 2
Cycle 3
Cycle 4
Time
compressed & read-only large database cluster
Fig. 3. DRO+: improved DRO
DRO+: A Systemic and Economical Approach
127
DRO has decreased the MTTR without increasing DBAs’ backup and recovery work. However it loses much loading performance because too few nodes are used to load data in every cycle. In DRO, all database nodes have the same storage volume and the storage resources are wasted since its volume is designed for all online time which is larger than one cycle. So we can save the budget for storage to buy more Loading Database nodes. The procedure of DRO+ in Fig. 4 is like DRO except that one compressed and read-only database cluster is added in every cycle whose storage size is larger than LDCs and in every cycle the number of LDCs is equal to the treble replication. What is the economical benefit of DRO+ over DRO will be illustrated in Section 4.
4 MTTR Analysis In Section 3, we have illustrated four database replication mechanisms. In this section, we will analyze MTTR, performance, economic cost of the four mechanisms. 4.1 MTTR in DRO In order to analyze the system’s MTTR in DRO, we define the following variables: The number of database clusters is NS. The data’s online time is T. The time of every cycle is C. The number of LDC in every cycle is NL. The storage volume of one LDC is S. The requested average data loading rate is V. The overall performance of the data loading is PL. The compression ratio is R1. The space utilization ratio of every database node is R2 after it becomes a query-only database and before it is compressed. In Table 1, the number of various types of database clusters in every cycle is shown. The value of every column in Table 1 is calculated based on the volume usage. The number of fresh database clusters should be an integer which is not less than the float value NL *R1 * R2, and the number of free database clusters should be an integer which is not greater than NL * (1 - R1 * R2). Table 1. Number of various types of database clusters in every cycle
For example, in Fig.2, the value of every variable is like the following: NL = 3, R1 * R2 = 1/3, NL *R1 * R2 = 1, NL * (1 - R1 * R2) = 2. In every cycle one or more QDC is created after the database clusters in the previous cycle are compressed. Because a QDC can only be written once with the compressed data and will not allow to be updated, the following two points should be guaranteed:
128
K. Du et al.
1) To make the best of the storage space, a QDC should be filled up as much as possible since the QDC can only be written once. 2) The cycle should be long enough that the volume of the data which is loaded in this cycle in the NL LDCs is large enough to fill up the storage space of one or more QDCs after the data is compressed. According to the above analysis and Table 2, we can educe the following equations: NS = NL + NL + (T/C - 2)*NL *R1 * R2 = (2 + (T/C - 2) *R1 * R2) NL.
(1)
The data loading performance is generally thought as relative to the number of the data loading clusters, so we have this: (f1, f2 is a constant factor.) PL= f1* NL. V= f2*PL.
(2) (3)
The system’s MTTR is classified as two types: MTTRQ, the MTTR of one QDC; MTTRL, the MTTR of one LDC. Since the recovery operation of a QDC is only to copy one or more data file, the recovery time is a constant time t0. So the system’s MTTR is determined by the MTTRL. In order to simplify the discussion about the MTTRL, we assume that it is linear to the data size of one node of the LDC: (f3 is a constant factor.) MTTRL = f3*V*C/NL. Referring to (2), (3), we can get: MTTRL = f1* f2* f3* NL *C/NL = f1* f2* f3 *C. From (1), (4), we can find the relation between MTTRL, C and NL: MTTRL = f1* f2* f3 * T *R1 * R2 /(2* R1 * R2 -2 + NS / NL).
(4) (5)
Theorem 1. In DRO and DRO+, assuming that the data loading performance is proportional to the number of data loading clusters and the MTTR of one database is proportional to its size, the MTTR is proportional to the cycle C. Proof. As discussed above.
□
Theorem 1 indicates that C determines the MTTR. 4.2 MTTR and Performance Comparison Theorem 2. In Treble, DRO and DRO+, assuming that the data loading performance is proportional to the number of LDCs and the MTTR of one database is proportional to its size, the MTTR of DRO+ is smallest and the loading performance of DRO+ is best. Proof. We mark the MTTR of treble, DRO and DRO+ as MTTRT, MTTRD and MTTR+. In fact, the double and treble replication mechanisms are a special case of DRO from (4). In (4), the treble replication means R1=R2 = 1, T = C, so the NS = NL, MTTRT = f1* f2* f3 *C= f1* f2* f3 *T. From (4), we also get that MTTRD = MTTR+= f1* f2* f3 *C. So MTTRD:MTTR+:MTTRT =C:C:T (6). If we assume the loading performance is linear to the number of Loading Clusters, obviously we can get the comparison of Performance of Treble, DRO and DRO+: PT: PD: P+= NS : NL:NS. (7) From (6) and (7), we can infer that DRO+ has the smallest MTTR and the best performance. □
DRO+: A Systemic and Economical Approach
129
5 Cost Analysis In Section 4.2 we have concluded that DRO+ has the smallest MTTR and the best performance. The cost may be higher than the other two. In this section we will show that the total cost of DRO+ is not always more than the other two. 5.1 Original Cost Analysis The original hardware cost includes CPU, memory, storage, network switch, and so on. In order to simplify analyzing the cost of the three replication mechanisms, we assume that every node has the same number of CPU and memory and the storage’s cost is proportional to its volume. Theorem 3. If the ratio of the storage cost to the computation cost is more than some value, the original cost of DRO+ is not more than Treble and DRO. Proof. The original hardware cost of Treble, DRO and DRO+, CT, CD, C+ is: CT = CD =3*NS *(CPU+S)
(8)
C+ = 3*(2* NS *(CPU+SW) + (T/C - 1)*(CPU+SR))
(9)
CPU is one node’s cost except its storage cost. The other variables are defined in Section 4.1. In (9), one loading node’s storage size is SW= (C/T)*S, one read-only node’s storage size is SR= NS * SW * R1. So if we expect the cost of DRO+ is not more than the other two, that is C+ <=CT = CD, we should keep the following inequality: S/CPU >= (NS + T/C - 1))/( NS *(1- R1+(C/T)*( R1-2)))
(10)
From (10), we can conclude that if the ratio of storage to CPU cost is larger than f(C)= (NS + T/C - 1))/( NS *(1- R1+(C/T)*( R1-2))), the original cost of DRO+ will not overspend the other two. □ Through simple analysis of f(C), we can find that it has a min value when R1>0.1 and T/C>3 are true which is really true in most cases. In Section 6.2 we can find it really true. 5.2 TCO Analysis The total cost of ownership (TCO) [15] of an information system can be divided into two parts: original hardware cost and management cost. The former is discussed in Section 5.1 and the latter mainly is human cost. As G. Weikum points out, TOC in a mission- critical system becomes more and more dominated by the money spent on human staff [16]. The human cost in recovery can be calculated as the total recovery time: failure count*MTTR. Since the system storage scale is the same, the failure count is the same. So the ratio of the human cost of DRO+ to Treble is CH+:CHT= MTTR+: MTTRT = C:T. The cost of transforming online data to offline is difficult to analyze in quantity, but at least it is clear that it is zero cost in DRO+ or DRO discussed in Section 3.2.2. In Treble, it is expensive for it needs huge extra human operation such as exporting and deleting old online data and it will sharply degrade the performance of a 24*7 running system.
130
K. Du et al.
6 Simulation and Cases Analysis In this section, we will show the differences of MTTR and cost in the four mechanisms which are described in section 5 through simulating several massive systems. 6.1 MTTR and Performance In this section we pay attention to MTTR and NL. Fig.4 shows the value of NL with different cycle. NS = 30, T = 150 days, R1 =0.4, R2 =0.5, f1= f2 =1. For treble and double replication, C=T=150 days; for DRO, C=10,15,30,50 days, NL is calculated from (1). In Fig.5, MTTR is calculated from (4) in Section 4.1. For Treble and DRO, f3 = 0.1 and for Double, f3 = 0.13. For Double and Treble, C=T=150 days. For DRO, MTTR varies with the cycle. From Fig.4 and 5, we can conclude that the shorter the cycle is, the shorter the MTTR is, but the data-loading performance the worse may be. MTTR can be decreased to one-tenth when the cycle is 15 days. 6.2 Cost Analysis
MTTR(day)
25 20
Double
15
T reble
10
DRO
5
DRO+
0 0
20
40
60
NL(Number of Clusters)
Fig.6 is the curves of the right expression of (10). Fig.7 is the human cost ratio of DRO+ to Treble in recovery calculated from Section 5.2. In Fig.6, T=150 days, the curves show that the larger NS is and the smaller R1 is, the smaller the ratio is. That is to say the larger the system is and the higher the compression ratio is, the more possibly the hardware cost in DRO+ is not more than the others. When C=15, NS =30 and R1=0.1, the minimum ratio reaches 1.8 and the human cost in recovery is only one-tenth in DRO+. The value of 1.8 can be easily reached in
o 0.25 i t a 0.2 R t s 0.15 o C 0.1 n a 0.05 m u 0 H
0
5
10
15
20
25
30
35
Cycle(day)
Fig. 7. Human Cost Ratio of DRO+ to Treble
DRO+: A Systemic and Economical Approach
131
data-intensive applications like TPCC. The NO.1 in TPCC’s Price/Performance column until 27-May-2006 is Dell’s PowerEdge 2800 whose S/CPU is 1.95 [17]. This is evidence that the original cost of DRO+ is quite probable to be not more than Treble or DRO.
7 Related Work In the research community of database availability, database replication is a primary technology. The eager approach may easily lead to deadlocks and be hard to scale for its block mode [18]. So many lazy approaches are proposed to improve the overall performance and scalability [19]. However, all of them discuss little about the recovery of a replica from other replicas in a massive scale background. Another novel research point is to build a high availability system based on sharenothing database clusters. RAIDb [7] aims to build a high available system like the commercial system based on multiple open-source databases. It implements the high availability and load schedule through building the middleware tier, caches all the sql operations in the management node and redoes them on the fault database replica. This recovery way is not feasible when the data arrives at a high rate because the cache size is too large and the traditional backup mechanism is not useful to each replica. Oracle RAC [20] adopts another way to gain high availability on massive systems. It provides high availability through a multiple instances fault-tolerance mechanism in the query processing tier and can’t continue working when encountering media failures. So it can’t easily provide high availability for the high media fault ratio in a massive system. Google is a successful massive system. It is constituted by about ten thousand nodes. About one hundred nodes break down every day [21]. Its data scale is 4080TB. The method to gain high availability is data replication. All data and metadata is replicated double or treble. It focuses on the high availability of a massive file system and its recovery granularity is physical files. Thus the difficult problem of data consistency in database recovery doesn’t exist. The replication mechanisms proposed in this article focus on quick recovery from the media failure node which is not covered enough in the above research work. Especially the view of taking the total cost of recovery into account is not addressed earlier.
8 Conclusions The development of computing technology in several decades has made it possible to store massive data like web-log, call detail records, sensor data. However, the short MTTF and long MTTR of massive systems caused by the massive scale becomes a new challenge. Much work has been done to increase MTTF but little attention has been paid to decrease MTTR which is a severe problem in running product systems. Based on the experiences of our product system CDRMDB, we propose a novel mechanism DRO and its variation DRO+ from a systemic and economical view. The simulation shows that our approach can sharply decrease the MTTR by an order of magnitude without any performance loss and need no extra hardware or human cost.
132
K. Du et al.
References 1. 2. 3. 4. 5.
6. 7. 8. 9.
10. 11. 12. 13. 14. 15. 16.
17. 18. 19.
20. 21.
Jim Gray, Alex Szalay. Science in an exponential world. Nature,V.440.23, 2006. Jim Gray, Prashant Shenoy. Rules of Thumb in Data Engineering. ICDE2000. Jacek Becla, Daniel Wang. Lessons Learned from Managing a Petabyte. CIDR2005. Boris Gelman. V2LDB. CIDR2005. K. Nagaraja , X. Li and B. Zhang, R. Bianchini, R. Martin and T. Nguyen. Using Fault Injection and Modeling to Evaluate the Performability of Cluster-Based Services. In Proceedings of the Usenix Symposium on Internet Technologies and Systems, Mar. 2003. Abraham Silberschatz, Henry F. Korth, S. Sudarshan. Database System Concepts. 4th edition. China Machine Press. P461-470. Emmanuel Cecchet. C-JDBC: a Middleware Framework for Database Clustering. IEEE Computer Society Technical Committee on Data Engineering 2004. Yuri Breitbart, Raghavan Komondoor, Rajeev Rastogi, S. Seshadri, Avi Silberschatz.Update Propagation Protocols For Replicated Databases. SIGMOD 1999. Patterson D. A., A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupman, N. Treuhaft. Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, March 15, 2002. Y. Dora Cai, Ruth Aydt, Robert J. Brunner. Optimized Data Loading for a Multi-Terabyte Sky Survey Repository. In Proc.Supuer Computing 2005. A. Szalay, P. Kunszt, A. Thakar, J. Gray, R. Brunner. Designing and Mining MultiTerabyte Astronomy Archives: The Sloan Digital Sky Survey. In Proc. SIGMOD2000. http://www.511.org/fastrak Matthias Wiesmann, Fernando Pedone, Andr´e Schiper, Bettina Kemme,Gustavo Alonso. Transaction Replication Techniques: a Three Parameter Classification. SRDS 2000. Daniel J. Abadi, Samuel R. Madden, and Miguel C. Ferreira. Integrating Compression and Execution in Column-Oriented Database Systems. Proceedings of SIGMOD 2006. Hitt, Ellis F. Total ownership cost use in management. Digital Avionics Systems Conference 1998. Gerhard Weikum, Axel Moenkeberg, Christof Hasse, Peter Zabback. Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering. VLDB2002. http://www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=105092601 Jim Gray, Pat Helland, Patrick O’Neil and Dennis Shasha - The Dangers of Replication and a Solution.ACM SIGMOD 1996. A. Sousa,J. Pereira, L. Soares, A. Correia Jr., L. Rocha, R. Oliveira, F. Moura.Testing the Dependability and Performance of Group Communication Based Database Replication Protocols. Dependable Systems and Networks (DSN) 2005. Building Highly Available Database Servers Using Oracle Real Application Clusters. An Oracle White Paper May, 2001. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. The ACM Symposium on Operating Systems Principles (SOSP) 2003.
Research and Implement of Real-Time Data Loading System IMIL* Han WeiHong, Jia Yan, and Yang ShuQiang Computer School, National University of Defense Technology, Changsha, China [email protected], [email protected], [email protected]
Abstract. With rapid development of the Internet and communication technology, massive data has been accumulated in many web-based applications such as deep web applications and web search engines. Increasing data volumes pose enormous challenges to data-loading techniques. This paper presents a data loading system in real time, the IMIL (Internet Monitoring Information Loader) that is used in RT-IMIS (Real-time Internet Monitoring Information System), which monitors real-time internet flux, manages network security, and collects a mass of Internet real-time information. IMIL consists of an extensible fault-tolerant hardware architecture, an efficient algorithm for bulk data loading using SQL*Loader and exchange partition mechanism, optimized parallelism, and guidelines for system tuning. Performance studies show the positive effects of these techniques with loading speed of every Cluster, increasing from 220 million records per day to 1.2 billion per day, and achieving the top loading speed of 6TB data when 10 Clusters are in parallel. This framework offers a promising approach for loading other large and complex databases.
loading process. The on-line multi-terabyte data transaction system stated above, poses new challenge to the traditional database technology and the speed of real-time data loading. This paper is based on RT-IMIS (Real-time Internet Monitoring Information System), which monitors real-time internet flux, manages network security, and collects a mass of Internet real-time information, including package tracing data, such as IP package or TCP package, user session information, network flow data, SNMP data, data package, flow data. The characteristics of these data are listed as follows: 1. Varieties of data source: including package tracing data, flow tracing data, route table, configuration data, SNMP data, package transmission delay by active survey, the utilized rate of link, and so on. 2. Sustaining a generation of data: the network works 24 hours and 7 days a week, so monitoring data of the network is continuous. 3. Large amounts of data and continual increase of data will form multi-terabyte data. RT-IMIS not only records these monitoring data into the database, but also provides analyses and query function for on-line data, including network abnormal monitoring, network attack and virus origin query, and on-line data query for one year with analyses and inquire about them. The storage of RT-IMIS annual information is about 400TB, the arrival rate of the average monitoring information is about 1TB per day with its peak rate reaching 6TB. The same data scale system like RT-IMIS has not been found until now. According to 2005 Top Ten Award Winners, the max scale data system in 2005 is Max Planck Institute for Meteorology, which is about 223TB, and the data scale of yahoo is about 100TB, AT&T about 94TB. These characteristics of RT-IMIS bring great challenges to data loading. First, the large scale of data requires the base hardware architecture extensible. Second, the large data-collection rates and continuous data processing require real time and faulttolerant data loading. Finally, data loading and data query are working together in RTIMIS, which requires that data loading can not affect the usability of the database. Based on the requirements of Internet Monitoring Information System, this paper presents real-time data loading part—IMIL (Internet Monitoring Information Loader) in RT-IMIS. IMIL uses Oracle 10g database, and adopts multi-node, multitask parallelism technology and a range of optimized data loading technology, making it possible for the speed of data loading to reach GB degree per second and satisfying real-time multi-terabyte data loading request of RT-IMIS.
2 Related Research In recent years, the research of data loading mostly centralized on improving the efficiency of data loading in a single database, especially in the performance of [5][6][7][8] . The related researches included: J. Bercken’s research on constructing indexes a currency data loading algorithm.; P. Ciaccia’s research on how to construct indexes in multimedia database while data bulk loading; Christian’s research on how to construct indexes while loading multi-dimension data, and so on. These researches are very useful in single database data loading, but their performance can not satisfy the requirement of Multi-terabyte data real-time loading.
Research and Implement of Real-Time Data Loading System IMIL
135
Each major relational database management system (RDBMS) vendor provides a utility to load data from a flat file into a table. The Oracle system supports SQL*Loader, the MS/SQLServer system supports Data Transformation Services (DTS), and IBM DB2 supports a LOAD utility. However, these are proprietary tools that can only work with the vendor’s own databases. Furthermore, they are primarily designed to quickly load data into a single database table without performing any data transformation. These data-loading services can be used as the local database support for multi-terabyte data loading, but they can not complete the multi-terabyte data loading themselves. The research according to multi-terabyte data loading is just beginning. The National Center for Supercomputing Applications (NCSA) and Department of Astronomy in University of Illinois are researching on data loading that is being used to populate a multi-table, multi-terabyte database repository for the Palomar-Quest [1] sky survey . They brought out a data loading framework named SkyLoader, which improves the performance of data loading. But their research aimed at sky survey data, depending on the application and the structure of table, and the rate of data loading is not enough for real time data stream loading. The research of Sloan Digital Sky Survey (SDSS) is similar, they use a SQL SERVER Cluster for storing data. [2][3][4] . SDSS uses a temporary database, but it does not perform as well as SkyLoader
3 IMIL System Framework 3.1 Extensible Fault-Tolerant Hardware Architecture RT-IMIS is based on RAC architecture of Oracle 10g. Oracle RAC architecture is a Multi-Instance, single database. All of the Oracle Instances share one Storage Resource. Oracle RAC technology is a high-availability, high-performance, and lowprice database architecture. If one of the Oracle Instances in the cluster breaks down, the others will automatically take over its tasks and try to repair it. All of these actions are transparent to the users. The cluster database has the Automatic Load Balanced Function, and all the Instances can share a database case, so the database may parallel complete the visit requests, making it highly effective. Moreover, Oracle RAC structure is one kind of expandable platform, the RAC cluster, may facilitate increase or deletion server pitch point and oracle example according to the service demand. The best processor number that Oracle RAC recommends is 6-8, (too many processors will cause an internal communication overload, and the performance would be unable to grow linearly with the number of processors.) Therefore, the IMIL system uses many Oracle RACs interconnected with each other; each Oracle RAC is composed of 6-8 processors and installed on an Oracle 10g database, as figure1 shows. This hardware architecture is high-extensible, it could increase the processor number in a RAC interior, and it could also increase the RAC number in the system. The structure is highly fault-tolerant, if one of the processor breaks down, the RAC will continue to function normally. If one of the RAC breaks down, then another RAC can take over its tasks, and the entire system will continue to function normally.
136
W. Han, Y. Jia, and S. Yang
Fig. 1. IMIL hardware architecture
3.2 IMIL System Software Framework The IMIL system is composed by the Data Dispatch module, the preprocessing module and the bulk loading module, as figure2 shows. Data Dispatch module: According to the data distribution strategy (users may choose for themselves), it accepts real-time data on the Internet and dispatches them to the preprocessing module corresponding to the different databases. Preprocessing module: Each preprocessing module corresponds with a local database. First, because there are too many varieties of data monitored on the Internet (their form is also different with each other), there should be a transformation of the received data to a unified expression data. Second, statistics and on-line analysis module are responsible for real-time analysis, discovering the threatsl in the network (like the network attack and the virus and so on) in real time, handling the statistical data, updating the statistical table promptly, keeping the statistical data consistent with real-time data. Last, according to the different data format, the data buffer manage module is responsible for writing the data to the corresponding buffer files and managing the buffer files. Bulk Loading Module: It is divided into three sub-modules: Data Loading Module, Index Constructing Module and Partition Exchanging Module. Data Loading Module is responsible to load data files to temporary tables. Index Constructing Module is responsible for creating all kinds of indexes on temporary tables. Partition Exchanging Module is responsible for exchanging temporary tables with a partition of monitoring information table. All these modules finally make the new data available in the database. These operations may execute in-order or out-order or in parallel. The parallel manager module manages parallel processing among all these sub-modules, it dispatches the different operations to computing resources used in data loading, controlling parallel data loading, and maintaining the relationship between each operation.
Research and Implement of Real-Time Data Loading System IMIL
137
Internet Monitoring data
Fig. 2. IMIL software architecture
3.2.1 Bulk Data Loading The traditional way to load data was by writing INSERT in the sentence, but this did not satisfy the demand of the mass data increase like IMIL. So we use the Oracle bulk data load tool SQL*Loader, which could directly load the data from a formed external data file to an Oracle database table. The flow of data loading by SQL*Loader is shown as figure3.
Oracle RDBMS
Input Data File
Log File SQL *Loader
Control File
Bad File abandoned file
Fig. 3. Data loading flow of SQL*Loader
1. Control file: It controls how to map data in external data file into the Oracle table. 2. Data-input file: It is a data file which is organized according to the attribute of the Oracle table by preprocessing module. 3. Output of SQL*Loader: The loaded Oracle table and index. The bad file composed by the data whose format is inconsistent with the request. The abandoned file composed by records, which can not satisfy the condition request in the control file. The log file of SQL*Loader executing. SQL*Loader has provided two kinds of data loading: The usual path and the direct path, as the following Figure shows. The usual path of data loading is the default set on the Oracle database, its merit is the space saved, but there are defects. The first defect is its speed in data loading is not as fast as direct data loading, the second is the loading operation will compete with other processes running in the database. Because there is little data update and delete operations in IMIL, the waste in database space is
138
W. Han, Y. Jia, and S. Yang
not very big when using the direct path data loading. The request of data loading speed in IMIL is very strong, therefore, we choose the direct path data loading in the IMIL system.
Fig. 4. Usual path data loading and direct path data loading
Although SQL*Loader has provided high performance data loading, it is still unable to meet the performance needs of the application environment. The main problem is that many indexes have been created on data tables to satisfy high performance demand of system query and analysis, including B-tree index and text index. This may cause two problems, first, when data is inserted into a table, the database needs to detect conflicts between recent data and old data in the index while at the same time it must update all indexes of the table. The repetitive update will consume a lot of time, especially when there is text index on the table, the time of update index could be far more than the time of loading. The second is that the table index can not be used while data loading by SQL*Loader, it will cause a delay when users visit this table at this time. Therefore, we add an index management mechanism and exchange partition mechanism to data loading using SQL*Loader. 3.2.2 Indexes Management and Exchange Partition There is a restriction in the usual data loading method, which requires that accessing the data can not be permitted until the data base has been imported completely. But RT-IMIS will import data continuously while other operations like query and analysis will been done at the same time. If the query of data base can not be permitted while importing the data, then the availability of the system will be affected. So the indexes management and exchange partition mechanism will be used while the data is loading. Oracle 10g supply a function to partition the table, this function is suitable for very large databases like IMIL. First, partitioning the table can reduce the affection to data query while loading the data. The operation is the same as changing mobile hard disk, but replacing the disk with a Partition. Second, considering the performance, the table in IMIL is very large, because of the great capacity of data stored on it; by using a Partition, we can improve the query performance by adopting the mechanism, such as ignoring some of the Partitions. Third, it makes the table administrable. It makes operations such as deleting data and backing up data easier. It partitions the large table, and divides and rules them one by one. Last, the technique of partition upgrades the availability of the system. It localizes the fault in the partition, and decreases the time for recovery.
Research and Implement of Real-Time Data Loading System IMIL
139
To the user, a partitioned table is a set of segments. Segments can be added and deleted if necessary. A partitioned table can exchange data with a table in the database if their structure, type, and key are consistent. The partitioned table owns the data of the table after exchanging data. The exchanging of data runs only in the data dictionary of the database, and the data is not moved. So the exchanging of data is so fast that it will not affect other operations executed by Data Manipulation Language (DML). Based on SQL Loader and exchanging partition and taking the user session table, which records the communication session message as an example, the process of data loading is shown as figure5.
Fig. 5. The process of data loading
The process of data loading for Session table includes steps as follow: 1. SQL*Loader loads the data to a temporary table of Session from data buffer file. 2. Creating indexes for the temporary table of Session. 3. Creating a new partition of Session table and Exchanging the temporary table with the new partition. 3.2.3 Parallel Execution of Data Loading In order to meet real-time loading of the Internet monitor information system, it will need high parallel execution of data loading. The parallelism not only exists between every Oracle 10g database (each RAC has an Oracle 10g database, corresponding to the hardware), but also exists in the loading task itself. So as to fully parallel the loading tasks of every Oracle 10g databases, the data independence between each database should be pledged as possible as it could when IMIL system dispatch the data. There are 10 RACs in IMIL system, and 10 Oracle 10g databases installed correspondingly to RACs. There are more than 100 tables in IMIL system. Through the correlation analyses of the tables, we sort them into 10 classes, and each database stores one class. This way, the parallel degree of loading tasks achieves almost 10, and the data transformation between 10 databases has greatly been reduced in data query and analyses. Inside every RAC, which is composed of 6 nodes, there also exists parallelism between nodes. The ideal parallel degree is 6, but because the 6 nodes are connected to a RAID through SAN, and they face one database storage system at the bottom, there are resources competition and lock waiting. If 6 nodes are loading at the same time, as found in the system monitoring displayed by Oracle in the test process, the system will have problems such as lock wait that will affect the performance. The test
140
W. Han, Y. Jia, and S. Yang
shows that the optimized quantity of parallel nodes is 4 (detailed test data and analyses is in figure 8). In each loading task, there are two level parallelisms: One is the parallelism of data loading among different tables; the other is the parallelism of sub tasks loaded by each table through a pipeline. So in the IMIL system, after data transform module has transformed the data format to the database table required format, it puts each format data corresponding to one database table into a different data file, and each database table corresponds to a SQL*Loader process, accessing it’s own data file, which assures the parallelism of data loading between different tables. Every loading task of the database table is divided into three sub tasks: Data loading module responses creating temporal table for data file, and loading the data into this temporal table. Index constructing module creates the index for the temporal table; exchanging partition module first creates a partition corresponding to a database table, and then exchanging the temporal table into the database. All of these sub tasks are through a pipeline. A process is created for each sub task, and each sub task hands over the data to the next sub task after finishing its own work, and then it can deal with next data file. Parallel task management module assort with the parallel execution of sub tasks. Through the multi-level parallel dispatch: Database level, node level, database table level, loading sub tasks level. The loading performance of IMIL system can be greatly improved. The loading speed of every Cluster can increase from 220 million records per day to 1.2 billion per day, and it can achieve the top loading speed of 6TB data when 10 Clusters are in parallel. 3.3 Optimization of IMIL System IMIL is a huge quantity database system, every database stores ten TB data, and each database table in the database is also huge. This can make many common operations, such as recording logs, database statistical optimizing, to be long-time operations, even greatly affecting the system performance. So there is a need to optimize the system according to the factual situation. 3.3.1 Online Log Oracle Instance will store the operations in the database into Redo Log Buffer in SGA, and background process LGWR will timely write the content of Redo Log Buffer into Redo Log File. When the Oracle database is created, it automatically assigns two Redo Log Files for each Instance, each of which is 50MB. For IMIL system, bulk loading will lead to frequently using Redo Log File. When a Redo Log File works out, the system will do Redo Log File switch, and then the system will automatically do a checkpoint operation to synchronize cache and storage. Through monitoring the background log and the trace file, (under the default setting), when loading speed is 50 million records per hour, the system will do checkpoint every 8 minutes. Frequent checkpoints bring on large I/O, which badly affects the system’s performance. So we increase the Redo Log Files number to eight and the volume to 512M for each Database Instance, every two of which are in a group. This effectively controls the interval time of two checkpoints to 30 minutes, and therefore reduce the effects of database recording log operation.
Research and Implement of Real-Time Data Loading System IMIL
141
3.3.2 Statistic Optimization Oracle10g database in default uses CBO to build and choose the query plan with the CBO choosing the query plan according to the statistic of using an index. The statistic will record the using of CPU and I/O resources by table and index, thereby helping the CBO make a choice. There are quantity of text indexes and other indexes in the data table of the IMIL system, which makes the statistical cost to be obvious. For example, like ‘ANALYZE TABLE COMPUTE STATISTICS’, it needs one day to analyze a big table with text index (10 billion records). This is unacceptable for such a real-time system like IMIL. Therefore, we use the strategy of doing the statistic of the table or index by Partition, and along with DBMS_JOB, Oracle RAC, we can timely choose a comparative idle node to execute database statistical task. This can maintain the optimal indexing performance, and also decrease the effects to the online data management due to the database statistic optimization.
4 Performance Analyses of IMIL System In this section, we present various aspects of our performance tests on IMIL system, all tests were implemented based on RT-IMIS application system, using a hardware equipment as a Cluster of system structure Figure 1 shows, six Nodes are jointed to a RAID through SAN Switch, Node is HP4640 server (ItaniumII CPU, 1.6GHz/6MB; Memory: 48G), SAN Switch is Brocade Silkworm 3850, RAID is StorageTek FLX240, overall capacity is 60TB, and adopted method is 1000M Switch Connection among the Nodes as well as among the Clusters; in addition, OS is RedHat AS4 in which Oracle 10g release2 database is built. All tests were performed using the same data, which is the real monitoring data of RT-IMIS. The adopted data table is TalkingContent table used for recording user interactive information, average record length of it is 784B, and there are 5 indexes, one of which is a text index. The characteristic of text index is that it takes a long time to create. In order to avoid impacting on test results, during the test process, RTIMIS will not need to do other work, like query and analyses. There are already two month worth of data stored in database when testing. 4.1 Bulk Data Loading and Exchange Partition In this test, we contrast the difference of performance among three modes, general data-loading mode (inserting record by strip), bulk data loading mode and bulk data loading with exchange partition mode. In order to gain a good comparability, we only use one Node to do data loading, and don’t do parallel task scheduling. Respectively testing for empty TalkingContent table, where there is existing data of 10 days, 20 days, 30 days, 40 days, 50 days, 60 days, and record the loading time of 2 million records, the following Figure shows the results.
W. Han, Y. Jia, and S. Yang Runtime(minutes)
142
400 300
general data loading
200 bulk data loading
100 0 0
10
20
30
40
50
60
bulk data loading + exchange partition
data in table(days)
Fig. 6. General Data loading, bulk data loading and exchange partition
From this test, we can see that the performance of bulk data loading with exchange partition mode is better than general data-loading mode and only bulk-data-loading mode. In addition, there is a very important problem as loading performance becomes decrease as data increases in general data-loading mode and bulk-data-loading mode. The reason is because under these two ways, it is required to scan existed data indexes when creating index for new data loading, consequently, the performance of loading descends as data increase. But the problem doesn’t exist while adopting exchange partition mode, so the performance of loading has nothing to do with the scale of data in the database. 4.2 Performance of Parallelism Scheduling In IMIL system, there exists multilevel parallelism of database level, Node level, database table level, loading subtask level (as presented in Section 3.2.3). Next, we do experiments for the performance of all levels parallelism scheduling respectively.
Runtime(minutes)
4.2.1 Parallelism Among Loading Subtasks To test the performance of parallel and serial mode among loading subtasks, we use a Node Engine, and test the loading time of different scale of data, following Figure shows the results. Viewing from Figure 7, we can recognize that the parallelism among subtasks can enhance the speed of loading, in the case of loading large scale data. 60 40
serial data loading
20
parallel among subtasks
0 0.5
1
1.5
2
2.5
3
number of records for loading (million)
Fig. 7. Performance Test of Loading Subtask Parallelism
4.2.2 Parallelism Among Nodes Now, we test the impact of parallelism of Nodes in RAC with respect to the loading performance of the system. Respectively, to load 2 million pieces of records by using
Research and Implement of Real-Time Data Loading System IMIL
143
Runtime(minutes)
1, 2, 3, 4, 5, 6 Nodes, and writing the loading time down. The following Figure shows the results. We can see from the result, the system reaches its best data loading performance when the data loading nodes number is 5, but the cost requires more resources. If loading nodes number is 6, the performance decreases because of overfull network communication and database lock waiting. Besause the performance improvement is not evident from 4 nodes to 5 nodes, but the using of resource increases evidence, we use 4 nodes to load data parallel in IMIL. 25 20 15 10 5 0 1
2
3
4
5
6
number of nodes
Fig. 8. Performance Test of Node Parallelism
5 Conclusions Real-time Internet Monitoring Information System (RT-IMIS) monitors real-time internet flux, manages network security, and collects a mass of Internet real-time information. RT-IMIS not only records these monitoring data into database, but also provides analyses and query function for on-line data, including network abnormal monitoring, network attack and virus origin query, in addition, on-line data query in one year and analyses and inquire about them. These characteristics of RT-IMIS bring great challenge to data loading. Based on the requirements of Internet Monitoring Information System, we present the real-time data loading part—IMIL (Internet Monitoring Information Loader) in RT-IMIS. IWIL consists of an extensible fault-tolerant hardware architecture, an efficient algorithm for bulk data loading using SQL*Loader and exchange partition mechanism, optimized parallelism, and guidelines for system tuning. Performance studies show the positive effects of these techniques, with loading speed of every Cluster increasing from 220 million records per day to 1.2 billion per day, and achieving the top loading speed of 6TB data when 10 Clusters are in parallel. Looking ahead, we will continue our collaboration on the RT-IMIS. With the dataloading phase under control, we will turn our attention to data query and analysis. Because of the massive data in database, the performance of data query and analysis is very important to RT-IMIS. Further more, data loading and data querying are doing together in RT-IMIS, and using the same system resource, so when the system resource is not enough, for example, data loading rate reach its peak value and using almost all the resource, how to management the system overloading is another problem to research.
144
W. Han, Y. Jia, and S. Yang
References 1. Y. Dora Cai, Ruth Aydt, Robert J. Brunner. “Optimized Data Loading for a MultiTerabyte Sky Survey Repository”. SC’05, November 12-18, 2005, Seattle, Washington, 2. A. Szalay, P. Kunszt, A. Thakar, J. Gray, R. Brunner. “Designing and Mining MultiTerabyte Astronomy Archives: The Sloan Digital Sky Survey”. In Proc. SIGMOD, Austin, TX, May 2000. 3. A. Szalay, J. Gray, A. Thakar, P. Kunszt, T. Malik, J. Raddick, C. Stoughton, J. vandenBerg. “The SDSS SkyServer-Public Access to the Sloan Digital Sky Server Data”. Microsoft Technical Report. MSR-TR-2001-104, Nov 2001. 4. A. Szalay, J. Gray, A. Thakar, B. Boroski, R. Gal, N. Li, P. Kunszt, T. Malik, W. O’Mullane, M. NietoSantisteban, J. Raddick, C. Stoughton, J. vandenBerg. “The SDSS DR1 SkyServer, Public Access to a Terabyte of Astronomical Data”. http://cas. sdss.org/dr3/en/skyserver. 5. J. Berchen, B. Seeger. “An Evaluation of Generic Bulk Loading Techniques”. In Proc. 27th VLDB Conference, Rome, Italy, 2001. 6. S. Amer-yahia and S. Cluet. “A Declarative Approach to Optimize Bulk Loading into Databases”. ACM Transactions on Database Systems, Vol. 29, Issue 2, June 2004. 7. C. Bohm and H. Kriegel. “Efficient Bulk Loading of Large High-Dimensional Indexes”. In Proc. Int. Conf. Data Warehousing and Knowledge Discovery (DaWak), 1999 8. S. Leutenegger, D. Nicol. “Efficient Bulk-Loading of Gridfiles”. IEEE Transactions on Knowledge and Data Engineering, 9(3):410-420, 1997. 9. D. Burleson. “Hypercharge Oracle data load speed”. http://www.dba-oracle.com/ oracle_tips_load_speed.htm 10. D. Burleson. “Hypercharging Oracle Data Loading”. http://www.orafaq.com /articles/ archives/000020.htm
Blogs in American Academic Libraries: An Overview of Their Present Status and Possible Future Use Fu Zhuo James C. Kirkpatrick Library, Central Missouri State University Warrensburg, Missouri 64093, U.S.A. [email protected]
Abstract. Academic libraries have embraced a new media tool, the weblog or blog, for various purposes. However, some librarians have raised concerns. Are academic library blogs overly personalized and often opinionated journals? Are they constantly dynamic, challenging, newsworthy, and entertaining? How often can they be updated to add and share new content in a timely manner? How effectively are they used in marketing a library’s resources effectively and for promoting when the library updates its resources and services? Who controls the content so that interesting opinions and controversial issues can be presented thoroughly and fairly? What is the next step for academic library blogs? This article summarizes the value of blogs that academic librarians bring to their academic communities. Although it cannot cover all the developments in weblog phenomena, the article aims to discuss the issues that academic library blogs face and their potential future in academic and research libraries. Keywords: Blog, weblog, academic library.
defined as “websites that are easily created and updated by those with even a minimum of technology know-how... And it’s not just text. Blogs can display pictures and video, include audio and flash, and even store other files like PowerPoint presentations or Excel spreadsheets for linking”[p.10]. RSS, standing for Real Simple Syndication or Rich Site Summary, uses a software called an "aggregator" or “newsfeed collector to check the sites one subscribes to, usually every hour, and it collects all the new content into a folder that is just waiting for you to come and read it”[p.13]. Both blogs and RSS are reshaping the way students and teachers use the Internet. They are no longer mere readers but writers to the Web as well. Blogs and RSS make it easier to filter and track the ever-growing number of resources and information nowadays. As the blog gains increasing attention and momentum, librarians have begun to examine the effectiveness of using blogs. Joyce Valenza, who was the winner of the Best Library/Librarian Blog award in 2005, hosts a very popular blog for librarians [3]. However, not all libraries or institutions have had successful blogging stories. Oftentimes, people find academic library blogs not to be very dynamic, challenging, or entertaining. So, what are the major problems with library blogs? What are the successful elements to building a good blog? Should every academic library create a blog? If an academic library launches a blog, how can it be maintained in an effective way to meet expectations of library staff and users?
2 Blogs in Education Ever since the blog was created, educators have quickly seized this opportunity to enhance both traditional and distance learning. In general, blogs are used in middle school through higher education. Various reports and literature about the use of blogs in education have appeared in professional publications. Topics covered are student interactions, interactive pedagogy, classroom activities, teaching critical thinking, distance learning, and faculty collaboration. The potential for reaching an audience reshapes the way students approach writing assignments, journal entries, and online discussions. Carlson [4] reported on the move of college professors in the United States to use blogs as a supplement in courses in creative writing, marketing, economics, and other subjects. The author presented the example at Quinnipiac University that some professors tried to incorporate blogs into creative writing classes. Students were asked to add new entries after an assignment was posted and read their peers’ blogs and comment on them. Beeson [5] summarized the use of weblogs in journalism programs in the U.S. in 2005. The author reported the Annenberg School for Communication of the University of Southern California (USC) in Los Angeles began its use of blogs in its online journalism course in fall 1999. The content of the blog consists of both mainstream media and blog coverage. At the Cronkite School of Journalism and Mass Communications of Arizona State University, instructors teach online media and advanced editing courses, both using blogs. Though the blog is only used for discussion rather than a true publishing outlet, the author still believes it gives students a more realistic view of the profession and is more in tune with the way of thinking of the younger generation.
Blogs in American Academic Libraries
147
The use of blogs is not limited to courses in writing, Selingo [6] reported blogs being used by second grade students in Frederick County, Maryland, to attack tough math problems and to post observations about their science experiments and art projects. Quible [7] also provided the success story of blogs being used in the business and educational worlds. The use of blogs and the teaching of business communication courses could be a perfect fit to help students improve their writing skills. The advantages of using blogs, as the author summarized, were that students became familiarized with a new web tool with powerful communication capabilities and that students’ writings and class information were shared electronically. Richardson [8] described the creation of a literary blog by English literature students at Hunterdon Central Regional High School (HCRHS) in Flemington, New Jersey. Their blogs have been successful as an online forum for classroom discussion, and to develop students' critical thinking, writing, and reading comprehension skills. Risinger [9] discussed the use of weblogs in the social studies classroom. Those blogs are interactive because teachers pose questions and ask students to respond. Nardi, Schiano, Gumbrecht and Swartz [10] reported that blogging is sometimes viewed as a new, grassroots form of journalism and a way to shape democracy outside the mass media and conventional party politics. The authors report the results of an ethnographic investigation of blogging from a sample of ordinary bloggers and concluded blogs combine the immediacy of up-to-the-minute posts with most current information presented first, and with a strong sense of the author's personality, passions, and point of view. Blogs can benefit teacher professional development too. Shaffer, Lackey and Bolling [11] reported blogs served as a venue for group activities and discussions and helped nursing teachers to keep up with changes in their instruction, rapid developments in technology, and the diversity of student population: The blog “enabled us to share our thoughts, ideas, and approaches in an open, non-threatening milieu“[p.127]. Clyde [12] discussed how some educational blogs have been useful sources of professional development information for teachers and teacher librarians to enhance their skills. To further help the readers, the author provided a list of useful and popular educational blogs for anyone interested in learning more about them. Dickey [13] pointed out a surprising advantage of using this new media tool blog to alleviate learner feelings of isolation, alienation, and frustration. The author found weblogs or blogs may provide a wider range of tools for bridging learners' feelings of isolation. The author’s findings are significant because academic libraries receive increasing demand of virtual reference questions from remote users.
3 Blogs in the Academic Libraries With more success stories of blogs being utilized in education, the application of blogs in academic libraries also increases. Blogs are used to communicate with library staff as an internal journal, to update what is going on in public relations, and to even publish newsletters. Although some libraries do not favor a separate use policy for the blog, relevant guidelines have been created to regulate and control the operation of the blog in most academic libraries.
148
Z. Fu
Perhaps the first and easiest way to use the blog is to take advantage of its media function. Glen [14] discussed blogs launched and run by scholars and academics in order to muse aloud about their research. Albanese [15] reported the University of Minnesota library launched a free blog system in April, 2004. Many professors indicated they would use the blog to encourage discussion and debate. The copyright and related rights to blog content were owned by the authors who contributed to the blog or replied to a post. There was a mechanism which allowed the authors to license the content of their blogs through a Creative Commons license. Similar success stories were echoed by other academic libraries. Blair and Cranston [16] recorded their experience of setting up a blog in the Colorado State University libraries. Editorial and publication guidelines and evaluation criteria were established before the blog was launched. The University of Michigan library jumped on the blog bandwagon and launched mBlog. It offers free blogs to students and employees. Apart from serving their present students and employees, their blog is also made available to alumni and friends. A guideline is in place, and service and technical FAQs have been set up to help users solve potential problems they may face when using mBlog. Caldwell [17] discussed the importance of blogs to libraries, including advantages and disadvantages of blogs to the library community. The author cited two university library blogs, Warwick University in the UK and the University of Minnesota in the US, as examples of successful cases in using blogs to promote a campus culture of intellectual freedom. Reichardt and Harder [18] discussed the history and background of weblogs and blogging software. They found that very little literature had been published in scientific and technical publications on subject-specific blogs. The authors compared the advantages and disadvantages of email and blogs and pointed out that “blog postings and comments, unlike email, create a time-stamped and often categorized trail of discussion that can be easily searched and browsed so that former mishaps need not occur” [p.110]. In addition to subject blogs, Reichardt and Harder also mentioned reference work blogging. The reference desk blog is one alternative that can certainly help keep reference team members informed of any developments at the reference desk. Walsh and Daniels [19] mentioned the use of blogs at Georgia State University Library, and suggested sources for locating blogs, such as LISFeeds, Feedster and NewsIsFree. In their opinions, blogs could be considered part of the scholarly communication system in academic institutions. A locally-created blog could be targeted to a specific readership, and could provide current and relevant information to a library's community. Vogel and Goans [20], also from the Georgia State University Library, described creating a blog to deliver information about library news, services, and resources to science faculty and students. Issues and challenges were addressed to ensure that it would be a viable and successful news-delivery system, as compared to the static newsletter. Bell [21] also affirmed the potential of blogs for promoting library services to the community. However, the author raised the question of how to compel readers to return regularly to the library blog. The author suggested that the library blog should be located where the readers are. One way to do that effectively is through courseware such as Blackboard, WebCT, Angel, and Moodle.
Blogs in American Academic Libraries
149
4 Blogs in Library Instruction Coulter and Draper [22] explored how a blog was used in the library instruction for an information literacy class. In order to identify the effectiveness and role of using a blog, the researchers surveyed students in the library instruction class and found that many students were not aware of the blog that the library had. They concluded that faculty collaboration with a librarian was the key to the success of blog being used. Ramsay and Kinnie [23] wrote about their experiments in outreach at the University of Rhode Island. Instead of luring students in to the library building, their librarians are reaching outward to faculty, students, and the distance learners by utilizing a librarian as a teaching assistant or embedded librarian for online classes. Part of the role of the embedded librarian is to establish a blog to promote relevant library resources. McKiernan [24] introduced the use of web logs for the enhancement of library services. The author reported an increase in the number of libraries at educational institutions that used blogs to promote library services. In particular, the author pointed out several academic libraries, such as the University of Alberta Library, the University of Tennessee libraries, the Ohio University in Athens Library, the Moraine Valle Community College Library in Illinois, and the University of Winnipeg Library, all started to use blogs to promote instructional and reference services. The use of blogs has broadened from a simple feed of library news to library reference and instruction areas.
5 Evaluation of Blogs As in the case of blogs in education, not all librarians or library researchers favor the creation and use of blogs in academic libraries. Bell [25] argued that academic library blogs are personalized and often opinionated periodicals. He rarely found any academic library blogs that were dynamic, challenging, newsworthy, regularly updated, or entertaining. Bell suggested ways to set up a quality blog such as finding a good model to follow, avoiding being too personal, updating regularly, and staying focused on what’s important to users’ success in the college and in the library. Bell’s voice on unsuccessful blogs is not alone. Gorman [26], current president of the American Library Association, defines the blog as “a species of interactive electronic diary by means of which the unpublishable, untrammeled by editors or the rules of grammar, can communicate their thoughts via the web” [p.44]. His comments have drawn tremendous disagreement from librarian bloggers, the blogging community, and technophiles generally. Fialkoff [27] and Gordon [28] rebuffed Gorman and expressed concern that the ALA leaders' inability to embrace blogs might decrease ALA membership. Other concerns about academic blogs include ethical issues and the integrity of academic blogs. Since the blog is personalized and opinionated, any negative comments and posts on certain technologies or library products may draw some outrage from library vendors, who tend to advocate their products. In their interviews with some library bloggers, Kenny and Stephens [29] raised such a question on the ethical code of blogging in the library. The library blogger’s responsibility is to bring
150
Z. Fu
out truth. Any comments or posts about a library technology product should be true, or the vendor of a certain product will ask the blogger to provide correct information accordingly.
6 Searching Academic Blogs The rapid growth and proliferation of academic blogs raises the question of how to find a good blog. There are many blog softwares available to help anyone who wants to set up a blog. Grimes [30] discusses a variety of Web search strategies and described Blogger, Blog Spot, Blogdex, MetaFilter, and UserLand are brand name software in the blogging market. It is up to an individual academic library to decide whether to use a free blog software like Blogger.com or purchase a more sophisticated software. Blogger.com has a search function which allows its users to search interesting blogs within its host. Since it offers some major foreign languages, blogs in other languages can be found there too. Google, Yahoo, and MSN can help find academic library blogs. Bates [31] commented on the blog search option for the Google Web search engine (http://blogsearch.google.com). The author pointed out one potentially significant limitation of the search option that Google indexes only the content in the feed rather than “much of the content of the blog” [p.17]. Picas [32] pointed out that knowing the structure of blogs and blog posts in general will enable one to design a search more effectively. Apart from that, knowing the specific structure of Web pages produced by blogging software packages will allow for effective field searching. The author indicates that many so-called blog search engines are really only index feeds. Feeds are XML versions of the blogs that can be read using an aggregator or incorporated into portals or Web pages. The author discussed blog search results from Google, Yahoo, MSN, and Ask Jeeves as well as some blog content search tools, such as the Bloglines (www.bloglines.com), the Feedster (www.feedster.com), the Pubsub (www.pubsub.com), the Technorati (www.technorati.com), and the Waypath (www.waypath.com). In addition to the above search engines, some library profession databases, such as Library, Information Science & Technology Abstracts by EBSCOhost and Library Literature and Information Science Fulltext by WilsonWeb, provide an alternative to find some successful library blogs. Another way to find academic library blogs is to browse some blogs hosted by librarians, e.g. Amanda Etches-Johnson [33], a reference librarian from Mills Memorial Library, McMaster University, compiled a list of library blogs including academic library blogs (http://www.blogwithoutalibrary.net/links/index.php?title=Welcome_to_the_ Blogging_Libraries_Wiki).
7 Library Professional Association Blogs Professional blogs in the field of library and information science, such as LISnews (www.lisnews.com), Library Stuff (www.librarystuff.net), Handheld Librarian
Blogs in American Academic Libraries
151
(handheldlib.blogspot.com), and Peter Scott’s Library News Daily blog (http://blog.xrefer.com/) keep us afresh with new developments in our field. Laughing Librarian blog (http://www.laughinglibrarian.com/) provides library humor and stuff. The American College and Research Library Association (ACRL) also hosts a remarkable blog (http://www.acrlblog.org/). Its posted articles and comments incisively address the issues and challenges of today’s academic libraries. Readers can read posts of various concerns in categories. The American Association of School Librarians and Public Library Association have their own blogs too (http://blogs.ala.org/aasl.php and http://www.plablog.org). News, comments, and concerns in their respective fields are posted regularly. Conference information is listed as well. The American Special Libraries Association (SLA) has created a blog (http://slablogger.typepad.com/infox) to serve its members. Information provided includes conference news, government information, current posts, comments, and archives. These library and information science community blogs provide a venue for readers to post messages and comments on issues that concern them.
8 Summary More and more academic libraries have established blogs because blogs are an inexpensive, easy-to-use, and fun means of communication. Today, academic library blogs have even evolved from general to subject-specific ones. No matter what purposes some academic libraries intend their blogs to have (e.g., interaction between the library and its patrons, communication among library staff, marketing new library services and materials, or remote help in distance education), blogs or weblogs certainly offer an alternative for librarians to identify what their patrons think about their resources and services. Consequently, appropriate action can be taken to improve the library’s services and collection developments. What makes a good blog? After all, those libraries that pioneered the blog used them as an instrument to propagate information about the library and its resources. The potential of a library blog cannot be underestimated. Even though some blogs fail to reach the user population, this useful tool for marketing and promoting the library should be further explored. The major difficulty that some librarian bloggers experience in reaching their users is that they have little control over the content and the design of the blog. The challenges for library blogs are to create more interesting, newsworthy, and entertaining content that appeals to readers. Faculty collaboration is another key to the success of library blogs. Faculty requirement to use the blog and promotion of such a media tool will surely increase the visibility of the library blog. Furthermore, the promotion and marketing of a library blog should be continuous. The blog link should be permanent. Only if those efforts are made can the library blog become another commonplace for academic libraries to enhance their operation. If blog content is kept current, substantial, and entertaining, and the posts and comments are spell-checked, the issues focused and challenging, then academic blogs will grow in prevalence and influence.
152
Z. Fu
References 1. Rainie, L.: The State of Blogging. 2006 2. Richardson, W.: Blogging AND RSS--the "what's it?" and "how to" of PowerfulNew Web Tools for Educators MultiMedia & Internet@Schools, 11 (2004) 10-13 3. Blog Awards Announced. Information Searcher, 16 (2005) 2-8 4. Carlson, S.: Weblogs Come to the Classroom. Chronicle of Higher Education, 50 (2003) A33 5. Beeson, P.: Bringing Blogs into the Classroom. Quill, 93 (2005) 27-29 6. Selingo, J.: In the Classroom, Web Logs are the New Bulletin Boards. New York Times, 153 (2004) G7-G7 7. Quible, Z. K.: Blogs and Written Business Communication Courses: A Perfect Union. Journal of Education for Business, 80 (2005) 327-332 8. Richardson, W.: New Jersey High School Learns the ABCs of Blogging. T H E Journal, 32 (2005) 40-40 9. Risinger, C. F.: Using Blogs in the Classroom: A New Approach to Teaching Social Studies with the Internet. Social Education, 70 (2006) 130-132 10. Nardi, B. A., Schiano, D. J., Gumbrecht, M. et al.: WHY WE Blog. Commun ACM, 47 (2004) 41-46 11. Shaffer, S. C., Lackey, S. P., Bolling, G. W.: Blogging as a Venue for Nurse Faculty Development. Nursing Education Perspectives, 27 (2006) 126-128 12. Clyde, L. A.: Educational Blogging. Teacher Librarian, 32 (2005) 43-45 13. Dickey M.: The Impact of Web-Logs (Blogs) on Student Perceptions of Isolation and Alienation in a Web-Based Distance-Learning Environment. Open Learning, 19 (2004) 279-291 14. Glenn, D.: Scholars Who Blog. Chronicle of Higher Education, 49 (2003) A14 15. Albanese, A.: UM Library Offers Free Blogs. Libr. J., 129 (2004) 18-18 16. Blair, J., & Cranston, C.: Preparing for the Birth of our Library BLOG. Computers in Libraries, 26 (2006) 10-54 17. Caldwell, T.: Pen a Blog Buster. Information World Review, (2005) 16-17 18. Reichardt, R., & Harder, G.: Weblogs: Their use and Application in Science and Technology Libraries. Science & Technology Libraries, 25 (2005) 105-116 19. Walsh, W., & Daniels, t.: Biz of Acq -- Issues in Scholarly Communication: Creating Your Own Blog. Against the Grain, 17 (2005) 85-88 20. Vogel, T. M., & Goans, D.: Delivering the News with Blogs: The Georgia State University Library Experience. Internet Ref. Serv. Q., 10 (2005) 5-27 21. Bell, S.: Where the Readers are. (Cover Story). Libr. J., 130 (2005) 8-8 22. Coulter, P., & Draper, L.: Blogging it into them: Weblogs in Information Literacy Instruction. (2006) 23. Ramsay, K. M., & Kinnie, J.: The Embedded Librarian. Libr. J., 131 (2006) 34-35 24. McKiernan, G.: This just in. Knowledge Quest, 33 (2005) 38-41 25. Bell, S.: The Library Blog: Innovative Idea Or Wasted Words. Library Issues, 26 (2006) 1-4 26. Gorman, M.: Revenge of the Blog People! Libr. J., 130 (2005) 44-44 27. Fialkoff, F.: The Power of Blogs. Libr. J., 130 (2005) 8-8 28. Gordon, R. S.: Revenge of the NextGen People. Libr. J., 130 (2005) 78-78 29. Kenney, B., & Stephens, M.: TALKIN' BLOGS. (Cover Story). Libr. J., 130 (2005) 38-41 30. Grimes, B.: Expand Your Web Search Horizons. PC World, 20 (2002) 53 31. Bates, M. E.: Bringing the Blogosphere to the Masses. EContent, 28 (2005) 17-17 32. Pikas, C. K.: BLOG Searching. (Cover Story). Online, 29 (2005) 16-21 33. Etches-Johnson, A.: Welcome to the Blogging Libraries Wiki.
Rebuilding the Library OPAC Wang Zhigeng Department for Serials, E-resource & Audiovisuals National Library of China [email protected]
Abstract. Today’s Chinese library users have been provided an immediate information access and retrieval environment by web search engines with a user-friendly web interface. In contrast, it is increasingly clear that traditional library OPACs are losing importance in the information context because they do not provide the same ease of use and access to broad of information. Libraries need to respond to this situation by initiating measures to enrich their MARC-based catalogues and databases with much more information than is currently available. At the same time, libraries must address the need for a new generation of OPAC that offers significantly enhanced functionality, and libraries must integrate their OPACs with local systems and external systems and applications. This paper offers some specific recommendations to assist libraries in identifying and implementing appropriate responses. Keywords: Library OPAC, Catalogue Enrichment, Functional Enhancement, System Integration, National Library of China.
1 Introduction Libraries have developed for years many automated systems to manage and track their physical and electronic holdings, and make those resources visible and accessible to users. The Online Public Access Catalogue (OPAC) has been built as the central point of access to library information resources. Today, almost any research library’s OPAC can be accessed on site and remotely. For libraries, the OPAC is a module of Integrated Library System (ILS), which has provided a solid foundation for both collection management and resource discovery. Until recently the place of the OPAC within the service context of libraries has not been challenged. However, times are changing quickly. We have seen the release of Google Scholar, Google Book Search, Baidu Guoxue 1 , Baidu Zhidao 2 , and MSN Academic Live Search. This year Google started indexing Chinese journal articles from VIP Data and WANFANG Data (both of them are Chinese journal aggregator). Last month Baidu released the collaboration with Library of Peking University. In the last several years, innovations from search engines have deeply impacted on library as an institution and librarian as a profession. The information landscape is increasingly fluid and the role of library OPAC in this context is being questioned. Information 1 2
retrieval is no longer limited to library-controlled resources. Users prefer information access via Google and Baidu rather than the library OPACs. In order to improve user success and satisfaction, this paper will look at the pressures the changing information environment is placing on the library OPAC, and will give some specific recommendations to assist libraries in identifying and implementing appropriate responses. The focus of this article is the research library environment in China. A case study of the National Library of China (NLC) experience within the changing information environment will be outlined.
2 Catalogue Enrichment Today’s users are increasingly conditioned by their web information seeking experiences gained from web search engines such as Google and Baidu. They expect quick and easy discovery and access to information. They enjoy the quick results that help them decide on the utility of the information returned. In the environment of the library service context, these expectations underscore the need for catalogue enrichments that are normally not provided at present. Due to the information technology developments, a range of information could easily become a regular feature of bibliographic databases. Let me cite the experience of the NLC, we have successfully implemented a suite of projects aimed to link users, catalogues, and web resources. We also are working on some pilot programs to increase the content of the catalogue record itself and to link bibliographic descriptions to related electronic resources. The Aleph500 Library System is an automated library system used to process library materials and provide public access to a database of NLC’s holdings, including over 5 million books, journals, sound recordings, video materials, maps, manuscripts, government documents, and materials on order. NLC’s OPAC is also called NLC Catalogue. Aleph went live at NLC on October 2002, replacing a group of separated Library Information System. Following the Guidelines for Online Public Access Catalogue Displays 3 from IFLA, we first focused on providing access to authority record in NLC’s OPAC. Because authority record is not only a tool for librarians to organize the bibliographic catalogue by establishing forms of names, titles, and subjects, but also a very important tool to assist users in finding materials effectively. This program was launched in 2003 using the MARC 21 Format for Authority Data. To data more than 600 thousand records of names of persons, corporate bodies and titles of works could be browsed via the web. And 200 thousand records of subject headings for Chinese books are available for searching. Next step of this program is to create authority records for serials and e-resources. In a long-term perspective, we are considering building Chinese authority files system on the web under the ideal of Virtual International Authority File (VIAF). We then focused on enriching bibliographic records to include tables of contents (TOCs) information, because such information will greatly improve the retrieval function of the OPAC and improve the user’s search experience in several ways. We started from creating TOC data manually, and this project is ongoing. More than 7 3
http://www.ifla.org/VII/s13/guide/opacguide03.pdf
Rebuilding the Library OPAC
155
million TOCs have been created, and all these data could be accessed through hyperlinks in the bibliographic record in NLC’s OPAC. To facilitate user searching, another area of catalogue enrichment activity is to provide holding information to all NLC’s collections using the MARC 21 Format for Holding. Currently more than 3 million items have holding information in NLC’s OPAC, including barcode, sub-library, location, call number, and item status. The holding information for all new items will be added at the acquisition process level. The retrospective program for all backfiles of bound serials will be completed within 2 years, and the retrospective program for old monographs and book series will take 10 years more. The latest catalogue enrichment activity is to provide access to scholarly annotations and book reviews for particular acquisitions. This aims to provide a direct link from the bibliographic record to the web TOC data posted by online bookshops and E-book aggregators. This program will also enable users browse online cover images, first chapters & excerpts, TOCs, Author Notes and view full images of book content if available. We are planning to expand ways to include all types of nonphysical items, such as E-books, E-journals, E-newspapers, and open access repositories. A new project will examine how NLC’s OPAC can enhance access by linking bibliographic record directly to full-text electronic resources, including E-books and E-journals. By doing so, the NLC’s OPAC will remain the central reliable source of information about its information resources, in all formats, and regardless of their ownership or licensorship.
3 Functional Enhancement The increasing availability of access to information via the Internet has transformed the expectations of web users as well as their service preferences. The range of functionality offered by search engine such as Google and Baidu has much to do with the transformation of user expectations. Web search engines have a number of features that are not typically incorporated into the next generation of library OPACs, because they add value to the experience of users seeking information. These functions include: relevance-ranking output, natural-language entry, automated mapping to controlled vocabulary, spelling check, similar pages, user popularity tracking, and browsing. “Search full view books” recently implemented by Google Book Search includes full-text searching of books as part of the regular keyword search. This feature will raise user’s expectation of library catalogues even further. Libraries need to adopt some of the best features of search engines into the best aspects of OPAC. NLC has taken some of those new features into account, such as keyword searching, browsing by title/author/classification, sorting options, etc. The new OPAC of NLC can also offer authorized users with some good options to reserve interested materials and renew it via the web remotely. In near future we will add links to OPAC to enable users to download the citation into reference managers, such as Refworks or EndNote. Also we will embed COinS (Context Object in Span) into OPAC search result, which will explore new ways to link OPAC users to the content they need and desire.
156
Z. Wang
It is clear that the new generation of OPACs should offer significantly enhanced functionality mentioned above. But there are two functionalities more important that the next generation of OPACs should offer, 1) to push content out with technologies such as alerting services, blogs, and RSS feeds, which can provide current awareness service to end users; 2) to be discoverable using the techniques of SEO (search engine optimizer), which means to optimize catalogues to make high-value content more accessible to web search engine users. The next version of NLC’s OPAC will enable catalogues to be indexed by Google and other search engines. We will create mechanism to extract bibliographic records from NLC’s catalogue to the format allowing effective search engine crawling. And we will create back-links from search engine results into NLC’s OPAC.
4 System Integration Libraries have developed for years the Integrated Library System (ILS) to manage and track their physical holdings, and make those resources visible and accessible to users. With large amount of electronic resources being available through the Internet, users do not depend on the OPACs to reach those resources. Rather, they have turned to web search engines like Google and Baidu. To manage and track electronic information, some new library systems have appeared in last several years. Federated Search System, which helps libraries create a agent interface for their electronic resources and allows users to search multiple resources simultaneously; OpenURL link-resolvers, which provides an infrastructure for context-sensitive linking among the distributed electronic resources; Electronic Resource Management System (ERM), which assists the library in collecting and tracking details related to electronic resource subscriptions. These new systems provide functionality that enables a single search of a large number of A&I databases and full text resources, providing comprehensive retrieval for a topic and leading users to the appropriate material. In the big metadata era (Googlized landscape), these systems must no longer solely deal with the internal workflows of acquisitions, cataloging, circulation, serials, and OPACs, but must be compatible and interoperable with other internal systems and more important external systems. Libraries must integrate with other services, such as virtual learning, distance education, or web searching. The future of a library interface to information will be based on traditional OPAC strengths but presented in a much simpler and more user-friendly format, supplemented with rich navigation and discovery tools. NLC has implemented Metalib as a library portal and SFX as a local OpenURL linkresolver, in addition to the Aleph500 system. NLC also has other local systems and applications including, online catalogues such as Union Catalogue4, thesis repository5, information delivery systems suchas IDP6, VOD7; and storage management system such as web archiving systems 8 ; and access management system. NLC provides Virtual 4
Reference Service and Interlibrary Loan /document delivery via separated system. NLC also has a plan to implement an ERM system and a DRM system. NLC has integrated Web OPAC (Aleph) with it’s portal (Metalib) and linkresolver (SFX) using the Z39.50 searching and OpenURL technology. This provides a seamless linking environment for end users seeking information. NLC has made its OPAC a resource for a library portal system, Users can generate an OPAC search through NLC Portal (Metalib), simultaneously with the same search in other A&I databases and full text resources, and view the integrated results from different systems in the same format. NLC has made its OPAC a source and target for an OpenURL link-resolver. By clicking the button of the link-resolver (SFX) on the result display screen, users can get a service menu, which offers extended service including fulltext linking, citation searching, web searching, virtual reference services, and direct email messages. Search engines are going to become another channel for users to discover and subsequently want to use their affiliated library and its resource and services. As these tools access more and more library content, they will become the primary access point to information for the majority of users. So it becomes more important for libraries to focus on the users, who are demanding information in many formats and through many channels, than ever. Google Scholar provides a simple way to broadly search for scholarly literature. Users can search across many disciplines and sources from one place and get ranked results. When a user find an interesting abstract or citation that he want to read, in many cases he may have access to the complete document through his library. Working with libraries to determine which journals and papers they've subscribed to electronically, Google Scholar can show the library links via OpenURL technology. NLC has joined the library links program, and this will assist our user from Google results to locate resource and service provided by NLC. We are working with Microsoft to provide the same service. We will make all our metadata systems be OpenURL-compliant and OAI-compliant. By doing so, we will understand and know that where we are in the context of users and deliver right content and service to our end users. As a national repository, we organize the Chinese recorded knowledge and culture, deliver the right content experiences for our users, and improve the quality of questions. We serve many communities in China and overseas. Those activities described above will respond to web users who want more content and better functionalities. By doing so, we can serve extended information needs of the users, offer structured pathways to all collections, and thereby attract users to continue to value NLC’s OPAC and other services.
References 1. Laura Maquignaz, and Jane Miller, “The Centrality of the Integrated Library Management System: a Strategic View of Information Management in an E-Service Environment” in VALA 12th Conference and Exhibition, 3-5 February, 2004, Melbourne, available at: http://eprints.vu.edu.au/archive/00000004/ 2. John D. Byrum, Jr, “Recommendations for urgently needed improvement of OPAC and the role of the National Bibliographic Agency in achieving it” in World Library and Information Congress: 71th IFLA General Conference and Council, August 14th - 18th 2005, Oslo, Norway. available at: http://www.ifla.org/IV/ifla71/papers/124e-Byrum.pdf
158
Z. Wang
3. Holly Yu, and Margo Young, “The Impact of Web Search Engines on Subject Searching in OPAC,” Information Technology and Libraries, 23(4), 2004, 168-80 4. M.S. Sridhar, “Subject searching in the OPAC of a special library: problems and issues,” OCLC Systems & Services, 20(4), 2004 5. M.S. Sridhar, “OPAC vs card catalogue: a comparative study of user behavior,” The Electronic Library, 22(2), 2004 6. Tamal Kumar Guha, and Veena Saraf, “OPAC usability: assessment through verbal protocol,” The Electronic Library, 23(4), 2005 7. Christopher S.G. Khoo, and Kwok-Wai Wan, “A simple relevancy-ranking strategy for an interface to Boolean OPACs,” The Electronic Library, 22(2), 2004 8. Alan Dawson, and Val Hamilton, “Optimising metadata to make high-value content more accessible to Google users,” Journal of Documentation, 62(3), 2006, 307 - 327 9. Judith Wusteman, “Realising the potential of web services,” OCLC Systems & Services, 22(1), 2006, 5-9 10. Karen G. Schneider, "How OPACs Suck, Part 1: Relevance Rank (Or the Lack of It)" , available at:http://www.techsource.ala.org/blog/2006/03/how-opacs-suck-part-1-relevancerank-or-the-lack-of-it.html
Web Content Mining for Market Intelligence Acquiring from B2C Websites* Danxiang Ai1, Yufeng Zhang1, Hui Zuo2, and Quan Wang1 1
School of Information Management, Wuhan University, 430072, Hubei, P.R. China {smalldenver, yfzhang9, wangquan}@163.com 2 School of Economics and Management, Wuhan University, 430072, Hubei, P.R. China [email protected]
Abstract. Market Intelligence (MI) is information and knowledge relevant to an enterprise market decision-making process. MI acquiring is a key activity for enterprises to keep predominance in furious market competition. The quick-developed Internet provides abundant information resources, but there is a lack of effective new approaches and models for MI acquiring. In this paper, we concentrate on MI mining based on B2C websites. We develop a specialized B2C websites mining model by syncretizing technology of web mining, knowledge representation, data warehouse and metadata. We design a web content mining algorithm integrating several web mining methods, and perform the digital camera sales experiments to validate it. Keywords: Market intelligence acquiring, Web content mining.
1 Introduction Market Intelligence (MI) is accurate, current and usable information and knowledge about market environment. MI acquiring is a key activity for corporations to make market decisions and keep predominance in furious market competition. In the past, MI acquiring often involved lots of legwork and reams of paper. Today, the Internet, with its easy access and unlimited data-storage capacity, enables researchers to assemble valuable information about customers, competitors and market environment without leaving PCs. MI acquiring from Internet has undergone considerable interest in recent years, however, the quick growth speed and dynamic structure of Internet give challenge to MI acquiring as well. In fact, there is a lack of effective approaches and models of MI acquiring from Internet. B2C (Business to Customer) website is an essential e-commerce pattern based on Internet. It provides a platform for online trades and communications between enterprises and customers. Some B2C websites are built by enterprises themselves, *
Supported by the National Natural Science Foundation of China (Grant No. 70573082).
while others are supplied by a third party and permit various enterprises to register and occupy a piece of web space for their sales, e.g., Amazon (www.amazon.com), Ebay (www.ebay.com) and Taobao (www.taobao.com). These websites, containing a great deal of useful information about sales, are good MI resources. So we put an emphasis on the mining model and methods of B2C websites for MI acquiring in this paper. In section 2, we analyze the general structure and characteristics of B2C websites, based on which we develop a B2C websites content mining model in Section 3. In Section 4 and Section 5, we ulteriorly study the algorithm and methods for data content mining and present a digital camera sales case to validate the algorithm.
2 Analysis of B2C Websites Structure and Characteristics 2.1 The General Structure of B2C Websites The general B2C website structure is made up of six basic types of webpages, which is illustrated in Fig. 1.
Fig. 1. This figure shows the general B2C website structure composed of six types of webpages
Product navigation webpage contains two kinds of navigation tools, product category and product search engine, to help customers get their favorite product list. Product list webpage contains a list of products with some common characteristics. Each item in the list is a summary of a product with the linkage to its details webpage. Product details webpage contains a detailed description of a specific product, and it has linkages to the shopping cart webpage, the seller’s information webpage, and the sales recorder webpage. Seller’s information webpage contains information about a specific seller. Sales record webpage contains records of past sales behaviors with regard to a specific product. Shopping cart webpage is where the customer confirms and fulfills his purchase. Different types of webpages contain different information about commodities and sales. We analyze webpages contents and sum up three groups of information fragments useful to support market decision:
Web Content Mining for Market Intelligence Acquiring from B2C Websites
161
Product information fragments describe product features, including Product name, Price, Discount price, Product class, Transport fees etc., which usually locate in product details webpages. Seller information fragments describe seller characteristics, including Seller’s Name, Location, Registry date, Credit degree etc., which usually locate in seller’s information webpages. Sale record information fragments describe sales behaviors and activities, including Product Name, Seller’s Name, Sale volume, Sale date etc., which usually locate in sales record webpages. 2.2 Characteristics of B2C Websites By analyzing the general structure and contents of B2C websites, we conclude that B2C websites have two sides of characteristics. On one side, B2C websites are business information centers. Compared to entire Internet, B2C websites not only have abundant and centralized information contents about products, sellers and sales, but also have relative steady structure and function, similar webpage types and data types, and controllable information volume, which gives us the convenience to explore effective web mining methods for MI acquiring. On the other side, webpages in B2C websites, like those in other most websites, are semi-structured and interconnected. Information contents in B2C websites are dynamic and distributed, not easy for machines to understand and process. In order to acquire deep contents from B2C websites, we construct a specialized mining model by syncretizing technologies of web mining, knowledge representation, data warehouse and metadata.
3 Web Content Mining Model We present the architecture of a model we developed for discovering and extracting valuable data contents by mining B2C websites. The model composed of three functional modules is illustrated in Fig. 2. 3.1 Web Data Content Mining Module The motivation of this module is to extract information contents from websites resources and transform them to analytical data in data warehouse. Generally speaking, this module needs to finish following tasks: (1) Collecting and grouping information fragments from the websites, to form an original data set. (2) Unifying conflicting data and removing repetitious data, to form a clean data set. (3) Transforming and integrating data to a subject-oriented structure proper for data warehouse.
162
D. Ai et al.
Fig. 2. This figure shows the architecture of web content mining model.
The module comprises three subsystems. Task manager subsystem helps users select data sources and build correspondent task plans. Task processing engine subsystem accesses data resources and execute web content mining algorithm according to the task plan. Methods manager subsystem is in charge of registering and controlling web mining methods of crawling, classifying, parsing, data reduction, data transformation and data modeling, which can be called by task processing engine during the mining course. Since B2C websites data is far more complex than relational databases, ordinary data preprocessing approaches are not enough in the module. Various web mining methods are applied to identify and extract useful web contents and rearrange them into structured format. We will give more discussion on mining methods in Section 4. 3.2 Data Warehouse Module Data warehouse module fulfills the data organizing, storing and managing, and offers the tools of basic generating, computing and querying. It is composed of two subsystems: data warehouse and data warehouse manager.
Web Content Mining for Market Intelligence Acquiring from B2C Websites
163
Data warehouse subsystem is “a subject-oriented, integrated, time-variant, nonvolatile collection of data”. It provides architectures to systematically organize and store data. Data warehouse manager subsystem is used to support and monitor data warehouse running. It takes on the maintaining and renewing of the data warehouse. 3.3 Metadata Module Metadata describes the content, class, quality, condition, structure and other characteristics of data. So we can define and execute web mining and data mining tasks more conveniently and effectively. In our model, the contents of the metadata are as follows: Ontology of information fragments is a collection of terms and definitions relevant to information fragments in B2C websites, including the names and types of the information fragments, the relations between the information fragments, and the possible HTML tags relevant to information fragments and their positions in the web webpage structure. Ontology of business field is a collection of terms and definitions relevant to business, describing the entities and relations in business activities. In our model, this ontology mainly contains descriptions of the product and sellers registered in the B2C websites. Description of B2C website includes the document structure models and analysis frames of different kinds of webpages in B2C websites. Description of data warehouse includes definition information about subjects, data cubes, fact tables, dimensional tables and other temporary tables.
4 Web Mining Method for Market Intelligence Acquiring The most crucial and difficult task in our web content mining model is to deal with the dynamic semi-structured hypertext data and extract the contents we need automatically and precisely. We integrate several web mining methods and design a web content mining algorithm to fulfill this task. 4.1 Web Content Mining Algorithm The algorithm, which combines several web mining process of crawling, classifying, parsing, data reduction, data transformation and data modeling, is described as follows. Algorithm: Web Content Mining Get: B2C Website Domain Name bwdn, Product Select Conditions PSCs, WebPage Structure Models WPSMs, WebPage Analysis Frames WPAFs OUT: Multidimensional Data Model
164
D. Ai et al.
1.Search (bwdn, PSCs) or Browse(bwdn,PSCs), to get Product List WebPage pl 2.Scan (pl), to get New WebPage URLs NURLs 3.Classify (NURLs, WPSMs), to pick out Product Details WebPages PDWPs 4.For each webpage pdwpi in PDWPs i.Save (pdwpi) into Collected WebPage Set CWPS and Webpage Relations Table WPRT ii.Scan (pdwpi), to get New WebPage URLs NURLs iii.Classify (NURLs, WPSMs), to pick out Sales Record WebPage sr and Seller’s Information WebPage si iv.Save (sr) and Save (si) into CWPS 5.For each webpage cwpi in CWPS if cwpi is a Product Details WebPage a.PDWP-Parse(cwpi, WPAFs), to extract Product Information Fragments PIFs b.Save (PIFs) into Product Information Table PIT else if cwpi is a Seller’s Information WebPage a.SIWP-Parse(cwpi, WPAFs), to extract Seller Information Fragments SIFs b.Save (SIFs) into Seller Information Table SIT else if cwpi is a Sales Record WebPage a.SRWP-Parse(cwpi, WPAFs), to extract Sales Record Information Fragments SRIFs b.Save (SRIFs) into Sales Record Table SRT 6.Reduce (PIT), Reduce (SIT), Reduce(SRT) 7.Transform (PIT), Transform (SIT), Transform (SRT) 8.DataModelBulid (PIT, SIT, SRT,WPRT), to build Multidimensional Data Structure by Star-schema 4.2
Crawling Method
The Algorithm described above starts from crawling the B2C website, namely, collect relevant web webpages. Firstly, the crawler takes advantage of the navigation tools provided by the specified B2C website to fetch product list webpage which contains products satisfying select conditions defined according to mining goals. Then the crawler scans the webpage to fetch new webpage URLs, and continues to scan the fetched webpages to seek new URLs in turn. During the course of crawling, some measures should be taken to control the scanning path. Firstly, the URLs to be visited should be restricted under the domain of given B2C websites. This will control the scanning scopes and avoid the crawler wandering to non-target websites. Secondly, newfound URLs should be checked to eliminate duplicates. Duplicate webpages will result in circle scanning and iterant fetches. So it is necessary to remove the already-scanned URLs. Thirdly, classifying methods are used to filter fetched webpages so that only the useful webpages are saved.
Web Content Mining for Market Intelligence Acquiring from B2C Websites
165
4.3 Classifying Method Classifying is a key step in the Algorithm. We use supervised learning paradigms in webpages classifying. We have listed six webpage types in Section 2.1. We use three of them as the webpage classes: product details webpage, seller’s information webpage and sales record webpage. We design two classifiers for B2C webpages classification, the Bayesian classifier and the inductive classifier. The Bayesian classifier is a text classifier based on Bayes theorem. According to the Bayes theorem, given a set of webpage classes C= {C1, C2, C3}, each class ci has an associated prior probability Pr(ci), with ∑ ci Pr(ci ) = 1 , given a webpage wp, Pr(wp) is the prior probability of wp and there is a class-Conditional webpage distribution Pr(wp|ci) for each webpage class. Thus the conditional probability that wp is generated from class ci is calculated as follows:
Pr(ci | wp ) =
Pr(ci) Pr(wp | ci ) Pr(wp ) .
(1)
Before the classifier can be used, it should be trained using training data set. In training webpage set, each webpage is marked with a label of its class name. The classifier will analyze the training webpages to learn the features of each webpage class. Once the classifier is trained, it will receive a set of unlabeled webpages to test whether it can determine their correct types. If the accuracy is considered acceptable, the classier can be applied to the new webpages and mark each webpage with a label of class name. The Bayesian classifier is a plain text classifier without considering the semi-structured and hypertextual features of webpages. So an inductive classifier, as a supplement of the Bayesian classifier, is designed to classify B2C webpages depending on their HTML document structures. A well-formed HTML document is a nested hierarchy of regions that is represented by a tree-structured Document Object Model (DOM). HTML elements, segments of text, or hyperlinks to other webpages are internal nodes and relevant to each other in the DOM tree. Furthermore, Most B2C websites adopt dynamic web technology like ASP or JSP. They generate instance webpages by filling the webpage template with the selected contents from background database. This make the webpages of the same class have similar HTML structures, which can be helpful in supervised learning as well. For each webpage class, there is a document structure model stored as metadata, containing the HTML tags, hyperlinks, and their relations defined as inductive rules. Some special tags such as Title, Meta, H1, and so on are assigned with heuristic weights in the model. These models are used by inductive classifier to do first-Order logic induction to confirm webpages’ classes. The combination of the Bayesian classifier and the inductive classifier can greatly improve the accuracy of webpages classification. But it is inevitable that the class of some webpages can not be ascertained. These pages will be discarded to ensure the efficiency of the classification. All classified webpages are stored in CWPS (Collected WebPage Set). The example of CWPS is illustrated in Table 1.
166
D. Ai et al. Table 1. Example of CWPS
In CWPS, a WebPage ID is the unique identification of a webpage. Each webpage is labeled with a class name: PDWP (Product Details WebPage), SIWP(Seller’s Information WebPage) or SRWP (Sales Record WebPage). Besides CWPS, the connections among webpages should be tracked during the cause of classification as well. We use a Webpage Relations Table WPRT to store the connections (see Table 2). In the table, a product details webpage has linkage to at least one seller’s information webpage, and zero or one sales record webpage. Table 2. Example of WPRT
4.4 Parsing Method
To parse a webpage means to understand the structure and contents of the webpage and pick useful information. Usually, Parsing is a complex job in web mining. But we reduce the complexity by identifying the webpage’s class beforehand. Since webpages of the same class have almost the same appearances and semantic structures, they can be processed with a unified pattern. For each webpage class, we build an analysis frame. The frames specify the kinds of information fragments and their possible locations in different classes of webpages. The document structure model established in webpage classification stage can be used here too. Some HTML elements and their attributes in the structure are directly relevant to information fragments. Identification of these element tags helps locate information fragments quickly. The parser applies the frame to the relevant class of webpage to identify information fragments and their contents. Ontology of information fragments is also used here to better understand the semantics of HTML tags and values during the parsing.
Web Content Mining for Market Intelligence Acquiring from B2C Websites
167
All Information fragments picked out are categorized according to the classes of their source webpages and saved into different tables: Product Information Table PIT, Seller Information Table SIT, and Sales Record Table SRT (see Section 5). 4.5 Reduction and Transformation Methods
Data reduction is used to eliminate data redundancy and reduce data set to proper size so that it could be more efficient yet produce the same result in mining. Data transformation is applied to consolidate data into unified forms appropriate for mining, which involving normalizing data format, smoothing data values, aggregating data at certain granularities, and doing statistics at certain data attributes. Reduction and Transformation Methods are applied to Product Information Table, Seller Information Table and Sales Record Table to form more effective data set. 4.6 Multidimensional Data Modeling Method
The last step of the algorithm is to build a multidimensional data model. The multidimensional data model views data in forms of a data cube, defined by dimensions and facts. It is a most common method of data representation in data warehouse, which is helpful in analyzing data from various points of view. The star schema is a multidimensional data modeling paradigm, which contains a large central table (fact table), and a set of smaller attendant tables (dimension tables), one for each dimension. Both fact table and dimension tables are organized surrounding subjects. Fact table provides the basic features of primary subjects while dimension tables record other relevant subjects or expanded subjects. The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table. In our problem, multidimensional data model is built by reorganizing the data in PIT, SIT and SRT. And the data relations among products, sellers and sales in the data model are generated according to WPRT (see Section 5).
5 Experimental Result We select Taobao as the source B2C website and aim to analyze the Canon A-series digital camera sales in 2005. The quantity of three classes of webpages collected from the website is shown in the Table 3. Table 3. Quantity of Three Classes of WebPages
168
D. Ai et al.
After parsing the collected webpages, we get three tables (Table 4, 5, 6) containing valuable information contents. Table 4. Product Information Table
Table 5. Sales Record Table
Table 6. Seller Information Table
The above information contents extracted from the B2C websites can be ulteriorly organized into a multidimensional model containing a sales fact table (SFT), a product dimensional table (PDT), a seller dimensional table (SDT) and a time dimensional table (TDT). The star-schema graph is illustrated in Fig. 3.
Web Content Mining for Market Intelligence Acquiring from B2C Websites
169
Fig. 3. This figure shows the star-schema graph of Canon A-series digital camera sales data model
6 Conclusions In this paper, we concentrate on web content mining based on B2C websites for MI acquiring. We developed a specialized B2C websites mining model, which can be used to automatically extract valuable information contents and discover market intelligence. We study the architecture and mechanism of the model in detail, design an algorithm of web content mining and analyze the multiple mining methods applied in the algorithm. Through the experimentation of camera digital sales mining, we believe that the result is interesting and useful for B2C web contents discovering and MI acquiring.
References 1. Mehmed Kantardzic: Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons (2003) 2. Soumen Chakrabarti: Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann (2003)
170
D. Ai et al.
3. Dorian Pyle: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco, CA (1999) 4. Daniel Larose: Data Mining Methods and Models, Wiley-Interscience, Hoboken, NJ (2005) 5. Zhang,Yufeng: Intelligent information systems. Wuhan University press, Wuhan (1991) 6. Tianhao Wu, &William M. Pottenger: A Semi-supervised Algorithm for Pattern Discovery in Information Extraction from Textual Data. The seventh Pacific-Asia conference on Knowledge Discovery and Data Mining (PAKDD) (2003) 7. B. Berendt, &M. Spiliopoulou: Analyzing navigation behavior in web sites integrating multiple information systems. The VLDB Journal, Vol. 9, No. 1 (2000) 56-75 8. G. Chang. et.al.: Mining the World Wide Web. An Information Search Approach. Kluwer Academic Publishers, Boston (2001) 9. M. Craven. et.al.: Learning to construct knowledge bases from the world wide web. Artificial Intelligence, Vol. 118, No. 1-2 (2000) 69-113 10. H. Kato, T. Nakayama, &Y. Yamane: Navigation analysis tool based on the correlation between contents distribution and access patterns. In Working Notes of the Workshop on Web Mining for E-Commerce- Challenges and Opportunities (WebKDD 2000) at the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA (2000) 95-104 11. M. E. Califf, &R. J. Mooney: Relational learning of pattern-match rules for information extraction. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99) (1999) 328-334
Design of Chinese Word Segmentation System Based on Improved Chinese Converse Dictionary and Reverse Maximum Matching Algorithm Liyi Zhang, Yazi Li, and Jian Meng Center for Studies of Information Resources, Wuhan University, 430072 {lyzhang, lyzi, jianmeng}@whu.edu.cn
Abstract. The increasing interest in cross-lingual and multilingual information retrieval has posed a great challenge of designing accurate information retrieval systems for Asian languages such as Chinese, Thai and Japanese. Word segmentation is one of the most important pre-processes of Chinese information processing. This paper reviews some popular word segmentation algorithms. Based on an improved Converse Chinese dictionary and an optimized reverse maximum matching algorithm, a Chinese word segmentation system is proposed. Experiments are carried out to demonstrate the substantially ameliorated accuracy and speed of the system. Keywords: Chinese word segmentation, Converse Chinese dictionary, reverse maximum matching algorithm.
compared; In Section 3 and 4, we elaborate on the design and implementation of a Chinese word segmentation system whuSplit based on an improved Reverse Maximum Matching (short for RMM) algorithm and an enhanced converse dictionary. Section 5 and 6 are experiment analysis and conclusions.
2 Existing Segmentation Algorithms Currently, the most common machine matching algorithms are Minimum Matching and Maximum Matching algorithm. In Minimum Matching algorithm, a two-character substring is split from the left of a string or a sentence, and then the substring is compared with the words in the pre-defined dictionary. If the dictionary has the substring, the substring is a Chinese word, and the new pre-processed sentence is the substring that starts from the third character of the previous sentence. In the next step, next two-character substring is split and compared too. All of the process is finished until all characters of the sentence are compared. If the dictionary does not have the substring, the new pre-processed sentence which starts from the second character of the previous sentence. The left two-character word of the sentence is split and compared. For example, the sentence not successful to match” can be segmented as follows:
which in English means “it is
The substring is split and compared with the words in the pre-defined dictionary. Since there is not a word equals , the pre-processed sentence will become is split and compared. As the word . Then the substring is a word. Then substring can be found in the dictionary, the substring is split and compared. Until the substring is compared. Table 1 demonstrates the steps. Table 1. The step of split of the sentence
So the words which are contained in the sentence are , , . Despite the algorithm is fast, ambiguous segmentation happens often, so the Minimum Matching is not used widely. The following sentence will not be segmented properly. Table 2 demonstrates the step of splitting the previous sentence. We find the sentence is segmented to some words including , even though the correct segmentation is . The incorrect segmentation is called ambiguous segmentation.
Design of Chinese Word Segmentation System
173
Table 2. The step of split of the sentence
The second segmentation algorithm is Maximum Matching that is used widely. This algorithm consists of forward Maximum Matching (abbreviated as MM) and backward Maximum Matching (abbreviated as RMM) algorithm. Let us first have a quick look at the MM algorithm. We suppose that the length of the longest term in the dictionary is n. In this paper n is equal to 5 that is the threshold. Firstly n-character substring is split from the left of the pre-processed sentence. Then the substring is compared with the terms in the dictionary. If the dictionary has a term which is equal to the substring, the substring is segmented. Therefore the rest pre-processed sentence stars from the (n+1)th character of the previous sentence. The process repeats until all character is compared. If the dictionary does not have a term that is equal to the substring, then the new pre-processed (n-1) characters substring is split from the left of the previous substring, then the new substring is compared with the terms in the dictionary. If the length of the substring is less than 2, then the new pre-processed sentence is split from the previous sentence that starts from the second character. Table 3 shows an example of segmentation using MM. The pre-processed sentence is .. Table 3. The step of split the sentence
174
L. Zhang, Y. Li, and J. Meng
Table 3 shows that the previous sentence consists of words which are and . Although the performance of MM algorithm is better than Minimum Matching algorithm, some ambiguous segmentation happens frequently. Taking the sentence which means “resolve the problem of a machine” as example. We show he steps in table 4. Table 4. The steps of splitting the sentence
From table 4 we find the sentence is segmented into words
and
, but
the correct result is and . Therefore ambiguous segmentation happens. Due to the ambiguous segmentation, the RMM algorithm is used more widely. Reverse Maximum Matching algorithm (RMM) is similar to MM. But RMM splits substring from the right of the pre-sentence. And if the dictionary does not contain the term that is equal to the substring, the left character is split from the previous substring and the new substring is compared. Now we take the sentence for example. Table 5 shows the steps. Table 5. The steps of splitting the sentence
The results shown from table 4 are correct. In Section 4, we propose an improved RMM algorithm.
Design of Chinese Word Segmentation System
175
3 The S tructure of the Converse Chinese Dictionary Traditional dictionary is usually constructed through pure-text, if data is not organized efficiently, the complication of the time of comparison is O (n) (n denotes the number of words in the dictionary) and the compare times in the worst situation is:
l *n
(1)
Where l represents the length of the pre-segmented text. If the index of dictionary is organized effectively, the times of comparison would be decreased significantly while the complexity of space won’t be increased. For instance, words in the dictionary can be organized by the first character or the last character. The index of existing converse Chinese dictionary is organized by the last word of character string. For example, there are some words that end up with the character , such as
. In the index of converse Chinese dictionary, if the
character is taken as index, these words turn to be the sentence as example, we first split the substring
. Taking
from the sentence, and match the substring with the words which are indexed by the character in the dictionary, the substring is not a term. Then the is sub-string is split from previous sub-string and the substring compared with the words in the dictionary which is a sub-dictionary indexed by the are split. The advantage of character . The loop will continue until the words this approach is the high efficiency because the only thing that need to be done in the process of matching is to look up the sub-dictionary, and there are only about 6,000 index characters in our system which are less than the number of total words in the dictionary. The above example shows that each time the pre-compared string needs to be reversed to adapt to the converse Chinese dictionary, which is a disadvantage of the approach. Firstly, the converse dictionary does not comply with human thinking,
Fig. 1. Structure of Improved converse Chinese dictionary
176
L. Zhang, Y. Li, and J. Meng
which makes it difficult for maintenance. Secondly, the string needs to be reversed each time it is compared, which is time-consuming and efficiency will be affected. Meanwhile it doesn’t adapt to human thinking. Figure 1 shows our Improved converse Chinese dictionary. It is easy to understand the proposed converse Chinese dictionary because all of the terms in the dictionary are arranged in nature order. The dictionary in this paper contains 324 300 words, which are of two characters, three characters, four characters and more than four characters. The words distribution and frequency is shown in table 6. Table 6. The words distribution and frequency Term’s character number Term number
2
3
4
5
6
7
aggregate
98559
98904
118247
5124
2446
980
324300
Proportion (%)
30.391
30.498
36.462
1.580
0.754
0.315
100
It is critical for machine matching segmentation to construct the dictionary. The proposed algorithm takes advantage of the hash table that is used to store terms. The key of the hash table is taken as index in the dictionary. The value of the hash table is a corresponding sub-dictionary. In our experiment, the segmentation system spends less than two seconds loading the pre-dictionary. Once the dictionary is loaded, it will reside in the memory and can be used for as many times as you want. The function of loading dictionary is described as follows. while (not end to the file) { read term from the file; take the last character char of the term as index; if (hash table contains the key which is equal to char) { get the sub hash table which takes cha as index; add the term to the corresponding dictionary; } else { create a new sub-dictionary which takes char as index; add the term to the new sub-dictionary add the sub-dictionary to the hash table; }//end else }//end while
4 The Improved RMM Algorithm and Design of Segmentation System 4.1 Improved RMM Algorithm We construct a stopword dictionary which contains frequently used words that don’t help distinguish one document from the other, such as space and other punctuations
Design of Chinese Word Segmentation System
177
marks. In our segmentation system, we can adjust the stopword dictionary manually or automatically and add meaningless or unwanted terms or Arabic numerals into the dictionary. With the stopword removal, the times of comparison can be reduced to (step-1)*step/2*I, where step is the threshold and I is the number of the stopwords. The stopword’ function is described as follows: while (not end of file or text) {//the Boolean-expression is true read character from text; if (the char is contained in the stopword dictionary) { if (the length of pre-processed string < 1) read next character; else splidString(dicHash, proc_str); //implement the segmentation function proc_str = null; // pre-processed string is evaluated to null } else {//stopword dictionary does not contain char proc_str += char; //char is appended to the pre-processed string }//end else }//end while It is a part of preprocessing of text to remove stopword. Fig. 2 is the flow diagram of the proposed RMM algorithm. From the performance of the algorithm, we conclude that ideally the compare
(N-I)/step
times is (N is the number of characters of preprocessed text and I is the number of the stopword). But in the worse situation, the times are shown as formula (2) and (3).
From the above formulas we can reach the conclusion that it is important to remove stopword from pre-processed text, especially in the large-scale pre-processed text. 4.2 Design of the Word Segmentation System Any system consists of elements that collaborate with each other to achieve an object. The model of the segmentation system whuSplitWord is demonstrated as follows.
S(A,Doc,Dic,M,SD) Where A represents segmentation algorithm, Doc represents pre-processed text, Dic is pre-defined dictionary, M represents constraints model and SD represents stopword dictionary. The function of the segmentation algorithm is described as follows.
178
L. Zhang, Y. Li, and J. Meng
a(d,m)=w1, w2, w3, …, wn Where a•A A, m•M M, it means that algorithm a constraint by model m is used to segment document d and the segmentation result is w1, w2, w3,…, wn. Fig. 3 shows the structure of the proposed segmentation system.
Fig. 2. Flow diagram of RMM algorithm
5 Experiment and Analysis Our experimental platform is as follows: Operating system: Win2000 RAM: 64M CPU: Celeron The test collection contains 100 pieces of news which comes from Chinese XinHua Net. For example, the sample in table 6 named content_4075530 comes from the web page http://news.xinhuanet.com/legal/2006-01/20/content_ 4075530.htm.
Design of Chinese Word Segmentation System
179
Documents
Pre-process
Dictionary
Segment
Documents
Constraint Model
Post-process
Fig. 3. Structure of the proposed segmentation system
Fig. 4 shows the interface of the segmentation system WhuSplitWord which is very easy to use. When the button “open” is clicked, the system begins to segment words and the result will be showed in the textbox named “segmentation result”. The
Fig. 4. Interface of segmentation system
180
L. Zhang, Y. Li, and J. Meng Table 7. The experiment result
Test collection
Character number
Word number
word number
Time
Time
(Different word)
(Non-statistical word frequency)
(Statistical word frequency) ms
ms
speed (words/ second) non-statistical word frequency
speed (word/ second) statistical word frequency
Segmentation accuracy (%)
content_4075930
1551
447
279
469
563
3316
2762
98.6
content_4076410
432
143
110
93
109
4646
3964
98.7
content_4062021
5473
1495
838
1563
1952
3479
2785.4
98.9
content_4074109
1281
361
259
375
438
3416
2924
98.8
……
……
……
…..
…..
…..
……
…..
…..
content_4075232
2234
567
391
453
531
4931
4207
99.51
content_4075235
3807
962
673
1203
1562
3164
2438
99.12
content_4075377
4249
1024
767
1391
1749
3054
2429
98.6
content_4075438
359
100
84
31
31
11580
11580
98.9
content_4075449
1414
383
253
485
579
2915
2243
99.43
sentence in the textbox is from the test collection that is removed stopword. The corresponding words are listed under the sentence. After segmenting words, the result will be exported for post- processing when the button “save” is pressed. In Table 6, the unit of time is millisecond and the speed is measured against the number of words that the system can deal with in a second. The statistics of word frequency can be achieved when the experiment of segmenting words is processing. Because the number of words is the number of different words in the segmentation result when the system calculates the frequency of words, the total number of words can also be got. In our experiment, the 10 quickest and the 10 slowest pieces of the 100 pieces processed by the system should be eliminated and the other 80 pieces are analyzed in the table 7. The conclusion is given by analyzing the 100 pieces of samples: The system can process 3798 words per second when the system does not calculate word frequency and 2690 words when frequency is calculated. Furthermore the precision is higher than 99%. The speed of the system greatly exceeds the existing systems such as SDWS, ABWS, CASS, and CWSM [13]
6 Conclusions Programmed in Java, WhuSplitWord has the advantages of portability and cross-platform. The system can be used on a large scale of test collection that can contain as many as 25000 words and can implement the statistics of word frequency at the same time. It is demonstrated that the speed can be improved greatly using the improved converse Chinese dictionary and reverse maximum matching algorithm. However, we have to point out that the ambiguous segmentation still remains as the result of the limits of mechanical matching algorithms. In order to achieve the precision close to 100%, dictionary should be updated and new supporting approaches should be adopted. This paper is supported by National Social Science Foundation (NO.:05BTQ022).
Design of Chinese Word Segmentation System
181
References 1. Kwok KL: Employing multiple representations for Chinese information retrieval. Journal of the American Society for Information, vol. 50, 8 (1999) 709-723 2. Kwok KL: Improving English and Chinese ad-hoc retrieval: A tipster text phase 3 project report. Information Retrieval, vol. 3, 4 (2000) 313-338 3. Spark-Jones K: The role of artificial intelligence in information retrieval. Journal of the American Society for Information Science, vol. 42, 8 (1991) 558-565 4. Fan, C. K., W.H.Tsai: Automatic word identification in Chinese sentence by the relaxation technique. Computer Processing of Chinese and Oriental Language. 1 (1988) 33-56. 5. Guo Jin: Critical tokenization and its properties. Computational Linguistics. 4 (1997) 569-596 6. Chen A.,He J, Xu L, Gey FC, Meggs J: Chinese text retrieval without using a dictionary. Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, (1997) 42-49 7. Lee, Kin Hong, Qin Lu: Text segmentation for Chinese spells checking. Journal of the American Society for Information Science, vol. 50, 9 (1999) 751-759 8. Huang X and Robertson S: A probability approach to Chinese information retrieval: Theory and experiments. Proceedings of the BCS-IRSG 2000, the 22nd Annual Colloquium on Information Retrieval Research, Cambridge, England 9. YUE-HENG SUN, PI-LIAN HE, SONG NIE, GUANG-YUAN WU: A Chinese Segmentation System Based on Document Self-Machine for Identifying the Unknown Words. Proceedings of the Second International Conference on Machine Learning and Cybernetics, Xi’an, 2-5 November 2003. 2080-2084 10. Dahan D., Brent M.: On the Discovery of novel word-like units from utterances: An artificial language study with implications for native-language acquisition. Journal of Experimental Psychology, vol. 128, (1999) 165-185 11. Cheng K.S., Wong K.F.: A study on word-based and integral-bit Chinese text compression algorithms. Journal of the American Society for Information Science, vol. 50, 3 (1999) 218-228 12. HONGLAN JIN, KAM-FAI WONG: a Chinese Dictionary Construction Algorithm for Information Retrieval. ACM Transactions on Asian Language Information Processing, Vol. 1, No.4, December 2002, 281-296 13. Liu Kaiying: Automatic word segmentation and tagging of Chinese texts. The Commercial Press. Beijing, (2000) 14. Packard Jerome: The Morphology of Chinese: A Linguistic and Cognitive Approach. Cambridge University Press (2000)
Cross-Media Database Retrieval System Based on TOTEM Cheng Zeng1, Haiyang Zhou1, and Bing Yan2 1
State Key Lab of Software Engineering, Wuhan University, Wuhan 430072, P.R. China 2 Hubei University of Economics, Wuhan 430205, P.R. China [email protected]
Abstract. This paper presents a model CSRM for realizing cross-media semantic retrieval in multimedia digital library. The object semantic is regarded as the core of all semantic information which is mined by maximum likelihood estimate method. Bayesian network instantiation, Markov chain are respectively utilized to recognize dynamic scene semantic and complex event semantic. Semantic templates are constructed to store the mapping relation between media features and semantic concepts so that all media in multimedia digital library could be automatically annotated. Besides, user could retrieve cross-media information with media example set in real time. A prototype system of cross-media retrieval based on TOTEM is constructed to prove the validity of CSRM and flexibility of data management.
Cross-Media Database Retrieval System Based on TOTEM
183
example, and look for a entrance of cross-media network by mining information set implied in different layers, so that they could retrieve various types of media and the retrievable proportion of media data is improved considerably. The word cross-media comes from the domain of journalism and communication, but is a new one in the domain of computer and information science. Cross-media is a combination spanning three platforms including planar media, multidimensional media and network media. Planar media: number, text, image, graph and other discrete media. Multidimensional media: 3D model, video, audio, animation and other complicated time-base media. Network media: Web page, real-time message, Email, Fax and other media taking network as carrier. Cross-media is based on multimedia technology, which aims at the combination of multimedia resources, information syncretism and coexisting in one system spanning three platforms. It extracts the relation, the cooperating effect, the complementary and the interaction of multiple dimensions among heterogeneous medias to the greatest extent, so that more information can be recognized, retrieved, published, reconstructed, and accrete for new purpose and using all kinds of medias efficiently.
2 Related Works Li Qing and IBM Research Lab in China cooperate and present a comprehensive retrieve model which supports various media[1, 2]. The model extracts the relations among homogeneous media via similarity computing between low-level features, and other relations among heterogeneous media via media structure characteristic and interactive feedback. But this paper doesn’t refer to these researches about how to mine and reason media semantic, construct cross-media relationship based on semantic directly. Moira[3] research cross-media retrieval among digital media and real-world media, which assigns a ID number for each media file and relates these ID with semantic concept. Reference[4,5,6] realize images auto-annotation by the methods of support vector machine and statistic learning to retrieve on cross-media based on concept description. But they are all limited in basal concepts involving scene and object and can’t recognize more complicated semantic information. Zettsu[7] presents cross-media object relationship mining method based on aspect. The method analyses logical structure and famous extent of Web pages, and mines structure semantic correlation among various sub-media. Wei-Ta[8] designs a kind of on-line multimedia learning system, which relates teaching record with slide by speech recognition in order to realize synchronized playing. Wu F[9] combines structure relationship and physical feature matching, which predigest the comparability problem of multiplex media including various sub-media as Euclid distance computing between homogeneous sub-media in them. It constructs
184
C. Zeng, H. Zhou, and B. Yan
multimedia semantic space, namely store matching degree between media files with two dimension array, to realize cross-media retrieval to a certain extent. However, this method doesn’t suit to deal with huge number of multimedia data, and each new media file requires interactive learning before inserting into database.
3 Cross-Media Semantic Mining Model At present, the research about cross-media retrieval is still at the starting stage. Because of lacking cross-media retrieval model considered on the whole, mining content of media information is not clear, expression is not unified, relationship between media is indeterminate, and cooperation is inconsistent. Especially information mining technology relating with media semantic is absent and difficult to satisfy the current high-efficient retrieval demand of various media types and huge quantity. As a result, this paper constructs a unified cross-media semantic retrieval model CSRM. Semi-natural language
Media object information mining Object semantic template
non-linear filter
Time order relation
TOTEM System Cross-media database (image, video, audio,
Plane media processing Two class region method Interesting region estimate Multi-granularity description
Relation rule Space topology Space- time trajectory
Personalized database
Media training set
)
Time-base media processing Similitude function computing Region delaymatching K-S statistic updating Dynamic texture optimizing
Media low-level feature processing module
Fig. 1. Cross-media semantic retrieval model
CSRM regards object semantic as the core to realize the transition from low-level feature to high-level semantic and cross-media retrieval to a certain extent. At first, CSRM mines correlative object information of planar media, tracks and estimate object regions of time-base media by media object-level information mining module. Semantic templates play a role of the bridge. Each of them stores correlative information required by a certain semantic concept. They are constructed by an
Cross-Media Database Retrieval System Based on TOTEM
185
interactive semantic template training system while semantic concepts are provided by several semantic concept ontology databases. All physical feature and semantic information mined will be stored in a three layer meta-data structure of cross-media which is independent of media type and is the groundwork of latent semantic relationship in cross-media semantic network. 3.1 Cross-Media Semantic Network The primary problem for realizing cross-media integrative retrieval is to seek a comparability measuring space without depending on media types which is equal to mining possible relationship among different types of media. 1) Relationship concerned with manual work These relationships exist between example and feedback media. They are also produced indirectly between different media by manual annotation of domain experts. 2) Inherent relationship in media physical structure These relationships emerge in multiplex media such as Web pages, Email, video and so on. Their characteristic is that each media possibly comprises other types of media. 3) Latent semantic relationship They indicate semantic relationship implied in media which is similar with mankind understands. Semantic information is mined directly in low-level physical features and the relationship between media will be recognized in semantic-level. The method is the best perfect for cross-media retrieval in future which could realize automatic semantic mining and recognizing in huge number of media data. 4) Relationship based on physical feature It refers to any type of media that above-mentioned three kinds of relationship, but the relationship based on physical feature is usually limited in homogenous media, which is one of the key problems of traditional multimedia search based on content. We regard it as cross-media relationship because it is inevitable to exist relationship among homogenous media for constructing cross-media semantic network and is able to strengthen the region of semantic reasoning. On the basis of four kinds of cross-media relationships, we could construct a virtual cross-media semantic network which covers different layers and contains different types of media. Each node in the network corresponds to a media file, and each line between the nodes represents a relationship. So that information syncretism, accreting for new purpose, cooperating interaction of cross-media are possible to become reality. 3.2 Object-Level Information Mining Object-level information mining module includes two main parts. 1) Planar media processing In terms of a certain strategy which combines feature and assigns weight, planar media, e.g. Image, video frame, etc. will be segmented based on object and collect feature set of object regions[10] with low-level feature extracted.
186
C. Zeng, H. Zhou, and B. Yan
2) Time-base media processing After adding time dimension, it is necessary to process that track object regions, estimate their state transformation that is possible to occur and ensure the integrity of object information. Besides, audio segment processing by durative time of audio object will also be involved in the module. For example, audience blatancy noise, whistle sound in the sports match will be regarded as two audio objects. But they are possible to intersect even overlap so that object segment technology is indispensable. We have researched the processing technology of video object in the former paper[11], which is crucial to recognize high-level semantic such as scene, event and so on. 3.3 Cross-Media Semantic Concept Ontology Database This paper constructs one cross-media object semantic concept ontology database and two high-level semantic concept ontology databases. The former don’t involve all semantic concepts in real world in that it is too difficult to realize. On the contrary, it only stores the common concepts used in lots of domains, which expresses the generalization for a certain thing. Semantic concept ontology database is to accessorily provide explaining and expression of various media and mapping among low-level features, regions and semantic templates. They are also a common language in interactive course. At the same time, the layers structure of theirs makes the semantic reasoning more convenient, semantic expression more flexible. 3.4 Construct Semantic Template Semantic templates include object semantic templates and high-level semantic templates. They are the bridges between semantic concepts in real world, object-level information and region relation mined from low-level physical features. The aim of theirs is to compute necessary parameters of each template with machine learning during an interactive template training course. Finally, media files will be automatically annotated in cross-media database to provide semantic retrieval to a certain extent. Semantic templates constructing is the key of transformation from physical feature to semantic. Fig.2 shows the training system of semantic templates. We present the object semantic mining method based on maximum likelihood estimate to compute the steadiest distribution of probability space and the parameters in each template. Especially, we consider object semantic concepts relation, namely the presence probability of one semantic object influences that of another. We optimize and update each template by feedback learning based on decision function to improve the impact of automatically recognizing media object semantic. We regard high-level semantic as more complex semantic description based on object semantic, namely the former is comprised of certain rules and relations according to the latter. This paper maps media to one or two dimension expression space and defines pertinent topological relation. Then a Bayesian network is constructed which has three classes of nodes involving object semantic concept, topological relation and scene semantic concept. We utilize network instantiation to
Cross-Media Database Retrieval System Based on TOTEM
187
Fig. 2. Semantic template training system
form a 3D statistic graph, the aim of which is to obtain probability relationship between scene semantic and their elements. So that it is possible to automatically recognize complex dynamic or static scene semantic. We also present an event semantic mining method which divides each event into simple event and complex event. The former is determined by object semantic and motion state of single object (between objects). The latter is regarded as a combination of many simple events which have certain time-order relation. So that event semantic recognition is able to be transformed into the problem of steady distribution computing of Markov chain. The method is efficient for event semantic recognition of multi-object and multi-relation. Because paper length is limited, we don’t introduce these methods and algorithms of semantic template constructing in detail. 3.5 Three Layers Meta-data Structure for Cross-Media After mapping semantic concept to semantic template, we require a uniform mode to store the corresponding relationship between semantic elements and low-level features and other information without regard to media type. The mode is required not only to satisfy the universality of feature definition and retrieval requirement of various types of media, but also to consider the particularity of those media which have abundant
188
C. Zeng, H. Zhou, and B. Yan
semantic and complex structure. So that it could improve the semantic matching extent between search results and examples. We think that high-level semantic such as scene, event, evolves from object semantic and their relationships each other. So a kind of three layer metadata structure of cross-media is defined. It stores low-level physical feature, object semantic information and high-level semantic information to realize the transition of semantic gap, the sketch map of which is as follows:
Fig. 3. Three layers metadata structure and relationship of cross-media
The upper layer is high-level semantic information metadata (SM), the middle layer is object semantic information metadata (OM) and the lower layer is low-level feature metadata (FM). The data structure of there is as follows: SM = { SID, OSet, OSR, OTR, OSC, Hconcept, OPsm, FPsm } OM = { OID, OFD, Fweight, Oconcept, SPom, FPom } FM = { MID, Type, BA, CBF, Address, Description, Fset, Sset, OPfm, SPfm } SID is a unique mark of high-level semantic, OID and MID have the similar function with it. OSet denotes semantic object set contained in high-level semantic. OSR, OTS, OSC respectively denote space topology information of semantic objects, time-order relation information of semantic objects, and state transition information of each semantic object changing over time. Some of them are possibly blank when express different high-level semantic. BR is a spare relation attribute and Hconcept store the most conformable semantic concept corresponding to the high-level semantic. Oconcept is similar to it. OFD and Fweight respectively denote feature information and their weights of semantic object regions. Type attribute in FM is represented as media type. BA is the basic attribute of the media file. CBF stores the physical feature and Address stores the pointer to the physical address of media file. We define the attribute Description for logging media annotation during interactive manipulation. OP*, FP*, SP* denote different pointers set to store dependence relation among three layers metadata. Fset and Sset store filiations or reason brother relationships among media sets and their different sub-media sets for constructing cross-media semantic network. We could retrieve homogeneous media based on content or heterogeneous media based on keywords with CBF in FM layer. It is the same as we could retrieve based on object or high-level semantic concepts because of the existence of various region relations and semantic concept description. Besides, expandable retrieval could be
Cross-Media Database Retrieval System Based on TOTEM
189
realized with the concept relations of levels in semantic concept database. Relationship pointers in metadata structure improve the retrieval range in cross-media database and eliminate the limit of media type.
4 Cross-Media Data Management and Retrieval Based on TOTEM Storing, managing and retrieving multimedia data are the key problems in the research of multimedia digital library. Traditional relational database suits to manage structuring data types such as character, number, but only simply store multimedia data with binary flow. Although object-oriented database is able to manage multimedia data, the strict inheriting relation between superclass and subclass has to lose lots of semantic information. We develop the object deputy database managing system TOTEM[12,13], which possesses the flexibility of relational database and the ability of object-oriented database dealing with complex information. Besides, the approach of deputy object could solve these problems such as multiple representations, dynamic changing of characteristic, storage and so on, for multimedia, geography and biology information. 4.1 TOTEM Database Management System 1) Create personalized cross-media database Traditional database retrieval often returns thousands of results so that users have to repeatedly filter them. The approach of deputy object in TOTEM is utilized to customize personalized database and resolve the problem. TOTEM provides the manipulations of creating deputy class, such as selection, extension, projection, union, join, group and so on, to represent many complex semantic relation such as specialization, generalization, aggregation grouping and so on. Each original class denotes a media database and each deputy class relies on one or many original classes. Besides, each deputy class could have its own second level deputy class which makes the management of multimedia database to be so flexible in terms of various information level classifications e.g. feature, semantic, basic attribute and manual description etc. Deputy classes in TOTEM system could define their own content and rule which is different from inheriting mechanism of oriented-object database. User could retrieve by cross-class search in different personalized databases to realize interest share. 2) Automatically classifying Traditional methods for automatically classifying data were realized by programming in application layer and redundantly storing data which refers to consistency problem. TOTEM realizes the automatic classification in database level. When new data insert into original class, they will be automatically computed and judged according to rules and then distributed to the deputy classes which accord with conditions. Especially, deputy classes only store bidirectional pointers connecting original classes and their own attributes to avoid additional storage cost.
190
C. Zeng, H. Zhou, and B. Yan
Fig. 4. The page of creating personalized cross-media database
3) Cross-class search Because of the existing of bidirectional pointers between original classes and deputy classes in TOTEM, we could retrieve cross-media data among different deputy classes or original classes. The characteristic suites to store and manage the three layers metadata of cross-media, so that we could flexibly retrieve between feature layer and semantic layer. 4.2 Cross-Media Retrieval System In the field of cross-media retrieval, the search conditions and results will not be restricted by media types. Media are not only processed by semantic mining methods, but also are related, matched, cooperated, reasoned by uniform semantic expressing mode. We construct a cross-media retrieval system prototype based on TOTEM which adopts the method combining cross-media example set and half natural language to realize cross-media retrieval, showed in Fig.5. The ICTCLAS[14] was expanded to process retrieval sentence. The theory is that nouns in the sentence are extracted which are the core recognizing retrieval content. Then gerunds are also extracted as the candidate words for event concepts. Preposition representing negative and conjunction are all regarded as restriction words. A thesaurus dictionary is used to extend retrieval scope. All media files in cross-media database have been automatically annotated by semantic templates, when semantic concepts in example set are extracted through semantic mining in real time. So those retrieval
Cross-Media Database Retrieval System Based on TOTEM
191
conditions will be transformed to half natural language and retrieve in cross-media semantic network with semantic reasoning based on cross-media relationship which includes:
Fig. 5. The main page of cross-media semantic retrieval
1) Reasoning according to different type of relationships Because physical structure relationship contains certain semantic similarity, it could share information with latent semantic relationship to extend relationship between media. For example, network media C involves image A and video B, so there is physical structure relationship between A and B. If image A increases latent semantic relationship with audio D which is just inserted into database, a new relationship will emerge between B and D. 2) Reasoning according to layers relation in concept ontology database This kind of reasoning utilizes the relation among different layers in semantic concept ontology database. For example, a concept set, such as tiger, bear, giraffe and so on, is extracted from retrieval example set, the system will return not only these media which match with retrieval concepts, but also relative media which have common proximate super concept, e.g. animal. Of course, these relative media possess lower relating degree. The prototype system could return four kinds of media including text, audio, image and video in an order based on semantic matching degree. The results will display semantic concepts contained in the media and some basic information. Besides, visual
192
C. Zeng, H. Zhou, and B. Yan
media will show display sketch and time-base media could play online, the page of which showed in Fig.6.
Fig. 6. The results of cross-media retrieval
5 Conclusions This paper presents a cross-media semantic retrieval model CSRM. Semantic templates are regarded as the bridge between media features and semantic concepts. The object semantic template is constructed by maximum likelihood estimate method. Based on semantic objects and their relation information, dynamic scene semantic template and complex event semantic template could also be respectively constructed by Bayesian network instantiation, Markov chain in order to realize automatic semantic recognization of these media files. User could retrieve cross-media information with media example set or half natural language in real time. A prototype system of cross-media retrieval based on TOTEM is constructed to prove the validity of CSRM and flexibility of data management.
References 1. Jun Yang, Qing Li, Yueting Zhuang. OCTOPUS: Aggressive Search of Multi-Modality Data Using Multifaceted Knowledge Base. http://www.cs.cityu.edu.hk/~csqli/research-projects/Octopus/papers/ www02jyang.pdf 2. Ji-Rong Wen, Qing Li, Wei-Ying Ma. A Multi-paradigm Querying Approach for a Generic Multimedia Database Management System. SIGMOD Record, 2003. 32(1)
Cross-Media Database Retrieval System Based on TOTEM
193
3. Moira C. Norrie, Beat Signer. Information Server for Highly-connected Cross-media Publishing. Information Systems, 30(2005): 526–542 4. Yueting Zhuang, Congmiao Wu, Fei Wu. Improving Web-Based Learning: Automatic Annotation of Multimedia Semantics and Cross-Media Indexing. Proceeding of ICWL 2004: 255-262 5. J. Jeon, V. Lavrenko and R. Manmatha. Automatic Image Annotation and Retrieval using Cross-Media Relevance Models. The Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003: 119-126 6. Ahmed Id Oumohmed, Max Mignotte , Jian-Yun Nie. Semantic-based Cross-media Image Retrieval. Proceedings of Third International Conference on Advances in Pattern Recognition, 2005: 414-423 7. Zettsu Koji, Kidawara Yutaka, Tanaka Katsumi. Discovering Aspect- based Correlation of Web Contents for Cross-media Information Retrieval. IEEE International Conference on Multimedia and Expo (ICME), 2(2004): 1015-1018 8. Wei-Ta Chu, Herng-Yow Chen. Toward Better Retrieval And Presentation by Exploring Cross- media Correlations. Multimedia Systems. 2005.10(3): 183-198, 9. Wu F, Yang Y, Zhuang YT, Pan YH. Understanding Multimedia Document Semantics for Cross-media Retrieval. PCM 2005, Part I, LNCS 3767, 2005: 993–1004 10. Zeng cheng, Jiaheng Cao, Yuwei. Object-based Tree Representation Mechanism of the Image With Multi- granularity in One Scale. Computer Engineering. 2006.32(8): 55-58, 11. Cheng ZENG, JiaHeng CAO, Ying Fang, Pei Du. Data Mining Based on Object in Video Flow with Dynamic Background. ADMA05. 2005: 382-390 12. Zhiyong Peng, yahiko Kambayashi, Deputy Mechanisms for Object-Oriented Databases, IEEE 11th International Conference on Data Engineering, March 6-10, 1995 13. Zhiyong Peng, Qing Li, Ling Feng, Xuhui Li, Junqiang Liu. Using Object Deputy Model to Prepare Data for Data Warehousing. IEEE Transactions on Knowledge and Data Engineering, 2005.17(9): 1274-1288 14. ICTCLAS. http://www.nlp.org.cn/
Semantic Query Planning Mechanism on XML Based Web Information Systems Jinguang Gu1,2 , Bo Hu1 , and Yi Zhou1 1
College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430081, China 2 College of Computer Science and Engineering, Southeast University, Nanjing 210096, China [email protected], [email protected] Abstract. To enable accessing web information at semantic level, this paper develops a semantic query planning mechanism on XML based web information system with complex ontology mapping technology. It discusses the patterns of complex ontology mappings at first, and then the ontology-based query planning in Mediator-Wrapper based environment with GAV style querying request. The extension of XML query algebra and XML query planning algorithm are discussed in detail.
1
Introduction
We witness a rapid increase in the number of web information sources that are available online. the World-Wide Web(WWW), in particular , is a popular medium for interacting with such sources[1]. This sources include both free and commercial databases, flat files services, web services or web based applications, HTML files and XML files. Due to its ability to express semi-structured information, XML based information System (X-WIS), which is an extension to Database Information System, plays a key role in web information community, and XML is rapidly becoming a language of choice to express, store and query information on the web, other kinds of web information such as HTML-based web information can be transferred to XML based information with annotation technologies. Users can query information with XML languages, XPath based languages such as XQuery, XUpdate are suitable for retrieving information in distributed integration systems. Problems that might arise due to heterogeneity of the data are already well known within the distributed database systems community: structural heterogeneity and semantic heterogeneity. Structural heterogeneity means that different information systems store their data in different structures. Semantic heterogeneity considers the content of an information item and its intended meaning[2]. How to accessing distributed information with a consistent semantic environment and how to make the XML query mechanism with semantic enabled are the main problems that should be discussed in distributed X-WISs. The use of ontologies for the explication of implicit and hidden knowledge is a possible approach to overcome the problem of semantic heterogeneity. Ontologies can be used to describe the semantics of the X-WIS sources and to L. Feng et al. (Eds.): WISE 2006 Workshops, LNCS 4256, pp. 194–205, 2006. c Springer-Verlag Berlin Heidelberg 2006
Semantic Query Planning Mechanism on XML Based WISs
195
make the content explicit. With respect to the data sources, they can be used for the identification and association of semantically corresponding information concepts. This paper focuses on how to use ontology technology to enable semantic level querying on XML based WISs. It uses ontology mapping technology to get a consistent semantic environment, and extends the XML based querying technologies to enable semantic querying on X-WISs. The remainder of this paper is structured as follows. Section 2 gives the general discussion about ontology enabled XML based information systems (WISs). Section 3 defines the patterns of ontology mapping. Section 4 discusses ontology enabled querying planning on XML based semi-structured web information systems, such as ontology enhanced XML query algebra and XML query planning algorithm. Section 5 summarizes the whole paper.
2
The XML Based Information Systems
Because XML has been the standard language to represent web information and semantic web resources, using XML to represent web based or semi-structured information systems is a good choice. In the distributed environment, every local site contains the local ontology based structured or semi-structured information source, this information source may be a relational database, native XML database, web site, XML based application or other autonomous system. 2.1
The Formal Definition of XML Based Web Information Systems
From the point of view of web based or semi-structured information processing, all local information sites can be expressed as collections of XML instances. An XML instance can be described as a [3]. Definition 1. An XML instance can be described as a structure Id := (Vd , Ed , δd , Td , Od , td , oidd , rootd ), consisting of: i. Id is a rooted and directed tree ,and Vd is the node set of the tree, Ed is the edge set of the tree and Ed ⊆ Vd × Vd , δd is the mapping function between nodes which identify the direction of the edges, rootd is the root of the tree; ii. Every node has a unique identifier which belongs to Od , which means for every node e exists ∀(e){e ∈ Vd → ∃(o){o = oidd (e) ∧ o ∈ Od }}; iii. There exist a mapping function td (e, string) which maps the attribute of the node a type τ ∈ Td , and the function has the following rules: – if string = tag, then maps it to the type of tag; – if string = content, then maps it to the type of content. iv. Every type τ ∈ Td has a domain denoted as dom(τ ); v. δd is the identifier of edge, with which the parent nodes can access the child nodes. On the other hand, the child nodes can access parent nodes by the identifier δd −1 .
196
J. Gu, B. Hu, and Y. Zhou
By an XML based information system (X-WIS) we mean S = ({Xd }, W ), where {Xd } is a finite set of XML instances, W is the semantic wrapper or mediator[4], Ontologies are used for the explicit description of the information source semantics, the ontology can be expressed as[5]: Definition 2. A core ontology is a structure O := (C, ≤C , R, ≤R , σ) consisting of: i. Two disjoint sets C and R whose elements are called concept identifiers and relation identifiers, We will call concept identifiers and relation identifiers just concepts and relations; ii. A partial order ≤C on C, called concept hierarchy or taxonomy; iii. A function σ : R → C + called signature; iv. A partial order ≤R on R, called relation hierarchy, where r1 ≤R r2 implies |σ(r1 )| = |σ(r2 )| and πi (σ(r1 )) ≤C πi (σ(r2 )), for each 1 ≤ i ≤ |σ(r1 )|. By a semantic enhanced XML based information system we mean P S = ({Si }i∈I , M ), where I is a set of sites, Si is an WIS for any i ∈ I, M is mapping relation on the set I which can be expressed as M : (S1 , S2 , . . . , Sn ) → S0 while Si (1 ≤ i ≤ n) denotes the local WIS sites, S0 denotes the global X-WIS site acted as the mediator site, M denotes an integrated procedure. The semantic enhanced XML based web information system was discussed in OBSA[6] system. 2.2
The Ontology Based Semantic Adapter of X-WISs
We employ web service based components named Semantic Adapter to perform the task of semantic information processing on every local site. Semantic adapter acts as the wrapper of local site information, different local site has different semantic adapter. The function of the semantic adapter can be described as follows[4]: – Ontology Establishing. With the help of domain expert, semantic adapter creates the local ontology to supply a local semantic view to express the semantic of local information source; – Semantic Mapping. The semantic adapter maintains a mapping table, which maps the local semantic to the semantic of other sites using method introduced in paper [4]; – Query Processing. The semantic adapter accepts the query request from the other sites, transfers it to the form which the local information source, the local information source executes the query and the semantic adapter transforms the result to the form which the other WIS sites need using XSLT technology; – Some other functions will be added in the future. The main components of semantic adapter can be described as figure 1, which has the following components:
Semantic Query Planning Mechanism on XML Based WISs
VMT
SKB
197
metadata Local warpping Local Semantic XML describe
XML Global Integrate XML Semantic XML OBSA schema
XSLT
S K C
Unstructured info
schema
…
Local DTD/Schema
Local Semantic XML describe Unstructured info Ontology
DTD/Schema
schema Local DTD/Schema
Fig. 1. The Wrapper Adapter of local WIS site
1. SKC (semantic knowledge construction). SKC constructs semantic mapping knowledge between schemata, it uses the results of schema extraction and concept matching to establish mapping between local and global semantic schemata. The mapping knowledge is saved in VMT; 2. MDD (Meta data dictionary). It could include some description of information source, such as schema, storage path, type and provider etc.; 3. SKB (semantic knowledge base). It includes the knowledge needed to understand Ontology concept and their attributes, they are synonymous words, comparison of Chinese and English etc. these knowledge is crucial to concept matching. SKB can expand automatically in the process of matching; 4. VMT (vocabulary mapping table).The VMT contains the mapping list of local ontology and its instances, one item of the list can be described as follows: <TItem> <STerm>Ontology.Term Description of the Concept or Ontology <MappingList> <MapItem Type="M"> %Direct Mapping %IP Address, Port or Semantic Adapter Service Descrption <Source>Source Description <MTerm>Source1.Term Map1.RelationConfidence Value> <MapItem Type="S"> %Subsumption Mapping <Source>Source Description> <MTerm>Source2.Term1 .... <MTerm>Source2.Termn
198
J. Gu, B. Hu, and Y. Zhou
Map2.RelationConfidence Value> <MapItem Type="C"> %Composition Mapping <Source>Source Description> <MTerm>Source3.Term1 <MConcatenate>Term1.Concatenate .... <MTerm>Source3.Termn <MConcatenate>Termn.Concatenate Map3.RelationConfidence Value> 5. MQW (Mapping and Querying Wrap), act as the wrap for ontology mapping and information retrieval.
3
The Patterns of Semantic Mapping
The major bottleneck of semantic query answering on XML based WISs is ontology mapping discovery, that means how to find the similarities between two given ontologies, determine which concepts and properties represent similar notions, and so on. Linguistic based[7], heuristics-based, machine learning based[8] or Bayesian network based[9] methods were discussed in recent years, were discussed in recent years, a survey of ontology mapping is discussed in paper [10]. The patterns of ontology mapping can be categorized into four expressions: direct mapping, subsumption mapping, composition mapping and decomposition mapping[11], a mapping can be defined as: Definition 3. A Ontology mapping is a structure M = (S, D, R, v), where S denotes the concepts of source ontology, D denotes the concepts of target ontology, R denotes the relation of the mapping and v denotes the confidence value of the mapping, 0 ≤ v ≤ 1. A direct mapping relates ontology concepts in distributed environment directly, and the cardinality of direct mapping could be one-to-one. A subsumption mapping is a 6-tuple SM = (Dm , Rm , Bm , m , Im , v), where Dm is a direct mapping expression; Rm is the first target concept, which is the most specialized ontology concept. The mapping between the source ontology and Rm is denoted as Root ontology concept mapping; Bm is the last target concept, which is the most generalized ontology concept. The mapping between the source ontology and Bm is denoted as Bottom ontology concept mapping; m is inclusion relation between target ontology concepts; Im is the inverse mapping. Subsumption mapping is used to denote concept inclusion relation especially
Semantic Query Planning Mechanism on XML Based WISs
199
in the multiple IS-A inclusion hierarchy. The composition mapping is a 4tuple CM = (Fm , Am , Bm , v), where Fm is a direct mapping expression; Am is chaining of role(s) between target ontology concepts; Bm is the last target symbol, which is the node of chaining target role(s), and composition mapping is used to map one concept to combined concepts. For example, the mapping address=contact (country, state, city, street, postcode) is a composition mapping, in which the concept address is mapped to combined concept “contact, country, state, street, and postcode” of local schema elements. The decomposition mapping is a 4-tuple CM = (Am , Bm , Lm , v), where Am is chaining of role(s) between source ontology concepts; Bm is the last target symbol, which is the node of chaining source role(s); Lm is a direct mapping expression. Decomposition mapping is used to map a combined concept to one local concept, and the example for the decomposition mapping is the reverse of the composition. These four mapping patterns can be described in the figure 2. C0
This paper defines some properties of semantic mapping which are useful in the task of semantic query planning. The first property is transitivity, for the mapping Mi−1,i = (Ci−1 , Ci , R, vi−1,i ) and Mi,i+1 = (Ci , Ci+1 , R, vi,i+1 ), a new mapping Mi−1,i+1 = (Ci−1 , Ci+1 , R, vi−1,i+1 ) can be created to satisfy the mapping relation R. The second property is symmetric, which means that the mapping M = (S, D, R, v) is equal to the mapping M = (D, S, R, v). The third property is strong mapping property, it can be described as follows. Definition 4. A set of mappings Mi (0 ≤ i ≤ n)are strong if they can satisfy the following conditions: i). They share the same mapping relation R, and the mapping relation is transitivity;
200
J. Gu, B. Hu, and Y. Zhou
ii). For ∀(i, j, k),vi , vj , vk are the confidence value of mapping Mi , Mj , Mk , then vi ≤ vj + vk .
4
Semantic XML Query Planning
The semantic query in a mediator-based WIS can be express as figure 3. Each data source uses its local wrapper to describe its semantics and its mapping relationship with other nodes. The semantic information is described with the language based on its ontology, and constructs the global semantics in the mediator environment based on ontology via ontology fusion mechanism. The user’s request is rewritten and modified accordingly based on the global semantics, and is due processed optimally. Corresponding operation plan is made and passed by the wrapper to each data source node for operation. From above description,
Query
Result
Ontology based knowledge coordinating and communication mechanism
Semantic based Mediator
Semantic based Query Rewriting
Query Optimizing
Wrapper
Source
Wrapper
Source
Distributed Query Planning
Distributed Query Executing Engine
Fig. 3. Semantic Querying in a Mediator-based WIS
we know that this paper employs the GAV(Global as View) method to process the user’s query[1]. The knowledge stored at mediator supply a global semantic view of the wrappers, which can be described as the ontology fusion connections list, and it can be described as[12]: Definition 5. Fusion Connection is a structure Fc (O1 : C1 , O2 : C2 , . . . , On : Cn , M), where C1 denotes a concept or concept set of ontology O1 , C2 denotes a concept or concept set of Ontology O2 , M denotes the mapping relationship between C1 , C2 , . . . and Cn .
Semantic Query Planning Mechanism on XML Based WISs
201
As has been mentioned above, the mapping patterns are direct mapping, subsumption mapping and composition mapping, the fusion connection can be described as Fcd , Fcs and Fcc respectively. The query can be described as a XML query with semantic enhanced, which can be described as an extension of XML algebra, and it will be discussed in the next subsection. Because common XML query languages such as XQuery and XUpdate can be transferred into XML query algebra, so the extension is manageable. 4.1
The Extension of XML Algebra with Semantic Query Enhanced
This paper extended XML algebra TAX[13] to enable semantic querying on mediated WISs, TAX uses Pattern Tree to describe query language and Witness Tree to describe the result instances which satisfy the Pattern Tree. The definition of pattern tree with ontology extension can be described as follows: Definition 6. An Ontology Enhanced Pattern Tree is a 2-tuple SP T := (T, F ), where T := (V, E) is a tree with node identifier and edge identifier. F is a combination of prediction expressions. The prediction expression F supports the following atomic condition or selection condition[14]. Atomic condition have the form of X op Y , where: – op ∈ {=, =, <,≤,>,≥,∼,instance of, isa, is part of, before, below,above} – X and Y are conditional terms, which are attributes ,types,type values v : τ and v ∈ dom(τ ), ontology concepts and so on; – ∼ stands for the estimation of semantic similarity. The selection condition is: – Atom conditions are selection conditions; – If c1 and c2 are selection conditions, then c1 ∧ c2, c1 ∨ c2 and ¬c1 are both selection conditions; – No others selection conditions forms. 4.2
XML Query Planning
The query planning is based on the semantic XML query rewriting technology[15]. In order to simplify the discussion, this paper just pays attention to the query planning mechanism of the selection operation. Briefly, a selection operation can be expressed as σ(X : S, Y ) {X ⊆ Pi ∪ Po , Y ⊆ P E}, where Pi is the input pattern tree, Po is output pattern tree, P E is predication list, S denotes the site in which the query will be executed. We define two operators ∪ and 1 to represent U nion and Join operation separately, and define the operator ⇒ to represent the query rewriting operation, and we use σ(X : S0 , Y ) or σ(X, Y ) to denote the user’s query from the mediator site. Firstly, we propose how to rewrite pattern tree (which is the X element of expression σ(X, Y )), there maybe several cases as follows:
202
J. Gu, B. Hu, and Y. Zhou
1. X is one of the elements of input pattern tree or output pattern tree, and it is also a concept in the global ontology hierarchy. Xi (1 ≤ i ≤ n) are the concepts for different local ontologies. X and Xi were combined into one concept in the integrated global ontology with strong direct mappings, which means that X and Xi can match each other, then we can rewrite X as X ∪ Xi . The responding selection rewriting can be expressed as: 1≤i≤n
σ(X, Y ) ⇒ σ(X, Y ) ∪ σ(X1 : S1 , Y ) ∪ σ(X2 : S2 , Y ) . . . ∪ σ(Xn : Sn , Y ) (1) 2. The concept of X is generated by the subsumption mapping or composition mapping of Xi (1 ≤ i ≤ n), then we can rewrite X as Xi . The 1≤i≤n
responding selection rewriting can be expressed as: σ(X, Y ) ⇒ σ(X1 : S1 , Y ) ∪ σ(X2 : S2 , Y ) . . . ∪ σ(Xn : Sn , Y )
(2)
And then, we propose how to rewrite the predication expressions (which is the Y element of the expression σ(X, Y ), there are also several cases, which can be described as follows: 1. If there are lots of concept Yi (1 ≤ i ≤ n)combined in the concept Y of global Ontology, we can rewrite Y as Y ∪ Yi . The corresponding selection 1≤i≤n
rewriting can be described as: σ(X, Y ) ⇒ σ(X, Y ) ∪ σ(X : S1 , Y1 ) ∪ σ(X : S2 , Y2 ) . . . ∪ σ(X : Sn , Yn ) (3) 2. If the concept Y is generated by the subsumption mapping of Yi (1 ≤ i ≤ n), we can rewrite Y as Yi . The corresponding selection rewriting can be 1≤i≤n
described as: σ(X, Y ) ⇒ σ(X : S1 , Y1 ) ∪ σ(X : S2 , Y2 ) . . . ∪ σ(X : Sn , Yn )
(4)
3. If the concept Y is generated by the composition mapping of Yi (1 ≤ i ≤ n), suppose the composition condition is F , we can rewrite Y as (Y1 + Y2 + . . . Yn ) ∩ F . The corresponding selection rewriting can be described as: σ(X, Y ) ⇒ σ(X : S1 , Y1 ∧F ) 1 σ(X : S2 , Y2 ∧F ) . . . 1 σ(X : Sn , Yn ∧F ) (5) It is worth to point out that rewriting process may require a recursion in the transitivity property of semantic mapping. Now we discuss the problem of reducing redundant in the process of ontology query, A selection is redundancy if it satisfy ∃(i, j){Xi ∈ Po ∧ Xj ∈ Po ∧ Xi ∩ Xj = ∅}
(6)
and corresponding rewriting of selection can be described as: σ(X, Y ) ⇒ σ(Xi : S1 , Y ) ∪ σ(Xj − (Xi ∩ Xj ) : S2 , Y )
(7)
Semantic Query Planning Mechanism on XML Based WISs
Input: σ(X, Y ) is the query needed to be processed, F L is the fusion connection list. Output: P is the query planning sequence P ← ∅, Sq ← ∅; foreach x ∈ X do switch Mappings of X node in fusion list F L do case directf usion P ← P + (σ(x, Y ), {σ(x, Y ), σ(x1 : S1 , Y ), σ(x2 : S2 , Y ), . . . , σ(xn : Sn , Y )}, ∪); case subsumption or composition P ← P + (σ(x, Y ), {σ(x1 : S1 , Y ), σ(x2 : S2 , Y ), . . . , σ(xn : Sn , Y )}, ∪); end end Sq ← Sq + σ(x1 : S1 , Y ) + σ(x2 : S2 , Y ) + . . . + σ(xn : Sn , Y ); end foreach σ(x, Y ) ∈ Sq do foreach y ∈ Y do switch Mappings of Y concept in fusion list F L do case directf unsion P ← P + (σ(x, y), {σ(x, y), σ(x : S1 , y1 ), σ(x : S2 , y2 ), . . . , σ(x : Sn , yn )}, ∪); case subsumption P ← P + (σ(x, y), {σ(x : S1 , y1 ), σ(x : S2 , y2 ), . . . , σ(x : Sn , yn )}, ∪); case decomposition P ← P +(σ(x, y), {σ(x, y1 ∧F ), σ(x, y2 ∧F ), . . . , σ(x, yn ∧F )}, 1, F ); end end end end return P ;
The query planning is a sequence, each node of the sequence can be denoted as Pn = (Qn , Sn , Cn , Fn ), where Qn is the query which is needed to rewrite, Sn is a set of sub query executed on different sites, Cn denotes the connection operator, in most time, it is ∪ or 1 operator, Fn is the predication which denotes the connection conditions. Pn represents the query rewriting procedure of query Qn . The query planning procedure of user’s query σ(X, Y ) can be expressed in algorithm 1.
5
Discussion and Conclusion
The paper mainly discusses the extension of query planning on XML based WISs with wrapped ontologies. It discusses the complex ontology mapping patterns, such as subsumption mapping, composition mapping and so forth. It also dis-
204
J. Gu, B. Hu, and Y. Zhou
cusses the semantic palnning mechanism, which primarily extends XML query algebra based on TAX, on the WISs wrapped with local ontologies. Complex ontology mapping ensures distributed querying can solve the problem of the inconsistency of semantic and increases the efficiency by refining on the querying and reducing redundancy. However, query optimizing in distributed web sites was not considered in the query planning mechanism discussed in this paper, future research will be focused on this topic.
Acknowledgment This work was partially supported by a grant from the NSF (Natural Science Fundation) of Hubei Prov. of China under grant number 2005ABA235 and by a grant from the NSF of Hubei Education Agency of China under grant number Z200511005.
References 1. Alon Y. Levy, A.R., Ordille, J.J.: Query heterogeneous information sources using source descriptions. In: Proceedings of the 22nd VLDB Conference, Mumbai, India, Morgan Kaufmann Publishers Inc (1996) 251–262 2. Wache, H., V¨ ogele, T., Visser, U., Stuckenschmidt, H., Sch¨ uster, G., Neumann, H., Hubner, S.: Ontology-based integration of information - a survey of existing approaches. In: Proceedings of IJCAI-01 Workshop: Ontologies and Information Sharing, Seattle, WA, Springer (2001) 108–117 3. L¨ u, J., Wang, G., Yu, G.: Optimizing Path Expression Queries of XML Data(in chinese). Journal of Software 14 (2003) 1615–1620 4. Gu, J., Chen, H., Chen, X.: An Ontology-based Representation Architecture of Unstructured Information. Wuhan University Journal of Natural Sciences 9 (2004) 595–600 5. Bozsak, E., Ehrig, M., Handschuh, S., et al.: KAON:Towards a Large Scale SemanticWeb. In: Proceedings of EC-Web 2002, LNCS, Springer-Verlag (2002) 6. Gu, J., Chen, H., Yang, L., Zhang, L.: OBSA:Ontology-based Semantic Information Processing Architecture. In Liu, J., Cercone, N., eds.: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence 2004, Beijing, IEEE Computer Society Press (2004) 607–610 7. Qu, Y., Hu, W., Cheng, G.: Constructing virtual documents for ontology matching. In: Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland, ACM Press (2006) 23–31 8. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to map between ontologies on the semantic web. In D.Lassner, D.D.Roure, A., ed.: Proceedings of the eleventh international conference on World Wide Web, ACM Press (2002) 662–673 9. Pan, R., Ding, Z., Yu, Y., Peng, Y.: A Bayesian Network Approach to Ontology Mapping. In: Proceedings of the 4th International Semantic Web Conference. (2005) 1–15 10. Kalfoglou, Y., Schorlemmer, M.: Ontology Mapping: The State of the Art. The Knowledge Engineering Review 18 (2003) 1–31
Semantic Query Planning Mechanism on XML Based WISs
205
11. KWON, J., JEONG, D., LEE, L.S., BAIK, D.K.: Intelligent semantic concept mapping for semantic query rewriting/optimization in ontology-based information integration system. International Journal of Software Engineering and Knowledge Engineering 14 (2004) 519–542 12. Gu, J., Zhou, Y.: Ontology fusion with complex mapping patterns. In: Proceedings of 10th International Conference on Knowledge-Based, Intelligent Information and Engineering Systems, Bournemouth, United Kingdom, LNCS, Springer Verlag (2006) 13. H.V.Jagadish, L.V.S.Lakshmanan, D.Srivastava, et al: TAX: A Tree Algebra for XML. Lecture Notes In Computer Science 2379 (2001) 149–164 14. Hung, E., Deng, Y., V.S.Subrahmanian: TOSS: An Extension of TAX with Ontologies and Simarity Queries. In G.Weikum, ed.: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, Paris, France, ACM Press (2004) 719–730 15. Gu, J., Zhou, Y.: Ontology based query rewriting on integrated xml based information systems. Studies in Computational Intelligence Series 23 (2006) 71–80
Analysis on the Construction of Management Modes of E-Government Information Exchange in China Cui Huang School of Public Policy and Management, Tsinghua University Beijing, China Postcode: 100084 [email protected]
Abstract. In this paper, the administrative impediments to the E-Government information exchange in China are concluded into four kinds on the basis of their common characteristics at practice level: departmental interests, exchange costs, disorder of data exchange and decentralization of administrative organs. Accordingly, this paper proposes four coordination management modes which are designed to dispel these impediments and facilitate the smooth operation of E-Government information exchange, and illustrates these modes with recommendable cases of information exchange. Keywords: Information Exchange, Management Mode, E-Government.
1 Introduction The Electronic Government (E-Government) systems can be easily linked, while the exchange of information among them seems not that easy. Although the standardization of data and the application of information exchange technologies make the exchange more feasible on technical level, they cannot ensure that the whole information exchange process moves more smoothly. Currently, the databases of many E-Government projects in China have been linked technically and physically, while the exchange of information among them is still far from being satisfactory realized due to certain administrative impediments. Indeed, it is easy to find that the realization of E-Government information exchange is rather an administrative issue than a technical issue in nature -- it encounters various impediments from administrative levels, which are more difficult to tackle than technical ones. For example, when conflicts between the information exchange and the departmental interests exist, or when risk costs and data processing costs are needed before the information exchange, certain administrative impediments will occur to discourage the exchange process.
2.1 Departmental Interests At present, there are no definite rules governing the possession and transfer of the ownership, collecting rights, developing rights and operational rights of the EGovernmental information resources and relevant management principles. Such absence results in that many governmental departments privatize the E-Governmental information resources and label these resources as “department-owned”. Some governmental departments even use those resources they hold to seek political and economic interests; for that purpose, they intentionally or unconsciously set barriers to block the E-Government information exchange from smooth operation. Furthermore, certain governmental departments take the information resources in their hands as bargaining chips and carry out the information rent-seeking. The term “information rent-seeking” hereof means that governmental departments take advantage of certain methods, such as administrative manipulations and legal rules, to impede the free flowing of E-Government information so as to grab unlawful interests or protect their vested interests. The rent-seeking is not a new concept, but it is absolutely a new phenomenon to use information or informationized methods to practice rent-seeking in China. For example, a certain municipal bureau of public security provides the basic personal data in its Permanent Residence Information Registration and Management System for the local China Mobile and China Unicom. The buyer’s personal information should be confirmed by inquiring at this municipal public security bureau’s database after every SIM/UIM card is sold by either of these two companies. 5 Yuan shall be paid to this public security bureau for each inquiry of the personal information. Therefore, when the municipal government proposed to exchange information among its departments, the municipal public security bureau 1 gave an utter denial. It is also very difficult to exchange information between different government departments, even if there is no direct economic interest in the data. The reason lies in the existence of political interests -- some governmental departments consider that their capacity of controlling information reflects their departments’ influencing power, and consequently that the blockage of certain own information resources will promote their importance. In their view, information transferring means power transferring. This is especially the case if a governmental department could obtain relevant political interests through just reporting some information data to its superior organ. 2.2 Exchange Costs There exists the issue of economic efficiency selection in E-Government information exchange. To carry out the information exchange, the government has to cover the data pre-processing costs, which include high and long-lasting costs of data coding, arranging, standardizing and database maintenance. Taking the exchange of enterprise-basic-information in a municipal as example: at first appearance, the original investment in facilities to support this enterprise-basic-information exchange among the governmental departments, such as Administration of Industry and Commerce(AIC) and Tax Department or Quality Department is not very high, which usually amounts to several hundred thousand Yuan. However, when this information 1
Development and Reform Committee of Fujian Province, the Recommendation List and Case Report on the Development of Sharing Platform for Government Affairs and the Reconstruction Project of Government Affairs Resources, 2004(8), pp.26-27.
208
C. Huang
exchange becomes a long-term and stable project, the municipal AIC, as the data provider, has to contribute remarkable manpower and financial costs to data maintenance for years. From the view of economics, the local AIC is embarrassed – continuing to push this information exchange system forward means high cost, while in accordance with the current regulating system, no financial compensation will be made by the Tax Department or Quality Department or the governmental budget department. In the meanwhile, the complexity of the issue of information exchange costs stands in that the pre-processing of data and information for exchange is carried out and the cost thereof is paid by a certain department, while it is another department, not the original one, obtaining comprehensive benefits from the information freely exchanged. China has not yet established a reasonable compensation system for such E-Government information exchange. Thus, when only one party gets benefits, the information exchange may not be able to last long. 2.3 Disordered Data Management The disordered data management of E-Government information also seriously and negatively impacts the feasibility of information exchange. Because China lacks the overall planning and effective control on the development and construction of governmental information resources, the management and maintenance of EGovernmental information data is actually in the condition of so-called “individually systematized”. This causes problems such as incompatible data definitions, incomplete data elements and various data-coding methods. Solutions to these problems require that information exchange interface should be standardized, and that regulations and criteria should be established regarding the input & output format of various software and hardware inside and among the information system, including interconnection criteria and communication protocols of the network, data exchange format of various databases, data conversion methods of different systems, etc. For a specific industry or government department, the emphases should be focused on the standardization of documents, notes, account books, report format and statistics criteria, and data transmission methods within it. The quality of information system is closely connected to the interface standardization, which is also a vital part in the EGovernment information exchange. Moreover, because lots of governmental information data are left almost unused for long time and their utilization ratio and application level are very low, the management of these data is at a primary and simple stage and cannot satisfy the necessary conditions for the information exchange. For instance, in a certain municipality, when the local tax bureau requests information exchange with local real estate and land administrative bureau with aim to strengthen the collection of house trading tax, such information exchange cannot be realized just because the house property information in this local real estate and land administrative bureau is still at a documentary stage without being digitalized and standardized. 2.4 The Decentralization of Administrative Organs In China, many governmental departments involves in the management and administration of E-Government. Among them, the National Informatization Leading
Analysis on the Construction of Management Modes
209
Group (NILG), the National Informatization Expert Advisory Committee (NIEAC), the State Council Informatization Office (SCIO) and the State Committee of Development and Reform (SCDR) nominally take the leading roles of general planning, coordination and supervision due to their governmental hierarchy. Their respective mandates and duties are the following: (1) The NILG is the top leading organ of China’s E-Government, which is chaired by the Prime Minister and composed of heads from various ministries of the State Council and other competent organizations such as the Army and the Party. (2) The NIEAC is set up to provide advisory and evaluation opinions to the National Leading Group of Informatization. This committee consists of 48 experts and has a secretariat logistically affiliated to the Informatization Office of the State Council. (3) The SCIO is the standing working body of the National Leading Group of Informatization. It is responsible for the strategic planning of China’s E-Government, overall administrative coordination and evaluation of E-Government project performance. (4) The SCDR is in charge of the programming and approval of E-Government projects. A majority of the money for EGovernment projects in China comes from the fund of basic infrastructures constructions administrated by the SCDR, with the minority from the national budget managed by the Financial Ministry. For that, before an E-government project is put into use, it shall be checked and evaluated by the SCDR after the finish of the construction. The following chart illustrates the interrelation among China’s administrative organs of E-Government. Table 1. China’s Administrative Organs System of E-Government
210
C. Huang
From the above chart, it is not difficult to find that it is each ministry under the State Council to take the concrete responsibility of constructing respective EGovernment projects within its competence. This may inevitably strengthen the existing role of departmental interests in allocating information resources of EGovernment and is contradict with some general objectives and principles of EGovernment –- improving the administrative capacity through the reconstruction of governmental operations based on E-Government, providing integrated governmental service to the public through the constructions of E-Government and shifting the governmental administration form procedure-oriented to performance-oriented. In the meanwhile, designating various ministries of the State Council to construct respective E-Government projects results in the vagueness of the responsibilities of project manager (the organ responsible for the project). Because the department planning EGovernment and the departments managing E-Government project construction stay at the same hierarchic rank, the former has little administrative authority either to supervise the latter or to tackle wrangles and disputes between/among the latter, which renders the coordination function of the former token.
3 Coordination Management Modes of the E-Government Information Exchange Faced with the above-mentioned administrative impediments, we should consider how to avoid or eliminate those impediments by adopting appropriate coordination management mode, so as to fully realize information exchange. Based on some successful instances of E-Government information exchange in China, this paper accordingly proposes four coordination management modes, in each of which an analysis of a successful practice is given. 3.1 Administrative Coordination Mode The administrative coordination mode discussed here may be defined as: the superior competent department shall discharge the responsibility of the overall planning and coordination, promulgate legal rules and regulations and adopt administrative measures to interfere and harmonize the interest conflicts among its subordinate departments and enforce the information exchange among them. Why can a strong superior competent department successfully eliminate impediments in the interconnection of the information process through the overall planning and coordination? The reason is that in China the administrative intervention from a strong superior department can effectively dispel administrative resistances to the breaking and reform of dynamic balance of the original political interest system—the reform here is for information exchange purpose. To those departments which are only single-functional and lack of information application advantages, the information exchange will weaken their functional power due to their disadvantaged situation in the E-Government environment; in contrast, to those departments which are multifold-functional and not lack of information application advantages, the information exchange will strengthen their functional powers through consolidating their dominance in the E-Government environment and giving fully play to their comprehensive advantages. Therefore, those disadvantaged
Analysis on the Construction of Management Modes
211
departments will become the resistant forces to the implementation of E-Government information exchange. In China, the only effective way to remove such resistance is to take a series of administrative measures to coordinate the interest conflicts among departments and remove administrative impediments so as to secure smooth exchange of the information. Case Study -- 44 Bureau-Level Departments in Beijing Realized Enterprise Credit Information Exchange In March 2002, Beijing Municipality Government established a cross-department EGovernment system, so-called Enterprise Credit Information Exchange, through direct administrative coordination measures. The Municipality Government first chose 15 government departments with higher level of information application ability as pilot units. During the pilot operation, the tax collection and inspection ability of the municipal Local Tax Department was remarkably strengthened with the help of enterprise registration information provided by the municipal Administration for Industry and Commerce. Soon after the implementation of this information exchange, the Local Tax Department handled a great deal of tax-evasion cases and demanded payment of 200 million Yuan tax evaded. After obtaining these good effects, the Municipal Government organized another 29 subordinate bureau level departments to join in this system and finally realized information exchange in 44 bureau-level departments under its overall planning and coordination. In August 2002, Beijing Municipality Government issued Decree No.106 “Administrative Rules on Enterprise Credit Information Accumulation and Publication of Beijing Administrative Organs”. This Decree came into effect on 1 October 2001, and stipulated that “In accordance with the unified regulations and standards and through the special E-Government networks, the administrative organs shall provide authentic, legal and complete enterprise credit information to the user of Enterprise Credit Information System timely and accurately, and shall bear legal liability in accordance with relevant laws. It is the municipal administrative organs’ responsibility to ascertain and publicize items, scope and standards of information, gather, submit, maintain, update and 2 manage the information needed in the system”. According to the above analysis, we can see that the final realization of information exchange among 44 bureau-level administrative departments in Beijing heavily relies on the overall planning and coordination conducted by the Municipality Government. During the planning and implementation of this information exchange project, the municipal government took administrative measures and made mandatory regulations. The implementation of the whole project was a step-by-step process. In the beginning, the Government chose some administrative organs with higher level information application ability as pilots. Then, after information exchange among them succeeded, the experience was reviewed and popularized actively. 3.2 Management Mode of Constructing Standard Conversion System Constructing standard conversion system is a desirable mode if the information exchanging providers’ original data standards, their original information system 2
Article 10, Administrative Rules on Enterprise Credit Information Accumulation and Publication of Beijing Administrative Organs, Decree No.106 of Beijing Municipality Government.
212
C. Huang
framework and their original operation divisions should be taken into full consideration. Such conversion system can reduce the reluctance of the information exchanging participants and facilitate the successful implementation of EGovernment information exchange. At present, when one talks about “information exchange” and “information integration”, the first thing considered by some relevant governmental departments will be “who is about to integrate whom?”, “on whose data standard will the whole system be based?” and “whose information system will be the parent system?”. Consequently, if the overall information exchange system is to take a certain government department as the center, this department would be happy to accept this system. While, if the information exchange center is located in another department and this department is no more than a supporting player, cloudy will inevitably replace the sunny mood of this department. To cope with such stalemate as well as to decrease costs of information exchange, a management mode of constructing standard conversion system which is well explained in the following case, is recommended. Such mode also can efficiently facilitate and improve the cross-department information exchange. Case Study -- The Construction of Standard Conversion System in China E-Port After the foundation of the China E-port (Port Electronic Enforcement System) Information Data Center in 2000, this center determined to establish a standard conversion system instead of creating a unified data standard, although the latter seemed a more ideal one. The main cause of such decision was the fact that related departments had already set up some information system installed with different data standards before the initiation of the China E-port project. For example, the General Administration of Customs (“GAC”) uses 10 digits as enterprise identification code while the State Administration of Foreign Exchange (“SAFE”) takes 9-digit code to identify an enterprise. When these two organs exchange information, we can neither request the GAC to change its 10-digit data to 9-digit, nor urge SAFE to make the change. Note-worthily, this problem was solved then by developing a set of data standard code conversion table by the E-port, and the costs of such development was acceptable. So when this data-incompatible problem appeared again in the process that the GAC exchanged information with the State Bureau of Quality and Technical Supervision and Quarantine, the E-port just accordingly developed another data standard code conversion table. As time passed by, the China E-port Information Data Center has accumulated a full set of data standard code conversion tables, which ensure the free conversion of identity codes, business documents and parameter codes among the Center, enterprise users and the rest 12 ministries and administrations. 3.3 Compensation Mode Because of the existence of costs of information exchange, it is necessary to design a compensation mechanism for the information exchange so as to ensure all participants’ interest and furthermore to foster a multi-win situation. During the information exchange process, the participants are in different roles and thus face the imbalances among them. On the one hand, the content and volume of information requested by each department differ -- the department that provides more and obtains less will feel unfair, and subsequently loses enthusiasm or even feels reluctant to exchange their information with other departments. On the other hand, the benefits
Analysis on the Construction of Management Modes
213
gained from exchange are different. For example, in the aforesaid Enterprise Basic Information Exchange system, the local AIC gives the basic information of enterprises such as registration, modification, termination (suspension) and annual review information to the Tax Department. Then the Tax Department can acquire updated information of enterprises, prevent tax evasion and increase tax revenue by strengthening market supervision. Personnel of the Tax Department will be praised and perhaps get better salary or rank promotion for their excellent performance. While at the same time, the local AIC gets nothing or an even worse result — losing their potential income from “Enterprise Information Enquiry Service”. This situation will negatively affect the local AIC’s eagerness to take part into information exchange activity. The proposed compensation mode for information exchange will be conducive to solve this problem. Under this compensation mode, the E-Government information may not be exchanged among certain (of course, not all) governmental agencies for free. A third party entrusted by both exchanging partners or the superior department will decide a compensation price for the exchanged information, and the beneficiary will pay money to the information provider. Case Study -- Information Exchange Mode Between GAC and MOFCOM Ministry of Commerce (“MOFCOM”) maintains an import and export merchandise automated monitor system, which needs a great deal of updated basic information of import and export products. While, these basic data are controlled by the GAC. In order to effectively administrate the import and export, the GAC collects, processes and maintains almost all original data about the I&E with enormous financial and human inputs. If these data are exchanged with MOFCOM freely, practically speaking, the GAC may fell that the privileged right, as well as the potential interests, to provide information to the State Council, other administrations, ministries and organizations by the GAC be negatively affected. Under this situation, the free information exchange would in no case be the preferred choice as far as the GAC is concerned. At present, MOFCOM buys data from the GAC every month and the price is decided through bilateral negotiation. After MOFCOM makes payment, it will obtain a user account with password from the GAC. This account is time-limited, and generally speaking the time window is a half day, during which MOFCOM shall download all needed data to its local database. 3.4 The Coordination Mode of Administrative Organs System of E-Government The establishment of a harmonious administrative organs system means that, on the basis of subordinate relationships of hierarchical government organs, the power of approval, the power of construction and the power of sub-system designing of EGovernment projects are allocated to governmental organs at different but subordinate hierarchic rank, so as to guarantee the smooth operation, management and supervision of various processes in E-Government projects. Thus, the planning and the coordination of E-Government information resources will become a mission possible. Case Study -- The Setting-Up of E-Government Office An E-Government Office directly led by the Prime Minister is proposed to be set up and take charge of the uniform management of E-Government on the basis of merging
214
C. Huang
the existing State Council Informatization Office and the working group on EGovernment within the SCDR. As a result of the appearance of this E-Government Office, the interrelationship among China’s administrative organs of E-Government may be illustrated as follows: Table 2. Proposed Coordination Mode of Administrative System of E-Government
The functions of this E-Government Office cover the following: (1) the administration and coordination of China’s E-Government in general, including the approval of projects, maintenance of upper-level design, the establishment of expert panels, the appointment of project managers, the surveillance of project construction and the evaluation of projects operation; (2) the propaganda of E-Government, the planning of personnel training, the coordination of government affairs reform (especially the coordination of cross-department reform projects); (3) promoting the improvement of functions of information management within governmental organs and the reshaping of attitudes of civil servants to informatization and government services.
4 Conclusions This paper proposed and discussed several practically effective management modes of E-Government information exchange. In practice, some of these coordination management modes could be incorporated into one application or project. For example, the administrative coordination mode and standard conversion system mode together with a third-party data center have been combined to construct the aforesaid
Analysis on the Construction of Management Modes
215
“E-port” project, and they jointly contribute to the success of this project. Firstly, during the construction of the “E-port”, the leaders of the State Council have attached great importance and support to it, which played a key role in the construction of the project. In 2001, the State Council approved 12 departments to co-construct the Eport project and issued an official document for that purpose. This was the first information system in China promoted by the State Council through an official document. Secondly, as abovementioned, some standard conversion systems have been developed for the E-port project. At the same time, the “E-port” project itself is a public data center and data exchange platform, which aims at sternly cracking down smuggling, foreign currency fraud, tax evasion and other illegal activities. To achieve this goal, the General Administration of Customs, as a organizing player, launched this project in conjunction with twelve other ministries and administrations, including the Ministry of Public Security, Ministry of Commerce, State Ministry of Taxation, Ministry of Industry and Commerce, State Bureau of Quality and Technical Supervision and Quarantine, State Administration of Foreign Exchange, People’s Bank of China, Ministry of Railways, Ministry of Communications, General Administration of Civil Aviation and Ministry of Information Industry. Hence, if multiple coordination management modes could be well combined and synthetically applied, better effects will be obtained in getting rid of the massive administrative impediments to the E-Government information exchange.
References 1. Greg R Notes: Government Information on the Internet. Library Trends, 2003(Fall) 2. Layne, Karen: Developing Fully Functional E-government: A Four Stage Model. Government Information Quarterly, Vol.18 3. Whitson, Thurman L: Best Practices in E-Government: Comprehensive Electronic Information Dissemination for Science and Technology. Government Information Quarterly, 2001(2)
Aggregated Framework of Enterprise Information System Based on Synergic Theory Li Gang* and Wei Quan Information Management School, Wuhan University, Wuhan 430072 [email protected]
Abstract. Based on synergic theory and integrated method, this paper aims to find out the proper solution for the aggregated information systems in manufacture enterprises. It puts forward the integration framework of information system according to synergic theory, which is based on the technology of Data Mining and Web Services. And it includes 2 parts: the inner synergic framework such as EIP, and the outer one such as E-Commerce platform. The whole framework enhances the efficiency of communication and information transfer in (an) enterprise. Meanwhile, it also ensures that enterprise will be value-added by means of improving the performance of information collection and transference in the sectors of supply chain. Keywords: Aggregation of information system synergic theory EIP E-Commerce.
With the fast development of information technology, more and more companies have tried to use EAI (Enterprise Architecture Integration) as their choice for the aggregation of enterprise information system, in order to accumulate the competitive advantages in the Internet epoch. This paper provides an aggregated framework for enterprise information system, which is based on the synergic theories and the integrated method of manufacture enterprises. This framework utilizes Web Services to combine information systems and companies. Therefore, while improving the processes efficiency, we can integrate each parts of the information system into enterprise supply chain.
1 Synergic Theories Synergic theories study problems about the relationship between parts and the whole. For a corporation, its departments are the parts and itself is the whole. For a supply chain, all the companies in it are the parts, while the whole supply chain is a whole. Every system is composed of separate parts. But the function of a system not simply equals the sum of functions providing(ed) by each part. So, usually the whole system can accomplish certain function while the parts of it cannot, which is called the whole effectivity. *
Project: Sponsored by Social Science Fund of China Ministry of Education (On Integrating and Sharing Information Resources. Project No. 03JB870005 ).
Aggregated Framework of Enterprise Information System
217
Because of the differences between the appearance and the result, the whole effectivity can be categorized into two types: complementary effect and synergic effect. Essentially, complementary effect is the addition of quantity. For the purpose of improving the effect of the whole, we compose the system in order to mine the invisible resources and utilize them. On the other hand, synergic effect is the improvement of quality. In order to get more values from each resource, we need to utilize the recessive resources of the whole and each part. Synergy has two meanings: First, it means synergy inside the corporation. The scale of synergy covers the businesses between each department, the restrictions between each resource, and so on. All of these need some tools to cooperate and unite. Second, it means the internal and external synergy of the enterprise in the supply chain, such as the synergy among users’ requirements, supplies, purchases and transactions. In the perspective of information system aggregation, synergic effect is not the same in different conditions. Inside a corporation, it works when we utilizes and mines the information in each department, and provides evidences for making decision, so that the competition ability of the enterprise will be higher. For a supply chain, synergic effect shows in a different way. Based on the mechanism of information system aggregation, we can get the information resources in time and cooperate with each system.
2 How to Aggregate Information Systems The technology of aggregating information systems has been developed for about twenty years, and many researchers have provided some architectures or solutions for the information system aggregation. The aggregation of the enterprise information system can be studied in two ways. Depending on the way we choose, i.e. scope or depth, the targets of the aggregation are different. Taking the scope of the aggregation, we will find several methods for aggregation, from easy way to hard way: [1] [2] [3] [4] [5]
Information system aggregation inside a department. Information system aggregation among different departments. Information system aggregation on the enterprise level. Information system aggregation with stable partners. Information system aggregation with random partners.
In the view of the depth of the aggregation, we provide the following methods for aggregation, from easy way to hard way: [1] Data aggregation. In order to accomplish the application aggregation and the aggregation of business process, we must resolve data and database aggregation problems at first. Before aggregation, we must identify the data and build the catalog, then define the metadata model. After these three steps are finished, the data can be distributed and shared in the database systems.
218
G. Li and Q. Wei
[2] The aggregation of application systems. We can provide real time aggregation for the data and for the functions between different applications. For example: We can aggregate CRM, enterprise applications and Web during the aggregation of B2B, and build an e-commerce site, which could utilize application resources sufficiently. [3] The aggregation of business models. While we are trying to aggregate business process, the transportation among different kinds of business information should be defined, accredited and managed in different business systems, in order to improve the operations, reduce the cost, and enhance the rapidity of response. The aggregation of business process involves business management, process simulation, and the workflow which is composed of synergic tasks, processes, organizations and information, and the tools which were involved in each step of the business process. Although the methods for the aggregation of enterprise information systems have various kinds and the process for it is also very complex, the basic idea of the aggregation of enterprise information system is still the same. Firstly, based on the data aggregation, we should accomplish the information aggregation among the enterprise departments. Then, after we finish the aggregation of application systems and the aggregation of business process, we can aggregate the internal and external information systems in the supply chain. This kind of aggregation appears as EIP (Enterprise Information Portal), and with the help of synergy e-commerce, it improves the supply chain.
3 The Aggregation Framework of Information System for the Manufacture Corporation The developments of different manufacture corporations are different, but the most important and basic condition for the aggregation of information systems is the information strategy that a corporation has started. An accomplished data aggregation means that the information system for a corporation’s own business has been built. In order to aggregate the information system in a manufacture corporation, we need to do something with the managements of material purchase, product production, product logistics, product sale, customer service and supplier service. In this article, based on the synergic theory, we put forward our solution for the aggregation of information system in the manufacture industry. And taking different synergic views, we can divide the synergy of the aggregation into two kinds: inner synergy and outer synergy. 3.1 The Inner Synergic Framework for the Aggregation of Information Systems The inner synergy need to build EIP platform, in order to aggregate the information in different departments or systems of the corporation, and the ordinary information out of the corporation. The modules of the platforms contain: e-Documents, e-CRM,
Aggregated Framework of Enterprise Information System
219
e-Logistics, e-Project, e-HRM,e-Financials, and e-Procurement. The above modules are the basis of the corporation’s daily working, and the EIP platform supplies data for these modules. The modules are the final appearance of EIP, and after EIP is being built, it aggregates information about enterprise finance, ERP, KM systems and other related MIS systems.
Fig. 1. The inner synergic framework for the aggregation of information systems
e-Documents:e-Documents provide a whole KM framework, allow people in the corporation share information, place, store and create documents, regardless the place and the time. All the documents can be found in the Web browser, and documents with different ranks will be permitted to different persons. E-Documents system makes the communication related to a corporation much easier, the corporation can build public web sites, intranet, and provide different contents for different users. The combination of amity and flexibility makes this module suitable for any kinds of organizations. e-CRM:e-CRM covers all the functions of traditional CRM products. After aggregating customers with services, sales, products and finance, we can acknowledge more about the customer. Using this solution of CRM based on Internet, the corporation can manage the relation information about contracts, communication, documents and requirements among the corporation, the customers and the partners. e-Logistics:e-Logistics manages the products, services and price, and shares information between Intranet and Internet.
220
G. Li and Q. Wei
e-Project:e-Project manages the activities and resources related to the projects, and studies the efficiency of the organization with the help of matrix view. Because of the aggregation between other modules, e-Project can manage all the related resources. e-HRM:Besides the functions of traditional HRM systems, e-HRM has the function of maintaining the information related to the roles and authorizations. With the help of other aggregated modules, E-HRM lets employees to access the internet,each employee can easily track the products, customers, documents and the request related to his name. e-Financials:e-Financials can automatically collect and deal with data on both Intranet and Internet, can analyze the efficiency of the organization online, and can aggregate centralized-control and local-execution. e-Financials ensure the validity of the data from individuals and departments, in order to get the detailed analyze reports. Taking Internet as the communication center, data in the e-Financials can be collected automatically during the transaction process. e-Financials help corporations to analyze their financial performances, to control company’s local execution, to supply reporting functions of financials to users. e-Procurement:The combination between e-Procurement and e-HRM can ensure our electronic managements for all processes, reduce the cost of requirements, provide well designed views and reports, fit the requirements that is defined by the corporation for its managements and processes. Building the synergic network among different modules and business systems, the inner synergic framework for the aggregation of information systems achieves the goal of managing the corporation related resources as a whole, using Web Services. In order to maximize the effects, we also need to use the technologies such as data warehouse and data mining to upgrade corporation’s competition ability. 3.2 The Outer Synergic Framework for the Aggregation of Information Systems The aggregation which appears as synergic e-commerce, builds an outer platform for the business systems of B2B, B2C, Logistics, CRM and other related ones, enhances the transactions of information among the corporations along the supply chain by releasing and storing all the internal and external information, so that the corporation’s value can be added during this process. Customer Portal:It is an interactive platform for the end-users, and it makes the interaction with users automatic and intelligent. While the users’ knowledge was used to help the corporation to make decisions for users, the related effect can be called user intelligence. Customer Portal is useful for the user knowledge’s producing, distributing and using, and it can form a perfect suite of user intelligence system, while it aggregates with other application systems and portals. Logistics Portal:It is an interactive platform for a third party logistics company or the logistics department of a company, and this platform supplies whole logistics service, provides dynamic control for the logistics process. The aggregation between logistics portal and other portals achieves the unitive management, adds values to the corporation during the manufacture process.
Aggregated Framework of Enterprise Information System
221
Fig. 2. The outer synergic framework for the aggregation of information systems
Reseller Portal:Reseller Portal is an aggregation of synergic e-commerce and corporation’s traditional marketing channels. while it cooperates with DRP systems and logistics systems, it ensures the upgrade process for the traditional marketing channels. As the interaction platform between corporation and resellers, this portal should manage sale’s data with the help of Customer Portal and Logistics Portal, in order to guide the producing process. Supplier Portal:It is the basis of global e-procurement and e-tendering. Using Internet and the technology of Web Services, we can ensure the dynamic feature of mutual selection between suppliers and corporations, and reduce the manufacture cost. Anther function of Supplier Portal is that it guarantees the stability of supplier services with the cooperation rules. EC Portal:EC Portal, which aggregates with the inner synergic framework, is the key point of the outer synergic framework. It provides an unusual marketing service, offers a display chance to each different types of products. When it combines with Logistics Portal, it supplies automatic transaction services. And at the same time, it provides a new way to collect information about customers’ consumption behavior. All the Portals such as Customer Portal, Logistics Portal, Reseller Portal, Supplier Portal and EC Portal are the appearances of the outer synergic framework for the aggregation of enterprise information system. They guarantee that corporations’ customers, suppliers and cooperators can access the services and customized information in a simple way. The outer synergic framework utilizes the synergic network based on the supply chain to optimize resource allocation, enhances the improving ability of supply chain. While mining the data about the related information of customer services and supply chain managements, we can make the process of supply chain more automatically with the help of Web Services technology.
222
G. Li and Q. Wei
4 Conclusion There are two main problems during the aggregation of enterprise information systems, and both of them are about information: one is about aggregation and the other is about recycle. To aggregate data, the corporation should choose the standard format for the data, and meanwhile, Web Services technology is the key technology during the aggregation process. Based on the uniform standard of interface, Web Services access different information systems and platforms, take certain strategies to show all the collected information on the same platform. With the help of Web Services management based on UDDI, Web Services can be dynamically added and removed, and the number of services in the supply chain should be increased. To recycle information, we should also focus on the technology of data mining. Its task is to mine the information in different databases of the corporation, to mine the information sources out of the corporations in the supply chain, to build data warehouse with these information and to provide reports to support the enterprise’s decision-making process. The aggregation of enterprise information systems is an effective and widely used way to resolve the problem of “Information Islands”. Based on the synergic theory, the aggregation framework of enterprise information systems not only meets the present requirements of enterprise developments, but also meets the aggregation requirements in the future. With the help of Web Services technology, the aggregation will include some dynamic features. And based on the technology of data mining, the self-improving ability of the aggregation will be enhanced. During the dynamic aggregation process in the supply chain, we can help the corporations to enhance their competition ability and achieve the goal of value-added.
References 1. Themistocleous,M. and Z. Irani. Towards a Methodology for the Development of Integrated IT Infrastructures . In:. Proceedings of the 39th Annual Hawaii International Conference on System Sciences . Washington, DC, USA: IEEE Computer Society, 2006:182 - 182 2. XU Jiajun, JIA Wenyu. Enterprise information portal, Beijing: China Machine Press,2004 3. LI Zongan. The foundation of C-Commerce system. Business times,2004,3:50-51 4. KONG Lingling, XIA Anbang. Integrated Supply Chain and It’s Information System Based on E-Business. 5. China Manufacturing informationization,2004,7:73-75 6. ZHANG Huai. Research on Key Technologies and Methods of Enterprise Information System Integration. 7. Hefei, Hefei University of Technology, 2004 8. GUO Shaoyou. Research on integration of enterprise information system based on web services. Zhenzhou, Zhenzhou University, 2004 9. LI Fenghui. On Ec-Oriented Web Data Mining. Shandong, Shandong University of Science and Technology, 2004
Web Personalization Based on Artificial Psychology Quan Lu1,3, Jing Chen2,3, and Bo Meng3 1
School Of Information Management, Wuhan University, Wuhan, 430072, P.R. China [email protected] 2 Department Of Computer Science, Central China Normal University, Wuhan, 430079, P.R. China 3 School Of Computer Science, Wuhan University, Wuhan, 430072, P.R. China
Abstract. Web personalization is one of the most popular ways of increasing traffic at web sites. Usage-based Web personalization is widely applied today. But most Web personalization systems based on usage require much recourse such as time and memory and are not too reasonable. In this paper we describe an approach to psychology-based Web personalization taking into account the full spectrum of Web mining techniques and activities. Our approach is described by the architecture shown in Figure 1, which heavily uses data mining techniques, thus making the personalization process both automatic and dynamic, and hence up-to-date. Artificial psychology techniques are used to support more effective and more precise transaction clustering.
technologies about users’ psychology have been considered in Personalization. One main purpose of Artificial Psychology (AP) is to imitate human psychology with computer, meanwhile, to provide some subjective evaluations for the objective things. This kind of research in web personalization is still at the beginning. In Section 2 we implement General Architecture for Psychology-Based Web Personalization, in which a cognitive expander is used to apply with Artificial Psychology to study users’ psychology and to direct web personalization, including personalized retrieval and user interface. In Sections 3 AP mathematics models applied to computer systems are analyzed and synthesized to study users’ psychology. Adjectives collections are abstracted to express users’ feeling about commodities. Then determining the characters of the commodities and find out their numerical representations, the quantification relationship of the adjective pairs and the characters is constructed. In Sections 4 transactions are clustered based on artificial psychology, which can be applied to direct retrieval and UI personalization. The conclusions are stated in Section 5.
2 General Architecture for Psychology-Based Web Personalization A generalized architecture for the system is depicted in Figure 1. The overall process of psychology -based Web personalization can be divided into two components. The Cognitive expander is comprised of the artificial psychology tasks resulting in a user psychology file, and the transaction clustering tasks based on k-means algorithm using user psychology. We now discuss the details of the Cognitive expander. User preferences comprise the notion of relevance and specific user interests (and thus the expectations) of users. In today’s systems especially the characterization of such preferences are suffering from a lack of intuition. Often they are restricted to the mere declaration of keywords. For advanced personalization of web and information systems cognitive knowledge has to be used. Getting an overall psychological utility consists of different steps [2]: 1) The internal representation and thus notion of relevance is needed for each user 2) The subjective utility has to be determined by considering the intention, situation, domain, etc. 3) The subjective probability that a preference is applicable has to be estimated 4) The subjective expected utility could be used to compare preferences and decide how to expand the user need. Each user session in a user session file can be thought of in two ways; either as a single transaction of many page references, or a set of many transactions each consisting of a single page reference. The goal of transaction identification is to dynamically create meaningful clusters of references for each user. Based on an underlying model of the user's browsing behavior, each page reference can be categorized as a content reference, auxiliary (or navigational) reference, or hybrid. In this way different types of transactions can be obtained from the user session file, including content-only transactions involving references to content pages, and navigation-content transactions involving a mix of pages types. The details of methods
Web Personalization Based on Artificial Psychology
225
for transaction identification are discussed in [8]. For the purpose of this paper we assume that each user session is viewed as a single transaction. And the session file may be filtered to remove very small transactions and very low support references (i.e., URL references that are not supported by a specified number of user transactions). This type of support filtering can be important in removing noise from the data, and can provide a form of dimensionality reduction in clustering tasks where URLs appearing in the session file are used as features. User transaction file
User session files
Utility Assessment
Transaction clustering based on psychology
Representation Generation
Psychology Analyzer
Cognitive Enhancer
Query Builder
Web Builder
Cognitive expander
Database Query
Web File
Fig. 1. General Architecture for Psychology-Based Web Personalization
Based on the user model in [2] the user query and answer in user transaction file is enhanced with cognitive psychology and the subjective expected utility is assigned. Considering the utilities for different combinations of preferences within a query a trade-off between query complexity or processing costs and an expected improvement of the query result has to be optimized. Comparing subjective expected utilities eventually leads to two decisions. One is about the retrieval strategy and an adequate query can be generated. This query is processed by the database retrieval system and the results are returned to the user. The other is about the display strategy and web Composition such as color can be built. The essential component is the Psychology Analyzer that refines the transaction file and passes user psychology to the Cognitive Enhancer. Cognitive expander aims at the implementation of a query builder and a Web Builder based on psychology knowledge that allows complex query building in tight cooperation with the user and self-adapting of user interface accord with individual psychology preference.
226
Q. Lu, J. Chen, and B. Meng
3 Psychology Analyzer The Psychology Analyzer studies users’ psychology using Artificial Psychology model. To implement the personalization of the web, Cognitive Enhancer should remember the info (query and evaluation) of the former cases made by every user, and analysis the query and evaluation to gain experience about user Psychology. A repository is used to record the evaluation of former cases of one user. Psychology Analyzer use Artificial Psychology (AP) to analysis the preference of user. First, the Representation Generation collects the adjectives used by user to describe what he or she wants and whether he or she is satisfied with the service, such as “lovely-bothersome”, “magnificent-plain”. However, not all the adjectives pairs are necessary, some adjective pairs have close correlation. So the representative pairs should be abstracted by factor analysis. The repository will record the numeric evaluation value of every case corresponding to the adjective pair. Second, the user should fill a questionnaire to describe the service that he needs, which is a new query. At last, Cognitive Enhancer will build the appropriate query and web according with the user’s psychology. Table 1. Average of the representative pairs
Expensive-cheap Romantic-prim Modern-outdated …
Case 1
Case 2
Case 3
Case 4
0.1 0.3 0.4 …
0.5 0.6 0.7 …
0.8 0.9 0.8 …
0.5 0.2 0.8 …
Table 2. Characters determination and numerical representation
sample 1 red green blue lightness cold or warm
65 108 213 30 30
sample 2 108 120 122 20 70
sample 3
sample 4
234 67 101 40 60
196 19 210 70 30
sample 5 23 100 255 20 10
For example, a user want to find a commodity in the Internet, firstly, there is a user preference value correspondence with the adjacent pair, and we call them average of the representative pairs, such as table 1. We can get the value like this, for instance, for expensive-cheap, 0.1 represent cheapest, 0.5 represent not expensive and not cheap, 0.9 represent most expensive. Now we find out the numerical representations.
Web Personalization Based on Artificial Psychology
227
Then we select red, green, blue, lightness, cold or warm as the items. We measure the color value (0-255) by PHOTOSHOP; the value of lightness and the value (0-100) of cold or warm are obtained by the questionnaire. Table 2 gives the result: The corresponding reactor matrix:
With the help of Quantification Theory , each case can be written in the form of reactor matrix. According to the formula of Quantification Theory :
Ⅰ
b = ( X ' X ) −1 X ' Y
(2)
We obtain the quantification relationship of the adjective pairs and the case, with Y is a column vector and its value is the average of the previously evaluation corresponding to an adjective pair, like table 2. X is the reactor matrix. All values of b correspond with the adjective pairs are obtained. By replacing b with its expression, we can calculate the evaluation Y of case with the help of the predictive formula ^
Y = Xb
(3)
The evaluations are stored in the Cognitive Enhancer’s repository. When a query comes, the Cognitive Enhancer will depend on its repository to direct the Transaction clustering which in turn to retrieval strategy and the UI strategy; surely it is going to be more and more accurate over time. The consequence is an improving self-evolution process between user and the web site.
4 Transaction Clustering Based on Psychology We use a multivariate k-means algorithm to obtain transaction clusters. The evaluations show in (3) stored in the Cognitive Enhancer’s repository act as inputs in k-means algorithm. K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters
228
Q. Lu, J. Chen, and B. Meng
resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function k
n
J = ∑∑ j =1 i =1
xi( j ) − cj
2
(4)
2
xi( j) − cj is a chosen distance measure between a data point xi( j ) and the cluster centre cj , is an indicator of the distance of the n data points from their
where
respective cluster centres. The algorithm is composed of the following steps: 1) Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2) Assign each object to the group that has the closest centroid. 3) When all objects have been assigned, recalculate the positions of the K centroids. 4) Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
5 Conclusion In this paper, advanced web personalization using transaction clustering and artificial psychology techniques are investigated. Since preferences of all types have become the centerpiece of personalized information systems, future architectures have to provide adequate UIs for modelling and input of preferences. However, psychological experiments show that users will experience difficulties with complex preference queries. And transaction clustering based on usage is too time wasting and vague. Thus systems also should use cognitive techniques and artificial psychology techniques to support users effectively. In the process of the model construction, we synthesize data mining techniques and AP mathematics theories, and integrate them into cognitive theory to form General Architecture model for Psychology-Based Web Personalization. The analysis of Psychology Analyzer and techniques in which the user preference is automatically learned from Web usage data based on k-means algorithm using user psychology will lead to two web personalisation. One is about the retrieval strategy and an adequate and precise query can be generated. The other is about the display strategy and web Composition such as color pigmentation can be built. And these are our future works.
Web Personalization Based on Artificial Psychology
229
References 1. An expert system of commodity choose applied with artificial psychology, Zhiliang Wang , (2001) 0-7803-7087-2/01 2001 IEEE. 2. A Roadmap to Personalized Information Systems by Cognitive Expansion of Queries Wolf-Tilo Balke http://www.l3s.de/ 3. A Computer System Designed for Fashion Fitting Based on An Artificial Psychology Model, Wang, ZhiLang, Proceedings of the 3d World Congress on Intelligent Control and Automation, 2000, 0-7 803-5995-X/00 2000 IEEE. 4. The research of artificial psychology model, WEIMIN XUE, proceeding of IEEE TENCON’02,0-7803-7490-8/02 2002 IEEE. 5. Designing User Interfaces using Activity Theory, Lorna Uden, Proceedings of the 34th Hawaii International Conference on System Sciences – 2001, 0-7695-0981-9/01 2001 IEEE 6. Automatic Personalization Based on Web Usage Mining, B. Mobasher, R. Cooley, J. Srivastava,communications of the ACM,2000(8):142-151 7. Discovery of Aggregate Usage Profiles for Web Personalization, B. Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, Yuqing Sun, Jim Wiltshire, Proceedings of the Web mining for E-commerce Workshop,Boston,2000 8. Data preparation for mining World Wide Web browsing patterns, Cooley, R., Mobasher, B., and Srivastava, J. Journal of Knowledge and Information Systems, (1) 1, 1999.
Grid Resource Management Based on Mobile Agent Xiaobo Tang1 and Li Huang2 1
School Of Information Management, Wuhan University, 430072, Wuhan, P.R. China [email protected] 2 School Of Information Management, Wuhan University, 430072, Wuhan, P.R. China [email protected]
Abstract. In this paper, basic concepts of mobile agent and resource management system are first described, and then traditional models of grid resource management system are discussed. Finally, a new grid resource management model based on mobile agent is proposed, meanwhile, an architectural model based on mobile agent for acquisition of grid information and an optimal algorithm for resource access are also put forward.
This paper is oriented to computing grid. The discussion concerns the tasks that have large computation but little data. In these tasks, the most precious resource is CPU; therefore the computing speed should be given the most consideration in resource management. In this paper, a new grid resource management model based on mobile agent is proposed; meanwhile, an architectural model for acquisition of grid information based on mobile agent is established with description and analysis.
2 Traditional Grid Resource Management Model and Analysis Although there are many differences between researches from overseas and domestic on traditional grid resource management access model, they can be simplified as the Fig.1. Messaging server: it provides functions for grid resource information such as organizing, storing, updating, retrieving, and then presenting a sole resource logic view. Resource node: it stands for resource provider who can dynamically join or exit the grid organization. The resource information is registered through register agreements with messaging server (usually it is static information such as the number of CPU, CPU Clock Speed, physical RAM or the version of operation system), then resource rode runs a dynamic information collecting program to acquire dynamic resource information (including the capacity of available RAM, the condition of load, instant computing performance indexes, the on or off status of node computers, and computing service permission, etc.), and submits these information regularly to the messaging server. Messaging server processes these dynamic information and forecasts recent performance of resources to provide reference for resource allocation. Resource broker: it receives work information submitted by grid user (including work resource needs) and retrieves for resources required by messaging server. Then it obtains information and submits works to resource broker for processing, receives the results and sends them back to grid user.
Fig. 1. Traditional grid resource management model
In the model shown in Fig.1, the messaging server should deal with registering requests from resources node as well as inquiring requests from resource broker, besides, it has to save and analyze dynamic information. As the resource nodes increase sharply, the performance of information server becomes the bottleneck of the whole
232
X. Tang and L. Huang
system. In order to solve the problem, this article proposes the information resource acquisition model based on mobile agent.
3 Grid Resource Management Model Based on Mobile Agent Compared with previous methods of dynamic resource collection, the mobile-agent-based method can transmit less operational code and status to the resource node, and execute performance analysis and forecasting for resource nodes locally, send back results to messaging server, thus reduce the network load. In addition, when the grid nodes are withdrawn due to exceptions, mobile agent will not reach them but it can record the status internally. While returning to the messaging server, it will sign off the node, which enables resource-messaging server to no longer keep all the signals of every resource node. When the members of grid organizations grow constantly, the advantages of the model become obvious. The mobile-agent-based grid information collecting model is shown as the following Fig.2, which consists of three parts:
Fig. 2. Mobile-agent-based Grid Resource Collecting Model
1. The messaging server creates corresponding mobile agent according to the resources characteristic, and assigns a specific constraint condition and the route rule for it. Then it dispatches mobile agent, making it move and collect the information required by each resource node, and finally deals with the data that agent sends back; 2. The resource node provides mobile agent execution environment and system agent. The execution environment accepts the mobile agent and provides access to local resources. The mobile agent and system agent alternately complete the task of information collecting assigned by resource-messaging server. 3. Mobile agent interacts with system agent and moves among all the resource nodes according to the route and strategy set by resource messaging server forehand to deal with dynamic information. 3.1 Structural Model of Resource Node The resources node can provide a safe and correct movement environment for agent. It can realize the movement of the mobile agent, the establishment of the agent performance status, mobile agent initialization, stipulation mechanism of the mobile agent, fault-tolerant strategy, security control and communication mechanism, and provide the basic service modules, such as: event service, yellow page service,
Grid Resource Management Based on Mobile Agent
233
transaction service and domain naming service (DNS), etc. The mobile agent is required to complete the mission appointed by information server. Meanwhile, it is also required to realize such tasks as mobile semantics (namely the option of target host or route), security control (self-protection), communication with external resources (mobile agent service environment and other agents), and some other functions. It may move from one mobile agent platform to another, execute tasks between them, and communicate with other mobile agent or access the service provided by mobile agent platform via Agent Communication Language (ACL). 3.2 Structural Model of Mobile Agent The structure model of mobile agent is illustrated in Fig.3. Mobile agent is composed of six interdependent modules as follows: security agent, environment interaction, task resolving, internal status, constraint conditions and route strategy.
Fig. 3. The Structure Model of Mobile Agent
The outer layer of system structure is security agent, which is the agent between mobile agent and those of resources node when they are performing communication of environment .The security agent module can execute the security strategy of mobile agent and prevent illegal accesses to mobile agent from resources nodes. Mobile agent recognizes the resources node environment and functions on it through the environment interaction module which implements ACL semantics and insures correct communication and coordination between the mobile agent and agent service facility using the same ACL. But the content of communication has nothing to do with ACL. The task resolving module includes performing module, reasoning methods and rules
234
X. Tang and L. Huang
(forecast parsing algorithm) related with mobile agent's missions. It analyzes and forecasts the performance and behavior of mobile agent. Internal status, which is the current condition of mobile agent in the process of execution, has influences on the process of task resolving. Meanwhile, the task resolving also has feedback on internal status and stores the result of analyzing and forecasting of resources node performance. Constraint conditions are those which are formulated by the creators of mobile agent to insure its behavior and performance, such as the time of returning, resident time and mission's completing degree and so on. Generally speaking, only the creators have the authority to alter the constraint conditions. Route strategy determines mobile agent's moving route. It may be static list of service facility and also dynamic route based on rules to pursue the solutions of complicate and indefinite missions.
4 Grid Resource Management Model Based on Mobile Agent As a tool of managing distributed grid resources, mobile agent is a middleware between users and grid resources. For grid resources, mobile agent establishes a pool of resources to store related information of resources and utilizes corresponding methods and tools to manage them. For users, it provides a standard visual interface to share users’ resources. Here, mobile agent mainly provides users with a logic view of resources. Demonstrated as Fig.4, resources managing model consists of user layer, resources managing layer of mobile agent and wide area resources layer. In the user layer, users or application system send user's requests to mobile agent by utilizing service-oriented view. User's request can be used to establish one or more sessions and can be demonstrated in resource description language, which describes the option of resources, the establishing of task session and the controlling of task. Task assigning agent accepts the request and develops resources allocation plan according to all kinds of application rules, user's request and authority list. Then it conveys the resources allocation plan and user's request to controlling lining agent, which can arrange the allocation plan according to resources management system condition based on mobile agent and divides the plan into a series of functions and work procedures. Other agents coordinate their works to complete user's request according to work procedures. This arrangement will ameliorate system's performance and intellectual level and improve the reliability of grid computing and executing efficiency. Under the management of controlling agent, resources discovery agent gets corresponding resources information from wide area resource layer and conveys the information to resources organizing agent, which accepts it and informs integrating agent of it. Integrating agent combines all kinds of information and generates corresponding service. At last, resources allocation agent transmits resources information, service information and all kinds of resources distribution lists to users or application system via the service-oriented view. Monitor agent monitors the condition of all kinds of agents and generates corresponding report for controlling agent as the reference of control and regulation.
Grid Resource Management Based on Mobile Agent
235
Fig. 4. Management Model of Grid resources based on Mobile Agent
5 Grid Resource Discovery Based on Mobile Agent The moving strategy of mobile agent mainly refers to the dynamic development of moving path according to the task of mobile agent, external environment as the current network load and the server load, which may make mobile agent complete its task with best performance at least expenses and time. The moving strategy may be divided into the static routing strategy and the dynamic route strategy generally. In the static routing, the hosts and the visiting order in mobile agent are determined by the information server before given tasks. In the dynamic routing, generally information server assigns an initial route table at first, mobile agent can modify the route table in the moving process following the table according to the environment alternation. The dynamic routing way manifests such features of mobile agent as reflection,
236
X. Tang and L. Huang
environment-orientation, and autonomy. Common moving strategies available are the travel plan and the dynamic path plan. 5.1 Travel Plan The travel plan is a kind of static route strategy, which is composed by a number of procedures, each procedure being made up of 5 parts: moving way, destination, execution condition, entrance method and the plan updating. The composition of travel procedure is shown in Fig.5.
Fig. 5. Travel Step
Moving of agent may be described as follows according to the migration which the travel plan carries on: The agent will access hosts designated by "Destination" according to the approach assigned by "Migration Mode". Whenever a new host is reached, the objects of agent will be deserialized and the methods or conditions given by “Execution Condition" are executed. If the result is true, the method designated by "Access Method" will be called and the execution of agent is restored. Otherwise, the "Access Method" will be skipped and agent continues moving to next destination. Finally, when local execution is finished, either the agent moves further according to the plan, or a new plan is created according to the method of “Itinerary Modification”. For example, when agent moves from host A to host B, the accessibility will be tested first, if it cannot be accessed, then the travel plan will be modified. The travel plan is the route table merely. The basic element of travel plan is called travel entrance, which is expressed by simple dualistic group form E (host, method). The operation to be performed on intended hosts are defined by methods. The moving order of mobile agent is determined by the order of dualistic group. In view of the different migration demand, the structured migration mechanism provides three kinds of motion semantics: sequential migration (SEQ), selected migration (SEL) and distribution way (ALL). These three ways are similar with sequential structure, branch structure and loop structure in the structured programming methodology. The SEQ way corresponds to the sequential structure, the SEL way corresponds to the branch structure, and the ALL way corresponds to the loop structure. Under the SEQ way, the agent visits all destinations sequentially, which is similar with execution of every command in sequential structure. Under ALL way, the execution at n hosts by n agents that are "cloned" is equivalent to repeat n times execution of agent in different data sets, which is similar with a n-time loop. It is worthy of pointing out that, the ALL way in parallel processing of mobile agent technology has potential advantages.
Grid Resource Management Based on Mobile Agent
237
5.2 Dynamic Path Plan Here, the description class of the visiting path itinerary is defined, inside which method is sealed for the dynamic path plan. The getNextHost method plans the dynamic path according to the acquired duty semantics, the network transmission resources condition, the host resources condition, the travel experience (in ideal situation it should include these aspects parameter).
6 Conclusion The grid computing is developing along with Internet development, the grid is a massive supercomputer which integrates the entire resources of Internet and can realize the comprehensive sharing of various resources as computing resources, the storage resources, the data resources, and the information resource, etc., thus eliminating the resources island. The resources server of traditional grid resource management model must process both registering requests from resource node and retrieving requests from resource agent. Meanwhile it also has to process storage and analysis of the dynamic information data. When the number of resource nodes increases at rapid speed, the resource information server may become the bottleneck of the whole system. The mobile-agent-based methods can transmit less operational code and status to the resource node, and execute performance analysis and forecasting for resource nodes locally, send back results to messaging server, thus reduce the network load. Resource managing model based on mobile agent consists of user layer, resources management layer of mobile agent and wide area resources layer. This model can improve the performance of grid resource management system and the execution efficiency. The mobile-agent-based resource discovery efficiency is relating to moving strategy and the algorithm of mobile agent, the common moving strategies are the travel plan and the dynamic path plan.
References 1. Karl Czajkowki,Steven Fitzgerald,Ian Foster,eyc: Grid Information Services for Distributed Resource Sharing[J].IEEE (2001) 2. Zoltan Balaton,Grbor Gombas,Zsolt Nemth: A Novel Architecture for Grid Information Sysem[J].IEEE (2002) 3. Cao Junwei,Kerbyson D J,Nudd G R: Performance Evaluation of an Agent-based Resource Management Infrastructure for Grid Computing. Proceeding of 1st IEEE /ACM International Symposium on Cluster Computing and the Grid (2001) 311-318 4. Nudd G R,Kerbyson D J,Papaedstathiou E P: PACE—A Toolset for the Performance Prediction of Parallel and Distributed System. High Performance Computing Applications (2000) 228 5. Cao Junwei,Jarvis S A,Spooner D P: Performance Prediction Technology for Agent-based Resource Management in Grid Environment. Proceeding of 11th IEEE Heterogeneous Computing Workshop (2002) 14
238
X. Tang and L. Huang
6. Cao Junwei,Jarvis S A,Sain S,et al.ARMS: An Agent-based Resource Management System for Grid. Computing Scientific Programming (2002) 135-148 7. Rajkumar Buyya,David Abramson and Jon Giddy: Economy Driven Resource Management Architecture for Computational Power Grids. Intl. Conf. on Parallel and Distributed Processing Techniques and Applications.USA ( 2002) 8. Jun K,Boloni L,Palacz K: Agent-based resource discovery[J].19th IEEE Heterogeneous Computing Workshop (1999) 257-268 9. I Foster,C Kesselman,J Nick: Grid Services for Distributed System Integration[J].Computer (2002)
The Use of Intelligent Information Access Technologies in Digital Libraries Jiangping Chen1, Yuhua Li2, and Gang Li3 1
School of Library and Information Sciences, University of North Texas, P. O. Box 311068, Denton, TX 76203-1068 [email protected] 2 Information Center, Hubei Water Bureau, Wuchang, Wuhan, P.R. China, 430070 [email protected] 3 School of Information Management, Wuhan University, Wuhan, P.R. China, 430072 [email protected]
Abstract. This paper discusses the use of intelligent information access (IIA) technologies in the development and services of digital libraries. After an introduction of the definition, subfields, and the characteristics of IIA, this paper reviews IIA-related research that has been recently published at leading digital library conferences. The authors suggest that applying IIA technologies to digital libraries may have the potential to greatly speed up the development of digital libraries and to improve the quality of user services. Digital library developers should consider collaborating with IIA researchers in order to build sustainable digital libraries. The paper concludes with a framework for integrating IIA technologies into digital library development and services. Keywords: Intelligent Information Access, Digital Libraries.
initiatives in Europe (http://www.infotoday.com/cilmag/nov00/raitt.htm) as well as in Asia (http://www.ariadne.ac.uk/issue33/china/, http://www.ercim. org/ publication/ Ercim_News/enw27/sugimoto.html, http://www. icadl2006.org/ introduction.html, http://www.cdac.in/html/diglib/dlproj.asp). Digital libraries are complex systems that contain many interrelated operations. They are also expensive projects because many of the operations such as metadata generation, digitization, and reference services involve substantial human effort. Sufficient financial support is the key to build sustainable digital libraries [13]. Currently, digital library projects largely depend on government funding. The big concern for many digital libraries is how to minimize the cost of development and to maximize the benefits of services. Developing automatic modules or tools to perform labor-intensive tasks or to expand user services are considered important as they can speed up system development, reduce costs related to system maintenance, and improve user services [6], [8]. Research communities on digital libraries are exploring and adopting various advanced computing technologies for the development of digital libraries. Many of these are intelligent information access technologies that focus on automatic text processing and understanding. The authors believe that in order to build sustainable digital libraries, developers should actively use IIA technologies. The literature shows that the research on adapting IIA technologies in the development of digital libraries has been conducted since the late 1990s. However, IIA technologies are not widely used in current digital library systems. Many digital library developers are not familiar with the recent advancements of IIA and do not realize the potential of IIA technologies. This paper aims to summarize the current research on applying IIA to digital libraries and to propose a framework on integrating IIA technologies into the digital library. The remaining part of the paper is organized as follows: First, IIA and its subfields are introduced. Then, the current research on IIA in digital libraries is reviewed. Next, a framework for integrating IIA technologies into digital libraries is proposed and discussed. The paper concludes with a summary and suggestions that digital library developers should consider collaborating with IIA researchers in order to build sustainable digital libraries.
2 IIA and Subfields Intelligent Information Access (IIA) is a term that has been used [1], [10], [11] but not clearly defined. In this paper, Intelligent Information Access (IIA) refers to technologies that make use of human knowledge and/or human-like intelligence to provide effective and efficient access to large, distributed, heterogeneous and multilingual (and at this time mainly text-based) information resources. In other words, any information access technologies that involve applying human knowledge to retrieve, understand, synthesize or extract information are considered Intelligent Information Access. In particular, IIA includes subfields such as automatic classification, information extraction, cross-language information retrieval, question answering, and summarization. These subfields will be briefly discussed below.
The Use of Intelligent Information Access Technologies in Digital Libraries
241
2.1 Automatic Classification Automatic classification is the process of determining a subject category of an object in a mechanical way. The object to be classified can be anything that can be identified and processed by a computer, including documents, images, or research projects. Automatic document classification that focuses on processing books, journal papers and other documents were the first areas developed for automatic classification. Today, these areas remain the biggest portion of automatic classification research. The subject categories can be pre-defined in a classification table or a thesaurus, or could emerged from the process of document classification. For example, a medical book can be assigned one or more subject terms such as “Acupuncture”, or “Medicine - Chinese Traditional”, based on the medical subject headings of the National Library of Medicine. There are two major types of approaches to automatic classification: one is rulebased and the other is the machine-learning approach. Rule-based automatic classification utilizes human expert knowledge to create rules for the automatic identification process, while machine-learning techniques enable the system to learn from a number of examples or training patterns in order to autonomously obtain and integrate the knowledge contained in the item into searchable index entries. As such, machine-learning techniques have become more and more popular due to their speed and scalability. 2.2 Information Extraction (IE) Information Extraction refers to technologies that, “automatically extract structured or semistructured information from unstructured machine-readable documents” (http://en.wikipedia.org/wiki/Information_extraction). The structured and/or semistructured information can be organization names, person names, events related to a partcicular location, or noun phrases referring to the the same object. Typical subtasks of IE include keyword extraction whereby information is taken from representative key terms and named entity extraction that identifies entity names such as names of people and organizations, place names, temporal expressions, and certain types of numerical expressions. Other IE subtasks are: coreference, which is the process of identifing chains of noun phrases that refer to the same object, and event extraction that identifies information related to an event, such as time, location, people, action, etc.. As automatic classification and information extraction are usually performed together in digital library context, this is generally referred to as, automatic metadata generation. 2.3 Cross-Language Information Retrieval (CLIR) CLIR provides users with access to information that is in a different language from their queries. For example, English-Chinese cross language information retrieval (EC-CLIR) enables English native speakers to search for Chinese text information using English queries. The basic strategy for modern information retrieval is to match documents to queries. A transformation on either side or both is necessary if the queries and
242
J. Chen, Y. Li, and G. Li
documents are not written in the same language, as in the case of CLIR, since the match cannot be directly conducted. [12] identified three basic transformation approaches to CLIR: query translation, document translation, and interlingual techniques. Query translation based CLIR systems translate user queries to the language that the documents are written. Document translation is the reverse of query translation where documents are translated into the query language. The interlingual approach translates both documents and queries to a third representation. Among the three approaches, the query translation approach is the generally accepted approach and applied by most CLIR experimental systems because of its simplicity and effectiveness. Query translation based CLIR systems utilize various knowledge resources, such as bilingual dictionaries, machine translation (MT) systems, parallel texts, or a combination of them to translate queries into the language of the document collection, and then conduct monolingual information retrieval. Figure 1 shows the structure of a query-translation based on an English-Chinese Cross Language Information retrieval system [3].
Queries (E) Query Translation
Queries (C)
Results (E) MT System
Results (C) Chinese IR System
Chinese Documents
Fig. 1. Structure of an English-Chinese CLIR System
Given an English Query such as, “What are the Chinese medicines/treatments for cataract?”, the English-Chinese CLIR system depicted in Figure 1 will first translate the query into Chinese through the “Query Translation” Module. Then, the translated query (in Chinese) will be sent to the “Chinese IR System” to find and return relevant Chinese documents to the user. An “MT system” after the IR module is optional, which can translate the retrieved Chinese documents into English if the user prefers. 2.4 Question Answering (QA) Question Answering (QA) systems identify answers from a large document collection or online information resources to users’ natural language questions. Such systems can release the users from digesting huge amounts of text. For example, the question,
The Use of Intelligent Information Access Technologies in Digital Libraries
243
“What is the primary symptom of a cataract?”, can be automatically answered by a QA system from a document collection. Question answering is very challenging due to the ambiguity of the questions, complexity of linguistic phenomena involved in the documents, and the difficulty to understand natural languages. A QA system typically contains multiple functional modules such as question processing, document retrieval, document annotation, sentence retrieval, and answer finding in order to find the answers from a large text collection [5]. It takes a team several years of hard work in order to build an effective QA system. 2.5 Text Summarization Text summarization is, “the process of distilling the most important information from a source to produce an abridged version for a particular user or task.” [9] The research for this process started in the 60's, but the technique has become more important because of the exponential growth of documents available on the Internet and the WWW. Text summarization is a difficult problem as it requires the computers to really understand the point of a text. The process normally involves not only semantic and discourse analysis, but also inference basing on world knowledge. True automatic abstraction or summarization has not been very successful. Current automatic summarization systems usually rely on sentence reduction or extraction to produce summaries. To create a summary, a system needs to identify the most important/topical-related sentences and return them to the reader. Among the five IIA subfields discussed above, some are well investigated both theoretically and practically. Significant research forums such as Text REtrieval Conference (TREC), Cross-Language Evaluation Forum (CLEF), NII-NACSIS Test Collection for IR Systems Workshop (NTCIR), and Document Understanding Conferences (DUC) conduct large-scale evaluation on tasks such as Cross-Language Information Retrieval (CLIR), monolingual Question and Answering (QA), CrossLanguage Question Answering (CLQA), and Text Summarization. The Association for Computational Linguistics (ACL) and ACM Special Interest Group in Information Retrieval also hold conferences each year involving researchers from areas of QA, CLIR, Automatic Classification, and other allied fields.
3 The Use of IIA Technologies in Digital Library Development and Services The use of intelligent information access technologies for digital libraries had been proposed at the same time the Digital Libraries Initiative I started. [7] suggested that intelligent information retrieval and natural language processing technologies should be integrated into digital library systems to provide intelligent access to digital resources for real users. Exploring IIA for digital library development and services has continued to be part of the digital library research activities since then. One can find research articles on IIA have been published in the proceedings of the leading digital library conference – IEEE/ACM Joint Conference on Digital Libraries (JCDL) and its predecessor – Conference on Digital Libraries (DL).
244
J. Chen, Y. Li, and G. Li
JCDL (or DL) has been a yearly event for DL researchers since 1996. The conference proceedings are available online (http://www.jcdl.org/past-event-conf.shtml) and are also collected in the ACM digital library (http://portal.acm.org/dl.cfm). The online publication of the JCDL and DL proceedings makes it convenient to access. We extracted and examined all IIA related research papers that have bee published at JCDL and DL conferences held from 1998-2006. Our purpose in this examination is to understand the scope and how specific IIA technologies have been explored for DL development and services. We identified twenty-five research papers (including full papers and short papers) in the five subfields of IIA. As mentioned in the last section, automatic classification and information extraction are usually performed together in digital library settings under the new name ‘automatic metadata generation’. In our analysis, we used automatic metadata generation to stand for the subfields. Table 1 shows the distribution of the papers and some titles of the papers. The Appendix lists all the IIA related papers that we identified at JCDL or DL proceedings between1998-2006. The biggest portion of the 25 IIA related papers at JCDL and DL (17 out of 25) explores automatic metadata generation. This is reasonable because metadata generation normally is the most labor and cost intensive activity of building a digital library. Applying the automatic approach to generate metadata has the potential to greatly reduce the cost of digital libraries and shorten their development time. Different topics within metadata extraction or generation were also explored in the 17 papers, such as automatic concept indexing, person name and relationship identification and annotation, image feature categorization, named entity recognition, title extraction, and genre analysis. The technologies used for automatic metadata extraction can be summarized as such: • • •
•
Linguistic or rule-based approaches. Applying linguistic knowledge and features of text segments as reflected in the original document to determine the categories; Statistical methods such as associative neural network, mutual information, and Hidden Markov Model (HMM); Machine learning approaches applying human annotated training data to train various statistical models and then used for developing models to process the same type of digital documents; and Hybrid approaches which combined various knowledge bases such as thesaurus or ontology and statistical approaches in order to achieve best results for different types of metadata generation.
Many of the projects perform the evaluation of the proposed metadata generation techniques using real digital libraries in different domains such as genealogy, education, and biology and/or on different types of digital objects such as image, news texts, and web documents. The research shows the performance of metadata extraction varies depending on the metadata types and many other factors. Some systems could do a quite good job at extracting certain metadata in certain domains. Generally, the quality and coverage of the knowledge resources that a system can access play a very important role in its performance.
The Use of Intelligent Information Access Technologies in Digital Libraries
245
Table 1. Distribution of IIA Related Papers at JCDL or DL Conferences During 1998-2006
IIA Subfield Automatic metadata generation
Frequency 17
Titles of Sample Papers •
• •
•
•
Cross-Language Information Retrieval
4
•
•
• •
Automatic Question Answering
2
Text Summarization
2
• • • •
Automatic categorization of figures in scientific documents Automatic extraction of titles from general documents using machine learning Automated text classification using a multi-agent framework Developing practical automatic metadata assignment and evaluation tools for internet resources Automatic document metadata extraction using support vector machines An approach to automatic classification of text for information retrieval Translating unknown cross-lingual queries in digital libraries using a webbased approach Harvesting translingual vocabulary mappings for multilingual digital libraries A multilingual, multimodal digital video library system Enhancing Usability in CITIDEL: Multimodal, Multilingual, and Interactive Visualization Interfaces Question answering on lecture videos: a multifaceted approach. QuASM, a system for question answering using semi-structured data. Using sentence-selection heuristics to rank text segments in TXTRACTOR Using librarian techniques in automatic text summarization for information retrieval
It is interesting to notice that some projects on automatic metadata generation made use of classifiers or machine learning tools that are developed and used in other similar situations. For example, one system used the Rainbow machine learning classification toolkit (http://www.cs.cmu.edu/~mccallum/bow/). IIA researchers have developed IIA resources that are available for adaptation [4]. It would be a costeffective strategy to find and use available IIA resources or toolkits for digital library research.
246
J. Chen, Y. Li, and G. Li
Four of the papers are related to cross-language information retrieval, or multilingual information retrieval. One paper discusses a digital video content management system which contains a client interface in both English and Chinese. The second one describes a method to create customized bilingual dictionaries and to map users’ queries from one language to multiple languages using an online library catalog containing over 10 million multilingual records. The third article investigates the feasibility of exploiting the web as the corpus source to translate unknown query terms for CLIR in digital libraries which can also help to create a domain-specific bilingual dictionary for CLIR. The fourth article compares three translation interfaces as they used for digital libraries. Two papers explore question answering in the digital library environment. One focuses on extending an information retrieval system and providing answers to questions that ask for specific facts such as person names, locations, and/or organization names, also known as ‘factoid’ questions, through using a named entity identifier. The other article proposes a method of combining video transcribing, question processing and answer segment extraction to answering questions about lecture videos. There are two articles on text summarization as well: One discusses a tool that uses sentence-selection heuristics to rank text segments in order to generate summaries that contain a user-defined number of sentences. The other paper explores the integration of librarian-expert techniques including reference interview strategies and cataloging resources in a text summarization system implementation to provide post-retrieval single document and/or multi-document summaries for users in a medical digital library. Most of the above works were supported by funding from various government agencies such as NSF, DARPA, IMLS as well as agencies from other countries. Several projects involved collaborations between researchers and organizations in two or more countries.
4 Integrating IIA into Digital Library Development and Service: A System Design The examination in the last section shows that investigations are ongoing to applying current IIA technologies to digital library development and user services. However, the number of papers on IIA at JCDL and DL conferences also indicates that the research is limited. Many digital library developers seem to be hesitant when considering IIA for their systems due to the fear that learning and using IIA can be difficult and the performance of the technologies are sometimes not stable or predictable. However, people should realize that certain IIA technologies, if appropriately applied based on the resources and conditions of the specific digital libraries, could greatly reduce the cost of maintaining a digital library and/or provide value-added services to the users of the digital library. For example, some types of metadata might be much easier to be automatically extracted than others for some digital libraries. There is also a great deal of potential benefit in consideration to CLIR which can provide a multilingual user interface thus allowing users speaking other languages to also use the digital library much in the same way as native speakers.
The Use of Intelligent Information Access Technologies in Digital Libraries
247
Service Interface Searching
Browsing
. . .
Other Services
IIA Tools MetaExtracting
. . .
QA
CLIR
Development Tools MetaExtracting
Digital Repository
. . . Digitization
Infrastructure Selection
Hardware
Network Infrastructure
Fig. 2. A Framework for Digital libraries Integrating IIA Tools
We would like to propose a framework for digital libraries that integrates IIA tools, as depicted in Figure 2. A digital library with this structure has a layer for IIA tools in addition to other components and/or layers of a traditional digital library. The modules and/or subsystems in the “IIA tools” layer can accept input from and provide output to other layers and the digital repository; below is a short description of the framework: • •
Infrastructure. Infrastructure refers to computer hardware and network infrastructure that a digital library should have; Development Tools/Processes. This layer refers to information gathering, digitization, and organization processes or tools that contributed to the construction of the digital repository. The “MetaExtracting” process in this layer refers to manual metadata generation that are conducted for many digital libraries;
248 • •
•
J. Chen, Y. Li, and G. Li
Digital Repository. A digital collection that is well organized to serve a particular user community; IIA Tools. IIS modules or subsystems that can automate some of the development process or expand the services that the users can receive. Due to the time and effort of developing an IIA module, digital library developers should add one tool at a time according to the need of the users and system planning. The “MetaExtracting” process in this layer refers to the automatic metadata generation that combines techniques on automatic classification and information extraction; Service Interface. This layer interacts with the users of the digital library through various services such as searching, browsing, digital reference, and other services.
The basic idea of this framework is that IIA technologies should be considered whenever possible and be integrated into the development of the digital library as an enhancement or expanding of the current digital library development or services.
5 Summary and Future Work Here, we have introduced the concept of intelligence information access (IIA) including the definition and methods of its subfields for the purpose of advocating the use of IIA in digital library development and user services, current research in applying IIA for digital libraries as published at JCDL and DL conferences are also reviewed. Some of the research provides encouraging results. In this effort, we have also proposed a framework for integrating IIA tools in a digital library structure. It is our belief that IIA technologies have the potential to greatly speed up the development of digital libraries as well as improving user services as such, we feel that this reflects a desired direction in the building of large-scale and sustainable digital libraries. To move this research forward, IIA researchers and DL developers should collaborate to explore the appropriate ways of applying various IIA technologies to digital libraries. Each digital library is a special case that needs careful analysis and evaluation before any IIA module can be added to the system. Therefore, we strongly believe both digital libraries and IIA technologies will benefit from the collaboration.
References 1. Berry, M. W., Dumais, S. T., & Letsche, T. A. (1995). Computational methods for intelligent information access. in Proceedings of Supercomputing'95, San Diego, CA, December 1995. Retrieved February 12, 2006, from http://citeseer. ist.psu.edu/ berry95computational.html. 2. Borgman, C. (1999). Discussion: “What are Digital Libraries? Competing Visions.” Information Processing and Management 35(3), 227-243. 3. Chen, J. (2005). Toward a unified retrieval outcome analysis framework for cross language information retrieval. Presented at ASIST 05 Annual Conference, Charlotte, North Carolina. October 30-November 1, 2005. available at: http://max. lis.unt. edu/ presentations/ASIST05.pdf.
The Use of Intelligent Information Access Technologies in Digital Libraries
249
4. Chen, J; Li, F; and Xuan, C. (2006). A Preliminary Analysis of the Use of Resources in Intelligent Information Access Research. To appear: Proceedings of ASIST 2006 Annual Conference. 5. Chen, J., Yu, P., and Ge, H. (2006). UNT 2005 TREC QA Participation: Using Lemur as IR Search Engine. Online Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005). Available at: http://trec.nist.gov/pubs/trec14/papers/unorth-texas.qa.pdf. 6. Chowdhury, G.G. (2002). Digital libraries and References Services: Present and Future. Journal of Documentation 58(3): 258-283. 7. Liddy, E.D., Eisenberg, M.B., McClure, C.R., Mills, K., Mernit, S., and Luckett, J.D. (1994). Research Agenda for the Intelligent Digital Library. The First Annual Conference on the Theory and Practice of Digital Libraries, College Station, TX, USA, June 19-21, 1994. 8. Lossau, Norbert. (2004). Search Engine Technology and Digital Libraries: Libraries Need to Discover the Academic Internet. D-Lib Magazine, 10(6), Retrieved July 18, 2006 at: http://www.dlib.org/dlib/june04/lossau/06lossau.html. 9. Mani, I. and Maybury, M. T. (1999). Eds. Advances in Automatic Summarization. The MIT press. 10. Maybury, T.M. (2005). Intelligent information access: theory and practice. In Proceedings of 2005 International Conference on Intelligence Analysis. Retrieved November 20, 2005, from https://analysis.mitre.org/proceedings/Final_Papers_Files/272_Camera_Ready_Paper.pdf. 11. Müller, M. E. (1999). Intelligent information access in the Web: ML based user modeling for high precision meta-search. In Proceedings of the Workshop on Machine Learning for Intelligent Information Access, The ECCAI Advanced Course on Artificial Intelligence (ACAI-99), Chania, Greece, 1999. Retrieved February 12, 2006, from http://citeseer.ist.psu.edu/article/muller99intelligent.html. 12. Oard, D. W., & Diekema, A. R. (1999). Cross-language information retrieval. In M. Williams (Ed.), Annual Review of Information Science and Technology, 33 (pp. 223-256). 13. Smith, A. (2003). Issues in sustainability: creating value for online users. First Monday, 8(5). http://www.firstmonday.org/issues/issue8_5/smith/.
Appendix: List of IIA Related Papers at JCDL or DL Conferences During 1998-2006 Conference JCDL’06
Paper •
• • •
JCDL’05
• •
Crane, G., Jones, A.The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection. Pp. 31 – 40. Lu, X., Mitra, P., Wang, J. Z., and Giles, C. Lee. Automatic categorization of figures in scientific documents. Pp. 129-138. Perrow, M and Barber, D. (2006). Tagging of name records for genealogical data browsing. Pp. 316 – 325. Wang, J. and Ge, N. Automatic feature thesaurus enrichment: extracting generic terms from digital gazetteer. Pp. 326 – 333. Hu, Y., Li, H., Cao, Y., Meyerzon, D., Zheng, Q. Automatic extraction of titles from general documents using machine learning Pp. 145-154. Fu, Y., Ke, W., Mostafa, J. Automated text classification using a multiagent framework. Pp. 157-158.
250
J. Chen, Y. Li, and G. Li
• •
JCDL’04
•
• • •
JCDL’03
•
• •
JCDL’02
• • •
• • •
JCDL’01
•
JCDL’00
•
ACM DL’99
•
ACM DL’98
•
Diekema, A. R., Chen, J. Experimenting with the automatic assignment of educational standards to digital library content. Pp. 223-224. Paynter, G. W. Developing practical automatic metadata assignment and evaluation tools for internet resources. Pp. 291-300. Wang, J., Teng, J., Cheng, P., Lu, W., and Chien, L. Translating unknown cross-lingual queries in digital libraries using a web-based approach. Pp. 108-116. Cao, J. and Nunamaker, J. F. Question answering on lecture videos: a multifaceted approach. Pp. 214-215. Yilmazei, O., Finneran, C. M., Liddy, E. D. Metaextract: an NLP system to automatically assign metadata. Pp. 241-242. Perugini, S., McDevitt, K., Richardson, R., Perez-Quinones, M., Shen, R., Ramakrishnan, N. etc. Enhancing Usability in CITIDEL: Multimodal, Multilingual, and Interactive Visualization Interfaces. Pp. 315-324. Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z. and Fox, E. A. Automatic document metadata extraction using support vector machines. Pp. 37-48. Takasu, A. Bibliographic attribute extraction from erroneous references based on a statistical model. Pp. 49-60. Soo, V., Lee, C., Li, C., Chen, S. L. and Chen, C. Automated semantic annotation and retrieval based on sharable ontology and case-based learning techniques. Pp. 61-72. McDonald, D. & Chen, H. Using sentence-selection heuristics to rank text segments in TXTRACTOR. Pp. 28-35. Kan, M. and Klavans, J. Using librarian techniques in automatic text summarization for information retrieval. Pp. 36-45. Pinto, D., Branstein, M., Coleman, R., Croft, W. B., King, etc, QuASM, a system for question answering using semi-structured data. Pp. 46-55. Cui, H., Heidorn, B., and Zhang, H. An approach to automatic classification of text for information retrieval. Pp. 96-97. Lyu, M. R., Yau, E., and Sze, S. A multilingual, multimodal digital video library system. Pp. 145-153. Larson, R. R., Gey, F., and Chen, A. Harvesting translingual vocabulary mappings for multilingual digital libraries. Pp. 185-190. Rauber, A., Muller-Kogler, A. Integrating automatic genre analysis into digital Libraries. Pp. 1-10. Larkey, L.S., Ogilvie, P., Price, M.A., Tamilio, B. Acrophile: an automated acronym extractor and server. Pp. 205-214. Ardo, A., and Koch, T. Creation and automatic classification of a robotgenerated subject index. Pp. 210-211. Chung, Y., Pottenger, W.M., and Schatz, B.R. Automatic subject indexing using an associative neural network. Pp. 59-68.
Encryption and Program Realization of Information Resources in DRM Chuanfu Chen1, Zhiqiang Wu1,2, Laichu Tang1, and Congjing Ran1 1
Abstract. With the development of digitizing techniques and computer network, more and more information resources are distributed and communicated by way of internet. The out-of-order nature and the uncontrolled replication of information on the internet have deprived these information resources of their copyright protection. DRM is an important means to protect the copyright of information resources, and encryption technology is the core of DRM. This paper has analyzed the principle and the main technology used by DRM in encrypting information resources; it has also conducted experiments on the programming techniques used on encrypting information resources in DRM. Keywords: DRM, Information Resources, Encryption, Copyright Protection, DES, RSA.
1 Introduction With the development of digital techniques and computer network, more and more information resources are being transmitted and communicated through internet. Because of the insecure nature of internet and the uncontrolled state of information reproduction and communication, information resources communicated by way of internet are vulnerable to threats such as illegal access, copying, alteration and sabotage, which lead to the fact that digital works can’t get proper copyright protection. In order to maintain the interests of the authors (especially those of the publishers), more and more people take the system of DRM (Digital Rights Management) as their weapon to protect their copyright. Data encryption is one of the most important means (in DRM). Data encryption technology recodes the information to hide the real meaning of its contents, which make illegal users can’t access to the real information. This article aims at studying the data encryption technology and its program realization used in DRM for protecting the copyright of information resources.
come into the DRM system, the system will use an encryption key to perform encryption calculation, and transform the information into encrypted message which can be openly transmitted on internet. When the encrypted message is received, the user will perform decryption and transform the message back into legible information resources with the decryption key. Fig.1 indicates the basic flowchart of encryption/decryption of information resources in DRM:
Internet
Decrypt
Information
Key
Encrypt
Key
Information
Fig. 1. The flowchart of encryption & decryption of information resources in DRM
At present, there are many ways to encrypt information resources, which are realized through the encryption calculation. On the part of the development of encrypting calculation, there are DES, RSA, MD5, Elliptic Calculation, etc., and which can be divided into Symmetrical Encryption Calculation and Asymmetrical Encryption Calculation.[1] Encryption calculations mentioned above are good ways to protect the information resources. However, because of the maturity and robustness of DES and RSA calculations, they are used mostly in DRM. In point of these two calculations, DES is a Symmetrical Encryption Calculation, which uses the same key for both encryption and decryption. Because of this, it can encrypt information resources effectively with high speed. While RSA is an Asymmetrical Encryption Calculation, which uses different keys for encryption and decryption. Though this way, it performs encryption of information resources with high safety, but it is slow and time-consuming, so it is not a suitable way to encrypt massive data. In order to ensure the safety of the information resources and increase the efficiency of encryption, and protect the copyright of information resources eventually, in DRM system, one of the best practices is to use DES to encrypt the information resources themselves, and use RSA to encrypt the encryption key used for the encryption process of DES.
3 Encrypting Technology Used on Information Resources in DRM 3.1 Encrypting the Information Resources Themselves with DES There are two basic encrypting means in DES: Jumble and Distribution.[2] Jumble is the way which changes the format of the original passage, while Distribution can spread partial changes of the original passage into the whole encrypted passage. DES imports a 64-bit secret key, and the actual usable length of the key is 56 bits (8 bits are used for odd/even parity checking). While encrypting, the original passage is divided
Encryption and Program Realization of Information Resources in DRM
253
into segments in forms of 64-bit units, which will be encrypted by the secret key. DES calculation will perform 16 rounds of encryption on the basis of 64-bit data segments, and in each round of encryption, a unique round-key, which is used for transformation of data segment, will be generated from the 56-bit main key. The original 64-bit raw data segments will be transformed by each round-key into totally different 64-bit output data segments and become the encrypted passage. While decrypting, the encrypted passage will be reverted back to its original status. Fig.2 indicates the DES calculation [3], and the basic algorithm [4] is as follows: (1) Create sub-keys from the 8-character (formed with 64-bit) secret key. i. Use the table of Permuted Choice one (PC-1) to remove the check-bit of the 64-bit secret key which is made up of 8 characters (take the first 8 digits if there are more than 8 characters), and thus create a 56-bit data-string called tempBuffer of no check-bit. ii. Divide the 56-bit tempBuffer into 2 parts, namely, C and D, each is 28-bit. iii. Use the LSI table to perform recurring left transposition, and then create new 28-bit C and D. iv. Combine the new C with the D to form a new 56-bit data-string Outbuffer. v. Use the table of Permuted Choice two (PC-2) to perform compact-transposition of the new 56-bit Outbufer, and create a 48-bit sub-key. vi. Repeat steps ii to v for 16 times to create 16 sub-keys with 48 bits: Sub-key1, … Sub-key16, which will be used in the 16 rounds of encrypttion of the original passage. (2) Use the Initial Permutation (IP) table to rearrange the imported 64-bit original message , and divide the rearranged 64-bit data-strings into two parts: L and R, and each has 32 bits. (3) Interchange the L and R parts created in (2) with function of f(R,K). i. Use the Expansion Permutation table to perform expansion-transposition of the 32-bit R part , and convert it into an expanded R’ with 48 bits. ii. Perform the XOR operation on R’ and Sub-Key(i), and then divide the result into 8 groups, each having 6 bits, where Sub-Key(i) is the corresponding sub-key for the current round (i), if the current round is 1, then the corresponding sub-key is Sub-Key(1). iii. Use the Substitution (S) boxes to substitute these 8 groups, and convert each group of 6 bits into the group of 4 bits, so the output has only 32 bits. iv. Use the Permutation (P) table to transpose the 32-bit output, and create a new 32-bit R. v. Interchange L and R, assign the bit value of R to L, and assign the bit value of L to R. (4) Repeat (3) for 16 times, and then finish the 16 rounds of encryption. (5) Combine the L with the R after 16 rounds of encryption to form a 64-bit data-string, and use the Final Permutation (IP-1) table to perform transposition, and arrive at a 64-bit encrypted message, which is the result of the DES encryption.
254
C. Chen et al.
Fig. 2. DES calculation
3.2 Encrypting the DES Encryption Key with RSA Without consideration of the disclosure of the encryption key, it will be sufficient to encrypt the information resources by using DES. However, as DES encryption uses the same key to encrypt and decrypt, it is inevitable that the key may be disclosed during transmission, and it is also inconvenient to distribute and manage the key. To ensure the security of the key during transmission, it is necessary to encrypt the key used in the DES encryption.[5] Because the key used in DES encryption has only 8 characters, it is very effective, fast and secure to encrypt it with RSA. RSA is an open-typed encryption system, the basic concept of which is: Side A creates a pair of encryption keys, and releases one of the key to the other side as an open key and keeps the other key as a private one. If the other side wants to send an encrypted message to side A, he/she will use the open key to encrypt the message and send it to side A, and then side A will use the private key to decrypt it. This approach will help to avoid the danger of exposing the key during transmission. The security of RSA relies on the difficulty in factor-decomposing of a gigantic number, which is a noted mathematical problem without a definite solution until now, and this guarantees the safety of the RSA calculation. In DRM, the RSA key used to encrypt the DES encryption key is created by the user, which includes an open key and a private key. Once these keys are created by the user by using RSA calculation, the open key is disclosed to the DRM system, and the private key is kept by the user. The DRM system then uses the open key to encrypt the DES
Encryption and Program Realization of Information Resources in DRM
255
encryption key, and creates a new encrypted data-string. For the convenience of describing the calculation method (For describing the calculation method conveniently), we combine the two parts together, and the basic algorithm is as follows: (1) The user creates the RSA encryption keys[6]. i. Randomly select two large prime numbers: p and q (which have to be kept secret) and calculates n. n=p*q.
(1)
ii. Calculate the Euler function of n: φ(n)=(p-1)*(q-1).
(2)
iii. Select a whole number e randomly, and satisfy the condition that e and φ(n) are mutual prime numbers, e is the open key (encrypting key). iv. Figure out the private key d (decrypting key), and the formula is: (d*e) mod φ(n) = 1
(3)
(2) DRM then further encrypt the key used in the DES encryption. i. Divide the DES encryption key into data segments of equal length k1, k2, … ki. Since DES encryption key has only 8 characters, each character can be recognized as a data segment. ii. Use the RSA open key to encrypt each data segment, and figure out the encrypted data segment Ci. The formula is: Ci=E(ki)=kie (mod n).
(4)
iii. Combine the data segments Ci to become as C, complete the RSA encryption of the DES encryption key, then the RSA encryption of the DES encryption key is completed, and C is the encrypted data. 3.3 Decrypting the Encrypted Information Resources After Received by the User When DRM system encrypts the information resources by DES encryption and then encrypts the DES encryption key by RSA encryption, it will create an encrypted message of the key and an encrypted message of the information resources. These messages will be transmitted to the user through the Internet. When the user receives the messages, he should decrypt them for use. Decryption by the user consists of the following steps: (1) Use the private key d to perform RSA decryption on the encrypted message of the DES encryption key. The decryption formula is: ki=D(Ci)=cid(mod n).
(5)
256
C. Chen et al.
Combine ki as a key, and obtain the DES encryption key of the content, which is the key used to encrypt the information resources. (2) With this key, uses DES calculation to decrypt the information resources. DES decryption needs to use the reverse transposition used in encryption, that is, to reverse the Final Permutation (IP-1) table and the Initial Permutation (IP) table, and use the same interchange function f(R,K) in the 16 rounds of interchanging. Once the information resources are recovered to their original state after decryption, they can be used by the user.
4 Program Realization of the Encryption of Information Resources in DRM Based on the DES and RSA encryption principle mentioned above, we used DELPHI6 as our developing tool and performed an experiment about encrypting some information resources in DRM, and got satisfactory results.1 In the experiment, the information resources encrypted are the Introduction of this article. 4.1 Encryption of Information Resources We performed the following steps in the encryption experiment: (1) Input into the DRM system the information resources required to be encrypted and the key used in DES encryption, and new information content is created after encryption; (2) The user selects randomly two mutual prime numbers p and q, using RSA calculation to create an open key (encryption key) and a private key (decryption key); (3) DRM system uses the open key provided by the user to encrypt the DES encryption key, and then obtains the encrypted data-string after RSA encryption. The result of the encryption is shown in Fig.3.
Fig. 3. Encryption of information 1
In the experiment, some source-codes are excerpted from http://www.ksaiy.com and http://ace.ulyssis.student.kuleuven.ac.be/~triade/
Encryption and Program Realization of Information Resources in DRM
257
The right part of the Figure is the output information after encryption, including both the data-string of the encrypted DES encryption key created by RSA calculation and the encrypted information resources created by DES calculation. The decrypting key used by the user and the number n used in decryption (the multiple of prime numbers p and q) is the outcome achieved when the user creates the RSA encryption. Once the information resources are processed by encryption, it can be transmitted to the user through internet. And during the transmission, only the encrypted information resources and the encrypted DES encryption key data-string are sent to the user. 4.2 Decryption of the Digitized Information When the user receives the encrypted information, he/she has to perform decryption in order to use it. In the decryption experiment, the user (1) uses the decryption key and the number n used in decryption (the multiple of prime numbers p and q) to decrypt the data-string of the DES encryption key, then (2) uses the decrypted DES encryption key to decrypt the information resources, and (3) obtains the original information resources that can be used finally. The result of the experiment is shown in Fig.4.
Fig. 4. Decryption of information
5 Conclusion Encryption Technology is an important device used in DRM to protect the copyright of information resources, and it plays an important role in protecting the right of communication through information network. However, technology should be unbiased; while protecting the copyright of the author, encryption devices should not restrict the fair-use of the information resources. The purpose we study the logic of the encryption technology used in DRM is that we try to find a principle which would be
258
C. Chen et al.
suitable for fair-use, so that encryption technology, on one hand, can protect the copyright of the lawful author, on the other hand, can incarnate the principle of fair-use, and realize the public access to the information resources ultimately.
References 1. William, stallings: Cryptography and Network Security: Principles and Practice 2nd ed(Photocopy). Beijing: Tsinghua University Press. 2002 2. Wang, Hongmei: DES Methodology and the Safety Control of Digitized Information Resources. Information Science. 4(2005) 544-549 3. Wu, Hongbing: DES Data Encryption Calculation Principle and Actualization. Computer Programming Technique and Protection. 3(2000) 85-88 4. Wei, Xumin: DES Calculation. High Performance Calculation Technique. 4(2001)23-27 5. Wu, Hao: Data Encryption Proposal Based on DES Calculation and RSA Calculation. Jiao-Jor Technical School Journal (Natural Science Forum). 5(2002)396-397 6. Tong, Xiaoyun, etc.: Research and Actualization of Encryption Technology Based on RSA Calculation. Micro-computer. 6(2003)23-25
Knowledge Organization Systems and Interfaces of Digital Libraries in China Li Si, Lixiao Xu, and Hongyan Chen School of Information Management, Wuhan University Wuhan, Hubei, P.R. China {Li Si, sunnylily99new}@yahoo.com.cn
Abstract. The paper introduces the type and function of knowledge organization systems, investigates and analyses knowledge organization systems and interfaces in Chinese digital libraries, puts forward some suggestions on strengthening KOSs in Chinese digital libraries.
1 Introduction Knowledge Organization Systems (KOS) refers to all kinds of semantic tools that describe and organize mankind knowledge structures, including classification, thesauri, semantic networks, ontology and other information retrieval and indexing languages. At present digital libraries, subject gateways, search engines are hot focus and constructing emphasis. How to integrate KOSs into digital libraries, enlarge its structures and functions to a newer, higher level is a research topic we will do.
2 Types of KOSs On the website of NKOS (Networked Knowledge Organization Systems/Services), types of KOSs consist of the following1: (1) Term Lists: include authority files, glossaries, gazetteers, dictionaries. (2) Classification and Categorization: include subject heading, classification schemes, categorization and taxonomies. (3) Relationship Groups: include thesauri, semantic networks, ontology. Thesauri: sets of terms representing concepts and the equivalence, hierarchical and relative relationships among them. Thesauri use BT (broader term), NT (narrower term), SY (synonym), and RT (associative or related term) to express relationships among them. Semantic networks: sets of terms expressing concepts, models as the nodes in a network of variable relationship type. Concepts are thought of as nodes, and relationships branch out from them. Ontology: a special concept model developed by knowledge-management community, namely concept spaces. Ontology can describe complex relationships among objects, including rules and axioms missing from semantic networks. Ontology 1
Available at: http://nkos.slis.kent.edu (accessed 1 May 2006).
that describes knowledge in a specific area is often connected with systems for knowledge mining and knowledge management. KOSs play a unique role in representing, indexing, retrieving knowledge since they appeared.
3 Functions of KOSs in Digital Library KOS is an indispensable part of digital library. Its’ functions arouse the academia’s attention gradually. About the functions of KOSs in digital libraries, Ali Shiri considered 2 : (1) use of thesauri and classification systems for cross-browsing and cross-searching across various digital collections; (2) creation of ontologies using existing thesauri; (3) classification systems and specialised controlled vocabularies to provide a general knowledge-representation facility for digital collections with a diverse range of materials; and (4) use of taxonomies to provide unified and organized access to different digital repositories through describing different layers of the digital collections. Hill et al figured3: (1) description: controlled sets of labels for describing an object; (2) definition: meanings associated with labels; (3) translation: mapping between equivalent representations; and (4) navigation: links within an organized structure of representation. Wang Jun described4: (1) classification and subject headings provide a knowledge structure for digital library resources; (2) provide effective ways for using metadata in digital libraries; (3) provide a integration knowledge space of retrieval, services, education; (4) a self-enriched, self-strengthened, self-adapted knowledge system; (5) realizes the technology base for knowledge management. Digital libraries overseas usually put emphasis on designing of KOS and navigation and retrieval interfaces. National Digital Library Project in China invested about 4 hundred million will construct knowledge organization as one of its standard projects, it is obvious that the development of KOSs in digital libraries has been one of the key technologies for providing users knowledge services.
4 Investigation of KOSs and Interfaces of Chinese Digital Libraries 4.1 Investigation and Analysis of Usage of KOSs In china, Digital libraries practice began with the Chinese Pilot Digital Library Project (CPDL), which was launched in 1996. Since then, the theory and practice of digital libraries have made steady progress. There are three types of digital libraries 5 : 2
Shiri, A.: Digital library research: current developed and trends. Library Review.5 (2003) 198-202. 3 Hill, L, etc.: Integration of knowledge organization systems into digital library architectures. New Technology of Library and Information Service.1 (2004) 4-8, 5 (2003) 198-202. 4 Wang Jun: Knowledge organization in digital library based on classification and subject thesauri. Journal of Library Science in China.3 (2004) 41-44,64. 5 Xiao Long et al. A summary on digital library study and development in China. Available at: http://www.idl.pku.edu.cn/pdf/dlsummary.pdf. (accessed 30 Apr. 2006).
Knowledge Organization Systems and Interfaces of Digital Libraries in China
261
resource-service mode (to emphasize information service online as quickly as possible); service-research mode (to own the double aims of research and service); collaboration-building mode (to collaborate with multiple libraries, obtain funds together, unify standard, focus on the reasonable distribution of resources and sharing). The authors have made an overall inquiry into the current status of the usage of KOSs in digital libraries in China. The study is based on literature review related to the application of KOSs in digital libraries, interviewing with experts involved in this area and inquiry to the interfaces of digital libraries. The investigation can be seen Table 1. Table 1. Survey of KOSs in digital libraries in China No. 1
2
3
4 5
Name of digital Library/service Ancient Documents Resources Collection of Peking University Digital Library China Academic Digital Library & Information System ( CADLIS)
Institution Peking University Library
Locally developed classification system
The Management Center of CALIS & CADAL
Locally developed main categories
China Hospital Knowledge Database (CHKD) Full-text Databases
China Academic Journals (CAJ-CD)
Locally developed classification system, fifteen categories, classification developed by CHKD (2003 edition), subject headings developed by CHKD (2003 edition)
Chinese Book Database Chinese Journal Database (CJFD)
China Academic Journals (CAJ-CD) China Academic Journals (CAJ-CD)
Full-text Full-text
6
Chinese National Science Digital Library (CSDL)
Chinese National Science Digital Library (CSDL)
7
Digital Portal of National Library
Chinese National Library
8 9
Types of KOS
Chinese
Digital Resource Navigation of the Library of Nanjing Normal University Dunhuang Science Digital Library
The Library of Nanjing Normal University Digital Library Institute of Lanzhou University
fourteen
CLC, 19 categories, 3-4 levels CLC, 9 categories Document database: locally developed classification system Indexing system: multiple classification system (e.g. CLC, PACC, Mathematics Subject Classification) Integrated journal database: CLC Science China: CLC Subject gateways: locally developed classification system based on CLC, DDC21, GB/T13745-92 Classification and code disciplines Database: locally developed classification system five categories, 2 levels Locally developed classification system, 39 categories, one level Locally developed classification system
262
L. Si, L. Xu, and H. Chen Table 1. (continued)
10 11
12
13
14
15 16 17
18
19 20 21 22 23 24 25 26
27
Founder Apabi Digital Library Home of Sursen Digital Library Internet Information Navigation Port of the Key Disciplines (one project of China Academic Library & Information System) (CALIS) Journal Collection of Agriculture Knowledge Repository in China (education edition) Journal Collection of Agriculture Knowledge Repository in China (government edition) King Base Digital Library of Renmin University of China KOS of Digital Library Based on Classification and Subject Heading Online Reading System of Digital Library of China Sendia E-Books Shanghai Picture Collection of Shanghai Digital Library Special Database System of Liaoning Province Library Super Star Digital Library The Children Digital Library of Shanghai The Retrieval Mode of Digital Library Based on Ontology The Subject Databases of CALIS Traffic & Transportation Engineering Information Gateways Tsinghua University Architecture Digital Library Wanfang Data Info Site
Founder Electronics CO., Ltd. Beijing Sursen Company
CLC (include all its categories and levels) Adapted CLC 31 categories (4 levels)
The Management Center of CALIS
Locally developed classification system 11 categories (3 levels)
China Academic Journals (CAJ-CD)
CLC, 19 levels)
China Academic Journals (CAJ-CD)
Locally developed classification system 14 categories (2-3 levels)
Renmin University China Digital Library
Only retrieval, no navigation
of
categories
CLC, Chinese Thesaurus
Wang Jun Digital Library of China Limited liability company Beijing Sendia Technologies Co. Ltd Shanghai Library Liaoning Province Library Beijing Century Superstar Information Technology Co. Ltd The Children Library of Shanghai
(2-3
Classified
CLC (involving its all categories) CLC (4-5 levels) & locally developed subject; classification (twenty-two categories, 2 levels) Locally developed classification system Locally developed classification system Chinese Library Classification (CLC), 22 categories CLC, 1 level
Dong Hui et al.
Ontology
The Management Center of CALIS
Locally developed classification system
The Library of Wuhan University of Technology
GB/13745-92 Classification and code disciplines
Tsinghua Library
Locally developed classification system Browse by subject: CLC, 2 levels. Others: locally developed classification system
University
Wanfang Data Co., Ltd..
Table 1 shows the total 27 digital libraries surveyed have almost made use of KOSs, and the detail usage of KOSs is listed in table 2. It should be noted that some
Knowledge Organization Systems and Interfaces of Digital Libraries in China
263
digital libraries used more than one type of KOSs, so the total number of KOSs in the Tabe 1 exceeds 27. Table 2. Frequency of KOSs used in digital libraries
The type of KOSs used
Frequency of digital libraries
Chinese Library Classification (CLC)
12
Classification Developed by CHKD
1
Dewey Decimal Classification 21(DDC21)
1
GB/T137 45-92 Classification and Code Disciplines
1
Locally developed classification system
13
Mathematics Subject Classification
1
Ontology
1
PACC (Physics Abstracts, Classification and Contents)
1
Subject Headings Developed by CHKD
1
The Subject Terms List of Chinese Classification
1
Table 2 indicates classification schemes mainly have been used in digital libraries, especially CLC (44.5%) and the classification system locally developed (55.1%), and the later are built on CLC partly. The major issues of KOSs in digital libraries are: (1) Some digital libraries are lack of classification navigation, e.g. “King Base Digital Library of Renmin University of China”; (2) The usage of ontology and subject headings is very poor, only Wang Jun used Chinese Classified Thesaurus in his research program, and “China Hospital Knowledge Database (CHKD) Full-text Databases” used the subject term of CHKD. But KOSs of digital libraries overseas pay great attention to the application of subject headings. For instance, the inquiry made by Canadian scholars Shiri and Moberg’s6, five types of subject headings have been used in total 33 digital libraries surveyed. Of all, 15 digital libraries have adopted LCSH, 6 digital libraries have used locally developed thesaurus. The percentage of usage of subject headings reaches 63.6, which is only 7.4 in China. And in the digital libraries surveyed, only “The Retrieval Mode of Digital Library Based on Ontology” has used ontology, whose first contributor is Dong Hui. Therefore, the application of ontology in digital libraries should be strengthened. 4.2 Investigation and Analysis of Interfaces of KOSs Through browsing interfaces of KOSs, the authors have also investigated search, browse and navigation function of the total 27 digital libraries provided. Some interfaces (e.g. No.2, 16, 23 in Table 1.) can not be browsed and some (e.g. No. 6, 9, 6
Shir, A., Molberg, K.: Interfaces to knowledge organization systems in Canadian digital library collections. Online Information Review.6 (2005) 604-620.
264
L. Si, L. Xu, and H. Chen
10, 20, 24 in Table 1) provide different interfaces for different types/subjects of information resource, they don’t suit for statistic, so the real investigation quantity is 19, and the interfaces features of the KOSs surveyed are outlined in Appendix. Based on Appendix, the authors summarized the browsing, searching, and multiple access points features of the KOSs surveyed, see Table 3. Table 3. Browse ways, search modes, access points in digital libraries interfaces Browse ways Number Frequency Percentage 0
Frequency Percentage Number Frequency Percentage 18
94.7
1
1
5.26
17
89.5
4-7
8
42.1
6
31.6
8-16
9
47.4
7
36.8
>16
1
5.26
From the investigation above, it is clear that most of the interfaces provide browse function for users, and the ways are various to meet different users’ needs. For instance, the percentage of multiple ways (≥2) for browse reaches 31.5. And the interfaces of KOSs surveyed offered various types of search, e.g. simple search, advanced search, professional search. The percentage of the type of simple & advanced search achieves 89.5 highly. The number of access points is almost from 8 to 16, about 47.4%, it can basically meet users’ needs for multi-ways search. But some KOSs provide only one access point, such as “Digital Resource Navigation of the library of Nanjing Normal University”, some provide too much access points, such as “Founder Apabi Digital Library”, it provides 32 access points, many of which are hardly used, at the same time, the interface of selecting access points becomes complex. In addition, some KOSs provide only one level browsing class, and correlativity of information is low; some can be browsed by the structure of KOS, but there are no information linked to them, information assurance principle is not embodied.
5 Some Suggestions Compared with KOSs of foreign countries, the KOSs in digital libraries in China have wide gaps. We consider that we should enhance on these aspects: 5.1 Follow General Procedures to Constitute KOSs in Digital Libraries General procedures for constituting KOSs are: Analyzing thoroughly users’ needs, the types of resources and services in digital library → Locating knowledge organization systems(it is preferable to find an existed KOS, such as classification and vocabulary)→ Deciding the infrastructure → Maintaining and updating KOSs timely → Presenting KOSs to Users. Additional, the intellectual propriety of KOS should be cared, and the link between KOS and digital library should be established.
Knowledge Organization Systems and Interfaces of Digital Libraries in China
265
5.2 Integrate Organically All Kinds of KOSs into Digital Libraries Every kind of KOSs has its particular function. For example, controlled vocabulary plays a role on selecting retrieval terms, building retrieval expression. Ontology can represent all kinds of complex relationships among concepts. KOSs of digital libraries in China aren’t perfect, thesaurus, gazetteers, dictionaries, semantic networks, ontology and so on are relatively weak, we can learn advanced experiences from foreign countries. For instance, SOSIG (Social Science Information Gateway) in British embedded HASSET (Humanities and Social Science Electronic Thesaurus) into the whole retrieval system, HASSET can provide not only access points but also help users adjust retrieval strategies. American Alexandria University Digital Library regards gazetteers as its unique part and realizes retrieval function. 5.3 Offer Ease-to-Use Interfaces It is convenient to provide more ways on interfaces for browse and search. In general, in order to browse and search, ways such as discipline, subject, time, area, resource type should be set based on information contents and characters. At the same time, more search modes should be provided to meet different types and levels of users’ needs. About the display of retrieval results, some choices such as sorting basis, detailed or simple description and the quantities displayed on a screen should be provided for users to select. It’s best to set retrieval help column. 5.4 Strengthen Ontology Application in Digital Libraries Ontologies are obvious demonstrations of concepts in a field. We abstract a field in reality into a set of concepts (such as entities, properties, procedures) and relationships among concepts, constitutes ontologies of this field 7 . Ontology is understood by machine, it can be the medium of exchanging information between human and machine, machine and machine. Some relative programs and practice accomplishments came out, such as American UMLS (Unified Medical Language System), ADEPT (Alexandria Digital Earth Prototype) found by the university of California, Santa Barbara and OpenGALEN set by the university of Manchester. Ontology has been a focus in China, but its application study should be extended and strengthened. Moreover, in order to realize cross-database browsing and searching based on KOSs, interoperation technique becomes one of the key techniques that should be solved.
References 1. Hodge, G.: Systems of knowledge organization for digital libraries: beyond traditional authority files. Available at: http://www.clir.org/pubs/reports/pub91/contents.html (accessed 30 Apr. 2006) 2. Ancient Documents Resources Collection of Peking University Digital LibraryAvailable at: http://rbdl.calis.edu.cn/pages/Search.htm (accessed 3 May 2006) 7
Dong Hui, Du Wenhua: The retrieval mode of digital library based on ontology and multiple agents. Journal of Library Science in China. 2 (2004) 6-15.
266
L. Si, L. Xu, and H. Chen
3. China Academic Digital Library & Information System (CADLIS) Available at: http://www.cadlis.edu.cn/portal/index.jsp (accessed 15 May 2006) 4. China Hospital Knowledge Database (CHKD) Full-text DatabasesAvailable at: http://kns50.chkd.cnki.net/kns50/navigator_chkd.aspx?ID=1 (accessed 18 May 2006) 5. Chinese Book Full-text DatabaseAvailable at: http://ckrd.cnki.net/GRID20/Navaigator. aspx?ID=5 (accessed 16 May 2006) 6. Chinese Journal Full-text Database (CJFD) Available at: http://cnki.toho-shoten. co.jp/line2_gycnki_zgxsqkgpbdzzzs.htm(accessed 16 May 2006) 7. Chinese National Science Digital Library (CSDL)Available at: http://www.csdl.ac. cn/index.html (accessed 5 May 2006) 8. Dunhuang Science Digital LibraryAvailable at: http://202.201.7.239/dunhuang/login.jsp (accessed 3 May 2006) 9. Digital Portal of Chinese National LibraryAvailable at: http://www.nlc.gov.cn/ portal/index.htm (accessed 5 May 2006) 10. Digital Resource Navigation of the Library of Nanjing Normal UniversityAvailable at: http://lib.njnu.edu.cn/sjkdhl.asp (accessed 5 May 2006) 11. Founder Apabi Digital LibraryAvailable at: http://ebook.lib.apabi.com/ (accessed 3 May 2006) 12. Home of Sursen Digital Library Available at: http://www.shusheng.cn (accessed 12 May 2006) 13. Internet Information Navigation Port of the Key Disciplines of CALISAvailable at: http://202.117.24.168/cm/ (accessed 2 May 2006) 14. Journal Collection of Agriculture Knowledge Repository in China (education edition)Available at: http://www.cakd.cnki.net/kns50/Navigator.aspx?ID=17(accessed 15 May 2006) 15. Journal Collection of Agriculture Knowledge Repository in China (government edition) Available at: http://ckrd.cnki.net/Grid20/Navigator.aspx?ID=196 (accessed 15 May 2006) 16. King Base Digital Library of Renmin University of ChinaAvailable at: http://202.112.118.49/(accessed 16 May 2006) 17. Online Reading System of Digital Library of ChinaAvailable at: http://www. 6uc.com/new/index.htm (accessed 16 May 2006) 18. Sendia E-Books Available at: http://cnlib.lib.tsinghua.edu.cn/query/query.jsp (accessed 20 May 2006) 19. Shanghai Picture Collection of Shanghai Digital Library Available at: http://www.digilib. sh.cn/dl/td/tdwb.htm (accessed 12 May 2006) 20. Special Database System of Liaoning Province Library Available at: http://www.lnlib.com:8080/was40/index.htm (accessed 10 May 2006) 21. Super Star Digital LibraryAvailable at: http://www.ssreader.com/ (accessed3 May 2006) 22. The Children Digital Library of ShanghaiAvailable at: http://media.sst.org.cn/ ebook/index.jsp (accessed 18 May 2006) 23. The Special Databases of CALIS Available at: http://www.calis.edu.cn/calisnew/ calis_index.asp?fid=3&class=5 (accessed 2 May 2006) 24. Traffic &Transportation Engineering Information Gateways Available at: http://lib.whut.edu.cn/index.htm (accessed 12 May 2006) 25. Tsinghua University Architecture Digital LibraryAvailable at: http://166.111.120.55:8001/ (accessed 10 May 2006) 26. Wanfang Data Info Site Available at: http://www.wanfangdata.com.cn/ (accessed 12 May 2006) 27. Wang Jun: KOS of Digital Library Based on Classification and Subject Heading Available at: http://vision.pku.edu.cn/ (accessed 5 May 2006)
Knowledge Organization Systems and Interfaces of Digital Libraries in China
267
Appendix: Survey of Interfaces of KOSs of Digital Libraries in China The serial number in table 3 is corresponding to the serial number in table 1 which represents the same digital library. Besides the access points listed in table 3, the serial number 1 provides three access points: recommendation level, discipline, resource format; Number 3 provides three access points: institutes, quotation, intelligent search; Number 4 provides three access points: binding means, the age of responsibility, the nationality of responsibility; Number 7 provides 12 access points: price, edition explanation, illustration and other details, notation, identifier type, impression, minor responsibility, minor responsible relation(s), responsible editor, contents, words quantity; Number 14 provides four access points: chief responsibility, minor contributor(s), resource format, system management; Number 17 provides two access points: other responsibility, sorting; Number 18 provides one access point: series title; Number 20 provides one access point: publisher; Number 26 provides two access points: contents, updating date; Number 27 provides one access point: department.
No. Interface
1
Discipline
3
4
5
√
√
√
7
8
1 0
√
√
Area B Type of resource r o Version w Name alphabetical s order e Publisher/database Visit type / frequency of visit Total number of browse ways Simple retrieval Advanced retrieval Professional retrieval Second retrieval Spatial & temporal retrieval Keyword Subject terms Class Number/keeping number Full-text retrieval
1 2
1 3
√
√
√
√
Subject Time
1 1
1 4
1 5
1 7
1 8
1 9
√ √
√
2 1
2 2
2 5
√
√
√
2 6
√
√ √
√ √
√
√
√ √ √ √
√
4
1
1
1
1
4
1
2
1
1
1
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√ √
0
1
1
2
2
1
2
1
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√ √
√
√
√ √
√
√
√
√
√
√ √
√
√
√ √
268 A c c e s
L. Si, L. Xu, and H. Chen Overall retrieval
√
√ √
Database Title
√
√
√
√
√
√
Author/contributor p first o The i contributor n Publisher t Place of s publication/ place of shooting Publication date / shooting date Description/ Abstract Source ISSN/ISBN Year (Journal) Issue (Journal) Related resource Relationship Fund Rights Version Date Language Coverage Type of resource Identifier References The number of access point Ordering D i Detail/simple s p l a Quantity choice y
√
√
√
√
√
√
√
√
√
√
√
Other titles √
√
√
√ √
√
√
√
√
√
√
√
√
√
√
√
√
√
√ √
√
√ √
√
√
√
√
√
√ √
√
√
√
√ √
√
√
√
√ √ √
√
√
√
√
5
1
2 0
√
√ √
√
√
√
√
√
√
√ √
√
6
√ √ √
√
√
1 3
1 4
1 4
√
√
4
6
4
8
√
√
√
√ √ √ √ √ √
√
9
√ √ √
√
√
√
8
√
√
√
√
1 0
√
√
√
√ √
√ √ √
√
√
√
√
√
√
√
1 2
√
√
√
√
√
√ √
√
√
√
7
6
√
√ √ √ √
1 0
1 4
Study on Distribution Model of Internet Information Resources in China Liu Hui Zhejiang University of Finance & Economics 18 Wenyuan Rd, Xiasha, Hangzhou, 310018 P.R. China [email protected]
Abstract. To identify the distribution model of Internet information resources in China, A statistical study based on relative national survey is conducted. The result shows that the Capital, Beijing outperforms its competitors nationwide achieving economic scale. At the same time other five regions accumulate more significant resources than the rest regions. Such trend should make significant impact on national information resource policies. Keywords: Internet, IRM, Information resource exploration, Statistical analysis.
1 Preface In order to understand fully the macro distribution situation of internet information resources in China and provide decision-making foundation for relevant institutions to do macro management, some institutions have conducted national surveys on internet information resources in the past few years1. For example, CINIC and other organizations conducted four surveys in this regard in 2001, 2002, 2003 and 2004 respectively and did some exploratory work to try to find out the real situation of the quantity and distribution of internet information resources in China, which has become the major support for us to further study the distribution of internet resources. This article used the recent data from the fourth survey of CINIC2004 to further analyze the regional distribution characteristic of internet resources, discuss problems and development trend of internet information resources in China and promote scientific development and utilization of national internet information resources.
2 Data and Analysis The CINIC survey includes all sites that have registered domain names in China (except Hong Kong, Macao and Taiwan) which includes all sites under the domain name .com, .net, .org and .cn. The definition of internet information resources confirmed by the survey is: The summation of openly released web pages and on-line databases on 1
http:// news. The Fifteenth Report on Survey of China Internet Development (2005/1). http:/ xinhuanet.com/it /2005-01 /19/content_2480485.htm (2005-11-15).
domestic internet, the data of which is up to the end of 2004 and includes four items related to the basic data of quantity of internet information resources of each province (see table 1)2. This article uses SPSS analysis method and provides standardization for variables of different order of magnitude and particularly uses the Q class cluster of hierarchical cluster analysis method which gathers samples of common characteristics together according to the degree of correlation of observed values or variables to make it easy to analyze samples of different classes3. Table 1. 2004 China’s Internet Information Resource Distribution
China Internet Network Information Center. 2004 Report on Survey of China’s internet Information Resource distribution http://news.xinhuanet.com/it/2005-04/14/content_2830215.htm (2005-12-10). 3 Yu Jianying, He Xuhong. Data Statistics analysis and SPSS Application. Beijing: Renmin Youdian Publisher, 2003, 251-291.
Study on Distribution Model of Internet Information Resources in China
271
Table 1. (continued)
11730 38.8718 10460 24660 513.2 1974 3159 2053721 .4718 Source: China Internet Network Information Center. 2004 Report on Survey of China’s internet Information Resource distribution http://news. xinhuanet. com/it/2005 04/14/content_2830215.htm (2005-12-10). 25 26 27 28 29 30 31
Yunnan Xisang Shanxi Gansu Qinghai Ningxia Xinjiang Uygur Total
0.8 0.1 1.3 0.3 0.1 0.2 0.5 100
0.7 0.3 0.8 0.4 0.1 0.2 0.3 100
581 3.84 534 1280 25.5 110 177 86700
Beijing has significant difference from the other regions in analysis with hierarchical cluster method, and forms an independent class when carving up class 2, class 3 and class 4. The economically developed provinces and cities are rather close to each other while the provinces and cities that drop behind are also rather close to each other, and apparently shows their sameness and difference when carving up as class 3. Moreover, from diagram 1 you can see that the distance between class 2 and class 3 is not very large, but the distance between these two classes and class 3 is very large. This illustrates that the capital Beijing is the region which has the maximum quantity of internet information resources in China, even when comparing with developed regions such as Shanghai, Jiangsu, Zhejiang, Fujian and Guangdong, it still has its quantitative advantage.
3 Conclusions and Discussion According to the above analysis, we can divide provinces and cities approximately into three classes according to their quantities of internet information resources (Table 2): obvious leaders in quantity, leaders in quantity and relative laggard in quantity. In the calculation of kurtosis of the four indexes “regional distribution of domain name”, “regional web site ratio”, “provincial web page number” and “provincial web page byte number” of Table 1, we found that all the kurtosis are larger than 0, being 2.103, 4.896, 20.258 and 22.234 respectively, especially the number of web pages and the number of bytes show a gathering trend towards certain regions; among the four indexes, the previous two reflect the category distribution of internet information resources and the latter two reflect the depth of contents of internet information resources. This illustrates that the types of information are more complete and the contents of information are more detailed and profound in places where internet information resources are abundant; meanwhile, in less developed places where internet information resources are scarce, the types of information are relatively less, and the contents of information are also relatively weak, not systematic and profound.
272
H. Liu
Dendrogram using Average Linkage (Between Groups) Rescaled Distance Cluster Combine C A S E Label
0
5
10
15
20
25
Num +---------+---------+---------+---------+---------+
Qinghai
29
Ningxia
30
Xisang
26
Inner Mongolia
05
Xinjiang
31
Guizhou
24
Hainan
21
Tianjing
02
Shanxi
27
Jiangxi
14
Guangxi
20
Shaanxi
04
Yunnan
25
Jinin
07
Gansu
28
Helongjiang
08
Chongqing
22
Hunan
18
Anhui
12
Henan
16
Hebei
03
Hubei
17
Niaoning
06
Shandong
15
Sichuan
23
Jiangsu
10
Fujian
13
Zhejiang
11
shanghai
09
Guangdong
19
Beijing
01
Fig. 1. Hierarchical cluster analysis Dendrogram
Study on Distribution Model of Internet Information Resources in China Table 2. Regional Internet Information Resource Category
Fig. 2. Regional distribution scale boxplot of national internet domain names
273
274
H. Liu
Beijing 3E8
2E8 S hanghai 1E8
0E0 Number of Web P ages Fig. 3. Provincial distribution boxplot of number of web pages
As the capital of China, Beijing gathers a large number of state organs, national media organizations, well-known network enterprises, famous universities, national institutions such as Chinese Academy of Sciences etc. and high tech enterprises which provide solid foundation for the production, collection, processing, coordination, storage and transmission of network information and ensure that the information is standard in content, complete in type and rapid in updating. According to the data in Table 1, Beijing occupies 35% of the national number of web pages and 39% of the national number of bytes and therefore has become the leader in the development of internet information resources and the center of national internet information resources. The advanced information infrastructure and upgraded network technology make it easier to break through the barriers of time, space and other physical barriers of information transmission4. In today’s high speed information network environment, the internet information resources gathered in Beijing has already become strategic information resources shared by the whole nation. Five provinces and cities such as Shanghai etc. are the leaders of the nation in the development of internet information resources due to their advantages in economy, science and technology, and even the proportion of their distribution of domain names is about the same as that of Beijing (see Fig. 2). If Beijing holds a lead because of the gathering of national information resources and because it is responsible to provide internet information to the whole nation; then these five provinces and cities secure 4
Wang Huibo.Analysis on Internet information resource effective distribution. Journal of Information,2004(1). 61-63.
Study on Distribution Model of Internet Information Resources in China
275
their leading positions because of their own advantages, which reflect the feature of tight association of local economy and social development. The practice of various countries indicates that the dependence on relevant technology and equipment increases day by day during the development and utilization of internet information resources, which will lead to the excess enrichment of information resources in a certain region and give birth to the polarization phenomenon of “rich internet information” and “poor internet information”5. In our country, the imbalance of regional development level causes the imbalance of the development of internet information resources. From Fig 3, you will see the extremely wide gap of the distribution of the number of web pages in different regions of the nation, and the low number of less developed provinces and cities is very worrisome, as the low development degree of internet information resources in these regions will affect the overall progress of local social development. In conclusion, the imbalance of distribution of internet information resources in different regions of our country is relatively significant comparing with the imbalance of resources of science and technology and literature resources. Information distribution across geographical restriction through the internet has largely promoted sharing of internet information resources inside the whole nation. The same information could be used by thousands of people, which reflect the idea of establishing and sharing together, and the provincial development of local internet information resources is a necessary part of the opening and development strategies of different provinces. The author believes that further studies need to be done in construction and utilization of internet information resources in China, combination of centralized large-scale development and decentralized development, combined promotion with both administrative power and market power, fair and equal sharing of information and reasonable price system for special information services, development of local and special information resources and utilization of external information resources.
References 1. The fifteenth Survey on Chinese Interne development (2005/1). http://news.xinhuanet.com/ it/2005-01 /19/content_2480485.htm (2005-11-15) 2. China Internet Network Information Center. 2004 Survey on Chinese Internet Information Resource. 2005.2. http://tech.sina.com.cn/focus/04_net_rep/index.shtml. (2006-2-15) 3. Yu Jianying,He Xuhong. Data statistics analysis and SPSS application. Beijing: Renmin youdian publishing house, 2003, 251-291 4. Wang Huibo. Analysis on effective distribution of network information resource. Information Journal 2004(1). 61-63. 5. Zhao Jinbao, Wang Xiaohong. On information internet resource distribution. Journal of Handan Medical Higher school,004(3).247-248 6. Ma Feicheng. Principle of Information Management. Wuhan: Wuhan University publishing house 2002.62-75. 5
Zhao Jinbao, Wang Xiaohong. On information internet resource distribution. Journal of Handan Medical Higher school,004(3).247-248.
Collaborative User Tracking for Community Organization on Blogosphere: A Case Study of eLearning@BlogGrid Jason J. Jung1,2 , Inay Ha2 , Supratip Ghose2 , and Geun-Sik Jo2 1 INRIA Rhˆ one-Alpes ZIRST 655 avenue de l’Europe, Montbonnot, 38334 Saint Ismier cedex, France [email protected] 2 Intelligent E-Commerce Systems Laboratory School of Computer and Information Engineering, Inha University 253 Yonghyun-Dong, Nam-Gu, Incheon 402-751 Korea [email protected], [email protected], [email protected]
Abstract. The aim of this study is to recommend relevant information to users by organizing user communities on electronic learning environment. In this paper, we propose a weblog-based approach to modeling users during collaborative learning process. Thereby, we formulate user behaviors on blogosphere, e.g., posting articles, linking to neighbors, and interactions between neighbors. These user models are capable of being compared with others to quantify similarities between users. We apply co-occurrence analysis methods. In this study, we deploy BlogGrid platform to support information pushing service to students. Through our experimental results, we found out that average weighting measurement scheme with co-occurrence patterns from responding (e.g., comments and trackback) activities is the most significant patterns for information pushing on collaborative learning.
1
Introduction
Efficient information pushing service is an important facility for collaborative systems. However, it is difficult to measure how relevant a certain resource is to people. In particular, electronic learning (e-learning) systems have been concerning about information (or knowledge) sharing between students [1]. These systems have to realize the contexts of the participating students from the students’ activities (e.g., social interactions) during collaborative learning. As a simple example, when a certain student is posting questions (or replies) about ‘Nihilism’, we can say that he is interested in ‘Philosophy’ and recommend new resources related to this topic to him. We have focused on two kinds of difficulties for efficient interactions; i) resource searching problem caused by overwhelming information, and ii) user clustering problem by social isolations on blogosphere. In order to solve these problems, we regard user activities as implicit evidence representing user context, and L. Feng et al. (Eds.): WISE 2006 Workshops, LNCS 4256, pp. 276–286, 2006. c Springer-Verlag Berlin Heidelberg 2006
Collaborative User Tracking for Community Organization on Blogosphere
277
we apply a weblogging systems to keep track of these activities. A blog (originally Weblog) is a web application presented as a web page consisting of periodic posts, normally in reverse chronological order [2]. Since Jorn Barger coined this idea in 1997, the usage of blogs has been spread from individual diaries to arms of political campaigns, media programs and corporations, and from the writing of one occasional author to the collaboration of a large community of writers [3]. More importantly, each blogger can make explicit connections with others such as families, friends, and colleagues. Through these social activities, indeed, bloggers can organize communities in a form of blogs. Hence, in terms of e-learning, we note two important features of blogs, as follows. – Personal content management. Students can create and manage various types of contents including personal information (e.g., personal history, commentaries, photos and hyperlinks of their classmates) as well as learning materials from (e.g., presentation files, examples, and webpages). – Information propagation by social activities. More importantly, along with the social links (i.e., hyperlinks), they can have access to the other classmate’s blogs. Contrast to simple browsing of a bunch of hypertext documents, they can promptly take some actions within those blogs [4]. A representative example of such activities is leaving comments against certain resources (e.g., questioning and replying) [5]. Current E-Learning solutions are not sufficiently aware of the context of the learner, that is the individual’s characteristics and the organizational context [6]. In order model user context, we formulate the user activities on blogosphere, and represent an activity as a set of features. Then, as aggregating the sets of features extracted from a sequence of activities, the user context is efficiently adapted over time. More importantly, in this paper, we propose a novel user clustering method to organize communities as a set of like-minded students during collaborative learning on BlogGrid platform. The outline of this paper is as follows. Sect. 2 explain how to model students with their activities on blogosphere. In Sect. 3, we will describe system architecture and information pushing service of BlogGrid for collaboration between students. Sect. 4 will show simple example and its experimental results. In Sect. 5, we want to discuss our system with some previous work. Finally, Sect. 6 will draw a conclusion and explain our future plan to improve our system.
2
Modeling Students on Collaborative Learning
In this paper, we assume that the actions taken by students have implicit meaning, and be able to be applied to extract useful information about their context. In order to model behaviors on blogosphere, we need to note thoroughly the roles of blogosphere and students. Blogosphere are divided into personal blogs and community blogs [7]. In context of e-learning domain, we can say that the community blogs are exactly replaced with class blogs. Thus, an e-learning blogosphere is organized as a set of student blogs U = {u1 , . . . , u|U| } and a set of
278
J.J. Jung et al.
Class Blogosphere
B A
Student Blogosphere
Fig. 1. Bipartite blogosphere on e-learning environment; three class blogs and six student blogs
class blogs C = {c1 , . . . , c|C| }. Basically, i-th student ui is able to possess his own personal blog linking to a set of class blogs which he is taking part in. As shown in Fig. 1, it is organized as bipartite graph structure. We note the activities that students can conduct on blogosphere, as follows. – Linking between classmates L. It is to establish a social network, as making connections with other people such as families, friends, and colleagues. More importantly, in order to share information about a particular topic, students can link to classmates who are relatively closer with each other than others. Easily, a set of neighbor students of ui is given by Li = {uα |ADJ(iα) = 1}
(1)
where symmetric adjacency matrix ADJ of which size is |U | × |U |. If user ui has a link to uj , ADJ(ij) = 1. Otherwise, ADJ(ij) = 0. For simplicity, the attached weight Wi (uα ∈ Li ) is given by either 1 or 0 from ADJ, instead of strength of social tie between nodes. – Posting articles P. It is the most basic action on blogosphere. As taking this action, students can input various types of information and enrich blogosphere. This action explicitly represents the corresponding blogger’s preferences. A set of articles posted user ui is given by Pi = {p1 , . . . , pN },
(2)
and the corresponding weight Wi (pα ) is given by Wi (pα ) =
where function Categoryi (pβ ) is to return the corresponding category labeling pβ . The weight means how much a user is interested in a certain topic (i.e., category). For example, he is probably interested in “music”, because he has been posting articles about recital schedules and musical instruments. In our case, we choose the user-defined category information, because contentbased analysis methods (e.g., keyword-based feature extractions) are rather efficient but very expensive, especially, within the information overwhelming spaces like blogosphere.
Collaborative User Tracking for Community Organization on Blogosphere
279
– Navigating N . In order to get relevant information within a blog system, people can visit other blogosphere. Students can navigate the other blogosphere by the following two methods. 1. Random browsing. Students can randomly jump into other blogosphere. In fact, this has been the only way to deal with nepotism problem. 2. Accessing to neighbors on social network. By referring to the list of neighbors, students can easily move into their blogs. These activities by student ui can be simply accumulated as Ni = {uα , uβ , . . . }
(4)
where uα means the personal blogs of the corresponding student. For frequency analysis of access patterns, we employ sessionization method proposed in [8]. Thus, the attached weight Wi (uα ∈ Ni ) is given by Wi (uα ∈ Ni ) = occur(uα ) ×
1 |Session(Ni , T )|
(5)
where T is a time span (e.g., a hour, a day, and a week), and function Session returns a set of sessions {sessiont|sessiont = uα , uβ , . . . } by segmenting Ni with time span T . Function occur is able to check if a given blog address uα is located in a session or not, and count all the sessions occurred by uα . We do not need to take care of navigation on class blogs. – Responding R. Students can respond to a certain article while navigating blogosphere. Their responses like personal opinions, sympathies, antipathies, or even apathy are expressed as the following two ways: 1. Comment. A student can leave his responses which consist of short text sentences. More than a comment can be serially and continuously attached to each post article. 2. Trackback. In contrast to comments, it allows a student to see who has seen the original post and has written another entry concerning it. It works by sending a ‘ping’ between the blogs, and therefore providing the alert. Trackback typically appears below a blog entry and shows a summary of what has been written on the target blog, together with a URL and the name of the blog. Since implemented by Movable Type1 , most blogging systems have adopted trackback mechanism as generic function. These activities by student ui can be simply accumulated as Ri = {pα,a , pβ,b , . . . }
(6)
where pα,a means a-th post in the personal blogs uα . We can easily realize that Ri is a subset of Ni . The weight of each responding is given by Wi (pα,a ) = 1
where colocated is a function for counting comments and trackbacks colocated in the same responding. Moreover, not only free-text sentence but also numeric rating format (e.g., from 0 to 5) and voting format (e.g., “Yes” or “No”) can be applied to reflect the corresponding students’ interests and opinions. Another interesting feature is that the responding can be nested. It means that students can respond to a certain comment already attached to articles. Overall, from i-th student’s behaviors on blogosphere, his model is represented as ui = L, P, N , R
(8)
where each element is assumed to be mutually exclusive with each other.
3
Information Pushing on eLearning@BlogGrid
Here, we want to compare user models in order to organize communities. Each community consists of only like-minded students, e.g., collaborative networks [9]. This community organization plays an important role of efficiently providing information to the students participating our system. Thereby, we apply cooccurrence analysis methods to measure the similarities between students. Then, we extend a generic BlogGrid architecture [10] to eLearning@BlogGrid for handling collaboration problem between student. 3.1
Community Organization Based on Similarity Measurement
Simple scheme for measuring the similarity between two users ui and ui is based on the equation Sim(ui, uj ) = max H Cui , Cuj , o˜(Cui , Cuj ) (9) C∈{L,P,N ,R}
where the function o˜ is to obtain the K common elements from both sets. The notation H indicates several heuristic functions that systems can apply to quantify the similarities between two users. In this paper, we want to utilize three difference heuristic functions, mentioned in [11], to compare two random sets. Derived from the previous equation, we formulate these heuristic ways, as shown in the following equations H [A, B, K] =
|K| max(|A|, |B|) |K|
= max W (ki ) i=1 |K| W (ki ) = i=1 |K|
(10) (11) (12)
where K is o˜(A, B), and ki ∈ K. While Eq. 10 simply expresses the minimum ratio of the common categories, the others use the weight value of each category.
Collaborative User Tracking for Community Organization on Blogosphere
281
Eq. 11 simply chooses the category of which weight value is maximal, and Eq. 12 computes the average weight value of common categories. Based on the similarities between two arbitrary students, we can apply a nonparametric approach, e.g., nearest neighborhood method [12]. However, we have to consider physical social links ADJ built by L. As extending [9], this task is to maximize “semantic” modularity function Q on social network. Given the number of communities k from a set of students, a social network S can be partitioned into a set of communities (or subgroup) G = {g1 , . . . , gk }. The users can be involved in more than one community. It means that a student in gi can also be taken as one of members of gj . The modularity function Q is formulated by k ua ∈gi ,ub ∈gi ,ADJ(ab) =1 Sim(ua , ub ) Q (S) = (13) |gi | i=1 where all possible pairs of bloggers should be considered. Thus, G(S) is discovered when Q (S) is maximized. For computing this, in this paper, we applied knearest neighborhood methods. Fig. 2 shows a simple example. The links between two nodes are weighted by the similarity between the corresponding students’ behaviors. Firstly, we can find the k nodes whose similarity summation is highest
d
a 0.6 0.9
0.8 e
1.0 b
c
0.7
0.9 f
0.3
Fig. 2. Example of community organization
(here, k = 2 and the highest nodes are c = 3.2 and f = 1.9). From these centers of communities, the rest of members are searched incrementally, until all nodes get involved at least one community. 3.2
Proactive Information Pushing Service
We refer the linkages between students in a same community as information channels to make information possible to flow. Along with the estimated information channel, the relevant pieces of information should be pushed actively. Information pushing service proposed in this paper is remote and asynchronous because this is based on web environment and information about a participant’s interests extracted from his own behaviors. We embed autonomous and proactive agent module into this system. Every communication between agents is conducted, regardless of user’s interventions.
282
J.J. Jung et al.
Moreover, while browsing blogosphere to search information, users can be “implicitly” recommended from the facilitator in the following two ways: – By querying specific information for the facilitator. After the information about a particular topic is requested, the facilitator can determine who has the maximum weight value of that topic by scanning his yellow pages. – By broadcasting new information of like-minded student bloggers from the facilitator. Every time a student responds a new post or comment, this fact, after normalization, is sent to the facilitator. The students within a same community, thereby, can obtain information related to the common concepts in their own preferences from neighbors. 3.3
System Architecture of eLearning@BlogGrid
As shown in Fig. 3, the whole system architecture of eLearning@BlogGrid consists of two main parts; i) in the middle of blogosphere, BlogGrid server containing a facilitator (or mediator) with data repository, and ii) client-side blogging browser which is capable of communicating with the facilitator.
Blogger
Facilitator
Set of Links Set of Categories
BloggerGrid Server
BloggerGrid Browser
Query Generator
Data Repository
List of Bloggers
Fig. 3. System architecture and graphic user interface of BlogGrid
Through personal agents’ reporting responding activities of bloggers, the facilitator agent can automatically generate queries (e.g., SQL queries) by using the similarity between the users in the same community (Eq. 13), and recommendations for providing the corresponding students. Each student blogger needs personal agent module. This agent initializes and manages the corresponding blogger’s preference. Thereby, it has to monitor the actions, and inform them to the facilitator for storing blogspace repository. Particularly, graphic user interface of the blogging browser is shown in left-bottom on Fig. 3.
Collaborative User Tracking for Community Organization on Blogosphere
4
283
Experimental Results
In this section, we want to explain two main experimental issues in this paper. Firstly, we evaluated our community organization mechanism from a set of students participating in our experiments, with respect to three different heuristic functions (Eq. 10-12) and four different behaviors (Eq. 8). Secondly, we verified the performance of the recommendation generated for collaborative learning among students, by interviewing the students. In order to conduct experiments on the proposed system, we have built a testing environment2 . We invited nine Masters students attending two classes (i.e., Neural Network and Web Services) in Inha university, Korea. Initially, they had to build their own blogs by using Blojsom platform3 , and for posting articles, collect some research papers which they are interested in from the selected conference proceedings such as – Proceedings 2004, – Proceedings 2005, – Proceedings – Proceedings
of IEEE International Joint Conference on Neural Networks of IEEE International Joint Conference on Neural Networks of IEEE International Conference on Web Services 2004, and of IEEE International Conference on Web Services 2005
by looking at four fields, e.g., title, authors, keywords, and abstract. Additionally, in order to build a social network, they were asked to declare their own friends (max. three) who are socially closer than others. Finally, we set the student’s preference model by using questionnaires about two classes. These datasets are available here4 . Community Organization. Each participant has noticed a set of students estimated in a same community, with respect to the heuristic function and blog activities. Table 1 shows the experimental results on community organization. We computed the matching ratio with the communities organized by preference similarities from questionnaires (vector-space model). Eventually, we found out that heuristic function used in Eq. 12 outperforms the other two functions. Moreover, rather than any other activities, responding activities should be emphasized to measure the similarity among people on blogosphere. Recommendation Generation. We have evaluated a set of BlogGrid’s recommendation provided to the participants organized by three different heuristic functions and blogosphere activities. For measuring the precision of recommendation, we computed the mean absolute error (MAE) given by MAE = 2 3 4
Number of false recommendation Total number of recommendation
J.J. Jung et al. Table 1. Experimental results on community organization (%) Heuristic func. Linking L Posting P Navigation N Responding R Avg. Eq. 10 38.45 49.52 57.95 48.64 Eq. 11 45.25 46.58 59.68 50.51 Eq. 12 63.50 62.96 64.93 63.80 Avg. 49.07 53.02 60.85 Table 2. Experimental results on MAE of recommendation by BlogGrid (%) Heuristic func. Linking L Posting P Navigation N Responding R Avg. Eq. 10 29.76 38.88 51.29 21.34 35.32 Eq. 11 25.13 34.34 39.96 23.32 30.69 Eq. 12 28.46 37.78 45.00 22.73 33.49 Avg. 27.78 37.00 45.42 22.46
where false recommendation is discriminated by comparing with the pre-defined user preferences (e.g., a set of categories). As presented in Table 2, the community organization by responding activity and heuristic function by Eq. 10 was the best formation among students for minimizing the recommendation error (about 21.34%).
5
Discussion and Related Work
We want to discuss some significant information uncovered in this study. Firstly, we found out that average weighting measurement scheme from co-occurrence patterns with the responding (e.g., comments and trackbacks) activities are the most significant patterns for information pushing on collaborative learning. We believe that students have taken the responding activities to a certain articles with “very high and reliable” interests. As a result, these activities is one of the most applicable implicit behaviors to measure the similarity between users. They have shown 24.0% and 14.77% improvements, as compared to posting |K|
W (k )
and navigating. Moreover, with respect to the heuristic functions, i=1|K| i in Eq. 12 has shown about 28.74% improvement. Secondly, recommendation was able to be propagated on distributed per|K| sonal information space. Responding activities and heuristic function max(|A|,|B|) (Eq. 10) have shown the best combination to organize communities for providing recommendations. In particular, communities organized by linking activity has shown relatively high recommendation performance. 5.1
Comparing with Client-Server Systems
Hereafter, we have to mention some improvements from traditional centralized systems for collaborative e-learning. Comparing with client-server e-learning systems such as HARMONY [13] and PLANETLAB [14], major difference is dynamic
Collaborative User Tracking for Community Organization on Blogosphere
285
socialization. Blogosphere can provide people a way to express their interests explicitly and implicitly, as configuring their own social networks. Henceforth, we consider a question “Does socialized information space improve the performance of collaborative learning among students?” The students in our system can be promptly switched into the most relevant communities in real-time. Expectedly, dynamic text sources such as RSS (Really Simple Syndication)5 , which is a family of web feed formats, specified in XML and used for Web syndication, can be exploited on our system. While it is used by (among other things) news websites, weblogs and podcasting, we can expect to embed this function to our system. 5.2
Category Representation
Another discussion issue is the representation of blog category. We built two classes; WS (Web Service) and NN (Neural Network). The categories of classes WS and NN are represented as simple lists and trees, respectively. As a result, hierarchical structure, in case of NN, has shown better performance. Bloggers were able to express their interests more in detail. Furthermore, for measuring the similarity between certain two patterns in vector space, hierarchically represented patterns could reduce the uncertainty.
6
Conclusions and Future Work
Tradition paradigm of computer-aided instruction (CAI) has been shifted by diverse information systems. Our blogging-based e-learning environment has shown to support efficient information sharing and propagation between students. In this paper, we believe that the strength of social ties between students is one of the most important criteria for collaborations on e-learning systems. Therefore, we exploit the grid computing paradigm, which is capable of supporting an efficient framework of information sharing between heterogeneous sources, to blogosphere and students. Practically, we apply co-occurrence analysis methods for measuring the similarities between both activities. In addition, we empirically evaluated three different heuristic functions. As our future plans, we have a plan to apply our system to a large set of students for scalability testing of our system. Especially, we are expecting that our system can support the beginning students to guide the up-to-data context on e-class. As another important future study, similar to [15], we plan to implement information visualization tools to visualize the social interactions among students, and help students to improve the understandability. Finally, we will combine blackboard architecture with our grid-based system, like wiki-based learning systems [16,17].
3. Blood, R.: How blogging software reshapes the online community. Communications of the ACM 47(12) (2004) 53-55 4. Higgins, C.J., Reeves, L., Byrd, E.: Interactive online journaling: a campus-wide implementation of blogging software. In: Proceedings of the 32nd Annual ACM SIGUCCS conference on User services (SIGUCCS 04), New York, NY, USA, ACM Press (2004) 139-142 5. Xu, W., Kreijns, K., Hu, J.: Designing social navigation for a virtual community of practice. In Pan, Z., Aylett, R., Diener, H., Jin, X., G¨ obel , S., Li, L., eds.: Proceedings of the First International Conference on Technologies for E-Learning and Digital Entertainment (Edutainment 2006), April 16-19. Volume 3942 of Lecture Notes in Computer Science., Springer (2006) 27-38 6. Schmidt, A., Winterhalter, C.: User context aware delivery of e-learning material: Approach and architecture. Journal of Universal Computer Science 10(1) (2004) 28-36 7. Nardi, B.A., Schiano, D.J., Gumbrecht, M., Swartz, L.: Why we blog. Communications of the ACM 47(12) (2004) 41-46 8. Jung, J.J.: Semantic preprocessing of web request streams for web usage mining. Journal of Universal Computer Science 11(8) (2005) 1383-1396 9. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Physical Review E 69 (2004) 066133 10. Jung, J.J., Ha, I., Jo, G.: BlogGrid: Towards an efficient information pushing service on blogspace. In Zhuge, H., Fox, G., eds.: Proceedings of 4th International Conference on Grid and Cooperative Computing, November 30 - December 3, 2005. Volume 3795 of Lecture Notes in Computer Science., Springer (2005) 178-183 11. Jung, J.J., Lee, K.S., Park, S.B., Jo, G.S.: Efficient web browsing with semantic annotation: A case study of product images in e-commerce sites. IEICE - Trans. Inf. Syst. E88-D(5) (2005) 843-850 12. Gowda, K.C., Krishna, G.: Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern Recognition 10(2) (1978) 105-112 13. Kojiri, T., Watanabe, T.: Harmony: Web-based adaptive collaborative learning environment. In: Proceedings of International Conference on Computers in Education (ICCE/SchoolNet). (2001) 559-566 14. Chen, Z., Hu, T., Yu, Y.: Analysis and research about an on-line collaborative learning teams based grids. In Pan, Z., Aylett, R., Diener, H., Jin, X., G¨ obel , S., Li, L., eds.: Proceedings of the First International Conference on Technologies for E-Learning and Digital Entertainment (Edutainment), April 16-19. Volume 3942 of Lecture Notes in Computer Science., Springer (2006) 735-744 15. Tscherteu, G.: The blogosphere map visualising microcontent dissemination inspired by maria montessori. In Hug, T., Lindner, M., Bruck, P.A., eds.: Proceedings of the Microlearning, Innsbruck University Press (2005) 109-120 16. Bruns, A., Humphreys, S.: Wikis in teaching and assessment: the M/Cyclopedia project. In: Proceedings of the 2005 international symposium on Wikis (WikiSym 05), New York, NY, USA, ACM Press (2005) 25-32 17. Reinhold, S., Abawi, D.F.: Concepts for extending wiki systems to supplement collaborative learning. In Pan, Z., Aylett, R., Diener, H., Jin, X., G¨ obel , S., Li, L., eds.: Proceedings of the First International Conference on Technologies for ELearning and Digital Entertainment (Edutainment), April 16-19. Volume 3942 of Lecture Notes in Computer Science., Springer (2006) 755-767
Adaptive UI Storytelling System Using MOL Sunghan Bae and Rohae Myung User Interface Lab., Korea University, Seoul, S. Korea [email protected], [email protected]
Abstract. In this paper, we propose Adaptive User Interface Storytelling System using multimedia Object Library (MOL). It consists of three key components; Adaptable User Interface (AUI), Storytelling Engine, Multimedia Object Library (MOL). AUI module allows users to input their description of objects guided by selected genre template. Through context cognition process, storytelling engine finds proper multimedia objects from MOL. Multimedia Object Library (MOL) contains various objects (e.g. characters, graphic objects, background images, music, and etc.) that are carefully prepared for designated application. We believe this proposed system provides various products idea in the area of content producing including education, elearning, multimedia book etc. Keywords: Storytelling Engine, Adaptive User Interface, Content, e-learning.
The proposed system has the following advantages. It allows users to create designated application more easily and it is great for who do not familiar with creating multimedia content. It also provides greater flexibility and expandability for various industries and the quality is only limited by provided templates and multimedia objects in the MOL. We expect this idea can be utilized to create great storytelling tool. This paper is organized as follows; we describe the conceptual design for proposed system in Section 2. In Section3 and 4, we explain the detailed explanation of System and applications. Finally, the conclusion and future works are presented in Section 5.
2 Conceptual Design of AUI Storytelling System In proposed system, we divide modules as follows; data input module, database of templates for categorized variables of multimedia objects, database for categorized story templates, storytelling engine, multimedia object library, storytelling simulation module. In the data input module, user gets proper templates based on their target application. Each designated template comes with adaptable user interface that is considered for each user’s level. During input, user describes what they want by typing either a short sentence or words, then storytelling engine is performing context cognition and simultaneously suggesting proper multimedia object suitable to what user wants.
A U I
Categorized Variables Input Template
Story Input Genre Selection
Edit Story
Context Cognition
Matching MOL function
Categorized Story Templates
Edit Objects (Background, Graphic, Sound)
Record Voice
Multimedia Storytelling
Fig. 1. Conceptual Design of AUI Storytelling System
Multimedia Object Library (M.O.L.)
Adaptive UI Storytelling System Using MOL
289
If users want to input story, they can do so using story editing utility and with story template. Editing object and recording utilities are also provided. Each page then can be added to make whole story like story board. After completing whole pages, multimedia presentation is being shown by story simulation module.
3 AUI Storytelling System 3.1 Data Input Module Using Adaptive User Interface Adaptive User Interface allows users choosing background graphic, various graphic objects, character and sound and writing story with ease for each scene. Even though these basic options are almost common for various templates, each template is designed based on procedures as follows; • • • • • •
Collecting requirement for each application Sampling of users Designing usability criterion, Analyzing characteristics of users by usability testing with existing UI Designing and testing prototype Designing Adaptive User Interface.
User can choose User Interface at the selection of Template screen. In terms of Usability testing, we performed both qualitative and quantitative analysis. Qualitative analysis is based on user’s subjective judgment of satisfaction on User Interface. Quantitative analysis is obtaining sensitivity of interaction between User and the Interface by calculating probability of hit, miss, false alarm and correct rejection based on Signal Detection Theory (SDT).
Storytelling Template
More option for Objects Fig. 2. Sample of Adaptable User Interface
Object Info Input Story Input
290
S. Bae and R. Myung
Figure 2 illustrates structure of sample Interface. When user selects template category at the template selection window that is activated by clicking more option button, corresponding template is opened with adaptable UI. There are data input fields for various objects with more option button and user can specify detailed information in order to select right object that they prefer. 3.2 Storytelling Engine Storytelling Engine’s major functions are context cognition and matching objects in the multimedia object library (MOL). It also provides Editing Story, Editing various objects and Recording Voice. Based on our AUI design, user can input either a short sentence or a word. Context Cognition function then analyze user’s input and finds out what most likely user wants using Naïve Bayes Classifier technique and Hidden Markov Model (HMM). Storytelling Engine then finds out proper object from Multimedia Object Library (MOL).
Edit Story Context Cognition
Matching MOL function
Storytelling Engine
Edit Objects (Background, Graphic, Sound)
Multimedia Object Library (M.O.L.)
Record Voice
Fig. 3. Storytelling Engine Architecture
3.2.1 Storytelling Context Cognition Module In this module, we apply a bit of Natural Language Processing (NLP) technique using Naïve Bayes Classifier. As there are many differences in languages among cultures and countries, classifiers need to be carefully reviewed and designed for each languages. When user input data, Cognition Module divides sentences into separated words, eliminates unnecessary ones and reduces original sentence to key words. For example, If user input “There is a very peaceful village.” at the Background selection, module eliminates auxiliary words and reduces to “peaceful village”. These keywords go into the Naïve Bayes Classifier table and finds out match. This matched one is sent to Multimedia Object Library to finds out matched background graphic object and provides user with several choices of backgrounds.
Adaptive UI Storytelling System Using MOL
291
This procedure is also applies to other user input area such as selection of character, sound and objects. The Naïve Bayes Classifier table need to be carefully designed based on characteristic consideration of each application category.
Data Recognition Divider
Post Processor
Object Pre Processor
Naive Bayes Classifier
Provides Suitable Object
HMM Module
Cognition Module
Multimedia Object Library (M.O.L.)
Fig. 4. Cognition Module
In Figure 4, HMM module is designed for future usage on cognition of emotion from user’s story input. Template •
BG
Very peaceful village
Divider
Tagging
Divide
ReArrange
peaceful 1a Village 1n
1n – 1a
very * peaceful village
Noun Adjective Verb order
Naive Bayes
Table User Input
Data
Village, town, hamlet
Village
M.O.L Data Village
Category BG
Location BG Library
Fig. 5. Procedure of searching object from Multimedia Object Library
292
S. Bae and R. Myung
3.2.2 Multimedia Object Library Architecture Multimedia Object Library (MOL) is composed with modules as follows; MOL Index, Tag Library, Metadata, Switchable Components, Composite Objects. Even though MOL suppose to contain all the possible objects, it is very difficult in real world. That is why it needs to be efficiently structured for consistent growing for the future. It might be a good way to start with application for Kids such as multimedia book and expanding it to cover various industries for serious usage. It is also a good idea sharing valuable objects through web among communities. As many people are producing objects by interest or commercial purpose, these could be good source of MOL. Objects acquisition through web needs to be well evaluated and set up guidelines for handling royalty, copyright and etc.
M. O. L.
Tag Library
MOL Index
Metadata
Switchable Components
Composite Objects
Fig. 6. MOL Architecture
In order to provide better Adaptive User Interface, we have designed Adaptive DB Transfer Module that consists of User Modeling Agent (UMA), Information Filtering Agent (IFA), Case-Based Reasoning Management (CBRM). This module performs suggesting user’s preference of object selection based on past records of preferred objects and how they like to be combined and shown. These user’s preferences are stored in Case Base Set and used for next time use. MOL Index After getting a key word from user’s input, system gets matched word from the Naïve Bayes classifier. And it goes to MOL Index table and finds out which category and where it is located. Because a user can input different name for same object, the Naïve Bayes classifier has to be well defined including possible synonyms used in dedicated application. As indicated in Table 1, table needs to be structured with many depths as clear as possible to cover possible object that user might type in. There are different categories
Adaptive UI Storytelling System Using MOL
293
in the Library Name section such as Background, Graphic Objects, Character, Sound and etc. We designed 1st depth field for Images and Graphics, but it can be expanded to contain animations and videos too. In terms of multimedia book for kids, short fairy-tale stories can be added to the MOL and users can utilize these stories in their application together with other objects. Table 1. MOL Index Table Group Library Name
Depth Depth1
Depth2
Depth3
Season
Image
2D
Place
Background
Depth4
Depth5
Spring
Mountain
▪ ▪ ▪
Summer Fall Winter Sea Mountain House School ▪ ▪ ▪
▪ ▪ ▪
Graphic Image Object Graphic Image Character Graphic Sound
SF
3D 2D 3D 2D 3D 2D 3D 2D 3D 2D 3D Wind Birds
Tag Library Tag Library is the set of tables contains the link information of designated objects so that the storytelling engine can access to Meta data to get matched object.
294
S. Bae and R. Myung
Meta data holds various information of each object and helps efficient searching. Storytelling engine is performing searching through Tag Library and finds proper objects in the MOL. These Meta data can be used either by itself or mixture of them which is Composite Object modified by user’s option. The Composite Object is displayed on the screen, and user can add additional objects or edit. It will give user more room for satisfaction making good quality display and great Multimedia effect for their presentation.
graphic~
*
alt-text~ ext-link~ long-desc~ label
?title
caption~ *p~
Object
Copyright-statement
Image~
*
alt-text~ ext-link~ long-desc~ label
?title
caption~ *p~ Copyright-statement
Fig. 7. Tag Library Structure
Meta Data & Composite Objects
Accessing
MOL Graphic
Access through Tag Library
Re-Construction
Metadata
Composite Object A
Sound
Composite Object B
Metadata
Composite Object C
Fig. 8. Accessing Meta data and Re-Construction of Composite Objects
Adaptive UI Storytelling System Using MOL
295
Switchable Components Users get more flexibility utilizing Switchable Components, as they can create either new looks of character or make more meaningful layout that shows their intention on the screen. Switchable Component can be categorized for target application. When you make a kid’s multimedia book, all the objects need to be designed for kid’s flavor and Switchable Components need to be prepared for same sense also.
Switchable Component
Fig. 9. Switchable Component Option
3.3 Storytelling Simulation Module After completing all the pages, user either can view whole storytelling or print. As each scene recorded as page, presentation is played like slide show. User can edit their final work at this stage too. Even though we designed storytelling simulation feature as a slide show, it can be customized for each specific application. When the interactivity is important for education or multimedia book, interactivity functions can be added as needed basis.
Fig. 10. Storytelling Simulation Screen
296
S. Bae and R. Myung
4 Applications 4.1 Multimedia Book for Kids Preparing a multimedia book for kids is not an easy task for mom and daddy, even though there are some tools available. Without learning and having expert knowledge handling multimedia authoring tools such as; graphic editor, sound editor, animation tool, etc., they are almost useless. With proposed system, one can easily compose multimedia book page by page just using ready made objects. It only matters of how all the objects are well prepared and designed. It is not the user’s headache but the designer’s. 4.2 Storyboard Before we make Films, and Commercial Films, we need to prepare storyboard showing sequences and explaining what all is about. Most of time, only expert with good drawing skill can create useful one. If a director using this proposed system, sufficient enough to visualize his or her idea with provided MOL, it will save a lot of time and money. Even though there are some existing tools available, it still requires intensive learning and functions are very limited. 4.3 Content Planning Storyboarding is very important when we plan content creation. All the staffs working on various parts of media creation, need to understand whole concept and feels. During idea discussion, project manager can utilized this proposed system and he or she can easily create and modify without time consuming. 4.4 Web-Based E-Learning E-learning through Web is now very popular in education fields. But most of materials are created by experts not by teachers, because it is not an easy task. With the proposed system equipped with MOL designed for education, teachers can create elearning book easily. Students can utilize these multimedia learning materials during or after school.
5 Conclusion and Future Work This paper has described Adaptive User Interface Storytelling System using Multimedia Object Library for a new approach to Interactive Storytelling. We designed various templates for specific application considering different level of users and offers options for choices. It gives a user more satisfaction and flexibility. In the future, we will design user interface more adaptive for users minimizing manual selection and editing. The Cognition Module is now designed for short sentence and used for finding objects in the MOL, but we will improve context cognition
Adaptive UI Storytelling System Using MOL
297
of long paragraphs which can be used to write a story automatically when user input simple synopsis. In terms of MOL, it now contains various multimedia objects that user can utilize, but stories are not included. If we add existing stories through web that can be grown automatically, user can utilize these stories for their storytelling also. In order to create good quality storytelling, it is obvious that we need to provide good quality objects as many as possible. It is very difficult to prepare all the objects by one company or organization. But this can be also achievable if we share objects through Web among general public. They will eager to provide their work for small amount of royalty as a Prosumer (Producer & Consumer). Eventually, it will be possible that mom types in her short synopsis into the template, this system will prepare proper stories, multimedia objects and shows and tells her baby a nice multimedia book at home.
References 1. ZhiYing Zhou, Adrian David Cheok, JiunHorng Pan, Yu Li (2004) “Magic Story Cube: an interactive tangible interface for storytelling” Proceedings of the 2004 conference on Interaction design and children: building a community. 2. Marc Cavazza, Fred Charles, Steven J. Mead (2003) “Interactive storytelling: from AI experiment to new media” Proceedings of the second international conference on Entertainment computing ICEC '03. 3. J. Cassell, K. Ryokai (2002) “Making Space for Voice: Technologies to Support Children’s Fantasy and Storytelling” Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1. 4. Paul Tarau, Elizabeth Figa (2004) “Agents, interactions, mobility, and systems (AIMS): Knowledge-based conversational agents and virtual storytelling” Proceedings of the 2004 ACM symposium on Applied computing. 5. Luis Romero, JORGE SANTIAGO, NUNO CORREIA (2004) “Wireless game and game story: Contextual information access and storytelling in mixed reality using hypermedia” Computers in Entertainment (CIE), Volume 2 Issue 3. 6. Lisa Neal (2001) “Research paper: Storytelling at a distance” eLearn, Volume 2001 Issue 5. 7. Paul Tarau, Elizabeth Figa (2004) “Knowledge-Based Conversational Agents and Virtual Storytelling” 2004 ACM Symposium on Applied Computing
Construction of a Distributed Learning Resource Management System Based on RSS Technology Chengling Zhao, Liyong Wan, Ying Yu, and Qi Luo Department of information technology, Central China Normal University, Wu-Han, China [email protected]
Abstract. RSS is the first letter abbreviations of Rich Site Summary (enriches the website summary) or Really Simple Syndication (really simple merger). It is a kind of simple and easy way to share content among different websites. It is usually used in news and other websites arranged in order, such as Blogs. We designed and implemented a distributed learning resource management system based on RSS technology. In this paper, we firstly introduce the basic principles of RSS technology, and then describe the three components of the system: the cluster of the distributed learning resource websites, the content aggregator and the content reader. This system will help us to manage the distributed learning resource simply and efficiently. Keywords: RSS, RSS reader, learning object, distributed learning resource network, IEEE-LOM.
Construction of a Distributed Learning Resource Management System
299
The basic grammar of RSS feed is as follows: <description> <description> RSS reader is a tool to fetch RSS files and display news items on the user interface. Lots of readers can reserve news items as soon as fetching the URL of the RSS files. Using this kind of reader to read news is just like reading an e-mail, because it gives prominence to news items and carries out cushion management for off-line reading. Furthermore there are some on-line RSS readers which can bind RSS feed with the target website. Generally speaking, the interface of the reader is made up of three sections (see figure 1): Channel list section: one channel corresponds to a XML address. Adding an XML address can generate a new channel. A channel may be a book, a column or an assembly of a kind of news. Title section: lies in the right corner above the reader’s main body. It lists the read status, title, published time and author in order of the sequence. Content section: lies in the right corner below the reader’s main body. The content section is made up of three components: a small toolbar, an address bar and a display window of the detailed content.
Fig. 1. The user interface of RSS reader 1.0.88.0
300
C. Zhao et al.
2 Use RSS Technology to Construct a Distributed Learning Resource Management System At present, a learning resource database usually stores resource files as binary code in the relational database. The management of learning resources is based on database operation. All resources are stored in a structured way. This kind of storing method can manage the resources efficiently and can also locate the resources accurately [2]. Nevertheless, all related information about resource files should be stored in the database. We must be assured that the performance parameters of the database are very high. The applications in the server side must firstly read resources from databases, and then send them to the client side. So there is also a higher requirement on network bandwidth. Furthermore, resource databases are standalone. Each resource database cannot communicate and share resources effectively. In such cases, the concept of a ‘distributed learning resource network’ emerges. A distributed learning resource network is a group of learning resource websites about different subjects and different themes. The relationship between each website is peer-to-peer. The resource center uses a resource directory system to realize the update and management of each resource websites’ directory simultaneously. RSS defines the package and publishing format of the content via XML standard. The content providers and the content receivers can both benefit from RSS technology. For the content providers, RSS technology offers them a real-time, highefficiency, safe and low-cost information-publishing channel. For the content receivers, RSS technology offers them a personalized reading experience. Based on the above advantages, we proposed a distributed learning resource management framework based on RSS technology. The framework is comprised of three key components: The cluster of the distributed learning resource websites: learning resource websites mainly refer to the portal websites on different subjects. Each website presents all related learning resources in the form of subject themes. Resources are organized in the form of learning objects (LOs). A large number of leaning objects form a learning object database. Each learning resource website owns the RSS file about the relative learning objects. Several learning resource websites form a largescale learning resource website group. Content aggregator: the content aggregator can bind RSS with IEEE-LOM, making use of RSS files to describe learning object Metadata (LOM). Each learning resource website submits the RSS files about learning object metadata to the content aggregator. The content aggregator is used for storing the RSS file of learning objects, at the same time it has a function of searching learning resources according to learning object metadata. Content reader: the content reader presents the required learning objects to the learners in the form of a channel. If the learners want to know detailed information about certain learning object, they can link to the website and read the content in the content reader. They can also read the content off-line and store the content in local devices.
Construction of a Distributed Learning Resource Management System
301
Fig. 2. The framework of the distributed learning resource management system
See figure 2, the flow of the framework is as follows: the content designers of each learning resource website submit the RSS files of all learning object metadata to the content aggregator. The learner sends out a learning request and searches related learning content through the learning object metadata search engine. With the help of metadata search technology, the learning object metadata search engine presents the required learning content in the form of title, author, URL and description on the content reader of learner’s computer desktop. Only if the reader chooses related learning content and clicks the link, can he/she see the detailed information of the learning content in the content reader.
3 The Key Problems in Implementing the System In the process of using RSS technology to construct a distributed learning resource management system, there are two key problems to be solved: One is how to bind IEEE-LOM with RSS, because RSS 1.0 is specified by RDF (Resource Description Framework). So we need to bind IEEE-LOM with RDF. Stephen Downes (2003) proposed a module of RSS-LOM in 2003. The RSS-LOM Module provides translation from IEEE-LOM to RSS 1.0 to allow learning object repositories to syndicate listings and descriptions of learning objects [3]. The other problem is how to develop the distributed learning resource website. We suggest that the learning resource website at best be classified as subject and theme. Each resource website is responsible for one subject’s learning resource only. The subject’s resource can be displayed in the pattern of themes. [4] Each subject’s resource website organizes their learning resources in the form of learning objects. The content designers provide every learning object with a RSS feed, which is
302
C. Zhao et al.
specified by RSS-LOM binding rules. For an example, if we develop a Chinese learning resource website for elementary school, the resource website includes four sub-libraries: instruction guide library, learning material library, reference resource library, exercise and test library. The instruction guide library can be divided further into four items. They are knowledge structure, teaching objectives, teaching suggestions and referenced teaching materials. The learning resources contain text, graph, audio and video and other formats. Here we list a learning object in the teaching material library of Small Boat of Venice, which is one lesson in elementary school. The metadata and corresponding RSS description is as follows [5]: Title: Segment of landscape Language: Chinese Content description: nature scenery of Venice Subject: Chinese Publisher: Ren Ming Education Publication Company Interaction sort: The type of explanation Duration: 2 minutes 10 seconds Format: RM Size: 10.8MB Location: Teaching material library Interaction degree: Low Learner status: Grade 5 of Elementary school Difficulty: Simple Segment of landscape Chinese Nature scenery of Venice Chinese Ren Ming Education Publication Company The type of explanation 2 minutes 10 seconds RM10.8MBTeaching material library Low Grade 5 of Elementary school Simple
4 The Implementation of the System We have developed a learning resource website based on the above analysis. The website is called “China Education Resource Website”. The top of the homepage is illustrated in figure3. In this website, the learning resources include: courseware,
Construction of a Distributed Learning Resource Management System
303
papers, instruction cases, exercise resources, media resources, learning tools and more. The resources are organized in the form of learning objects and supported by RSS-LOM binding rules.
Fig. 3. The top section of the website
To acquire the needed learning resources the learners should use a learning object metadata search engine to find the resources among different resource websites like “China Education Resource Website”. So we developed a search engine. The interface of the search engine is illustrated in figure 4.
Fig. 4. The interface of the search engine
Using this search engine, we can search the target learning objects according to the publisher, the title, the date, the subject and the language. When we finish searching, the results will be aggregated into an XML file, which will be read by a RSS content reader [6]. If we search the title item “lesson plan ” and “learner status” “primary school”, then we can get the following results presented in the RSS reader(see figure 5).
304
C. Zhao et al.
Fig. 5. Search results presented in RSS reader
5 Conclusions and Analysis Compared to other distributed learning resource network frameworks, the distributed learning resource management system has the following advantages: Realize personalized learning: The system realizes the concept of ‘push’, namely the newest content being pushed into the user’s content reader immediately. It also realizes the concept of ‘pull’, namely that the user can filter and customize the learning contents by using the learning object metadata search engine. The XML label has abundant meaning. Using XML we can locate and retrieve the content accurately. Lower the cost of publication and management: RSS technology separates the development of content from publication, namely that the designers of the resource websites only need to think about how to make and organize learning resources, and have not to think about the problem of publication [7]. Abstraction of learning content into RSS files is easy to realize, and it is once and for all, so the cost of long-time publication falls to zero. Furthermore, the distributed storage of learning resources lessens pressure on the manager and reduces the cost of management considerably. Shield rubbish information and manage local content easily: The learning resources in the user side rely completely on the user’s interest to subscribe to a trusted content source. RSS reader software can entirely shield the unsubscribed content, advertisements and other noise-contents. In addition, the user can read offline and store the files; search resources and do other management operations. So the reader software is not only a reader, but also a resource library with the learners.
References 1. Stephen, D.: An Introduction to RSS for Educational Designers. (2002) http;//www. downes.ca/files/RSS_Educ.htm 2. Shengquan,Y., Lingyun, Z., Xiaoming, C.: New Development About The Management Of Education Resource. CET China Educational Technology (2003) 96-99
Construction of a Distributed Learning Resource Management System
305
3. Stephen, D.: RDF Site Summary 1.0 Modules: Learning Object Metadata. (2003), http://www.downes.ca/xml/rss_lom.htm 4. Jian-fang, Z., Gui-fang, L.: Construction of the Personal Information Portal Based on RSS Readers. Sci/tech Information Development & Economy. 17 (2005) 237-239 5. huier, Z.: Implementation of RSS - based Science and Technology Information Syndication System. New Technology of Library and Information Service.7 (2005) 60-63. 6. Yang, J.: Metadata search engine for knowledge management in learning communities. Proceedings of international conference on computers in education, Melbourne, Australia (2004) 787-792 7. Christopher, B.: Combining ITS and eLearning Technologies: Opportunities and Challenges (2006)www.activemath.org/publications/ Brooksetal-CombITSEL-ITS-2006.pdf
A Semantic Web Application to Automate the Construction of Didactic Material for Web-Based Education System Rubén Peredo, Leandro Balladares, and Iván Peredo Computer Science Research Center of the National Polytechnic Institute, Nueva Industrial Vallejo, 07738, México City, México {peredo, ballad}@cic.ipn.mx
Abstract. In this paper, a new component-based Semantic Web Application (SWA) for Web-Based Education (WBE) is presented. This SWA is based on the IEEE 1484 LTSA (Learning Technology System Architecture) specification and the software development pattern of the Intelligent Reusable Learning Components Object Oriented (IRLCOO). IRLCOO are a special type of Sharable Content Object (SCO) according to the Sharable Content Object Reference Model (SCORM). SCORM 2004 is used to create reusable and interoperable learning content. The SWA that is described in this paper is used to develop authoring and evaluation tools oriented to offer application level interoperability under the philosophy of Web Services (WS). Other used technology is AJAX (Asynchronous JavaScript And XML), this is used for communication between component and LMS’s API. Jade, Struts, Servlets, JSPs and JavaBeans implement the functionality of the system under the model MVC. The new component-based model is blended with metadata (SCORM 2004, RDF, XML), to be used for the rule-based inference engine known as Jena and Joseki server for implement a semantic platform.
A Semantic Web Application to Automate the Construction of Didactic Material
307
cessfully completed for a good result to be achieved. The steps require different profiles of professionals from authors to electronic artists; they must learn to work in an integrated way, as a team. To involve the professors of the different educational institutions in the development of contents, it is necessary to reduce the high complexity. There are approaches over this process that focus on new paradigms to produce and deliver quality content for online learning experiences. These approaches try to develop, revise and upgrade the learning content in an efficient way. The work described in this paper, it is based in a special type of labeled materials called IRLCOO, developed by Peredo et al [1]. The IRLCOO represent a kind of learning content characterized by rich multimedia, high interactivity and intense feedback that is supported by means of a standard interface and functionality. The IRLCOO and Agents are part of a new proposed architecture based on IEEE 1484 LTSA specification [2] and open standards such as XML [3], the Global IMS Learning Consortium [4], Advanced Distributed Learning (ADL), and SCORM [5]. This paper is organized as follows: in Section 2, the Architecture is described; in Section 3, the authoring system is showed; moreover in Section 4, the evaluation system and Semantic Web Platform (SWP) are presented; finally, the conclusions and future work are discussed.
2 Agents and Component-Based Architecture The Component-Based Software Development (CBSD) focuses on building large software systems by integrating previously existing software components. By enhancing the flexibility and maintainability of systems, this approach can potentially be used to reduce software development costs, assemble systems rapidly, and reduce the spiraling maintenance burden associated with the support and upgrade of large systems. At the foundation of this approach is the assumption that certain parts of large software systems reappear with sufficient regularity that common parts should be written once, rather than many times, and that common systems should be assembled through reuse rather than rewritten over and over [6]. Components are widely seen by software engineers as an important technology to address the “software crisis.” There are a number of important reasons why Component-Oriented Programming (COP) is important. It provides a higher level of abstraction. There are an increasingly large number of reusable component libraries that assist in the development of applications for various domains. There are three major goals of COP: conquering complexity, managing change, and reuse [7]. According to Szyperski [8] a software component is: “a unit of composition with contractually specified interfaces and explicit context dependencies. A software component can be deployed independently and is object to composition by third parties.” Although, in most cases this definition is acceptable, its meaning is quite generic, so it is not surprising that the term is used to mean rather different concepts. Our architecture is based on layer 3 of IEEE 1484 LTSA specification.This architecture is presented in Fig. 1, and consists in five processes: learner entity, learner agent(s), evaluation agent(s), coach agent(s), and delivery agent(s) process; two stores: learner records and learning resources; and fourteen information workflows.
308
R. Peredo, L. Balladares, and I. Peredo
L e a rn e r E n tity
Mu
lt i
d me
Lo
L e a rn e r A g e n t(s )
In te ra c tio n C o n te x t
D e liv e ry A g e n t(s ) L e a rn in g C o n te n t
ia
ca
L e a rn in g R e s o u rc e s
l iz
ad
Re
Be
gi
st
ra
ti o
n
of
ha
v io
r
E v a lu a tio n A g e n t(s ) pr
ef
L e a rn in g er en P re fe re n c e s c A ssessm ent es or
C oach A g e n t(s )
L e a rn e r In fo L e a rn e r R e c o rd s
Fig. 1. Agents and Component-Based Architecture
The coach process has been divided in two subprocesses: coach and virtual coach agent. The reason is because we considered that this process has to adapt to the learners’ individual needs in a quick way during the learning process. For this, some decisions over sequence, activities, examples, etc., can be made manually for the coach but in others cases this decisions can be made automatically for the virtual coach agent. Delivery agent, which uses them to retrieve the content for delivery to the learner as IRLCOOs (multimedia; multimedia represents learning content, etc.), to which the learner exhibits a certain behavior; this behavior is evaluated via evaluation agent and IRLCOOs, the results in an assessment or learner information such as performance; learner information is stored in learner records; and interaction context provides the context used to interpret the learner’s behavior. The learner is assisted by the learner agent and via the IRLCOOs to collect learner information and to store it in learner records (registration of preferences, this is new inside of the model IEEE 1484 LTSA), in a direct way without passing by the professor, information like: the learner's trajectory on the materials, time in each one of the materials, more visited materials, id SCORM 2004 of the materials, id RDF of the materials, the learner 's trajectory in the evaluation, sequence of the correct questions, sequence of the incorrect questions, completed activities, completed tasks, etc. 2.1 IRLCOO Platform Flash is an integrator of media and have a powerful programming language denominated ActionScript 2.0 [9]. IRLCOO were developed with Flash’s components; this language is completely object oriented and enables the design of client components that allow multimedia content. At run-time, the components load media objects and offer a programmable and adaptive environment to the student's necessities. Flash already has Smart Clips (SC) for the learning elements denominated Learning Interactions (LI). The aim is to generate a multimedia library of IRLCOO for WBE systems with the purpose to separate the content of the navigation. Thus, the components use different levels of code inside the Flash Player (FP). With this structure, it is possible
A Semantic Web Application to Automate the Construction of Didactic Material
309
to generate specialized components which are small, reusable, and suitable to integrate them inside a bigger component at run-time. The liberation of ActionScript version 2.0 inside Flash MX 2004 allows the implementation of applications under the object oriented paradigm. With these facilities are tailored IRLCOO. In addition, IRLCOO owns certain communication functionalities inside the Application Programming Interface with LMS, Multi-Agents System (MAS), and different frameworks, as AJAX [10], Hibernate [11], Struts [12], etc.), and dynamic load of Assets at run-time. Besides consuming certain WS like: warnings, exams, advices, help, etc. The component architecture diagram for evaluation IRLCOOs is presented in Fig. 2. IRLCOO are meta-labeled with the purpose of complete a similar function as the supermarket product bar codes, which are used to identify the products and to determine certain characteristics specify of themselves. This contrast is made with the meta-labeled Resource Description Framework (RDF) [13], which allows enabling certain grade of inferences on the materials by means of the semantic platform. 2.2 Communication Between IRLCOO and Web Services ActionScript 2.0 adds the component WebServiceConnector to connect Web Services (WS) from the IRLCOO. The WebServiceConnector component enables the access to remote methods offered by a LMS through SOAP protocol. This gives to a WS the ability to accept parameters and return a result to the script, that is to say, it is possible to access and join data between public or own WS and the IRLCOO. It is possible to reduce the programming time, since a simple instance of the WebServiceConnector component is used to make multiple calls to the same functionality within the LMS. The components discover and invoke WS using SOAP and UDDI, via middleware and a server JUDDI. Placing a run-time layer between a WS client and server dramatically increases the options for writing smarter, more dynamic clients. Reducing the needs for hard-coded dependencies within WS clients. Only it is necessary to use different instances for each one of the different functionalities. WS are information databases, which can be loaded using the component and deployed within an IRLCOO. 2.3 Communication Between IRLCOO and LMS The communication model uses an asynchronous mode in Run-Time Environment (RTE) and joins to LMS communication API of ADL [5], and AJAX [10]. The LMS communication API of ADL consists of a collection of standard methods to communicate the Client with the LMS. AJAX is a Web development technique for creating interactive Web applications that is executed in the client side. The Apache Struts is an open source framework for developing J2EE Web applications. It uses and extends the Java Servlet API to encourage developers to adopt an MVC architecture. This framework provides the advantage of maintainability, performance (tags pooling, caching, etc.), and reusability (contains tools for field validation that are executed in the client or server sides).
310
R. Peredo, L. Balladares, and I. Peredo
Test asTest : tt.Material question_mc : MovieClip feedback_txt : Dinamic Text btnevaluate : Button btnFordward : Button btnNext : Button btnFirst : Button btnLast : Button time : Dinamic Text onReleaseEvaluation() onReleaseFirst() onReleaseLast() onReleaseNext() onReleaseBack() conexionList() setParam()
text : Dinamic Text container : Container value XML
Source
(from Fl ash)
(from prototypeRCPkg)
loadSource()
Activate() Deactivate() getX() getY() getValue()
Fig. 2. Component Architecture Diagram for evaluation IRLCOOs
The communication model starts: (1) when an IRLCOO generates an event; (2) when from the browser interface is made a JavaScript call to the function FlashFileName_DoFSCommand(command,args), which handles all the FSCommand messages from IRLCOO, LMS communication API, and AJAX and Struts methods. Next, a fragment of this code is showed: function FlashFileName_DoFSCommand(command,args)
{
doInitialize(); doSetValue(name,value); doTerminate(); callAjaxStruts(); } The operation of this code is as follow: (I) the communication with the LMS starts when the standard methods call to the Communication Adapter; (II) the Communication Adapter implements the bidirectional communication LMS’s API between the Client and the LMS; (III) the LMS realizes the query-response handling and the business logic, i.e., the access to the database. The communication with AJAX and Struts framework begins when the AJAXStruts method is called. (3) An instance of the XMLHttpRequest object is created.
A Semantic Web Application to Automate the Construction of Didactic Material
311
Using the open() method, the call is set up, the URL is set along with the desired HTTP method, typically GET or POST. The request is actually triggered via a call to the send() method. This code might look something like this: function callAjaxStruts ()
{
createXMLHttpRequest(); var url = "register.do?"; var urlString = createUrlString(); xmlHttp.open(“POST”,url,true); xmlHttp.onreadystatechange = processStateChange; xmlHttp.setRequestHeader("Content-Type”, "application/x-www-form-urlencoded;”); xmlHttp.send(urlString);
}
(4) A request is made to the server; in this case, the Controller is a Servlet, which coordinates all applications activities, such as: reception of user data, (6) data validations, and control flow. The Controller is configured for a XML file. (7) The Controller calls the Perform method of Action, it passes to this method the data values and the Action reviews the characteristic data that correspond to the Model. (8) The business objects (JavaBeans) realize the business logic, (9) usually a database access. (10) The Action sends the response to the Controller. (11) The Controller re-routes and generates the interface for the results to the View (JSPs). (12) The View queries to the Business objects based on the appropriate interface. (13) The request is returned to the browser. The Content-Type is set to text/xml, the XMLHttpRequest object can process results only of the text/html type. In more complex instances, the response might be quite involved and include JavaScript, DOM manipulation, or other related technologies. (14) The XMLHttpRequest object calls the function callback() when the processing returns. This function checks the readyState property on the XMLHttpRequest object and then looks at the status code returned from the server. (15) If all is as expected, the callback() function sends HTML code and it does something interesting on the client. This communication model provides wide perspectives for the WBE systems. For example, the LMS communication API allows us to do database queries of learners’ information such as, personal information, scores, assigned courses, trajectory, etc. While the communication with AJAX and Struts / Servlets / JSP / JavaBeans / XSLT frameworks provides the capability of modify the learner’s trajectory according to variables from the learner records in RTE (advanced dynamic sequence), components management (IRLCOO) – remember that the IRLCOOs consuming files XML – then, this model provides the way to write, load, change and erase XML files in the server side.
312
R. Peredo, L. Balladares, and I. Peredo
3 Content System In order to facilitate the development of learning content, it was built an authoring system called Content System (CS). CS based on components, which facilities the authoring content by IRLCOOs. In addition, the Structure of content is achieved by meta-labeled SCORM 2004. CS is used to construct Web-based courseware from the stored IRLCOO (Learning Resources), besides enhancing the courseware with various authoring tools based on components. Developers choose one of the CS lesson templates and specify the desired components to be used in each item. At this moment, the CS lesson templates are based on the Uskov’s model [14], as is depicted in Fig. 3. A meta data tool supports the generation of the file XML for IRLCOO to provide on-line courses. This courseware estimates learners’ metrics with the purpose to tailor their learning experiences. Furthermore, the IRLCOO offer a friendly interface and flexible functionality. These deliverables are compliance with the specifications of the IRLCOO and with learning items of SCORM 2004 models (Content Aggregation, Sequencing and Navigation, and Run Time Environment) [5]. Meta data represent the specific description of the component and its contents, such as: title, description, keywords, learning objectives, item type, and rights of use. The meta data tool provides templates for entering meta data and storing each component in the CS or another IMS/IEEE standard repository. CS proposes a course structure based on the idea of a compound learning item as a collection of Reusable Learning Atoms (RLA) and Reusable Information Atoms (RIA) [14]. These atoms are grouped together to teach a common task based on a single learning objective, as is depicted in Fig. 3. A RLA is an elementary atomic piece of learning that is built upon a single learning objective. Each RLA can be classified as: concept, fact, process or procedure. The RLAs provide the information of learner’s behavior within the course, e.g., trajectory, times and assessments; this information is stored in the learner history database. A RIA is an atomic piece of information that is built upon single information objects. It may contain up to seven different content items, such as: over-view, introduction, importance, objectives, prerequisites, scenario, and outline. The CS implements a navigation map or instructional and learning strategy, allowing to the learner to interact with content objects along the learning experiences. These experiences follow an instructional-teaching strategy. This kind of strategies carry out modifications of the learning content structure; such modifications are done by the learning experience designer with the objective of provide significant learning and to teach the learners how to think [15]. The learning content can be interpreted in a Learning Content Tree.
4 Evaluation System The Evaluation System (ES) for WBE is designed under the same philosophy used for the CS. The functionality of the ES lays on the analysis of the learner’s profile, which is built during the teaching-learning experiences. The profile is based on metrics that elicited from the student’s behavior at run-time. These measures are stored into the
A Semantic Web Application to Automate the Construction of Didactic Material
313
learner records that compose the profile. The generation of new sequences of courses is in function of the results obtained, besides the account of the adaptation level. The ES combines IRLCOOs, additional meta-labels, and a Java Agent platform. Also, some technologies of the Intelligence Artificial field are considered in order to recreate a Semantic Web environment. Semantic Web aims for assisting human users to achieve their online activities. Semantic Web offers plenty of advantages, such as: reduction of the complexity for potential developers, standardization of functionalities and attributes, definition of a set of specialized APIs, and deployment of a Semantic Web platform. IRLCOO WELCOME
IRLCOO OVERVIEW
IRLCOO COMPOUND LEARNING ITEM
IRLCOO INTRODUCTION
IRLCOO SUMMARY
...
Post
IRLCOOs ASSESSMENT
Prev
Fig. 3. A structure of a compound learning item with IRLCOOs
The activity diagram in Fig. 4 shows the process for content generation and evaluation of the system. to c r e a te e d u c a tio n a l m a t e r ia l
E v a lu a t io n
C o n te n t
F o rm
W e lc o m e
F o rm
F o rm
F o rm
D ra g a n d D ro p
F o rm
O v e r v ie w
F illin g T h e B la n k
F o rm
I n tr o d u c t io n
F o rm
C o m p o u n d L e a rn in g Ite m
F o rm
H o t O b je c ts
F o rm
M u lt ip le C h o ic e
S u m m a ry
T r u e o r F a ls e
s t r u c t u r e c r e a tio n , f ile x m l, a n d u p lo a d m u lt im e d ia
IR L C O O
Fig. 4. A structure Activity diagram to generate IRLCOOs
All resources have a Universal Resource Identifier (URI). An URI can be a Unified Resource Locator (URL) or some other type of unique identifier. An identifier does not necessarily enable access to a resource. The XML layer is used to define the
314
R. Peredo, L. Balladares, and I. Peredo
SCORM meta data of IRLCOO that are used to interchange data over the Web. XML Schema tier corresponds to the language used to define the structure of meta data. The Resource Description Framework (RDF) level is represented by the language used for describing all information and meta data sorts. RDF Schema layer is carried out by the Framework that provides meaning to the vocabulary implemented. The Ontology tier is devoted to define the semantic for establishing the use of words and terms in the context of the vocabulary. Logical level corresponds to the reasoning used to establish consistency and correctness of data sets and to infer conclusions that are not explicitly stated. The Proofs layer explains the steps of logical reasoning. The Trust tier provides authentication of identity and evidence of the trustworthiness of data, services and agents [16-17]. In resume, the components and operation of the CS and ES are outlined in Fig. 5. Basically the ES is fulfilled through two phases. The first phase is supported by the LMS, and is devoted to present the course and its structure. All the actions are registered and the presentation of the contents is realized with IRLCOO content. The evaluations are done by evaluating IRLCOO and in some cases by simulators based on IRLCOO. These processes are deployed by the framework of Servlets, JSPs and JavaBeans. The second phase analyzes the learner's records carried out by the Server based on Jade MAS [18]. This agent platform owns seven agents: Snooper, Buffer, Learner, Evaluation, Delivering, Coach, and Info. The fundamental idea is to automate the learner's analysis for the coach/virtual coach, and to give partial results that can be useful for the learner's final instruction. These agents are implemented as JavaBeans programs, which are embedded in the applications running both at the client and server sides. The Snooper Agent works as a trigger by means of the INFORM performative that activates the server’s part of the MAS. This agent is deployed into a JSP that uses a JavaBean. Once the lesson or evaluation is finished, the graphical user interface activates the Snooper Agent and sends it the behavior or evaluation metrics and FIPA -ACL to be analyzed at the server-side of the MAS. The Snooper Agent activates the system, buffer will manage the connection with client and manages all the messages from the client, either they are buffered, sending them to the agent coach, and then the coach agent asks an evaluation agent for the previous learner monitoring information. The evaluation agent will search and retrieve this information and give it to the coach agent, who will analyze the information to determine if the learner needs help, in order to send material availability notifications to delivery agents for forwarding to learner agents, the coach agents can then make recommendations to the learner regarding new skills in which the learner might be interested. The learner agent asks a coach agent for the availability of specific learning material. The coach agent then asks the learner agent for the learner’s preferences and uses this, together with the information on specific learning material, to query the learning resources for the availability of the material. The evaluation agent checks with a learner agent to determine if it can go ahead with monitoring the learner’s performance on specific learning material. The learner agent checks its stored learner privacy preferences and gets back to the evaluation agent with the answer. These agents employ the dynamic sequencing to change the course or assessment sequence. The sequencing is defined for the
A Semantic Web Application to Automate the Construction of Didactic Material
315
instructional strategy based on Concept Graph and it employs the SCORM Sequencing/Navigation. Once the receives the necessary information (sequence, kind of IRLCOO and localization, etc.), this is represented as a string dynamically constructed by the rule-based inference engine known as Jena [19] and Joseki server [20], to generate dynamic feedback. 4.2 Platform Semantic The overall architecture of SWP, which includes three basic engine representing different aspects, is provided in Fig. 5. First, the query engine receives queries and answers them by checking the content of the databases that were filled by info agent and inference engine. Second, the database manager is the backbone of the entire systems. It receives facts from the info agent, exchanges facts as input and output with the inference engine, and provide facts to the query engine. Third, the inference engine use facts and ontologies to derive additional factual knowledge that is only provided implicated. It frees knowledge providers from the burden of specifying each fact explicitly. Again, ontologies are the overall structuring principle. The info agent uses them to extracts facts, the inference engine to infer facts, the database manager to structure the database, and query engine to provide help in formulating queries. Jena was selected as the inference engine. It is a Java framework for building SWA. It provides a programmatic environment for RDF, RDFS and OWL, SPARQL and includes a rule-based inference engine.
Fig. 5. Platform Semantic for WBE
316
R. Peredo, L. Balladares, and I. Peredo
Fig. 6. Content IRLCOOs and evaluation respectively
While Joseki was selected as Web API and server. It is an HTTP and SOAP engine supports the SPARQL Protocol and the SPARQL RDF Query language. SPARQL is developed by the W3C RDF Data Access Working Group.
5 Conclusions The communication model composes for the LMS communication API and AJAX, IRLCOO, WS, Semantic Web, and JUDDI. It provides new development capabilities for WBE systems, because their integrant technologies are complementary. CS and the ES were developed under this model to help in the automation and reduce of the complexity of the learning content process. IRLCOO and Semantic Web Platform allow developing authoring and evaluation systems to create adaptive and intelligent WBE. Our approach focus on: reusability, accessibility, durability, and, interoperability of the learning contents, which are built as IRLCOO, as the main component for delivering learning and evaluation content. The incorporation of Web Semantic Platforms helps us to create intelligent and adaptive systems (bidirectional communication), according to the users needs. The ADL-Schema manages dynamic sequencing, composition, content separation, and navigation based on the system. While, our model has the same ADL advantages and adds the capacity of generates CASE tools for desk and Web using the same learning and evaluation components generated. The model IEEE 1484 LTSA was improved to automate the taking of certain metric of the student in an automatic way.
6 Future Work It is important to generate new IRLCOOs and to investigate new middlewares for the IRLCOOs. It is necessary to generate a more advanced intelligent systems as much in the MAS as in the SWP. In the future we will consider others theories for the CS such as: Based-Problems Learning (BPL), the cases method, etc.
A Semantic Web Application to Automate the Construction of Didactic Material
317
Acknowledgements The authors would like to thank the IPN and CIC for partial support for this work within the project 20061226. The authors would like to acknowledge all their colleagues and students participating in the design and development of the software.
References 1. Peredo, R., Balladares, L., & Sheremetov, L.: Development of intelligent reusable learning objects for web-based education systems. Expert Systems with Applications. 28(2). (2005) 273-283 2. IEEE 1484.1/D9 LTSA: Draft Standard for Learning Technology - Learning Technology Systems Architecture (LTSA). New York, USA (2001). URL: http://ieee.ltsc.org/wg1 3. XML specifications (2006). URL: http://www.w3.org/XML/ 4. Global IMS Consortium (2005). URL: http://www.imsproject.org 5. Advanced Distributed Learning Consortium (2006). URL: http://www.adlnet.org 6. Gary T. Leavens and Murali Sitaraman: Foundations of Component-Based Systems. Cambridge University Press. USA (2000) 7. Wang, A., & Qian, K.: Component-Oriented Programming. John Wiley & Sons, Inc., Publication. Georgia, USA. (2005) 3-5 8. Szyperski, C.: Component Software. Beyond Object-Oriented Programming. AddisonWesley Editorial. USA (1998) 9. Macromedia, Inc. (2006). URL: http://www.macromedia.com 10. Grane. D., Pascarello, E., & James, D.: Ajax in Action. Manning Publications. Greenwich, USA (2006) 11. Peak, P. & Heudecker, N.: Hibernate Quickly. Manning Publications. Greenwich, USA (2006) 12. Holmes, J.: Struts: The Complete Reference. Mc Graw Hill – Osborne Publications.Edited by Herbert Schild. California, USA (2004) 13. RDF specification (2005). URL: http://www.w3.org/RDF/default.htm 14. U. Vladimir & U. Maria.: Reusable learning objects approach to Web-based education. International Journal of Computers & Applications. Volumen 25, Number 3. (2003) 15. Díaz-Barriga, F.: Educational strategies for a significant learning. Mc Graw Hill Publication. Second edition. D.F., México (2002) 16. Passin, T.: Explorer’s Guide to Semantic Web. Manning Publications Co. USA (2004) 17. Antoniou, G. & Van Harmelen, F.: A Semantic Web Primer. The MIT Press. USA (2004) 18. Jade (2006). URL: http:// http://jade.tilab.com/ 19. Jena (2006). URL: http://jena.sourceforge.net/ 20. Joseki server (2006). URL: http:// http://www.joseki.org/
Dai, Shucheng 49 Deng, Bo 122 Du, Kai 122 Duan, Lei 61 Feng, Shi Fu, Zhuo
104 145
Ghose, Supratip 276 Gu, Jinguang 194 Gu, Yu 13 Guo, Li 1 Ha, Inay 276 Han, WeiHong 133 Hu, Bo 194 Huang, Cui 206 Jia, Yan 83, 133 Jo, Geun-Sik 276 Jung, Jason J. 276 Kim, Ung Mo