XML for Data Warehousing Chances and Challenges (Extended Abstract) Peter Fankhauser and Thomas Klement Fraunhofer IPSI, Integrated Publication and Information Systems Institute Dolivostr. 15, 64293 Darmstadt, Germany {fankhaus,klement}@fraunhofer.ipsi.de http://www.ipsi.fraunhofer.de
The prospects of XML for data warehousing are staggering. As a primary purpose of data warehouses is to store non-operational data in the long term, i.e., to exchange them over time, the key reasons for the overwhelming success of XML as an exchange format also hold for data warehouses. – Expressive power: XML can represent relational data, EDI messages, report formats, and structured documents directly, without information loss, and with uniform syntax. – Self-describing: XML combines data and metadata. Thereby, heterogeneous and even irregular data can be represented and processed without a fixed schema, which may become obsolete or simply get lost. – Openness: As a text format with full support for Unicode, XML is not tied to a particular hardware or software platform, which makes it ideally suited for future proof long-term archival. But what can we do with an XML data warehouse beyond long term archival? How can we make sense of these data? How can we cleanse them, validate them, aggregate them, and ultimately discover useful patterns in XML data? A natural first step is to bring the power of OLAP to XML. Unfortunately, even though in principle XML is well suited to represent multidimensional data cubes, there does not yet exist a widely agreed upon standard neither for representing data cubes nor for querying them. XQuery 1.0 has resisted standardizing even basic OLAP features. Grouping and aggregation requires nested for-loops, which are difficult to optimize. XSLT 2.0 (XSL Transformations) has introduced basic grouping mechanisms. However, these mechanisms make it difficult to take into account hierarchical dimensions, and accordingly compute derived aggregations at different levels. In the first part of the talk we will introduce a small XML vocabulary for expressing OLAP queries that allow aggregation on different levels of granularity and can fully exploit document order and nested structure of XML. Moreover, we will illustrate the main optimization and processing techniques for such queries. Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 1–3, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
Peter Fankhauser and Thomas Klement
Data cubes constitute only one possible device to deal with the key challenge of XML data warehouses. XML data are notoriously noisy. They often come without a schema or with highly heterogeneous schemas, they rarely explicate dependencies and therefore are often redundant, and they can contain missing and inconsistent values. Data mining provides a wealth of established methods to deal with this situation. In the second part of the talk, we will illustrate by way of a simple experiment, how data mining techniques can help in combining multiple data sources and bringing them to effective use. We explore to which extent stable XML technology can be used to implement these techniques. The experiment deliberately focuses on data and mining techniques that cannot not be readily represented and realized with standard relational technology. It combines a bilingual dictionary, a thesaurus, and a text corpus (altogether about 150 MB data) in order to support bilingual search and thesaurus based analysis of the text corpus. We proceeded in three steps: None of the data sources was in XML form; therefore they needed to be structurally enriched to XML with a variety of tools. State-of-the-art schema mining combined with an off-the-shelf XML-Schema validator has proven to be very helpful to ensure quality for this initial step by ruling out enrichment errors and spurious structural variations in the initial data. In the next step, the data were cleansed. The thesaurus contained spurious cycles and missing relationships, and the dictionary suffered from incomplete definitions. These inconsistencies significantly impeded further analysis steps. XSLT, extended with appropriate means to efficiently realize fixpoint queries guided by regular path expressions, turned out to be a quick and dirty means for this step. However, even though cleansing did not go very far, the developed stylesheets reached a considerable level of complexity, indicating the need for better models to express and detect such inconsistencies. In the final step, the thesaurus was used to enrich the text corpus with so called lexical chains, which cluster a text into sentence groups that contain words in sufficiently close semantic neighborhood. These chains can be used to understand the role of lexical cohesion for text structure, to deploy this structure for finer grained document retrieval and clustering, and ultimately to enhance the thesaurus with additional relationships. Again, XSLT turned out to be a suitable means to implement the enrichment logic in an ad-hoc fashion, but the lack of higher level abstractions for both, the data structures and the analysis rules, resulted in fairly complex stylesheets. On the other hand, XSLT’s versatility w.r.t. expressing different structural views on XML turned out to be extremely helpful to flexibly visualize lexical chains. The main lessons learned from this small experiment are that state-of-the-art XML technology is mature and scalable enough to realize a fairly challenging text mining application. The main benefits of XML show especially for the early steps of data cleansing and enrichment, and the late steps of interactive analysis. These steps are arguably much harder to realize with traditional data warehouse technology, which requires significantly more data cleansing and restructuring as
XML for Data Warehousing Chances and Challenges
3
a prerequisite. On the other hand, the thesaurus based analysis in Step 3 suffers from the lack of XML-based interfaces to mining methods and tools. Realizing these in XSLT, which has some deficiencies w.r.t. compositionality and expressive power, turns out to be unnecessarily complex.
CPM: A Cube Presentation Model for OLAP Andreas Maniatis1, Panos Vassiliadis2, Spiros Skiadopoulos 1, Yannis Vassiliou1 1
National Technical Univ. of Athens, Dept. of Elec. and Computer Eng., 15780 Athens, Hellas {andreas,spiros,yv}@ dblab.ece.ntua.gr
2 University of Ioannina, Dept. of Computer Science 45110 Ioannina, Hellas
[email protected]
Abstract. On-Line Analytical Processing (OLAP) is a trend in database technology, based on the multidimensional view of data. In this paper we introduce the Cube Presentation Model (CPM), a presentational model for OLAP data which, to the best of our knowledge, is the only formal presentational model for OLAP found in the literature until today. First, our proposal extends a previous logical model for cubes, to handle more complex cases. Then, we present a novel presentational model for OLAP screens, intuitively based on the geometrical representation of a cube and its human perception in the space. Moreover, we show how the logical and the presentational models are integrated smoothly. Finally, we describe how typical OLAP operations can be easily mapped to the CPM.
1.
Introduction
In the last years, On-Line Analytical Processing (OLAP) and data warehousing has become a major research area in the database community [1, 2]. An important issue faced by vendors, researchers and - mainly - users of OLAP applications is the visualization of data. Presentational models are not really a part of the classical conceptual-logical-physical hierarchy of database models; nevertheless, since OLAP is a technology facilitating decision-making, the presentation of data is of major importance. Research-wise, data visualization is presently a quickly evolving field and dealing with the presentation of vast amounts of data to the users [3, 4, 5]. In the OLAP field, though, we are aware of only two approaches towards a discrete and autonomous presentation model for OLAP. In the industrial field Microsoft has already issued a commercial standard for multidimensional databases, where the presentational issues form a major part [6]. In this approach, a powerful query language is used to provide the user with complex reports, created from several cubes (or actually subsets of existing cubes). An example is depicted in Fig. 1. The Microsoft standard, however, suffers from several problems, with two of them being the most prominent ones: First, the logical and presentational models are mixed, resulting in a complex language which is difficult to use (although powerful enough).
Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 4-13, 2003. c Springer-Verlag Berlin Heidelberg 2003
CPM: A Cube Presentation Model for OLAP
5
Secondly, the model is formalized but not thoroughly: for instance, to our knowledge, there is no definition for the schema of a multicube. SELECT CROSSJOIN({Venk,Netz},{USA_N.Children,USA_S,Japan}) ON COLUMNS {Qtr1.CHILDREN,Qtr2,Qtr3,Qtr4.CHILDREN} ON ROWS FROM SalesCube WHERE (Sales,[1991],Products.ALL) Year = 1991 Product = ALL
Venk USA USA_N Seattle
Japan USA_S Boston
Netz USA USA_N Seattle
Japan USA_S Boston
Size(city)
R1
R2 R3 R4
Qtr1
Qtr2 Qtr3 Qtr4
Jan Feb Mar
C1
C2
C3
C4
C5
C6
Jan Feb Mar
Fig. 1: Motivating example for the cube model (taken from [6]). Apart from the industrial proposal of Microsoft, an academic approach has also been proposed [5]. However, the proposed Tape model seems to be limited in its expressive power (with respect to the Microsoft proposal) and its formal aspects are not yet publicly available. In this paper we introduce a cube presentation model (CPM). The main idea behind CPM lies in the separation of logical data retrieval (which we encapsulate in the logical layer of CPM) and data presentation (captured from the presentational layer of CPM). The logical layer that we propose is based on an extension of a previous proposal [8] to incorporate more complex cubes. Replacing the logical layer with any other model compatible to classical OLAP notions (like dimensions, hierarchies and cubes) can be easily performed. The presentational layer, at the same time, provides a formal model for OLAP screens. To our knowledge, there is no such result in the related literature. Finally, we show how typical OLAP operations like roll-up and drill down are mapped to simple operations over the underlying presentational model. The remainder of this paper is structured as follows. In Section 2, we present the logical layer underlying CPM. In Section 3, we introduce the presentational layer of the CPM model. In Section 4, we present a mapping from the logical to the presentational model and finally, in Section 5 we conclude our results and present topics for future work. Due to space limitations, we refer the interested reader to a long version of this report for more intuition and rigorous definitions [7].
2.
The logical layer of the Cube Presentation Model
The Cube Presentation Model (CPM) is composed of two parts: (a) a logical layer, which involves the formulation of cubes and (b) a presentational layer that involves the presentation of these cubes (normally, on a 2D screen). In this section, we present
6
Andreas Maniatis et al.
the logical layer of CPM; to this end, we extend a logical model [8] in order to compute more complex cubes. We briefly repeat the basic constructs of the logical model and refer the interested reader to [8] for a detailed presentation of this part of the model. The most basic constructs are: − A dimension is a lattice of dimension levels (L,p), where p is a partial order defined among the levels of L. − A family of monotone, pairwise consistent ancestor functions ancLL is defined, such that for each pair of levels L1 and L2 with L1pL2, the function ancLL maps each element of dom(L1) to an element of dom(L2). − A data set DS over a schema S=[L1,…,Ln,A1,…,Am] is a finite set of tuples over S such that [L1,…,Ln] are levels, the rest of the attributes are measures and [L1,…,Ln] is a primary key. A detailed data set DS0 is a data set where all levels are at the bottom of their hierarchies. − A selection condition φ is a formula involving atoms and the logical connectives ∧, ∨ and ¬. The atoms involve levels, values and ancestor functions, in clause of the form x ∂ y. A detailed selection condition involves levels at the bottom of their hierarchies. − A primary cube c (over the schema [L1,…,Ln,M1,…,Mm]), is an expression of the form c=(DS0,φ,[L1,…,Ln,M1,…,Mm],[agg1(M01),…,aggm(M0m)]), where: DS0 is a detailed data set over the schema S=[L01,…,L0n,M01,…,M0k],m≤k. φ is a detailed selection condition. M1,…,Mm are measures. L0i and Li are levels such that L0ipLi, 1≤i≤n. aggi∈{sum,min,max,count}, 1≤i≤m. The limitations of primary cubes is that, although they model accurately SELECT-FROM-WHERE-GROUPBY queries, they fail to model (a) ordering, (b) computation of values through functions and (c) selection over computed or aggregate values (i.e., the HAVING clause of a SQL query). To compensate this shortcoming, we extend the aforementioned model with the following entities: − Let F be a set of functions mapping sets of attributes to attributes. We distinguish the following major categories of functions: property functions, arithmetic functions and control functions. For example, for the level Day, we can have the property function holiday(Day) indicating whether a day is a holiday or not. An arithmetic function is, for example Profit=(Price-Cost)*Sold_Items. − A secondary selection condition ψ is a formula in disjunctive normal form. An atom of the secondary selection condition is true, false or an expression of the form x θ y, where x and y can be one of the following: (a) an attribute Ai (including RANK), (b) a value l, an expression of the form fi(Ai), where Ai is a set of attributes (levels and measures) and (c) θ is an operator from the set (>, <, =, ≥, ≤, ≠). With this kind of formulae, we can compute relationships between measures (Cost>Price), ranking and range selections (ORDER BY...;STOP after 200, RANK[20:30]), measure selections (sales>3000), property based selection (Color(Product)='Green'). 2 1
2 1
CPM: A Cube Presentation Model for OLAP
7
− Assume a data set DS over the schema [A1,A2,…,Az]. Without loss of generality, suppose a non-empty subset of the schema S=A1,…,Ak,k≤z. Then, there is a set of ordering operations OθS, used to sort the values of the data set, with respect to the set of attributes participating to S. θ belongs to the set {<,>,∅} in order to denote ascending, descending and no order, respectively. An ordering operation is applied over a data set and returns another data set which obligatorily encompasses the measure RANK. − A secondary cube over the schema S=[L1,…,Ln,M1,…,Mm,Am+1,…,Am+p, RANK] is an expression of the form: s=[c,[Am+1:fm+1(Am+1),…,Am+p:fm+p(Am+p)],OθA,ψ] where c=(DS0,φ,[L1,…,Ln,M1,…,Mm],[agg1(M01),…,aggm(M0m)]) is a primary cube, [Am+1,…,Am+p]⊆[L1,…,Ln,M1,…,Mm], A⊆S-{RANK}, fm+1,…,fm+p are functions belonging to F and ψ is a secondary selection condition. With these additions, primary cubes are extended to secondary cubes that incorporate: (a) computation of new attributes (Am+i) through the respective functions (fm+i), (b) ordering (OθA) and (c) the HAVING clause, through the secondary selection condition ψ.
3.
The presentational layer of the Cube Presentation Model
In this section, we present the presentation layer of CPM. First, we will give an intuitive, informal description of the model; then we will present its formal definition. Throughout the paper, we will use the example of Fig. 1, as our reference example. The most important entities of the logical layer of CPM include: − Points: A point over an axis resembles the classical notion of points over axes in mathematics. Still, since we are grouping more than one attribute per axis (in order to make things presentable in a 2D screen), formally, a point is a pair comprising of a set of attribute groups (with one of them acting as primary key) and a set of equality selection conditions for each of the keys. − Axis: An axis can be viewed as a set of points. We introduce two special purpose axes, Invisible and Content. The Invisible axis is a placeholder for the levels of the data set which are not found in the “normal” axis defining the multicube. The Content axis has a more elaborate role: in the case where no measure is found in any axis then the measure which will fill the content of the multicube is placed there. − Multicubes. A multicube is a set of axes, such that (a) all the levels of the same dimensions are found in the same axis, (b) Invisible and Content axes are taken into account, (c) all the measures involved are tagged with an aggregate function and (d) all the dimensions of the underlying data set are present in the multicube definition. In our motivating example, the multicube MC is defined as MC={Rows,Columns,Sections,Invisible,Content}. − 2D-slice: Consider a multicube MC, composed of K axes. A 2D-slice over MC is a set of (K-2) points, each from a separate axis. Intuitively, a 2D-slice pins the axes of
8
Andreas Maniatis et al.
the multicube to specific points, except for 2 axes, which will be presented on the screen (or a printout). In Fig. 2, we depict such a 2D slice over a multicube. − Tape: Consider a 2D-slice SL over a multicube MC, composed of K axes. A tape over SL is a set of (K-1) points, where the (K-2) points are the points of SL. A tape is always parallel to a specific axis: out of the two "free" axis of the 2D-slice, we pin one of them to a specific point which distinguishes the tape from the 2D-slice. − Cross-join: Consider a 2D-slice SL over a multicube MC, composed of K axes and two tapes t1 and t2 which are not parallel to the same axis. A cross-join over t1 and t2 is a set of K points, where the (K-2) points are the points of SL and each of the two remaining points is a point on a different axis of the remaining axes of the slice. The query of Fig. 1 is a 2D-Slice, say SL. In SL one can identify 4 horizontal tapes denoted as R1, R2, R3 and R4 in Fig. 1) and 6 vertical tapes (numbered from C1 to C6). The meaning of the horizontal tapes is straightforward: they represent the Quarter dimension, expressed either as quarters or as months. The meaning of the vertical tapes is somewhat more complex: they represent the combination of the dimensions Salesman and Geography, with the latter expressed in City, Region and Country level. Moreover, two constraints are superimposed over these tapes: the Year dimension is pinned to a specific value and the Product dimension is ignored. In this multidimensional world of 5 axes, the tapes C1 and R1 are defined as: C1 = [(Salesman='Venk'∧ancregion city (city)='USA_N'),(Year='1991'), (ancALL item(Products)='all'),(Sales,sum(Sales))] R1 = [(ancmonth day (Month)='Qtr1'∧Year='1991'),(Year='1991'), (ancALL item(Products)='all'),(Sales,sum(Sales))]
One can also consider the cross-join t1 defined by the common cells of the tapes R1 and C1. Remember that City defines an attribute group along with [Size(City)]. t1=([SalesCube,(Salesman='Venk'∧ancregion city (city)='USA_N ∧ ALL ancmonth day (Month)='Qtr1'∧Year='1991'∧ancitem(Products)='all'), [Salesman,City,Month,Year,Products.ALL,Sales],sum], [Size(City)],true)
In the rest of this section, we will describe the presentation layer of CPM in its formality. First, we extend the notion of dimension to incorporate any kind of attributes (i.e., results of functions, measures, etc.). Consequently, we consider every attribute not already belonging to some dimension, to belong to a single-level dimension (with the same name as the attribute), with no ancestor functions or properties defined over it. We will distinguish between the dimensions comprising levels and functionally dependent attributes through the terms level dimensions and attribute dimensions, wherever necessary. The dimensions involving arithmetic measures will be called measure dimensions. An attribute group AG over a data set DS is a pair [A,DA], where A is a list of attributes belonging to DS (called the key of the group) and DA is a list of attributes dependent on the attributes of A. With the term dependent we mean (a) measures dependent over the respective levels of the data set and (b) function results depending
CPM: A Cube Presentation Model for OLAP
9
on the arguments of the function. One can consider examples of the attribute groups such as ag1=([City],[Size(City)]),ag2=([Sales,Expenses],[Profit]). Invisible
Sections
Year=1992
Year=1991
Content
+
+
Products.ALL = 'all'
Sales, sum(Sales 0), true
ancmonth day (Month)= Qtr1 Quarter = Qtr3 Quarter = Qtr2
Rows ancmonth day (Month)= Qtr4
Salesman='Venk', Region='USA_S' Salesman='Netz', (2) anc region city (City) = (1) 'USA_N' Salesman='Venk', (4) ancregion city (City) = (3) 'USA_N' Salesman='Venk', Country='Japan'
Salesman='Netz', Country='Japan' (6)
(5) Salesman='Netz', Region='USA_S' Columns
Fig. 2: The 2D-Slice SL for the example of Fig. 1.
A dimension group DG over a data set DS is a pair [D,DD], where D is a list of dimensions over DS (called the key of the dimension group) and DD is a list of dimensions dependent on the dimensions of D. With the term dependent we simply extend the respective definition of attribute groups, to cover also the respective dimensions. For reasons of brevity, wherever possible, we will denote an attribute/dimension group comprising only of its key simply by the respective attribute/dimension. An axis schema is a pair [DG,AG], where DG is a list of K dimension groups and AG is an ordered list of K finite ordered lists of attribute groups, where the keys of each (inner) list belong to the same dimension, found in the same position in DG, where K>0. The members of each ordered list are not necessarily different. We denote an axis schema as a pair ASK=([DG1×DG2×…×DGK],[[ag11,ag21,…,agk1 ]×[ag12,ag22 1
,…,agk22]×…×[ag1k,ag2k,…,agkkk])}.
In other words, one can consider an axis schema as the Cartesian product of the respective dimension groups, instantiated at a finite number of attribute groups. For instance, in the example of Fig. 1, we can observe two axes schemata, having the following definitions: Row_S = {[Quarter],[Month,Quarter,Quarter,Month]} Column_S = {[Salesman×Geography], [Salesman]×[[City,Size(City)], Region, Country]} Consider a detailed data set DS. An axis over DS is a pair comprising of an axis schema over K dimension groups, where all the keys of its attribute groups belong to DS, and an ordered list of K finite ordered lists of selection conditions (primary or
secondary), where each member of the inner lists, involves only the respective key of the attribute group. a = (ASK,[φ1,φ2,...,φK]),K≤N or a={[DG1×DG2×…×DGK],[[ag11,ag21,…,agk1 ]×[ag12,ag22,…,agk2 ]×…×[ag1k,ag2k,…,agkk ]], [[φ11,φ21,…,φk1 ]×[φ12,φ22,…,φk2 ]×...×[φ1k,φ2k,…,φkk ]]} 1
1
2
2
k
k
10
Andreas Maniatis et al.
Practically, an axis is a restriction of an axis schema to specific values, through the introduction of specific constraints for each occurrence of a level. In our motivating example, we have two axes: Rows = {Row_S,[ancmonth day (Month)=Qtr1,Quarter=Qtr2, Quarter=Qtr3,ancmonth day (Month)=Qtr4]} Columns = {Column_S,{[Salesman='Venk',Salesman='Netz'], [ancregion city (City)='USA_N', Region='USA_S', Country='Japan']}
We will denote the set of dimension groups of each axis a by dim(a). A point over an axis is a pair comprising of a set of attribute groups and a set of equality selection conditions for each one of their keys. p1=([Salesman,[City,Size(City)]], [Salesman='Venk',ancregion city (City)= 'USA_N'])
An axis can be reduced to a set of points, if one calculates the Cartesian products of the attribute groups and their respective selection conditions. In other words, a=([DG1×DG2×...×DGK],[[p1,p2,…,pl]), l=k1×k2×…×kkk.
Two axes schemata are joinable over a data set if their key dimensions (a) belong to the set of dimensions of the data set and (b) are disjoint. For instance, Rows_S and Columns_S are joinable. A multicube schema over a detailed data set is a finite set of axes schemata fulfilling the following constraints: 1. All the axes schemata are pair-wise joinable over the data set. 2. The key of each dimension group belongs only to one axis. 3. Similarly, from the definition of the axis schema, the attributes belonging to a dimension group are all found in the same axis. 4. Two special purpose axes called Invisible and Content exist. The Content axis can take only measure dimensions. 5. All the measure dimensions of the multicube are found in the same axis. If more than one measure exist, they cannot be found in the Content axis. 6. If no measure is found in any of the "normal" axes, then a single measure must be found in the axis Content. 7. Each key measure is tagged with an aggregate function over a measure of the data set. 8. For each attribute participating in a group, all the members of the group are found in the same axis. 9. All the level dimensions of the data set are found in the union of the axis schemata (if some dimensions are not found in the "normal" axes, they must be found in the Invisible axis). The role of the Invisible axis follows: it is a placeholder for the levels of the data set which are not to be taken into account in the multicube. The Content axis has a more elaborate role: in the case where no measure is found in any axis (like in the example of Fig. 1) then the measure which will fill the content of the multicube is placed there. If more than one measures are found, then they must be placed in the same axis (not Content), as this would cause a problem of presentation on a two-dimensional space. A multicube over a data set is defined as a finite set of axes, whose schemata can define a multicube schema. The following constraints must be met:
CPM: A Cube Presentation Model for OLAP
11
1. Each point from a level dimension, not in the Invisible axis, must have an equality selection condition, returning a finite number of values. 2. The rest of the points can have arbitrary selection conditions (including "true" for the measure dimensions, for example). For example, suppose a detailed data set SalesCube under the schema S = [Quarter.Day, Salesman.Salesman, Geography.City, Time.Day, Product.Item, Sales, PercentChange, BudgetedSales] Suppose also the following axes schemata over DS0 Row_S = {[Quarter],[Month,Quarter,Quarter,Month]} Column_S = {[Salesman×Geography], [Salesman]×[[City,Size(City)], Region, Country]} Section_S = {[Time],[Year]} Invisible_S = {[Product],[Product.ALL]} Content_S = {[Sales],[sum(Sales0)]}
and their respective axes Rows={Row_S,[ancmonth day (Month)=Qtr1,Quarter=Qtr2,Quarter=Qtr3, ancmonth day (Month)=Qtr4]} Columns = {Column_S,{[Salesman='Venk',Salesman='Netz'], [ancregion city (City)='USA_N', Region='USA_S', Country='Japan']} Sections = {Section_S,[Year=1991,Year=1992]} Invisible = {Invisible_S,[ALL='all']} Content_S = {Content_S,[true]} Then, a multicube MC can be defined as MC = {Rows, Columns, Sections, Invisible, Content}
Consider a multicube MC, composed of K axes. A 2D-slice over MC is a set of (K-2) points, each from a separate axis, where the points of the Invisible and the Content axis are comprised within the points of the 2D-slice. Intuitively, a 2D-slice pins the axes of the multicube to specific points, except for 2 axes, which will be presented on a screen (or a printout). Consider a 2D-slice SL over a multicube MC, composed of K axes. A tape over SL is a set of (K-1) points, where the (K-2) points are the points of SL. A tape is always parallel to a specific axis: out of the two "free" axis of the 2D-slice, we pin one of them to a specific point which distinguishes the tape from the 2D-slice. A tape is more restrictively defined with respect to the 2D-slice by a single point: we will call this point the key of the tape with respect to its 2D-slice. Moreover if a 2D-slice has two axes a1,a2 with size(a1) and size(a2) points each, then one can define size(a1)*size(a2) tapes over this 2D-slice. Consider a 2D-slice SL over a multicube MC, composed of K axes. Consider also two tapes t1 and t2 which are not parallel to the same axis. A cross-join over t1 and t2 is a set of K points, where the (K-2) points are the points of SL and each of the two remaining points is a point on a different axis of the remaining axes of the slice. Two tapes are joinable if they can produce a cross-join.
12
4.
Andreas Maniatis et al.
Bridging the presentation and the logical layers of CPM
Cross-joins form the bridge between the logical and the presentational model. In this section we provide a theorem proving that a cross-join is a secondary cube. Then, we show how common OLAP operations can be performed on the basis of our model. The proofs can be found at [7]. Theorem 1. Α cross-join is equivalent to a secondary cube. The only difference between a tape and a cross-join is that the cross-join restricts all of its dimensions with equality constraints, whereas the tape constraints only a subset of them. Moreover, from the definition of the joinable tapes it follows that a 2D-slice contains as many cross-joins as the number of pairs of joinable tapes belonging to this particular slice. This observation also helps us to understand why a tape can also be viewed as a collection of cross-joins (or cubes). Each of this cross-joins is defined from the k-1 points of the tape and one point from all its joinable tapes. This point belongs to the points of the axis the tape is parallel to. Consequently, we are allowed to treat a tape as a set of cubes: t=[c1,…,ck]. Thus we have the following lemma. Lemma 1. A tape is a finite set of secondary cubes. We briefly describe how usual operations of the OLAP tools, such as roll-up, drill down, pivot etc can be mapped to operations over 2D-slices and tapes. − Roll-up. Roll-up is performed over a set of tapes. Initially key points of these tapes are eliminated and replaced by their ancestor values. Then tapes are also eliminated and replaced by tapes defined by the respective keys of these ancestor values. The cross-joins that emerge can be computed through the appropriate aggregation of the underlying data. − Drill-down. Drill down is exactly the opposite of the roll-up operation. The only difference is that normally, the existing tapes are not removed, but rather complemented by the tapes of the lower level values. − Pivot. Pivot means moving one dimension from an axis to another. The contents of the 2D-slice over which pivot is performed are not recomputed, instead they are just reorganized in their presentation. − Selection. A selection condition (primary or secondary) is evaluated against the points of the axes, or the content of the 2D-slice. In every case, the calculation of the new 2D-slice is based on the propagation of the selection to the already computed cubes. − Slice. Slice is a special form of roll-up, where a dimension is rolled up to the level ALL. In other words, the dimension is not taken into account any more in the groupings over the underlying data set. Slicing can also mean the reconstruction of the multicube by moving the sliced dimension to the Invisible axis. − ROLLUP [9]. In the relational context, the ROLLUP operator takes all combination of attributes participating in the grouping of a fact table and produces all the
CPM: A Cube Presentation Model for OLAP
13
possible tables, with these marginal aggregations, out of the original query. In our context, this can be done by producing all combinations of Slice operations over the levels of the underlying data set. One can even go further by combining roll-ups to all the combinations of levels in a hierarchy.
5.
Conclusions and Future Work
In this paper we have introduced the Cube Presentation Model, a presentation model for OLAP data which formalizes previously proposed standards for a presentation layer and which, to the best of our knowledge, is the only formal presentational model for OLAP in the literature. Our contributions can be listed as follows: (a) we have presented an extension of a previous logical model for cubes, to handle more complex cases; (b) we have introduced a novel presentational model for OLAP screens, intuitively based on the geometrical representation of a cube and its human perception in the space; (c) we have discussed how these two models can be smoothly integrated; and (d) we have suggested how typical OLAP operations can be easily mapped to the proposed presentational model. Next steps in our research include the introduction of suitable visualization techniques for CPM, complying with current standards and recommendation as far as usability and user interface design is concerned and its extension to address the specific visualization requirements of mobile devices.
References [1]S. Chaudhuri, U. Dayal: An overview of Data Warehousing and OLAP technology. ACM SIGMOD Record, 26(1), March 1997. [2]P. Vassiliadis, T. Sellis: A Survey of Logical Models for OLAP Databases. SIGMOD Record 28(4), Dec. 1999. [3]D.A. Keim. Visual Data Mining. Tutorials of the 23 rd International Conference on Very Large Data Bases, Athens, Greece, 1997. [4]Alfred Inselberg. Visualization and Knowledge Discovery for High Dimensional Data . 2nd Workshop Proceedings UIDIS, IEEE, 2001. [5]M. Gebhardt, M. Jarke, S. Jacobs: A Toolkit for Negotiation Support Interfaces to MultiDimensional Data. ACM SIGMOD 1997, pp. 348 – 356. [6]Microsoft Corp. OLEDB for OLAP February 1998. Available at: http://www.microsoft.com/data/oledb/olap/. [7]A. Maniatis, P. Vassiliadis, S. Skiadopoulos, Y. Vassiliou. CPM: A Cube Presentation Model. http://www.dblab.ece.ntua.gr/~andreas/publications/CPM_dawak03.pdf (Long Version). [8]Panos Vassiliadis, Spiros Skiadopoulos: Modeling and Optimization Issues for Multidimensional Databases. Proc. of CAiSE-00, Stockholm, Sweden, 2000. [9]J. Gray et al.: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab and Sub-Totals. Proc. of the ICDE 1996.
Computation of Sparse Data Cubes with Constraints Changqing Chen1, Jianlin Feng2, and Longgang Xiang3 1
School of Software, Huazhong Univ. of Sci. & Tech., Wuhan 430074, Hubei, China
[email protected] 2 School of Computer Science, Huazhong Univ. of Sci. & Tech., Wuhan 430074, Hubei, China
[email protected] 3 School of Computer Science, Huazhong Univ. of Sci. & Tech., Wuhan 430074, Hubei, China
[email protected]
Abstract. For a data cube there are always constraints between dimensions or between attributes in a dimension, such as functional dependencies. We introduce the problem that when there are functional dependencies, how to use them to speed up the computation of sparse data cubes. A new algorithm CFD is presented to satisfy this demand. CFD determines the order of dimensions by considering their cardinalities and functional dependencies between them together. It makes dimensions with functional dependencies adjacent and their codes satisfy monotonic mapping, thus reduces the number of partitions for such dimensions. It also combines partitioning from bottom to up and aggregate computation from top to bottom to speed up the computation further. In addition CFD can efficiently compute a data cube with hierarchies from the smallest granularity to the coarsest one, and at most one attribute in a dimension takes part in the computation each time. The experiments have shown that the performance of CFD has a significant improvement.
1 Introduction OLAP often pre-computes a large number of aggregates to improve the performance of aggregation queries. A new operator CUBE BY [5] was introduced to represent a set of group-by operationsi.e., to compute aggregates for all possible combinations of attributes in the CUBE BY clause. The following example 1 shows a cube computation query on a relation SALES (employee, product, customer, quantity) Example 1: SELECT employee, product, customer, SUM (quantity) FROM SALES CUBE BY employee, product, customer It will compute group-bys for (employee, product, customer), (employee, product), (employee, customer), (product, customer), (employee), (product), (customer) and ALL (no GROUP BY). The attributes in the CUBE BY clause are called dimensions and the attributes aggregated are called measures. For n dimensions, 2 n group-bys are Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 14-23, 2003. c Springer-Verlag Berlin Heidelberg 2003
Computation of Sparse Data Cubes with Constraints
15
computed. The number of distinct values of a dimension is its cardinality. Each combination of attribute values from different dimensions constitutes a cell. If empty cells are a majority of the whole cube, then the cube is sparse. Relational normal forms are hardly suitable for OLAP cubes because of different goals in operational and OLAP databases. The main goal of operational databases is to avoid update anomalies and the relational normal forms are very adaptive for this goal. But for OLAP databases the efficiency of queries is the most important issue. So there are always constraints between dimensions or between attributes in a dimension for a cube, such as functional dependencies. Sparsity clearly depends on actual data. However, functional dependencies between dimensions may imply potential sparsity [4]. A tacit assumption for all algorithms before is that dimensions are independent each other and so all these algorithms did not consider the effect of functional dependencies on computing cubes. Algebraic functions COUNT, SUM, MIN and MAX have the key property that more detailed aggregates (i.e. more dimensions) can be used to compute less detailed aggregates (i.e. fewer dimensions). This property induces a partial ordering (i.e. a lattice) on all group-bys of the CUBE. A group-by is called a child of some parent group-by if the parent can be used to compute the child (and no other group-by is between the parent and the child). The algorithms [1, 2, 3, 6] recognize that group-bys with common attributes can share partitions, sorts, or partial sorts. The difference between them is that how they exploit such properties. In these algorithms, BUC [1] computes from bottom to up, while others compute from top to bottom. This paper addresses full cube computation over sparse data cubes and makes the following contributions: 1.
We introduce the problem of computation of sparse data cubes with constraints, which allows us to use such constraints to speed up the computation. A new algorithm CFD (Computation by Functional Dependencies) is presented to satisfy this demand. CFD determines the partitioning order of dimensions by considering their cardinalities and functional dependencies between them together. Therefore the correlated dimensions can share sorts.
2.
CFD partitions group-bys of a data cube from bottom to up, at the same time it computes aggregate values from top to bottom by summing up return values of smaller partitions. Even if all the dimensions are independent each other, CFD is still faster than BUC to compute full cubes.
3.
Few algorithms deal with hierarchies in dimensions. CFD can compute a sparse data cube with hierarchies in dimensions. In this situation, CFD efficiently computes from the smallest granularity to the coarsest one. The rest of this paper is organized as follows: Section 2 presents the problem of sparse cubes with constraints. Section 3 illustrates how to decide partitioning order of dimensions. Section 4 presents a new algorithm for the computation of sparse cubes called CFD. Our performance analysis is described in section 5. Related work is discussed in section 6. Section 7 contains conclusions.
16
Changqing Chen et al.
2 The Problem Let C= DM be an OLAP cube schema, where D are the set of dimensions and M the set of measures. Two attributes X and Y with one-to-one or many-to-one relation has a functional dependency XY, where X is called a determining attribute and Y is called a depending attribute. Such functional dependency can exist between two dimensions or two attributes in a dimension. The problem is when there are constraints (functional dependencies), how to use them to speed up the computation of sparse cubes. The dependencies considered in CFD are only those that the left and right side always contain a single attribute respectively. Such functional dependencies will help in data pre-processing (see Section 3.2) and partitioning dimensions (see section 4.1). Functional dependencies between dimensions implied the structural sparsity of a cube [4]. With no functional dependencies, the structural sparsity is zero. Considering the cube in Example 1, if we know that one employee sells only one product, we get a functional dependency employeeproduct. Assume we have 6 employees, 4 customers, and 3 different products, then the size of the cube is 72 cells. Further the total number of occupied cells in the whole cube is at most 64=24, thus the structural sparsity is 67%.
3 Data Preprocessing CFD partitions from bottom to up just like BUC, i.e., first partitions on one dimension, then two dimensions, and etc. One difference between CFD and BUC is that CFD chooses the order of dimensions by functional dependencies and cardinalities together. 3.1 Deciding the Order of Dimensions First we build a directed graph by Functional Dependencies between dimensions, called FD graph. The graph ignores all transitive dependencies (i.e., dependencies that can be deduced from other dependencies). A node in the graph is a dimension. Once the graph has been built, we try to classify the nodes. We find the longest path in the graph in order to make the most of dependencies. The nodes in such path form a dependency set and are deleted from the graph. Such process is repeated until the graph is empty. The time complexity of this process is O(n2), where n is the number of dimensions. Example 2: A cube has six dimensions from A to F with the cardinalities descendant and functional dependencies AC, AD, CE, BF. Figure 1 is the corresponding FD graph. From Figure 1, we first get the dependency set {A, C, E} for they have the longest path, then {B, F} and at last {D}. The elements in each set are ordered by dependencies. Although there is a functional dependency between A and D, it is not considered, so the dependency set {D} contains only the dimension D itself. After getting the dependency sets, CFD sorts them descendently by the biggest cardinality of a dimension in each set. Then we merge each set sequentially to determine
Computation of Sparse Data Cubes with Constraints
17
the order of dimensions. By this approach, CFD can make the depending dimension share the sort of the determining dimension because such two dimensions are putted together. If there is no functional dependency, the partitioning order of CFD is just the same as that of BUC. A
C
B
D
F
Tom
towel
towel
Tom
0
0
Bob
soap
towel
Ross
1
0
Smith
soap
soap
Bob
2
1
Smith
3
1
White E
Fig. 1. FD graph
employee product
product employee
employee product
sharver
sort
soap
encode
Louis
soap
soap
Louis
4
1
Ross
towel
shaver
White
5
2
Fig. 2. The encoding of two dimensions with a functional dependency
3.2 Data Encoding Like other algorithms for computing a data cube, CFD assumes that each dimension value is an integer between zero and its cardinality, and that the cardinality is known in advance. A usual data encoding does not consider the correlations between dimensions and simply maps each dimension value between zero and its cardinality. This operation is similar to sorting on the values of a dimension. In order to share shorts, CFD encodes adjacent dimensions with functional dependencies jointly to make their codes satisfy a monotonic mapping. For example, X and Y are two dimensions and f is a functional dependency from X to Y. Assume there are two arbitrary values xi and xj on dimension X, and yi = f(xi) and yj = f(xj) are two values on dimension Y. If xi> xj, we have yi yj, then y = f(x) is monotonic. Due to the functional dependency between X and Y, the approach of encoding is to sort on dimension Y first, then the values of X and Y can be mapped sequentially to zero and their cardinalities respectively. Figure 2 shows the encoding of two dimensions with a functional dependency: employeeproduct in Example 1. Obviously, if the left or right side of a functional dependency has more than one attribute, it is difficult to encode like that. Note that the mapping relations can be reflected in the fact table for correlated dimensions. But for hierarchies in a dimension the mapping relations should be stored respectively.
4 Algorithm CFD We propose a new algorithm called CFD for the computation of full sparse data cubes with constraints. The idea in CFD is to take advantage of functional dependencies to share partitions and to make use of the property of algebraic functions to reduce aggregation costs. CFD was inspired by BUC algorithm and is similar to a version of algorithm BUC except the aggregation computation and the partition function. After data preprocessing, we can compute a sparse data cube now.
18
Changqing Chen et al.
CFD(input,dim) Inputs: input: The relation to aggregate. dim: The starting dimension to partition. Globals: numdims: the total number of dimensions. dependentable[]: the dependency sets gotten from section 3.1. hierarchy[numdims]: the high of hierarchies in each dimension. cardinality[numDims][]: the cardinality of each dimension. dataCount[numdims]: the size of each partition. aggval[numdims]: sum the results of smaller partitions. 1: if (dim == numdims) aggval[dim]=Aggregate(input); //the result of a thinnest part ition. 2: FOR d = dim; d < numdims; d++ DO 3: FOR h=0; h
The details of CFD are in Figure 3. The first step is to aggregate the entire input to aggval[numdims] when arriving at the smallest partition. For each dimension d between dim and numDims, the input is partitioned on dimension d (line 6). On return from Partition(), dataCount contains the number of records for each distinct value of the d-th dimension. Line 8 iterates through the partitions (i.e. each distinct). The partition becomes the input relation in the next recursive call to CFD, which computes the cube on the partition for dimensions (d+1) to numDims. Upon return from the recursive call, we sum up the aggregation results from smaller partitions (line 14) and continue with the next partition of dimension d. Once all the partitions are processed, we repeat the whole process for the next dimension. We use the same optimization process Write-Ancestors() as that in BUC (line 11). Write-Ancestors() simply outputs each of ancestor group-bys to avoid fruitless partition and computation when the partition contains only one tuple. For a data cube containing a single tuple partition (a1, b1) with dimensions A, B, C, D, Write-Ancestors() directly output (a1, b1), (a1, b1, c), (a1, b1, c, d) and (a1, b1, d).
Computation of Sparse Data Cubes with Constraints
19
4.1 Partitioning CFD partitions from the bottom of the lattice and works its way up towards the larger, less detailed aggregates. When CFD partitions, Partition() determines whether two dimensions are dependent by dependentable[] gotten from section 3.1. 4
A
C
B
D
a1
c1
b1
d1
a1
c1
b1
d2
a1
c1
b2
d2
a2
c1
b1
d2
a3
c2
b3
d1
a4
c3
b4
d2
Fig. 4. Encoding and partitioning of CFD
3
2
ACB
AC
6
1
AB
A
5
ACBD
7
ACD
8
AD
9
10
ABD
CB
13
C
16
B
12
11
CBD
CD
15
14
BD
D
ALL
Fig. 5. CFD processing tree
Example 3: A data cube has four dimensions from A to D with the cardinalities descendant and a dependency: AC. The order of dimensions for is A, C, B and D. Figure 4 illustrates how the input in Example 3 is partitioned during the first calls by CFD. First CFD partitions on dimension A, producing partitions a1 to a4, then it recursively partitions the partition a1 on dimension C, then dimension B. Because of the dependency: AC and the monotonic mapping, CFD will not sort on dimension C. This is one key optimization of CFD. CFD will not affect the efficiency of other optimizations such as finding single tuples. For independent dimensions, CountingSort, QuickSort and InsertSort can be used by CFD just like BUC. 4.2 Data Cube Computation Another key factor in the success of CFD is to take advantage of algebraic functions by summing up the results of the recursive calls from the smallest partitions (line 1 and 14 of Figure 3). By this approach, CFD can reduce half of the time to scan the whole relation, and it is faster than BUC even when dimensions are independent. Figure 5 shows the CFD processing tree for Example 3. The numbers indicate the order in which CFD visits the group-bys and CFD produces the empty group-by ALL last. We find if we directly use the return of the recursive call to aggregate the results, CFD actually runs more slowly than BUC, just as what reported in [1]. This may be the side effect of stack operations. Instead we use the array aggval[numdims] to record and aggregate the results, and it really runs faster than BUC. On the negative side, CFD cannot efficiently compute holistic functions. 4.3 Data Cube with Hierarchies Some approaches have been proposed to reduce structural sparsity risks, i.e. some dimensions with functional dependencies are decomposed to hierarchical dimensions [4, 11]. Hierarchical dimensions, which enable the user to analyze data on different levels of aggregation, are essential for OLAP.
20
Changqing Chen et al.
CFD can also efficiently compute a cube with hierarchies in dimensions (showed in line 3 of Figure 3). Because the attributes in a dimension have functional dependencies from the smaller granularity (the smallest one is the key) to the coarser one, so its computation is similar to that of a cube with constraints between dimensions.
4.4 Memory Requirements The memory requirement of CFD is slightly more than that of BUC. CFD tries to fit a partition of tuples in memory as soon as possible. Say that a partition has X tuples and each tuple requires N bytes. Our implementation also uses pointers to the tuples, and CountingSort requires a second set of pointers for temporary use. Let d0…dn-1 be the cardinalities of each dimension for a data cube, and dmax be the maximum cardinality. n 1
d i counters. In order to aggregate
CountingSort uses dmax counters, and CFD uses i =0
results from the smallest partitions, an n array is needed for an aggregation function. If the counters and pointers are each four bytes and every element in the above array is also four bytes, the total memory requirement in bytes for CFD is: n 1
d i + 4dmax + 4n
( +8) X + 4 i =0
5 Performance Analysis Since MemoryCube is the first algorithm to compute sparse cubes and BUC is faster than MemoryCube, we concentrate on comparing CFD with BUC to compute full cubes. We implemented CFD for main memory only. The implementation of BUC had the same restriction. CFD can be smoothly extended when performing an external partitioning. We did not count the time to read the file and write the results. We only measured running time as the total cost, including partitioning time and aggregating time. We also neglected the cost of determining the order of dimensions. 5.1 Qualitative analysis First we analyze the time of aggregation computation and assume dimensions are independent each other. For a cube with n dimensions, the number of 2 n group-bys need to be computed. BUC needs to scan the whole relation 2n times. CFD uses the results of the smallest partitions to aggregate the results of coarser partitions. This means CFD can save about half of the time of aggregation computation. Due to the single-tuple optimization, CFD can actually save 3-5% of the total cost than BUC when dimensions are independent.
Computation of Sparse Data Cubes with Constraints
21
Next we analyze the partitioning time. For a cube with n dimensions, the order is 0, 1, …n-1, then the number of partitions for each dimension are 20, 21, …2n-1 correspondingly. Say that the partitioning time of each dimension is one. Considering a functional dependency didi+1, it will save 2i+1 times by sharing the sort of the i-th dimension when partitioning the (i+1)-th dimension. If there are k functional dependencies considered and depending dimensions are i1, … ik, then the total time saved is ik m = i1
2 m (k
It seems that if we push the dimensions with dependencies more backward, we will save more partitioning time. But such approach neglects the optimization of finding single tuples as early as possible. Experiments have shown that the performance may decrease if we do like that. 5.2 Full Cube Computation All tests were run on a machine with 128MB of memory and a 500MHz Pentium processor. The data was randomly generated uniformly. The sparse data cube used in our experiments had eight dimensions from A to H, and each dimension had a cardinality of 10. The whole data could be loaded in the memory. Figure 6 compared CFD with BUC to compute a full cube with five sum functions and four functional dependencies: AE, BF, CG, and DH, varying the number of tuples. For 1.5 million tuples, CFD saved the time by about 25% than BUC. The time saved above is caused by two factors: the time for partition and aggregation computation. Figure 7 shows the computation time for one million tuples with dimensions independent varying the number of aggregation functions. For five sum functions, CFD saved 15% than BUC. Figure 8 compared CFD with BUC considering functional dependencies. The data for this test were one million tuples. We considered one dependency in Figure 6 each time. With the position of a dependency moves backwards, the time reduces quickly.
300
160
200 100 0 0.5
200
Time(sec)
Time(sec)
Time ( s ec)
165
250
400
150
150
100 0. 75
1
1. 25
1.5
Number of tuples (million)
BUC
145
1
2
3
4
5
Nu mber of s u m s
CFD
Fig. 6. Computation of a cube
155
BUC
0
1
2
3
4
Po sitio n o f a d epen den cy
CF D
Fig. 7. Aggregation functions
CFD
Fig. 8. Different position
Figure 9 compared CFD with BUC for just one sum function, assuming the dimensions are independent each other. The data varied from 0.5 million tuples to 1.5 million tuples. With the tuples increased, single tuples decreased and CFD saved more time of aggregation computation than BUC. We note that CFD still saved 3% of the total time on the most sparse cube with 0.5 million tuples. This experiment shows that CFD is more adaptive than BUC to compute data cubes with different sparsities.
22
Changqing Chen et al.
250
1500 Ti me ( s ec)
Ti me ( s ec)
200 150 100
1000 500
50 0 0.5
0. 75
1
Number of tuples
BUC
1. 25 1.5 (million)
CFD
Fig. 9. Dimensions independent
0 0.2
0.4
0.6
0.8
1
Number of tuples (million)
CFD & No FD
CFD with FD
Fig. 10. Hierarchies in dimensions
5.3 Hierarchies This experiment explores the computation of a data cube with hierarchies varying the number of tuples. The dimensions were order by cardinality: 500, 200, 100, 80, 60 and 50. The cardinality of coarser granularity was one third of that of smaller granularity adjacent in each dimension. There were three hierarchies in the first two dimensions and two hierarchies in the other dimensions. The results are shown in Figure 10. The bottom line shows the computation considering functional dependencies between attributes in a dimension and the upper line shows that without considering those correlations. The running time increased 30% when functional dependencies not considered. 5.4 Weather Data We compared CFD with BUC on a real nine-dimensional dataset containing weather conditions at various weather stations on land for September 1985 [12]. The dataset contained 1,015,367 tuples. The attributes were ordered by cardinality: station (7037), longitude (352), Solar-altitude (179), latitude (152), present-weather (101), day (30), weather-change-code (10), hour (8), and brightness (2). Some of the attributes were significantly correlated (e.g., only one station was at one (latitude, longitude)). The experiment shows that CFD is effective on real dataset. The running time of CFD had a 3% improvement than BUC considering the functional dependency station latitude. When (latitude, longitude) was treated as hierarchies of the dimension station, CFD had a 5% improvement than BUC that does not consider dependencies.
6 Related Work Since the introduction of the concept of data cube [5], efficient computation of data cubes has been a theme of active research, such as [1, 2, 3, 7, 8, 9, 10]. All algorithms
Computation of Sparse Data Cubes with Constraints
23
before did not consider the effect of constraints on the computation of a data cube. Following Lenher et al., a general principle in OLAP design is that dimensions should be independent, and inside a dimension there should be a single attribute key for the dimension [11]. But it is not always suitable to put all correlated attributes in a dimension to satisfy such principle, just as the weather data in [12] showed. While some work concentrated on fast computing a data cube, some other work dealt with the size problem of a cube [7, 8, 10]. Computing iceberg cubes rather than complete cubes were also proposed by [2, 9].
7
Conclusions
We introduce the problem of computation of sparse data cubes with constraints and present a novel algorithm CFD for this problem. CFD decides the order of dimensions by taking into account cardinalities and functional dependencies together. In the meantime CFD partitions group-bys from bottom to up and aggregates the results of partitions from top to bottom. So CFD combines the advantage of partitioning from bottom to up of BUC and the advantage of aggregation computation from top to bottom of Pipesort and etc. Our approach can efficiently speed up the computation of sparse data cubes as well as those with hierarchies.
References 1.
K.Baeyer, R.Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. SIGMOD’99, pages 359-370 2. K.A.Ross, D.Strivastava. Fast computation of sparse data cubes. In Proc. Of the 23 rd VLDB Conf., pages 116-125,Athens, Green, 1997 3. S. Agarwal, R. Agrawal, P. M. Desgpande, A.Gupta, J. F. Naughton, R. Ramakrishnan, and S.Sarawagi,. On the computation of multidimensional aggregates. In Proc. Of the 22nd VLDB Conf., pages 506-521, 1996 4. T. Niemi, J. Nummenmaa, P. Thanisch, Constructing OLAP Cubes Based on Queries. DOLAP 2001, pages 1-8 5. J.Gray, A.Bosworth, A.Layman and H.pirahesh. Datacube: A relational aggregation operator generalizing group by, cross-tab, and sub-totals, ICDE’96, pages 152-159 6. Y. Zhao, P. M. Desgpande, and J. F. Naughton. An array-based algorithm for simultaneous mutldimensional aggregates. SIGMOD’97, pages 159-170 7. W. Wang, J. Feng, H. Lu, J. X. Yu. Condensed Cube: An Effective Approach to Reducing Data Cube Size. Proc. of the 18 th Int. Conf. on Data Engineering, 2002, pp.155 -165 8. Y. Sismanis, A. Deligiannakis, N. Roussopoulos, Y. Kotidis. Dwarf: Shrinking the PetaCube. SIGMOD’02 9. J. Han, J. Pei, G. Dong and K. Wang. Efficient Computation of Iceberg Cubes with Complex Mea sures. SIGMOD’01. 10. Lakshmanan, L., Pei, J. and Han, J. Quotient Cube: How to Summarize the Semantics of a Data CubeFast. Proc. the 28 rd VLDB Conference, HongKong, China, 2002 11. W. Lenher, J. Albrecht and H. Wedekind. Normal forms for multidimensional databases. SSDBM’98, pages 63-72 12. C. Hahn, S. Warren, and J. London. Edited synoptic cloud reports from ships and land stations over the globe, 1982-1991. http://cdiac.esd.ornl.gov/-cdiac/ndps/ndp026b.h tml, http://cdiac.esd.ornl.gov/ftp/ndp026b/SEP85L.Z, 1994
Answering Joint Queries from Multiple Aggregate OLAP Databases Elaheh Pourabbas1 and Arie Shoshani2 1
Istituto di Analisi dei Sistemi ed Informatica ”Antonio Ruberti” - CNR Viale Manzoni, 30 00185, Rome,Italy
[email protected] 2 Lawrence Berkeley National Laboratory Mailstop 50B-3238 1 Cyclotron Road, Berkeley, CA 94720, USA
[email protected]
Abstract. Given an OLAP query expressed over multiple source OLAP databases, we study the problem of evaluating the result OLAP target database. The problem arises when it is not possible to derive the result from a single database. The method we use is the linear indirect estimator, commonly used for statistical estimation. We examine two obvious computational methods for computing such a target database, called the ”Full-cross-product” (F) and the ”Pre-aggregation” (P) methods. We study the accuracy and computational complexity of these methods. While the method F provides a more accurate estimate, it is more expensive computationally than P. Our contribution is in proposing a third new method, called the ”Partial-Pre-aggregation” method (PP), which is significantly less expensive than F, but is just as accurate.
1
Introduction
Similar to Statistical Databases that were introduced in the 1980’s [1], OLAP databases have a data model that represents one or more ”measures” over a multidimensional space of ”dimensions”, where each dimension can be defined over a hierarchy of ”category attributes” [3],[7]. In many socio-economic type applications only summarized data is available because the base data for the summaries (called ”microdata”) are not kept or are unavailable for reasons of privacy [7]. We will refer to Statistical Databases or OLAP databases that contain summarized data as ”summary databases”, and the measures associated with them as ”summary measures”. Each summary measure must have a ”summary operator” associated with it, such as ”sum, or ”average”. In this paper, we address the problem of evaluating queries expressed over multiple summary databases [6]. That is, given that the base data is not available and that a query cannot be derived from a single summary database, we examine the process of estimating the desired result from multiple summary databases by a method of interpolation common in statistical estimation, called ”linear indirect estimator”. Essentially, this method takes advantage of the correlation between measures to perform the
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 24–34, 2003. c Springer-Verlag Berlin Heidelberg 2003
Answering Joint Queries from Multiple Aggregate OLAP Databases
25
estimation. For example, suppose that we have a summary database of ”totalincome by race, and sex” and another summary database of ”population by state, age, and sex”. If we know that there is a strong correlation between ”population” and ”income” of states, we can infer the result ”total-income by state”. We say that in this case the ”population” was used as a proxy measure to estimate ”total-income by state”. Similarly, the summary database ”population by state, age, and sex” is referred to as the proxy database. The problem we are addressing is to answer the query ”total-income by state” over the two summary databases. Furthermore, the fact that there is a common category attribute to both databases (Sex) can be used to achieve more accurate results, as we’ll show later. Consider the above mentioned databases written in the following notation: ”summary-measure (category-attribute,..., category-attribute)”. One obvious method of estimating this result is to aggregate each of the source databases to the maximum level. We call this the Pre-aggregation (P) method. In this case, we can aggregate Population(State, Sex, Age) over sex and age to produce Population(State)1 and Total-Income(Race,Sex) over race and sex to produce Total-Income(•), where the symbol ”•” indicates aggregation over all the category attributes. Then, we can calculate the proportional estimation using linear indirect estimation (see section 3) to produce Total-Income(State). Another possibility is to produce the full cross product: Total-Income(State,Sex,Race,Age) using Population as a proxy summary measure. Then, from this result we can aggregate over Sex, Race, Age to get Total-Income(State). We call this the Fullcross-product (F) method. This is the most accurate result that can be produced since it performs the linear indirect estimation on the most detailed cells. We use the Average Relative Error (ARE) to measure accuracy (see section 5) of these estimation methods. Using this measure we show that method P is less accurate than method F. Our main result is in proposing a method, called the Partial-pre-aggregation (PP) method that achieves the same accuracy as the F method but at a much lower computation cost. This is achieved by noticing that it is possible to pre-aggregate over all the category attributes that are not in common to the two source databases before we perform the cross product, and still achieve the same accuracy of the full cross product computation. According to this method, in our example, we first aggregate over Race in the TotalIncome(Race,Sex) database to produce Total-Income(Sex) and aggregate over Age in Population(State,Sex,Age) to produce Population(State, Sex). Then, we use the linear indirect estimator to produce Total-Income(State,Sex). Finally, we aggregate over Sex in Total-Income(State,Sex) to produce Total-Income(State). We show that this result is as accurate as F, but by performing the aggregations over the non-common attributes early we minimize the computation needed. The paper is structured as follows. Section 2 introduces syntax for defining joint queries that provides the basis for a formal analysis of the accuracy of the results. Section 3 discusses the underlying methodology for the estimation of query results. Section 4 describes the three methods for estimating the query 1
Population(State) is equivalent to Population(State, ALL, ALL), where ALL indicates the construct introduced in [2]. For the sake of brevity, we use the first notation.
26
Elaheh Pourabbas and Arie Shoshani
results. The accuracy of these methods is investigated in Section 5, whereas, the results of performance evaluation are given in Section 6. Finally, Section 7 concludes.
2
The Joint Query Syntax
We define the syntax of a joint query on two summary databases, in terms of the common and non-common category attributes of the databases. Let M (CMi ), N (CN j ) with 1 ≤ i ≤ p, and 1 ≤ j ≤ q be summary databases, where M and N are summary measures, CMi and CN j are category attributes, and p and q each represents a finite number. A joint query formulated on these summary databases is a triple defined by M T (C C , C C , C T ) where: M T is a selected target summary database that can be one of M or N . Without loss of generality, suppose that M was selected. Then, C C is a set of R common category attributes, where CrC = CMi = CN j , with 1 ≤ r ≤ R. C C is a set of S non-common category attributes, where CsC = CMi or CsC = CN j , with CMi = CN j , 1 ≤ s ≤ S. C T represents a set of target category attributes, which includes at least one CN i and at least one of the target category attributes is not in the target summary database. According to these distinctions, the summary databases can C C C C T T be represented by M (CM , CM ), N (CN , CN , CN ), where CN are called target category attributes of the joint query, N is called the proxy measure, and M C C T , CN , CN ) is called the proxy database. Note is the target measure (M T ), N (CN that the set of category attributes of summary databases, as well as the category attributes of a joint query can be null.
3
The Linear Indirect Estimator Method
The main idea of such an approach, known in the literature as Small Area Estimation, is to use data from surveys designed to produce estimates of variables of interest at the national or regional level, and to obtain comparable estimates at more geographically disaggregated levels such as counties. This approach is characterized by indirect estimators techniques. An indirect estimator uses values of the variable of interest from available auxiliary (called predictor or proxy) data at the local level that are related to the variable of interest [4],[5]. In this model, the population is partitioned into a large number of domains formed by cross classification of demographic variables such as age, sex, race. Let i denote a small area. For each domain d, the only available variable of interest Y (e.g., Income) denoted by Y (d) = i Y (i, d) is calculated from the survey data. Furthermore, it is assumed that auxiliary information (e.g., Population) in the form of X(i, d) is also available. A estimator i is given ˆ of Y for small area Y (i, d), where X(d) = (X(i, d)/X(d))Y (d) = by Yˆ (i) = d d i X(i, d), X(i, d)/X(d) represents the proportion of the population of small area i that is in each domain d, and i Yˆ (i) equals to the direct estimator Y = d Y (d). The estimate is subject to error. The error is computed using the true values. In
Answering Joint Queries from Multiple Aggregate OLAP Databases
27
our examples, we assume the knowledge of the true values in order to evaluate the error. This provides us with the means of comparing the accuracy of the results using different computational methods. This estimation method applies to any summary database where the assumption is that the characteristics of small areas are sufficiently close to the characteristics of the large areas.
4
Computational Methods for Answering Joint Query
In this section, we propose three methods able to compute joint queries expressed on two summary databases. They are called Full-cross-product, Pre-aggregation, and Partial-Pre-aggregation methods. We first define them, and then we discuss their relative computational complexity. For the definition of these methods, we use the formalism defined for a joint query in Section 2. 4.1
The Full Cross Product Method (F)
The next theorem provides the estimation of the full-cross-product over summary databases that represent different measures. C C C C , CM ) and N (CN , CN ) be two summary databases. The Theorem 1. Let M (CM full-cross-product summary database is obtained as follows: C
C
N (CN , CN ) C C C C C ˆ (CM , CM , CN ) = M (CM , CM ) M C , CC ) N (CN N CC
(1)
N
C C If we assume as proxy the summary database N (CN , CN ), then by linear indirect estimation the proof of the above theorem is straightforward.
Example 1. Table 1 represents the data in the summary databases, Income(Race, Sex), and Population(State, Sex, Age) where by ”Income” we mean ”TotalIncome” in the rest of paper. Let us consider obtaining the full-cross-product ˆ summary database by Eq. 1, Income(State,Sex,Age,Race)=Income(Race, Sex ) Population(State,Age,Sex ) . According to the syntax introduced in SecPopulation(State,Age,Sex ) State,Age
tion 2, Sex is a common category attribute, Race, Age, and State are non-common category attributes. The summary database Population(State, Sex, Age) is the proxy database. For instance, if Income(State) is the target summary database, applying the F method, we first obtain the full cross product summary database shown in Table 2 (for space limitation, only one state is shown) and then we summarize all category attributes except the target attribute. The result is shown in Table 3, third column.
28
4.2
Elaheh Pourabbas and Arie Shoshani
The Pre-aggregation (P) Method
The pre-aggregation method is based on summarizing the summary databases over all common and non-common category attributes before the application of the linear indirect estimator method. C C C C T Definition 1. Let M (CM , CM ) and N (CN , CN , CN ) be summary databases. ˆ T (C T ), is estimated by pre-summarizing all The target summary database M N common and non-common category attributes in the summary databases as fol C C T C C T , CM ), N (CN ) = C C ,C C N (CN , CN , CN ), and lows: M (•) = C C ,C C M (CM M
M
N
N
T
ˆ T (C T ) = M (•)( N (CN ) ). then applying the linear indirect estimation: M N N (C T ) CT N
N
Example 2. Consider Table 1. To apply the P method, we first summarize all common and non-common category attributes in the summary databases Income and Population: Age,Sex Population (State, Age, Sex )=Population (State), Income(Race, Sex ) = Income(•). Then, by applying linear indirect Race,Sex ˆ estimation, we obtain I ncome(State)=Income(•) Population(State) . The reState
Population(State)
sult of target summary database ˆI ncome(State) is shown in in Table 3, fourth column. We observe that ˆI ncome(State)F and ˆI ncome(State)P are different. 4.3
The Partial-Pre-aggregation (PP) Method
This method was devised to yield the same accuracy as method F but with a lower computational complexity. The main idea is to summarize the summary databases only over non-common category attributes, and then estimate the target summary database with the common and target category attributes. C C C C T , CM ) and N (CN , CN , CN ) be summary databases. Definition 2. Let M (CM T T ˆ The target summary database M (CN ), is estimated by pre-summarizing all the C non-common category attributes in the summary databases as follows: M (CM )= C C C T C C T M (C , C ), N (C , C ) = N (C , C , C ), and then estimating C C M M N N N N N CM CN the cross product and summarizing over the common attributes as follows: C T ˆ ( C C , C T ) = C (M (C C ))( N (CN ,CNC) T ). ˆ T (C T ) = C M M N N N M C C N (C ,C ) N
N
C C ,C T N N
N
N
Example 3. Consider again Table 1. First, we summarize the summary databases over non-common category attributes as follows: Age Population (State, Age, Sex ) = Population (State, Sex ), Race Income(Race, Sex ) = Income(Sex ). Then, by applying linear indirect estimation, we obtain ˆI ncome(State)= Sex ˆI ncome ) (State, Sex )=Income(Sex ) Population(State,Sex . We note that the results Population(State,Sex ) State
obtained by ˆI ncome(State)PP are identical to ˆI ncome(State)F shown in Table 3.
Answering Joint Queries from Multiple Aggregate OLAP Databases
29
Table 1. Income(Race,Sex) and Population(State,Sex,Age) Sex
Income Race
Male
White 629907 Black 311121 Hispanic 312800 Non-Hispanic 331254 Total 1585082
Sex
Population
Female
State
330567 241835 192632 272765 1037799
AL CA FL NE TE
Age
Male Female
<65 ≥65 <65 ≥65 <65
years years years years years
6 3 7 3 9
4 3 5 5 6
≥65 <65 ≥65 <65 ≥65
years 5 years 3 years 3 years 12 years 7 58
3 5 7 6 5 49
Total
ˆ Table 2. Income (State,Race,Age,Sex) ˆ Income State AL
5
Sex Race
Age
<65 years ≥65 years Black <65 years ≥65 years Hispanic <65 years ≥65 years Non-Hispanic <65 years ≥65 years ...... ...... White
Male
Female
65162.7931 32581.396552 32184.93103 16092.46552 32358.62069 16179.31035 34267.65517 17133.82759 ......
26985.06122 20238.795918 19741.632653 14806.22449 15725.061224 11793.79592 22266.530612 16699.897959 ......
Accuracy Analysis of Methods
The previous definitions can be used to estimate joint queries with any number of (common, non-common) category attributes. As seen from the previous examples, the application of the F and P methods on the same data yields different results. This difference is formalized by the next theorem.
30
Elaheh Pourabbas and Arie Shoshani
ˆ T (C T ) over M (C C , C C ) and Theorem 2. The estimation of any joint query M N M M C C T , CN ,CN ) using the methods F and P give different results. N (CN Proof. (sketch) We prove this by negation. Suppose methods F and P give the ˆ T (C T ). According to this assumption in Eq.2, (i), (ii) must be same results M N C C equal. CM = CN , but the proportions are not equal. It contradicts the assumption. C C T N (C , C , C ) C N N N C C C (i) M (CM , CM ) N C C T C C ,C T N (CN , CN , CN ) C CN CN CM N C C T N (C , C , C ) C N N N CN CC C C M (CM , CM ) N (ii) C , CC , CT ) N (C C T N N N CC C CC C ,C M
CM
N
N
(2)
N
Since the methods F and P yield different results, we need a way of evaluating the accuracy of these results. A common approach to determine the accuracy is based on the calculation of their average relative errors (ARE) from the ”true” m |ˆvi −vi | 1 base values (v) as defined in [4]: ARE = m . By the linear indirect i=1 vi estimator method, we obtain an estimation of a given target measure for small area i, and then we calculate the ARE from the true value of small area i. In Table 3, the true value for each State as well as the estimated value by methods F (or PP), and P are shown. Note that, the true values are calculated from the base data in order to obtain the error between the true values and the data estimated by each of the F, PP, and P methods. In same table, the relative ARE are shown. As can be seen, while method P is less expensive in terms of computation than method F, method F is more precise w.r.t. method P. An intermediate solution in term of computational complexity of these two methods is method PP. As we saw from the examples 1, and 2 and on the basis of the next theorem the methods F and PP give the same results. ˆ T (C T ) over M (C C , C C ) and Theorem 3. The estimation of any joint query M N M M C C T N (CN , CN ,CN ) using the methods F and PP give the same results. Proof. (sketch) It is easy to show the part F⇒PP by simple algebric manipulation [6]. For the part PP⇒F, we consider a simplified representation of the ˆ T (C T ) obtained by the PP method as follows: (i) M N For any set of category attributes Cp , and Cq with p = q, the above summary database can be the result of summarization over C p and Cq as follows:
N (CNC ,Cq ,CNT ) Cq C where the inner term is equivC C ,C ) CN Cp M (CM , Cp ) N (CN q Cq
N (CNC ,Cq ,CNT ) C ˆ (CN , Cp , Cq , M (C = Cp ,Cq M alent to (ii) , C ) p M Cp ,Cq N (C C ) N
T CN ). As can be easily seen, the summarization of the full cross product in (ii) ˆ (C C , C T ) that is equal to the inner term of (i). over Cp , and Cq gives M N N
Answering Joint Queries from Multiple Aggregate OLAP Databases
31
Table 3. Results obtained by methods F (or PP) and P, and their ARE ˆ Income Income F (P P ) AL CA FL NE TE
265039 495302 806255 393694 662591
394218.0000 485085.7143 573222.14286 418128.85714 752226.28571
ˆ Income P 392206.50467 490258.13084 563796.85047 441232.31776 735387.19626
ARE
5.1
ˆ ˆ |Income i(F ) −Incomei | |Incomei(P ) −Incomei | Incomei
Incomei
0.487396195 0.020626377 0.289031209 0.062065607 0.135279963
0.479806763 0.010183422 0.300721421 0.120749409 0.109865960
0.198879870
0.204265395
Pre-aggregation on Category Hierarchies
Let us consider an example of a category hierarchy State → Region → Country. In the case that the result database needs to be evaluated over a higher-level category, such as Income(Region) it is more efficient to aggregate (or roll-up in OLAP terminology) the source database Population(State,Age,Sex) to the region level to produce Population(Region,Age,Sex) before applying the PP method. Intuitively, we expect to get the same level of accuracy. We verified that indeed aggregating (rolling-up) first to the higher category level produced the same accuracy as aggregating after applying the PP method [6]. Thus, we consider aggregation to the desired level of the category hierarchies as a first step of the Partial-Pre-Aggregation (PP) method.
6
Performance Evaluation
In this section, we illustrate the experimental results of performance evaluation of the methods F and PP. We focus our attention on these methods because of their higher accuracy levels w.r.t. method P. The performance evaluation of methods are described over two and then over more than two summary databases through some examples. 6.1
Performance Evaluation over Two Summary Databases
Let the number of cells of a given summary database be defined by XM = i C where |Ci | represents the cardinality of the domain values of the i=1,...,n i-th category attributes. For instance, the total number of cells of the summary database Population(State,Age,Sex) is XP opulation = 20, given that the cardinality of the domain values of the category attributes (State, Age, Sex) are respectively: 5, 2, 2. Note that each cell requires the same space for storing the data value, and therefore the space complexity for each cell is assumed to be the same. Thus counting the number of cells is a good measure of the space required.
32
Elaheh Pourabbas and Arie Shoshani
C C C C T Definition 3. Let M (CM , CM ) and N (CN , CN , CN ) be summary databases, and let M be the target summary database. The total number of cells of the C C C T , CM , CN ,CN ) is defined as follows. In the case target summary database M (CM X X of method F: X M = M N q , where C C represents the cardinality of the N
q=1,...,n
|C |
domain values of the category attribute which is common between the target and C (X /|C C |)(XN /|CN |) proxy summary databases. In the case of method PP: X M = M M . q N
q=1,...,n
|C |
Each cell in the target summary database shows the cost in terms of the number of arithmetic operations performed (time) and the number of bytes (space) needed. The space cost is important in the intermediate steps of processing when methods F and PP are applied to achieve the final result. For instance, let us consider the summary databases shown in Table 1. Applying the F method, the ˆ total number of cells of the summary database Income (State,Race,Age,Sex) is 80, while by the method PP, the total number of cells of the summary database ˆ Income (State,Sex) is 10. Concerning the arithmetic operations, we assume that the multiplication and division operations in each cell need some fixed amount of time, such as 3µs. Therefore, in our example, the time cost for estimating the ˆ target summary database Income(State) by F is 255µs, while by PP it is 45µs. 6.2
Performance Evaluation over Three Summary Databases
The evaluation of a joint query over more than two summary databases is performed by selecting the target summary measure of one database and using all the other databases as proxy databases. Consequently, there are many combinations for estimating the result of such a joint query, depending on the order of evaluation. As examples of our methodology to performance evaluation, we use the summary databases Population(State,Age,Sex), Income(Race,Sex,Profession), P opulationE (Race, Sex, Education-level), which are labelled by a, b, and c in Figure 1. The joint query is Income(State,Educationlevel). In this case, the target summary measure is from b, and the target category attributes belong to a and c. In Figure 1, for example, cb indicates that between b and c, the former is the target database and the latter is the proxy database. For answering this query, we use the methods F and PP. Note that for all the cases reported in Figure 1, the solutions yield the same total number of cells for the target summary database using the methods F and PP. ˆ The target summary database Income (State,Education-level) is the result of: (a) Using method F, the aggregation over Race, Sex, Age, Profession in ˆI ncome(State, Race, Sex , Age, Profession, Education − level ), which has 320 cells; (a) Using method PP, the aggregation over Race and Sex in ˆ Income(State, Race, Sex, Education − level), which has 80 cells. Table 4 shows the cost of processing the intermediate steps in each solution. As can be seen from the table, the minimum space cost is achieved with solution B for both methods F and PP. The same table shows the total time cost (in µs) for estimating the target summary database. Overall, solution B with method PP provides the best performance (least amount of space and least amount of computation).
Answering Joint Queries from Multiple Aggregate OLAP Databases
a
c
b
b
a
c
b a b a,c
b a
b a,c (C)
(B)
b
a
c b c
a
b
a b
b a,c
c b c
a
b
c c b
b a
b a,c
(D)
c
b c a
b a,c
(A)
Fig. 1.
a a c
b c
33
b a,c (F)
(E)
Solutions of joint query on multiple summary databases
Table 4. Cost of Space and Time of intermediate summary databases for each solution Space(byte) Methods Time(µs) Methods Solutions A B C D,E,F
7
F PP 160 32 160 192
40 16 80 40
Solutions
F
PP
A B C D,E,F
1440 1056 1440 1536
360 288 480 360
Conclusions
In this paper, we proposed a method, called the Partial-Pre-aggregation method, for estimating the results of a joint query over two source databases. This method is based on partitioning the category attributes of the source databases into ”common”, ”non-common”, and ”target” attributes. By summarizing on the non-common attributes first, we reduce the computational and space complexity of applying the linear indirect estimator method. We have shown that the PartialPre-aggregation method is, in general, significantly more efficient that the Fullcross-product method commonly used by statistical software. Furthermore, we provided a way to evaluate the optimal order of pairing databases for queries over more than two source summary databases.
References [1] Chan, P., Shoshani, A.: SUBJECT: A Directory Driven System for Organizing and Accessing Large Statistical Databases. Conference on Very Large Data Bases, (1981) 553–563 24
34
Elaheh Pourabbas and Arie Shoshani [2] Gray, J., Bosworth, A., Layman,A.,Pirahesh, H.:Data cube: a Relational Aggregation Operator Generalizing Group-by, Cross-tabs and Subtotals. 12th IEEE Int. Conf. on Data Engineering, (1996) 152–159 25 [3] Codd, E. F., Codd, S. B., Salley, C. T.: Providing OLAP (On-Line Analytical Processing) to User-Analysts: An IT mandate. Technical report (1993) 24 [4] Ghosh, M., Rao, J. N. K. : Small Area Estimation: An Appraisal. Journal of Statistical Science. 9 (1994) 55–93 26, 30 [5] Pfeffermann, D.: Samll Area Estimation - New Developments and Directions. International Statistical Review, 70 (2002) 26 [6] Pourabbas, E., Shoshani, A.: Joint Queries Estimation from Aggregate OLAP Databases. LBNL Technical Report, (2001) LBNL-48750 24, 30, 31 [7] Shoshani, A.: OLAP and Statistical Databases: Similarities and Differences. 16th ACM Symposium on Principles of Database Systems, (1997) 185–196 24
An Approach to Enabling Spatial OLAP by Aggregating on Spatial Hierarchy Long Zhang, Ying Li, Fangyan Rao, Xiulan Yu, Ying Chen, and Dong Liu IBM China Research Laboratory, Beijing 100085, P. R. China {longzh,lying,raofy,yuxl,yingch,liudong}@cn.ibm.com
Abstract. Investigation shows that a huge number of spatial data exists in current business databases. Traditional data warehousing and OLAP, however, could not exploit the spatial information to get deep insight into the business data in decision making. In this paper, we propose a novel approach to enabling spatial OLAP by aggregating on the spatial hierarchy. A spatial index mechanism is employed to derive the spatial hierarchy for pre-aggregation and materialization, which in turn are leveraged by the OLAP system to efficiently answer spatial OLAP queries. Our prototype system shows that the proposed approach could be integrated easily into the existing data warehouse and OLAP systems to support spatial analysis. Preliminary experiment results are also presented.
1
Introduction
OLAP system provides architectures and tools for knowledge workers (executives, managers, analysts) to systematically organize, understand, and use their data to make strategic decisions. It is claimed that 80% of the overall information stored in computers is geo-spatial related, either explicitly or implicitly[4]. Currently a large number of spatial data have been accumulated in business information systems. How to analyze such business data associated with the spatial information presents the challenges to data warehousing and OLAP systems. For example, the business data, such as the growth of surrounding neighborhood, proximate competitors and geographical data, such as the distance to the nearest highways, can be used to choose a location for a new store. To address the above issues, two methods are commonly used to support management of business data with spatial information: (1) GIS + DBMS. In this approach, geographical information system (GIS) is used to model, manipulate and analyze the spatial data, and DBMS is used to handle business data. The disadvantage of this method is that the business data and spatial data are maintained separately and it is hard to provide a uniform view for users; (2) DBMS with spatial extensions. Some commercial database vendors, such as IBM and Oracle, provide spatial extensions in their database systems[3, 11]. These spatial extensions provide methods to execute spatial query, namely, Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 35-44, 2003. c Springer-Verlag Berlin Heidelberg 2003
36
Long Zhang et al.
to select spatial objects in specified areas. However they do not provide spatial data analysis functionality in the OLAP tools. The data in the warehouse are often modelled as multidimensional cubes[5] which allow the data to be organized around natural business concepts that are called measures and dimensions. We adopt Han’s definition of spatial-tospatial dimension as the spatial dimension, whose primitive level and all of its high-level generalized data are spatial[9]. The spatial dimension differentiates the non-spatial dimension in that there is little a priori knowledge about the grouping hierarchy[12, 13]. In addition to some predefined regions, the user may request groupings based on a map which are computed on the fly, or may be arbitrarily created (e.g. an arbitrary grid in a selected window). Thus the wellknown pre-aggregation methods which are used to enhance performance can not be applied. In this paper, we present an approach to enabling spatial data manipulation into OLAP systems. We extend the cube query by introducing spatial predicates and functions which explicitly express the spatial relationships among data in fact tables and dimension tables. A spatial index mechanism is employed to derive the spatial hierarchy for pre-aggregation and materialization, which in turn is leveraged by OLAP system to efficiently answer spatial OLAP queries. Our prototype system shows that the proposed approach could be integrated easily into existing data warehousing and OLAP systems to support spatial analysis. Preliminary studies on the performance of the system are presented. The remainder of this paper is organized as follows. A motivating example is given in Section 2. The spatial index mechanism is described in Section 3. The system architecture is depicts in Section 4. Section 5 gives a detailed explanation of query processing. Preliminary experiments are given in Section 6. Related works are given in Section 7 and conclusion is drawn in Section 8.
2
Motivating Example
In our work, the star schema is used to map multi-dimensional data on a relational database. To focus on spatial dimensions, we employ a simple data warehouse of sales from thousands of gas stations as the motivating example. The sales concerning different gas stations, gas types, customers at different time are analyzed. The schema of dimension tables station, gas, customer, time and fact table transaction is as follows: station (station ID, location) gas (gas ID, unit price) customer (customer ID, type) time (time ID, day, month, year) transaction (station ID, gas ID, customer ID, time ID, sales, quantity) Each tuple in the fact table is a transaction item indicating sales, customer ID, gas type, gas station, and time involved in the transaction. Here, station is a spatial dimension. Its attribute location gives the spatial location of a gas station. A location is a point such as “(332, 5587)”. The typical OLAP query
An Approach to Enabling Spatial OLAP by Aggregating on Spatial Hierarchy
37
may be: “What are the total sales of each type of gas for ‘taxi’ customers at each gas station in Oct. 2002?” While an OLAP query involving spatial information may be “What are the total sales of each type of gas for customer ‘taxi’ at gas stations within the query window in Oct. 2002?” Here, a query window is a rectangle that user draws on the map.
3
The Spatial Hierarchy and Pre-Aggregation
In OLAP system, concept hierarchy is the basis for two commonly used OLAP operators: drill-down and roll-up[8]. A concept hierarchy defines a sequence of mappings from a set of low level concepts to higher level, more general concepts. For example, the hierarchy on time dimension is day, month and year. Traditionally, users, field experts and knowledge engineers manually define the concept hierarchies of dimensions. But for the spatial dimension, there is little a priori knowledge available on how to define an appropriate hierarchy in advance. Spatial indexing has been one of the active focus areas in recent database research. There have been many data structures proposed for indexing spatial objects[6], such as Quadtree, R-tree (and its derivations such as R*-tree and R+-tree), and grid file. Among them, those indexes with tree structures provide nesting relationship between high-level nodes and low-level nodes. This relationship might be inspiring candidates for building the spatial hierarchy. Each node in R-tree stores a minimum bounding rectangle (MBR) and the MBR of higher level node encloses all MBRs of its descendants. Thus R-tree provides a naturally nested hierarchy based on the spatial shapes and placement of the indexed objects. This R-tree derived spatial hierarchy plays the same role as traditional concept hierarchies. It can be used for pre-aggregation on spatial dimension. Other tree-like indexes can also be used to derive the spatial hierarchy. Fig. 1 indicates a part of a sample R-tree for the motivating example and the spatial layout of the objects. The leaf entries are gas station IDs. Each intermediate R-tree entry has an associated MBR indicating its spatial scope. r2 r5
r1 r3 s2
r6
r3
r2
r9
s3
r4
s1
s4
r5
r6
r7
r8
r9
r10
r11 r7
...... ..................
...... r11
s1 s2 s3
r8
s4
s5
r4
s5 r10
Fig. 1. A part of sample R-tree and spatial layout of the objects
Assuming that a user draws a query window indicated as bold rectangle in Fig. 1, our Spatial Index Engine (will be introduced in Section 4) uses R-tree to compute the query. It must return exact objects within the query window. Index searching algorithm tries to find higher level nodes in the tree that satisfy spatial
38
Long Zhang et al.
query predicates, more specifically, the within predicate. In some cases, some gas stations are contained in the query window, but MBRs of their ascendant nodes are not completely enclosed by the query window. Thus these gas stations should be fetched instead of their ascendant nodes. According to this, for the query window in Fig. 1, intermediate R-tree entries {r2, r7 } and leaf entries {s2, s3, s4 } are returned. With the spatial hierarchy, pre-aggregation and materialization can be built to answer OLAP queries. Usually, the result of materialization is stored in a summary table. For our motivating example, the summary table concerning spatial dimensions is as follows: spatial_sum (nid, customer, gas, month, year, sales) NIDs are IDs of R-tree nodes. Currently, we materialize the whole cube; that is, all nodes in spatial hierarchy (R-tree) will be computed and the result will be inserted into summary table. The summary table is built up traversing aggregating paths. For example, in Fig. 1, gas stations s1, s2 and s3 are grouped into node r9, and in turn, r7, r8 and r9 are grouped into r3 and so on. For each intermediate node in the spatial hierarchy, all other non-spatial dimensions will be aggregated and materialized. DB2 provides OLAP extensions[2] to SQL and we use Group By Cube to aggregate and materialize non-spatial dimensions. For example, the corresponding pre-aggregation and materialization SQL statement for r9 is: INSERT INTO spatial_sum SELECT "r9" AS nid, c.type AS customer, g.gas_id AS gas, t.month, t. year, sum(tr.sales) AS sales FROM station s, customer c, gas g, transaction tr, time t WHERE tr.station_id in ("s1", "s2", "s3") GROUP BY CUBE (customer, gas, month, year) Using this method, all nodes in the spatial hierarchy are aggregated and materialized into table spatial sum.
4
System Architecture
Our prototype system executes spatial OLAP queries, accessing multidimensional data cubes stored in a relational database and presents the result in a multidimensional format. Fig. 2 depicts the architecture of our system. Warehouse Builder extracts, transforms and loads raw data from operational database systems (ODS) to build fact table and dimension tables. Before summary tables are created, Spatial Index Engine uses dimension tables containing spatial data to build spatial hierarchy. The hierarchy then is used by Warehouse Builder when generating summary tables. User query is generated in Graphic User Interface (GUI) on client/browser side, encoded in XML and then transferred to server side. Spatial OLAP Query Processor extracts all spatial predicates from the query and dispatches them to Spatial Index Engine. By investigating Spatial Index, the engine fetches all
An Approach to Enabling Spatial OLAP by Aggregating on Spatial Hierarchy
39
Graphic User Interface
Clinet/Browser side query (XML)
result (XML)
Server side spatial predicate
Spatial OLAP Query Processor
result set
query
Spatial Index Engine
summary info.
Java OLAP API
Spatial Index
ROLAP Engine
ODS Fact Table
summary table
Warehouse Builder
Dimension Table
ODS
ODS
Fig. 2. System architecture
intermediate R-tree nodes whose MBRs satisfy the predicates and individual leaf nodes (i.e. gas station IDs in our motivating example) whose spatial locations satisfy the predicates. These nodes are returned to Spatial OLAP Query Processor. With these data, the processor rewrites the original query and sends the rewritten query to ROLAP Engine, which is a relational OLAP engine. ROLAP Engine processes the query using summary tables and returns query results to the query processor. Query results are further reconstructed as XML document and sent to client/browser side. The architecture takes the advantage of independent Spatial Index Engine. Thus the system could employ other index structures, such as Quadtree, with little modifications on other modules. The interface between query processor and ROLAP Engine is a Java OLAP API referring to JOLAP API specification[10]. With this API, any OLAP engines complying with it can be easily adopted.
5
Spatial OLAP Query Processing
We use the common query region shape - a rectangle drawn on a map to analyze spatial OLAP query processing. There are two types of queries: – Summary query (SQ). It requests summary information of all selected spatial objects as a whole, e.g. “what are the total sales of each customer at ALL gas stations within my indicated query window during each month in 2002?” – Individual query (IQ). It asks for individual information based on each selected spatial object, e.g. “what are the total sales of each customer at EACH gas station within my indicated query window during each month in 2002?” Obviously, SQ queries investigate and take the advantage of the data already summarized on spatial dimension while IQ queries can not exploit the summary information and must be processed as traditional OLAP queries. To deal with these two types separately, we built different cubes for them. The schema of summary table concerning spatial information is introduced in Section 3. We established the following traditional cube that is based on individual gas stations for IQ queries:
40
Long Zhang et al.
nonspatial_sum (station_id, customer, gas, month, year, sales) Taking a query as an example, “what are the total sales of each customer at all gas stations within my indicated query window during each month in 2002?”. The indicated rectangle query window with upper left corner (122, 220) and lower right corner (500, 768) is depicted in Fig. 1. The SQL statement is: Q1: SELECT c.type as customer, month, sum(sales) FROM customer c, gas g, time t, station s, transaction tr WHERE s.location WITHIN query_window(122,220,500,768) AND t.year= 2002 AND tr.customer_id=c.customer_id AND tr.gas_id=g.gas_id AND tr.station_id=s.station_id AND tr.time_id=t.time_id GROUP BY customer, month When the query is being processed, all the gas stations located within the window will be selected and their sales information will contribute to the query result. After the query is generated in GUI and transferred to the server side, the Spatial OLAP Query Processor extracts the spatial predicates in the WHERE clause. Then query processor dispatches each predicate to Spatial Index Engine. The engine deals with the predicate and returns an NID list containing IDs of all R-tree nodes which satisfy the predicate and a gas station ID list containing all IDs of separate gas stations. For example, for predicate query window (122, 220, 500, 768), Spatial Index Engine returns two list: one NID list containing all intermediate R-tree nodes whose MBRs are within the specified query window (122, 220, 500, 768); the other list of gas station IDs containing all gas stations whose locations are also within the query window, but MBRs of their ascendants are not completely enclosed by the query window. Let the NID list be {23, 179, 255, 88} and the gas station ID list be {868, 3234, 843, 65, 7665}. Then Q1 can be rewritten into: Q2: WITH summary AS (SELECT nid AS id, customer, month, sum(sales) FROM spatial_sum WHERE nid IN (23,179,255,88) AND year=2002 AND gas is null AND year is null UNION ALL SELECT station_id AS id, customer, month, sum(sales) FROM nonspatial_sum WHERE station_id IN (868,3234,843,65,7665) AND year=2002 AND gas is null AND year is null) SELECT customer, month, sales FROM summary GROUP BY customer, month The processing of IQ queries will not investigate the summary data based on spatial information. The user is not interested in the total information on
An Approach to Enabling Spatial OLAP by Aggregating on Spatial Hierarchy
41
all gas stations as one, but each individual gas station in the query window; that is, “what are the total sales of each customer at EACH gas station within my indicated query window during each month in 2002?” Spatial Index Engine will only return one list containing all individual gas stations whose locations confirm to the specified spatial predicate. The corresponding query is: Q3: SELECT c.type as customer, tr.station_id as station, t.month, sum(sales) FROM customer c, station s, time t, transaction tr WHERE s.location WITHIN query_window(300,220,990,882) AND t.year=2002 AND tr.station_id=s.station_id AND tr.time_id=t.time_id AND tr.customer_id=c.customer_id GROUP BY customer, station, month Assuming the gas station IDs returned by the index engine are {868, 3234, 843, 65, 7665}, Q3 can be rewritten as Q4, where non-spatial summary table is used. Q4: SELECT customer, station, month, sum(sales) FROM nonspatial_sum WHERE station IN (868, 3234, 843, 65, 7665) AND year=2002 AND gas is null
6
Experiments
Preliminary experiments are conducted on the simplest shape: point. Our aim is to find whether our approach is effective for spatial OLAP queries and more complex shapes such as line and polygon will be discussed in further work. Two data sets were generated according to the motivating example. The locations of gas stations for each data set comply with Gaussian distribution. By this, we attempted to reflect the fact that gas stations tend to be clustered around areas with dense road network. Table 1 gives the statistic information of these data sets: Table 1. The statistic data of base tables set # of gas stations # of customers # of transactions 1 2
1,000 10,000
200 800
8,663 34,685
Spatial index was built on each data set. We computed and materialized the whole cube by using the spatial hierarchy. The materializing result is stored in table spatial sum. Materialized cube for other non-spatial dimensions is stored in table nonspatial sum. The statistic data are given in Table 2. DB2 UDB and Spatial Extender were used as the database server to support storing spatial objects. Our system runs on a desktop machine with Pentium III 750M CPU and 512M memory.
42
Long Zhang et al. Table 2. The statistic data of spatial index and summary tables Set R-tree size R-tree depth Size of nonspatial sum Size of spatial sum 1 2
1,424 14,182
6 9
25,120 126,176
30,490 195,616
We compare our approach with the one in which spatial dimension is not aggregated. For simplification, our approach is called spatial aggregation (SA) approach and the other one non-spatial aggregation (NSA) approach. In order to restrict discussion to the main concerns, we use the following typical query in the following experiments. Find out the total sales for every gas type in every month for all gas stations located in the specified query window. Obviously, with spatial hierarchy not aggregated, to answer the spatial query, base table station must be searched in order to fetch all IDs of gas stations located among the query window. The query for NSA is SELECT gas, month, sum(sales) FROM nonspatial_sum WHERE gas is not null AND month is not null AND station_id IN (SELECT station_id FROM station WHERE location..xmin>$left AND location..xmin<$right AND location..ymin>$bottom and location..ymin<$top) Here, $left, $right, $bottom and $top indicate the border of the specified query window. It should be mentioned here that if the objects have more complex shapes, such as lines and polygons, searching objects whose location are within a query window is a time-consuming task. In our approach, the better part of this task has been done when building spatial index: the objects have been grouped into the nodes of spatial hierarchy. The searching cost is reduced dramatically. 1200
12000 SA NSA
1000
10000
800
8000 Number of tuple access
Number of tuple access
SA NSA
600
6000
400
4000
200
2000
0
0
10
20
30
40
50 Area%
60
70
80
90
0
10
20
30
40
50 Area%
60
70
80
90
(b) 10,000 gas stations (a) 1,000 gas stations Fig. 3. Aggregation degree
In order to investigate the aggregation ratio of our spatial hierarchy, for each data set, we generated 9 or more query sets, each containing 100 queries. The query windows in every query set have equal area and the centers of the windows
An Approach to Enabling Spatial OLAP by Aggregating on Spatial Hierarchy
43
comply with Gaussian distribution. The proportions of window area to the whole space are predefined as 10, 20, 30, ..., 90% for these query sets. Fig. 3 depicts the average numbers of tuple-access by NSA and SA methods on each set. It can be derived from Fig. 3 that with the area of query window growing, the number of tuple-access by SA approach drops dramatically, while that by NSA approach increases quickly. At any area ratio, the number of accessed tuples by SA is much less than that by NSA, usually more than an order of magnitude. Fig. 4 displays the performance comparison for NSA and SA on above query sets. It can be seen from the figure that with the window area groups, SA always performs much better than NSA. This result also complies with the trend of aggregation degree presented in Fig. 3. 1
0.14
0.9
SA NSA
0.12
SA NSA
0.8
0.7
Elapse time (s)
Elapse time (s)
0.1
0.08
0.06
0.6
0.5
0.4
0.3
0.04
0.2 0.02
0.1 0
10
20
30
40
50 Area%
60
70
80
90
0
0
10
20
30
40
50 Area%
60
70
80
90
(a) 1,000 gas stations (b) 10,000 gas stations Fig. 4. The elapse time of SA and NSA
7
Related Works
Although advances in OLAP technology have led to successful industrial decision supporting systems, the integration of spatial data into data warehouse for supporting OLAP has only recently become a topic of active research. Han et al. [9, 14] are the first ones to propose a framework for spatial data warehouses. They proposed an extension of the star-schema and focus on the spatial measures and propose a method for selecting spatial objects for materialization. It was differed from our work that they focused on spatial measures while we concentrated on the spatial relationship and the spatial query processing. Papadias et al. [12, 13] gave an approach to deal with the problem of providing OLAP operations in spatial data warehouse, i.e. to allow user to execute aggregation queries in groups, based on the position of objects in space. Both Han’s and Papadias’ approaches modified the classical star schema in the existing data warehouses. The work in this paper managed to reserve the star schema while representing the spatial data. The pre-aggregation and materialization is the traditional method to enhance the performance of OLAP systems [1, 7]. Due to the lack of a priori knowledge about the spatial hierarchy, these well-known methods can not be applied directly. Papadias’ method stores aggregation results in the index. In our method, the results are simply stored in relational tables which are separated from the spatial index.
44
8
Long Zhang et al.
Conclusion
Currently a large number of spatial data have been stored in business information systems. In order to analyze these data, traditional OLAP systems must be adjusted to handle them. From a user’s perspective, it is not convenient to process spatial data separately before OLAP analyzing as that current OLAP systems deal with spatial information. In this paper, we discussed the feasibility of enabling spatial OLAP in traditional OLAP systems by aggregating on spatial hierarchy. In our proposed approach, spatial hierarchy can be automatically established. User issues queries involving spatial predicates and the system will compute these predicates automatically using spatial index. Pre-aggregation and materialization techniques are engaged under the help of the spatial hierarchy.
References 1. Baralis, E., Paraboschi, S., Teniente, E.: Materialized View Selection in a Multidimensional Database. Proceedings of VLDB Conference, 1997. 2. Colossi, N., Malloy, W., Reinwald, B.: Relational extensions for OLAP. IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002. 3. David W. Adler. DB2 Spatial Extender - Spatial data within the RDBMS. Proceedings of VLDB Conference, pp 687-690, 2001. 4. Daratech: Geographic Information Systems Markets and Opportunities. Daratech, Inc. 2000. 5. Gray, J., Chaudhuri, S., Bosworth, A., et al.: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery Journal, Vol 1, 29-53 ,1997. 6. Gaede, V., Gnther, O.: Multidimensional Access Methods. ACM Computing Surveys, 1997. 7. Gupta, H.: Selection of Views to Materialize in a Data Warehouse. Proceedings of International Conference on Database Theory , 1997. 8. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publisher, Inc. 2001. 9. Han, J., Stefanovic, N., Koperski, K.: Selective Materialization: An Efficient Method for Spatial Data Cube Construction. Proceedings of PAKDD Conference, 1998. 10. Java OLAP Interface (JOLAP), version 0.85. Java Community Process, May 2002. Available at http://jcp.org/en/jsr/detail?id=069. 11. Kothui, R.K.V., Ravada, S., Abugov, D.: Quadtree and R-tree Indexes in Oracle Spatial: A Comparison using GIS Data. ACM SIGMOD Conference, 2002. 12. Papadias, D., Kalnis, P., Zhang, J., Tao, Y.: Efficient OLAP operations in spatial data warehouses. Technical Report HKUST-CS01-01, Jan. 2001. 13. Papadias, D., Tao, Y., Kalnis, P., Zhang, J.: Indexing Spatio-Temporal Data Warehouses. Proceedings of International Conference on Data Engineering, 2002. 14. Stefanovic, N., Han, J., Koperski, K.: Object-Based Selective Materialization for Efficient Implementation of Spatial Data Cubes. TKDE 12(6): 938-958 (2000)
A Multidimensional Aggregation Object (MAO) Framework For Computing Distributive Aggregations Meng-Feng Tsai and Wesley Chu Computer Science Deptartment University of California, Los Angeles, 90095
[email protected],
[email protected]
Multidimensional aggregation plays an important role in decisionmaking systems. A conceptual Multidimensional Aggregation Object (MAO), which consists of measures, scopes and aggregation function, is introduced to represent relationships among aggregators on addressable subsets of data. In the MAO model, aggregations of low-level (intermediate) data can be reused for aggregations on high-level data along the same dimension. Experimental results show that caching intermediate aggregated data can signi cantly improve performance. Incremental compensating and full recomputing cache-updating approaches are proposed. Execution plans for deriving the aggregations from MAOs are presented. The proposed data aggregation technique can be applied to data-warehousing, OLAP, and data mining tasks. Abstract.
1
Introduction
In multidimensional DB, current research focuses on accessing raw data and performing simple data aggregations. In [Venk 96] and [SAgr 96], lattice structure and data caching algorithms are studied but they can only support few aggregators such as MAX, MIN, SUM, AVG. [Mumi 97] studied intermediate data maintenance for these few aggregators. The multidimensional structure model was presented in [RAgr 97]. Research about derivability in multidimensional environments was presented by [Albr 99], but the focus was on predicates of queries and did not support general aggregators. Commercial products are available for performing simple summarization for OLAP systems. Decision support systems, however, require complex aggregate functions such as variance to analyze and summarize data, which is the topic of this paper. Consider an airline enterprise needs to evaluate the eÆciency of its operations. For example, to detect abnormal operations, we need to evaluate the variances of fuel consumption, arrival and departure delays, and the number of passengers for dierent dates, aircrafts and ights, which requires the summarization of data values from dierent aspects and at dierent granularity levels. Due to limitations of the conventional aggregation techniques, many useful functions such as variance, high orders of moments, and a combination of Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 45-54, 2003. c Springer-Verlag Berlin Heidelberg 2003
46
Meng-Feng Tsai and Wesley Chu
several simple aggregations are unable to be supported. To remedy this shortcoming, we propose a multidimensional aggregation methodology that supports the dimensional hierarchies, and also systematically exploits relationships among aggregation functions, therefore allowing the customization of aggregators. This paper is organized as follows. Section 2 proposes the Multidimensional Aggregation Object (MAO) to represent the inter-relationships among aggregations at dierent levels. Section 3 discusses the derivation relationships among MAOs. Section 4 introduces caching methods to enhance the system performance, and also includes experiments to show performance improvements. Section 5 presents methods to synchronize cached data for updates, and section 6 provides a language to specify metadata of execution plans for processing MAOs. 2
Multidimensional Aggregation Object (MAO)
In multidimensional environments, data can be referred to by dierent dimensions at dierent levels. \Scope" is the entity representing the entire addressable data set at dierent levels. In the airline enterprise example, suppose there are three dimensions: a date dimension at the date and the weekday level, an aircraft dimension with only one aircraft level, and a ight dimension classi es data at the ight number and the time block levels. A scope is a combination of dierent dimensions from dierent levels. In the airline enterprise example, the scope (W D; AC; T B ) (in Figure 1) is from date, aircraft, and ight dimensions to weekday, aircraft, and time-block levels. The generalization of \scopes" from top (most generalized) to bottom (most speci ed) is rst proposed by [SAgr 96]. A lattice structure of scopes is shown in gure 1, where (W D; AC; f #) can derive (W D; AC; T B ) since time block (TB) is a generalization of ight number (f#) in the ights dimension. Aggregation functions in multidimensional environments can be categorized into three types [SAgr 96]: Distributive, Algebraic, and Holistic. We will focus on distributive and algebraic functions. Let the input data set be S with size n. Each entry in S is referred to as si . For distributive aggregations, an aggregate function is distributive if the results can be derived incrementally by repeatedly applying an associative and commutative binary operation Op(s1; s2), where s1 is the ongoing accumulating result, initially an empty value, and s2 is a newly selected entry from the remaining data set. As a result, distributive aggregation is capable of processing upper dimension level data from lower level data and the aggregated results are invariant to the sequence of those binary operations. For algebraic aggregators, an algebraic function maps from a nite group of values into a single value. This kind of function can consist of any constants or outputs from other aggregate functions. To present relationships among complex aggregators, a conceptual \object" can be used to encapsulate the necessary information. A multidimensional aggregate object (MAO) is de ned as a tuple with three elements M AO = (M; S; F ) where M = (M1 ; :::; Mm ) is a set of entries of measures; S = (l1 ; :::ln ) is a scope, com-
A Multidimensional Aggregation Object (MAO) Framework
47
/ => roots; WD=>weekdays, Date=>day in 1 month; AC=>aircrafts; TB=>time block; f#=> flight number /, /, /
WD,/ ,/ [7]
/, AC, / [14]
/, /, TB [4]
Date,/ ,/ [30]
WD, AC, / [98]
WD,/ ,TB [28]
/, AC, TB [56]
/, /, f# [53]
Date,AC, / [420]
Date,/ ,TB [120]
WD,AC,TB [392]
WD,/ ,f# [371]
/ ,AC, f# [742]
Date,AC, TB [1680]
Date,/ ,f# [1590]
WD, AC, f# [5194]
Date, AC, f# [ ] = number of possible entries for each scope.
Fig. 1.
Using scopes to present dierent levels of granularity of data.
posed by combinations of dierent dimension levels Di ; and F is the aggregate function that applies to M and S . MAO is the key identity for the derivation relationship that exists between the scopes and aggregation functions. A portion of the MAO derivation relationships for the airline enterprise example is shown in Figure 2.
/ => roots; WD=>weekdays, Date=>day in 1 month; AC=>aircrafts; TB=>time block; f#=> flight number "number of passengers" => #p (measure) #p,(WD,AC,/), Var
#p, (WD,AC,/), Sum
#p, (WD,AC,/),{SqrSum, Sum}
#p, (Date,AC,/), Var
#p, (Date,AC,/), Sum
#p, (Date,AC,/),{SqrSum,Sum}
#p, (WD,AC,TB), Var
#p, (WD,AC,TB),Sum
#p, (WD,AC,TB), {SqrSum,Sum}
note: Variance(Var) can be derived by SqrSum and Sum. (assuming that the function "count" is given) ->
Fig. 2.
ple.
Variance(X) := Sum [sqr(xi - Avg(X)) ] := Sum [sqr(xi)] - Sum(xi)Avg(X).
A portion of the MAO derivation relationships for the airline enterprise exam-
48
3
Meng-Feng Tsai and Wesley Chu
Invariance in MAO Derivation
If a given MAO Mi operates on a given scope with a certain measure and can be derived from another MAO Mj , then we say Mi is derivable from Mj , denoted as Mj ! Mi . We will prove that any MAO derived from multiple lower level MAOs will generate the identical aggregated results. Algebraic aggregation serves as a mapping function among MAOs with the same scopes. If two MAOs are related through an algebraic aggregate, then the implementation of a correct mapping function is essential. For distributive aggregations, we will rst prove that the results from aggregating partitions of the data set are identical despite their dierent partitioned data sets. Let the aggregated results of MAO Mj be Agj and each entry in Mi be ix . If Mi ! Mj through distributive aggregation f , and there are k partitions formed by the dimension hierarchy, then Agj = ff (i1 ; i2 ; :::; ip1 ); f (ip1+1 ; :::; ip2 ); :::; f (ip(k 1)+1 ; :::; ipk )g.
Distributive aggregations using an associative binary operator can be derived by aggregating from dierent partitions on a given data set and yield the identical aggregated result. Lemma 1
Proof: Note that the distributive aggregation uses an associative and commutative binary operator. Let n be an arbitrary positive integer, and let a1 ; a2 ; :::; an be elements from the set S , if associative property holds for S . Then for any partition on S , say on the r-th element, we have the generalized associative property: a1 a2 ::: an = (a1 a2 ::: ar ) (ar +1 ::: an ) where is an associative binary operator. Repeatedly apply the generalized associative property to two arbitrary sets of partitions. At the rst k partition, we have: (a1 :::ap1 ) (ap1+1 :::ap2 )::: (ap(k 1)+1 :::an )) = (a1 a2 ::: an ) This holds for another l partition. (a1 :::aq1 ) (aq1+1 :::aq2 )::: (aq(l 1)+1 :::an )) = (a1 a2 ::: an ) Therefore, two aggregations on any two arbitrary partitions of the same data set generate the identical result. Based on lemma 1 we can prove the following theorem:
For a given pair of MAOs, (Mx ; My ), if My can be derived from through multiple distributive aggregations, then all the aggregated results are x identical. Theorem 1 M
Proof: Consider each path between any pair of derivable MAOs with multiple derivation paths, say Mx ! My . Each resulting entry in My is derived from more detailed entries in Mx , through a combination of partitions, formed by involved dimension hierarchies. Say an entry ty^ in My is derived from a set of entries Tx = ft1 ; t2 ; :::; tl g from Mx , which means the aggregated result ty ^ = f (t1 ; t2 ; :::; tl ), where f is a distributive function using a binary operator . Assume there are two paths for Mx to derive ty^, that is Mx ) Mi ! My ,
A Multidimensional Aggregation Object (MAO) Framework
49
and Mx ) Mj ! My . Thus ty^ can also be derived from a group of entries from Mi or Mj . Consider Tx , which contributes a set of entries Ti = fti1 ; ti2 ; :::tim g for Mi , and another set Tj = ftj1 ; tj2 ; :::; tjn g for Mj . Since direct aggregation is summarized across one level of dimension, Mi and Mj can be considered as intermediate results aggregated by dierent partitions from Mx . Thus Ti = ff (t1 ; :::; tp1 ); f (tp1+1 ; :::; tp2 ); :::; f (tp(m 1)+1 ; :::; tl )g, and similarly Tj = ff (t1 ; :::; tq1 ); f (tq1+1 ; :::; tq2 ); :::; f (tq(n 1)+1 ; :::; tl )g. So when derived from Mi , the corresponding entries are: t ^ = f (f (t1 ; :::; tp1 ); f (tp1+1 ; :::; tp2 ); :::; f (tp(m 1)+1 ; :::; tl )), and y2 t ^ = f (f (t1 ; :::; tq 1 ); f (tq 1+1 ; :::; tp2 ); :::; f (tq (n 1)+1 ; :::; tl )). From lemma 1 we y3 know ty^ = ty^2 = ty^3 . Therefore the theorem is correct under the same operator.
The aggregated results from a distributive aggregation for a MAO that are derived from its multiple lower level MAOs are identical. Corollary 1
Proof: For a given measure, all the distributive MAOs are aggregations based on dierent partitions but on the same raw data set (most detailed data). Suppose the MAO with the most detailed scope is Mx . Because all other MAOs are considered to be derived from Mx directly or indirectly, from theorem 1 we know all distributive MAOs always generate the identical aggregating results. 4
Performance Improvements of Using MAO with Caching
Caching intermediate MAOs can provide signi cant performance improvements by using inputs from lower-level MAOs, which have fewer entries than the detailed fact data. We use the greedy algorithm [Venk 96] that at each step selects the best candidate to cache the MAOs. [Venk 96] proves that the upper bound of the bene t derived from the greedy algorithm from a lattice data structure is at least (e 1)=e ' 63% of that from the optimal solution where e = 2:73. The MAOs in our implementation form a partial order set rather than a lattice. However, the proof is applicable on the partial order set as well. 4.1
Experimental Results
The performance improvements are measured by the number of entries saved by caching. For comparison, scope is used when MAO is not used. Both scope and MAO use the same caching schemes. For example, if a MAO M (or a scope when MAO is not used) can be derived from another cached MAO M (or scope) with N1 entries rather than M with N2 entries (N2 > N1 ), the performance improvement for computing M is N2 N1 . For a given set of MAO SM , The total performance improvement is the sum of improvements for all MAOs in SM from caching. Thus the performance improvement increases as the number of cached MAOs increases.
50
Meng-Feng Tsai and Wesley Chu
We assume that there are four dierent aggregations involved: Sum, Square Sum, Variance, and Average. For the airline enterprise example, the three dimensions are: dates, aircrafts, and ights. The three measures are number of passengers, delay minutes, and fuel burn-o. For the international trading corporation database, the data is classi ed into three dimensions: time, which spans from 12 years on 144 months; product type, which consists of 4 major categories and 18 minor ones; area, which consists of 5 countries and 45 regions. The measures used are: sale price, customer numbers, and cost per item. We assume aggregations are needed by all scopes. All scopes and MAOs are equally likely to be accessed, the cached limit for both scope and MAO are set at 15. The greedy algorithm is used for cache placement. As shown in Figure 3a), for the case without MAO, the performance improvement is insensitive to the number of cached scopes. While for the system using MAO, the performance improvements increase signi cantly as the number of cached MAOs increase. Similar observations can also be seen in the international trading company example in Figure 3b). 9e+06
7e+07 using MAO without MAO(scope) performance improvements
performance improvements
8e+06 7e+06 6e+06 5e+06 4e+06 3e+06 2e+06 1e+06 0
using MAO without MAO(scope)
6e+07 5e+07 4e+07 3e+07 2e+07 1e+07 0
0
2 4 6 8 10 12 14 16 a) number of cached MAOs or scopes
0
2 4 6 8 10 12 14 16 b) number of cached MAOs or scopes
performance comparisons for the airline enterprise example and the international trading company example. Fig. 3.
Because of the problem size, the performance improvements of the international company yield much larger than those of the airline enterprise. Thus the performance improvement is more signi cant when using MAO for larger problem size. 5
Data Synchronization
We shall now discuss the methods for handling intermediate cached data during data updates. There are two approaches based on whether an inverse aggregation function exists.
A Multidimensional Aggregation Object (MAO) Framework 5.1
51
Incremental Compensating Updates
An eÆcient way to compensate updates is to incrementally update the aggregation results according to the changes in input data. This can be accomplished if the corresponding inverse aggregating functions exist. Formally, a function g is called an inverse function of f if for any input set S , g (S ) f (S ) = I , where I is the identity that for any element a, aI = I a = a. When the aggregation is not distributive or an inverse function does not exist, this process may still be done indirectly through transformation to other distributive and inversable MAO. For example, variance is not distributive but can be derived from summation and the summation of squares. Both of the two operations are distributive and have their inverse functions (substraction of sums of data, or sums of squares). Therefore variances can be synchronized without totally recalculating the entire set of updated data if we maintain both sums of data and sums of squares. 5.2
Complete Reaggregation
When impossible or ineÆcient to compensate update, we have to recalculate the entire new data set. To reduce the computing and retrieving costs, we can compute intermediate MAOs rst, then use them to derive higher level MAOs. Therefore the procedure performs in \stages." The cached MAOs that need to be synchronized are the rst stage, then the precedure looks for each of their immediate descendants as candidates for the second stage. By applying the greedy approach again, we search the best candidates one by one until no cost improvements available. Repeating this process, we will nally reach the most detailed fact data where updates were generated. We can derive MAOs at each stage in reverse order until the targets at the rst stage can be derived. Suppose the maximun number of disjointed cached MAOs (can not derive one another) is m and the depth for MAO deriving relationships is limited at d. Then the number of stages is bound by m and d. The greedy process for each stage has an upper bound of m2 (since probing the set in the next stages is less than m). Thus the reaggregation process is bound by (min(m; d) m2 ). 6
An Execution Plan
The metadata that speci es how to execute the derivation and synchronization consists of three categories: establish derivation relationships among MAOs, estimate costs for cache placement and maintanence, and synchronize the cached intermediate data. An Execution Plan (EP) is proposed to specify this information as shown in Figure 4. Since both computation and retrieval time are linearly related to the size of the processed data, computation time complexity is based on the size of sources and updates.
52
Meng-Feng Tsai and Wesley Chu
<Execution_Plans>->
<Synchronization_scheme> -> <Sources> <Sources> -> mao-list -> mao-list -> | -> current_partial_sum, binary_op, cursor_on_sources -> mapping_function, cursor_on_sources. -> -> -> function1, size_of_sources -> function2, size_of_sources -> | -> accessing_cost, computing_cost -> -> function1, size_updated_sources -> function2, size_updated_sources <Synchronization_scheme> -> -> -> binary_op, insertion_data, inverse_op, deletion_data -> search_for_indirect_sources -> | -> current_partial_sum2, binary_op ,cursor_on_updated_sources -> mapping_function, cursors_on_updated_sources. Fig. 4.
Execution Plan, language and content.
A Multidimensional Aggregation Object (MAO) Framework
53
An example of EP is shown in Figure 5. The system uses this EP to establish and maintain relationships between the source and target MAOs. The derivation relationship re ects that square sums of total passengers by dierent weekdays, aircrafts, and time blocks can be derived from the square sums by dierent dates, aircrafts, and time blocks. Since square sum is a distributive function, this is a distributive relationship and enables the incremental compensation approach for data synchronization. Cost estimations provide information to guide the system for cache placement and synchronization. ((Date,AC,TB),SqrSum, # of passengers) -> ((WD,AC,TB), SqrSum, # of passengers) Derivation_Relationship: Source: ((Date,AC,TB),SqrSum, # of passengers) Target: ((WD,AC,TB), SqrSum, # of passengers) ## W : cursor on target entry; X,Y: cursor on source entry; S: accumulating result. Distributive_Aggregation: if sources are raw data(most detailed fact) => W = X*X + S else W = Y + S Cost_Estimation: For_caching: Caching_computation: f1(size_of_source) Caching_retrieval:f2(size_of_source) For_synchronization: Compensation_cost: computation_cost: f1(size_of_inserted) + f3(size_of_deleted) access_cost: f2(size_of_inserted + size_of_deleted) Recomputation_cost: compute:f1(size_of_updated_source) access:f2(size_of_updated_source) Sunchronization_scheme: direct_compensation: if source are raw data => W = W + I*I - D*D else W = W + I - D indirect_compensation: *pointer for source’s compensation plan. ## recomputation_scheme is the same as in Distributive_Aggregation except setting the source’s cursors on updated source Fig. 5.
7
An example of Execution Plan.
Conclusion
In this paper, we introduced a Multidimensional Aggregation Object (MAO) model which consists of the aggregation function, the aggregation values, and the aggregation scope in a multidimensional environment. MAO represents the aggregated values in a multidimensional structure, and provides information to reuse lower-level and simpler aggregations for compositive aggregations. This information can improve performance and maintain potential data dependancies. The caching placement algorithm is proposed to eÆciently reuse intermediate aggregation results. Because the MAO model provides more information on
54
Meng-Feng Tsai and Wesley Chu
aggregation than presenting data at dierent levels by scope, caching MAOs provides signi cant performance improvements as compared to conventional techniques by caching scopes. To maintain the cached data during updating raw data, two techniques can be used to synchronize the cached data. If an inverse aggregation function exists, then the incremental approach should be used, which uses the inverse function to compensate the cached results. If the inverse aggregation function is not available, then a full reaggregation is needed using the newly updated data. The information for processing MAOs can be speci ed in an Execution Plan (EP). By providing the derivation relationships, cost estimating functions, and synchronization plans in EP, a system can eÆciently reuse and maintain intermediate data. Experiment results show that the application of a caching method using MAO can yield close to an order of a magnitude of improvement in computations as compared with the method that does not use the MAO model. By tracing derivation relationships among the MAOs, the system provides related aggregations at all levels and can therefore be systematicly maintained. Therefore, our proposed methodology provides a more versatile, eÆcient, and coherent environment for complex aggregation tasks. References
[Albr 99] J. Albrecht, H. Gunzel, W. Lehner, \Set-Derivability of Multidimensional Aggregates", DaWak 1999, pp 133-142. [SAgr 96] S. Agrawal, R. Agrawal, P.M. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Sarawagi. \On the computation of multidimensional aggregates ", Proc. 1996 Int. Conf. VLDB'96, pp 506-521, Bombay, India, Sept. 1996. [RAgr 97] R. Agrawal, A. Gupta, and S. Sarawagi, \Modeling multidimensional databases" , Proc. 1997 Int. Conf. Data Engineering(ICDE'97), pp 232243, Birmingham, England, Apr. 1997. [Mumi 97] Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick, \Maintenance of Data Cubes and Summary Tables in a Warehouse ", ACM SIGMOD pp.100{111, '97 AZ, USA. [JimG 96] Jim Gray, Adam Bosworth, Andrew Layman, Hamid Pirahesh, \Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals" , IEEE Data Engineering pp.152-159. [Amit 96] Amit Shukla, Prasad M. Deshpande, Jerey F. Naughton, Karthikeyan Ramasamy, \Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies", VLDB 96 Mumbai(Bombay), India, 1996. [Venk 96] Venky Harinarayan, Anand Rajaraman, Jerey D. Ullman, \Implementing Data Cubes EÆciently", SIGMOD Conference 1996 pp.205-216. [Jose 97] Joseph M. Hellerstein, Peter J. Haas, Helen J. Wang, \Online Aggregation", SIGMOD Conference 1997 pp.171-182. [Han 01] Jiawei Han, Micheline Kamber, \Data Mining Concepts and Techniques", Morgan Kaufmann Publishers, pp.230-243.
The GMD Data Model for Multidimensional Information: A Brief Introduction Enrico Franconi and Anand Kamble Faculty of Computer Science Free Univ. of Bozen-Bolzano, Italy [email protected] [email protected]
Abstract. In this paper we introduce a novel data model for multidimensional information, GMD, generalising the MD data model first proposed in Cabibbo et al (EDBT-98). The aim of this work is not to propose yet another multidimensional data model, but to find the general precise formalism encompassing all the proposals for a logical data model in the data warehouse field. Our proposal is compatible with all these proposals, making therefore possible a formal comparison of the differences of the models in the literature, and to study formal properties or extensions of such data models. Starting with a logic-based definition of the semantics of the GMD data model and of the basic algebraic operations over it, we show how the most important approaches in DW modelling can be captured by it. The star and the snowflake schemas, Gray’s cube, Agrawal’s and Vassiliadis’ models, MD and other multidimensional conceptual data can be captured uniformly by GMD. In this way it is possible to formally understand the real differences in expressivity of the various models, their limits, and their potentials.
1
Introduction
In this short paper we introduce a novel data model for multidimensional information, GMD, generalising the MD data model first proposed in [2]. The aim of this work is not to propose yet another data model, but to find the most general formalism encompassing all the proposals for a logical data model in the data warehouse field, as for example summarised in [10]. Our proposal is compatible with all these proposals, making therefore possible a formal comparison of the different expressivities of the models in the literature. We believe that the GMD data model is already very useful since it provides a very precise and, we believe, very elegant and uniform way to model multidimensional information. It turns out that most of the proposals in the literature make many hidden assumptions which may harm the understanding of the advantages or disadvantages of the proposal itself. An embedding in our model would make all these assumptions explicit. Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 55–65, 2003. c Springer-Verlag Berlin Heidelberg 2003
56
Enrico Franconi and Anand Kamble
So far, we have considered, together with the classical basic star and snowflake ER-based models and multidimensional cubes, the logical data models introduced in [2, 5, 1, 6, 9, 11, 3, 7, 8]. A complete account of both the GMD data model (including and extended algebra) and of the various encodings can be found in [4]; in this paper we just give a brief introduction to the basic principles of the data model. GMD is completely defined using a logic-based approach. We start introducing a data warehouse schema, which is nothing else than a set of fact definitions which restricts (i.e., constrains) the set of legal data warehouse states associated to the schema. By systematically defining how the various operators used in a fact definition constrain the legal data warehouse states, we give a formal logic-based account of the GMD data model.
2
The Syntax of the GMD Data Model
We introduce in this Section the notion of data warehouse schema. A data warehouse schema basically introduces the structures of the cubes that will populate the warehouse, together with the types allowed for the components of the structures. The definition of a GMD schema that follows is explained step by step. Definition 1 (GMD schema). Consider the signature < F , D, L, M, V, A >, where F is a finite set of fact names, D is a finite set of dimension names, L is a finite set of level names – each one associated to a finite set of level element names, M is a finite set of measure names, V is a finite set of domain names – each one associated to a finite set of values, A is a finite set of level attributes. ➽ We have just defined the alphabet of a data warehouse: we may have fact names (like SALES, PURCHASES), dimension names (like Date, Product), level name (like year, month, product-brand, product-category) and their level elements (like 2003, 2004, heineken, drink), measure names (like Price, UnitSales), domain names (like integers, strings), and level attributes (like is-leap, country-of-origin). A GMD schema includes: – a finite set of fact definitions of the form . F = E {D1 |L1 , . . . , Dn |Ln } : {M1 |V1 , . . . , Mm |Vm }, where E,F ∈ F, Di ∈ D, Li ∈ L, Mj ∈ M, Vj ∈ V. We call the fact name F a defined fact, and we say that F is based on E. A fact name not appearing at the left hand side of a definition is called an undefined fact. We will generally call fact either a defined fact or an undefined fact. A fact based on an undefined fact is called basic fact. A fact based on a defined fact is called aggregated fact. A fact is dimensionless if n = 0; it is measureless if m = 0. The orderings in a defined fact among dimensions and among measures are irrelevant.
The GMD Data Model for Multidimensional Information
➽ We have here introduced the building block of a GMD schema: the fact definition. A basic fact corresponds to the base data of any data warehouse: it is the cube structure that contains all the data on which any other cube will be built upon. In the following example, BASIC-SALES is a basic fact, including base data about sale transactions, organised by date, product, and store (which are the dimensions of the fact) which are respectively restricted to the levels day, product, and store, and with unit sales and sale price as measures: . BASIC-SALES = SALES {Date|day , Product|product, Store|store } : {UnitSales|int , SalePrice|int} – a partial order (L, ≤) on the levels in L. We call the immediate predecessor relation on L induced by ≤. ➽ The partial order defines the taxonomy of levels. For example, day
month quarter and day week; product type category – a finite set of roll-up partial functions between level elements → Lj ρLi ,Lj : Li for each Li , Lj such that Li Lj . We call ρ∗Li ,Lj the reflexive transitive closure of the roll-up functions inductively defined as follows: ρ∗Li ,Li = id ρ∗Li ,Lj = k ρLi ,Lk ◦ ρ∗Lk ,Lj where (ρLp ,Lq ∪ ρLr ,Ls )(x) = y
for each k such that Li Lk ρLp ,Lq (x) = ρLr ,Ls (x) = y, or ρL ,L (x) = y and ρLr ,Ls (x) = ⊥, or iff p q ρLp ,Lq (x) = ⊥ and ρLr ,Ls (x) = y
➽ When in a schema various levels are introduced for a dimension, it is also necessary to introduce a roll-up function for them. A rollup function defines how elements of one level map to elements of a superior level. Since we just require for the roll-up function to be a partial order, it is possible to have elements of a level which rollup to an upper level, while other elements may skip that upper level to be mapped to a superior one. For example, ρday,month(1/1/01) = Jan-01, ρday,month (2/1/01) = Jan-01, . . . ρquarter,year(Qtr1-01) = 2001, ρquarter,year(Qtr2-01) = 2001, . . .
57
58
Enrico Franconi and Anand Kamble
– a finite set of level attribute definitions: . L = {A1 |V1 , . . . , An |Vn } where L ∈ L, Ai ∈ A and Vi ∈ V for each i, 1 ≤ i ≤ n. ➽ Level attributes are properties associated to levels. For example, . product = {prodname|string , prodnum|int , prodsize|int , prodweight|int } – a finite set of measure definitions of the form . N = f(M ) where N, M ∈ M, and f is an aggregation function f : B(V) → W, for some V, W ∈ V. B(V) is the finite set of all bags obtainable from values in V whose cardinality is bound by some finite integer Ω. ➽ Measure definitions are used to compute values of measures in an aggregated fact from values of the fact it is based on. For example: . . Total-UnitSales = sum(UnitSales) and Avg-SalePrice = average(SalePrice) Levels and facts are subject to additional syntactical well-foundedness conditions: – The connected components of (L, ≤) must have a unique least element each, which is called basic level. ➽ The basic level contains the finest grained level elements, on top of which all the facts are identified. For example, store city country; store is a basic level. – For each undefined fact there can be at most one basic fact based on it. ➽ This allows us to disregard undefined facts, which are in one-toone correspondence with basic facts. – Each aggregated fact must be congruent with the defined fact it is based on, i.e., for each aggregated fact G and for the defined fact F it is based on such that . F = E {D1 |L1 , . . . , Dn |Ln } : {M1 |V1 , . . . , Mm |Vm } . G = F {D1 |R1 , . . . , Dp |Rp } : {N1 |W1 , . . . , Nq |Wq } the following must hold (for some reordering on the dimensions): • the dimensions in the aggregated fact G are among the dimensions of the fact F it is based on: p≤n • the level of a dimension in the aggregated fact G is above the level of the corresponding dimension in the fact F it is based on: Li ≤ R i
for each i ≤ p
The GMD Data Model for Multidimensional Information
59
• each measure in the aggregated fact G is computed via an aggregation function from some measure of the defined fact F it is based on: . N1 = f1 (Mj(1) )
...
. Nq = fq (Mj(q) )
Moreover the range and the domain of the aggregation function should be in agreement with the domains specified respectively in the aggregated fact G and in the fact F it is based on. ➽ Here we give a more precise characterisation of an aggregated fact: its dimensions should be among the dimensions of the fact it is based on, its levels should be generalised from the corresponding ones in the fact it is based on, and its measures should be all computed from the fact it is based on. For example, given the basic fact BASIC-SALES: . BASIC-SALES = SALES {Date|day , Product|product, Store|store } : {UnitSales|int , SalePrice|int} the following SALES-BY-MONTH-AND-TYPE is an aggregated fact computed from the BASIC-SALES fact: . SALES-BY-MONTH-AND-TYPE = BASIC-SALES {Date|month , Product|type } : {Total-UnitSales|int, Avg-SalePrice|real} with the following aggregated measures: . Total-UnitSales = sum(UnitSales) . Avg-SalePrice = average(SalePrice) 2.1
Example
The following GMD schema summarises the examples shown in the previous Section: – Signature: • F = {SALES, BASIC-SALES, SALES-BY-MONTH-AND-TYPE, PURCHASES} • M = {UnitSales, Price, Total-UnitSales, Avg-Price} • D = {Date, Product, Store} • L = {day, week, month, quarter, year, product, type, category, brand, store, city, country } day = {1/1/01, 2/1/01, . . . , 1/1/02, 2/1/02, . . . } month = {Jan-01, Feb-01, . . . , Jan-02, Feb-02, . . . } quarter = {Qtr1-01, Qtr2-01, . . . , Qtr1-02, Qtr2-02, . . . } year = {2001, 2002} ···
60
–
–
–
–
–
3
Enrico Franconi and Anand Kamble • V = {int, real, string} • A = {dayname, prodname, prodsize, prodweight, storenumb} Partial order over levels: • day month quarter year, day week; day is a basic level • product type category, product brand; product is a basic level • store city country; store is a basic level Roll-up functions: ρday,month (1/1/01) = Jan-01, ρday,month (2/1/01) = Jan-01, . . . ρmonth,quarter (Jan-01) = Qtr1-01, ρmonth,quarter (Feb-01) = Qtr1-01, . . . ρquarter,year (Qtr1-01) = 2001, ρquarter,year (Qtr2-01) = 2001, . . . ρ∗day,year (1/1/01) = 2001, ρ∗day,year (2/1/01) = 2001, . . . ··· Level Attributes: . day = {dayname|string , daynum|int } . product = {prodname|string , prodnum|int , prodsize|int , prodweight|int } . store = {storename|string , storenum|int , address|string } Facts: . BASIC-SALES = SALES {Date|day , Product|product , Store|store } : {UnitSales|int , SalePrice|int } . SALES-BY-MONTH-AND-TYPE = BASIC-SALES {Date|month , Product|type } : {Total-UnitSales|int , Avg-SalePrice|real } Measures: . Total-UnitSales = sum(UnitSales) . Avg-SalePrice = average(SalePrice)
GMD Semantics
Having just defined the syntax of GMD schemas, we introduce now their semantics through a well founded model theory. We define the notion of a data warehouse state, namely a specific data warehouse, and we formalise when a data warehouse state is actually in agreement with the constraints imposed by a GMD schema. Definition 2 (Data Warehouse State). A data warehouse state over a schema with the signature < F , D, L, M, V, A > is a tuple I = < ∆, Λ, Γ, ·I >, where – ∆ is a non-empty finite set of individual facts (or cells) of cardinality smaller than Ω; ➽ Elements in ∆ are the object identifiers for the cells in a multidimensional cube; we call them individual facts. – Λ is a finite set of level elements; – Γ is a finite set of domain elements;
The GMD Data Model for Multidimensional Information
61
– ·I is a function (the interpretation function) such that FI ⊆ ∆
for each F ∈ F, where FI is disjoint from any other EI such that E ∈ F for each L ∈ L, where LI is disjoint from any other HI LI ⊆ Λ such that H ∈ L for each V ∈ V, where VI is disjoint from any other WI VI ⊆ Γ such that W ∈ V DI = ∆ →Λ for each D ∈ D MI = ∆ →Γ for each M ∈ M L I → Γ for each L ∈ L and AL (Ai ) = L i ∈ A for some i (Note: in the paper we will omit the ·I interpretation function applied to some symbol whenever this is non ambiguous) ➽ The interpretation functions defines a specific data warehouse state given a GMD signature, regardless from any fact definition. It associates to a fact name a set of cells (individual facts), which are meant to form a cube. To each cell corresponds a level element for some dimension name: the sequence of these level elements is meant to be the “coordinate” of the cell. Moreover, to each cell corresponds a value for some measure name. Since fact definitions in the schema are not considered yet at this stage, the dimensions and the measures associated to cells are still arbitrary. In the following, we will introduce the notion of legal data warehouse state, which is the data warehouse state which conforms to the constraints imposed by the fact definitions. A data warehouse state will be called legal for a given GMD schema if it is a data warehouse state in the signature of the GMD schema and it satisfies the additional conditions found in the GMD schema. A data warehouse state is legal with respect to a GMD schema if: . – for each fact F = E {D1 |L1 , . . . , Dn |Ln } : {M1 |V1 , . . . , Mm |Vm } in the schema: • the function associated to a dimension which does not appear in a fact is undefined for its cells: ∀f. F(f ) → f ∈ dom(D) for each D ∈ D such that D = Di for each i ≤ n ➽ This condition states that the level elements associated to a cell of a fact should correspond only to the dimensions declared in the fact definition of the schema. That is, a cell has only the declared dimensions in any legal data warehouse state. • each cell of a fact has a unique set of dimension values at the appropriate level: ∀f. F(f ) → ∃l1 , . . . , ln . D1 (f ) = l1 ∧ L1 (l1 ) ∧ . . . ∧ Dn (f ) = ln ∧ Ln (ln ) ➽ This condition states that the level elements associated to a cell of a fact are unique for each dimension declared for the fact in the schema. So, a cell has a unique value for each declared dimension in any legal data warehouse state.
62
Enrico Franconi and Anand Kamble
• a set of dimension values identifies a unique cell within a fact: ∀f, f , l1 , . . . , ln . F(f ) ∧ F(f ) ∧ D1 (f ) = l1 ∧ D1 (f ) = l1 ∧ . . . ∧ Dn (f ) = ln ∧ Dn (f ) = ln → f = f ➽ This condition states that a sequence of level elements associated to a cell of a fact are associated only to that cell. Therefore, the sequence of dimension values can really be seen as an identifying coordinate for the cell. In other words, these conditions enforce the legal data warehouse state to really model a cube according the specification given in the schema. • the function associated to a measure which does not appear in a fact is undefined for its cells: ∀f. F(f ) → f ∈ dom(M) for each M ∈ M such that M = Mi for each i ≤ n ➽ This condition states that the measure values associated to a cell of a fact in a legal data warehouse state should correspond only to the measures explicitly declared in the fact definition of the schema. • each cell of a fact has a unique set of measures: ∀f. F(f ) → ∃m1 , . . . , mm . M1 (f ) = m1 ∧ V1 (m1 ) ∧ . . . ∧ Mm (f ) = mm ∧ Vm (mm ) ➽ This condition states that the measure values associated to a cell of a fact are unique for each measure explicitly declared for the fact in the schema. So, a cell has a unique measure value for each declared measure in any legal data warehouse state. – for each aggregated fact and for the defined fact it is based on in the schema: . F = E {D1 |L1 , . . . , Dn |Ln } : {M1 |V1 , . . . , Mm |Vm } . G = F {D1 |R1 , . . . , Dp |Rp } : {N1 |W1 , . . . , Nq |Wq } . . N1 = f1 (Mj(1) ) . . . Nq = fq (Mj(q) ) each aggregated measure function should actually compute the aggregation of the values in the corresponding measure of the fact the aggregation is based on: ∀g, v. Ni (g) = v ↔ ∃r1 , . . . , rp . G(g) ∧ D1 (g) = r1 ∧ . . . ∧ Dp (g) = rp ∧ v = fi ({|Mj(i) (f ) | ∃l1 , . . . , lp . F(f )∧ D1 (f ) = l1 ∧ . . . ∧ Dp (f ) = lp ∧ ρ∗L1 ,R1 (l1 ) = r1 ∧ . . . ∧ ρ∗Lp ,Rp (lp ) = rp |}) for each i ≤ q, where {| · |} denotes a bag. ➽ This condition guarantees that if a fact is the aggregation of another fact, then in a legal data warehouse state the measures associated to the cells of the aggregated cube should be actually computed by applying the aggregation function to the measures of the corresponding cells in the original cube. The correspondence between a cell in the aggregated cube and a set of cells in the original cube is found by looking how their coordinates – which are level elements – are mapped through the roll-up function dimension by dimension.
The GMD Data Model for Multidimensional Information
63
According to the definition, a legal data warehouse state for a GMD schema is a bunch of multidimensional cubes, whose cells carry measure values. Each cube conforms to the fact definition given in the GMD schema, i.e., the coordinates are in agreement with the dimensions and the levels specified, and the measures are of the correct type. If a cube is the aggregation of another cube, in a legal data warehouse state it is enforced that the measures of the aggregated cubes are correctly computed from the measures of the original cube. 3.1
Example
A possible legal data warehouse state for (part of) the previous example GMD schema is shown in the following. BASIC-SALESI = {s1 , s2 , s3 , s4 , s5 , s6 , s7 } SALES-BY-MONTH-AND-TYPEI = {g1 , g2 , g3 , g4 , g5 , g6 } Date(s1 ) Date(s2 ) Date(s3 ) Date(s4 ) Date(s5 ) Date(s6 ) Date(s7 )
= = = = = = =
1/1/01 7/1/01 7/1/01 10/2/01 28/2/01 2/3/01 12/3/01
UnitSales(s1 ) UnitSales(s2 ) UnitSales(s3 ) UnitSales(s4 ) UnitSales(s5 ) UnitSales(s6 ) UnitSales(s7 ) Date(g1 ) Date(g2 ) Date(g3 ) Date(g4 ) Date(g5 ) Date(g6 )
= = = = = =
= = = = = = =
100 500 230 300 210 150 100
daynum(day) = 1
= = = = = = =
Organic-milk-1l Organic-yogh-125g Organic-milk-1l Organic-milk-1l Organic-beer-6pack Organic-milk-1l Organic-beer-6pack
EuroSalePrice(s1) EuroSalePrice(s2) EuroSalePrice(s3) EuroSalePrice(s4) EuroSalePrice(s5) EuroSalePrice(s6) EuroSalePrice(s7)
Jan-01 Feb-01 Jan-01 Feb-01 Mar-01 Mar-01
Total-UnitSales(g1 ) Total-UnitSales(g2 ) Total-UnitSales(g3 ) Total-UnitSales(g4 ) Total-UnitSales(g5 ) Total-UnitSales(g6 )
4
Product(s1 ) Product(s2 ) Product(s3 ) Product(s4 ) Product(s5 ) Product(s6 ) Product(s7 )
Product(g1 ) Product(g2 ) Product(g3 ) Product(g4 ) Product(g5 ) Product(g6 ) = = = = = =
830 300 0 210 150 100
= = = = = =
= = = = = = =
Store(s1 ) Store(s2 ) Store(s3 ) Store(s4 ) Store(s5 ) Store(s6 ) Store(s7 )
= = = = = = =
Fair-trade-central Fair-trade-central Ali-grocery Barbacan-store Fair-trade-central Fair-trade-central Ali-grocery
71,00 250,00 138,00 210,00 420,00 105,00 200,00
Dairy Dairy Drink Drink Dairy Drink Avg-EuroSalePrice(g1) Avg-EuroSalePrice(g2) Avg-EuroSalePrice(g3) Avg-EuroSalePrice(g4) Avg-EuroSalePrice(g5) Avg-EuroSalePrice(g6)
prodweight(product) = 100gm
= = = = = =
153,00 210,00 0,00 420,00 105,00 200,00 storenum(store) = S101
GMD Extensions
For lack of space, in this brief report it is impossible to introduce the full GMD framework [4], which includes a full algebra in addition to the basic aggregation operation introduced in this paper. We will just mention the main extensions with respect to what has been presented here, and the main results. The full GMD schema language includes also the possibility to define aggregated measures with respect to the application of a function to a set of original
64
Enrico Franconi and Anand Kamble
measures, pretty much like in SQL. For example, it is possible to have an aggregated cube with a measure total-profit being the sum of the differences between the cost and the price in the original cube; the difference is applied cell by cell in the original cube (generating a profit virtual measure), and then the aggregation computes the sum of all the profits. Two selection operators are also in the full GMD language. The slice operation simply selects the cells of a cube corresponding to a specific value for a dimension, resulting in a cube which contains a subset of the cells of the original one and one less dimension. The multislice allows for the selection of ranges of values for a dimension, so that the resulting cube will contain a subset of the cells of the original one but retains the selected dimension. A fact-join operation is defined only between cubes sharing the same dimensions and the same levels. We argue that a more general join operation is meaningless in a cube algebra, since it may leads to cubes whose measures are no more understandable. For similar reasons we do not allow a general union operator (like the one proposed in [6]). As we were mentioning in the introduction, one main result is in the full encoding of many data warehouse logical data models as GMD schemas. We are able in this way to give an homogeneous semantics (in terms of legal data warehouse states) to the logical model and the algebras proposed in all these different approaches, we are able to clarify ambiguous parts, and we argue about the utility of some of the operators presented in the literature. The other main result is in the proposal of a novel conceptual data model for multidimensional information, that extends and clarifies the one presented in [3].
References [1] R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. In Proc. of ICDE-97, 1997. 56 [2] Luca Cabibbo and Riccardo Torlone. A logical approach to multidimensional databases. In Proc. of EDBT-98, 1998. 55, 56 [3] E. Franconi and U. Sattler. A data warehouse conceptual data model for multidimensional aggregation. In Proc. of the Workshop on Design and Management of Data Warehouses (DMDW-99), 1999. 56, 64 [4] Enrico Franconi and Anand S. Kamble. The GMD data model for multidimensional information. Technical report, Free University of Bozen-Bolzano, Italy, 2003. Forthcoming. 56, 63 [5] M. Golfarelli, D. Maio, and S. Rizzi. The dimensional fact model: a conceptual model for data warehouses. IJCIS, 7(2-3):215–247, 1998. 56 [6] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: a relational aggregation operator generalizing group-by, cross-tabs and subtotals. In Proc. of ICDE-96, 1996. 56, 64 [7] M. Gyssens and L. V. S. Lakshmanan. A foundation for multi-dimensional databases. In Proc. of VLDB-97, pages 106–115, 1997. 56 [8] A. Tsois, N. Karayiannidis, and T. Sellis. MAC: Conceptual data modelling for OLAP. In Proc. of the International Workshop on Design and Management of Warehouses (DMDW-2001), pages 5–1–5–13, 2001. 56
The GMD Data Model for Multidimensional Information
65
[9] P. Vassiliadis. Modeling multidimensional databases, cubes and cube operations. In Proc. of the 10th SSDBM Conference, Capri, Italy, July 1998. 56 [10] P. Vassiliadis and T. Sellis. A survey of logical models for OLAP databases. In SIGMOD Record, volume 28, pages 64–69, December 1999. 55 [11] P. Vassiliadis and S. Skiadopoulos. Modelling and optimisation issues for multidimensional databases. In Proc. of CAiSE-2000, pages 482–497, 2000. 56
An Application of Case-Based Reasoning in Multidimensional Database Architecture* Dragan Simiü1, Vladimir Kurbalija2, Zoran Budimac2 1
2
Novi Sad Fair, Hajduk Veljkova 11, 21000 Novi Sad, Yugoslavia [email protected]
Department of Mathematics and Informatics, Fac. of Science, Univ. of Novi Sad Trg D. Obradoviüa 4, 21000 Novi Sad, Yugoslavia [email protected], [email protected]
ABSTRACT. A concept of decision support system is considered in this paper. It provides data needed for fast, precise and good business decision making to all levels of management. The aim of the project is the development of a new online analytical processing oriented on case-based reasoning (CBR) where a previous experience for every new problem is taken into account. Methodological aspects have been tested in practice as a part of the management information system development project of "Novi Sad Fair". A case study of an application of CBR in prediction of future payments is discussed in the paper.
1
Introduction
In recent years, there has been an explosive growth in the use of database for decision support systems. This phenomenon is a result of the increased availability of new technologies to support efficient storage and retrieval of large volumes of data: data warehouse and online analytical processing (OLAP) products. A data warehouse can be defined as an online repository of historical enterprise data that is used to support decision-making. OLAP refers to technologies that allow users to efficiently retrieve data from the data warehouse. In order to help an analyst focus on important data and make better decisions, case-based reasoning (CBR - an artificial intelligence technology) is introduced for making predictions based on previous cases. CBR will automatically generate an answer to the problem using stored experience, thus freeing the human expert of obligations to analyse numerical or graphical data. The use of CBR in predicting the rhythm of issuing invoices and receiving actual payments based on the experience stored in the data warehouse is presented in this paper. Predictions obtained in this manner are important for future planning of a company such as the ”Novi Sad Fair” because achievement of sales plans, revenue and *
Research was partially supported by the Ministry of Science, Technologies and Development of Republic of Serbia, project no. 1844: ”Development of (intelligent) techniques based on software agents for application in information retrieval and workflow”
Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 66-75, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Application of Case-Based Reasoning in Multidimensional Database Architecture
67
company liquidation are measures of success in business. Performed simulations show that predictions made by CBR differ only for 8% in respect to what actually happened. With inclusion of more historical data in the warehouse, the system gets better in predictions. Furthermore, the system uses not only a data warehouse but also previous cases and previous predictions in future predictions thus learning during the operating process. The combination of CBR and data warehousing, i.e. making an OLAP intelligent by the use of CBR is a rarely used approach, if used at all. The system also uses a novel CBR technique to compare graphical representation of data which greatly simplifies the explanation of the prediction process to the end-user [3]. The rest of the paper is organized as follows. The following section elaborates more on motivations and reasons for inclusion of CBR in decision support system. This section also introduces our case-study on which we shall describe the usage of our system. Section three overviews the case based reasoning technique, while section four describes the original algorithm for searching the previous cases (curves) looking for the most similar one. Fifth section describes the actual application of our technique to the given problem. Section six presents the related work, while the seventh section concludes the paper.
2
User requirements for decision support system
“Novi Sad Fair” represents a complex organization considering the fact that it is engaged in a multitude of activities. The basic Fair activity is organizing fair exhibitions, although it has particular activities throughout the year. Ten times a year, 27 fair exhibitions are organized where nearly 4000 exhibitors take part, both from the country and abroad. Besides designing a ‘classical’ decision support system based on a data warehouse and OLAP, requirements of the company management clearly showed that it will not be enough for good decision making. The decision to include artificial intelligence methods in general and CBR in particular into the whole system was driven by the results of the survey. The survey was made on the sample of 42 individuals (users of the current management information system) divided into three groups: strategictactical management (9 people), operational managers (15 people), and transactional users (18 people). After a statistical evaluation of the survey [5], the following conclusions (among others) were drown: Development of the decision support systems should be focussed on problems closely related to financial estimates and financial marker trends tracking which span several years. The key influences on business (management) are political and economic environment of the country and region, which induces the necessity of exact implementation of those influences in the observed model (problem). Also it is necessary to take them into account in future events estimations.
68
Dragan Simi´c et al.
The behavior of the observed case does not depend on its pre-history but only on initial state, respectively.
Implementation of this non-exact mathematical model is a very complex problem. As an example, let us take a look into the problem pointed to us by company managers. During any fair exhibition the total of actual income is only 30% to 50% of the total invoice value. Therefore, managers want to know how high the payment of some fair services would be in some future time, with respect to invoicing. If they could predict reliably enough what would happen in the future, they could make important business activities to ensure faster arrival of invoiced payments and plan future activities and exhibitions better. The classical methods can not explain influences on business and management well enough. There are political and economic environments of the country and region that cannot be successfully explained and used with classical methods: war in Iraq, oil deficiency, political assassinations, terrorism, spiral growth in mobile telecommunication industry, general human occupation and motivation. And this is even more true in an enterprise such as Fair whose success depends on many external factors. One possible approach in dealing with external influences is observing the case histories of similar problems (cases) for a longer period of time, and making estimations according to that observation. This approach, generally speaking, represents intelligent search which is applied to solving new problems by adapting solutions that worked for similar problems in the past - case-based reasoning.
3
Case based reasoning
Case-Based Reasoning is a relatively new and promising area of artificial intelligence and it is also considered a problem solving technology (or technique). This technology is used for solving problems in domains where experience plays an important role [2]. Generally speaking, case-based reasoning is applied to solving new problems by adapting solutions that worked for similar problems in the past. The main supposition here is that similar problems have similar solutions. The basic scenario for mainly all CBR applications looks as follows. In order to find a solution of an actual problem, one looks for a similar problem in an experience base, takes the solution from the past and uses it as a starting point to find a solution to the actual problem. In CBR systems experience is stored in a form of cases. The case is a recorded situation where problem was totally or partially solved, and it can be represented as an ordered pair (problem, solution). The whole experience is stored in case base, which is a set of cases and each case represents some previous episode where the problem was successfully solved. The main problem in CBR is to find a good similarity measure – the measure that can tell to what extent the two problems are similar. In the functional way similarity can be defined as a function sim : U u CB o [0 , 1] where U refers to the universe of all objects (from a given domain), while CB refers to the case base (just those objects
An Application of Case-Based Reasoning in Multidimensional Database Architecture
69
which were examined in the past and saved in the case memory). The higher value of the similarity function means that these objects are more similar [1]. The case based reasoning system has not the only goal of providing solutions to problems but also of taking care of other tasks occurring when used in practice. The main phases of the case-based reasoning activities are described in the CBR-cycle (fig. 1) [1].
Fig. 1. The CBR-Cycle after Aamodt and Plaza (1994)
In the retrieve phase the most similar case (or k most similar cases) to the problem case is retrieved from the case memory, while in the reuse phase some modifications to the retrieved case are done in order to provide better solution to the problem (case adaptation). As the case-based reasoning only suggests solutions, there may be a need for a correctness proof or an external validation. That is the task of the phase revise. In the retain phase the knowledge, learned from this problem, is integrated in the system by modifying some knowledge containers. The main advantage of this technology is that it can be applied to almost any domain. CBR system does not try to find rules between parameters of the problem; it just tries to find similar problems (from the past) and to use solutions of the similar problems as a solution of an actual problem. So, this approach is extremely suitable for less examined domains – for domains where rules and connections between parameters are not known. The second very important advantage is that CBR approach to learning and problem solving is very similar to human cognitive processes – people take into account and use past experiences to make future decisions.
70
4
Dragan Simi´c et al.
CBR for predicting curves behaviour
The CBR system for its graphics in presenting both the problem and the cases is used [3]. The reasons are that in many practical domains some decisions depend on behaviour of time diagrams, charts and curves. The system therefore analyses curves, compares them to similar curves from the past and predicts the future behaviour of the current curve on the basis of the most similar curves from the past. The main problem here, as almost in every CBR system, was to create a good similarity measure for curves, i.e. what is the function that can tell to what extent the two curves are similar. In many practical domains data are represented with the set of points, where the point is an ordered pair (x,y). Very often the pairs are (t,v) where t represents time and v represents some value in the time t. When the data is given in this way (as a set of points) then it can be graphically represented. When the points are connected, then they represent some kind of a curve. If the points are connected only with straight lines then it represents the linear interpolation, but if someone wants smoother curves then some other kind of interpolation with polynomials must be used. There was a choice between a classical interpolating polynomial and a cubic spline. The cubic spline was chosen for two main reasons: x x
Power: for the n+1 points classical interpolating polynomial has the power n, while cubic spline always has the power 4. Oscillation: if only one point is moved (which can be the result of bad experiment or measuring) classical interpolating polynomial significantly changes (oscillates), while cubic spline only changes locally (which is more appropriate for real world domains).
Fig. 2. Surface between two curves
When the cubic spline is calculated for curves then one very intuitive and simple similarity (or distance – which is the dual notion for similarity1) measure can be used. 1
When the dictance d i known then the similarity sim can be easily computed using for example function: sim = 1/(1+d)
An Application of Case-Based Reasoning in Multidimensional Database Architecture
71
The distance between two curves can be represented as a surface between these curves as seen on the fig 2. This surface can be easily calculated using the definitive integral. Furthermore, the calculation of the definitive integral for polynomials is a very simple and efficient operation.
5
Application of the system
A data warehouse of “Novi Sad Fair” contains data about payment and invoicing processes from the last 3 years for every exhibition - containing between 25 and 30 exhibitions every year. Processes are presented as sets of points where every point is given with the time of the measuring (day from the beginning of the process) and the value of payment or invoicing on that day. It can be concluded that these processes can be represented as curves. Note that the case-base consists of cases of all exhibitions and that such a case-base is used in solving concrete problems for concrete exhibitions. The reason for this is that environmental and external factors influence business processes of the fair to a high extent. The measurement of the payment and invoicing values was done every 4 days from the beginning of the invoice process in duration of 400 days, therefore every curve consists of approximately 100 points. By analysing these curves, the process of invoicing usually starts several months before the exhibition and that value of invoicing rapidly grows approximately to the time of the beginning of exhibition. After that time the value of invoicing remains approximately the same till the end of the process. That moment, when the value of invoicing reaches some constant value and stays the same to the end, is called the time of saturation for the invoicing process, and the corresponding value – the value of saturation. The process of payment starts several days after the corresponding process of invoicing (process of payment and invoicing for the same exhibition). After that the value of payment grows, but not so rapidly as the value of invoicing. At the moment of exhibition the value of payment is between 30% and 50% of the value of invoicing. After that, the value of payment continues to grow to some moment when it reaches a constant value and stays approximately constant till the end of the process. That moment is called the time of saturation for the payment process, and the corresponding value – the value of saturation. Payment time of saturation is usually couple of months after the invoice time of the saturation, and the payment value of saturation is always less than the invoice value of saturation or equal. The analysis shows that payment value of saturation is between 80% and 100% of the invoice value of saturation. The maximum represents a total of services invoiced and that amount is to be paid. The same stands for the invoicing curve where the maximum amount of payment represents the amount of payment by regular means. The rest will be paid later by court order, other special business agreements or, perhaps, will not be paid at all (debtor bankruptcy).
72
Dragan Simi´c et al.
Fig. 3. The curves from the data mart, as the "Old payment curve" and the "Old invoice curve"
One characteristic invoice and a corresponding payment curve as the "Old payment curve" and "Old invoice curve" from the ”curve base” are shown (fig. 3). The points of saturation (time and value) are represented with the emphasised points on curves. At the beginning system reads the input data from two data marts: one data mart contains the information about all invoice processes for every exhibition in the past 3 years, while the other data mart contains the information about the corresponding payment processes. After that, the system creates splines for every curve (invoice and payment) and internally stores the curves in the list of pairs containing the invoice curve and the corresponding payment curve. In the same way system reads the problem curves from the third data mart. The problem is invoice and a corresponding problem curve at the moment of the exhibition. At that moment, the invoice curve reaches its saturation point, while the payment curve is still far away from its saturation point. These curves are shown as the "Actual payment curve" and the "Actual invoice curve" (fig. 4). The solution of this problem would be the saturation point for the payment curve. This means that system helps experts by suggesting and predicting the level of future payments. At the end of the total invoicing for selected fair exposition, operational exposition manager can get a prediction from CBR system of a) the time period when payment of a debt will be made and b) the amount paid regularly.
An Application of Case-Based Reasoning in Multidimensional Database Architecture
73
Fig. 4. Problem payment and invoice curves, as the "Actual payment curve" and the "Actual invoice curve" and prediction for the future payments Time point and the amount of payment of a debt are marked on the graphic by a big red dot (fig. 4). When used with the subsets of already known values, CBR predicts the results that differed around 10% in time and 2% in value from actually happened. 5.1 Calculation of saturation points and system learning The saturation point for one prediction is calculated by using 10% of the most similar payment curves from the database of previous payment processes. The similarity is calculated by using the previosly described algorithm. Since the values of saturation are different for each exhibition, every curve from the database must be scaled with a particular factor so that the invoice values of saturation of the old curve and actual curve are the same. That factor is easily calculated as: actual _ value _ of _ saturation Factor old _ value _ of _ saturation where the actual value of saturation is in fact the value of the invoice in the time of the exhibition. The final solution is then calculated by using payment saturation points of the 10% most similar payment curves. Saturation points of the similar curves are multiplied with the appropriate type of goodness and then summed. The values of goodness are directly proportional to the similarity between old and actual curves, but the sum of all goodnesses must be 1. Since the system calculates the distance, the similarity is calculated as:
74
Dragan Simi´c et al.
1 1 dist The goodness for every old payment curve is calculated as: sim
goodness i
simi ¦ sim j
all _ j
At the end, the final solution – payment saturation point is calculated as: sat _ point ¦ goodness i sat _ point i all _ i
The system draws the solution point at the diagram combined with the saturation time and value. The system also supports solution revising and retaining (fig. 1). By memorizing a) the problem, b) suggested solution, c) the number of similar curves used for obtaining the suggestion and d) the real solution (obtained later), the system uses this information in the phase of reusing the solution for future problems. The system will then use not only 10% percent of the most similar curves but will also inspect the previous decisions in order to find ‘better’ number of similar curves that would lead to the better prediction.
6
Related work
The system presented in the paper represents a useful coexistence of a data warehouse and a case based reasoning resulting in a decision support system. The data warehouse (part of the described system) has been in operation in “Novi Sad Fair” since 2001 and is described in more details [5] [6] [7]. The part of the system that uses CBR in comparing curves has been done during the stay of the second author at Humboldt University in Berlin and is described in [3] in more detail. Although CBR is successfully used in many areas (aircraft conflict resolution in air traffic control, optimizing rail transport, subway maintenance, optimal job search, support to help-desks, intelligent search on the internet) [4], it is not very often used in combination with data warehouse and in collaboration with classical OLAP, probably due to novelty of this technique. CBR does not require causal model or deep understanding of a domain and therefore it can be used in domains that are poorly defined, where information is incomplete, contradictory, or where it is difficult to get sufficient domain of knowledge. All this is typical for business processing. Besides CBR, other possibilities are rule base knowledge or knowledge discovery in database where knowledge evaluation is based on rules [1]. The rules are usually generated by combining propositions. As the complexity of the knowledge base increases, maintaining becomes problematical because changing rules often implies a lot of reorganization in a rule base system. On the other side, it is easier to add or delete a case in a CBR system, which finally provides the advantages in terms of learning and explicability. Applying CBR to curves and its usage in decision making is also a novel approach. According to the authors' findings, the usage of CBR, looking for similarities in curves and predicting future trends is by far superior to other currently used techniques.
An Application of Case-Based Reasoning in Multidimensional Database Architecture
7
75
Conclusion
The paper presented the decision support system that uses CBR as an OLAP to the data warehouse. The paper has in greater detail described the CBR part of the system giving a thorough explanation of one case study. There are numerous advantages of this system. For instance, based on CBR predictions, operational managers can make important business activities, so they would: a) make payment delays shorter, b) make the total of payment amount bigger, c) secure payment guarantee on time, d) reduce the risk of payment cancellation and e) inform senior managers on time. By combining graphical representation of predicted values with most similar curves from the past, the system enables better and more focussed understanding of predictions with respect to real data from the past. Senior managers can use these predictions to better plan possible investments and new exhibitions, based on the amount of funds and the time of their availability, as predicted by the CBR system. Presented system is not only limited to this case-study but it can be applied to other business values as well (expenses, investments, profit) and it guarantees the same level of success.
Acknowledgement The CBR system that uses graphical representation of problem and cases [3] was implemented by V. Kurbalija at Humboldt University, Berlin (AI Lab) under the leadership of Hans-Dieter Burkhard and sponsorship of DAAD (German academic exchange service). Authors of this paper are grateful to Prof. Burkhard and his team for their unselfish support without which none of this would be possible.
References 1. Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodological Variations and System Approaches, AI Commutations, pp. 39-58. 1994. 2. Zoran Budimac, Vladimir Kurbalija: Case-based Reasoning – A Short Overview, Conference of Informatics and IT, Bitola, 2001. 3. Vladinmir Kurbalija: On Similarity of Curves – project report, Humboldt University, AI Lab, Berlin, 2003. 4. Mario Lenz, Brigitte Bartsh-Sporl, Hans-Dieter Burkhard, Stefan Wess, G. Goos, J. Van Leeuwen, B. Bartsh: Case-Based Reasoning Technology: From Foundations to Aplications, Springer Verlag, October 1998. 5. Dragan Simic: Financial Prediction and Decision Support System Based on Artificial Intelligence Technology, Ph.D. thesis, draft text – manuscript, Novi Sad 2003. 6. Dragan Simic: Reengineering management information systems, contemporary information technologies perspective, Master thesis, Novi Sad 2001. 7. Dragan Simic: Data Warehouse and Strategic Management, Strategic management and decision support systems, Palic, 1999.
MetaCube XTM: A Multidimensional Metadata Approach for Semantic Web Warehousing Systems Thanh Binh Nguyen1, A Min Tjoa1, and Oscar Mangisengi2 1
Institute of Software Technology (E188) Vienna University of Technology, Favoritenstr. 9-11/188, A-1040 Vienna, Austria {binh,tjoa}@ifs.tuwien.ac.at 2 Software Competence Center Hagenberg Hauptstrasse 99, A-4232 Hagenberg, Austria [email protected]
Abstract. Providing access and search among multiple, heterogeneous, distributed and autonomous data warehouses has become one of the main issues in the current research. In this paper, we propose to integrate data warehouse schema information by using metadata represented in XTM (XML Topic Maps) to bridge possible semantic heterogeneity. A detailed description of an architecture that enables the efficient processing of user queries involving data from heterogeneous is presented. As a result, the interoperability is accomplished by a schema integration approach based on XTM. Furthermore, important implementation aspects of the MetaCube-XTM prototype, which makes use of the Meta Data Interchange Specification (MDIS), and the Open Information Model, complete the presentation of our approach.
1
Introduction
The advent of the World Wide Web (WWW) in the mid –1990s has resulted in even greater demand for effectively managing data, information, and knowledge. Web sources consist of very large information resources that are distributed into different location, sites, and systems. According to [15], Web warehousing is a novel and very active research area, which combines two rapidly developing technologies, i.e. data warehousing and Web technology depicted in figure 1. However, the emerging challenge of Web warehousing is how to manage Web OLAP data warehouse sites in a unified way and to provide a unified access among different Web OLAP resources [15]. Therefore, a multidimensional metadata standard or framework is necessary to enable the data warehousing interoperability. As a result, we are addressing the following issues:
Y. Kambayashi, M. Mohania, W. Wßö (Eds.): DaWaK 2003, LNCS 2737, pp. 76-88, 2003. Springer-Verlag Berlin Heidelberg 2003
MetaCube XTM
77
Data Warehousing contributes: Data management warehousing approach
Web Warehousing
The Web contributes: Web technology text and multimedia managament
Fig. 1. The hybrid of Web warehousing systems
Multidimensional Metadata Standard. In the database community there exists some research efforts for formal multidimensional data models and their corresponding query languages [1,4,6,9, 12,13,19]. However, each approach presents its own view of multidimensional analysis requirements, terminology and formalism. As a result, none of the models is capable of encompassing the others. Data Warehousing Interoperability. The relevance of interoperability for future data warehouse architectures is described in detail in [5]. Interoperability not only has to resolve the differences in data structures; it also has to deal in addition with semantic heterogeneity. In this context, the MetaCube concept is proposed in [19] as a multidimensional metadata framework for cross-domain data warehousing systems. In the further development, the MetaCube concept is extended to MetaCube-X by using XML [20], to support interoperability for web warehousing applications. MetaCube-X is an XML (extensible markup language) instance of MetaCube, and provides a n“ eutral” syntax for interoperability among different web warehousing systems. In the described framework, we define a global MetaCube-X stored in the server site and local MetaCube-X(s), each of which is stored in a local Web warehouse. The emerging issues to be handled in the global MetaCube-X are mainly issues concerning semantic heterogeneities of the local MetaCube-X, while the capability for accessing data at any level of complexity should still be provided by local Web data warehouse. In this paper we extend the concept of MetaCube-X using Topic Maps (TMs) [23] (MetaCube-XTM). Research is showing that topic maps can provide a sound basis for the Semantic Web. In addition, the Topic Maps also builds a bridge between the domains of knowledge representation and information management. The MetaCubeXTM system provides a unified view for users that address the semantic heterogeneities. On the other hand, it also supports data access of any level complexity on the local data warehouses using local MetaCube-XTMs.
78
Thanh Binh Nguyen et al.
Prototyping. Both the MetaCube-XTM concept and the web technologies are now sufficiently mature to move from proof of concept towards a semi-operational prototype – the MetaCube-XTM prototype. The remainder of this paper is organized as follows. Section 2 presents related works. In Section 3 we summarize the concepts of MetaCube [19] and introduce MetaCube-XTM protocol. Hereafter, we show the implementation of MetaCube-XTM prototype. The conclusion and future works appear in Section 5.
2
Related Works
The DARPA Agent Markup Language (DAML) [21] developed by DARPA aims at developing a language and tools to facilitate the concept of the Semantic Web [22]: the idea of having data on the Web defined and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications. The DAML language is being developed as an extension to XML and the Resource Description Framework (RDF). The latest extension step of this language (DAML + OIL - Ontology Inference Layer) provides a rich set of constructs to create ontologies and to markup information so that it is machine readable and understandable. The Ontology Inference Layer (OIL) [7,11] from the On-To-Knowledge Project is a proposal for such a standard way of expressing ontologies based on the use of web standards like XML schema and RDF schemas. OIL is the first ontology representation language that is properly grounded in W3C standard such as RDF/RDF-schema and XML/XML-schema. DAML and OIL are general concepts not specifically related to database or data warehouse interoperability. In the field of federated data warehouses, a variety of approaches to interoperability have been proposed. In [14] the authors describe the usage of XML to enable interoperability of data warehouses by an additional architectural layer used for exchanging schema metadata. Distributed DWH architectures based on CORBA [2], and centralized virtual data warehouses based on CORBA and XML [3] have been proposed recently. All of these approaches propose distributed data warehouse architectures based on a kind of restricted data and metadata interchange format using particular XML terms and RDF extensions, respectively. Basically, they achieve syntactical integration - but these concepts do not address semantic heterogeneity to enable a thorough description of mappings between federated, heterogeneous data warehouse systems. [14] presents distributed and parallel computing issues in data warehousing. [2] also presents the prototypical distributed OLAP system developed in the context of the CUBE-STAR project. In [20], the MetaCube-X is proposed as an XML instance of the MetaCube 's concept [19] for supporting data warehouses in the federated environment. It provides a framework for supporting integration and interoperability of data warehouses. Moreover, this paper, the MetaCube-XTM, a new MetaCube generation, is addressed to the semantic heterogeneity for data warehousing interoperability.
MetaCube XTM
3
79
The Concepts of MetaCube-XTM
In this section MetaCube-XTM is presented as a framework of DWH interoperability. Based on this concept, a protocol is studied and proposed as a generic framework to support data access of any complexity level on local data warehouses. 3.1
MetaCube Conceptual Data Model
In [19], a conceptual multidimensional data model that facilitates a precise rigorous conceptualization for OLAP has been introduced and presented. This approach is built on basic mathematic concepts, i.e. partial order, partially ordered set (poset) [10]. The mathematical foundation provides the basis for handling natural hierarchical relationships among data elements along (OLAP)dimensions with many levels of complexity in their structures. We summarize the MetaCube concepts introduced in [19] as follows: Dimension Concepts In [19] we introduced hierarchical relationships among dimension members by means of one hierarchical domain per dimension. A hierarchical domain is a poset (partially ordered set), denoted by < dom( D ), p D > , of dimension ele-
ments dom(D ) = {dm } ∪ {dm ,L, dm }, organized in hierarchy of levels, corresponding to different levels of granularity. An example of the hierarchy domain of the dimension Time with unbalanced and multiple hierarchical structure is showed in figure 2. Afterwards, it allows us to consider a dimension schema as a poset of levels, denoted by DSchema(D)= Levels(D) , p L . Figure 3 shows examples of dimension all
1
n
schemas of three dimensions Product, Geography and Time. Furthermore, a family of sets {dom(l0 ),.., dom(lh )} is a partition [10]of dom(D) . In this concept, a dimension hierarchy is a path along the dimension schema, beginning at the root level and ending at a leaf level [19]. all
1999
Q1.1999
Jan.1999
Feb.1999
Mar.1999 W1.1999
1.Jan.1999
6.Jan.1999
1.Feb.1999
3.Feb.1999
W5.1999
W9.1999
3.Mar.1999
Fig. 2. An example of the hierarchy domain of the dimension Time with unbalanced and multiple hierarchical structure
80
Thanh Binh Nguyen et al. All
All
All
Category
Country
Year Quarter
Type
State
Month Week
Item
City
Product
Geography
Day Time
Fig. 3. Examples of dimension schemas of three dimensions Product, Geography and Time
The Concept of Measures
[Measure Schema] A schema of a measure M is a tuple MSchema(M ) = Fname, O , where: • •
Fname is a name of a corresponding fact, O ∈ Ω ∪ {NONE, COMPOSITE} is an operation type applied to a specific fact [2]. Furthermore: - Ω={SUM, COUNT, MAX, MIN} is a set of aggregation functions. - COMPOSITE is an operation (e.g. average), where measures cannot be utilized in order to automatically derive higher aggregations. - NONE measures are not aggregated. In this case, the measure is the fact.
[Measure Domain] Let N be a numerical domain where a measure value is defined (e.g. N, Z, R or a union of these domains). The domain of a measure is a subset of N. We denote by dom(M ) ⊂
N
.
The Concept of MetaCube
First, a MetaCube schema is defined by a triple of a MetaCube name, an x tuple of dimension schemas, and a y tuple of measure schemas, denoted by CSchema(C)= Cname, DSchemas, MSchemas . Furthermore, the hierarchy domain of a MetaCube, denoted by dom(C) = Cells(C) , p C is a poset, where each data cell is an intersection among a set of dimension members and measure data values, each of which belongs to one dimension or one measure. Afterwards, data cells of within the MetaCube hierarchy domain are grouped into a set of associated granular groups, each of which expresses a mapping from the domains of x-tuple of dimension levels (independent variables) to y-numerical domains of y-tuple of numeric measures (dependent variables). Hereafter, a MetaCube is constructed based on a set of dimensions, and consists of a MetaCube schema, and is associated with a set of groups.
Product
St or e
MetaCube XTM
81
Mexico USA
Alcoholic
10
Dairy
50
Beverage
20
Baked Food
12
Meat
15
Seafood
10
1 2 3 4 5 6 Time Fig. 4. Sales MetaCube is constructed from three dimensions: Store, Product and Time and one fact: TotalSale
3.2
MetaCube XTM Protocol
The MetaCube-XTM protocol is proposed to handle the design, integration, and maintenance of heterogeneous schemas of the local data warehouses. It describes each local schema including its dimensions, dimension hierarchies, dimension levels, cubes, and measures. With the means of the The MetaCube-XTM protocol it should be possible to describe any schema represented by any multidimensional data models (i.e. star schema, snow-flake model, etc.). Furthermore, it is also aimed to provide abilities for interoperability searching and data integration among web-data warehouses as shown in Figure 5. The architecture of MetaCube-XTM systems consists of clients, server protocol (i.e., the global MetaCube-XTM at an information server, and several distributed local data warehouses and their local MetaCube-XTMs). The functionalities are given as follows: •
•
•
MetaCube–XTM Services. A set of MetaCube–XTM services at the information server is intended to provide searching and navigation abilities for clients and to manage the access to local DWHs from the federated information sever (figure 5). Global MetaCube-XTM. The global MetaCube-XTM is stored at the server, and is intended to provide a multidimensional metadata framework for a class of local data warehouses managed by MetaCube-XTM Services. Thus, it has to solve semantic heterogeneity and support the search facility to the local data warehouse. Local MetaCube-XTM. Each local MetaCube-XTM is used to describe the multidimensional data model for each local data warehouse based on the global MetaCube-XTM. The Local MetaCube-XTM is stored in the local data warehouse.
82
Thanh Binh Nguyen et al.
Client
Web Data Warehouse Queries
MetaCube-X Server MetaCube-XTM Services
XML
Global MetaCubeXTM
locatorDB
XML
XML
XML
Local MetaCubeXTM
Local MetaCubeXTM
Local MetaCubeXTM
Data Warehouse 1
Data Warehouse n
Fig. 5. MetaCube-XTM architecture
4
MetaCube-XTM Prototype
The entire idea behind prototyping is to cut down on the complexity of implementation by eliminating parts of the full system. In this context, the MetaCube-XTM prototype has been implemented. First UML (Unified Modelling Language) is used to model the MetaCube concept. UML modeling provides a framework to implement MetaCube in the XTM (XML Topic Maps). Hereafter, we describe the local MetaCube-XTM as a local presentation of DWH schemas using topic maps (bottomup approach). Then we describe the integration of heterogeneous schemas in subsection 4.2. We are going to use only predefined XTM tags as proposed by the XTM Standard (topicMap, topic, baseName, association, occurrence, topicRef, etc.). Therefore it will be possible to use tools based on the XTM standard to create, generate, and maintain such XTM descriptions easily. In this section we also present the process of MetaCube-XTM prototype. 4.1
Modeling MetaCube-XTM with UML
The common or MetaCube-XTM is a model that is used for expressing all schema objects available in the different local data warehouses. To model the MetaCube-
MetaCube XTM
83
XTM, UML is used to model dimensions, measures and data cubes in context of MetaCube data model (figure 6) [19,20]. The approach is implemented by a mapping into XML schema based on the following standard specifications: Meta Data Interchange Specification (MDIS) [16], and the Open Information Model (OIM) [17,18] of the Meta Data Coalition (MDC). 4.2
Implementation with XML Topic Maps
Topic Maps (TMs) provides a solution for organizing and navigating information resources into a unified view on the Web. In this paper we use XTM is to represent the MetaCube concept, to model data to any dimensional level of complexity, to check data for structural correctness, to define new tags corresponding to a new dimension, and to show hierarchical information corresponding to dimension hierarchies. These functionalities are necessary for data warehouse schema handling and OLAP application.
Has Child
+Ch ild
0..*
NestedElelement
+Fa th er
Des cri ption : String;
0..*
+Fa the r +Chi ld
Has Father
MDElement
belongs to
Cell MeasureValue
Gro upby Gnam e : String;
1 ..*
DimensionElement
Des cription : type;
1..* b elongs to
GSchema IntergerValue Des cription : int;
floatValue
r efers to
Level
Des cription : float;
Lnam e : String;
Gnam e : String;
1..*
1..*
1.. *
refers to
1..*
1 .. *
refers to
MeasureSchema
Hierarchy Hnam e : Stri ng;
belongs to
Fnam e : Str ing; AggFunc ti on : Str ing;
refers to
1..* 1 ..*
refers to
DimensionSchema Dname : String;
belongs to
belo ngs to
refers to
Dimension 1 ..*
Fig. 6. The MetaCube-XTM model with UML
Cube Cnam e : String; Bas icGroupby : Groupby;
84
Thanh Binh Nguyen et al. /* local Topic Maps */ /* define topic Cube */ Meta Cube-1 /* define instance of Cube */ Instance of Cube-1
……….. Meta Dimension Meta Cube-Dimension Meta Dimension-Level
Fig. 7. An example of local MetaCube-XTM
The MetaCube-XTM is an XML Topic Maps (XTM) instance of MetaCube concept for supporting interoperability and integration among data warehouse systems. This metadata provides description of different multidimensional data models. It covers heterogeneity problems, such as syntactical, data model, semantic, schematic, and structural heterogeneities. 4.2.1 Schema Integration
Schema integration is intended to overcome the semantic heterogeneity and to provide a global view for clients. The process of schema integration consists of integration of local MetaCube-XTM(s) into the global MetaCube-XTM, and merging. The following section discusses issues concerning local MetaCube-XTM(s), the global MetaCubeXTM. • Local MetaCube-XTM With reference to the MetaCube-XTM UML modeling given in figure 5, the global MetaCube-XTM is represented in XTM document supporting multidimensional data model, such as cube, dimension, dimension schema, hierarchy, measures for each data warehouse. The local MetaCube-XTM is intended to provide data access at any level of data complexity. Figure 7 shows an example of a local MetaCube-XTM describing a local Web warehouse. • Global MetaCube-XTM The global MetaCube-XTM is aimed to provide a common framework for describing a multidimensional data model for Web warehouses. Therefore, the global MetaCubeXTM is a result of the integration of local MetaCube-XTMs. In the integration process, merging tools resolve heterogeneity problems (i.e., naming conflicts among different local MetaCube-XTMs). The merging process is based on the subject of topics available among local MetaCube-XTMs. The global MetaCube-XTM provides the
MetaCube XTM
85
logic to reconcile differences, and drive Web warehousing systems conforming to a global schema. In addition, the global MetaCube-XTM represents metadata that is used for query processing. If there is a query posted by users, the MetaCube-XTM service receives the query from the user, parses, checks, and compares it with the global MetaCubeXTM, and distributes it to selected local Web warehouses. Thus, in this model the global MetaCube-XTM must be able to represent heterogeneity of dimensions and measures from local Web warehouses in relation to the MetaCube-XTM model. An example of global MetaCube-XTM is given in figure 8. 4.2.2 Prototyping
To demonstrate the capability and efficiency of the proposed concept we use the prototype for the International Union of Forest Research Organizations (IUFRO) data warehouses, which are distributed in different Asian, African and European countries. Because of the genesis of the local (national) data warehouses, they are by nature heterogeneous. Local topic maps are used as an additional layer for the representation of local schema information, whereas in the global layer topic maps acts as mediator to hide conflicts of different local schemas. Currently we have implemented an incremental prototype with full data warehousing funtionality as a proof of concept for the feasibility of our approach of an XTM based data warehouse federation http://iufro.ifs.tuwien.ac.at/metacube2.0/. In detail, the MetaCube-XTM prototype has been implemented to function as follows: Each time a query is posed for the MetaCube-XTM systems, the following 5 steps are required (see figure 9). Steps 1-3 belong to the build-time stage at the global Metacube-XTM server. This stage covers the modeling and the design of the search metadata. In this stage all definitions for the required metadata are completed which are required to search into local DWH. The next step 4 belongs to the run-time stage or searching processes at local DWH systems by means of local MetaCube-XTMs. Steps 5 and 6 are used for displaying retrieved information. All steps are described following in more detail. /* global Topic Maps */ /* define topic Global Cube */ Global Meta Cube
…. Meta Cube-1 …
Fig. 8. An example of the global MetaCube-XTM
86
•
•
Thanh Binh Nguyen et al.
MetaCube-XTM Definitions. Dependent on the characteristics of the required data, different global MetaCube-XTM structures can be defined at the MetaCube-XTM server. In this step, user can select a number of dimensions and measure to define a MetaCube-XTM schema. MetaCube-XTM Browser. Based on the tree representation of selected dimension domains, user can roll up or drill down along each dimension to select elements for searching. These selected dimension elements will be used to for query in local Web warehouses.
Local MetaCube-XTM DWH Selections. This step provides flexibility in support of interoperability searching among multiple heterogeneous DWHs.
Global MetaCube-XTM
MetaCube-XTM Multi-Host Search Process
Searching Preparation 1
MetaCube-XTM Definitions
2
2) Dimension MetaCube-XTM Browser Definitions
3
DWH Selections
Local DWHs 4
Local MetaCube-X TM
Multi-Host Search
Local MetaCube-X
Local MetaCube-X TM
SampleTM Data Definitions
.......... DWH 1
DWH 2
DWH n
Resuts Information Retrievations
5
List of Results
6
Detail Result
Fig. 9. MetaCube-XTM Multi Host Search Processes
MetaCube XTM
5
87
Conclusion and Future Works
In this paper we have presented the concept of MetaCube-XTM, which is an XMLTopic Maps instance of MetaCube concept [19]. The MetaCube-XTM provides a framework to achieve interoperability between heterogeneous schema models, which enable the joint querying of distributed web data warehouse (OLAP) systems. We also describe how to use topic maps to deal with these issues. Local topic maps are used as an additional layer for the representation of local schema information, whereas in the global layer topic maps acts as mediator to hide conflicts of different local schemas. This concept facilitates to achieve semantic integration.
Acknowledgment The authors are very indebted to IUFRO for supporting our approach from the very beginning in the framework of GFIS (Global Forest Information Systems http://www.gfis.net).
References [1] [2] [3] [4] [5]
[6] [7]
Agrawal, R., Gupta, A., Sarawagi, A.: Modeling Multidimensional Databases. IBM Research Report, IBM Almaden Research Center, September 1995. Albrecht. J., Lehner, W.: On-Line Analytical Processing in Distributed Data Warehouses. International Databases Engineering and Applications Symposium (IDEAS), Cardiff, Wales, U.K, July 8-10, 1998. Ammoura A., O. Zaiane, and R. Goebel. Towards a Novel OLAP Interface for Distributed Data Warehouses. Proc. of DAWAK 2001, Springer LNCS 2114, pp. 174-185, Munich, Germany, Sept. 2001. Blaschka M., Sapia C., Höfling G., Dinter B.: Finding your way through multidimensional data models. In Proceeding of the 9th International DEXA Workshop, Vienna, Austria, August 1998. Bruckner R. M, Ling T. W., Mangisengi O., Tjoa A M.. A Framework for a Multidimensional OLAP Model using Topic Maps. In Proceedings of the Second International Conference on WebInformation Systems Engineering (WISE 2001) Conference (Web Semantics Workshop), Vol.2, pp. 109-118, IEEE Computer Society Press.Kyoto, Japan, December 2001. Computing Surveys, Vol. 22, No. 3, September 1990. Chaudhuri, S., Dayal, U.: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record Volume 26, Number 1, September 1997. D. Fensel, I. Horrocks, F. Van Harmelen, S. Decker, M. Erdmann, and M. Klein. OIL in a Nutshell. In: Knowledge Acquisition, Modeling, and Management, Proc. of the 12th European Knowledge Acquisition Conference (EKAW2000), R. Dieng et al. (eds.), Springer-Verlag LNAI 1937, pp. 1-16, Oct. 2000.
88
[8] [9] [10] [11] [12] [13]
[14]
[15] [16] [17] [18] [19]
[20]
[21] [22] [23]
Thanh Binh Nguyen et al.
Garcia-Molina, H., Labio, W., Wiener, J.L., Zhuge, Y.: Distributed and Parallel Computing Issues in Data Warehousing. In Proceedings of ACM Principles of Distributed Computing Conference, 1999. Invited Talk. Gray J., Bosworth A., Layman A., Pirahesh H.: Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tabs, and Sub-Totals. Proceedings of ICDE '96, New Orleans, February 1996. Gross, J., Yellen, J.: Graph Theory and its applications. CRC Press,1999. I. Horrocks, D. Fensel, J. Broekstra, S. Decker, M. Erdmann, C. Goble, F. van Harmelen, M. Klein, S. Staab, R. Studer, and E. Motta. The Ontology Inference Layer OIL. Li, C., Wang, X.S.: A Data Model for Supporting On-Line Analytical Processing. CIKM 1996. Mangisengi, O., Tjoa, A M., Wagner, R.R.: Multidimensional Modelling Approaches for OLAP. Proceedings of the Ninth International Database Conference H “ eterogeneous and Internet Databases 1999, ISBN 962-937-046-8. Ed. J. Fong, Hong Kong, 1999. Mangisengi O, J. Huber, Ch. Hawel, and W. Essmayr. A Framework for Supporting Interoperability of Data Warehouse Islands Using XML. Proc. of DAWAK 2001, Springer LNCS 2114, pp. 328-338, Munich, Germany, Sept. 2001. Mattison R. Web Warehousing and Knowledge Management. McGraw-Hill, 1999. Meta Data Coalition. Metadata Interchange Specification (MDIS) Version 1.1, August 1997. Meta Data Coalition. Open Information Model XML Encoding. Version 1.0, December 1999. http://www.mdcinfo.com/. Meta Data Coalition. Open Information Model. Version 1.1, August 1999. http://www.mdcinfo.com/. Nguyen, T.B., Tjoa, A M., Wagner, R.R.: Conceptual Multidimensional Data Model Based on MetaCube. In Proc. of First Biennial International Conference on Advances in Information Systems (ADVIS'2000), Izmir, TURKEY, October 2000. Lecture Notes in Computer Science (LNCS), Springer, 2000. Nguyen T.B., Tjoa A M., Mangisengi O.. MetaCube-X: An XML Metadata Foundation for Interoperability Search among Web Warehouses. In Proceedings of the 3rd Intl. Workshop DMDW'2001, Interlaken, witzerland, June 4, 2000. The DARPA Agent Markup Language Homepage. http://daml.semanticweb.org/. The Semantic Web Homepage. http://www.semanticweb.org/ XML Topic Maps (XTM) 1.0 Specification. http://www.topicmaps.org/xtm/1.0/.
Designing Web Warehouses from XML Schemas Boris Vrdoljak1, Marko Banek1, and Stefano Rizzi2 1
FER – University of Zagreb Unska 3, HR-10000 Zagreb, Croatia {boris.vrdoljak,marko.banek}@fer.hr 2 DEIS – University of Bologna Viale Risorgimento 2, 40136 Bologna, Italy [email protected]
Abstract. Web warehousing plays a key role in providing the managers with up-to-date and comprehensive information about their business domain. On the other hand, since XML is now a standard de facto for the exchange of semi-structured data, integrating XML data into web warehouses is a hot topic. In this paper we propose a semi-automated methodology for designing web warehouses from XML sources modeled by XML Schemas. In the proposed methodology, design is carried out by first creating a schema graph, then navigating its arcs in order to derive a correct multidimensional representation. Differently from previous approaches in the literature, particular relevance is given to the problem of detecting shared hierarchies and convergence of dependencies, and of modeling many-to-many relationships. The approach is implemented in a prototype that reads an XML Schema and produces in output the logical schema of the warehouse.
1
Introduction
The possibility of integrating data extracted from the web into data warehouses (which in this case will be more properly called web warehouses [1]) is playing a key role in providing the enterprise managers with up-to-date and comprehensive information about their business domain. On the other hand, the Extensible Markup Language (XML) has become a standard for the exchange of semi-structured data, and large volumes of XML data already exist. Therefore, integrating XML data into web warehouses is a hot topic. Designing a data/web warehouse entails transforming the schema that describes the source operational data into a multidimensional schema for modeling the information that will be analyzed and queried by business users. In this paper we propose a semiautomated methodology for designing web warehouses from XML sources modeled by XML Schemas, which offer facilities for describing the structure and constraining the content of XML documents. As HTML documents do not contain semantic description of data, but only the presentation, automating design from HTML sources is unfeasible. XML models semi-structured data, so the main issue arising is that not Y. Kambayashi, M. Mohania, W. Wöß (Eds.): DaWaK 2003, LNCS 2737, pp. 89-98, 2003. Springer-Verlag Berlin Heidelberg 2003
90
Boris Vrdoljak et al.
all the information needed for design can be safely derived. In the proposed methodology, design is carried out by first creating a schema graph, then navigating its arcs in order to derive a correct multidimensional representation in the form of a dependency graph where arcs represent inter-attribute relationships. The problem of correctly inferring the needed information is solved by querying the source XML documents and, if necessary, by asking the designer's help. Some approaches concerning related issues have been proposed in the literature. In [4] a technique for conceptual design starting from DTDs [12] is outlined. That approach is now partially outdated due to the increasing popularity of XML Schema; besides, some complex modeling situations were not specifically addressed in the paper. In [5] and [6] DTDs are used as a source for designing multidimensional schemas (modeled in UML). Though that approach bears some resemblance to ours, the unknown cardinalities of relationships are not verified against actual XML data, but they are always arbitrarily assumed to be to-one. Besides, the id/idref mechanism used in DTDs is less expressive than key/keyref in XML Schema. The approach described in [8] is focused on populating multidimensional cubes by collecting XML data, but assumes that the multidimensional schema is known in advance (i.e., that conceptual design has been already carried out). In [9], the author shows how to use XML to directly model multidimensional data, without addressing the problem of how to derive the multidimensional schema. Differently from previous approaches in the literature, in our paper particular relevance is given to the problem of detecting shared hierarchies and convergence of dependencies, and of modeling many-to-many relationships within hierarchies. The approach is implemented in a prototype that reads an XML Schema and produces in output the star schema for the web warehouse.
2
Relationships in XML Schema
The structure of XML data can be visualized by using a schema graph (SG) derived from the Schema describing the data. The method is adopted from [10], where simpler, but less efficient DTD is still used as a grammar. The SG for the XML Schema describing a purchase order, taken from the W3C's document [14] and slightly extended, is shown in Fig. 1. In addition to the SG vertices that correspond to elements and attributes in the XML Schema, the operators inherited from the DTD element type declarations are also used because of their simplicity. They determine whether the sub-element or attribute may appear one or more (“+”), zero or more (“*”), or zero or one times (“?”). The default cardinality is exactly one and in that case no operator is shown. Attributes and sub-elements are not distinguished in the graph. Since our design methodology is primarily based on detecting many-to-one relationships, in the following we will focus on the way those relationships can be expressed. There are two different ways of specifying relationships in XML Schemas. •
First, relationships can be specified by sub-elements with different cardinalities. However, given an XML Schema, we can express only the cardinality of the relationship from an element to its sub-elements and attributes. The cardinality
Designing Web Warehouses from XML Schemas
91
in the opposite direction cannot be discovered by exploring the Schema; only by exploring the data that conforms to the Schema or by having some knowledge about the domain described, it can be concluded about the cardinality in the direction from a child element to its parent. Second, the key and keyref elements can be used for defining keys and their references. The key element indicates that every attribute or element value must be unique within a certain scope and not null. If the key is an element, it should be of a simple type. By using keyref elements, keys can be referenced. Not just attribute values, but also element content and their combinations can be declared to be keys, provided that the order and type of those elements and attributes is the same in both the key and keyref definitions. In contrast to id/idref mechanism in DTDs, key and keyref elements are specified to hold within the scope of particular elements.
•
3
From XML Schema to Multidimensional Schema
In this section we propose a semi-automatic approach for designing a web warehouse starting from an XML Schema. The methodology consists of the following steps: 1. 2. 3. 4.
Preprocessing the XML Schema. Creating and transforming the SG. Choosing facts. For each fact: 4.1 Building the dependency graph from the SG. 4.2 Rearranging the dependency graph. 4.3 Defining dimensions and measures. 4.4 Creating the logical schema.
Given a fact, the dependency graph (DG) is an intermediate structure used to provide a multidimensional representation of the data describing the fact. In particular, it is a directed rooted graph whose vertices are a subset of the element and attribute vertices of the SG, and whose arcs represent associations between vertices. The root of the DG corresponds to the fact. purchaseOrder
shipTo
orderDate
billTo ?
name street city
state zip
+ ? items comment *
? country
?
item
product brand productCode weight key size
partNum ? keyref productName quantity shipDate USPrice
Fig. 1. The Schema Graph
92
Boris Vrdoljak et al.
While in most cases the hierarchies included in the multidimensional schema represent only to-one associations (sometimes called roll-up relationships since they support the roll-up OLAP operator), in some applications it is important to model also many-to-many associations. For instance, suppose the fact to be modeled is the sales of books, so book is one of the dimensions. Although books that have many authors certainly exist, it would be interesting to aggregate the sales by author. It is remarkable that summarizability is maintained through many-to-many associations, if a normalized weight is introduced [7]. Besides, some specific solutions for logical design in presence of many-to-many associations were devised [11]. However, since modeling many-to-many associations in a warehouse should be considered an exception, their inclusion in the DG is subject to the judgment of the designer, who is supposed to be an expert of the business domain being modeled. After the DG has been derived from the SG, it may be rearranged (typically, by dropping some uninteresting attributes). This phase of design necessarily depends on the user requirements and cannot be carried out automatically; since it has already been investigated (for instance in [2]), it is considered to be outside the scope of this paper. Finally, after the designer has selected dimensions and measures among the vertices of the DG, a logical schema can be immediately derived from it. 3.1
Choosing Facts and Building Dependency Graphs
The relationships in the Schema can be specified in a complex and redundant way. Therefore, we transform some structures to simplify the Schema, similarly as DTD was simplified in [10] and [6]. A common example of Schema simplification concerns the choice element, which denotes that exactly one of the sub-elements must appear in a document conforming to that Schema. The choice element is removed from the schema and a minOccurs attribute with value 0 is added to each of its subelements. The resulting simplified structure, although not being equivalent to the choice expression, preserves all the needed information about the cardinalities of relationships. After the initial SG has been created [10], it must undergo two transformations. First, all the key attributes or elements are located and swapped with their parent vertex in order to explicitly express the functional dependency relating the key with the other attributes and elements. Second, some vertices that do not store any value are eliminated. A typical case is an element that has only one sub-element of complex type and no attributes, and the relationship with its sub-element is to-many. We name such an element a container. Note that, when a vertex v is deleted, the parent of v inherits all the children of v and their cardinalities. The next step is choosing the fact. The designer chooses the fact among all the vertices and arcs of the SG. An arc can be chosen as a fact if it represents a many-tomany relationship. For the purchase order SG presented in Fig. 1, after the items element has been eliminated as a container, the relationship between purchaseOrder and item is chosen as a fact, as in Fig. 2. For each fact f, the corresponding DG must be built by including a subset of the vertices of the SG. The DG is initialized with the root f, to be enlarged by recursively navigating the relationships between vertices in the SG. After a vertex v of the SG is inserted in the DG, navigation takes place in two steps:
Designing Web Warehouses from XML Schemas
93
purchaseOrder ...
? ... orderDate
* fact
item
Fig. 2. Choosing a fact
1.
2.
For each vertex w that is a child of v in the SG: When examining relationships in the direction expressed by arcs of the SG, the cardinality information is expressed either explicitly by “?”, “*” and “+” vertices, or implicitly by their absence. If w corresponds to an element or attribute in the Schema, it is added to the DG as a child of v; if w is a “?” operator, its child is added to the DG as a child of v. If w is a “*” or “+” operator, the cardinality of the relationship from u, child of w, to v is checked by querying the XML documents (see Section 4.5): if it is to-many, the designer decides whether the many-to-many relationship between v and u is interesting enough to be inserted into the DG or not. For each vertex z that is a parent of v in the SG: When examining relationships in this direction, vertices corresponding to “?”, “*” and “+” operators are skipped since they only express the cardinality in the opposite direction. Since the Schema yields no further information about the relationship cardinality, it is necessary to examine the actual data by querying the XML documents conforming to the Schema (see Section 4.5). If a to-one relationship is detected, z is included in the DG.
Whenever a vertex corresponding to a keyref element is reached, the navigation algorithm “jumps” to its associated key vertex, so that descendants of the key become descendants of the keyref element. A similar approach is used in [3], where the operational sources are represented by a relational schema, when a foreign key is met during navigation of relations. See for instance Fig. 3, showing the resulting DG for the purchase order example. From the fact, following to-one relationship, the item vertex is added to the DG. Vertex productCode is defined to be a key (Fig.1). It is swapped with product, which then is dropped since it carries no value. The partNum vertex is a child of item and is defined as a key reference to the productCode attribute. size, weight and brand, the children of productCode, become descendants of the partNum attribute in the DG. 3.2
Querying XML Documents
In our approach, XQuery [15] is used to query the XML documents in three different situations: 1. 2. 3.
examination of convergence and shared hierarchies searching for many-to-many relationships between the descendants of the fact in SG searching for to-many relationships towards the ancestors of the fact in the SG
94
Boris Vrdoljak et al.
orderDate
name
productName
street
quantity USPrice
purchaseOrder-item zip
purchaseOrder USAddress
city
comment state
item
FACT
comment partNum
shipDate
country size
weight
brand
Fig. 3. The DG for the purchase order example
Note that, since in all three cases querying the documents is aimed at counting how many distinct values of an attribute v are associated to a single value of an attribute w, it is always preliminarily necessary to determine a valid identifier for both v and w. To this end, if no key is specified for an attribute, the designer is asked to define an identifier by selecting a subset of its non-optional sub-elements. Convergence and Shared Hierarchies. Whenever a complex type has more than one instance in the SG, and all of the instances have a common ancestor vertex, either a convergence or a shared hierarchy may be implied in the DG. A convergence holds if an attribute is functionally determined by another attribute along two or more distinct paths of to-one associations. On the other hand, it often happens that whole parts of hierarchies are replicated two or more times. In this case we talk of a shared hierarchy, to emphasize that there is no convergence. In our approach, the examination is made by querying the available XML documents conforming to the given Schema. In the purchase order example, following a to-one relationship from the fact, the purchaseOrder vertex is added to the DG. It has two children, shipTo and billTo (Fig. 1), that have the same complex type USAddress. The purchaseOrder element is the closest common ancestor of shipTo and billTo, thus all the instances of the purchaseOrder element have to be retrieved. For each purchaseOrder instance, the content of the first child, shipTo, is compared to the content of the second one, billTo, using the deep-equal XQuery operator as shown in Fig. 4.
let $x:= for $c in $retValue where not(deep-equal($c/first/content, $c/second/content)) return $c return count($x) Fig. 4. A part of the XQuery query for distinguishing convergence from shared hierarchy
Designing Web Warehouses from XML Schemas
95
By using the COUNT function, the query returns the number of couples with different contents. If at least one couple with different contents is counted, a shared hierarchy is introduced. Otherwise, since in principle there still is a possibility that documents in which the contents of the complex type instances are not equal will exist, the designer has to decide about the existence of convergence by leaning on her knowledge of the application domain. In our example, supposing it is found that shipTo and billTo have different values in some cases, a shared hierarchy is introduced. Many-to-Many Relationships between the Descendants of the Fact. While in most cases only to-one associations are included into the DG, there are situations in which it is useful to model many-to-many associations. Consider the SG in Fig. 5, modeling the sales of the books, where the bookSale vertex is chosen as the fact. After the book vertex is included into the DG, a to-many relationship between book and author is detected. Since including a one-to-many association would be useless for aggregation, the available XML documents conforming to the bookSale Schema are examined by using XQuery to find out whether the same author can write multiple books. A part of the query is presented in Fig. 6: it counts the number of distinct books (i.e. different parent elements) for each author (child) and returns the maximum number. If the returned number is greater than one, the relationship is many-to-many, and the designer may choose whether it should be included in the DG or not. If the examination of the available XML documents has not proved that the relationship is many-to-many, the designer can still, leaning on his or her knowledge, state the relationship as many-to-many and decide if it is interesting for aggregation. bookSale
book ?
+ title
quantity
date
store
year publisher
author
price
city storeNo
nameLast nameFirst
address
Fig. 5. The book sale example
max( ... for $c in distinct-values($retValue/child) let $p:=for $exp in $retValue where deep-equal($exp/child,$c) return $exp/parent return count(distinct-values($p)) ) Fig. 6. A part of a query for examining many-to-many relationships
96
Boris Vrdoljak et al. TIME timeKey orderDate dayOfWeek holiday month PRODUCT productKey partNum productName size weight brand
PURCHASE_ORDER shipToCustomerKey billToCustomerKey orderDateKey productKey USPrice quantity income
CUSTOMER customerKey customer name street zip city state country
Fig. 7. The star schema for the purchase order example
To-Many Relationships towards the Ancestors of the Fact. This type of search should be done because the ancestors of the fact element in the SG will not always form a hierarchically organized dimension in spite of the nesting structures in XML. When navigating the SG upwards from the fact, the relationships must be examined by XQuery since we have no information about the relationship cardinality, which is not necessarily to-one. The query is similar to the one for examining many-to-many relationships, and counts the number of distinct values of the parent element corresponding to each value of the child element. 3.3
Creating the Logical Scheme
Once the DG has been created, it may be rearranged as discussed in [3]. Considering for instance the DG in Fig. 3, we observe that there is no need for the existence of both purchaseOrder and purchaseOrder-item, so only the former is left. Considering item and partNum, only the latter is left. The comment and shipDate attributes are dropped to eliminate unnecessary details. Finally, attribute USAddress is renamed into customer in order to clarify its role. The final steps of building a multidimensional schema include the choice of dimensions and measures as described in [2]. In the purchase order example, USPrice and quantity are chosen as measures, while orderDate, partNum, shipToCustomer, and billToCustomer are the dimensions. Finally, the logical schema is easily obtained by including measures in the fact table and creating a dimension table for each hierarchy in the DG. Fig. 7 shows the resulting star schema corresponding to the DG in Fig. 3; note how the shared hierarchy on customer is represented in the logical model by only one dimension table named CUSTOMER, and how a derived measure, income, has been defined by combining quantity and USPrice. In the presence of many-to-many relationships one of the logical design solution proposed in [11] is to be adopted.
4
Conclusion
In this paper we described an approach to design a web warehouse starting from the XML Schema describing the operational source. As compared to previous approaches
Designing Web Warehouses from XML Schemas
97
based on DTDs, the higher expressiveness of XML Schema allows more effective modeling. Particular relevance is given to the problem of detecting shared hierarchies and convergences; besides, many-to-many relationships within hierarchies can be modeled. The approach is implemented in a Java-based prototype that reads an XML Schema and produces in output the star schema for the web warehouse. Since all the needed information cannot be inferred from XML Schema, in some cases the source XML documents are queried using XQuery language, and if necessary, the designer is asked for help. The prototype automates several parts of the design process: preprocessing the XML Schema, creating and transforming the schema graph, building the dependency graph, querying XML documents. All phases are controlled and monitored by the designer through a graphical interface that also allows some restructuring interventions on the dependency graph.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
S. S. Bhowmick, S. K. Madria, W.-K. Ng, and E. P. Lim, “Web Warehousing: Design and Issues”, Proc. DWDM'98, Singapore, 1998. M. Golfarelli, D. Maio, and S. Rizzi, “Conceptual design of data warehouses from E/R schemes”, Proc. HICSS-31, vol. VII, Kona, Hawaii, pp. 334-343, 1998. M. Golfarelli, D. Maio, S. Rizzi, “The Dimensional Fact Model: a Conceptual Model for Data Warehouses”, International Journal of Cooperative Information Systems, vol. 7, n. 2&3, pp. 215-247, 1998. M. Golfarelli, S. Rizzi, and B. Vrdoljak, “Data warehouse design from XML sources”, Proc. DOLAP'01, Atlanta, pp. 40-47, 2001. M. Jensen, T. Møller, and T.B. Pedersen, “Specifying OLAP Cubes On XML Data”, Journal of Intelligent Information Systems, 2001. M. Jensen, T. Møller, and T.B. Pedersen, “Converting XML Data To UML Diagrams For Conceptual Data Integration”, Proc. DIWeb'01, Interlaken, 2001. R. Kimball. “The data warehouse toolkit”. John Wiley & Sons, 1996. T. Niemi, M. Niinimäki, J. Nummenmaa, and P. Thanisch, “Constructing an OLAP cube from distributed XML data”, Proc. DOLAP'02, McLean, 2002. J. Pokorny, “Modeling stars using XML”, Proc. DOLAP'01, 2001. J. Shanmugasundaram et al., “Relational Databases for Querying XML Documents: Limitations and Opportunities”, Proc. 25th VLDB, Edinburgh, 1999. I.Y. Song, W. Rowen, C. Medsker, and E. Ewen, “An analysis of many-tomany relationships between fact and dimension tables in dimensional modeling”, Proc. DMDW, Interlaken, Switzerland, pp. 6.1-6.13, 2001. World Wide Web Consortium (W3C), “XML 1.0 Specification”, http://www.w3.org/TR /2000/REC-xml-20001006. World Wide Web Consortium (W3C), “XML Schema”, http://www.w3.org/XML/Schema.
98
[14] [15]
Boris Vrdoljak et al.
World Wide Web Consortium (W3C), “XML Schema Part 0: Primer”, http://www.w3.org /TR/xmlschema-0/. World Wide Web Consortium (W3C), “XQuery 1.0: An XML Query Language (Working Draft)”, http://www.w3.org/TR/xquery/.
Building XML Data Warehouse Based on Frequent Patterns in User Queries Ji Zhang1, Tok Wang Ling1, Robert M. Bruckner2, A Min Tjoa2 1
2
Department of Computer Science National University of Singapore Singapore 117543
Institute of Software Technology Vienna University of Technology Favoritenstr. 9/188, A-1040 Vienna, Austria
{zhangji, lingtw}@comp.nus.edu.sg
{bruckner, tjoa}@ifs.tuwien.ac.at
Abstract. With the proliferation of XML-based data sources available across the Internet, it is increasingly important to provide users with a data warehouse of XML data sources to facilitate decision-making processes. Due to the extremely large amount of XML data available on web, unguided warehousing of XML data turns out to be highly costly and usually cannot well accommodate the users’ needs in XML data acquirement. In this paper, we propose an approach to materialize XML data warehouses based on frequent query patterns discovered from historical queries issued by users. The schemas of integrated XML documents in the warehouse are built using these frequent query patterns represented as Frequent Query Pattern Trees (FreqQPTs). Using hierarchical clustering technique, the integration approach in the data warehouse is flexible with respect to obtaining and maintaining XML documents. Experiments show that the overall processing of the same queries issued against the global schema become much efficient by using the XML data warehouse built than by directly searching the multiple data sources.
1. Introduction A data warehouse (DWH) is a repository of data that has been extracted, transformed, and integrated from multiple and independent data source like operational databases and external systems [1]. A data warehouse system, together with its associated technologies and tools, enables knowledge workers to acquire, integrate, and analyze information from different data sources. Recently, XML has rapidly emerged as a standardized data format to represent and exchange data on the web. The traditional DWH has gradually given way to the XML-based DWH, which becomes the mainstream framework. Building a XML data warehouse is appealing since it provides users with a collection of semantically consistent, clean, and concrete XML-based data that are suitable for efficient query and analysis purposes. However, the major drawback of building an enterprise wide XML data warehouse system is that it is usually extremely time and cost consuming that is unlikely to be successful [10]. Furthermore, without proper guidance on which information is to be stored, the resulting data warehouse cannot really well accommodate the users’ needs in XML data acquirement. In order to overcome this problem, we propose a novel XML data warehouse approach by taking advantage of the underlying frequent patterns existing in the query Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 99-108, 2003. c Springer-Verlag Berlin Heidelberg 2003
100
Ji Zhang et al.
history of users. The historical user queries can ideally provide us with guidance regarding which XML data sources are more frequently accessed by users, compared to others. The general idea of our approach is: Given multiple distributed XML data sources and their globally integrated schema represented as a DTD (data type definition) tree, we will build a XML data warehouse based on the method of revealing frequent query patterns. In doing so, the frequent query patterns, each represented as a Frequent Query Pattern Tree (FreqQPT), are discovered by applying a rule-mining algorithm. Then, FreqQPTs are clustered and merged to generate a specified number of integrated XML documents. Apparently, the schema of integrated XML documents in the warehouse is only a subset of the global schema and the size of this warehouse is usually much smaller than the total size of all distributed data sources. A smaller sized data warehouse can not only save storage space but also enable query processing to be performed more efficiently. Furthermore, this approach is more user-oriented and is better tailored to the user’s needs and interests. There has been some research in the field of building and managing XML data warehouse. The authors of [2] present a semi-automated approach to building a conceptual schema for a data mart starting from XML sources. The work in [3] uses XML to establish an Internet-based data warehouse system to solve the defects of client/server data warehouse systems. [4] presents a framework for supporting interoperability among data warehouse islands for federated environments based on XML. A change-centric method to manage versions in a web warehouse of XML data is published in [5]. Integration strategies and their application to XML Schema integration has been discussed in [6]. The author of [8] introduces a dynamic warehouse, which supports evaluation, change control and data integration of XML data. The remainder of this paper is organized as follows. Section 2 discusses the generation of XML data warehouses based on frequent query patterns of users’ queries. In Sections 3, query processing using the data warehouse is discussed. Experimental results are repoeted in Section 4. The final section conclude this paper.
2. Building a XML DWH Based on Frequent Query Patterns 2.1.
Transforming Users’ Queries into Query Path Transactions
XQuery is a flexible language commonly used to query a broad spectrum of XML information sources, including both databases and documents [7]. The following XQuery-formatted query aims to extract the ISBN, Title, Author and Price of books with a price over 20 dollars from a set of XML documents about book-related information. The global DTD tree is shown in Figure 1. FOR $a IN DOCUMENT (book XML documents)/book SATIFIES $a/Price/data()>20 RETURN {$a/ISBN, $a/Title, $a/Author, $a/Price}
Building XML Data Warehouse Based on Frequent Patterns in User Queries
101
Book
ISBN
Title
Author+
Name
Affiliation
Section+
Title
Publisher
Para*
Price
Year
Figure*
Title
Image
QP1: Book/ISBN QP2: Book/Title QP3: Book/Author/Name QP4:Book/Author/Affiliation QP5: Book/Price
Fig. 1. Global DTD Tree of multiple XML documents. Fig. 2. QPs of the XQuery sample.
A Query Path is a path expression of a DTD tree that starts at the root of tree. QPs can be obtained from the query script expressed using XQuery Statements. The sample query above can be decomposed into five QPs, as shown in Figure 2. The root of a QP is denoted as Root(QP) and all QPs in a query have the same root. Please note that two QPs with different roots are regarded as different QPs, although these two paths may have some common nodes. This is because different roots of paths often indicate dissimilar contexts of the queries. For example, two queries Author/Name and Book/Author/Name are different because Root(Author/Name)=Author Root(Book/Author/Name)=Book. A query can be expressed using a set of QPs which includes all the QPs that this query consists. For example, the above sample query, denoted as Q, can be expressed using a QP set such as Q={QP1, QP2, QP3, QP4, QP5}. By transforming all the queries into QP sets, we now obtain a database containing all these QP sets of queries, denoted as DQPS. We will then apply a rule-mining techniques to discover significant rules among the users’ query patterns. 2.2.
Discovering Frequent Query Path Sets in DQPS
The aim of applying a rule mining technique in DQPS is to discover Frequent Query Path Sets (FreqQPSs) in DQPS. A FreqQPS contains frequent QPs that jointly occur in DQPS. Frequent Query Pattern Trees (FreqQPTs) are built from these FreqQPSs and serve as building blocks of schemas of the integrated XML documents in the data warehouse. Formal definition of FreqQPTs is given as follows. Definition 1. Frequent Query Path Set (FreqQPS): From all the occurring QPs in DQPS transformed from user’s queries, a Frequent Query Path Set (FreqQPS) is a set of QPs: {QP1, QP2,…,QPn} that satisfies the following two requirements: (1) Support requirement: Support ({QP1, QP2,…,QPn}) minsup; (2) Confidence requirement: For each QPi, Freq({QP1, QP2,…,QPn}) / Freq(QPi) minconf. where Freq(s) counts the occurrence of set s in DQPS. In (1), Support({QP1, QP2,…,QPn}) = freq({QP1, QP2,…,QPn}) / N(DQPS), where N(DQPS) is the total number of QPs in DQPS. The constants minsup and minconf are the minimum support and confidence thresholds, specified by the user. A FreqQPS that consists of n QPs is termed as an n-itemset FreqQPS. The definition of a FreqQPS is similar to that of association rules. The support requirement is identical to the traditional definition of large association rules. The confidence requirement is, however, more rigid than the traditional definition. Setting a more rigid confidence requirement is to ensure the joint occurrence of QPs in a FreqQPS should be significant enough with respect to an individual occurrence of any
102
Ji Zhang et al.
QP. Since the number of QPs in the FreqQPS is unknown in advance, we will mine all FreqQPSs containing various numbers of itemsets. The FreqQPS mining algorithm is presented in Figure 3. The n-itemset QPS candidates are generated by joining (n-1)-itemset FreqQPSs. A pruning mechanism is devised to delete those candidates of the n-itemset QPSs that do not have n (n-1)-itemset subsets in the (n-1)-itemset FreqQPS list. The reason is that if one or more (n-1)-subsets of a n-itemset QPS candidate are missing in the (n-1)-itemset FreqQPS list, this n-itemset QPS cannot become a FreqQPS. This is obviously more rigid than pruning mechanism used in conventional association rule mining. For example, if one or more of the 2-itemset QPSs {QP1, QP2}, {QP1, QP3} and {QP2, QP3} are not frequent, then the 3-itemset QPS {QP1, QP2, QP3} cannot become a frequent QPS. The proof of this pruning mechanism is given below. The pruning the nitemset QPS candidates are evaluated in terms of the support and confidence requirements to decide whether or not they are a FreqQPS. The (n-1)-itemset FreqQPSs are finally deleted if they are subsets of some n–itemset FreqQPSs. For example, the 2itemset FreqQPT {QP1, QP2} will be deleted from 2-itemset FreqQPT list if the 3itemset {QP1, QP2, QP3} exists in the 3-itemset FreqQPT list. Algorithm MineFreqQPS Input: DQPS, minsup, minconf. Output: FreqQPS of varied number of itemsets. FreqQPS1={QP in DQPS| SatisfySup(QP)=true}; i=2; WHILE (CanFreqQPSi-1 is not empty) { CanQPSi=CanQPSGen(FreqQPSi-1); CanQPSi= CanQPSiʊ{QPSi| NoSubSet(QPSi, FreqQPSi-1)
Proof: Suppose a n-itemset QPS has only p (n-1)-itemset subsets QPSn-1i|1ip, meaning that there are (n-p) subsets of QPSn are missing in the (n-1)-itemset QPS list. These missing (n-p) subsets of QPSn, denoted as QPSn-1i| (p+1)in, are definitely not FreqQPSs and they fail to satisfy the support or the confidence requirement or both. Specifically, (1) If QPSn-1i|(p+1)in does not satisfy support requirement, then Support(QP1, QP2,…,QPn-1) < minsup. Because Support(QP1, QP2,…,QPn) Support(QP1, QP2 ,…,QPn-1), so Support(QP1, QP2,…,QPn) < minsup, meaning QPSn cannot become a n-itemset FreqQPS; (2) If QPSn-1i|(p+1)in does not satisfy confidence requirement, then for a certain QPi, Freq(QP1, QP2,…,QPn-1) / Freq(QPi) < minconf. Because Support(QP1, QP2, …,QPn) Support(QP1, QP2,…,QPn-1), so for QPi, Freq(QP1, QP2,…,QPn) / Freq(QPi) < minconf, meaning that QPSn cannot become a n-itemset FreqQPS.
Building XML Data Warehouse Based on Frequent Patterns in User Queries
103
After we have obtained a number of FreqQPSs, their corresponding Frequent Query Pattern Trees (FreqQPTs) will be built. Definition 2. Frequent Query Pattern Tree (FreqQPT): Given a FreqQPS, its corresponding Frequent Query Pattern Tree (FreqQPT) is a rooted tree FreqQPT=, where V and E denote its vertex and edge sets, which are the union of the vertices and edges of QPs in this FreqQPS, respectively. The root of a FreqQPT, denoted as Root(FreqQPT), is the root of its constituting QPs. For example, suppose a FreqQPS has two QPs: Book/Title and Book/Author/Name. The resulting FreqQPT is shown in the Figure 4. Book
{Book/ Title, Book/ Author/ Name}
Title
Author Name
Fig. 4. Build a FreqQPT for a FreqQPS.
2.3.
Generating Schemas of Integrated XML Documents
When all FreqQPTs have been mined, the schema of the integrated XML document will be built. We have noticed that a larger integrated XML document usually requires larger space when it is loaded into main memory. In order to solve this problem, we alternatively choose to build a few, rather than only one, integrated XML documents from the FreqQPTs mined, making the integration more flexible. The exact number of integrated XML documents to be obtained is user- specified. The basic idea is to use a clustering technique to find a pre-specified number of clusters of FreqQPTs. The integration of the FreqQPTs is performed within each of the clusters. Similarity measurement of FreqQPTs We need to measure the similarity between two FreqQPTs in order to find the closest pair in each step of the clustering process. It is noticed that the complexity of merging two FreqQPTs is dependant on the distance of the roots of the FreqQPTs involved, rather than on the other nodes in the FreqQPTs. Intuitively, the closer the two roots are to each other, the easier the merging can be done and vice versa. To measure the similarity between the roots of two FreqQPTs, we have to first discuss the similarity between two nodes in the hierarchy of a global schema. In our work, the similarity computation between two nodes in the hierarchy is based on the edge counting method. We measure the similarity of nodes by first computing the distance between two nodes, since the distance can be easily obtained by edge counting. Naturally, the larger the number of edges between two nodes, the further apart the two nodes are. The distance between two nodes n1 and n2, denoted as NodeDist(n1, n2), is computed as NodeDist(n1, n2)= Nedge(n1, n2), where Nedge() returns the number of edges between n1 and n2. This distance can be normalized by dividing the maximum possible distance between two nodes in the hierarchy, denoted by LongestDist. The normalized distance between n1 and n2, denoted as NodeDistN(n1, n2), is computed as follows: NodeDistN(n1, n2)= Nedge(n1, n2)/LongestDist Thus the similarity between n1 and n2 is computed as: NodeSimN(n1, n2)=1- NodeDistN(n1, n2) We now give an example to show how the similarity between two roots of FreqQPTs is computed. Suppose there are two QPs, QP1: Book/ Price and QP2:
104
Ji Zhang et al.
Section/ Figure/ Image as shown in Figure 5. What we should do is to compute the similarity between the roots of these two QPs, namely Book and Section. The maximum length between two nodes in the hierarchy as shown in Figure 1 is 5 (from Name or Affiliation to Title or Image). Thus NodeSimN(Book, Section) = 1 – NodeDistN(Book, Section) = 1–1/5 = 4/5 = 0.8. Book Section Figure
Price
Image
Fig. 5. Similarity between two QPs.
Merging of FreqQPTs When a nearest pair of FreqQPTs is found in each step of the clustering, merging of these two FreqQPTs is performed. Let FreqQPT1=, FreqQPT2=, Root(FreqQPT1)=root1, Root(FreqQPT2)=root2, and FreQPTM be the new FreqQPT merged from FreqQPT1 and FreqQPT2. We will now present the definition of Nearest Common Ancestor Node (NCAN) of two nodes in the DTD tree before we give details of FreqQPT merging. Definition 3. Nearest Common Ancestor Node (NCAN): The NCAN of root nodes of two FreqQPTs root1 and root2 in the hierarchical structure of a global DTD tree H, denoted as NCANH(root1, root2), is the common ancestor node in H that are closest to both root1 and root2. To merge two closest FreqQPTs, the Nearest Common Ancestor Node (NCAN) of root1 and root2 has to be found, thereby these two FreqQPTs can be connected. We denote the vertex and edge set of the paths between NCANH(root1, root2) and root1 as VNCANĺroot1 and ENCANĺroot1, and those between NCANH(root1, root2) and root2 as VNCANĺroot2 and ENCANĺroot2. The FreQPTM in this case can be expressed as FreQPTM={Union(V1, V2, VNCANĺroot1,VNCANĺroot2), Union (E1, E2, ENCANĺroot1, ENCANĺroot2)} and Root(FreQPTM)= NCANH(root1, root2). Specifically, there are three scenarios in merging two FreqQPTs, namely, (1) the two FreqQPTs have the same root; (2) The root of one FreqQPT is an ancestor node of another FreqQPT’s root; (3) case other than (a) and (b). Figure 6 (a)-(c) give examples for each of the cases of FreqQPT merging discussed above. The dot-lined edges in the integrated schema, if any, are the extra edges that have to be included into the integrated schema in merging the two separate FreqQPTs.
Title
Book
Book
Book Price
Author Nam e
Title
Author Name
Price
(a) Example for Case 1. Book Title Title
Section Figure
Book Section Figure
Title
Image Title
Image
Building XML Data Warehouse Based on Frequent Patterns in User Queries
105
(b) Example for Case 2. Author
Book
Figure Author
Name
Affiliation
Title
Image
Name
Affiliation
Section Figure Title
Image
(c) Example for Case 3. Fig. 6 (a) – (c). Examples of FreqQPT merging.
Clustering of FreqQPTs The aim of clustering FreqQPTs is to group similar FreqQPTs together for further integration. Merging two closer FreqQPTs is cheaper and requires less re-structuring operations compared to merging two FreqQPTs far apart from each other. In our work, we utilize the agglomerative hierarchical clustering paradigm. The basic idea of agglomerative hierarchical clustering is to begin with each FreqQPT as a distinct cluster and merge the two closest clusters in each subsequent step until a stopping criterion is met. The stop criterion of the clustering is typically either the similarity threshold or the number of clusters to be obtained. We choose to specify the number of clusters since it is more intuitive and easy to specify, compared to the similarity threshold that is typically not known before the clustering process. Please note that k , the specified number of clusters to be obtained, should not be larger than the number of FreqQPTs, otherwise the error message will be returned. This is because the QPs in the same FreqQPTs are not allowed to be further split. In each step, the two closest FreqQPT pair will be found and merged into one FreqQPT and the number of current clusters will be decreased by 1 accordingly. This clustering process is terminated when k clusters are obtained. 2.4.
Acquire Data to Feed the Warehouse
The last step of building the XML data warehouse is to read data from XML data sources when the schemas of the integrated XML documents are ready. Coming from different data source across the Internet, these data may be incomplete, noisy, inconsistent, and duplicate. Processing efforts such as standardization, data cleaning and conflict solving need to be performed to make the data in the warehouse more consistent, clean, and concrete.
3 Processing of Queries Using the Date Warehouse One of the main purposes of building data warehouse is to facilitate the query processing. When there is no data warehouse, processing of queries use the single mediator architecture (shown in Figure 7), in which all the queries will be processed in this mediator and directed to the multiple XML data sources. When the data warehouse has been built, a dual-mediator architecture is adopted (shown in Figure 8). Mediator 1 processes all the incoming queries from users, and each query will be directed to either the data warehouse or mediator 2 which is responsible for further directing the queries to the XML data sources or both.
106
Ji Zhang et al.
Specifically, let QPSdwh be QP set of the integrated XML documents in the data warehouse. QPTra(q) be the QP transaction of the query q. (i) If QPTra(q) QPSdwh, meaning that all the QPs of q can be found in the schemas of integrated XML documents in the data warehouse, and this query can be answered by using the data warehouse alone, then q will only be directed by mediator 1 to the XML data warehouse; (ii) if QPTra(q) QPSdwh and QPTra(q)ŀQPSdwh is not empty, meaning that not all QPs of q can be found in the schemas of integrated XML documents in the data warehouse, and the data warehouse does not contain enough information to answer q, then q will be directed by mediator 1 to both the data warehouse and mediator 2; (iii) if QPTra(q)ŀQPSdwh is empty, indicating that the information needed to answer q is not contained in the warehouse, thus q will only directed by mediator 1 to mediator 2. X M L d ata so urces
… ... XML data sources …
.. . Mediator
… .. Users . Fig. 7. Query processing without data warehouse
X M L da ta w areh o use
M ed iato r 2
M ed ia to r 1
… .. U sers .
Fig. 8. Query processing with data warehouse
4. Experimental Results In this section, we will conduct experiments to evaluate the efficiency of the constructing schema of XML data integration and the speedup of query processing by means of the data warehouse we have built. We use a set of 50 XML documents about book information and generate their global DTD tree. Zipfian distribution is employed to produce transaction file of queries, because web queries and surfing patterns typically conform to the Zipf’s law [9]. In our work, the query transaction file contains 500 such synthetic queries based on which the data warehouse is built. All these experiments are carried out on the PC of 900 MHz PC with 256 megabytes of main memory running on Windows 2000. 4.1 Construction of the Data Warehouse Schema under Varying Number of Queries
Building XML Data Warehouse Based on Frequent Patterns in User Queries
107
Fig. 9. Efficiency of constructing the data warehouse schema under varying number of queries
Fig. 10. Comparative study on query answering time
First, we will evaluate the time spent in constructing the schema of XML data integration of the data warehouse under varying number of queries from which frequent query patterns are extracted. The number of the queries used is varied from 100 to 1,000. As shown in Figure 9, the time increases approximately in an exponential rate since the number of candidates of FreqQPSs generated increases exponentially as the number of queries goes up. 4.2 Speedup of Query Processing Using Data Warehouse The major benefits of building data warehouse system based on frequent query patterns are to not only obtain a smaller but more concrete and clean subset of original XML data sources but also helps speedup the query processing. In this experiment, we measure the response time for answering queries with and without the aid of the data warehouse, respectively. The number of queries to be answered ranges from 100 to 1,000. The results shown in Figure 10 justifies that, by using data warehouse we have built, the query answering is faster than that the case when there is no such a data warehouse. This is because that the potion of information contained in the data
108
Ji Zhang et al.
warehouse is smaller in size than that stored in the original data sources, reducing the volume of data needed to scanned in the query answering. In addition, the data has been undergone the processing such as standardization, data cleaning and conflict solving, thus the duplication of data is lower. The smaller size and lower duplication of the data in the warehouse contribute to the higher efficiency in query answering.
5
Conclusions
In this paper, we propose a novel approach to perform XML data warehousing based on the frequent query patterns discovered from historical user’s queries. A specific rule mining technique is employed to discover these frequent query patterns, based on which the schemas of integrated XML documents are built. Frequent query patterns are represented using Frequent Pattern Trees (FreqQPTs) that are clustered using a hierarchical clustering technique according to the integration specification to build the schemas of integrated XML documents. Experimental results show that query answering time is reduced when compared to the case when there is no such a data warehouse.
References [1] H. Garcia-Molina, W. Labio, J. L.Wiener, and Y. Zhuge: Distributed and Parallel Computing Issues in Data Warehousing. In Proc. of ACM Principles of Distributed Computing Conference (PODS), Puerto Vallarta, Mexico 1998. [2] M. Golfarelli, S. Rizzi, and B. Vrdoljak: Data Warehouse Design from XML Sources. In Proc. of ACM DOLAP’01, Atlanta, Georgia, USA, Nov. 2001. [3] S. M. Huang and C.H. Su: The Development of an XML-based Data Warehouse System. In Proc. of 3rd Intl. Conf. of Intelligent Data Engineering and Automated Learning (IDEAL’02), Springer LNCS 2412, pp. 206-212, Manchester, UK, Aug. 2002. [4] O. Mangisengi, J. Huber, C. Hawel and W. Essmayr: A Framework for Supporting Interoperability of Data Warehouse Islands using XML. In Proc. of 3rd Intl. Conf. DaWaK’01, Springer LNCS 2114, pp. 328-338, Munich, Germany, Sept. 2001. [5] A. Marian, S. Abiteboul, G. Cobena, and L. Mignet: Change-centric Management of Versions in an XML Warehouse. In Proc. of Intl. Conf. on Very Large Data Bases (VLDB’01), pp. 581-590, Roma, Italy, Sept. 2001. [6] K. Passi, L. Lane, S. Madria, B.C. Sakamuri, M. Mohania and S. Bhowmick: A Model for XML Schema Integration. In Proc. of 3rd Intl. Conf. EC-Web, Springer LNCS 2455, pp. 193-202, Aix-en-Provence, France, Sept. 2002. [7] XQuery Language 1.0. http://www.w3.org/TR/xquery/. [8] L. Xyleme. A Dynamic Warehouse for XML Data of the Web. IEEE Data Engineering Bulletin, Vol. 24(2), pp. 40-47, 2001. [9] L. H. Yang, M. L. Lee, W. Hsu, S. Acharya. Mining Frequent Query Patterns from XML Queries. In Proc. of 8th Intl. Symp. on Database Systems for Advanced Applications (DASFAA’03), Kyoto, Japan, March 2003. [10] L.Garber. Michael StoneBraker on the Importance of Data Integration. IT Professional, Vol. 1, No.3, pp 80, 77-79, 1999.
A Temporal Study of Data Sources to Load a Corporate Data Warehouse Carme Martín and Alberto Abelló Departament de Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Jordi Girona Salgado 1-3, E-08034 Barcelona, Catalunya {martin,aabello}@lsi.upc.es
Abstract. The input data of the corporate data warehouse is provided by the data sources, that are integrated. In the temporal database research area, a bitemporal database is a database supporting valid time and transaction time. Valid time is the time when the fact is true in the modeled reality, while transaction time is the time when the fact is stored in the database. Defining a data warehouse as a bitemporal database containing integrated and subject-oriented data in support of the decision making process, transaction time in the data warehouse can always be obtained, because it is internal to a given storage system. When an event is loaded into the data warehouse, its valid time is transformed into a bitemporal element, adding transaction time, generated by the database management system of the data warehouse. However, depending on whether the data sources manage transaction time and valid time or not, we could obtain the valid time for the data warehouse or not. The aim of this paper is to present a temporal study of the different kinds of data sources to load a corporate data warehouse, using a bitemporal storage structure.
1
Introduction
As defined in [4], a data warehouse (DW) is an architectural structure that supports the management of "Subject-oriented", "Integrated", "Time-variant", and "Nonvolatile" data. A temporal database (TDB) is introduced in [9] as a database that supports "Valid time" (VT) (i.e. the time when the fact becomes effective in reality), or "Transaction time" (TT) (i.e. the time when the fact is stored in the database), or both times. Note that this definition excludes "User-defined time", which is an uninterpreted attribute domain of time directly managed by the user and not by the database system. A "bitemporal database" is a database that supports VT and TT. The affinity between both concepts (i.e. DW and TDB) may not be obvious. However, time references are essential in business decisions, and the dissection of both definitions shows their closeness.
Y. Kambayashi, M. Mohania, W. Wöß (Eds.): DaWaK 2003, LNCS 2737, pp. 109-118, 2003. Springer-Verlag Berlin Heidelberg 2003
110
Carme Martín and Alberto Abelló
Fig. 1. A Data Warehouse as a Bitemporal Database
We consider the accepted definition of DW in [4] could be rewritten in terms of TDB concepts. Firstly, "Time-variance" simply specifies that every record in the DW is accurate relative to some moment in time. On the other hand, the definition of VT in [6] states that it is the time when the fact is true in the modeled reality. Therefore, both outline the importance of showing when data is correct and exactly corresponds to reality. Moreover, "Non-volatility" refers to the fact that changes in the DW are captured in the form of a "time-variant snapshot". Instead of true updates, a new "snapshot" is added to the DW in order to reflect changes. This concept can be clearly identified with that of TT, defined in [6] as the time when the fact is current in the database. In [1], we define a DW as a bitemporal database containing integrated and subject-oriented data in support of the decision making process, as it is sketched in figure 1. The first implication of this definition is that TT is entirely maintained by the system, and no user is allowed to change it. Moreover, the system should also provide specific management mechanisms for VT. The importance of this temporal conception is also outlined in [8], which asks DW systems for support of advanced temporal concepts. Data in the DW comes from independent heterogeneous sources. Therefore, TT and, specially, VT, introduced in the DW depend on which times are provided by the data sources. In [1], we propose a storage structure to implement a bitemporal DW. We concentrate our previous work is the most common case of data sources, i.e. specific and logged sources. In this paper, we are going to explain the behaviour of our bitemporal storage structure for all the different kinds of data sources to show that it can also be used with all possible data sources. The paper is organized as follows. Next section describes the bitemporal storage structure used throughout the paper. Section 3 explains the temporal study of the
A Temporal Study of Data Sources to Load a Corporate Data Warehouse
111
different kinds of data sources and the implementation of these in the bitemporal storage structure. Finally, section 4 provides some conclusions.
2
A Bitemporal Storage Structure
In [1], we present a bitemporal storage structure, named Current/Historical, consisting of a Current table that reflects current data and a set of Historical tables that show the historical evolution of the data, as depicted in figure 2. The set of Current tables in the DW give rise to the Operational Data Store, defined in [4]. A bitemporal event occurs at a starting VT and is true until an ending VT. However, if the ending VT is not known at this moment, since data is currently valid in the sources, we use the special VT value "Now", whose semantics are explained in [3]. For example, if we hire an employee, the starting VT will be her/his starting hiring data, and the ending VT will be the value "Now", until s/he is fired. DW insertions initialize the starting TT to the current TT and the ending TT to the value "Until_changed" (UC). As the current time inexorably advances, the value of UC always reflects the current time. Moreover, the current TT is denoted by “Current”. The Current table only contains two different temporal values for each attribute (Ai), i.e. the starting VT (Vis) and the starting TT (Tis). The other time values for current data always contain “Now” for ending VT and “UC” for ending TT, so they do not need to be stored. Moreover, we have one Historical table for each set of attributes with the same temporal behaviour. In this way, each Historical table contains only one semantic concept, and all attributes change at a time. For example, if we store employee information (only one subject), we should use a different structure for home address and telephone, and another one for work address and telephone (two different temporal behaviours). If the company reallocates somebody in a new room, only her/his job data will change. If s/he moves to a bigger house, only home data will change. Even though all data regards the same subject, it shows a different time behaviour. Without loss of generality, in this paper, we will assume that every table contains at most one attribute. Therefore, Historical tables contain the history for each attribute, with the starting VT (Vs), the ending VT (Ve), the TT when the insert was processed (Ti) and the TT when the delete was processed (Td). From these four temporal values, we can easily reconstruct the whole history of the value.
Fig. 2. Current and Historical tables in the bitemporal DW
112
Carme Martín and Alberto Abelló
In TDB research area, insertion, deletion, and modification operations, are defined in [7]. In addition, in [10], temporal insertion, deletion, and modification operations are explained for relational databases with time support. However, in our bitemporal storage structure, we require a redefinition of TDB insertion, deletion and modification operations. On loading the DW, all we need to do to process an insert is add a new record to the Current table. The processing of a delete is not much more difficult: the corresponding record is removed from the Current table, and a new one is added to each one of the Historical tables. The modification changes the old values of the Current table with the new values and adds a new record in the corresponding Historical table of the modified attribute. This behaviour can be easily inferred from deletion and insertion behaviours.
3
Temporal Study of Data Sources
The input data of the DW is provided by the data sources, that are integrated. Depending on whether the data sources manage TT and VT or not, we could obtain the VT for the DW or not. TT in the DW can always be obtained, because it is internal to a given storage system. When an event is loaded into the DW, its VT, supplied by the "Extraction, Transformation and Load" (ETL) module, is transformed into a bitemporal element, adding TT, generated by the DW DBMS. In the next sections, let us study the different kinds of data sources proposed in [5] to be implemented in our bitemporal storage structure. We consider throughout the paper, as example, a bitemporal DW with a courses relation with an OID, a Name and a Cost attributes. 3.1
Snapshot and Queryable Sources
Snapshot sources are sources that the only way to access to the data source content is through a dump of its data. A queryable source is a data source that offers a query interface. From snapshot and queryable sources that do not keep any kind of time, we can only store the TT in the DW. From snapshot and queryable sources, if they do not have any temporal information, we can only consider timestamping the data while we extract them. In the absence of true VT, all we can do is approximate it by the DW TT. In figure 3, we show the insertion of a databases course with a 100 euros cost, for the course relation. Inserting this course into the bitemporal DW, the starting TT is recorded. Given the bitemporal nature of the DW, all we can do in this case is approximate the starting VT by means of TT information. The best approximation we could obtain for the starting VT is the starting point of the "update window", because all we know is that the event occurred before the beginning of the DW load. Moreover, since during the "update window" queries are not allowed, we can consider that the load is atomic, in the sense that there is no temporal order among the operations. Therefore, the TT can also be fixed at the beginning of the "update window". Thus, TT and VT will have the same value.
A Temporal Study of Data Sources to Load a Corporate Data Warehouse
113
Fig. 3. The databases course insertion in snapshot and queryable sources
Fig. 4. The databases course deletion in snapshot and queryable sources
In figure 4, the deletion process of the databases course is described. The deletion operation eliminates the bitemporal element of the Current table and adds a new one in each Historical table. Similar to the insertion operation, we approximate now the ending VT by means of TT. 3.2
Specific and Logged Sources
Specific sources are able to write "delta files" that describe the system actions. Logged sources have a "log file" where all their actions are registered. From specific and logged (those able to keep track of the performed operations) sources, if they timestamp the entries with the source TT, we can approximate the VT by means of it. If no other information exists, the data can be considered valid while it is current in the operational database. This is the most usual environment for a DW.
114
Carme Martín and Alberto Abelló
Fig. 5. The databases course insertion in specific and logged sources
Fig. 6. The databases course deletion in specific and logged sources
In figure 5, we show the insertion of a databases course with a 100 euros cost and with a source TT value of 1. We can see that source TT is converted into VT, and we also have the TT of the DW. In figure 6, the deletion process of the databases course is described. When a deletion operation comes from the data source, it has an ending TT value to be converted in an ending VT value in the DW. As it is shown, a logical deletion generates the physical removal of the existing bitemporal element in the Current table. The ending TT of the Historical table is the TT of the DW. However, this is not enough and a new bitemporal element is added to the Historical tables, which expresses that from now on we know the ending VT value, i.e. the timestamp of the data source. 3.3
Callback and Internal Action Sources
Callback sources are sources that provide triggers or other active capabilities. Internal action sources are similar to callback sources, except that to define triggers in the data source requires to create auxiliary relations called "delta tables". In these
A Temporal Study of Data Sources to Load a Corporate Data Warehouse
115
"cooperative" sources, the TT in the sources needs to be used again to approximate the VT. However, this kind of data sources offer two different kinds of DW load: Deferred Load. The most common possibility is to use the triggers to generate "delta files", and later on use these files to load the DW. Thus, the temporal information in the "delta files" will be the source TT. We have already described this case in section 3.2. Real-Time Load. Another possibility would be that both repositories (i.e. the data source and the DW) are updated at the same time. Then, the TT of the DW corresponds to source TT. Notice that, it is the same assumption that we have considered for snapshot and queryable sources. Therefore, this case has the same temporal behaviour than snapshot and queryable sources, explained in section 3.1. Nevertheless, they do not have the same temporal knowledge. In callback and internal action sources we really have a source TT to approximate the VT, while in snapshot and queryable sources we have no temporal information in the source, so this is a better approximation. 3.4
Bitemporal Data Sources
Bitemporal data sources are sources whose data are stored in a bitemporal database. From bitemporal data sources (not considered in [5]), we could obtain true VT besides TT. Moreover, we also know the TT of the DW. Therefore, we should choice one temporal attribute out of those two of the sources to be used as VT in the DW (the other one will be managed as a user-defined time attribute): Source TT Used as VT. If the VT of the DW is obtained from the source TT, the source VT could be an additional user-defined time attribute to be considered. In this case, bitemporal data sources will have the same temporal behaviour than specific and logged sources (explained in section 3.2) with an additional user-defined time attribute to express true valid time information. The temporal information provided to the ETL process will be: source TT (that will give rise to the VT of the DW), and source VT (that will be treated as a user-defined time attribute). Considering temporal attributes in this way allows to keep using effectively the Current/Historical storage structure. Source VT Used as VT. If the VT of the DW is obtained from the source VT, the source TT could be an additional user-defined time attribute to be considered. These bitemporal sources provide the following temporal information: source VT (that will give rise to the VT of the DW), and source TT (that will be treated as a user-defined time attribute). Since VT in the data source is an interval, we need to add another temporal attribute to the Current table to record the ending VT. Figure 7 presents this new possibility having bitemporal sources. Two new attributes need to be added to the Current table for every attribute: One for ending VT and another one for the user-defined time representing source TT. Notice that, the Current table represents current data in the source. Therefore, the ending source TT
116
Carme Martín and Alberto Abelló
will always be UC so that it is not necessary to record it. In this example, starting and ending VT are later than the load of the DW. Figure 8 shows the deletion of the databases course inserted in figure 7. Regarding the Historical tables, also two new user-defined time attributes need to be added to each table: one for starting source TT and another one for ending TT. We already had two attributes for starting and ending VT.
Fig. 7. The databases course insertion in bitemporal sources
Fig. 8. The databases course deletion in bitemporal sources
A Temporal Study of Data Sources to Load a Corporate Data Warehouse
117
Fig. 9. A coalescing example
The main problem in this case (not taking into account the increase in the number of attributes of the different tables) is that the OID is no more an identifier by itself for the Current table. Even more, we cannot guarantee that there are the same number of current values for each attribute, so that we should divide the Current table into independent tables for every attribute. Thus, we would have the same number of Current and Historical tables, which would worsen the performance of this storage technique. In this kind of sources, in order to reduce storage space, when two time intervals overlap or are adjacent, the coalescing operation of TDB [2] could be applied either to Current or Historical tables. When either VT or TT intervals have identical nontemporal attribute values in two different tuples, then the coalescing operation can obtain only one tuple from both if the other time interval that do not coincide overlap or the ending point of one of them is the starting point of the other. Even if the bitemporal source uses the coalescing operation, we can still find in the DW tuples to be coalesced, if they come from different sources. As example, in figure 9, we see that the databases course information could be coalesced in only one tuple.
4
Conclusions
In this paper, we have presented a temporal study of the different kinds of data sources to load a DW. The correspondences between temporal attributes in the data sources and those in the DW have been analyzed. In the corporate DW, we have identified the two existing orthogonal temporal dimensions: valid time dimension and transaction time dimension. In this bitemporal DW environment, we have used our bitemporal storage structure [1] to represent all the temporal data source knowledge obtained by the different kinds of data sources. Analyzing these different data sources, presented in [5] with an additional type, i.e. bitemporal source, we found that in some cases snapshot and queryable sources could have the same temporal behaviour than callback and internal action sources (just the latter being more
118
Carme Martín and Alberto Abelló
precise). In general, these "cooperative" sources would behave like specific and logged sources. Regarding bitemporal data sources, they are more difficult to manage and, in general, would need an ad hoc storage structure. However, using an appropriate interpretation, they can also be treated as specific and logged sources. Thus, we can use the Current/Historical bitemporal storage structure to warehouse any kind of data sources.
Acknowledgments The authors would like to thank Núria Castell for the support she has given to this work. This work has been partially supported by the Spanish Research Program PRONTIC under project TIC2000-1723-C02-01.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
Abelló, A.; Martín, C. "A Bitemporal Storage Structure for a Corporate Data Warehouse". Proc. of the 5th. Int. Conf. on Enterprise Information Systems (ICEIS). pages 177-183, 2003. Böhlen, M.; Snodgrass, R.T.; Soo, M.D. "Coalescing in Temporal Databases". In Proc. of the 22nd. Int. Conf. on Very Large Data Bases (VLDB). pages 180191, 1996. Clifford, J.; Dyreson, C.; Isakowitz, T.; Jensen, C.S., Snodgrass, R.T. "On the Semantics of 'Now' in Databases". ACM Transactions on Database Systems. 22(2):171-214, 1997. Inmon, W. H., Imhoff, C.; Sousa, R. "Corporate Information Factory". John Wiley & Sons, second edition, 1998. Jarke, M.; Lenzerini, M.; Vassilious, Y.; Vassiliadis, P., editors, "Fundamentals of Data Warehousing". Springer-Verlag, 2000. Jensen, C.S.; Clifford, J.; Gadia, S.K.; Segev, A.; Snodgrass, R.T. "A Glossary of Temporal Database Concepts". ACM SIGMOD Record. 21(3):35-43, 1992. Jensen, C.S.; Soo, M.D.; Snodgrass R.T. “Unifying Temporal Data Models via a Conceptual Model". Information Systems. 19(7):513-547, 1994. Pedersen, T.B.; Jensen, C.S. "Research Issues in Clinical Data Warehousing". In Proc. of the 10th. Int. Conf. on Statistical and Scientific Database Management (SSDBM). pages 43-52, 1998. Snodgrass, R.T.; Ahn, I. "Temporal Databases". IEEE Computer. 19(9):35-42, 1986. Snodgrass, R.T. "Developing Time-Oriented Database Applications in SQL". Morgan Kaufmann Publishers, 2000.
Automatic Detection of Structural Changes in Data Warehouses Johann Eder, Christian Koncilia , and Dieter Mitsche§
University of Klagenfurt Dep. of Informatics-Systems {eder,koncilia}@isys.uni-klu.ac.at § [email protected]
Abstract. Data Warehouses provide sophisticated tools for analyzing complex data online, in particular by aggregating data along dimensions spanned by master data. Changes to these master data is a frequent threat to the correctness of OLAP results, in particular for multi- period data analysis, trend calculations, etc. As dimension data might change in underlying data sources without notifying the data warehouse, we are exploring the application of data mining techniques for detecting such changes and contribute to avoiding incorrect results of OLAP queries.
1
Introduction and Motivation
A data warehouse is a collection of data stemming from different frequently heterogeneous data sources and is optimized for complex data analysis operation rather than for transaction processing. The most popular architecture are multidimensional data warehouses (data cubes) where facts (transaction data) are “indexed” by several orthogonal dimensions representing a hierarchical organization of master data. OLAP (on-line analytical processing) tools allow the analysis of this data, in particular by aggregating data along the dimensions with different consolidation functions. Although data warehouses are typically deployed to analyse data from a longer time period than transactional databases, they are not well prepared for changes in the structure of the dimension data. This surprising observation originates in the (implicit) assumption that the dimensions of data warehouses ought to be orthogonal, which, in the case of the dimension time means that all other dimensions ought to be time-invariant. In this paper we address another important issue: how can such structural changes be recognized, even if the sources do not notify the data warehouse about the changes. Such “hidden” changes are a problem, because (usually) such changes are not modifications of the schema of the data source. E.g., inserting the data of a new product or a new employee is a modification on the instance level. However, in the data warehouse such changes result in a modification of its structure. Of course, this defensive strategy of recognizing structural changes can only be an aid to avoid some problems, it is not a replacement for adequate means for Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 119-128, 2003. c Springer-Verlag Berlin Heidelberg 2003
120
Johann Eder et al.
managing knowledge about changes. Nevertheless, in several practical situation we able trace erroneous results of OLAP queries back to structural changes not known by the analysts and the data warehouse operators, erroneous in the sense that the the resulting data din not correctly represent the state of affairs in the real world. As means for detecting such changes we propose the use of data mining technique. In a nutshell, the problem can be described as a multidimensional outlier detection problem. In this paper we report on some experiments we were conducting to analyse – which data mining techniques might be applied for the detection of structural changes – how these techniques are best applied – whether these techniques effectively detect structural changes – whether these techniques scale up to large data warehouses. To the best of our knowledge it is the first time that the problem of changes in dimensions of data warehouses is addressed with data mining techniques. The problems related to the effects of structural changes in data warehouses and approaches to overcome the problems they cause were subject of several projects [Yan01, BSH99, Vai01, CS99] including our own efforts [EK01, EKM02] to build a temporal data warehouse structure with means to transform data between structural versions such that OLAP tools work on data cleaned of the effects of structural changes. The remainder of this paper is organized as follows: in section 2 we bring basic definitions and discuss the notion of structural changes in data warehouses. In section 3 we briefly introduce the data mining techniques we analyzed for the detection of structural changes and we introduce a procedure for applying these techniques. In section 4 we briefly discuss the experiments we conducted as proof of concept. Finally in section 5 we draw some conclusions.
2
Structural Changes
We will now briefly discuss different types of structural changes. Furthermore, we will argue why some of these structural changes do not need to be detected automatically. In [EK01] we showed how the basic operations INSERT, UPDATE and DELETE have to be adopted for a temporal data warehouse. With respect to dimension members, i.e., the instances of dimensions, these basic operations may be combined to represent the following complex operations (where Q is the chronon, i.e. it is a “a non-decomposable time interval of some fixed, minimal duration” [JD98]): i.) SPLIT: One dimension member M splits into n dimension members, M1 , ... ,Mn . This operation translates into a DELETE(M , Ts −Q) and a set of insert operations INSERT(Mi , Ts ).
Automatic Detection of Structural Changes in Data Warehouses Initial Outline
Insert new Div. A and Change Name of SubDiv. C
Split Div. A into A1 and A2
Merge Div. A1 and A2 Delete SubDiv. X
SV1
SV2
SV3
SV4
Div. E
Div. A
Div. A1
Div. E
Div. A2
Subdiv. C
Subdiv. X
Subdiv. D
Subdiv. D
121
Div. F
Div. E
Div. E
Subdiv. X
Subdiv. D
Subdiv. D Legend mapping function for Facts F1, ..., Fn
Fig. 1. An example of structural changes
ii.)
iii.)
iv.)
v.)
For instance, Figure 1 shows a split operation between the structure versions SV2 and SV3 where a division “Div.A” splits up into two divisions “Div.A1 ” and “Div.A2 ”. We would need one delete operation (“Div.A”) and two inserts (“Div.A1 ” and “Div.A2 ”) to cope with this. MERGE: n dimension members M1 , ..., Mn are merged together into one dimension member M . This operation translates into a set of delete operations DELETE(Mi , Ts − Q) and an insert operation INSERT(M , Ts ). A merge is the opposite to a split, i. e. a split in one direction of time is always a merge in the opposite direction of time. Consider, for the example given, that these modifications occur at the timepoint T . For each analysis that requires measures from a timepoint before T for the structure version which is valid at timepoint T we would call these modifications “a split”. For each analysis that requires measures from timepoint T for a structure version valid before timepoint T these modifications would be called “a merge”. CHANGE: An attribute of a dimension member changes, for example, if the product number (Key) or the name of a department (user defined attribute) changes. Such a modification can be carried out by using the update operation defined above. With respect to dimension members representing measures, CHANGE could mean that the way how to compute the measure changes (for example, the way how to compute the unemployment rate changed in Austria because they joined the European Union in 1995) or that the unit of facts changes (for instance, from Austrian Schillings to EURO). MOVE: A dimension member moves from one parent to another, i.e., we modify the hierarchical position of a dimension member. For instance, if a product P no longer belongs to product group GA but to product group GB . P This can be done by changing the DMid (parent ID) of the corresponding dimension member with an update operation. NEW-MEMBER: A new dimension member is inserted. For example, if a new product becomes part of the product spectrum. This modification can be done by using an insert operation.
122
Johann Eder et al.
vi.) DELETE-MEMBER: A dimension member is deleted. For instance, if a branch disbands. Just as a merge and a split are related depending on the direction of time, this is also applicable for the operations NEW-MEMBER and DELETE-MEMBER. In opposite to a NEW-MEMBER operation there is a relation between the deleted dimension member and the following structure version. Consider, for example, that we delete a dimension member “Subdivision B ” at timepoint T . If for the structure version valid at timepoint T we would request measures from a timepoint before T , we still could get valid results by simply subtracting the measures for “Subdivision B ” from its parent. For two of these operations, namely NEW-MEMBER and DELETEMEMBER, there is no need to use data mining techniques to automatically detect these modifications. When loading data from data sources for a dimension member which is new in the data source but does not exist in the warehouse yet, the NEW-MEMBER operation is detected automatically by the ETL-Tool (extraction, transformation and loading tool). On the other hand, the ETL-Tool automatically detects when no fact values are available in the data source for deleted dimension members.
3
Data Mining Techniques
In this section a selection of different data mining techniques for automatic detection of structural changes is issued. Whereas the first section gives an overview over some possible data mining techniques, the second section focusses on multidimensional extensions of the methods. Finally, a stepwise approach to detect structural changes at different layers is proposed. 3.1
Possible data mining methods
The simplest method for detecting structural changes is the calculation of deviation matrices. Absolute and relative differences between consecutive values, and differences in the shares of each dimension member between two chronons can be easily computed - the runtime of this approach is clearly linear in the number of analyzed values. Since this method is very fast, it should be used as a first sieve. A second approach whose runtime complexity is in the same order as the calculation of traditional deviation matrices is the attempt to model a given data set with a stepwise constant differential equation (perhaps with a simple functional equation). This model, however, only makes sense if there exists some rudimentary, basic knowledge about factors that could have caused the development of certain members (but not exact knowledge, since in this case no data mining would have to be done anymore). After having solved the equation (for solution techniques of stepwise differential equations refer to [Dia00]), the relative and absolute differences between the predicted value and the actual value can be considered to detect structural changes.
Automatic Detection of Structural Changes in Data Warehouses
123
Other techniques that can be used for detecting structural changes are mostly techniques that are also used for time-series analysis: – autoregression - a significantly high absolute and relative difference between a dimension member’s actual value and its value predicted via a simple ARMA (AutoRegression Moving Average)(p,q)-model (or, if necessary, an ARIMA (AutoRegression Integrated Moving Average)(p,d,q)-model, perhaps even with extensions for seasonal periods) is an indicator for a structural change of that dimension member. – autocorrelation - the usage of this method is similar to the method of autoregression. The results of this method, however, can be easily visualized with the help of correlograms. – crosscorrelation and regression - these methods can be used to detect significant dependencies between two different members. Especially a very low correlation coefficient (a very inaccurate prediction with a simple regression model, respectively) could lead to the roots of a structural change. – discrete fourier transform (DFT), discrete cosine transform (DCT), different types of discrete wavelet transforms - the maximum difference (scaled by mean of the vector) as well as the overall difference (scaled by mean and length of the vector) of the coefficients of the transforms of two dimension members can be used to detect structural changes. – singular value decomposition (SVD) - unusually high differences in singular values can be used for detecting changes in the measure dimension when analyzing the whole data matrix. If single dimension members are compared, the differences of the eigenvalues of the covariance matrices of the dimension members (= principal component analysis) can be used in the same way. In this paper, due to lack of space no detailed explanation of these methods is given, for details refer to [Atk89] (fourier transform), [Vid99] (wavelet transforms), [BD02] (autoregression and -correlation), [Hol02] (SVD and principal component analysis), [Wei85] (linear regression and crosscorrelation). 3.2
Multidimensional Structures
Since in data warehouses there is usually a multidimensional view on the data, the techniques shown in the previous section have to be applied carefully. If all structure dimensions are considered simultaneously and a structural change occurred in one structure dimension, it is impossible to detect the dimension that was responsible for this change. Therefore the values of the data warehouse have to be grouped along one structure dimension (we considered only fullyadditive measures that can be summed along each structure dimension. It is one aspect of further work to check the approach for semi-additive and non-additive measures). On this two-dimensional view the methods of the previous section can then be applied to detect the changed structure dimension. If, however, a change happens in two or more structure dimensions at the same time, the analysis of the values grouped along one structure dimension will not be successful - either
124
Johann Eder et al.
a few (or even all) structure dimensions will show big volatilities in their data, or not a single structure dimension will show significant changes. Hence, if a change in the structure dimensions is assumed, and none can be detected with the methods described above, the values have to be grouped along two structure dimensions. The methods can be applied on this view, if changes still cannot be detected, the values are grouped by three structure dimensions, a.s.o. The approach to analyze structural changes in data warehouses by grouping values by just one structure dimension in the initial step was chosen for two reasons: 1.) In the vast majority of all structure changes in structure dimensions only one structure dimension will be affected. 2.) The runtime and memory complexity of the analysis is much smaller when values are grouped by just one structure dimension: let D1 , D2 , . . . , Dn denote the number of elements in the i-th structure dimension (i = 1 . . . n; D1 ≥ D2 ≥ . . . ≥ Dn ); then in the first step only D1 + D2 + . . . + Dn = O(D1 ) different values have to be analyzed, in the second step already D1 D2 + . . . + D1 Dn + . . . + Dn−1 Dn = O(D12 ), in the i-th step therefore O(D1i ), i = 1 . . . n. 3.3
Stepwise Approach
As a conclusion of the previous sections we propose a stepwise approach to detect different types of structure changes at different layers: 1.) In the first step the whole data matrix of each measure in the data warehouse is analyzed to detect changes in the measure dimension (change of the calculation formula of a measure, change of the metric unit of a measure). Primarily a simple deviation matrix that calculates the differences of the sums of all absolute values between two consecutive chronons can be applied here. If these differences between two chronons are substantially bigger than those between other chronons, then this is an indicator for a change in the measure dimension. If the runtime performance is not too critical, SVD and DCT can also be carried out to detect changes. Changes at this level that are detected must be either corrected or eliminated - otherwise the results in the following steps will be biased by these errors. 2.) In the next step the data are grouped by one structure dimension. The deviation matrices that were described above can be applied here to detect dimension members that were affected by structural changes. 3.) If the data grouped by one structure dimension can be adequately modelled with a stepwise constant differential equation (or a simple functional equation) then also the deviation matrices that calculate the absolute and relative difference between the model-estimated value and the actual value should be used. 4.) In each structure dimension where one dimension member is known that definitely remained unchanged throughout all chronons (fairly stable so that it can be considered as a dimension member with an average development,
Automatic Detection of Structural Changes in Data Warehouses
125
mostly a dimension member with rather big absolute values), other data mining techniques such as autocorrelation, autoregression, discrete fourier transform, discrete wavelet transform, principal component analysis, crosscorrelation and linear regression can be used to compare this ’average’ dimension member with any other dimension member detected in steps 2 and 3. If one of the methods shows big differences between the average dimension member and the previously detected dimension member, then this is an indicator for a structural change of the latter one. Hence, these methods on the one hand are used to make the selection of detected dimension members smaller, on the other hand they are also used to ’prove’ the results of the previous steps. However, all these methods should not be applied to a dimension member that is lacking values, whose data are too volatile or whose values are often zero. If no ’average’ dimension member is known, the dimension members that were detected in previous steps can also be compared with the sum of the absolute values of all dimension members. In any case, it is for performance reasons recommended to use the method of autocorrelation at first; among all wavelet transforms the Haar method is the fastest. 5.) If in steps 2, 3 and 4 no (or not all) structural changes are detected and one still assumes structural changes, then the values are grouped by i+1 structure dimensions, where i (i = 1 . . . n − 1, n = number of structure dimensions) is the number of structure dimensions that were used for grouping values in the current step. Again, steps 2, 3 and 4 can be applied.
4
Experiments
The stepwise approach proposed in the previous section was tested on many small data sets and one larger sample data set. Here one small example is given to show the usefulness of the stepwise approach. Consider the data warehouse with one measure dimension and four structure dimensions given in table 1, where three structural changes and one measure change are hidden: between year1 and year2 dimension member SD21 is reduced to 20% of its original value (UPDATE, MOVE or SPLIT), between year2 and year3 dimension members SD31 and SD32 swap their structure (UPDATE or MOVE), between year3 and year4 dimension member SD41 loses 70% of its value to SD42 (MOVE); between year3 and year4 the currency of the measure changes (from EURO to ATS, values multiplied by 14) (in this case, one might probably detect the measure change in the last chronon simply by visual inspection of all values, without calculating deviation matrices. In large data warehouses, however, this may become infeasible.). In the first step the data matrix is checked for changes in the measure dimension: differences of the sums of all absolute values of two consecutive chronons are calculated. As can be seen from iteration one of table 2, the difference between year3 and year4 (122,720) is by far the biggest - a very strong indicator for a measure change between these years. When asking for possible explanations, domain experts should recognize the fact of a new currency. To be able to continue the analysis, all values in the
126
Johann Eder et al. SD1 SD11 SD11 SD11 SD11
SD2 SD21 SD21 SD21 SD21
SD3 SD31 SD31 SD32 SD32
SD11 SD11 SD11 SD11
SD22 SD22 SD22 SD22
SD31 SD31 SD32 SD32
SD12 SD12 SD12 SD12
SD21 SD21 SD21 SD21
SD31 SD31 SD32 SD32
SD4 year1 year2 year3 year4 SD41 100 20 60 252 SD42 200 40 80 1,708 SD41 300 60 20 84 SD42 400 80 40 756 SD41 500 500 700 2,940 SD42 600 600 800 18,060 SD41 700 700 500 2,100 SD42 800 800 600 13,300 SD41 900 180 220 924 SD42 1,000 200 240 5,516 SD41 1,100 220 180 756 SD42 1,200 240 200 4,564
SD12 SD12 SD12 SD12
SD22 SD22 SD22 SD22
SD31 SD31 SD32 SD32
SD41 SD42 SD41 SD42
1,300 1,400 1,500 1,600
1,300 1,400 1,500 1,600
1,500 1,600 1,300 1,400
6,300 37,100 5,460 32,340
SD=structure dimension, SDij =j-th dimension member in structure dimension i Table 1. Structural changes in a data warehouse with four structure dimensions
Diff
year12 year23 year34
iteration 1 -4,160 iteration 2 -4,160
0 0
122,720 0
Diff=absolute difference, yearmn =comparison of year m with year n Table 2. Detection of changes in the measure dimension
data warehouse have to be noted in the same currency. Therefore, all values in the last column (year4 ) are divided by 14. Having corrected the problem of different currencies, the biggest remaining difference is -4,160 between year1 and year2 (see line ’it 2’ in table 2). According to domain experts, this difference cannot be explained with changes in the measure dimension, and hence, the approach can be continued with the analysis of changes in the structure dimensions. In the next step the values in the data warehouse are grouped by one structure dimension - on the resulting view the differences of shares of dimension members are calculated (this deviation matrix was chosen because it shows the outliers most clearly in this case). As can be seen from table 3, the former ’hidden’ structural changes become obvious: all three structural changes are detected (in this example with just two dimension members per structure dimension the changes in the one member have to be counted up in the other - it is therefore not known whether between year1 and year2 dimension member SD21 or SD22 changed. In real-world data warehouses with many more dimension members, however, it usually is clear which dimension member changed). Here, due to lack of space steps 3 and 4 are omitted; if one assumes further structural changes, the detected structural changes have to be corrected, and the above deviation
Automatic Detection of Structural Changes in Data Warehouses ∆(%) SD11 SD12
SD21 SD22
SD31 SD32
SD41 SD42
127
year12 year23 year34 3.19% 0% 0% -3.19% 0% 0% -27.22% 0% 0% 27.22% 0% 0% 0.8% 10.17% 0% -0.8% -10.17% 0% 0.4% 0% -33.22% -0.4% 0% 33.22%
SD=structure dimension, SDij =j-th dimension member in structure dimension i, ∆(%)=change in share of a dimension member between two consecutive years, yearmn =comparison of shares of different dimension members between year m and year n Table 3. Detection of changes in the structure dimension
matrix can be calculated once again. In this case, however, all differences of all dimension members between all years are zero - all dimension members stay unchanged throughout the four years. Hence, a further analysis of combined structural changes is useless. On a large sample data set (40 GB) we tested the performance of our proposed approach: good scalability of the methods was shown - all methods (excluding SVD and DCT on the whole data matrix, they took six minutes) took less than six seconds (Pentium III, 866 MHz, 128 MB SDRAM). This example, however, also showed that the quality of the results of the different methods very much depends on the quality and the volatility of the original data.
5
Conclusions
Unknown structural changes lead to incorrect results of OLAP queries, to analysis reports with wrong data and probably to suboptimal decisions based on these data. Since analysts need not see such changes in the data, the changes might be hiden in the lower levels of the dimension hierarchies which are typically only looked at in drill down operations, but of course influence the data derived ion higher levels. Pro-active steps are necessary to avoid that incorrect results due to neglecting of structural changes. Some changes might be detected when data is loaded into the database, or when change-logs of the sources are forwarded to the data warehouse. Nevertheless, such changes stemming from different sources might appear unnoticed in data warehouses. Therefore, we are proposing to apply data mining techniques for detecting such changes, or more precisely detecting suspicious unexpected data characteristics which might originate from unknown structural changes. It is clear that any such technique will phase the problem, that it might not be able to detect all such changes, in particular, when the data looks perfectly feasible. On the other hand, the techniques might indicate a change due to character-
128
Johann Eder et al.
istics of data which is however correctly representing reality, i.e. no change in dimension data took place. We showed that several data mining techniques might be used for this analysis. We propose a procedure which uses several techniques in turn and in our opinion is a good combination of efficiency and effectiveness. We were able to show that the techniques we propose actually detect structural changes in data warehouses and that these techniques also scale up for large data warehouses. The application of the data mining techniques, however, requires good quality of the data in the data warehouse, because otherwise errors of first and second degree rise. It is also necessary to fine-tune the parameters for the data mining techniques, in particular to take the volatility of the data in the data warehouse into account. Here, further research is expected lead to self adaptive methods.
References [Atk89]
K. E. Atkinson. An Introduction to Numerical Analysis. John Wiley, New York, USA, 1989. [BD02] P. J. Brockwell and R. A. Davis. Introduction to Time Series Forecasting. Springer Verlag, New York, USA, 2002. [BSH99] M. Blaschka, C. Sapia, and G. H¨ ofling. On Schema Evolution in Multidimensional Databases. In Proceedings of the DaWak99 Conference, Florence, Italy, 1999. [CS99] P. Chamoni and S. Stock. Temporal Structures in Data Warehousing. In Proceedings of the 1st International Conference on Data Warehousing and Knowledge Discovery (DaWaK’99), pages 353–358, Florence, Italy, 1999. [Dia00] F. Diacu. An Introduction to Differential Equations - Order and Chaos. W. H. Freeman, New York, USA, 2000. [EK01] J. Eder and C. Koncilia. Changes of Dimension Data in Temporal Data Warehouses. In Proceedings of 3rd International Conference on Data Warehousing and Knowledge Discovery (DaWaK’01), Munich, Germany, 2001. Springer Verlag (LNCS 2114). [EKM02] J. Eder, C. Koncilia, and T. Morzy. The COMET Metamodel for Temporal Data Warehouses. In Proceedings of the 14th International Conference on Advanced Information Systems Engineering (CAISE’02), Toronto, Canada, 2002. Springer Verlag (LNCS 2348). [Hol02] J. Hollmen. Principal component analysis, 2002. URL: http://www.cis.hut.fi/ jhollmen/dippa/node30.html. [JD98] C. S. Jensen and C. E. Dyreson, editors. A consensus Glossary of Temporal Database Concepts - Feb. 1998 Version, pages 367–405. Springer-Verlag, 1998. In [EJS98]. [Vai01] A. Vaisman. Updates, View Maintenance and Time Management in Multidimensional Databases. Universidad de Buenos Aires, 2001. Ph.D. Thesis. [Vid99] B. Vidakovic. Statistical Modeling by Wavelets. John Wiley, New York, USA, 1999. [Wei85] S. Weisberg. Applied Linear Regression. John Wiley, New York, USA, 1985. [Yan01] J. Yang. Temporal Data Warehousing. Stanford University, June 2001. Ph.D. Thesis.
Performance Tests in Data Warehousing ETLM Process for Detection of Changes in Data Origin Rosana L. A. Rocha, Leonardo Figueiredo Cardoso, and Jano Moreira de Souza COPPE/UFRJ, Federal University of Rio de Janeiro PO BOX 68511, Rio de Janeiro, RJ - Brazil (55)(21) 2590-2552 {rosana,cardoso,jano}@cos.ufrj.br
Abstract. In a data warehouse (DW) environment, when the operational environment does not posses or does not want to inform the data about the changes that occurred, controls have to be implemented to enable detection of these changes and to reflect them in the DW environment. The main scenarios are: i) the impossibility to instrument the DBMS (triggers, transaction log, stored procedures, replication, materialized views, old and new versions of data, etc) due to security policies, data property or performance issues; ii) the lack of instrumentation resources on the DBMS; iii) the use of legacy technologies such file systems or semi-structured data; iv) application proprietary databases and ERP systems. In another article [1], we presented the development and implementation of a technique that was derived for the comparison of database snapshots, where we use signatures to mark and detect changes. The technique is simple and can be applied to all four scenarios above. To prove the efficiency of our technique, in this article we do comparative performance tests between these approaches. We performed two benchmarks: the first one using synthetic data and the second one using the real data from a case study in the data warehouse project developed for Rio Sul Airlines, a regional aviation company belonging to the Brazil-based Varig group. We also describe the main approaches to solve the detection of changes in data origin.
1
Introduction
The detection of changes in data origin issue is well known in the DW area. As mentioned by DO, DREW et al. [2], most research work on DW update focuses on the problem that, given a differential relation, how do we refresh the DW efficiently, in an approach defined by ÖZSU & VALDURIEZ [3]. This approach capture the after and before images, for all lines affected by each operation. Some of these researches are based on the existence of an instrumentation resource of differential relation on the DBMS. The difference between them occurs in terms of DW capabilities, such Y. Kambayashi, M. Mohania, W. Wßö (Eds.): DaWaK 2003, LNCS 2737, pp. 129-139, 2003. Springer-Verlag Berlin Heidelberg 2003
130
Rosana L. A. Rocha et al.
as: convergent DW consistency [4; 5], replication of some source relations [6], full replication [7], versioning [8], etc. According INMON & KELLEY [9], a DW is a repository of integrated information, available for queries and analysis (e.g., decision support, or data mining). The DW came to meet this demand for a possibility of fast analysis of business information. In this paper, to prove the efficiency of our technique, we do comparative performance tests between two database snapshots approaches (with and without signatures) for detection of changes in data origin. We performed two benchmarks: the first one using synthetic data, and the second one using the real data. The rest of the paper is organized as follows: In section 2, we have described the main approaches, which may be used for solving the problem of detecting changes in data origin. In section 3, we have described the implementation of our technique. In section 4, we have presented performance tests among our technique and database snapshots approach. In the section 5 and last part, we present our final considerations.
2
Detection of Changes
There are several challenges that would be overcome in the development of a DW environment. However, in this article, the main focus will gravitate around the detection of changes in data origin. According to LABIO, YERNENI et al. [10], one of the main problems is updating derived data when the remote information sources change. For the cases in which the operational environment does not bear, or does not want to inform the storage of changing history-taking place on data, controls have to be implemented in order to meet this need. There are several approaches, which may be implemented in detection of changes in data origin. In order to choose among them, we have to take into consideration the following factors: there are advantages, disadvantages, and the existence of the necessary features in the operational and DW environments. Below, we will describe some of the most of used approaches for the solution of the change detection problem in data origin. These approaches had to be used, for each table in the data origin area in which the mapping of changes is necessary. We will use as a basis for this description, the framework proposed for the DW environment in [1]. In figure 1, we have an example using this framework. 2.1
New Table with All the Performed Changes
Operational: we created a “daughter table” related to the originating table, identifying the following: the change operation which was performed, when it occurred, and in case of updates, which fields have been altered, and their values prior to the change. Necessary features: that the data from the origin area be stored in a DBMS; that the a DBMS bearing a trigger mechanism for the operations of insert, update and delete, with the approach of “old” and “new” versions implemented. This approach was mentioned WIDOM & CERI [11]; that the using of triggers be permitted.
Performance Tests in Data Warehousing ETLM Process
131
Advantages: assurance the entire mapping of changes performed in the originating table have been stored; facility and greater speed in the extraction process, as only the lines changed from the last extraction performed would be consulted. Disadvantages: overload on the data origin environment, due to the need treatment in inclusion for mapping of each change taking place; increase in the need for storage space of this new data and indexes; need for cleaning management of the mapping tables which beard a very large growth, from time to time. When to use: when the mapping of all changes performed in the data origin becomes really necessary. 2.2
Marking on Originating Table for Storing the Last Change Carried Out
Operational: we created some columns identifying the change operation performed and when it took place. In the case of update operation, we only mark, when there has been a change in one of the relevant fields to the DW extraction process. Approach presented in [12]. Necessary features: as abovementioned “approach 2.1”. Advantages: decrease in data origin environment overloading this à vis, “approach 2.1”; decrease in the need for storage space regarding “approach 2.1”; is a greatest speed in the extraction process, as only the lines changed as from the last extraction performed will be consulted. Disadvantages: the entire mapping of changes taking place in the originating table is not stored, only the last change; increase in the need for storage space of these new fields; the removals must be logical; the need for cleaning management, from time to time, on the originating table for the lines marked as logical removals which have already been extracted for the DW. When to use: when the mapping of all changes performed in data origin is not necessary, only the last change. 2.3
Interpretation of Transaction Log on the DBMS
Operational: as from the DBMS transactions log, we interpret all transactions carried out. Approach presented in [2; 12; 13]. Necessary features: that the data from the origin area be stored in a DBMS; this DBMS should have implemented the transactions log control; that the translation from the DBMS transactions log of SQL commands for insert, update and delete should be possible. This translation maybe performed through a specific tool from DBMS or through the development of a proprietary translator. Advantages: there is no overload on the data origin environment; decrease the necessity for storage space, network and processing; greater speed in the extraction process, as only the lines changed since the last extraction performed would be consulted.
132
Rosana L. A. Rocha et al.
Disadvantages: need for controlling the transaction log area size by the DBA, in order to prevent transaction losts. This may occur when the cycling of transaction log takes place. That is, the area from the transaction log, with transactions that have already commited, may be reused by the DBMS; the possibility of a physical/human failure in the transaction log file copy, before the beginning of the backup process. When to use: when the mapping of all or part of the changes occurred in data origin is necessary; possibility risk of lost of changes because of human, physical failure or by an unexpected operation by the DBA. 2.4
Comparison of Database Snapshots
A number of methods may be used to perform a copy of data, from the origin area to the work area, necessary for the implementation of this approach. As an example of these methods we have: bulk copy, replication or trigger (two phase commit protocol). We present below the common items for all the methods of this approach. This approach was mentioned in [12-15]. Operational: we will perform a copy of the tables from the origin area to the work area. In this copy only the pertinent fields to the DW extraction process will be considered. We have identified this new table under “_current”. In case the extraction process is being carried out for the second time, there will be the analogous table to this new table, identified as “_previous”. These two tables will be used to verify the changes taking place. These changes will be found in the comparison between their data. In case the extraction process is being performed for first time it will not be necessary to carry out the comparisons. Necessary features: that the data from the origin area be stored in a DBMS; a large storage space in the work area for the two versions of tables “current” and “previous”, for each table in which the mapping of changes is necessary. Disadvantages: use of processing and recording time in data copy from the origin area to the work area, and on data comparison, for the detection of changes; use of large storage space in the work area for the two versions of tables “current” and “previous”. When to use: in environments in which the DBMS in the origin area does not bear trigger mechanisms; or when the implementation of triggers is not allowed; or where there is no transaction log in DBMS; or where there is no possibility for translating the transaction log; or when there are problems and/or low performance access to the data origin area. 2.5
Refresh Tables
Operational: for each table in the data origin area, in which the mapping of changes it is necessary, we have removed all data from the fact and dimension tables, and have performed the entire process of extraction and load again. Approach was mentioned in [13; 16; 17].
Performance Tests in Data Warehousing ETLM Process
133
Necessary features: none in special. Advantages: there is no need for any control of changes detection both of origin and work areas. Disadvantages: use of a large processing time of each new ETLM process (extraction, transformation/validation, load and materialization), as the processes would be redone each time; the mapping of changes performed is not fully done as we only have the last position; lost of DW information history, as at each new ETLM process, we would be recreating its data as from the last position in the origin area. When to use: when the mapping of all changes performed in data origin is not actually necessary; there is no need for keeping DW information history; there is no time restriction, as we will recreate entire DW environment at each new processing.
3
An Improved Approach
In our DW development and implementation at Rio-Sul Airlines, we have used as a basis the framework presented for the DW environment in [1]. For the choice of the approach to be implemented, in the solution of problem of changes detection in data origin, we have taken into consideration the features found in the Rio-Sul Airlines operational and DW environment. The main feature of this environment refers to DBMS in which the originating data was stored. The system has serious limitations both of technical and policy-wise, such as: i) the inexistence of a instrumentation resource of triggers, replication and transaction log; ii) low performance in querying processing, leading to overload in the operational environment; iii) limitations in data access via ODBC; iv) concurrency problems; v) proprietary application database; vi) great difficulty for DBMS maintenance; vii) high data volatile; viii) an uninterrupted use of DBMS by users (24x7). PHASES
1st Phase
2nd Phase
3rd Phase
4th Phase
Origin
Work
Dimensional
Multidimensional
AREAS
OLTP Copy
Step B) Extraction
Staging Area Multidimensional Structure and Data from Originating Tables
Step A) Return Backup
OLTP
Step C) Transformation and Validation
Valid Data Tables
Step F) Materialization
Rejected Data Tables
Dimensional
Step D) Signatures Generation and Verifying Modifications Tables “new” “all signatures”
Step E) Load
“altered” “removed”
ETLM PROCESS
Extraction
Transformation, Validation
Load
Materializaton
Fig. 1. The DW environment framework used at Rio-Sul Airlines
134
Rosana L. A. Rocha et al.
The technique we have developed and implemented, derivate from the comparison of database snapshots (“approach 2.4”), in which we use signatures to mark and detect changes. We describe below, in a step-by-step manner, how we have solved each of the problems found. In this environment, we use a smaller machine. Every weekend, we secured the backup from the previous night and we went up in this new machine (step A, Figure 1). As from this data, we started the extraction process (step B, Figure 1) using the method of bulk copy (approach 2.4, bulk copy method). At the end of this process, we had the tables and the relevant data to the DW project in the relational model, without foreign keys, and with some indexes created, aiming at facilitating queries to be performed next.As from this data, we carried out the data validation and transformation processes (step C, Figure 1), having as a result, the both information of valid and rejected lines. For the valid line information, we marked the lines, which had gone through all validations. For this, we created only, one new field in the tables charged in the extraction process. For the rejected line information, we created a new table with the relevant information from the problem table/line, identifying also the reason for rejection, aiming at facilitating the adjustment of incorrect data in the origin environment later. Due to limitations of the DBMS, we use the approach of comparison of database snapshots (“approach 2.4”). This approach could be used, as there was no need to map all the changes, which had occurred in data origin, only the last situation met the company needs. With the purpose of improving the two main disadvantages for this approach, which are, namely: i) the processing time for comparisons in the verification of changes that have occurred; ii) the space for storage of table copies. We have implemented the following improvement: for each table, which is the origin of data for fact and dimension tables, and in which the detection of changes in data origin was necessary, we have created a new table in which we had the primary key of the data originating table and a signature field. In our development the signature is calculated using the BCS (Binary Checksum) algorithm, taking into consideration, bringing together of all relevant fields on each table. Although, any hash algorithm could be used to calculate the signatures. In our implementation, the occurrence of collisions was very unlikely, as the queries performed on the tables with the suffix of “all signatures” were always carried out as from the primary key of the table, and not as from the signatures created. Therefore, the signatures were only used to detect whether there has been any change to the original data. We will describe, as follows, how the process of calculating and verifying signatures works (step D, Figure 1). For each table in which the detection of changes in data origin was necessary, we obtained the lines marked as valid and carried out the calculation of the signature using the BCS algorithm. At each line calculated, we checked on the reference table with the suffix “all signatures”, using the primary key as a basis for queries. If the result of the query does not return any line, we include the data in the reference table with the suffix “new”. If returns, we compare the signature calculated with the signature stored in the reference tables with the suffix “all signatures”. In case the signature is different, we include a new line in the reference
Performance Tests in Data Warehousing ETLM Process
135
table with the suffix “altered” otherwise the line is the same as the last load. Finally, we move to the next line. At the end of the process for checking the inserts and updates, we started the process to identifying the removals. For all tables with the suffix “all signatures”, we queried as from the primary key, whether the line existed in the table of data origin referring to the table, which is being checked. In case the query does not return any value, this means that the original line has being removed. In the end we include a new line in the reference table with the suffix “removed”. At the end of this process, as from data existing on tables with suffix “new”, “altered” and “removed”, we will start the load process (step E, Figure 1). In this process, we have followed the logical order, aiming at avoiding failures, and the adjustments that will be performed on the dimensional model data, according to each case. The processing of lines, as follows: i) new (dimensions); ii) altered (dimensions); iii) new (facts); iv) altered (facts); v) removed (facts); vi) removed (dimensions).
4
Performance Tests
For solving the two main disadvantages of the existing approach of comparison of database snapshots (“approach 2.4”), we have implemented an improvement using signatures. With the purpose of verifying the better results of our approach, we present our considerations regarding time and space. Firstly, the space for storage of table copies, the improvement reached is clear. In each table involved, instead of store all fields, which we will verify the changes taking place, we transform these fields in one field with the signature of them. Secondly, the processing time for comparisons in the verification of changes, we have done performance tests to demonstrate the improvement reached. In these tests we have performed two benchmarks: firstly, using synthetic data, and secondly, using the real data from Rio Sul Airlines. The tests were done to detect updates. The inclusion and removal changes follow the same operation process, for both processes, so the performance tests were not necessary. The performance tests were done using a computer Pentium III 600 Mhz, 256 Mb RAM, SQL Server 2000 DBMS. For the tests using synthetic data, we prepared two tables named respectively “t10” and “t20”, using the structures presented in Figure 2. For them we have one field as primary key, and respectively nine and nineteen other fields as complementary data. These tests were done in three steps: i) for each one of the tables current and previous, we carried out, 100.000, 200.000 and 500.000 lines; ii) we did 10%, 20% and 90% of update in current table. iii) we did comparisons for the verification of changes. For the tests using real data, we used one table named “treal”. This table has: one field char(12) as primary key, nine fields as decimal(19,4), seven fields as char, all of them as complementary data. These tests were done in three steps: i) for each table current and previous, we had around 19.000.000 lines; ii) we did 1%, 2% and 9% of updates in current table; iii) we did comparisons for the verification of changes.
136
Rosana L. A. Rocha et al.
All Fields Current Table Name Id
Type Integer
Field 1 Varchar(20) :
:
Field N Varchar(20)
Signatures
Previous Table Null
PK
No
Yes
Id
No
No
Field 1 Varchar(20)
:
:
No
No
Name
Type Integer
:
:
Field N Varchar(20)
Current Table Null
PK
No
Yes
Id
No
No
Field 1 Varchar(20)
:
:
No
No
Name
Type Integer
:
:
Field N Varchar(20)
Previous Table Null
PK
Null
PK
No
Yes
Id
Name
Integer
Type
No
Yes
No
No
Signature Integer
No
No
:
:
No
No
Fig. 2. Structures used on performance tests
To organize the presentation of our performance tests we named our approach as “signatures”, and the comparison database snapshots approach as “all fields”. The pattern queries used by each approach for comparisons in the verification of changes were:i) All fields approach: Perform a number of comparisons according to the number of the fields for the table. SELECT count(C.id) FROM table_current as C, table_previous as P WHERE (C.id=P.id) AND ((C.field1<>P.field1) OR (C.field2<>P.field2) OR (C.fieldN<>P.fieldN)); ii) Signatures approach: Perform a number of comparisons according to the number of the fields for the primary key and one execution program, to calculate the signature. SELECT count(C.id) FROM table_current as C, table_previous as P WHERE (C.id = P.id) AND (P.signature <> F_CalculateSignature(C.Field 1,..,C.Field N)); The use of the clause count(C.id), has the objective to obtain the spend time for comparisons, only after all results were computed. In a real situation, we obtain only the fields that compose the primary key. In our tests we did the updates only in one field, that was the second field used in the where clause. The main objective for this scenario is permitting, always the best case in difference comparisons for the “all fields” approach. The results of our tests are presented and explained as follows. 4.1
Graphics
According to the graphs depicted in Figure 3, we may observe that the “signatures” approach is practically constant for each modification tax, this occurs because the algorithm always does the same number of comparisons and function calls. The “all fields” approach predisposes to be slowly with a minor number of changed lines. Because, this approach has always to compare all fields to discover if one modification occurred when in fact it may not occurred. The better performance for “signatures” approach, should take place also, because the “previous table” is considerably smaller than in “all fields” approach, and can easily fit in main memory for join process.
Performance Tests in Data Warehousing ETLM Process
00:04:19,20
00:04:19,20
00:02:52,80
00:02:52,80
00:01:26,40
00:01:26,40
00:00:00,00
00:00:00,00
100.000
200.000
500.000
100.000
200.000
500.000
t10 a ll fie lds (10%)
00:00:07,30
00:00:35,02
00:03:16,24
t10 a ll fie lds (20%)
00:00:03,93
00:00:21,37
00:02:59,26
t10 s igna ture s (10%)
00:00:01,97
00:00:04,04
00:00:11,13
t10 s igna ture s (20%)
00:00:01,99
00:00:04,07
00:00:11,55
N u m b e r o f Lin e s
137
N u m be r o f Line s
00:02:52,80 00:02:09,60
We can notice, with the use of “signatures” approach, the gain
00:01:26,40
of approximately:
00:00:43,20 00:00:00,00
100.000
200.000
500.000
t10 a ll fie lds (90%)
00:00:02,49
00:00:06,96
00:02:29,39
t10 s igna ture s (90%)
00:00:02,04
00:00:04,29
00:00:11,80
Table t10
100.000 lines
200.000 lines
500.000 lines
10% of updates 20% of updates 90% of updates
73% 50% 18%
88% 81% 38%
96% 95% 95%
N u m b e r o f Lin e s
Fig. 3. “t10” with 10%, 20% and 90% of updates 00:21:36,00
00:21:36,00
00:14:24,00
00:14:24,00
00:07:12,00
00:07:12,00
00:00:00,00
00:00:00,00
100.000
200.000
500.000
100.000
200.000
500.000
t20 a ll fie lds (10%)
00:00:07,78
00:02:59,99
00:14:41,45
t20 all fie lds (20%)
00:00:07,49
00:02:17,45
00:12:12,77
t20 s igna ture s (10%)
00:00:03,40
00:00:06,89
00:05:34,03
t20 s igna ture s (20%)
00:00:03,42
00:00:06,97
00:05:35,97
N u m b e r o f Lin e s
N u m be r o f Line s
00:21:36,00
We can notice, with the use of “signatures” approach, the gain
00:14:24,00
of approximately:
00:07:12,00 00:00:00,00
100.000
200.000
500.000
t20 a ll fie lds (90%)
00:00:04,23
00:02:11,65
00:09:51,95
t20 s igna ture s (90%)
00:00:03,48
00:00:07,06
00:05:37,82
Table t20
100.000 lines
200.000 lines
500.000 lines
10% of updates 20% of updates 90% of updates
53% 54% 18%
98% 97% 96%
63% 56% 43%
N u m b e r o f Lin e s
Fig. 4. “t20” with 10%, 20% and 90% of updates
According to the graphs depicted in Figure 4, the results prove that the processing time for “signatures” approach is directly proportional to the number of fields involved in the comparisons and the number of table's lines. According to the graph depicted in Figure 5 we may observe that the tests with real data only confirm the better performance of “signatures” approach. Although, the computed times are considerable, we can notice the gain for a huge table of approximately: i) 48% for 1% of updates; ii) 33% for 2% of updates; iii) 30% for 9% of updates. In our real DW environment we have a great number of such type of huge tables, so accumulated gain of time became more important with the use of “signatures” approach.
138
Rosana L. A. Rocha et al.
19.768.400 Lines 02:52:48,00 02:24:00,00 02:25:30,49
01:57:07,87
"treal" all fields (9%)
"treal" signatures (1%,2%,9%)
00:00:00,00
02:31:54,76
00:28:48,00
"treal" all fields (2%)
00:57:36,00
02:43:36,72
01:26:24,00
"treal" all fields (1%)
01:55:12,00
Fig. 5. “treal” with 1%, 2% and 9% of updates
5
Conclusion
According to LABIO, YERNENI et al. [10], due to the constantly increasing size of warehouses and the rapid rates of change, there is increasing pressure to reduce the time taking for warehouse update. To meet this demand, this article had as its main purpose presenting our performance tests to prove the efficiency of our technique, with signatures, against the approach of comparison database snapshots. In our performance tests we should conclude that, in the greater part of the cases our technique had a better performance; especially in cases that the modifications occurred in a small number of lines. In real situations, usually the number of modifications is smaller than presented in our tests, favoring the use of our “signature” approach. Thus, we conclude and prove that our “signature” approach is more efficient and economic in relation the storage space and time, and should be used in situations where the use of instrumentation resource on the DBMS in operational environment is not possible.
References [1]
[2] [3]
ROCHA, R. L. A., CARDOSO, L. F., SOUZA, J. M., 2003, An Improved Approach in Data Warehousing ETLM Process for Detection of Changes in Data Origin. COPPE/UFRJ, Report Nº ES-593/03. http://www.cos.ufrj.br/publicacoes/reltec/es59303.pdf. DO, L., DREW, P., JIN, W., et al, 1998, "Issues in Developing Very Large Databases". In: Proceedings of the 24th VLDB Conference, pp. 633-636, New York, USA, August. ÖZSU, M. T.,VALDURIEZ, P., 1991, Principles of Distributes Database Systems. 1st Ed, New Jersey, USA, Prentice Hall Inc.
Performance Tests in Data Warehousing ETLM Process
[4]
[5]
[6] [7]
[8]
[9] [10]
[11] [12] [13] [14]
[15] [16] [17]
139
ZHUGE, Y., GARCIA-MOLINA, H., HAMMER, J., et al, 1995, "View Maintenance in a Warehousing Environment". In: Proceedings of ACM SIGMOD International Conference on Management Data, pp. 316-327, San Jose, California, USA, June. ZHUGE, Y., GARCIA-MOLINA, H., WIENER, J. L., 1996, "The Strobe Algorithms for Multi-Source Warehouse Consistency". In: Proceedings on Parallel and Distributed Information Systems, pp. 146-157, Miami Beach, Florida, USA, December. QUASS, D.,WIDOM, J., 1997, "On-Line Warehouse View Maintenance". In: Proceedings of ACM SIGMOD International Conference on Management Data, pp. 405-416, Tucson, Arizona, USA, May. HULL, R.,ZHOU, G., 1996, "Towards the Study of Performance Trade-offs Between Materialized and Virtual Integrated Views". In: Proc. Workshop on Materialized Views: Techniques and Applications (VIEWS 96), pp. 91-102, Canada, June. QUASS, D., GUPTA, A., MUMICK, I. S., et al, 1996, "Making Views SelfMaintainable for Data Warehousing". In: Proceedings on Parallel and Distributed Information Systems, pp. 158-169, Miami Beach, Florida, USA, December. INMON, W. H.,KELLEY, C., 1993, Rdb/VMS, developing the data warehouse, Boston, QED Pub. Group. LABIO, W. J., YERNENI, R., GARCIA-MOLINA, H., 1999, "Shrinking the Warehouse Update Window". In: Proceedings of ACM SIGMOD International Conference on Management Data, pp. 383-394, Philadelphia, USA, June. WIDOM, J.,CERI, S., 1996, "Active Databases Systems: Triggers and Rules for Advanced Database Processing.", San Francisco, California, USA. CRAIG, R. S., VIVONA, J. A., BERKOVITCH, D., 1999, Microsoft data warehousing building distributed decision support systems, New York, Wiley. WIDOM, J., 1995, "Research Problems in Data Warehousing". In: Proceedings of ACM CIKM International Conference on Management Data, pp. 2530, USA, November. HAMMER, J., GARCIA-MOLINA, H., WIDOM, J., et al, 1995, "The Stanford Data Warehousing Project", IEEE Quarterly Bulletin on Data Engineering; Special Issue on Materialized Views and Data Warehousing, v. 18, n. 2, pp. 41-48. CHAWATHE, S. S.,GARCIA-MOLINA, H., 1997, "Meaningful Change Detection in Structured Data". In: Proceedings of ACM SIGMOD International Conference on Management Data, pp. 26-37, Arizona, USA, May. KIMBALL, R., 1996, Data Warehouse Toolkit, New York, USA, John Wiley & Sons, Inc. KIMBALL, R., 1998, The Data Warehouse Lifecycle Toolkit. Expert Methods for Designing, Developing, and Deploying Data Warehouses, New York, USA, John Wiley & Sons, Inc.
Recent Developments in Web Usage Mining Research Federico Michele Facca and Pier Luca Lanzi Artificial Intelligence and Robotics Laboratory Dipartimento di Elettronica e Informazione, Politecnico di Milano
Abstract. Web Usage Mining is that area of Web Mining which deals with the extraction of interesting knowledge from logging information produced by web servers. In this paper, we present a survey of the recent developments in this area that is receiving increasing attention from the Data Mining community.
1
Introduction
Web Mining [29] is that area of Data Mining which deals with the extraction of interesting knowledge from the World Wide Web. More precisely [40], Web Content Mining is that part of Web Mining which focuses on the raw information available in web pages; source data mainly consist of textual data in web pages (e.g., words, but also tags); typical applications are content-based categorization and content-based ranking of web pages. Web Structure Mining is that part of Web Mining which focuses on the structure of web sites; source data mainly consist of the structural information in web pages (e.g., links to other pages); typical applications are link-based categorization of web pages, ranking of web pages through a combination of content and structure (e.g. [20]), and reverse engineering of web site models. Web Usage Mining is that part of Web Mining which deals with the extraction of knowledge from server log files; source data mainly consist of the (textual) logs, that are collected when users access web servers and might be represented in standard formats; typical applications are those based on user modeling techniques, such as web personalization, adaptive web sites, and user modeling. The recent years have seen the flourishing of research in the area of Web Mining and specifically of Web Usage Mining. Since the early papers published in the mid 1990s, more than 400 papers on Web Mining have been published; more or less than 150 papers, of the overall 400, have been before 2001; around the 50% of these papers regarded Web Usage Mining. The first workshop entirely on this topic, WebKDD, was held in 1999. Since 2000, the published papers on Web Usage Mining are more than 150 showing a dramatic increase of interest for this area. This paper is a survey of the recent developments in the area of Web Usage Mining. It is based on the more than 150 papers published since 2000 on the topic of Web Usage Mining; see the on-line bibliography on the web site of the cInQ project [1].
Contact Author: Pier Luca Lanzi, [email protected].
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 140–150, 2003. c Springer-Verlag Berlin Heidelberg 2003
Recent Developments in Web Usage Mining Research
2
141
Data Sources
Web Usage Mining applications are based on data collected from three main sources [58]: (i) web servers, (ii) proxy servers, and (iii) web clients. The Server Side. Web servers are surely the richest and the most common source of data. They can collect large amounts of information in their log files and in the log files of the databases they use. These logs usually contain basic information e.g.: name and IP of the remote host, date and time of the request, the request line exactly as it came from the client, etc. This information is usually represented in standard format e.g.: Common Log Format [2], Extended Log Format [3], LogML [53]. When exploiting log information from web servers, the major issue is the identification of users’ sessions (see Section 3). Apart from web logs, users’ behavior can also be tracked down on the server side by means of TCP/IP packet sniffers. Even in this case the identification of users’ sessions is still an issue, but the use of packet sniffers provides some advantages [52]. In fact: (i) data are collected in real time; (ii) information coming from different web servers can be easily merged together into a unique log; (iii) the use of special buttons (e.g., the stop button) can be detected so to collect information usually unavailable in log files. Packet sniffers are rarely used in practice because of rise scalability issue on web servers with high traffic [52], and the impossibility to access encrypted packets like those used in secure commercial transactions a quite severe limitation when applying web usage mining to ebusinesses [13]. Probably, the best approach for tracking web usage consists of directly accessing the server application layer, as proposed in [14]. Unfortunately, this is not always possible. The Proxy Side. Many internet service providers (ISPs) give to their customer Proxy Server services to improve navigation speed through caching. In many respects, collecting navigation data at the proxy level is basically the same as collecting data at the server level. The main difference in this case is that proxy servers collect data of groups of users accessing huge groups of web servers. The Client Side. Usage data can be tracked also on the client side by using JavaScript, java applets [56], or even modified browsers [22]. These techniques avoid the problems of users’ sessions identification and the problems caused by caching (like the use of the back button). In addition, they provide detailed information about actual user behaviors [30]. However, these approaches rely heavily on the users’ cooperation and rise many issues concerning the privacy laws, which are quite strict.
142
3
Federico Michele Facca and Pier Luca Lanzi
Preprocessing
Data preprocessing has a fundamental role in Web Usage Mining applications. The preprocessing of web logs is usually complex and time demanding. It comprises four different tasks: (i) the data cleaning, (ii) the identification and the reconstruction of users’ sessions, (iii) the retrieving of information about page content and structure, and (iv) the data formatting. Data Cleaning. This step consists of removing all the data tracked in web logs that are useless for mining purposes [27, 12] e.g.: requests for graphical page content (e.g., jpg and gif images); requests for any other file which might be included into a web page; or even navigation sessions performed by robots and web spiders. While requests for graphical contents and files are easy to eliminate, robots’ and web spiders’ navigation patterns must be explicitly identified. This is usually done for instance by referring to the remote hostname, by referring to the user agent, or by checking the access to the robots.txt file. However, some robots actually send a false user agent in HTTP request. In these cases, a heuristic based on navigational behavior can be used to separates robot sessions from actual users’ sessions (see [60, 61]). Session Identification and Reconstruction. This step consists of (i) identifying the different users’ sessions from the usually very poor information available in log files and (ii) reconstructing the users’ navigation path within the identified sessions. Most of the problems encountered in this phase are caused by the caching performed either by proxy servers either by browsers. Proxy caching causes a single IP address (the one belonging to the proxy Server) to be associated with different users’ sessions, so that it becomes impossible to use IP addresses as users identifies. This problem can be partially solved by the use of cookies [25], by URL rewriting, or by requiring the user to log in when entering the web site [12]. Web browser caching is a more complex issue. Log from web servers cannot include any information about the use of the back button. This can generate inconsistent navigation paths in the users’ sessions. However, by using additional information about the web site structure is still possible to reconstruct a consistent path by means of heuristics. Because the HTTP protocol is stateless, it is virtually impossible to determine when a user actually leaves the web site in order to determine when a session should be considered finished. This problem is referred to as sessionization. [17] described and compared three heuristics for the identification of sessions termination; two were based on the time between users’ page requests, one was based on information about the referrer. [24] proposed an adaptive time out heuristic. [26] proposed a technique to infer the timeout threshold for the specific web site. Other authors proposed different thresholds for time oriented heuristics based on empiric experiments. Content and Structure Retrieving. The vast majority of Web Usage Mining applications use the visited URLs as the main source of information for mining
Recent Developments in Web Usage Mining Research
143
purposes. URLs are however a poor source of information since, for instance, they do not convey any information about the actual page content. [26] has been the first to employ content based information to enrich the web log data. If an adequate classification is not known in advance, Web Structure Mining techniques can be employed to develop one. As in search engines, web pages are classified according to their semantic areas by means of Web Content Mining techniques; this classification information can then be used to enrich information extracted from logs. For instance, [59] proposes to use Semantic Web for Web Usage Mining: web pages are mapped into ontologies to add meaning to the observed frequent paths. [15] introduces concept-based paths as an alternative to the usual user navigation paths; concept-based path are a high level generalization of usual path in which common concepts are extracted by means of intersection of raw user paths and similarity measures. Data Formatting. This is the final step of preprocessing. Once the previous phases have been completed, data are properly formatted before applying mining techniques. [11] stores data extracted from web logs into a relational database using a click fact schema, so as to provide better support to log querying finalized to frequent pattern mining. [47] introduces a method based on signature tree to index log stored in databases for efficient pattern queries. A tree structure, WAP-tree, is also introduced in [51] to register access sequence to web pages; this structure is optimized to exploit the sequence mining algorithm developed by the same authors [51].
4
Techniques
Most of the commercial applications of Web Usage Mining exploit consolidated statistical analysis techniques. In contrast, research in this area is mainly focused on the development of knowledge discovery techniques specifically designed for the analysis of web usage data. Most of this research effort focuses on three main paradigms: association rules, sequential patterns, and clustering (see [32] for a detailed description of these techniques). Association Rules. are probably the most elementary data mining technique and, at the same time, the most used technique in Web Usage Mining. When applied to Web Usage Mining, association rules are used to find associations among web pages that frequently appear together in users’ sessions. The typical result has the form “A.html, B.html ⇒ C.html” which states that if a user has visited page A.html and page B.html, it is very likely that in the same session, the same user has also visited page C.html. This type of result is for instance produced by [38] and [46] by using a modification of the Apriori algorithm [32]. [37] proposes and evaluates some interestingness measures to evaluate the association rules mined from web usage data. [21] exploits a mixed technique of association rules and fuzzy logic to extract fuzzy association rules from web logs.
144
Federico Michele Facca and Pier Luca Lanzi
Sequential Patterns. are used to discover frequent subsequences among large amount of sequential data. In web usage mining, sequential patterns are exploited to find sequential navigation patterns that appear in users’ sessions frequently. The typical sequential pattern has the form [45]: the 70% of users who first visited A.html and then visited B.html afterwards, in the same session, have also accessed page C.html. Sequential patterns might appear syntactically similar to association rules; in fact algorithms to extract association rules can also be used for sequential pattern mining. There are essentially two classes of algorithms that are used to extract sequential patterns: one includes methods based on association rule mining; one includes methods based on the use of tree-structures, data projection techniques, and Markov chains to mine navigation patterns. Some well-known algorithms for mining association rules have been modified to extract sequential patterns. [44] presents a comparison of different sequential pattern algorithms applied to Web Usage Mining. The comparison includes PSP+, FreeSpan, and PrefixSpan. While PSP+ is an evolution of GSP, based on candidate generation and test heuristic, FreeSpan and the newly proposed PrefixSpan use a data projection based approach. According to [44] PrefixSpan outperforms the other two algorithms and offers very good performance even on long sequences. [54] proposes an hybrid method: data are store in a database according to a so-called Click Fact Schema; an Hypertext Probabilistic Grammar (HPG) is generated by querying the databases; HPGs represent transitions among web pages through a model which resembles many similarities with Markov chains. The frequent sequential patterns are mined through a breadth first search over the hypertext probabilistic grammar. HPGs were first proposed in [18], and later improved in [54] where some scalability issues of the original proposal have been solved. Clustering. techniques look for groups of similar items among large amount of data based on a general idea of distance function which computes the similarity between groups. Clustering has been widely used in Web Usage Mining to group together similar sessions [56, 34, 36, 15]. [65] was the first to suggest that the focus of web usage mining should be shifted from single user sessions to group of user sessions; [65] was also the first to apply clustering for identifying such cluster of similar sessions. [15] proposes similarity graph in conjunction with the time spent on web pages to estimate group similarity in concept-based clustering. [33] uses sequence alignment to measure similarity, while [65] exploits belief functions. [57] uses Genetic Algorithms [35] to improve the results of clustering through user feedback. [48] couples Fuzzy Artificial Immune System and clustering techniques to improve the users’ profiles obtained through clustering. [34] applies multimodal clustering, a technique which build clusters by using multiple information data features. [49] presents an application of matrix clustering to web usage data.
Recent Developments in Web Usage Mining Research
5
145
Applications
The general goal of Web Usage Mining is to gather interesting information about users navigation patterns (i.e., to characterize web users). This information can be exploited later to improve the web site from the users’ viewpoint. The results produced by the mining of web logs can used for various purposes [58]: (i) to personalize the delivery of web content; (ii) to improve user navigation through prefetching and caching; (iii) to improve web design; or in e-commerce sites (iv) to improve the customer satisfaction. Personalization of Web Content. Web Usage Mining techniques can be used to provide personalized web user experience. For instance, it is possible to anticipate, in real time, the user behavior by comparing the current navigation pattern with typical patterns which were extracted from past web log. In this area, recommendation systems are the most common application; their aim is to recommend interesting links to products which could be interesting to users [10, 21, 63, 43]. Personalized Site Maps [62] are an example of recommendation system for links (see also [45]). [50] proposed an adaptive technique to reorganize the product catalog of the products according to the forecasted user profile. A survey on existing commercial recommendation systems, implemented in e-commerce web sites, is presented in [55]. Prefetching and Caching. The results produced by Web Usage Mining can be exploited to improve the performance of web servers and web-based applications. Typically, Web Usage Mining can be used to develop proper prefetching and caching strategies so as to reduce the server response time, as done in [23, 41, 42, 46, 64]. Support to the Design. Usability is one of the major issues in the design and implementation of web sites. The results produced by Web Usage Mining techniques can provide guidelines for improving the design of web applications. [16] uses stratograms to evaluate the organization and the efficiency of web sites from the users’ viewpoint. [31] exploits Web Usage Mining techniques to suggest proper modifications to web sites. Adaptive Web sites represents a further step. In this case, the content and the structure of the web site can be dynamically reorganized according to the data mined from the users’ behavior [39, 66]. E-commerce. Mining business intelligence from web usage data is dramatically important for e-commerce web-based companies. Customer Relationship Management (CRM) can have an effective advantage from the use of Web Usage Mining techniques. In this case, the focus is on business specific issues such as: customer attraction, customer retention, cross sales, and customer departure [19, 14, 28].
146
6
Federico Michele Facca and Pier Luca Lanzi
Software
There are many commercial tools which perform analysis on log data collected from web servers. Most of these tools are based on statistical analysis techniques, while only a few products exploit Data Mining techniques. [28] provides an up to date review of available commercial tools for web usage mining. With respect to Web Mining commercial tools, it is worth noting that since the review made in [58], the number of existing products almost doubled. Companies which sold Web Usage Mining products in the past have been disappeared (e.g., Andromeda’s Aria); others have been bought by other companies. In most cases, Web Usage Mining tools are part of integrated Customer Relation Management (CRM) solutions for e-commerce (e.g., [8] and [4]). Sometimes, these tools are simple web log analyzers (e.g., [6, 7, 5]). One software developed in a research environment, WUM [9], appears to have reached an interesting maturity level; WUM has currently reached the version 7.0. We presented a survey of the recent developments in the area of Web Usage Mining, based on the more than 150 papers published since 2000 on this topic. Because, it was not possible to cite all the papers here we refer the interested reader to provide an on-line bibliography on the web site of the cInQ project [1].
Acknowledgements This work has been supported by the consortium on discovering knowledge with Inductive Queries (cInQ) [1], a project funded by the Future and Emerging Technologies arm of the IST Programme (Contr. no. IST-2000-26469). The authors wish to thank Maristella Matera for discussions.
References [1] consortium on discovering knowledge with Inductive Queries (cInQ). Project funded by the European Commission under the Information Society Technologies Programme (1998-2002) Future and Emerging Technologies arm. Contract no. IST-2000-26469. http://www.cinq-project.org. Bibliography on Web Usage Mining available at http://www.cinq-project.org/intranet/polimi/. 140, 146 [2] Configuration File of W3C httpd, 1995. http://www.w3.org/Daemon/User/Config/. 141 [3] W3C Extended Log File Format, 1996. http://www.w3.org/TR/WD-logfile.html. 141 [4] Accrue, 2003. http://www.accrue.com. 146 [5] Funnel Web Analyzer, 2003. http://www.quest.com. 146 [6] NetIQ WebTrends Log Analyzer, 2003. http://www.netiq.com. 146 [7] Sane NetTracker, 2003. http://www.sane.com/products/NetTracker. 146 [8] WebSideStory HitBox, 2003. http://www.websidestory.com. 146 [9] WUM: A Web Utilization Miner, 2003. http://wum.wiwi.hu-berlin.de. 146
Recent Developments in Web Usage Mining Research
147
[10] Gediminas Adomavicius and Alexander Tuzhilin. Extending recommender systems: A multidimensional approach. 145 [11] Jesper Andersen, Anders Giversen, Allan H. Jensen, Rune S. Larsen, Torben Bach Pedersen, and Janne Skyt. Analyzing clickstreams using subsessions. In International Workshop on Data Warehousing and OLAP (DOLAP 2000), 2000. 143 [12] Corin R. Anderson. A Machine Learning Approach to Web Personalization. PhD thesis, University of Washington, 2002. 142 [13] Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng. Integrating ecommerce and data mining: Architecture and challenges. In WEBKDD 2000 - Web Mining for E-Commerce – Challenges and Opportunities, Second International Workshop, August 2000. 141 [14] Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng. Integrating ecommerce and data mining: Architecture and challenges. In Nick Cercone, Tsau Young Lin, and Xindong Wu, editors, Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM 2001). IEEE Computer Society, 2001. 141, 145 [15] A. Banerjee and J. Ghosh. Clickstream clustering using weighted longest common subsequences. In Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining, 2001. 143, 144 [16] Bettina Berendt. Using site semantics to analyze, visualize, and support navigation. Data Mining and Knowledge Discovery, 6(1):37–59, 2002. 145 [17] Bettina Berendt, Bamshad Mobasher, Miki Nakagawa, and Myra Spiliopoulou. The impact of site structure and user environment on session reconstruction in web usage analysis. In Proceedings of the 4th WebKDD 2002 Workshop, at the ACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD’2002), 2002. 142 [18] Jose Borges. A Data Mining Model to Capture UserWeb Navigation Patterns. PhD thesis, Department of Computer Science University College London, 2000. 144 [19] Catherine Bounsaythip and Esa Rinta-Runsala. Overview of data mining for customer behavior modeling. Technical Report TTE1-2001-18, VTT Information Technology, 2001. 145 [20] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998. 140 [21] S. Shiu C. Wong and S. Pal. Mining fuzzy association rules for web access case adaptation. In Case-Based Reasoning Research and Development: Proceedings of the Fourth International Conference on Case-Based Reasoning, pages ?–?, 2001. 143, 145 [22] Lara D. Catledge and James E. Pitkow. Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems, 27(6):1065–1073, 1995. 141 [23] Cheng-Yue Chang and Ming-Syan Chen. A new cache replacement algorithm for the integration of web caching and prefectching. In Proceedings of the eleventh international conference on Information and knowledge management, pages 632– 634. ACM Press, 2002. 145 [24] Mao Chen, Andrea S. LaPaugh, and Jaswinder Pal Singh. Predicting category accesses for a user in a structured information space. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 65–72, 2002. 142
148
Federico Michele Facca and Pier Luca Lanzi
[25] R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. PhD thesis, University of Minnesota, 2000. 142 [26] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1):5–32, 1999. 142, 143 [27] Boris Diebold and Michael Kaufmann. Usage-based visualization of web localities. In Australian symposium on Information visualisation, pages 159–164, 2001. 142 [28] Magdalini Eirinaki and Michalis Vazirgiannis. Web mining for web personalization. ACM Transactions on Internet Technology (TOIT), 3(1):1–27, 2003. 145, 146 [29] Oren Etzioni. The world-wide web: Quagmire or gold mine? Communications of the ACM, 39(11):65–68, 1996. 140 [30] Kurt D. Fenstermacher and Mark Ginsburg. Mining client-side activity for personalization. In Fourth IEEE International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS’02), pages 205– 212, 2002. 141 [31] Yongjian Fu, Mario Creado, and Chunhua Ju. Reorganizing web sites based on user access patterns. In Proceedings of the tenth international conference on Information and knowledge management, pages 583–585. ACM Press, 2001. 145 [32] Jiawei Han and Micheline Kamber. Data Mining Concepts and Techniques. Morgan Kaufmann, 2001. 143 [33] Birgit Hay, Geert Wets, and Koen Vanhoof. Clustering navigation patterns on a website using a sequence alignment method. 144 [34] Jeffrey Heer and Ed H. Chi. Mining the structure of user activity using cluster stability. In Proceedings of the Workshop on Web Analytics, Second SIAM Conference on Data Mining. ACM Press, 2002. 144 [35] John H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, 1975. Republished by the MIT press, 1992. 144 [36] Joshua Zhexue Huang, Michael Ng, Wai-Ki Ching, Joe Ng, and David Cheung. A cube model and cluster analysis for web access sessions. In R. Kohavi, B. Masand, M. Spiliopoulou, and J. Srivastava, editors, WEBKDD 2001 - Mining Web Log Data Across All Customers Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001. Revised Papers, volume 2356 of Lecture Notes in Computer Science, pages 48–67. Springer, 2002. 144 [37] Xiangji Huang, Nick Cercone, and Aijun An. Comparison of interestingness functions for learning web usage patterns. In Proceedings of the eleventh international conference on Information and knowledge management, pages 617–620. ACM Press, 2002. 143 [38] Karuna P. Joshi, Anupam Joshi, and Yelena Yesha. On using a warehouse to analyze web logs. Distributed and Parallel Databases, 13(2):161–180, 2003. 143 [39] Tapan Kamdar. Creating adaptive web servers using incremental web log mining. Master’s thesis, Computer Science Department, University of Maryland, Baltimore County, 2001. 145 [40] Kosala and Blockeel. Web mining research: A survey. SIGKDD: SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 2(1):1–15, 2000. 140 [41] Bin Lan, Stephane Bressan, Beng Chin Ooi, and Kian-Lee Tan. Rule-assisted prefetching in web-server caching. In Proceedings of the ninth international conference on Information and knowledge management (CIKM 2000), pages 504– 511. ACM Press, 2000. 145
Recent Developments in Web Usage Mining Research
149
[42] Tianyi Li. Web-document prediction and presending using association rule sequential classifiers. Master’s thesis, Simon Fraser University, 2001. 145 [43] Bamshad Mobasher, Honghua Dai, Tao Luo, and Miki Nakagawa. Effective personalization based on association rule discovery from web usage data. In Web Information and Data Management, pages 9–15, 2001. 145 [44] Behzad Mortazavi-Asl. Discovering and mining user web-page traversal patterns. Master’s thesis, Simon Fraser University, 2001. 144 [45] Eleni Stroulia Nan Niu and Mohammad El-Ramly. Understanding web usage for dynamic web-site adaptation: A case study. In Proceedings of the Fourth International Workshop on Web Site Evolution (WSE’02), pages 53–64. IEEE, 2002. 144, 145 [46] Alexandros Nanopoulos, Dimitrios Katsaros, and Yannis Manolopoulos. Exploiting web log mining for web cache enhancement. In R. Kohavi, B. Masand, M. Spiliopoulou, and J. Srivastava, editors, WEBKDD 2001 - Mining Web Log Data Across All Customers Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001. Revised Papers, volume 2356 of Lecture Notes in Computer Science, pages 68–87. Springer, 2002. 143, 145 [47] Alexandros Nanopoulos, Maciej Zakrzewicz, Tadeusz Morzy, and Yannis Manolopoulos. Indexing web access-logs for pattern queries. In Fourth ACM CIKM International Workshop on Web Information and Data Management (WIDM’02), 2002. 143 [48] O. Nasraoui, F. Gonzalez, and D. Dasgupta. The fuzzy artificial immune system: Motivations, basic concepts, and application to clustering and web profiling. In Proceedings of the World Congress on Computational Intelligence (WCCI) and IEEE International Conference on Fuzzy Systems, pages 711–716, 2002. 144 [49] Shigeru Oyanagi, Kazuto Kubota, and Akihiko Nakase. Application of matrix clustering to web log analysis and access prediction. In WEBKDD 2001 - Mining Web Log Data Across All Customers Touch Points, Third International Workshop, 2001. 144 [50] Hye-Young Paik, Boualem Benatallah, and Rachid Hamadi. Dynamic restructuring of e-catalog communities based on user interaction patterns. World Wide Web, 5(4):325–366, 2002. 145 [51] Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu. Mining access patterns efficiently from web logs. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 396–407, 2000. 143 [52] Pilot Software. Web Site Analysis, Going Beyond Traffic Analysis http://www.marketwave.com/products solutions/hitlist.html, 2002. 141 [53] John R. Punin, Mukkai S. Krishnamoorthy, and Mohammed J. Zaki. Logml: Log markup language for web usage mining. In R. Kohavi, B. Masand, M. Spiliopoulou, and J. Srivastava, editors, WEBKDD 2001 - Mining Web Log Data Across All Customers Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001. Revised Papers, volume 2356 of Lecture Notes in Computer Science, pages 88–112. Springer, 2002. 141 [54] T. B. Pedersen S. Jespersen and J. Thorhauge. A hybrid approach to web usage mining. Technical Report R02-5002, Department of Computer Science Aalborg University, 2002. 144 [55] J. Ben Schafer, Joseph A. Konstan, and John Riedl. E-commerce recommendation applications. Data Mining and Knowledge Discovery, 5(1-2):115–153, 2001. 145
150
Federico Michele Facca and Pier Luca Lanzi
[56] Cyrus Shahabi and Farnoush Banaei-Kashani. A framework for efficient and anonymous web usage mining based on client-side tracking. In R. Kohavi, B. Masand, M. Spiliopoulou, and J. Srivastava, editors, WEBKDD 2001 - Mining Web Log Data Across All Customers Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001. Revised Papers, volume 2356 of Lecture Notes in Computer Science, pages 113–144. Springer, 2002. 141, 144 [57] Cyrus Shahabi and Yi-Shin Chen. Improving user profiles for e-commerce by genetic algorithms. E-Commerce and Intelligent Methods Studies in Fuzziness and Soft Computing, 105(8), 2002. 144 [58] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23, 2000. 141, 145, 146 [59] G. Stumme, A. Hotho, and B. Berendt. Usage mining for and on the semantic web. In National Science Foundation Workshop on Next Generation Data Mining, 2002. 143 [60] Pang-Ning Tan and Vipin Kumar. Modeling of web robot navigational patterns. In WEBKDD 2000 - Web Mining for E-Commerce – Challenges and Opportunities, Second International Workshop, August 2000. 142 [61] Pang-Ning Tan and Vipin Kumar. Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 6(1):9–35, 2002. 142 [62] Fergus Toolan and Nicholas Kushmerick. Mining web logs for personalized site maps. 145 [63] Debra VanderMeer, Kaushik Dutta, and Anindya Datta. Enabling scalable online personalization on the web. In Proceedings of the 2nd ACM E-Commerce Conference (EC’00), pages 185–196. ACM Press, 2000. 145 [64] Yi-Hung Wu and Arbee L. P. Chen. Prediction of web page accesses by proxy server log. World Wide Web, 5(1):67–88, 2002. 145 [65] Yunjuan Xie and Vir V. Phoha. Web user clustering from access log using belief function. In Proceedings of the First International Conference on Knowledge Capture (K-CAP 2001), pages 202–208. ACM Press, 2001. 144 [66] Osmar R. Za¨ıane. Web usage mining for a better web-based learning environment. In Proceedings of Conference on Advanced Technology for Education, pages 450–455, 2001. 145
Parallel Vector Computing Technique for Discovering Communities on the Very Large Scale Web Graph Kikuko Kawase1 , Minoru Kawahara 2 , Takeshi Iwashita2 , Hiroyuki Kawano 1 , and Masanori Kawazawa2 1
Department of Systems Science, Graduate School of Informatics, Kyoto University 2 Academic Center for Computing and Media Studies, Kyoto University
Abstract. The study of the authoritative pages and community discovery from an enormous Web contents has attracted many researchers. One of the link-based analysis, the HITS algorithm, calculates authority scores as the eigenvector of a adjacency matrix created from the Web graph. Although it was considered impossible to compute the eigenvector of a very large scale of Web graph using previous techniques, due to this calculation requires enormous memory space. We make it possible using data compression and parallel computation.
1 Introduction ISC (Internet Software Consortium)[10] checked out the existence of over 160 million Web servers which construct the Web (World Wide Web) on the Internet as of July 2002. It is easy to guess there are a huge number of Web pages from it. Some Web search engines collect many Web pages. For examples, Google[8] in the United States has 2,500 million pages as of November 2002, and AlltheWeb.com[1] of Norway has 2,100 million pages for retrieval. Additionally, Openfind[17], which is a Web search engine in Taiwan, is under a beta test to retrieve over 3,500 million Web pages. However it is difficult to find out useful Web pages using only standard retrieval technique such as fulltext search because of a huge number of Web pages and retrieval results from them. Many researches have been trying to find out useful Web pages out of the retrieval results[19]. Especially, many researches have used a link structure of the Web to evaluate the importance of Web pages and find the strongly connected Web page county (Web community) [9, 12–15]. There is a popular algorithm called as “HITS (Hyperlink-Induced Topic Search) algorithm”[12, 13] which uses make of the link structure of the Web and evaluates each importance of Web page. And there has been many researches of Web community discovery using the algorithm[3, 6]. For example, PageRank[18], which is used in order to rank Web pages in the Web search engine Google, is an evaluation method based on the HITS algorithm[8, 11]. In the HITS algorithm, which will be given a detailed description in Section 2, a link between two Web pages in the Web structure are regarded as an edge of directed graph, and then the Web graph can be denoted as a adjacency matrix. When the Web structure is denoted by a adjacency matrix, the evaluation of importance of Web pages resolves itself into an eigenvalue problem of the matrix. But there are two Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 151-160, 2003. c Springer-Verlag Berlin Heidelberg 2003
152
Kikuko Kawase et al.
problems for this calculation. One is that it is required a huge memory area to hold the huge adjacency matrix corresponding to large-scale Web pages. And another is that it is required high calculation performance to compute the huge matrix. According to a simple calculation for n Web pages, it is required at least 8 × n 2 bytes of memory area for a square matrix whose one side is n, where a edge holds a double precision floating point value which occupies 8 bytes memory area. For example, the memory area of 8 × 1016 = 80PB(s) is required to calculate 100 million (= 10 8 ) Web pages. Thus it is impossible to evaluate such a huge Web pages. We noted that almost all Web pages hold few links compared with the number of Web pages, and then the adjacency matrix becomes a huge sparse matrix. In this paper, we approach the way of high-speed calculation for the importance of a huge number of Web pages through the sparse matrix storing method[7], which compresses data and stores into main memory. And we apply the high-speed calculation technique of vectorization and parallelization for a parallel vector computer. The paper is organized as follows. Section 2 describes a calculation method for the importance of Web pages. Section 3 describes a compression method of data and a highspeed calculation method. Section 4 evaluates our proposed method. Finery, Section 5 makes concluding remarks and describes future subjects.
2 Calculation method for the importance of Web pages 2.1 Authority and hub A graph consists of points (nodes) and lines (edges) which connect two nodes. If it is thought of Web pages (p 1 , p2 , · · ·) as nodes and the set of nodes is defined as V = {p1 , p2 , · · ·}, and it is thought of a directed link (p i , p j ) from a Web page p i to a Web page p j as a edge and the set of links is defined as E, then the Web structure can be denoted by a directed graph G = (V, E). And the graph can be translated into an adjacency matrix A when the element a i j of A is set to 1 if a directed link (p i , p j ) exists and is set to 0 if not exist, where i denotes a row and j denotes a column. An authority is a page which is considered to be useful and a hub is a page which has many hyperlinks to valuable pages. A good hub links many good authorities and a good authority is linked from many good hubs. HITS is one of the algorithms based on this idea. The useful pages can be found out as follows[5]. It is assumed that p and q are Web pages then the authority weight x p of p and the hub weight y p of q are defined as follows: xp =
∑
yq ,
q:(q,p)∈E
Where,
∑ (x p)2 = 1,
yp =
∑
xq ,
q:(q,p)∈E
∑ (y p )2 = 1.
The hub weight and the authority weight are initialized by the same non-negative value. So x and y will converge on x ∗ and y∗ respectively by iterative updating of these weights for each pages. It is thought that x ∗ and y∗ show the usefulness of each Web page. Furthermore one calculation of the iterations is x ← A t y, y ← Ax using adjacency
Parallel Vector Computing Technique for Discovering Communities
153
matrix A. Suppose an initial value is z, the authority weight x k and hub weight y k after k times calculation are expressed as: xk = (At A)k−1 At z, yk = (AAt )k z. And when xk and yk are converged, we can get x ∗ and y∗ , where x∗ is the eigenvector of At A and y∗ is the eigenvector of AAt . Although all of the eigenvectors and the eigenvalues of AA t and At A can be calculated by a method of singular value decomposition of A, it is impossible to use the method since it is required for the method too much memory to calculate a large-scale Web graph treated in this paper, In the mean time, the research of retrieval [13] says that it can be assumed that there exists only one maximum eigenvalue when it comes to a Web graph, and there also exists only one maximum eigenvalues of AA t and At A. Therefore we calculate the principal eigenvector of AA t and At A using the power method for discovering authorities and hubs in the whole Web according to this assumption. 2.2 Power Method The power method is an effective technique for calculation of the maximum eigenvalue and the eigenvector of a matrix . We need only principal eigenvector so that this method is very suitable. Suppose matrix A has eigenvalues λ 1 , λ2 , · · · , λn ( | λ1 |>| λ2 |≥ · · · ≥| λn |) and the eigenvectors corresponding them are ξ 1 , ξ2 , · · · , ξn . If u0 is any vector, it can be said that the linear merging of ξ 1 , ξ2 , · · · , ξn where α1 , α2 , · · · , αn is any real number. n
u0 = ∑ αi ξi . i=1
Now calculate the value formula and Aξ i = λi ξi ,
AS u
0
which is multiplied u 0 S times by A. Using the upper
A u0 = A s
n
∑ αi ξi
s
=A
s−1
i=1
n
∑ αi λi ξi
i=1
.. . =
n
∑ αi λisξi
i=1
n
= λ1S α1 ξ1 + ∑ αi i=2
λi λ1
S ξi .
As S gets large enough, the second term of the upper expression gets closer to zero. Thus AS u0 α1 λ1S ξ1 . So we can get the principal eigenvalue λ 1 from the fraction of AS+1 u0 and AS u0 . It calculates as follows. vS+1 = AuS , v uS+1 = bS+1 (S = 0, 1, 2, · · ·), S+1
154
Kikuko Kawase et al.
where bS+1 is the element of the maximum absolute value of vector v S+1 . Therefore
uS+1 =
vS+1 AuS AvS = = bS+1 bS+1 bS+1 bS
.. . =
AS+1 u0 . bS+1 · · · b1
Although some matrices may converge very slowly, the principal eigenvalue for the adjacency matrix of the Web graph can converge by a few accumulation because it is much larger than the next eigenvalue[9]. In this paper, this matrix A equivalents to the product of the adjacency matrix and its transposed matrix. Suppose A is the adjacency matrix, the formula is
vS+1 = AAt uS , v uS+1 = bS+1 (S = 0, 1, 2, · · ·). S+1
There are 2 ways of this calculation. One is the way to calculate A t uS first [Method 1], and the other is the way to calculate AA t first [Method 2]. In case there are many repeat count and little calculation cost, [Method 2] has merit.
2.3 Data compression storing method for large-scale matrix The adjacency matrix of the Web graph is a sparse matrix with 0 and 1, so that we use Compressed Row Storage format (CRS format) [2] for data compression. This format can be used for any structure of matrix, and it does not hold any 0 elements. It puts the consequent nonzero elements of matrix rows in adjacent memory locations. Assuming we have a sparse matrix A, we need to use 3 vectors: one for floating-point numbers (val), and other two for integers (col ind, row ptr). Although the val vector holds the values of nonzero elements of the matrix A as they are traversed in a row-wise fashion, we can omit to use this vector in this paper because all nonzero elements become 1. The col ind vector holds the column indexes of the nonzero elements. The row ptr vector holds the row indexes in the col ind vector. By convention, we define row ptr(n + 1) = e + 1, where e is the number of nonzero elements in the matrix A. Therefore this storing method requires only e + n + 1 storage locations instead of storing n × n elements. As well as the CRS format, there is a data compression method like as the Compressed Column Storage (CCS) format. The CCS format is the same as the CRS format except that the columns of A are stored. In other words, the CCS format can be said the CRS format for At [2]. In [Method 2], it is required to store the calculation result of AAt on the main memory in addition to the area for A. Although AA t becomes a symmetrical matrix and a symmetrical matrix is suitable for parallel processing, there is no guarantee that AAt becomes a sparse matrix and a huge memory area may be requited.
Parallel Vector Computing Technique for Discovering Communities r0
r1
r2
r3
r0
r1
r1
r1
r2
r2
r2
r2
r3
r3
r3
r3
r0
r0
(2)
r1
r1
(3)
r2
(4)
r3 t
uS
r0
r0
(1)
A
155
(1)
(3)
A
b
Fig. 1. The parallel calculation of b ← At u with 4PEs
(2)
(4)
b
v S+1
Fig. 2. The parallel calculation of v ← Ab with 4PEs
3 Implementation on a distributed memory parallel vector computer In order to evaluate our method, we implemented our proposed method on a parallel vector computer Fujitsu VPP800/63, which includes 63PEs (Processing Element) and each PE has 8GFLOPS calculating power and 8GB memory, of Academic Center for Computing and Media Studies, Kyoto University.
3.1 Parallel procession with MPI We use MPI (Message Passing Interface) for the implementation. MPI is one of the most general programming techniques which gives parallel computing in a distributed memory parallel computer, transmitting and receiving a message between processors . Here we explain the distributed parallel procession with [Method 1]. [Method 1] doesn’t calculate AAt to save the storage area. In this paper, all array of the one dimensional arrays for matrix A and eigenvector and for work area distributed on the main memory. Fig.1 and Fig.2 show distribution of data using 4 processors (r 0 , r1 , r2 , r3 ). Since we put only matrix A using CRS formats on the main memory, A t is stored with CCS formats like Fig.1. More specifically we explain the calculation of vS+1 = A(At uS ) using 4 processors. First to calculate b ← A t us of Fig.1 each processor calculates sum of products of the part of A t and us that each processor has. As a result each one gets n dimensional vector b i . And we get b by calculation of Σ b i = b with communication between processors. Then to calculate v s+1 ← Ab each one communicates each part of vector b using MPI and calculates sum of products of matrix A. And each one stores the result as vector vs+1 . Meantime [Method 2] is easy to parallel processing of AA t using the way of matrix storage on the distributed memory.
156
Kikuko Kawase et al. n1
n2
n3
n4
1 100000
c
a cn a cn 1
2
m1
a cn a cn 3
4
A
am 1 2
m2
am 1 3
m3 At
Fig. 3. The product of matrices compressed by CRS formats and CCS formats
10000 The number of pages
am 1 1
Hub weights Authority weiths
1000 100 10 1
1 10- 1 10- 2 10- 3 10- 4 10- 5 10- 6 10- 7 10- 8 10- 9 The hub weights and the authority weights
Fig. 4. Distribution of Hub weights and Authority weights
3.2 Vector calculation In case it can be calculated independently, high-speed processing of array operation is possible with vector calculation. In order to increase the efficiency of vector calculation to the array col ind stored by the CRS format in [Method 1], it calculates in order of the numerical value stored in the array row ptr. That is, each line can be independently calculated and vectorization becomes possible by calculating sequentially from the element which appears in the leftmost of each line. Even in [Method 2], when calculating the product of the matrix stored by the CRS format and the matrix stored by the CCS format, vector calculation of the matrix multiplication is possible in a compressed form by using additional two vectors. For example, suppose that inner product calculation of column 1 of A t and row c of A is calculated as shown in Fig.3. Non-zero elements are at column n 1 , n2 , n3 , n4 in the row c of A and at row m 1 , m2 , m3 in the column 1 of A t . They are stored by CRS format and CCS format. We import check A and check At which point positions in the vectors. They are initialized to 1, and point to a cn1 and am 1 . If m1 is smaller than n1 1 then check A is incremented and points a cn2 . If m1 is larger than n 1 then check At is incremented and points a m 1 . If m1 is equal to n1 then the product of a cn1 and am 1 it 2 1 calculated and check A and check At point a cn2 and am 1 respectively by incrementa2 tion. Thus calculation will be completed when the value which points either out is lost. This calculation can be done independently if each value of c is differ one another. So it is possible to do a vector calculation. Furthermore, it can be calculated without thawing compressed data and the main memory can be used efficiently.
4 Performance evaluation As a result of conducting a preliminary experiment using [Method 1] and [Method 2], [Method 1] is better in the field of rapidity and an occupancy memory about 100 times than [Method 2], so we use [Method 1] for evaluation.
The number of pages
Parallel Vector Computing Technique for Discovering Communities le+10 le+09 le+08 le+07 le+06 10000 1000 100 10 1
157
October, 1999
1
10
100
1000
The number of links
Fig. 5. The links of real data and test data: “Graph Structure in the Web” of IBM Almaden Research Institute Table 1. The relation between pages and links Number of pages 1 ×104 1 ×105 1 ×106 1 ×107 1 ×108 1 ×109 2 ×109 Maximum number of links 68 160 374 873 2,036 4,747 6,125
4.1 Evaluation using actual data As real data used for performance evaluation, we used the link data of about 15 million Web pages based on “jp” domains which were collected for National Institute of Informatics NTCIR-3 Web retrieval tasks[16]. In this link data the number of average links is 7.2, and the number of the maximum links is 1,258. On the other hand, as a result of the experiment using 10PEs of VPP800, the lapsed time was 550 seconds and the occupancy memory domain was757MB. Fig.4 shows the calculation result of authority weights and hub weights. In Fig.4 there are some pages those authority weight and hub weight are almost 1. As a result of research they are the Web pages those have many self links. Although this phenomenon in HITS algorithm was pointed out in the past research[3], it can avoid by adding some improvement to algorithm[3, 18]. In case it actually applies to Web link analysis, it is necessary to use the improved algorithm. 4.2 Evaluation using test data According to “Graph Structure in the Web” of IBM Almaden Research Institute[4], the number of average links for the 200 million Web pages as of October 1999 was 16.1. So we use the assumption that there are 20 links as an average estimated a little higher. And we assumed that the Web page that has the most links has the number raised the number of all pages × 10 to the 1/2.72th power based on Fig.5. Table1 shows the relation between the number of the maximum links assumed and the total number of pages.
158
Kikuko Kawase et al.
40
10 mln pages
10000
1 mln pages
2 PE
30
The effect (times)
Time (second)
1000
10 PE
100 10 1
100 ths pages 20
10
0.1 0
0.01 10 ths 100 ths
1 mln 10 mln 100 mln
0
5
The number of pages
Fig. 6. The relation between the number of pages and the lapsed time
10
15 20 25 30 The number of PE
35
40
Fig. 7. The number effect of PE
160 140 120
2PE 1 mln pages
35
The lapsed time
The lapsed time (second)
40
10PE 1mln pages
30 25 20
10 PE 10 mln pages
100 80
2 PE 1 mln pages
60 40
15 20
10
0
5 0 10
-3
-5
-7
-9
10 10 10 The convergence condition
10
- 11
Fig. 8. The relation between the lapsed time and the convergence condition
20
40
60
80
100
The average number of links
Fig. 9. The relation between the average number of links and lapsed time
Moreover, thinking the increases of the number of average links in the future, we measured the calculation time when the number of average links was made to increase. We generated a Web graph using random numbers to fit the number of average links to 20. The convergence conditions of the eigenvector was set to 10 −5 . In Fig.6 we changed the number of pages from 10,000 to 100 million and measured lapsed time using 2PEs and 10PEs. Fig.6 shows that lapsed time is proportional to the 1.3rd power of the number of pages. Fig.7 shows the number effect of PE. We evaluated to use 2, 5, 10, 20 and 40PEs and measured on the basis of the time when calculating by one PE. Fig.7 shows that the number effect is acquired larger data. Fig.8 expresses the lapsed time at the time when we make it sever that the judgment conditions of convergence in the power method. Fig.8 shows that lapsed time is proportional to the number of places of decimals of condition mostly. In Fig.9 We measured lapsed time changing the number of average links with 20, 40, 60, 80 and 100. Even if the number of average links increased 5 times, lapsed time stopped about 2.8 times by the case for 1 million page using 2PEs and about 1.8 times
Parallel Vector Computing Technique for Discovering Communities
159
Table 2. The used resource of computer Number of PE Number of pages Lapsed time Used memory Communication time 40 2 billions 12 hours 2 minutes 280GB 20 minutes
by the case for 10 million page using 10PEs. Moreover, although the number of average links increased, that lapsed time decreases in some case. Therefore, even if the number of average links increased in the future, it does not need so long lapsed time. The variable used by the program of the experiment needs 120 B of the main memory per 1 Web page. Using the current system, since the user area of the main memory is limited up to 7GB per 1PE, it is thought that calculation of about 50 million Web pages is possible using 1PE. Then, we calculated 2 billion Web pages using 40PEs which are the maximum number that a user can use. Table 2 shows the lapsed time, the used main memory domain, and communication time. In addition, since the result that lapsed time using real data require more than lapsed time using test data when calculated in the same accuracy, we compensate for the number of times of loop in the power method of test data by the number of times of loop in real data and get the lapsed time and communication time. Communication time is the time of the message communication by MPI. In addition, in this paper, 32 bit-length integer type variable is used as a variable which has memorized the link, and if the main memory space for storing the data of a procession is securable, it will be thought that is is possible to treat the Web pages up to about 4 billion. Although it is necessary to use an integer type variable with more long bit length when treating the Web page beyond it, when 64 bit-length integer type variable is used, to 2×4×n bytes at the time of using 32 bit-length integer type variable, used memory domain is 2 × 8 × n bytes, so it is not so a big increase.
5 Conclusion In this paper, in order to analyze the Web link structure, we proposed a technique for calculation of the importance of Web pages by replacing a large-scale Web graph with a huge sparse matrix of the adjacency matrix and solving eigenvalue problem on the main memory of the computer using CRS formats. We checked that calculation ended in realistic time by the computer system of currently possessed using the vectorization calculating method and the parallel computing method. This shows that the analysis of link structure to the Web page of a world scale is possible by applying the technique of this paper. Although this paper described how to calculate the importance of Web pages using HITS algorithm, the technique in this paper is applicable to other Web graph analysis techniques[3, 18]. It is possible to develop the new algorithm which used that the calculation to large-scale data by using the sparse matrix storing method for the Web graph and the high-speed calculation technique shown in this paper. In this paper although the experiment to an actual Web page is a small-scale thing for about 15 million pages, we think that it needs to conduct the evaluation to large-scale
160
Kikuko Kawase et al.
actual Web pages. Moreover, it is necessary to clarify relation between the memory and calculation time of various algorithms.
Acknowledgments A part of this work is supported by the grant of Scientific Research (13680482, 14213101, 15017248, 15500065) from the Ministry of Education, Science, Sports and Culture of Japan. A part of this work is supported by the grant of Mazda Foundation. We would like to thank NTCIR (NII-NACSIS Test Collection for IR Systems) Project of National Institute of Informatics in Japan.
References 1. AlltheWeb.com: http://www.alltheweb.com/ . 2. Barrett, R., Chan, T., Donato, J., Berry, M. and Demmel, J.: Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition, SIAM (1994). 3. Bharat, K. and Henzinger, M.: Improved Algorithm for Topic Distillation in a Hyperlinked Environment, Proc. of ACM SIGIR’98, Melbourne, Australia, pp. 104–111 (1998). 4. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A. and Wiener, J.: Graph Structure in the Web, Computer Networks, pp. 309–320 (2000). 5. Chakrabarti, S., Dom, B., Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D. and Kleinberg, J.: Mining the Web’s Link Structure, Computer, pp. 60–67 (1999). 6. Dean, J. and Henzinger, M.: Finding Related Pages in the World Wide Web, Proc. of the 8th World-Wide Web Conference, Amsterdam, Netherlands (1999). 7. Fuller, L. and Bechtel, R.: Introduction to matrix algebra, Dickenson Pub. Co (1967). 8. Google: http://www.google.com/ . 9. Hirokawa, S. and Ikeda, D.: Structural Analysis of Web Graph, Transactions of the Japanses Society for Artificial Intelligence, Vol. 16, No. 4, pp. 525–529 (2001). 10. Internet Software Consortium: http://www.isc.org/ . 11. Kazama, K. and Harada, M.: Advanced Web Search Engine Technologies, Transactions of the Japanses Society for Artificial Intelligence, Vol. 16, No. 4, pp. 503–508 (2001). 12. Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment, Journal of the ACM, pp. 604–632 (1999). 13. Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S. and Tomkins, A.: The Web as a Graph: Measurements, Models, and Methods, Computing and Combinatorics. 5th Annual International Conference, COCOON’99, pp. 1–17 (1999). 14. Kumar, R., Raghavan, P., Rajagopalan, S. and Tomkins, A.: Extracting Large-scale Knowledge Bases from the Web, Proceedings of the 25th International Conference on Very Large Databases, Edinburgh, UK, pp. 639–650 (1999). 15. Murata, T.: Discovery of Web Communities Based on Co-occurrence of References, Transactions of the Japanses Society for Artificial Intelligence, Vol. 16, No. 3, pp. 316–323 (2001). 16. NII-NACSIS Test Collection for IR Systems Project: http://research.nii.ac.jp/ ntcadm/indexen.html. 17. Openfind: http://www.openfind.com/ . 18. Page, L., Brin, S., Motwani, R. and Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Library Project, http://dbpubs.stanford.edu/pub/1999-66, No. 1999-66 (1999). 19. Yamada, S., Murata, T. and Kitamura, Y.: Intelligent Web Information System, Transactions of the Japanses Society for Artificial Intelligence, Vol. 16, No. 4, pp. 495–502 (2001).
Ordinal Association Rules towards Association Rules Sylvie Guillaume Laboratoire LIMOS UMR 6158 CNRS, Université Blaise Pascal Complexe scientifique des Cézeaux, 63177 AUBIERE Cedex, France [email protected]
Abstract. Intensity of inclination, an objective rule-interest measure, allows us to extract implications on databases without having to go through the step of transforming the initial set of attributes into binary attributes, thereby avoiding obtaining a prohibitive number of rules of little significance with many redundancies. This new kind of rule, ordinal association rules, reveals the overall behavior of the population and it is vital that this study be extended by exploring specific ordinal association rules in order to refine our analysis and to extract behaviors in sub-sets. This paper focuses on the mining of association rules based on extracted ordinal association rules in order to, on the one hand remove the discretization step of numeric attributes and the step of complete disjunctive coding, and on the other hand obtain a variable discretization of numeric attributes i.e. dependent on association of attributes. The study ends with an evaluation of an application to some banking data. Keywords: Association rules, interestingness measures, implicative analysis and numeric attributes
1
Introduction
Association rules [1] have demonstrated the ability to detect interesting associations between binary attributes in a database. However, they use frequency statistics (support and confidence) [2] [3] and hence have limited utility for ordinal analyses (i.e. analyses with numeric and ordinal categorical attributes). Srikant and Agrawal [4] describe techniques that can be used for numeric attribute automatic discretization and that allow previous algorithms to be used. It is also a problem at the center of [5], how to determine the basic intervals. Aumann and Lindel [6] and also G.I. Webb [7] propose a new kind of rule, impact rules, which extract useful interactions between combinations of binary attributes on the left-hand side of the rule and a numeric attribute on the right-hand side of the rule. [8] discuss association rules where the lefthand side contains two numeric attributes and the right-hand side contains one boolean attribute. Finally, Guillaume [9] also proposes a new kind of rule, ordinal Y. Kambayashi, M. Mohania, W. Wöß (Eds.): DaWaK 2003, LNCS 2737, pp. 161-171, 2003. Springer-Verlag Berlin Heidelberg 2003
162
Sylvie Guillaume
association rules, which extract implications between conjunctions of any type of attributes. These rules are based on an objective measure, intensity of inclination, which evaluates the ″smallness″ of the number of transactions which contradict the rule. This measure prunes out the transformation step of data (i.e. the discretization step of numeric attributes and the step of complete disjunctive coding) thereby avoiding obtaining a prohibitive number of rules which have little significance and have many redundancies. These ordinal association rules extracted by this measure reveal the overall behavior of transactions in the database. However, this is not sufficient and we need to know the behavior of sub-sets of the database. This paper focuses on the technique for mining of specific rules based on extracted ordinal association rules in order to, on the one hand remove the transformation step of initial attributes, and on the other hand obtain a variable discretization of numeric attributes i.e. dependent on association of attributes. [10] present a technique for finding frequent itemsets in numeric databases without needing to discretize the attributes. This discretization depends on itemsets but then is invariable when they extract association rules i.e. these attributes have the same intervals if they are on the lefthand side of the rule or on the right-hand side of the rule, contrary to our technique. Our technique is split into two steps, each step going towards a more significant degree of specialization. The first step allows us to obtain specific ordinal association rules from ordinal association rules and, the second to extract association rules from specific ordinal association rules. Moreover, this technique is particurlarly suitable for sparse data since it does not seek frequent itemsets and then does not use support. The remainder of the paper is organized as follows. In section 2 we explain the principle and the criteria which make it possible for the intensity of inclination to retain an ordinal association rule. This is done in order to justify the technique selected in section 3 to extract rules on sub-sets of the database. In section 3 we define how to mine specific rules from ordinal association rules. In section 4 this technique of extraction is evaluated on banking data and we conclude with a summary in section 5.
2
Intensity of Inclination
First, we explain the principle of this measure (i.e. why the rule holds with this measure) in order to understand the technique for mining specific rules presented in section 3. Intensity of inclination [9], which is an objective rule-interest measure evaluating the implication between conjunctions of ordinal attributes, is a generalization of intensity of propensity [11] (measure evaluating the implicative interaction between two numeric attributes taking values in an interval [0..1]). Let X and Y be respectively two conjunctions of p and q ordinal attributes. We suppose that X = X1,..,Xp and Y = Y1,..,Yq, where X1, .., Xp, Y1, .., Yq are ordinal attributes taking values x 1i , .., x pi , y 1i , .., y qi (i∈{1..N}, N representing the number of transactions in the database) respectively in [ x1min ..x1max ] , .., [ x pmin ..x pmax ] ,
Ordinal Association Rules towards Association Rules [ y1 min .. y1 max ] ,
163
.., [ y qmin .. y qmax ] . In order to take account of ordinal categorical attributes,
an appropriate coding into numeric attributes is required. p
q
∑y
Let xi =∑ x ji and yi = j =1
ki
be respectively values of attributes X and Y for
k =1 p
q
∑y
transaction ei and let xmin =∑ x jmin and ymax = j =1
be respectively the minimum and
kmax
k =1
maximum values taken by the attributes X and Y in the database. Let r and s be respectively the number of values taken by attributes X and Y in the database. We assume that values taken by attributes X and Y are ordered in this way : (x(1)=xmin) < x(2) < … < x(r-1) < (x( r )=xmax) and (y(1)=ymin) < y(2) < … < y(s-1) < (y(s)=ymax). The intensity of inclination evaluates the ″smallness″ of the number N
to = ∑ (xi -xmin )(ymax -yi ) of negative transactions, therefore a comparison between i =1
observed frequencies1 nij and expected frequencies s
r
k =1
k =1
must be carried out (with
n i. n .j N
n i. =∑nik and n .j =∑nkj ). A complete definition of intensity of inclination is presented
in [9]. Table 1 gives the contingency distribution of differences between observed and expected frequencies for attributes X and Y in the database according to the numbers r and s of values taken respectively by X and Y. Remark. The sum of all the lines and all the columns in the contingency table of differences is equal to zero :
∀i∈{1,..,r}
n
∀j∈{1,..,s}
n
i1
−
1j
−
n n .1
N n n .j
N
i.
1.
+ ... + n
ij
−
+ ... + n
ij
−
n n .j
N n n .j
N
i.
i.
+ ... + n
is
−
+ ... + n
rj
−
n n .s
N n n .j
N
i.
r.
=n =n
i.
.j
− −
n
i.
N n
(n .j
N
.1
(n
+ ... + n
1.
.j
+ ... + n
+ ... + n ) = 0
i.
.s
+ ... + n
r.
) = 0
Intensity of inclination guarantees that observed frequencies deviate significantly from expected frequencies particularly in the lower left part of table 1 and it is possible that gradually decreases as we move away from this part (see the gradation of grey in table 1 which represents the strengths of differences that we must associate). Example. In order to illustrate our remarks, we take a real-life example from some banking data where the database is presented in section 4. The ordinal association rule ″X1=stocks, X2=permanent overdraft facility → Y1=house purchase saving plan″ has been extracted with an intensity of inclination equal to 0.9563. Attributes X1, X2 and Y1 represent respectively the number of accounts ″stocks″, the number of accounts 1
Observed frequency nij represents the number of transactions which verify the value x(i) of attribute X (i.e. X=x(i)) and the value y(j) of attribute Y (i.e. Y=y(j)).
164
Sylvie Guillaume
″permanent overdraft facility″ and the number of ″house purchase saving plans″ opened by a household. Table 2 represents the contingency table of attributes X (X=X1,X2) and Y (Y=Y1). Values of attributes X1 and X2 are respectively in intervals [0..5] and [0..2], and values of Y are in [0..6]. No transaction exists with the following three combinations : (4 stocks accounts, 2 permanent overdraft facility accounts), (5 stocks, 1 permanent overdraft facility) and (5 stocks, 2 permanent overdraft facility), this is why we do not find values 6 and 7 for attribute X. Table 3 represents the contingency table of differences between observed and expected frequencies of attributes X and Y in banking data. Table 1. Contingency table of differences between observed and expected frequencies for attributes X and Y (1)
x min=x … X(i) … xmax=x( r )
ymin=y(1) n11-n.1n1./N … ni1-n.1ni./N … nr1-n.1nr./N 0
… … … … … … …
y(j) n1j-n.jn1./N … nij-n.jni./N … nrj-n.jnr./N 0
… … … … … … …
ymax=y(s) n1s-n.sn1./N … nis-n.sni./N … nrs-n.snr./N 0
0 … 0 … 0 0
Table 2. Contingency table for the banking example
X=0 X=1 X=2 X=3 X=4 X=5 Total
Y=0 33,473 8,563 1,218 149 7 1 43,411
Y=1 1,565 813 184 22 2 1 2,587
Y=2 538 277 57 12 1 1 886
Y=3 Y=4 Y=5 Y=6 Total 89 26 3 0 35,694 55 16 1 2 9,727 23 6 1 0 1,489 4 2 0 0 189 0 0 0 0 10 0 0 0 0 3 171 50 5 2 47,112
Table 3. Contingency table of differences for the banking example
X=0 X=1 X=2 X=3 X=4 X=5
Y=0 583 -400 -154 -25 -2 -2 0
Y=1 -395 279 102 12 1 1 0
Y=2 -133 94 29 8 1 1 0
Y=3 -41 20 18 3 0 0 0
Y=4 -12 6 4 2 0 0 0
Y=5 -1 0 1 0 0 0 0
Y=6 -2 2 0 0 0 0 0
-1 1 0 0 0 0 0
Ordinal Association Rules towards Association Rules
165
Y=y(j) Z1 +w
X=x
(i)
-w
Zone of negative differences
-w 0
+w
Zone of positive differences Z2
0
Fig. 1. Form of contingency table of differences for attributes X and Y in the simplest cases
Thus, for X=5 and Y=0 there are two fewer transactions compared to what could be expected and on the contrary, for X=1 and Y=1, there are 279 more transactions compared to what could be expected under the assumption that X and Y are independent.
3
Ordinal Association Rules towards Association Rules
In this section, we present the technique for discovering significant relationships on subsets of the database. We start with ordinal association rules extracted from the overall database [9] to go towards specialized rules i.e. verified by subsets of the database. This top-down discovery is split into two steps, each step going towards a more significant degree of specialization. The first one allows us to obtain specific ordinal association rules and the second one to extract association rules. 3.1
First Specialization Step
The first step allows us to obtain specific ordinal association rules i.e. rules of the following form X=[x(i1)..x(i2)] → Y=[y(j1)..y(j2)] with (x(i1), x(i2))∈ [x(1)..x(r)]² and (y(j1), y(j2))∈[y(1)..y(s)]². Principle. Thanks to the properties of the contingency table of differences (i.e. the sum of each line and each column being equal to zero), we find that, schematically, we have a table like that shown in figure 1, obviously in the simplest cases i.e. when the number of distinct values for respectively X and Y is approximately ten. In the other cases, we also find this kind of table but with a more significant number of zones (i.e. subsets of the contingency table of differences) of positive and negative differences. Let Nep be the total number of the positive differences between observed and expected frequencies and m be the number of zones Zk (k∈{1,..,m}) of positive differences. For the table represented by figure 1, Nep=2w and m=2 and for table 3, Nep=1167 et m=2. Thanks to the experiments carried out, the table of differences has many cases (X=x(i), Y=y(j)) whose value is equal to zero. This is why we try to find in the zone Zk, the largest rectangle Rk which does not contain any zero value and which has a support Pep (Pep=nek/Nep where nek represents the number of positive differences of the
166
Sylvie Guillaume
rectangle Rk) superior to a user-specified minimal support minep. We define this rectangle Rk by the two following points : the higher left point Pg (X=x(i1),Y=y(j1)) and the lower right point Pd (X=x(i2),Y=y(j2)). The knowledge of these two points allows us to obtain the specific ordinal association rule X=[x(i1).. x(i2)]→Y=[y(j1).. y(j2)]. To refine our analysis, we calculate the support Tc of the rule which is equal to
(X =[x
..x(i2)])ei∈Ω∩(Y =[y(j1)..y(j2)])ei∈Ω . N
(i1)
The RRAOS algorithm for mining specific ordinal association rules is described below. input : Set RAO of ordinal association rules ROS=∅ %initialization of the set of specific ordinal association rules for each ordinal association rule of RAO Calculation of the contingency table of differences Search for m zones Zk for each zone Zk Search for the rectangle (i1) (j1) (i2) (j2) Rk [Pg(X=x ,Y=y ), Pd (X=x ,Y=y )] if Pep ≥ minep (i1) (i2) (j1) (j2) RO=[RO ; X=[x .. x ] → Y=[y .. y ] Pep Tc] end if % Pep ≥ minep end for %each zone Zk end for %each ordinal association rule of RAO output : Set RAOS of specific ordinal association rules Example. If we take the banking example again, the RRAOS algorithm will extract two specific ordinal rules which are summarized in table 4. A post-processing step (i.e. use of subjective measures) would remove the uninteresting rule "X=0→Y=0". 3.2
Second Specialization Step
The second step allows us to obtain association rules starting from specific ordinal association rules in order to obtain more specific rules with initial attributes Xi (i∈{1..p}). Table 4. RRAOS algorithm for the banking example
Input : RAO={″X1= stocks, X2= permanent overdraft facility → Y1= house purchase saving plan ″} Zones Zk Rectangles Rk Pep Tc Z1 R1 [Pg (X=0,Y=0),Pd (X=0,Y=0)] 0.50 0.71 Z2 R2 [Pg (X=1,Y=1),Pd (X=3,Y=4)] 0.49 0.03 Output : RAOS={″X=0 → Y=0″ Pep=0.50 Tc=0.71 ; Tc=0.03 } ″X=[1..3] → Y=[1..4]″ Pep=0.49
Ordinal Association Rules towards Association Rules
167
Let r1, .., rp be respectively the number of values taken by attributes X1, .., Xp and s be the number of values taken by attribute Y. This step allows us to obtain rules of the following form : X1=[x1(i1)..x1(i2)]∧..∧ Xp=[xp(i1)..xp(i2)] → Y=[y(j1)..y(j2)] with (x1(i1), x1(i2))∈[x1(1)..x1(r1)]², …, (xp(i1), xp(i2))∈[xp(1)..xp(rp)]² and (y(j1), y(j2))∈[y(1)..y(s)]². Principle. We expand the contingency table of attributes X and Y in order to obtain initial variables Xi (i∈{1..p}). Let c(i) be the number of combinations for the line X=x(i) (i∈{1..r}) of the contingency table of attributes X and Y such that the sum of the p values X1=x1(i1)k, .., Xp=xp(ip)k is equal to x(i) : p
∀k ∈ {1..c(i)} ∑ x j (i j ) = x (i ) with i1∈{1..r1} … ip∈{1..rp}. j =1
Table 6 expands the contingency table of the banking example (i.e. table 2). We have only expanded the values X=1, X=2 and X=3 because we are interested in the specific ordinal association rule X=[1..3] → Y=[1..4]. Table 5. Expanding of the line X=x(i) of the contingency table
c(i) combinations X1=x1 … Xp=xp(ip)1 … … … X1=x1(i1)k … Xp=xp(ip)k … … … X1=x1(i1)c(i) … Xp=xp(ip)c(i) Total (i1)1
X=x(i)
Y=y(1) ni11 … ni1k … ni1c(i) ni1
… … … … … … …
Y=y(j) nij1 … nijk … nijc(i) nij
… … … … … … …
Y=y(s) nis1 … nisk … nisc(i) nis
Table 6. Contingency table partially expanded for the banking example
X=0 X1=0, X2=1 X1=1, X2=0 X1=0, X2=2 X1=1, X2=1 X1=2, X2=0 X1=1, X2=2 X1=2, X2=1 X1=3, X2=0
X=1 X=2 X=3 X=4 X=5 Total
Y=0 33,473 5,432 3,131 0 509 709 0 104 45 7 1 43,411
Y=1 Y=2 Y=3 Y=4 Y=5 Y=6 Total 1,565 538 89 26 3 0 35,694 564 186 36 11 0 1 6,230 249 91 19 5 1 1 3,497 1 0 0 0 0 0 1 98 34 17 2 1 0 661 85 23 6 4 0 0 827 0 0 0 0 0 0 0 16 8 3 2 0 0 133 6 4 1 0 0 0 56 2 1 0 0 0 0 10 1 1 0 0 0 0 3 2,587 886 171 50 5 2 47,112
168
Sylvie Guillaume
Let II1 be the value of intensity of inclination for the specific ordinal association rule X=[x(i1)..x(i2)] → Y=[y(j1)..y(j2)]. The aim is to determine for the line X=x(i), combinations X1=x1(i1)k, .., Xp=xp(ip)k which are significant when the specific ordinal association rule appears. We therefore remove each combination from the expanded contingency table and we calculate the new value II2 of the intensity of inclination. If this new value is sufficiently inferior to II1, this means that this combination is significant for the rule. In order to measure this difference between the two values of the intensity of inclination, we calculate the ratio II1/II2. This ratio must be higher than a user-specified minimal ratio minr (the minimal ratio must be higher than 1). We can refine our analysis by calculating the support Tc of the association rule. The RRA algorithm for finding association rules is described below. input : Set RAOS of specific ordinal association rules RA=∅ %initialization of the set of association rules for each rule of RAOS where II1 represents the intensity of inclination Calculation of the expanded contingency table (i) for each line X=x (i) for each combination of the line X=x Removal of the combination from the contingency table Calculation of the intensity of inclination II2 if II1/II2 ≥ minr (i1) (ip) (j1) (j2) RA=[RA ; X1=x1 , .., Xp=xp → Y=[y ..y ] Tc ] end if % II1/II2 ≥ minr (i) end for %each combination of the line X=x (i) end for %each line X=x end for %each specific ordinal association rule of RAOS output : Set RA of association rules Example. RRA algorithm has been developed on the banking example and is summarized in table 7. The minimal ratio minr is equal to 1.005. Thus, the association rule X1=[0..2] ∧X2=1→Y=[1..4] has been extracted with a support Tc equal to 0.1491. Table 7. RRA Algorithm developed on the banking example
Combination X1=0,X2=1 X1=1,X2=0 X1=0,X2=2 X1=1,X2=1 X1=2,X2=0 X1=1,X2=2 X1=2,X2=1 X1=3,X2=0
II2 0.9213 0.9530 0.9561 0.9218 0.9533 0.9563 0.9487 0.9547
II1/II2 1.0380 1.0034 1.0002 1.0374 1.0032 1.0000 1.0080 1.0017
Association rule X1=0∧X2=1→Y=[1..4] X1=1∧X2=1→Y=[1..4] X1=2∧X2=1→Y=[1..4]
Ordinal Association Rules towards Association Rules
4
169
Evaluation on the Banking Data
In this section, we shall present association rules discovered in the banking data. First, we shall describe the banking database and then we shall give some association rules which have been discovered. The banking database consists of 47,112 transactions described by 44 numeric attributes. Attributes can be broken down into three categories : -
information about customers (age, number of years with bank, …), information about various accounts opened with the bank (bonds, mortgages, savings accounts, …) and statistics about various accounts (rate of indebtedness, total income, …).
The characteristic of this database is that data is sparse i.e. the percentage of customers who make use of a financial service is small. This can be seen in the banking example of section 3 since 71.05% of the customers do not have X and Y, 75.76% do not have X and 92.14% do not have Y. The extraction of the association rules is based on support and confidence, which means we have a prohibitive number of uninteresting rules like for example ″X=0→Y=0″, ″X=1→Y=0″, ″X=2→Y=0″, ″Y=1→X=0″, … For this kind of data, it is necessary to apply other methods. 708 implications will be admitted at a 95% level of confidence and the minimal ratio minr will be equal to 1.005. We have only extracted rules with a maximum of three attributes on the left hand side of the rule. Thus for example, we have discovered the ordinal association rule ″X1=bonds, X2=stocks → Y1=LER-PER-PEP Insurance″ with an intensity of inclination equal to 0.9632. The attributes X1, X2 et Y1 represent respectively the number of accounts opened by customers for the financial services mentioned above and these attributes take their values in respectively [0..4]∪{6}, [0..5] and [0..11]. Then, the specific ordinal association rule ″X=[1..4] →Y1=[1..3]″ has been extracted with a support equal to 0.0381. This former rule has been specialized and the two following association rules have been extracted : ″X1=0, X2=[1,2] →Y1=[1..3]″ with Tc=0.0189 and ″X1=1, X2=[0..2] →Y1=[1..3]″ with Tc=0.0184. We verify that the ″stocks″ attribute has an interval (i.e. [1,2]) which is different from the banking example used to illustrate the technique for mining specific rules in section 3. Another example, the ordinal rule "X1=LER-PER-PEP Insurance, X2=stocks, X3=Credimatic → Y1=house purchase saving plan" has been discovered with an intensity of inclination equal to 0.9605. Values of these attributes are respectively in intervals [0..11], [0..5], [0..6] and [0..6]. Then, the specific ordinal rule "X=[1..5] →Y1=[1..3]" has been extracted with a support equal to 0.0371 and finally, the association rules "X1=0, X2=0, X3=[1..2] → Y1=[1..3]" with Tc=0.0127 and ″X1=1, X2=1, X3=0 → Y1=[1..3]" with Tc=0.0022 have been extracted.
170
Sylvie Guillaume
5
Conclusion and Further Work
Unsupervised usual algorithms for the discovery of association rules require a transformation of initial attributes into binary attributes. As the complexity of these algorithms increases exponentially with the number of attributes, this transformation step can lead us, on the one hand to a combinatorial explosion and on the other hand to a prohibitive number of rules of little significance with many redundancies. Moreover, we have a discretization of numeric attributes independent of association with the other attributes since this takes place in the pre-processing step. This is why we have proposed a technique for rule discovery, which prunes out the discretization step of numeric attributes before extraction in order to obtain a discretization based on the other attributes, thereby avoiding a prohibitive number of rules since we first search for general ordinal association rules and then, we refine our analysis by discovering specific rules. This technique is particularly suitable for discrete numeric attributes and the study has to be extended from taking continuous numeric attributes into account. It also has to be extended with the proposal of a new interest-rule measure, called coefficient of inclination. This would enable us to better take account the specificity of the ordinal categorical attributes. A comparison with others works has to be carried out.
References [1]
[2]
[3]
[4]
[5] [6]
AGRAWAL R., IMIELINSKI T. and SWAMI A.,″Mining Association Rules between Sets of Items in Large Databases″, In Proceedings of the 1993 ACMSIGMOD International Conference on Management of Data (SIGMOD'93), Washington, D.C., ACM Press, 207-216, May 1993. MANNILA H., TOIVONEN H. and VERKAMO A.I., ″Efficient algorithms for Discovering Association Rules″. In Usama M. Fayyad and Ramasamy Uthurusamy, editors, AAAI Workshop on Knowledge Discovery in Databases, 181-192, Seattle, Washington, 1994. AGRAWAL R., MANNILA H., SRIKANT R., TOIVONEN H. and VERKAMO A.I.,″Fast Discovery of Association Rules″, In Fayyad U.M., Piatetsky-Shapiro G., Smyth P. and Uthurusamy R. eds., Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press., 307-328, 1996. SRIKANT R., AGRAWAL R., ″Mining quantitative association rules in large relational tables″, Proceedings 1996 ACM-SIGMOD International Conference Management of Data, Montréal, Canada, June 1996. MILLER R.J., YANG Y., ″Association rules over interval data″, In Proceedings of ACM SIGMOD International Conference Management of Data, 452-461, Tucson, AZ, 1997. AUMAN Y., LINDELL Y., ″A Statistical Theory for Quantitative Association Rules″, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 99), 261-270, 1999.
Ordinal Association Rules towards Association Rules
[7] [8]
[9] [10]
[11]
171
WEBB G.I., ″Discovering associations with numeric variables″, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 13-23, Montréal, Canada, 1996. FUKUDA T., MORIMOTO Y., MORISHITA S., and TOKUYAMA T., ″Data mining using two-dimensional optimized association rules : Scheme, algorithms, and visualization″, In Proceedings of ACM SIGMOD International Conference Management of Data, 452-461, Tucson, AZ, 1997. GUILLAUME S., ″Discovery of Ordinal Association Rules″, In Proceedings of the Sixth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'02), 322-327, Taipei, Taiwan, 6-8 May 2002. MATA J., ALVAREZ J-L. and RIQUELME J-C., ″Discovering Numeric Association Rules via Evolutionary Algorithm″, In Proc. of the Sixth PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD'02), 4051, Taipei, Taiwan, 6-8 May 2002. LAGRANGE J.B., Analyse Implicative d'un Ensemble de Variables Numériques, Application au Traitement d'un Questionnaire à Réponses Modales Ordonnées, rapport interne Institut de Recherche Mathématique de Rennes, Prépublication 97-32 Implication Statistique, Décembre1997.
Rough Set based Decision Tree Model for Classification Sonajharia Minz 1 and Rajni Jain 1, 2 1
School of Computers and Systems Sciences, Jawaharlal Nehru University, New Delhi, India 110067 [email protected] 2 National Centre for Agricultural Economics and Policy Research, Library Avenue, Pusa, New Delhi, India 110012 [email protected], [email protected]
Abstract. Decision tree, a commonly used classification model, is constructed recursively following a top down approach (from the general concepts to particular examples) by repeatedly splitting the training data set. ID3 is a greedy algorithm that considers one attribute at a time for splitting at a node. In C4.5, all attributes, barring the nominal attributes used at the parent nodes, are retained for further computation. This leads to extra overheads of memory and computational efforts. Rough Set theory (RS) simplifies the search for dominant attributes in the information systems. In this paper, Rough set based Decision Tree (RDT) model combining the RS tools with classical DT capabilities, is proposed to address the issue of computational overheads. The experiments compare the performance of RDT with RS approach and ID3 algorithm. The performance of RDT over RS approach is observed better in accuracy and rule complexity while RDT and ID3 are comparable. Keywords: Rough set, supervised learning, decision tree, feature selection, classification, data mining.
1 Introduction Decision Tree (DT) has increasingly gained popularity and is commonly used classification model. Following a top down approach, DT is constructed recursively by splitting the given set of examples [7]. In many real time situations, there are far too many attributes to be handled by learning schemes, majority of them are redundant. Reducing the dimension of the data reduces the size of the hypothesis-space and allows algorithms to operate faster and more effectively. In some cases, accuracy of future classification can be improved while in others the target concept is more compact and is easily interpreted [3]. For DT induction, ID3 and its successor C4.5 [7] by Quinlan, are widely used. Although both attempt to select attributes appropriately, all attributes, barring the nominal attributes that have been used at the parent nodes, are retained for further computation involved in splitting criteria. Irrelevant attributes are not filtered until the example reaches the leaf of the decision tree. This leads to extra overhead in terms of memory and computational efforts. Its performance can be improved by prior selection of attributes [3]. We propose to use Rough Set theory (RS) [6] introduced by Pawlak in early 80’s, for attribute subset selection. The intent of RS Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 172-181, 2003. c Springer-Verlag Berlin Heidelberg 2003
Rough Set Based Decision Tree Model for Classification
173
is based on the fact that in real life while dealing with sets, due to limited resolution of our perception mechanism, we can distinguish only classes of elements rather than individuals. Elements within classes are indistinguishable. RS offers a simplified search for dominant attributes in data sets [6,11]. This is used in the proposed RS based Decision Tree (RDT) model for classification. RDT combines merits of both RS and DT induction algorithm. Thus, it aims to improve efficiency, simplicity and generalization capability of both the base algorithms. The paper is organized as follows. In section 2, the relevant concepts of rough set theory are illustrated. The proposed RDT model is described in section 3. Section 4 formulates the performance evaluation criteria. After discussing experimental results in section 5, we conclude with directions for future work in section 6.
2 Rough Set Theory 2.1 Concepts A brief review of some concepts of RS [6,11], used for mining classification rules are presented in this section. 2.1.1 Information System and Decision Table. In RS, knowledge is a collection of facts expressed in terms of the values of attributes that describe the objects. These facts are represented in the form of a data table. Entries in a row represent an object. A data table described as above is called an information system. Formally, an information system S is a 4-tuple, S = (U, Q, V, f) where, U a nonempty, finite set of objects is called the universe; Q a finite set of attributes; V= ∪Vq, ∀q ∈ Q and Vq being the domain of attribute q; and f : U × Q →V, f be the information function assigning values from the universe U to each of the attributes q for every object in the set of examples. A decision table A = (U, C ∪ D), is an information system where Q = (C ∪ D). C is the set of categorical attributes and D is a set of decision attributes. In RS, the decision table represents either a full or partial dependency occurring in data. 2.1.2 Indiscernibility Relation. For a subset P ⊆ Q of attributes of an information system S, a relation called indiscernibility relation denoted by IND is defined as, INDs (P)={ (x, y) ∈ U × U : f(x, a)=f(y, a) ∀ a∈ P}. If (x, y) ∈ INDs(P) then objects x and y are called indiscernible with respect to P. The subscript s may be omitted if information system is implied from the context. IND(P) is an equivalence relation that partitions U into equivalence classes, the sets of objects indiscernible with respect to P. Set of such partitions are denoted by U/IND(P). 2.1.3 Approximation of Sets. Let X ⊆ U be a subset of the universe. A description for X is desired that can determine the membership status of each object in U with respect to X. Indiscernibility relation is used for this purpose. If a partition defined by IND(P) partially overlaps with the set X, the objects in such an equivalence class can not be determined without ambiguity. Consequently, description of such a set X may
174
Sonajharia Minz and Rajni Jain
not be possible. Therefore, the description of X is defined in terms of P-lower approximation (denoted as P ) and P-upper approximation (denoted as P ), for P ⊆ Q
P X = ∪{Y ∈ U / IND ( P ) : Y ⊆ X }
(1)
P X = ∪{Y ∈ U / IND ( P ) : Y ∩ X ≠ φ }
(2)
A set X for which P X = P X is called as exact set otherwise it is called rough set with respect to P. 2.1.4 Dependency of Attributes. RS introduces a measure of dependency of two subsets of attributes P, R ⊆ Q. The measure is called a degree of dependency of P on R, denoted by γ R(P). It is defined as,
γ R ( P) =
card ( POS R ( P )) card (U )
, where
POS R ( P) =
∪
X ∈U / IND ( P )
RX
(3)
The set POSR(P), positive region, is the set of all the elements of U that can be uniquely classified into partitions U/IND(P) by R. The coefficient γ R (P ) represents the fraction of the number of objects in the universe which can be properly classified. If P totally depends on R then γ R (P) = 1, else γ R (P) < 1. 2.1.6 Reduct. The minimum set of attributes that preserves the indiscernibility relation is called reduct. The relative reduct of the attribute set P, P ⊂ Q, with respect to the dependency γ P (Q) is defined as a subset RED(P,Q) ⊆ P such that: 1. γ RED( P ,Q ) (Q ) = γ P (Q ) , i.e. relative reduct preserves the degree of inter 2.
attribute dependency, For any attribute a ∈ RED(P,Q), γ RED( P ,Q ) −{a} (Q ) < γ P (Q ) , i.e. the relative
reduct is a minimal subset with respect to property 1. A single relative reduct can be computed in linear time. Genetic algorithms are also used for simultaneous computation of many reducts [1,8,10]. 2.1.7 Rule Discovery. Rules can be perceived as data patterns that represent relationships between attribute values. RS theory provides mechanism to generate rules directly from examples [11]. In this paper, rules are produced by reading the attribute values from the reduced decision table. 2.2 Illustrations Example 2.2.1. Using Table 1 [9] some concepts of the information system as described in 2.1 are: U={X1, X2, X3, X4, X5, X6, X7, X8} Q={Hair, Height, Weight, Lotion, Sunburn} VHair={blonde, brown, red}, VHeight={tall, average, short}, VWeight={light, average, heavy}, VLotion={no, yes} f(X1, Hair)=blonde, i.e. value of the attribute Hair for object X1 is blonde For R={Lotion} ⊆ Q, U/IND(R)={{X1, X4, X5, X6, X7},{X2,X3,X8}}
Rough Set Based Decision Tree Model for Classification
175
The lower and upper approximation with reference to R={Lotion} for objects with decision attribute Sunburn = yes, i.e. X={X1, X4, X5} Using equation (1) and (2),
R X=φ and R X={X1, X4, X5, X6, X7} Table 1. Sunburn data set
ID
Hair
Height
Weight
Lotion
Sunburn
X1 X2 X3 X4 X5 X6 X7 X8
blonde blonde brown blonde red brown brown blonde
average tall short short average tall average short
light average average average heavy heavy heavy light
no yes yes no no no no yes
yes no no yes yes no no no
Table 2. Weather data set ID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain
Temperature Hot Hot Hot mild cool cool cool mild cool mild mild mild Hot mild
Humidity High High High High Normal Normal Normal High Normal Normal Normal High Normal High
Wind weak strong weak weak weak strong strong weak weak weak strong strong weak strong
Play no no yes yes yes no yes no yes yes yes yes yes no
Example 2.2.2. Let P={Sunburn} and R={Lotion} then U/IND(P)={{X1, X4, X5}, {X2, X3, X6, X7, X8}}, POSR(P)= φ ∪ {X2, X3, X8}, the dependency of Sunburn on Lotion i.e. γ R(P)=3/8. Example 2.2.3. For Table 2, two reducts R1 ={Outlook, Temperature, Wind} and R2={Outlook, Humidity, Wind} exist. One sample of the twelve rules generated using R1 is , If Outlook=sunny and Temperature=mild and Wind=weak then Play=No; On closer observation of the rules generated in the above case there exist scope for improvement on the number of selectors and the number of rules. This issue is addressed in the next section.
176
Sonajharia Minz and Rajni Jain
Decision Table
Reduct Computation Algorithm
Reduct
Remove attributes absent in reduct
Reduced Training Data
ID3 algorithm
Decision Tree Fig. 1. The Architecture of RDT model
3 Rough Decision Trees (RDT) Model The proposed RDT model combines the RS-based approaches [11] and decision tree learning capabilities [7,9]. The issues related to the greediness of the ID3 algorithm and complexity of rules in RS approach, are addressed by the proposed model. The architecture of RDT model is presented in Fig. 1. The implementation of the architecture is presented by algorithm RDT. Algorithm RDT: Rough Decision Tree Induction 1. Input the training data set T1. 2. Discretize the numeric or continuous attributes if any, and label the modified data set as T2. 3. Obtain the minimal decision relative reduct of T2, say R. 4. Reduce T2 based on reduct R and label reduced data set as T3. 5. Apply ID3 algorithm on T3 to induce decision tree. 6. If needed, convert decision tree to rules by traversing all the possible paths from root to each leaf node. The training data is a collection of examples used for supervised learning to develop the classification model. In step 2, continuous attributes in data set be discretized. A number of algorithms are available in the literature for discretization [2,3]. Any local discretization algorithm may be used as per the requirement. The two data sets used in this paper have nominal attributes. The next step involves computation of reduct. The reduct distinguishes between examples belonging to
Rough Set Based Decision Tree Model for Classification
177
different decision classes. The reduct also assists in reducing the training data, which is finally used for decision tree induction. In this paper, Johnson’s algorithm (as implemented in Rosetta software [8]) is used for computation of a single reduct. GA based algorithms have also been reported in the literature for computing the population of reducts [1,10]. This provides flexibility to the data miner for choosing the desired set of attributes in the induction of the decision tree. For example, the reducts could be ranked in terms of the cost of obtaining the values of the required attributes. A reduct score can also be computed based on the cardinality of each attribute of the reduct. The reduct with the minimum score could be preferred for further steps [11]. In step 4, by removing columns of attributes, not present in the reduct, the dimension of the learning examples are reduced. DT learning algorithm is then applied to the reduced examples in step 5. In this paper, an implementation of ID3 algorithm as proposed by Quinlan is used for inducing decision tree. Step 6 maps the tree to the rules. Outlook sunny
overcast
Temperature mild
hot no
no
Wind
yes
weak
cool yes
Wind weak
rain
yes
strong no
strong yes Fig. 2. RDT classifier for Weather data set
Example 3.1 For Table 2 the RDT produces the reduct R={Outlook, Temperature, Wind}. It generates a decision tree (Fig. 2), which is mapped to the following decision rules, 1. If Outlook=sunny and Temperature=hot then Play=no; 2. If Outlook=sunny and Temperature=mild and Wind =weak then Play =no; 3. If Outlook=sunny and Temp=mild and Wind =strong then Play =yes; 4. If Outlook=sunny and Temperature=cool then Play =yes; 5. If Outlook=overcast then Play =yes; 6. If Outlook=rain and Wind = weak then Play =yes; 7. If Outlook=rain and Wind=strong then Play =no;
178
Sonajharia Minz and Rajni Jain
It is observed that rules are less in number as well as more generalized compared to those generated in Section 2. On using GA-based algorithm two reducts (Example 2.2.3), R1={Outlook, Temperature, Wind} and R2={Outlook, Humidity, Wind} would provide different trees. R2 generates simpler tree with better accuracy than that of R1. Issue of relevant reduct selection is not addressed in this paper.
4 Evaluation Criteria for RDT To evaluate the performance of the RDT model, classification accuracy and rule complexity are considered. Using these two measures the behavior and the performances of the three models namely RDT, ID3 algorithm and RS approach, are compared. Classification accuracy is assessed by applying the algorithms to the examples not used for rule induction and is measured by the fraction of the examples for which decision class is correctly predicted by the model [4]. Fraction of instances for which no prediction can be made by the model is called uncertainty. Classification error refers to fraction of test examples, which are misclassified by the system. A set of rules induced for classification is called rule-set. A condition of the form attribute=value is called a selector. The size of rule-set may not be appropriate criterion for evaluation hence total number of selectors in a rule-set is used as a measure of complexity of the rule-set [4].
5 Results and Discussion
Complexity
Num. of Rules
40 30
10
20
5
10 0
0 Sunburn
Weather
Sunburn RS
Weather ID3
RDT
Fig.3. Comparison of RS, ID3, RDT algorithms w.r.t. complexity and number of rules for training dataset of Sunburn and Weather
The aim of this paper is to introduce the RDT model. The model has been implemented on very small sample data sets. The results from these pilot data sets can neither be used to claim nor disprove the validity and usefulness of the model over other approaches. However, the results could indicate whether or not to explore it further for data mining. Subsequent sections 5.1 and 5.2 discuss the results of the
Rough Set Based Decision Tree Model for Classification
179
three approaches mentioned in the paper for training and test data. Num. of Rules (Sunburn) 8
Num. of Rules (Weather) 12 8
4 4 0
0 1
2
3
4
1
5 Avg.
Complexity (Sunburn)
2
3
4
5 Avg.
Complexity (Weather)
12 30 8
20
4
10
0
0 1
2
3
4
5 Avg.
1
2 RS
3
4 ID3
5 Avg. RDT
Fig. 4. Comparison of RS, ID3, RDT algorithms w.r.t. number of rules, complexity for training:test::70:30 of Sunburn and Weather
5.1 Training Data Each of the three algorithms RS, ID3 and RDT were applied to the data sets mentioned in Table 1 and Table 2. It was observed that for each of the data sets accuracy is 1, thus uncertainty and error are 0. The results regarding number of rules and the complexity of rule-set are presented in Fig. 3. The size and the complexity of rule-set induced by RDT model is significantly less as compared to that of RS for both data sets. On comparing RDT with ID3, it was observed that for Sunburn data set the complexity of rules induced by the two are equal however, for Weather data set, complexity of rule-set is greater for RDT. This is attributed to the computation of a single reduct. As mentioned in examples in earlier sections, there are two reducts namely R1:{Outlook, Temperature, Wind} and R2:{Outlook, Humidity, Wind} but only a single reduct is used in the model. For a system, if only one reduct is filtered out, it is possible that alternate reduct, if any, could generate less complex rule-set under RDT. This issue may be addressed by using some measure of reduct fitness or ranking of reducts in the reduct population obtained by GA based tools for reduct computation. However, this is not dealt with in this paper.
180
Sonajharia Minz and Rajni Jain
Accuracy (Sunburn)
Accuracy (Weather)
1
1
0.5
0.5
0
0 1
2
3
4
5 Avg.
1
Error (Sunburn) 1
0.3
0.5
0
0 2
3
4
3
4
5
Avg.
5
Avg.
Error (Weather)
0.6
1
2
5
1
Avg.
2
3
4
Uncertainty (Weather)
Uncertainty (Sunburn) 0.6 0.8
0.4
0.4
0.2
0
0 1
2
3
4
5
Avg.
1
2 3 4 5 Avg. RS ID3 RDT
Fig. 5. Comparison of RS, Id3, RDT algorithms w.r.t accuracy, error, uncertainty for Training:Test::70:30 of Sunburn and Weather
5.2 Test Data For each of the three algorithms, the results over five trials were averaged for the two domains. In each trial, 70% of training examples were selected at random from entire data set and the remaining were used for testing. The training data, is used for induction of classification rules or tree while the test data is used for evaluating the
Rough Set Based Decision Tree Model for Classification
181
performance of the induced model. Each of the algorithms was implemented on the same training-test partition. These results are presented in Fig. 4 and Fig. 5. It is observed that for Sunburn data set, RDT performs better than RS in terms of complexity of rule-set, accuracy and other performance parameters while RDT model is comparable to ID3. For Weather data set, average accuracy of rules generated from RDT model has improved over RS as well as ID3 while average complexity of rules for RDT is improved over RS approach but is little more than that of ID3. 6 Conclusions The results from the experiments on the small data sets neither claim nor disprove the usefulness of the model as compared to other approaches. However, they suggest that RDT can serve as a model for classification as it generates simpler rules and removes irrelevant attributes at a stage prior to tree induction. This facilitates less memory requirements for the subsequent steps while executing the model and for classifying the test data as well as actual examples. For real data sets, at times number of reducts (may be hundreds) exist and at some other times no reduct may exist. This provides potential for further refinement of RDT model. Availability of many reducts offers scope to generate a tree avoiding evaluation of an attribute that is difficult or impossible to measure. It also offers options of using low cost decision trees. Problem related to absence of reduct for noisy domains or inconsistent data sets can be handled by computing approximate reducts using variable precision RS model [12]. Further research is being pursued to handle such real time data sets. References 1.
Bjorvand, A. T., Komorowski, J.: Practical Applications of Genetic Algorithms for Efficient Reduct Computation., Wissenschaft & Technik Verlag, 4 (1997) 601-606. 2. Grzymala-Busse, J. W., Stefanowski, Jerzy: Three Discretization Methods for Rule Induction. IJIS, 16(1) (2001) 29-38 3. Hall, Mark A., Holmes, G.: Benchmarking Attribute Selection Techniques for Discrete Class Data Mining. IEEE TKDE 20 (2002) 1-16 4. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, (2001), 279-325 5. Murthy, S.K.: Automatic Construction of decision trees from Data: A Multidisciplinary Survey. Data Mining and Knowledge Discovery 2 (1998) 345-389 6. Pawlak, Z.: Drawing Conclusions from Data-The Rough Set Way. IJIS, 16 (2001) 3-11 7. Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kauffman (1993) 8. Rosetta, Rough set toolkit for analysis of data available at http://www.idi.ntnu.no/~aleks/rosetta/. 9. Winston, P.H.: Artificial Intelligence. Addison-Wesley Third Edition (1992) 10. Wroblewski, J.: Finding Minimal Reduct Using Genetic Algorithms. Warsaw University of Technology- Institute of Computer Science- Reports – 16/95 (1995) 11. Ziarko, W.: Discovery through Rough Set Theory. Comm. of ACM, 42(11) (1999) 55-57 12. Ziarko, W.: Variable Precision Rough Set Model. Jr. of Computer and System Sciences, 46(1), Feb, (1993) 39-59b
Inference Based Classifier: E!cient Construction of Decision Trees for Sparse Categorical Attributes Shih-Hsiang Lo, Jian-Chih Ou, and Ming-Syan Chen Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, ROC {lodavid,alex}@arbor.ee.ntu.edu.tw, [email protected]
Abstract. Classification is an important problem in data mining and machine learning, and the decision tree approach has been identified as an e!cient means for classification. According to our observation on real data, the distribution of attributes with respect to information gain is very sparse because only a few attributes are major discriminating attributes where a discriminating attribute is an attribute, by whose value we are likely to distinguish one tuple from another. In this paper, we propose an e!cient decision tree classifier for categorical attribute of sparse distribution. In essence, the proposed Inference Based Classifier (abbreviated as IBC ) can alleviate the “overfitting” problem of conventional decision tree classifiers. Also, IBC has the advantage of deciding the splitting number automatically based on the generated partitions. LEF is empirically compared to F4=5, VOLT and K-means based classifiers. The experimental results show that LEF significantly outperforms the companion methods in execution e!ciency for dataset with categorical attributes of sparse distribution while attaining approximately the same classification accuracies. Consequently, LEF is considered as an accurate and e!cient classifier for sparse categorical attributes.
1
Introduction
Classification is an important issue both in data mining and machine learning, with such important techniques as Bayesian classification [3], neural networks [16], genetic algorithms [7] and decision trees [1][12]. Decision tree classifiers have been identified as e!cient methods for classification. In addition, it was proven that a decision tree with scale-up and parallel capability is very e!cient and suitable for large training sets [9][17]. Also, decision tree generation algorithms do not require additional information than that already contained in the training data [5][12]. Eventually, decision trees earn similar and sometimes better accuracy compared to other classification methods [11]. Numerous decision tree algorithms have been developed over the years, e.g., ID3 [13], C4.5 [12], CART [1], VOLT [9], SPRINT [17]. However, even being capable of processing both numerical and categorical attributes, most existing decision tree classifiers are not explicitly designed for categorical attributes Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 182-191, 2003. c Springer-Verlag Berlin Heidelberg 2003
Inference Based Classifier ID
age
income
credit-rating
1
<=30
low
fair
2
<=30
low
excellent
3
31...40
high
excellent
4
>40
med
fair
5
>40
med
fair
6
31…40
high
excellent
7
<=30
low
fair
183
Fig. 1. A small credit-rating dataset
[9][10][17]. Thus, their performance on categorical data is not particularly optimized. Consequently, we propose in this paper an e!cient decision tree classifier which not only considers on the categorical attribute’s characteristics in practical datasets but also alleviates the “overfitting” problem of traditional decision tree classifiers. According to our observation on real data, the distribution of attributes with respect to information gain is very sparse because only a few attributes are major discriminating attributes where a discriminating attribute is an attribute, by whose value we are likely to distinguish one tuple from another. As shown by our experiments without dealing with the descriminating attribute separately, prior works are not suitable for processing data set with sparse attributes. We call the attribute corresponding to the target label to classify the target attribute. An attribute which is not a target attribute is called an ordinary attribute. For example, the target attribute in Figure 1 is “credit-rateing,” and two ordinary attributes are “age” and “income.” It is first observed that in many real-life datasets, such as customers’ credit-rating data of banks and credit-card companies, medical diagnosis data and document categorization data, the attributes are mostly categorical attributes and the value of an attribute usually implies one target class. An inference class is defined as the target class to which the majority of an attribute value belongs. For example in Figure 1, the target classes refer to the values of attribute “credit-rating.” The value “?= 30” of attribute “age” has the inference class “fair.” Then, note that after mapping each ordinary attribute value to its influence class, it would be better to divide the ordinary attributes according to their inference classes instead of their original values before proceeding to perform the goodness function, e.g., GINI index, computation for node splitting. As will be shown later, by doing so the execution e!ciency is improved and the overfitting problem can be alleviated. A detailed example will be given in Section 3.1 later. Note that the number of wdujhw fodvvhv in real-life datasets is usually less than that of an attribute. For the example in Figure 1, the number of wdujhw fodvvhv, 2, is less than that of attribute “age,” 3 and also that of attribute “income,” 3. This fact is indeed instrumental to the e!cient execution of the proposed algorithm as will be validated by our experimental studies.
184
Shih-Hsiang Lo et al.
Explicitly, the decision tree classifier, Inference Based Classifier (abbreviated as IBC ) proposed, is a two phases decision tree classifier for categorical attributes. In the first phase, IBC partitions each attribute’s values according to their inference classes. By partitioning the attribute’s value based on its inference class, IBC can identify the major discriminating attribute from sparse categorical attributes and also alleviate the overfitting problem which most conventional decision tree classifiers suer. Recall that some “overfitting” problem might be induced by small data in the training dataset which do not appear in the testing dataset at all. This phenomenon can be alleviated by using the inference class to do the classification. In the second phase of LEF, by evaluating the goodness function for each attribute, LEF selects the best splitting attribute from all attributes as the splitting node of a decision tree. In addition to alleviating the overfitting problem, LEF has the advantage of deciding the splitting number automatically based on the generated partitions, since unlike prior work [4] no additional clustering on the attributes is needed to determine the splitting by IBC. In the experimental study, we compare LEF with C4.5, VOLT and K-mean based classifiers for categorical attributes. The experimental results show that LEF significantly outperforms companion schemes significantly in execution e!ciency while attaining approximately the same classification accuracy. Consequently, LEF is considered as a practical and e!cient categorical attribute classifier. The remainder of the paper is organized as follows. Preliminaries are given in Section 2. In Section 3, we present the new developed decision tree for categorical attributes. In Section 4, we conduct the experiments to access the performance of LEF. Finally, we give a conclusion in Section 5.
2
Preliminaries
Suppose D is an attribute and {d1 , d2 > ===, dp } are p possible values of attribute D. The domain of the target classes w is represented by domain(w)= {f1 , f2 > ===, f|grpdlq(w)| }. The inference class for a value dl of attribute D, denoted by fdl , is the target class to which most tuples with their attribute D= dl imply. Explicitly, use qD (dl > fn ) to denote the number of tuples which imply to fn and have a value of dl in their attribute D. Then, we have qD (dl > fdl ) =
max
fn 5grpdlq(w)
{qD (dl > fn )} =
(1)
The inference class for each value of attribute D can hence be obtained. For the example profile in Figure 1, if D is “age” with value “?=30,” then domain(w)= {fair, excellent}, and qD (?=30, fair )=2, and qD (?=30, excellent)=1. “fair ” is therefore the inference class for the value “?=30” of the attribute “age.” Definition 1: The unique target class to which most tuples with their attribute D=dl imply is called the lqi huhqfh fodvv for a value dl of attribute D.
Inference Based Classifier
185
If the target class to which most tuples with their attribute D = dl imply is not unique, we say attribute value dl is associated with a neutral class. Also, we call that value dl is a neural attribute value. As will be seen later, by replacing the original attribute value with its inference class in performing the node-splitting, IBC is able to build the decision tree very e!ciently without comprising the quality of classification.
3
Inference Based Classifier
In essence, LEF is a decision tree classifier that refines the splitting criteria for categorical attributes in the building phase in order to reveal the major discriminating attribute from sparse attributes. Also, IBC can alleviate the overfitting problem and improve the overall execution e!ciency. Note that information gain and GINI index are common measurements for selecting the best split node. Without loss of generality, we adopt information gain as a measurement to identify the sparsity of attributes and GINI index as the measurement for node splitting criterion. 3.1
Algorithm of IBC
As described earlier, LEF is divided into two major phases, i.e., partitioning values of an attribute according to their inference class, to be presented in Section 3.2, and selecting the best splitting attribute with the lowest JLQ L index value from these attributes, to be presented in Section 3.3. 3.2
Inference Class Identification Phase
Algorithm LEF : Inference Based Classifier Lqlwldo S kdvh: MakeDecisionTree(Training Data W ) 1. BuildNode(W ) S kdvh Rqh : Lqi huhqfh Fodvv Lghqwli lfdwlrq S kdvh EvaluateSplit(Data V) 1. begin for each attribute D in V 2. begin for value dl in D do 3. if dl is a neural attribute value then 4. Categorize dl to the partition of NEURAL 5. else 6. Categorize dl to the partition of dl ’s inference class 7. end 8. if there are two or more partitions then 9. Compute the gini index for these partitions 10. else 11. Return no gini index 12. end
186
Shih-Hsiang Lo et al.
In the Initial Phase of IBC, the training dataset is input for the tree building. Before the evaluation for the best split, the first phase of IBC, Inference Class Identification Phase, scans each attribute in data V from Step 3 to Step 12. From Step 2 to Step 7, IBC first assigns an inference class to each attribute value and groups the attribute’s values into partitions according to their inference classes. If there are two or more partitions, the JLQ L index is calculated and returned in Step 9. Otherwise, IBC returns no JLQ L index in Step 11. No 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Outlook Temperature Humidity Sunny Hot High Sunny Hot High Overcast Hot High Rain Mild High Rain Cool Normal Rain Cool Normal Overcast Cool Normal Sunny Mild High Sunny Cool Normal Rain Mild Normal Sunny Mild Normal Overcast Mild High Overcast Hot Normal Rain Mild High
Windy Class False N True N False P False P False P True N True P False N False P False P True P True P False P True N
Table 1. A training dataset
For illustrative purposes, the following example uses the training data in Table 1 which has four categorical attributes, i.e., Outlook, Temperature, Humidity and Windy, and one target attribute, i.e., Class. For each node of a decision tree, LEF first counts the target class occurrences of each attribute value and assigns an lqi huhqfh fodvv to each attribute value. Then, LEF partitions the values of an attribute into groups according to their inference classes and calculates the JLQ L index for each attributes. Finally, LEF selects the attribute with the lowest JLQ L index as the splitting attribute for the decision tree node. For attribute Outlook in Table 1, LEF evaluates the counts of target classes for each attribute value shown in Table 2. Next, LEF partitions these attributes into two groups according to their inference classes and calculates the corresponding JLQ L index of attribute Outlook. For other attributes, Temperature, Humidity and Windy, the counts of target classes and inference class of attribute values are listed in Table 3, 4 and 5 accordingly. 9 7 5 2 2 3 )(1 ( )2 ( )2 ) + ( )(1 ( )2 ( )2 ) = 0=3937 (2) 14 9 9 14 5 5 For other attributes, Temperature, Humidity and Windy, the counts of target classes and inference class of attribute values are listed in Table 3, 4 and 5 jlqlRxworrn = (
Inference Based Classifier
187
accordingly. Note that the attribute value, Hot, of attribute, Temperature, is a neural attribute value whose lqi huhqfh fodvv is NEURAL. Further, the JLQ L indexes of these attributes are calculated as follows. jlqlW hpshudwxuh = (
jlqlKxplglw| = (
jlqlZ lqg| = (
4 2 10 7 2 3 )(1( )2 ( )2 )+( )(1( )2 ( )2 ) = 0=4429 (3) 14 4 4 14 10 10
7 3 7 6 4 1 )(1 ( )2 ( )2 ) + ( )(1 ( )2 ( )2 ) = 0=3673 (4) 14 7 7 14 7 7
6 3 8 6 3 2 )(1 ( )2 ( )2 ) + ( )(1 ( )2 ( )2 ) = 0=4286 14 6 6 14 8 8
(5)
jlqlmin = min(jlqlrxwrrn > jlqlW hpshudwxuh > jlqlKxplglw| > jlqlZ lqg| ) = jlqlKxplglw| (6) attribute values Sunny Overcast Rain
class counts lqi huhqfh N P fodvv 3 2 N 0 4 P 2 3 P
Table 2 Attribute Outlook attribute values High Normal
class counts lqi huhqfh N P fodvv 4 3 N 1 6 P
Table 4 Attribute Humidity
3.3
attribute values Hot Mild Cool
class counts lqi huhqfh N P fodvv 2 2 NEURAL 2 4 P 1 3 P
Table 3 Attribute Temperature attribute values True False
class counts lqi huhqfh N P fodvv 3 3 NEURAL 2 6 P
Table 5 Attribute Windy
Node Split Phase
S kdvh W zr : Q rgh Vsolw S kdvh BuildNode(Data V) 1. If (all tuples in V are in the same class) then return 2. Call EvaluateSplit(V) function to evaluate splits for each attribute D in V 3. Use the best split found to partition V into V1 ===Vq where q is the number of partitions 4. begin for each partition Vl in V where l=1 to q 5. BuildNode(Vl )
188
Shih-Hsiang Lo et al. Humidity High
Normal
Outlook Sunny N(1)
P(0.86)
Overcast
Rain
P(1)
Windy False P(1)
True N(1)
Fig. 2. The decision tree for Table 1 by IBC
6. end The second phase of LEF, i.e., Q rgh Vsolw S kdvh, checks if all tuples in V are in the same target class in Step 1. If all tuples are in the same class, LEF does not split these tuples and returns the target class. Otherwise, LEF uses attributes’ evaluation values returned by first phase to select the best splitting attribute in V and partitions V into subpartitions in Step 2 and Step 3. Then, the child nodes of decision trees grow accordingly from Step 4 to Step 6. For the example in Section 3.1.1, the Node Split Phase of LEF chooses the attribute Humidity with the lowest JLQ L index value as the best splitting attribute for the decision tree node. Then, IBC partitions Table 1 into two subtables which consist of one table where the value of attribute Humidity is High and the other one where the value of attribute Humidity is Normal. Following a similar procedure of LEF for these subtables, the whole decision tree is built as depicted in Figure 2 where the purity is also examined in each leaf.
4
Performance Studies
To assess the performance of algorithms proposed, we perform several experiments on a computer with a FS X clock rate of 700 MHz and 256 MB of main memory. The characteristics of the real-life datasets are described in Section 4.1. Experimental studies on LEF and others schemes are conducted in Section 4.2. Results on execution e!ciency are presented in Section 4.3. 4.1
Real-life Datasets
We experimented with three real-life datasets from the UCI Machine Learning Repository. These datasets are used by the machine learning community for the empirical analysis of machine learning algorithms. We use a small portion of data
Inference Based Classifier
189
as the training dataset and the rest of the data is used as the testing dataset. Note that the attributes in these selected data belongs to categorical attributes. In addition, we calculate the information gain of categorical attributes for all datasets. Based these information gain, we further obtain the mean, variance and standard deviation in order to understand the distribution of categorical attributes among datasets. Table 6 lists the characteristics of our training and testing datasets which are sorted by their standard variance with respect to information gain of attributes. Note that Credit card, Breast-cancer and Liverdiscorders are considered as the data sets with sparse attributes and others are not. Data set CreditCard BreastCancer LiverDisorders Heart Mushroom SoyBean No. of Attributes 11 9 6 13 22 35 No. of Classes 2 2 2 2 2 19 No. of Training set 8745 500 230 180 5416 450 No. of Testing set 4373 199 115 90 2708 233 Info. Gain 1.83 4.61 1.34 2.72 4.61 27.97 Mean 0.1661 0.5122 0.2226 0.2091 0.2094 0.7992 Variance 0.0163 0.0169 0.0186 0.0329 0.0467 0.1078 Stand. Deviation 0.1278 0.1301 0.1364 0.1813 0.2160 0.3282 Table 6 The characteristics of data sets
4.2
Experiment One: Classification Accuracy
In this experiment, we compare accuracy results with tree pruning and without tree pruning where the MDL pruning technique was applied in the tree pruning phase. From the results shown in Table 7 without tree pruning, LEF is a clear winner and has the highest accuracy in 4 cases because IBC is designed for data set with sparse categorical attributes. In Table 8, the overall accuracies were improved by applying tree pruning to alleviate overfitting problem. However, the LEF accuracies were a little lower than others after doing tree pruning. The reason is that LEF considers and alleviates the overfitting problem in the tree building phase whereas others do not. So, other methods require tree pruning techniques to alleviate the overfitting problem and improve the accuracy. Data set
LEF N phdq VOLT F4=5
CreditCard BreastCancer LiverDisorders Heart Mushroom SoyBean 0.8218 0.9748 0.6522 0.4667 0.9435 0.9055 0.7567 0.9296 0.5621 0.5146 0.9213 0.8645 0.5988 0.9447 0.5913 0.6633 0.9435 0.8927 0.7070 0.92017 0.6184 0.7713 1.00 0.8841
Table 7 Accuracy comparison on real-life data sets(without tree pruning)
190
Shih-Hsiang Lo et al.
Data set
LEF N phdq VOLT F4=5
CreditCard BreastCancer LiverDisorders Heart Mushroom SoyBean 0.8278 0.9397 0.6434 0.5778 0.9321 0.8969 0.7793 0.9246 0.5478 0.5333 0.9439 0.8841 0.7001 0.9347 0.6434 0.6667 0.9468 0.9098 0.7390 0.9327 0.6184 0.7713 1.00 0.9013
Table 8 Accuracy comparison on real-life data sets(with tree pruning)
4.3
Experiment Two: Execution Time in Scale-Up Experiments for data set of sparse categorical attributes
Because SLIQ was shown to outperform C4.5 in [9], so we only compare SLIQ, K-mean based and IBC in scale-up experiments. Before scale-up experiments, we briefly explain the complexity of three methods. In general case, the complexity of SLIQ is O(n q2 )[9], the complexity of K-mean based is O(n (q n)2 )[8] and the complexity of IBC is O(n q) where n is the number of attributes and q is the size of data set. For scale-up experiments, we selected the credit-card dataset and divided it into dierent sizes of training set to show execution e!ciency for data set of sparse categorical attributes. The size of the dataset increases from 1,000 to 12,000. The results in scale-up experiments are shown in Fig. 3 and Fig. 4. In Fig. 3, the value in |-axis corresponds to the ratio of the execution time of VOLT to that of LEF (presented in log scale), showing LEF outperforms VOLT significantly in execution e!ciency. In addition, the LEF is approximately twice faster than the K-mean based algorithm as the size of datasets increases in Fig. 4. These results in fact agree with their time complexities pointed out earlier. Consequently, the experimental results indicate that IBC is an e!cient decision tree classifier with good classification quality for sparse categorical attributes. 12
IBC SLIQ
2 1.5 1 0.5
8 6 4 2
0
0
1
2
3
4
5
6
7
8
9 10 11 12
Dataset size (K)
Fig. 2: SLIQ and IBC
5
IBC K-mean based
10 Time (second)
Execution Time Ratio in Log
2.5
1
2
3
4
5
6
7
8
9 10 11 12
Dataset size (K)
Fig. 3: K-mean based and IBC
Conclusion
According to our observation on real data, the distribution of attributes with respect to information gain was very sparse because only a few attributes are
Inference Based Classifier
191
major discriminating attributes where a discriminating attribute is an attribute, by whose value we are likely to distinguish one tuple from another. In this paper, we proposed an e!cient decision tree classifier for categorical attribute of sparse distribution. In essence, the proposed Inference Based Classifier can alleviate the “overfitting” problem of conventional decision tree classifiers. Also, IBC had the advantage of deciding the splitting number automatically based on the generated partitions. LEF was empirically compared to F4=5, VOLT and Kmeans based classifiers. The experimental results showed that LEF significantly outperformed the companion methods in execution e!ciency for dataset with categorical attributes of sparse distribution while attaining approximately the same classification accuracies. Consequently, LEF was considered as an accurate and e!cient classifier for sparse categorical attributes.
References 1. L. Breiman, J. H. Friedman, R.A. Olshen, and C.J. Sotne. Classification and Regression Trees. Wadsworth, Belmont, 1984. 2. NASA Ames Research Center. Introduction to IND Version 2.1. GA23-2475-02 edition, 1992. 3. P. Chesseman, J. Kelly, M. Self, and et al. AutoClass: A Bayesian classification system. In 5th Int’l Conf. on Machine Learning. Morgan Kaufman, 1988. 4. P. A. Chou. Optimal Partitioning for Classification and Resgression Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 13, No 4, 1991. 5. U. Fayyad. On the Induction of Decision Trees for multiple Concept Learning. PhD thesis, The University of Michigan, Ann arbor, 1991. 6. U. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proc. of the 13th International Joint Conference on Artificial Intelligence, 1993. 7. D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufmann, 1989. 8. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000. 9. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for data mining. In EDBT 96, Avignonm, France, 1996. 10. M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In Int’l Conference on Knowledge Discovery in Databases and Data Mining, 1995. 11. D. Michie, D.J. Spiegelhalter, and C.C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. 12. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 13. J.R. Quinlan. Induction of decision trees. Machine Learning, 1986. 14. J.R. Quinlan and R. L. Rivest. Inferring decision trees using minimum description length principle. Information and Computtation, 1989. 15. R. Rastogi and K. Shim. PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, 1998. 16. B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996. 17. J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the VLDB Conference, Bombay, India, 1996.
Generating Effective Classifiers with Supervised Learning of Genetic Programming Been-Chian Chien1, Jui-Hsiang Yang1, and Wen-Yang Lin2 1
Institute of Information Engineering I-Shou University 1, Section 1, Hsueh-Cheng Rd., Ta-Hsu Hsiang, Kaohsiung County Taiwan, 840, R.O.C. {cbc, m9003012}@isu.edu.tw 2 Department of Information Management I-Shou University 1, Section 1, Hsueh-Cheng Rd., Ta-Hsu Hsiang, Kaohsiung County Taiwan, 840, R.O.C. [email protected]
Abstract. A new approach of learning classifiers using genetic programming has been developed recently. Most of the previous researches generate classification rules to classify data. However, the generation of rules is time consuming and the recognition accuracy is limited. In this paper, an approach of learning classification functions by genetic programming is proposed for classification. Since a classification function deals with numerical attributes only, the proposed scheme first transforms the nominal data into numerical values by rough membership functions. Then, the learning technique of genetic programming is used to generate classification functions. For the purpose of improving the accuracy of classification, we proposed an adaptive interval fitness function. Combining the learned classification functions with training samples, an effective classification method is presented. Numbers of data sets selected from UCI Machine Learning repository are used to show the effectiveness of the proposed method and compare with other classifiers.
1 Introduction Classification is one of the important tasks in machine learning. A classification problem is a supervised learning that is given a data set with pre-defined classes referred as training samples, then the classification rules, decision trees, or mathematical functions are learned from the training samples to classify future data with unknown class. Owing to the versatility of human activities and unpredictability of data, such a mission is a challenge. For solving classification problem, many different methods have been proposed. Most of the previous classification methods are based on mathematical models or theories. For example, the probability-based classification methods are built on Bayesian decision theory [5][7]. The Bayesian network is one of the important classification methods based on statistical model. Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 192-201, 2003. c Springer-Verlag Berlin Heidelberg 2003
Generating Effective Classifiers with Supervised Learning of Genetic Programming
193
Many improvements of Naïve Bayes like NBTree [7] and SNNB [16] also provide good classification results. Another well-known approach is neural network [17]. In the approach of neural network, a multi-layered network with m inputs and n outputs is trained with a given training set. We give an input vector to the network, and an ndimensional output vector is obtained from the outputs of the network. Then the given vector is assigned to the class with the maximum output. The other type of classification approach uses the decision tree, such as ID3 and C4.5 [14]. A decision tree is a flow-chart-like tree structure, which each internal node denotes a decision on an attribute, each branch represents an outcome of the decision and leaf nodes represent classes. Generally, a classification problem can be represented in a decision tree clearly. Recently, some modern computational techniques start to be applied by few researchers to develop new classifiers. As an example, CBA [10] employs data mining techniques to develop a hybrid rule-based classification approach by integrating classification rules mining with association rules mining. The evolutionary computation is the other one interesting technique. The most common techniques of evolutionary computing are the genetic algorithm and the genetic programming [4][6]. For solving a classification problem, the genetic algorithm first encodes a random set of classification rules to a sequence of bit strings. Then the bit strings will be replaced by new bit strings after applying the evolution operators such as reproduction, crossover and mutation. After a number of generations are evolved, the bit strings with good fitness will be generated. Thus a set of effective classification rules can be obtained from the final set of bit strings satisfying the fitness function. For the genetic programming, either classification rules [4] or classification functions [6] can be learned to accomplish the task of classification. The main advantage of classifying by functions instead of rules is concise and efficient, because computation of functions is easier than rules induction. The technique of genetic programming (GP) was proposed by Koza [8][9] in 1987. The genetic programming has been applied to several applications like symbolic regression, the robot control programs, and classification, etc. Genetic programming can discover underlying data relationships and presents these relationships by expressions. The algorithm of genetic programming begins with a population that is a set of randomly created individuals. Each individual represents a potential solution that is represented as a binary tree. Each binary tree is constructed by all possible compositions of the sets of functions and terminals. A fitness value of each tree is calculated by a suitable fitness function. According to the fitness value, a set of individuals having better fitness will be selected. These individuals are used to generate new population in next generation with genetic operators. Genetic operators generally include reproduction, crossover and mutation [8]. After the evolution of a number of generations, we can obtain an individual with good fitness value. The previous researches on classification using genetic programming have shown the feasibility of learning classification functions by designing an accuracy-based fitness function [6][11] and special evolution operations [2]. However, there are two main disadvantages in the previous work. First, only numerical attributes can be calculated in classification functions. It is difficult for genetic programming to handle the cases with nominal attributes containing categorical data. The second
194
Been-Chian Chien et al.
drawback is that classification functions may conflict one another. The result of conflicting will decrease the accuracy of classification. In this paper, we propose a new learning scheme that defines a rough attribute membership function to solve the problem of nominal attributes and gives a new fitness function for genetic programming to generate a function-based classifier. Fifteen data sets are selected from UCI data repository to show the performance of the proposed scheme and compare the results with other approaches. This paper is organized as follows: Section 2 introduces the concepts of rough set theory and rough membership functions. In Section 3, we discuss the proposed learning algorithm based on rough attribute membership and genetic programming. Section 4 presents the classification algorithm. The experimental results is shown and compared with other classifiers in Section 5. Conclusions are made in Section 6.
2 Rough Membership Functions Rough set introduced by Pawlak [12] is a powerful tool for the identification of common attributes in data sets. The mathematical foundation of rough set theory is based on the set approximation of partition space on sets. The rough sets theory has been successfully applied to knowledge discovery in databases. This theory provides a powerful foundation to reveal and discover important structures in data and to classify complex objects. An attribute-oriented rough sets technique can reduce the computational complexity of learning processes and eliminate the unimportant or irrelevant attributes so that the knowledge can be learned from large databases efficiently. The idea of rough sets is based on the establishment of equivalence classes on the given data set S and supports two approximations called lower approximation and upper approximation. The lower approximation of a concept X contains the equivalence classes that are certain to belong to X without ambiguity. The upper approximation of a concept X contains the equivalence classes that cannot be described as not belonging to X. A vague concept description can contain boundary-line objects from the universe, which cannot be with absolute certainty classified as satisfying the description of a concept. Such uncertainty is related to the idea of membership of an element to a concept X. We use the following definitions to describe the membership of a concept X on a specified set of attributes B [13]. Definition 1: Let U = (S, A) be an information system where S is a non-empty, finite set of objects and A is a non-empty, finite set of attributes. For each B A, a A, )S2 _ a B, a(x) = there is an equivalence relation EA(B) such that EA(B) = {(x, xc a(xc )}. If (x,xc ) EA(B), we say that objects x and xcare indiscernible. Definition 2: apr = (S, E), is called an approximation space. The object x S belongs to one and only one equivalence class. Let [x]B = { y | x EA(B) y,ґ x, y S }, [S]B = {[x]B | x S }. The notation [x]B denotes equivalence classes of EA(B) and [S]B denotes the set of all equivalence classes [x]B for x S.
Generating Effective Classifiers with Supervised Learning of Genetic Programming
195
Definition 3: For a given concept X S, a rough attribute membership function of X on the set of attributes B is defined as | [ x] B X | , µ BX ( x) | [ x] B | where |[x]B| denotes the cardinality of equivalence classes of [x]B and |[x]B X| denotes the cardinality of the set [x]B X. The value of P BX (x) is in the range of [0, 1].
3 The Learning Algorithm of Classification Functions 3.1 The Classification Functions Consider a given data set S has n attributes A1, A2, …, An. Let Di be the domain of Ai, Di R for 1 d i d n and A = {A1, A2,…, An}. For each data xj in S, xj = (vj1, vj2, ... , vjn), where vjt Dt stands for the value of attribute At in data xj. Let C = {C1, C2, …, CK} be the set of K predefined classes. We say that <xj, cj> is a sample if the data xj belongs to class cj, cj C. We define a training set (TS) to be a set of samples, TS = {<xj, cj> | xj S , cj C, 1 d j d m}, where m = |TS| is the number of samples in TS. Let mi be the number of samples belonging to the class Ci, we have m = m1 + m2 + + mK, 1 d i d K. A classification function for the class Ci is defined as fi : Rn o R, where fi is a function that can determine whether a data xj belongs to the class Ci or not. Since the image of a classification function is in real number, we can decide if a data xj belongs to the specified class by means of a specific range where xj is mapped. If we find the set of functions that can recognize all K classes, a classifier is constructed. We define a classifier F for the set of K predefined classes C as follows, F = { fi | fi : Rn o R, 1 d i d K}. 3.2 The Transformation of Rough Attributes Membership The classification function defined in Section 3.1 has a limitation on attributes: Since the calculation of functions accepts only numerical values, classification functions cannot work if datasets contained nominal attributes. In order to apply the genetic programming to be able to train all data sets, we make use of rough attribute membership as the definitions in Section 2 to transform the nominal attributes into a set of numerical attributes. Let A = {A1, A2, …, An}, for a data set S has the set of attributes A containing n attributes and Di is a domain of Ai, a data xj S, xj = (vj1, vj2, ... , vjn), vji Di. If Di R, Ai is a numerical attribute, we have Ãi = Ai, let wjk be the value of xj in Ãi, wjk = vji. If Ai is a nominal attribute, we assume that S is partitioned into pi equivalence classes by attribute Ai. Let [sl]Ai denote the l-th equivalence class partitioned by attribute Ai, pi is the number of equivalence classes partitioned by the attribute Ai. Thus, we have pi
[ S ] Ai
[s ] l
l 1
Ai
, where p i | [ S ] Ai | .
196
Been-Chian Chien et al.
We transform the original nominal attribute Ai into K numerical attributes Ãi. Let Ãi = (Ai1, Ai2, ... , AiK), where K is the number of predefined classes C as defined in Section 3.1 and the domains of Ai1, Ai2, ... , AiK are in [0, 1]. The method of transformation is as follows: For a data xj S, xj = (vj1, vj2, ... , vjn), if vji Di and Ai is a nominal attribute, the vji will be transformed into (wjk, wj(k+1), ... , wj(k+K-1) ), wik [0, 1] and wjk = P ACi1 ( x j ) , wj(k+1) = P CAi2 ( x j ) , ... , wj(k+K-1)= P ACiK ( x j ) , where | [ s l ] Ai [ x j ] C k | , if xji [sl]Ai. µ ACik ( x j ) | [ s l ] Ai | After the transformation, we get the new data set Sc with attributes à = {Ã1, Ã2, ... , Ãn} and a data xj S, xj = (vj1, vj2, ... , vjn) will be transformed into yj, yj = (wj1, wj2, …, wjnc), where nc = (n - q) + qK, q is the number of nominal attributes in A. Thus, the new training set becomes TSc = { | yj Sc, cj C, 1 d j d m}. 3.3 The Adaptive Interval Fitness Function The fitness function is important for genetic programming to generate effective solutions. As descriptions in Section 3.1, a classification function fi for the class Ci should be able to distinguish positive instances from a set of data by mapping the data yj to a specific range. To find such a function, we define an adaptive interval fitness function. The mapping interval for positive instances is defined in the following. Let be an positive instance of the training set TSc for the class cj = Ci. We ( gen ) urge the values of fi(yj) for positive instances to fall in the interval [ X i ri , ( gen ) Xi ri ]. At the same time, we also wish that the values of fi(yj) for negative ( gen ) ( gen ) ( gen ) instances are mapped out of the interval [ X i ri , X i ri ]. The X i is the mean value of an individual fi(yj) in the gen-th generation of evolution, ¦ fi (y j ) ( gen )
y j , c j !TS ', c j Ci
, 1 d j d mi , 1 d i d K . mi ( gen ) and positive instances for Let ri be the maximum distance between X i ( gen ) 1 d j d mi. That is, ri max{| X i f i ( y j ) |}, 1 d i d K . 1d j d mi We measure the error of a positive instance for (gen+1)-th generation by ( gen ) °0 if c C i and | f i ( y j ) X i |d ri j , Dp ® ( gen ) |! ri °¯1 if c j C i and | f i ( y j ) X i and the error of a negative instance by °1 if c z C and | f ( y ) X i( gen ) |d r i j i i j . Dn ® ( gen ) |! ri °¯0 if c j z C i and | f i ( y j ) X i The fitness value of an individual fi is then evaluated by the following fitness function: Xi
m
fitness(fi, TSc) = ¦ ( D p D n ) . j 1
Generating Effective Classifiers with Supervised Learning of Genetic Programming
197
The fitness value of an individual represents the degree of error between the target function and the individual, we should have the fitness value be as small as possible. 3.4 The Learning Algorithm The classification functions learning algorithm using genetic programming is described in detail as follows: Algorithm: The learning of classification functions Input: The training set TS. Output: A classification function. Step 1: Initial value i = 1, k = 1. Step 2: Transform nominal attributes into rough attribute membership values. For a data <xj, cj> TS, xj = (vj1, vj2, …, vjn), 1d j d m. If Ai is a numerical attribute, wjk = vji, k = k + 1. If Ai is a nominal attribute, wjk = P CA ( x j ) , wj(k+1) = P CA2 ( x j ) , ... , wj(k+K-1)= P CAK ( x j ) , 1
i
i
i
k = k + K, repeat Step 2 until yj = (wj1, wj2, …, wjn’) is generated, nc = (n - q) + qK, q is the number of nominal attributes in A. Step 3: j = j + 1, if j d m, go to Step 1. Step 4: Generate the new training set TSc. TSc = {| yj = (wj1, wj2, …, wjn’) , cj C, 1 d j d m}. Step 5: Initialize the population. Let gen = 1 and generate the set of individuals :1 = { h11 , h21 , …, h1p } initially, where :(gen) is the population in the generation gen and hi( gen ) stands for the ith individual of the generation gen. Step 6: Evaluate the fitness value of each individual on the training set. For all hi( gen ) :(gen), compute the fitness values Ei( gen) = fitness( hi( gen ) , TSc), where the fitness evaluating function fitness() is defined as Section 3.3. Step 7: Does it satisfy the conditions of termination? If the best fitness value of E i( gen ) satisfies the conditions of termination ( E i( gen ) = 0) or the gen is equal to the specified maximum generation, the hi( gen ) with the best fitness value are returned and the algorithm halts; otherwise, gen = gen + 1. Step 8: Generate the next generation of individuals and go to Step 5. The new population of next generation :(gen) is generated by the ratio of Pr, Pc and Pm, goes to Step 6, where Pr, Pc and Pm represent the probabilities of reproduction, crossover and mutation operations, respectively.
4 The Classification Algorithm After the learning phase, we obtain a set of classification functions F that can recognize the classes in TSc. However, these functions may still conflict each other in practical cases. To avoid the situations of conflict and rejection, we proposed a
198
Been-Chian Chien et al.
scheme based on the Z-score of statistical test. In the classification phase, we calculate all the Z-values of every classification function for the unknown class data and compare these Z-values. If the Z-value of an unknown class object yj for classification fi is minimum, then yj belongs to the class Ci. We present the classification approach in the following. For a classification function fi F corresponding to the class Ci and positive instances TSc with cj = Ci. Let X i be the mean of values of fi(yj) defined in Section 3.3. The standard deviation of values of fi(yj), 1 d j d mi, is defined as
¦( f (y
Vi
i y j , c j !TS' , c j Ci
j
) X i )2
mi
, 1 d j d mi , 1 d i d K
For a data x S and a classification function fi, let y Sc be the data with all numerical values transformed from x using rough attribute membership. The Z-value of data y for fi is defined as | f i ( y) X i | Z i ( y) ,
Vi
where 1 didK. We used the Z-value to determine the class to which the data should be assigned. The detailed classification algorithm is listed as follows. Algorithm: The classification algorithm Input: A data x. Output: The class Ck that x is assigned. Step 1: Initial value k = 1. Step 2: Transform nominal attributes of x into numerical attributes. Assume that the data x S, x = (v1, v2, …, vn). If Ai is a numerical attribute, wjk = vji, k = k + 1. If Ai is a nominal attribute, wjk = P CA ( x j ) , wj(k+1) = P CA2 ( x j ) , ... , wj(k+K-1)= P CAK ( x j ) , i i k = k + K, repeat Step 2 until yj = (wj1, wj2, …, wjn’) is generated, nc = (n - q) + qK, q is the number of nominal attributes in A. Step 3: Initially, i = 1. Step 4: Compute Zi(y) with classification function fi(y). Step 5: If i < K, then i = i + 1, go to Step 4. Otherwise, go to Step 6. Step 6: Find the k Arg min{ Z i ( y )} , the data x will be assigned to the class Ck. 1
i
1d i d K
Table 1. The parameters of GPQuick used in the experiments Parameters Node mutate weight Mutate constant weight Mutate shrink weight Selection method Tournament size Crossover weight Crossover weight annealing
Values 43.5% 43.5% 13% Tournament 7 28% 20%
Parameters Values Mutation weight 8% Mutation weight annealing 40% Population size 1000 Set of functions {+, -, u, y} 0 Initial value of X i Initial value of ri 10 Generations 10000
Generating Effective Classifiers with Supervised Learning of Genetic Programming
199
5 The Experimental Results The proposed learning algorithm based on genetic programming is implemented by modifying the GPQuick 2.1 [15]. Since the GPQuick is an open source on Internet, it is more confidential for us to build and evaluate the algorithm of learning classifiers. The parameters used in our experiments are listed in Table 1. We define only four basic operations {+, -, u, y} for final functions. That is, each classification function contains only the four basic operations. The population size is set to be 1000 and the number of maximum generations of evolution is set to be 10000 for all data sets. Although the number of generations is high, the GPQuick still have good performance in computation time. The experimental data sets are selected from UCI Machine Learning repository [1]. We take 15 data sets from the repository totally including three nominal data sets, four composite data sets (with nominal and numeric attributes), and eight numeric data sets. The size of data and the number of attributes in the data sets are quite diverse. The related information of the selected data sets is summarized in Table 2. Since the GPQuick is fast in the process of evolving, the training time for each classification function in 10000 generations can be done in few seconds or minutes depending on the number of cases in the training data sets. The proposed learning algorithm is efficient when it is compared with the training time in [11] having more than an hour. We don’t know why the GPQuick is so powerful in evolving speed, but it is easy to get the source [15] and modify the problem class to obtain the results for everyone. The performance of the proposed classification scheme is evaluated by the average classification error rate of 10-fold cross validation for 10 runs. We figure out the experimental results and compare the effectiveness with different classification models in Table 3. These models include statistical model like Naïve Bayes[3], NBTree [7], SNNB [16], the decision tree based classifier C4.5 [14] and the association rule-based classifier CBA [10]. The related error rates in Table 3 are cited from [16] except the GP-based classifier. Since the proposed GP-based classifier is random based, we also show the standard deviations in the table for reference of readers. From the experimental results, we observed that the proposed method obtains lower error rates than CBA in 12 out of the 15 domains, and higher error rates in three domains. It obtains lower error rates than C4.5 rules in 13 domains, only one domain has higher error rate and the other one results in a draw. While comparing our method with NBTree and SNNB, the results are also better for most cases. While comparing with Naïve Bayes, the proposed method wins 13 domains and loses in 2 domains. Generally, the classification results of proposed method are better than others on an average. However, in some data sets, the test results in GP-based is much worse than others, for example, in the “labor” data, we found that the average error rate is 20.1%. The main reason of high error rate terribly in this case is the small size of samples in the data set. The “labor” contains only 57 data totally and is divided into two classes. While the data with small size is tested in 10-fold cross validation, the situation of overfitting will occur in both of the two classification functions. That’s also why the rule based classifiers like C4.5 and CBA have the similar classification results as ours in the labor data set.
200
Been-Chian Chien et al.
6 Conclusions Classification is an important task in many applications. The technique of classification using genetic programming is a new classification approach developed recently. However, how to handling nominal attributes in genetic programming is a difficult problem. We proposed a scheme based on the rough membership function to classify data with nominal attribute using genetic programming in this paper. Further, for improving the accuracy of classification, we proposed an adaptive interval fitness function and use the minimum Z-value to determine the class to which the data should be assigned. The experimental results demonstrate that the proposed scheme is feasible and effective. We are trying to reduce the dimensions of attributes for any possible data sets and cope with the data having missing values in the future. Table 2. The information of data sets Attributes Attributes Data set Classes Cases Data set Classes Cases nominal numeric nominal numeric australian 8 6 2 690 lymph 18 0 4 148 german 13 7 2 1000 pima 0 8 2 768 glass 0 9 7 214 sonar 0 60 2 208 heart 7 6 2 270 tic-tac-toe 9 0 2 958 ionosphere 0 34 2 351 vehicle 0 18 4 846 iris 0 4 3 150 waveform 0 21 3 5000 labor 8 8 2 57 wine 0 13 3 178 led7 7 0 10 3200
Data sets australian german glass heart ionosphere iris labor led7 lymph pima sonar tic-tac-toe vehicle waveform wine
Table 3. The average error rates (%) for compared classifiers NB NBTree SNNB C4.5 CBA GP-Ave. 14.1 14.5 14.8 15.3 14.6 9.5 24.5 24.5 26.2 27.7 26.5 16.7 28.5 28.0 28.0 31.3 26.1 22.1 18.1 17.4 18.9 19.2 18.1 11.9 10.5 12.0 10.5 10.0 7.7 7.2 5.3 7.3 5.3 4.7 5.3 4.7 5.0 12.3 3.3 20.7 13.7 20.1 26.7 26.7 26.5 26.5 28.1 18.7 19.0 17.6 17.0 26.5 22.1 13.7 24.5 24.9 25.1 24.5 27.1 18.3 21.6 22.6 16.8 29.8 22.5 5.6 30.1 17.0 15.4 0.6 0.4 5.2 40.0 29.5 28.4 27.4 31 24.7 19.3 16.1 17.4 21.9 20.3 11.7 1.7 2.8 1.7 7.3 5.0 4.5
S.D. 1.2 2.2 2.9 2.7 2.3 1.0 3.0 2.7 1.5 2.8 1.9 1.6 2.4 1.8 0.7
Generating Effective Classifiers with Supervised Learning of Genetic Programming
201
References 1.
2.
3. 4.
5. 6.
7.
8. 9.
10.
11.
12. 13.
14. 15. 16.
17.
Blake, C., Keogh, E. and Merz, C. J.: UCI repository of machine learning database, http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, University of California, Department of Information and Computer Science (1998) Bramrier, M. and Banzhaf, W.: A Comparison of Linear Genetic Programming and Neural Networks in Medical Data Mining, IEEE Transaction on Evolutionary Computation, 5, 1 (2001) 17-26 Duda, R. O. and Hart, P. E.: Pattern Classification and Scene Analysis, New York: Wiley, John and Sons Incorporated Publishers (1973) Freitas, A. A.: A Genetic Programming Framework for Two Data Mining Tasks: Classification and Generalized Rule Induction, Proc. the 2nd Annual Conf. Genetic Programming. Stanford University, CA, USA: Morgan Kaufmann Publishers (1997) 96101 Heckerman, D. M. and Wellman, P.: Bayesian Networks, Communications of the ACM, 38, 3 (1995) 27-30 Kishore, J. K., Patnaik, L. M., and Agrawal, V. K.: Application of Genetic Programming for Multicategory Pattern Classification, IEEE Transactions on Evolutionary Computation, 4, 3 (2000) 242-258 Kohavi, R.: Scaling Up the Accuracy of Naïve-Bayes Classifiers: a Decision-Tree Hybrid. Proc. Int. Conf. Knowledge Discovery & Data Mining. Cambridge/Menlo Park: AAAI Press/MIT Press Publishers (1996) 202-207 Koza, J. R.: Genetic Programming: On the programming of computers by means of Natural Selection, MIT Press Publishers (1992) Koza, J. R.: Introductory Genetic Programming Tutorial, Proc. the First Annual Conf. Genetic Programming, Stanford University. Cambridge, MA: MIT Press Publishers (1996) Liu, B., Hsu, W., and Ma, Y.: Integrating Classification and Association Rule Mining. Proc. the Fourth Int. Conf. Knowledge Discovery and Data Mining. Menlo Park, CA, AAAI Press Publishers (1998) 443-447 Loveard, T. and Ciesielski, V.: Representing Classification Problems in Genetic Programming, Proc. the Congress on Evolutionary Computation. COEX Center, Seoul, Korea (2001) 1070-1077 Pawlak, Z.: Rough Sets, International Journal of Computer and Information Sciences, 11 (1982) 341-356 Pawlak, Z. and Skowron, A.: Rough Membership Functions, in: R.R. Yager and M. Fedrizzi and J. Kacprzyk (Eds.), Advances in the Dempster-Shafer Theory of Evidence (1994) 251-271 Quinlan, J. R.: C4.5: Programs for Machine Learning, San Mateo, California, Morgan Kaufmann Publishers (1993) Singleton A. Genetic Programming with C++, http://www.byte.com/art/9402/sec10/art1.htm, Byte, Feb., (1994) 171-176 Xie, Z., Hsu, W., Liu, Z., and Lee, M. L.: SNNB: A Selective Neighborhood Based Naïve Bayes for Lazy Learning, Proc. the sixth Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining (2002) 104-114 Zhang, G. P.: Neural Networks for Classification: a Survey, IEEE Transaction on Systems, Man, And Cybernetics-Part C: Applications and Reviews, 30, 4 (2000) 451462
Clustering by Regression Analysis Masahiro Motoyoshi1, Takao Miura1 , and Isamu Shioya2 1
2
Dept.of Elect.& Elect. Engr., HOSEI University 3-7-2 KajinoCho, Koganei, Tokyo, 184–8584 Japan {i02r3243,miurat}@k.hosei.ac.jp Dept.of Management and Informatics, SANNO University 1573 Kamikasuya, Isehara, Kanagawa 259–1197 Japan [email protected]
Abstract. In data clustering, many approaches have been proposed such as K-means method and hierarchical method. One of the problems is that the results depend heavily on initial values and criterion to combine clusters. In this investigation, we propose a new method to avoid this deficiency. Here we assume there exists aspects of local regression in data. Then we develop our theory to combine clusters using F values by regression analysis as criterion. We examine experiments and show how well the theory works. Keywords: Data Mining, Multivariable Analysis, Regression Analysis, Clustering
1
Introduction
It is well-known that stocks in a securities market are properly classified according to industry genre (classification of industries). Such genre appears very often in security investment. The movement of the descriptions would be similar with each other, but this classification should be maintained according to economical situation, trends and activities in our societies and regulations. Sometimes we see some mismatch between the classification and real society. When an analyst tries classifying using more effective criterion, she/he will try a quantitative classification. Cluster analysis is one of the method based on multivariate analysis which performs a quantitative classification. Cluster analysis is a general term of algorithms to classify similar objects into groups (clusters) where each object in one cluster shares heterogeneous feature. We can say that, in very research activity, a researcher is faced to a problem how observed data should be systematically organized, that is, how to classify. Generally the higher similarity of objects in a cluster and the lower similarity between clusters we see, the better clustering we have. This means quality of clustering depends on definition of similarity and the calculation complexity. There is no guarantee to see whether we can interpret similarity easily or not.
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 202–211, 2003. c Springer-Verlag Berlin Heidelberg 2003
Clustering by Regression Analysis
203
So is true for similarity from the view point of analysts. It is an analyst’s responsibility to apply methods accurately to specific applications. The point is how to find out hidden patterns. Roughly speaking, cluster analysis has divided into hierarchical methods and non-hierarchical methods[4]. In non-hierarchical methods, data are decomposed into k clusters each of which satisfies evaluation standard most. To obtain best solutions, we have to look for all the possible patterns, it takes much time. Heuristic methods have been investigated. K-means method[6] generates clusters based on their centers. Fuzzy K-means method[1] takes an approach based on fuzzy classification. AutoClass[3] automatically determines the number of clusters and classifies the data stochastically. Recently an interesting approach[2] has been proposed, called ”Local Dimensionality Reduction”. In this approach, data are assumed to have correlation locally same as our case. But the clustering technique is based on Principal Component Analysis (PCA) and they propose completely different algorithm from ours. In this investigation, we assume there exists aspects of local regression in data, i.e., we assume observed data structure of local sub-linear space. We propose a new clustering method using variance and F value by regression analysis as a criterion to make suitable clusters. In the next section we discuss reasons why conventional approaches are not suitable to our situation. In section 3 we give some definitions and discuss about preliminary processing of data. Section 4 contains a method to combine clusters and the criterion. In section 5, we examine experimental results and the comparative experiments by K-mean method. After reviewing related works in section 6, we conclude our work in section 7.
2
Clustering Data with Local Trends
As previously mentioned, we assume a collection of data in which we see several trends at the same time. Such kind of data could be regressed locally by using partial linear functions and the result forms an elliptic cluster in multidimensional space. Of course, these clusters may cross each other. The naive solution is to put clusters together by using nearest neighbor method found in hierarchical approach. However, when clusters cross, the result doesn’t become nice; they will be divided at a crossing. If the clusters have different trends but they are close to each other, they could be combined. Similarly an approach based on general Minkowski distance have the same problem. In K-mean method, a collection of objects is represented by its center of gravity. Thus every textbook says that it is not suitable for non-convex clusters. The notion of center comes from a notion of variance, but if we look for points to improve linearity of the two clusters by moving objects, we can’t always obtain such a point. More serious is that we should decide the number of clusters in advance.
204
Masahiro Motoyoshi et al.
These two approaches share a common problem, how to define similarity between clusters. In our case, we want to capture local aspects of sub linearity, thus new techniques should satisfy the following properties: (1) similarity to classify sub linear space. (2) convergence on suitable level (i.e., the number of clusters) which can be interpreted easily. Regression analysis is one of techniques of multivariable analysis by which we can predict future phenomenon in form of mathematical functions. We introduce F value as a criterion of the similarity to combine clusters, and that we can consider a cluster as a line, that is, our approach is clustering by line while K-mean method is clustering by point. In this investigation, by examining F value (as similarity measure), we combine linear clusters in one by one manner, in fact, we take an approach of restoring to target clusters.
3
The Choice of Initial Clusters
In this section, let us explain the difference between our approach and hierarchical clustering by existing agglomerative nesting. An object is a thing in the world of interests. Data is a set of objects which have several variables. We assume that all variables are inputs given from surroundings and no other external criteria for a classification. There are two kinds of variables. A criterion variable is an attribute which plays a role of criterion of regression analysis given by analysts, and another is called a explanatory variable. In this investigation, we discuss only numerical data. As for categorical data, readers could think about quantification theory or dummy variables. We deal with data by a matrix (X|Y ). An object is represented as a row of a matrix while criterion/explanatory variables are described as a column. We denote explanatory variables and a criterion variable by x1 , x2 , . . . , xm and y respectively, and the number of objects by n: x11 . . . x1m y1 .. . . . . . . .. .. (X|Y ) = (1) xk1 . . . xkm yk . . . . . . . . . . . . xn1 . . . xnm yn (∈ Rn×(m+1) ) where X denotes explanatory variables, and Y denotes criterion variable. Each variable is assumed to be normalized (called Z score). An initial cluster is a set of objects where each object is exclusively contained in the initial cluster.
Clustering by Regression Analysis
205
In the first stage, in an agglomerative nesting, each object represents each cluster. Similarity is defined as distance between objects. However we like to assume every cluster has to have variances because we deal with ”data as a line”. We pose this assumption on initial clusters. To make ”our” initial clusters, we divide the objects into small groups. We obtain initial clusters dynamically by an inner product (cosine) which measures the difference of angle between two vectors as the following algorithm shows: 0. Let a input vector be s1 , s2 , , , sn . 1. Let the first input vector s1 be center of cluster C1 and s1 be a member of C1 . 2. Calculate similarity between sk and existing cluster C1 . . . Ci by (2). If every similarities is below given threshold Θ, we generate a new cluster and let it be the center of the cluster. Otherwise, let it be a member of cluster which has the highest similarity. By using (3) calculate again a center of cluster to which members are added. 3. Repeat until all the assignment is completed. 4. Remove clusters which has no F value and less than m + 2 members. where cos(k, j) = cj =
sk · cj |sk ||cj |
(2)
Sk ∈ Cjsk Mj
(3)
Note Mj means the number of members in Cj and m means the number of explanatory variables.
4
Combining Clusters
Now let us define similarities between clusters, and describe how the similarity criterion relates to combining. We define the similarity between clusters from two aspects. One of the aspects is a distance between centers of clusters. We simply take Euclidean distance as distance measure (as well as a least square method used by a regression analysis). d(i, j) = |xi1 − xj1 |2 + . . . + |xim − xjm |2 + |yi − yj |2 (4) Then we define non-similarity matrix as follows: 0 d(2, 1) 0 d(3, 1) d(3, 2) 0 .. .. .. .. . . . . d(n, 1) d(n, 2) . . . d(n, n − 1) 0 (∈ Rn×n )
(5)
206
Masahiro Motoyoshi et al.
Clearly one of the candidate clusters to combine is the one with the smallest distance and we have to examine whether it is suitable or not in our case. For this purpose we define a new similarity by F value of the regression to keep effectiveness. In the following let us review quickly F test and presumption by regression based on least square method in multiple regression analysis. Given clusters represented by data matrix like (1), we define a model of multiple regression analysis which is corresponded to the clusters as follows: y = b1 x1 + b2 x2 + . . . + bm xm + ei
(6)
A estimator of the least squares ˜bi of bi is given by B = (˜b1 , ˜b2 , . . . , ˜bm ) = (X T X)−1 X T Y
(7)
This is called regression coefficient. Actually it is a standardised partial regression coefficient, because it is based on z-score. Let y be an observed value and Y be a predicted value based on the regression coefficient. Then, for variation factor by regression, sum of squares SR and mean square VR are defined as SR =
n
(Yk − Y¯ )2
;
VR =
k=1
SR m
(8)
For variation factor by residual, sum of squares SE and mean square VE are SE =
n
(yk − Yk )2
;
VE =
k=1
SE n−m−1
(9)
Then we define F value F0 by: F0 =
VR VE
(10)
It is well known that F0 obeys F distribution where the first and second degrees of freedom are m and n − m − 1 respectively. Given clusters A and B where the number of members are a and b respectively, a data matrix of the combined cluster A ∪ B is described as follows. xA11 . . . xA1m yA1 .. . . . .. . . .. . xAa1 . . . xAam yAa (X|Y ) = (11) xB11 . . . xB1m yB1 . . .. . . ... .. . xBb1 . . . xBbm yBb (∈ Rn×(m+1) )
Clustering by Regression Analysis
207
where n = a + b. As previously mentioned, we can calculate regression by (7) and F value by (10). Let us examine the relationship between two F values of clusters before/after combining. Let A, B be two clusters, FA , FB the two F values and F the F value after combining A and B. Then we have some interesting properties as shown in examples. Property 1. FA > F , FB > F When F decreases, the gradient is significantly different. Thus we can say that the similarity between A and B is low and linearity of the cluster decreases. In the case of FA = FB , F = 0, both A and B have same number objects and coordinates and the regressions are orthogonal at center of gravity. Property 2. FA ≤ F , FB ≤ F When F increases, the gradient isn’t significantly different and the similarity between A and B is high. Linearity of the cluster increases. When FA = FB , F = 2 × FA , we see A and B have same number of the objects and coordinates. Property 3. FA ≤ F, FB > F , or FB ≤ F, FA > F One of FA , FB increases while another decreases, when there exists big difference between the variances of A and B, or between FA and FB . We can’t say anything about combining. Thus we can say we’d better combine clusters if F is bigger than both FA and FB . Non-similarity using Euclidean distance is one of the nice ideas to prohibit from combining clusters that have the distance bigger than local ones. Since our algorithm proceeds based on a criterion using F values, the process continues to look for candidate clusters by decreasing distance criterion until the process satisfies our F value criterion. But we may have difficulties in the case of defective initial clusters, or in the case of no cluster to regress locally; the process might combine clusters that should not be combined. To overcome such problem, we assume a threshold ∆ to a distance. That is, we have ∆ as a criterion of variance. ∆ > (Var(A) + V ar(B)) × D where Var(A), Var(B) mean variance of A, B respectively and D means the distance between the centers of the gravity. When A and B satisfy both criterion of F value and ∆, we can combine the two clusters. In our experiments, we give ∆ the average of the internal variances of initial clusters as a default. By ∆ we manage the internal variances of clusters to avoid combine far clusters.
208
Masahiro Motoyoshi et al.
Now our algorithm is given as follows. 1. Standardize data. 2. Calculate initial clusters that satisfy Θ. Remove clusters which the number of members don’t reach the number of explanatory variables. 3. Calculate center of gravity, variance, regression coefficient, F value to each cluster the distance between their centers of gravity. 4. Choose close clusters as candidates for combining. Standardize the pair. Calculate regression coefficient and F value again. 5. Combine the pair if F value of a combined cluster is bigger than F value of each cluster and if it satisfy ∆. Otherwise, go to step4 to choose other candidates. If there is not candidate any more, then stop. 6. Calculate center of gravity to each cluster and distance between their centers of gravity again and go to step4.
5
Experiments
In this section, let us show some experiments to demonstrate the feasibility of our theory. We have Weather Data in Japan[5]; on January, 1997 two meteorological observatory data of Wakkanai in Hokkaido (northern part of Japan) and Niigata in Honshu (middle part of Japan) measured in January of 1997. Each meteorological observatory contains 744 records. To apply our method under the assumption that there are clusters to regress locally. We simply joined them. and we have 1488 records of 180KB. Each data instance contains 22 attributes observed every hour. We utilize ”day”(day), ”hour”(hour), ”pressure”(hPa), ”sea-level pressure”(hPa), ”air temperature”(C), ”dew point”(C), ”steam pressure”(hPa) and ”relative humidity”(%) as candidates of variables among the 22 attributes. All of them are numerical without any missing value. We use ”observation point number” additionally only for the purpose of evaluation. A table1 contains examples of the data. Before processing, we have standardized all variables to analyze by our algorithm. We take ”air temperature” as a criterion variable and other values as explanatory variables.
Table 1. Weather Data Point Day Hour Pressure Sea pressure Temperature . . . 604 1 1 1019.2 1020 5 ... 604 1 2 1018.6 1019.4 5.2 . . . 604 1 3 1018.3 1019.1 5.4 . . . .. .. .. .. .. .. .. . . . . . . . 401 31 24 1014.6 1016 -5.8 . . .
Clustering by Regression Analysis
209
Let Θ = 0.8 and ∆ = 15. A table 2 shows the results. In this experiment, we have obtained 40 initial clusters from 1364 objects by using inner product. We have excluded other 124 objects because they have been classified into the small clusters. It took us 35 loops for convergence. And eventually we have got 5 clusters. Let us go into more detail of our results. Cluster 1 has been obtained by combining 19 initial clusters. On the other hand, cluster 2 and cluster 3 contains no combining. Cluster 4 and cluster 5 contain 10 and 9 initial clusters respectively. Generally the results seem to reflect features by observation points. In fact, cluster 1 contains 519 objects (69.8%) of 744 objects in Niigata point. cluster 5 holds 469 objects (63.0%) of 744 objects in Wakkanai point. Thus we can say cluster 1 reflects peculiar trends of Niigata points well, and cluster 2 reflects peculiar trends of Wakkanai points well. For example, in a table3, we see both ”pressure” and ”temperature” of cluster 1 are higher than other clusters. Thus the cluster contains objects that were observed in a region of high altitude and high temperature. Also ”temperature” and ”humidity” in cluster 5 are relatively low. And we see the cluster contains objects observed in region of low precipitation and low temperature. In case of cluster 4, ”day” value is high since it was observed in January. The ”pressure” is low, and ”humidity” is also high. We can say that cluster 4 is unrelated to observation region. We might be able to characterize cluster 4 by state of weather such as low-pressures. In fact, the cluster contains almost same number of objects of Niigata Wakkanai points. In a table 4, the absolute value of regression coefficients in cluster 5 is overall high compared with cluster 1 and 4. Compared to change of weather, change of temperature is large. That is, temperature varies in a wide range. Since Hokkaido is the region that takes maximum of annual difference of temperature in Japan1 , our results go well with actual classification. Also cluster 1 is similar to cluster 5, but the gradient is smaller. Thus, temperature in cluster1 doesn’t vary very much. Cluster 4 is clearly different from other clusters; the cluster has correlation only for dew point and relative humidity. Let us summarize our experiment. We got 5 clusters. Especially, we have extracted regional features from cluster1 and 5. It is evident by information on observation point in table 2 to see clustering suitably has classified objects very well. This fact means that the results in our experiment satisfy the initial condition. Let us discuss some more aspects to compare our technique with others. We have analyzed the same data by using K-means method with statistics application tool SPSS. We gave centers of initial clusters by random numbers. We specified 10 as the maximum number of iteration. Then we have analyzed two cases of k = 2 and 3. Let us show the results of k = 2 in a table 5, and the results of k = 3 in a table 6. 1
¨ in winter time, In Hokkaido area, the lowest temperature decreases to about -20¨ uA and the maximum air temperature exceeds 30 C in summer time.
210
Masahiro Motoyoshi et al.
Table 2. Final clusters(Θ = 0.8, ∆ = 15) Variance F-value Contained clusters Niigata Wakkanai Cluster1 4.958 8613.28 19 519 74 Cluster2 2.12926 2043.62 1 29 1 Cluster3 2.42196 78.2235 1 45 0 Cluster4 5.1034 85603.6 10 135 189 Cluster5 5.50085 17964.9 9 11 469
Table 3. Center of gravity for cluster Cluster1 Cluster4 Cluster5 Day -0.0103696 0.487242 -0.0987597 Hour 0.0622433 -0.148476 0.0358148 Pressure 0.599712 -0.899464 -0.0542843 Sea-level pressure 0.580779 -0.902211 -0.023735 Dew point 0.512113 0.294745 -1.05024 Steam pressure 0.468126 0.245389 -0.993099 Relative humidity -0.179054 0.975926 -0.392923 Air temperature 0.692525 -0.234597 -0.976859
Table 4. Standardised regression coefficient of clusters Cluster1 Cluster4 Cluster5 Day -0.0163907 -0.0029574 0.0108929 Hour -0.00316974 -0.00182585 -0.00965708 Pressure 1.18154 0.0357092 1.77679 Sea-level pressure -1.15683 -0.0393822 -1.77494 Dew point 0.909799 1.103 1.22524 Steam pressure 0.421526 -0.016361 0.212142 Relative humidity -1.25678 -0.36817 -0.804252
Table 5. Clustering by K-mean method (k=2) Cluster1 Cluster2
Niigata Wakkanai 496 359 248 385
In case of k = 2, it seems hard for readers (and for us) to extract significant differences of the two final clusters with respect to observation points. Similarly, in case of k = 3, we can’t extract sharp feature from the results. Thus, our technique can be an alternative when it is not possible to cluster well by K-means method.
Clustering by Regression Analysis
211
Table 6. Clustering by K-mean method (k=3) Cluster1 Cluster2 Cluster3
6
Niigata Wakkanai 293 149 158 353 293 243
Conclusion
In this investigation, we have discussed clustering for data where objects with a different local trend existed together. We have proposed how to extract trend of clusters by using regression analysis and similarity of the cluster by F-value of a regression. We have introduced threshold of distance between clusters to keep precision of the cluster. By examining the data, we have shown that we can extract clusters of a moderate number to interpret it and the feature by center of gravity and regression coefficient. We have examined some experimental results and compared our method with other methods to show the feasibility of our approach. We had already discussed how to mine Temporal Class Schemes to model a collection of time series data[7], and we are now developing integrated methodologies to time series data and stream data.
Acknowledgements We would like to acknowledge the financial support by Grant-in-Aid for Scientific Research (C)(2) (No.14580392).
References [1] Bezdek,J. C.: ”Numerical taxonomy with fuzzy sets”, Journal of Mathematical Biology, Vol.1, pp.57-71, 1974. 203 [2] Chakrabarti, K. and Mehrotra, S.: ”Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces”, proc.VLDB, 2000. 203 [3] Cheeseman,P.,et al: ”Bayesian classification”, proc. ACM Artificial Intelligence, 1988, pp.607-611. 203 [4] Jain, A. K., Murty, M. N. and Flynn, P. J.: ”Data Clutering – A Review”, ACM Computing Surveys, Vol. 31-3, 1999, pp.264-323 203 [5] Japan Weather Association: ”Weather Data HIMAWARI”, Maruzen, 1998. 208 [6] MacQueen, J. B.: ”Some methods for classification and analysis of multivariate observations” , proc. Fifth Berkeley Symposium observations, ProStatistics and Probability, Vol.1, University of California Press,1967. 203 [7] Motoyoshi,M., Miura,T., Watanabe,K., Shioya,I.: ”Mining Temporal Classes from Time Series Data”, proc.ACM Conf. on Information and Knowledge Management (CIKM), 2002. 211 [8] Wallace,C. S.and Dowe, D. L.: ”Intrinsic classification by MML-the Snob program”, proc. 7th Australian Joint Conference on Artificial Intelligence, 1994, pp.37-44.
Handling Large Workloads by Profiling and Clustering Matteo Golfarelli DEIS - University of Bologna 40136 Bologna - Italy [email protected]
Abstract. View materialization is recognized to be one of the most effective ways to increase the Data Warehouse performance; nevertheless, due to the computational complexity of the techniques aimed at choosing the best set of views to be materialized, this task is mainly carried out manually when large workloads are involved. In this paper we propose a set of statistical indicators that can be used by the designer to characterize the workload of the Data Warehouse, thus driving the logical and physical optimization tasks; furthermore we propose a clustering algorithm that allows the cardinality of the workload to be reduced and uses these indicators for measuring the quality of the reduced workload. Using the reduced workload as the input to a view materialization algorithm allows large workloads to be efficiently handled.
1 Introduction During the design of a data warehouse (DW), the phases aimed at improving the system performance are mainly the logical and physical ones. One of the most effective ways to achieve this goal during logical design is view materialization. The so-called view materialization problem consists of choosing the best subset of the possible (candidate) views to be precomputed and stored in the database while respecting a set of system and user constraints (see [8] for a survey). Even if the most important constraint is the disk space available for storing aggregated data, the quality of the result is usually measured in terms of the number of disk pages necessary to answer a given workload. Despite the efforts made by research in the last years, view materialization remains a task whose success depends on the experience of the designer that, adopting rules of thumb and applying the trial-and-error approach, may lead to acceptable solutions. Unlike other issues in the Data Warehouse (DW) field, understanding why the large set of techniques available in the literature have not been engineered and included in some commercial tools is fundamental to solving the problem. Of course the main reason is the computational complexity of view materialization that makes all the approaches proposed unsuitable for workloads larger than about forty queries. Unfortunately, real workloads are much larger and are not usually available during the DW design but only when the system is on-line. Nevertheless, the designer can estimate the core of the workload at design phase but such a rough approximation will lead to a largely sub-optimal solution. Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 212-223, 2003. c Springer-Verlag Berlin Heidelberg 2003
Handling Large Workloads by Profiling and Clustering
213
We believe that the best solution is to carry out a rough optimization at design time and to refine the solution by tuning it, manually or automatically, when the system is on-line on the base of the real workload. The main difficulty with this approach is the huge size of the workload that cannot be handled by the algorithms known in the literature. In this context the contribution of the paper is twofold: firstly we propose a technique for profiling large workloads that can be obtained from the log file produced by the DBMS when the DW is on line. The statistical indicators obtained can be used by the designer to characterize the DW workload thus driving the logical and physical optimization tasks. The second contribution concerns a clustering algorithm that allows the cardinality of the workload to be reduced and that uses the indicators in order to measure the quality of the reduced workload. Using the reduced workload as the input to a view materialization algorithm allows large workloads to be efficiently handled. Since clustering is an independent way of preprocessing, all the algorithms presented in the literature can be adopted during the views selection phase. Figure 1 shows the framework we assume for our approach: OLAP applications generate SQL queries whose logs are periodically elaborated to determine the statistical indicators and a clustered workload that can be handled by a view materialization algorithm that produces new patterns to be materialized. OLAP APPLICATIONS Queries
Data
RDBMS Queries
Log data Data volume
Data
DW
Views
Profiling & Clustering
Clustered workload
Statistical indicators
View materialization
Fig. 1. Overall framework for the view materialization process
To the best of our knowledge only few works directly faced the workload size problem; in particular, in [5] the authors proposed a polynomial time algorithm that explores only a subset of the candidate views and delivers a solution whose quality is comparable with other techniques that run in exponential time. In [1] the authors propose a heuristic reduction technique that is based on the functional dependencies between attributes and excludes from the search space those views that are “similar” to other ones already considered. With respect to ours, this approach does not produce any representative workload to be used for further optimizations. Clustering of queries in the field of DWs has been recently used to reduce the complexity of the plan selection task [2]: each cluster has a representative for whom the execution plan, as determined by the optimizer, is persistently stored. Here the concept of similarity is based on a complex set of features that it is necessary to encode when different queries can be efficiently solved using the same execution plan. This idea has been implicitly used in several previous works where a global optimization plan was obtained given a set of queries [7].
214
Matteo Golfarelli
The rest of the paper is organized as follows: Section 2 presents the necessary background, Section 3 defines the statistical indicators for workload profiling; Section 4 presents the algorithm for query clustering while in Section 5 a set of experiments, aimed at proving its effectiveness, are reported. Finally in Section 6 the conclusions are drawn.
2 Background It is recognized that DWs lean on the multidimensional model to represent data, meaning that indicators that measure a fact of interest are organized according a set of dimensions of analysis; for example, sales can be measured by the quantity sold and the price of each sale of a given product that took place in a given store and on a given day. Each dimension is usually related to a set of attributes describing it at different aggregation levels; the attributes are organized in a hierarchy defined according to a set of functional dependencies. For example a product can be characterized by the attributes PName, Type, Category and Brand among which PName→Type, the following functional dependencies are defined: Type→Category and PName→Brand; on the other hand, stores can be described by their geographical and commercial location: SName→City, City→Country, SName→CommArea, CommArea→CommZone. On relational solutions, the multidimensional nature of data is implemented on the logical model by adopting the so-called star scheme, composed by a set of, fully denormalized, dimension tables, one for each dimension of analysis, and a fact table whose primary key is obtained by composing the foreign keys referencing the dimension tables. The most common class of queries used to extract information from a star schema are GPSJ [3] that consists of a selection, over a generalized projection over a selection over a join between the fact table and the dimension table involved. It is easy to understand that grouping heavily contributes to the global query cost and that such a cost can be reduced precomputing (materializing) that aggregated information that is useful to answer a given workload. Unfortunately, in real applications, the size of such views never fits the constraint given by the available disk space and it is very hard to choose the best subset to be actually materialized. When working on a single fact scheme and assuming that all the measures contained in the elemental fact table are replicated a view is completely defined by its aggregation level. Definition 1 The pattern of a view consists of a set of dimension table attributes such that no functional dependency exists between attributes in the same pattern. Possible patterns for the sales fact are: P1 = {Month, Country, Category}, P2 = {Year, Sname}, P3 = {Brand}. In the following we will use indifferently the terms pattern and view and we will refer to the query pattern as the coarsest pattern that can be used to answer the query. Definition 2 Given two views Vi, Vj with patterns Pi, Pj respectively, we say that Vi can be derived from Vj (Vi ≤ Vj) if the data in Vi can be calculated from the data in Vj.
Handling Large Workloads by Profiling and Clustering
215
Derivability determines a partial-order relationship between the views, and thus between patterns, of a fact scheme. Such partial-order can be represented by the socalled multidimensional lattice [1] whose nodes are the patterns and whose arcs show a direct derivability relationship between patterns. Definition 3 We denote with Pi ⊕ Pj the least upper bound (ancestor) of two patterns in the multidimensional lattice. In other words the ancestor of two patterns corresponds to the coarsest one from which both can be derived. Given a set of queries the ancestor operator can be used to determine the set of views that are potentially useful to reduce the workload cost (candidate views). The candidate set can be obtained, starting from the workload queries, by iteratively adding to the set the ancestors of each couple of patterns until a fixed point is reached. Most of the approaches to view materialization try to determine first the candidate views, and to choose the best subset that fits the constraints later. Both problems have an exponential complexity.
3 Profiling the workload Profiling means determining a set of indicators that captures the workload features that have an impact on the effectiveness of different optimization techniques. In particular, we are interested in those relevant to the problem of view materialization and that help the designer to answer queries like: “How suitable to materialization is the workload ?”, “How much space do I need to obtain good results ?”. In the following we propose four indicators that have proved to properly capture all the relevant aspects and that can be used as guidance by the designer that manually tunes the DW or as input to an optimization algorithm for a materialized view section. All the indicators are based on the concept of cardinality of the views associated to a given pattern that can be estimated knowing the data volume of the fact scheme that we assume to contain the cardinality of the base fact table and the number of distinct values of each attribute in the dimension tables. The cardinality of an aggregate view can be estimated using Cardenas’ formula. In our case the objects are the tuples in the elemental fact table with pattern P0 (whose number |P0| is assumed to be known) while the number of buckets is the maximum number of tuples, |P|Max, that can be stored in a view with pattern P and that can be easily calculated given the cardinalities of the attributes belonging to the pattern, thus (1) Card(P)= Φ(|P|Max ,|I0|)
3.1 Aggregation level of the workload The aggregation level of a pattern P is calculated as:
Agg ( P ) = 1 −
Card ( P ) | P0 |
(2)
216
Matteo Golfarelli
Agg(P) ranges between [0,1[, the higher the values the coarser the pattern. The average aggregation level (AAL) of the full workload W ={Q1,…Qn} can be calculated as (3) 1 n AAL = ∑ Agg ( Pi ) n i =1 where Pi is the pattern of query Qi. In order to partially capture how the queries are distributed at different aggregation levels we also include the aggregation level standard deviation (ALSD), which is the standard deviation for AAL: (4) n ∑ (Agg ( Pi ) − AAL )2
ALSD =
i =1
n
AAL and ALSD characterize to what extent the information required by the users is aggregated and express the willingness of the workload to be optimized using materialized views. Intuitively, workloads with high values of AAL will be efficiently optimized using materialized views since they determine a strong reduction of the number of tuples to be read. Furthermore, the limited size of such tables allows a higher number of views to be materialized. On the other hand, a low value for ALSD denotes that most of the views share the same aggregation level further improving the usefulness of view materialization. 3.2 Skewness of the workload Measuring the aggregation level is not sufficient to characterize the workload; in fact workloads with similar values of AAL and ALSD can behave differently, with respect to materialization, depending on the attributes involved in the queries. Consider for example two workloads W1 ={Q1, Q2} and W2 ={Q3, Q4} formulated on the Sales fact and the pattern of their queries: − − − −
P1 = {Category, City} P2 = {Type, Country} P3 = {Category, Country} P4 = {Brand, CommZone}
Card(P1) = 2100 Card(P2) = 1450 Card(P3) = 380 Card(P4) = 680
Materializing a single view to answer both the queries in the workload is much more useful for W1, than for W2 since in the first case the ancestor is very “close” to the queries (P1⊕ P2={Type, City}) and still coarse, while in the second case it is “far” and fine (P3⊕ P4={SName, PName}). This difference is captured by the distance between the two patterns that we calculate as: (5) Dist(Pi, Pj) = Agg(Pi) + Agg(Pj) - 2 Agg(Pi ⊕ Pj)
Handling Large Workloads by Profiling and Clustering
217
Dist(Pi, Pj) is calculated in terms of distance of Pi and Pj from their ancestor that is the point of the multidimensional lattice closest to both the views. Figure 2 shows two different situations on the same multidimensional lattice: even if the aggregation level of the patterns is similar, the distance between each couple change significantly. The average skewness (ASK) of the full workload W ={Q1,…Qn} can be calculated as
ASK =
n −1 n 2 ∑ ∑ Dist ( Pi , P j ) n ⋅ (n − 1) i =1 j =i +1
(6)
where Pz is the pattern of query Qz. ASK ranges in [0,2[1 . Also for the skewness indicator it is useful to calculate the standard deviation (Skewness Standard Deviation, SKSD) in order to evaluate how the distances between queries are distributed with respect to their mean value: (7) n −1 n 2 2 SKSD = Dist ( P , P ) − ASK ∑ ∑ i j n ⋅ (n − 1) i =1 j =i +1
(
)
Intuitively, workloads with low values for ASK will be efficiently optimized using materialized views since the similarity of the query patterns makes it possible to materialize few views to optimize several queries. P0
{} Fig. 2. Distance between close and far patterns
P0
{}
4 Clustering of queries Clustering is one of the most common techniques for classification of features into groups. Several algorithms have been proposed in the literature (see [4] for a survey) each suitable for a specific class of problems. In this paper we adopted the hierarchical approach that recursively agglomerates the two most similar clusters forming a dendogram whose creation can be stopped at different levels to yield different clustering of data, each related to a different level of similarity that will be evaluated using the statistical indicators introduced so far. The initial clusters contain a single query of the workload that represent them. At each step the algorithm looks 1 The maximum value for ASK depends on the cardinalities of the attributes and on the functional dependencies defined on the hierarchies, thus it cannot be defined without considering the specific star schema.
218
Matteo Golfarelli
for the two most similar clusters that are collapsed forming a new one that is represented by the query whose pattern is the ancestor of their representative. Figure 3 shows the output of this process. With a little abuse of terminology we write qx⊕qy meaning that the ancestor operator is applied to the pattern of the queries. c10=q1⊕q2⊕q3⊕q4⊕q5⊕q6 c9=q4⊕q5⊕q6
level 5 level 4
c8=q1⊕q2⊕q3
level 3
c7=q4⊕q5
level 2
c7=q1⊕q2
level 1 c1=q1
c2=q2
c3=q3
c4=q4
c5=q5
c6=q6
level 0
Fig. 3. A possible dendogram for a workload with 6 queries
Similarity between clusters is expressed in terms of the distance, as defined in Section 3.2, between the patterns of their representatives. Each cluster is represented by the ancestor of all the queries belonging to it and is labeled with the sum of the frequencies of its queries. This simple, but effective, solution reflects the criteria adopted by the view materialization algorithms that rely on the ancestor concept when choosing one view to answer several queries. The main drawback here is that the value of AAL tends to decrease when the initial workload is strongly aggregated. Nevertheless the ancestor solution is the only one ensuring that the cluster representative effectively characterizes its queries with respect to materialization (i.e. all the queries in the cluster can be answered on a view on which the representative can also be answered). Adding new queries to a cluster inevitably induces heterogeneity in the aggregation level of its queries thus reducing its capability to represent all of them. Given a clustering Wc ={C1,…Cm}, we measure the compactness of the clusters in terms of similarity of the aggregation levels of the queries in each cluster as: (8) 1 m IntraALSD = ∑ ALSDi m i =1 where ALSDi is the standard deviation of the aggregation level for queries in the cluster Ci. The lower IntraALSD the closer the queries in the clusters. As to the behavior of ASK, it tends to increase when the number of clusters decreases since the closer queries are collapsed earlier than the far ones. While this is an obvious effect of clustering a second relevant measure of the compactness of the clusters in Wc ={C1,…Cm} can be expressed in terms of internal skewness:
Handling Large Workloads by Profiling and Clustering
IntraASK =
219
(9)
1 m ∑ ASK i m i =1
where ASKi is the skewness of the queries in the cluster Ci. The lower IntraASK the closer the queries in the clusters. The ratio between the statistical indicators and the corresponding intra cluster ones can be used to evaluate how well the clustering models the original workload; in particular we adopted this technique to define when the clustering process must be stopped; the stop rule we adopt is as follows: AAL ASK > T AL ∨ > TSK Stop if IntraAAL IntraASK In our tests both TAL and TSK have been set to 5.
5 Tests and discussion In this section we present four different tests aimed at proving the effectiveness of both profiling and clustering. The tests have been carried out on the LINEITEM fact scheme described in the TPC-H/R benchmark [9] using a set of generated workloads. Since selections are rarely take into account by view materialization algorithms our queries do not present any selection clause. As to the materialization algorithm, we adopted the classic one in the literature proposed by Baralis et al. [1]; the algorithm first determines the set of candidate views and then heuristically chooses the best subset that fits given space constraints. Splitting the process into two phases allows us to estimate both the difficulty of the problem, that we measure in terms of the number of candidate views, and the effectiveness of materialization that is calculated in terms of the number of disk pages saved by materialization. The cost function we adopted computes the cost of a query Q on a star schema S composed by a fact table FT and a set {DT1,…, DTn} of dimension tables as (10) (Size( DT ) + Size(PK )) Cost (Q, S ) = Size( FT ) +
∑
i∈Dim(Q )
i
i
where Size( ) function returns the size of a table/index expressed in disk pages, Dim(Q) returns the indexes of the dimension tables involved in Q and PKi is the primary index on DTi. This cost function assumes the execution plan that is adopted by Redbrick 6.0 when no select conditions are present in a query on a star schema.
5.1 Workload features fitting The first test shows that the statistical indicators proposed in Section 3 effectively summarize the features of a workload. Four workloads, each made up of 20 queries, have been generated with different values for the indicators. Table 1 reports the value of the parameters and the resulting number of candidate views that confirms the considerations made in Section 3. The complexity of the problem mainly depends on the value of the ASK and is more slightly influenced by AAL. The simplest workloads
220
Matteo Golfarelli
to be elaborated will be those with highly aggregated queries with similar patterns, while the most complex will be those with very different patterns with a low aggregation level. It should be noted that on increasing the size of the worklfoads, those with a “nice” profile still perform well, while the others quickly become too complex. For example workloads WKL5, WKL6, whose profile follows those of WKL1 and WKL4 respectively, in Table 1 contains 30 queries: while the number of candidate views remains low for WKL5, it explodes for WKL6. Actually, we stopped the algorithm after two days of computation on a PENTIUM IV CPU (1GHz). The profile is also useful to evaluate how well the workload will behave with respect to view materialization. Figure 4.a shows that, regardless of the difficulty of the problems, workloads with high values of AAL are strongly optimized even when a limited disk space is available for storing materialized views. This behavior is induced by the dimension, and thus by the number, of the materialized views that fits the space constraint as it can be verified in Figure 4.b. Table 1. Number of candidate views for workloads with different profiles
12
ALSD 0.307 0.245 0.278 0.153 0.297 0.276
ASK 0.348 0.327 0.810 0.751 0.316 0.668
SKSD N. Candidate views 0.393 97 0.269 124 0.391 596 0.216 868 0.371 99 0.354 > 36158
20
(a)
10
Millions of disk pages
AAL 0.835 0.186 0.790 0.384 0.884 0.352
(b)
N. of materialized views
Name WKL1 WKL2 WKL3 WKL4 WKL5 WKL6
15
8
10
6
5
4 2
0
0 1.1
1.4
WKL1
1.7 2 2.3 Disk space constraint (GB)
2.6
WKL2
2.9
1.1 1.4 1.7 2 2.3 2.6 2.9 Disk space constraint (GB)
WKL3
WKL4
Fig. 4. Cost of the workloads (a) and number of materialized views (b) on varying the disk space constraint for the workloads in Table 1
5.2 Clustering suboptimality The second test is aimed at proving that clustering produces a good approximation of the input workload, meaning that applying view materialization to the original and clustered workload does not induce a too heavy suboptimality. With reference to the workloads in Table 1, Table 2 shows how change the behavior and the effectiveness of the view materialization algorithm changes for an increasing level of clustering. It
Handling Large Workloads by Profiling and Clustering
221
should be noted that the number of candidate views can be strongly reduced inducing, in most cases, a limited suboptimality. By comparing the suboptimality percentages with the statistical indicator trends presented in Figure 5, it is clear that suboptimality arises earlier for workloads where IntraASDL and IntraASK increase earlier. 5.3 Handling large workloads When workloads with hundred of queries are considered it is not possible to measure the suboptimality induced by the clustered solution since the original workloads cannot be directly optimized. On the other hand, it is still possible to compare the increase of the performance with respect to the case with no materialized views and it is also interesting to show how the workload costs change depending on the number of queries included in the clustered workload and how the cost is related to the statistical indicators. Table 2. Effects of clustering on the view materialization algorithm applied to workload in Table 1 WKL WKL1
WKL2
WKL3
WKL4
#. Cluster # Cand.Views #. Mat.Views % SubOpt Stop rule at 15 90 12 0.001 3 10 68 7 0.308 5 25 3 40.511 15 79 2 0.000 6 10 38 2 2.561 5 6 2 4.564 15 549 10 1.186 7 10 156 7 22.146 5 16 4 65.407 15 321 2 0.0 4 10 129 2 0.0 5 17 2 0.0
Table 3 reports the view materialization results for two workloads, WKL 7 (AAL:0.915, ALSD:0.266, ASK: 0.209, SKSD: 0.398) - WKL 8 (AAL: 0.377, ALSD: 0.250, ASK: 0.738, SKSD: 0.345), containing 200 queries. The data in the table and the graphs in Figure 6 confirm the behaviors deduced from previous tests: the effectiveness of view materialization is higher for workloads with high value of AAL and low value of ASK. Also the capability of the clustering algorithm to capture the features of the original workload depends on its profile, in fact workloads with higher values of ASK require more queries (7 for WKL7 vs. 20 for WKL8) in the clustered workload to effectively model the original one. On the other hand it is not useful to excessively increase the clustered workload cardinality since the performance improvement is much lower than the increase of the computation time.
Matteo Golfarelli
1.5
1 0.8 0.6 0.4 0.2 0
WKL1
1 0.5
2
4
6
8
10
12
14
16
18
N. Clusters
N. Clusters 1
WKL3
WKL4
0.8 0.6 0.4 0.2
N. Clusters
AAL
ASK
2
4
6
8
14
16
18
20
2
4
6
8
10
12
14
16
0 18
20
1.2 1 0.8 0.6 0.4 0.2 0
10
20
0
WKL2
12
222
N. Clusters
IntraAAL
IntraASK
Fig. 5. Trends of the statistical indicators for increasing levels of clustering and for different workloads. Table 3. Effects of clustering on the view materialization algorithm applied to workload in Table 1
#. # #. Cluster Cand.Views Mat.Views 30 12506 17 20 4744 15 WKL7 10 384 9 7 64 6 30 17579 5 WKL8 20 2125 5 10 129 2 WKL
%Cost Reduction 90.6 89.0 83.3 38.9 19.1 17.8 2.4
Comp. Time Stop rule (sec.) at 43984 439 6 39 24 78427 19 304 25
6 Conclusions In this paper we have discussed two techniques that make it possible to carry out view materialization when the high cardinality of the workload does not allow the problem to be faced directly. In particular, the set of statistical indicators proposed have proved to capture those workload features that are relevant to the view materialization problem, thus driving the designer choices. The clustering algorithm allows large workloads to be handled by automatic techniques for view materialization since it reduces its cardinality slightly corrupting the original characteristics. We believe that the use of the information carried by the statistical indicators we proposed can be
Handling Large Workloads by Profiling and Clustering
223
profitably used to increase the effectiveness of the optimization algorithms used in both logical and physical design. For example, in [6] the authors propose a technique for splitting a given quantity of disk space into two parts used for creating views and indexes respectively. Since the technique takes account of only information relative to a single query our indicators can improve the solution by providing the bent of the workload to be optimized by indexing or view materializing. 1
2
0.6
1
20 0 18 0 16 0 14 0 12 0 10 0
40 20
80 60
0
20 0 18 0 16 0 14 0 12 0 10 0
0.2
0
N. Clusters
ASK
40 20
0.4
0.5
AAL
WKL8
0.8
80 60
1.5
WKL7
N. Clusters
IntraAAL
IntraASK
Fig. 6. Trends of the statistical indicators for increasing levels of clustering and for different workloads.
References [1] E. Baralis, S. Paraboschi and E. Teniente. Materialized view selection in a multidimensional database. In Proc. 23rd VLDB, Greece, 1997. [2] A. Ghosh, J. Parikh, V.S. Sengar and J. R. Haritsa. Plan Selection Based on Query Clustering, In Proc. 28th VLDB, Hong Kong, China, 2002. [3] A. Gupta, V. Harinarayan and D. Quass. Aggregate-query processing in data-warehousing environments. In Proc. 21st VLDB, Switzerland, 1995. [4] A.K. Jain, M.N. Murty and P.J. Flynn. Data Clustering A Review. ACM Computing Surveys, Vol. 31, N. 3, September 1999. [5] T. P. Nadeau and T. J. Teorey. Achieving scalability in OLAP materialized view selection. In Proc. DOLAP’02, Virginia USA, 2002. [6] S. Rizzi and E. Saltarelli. View materialization vs. Indexing: balancing space constraints in Data Warehouse Design. To appear in Proc. CAISE’03, Austria, 2003. [7] T. K. Sellis. Global query Optimization. In Proc. SIGMOD Conference Washington D.C. 1986, pp. 191-205. [8] D. Theodoratos, M. Bouzeghoub. A General Framework for the View Selection Problem for Data Warehouse Design and Evolution. In Proc. DOLAP’00, Washington D.C. USA, 2000. [9] Transaction Processing Performance Council. TPC Benchmark H (Decision Support) Standard Specification, Revision 1.1.0, 1998, http://www.tpc.org.
Incremental OPTICS: Efficient Computation of Updates in a Hierarchical Cluster Ordering Hans-Peter Kriegel, Peer Kr¨oger, and Irina Gotlibovich Institute for Computer Science University of Munich, Germany {kriegel,kroegerp,gotlibov}@dbs.informatik.uni-muenchen.de
Abstract. Data warehouses are a challenging field of application for data mining tasks such as clustering. Usually, updates are collected and applied to the data warehouse periodically in a batch mode. As a consequence, all mined patterns discovered in the data warehouse (e.g. clustering structures) have to be updated as well. In this paper, we present a method for incrementally updating the clustering structure computed by the hierarchical clustering algorithm OPTICS. We determine the parts of the cluster ordering that are affected by update operations and develop efficient algorithms that incrementally update an existing cluster ordering. A performance evaluation of incremental OPTICS based on synthetic datasets as well as on a real-world dataset demonstrates that incremental OPTICS gains significant speed-up factors over OPTICS for update operations.
1
Introduction
Many companies gather a vast amount of corporate data. This data is typically distributed over several local databases. Since the knowledge hidden in this data is usually of great strategic importance, more and more companies integrate their corporate data into a common data warehouse. In this paper, we do not anticipate any special warehousing architecture but simply address an environment which provides derived information for the purpose of analysis and which is dynamic, i.e. many updates occur. Usually manual or semi-automatic analysis such as OLAP cannot make use of the entire information stored in a data warehouse. Automatic data mining techniques are more appropriate to fully exploit the knowledge hidden in the data. In this paper, we focus on clustering, which is the data mining task of grouping the objects of a database into classes such that objects within one class are similar and objects from different classes are not (according to an appropriate similarity measure). In recent years, several clustering algorithms have been proposed [1,2,3,4,5]. A data warehouse is typically not updated immediately when insertions or deletions on a member database occur. Usually updates are collected locally and applied to the common data warehouse periodically in a batch mode, e.g. each night. As a consequence, all clusters explored by clustering methods have to be updated as well. The update of the mined patterns has to be efficient because it should be finished when the warehouse has to be available for its users again, e.g. in the next morning. Since a warehouse usually Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 224-233, 2003. c Springer-Verlag Berlin Heidelberg 2003
Incremental OPTICS
225
stores a large amount of data, it is highly desirable to perform updates incrementally [6]. Instead of recomputing the clusters by applying the algorithm to the entire (very large) updated database, only the old clusters and the objects inserted or deleted during a given period are considered. In this paper, we present an incremental version of OPTICS [5] which is an efficient clustering algorithm for metric databases. OPTICS combines a density-based clustering notion with the advantages of hierarchical approaches. Due to the density-based nature of OPTICS, the insertion or deletion of an object usually causes expensive computations only in the neighborhood of this object. A reorganization of the cluster structure thus affects only a limited set of database objects. We demonstrate the advantage of the incremental version of OPTICS based on a thorough performance evaluation using several synthetic and a real-world dataset. The remainder of this paper is organized as follows. We review related work in Section 2. Section 3 briefly introduces the clustering algorithm OPTICS. The incremental algorithms for insertions and deletions are presented in Section 4. In Section 5, the results of our performance evaluation are reported. Conclusions are presented in Section 6.
2
Related Work
Beside the tremendous amount of clustering algorithms (e.g. [1,2,3,4,5]), the problem of incrementally updating mined patterns is a rather new area of research. Most work has been done in the area of developing incremental algorithms for the task of mining association rules, e.g. [7]. In [8] algorithms for incremental attribute-oriented generalization are presented. The only algorithm for incrementally updating clusters detected by a clustering algorithm is IncrementalDBSCAN proposed in [6]. It is based on the algorithm DBSCAN [4] which models clusters as density-connected sets. Due to the density-based nature of DBSCAN, the insertion or deletion of an object affects the current clustering only in the neighborhood of this object. Based on these observations, IncrementalDBSCAN yields a significant speed-up over DBSCAN [6]. In this paper, we propose IncOPTICS an incremental version of OPTICS [5] which combines the density-based clustering notion of DBSCAN with the advantages of hierarchical clustering concepts. Since OPTICS is an extension to DBSCAN and yields much more information about the clustering structure of a database, IncOPTICS is much more complex than IncrementalDBSCAN. However, IncOPTICS yields an accurate speed-up over OPTICS without any loss of effectiveness, i.e. quality.
3
Density-Based Hierarchical Clustering
R
In the following, we assume that D is a database of n objects, dist : D × D → is a metric distance function on objects in D and Nε (p) := {q ∈ D | dist(p, q) ≤ ε} denotes the ε-neighborhood of p ∈ D where ε ∈ . OPTICS extends the density-connected clustering notion of DBSCAN [4] by hierarchical concepts. In contrast to DBSCAN, OPTICS does not assign cluster memberships
R
226
Hans-Peter Kriegel et al.
but computes a cluster order in which the objects are processed and additionally generates the information which would be used by an extended DBSCAN algorithm to assign cluster memberships. This information consists of only two values for each object, the core-level and the reachability-distance (or short: reachability).
N
R
Definition 1 (core-level). Let p ∈ D, MinPts ∈ , ε ∈ , and MinPts-dist(p) be the distance from p to its MinPts-nearest neighbor. The core-level of p is defined as follows: ∞ if |Nε (p)| < MinPts CLev(p) := MinPts-dist(p) otherwise.
N
R
Definition 2 (reachability). Let p, q ∈ D, MinPts ∈ , and ε ∈ . The reachability of p wrt. q is defined as RDist(p, q) := max{CLev(q), dist(q, p)}.
N
R
Definition 3 (cluster ordering). Let MinPts ∈ , ε ∈ , and CO be a totally ordered permutation of the n objects of D. Each o ∈ D has additional attributes Pos(o), Core(o) and Reach(o), where Pos(o) symbolizes the position of o in CO. We call CO a cluster ordering wrt. ε and MinPts if the following three conditions hold: (1) ∀p ∈ CO : Core(p) = CLev(p) (2) ∀o, x, y ∈ CO : Pos(x) < Pos(o) ∧ Pos(y) > Pos(o) ⇒ RDist(y, x) ≥ RDist(o, x) (3) ∀p, o ∈ CO : Reach(p) = min{RDist(p, o) | Pos(o) < Pos(p)}, where min ∅ = ∞. Intuitively, Condition (2) states that the order is built on selecting at each position i in CO that object o having the minimum reachability to any object before i. A cluster ordering is a powerful tool to extract flat, density-based decompositions for any ε ≤ ε. It is also useful to analyze the hierarchical clustering structure when plotting the reachability values for each object in the cluster ordering (cf. Fig. 1(a)). Like DBSCAN, OPTICS uses one pass over D and computes the ε-neighborhood for each object of D to determine the core-levels and reachabilities and to compute the cluster ordering. The choice of the starting object does not affect the quality of the result [5]. The runtime of OPTICS is actually higher than that of DBSCAN because the computation of a cluster ordering is more complex than simply assigning cluster memberships and the choice of the parameter ε affects the runtime of the range queries (for OPTICS, ε has typically to be chosen significantly higher than for DBSCAN).
4
Incremental OPTICS
The key observation is that the core-level of some objects may change due to an update. As a consequence, the reachability values of some objects have to be updated as well. Therefore, condition (2) of Def. 3 may be violated, i.e. an object may have to move to another position in the cluster ordering. We will have to reorganize the cluster ordering such that condition (2) of Def. 3 is re-established. The general idea for an incremental version of OPTICS is not to recompute the ε-neighborhood for each object in D but restrict the reorganization on a limited subset of the objects (cf. Fig. 1(b)). Although it cannot be ensured in general, it is very likely that the reorganization is bounded to a limited part of the cluster ordering due to the density-based nature of
Incremental OPTICS
(a)
227
(b)
Fig. 1. (a) Visual analysis of the cluster ordering: clusters are valleys in the according reachability plot. (b) Schema of the reorganization procedure.
OPTICS. IncOPTICS therefore proceeds in two major steps. First, the starting point for the reorganization is determined. Second, the reorganization of the cluster ordering is worked out until a valid cluster ordering is re-established. In the following, we will first discuss how to determine the frontiers of the reorganization, i.e. the starting point and the criteria for termination. We will determine two sets of objects affected by an update operation. One set called mutating objects, contains objects that may change its core-level due to an update operation. The second set of affected objects contains objects that move forward/backwards in the cluster ordering to re-establish condition (2) of Def. 3. Movement of objects may be caused by changing reachabilities — as an effect of changing core-levels — or by moving predecessors/successors in the cluster ordering. Since we can easily compute a set of all objects possibly moving, we call this set moving objects, containing all objects that may move forward/backwards in the cluster ordering due to an update. 4.1
Mutating Objects
Obviously, an object o may change its core-level only if the update operation affects the ε-neighborhood of o. From Def. 1 it follows that if the inserted/deleted object is one of o’s MinPts-nearest neighbors, Core(o) increases in case of a deletion and decreases in case of an insertion. This observation led us to the definition of the set M UTATING(p) of mutating objects: Definition 4 (mutating objects). Let p be an arbitrary object either in or not in the cluster ordering CO. The set of objects in CO possibly mutating their core-level after the insertion/deletion of p is defined as: M UTATING(p) := {q | p ∈ Nε (q)}. Let us note that p ∈ M UTATING(p) since p ∈ Nε (p). In fact, M UTATING(p) can be computed rather easily. Lemma 1. ∀p ∈ D : M UTATING(p) = Nε (p).
228
Hans-Peter Kriegel et al.
Proof. Since dist is a metric, the following conclusions hold: ∀ q ∈ Nε (p) : dist(q, p) ≤ ε ⇔ dist(p, q) ≤ ε ⇔ p ∈ Nε (q) ⇔ q ∈ M UTATING(p). Lemma 2. Let C be a cluster ordering and p ∈ CO. M UTATING(p) is a superset of the objects that change their core-level due to an insertion/deletion of p into/from CO. Proof. (Sketch) Let q ∈ M UTATING(p): Core(q) changes if p is one of q’s MinPts-nearest neighbors. Let q ∈ M UTATING(p): According to Lemma 1, p ∈ Nε (q) and thus p either cannot be any of q’s MinPts-nearest neighbors or Core(q) = ∞ remains due to Def. 1. Due to Lemma 2, we have to test for each object o ∈ M UTATING(p) whether Core(o) increases/decreases or not by computing Nε (o) (one range query). 4.2
Moving Objects
The second type of affected objects move forward or backwards in the cluster ordering after an update operation. In order to determine the objects that may move forward or backwards after an update operation occurs, we first define the predecessor and the set of successors of an object: Definition 5 (predecessor). Let CO be a cluster ordering and o ∈ CO. For each entry p ∈ CO the predecessor is defined as o if Reach(p) = RDist(o, p) Pre(p) = UNDEFINED if Reach(p) = ∞. Intuitively, Pre(p) is the object in CO from which p has been reached. Definition 6 (successors). Let CO be a cluster ordering. For each object p ∈ CO the set of successors is defined as Suc(p) := {q ∈ CO | Pre(q) = p}. Lemma 3. Let CO be a cluster ordering and p ∈ CO. If Core(p) changes due to an update operation, then each object o ∈ Suc(p) may change its reachability values. [Def. 6]
[Def. 5]
Proof. ∀o ∈ CO: o ∈ Suc(p) =⇒ Pre(o) = p =⇒ Reach(o) = RDist(o, p) [Def. 2]
=⇒ Reach(o) = max{Core(p), dist(p, o)}. Since the value Core(p) has changed, Reach(o) may also have changed. As a consequence of a changed reachability value, objects may move in the cluster ordering. If the reachability-distance of an object decreases, this object may move forward such that Condition (2) of Def. 3 is not violated. On the other hand, if the reachabilitydistance of an object increases, this object may move backwards due to the same reason. In addition, if an object has moved, all successors of this objects may also move although their reachabilities remain unchanged. All such objects that may move after an insertion or deletion of p are called moving objects:
Incremental OPTICS
229
Definition 7 (moving objects). Let p be an arbitrary object either in or not in the cluster ordering CO. The set of objects possibly moving forward/backwards in CO after insertion/deletion of p is defined recursively: (1) If x ∈ M UTATING(p) and q ∈ Suc(x) then q ∈ M OVING(p). (2) If y ∈ M OVING(p) and q ∈ Suc(y) then q ∈ M OVING(p). (3) If y ∈ M OVING(p) and q ∈ Pre(y) then q ∈ M OVING(p). Case (1) states, that if an object is a successor of a mutating object, it is a moving object. The other two cases state, that if an object is a successor or predecessor of a moving object it is also a moving object. Case (3) is needed, if a successor of an object o is moved to a position before o during reorganization. For the reorganization of moving objects we do not have to compute range queries. We solely need to compare the old reachability values to decide whether these objects have to move or not. 4.3
Limits of Reorganization
We are now able to determine between which bounds the cluster ordering must be reorganized to re-establish a valid cluster ordering according to Def. 3. Lemma 4. Let CO be a cluster ordering and p be an object either in or not in CO. The set of objects that have to be reorganized due to an insertion or deletion of p is a subset of M UTATING(p) ∪ M OVING(p). Proof. (Sketch) Let o be an object which has to be reorganized. If o has to be reorganized due to a change of Core(o), then o ∈ M UTATING(p). Else o has to be reorganized due to a changed reachability or due to moving predecessor/successors. Then o ∈ M OVING(p). Since OPTICS is based on the formalisms of DBSCAN, the determination of the start position for reorganization is rather easy. We simply have to determine the first object in the cluster ordering whose core-level changes after the insertion or deletion because reorganization is only initiated by changing core-levels. Lemma 5. Let CO be a cluster ordering which is updated by an insertion or deletion of object p. The object o ∈ D is the start object where reorganization starts if the following conditions hold: (1) o ∈ M UTATING(p) = q : Pos(o) ≤ Pos(q). (2) ∀q ∈ M UTATING(p), o Proof. Since reorganization is caused by changing core-levels, the start object must change its core-level due to the update. (1) follows from Def. 4. According to Def. 7, each q ∈ Suc(p) can by affected by the reorganization. To ensure, that no object is lost by the reorganization procedure, o has to be the first object, whose core-level has changed (⇒(2)). In addition, all objects before o are neither element of M UTATING(p) nor of M OVING(p). Therefore, they do not have to be reorganized.
230
Hans-Peter Kriegel et al.
WHILE NOT Seeds.isEmpty() DO // Decide which object is at next added to COnew IF currObj.reach > Seeds.first().reach THEN COnew .add(Seeds.first()); Seeds.removeFirst(); ELSE COnew .add(currObj); currObj = next object in COold which has not yet been inserted into COnew // Decide which objects are inserted into Seeds q = COnew .lastInsertedObject(); IF q∈ M UTATING(p) THEN FOR EACH o∈Nε (p) which has not yet been inserted into COnew DO Seeds.insert(o, max{q.core, dist(q,o)}); ELSE IF q∈ M OVING(p) THEN FOR EACH o∈Pre(p) OR o∈Suc(p) and o has not yet been inserted into COnew DO Seeds.insert(o, o.reach); Fig. 2. IncOPTICS: Reorganization of the cluster ordering
4.4
Reorganizing a Cluster Ordering
In the following, COold denotes the old cluster ordering before the update and COnew denotes the updated cluster ordering which is computed by IncOPTICS. After the start object so has been determined according to Lemma 5, all objects q ∈ COold with Pos(q) < Pos(so) can be copied into COnew (cf. Fig. 1(b)) because up to the position of so COold is a valid cluster ordering. The reorganization of CO begins at so and imitates OPTICS. The pseudo-code of the procedure is depicted in Fig. 2. It is assumed that each not yet handled o ∈ Nε (so) is inserted into the priority queue Seeds which manages all not yet handled objects from M OVING(p) ∪ M UTATING(p) (i.e. all o ∈ M OVING(p) ∪ M UTATING(p) with Pos(o) ≥ Pos(so)) sorted in the order of ascending reachabilities. In each step of the reorganization loop, the reachability of the first object in Seeds is compared with the reachability of the current object in COold . The entry with the smallest reachability is inserted into the next free position of COnew . In case of a delete operation, this step is skipped if the considered object is the update object. After this insertion, Seeds has to be updated depending on which object has recently been inserted. If the inserted object is an element of M UTATING(p), all neighbors that are currently not yet handled may change their reachabilities. If the inserted object is an element of M OVING(p), all predecessors and successors that are currently not yet handled may move. In both cases, the corresponding objects are inserted into Seeds using the method Seeds::insert which inserts an object with its current reachability or updates the reachability of an object if it is already in the priority queue. If a predecessor is inserted into Seeds, its reachability has to be recomputed (which means a distance calculation in the worst-case) because RDist(., .) is not symmetric. According to Lemma 4, the reorganization terminates if there are no more objects in Seeds, i.e. all objects in M OVING(p) ∪ M UTATING(p), that have to be processed, are
Incremental OPTICS
(a) Insertion
231
(b) Deletion
Fig. 3. Runtime of OPTICS vs. average and maximum runtime of IncOPTICS.
handled. COnew is filled with all objects from COold which are not yet handled (and thus need not to be considered by the reorganization) maintaining the order determined by COold (cf. Fig. 1(b)). The resulting COnew is valid according to Def. 3.
5
Experimental Evaluation
We evaluated IncOPTICS using four synthetic datasets consisting of 100,000, 200,000, 300,000, and 500,000 2-dimensional points and a real-world dataset consisting of 112,361 TV snapshots encoded as 64-dimensional color histograms. All experiments were run on a workstation featureing a 2 GHz CPU and 3,5 GB RAM. An X-Tree was used to speed up the range queries computed by OPTICS and IncOPTICS. We performed 100 random updates (insertions and deletions) on each of the synthetic datasets and compared the runtime of OPTICS with the maximum and average runtimes of IncOPTICS (insert/delete) on the random updates. The results are depicted in Fig. 3. We observed average speed-up factors of about 45 and 25 and worst-case speed-up factors of about 20 and 17 in case of insertion and deletion, respectively. A similar observation, but on a lower level, can be made when evaluating the performance of OPTICS and IncOPTICS applied to the real world dataset. The worst ever observed speed-up factor for the real-world dataset was 3. In Fig. 5(a)) the average runtimes of IncOPTICS of the best 10 inserted and deleted objects are compared with the runtime of OPTICS using the TV dataset. A possible reason for the large speed up is that IncOPTICS saves a lot of range queries. This is shown in Fig. 4(a) and 4(b) where we compared the average and maximum number of range queries and moved objects, respectively. The cardinality of the set M UTATING(p) is depicted as “RQ” and the cardinality of the set M OVING(p) is depicted as “MO” in the figures. It can be seen, that IncOPTICS saves a lot of range queries compared to OPTICS. For high dimensional data this observation is even more important since the logarithmic runtime of most index structures for a single range query degenerates to a linear runtime. Fig. 5(b) presenting the average cardinality of
232
Hans-Peter Kriegel et al.
(a) Insertion
(b) Deletion
Fig. 4. Comparison of average and maximum cardinalities of M OVING(p) vs. M UTATING(p)
the sets of mutating objects and moving objects of incremental insertion/deletion, illustrates this effect. Since the number of objects which have to be reorganized is rather high in case of insertion or deletion the runtime speed-up is caused by the strong reduction of range queries (cf. bars “IncInsert RQ” and “IncDelete RQ” in Fig. 5(b)). We separately analyzed the objects o whose insertions/deletions caused the highest runtime. Thereby, we found out that the biggest part of the high runtimes originated from the reorganization step due to a high cardinality of the set M OVING(o). We further observed that these objects causing high update runtimes usually are located between two clusters and objects in M UTATING(o) belong to more than one cluster. Since spatially neighboring clusters need not to be adjacent in the cluster ordering, the reorganization affects a lot more objects. This observation is important because it indicates that the runtimes are more likely near the average case than near the worst case especially for insert operations since most inserted objects will probably reproduce the distribution of the already existing data. Let us note, that since the tests on the TV Dataset were run using unfavourable objects, the performance results are less impressive than the results on the synthetic datasets.
6
Conclusions
In this paper, we proposed an incremental algorithm for mining hierarchical clustering structures based on OPTICS. Due to the density-based notion of OPTICS, insertions and deletions affect only a limited subset of objects directly, i.e. a change of their corelevel may occur. We identified a second set of objects which are indirectly affected by update operations and thus they may move forward or backwards in the cluster ordering. Based on these considerations, efficient algorithms for incremental insertions and deletions of a cluster ordering were suggested. A performance evaluation of IncOPTICS using synthetic as well as real-world databases demonstrated the efficiency of the proposed algorithm.
Incremental OPTICS
(a) Runtimes
233
(b) Affected objects
Fig. 5. Runtimes and affected objects of IncOPTICS vs. OPTICS applied on the TV Data.
Comparing these results to the performance of IncrementalDBSCAN which achieves much higher speed-up factors over DBSCAN, it should be mentioned that incremental hierarchical clustering is much more complex than incremental “flat” clustering. In fact, OPTICS generates considerably more information than DBSCAN and thus IncOPTICS is suitable for a much broader range of applications compared to IncrementalDBSCAN.
References 1. McQueen, J.: ”Some Methods for Classification and Analysis of Multivariate Observations”. In: 5th Berkeley Symp. Math. Statist. Prob. Volume 1. (1967) 281–297 2. Ng, R., J., H.: ”Efficient and Affective Clustering Methods for Spatial Data Mining”. In: Proc. 20st Int. Conf. on Very Large Databases (VLDB’94), Santiago, Chile. (1994) 144–155 3. Zhang, T., Ramakrishnan, R., M., L.: ”BIRCH: An Efficient Data Clustering Method for Very Large Databases”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’96), Montreal, Canada. (1996) 103–114 4. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: ”A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD’96), Portland, OR, AAAI Press (1996) 291–316 5. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: ”OPTICS: Ordering Points to Identify the Clustering Structure”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’99), Philadelphia, PA. (1999) 49–60 6. Ester, M., Kriegel, H.P., Sander, J., Wimmer, M., Xu, X.: ”Incremental Clustering for Mining in a Data Warehousing Environment”. In: Proc. 24th Int. Conf. on Very Large Databases (VLDB’98). (1998) 323–333 7. Feldman, R., Aumann, Y., Amir, A., Mannila, H.: ”Efficient Algorithms for Discovering Frequent Sets in Incremental Databases”. In: Proc. ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, AZ. (1997) 59–66 8. Ester, M., Wittman, R.: ”Incremental Generalization for Mining in a Data Warehousing Environment”. In: Proc. 6th Int. Conf. on Extending Database Technology, Valencia, Spain. Volume 1377 of Lecture Notes in Computer Science (LNCS)., Springer (1998) 135–152
On Complementarity of Cluster and Outlier Detection Schemes Zhixiang Chen1 , Ada Wai-Chee Fu2 , and Jian Tang2 Department of Computer Science, University of Texas-Pan American, Edinburg TX 78539 USA. 1
[email protected]
Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong. [email protected],[email protected] 2
1 Introduction We are interested in the problem of outlier detection, which is the discovery of data that deviate a lot from other data patterns. Hawkins [7] characterizes an outlier in a quite intuitive way as follows: An outlier is an observation that
deviates so much from other observations as to arouse suspicion that it was generated by a dierent mechanism.
Most methods in the early work that detects outliers independently have been developed in the eld of statistics [3]. These methods normally assume that the distribution of a data set is known in advance. A large amount of the work was done under the general topic of clustering [6,12, 15, 8,17]. These algorithms can also generate outliers as by-products. Recently, researchers have proposed distance-based and density-based as well as connectivity-based outlier detection schemes [10,11,13,4,16], which distinguish objects that are likely to be outliers from those that are not based on the number of objects in the neighborhood of an object These schemes do not make assumptions about the data distribution. In this paper, we want to nd out if the indirect clustering approach such as [6] and the direct approach such as [10,4,16] are similar in the eects of outlier detection and also the cases where they may dier. When a direct approach and an indirect one have similar eects, we say that they are complementary. We consider the comparison of DBSCAN clustering method and the DB-Outlier, LOF and COF de nitions of outliers. These methods are chosen based on their more superior powers on handling clusters or patterns of dierent shapes with no apriori distribution assumption. We believe that these methods are better equipped to handle the varieties of outlier natures. Some interesting discoveries are made. First, we show that DBSCAN and DB-Outlier approaches are almost complementary, and we also show an extension of the DB-outlier scheme so that it is complementary with DBSCAN. Second, we show that DBSCAN approach is complementary with density-based and connectivity-based outlier schemes within a density cluster or far away from some clusters. Finally, we show that there are cases where DBSCAN approach is not complementary with density-based and connectivity-based outlier schemes. Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 234-243, 2003. c Springer-Verlag Berlin Heidelberg 2003
On Complementarity of Cluster and Outlier Detection Schemes
235
2 Density Based Clustering and Outlier Detection Let D be a data set of points in a data space. For any p 2 D and any positive value v, the v-neighborhood of p is Nv (p) = fo : dist(p; o) v & o 2 Dg. For a given threshold n > 0; p is called a core point with respect to v and n (or core point if no confusion arises) if jNv (p)j n. Given v > 0 and n > 0, a point p is directly density-reachable from a point q with respect to v and n, if p 2 Nv (q) and q is a core point. A point p is densityreachable from a point q with respect to v and n, if there is a chain of points p1; p2; : : :; ps, p1 = q, ps = p such that pi+1 is directly density-reachable from pi. A point p is density-connected to a point q with respect to v and n, if there is a point o such that both p and q are density-reachable from o with respect to v and n.
De nition 1 (DBSCAN Clusters and Outliers [6]). Let D be a set of objects. A cluster C with respect to v and n in D is a non-empty subset of D satisfying the following conditions: (1) Maximality: 8p; q 2 D, if p 2 C and q is density-reachable from p with respect to v and n, the also q 2 C . (2) Connectivity: 8p; q 2 C , q is density-connected to p with respect to v and n. Finally, every object not contained in any cluster is an outlier (or a noise).
3 Outlier Detection Schemes
Distance Based Outliers:Knorr and Ng [10] proposed a distance-based scheme, called DB(n; v)-outliers. Let D be a data set. For any p 2 D and any positive values v and n, p is a DB(n; v)-outlier if jNv (p)j n, otherwise it is not. The weakness is that it is not powerful enough to cope with certain scenarios with dierent densities in data clusters [4]. Similar weakness is found in the scheme proposed by Ramaswamy, et al. [13], which is actually a special case of DB(n; v)outlier. Density Based Outliers: Breuning, et al. [4] proposed a density-based outlier detection scheme as follows. Let p; o 2 D. Let k be a positive integer, and k-distance(o) be the distance from o to its k-th nearest neighbor. The reachability distance of p with respect to o for k is reach-diskk (p; o) = maxfk-distance(o); dist(p; o)g: The reachability distance smoothes the uctuation of the distances between p and its \close" neighbors. The local reachability density of p for k is:
P
o2N
k (p) =
p) (p)
k-distance(
k (p; o)
reach-dist
!;
1
: jN p (p)j That is, lrdk (p) is the inverse of the average reachability distance from p to the objects in its k-distance neighborhood. The local outlier factor of p is lrd
k-distance(
k (p) =
LOF
)
P
o2N
p) (p)
k-distance(
jN
p) (p)j
k-distance(
o p
lrdk ( ) lrdk ( )
:
236
Zhixiang Chen et al.
The LOF value measures how strong an object can be an outlier. A threshold on LOF value can be set to de ne outliers. Connectivity Based Outliers: The connectivity based outlier detection scheme was proposed in Tang et al. [16]. This scheme is based on the idea of dierentiating \low density" from \isolativity". While low density normally refers to the fact that the number of points in the \close" neighborhood of an object is (relatively) normalsize, isolativity refers to the degree that an object is \connected" to other objects. As a result, isolation can imply low density, but the other direction is not always true.
De nition 2 Let P; Q D, P \ Q = ; and P; Q 6= ;. We de ne dist(P; Q) = minfdist(x; y) : x 2 P & y 2 Qg, and call dist(P; Q) the distance between P and Q. For any given q 2 Q, we say that q is the nearest neighbor of P in Q if there is a p 2 P such that dist(p; q) = dist(P; Q): De nition 3 Let G = fp ; p ; : : :; pr g be a subset of D. A set-based nearest path, or SBN-path, from p on G is a sequence hp ; p ; : : :; pr i such that for all 1 i r ; 1; pi is a nearest neighbor of set fp ; : : :; pig in fpi ; : : :; pr g. 1
2
1
1
+1
2
1
+1
In the above, if the nearest neighbor is not unique, we can impose a prede ned order among the neighbors to break the tie. Thus an SBN-path is uniquely determined.
De nition 4 Let s = hp ; p ; : : :; pr i be an SBN-path. A set-based nearest trail, or SBN-trail, with respect to s is a sequence he ; : : :; er; i such that for all 1 i r ; 1, ei = (oi ; pi ) where oi 2 fp ; : : :; pig, and dist(ei ) = dist(oi ; pi ) = dist(fp ; : : :; pi g; fpi ; : : :; pr g). 1
2
1
+1
1
1
1
+1
+1
Again, if oi is not uniquely determined, we should break the tie by a prede ned order. Thus the SBN-trail is unique for any SBN-path.
De nition 5 Let G = fp ; p ; : : :; pr g be a subset of D. Let s = hp ; p ; : : :; pr i be an SBN-path from p and e = he ; : : :; er; i be the SBN-trail with respect to s. The average chaining distance from p to G ; fp g, denoted by ac-distG (p ), 1
2
1
1
1
1
is de ned as
G(p1 ) =
ac-dist
2
1
1
1
r; X 2(r ; i) 1
i=1
( ; 1) dist(ei ):
r r
De nition 6 Let p 2 D and k be a positive integer. The connectivity-based outlier factor (COF) at p with respect to its k-neighborhood is de ned as COF
k (p) =
jNk (p)j ac-distNk p (p) : P ac-dist (o) ( )
o2Nk (p)
Nk (o)
A threshold on COF can be set to de ne outliers.
On Complementarity of Cluster and Outlier Detection Schemes
237
4 Complementarity of DB-outlier and DBSCAN When the clustering approach and outlier detection approach both give the same result about a data point (as outlier or non-outlier) we say that they are complementary. Since both techniques typically require some parameter settings, it is of interest to see if there exist some parameter settings for each approach so that the methods are complementary. In this section, we rst show that the DBSCAN clustering scheme and the DB-outlier detection scheme are almost complementary. We then propose an extended DB-outlier detection scheme and show that it is complementary with the DBSCAN clustering scheme.
Theorem 1 If there is a parameter setting for DBSCAN clustering scheme to detect clusters and outliers, then there is a parameter setting for the DB-outlier detection scheme such that the following is true: For any object p 2 D, if DBSCAN identi es p as an outlier then DB-outlier detection scheme also identi es it as an outlier. (Note that this implies that if DB-outlier scheme detection scheme identi es p not to be an outlier, then DBSCAN identi es it not to be an outlier (i.e., inside some cluster).)
Proof. Let a parameter setting for DBSCAN be v and n. For any object p 2 D, if DBSCAN identi es p as an outlier, then we have by de nition that jNv (p)j < n. When we choose the same parameter setting v and n for the DBoutlier detection scheme, it identi es p as an outlier too, because jNv (p)j < n. It is easy to see that objects identi ed by DBSCAN to be inside clusters can be identi ed as outliers by the DB-outlier detection scheme with the same parameter setting. In order to avoid such in-complementarity on border objects, we propose the following extension of the DB-outlier detection scheme:
De nition 7 (EDB-Outliers). Given any object p in a data set D, p is an extended distance-based outlier, denoted as EDB -outlier, with respect to v and n, if jNv (p)j < n and 8q 2 Nv (p), jNv (q)j < n. Following the work in [10], one can easily design an EDB-outlier detection scheme to detect EDB-outliers with respect to the parameter setting of v and n. The following result shows that EDB-outliers and DBSCAN-outliers are complementary.
Theorem 2 If there is a parameter setting for DBSCAN to detect clusters and outliers, then there is a parameter setting for EDB-outlier detection scheme such that the following is true: For any object p 2 D, DBSCAN identi es p as an
outlier if and only if EDB-outlier detection scheme also identi es it as an outlier.
Proof. Let a parameter setting for DBSCAN be v and n. For any object p 2 C , if DBSCAN identi es p as an outlier, then we have by de nition jNv (p)j < n
238
Zhixiang Chen et al.
and p is not density reachable from any core object in D with respect to v and n. The latter property means that 8q 2 Nv (p), q is not a core object, i.e., jNv (q)j < n. Hence, when we choose the same parameter setting of v and n for the EDB-outlier detection scheme, it identi es p as an outlier too. On the other hand, if the EDB-outlier detection scheme identi es p as an outlier with respect to v and n, then we have by de nition jNv (p)j < n and 8q 2 Nv (p), jNv (q)j < n. Now consider that the same parameter setting of v and n is chosen for DBSCAN. Suppose that p is inside a cluster C, then by de nition there is a core object q 2 C such that p is density-reachable from q with respect to v and n. I.e., there is sequence of objects q1 = q; q2; : : :; qs; qs+1 = p, qi 2 C, such that qi+1 is directly density-reachable from qi. In particular, p is directly density-reachable from qs . By de nition, jNv (qs)j n and p 2 Nv (qs). Thus, we have jNv (qs)j n and qs 2 Nv (p), a contradiction to the given fact that p is an EDB-outlier with respect to v and n. The above argument implies that p must not be in any cluster with respect to v and n. Therefore, DBSCAN will identify p as an outlier with respect to v and n.
5 Complementarity of COF and LOF Let D be a data set. For any integer k > 0 and any object p 2 D, we use Nk (p) to denote Nreach-distk (p) (p) for convenience. In the following lemma, C can be viewed as a cluster. We rst consider complementarity inside a density cluster. Lemma 3 Given a subset C of a data set D and an integer k > 0, assume that Nk (p) C and there is a positive value d such that 8p 2 C , d ; reach-distk (p) d + for a very normalsize xed positive value 0 < < d. Assume further that there exists a positive value f > 0 such that for any p; q 2 C , f ; dist(p; q) f + for some positive normalsize value with 0 < < f . Then, 8p 2 C , we have d; LOFk (p) dd +; ; and ff ;+ COFk (p) ff ;+ : (1) d+ Proof. 8p 2 C, since Np C, we have 8q 2 NkP (p), q 2 C, hence d ; reach-distk (p;q) : reach-distk (q) d + . By de nition, lrdk1(p) = P q2Nk p jNk (p)j ( )
lrdk (q)
k p ; we have Hence, d ; lrdk (p) d + . Since LOFk (p) = q2NjNkkp(p)lrd j d; LOFk (p) d+ , this implies the left part of (1). d+ d; Given any p 2 C, let s = fe1; : : :; er;1 g be the SBN-trail with respect to the SBN-path from p on Nk (p). It follows from the de nition and the given conditions that f ; dist(ei ) f + for i = 1; 2; : : :; r ; 1. Hence, by de nition we have r; r; X 2(r ; i) dist(e ) X 2(r ; i) (f + ) = f + : ac-distk (p) = i r (r ; 1) r (r ; 1) i i ( )
1
1
=1
=1
( )
On Complementarity of Cluster and Outlier Detection Schemes
239
Similarly, we have k (p) =
ac-dist
r; X 2(r ; i)
r; X 2(r ; i)
i=1
i=1
1
dist(ei) r (r ; 1)
1
( ; 1) (f ; ) = f ;
r r
:
Thus, k (p) =
COF
PjNk (p)j
k (p) k (o)
ac-dist
o2Nk (p)
ac-dist
ff +; ;
k (p) =
COF
PjNk (p)j
k (p) k (o)
ac-dist
o2Nk (p)
ac-dist
ff ;+
Hence, these together implies the right part of (1).
Theorem 4 Let C be any cluster in a data set D satisfying the conditions in
Lemma 1. We can choose a parameter setting d + and k for the DBSCAN clustering scheme so that all points in C will be identi ed as cluster points, i.e., non-outliers. We can choose a parameter setting of k and dd+; for the LOFoutlier detection scheme so that it identi es all points in C as cluster points. Finally, we can choose a parameter setting of k and ff ;+ for the COF-outlier detection scheme so that it identi es all points in C as cluster points as well.
Proof. For any p 2 C, it follows from the conditions of Lemma 3 that
reach-distk (p) d + . By the de nition of reachability distance, Nk (p) will
have at least k objects. Since Nk (p) Nd+ (p), we have jNd+ (p)j k. Hence, p is a core object with respect to d + and k. It follows from the de nition of DBSCAN clusters that p will be identi ed as an object inside a cluster, hence a non-outlier. From (1) of Lemma 1 we have dd;+ LOFk (p)and ff ;+ COFk (p): This means that when the parameter setting of k and dd+; is chosen for the LOFoutlier detection scheme, (Precisely, k is used to de ne the reachability distance, and dd;+ is a threshold for the LOF values to select outliers.) p will be identi ed as a non-outlier. Similarly, when the parameter setting of k and ff ;+ is chosen for the COF-outlier detection scheme, p will be identi ed a non-outlier as well.
Next we show some cases where LOF and COF are both complementary with DBSCAN in detecting points outside some clusters as outliers. For any two sets of objects A and B, let dist(A; B) = minfdist(x; y) : x 2 A & y 2 B g. Again, for any object p, we let Nm (p) denote the m-reachability neighborhood Nreach-distm (p) of p.
Theorem 5 Given two subsets O and C of a data set D, let d = minfdist(x; y) : x 2 O&y 2 C g, and m = jOj. Assume that 8o 2 O, Nm (o) = O, and N m (o) ; O C . Moreover, 8p 2 C , lrd m (p) d and ac-dist m (p) < 1; and for 8p 2 O, d ac-dist m (p) d. Then, there exist parameter settings for DBSCAN, LOF 2 3
2
2
4
2
2
and COF respectively such that each of the three methods will identify O as outliers.
240
Zhixiang Chen et al.
Proof. First, let r be the diameter of O, i.e., r = maxfdist(x; y) : x 2 O&y 2 Og. Since for any o 2 O, Nm (o) = O, we have reach-distm (o) r and Nm (o) = Nr (o) = O. When the parameter setting of r and m + 1 is chosen for the DBSCAN clustering scheme, then o is not a core object. Since Nr (o) = O, o is not directly reachable from any object outside O with the distance r. Hence, DBSCAN will identi es O as outliers. For any o 2 O, given conditions Nm (o) = O and N2m (o) ; O C it follows that, for any p 2 N2m (o), reach-dist2m (o; p) = maxf2m-distance(p); dist(o; p)g jN m (o)j 1 d. Thus, lrd2m (o) = P reach-dist m (p;o) d : Hence, 2
p2N2m (o)
2
P lrd2m (p) d(P lrd2m (p) + p 2 N p 2 O ;f o g p2C \N2m (o) lrd2m (p)) 2m (o) lrd2m (o) LOF2m (o) = = jN2m (o)j P jN2m (o)j P 4 d( p2O ;fog 1d + ) p2C \N2m (o) d = jO ; fogj + 4jC \ N2m (o)j > 1: jN2m (o)j jN2m (o)j P
Therefore, when the parameter setting of 2m and 1 is chosen for the LOFoutlier detection scheme, each object in O will be identi ed as an outlier. Similarly, For any o 2 O, let y = jN2m (o)j, then y 2m. It follows from the given conditions that COF2m (o) 2d y P 3 ac-dist (p) + 2 m p2O;fog p2N m (o)\C ac-dist2m (p)
jN2m (o)j ac-dist2m (o) P =P ac-dist (p) p2N2m (o)
d
m
2
2
(m ; 1)d +y (y ; m + 1) 1; when d 6: Hence, when the parameter setting of 2m and 1 is chosen for the COF-outlier detection scheme, every object in O will be identi ed as outliers. 2 3
6 Non-Complementarity In this section, we shall show two non-complementarity results of DBSCAN cluster and outlier schemes LOF and COF, which reveal that in general DBSCAN scheme is not complementary with the LOF-outlier detection scheme, nor with the COF-outlier detection scheme. We use two approaches to obtain our results, one is by actual computation and the other is detailed analysis. The computing environment of our computation is a Dell Precision 530 Workstation with dual Xeon 1.5 GHz processors. In order to have more precise results, all arithmetic operations were carried out with 10 decimal digit precision. Example 1. Let us consider a data set D1 consisting of data objects as shown in Fig. 1. A is the set composed of objects on the line patterns, and B is the set composed of all objects in the disk pattern. o refers to the object p outside B. A has 402 objects such that the 1-distance of any object in it isp 2. B has 44 objects such that the 1-distance of any object in it is less than p 2 and the distance between A and B (which are respectively, p and q) is 2. Finally, the
On Complementarity of Cluster and Outlier Detection Schemes
241
A p
LOF
B
o q
(a) Data set Fig. 1.
COF 1.50 1.40 1.30 1.20 1.10 1.00 0.90 0.80
1.80 1.60 1.40 1.20 1.00 0.80
(b) LOF Values
(c) COF Values
LOF and COF Values for Example 1 (k = 3)
p
distance p between o and B is 2, and the distance between o and A is greater than 2. Non-Complementarity Result 1. When the parameter setting of k = 3 and threshold = 1:4 is used, the LOF-outlier detection scheme identi es o as an outliers, and all objects in A or B as non-outliers. When the parameter setting of k = 3 and threshold = 1:13 is used, the COF-outlier detection scheme identi es o as an outliers, and all objects in A or B as non-outliers. Finally, for any parameter setting of v and n, the DBSCAN scheme will not be able to identify o as an outlier and objects in A or B as non-outliers. The rst two results are obtained through actual computation. We have implemented the LOF outlier detection algorithm in [4] and the COF outlier detection algorithm in [16]. We used k=3 to compute LOF values and the COF values. The results are shown in Figure 1. We found that the LOF value of o is 1.8146844215 and all other objects have LOF values the same or almost the same as 1, except several objects have LOF values larger than 1 and less than 1.4. Hence, the threshold 1.4 enables the LOF-outlier detection scheme to distinguish the outlier o from all the non-outliers in A or B. We also found that the COF value of o is 1.2057529817, and all the other objects have COF values the same or almost the same as 1, except several objects have COF values larger than 1 but less than 1.13. Hence, the threshold value 1:13 enables the COF-outlier detection scheme to distinguish the outlier o from all the other non-outliers. p Recall thatp the distance between any two adjacent objects in A is 2 and p dist(A; B) = 2. If the radius parameter v is less than 2, then the DBSCAN scheme will identify all objects in A as outliers. In order to identify p objects in A as non-outliers, the radius parameter v must be greater than 2. Note that p the distance between o and B is 2 and the 1-distance of any object in B is p p less than 2. Let w 2 B such that dist(o; w) = 3. For any radius parameter p v > 2, o is directly reachable from w with respect to v. Hence, if DBSCAN scheme identi es w as an non-outliers then it also identi es o as an non-outlier. In summary, for any parameter setting of v and n, the DBSCAN scheme cannot
242
Zhixiang Chen et al.
p
distinguish p o from objects in A if v 2; it cannot distinguish o from objects in B if v > 2. Therefore, no parameter setting of v and n enables the DBSCAN scheme to identi es outliers and clusters for objects in D. Outlier q=(-88,0) Non-outlier p=(67,45) Outlier o=(26,0) Non-outlier w=(22,0)
14.00
l 1 ... l2
12.00
p 10.00
D LOF
C q
w
o
8.00
6.00
4.00
2.00
l4 l3 ...
(a) Data Set of Example 2 Fig. 2.
20
40
60
80
100
120
140
160
180
k
(b) LOF Values of Four Objects
Illustrations of Example 2
Example 2. Let us consider a data set D2 as shown in Fig. 2(a). D2 has 192 objects. Set C composed of eight objects on the border p of a diamond pattern such that any two adjacent objects have a distance 2. Set D composed of 182 objects on four lines l1 ; l2; l3 and l3 . l1 and l2 lie 45 degrees above the horizontal line, and l3 and l4 lie 45 degrees below the horizontal line. l2 . l1 and l3 meet at (20; 0), and l2 and l4 meet at wp= (22; 0). Any two adjacent objects on any of the four lines have a distance of 2. Two additional objects lie below the l2 , the rst is (22; 4) and the second is o = (26; 0). Let q = (;88; 0) and p = (67; 45) as shown in the gure. According to Hawkin's de nition, is obvious that q and o are outliers, but p and w are not. p Non-Complementarity Result 2. When a parameter setting of v = 2 and n = 4 is chosen, the DBSCAN scheme identi es objects in C (including q) and o as outliers and all other objects as non-outliers. However, for any parameter setting of k and threshold, the LOF-outlier detection scheme cannot identify
outliers and non-outliers correctly.
Because pobjects in C lie at the border of a diamond shape and an equal distance p of 2 separates any two adjacent objects in the diamond shape in C, the 2-neighborhood of any object inpC has exactly 3 objects. Thus, every object in C is an outlier p with respect to v = 2 and n = 4. It follows from the condition of D2 , the 2-neighborhood of any object in D is exactly 4 except the six end objects (20; 0); w; q; (66;46); (66; ;46) and (67; ;45). But, those six objects are reachable from some other points on the lines with respect to v, and so does the object (22; 4). Hence, p the DBSCAN identi ed all objects in D and (22; 4) as nonoutliers. Since 2-neighborhood of o has exactly 3 objects and o is not reachable from any core object with respect to v and pct (o is reachable from the non-core object (22; 4)), the DBSCAN p identi es o as an outlier. In summary, with the parameter setting of v = 2 and pct = 4, the DBSCAN identi es outliers and clusters in D2 correctly.
On Complementarity of Cluster and Outlier Detection Schemes
243
In order to obtain the result for LOF-outlier detection scheme, we ran the LOF outlier detection algorithm for k = 1; 2; 3; : ::; 191 (191 = jD2j ; 1) to compute LOF values for the four objects p; q; o and w. The LOF values are shown in Fig. 2(b). We found the following results: for 1 k 7, LOFk (q) LOFk (p); for 8 k 182, LOFk (o) LOFk (p); and for 183 k 191, LOFk (o) LOFk (w). This implies that any given parameter threshold cannot separate both q and o from p and w. Hence, for any parameter setting of k and threshold, the LOF-outlier detection scheme cannot identify outliers and nonoutliers in D2 correctly.
References 1. M. Ankerst, M. Breunig, H.P. Kriegel, and J. Sander: \OPTICS: Ordering points to identify the cluster structure", Proc. of ACM-SIGMOD Conf., pp. 49-60, 1999. 2. A. Arning, R. Agrawal, P. Raghavan: "A Linear Method for Deviation detection in Large Databases", Proc. of 2nd Intl. Conf. On Knowledge Discovery and Data Mining, 1996, pp 164 - 169. 3. V. Barnett, T. Lewis: "Outliers in Statistical Data", John Wiley, 1994. 4. M. Breuning, Hans-Peter Kriegel, R. Ng, J. Sander: "LOF: Identifying densitybased Local Outliers", Proc. of the ACM SIGMOD Conf., 2000. 5. W. DuMouchel, M. Schonlau: "A Fast Computer Intrusion Detection Algorithm based on Hypothesis Testing of Command Transition Probabilities", Proc.of 4th Intl. Conf. On Knowledge Discovery and Data Mining, 1998, pp. 189 - 193. 6. M. Ester, H. Kriegel, J. Sander, X. Xu: "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", Proc. of 2nd Intl. Conf. On Knowledge Discovery and Data Mining, 1996, pp 226 - 231. 7. T. Fawcett, F. Provost: "Adaptive Fraud Detection", Data Mining and Knowledge Discovery Journal, Kluwer Academic Publishers, Vol. 1, No. 3, 1997, pp 291 - 316. 8. S. Guha, R. Rastogi, K. Shim: "Cure: An Ecient Clustering Algorithm for Large Databases", In Proc. of the ACM SIGMOD Conf., 1998, pp 73 - 84. 9. D. Hawkins: "Identi cation of Outliers", Chapman and Hall, London, 1980. 10. E. Knorr, R. Ng: "Algorithms for Mining Distance-based Outliers in Large Datasets", Proc. of 24th Intl. Conf. On VLDB, 1998, pp 392 - 403. 11. E. Knorr, R. Ng: "Finding Intensional Knowledge of Distance-based Outliers", Proc. of 25th Intl. Conf. On VLDB, 1999, pp 211 - 222. 12. R. Ng, J. Han: "Ecient and Eective Clustering Methods for Spatial Data Mining", Proc. of 20th Intl. Conf. On Very Large Data Bases, 1994, pp 144 - 155. 13. S. Ramaswamy, R. Rastogi, S. Kyuseok: "Ecient Algorithms for Mining Outliers from Large Data Sets", Proc. of ACM SIGMOD Conf., 2000, pp 427 - 438. 14. N. Roussopoulos, S. Kelley, F. Vincent, "Nearest Neighbor Queries", Proc. of ACM SIGMOD Conf., 1995, pp 71 - 79. 15. G. Sheikholeslami, S. Chatterjee, A. Zhang: "WaveCluster: A multi-Resolution Clustering Approach for Very Large Spatial Databases", Proc. of 24th Intl. Conf. On Very Large Data Bases, 1998, pp 428 - 439. 16. J. Tang, Z. Chen, A. Fu, D. Cheung, \A Robust Outlier Detection Scheme in Large Data Sets", PAKDD, 2002. 17. T. Zhang, R. Ramakrishnan, M. Linvy: "BIRCH: An Ecient Data Clustering Method for Very Large Databases", Proc. of ACM SIGMOD Intl. Conf., , 1996, pp 103 - 114.
Cluster Validity Using Support Vector Machines Vladimir Estivill-Castro1 and Jianhua Yang2 1
2
Griffith University, Brisbane QLD 4111, Australia The University of Western Sydney, Campbelltown, NSW 2560, Australia
Abstract. Gaining confidence that a clustering algorithm has produced meaningful results and not an accident of its usually heuristic optimization is central to data analysis. This is the issue of validity and we propose here a method by which Support Vector Machines are used to evaluate the separation in the clustering results. However, we not only obtain a method to compare clustering results from different algorithms or different runs of the same algorithm, but we can also filter noise and outliers. Thus, for a fixed data set we can identify what is the most robust and potentially meaningful clustering result. A set of experiments illustrates the steps of our approach.
1
Introduction
Clustering is challenging because normally there is no a priori information about structure in the data or about potential parameters, like the number of clusters. Thus, assumptions make possible to select a model to fit to the data. For instance, k-Means fits mixture models of normals with covariance matrices set to the identity matrix. k-Means is widely applied because of its speed; but, because of its simplicity, it is statistically biased and statistically inconsistent, and thus it may produce poor (invalid) results. Hence, clustering depends significantly on the data and the way the algorithm represents (models) structure for the data [8]. The purpose of clustering validity is to increase the confidence about groups proposed by a clustering algorithm. The validity of results is up-most importance, since patterns in data will be far from useful if they were invalid [7]. Validity is a certain amount of confidence that the clusters found are actually somehow significant [6]. That is, the hypothetical structure postulated as the result of a clustering algorithm must be tested to gain confidence that it actually exists in the data. A fundamental way is to measure how “natural” are the resulting clusters. Here, formalizing how “natural” a partition is, implies fitting metrics between the clusters and the data structure [8]. Compactness and separation are two main criteria proposed for comparing clustering schemes [17]. Compactness means the members of each cluster should be as close to each other as possible. Separation means the clusters themselves should be widely spaced. Novelty detection and concepts of maximizing margins based on Support Vector Machines (SVMs) and related kernel methods make them favorable for verifying that there is a separation (a margin) between the clusters of an algorithm’s output. In this sense, we propose to use SVMs for validating data models, Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 244–256, 2003. c Springer-Verlag Berlin Heidelberg 2003
Cluster Validity Using Support Vector Machines
245
and confirm that the structure revealed in clustering results is indeed of some significance. We propose that an analysis of magnitude of margins and (relative) number of Support Vectors increases the confidence that a clustering output does separate clusters and creates meaningful groups. The confirmation of separation in the results can be gradually realized by controlling training parameters. At a minimum, our approach is able to discriminate between two outputs of two clustering algorithms and identify the more significant one. Section 2 presents relevant aspects of Support Vector Machines for our clustering validity approach. Section 3 presents our techniques. Section 4 presents experimental results. We then conclude with a discussion of related work.
2
Support Vector Machines
Our cluster validity method measures margins and analyzes the number of Support Vectors. Thus, a summary of Support Vector Machines (SVMs) is necessary. The foundations of SVMs were developed by Vapnik [16] and are gaining popularity due to promising empirical performance [9]. The approach is systematic, reproducible, and motivated by statistical learning theory. The training formulation embodies optimization of a convex cost function, thus all local minima are global minimum in the learning process [1]. SVMs can provide good generalization performance on data mining tasks without incorporating problem domain knowledge. SVM have been successfully extended from basic classification tasks to handle regression, operator inversion, density estimation, novelty detection, clustering and to include other desirable properties, such as invariance under symmetries and robustness in the presence of noise [15, 1, 16]. In addition to their accuracy, a key characteristic of SVMs is their mathematical tractability and geometric interpretation. Consider the supervised problem of finding a separator for a set of training samples {(xi , yi )}li=1 belonging to two classes, where xi is the input vector for the ith example and yi is the target output. We assume that for the positive subset yi = +1 while for the negative subset yi = −1. If positive and negative examples are “linearly separable”, the convex hulls of positive and negative examples are disjoint. Those closest pair of points in respective convex hulls lie on the hyper-planes wT x + b = ±1. The separation between the hyper-plane and the closest data point is called the margin of separation and is denoted by γ. The goal of SVMs is to choose the hyper-plane whose parameters w and b maximize γ = 1/w; essentially a quadratic minimization problem (minimize w). Under these conditions, the decision surface w T x + b is referred to as the optimal hyper-plane. The particular data points (xi , yi ) that satisfy yi [w t xi +b] = 1 are called Support Vectors, hence the name “Support Vector Machines”. In conceptual terms, the Support Vectors are those data points that lie closest to the decision surface and are the most difficult to classify. As such, they directly influence the location of the decision surface [10].
246
Vladimir Estivill-Castro and Jianhua Yang
If the two classes are nonlinearly separable, the variants called φ-machines map the input space S = {x1 , . . . , xl } into a high-dimensional feature space F = {φ(x)|i = 1, . . . , l}. By choosing an adequate mapping φ, the input samples become linearly or mostly linearly separable in feature space. SVMs are capable of providing good generalization for high dimensional training data, since the complexity of optimal hyper-plane can be carefully controlled independently of the number of dimensions [5]. SVMs can deal with arbitrary boundaries in data space, and are not limited to linear discriminants. For our cluster validity, we make use of the features of ν-Support Vector Machine (ν-SVM). The ν-SVM is a new class of SVMs that has the advantage of using a parameter ν on effectively controlling the number of Support Vectors [14, 18, 4]. Again consider training vectors xi ∈ d , i = 1, · · · , l labeled in two classes by a label vector y ∈ l such that yi ∈ {1, −1}. As a primal problem for ν-Support Vector Classification (ν-SVC), we consider the following minimization: Minimize 12 w2 − νρ + 1l li=1 ξi subject to yi (wT φ(xi ) + b) ≥ ρ − ξi , ξi ≥ 0, i = 1, · · · , l, ρ ≥ 0,
(1)
where 1. Training vectors xi are mapped into a higher dimensional feature space through the function φ, and 2. Non-negative slack variables ξi for soft margin control are penalized in the objective function. The parameter ρ is such that when ξT = (ξ1 , · · · , ξl ) = 0, the margin of separation is γ = ρ/w. The parameter ν ∈ [0, 1] has been shown to be an upper bound of the fraction of margin errors and a lower bound of the fraction of Support Vectors [18, 4]. In practice, the above prime problem is usually solved through its dual by introducing Lagrangian multipliers and incorporating kernels: Minimize 12 αT (Q + yy T )α subject to 0 ≤ αi ≤ 1/l, i = 1, · · · , l (2) eT α ≥ ν where Q is a positive semidefinite matrix, Qij ≡ yi yj k(xi , xj ), and k(xi , xj ) = φ(xi )T · φ(xj ) is a kernel, e is a vector of all ones. The context for solving this dual problem is presented in [18, 4], some conclusions are useful for our cluster validity approach. Proposition 1. Suppose ν-SVC leads to ρ > 0, then regular C-SVC with parameter C set a priori to 1/ρ, leads to the same decision function. Lemma 1. Optimization problem (2) is feasible if and only if ν ≤ νmax , where νmax = 2 min(#yi = 1, #yi = −1)/l, and (#yi = 1), (#yi = −1) denote the number of elements in the first and second classes respectively. Corollary 1. If Q is positive definite, then the training data are separable.
Cluster Validity Using Support Vector Machines
247
Thus, we note that νl is a lower bound of the number of Support Vectors(SVs) and an upper bound of the number of misclassified training data. These misclassified data are treated as outliers and called Bounded Support Vectors(BSVs). The larger we select ν, the more points are allowed to lie inside the margin; if ν is smaller, the total number of Support Vectors decreases accordingly. Proposition 1 describes the relation between standard C-SVC and ν-SVC, and an interesting interpretation of the regularization parameter C. The increase of C in C-SVC is like the decrease of ν in ν-SVC. Lemma 1 shows that the size of νmax depends on how balanced the training set is. If the numbers of positive and negative examples match, then νmax = 1. Corollary 1 helps us verify whether a training problem under extent kernels is separable. We do not assume the original cluster results are separable, but, it is favorable to use balls to describe the data in feature space by choosing RBF kernels. If the RBF kernel is used, Q is positive definite [4]. Also, RBF kernels yield an appropriate tight contour representations of a cluster [15]. Again, we can try to put most of the data into a small ball to maximize the classification problem, and the bound of the probability of points falling outside the ball can be controlled by the parameter ν. For a kernel k(x, x ) that only depends on x − x , k(x, x) is constant, so the linear term in the dual target function is constant. This simplifies computation. So in our cluster validity approach,
2
we will use the Gaussian kernels kq (x, x ) = eqx−x with width parameter −1 q = 2σ 2 (note q < 0). In this situation, the number of Support Vectors depends on both ν and q. When q’s magnitude increases, boundaries become rough (the derivative oscillates more), since a large fraction of the data turns into SVs, especially those potential outliers that are broken off from core data points in the form of SVs. But no outliers will be allowed, if ν = 0. By increasing ν, more SVs will be turned into outliers or BSVs. Parameters ν and p will be used alternatively in the following sections.
3
Cluster Validity Using SVMs
We apply SVMs to the output of clustering algorithms, and show they learn the structure inherent in clustering results. By checking the complexity of boundaries, we are able to verify if there are significant “valleys” between data clusters and how outliers are distributed. All these are readily computable from the data in an supervised manner through SVMs training. Our approach is based on three properties of clustering results. First, good clustering results should separate clusters well; thus in good clustering results we should find separation (relative large margins between clusters). Second, there should be high density concentration in the core of the cluster (what has been named compactness). Third, removing a few points in the core shall not affect their shape. However, points in cluster boundaries are in sparse region and perturbing them does change the shape of boundaries.
248
Vladimir Estivill-Castro and Jianhua Yang
To verify separation pairwise, we learn the margin γ from SVMs training; then we choose the top ranked SVs (we propose 5) from a pair of clusters and their k (also 5) nearest neighbors. We measure the average distance of these SVs from their projected neighbors from each cluster (projected along the normal of the optimal hyper-plane). We let these average be γ1 for the first cluster in a pair and we denote it as γ2 for the second cluster. We compare γ with γ i . Given scalars t1 and t2 , the relations between local measures and margin is evaluated by analyzing if any of the following conditions holds: Condition 1: γ1 < t1 · γ or γ2 < t1 · γ; Condition 2: γ1 > t2 · γ or γ2 > t2 · γ. (3) If either of them holds for carefully selected control parameters t1 and t2 , the clusters are separable; otherwise they are not separable (we recommend t1 = 0.5 and t2 = 2). This separation test can discriminate between two results of a clustering algorithm. That is, when facing two results, maybe because the algorithm is randomized or because two clustering methods are applied, we increase the confidence (and thus the preference to believe one is more valid than the other) by selecting the clustering result that shows less pairs of non-separable classes. To verify the compactness of each cluster, we control the number of SVs and BSVs. As mentioned before, the parameter q of the Gaussian kernel determines the scale at which the data is probed, and as its magnitude increases, more SVs result - especially potential outliers tend to appear isolated as BSVs. However to allow for BSVs, the parameter ν should be greater than 0. This parameter enables analyzing points that are hard to assign to a class because they are away from high density areas. We refer to these as noise or outliers, and they will usually host BSVs. As shown by the theorems cited above, controlling q and ν provides us a mechanism for verifying compactness of clusters. We verify robustness by checking the stability of cluster assignment. After removing a fraction of BSVs, if reclustering results in repeatable assignments, we conclude that the cores of classes exist and outliers have been detected. We test the confidence of the result in applying an arbitrary clustering algorithm A to a data set as follows. If the clustering result is repeatable (compact and robust to our removal of BSVs and their nearest neighbors) and separable (in the sense of having a margin a faction larger than the average distance between SVs), this maximizes our confidence that the data does reflect this clustering and is not an artifact of the clustering algorithm. We say the clustering result has a maximum sense of validity. On the other hand, if reclustering results are not quite repeatable but well separable, or repeatable but not quite separable, we still call the current run a valid run. Our approach may still find valid clusters. However, if reclustering shows output that is neither separable nor repeatable, we call the current run an invalid run. In this case, the BSVs removed in the last run may not be outliers, and they should be recovered for a reclustering. We discriminate runs further by repeating the above validity test, for several rounds. If consecutive clustering results converge to a stable assignment (i.e. the result from each run is repeatable and separable), we claim that potential outliers have been removed, and cores of clusters have emerged. If repetition of
Cluster Validity Using Support Vector Machines
249
the analysis still produces invalid runs, (clustering solutions differ across runs without good separation) the clustering results are not interesting. In order to set the parameters of our method we conducted a series of experiments we summarize here 1 . We determined parameters for separation and compactness checking first. The data sets used were in different shapes to ensure generality. The LibSVM [3] SVMs library has been used in our implementation of our cluster validity scheme. The first evaluation of separation accurately measured the margin between two clusters. To ensure the lower error bound, we use a hard margin training strategy by setting ν = 0.01 and q = 0.001. This allows for few BSVs. In this evaluation, six data sets each with 972 points uniformly and randomly generated in 2 boxes were used. The margin between the boxes is decreasing across data sets. To verify the separation of a pair of clusters, we calculated the values of γ1 and γ2 . Our process compared them with the margin γ and inspected the difference. The experiment showed that the larger the discrepancies between γ1 and γ (or γ2 and γ), the more separable the clusters are. In general, if γ1 < 0.5γ or γ2 < 0.5γ, the two clusters are separable. Thus, the choice of value for t1 . Secondly, we analyzed other possible cases of the separation test. This included (a) both γ1 and γ2 much larger than γ; (b) a small difference between γ1 and γ, but the difference between γ2 and γ is significant (c) significant difference between γ1 and γ, although there is no much difference between γ2 and γ. Again, we set t1 = 0.5 and t2 = 2 for this test. Then, according to the verification rules of separation (in Equation (3)), all of these examples were declared separable coinciding with our expectation. Third, we tested noisy situation and non-convex clusters. Occasionally clustering results might not accurately describe the groups in the data or are hard to interpret because noise is present and outliers may mask data models. When these potential outliers are tested and removed, the cores of clusters appear. We performed a test that showed that, in the presence of noise, our approach works as a filter and the structure or model fit to the data becomes clearer. A ringshaped cluster with 558 points surrounded by noise and another spherical cluster were in the dataset. A ν-SVC trained with ν = 0.1 and q = 0.001 results in 51 BSVs. After filtering these BSVs (outliers are more likely to become BSVs), our method showed a clear data model that has two significantly isolated dense clusters. Moreover, if a ν-SVC is trained again with ν = 0.05 and q = 0.001 on the clearer model, fewer BSVs (17) are generated (see Fig. 1)3 . As we discussed, the existence of outliers complicates clustering results. These may be valid, but separation and compactness are also distorted. The repeated performance of a clustering algorithm depends on the previous clustering results. If these results have recognized compact clusters with cores, then they become robust to our removal of BSVs. There are two cases. In the first case, the last two consecutive runs of algorithm A (separated by an application of BSVs removal) are consistent. That is, the clustering results are repeatable. The alternative 1
The reader can obtain an extended version of this submission with large figures in www.cit.gu.edu.au/˜s2130677/publications.html
250
Vladimir Estivill-Castro and Jianhua Yang
(a)
(b)
(c)
Fig. 1. Illustration of outlier checking. Circled points are SVs
(a) Clustering structure C1
(b) SVs in circles
(c) Clustering structure C2
Fig. 2. For an initial clustering (produced by k-Means) that gives non-compact classes, reclustering results are not repeated when outliers are removed. 2(a) Results of the original first run. 2(b) Test for outliers. 2(c) Reclustering results; R = 0.5077, J = 0.3924, F M = 0.5637 case is that reclustering with A after BSVs removal is not concordant with the previous result. Our check for repeated performance of clustering results verifies this. We experimented with 1000 points drawn from a mixture data model3 and training parameters for ν-SVC are set to ν = 0.05 and q = 0.005, we showed that the reclustering results can become repeatable leading to valid results (see Figs. 3(a), 3(c) and 3(d))3 . However we also showed cases, where an initial invalid clustering does not lead to repeatable results (see Figs. 2(a), 2(b) and 2(c))3 . To measure the degree of repeated performance between clustering results of two different runs, we adopt indexes of external criteria used in cluster validity. External criteria are usually used for comparing a clustering structures C with a predetermined partition P for a given data set X. Instead of referring to a predetermined partition P of X, we measure the degree of agreement between two consecutively produced clustering structures C1 and C2 . The indexes we use are the rand statistic R, the Jaccard coefficient J and the Fowlkes-Mallows index F M [12]. The values of these three statistics are between 0 and 1. The larger their value, the higher degree to which C1 matches C2 .
Cluster Validity Using Support Vector Machines
4
251
Experimental Results
First, we use a 2D dataset for a detailed illustration of our cluster validity testing using SVMs (Fig. 3). The 2D data set is from a mixture model and consists of 1000 points. The k -medoids algorithm assigns two clusters. The validity process will be conducted in several rounds. Each round consists of reclustering and our SVMs analysis (compactness checking, separation verification, and outliers splitting and filtering). The process stops when a clear clustering structure appears (this is identified because it is separable and repeatable), or after several rounds (we recommend six). Several runs that do not suggest a valid result indicate the clustering method is not finding reasonable clusters in the data. For the separation test in this example, we train ν-SVC with parameters ν = 0.01 and q = 0.0005. To filter potential outliers, we conduct ν-SVC with ν = 0.05 but different q in every round. The first round starts with q = 0.005, and q will be doubled in each following round. Fig. 3(b) and Fig. 3(c)3 show separation test and compactness evaluation respectively corresponding to the first round. We observed that the cluster results are separable. Fig. 3(b) indicates γ1 > 2γ and γ2 > 2γ. Fig. 3(c) shows the SVs generated, where 39 BSVs will be filtered as potential outliers. We perform reclustering after filtering outliers, and match the current cluster structure to previous clustering clustering structure. The values of indexes R = 1 (J = 1 and F M = 1) indicate compactness. Similarly, the second round up to the fourth round also show repeatable and separable clustering structure. We conclude that the original cluster results can be considered valid. We now show our cluster validity testing using SVMs on a 3D data set (see Fig. 4)3 . The data set is from a mixture model and consists of 2000 points. The algorithm k-Means assigns three clusters. The validity process is similar to that in 2D example. After five rounds of reclustering and SVMs analysis, the validity process stops, and a clear clustering structure appears. For the separation test in this example, we train ν-SVC with parameters ν = 0.01 and q = 0.0005. To filter potential outliers, we conduct ν-SVC with ν = 0.05 but different q in every round. The first round starts with q = 0.005, and q will be doubled in each following round. In the figure, we show the effect of a round with a 3D view of the data followed by the separation test and the compactness verification. To give a 3D view effect, we construct convex hulls of clusters. For the separation and the compactness checking, we use projections along z axis. Because of pairwise analysis, we denote by γi,j the margin between clusters i and j, while γ i(i,j) is the neighborhood dispersion measure of SVs in cluster i with respect to the pair of clusters i and j. Fig. 4(a) illustrates a 3D view of original clustering result. Fig. 4(b) and Fig. 4(c)3 show separation test and compactness evaluation respectively corresponding to the first round. Fig. 4(b) indicates γ 1(1,2) /γ1,2 = 6.8, γ 1(1,3) /γ1,3 = 11.2 and γ 2(2,3) /γ2,3 = 21.2. Thus, we conclude that the cluster results are separable in the first run. Fig. 4(c) shows the SVs generated, where 63 BSVs will be filtered as potential outliers. We perform reclustering after filtering outliers, and match the current cluster structure to previous clustering structure. Index values
252
Vladimir Estivill-Castro and Jianhua Yang
(a) Original clustering structure C1
(d) Structure C2 from reclustering
(h) BSVs=41, R=J=FM=1.
(b) γ = 0.019004 γ1 = 0.038670 γ2 = 0.055341
(c) SVs in circles, BSVs=39, R=J=FM=1.
(e) γ = 0.062401 γ1 = 0.002313 γ2 = 0.003085
(f) BSVs=39, R=J=FM=1.
(g) γ = 0.070210 γ1 = 0.002349 γ2 = 0.002081
(i) γ = 0.071086 γ1 = 0.005766 γ2 = 0.004546
(j) BSVs=41, R=J=FM=1.
(k) γ = 0.071159 γ1 = 0.002585 γ2 = 0.003663
Fig. 3. A 2D example of cluster validity through SMVs approach. Circled points are SVs. Original first run results in compact classes 3(a). 3(c) Test for outliers. 3(d) Reclustering results; R = 1.0, J = 1.0, F M = 1.0. 3(b) and 3(c) separation check and compactness verification of the first round. 3(e) and 3(f) separation check and compactness verification of the second round. 3(g) and 3(h) separation check and compactness verification of the third round. 3(i) and 3(j) separation check and compactness verification of the fourth round. 3(i) clearly separable and repeatable clustering structure
Cluster Validity Using Support Vector Machines
253
R = 1 indicate the compactness of the result in previous run. Similarly, the second round up to the fifth round also show repeatable and separable clustering structure. Thus the original cluster results can be considered valid.
5
Related Work and Discussion
Various methods have been proposed for clustering validity. The most common approaches are formal indexes of cohesion or separation (and their distribution with respect to a null hypothesis). In [11, 17], a clear and comprehensive description of these statistical tools is available. These tools have been designed to carry out hypothesis testing to increase the confidence that the results of clustering algorithms are actual structure in the data (structure understood as discrepancy from the null hypothesis). However, even these mathematically defined indexes face many difficulties. In almost all practical settings, this statistic-based methodology for validity faces challenging computation of the probability density function of indexes that complicates the hypothesis testing approach around the null hypothesis [17]. Bezdek [2] realized that it seemed impossible to formulate a theoretical null hypothesis used to substantiate or repudiate the validity of algorithmically suggested clusters. The information contained in data models can also be captured using concepts from information theory [8]. In specialized cases, like conceptual schema clustering, formal validation has been used for suggesting and verifying certain properties [19]. While formal validity guarantees the consistency of clustering operations in some special cases like information system modeling, it is not a general-purpose method. On the other hand, if the use of more sophisticated mathematics requires more specific assumptions about the model, and if these assumptions are not satisfied by the application, performance of such validity test could degrade beyond usefulness. In addition to theoretical indexes, empirical evaluation methods [13] are also used in some cases where sample datasets with similar known patterns are available. The major drawback of empirical evaluation is the lack of benchmarks and unified methodology. In addition, in practice it is sometimes not so simple to get reliable and accurate ground truth. External validity [17] is common practice amongst researchers. But it is hard to contrast algorithms whose results are produced in different data sets from different applications. The nature of clustering is exploratory, rather than confirmatory. The task of data mining is that we are to find novel patterns. Intuitively, if clusters are isolated from each other and each cluster is compact, the clustering results are somehow natural. Cluster validity is a certain amount of confidence that the cluster structure found is significant. In this paper, we have applied Support Vector Machines and related kernel methods to cluster validity. SVMs training based on clustering results can obtain insight into the structure inherent in data. By analyzing the complexity of boundaries through support information, we can verify separation performance and potential outliers. After several rounds of reclustering and outlier filtering, we will confirm clearer clustering structures
254
Vladimir Estivill-Castro and Jianhua Yang
(a) Original clustering result
(b)
(c) SV s = 184 BSV s = 63
(d) Reclustering R = 1
(e) γ 1(1,2) /γ1,2 = 0.47 γ 1(1,3) /γ1,3 = 0.25 γ 2(2,3) /γ2,3 = 0.17
(f)
(g)
(h)
SVs=155 BSV s = 57
Reclustering R = 1
γ 1(1,2) /γ1,2 = 0.12 γ 1(1,3) /γ1,3 = 0.02 γ 2(2,3) /γ2,3 = 0.01
(i)
(j)
(k)
SV s = 125 BSV s = 44
Reclustering R = 1
γ 1(1,2) /γ1,2 = 0.06 γ 1(1,3) /γ1,3 = 0.09 γ 2(2,3) /γ2,3 = 0.31
(l) SV s = 105 BSV s = 36
(m) Reclustering R = 1
(n)
(o) SV s = 98 BSV s = 26
γ 1(1,2) /γ1,2 = 6.8 γ 1(1,3) /γ1,3 = 11.2 γ 2(2,3) /γ2,3 = 21.2
γ 1(1,2) /γ1,2 = 0.02 γ 1(1,3) /γ1,3 = 0.08 γ 2(2,3) /γ2,3 = 0.18
(p) Reclustering R = 1
Fig. 4. 3D example of cluster validity through SMVs. SVs as circled points. 4(a) 3D view of the original clustering result. 4(b), 4(c) and 4(d) is 1st run. 4(e), 4(f) and 4(g) is 2nd run. 4(h), 4(i) and 4(j) is 3rd run. 4(k), 4(l) and 4(m) is 4th run. 4(n), 4(o) and 4(p) is 5th run arriving at clearly separable and repeatable clustering structure. Separation tests in 4(b), 4(e), 4(h), 4(k) and 4(n). Compactness verification in 4(c), 4(f), 4(i), 4(l) and 4(o). 3D view of reclustering result in 4(d), 4(g), 4(j) and 4(m)
Cluster Validity Using Support Vector Machines
255
when we observe they are repeatable and compact. Counting the number of valid runs and match results from different rounds in our process contributes to verifying the goodness of clustering result. This provides us a novel mechanism for cluster evaluation. Our approach provides a novel mechanism to address cluster validity problems for more elaborate analysis. This is required by a number of clustering applications. The intuitive interpretability of support information and boundary complexity makes it easy to operate practical cluster validity.
References [1] K. P. Bennett and C. Campbell. Support vector machines: Hype or hallelujah. SIGKDD Explorations, 2(2):1–13, 2000. 245 [2] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, NY, 1981. 253 [3] C. C. Chang and C. J. Lin. LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. 249 [4] C. C. Chang and C. J. Lin. Training ν-support vector classifiers: Theory and algorithms. Neural Computation, 13(9):2119–2147, 2001. 246, 247 [5] V. Cherkassky and F. Muller. Learning from Data — Concept, Theory and Methods. Wiley, NY, USA, 1998. 246 [6] R. C. Dubes. Cluster analysis and related issues. C. H. Chen, L. F. Pau, and P. S. P. Wang, eds., Handbook of Pattern Recognition and Computer Vision, 3–32, NJ, 1993. World Scientific. Chapter 1.1. 244 [7] V. Estivill-Castro. Why so many clustering algorithms - a position paper. SIGKDD Explorations, 4(1):65–75, June 2002. 244 [8] E. Gokcay and J. Principe. A new clustering evaluation function using Renyi’s information potential. R. O. Wells, J. Tian, R. G. Baraniuk, D. M. Tan, and H. R. Wu, eds., Proc. of IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP 2000), 3490–3493, Istanbul, 2000. 244, 253 [9] S. Gunn. Support vector machines for classification and regression. Tech. Report ISIS-1-98, Univ. of Southampton, Dept. of Electronics and Computer Science, 1998. 245 [10] S. S. Haykin. Neural networks: a comprehensive foundation. PrenticeHall, NJ, 1999. 245 [11] A. K. Jain & R. C. Dubes. Algorithms for Clustering Data. PrenticeHall, NJ, 1998. 253 [12] R. Koschke and T. Eisenbarth. A framework for experimental evaluation of clustering techniques. Proc. Int. Workshop on Program Comprehension, 2000. 250 [13] A. Rauber, J. Paralic, and E. Pampalk. Empirical evaluation of clustering algorithms. M. Malekovic and A. Lorencic, eds., 11th Int. Conf. Information and Intelligent Systems (IIS’2000), Varazdin, Croatia, Sep. 20 - 22 2000. Univ. of Zagreb. 253 [14] B. Sch¨ olkopf, R. C. Williamson, A. J. Smola, and J. Shawe-Taylor. SV estimation of a distribution’s support. T.K Leen, S. A. Solla, and K. R. M¨ uller, eds., Advances in Neural Information Processing Systems 12. MIT Press, forthcomming. mlg.anu.edu.au/ smola/publications.html. 246 [15] H. Siegelmann, A. Ben-Hur, D. Horn, and V. Vapnik. Support vector clustering. J. Machine Learning Research, 2:125–137, 2001. 245, 247
256
Vladimir Estivill-Castro and Jianhua Yang
[16] V. N. Vapnik. The nature of statistical learning theory. Springer Verlag, Heidelberg, 1995. 245 [17] M. Vazirgiannis, M Halkidi, and Y. Batistakis. On clustering validation techniques. Intelligent Information Systems J. 17(2):107–145, 2001. 244, 253 [18] R. Williamson, B. Sch¨ olkopf, A. Smola, and P. Bartlett. New support vector algorithms. Neural Computation, 12(5):1207–1245, 2000. 246 [19] R. Winter. Formal validation of schema clustering for large information systems. Proc. First American Conference on Information Systems, 1995. 253
FSSM: Fast Construction of the Optimized Segment Support Map Kok-Leong Ong, Wee-Keong Ng, and Ee-Peng Lim Centre for Advanced Information Systems, Nanyang Technological University, Nanyang Avenue, N4-B3C-13, Singapore 639798, SINGAPORE [email protected]
Abstract. Computing the frequency of a pattern is one of the key operations in data mining algorithms. Recently, the Optimized Segment Support Map (OSSM) was introduced as a simple but powerful way of speeding up any form of frequency counting satisfying the monotonicity condition. However, the construction cost to obtain the ideal OSSM is high, and makes it less attractive in practice. In this paper, we propose the FSSM, a novel algorithm that constructs the OSSM quickly using a FP-Tree. Given a user-defined segment size, the FSSM is able to construct the OSSM at a fraction of the time required by the algorithm previously proposed. More importantly, this fast construction time is achieved without compromising the quality of the OSSM. Our experimental results confirm that the FSSM is a promising solution for constructing the best OSSM within user given constraints.
1
Introduction
Frequent set (or pattern) mining plays a pivotal role in many data mining tasks including associations [1] and its variants [2, 4, 7, 13], sequential patterns [12] and episodes [9], constrained frequent sets [11], emerging patterns [3], and many others. At the core of discovering frequent sets is the task of computing the frequency (or support) of a given pattern. In all cases above, we have the following abstract problem for computing support. Given a collection I of atomic patterns or conditions, compute for collections C ⊆ I the support σ(C) of C, where the monotonicity condition σ(C) σ({c}) holds for all c ∈ C. Typically, the frequencies of patterns are computed in a collection of transactions, i.e., D = {T1 , . . . , Ti }, where a transaction can be a set of items, a sequence of events in a sliding time window, or a collection of spatial objects. One class of algorithms find the above patterns by generating candidate patterns C1 , . . . , Cj , and then checking them against D. This process is known to be tedious and time-consuming. Thus, novel algorithms and data structures were proposed to improve the efficiency of frequency counting. However, most solutions do not address the problem in a holistic manner. As a result, extensive efforts are often needed to incorporate a particular solution to an existing algorithm.
This work was supported by SingAREN under Project M48020004.
Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 257-266, 2003. c Springer-Verlag Berlin Heidelberg 2003
258
Kok-Leong Ong et al.
Recently, the Optimized Segment Support Map (OSSM) [8, 10] was introduced as a simple yet powerful way of speeding up any form of frequency counting satisfying the monotonicity condition. It is a light-weight, easy to compute structure, that partitions D into n segments, i.e., D = S1 ∪ . . . ∪ Sn and Sp ∩ Sq = ∅, with the goal of reducing the number of candidate patterns for which frequency counting is required. The idea of the OSSM is simple: the frequencies of patterns in different parts of the data is different. Therefore, computing the frequencies separately in different parts of the data makes it possible to obtain tighter support bounds for the frequencies of the collections of patterns. This enables one to prune more effectively, thus improving the speed of counting. Although the OSSM is an attractive solution for a large class of algorithms, it suffers from one major problem: the construction cost to obtain the best OSSM of a user-defined segment size for a given large collection is high. This makes the OSSM much less attractive in practice. For practicality, the authors proposed hybrid algorithms that use heuristics to contain the runtime, and to construct the “next best” OSSM. Although the solution guarantees an OSSM that improves performance, the quality of estimation is sub-optimal. This translates to a weaker support bound estimated for a given pattern and hence, reduces the probability of pruning an infrequent pattern. Our contribution to the above is to show the possibility of constructing the best OSSM within limited time for a given segment size and a large collection. Our proposal, called the FSSM, is an algorithm that constructs the OSSM from the FP-Tree. With the FSSM, we need not compromise the quality of estimation in favor of a shorter construction time. The FSSM may therefore make obsolete the sub-optimal algorithms originally proposed. Our experimental results support these claims.
2
Background
The OSSM is a light-weight structure that holds the support of all singleton itemsets in each segment of the database D. A segment in D is a partition containing a set of transactions such that D = S1 ∪ . . . ∪ Sn and Sp ∩ Sq = ∅. In each segment, the support of each singleton itemset is registered and thus, n the support of an item ‘c’ can be obtained by i=1 σi ({c}). While the OSSM contains only segment supports of singleton itemsets, it can be used to give an upper bound on the support ( σ ) of any itemset C using the formula given below, where On is the OSSM constructed with n segments. σ (C, On ) =
n
min({σi ({c}) | c ∈ C})
i=1
Let us consider the example in Figure 1. Assume in this configuration, each segment has exactly two transactions. Then, we have the OSSM (right table) where the frequency of each item in each segment is registered. By the equation above, the estimated support of an itemset C = {a, b} would be σ (C, On ) =
FSSM: Fast Construction of the Optimized Segment Support Map TID 1 2 3 4 5 6
Contents Segment {a} 1 {a, b} 1 {a} 2 {a} 2 {b} 3 {b} 3
{a} {b}
S1 2 1
S2 2 0
S3 0 2
259
D = S1 ∪ S2 ∪ S3 4 3
Fig. 1. A collection of transactions (left) and its corresponding OSSM (right). The OSSM is constructed with a user-defined segment size of n = 3.
min(2, 1)+min(2, 0)+min(0, 2) = 1. Although this estimate is the support bound of C, it turns out to be the actual support of C for this particular configuration of segments. Suppose we now switch T1 and T5 in the OSSM, i.e., S1 = {T2 , T5 } (C, On ) = 2! This observation suggests that the way and S3 = {T1 , T6 }, then σ transactions are selected into a segment can affect the quality of estimation. Clearly, if each segment contains only one transaction, then the estimate will be optimal and equals the actual support. However, this number of segments will be practically infeasible. The ideal alternative is to use a minimum number of segments to maintain the optimality of our estimate. This leads to the following problem formulation. Definition 1. Given a collection of transactions, the segment minimization problem is to determine the minimum value nm for the number of segments in (C, Onm ) = σ(C) for all itemsets C, i.e., the upper the OSSM Onm , such that σ bound on the support for any itemset C is exactly its actual support. With the FSSM, the minimum number of segments can be obtained quickly in two passes of the database. However, knowing the minimum number of segments is at best a problem of academic interest. In practice, this number is still too large to consider the OSSM as light-weight. It is thus desirable to construct the OSSM based on a user-defined segment size nu . And since nu nm , we expect a drop in the accuracy of the estimate. The goal then is to find the best configuration of segments, such that the quality of every estimate is the best within the bounds of nu . This problem is formally stated as follows. Definition 2. Given a collection of transactions and a user-defined segment size nu nm to be formed, the constrained segmentation problem is to determine the best composition of the nu segments that minimizes the loss of accuracy in the estimate.
3
FSSM: Algorithm for Fast OSSM Construction
In this section, we present our solutions to the above problems. For the ease of discussion, we assume the reader is familiar with the FP-Tree and the OSSM. If not, a proper treatment can be obtained in [5, 10].
260
3.1
Kok-Leong Ong et al.
Constructing the Ideal OSSM
Earlier, we mentioned that the FSSM constructs the optimal OSSM from the FP-Tree. Therefore, we begin by showing the relationship between the two. Lemma 1. Let Si and Sj be two segments of the same configuration from a collection of transactions. If we merge Si and Sj into one segment Sm , then Sm is the same configuration, and σ (C, Sm ) = σ (C, Si ) + σ (C, Sj ). The term configuration refers to the characteristic of a segment that is described by the descending frequency order of the singleton itemsets. As an example, suppose the database has three unique items and two segments, i.e., S1 = {b(4), a(1), c(0)} and S2 = {b(3), a(2), c(2)}, where the number in the parentheses is the frequency of each item in the segment. In this case, both segments are described by the same configuration σ({a}) σ({b}) σ({c}) , and therefore can be merged (by Lemma 1) without loosing accuracy. In a more general case, the lemma solves the segment minimization problem. Suppose each segment begins with a single transaction, i.e., the singleton frequency registered in each segment is either ‘1’ or ‘0’. We begin by merging two single-transaction segments of the same configuration. From this merged segment, we continue merging other single-transaction segments as long as the configuration is not altered. When no other single-transaction segments can be merged without loosing accuracy, we repeat the process on another configuration. The number of segments found after processing all distinct configurations is the minimum number of segments required to build the optimal OSSM. Theorem 1. The minimum number of segments required for the upper bound on σ(C) to be exact for all C, is the number of segments with distinct configurations. Proof: As shown in [10]. Notice the process of merging two segments is very similar to the process of FP-Tree construction. First, the criterion to order items in a transaction is the same as that to determine the configuration of a segment (specifically a singletransaction segment). Second, the merging criterion of two segments is implicitly carried out by the overlaying of a transaction on an existing unique path1 in the FP-Tree. An example will illustrate this observation. Let T1 = {f, a, m, p}, T2 = {f, a, m} and T3 = {f, b, m} such that the transactions are already ordered, and σ({b}) σ({a}). Based on FP-Tree characteristics, T1 and T2 will share the same path in the FP-Tree, while T3 will have a path of its own. For the two transactions overlaid on the same path in the FP-Tree, they actually have the same configuration: σ({f }) σ({a}) σ({m}) σ({p}) σ({b}) . . . , since σ({b}) = 0 in both T1 and T2 and σ({p}) = 0 for T2 . For T3 , the configuration is σ({f }) σ({b}) σ({m}) σ({a}) σ({p}) . . . , where σ({a}) = σ({p}) = 0. Clearly, this is a different configuration from T1 and T2 and hence, a different path in the FP-Tree. 1
A unique path in the FP-Tree, is a distinct path that starts from the root node, and ends at one of the leaf nodes in the FP-Tree.
FSSM: Fast Construction of the Optimized Segment Support Map
261
Theorem 2. Given a FP-Tree constructed from some collection, the number of unique paths (or leaf nodes) in the FP-Tree is the minimum number of segments achievable without compromising the accuracy of the OSSM. Proof: Suppose the number of unique paths in the FP-Tree is not the minimum number of segments required to build the optimal OSSM. Then, there will be at least one unique path that has the same configuration as another path in the FP-Tree. However, two paths Pi and Pj in the FP-Tree can have the same configuration if and only if, there exist transactions in both paths that have the same configuration. If Ti ∈ Pi and Tj ∈ Pj are of the same configuration, they must satisfy the condition Ti ⊆ Tj and ∀c ∈ Tj − Ti , σ({c}) σ({x|Ti | ∈ Ti }), or vice versa. However by the principle of FP-Tree construction, if Ti and Tj satisfy the above condition, then they must be overlaid on the same path. Therefore, each unique path in the FP-Tree must be of a distinct configuration. Hence, we may now apply Theorem 1 to complete the proof of Theorem 2. Corollary 1. The transactions that are fully contained in each unique path of the FP-Tree is the set of transactions that constitutes to a distinct segment in the optimal OSSM. Proof: By Theorem 2, every unique path in the FP-Tree must have a distinct configuration, and all transactions contained in a unique path are transactions with the same configuration. In addition, since every transaction in the collection must lie completely along one of the paths in the FP-Tree, it follows that there is an implicit and complete partition on the collection by the unique path the transaction belongs. By this observation, every unique path and its set of transactions must therefore correspond to a distinct segment in the optimal OSSM. Hence, we have the above corollary of Theorem 2. From Theorem 1, we shall give an algorithmic sketch of the construction algorithm for the optimal OSSM. Although this has little practical utility, its result is an intermediate step towards the goal of finding the optimal OSSM within the bounds of the user-defined segment size. Hence, its efficient construction is still important. The algorithm to construct the optimal OSSM is given in Figure 2. Notice that the process is very much based on the FP-Tree construction. In fact, the entire FP-Tree is constructed along with the optimal OSSM. Therefore, the efficiency of the algorithm is bounded by the time needed to construct the FP-Tree, i.e., within two scans of the database. The results of the above is important to solve the constrained segmentation problem. As we will show in the next subsection, the overlapping of unique paths in the FP-Tree contain an important property that will allow us to construct the best OSSM within the bounds of the user-defined segment size. As before, we shall present the formal discussions that lead to the algorithm. 3.2
Constructing OSSM with User-Defined Segment Size
Essentially, Theorem 1 states the lower bound nm on the number of segments allowable before the OSSM becomes sub-optimal in its estimation. Also men-
262
Kok-Leong Ong et al.
Algorithm BuildOptimalOSSM(Set of transactions D) begin Find the singleton frequency of each item in D; // Pass 1 foreach transaction T ∈ D do // Pass 2 Sort T accordingly to descending frequency order; if (T can be inserted completely along an existing path Pi in the FP-Tree) then Increment the counter in segment Si for each item in T ; else Create the new path Pj in the FP-Tree, and the new segment Sj ; Initialize the counter in segment Sj for each item in T to 1; endif endfor return optimal OSSM and FP-Tree; end Fig. 2. Algorithm to construct the optimal OSSM via FP-Tree construction.
tioned is that the value of nm is too high to construct the OSSM as a light weight and easy to compute structure. The alternative, as proposed, is to introduce a user-defined segment size nu where nu nm . Clearly, when nu < nm , the accuracy can no longer be maintained. This means merging segments of different configuration so as to reach the user-defined segment size. Of course, the simplest approach is to randomly merge any distinct configuration. However, this will result in an OSSM with poor pattern pruning efficiency. As such, we are interested in constructing the best OSSM within the bounds of the user-defined segment size. Towards this goal, the following measure was proposed. [ σ ({ci , cj }, O1 ) − σ ({ci , cj }, Ov )] SubOp (S) = ci ,cj
In the equation, S = {S1 , . . . , Sv } is a set of segments with v 2. The first term is the upper bound on σ({ci , cj }) based on O1 , which consists of one combined segment formed by merging all v segments in S. The second term is the upper bound based on Ov which keeps the v segments separated. The difference between the two terms quantifies the amount of sub-optimality in the estimation on the set {ci , cj } to have the v segments merged, and the sum over all pairs of items measure the total loss. Generally, if all v segments are of the same configuration, then SubOp (S) = 0, and if there are at least two segments with different configurations, then SubOp (S) > 0. What this means is that we would like to merge segments having smaller sub-optimality values, i.e., they have a reduced loss when the v segments are merged. And this measure is the basis of operation for the algorithms proposed by the authors. Clearly, this approach is expensive. First, computing a single suboptimality value requires the sum of all pairs of items in the segment. If there terms to be summed. Second, the number of are k items, then there are k·(k−1) 2 distinct segments for which the sub-optimality is to be computed is also very large. As a result, the runtime to construct the best OSSM within the bounds
FSSM: Fast Construction of the Optimized Segment Support Map
263
of the user-defined segment size becomes very high. To contain the runtime, hybrid algorithms were proposed. These algorithms first create segments of larger granularity by randomly merging existing segments before the sub-optimality measure is used to reach the user-defined segment size. The consequence is an OSSM with an estimation accuracy that cannot be predetermined, and is often not the best OSSM possible for the given user-defined segment size. With regards to the above, the FP-Tree has some interesting characteristics. Recall in Theorem 2, we learn that segments having the same configuration share the same unique path. Likewise, it is not difficult to observe that two unique paths are similar in configuration if they have a high degree of overlapping (i.e., sharing of prefixes). In other words, as the overlapping increases, the suboptimality value approaches zero. To illustrate this, suppose T1 = {f, a, m}, T2 = {f, a, c, p} and T3 = {f, a, c, q}. A FP-Tree constructed over these transactions will have three unique paths due to their distinct configurations. Assuming that T2 is to be merged with either T1 or T3 , then we observed that T2 should be merged with T3 . This is because T3 has a longer shared prefix than T1 , i.e., more overlapping in the two paths. This can be confirmed by the calculating the sub-optimality, i.e., SubOp(T1 , T2 ) = 2 and SubOp(T2 , T3 ) = 1. Lemma 2. Given a segment Si and its corresponding unique path Pi in the FP-Tree, the segment(s) that have the lowest sub-optimality value (i.e., the most similar configuration) with respect to Si , are the segment(s) whose unique path has the most overlap with Pi in the FP-Tree. Proof: Let Pj be a unique path with a distinct configuration from Pi . Without loss of distinction in the configuration, let the first k items in both configurations share the same item and frequency ordering. Then, the sub-optimality computed with or without the k items will be the same; since computing all pairs of the first k items (of the same configuration) contributes a zero result. Furthermore, the sub-optimality of Pi and Pj has to be non-zero. Therefore, a non-zero suboptimality depends on the remaining L = max(|Pi |, |Pj |) − k items, where each pair (formed from the L items) contributes to a non-zero partial sum. As k tends towards L, the number of pairs that can be formed from the L items reduces, and the sub-optimality thus approaches zero. Clearly, max(|Pi |, |Pj |) > k > 0 when Pi and Pj in the FP-Tree partially overlaps one another, and k = 0 when they do not overlap at all. Hence, with more overlapping between the two path, i.e., a large k, there is less overall loss in the accuracy, hence Lemma 2. Figure 3 shows the FSSM algorithm that constructs the best OSSM based on the user-defined segment size nu . Instead of creating segments of larger granularity by randomly merging existing ones, we begin with the nm segments in the optimal OSSM constructed earlier. From this nm segments, we merged two segments at a time such that the loss of accuracy is minimized. Clearly, this is costly if we compare each segment against every other as proposed [10]. Rather, we utilize Lemma 2 to cut the search space down to comparing only a few segments. More importantly, the FSSM begins with the optimal OSSM and will
264
Kok-Leong Ong et al.
Algorithm BuildBestOSSM(FP-Tree T, Segment Size nu , Optimal OSSM Om ) begin while (number of segments in Om > nu ) do select node N from lookup table H where N is the next furthest from the root of T and has > 1 child nodes; foreach possible pair of direct child nodes (ci , cj ) of N do Let Si /Sj be the segment for path Pi /Pj containing ci /cj respectively; Compute the sub-optimality as a result of merging Si and Sj ; endfor Merge the pair Sp and Sq whose sub-optimality value is smallest; Create unique path Pp q in T by merging Pp and Pq ; endwhile return best OSSM with nu segments; end Fig. 3. FSSM: algorithm to build the best OSSM for any given segment size nu < nm .
always merge segments with minimum loss of accuracy. This ensures that the best OSSM is always constructed for any value of nu . Each pass through the while-loop merges two segments at a time, and this continues until the OSSM of nm segments reduces to nu segments. At the start of each pass, we first find the set of unique paths having the longest common prefix (i.e., the biggest k value). This is satisfied by the condition in the selectwhere statement which returns N , the last node in the common prefix. This node is important because together with its direct children, we can derive the set of all unique paths sharing this common prefix. The for-loop then computes the sub-optimality for each pair of segments in this set of unique paths. Instead of searching the FP-Tree (which will be inefficient), our implementation uses a lookup table H to find N . Each entry in H records the distance of a node having more than one child, and a reference to the actual node in the FP-Tree. All entries in H are then ordered by their distance so that the select-where statement can find the next furthest node by iterating through H. Although the pair of segments to process is substantially reduced, the efficiency of the for-loop can be further enhanced with a more efficient method of computing sub-optimality. As shown in the proof for Lemma 2, the first k items in the common prefix do not contribute to a non-zero sub-optimality. By the same rationale, we can also exclude all the h items where their singleton frequencies are zero in both segments. Hence, the sub-optimality can be computed by considering only the remaining |I| − k − h or max(|Pi |, |Pj |) − k items. After comparing all segments under N , we merge the two segments represented by the two unique paths with the least loss in the accuracy. Finally, we merge the two unique paths whose segments they represent were combined earlier. This new path will then correspond to the merged segment in the OSSM, where all nodes in the path are arranged according to their descending singleton frequency. The rationale for merging the two paths is to consistently reflect the state of the OSSM required for the subsequent passes.
FSSM: Fast Construction of the Optimized Segment Support Map
265
60
100000
Runtime (seconds)
1000 100 FSSM Random-RC Greedy
10
Speedup Relative to Apriori without the OSSM
FSSM/Greedy
10000
50
Random-RC
40 30 20 10 0
1 20
30
50 80 120 170 Number of Segments
230
300
20
30
50 80 120 170 Number of Segments
230
300
Fig. 4. (a) Runtime performance comparison for constructing the OSSM based on a number of given segment sizes. (b) Corresponding speedup achieved for Apriori using the OSSMs constructed in the first set of experiments.
4
Experimental Results
The objective of our experiment is to evaluate the cost effectiveness of our approach against the Greedy and Random-RC algorithms proposed in [10]. We conducted two sets of experiments using a real data set BMS-POS [6], which has 515,597 transactions. In the first set of experiments, we compare the FSSM against the Greedy and Random-RC in terms of their performance to construct the OSSM based on different user-defined segment sizes. In the second set of experiments, we compare their speedup contribution to the Apriori using the OSSMs constructed by the three algorithms at varying segment sizes. Figure 4(a) shows the results of the first set of experiments. As we expected from our previous discussion, the Greedy algorithm experiences extremely poor runtime when it comes to constructing the best OSSM within the bounds of the given segment size. Compared to the greedy algorithm, FSSM produces the same results in significantly less time, showing the feasibility of pursuing the best OSSM in practical context. Interestingly, our algorithm is even able to out-perform the Random-RC on larger user-defined segment sizes. This can be explained by observing the fact that the Random-RC first randomly merge segments to some larger granularity segments before constructing the OSSM based on the sub-optimality measure. As the user-defined segment size becomes larger, the granularity of each segment, formed from random merging, becomes finer. With more combination of segments, the cost to find the best segments to merge in turn becomes higher. Although we are able to construct the OSSM at the performance level of the Random-RC algorithm, it does not mean that the OSSM produced is of poor quality. As a matter of fact, the FSSM guarantees the best OSSM by the same principle that the Greedy algorithm used to build the best OSSM from the given user-defined segment size. Having shown this by a theoretical discussion, our experimental results in Figure 4(b) provides the empirical evidence. While the Random-RC takes approximately the same amount of time as the FSSM during construction, it fails to deliver the same level of speedup as the FSSM in
266
Kok-Leong Ong et al.
all cases of our experiments. On the other hand, our FSSM is able to construct the OSSM very quickly, and yet deliver the same level of speedup as the OSSM produced by the Greedy algorithm.
5
Conclusions
In this paper, we present an important observation about the construction of an optimal OSSM with respect to the FP-Tree. We show, by means of formal analysis, the relationship between the them, and how the characteristics of the FP-Tree can be exploited to construct high-quality OSSMs. We demonstrated, both theoretically and empirically, that our proposal is able to consistently produce the best OSSM within limited time for any given segment size. More importantly, with the best within reach, the various compromises suggested to balance construction time and speedup becomes unnecessary.
References 1. R. Agrawal and R. Srikant. Fast Algorithm for Mining Association Rules. In Proc. of VLDB, pages 487–499, Santiago, Chile, August 1994. 2. C. H. Cai, Ada W. C. Fu, C. H. Cheng, and W. W. Kwong. Mining Association Rules with Weighted Items. In Proc. of IDEAS Symp., August 1998. 3. G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences. In Proc. of ACM SIGKDD, San Diego, CA, USA, August 1999. 4. J. Han and Y. Fu. Discovery of Multiple-Level Association Rules from Large Databases. In Proc. of VLDB, Zurich, Swizerland, 1995. 5. J. Han, J. Pei Y. Yin, and R. Mao. Mining Frequent Patterns without Candidate Generation: A Frequent-pattern Tree Approach. J. of Data Mining and Knowledge Discovery, 7(3/4), 2003. 6. R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98, 2000. 7. K. Koperski and J. Han. Discovery of Spatial Association Rules in Geographic Information Databases. In Proc. of the 14th Int. Symp. on Large Spatial Databases, Maine, August 1995. 8. L. Lakshmanan, K-S. Leung, and R.T. Ng. The Segment Support Map: Scalable Mining of Frequent Itemsets. SIGKDD Explorations, 2:21–27, December 2000. 9. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes in Sequences. In Proc. of ACM SIGKDD, Montreal, Canada, August 1995. 10. C. K-S. Leung R. T. Ng and H. Mannila. OSSM: A Segmentation Approach to Optimize Frequency Counting. In Proc. of IEEE Int. Conf. on Data Engineering, pages 583–592, San Jose, CA, USA, February 2002. 11. R. T. Ng, L. V. S. Lakshmanan, and J. Han. Exploratory Mining and Pruning Optimizations of Constrained Association Rules. In Proc. of SIGMOD, Washington, USA, June 1998. 12. R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proc. of the 5th Int. Conf. on Extending Database Technology, Avignon, France, March 1996. 13. O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. In Proc. of ICDE, San Diego, March 2000.
Using a Connectionist Approach for Enhancing Domain Ontologies: Self-Organizing Word Category Maps Revisited Michael Dittenbach1 , Dieter Merkl1,2 , and Helmut Berger1 1
E-Commerce Competence Center – EC3 Donau-City-Straße 1, A–1220 Wien, Austria 2 Institut f¨ ur Softwaretechnik, Technische Universit¨ at Wien Favoritenstraße 9–11/188, A–1040 Wien, Austria {michael.dittenbach,dieter.merkl,helmut.berger}@ec3.at
Abstract. In this paper, we present an approach based on neural networks for organizing words of a specific domain according to their semantic relations. The terms, which are extracted from domain-specific text documents, are mapped onto a two-dimensional map to provide an intuitive interface displaying semantically similar words in spatially similar regions. This representation of a domain vocabulary supports the construction and enrichment of domain ontologies by making relevant concepts and their relations evident.
1
Introduction
Ontologies gained increasing importance in many fields of computer science. Especially for information retrieval systems, ontologies can be a valuable means of representing and modeling domain knowledge to deliver search results of a higher quality. However, a crucial problem is an ontology’s increasing complexity with growing size of the application domain. In this paper, we present an approach based on a neural network to assist domain engineers in creating or enhancing ontologies for information retrieval systems. We show an example from the tourism domain, where free-form text descriptions of accommodations are used as a basis to enrich the ontology of a tourism information retrieval system with highly specialized terms that are hardly found in general purpose thesauri or dictionaries. We exploit information inherent in the textual descriptions that are accessible but separated from the structured information the search engine operates on. The vector representations of the terms are created by generating statistics about local contexts of the words occurring in natural language descriptions of accommodations. These descriptions have in common that words belonging together with respect to their semantics, are found spatially close together regarding their position in the text, even though the descriptions are written by different authors, i.e. the accommodation providers themselves in case of our application. Therefore, we think that the approach presented in this paper can be applied to a variety of domains, since, Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 267–277, 2003. c Springer-Verlag Berlin Heidelberg 2003
268
Michael Dittenbach et al.
for instance product descriptions, generally have similarly structured content. Consider for example, typical computer hardware descriptions where information about, say, storage devices are normally grouped together, rather than being intertwined with input and display devices. More specifically, we use the self-organizing map to cluster terms relevant to the application domain to provide an intuitive representation of their semantic relations. With this kind of representation at hand, finding synonyms, adding new relations between concepts or detecting new concepts, which would be important to be added to the ontology, is facilitated. More traditional clustering techniques are used in the DARE system [3] as methods supporting combined top-down and bottom-up ontology engineering [11]. The remainder of the paper is structured as follows. In Section 2 we provide a brief review of our natural language tourism information retrieval system along with some results of a field trial in which the interface has been made publicly accessible. Section 3 gives an overview of the SOM and how it can be used to create a word category map. Following a description of our experiments in Section 4, we provide some concluding remarks in Section 5.
2 2.1
A Tourism Information Retrieval System System Architecture
We have developed a natural language interface for the largest Austrian webbased tourism platform Tiscover (http://www.tiscover.com) [12]. Tiscover is a well-known tourism information system and booking service in Europe that already covers more than 50,000 accommodations in Austria, Germany, Liechtenstein, Switzerland and Italy. Contrary to the original form-based Tiscover interface, our natural language interface allows users to search for accommodations throughout Austria by formulating the query in natural language sentences either in German or English. The language of the query is automatically detected and the result is presented accordingly. For the task of natural language query analysis we followed the assumption that shallow natural language processing is sufficient in restricted and well-defined domains. In particular, our approach relies on the selection of query concepts, which are modeled in a domain ontology, followed by syntactic and semantic analysis of the parts of the query where the concepts appear. To improve the retrieval performance, we used a phonetic algorithm to find and correct orthographic errors and misspellings. It is furthermore an important issue to automatically identify proper names consisting of more than one word, e.g. “Gries am Brenner”, without having the user to enclose it with quotes. This also applies to phrases and multi-word denominations like “city center” or “youth hostel”, to name but a few. In the next query processing step, the relevant concepts and modifiers are tagged. For this purpose, we have developed an XML-based ontology covering the semantics of domain specific concepts and modifiers and describing linguistic concepts like synonymy. Additionally,
Using a Connectionist Approach for Enhancing Domain Ontologies
269
a lightweight grammar describes how particular concepts may be modified by prepositions and adverbial or adjectival structures that are also specified in the ontology. Finally, the query is transformed into an SQL statement to retrieve information from the database. The tagged concepts and modifiers together with the rule set and parameterized SQL fragments, also defined in the knowledge base, are used to create the complete SQL statement reflecting the natural language query. A generic XML description of the matching accommodations is transformed into a device-dependent output, customized to fit screen size and bandwidth. Our information retrieval system covers a part of the Tiscover database, that, as of October 2001, provides access to information about 13,117 Austrian accommodations. These are described by a large number of characteristics including the respective numbers of various room types, different facilities and services provided in the accommodation, or the type of food. The accommodations are located in 1,923 towns and cities that are again described by various features, mainly information about sports activities offered, e.g. mountain biking or skiing, but also the number of inhabitants or the sea level. The federal states of Austria are the higher-level geographical units. For a more detailed report on the system we refer to [2]. 2.2
A Field Trial and Its Implications
The field trial was carried out during ten days in March 2002. During this time our natural language interface was promoted on and linked from the main Tiscover page. We obtained 1,425 unique queries through our interface, i.e. equal queries from the same client host have been reduced to one entry in the query log to eliminate a possible bias for our evaluation of the query complexity. In more than a half of the queries, users formulated complete, grammatically correct sentences, about one fifth were partial sentences and the remaining set were keyword-type queries. Several of the queries consisted of more than one sentence. This approves our assumption that users accept the natural language interface and are willing to type more than just a few keywords to search for information. More than this, a substantial portion of the users is typing complete sentences to express their information needs. To inspect the complexity of the queries, we considered the number of concepts and the usage of modifiers like “and”, “or”, “not”, “near” and some combinations of those as quantitative measures. We found out that the level of sentence complexity is not very high. This confirms our assumption that shallow text parsing is sufficient to analyze the queries emerging in a limited domain like tourism. Even more important for the research described in this paper, we found out that regions or local attractions are inevitable informations that have to be integrated in such systems. We also noticed that users’ queries contained vague or highly subjective criteria like “romantic”, “cheap” or “within walking distance to”. Even “wellness”, a term broadly used in tourism nowadays, is far from being exactly defined. A more detailed evaluation of the results of the field
270
Michael Dittenbach et al.
trial can be found in [1]. It furthermore turned out that a deficiency of our ontology was the lack of diversity of the terminology. To provide better quality search results, it is necessary to enrich the ontology with additional synonyms. Besides the structured information about the accommodations, the web pages describing the accommodations offer a lot more information in form of natural language descriptions. Hence, the words occurring in these texts constitute a very specialized vocabulary for this domain. The next obvious step is to exploit this information to enhance the domain ontology for the information retrieval system. Due to the size of this vocabulary, some intelligent form of representation is necessary to express semantic relations between the words.
3 3.1
Word Categorization Encoding the Semantic Contexts
Ritter and Kohonen [13] have shown that it is possible to cluster terms according to their syntactic category by encoding word contexts of terms in an artificial data set of three-word sentences that consist of nouns, verbs and adverbs, such as, e.g. “Jim speaks often” and “Mary buys meat”. The resulting maps clearly showed three main clusters corresponding to the three word classes. It should furthermore be noted that within each cluster, the words of a class were arranged according to their semantic relation. For example, the adverbs poorly and well were located closer together on the map than poorly and much, the latter was located spatially close to little. An example from a different cluster would be the verbs likes and hates. Other experiments using a collection of fairy tales by the Grimm Brothers have shown that this method works well with real-world text documents [5]. The terms on the SOM were divided into three clusters, namely nouns, verbs and all other word classes. Again, inside these clusters, semantic similarities between words were mirrored. The results of these experiments have been elaborated later to reduce the vector dimensionality for document clustering in the WEBSOM project [6]. Here, a word category map has been trained with the terms occurring in the document collection to subsume words with similar context to one semantic category. These categories, obviously fewer than the number of all words of the document collection, have then been used to create document vectors for clustering. Since new methods of dimensionality reduction have been developed, the word category map has been dropped for this particular purpose [9]. Nevertheless, since our objective is to disclose semantic relations between words, we decided to use word category maps. For training a self-organizing map in order to organize terms according to their semantic similarity, these terms have to be encoded as n-dimensional numerical vectors. As shown in [4], the random vectors are quasi-orthogonal in case of n being large enough. Thus, unwanted geometrical dependence of the word representation can be avoided. This is a necessary condition, because otherwise the clustering result could be dominated by random effects overriding the semantic similarity of words.
Using a Connectionist Approach for Enhancing Domain Ontologies
271
We assume that, in textual descriptions dominated by enumerations, semantic similarity is captured by contextual closeness within the description. For example, when arguing about the attractions offered for children, things like a playground, a sandbox or the availability of a baby-sitter will be enumerated together. Analogously, the same is true for recreation equipment like a sauna, a steam bath or an infrared cabin. To capture this contextual closeness, we use word windows where a particular word i is described by the set of words that appear a fixed number of words before and after word i in the textual description. Given that every word is represented by a unique n-dimensional random vector, the context vector of a word i is built as the concatenation of the average of all words preceding as well as succeeding word i. Technically speaking, an n × N -dimensional vector xi representing word i is a concatenation of vectors xi (dj ) denoting the mean vectors of terms occurring at the set of displacements {d1 , . . . , dN } of the term as given in Equation 1. Consequently, the dimensionality of xi is n×N . This kind of representation has the effect that words appearing in similar contexts are represented by similar vectors in a high-dimensional space. (d1 ) xi .. xi = . (1) xi (dN ) With this method, a statistical model of word contexts is created. Consider, for example, the term Skifahren (skiing). The set of words occurring directly before the term at displacement −1 consists of words like Langlaufen (cross country skiing), Rodeln (toboggan), Pulverschnee (powder snow) or Winter to name but a few. By averaging the respective vectors representing these terms, a statistical model of word contexts is created. 3.2
Self-Organizing Map Algorithm
The self-organizing map (SOM) [7, 8] is an unsupervised neural network providing a mapping from a high-dimensional input space to a usually two-dimensional output space while preserving topological relations as faithfully as possible. The SOM consists of a set of units arranged in a two-dimensional grid, with a weight vector mi ∈ n attached to each unit i. Data from the high-dimensional input space, referred to as input vectors x ∈ n , are presented to the SOM and the activation of each unit for the presented input vector is calculated using an activation function. Commonly, the Euclidean distance between the weight vector of the unit and the input vector serves as the activation function, i.e. the smaller the Euclidean distance, the higher the activation. In the next step the weight vector of the unit showing the highest activation is selected as the winner and is modified as to more closely resemble the presented input vector. Pragmatically speaking, the weight vector of the winner is moved towards the presented input by a certain fraction of the Euclidean distance as indicated by a time-decreasing learning rate α(t) as shown in Equation 2.
272
Michael Dittenbach et al.
mi (t + 1) = mi (t) + α(t) · hci (t) · [x(t) − mi (t)]
(2)
Thus, this unit’s activation will be even higher the next time the same input signal is presented. Furthermore, the weight vectors of units in the neighborhood of the winner are modified accordingly as described by a neighborhood function hci (t) (cf. Equation 3), yet to a less strong amount as compared to the winner. The strenght of adaptation depends on the Euclidean distance ||rc − ri || between the winner c and a unit i regarding their respective locations rc , ri ∈ 2 on the 2-dimensional map and a time-decreasing parameter σ. 2 ||rc − ri || hci (t) = exp − (3) 2 · σ 2 (t) Starting with a rather large neighborhood for a general organization of the weight vectors, this learning procedure finally leads to a fine-grained topologically ordered mapping of the presented input signals. Similar input data are mapped onto neighboring regions on the map.
4 4.1
Experiments Data
The data provided by Tiscover consist, on the one hand, of structured information as described in Section 2, and, on the other hand, of free-form texts describing the accommodations. Because accommodation providers themselves enter the data into the system, the descriptions vary in length and style and are are not uniform or even quality controlled regarding spelling. HTML tags, which are allowed to format the descriptions, had to be removed to have plain-text files for further processing. For the experiments presented hereafter, we used the German descriptions of the accommodations since they are more comprehensive than the English ones. Especially small and medium-sized accommodations provide only a very rudimentary English description, many being far from correctly spelled. It has been shown with a text collection consisting of fairy tales that, with free-form text documents, the word categories dominate the cluster structure of such a map [5]. To create semantic maps primarily reflecting the semantic similarity of words rather than categorizing word classes, we removed words other than nouns and proper names. Therefore, we used the characteristic, unique to the German language, of nouns starting with a capital letter to filter the nouns and proper names occurring in the texts. Obviously, using this method, some other words like adjectives, verbs or adverbs at the beginning of sentences or in improperly written documents are also filtered. Contrarily, some nouns can be missed, too. A different method of determining nouns or other relevant word classes, especially for languages other than German, would be part-of-speech (POS) taggers. But even
Using a Connectionist Approach for Enhancing Domain Ontologies
273
Die Ferienwohnung Lage Stadtrand Wien Bezirk Mauer In Gehminuten Schnellbahn Fahrminuten Wien Mitte Stadt Die Wohnung Wohn Eßraum Kamin SAT TV K¨ uche Geschirrsp¨ uler Schlafzimmer Einzelbetten Einbettzimmer Badezimmer Wanne Doppelwaschbecken Dusche Extra WC Terrasse Sitzgarnitur Ruhebetten Die Ferienwohnung Aufenthalt W¨ unsche
Fig. 1. A sample description of a holiday flat in a suburb of Vienna after removing almost all words not being nouns or proper names the(fem.) , holiday flat, location, outskirts, Vienna, district, Mauer, in, minutes to walk, urban railway, minutes to drive, Wien Mitte (station name), city, the(fem.) , flat, living, dining room, fireplace, satellite tv, kitchen, dishwasher, sleeping room, single beds, single-bed room, bathroom, bathtub, double washbasin, shower, separate toilet, terrace, chairs and table, couches, the(fem.) , holiday flat, stay, wishes
Fig. 2. English translation of the description shown in Figure 1
state-of-the-art POS taggers do not reach an accuracy of 100% [10]. For the rest of this section, the numbers and figures presented, refer to the already preprocessed documents, if not stated otherwise. The collection consists of 12,471 documents with a total number of 481,580 words, i.e. on average, a description contains about 39 words. For the curious reader we shall note that not all of the 13,117 accommodations in the database provide a textual description. The vocabulary of the document collection comprises 35,873 unique terms, but for the sake of readability of the maps we reduced the number of terms by excluding those occurring less than ten times in the whole collection. Consequently, we used 3,662 terms for creating the semantic maps. In Figure 1, a natural language description of a holiday flat in Vienna is shown. Beginning with the location of the flat, the accessibility by public transport is mentioned, followed by some terms describing the dining and living room together with enumerations of the respective furniture and fixtures. Other parts of the flat are the sleeping room, a single bedroom and the bathroom. In this particular example, the only words not being nouns or proper names are the determiner Die and the preposition In at the beginning of sentences. For the sake of convenience, we have provided an English translation in Figure 2. 4.2
Semantic Map
For encoding the terms we have chosen 90-dimensional random vectors. The vectors used for training the semantic map depicted in Figure 3 were created by using a context window of length four, i.e. two words before and two words after a term. But instead of treating all four sets of context terms separately, we have put terms at displacements −2 and −1 as well as those at displacements +1 and +2 together. Then the average vectors of both sets were calculated and
274
Michael Dittenbach et al.
finally concatenated to create the 180-dimensional context vectors. Further experiments have shown that this setting yielded the best result. For example, using a context window of length four but considering all displacements separately, i.e. the final context vector has length 360, has led to a map where the clusters were not as coherent as on the map shown below. A smaller context window of length two, taking only the surrounding words at displacements −1 and +1 into account, had a similar effect. This indicates that the amount of text available for creating such a statistical model is crucial for the quality of the resulting map. By subsuming the context words at displacements before as well as after the word, the disadvantage of having an insufficient amount of text can be alleviated, because having twice the number of contexts with displacements −1 and +1 is simulated. Due to the enumerative nature of the accommodation descriptions, the exact position of the context terms can be disregarded. The self-organizing map depicted in Figure 3 consists of 20 × 20 units. Due to space considerations, only a few clusters can be detailed in this description and enumerations of terms in a cluster will only be exemplary. The semantic clusters shaded gray have been determined by manual inspection. They consist of very homogeneous sets of terms related to distinct aspects of the domain. The parts of the right half of the map that have not been shaded, mainly contain proper names of places, lakes, mountains, cities or accommodations. However, it shall be noted, that e.g. names of lakes or mountains are homogeneously grouped in separate clusters. In the upper left corner, mostly verbs, adverbs, adjectives or conjunctions are located. These are terms that have been inadvertently included in the set of relevant terms as described in the previous subsection. In the upper part of the map, a cluster containing terms related to pricing, fees and reductions can be found. Other clusters in this area predominantly deal with words describing types of accommodation and, in the top-right corner a strong cluster of accommodation names can be found. On the right-hand border of the map, geographical locations, such as central, outskirts, or close to a forest have been mapped, and a cluster containing skiing- and mountaineering-related terms is also located there. A dominant cluster containing words that describe room types, furnishing and fixtures can be found in the lower left corner of the map. The cluster labeled advertising terms in the bottom-right corner of the map, predominately contains words that are found at the beginning of the documents where the pleasures awaiting the potential customer are described. Interesting inter-cluster relations showing the semantic ordering of the terms can be found in the bottom part of the map. The cluster labeled farm contains terms describing, amongst other things, typical goods produced on farms like, organic products, jam, grape juice or schnaps. In the upper left corner of the cluster, names of farm animals (e.g. pig, cow, chicken) as well as animals usually found in a petting zoo (e.g. donkey, dwarf goats, cats, calves) are located. This cluster describing animals adjoins a cluster primarily containing terms related
Using a Connectionist Approach for Enhancing Domain Ontologies
verbs
types of prices, rates,fees reductions adjectives adverbs conjunctions determiner
types of private accomm.
275
proper names (accomm.)
proper names (farms)
group travel
swimming location
proper names wellness
view types of travelers
sports outdoor sports games children
skiing animals
mountaineering
kitchen advertising terms
farm room types, furnishing and fixtures food proper names (cities)
Fig. 3. A self-organizing semantic map of terms in the tourism domain with labels denoting general semantic clusters. The cluster boundaries have been drawn manually
to children, toys and games. Some terms are playroom, tabletop soccer, sandbox and volleyball, to name but a few. This representation of a domain vocabulary supports the construction and enrichment of domain ontologies by making relevant concepts and their relations evident. To provide an example, we found a wealth of terms describing sauna-like recreational facilities having in common that the vacationer sojourns in a closed room with well-tempered atmosphere, e.g. sauna, tepidarium, bio sauna, herbal sauna, Finnish sauna, steam sauna, thermarium or infrared cabin. On the one hand, major semantic categories identified by inspecting and evaluating the semantic map can be used as a basis for a top-down ontology engineering approach. On the other hand, the clustered terms, extracted from domain-relevant documents, can be used for bottom-up engineering an existing ontology.
276
5
Michael Dittenbach et al.
Conclusions
In this paper, we have presented a method, based on the self-organizing map, to support the construction and enrichment of domain ontologies. The words occurring in free-form text documents from the application domain are clustered according to their semantic similarity based on statistical context analysis. More precisely, we have shown that when a word is described by words that appear within a fix-sized context window, semantic relations of words unfold in the self-organizing map. Thus, words that refer to similar objects can be found in neighboring parts of the map. The two-dimensional map representation provides an intuitive interface for browsing through the vocabulary to discover new concepts or relations between concepts that are still missing in the ontology. We illustrated this approach with an example from the tourism domain. The clustering results revealed a number of relevant tourism-related terms that can now be integrated into the ontology to provide better retrieval results when searching for accommodations. We achieved this by analysis of self-descriptions written by accommodation providers, thus assisting substantially the costly and time-consuming process of ontology engineering.
References [1] M. Dittenbach, D. Merkl, and H. Berger. What customers really want from tourism information systems but never dared to ask. In Proc. of the 5th Int’l Conference on Electronic Commerce Research (ICECR-5), Montreal, Canada, 2002. 270 [2] M. Dittenbach, D. Merkl, and H. Berger. A natural language query interface for tourism information. In A. J. Frew, M. Hitz, and P. O’Connor, editors, Proceedings of the 10th International Conference on Information Technologies in Tourism (ENTER 2003), pages 152–162, Helsinki, Finland, 2003. Springer-Verlag. 269 [3] W. Frakes, R. Prieto-D´ıaz, and C. Fox. DARE: Domain analysis and reuse environment. Annals of Software Engineering, Kluwer, 5:125–141, 1998. 268 [4] T. Honkela. Self-Organizing Maps in Natural Language Processing. PhD thesis, Helsinki University of Technology, 1997. 270 [5] T. Honkela, V. Pulkki, and T. Kohonen. Contextual relations of words in grimm tales, analyzed by self-organizing map. In F. Fogelman-Soulie and P. Gallinari, editors, Proceedings of the International Conference on Artificial Neural Networks (ICANN 1995), pages 3–7, Paris, France, 1995. EC2 et Cie. 270, 272 [6] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM–self-organizing maps of document collections. Neurocomputing, Elsevier, 21:101–117, November 1998. 270 [7] T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 1982. 271 [8] T. Kohonen. Self-organizing maps. Springer-Verlag, Berlin, 1995. 271 [9] T. Kohonen, S. Kaski, K. Lagus, J. Saloj¨ arvi, J. Honkela, V. Paatero, and A. Saarela. Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3):574–585, May 2000. 270 [10] C. Manning and H. Sch¨ utze. Foundations of statistical natural language processing. MIT Press, 2000. 273
Using a Connectionist Approach for Enhancing Domain Ontologies
277
[11] R. Prieto-D´ıaz. A faceted approach to building ontologies. In S. Spaccapietra, S. T. March, and Y. Kambayashi, editors, Proc. of the 21st Int’l Conf. on Conceptual Modeling (ER 2002), LNCS, Tampere, Finland, 2002. Springer-Verlag. 268 [12] B. Pr¨ oll, W. Retschitzegger, R. Wagner, and A. Ebner. Beyond traditional tourism information systems – TIScover. Information Technology and Tourism, 1, 1998. 268 [13] H. Ritter and T. Kohonen. Self-organizing semantic maps. Biological Cybernetics, 61(4):241–254, 1989. 270
Parameterless Data Compression and Noise Filtering Using Association Rule Mining Yew-Kwong Woon1 , Xiang Li2 , Wee-Keong Ng1 , and Wen-Feng Lu23 1
2
Nanyang Technological University, Nanyang Avenue, Singapore 639798, SINGAPORE Singapore Institute of Manufacturing Technology, 71 Nanyang Drive, Singapore 638075, SINGAPORE 3 Singapore-MIT Alliance
Abstract. The explosion of raw data in our information age necessitates the use of unsupervised knowledge discovery techniques to understand mountains of data. Cluster analysis is suitable for this task because of its ability to discover natural groupings of objects without human intervention. However, noise in the data greatly affects clustering results. Existing clustering techniques use density-based, grid-based or resolution-based methods to handle noise but they require the fine-tuning of complex parameters. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. There are several noise/outlier detection techniques but they too need suitable parameters. In this paper, we present a novel parameterless method of filtering noise using ideas borrowed from association rule mining. We term our technique, FLUID (Filtering Using Itemset Discovery). FLUID automatically discovers representative points in the dataset without any input parameter by mapping the dataset into a form suitable for frequent itemset discovery. After frequent itemsets are discovered, they are mapped back to their original form and become representative points of the original dataset. As such, FLUID accomplishes both data and noise reduction simultaneously, making it an ideal preprocessing step for cluster analysis. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID.
1
Introduction
The information age was hastily ushered in by the birth of the World Wide Web (Web) in 1990. All of sudden, an abundance of information, in the form of web pages and digital libraries, was available at the fingertips of anyone who was connected to the Web. Researchers from the Online Computer Library Center found that there were 7 million unique sites in the year 2000 and the Web was predicted to continue its fast expansion [1]. Data mining becomes important because traditional statistical techniques are no longer feasible to handle such immense data. Cluster analysis, or clustering, becomes the data mining technique of choice because of its ability to function with little human supervision. Clustering is the process of grouping a set of physical/abstract objects Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 278-287, 2003. c Springer-Verlag Berlin Heidelberg 2003
Parameterless Data Compression and Noise Filtering
279
into classes of similar objects. It has been found to be useful for a wide variety of applications such as web usage mining [2], manufacturing [3], personalization of web pages [4] and digital libraries [5]. Researchers begin to analyze traditional clustering techniques in an attempt to adapt them to current needs. One such technique is the classic k-means algorithm [6]. It is fast but is very sensitive to the parameter k and noise. Recent clustering techniques that attempt to handle noise more effectively include density-based techniques [7], grid-based techniques [8] and resolution-based techniques [9, 10]. However, all of them require the fine-tuning of complex parameters to remove the adverse effects of noise. Empirical studies show that many adjustments need to be made and an optimal solution is not always guaranteed [10]. Moreover, for high-dimensional data that cannot be visualized by humans, this fine-tuning process is greatly impaired. Since most data, such as digital library documents, web logs and manufacturing specifications, have many features or dimensions, this shortcoming is unacceptable. There are also several work on outlier/noise detection but they too require the setting of non-intuitive parameters [11, 12]. In this paper, we present a novel unsupervised method of filtering noise using ideas borrowed from association rule mining (ARM) [13]. We term our technique, FLUID (FiLtering Using Itemset Discovery). FLUID first maps the dataset into a set of items using binning. Next, ARM is applied to it to discover frequent itemsets. As there has been sustained intense interest in ARM since its conception in 1993, ARM algorithms have improved by leaps and bounds. Any ARM algorithm can be used by FLUID and this allows the leveraging of the efficiency of latest ARM methods. After frequent itemsets are found, they are mapped back to become representative points of the original dataset. This capability of FLUID not only eliminates the problematic need for noise removal in existing clustering algorithms but also improves their efficiency and scalability because the size of the dataset is significantly reduced. Experiments involving a prominent synthetic dataset prove the effectiveness and efficiency of FLUID. The rest of the paper is organized as follows. The next section reviews related work in the areas of clustering, outlier detection, ARM while Section 3 presents the FLUID algorithm. Experiments are conducted on both real and synthetic datasets to assess the feasibility of FLUID in Section 4. Finally, the paper is concluded in Section 5.
2
Related Work
In this section, we review prominent works in the areas of clustering and outlier detection. The problem of ARM and its representative algorithms are discussed as well. 2.1
Clustering and Outlier Detection
The k-means algorithm is the pioneering algorithm in clustering [6]. It begins by randomly generating k cluster centers known as centroids. Objects are iteratively
280
Yew-Kwong Woon et al.
assigned to the cluster where the distance between itself and the cluster’s centroid is the shortest. It is fast but sensitive to the parameter k and noise. Densitybased methods are more noise-resistant and are based on the notion that dense regions are interesting regions. DBSCAN (Density Based Spatial Clustering of Applications with Noise) is the pioneering density-based technique [7]. It uses two input parameters to define what constitutes the neighborhood of an object and whether its neighborhood is dense enough to be considered. Grid-based techniques can also handle noise. They partition the search space into a number of cells/units and perform clustering on such units. CLIQUE (CLustering In QUEst) considers a unit to be dense if the number of objects in it exceeds a density threshold and uses an apriori-like technique to iteratively derive higherdimensional dense units. CLIQUE requires the user to specify a density threshold and the size of grids. Recently, resolution-based techniques are proposed and applied successfully on noisy datasets. The basic idea is that when viewed at different resolutions, the dataset reveals different clusters and by visualization or change detection of certain statistics, the correct resolution at which noise is minimum can be chosen. WaveCluster is a resolution-based algorithm that uses wavelet transformation to distinguish clusters from noise [9]. Users must first determine the best quantization scheme for the dataset and then decide on the number of times to apply wavelet transform. The TURN* algorithm is another recent resolution-based algorithm [10]. It iteratively scales the data to various resolutions. To determine the ideal resolution, it uses the third differential of the series of cluster feature statistics to detect an abrupt change in the trend. However, it is unclear how certain parameters such as the closeness threshold and the step size of resolution scaling are chosen. Outlier detection is another means of tackling noise. One classic notion is that of DB(Distance-Based)-outliers [11]. An object is considered to be a DB-outlier if a certain fraction f of the dataset lies greater than a distance D from it. A recent enhancement of it involves the use of the concept of k-nearest neighbors [12]; the top n points with the largest Dk (distance of the k th nearest neighbor of a point) are treated as outliers. The parameters f, D, k, n must be supplied by the user. In summary, currently, there is no ideal solution to the problem of noise and existing clustering algorithms require much parameter tweaking which becomes difficult for high-dimensional datasets. Even if somehow their parameters can be optimally set for a particular dataset, there is no guarantee that the same settings will work for other datasets. The problem is similar in the area of outlier detection. 2.2
Association Rule Mining
Since the concept of ARM is central to FLUID, we formally define ARM and then survey existing ARM algorithms in this section. A formal description of ARM is as follows. Let the universal itemset, I = {a1 , a2 , ..., aU } be a set of literals called items. Let Dt be a database of transactions, where each transaction T contains a set of items such that T ⊆ I. A j-itemset is a set of j unique items.
Parameterless Data Compression and Noise Filtering
281
For a given itemset X ⊆ I and a given transaction T, T contains X if and only if X ⊆ T . Let ψX be the support count of an itemset X, which is the number of transactions in Dt that contain X. Let s be the support threshold and |Dt | be the number of transactions in Dt . An itemset X is frequent if ψX |Dt | × s%. An association rule is an implication of the form X =⇒ Y , where X ⊆ I, Y ⊆ I and X ∩ Y = ∅. The association rule X =⇒ Y holds in Dt with confidence c% if no less than c% of the transactions in Dt that contain X also contain Y . The association rule X =⇒ Y has support s% in Dt if ψX∪Y = |Dt | × s%. The problem of mining association rules is to discover rules that have confidence and support greater than the thresholds. It consists of two main tasks: the discovery of frequent itemsets and the generation of association rules from frequent itemsets. Researchers usually tackle the first task only because it is more computationally expensive. Hence, current algorithms are designed to efficiently discover frequent itemsets. We will leverage the ability of ARM algorithms to rapidly discover frequent itemsets in FLUID. Introduced in 1994, the Apriori algorithm is the first successful algorithm for mining association rules [13]. Since its introduction, it has popularized ARM. It introduces a method to generate candidate itemsets in a pass using only frequent itemsets from the previous pass. The idea, known as the apriori property, rests on the fact that any subset of a frequent itemset must be frequent as well. The FP-growth (Frequent Pattern-growth) algorithm is a recent ARM algorithm that achieves impressive results by removing the need to generate candidate itemsets which is the main bottleneck in Apriori [14]. It uses a compact tree structure called a Frequent Pattern-tree (FP-tree) to store information about frequent itemsets. This compact structure also removes the need for multiple database scans and it is constructed using only two scans. The items in the transactions are first sorted and then used to construct the FP-tree. Next, FP-growth proceeds to recursively mine FP-trees of decreasing size to generate frequent itemsets. Recently, we presented a novel trie-based data structure known as the Support-Ordered Trie ITemset (SOTrieIT) to store support counts of 1-itemsets and 2-itemsets [15, 16]. The SOTrieIT is designed to be used efficiently by our algorithm, FOLDARM (Fast OnLine Dynamic Association Rule Mining) [16]. In our recent work on ARM, we propose a new algorithm, FOLD-growth (Fast OnLine Dynamic-growth) which is an optimized hybrid version of FOLDARM and FP-growth [17]. FOLD-growth first builds a set of SOTrieITs from the database and use them to prune the database before building FP-trees. FOLD-growth is shown to outperform FP-growth by up to two orders of magnitude.
3 3.1
Filtering Using Itemset Discovery (FLUID) Algorithm
Given a d-dimensional dataset Do consisting of n objects o1 , o2 , . . . , on , FLUID discovers a set of representative objects O1 , O2 , . . . , Om where m n in three main steps:
282
Yew-Kwong Woon et al.
1. Convert dataset Do into a transactional database Dt using procedure MapDB 2. Mine Dt for frequent itemsets using procedure MineDB 3. Convert the discovered frequent itemsets back to their original object form using procedure MapItemset Procedure MapDB 1 2
3 4 5 6
Sort each dimension of Do in ascending order Compute mean µx and standard deviation σx of the nearest object distance in each dimension x by checking the left and right neighbors of each object Find range of values rx for each dimension x Compute number of bins βx for each dimension x: βx = rx /((µx + 3 × σx ) × 0.005 × n Map each bin to a unique item a ∈ I Convert each object oi in Do into a transaction Ti with exactly d items items by binning its feature values, yielding a transactional database Dt
Procedure MapDB tries to discretize the features of dataset Do in a way that minimizes the number of required bins without losing the pertinent structural information of Do . Every dimension has its own distribution of values and thus, it is necessary to compute the bin sizes of each dimension/feature separately. Discretization is itself a massive area but experiments reveal that MapDB is good enough to remove noise efficiently and effectively. To understand the data distribution in each dimension, the mean and standard deviation of the closest neighbor distance of every object in every dimension are computed. Assuming that all dimensions follow a Normal distribution, an object should have one neighboring object within three standard deviations of the mean nearest neighbor distance. To avoid having too many bins, there is a need to ensure that each bin would contain a certain number of objects (0.5% of dataset size) and this is accomplished in step 4. In the event that the values are spread out too widely, i.e. the standard deviation is much larger than the mean, the number of standard deviations used in step 4 is reduced to 1 instead of 3. Note that if a particular dimension has less than 100 unique values, steps 2-4 would be unnecessary and the number of bins would be the number of unique values. As mentioned in step 6, each object becomes a transaction with exactly d items because each item represents one feature of the object. The transactions do not have duplicated items because every feature has its own unique set of bins. Once Do is mapped into transactions with unique items, it is now in a form that can be mined by any association rule mining algorithm. Procedure MineDB 1
Set support threshold s = 0.1 (10%)
Parameterless Data Compression and Noise Filtering
2 3 4 5 6 7 8 9 10 11 12 13 14
283
Set number of required frequent d-itemsets k = n Let δ(A, B) be the distance between 2 j-itemsets A(a1 , . . . , aj ) and j B(b1 , . . . , bj ): δ(A, B) = i=1 (ai − bi ) A itemset A is a loner itemset if δ(A, Z) > 1, ∀Z ∈ L ∧ Z = A Repeat Repeat Use an association rule mining algorithm to discover a set of frequent itemsets L from Dt Remove itemsets with less than d items from L Adjust s using a variable step size to bring |L| closer to k Until |L| = k or |L| stabilizes Set k = 12 |L| Set s = 0.1 Remove loner itemsets from |L| Until abrupt change in number of loner itemsets
MineDB is the most time-consuming and complex step of FLUID. The key idea here is to discover the optimal set of frequent itemsets that represents the important characteristics of the original dataset; we consider important characteristics as dense regions in the original dataset. In this case, the support threshold s is akin to the density threshold used by density-based clustering algorithms and thus, it can be used to remove regions with low density (itemsets with low support counts). The crucial point here is how to automate the finetuning of s. This is done by checking the number of loner itemsets after each iteration (steps 6-14). Loner itemsets represent points with no neighboring points in the discretized d-dimensional feature space. Therefore, an abrupt change in the number of loner itemsets indicates that the current support threshold value has been reduced to a point where dense regions in the original datasets are being divided into too many sparse regions. This point is made more evident in Section 5 where its effect can be visually observed. The number of desired frequent d-itemsets (frequent itemsets with exactly d items), k, is initially set to the size of the original dataset as seen in step 2. The goal is to obtain the finest resolution of the dataset that is attainable after its transformation. The algorithm then proceeds to derive coarser resolutions in an exponential fashion in order to quickly discover a good representation of the original dataset. This is done at step 11 where k is being reduced to half of |L|. The amount of reduction can certainly be lowered to get more resolutions but this will incur longer processing time and may not be necessary. Experiments have revealed that our choice suffices for a good approximation of the representative points of a dataset. In step 8, notice that itemsets with less than d items are removed. This is because association rule mining discovers frequent itemsets with various sizes but we are only interested in frequent itemsets containing items that represent all the features of the dataset. In step 9, the support threshold s is incremented/decremented by a variable step size. The step size is variable as it must
284
Yew-Kwong Woon et al.
be made smaller in order to zoom in on the best possible s to obtain the required number of frequent d-itemsets, k. In most situations, it is quite unlikely that |L| can be adjusted to equal k exactly and thus, if |L| stabilizes or fluctuates between similar values, its closest approximation to k is considered as the best solution as seen in step 10. Procedure MapItemset 1 2 3 4 5
for each frequent itemset A ∈ L do for each item i ∈ A do Assign the center of the bin represented by i as its new value end for end for
The final step of FLUID is the simplest: it involves mapping the frequent itemsets back to their original object form. The filtered dataset would now contain representative points of the original dataset excluding most of the noise. Note that the filtering is only an approximation but it is sufficient to remove most of the noise in the data and retain pertinent structural characteristics of the data. Subsequent data mining tasks such as clustering can then be used to extract knowledge from the filtered and compressed dataset efficiently with little complications from noise. Note also that the types of clusters discovered depend mainly on the clustering algorithm used and not on FLUID. 3.2
Complexity Analysis
The following are the time complexities of the three main steps of FLUID: 1. MapDB: The main processing time is taken by step 1 and hence, its time complexity is O(n log n). 2. MineDB: As the total number of iterations used by the loops in the procedure is very small, the bulk of the processing time is attributed to the time to perform association rule mining given by TA . 3. MapItemset: The processing time is dependent on the number of resultant representative points |L| and thus, it has a time complexity of O(n). Hence, the overall time complexity of FLUID is O(n log n + TA + n). 3.3
Strengths and Weaknesses
The main strength of FLUID is its independence on user-supplied parameters. Unlike its predecessors, FLUID does not require any human supervision. Not only it removes noise/outliers, it compresses the dataset into a set of representative points without any loss of pertinent structural information of the original dataset. In addition, it is reasonably scalable with respect to both the size and
Parameterless Data Compression and Noise Filtering
285
500 500 450 450 400 400 350 350 300 300 250
250
200
200
150
150
100
100
50
50
0
0 0
100
200
300
400
500
600
700
0
100
200
(a)
300
400
500
600
700
400
500
600
700
(b)
500
500
450
450
400
400
350
350
300
300
250
250
200
200
150
150
100
100
50
50
0
0 0
100
200
300
(c)
400
500
600
700
0
100
200
300
(d)
Fig. 1. Results of executing FLUID on a synthetic dataset.
dimensionality of the dataset as it inherits the efficient characteristics of existing association rule mining algorithms. Hence, it is an attractive preprocessing tool for clustering or other data mining tasks. Ironically, its weakness also stems from its use of association rule mining techniques. This is because association rule mining algorithms do not scale as well as resolution-based algorithms in terms of dataset dimensionality. Fortunately, since ARM is still receiving much attention from the research community, it is possible that more efficient ARM algorithms will be available to FLUID. Another weakness is that FLUID spends much redundant processing time in finding and storing frequent itemsets that have less than d items. This problem is inherent in association rule mining because larger frequent itemsets are usually formed from smaller frequent itemsets. Efficiency and scalability can certainly be improved greatly if there is a way to directly discover frequent d-itemsets.
4
Experiments
This section evaluates the viability of FLUID by conducting experiments on a Pentium-4 machine with a CPU clock rate of 2 GHz and 1 GB of main memory. We shall use FOLD-growth as our ARM algorithm in our experiments as it is fast, incremental and scalable [17]. All algorithms are implemented in Java.
286
Yew-Kwong Woon et al.
The synthetic dataset (named t7.10k.dat) used here tests the ability of FLUID to discover clusters of various sizes and shapes amidst much noise; it has been used as a benchmarking test for several clustering algorithms [10]. It has been shown that prominent algorithms like k-means [6], DBSCAN [7], CHAMELEON [18] and WaveCluster [9] are unable to properly find the nine visually-obvious clusters and remove noise even with exhaustive parameter adjustments [10]. Only TURN* [10] manages to find the correct clusters but it requires user-supplied parameters as mentioned in Section 2.1. Figure 1(a) shows the dataset with 10,000 points in nine arbitrary-shaped clusters interspersed with random noise. Figure 1 shows the results of running FLUID on the dataset. FLUID stops at the iteration when Figure 1(c) is obtained but we show the rest of the results to illustrate the effect of loner itemsets. It is clear that Figure 1(c) is the optimal result as most of the noise is removed while the nine clusters remain intact. Figure 1(d) loses much of the pertinent information of the dataset. The number of loner itemsets for Figures 1(b), (c) and (d) is 155, 55 and 136 respectively. Figure 1(b) has the most loner itemsets because of the presence of noise in the original dataset. It is the finest representation of the dataset in terms of resolution. There is a sharp drop in the number of loner itemsets in Figure 1(c) followed by a sharp increase in the number of loner itemsets in Figure 1(d). The sharp drop can be explained by the fact that most noise is removed leaving behind objects that are closely grouped together. In contrast, the sharp increase in loner itemsets is caused by too low a support threshold. This means that only very dense regions are captured and this causes the disintegration of the nine clusters as seen in Figure 1(d). Hence, a change in the trend of the number of loner itemsets is indicative that the structural characteristics of the dataset has changed. FLUID took a mere 6 s to compress the dataset into 1,650 representatives points with much of the noise removed. The dataset is reduced by more than 80% without affecting its inherent structure, that is, the shapes of its nine clusters are retained. Therefore, it is proven in this experiment that FLUID can filter away noise even in a noisy dataset with sophisticated clusters without any user parameters and with impressive efficiency.
5
Conclusions
Clustering is an important data mining task especially in our information age where raw data is abundant. Several existing clustering methods cannot handle noise effectively because they require the user to set complex parameters properly. We propose FLUID, a noise-filtering and parameterless algorithm based on association rule mining, to overcome the problem of noise as well as to compress the dataset. Experiments on a benchmarking synthetic dataset show the effectiveness of our approach. In our future work, we will improve and provide vigorous proofs of our approach and design a clustering algorithm that can integrate efficiently with FLUID. In addition, the problem of handling high dimensional datasets will be addressed. Finally, more experiments involving larger datasets with more dimensions will be conducted to affirm the practicality of FLUID.
Parameterless Data Compression and Noise Filtering
287
References 1. Dean, N., ed.: OCLC Researchers Measure the World Wide Web. Number 248. Online Computer Library Center (OCLC) Newsletter (2000) 2. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1 (2000) 12–23 3. Gardner, M., Bieker, J.: Data mining solves tough semiconductor manufacturing problems. In: Proc. 6th ACM SIGKDD Int. Conf. on Knowledge discovery and data mining, Boston, Massachusetts, United States (2000) 376–383 4. Mobasher, B., Dai, H., Luo, T., Nakagawa, M., Sun, Y., Wiltshire, J.: Discovery of aggregate usage profiles for web personalization. In: Proc. Workshop on Web Mining for E-Commerce - Challenges and Opportunities, Boston, MA, USA (2000) 5. Sun, A., Lim, E.P., Ng, W.K.: Personalized classification for keyword-based category profiles. In: Proc. 6th European Conf. on Research and Advanced Technology for Digital Libraries, Rome, Italy (2002) 61–74 6. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability. (1967) 281–297 7. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon (1996) 226–231 8. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD Conf., Seattle, WA (1998) 94–105 9. Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: A wavelet based clustering approach for spatial data in very large databases. VLDB Journal 8 (2000) 289–304 10. Foss, A., Zaiane, O.R.: A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets. In: Proc. Int. Conf. on Data Mining, Maebashi City, Japan (2002) 179–186 11. Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proc. 24th Int. Conf. on Very Large Data Bases. (1998) 392–403 12. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) 427– 438 13. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. on Very Large Databases, Santiago, Chile (1994) 487–499 14. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) 1–12 15. Das, A., Ng, W.K., Woon, Y.K.: Rapid association rule mining. In: Proc. 10th Int. Conf. on Information and Knowledge Management, Atlanta, Georgia (2001) 474–481 16. Woon, Y.K., Ng, W.K., Das, A.: Fast online dynamic association rule mining. In: Proc. 2nd Int. Conf. on Web Information Systems Engineering, Kyoto, Japan (2001) 278–287 17. Woon, Y.K., Ng, W.K., Lim, E.P.: Preprocessing optimization structures for association rule mining. In: Technical Report CAIS-TR-02-48, School of Computer Engineering, Nanyang Technological University, Singapore (2002) 18. Karypis, G., Han, E.H., Kumar, V.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32 (1999) 68–75
Performance Evaluation of SQL-OR Variants for Association Rule Mining* P. Mishra and S. Chakravarthy Information and Technology Laboratory and CSE Department The University of Texas at Arlington, Arlington, TX 76019 {pmishra,sharma}@cse.uta.edu
Abstract. In this paper, we focus on the SQL-OR approaches. We study several additional optimizations for the SQL-OR approaches (Vertical Tid, Gather-join, and Gather count) and evaluate them using DB2 and Oracle RDBMSs. We evaluate the approaches analytically and compare their performance on large data sets. Finally, we summarize the results and indicate the conditions for which the individual optimizations are useful.
1
Introduction
The work on association rule mining started with the development of the AIS algorithm [1] and then some of its modifications as discussed in [2]. Since then, there have been continuous attempts in improving the performance of these algorithms [3, 4, 5]. However, most of these algorithms are applicable to data present in flat files. SETM [6], showed how the data stored in RDBMS can be mined using SQL and the corresponding performance gain achieved by optimizing these queries. Recent research in the field of mining over databases has been in integrating the mining functions with the database. The Data Mining Query Language DMQL [7] proposed a collection of such operators for classification rules, association rules etc. [8] proposed the MineRule operator for generating general/clustered/ordered association rules. [9] presents a methodology for tightly-coupled integration of data mining applications with a relational database system. In [10] and [11] the authors have tried to highlight the implications of various architectural alternatives for coupling data mining with relational database systems. Some of the research has focused on the development of SQL-based formulations for association rule mining. Relative performances and all possible combinations for optimizations of k-way join is addressed in [13, 14]. In this paper, we will analyze the characteristics of these optimizations in detail both analytically and experimentally. We conclude why certain optimizations are always useful and why some perceived optimizations do not seem to work as intended. *
This work was supported, in part, by NSF grants IIS-0097517, IIS-0123730 and ITR 0121297.
Y. Kambayashi, M. Mohania, W. Wßö (Eds.): DaWaK 2003, LNCS 2737, pp. 288-298, 2003. Springer-Verlag Berlin Heidelberg 2003
Performance Evaluation of SQL-OR Variants for Association Rule Mining
1.1
289
Focus of This Paper
With more and more use of RDBMS to store and manipulate data, mining directly on RDBMSs is critical. The goal of this paper is to study all aspects of the basic SQL-OR approaches for association rule mining and then explore additional performance optimizations to them. The other goal of our work is to use the results obtained from mining various relations to make the optimizer mining-aware. Also, the results collected from the performance evaluations of these algorithms are critical for developing a knowledge base that can be used for selecting appropriate approaches as well as optimizations with in a given approach. The rest of the paper is organized as follows. Section 3 covers in detail various SQL-OR approaches for support counting and their performance analysis. Section 4 considers the optimizations and reports the main results only due to space limitations. The details can be found in [13] available on the web. In section 5 we have compiled the summary of results obtained from mining various datasets. We conclude and present the future work in section 6.
2
Association Rules
The problem of association rule mining was formally defined in [2]. In short, it can be stated as: Let I be the collection of all the items and D be the set of transactions. Let T be a single transaction involving some of the items from the set I. The association rule is of the form A ⇒ B (where A and B are sets). If the support of itemset AB is 30%, it means 3“ 0% of all the transactions contain both the itemsets – itemset A and itemset B”. And if the confidence of the rule A ⇒ B is 70%, it means 7“ 0% of all the transactions that contain itemset A also contains itemset B”.
3
SQL-OR Based Approaches
The nomenclature of these datasets is of the form TxxIyyDzzzK. Where xx denotes the average number of items present per transaction. yy denotes the average support of each item in the dataset and zzzK denotes the total number of transactions in K (1000's). The experiments have been performed on Oracle 8i (installed on a Solaris machine with 384MB of RAM) and IBM DB2/UDB (over Windows NT with 256MB of RAM). Each experiment has been performed 4 times. The values from the first results are ignored so as to avoid the effect of the previous experiments and other database setups. The average of the next 3 results is taken and used for analysis. This is done so as to avoid any false reporting of time due to system overload or any other factors. For most of the experiments, we have found that the percentage difference of each run with respect to the average is less than one percent. Before feeding the input to the mining algorithm, if it is not in the (tid, item) format, it is converted to that format (by using the algorithm and the approach presented in [12]). On completion of the mining, the results are remapped to their original values. Since the time taken for
290
P. Mishra and S. Chakravarthy
mapping, rule generation and re-mapping the results to their original descriptions is not very significant, they are not reported. For the purpose of reporting the experimental results in this paper, for most of the optimizations we have shown the results only for three datasets – T5I2D500K, T5I2D1000K and T10I4D100K. Wherever there is a marked difference between the results for Oracle and IBM DB2/UDB they are also shown; otherwise the result from anyone of the RDBMSs have been included. 3.1
VerticalTid Approach (Vtid)
This approach makes uses of two procedures – SaveTid and CountAndK. The SaveTid procedure is called once to create CLOBs (character large objects) for representing a list of transactions. This procedure scans the input table once and for every unique item id, generates a CLOB containing the list of transactions in which that item occurs (TidList). These item ids, along with there corresponding TidList are then inserted in the TidListTable relation, which has the following schema (Item: number, TidList: CLOB). Once the TidListTable is generated, then this relation is used for support counting in all the passes. Figure 3.1 shows the time for mining the relation T5I2D100K with different support values on DB2. Figure 3.2 shows the same for Oracle. A pass-wise analysis of these figures shows that second pass is consuming most of the time. This is where the TidList of items constituting the 2-itemsets are compared for finding the common transactions in them. Though the counting process seems to be very straightforward but the process of reading and intersecting these CLOBs is time consuming. As number of 2-candidate itemsets is very large, the total time taken for support counting in pass 2 is very high. We also checked how this approach scales up as size of datasets increase for support values of 0.20%, 0.15% and 0.10% on DB2 and Oracle respectively. From these figures [13] it is clear that Vertical Tid does not do well as size of the datasets increases.
Fig. 1. VertTid on T5I2D100K (DB2)
Fig. 2. VertTid on T5I2D100K (Oracle)
Performance Evaluation of SQL-OR Variants for Association Rule Mining
3.2
291
Gather Join Approach (Gjn)
In this approach for candidate itemset generation, Thomas [11], Dudgikar [12], and our implementation for DB2 uses the SaveItem procedure. This procedure is similar to the SaveTid procedure. The only difference being that here a CLOB object represents a list of item ids. The SaveItem procedure scans the input dataset and for every unique transaction, generates a CLOB object to represent the list of items bought in that transaction (called ItemList). The transaction along with its corresponding ItemList is then inserted into the ItemListTable relation, which has the following schema: (Tid: number, ItemList: CLOB). The ItemList column is then read in every pass for generation of k-candidate itemset. In our implementation, for Oracle, we skip the generation of ItemListTable and the CombinationK stored procedure has been modified. The CombinationK udf for DB2 uses the ItemList column from the ItemListTable to generate k-candidate itemsets while in Oracle, in any pas k, this stored procedure reads the input dataset ordered by T “ id” column and inserts all item ids, corresponding to a particular transaction in to a vector. This vector is then used to generate all the possible k-candidate itemsets. This is done to avoid the usage of CLOBs as working on CLOBs in Oracle has been found to be very time consuming and also the implementation in Oracle had to be done as stored procedure, which does not necessarily needs the inputs as CLOBs. In pass 2 and pass 3, Combination2 and Combination3 stored procedures read the input dataset and generate candidate itemsets of length 2 and length 3 respectively. For DB2 the process of candidate itemset generation is as follows: In any pass k, for each tuple of ItemListTable, the CombinationK udf is invoked. This udf receives the ItemList as input and returns all k-item combinations. Figure 4.1 and Figure 4.2 show the time taken for mining the dataset T5I2D100K with different support values, using this approach on Oracle and DB2 respectively. The legend I“ temLT” corresponds to the time taken in building the ItemListTable. Since building of ItemListTable is skipped for our Oracle implementation, the time taken for building ItemListTable for Oracle is zero. 3.3
Gather Count Approach (Gcnt)
This approach has been implemented for Oracle only. This is a slight modification to the Gather Join approach. Here, the support of candidate itemsets are counted directly in memory, so as to save the time spent in materializing the candidate itemsets and then counting their support. In pass 2, Gcnt uses GatherCount2 procedure, which is a modification to the Combination2 procedure. In second pass, instead of simply generating all the candidate itemsets of length 2 (as is done in the Combination2 procedure in Gjn), the GatherCount2 procedure uses a 2 dimensional array to count the occurrence of each itemset and then only those itemsets that have support count > the user specified minimum support value are inserted in frequent itemsets table. This reduces the time taken for generating frequent itemsets of length 2 as it skips the materialization of C2 relation. The way it is done is that in pass 2 a 2-D array of dimensions [# of items] * [# of items] is built. All the cells of this array are initialized to zero. The Gathercount2 procedure generates all 2-item combinations (similar to the way it was done in Combination2 procedure of Gjn) and increments the count of the itemset in the array. Thus if an itemset {2,3} is generated, the value in the cell
292
P. Mishra and S. Chakravarthy
[Item2][Item3] is incremented by 1. As the itemsets are generated in such a way that item in position 1 < the item in position 2, hence half of the cells in the 2-D array will always be zero. However this method of support counting cannot be used for higher passes, because building a 3 or more dimensional array would cost a whole lot of memory.
Fig. 3. Gather Join on T5I2D100K (O)
Fig. 4. Gather Join on T5I2D100K (DB2)
Fig. 5. Naïve SQL-OR Based Approaches (O)
Fig. 6. Ck and Fk for Gjn (DB2)
4
Analysis and Optimizations to the SQL-OR Based Approaches
Figure 4.3 compares the time taken for mining by the naïve SQL-OR based approaches for support values of 0.10% on datasets T5I2D10K, T5I2D100K, T5I2D500K and T5I2D1000K on Oracle. From this figure it is very clear that of the 3 approaches, Vertical Tid has the worst performance. This is because Vtid blows up at the second pass, where the overall time taken in support counting of all the 2-itemsets by intersecting their TidLists is very large. So the optimization to Vtid would be to reduce the number of TidLists processed by the CountAndK procedure in each pass. This optimization is explained in more detail in section 4.1. For the other two approaches, though they complete for large datasets, they take a lot of time. The difference in the candidate itemset generation process, as is done in
Performance Evaluation of SQL-OR Variants for Association Rule Mining
293
these approaches and the way it is done for any SQL-92 based approach is that here in any pass k, all the items bought in a transaction (the complete ItemList) are used for generation of candidate itemsets. Whereas in the SQL-92 based approaches, in kth pass, only frequent itemsets of length k-1, were extended. The significance of this is in the number of candidate itemsets that are generated at each pass and the way support counting is done. In SQL-92 based approaches, frequent itemsets of length k1 are used to generate candidate itemsets of length k and then additional joins are done to consider only those candidate itemsets, whose subsets of length k-1 are also frequent (because of the subset property). This reduces the number of candidate itemsets that are generated at each pass significantly. But then for support counting input dataset had to be joined k-times with an additional join condition to identify that these items (constituting an itemset) where coming from same transaction. In Gjn and Gcnt, since the candidate itemsets are generated from the complete ItemList of a transaction, there is no need to join the input dataset. Just a single group by on the items constituting an itemset, with a having clause is sufficient to identify all those candidate itemsets that are frequent. However, in any pass k, there is no easy way to identify the frequent itemsets of length k-1 and use them selectively to generate candidate itemsets of length k; rather the entire ItemList is used for generation of kcandidate itemsets. This generates a huge number of unwanted candidate itemsets and hence an equivalent increase in the time for support counting. compares the time taken in generation of these candidate itemsets and their support counting for each pass for dataset T5I2D100K, for support value of 0.10% on DB2. These figures suggest that most of the time taken is in the generation of large number of candidate itemsets. So a way to optimize it would be to reduce the number of candidate itemsets. This optimization is explained in detail in the section 4.2 and 4.3. 4.1
Improved VerticalTid Approach (IM_Vtid)
In Vtid approach, for support counting, in any pass k, the TidList of each item constituting an itemset is passed to the CountAndK procedure. As the length of the itemsets increases, the number of TidLists passed as parameter to the CountAndK procedure also increases (in pass k, CountAndK procedure receives k TidLists).
Fig. 7. % Gain of Im_Vtid over Vtid
Fig. 8. IM_Vtid on T5I2D1000K
294
P. Mishra and S. Chakravarthy
So to enhance the process of support counting, this optimization does the following: In pass 2, frequent itemsets of length two are generated directly by performing a self-join of input dataset. The join condition being that the item from the first copy < the item from second copy and that both the items belong to the same Tid. For pass 3 onwards, for those itemsets, whose count > minimum support value, the CountAndK procedure builds again a list of transactions (as a CLOB) that have been found common in all the TidLists to represent that itemset as a whole. (We have implemented this for Oracle only and have modified the CountAndK stored procedure to reflect the above change, hence for this optimization CountAndK procedure is used only in the reference of implementation for Oracle.) In pass k, the itemset along with its TidList is materialized in an intermediate relation. In the next pass (pass k+1), during the support counting of the candidate itemsets (which are one extension to the frequent itemsets of length k, that have been materialized in pass k), there is no need to pass the TidLists of all the items constituting this itemset. Instead, just two TidLists – one representing the k-itemset and other representing the item, extending this itemset are passed. This saves a whole lot of time, in searching the list of common transactions in the TidLists received by the CountAndK procedure. Figure 4.5 shows the performance gained (in percentages) by using Im_Vtid over Vtid for datasets T5I2D10K and T5I2D100K for support values of 0.20%, 0.15% and 0.10% (for other datasets Vtid didn't complete). Figure 4.6 shows the overall time taken for mining the relation T5I2D1000K with IM_Vtid approach for different support values on Oracle. The legend TidLT represents the time taken in building the TidListTable from the input dataset (T5I2D1000K). This phase basically represents the time taken in building the TidList (a CLOB object) for each item id. From Figure 4.6 it is clear that time taken in building the TidListTable is a huge overhead. It accounts for nearly 60 to 80 percent of the total time spent for mining. Though this optimization is very effective but still the time taken for building the TidListTable shows that the efficiency of RDBMS in manipulating CLOBs is a bottleneck. 4.2
Improved Gather Join Approach (IM_Gjn)
In Gjn approach, in any pass k, all the items that occur in a transaction are used for the generation of candidate itemsets of length k. In subsequent passes, the items, which did not participate in the generation of frequent itemsets, are not eliminated from the list of items for that transaction. There is no easy way of scanning and eliminating all those items from the ItemList of a transaction that did not participate in the formation of frequent itemsets in any pass. As there is no pruning of the items, a huge number of unwanted candidate itemsets are generated in every pass. One possible way to optimize this would be that in any pass k, use tuples of only those transactions, (instead of the entire input table) which have contributed to the generation of frequent itemsets in pass k-1. For this we use an intermediate relation FComb. In any pass k, this relation contains the tuples of only those transactions whose items have contributed in the formation of frequent itemsets in pass k-1. This is done by joining the candidate itemsets table (Ck-1) with the frequent itemsets table (Fk-1). But for identifying the candidate itemsets that belong to same transaction, the CombinationK stored procedure has been modified to insert the transaction id along with the item combinations that were generated from the ItemList of that transaction, in the Ck
Performance Evaluation of SQL-OR Variants for Association Rule Mining
295
relation In any pass k, FComb table is thus generated which is then used by the CombinationK stored procedure (instead of the input dataset) to generate candidate itemsets of length k. Figure 4.7 compares the time required for mining relation T5I2D100K on Oracle, when the FComb table is materialized (IM_Gjn) and used for the generation of candidate itemsets and when input table is used as it is (Gjn) for generation of candidate itemset. We see that the total mining time by using FComb relation is quite less than the total mining time using the input dataset as it is. Also in Gjn, for different support values (0.20%, 0.15% and 0.10%) the time taken in each pass is nearly same. This is because in Gjn there is no pruning of candidate itemsets and then irrespective of the user specified support values the entire ItemList is used for generating all the candidate itemsets of length k. Figure 4.8 compares the number of candidate itemsets generated for relation T5I2D100K when input relation and when FComb relation are used by the CombinationK stored procedure for support value of 0.10%. From this figure, we see that in higher passes, when input relation is used then the number of candidate itemsets are significantly larger than when FComb relation is used which accounts for the difference in the total time taken for mining by these two methods. Figure 4.9 shows the performance gained (in percentages) by using IM_Gjn over Gjn on datasets T5I2D10K, T5I2D100K, T5I2D500K and T5I2D1000K for different support value. From this figure we see that on an average the gain for different support values is 1500% on the different datasets. 4.3
Improved Gather Count Approach (IM_Gcnt)
As Gcnt approach is a slight modification of the Gjn approach, the optimization suggested for the Gjn can be used for this approach also. In Gcnt approach, the second pass uses a 2-Dimensioanl array to count the occurrence of all item combinations of length 2 and those item combinations whose count > user specified support value are directly inserted in the frequent itemsets' relation (F2). The materialization of the candidate itemsets of length 2 (C2) at this step is skipped and in the third pass, F2 is joined with two copies of input dataset to generate FComb, which is then used by the modified Combination3 stored procedure. For subsequent passes, materialization of FComb relation is done in the same manner as is done for the IM_Gjn approach.
Fig. 9. Gjn & IM_Gjn on T5I2D100K (O)
Fig. 10. Size of Ck (Gjn & IM_Gjn)
296
P. Mishra and S. Chakravarthy
Fig. 11. Performance Gain for IM_Gjn
Fig. 13. Gjn & Gcnt for T5I2D1000K (O)
Fig. 12. Vtid, Gj , Gcnt on T5I2D100K (O)
Fig. 14. Performance Gain for IM_Gcnt
Figure 4.11 compares the mining time for tables T5I2D1000K on Oracle using IM_Gcnt approach for different support values and also compares it with the IM_Gjn approach. This figure shows that, of both the approaches, IM_Gcnt performs better than IM_Gjn. This is because of the time saved in the second pass of the IM_Gcnt approach. For the rest of the passes, the time taken by both of them is almost same as both of them use the same modified CombinationK stored procedure for generation of candidate itemsets. Thus if memory is available for building the 2-D array then performance can be improved by counting the support in memory. Remember that the size of the array needed would be of the order of n2 where n is the number of distinct items in the dataset. Figure 4.12 shows the performance gained (in percentages) by using IM_Gcnt over Gcnt on datasets T5I2D10K, T5I2D100K, T5I2D500K and T5I2D1000K for different support values. From this figure we see that on an average the gain for different support values is 2500% on different datasets.
Performance Evaluation of SQL-OR Variants for Association Rule Mining
5
297
Summary of Results
The SQL-OR based approaches use a simple approach to candidate itemset generation and support count. But when compared with SQL-92 based approaches [14], they do not even come close. The time taken by the naïve SQL-OR based approaches, using stored procedures and udfs, is much more than the basic k-way join approach for support counting. In the SQL-OR approaches, although the use of complex data structures makes the process of mining simpler, they also make it quite inefficient. Among the naïve SQL-OR approaches, we found that the Gather Count approach is the best while the VerticalTid approach has the worst performance. Figure 4.10 shows this for dataset T5I2D100K on Oracle and Figure 4.3 compares the total time taken by these approaches for different datasets. The Gather count outperforms the Gather join approach because in the second pass it uses main memory to do the support counting and hence skips the generation of candidate itemsets of length 2. The other optimizations (IM_Gjn and IM_Gcnt), as implemented in Oracle, avoid the usage of CLOB objects and hence these improved versions seem to be very promising. The Gather Count approach, which makes use of system memory in second pass for support counting, is an improvement over the optimization for the Gather Join approach. Figure 4.6 and Figure 4.11 shows the performance of IM_Vtid, IM_Gjn and IM_Gcnt for dataset T5I2D1000K for different support values. From these figures it is clear that IM_Gcnt is the best of the three SQL-OR approaches and their optimizations discussed in this paper. We have compiled the results obtained from mining different relations into a tabular format. This can be converted into metadata and made available to the miningoptimizer so that it can use these values as a cue for choosing a particular optimization for mining a given input relation.
6
Conclusion and Future Work
In SQL-OR based approaches, if we have enough memory to build a 2 dimensional array for counting support in the second pass, then Gather count approach has been found to be the best of all the naïve SQL-OR based approaches. If building an in memory 2-dimensional array is a problem, then Gather join is a better alternative. The same implies when we have enough space to materialize intermediate relations (on disk). Hence when the optimizations to the SQL-OR based approaches is considered; the optimized Gather count approach (IM_Gcnt) is the best in all the optimizations. Also in most of the cases IM_Gcnt has been found to be the best of all the all the approaches and their optimizations (including those for SQL-92 based approaches).
References [1] [2]
Agrawal, R., T. Imielinski, and A. Swami. Mining Association Rules between sets of items in large databases. in ACM SIGMOD 1993. Agrawal, R. and R. Srikant. Fast Algorithms for mining association rules. in 20th Int'l Conference on Very Large Databases (VLDB). 1994.
298
P. Mishra and S. Chakravarthy
[3]
Savasere, A., E. Omiecinsky, and S. Navathe. An efficient algorithm for mining association rules in large databases. in 21st Int'l Cong. on Very Large Databases (VLDB). 1995. Shenoy, P., et al. Turbo-charging Vertical Mining of Large Databases. in 2000 SIGMOD. Han, J., J. Pei, and Y. Yin. Mining Frequent Patterns wihtout Candidate Generation. in ACM SIGMOD 2000. Houtsma, M. and A. Swami. Set-Oriented Mining for Association Rules in Relational Databases. in ICDE, 1995. Han, J., et al. DMQL: A data mining query language for relational database. in ACM SIGMOD workshop on research issues on data mining and knowledge discovery. 1996. Meo, R., G. Psaila, and S. Ceri. A New SQL-like Operator for Mining Association Rules. in Proc. of the 22nd VLDB Conference. 1996 India. Agrawal, R. and K. Shim, Developing tightly-coupled Data Mining Applications on a Relational Database System. 1995, IBM Report. Sarawagi, S., S. Thomas, and R. Agrawal. Integrating Association Rule Mining with Rekational Database System: Alternatives and Implications. in ACM SIGMOD 1998. Thomas, S., Architectures and optimizations for integrating Data Mining algorithms with Database Systems, in CSE. 1998, University of Florida. Dudgikar, M., A Layered Optimizer or Mining Association Rules over RDBMS, in CSE Department. 2000, University of Florida: Gainesville. Mishra, P. Evaluation of K-way Join and its variants for Association Rule Mining. MS Thesis 2002, Information and Technology Lab and CSE Department at UT Arlington, TX. Mishra, P. and Chakravarthy, S. P “ erformance Evaluation and Analysis of SQL-92 Approaches for Association Rule Mining”, in BNCOD Proc., 2003.
[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
A Distance-Based Approach to Find Interesting Patterns Chen Zheng1 and Yanfen Zhao2 1
Department of Computer Science, National University of Singapore 3 Science Drive 2, Singapore 117543 [email protected] 2 China Construction Bank No.142, Guping Road, Fujian, P.R.China 350003 [email protected]
Abstract. One of the major problems in knowledge discovery is producing too many trivial and uninteresting patterns. The measurement of interestingness is divided into subjective and objective measures and used to address the problem. In this paper, we propose a novel method to discover interesting patterns by incorporating the domain user's preconceived knowledge. The prior knowledge constitutes a set of hypothesis about the domain. A new parameter called the distance is proposed to measure the gap between the user's existing hypothesis and system-generated knowledge. To evaluate the practicality of our approach, we apply the proposed approach through some real-life data sets and present our findings.
1
Introduction
In the field of knowledge discovery in database, most of the previous research work focuses on the validity of the discovered patterns. Little was given consideration to the interestingness problem. Among the huge number of patterns in database, most are useless and common sense rules. It is difficult for domain users to identify the patterns that are interesting to him/her manually. To address this problem, some researchers have proposed many useful and novel approaches according to their different understanding of interesting patterns. In [14], the interestingness is defined as the unexpected pattern, which is in the form of probabilistic terms. The patterns are interesting if they can affect the degree of users' beliefs. In [5, 6], the definition of interestingness is based on the syntactic comparison between system generated rules and belief. In [9], a new definition of interestingness is given in terms of logical contradiction between rule and belief. In this paper, we follow the research of subjective measures and give the new definition of interestingness in terms of distance between the discovered knowledge and an initial set of users hypothesis. We believe that the interesting knowledge is the surprising pattern, which is the deviation of the general conforming rules. Thus, the Y. Kambayashi, M. Mohania, W. Wßö (Eds.): DaWaK 2003, LNCS 2737, pp. 299-308, 2003. Springer-Verlag Berlin Heidelberg 2003
300
Chen Zheng and Yanfen Zhao
further the distance between generated rules and user's hypothesis, the more interesting the pattern it will be. To calculate the distance, we first transform the original data set into (fuzzy linguistic variable, linguistic terms) pairs according to different level of the certainty. The existing hypothesis is also a set of fuzzy rules since the domain users usually have the vague ideas about the domain beforehand. The distance is calculated on the hypothesis and the rules generated by the transformed data set. The rest of this paper is organized as follows. Section 2 describes related work in developing different measures of the interestingness. Section 3 describes our proposed fuzzy distance measure and the methodology to find the interesting patterns. Section 4 describes our implementation and presents the experiment results. Finally, Section 5 concludes our work.
2
Related Work
Generally speaking, there are two categories of interestingness measurement: objective measure and subjective measure. The objective measure aims to find interesting pattern by exploring the data and its underlying structure during discovery process. Such measures includes J-measure[13], certainty factor[2] and strength[16]. However, the interestingness also depends on the users who examine the pattern, i.e. A pattern that may be interesting to a group of users doesn't make any sense for another group of users. Even for the same user, he/she may have different feeling towards the same rule when time passes by. Thus, the subjective measure is useful and necessary. In the field of data mining, subjective interestingness has been identified as an important problem in [3,7,11,12]. The domain-specific system KEFIR [1] is one example. KEFIR use actionability to measure interestingness and analyzes healthcare insurance claims for uncovering k“ ey findings”. In [14], the probabilistic belief is used to describe subjective interestingness. In [10,15], the author proposes two subjective measure of interestingness: unexpectedness and actionability, which means the pattern can help users do something to his/her advantage. Liu.et.al. reported a technique for rule analysis against user's expectation [5], the technique is based on the syntactic comparisons between a rule and a belief, this method requires user to provide precise knowledge. However, in real life situation, it may be difficult to supply such information. In [6], Liu et.al analyze the discovered classification rules against a set of general impressions that are specified using a special representation language. The unexpected rules are defined as those fail to conform the general impressions. Different from the above approaches, our proposed method is domain-independent and use fuzzy α -level cut to transform the original data set, and then use the generated fuzzy rules to compare with the fuzzy hypothesis. A new measurement is defined to calculate the degree of interestingness.
3
The Proposed Approach
3.1
Specifying Interesting Patterns
Let R be the set of system generated knowledge, H be the set of user's hypothesis. Our proposed method will calculate the distance between R and H , the discovered
A Distance-Based Approach to Find Interesting Patterns
301
rules are classified into four sub groups: conforming rules, similar rules, covered rules, deviated rules based on the distance (section 3.2). Below, we give the definition of each sub group rules: Definition 1 Conforming Rules A discovered rule r ( r ∈ R ) is said to be the conforming rule w.r.t. the hypothesis h ( h ∈ H ) given that both of the antecedent and consequent part of the two rules are exactly the same. r and h has no distance in this situation. Definition 2 Similar Rules A discovered rule r ( r ∈ R ) is said to be the similar rule w.r.t. the hypothesis h ( h ∈ H ) given that they have similar attribute values in the antecedent part of the rules and the same consequent. We say that they are c“ lose” to each other under this situation. Definition 3 Covered Rules A discovered rule r ( r ∈ R ) is said to be the covered rule w.r.t. the hypothesis h ( h ∈ H ) given that the antecedent part of h is the subset of that of r , r and h have the same attribute values in the common attributes and the consequent part. In this situation, r can be inferred from h , they have no distance in this situation. Definition 4 Deviated Rules A discovered rule r ( r ∈ R ) is said to be the deviated rule w.r.t. the hypothesis h ( h ∈ H ) given three situations as follows: (1) Same antecedent part, different consequence r and h have the same conditions, however, their class label are different, This means r has the surprising result to the user. (2) Different antecedent part, same consequence r and h have the different conditions and same class label. But r is not covered by h . This means r has the surprising reason to the user. (3) Different antecedent part, different consequence r and h have the different class labels as well as different conditions. Difference means they can be different in attribute values, attribute names or both. Among these four sub group rules, since some knowledge and interests are embedded in the users' expected hypothesis, whether these patterns are interesting or not depend on the degree that the system generated knowledge and users' hypothesis are apart from each other. The rules that are far apart from hypothesis always surprise the user, contradict user expectations and trigger the user to investigate it further, i.e. are more interesting than the trivial rules, which are the common sense, similar or can be derived from the hypothesis.
302
Chen Zheng and Yanfen Zhao
3.2
Measures of Interestingness
We use the distance measure to identify the interesting pattern. The computation of distance between rule ri in the system generated knowledge base R and the hypothesis is made up of three steps: (1). Calculate the attribute distance; (2). Calculate the tuple distance; (3). Calculate the rule distance. 3.2.1 Attribute Distance The value of attribute distance varies from 0 to 1. 1 represents complete difference and 0 represents no difference. The higher the value, the more difference of rule and hypothesis in attribute comparison. Suppose an attribute K in rule r has a value r.k and given a hypothesis h . distk (r , h) denotes the distance between r and h in attribute K . We should consider the following factors during attribute comparison: attribute type, attribute name, attribute value, class label difference.
Discrete attribute values distance: The distance between discrete attribute values is either 0 or 1. 0“ ” for the same attribute value. 1“ ” for different attribute values. Continuous attribute values distance: The distance between continuous attribute values is calculated as follows: Since we have changed the original data tuple into (linguistic variable, linguistic term) pairs, we assign an ordered list {l 1, l 2...li} to the lin-
guistic term set, where l 1 < l 2... < li , lj ∈ [0,1] and j ∈ [1, i ] . The distance between linguistic terms termj and termk of continuous attributes is | lj − lk | . For example, given the term set (short, middle, tall) of linguistic variable h“ eight”, we can assign a list [0,0.5,1]. Then the distance between short and middle is 0.5. Attribute names distance: Suppose the antecedent part of rule r and hypothesis h is r.ant and h.ant respectively. The set of attributes that are common to both the antecedent part of r and h are denoted as IS (stands for intersection) i.e. IS( r , h ) = r.ant ∩ h.ant . Let | r.ant | be the number of attributes in r.ant . The distance be-
tween attribute names in r and h (denoted as distname(r , h) ) is computed as follows: distname(r , h) =
| r.ant | − | IS (r , h) | | r.ant | + | h.ant |
(1)
Class distance: For classification rule generation, the class attribute name of each tuple is the same. The distance between class attribute values in r and h (denoted as distclass (r , h) ) is either 0 or 1. Since class attribute is an important attribute, we use the maximum attribute weight for class attribute i.e. w max = max( w1, w2,...wn) given n attributes (except the class attribute) in the original data set. 3.2.2 Tuple Distance
The tuple distance is computed from attribute distance and attribute weight. We introduce the concept of attribute weight to indicate the relative importance of some
A Distance-Based Approach to Find Interesting Patterns
303
attributes during the calculation of the tuple distance between r and h . For example, in the credit card risk analysis, we may consider the s“ alary” attribute is more important than the S “ ex” attribute and contribute more to the interestingness of the rule. We define a data set with attributes attr 1, attr 2,...attrn and attribute weights w1, w2,...wn respectively. The attribute weight is given by the users and the sum of all the attribute weight is 1. Given a rule r and a hypothesis h , let dist 1(r , h), dist 2(r , h),...distn (r , h) be every attribute value distance between r and h . Simply using syntactic comparison between r and h can't distinguish the covered rule, which are redundant rules. For example, suppose we have rule r : age=young, sex=male, salary=low, occupation=student ! risk=high and h : salary=low !risk=high, although they have different attribute names, attribute values, r is covered by h . This means r is not surprise to us if we already know h . So the tuple distance between r and h is defined according to two situations. The top part of the formula 2 is used to calculate the distance between the covered rules and hypothesis. distclass (r , h) × w max
(if given ∀k , k ∈ IS (r , h) , distk (r , h) = 0 )
d ( r , h) =
(2)
∑ distk (r , h) × wk
distclass (r , h) × w max +
k∈IS ( r , h )
| IS (r , h) |
+ distname(r , h)
(Otherwise)
3.2.3 Rule Distance
Finally, we calculate the average tuple distance between rule ri and the set of existing user hypothesis. Suppose R and H is the system generated knowledge base and existing hypothesis respectively, | H | denotes the size of H . Given a rule ri ∈ R . The distance between rule ri and H (denoted as Di ) is defined as follows: |H |
Di = ∑ d (ri, hj ) / | H |
(3)
j =1
3.3
Discovery Strategy
The key idea of our proposed approach is to use fuzzy α -cut to transform our original data set and generate the fuzzy certainty rules from the transformed data. On the other hand, user's imprecise prior knowledge is expressed as s set of fuzzy rules. So the distance is calculated by comparing the same format rules. Let us first review the definition of some fuzzy terms: linguistic variable, linguistic terms, degree of membership, α -level cut, and fuzzy certainty rules. According to the definition given by Zimmermann, a fuzzy linguistic variable is a quintuple ( x, T ( x), U , G, M% ) in which x is the name of the variable; T ( x) denotes the term set of x ; that is, the set of names
304
Chen Zheng and Yanfen Zhao
of linguistic values of x , with each value being a fuzzy variable denoted generically by x and ranging over a universe of discourse U ; G is a syntactic rule for generating the name, X , of values of x ; and M is a semantic rule for associating with each value X its meaning, M% ( X ) which is a fuzzy subset of U . A particular X , that is a name generated by G is called a term [18]. For example, given the linguistic variable a“ ge,” the term set (T ) x could be v“ ery young,” y“ oung,” m “ iddle-age,” o“ ld.” The base-variable u is the age in years of life, the µ F (u ) is interpreted as the degree of membership of u in the fuzzy set F . M% ( X ) assigns a meaning to the fuzzy terms. For example, M% (old ) can be defined as follows: M% (old ) = {(u , µ old (u )) | u ∈ [0,120]} , where µ old (u ) denotes the membership function of u of term young as follows: µ old (u ) equals to 0 when u belongs to [0, 70] and µ old (u ) equals to (1 + ((u − 70) / 5)−2 ) −1 when u belongs to [70,120] . Given a certainty level α
( α ∈ [0,1] ), we can define the set F α [17] as follows: F α = {u ∈ U | µ F (u ) ≥ α } , F α is called as α -level cut, which contains all the elements of U that are compatible with fuzzy set F above the level α . The syntax of the fuzzy certainty rule A → B is I“ f X is A , then Y is B with the certainty α ”, where, A , B is the fuzzy set. Compared with traditional classification rules, our method uses the fuzzy α -cut concept to generate the certainty rules. Now, we present an overview of our approach. It consists of the four phases below: Step 1. Given a certainty level α , transform the continuous attributes in the original dataset to the (linguistic variable, linguistic term) pairs according to the α -level cut and keep the categorical attributes unchanged. Then we get the transformed dataset T . Step 2. Generate the fuzzy certainty rules R based on T and compare with the hypothesis given by users and calculate the distance according to the formula given in section 3.2. Step 3. Sort fuzzy certainty rules according to the distance and choose the fuzzy rules with distance larger than the threshold δ . Step 4. Check the α -level cut of linguistic terms and defuzzify the fuzzy certainty rules into crisp if-then rules. Given original data set D and each tuple d belongs to D . For each continuous attribute Ai in D , we first specify the linguistic term set Lik for Ai given K linguistic terms, then we generate the membership of Ai in d for every element Lij belongs to Lik according to user specification or methods in [8]. After that, a certainty level α is given and we construct the α -cut (denoted as Lijα ) of the linguistic term Lij . If the value of Ai in tuple d falls into Lijα , we say Lij is the possible linguis-
A Distance-Based Approach to Find Interesting Patterns
305
tic term. The original tuple d will be split and inserted into transformed data set T according to the combination result of every possible linguistic terms of different attributes. Then the traditional data mining tools for example [4] will be applied to the transformed data set T and used to generate the fuzzy certainty rules. The next step is to use the formula in section 3.2 to calculate the distance Di between each rule tuple ri ( ri ∈ R ) and all the hypothesis rules h belongs to H . We specify a distance threshold δ to identify those interesting rules r * , which have Di greater than δ . The user then chooses the explainable interesting rules from r * and updates the hypothesis rule base H . Then we check the α -level cut of each linguistic term Lijα and return the data points belong to Lijα and defuzzify the fuzzy certainty rules into crisp if-then rules. Similarly, we can generate different certainty level rules and compare them with the hypothesis.
4
Experiments
To evaluate the effectiveness of our approach, we implement the proposed approach on the c“ redit card risk analysis” system of China Construction Bank, which is the third biggest bank of China. We have four years of historic customer's credit card information. The bank users knew that the number of malicious overdraft cases had steadily gone up over the years. In order to decide suitable actions to be taken, they were interested in whether there were some specific groups of people who were responsible for this or such cases happened randomly. Especially, they want to know those unexpected patterns of thousand of rules. We let the user to input ten to twenty hypothesis, after we generate the fuzzy rules, we use the Oracle 8 database system to store the rules discovered and user hypothesis, system runs on windows 2000. The experiments on the database are performed on a 500 MHz Pentium II machine with 128 MB memory. Table 1 gives the summary of the number of conforming and interesting rules discovered at some certainty levels (minimum support is 1%, minimum confidence is 50% and δ is 0.5). Column C “ ert.” shows different certainty levels. Column R “ ul.” shows the number of fuzzy certainty rules. Column C “ onf” shows the number of conforming rules, similar and covered rules. Column #“ num” shows the number of interesting rules. Column e“ xpl.” Shows the number of explainable interesting rules. Column f“ alse” shows the number of interesting rules that are not surprised to the user. After we show the result to our users, they all agree that our fuzzy rules are concise, intuitive compared to CBA crisp rules and the hypothesis are verified by our conformed rules generated. On the other hand, part of interesting rules are explainable, especially users show great interest in investigating the unexpected rules, few of the rules which are not interesting are mis-identified because they have statistic significance (large support , confidence and distance values). Figure 1 shows different minimum support thresholds in the X-axis given sub_99 data. Figure 2 shows the total execution time with respect to different data size sampling from year 2000 data, which
306
Chen Zheng and Yanfen Zhao
contains 270000 tuples. 5“ 0k” means we sample 50000 tuples to perform our task. The legend in the right corner specifies different certainty levels. Table 1. Results of conforming and deviated rule mining
#attr
#rec
Cert.
Rul.
Conf.
Sub_98 Sub_98 Sub_99 Sub_99 Sub_00 Sub_00
27 27 31 31 28 28
4600 4600 12031 12031 13167 13167
0.7 0.5 0.8 0.4 0.8 0.6
232 291 397 693 487 688
Sub_01 Sub_01
28 28
17210 17210
0.7 0.9
365 448 511 821 773 101 5 776 341
Execuction time (sec.)
Data
559 220
Interesting Rules #num #expl. false 133 57 9 157 62 5 114 79 13 128 87 25 286 53 32 327 78 11 217 121
65 45
300 250 200 150 100 50
17 3
0. 9 0. 8 0. 7 0. 6 0. 5%
1. 0%
1. 5%
2. 0%
2. 5%
3. 0%
Execuction time (sec.)
Fig. 1. Total execution time with respect to different minsup values
3250 2950 2650 2350 2050 1750 1450 1150 850 550
0. 9 0. 8 0. 7 0. 6
50k
100k
150k
200k
250k
Fig. 2. Total execution time with respect to different data size
A Distance-Based Approach to Find Interesting Patterns
5
307
Conclusion
This paper proposes a novel domain-independent approach to help the domain users find the conforming rules and identify the interesting patterns. The system transform the data set based on different level of belief, the fuzzy certainty rules generated will be used to compare with the same format imprecise knowledge given by the users. The distance measure considers both the semantic and syntactic factors during comparison. Since users have different background knowledge, interest and hypothesis, our new approach is flexible to satisfy their needs. In our future work, we will carry on more experiments and compare our algorithm with other methods of computing the interestingness.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
C. J.Matheus, G.Piatesky-Shapiro, and D.Mcneil. An application of KEFIR to the analysis of healthcare information. In proceedings of the AAAI-94 Workshop on Knowledge Discovery in Databases, 1994 J.Hong and C.Mao. Incremental discovery of rules and structure by hierachical and parallel clustering. In G.Piatetsky-Shapiro and W.J.Frawley, editors, Knowledge Discovery in Databases. AAAI/MIT Press, 1991 Klemetinen, M., Mannila, H., et.al. Finding interesting rules from large sets of discovered association rules. Proceedings of the Third International Conference on Information and Knowledge Management, 401-407,1994 Liu, B. et.al, "Integrating Classification and Association Rule Mining." Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 80-86.,1998 Liu, B. and Hsu, W. Post-Analysis of Learned Rules. In Proc. Of the thirteenth National Conf. On Artificial Intelligence (AAAI'96), 828-834,1996 Liu, B. and Hsu, W, and Chen.S. Using General Impressions to Analyze Discovered Classification Rules. In Proc.of the Third Intl. Conf. On knowledge Discovery and Data Mining, 31-36,1997 Major, J., and Mangano, J. 1993. Selecting among rules induced from a hurricane database. KDD-93, 28-41,1993 M. Kaya, et.al. Efficient Automated Mining of Fuzzy Association Rules, 133142, DEXA, 2002 Padmanabhan, B. and Tuzhilin, A. On the Discovery of Unexpected Rules in Data Mining Applications. In Procs. of the Workshop on Information Technology and Systems, 81-90,1997 Padmanabhan, B. and Tuzhilin, A beliefe-driven method for discovering unexpectedpatterns. In Proc.of the Fourth International Conference on Knowledge Discovery and Data Mining , 27-31,1998 Piatesky-Shapiro, G. and Matheus, C. The interestingness of deviations. KDD94, 25-36,1994
308
Chen Zheng and Yanfen Zhao
[12]
Piatetsky-Shapiro, G., Matheus, C., Smyth, P., and Uthurusamy, R. KDD-93: progress and challenges ..., AI magazine, Fall, 77-87,1994 P.Smyth and R.M.Goodman. Rule induction using information theory. In G.Piatetsky-Shapiro and W.J.Frawley, editors, Knowledge Discovery in Databases. AAAI/MIT Press,1991 Silberschatz, A. and Tuzhilin, A. On Subjective Measures of Interestingness in Knowledge Discovery. In Proc. of the First International Conference on Knowledge Discovery and Data Mining, 275-281,1995 Silberschatz, A. and Tuzhilin, A. What Makes Patterns Interesting in Knowledge Discovery Systems. IEEE Trans. on Know. and Data Engineering. Spec. Issue on Data Mining, v.5, no.6, 970-974,1996 V.Dhar and A. Tuzhilin. Abstract-driven pattern discovery in databases. IEEE Transactions on Knowledge and Data Engineering, 5(6),1993 Zadeh,L.A. Similarity relations and fuzzy orderings. Inf. Sci., 3, 159-176,1971 Zimmermann, H. J. Fuzzy set theory and its applications. Kluwer Academic Publishers, 1991
[13] [14] [15] [16] [17] [18]
Similarity Search in Structured Data Hans-Peter Kriegel and Stefan Sch¨ onauer University of Munich Institute for Computer Science {kriegel, schoenauer}@informatik.uni-muenchen.de
Abstract. Recently, structured data is getting more and more important in database applications, such as molecular biology, image retrieval or XML document retrieval. Attributed graphs are a natural model for the structured data in those applications. For the clustering and classification of such structured data, a similarity measure for attributed graphs is necessary. All known similarity measures for attributed graphs are either limited to a special type of graph or computationally extremely complex, i.e. NP-complete, and are, therefore, unsuitable for data mining in large databases. In this paper, we present a new similarity measure for attributed graphs, called matching distance. We demonstrate, how the matching distance can be used for efficient similarity search in attributed graphs. Furthermore, we propose a filter-refinement architecture and an accompanying set of filter methods to reduce the number of necessary distance calculations during similarity search. Our experiments show that the matching distance is a meaningful similarity measure for attributed graphs and that it enables efficient clustering of structured data.
1
Introduction
Modern database applications, like molecular biology, image retrieval or XML document retrieval, are mainly based on complex structured objects. Those objects have an internal structure that is usually modeled using graphs or trees, which are then enriched with attribute information (cf. figure 1). In addition to the data objects, those modern database applications can also be characterized by their most improtant operations, which are extracting new knowledge from the database, or in other words data mining. The data mining tasks in this context require some notion of similarity or dissimilarity of objects in the database. A common approach is to extract a vector of features from the database objects and then use the Euclidean distance or some other Lp -norm between those feature vectors as similarity measure. But often this results in very highdimensional feature vectors, which even index structures for high-dimensional feature vectors like the X-tree [1], the IQ-tree [2] or the VA-file [3], can no longer handle efficiently due to a number of effects usually described by the term ’curse of dimensionality’. Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 309-319, 2003. c Springer-Verlag Berlin Heidelberg 2003
310
Hans-Peter Kriegel and Stefan Sch¨onauer
O
O C C
C H
O
O C
C
C
C C
C H
H
Fig. 1. Examples of attributed graphs: an image together with its graph and the graph of a molecule.
Especially for graph modeled data, the additional problem arises how to include the structural information into the feature vector. As the structure of a graph cannot be modeled by a low-dimensional feature vector, the dimensionality problem gets even worse. A way out of this dilemma is to define similarity directly for attributed graphs. Consequently, there is a strong need for similarity measures for attributed graphs. Several approaches to this problem have been proposed in recent time. Unfortunately, all of them have certain drawbacks, like being restricted to special graph types or having NP-complete time complexity, which makes them unusable for data mining applications. Therefore, we present a new similarity measure for attributed graphs, called the edge matching distance, which is not restricted to special graph types and can be evaluated efficiently. Additionally, we propose a filter-refinement architecture for efficient query processing and provide a set of filter methods for the edge matching distance. The paper is organized as follows: In the next section, we describe the existing similarity measures for attributed graphs and discuss their strengths and weaknesses. The edge matching distance and its properties are presented in section 3, before the query architecture and the filter methods are introduced in section 4. In section 5, the effectiveness and efficiency of our methods is demonstrated in experiments with real data from the domain of image retrieval, before we finish with a short conclusion.
2
Related Work
As graphs are a very general object model, graph similarity has been studied in many fields. Similarity measures for graphs have been used in systems for shape retrieval [4], object recognition [5] or face recognition [6]. For all those measures, graph features specific to the graphs in the application, are exploited in order to define graph similarity. Examples of such features are a given oneto-one mapping between the vertices of different graphs or the requirement that all graphs are of the same order. A very common similarity measure for graphs is the edit distance. It uses the same principle as the well known edit distance for strings [7, 8]. The idea is to determine the minimal number of insertion and deletions of vertices and edges
Similarity Search in Structured Data
311
to make the compared graphs isomorphic. In [9] Sanfeliu and Fu extended this principle to attributed graphs, by introducing vertex relabeling as a third basic operation beside insertions and deletions. In [10] this measure is used for data mining in a graph. Unfortunately, the edit distance is a very time-complex measure. Zhang, Statman and Shasha proved in [11] that the edit distance for unordered labeled trees is NP-complete. Consequently, in [12] a restricted edit distance for connected acyclic graphs, i.e. trees, was introduced. Papadopoulos and Manulopoulos presented another similarity measure for graphs in [13]. Their measure is based on histograms of the degree sequence of graphs and can be computed in linear time, but does not take the attribute information of vertices and edges into account. In the field of image retrieval, similarity of attributed graphs is sometimes described as an assignment problem [14], where the similarity distance between two graphs is defined as the minimal cost for mapping the vertices of one graph to those of another graph. With an appropriate cost function for the assignment of vertices, this measure takes the vertex attributes into account and can be evaluated in polynomial time. This asssignment measure, which we will call vertex matching distance in the rest of the paper, obviously completely ignores the structure of graphs, i.e. they are just treated as sets of vertices.
3
The Edge Matching Distance
As we just described, all the known similarity measures for attributed graphs have certain drawbacks. Starting from the edit distance and the vertex matching distance we propose a new method to measure the similarity of attributed graphs. This method solves the problems mentioned above and is useful in the context of large databases of structured objects. 3.1
Similarity of Structured Data
The similarity of attributed graphs has several major aspects. The first one is the structural similarity of graphs and the second one is the similarity of the attributes. Additionally, the weighting of the two just mentioned aspects is significant, because it is highly application dependent, to what extent the structural similarity determines the object similarity and to what extent the attribute similarity has to be considered. With the edit distance between attributed graphs there exists a similarity measure that fulfills all those conditions. Unfortunately, the computational complexity of this measure is too high to use it for clustering databases of arbitrary size. The vertex matching distance on the other hand can be evaluated in polynomial time, but this similarity measure does not take the structural relationships between the vertices into account, which results in a too coarse model for the similarity of attributed graph. For our similarity measure, called the edge matching
312
Hans-Peter Kriegel and Stefan Sch¨onauer
G1
G2 ∆
Fig. 2. An example of an edge matching between the graphs G1 and G2 .
distance, we also rely on the principle of graph matching. But instead of matching the vertices of two graphs, we propose a cost function for the matching of edges and then derive a minimal weight maximal matching between the edge sets of two graphs. This way not only the attribute distribution, but also the structural relationships of the vertices are taken into account. Figure 2 illustrates the idea behind our measure, while the formal definition of the edge matching distance is as follows: Definition 1. (edge matching, edge matching distance) Let G1 (V1 , E1 ) and G2 (V2 , E2 ) be two attributed graphs. Without loss of generality, we assume that |E1 | ≥ |E2 |. The complete bipartite graph Gem (Vem = E1 ∪ E2 ∪ ∆, E1 × (E2 ∪ ∆)), where ∆ represents an empty dummy edge, is called the edge matching graph of G1 and G2 . An edge matching between G1 and G2 is defined as a maximal matching in Gem . Let there be a non-negative metric cost function c : E1 × (E2 ∪ ∆) → IR0+ . We define the matching distance between G1 and G2 , denoted by dmatch (G1 , G2 ), as the cost of the minimum-weight edge matching between G1 and G2 with respect to the cost function c. Through the use of an appropriate cost function, it is possible to adapt the edge matching distance to the particular application needs. This implies how individual attributes are weighted or how the structural similarity is weighted relative to the attribute similarity. 3.2
Properties of the Edge Matching Distance
In order to use the edge matching distance for the clustering of attributed graphs, we need to investigate a few of the properties of this measure. The time complexity of the measure is of great importance for the applicability of the measure in data mining applications. Additionally, the proof of the following theorem also provides an algorithm how the matching distance can be computed efficiently. Theorem 1. The matching distance can be calculated in O(n3 ) time in the worst case. Proof. To calculate the matching distance between two attributed graphs G1 and G2 , a minimum-weight edge matching between the two graphs has to be determined. This is equivalent to determining a minimum-weight maximal matching
Similarity Search in Structured Data
313
in the edge matching graph of G1 and G2 . To achieve this, the method of Kuhn [15] and Munkres [16] can be used. This algorithm, also known as the Hungarian method, has a worst case complexity of O(n3 ), where n is the number of edges in the larger one of the two graphs. Apart from the complexity of the edge matching distance itself, it is also important that there are efficient search algorithms and index structures to support the use in large databases. In the context of similarity search two query types are most important, which are range queries and (k)-nearest-neighbor queries. Especially for k-nearest-neighbor search, Roussopoulos, Kelley and Vincent[17] and Hjaltason and Samet [18] proposed efficient algorithms. Both of these require that the similarity measure is a metric. Additionally, those algorithms rely on an index structure for the metric objects, such as the M-tree [19]. Therefore, the following theorem is of great importance for the practical application of the edge matching distance. Theorem 2. The edge matching distance for attributed graphs is a metric. Proof. To show that the edge matching distance is a metric, we have to prove the three metric properties for this similarity measure. 1. dmatch (G1 , G2 ) ≥ 0 The edge matching distance between two graphs is the sum of the cost for each edge matching. As the cost function is non-negative, any sum of cost values is also non-negative. 2. dmatch (G1 , G2 ) = dmatch (G2 , G1 ) The minimum-weight maximal matching in a bipartite graph is symmetric, if the edges in the bipartite graph are undirected. This is equivalent to the cost function being symmetric. As the cost function is a metric, the cost for matching two edges is symmetric. Therefore, the edge matching distance is symmetric. 3. dmatch (G1 , G3 ) ≤ dmatch (G1 , G2 ) + dmatch (G2 , G3 ) As the cost function is a metric, the triangle inequality holds for each triple of edges in G1 , G2 and G3 and for those edges that are mapped to an empty edge. The edge matching distance is the sum of the cost of the matching of individual edges. Therefore, the triangle inequality also holds for the edge matching distance. Definition 1 does not require that the two graphs are isomorphic in order to have a matching distance of zero. But the matching of the edges together with an appropriate cost function ensures that graphs with a matching distance of zero have a very high structural similarity. But even if the application requires that only isomorphic graphs are considered identical, the matching distance is still of great use. The following lemma allows to use the matching distance between two graphs as filter for the edit distance in a filter refinement architecture as will be described in section 4.1. This way, the number of expensive edit distance calculations during query processing can be greatly reduced.
314
Hans-Peter Kriegel and Stefan Sch¨onauer
Lemma 1. Given a cost function for the edge matching which is always less than or equal to the cost for editing an edge, the matching distance between attributed graphs is a lower bound for the edit distance between attributed graphs: ∀G1 , G2 : dmatch (G1 , G2 ) ≤ dED (G1 , G2 ) Proof. The edit distance between two graphs is the number of edit operations which are necessary to make those graphs isomorphic. To be isomorphic, the two graphs have to have identical edge sets. Additionally, the vertex sets have to be identical, too. As the cost function for the edge matching distance is always less than or equal to the cost to transform two edges into each other through an edit operation, the edge matching distance is a lower bound for the number of edit operations, which are necessary to make the two edge sets identical. As the cost for making the vertex sets identical is not covered by the edge matching distance, it follows that the edge matching distance is a lower bound for the edit distance between attributed graphs.
4
Efficient Query Processing Using the Edge Matching Distance
While the edge matching distance already has polynomial time complexity as compared to the exponential time complexity of the edit distance, a matching distance calculation is still a complex operation. Therefore, it makes sense to try to reduce the number of distance calculations during query processing. This goal can be achieved by using a filter-refinement architecture. 4.1
Multi-Step Query Processing
Query processing in a filter-refinement architecture is performed in two or more steps, where the first steps are filter steps that return a number of candidate objects from the database. For those candidate objects the exact similarity distance is determined in the refinement step and the objects fulfilling the query predicate are reported. To reduce the overall search time, the filter steps have to be easy to perform and a substantial part of the database objects has to be filtered out. Additionally, the completeness of the filter step is essential, i.e. there must be no false drops during the filter steps. Available similarity search algorithms guarantee completeness if the distance function in the filter step fulfills the lowerbounding property. This means that the filter distance between two objects must always be less than or equal to their exact distance. Using a multi-step query architecture requires efficient algorithms which actually make use of the filter step. Agrawal, Faloutsos and Swami proposed such an algorithm for range search [20]. In [21] and [22] multi-step algorithms for k-nearest-neighbor search were presented, which are optimal in the number of exact distance calculations neccessary during query processing. Therefore, we employ the latter algorithms in our experiments.
Similarity Search in Structured Data
4.2
315
A Filter for the Edge Matching Distance
To employ a filter-refinement architecture we need filters for the edge matching distance, which cover the structural as well as the attribute properties of the graphs in order to be effective. A way to derive a filter for a similarity measure is to approximate the database objects and then determine the similarity of those approximations. As an approximation for the structure of a graph G we use the size of that graph, denoted by s(G), i.e. the number of edges in the graph. We define the following similarity measure for our structural approximation of attributed graphs: dstruct (G1 , G2 ) = |s(G1 ) − s(G2 )| · wmismatch Here wmismatch is the cost for matching an edge with the empty edge ∆. When the edge matching distance between two graphs is determined, all edges of the larger graph, which are not mapped onto an edge of the smaller graph, are mapped onto an empty dummy edge ∆. Therefore, the above measure fulfills the lower bounding property, i.e. ∀G1 , G2 : dstruct (G1 , G2 ) ≤ dmatch (G1 , G2 ). Our filters for the attribute part of graphs are based on the observation that the difference between the attribute distributions of two graphs influences their edge matching distance. This is due to the fact, that during the distance calculation, edges of the two graphs are assigned to each other. Consequently, the edge matching distance between two graphs is the smaller, the more edges with the same attribute values the two graphs have, i.e. the more similar their attribute value distributions are. Obviously, it is too complex to determine the exact difference of the attribute distributions of two graphs in order to use this as a filter and an approximation of those distributions is, therefore, needed. We propose a filter for the attribute part of graphs, which exploits the fact that |x − y| ≥ ||x| − |y||. For attributes which are associated with edges, we add all the absolute values for an attribute in a graph. For two graphs G1 and G2 with s(G1 ) = s(G2 ), the difference between those sums, denoted by da (G1 , G2 ), is the minimum total difference between G1 and G2 for the respective attribute. Weighted appropriately according to the cost function that is used, this is a lower bound for the edge matching distance. For graphs of different size, this is no longer true, as an edge causing the attribute difference could also be assigned to an empty edge. Therefore, the difference in size of the graphs multiplied with the maximum cost for this attribute has to be substracted from da (G1 , G2 ) in order to be lower bounding in all cases. When considering attributes that are associated with vertices in the graphs,we have to take into account that during the distance calculation a vertex v is compared with several vertices of the second graph, namely exactly degree(v) many vertices. To take care of this effect, the absolute attribute value for a vertex attribute has to be multiplied with the degree of the vertex, which carries this attribute value, before the attribute values are added in the same manner as for edge attributes. Obviously, the appropriately weighted size difference has to be substracted in order to achieve a lower bounding filter value for a node attribute.
316
Hans-Peter Kriegel and Stefan Sch¨onauer
Fig. 3. Result of a 10-nearest-neighbor query for the pictograph dataset. The query object is shown on top, the result for the vertex matching distance is in the middle row and the result for the edge matching distance is in the bottom row.
With the above methods it is ensured that the sum of the structural filter distance plus all attribute filter distances is still a lower bound for the edge matching distance between two graphs. Furthermore, it is possible to precompute the structural and all attribute filter values and store them in a single vector. This supports efficient filtering during query processing.
5
Experimental Evaluation
To evaluate our new methods, we chose an image retrieval application and ran tests on a number of real world data sets: – 705 black-and-white pictographs – 9818 full-color TV images To extract graphs from the images, they were segmented with a region growing technique and neighboring segments were connected by edges to represent the neighborhood relationship. Each segment was assigned four attribute values, which are the size, the height and width of the bounding box and the color of the segment. The values of the first three attributes were expressed as a percentage relative to the image size, height and width in order to make the measure invariant to scaling. We implemented all methods in Java 1.4 and performed our tests on a workstation with a 2.4GHz Xeon processor and 4GB RAM. To calculate the cost for matching two edges, we add the difference between the values of the attributes of the corresponding terminal vertices of the two edges divided by the maximal possible difference for the respective attribute. This way, relatively small differences in the attribute values of the vertices result in a small matching cost for the compared edges. The cost for matching an edge with an empty edge is equal to the maximal cost for matching two edges. This results in a cost function, which fulfills the metric properties.
Similarity Search in Structured Data
317
Fig. 4. A cluster of portraits in the TV-images.
Figure 3 shows a comparison between the results of a 10-nearest-neighbor query in the pictograph dataset with the edge matching distance and the vertex matching distance. As one can see, the result obtained with the edge matching distance contains less false positives due to the fact that the structural properties of the images are considered more using this measure. It is important to note that this better result was obtained, even though the runtime of the query processing increases by as little as 5%. To demonstrate the usefullness of the edge matching distance for data mining tasks, we determined clusterings of the TV-images by using the density-based clustering algorithm DBSCAN [23]. In figure 4 one cluster found with the edge matching distance is depicted. Although, the cluster contains some other objects, it clearly consist mainly of portraits. When clustering with the vertex matching distance, we found no comparable cluster, i.e. this cluster could only be found with the edge matching distance as similarity measure. To measure the selectivity of our filter method, we implemented a filter refinement architecture as described in [21]. For each of our datasets, we measured the average filter selectivity for 100 queries which retrieved various fractions of the database. The results for the experiment when using the full-color TV-images are depicted in figure 5(a). It shows that the selectivity of our filter is very good, as e.g. for a query result which is 5% of the database size, more than 87% of the database objects are filtered out. The results for the pictograph dataset, as shown in figure 5(b), underline the good selectivity of the filter method. Even for a quite large result size of 10%, more than 82% of the database objects are removed by the filter. As the calculation of the edge matching distance is far more complex than that of the filter distance, it is not surprising that the reduction in runtime resulting from filter use was proportional to the number of database objects, which were filtered out.
6
Conclusions
In this paper, we presented a new similarity measure for data modeled as attributed graphs. Starting from the vertex matching distance, well known from the field of image retrieval, we developed the so called edge matching distance, which
318
Hans-Peter Kriegel and Stefan Sch¨onauer
(a)
(b)
Fig. 5. Average filter selectivity for the TV-image dataset (a) and the pictograph dataset (b).
is based on minimum-weight maximum matching of the edge sets of graphs. This measure takes the structural and the attribute properties of the attributed graphs into account and can be calculated in O(n3 ) time in the worst case, which allows to use it in data mining applications, unlike the common edit distance. In our experiments, we demonstrate that the edge matching distance reflects the similarity of graph modeled objects better than the similar vertex matching distance, while having an almost identical runtime. Furthermore, we devised a filter refinement architecture and a filter method for the edge matching distance. Our experiments show that this architecture reduces the number of necessary distance calculations during query processing between 87% and 93%. In our future work, we will investigate different cost functions for the edge matching distance as well as their usefullness for different applications. This includes especially, the field of molecular biology, where we plan to apply our methods to the problem of similarity search in protein databases.
7
Acknowledgement
Finally let us acknowledge the help of Stefan Brecheisen, who implemented part of our code.
References 1. Berchtold, S., Keim, D., Kriegel, H.P.: The X-tree: An index structure for highdimensional data. In: Proc. 22nd VLDB Conf., Bombay, India (1996) 28–39 2. Berchtold, S., B¨ ohm, C., Jagadish, H., Kriegel, H.P., Sander, J.: Independent quantization: An index compression technique for high-dimensional data spaces. In: Proc. of the 16th ICDE. (2000) 577–588
Similarity Search in Structured Data
319
3. Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proc. 24th VLDB Conf. (1998) 194–205 4. Huet, B., Cross, A., Hancock, E.: Shape retrieval by inexact graph matching. In: Proc. IEEE Int. Conf. on Multimedia Computing Systems. Volume 2., IEEE Computer Society Press (1999) 40–44 5. Kubicka, E., Kubicki, G., Vakalis, I.: Using graph distance in object recognition. In: Proc. ACM Computer Science Conference. (1990) 43–48 6. Wiskott, L., Fellous, J.M., Kr¨ uger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. IEEE PAMI 19 (1997) 775–779 7. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady 10 (1966) 707–710 8. Wagner, R.A., Fisher, M.J.: The string-to-string correction problem. Journal of the ACM 21 (1974) 168–173 9. Sanfeliu, A., Fu, K.S.: A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man and Cybernetics 13 (1983) 353–362 10. Cook, D.J., Holder, L.B.: Graph-based data mining. IEEE Intelligent Systems 15 (2000) 32–41 11. Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Information Processing Letters 42 (1992) 133–139 12. Zhang, K., Wang, J., Shasha, D.: On the editing distance between undirected acyclic graphs. International Journal of Foundations of Computer Science 7 (1996) 43–57 13. Papadopoulos, A., Manolopoulos, Y.: Structure-based similarity search with graph histograms. In: Proc. DEXA/IWOSS Int. Workshop on Similarity Search, IEEE Computer Society Press (1999) 174–178 14. Petrakis, E.: Design and evaluation of spatial similarity approaches for image retrieval. Image and Vision Computing 20 (2002) 59–76 15. Kuhn, H.: The hungarian method for the assignment problem. Nval Research Logistics Quarterly 2 (1955) 83–97 16. Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the SIAM 6 (1957) 32–38 17. Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: Proc. ACM SIGMOD, ACM Press (1995) 71–79 18. Hjaltason, G.R., Samet, H.: Ranking in spatial databases. In: Advances in Spatial Databases, 4th International Symposium, SSD’95, Portland, Maine. Volume 951 of Lecture Notes in Computer Science., Springer (1995) 83–95 19. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Proc. of 23rd VLDB Conf. (1997) 426–435 20. Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Proc. of the 4th Int. Conf. of Foundations of Data Organization and Algorithms (FODO), Springer Verlag (1993) 69–84 21. Seidl, T., Kriegel, H.P.: Optimal multi-step k-nearest neighbor search. In: Proc. ACM SIGMOD, ACM Press (1998) 154–165 22. Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., Protopapas, Z.: Fast and effective retrieval of medical tumor shapes. IEEE TKDE 10 (1998) 889–904 23. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, AAAI Press (1996) 226–231
Using an Interest Ontology for Improved Support in Rule Mining Xiaoming Chen1 , Xuan Zhou1 , Richard Scherl2 , and James Geller1, 1
CS Dept., New Jersey Institute of Technology, Newark, NJ 07102 2 Monmouth University, West Long Branch, New Jersey 07764
Abstract. This paper describes the use of a concept hierarchy for improving the results of association rule mining. Given a large set of tuples with demographic information and personal interest information, association rules can be derived, that associate ages and gender with interests. However, there are two problems. Some data sets are too sparse for coming up with rules with high support. Secondly, some data sets with abstract interests do not represent the actual interests well. To overcome these problems, we are preprocessing the data tuples using an ontology of interests. Thus, interests within tuples that are very specific are replaced by more general interests retrieved from the interest ontology. This results in many more tuples at a more general level. Feeding those tuples to an association rule miner results in rules that have better support and that better represent the reality.3
1
Introduction
Data mining has become an important research tool for the purpose of marketing. It makes it possible to draw far-reaching conclusions from existing customer databases about connections between different products purchased. If demographic data are available, data mining also allows the generation of rules that connect them with products. However, companies are not just interested in the behavior of their existing customers, they would like to find out about potential customers. Typically, there is no information about potential customers available in a company database, that can be used for data mining. It is possible to perform data mining on potential customers, if one makes the following two adjustments: (1) Instead of looking at products already purchased, we may look at interests of a customer. (2) Many people express their interests freely and explicitly on their Web home pages. The process of mining data of potential customers becomes a process of Web Mining. In this project, we are extracting raw data from home pages on the Web. In the second stage, we raise specific but sparse data to higher levels, to make it denser. In the third stage we apply traditional rule mining algorithms to the data. When mining real data, what is available is often too sparse to produce rules with reasonable support. In this paper we are describing a method how to 3
This research was supported by the NJ Commission for Science and Technology Contact Author: James Geller, [email protected]
Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 320-329, 2003. c Springer-Verlag Berlin Heidelberg 2003
Using an Interest Ontology for Improved Support in Rule Mining
321
improve the support of mined rules by using a large ontology of interests that are related to the extracted raw data.
2
Description of Project, Data and Mining
Our Web Marketing system consists of six modules. (1) The Web search module extracts home pages of users from several portal sites. Currently, the following portal sites are used: LiveJournal, ICQ and Yahoo, as well as a few major universities. (2) The Object-Relational database stores the cleaned results of this search. (3) The data mining module uses the WEKA [13] package for extracting association rules from the table data. (4) The ontology is the main knowledge representation of this project [4, 11]. It consists of interest hierarchies based on Yahoo and ICQ. (5) The advanced extraction component processes Web pages which do not follow simple structure rules. (6) The front end is a user-friendly, Web-based GUI that allows users with no knowledge of SQL to query both the raw data in the tables and the derived rules. The data that we are using for data mining consists of records of real personal data that contain either demographic data and expressed interest data or two different items of interest data. In most cases, we are using triples of age, gender and one interest as input for data mining. In other cases we are using pairs of interests. Interests are derived from one of sixteen top level interest categories. These interest categories are called interests at level 1. Examples of level 1 interests (according to Yahoo) include RECREATION SPORTS, HEALTH WELLNESS, GOVERNMENT POLITICS, etc. Interests are organized as a DAG (Directed Acyclic Graph) hierarchy. As a result of the large size of the database, the available data goes well beyond the capacity of the data mining program. Thus, the data sets had to be broken into smaller data sets. A convenient way to do this is to perform data mining on the categories divided at level 1 (top level) or the children of level 1. Thus there are 16 interest categories at level 1, and the interest GOVERNMENT POLITICS has 20 children, including LAW, MILITARY, ETHICS, TAXES, etc. At the time when we extracted the data, ENTERTAINMENT ARTS was the largest data file at level 1. It had 176218 data items, which is not too large to be handled by the data mining program. WEKA generates association rules [1] using the Apriori algorithm first presented by [2]. Since WEKA only works with clean data converted to a fixed format, called .arff format, we have created customized programs to do data selection and data cleaning.
322
3
Xiaoming Chen et al.
Using Raising for Improved Support
A concept hierarchy is present in many databases either explicitly or implicitly. Some previous work utilizes a hierarchy for data mining. Han [5] discusses data mining at multiple concept levels. His approach is to use discovered associations at one level (e.g., milk → bread) to direct the search for associations at a different level (e.g., milk of brand X → bread of brand Y). As most of our data mining involves only one interest, our problem setting is quite different. Han et al. [6] introduce a top-down progressive deepening method for mining multiple-level association rules. They utilize the hierarchy to collect large item sets at different concept levels. Our approach utilizes an interest ontology to improve support in rule mining by means of concept raising. Fortin et al. [3] use an object-oriented representation for data mining. Their interest is in deriving multi-level association rules. As we are typically using only one data item in each tuple for raising, the possibility of multi-level rules does not arise in our problem setting. Srikant et al. [12] present Cumulative and EstMerge algorithms to find associations between items at any level by adding all ancestors of each item to the transaction. In our work, items of different levels do not coexist in any step of mining. Psaila et al. [9] describe a method how to improve association rule mining by using a generalization hierarchy. Their hierarchy is extracted from the schema of the database and used together with mining queries [7]. In our approach, we are making use of a large pre-existing concept hierarchy, which contains concepts from the data tuples. P´airc´eir et al. also differ from our work in that they are mining multi-level rules that associate items spanning several levels of a concept hierarchy [10]. Joshi et al. [8] are interested in situations where rare instances are really the most interesting ones, e.g., in intrusion detection. They present a two-phase data mining method with a good balance of precision and recall. For us, rare instances are not by themselves important, they are only important because they contribute with other rare instances to result in frequently occurring instances for data mining. There are 11 levels in the Yahoo interest hierarchy. Every extracted interest belongs somewhere in the hierarchy, and is at a certain level. The lower the level value, the higher up it is in the hierarchy. Level 0 is the root. Level 1 is the top level, which includes 16 interests. For example, FAMILY HOME is an interest at level 1. PARENTING is an interest at level 2. PARENTING is a child of FAMILY HOME in the hierarchy. If a person expressed an interest in PARENTING, it is common sense that he or she is interested in FAMILY HOME. Therefore, at level 1, when we count those who are interested in FAMILY HOME, it is reasonable to count those who are interested in PARENTING. This idea applies in the same way to lower levels. A big problem in the derivation of association rules is that available data is sometimes very sparse and biased as a result of the interest hierarchy. For example, among over a million of interest records in our database only 11 people expressed an interest in RECREATION SPORTS, and nobody expressed an interest in SCIENCE. The fact that people did not express interests with more general terms does not mean they are not interested. The data file of
Using an Interest Ontology for Improved Support in Rule Mining
323
RECREATION SPORTS has 62734 data items. In other words, 62734 interest expressions of individuals are in the category of RECREATION SPORTS. Instead of saying “I’m interested in Recreation and Sports,” people prefer saying “I’m interested in basketball and fishing.” They tend to be more specific with their interests. We analyzed the 16 top level categories of the interest hierarchy. We found users expressing interests at the top level only in two categories, MUSIC and RECREATION SPORTS. When mining data at higher levels, it is important to include data at lower levels, in order to gain data accuracy and higher support. In the following examples, the first letter stands for an age range. The age range from 10 to 19 is represented by A, 20 to 29 is B, 30 to 39 is C, 40 to 49 is D, etc. The second letter stands for Male or Female. Text after a double slash (//) is not part of the data. It contains explanatory remarks. Original Data File: B,M,BUSINESS FINANCE //level=1 D,F,METRICOM INC //level=7 E,M,BUSINESS SCHOOLS //level=2 C,F,ALUMNI //level=3 B,M,MAKERS //level=4 B,F,INDUSTRY ASSOCIATIONS //level=2 C,M,AOL INSTANT MESSENGER //level=6 D,M,INTRACOMPANY GROUPS //level=3 C,M,MORE ABOUT ME //wrong data The levels below 7 do not have any data in this example. Raising will process the data level-by-level starting at level 1. It is easiest to see what happens if we look at the processing of level 3. First the result is initialized with the data at level 3 contained in the source file. With our data shown above, that means that the result is initialized with the following two lines. C,F,ALUMNI D,M,INTRACOMPANY GROUPS In order to perform the raising we need to find ancestors at level 3 of the interests in our data. Table 1 shows all ancestors of our interests from levels 4, 5, 6, 7, such that the ancestors are at level 3. The following lines are now added to our result. D,F,COMMUNICATIONS AND NETWORKING // raised from level=7 (1st ancestor) D,F,COMPUTERS // raised from level=7 (2nd ancestor) B,M,ELECTRONICS // raised from level=4 C,M,COMPUTERS // raised from level=6 That means, after raising we have the following occurrence counts at level 3.
324
Xiaoming Chen et al.
ALUMNI: 1 INTRACOMPANY GROUPS: 1 COMMUNICATIONS AND NETWORKING: 1 COMPUTERS: 2 ELECTRONICS: 1 Before raising, we only had two items at level 3. Now, we have six items at level 3. That means that we now have more data as input for data mining than before raising. Thus, the results of data mining will have better support and will much better reflect the actual interests of people. Table 1. Relevant Ancestors Interest Name METRICOM INC METRICOM INC MAKERS AOL INSTANT MESSENGER
Its Ancestor(s) at Level 3 COMMUNICATIONS AND NETWORKING COMPUTERS ELECTRONICS COMPUTERS
Due to the existence of multiple parents and common ancestors, the precise method of raising is very important. There are different ways to raise a data file. One way is to get the data file of the lowest level, and raise interests bottom-up, one level at a time, until we finish at level 1. The data raised from lower levels is combined with the original data from the given level to form the data file at that level. If an interest has multiple parents, we include these different parents in the raised data. However, if those parents have the same ancestor at some higher level, duplicates of data appear at the level of common ancestors. This problem is solved by adopting a different method: we are raising directly to the target level, without raising to any intermediate level. After raising to a certain level, all data at this level can be deleted and never have to be considered again for lower levels. This method solves the problem of duplicates caused by multiple parents and common ancestors. The data file also becomes smaller when the destination level becomes lower. In summary, the raising algorithm is implemented as follows: Raise the original data to level 1. Do data mining. Delete all data at level 1 from the original data file. Raise the remaining data file to level 2. Do data mining. Delete all data at level 2 from the data file, etc. Continue until there’s no more valid data. The remaining data in the data file are wrong data.
4
Results
The quality of association rules is normally measured by specifying support and confidence. Support may be given in two different ways [13], as absolute support and as relative support. Witten et al. write:
Using an Interest Ontology for Improved Support in Rule Mining
325
The coverage of an association rule is the number of instances for which it predicts correctly – this is often called its support. . . . It may also be convenient to specify coverage as a percentage of the total number of instances instead. (p. 64) For our purposes, we are most interested in the total number of tuples that can be used for deriving association rules, thus we will use the absolute number of support only. The data support is substantially improved by means of raising. Following are two rules from RECREATION SPORTS at level 2 without raising: age=B interest=AVIATION 70 ⇒ gender=M 55 conf:(0.79) (1) age=C interest=OUTDOORS 370 ⇒ gender=M 228 conf:(0.62) (2) Following are two rules from RECREATION SPORTS at level 2 with raising. age=A gender=F 13773 ⇒ interest=SPORTS 10834 conf:(0.79) (3) age=C interest=OUTDOORS 8284 ⇒ gender=M 5598 conf:(0.68) (4) Rule (2) and Rule (4) have the same attributes and rule structure. Without raising, the absolute support is 228, while with raising it becomes 5598. The improvement of the absolute support of this rule is 2355%. Not all rules for the same category and level have the same attributes and structure. For example, rule (1) appeared in the rules without raising, but not in the rules with raising. Without raising, 70 people are of age category B and choose AVIATION as their interest. Among them, 55 are male. The confidence for this rule is 0.79. After raising, there is no rule about AVIATION, because the support is too small compared with other interests such as SPORTS and OUTDOORS. In other words, one effect of raising is that rules that appear in the result of WEKA before raising might not appear after raising and vice versa. There is a combination of two factors why rules may disappear after raising. First, this may be a result of how WEKA orders the rules that it finds by confidence and support. WEKA primarily uses confidence for ordering the rules. There is a cut off parameter, so that only the top N rules are returned. Thus, by raising, a rule in the top N might drop below the top N. There is a second factor that affects the change of order of the mined rules. Although the Yahoo ontology ranks both AVIATION and SPORTS as level-2 interests, the hierarchy structure underneath them is not balanced. According to the hierarchy, AVIATION has 21 descendents, while SPORTS has 2120 descendents, which is about 100 times more. After raising to level 2, all nodes below level 2 are replaced by their ancestors at level 2. As a result, SPORTS becomes an interest with overwhelmingly high support, whereas the improvement rate for AVIATION is so small that it disappeared from the rule set after raising. There is another positive effect of raising. Rule (3) above appeared in the rules with raising. After raising, 13773 people are of age category A and gender category F. Among them, 10834 are interested in SPORTS. The confidence is 0.79. These data look good enough to generate a convincing rule. However, there were no rules about SPORTS before raising. Thus, we have uncovered a rule with strong support that also agrees with our intuition. However, without raising, this
326
Xiaoming Chen et al.
rule was not in the result of WEKA. Thus, raising can uncover new rules that agree well with our intuition and that also have better absolute support. To evaluate our method, we compared the support and confidence of raised and unraised rules. The improvement of support is substantial. Table 2 compares support and confidence for the same rules before and after raising for RECREATION SPORTS at level 2. There are 58 3-attribute rules without raising, and 55 3-attribute rules with raising. 18 rules are the same in both results. Their support and confidence are compared in the table. The average support is 170 before raising, and 4527 after raising. The average improvement is 2898%. Thus, there is a substantial improvement in absolute support. After raising, the lower average confidence is a result of expanded data. Raising effects not only the data that contributes to a rule, but all other data as well. Thus, confidence was expected to drop. Even though the confidence is lower, the improvement in support by far outstrips this unwanted effect. Table 2. Support and Confidence Before and After Raising Rule (int = interest, gen = gender) age=C int=AUTOMOTIVE ⇒ gen=M age=B int=AUTOMOTIVE ⇒ gen=M age=C int=OUTDOORS ⇒ gen=M age=D int=OUTDOORS ⇒ gen=M age=B int=OUTDOORS ⇒ gen=M age=C gen=M ⇒ int=OUTDOORS gen=M int=AUTOMOTIVE ⇒ age=B age=D gen=M ⇒ int=OUTDOORS age=B int=OUTDOORS ⇒ gen=F age=B gen=M ⇒ int=OUTDOORS gen=F int=OUTDOORS ⇒ age=B gen=M int=OUTDOORS ⇒ age=B int=AUTOMOTIVE ⇒ age=B gen=M gen=M int=OUTDOORS ⇒ age=C age=D ⇒ gen=M int=OUTDOORS gen=M int=AUTOMOTIVE ⇒ age=C int=OUTDOORS ⇒ age=B gen=M int=OUTDOORS ⇒ age=C gen=M
Supp. Supp. Improv. Conf. Conf. Improv. w/o w/ of w/o w/ of rais. rais. supp. rais. rais. Conf. 57 3183 5484% 80 73 -7% 124 4140 3238% 73 65 -8% 228 5598 2355% 62 68 6% 100 3274 3174% 58 67 9% 242 5792 2293% 54 61 7% 228 5598 2355% 51 23 -28% 124 4140 3238% 47 37 -10% 100 3274 3174% 46 27 -19% 205 3660 1685% 46 39 -7% 242 5792 2293% 44 18 -26% 205 3660 1685% 42 39 -3% 242 5792 2293% 38 34 -4% 124 4140 3238% 35 25 -10% 228 5598 2355% 35 33 -2% 100 3274 3174% 29 19 -10% 57 3183 5484% 22 28 6% 242 5792 2293% 21 22 1% 228 5598 2355% 20 21 1%
Table 3 shows the comparison of all rules that are the same before and after raising. The average improvement of support is calculated at level 2, level 3, level 4, level 5 and level 6 for each of the 16 categories. As explained in Sect. 3, few people expressed an interest at level 1, because these interest names are too general. Before raising, there are only 11 level-1 tuples with the interest RECREATION SPORTS and 278 tuples with the interest MUSIC. In the other
Using an Interest Ontology for Improved Support in Rule Mining
327
14 categories, there are no tuples at level 1 at all. However, after raising, there are 6,119 to 174,916 tuples at level 1, because each valid interest in the original data can be represented by its ancestor at level 1, no matter how low the interest is in the hierarchy. All the 16 categories have data down to level 6. However, COMPUTERS INTERNET, FAMILY HOME and HEALTH WELLNESS have no data at level 7. In general, data below level 6 is very sparse and does not contribute a great deal to the results. Therefore, we present the comparison of rules from level 2 through level 5 only. Some rules generated by WEKA are the same with and without raising. Some are different. In some cases, there is not a single rule in common between the rule sets with and without raising. The comparison is therefore not applicable. Those conditions are denoted by “N/A” in the table.
Table 3. Support Improvement Rate of Common Rules Category BUSINESS FINANCE COMPUTERS INTERNET CULTURES COMMUNITY ENTERTAINMENT ARTS FAMILY HOME GAMES GOVERNMENT POLITICS HEALTH WELLNESS HOBBIES CRAFTS MUSIC RECREATION SPORTS REGIONAL RELIGION BELIEFS ROMANCE RELATIONSHIPS SCHOOLS EDUCATION SCIENCE Average Improvement
Level2 122% 363% N/A N/A 148% 488% 333% 472% N/A N/A 2898% 6196% 270% 224% 295% 1231% 1086%
Level3 284% 121% 439% N/A 33% N/A 586% 275% 0% 2852% N/A 123% 88% 246% 578% 0% 432%
Level4 0% 11% N/A N/A 0% 108% 0% 100% 0% N/A 76% N/A 634% N/A N/A 111% 104%
Level5 409% 0% 435% N/A 0% 0% N/A 277% 0% 0% N/A 0% 0% 17% 297% 284% 132%
Table 4 shows the average improvement of support of all rules after raising to level 2, level 3, level 4 and level 5 within the 16 interest categories. This is computed as follows. We sum the support values for all rules before raising and divide them by the number of rules, i.e., we compute the average support before raising, Sb . Similarly, we compute the average support of all the rules after raising. Then the improvement rate R is computed as:
R=
Sa − S b ∗ 100 [percent] Sb
(1)
328
Xiaoming Chen et al.
The average improvement rate for level 2 through level 5 is, respectively, 279%, 152%, 68% and 20%. WEKA ranks the rules according to the confidence, and discards rules with lower confidence even though the support may be higher. In Tab. 4 there are three values where the improvement rate R is negative. This may happen if the total average relative support becomes lower after raising. That in turn can happen, because, as mentioned before, the rules before and after raising may be different rules. The choice of rules by WEKA is primarily made based on relative support and confidence values. Table 4. Support Improvement Rate of All Rules Category BUSINESS FINANCE COMPUTERS INTERNET CULTURES COMMUNITY ENTERTAINMENT ARTS FAMILY HOME GAMES GOVERNMENT POLITICS HEALTH WELLNESS HOBBIES CRAFTS MUSIC RECREATION SPORTS REGIONAL RELIGION BELIEFS ROMANCE RELATIONSHIPS SCHOOLS EDUCATION SCIENCE Average Improvement
5
Level2 231% 361% 1751% 4471% 77% 551% 622% 526% 13266% 13576% 6717% 7484% 285% 173% 225% 890% 279%
Level3 574% 195% 444% 2438% 26% 1057% 495% 383% 2% 3514% 314% 170% 86% 145% 550% 925% 152%
Level4 -26% 74% 254% 1101% 56% 188% 167% 515% 7% 97% 85% 242% 627% 2861% 1925% 302% 68%
Level5 228% -59% 798% 332% 57% 208% 1400% 229% 60% 62% 222% -50% 383% 87% 156% 317% 20%
Conclusions and Future Work
In this paper, we showed that the combination of an ontology of the mined concepts with a standard rule mining algorithm can be used to generate data sets with orders of magnitude more tuples at higher levels. Generating rules from these tuples results in much larger (absolute) support values. In addition, raising often produces rules that, according to our intuition, better represent the domain than rules found without raising. Formalizing this intuition is a subject of future work. According to our extensive experiments with tuples derived from Yahoo interest data, data mining with raising can improve absolute support for rules up to over 6000% (averaged over all common rules in one interest category). Improvements in support may be even larger for individual rules. When averaging
Using an Interest Ontology for Improved Support in Rule Mining
329
over all support improvements for all 16 top level categories and levels 2 to 5, we get a value of 438%. Future work includes using other data mining algorithms, and integrating the raising process directly into the rule mining algorithm. Besides mining for association rules, we can also perform classification and clustering at different levels of the raised data. The rule mining algorithm itself needs adaptation to our domain. For instance, there are over 31,000 interests in our version of the interest hierarchy. Yahoo has meanwhile added many more interests. Finding interest – interest associations becomes difficult using WEKA, as interests of persons appear as sets, which are hard to map onto the .arff format.
References 1. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., 1993. 2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 1994. 3. S. Fortin and L. Liu. An object-oriented approach to multi-level association rule mining. In Proceedings of the fifth international conference on Information and knowledge management, pages 65–72. ACM Press, 1996. 4. J. Geller, R. Scherl, and Y. Perl. Mining the web for target marketing information. In Proceedings of CollECTeR, Toulouse, France, 2002. 5. J. Han. Mining knowledge at multiple concept levels. In CIKM, pages 19–24, 1995. 6. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. In Proc. of 1995 Int’l Conf. on Very Large Data Bases (VLDB’95), Z¨ urich, Switzerland, September 1995, pages 420–431, 1995. 7. J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. DMQL: A data mining query language for relational databases, 1996. 8. M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining needle in a haystack: classifying rare classes via two-phase rule induction. SIGMOD Record (ACM Special Interest Group on Management of Data), 30(2):91–102, 2001. 9. G. P. and P. L. Lanzi. Hierarchy-based mining of association rules in data warehouses. In Proceedings of the 2000 ACM symposium on Applied computing 2000, pages 307–312. ACM Press, 2000. 10. R. P´ airc´eir, S. McClean, and B. Scotney. Discovery of multi-level rules and exceptions from a distributed database. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 523–532. ACM Press, 2000. 11. R. Scherl and J. Geller. Global communities, marketing and web mining,. Journal of Doing Business Across Borders, 1(2):141–150, 2002. http://www.newcastle.edu.au/journal/dbab/images/dbab 1(2).pdf. 12. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of 1995 Int’l Conf. on Very Large Data Bases (VLDB’95), Z¨ urich, Switzerland, September 1995, pages 407–419, 1995. 13. I. H. Witten and E. Frank. Data Mining. Morgan Kaufmann Publishers, San Francisco, 2000.
Fraud Formalization and Detection Bharat Bhargava, Yuhui Zhong, and Yunhua Lu Center for Education and Research in Information Assurance and Security (CERIAS) and Department of Computer Sciences Purdue University, West Lafayette, IN 47907, USA {bb,zhong,luy}@cs.purdue.edu
Abstract. A fraudster can be an impersonator or a swindler. An impersonator is an illegitimate user who steals resources from the victims by “taking over” their accounts. A swindler is a legitimate user who intentionally harms the system or other users by deception. Previous research efforts in fraud detection concentrate on identifying frauds caused by impersonators. Detecting frauds conducted by swindlers is a challenging issue. We propose an architecture to catch swindlers. It consists of four components: profile-based anomaly detector, state transition analysis, deceiving intention predictor, and decision-making component. Profilebased anomaly detector outputs fraud confidence indicating the possibility of fraud when there is a sharp deviation from usual patterns. State transition analysis provides state description to users when an activity results in entering a dangerous state leading to fraud. Deceiving intention predictor discovers malicious intentions. Three types of deceiving intentions, namely uncovered deceiving intention, trapping intention, and illusive intention, are defined. A deceiving intention prediction algorithm is developed. A user-configurable risk evaluation function is used for decision making. A fraud alarm is raised when the expected risk is greater than the fraud investigation cost.
1
Introduction
Fraudsters can be classified into two categories: impersonators and swindlers. An impersonator is an illegitimate user who steals resources from the victims by “taking over” their accounts. A swindler, on the other hand, is a legitimate user who intentionally harms the system or other users by deception. Taking superimposition fraud in telecommunication [7] as an example, impersonators impose their usage on the accounts of legitimate users by using cloned phones with Mobile Identification Numbers (MIN) and Equipment Serial Numbers (ESN) stolen from the victims. Swindlers obtain legitimate accounts and use the services without the intention to pay bills, which is called subscription fraud. Impersonators can be forestalled by utilizing cryptographic technologies that provide strong protection to users’ authentication information. The idea of separation of duty may be applied to reduce the impact of a swindler. The essence
This research is supported by NSF grant IIS-0209059.
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 330–339, 2003. c Springer-Verlag Berlin Heidelberg 2003
Fraud Formalization and Detection
331
is to restrict the power an entity (e.g., a transaction partner) can have to prevent him from abusing it. An empirical example of this idea is that laws are set, enforced and interpreted by different parties. Separation of duty can be implemented by using access control mechanisms such as role based access control mechanism, or lattice-based access control model [8]. Separation of duty policies and other mechanisms, like dual-log bookkeeping [8] reduce frauds but cannot eliminate them. For example, for online auctions, such as eBay, sellers and buyers have restricted knowledge about the other side. Although eBay, as a trusted third party, has authentication services to check the information provided by sellers and buyers (e.g. phone numbers), it is impossible to verify all of them due to the high quantities of online transactions. Fraud is a persistent issue under such an environment. In this paper, we concentrate on swindler detection. Three approaches are considered: (a) detecting an entity’s activities that deviate from normal patterns, which may imply the existence of a fraud; (b) constructing state transition graphs for existing fraud scenarios and detecting fraud attempts similar to the known ones; and (c) discovering an entity’s intention based on his behavior. The first two approaches can also be used to detect frauds conducted by impersonators. The last one is applicable only for swindler detection. The rest of this paper is organized as the follows. Section 2 introduces the related work. Definitions for fraud and deceiving intentions are presented in Section 3. An architecture for swindler detection is proposed in Section 4. It consists of a profile-based anomaly detector, a state transition analysis component, a deceiving intention predictor, and a decision-making component. The functionalities and design considerations for each component are discussed. An algorithm for predicting deceiving intentions is designed and studied via experiments. Section 5 concludes the paper.
2
Related Work
Fraud detection systems are widely used in telecommunications, online transactions, the insurance industry, computer and network security [1, 3, 6, 11]. The majority of research efforts addresses detecting impersonators (e.g. detecting superimposition fraud in telecommunications). Effective fraud detection uses both fraud rules and pattern analysis. Fawcett and Provost proposed an adaptive rule-based detection framework [4]. Rosset et al. pointed out that standard classification and rule generation were not appropriate for fraud detection [7]. The generation and selection of a rule set should combine both user-level and behavior-level attributes. Burge and Shawe-Taylor developed a neural network technique [2]. The probability distributions for current behavior profiles and behavior profile histories are compared using Hellinger distances. Larger distances indicate more suspicion of fraud. Several criteria exist to evaluate the performance of fraud detection engines. ROC (Receiver Operating Characteristics) is a widely used one [10, 5]. Rosset et al. use accuracy and fraud coverage as criteria [7]. Accuracy is the number
332
Bharat Bhargava et al.
of detected instances of fraud over the total number of classified frauds. Fraud coverage is the number of detected frauds over the total number of frauds. Stolfo et al. use a cost-based metric in commercial fraud detection systems [9]. If the loss resulting from a fraud is smaller than the investigation cost, this fraud is ignored. This metric is not suitable in circumstances where such a fraud happens frequently and causes a significant accumulative loss.
3
Formal Definitions
Frauds by swindlers occur in cooperations where each entity makes a commitment. A swindler is an entity that has no intention to keep his commitment. Commitment is the integrity constraints, assumptions, and conditions an entity promises to satisfy in a process of cooperation. Commitment is described by using conjunction of expressions. An expression is (a) an equality with an attribute variable on the left hand side and a constant representing the expected value on the right hand side, or (b) a user-defined predicate that represents certain complex constraints, assumptions and conditions. A user-defined Boolean function is associated with the predicate to check whether the constraints, assumptions and conditions hold. Outcome is the actual results of a cooperation. Each expression in a commitment has a corresponding one in the outcome. For an equality expression, the actual value of the attribute is on the right hand side. For a predicate expression, if the use-define function is true, the predicate itself is in the outcome. Otherwise, the negation of the predicate is included. Example: A commitment of a seller for selling a vase is (Received by = 04/01) ∧ (Prize = $1000) ∧ (Quality = A) ∧ ReturnIfAnyQualityProblem. This commitment says that the seller promises to send out one “A” quality vase at the price of $1000. The vase should be received by April 1st . If there is a quality problem, the buyer can return the vase. An possible outcome is (Received by = 04/05) ∧ (Prize = $1000) ∧ (Quality = B) ∧ ¬ReturnIfAnyQualityProblem. This outcome shows that the vase of quality “B”, was received on April 5th . The return request was refused. We may conclude that the seller is a swindler. Predicates or attribute variables play different roles in detecting a swindler. We define two properties, namely intention-testifying and intention-dependent. Intention-testifying: A predicate P is intention-testifying if the presence of ¬P in an outcome leads to the conclusion that a partner is a swindler. An attribute variable V is intention-testifying if one can conclude that a partner is a swindler when V’s expected value is more desirable than the actual value. Intention-dependent: A predicate P is intention-dependent if it is possible that a partner is a swindler when ¬P appears in an outcome. An attribute variable V is intention-dependent if it is possible that a partner is a swindler when its expected value is more desirable than the actual value. An intention-testifying variable or predicate is intention-dependent. The opposite direction is not necessarily true.
1
1
0.9
0.9
0.8
0.8
0.8
0.7
0.6
0.5
0.4
0.3
0.7
0.6
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
0
50
100
150
Satisfaction Rating
1
0.9
Satisfaction Rating
Satisfaction Rating
Fraud Formalization and Detection
0
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
50
100
150
Number of Observations
Number of Observations
(a) Uncovered deceiving intention
333
(b) Trapping intention
0
50
100
150
Number of Observations
(c) Illusive intention
Fig. 1. Deceiving intention
In the above example, ReturnIfAnyQualityProblem can be intentiontestifying or intention-dependent. The decision is up to the user. Prize is intention-testifying since if the seller charges more money, we believe that he is a swindler. Quality and received by are defined as intention-dependent variables considering that a seller may not have full control on them. 3.1
Deceiving Intentions
Since the intention-testifying property is usually too strong in real applications, variables and predicates are specified as intention-dependent. A conclusion that a partner is a swindler cannot be drawn with 100% certainty based on one intention-dependent variable or predicate in one outcome. Two approaches can be used to increase the confidence: (a) consider multiple variables or predicates in one outcome; and (b) consider one variable or predicate in multiple outcomes. The second approach is applied in this paper. Assume a satisfaction rating ranging from 0 to 1 is given for the actual value of each intention-dependent variable in an outcome. The higher the rating is, the more satisfied the user is. The value of 0 means totally unacceptable and the value of 1 indicates that actual value is not worse than the expected value. For example, if the quality of received vase is B, the rating is 0.5. If the quality is C, the rating drops to 0.2. For each intention-dependent predicate P, the rating is 0 if ¬P appears. Otherwise, the rating is 1. A satisfaction rating is related to an entity’s deceiving intention as well as some unpredictable factors. It is modelled by using random variables with normal distribution. The mean function fm (n) determines the mean value of the normal distribution at the the nth rating. Three types of deceiving intentions are identified. Uncovered deceiving intention: The satisfaction ratings associated with a swindler having uncovered deceiving intention are stably low. The ratings vary in a small range over time. The mean function is defined as fm (n) = M, where M is a constant. Figure 1a shows satisfaction ratings with fm (n)=0.2. The fluctuation of ratings results from the unpredictable factors.
334
Bharat Bhargava et al.
Trapping intention: The rating sequence can be divided into two phases: preparing and trapping. A swindler behaves well to achieve a trustworthy image before he conducts frauds. The mean function can be defined as: mhigh , n≤ n0 ; fm (n) = W here n0 is the turning point. mhigh , otherwise. Figure 1b shows satisfaction ratings for a swindler with trapping intention. Fm (n) is 0.8 for the first 50 interactions and 0.2 afterwards. Illusive intention: A smart swindler with illusive intention, instead of misbehaving continuously, attempts to cover the bad effects by intentionally doing something good after misbehaviors. He repeats the process of preparing and trapping. fm (n) is a periodic function. For simplicity, we assume the period is N, the mean function is defined as: mhigh , (n mod N) < n0 ; fm (n) = mhigh , otherwise. Figure 1c shows satisfaction ratings with period of 20. In each period, fm (n) is 0.8 for the first 15 interactions and 0.2 for the last five.
4
Architecture for Swindler Detection
Swindler detection consists of profile-based anomaly detector, state transition analysis, deceiving intention predictor, and decision-making. Profile-based anomaly detector monitors suspicious actions based upon the established patterns of an entity. It outputs fraud confidence indicating the possibility of a fraud. State transition analysis builds a state transition graph that provides state description to users when an activity results in entering a dangerous state leading
Record Preprocessor Architecture boundary Satisfied ratings Profile-based Anomaly Detector Fraud Confidence
State Transition Analysis State Description
Deceiving Intention predictor DI-Confidence
Decision Making
Fig. 2. Architecture for swindler detection
Fraud Formalization and Detection
335
to a fraud. Deceiving intention predictor discovers deceiving intention based on satisfaction ratings. It outputs DI-confidence to characterize the belief that the target entity has a deceiving intention. DI-confidence is a real number ranging over [0,1]. The higher the value is, the greater the belief is. Outputs of these components are feed into decision-making component that assists users to reach decisions based on predefined policies. Decision-making component passes warnings from state transition analysis to user and display the description of next potential state in a readable format. The expected risk is computed as follows. f(fraud confidence, DI-confidence, estimated cost) = max(fraud confidence, DI-confidence) × estimated cost Users can replace this function according to their specific requirements. A fraud alarm will arise when expected risk is greater than fraud-investigating cost. In the rest of this section, we concentrate on the other three components. 4.1
Profile-Based Anomaly Detector
As illustrated in fig. 3, profile-based anomaly detector consists of rule generation and weighting, user profiling, and online detection. Rule generation and weighting: Data mining techniques such as association rule mining are applied to generate fraud rules. The generated rules are assigned weights according to their frequency of occurrence. Both entity-level and behavior-level attributes are used in mining fraud rules and weighting. Normally, a large volume of rules will be generated. User profiling: Profile information characterizes both the entity-level information (e.g. financial status) and an entity’s behavior patterns (e.g. interested products). There are two sets of profiling data, one for history profiles and the other for current profiles. Two steps, variable selection followed by data filtering, are used for user profiling. The first step chooses variables characterizing the normal behavior. Selected variables need to be comparable among different entities.
Profile-based anomaly detector boundary Case selection
Rule Generation and Weighting Rules selection
Record Preprocessor
User Profiling Rules and patterns retrieving Online Detection Fraud confidence
Fig. 3. Profile-based anomaly detector
336
Bharat Bhargava et al.
Profile of the selected variable must show a pattern under normal conditions. These variables need to be sensitive to anomaly (i.e., at least one of these patterns is not matched in occurrence of anomaly). The objective of data filtering for history profiles is data homogenization (i.e. grouping similar entities). The current profile set will be dynamically updated according to behaviors. As behavior level data is large, decay is needed to reduce the data volume. This part also involves rule selection for a specific entity based on profiling results and rules. The rule selection triggers the measurements of normal behaviors regarding the rules. These statistics are stored in history profiles for online detection. Online detection: The detection engine retrieves the related rules from the profiling component when an activity occurs. It may retrieve the entity’s current behavior patterns and behavior pattern history as well. Analysis methods such as Hellinger distance can be used to calculate the deviation of current profile patterns to profile history patterns. These results are combined to determine fraud confidence. 4.2
State Transition Analysis
State transition analysis models fraud scenarios as series of states changing from an initial secure state to a final compromised state. The initial state is the start state prior to actions that lead to a fraud. The final state is the resulting state of completion of the fraud. There may be several intermediate states between them. The action, which causes one state to transit to another, is called the signature action. Signature actions are the minimum actions that lead to the final state. Without such actions, this fraud scenario will not be completed. This model requires collecting fraud scenarios and identifying the initial states and the final states. The signature actions for that scenario are identified in backward direction. The fraud scenario is represented as a state transition graph by the states and signature actions. A danger factor is associated with each state. It is defined by the distance from the current state to a final state. If one state leads to several final states, the minimum distance is used. For each activity, state transition analysis checks the potential next states. If the maximum value of the danger factors associated with the potential states exceeds a threshold, a warning is raised and detailed state description is sent to the decision-making component. 4.3
Deceiving Intention Predictor
The kernel of the predictor is the deceiving intention prediction (DIP) algorithm. DIP views the belief of deceiving intention as the complementary of trust belief. The trust belief about an entity is evaluated based on the satisfaction sequence , Rn is the most recent one, which contributes to a portion of α to the trust belief. The rest portion comes from the previous trust belief that is determined recursively. For each entity, DIP maintains a pair of factors (i.e. current construction factor W c and current destruction factor W d). If integrating Rn will increase trust belief, α = W c. Otherwise, α = W d. W c and
Fraud Formalization and Detection
337
W d satisfy the constraint W c < W d, which implies that more efforts are needed to gain the same amount of trust than to loose it [12]. W c and W d are modified when a foul event is triggered by the fact that the coming satisfaction rating is lower than a user-defined threshold. Upon a foul event, the target entity is put under supervision. His W c is decreased and W d is increased. If the entity does not conduct any foul event during the supervision period, the W c and W d are reset to the initial values. Otherwise, they are further decreased and increased respectively. Current supervision period of an entity increases each time when he conduct a foul event, so that he will be punished longer next time, which means an entity with worse history is treated harsher. The DI-confidence is computed as 1 − current trust belief . DIP algorithm accepts seven input parameters: initial construction factor W c and destruction factor W d; initial supervision period p; initial penalty ratios for construction factor, destruction factor and supervision r1, r2 and r3 such that r1, r2 ∈ (0, 1) and r3 > 1; foul event threshold f T hreshold. For each entity k, we maintain a profile P(k) consisting of five fields: current trust value tV alue, current construction factor W c, current destruction factor W d, current supervision period cP eriod, rest of supervision period sRest. DIP algorithm (Input parameters: Wd, Wc, r1, r2, r3, p, fThreshold; Output: DI-confidence) Initialize P(k) with input parameters while there are new rating R if R <= fThreshold then //put under supervision P(k).Wd = P(k).Wd + r1 * (1 - P(k).Wd) P(k).Wc = r2 * P(k).Wc P(k).sRest = P(k).sRest + P(k).cPeriod P(k).cPeriod = r3 * P(k).cPeriod end if if R <= P(k).tValue then //update tValue W = P(k).Wd else W = P(k).Wc end if P(k).tValue = P(k).tValue * (1 - W) + R * P(k).W if P(k).sRest > 0 and R > fThreshold then P(k).sRest = P(k).sRest - 1 if P(k).sRest = 0 then //restore Wc and Wd P(k).Wd = Wd and P(k).Wc = Wc end if end if return (1 - P(k).tValue) end while Experimental Study DIP’s capability of discovering deceiving intentions defined in section 3.1 is investigated through experiments. Initial construction fac-
Bharat Bhargava et al.
1
1
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.5
0.4
DI−confidence
1
0.9
DI−confidence
DI−confidence
338
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0
0
50
100
Number of Ratings
150
0
0.1
0
50
100
Number of Ratings
150
0
0
50
100
150
Number of Ratings
(a) Discovery uncovered de- (b) Discovery trapping in- (c) Discovery illusive intenceiving intention tention tion
Fig. 4. Experiments to discovery deceiving intentions
tor is 0.05. Initial destruction factor is 0.1. Penalty ratios for construction factor, destruction factor and supervision-period are 0.9, 0.1 and 2 respectively. The threshold for a foul event is 0.18. The results are shown in fig. 4. The x-axis of each figure is the number of ratings. The y-axis is the DI-confidence. Swindler with uncovered deceiving intention: The satisfaction rating sequence of the generated swindler is shown in fig. 1a. The result is illustrated in fig. 4a. Since the possibility for the swindler to conduct foul events is high, he is under supervision at most of the time. The construction and destruction factors become close to 0 and 1 respectively because of the punishment for foul events. The trust values are close to the minimum rating of interactions that is 0.1 and DI-confidence is around 0.9. Swindler with trapping intention: The satisfaction rating sequence of the generated swindler is shown in fig. 1b. As illustrated in fig. 4b, DIP responds to the sharp drop of fm (n) very quickly. After fm (n) changes from 0.8 to 0.2, it takes only 6 interactions for DI-confidence increasing from 0.2239 to 0.7592. Swindler with illusive intention: The satisfaction rating sequence of the generated swindler is shown in fig. 1c. As illustrated in fig. 4c, when the mean function fm (n) changes from 0.8 to 0.2, DI-confidence increases. When fm (n) changes back from 0.2 to 0.8, DI-confidence decreases. DIP is able to catch this smart swindler in the sense that his DI-confidence eventually increases to about 0.9. The swindler’s effort to cover a fraud with good behaviors has less and less effect with the number of frauds.
5
Conclusions
In this paper, we classify fraudsters as impersonators and swindlers and present a mechanism to detect swindlers. The concepts relevant to frauds conducted by swindlers are formally defined. Uncovered deceiving intention, trapping intention, and illusive intention are identified. We propose an approach for swindler detection, which integrates the ideas of anomaly detection, state transition analysis, and history-based intention prediction. An architecture that realizes this approach is presented. The experiment results show that the proposed deceiving
Fraud Formalization and Detection
339
intention prediction (DIP) algorithm accurately detects the uncovered deceiving intention. Trapping intention is captured promptly in about 6 interactions after a swindler enters the trapping phase. The illusive intention of a swindler, who attempt to cover frauds with good behaviors, can also be caught by DIP.
References [1] R. J. Bolton and D. J. Hand. Statistical fraud detection: A review. Statistical Science, 17(3):235–255, 2002. 331 [2] P. Burge and J. Shawe-Taylor. Detecting cellular fraud using adaptive prototypes. In Proceedings of AAAI-97 Workshop on AI Approaches to Fraud Detection and Risk Management, 1997. 331 [3] M. Cahill, F. Chen, D. Lambert, J. Pinheiro, and D. Sun. Detecting fraud in the real world. In Handbook of Massive Datasets, pages 911–930. Klewer Academic Publishers, 2002. 331 [4] T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 1997. 331 [5] J. Hollm´en and V. Tresp. Call-based fraud detection in mobile communication networks using a hierarchical regime-switching model. In Proceedings of Advances in Neural Information Processing Systems (NIPS’11), 1998. 331 [6] Bertis B. Little, Walter L. Johnston, Ashley C. Lovell, Roderick M. Rejesus, and Steve A. Steed. Collusion in the U. S. crop insurance program: applied data mining. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 594–598. ACM Press, 2002. 331 [7] Saharon Rosset, Uzi Murad, Einat Neumann, Yizhak Idan, and Gadi Pinkas. Discovery of fraud rules for telecommunications` uchallenges and solutions. In Proceedings of the fifth ACM SIGKDD, pages 409–413. ACM Press, 1999. 330, 331 [8] Ravi Sandhu. Lattice-based access control models. IEEE Computer, 26(11):9–19, 1993. 331 [9] Salvatore J. Stolfo, Wenke Lee, Philip K. Chan, Wei Fan, and Eleazar Eskin. Data mining-based intrusion detectors: an overview of the columbia IDS project. ACM SIGMOD Record, 30(4):5–14, 2001. 332 [10] M. Taniguchi, J. Hollm´en M. Haft, and V. Tresp. Fraud detection in communications networks using neural and probabilistic methods. In Proceedings of the IEEE International Conference in Acoustics, Speech and Signal Processing, 1998. 331 [11] David Wagner and Paolo Soto. Mimicry attacks on host-based intrusion detection systems. In Proceedings of the 9th ACM conference on Computer and communications security, pages 255–264. ACM Press, 2002. 331 [12] Y. Zhong, Y. Lu, and B. Bhargava. Dynamic trust production based on interaction sequence. Technical Report CSD-TR 03-006, Department of Computer Sciences, Purdue University, 2003. 337
Combining Noise Correction with Feature Selection? Choh Man Teng Institute for Human and Machine Cognition University of West Florida 40 South Alcaniz Street, Pensacola FL 32501, USA [email protected]
Polishing is a noise correction mechanism which makes use of the inter-relationship between attribute and class values in the data set to identify and selectively correct components that are noisy. We applied polishing to a data set of amino acid sequences and associated information of point mutations of the gene COLIA1 for the classi cation of the phenotypes of the genetic collagenous disease Osteogenesis Imperfecta (OI). OI is associated with mutations in one or both of the genes COLIA1 and COLIA2. There are at least four known phenotypes of OI, of which type II is the severest and often lethal. Preliminary results of polishing suggest that it can lead to a higher classi cation accuracy. We further investigated the use of polishing as a scoring mechanism for feature selection, and the eect of the features so derived on the resulting classi er. Our experiments on the OI data set suggest that combining polishing and feature selection is a viable mechanism for improving data quality. Abstract.
1
Approaches to Noise Handling
Imperfections in data can arise from many sources, for instance, faulty measuring devices, transcription errors, and transmission irregularities. Except in the most structured and synthetic environment, it is almost inevitable that there is some noise in any data we have collected. Data quality is a prime concern for many tasks in learning and induction. The utility of a procedure is limited by the quality of the data we have access to. For a classi cation task, for instance, a classi er built from a noisy training set might be less accurate and less compact than one built from the noise-free version of the same data set using an identical algorithm. Imperfections in a data set can be dealt with in three broad ways. We may leave the noise in, lter it out, or correct it. On the rst approach, the data set is taken as is, with the noisy instances left in place. Algorithms that make use of the data are designed to be robust ; that is, they can tolerate a certain amount of noise in the data. Robustness is typically accomplished by avoiding over tting, ?
This work was supported by NASA NCC2-1239 and ONR N00014-03-1-0516.
Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 340-349, 2003. c Springer-Verlag Berlin Heidelberg 2003
Combining Noise Correction with Feature Selection
341
so that the resulting classi er is not overly specialized to account for the noise. This approach is taken by, for example, c4.5 [Quinlan, 1987] and CN2 [Clark and Niblett, 1989]. On the second approach, the data is ltered before being used. Instances that are suspected of being noisy according to certain evaluation criteria are discarded [John, 1995; Gamberger et al., 1996; Brodley and Friedl, 1999]. A classi er is then built using only the retained instances in the smaller but cleaner data set. Similar ideas can be found in robust regression and outlier detection techniques in statistics [Rousseeuw and Leroy, 1987]. On the rst approach, robust algorithms do not require preprocessing of the data, but the noise in the data may interfere with the mechanism, and a classi er built from a noisy data set may be of less utility than it could have been if the data were not noisy. On the second approach, by ltering out the noisy instances from the data, there is a tradeo between the amount of information available for building the classi er and the amount of noise retained in the data set. Filtering is not information-eÆcient; the more noisy instances we discard, the less data remains. On the third approach, the noisy instances are identi ed, but instead of tossing these instances out, they are repaired by replacing the corrupted values with more appropriate ones. These corrected instances are then reintroduced into the data set. Noise correction has been shown to give better results than simply removing the noise from the data set in some cases [Drastal, 1991; Teng, 2001]. We have developed a data correction method called polishing [Teng, 1999]. Data polishing, when carried out correctly, would preserve the maximal information available in the data set, approximating the noise-free ideal situation. A classi er built from this corrected data should have a higher predictive power and a more streamlined representation. Polishing has been shown to improve the performance of classi ers in a number of situations [Teng, 1999; Teng, 2000]. In this paper we study in more detail a research problem in the biomedical domain, using a data set which describes the genetic collagenous disease Osteogenesis Imperfecta (OI). We have previously applied polishing to this data set, with some improvement in the accuracy and size of the resulting classi ers [Teng, 2003]. Here we in addition explore the selection and use of relevant features in the data set in conjunction with noise correction.
2
Feature Selection
Feature selection is concerned with the problem of identifying a set of features or attributes that are relevant or useful to the task at hand [Liu and Motoda, 1998, for example]. Spurious variables, either irrelevant or redundant, can aect the performance of the induced classi er. In addition, concentrating on a reduced set of features improves the readability of the classi er, which is desirable when our
342
Choh Man Teng
goal is to achieve not only a high predictive accuracy but also an understanding of the underlying structure relating the attributes and the prediction. There are several approaches to feature selection. The utility of the features can be scored using a variety of statistical and experimental measures, for instance, correlation and information entropy [Kira and Rendell, 1992; Koller and Sahami, 1996]. The wrapper approach uses the learning algorithm itself to iteratively search for sets of features that can improve the performance of the algorithm [Kohavi and John, 1997]. Feature scoring is typically faster and the resulting data set is independent of the particular learning algorithm to be used, since the selection of the features is based on scores computed using the characteristics of the data set alone. The wrapper approach in addition takes into account the bias of the learning algorithm to be deployed by utilizing the algorithm itself in the estimation of the relevance of the features. We study the eect of feature selection when combined with noise correction. The polishing mechanism was used in part to score the features in the data set, and the reduced and polished data set was compared to the unreduced and/or unpolished data sets. In the following sections we will rst describe the polishing mechanism and the application domain (the classi cation of the genetic disease OI), and then we will discuss the experimental procedure together with the feature selection method employed.
3
Polishing
Machine learning methods such as the naive Bayes classi er typically assume that dierent components of a data set are (conditionally) independent. It has often been pointed out that this assumption is a gross oversimpli cation of the actual relationship between the attributes; hence the word \naive" [Mitchell, 1997, for example]. Extensions to the naive Bayes classi er have been introduced to loosen the independence criterion [Kononenko, 1991; Langley et al., 1992], but some have also investigated alternative explanations for the success of this classi er [Domingos and Pazzani, 1996]. Controversy aside, most will agree that in many cases there is a de nite relationship within the data; otherwise any eort to mine knowledge or patterns from the data would be ill-advised. Polishing takes advantage of this interdependency between the components of a data set to identify the noisy elements and suggest appropriate replacements. Rather than utilizing the features only to predict the target concept, we can as well turn the process around and utilize the target together with selected features to predict the value of another feature. This provides a means to identify noisy elements together with their correct values. Note that except for totally irrelevant elements, each feature would be at least related to some extent to the target concept, even if not to any other features. The basic algorithm of polishing consists of two phases: prediction and adjustment. In the prediction phase, elements in the data that are suspected of
Combining Noise Correction with Feature Selection
343
being noisy are identi ed together with a nominated replacement value. In the adjustment phase, we selectively incorporate the nominated changes into the data set. In the rst phase, the predictions are carried out by systematically swapping the target and particular features of the data set, and performing a ten-fold classi cation using a chosen classi cation algorithm for the prediction of the feature values. If the predicted value of a feature in an instance is dierent from the stated value in the data set, the location of the discrepancy is agged and recorded together with the predicted value. This information is passed on to the next phase, where we institute the actual adjustments. Since the polishing process itself is based on imperfect data, the predictions obtained in the rst phase can contain errors as well. We should not indiscriminately incorporate all the nominated changes. Rather, in the second phase, the adjustment phase, we selectively adopt appropriate changes from those predicted in the rst phase, using a number of strategies to identify the best combination of changes that would improve the tness of a datum. Given a training set, we try to identify suspect attributes and classes and replace their values according to the polishing procedure. The bare-bones description of polishing is given in Figure 1. Polishing makes use of a procedure flip to recursively try out selective combinations of attribute changes. The function classify(Classi ers ; xj ; c) returns the number of classi ers in the set Classi ers which classify the instance xj as belonging to class c. Further details of polishing can be found in [Teng, 1999; Teng, 2000; Teng, 2001].
4
Osteogenesis Imperfecta
Osteogenesis Imperfecta (OI), commonly known as brittle bone disease, is a genetic disorder characterized by bones that fracture easily for little or no reason. This disorder is associated with mutations in one or both of the genes, COLIA1 and COLIA2, which are associated with the production of peptides of type I collagen. Type I collagen is a protein found in the connective tissues in the body. A mutation in COLIA1 or COLIA2 may lead to a change in the structure and expression of the type I collagen molecules produced, which in turn aects the bone structure. There are at least four known phenotypes of osteogenesis imperfecta, namely, types I, II, III, and IV. Of these four type II is the severest form of OI and is often lethal. At least 70 dierent kinds of point mutations in COLIA1 and COLIA2 have been found to be associated with OI, and of these approximately half of the mutations are related to type II, the lethal form of OI [Hunter and Klein, 1993]. While OI may be diagnosed with collagenous or DNA tests, determining the relevant structure and the relationship between the point mutations and the types of OI remains an open research area [Klein and Wong, 1992; Mooney et al., 2001].
344
Choh Man Teng
Polishing(OldData ,votes ,changes ,cutoff ) Input OldData : (possibly) noisy data votes : #classifiers that need to agree changes : max #changes per instance cutoff : size of attribute subset considered Output NewData : polished data for
each attribute ai AttList i ;; tmpData swap ai and class c in OldData ; 10-fold cross-validation of tmpData ; for each instance xj misclassified new value of ai predicted for xj ; AttList i AttList i [ fhj; new ig;
flip(j; votes ; k; cutoff ; starti ) Input j : index of the instance to be adjusted votes : #classifiers that need to agree k : #changes yet to be made cutoff : size of attribute subset considered starti : index of AttSorted containing first attribute to be adjusted Output true/false: whether a change has been made (also modifies NewData ) if k
=0
if
then
NewData return
end
NewData true;
xj
) votes
[ f j g; x
end
end
NewData AttSorted
else return false;
;;
relevant attributes sorted in ascending order of jAttList i j; Classifiers classifiers from 10-fold cross-validation of OldData ; for each instance xj for k from 0 to changes adjusted flip(j; votes ; k; cutoff ; 0); if adjusted then break; end if
then
classify(Classifiers ; xj ; class of
(not adjusted ) then NewData NewData [ fxj g;
from starti to cutoff AttSorted [i]; hj; new i 2 AttList i0 then attribute ai0 of xj new ; adjusted flip(j; votes ; k 1; cutof f; i + 1); if adjusted then return true; reset ai0 of xj ;
else for i ai0
if
end end return false;
end end return
NewData ; Fig. 1.
The polishing algorithm.
4.1 Data Description Below we will describe a data set consisting of information on sequences of amino acids, each with a point mutation in COLIA1. The sequences are divided into lethal (type II) and non-lethal (types I, III, and IV) forms of OI. The objective is to generate a classi cation scheme that will help us understand and dierentiate between lethal and non-lethal forms of OI. Each instance in the data set contains the following attributes.
A1 ; : : : ; A29 : a sequence of 29 amino acids. These are the amino acids at and around the site of the mutation. The mutated residue is centered at A15 , with 14 amino acids on each side in the sequence. Each attribute Ai can take on one of 21 values: each of the 20 regular amino acids (A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V), and hydroxyproline (O), a modi ed proline (P) which can be found in collagen molecules. Four attributes provide supplementary information regarding hydrogen bonds in the molecules.
S-W : number of solute hydrogen bonds wild type; S-M : number of solute hydrogen bonds mutated type; SS-W : number of solute-solvent hydrogen bonds wild type;
Combining Noise Correction with Feature Selection
345
SS-M : number of solute-solvent hydrogen bonds mutated type. These are the number of hydrogen bonds of the speci ed types that are present in the wild (un-mutated) and mutated protein molecules more than 80% of the time. The class of each instance can be one of two values.
y : lethal OI (type II); n : non-lethal OI (types I, III, or IV). Thus, each instance contains 33 attributes and a binary classi cation.
4.2 Data Characteristics A number of characteristics of the OI data set suggest that it is an appropriate candidate for polishing and feature selection. First of all, the amino acid sequence and associated information are prone to noise arising from the clinical procedures. Thus, there is a need for an eective measure for noise handling. The number of possible values for many of the attributes is fairly large, resulting in a data set that is sparse with little redundancy. This makes it undesirable to use an information-ineÆcient mechanism such as ltering for noise handling, since discarding any data instance is likely to lose some valuable information that is not duplicated in the remaining portion of the data set. While the precise relationship between the dierent amino acid blocks is not clear, we do know that they interact, and this inter-relationship between amino acids in a sequence can be exploited to nominate replacement values for the attributes using the polishing mechanism. In addition, the conformation of collagen molecules is exceptionally linear, and thus we can expect that each attribute may be predicted to a certain extent by considering only the values of the adjacent attributes in the sequence. Furthermore, we are interested not only in the predictive accuracy of the classi er but also in identifying the relevant features contributing to the lethal phenotype of OI and the relationship between these features. We have previously observed that many of the attributes may not be relevant [Hewett et al., 2002; Teng, 2003], in the sense that the classi er may make use of only a few of the available attributes. This makes it desirable to incorporate a feature selection procedure that may increase the intelligibility of the resulting classi er as well as improve the accuracy of the prediction by removing potentially confounding attributes.
5
Experiments
We used the decision tree builder c4.5 [Quinlan, 1993] to provide our basic classi ers, and performed ten-fold cross-validation on the OI data set described in the previous section. In each trial, nine parts of the data were used for training and a tenth part was reserved for testing. The training data was polished and the polished data was
346
Choh Man Teng
Average classi cation accuracy and size of the decision trees constructed from the unpolished and polished data. (a) All attributes were used. The dierence between the classi cation accuracies of the pruned unpolished and polished cases is signi cant at the 0.05 level. (b) Only those attributes reported in Table 2 were used. The dierences between the corresponding classi cation accuracies of the unpruned trees in (a) and (b) are signi cant at the 0.05 level. Table 1.
(a) Using all attributes Unpolished Polished (b) Using only attributes reported in Table 2 Unpolished Polished
Unpruned Accuracy Tree Size 46:5% 91.4 53:0% 94.8 Unpruned Accuracy Tree Size 71.0% 34.4 73.5% 76.4
Pruned Accuracy Tree Size 60:0% 11.6 66:0% 11.4 Pruned Accuracy Tree Size 62.0% 16.7 66.0% 8.8
then used to construct a decision tree. The unseen (and unpolished) instances in the test data set were classi ed according to this tree. For each trial a tree was also constructed from the original unpolished training data for comparison purposes. Below we analyze a number of aspects of the results obtained from the experiments, namely, the classi er characteristics (accuracy and size) and the list of relevant attributes selected by the classi ers. We observed that few of the attributes were considered relevant according to this procedure. The experiments were rerun using only the selected attributes as input, and the results were compared to those obtained using all the attributes in the original data set, with and without polishing.
5.1 Classi er Characteristics The average classi cation accuracy and size of the decision trees constructed from the unpolished and polished data, using all the available attributes as input, are reported in Table 1(a). The dierence between the classi cation accuracies of the pruned trees constructed from unpolished and polished data is statistically signi cant at the 0.05 level, using a one-tailed paired t-test. Even though previously we found that polishing led to a decrease in tree size [Teng, 1999; Teng, 2000], in this study the tree sizes resulting from the two approaches do not dier much.
5.2 Relevant Attributes We looked at the attributes used in constructing the unpolished and polished trees, as these attributes were the ones that were considered predictive of the OI phenotype in the decision tree setting.
Combining Noise Correction with Feature Selection
347
Relevant attributes, in decreasing order of the average percentage of occurrence in the decision trees. Table 2.
Unpolished % Occurrence 33.3% 16.7% A15 ; A20 ; A22 16.7% Attribute S-M S-W
Attribute A15 A11
, A14
Polished % Occurrence 50.0% 25.0%
We used the number of trees involving a particular attribute as an indicator of the relevance of that attribute. Table 2 gives the normalized percentages of occurrence of the attributes, averaged over the cross validation trials, obtained from the trees constructed using the unpolished and polished data sets respectively. The relevant attributes picked out from using the unpolished and polished data are similar, although the rank orders and the percentages of occurrence dier to some extent. We expected A15 , the attribute denoting the mutated amino acid in the molecule, to play a signi cant role in the classi cation of OI disease types. This was supported by the ndings in Table 2. We also noted that A15 was used more frequently in the decision trees constructed from the polished data than in those constructed from the unpolished data. The stronger emphasis placed on this attribute may partially account for the increase in the classi cation accuracy resulting from polishing. Other attributes that were ranked high in both the unpolished and polished cases include S-M (the number of solute hydrogen bonds mutated) and S-W (the number of solute hydrogen bonds wild). The amino acids in the sequence that were of interest in the unpolished and polished trees diered. Domain expertise is needed to further interpret the implications of these results.
5.3 Rebuilding with Selected Attributes As we discussed above, the results in Table 2 indicated that only a few of the attributes were used in the decision trees. Even though the rest of the attributes were not retained in the pruned trees, they nonetheless entered into the computation, and could have had a distracting eect on the tree building process. We used as a feature scoring mechanism the decision trees built using all the attributes as input. This was similar to the approach taken in [Cardie, 1993], although in our case the same learning method was used for both feature selection and the nal classi er induction. We adopted a binary scoring scheme: all and only those attributes that were used in the construction of the trees were selected. These were the attributes reported in Table 2. The classi cation accuracy and size of the decision trees built using only the features selected from the unpolished and polished data are reported in Table 1(b). The dierences between the corresponding classi cation accuracies
348
Choh Man Teng
of the unpruned trees in Tables 1(a) and (b) are signi cant at the 0.05 level, using a one-tailed paired t-test. The accuracy and size of the pruned trees constructed using only the selected attributes do not dier much from those obtained by using all the attributes as input. Pruning was not helpful in this particular set of experiments, perhaps because the data set had already been \cleaned" to some extent by the various preprocessing techniques. In both the unpolished and polished cases, using only the selected attributes gave rise to trees with signi cantly higher classi cation accuracy and smaller size than those obtained when all the attributes were included. This suggests that the additional re nement of thinning out the irrelevant attributes is bene cial. In addition, using the polished data as a basis for feature selection can improve to some extent the performance of the learning algorithm over the use of unpolished data for the same task.
6
Remarks
We investigated the eects of polishing and feature selection on a data set describing the genetic disease osteogenesis imperfecta . Both mechanisms, when applied individually, were shown to improve the predictive accuracy of the resulting classi ers. Better performance was obtained by combining the two techniques so that the relevant features were selected based on classi ers built from a polished data set. This suggests that the two methods combined can have a positive impact on the data quality by both correcting noisy values and removing irrelevant and redundant attributes from the input.
References [Brodley and Friedl, 1999] Carla E. Brodley and Mark A. Friedl. Identifying mislabeled training data. Journal of Arti cial Intelligence Research, 11:131{167, 1999. [Cardie, 1993] Claire Cardie. Using decision trees to improve case-based learning. In Proceedings of the Tenth International Conference on Machine Learning, pages 25{ 32, 1993. [Clark and Niblett, 1989] P. Clark and T. Niblett. The CN2 induction algorithm. Machine Learning, 3(4):261{283, 1989. [Domingos and Pazzani, 1996] Pedro Domingos and Michael Pazzani. Beyond independence: Conditions for the optimality of the simple Bayesian classi er. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 105{112, 1996. [Drastal, 1991] George Drastal. Informed pruning in constructive induction. In Proceedings of the Eighth International Workshop on Machine Learning, pages 132{136, 1991. [Gamberger et al., 1996] Dragan Gamberger, Nada Lavrac, and Saso Dzeroski. Noise elimination in inductive concept learning: A case study in medical diagnosis. In Proceedings of the Seventh International Workshop on Algorithmic Learning Theory, pages 199{212, 1996.
Combining Noise Correction with Feature Selection
349
[Hewett et al., 2002] Rattikorn Hewett, John Leuchner, Choh Man Teng, Sean D. Mooney, and Teri E. Klein. Compression-based induction and genome data. In Proceedings of the Fifteenth International Florida Arti cial Intelligence Research Society Conference, pages 344{348, 2002. [Hunter and Klein, 1993] Lawrence Hunter and Teri E. Klein. Finding relevant biomolecular features. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology, pages 190{197, 1993. [John, 1995] George H. John. Robust decision trees: Removing outliers from databases. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 174{179, 1995. [Kira and Rendell, 1992] Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In Proceedings of the Ninth International Conference on Machine Learning, pages 249{256, 1992. [Klein and Wong, 1992] Teri E. Klein and E. Wong. Neural networks applied to the collagenous disease osteogenesis imperfecta. In Proceedings of the Hawaii International Conference on System Sciences, volume I, pages 697{705, 1992. [Kohavi and John, 1997] Ron Kohavi and George H. John. Wrappers for feature selection. Arti cial Intelligence, 97(1{2):273{324, 1997. [Koller and Sahami, 1996] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 284{292, 1996. [Kononenko, 1991] Igor Kononenko. Semi-naive Bayesian classi er. In Proceedings of the Sixth European Working Session on Learning, pages 206{219, 1991. [Langley et al., 1992] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian classi ers. In Proceedings of the Tenth National Conference on Arti cial Intelligence, pages 223{228, 1992. [Liu and Motoda, 1998] Huan Liu and Hiroshi Motoda, editors. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998. [Mitchell, 1997] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997. [Mooney et al., 2001] Sean D. Mooney, Conrad C. Huang, Peter A. Kollman, and Teri E. Klein. Computed free energy dierences between point mutations in a collagen-like peptide. Biopolymers, 58:347{353, 2001. [Quinlan, 1987] J. Ross Quinlan. Simplifying decision trees. International Journal of Man-Machine Studies, 27(3):221{234, 1987. [Quinlan, 1993] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [Rousseeuw and Leroy, 1987] Peter J. Rousseeuw and Annick M. Leroy. Robust Regression and Outlier Detection. John Wiley & Sons, 1987. [Teng, 1999] Choh Man Teng. Correcting noisy data. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 239{248, 1999. [Teng, 2000] Choh Man Teng. Evaluating noise correction. In Proceedings of the Sixth Paci c Rim International Conference on Arti cial Intelligence. Springer-Verlag, 2000. [Teng, 2001] Choh Man Teng. A comparison of noise handling techniques. In Proceedings of the Fourteenth International Florida Arti cial Intelligence Research Society Conference, pages 269{273, 2001. [Teng, 2003] Choh Man Teng. Noise correction in genomic data. In Proceedings of the International Conference on Intelligent Data Engineering and Automated Learning. Springer-Verlag, 2003. To appear.
Pre-computing Approximate Hierarchical Range Queries in a Tree-Like Histogram Francesco Buccafurri and Gianluca Lax DIMET, Universit` a degli Studi Mediterranea di Reggio Calabria Via Graziella, Localit` a Feo di Vito, 89060 Reggio Calabria, Italy {bucca,lax}@ing.unirc.it
Abstract. Histograms are a lossy compression technique widely applied in various application contexts, like query optimization, statistical and temporal databases, OLAP applications, and so on. This paper presents a new histogram based on a hierarchical decomposition of the original data distribution kept in a complete binary tree. This tree, thus containing a set of pre-computed hierarchical queries, is encoded in a compressed form using bit saving in representing integer numbers. The approach, extending a recently proposed technique based on the application of such a decomposition to the buckets of a pre-existing histogram, is shown by several experiments to improve the accuracy of the state-of-the-art histograms.
1
Introduction
Histograms are a lossy compression technique widely applied in various application contexts, like query optimization [9], statistical [5] and temporal databases [12], and, more recently, OLAP applications [4, 10]. In OLAP, compression allows us to obtain fast approximate answers by evaluating queries on reduced data in place that original ones. Histograms are well suited to this purpose, especially in case of range queries. Indeed, buckets of histograms basically correspond to a set of pre-computed range queries, allowing us to estimate the remainder possible range queries. Estimation is needed when the range query overlaps partially a bucket. As a consequence, the problem of minimizing the estimation error becomes crucial in the context of OLAP applications. In this work we propose a new histogram, extending the approach used in [2] for the estimation inside the bucket. The histogram, called nLT, consists of a tree-like index, with a number of levels depending on the fixed compression ratio. Nodes of the index contain, hierarchically, pre-computed range queries, stored by an approximate (via bit saving) encoding. Compression derives both from aggregation implemented by leaves of the tree, and from the saving of bits obtained by representing range queries with less than 32 bits (assumed enough for an exact representation). The number of bits used for representing range queries is decreasing for increasing level of the tree. Peculiar characteristics of our histogram are the following: Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 350–359, 2003. c Springer-Verlag Berlin Heidelberg 2003
Pre-computing Approximate Hierarchical Range Queries
351
1. Due to bit saving, the number of pre-computed range queries embedded in our histogram is larger than a bucket-based histogram occupying the same storage space. Observe that such queries are stored in an approximate form. However, hierarchical organization of the index, allows us to express the value of a range query as a fraction of the range query including it (i.e., corresponding to the parent node in the tree), and this allows us to maintain a low numeric approximation. In case of absence of the tree, values of range queries would be expressed as a fraction of the maximum value (i.e., the query involving the entire domain). 2. The histogram supports directly hierarchical range queries, representing a meaningful type of queries in the OLAP context [4]. 3. The evaluation of a range query can be executed by visiting the tree from the root to a leaf (in the worst case), thus with a logarithmic cost on the number of smallest pre-computed range queries (this number is the counterpart of the number of buckets of a classic histogram, from which the cost of the evaluation of the query depends linearly). 4. The update of the histogram (we refer here to the case of the change of a single occurrence frequency) can be performed without reconstructing the entire tree, but only by updating nodes of the path connecting the leaf involved by the change with the root of the tree. Also this task is hence feasible in logarithmic time. While the three last points above describe evidently positive characteristics of the proposed method, the first point needs some kind of validation, to be considered effectively a point in favor of our proposal. Indeed, there is no apriori clear if having a larger set of approximate pre-computed queries (even if this approximation is reduced by the hierarchical organization) is better than having a smaller set of exact pre-computed range queries. In this work we try to give an answer to this question through experimental comparison with the most relevant histograms proposed in the literature. Thus, the main contribution of the paper is to conclude that keeping pre-computed hierarchical range queries (with a suitable numerical approximation done by bit saving), advances accuracy of histograms, not only when hierarchical decomposition is applied to buckets of pre-existing histograms (as shown in [2]), but also when the technique is applied to the entire data distribution. The paper is organized as follows. In the next section we illustrate histograms. Our histogram is presented in Section 3. Section 4 reports results of experiments conducted on our histogram and several other ones. Finally, we give conclusions in Section 5.
2
Histograms
Histograms are used for reducing relations in order to give approximate answers to range queries on such relations. Let X be an attribute of a relation R. W.l.o.g., we assume that the domain U of the attribute X is the interval of integer numbers from 1 to |U |1 . The set of frequencies is the set F = {f (1), ..., f (|U |)} where f (i) 1
|U | denotes the cardinality of the set U
352
Francesco Buccafurri and Gianluca Lax
is the number of occurrence of the value i in the relation R, for each 1 ≤ i ≤ |U |. The set of values is V = {i ∈ U such that f (i) > 0}. From now on, consider given R, X, F and V . A bucket B on X is a 4-tuple ub lb, ub, t, c, with 1 ≤ lb < ub ≤ |U |, t = |V | and c = i=lb f (i). lb and ub are said, respectively, lower bound and upper bound of B, t is said number of non-null values of B and c is the sum of frequencies of B. A histogram H on X is a h-tuple B1 , ..., Bh of buckets such that (1) ∀1 ≤ i < h, the upper bound of Bi precedes the lower bound of Bi+1 and (2) ∀j with 1 ≤ j ≤ |U | and (fj > 0) ⇒ ∃i ∈ [1, h] such that j ∈ Bi . Given a histogram H and a range query Q, it is possible to return an estimation of the answer to Q using information contained in H. At this point the following problem arises: how to partition the domain U into b buckets in order to minimize the error estimation? According to the criterion used for partitioning the domain, there are different classes of histograms (we report here only the most important ones): 1. Equi-sum Histograms [9]: buckets are obtained in such a way that the sum of occurrences in each bucket is equal to 1/b times the total sum of occurrences. 2. MaxDiff Histogram [9, 8]: each bucket has the upper bound in Vi ∈ V (set of attribute values actually appearing in the relation R), if |φ(Vi ) − φ(Vi+1 )| is one of the b − 1 highest computed values, for each i. φ(Vi ) is said area and is obtained as f (Vi ) · (Vi+1 − Vi ). 3. V-Optimal Histograms [6]: boundaries of each bucket, say lbi and ubi (with b 1 ≤ i ≤ b), are fixed in such a way that i=1 SSEi is minimum, where ubi SSEi = j=lbi (f (j) − avgi )2 and avgi is the average of the frequencies occurring in the i-th bucket. In the part of the work devoted to experiments (see Section 4), among the above presented bucket-based histograms, we have considered only MaxDiff and V-Optimal histograms, as it was shown in the literature that their have the best perfomances in terms of accuracy. In addition, we will consider also two further bucket-based histograms, called MaxDiff4LT and V-Optimal4LT. Such methods have been proposed in [2], and consist of adding a 32 bit tree-like index, called 4LT, to each bucket of either a MaxDiff or a V-Optimal histogram. The 4LT is used for computing, in an approximate way, the frequency sums of 8 non overlapping sub-ranges of the bucket. We observe that the idea underlying the proposal presented in this paper takes its origin just from the 4LT method, extending the application of such an approach to the construction of the entire histograms instead of single buckets. There are other kinds of histograms whose construction is not driven by the search of a suitable partition of the attribute domain and, further, their structure is more complex than simply a set of buckets. We call such histograms non bucket-based. Two important examples of histograms of such a type are waveletbased and binary-tree histograms. Wavelets are mathematical transformations implementing hierarchical decomposition of functions originally used in different
Pre-computing Approximate Hierarchical Range Queries
353
research and application contexts, like image and signal processing [7, 13]. Recent studies have shown the applicability of wavelets to selectivity estimation [6] as well as the approximation of OLAP range queries over datacubes [14, 15]. A wavelet-based histogram is not a set of buckets; it consists of a set of wavelet coefficients and a set of indices by which the original frequency set can be reconstructed. Histograms are obtained by applying one of these transformations to the original cumulative frequency set (extended over the entire attribute domain) and selecting, among the N wavelet coefficients, the m < N most significant coefficients, for m corresponding to the desired storage usage. The binary-tree histogram [1] is also based on a hierarchical multiresolution decomposition of the data distribution operating in a quad-tree fashion, adapted to the mono-dimensional case. Beside the bucket-based histograms, both the above types of histograms are compared experimentally in this paper with our histogram, which is a non bucketbased histogram too.
3
The nLT Histogram
In this section we describe the proposed histogram, called nLT. As wavelet and binary-tree histograms, nLT is a non bucket-based histogram. Given a positive integer n, an nLT histogram (on the attribute X is a full binary tree with n levels such that each node N is a 3-tuple l(N ), u(N ), val(N ), u(N ) where 1 ≤ l(N ) < u(N ) ≤ |U | and val(N ) = i=l(N ) f (i). l(N ) and u(N ) are said, respectively, lower bound and upper bound of N and val(N ) is said value of N . Observe that the interval of the domain of X with boundaries l(N ) and u(N ) is associated to N . We denote by r(N ) such an interval. Moreover, val(N ) is the sum of occurrence frequencies of X within such an interval. The root node, denoted by N0 is such that l(N0 ) = 1 and u(N0 ) = |U |. Given a leaf node N , the left-hand child node, say Nf s , is such that l(Nf s ) = l(N ) and ) 2 u(Nf s ) = u(N )+l(N , while the right-hand child node, say Nf d , is such that 2 u(N )+l(N ) l(Nf d ) = + 1 and u(Nf d ) = u(N ). 2 Concerning the implementation of the nLT, we observe that it is not needed to keep lower and upper bounds of nodes, since they can be derived by the knowledge of n and the position of the node in the tree. Moreover, we don’t have to keep the value of any right-hand child node too, since such a value can be obtained as difference between the value of the parent node with the value of the sibling node. In Figure 1 an example of nLT with n = 3 is reported. The nLT of this example refers to a domain of size 12 with 3 null elements. For each node (represented as a box), we have reported boundaries of the associated interval (on the left side and on the right side, respectively) and the value of the node (inside the box). Grey nodes can be derived by white nodes. Thus, they are not stored. 2
x denotes the application of the operator floor to x
354
Francesco Buccafurri and Gianluca Lax
Fig. 1. Example of nLT
The storage space required by the nLT in case integers are encoded using t bits, is t · 2n−1 . We assume that t = 32 is enough for representing integer values with no scaling approximation. In the following we will refer to this kind of nLT implementation as exact implementation of the nLT, or, for short, exact nLT. In the next section, we will illustrate how to reduce the storage space by varying the number of bits used for encoding the value of the nodes. Of course, to the lossy compression due to linear interpolation needed for retrieving all the non pre-computed range queries, we add another lossy compression given by the bit saving. 3.1
Approximate nLT
In this section we describe the approximate nLT, that is an implementation of the nLT which uses length-variable encoding of integer numbers. In particular, all nodes which belong to the same level in the tree are represented with the same number of bits. When we go down to the lower level, we reduce by 1 the number of bits used for representing nodes of this level. This bit saving, allows us to increase the nLT depth (w.r.t. the exact nLT), once the total storage space is fixed, and to have a larger set of pre-computed range queries and thus higher resolution. Substantially, the approach is based on the assumption that, in the average, the sum of occurrences of a given interval of the frequency vector, is twice than the sum of the occurrences of each half of such an interval. This assumption is chosen as heuristic criterion for designing the approximate nLT, and this explains the choice of reducing by 1 per level the number of bits used for representing numbers. Clearly, the sum contained in a given node is represented as a fraction of the sum contained in the parent node. Observe that, in principle, it could be used also a representation allowing possibly different number of bits for nodes belonging to the same level, depending on the actual value contained into nodes. However, we should deal with the spatial overhead due to these variable codes. The reduction of 1 bit per level appears as a reasonable compromise. We describe now in more details how to encode with a certain number of bits, say k, the value of a given node N , denoting by P the parent node of N .
Pre-computing Approximate Hierarchical Range Queries
355
With such a representation, the value of the node val(N ) will be recovered not in exact way, in general. It will be affected by a certain scaling approximation. k (N ) the encoding of val(N ) done with k bits and by valk (N ) We denote by val k (N ). the approximation of val(N ) obtained by val We have that: k (N ) = Round( val(N ) × (2k − 1)) val val(P ) k (N ) ≤ 2k − 1. Concerning the approximation of val(N ) it Clearly, 0 ≤ val results: k (N ) valk (N ) = ( val × val(P )) 2k −1 The absolute error due to the k-bit encoding of the node N , with parent node P is: %a (val(N ), val(P ), k) = |val(N ) − valk (N )|. It can be easily verified that 0 ≤ %a (val(N ), val(P ), k) ≤ The relative error is defined as: %r (val(N ), val(P ), k) =
val(P ) . 2k+1
a (val(N ),val(P ),k) . val(N )
Define now the average relative error (for variable value of the node N ) as: val(P ) 1 %r (i, val(P ), k). %r (val(N ), val(P ), k) = val(P i=1 ) × We observe that, for the root node N0 , we use 32 bits. Thus, no scaling error arises for such a node, i.e. val(N0 ) = valk (N0 ). It can be proven that the average relative error is null until val(P ) reaches the value 2k , and, then, after a number of decreasing oscillations, converges to a value independent of val(P ) and depending on k. Before proceeding to the implementation of a nLT, we should set the two parameters n and k, that are, we recall, number of levels of the nLT and number of bits used for encoding each child node of the root (for the successive levels, as already mentioned, we drop 1 bit per level). Observe that, according to the above observation about the average relative error, setting the parameter k means fixing also the average relative error due to scaling approximation. Thus, in order to reduce such an error, we should set k to a value as large as possible. However, for a fixed compression ratio, this may limit the depth of the tree and, thus the resolution of the leaves. As a consequence, the error arising from linear interpolation done inside leaf nodes, increases. The choice of k has thus to solve the above trade-off. The size of an approximate nLT is thus: size(nLT ) = 32 +
n−2
(n − h) × 2h
(1)
h=0
recalling that the root node is encoded with 32 bits. For instance, a nLT with n = 4 and k = 11 uses 32+20 ·11+21 ·10+22 ·9 = 99 bit for representing its nodes.
356
4
Francesco Buccafurri and Gianluca Lax
Experiments on Histograms
In this section we shall conduct several experiments on synthetic data in order to compare the effectiveness of several histograms in estimating range query. Available Storage: For our experiments, we shall use a storage space, that is 42 four-byte numbers to be in line with experiments reported in [9], which we replicate. Techniques: We compare nLT with 6 new and old histograms, fixing the total space required by each technique: – MaxDiff (MD) and V-Optimal (VO) produce 21 bucket; for each bucket both upper bound and value are stored. – Max-Diff with 4LT (MD4LT) and V-Optimal with 4LT (VO4LT) produce 14 bucket; for each bucket is stored the upper bound, the value and the 4LT index. – Wavelet (WA) are constructed using the bi -orthogonal 2.2 decomposition of the M AT LAB 5.2 wavelet toolbox. The wavelet approach needs 21 four-byte wavelet coefficients plus another 21 four-byte numbers for storing coefficient positions. We have stored the 21 largest (in absolute value) wavelet coefficients and, in the reconstruction phase, we have set to 0 the remaining coefficients. – Binary-Tree (BT) produces 19 terminal buckets (for reproducing the experiments reported in [1]). – nLT (nLT) is obtained fixing n = 9 and k = 11. Using (1) shown in Section 3.1, the stored space is about 41 four-byte numbers. The choice of k = 11 and, consequently of n = 9, is done by fixing the average relative error of the highest level of the tree to about 0.15%. Data Distributions: A data distribution is characterized by a distribution for frequencies and a distribution for spreads. Frequency set and value set are generated independently, then frequencies are randomly assigned to the elements of the value set. We consider 3 data distributions: (1) D1 : Zipf-cusp max(0.5,1.0). (2) D2 = Zipf-zrand(0.5,1.0): Frequencies are distributed according to a Zipf distribution with the z parameter equal to 0.5. Spreads follow a ZRand distribution [8] with z parameter equal to 1.0 (i.e., spreads following a Zipf distributions with z parameter equal to 1.0 are randomly assigned to attribute values). (3) D3 = Gauss-rand: Frequencies are distributed according to a Gauss distribution. Spreads are randomly distributed. Histograms Populations: A population is characterized by the value of three parameters, that are T , D and t and represents the set of histograms storing a relation of cardinality T , attribute domain size D and value set size t (i.e., number of non-null attribute values).
Pre-computing Approximate Hierarchical Range Queries
method/popul. WA MD VO M D4LT V O4LT BT nLT
P1 3.50 4.30 1.43 0.70 0.29 0.26 0.24 (a)
P2 3.42 5.78 1.68 0.80 0.32 0.27 0.24
P3 2.99 8.37 1.77 0.70 0.32 0.27 0.22
avg 3.30 6.15 1.63 0.73 0.31 0.27 0.23
method/popul. P1 WA 13.09 MD 19.35 VO 5.55 M D4LT 1.57 V O4LT 1.33 BT 1.12 nLT 0.63 (b)
P2 13.06 16.04 5.96 1.60 1.41 1.15 0.69
P3 6.08 2.89 2.16 0.59 0.56 0.44 0.26
357
avg 10.71 12.76 4.56 1.25 1.10 0.90 0.53
Fig. 2. (a): Errors for distribution 1. (b): Errors for distribution 2
Population P1 . This population is characterized by the following values for the parameters: D = 4100, t = 500 and T = 100000. Population P2 . This population is characterized by the following values for the parameters: D = 4100, t = 500 and T = 500000. Population P3 . This population is characterized by the following values for the parameters: D = 4100, t = 1000 and T = 500000. Data Sets: Each data set included in the experiments is obtained by generating under one of the above described data distributions 10 histograms belonging to one of the populations specified below. We consider the 9 data sets that are generated by combining all data distributions and all populations. All queries belonging to the query set below are evaluated over the histograms of each data set: Query Set and Error Metric: In our experiments, we use the query set {X ≤ d : 1 ≤ d ≤ D} (recall, X is the histogram attribute and 1..D is its domain) for evaluating the effectiveness of the various methods. We measure the error of approximation made by histograms Q rel on the above query set by us1 ing the average of the relative error Q i=1 ei , where Q is the cardinality of rel = |SiS−iSi | , where Si and Si the query set, and ei is the relative error , i.e., erel i are the actual answer and the estimated answer of the query i-th of the query set. For each population and distribution we have calculated the average relative error. Table in Figure 2.(a) shows good accuracy on the distibution Zipf max of all index-based methods. In particular, nLT has the best performances, even if there is no a high gap w.r.t. the other methods. The error is considerable low for nLT (less than 0.25%) although the compression ratio is very high (i.e., about 100). With the second distribution, that is Zipf rand (see Figure 2.(b)), behavoir of methods becomes more different: Wavelt and MaxDiff show an unsatisfactory accuracy, V-Optimal has better performances but errors still high, while indexbased methods show very low errors. Once again, nLT reports the minimum
358
Francesco Buccafurri and Gianluca Lax
method/popul. WA MD VO M D4LT V O4LT BT nLT
P1 14.53 11.65 10.60 3.14 2.32 1.51 1.38
P2 5.55 6.65 6.16 2.32 4.85 3.50 0.87
P3 5.06 3.30 2.82 1.33 1.24 0.91 0.70
avg 8.38 7.20 6.53 2.26 2.80 1.97 0.99
Fig. 3. Errors for distribution 3 2.5
10 Wavelet Maxdiff V−Optimal nLT
9
2
8
1
relative error %
relative error %
7
1.5
Wavelet Maxdiff V−Optimal nLT
6
5
4
3
0.5
2
1
0 20
30
40
50
60 density %
70
80
90
100
0 15
20
25
30 35 storage space
40
45
50
Fig. 4. Experimental results
error. In Figure 3 we report results of experiments performed on Gauss data. Due to the high variance, all methods become worse. Also nLT presents a slightly higher error, w.r.t. Zipf data, but still less than 1% (in the average), and still less than the error of the other methods. In Figure 4, average relative error versus data density and versus histogram size are plotted (in the left-hand graph and right-hand graph, respectively). For | data density we mean the ratio |V |U| between the cardinality of the non null value set and the cardinality of the attribute domain. For histogram size we mean the amount of 4-byte numbers used for storing the histogram. This measure is hence related to the compression ratio. In both cases nLT, compared with classical bucket-based histograms, shows the best performances with a considerable improvement gap.
5
Conclusion
In this paper we have presented a new non bucket-based histogram, which we have called nLT. It is based on a hierarchical decomposition of the data dis-
Pre-computing Approximate Hierarchical Range Queries
359
tribution kept in a complete n-level binary tree. Nodes of the tree store, in a approximate form (via bit saving), pre-computed range query on the original data distribution. Beside the capability of the histogram to directly support hierarchical range query and efficient updating and query answering, we have shown experimentally it improves significantly the state of the art in terms of accuracy in estimating range queries.
References [1] F. Buccafurri, L. Pontieri, D. Rosaci, D. Sacc` a Binary-tree Histograms with Tree Indices DEXA 2002, Aix-en-Provence, France. 353, 356 [2] F. Buccafurri, L. Pontieri, D. Rosaci, D. Sacc` a Improving Range Query Estimation on Histograms ICDE 2002, San Jose (CA), USA. 350, 351, 352 [3] Buccafurri, F., Rosaci, D., Sacca’, D., Compressed datacubes for fast OLAP applications, DaWaK 1999, Florence, 65-77. [4] Koudas, N., Muthukrishnan, S., Srivastava, D., Optimal Histograms for Hierarchical Range Queries, Proc. of Symposium on Principles of Database Systems PODS pp. 196-204, Dallas, Texas, 2000. 350, 351 [5] Malvestuto, F., A Universal-Scheme Approach to Statistical Databases Containing Homogeneous Summary Tables, ACM TODS, 18(4), 678–708, December 1993. 350 [6] Y. Matias, J. S. Vitter, M. Wang. Wavelet-based histograms for selectivity estimation. In Proceedings of the 1998 ACM SIGMOD Conference on Management of Data, Seattle, Washington, June 1998 352, 353 [7] Natsev, A., Rastogi, R., Shim, K., WALRUS: A Similarity Retrieval Algorithm for Image Databases, In Proceedings of the 1999 ACM SIGMOD Conference on Management of Data, 1999. 353 [8] V. Poosala. Histogram-based Estimation Techniques in Database Systems. PhD dissertation, University of Wisconsin-Madison, 1997 352, 356 [9] V. Poosala, Y. E. Ioannidis, P. J. Haas, E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 294-305, 1996 350, 352, 356 [10] Poosala, V., Ganti, V., Ioannidis, Y. E., Approximate Query Answering using Histograms, IEEE Data Engineering Bulletin Vol. 22, March 1999. 350 [11] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. T. Price. Access path selection in a relational database management system. In Proc. of ACM SIGMOD Internatinal Conference, pages 23-24, 1979 [12] Sitzmann, I., Stuckey, P. J., Improving Temporal Joins Using Histograms, Proc. of the Int. Conference, Database and Expert Systems Applications – DEXA 2000. 350 [13] E. J. Stollnitz, T. D. Derose, and D. H. Salesin. Wavelets for Computer Graphics. Morgann Kauffmann, 1996. 353 [14] J. S. Vitter, M. Wang, B. Iyer. Data Cube Approximation and Histograms via Wavelet. In Proceedings of the 1998 CIKM International Conference on Information and Knowledge Management, Washington, 1998 353 [15] J. S. Vitter, M. Wang, Approximate Computation of Multidimansional Aggregates of Sparse Data using Wavelets, In Proceedings of the 1999 ACM SIGMOD International Conference on Managemnet of Data, 1999. 353
Comprehensive Log Compression with Frequent Patterns Kimmo H¨at¨ onen1 , Jean Fran¸cois Boulicaut2 , Mika Klemettinen1 , Markus Miettinen1 , and Cyrille Masson2 1
Nokia Research Center P.O.Box 407, FIN-00045 Nokia Group, Finland {kimmo.hatonen,mika.klemettinen,markus.miettinen}@nokia.com 2 INSA de Lyon, LIRIS CNRS FRE 2672 F-69621 Villeurbanne, France {Jean-Francois.Boulicaut,Cyrille.Masson}@insa-lyon.fr
Abstract. In this paper we present a comprehensive log compression (CLC) method that uses frequent patterns and their condensed representations to identify repetitive information from large log files generated by communications networks. We also show how the identified information can be used to separate and filter out frequently occurring events that hide other, unique or only a few times occurring events. The identification can be done without any prior knowledge about the domain or the events. For example, no pre-defined patterns or value combinations are needed. This separation makes it easier for a human observer to perceive and analyse large amounts of log data. The applicability of the CLC method is demonstrated with real-world examples from data communication networks.
1
Introduction
In the near future telecommunication networks will deploy an open packet-based infrastructure which has been originally developed for data communication networks. The monitoring of this new packet-based infrastructure will be a challenge for operators. The old networks will remain up and running for still some time. At the same time the rollout of the new infrastructure will take place introducing many new information sources, between which the information needed in, e.g., security monitoring and fault analysis will be scattered. These sources can include different kinds of event logs, e.g., firewall logs, operating systems’ system logs and different application server logs to name a few. The problem is becoming worse every day as operators are adding new tools for logging and monitoring their networks. As the requirements for the quality of service perceived by customers gain more importance, the operators are starting to seriously utilise information that is hidden in these logs. Their interest towards analysing their own processes and operation of their network increases concurrently. Data mining and knowledge discovery methods are a promising alternative for operators to gain more out of their data. Based on our experience, however, Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 360–370, 2003. c Springer-Verlag Berlin Heidelberg 2003
Comprehensive Log Compression with Frequent Patterns
361
simple-minded use of discovery algorithms in the network analysis poses problems with the amount of generated information and its relevance. In the KDD process [6, 10, 9], it is often reasonable or even necessary to constrain the discovery using background knowledge. If no constraints are applied, the discovered result set of, say, association rules [1, 2] might become huge and contain mostly trivial and uninteresting rules. Also, association and episode rule mining techniques can only capture frequently recurring events according to some frequency and confidence thresholds. This is needed to restrict the search space and thus for computation tractability. Clearly, the thresholds that can be used are not necessarily the ones that denote objective interestingness from the user point of view. Indeed, rare combinations can be extremely interesting. When considering our previously unknown domains, an explicit background knowledge is missing, e.g., about the possible or reasonable values of attributes and their relationships. When it is difficult or impossible to define and maintain a priori knowledge about the system, there is still a possibility to use meta information that can be extracted from the logs. Meta information characterizes different types of log entries and log entry combinations. It can not only be used to help an expert in filtering and browsing the logs manually but also to automatically identify and filter out insignificant log entries. It is possible to reduce the size of an analysed data set to a fraction of its original size without losing any critical information. One type of meta information are frequent patterns. They capture the common value combinations that occur in the logs. Furthermore, such meta information can be condensed by means of, e.g., the closed frequent itemsets [12, 3]. Closed sets form natural inclusion graphs between different covering sets. This type of presentation is quite understandable for an expert and can be used to create hierarchical views. These condensed representations can be extracted directly from highly correlated and/or dense data, i.e., in contexts where the approaches that compute the whole collection of the frequent patterns F S are intractable [12, 3, 17, 13]. They can also be used to regenerate efficiently the whole F S collection, possibly partially and on the fly. We propose here our Comprehensive Log Compression (CLC) method. It is based on the computation of frequent pattern condensed representations and we use this presentation as an entry point to the data. The method provides a way to dynamically characterize and combine log data entries before they are shown to a human observer. It finds frequently occurring patterns from dense log data and links patterns to the data as a data directory. It is also possible to separate recurring data and analyse it separately. In most cases, this reduces the amount of data needed to be evaluated by an expert to a fraction of the original volume. This type of representation is general w.r.t. different log types. Frequent sets can be generated from most of the logs that have structure and contain repeating symbolic values in their fields, e.g., in Web Usage Mining applications [11, 16]. The main difference between the proposed method and those applications is the objective setting of the mining task. Most of the web usage applications try to identify and somehow validate common access patterns in web sites. These patterns are then used to do some sort of optimization of the site. The proposed
362
Kimmo H¨ at¨ onen et al.
... 777;11May2000; 778;11May2000; 779;11May2000; 781;11May2000; 782;11May2000; ...
0:00:23;a_daemon;B1;12.12.123.12;tcp;; 0:00:31;a_daemon;B1;12.12.123.12;tcp;; 0:00:32;1234;B1;255.255.255.255;udp;; 0:00:43;a_daemon;B1;12.12.123.12;tcp;; 0:00:51;a_daemon;B1;12.12.123.12;tcp;;
Fig. 1. An example of a firewall log method, however, doesn’t say anything about semantic correctness or relations between the found frequent patterns. It only summarizes the most frequent value combinations in entries. This gives either a human expert or computationally more intensive algorithms a change to continue with data, which doesn’t contain too common and trivial entries. Based on our experience with real-life log data, e.g., large application and firewall logs, the original data set of tens of thousands of rows can often be represented by just a couple of identified patterns and the exceptions not matching these patterns.
2
Log Data and Log Data Analysis
A log data consists of entries that represent a specific condition or an event that has occurred somewhere in the system. The entries have several fields, which are called variables from now on. The structure of entries might change over time from entry to another, although some variables are common to all of them. Each variable has a set of possible values called a value space. Values of one value space can be considered as binary attributes. Variable value spaces are separated. A small example of a log data is given in Figure 1. It shows a sample from a log file produced by CheckPoint’s Firewall-1. In a data set a value range in a variable value space might be very large or very limited. For example, there may be only few firewalls in an enterprise, but every IP address in the internet might try to contact the enterprise domain. There are also several variables that have such a large value space but contain only a fraction of the possible values. Therefore, it is unpractical and almost impossible to fix the size of the value spaces as a priori knowledge. A log file may be very large. During one day, there might accumulate millions of lines into a log file. A solution to browse the data is either to search for patterns that are known to be interesting with high probability or to filter out patterns that most probably are uninteresting. A system can assist in this but the evaluation of interestingness is left for an expert. To be able to make the evaluation an expert has to check the found log entries by hand. He has to return to the original log file and iteratively check all those probably interesting entries and their surroundings. Many of the most dangerous attacks are new and unseen for an enterprise defense system. Therefore, when the data exploration is limited only to known patterns it may be impossible to find the new attacks. Comprehensive Log Compression (CLC) is an operation where meta information is extracted from the log entries and used to summarize redundant entries
Comprehensive Log Compression with Frequent Patterns {Proto:tcp, Service:a_daemon, Src:B1} 11161 {Proto:tcp, SPort:, Src:B1} 11161 {Proto:tcp, SPort:, Service:a_daemon} 11161 {SPort:, Service:a_daemon, Src:B1} 11161 ... {Destination:123.12.123.12, SPort:, Service:a_daemon, Src:B1} 10283 {Destination:123.12.123.12, Proto:tcp, Service:a_daemon, Src:B1} 10283 {Destination:123.12.123.12, Proto:tcp, SPort:, Src:B1} 10283 {Destination:123.12.123.12, Proto:tcp, SPort:, Service:a_daemon} 10283 {Proto:tcp, SPort:, Service:a_daemon, Src:B1} 11161 ... {Destination:123.12.123.12, Proto:tcp, SPort:, Service:a_daemon, Src:B1}
363
10283
Fig. 2. A sample of frequent sets extracted from a firewall log
without losing any important information. By combining log entries with their frequencies and identifying recurring patterns, we are able to separate correlating entries from infrequent ones and display them with accompanying information. Thus, an expert has a more covering overview of the logged system and he can identify interesting phenomena and concentrate on his analysis. The summary has to be understandable for an expert and must contain all the relevant information that is available in the original log. Presentation has also to provide a mechanism to move back and forth between the summary and the original logs. Summarization can be done by finding correlating value combinations from large amount of log entries. Due to the nature of the logging mechanism, there are always several value combinations that are common to a large number of the entries. When these patterns are combined with information about how uncorrelating values are changing w.r.t. to these correlating patterns it gives a comprehensive description of the contents of the logs. In many cases it is possible to detect such patterns by browsing the log data but unfortunately it is also tedious. E.g., a clever attack against a firewall cluster of an enterprise is scattered over all of its firewalls and executed slowly from several different IP addresses using all the possible protocols alternately. Figure 2 provides a sample of frequent sets extracted from the data introduced in Figure 1. In Figure 2, the last pattern, which contains five attributes, has five subpatterns out of which four have the same frequency as the longer pattern and only one has larger frequency. In fact, many frequent patterns have the same frequency and it is the key idea of the frequent closed set mining technique to consider only some representative patterns, i.e., the frequent closed itemsets (see next section for a formalization). Figure 3 gives a sample of frequent closed sets that correspond to the frequent patterns shown in Figure 2. An example of the results of applying the CLC method to a firewall log data set can be seen in Table 1. It shows three patterns with highest coverage values found from the firewall log introduced in Figure 1. If the supports of these patterns are combined, then 91% of the data in the log is covered. The blank fields in the figure are intentionally left empty in the original log data. The fields marked with ’*’ can have varying values. For example, in the pattern 1 the field
364
Kimmo H¨ at¨ onen et al.
{Proto:tcp, SPort:, Service:a_daemon, Src:B1} 11161 {Destination:123.12.123.12, Proto:tcp, SPort:, Service:a_daemon, Src:B1} {Destination:123.12.123.13, Proto:tcp, SPort:, Service:a_daemon, Src:B1}
10283 878
Fig. 3. A sample of closed sets extracted from a firewall log Table 1. The three most frequent patterns found from a firewall log No Destination Proto SPort Service 1. * tcp A daemon 2. 255.255.255.255 udp 1234 3. 123.12.123.12 udp B-dgm
Src B1 * *
Count 11161 1437 1607
’Destination’ gets two different values on lines matched by it, as it is shown in Figure 3.
3
Formalization
The definition of a LOG pattern domain is made of the definition of a language of patterns L, evaluation functions that assign a description to each pattern in a given log r, and languages for primitive constraints that specify the desired patterns. We introduce some notations that are used for defining the LOG pattern domain. A so-called log contains the data in a form of log entries and patterns are the so-called itemsets, which are sets of (f ield, value) pairs of log entries. Definition 1 (Log). Assume that Items is a finite set of (f ield, value) pairs denoted by field name combined with value, e.g., Items= {A : ai , B : bj , C : ck , . . .}. A log entry t is a subset of Items. A log r is a finite and non empty multiset r = {e1 , e2 , . . . , en } of log entries. Definition 2 (Itemsets). An itemset is a subset of Items. The language of patterns for itemsets is L = 2Items . Definition 3 (Constraint). If T denotes the set of all logs and 2Items the set of all itemsets, an itemset constraint C is a predicate over 2Items × T . An itemset S ∈ 2Items satisfies a constraint C in the database r ∈ T iff C(S, r) = true.When it is clear from the context, we write C(S). Evaluation functions return information about the properties of a given itemset in a given log. These functions provide an expert information about the events and conditions in the network. They also form a basis for summary creation. They are used to select the proper entry points to the log data. Definition 4 (Support for Itemsets). A log entry e supports an itemset S if every item in S belongs to e, i.e., S ⊆ e. The support (denoted support(S, r)) of an itemset S is the multiset of all log entries of r that supports S (e.g., support(∅) = r).
Comprehensive Log Compression with Frequent Patterns
365
Definition 5 (Frequency). The frequency of an itemset S in a log r is defined by F (S, r) = |support(S)| where |.| denotes the cardinality of the multiset. Definition 6 (Coverage). The coverage of an itemset S in a log r is defined by Cov(S, r) = F (S, r) · |S|, where |.| denotes the cardinality of the itemset S. Definition 7 (Perfectness). The perfectness of an itemset S in a log r is de (S,r) |ei |, where ∀ei : ei ∈ support(S, r) and fined by Perf (S, r) = Cov(S, r)/ F i=0 |ei | denotes to the cardinality of log entry ei . Please, notice that if the cardinality of all the log entries is constant it applies then Perf (S, r) = Cov(S, r)/(F (S, r) · |e|), where e is an arbitrary log entry. Primitive constraints are a tool set that is used to create and control summaries. For instance, the summaries are composed by using the frequent (closed) sets, i.e., sets that satisfy a conjunction of a minimal frequency constraint and the closeness constraint plus the original data. Definition 8 (Minimal Frequency). Given an itemset S, a log r, and a frequency threshold γ ∈ [1, |r|], Cminfreq(S, r) ≡ F(S, r) ≥ γ. Itemsets that satisfy Cminfreq are called γ-frequent or frequent in r. Definition 9 (Minimal Perfectness). Given an itemset S, a log r, and a perfectness threshold π ∈ [0, 1], Cminperf (S, r) ≡ Perf (S, r) ≥ π. Itemsets that satisfy Cminperf are called π-perfect or perfect in r. Definition 10 (Closures, Closed Itemsets and Constraint Cclose ). The closure of an itemset S in r (denoted by closure(S, r)) is the maximal (for set inclusion) superset of S which has the same support than S. In other terms, the closure of S is the set of items that are common to all the log entries which support S. A closed itemset is an itemset that is equal to its closure in r, i.e., we define Cclose (S, r) ≡ closure(S, r) = S. Closed itemsets are maximal sets of items that are supported by a multiset of log entries. If we consider the equivalence class that group all the itemsets that have the same closure (and thus the same frequency), the closed sets are the maximal elements of each equivalence class. Thus, when the collection of the frequent itemsets F S is available, a simple post-processing technique can be applied to compute only the frequent closed itemsets. When the data is sparse, it is possible to compute F S, e.g., by using Apriori-like algorithms [2]. However, the number of frequent itemsets can be extremely large, especially in dense logs that contain many highly correlated field values. In that case, computing F S might not be feasible while the frequent closed sets CF S can often be computed for the same frequency threshold or even a lower one. CF S = {φ ∈ L | Cminfreq(φ, r) ∧ Cclose (φ, r) satisfied}. On one hand, F S can be efficiently derived from CF S without scanning the data again [12, 3]. On the other hand, CF S is a compact representation of the information about every frequent set and its frequency and thus fulfills the needs for CLC. Several algorithms can compute efficiently the frequent closed sets. In this work, we compute the frequent closed sets by
366
Kimmo H¨ at¨ onen et al.
computing the frequent free sets and providing their closures [4, 5]. This is efficient since the freeness property is anti-monotonic, i.e., a key property for an efficient processing of the search space. For a user, displaying of the adequate information is the most important phase of the CLC method. This phase gets the original log file and a condensed set of frequent patterns as input. An objective of the method is to select the most informative patterns as starting points for navigating the condensed set of patterns and data. As it has been shown [12], the frequent closed sets give rise to a lattice structure, ordered by set inclusion. These inclusion relations between patterns can be used as navigational links. What are the most informative patterns depends on the application and a task in hand. There are at least three possible measures that can be used to sort the patterns: frequency, i.e., on how many lines the pattern exists in a data set; perfectness, i.e., how big part of the line has been fixed in the pattern; and coverage of the pattern, i.e., how large part of the database is covered by the pattern. Coverage is a measure, which balances the trade-off between patterns that are short but whose frequency is high and patterns that are long but whose frequency is lower. Selection of the most informative patterns can also be based on the optimality w.r.t. coverage. It is possible that an expert wishes to see only n most covering patterns or most covering patterns that together cover more than m% of the data. Examples of optimality constraints are considered in [14, 15]. An interesting issue is the treatment of the patterns, whose perfectness is close to zero. It is often the case that the support of such a small pattern is almost entirely covered by supports of larger patterns, subset of which the small pattern is. The most interesting property of this kind of lines is the possibility to find those rare and exceptional entries that are not covered by any of the frequent patterns. In the domain that we are working on, log entries of telecommunication applications, we have found out that coverage and perfectness are very good measures to find good and informative starting points for pattern and data browsing. This is probably because of the fact that if there are too many fields that have not fixed values, then the meaning of the entry is not clear and those patterns are not understandable for an expert. On the other hand, in those logs there are a lot of repeating patterns, whose coverage is high and perfectness is close to 100 percent.
4
Experiments
Our experiments were done with two separate log sets. The first of them was a firewall log that was divided into several files so that each file contained entries logged during one day. From this collection we selected logs of four days with which we executed the CLC method with different frequency thresholds. The purpose of this test was to find out how large a portion of the original log it is possible to cover with the patterns found and what the optimal value for the
Comprehensive Log Compression with Frequent Patterns
367
Table 2. Summary of the CLC experiments with firewall data Sup
Day 1 Freq Clsd Sel Lines 100 8655 48 5 5162 50 9213 55 6 5224 10 11381 74 12 5347 5 13013 82 13 5351 Tot 5358
Firewall days Day 2 Day 3 % Freq Clsd Sel Lines % Freq Clsd Sel Lines 96.3 9151 54 5 15366 98.6 10572 82 7 12287 97.5 9771 66 7 15457 99.2 11880 95 11 12427 99.8 12580 88 12 15537 99.7 19897 155 19 12552 99.9 14346 104 14 15569 99.9 22887 208 20 12573 15588 12656
Day 4 % Freq Clsd Sel Lines 97.1 8001 37 4 4902 98.2 8315 42 5 4911 99.2 10079 58 8 4999 99.3 12183 69 10 5036 5039
% 97.3 97.5 99.2 99.9
frequency threshold would be. In Table 2, a summary of the experiment results is presented. Table 2 shows, for each firewall daily log file, the number of frequent sets (Freq), closed sets (Clsd) derived from those, selected closed sets (Sel), the number of lines that the selected sets cover (Lines) and how big part of the log these lines are covering (%). The tests were executed with several frequency thresholds (Sup). The pattern selection was based on the coverage of each pattern. As can be seen from the result, already with the rather high frequency threshold of 50 lines, the coverage percentage is high. With this threshold there were, e.g., only 229 (1.8%) lines not covered in the log file of day 3. This was basically because there was an exceptionally well distributed port scan during that day. Those entries were so fragmented that they escaped from the CLC algorithm, but were clearly visible when all the other information was taken away. In Table 2, we also show the sizes of the different representations compared to each other. As can be seen, the reduction from the number of frequent sets to the number of closed sets is remarkable. However, by selecting the most covering patterns, it is possible to reduce the number of shown patterns to very few without losing the descriptive power of the representation. Another data set that was used to test our method was an application log of a large software system. The log contains information about the execution of different application modules. The main purpose of the log is to provide information for system operation, maintenance and debugging. The log entries provide a continuous flow of data, not occasional bursts, which are typical for firewall entries. The interesting thing in the flow are the possible error messages that are rare and often hidden in the mass. The size of the application log was more than 105 000 lines, which were collected during a period of 42 days. From these entries, with the frequency threshold of 1000 lines (about 1%), the CLC method was able to identify 13 interesting patterns that covered 91.5% of the data. When the frequency threshold was still lowered to 50 lines, the coverage rose up to 95.8%. With that threshold value, there were 33 patterns found. The resulting patterns, however, started to be so fragmented that they were not very useful anymore. These experiments show the usefulness of the condensed representation of the frequent itemsets by means of the frequent closed itemsets. In a data set like a firewall log, it is possible to select only a few most covering of the found frequent closed sets and cover the majority of the data. After this bulk has been
368
Kimmo H¨ at¨ onen et al.
removed from the log it is much easier for any human expert to inspect the rest of the log, even manually. Notice also that the computation of our results has been easy. This is partly because of our test data sets reported here are not very large; the largest set being a little over 100 000 lines. However, in a real environment of a large corporation, the daily firewall logs might contain millions of lines and much more variables. The amount of data — the number of lines and the number of variables — will continue to grow in the future, when the number of service types, different services and their use will grow. The scalability of the algorithms that compute the frequent closed sets is quite good compared to the Apriori approach: fewer data scans are needed and the search space can be drastically reduced in the case of dense data [12, 3, 5]. In particular, we have done preliminary testing with ac-miner designed by A. Bykowski [5]. It discovers free sets, from which it is straightforward to compute closed sets. These tests have shown promising results w.r.t. execution times. This approach seems to scale up more easily than the search for a whole set of frequent sets. Also, other condensed representations have been recently proposed like the δ-free sets, the ∨-free sets or the Non Derivable Itemsets [5, 7, 8]. They could be used in even more difficult contexts (very dense and highly-correlated data). Notice however, that from the end user point of view, these representations do not have the intuitive semantics of the closed itemsets.
5
Conclusions and Future Work
The Comprehensive Log Compression (CLC) method provides a powerful tool for any analysis that inspects data with lot of redundancy. Only very little a priori knowledge is needed to perform the analysis: knowledge structures: only a minimum frequency threshold for the discovery of closed sets and e.g., the number of displayed patterns, to guide the selection of the most covering patterns. The method provides a mechanism to separate different information types from each other. The CLC method identifies frequent repetitive patterns from a log database and can be used to emphasize either the normal course of actions or exceptional log entries or events in the normal course of actions. This is especially useful in getting knowledge out of previously unknown domains or in analyzing logs that are used to record unstructured and unclassified information. In the future we are interested in generalizing and testing the described method with frequent episodes: how to utilize relations between selected closed sets. Other interesting issues concern the theoretical foundations of the CLC method as well as ways to utilize this method in different real world applications.
Acknowledgements The authors have partly been supported by the Nokia Foundation and the consortium on discovering knowledge with Inductive Queries (cInQ), a project
Comprehensive Log Compression with Frequent Patterns
369
funded by the Future and Emerging Technologies arm of the IST Programme (Contract no. IST-2000-26469).
References [1] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In SIGMOD’93, pages 207–216, Washington, USA, May 1993. ACM Press. 361 [2] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 307–328. AAAI Press, 1996. 361, 365 [3] Jean-Fran¸cois Boulicaut and Artur Bykowski. Frequent closures as a concise representation for binary data mining. In PAKDD’00, volume 1805 of LNAI, pages 62–73, Kyoto, JP, April 2000. Springer-Verlag. 361, 365, 368 [4] Jean-Fran¸cois Boulicaut, Artur Bykowski, and Christophe Rigotti. Approximation of frequency queries by mean of free-sets. In PKDD’00, volume 1910 of LNAI, pages 75–85, Lyon, F, September 2000. Springer-Verlag. 366 [5] Jean-Fran¸cois Boulicaut, Artur Bykowski, and Christophe Rigotti. Free-sets: a condensed representation of boolean data for the approximation of frequency queries. Data Mining and Knowledge Discovery journal, 7(1):5–22, 2003. 366, 368 [6] Ronald J. Brachman and Tej Anand. The process of knowledge discovery in databases: A first sketch. In Advances in Knowledge Discovery and Data Mining, July 1994. 361 [7] Artur Bykowski and Christophe Rigotti. A condensed representation to find frequent patterns. In PODS’01, pages 267 – 273. ACM Press, May 2001. 368 [8] Toon Calders and Bart Goethals. Mining all non derivable frequent itemsets. In PKDD’02, volume 2431 of LNAI, pages 74–83, Helsinki, FIN, August 2002. Springer-Verlag. 368 [9] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27 – 34, November 1996. 361 [10] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining, pages 1 – 34. AAAI Press, Menlo Park, CA, 1996. 361 [11] R. Kosala and H. Blockeel. Web mining research: A survey. SIGKDD: SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 2(1):1–15, 2000. 361 [12] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Efficient mining of association rules using closed itemset lattices. Information Systems, 24(1):25–46, January 1999. 361, 365, 366, 368 [13] Jian Pei, Jiawei Han, and Runying Mao. CLOSET an efficient algorithm for mining frequent closed itemsets. In SIGMOD Workshop DMKD’00, Dallas, USA, May 2000. 361 [14] Tobias Scheffer. Finding association rules that trade support optimally against confidence. In PKDD’01, volume 2168 of LNCS, pages 424–435, Freiburg, D, September 2001. Springer-Verlag. 366 [15] Jun Sese and Shinichi Morishita. Answering the most correlated N association rules efficiently. In PKDD’02, volume 2431 of LNAI, pages 410–422, Helsinki, FIN, August 2002. Springer-Verlag. 366
370
Kimmo H¨ at¨ onen et al.
[16] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23, 2000. 361 [17] Mohammed Javeed Zaki. Generating non-redundant association rules. In SIGKDD’00, pages 34–43, Boston, USA, August 2000. ACM Press. 361
Non Recursive Generation of Frequent K-itemsets from Frequent Pattern Tree Representations Mohammad El-Hajj and Osmar R. Za¨ıane Department of Computing Science, University of Alberta, Edmonton AB, Canada {mohammad, zaiane}@cs.ualberta.ca Abstract. Existing association rule mining algorithms suffer from many problems when mining massive transactional datasets. One major problem is the high memory dependency: gigantic data structures built are assumed to fit in main memory; in addition, the recursive mining process to mine these structures is also too voracious in memory resources. This paper proposes a new association rule-mining algorithm based on frequent pattern tree data structure. Our algorithm does not use much more memory over and above the memory used by the data structure. For each frequent item, a relatively small independent tree called COFI-tree, is built summarizing co-occurrences. Finally, a simple and non-recursive mining process mines the COFI-trees. Experimental studies reveal that our approach is efficient and allows the mining of larger datasets than those limited by FP-Tree
1
Introduction
Recent days have witnessed an explosive growth in generating data in all fields of science, business, medicine, military, etc. The same rate of growth in the processing power of evaluating and analyzing the data did not follow this massive growth. Due to this phenomenon, a tremendous volume of data is still kept without being studied. Data mining, a research field that tries to ease this problem, proposes some solutions for the extraction of significant and potentially useful patterns from these large collections of data. One of the canonical tasks in data mining is the discovery of association rules. Discovering association rules, considered as one of the most important tasks, has been the focus of many studies in the last few years. Many solutions have been proposed using a sequential or parallel paradigm. However, the existing algorithms depend heavily on massive computation that might cause high dependency on the memory size or repeated I/O scans for the data sets. Association rule mining algorithms currently proposed in the literature are not sufficient for extremely large datasets and new solutions, that especially are less reliant on memory size, still have to be found. 1.1
Problem Statement
The problem consists of finding associations between items or itemsets in transactional data. The data could be retail sales in the form of customer transactions Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 371-380, 2003. c Springer-Verlag Berlin Heidelberg 2003
372
Mohammad El-Hajj and Osmar R. Za¨ıane
or any collection of sets of observations. Formally, as defined in [2], the problem is stated as follows: Let I = {i1 , i2 , ...im } be a set of literals, called items. m is considered the dimensionality of the problem. Let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. A unique identifier TID is given to each transaction. A transaction T is said to contain X, a set of items in I, if X ⊆ T . An association rule is an implication of the form “X ⇒ Y ”, where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. An itemset X is said to be large or frequent if its support s is greater or equal than a given minimum support threshold σ. The rule X ⇒ Y has a support s in the transaction set D if s% of the transactions in D contain X ∪ Y . In other words, the support of the rule is the probability that X and Y hold together among all the possible presented cases. It is said that the rule X ⇒ Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y . In other words, the confidence of the rule is the conditional probability that the consequent Y is true under the condition of the antecedent X. The problem of discovering all association rules from a set of transactions D consists of generating the rules that have a support and confidence greater than a given threshold. These rules are called strong rules. This association-mining task can be broken into two steps: 1. A step for finding all frequent k-itemsets known for its extreme I/O scan expense, and the massive computational costs; 2. A straightforward step for generating strong rules. In this paper, we are mainly interested in the first step. 1.2
Related Work
Several algorithms have been proposed in the literature to address the problem of mining association rules [2, 6]. One of the key algorithms, which seems to be the most popular in many applications for enumerating frequent itemsets is the Apriori algorithm [2]. This Apriori algorithm also forms the foundation of most known algorithms. It uses a monotone property stating that for a k-itemset to be frequent, all its (k-1)-itemsets have to be frequent. The use of this fundamental property reduces the computational cost of candidate frequent itemset generation. However, in the cases of extremely large input sets with big frequent 1-items set, the Apriori algorithm still suffers from two main problems of repeated I/O scanning and high computational cost. One major hurdle observed with most real datasets is the sheer size of the candidate frequent 2-itemsets and 3-itemsets. TreeProjection is an efficient algorithm presented in [1]. This algorithm builds a lexicographic tree in which each node of this tree presents a frequent pattern. The authors of this algorithm report that their algorithm is one order of magnitude faster than the existing techniques in the literature. Another innovative approach of discovering frequent patterns in transactional databases, FP-Growth, was proposed by Han et al. in [6]. This algorithm creates a compact tree-structure, FP-Tree, representing frequent patterns, that alleviates the multi-scan problem and improves the candidate itemset generation. The algorithm requires only two full I/O scans of the dataset to build the prefix tree in main memory and then mines directly this structure. The authors
Non Recursive Generation of Frequent K-itemsets
373
of this algorithm report that their algorithm is faster than the Apriori and the TreeProjection algorithms. Mining the FP-tree structure is done recursively by building conditional trees that are of the same order of magnitude in number as the frequent patterns. This massive creation of conditional trees makes this algorithm not scalable to mine large datasets beyond few millions. [7] proposes a new algorithm H-mine that invokes FP-Tree to mine condensed data. This algorithm is still not scalable as reported by its authors in [8]. 1.3
Preliminaries, Motivations and Contributions
The (Co-Occurrence Frequent Item Tree, or COFI-tree for short) algorithm that we are presenting in this paper is based on the core idea of the FP-Growth algorithm proposed by Han et al. in [6]. A compacted tree structure, FP-Tree, is built based on an ordered list of the frequent 1-itemsets present in the transactional database. However, rather than using FP-Growth which recursively builds a large number of relatively large trees called conditional trees [6] from the built FP-tree, we successively build one small tree (called COFI-tree) for each frequent 1-itemset and mine the trees with simple non-recursive traversals. We keep only one such COFI-tree in main memory at a time. The COFI-tree approach is a divide and conquer approach, in which we do not seek to find all frequent patterns at once, but we independently find all frequent patterns related to each frequent item in the frequent 1-itemset. The main differences between our approach and the FP-growth approach are the followings: (1) we only build one COFI-tree for each frequent item A. This COFI-tree is non-recursively traversed to generate all frequent patterns related to item A. (2) Only one COFI-tree resides in memory at one time and it is discarded as soon as it is mined to make room for the next COFI-tree. Algorithms like FP-Tree-based depend heavily on the memory size as the memory size plays an important role in defining the size of the problem. Memory is not only needed to store the data structure itself, but also to generate recursively in the mining process the set of conditional trees. This phenomenon is often overlooked. As argued by the authors of the algorithm, this is a serious constraint [8]. Other approaches such as in [7], build yet another data structure from which the FP-Tree is generated, thus doubling the need for main memory. The current association rule mining algorithms handle only relatively small sizes with low dimensions. Most of them scale up to only a couple of millions of transactions and a few thousands of dimensions [8, 5]. None of the existing algorithms scales to beyond 15 million transactions, and hundreds of thousands of dimensions, in which each transaction has an average of at least a couple of dozen items. The remainder of this paper is organized as follows: Section 2 describes the Frequent Pattern tree, design and construction. Section 3 illustrates the design, constructions and mining of the Co-Occurrence Frequent Item trees. Experimental results are given in Section 4. Finally, Section 5 concludes by discussing some issues and highlights our future work.
374
Mohammad El-Hajj and Osmar R. Za¨ıane
2
Frequent Pattern Tree: Design and Construction
The COFI-tree approach we propose consists of two main stages. Stage one is the construction of the Frequent Pattern tree and stage two is the actual mining for these data structures, much like the FP-growth algorithm. 2.1
Construction of the Frequent Pattern Tree
The goal of this stage is to build the compact data structures called Frequent Pattern Tree [6]. This construction is done in two phases, where each phase requires a full I/O scan of the dataset. A first initial scan of the database identifies the frequent 1-itemsets. The goal is to generate an ordered list of frequent items that would be used when building the tree in the second phase. This phase starts by enumerating the items appearing in the transactions. After enumeration these items (i.e. after reading the whole dataset), infrequent items with a support less than the support threshold are weeded out and the remaining frequent items are sorted by their frequency. This list is organized in a table, called header table, where the items and their respective support are stored along with pointers to the first occurrence of the item in the frequent pattern tree. Phase 2 would construct a frequent pattern tree. Table 1. Transactional database T.No. T1 T5 T9 T13 T17
A A A M A
Items GD C B N O FMN D C G K E F
T.No. Items B T2 B C H E P T6 A C Q R O T10 C F P G O T14 C F P Q C T18 C D L B
T.No. Items D T3 B D E A G T7 A C H I R T11 A D B H J T15 B D E F A
Item Freq. Item Freq. Item Freq. A 11 H 3 Q 2 B 10 F 7 R 2 C 10 M 3 I 3 D 9 N 3 K 3 G 4 O 3 L 3 E 8 P 3 J 3 Step1
Item Freq. A 11 B 10 C 10 D 9 E 8 F 7 Step2
T.No. M T4 G T8 I T12 I T16
C L D J
Items EFA EFK EBK EBA
N B L D
Item Freq. F 7 E 8 D 9 C 10 B 10 A 11 Step3
Fig. 1. Steps of phase 1.
Phase 2 of constructing the Frequent Pattern tree structure is the actual building of this compact tree. This phase requires a second complete I/O scan
Non Recursive Generation of Frequent K-itemsets
375
from the dataset. For each transaction read only the set of frequent items present in the header table is collected and sorted in descending order according to their frequency. These sorted transaction items are used in constructing the FP-Trees as follows: for the first item on the sorted transactional dataset, check if it exists as one of the children of the root. If it exists then increment the support for this node. Otherwise, add a new node for this item as a child for the root node with 1 as support. Then, consider the current item node as the newly temporary root and repeat the same procedure with the next item on the sorted transaction. During the process of adding any new item-node to the FP-Tree, a link is maintained between this item-node in the tree and its entry in the header table. The header table holds as one pointer per item that points to the first occurrences of this item in the FP-Tree structure. For illustration, we use an example with the transactions shown in Table 1. Let the minimum support threshold set to 4. Phase 1 starts by accumulating the support for all items that occur in the transactions. Step 2 of phase 1 removes all non-frequent items, in our example (G, H, I, J, K, L,M, N, O, P, Q and R), leaving only the frequent items (A, B, C, D, E, and F). Finally all frequent items are sorted according to their support to generate the sorted frequent 1itemset. This last step ends phase 1 of the COFI-tree algorithm and starts the second phase. In phase 2, the first transaction (A, G, D, C, B) read is filtered to consider only the frequent items that occur in the header table (i.e. A, D, C and B). This frequent list is sorted according to the items’ supports (A, B, C and D). This ordered transaction generates the first path of the FP-Tree with all item-node support initially equal to 1. A link is established between each itemnode in the tree and its corresponding item entry in the header table. The same procedure is executed for the second transaction (B, C, H, E, and D), which yields a sorted frequent item list (B, C, D, E) that forms the second path of the FP-Tree. Transaction 3 (B, D, E, A, and M) yields the sorted frequent item list (A, B, D, E) that shares the same prefix (A, B) with an existing path on the tree. Item-nodes (A and B) support is incremented by 1 making the support of (A) and (B) equal to 2 and a new sub-path is created with the remaining items on the list (D, E) all with support equal to 1. The same process occurs for all transactions until we build the FP-Tree for the transactions given in Table 1. Figure 2 shows the result of the tree building process.
Root A 11 F
B 4
C 3
7
E 8 D 9 C 10 B 10 A 11
F 1
C 4 E 2
F 2
B C
D
6 2
2
C 1 D
3 E 2
D 1
E 1 F 1
E 1
Fig. 2. Frequent Pattern Tree.
D2
F 2 E 2 F 1
D 1
376
3
Mohammad El-Hajj and Osmar R. Za¨ıane
Co-Occurrence Frequent-Item-trees: Construction and Mining
Our approach for computing frequencies relies first on building independent relatively small trees for each frequent item in the the header table of the FP-Tree called COFI-trees. Then we mine separately each one of the trees as soon as they are built, minimizing the candidacy generation and without building conditional sub-trees recursively. The trees are discarded as soon as mined. At any given time, only one COFI-tree is present in main memory. 3.1
Construction of the Co-Occurrence Frequent-Item-trees
The small COFI-trees we build are similar to the conditional FP-trees in general in the sense that they have a header with ordered frequent items and horizontal pointers pointing to a succession of nodes containing the same frequent item, and the prefix tree per-se with paths representing sub-transactions. However, the COFI-trees have bidirectional links in the tree allowing bottom-up scanning as well, and the nodes contain not only the item label and a frequency counter, but also a participation counter as explained later in this section. The COFI-tree for a given frequent item x contains only nodes labeled with items that are more frequent or as frequent as x. To illustrate the idea of the COFI-trees, we will explain step by step the process of creating COFI-trees for the FP-Tree of Figure 2. With our example, the first Co-Occurrence Frequent Item tree is built for item F as it is the least frequent item in the header table. In this tree for F, all frequent items which are more frequent than F and share transactions with F participate in building the tree. This can be found by following the chain of item F in the FP-Tree structure. The F-COFI-tree starts with the root node containing the item in question, F. For each sub-transaction or branch in the FP-Tree containing item F with other frequent items that are more frequent than F which are parent nodes of F, a branch is formed starting from the root node F. The support of this branch is equal to the support of the F node in its corresponding branch in FP-Tree. If multiple frequent items share the same prefix, they are merged into one branch and a counter for each node of the tree is adjusted accordingly. Figure 3 illustrates all COFI-trees for frequent items of Figure 2. In Figure 3, the rectangle nodes are nodes from the tree with an item label and two counters. The first counter is a support-count for that node while the second counter, called participation-count, is initialized to 0 and is used by the mining algorithm discussed later, a horizontal link which points to the next node that has the same item-name in the tree, and a bi-directional vertical link that links a child node with its parent and a parent with its child. The bi-directional pointers facilitate the mining process by making the traversal of the tree easier. The squares are actually cells from the header table as with the FP-Tree. This is a list made of all frequent items that participate in building the tree structure sorted in ascending order of their global support. Each entry in this list contains
Non Recursive Generation of Frequent K-itemsets F-COFI-tree
E D C B A
4 2 4 2 3
E-COFI-tree
E
C
377
( 2
( 4 0 )
0 )
F
( 7 0 )
A
( 1 0 )
B ( 1 0 )
C
D
A ( 2 0 ) D-COFI-tree
( 1
B
( 2 0 )
D C B A
0 )
( 1 0 )
5 3 6 4
D
C
( 1
B ( 1 C-COFI-tree
0 )
D ( 9 0 )
C
( 5 0 )
0 )
B
A
E
( 8 0 )
C
( 2 0 )
( 4
( 2
0 )
B
( 1
0 )
A ( 2 0 )
0 ) B-COFI-tree B ( 10 0 )
( 10 0 )
A 6 C 4 B 8 A 5
C
( 4 0 )
B
( 5
0 )
B
( 3
0 )
A
( 4
0 )
A
( 6
0 )
B 3 A 6 B
( 3
A ( 2
0 )
0 )
A
( 3
0 )
A ( 2 0 )
Fig. 3. COFI-trees
the item-name, item-counter, and a pointer to the first node in the tree that has the same item-name. To explain the COFI-tree building process, we will highlight the building steps for the F-COFI-tree in Figure 3. Frequent item F is read from the header table and its first location in the FP-Tree is located using the pointer in the header table. The first location of item F indicate that it shares a branch with item A, with support = 1 for this branch as the support of the F-item is considered the support for this branch (following the upper links for this item). Two nodes are created, for FA: 1. The second location of F indicate a new branch of FECA:2 as the support of F=2. Three nodes are created for items ECA with support = 2. The support of the F node is incremented by 2. The third location indicates the sub-transaction FEB:1. Nodes for F and E are already exist and only new node for B is created as a another child for E. The support for all these nodes are incremented by 1. B becomes 1, E becomes 3 and F becomes 4. FEDB:1 is read after that, FE branch already exists and a new child branch for DB is created as a child for E with support = 1. The support for E nodes becomes 4, F becomes 5. Finally FC:2 is read, and a new node for item C is created with support =2, and F support becomes 7. Like with FP-Trees, the header constitutes a list of all frequent items to maintain the location of first entry for each item in the COFI-tree. A link is also made for each node in the tree that points to the next location of the same item in the tree if it exists. The mining process is the last step done on the F-COFI-tree before removing it and creating the next COFI-tree for the next item in the header table. 3.2
Mining the COFI-trees
The COFI-trees of all frequent items are not constructed together. Each tree is built, mined, then discarded before the next COFI-tree is built. The mining pro-
378
Mohammad El-Hajj and Osmar R. Za¨ıane
Step 1
E (8,1) E
( 8
E(8 5)
Step 2
0 )
E
( 8
1 )
C
( 2
0 )
B
( 4
0 )
A
( 2
0 )
E
( 8
6 )
C
( 2
0 )
( 4
4 )
D(5 1) D(5 5) D C B A
5 3 6 3
C
B
D
( 5
( 1
0 )
( 1
0 )
0 )
C
( 2
B
( 4
0 )
A
( 2
0 )
E
( 8
0 )
A
B
( 1
( 2
0 )
0 )
C(1 1) B(1 1) EDB:1 ED:1 EB:1 EDB:1
Step 3
D C B A
5 3 6 3
C
B
D
( 5
( 1
1 )
( 1
1 )
1 )
A
B
( 1
( 2
0 )
0 ) B(4 4) EDB:4 ED:5 EB:5 EDB:5 E(8 6)
Step 4
E (8 6) 5 ) B(1 1)
D C B A
5 3 6 3
D
C
B
( 1
( 1
( 5
1 )
1 )
5 )
B
A
C
( 4
( 2
( 2
4 )
0 )
0 )
A
B
( 2
( 1
0 )
0 ) EB:1 ED:5 EB:6 EDB:5
D(5 5) D C B A
5 3 6 3
D
( 5
( 1
1 )
5 )
B
( 1
( 2
0 )
1 ) No change
C
B
( 1
1 )
B
A
( 2
A
0 )
ED:5 EB:6 EDB:5
Fig. 4. Steps needed to generate frequent patterns related to item E
cess is done for each tree independently with the purpose of finding all frequent k -itemset patterns in which the item on the root of the tree participates. Steps to produce frequent patterns related to the E item for example, are illustrated in Figure 4. From each branch of the tree, using the support-count and the participation-count, candidate frequent patterns are identified and stored temporarily in a list. The non-frequent ones are discarded at the end when all branches are processed. The mining process for the E-COFI-tree starts from the most locally frequent item in the header table of the tree, which is item B. Item B exists in three branches in the E-COFI-tree which are (B:1, C:1, D:5 and E:8), (B:4, D:5, and E:8) and (B:1, and E:8). The frequency of each branch is the frequency of the first item in the branch minus the participation value of the same node. Item B in the first branch has a frequency value of 1 and participation value of 0 which makes the first pattern EDB frequency equals to 1. The participation values for all nodes in this branch are incremented by 1, which is the frequency of this pattern. In the first pattern EDB: 1, we need to generate all sub-patterns that item E participates in which are ED:1 EB:1 and EDB:1. The second branch that has B generates the pattern EDB: 4 as the frequency of B on this branch is 4 and its participation value is equal to 0. All participation values on these nodes are incremented by 4. Sub-patterns are also generated from the EDB pattern which are ED: 4 , EB: 4, and EDB: 4. All patterns already exist with support value equals to 1, and only updating their support value is needed to make it equal to 5. The last branch EB:1 will generate only one pattern which is EB:1, and consequently its value will be updated to become 6. The second locally frequent item in this tree, “D” exists in one branch (D: 5 and E: 8) with participation value of 5 for the D node. Since the participation value for this node equals to its support value, then no patterns can be generated from this node. Finally all non-frequent patterns are omitted leaving us with only frequent patterns that item E participates in which are ED:5, EB:6 and EBD:5. The COFI-tree of Item E can be removed at this time
Non Recursive Generation of Frequent K-itemsets
379
and another tree can be generated and tested to produce all the frequent patterns related to the root node. The same process is executed to generate the frequent patterns. The D-COFI-tree is created after the E-COFI-tree. Mining this tree generates the following frequent patterns: DB:8, DA:5, and DBA:5. C-COFItree generates one frequent pattern which is CA:6. Finally, the B-COFI-tree is created and the frequent pattern BA:6 is generated.
4
Experimental Evaluations and Performance Study
To test the efficiency of the COFI-tree approach, we conducted experiments comparing our approach with two well-known algorithms namely: Apriori and FP-Growth. To avoid implementation bias, third party Apriori implementation, by Christian Borgelt [4], and FP-Growth [6] written by its original authors are used. The experiments were run on a 733-MHz machine with a relatively small RAM of 256MB. Transactions were generated using IBM synthetic data generator [3]. We conducted different experiments to test the COFI-tree algorithm when mining extremely large transactional databases. We tested the applicability and scalability of the COFI-tree algorithm. In one of these experiments, we mined using a support threshold of 0.01% transactional databases of sizes ranging from 1 million to 25 million transactions with an average transaction length of 24 items. The dimensionality of the 1 and 2 million transaction dataset was 10,000 items while the datasets ranging from 5 million to 25 million transactions had a dimensionality of 100,000 unique items. Figure 5A illustrates the comparative results obtained with Apriori, FP-Growth and the COFI-tree. Apriori failed to mine the 5 million transactional database and FP-Growth couldn’t mine beyond the 5 million transaction mark. The COFI-tree, however, demonstrates good scalability as this algorithm mines 25 million transactions in 2921s (about 48 minutes). None of the tested algorithms, or reported results in the literature reaches such a big size. To test the behavior of the COFI-tree vis-`a-vis different support thresholds, a set of experiments was conducted on a database size of one million transactions, with 10,000 items and an average transaction length of 24 items. The mining process tested different support levels, which are 0.0025% that revealed almost 125K frequent patterns, 0.005% that revealed nearly 70K frequent patterns, 0.0075% that generated 32K frequent patterns and 0.01 that returned 17K frequent patterns. Figure 5B depicts the time needed in seconds for each one of these runs. The results show that the COFI-tree algorithm outperforms both Apriori and FP-Growth algorithms in all cases.
5
Discussion and Future Work
Finding scalable algorithms for association rule mining in extremely large databases is the main goal of our research. To reach this goal, we propose a new algorithm that is FP-Tree based. This algorithm identifies the main problem of the FPGrowth algorithm which is the recursive creation and mining of many conditional
380
Mohammad El-Hajj and Osmar R. Za¨ıane Apriori
FP-Growth
COFI-tree
Apriori
FP-Growth
COFI-tree
12000
3000
10000
Time in seconds
Time in seconds
3500
2500 2000 1500 1000 500 0 1M
2M
Size in millions
5M
(A)
10M
15M
20M
25M
8000 6000 4000 2000 0
Support 0.0025%
0.005%
0.0075%
0.01%
(B)
Fig. 5. Computational performance and scalability
pattern trees, which are equal in number to the distinct frequent patterns generated. We have replaced this step by creating one COFI-tree for each frequent item. A simple non-recursive mining process is applied to generate all frequent patterns related to the tested COFI-tree. The experiments we conducted showed that our algorithm is scalable to mine tens of millions of transactions, if not more. We are currently studying the possibility of parallelizing the COFI-tree algorithm to investigate the opportunity of mining hundred of millions of transactions in a reasonable time and with acceptable resources.
References 1. R. Agarwal, C.Aggarwal, and V. Prasad. A tree projection algorithm for generation of frequent itemsets. Parallel and distributed Computing, 2000. 2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases, pages 487–499, Santiago, Chile, September 1994. 3. IBM. Almaden. Quest synthetic data generation code. http://www.almaden.ibm.com/cs/quest/syndata.html. 4. C. Borgelt. Apriori implementation. http://fuzzy.cs.unimagdeburg.de/~borgelt/apriori/apriori.html. 5. E.-H. Han, G. Karypis, and V.Kumar. Scalable parallel data mining for association rule. Transactions on Knowledge and data engineering, 12(3):337–352, May-June 2000. 6. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM-SIGMOD, Dallas, 2000. 7. H. Huang, X. Wu, and R. Relue. Association analysis with one scan of databases. In IEEE International Conference on Data Mining, pages 629–636, December 2002. 8. J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by oppotunistic projection. In Eight ACM SIGKDD Internationa Conf. on Knowledge Discovery and Data Mining, pages 229–238, Edmonton, Alberta, August 2002.
A New Computation Model for Rough Set Theory Based on Database Systems Jianchao Han1 , Xiaohua Hu2 , T. Y. Lin3 1
Dept. of Computer Science, California State University Dominguez Hills 1000 E. Victoria St., Carson, CA 90747, USA 2 College of Information Science and Technology, Drexel University 3141 Chestnut St., Philadelphia, PA 19104, USA 3 Dept. of Computer Science, San Jose State University One Washington Square, San Jose, CA 94403, USA We propose a new computation model for rough set theory using relational algebra operations in this paper. We present the necessary and suÆcient conditions on data tables under which an attribute is a core attribute and those under which a subset of condition attributes is a reduct, respectively. With this model, two algorithms for core attributes computation and reduct generation are suggested. The correctness of both algorithms is proved and their time complexity is analyzed. Since relational algebra operations have been eÆciently implemented in most widely-used database systems, the algorithms presented can be extensively applied to these database systems and adapted to a wide range of real-life applications with very large data sets. Abstract.
1
Introduction
Rough sets theory was rst introduced by Pawlak in the 1980's [10] and has been widely applied in dierent real applications such as machine learning, knowledge discovery, expert systems [2, 6, 7, 11] since then. Rough sets theory is especially useful for domains where data collected are imprecise and/or inconsistent. It provides a powerful tool for data analysis and data mining from imprecise and ambiguous data. Many rough sets models have been developed in the rough set community [7, 8]. Some of them have been applied in the industrial data mining projects such as stock market prediction, patient symptom diagnosis, telecommunication churner prediction, and nancial bank customer attrition analysis to solve challenging business problems. These rough set models focus on the extension of the original model proposed by Pawlak [10, 11] and attempt to deal with its limitations, but haven't paid much attention on the eÆciency of the model implementation, like the core and reduct generation. One of the serious drawbacks of existing rough set models is the ineÆciency and unscalability of their implementations to compute the core and reduct and identify the dispensable attributes, which limits their suitability in data mining applications with large data sets. Further investigation reveals that existing rough set methods perform the computations of core and reduct in at les rather than integrate with the eÆcient and high performance relational database set Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 381-390, 2003. c Springer-Verlag Berlin Heidelberg 2003
382
Jianchao Han et al.
operations, while some authors have proposed ideas to reduce data using relational database system techniques [4, 6]. To overcome the problem, we propose a new computation model of rough set theory to eÆciently compute the core and reducts by means of relational database set-oriented operations such as Cardinality and Projection. We prove and demonstrate that our computation model is equivalent to the traditional rough set model, but much more eÆcient and scalable. The rest of the paper is organized as follows: We brie y overview the traditional rough set theory in Section 2. A new computation model of rough set theory by means of relational database set-oriented operations is proposed in Section 3. In Section 4, we describe our new algorithms to compute core attributes, construct reducts based on our new model, and analyze their time complexity. Related works are discussed in Section 5. Finally Section 6 is the conclusion and future work.
2
Overview of Rough Set Theory
An information system, IS , is de ned as: IS =< U; C; D; fVa ga2C [D ; f >; where U = fu ; u ; :::ung is a non-empty set of tuples, called data set or data table, C is a non-empty set of condition attributes, and D is a non-empty set of decision attributes and C \ D = ;. Va is the domain S of attribute \a" with at least two elements. f is a function: U (C [ D) ! V= a2C [D Va , which maps each pair of tuple and attribute to an attribute value. Let A C [ D, and ti ; tj 2 U , we de ne a binary relation RA , called an indiscernibility relation, as follows: RA = f< ti ; tj >2 U U : 8a 2 A; ti [a] = tj [a]g, where t[a] indicates the value of attribute a 2 A of the tuple t. The indiscernibility relation, denoted IND, is an equivalent relation on U . The ordered pair < U; IND > is called an approximation space. It partitions U into equivalent classes, each of which is labeled by a description Ai , and called an elementary set. Any nite union of elementary set is called a de nable set in < U; IND >. De nition 1. Let X be a subset of U and represent a concept. Assume A is a subset of attributes, A C [ D, [A] = fA ; A ; : : : ; Am g is the set of elementary sets based on A. The lower approximation of X based on A, denoted LowerA (X ), is de ned S as LowerA (X ) = fAi 2 [A]jAi X; 1 i mg; which contains all the tuples in U that can be de nitely classi ed to X , so is called a positive region of X w.r.t. A. The upper approximation of X based on A, denoted UpperA (X ), is de ned S as UpperA (X ) = fAi 2 [A]jAi \ X 6= ;; 1 i mg; which contains those tuples in U that can be possibly classi ed to X . The set of those tuples that can be possibly but not de nitely classi ed to X is called the boundary area of X , denoted BoundaryA (X ), and de ned as BoundaryA(X ) = UpperA(X ) LowerA(X ): S The negative region of X is de ned as NegativeA(X ) = fAi 2 [A]jAi U X; 1 i mg; which contains the tuples that can not be classi ed to X . 2 1
2
1
2
A New Computation Model for Rough Set Theory Based on Database Systems
383
Thus, the positive and negative regions encompass positive and negative examples of concept X , respectively, while the boundary region forms the uncertain examples. If LowerA (X ) = UpperA (X ), then the boundary region of the set X disappears and the rough set becomes equivalent to the standard set. Generally, for any concept X , we can derive two kinds of classi cation rules from the lower and upper approximation of X based on a subset of condition attributes. The former is deterministic because they de nitely determine that the tuples satisfying the rule condition must be in the target concept, while the latter is non-deterministic because the tuples satisfying the rule condition are only possibly in the target concept. Speci cally, let [D] = fD1 ; D2 ; : : : ; Dk g is the set of elementary sets based on the decision attributes set D. Assume A is a subset of condition attributes, A C , and [A] = fA1; A2; : : : ; Ahg is the set of elementary sets based on A.
De nition 2. 8 Dj 2 [D]; 1 j k, the lower approximation of Dj based on A, S denoted LowerA (Dj ), is de ned as LowerA (Dj ) = fAi jAi Dj ; 1 i hg: All tuples in LowerA (Dj ) can be certainly classi ed to Dj . The lower approximaS tion of [D], denoted LowerA ([D]), is de ned as LowerA ([D]) = kj=1 LowerA (Dj ): All tuples in LowerA ([D]) can be certainly classi ed. Similarly, 8 Dj 2 [D]; 1 j k , the upper approximation of Dj based on A, denoted UpperA(Dj ), is de ned as UpperA(Dj ) = SfAijAi \ Dj 6= ;; 1 i hg: All tuples in UpperA(Dj ) can be probably classi ed to Dj . The Upper approximation of [D], denoted UpperA ([D]), is de ned as UpperA ([D]) = Sk Upper (D ): All tuples in Upper ([D]) can be probably classi ed. A j A j=1 The boundary of [D] based on A C , denoted BoundaryA ([D]), is de ned as BoundaryA ([D]) = UpperA([D]) LowerA ([D]): All tuples in BoundaryA([D]) can not be classi ed in terms of A and D. 2 Rough sets theory can tell us whether the information for classi cation of tuples is consistent based on the data table itself. If the data is inconsistent, it suggests more information about the tuples need to be collected in order to build a good classi cation model for all tuples. If there exist a pair of tuples in U such that they have the same condition attributes values but dierent decision attributes values, U is said to contain contradictory tuples.
U is consistent if no contradictory pair of tuples exist in U , that is, 8 t1 ; t2 2 U , if t1 [D] 6= t2 [D], then t1 [C ] 6= t2 [C ]. 2 De nition 3.
Usually, the existence of contradictory tuples indicates that the information contained in U is not enough to classify all tuples, and there must be some contradictory tuples contained in the boundary area, see Proposition 1 On the other hand, if the data is consistent, rough sets theory can also determine whether there are more than suÆcient or redundant information in the data and provide approaches to nding the minimum data needed for classi cation model. This property of rough sets theory is very important for applications where domain knowledge is limited or data collection is expensive/laborious, because it ensures the data collected is right (not more or less) to build a good
384
Jianchao Han et al.
classi cation model without sacri cing the accuracy of the classi cation model or wasting time and eort to gather extra information. Furthermore, rough sets theory classi es all the attributes into three categories: core attributes, reduct attributes, and dispensable attributes. Core attributes have the essential information to make correct classi cation for the data set and should be retained in the data set; dispensable attributes are the redundant ones in the data set and should be eliminated without loss of any useful information; while reduct attributes are in the middle between. A reduct attribute may or may not be essential.
De nition 4. A condition attribute a 2 C is a dispensable attribute of C in U w.r.t. D if LowerC ([D]) = LowerC fag ([D]): Otherwise, a 2 C is called a core
attribute of C w.r.t. D.
2
A reduct of the condition attributes set is a minimum subset of the entire condition attributes set that has the same classi cation capability as the original attributes set.
De nition 5. A subset R of C , R C , is de ned as a reduct of C in U w.r.t. D if LowerR ([D]) = LowerC ([D]) and 8B R; LowerB ([D]) 6= LowerC ([D]). A condition attribute a 2 C is said to be a reduct attribute if 9R C , R is a reduct of C and a 2 R. 2 For a given data table, there may exist more than one reduct. Finding all reducts of the condition attributes set is NP-hard [2].
3
A New Computation Model for Rough Set Theory
Some limitations of rough sets theory have been presented [7, 8], which restrict its suitability in practice. One of these limitations is the ineÆciency in computation of core attributes and reducts, which limits its suitability for large data sets. In order to nd core attributes, dispensable attributes, or reducts, rough set model needs to construct all the equivalent classes based on the values of condition and decision attributes of all tuples in the data set. It is very time-consuming and infeasible, since most data mining applications require eÆcient algorithms to deal with scalable data sets. Our experience and investigation nd out that current implementations of rough set model is based on the at le-oriented computations to calculate core attributes and reducts. As is known, however, set-oriented operations in existing relational database systems such as Oracle, Sybase, and DB2 are much eÆcient and scalable to deal with large data sets. These high performance set-oriented operations can be integrated with rough set model to improve the eÆciency of the various operations of rough sets theory. We propose a computation model based on the relational algebra in this section, which provides the necessary and suÆcient conditions with respect to database operations for computing core attributes and constructing reducts, and
A New Computation Model for Rough Set Theory Based on Database Systems
385
then describe the algorithms to compute the core attributes and generate reducts of the given attribute sets in next section. For simplicity and convenience, we make the following conventions: Let a 2 C [ D be an attribute and t 2 U be a tuple. t[a] denotes t's value of the attribute a. If t1 2 U and t2 2 U are two tuples and t1[a] = t2 [a], then it is denoted as t1 a t2; Let A = fa1; a2; : : :; ak g C [ D be a subset of attributes and t 2 U be a tuple. t[A] denotes the sequence < t[a1 ]; t[a2 ]; : : : ; t[ak ] >. For t1 ; t2 2 U , we say t1 [A] = t2 [A], denoted t1 A t2 , if and only if t1 [ai ] = t2 [ai ]; i = 1; 2; : : :; k . To start with, let's review two set-oriented operations utilized in relational database systems: Count and Projection [5]. Assume Y is a data table. Count (Cardinality): Card(Y ) is the number of distinct tuples in Y . Projection: Assume Y has columns C , and E C , C (Y ) is a data table that contains all tuples of Y but only columns in E .
Proposition 1 The data table U is consistent if and only if U = LowerC ([D]) = UpperC ([D]) and BoundaryC ([D]) = ;. Proof. Let [C ] = fC1 ; C2 ; : : :; Cm g and [D] = fD1 ; D2 ; : : :; Dn g be the set of equivalent classes induced by C and D, respectively. Assume U is consistent. On the one hand, by De nitions 1 and 2, it is obvious that LowerC ([D]) UpperC ([D]) U: S On the other hand, 8 t 2 U = ni=1 Di ; 91S j n such that t 2 Dj . 0 0 Similarly, 91 i m such that t 2 Ci , for U = m i=1 Ci . 8t 2 Ci ; t[C ] = t [C ]. 0 0 By De nition 3, t[D] = t [D]. So t 2 Dj , for t 2 Dj . Thus, Ci Dj , which leads to t 2 LowerC (Dj ), and t 2 LowerC ([D]). Hence, U LowerC ([D]), and therefore U = LowerC ([D]) = UpperC ([D]). Furthermore, BoundaryC ([D]) = UpperC ([D]) LowerC ([D]) = ;. 2
U is consistent if and only if Card(C (U )) = Card(C D (U )). Proof. By Proposition 1, U is consistent if and only if BoundaryC ([D]) = ; if and only if 8 t; s 2 U; t[C ] = s[C ] is equivalent to t[C + D] = s[C + D] if and only if Card(C (U )) = Card(C D (U )). 2 Proposition 2 Let A B C . Assume [A] = fA ; A ; : : :; Am g and [B ] = fB ; B ; : : : ; Bn g are the set of equivalent classes induced by A and B , respectively, then 8 Bi 2 [B ]; i = 1; 2; : : :; n, and Aj 2 [A]; j = 1; 2; : : :; m, either Bi \ Aj = ; or Bi Aj . [B] is said a re nement of [A]. 2 Proposition 3 If U is consistent, 8 A C; Card(A (U )) Card(A D (U )). Proof. 8 t; s 2 U , if t and s are projected to be the same in A D (U ), then they must be projected to be the same in A (U ). 2 Theorem 2 If U is consistent, then 8 A C; LowerC ([D]) 6= LowerC A ([D]) if and only if Card(C A (U )) 6= Card(C A D (U )): Theorem 1
+
+
1
1
2
2
+
+
+
386
Jianchao Han et al.
Proof. Let [C ] = fC1 ; C2 ; : : :; Cm g and [C A] = fC10 ; C20 ; : : : ; Ck0 g be the set of equivalent classes induced by C and C A, respectively, and [D] = fD1; D2 ; : : : ; Dn g be the set of equivalent classes induced by D. to De nition 2, for given 1 j n, we have LowerC A (Dj ) = SfCAccording 0q jCq0 Dj ; 1 q kg. Thus, 8 t 2 LowerC A (Dj ); 9 1 q k such that t 2 Cq0 and Cq0 Dj . Because U = Ski=1 Ci0 = Smi=1 Ci , so 9 1 p m; t 2 Cp . Hence, we have t 2 Cq0 \ Cp 6= ;: By Proposition 2,Sit can be easily seen that Cp Cq0 Dj because C A C . Hence t 2 fCi jCi Dj ; 1 i mg = LowerC (Dj ). Therefore, LowerC A(Dj ) LowerC (Dj ) and thus LowerC A([D]) LowerC ([D]). Because LowerC A ([D]) 6= LowerC ([D]) from the given condition, we must have LowerC A([D]) LowerC (D): So it can be inferred that 9 t0 2 U such that t0 2 LowerC ([D]) and t0 2= LowerC A([D]). Thus, 9 Dj ; 1 j n, such that t0 2 LowerC (Dj ), which means, 9 Cp ; 1 pS m; t0 2 Cp Dj . And 8 1 i n; t0 2= LowerC A(Di ), that is, t0 2= fCq0 jCq0 Di ; 1 q Skg. However, t0 2 U = kq=1 fCq0 g. Hence 9 1 q k; t0 2 Cq0 but 8 1 i k; Cq0 6 Di. It is known t0 2 Dj . Thus, we have 9 t0 2 U; t0 2 Cq0 \ Dj 6= ;; andSnCq0 6 Dj ; which means, 9 t 2 U such that t 2 Cq0 , but t 2= Dj . Because U = i=1 fDig, so 9 1 s n such that t 2 Ds ; s 6= j . Thus, t 2 Cq0 \ Ds; s 6= j: Therefore, we obtain t0 C A t, that is, t0 [C A] = t[C A]; for t0 2 Cq0 and t 2 Cq0 ; but t0 6C A+D t, that is, t0 [C A + D] 6= t[C A + D]; for t0 2 Dj and t 2 Ds ; s 6= j . From above, one can see that t0 and t are projected to be same by C A (U ) but dierent by C A+D (U ). Thus, C A+D (U ) has at least one more distinct tuple than C A (U ), which means Card(C A (U )) < Card(C A+D (U )). On the other hand, if Card(C A (U )) 6= Card(C A+D (U )), one can infer Card(C A (U )) < Card(C A+D (U )) by Proposition 3. Hence, 9 t and s 2 U such that t and s are projected to be same by C A(U ) but distinct by C A+D (U ), that is, t[C A] = s[C A], and t[C A + D] 6= s[C A + D]. Thus, we have t[D] 6= s[D], that is, t 6D s. Therefore, 9 1 q k such that t; s 2 Cq0 , and 1 i 6= j k such that t 2 Di and s 2 Dj . So 8 1 p n; Cq0 6 Dp (otherwise t; s 2 Dp). By De nition 2, we have 8 1 p n; t; s 2= LowerC A(Dp). Thus, t; s 2= LowerC A([D]): U is consistent, however. By De nition 3 and Proposition 1, t; s 2 U = LowerC ([D]); which leads to LowerC ([D]) 6= LowerC A([D]). 2
Corollary 1 If U is consistent, then a 2 C is a core attribute of C in U w.r.t. D if and only if Card(C fag (U )) 6= Card(C fag+D (U )). 2 Corollary 2 If U is consistent, then a 2 C is a dispensable attribute of C in U w.r.t. D if and only if Card(C fag+D (U )) = Card(C fag (U )). 2 Corollary 3 If U is consistent, then 8A C , LowerC ([D]) = LowerC A ([D]) if and only if Card(C A+D (U )) = Card(C A (U )). 2
A New Computation Model for Rough Set Theory Based on Database Systems
387
Thus, in order to check whether an attribute a 2 C is a core attribute, we only need to take two projections of the table: one on C fag + D, and the other on C fag, and then count the distinct number of tuples in the projections. If the cardinality of the two projection tables is the same, then no information is lost in removing the dispensable attribute a. Otherwise, a is a core attribute. Put it in a more formal way, using database term, the cardinality of two projections being compared will be dierent if and only if there exist at least two tuples x and y such that 8 c 2 C fag; x[c] = y [c], but x[a] 6= y [a] and x[D] 6= y [D]. In this case, the number of distinct tuples in the projection on C fag will be one fewer than that in the projection on C fag + D, for x and y are identical in the former, while they are still distinguishable in the latter. So eliminating attribute a will lose the ability to distinguish tuples x and y. Intuitively, this means that some classi cation information will be lost if a is eliminated.
De nition 6. Let B C . The degree of dependency, denoted K (B; D), between Card(B (U )) : B and D in the data table U is de ned as K (B; D) = Card 2 (B+D (U )) Proposition 4 If
K (C; D) = 1.
U
is consistent, then
8 B C; 0 < K (B; D)
1, and
Proof. By Proposition 3 and De nition 6, one can infer K (B; D) 1. By Theorem 1, Card(C (U )) = Card(C +D (U )). Therefore, K (C; D) = 1. 2
U is consistent, then R C is a reduct of C w.r.t. D if and K (R; D) = K (C; D), and 8B R; K (B; D) 6= K (C; D). Proof. K (R; D) = K (C; D) if and only if, by Proposition 4, K (R; D) = 1 if and only if, by De nition 6, Card(R (U )) = Card(R D (U ) if and only if, by Corollary 3, LowerR ([D]) = LowerC ([D]). Similarly, 8 B R; K (B; D) 6= K (C; D) if and only if LowerB ([D]) 6= LowerC ([D]). By De nition 5, one can Theorem 3 If only if
+
see that this theorem holds.
4
2
Algorithms for Finding Core Attributes and Reducts
In classi cation, two kinds of attributes are generally perceived as unnecessary: attributes that are irrelevant to the target concept (like the customer ID), and attributes that are redundant given other attributes. These unnecessary attributes can exist simultaneously, but the redundant attributes are more diÆcult to eliminate because of the correlations between them. In rough set community, we eliminate unnecessary attributes by constructing reducts of condition attributes. As proved [10], a reduct of condition attributes set C must contain all core attributes of C . So it is important to develop an eÆcient algorithm to nd all core attributes in order to generate a reduct. In traditional rough set models, this is achieved by constructing a decision matrix, and then nd all entries with only one attribute in the decision matrix. The corresponding attributes of the entries containing only one attribute, are core attributes [2]. This method is ineÆcient
388
Jianchao Han et al.
and not realistic to construct a decision matrix for millions of tuples, which is a typical situation for data mining applications. Before we present the algorithms, we review the implementation of Count and Projection in relational database systems using SQL statements. One can verify that both of them run in time of O(n) [5].
Card(C D (U )): SELECT DISTINCT COUNT(*) FROM U { Card(X (U )): {
+
SELECT DISTINCT COUNT(*) FROM (SELECT X FROM U)
Algorithm 1 FindCore: Find the set of core attributes of a data table Input: A consistent data table U with conditional attributes set C and decision attributes set D
Output: Core { the set of core attributes of C w.r.t. D in U 1. 2. 3. 4. 5.
Set Core ; For each attribute a 2 C If Card(C fag (U )) < Card(C fag+D (U ))
Then Core
Return Core
Core [ fag
Theorem 2 ensures that the outcome Core of the algorithm FindCore contains all core attributes and only those attributes.
Theorem 4 The algorithm FindCore can be implemented in O(mn) time, where
m is the number of attributes and n is the number of tuples (rows). Proof. The For loop is executed m times, and inside each loop, nding the cardinality takes O(n). Therefore, the total running time is O(mn). 2 Algorithm 2 FindReduct: Find a reduct of the conditional attributes set Input: A consistent data table U with conditional attributes set C and decision attributes set D, and the Core of C w.r.t. D in U
Output: REDU { a reduct of conditional attributes set of C w.r.t. D in U 1. 2. 3. 4. 5.
REDU
C; DISP
C Core
For each attribute a 2 DISP Do If K (REDU fag; D) = 1 Then
REDU REDU fag
Return REDU
Proposition 5 Assume U is consistent and R C . If K (R; D) < 1 then 8 B
R; K (B; D) < 1. Proof. Since K (R; D) < 1, we have Card(R (U )) < Card(R D (U )) by De nition 6. Thus, 9 t; s 2 U such that t[R] = s[R] but t[R + D] 6= s[R + D], so t[D] 6= s[D], and 8 B R; t[B] = s[B]. Therefore, Card(B (U )) < Card(B D (U )) and K (B; D) < 1 by De nition 6. 2 +
+
A New Computation Model for Rough Set Theory Based on Database Systems
389
Theorem 5 The outcome of Algorithm 2 is a reduct of C w.r.t. D in U . Proof. Assume the output of Algorithm 2 is REDU . From the algorithm it can be easily observed that K (REDU; D) = 1, and 8 a 2 REDU; K (REDU fag; D) < 1. By Proposition 5, one can see 8 B REDU; K (B; D) < 1. Therefore, by Proposition 4 and Theorem 3, we conclude that REDU is a reduct of C w.r.t. D in U . 2
Theorem 6 Algorithm 2 runs in time of O(mn), where
m is the number of U and n is the number of tuples in U . Proof. The For loop executes at most m times and each loop takes O(n) time to calculate K (REDU fag; D). Thus, the total running time of the algorithm is O(mn). 2 attributes in
One may note that the outcome of the algorithm FindReduct is an arbitrary reduct of the condition atrributes set C , if C has more than one reduct. Which reduct is generated depends on the order of attributes that are checked for dispensibility in Step 2 of the FindReduct algorithm. Some authers propose algorithms for constructing the best reduct, but what is the best depends on how to de ne the criteria, such as the number of attributes in the reduct, the number of possible values of attributes, etc. Given the criteria, FindReduct can be easily adapted to construct the best reduct, only if we choose an attribute to check for dispensibility based on the criteria. This will be one of our future works.
5
Related Work
Currently, there are few papers on the algorithm for nding core attributes. The traditional method is to construct a decision matrix and then search all the entries in the matrix. If an entry in the matrix contains only one attribute, that attribute is a core attribute [2]. Constructing the decision matrix, however, is not realistic in real-world applications. Our method for nding all attributes is much more eÆcient and scalable, especially when used with relational database systems, and only takes O(mn) time. There are algorithms for nding reducts in the literature, although nding all reducts is NP-hard [11]. Feature selection algorithms for constructing classi ers have been proposed [1, 3, 9], which are strongly related with nding reducts. However, very few of those literature address the time complexity analysis of algorithms. The algorithm for nding a reduct proposed in [1] takes O(m2 n2 ), while four algorithms for nding subset attributes are developed in [3], each of which takes O(m3 n2 ). Our algorithm for nding a reduct presented in this paper runs only in time of O(mn). Moreover, our algorithm utilizes the relational database system operations, and thus much more scalable. What we present in this paper is originally activated by [4, 6], both of which propose new techniques of using relational database systems to implement some rough set operations.
390
6
Jianchao Han et al.
Concluding Remarks
Most existing rough set models do not integrate with database systems but perform computational intensive operations such as generating core, reduct, and rule induction on at les, which limits their applicability for large data set in data mining applications. In order to take advantage of eÆcient data structures and algorithms developed in database systems, we proposed a new computation model for rough set theory using relational algebra operations. Two algorithms for computing core attributes and constructing reducts were presented. We proved the their correctness and analyzed their time complexity. Since relational algebra operations have been eÆciently implemented in most widely-used database systems, the algorithms presented can be extensively applied to these database systems and adapted to a wide range of real-life applications. Moreover, our algorithms are scalable, because existing database systems have demonstrated the capability of eÆciently processing very large data sets. However, the FindReduct algorithm can only generate an arbitrary reduct, which may not be the best one. To nd the best reduct, we should gure out how to de ne the seletion criteria, which is usually dependent on the application and bias. Our future work will focus on following two aspects: de ning the reduct selection criteria and nding the best reduct in terms of the criteria; and applying this model to feature selection and rule induction for knowledge discovery.
References 1. Bell, D., Guan, J., Computational methods for rough classi cation and discovery, J. of ASIS 49:5, pp. 403-414, 1998. 2. Cercone, N., Ziarko, W., Hu, X., Rule Discovery from Databases: A Decision Matrix Approach, Proc. Int'l Sym. on Methodologies for Intelligent System, 1996. 3. Deogun, J., Choubey, S., Taghavan, V., Sever, H., Feature selection and eective classi ers, J. of ASIS 49:5, pp. 423-434, 1998. 4. Hu, X., Lin, T. Y., Han, J., A New Rough Sets Model Based on Database Systems, Proc. of the 9th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, 2003. 5. Garcia-Molina, H., Ullman, J. D., Widom, J., Database System Implementation, Prentice Hall, 2000. 6. Kumar, A., A New Technique for Data Reduction in A Database System for Knowledge Discovery Applications, J. of Intelligent Systems, 10(3). 7. Lin, T.Y and Cercone, N., Applications of Rough Sets Theory and Data Mining, Kluwer Academic Publishers, 1997. 8. Lin, T. Y., Yao, Y. Y., and Zadeh, L. A., Data Mining, Rough Sets and Granular Computing, Physical-Verlag, 2002. 9. Modrzejewski, M., Feature Selection Using Rough Sets Theory, in Proc. ECML, pp.213-226, 1993. 10. Pawlak, Z., Rough Sets, International Journal of Information and Computer Science, 11(5), pp.341-356, 1982. 11. Pawlak, Z., Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic Publishers, 1991.
Computing SQL Queries with Boolean Aggregates Antonio Badia Computer Engineering and Computer Science Department University of Louisville
Abstract. We introduce a new method for optimization of SQL queries with nested subqueries. The method is based on the idea of Boolean aggregates, aggregates that compute the conjunction or disjunction of a set of conditions. When combined with grouping, Boolean aggregates allow us to compute all types of non-aggregated subqueries in a uniform manner. The resulting query trees are simple and amenable to further optimization. Our approach can be combined with other optimization techniques and can be implemented with a minimum of changes in any cost-based optimizer.
1
Introduction
Due to the importance of query optimization, there exists a large body of research in the subject, especially for the case of nested subqueries ([10, 5, 13, 7, 8, 17]). It is considered nowadays that existing approaches can deal with all types of SQL subqueries through unnesting. However, practical implementation lags behind the theory, since some transformations are quite complex to implement. In particular, subqueries where the linking condition (the condition connecting query and subquery) is one of NOT IN, NOT EXISTS or a comparison with ALL seem to present problems to current optimizers. These cases are assumed to be translated, or are dealt with using antijoins. However, the usual translation does not work in the presence of nulls, and even when fixed it adds some overhead to the original query. On the other hand, antijoins introduce yet another operator that cannot be moved in the query tree, thus making the job of the optimizer more difficult. When a query has several levels, the complexity grows rapidly (an example is given below). In this paper we introduce a variant of traditional unnesting methods that deals with all types of linking conditions in a simple, uniform manner. The query tree created is simple, and the approach extends neatly to several levels of nesting and several subqueries at the same level. The approach is based on the concept of Boolean aggregates, which are an extension of the idea of aggregate function in SQL ([12]). Intuitively, Boolean aggregates are applied to a set of predicates and combine the truth values resulting from evaluation of the predicates. We show how two simple Boolean predicates can take care of any type of SQL subquery in
This research was sponsored by NSF under grant IIS-0091928.
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 391–400, 2003. c Springer-Verlag Berlin Heidelberg 2003
392
Antonio Badia
a uniform manner. The resulting query trees are simple and amenable to further optimization. Our approach can be combined with other optimization techniques and can be implemented with a minimum of changes in any cost-based optimizer. In section 2 we describe in more detail related research on query optimization and motivate our approach with an example. In section 3 we introduce the concept of Boolean aggregates and show its use in query unnesting. We then apply our approach to the example and discuss the differences with standard unnesting. Finally, in section 4 we offer some preliminary conclusions and discuss further research.
2
Related Research and Motivation
We study SQL queries that contain correlated subqueries1 . Such subqueries contain a correlated predicate, a condition in their WHERE clause introducing the correlation. The attribute in the correlated predicate provided by a relation in an outer block is called the correlation attribute; the other attribute is called the correlated attribute. The condition connecting query and subquery is called the linking condition. There are basically four types of linking condition in SQL: comparisons between an attribute and an aggregation (called the linking aggregate); IN and NOT IN comparisons; EXISTS and NOT EXISTS comparisons; and quantified comparisons between an attribute and a set of attribute through the use of SOME and ALL. We call linking conditions involving an aggregate, IN, EXISTS, and comparisons with SOME positive linking conditions, and the rest (those involving NOT IN, NOT EXISTS, and comparisons with ALL) negative linking conditions. All nested correlated subqueries are nowadays executed by some variation of unnesting. In its original approach ([10]), the correlation predicate is seen as a join; if the subquery is aggregated, the aggregate is computed in advance and then join is used. Kim’s approach had a number of shortcomings; among them, it assumed that the correlation predicate always used equality and the linking condition was a positive one. Dayal’s ([5]) and Muralikrishna’s ([13]) work solved these shortcomings; Dayal introduced the idea of using an outerjoin instead of a join (so values with no match would not be lost), and proceeds with the aggregate computation after the outerjoin. Muralikrishna generalizes the approach and points out that negative linking aggregates can be dealt with using antijoin or translating them to other, positive linking aggregates. These approaches also introduce some shortcomings. First, outerjoins and antijoins do not commute with regular joins or selections; therefore, a query tree with all these operators does not offer many degrees of freedom to the optimizer. The work of [6] and [16] has studied conditions under which outerjoins and antijoins can be moved; alleviating this problem partially. Another problem with this approach is that by carrying out the (outer)join corresponding to the correlation predicate, other predicates in the WHERE clause of the main query, which may restrict the total computation to be carried out, are postponed. The magic sets 1
The approach is applicable to non-correlated subqueries as well, but does not provide any substantial gains in that case.
Computing SQL Queries with Boolean Aggregates
393
approach ([17, 18, 20]) pushes these predicates down past the (outer)join by identifying the minimal set of values that the correlating attributes can take (the magic set), and computing it in advance. This minimizes the size of other computation but comes at the cost of building the magic set in advance. However, all approaches in the literature assume positive linking conditions (and all examples shown in [5, 13, 19, 20, 18] involve positive linking conditions). Negative linking conditions are not given much attention; it is considered that queries can be rewritten to avoid them, or that they can be dealt with directly using antijoins. But both approaches are problematic. About the former, we point out that the standard translation does not work if nulls are present. Assume, for instance, the condition attr > ALL Q, where Q is a subquery, with attr2 the linked attribute. It is usually assumed that a (left) antijoin with condition attr ≤ attr2 is a correct translation of this condition, since for a tuple t to be in the antijoin, it cannot be the case that t.attr ≤ attr2, for any value of attr2 (or any value in a given group, if the subquery is correlated). Unfortunately, this equivalence is only true for 2-valued logics, not for the 3-valued logic that SQL uses to evaluate predicates when null is present. The condition attr ≤ attr2 will fail if attr is not null, and no value of attr2 is greater than or equal to attr, which may happen because attr2 is the right value or because attr2 is null. Hence, a tuple t will be in the antijoin in the last case above, and t will qualify for the result. Even though one could argue that this can be solved by changing the condition in the antijoin (and indeed, a correct rewrite is possible, but more complex than usually considered ([1]), a larger problem with this approach is that it produces plans with outerjoins and antijoins, which are very difficult to move around on the query tree; even though recent research has shown that outerjoins ([6]) and antijoins ([16]) can be moved under limited circumstances, this still poses a constraint on the alternatives that can be generated for a given query plan -and it is up to the optimizer to check that the necessary conditions are met. Hence, proliferation of these operations makes the task of the query optimizer difficult. As an example of the problems of the traditional approach, assume tables R(A,B,C,D), S(E,F,G,H,I), U(J,K,L), and consider the query Select * From R Where R.A > 10 and R.B NOT IN (Select S.E From S Where S.F = 5 and R.D = S.G and S.H > ALL (Select U.J From U Where U.K = R.C and U.L != S.I)) Unnesting this query with the traditional approach has the problem of introducing several outerjoins and antijoins that cannot be moved, as well as extra
394
Antonio Badia Project(R.*) Select(A>10 & F=5) AJ(B = E) AJ(H =< J) Project(R.*,S.*) Project(S.*,T.*) LOJ(K = C and L != I) T
LOJ(D = G) R
S
Fig. 1. Standard unnesting approach applied to the example
operations. To see why, note that we must outerjoin U with S and R, and then group by the keys of R and S, to determine which tuples of U must be tested for the ALL linking condition. However, should the set of tuples of U in a group fail the test, we cannot throw the whole group away: for that means that some tuples in S fail to qualify for an answer, making true the NOT IN linking condition, and hence qualifying the R tuple. Thus, tuples in S and U should be antijoined separately to determine which tuples in S pass or fail the ALL test. Then the result should separately antijoined with R to determine which tuples in R pass or fail the NOT IN test. The result is shown in figure 1, with LOJ denoting a left outer join and AJ denoting an antijoin (note that the tree is actually a graph!). Even though Muralikrishna ([13]) proposes to extract (left) antijoins from (left) outerjoins, we note that in general such reuse may not be possible: here, the outerjoin is introduced to deal with the correlation, and the antijoin with the linking, and therefore they have distinct, independent conditions attached to them (and such approaches transform the query tree in a query graph, making it harder for the optimizer to consider alternatives). Also, magic sets would be able to improve on the above plan pushing selections down to the relations; however, this approach does not improve the overall situation, with outerjoins and antijoins still present. Clearly, what is called for is an approach which uniformly deals with all types of linking conditions without introducing undue complexity.
3
Boolean Aggregates
We seek a uniform method that will work for all linking conditions. In order to achieve this, we define Boolean aggregates AND and OR, which take as input a comparison, a set of values (or tuples), and return a Boolean (true or false) as output. Let attr be an attribute, θ a comparison operator and S a set of values.
Computing SQL Queries with Boolean Aggregates
Then AN D(S, attr, θ) =
395
attr θ attr2
attr2∈S
We define AN D(∅, att, θ) to be true for any att, θ. Also, OR(S, attr, θ) = attr θ attr2 attr2∈S
We define OR(∅, att, θ) to be false for any att, θ. It is important to point out that each individual comparison is subject to the semantics of SQL’s WHERE clause; in particular, comparisons with null values return unknown. The usual behavior of unknown with respect to conjunction and disjunction is followed ([12]). Note also that the set S will be implicit in normal use. When the Boolean aggregates are used alone, S will be the input relation to the aggregate; when used in conjunction with a GROUP-BY operator, each group will provide the input set. Thus, we will write GBA,AN D(B,θ) (R), where A is a subset of attributes of the schema of R, B is an attribute from the schema of R, and θ is a comparison operator; and similarly for OR. The intended meaning is that, similar to other aggregates, AND is applied to each group created by the grouping. We use boolean aggregates to compute any linking condition which does not use a (regular) aggregate, as follows: after a join or outerjoin connecting query and subquery is introduced by the unnesting, a group by is executed. The grouping attributes are any key of the relation from the outer block; the Boolean aggregate used depends on the linking condition: for attr θ SOM E Q, where Q is a correlated subquery, the aggregate used is OR(attr, θ). For attr IN Q, the linking condition is treated as attr = SOM E Q. For EXIST S Q, the aggregate used in OR(1, 1, =)2 . For attr θ ALL Q, where Q is a correlated subquery, the aggregate used is AN D(attr, θ). For attr N OT IN Q, the linking condition is treated as attr = ALL Q. Finally, for N OT EXIST S Q, the aggregate used is AN D(1, 1, =). After the grouping and aggregation, the Boolean aggregates leave a truth value in each group of the grouped relation. A selection then must be used to pick up those tuples where the boolean is set to true. Note that most of this work can be optimized in implementation, an issue that we discuss in the next subsection. Clearly, implementing a Boolean aggregate is very similar to implementing a regular aggregate. The usual way to compute the traditional SQL aggregates (min, max, sum, count, avg) is to use an accumulator variable in which to store temporary results, and update it as more values come. For min and max, for instance, any new value is compared to the value in the accumulator, and replaces it if it is smaller (larger). Sum and count initialize the accumulator to 0, and increase the accumulator with each new value (using the value, for sum, using 1, for count). Likewise, a Boolean accumulator is used for Boolean 2
Note that technically this formulation is not correct since we are using a constant instead of attr, but the meaning is clear.
396
Antonio Badia
aggregates. For ALL, the accumulator is started as true; for SOME, as false. As new values arrive, a comparison is carried out, and the result is ANDed (for AND) or ORed (for OR) with the accumulator. There is, however, a problem with this straightforward approach. When an outerjoin is used to deal with the correlation, tuples in the outer block that have no match appear in the result exactly once, padded on the attributes of the inner block with nulls. Thus, when a group by is done, these tuples become their own group. Hence, tuples with no match actually have one (null) match in the outer join. The Boolean aggregate will then iterate over this single tuple and, finding a null value on it, will deposit a value of unknown in the accumulator. But when a tuple has no matches the ALL test should be considered successful. The problem is that the outer join marks no matches with a null; while this null is meant to be no value occurs, SQL is incapable of distinguishing this interpretation from others, like value unknown (for which the 3-valued semantics makes sense). Note also that the value of attr2 may genuinely be a null, if such a null existed in the original data. Thus, what is needed is a way to distinguish between tuples that have been added as a pad by the outer join. We stipulate that outer joins will pad tuples without a match not with nulls, but with a different marker, called an emptymarker, which is different from any possible value and from the null marker itself. Then a program like the following can be used to implement the AND aggregate: acc = True; while (not (empty(S)){ t = first(S); if (t.attr2 != emptymark) acc = acc AND attr comp attr2; S = rest(S); } Note that this program implements the semantics given for the operator, since a single tuple with the empty marker represents the empty set in the relational framework3. 3.1
Query Unnesting
We unnest using an approach that we call quasi-magic. First, at every query level the WHERE clause, with the exception of any linking condition(s), is transformed into a query tree. This allows us to push selections before any unnesting, as in the magic approach, but we do not compute the magic set, just the complementary set ([17, 18, 20]). This way, we avoid the overhead associated with the magic method. Then, correlated queries are treated as in Dayal’s approach, by adding 3
The change of padding in the outer join should be of no consequence to the rest of query processing. Right after the application of the Boolean aggregate, a selection will pick up only those tuples with a value of true in the accumulator. This includes tuples with the marker; however, no other operator up the query tree operates on the values with the marker -in the standard setting, they would contain nulls, and hence no useful operation can be carried out on these values.
Computing SQL Queries with Boolean Aggregates
397
a join (or outerjoin, if necessary), followed by a group by on key attributes of the outer relation. At this point, we apply boolean aggregates by using the linking condition, as outlined above. In our previous example, a tree (call it T1 ) will be formed to deal with the outer block: σA>10 (R). A second tree (call it T2 ) is formed for the nested query block at first level: σF =5 (S). Finally, a third tree is formed for the innermost block: U (note that this is a trivial tree because, at every level, we are excluding linking conditions, and there is nothing but linking conditions in the WHERE clause of the innermost block of our example). Using these trees as building blocks, a tree for the whole query is built as follows: 1. First, construct a graph where each tree formed so far is a node and there is a direct link from node Ti to node Tj if there is a correlation in the Tj block with the value of the correlation coming from a relation in the Ti block; the link is annotated with the correlation predicate. Then, we start our tree by left outerjoining any two nodes that have a link between them (the left input corresponding to the block in the outer query), using the condition in the annotation of the link, and starting with graph sources (because of SQL semantics, this will correspond to outermost blocks that are not correlated) and finishing with sinks (because of SQL semantics, this will correspond to innermost blocks that are correlated). Thus, we outerjoin from the outside in. An exception is made for links between Ti and Tj if there is a path in the graph between Ti and Tj on length ≥ 1. In the example above, our graph will have three nodes, T1 , T2 and T3 , with links from T1 to T2 , T1 to T3 and T2 to T3 . We will create a left outerjoin between T2 and T3 first, and then another left outerjoin of T1 with the previous result. In a situation like this, the link from T1 to T3 becomes a condition just another condition when we outerjoin T1 to the result of the previous outerjoin. 2. On top of the tree obtained in the previous step, we add GROUP BY nodes, with the grouping attributes corresponding to keys of relations in the left argument of the left outerjoins. On each GROUP BY, the appropriate (boolean) aggregate is used, followed by a SELECT looking for tuples with true (for boolean aggregates) or applying the linking condition (for regular aggregates). Note that these nodes are applied from the inside out, ie. the first (bottom) one corresponds to the innermost linking condition, and so on. 3. A projection, if needed, is placed on top of the tree. The following optimization is applied automatically: every outerjoin is considered to see if it can be transformed into a join. This is not possible for negative linking conditions (NOT IN, NOT EXISTS, ALL), but it is possible for positive linking conditions and all aggregates except COUNT(*)4 . 4
This rule coincides with some of Galindo-Legaria rules ([6]), in that we know that in positive linking conditions and aggregates we are going to have selections that are null-intolerant and, therefore, the outerjoin is equivalent to a join.
398
Antonio Badia
PROJECT(R.*) SELECT(Bool=True) GB(Rkey,AND(R.B != S.E)) Select(Bool=True) GB(Rkey,Skey, AND(S.H > T.J)) LOJ(K = C and D = G) SELECT(A>10) LOJ(L = I) R
Select(F=5) T S
Fig. 2. Our approach applied to the example
After this process, the tree is passed on to the query optimizer to see if further optimization is possible. Note that inside each subtree Ti there may be some optimization work to do; note also that, since all operators in the tree are joins and outerjoins, the optimizer may be able to move around some operators. Also, some GROUP BY nodes may be pulled up or pushed down ([2, 3, 8, 9]). We show the final result applied to our example above in figure 2. Note that in our example the outerjoins cannot be transformed into joins; however, the group bys may be pushed down depending on the keys of the relation (which we did not specify). Also, even if groupings cannot be pushed down, note that the first one groups the temporal relation by the keys of R and S, while the second one groups by the keys of R alone. Clearly, this second grouping is trivial; the whole operation (grouping and aggregate) can be done in one scan of the input. Compare this tree with the one that is achieved by standard unnesting (shown in figure 1), and it is clear that our approach is more uniform and simple, while using to its advantage the ideas behind standard unnesting. Again, magic sets could be applied to Dayal’s approach, to push down the selections in R and S like we did. However, in this case additional steps would be needed (for the creation of the complementary and magic sets), and the need for outerjoins and antijoins does not disappear. In our approach, the complementary set is always produced by our decision to process first operations at the same level, collapsing each query block (with the exception of linking conditions) to one relation (this is the reason we call our approach a quasi-magic strategy). As more levels and more subqueries with more correlations are added, the simplicity and clarity of our approach is more evident.
Computing SQL Queries with Boolean Aggregates
3.2
399
Optimizations
Besides algebraic optimizations, there are some particular optimizations that can be applied to Boolean aggregates. Obviously, AND evaluation can stop as soon as some predicate evaluates to false (with final result false); and OR evaluation can stop as soon as some predicate evaluates to true (with final result true). The later selection on boolean values can be done on the fly: since we know that the selection condition is going to be looking for groups with a value of true, groups with a value of false can be thrown away directly, in essence pipelining the selection in the GROUP-BY. Note also that by pipelining the selection, we eliminate the need for a Boolean attribute! In our example, once both left outer joins have been carried out, the first GROUP-BY is executed by using either sorting or hashing by the keys of R and S. On each group, the Boolean aggregate AND is computed as tuples come. As soon as a comparison returns false, computation of the Boolean aggregate is stopped, and the group is marked so that any further tuples belonging to the group are ignored; no output is produced for that group. Groups that do not fail the test are added to the output. Once this temporary result is created, it is read again and scanned looking only at values of the keys of R to create the groups; the second Boolean aggregate is computed as before. Also as before, as soon as a comparison returns false, the group is flagged for dismissal. Output is composed of groups that were not flagged when input was exhausted. Therefore, the cost of our plan, considering only operations above the second left outer join, is that of grouping the temporary relation by the keys of R and S, writing the output to disk and reading this output into memory again. In traditional unnesting, the cost after the second left outer joins is that of executing two antijoins, which is in the order of executing two joins.
4
Conclusion and Further Work
We have proposed an approach to unnesting SQL subqueries which builds on top of existing approaches. Therefore, our proposal is very easy to implement in existing query optimization and query execution engines, as it requires very little in the way of new operations, cost calculations, or implementation in the back-end. The approach allows us to treat all SQL subqueries in a uniform and simplified manner, and meshes well with existing approaches, letting the optimizer move operators around and apply advanced optimization techniques (like outerjoin reduction and push down/pull up of GROUP BY nodes). Further, because it extends to several levels easily, it simplifies resulting query trees. Optimizers are becoming quite sophisticate and complex; a simple and uniform treatment of all queries is certainly worth examining. We have argued that our approach yields better performance than traditional approaches when negative linking conditions are present. We plan to analyze the performance of our approach by implementing Boolean attributes on a DBMS and/or developing a detailed cost model, to offer further support for the conclusions reached in this paper.
400
Antonio Badia
References [1] Cao, Bin and Badia, A. Subquery Rewriting for Optimization of SQL Queries, submitted for publication. 393 [2] Chaudhuri, S. ans Shim, K. Including Group-By in Query Optimization, in Proceedings of the 2th VLDB Conference, 1994. 398 [3] Chaudhuri, S. ans Shim, K. An Overview of Cost-Based Optimization of Queries with Aggregates, Data Engineering Bulletin, 18(3), 1995. 398 [4] Cohen, S., Nutt, W. and Serebrenik, A. Algorithms for Rewriting Aggregate Queries using Views, Proceedings of the Design and Management of Data Warehouses Conference, 1999. [5] Dayal, U. Of Nests and Trees: A Unified Approach to Processing Queries That Contain Nested Subqueries, Aggregates, and Quantifiers, in Proceedings of the VLDB Conference, 1987. 391, 392, 393 [6] Galindo-Legaria, C. and Rosenthal, A. Outerjoin Simplification and Reordering for Query Optimization, ACM TODS, vol. 22, n. 1, 1997. 392, 393, 397 [7] Ganski, R. and Wong, H. Optimization of Nested SQL Queries Revisited, in Proceedings of the ACM SIGMOD Conference, 1987. 391 [8] Goel, P. and Iyer, B. SQL Query Optimization: Reordering for a General Class of Queries, in Proceedings of the 1996 ACM SIGMOD Conference. 391, 398 [9] Gupta, A., Harinayaran, V. and Quass, D. Aggregate-Query Processing in Data Warehousing Environments, in Proceedings of the VLDB Conference, 1995. 398 [10] Kim, W. On Optimizing an SQL-Like Nested Query, ACM Transactions On Database Systems, vol. 7, n.3, September 1982. 391, 392 [11] Materialized Views: Techniques, Implementations and Applications, A. Gupta and I. S. Mumick, eds., MIT Press, 1999. [12] Melton, J. Advanced SQL: 1999, Understanding Object-Relational and Other Advanced Features, Morgan Kaufmann, 2003. 391, 395 [13] Muralikrishna, M. Improving Unnesting Algorithms for Join Aggregate Queries in SQL, in Proceedings of the VLDB Conference, 1992. 391, 392, 393, 394 [14] Ross, K. and Rao, J. Reusing Invariants: A New Strategy for Correlated Queries, in Proceedings of the ACM SIGMOD Conference, 1998. [15] Ross, K. and Chatziantoniou, D., Groupwise Processing of Relational Queries, in Proceedings of the 23rd VLDB Conference, 1997. [16] Jun Rao, Bruce Lindsay, Guy Lohman, Hamid Pirahesh and David Simmen, Using EELs, a Practical Approach to Outerjoin and Antijoin Reordering, in Proceedings of ICDE 2001. 392, 393 [17] Praveen Seshadri, Hamid Pirahesh, T. Y. Cliff Leung Complex Query Decorrelation, in Proceedings of ICDE 1996, pages 450-458. 391, 393, 396 [18] Praveen Seshadri, Joseph M. Hellerstein, Hamid Pirahesh, T. Y. Cliff Leung, Raghu Ramakrishnan, Divesh Srivastava, Peter J. Stuckey, and S. Sudarshan Cost-Based Optimization for Magic: Algebra and Implementation, in Proceedings of the SIGMOD Conference, 1996, pages 435-446. 393, 396 [19] Inderpal Singh Mumick and Hamid Pirahesh Implementation of Magic-sets in a Relational Database System, in Proceedings of the SIGMOD Conference 1994, pages 103-114. 393 [20] Inderpal Singh Mumick, Sheldon J. Finkelstein, Hamid Pirahesh and Raghu Ramakrishnan Magic is Relevant, in Proceedings of the SIGMOD Conference, 1990, pages 247-258. 393, 396
Fighting Redundancy in SQL Antonio Badia and Dev Anand Computer Engineering and Computer Science Department University of Louisville Louisville KY 40292
Abstract. Many SQL queries with aggregated subqueries exhibit redundancy (overlap in FROM and WHERE clauses). We propose a method, called the for-loop, to optimize such queries by ensuring that redundant computations are done only once. We specify a procedure to build a query plan implementing our method, give an example of its use and argue that it offers performance advantages over traditional approaches.
1
Introduction
In this paper, we study a class of Decision-Support SQL queries, characterize them and show how to process them in an improved manner. In particular, we analyze queries containing subqueries, where the subquery is aggregated (type-A and type-JA in [8]). In many of these queries, SQL exhibits redundancy in that FROM and WHERE clauses of query and subquery show a great deal of overlap. We argue that these patterns are currently not well supported by relational query processors. The following example gives some intuition about the problem; the query used is Query 2 from the TPC-H benchmark ([18]) -we will refer to it as query TPCH2: select s_acctbal, s_name, n_name, p_partkey, p_mfgr, s_address, s_phone, s_comment from part, supplier, partsupp, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and p_size = 15 and p_type like ’%BRASS’ and r_name = ’EUROPE’ and s_nationkey = n_nationkey and n_regionkey = r_regionkey and ps_supplycost = (select min(ps_supplycost) from partsupp, supplier, nation, region where p_partkey = ps_partkey and s_suppkey = ps_suppkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey and r_name = ’EUROPE’) order by s_acctbal desc, n_name, s_name, p_partkey;
This research was sponsored by NSF under grant IIS-0091928.
Y. Kambayashi, M. Mohania, W. W¨ oß (Eds.): DaWaK 2003, LNCS 2737, pp. 401–411, 2003. c Springer-Verlag Berlin Heidelberg 2003
402
Antonio Badia and Dev Anand
This query is executed in most systems by using unnesting techniques. However, the commonality between query and subquery will not be detected, and all operations (including common joins and selections) will be repeated (see an in-depth discussion of this example in subsection 2.3). Our goal is to avoid duplication of effort. For lack of space, we will not discuss related research in query optimization ([3, 11, 6, 7, 8, 15]); we point out that detecting and dealing with redundancy is not attempted in this body of work. Our method applies only to aggregated subqueries that contain WHERE clauses overlapping with the main query’s WHERE clause. This may seem a very narrow type of queries until one realizes that all types of SQL subqueries can be rewritten as aggregated subqueries (EXISTS, for instance, can be rewritten as a subquery with COUNT; all other types of subqueries can be rewritten similarly ([2])). Therefore, the approach is potentially applicable to any SQL query with subqueries. Also, it is important to point out that the redundancy is present because of the structure of SQL, which necessitates a subquery in order to declaratively state the aggregation to be computed. Thus, we argue that such redundancy is not infrequent ([10]). We describe an optimization method geared towards detecting and optimizing this redundancy. Our method not only computes the redundant part only once, but also proposes a new special operator to compute the rest of the query very effectively. In section 2 we describe our approach and the new operator in more detail. We formally describe the operator (subsection 2.1), show one query trees with the operator can be generated for a given SQL query (subsection 2.2), and describe an experiment ran on the context of the TPC-H benchmark ([18]) (subsection 2.3). Finally, in section 3 we propose some further research.
2
Optimization of Redundancy
In this section we define patterns which detect redundancy in SQL queries. We then show how to use the matching of patterns and SQL queries to produce a query plan which avoids repeating computations. We represent SQL queries in an schematic form or pattern. With the keywords SELECT ... FROM ... WHERE we will use L, L1 , L2 , . . . as variables over a list of attributes; T, T1 , T2 , . . . as variables over a list of relations, F, F1 , F2 , . . . as variables over aggregate functions and ∆, ∆1 , ∆2 , . . . as variables over (complex) conditions. Attributes will be represented by attr, attr1 , attr2 , . . .. If there is a condition in the WHERE clause of the subquery which introduces correlation it will be shown explicitly; this is called the correlation condition. The table to which the correlated attribute belongs is called the correlation table, and is said to introduce the correlation; the attribute compared to the correlated attribute is called the correlating attribute. Also, the condition that connects query and subquery (called a linking condition) is also shown explicitly. The operator in the linking condition is called the linking operator, the attributes the linking attributes and the aggregate function on the subquery side is called the linking aggregate. We will say that a pattern
Fighting Redundancy in SQL
403
matches an SQL query when there is a correspondence g between the variables in the pattern and the elements of the query. As an example, the pattern SELECT L FROM T WHERE ∆1 AND attr1 θ (SELECT F(attr2) FROM T WHERE ∆2 ) would match query TPCH2 by setting = {p partkey = ps partkey and s suppkey = ps suppkey and g(∆1 ) p size = 15 and p type like ’%BRASS’ and r name = ’EUROPE’ and s nationkey = n nationkey and n regionkey = r regionkey }, g(∆2 ) = {p partkey = ps partkey and s suppkey = ps suppkey and r name = ’EUROPE’ and s nationkey = n nationkey and n regionkey = r regionkey}, g(T) = {part,supplier,partuspp,nation,region}, g(F) = min and g(attr1) = g(attr2) = ps supplycost. Note that the T symbol appears twice so the pattern forces the query to have the same FROM clauses in the main query and in the subquery1 . The correlation condition is p partkey = ps partkey; the correlation table is part, and ps partkey is the the correlating attribute. The linking condition here is ps supplycost = min(ps suplycost); thus ps supplycost is the linking attribute, ’=’ the linking operator and min the linking aggregate. The basic idea of our approach is to divide the work to be done in three parts: one that is common to query and subquery, one that belongs only to the subquery, and one that belongs only to the main query2 . The part that is common to both query and subquery can be done only once; however, as we argue in subsection 2.3 in most systems today it would be done twice. We calculate the three parts above as follows: the common part is g(∆1 ) ∩ g(∆2 ); the part proper to the main query is g(∆1 ) − g(∆2 ); and the part proper to the subquery is g(∆2 ) − g(∆1 ). For query TPCH2, this yields { p partkey = ps partkey and s suppkey = ps suppkey and r name = ’EUROPE’ and s nationkey = n nationkey and n regionkey = r regionkey}, {p size = 15 and p type like ’%BRASS’} and ∅, respectively. We use this matching in constructing a program to compute this query. The process is explained in the next subsection. 2.1
The For-Loop Operator
We start out with the common part, called the base relation, in order to ensure that it is not done twice. The base relation can be expressed as an SPJ query. Our strategy is to compute the rest of the query starting from this base relation. This strategy faces two difficulties. First, if we simply divide the query based 1 2
For correlated subqueries, the correlation table is counted as present in the FROM clause of the subquery. We are assuming that all relations mentioned in a query are connected; i.e. that there are no Cartesian products present, only joins. Therefore, when there is overlap between query and subquery FROM clause, we are very likely to find common conditions in both WHERE clauses (at least the joins).
404
Antonio Badia and Dev Anand
on common parts we obtain a plan where redundancy is eliminated at the price of fixing the order of some operations. In particular, some selections not in the common part wouldn’t be pushed down. Hence, it is unclear whether this strategy will provide significant improvements by itself (this situation is similar to that of [13]). Second, when starting from the base relation, we face a problem in that this relation has to be used for two different purposes: it must be used to compute an aggregate after finishing up the WHERE clause in the subquery (i.e. after computing g(∆2 ) − g(∆1 )); and it must be used to finish up the WHERE clause in the main query (i.e. to compute g(∆1 ) − g(∆2 )) and then, using the result of the previous step, compute the final answer to the query. However, it is extremely hard in relational algebra to combine the operators involved. For instance, the computation of an aggregate must be done before the aggregate can be used in a selection condition. In order to solve this problem, we define a new operator, called the forloop, which combines several relational operators into a new one (i.e. a macrooperator). The approach is based on the observation that some basic operations appear frequently together and they could be more efficiently implemented as a whole. In our particular case, we show in the next subsection that there is an efficient implementation of the for-loop operator which allows it, in some cases, to compute several basic operators with one pass over the data, thus saving considerable disk I/O. Definition 1. Let R be a relation, sch(R) the schema of R, L ⊆ sch(R), A ∈ sch(R), F an aggregate function, α a condition on R (i.e. involving only attributes of sch(R)) and β a condition on sch(R) ∪ {F (A)} (i.e. involving attributes of sch(R) and possibly F (A)). Then for-loop operator is defined as either one of the following: 1. F LL,F (A),α,β (R). The meaning of the operator is defined as follows: let T emp be the relation GBL,F (A) (σα (R)) (GB is used to indicate a group-by operation). Then the for-loop yields relation σβ (R R.L=T emp.L T emp), where the condition of the join is understood as the pairwise equality of each attribute in L. This is called a grouped for-loop. 2. F LF (A),α,β (R). The meaning of the operator is given by σβ (AGGF (A) (σα (R)) × R), where AGGF (A) (R) indicates the aggregate F computed over all A values of R. This is called a flat for-loop. Note that β may contain aggregated attributes as part of a condition. In fact, in the typical use in our approach, it does contains an aggregation. The main use of a for-loop is to calculate the linking condition of a query with an aggregated subquery on the fly, possibly with additional selections. Thus, for instance, for query TPCH2, the for-loop would take the grouped form F Lp partkey , min(ps supplycost),∅,p size=15∧p typeLIKE%BRASS∧ps suplycost=min(ps supplycost)(R), where R is the relation obtained by computing the base relation3 . The for-loop is equivalent to the relational expression σp size=15∧p typeLIKE%BRASS∧ps suplycost= min(ps supplycost) (AGGmin(ps supplycost) (R) × R). 3
Again, note that the base relation contains the correlation as a join.
Fighting Redundancy in SQL
405
It can be seen that this expression will compute the original SQL query; the aggregation will compute the aggregate function of the subquery (the conditions in the WHERE clause of the subquery have already been computed in R, since in this case ∆2 ⊆ ∆1 and hence ∆2 − ∆1 = ∅), and the Cartesian product will put a copy of this aggregate on each tuple, allowing the linking condition to be stated as a regular condition over the resulting relation. Note that this expression may not be better, from a cost point of view, than other plans produced by standard optimization. What makes this plan attractive is that the for-loop operator can be implemented in such a way that it computes its output with one pass over the data. In particular, the implementation will not carry out any Cartesian product, which is used only to explain the semantics of the operator. The operator is written as an iterator that loops over the input implementing a simple program (hence the name). The basic idea is simple: in some cases, computing an aggregation and using the aggregate result in a selection can be done at the same time. This is due to the behavior of some aggregates and the semantics of the conditions involved. Assume, for instance, that we have a comparison of the type att = min(attr2), where both attr and attr2 are attributes of some table R. In this case, as we go on computing the minimum for a series of values, we can actually decide, as we iterate over R, whether some tuples will make the condition true or not ever. This is due to the fact that min is monotonically non-increasing, i.e. as we iterate over R and we carry a current minimum, this value will always stay the same or decrease, never increase. Since equality imposes a very strict constraint, we can take a decision on the current tuple t based on the values of t.attr and the current minimum, as follows: if t.attr is greater than the current minimum, we can safely get rid of it. If t.attr is equal to the current minimum, we should keep it, as least for now, in a temporary result temp1. If t.attr is less than the current minimum, we should keep it, in case our current minimum changes, in a temporary result temp2. Whenever the current minimum changes, we know that temp1 should be deleted, i.e. tuples there cannot be part of a solution. On the other hand, temp2 should be filtered: some tuples there may be thrown away, some may be in a new temp1, some may remain in temp2. At the end of the iteration, the set temp1 gives us the correct solution. Of course, as we go over the tuples in R we may keep some tuples that we need to get rid of later on; but the important point is that we never have to get back and recover a tuple that we dismissed, thanks to the monotonic behavior of min. This behavior does generalize to max, sum, count, since they are all monotonically non-decreasing (for sum, it is assumed that all values in the domain are positive numbers); however, average is not monotonic (either in an increasing or decreasing manner). For this reason, our approach does not apply to average. For the other aggregates, though, we argue that we can successfully take decisions on the fly without having to recover discarded tuples later on.
406
2.2
Antonio Badia and Dev Anand
Query Transformation
The general strategy to produce a query plan with for-loops for a given SQL query Q is as follows: we classify q into one of two categories, according to q’s structure. For each category, a pattern p is given. As before, if q fits into p there is a mapping g between constants in q and variables in p. Associated with each pattern there is a for-loop program template t. A template is different from a program in that it has variables and options. Using the information on the mapping g (including the particular linking aggregate and linking condition in q), a concrete for-loop program is generated from t. The process to produce a query tree containing a for-loop operator is then simple: our patterns allow us to identify the part common to query and subquery (i.e. the base relation), which is used to start the query tree. Standard relational optimization techniques can be applied to this part. Then a for-loop operator which takes the base relation as input is added to the query tree, and its parameters determined. We describe each step separately. We distinguish between two types of queries: type A queries, in which the subquery is not correlated (this corresponds to type J in [8]); and type B queries, where the subquery is correlated (this corresponds to the type JA in [8]). Queries of type A are interesting in that usual optimization techniques cannot do anything to improve them (obviously, unnesting does not apply to them). Thus, our approach, whenever applicable, offers a chance to create an improved query plan. In contrast, queries of type B have been dealt with extensively in the literature ([8, 3, 6, 11, 17, 16, 15]). As we will see, our approach is closely related to other unnesting techniques, but it is the only one that considers redundancy between query and subquery and its optimization. The general pattern a type A query must fit is given below: SELECT L FROM T WHERE ∆1 and attr1 θ (SELECT F(attr2 ) FROM T WHERE ∆2 ) {GROUP BY L2} The parenthesis around the GROUP BY clause are to indicate that such clause is optional4 . We create a query plan for this query in two steps: 1. A base relation is defined by g(∆1 ) ∩ g(∆2 )(g(T )). Note that this is an SPJ query, which can be optimized by standard techniques. 2. We apply a forloop operator defined by F L(g(F (attr2 )), g(∆2 ) − g(∆1 ), g(∆1 ) − g(∆2 ) ∧ g(attr3 θ F2 (attr4 ))) It can be seen that this query plan computes the correct result for this query by using the definition of the for-loop operator. Here, the aggregate is F (attr2 ), 4
Obviously, SQL syntax requires that L2 ⊆ L, where L and L2 are lists of attributes. In the following, we assume that queries are well formed.
Fighting Redundancy in SQL
407
α is g(∆2 − ∆1 ) and β is g(∆1 ) − g(∆2 ) ∧ g(attr θ F (attr2 )). Thus, this plan will first apply ∆1 ∩∆2 to T , in order to generate the base relation. Then, the for-loop will compute the aggregate F (attr2 ) on the result of selecting g(∆2 − ∆1 ) on the base relation. Note that (∆2 − ∆1 ) ∪ (∆1 ∩ ∆2 ) = ∆2 , and hence the aggregate is computed over the conditions in the subquery only, as it should. The result of this aggregate is then “appended” to every tuple in the base relation by the Cartesian product (again, note that this description is purely conceptual). After that, the selection on g(∆1 ) − g(∆2 ) ∧ g(attr3 θ F2 (attr4 )) is applied. Here we have that (∆1 − ∆2 ) ∪ (∆1 ∩ ∆2 ) = ∆1 , and hence we are applying all the conditions in the main clause. We are also applying the linking condition attr3 θ F (attr2 ), which can be considered a regular condition now because F (attr2 ) is present in every tuple. Thus, the forloop operator computes the query correctly. This forloop operator will be implemented by a program that will carry out all needed operators with one scan of the input relation. Clearly, the concrete program is going to depend on the linking operator (θ, assumed to be one of {=, <=, >=, <, >}) and the aggregate function (F, assumed to be one of min,max,sum,count,avg). The general pattern for type B queries is given next. SELECT L FROM T1 WHERE ∆1 and attr1 θ (SELECT F1 (attr2 ) FROM T2 WHERE ∆2 and S.attr3 θ R.attr4 ) {GROUP BY L2} where R ∈ T1 − T2 , S ∈ T2 , and we are assuming that T1 − {R} = T2 − {S} (i.e. the FROM clauses contain the same relations except the one introducing the correlated attribute, called R, and the one introducing the correlation attribute, called S). We call T = T1 − {R}. As before, a group by clause is optional. In our approach, we consider the table containing the correlated attribute as part of the FROM clause of the subquery too (i.e. we effectively decorrelate the subquery). Thus, the outer join is always part of our common part. In our plan, there are two steps: 1. compute the base relation, given by g(∆1 ∩ ∆2 )(T ∪ {R, S}). This includes the outer join of R and S. 2. computation of a grouped forloop defined by F L(attr6, F (attr2), ∆2 − ∆1 , ∆1 − ∆2 ∧ attr1 θ F (attr2)) which computes the rest of the query. Our plan has two main differences with traditional unnesting: the parts common to query and subquery are computed only once, at the beginning of the plan, and computing the aggregate, the linking predicate, and possible some selections is carried out by the forloop predicate in one step. Thus, we potentially deal with larger temporary results, as some selections (those not in ∆1 ∩ ∆2 ) are not pushed down, but may be able to effect several computations at once
408
Antonio Badia and Dev Anand Select
ps_supplycost=min(ps_supplycost)
Join Join
GBps_partkey,min(ps_supplycost) Select size=15&type LIKE %BRASS
Join
Join Part Select name="Europe"
Join
Select name="Europe"
Join Join
Region Region
Nation Join
PartSupp
Supplier PartSupp
Nation Supplier
Fig. 1. Standard query plan
(and do not repeat any computation). Clearly, which plan is better depends on the amount of redundancy between query and subquery, the linking condition (which determines how efficient the for-loop operator is), and traditional optimization parameters, like the size of the input relations and the selectivity of the different conditions. 2.3
Example and Analytical Comparison
We apply our approach to query TPCH2; this is a typical B query. For our experiment we created a TPC-H benchmark of the smallest size (1 GB) using two leading commercial DBMS. We created indices in all primary and foreign keys, updated system statistics, and capture the query plan for query 2 on each system. Both query plans were very similar, and they are represented by the query tree in figure 1. Note that the query is unnested based on Kim’s approach (i.e. first group and then join). Note also that all selections are pushed all the way down; they were executed by pipelining with the joins. The main differences between the two systems were the choices of implementations for the joins and different join ordering5 . For our concern, the main observation about this query plan is that operations in query and subquery are repeated, even though there clearly is a large amount of repetition6 . We created a query plan for this query, based on our approach (shown in figure 2). Note that our approach does not dictate how the base relation is optimized; the particular plan shown uses the same tree as the original query tree to facilitate comparisons. It is easy to see that 5
6
To make sure that the particular linking condition was not an issue, the query was changed to use different linking aggregates and linking operators; the query plan remained the same (except that for operators other than equality Dayal’s approach was used instead of Kim’s). Also, memory size was varied from a minimum of 64 M to a maximum of 512 M, to determine if memory size was an issue. Again, the query plan remained the same through all memory sizes. We have disregarded the final Sort needed to complete the query, as this would be necessary in any approach, including ours.
Fighting Redundancy in SQL
409
FL(p_partkey, min(ps_supplycost),
(p_size=15 & p_type LIKE %BRASS & ps_supplycost=min(ps_supplycos
Join Join Part Join Supplier
Select name="Europe"
Region Join Nation PartSupp
Fig. 2. For-loop query plan
our approach avoids any duplication of work. However, this comes at the cost of fixing the order of some operations (i.e. operations in ∆1 ∩ ∆2 must be done before other operations). In particular, some selections get pushed up because they do not belong into the common part, which increases the size of the relation created as input for the for-loop. Here, TPCH2 returns 460 rows, while the intermediate relation that the for-loop takes as input has 158,960 tuples. Thus, the cost of executing the for-loop may add more than other operations because of a larger input. However, grouping and aggregating took both systems about 10% of the total time7 . Another observation is that the duplicated operations do not take double the time, because of cache usage. But this can be attributed to the excellent main memory/database size ratio in our setup; with a more realistic setup this effect is likely to be diminished. Nevertheless, our approach avoids duplicated computation and does result in some time improvement (it takes about 70% of the time of the standard approach). In any case, it is clear that a plan using the for-loop is not guaranteed to be superior to traditional plans under all circumstances. Thus, it is very important to note that we assume a cost-based optimizer which will generate a for-loop plan if at least some amount of redundancy is detected, and will compare the for-loop plan to others based on cost.
3
Conclusions and Further Research
We have argued that Decision-support SQL queries tend to contain redundancy between query and subquery, and this redundancy is not detected and optimized by relational processors. We have introduced a new optimization mechanism to deal with this redundancy, the for-loop operator, and an implementation for it, the for-loop program. We developed a transformation process that takes us from SQL queries to for-loop programs. A comparative analysis with standard relational optimization was shown. The for-loop approach promises a more efficient implementation for queries falling in the patterns given. For simplicity and lack of space, the approach is introduced here applied to a very restricted 7
This and all other data about time come from measuring performance of appropriate SQL queries executed against the TPC-H database on both systems. Details are left out for lack of space.
410
Antonio Badia and Dev Anand
class of queries. However, we have already worked out extensions to widen its scope (mainly, the approach can work with overlapping (not just identical) FROM clauses in query and subquery, and with different classes of linking conditions). We are currently developing a precise cost model, in order to compare the approach with traditional query optimization using different degrees of overlap, different linking conditions, and different data distributions as parameters. We are also working on extending the approach to several levels of nesting, and studying its applicability to OQL.
References [1] Badia, A. and Niehues, M. Optimization of Sequences of Relational Queries in Decision-Support Environments, in Proceedings of DAWAK’99, LNCS n. 1676, Springer-Verlag. [2] Cao, Bin and Badia, A. Subquery Rewriting for Optimization of SQL Queries, submitted for publication. 402 [3] Dayal, U. Of Nests and Trees: A Unified Approach to Processing Queries That Contain Nested Subqueries, Aggregates, and Quantifiers, in Proceedings of the VLDB Conference, 1987. 402, 406 [4] Fegaras, L. and Maier, D. Optimizing Queries Using an Effective Calculus, ACM TODS, vol. 25, n. 4, 2000. [5] Freytag, J. and Goodman, N. On the Translation of Relational Queries into Iterative Programs, ACM Transactions on Database Systems, vol. 14, no. 1, March 1989. [6] Ganski, R. and Wong, H. Optimization of Nested SQL Queries Revisited, in Proceedings of the ACM SIGMOD Conference, 1987. 402, 406 [7] Goel, P. and Iyer, B. SQL Query Optimization: Reordering for a General Class of Queries, in Proceedings of the 1996 ACM SIGMOD Conference. 402 [8] Kim, W. On Optimizing an SQL-Like Nested Query, ACM Transactions On Database Systems, vol. 7, n.3, September 1982. 401, 402, 406 [9] Lieuwen, D. and DeWitt, D. A Transformation-Based Approach to Optimizing Loops in database Programming Languages, in Proceedings of the ACM SIGMOD Conference, 1992. [10] Lu, H., Chan, H. C. and Wei, K. K. A Survey on Usage of SQL, SIGMOD Record, 1993. 402 [11] Muralikrishna, M. Improving Unnesting Algorithms for Join Aggregate Queries in SQL, in Proceedings of the VLDB Conference, 1992. 402, 406 [12] Park, J. and Segev, A. Using common subexpressions to optimize multiple queries, in Proceedings of the 1988 IEEE CS ICDE. [13] Ross, K. and Rao, J. Reusing Invariants: A New Strategy for Correlated Queries, in Proceedings of the ACM SIGMOD Conference, 1998. 404 [14] Jun Rao, Bruce Lindsay, Guy Lohman, Hamid Pirahesh and David Simmen, Using EELs, a Practical Approach to Outerjoin and Antijoin Reordering, in Proceedings of ICDE 2001. [15] Praveen Seshadri, Hamid Pirahesh, T. Y. Cliff Leung Complex Query Decorrelation, in Proceedings of ICDE 1996. 402, 406 [16] Praveen Seshadri, Joseph M. Hellerstein, Hamid Pirahesh, T. Y. Cliff Leung, Raghu Ramakrishnan, Divesh Srivastava, Peter J. Stuckey, and S. Sudarshan Cost-Based Optimization for Magic: Algebra and Implementation, in Proceedings of the SIGMOD Conference, 1996. 406
Fighting Redundancy in SQL
411
[17] Inderpal Singh Mumick and Hamid Pirahesh Implementation of Magic-sets in a Relational Database System, in Proceedings of the SIGMOD Conference 1994. 406 [18] TPC-H Benchmark, TPC Council, http://www.tpc.org/home.page.html. 401, 402
“On-the-fly” VS Materialized Sampling and Heuristics Pedro Furtado 1
Centro de Informática e Sistemas (DEI-CISUC) Universidade de Coimbra [email protected] http://www.dei.uc.pt/~pnf
Abstract. Aggregation queries can take hours to return answers in large Data warehouses (DW). The user interested in exploring data in several iterative steps using decision support or data mining tools may feel frustrated for such long response times. The ability to return fast approximate answers accurately and efficiently is important to these applications. Samples for use in query answering can be obtained “On-thefly” (OS) or from a materialized summary of samples (MS). While MS are typically faster than OS summaries, they have the limitation that sampling rates are predefined upon construction. This paper analyzes the use of OS versus MS for approximate answering of aggregation queries and proposes a Sampling Heuristic that chooses the appropriate sampling rate to provide answers as fast as possible while guaranteeing accuracy targets simultaneously. The experimental section compares OS to MS, analyzing response time and accuracy (TPC-H benchmark), and shows the heuristics strategy in action.
1
Introduction
Applications that analyze data in todays' large organizations typically access very large volumes of data, pushing the limits of traditional database management systems in performance and scalability. Sampling summaries return fast approximate answers to aggregation queries, can easily be implemented in a DBMS with none or only minor changes and make use of the query processing and optimization strategies and structures of the DBMS. Materialized sampling (MS) such as AQUA [6] imply that summaries are constructed in one phase and used subsequently. Although these summaries can be very fast, they have an important limitation: the summary size must be defined at summary construction time. The statistical answer estimation strategy used by sampling means that, while a very detailed query pattern can only be answered accurately with a large number of samples, more aggregated patterns can be answered with very small, extremely fast summaries. Therefore, it is useful to be able to choose a specific sampling rate for a specific query. Sampling can also be achieved using a common SAMPLE operator that extracts a percentage of rows from a table randomly using for instance a sequential one-pass Y. Kambayashi, M. Mohania, W. Wßö (Eds.): DaWaK 2003, LNCS 2737, pp. 412-421, 2003. Springer-Verlag Berlin Heidelberg 2003
“On-the-fly” VS Materialized Sampling and Heuristics
413
strategy [10] over a table directory or index. This operator exists typically for the collection of statistics over schema objects for cost-based-optimization (e.g. Oracle 9i SAMPLE operator). It is based on specifying the desired sampling rate (e.g. SAMPLE 1%), scanning only a subset of the table blocks and extracting samples from those blocks. A faster but less uniform sampling alternative uses all the tuples from each scanned block as samples (e.g. SAMPLE BLOCK 1%). Materialized Sampling (MS) has an important advantage over “on-the-fly” (OS) and “online aggregation” (OA) [9] in that while OS and OA retrieve random samples, requiring non-sequential I/O, MS can use faster sequential scans over the materialized samples. In this paper we analyze the use of the SAMPLE operator for OS approximate answering of aggregation queries and compare with MS, revealing the advantages and shortcomings of OS. We also propose sampling heuristics to choose the appropriate sampling rate to provide answers as fast as possible while guaranteeing accuracy targets simultaneously. The paper is organized as follows: section 2 discusses related work. Section 3 discusses summarizing approaches and heuristics issues. Section 4 presents experimental analysis and comparison using the TPC-H decision support benchmark. Section 5 contains concluding remarks.
2
Related Work
There are several recent works in approximate query answering strategies, which include [9, 8, 2, 1]. There has also been considerable amount of work in developing statistical techniques for data reduction in large data warehouses, as can be seen in the survey [3]. Summaries reduce immensely the amount of data that must be processed. Materialized views (MVs) can also achieve this by pre-computing quantities, and they are quite useful for instance to obtain pre-defined reports. However, while summaries work well in any ad-hoc environment, MVs have a more limited, pre-defined scope. The Approximate Query Answering (AQUA) system [6, 7] provides approximate answers using small, pre-computed synopsis of the underlying base data. The system provides probabilistic error/confidence bounds on the answer [2, 8]. [9] proposed a technique for online aggregation, in which the base data is scanned in random order at query time and the approximate answer is continuously updated as the scan proceeds, until all tuples are processed or the user is satisfied with the answer. A graphical display depicts the answer and a confidence interval as the scan proceeds, so that the user may stop the process at any time. In order to achieve the random order scanning, there must be an index on the base tuples ordering by the grouping columns (typically a large index), and a specific functionality that scans this index iteratively (a possibly enormous number of runs) end-to-end in order to retrieve individual tuples from each group in each run. The authors claim that, with appropriate buffering, index striding is at least as fast as scanning a relation via an unclustered index, with each tuple of the relation being fetched only once, although each fetch requires a random I/O which is typically much slower than a full table scan with sequential I/O.
414
Pedro Furtado
avg = avg(samples) ± 1.65 x σ(l_quantity)/√(count(*) count = count(samples)/SP ± 1.65 x sqrt(count(*)) / SP
sum = sum(samples)/SP max = max(samples) min = min(samples) SELECT brand, yearmonth, avg(l_quantity), sum(l_quantity)/SP, count(*)/SP FROM lineitem, part WHERE l_partkey=p_partkey GROUP BY yearmonth, brand SAMPLE SP; Fig. 1. Estimation and Query Rewriting
On-the-fly sampling (OS) also retrieves samples online, but uses a common SAMPLE operator and no specific structures or functionality over such structures. For this operator to be used in approximate answering, the appropriate sampling rate must be used depending on the aggregation pattern and it should deliver estimations and accuracy measures. The insufficiency of samples in summaries is an important issue in the determination of sampling rates and our previous work includes a strategy for appropriate sampling rate choice [5]. This problem has also been the main driver of proposals on improving the representation capability of summaries based on query workloads [2, 1, 4].
3
Summarizing Approaches and Heuristics
In this section we describe the structure and procedure for “on-the-fly” (OS) and materialized (MS) sampling, comparing the approaches, and develop the heuristic strategy used to choose a convenient sampling rate. 3.1
Sampling Rate and Accuracy/Speed (A/S) Limitations
Summary approaches use a middle layer SW that analyses each query, rewrites it to execute against a sampled data set and returns a very fast estimation. The sampling strategy itself is based on either pre-computed materialized samples (MS) or “on-thefly” sampling (OS) from the DW. The pre-computed MS summary is put into a structure similar to the DW (typically, a set of star schemas), but facts are replaced with samples using a sampling rate (or sampling percentage - SP) (e.g. a summary can have SP=1% of the original data) and dimensions contain the subset of rows that are referenced by the fact samples. “On-the-fly” sampling, on the other hand, is obtained by specifying a sampling rate “SAMPLE SP%” at the end of the query which is then submitted against the DW. The estimation procedure is based on producing an individual estimation and error bound for each group of a typical group aggregation query. Figure 1 shows the formulas used and a rewritten query to estimate from samples and provide confidence intervals. Intuitively, or from the analysis of the query rewrites, it is possible to conclude that the summaries do not involve any complex computations and holds the promise of extremely fast response times against a much smaller data set than the original DW.
“On-the-fly” VS Materialized Sampling and Heuristics
415
Fig. 2.- Materialized Summary Construction
For either OS or MS with a set of summary sizes, an appropriate sampling rate should be determined. The sampling rate should be as small as possible for fast response times. However, it is crucial to have enough samples in order to return accurate estimations. Going back to the rewritten query of figure 1, the estimation procedure is applied individually within each group aggregated in the query. A certain sampling rate can estimate the sales of some brands by year but fail completely to estimate by month or week. Additionally, groups can have different sizes (and data distribution as well), so that some may lack samples. The heuristics strategy proposes solutions not only to determine the best SP but also to deal with these issues. 3.2
Structure and Comparison of Materialized and “On-the-fly” Sampling
MS: Figure 2 shows the Materialized samples (MS) construction strategy. MS can be obtained easily by setting up a schema similar to the base schema and then sampling the base fact table(s) into the MS fact table(s). Dimensions are then populated from the base dimensions by importing the rows that are referenced by the MS fact(s), resulting in facts with SP% of the base fact and dimensions typically much smaller than the base dimensions as well. The query that would be submitted against the base DW is rewritten by replacing the fact and dimension table names by the corresponding summary fact and dimensions, proceeding then with expression substitution (query rewriting). OS: Figure 3 shows the basic strategy used to obtain “on-the-fly” samples (OS) to answer the query. The fact table source is replaced by a sub-expression selecting samples from that fact source. Query expressions are also substituted exactly as with MS but, unlike MS, the base dimensions are maintained in the query. The query processor samples the fact table by selecting tuples randomly. In order to be more efficient, this sampling should be done over a row directory or index structure to avoid scanning the whole table.
Fig. 3. Answering Queries with On-the-fly Sampling
416
Pedro Furtado
From the previous description, it is easy to see why materialized summaries (MS) are typically faster than “on-the-fly” sampling (OS) with the same sampling rate. In MS the summary facts are available for fast sequential scanning and dimensions are smaller than the original data warehouse dimensions, while OS must retrieve samples using non-sequential I/O and join with complete base dimensions. The exact difference of speedup between MS and OS depends on a set of factors related to the schema and size of facts and dimensions, but the difference is frequently large, as we show in the experimental section. How can we reduce the response time disadvantage of OS? It is advantageous to reduce I/O by sampling blocks rather than individual rows, but samples will not be completely random. The overhead of joining the sample rows with complete (large) dimensions in OS instead of joining with the subset of dimension rows corresponding to summary facts (MS) is more relevant in many situations. The only straightforward way to reduce this problem would be to materialize a reasonably small summary (MS) and then sample that summary “on-the-fly” (OS) for smaller sampling rate. 3.3
Sampling Rate Decision
The objective of the sampling heuristic (SH) is simple: to find the most appropriate sampling rate (SPQ) for a query Q. This is typically the fastest (therefore smallest) summary that is still capable of answering within a desired accuracy target. If OS is being used, the heuristic then uses SPQ to sample the base data, otherwise (MS) it chooses the (larger) summary size closest to SPQ. The accuracy target can be defined by parameters (CI%, FG%). The CI% value is a confidence interval target CI% = CIestimation/estimation (e.g. the error should be within 10% of the estimated value). The fraction of groups that must be within CI% (FG%) is important to enable sampling even when a few groups are too badly represented (have too few samples). Without this parameter (or equivalently when FG%=100%), the accuracy target would have to be met by all groups in the response set, including the smallest one, which can result in large sampling rates. For instance, (CI%=10%, FG%=90%) means that at least 90% of groups are expected to answer within the 10% CI% target. Minimum and maximum sampling rates can be useful to enclose the range of possible choices for SPQ (e.g. SPmin=0.05%, SPmax=30%). The sampling rate SPmax is a value above which it is not very advantageous to sample and the minimum is a very fast summary. In practice, the sampling rate SPmax would depend on a specific query and should be modified accordingly. For instance, a full table scan on a base fact is as fast as a sampling rate that requires every block of the table to be read (SP=1/average number of tuples per block). However, queries with heavy joining can still improve execution time immensely with that sampling rate. Given a query Q, the heuristic must find the suitable sampling rate SPQ I n the spectrum of Figure 4 based on accuracy targets CI% and FG%. If SPQ is below SPmin, it is replaced by SPmin, which provides additional accuracy without large response time. Otherwise, SPQ is used unless SPQ >SPmax, in which case it is better to go to the DW directly.
“On-the-fly” VS Materialized Sampling and Heuristics
417
Fig.4. SP ranges and query processing choices
3.4
Determining SPQ from Accuracy Targets
If we know the minimum number of samples needed to estimate within a given accuracy target (nmin) and we also know the number of values in the most demanding aggregation group (ng), then a sampling rate of SPQ = nmin/ng should be enough. For instance, if nmin = 45 and the number of elements of that group is 4500, then SPQ≥1%. Instead of the most demanding group, we can determine ng as the FG% percentile of the distribution of the number of elements. For instance, for FG=75%, ng is a number such that 75% of the aggregation groups have at least ng samples. Then SPQ= nmin/ng should be able to estimate at least 75% of the group results within the accuracy target CI%. We call ng the characteristic number of elements, as it is a statistical measure on the number of elements in groups. Next we show how ng and nmin are determined. There are three alternatives for the determination of SPQ or ng: • • •
Manual: the user can specify SPQ manually; Selectivity: ng can be estimated using count statistics; Trial: a trial SPQ (SPtry) can be used and if the result is not sufficiently accurate, another SPQ is estimated from statistics on ng collected during the execution of SPtry;
The determination of ng based on statistical data is a selectivity estimation problem with the particularity that what must be estimated is the number of elements within each aggregation group. Selectivity estimation is a recurring theme in RDBMS, with many alternative strategies. We propose that statistics be collected when queries are executed so that they become available later on. Count statistics are collected and put into a structure identifying the aggregation and containing percentiles of the cumulative distribution of number of elements (e.g. in the following example 75% of the groups have at least 4500 elements): brand/month 10%=17000, 25%=9400, 50%=6700, 75%=6000, 90%=1750, max=20000, min=1000, stdev=710, SPS=100% These statistics are useful to determine the ng value that should be used based on the minimum fraction of groups (FG%) that are expected to return confidence intervals below CI%. For instance, in the above example, supposing that nmin=45, if FG%=75% SPQ=45/4500=0.75%, whereas if FG%=100%, SPQ=45/1000=4.5%. The system should be able to collect this information when the query is executed against a summary, as this is the most frequent situation. In that case, if the sampling rate used to query was SPS, this value should be stored together with the statistics to be used in inferring the probable number of elements (using 1/SP x each value). If a query has not been executed before and it is impossible to estimate the group selectivity, the strategy uses a trial approach. The query is executed against a reasona-
418
Pedro Furtado
bly small and fast sampling rate SPtry in a first step (SPtry should be defined). Either way, response statistics are collected on the number of elements (ngtry) for posterior queries. If the answer from this first try is not sufficiently accurate, a second try uses an SP2 = nmin / (ngtry/SPtry), or this value multiplied by a factor for additional guarantees (e.g. 1.1 x nmin / (ngtry/SPtry)). Iterations go on until the accuracy targets are met. If SPtry is too small for a query pattern, the system simply reiterates using the inferral process until the CI% accuracy target is guaranteed. The other parameter that must be determined is nmin. This value is obtained from the confidence interval formulas by solving to obtain ns and considering the relative confidence interval ci. For instance, for the average and count functions: nmin(avg) = (zp/ci)2(σ/µ)2
nmin(count) = (zp/ci)2
The unknown expression (σ/µ) is typically replaced by 50% in statistics works, for estimation purposes. The minimum number of samples varies between different aggregation functions. If an expression contains all the previous aggregation functions, nmin = min[nmin(AVG), nmin(SUM), nmin(COUNT)].
4
Experimental Analysis and Comparison
This section analyses experimental results on a Intel Pentium III 800 MHz CPU and 256 MB of RAM, running Oracle 9i DBMS and the TPC-H benchmark with the scale factor (SF) of 5 (5GB). The Oracle SAMPLE operator was used directly. We have used template aggregation queries over TPCH (Figure 5), with different time granularities. Query Qa(above) involves joining only two tables, whereas query Qb (below) engages in extensive joining, including a very large ORDERS table. The base fact table that was sampled in OS was LINEITEM. 4.1
Response Time Results
Our objective in this section is to evaluate the response time improvement using OS and MS summaries, in order to have a comparison and measure of the effectiveness of these approaches. The response time is very dependent on several factors, including the query plan or the amount of memory available for sorting or for hash joining. We ran experiments with exactly the same conditions and repeatedly. We have optimized the execution plan for queries Qa (19 min, 20 mins) and Qb (47 mins, 52 mins) for monthly and yearky aggregations respectively. SELECT p_brand, year_month, avg(l_quantity), sum(l_quantity), count(*) FROM lineitem, part WHERE l_partkey=p_partkey GROUP BY to_char(l_shipdate,'yyyy-mm'), p_brand; SELECT n_name, year_month, avg(l_extendedprice), sum(l_extendedprice), count(*) FROM lineitem, customer, orders, supplier, nation, region WHERE <join conditions> GROUP BY n_name, to_char(l_shipdate,’yyyy-mm’); Fig. 5. Qeries Qa (left) and Qb (right) used in the experiments
“On-the-fly” VS Materialized Sampling and Heuristics
Response Time (%)
40%
Qa(year) Qb(year)
15% Response Time (%)
Qa(month) Qb(month)
50%
30% 20% 10% 0% 0%
2%
4%
6%
8%
10%
Qa(month) Qb(month)
419
Qa(year) Qb(year)
10% 5% 0% 0.0%
0.2%
0.4%
0.6%
0.8%
1.0%
Sampling Rate (%)
Sampling Rate (%)
Fig. 6. % of Resp. Time VS % Sampling Rate for Qa and Qb Using OS
50% OS Qa(month) OS Qa(year) 40%
MS Qa(month) MS Qa(year)
100%
30% 20% 10% 0% 0%
2%
4% 6% Sampling Rate (%)
8%
10%
Response Time (%)
Response Time (%)
Figure 6 displays the query response time using OS, as a percentage of the response time of the base data (Y axis) for each sampling rate (X-axis). Linear speedup (1/SP) is indicated in the picture as a solid line. The right picture is an 0-1% detail. The most important analysis from the figure is that the speedup is typically much less than linear. For instance, a summary with SP=1% takes about 12% of the DW response time; Other comments: the speedup to sampling rate ratio improves as the sampling rate increases; Query Qb (with heavy joining) exhibited a worse ratio than query Qa for SP below 1% (detail) and a better ratio for larger SP; Figure 7 compares On-the-fly summaries (OS) to Materialized summaries (MS) in the same setup of the previous experiment (the right picture is a detail for SP<1%). OS Qa(month)
MS Qa(month)
OS Qa(year)
MS Qa(year)
10% 1% 0% 0.0%
0.2%
0.4%
0.6%
0.8%
1.0%
Sampling Rate (%)
Fig. 7. Comparing OS to MS Resp. Times vs Sampling Rates for query Qa
As we would expect MS summaries are extremely fast, obtaining an almost linear speedup (1/SP). This means, for instance, that an MS summary with 0.5% of the base fact data takes about 0.5% of the response time the base data would take. An OS summary with the same SP (0.5%) took almost 10% of the base data response time! 4.2
Applying the Heuristics
In this section we use the experimental data sets to test the sampling rate heuristics. We submitted Qa and Qb with monthly and yearly aggregations. The accuracy target CI% was set at 10% (LCT). As a result, nmin(AVG)=68 and nmin(COUNT)=271. Figure 8 shows, using the nmin/ng expression, the minimum SPQ necessary for queries to be answered within (CI%,FG%).
420
Pedro Furtado
(a) Qa(month)
(b) Qb(month)
(c) Qa(year)
(d) Qb(year)
Fig. 8. Count Statistics for Qa and Qb, Year and Month
With this information, it is possible to see that only very large summaries can estimate Qb(month) with the necessary accuracy. The remaining queries can be answered within the accuracy target with small summaries. For instance, a summary with 0.49% would be sufficient to answer Qa(month) for FG%=90%. Figure 9 shows the results from using the try approach (SPtry=0.1%), SPQ = nmin/(ngtry/SPtry).
Fig. 9. SP 0.1% Inferral after first iteration considering FG%=90%
The SPQ for Qa(year) was very well estimated because all groups had many samples; Qa(month) was also reasonably well represented, so that the estimation of SPQ by the first try is also very good, although not perfect; Qb(year) groups were not as well represented, but the query returned the correct number of groups, so that the
“On-the-fly” VS Materialized Sampling and Heuristics
421
result was only slightly overshot. The only case in which the estimation is not good is Qb(month). This is because, as we explained before, several result groups were missing for lack of representation. While Qb(month) returned 2078 groups, using SP 0.1% only 966 groups were able to answer. However, it is very interesting to see that the SP that was inferred is about half of the real one (because only the upper half of the groups answered) and the procedure still works quite well because the second iteration with those SP values would infer the correct SPQ.
5
Conclusions
The main conclusions from this work are that MS is very advantageous in what concerns response time, while OS strength is complete flexibility of sample rate choice, but the response time is weak. We also proposed a sampling heuristic that chooses the appropriate sampling rate within accuracy targets. We have proved our claims experimentally using the TPC-H benchmark.
Acknowledgments I would like to thank João Pedro Costa for his comments and discussions on the materialized versus on-the-fly issue.
References [1]
S. Acharaya, P.B. Gibbons, and V. Poosala. “Congressional Samples for Approximate Answering of Group-By Queries”, ACM SIGMOD Int. Conference on Management of Data, pp.487-498, June 2000. [2] S. Acharaya et al. “Join synopses for approximate query answering”, ACM SIGMOD Int. Conference on Management of Data, pp.275-286, June 1999. [3] D. Barbara et al. The New Jersey data reduction report. Bulletin of the Technical Committee on Data Engineering, 20(4):3–45, 1997. [4] Pedro Furtado, João Pedro Costa: Time-Interval Sampling for Improved Estimations in Data Warehouses. DaWaK 2002: 327-338. [5] Pedro Furtado, João Pedro Costa: The BofS Solution to Limitations of Approximate Summaries. DASFAA 2003. [6] P. B. Gibbons, Y. Matias, and V. Poosala. Aqua project white paper. Technical report, Bell Laboratories, Murray Hill, New Jersey, December 1997. [7] P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proc. ACM SIGMOD Int. Conference on Management of Data, pp.331–342, June 1998. [8] P. J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In Proc. 9th Intl. Conf. Scientific and Statistical Database Management, August 1997. [9] J.M. Hellerstein, P.J. Haas, and H.J. Wang. “Online aggregation”, ACM SIGMOD Int. Conference on Management of Data, pp.171-182, May 1997. [10] J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37-57, 1985.
Incremental and Decremental Proximal Support Vector Classification using Decay Coefficients Amund Tveit, Magnus Lie Hetland and H˚ avard Engum Department of Computer and Information Science, Norwegian University of Science and Technology, N-7491 Trondheim, Norway {amundt,mlh,havare}@idi.ntnu.no
Abstract. This paper presents an efficient approach for supporting decremental learning for incremental proximal support vector machines (SVM). The presented decremental algorithm based on decay coefficients is compared with an existing window-based decremental algorithm, and is shown to perform at a similar level in accuracy, but providing significantly better computational performance.
1
Introduction
Support Vector Machines (SVMs) is an exceptionally efficient data mining approach for classification, clustering and time series analysis [5, 12, 4]. This is primarily due to SVMs highly accurate results that are competitive with other data mining approaches, e.g. artificial neural networks (ANNs) and evolutionary algorithms (EAs). In recent years tremendous growth in the amount of data gathered (e.g. user clickstreams on the web, in e-commerce and in intrusion detection systems), has changed the focus of SVM classifier algorithms to not only provide accurate results, but to also enable online learning, i.e. incremental and decremental learning, in order to handle concept drift of classes [2, 13]. Fung and Mangasarian introduced the Incremental and Decremental Linear Proximal Support Vector Machine (PSVM) for binary classification [10], and showed that it was able to be trained extremely fast, i.e. with 1 billion examples (500 increments of 2 million) in 2 hours and 26 minutes on relatively low-end hardware (400 MHz Pentium II). This has later been extended to support efficient support of incremental multicategorical classification [16]. Proximal SVMs has also been shown to perform at a similar level of accuracy as regular SVMs and at the same time being significantly faster [9]. In this paper we propose a computationally efficient algorithm that enables decremental support for Incremental PSVMs using a weight decay coefficient. The suggested approach is compared the current time-window based approach proposed by Fung and Mangasarian [10]. Y. Kambayashi, M. Mohania, W. W¨oß (Eds.): DaWaK 2003, LNCS 2737, pp. 422-429, 2003. c Springer-Verlag Berlin Heidelberg 2003
Incremental and Decremental Proximal Support Vector Classification
2
423
Background Theory
The basic idea of Support Vector Machine classification is to find an optimal maximal margin separating hyperplane between two classes. Support Vector Machines uses an implicit nonlinear mapping from input-space to a higher dimensional feature-space using kernel-functions, in order to find a hyperplane of problems which are not linear separable in input-space [7, 18]. Classifying multiple classes is commonly performed by combining several binary SVM classifiers in a tournament manner, either one-against-all or one-against-one, the latter approach requiring substantial more computational effort [11]. The standard binary SVM classification problem with soft margin (allowing some errors) is shown visually in Fig. 1(a). Intuitively, the problem is to maximize the margin between the solid planes and at the same time permit as few errors as possible, errors being positive class points on the negative side (of the solid line) or vice versa.
x' w =g + 1
O X X O
O X O
O X
O
O
O
O
A-
O O
X
X
OO O O
X
O
X X
X X
O
O
A+
O O
X
O O
O X
X O X X X O X X X X X X X X X X X X X X
A-
X
X
O
O O
O
O O
O
O O
X
O
X
X X X
O X
x' w =γ + 1
O X
X O X X O X X
O
O O O
X
X
OO OO
O O O O
O O
X O
A+
O
O O O O O
O
X
X
O
x' w =γ −1
x' w =g −1 M argin = 2 w
Margin = 2 wº ª « » « γ» ¬ ¼
w
Separating Plane
x' w =g
Separating Plane
(a) Standard SVM classifier
x' w −γ = 0
(b) Proximal classifier
wº ª « » « » γ¼ ¬
SVM
Fig. 1. SVM and PSVM
The standard SVM problem can be stated as a quadratic optimization problem with constraints, as shown in (1). min
(w,γ,y)∈Rn+1+m
{ve y + 12 w w}
s.t. D(Aw − eγ) + y ≥ e y≥0
A ∈ Rm×n , D ∈ {−1, +1}
m×1
(1)
, e = 1m×1
Fung and Mangasarian [8] replaced the inequality constraint in (1) with an equality constraint. This changed the binary classification problem, because the points in Fig. 1(b) are no longer bounded by the planes, but are clustered around
424
Amund Tveit et al.
them. By solving the equation for y and inserting the result into the expression to be minimized, one gets the following unconstrained optimization problem: min
(w,γ)∈Rn+1+m
Setting ∇f =
f (w, γ) = ν2 D(Aw − eγ) − e2 + 12 (w w + γ 2 )
∂f ∂f ∂w , ∂γ
(2)
= 0 one gets:
−1 −1 I A De w A A + νI −A e = E De = + E E −e De γ −e A ν1 + m ν B
X
(3)
A−1
E = [A − e], E ∈ Rm×(n+1) Agarwal has showed that the Proximal SVM is directly transferable to a ridge regression expression [1]. Fung and Mangasarian [10] later showed that (3) can be rewritten to handle increments (E i , di ) and decrements (E d , dd ), as shown in (4). This decremental approach is based on time windows. X = =
w γ
I + E E + (E i ) E i − (E d ) E d ν
−1
E d + (E i ) di − (E d ) dd
,
(4)
where d = De .
3
PSVM Decremental Learning using Weight Decay Coefficient
The basic idea is to reduce the effect of the existing (old) accumulated training knowledge E E with an exponential weight decay coefficient α.
w γ
=
I ν
+ α · E E + Ei Ei
−1
α · E d + E i di ; α ∈ 0, 1]
(5)
As opposed to the decremental approach in expression (4), the presented
weight decay approach does not require storage of increments (E i E i , E i di )
later to be retrieved as decrements (E d E d , E d dd ). A hybrid approach is shown in expression (6), where one has both a soft decremental effect using the weight decay coefficient α as well as a hard decremental effect using a fixed window of size W increments.
Incremental and Decremental Proximal Support Vector Classification
4
w γ
−1 + α · E E + E i E i − αW · E d E d · α · E D + E i Di − αW · E d Dd ; α ∈ 0, 1] =
I ν
425
(6)
Related Work
Syed et al. presented an approach for handling concept drift with SVM [2]. Their approach trains on data, and keeps only the support vectors representing the data before (exact) training with new data and the previous support vectors. Klinkenberg and Joachims presented a window adjustment based SVM method for detecting and handling concept drift [13]. Cauwenberghs and Poggio proposed an incremental and decremental SVM method based on a different approximation than used by us [6].
5
Empirical results
In order to test and compare our suggested decremental PSVM learning approach with the existing window-based approach we created synthetic binary classification data sets with simulated concept drift. This was created by sampling feature values from a multivariate normal distribution where the covariance matrix Ω = I (identity matrix) and the mean vector µ was sliding linearly from only +1 values to −1 values for the positive class case, and vice versa for the negative class [14], as shown in algorithm 1.
Algorithm 1 simConceptDrift(nF eat, nSteps, nExP erStep, start) Require: nF eat, nSteps, nExP erStep ∈ N and start ∈ R Ensure: Linear stochastic drift in nSteps from start to −start 1: center = [start, . . . , start] {vector of length nF eat} 2: origcenter = center 3: for all step in {0, . . . , nSteps − 1} do 4: for all synthExampleCount in {0, . . . , nExP erStep − 1} do 5: sample example from multivar.gauss.dist with µ = center and σ 2 ’s = 1 6: end for step+1 7: center = origcenter · (1 − 2 · nStep−1 ) {concept drift} 8: end for
5.1
Classification Accuracy
For the small concept drift test (20000 examples with 10 features and 40 increments of 500 examples, figure 2(a)), the weight decay of α = 0.1 performs slightly better in terms of unlearning than a window size of W = 5, and a weight
426
Amund Tveit et al.
decay of α = 0.9 performs between unlearning with W = 10 and W = 20, and the unlearning performance varies quite a bit with α. For the medium concept drift test (200000 examples with 10 features and 400 increments of 500 examples, figure 2(b)), the value of α matters less, this due to more increments shown and faster exponential effect of the weight decay coefficient than in the small concept drift test. As seen in both figure 2(a) and 2(b), there is “dip” in classification performance around their respective center points (increment number ≈ 20 and 200). This is caused by concept drift, i.e. the features of the positive and negative class are indiscernible.
80 60
Non.Decr. W=100 W=10 alpha=0.1 alpha=0.9
0
0
20
40
Classification Accuracy (%)
60 40
Non.Decr. W=5 W=10 W=20 alpha=0.1 alpha=0.9
20
Classification Accuracy (%)
80
100
Number of examples = 200 000
100
Number of examples=20000
0
10
20
30
Increment Number
(a) Short timespan
0
100
200
300
400
Increment Number
(b) Medium timespan
Fig. 2. Classification Accuracy under Concept Drift
5.2
Computational Performance
As shown in figure 5.2 the computational performance (measured in wallclock time) of the weight decay based approach is almost twice as fast as the windowbased approach except for large windows (e.g. W = 1000). The performance difference seems to decrease with increasing increment size, this is supported by the P-values from T-test comparisons. 21 out of 27 T-tests (tables 1-3) showed significant difference in favor of the weight decay based approach over the window based approach. Performed T-tests were based on timing of ten repeated runs of each presented configuration of α, w and increment size.
Incremental and Decremental Proximal Support Vector Classification
427
40
60
80
alpha=0.1 alpha=0.5 alpha=0.9 W=10 W=100 W=1000
20
Average Time (seconds)
100
Number of examples = 2000 000
50
100
200
500
1000
2000
5000
Increment Size
Fig. 3. Computational Performance (Long timespan)
α=0.1 α=0.5 α=0.9
w=10 w=100 w=1000 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00
Table 1. P-values for increment size 50 (Comp. Perf.)
5.3
Implementation and Test environment
The incremental and decremental proximal SVM has been implemented in C++ using the CLapack and ATLAS libraries [3, 19]. Support for Python and Java interfaces to the library is currently under development using the “Simplified Wrapper and Interface Generator”[15]. A Linux cluster (Athlon 1.4-1.66 GHz nodes, Sorceror Linux) has served as the test environment.
Acknowledgements We would like to thank Professor Mihhail Matskin and Professor Arne Halaas. This work is supported by the Norwegian Research Council.
6
Conclusion and Future Work
We have introduced a weigth decay based decremental approach for proximal SVMs and shown that it can replace the current window-based approach. The
428
Amund Tveit et al.
α=0.1 α=0.5 α=0.9
w=10 w=100 w=1000 0.00 0.00 0.25 0.00 0.00 0.39 0.00 0.00 0.67
Table 2. P-values for increment size 500 (Comp. Perf.)
α=0.1 α=0.5 α=0.9
w=10 w=100 w=1000 0.00 0.00 0.67 0.00 0.00 0.79 0.00 0.00 0.57
Table 3. P-values for increment size 5000 (Comp. Perf.)
weight decay based approach is significantly faster than the window-based approach (due to less IO-requirements) for small-to-medium increment and window sizes, this is supported by simulation and p-values from T-Test. Future work includes applying the approach on demanding incremental classification and prediction problems. e.g. game usage mining [17]. Algorithmic improvements that needs to be done include 1) develop incremental multiclass balancing mechanisms, 2) investigate the approriateness of parallellized incremental proximal SVMs, 3) strengthen implementation with support for tuning set and kernels.
References 1. Deepak K. Agarwal. Shrinkage Estimator Generalizations of Proximal Support Vector Machines. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 173–182. ACM Press, 2002. 2. Nadeem Ahmed, Huan Liu, and Kah Kay Sung. Handling Concept Drifts in Incremental Learning with Support Vector Machines. In Proceedings of the fifth International Conference on Knowledge Discovery and Data Mining, pages 317–321. ACM Press, 1999. 3. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999. 4. Asa Ben-Hur, David Horn, Hava T. Siegelmann, and Vladimir Vapnik. Support Vector Clustering. Journal of Machine Learning Research, 2:125–137, 2001. 5. Robert Burbidge and Bernhard F. Buxton. An introduction to support vector machines for data mining. In M. Sheppee, editor, Keynote Papers, Young OR12, pages 3–15, University of Nottingham, March 2001. Operational Research Society, Operational Research Society.
Incremental and Decremental Proximal Support Vector Classification
429
6. Gert Cauwenberghs and Tomaso Poggio. Incremental and Decremental Support Vector Machine Learning. In Advances in Neural Information Processing Systems (NIPS’2000), volume 13, pages 409–415. MIT Press, 2001. 7. Nello Christiani and John Shawe-Taylor. An Introduction to Support Vector Machines and other kernel-based learning methods, chapter 6, pages 93–111. Cambridge University Press, 1st edition, 2000. 8. Glenn Fung and Olvi L. Mangasarian. Multicategory Proximal Support Vector Classifiers. Submitted to Machine Learning Journal, 2001. 9. Glenn Fung and Olvi L. Mangasarian. Proximal support vector machine classifiers. In Proceedings of the 7th ACM Conference on Knowledge Discovery and Data Mining, pages 77–86. ACM, 2001. 10. Glenn Fung and Olvi L. Mangasarian. Incremental Support Vector Machine Classification. In R. Grossman, H. Mannila, and R. Motwani, editors, Proceedings of the Second SIAM International Conference on Data Mining, pages 247–260. SIAM, April 2002. 11. Chih-Wei Hsu and Chih-Jen Lin. A Comparison of Methods for Multi-class Support Vector Machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002. 12. Jeffrey Huang, Xuhui Shao, and Harry Wechsler. Face pose discrimination using support vector machines (svm). In Proceedings of 14th Int’l Conf. on Pattern Recognition (ICPR’98), pages 154–156. IEEE, 1998. 13. Ralf Klinkenberg and Thorsten Joachims. Detecting Concept Drift with Support Vector Machines. In Pat Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning (ICML). Morgan Kaufmann, 2000. 14. Kenneth Lange. Numerical Analysis for Statisticians, chapter 7.3, pages 80–81. Springer-Verlag, 1999. 15. Simplified wrapper and interface generator. Online, http://www.swig.org/, March 2003. 16. Amund Tveit and Magnus Lie Hetland. Multicategory Incremental Proximal Support Vector Classifiers. In Proceedings of the 7th International Conference on Knowledge-Based Information & Engineering Systems (forthcoming), Lecture Notes in Artificial Intelligence. Springer-Verlag, 2003. 17. Amund Tveit and Gisle B. Tveit. Game Usage Mining: Information Gathering for Knowledge Discovery in Massive Multiplayer Games. In Proceedings of the International Conference on Internet Computing (IC’2002), session on Web Mining. CSREA Press, June 2002. 18. Vladimir N. Vapnik. The Nature of Statistical Learning Theory, chapter 5, pages 138–146. Springer-Verlag, 2nd edition, 1999. 19. Richard C. Whaley, Antoine Petitet, and Jack J. Dongarra. Automated Empirical Optimization of Software and the ATLAS Project”. Parallel Computing, 27(12):3–25, 2001.