Flexible Databases Supporting Imprecision and Uncertainty (Studies in Fuzziness and Soft Computing)

Part III Semistructured Data Management Flexible Query Languages for Relational Databases: An Overview 1 2 3 Anton...

Author: Gloria Bordogna | Giuseppe Psaila

107 downloads 1009 Views 3MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Part III

Semistructured Data Management

Flexible Query Languages for Relational Databases: An Overview 1

2

3

Antonio Rosado , Rita A. Ribeiro , Slawomir Zadrozny , Janusz Kacprzyk

3

1

Universidade do Algarve, Campus de Gambelas, 8000 Faro, Portugal [email protected] 2 UNINOVA, Campus of New University of Lisbon, Caparica 2929-516 Portugal [email protected] 3 Systems Research Institute, Polish Academy of Sciences, 01-447 Waszawa, Poland {kacprzyk,zadrozny}@ibspan.waw.pl

1 Introduction Managers rely more and more on the use of databases to obtain insights and updated information on activities of their institutions and companies. More and more people, from experts to non-experts, are depending on information from databases, to fulfill everyday tasks, notably those related to decision making. Basically, the content of a database describes selected aspects of the real world relevant for a given company, institution, etc. Often, our knowledge about the entities represented in a database as well as our preferences as to what should be retrieved from a database are imperfect or imprecise. This raises a question of a proper modeling of imperfect information in the context of database management systems (DBMSs). The focus of this paper is on flexible query languages (FQL) for databases that are based on fuzzy sets theory. Since there are many contributions in this field, we propose two taxonomies to help and guide database designers and users. These taxonomies address the FQL in crisp relational databases and in fuzzy relational databases, respectively. Approaches mentioned in these taxonomies are not exhaustive in terms of literature, which is huge, but they are quite representative. We believe that these two taxonomies provide a better understanding of the field and can help select the best approaches to solve specific problems.

A. Rosado et al.: Flexible Query Languages for Relational Databases: An Overview, Studfuzz 203, 3–53 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

4

Antonio Rosado et al.

Dubois and Prade [16] enumerate the two reasons for using fuzzy sets theory [51] to make querying more flexible. First, fuzzy sets provides a better representation of the user’s preferences. For example, in a query asking for some apartment “not too expensive and not too far from downtown”, the user may feel much more comfortable using linguistic terms instead of precisely specified numerical constraints. Moreover, these linguistic terms express exactly what the preferences of a user are; for example when an interval to which the price has to belong is imprecisely specified. The linguistic terms clearly suggest that there is a smooth transition between acceptable and unacceptable prices. Thus, we can have a price definitely matching or definitely not matching the user’s request, but also matching to a certain degree. Another important aspect of using fuzzy sets theory is a direct consequence of the previous. Namely, as soon as we have a matching degree, answers can be ranked according to the users’ requirements [16]. According to many authors such as Bosc and Pivert [5],[8], Kacprzyk and Zadrozny [21], Takahashi [46], Medina et al. [32], etc. there are two main lines of research in the use of fuzzy set theory in the DBMS context. The first one assumes a conventional database and, essentially, develops a fuzzy querying interface using fuzzy sets, possibility theory, fuzzy logic, etc. Among authors who have contributed to this research are Bosc and Pivert [3],[5],[7], Dubois and Prade [16], Tahani [44], Takahashi [45],[46], Kacprzyk, Zadrozny and Ziolkowski [27], Ribeiro and Moreira [39]. In Section 4, we will describe their works and propose a taxonomy for these approaches. The second line of research uses fuzzy or possibilistic elements for developing a fuzzy database model that accounts for imprecision and vagueness in data. Here, of course, querying constitutes also an important element of a model. Some relevant concepts are presented, e.g., in [1], [8], [9], [12], [13], [18], [20], [32], [34], [35], [42], [43]. These approaches are described in Section 5. There are also other issues in the use of fuzzy sets theory in relational databases such as efficiency of fuzzy queries execution; fuzzy functional dependencies/constraints, fuzzy logical databases, but they are beyond the scope here. In Section 2, we review the fundamental concepts of the relational data model which includes its main querying formalisms: the relational algebra, the relational calculus and the SQL language. Next, in Section 3, we review the main concepts of fuzzy sets theory that will be used in this paper. Sections 2 and 3 provide the theoretical base to make the paper self-contained. We examine in detail the main approaches proposed in the literature that are concerned with the first line of research mentioned above, i.e. flexible query languages for the crisp relational data model. We propose a taxonomy to organize these approaches, thus resulting in an overall picture of the main research done. The last part, Section 5, is devoted to the second line of research mentioned above, i.e. the flexible query languages for fuzzy relational databases. Again, a taxonomy for different approaches is proposed. Finally, in Section 6, we present some conclusions about this work.

Flexible Query Languages for Relational Databases: an Overview

5

2 Brief introduction to the relational data model A relational database is a collection of relations, defined according to the relational database model [14]. A relation may be understood as the relation schema or the relation instance. The relation schema has the following form: R(A1:D1, …, An:Dn)

(1)

where R is the name of the relation, Ai (1 ≤ i ≤ n) is the i-th attribute (which also may be called column or field) and Di (1 ≤ i ≤ n) is the domain corresponding to the attribute Ai. Each Di defines a set of values being possible values for an attribute. Often, we refer to a schema by just indicating the set of attribute names R(X), X = {Ai,…An}. A relation instance of given relation schema is a set of tuples, each composed of values of the attributes belonging to the relation schema: {< d1, …, dn> | (d1 ∈ D1), …,(dn ∈ Dn)}

(2)

where di (1 ≤ i ≤ n) is the value of the tuple corresponding to attribute Ai; this value must belong to the set Di. Usually the term relation instance is abbreviated to relation whenever there is no confusion with other aspects of the relation. A relation is here denoted R, its n attributes are denoted A1, …, An and D1, …, Dn are their domains. The m tuples of a relation are denoted t1, …, tm and dij represents the value of the j-th attribute in tuple ti. Relations are often referred to as tables, with columns and rows corresponding to attributes and tuples of the relation. The typical operations on data in a database include insertion, deletion, updating and retrieval. The latter is the most important for our considerations. Usually not all data is retrieved but only the data matching certain criteria is required. From the operational point of view there are two approaches to devising a query language, i.e., a language allowing to express the range of required data. The first one consists in the use of a restricted version of the predicate/relational calculus of mathematical logic to specify the requirements the retrieved data should meet. In this approach the actual way of data retrieval is completely left to the database management system employed. In the second approach, a query is a sequence of operations that should be executed on the database to obtain the required data. These operations correspond to the underlying data model, i.e., correspond to the operations in relational algebra. From the theoretical point of view both approaches are equivalent in the sense of their expressive power. In Section 2, we present a brief overview of both formalisms. In relational database management systems implementations usually a hybrid approach is adopted, notably exemplified by the SQL language. Whichever querying approach is assumed, the most important part of a query is a set of conditions (criteria) of which rows will be selected to be included in

6

Antonio Rosado et al.

an answer to the query. Thus, it is interesting to study the retrieval process from the perspective where a query is meant to define a prototype of data to be retrieved. Then, during the retrieval process for every row a matching degree of its content and the prototype is calculated. In the classical crisp approach this matching degree is binary: a row matches the prototype or not. In real situations, the description of the prototype may be imprecise and this leads to a partial matching degree. This line of reasoning, adopted by many authors, provides an interesting basis for the analysis of flexible (fuzzy) querying languages, which we will explore next. To clarify the understanding of some proposals in the literature, we will use, whenever necessary, the following relational scheme example: Employees(#emp, name, #dep, age, job, salary, commission, town)

(3)

Departments(#dep, budget, size, city)

(4)

with the instances of relations Employees and Departments given in Fig. 1: This example will also make possible to explain, in a simple way, the main concepts of relational algebra and relational calculus. 2.1 Relational algebra Relational algebra defines a set of operations to manipulate data in relations [37] The list of basic operations includes: • the usual set theoretic operations including the union, difference and Cartesian product. The Cartesian product is slightly modified to fit the database context and produces a relation whose scheme is a union of the schemes of the argument relations. • the selection, σP(R), gives the tuples of relation R that satisfy a Boolean expression P, which is defined over the scheme of R. • the projection, πY(R), gives a relation obtained when all attributes from the set X-Y are removed, where R(X) is the scheme of the relation R. Thus, the scheme of the resulting relation comprises only a subset Y of the set of attributes of R and some tuples of the original relation are also removed (those with identical values of the attributes belonging to Y).

Flexible Query Languages for Relational Databases: an Overview #dep 1 7 9 8 3

#emp 22 29 31 32 58 64 71 74 85 95

budget $130000 $90000 $110000 $50000 $40000

name Arthur John Mary Peter Barbara Mary Michael Jude Horatio Ken

size 30 20 11 7 11

#dep 1 3 7 1 7 7 8 9 9 3

7

city S. Francisco New York Washington S. Francisco New York

age 30 35 40 39 39 27 30 35 50 55

job programmer accounter sales manager systems analyst marketing manager product manager research assistant secretary technical assistant controller

sal commissiontown $1500 $500 New York $1800 $400 Boston $2300 $200 San Francisco $2000 $300 New York $2500 $500 Los Angeles $2000 $400 New York $1500 $100 S. Diego $1600 $300 New York $2400 $400 New York $2800 $500 Boston

Fig. 1. Example of a crisp relational database Other operations can also be used but they could be expressed in terms of the five operations mentioned above. For example, the popular join operation, joinAθB(R,S), is a combination of the Cartesian product and the selection operation. Another operation, often discussed within the subject of “fuzzification”, is the division R ÷ S. If X and Y are the schemes of R and S, respectively, with Y ⊂ X, then R ÷ S gives the maximal (in the sense of “⊂”) relation T, having the scheme X-Y, such that T × S ⊂ R, i.e.,: R ÷ S = {t : ∀u ∈ S, (t,u) ∈ R}

(5)

The division operation may also be expressed as a combination of the projection, Cartesian product and difference operations. We complete this section by providing an example of a query expressed in the relational algebra. The query [7]: Q1 - Find the employees younger than 35 who work in a department whose budget is higher than $100000 may be expressed in relational algebra as:

π #emp ( join Emp.#dep = Dep.#dep (σ age<35 ( Emp), σ budget >100000 ( Dep)))

(6)

8

Antonio Rosado et al.

2.2 Relational calculus Relational calculus comprises domain relational calculus (DRC) and tuple relational calculus (TRC) [37]. DRC and TRC are declarative query languages based on the first-order predicate logic. The main idea is to describe what is sought rather than to define how to get it, in a similar fashion as in the relational algebra. Next, we briefly review the DRC language because it is the base for several flexible query languages (FQLs) described in Sections 4 and 5, respectively. A DRC query is an expression based on the first order predicate calculus language. The general form of such a query is as follows: {(x1,…,xn) | φ(x1,…,xn) }

(7)

where φ(x1,…,xn) is a formula of the language. An answer to such a query is a set (possibly empty) of tuples (a1,…,an) such that when substituting ai’s for xi’s in φ a true formula is obtained. The building blocks of formulae are atomic formulae. In the DRC two classes of atomic formulae are usually distinguished: R(x1, x2, …, xn)

(8)

x1 θ x 2 where xi (1 ≤ i ≤ n) is either a domain variable or a constant; R is an n-ary predicate ; θ is a comparison operator from the set {<, >, =, ≤, ≥, ≠}. Finally, a formula is: • • • •

an atomic formula, ¬ψ, ψ1 ∧ ψ2, ψ1∨ ψ2, ψ1 ⇒ ψ2 ∃ x (ψ (x)) ∀ x (ψ (x))

(9)

where ψ, ψ1, ψ2 are formulas and ψ(x) is a formula containing a domain variable x. An example of a DRC query equivalent to the example of the relational algebra query of Section 2.1 (see Eq. (6)) is the following:

  EMP( x1 , x2 , x3 , x4 ,...) ∧ x4 < 35 ∧   (10) ( x1 ) | ∃x2 , x3 , x4 , x5 , x6    DEP( x5 , x6 ,...) ∧ x6 > 100000 ∧ x5 = x3  In the tuple relational calculus (TRC) we use the syntax similar to Eqs. (8)(9) but replacing the domain variables with tuple variables. For example, (10) may be expressed in TRC as:

9

Flexible Query Languages for Relational Databases: an Overview {t | t ∈ Employees ∧ t.age < 35 ∧ ∃ s ∈ Departments (s.#dep = t.#dep ∧ s.budget > 100000)}

(11)

2.3 SQL SQL is a de facto industry standard command language for the relational database management systems. SQL has commands to deal with all aspects concerning the creation, maintenance and use of a database such as the creation of tables, insertion of rows, querying the database, security issues, etc. In this section, we are only interested in what SQL provides for querying a database. The syntax of a basic query in SQL is [37]: SELECT select_list FROM from_list WHERE conditions

(12)

Such a query retrieves required data from some tables and builds a new table. The select_list specifies the expressions (often, just column names) whose values are to populate columns of the new table. The columns used in these expressions have to be listed in the from_list. The expressions of the SELECT clause are calculated only for the rows, from the FROM clause, that meet the conditions specified in the WHERE clause. The condition is a Boolean combination of atomic conditions created using the logical connectives AND, OR, NOT. Each atomic condition has the following syntax: “expression op expression”, where “op” is one of the comparison operators (<, <=, =, <>, >=, >) and “expression” is a column name, a constant or an expression (numeric expression or string expression). A basic query (12) corresponds to an expression in relational algebra involving the operations of projection, selection and Cartesian product. For example, the following query: Q2 - Find the names and age of all employees, is expressed in SQL as: SELECT name, age FROM Employees

(13)

and the result is a table with two columns (name,age), and as many rows as in the table Employees. A way to extend the basic form of a query is to use nested queries (subqueries). Usually, the nested queries are used along with the set operators IN, NOT IN, EXISTS, NOT EXISTS, op ANY, op ALL as exemplified by the query:

10

Antonio Rosado et al. Q3 - Find the names of employees who work in New York may be expressed as:

SELECT name FROM Employees WHERE #dep IN (SELECT #dep FROM Departments WHERE city = ‘New York’)

(14)

The subquery retrieves the set of departments that are located in New York. The main query retrieves the names of employees such that their department is in this set. The set operator IN allows to test whether a value belongs to a set or not.

3 A brief introduction to fuzzy sets theory Fuzzy sets theory [51] is an attempt to model an inherent vagueness of natural language. Almost any concept expressed in natural language, like young people, implies that elements of the universe of discourse (the particular people) are young to a certain degree. Note that the concept of young is context dependent. Such a graduality is modeled by a membership function µA that for each person x assigns a value µA(x) from the interval [0,1], representing the degree to which person x belongs to set A. In our example, µYOUNG(x) represents the degree to which x is considered young. The cardinality |A| of a fuzzy set A, defined on a finite universe set X, is given by the sum of the membership values of all elements of X in A, and is sometimes called scalar cardinality to distinguish from other types of cardinality (see e.g. [31])

A=

∑ µ A (x )

(15)

x∈X

The relative cardinality ||A|| of a fuzzy set A, in a finite universe set X, is defined by:

A =

A X

(16)

where |X| is the cardinality of the universal set X. The counterparts of the classic operations of the complement, union and intersection for fuzzy sets A and B, are defined as follows:

µ¬A(x) = 1 - µA(x)

(17)

Flexible Query Languages for Relational Databases: an Overview

11

µA∪B(x) = max [µA(x), µB(x)]

(18)

µA∩B(x) = min [µA(x), µB(x)]

(19)

The classic correspondence of set theoretical operations and logical connectives is preserved. Thus, (17)-(19) provide also interpretation for the connectives of negation, disjunction and conjunction. Several fuzzy implication operators have also been proposed in the literature [17]. The most commonly used are: Kleene-Dienes: I(x,y) = max(1 – x, y)

(20)

Lukasiewicz: I(x,y) = min(1,(1-x+ y))

(21)

1 if x ≤ y Gödel: I ( x, y ) =   y otherwise

(22)

Goguen: I(x,y) = min(y/x,1)

(23)

Fuzzy sets theory was conceived primarily as a formalism to represent the meaning of natural language expressions. In the following subsections we will briefly review some concepts relevant for this topic. 3.1 Linguistic variable Basically, a linguistic variable is a variable assuming linguistic values instead of numerical values. Formally, a linguistic variable [53] is a quintuple (H, T, U, G, M) where H is the name of the variable, T is the set of linguistic names (called terms) that can be assigned to the variable; U is the universe of values that are used to define the meaning M of each linguistic value in T; and G is a grammar that is used to specify the values allowed in T. The meaning M(X) of a term X ∈ T, is specified as a fuzzy subset in U. The terms may be atomic terms such as “young” or composite terms, which result, for example, from applying modifiers (see next subsection) and logical connectives to atomic terms. For example [53], a linguistic variable called age (H = age) may have the term set T = {old, very old, not old, more or less young, quite young, not very old and not very young, …}, which, for simplicity reasons, is defined here in an informal way, without defining the grammar G. In this example, young and old are the atomic terms. The universe of discourse might be U = [0,100] and the meaning of the term young, M(young), could be given by a fuzzy set, such that:

12

Antonio Rosado et al.

u ∈ [0,25 ] 1,  −1 ,u∈U µ M ( young ) (u ) =    u − 25  2    , u ∈ ]25,100 ]  1+     5  

(24)

3.2 Modifiers A linguistic modifier can be modeled by using an operator that acts on the fuzzy set corresponding to the linguistic term to which the modifier is applied. For example [41] the linguistic modifier very in the linguistic expression “very young” intensifies the meaning expressed by the fuzzy term young. Hence, the effect of very is to decrease the membership of the values belonging to the fuzzy set YOUNG. The concentration operator can produce this effect:

µCON(A)(x) = µ2A(x)

(25)

Conversely, the dilation operator can be used for defining modifiers, as for example slightly and it is modeled as:

µ DIL( A) (x ) = µ 1A/ 2 (x )

(26)

Other usual modifier is not that is modeled by the complement operator. There are many more operators used to model linguistic modifiers, cf., e.g., [28]. 3.3 Fuzzy (linguistic) quantifiers Classical logic recognizes two quantifiers expressing that all objects posses certain property (general quantifier) or that at least one object possesses certain property (existential quantifier), respectively. However, natural languages offer many more forms of quantifiers. For example, quite often one says that most of the objects possess certain property. Basically, there are two types o fuzzy (linguistic) quantifiers [50], [54]: absolute - such as “approximately 3” and “several” and proportional such as “most” and “a few”. There are also two general types of propositions referring to linguistic quantifiers: 1. Q X’s are A’s (type I) 2. Q B’s are A’s (type II)

Flexible Query Languages for Relational Databases: an Overview

13

where Q is a linguistic quantifier, and A and B are fuzzy sets modeling certain fuzzy properties of the objects of the universe X. In what follows we briefly discuss how linguistic quantifiers may be formalized. Zadeh’s calculus of linguistically quantified propositions Zadeh [54] proposed an interpretation for fuzzy quantified statements of both types I and II based on the concepts of cardinality (Eq. (15)) and relative cardinality (Eq. (16)) of a fuzzy set. A fuzzy proposition Q X’s are A’s has the truth degree T that is computed using the following equations [54]: T = Qabsolute(|A|) = Qabsolute(∑i µA(xi))

(27)

where Q is respectively an absolute and a relative quantifier. For fuzzy propositions Q B’s are A’s, where both A and B are fuzzy sets, we have [54]: T = Qabsolute (|A∩B|) = Qabsolute (∑i µA(xi) ∧ µB(xi))

 A∩ B T = Qrelative   B 

  µ (x ) ∧ µ B (xi )   = Qrelative  ∑i A i     ∑i µ B (xi )   

(28) (29)

OWA operators and Yager’s calculus of linguistically quantified propositions Yager [50] proposed the use of Ordered Weighted Averaging (OWA) operators for the evaluation of linguistically quantified propositions to overcomee some problems of the Zadeh’s proposal (see in [7] an example comparing Zadeh’s approach with Yager’s approach). An OWA operator of dimension n is a mapping f that performs an aggregation of its n arguments a1, …, an [50], such that:

f (a1 ,..., an ) =

n

∑bjwj

(30)

j =1

where ai ∈ [0,1], bj is the j-th largest from among ai, and wj (wj ∈ [0,1]) are weights such that ∑i wi = 1. The classical “AND” and “OR” may be expressed as special OWA operators: for wn = 1 and wj = 0 (∀j < n) we obtain f1(a1, …, an) = bn = mini ai for w1 = 1 and wj = 0 (∀j >1) we obtain f2(a1, …, an) = b1 = maxi ai

(31)

14

Antonio Rosado et al.

Moreover, any OWA operator lies somewhere between the “OR” operator and the “AND” operator [50] in the sense that: mini ai ≤ f(a1, …, an) ≤ maxi ai

(32)

It is also possible to define an OWA operator which may be interpreted as a linguistic quantifier. Yager [50] proposed a scheme of defining an OWA operator corresponding to a linguistic quantifier in the sense of Zadeh. Namely, starting with a linguistic quantifier Q that is monotone and regular (Q(0) = 0; Q(1) = 1) we set the weights of corresponding OWA operator as follows: wi = Qrelative(i/n) – Qrelative((i-1)/n) ∀ i=1,…,n

(33)

Then, the degree of truth T of a quantified proposition “Q X’s are A” is computed using this OWA operator [50]: T = f (µA(x1), …, µA(xn))

(34)

3.4 Possibility distributions Consider a vague proposition “X is young”. This is an imprecise proposition because it does not assign a particular value for X (more precisely, for X’s attribute age). Instead, it associates, with each possible value of X its possibility degree, a number in the interval [0,1] [29]. We can say that proposition p = “X is young” induces a possibility distribution π (the notation πX is often used to indicate what variable is considered) on the domain of the attribute age: X is young → π = YOUNG

(35)

or, equivalently: ∀u ∈ U

π (u) = µYOUNG(u)

(36)

that is, the possibility that a certain u∈U is an actual value of X is equal to the u’s membership degree to the fuzzy set YOUNG, which models the linguistic term young. Knowing the possibility distribution πX we may be also interested in determining what is the possibility that X’s value belongs to a set A⊆U. This leads to the concept of the possibility measure, i.e., a function Π such that:

Flexible Query Languages for Relational Databases: an Overview Π: 2 → [0,1] U

15

(37)

From the postulated properties of possibility measures it is assumed that (in fact, usually we start with the concept and properties of the possibility measure and only then the notion of the possibility distribution is introduced):

Π ( A) = sup π (u ) u∈A

(38)

The possibility measure alone does not tell us enough about the location of the actual value of X: outside or inside A. Thus, it is usually argued that it should be accompanied by the possibility measure of the complement of A. More precisely, the necessity measure, Ν, is defined as, expressing the “impossibility” of the set A:

( )

N ( A) = 1 − Π A = inf π (u ) u∈A

(39)

The formulae of Eqs. (38)-(39) are extended to the case where A is a fuzzy set in the following way:

Poss( X is A) = Π ( A) = sup min (π (u ), µ A (u )) u∈U

(40)

and:

Nec( X is A) = N ( A) = inf max(1 − π (u ), µ A (u )) u∈U

(41)

Now, if we know that the possibility distribution of the X’s value is π then the degree to which the actual value of X belongs to A (often denoted as “X is A”) belongs to the interval [Ν(A), Π(A)]. Eqs. (40)-(41) for the possibility and necessity measures are directly employed when the matching degree is computed in the context of querying possibilistic fuzzy databases - see Section 5. Actually, the interpretation of more advanced queries calls for more sophisticated formulae. Namely, let us assume that we have two variables X and Y, defined on the same universe U, and we know the possibility distributions of their values, πX and πY, respectively. The question is: what is the possibility that the actual values of these variables are equal. In order to answer this question consistently we proceed as follows. First, we observe that πX and πY jointly represent a possibility distribution πXY on U x U:

πXY(u,w) = min (πX(u), πY(w))

(42)

Antonio Rosado et al.

16

Second, the possibility (necessity) measure associated with πXY will be denoted ΠXY (ΝXY). Thirdly, the answers to our question are the values of the possibility and necessity measures, for the set of pairs of identical elements from U, i.e., Poss( X = Y ) = Π XY ({(u , u ) : u ∈ U }) = sup min(π X (u ), π Y (u ))

(43)

Nec( X = Y ) = N XY ({(u, u ) : u ∈ U }) = inf max(1 − π X (u ),1 − π Y (u ))

(44)

u∈U

u∈U

In Section 5, we discuss the Eqs. (40)-(41) for a more general case, when, roughly speaking, the “is” and “=” operators are replaced with other crisp or fuzzy relational operators. Now, we would like to distinguish another concept, useful in the context of a fuzzy database querying: the possibility distributions’ similarity. If we know the possibility distributions for two variables, X and Y, we can be interested in finding out how similar both distributions are. Similarity is meant here in a very broad sense. Obviously, Eq. (43) provides an assessment of this similarity, but other measures are also applicable. In Section 5 we discuss this concept in more detail. 3.5 Fuzzy relations A crisp relation R(x1, x2, ..., xn) defined on crisp sets X1, X2, …, Xn is a subset of the Cartesian product X1 × X2 × … × Xn. Similarly, a fuzzy relation R is a fuzzy set defined on the Cartesian product X1 × X2 × … × Xn. The membership degrees µR(x1, x2, …, xn) represent the strength of the relation between the elements of the tuples (x1, x2, …, xn). In the relational database terminology we will say that R is defined over schema (X1, X2, …, Xn). The composition of two binary crisp relations P(X,Y) and Q(Y,Z), denoted P(X,Y) ° Q(Y,Z), is a crisp binary relation R(X,Z) such that: R(X,Z) = {(x,z) ∈ X × Z | ∃ y ∈ Y (x,y) ∈ P ∧ (y,z) ∈ Q}

(45)

The composition of two fuzzy binary relations P(X,Y) and Q(Y,Z) may be defined in several ways. The most commonly used definitions are the following [29]:

µP

Q

(x, z ) = max min[µ P (x, y ), µQ ( y, z )] y∈Y

[

]

µ P•Q (x, z ) = max µ P (x, y )× µ Q ( y, z ) y∈Y

(46) (47)

respectively for max-min composition and max-product composition. A fuzzy relation R defined on X × X, verifying the following properties: [29]

Flexible Query Languages for Relational Databases: an Overview

17

1. reflexivity: ∀x ∈ X, µR(x,x) = 1

(48)

2. symmetry: ∀x,y ∈ X, µR(x,y) = µR(y,x)

(49)

3. max-min transitivity: ∀x,z ∈ X, µR(x,z) ≥ • max min[µR(x,y), µR(y,z)] y∈Y

(50)

is called a similarity relation R(X,X). If R(X,X) is only reflexive and symmetric then R is called a proximity relation [29]. An algebra for fuzzy relations Relational algebra (see Section 2) may be extended in order to provide the same type of operations for fuzzy relations. Next, we present the definitions for the operations of such an extended algebra [3]. The set operations union, intersection, difference and Cartesian product of two fuzzy relations R and S are direct applications of operations on fuzzy sets, i.e.: µ R ∪ S (x) = max (µR(x), µS(x))

(51)

µ R ∩ S (x) = min (µR(x), µS(x))

(52)

µ R - S (x) = µ R ∩ S (x) = min (µR(x), 1- µS(x))

(53)

µ R x S (xy) = min (µR(x), µS(y))

(54)

_

where x and y are tuples of relations R and S. Moreover, the operations selection, projection and join are defined as:

µσ P (R ) (x ) = min (µ R (x ), µ P (x ))

(55)

µπ Z (R ) (z ) = max µ R (zv )

(56)

v

where P is a fuzzy condition, z and v are tuples defined over schema Z and V, respectively, such that Z ∪ V = X and Z ∩ V = ∅; A and B are subsets of X and Y respectively, and θ is a fuzzy comparator, i.e. it is a fuzzy relation, defined on sets A and B, such as approximately equal. Let us briefly explain the three formulas above. First, for selecting the tuples x in a fuzzy relation R that satisfy a fuzzy condition P, we compute the matching degree of each tuple against P, that is, µP(x). The membership degree of each x in the resulting relation must take into account two values: µP(x) and its original

18

Antonio Rosado et al.

membership degree µR(x). Projecting a fuzzy relation R through one of its subschema Z, that is, Z ⊂ X, requires forming the sub-tuples z corresponding to Z, and considering for their membership degrees the highest membership degree of the tuples x that include z as a sub-tuple. Finally, joining two fuzzy relations R and S using a fuzzy condition A θ B requires concatenation of pairs of tuples x and y. The membership degree of each new tuple xy is the minimum of its matching degree related to A θ B and the original membership values of tuples x and y. The relational algebra operation of division (see Eq. (5)) can also be extended using the fuzzy sets (fuzzy relations) instead of the crisp sets (crisp relations). Dubois and Prade [16] proposed the following extension: µR÷S(t) = minu µS(u) → µR(t,u)

(57)

where the symbol → denotes a fuzzy logic implication. 3.6 PRUF Above we summarized how the semantics of propositions may be expressed in terms of possibility distributions (more generally, possibility theory). So far, we only discussed the simple propositions exemplified by Eq. (35), however, natural languages are syntactically much richer. Zadeh [53] classified fuzzy propositions of natural languages in five types: simple fuzzy propositions (type I), modified fuzzy propositions (type II), composed fuzzy propositions (type III), quantified fuzzy propositions (type IV), and qualified fuzzy propositions (type V). He proposed a semantic interpretation for each type in terms of the possibility distribution in a way analogous to Eq. (36). In Fig. 2 below we collected examples of propositions belonging to each type and the corresponding induced possibility distribution formulation. Type I II III IV V

Example Maria is young Bob is very tall Maria is young and Bob is very tall Most students are young It is quite true (probable or impossible) that Maria is young

Induced possibility distribution πage = YOUNG πheight = VERY_TALL π(age, height) = YOUNG × VERY_TALL πcount(age) = MOST, YOUNG πage = QUITE_TRUE, YOUNG

Fig. 2. Classification of fuzzy propositions [53]

Flexible Query Languages for Relational Databases: an Overview

19

We may observe that: QUITE_TRUE, YOUNG denotes the function µQUITE_TRUE(µYOUNG(x)); YOUNG × VERY_TALL is a Cartesian product of fuzzy set YOUNG and the modified fuzzy set VERY_TALL; and MOST is a quantifier.

4 Taxonomy of flexible query languages (FQLs) for crisp relational databases In this section we propose a taxonomy for the main proposals found in the literature related to FQLs for crisp relational databases. The purpose of proposing this taxonomy is to offer a structured view about the topic and to highlight main differences and similarities between various approaches. Further, we believe that this taxonomy can offer some guidance and clarification about the most relevant proposals in this research area. 4.1 Taxonomy of FQLs for crisp databases In Fig. 3 we present the main approaches in this topic grouped in four categories. Group 1, denoted basic fuzzy predicates, includes the first approach that used fuzzy predicates in queries. Group 2, denoted flexible aggregation operators, presents the proposals that study flexible aggregation of partial matching degrees via linguistic quantifiers and importance weights. Group 3, denoted SQL extensions, presents proposals introducing elements of fuzzy the querying paradigm in SQL. Finally, Group 4, denoted PRUF and flexible querying, describes proposals that focus on the interpretation of natural language expressions for the purposes of querying. FQL Taxonomy

Proposals of FQLs for crisp relational databases G1. Basic fuzzy predicates P1. Tahani [44] G2. Flexible aggregation P2. Kacprzyk, Zadrozny and Ziolkowski operators [26],[27] P3. Bosc and Pivert [6] P4. Dubois and Prade [16] G3. SQL extensions P5. (SQLf) Bosc and Pivert [3], [5], [7] P6. (FQUERY) Kacprzyk and Zadrozny [21] G4. PRUF and flexible P7. Takahashi [45], [46] querying Fig. 4. Main approaches to Flexible Query Languages (FQLs) for crisp relational databases

20

Antonio Rosado et al.

Since the SQL’s SELECT command is the standard for the querying of crisp relational databases, we will also use it for illustrative purposes and to clearly distinguish the taxonomy groups. The first two groups (G1-G2) of proposals, basic fuzzy predicates and flexible aggregation operators, have a theoretical character and essentially extend the WHERE clause’s condition of the SELECT command, by incorporating linguistic (fuzzy) terms and using flexible aggregation operators (connectives). The third group (G3) of proposals, SQL extensions, comprises more practical approaches that embed the fuzzy predicates in the syntax of the standard SQL. For example, one of the proposals (FQUERY for Access [21]) implements fuzzy predicates in the WHERE clause and another proposal (SQLf [7]) specifies a new language that extends SQL by incorporating fuzzy predicates not only in the WHERE clause but wherever it makes sense. The fourth group (G4) of proposals, natural language querying languages, instead of modifying the SQL’s SELECT command proposes a set of natural language query types that includes an adequate subset of fuzzy propositions. The proposals included in G4 are based on the PRUF language [53], and the mismatch between the natural language queries and information stored in the crisp relational database is bridged by translation rules that perform the respective conversion. Next, we will present a summary of the proposals by the authors listed under each group of the taxonomy. We will use Px to refer to each author’s proposal, where x is the sequential number of the author in the taxonomy. The description by thematic group will highlight the resemblances of approaches in terms of the topic of study and will simplify the readers’ understanding of such a vast literature. In spite of the fact that many other authors had also contributed to the advancement of flexible querying in crisp databases, in this work we selected the most representative for each topic. G1 Basic fuzzy predicates P1. Tahani [44] was the first author to propose the use of fuzzy sets theory to improve the flexibility of crisp database queries. He proposed a formal approach and architecture to deal with simple fuzzy queries for crisp relational databases. The queries are based on the SEQUEL language. The idea may be best illustrated by the query: Q4 – Find the names and department numbers of employees who are young and have a high salary OR those who are young and have low commission Using our database scheme defined in Eqs. (3)-(4) and Tahani's proposal such a query may be expressed as:

Flexible Query Languages for Relational Databases: an Overview SELECT name, # dep FROM Employees WHERE age=YOUNG AND (sal=HIGH OR commission=LOW)

21 (58)

Thus, Tahani proposed to use in the query condition vague terms typical for natural language. Syntactically, they are represented as fuzzy predicates. Their semantics is provided by appropriate fuzzy sets. Then, the main question is how the matching degree for each particular row is computed. For that purpose Tahani defines the matching function γ. For a tuple ti and a simple query P of type A = v, where A is an attribute (e.g., age) and v is a vague (fuzzy) term (e.g., young), the value of the function γ is:

γ (P,ti) = µv(u)

(59)

where u is the value of the attribute A in tuple ti. For example:

γ(AGE = young, <22, Arthur, 1, 30, programmer, 1500, 50,

(60)

New York>) = 0.5 if µyoung(30)=0.5. The matching function γ for complex queries involving logical connectives like age=YOUNG AND (sal=HIGH OR commission=LOW) is:

γ(P1 AND P2, ti) = min (γ(P1, ti), γ(P2, ti))

(61)

γ (P1 OR P2, ti) = max (γ(P1, ti), γ(P2, ti))

(62)

γ(¬ P, ti) = 1-γ(P, ti)

(63)

where P, P1, P2 are queries. Thus, the logical connectives in the queries are interpreted as the original Zadeh’s fuzzy connectives. G2 Flexible aggregation operators P2. Kacprzyk, Zadrozny and Ziolkowski [26], [27] were the first to propose the aggregation of partial queries (predicates, conditions) to be guided by a linguistic quantifier (see Section 3). Thus, they proposed to extend the querying language so as the selection operator’s condition may be expressed as: P = Q out of {P1, …, Pn}

(64)

22

Antonio Rosado et al.

where Q is a linguistic (fuzzy) quantifier and Pi are partial conditions to be aggregated. Thus, the overall matching degree may be computed using any of the approaches discussed in Section 3. In [27] and [26] the original Zadeh’s approach has been adopted, but later in [22] the authors used the OWA operators as the model for the linguistic quantifier. Both type I and type II linguistically quantified propositions (see Section 3) were studied in this context by the authors. In the latter case query (64) may be extended to: P = Q important out of {P1, …, Pn},

(65)

where the importance is represented by a fuzzy set of Pi’s in the sense that the value of the membership function of given Pi is equal to its importance weight. Thus, in (65) importance corresponds to B in “Q B’s are A’s” of Section 3. P3. Another scheme for the aggregation of fuzzy conditions of varying importance has been studied by Bosc and Pivert [6]. They proposed a fuzzy operator for the hierarchical aggregation of fuzzy conditions, which extends the concept of hierarchical aggregation given by Lacroix and Lavency for crisp conditions [30]. Lacroix and Lavency proposed to extend the concept of classical crisp queries in the following way. A query has two parts: a selection part, S, and a set of crisp conditions, PRF, called preferences. The semantic of this query is the following: select the tuples satisfying S and rank them according to PRF. More precisely, if there are no tuples satisfying condition S then the answer to the query is empty. Otherwise, the answer comprises the tuples that verify S and at the same time best satisfy PRF. In the latter case, various assumptions on the interrelation of the conditions belonging to the PRF may be made. Two cases are considered: (a) the conditions are equally important, (b) there is a (linear) hierarchy of conditions those higher in hierarchy are more important. Thus, in the second case we have the importance of conditions imposed not by numerical weights, but by their position in the hierarchy. The ranking of the tuples depends on what assumption is made: (a) or (b). In the first case, the count of the conditions in PRF that are satisfied by a tuple is taken into account. In the second case, the lexicographic ordering of the tuples according to their fulfillment of particular conditions belonging to PRF (taken in order imposed by the hierarchy) is employed. Bosc and Pivert [6] proposed a fuzzy operator N to model the hierarchical aggregation described above, in which the contribution of a condition Pi ∈ PRF is less or equal than the contribution of conditions higher in the hierarchy. Let us assume that the conditions are ordered according to the hierarchy, i.e., if i<j then Pi is higher in the hierarchy (is more important) than Pj. The fuzzy operator proposed is defined as a combination of two operators. The first, denoted below with µ 'Pi ,

Flexible Query Languages for Relational Databases: an Overview

23

limits the contribution of condition Pj relatively to the contributions of all preceding conditions Pi (i<j), while the second combines all contributions to obtain the final value for the aggregation of the fuzzy conditions:

µ

N ( P1…Pn )

(t ) =

∑in=1 µ´Pi (t ) n

(66)

where µ´ Pi (t ) = min ( j ≤i ) ( µ Pj (t )). This operator expresses the degree to which a tuple t satisfies the hierarchical aggregation of the fuzzy conditions. The authors also considered another version of their operator replacing the arithmetic mean by the weighted average. Bosc and Pivert adopt a different interpretation of hierarchy of conditions than originally assumed by Lacroix and Lavency. Namely, in the latter case, if no tuple satisfies a condition from, e.g., the k-th level of the hierarchy, then the conditions of the lower levels do play a role in the ranking of tuples. In the former approach, all these lower levels are neglected. P4. Dubois and Prade (cf. [16]) studied the question of conditions Pi with varying degrees of importance forming together a compound condition P via the conjunction. The first model considers some importance weight wi for each elementary condition Pi and the matching degree of the weighted condition Pi against a value u of the corresponding attribute is given by the following equation:

µ P* (u ) = max(µ Pi (u ) ,1 − wi ) i

(67)

where Pi* denotes the condition Pi with an importance associated to it. Then, the matching degree of the condition P is calculated using the standard min operator:

µ P (u ) = min µ P* (u ) = min max(µ Pi (u ) ,1 − wi ) i =1,...,n

i

i =1,...,n

(68)

When the importance is minimal (wi = 0), the condition Pi is not considered in the evaluation. On the other hand, with wi = 1, the evaluation of condition Pi is vital for the evaluation of condition P. This model has been refined (see in [16]) to deal with a variable importance wi – depending on the matching degree of the associated elementary condition. For example, in a specific context it may be useful to assume wi constant for relatively high satisfaction of the elementary condition but for extremely low satisfaction it should be stronger reflected in the overall matching degree by automatically increasing wi. For example, if we want to buy a car and one of our criterion is to have a moderate price, but it is not our primary criterion

Antonio Rosado et al.

24

(condition), then we assume an importance weight smaller than 1.0. However, if a particular car has a very high price, the price criterion becomes more important (wi = 1) and the car is rejected by having a very low satisfaction membership value. The second model, originally proposed by Yager in 1984 (see in [16]) considers a threshold αi for each elementary condition Pi. If condition Pi is satisfied to a degree above threshold αi, that is, µPi(u) ≥ αi, the resulting partial matching degree becomes 1, i.e., µ * (u ) = 1. On the other hand, if the threshold is not reached, i.e., Pi

µPi(u) < αi, then we may consider two ways for the evaluation of the condition: (a) µ P* (u ) = µPi(u) or (b) µ P* (u ) = i

i

µ Pi (u )

αi

. It turns out that both ways may be

expressed by a formula:

µ P (u ) = min µ P* (u ) = min α i → µ Pi (u ) i =1,...,n

i

i =1,...,n

(69)

where → is the implication logical operation. Then, using the Gödel implication (see Eq. (22)) and the Goguen implication (see Eq. (23)) we can obtain (a) and (b), respectively. Note that the first model of importance proposed by Dubois and Prade (see in [16]) and formalized by Eq. (67) is also covered by the general formula of (69) when the Kleene-Dienes implication (see Eq. (20)) is assumed. Still another model of importance applicable to the aggregation of partial matching has been proposed by Dubois and Prade [16] in which they use conditional requirements P1ÆP2 to provide an interpretation for the hierarchical aggregation of fuzzy predicates. The authors consider a similar context to that of the paper by Lacroix and Lavency [30]. An overall condition P is considered to be a sequence of elementary conditions Pi=1,n accompanied by importance weights (called here priorities). It is interpreted in such a way that “P1 should be satisfied (with priority 1) and among the solutions meeting P1 (if any) the ones satisfying P2 are preferred (with priority α2), and among those satisfying both P1 and P2, those satisfying P3 are preferred with priority α3 (α3 < α2 < 1) and so on”. This may be interpreted as nested implication operators: P1Æ(P2Æ(P3Æ… . The overall matching degree (the results of the aggregation) may be thus represented by the following membership function defining a fuzzy set of elements (rows) satisfying P (when P consists of 3 partial predicates):

 µ P1 (u ), max(µ P2 (u ),1 − min(µ P1 (u ),α 2 )),   µ P (u ) = min  max(µ P (u ),1 − min(µ P (u ), µ P (u ),α 3 ))  3 1 2  

(70)

Flexible Query Languages for Relational Databases: an Overview where

min(µ P1 (u ),α 2 )

and

min(µ P1 (u ), µ P2 (u ),α 3 )

are

priority

25 levels

(corresponding to wi in Eq. (67)) of fuzzy conditions P2 and P3, respectively. Hence, concerning the predicate P2, its priority is α2 if P1 is fully satisfied and is zero if P1 is not at all satisfied, which reflects the fact that P2 is conditioned by P1. This is another example of the variable importance weight, but this time depending on the satisfaction of the “preceding” partial condition. Notice, that the hierarchy (nesting) of the conditions is here used in the same sense as in Bosc and Pivert's approach rather than in Lacroix and Lavency's sense. G3. SQL extensions In G1 and G2 we reviewed proposals for the application of fuzzy conditions, in the context of crisp relational database querying. In terms of relational algebra we considered them as extensions to the selection operation making it possible to use linguistic terms and flexible aggregation operators in conditions. Here, we discuss two proposals of the most popular extensions of a de facto standard querying language of relational databases, i.e., SQL. Notice that already in Section G1 we discussed Tahani’s approach directly referring to SQL (or SEQUEL). However, both approaches discussed here, the SQLf and FQUERY for Access, differ from Tahani’s approach. The first one is an extension to SQL syntax introducing linguistic (fuzzy) terms, wherever it makes sense, and the second is an example of the implementation of a specific “fuzzy extension” to SQL for Microsoft Access®, a popular desktop DBMS. P5. The previously discussed approaches concentrated on the fuzzification of conditions appearing in the WHERE clause of the SQL’s SELECT instruction. Bosc and Pivert [5], [7], [8] proposed a new language, called SQLf, that is a much more comprehensive and complete fuzzy extension to the crisp SQL language. In SQLf linguistic terms may appear as: 1. Fuzzy values, relations and quantifiers (as aggregation operators) in the WHERE and HAVING clauses; 2. The linguistic quantifiers in addition to the classical EXISTS and ALL quantifiers used together with subqueries. Moreover, the authors observe that in case of complex SQL queries involving linguistic (fuzzy) terms the partial results are fuzzy relations. Thus, all operations of relational algebra (implicitly or explicitly used in SQL’s SELECT instruction) have to be redefined to properly process fuzzy relations. Hence, the union, intersection and difference operations are considered. Special attention is also paid to the division operation which may be interpreted in a different way due to many possible versions of the implication available in fuzzy logic. Other typical operations for SQL require a redefinition, including the partition of relations (fuzzy

26

Antonio Rosado et al.

relations) with the operator (clause) GROUP BY and the operators IN and NOT IN used together with subqueries. All the features of SQL just mentioned were extended in such a way so as to preserve the equivalences that occur in the crisp SQL. To illustrate this work, we describe below an extension of the nesting operator IN and how the partition of relations is adapted to the case of a fuzzy relation in SQLf. SQLf allows fuzzy conditions as described in G2. For example, the query [7]: Q5 - Find the young employees who work in a high-budget department can be expressed in SQLf as [7]: SELECT #emp FROM Employees WHERE age = ‘young’ AND #dep IN (SELECT #dep FROM Departments WHERE budget = ‘high’)

(71)

where the result of the subquery is a fuzzy relation. Consequently, the meaning of the condition a IN E, where a is an element and E is a crisp relation, must be extended to deal with fuzzy relations. Fuzzy sets theory suggests the following definition for the predicate IN [7]:

µ IN (a, A) =

sup

b∈support (A )

min (µ = (a, b ), µ A (b ))

(72)

where a is an element and A is a fuzzy set. In case of the classical identity relation "=" this boils down to µIN(a,A)=µA(a). Bosc and Pivert [7] propose to obtain more flexibility by replacing “=” with another operator referring to the similarity between elements. This leads to a concept of a fuzzy membership, INF. For example, the query [7]: Q6 - Find the employees who work in a department whose budget is about 1000 times their own salary can be expressed in SQLf as [7]: SELECT #emp FROM Employees WHERE sal * 1000 INF (SELECT budget FROM Departments WHERE #dep = Employees.#dep)

(73)

In SQLf, the HAVING clause is extended in two ways: with a fuzzy condition and/or fuzzy quantified proposition. As an example of a fuzzy condition, we can

Flexible Query Languages for Relational Databases: an Overview

27

replace the identity operator “=” with the similarity operator ≈. As an example of using fuzzy quantified propositions, the query [7]: Q7 - Find the departments where most of the young employees are well paid may be expressed in SQLf as [7]: SELECT #dep FROM Employees GROUP BY #dep HAVING most (age = ‘young’) ARE (sal = ‘well paid’)

(74)

Recently, the authors of SQLf are working on the interpretation of SQL’s aggregate functions such as MAX, AVG etc. for the case of fuzzy relations. For example, it is not clear what should be the answer to the query: “Find maximum salary of young employees”. For a discussion of this topic see, e.g., [15],[10],[11] P6. Kacprzyk and Zadrozny [21] started with the syntax of SQL language as it is implemented in the Microsoft Access® DBMS. The authors proposed to include in the language linguistic (fuzzy) terms (predicates) in the spirit of the approaches discussed in Section G1. More specifically, they proposed to take into account the following types of linguistic terms: • fuzzy values (e.g. YOUNG) • fuzzy comparators (e.g. MUCH GREATER THAN) • fuzzy quantifiers (e.g. MOST) The matching degree of relevant rows is calculated according to the previously discussed semantics of fuzzy conditions and linguistically quantified propositions. Kacprzyk and Zadrozny [21] have implemented this extension to SQL as an addin, called FQUERY for Access, in the Microsoft Access package, thus extending the native Access’s querying interface with a capability of manipulating linguistic terms. In FQUERY for Access, the user composes a query using a Query By Example type user interface provided by the host environment, i.e., Microsoft Access. It is executed more or less as a regular Access’s query while FQUERY is responsible for the calculation of matching degrees of the rows, interpreting linguistic terms in an appropriate way. The resulting rows of the answer relation are ordered nonincreasingly with respect to the matching degree. FQUERY for Access was one of the first implementations demonstrating the usefulness of fuzzy querying features for a crisp database. Besides the syntax and semantics of the extended SQL, the authors proposed also a scheme for the elicitation and manipulation of linguistic terms to be used in queries. The problem has been solved in accordance with the relational data model paradigm. Linguistic terms are maintained by FQUERY in a

28

Antonio Rosado et al.

dictionary, “de facto” as another system table storing metadata in regular relational database management systems. The concept of FQUERY for Access has been later developed in two directions. In [55] and [23] the very same concept has been applied in the environment of the Internet WWW service. Another interesting line of development [24], [25] boils down to the addition of some data mining capabilities to the existing fuzzy querying interface. Such a combined interface partially exploits the same modules and data structures and seems to be a promising direction for the development of advanced data analysis tools. G4. PRUF and flexible querying P7. Takahashi [45],[46] proposed a flexible query language (FQL) that is an extension to the domain relational calculus (DRC) Thus, in fact he proposed to use fuzzy logic language instead of classical logic language to express conditions that requested data should meet. The author follows to some extent the idea of Zadeh’s PRUF (see Section 3.6.). However, it seems that the grammar proposed by the author is unnecessarily complicated. As in case of the crisp DRC, the result of a query, expressed as in (7), is a set of rows satisfying a formula. Note that these rows may come directly from the relations defined in a database or can be “constructed” from existing relations (e.g., as in algebraic join operation). There are two problems with the Takahashi’s language. First, he does not discusses any extension of the concept of safe formula [47] for his language so that some queries (formulas) may produce an infinite number of rows as an answer. Second, Takahashi refers to Zadeh’s PRUF that is not fully sound in the context of a querying language. Basically, PRUF provides semantics for a subset of natural language propositions. This semantics is based on possibility theory and thus is more appropriate for the representation of imprecise facts in the database rather than for the interpretation of meaning of a query. Anyway, the calculations overlap to some extent and those proposed by Takahashi for matching degree assessment are still valid. An experimental application for querying a business database using the PRUF translation rules (section 3) that showed the flexibility of the PRUF rules was developed by [40].

5 Flexible query languages (FQLs) for fuzzy relational databases Several fuzzy querying models for modeling incomplete information in fuzzy relational databases have been proposed in the literature (see for instance [3]). Usually, each model requires a specialised querying formalism. In this section we propose a taxonomy for the main proposals found in the literature related to fuzzy

Flexible Query Languages for Relational Databases: an Overview

29

relational databases. The objective of proposing this taxonomy is to offer a structured view about the topic. We hope to shed some light on main differences and similarities between the particular approaches. Further, we hope the taxonomy can offer some guidance and clarification about the most relevant proposals in the area. 5.1 Taxonomy of FQLs for fuzzy databases In Fig. 5 we propose a classification for different FQLs proposed in the literature for fuzzy relational databases.

Taxonomy G1. Possibilistic model G2. Similarity-based model G3. Hybrid models

Proposals of FQLs for fuzzy databases P1. Prade and Testemale [34],[35] P2. Baldwin, Coyne and Martin [1] P3. Bosc and Pivert [8] P4. Buckles and Petry [12] P5. Buckles, Petry and Sachar [13] P6. Shenoi, Melton and Fan [42],[43] P7. Medina, Pons and Vila [32] P8. Galindo, Medina and Aranda [18] P9. Galindo, Medina, Pons and Cubero [20]

Fig. 5. Taxonomy for Flexible Query Languages (FQLs) for fuzzy databases The proposed taxonomy includes three main groups. Group 1 (G1) is devoted to proposals related to possibilistic fuzzy databases. Group 2 (G2) includes proposals relevant for similarity- based models. Group 3 (G3) presents proposals for hybrid models, i.e. combined possibilistic and similarity-based models. The literature on fuzzy databases is much richer and includes among others: [48], [49], [56]. We selected some representative proposals as listed in Fig. 5 and now we will examine them in a more detailed way. G1 Possibilistic model P1. Prade and Testemale [34] generalize the concept of a relational database in such a way that value dij of the j-th attribute in tuple ti may be given as a possibility distribution. This makes it possible to store incomplete or imprecise information. The idea may be best illustrated with an example [34]. Let us consider the PERSON relation (Fig. 6) storing information about students where M1

30

Antonio Rosado et al.

corresponds to the grade in mathematics during the first quarter and NAME and AGE represent name and age of a student. All numerical and non-numerical values in the columns AGE and M1 may be easily represented using appropriate possibility distributions. For instance, the value used for Jill’s AGE attribute means nothing more than that our knowledge about her age is drawn from the proposition: “Jill is young”. This proposition induces the possibility distribution on the domain of the AGE attribute. Because we do not know what is Jill’s exact age we can only assign a possibility degree to all potential numbers representing her age. This distinction between a fuzzy set and an induced possibility distribution is important for approaches dealing with relational databases.

Person:

Name Tom David Bob Jane Jill Joe Jack

Age Young 20 22 about_21 Young about_23 [22, 25]

M1 15 Rather_bad bad_to_very_bad Rather_good around_10 [14, 16] Unknown

Fig. 6. Example from Prade and Testemale [34] Prade and Testemale [34] proposed to adapt the classical relational algebra to the case of the possibilistic database. Thus, all standard operations of selection, Cartesian product, join (see [4] for some problems concerning the possibilistic join), projection, union and intersection are extended. In order to illustrate the algebra, we discuss the selection operation and give a relevant example of this type of query. The selection, σP(R), of a relation R upon the condition P may refer to two types of atomic conditions for Pi: 1. Aj θ a, where Aj is the name of an attribute, θ is a comparison operator (fuzzy or not) and a is a constant (fuzzy or not); 2. Aj θ Ak, where Ak is also an attribute name. More complex conditions can be built from the above atomic conditions and the logical connectives of negation, disjunction and conjunction. The matching degree of an atomic condition and a tuple ti is computed as a pair: possibility and necessity measure (with respect to the possibility distributions dij and dik) of relevant sets. In case (a) it is the set, crisp or fuzzy, of the elements belonging to the domain of Aj and being in relation θ (crisp or fuzzy) with the constant a. In the second case (b) it is the subset of the Cartesian product of

Flexible Query Languages for Relational Databases: an Overview

31

domains of Aj and Ak containing only the pairs of elements being in relation θ. In this case a joint possibility distribution over the Cartesian product of the domains of Aj and Ak is used. Formally, the matching degree for case (a) is computed as follows. Let us denote by F a set (in general fuzzy) whose possibility and necessity measures have to be computed. Its membership function for the elements of the domain of attribute Aj is:

µ F (d ) = sup min( µθ (d , d ' ) , µ a (d ' )) d ∈ Dom( A j ) d '∈D

(75)

Now, the possibility and necessity measures of set F with respect to the possibility distribution π d being the value of dij are computed as: ij

Π dij ( F ) = sup min(π dij (d ), µ F (d ))

(76)

Ν dij ( F ) = inf max( 1 - π dij (d ), µ F (d ))

(77)

d ∈D

d∈D

In case (b) set F comprises the pairs of elements (d, d’), d∈ Dom(Aj), d’∈ Dom(Ak) such that d θ d’ is satisfied. Thus, its membership function is identical to that of θ:

µ F (d , d ′) = µθ (d , d ′)

(78)

Now we compute the possibility and necessity measures with respect to a joint possibility distribution π ( dij ,dik ) :

π ( dij ,dik ) (d , d ' ) = min (π dij (d ) , π dik (d ' ))

∀(d , d ' ) ∈ D × D

(79)

Then, the possibility and necessity measures are computed as previously:

Π ( dij ,dik ) ( F ) = sup min(π (dij ,dik )(d , d ' ), µ F (d , d ' ))

(80)

Ν ( dij ,dik ) ( F ) = inf max( 1 - π ( dij ,dik ) (d , d ' ), µ F (d , d ' ))

(81)

d∈D

d∈D

P2. Baldwin and his collaborators [1] implemented a system for querying a fuzzy relational database that uses semantic unification and the evidential logic rule. The value of an attribute in the database may be either a crisp value or a possibility

32

Antonio Rosado et al.

distribution of values. The queries are composed of one or more conditions (corresponding to attributes of the database), with an importance for each condition and they are represented by a filtering function (similar to the notion of quantifier) and a threshold. The specific feature of their work [1] is the process of semantic unification used for matching the fuzzy values of the conditions with the possibility distributions of different attributes of a tuple in a database. That process is based on the mass assignment theory developed by Baldwin [2] which gives for the matching of two fuzzy sets, an interval [n, p], where n (necessary) is the certain degree of matching and p (possibility) is the maximum possible degree of matching. Next, it is described how that interval is computed. The mass assignments theory [2] provides a bridge between the two forms of uncertainty: probability and fuzziness. A fuzzy set induces a family of probability distributions that can be represented by a function called a mass assignment. The interpretation of this translation (fuzzy set to a mass assignment) may be briefly explained as follows [38]. X A mass assignment is a function m: 2 → [0,1], such that: m(Ai) ≥ 0

∑ m( Ai ) = 1

(82)

i

where Ai are subsets of a set X = {x1, x2, …,xn}, such that Ai = {x1, …, xi}. Hence, A1 ⊂ A2 … ⊂ An. Note that these expressions mean that m represents a family of probability distributions. A fuzzy set is converted to a mass assignment in the following way. The normalized fuzzy set A = x1/µ(x1) + x2/µ(x2) + … + xn/µ(xn), where µ(x1) = 1 and µ(x1) ≥ µ(x2) ≥ … ≥ µ(xn), induces a possibility distribution π such that πx(xi) = µA(xi) (see Eq. (36)). Then, the mass assignment m associated with πx is defined over the subsets Ai as: m(Ai) = mi

(83)

where mi = πx(xi) - πx(xi+1), πx(x1) = 1 and πx(xn+1) = 0. Semantic unification computes the matching of two fuzzy sets calculating the mass assignment m(A|A’), which represents a conditional probability distribution, over the truth set {t, f, u}, of A given A’. The resulting support pair (Sn, Sp) is the sum of the truth- values (t), in the case of Sn, and the sum of the truth-values t with the uncertain values u, for the case of Sp. For example [1], the support pair for matching the two fuzzy sets: cheap = 10/1 + 20/0.5 + 25/0.25 + 30/0.01 average = 20/0.01 + 25/0.5 + 30/1

(84)

Flexible Query Languages for Relational Databases: an Overview

33

is computed calculating the following matrix of mass assignments m(cheap | average): {Mi}: mi {10}: 0.5 {10, 20}: 0.25 {10, 20, 25}: 0.24 {10,20,25,30}: 0.01

{30}: 0.5 f f f t: 0.005

{30, 25}: 0.49 f f u: 0.1176 t: 0.0049

{30, 25, 20}: 0.01 F u: 0.0025 u: 0.0024 t: 0.0001

Fig. 7. Example from Baldwin et al. [1] to which corresponds the following support pair: (Sn,Sp)=([0.005+0.0049+0.0001],[0.0025+0.1176+0.0024+Sn])=(0.01,0.1325) Afterwards, they combine different matching degrees of conditions with their importances using a process called an evidential support logic rule. This rule uses a function called an S function that acts as an aggregation operator OR or AND or even something between these two functions (similar to quantifiers). Formally, the support pair (Sn, Sp) for a tuple is computed in the following way:

 

 



(Sn , S p ) =  S  ∑ w jα j , S  ∑ w j β j   n

  j =1

n

  j =1

(85)



where (αj βj) are the supports for the n conditions, wj are their importances and n

∑ w j =1. j =1

Next, we present an example of using semantic unification on querying a fuzzy relational database. The following relation is a simplification of that presented in [1] with some attributes removed. Common_name Pine_marten polecat ferret mink

Upper_fur ({brown:1,black:0.4}(very_dark)) ({brown:0.7,black:0.6}(very_dark)) ({brown:0.5,black:0.7}(dark))

Body_length (average pine_marten) (average polecat) (average polecat) ({brown:1,black:0.7,chocolate:0.8}(very_ (average mink) dark))

Fig. 8. Example from Baldwin et al. [1] The attribute common_name is crisp whereas the attributes upper_fur and body_length are fuzzy. The attribute upper_fur is a compound fuzzy attribute, which means it is composed of two possibility distributions: the first is discrete

34

Antonio Rosado et al.

(e.g., {brown: 1, black: 0.4}) and the second is continuous (e.g., very_dark). Body_length is defined using continuous possibility distributions such as average pine_marten, which specifies that the length is average in the context of pine_marten mammals. The answer for the query [1]: Selection criteria. Threshold = 0. Body_length: (average polecat), Importance: high Upper_fur: ({brown: 1, black: 0.7}(very_dark)), Importance: low is: mink has support (0.40175 1) ferret has support (0.3 0.728) polecat has support (0.426 0.88) pine_marten has support (0.116 0.953846)

The above answer shows that there are two mammals competing for the best solution: the mink and the polecat; the mink has a higher possible support and the polecat has a higher necessary support. Note that the threshold supplied by the user is a necessary threshold, which means that all mammals with a necessary support greater or equal to zero appear in the answer. P3. Bosc and Pivert [9] propose a new type of query for possibilistic databases that do not rely on the fuzzy pattern matching (see P1). Instead, the queries of this new type refer to the representation of the data (possibility distributions). In this case, an answer is either a crisp or fuzzy set depending if the queries are crisp or fuzzy. Moreover, these new queries improve the expression power of the associated FQL: besides the atomic conditions of types Aj θ a and Aj θ Ak, FQLs are enriched with new conditions that we will describe next. First, let us consider such new queries referring to only one possibility distribution. In order to express their conditions Bosc and Pivert [9] define the following three functions: Poss(A,{d1, …,dn}) = min(πA(d1), …, (πA(dn))

(86)

Card_cut(A,λ) = |{d ∈ D: πA(d) ≥ λ}|

(87)

Card_supp(A) = |{d ∈ D: πA(d) > 0}|

(88)

where A is an attribute in the possibilistic database; d1, …,dn are values in the domain, D, of attribute A; πA is a possibility distribution representing the value of A, and λ is a number from the interval [0,1]. Function Poss(A,{d1,…,dn}) supplies

Flexible Query Languages for Relational Databases: an Overview

35

the truth degree of the statement “all the values d1, …, dn are possible for A”. Functions Card_cut(A, λ) and Card_supp(A) supply the number of values that are possible for A with possibility degrees above or equal to, respectively, λ and 0. Therefore, we can easily express queries of the following type [9]: “find the houses for which the price value $100.000 is considered more possible than the value $80.000” or “find the houses for which $100.000 is the only price value which is completely possible”, as Poss(PRICE,{100.000}) ≥ Poss (PRICE, {80.000}) and Poss(PRICE,{100.000}) =1 and Card_cut(PRICE,1) = 1, respectively. These conditions are Boolean and, hence, the respective answers are crisp relations. However, these conditions can be fuzzified and then the respective answers are fuzzy relations (this corresponds to the case of fuzzy queries against crisp data, which was detailed in Section 4.). In order to perform a syntactical comparison between two possibility distributions various comparison techniques may be employed, cf. [36]. Bosc and Pivert [9] use an extended resemblance relation defined on the interval [0,1], called a fuzzy equality measure, which is defined as follows:

µ EQ (π , π ') = minψ (π (u ), π ' (u ))

(89)

u ∈D

where π, π’ are two possibility distributions to be compared and ψ is a resemblance relation, i.e., a reflexive and symmetric relation defined on [0,1]. They also propose an extended formula based both on the resemblance relation over D, denoted RES, and the resemblance relation over [0,1] (a proximity measure, pr,), defined as:

µ s (π ,π ') (u ) =

sup

v∈sup p (π ')

(

)

min µ RES (u , v ), µ pr (π (u ), π ' (v ))

(90)

The equation given above measures the degree to which the possibility distribution π can be replaced by the possibility distribution π’ with respect to an element u belonging to the support of π. Such a replacement is acceptable (the computed degree is high) if there exists v belonging to the support of π’ such that u and v are similar (in the sense of RES) and the values π(u) and π’(v) are similar (in the sense of pr). Then, the degree to which we can replace a possibility distribution π with a possibility distribution π’ with respect to the whole support of π is given by the following equation [9]:

µ repl (π ,π ') =

inf

u ∈sup (π )

(

)

max 1 − π (u ), µ S (π ,π ') (u )

(91)

36

Antonio Rosado et al.

where the resulting degree is the weighted combination of the degrees µs(π,π’)(u) for all the elements u in the support of π and the values π(u) are weights (see Eq. (67)). The query conditions involving two representations can be expressed as: REP(A1) ≈ REP(A2)

(92)

where REP(A1) and REP(A2) are the “shapes” of the possibility distributions to be compared. For example [9], the condition REP(AGE) ≈ REP(middle_aged) will evaluate the extent to which a value of attribute AGE (possibility distribution) is syntactically similar to the possibility distribution induced by the fuzzy set corresponding to the middle_aged concept. Notice that in the case of fuzzy pattern matching a similar query may be used. However, it would produce a possibility/necessity measures of the event that the value of the AGE attribute belongs to the fuzzy set middle_aged provided that all we know about the age is a possibility distribution. G2 Similarity-based model P4. Buckles and Petry [12], [13], [33] introduced a similarity based model for a fuzzy database in which: 1. Each domain Dj is equipped with a similarity relation Sj (see Fig. 9b)), which extends the identity relation used in the crisp relational model. It means that two values of an attribute match not only when they are identical but also if they are similar enough. Similarity relations support basic features of fuzzy querying: a query requesting a given attribute value will also be satisfied by other similar attribute values. 2. The value of tuple i for attribute j, dij, may be any valid (that is, verifying the semantics of the relation) subset of its domain Dj except for the null set. That is:

d ij ∈ 2

Dj

d ij ≠ ∅ . This definition helps represent uncertainty within the

tuples as well as it is the consequence of the similarity relations introduced. Fig. 9c) illustrates a relation in this model; the domain of each attribute and the corresponding similarity relations are shown in Fig. 9a) and Fig. 9b), respectively. The authors also proposed an extension to the relational algebra. The idea of this extension is illustrated on the example of the selection operation (see Fig. 9d)). The syntax of the selection operation [13] is slightly richer in comparison with the one introduced in Section 2 (note that here we use our notation, not the original one):

σP(R) with

(93)

Flexible Query Languages for Relational Databases: an Overview

37

A= {a, b, c, d, e} M= {α, β, χ, δ} a) Domain Sets a b c d e

a 1 0.6 0.3 0.6 0.5

b 0.6 1 0.3 0.7 0.5

c 0.3 0.3 1 0.3 0.3

d 0.6 0.7 0.3 1 0.5

e 0.5 0.5 0.3 0.5 1

α β χ δ

α 1 0.7 0.4 0.5

β 0.7 1 0.4 0.5

χ 0.4 0.4 1 0.4

δ 0.5 0.5 0.4 1

b) Similarity Relations: S1 (left), S2 (right) R2: A a c, e a a

M

R2’: A M a α, δ

α, δ β β χ

c) Relations (one in this example)

a β a χ

R: A M a α, β, δ a χ

d) R ←(σA=a(R2) with α(M) = 0.5)

Fig. 9. Example from [12]: components of a fuzzy relational database (a-c); a fuzzy relational algebra operation: selection (d). where R and P denote the relation and a Boolean expression as previously, while the level condition specifies a similarity threshold, a number belonging to [0,1], for the domain Dj of an attribute Aj appearing in P. This operation selects the tuples from R that satisfy condition P but if a subcondition of P is of the form Aj=a, where Aj is an attribute and a is constant, then it is interpreted as “a is similar to an element of the value of Aj at least to the degree expressed with the level condition” (remember that the values of attributes are, in general, sets). See Fig. 9d) where R2’ is the intermediate result for the query obtained from the tuples satisfying condition A = a, where A is an attribute name and a belongs to A’s domain. Then, the result R is obtained by removing redundant tuples that are identified using the following definition. Two tuples ti and tk are redundant if [12]: I(dij ∪ dkj) ≥ α(Dj),

j = 1,2,…,m

(94)

where dij is the value of tuple i for attribute j; Dj is the domain of attribute j; α (Dj) is the similarity threshold and I is a similarity index defined as:

38

Antonio Rosado et al.

I ( H ) = min S j ( x,y ) x,y∈H

, H ⊆ Dj

(95)

where Sj(⋅,⋅) is a similarity relation associated with the domain Dj . In summary, two tuples are redundant if the values of all corresponding attributes are similar. Two values, dij and dkj of an attribute Aj are similar if the minimum similarity degree between a pair of elements in dij ∪ dkj, I(dij ∪ dkj), is greater than a pre-specified one for this attribute level, α(Dj). All queries should pre-specify these levels for all attributes involved (by default it is assumed to be 1.0). In the example of Fig. 2b), the tuples (a,{α, δ}) and (a,β) in relation R2’ are redundant because a is obviously similar to a (the similarity relation is reflexive) and β is similar to both α and δ, that is: min {S2(α,β), S2(δ,β)} ≥ α(M) = 0.5

(96)

Hence, the tuples (a,{α,δ}) and (a,β) are merged, producing the tuple (a,{α,β,δ}) in the resulting relation R. P5. Buckles et al [13] adapted DRC (Section 2) to the similarity based model. Syntactically, this adaptation manifests itself with the addition of the with clause following each formula. This clause is to be interpreted in a way analogous to the case of their extended relational algebra. Thus, starting with the standard DRC query equivalent formula {( X 1 , … , X n ) Ψ ( X 1 , … , X n )} , Buckles and Petry’s fuzzy domain calculus comprises the following atomic formulae [13]: R(X1,X2, …,Xn) with < domain level conditions >

(97)

X1 θ X2 with

(98)

where R is a database relation, Xi is a constant or a domain variable and the domain level conditions are level conditions for all attributes in R; θ is a comparison operator and the operator level condition is a level condition applying to θ (when omitted its value is 1 by default). In case of Eq. (97) the variables X1, X2, …, Xn are instantiated from the values of a tuple ti = (di1,…,din) producing the following vector of matching degrees [13]: <S(di1,X1), S(di2, X2), …, S(din, Xn)>

(99)

that is, the matching degree, γ(R(.),ti), of R(X1,X2, …, Xn) against tuple ti. The values S(dij,Xj) are defined as follows (provided all variables Xi are instantiated from the same tuple - otherwise S(dij,Xj) = 0 ∀j):

Flexible Query Languages for Relational Databases: an Overview

 min S j (u,v) , u∈d ∧v = X j S (d ij ,X j ) =  ij min S (u,v) , u∈dij ∧v∈x j j 

if X j is a constant

39

(100)

if x j is an instantiation for X j

Tuple ti satisfies Eq. (97) above if for each j, S(dij,Xj) is greater or equal to the similarity threshold corresponding to the attribute Aj. In case of the Eq. (98) the matching degree, γ(X1 θ X2,ti), of formula X1 θ X2 against tuple ti is the minimum value θ(x1,x2) over the pairs (x1,x2) where x1 ∈ di1, x2 ∈ di2 and dij correspond to a constant or an instantiation of the variable. Tuple ti satisfies Eq. (98) above if this matching degree is greater or equal to the threshold specified by the operator level condition. Further, a formula in Buckles and Petry’ fuzzy domain calculus is defined in a similar way to the definition of a domain calculus formula (Section 2). Specifically, it is one of the following expressions [13]: 1. An atomic formula 2. ψ1 ∧ ψ2 with < level condition >, ψ1 ∨ ψ2 with < level condition >, ¬ ψ with < level condition > 3. ∃X(A) ψ with < level condition >, ∀X(A) ψ with < level condition > 4. (ψ), [ψ] where ψ, ψ1, ψ2 are formulas, X is a domain variable associated to attribute A, and level condition applies to any free or bound variable in the formulas ψi. The authors introduce also the concept of the safe formula in a way analogous to the crisp case. Then, they prove that the expressive power of their DRC is at least the same as the previously discussed fuzzy relational algebra (see Proposal P4). P6. Shenoi and Melton [42] extended the fuzzy relational database model of Buckles and Petry (see Proposals P4 and P5) by relaxing the transitivity property required for the similarity relation. They replaced the similarity relation with a proximity relation and showed that general properties of the model are preserved. This provides the user with the freedom to define the closeness among the different elements of the domain. For example [42], if one is close to two with a degree of 0.8 and two is close to three also with a degree of 0.8, then a degree of 0.6 between one and three would not be allowed if we used a similarity relation: the max-min transitivity would impose a value equal or higher to 0.8 which would contradict common sense. Further, Shenoi et al [43] proposed a more general model, an equivalence classes model. The model is based on the following two assumptions [43]: 1. the existence of partitions at the desired level of precision for each non-empty subset of a domain

40

Antonio Rosado et al.

2. the total ordering of partitions related to the same domain but for different precision levels is provided by the concepts of a finer and coarser partition. Assumption 1 is motivated by the postulate that a partitioning of a domain should be defined separately for each subset of elements currently appearing in a database. The authors call such a subset a temporal domain (it is also called an active domain). Assumption 2 expresses an intuitive requirement that elements indistinguishable (i.e., falling into the same equivalence class) at a given level of precision should be still indistinguishable at any lower level of precision. Thus, any two partitions of the same temporal domain taken at different precision levels should be in a coarser/finer relation. This is an abstract model [43] because it does not specify how to partition the domain: the users have a freedom to choose how to define the equivalence classes (information chunks). For instance, Buckles and Petry [12] used similarity relations on scalar domains while Shenoi and Melton [42] used proximity relations on scalar domains. In fact, Shenoi et al. [43] show that any partitions satisfying assumptions 1 and 2 guarantee that the fundamental properties of the relational model are preserved. They also proposed a fuzzy relational algebra for this model. For example [43], Fig. 10 shows a fuzzy relation of candidates to a political election storing the name, age and political view of the candidates and the associated algebra. The scheme of a fuzzy relation is defined by adding the set of precision levels αR used for each attribute:

αR = (αName, αAge, αView)

(101)

In Fig. 10(a-b), the partition of temporal domains into equivalence classes of identical elements results in partitions whose equivalence classes are sets with only one element, corresponding to the special case of the relational model. In fact, since each component of a tuple is a subset of an equivalence class in the partition of its respective attribute, each attribute value in the relation candidates must have only one element. For example, for attribute Name we can have the following partition: PName : {{Cook},{Dean},{Hall},{Kane},{Luce}, {Mann}, {Page}, {Rudd}}

(102)

Flexible Query Languages for Relational Databases: an Overview Name {Cook} {Dean} {Hall} {Kane} {Luce} {Mann} {Page} {Rudd}

41

Age {39} {68} {42} {54} {73} {50} {65} {58}

View {Arch_Conservative} {Ultra_Liberal} Name Age View {Conservative} {Cook} {39} {Arch_Conservative} {Moderate} {Liberal} b) questionable_candidates (α = (αName, αAge, αView)). {Moderate} {Liberal} Name Age View {Conservative} {Dean} {68} {Ultra_Liberal} a) candidates (α = (αName, αAge, αView)). {Kane} {54} {Moderate} {Luce} {73} {Liberal} {Mann} {50} {Moderate} Name Age View {Page} {65} {Liberal} {Cook, Hall} {39, 42} {Arch_Conservative, Conservative} {Rudd} {58} {Conservative} {Dean, Luce, Page} {65, 68, 73} {Ultra_Liberal, Liberal} {Kane, Mann} {50, 54} {Moderate} {Rudd} {58} {Conservative} d) revised_candidates (αR = (αName, αAge, αView)) ← candidates – questionable_candidates c) abstract_candidates (α’ = (α’Name, α’Age, α’View)).

Fig. 10. Example from Shenoi et al. [43]: algebra for the equivalence classes’ model Partitions PName, PAge, PView have implicit precisions, respectively, αName, αAge, αView (see relation candidates in Fig. 10a)). On the other hand, the following partitions: P’Name : {{Cook, Dean, Hall, Kane, Luce, Mann, Page, Rudd}}

(103)

P’Age : {{39,42}, {50,54,58}, {65,68,73}}

(104)

P’View : {{Ultra_Liberal, Liberal},{Moderate},{Conservative, Arch_Conservative}}

(105)

create equivalence classes of more or less identical elements. We can see that the precision of P’Name, α’Name, makes all elements of the temporal domain DName to be indistinguishable whereas the precision of partitions P’Age and P’View, α’Age and α’View, split temporal domains DAge and DView into three equivalent classes. It is obvious that α’ ≤ α means that the corresponding partition P’ is coarser than the corresponding partition P. Then, if we decrease the precisions α (identical) for attributes of relation candidates (see Fig. 10a)) to the new precisions α’ (more or less identical), redundant tuples will appear which must be merged. Fuzzy relation abstract_candidates (see Fig. 10c)) shows the result of merging the tuples in

42

Antonio Rosado et al.

relation candidates considering the new precisions α’Name, α’Age, α’View. We can see, for example, that candidates Kane and Mann are equivalent and hence merged because their names, ages and political views are considered equivalent (see partitions P’Name, P’Age and P’View) but Rudd is not equivalent to Kane and Mann because Rudd’s political view (conservative) is not equivalent to Kane and Mann’ political view (moderate). Note that moderate and conservative do not belong to the same equivalence class in partition P’View. Therefore, Rudd is not redundant and therefore is not merged with any other candidate in candidates’ relation. Finally, the difference operation between the fuzzy relations candidates (see Fig. 10a)) and questionable_candidates (see Fig. 10b)), using precision levels α’Name, α’Age, α’View, is shown in Fig. 10d). We can see that the two tuples, corresponding to candidates Cook and Hall, from fuzzy relation candidates are missing in the result, i.e. fuzzy relation revised_candidates. Besides the tuple appearing in relation questionable_candidates (Cook’s tuple), an extra tuple corresponding to candidate Hall is also removed from candidates because candidate Hall is equivalent to candidate Cook under the more or less identical precision (see P’Name, P’Age, P’View). G3 Hybrid models P7. Medina et al [32] proposed a fuzzy database model, GEFRED (generalized fuzzy relational database) that tries to integrate features of both the possibilistic and similarity based models. The data is represented with generalized fuzzy relations that take into account imprecision as well as uncertainty of information. The latter is dealt with via a compatibility degree associated to each attribute value. More precisely, a generalized fuzzy relation R is composed of two sets [32]: R = (H,B), where H (Head) is the set: H = {(A1 : DG1 [,C1]), (A2 : DG2 [,C2]), ..., (An : DGn [,Cn]) }

(106)

and B (Body) is the set: B = {(A1 : di1 [,ci1]), (A2 : di2 [,ci2]), ..., (An : din [,cin]) }

(107)

where Aj (j = 1…n) is the j-th attribute, DGj (j = 1…n) is the family of all possibility distributions defined over the domain of attribute Aj, which is called a generalized fuzzy domain; Cj is the compatibility attribute of attribute Aj, which may be optional; and dij is the value of attribute Aj in tuple i and cij (j = 1…n) is the compatibility degree of value dij, which is a value in the interval [0,1]. For example, Fig. 11b) (adopted from [32]) shows a generalized fuzzy relation with attributes NAME, ADDRESS, AGE, PRODUCTIVITY, SALARY and compatibility

Flexible Query Languages for Relational Databases: an Overview

43

attribute CAGE. The compatibility attributes associated with other attributes are equal to one and, therefore, they are not shown. Medina et al [32] defined an algebra, a generalized fuzzy relational algebra, to manipulate information stored in the fuzzy database. Next, we will illustrate the selection operation, which is called a generalized fuzzy selection. It is based on a simple condition θG(Aj,a)≥α, where θG is a comparison operator (generalized fuzzy comparator); α is a compatibility threshold (α ∈ [0,1]), and a is a constant. Comparison operator θG is defined as:

θG : DGj × DGj → [0,1]

θ G (π , π ' ) =

sup

(d1 ,d 2 )∈Dk ×Dk

min(θ (d1 , d 2 ), π (d1 ), π ' (d 2 ))

(108) (109)

where π and π’ are possibility distributions and θ can be: • A classical comparison operator such as =, ≠, >, ≥, <, ≤; • A fuzzy comparison operator such as approximately equal, much greater than, etc; • A similarity comparison operator which is defined using a similarity relation on scalar data. In fact, Eq. (109) is similar to Eq. (80) used while querying possibilistic databases. A new element is here an explicitly stated threshold value which resembles queries in the similarity based model. Compatibility degrees of all involved attributes require a special treatment in the matching degree calculation. In fact, these compatibility degrees correspond to partial matching degrees that are immediately aggregated to an overall matching degree in other models. For example, the selection σ=(AGE,old) ≥ 0.6(Emp) of relation Emp (see Fig. 11a)) using the condition =(AGE,old) ≥ 0.6 results in relation S (see Fig. 11b)). We can see from the definition of label OLD (see Fig. 11d)) that tuples corresponding to employees Luis (his age is 30 or 31), Javier (his age is between 30 and 35) do not belong to S because their compatibility degree CAGE is zero and hence the condition is not satisfied. The same is true for tuples corresponding to employees Juan Carlos and Julia, which are YOUNG (see Fig. 11d)). Finally, the compatibility degrees for tuples corresponding to employees Antonio and Francisco are computed from Eq. (109). P8. Galindo et al [18] extended the GEFRED model with a fuzzy domain relational calculus (FDRC) for querying fuzzy relational databases. The FDRC language is described next. Galindo et al.’s fuzzy domain calculus comprises the following atomic formulas [18]:

44

Antonio Rosado et al. H Name B Antonio Francisco Luis

Address Reyes Católicos P. A. Alarcón Recogidas

Age Middle Old {.8/30,1/ 31} Young Young [30,35]

Juan Carlos Camino Ronda Julia Puerta Real Javier Gran Vía a) The generalized fuzzy relation Emp

H Name Address B Antonio Reyes Católicos Francisco P. A. Alarcón

Productivity Fair Excellent Good

Salary 100000 150000 110000

Department Production Comercial Production

Bad Good Fair

90000 130000 105000

Production Comercial Human Resources

Age CAge Productivity Middle 0.75 Fair Old 1 Excellent

Salary 100000 150000

b) Relation S = σ=(AGE,Old) ≥ 0.6 (Emp).

H Department CDepartment B Production 1 S1 Human resources 0.5 S2

1

0

YOUNG MIDDLE

OLD

16 25 30 35 40 45 50 55

c) Relation R: departments with at least one d) Labels definitions for attribute Age. bad (≥ 0.5) employee.

65

80

AGE

Fig. 11. Example from [32]: algebra for the GEFRED model 1. R(X1, X2, ..., Xn) ≥ α, where R is predicate symbol corresponding to a generalized fuzzy relation with n attributes, and each Xi is a constant or a domain variable. This atom requires that the tuple (X1, X2, ..., Xn) belongs to a relation corresponding to R to a degree higher or equal α. For a given instantiation of variables Xi this degree of membership is computed as follows: 2.

R( X 1 , X 2 ,..., X n ) =

max min = (d rc , X c )

r =1,..., m c =1,..., n

(110)

where m is the number of tuples in relation corresponding to R, = is a generalized fuzzy comparator defined by Eq. (109). Thus, the membership degree of a given tuple (X1, X2, ..., Xn) – note that now Xi denotes a constant originally present in the atomic formula or a value substituting the domain

Flexible Query Languages for Relational Databases: an Overview

45

variable – is computed by comparing this tuple with all tuples of relation corresponding to the predicate symbol R. This comparison is done using a fuzzy comparator for all attributes separately; drc denotes a value of the c-th attribute in the r-th tuple. Then, the total result of comparison is taken as the minimum of these per-attribute comparisons. A tuple of the relation for which this maximum value is attained is referred to as the most similar tuple. The fulfillment of threshold α is a value in the interval [0,1] that is the minimum value admissible for R(X1, X2, ..., Xn) in order to make the atom true. 3. θG(X,Y) ≥ α, where θG is a generalized fuzzy comparator, and X and Y are constants or domain variables. This atom expresses that the value X is related to the value Y by the fuzzy comparator θG to a certain truth degree, which is greater or equal to α. Examples of fuzzy atoms are R(X1, Good, X3) and =(X1,Good) ≥ 0.9. In the first case, the threshold is omitted which means that its value is equal to one. Further, a formula in Galindo et al.’s fuzzy domain calculus is defined in a similar way to the definition of a domain calculus formula (Section 2). Specifically, it is either an atomic formula or one of the following expressions: ¬ψ1, ψ1 ∧ ψ2, ψ1 ∨ ψ2, ψ1 ⇒ ψ2,∃X ψ1(X),∀X ψ1(X), where ψ1 and ψ2 are fuzzy formulas and X is a domain variable. Afterwards, they demonstrate an expressive power of FDRC proving that any expression in the fuzzy relational algebra has an equivalent expression in FDRC. Observe that so far we are still operating within the framework of classical logic – due to the threshold value α in an atomic formulae, they are true or false. A possible partial matching is preserved using the concept of a compatibility degree. The result of a query is a set of tuples satisfying it – a new generalized relation. Each attribute value of this new relation may be associated with a compatibility degree expressing how well this specific value matches the query. This degree is computed by a matching function γ which takes three arguments: a formula (query) ψ, a tuple ti = (di1,di2,…,din), and X, a domain variable (attribute) for which a compatibility degree is to be computed. This may be formally described as:

γ(ψ, ti, X) ∈ [0,1] ∪ λ

(111)

Value λ is a value not belonging to [0,1] that asserts that degree γ is not applicable or meaningless. Function γ(ψ,ti,X) is defined depending on the structure of ψ. There are four cases: ψ is an atomic formula, a negation, a disjunction or a formula with an existential quantifier [18]. When ψ is an atomic formula of type R(X1, X2, ..., Xn,K) ≥ α we have: R(K), 

γ ( R(X1 , X2 , ..., Xn , K) ≥ α, ti , X ) = min{cij , R(ti )},  λ,

if there are no variables in ψ if X = Aj otherwise

(112)

46

Antonio Rosado et al.

where K is the list of constants present in ψ, the values R(K) and R(ti) are computed using Eq. (110); X = Aj indicates that variable X is an attribute Aj in R; cij is the value of the compatibility attribute Cj for the tuple most similar to ti; if there is no cij associated with this most similar tuple, then cij is assumed to be equal 1.0. The atomic formula θG (Xj,Y) ≥ α is evaluated as:

( (

)

)

θG (d ij , Y ), 

γ θG X j , Y ≥ α , ti , X = θG ( X j , Y ),  λ ,

if X i = X if X j is a constant

(113)

otherwise

The evaluation of the negation and disjunction expressions is done using the complement and maximum operators, respectively (see (17)-(18)). On the other hand, for formula ψ expressed as ∃ Xn+1 (ψ1 (X1, X2, ..., Xn, Xn+1)): γ (ψ ( X1,..., X n ), ti , X ) =

max

d i (n +1)∈DOM (ψ )

γ (ψ 1 (X1,..., X n , di (n +1) ), ti , X )

(114)

where DOM(ψ) is the set of all symbols that appear in formula ψ or in a tuple of a relation appearing in ψ. The remaining expressions, that is, the conjunction, the implication, and the universal quantifier can be expressed in an equivalent form using the negation, the disjunction and the existential quantifier. Then, the result we obtain for a general query {X1, X2, …, Xn | ψ(X1, X2, ..., Xn)} is a generalized relation R (see Eqs. (106)-(107)) that is computed using the following two steps [18]: 1. Compute all the tuples (dr1,…,drn) that make true the formula ψ(dr1,…,drn); 2. The compatibility values crj (j=1,…,n) for each compatibility attribute Cj, corresponding to the r (r = 1, …, m) tuples of R computed in 1., are computed as: crj = γ(ψ(X1,…,Xn),tr,Xj)

(115)

where tr = (dr1,…,drn) is the r-th tuple of relation R. If crj = λ or crj = 1 for all r = 1, …, m, the attribute Cj is removed from R. For example, the query “show the departments with at least one bad employee (with a degree greater than or equal to 0.5) “ may be expressed in FDRC as:

Flexible Query Languages for Relational Databases: an Overview {d | ∃n, a, ag, p, s (Emp(n,a,ag,p,s,d) ∧ =(p,Bad) ≥ 0.5)}

47

(116)

Which, considering the fuzzy relation in Fig. 11a), produces the resulting fuzzy relation in Fig. 11c). Further, the value C1Department is computed in the following way: C1Department = γ (ψ,t1,d) = ∃n, a, ag, p, s (γ (Emp(n,a,ag,p,s,d),t1,d) ∧ ∧ γ (=(p,Bad) ≥ 0.5,t1,d)) = max {γ (Emp(Antonio, Reyes Católicos, Middle, fair, 100000, d), t1,d) ∧ ∧ γ(=(fair,Bad) ≥ 0.5,t1,d), γ (Emp(Luis, Recogidas, {.8/30, 1/31}, good, 110000, d),t1,d) ∧ ∧ γ(=(good, Bad) ≥ 0.5,t1,d), γ(Emp(Juan Carlos, Camino Ronda, Young, Bad, 90000, d),t1,d) ∧ ∧ γ(=(Bad, Bad) ≥ 0.5,t1,d)} = max {min(1,0.5), min(1,0), min(1,1)} = max {0.5, 0, 1} = 1 Note that the existential quantifier (see Eq. (114)) is replaced with the values of tuples such that department = production (t1). Then, there is computed the matching degree for those tuples (three) and finally the maximum of these degrees is considered. Galindo et al. [19] propose also the inclusion of fuzzy quantifiers in the FDRC language just described. P9. Galindo et al. [20] implemented the FRDB model, GEFRED, on the crisp DBMS Oracle and a FQL, fuzzy SQL (FSQL). They extended the SELECT command of SQL in order to allow for more flexible conditions by choosing between possibility and necessity within fuzzy comparators; retrieving the most (least) important tuples using fulfillment thresholds or allowing fuzzy constants in the right side of the condition. More precisely, the main functionalities added are [20]: 1. Linguistic labels. These labels can be defined for two types of attributes: attributes with an ordered domain or attributes with a scalar domain. In the first case, the labels are defined as trapezoidal possibility distributions and, in the second case, a similarity relation between the labels of the attribute is defined. 2. Fuzzy comparators. They extend the usual comparators =, >, ≥, <, ≤ providing comparators that have two forms, corresponding to the possibility and the necessity cases. For example, the operator FEQ evaluates the possibility of two attributes (or one attribute and a constant) being equal using Eq. (80) (or Eq. (76)). And the NFEQ operator evaluates the necessity of two attributes (or one attribute and a constant) being equal using Eq. (81) (or Eq. (77)). Note that FEQ and NFEQ are instances of the generalized fuzzy comparator θG. Additionally,

48

Antonio Rosado et al.

they provide the fuzzy comparators much greater than (MGT, NMGT) and much less than (MLT, NMLT). 3. Fulfillment thresholds (α). They are specified with the syntax: [THOLD] α

(117)

which is equivalent to having THOLD replaced by ≥, and where the reserved word THOLD may be substituted by a crisp comparator (=, <, …), modifying the meaning of the condition. 4. Function CDEG() shows a column with the compatibility degree for the argument attribute, corresponding to the compatibility degree of the conditions in which the attribute appears. Function CDEG(*) shows the compatibility degree for all the fuzzy attributes appearing in the condition. If we want to see the compatibility columns for all attributes and the compatibility degree of the whole tuples, we will use the % character in the SELECT clause (like SQL does for the * character). 5. Fuzzy constants. FSQL allows for the use of the following constants: UNKNOWN, UNDEFINED, NULL, $[a,b,c,d], $label, [n,m], #n. $[a,b,c,d] is a fuzzy trapezoid function with a ≤ b ≤ c ≤ d. $label is a linguistic label (as described above); [n,m] is an interval, for which a = b = n and c = d = m; and #n means approximately n, in which the fuzzy trapezoid is replaced by a triangle with b = c = n and n – a = d – n. 6. Condition with IS. A condition, IS [NOT] (UNKNOWN | UNDEFINED | NULL)

(118)

is true (without NOT) when the fuzzy attribute value is equal to the fuzzy constant on the right. For example, the query: Q8 - Find the Spanish cities with more than “around 500 thousands” inhabitants may be expressed in FSQL as [20]: SELECT city, CDEG(inhabitants) FROM Population WHERE country = ‘Spain’ AND inhabitants FGEQ $[200,350,650,800] .75 AND inhabitants IS NOT UNKNOWN

(119)

Flexible Query Languages for Relational Databases: an Overview

49

where FGEQ is the fuzzy comparator extending the crisp ≥ comparator; $[200,350,650,800] is a fuzzy constant, and .75 is a fulfillment threshold requiring that the condition on the number of inhabitants is satisfied at least to the degree 0.75. Note that two columns would be displayed: the name of a city and the degree to which its number of inhabitants is greater or equal to around 500 thousands. The cities with unknown number of inhabitants are excluded.

6 Conclusion In the last two decades, imprecision (fuzziness) and uncertainty has been studied in the context of relational DBMS, in particular in the area of querying and in the area of modeling and storing imprecise and uncertain data. The first area has led to the appearance of increasingly flexible query languages (FQLs) which provide more human consistent interfaces in comparison to the classical query languages, by using fuzzy sets theory. In the second area, fuzzy sets theory is used to extend the relational database model, leading to what are usually called fuzzy relational database models, and on the flexible querying of the resulting models. We introduced two taxonomies for FQLs within the context of relational database models to organize the field and to offer a structured view of the topic. One taxonomy organizes the research on FQLs for crisp relational databases and another organizes the research on fuzzy relational databases. Both taxonomies provide a structured view of the main research topics studied and highlight the main differences and similarities between approaches. We believe that our contribution in organizing the field of flexible query languages in relational databases can shed some light on the most relevant proposals in the area, as well as guide designers and interested users in understanding and selecting the best approaches to suit their aims.

References [1] Baldwin JF, Coyne MR, Martin TP (1993) Querying a database with fuzzy attribute values by iterative updating of the selection criteria. In: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI'93) [2] Baldwin JF, Martin TP, Pilsworth BW (1995) FRIL - Fuzzy and evidential reasoning in artificial intelligence. John Wiley & Sons, Inc, New York [3] Bosc P (1999) Fuzzy Databases. In: Bezdek J, Dubois D, Prade H (eds) Fuzzy sets in approximate reasoning and information systems, The Handbooks of Fuzzy Sets Series. Kluwer Academic Publishers, pp 403-468

50

Antonio Rosado et al.

[4] Bosc P, Lietard L, Pivert O (2000) About ill-known data and equi-join operations. In: Larsen HL, Kacprzyk J, Zadrozny S, Andreasen T, Christiansen H (eds) Flexible query answering systems. Recent advances. Physica-Verlag, Heidelberg New York, pp 65-74 [5] Bosc P, Pivert O (1992) Fuzzy querying in conventional databases. In: Zadeh LA, Kacprzyk J (eds) Fuzzy logic for the management of uncertainty. John Wiley & Sons, pp 645-671 [6] Bosc P, Pivert O (1993) An approach for a hierarchical aggregation of fuzzy predicates. In: Proceedings of 2nd IEEE International Conference on Fuzzy Systems (FUZZ-IEEE´93), USA, pp 1231-1236 [7] Bosc P, Pivert O (1995) SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems 3: 1-17 [8] Bosc P, Pivert O (1997) Fuzzy queries against regular and fuzzy databases. In: Andreasen T, Christiansen H, Larsen HL (eds) Flexible query answering systems. Kluwer Academic Publishers, pp 187-208 [9] Bosc P, Pivert O (1997) On representation-based querying of databases containing ill-known values. In: Proceedings of International Symposium on Methodologies for Intelligent Systems (ISMIS' 97), pp 477-486 [10] Bosc P, Pivert O, Lietard L (2001) Aggregate operators in database flexible querying. In: Proceedings of 9th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE'2001), Melbourne, Australia, pp 1231-1234 [11] Bosc P, Pivert O, Lietard L (2003) On the comparison of aggregates over fuzzy sets. In: Bouchon-Meunier B, Foulloy L, Yager RR (eds) Intelligent systems for information processing. From representation to applications. Elsevier, pp 141-152 [12] Buckles BP, Petry FE (1985) Query languages for fuzzy databases. In: Kacprzyk J, Yager RR (eds) Management decision support systems using fuzzy sets and possibility theory. Verlag, TUV Rheinland, pp 241-251 [13] Buckles BP, Petry FE, Sachar HS (1986) Design of similarity-based relational databases. In: Prade H, Negoita CV (eds) Fuzzy logic in knowledge engineering. Verlag TUV Rheinland, pp 3-7 [14] Codd EF (1970) A relational model of data for large shared data banks. Communications of the ACM 13(6): 377-387 [15] Dubois D, Prade H (1990) Measuring properties of fuzzy sets: A general technique and its use in fuzzy query evaluation. Fuzzy Sets and Systems 38(2): 137-152 [16] Dubois D, Prade H (1997) Using fuzzy sets in flexible querying: why and how?. In: Andreasen T, Christiansen H, Larsen HL (eds) Flexible query answering systems. Kluwer Academic Publishers, pp 45-60 [17] Fodor J, Yager RR (2000) Fuzzy set-theoretic operators and quantifiers. In: Dubois D, Prade H (eds) Fundamentals of fuzzy sets. Kluwer Academic Publishers, 125-193

Flexible Query Languages for Relational Databases: an Overview

51

[18] Galindo J; Medina JM; Aranda GMC (1999) Querying fuzzy relational databases through fuzzy domain calculus. International Journal of Intelligent Systems 14(4): 375-411 [19] Galindo J, Medina JM, Cubero JC, García MT (2000) Fuzzy quantifiers in fuzzy domain calculus. In: Proceedings of 8th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU´2000), Spain, pp 1697-1702 [20] Galindo J, Medina JM, Pons O, Cubero JC (1998) A server for fuzzy SQL queries. In: Andreasen T, Christiansen H, Larsen HL (eds) Flexible query answering systems. LNAI: 1495, Springer, pp 164-174 [21] Kacprzyk J, Zadrozny S (1995) FQUERY for Access: fuzzy querying for Windows-based DBMS. In: Bosc P, Kacprzyk J (eds) Fuzziness in database management systems. Physica-Verlag, Heidelberg, pp 415-433 [22] Kacprzyk J, Zadrozny S (1997) Implementation of OWA operators in fuzzy querying for Microsoft Access. In: Yager RR, Kacprzyk J (eds) The ordered weighted averaging operators: theory and applications. Kluwer, Boston, pp 293-306 [23] Kacprzyk J, Zadrozny S (1999) Fuzzy querying via WWW: implementational issues. In: Proceedings of 7th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE'1999), Seoul, Korea, pp 603-608 [24] Kacprzyk J, Zadrozny S (2000) On a fuzzy querying and data mining interface. Kybernetika 36: 657-670 [25] Kacprzyk J, Zadrozny S (2000) On combining intelligent querying and data mining using fuzzy logic concepts. In: Bordogna G, Pasi G (eds) Recent research issues on the management of fuzziness in databases. Physica-Verlag, Heidelberg New York, pp 67-81 [26] Kacprzyk J, Zadrozny S, Ziolkowski A (1989) FQUERY III+: a 'humanconsistent` database querying system based on fuzzy logic with linguistic quantifiers. Information Systems 6: 443-453 [27] Kacprzyk J, Ziolkowski A (1986) Database queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man and Cybernetics SMC 16: 474-479 [28] Kerre EE, De Cock M (1999) Linguistic modifiers: an overview. In: Chen G, Ying M., Cai K-Y (eds) Fuzzy logic and soft computing. Kluwer Academic Publishers, pp 69-85 [29] Klir GJ, Folger TA (1988) Fuzzy sets, uncertainty and information, PrenticeHall. [30] Lacroix M, Lavency P (1987) Preferences: putting more knowledge into queries. In: Proceedings of 13rd International Conference on Very Large Databases (VLDB' 87), Brighton (GB), pp 217-225 [31] Liu Y, Kerre EE (1998) An overview of fuzzy quantifiers (I). Interpretations. Fuzzy Sets and Systems 95: 1-21

52

Antonio Rosado et al.

[32] Medina JM, Pons O, Vila MA (1994) GEFRED: a generalized model of fuzzy relational databases. Information Sciences 76(1-2): 87-109 [33] Petry FE (1996) Fuzzy databases: principles and applications. Kluwer Academic Publishers [34] Prade H, Testemale C (1984) Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences 34: 115-143 [35] Prade H, Testemale C (1987) Representation of soft constraints and fuzzy attribute values by means of possibility distributions in databases. In: Bezdek JC (ed) Analysis of fuzzy information, vol. II, CRC Press. pp 213-229 [36] Raju KVSVN, Majumdar AK (1988) Fuzzy functional dependencies and lossless join decomposition of fuzzy relational database systems. ACM Transactions on Database Systems 13: 129-166 [37] Ramakrishnan R, Gehrke J (2000) Database management systems, McGrawHill. [38] Ribeiro RA (1993) Application of support logic theory to fuzzy multiple attribute decision problems, University of Bristol, UK [39] Ribeiro RA, Moreira AM (1999) Intelligent query model for business characteristcs. In: Proceedings of IEEE/WSES/IMACS CSCC'99 Conference, Greece [40] Ribeiro RA, Moreira AM (2003) Fuzzy query interface for a business database. International Journal of Human Computer Studies 58(4): 363-391 [41] Schmucker KJ (1984) Fuzzy sets, natural language computations, and risk analysis, Computer Science Press [42] Shenoi S, Melton A (1989) Proximity relations in the fuzzy relational database model. Fuzzy Sets and Systems 31: 285-296 [43] Shenoi S, Melton A, Fan LT (1990) An equivalence classes model of fuzzy relational databases. Fuzzy Sets and Systems 38: 153-170 [44] Tahani V (1977) A conceptual framework for fuzzy query processing: a step toward very intelligent database systems. Information Processing and Management 13: 289-303 [45] Takahashi Y (1991) A fuzzy query language for relational databases. IEEE Transactions on Systems, Man and Cybernetics SMC 21: 1576-1579 [46] Takahashi Y (1995) A fuzzy query language for relational databases. In: Bosc P, Kacprzyk J (eds) Fuzziness in database management systems. PhysicaVerlag, Heidelberg, pp 365-384 [47] Ullman JD (1982) Principles of database systems, Computer Science Press. [48] Umano M (1982) FREEDOM-0: a fuzzy database system. In: Gupta M, Sanchez E (eds) Fuzzy information and decision processes. North-Holland, Amsterdam, pp 339-347

Flexible Query Languages for Relational Databases: an Overview

53

[49] Umano M, Fukami S (1994) Fuzzy relational algebra for possibilitydistribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems 3: 7-27 [50] Yager RR (1994) Interpreting linguistically quantified propositions. International Journal of Intelligent Systems 9: 541-569 [51] Zadeh LA (1965) Fuzzy sets. Information and Control 8: 338-353 [52] Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning - II. Information Sciences, 8: 219-269 [53] Zadeh LA (1978) PRUF -a meaning representation language for natural languages. International Journal of Man-Machine Studies 10: 395-460 [54] Zadeh LA (1983) A computational approach to fuzzy quantifiers in natural languages. Computational Mathematics Applications 9: 149-184 [55] Zadrozny S, Kacprzyk J (1998) Implementing fuzzy querying via the Internet/WWW: Java applets, ActiveX controls and cookies. In: Andreasen T, Christiansen H, Larsen HL (eds) Flexible query answering systems. LNAI: 1495, Springer, Berlin Heidelberg, pp 382-392 [56] Zemankova-Leech M, Kandel A (1984) Fuzzy relational databases - A key to expert systems. Koln, Germany, TUV Rheinland.

Vacuity-Oriented Generalized Yes/No Queries Addressed to Possibilistic Databases Patrick Bosc and Olivier Pivert IRISA-ENSSAT Technopole Anticipa BP 80518 22305 Lannion France bosc, [email protected]

1 Introduction In this chapter, extended relational databases are considered where some attribute values are imprecisely known. The need for imperfect data is more and more recognized and imprecise information can appear in diverse situations such as data warehouses, forecasts, incomplete archives, structured data extracted from texts, or systems where information issued from automated recognition procedures is stored. Diﬀerent formalisms can be used to represent imprecise information (see [7] for instance), and the possibilistic setting is assumed in the rest of the chapter. A key question is to deﬁne a sound semantics for queries addressed to imprecise databases. Since imprecise data are represented as (possibly inﬁnite) sets of acceptable candidates, an imprecise database can be seen as a set of regular databases, called worlds, associated with a choice for each attribute value. This approach provides a rational starting point for the deﬁnition of a query in the sense that its result is the set of the results obtained for each world (or interpretation). Unfortunately, such an approach is intractable, obviously in the case of an inﬁnite number of worlds, but also due to the possibly huge number of worlds when it is ﬁnite. This observation leads to consider only speciﬁc queries which can be processed directly against the possibilistic database (the processing is then called “compact”), while delivering a result equivalent to the one deﬁned in terms of worlds. The principle of the approach advocated is summarized in ﬁgure 1. A compact calculus valid for a subset of the relational algebra has been devised (see [4] for details). One of its characteristics is to deal with queries containing binary operations allowing for the composition of relations. In this context, the result of a query is a possibilistic relation whose interpretations correspond to more or less possible results, equivalent to those which would have been obtained with a calculus applied to worlds. This achievement is P. Bosc and O. Pivert: Vacuity-Oriented Generalized Yes/No Queries Addressed to Possibilistic Databases, Studfuzz 203, 55–73 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

56

Patrick Bosc and Olivier Pivert

interesting from a methodological point of view, but the use by an end user of the result which is delivered, is probably somewhat delicate. So, one can ﬁnd it convenient to deﬁne queries which are more specialized (or targeted) to ﬁt user needs. Vacuityoriented generalized yes/no queries, considered in the rest of this chapter, are intended to meet this goal. Their generic form is: “is the answer to Q nonempty?”. In the presence of precise data, the answer to such a query is either “yes” (if the result of Q is a nonempty relation), or “no” (in the opposite situation). However, in the possibilistic setting, imprecise data come into play, and the answer is tainted with uncertainty. It can be characterized by a double degree of possibility and certainty, both depending on the degrees of the worlds in which the result is empty or not. To reﬂect the uncertain nature of the result, vacuityoriented generalized yes/no queries are reformulated as follows: “to what extent is it possible and certain that the answer to Q is not empty?”.

Fig. 1. Principle for a sound compact processing against a possibilistic database.

To answer such a query, a two-step processing will take place: • ﬁrst, Q is evaluated in a compact fashion (i.e., directly against the possibilistic database, see ﬁgure 1),

Vacuity-Oriented Yes/No Queries in Possibilistic Databases

57

• then, the two desired degrees are computed from the resulting relation obtained. The structure of the rest of the chapter is the following. In section 2, some basic notions on possibility theory are recalled and the notion of possibilistic relational databases is introduced. The family of yes/no queries considered in the chapter, is presented in section 3. Then, the data model required for a valid compact processing of algebraic queries is described in section 4. It is based on two simple appropriate extensions of the basic model advocated in [8] Section 5 is devoted to the presentation of the two-step processing of vacuityoriented generalized yes/no queries. Finally, the conclusion summarizes the contributions of the chapter and outlines some perspectives for future work.

2 Possibilistic relational databases 2.1 A brief reminder about possibility theory Possibility theory [9] provides an ordinal model for uncertainty where imprecision is represented by means of a preference relation coded by a total order over the possible situations. This framework is closely related to fuzzy set theory [10] since the idea is to constrain the values taken by a variable thanks to a (normalized) fuzzy set called a possibility distribution π. π(a) expresses the degree to which a is a possible value for the considered variable. The normalization condition imposes that at least one of the values of the domain (a0 ) is completely possible, i.e., π(a0 ) = 1. This setting is particularly suited to take into account un-certainty represented by linguistic terms such as “tall”, “medium” or “recent”. In the discrete case assumed later on, a possibility distribution is written {π1 /a1 , ..., πn /an } where ai is a candidate value and πi its degree of possibility. Any event E deﬁned on the power-set of X is characterized by two measures: Π its possibility and N its certainty. The necessity N of E is deﬁned as: N (E) = 1 − Π(E) where is the event opposite to E. The following results, where E, E1 and E2 denote events, are of interest in the context of this chapter (see [6] for more details): Π(E1 ∪ E2 ) = max(Π(E1 ), Π(E2 ))

(1)

Π(E1 ∩ E2 ) = min(Π(E1 ), Π(E2 )) if E1 and E2 are non-interactive N (E1 ∩ E2 ) = min(N (E1 ), N (E2 ))

(2) (3)

N (E1 ∪ E2 ) = max(N (E1 ), N (E2 )) if E1 and E2 are non-interactive Π(E) < 1 ⇒ N (E) = 0

(4) (5)

The two measures Π and N provide a total order over the set of regular (non fuzzy) events. They can be ordered according to Π for those which are not at all certain and according to N for those which are completely possible. This capability for rank-ordering events is important in the context of the problem dealt with in this chapter as it will be seen in section 5.

58

Patrick Bosc and Olivier Pivert

2.2 Possibilistic relational databases In contrast to a regular database, a possibilistic relational database D may have some attributes which take imprecise values. In such a case, a possibility distribution is used to represent all the more or less acceptable candidate values for the attribute. The ﬁrst version of a possibilistic database model was introduced in [8]. From a semantic point of view, a possibilistic database D can be interpreted as a set of usual databases (also called worlds) W1 , ..., Wp , denoted by rep(D), each of which being more or less possible. This view establishes a semantic connection between possibilistic and regular databases. This relationship is particularly interesting since it oﬀers a canonical approach to the deﬁnition of queries addressed to possibilistic databases as it will be seen later. Any world Wi is obtained by choosing a candidate value in each possibility distribution appearing in D. One of these (regular) databases, let us say Wk , is supposed to correspond to the actual state of the universe modeled. Any world Wi corresponds to a conjunction of independent choices and according to formula 2, the degree assigned to it is the minimum of the degrees tied to each of the chosen candidate values in the original possibilistic database D. Therefore, at least one of the worlds is completely possible, i.e., is assigned the degree of possibility Π = 1. Example 1. Let us consider the possibilistic database D involving two relations: im and pl whose respective schemas are IM(#i, ac, date, place) and PL(ac, lg, sp). The relation im describes satellite images of aircrafts and each image, identiﬁed by a number (#i), taken on a certain location (place) a given day (date) is supposed to include a single plane (ac). The relation pl gives the length (lg) and maximal speed (msp) of each aircraft. With the extensions of im and pl given hereafter, four worlds can be drawn, W1 , W2 , W3 and W4 , since there are two candidates for date in the ﬁrst tuple of im and two candidates for ac in the second one. im

#i ac date place i1 a1 {1/d1 ,0.7/d3 } c1 d1 c2 i3 {1/a3 ,0.3/a4 }

pl

ac a1 a2 a3 a4

lg 20 25 18 20

msp 1000 1200 800 1200

Each of these worlds involves the relation pl which has only precise values (and then has a single interpretation) and one of the four following regular relations im1 to im4, issued from the possibilistic relation im:

Vacuity-Oriented Yes/No Queries in Possibilistic Databases im1 #i ac date place Π=1 i1 a1 d1 c1 i3 a3 d1 c2 im3 #i ac date place Π=0.3 i1 a1 d1 c1 i3 a4 d1 c2

59

im2 #i ac date place Π=0.7 i1 a1 d3 c1 i3 a3 d1 c2 im4 #i ac date place Π=0.3 i1 a1 d3 c1 i3 a4 d1 c2

The value Π speciﬁed is that of each world Wi made of the pair of relations (imi , pl) and as it is expected, one of them is completely possible. It is worth mentioning that, as far as a query Q concerns the values taken by the attributes (not their description, see [2]), the answer to Q becomes uncertain (even if in certain cases, it happens that no uncertainty occurs). This is due to the fact that the elements appearing in the answer generally depend on the choices that are made, i.e., vary with the world onto which the query may apply. This observation holds in particular for yes/no queries when imprecise values come into play (see next sections). Example 2. Let us come back to the database of example 1 and consider the query looking for the places where an aircraft whose maximal speed is over 900 km/h, has been seen. The answer contains c1 for sure (Π = N = 1), since the aircraft a1 whose maximal speed is 1000 km/h has been seen there. On the contrary, some uncertainty aﬀects c2 , since it is in the answer only if the aircraft seen there is a4 , which is possible at the degree Π = 0.3 and therefore completely unsure (N = 0) since it is completely possible that a3 (which does not match the selection imposed as to the maximal speed) has been seen instead of a4 .

3 Vacuity-oriented generalized yes/no queries As mentioned in the introduction, the format of vacuity-oriented generalized yes/no queries is: Q1:=“to what extent is it possible and certain that the answer to Q is non-empty?” where Q is a relational query. They generalize queries of the form: Q2:=“to what extent is it possible and certain that t belongs to the answer to Q?” where t is a speciﬁed tuple, studied in [5] Let us mention that Q2-type queries have been studied in [1] in the context of null values (the diﬀerence being that

60

Patrick Bosc and Olivier Pivert

their generic formulation is “is it true that tuple t belongs to the answer to Q?”). In queries of type Q2, the speciﬁcation of a desired target tuple is assumed and this may be impossible in some situations for Q1-type queries. On the other hand, it is always possible to transform a Q2-type query into a Q1-type one, by inserting additional conditions (namely those corresponding to the selection of tuple t) into Q. Example 3. Let us come back again to the database of example 1. If one wants to know the extent to which it is possible and certain that an image was taken at date d1 with an aircraft whose length is 20, one can use the query: “to what extent is it possible and certain that (d1 , 20) belongs to the answer to Q?”, where Q is the algebraic query: project(equi-join(im, av, {ac}, {ac}),{date, lg}), where project(r, Y) denotes the projection of relation r onto the set of attributes Y and equi-join(r, s, X, Y) stands for the join of relations r and s with the joining condition: X = Y. But it it also possible to call on the query: “to what extent is it possible and certain that the answer to Q’ is non-empty?”, where Q’ is the following algebraic query: equi-join(select(im, date = ’d1 ’), select(av, lg = 20), {ac}, {ac}). On the contrary, if one looks for the existence of images with aircrafts whose maximal speed is under 900 km/h and diﬀerent from “ATR42”, the desired result of the underlying algebraic query cannot be characterized as a single tuple (indeed, it corresponds to a set of acceptable tuples, which can be speciﬁed only if all the names of aircrafts and all the possible speeds under 900 km/h are known). It is then necessary to call on a Q1-type query, namely: “to what extent is it possible and certain that the answer to Q” is non-empty?”, where Q” is the algebraic expression:

Vacuity-Oriented Yes/No Queries in Possibilistic Databases

61

equi-join(select(im, ac = ’ATR42’), select(av, msp < 900), {ac}, {ac}). The result of a vacuity-oriented generalized yes/no query (Q1-type above) depends on the nature (empty or not) of the answer to query Q in the worlds associated with the considered possibilistic database. Due to property 5, four cases can be distinguished: • Π = 0: it is certain that the answer to Q is empty, i.e., whatever the world chosen, Q delivers an empty answer, • 0 < Π < 1: it is not at all certain that the answer to Q is not empty (there is at least one world completely possible where the answer to Q is empty), but this is somewhat possible there is one world possible at the degree Π where the answer to Q is not empty (and no such world possible at a larger degree), • 0 < N < 1: it is somewhat certain that the answer to Q is not empty (the highest possible world where the answer to Q is empty is possible at the degree 1 − N ) and this is completely possible (there is one world completely possible where the answer to Q is not empty), • N = 1: it is completely certain (or sure) that the answer to Q is not empty (the answer to Q is not empty in any world).

4 An extended possibilistic data model 4.1 Objective As mentioned before, a calculus based on the processing of the query (Q) against every more or less possible database (world) associated with the possibilistic database is indeed intractable. Then, a compact approach to the calculus of the desired possibility and necessity degrees must be found out. In order to reach this goal, it is necessary to be provided with a data model and operations which have good properties: • the data model must be closed for the considered operations, i.e., any operation takes elements of the model as input(s) and returns such an element, • any compact query (applying to the possibilistic database D) must behave so that the interpretations of its result correspond exactly to the results of this query applied to all interpretations (worlds) drawn from D. This latter point is the characteristic property of data models called strong representation systems and this can be formalized as follows. Let O be a set of algebraic operators {op1 , , opn } and Q be a query calling only on operators of O. A data model associated with a possibilistic database D is a strong representation system for O if:

62

Patrick Bosc and Olivier Pivert

rep(Qc(D)) = Q(rep(D)) (P1) where rep(D) denotes the set of worlds associated with D and Qc stands for the query calling on the compact version of the operations. It turns out that the initial relational possibilistic model suﬀers from deﬁciencies which prevents it to comply with property P1 in two respects: • the handling of “missing tuples”, i.e., tuples of a possibilistic relation which may have no representative in some worlds, • the accounting for dependencies between candidate values, since up to now it is assumed that candidate values are independent. This means that the set of all choices is obtained as the Cartesian product of all the candidates, or alternatively that the choice of a given candidate in a distribution does not depend on any choice in any other distribution. These two aspects are detailed in the two following subsections. 4.2 Representing possibly missing tuples There is a need at the compact level for expressing that some tuples has no representative in some worlds. Indeed, some operations (e.g., selections, joins) lead to discard candidate values from a distribution in order to be compatible with the world-based interpretation (contrary to what is done in the approach suggested by Prade and Testemale, [8]). However, one must be able to compute the degree of all the worlds that can be derived from the answer, including those in which some tuples are not represented (because some of their candidate values have been discarded). As a consequence, it is mandatory to have some information in the model serving as a basis for deriving all the (more or less possible) worlds with their accurate associated degree. A solution compatible with the relational data model is to introduce a new attribute, denoted by N , which states whether or not it is legal to build worlds where the corresponding tuple is not represented, and, if so, the degree of possibility tied to this choice. The value t.N of N in a tuple t is meant to express the certainty of the presence of a representative of tuple t in any world. So doing, it is possible to generate the worlds from which a tuple is not represented, by taking into account the degree of possibility of its absence, which, according to possibility theory, is given by (1t.N). An extended possibilistic relation is a set of tuples: {< N1 /t1 >, , < Np /tp >} where Ni stands for the certainty of the event “tuple ti is represented in any interpretation of the relation”. It is worth mentioning that Ni equals 1 for tuples of initial possibilistic relations which makes initial relations compatible with the model.

Vacuity-Oriented Yes/No Queries in Possibilistic Databases

63

Example 4. Let us consider a relation r of schema is R(A, B, C). According to the previous notation, N is not considered a regular attribute and does not appear in the relation schemas. Let us assume that a selection applies to r with the condition “A ∈ {a1 , a2 }”. Let u be a tuple of t which takes an imprecise value on A, e.g.: u = 1/ < {1/a2 , 0.7/a1 , 0.7/a3 , 0.4/a4 , 0.1/a5 }, b, c > . To conform with the selection condition, the candidate values a3 , a4 and a5 must be removed from the tuple resulting from the selection. According to the approach proposed before, some track is also kept (thanks to the special attribute N ) about the fact that worlds do exist where the values for A were not solely a1 and a2 . Hence, the resulting tuple will be represented as: 0.3/ < {1/a2 , 0.7/a1 }, b, c > stating that the choice associated with the absence of a representative of this tuple in a given world is possible at the degree: 0.7 = 1 − 0.3 One may wonder about the fact that only the possibility of the most possible candidate value is taken into account. This is indeed suﬃcient because in possibility theory if a given event has several occurrences (with possibly diﬀerent grades), these occurrences are merged into a single one which is assigned the highest of the degrees. Example 5. Let us consider a relation r whose schema is R(A, B, C) with the extension given hereafter: r

A B {1/a1 , 0.6/a2 , 0.3/a3 } b1 b2 a1 b2 a2

C c1 c2 c3

N 1 1 1

and the selection condition “A = a1 ”. The resulting relation is: res

A a1 a1

BC N b1 c1 0.4 b2 c2 1

This relation has two interpretations (worlds): one with the two tuples which is completely possible (min(min(1, 1, 1), min(1, 1, 1)) = 1) and one with only the second tuple which is possible at the degree

64

Patrick Bosc and Olivier Pivert 1/r1 A a1 a1 a2

B b1 b2 b2

C c1 c2 c3

1/r’1 A B C a1 b1 c1 a1 b2 c2

0.6/r2 A a2 a1 a2

B b1 b2 b2

C c1 c2 c3

0.6/r’2 A B C a1 b2 c2

0.3/r3 A a3 a1 a2

B b1 b2 b2

C c1 c2 c3

0.3/r’3 A B C a1 b2 c2

min(1−0.4, min(1, 1, 1)) = 0.6. From the three worlds that can be issued from r: the following resulting relations would be obtained after selection: which reduce to only two diﬀerent worlds that are exactly those derived from the compact relation res, namely: A B CΠ=1 a1 b1 c1 a1 b2 c2

A B C Π = 0.6 a1 b2 c2

and it can be observed that the two relations r’2 and r’3 have the same content and they are merged into a single one with the degree of possibility max(0.6, 0.3) = 0.6. 4.3 Possibility distributions over several attributes Another aspect of the data model which needs to be revisited is related to the fact that it is sometimes necessary to express relationships (dependencies) between candidate values coming from diﬀerent attributes of a same tuple. For example, let us consider a given tuple t where the two attributes A and B take the imprecise values t.A = {a1, a2} and t.B = {b1, b2, b3}, whatever the associated grades which are omitted here. If an operation retains only the pairs (a1, b1) and (a2, b3), it is impossible to represent this situation with a Cartesian product of subsets of t.A on the one hand and t.B on the other hand. In other words, A and B values cannot be kept separate (which would mean that they are independent) and the correct associations must be explicitly represented. This requires that the model incorporates attribute values deﬁned as possibility distributions over several domains. This is feasible in the scope of the relational data model thanks to the concept of a nested relation. More precisely, candidates will be represented as weighted tuples which have a disjunctive semantics, i.e., they are mutually exclusive. Therefore, non-nested tuples keep their conjunctive meaning, whereas nested ones have a disjunctive interpretation since they represent alternative candidates. In the remainder, the notation R(A1 , , Am , X1 (Ap , , Aq ), , Xn (Ak , , Ar )) stands for a schema where A1 to Am are elementary attributes (also called level-one attributes)

Vacuity-Oriented Yes/No Queries in Possibilistic Databases

65

whose values are either precise or not (possibility distributions) and Xi (Ah , , Aj ) represents a “structured” attribute Xi whose values are possibility distributions made of tuples built over attributes Ah to Aj which are called “nested” attributes. Obviously, such relations have an interpretation in terms of worlds as it is the case for ordinary possibilistic relations. When one moves to a given world, a structured candidate value is split into atomic values and the schema becomes a non-nested one. Example 6. Let us consider a relation r whose schema is R(A, B, C, D), with the extension: r

A B C DN a1 {1/8,0.8/10,0.4/12} {1/9,0.7/15} d1 1 5 {1/11,1/12} d3 1 a2

and the selection condition “B < C”. The resulting relation will include a nested relation gathering the attributes B and C and the result is: res

X D N B C a1 {1/<8,9>,0.7/<8,15>,0.7/<10,15>,0.4/<12,15>} d1 0.2 {1/<5, 11>, 1/<5, 12>} d3 1.0 a2 A

Two points are worthy of comments: • the certainty degree of the ﬁrst tuple is 0.2 since the candidates 0.8/<10,9> and 0.4/<12,9> have been removed and the degree of the most possible one is 0.8, • although no candidate is discarded in the second tuple, its representation must be modiﬁed in order to cope with the structure of the resulting relation.

5 Processing vacuity-oriented generalized yes/no queries 5.1 Principle of the approach The principle of the method described hereafter is outlined in ﬁgure 2, which is an adaptation of ﬁgure 1 to extended yes/no queries. By construction, the compact processing, which is sound, general and tractable is based on a twostep mechanism: • a compact calculus of an algebraic query such that the equivalence with the individual worlds is guaranteed, which builds a compact possibilistic relation (complying with the data model presented before),

66

Patrick Bosc and Olivier Pivert

Fig. 2. Principle for processing vacuity-oriented generalized yes/no queries.

• a post-processing aimed at the calculation of the two degrees constituting the ﬁnal answer to the user’s query. The ﬁrst step relies on a processing strategy that has been pointed out for dealing with algebraic queries [4] Of course, the price to pay for the compliance with property P1 is that only a restricted number of operations are allowed in Q with respect to the usual relational algebra. 5.2 Compact version of relational operators It turns out that only four operators can be processed in a compact way for the following reason. In order to comply with property P1, any operator o must necessarily be so that the following property holds (P2): when a tuple t1 of an input relation is used by o, any part of t1 involving imprecise information must appear in at most one tuple of the output relation Proof. Let us consider a tuple u of a relation r whose A-component is imprecise and described by means of a possibility distribution: π = {d1 /a1 ,..., dn /an }. Let us assume that u.A (i.e., π) appears in two tuples: v1 = N1 / < x1 , π, y1 >, v2 = N2 / < x2 , π, y2 > of the output relation (whose schema is (X, A, Y)). Tuples v1 and v2 are supposed to stem from u (and possibly some other tuples from another relation). Due to the independence hypothesis assumed when worlds are built, it is possible to derive the following world from the output relation res:

Vacuity-Oriented Yes/No Queries in Possibilistic Databases

X ... x1 x2 ...

A ... ai aj ...

67

Y ... y1 y2 ...

However, this world is illegal (if ai = aj ), since in any original world, the value of A is either ai , or aj . The consequence of the preceding constraint (P2) is that Cartesian products (and then joins), intersections and diﬀerences (all of them in their most general form) cannot be expected to be acceptable. Now, the objective is mainly to present the key points governing the four legal operators. The interested reader will ﬁnd more details about them in [4], in particular, in terms of proofs of compliance with properties P1 and P2. The selection When imprecise data come into play, two aspects must be handled for a selection: • the determination of the schema of the output relation, which depends on the schema of the input relation and the nature of the selection condition. The schema of the resulting relation remains unchanged for a condition of the form “attribute θ constant” (θ ∈ {=, =, >, <, ≥, ≤}), as well as for a conjunction of such conditions. On the other hand, if the condition is of the form “attribute1 θ attribute2 ” or “attribute1 θ1 const1 or attribute2 θ2 const2 ”, or more generally “cond1 (attribute1 ) or cond2 (attribute2 )”, it is necessary to introduce distributions deﬁned over both attribute1 and attribute2 ; • for each tuple t of the input relation, a tuple of the result is produced according to the following procedure (provided that at least one candidate value satisﬁes the selection condition): – removal of any candidate value of t (possibly deﬁned over several attributes in the case of nested relations) concerned with the selection condition which does not satisfy the selection condition, – update of N by taking into account the removal of the most possible candidate value (whose possibility degree is denoted by πM ). More precisely, the value of N in the resulting tuple is the minimum of πM and its value in the initial tuple t (t.N ). The selection is illustrated in example 5. The projection The regular projection is intended for the removal of “useless” attributes. Four aspects have to be taken into account for the projection of a relation r on a subset Y of its attributes in the presence of imprecise values:

68

Patrick Bosc and Olivier Pivert

• the impact of the projection on the value of the attribute N in each tuple: indeed, this value is not aﬀected by a projection and it remains unchanged in the output tuple; • the role of duplicates: their handling is a little bit more delicate because tuples in level-one relations are conjunctive, whereas they are mutually exclusive in a nested relation where they represent alternative candidates. The appropriate way of doing is the following: – duplicate tuples may appear in level-one tuples, because possibility distributions are representations and not values strictly speaking. This is necessary to recover all the legal interpretations of the projected relation. For instance, if relation r whose schema is R(A, B) contains the tuples < {π1 /a1 , ..., πn /an }, b1 > and < {π1 /a1 , ..., πn /an }, b2 >, the projection over A results in a relation with two identical tuples. This makes it possible to have a world with two tuples < ai > and < aj >, which would be quite impossible with a single tuple kept in the projection, – duplicates are removed inside nested relations, since these tuples represent alternative choices for which duplicates are meaningless. According to possibility theory, the highest possibility degree is assigned in such a case; • the structure of the resulting relation: two cases involving a nested relation are worthy of comments: – if all the attributes of a nested relation are suppressed, the nested relation totally disappears, – if only one of the attributes of a nested relation is retained, the nested relation becomes a level-one attribute. • the grades: when an attribute with an imprecise value or a nested relation disappears, the possibility of the most possible candidate is forwarded (by means of a minimum) to one of the remaining attributes. So doing, it is guaranteed that no world can be drawn with a possibility greater than that of the corresponding world before projection. Example 7. Let us consider a relation of schema (A, X(B, C), D) involving a nested relation: r

A

X B C a1 {1/< b1 , c1 >,0.7/< b1 , c2 >,0.7/< b2 , c2 >, 0.4/< b3 , c2 >} a2 {1/< b1 , c4 >, 1/< b1 , c2 >}

D

N

{0.6/d1 } 0.2 d3

1

The projection of r on the attributes A and C produces the following result : It is worth noticing that the degree 0.6 initially attached to d1 has impacted the attribute A in the ﬁrst tuple of the result (it would have been possible to forward it to C as well, or even to both A and C).

Vacuity-Oriented Yes/No Queries in Possibilistic Databases res

69

A B N {0.6/a1 } {1/c1 ,0.7/c2 } 0.2 {1/c4 ,1/c2 } 1 a2

The foreign-key join (fk-join) Although there is no hope for deﬁning an acceptable compact version of the join in general, it turns out that the necessary condition highlighted at the beginning of subsection 5.2 (P2) can be satisﬁed by a speciﬁc join, namely the fk-join. The fk-join is a binary operator which allows for the composition of two relations. Let us consider the relation schemas R(W, Z) and S(W, Y), a possibilistic relation r(R) where W and and Z may take imprecise values and a regular relation s(S) where the functional dependency W →Y is valid. W is then the key of s and a foreign key of r. The principle of the fk-join consists in completing relation r by adding the image of the W-component (via W → Y) into each tuple. By deﬁnition, this leads to a resulting relation whose schema (X(W, Y), Z) involves a nested relation, which “connects” the pairs of candidates over W and Y. The degree of possibility of any structured candidate value is the one of the value that has been completed. Similarly to the selection, the attribute N is updated to keep track of the most possible candidate value for which no match occurred. This operator works in a similar way when the attribute(s) involved are already part of a nested relation. For instance, with relations r and s whose respective schemas are R(X1(W, T), V) and S(W, Z), the schema of the resulting relation is (X2(W, Z, T), V), i.e., the set of attributes Z is inserted in the nested relation of the result. An illustration of the fk-join is given in the example worked out in 5.4. The union An interesting point is the fact that there is no interaction between tuples to produce an element of the union of two compatible relations (i.e., whose attributes are pairwise deﬁned on the same domain). Then, the necessary condition (P2) evoked before is satisﬁed and the only remaining condition in order to comply with property P1 concerns the independence of the two input relations (i.e., choices of candidates among tuples can be made without any restriction). If this assumption is valid, two points are worthy of comments: • the possible presence of nested relations inside the input relations. It may happen that two possibilistic relations have compatible attributes, but involved in nested relations whose schemas diﬀer. In such a situation, it is necessary to deﬁne a common schema, which will be that of the resulting relation. For instance, the union of relations r and s whose respective schemas are (X1(A, B), C, D, X2(E, F), G) and (A, B, C, X3(D, E), F,

70

Patrick Bosc and Olivier Pivert

G) makes sense and the common schema is (X4(A, B), C, X5(D, E, F), G); • the handling of duplicates. For reasons similar to those explained for the projection, duplicates must not be removed in order to be able to recover all the possible worlds (interpretations). Finally, when attributes of the relations may take imprecise values, the union is deﬁned by putting all their tuples into the result (along with their degree of certainty) after a possible homogenization of the schema (if necessary). 5.3 Computation of the two ﬁnal degrees Once the algebraic component Q involved in a vacuity-oriented generalized yes/no query has been processed according to the previous strategy, a compact resulting relation res is produced. The degrees of possibility and necessity, which constitute the answer to the vacuity-oriented generalized yes/no query, can be deduced from res in the following way: • if res is empty, then the possibility degree equals 0. Due to formula 5, N also equals 0. In other words, it is certain that the answer to Q is empty; • if res contains at least one tuple tj whose N-value (Nj ) is strictly positive, the necessity degree that the answer to Q is not empty is given by: maxti∈res Ni . Proof. Let res be made of the tuples {N1 /t1 , ..., Np /tp }. The only empty world that can be derived from res is the one where neither t1 , ..., nor tp has a representative. According to formula 2, the degree of possibility of such a world is: min(1 N1 , ..., 1 Np ) since the events considered are non-interactive. In particular, there is no such world if one of the Ni ’s equals 1. Thus: N(the answer to Q is not empty) = 1 −Π(res has an empty representative) = max(N1 ,..., Np ). The contraposition of formula 5 (N > 0 ⇒ Π = 1) allows to state that the possibility degree is 1. In other words, it is completely possible that the answer to Q is not empty (and it may be certain in the special case where N itself is 1); • if the N -value of all the tuples of res equals 0, the certainty that the answer to Q is not empty is 0 (cf. the previous proof) and, according to formula 1, its possibility is given by the possibility of the highest possible interpretation of the tuples involved in res, i.e., the one obtained in taking the highest degree attached to the tuples appearing in res. 5.4 An example This example is intended for illustrating the overall functioning of the procedure designed for answering vacuity-oriented generalized yes/no queries

Vacuity-Oriented Yes/No Queries in Possibilistic Databases

71

addressed to possibilistic databases. Let us consider the possibilistic database composed of the relations im1(IM), im2(IM) and pl(PL) where IM and PL are the schemas introduced in example 1. The two relations im1 and im2 are assumed to contain images of aircrafts taken by two distinct sources (e.g., satellites). Let us take the query: “to what extent is it possible and certain that there exists at least one shot with an aircraft of length 20 and maximal speed greater than 900 km/h, taken on date d1?”. More formally this query is: “to what extent is it possible and certain that the result of Q is nonempty?”, where Q = fk-join(union(select(im1, date = ’d1’), select(im2, date = ’d1’)), select(pl, lg = 20 and msp > 900), {ac}, {ac}), which, optionally, a projection on the attribute #i can be added to. With the extensions of relations im1, im2 and pl given hereafter: im1

#i ac date place N i1 a3 {1/d1 ,0.7/d3 } c1 1 c2 1 i2 {1/a2 ,0.7/a1 } d1

im1

#i ac date place N i3 {1/a5 ,0.4/a3 } {0.6/d4 ,1/d1 } c3 1

pl

ac a1 a2 a3 a4 a5

lg 20 25 18 20 20

msp 1000 1200 600 1200 1000

the selection on im1 (resp. im2, pl) creates the intermediate relation im11 (resp. im21, pl1): im11

#i ac date i1 a3 {1/d1 } i2 {1/a2 ,0.7/a1 } d1

place N c1 0.3 c2 1

im21

#i ac date i3 {1/a5 ,0.4/a3 } {1/d1 }

place N c3 0.3

pl1

ac a1 a4 a5

lg 20 20 20

msp 1000 1200 1000

After the union of im11 and im21, the fk-join with pl1 delivers the result: From this table, it can be deduced that the possibility and the certainty that the answer to Q is not empty are respectively: Π = 1, N = 0.4

72

Patrick Bosc and Olivier Pivert res

#i i2 i3

X date place N ac lg msp {0.7/} d1 c2 0 {1/} {1/d1 } c3 0.4

If a ﬁnal projection appears in Q, the resulting table obtained is: #i N {0.7/i2 } 0 0.4 i3 and the same values of possibility and necessity are derived from it.

6 Conclusion This chapter addresses the issue of querying relational databases where some attribute values are imprecisely known and represented as possibility distributions. For tractability reasons, a speciﬁc family of queries, called vacuityoriented generalized yes/no queries, has been investigated. The general format of such queries is: “to what extent is it possible and certain that the answer to Q is non-empty?” , where Q is an algebraic query. It has been shown that these queries could be processed in a compact (and then reasonable from a performance viewpoint) way at the price of some restrictions as to the operations present in Q. Four of the regular algebraic operators have been extended to work directly on possibilistic relations. To this aim, it is necessary to enrich the initial possibilistic relational data model. In the more speciﬁc case of queries calling only on the possibility, another compact processing strategy can be devised where a wider range of algebraic operations are allowed [3]. This work opens several lines for future research. One of them is related to the inﬂuence of the model of uncertainty chosen to represent ill-known information. We conjecture that it should be feasible to replace possibility distributions by probability distributions, because the processing strategy is founded on the equivalence between the treatment on the possibilistic database D and the one on the worlds, i.e., the regular databases, associated with D. Of course, the deﬁnition of the compact operators, as well as the ﬁnal calculus of degrees should be adapted appropriately to the probabilistic framework. Another subject of interest would concern the introduction of fuzzy algebraic queries in the body of vacuityoriented generalized yes/no queries. Since, fuzzy sets and possibility theory share the same semantics of degrees (i.e., preferences over candidates), there is no fundamental obstacle to deal with the two aspects: imprecise data represented as possibility distributions and ﬂexible queries with gradual predicates and connectors modeled in the framework of fuzzy sets.

Vacuity-Oriented Yes/No Queries in Possibilistic Databases

73

References [1] S. Abiteboul, P. Kanellakis, and G. Grahne. On the representation and querying of sets of possible worlds. Theoretical Computer Science, 78:159187, 1991. [2] P. Bosc, L. Duval, and O. Pivert. Value-based and representation-based querying of possibilistic databases. In Recent Issues of Fuzzy Databases, pages 3–27. Physica Verlag, 2000. [3] P. Bosc, L. Duval, and O. Pivert. An initial approach to the evaluation of possibilistic queries addressed to possibilistic databases. Fuzzy Sets and Systems, 140:151166, 2003. [4] P. Bosc and O. Pivert. Towards an algebraic query language for possibilistic databases. In Proc. of the 12th Conference on Fuzzy Systems (FUZZ-IEEE’03), 2003. [5] P. Bosc and O. Pivert. About yes/no queries against possibilistic databases. Journal of Intelligent Systems, To appear in, 2005. [6] D. Dubois and H. Prade. Fuzzy Sets and Systems Theory and Applications. Academic Press, 1980. [7] T. Imielinski and W. Lipski. Incomplete information in relational databases. Journal of the ACM, 31:761791, 1984. [8] H. Prade and C. Testemale. Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34:115–143, 1984. [9] L. A. Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1:3–28, 1978. [10] L.A. Zadeh. Fuzzy sets. Information and Control, 8:338–353, 1965.

On the Applicability of Extended Possibilistic Truth Values in Flexible Database Modelling and Querying Guy de Tr´e and Rita de Caluwe Ghent University Department of Telecommunications and Information Processing Sint-Pietersnieuwstraat 41, B-9000 Ghent (Belgium) {Guy.DETRE,Rita.DECALUWE}@UGent.be

1 Introduction The mathematical logic, supporting a database model, largely determines the behavior and interaction of the model on constraint satisfaction and on database querying. For traditional database models that only handle perfectly described data and do not explicitly deal with missing information, the classical two-valued Boolean logic is well suited: both integrity constraints and querying criteria can be described by Boolean expressions, which should evaluate to true in order to be satisﬁed. If pseudo-descriptions, commonly known as ‘null’-values, are admitted into a database model to cope with missing information, a two-valued logic is no longer suﬃcient in order to deﬁne the impact of transformations and modiﬁcations in the presence of nulls [1; 2]. In his approach, Edgar Codd extends the relational database model based on an underlying three-valued logic [2] in order to formalize the semantics of null values in traditional relational databases. This approach has even been further extended by using a four-valued logic, hereby making an explicit distinction between nulls that represent unknown information (i.e. information that exists but is not available) and nulls that represent inapplicable information (i.e. information that is missing because it is not applicable) [4]. Due to the fact that the law of the excluded middle no longer holds for many-valued logics, such extensions have been the subject of criticism [5]. If one wants to model imperfect information more eﬃciently or if one wants to enhance the database access facilities by dealing with ﬂexible database querying, the demand for a more elaborated logic to support database modelling even becomes more stringent. In these situations, query satisfaction and constraint satisfaction become a matter of degree and require a logical

G. de Tr´ e and R. de Caluwe: On the Applicability of Extended Possibilistic Truth Values in Flexible Database Modelling and Querying, Studfuzz 203, 75–90 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

76

Guy de Tr´e and Rita de Caluwe

framework that allows us to reﬂect the knowledge of the actual truth of a proposition in a rather natural way. In this chapter, it is discussed how a many-valued logic, based on extended possibilistic truth values (EPTVs), meets these requirements and moreover helps to overcome some of the problems experienced in the null value approach in a more natural way. The main research results on EPTVs are brought together in the framework of ﬂexible database modelling and compared with other traditional and ‘fuzzy’ approaches. The remainder of the chapter is organized as follows: In Section 2, preliminary deﬁnitions on EPTVs are presented. In Section 3, the suitability of the presented logic to support (ﬂexible) data and database modelling is discussed. Section 4 deals with the support of ﬂexible querying. An elaborate example, illustrating both database modelling and database querying, is given in Section 5. In Section 6 a comparison with other approaches is given. Finally, conclusions are given in Section 7.

2 Some preliminaries on EPTVs The concept of an EPTV [7] is an extension of the concept of a possibilistic truth value (PTV) as originally presented by Henri Prade [15] and further developed by Gert De Cooman [6]. The basic idea behind EPTV’s is that one considers a universe I ∗ of three truth values: true (T ), false (F ) and undeﬁned (⊥), the latter being a pseudo-description for those situations where information is inapplicable and a Boolean truth value could not be used without a lost of information. For example, with this consideration the truth value of the proposition ‘is salary higher than 2.000 Euro ?’ for a retired person is ⊥, which denotes that a (condition on) salary is not applicable for retired persons. A truth value F should in this case be less informative and just indicate that the person does not have a salary that is higher than 2.000 Euro. EPTVs are deﬁned as disjunctive fuzzy sets [17] over I ∗ , where the membership grades are interpreted as degrees of possibility. This means that each EPTV has the form {(T, µT ), (F, µF ), (⊥, µ⊥ )} where µT , µF , µ⊥ ∈ [0, 1]. Furthermore, µT denotes the possibility that the EPTV is true, µF denotes the possibility that the EPTV is false and µ⊥ denotes the possibility that the EPTV is undeﬁned. For example, the EPTV {(T, 1), (F, 0.2)} expresses ‘either completely possible true or less possibly (with possibility 0.2) false’. The associated membership grades allow for a gradual representation of truth and uncertainty about truth and provide more ﬂexibility to support ﬂexible database modelling and ﬂexible database querying. Special cases of EPTVs are: The formal deﬁnition of EPTVs and an overview of their most important properties and logical operators are given in [7].

EPTVs in Flexible Database Modelling and Querying

77

Table 1. Special cases of EPTVs. EPTV {(T, 1)} {(F, 1)} {(T, 1), (F, 1)} {(⊥, 1)} {(T, 1), (F, 1), (⊥, 1)}

Semantics True False Unknown (either true or false) Inapplicable No information

In order to be of practical usefulness, deﬁnitions of the basic logical opera˜ ), disjunction (∨ ˜ ) and negation (˜ tors for conjunction (∧ ¬) have to be provided. A notation with tilde is used here, to explicitly denote that that these operators are deﬁned on EPTVs. This way, they can easily be distinguished from their counterparts in a classical three-valued logic. In order to preserve compatibility with classical logic, it is important that truth-functionality is as complete as possible (complete truth-functionality can not be obtained in many-valued logics [14]). This means that the behavior of a logical operator should be mirrored in a logical function combining EPTVs, i.e., with the understanding that P represents the universe of all propositions, t : P → {T, F } is the mapping that associates a traditional ˜ ∗ ) is the mapping that associates truth value with each p ∈ P and t˜∗ → ℘(I ∗ an EPTV with each p ∈ P (where ℘(I ˜ ) denotes the set of fuzzy sets on I ∗ ), the following rules should hold ˜ t˜∗ (q) • p, q ∈ P , q = ¬p : t˜∗ (p ∧ q) = t˜∗ (p)∧ ∗ ∗ ˜ ˜ ˜ t˜∗ (q) • p, q ∈ P , q = ¬p : t (p ∨ q) = t (p)∨ ∗ ∗ ˜ ˜ ˜ t (p) • p ∈ P : t (¬p) = ¬ Several deﬁnitions for logical operators have been developed, all of them with their own advantages and drawbacks. An overview and comparison of deﬁnitions that are respectively based on Zadeh’s extension principle, the minimum t-norm operation and the algebraic t-norm operation can be found in [11]. As an illustration the deﬁnitions, based on the minimum t-norm operation, are given. With the understanding that ℘(I ˜ ∗ ) denotes the set of fuzzy sets on I ∗ , the conjunctive aggregation operator ˜ : ℘(I ˜ ∗ ) → ℘(I ˜ ∗) ∧ ˜ ∗ ) × ℘(I ˜ , V˜ ) → U ˜∧ ˜ V˜ (U is deﬁned by • µU˜ ∧˜ V˜ (T ) = min(µU˜ (T ), µV˜ (T )) • µU˜ ∧˜ V˜ (F ) = max(µ  U˜ (F ), µV˜ (F )) min(µU˜ (T ), µV˜ (⊥)), • µU˜ ∧˜ V˜ (⊥) = max  min(µU˜ (⊥), µV˜ (T )),  min(µU˜ (⊥), µV˜ (⊥))

78

Guy de Tr´e and Rita de Caluwe

The disjunctive aggregation operator ˜ : ℘(I ˜ ∗ ) → ℘(I ˜ ∗) ∨ ˜ ∗ ) × ℘(I ˜ , V˜ ) → U ˜∨ ˜ V˜ (U is deﬁned by • µU˜ ∨˜ V˜ (T ) = max(µU˜ (T ), µV˜ (T )) • µU˜ ∨˜ V˜ (F ) = min(µ  U˜ (F ), µV˜ (F )) min(µU˜ (F ), µV˜ (⊥)), • µU˜ ∨˜ V˜ (⊥) = max  min(µU˜ (⊥), µV˜ (F )),  min(µU˜ (⊥), µV˜ (⊥)) Finally, the negation operator ¬ ˜ : ℘(I ˜ ∗ ) → ℘(I ˜ ∗) ˜ )˜ ˜ (U ¬U is deﬁned by • µ¬˜ U˜ (T ) = µU˜ (F ) • µ¬˜ U˜ (F ) = µU˜ (T ) • µ¬˜ U˜ (⊥) = µU˜ (⊥)

3 Supporting ﬂexible database modelling Equipped with operators that almost preserve truth-functionality, EPTVs are well suited to support the modelling of imperfect and missing information in databases. This is especially the case if attribute domains are extended with domain speciﬁc symbols (⊥t ) that are used to model the inapplicability of a regular domain value (and thus support nulls that represent inapplicability) and if possibility distributions, deﬁned over the attribute domains, are used to model vague, imprecise or uncertain attribute values, as outlined in [16]. In such cases, completely unknown information can be modelled by uniform possibility distributions that associate the value one with each regular domain value. In fact, a seamless integration of EPTVs in the type system of a database model is obtained by extending the domain of the Boolean type with a truth value ⊥ and by allowing the use of possibility distributions that are deﬁned over this extended domain. Additionally, the deﬁnitions of all operators within the type system should be extended in order to cope with the extra domain values ⊥t . In most cases this could be done as follows op(v1 , ⊥t2 ) = op(⊥t1 , v2 ) = ⊥t where op is an operator that has to be extended, ⊥ti , i = 1, 2 is the extra value of the domain domti of argument type ti , vi , i = 1, 2 is a value of domti and ⊥t is the extra value of the domain of the result type t of op.

EPTVs in Flexible Database Modelling and Querying

79

EPTVs are also well suited to express constraint satisfaction and therefore support the use of (ﬂexible) integrity constraints in database modelling. Traditionally, integrity constraints are seen as properties of the universe and, as such, they impact database records when updates are performed. As any other constraint, an integrity constraint can be softened. For some database management tasks, as for example data integration, this can be advantageously [10]. Typical for ﬂexible integrity constraints is that they can impose soft restrictions on the database, which can be useful in situations where one can not express the semantic constraints on the data by means of a strict restriction. With respect to an update, insertion or deletion of a database record, three situations can be considered: 1. All (ﬂexible) integrity constraints are completely satisﬁed. In this case, the constraints behave as if they were regular integrity constraints. The update, insertion or deletion can be safely executed on the database. 2. At least one of the ﬂexible integrity constraints is partially satisﬁed. In this case, deletion should not be allowed, as it would make the database invalid. Insertion and update are allowed and have as eﬀect that the record under consideration should only partially belong to the database. This can be modelled by means of EPTVs as described and illustrated below. 3. None of the (ﬂexible) integrity constraints is satisﬁed at all. In this case, the constraints also behave as regular integrity constraints. The update, insertion or deletion operation is not allowed. For example, instead of limiting the valid temperatures in Belgium within the range [−25 ◦ C, 40 ◦ C], a soft restriction, e.g., allows it to limit these temperatures to the more ﬂexible range from ‘about −20 ◦ C’ to ‘about 40 ◦ C’. Constraint satisfaction in the presence of soft restrictions may not be known with certainty and therefore requires a gradual representation. EPTVs can be used for this purpose. If an EPTV {(T, µT ), (F, µF ), (⊥, µ⊥ )} is used to express constraint satisfaction, µT denotes the possibility that the constraint is satisﬁed, µF denotes the possibility that the constraint is not satisﬁed and µ⊥ denotes the possibility that the constraint does not apply to the considered database record. Furthermore, to each record in a database an EPTV, resulting from the aggregation of the EPTVs of all integrity constraints that are deﬁned for the record, can be attached. Such an EPTV then expresses to which extent the record satisﬁes all integrity constraints that have been deﬁned for it. In the next section, it will be illustrated how these attached EPTVs can be calculated and aggregated. With the description and illustration given above, it should be clear that EPTVs are suited to support ﬂexible database modelling. More details about

80

Guy de Tr´e and Rita de Caluwe

the generality of the presented approach and a further description of diﬀerent kinds of ﬂexible integrity constraints are beyond the scope of this chapter, but can be found in [9].

4 Supporting ﬂexible database querying Because a (soft) selection condition imposed by a (ﬂexible) query can be seen as a special case of a constraint, EPTVs can also be used to express query satisfaction. What is initially needed are appropriate evaluation functions for comparison operators. The basic comparison operators can be deﬁned as follows: assume that ˜ and V˜ are fuzzy op ∈ {=, =, ≤, ≥, <, >} denotes a comparison operator and U sets that are deﬁned over the domain domt of an attribute with associated type t. Furthermore, assume that domt contains the extra value ⊥t , which is used to model inapplicability. Then the EPTV of the comparison ˜ op V˜ U is deﬁned by • µU˜ • µU˜ • µU˜

op V˜ (T ) op V˜ (F )

= supx,y∈(domt )2 :x op y min(µU˜ (x), µV˜ (y)) = supx,y∈(domt \{⊥t })2 :¬ (x op y) min(µU˜ (x), µV˜ (y)) (⊥) = min(µU˜ (⊥t ), µV˜ (⊥t )) op V˜

Another frequently used comparison operator is the so-called ‘is’ operator. This operator allows to compare stored (labels of) possibility distributions with (labels of) fuzzy sets provided by the user in a query. The general form of the ‘is’ operator is A is L where A is the name of a database attribute and L is a linguistic term that is given by the user. For example, if A is the Salary attribute of an employee record whose current value is denoted by the label ‘about 3.000 Euro’ that is modelled by a possibility distribution function π‘about 3.000 Euro’ and L is the label of the fuzzy set that represents ‘high salary’, then the calculated EPTV of the proposition ‘Salary is high salary’ will express to what extent ‘about 3.000 Euro’ is a ‘high salary’. From the possibilistic logic viewpoint, the ‘is’ operator should be interpreted as follows. L is represented by a fuzzy set with membership function µL and πA is a possibility distribution being a value of the attribute A at given record. p = A IS L is treated as a proposition of a multi-valued logic, thus possessing a fuzzy set of models, M . On the other hand, πA is a possibility distribution on the space of interpretations. An interpretation is meant here as an assignment of a value to the attribute A. Evaluation of the proposition p corresponds to the computation of an EPTV and is, in accordance with [13], deﬁned by

EPTVs in Flexible Database Modelling and Querying

81

• µA is L (T ) = supx∈domt min(πA (x), µL (x)) • µA is L (F ) = supx∈domt \{⊥t } min(πA (x), 1 − µL (x)) • µA is L (⊥) = min(πA (⊥t ), 1 − µL (⊥t )) In ﬂexible querying, weights may be associated with the query criteria, hereby denoting that some criteria might be more relevant (or stringent) than others. In order to cope with such weights, some more advanced aggregation operators are needed. From a conceptual point of view, a distinction can be made between static weights and dynamic weights. Static weights are ﬁxed, known in advance and can be directly derived from the formulation of the query. These weights are independent of the values of the record(s) on which the query criteria act and are not allowed to change during query processing. A further distinction can be made between static weight assignments, where it is also known in advance with which constraint a weight is associated (e.g. in a situation where the user explicitly states his/her preferences) and dynamic weight assignments, where the associations between weights and criteria depend on the actual attribute values of the record(s) on which the query criteria act (e.g. in a situation where most criteria have to be satisﬁed, but it is not important which ones). In aggregation with dynamic weights, neither the weights nor the associations between weights and criteria are known in advance and depend on the attribute values of the record(s) on which the query criteria act. This kind of ﬂexibility is required to avoid some unnatural behavior of the query evaluation in cases where, e.g., a criterion is of limited importance only within a given range of values, for example, if the criterion ‘high salary’ is not important, unless the salary value is extremely high. Several advanced aggregation operators for EPTVs have been studied by the authors. An example of an advanced aggregation operator for EPTVs that deals with ﬁxed, statically assigned weights is the operator based on residual (co)implicator functions presented in [8]. However, more research on aggregation operators within the framework of EPTVs is necessary to be able to cope with all the diﬀerent kinds of weight handling as described above. Another important aspect of ﬂexible querying is the ranking of the possible alternatives in the result of a query. In fact, because query satisfaction is no longer a matter of bivalence (true/false) but a matter of degree, the supporting logical framework should provide adequate ranking facilities. Such facilities are necessary for the eﬃcient interpretation and satisfactory representation of the (relative importance of the) results of a query. The simplest way to rank EPTVs is by using the so-called analytical ranking function ra that maps each EPTV {(T, µT ), (F, µF ), (⊥, µ⊥ )} onto its associated ranking value (1 + (µT − µF )(1 − µ⊥ /2)) ∈ [0, 1] 2 An alternative approach is to assume the following initial ordering of the ‘basic’ truth values

82

Guy de Tr´e and Rita de Caluwe

{(T, µT )} > {(⊥, µ⊥ )} > {(F, µF )} and to consider the impact of the (F)alse component on the ranking process to be more negative than the impact of the ⊥-component. This can be obtained by introducing a threshold value α for the ⊥-component and by subsequently ranking the EPTVs as if there were PTVs. The EPTVs are then ranked with decreasing possibilities of being true (i.e. µT ) and the additional restriction that their respective possibilities of being inapplicable (i.e. µ⊥ ) must not be greater than α. The resulting mapping function rα then maps each EPTV {(T, µT ), (F, µF ), (⊥, µ⊥ )} onto its associated ranking value   µT + (1 − µF ) if µ⊥ ≤ α 2 0 if µ > α ⊥

More information on ranking functions for EPTVs, including a third ranking method, has been presented in [12]. In this section it is illustrated that EPTVs are suited to support ﬂexible database querying. To preserve generality and to be independent of any database model, only selection conditions are dealt with. However, when dealing with a speciﬁc database model, as for example the relational database model [3], other data manipulation operators need to be considered. This is beyond the scope of this chapter. More details about the handling and behavior of such operators within a logical framework based on EPTVs can be found in [9].

5 An example As an illustrative example a simple relational database [3] consisting of a single relation ‘High buildings’ is considered. 5.1 Flexible database modelling The relation High buildings contains some facts about buildings, which can generally be categorized as being high. Each tuple t in High buildings represents a building and is characterized by a unique building name (name) —which is the primary key attribute—, the city where the building is located (location),the height of the building in metres (height), the number of storeys in the building (storeys) and the year when the building the construction of the building was ﬁnished (year). To assure that the relation only contains tuples that represent high buildings, a soft integrity constraint ‘height IS high’

EPTVs in Flexible Database Modelling and Querying

83

Fig. 1. The modelling of ‘high’ buildings.

is deﬁned for the relation. Hereby, high is a linguistic term being represented by the fuzzy set with membership function as depicted in Figure 1. As extra information an EPTV, expressing the truth of the proposition that the tuple belongs to the relation is attached to each tuple (Contains). In fact, this EPTV expresses the integrity constraint satisfaction of each tuple. Each attribute Ai has an associated domain domAi of valid values. Each associated domain domAi contains a domain speciﬁc ‘undeﬁned’ element ⊥Ai that is used to model cases where a regular domain value is not applicable, as described in [16]. Table 2. Example of instances of relation High buildings. N ame Location Height Storeys Y ear Contains Empire State New York 381 102 1931 {(T, 1)} Petronas Towers Kuala Lumpur 452 88 1998 {(T, 1)} Taipei 101 Taipei 509 101 2004 {(T, 1)} Chrysler Building New York 319 77 1930 {(T, 1)} Eiﬀel Tower Paris 300.5±0.2 3 1889 {(T, 1)} Central Plaza Hong Kong 374 78 1992 {(T, 1)} Sears Tower Chicago 442 110 1974 {(T, 1)} Great Pyramid Gizeh 140±5 N/A ±2570BC {(T, 0.2), (F, 0.8)} Trump Building New York 283 72 1930 {(T, 0.9), (F, 0.1)} Warsaw TV Mast Warsaw 646 N/A 1973 {(T, 1)} CITIC Plaza Guangzhou 391 80 1996 {(T, 1)}

Table 2 is a tabular representation of an example of High buildings (each row represents a tuple, also called an instance of the relation). The values of the ‘height’- and ‘year’-attributes are labels, each of them being characterized by a possibility distribution function that is associated with the attribute and is deﬁned over the domain of the attribute. For example, a label denoting a ‘crisp’ number like 381 corresponds to a possibility distribution function with a discrete support, labels like ‘300.5±0.2’ correspond to triangular possibility distribution functions. These possibility distributions model that the attribute value is imprecise. For example, the height of the Eiﬀel Tower is known to be 300.51 metres (± 15cm depending on temperature). Likewise, at construction

84

Guy de Tr´e and Rita de Caluwe

the Great Pyramid was about 146 metres, but due to erosion its current height is about 137 metres, what is modelled by the label ‘140±5’. The label ‘N/A’ stands for ‘not applicable’ and corresponds to a possibility distribution function ΠN/A (x) = 0, if x ∈ domAi \ {⊥Ai } = 1, if x = ⊥Ai This label is used to denote that the storeys attribute does not apply for the ‘Great Pyramid’ and the ‘Warsaw TV Mast’. In a similar way, one can consider a label ‘UNK’ that stands for ‘unknown (but applicable)’ and corresponds to a (rectangular) possibility distribution function ΠU N K (x) = 1, if x ∈ domAi \ {⊥Ai } = 0, if x = ⊥Ai and a label ‘UNA’ that stands for ‘unavailable’ and consequently corresponds to the possibility distribution function ΠU N A (x) = 1, ∀x ∈ domAi Due to the soft integrity constraint ‘height IS high’ the ‘Great Pyramid’ is still considered as being a high building, but to a lesser extent than other more recent buildings. This is reﬂected by the associated EPTV in the Contains attribute ﬁeld, that is calculated by applying the formula for ‘IS’ predicates, i.e. • µ140±5 is high (T ) = supx∈domheight min(π140±5 (x), µhigh (x)) = 0.2 • µ140±5 is high (F ) = supx∈domheight \{⊥height } min(π140±5 (x), 1 − µhigh (x)) = 0.8 • µ140±5 is high (⊥) = min(π140±5 (⊥height ), 1 − µhigh (⊥height )) = 0 For the same reasons ‘Trump Tower’ is almost considered as being a high building with associated EPTV: • µ283 is high (T ) = supx∈domheight min(π283 (x), µhigh (x)) = 0.9 • µ283 is high (F ) = supx∈domheight \{⊥height } min(π283 (x), 1 − µhigh (x)) = 0.1 • µ283 is high (⊥) = min(π283 (⊥height ), 1 − µhigh (⊥height )) = 0 5.2 Flexible database querying In ﬂexible database requests, the query criteria themselves may be ﬂexible and contain labels. Consider, for example, the following query with a

EPTVs in Flexible Database Modelling and Querying

85

non-compound ﬂexible query condition: Query 1: SELECT name, location, height FROM High buildings WHERE height IS About 300m This query selects the name, location and height of all buildings that are about 300 meters high. For the sake of the example, it is assumed that the linguistic term ‘About 300m’ stands for 300m±50m and is modelled by a fuzzy set with triangular membership function as depicted in Figure 2.

Fig. 2. The modelling of ‘About 300m’.

The tuples belonging to the result set of Query 1 are given in Table 3. This result set illustrates that EPTVs can usefully be applied to express the ‘degree’ of query satisfaction and moreover adequately can handle the occurrence of missing information in ﬂexible database requests. Table 3. Result set of Query 1. N ame Eiﬀel Tower Trump Building Chrysler Building Empire State Petronas Towers Taipei 101 Central Plaza Sears Tower Great Pyramid Warsaw TV Mast CITIC Plaza

Location Paris New York New York New York Kuala Lumpur Taipei Hong Kong Chicago Gizeh Warsaw Guangzhou

Height 300.5±0.1 283 319 381 452 509 374 442 140±5 646 391

Contains {(T, 0.99), (F, 0.01)} {(T, 0.66), (F, 0.34)} {(T, 0.62), (F, 0.38)} {(F, 1)} {(F, 1)} {(F, 1)} {(F, 1)} {(F, 1)} {(F, 1)} {(F, 1)} {(F, 1)}

ra 0.99 0.66 0.62 0 0 0 0 0 0 0 0

With this example it is illustrated that EPTVs allow to model the partial satisfaction of a ﬂexible query condition. Tuple ‘Eiﬀel Tower’ almost completely satisﬁes the query condition, whereas tuples ‘Trump Building’ and ‘Chrysler Building’ only partly satisfy the query condition. All other tuples do

86

Guy de Tr´e and Rita de Caluwe

not satisfy the condition but are shown for completeness. If required, threshold values on the associated EPTVs can be used to limit the query results. The associated EPTVs in the result set all result from the conjunction of the associated EPTV from the original relation and the EPTV that results from the evaluation of the query condition. Hereby, the conjunction operator based on the minimum t-norm operation, has been applied. For example, the associated EPTV for the tuple ‘Trump Building’ is obtained from the conjunction ˜ {(T, 0.66), (F, 0.34)} {(T, 0.9), (F, 0.1)}∧ where {(T, 0.9), (F, 0.1)} is the EPTV from the relation High buildings and {(T, 0.66), (F, 0.34)} is the EPTV that results from the evaluation of the query condition ‘height IS About 300m’, that is calculated by applying the formula for ‘IS’ predicates, i.e. • µ283 is About 300m (T ) = supx∈domheight min(π283 (x), µAbout 300m (x)) = 0.66 • µ283 is About 300m (F ) = supx∈domheight \{⊥height } min(π283 (x), 1 − µAbout 300m (x)) = 0.34 • µ283 is About 300m (⊥) = min(π283 (⊥height ), 1 − µAbout 300m (⊥height )) = 0 As extra information for the reader, the derived analytical ranking value ra is added in the last column of the result set. Usually, this derived information should not be stored in the database. Queries with compound query conditions (crisp or ﬂexible) are handled similarly as queries with compound crisp query conditions: the EPTVs of the tuples in the result set are calculated from the EPTVs of the individual ˜ ’ and ‘∨ ˜ ’. conditions in the composition, using the logical operators ‘˜ ¬’, ‘∧ In order to illustrate the use of the third truth value ⊥, consider the following query: Query 2: SELECT name, location, storeys FROM High buildings WHERE storeys > 100 This query selects the name, location and number of storeys of all buildings that have more that 100 stories. The tuples belonging to the result set of Query 2 are given in Table 4. As with the previous example the associated EPTVs again result from the conjunction of the associated EPTV from the original relation and the EPTV that results from the evaluation of the query condition. Again, the conjunction operator based on the minimum t-norm operation, has been applied. By looking at the result set it can be observed that the framework makes a clear distinction between the situation where data exist, but the query condition is not satisﬁed and the situation where the query data does not apply.

EPTVs in Flexible Database Modelling and Querying

87

Table 4. Result set of Query 2. N ame Empire State Taipei 101 Sears Tower Warsaw TV Mast Great Pyramid Eiﬀel Tower Trump Building Chrysler Building Petronas Towers Central Plaza CITIC Plaza

Location New York Taipei Chicago Warsaw Gizeh Paris New York New York Kuala Lumpur Hong Kong Guangzhou

Storeys 102 101 110 N/A N/A 3 72 77 88 78 80

Contains {(T, 1)} {(T, 1)} {(T, 1)} {(⊥, 1)} {(F, 0.8), (⊥, 0.2)} {(F, 1)} {(F, 1)} {(F, 1)} {(F, 1)} {(F, 1)} {(F, 1)}

ra 1 1 1 0.5 0.14 0 0 0 0 0 0

For example, the facts that the ‘Great Pyramid’ and the ‘Warsaw TV Mast’ do not have ‘storeys’ are communicated to the users by means of the extra truth value ⊥, which diﬀers from zero. The ‘Great Pyramid’ only satisﬁes the query condition to a minimal extent because it not fully belongs to the original relation. The associated EPTV {(F, 0.8), (⊥, 0.2)} is in fact the result of the conjunction ˜ {(⊥, 1)} {(T, 0.2), (F, 0.8)}∧ where {(T, 0.2), (F, 0.8)} is the EPTV from the relation High buildings and {(⊥, 1)} is the EPTV that results from the evaluation of the query condition ‘storeys > 100’.

6 Comparison with other approaches Classical two-valued Boolean logic does not have the semantic richness that is necessary to eﬃciently support ﬂexible database modelling and ﬂexible database querying. Extensions like three-valued Kleene logic and four-valued Belnap logic provide an extra truth value to represent unknown information and can been used to support null values. However, as described in [5], such approaches introduce problems of incomplete truth-functionality. Neither do they give satisfaction to support the requirements of more advanced techniques for the modelling and querying of imperfect information as described in the previous sections. In more advanced (fuzzy) database modelling, the most frequently used framework is based on a [0, 1]-valued possibility measure (and an associated necessity measure) [13]. Typical for the approach is that a possibility measure P os and an associated necessity measure N ec are deﬁned over the Cartesian product domt1 × domt2 × · · · × domtm of the domains of the associated types of the characterizing attributes A1 , A2 , . . . , Am of the resulting record set R

88

Guy de Tr´e and Rita de Caluwe

of a query. As a consequence, every record r of the result set has been assigned a possibility P os({r}) ∈ [0, 1] and a necessity N ec({r}) ∈ [0, 1], which respectively indicate to what extent it is possible that r belongs to the result set and to what extent it is certain (or necessary) that r belongs to the result set. Conjunction of possibility measures and disjunction of necessity measures is obtained by P os(A ∪ B) = max(P os(A), P os(B)) and N ec(A ∩ B) = min(N ec(A), N ec(B)) Furthermore, ∀ A, P os(A) = 1 − N ec(A), where A denotes the complement of A. The main advantage of using this approach is that possibility theory provides mathematical tools and evaluation functions that allow to calculate the possibility measures and necessity measures stemming from the comparison of possibility distributions that model uncertain, imprecise or vague information [13]. The approach presented in this chapter is an extension of this framework, which additionally allows to cope explicitly with the inapplicability of information what regularly occurs in database applications. In fact, the similarity between both approaches becomes clear if the approach with EPTVs is restricted to the regular truth values T and F . If the EPTV t˜∗ (p) represents the truth of a proposition p, then the membership grade µt˜∗ (p) (T ) is the possibility that the proposition is true and can be interpreted as the possibility of the proposition, whereas the membership grade µt˜∗ (p) (F ) of F is the possibility that the proposition is false, which might be interpreted as the complement of the necessity of the proposition. Beside the main advantage of being able to cope explicitly with inapplicability of information, this new approach also provides the user with an eﬃcient and convenient way to handle expressions. Expressions can be evaluated just by applying the extended conjunction, disjunction, negation, implication and equivalence operators as presented in [7]. EPTVs provide the user with the maximum of information available: if some parts of an expression are not applicable at evaluation, this would be clearly expressed by means of the third component ⊥. In those cases where a ranking of EPTVs is required, a ranking function as presented in Section 4 can be applied in order to obtain a single ranking value. This ranking value needs to be interpreted as extra information, associated with the EPTV, and does not replace the EPTV. The EPTV itself remains connected with the database records as illustrated in the examples of the previous section, thereby providing more information about the truth value under consideration than possibility and necessity measures do.

EPTVs in Flexible Database Modelling and Querying

89

7 Conclusions In this chapter, the usefulness of a logical framework based on EPTVs in order to support ﬂexible database modelling and querying has been discussed. This logical framework is based on possibility theory and has speciﬁc facilities to deal with the inapplicability of information, the inapplicability of integrity constraints and the inapplicability of querying criteria. Although many-valued logics cannot be fully truth-functional, the truth-functionality of the presented framework is as complete as possible. This allows for a better support for the handling of imperfect and missing information. Moreover, because with EPTVs no speciﬁc notations are required to model unknown information, some of the problems that come along with traditional approaches, that are based on a three or four-valued logic that has a speciﬁc truth value ’unknown’, have been solved. As indicated in this chapter further research on EPTVs is necessary. Especially within the ﬁeld of advanced aggregation. Future work of our research group will further explore this ﬁeld. Another topic of interest concerns the further study and development of more advanced querying mechanism, including other and operators and evaluation functions. Up to now the focus of our research was on the applicability of EPTVs within object oriented database models. In the near future attention will be paid to the incorporation of the presented framework in both the regular relational database model and the possibilistic relational database models.

References [1] J. Biskup. A formal approach to null values in database relations. In H. Gallaire, J. Minker, and Nicolas J., editors, Advances in Data Base Theory, pages 299–341. Plenum Press, New York, USA, 1981. [2] E.F. Codd. Missing information (applicable and inapplicable) in relational databases. ACM SIGMOD Record, 15(4):53–78, 1986. [3] E.F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6), 1986. Republished in (1983) Communications of the ACM 26(1). [4] E.F. Codd. More commentary on missing information in relational databases (applicable and inapplicable information). ACM SIGMOD Record, 16(1):42–50, 1987. [5] C.J. Date. Null values in database management. In Relational Database: Selected Writings, pages 313–334. Addisson-Wesley Publishing Company, Massachusetts, USA, 1986. [6] G. de Cooman. From possibilistic information ot kleene’s strong multivalued logics. In Dubois, D. et al., editor, Fuzzy sets, logics and reasoning about knowledge, pages 315–323. Kluwer Academic Publishers, Boston, USA, 1999.

90

Guy de Tr´e and Rita de Caluwe

[7] G. de Tr´e. Extended possibilistic truth values. International Journal of Intelligent Systems, 17:427–446, 2002. [8] G. de Tr´e and B. De Baets. Aggregating constraint satisfaction degrees expressed by possibilistic truth values. IEEE Transactions on Fuzzy Systems, 11(3):361–368, 2003. [9] G. de Tr´e and R. de Caluwe. A constraint based fuzzy object oriented database model. In Z. Ma, editor, Advances in Fuzzy Object-Oriented Databases: Modeling and Applications, pages 1–45. Idea Group Publishing, Hershey, USA, 2005. [10] G. de Tr´e, R. de Caluwe, and H. Prade. Null values revisited in prospect of data integration. Lecture Notes in Computer Science, 3226:79–90, 2004. [11] G. de Tr´e, R. de Caluwe, J. Verstraete, and A. Hallez. Conjunctive aggregation of possibilistic truth values and ﬂexible database querying. Lecture Notes in Artiﬁcial Intelligence, 2522:344–355, 2002. [12] G. de Tr´e, T. Matth´e, K. Tourn´e, and B. Callens. Ranking the possible alternatives in ﬂexible querying: An extended possibilistic approach. Lecture Notes in Computer Science, 2869:204–211, 2003. [13] D. Dubois and H. Prade. Fuzzy Sets and Systems Theory and Applications. Academic Press, 1980. [14] D. Dubois and H. Prade. Possibility theory, probability theory and multiple-valued logics: A clariﬁcation. Annals of Mathematics and Artiﬁcial Intelligence, 32(1–4):35–66, 2001. [15] H. Prade. Possibility sets, fuzzy sets and their relation to lukasiewicz logic. In Proc. of the 12th International Symposium on Multiple-Valued Logic, pages 223–227, 1982. [16] H. Prade and C. Testemale. Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34:115–143, 1984. [17] L.A. Zadeh. Fuzzy sets. Information and Control, 8:338–353, 1965.

Preference based Quality Assessment and Presentation of Query Results Stefan Fischer, Werner Kießling and Timotheus Preisinger Chair for Databases and Information Systems University of Augsburg D-86139 Augsburg {fischer, kiessling, preisinger}@informatik.uni-augsburg.de

1 Introduction As long as there have been database search engines there has been the problem of what to present to the user when there is no perfect match and how to present that query result to the user. Respecting the user’s search preferences is the suitable way to search for best matching alternatives. Modelling such preferences as strict partial orders in “A is better than B” semantics has been proven to be user intuitive in various internet applications. The better the search result, the better is the psychological advantage of the presenter. Thus, there is the necessity to know the quality of the search result with respect to the search preferences. This chapter introduces a novel personalized and situated quality assessment for query results. Based on a human comprehensible linguistic model of ﬁve quality categories a very intuitive framework for valuations is deﬁned for numerical as well as for categorical search preferences. These quality valuations provide human comprehensible presentation arguments. Moreover, they are used to compute the situated overall quality of a search result. For delivery of the results a ﬂexible and situated ﬁlter decides which results to present, e.g. by respecting quality requirements of the user. A so called presentation preference determines which results are predestined to be especially pointed out to a user. Eventually, it will be evaluated how ecommerce applications will proﬁt from the use of a preference based search in combination with the introduced human comprehensible quality assessment. Considering the procurement of goods via internet the idea is simple. A customer expects to have at least the service he or she has when directly contacting a human sales person. That means the customer wants to be treated individually according to his or her needs. But the misery already begins with the ﬁrst step, the usage of the search engine.

S. Fischer et al.: Preference based Quality Assessment and Presentation of Query Results, Studfuzz 203, 91–121 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

92

Stefan Fischer et al.

1.1 Database search engines – a continuing misery The lack of eﬀectiveness of database search engines is as old as database search engines themselves. If there is no perfect match with respect to the search conditions a best alternative must be delivered. Even Amazon.com, market leader in the B2C domains of books and audio CDs, is not able to present a simple alternative to the desired book “Diary for Robin” by the author “James Patterson”, although there is a book by this author with the similar title “Diary for Nicholas” (see Fig. 1). This phenomenon is known as empty result eﬀect. As a solution, by now most search engines are equipped with the option

Fig. 1. Amazon’s failing search engine.

to combine the search conditions with a logical or. E.g. the search engine of the company B2B-Perfect.com uses this simple technology and promises a powerful search. Especially in B2B, where the goods are much more complex than books, the eﬀect is clear, a ﬂooding eﬀect with lots of irrelevant results. Fortunately, the misery can be stopped by respecting the customer’s search preferences as soft conditions. Modelling preferences as strict partial orders as “A is better than B” semantics [15] has been proven to be user intuitive in various internet applications [19]. 1.2 Quality of a query result But more than this it is necessary to provide a similarly good service as a human sales person. Since a customer wants to understand why the presented results are the best available ones for him, it is necessary to know about the quality of the query results w.r.t. the customer’s wishes. Normally, to be convinced of the presented pre-selection, as stated by the sales psychologist Becker [2], a customer wants an impression of the search quality of the presented query results as decision support. Some advanced search engines deliver ad hoc and very often not intuitively computed alternatives. The presentation of these results is often equipped with a valuation not human comprehensible that aims to tell how close the result is to the search conditions. E.g. the search engine for the documentation of the Oracle database system scores the results with values between 1% and 100% and orders the results descending.

Preference based Quality Assessment and Presentation of Query Results

93

We will shortly leave the ﬁeld of querying standard databases and have a look at structured and semi-structured data. The search engine used by the portal of the scientiﬁc association ACM.org does a little better by valuating the results within ﬁve categories (see Fig. 2), which is the intuitive number of what a human being normally diﬀerentiates according to Zadeh [26], a thesis also supported by psychologist Miller’s researches on perception [20].

Fig. 2. Quality of search results at ACM.

But the user of this search engine can hardly understand why there are ratings of the second and fourth category when searching only for the word “Kießling”. A closer look to the ﬁrst result shows Kießling’s works are often cited, which seems to be the reason for the applied second highest valuation. But it is not comprehensible why the second result only gets a rating of the fourth category although Kießling is coauthor of the paper and his work is also cited very often therein. The problem lies in the non-personalized and non-situated measurement of key data used in such technologies which obviously fail. Maybe one user focusses the author, another one the number of citations. Such problems have been widely discussed in the context of information retrieval. It seems to be a typical problem of this domain not directly connected to usual database searches, but it suﬀers of the same problems as common keyword search does in ordinary databases: On the one hand, users can hardly understand which methodology is used for result computation. On the other hand, there are barely any possibilities for a personalization of the search request – although being fundamental for satisfying search results.

94

Stefan Fischer et al.

A further and very interesting aspect is that in a sales scenario of course the better the search result, the better is the psychological advantage of the vendor. Therefore, an internet store should have the information about the quality of the search results for being able to provide a good reasoning when oﬀering the results. All the well known sales psychology models (e. g. [21; 9; 11]) emphasize that knowledge about the quality of the oﬀered goods with respect to the customer’s preferences is a major factor for a sales dialog and for consumer choice behavior. The preferences of each of the customers of course diﬀer. They are even diﬀerent for one customer in various situations, e.g. someone may in general prefer the color green, but not when considering the color of a car. These personality and situation dependant preferences are also known as conditional preferences in [4], although the approach used there rather focuses on situations. Of course, in a sales scenario the question is not only how to valuate a query result, but also in which order the results should be presented. Is there a result to pro-actively point out? Is there a possibility to inﬂuence the consumer choice behavior? For these questions this chapter is organized as follows. Next, the foundations of the underlying intuitive preference model will be summarized. In section 3 a novel approach for the quality assessment of query results will be introduced, including psychological aspects of query result presentation regarding consumer choice behavior. Integrated into a prototype experiences with human customers will demonstrate the valuable eﬀects in section 4. Summary and outlook will conclude this chapter.

2 Fundamental preference concepts revisited To handle user preferences, a comprehensive and theoretical well-founded framework is necessary for a ﬂexible and intuitive usage. Therefore, the foundations of Kießling [15; 16] for preferences in databases are brieﬂy repeated. They are the basis for an intuitive and ﬂexible valuation of search results. Afterwards we will have a look at some other preference models. 2.1 Preference modelling – foundations People naturally express their wishes in terms like “I like A better than B”. This kind of preference modelling is universally applied in any kind of situation and understood by everybody. Even just thinking having tea or coﬀee for breakfast shows such a preference. People are intuitively used to deal with such preferences. As shown and designed in [15; 16], all these preferences can be formulated as strict partial orders and even be engineered to complex, multidimensional preference constructs without loss of intuitive semantics.

Preference based Quality Assessment and Presentation of Query Results

95

Preferences and their engineering In [15; 16] a preference is formulated on a set of attribute names with an associated domain of values. When combining preferences P1 and P2 , the attributes of P1 and P2 may overlap – allowing multiple and even conﬂicting preferences on the same attributes. Deﬁnition 1. Domain values of a set of attributes Let A = A1 , A2 , . . . , Ak denote a non-empty set of attribute names Ai associated with domains of values dom(Ai ), 1 ≤ i ≤ k. The domain of A is deﬁned as dom(A) := ×Ai ∈A dom(Ai ). Deﬁnition 2. A preference as strict partial order Given a set A of attribute names, a preference P is a strict partial order P := (A,

96

Stefan Fischer et al.

will see some of the various intuitive preference constructs introduced in [15]. The following preference constructs enable one to easily model the most frequent expressions. Deﬁnition 4. Categorical or non-numerical base preferences Given a ﬁnite POS-set S ⊆ dom(A) of preferred values, P := POS(A, S) is called a POS preference if ∀x, y ∈ dom(A) : x

Preference based Quality Assessment and Presentation of Query Results

97

Given an interval [low, up] ∈ dom(A) × dom(A), the distance is deﬁned as ∀v ∈ dom(A) : dist(v, [low, up]) := if v ∈ [low, up] then 0 else if v < low then low − v else v − up. P := BETWEEN(A, [low, up]) is a BETWEEN preference if ∀x, y ∈ dom(A) : x

dist(y, [low, up]). The user speciﬁes an interval with his preferred value. P := AROUND(A, z) is called AROUND preference if P := BET W EEN (A, [z, z]) The user speciﬁes a single preferred value. Given a function f : dom(A) → R and the ‘less-than’ order ‘<’ on R. P := SCORE(A, f ) is called SCORE preference ∀x, y ∈ dom(A) : x

2 are obvious. Deﬁnition 6. Pareto preference Given two preferences P1 , P2 on two domains A1 , A2 and two tuples (x1 , x2 ), (y1 , y2 ) ∈ dom(A1 ) × dom(A2 ). A preference P := P1 ⊗ P2 is called a Pareto preference if (x1 , x2 )
98

Stefan Fischer et al.

maximum temperature during winter days should be between 10 and 16 degrees Celsius. These three preferences can be expressed as P := P1 ⊗P2 ⊗P3 , where P1 := AROU N D(population, 500000), P2 := P OS(state capital, yes), P3 := BET W EEN (degreesCelsius, [10, 16]). A graph with edges xi → xj for every two xi , xj with xi

Las Vegas (500000, no, 16)

Phoenix (1400000, yes, 21)

San Diego (1300000, no, 18)

Fig. 3. Graph for Lisa’s Pareto preference.

If a preference is more important than another then the following prioritized preference supports the intuitive modeling. Deﬁnition 7. Prioritized preference Given two preferences P1 , P2 on two domains A1 , A2 and two tuples (x1 , x2 ), (y1 , y2 ) ∈ dom(A1 ) × dom(A2 ). A preference P := P1 & P2 is called a prioritized preference if (x1 , x2 )
& P2

(y1 , y2 ) ⇔ x1
In a prioritization, a tuple y is better than a tuple x if it is better in the more important contained preference. Given an equal rating there, y is rated better than x if it is better in the less important preference. If they are rated uncomparable in the more important contained preference, the overall rating is as well uncomparable. 2.2 The BMO query model Preferences are deﬁned in terms of values from dom(A), representing the realm of wishes. In database applications it is assumed that the real world is mapped into appropriate instances that are called database sets. A database set R may e.g. be a view or a base relation in SQL or a DTD-instance in XML. Whether preferences can be satisﬁed depends on the current database contents. Thus a match-making between wishes and data has to be made. To this purpose the “Best Matches Only” (BMO) query model has been introduced in [15; 16].

Preference based Quality Assessment and Presentation of Query Results

99

Deﬁnition 8. BMO result The Best Matches Only result set contains only the best matches w.r.t. the strict partial order of a preference P . It is a selection of unordered result tuples. All tuples t, t ∈ BM O have equal or uncomparable values regarding the preference P . For prioritized preferences this BMO model can lead to a possible disregard of less important preferences. When comparing tuples, a speciﬁc preference is only evaluated if all the more important preferences show equal results. In diﬀerence to number based preference models we do not imply any kind of order to the elements of the BMO set. For no two of them we can say which one is better according to the user’s preferences at this point. This will be the subject of section 3. Chomicki’s winnow operator [5] provides similar semantics. Currently, there are two query languages extended by soft conditions based on the presented preference model, namely Preference SQL [19] for relational databases and Preference XPath[18] for XML structured data. 2.3 Other preference models Basically, there are two diﬀerent ways of expressing preferences in databases, quantitative and qualitative. The quantitative approach uses scoring functions for computing a numeric score for every tuple. The results are returned in order of this score. Agrawal and Wimmers present numerical value based preferences in [1]. A preference is modelled as a ﬂoating point value between 0 and 1. Although this approach enables a user to express preferences at a very ﬁne grained level, it is not easy understandable for many people. People are not used to rating anything on a ﬂoating point number base [20]. The combination of diﬀerent preferences they propose is a kind of prioritization between these. Using a fuzzy set approach, Bosc and Prade not only use soft query conditions but also propose a way of storing imprecise data in [3]. Our focus however lies on product catalog search and recommendations. As product descriptions in catalogs do not contain uncertain data, we do not have to deal with this issue. The soft conditions in [3] are also based on ﬂoating point values between 0 and 1. Dubois and Prade use this fuzzy preference model as a base for their research in [7]. For the connection of equally important preferences they propose diﬀerent ways of computing including use of minimum and maximum functions and averaging. Although not explicitly stated, the selection of the computation type should be made following personalized and situated principles. Using ﬁve discrete levels of fulﬁlling search conditions it combines result selection and a kind of quality assignment. Qualitative preference models like ours use directly speciﬁed preferences between tuples. Users are enabled to express their preferences declaratively.

100

Stefan Fischer et al.

This is easy to accomplish especially for unexperienced users and therefore provides a very intuitive interface.

3 Personalized quality assessment and presentation of query results Operating the search engine of a database is a very powerful position and decisive question of design [6]. E.g., considering the power of the decision which results to present ﬁrst at Google.com’s search engine, the ﬁrst listed links have much more visits, with all the positive consequences of this. When setting up the search engine in a database, there are ﬁve steps: 1. 2. 3. 4. 5.

Preparing the sources Indexing Interpreting the query Searching Presentation of the results

Steps 1 to 4 are part of the database management system. To improve the search quality by avoiding the empty result eﬀect and the ﬂooding eﬀect, a preference search technology (see subsection 2.2) should be used for step 3. The importance of step 5 is still nearly ignored. Especially in a sales scenario, it is very important which results and alternatives to present at all, which ones to put in front, or even to recommend. Here also the question is whether to respect presenter preferences. There are attempts to cover these problems, e.g. the provider of e-catalog technology Poet.com allows specifying “favorites” by ﬂags in the database. Whenever favorites are in the result set they are placed at the top of the result table. But that respects the presenter preferences only statically. Furthermore, there is no respect to any customer preferences or to aspects of speciﬁc situations. One major and still unsolved key issue in these diﬃculties is the quality of the search result with respect to the customer preferences. The knowledge about the qualities of search results is absolutely necessary to do a good reasoning when presenting results. In this section a novel personalized and situated presentation framework is introduced. A valuation of query results is, of course, necessary only for advanced search engines that are able to present alternatives when there are no perfect matches. In the sequel, the BMO result model as described in Deﬁnition 8 is applied, because it is very user intuitive and has been shown to be suitable for a cooperative product search in various applications [19]. 3.1 Design principles and workﬂow For a successful deal, the product presentation is a decisive factor. An example of an advanced search result presentation is shown in the following scenario.

Preference based Quality Assessment and Presentation of Query Results

101

Example 3. Sample scenario of advanced search result presentation Marge is a computer dealer and tells her vendor Apu: “I am interested in notebooks. The clock frequency must be at least 2 GHz. The order quantity should be around 40. It is equally important that the main memory capacity should be between 512 and 1024 MB-RAM, and it is equally important that the price should be between 100 and 1200 Euro. Equally important, my preferred manufacturers are Toshiba and Hewlett Packard.” An answer from the vendor side might be the result set shown in Table 1. Table 1. Sample BMO result set t1 t2 t3 t4

Make CPU-GHz MB-RAM Quantity Price per unit (Euro) Elitegroup 2.0 256 40 1150 Gerion 2.0 374 50 1199 HP 2.2 512 50 1249 Toshiba 2.4 768 40 1378

Even in this not too complex example the customer would be glad about advice and assistance. Additionally, a proactive recommendation of a product supports an eﬀective selling process [2]. A cooperative result presentation could now pick out one result. This could be the Toshiba notebooks, because the vendor’s utility function of maximizing the turnover is supported and, besides the hard condition of the CPU-GHz, three of four preferences are perfectly fulﬁlled. This delivers the following sales arguments: “There are four best matches with respect to your preferences. I recommend the Toshiba notebook position which very well meets your preferences. It perfectly matches your desired manufacturer and order quantity and has even a faster CPU and perfectly the MB-RAM as you demanded. I think the moderately higher price is acceptable for this high-quality product.” The ﬁrst aspect of the following design principles serves to avoid incomprehensible and non-personalized ad hoc attempts of presenting search results with incorrect valuations and disrespect of the preferences of customer and presenter, i.e. the vendor side. Moreover, the second aspect serves to enable advanced search engines to do a cooperative and eﬀective presentation of the search results following the strategy of the presenter: 1. Consideration of user preferences as well as presenter preferences From the search to the ﬁnal presentation the preferences of the user as well as the preferences of the presenter must be ﬂexibly combinable with respect to the speciﬁc situation. 2. Providing plausible arguments about the quality of the search results For an eﬀective presentation of search results the vendor needs arguments to underpin his decision why to present or point out the speciﬁc results. Therefore, it is necessary to know about the search quality with respect

102

Stefan Fischer et al.

to single search conditions as well as about an overall impression of the quality of a search result for the speciﬁc situation. 3. Consideration of ﬁlter criteria applied to the search result set In various cases not all computed alternatives should be shown to the user. The reason may be e.g. search quality requirements of the user or results the presenter only wants to show under speciﬁc circumstances and diﬀerent situations. Therefore it must be possible to apply ﬂexible ﬁlter conditions to the search result set. 4. Semantically well-founded marked favorites within the result set To enable the vendor to point out one or more of the results, these have to be marked following ﬂexible and situated criteria such as e.g. a utility function of a vendor or the result quality with respect to a customer’s search preferences. Deﬁnition 9 describes the major steps of the workﬂow of a personalized advanced query result presentation based on the BMO query model as described above, respecting the just mentioned design principles. Deﬁnition 9. Workﬂow of a result presentation The presentation of BMO query results is deﬁned in seven steps: 1. Query composition with search preferences 2. Preference search 3. Computing the quality valuations of the BMO results 4. Computing aggregated quality valuations 5. Applying “But-only” ﬁlter conditions to the BMO results 6. Marking out favorites via presentation preferences 7. Result presentation and consideration of customer feedback In the sequel the steps of this Deﬁnition are outlined. 1. Query composition with search preferences: Search preferences must be known before starting a qualitative product search. While customers may express theirs in a search mask, vendor preferences can be regarded as a-priori ﬁlters. Both types can be intuitively expressed in our preference model described in Deﬁnitions 2 and 3. 2. Preference search: With a given search preference best matches can be queried from a database using a preference query language (see section 2.2). The results are only best matches according to the BMO model of Deﬁnition 8. This provides a clearly arranged pre-selection without bothering the customer by lots of search iterations with adjustments of the search criteria. 3./4. Computing the quality valuations of the BMO results and aggregated quality valuations: Knowing the quality of a search result with respect to each occurring base preference provides valuable presentation arguments. People normally do not think in numbers when rating a quality. As Zadeh found out [26] they think in so called linguistic variables. We use a system of ﬁve quality levels

Preference based Quality Assessment and Presentation of Query Results

103

each mapped to a linguistic variable. Of course our approach of quality assessment would work as well with more or less levels. Thereby we get a basis for an objective quality computing with respect to complex preferences and therefore to an overall quality computation for every search result. The following Deﬁnition 10 speciﬁes the quality function of search results for the underlying situation. Deﬁnition 10. Quality function QUAL Let V := (‘perfect’, ‘very good’, ‘good’, ‘acceptable’, ‘suﬃcient’) be a descending ordered list of 5 linguistic quality terms and let C(s) := {C1 (s), . . . , C5 (s)} be a partition of dom(A) into 5 parts depending on the situation s. Then the quality function QU ALP,s : BM O → V for a tuple t ∈ BM O with t[A] ∈ dom(A) regarding a preference P := (A,

Via online request tuple t3 is oﬀered to the customer with a quality of q1 . After thinking about the oﬀer he decides to order the notebook but in this moment the notebook t3 is sold out. The vendor oﬀers t2 as a further alternative and argues with quality q2 for this result, where q1 < q2 . The

104

Stefan Fischer et al.

customer would not understand why the obviously better result regarding his preferences would be valuated worse than the worse alternative. For him the situation has not changed. Therefore, QU ALP,s must fulﬁll the following postulate of Deﬁnition 11. Deﬁnition 11. Quality postulate For a given preference P := (A,

Preference based Quality Assessment and Presentation of Query Results

105

Deﬁnition 13. “But-only” ﬁlter applied to query results Let BOF specify a hard ﬁlter condition over BMO++ . Then the result set of the selection on the extended pre-selection BMO++ with application of the “but-only” ﬁlter BOF is declared as BM O∗ := σBOF (BM O++ ), where σcondition (R) denotes the well-known hard selection over the relation R. A condition can be stated in terms of quality, e.g. results which only have a very low overall quality should be disregarded. Another condition might concern the vendor’s utility function, e.g. only show the ﬁve results with the highest proﬁt. Of course, any combinations are possible as well. 6. Marking favorites via presentation preferences: BMO∗ will be presented to the user, but for a promising presentation, the system should proactively point out a result and give reasons why this is an appropriate result. In a sales scenario, a vendor would act smart and according to the rules of sales psychology, personalized to each customer. There is a wide range of rules how to decide the presentation order. Thus, presentation preferences PP need to be ﬂexible to enable the use of diﬀerent sales strategies as well as vendor preferences, e.g. the maximization of proﬁt. The selection for determination of the favorites with the presentation preference PP can be declared as follows. Deﬁnition 14. Presentation preference over BMO∗ Let PP be the presentation preference over the enhanced result set BMO∗ . Then FAV := [PP](BMO∗ ) denotes the result set of favorite query results according to the presentation preference PP. If FAV consists of more than one result then a random pick out of this result set is a proper choice to determine which result to present in the ﬁrst row. Yet, the empty result eﬀect never occurs, since the selection of favorites is deﬁned as preference. 7. Result presentation and consideration of customer feedback: Finally, if there is no agreement for a result there might be several reasons. On the one side, if the number of results is still too large, then a ﬁltering within the results should be done according to the customer’s feedback. The results of this selection are computed with the customer ﬁlter criterion CFC by σCF C (BM O∗ ). Because of the very good ﬁlter eﬀect of the BMO query model (see [17]) this will not often be necessary. On the other side, if the customer is not satisﬁed with this selection then according to his feedback a further preference query should be started or results hidden due to step 5 could be presented. If the customer has correctly expressed his search preferences this will not often be necessary, because he can be sure that all (but the possibly hidden) relevant results have been presented. Considering these 7 steps of the workﬂow of Deﬁnition 9 step 1 and 2 were discussed in chapter 2 in detail. Step 7 depends on the application and shall not be the focus here. For step 3, the next three subsections introduce an

106

Stefan Fischer et al.

intuitive and situative model for quality valuation of query results regarding preferences. Step 4 can easily be managed by applying the aggregate functions of the used query language and therefore needs no further consideration. In subsections 3.4 and 3.5 approaches for “but-only” ﬁlters and presentation preferences for a personalized and smart presentation dialog are introduced. 3.2 Quality information for base preferences In contrast to ad hoc valuations and non-intuitive approaches in the sequel a universal and semantics based approach is developed. The preference model of [15; 16] provides semantic information according to various intuitive constructs to formulate a human’s wish. Based on this framework, the quality of a search result is determined under consideration of this semantics. Thus, the obtained qualities are available with an according context and so they provide comprehensible, suitable arguments for the presentation of search results. Categorizing the qualities of a result tuple within its queried base preferences is a very sensitive design decision and depends on many factors of the respective situation. The quality valuation function QUALP,s of Deﬁnition 10 will be speciﬁed in a well-formed way for each mentioned base preference. Because of the sensitivity of the quality valuations to the current situation, partitioning parameters bi (s) ∈ R are deﬁned, depending on the situations. Please note that consequently QUALP,s is instantiated for a BMO result set as deﬁned in Deﬁnition 10. In [10] it is shown that the following valuation approach is not restricted to the intuitive BMO model. SCORE preference Deﬁnitely a perfect match for a preference P := SCORE(A, f) is a match where the scoring function is maximal. But often the scoring function f has no upper limit in a speciﬁc domain. Thus for a quality valuation additional knowledge about the situation must be regarded. Depending on the speciﬁc situation the knowledge engineer has to formulate QUALP,s as declared in Deﬁnition 15. Deﬁnition 15. QUALP,s -function of a SCORE preference For a given preference P := SCORE(A, f ) in a situation s and a result tuple t ∈ BM O, QUALP,s is deﬁned as follows, where b1 (s) ≥ b2 (s) ≥ b3 (s) ≥ b4 (s):  ‘perfect’, b1 (s) ≤ f (t[A])     b2 (s) ≤ f (t[A]) < b1 (s)  ‘very good’, b3 (s) ≤ f (t[A]) < b2 (s) QU ALP,s (t) := ‘good’,   ‘acceptable’ b  4 (s) ≤ f (t[A]) < b3 (s)   ‘suﬃcient’, f (t[A]) < b4 (s)

Preference based Quality Assessment and Presentation of Query Results

107

BETWEEN preference Regarding a preference P := BETWEEN(A, [low, up]), a search result t is marked as a perfect match if and only if t[A] ∈ [low, up]. For the quality classiﬁcations except ‘perfect’ the knowledge engineer has to declare the suitable ranges, under involvement of the respective situation. There, a result with a distance d > 0 from the upper bound of the optimal range intuitively must be of the same quality like a result with distance d from the lower bound. Therefore, the following symmetric partition is deﬁned. Deﬁnition 16. QUALP,s -function of a BETWEEN preference For a given preference P := BETWEEN(A, [l, u]) in a situation s for a result tuple t ∈ BM O, QUALP,s is of the following form, where 0 ≤ b1 (s) ≤ b2 (s) ≤ b3 (s) and the function dist(v, [low, up]) is known from Deﬁnition 5:  ‘perfect’, dist(t[A], [l, u]) = 0      ‘very good’, 0 < dist(t[A], [l, u]) ≤ b1 (s) b1 (s) < dist(t[A], [l, u]) ≤ b2 (s) QU ALP,s (t) := ‘good’,   ‘acceptable’, b  2 (s) < dist(t[A], [l, u]) ≤ b3 (s)   ‘suﬃcient’, b3 (s) < dist(t[A], [l, u]) AROUND preference The AROUND preference is a special case of the BETWEEN preference. It’s QUALP,s -function can be described equally to the one of the BETWEEN preference when setting up = low =: z. It is clear that a result tuple t is denoted as perfect match iﬀ it exactly has the value z. HIGHEST preference Just like for the SCORE preference, in a HIGHEST preference the biggest values are best. And as well there need not be a upper limit for values. So the knowledge engineer has to decide which quality valuation a search result deserves. With knowledge about the situation he can design the following well-formed QUALP,s -function. Deﬁnition 17. QUALP,s -function of a HIGHEST preference For a given preference P := HIGHEST(A) in a situation s for a result tuple t ∈ BM O, QUALP,s is of the following form, where b1 (s) ≥ b2 (s) ≥ b3 (s) ≥ b4 (s):  ‘perfect’, b1 (s) ≤ t[A]     (s) ≤ t[A] < b1 (s) ‘very good’, b  2 b3 (s) ≤ t[A] < b2 (s) QU ALP,s (t) := ‘good’,   ‘acceptable’ b4 (s) ≤ t[A] < b3 (s)    ‘suﬃcient’, t[A] < b4 (s)

108

Stefan Fischer et al.

The LOWEST preference, being dual to the HIGHEST preference has a nearly identical deﬁnition for the QUALP,s -function. The function for LOWEST preferences is deﬁned by replacing ‘<’ by ‘>’ and analogous ‘≤’ by ‘≥’ in Deﬁnition 17, so that smaller values are rated better. POS/NEG preference A result tuple t is a perfect match referring to a preference P := POS/NEG(A, POS-set; NEG-set) iﬀ t[A] ∈ POS-set and it must therefore be marked as being of ‘perfect’ quality. Accordingly if t[A] ∈ NEG-set, then this result t has to be declared of ‘suﬃcient’ quality. The knowledge engineer’s only decision is how to valuate results t[A] ∈ / POS-set ∪ NEG-set. Once again, there is no wisest selection. The method to be used depends on user and situation. Deﬁnition 18. QUALP,s -function of a POS/NEG preference For a given preference P := POS/NEG(A, POS-set; NEG-set) in a situation s for a result tuple t ∈ BM O, QU ALP,s is of the following form:  t[A] ∈POS-set   ‘perfect’,   M  ‘very good’, N QU ALP,s (t) := ‘good’,   ‘acceptable’ O    ‘suﬃcient’, t[A] ∈NEG-set There are three possible approaches: 1. Optimistic: M := t[A] ∈ / POS-set ∪ NEG-set, N := ∅, O := ∅. 2. Moderate: N := t[A] ∈ / POS-set ∪ NEG-set, M := ∅, O := ∅. 3. Pessimistic: O := t[A] ∈ / POS-set ∪ NEG-set, M := ∅, N := ∅. Example 4. Sample QUALP,s -function for a POS/NEG preference In Example 1 we learned that Marge had a POS/NEG preference for cities with Minnesota as the only element of the POS-set and a NEG-set containing Los Angeles and Orlando. Assuming an optimistic quality valuation, we get the quality values stated in Table 3. Table 3. Sample result quality for a POS/NEG preference City Quality rating Minnesota perfect New York very good Los Angeles suﬃcient

It depends on the situation which approach a knowledge engineer selects. Sales psychology tells a more pessimistic valuation may be more believable for some customers. For other customers a more oﬀensive presentation with an optimistic valuation leads to a successful deal.

Preference based Quality Assessment and Presentation of Query Results

109

POS/POS preference / P OS1 A tuple t with t[A] ∈ P OS1 -set is a perfect match. If t[A] ∈ set ∪ P OS2 -set then this t[A] is not mentioned by the customer and therefore the quality valuation must be ‘suﬃcient’. Only for values t with t[A] ∈ P OS2 set the knowledge engineer has to decide which quality valuation to use. Deﬁnition 19. QUALP,s -function of a POS/POS preference For a given preference P := P OS/P OS(A, P OS1 -set; P OS2 -set) in a situation s for a result tuple t ∈ BM O, QUALP,s is of the following form:  ‘perfect’, t[A] ∈ P OS1 -set     M  ‘very good’, N QU ALP,s (t) := ‘good’,   ‘acceptable’ O    ‘suﬃcient’, t[A] ∈ / P OS1 -set ∪ P OS2 -set There are three possible approaches: 1. Optimistic: M := t[A] ∈ P OS2 -set, N := ∅, O := ∅. 2. Moderate: N := t[A] ∈ P OS2 -set, M := ∅, O := ∅. 3. Pessimistic: O := t[A] ∈ P OS2 -set, M := ∅, N := ∅. NEG preference The quality valuation for a result t ∈ BM O with respect to a NEG preference P := NEG(A, NEG-set) is intuitively clear and leaves no scope for a knowledge engineer. A result t has to be declared of ‘perfect’ quality when t[A] ∈ / NEGset, because that is exactly what the user desires. If t[A] ∈ NEG-set this result must be stated as ‘suﬃcient’, according to the preference model philosophy. Deﬁnition 20. QUALP,s -function of a NEG preference For a given preference P := NEG(A, NEG-set) in a situation s for a result tuple t ∈ BM O, QUALP,s is of the following form: ‘perfect’, t[A] ∈NEG-set / QU ALP,s (t) := ‘suﬃcient’, t[A] ∈NEG-set POS preference The dual of the NEG preference is the POS preference. Thus, the design modelling is intuitively the dual design of the NEG preference: Elements of the POS-set are rated ‘perfect’, all others ‘suﬃcient’. Deﬁnition 21. QUALP,s -function of a POS preference For a given preference P := POS(A, POS-set) in a situation s for a result tuple t ∈ BM O, QUALP,s is of the following form: ‘perfect’, t[A] ∈POS-set QU ALP,s (t) := ‘suﬃcient’, t[A] ∈POS-set /

110

Stefan Fischer et al.

Lemma 1. For the QUALP,s -functions of all mentioned base preferences it holds that a) ∀ t, t | t[A], t [A] ∈ dom(A) : t

Preference based Quality Assessment and Presentation of Query Results

111

P2 := P3 ⊗ P4 is considered to be in the form P := P1 ⊗ P3 ⊗ P4 . This reﬂects that each of the equally important preferences P1 , P3 , and P4 must have the same impact for the quality valuation. Obviously, in a numerical average valuation the ﬁrst consideration would lead to a wrong quality. Deﬁnition 22. Optimistic valuation of a Pareto preference result Assume a Pareto preference P := P1 . . . Pd , d ≥ 2, in a situation s where for all Pj , j = 1, . . . , d, it holds that Pj is no Pareto preference and QUALPj ,s (t) fulﬁlls the postulate of Deﬁnition 11. The optimistic valuation for a result tuple t ∈ BM O is deﬁned as follows regarding the order of quality terms of Deﬁnition 10: QU ALP,s (t) := max{{QU ALPj ,s (t)|j = 1, . . . , d}}, where {{. . . }} denotes a multi-set. Thus, for the optimistic valuation the best quality valuation of all Pareto combined preferences Pj within P is decisive. The contrary approach, the pessimistic valuation deﬁnes the quality of a result tuple t as worst quality valuation of all preferences Pj in P . Example 5. Sample optimistic and pessimistic quality valuation Marge and Homer are interested in a bundle of notebooks and consider seven equally important characteristics. Formulated in preference algebra they have the Pareto preference P := Pmake ⊗ Pmhz ⊗ Pram ⊗ Poquan ⊗ Pdeltime ⊗ Pweight ⊗ Pprice . For these seven base preferences Marge and Homer have the same situational parameters and the same opinions, whether these characteristics are suitably fulﬁlled or not. In Fig. 4 the preference quality information for each Pj is given for one result tuple t regarding to the notebook characteristics. In this, ﬁve stars for one Pj denote a ‘perfect’ quality, four stars denote a ‘very good’ quality, and so on until one star denotes a ‘suﬃcient’ quality for the QU ALPj ,s -function. Thus, e.g. QU ALPmake ,s (t) = ‘very good’. Homer is the carefree customer and only considers the positive aspects of a search result. Therefore, for Homer the optimistic quality valuation would be appropriate. With the valuation of Deﬁnition 22 the quality for P is computed as follows: QU ALP,s (t) = max{{ ‘very good’, ‘acceptable’, ‘very good’, ‘very good’, ‘very good’, ‘acceptable’, ‘good’}} = ‘very good’ Marge, in contrast to Homer, is always very skeptical. For her, the pessimistic quality valuation might be suitable. Analog to Deﬁnition 22 it is calculated as follows: QU ALP,s (t) = min{{ ‘very good’, ‘acceptable’, ‘very good’, ‘very good’, ‘very good’, ‘acceptable’, ‘good’}} = ‘acceptable’

112

Stefan Fischer et al.

Fig. 4. Sample qualities of notebook characteristics for one result tuplet.

The qualities for these two examples are visualized in Fig. 4. Besides these two valuations often a more moderate approach is appropriate in various situations. Therefore, a statistically more robust techniques is introduced. Even more ways of computation can be found in [10]. Deﬁnition 23. Equidistant linguistic average valuation of a Pareto preference result Assume a Pareto preference P := P1 ⊗ . . . ⊗ Pd , d ≥ 2, in a situation s where for all Pj , j = 1, . . . , d, it holds that Pj is no Pareto preference and QU ALPj ,s (t) fulﬁlls the postulate of Deﬁnition 11. The equidistant linguistic average valuation for a result tuple t is deﬁned as follows regarding the order of quality terms of Deﬁnition 10:  ‘perfect’, y=5     y=4  ‘very good’, y=3 QU ALP,s (t) := ‘good’,   ‘acceptable’ y =2    ‘suﬃcient’, y=1 where y := round  5,      4, where g(x) := 3,    2,   1,

d 1 d

g(QU AL (t)) P ,s j j=1

x = ‘perfect’ x = ‘very good’ x = ‘good’ x = ‘acceptable’ x = ‘suﬃcient’

Quality valuation of a prioritized preference result According to the philosophy of a prioritized preference P1 & . . . & Pd the preference P1 is the dominating preference and thus must have the decisive impact. As P2 , . . . , Pd are subordinated, P1 as the only decisive factor leads to the following valuation.

Preference based Quality Assessment and Presentation of Query Results

113

Deﬁnition 24. Quality valuation of a prioritized preference result Assume a prioritized preference P1 & . . . & Pd , d ≥ 2, in a situation s where for all Pj , j = 1, . . . , d, it holds that Pj is no prioritized preference and QU ALPj ,s (t) fulﬁlls the postulate of Deﬁnition 11. The valuation for a result tuple t is deﬁned as follows: QU ALP,s (t) := QU ALP1 ,s (t) There is no further correct valuation of a prioritized preference result [10]. Please note that the information of QU ALP2 ,s (t), . . ., QU ALPd ,s (t) of a prioritized preference is not necessary for the quality of the prioritized preference, but provides valuable arguments for a result presentation. In the last two sections the quality valuation functions were deﬁned for each base and complex preference in the form QU ALP,s (t). It is obvious that the structure of the complex preferences can be used to calculate recursively the quality of each combined preference. Thus, all presentation arguments regarding the quality can be computed. With this the complex attribute Q of BM O+ of Deﬁnition 12 can be calculated. 3.4 Filter criterion “but-only” The quality valuations with respect to the base preferences as well as for the complex preferences have been constructed. Based on them, the decision which results to present can be made now. These results will be visible for the user while the others will be hidden and never or in a second phase be presented to the user. There are various reasons for ﬁlter criteria BOF (”but-only” ﬁlter) to compute BM O∗ := σBOF (BM O++ ) (see Deﬁnition 14). Three obvious criteria are hidden preferences of the knowledge engineer, issues of presentation style, and quality claims of a user, but of course the framework is not limited to these. E.g. in a web shop several vendor preferences could be integrated in a search preference P. Many criteria can only be handled after the soft selection, for instance a vendor wants to present only one result per manufacturer, but with highest possible quality. Lots of other and combined approaches are imaginable. 3.5 Selection criterion for pointing out a search result Suppose the decision has been made which results to present to the customer. Still in many scenarios it is necessary to point out one or more results. The question which results are the best ones to present ﬁrst is very situation and domain dependent. Agrawal and Wimmer [1] suggest bringing the results into a total order by rating the results with a sum over weighted preferences. But there is only a limited expressiveness in their search technology/preference

114

Stefan Fischer et al.

model and also limited semantics for a smart presentation order. As Pareto preferences and prioritized preferences cannot in general be transformed into total orders [16; 5], several diﬀerent presentation strategies will be pointed out. It becomes clear that the preference model of [15; 16] and the introduced quality valuation of search results form a very powerful framework for an easy and intuitive declaration of presentation preferences. By avoiding the empty result eﬀect when using this model for presentation we can also make sure that there will always be at least one result predestined to point out. General selection criteria One straight forward approach is to pick out a result with the best available overall quality, namely the highest AQoverall . Assume a result set BM O∗ . The presentation preference regarding the highest overall quality within BM O∗ is deﬁned as PP := HIGHEST(AQoverall ). If this presentation preference PP delivers more than one result then a random pick out of these is an appropriate way. To get many presentation arguments for a good reasoning, it is very useful to pick out the results with the highest number of high quality single characteristics, i.e. with the most perfectly fulﬁlled base preferences. This leads to the next selection criterion. Once again assume a result set BM O∗ with the aggregated attributes AQ‘perfect’ , AQ‘very good’ , AQ‘good’ , and AQ‘acceptable’ , counting the quality valuations with respect to the base preferences for a result t. Then the selection criterion for the highest quantity of most positive result characteristics is deﬁned as the following complex presentation preference: PP := HIGHEST(AQ‘perfect’ ) & HIGHEST(AQ‘very good’ ) & HIGHEST(AQ‘good’ ) & HIGHEST(AQ‘acceptable’ ) This selection criterion ﬁrst picks out results with most ‘perfectly’ satisﬁed base preferences. If there is an equal number of ‘perfect’ statements then within these results the results with the most ‘very good’ valuations are delivered and so forth. Of course, there are many further intuitive criteria. In the next section the focus is on criteria wrt. a sales scenario. Presentation criteria in sales scenarios As a special domain, selection criteria for a presentation within a sales scenario, are discussed in this section. Especially in a sales dialog it is very important to choose the adequate products to point out, i.e. to follow a special sales strategy [22; 12]. The style of a good vendor is to actively recommend one or more results and to argue why this is an appropriate product. The vendor can improve his believability when he is able to adequately valuate the alternatives [2]. Of course, the better the quality of a search result the higher is the psychological advantage for the vendor within a result presentation.

Preference based Quality Assessment and Presentation of Query Results

115

There is no single perfect sale strategy. Many cognitive and environmental factors have an impact on the success of a sales scenario [2]. Some people can be convinced by a more blatant oﬀering. Others react positively to an oﬀer, when positive and negative aspects of the search result are discussed. Also vendors have diﬀerent utility functions. There can even be conﬂicting aspects, e.g. when there is a result in the result set with lots of high-quality presentation arguments and there is a further result with a higher proﬁt for the vendor. Each selection criterion is declaratively deﬁned as a preference according to the model of [15; 16]. As shown these preferences can ﬂexibly be combined into more complex preferences, even if there are conﬂicts in the preferences. With the following criteria a framework is given to support wellknown aspects of sales strategies concerning the selection of one item out of the result set being actively recommended. This describes a prime example for the usage of soft selections. In literature about consumer choice behavior all models treat the quality of the oﬀered goods wrt. the customer’s preferences as a major cognitive factor [21; 9; 11]. Thus, the rules discussed in the last subsection that focus on good arguments regarding the preferences are also highly relevant for a sales strategy driven query result presentation. But it is not always suitable to present the best alternatives at ﬁrst. Hansen [11] emphasizes that it depends on the situation, i.e. on the person involved, the location, and many other factors. He also points out that it is very important for a customer to consider alternatives. It depends on the customer whether he wants to the best available alternatives presented to him or to discover them by himself. Possibly, the selection criterion PP := AROUND(AQoverall ) might be useful, e.g. in order to start with an alternative of middle quality. A further very important selection criterion according to the sales psychologist B¨ansch [2] is the price. He argues that on the one side recommending a result with a high price shows the customer a high appreciation of his ﬁnancial strength. But on the other side, this is embarrassing if the customer actually does not have this ﬁnancial strength. Thus, the rule is when knowing about the highest price preference of a customer, then the result with the highest price and hence the semantically supposed item of highest product quality should be presented. If there is no knowledge about such a ﬁnancial strength or the will to buy the most expensive alternative, then an alternative of a middle price up to a high price or even with the second highest price should be recommended. This shows appreciation of the customer, but it is not embarrassing for him when he cannot aﬀord this product. And, in case of ﬁnancial strength there is even the possibility for the customer to step up to a product of higher quality and price, which boosts his ego. Of course, all the discussed presentation preferences can be combined with vendor preferences like a maximal turnover, a maximal proﬁt, or promotion of slow sellers. Moreover, every imaginable result set could be used as input within this framework. This framework is universal, because it satisﬁes the

116

Stefan Fischer et al.

postulate of Deﬁnition 11 and the valuation itself is based on the ﬂexible, intuitive and semantics based preference model of [15; 16].

4 Evaluation by means of an automated sales agent The quality assessment framework including lots of presentation preferences was realized as a Java component and evaluated within the automated sales agent COSIMAB2B . 4.1 The prototype COSIMAB2B COSIMAB2B is the prototype of an autonomous sales agent that automates a cost intensive e-procurement process. Together with industrial partners a typical B2B use case was modelled. The product domain comprises boxes, in particular storage, transport, and waste containers. On the client side the customer interface is additionally equipped with some optional features. A female character agent named COSIMA embodies the electronic sales agent. COSIMA talks to the customer via speech synthesis in real time. Text templates are used for the text generation to realize speech output e.g. for diﬀerent sales strategies and to dynamically integrate the discussed quality knowledge about the search results. Upon the start of COSIMAB2B the embodied character agent welcomes the customer. Then the customer iteratively composes the content of his shopping cart by searching the product database. In Figure 5 for example the customer is searching for a red storage container made of polyethylene with a volume of about three liters and a width of 100 till 150 millimeters. Actually, there is no perfect match for these search preferences in the product database. Thus, best alternatives are oﬀered. As shown in Fig. 5 COSIMA smartly presents the search results. Following a given sales strategy, COSIMA presents the article with most perfectly fulﬁlled base preferences. Here she especially emphasizes the perfectly matched red color and fairly mentions the nearly matched volume of 2.7 liters. As the width is perfectly in the customer’s preferred range and also the material is exactly the desired one, COSIMA completes her arguing by emphasizing the ‘perfect’ overall quality. Finally, COSIMA proactively draws the customer’s attention to optional accessories (cross selling). 4.2 Evaluation In an experimental setting 30 test customers had to act as purchasers for a company. They had ﬁve tasks, i.e. to buy ﬁve exactly speciﬁed storage boxes. If there was no such box available they had to select a best alternative on their own responsibility. For the quality valuations for all customers the same

Preference based Quality Assessment and Presentation of Query Results

117

Fig. 5. Smart search result presentation.

adjustments were used. All but two people deemed the quality valuations correct. Thus, a personalization of these parameters is necessary, but obviously an approach of user modelling, e.g. of preference patterns [25] would be appropriate for most of the customers. By applying diﬀerent sales strategies, customers were divided into two groups of 15 people. In the ﬁrst group the strategy “best overall quality before most perfect arguments” (strategy b) was applied. For the second group the strategy “second highest price before most perfect arguments” (strategy s) was in use. Please note that the result set is the same for both groups, but the presentation order may diﬀer. Querying the properties of the ﬁrst task, seven boxes in exactly the same order for both strategies were delivered by COSIMA. She always recommended the ﬁrst result. Fig. 5 shows there are 5 results visible at ﬁrst sight. It is remarkable that the very seriously deciding test customers brieﬂy scrolled to lower-listed products, but did not consider their properties. As illustrated in Fig. 6 only results visible at ﬁrst sight were selected. Two more people followed the “higher quality has a higher price” argumentation of strategy s (product 1 is more expensive than 2, 3, and 5) by selecting the recommended result in contrast to people with strategy b. This result is also emphasized by the results of the non-illustrated tasks 2, 3, and 5. Exemplarily, at last for this experimental setting the experiences of task 4 are illustrated in Fig. 7. Here diﬀerent sales strategies followed a diﬀerent order of the resulting products named V, W, X, and Y. The arguments about the quality of the search result with respect to the search preferences (strategy b) leads to selections of the ﬁrst ordered products, whereas the “higher product

118

Stefan Fischer et al.

Fig. 6. Selections regarding the presentation order of task 1.

quality has a higher price” argumentation (strategy s) leads to more selections of the product W. Here product Y is the most expensive one.

Fig. 7. Selections regarding the presentation orders of task 4.

Summarizing, it can be stated that diﬀerent sales strategies had indeed diﬀerent impacts on the test customers. Not each customer appreciated each argumentation, e.g. some liked to hear only positive arguments, others expected to hear at least one not perfectly fulﬁlled search preference in the argumentation. Thus, a personalization is necessary. If there is no knowledge about the customer, moderate strategies like ‘second highest price’ should be applied. As shown in Fig. 6 this is a powerful instrument to direct the customer choice behavior. Moreover, a good argumentation was experienced as believable and the test customers showed a satisﬁed attitude during their shopping which is very important for a successful customer relationship [21; 9; 11].

5 Summary and outlook After analyzing the continuing misery of today’s search engines and the problematic point of search result presentation, a review of the preference model of [15; 16] has been given. Moreover, it has been shown that knowledge about the quality of search results with respect to search preferences is a major issue with regard to several aspects, e.g. for consumer choice behavior in e-commerce applications [21; 9; 11]. A novel eﬀective approach of quality assessment for search result presentations has been found. Following the philosophy of [26] a linguistic model for the search result quality has been introduced. An intuitive framework has been elaborated for

Preference based Quality Assessment and Presentation of Query Results

119

situated quality valuations, satisfying the postulate that worse results must not have a higher valuation than better results. Deﬁning quality valuation functions for all base preference constructors of [15; 16] with an output of linguistic quality terms an extensible framework for a human comprehensible search result presentation was created. Based on these results, various situated quality valuations for complex preferences [15; 16] have been introduced. After respecting a so called “but-only” ﬁlter, i.e. ﬁlter criteria applied over the search results, various intuitive selection criteria have been introduced for deciding which result to present ﬁrst or which results to proactively oﬀer to a customer, respectively. Especially for the case of a sales process, selection criteria have been presented. For the ﬁrst time this enables the ﬂexible application of personalized sales strategies [22; 12] within an e-commerce sales process. This can be done in a declarative manner, even coping with conﬂicting presentation preferences. In combination with a preference search engine, the search and presentation process can be managed very eﬃciently for users/customers. They do not have to iterate through several pages and moreover get helpful arguments why the results are the suitable ones. Experimental settings showed the applicability of the introduced quality assessment approaches according to the eﬀect on human beings. Fields of applications for the quality assessment could be comparison shops like COSIMA [17]. In that platform a ﬁxed heuristic determining the presentation order is implemented equally for all customers. With this introduced quality assessment, in such an application each customer could be easily and ﬂexibly treated according to his search quality claims and with respect to his presentation preferences. Moreover, some strategies of the service provider could be respected, e.g. to ﬂexibly push sponsored results without conﬂicting with the customer’s preferences. Obviously, this technology is also predestined for managing personalized decision support. The knowledge for the adjustment of the partitioning parameters is at the moment a manual task for the knowledge engineer of the search engine. Many inﬂuences may change the quality sensation of a customer and preferences may vary with the according situation. This dynamic aspect for the deﬁnition of a situation is discussed e.g. by the sociologist Thomas [24] or the human factors researcher Endsley [8]. For the detection of partitioning parameters for the quality functions an automated process would be very helpful. An imaginable solution could be a technique like Preference Mining [14; 13].

References [1] R. Agrawal and E. Wimmers. A framework for expressing and combining preferences. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 297–306, 2000. [2] W. Becker. Verkaufspsychologie - Theoretische Grundlagen und praktische Anwendungen. Proﬁl Verlag, 1998.

120

Stefan Fischer et al.

[3] P. Bosc and H. Prade. Uncertainty Management in Information Systems, chapter An Introduction to the Fuzzy Set and Possibility Theory-Based Treatment of Flexible Queries and Uncertain or Imprecise Databases, pages 284–324. Kluwer, Dordrecht, 1994. [4] L. Cholvy and C. Garion. An attempt to adapt a logic of conditional preferences for reasoning with contrary-to-duties. Fundamenta Informaticae, 48(2-3):183–204, 2001. [5] J. Chomicki. Preference formulas in relational queries. ACM Transactions on Database Systems (TODS), 28(4):427–466, 2003. [6] R. Dewan, M. Freimer, and P. Nelson. Impact of search engine ownership on underlying market for goods and services. In Thirty-Fourth Annual Hawaii International Conference on System Sciences (HICSS), Maui, Hawaii, USA, 2001. [7] D. Dubois and H. Prade. Flexible Query Answering Systems, chapter Using fuzzy sets in ﬂexible querying: Why and how? Kluwer Academic Publishers, Norwell, Massachusetts, USA, 1997. [8] M. R. Endsley. Toward a theory of situation awareness in dynamic systems. Human Factors, 37(1):32–64, 1995. [9] J. F. Engel, R. D. Blackwell, and D. T. Kollat. Consumer Behavior. Dryden Press, USA, 1978. [10] S. Fisher. Personalized query result presentation and oﬀer composition for e-procurement applications. Master’s thesis, University of Augsburg, Augsburg, Germany, 2004. [11] F. Hansen. Consumer choice behavior - a cognitive theory. The Free Press, 1972. [12] S.E. Heiman and D. Sanchez. New Strategic Selling: Unique Sales System Proven Successful By World’s Largest Company’s. Warner Books, 1998. [13] S. Holland, M. Ester, and W. Kießling. A novel approach on mining user preferences for personalized applications. In Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD ’03), pages 204–216, Dubrovnik, Croatia, 2003. [14] S. Holland and W. Kießling. Situated preferences and preference repositories for personalized database applications. In 23rd International Conference on Conceptual Modeling (ER 2004), pages 511–523, Shanghai, China, 2004. [15] W. Kießling. Foundations of preferences in database systems. In Proceedings of the Twenty-eighth International Conference on Very Large Databases, pages 311–322, Hong Kong, China, 2002. [16] W. Kießling. Preference queries with sv-semantics. In Proceedings of the 11th International Conference on Management of Data (COMMAD 2005), pages 15–26, Goa, India, 2005. [17] W. Kießling, S. Fischer, S. Holland, and T. Ehm. A smart and speaking e-sales assistant. In Proceedings of the Third International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS 2001), pages 21–30, San Jose, CA, USA, 2001.

Preference based Quality Assessment and Presentation of Query Results

121

[18] W. Kießling, B. Hafenrichter, S. Fischer, and S. Holland. Preference xpath: A query language for e-commerce. In Internationale Tagung Wirtschaftsinformatik 2001, pages 427–440, Augsburg, Germany, 2001. [19] W. Kießling and G. K¨ ostler. Preference sql - design, implementation, experiences. In Proceedings of the Twenty-eighth International Conference on Very Large Databases, pages 990–1001, Hong Kong, China, 2002. [20] G. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. The Psychological Review, 63:81–97, 1956. [21] F. M. Nicosia. Consumer Decision Process. Prentice-Hall, Inc., NJ, USA, 1966. [22] N. Rackham. Major Account Sales Strategy. McGraw-Hill Trade, 1989. [23] C. P. Simon and L. Blume. Mathematics for Economics. W. W. Norton & Company, 1994. [24] W. I. Thomas. The Unadjusted Girl. Little, Brown, and Co. Boston, 1923. [25] Q. Wang, W. T. Balke, W. Kießling, and A. Huhn. P-news: Deeply personalized news dissemination for mpeg-7 based digital libraries. In 8th European Conference on Digital Libraries (ECDL 2004), pages 256– 268, Bath, UK, 2004. [26] L. A. Zadeh. The Concept of a linguistic variable and its application to approximate reasoning. Elsevier Pub. Co., New York, NY, USA, 1973.

Modeling Fuzzy Information in the EER and Nested Relational Database Models Zongmin M. Ma Department of Computer Science and Engineering, Northeastern University, Shenyang, Liaoning 110004, China

1 Introduction A major goal for database research has been the incorporation of additional semantics into database models. It is recognized that the relational database model has semantic and structured drawbacks when it comes to modeling some emerging applications such as computer aided design (CAD), geographical information systems (GIS), and artificial intelligence. In response to this problem, some attempts to relax the first normal form (1NF) limitation, which is the most fundamental normalization constraint in the relational databases, are made and one kind of data model, called non-first normal (or nested) relational database model, has been introduced. In common sense, the nested relational database model means that attribute values in the relational instances are either atomic or set-valued and even relations themselves. In addition, the next generation of database models takes the form of rich data models such as the object-oriented database model and the semantic (conceptual) data models. In real world applications, information is often vague or ambiguous. Therefore, different kinds of imperfect information have been extensively introduced into relational databases [8, 18, 25, 26]. However, the classical relational database model and its extensions with imprecision and uncertainty does not satisfy the need of modeling complex objects with imprecise and uncertain information. So, many researches have concentrated on conceptual data models [10, 31, 38] and object-oriented databases (OODB) [2, 12-17, 22] to deal with complex objects and uncertain data together. It should be pointed out that, being the extension of the relational database model, the nested relational database model is able to handle complex-valued attributes and may be better suitable to some complex applications (e.g., office automation systems, information retrieval systems and expert database systems [33]). In [33], an 2 extended nested relational database model (also known as an NF data model) Z.M. Ma: Modeling Fuzzy Information in the EER and Nested Relational Database Models, Studfuzz 203, 123–146 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

124

Zongmin M. Ma

was introduced for representing and manipulating complex and uncertain data in databases, and the extended algebra and the extended SQL-like query language were thereby defined. Also physical data representation of the model, and the core operations that the model provides, were also introduced in [33], in which the uncertain data were represented based on similarity relationship [8]. In [20], based on possibility distribution and the semantic measure of fuzzy data developed in [18], a fuzzy nested relational database model was briefly presented to cope with imperfect as well as complex objects in the real world. Generally speaking, the classical relational databases are designed by developing high-level conceptual data models first, and then the developed conceptual models are mapped to actual implementations, i.e., database models. The ER model [11], for example, is extensively used for conceptual design of the relational databases [29]. In addition, conceptual data models, EER, IFO [1] and UML have been used for conceptual design of the object-oriented databases [23, 24]. It should be noticed, however, that less research has been done in developing the design methodologies for implementing the databases with imprecise and uncertain information although the conceptual data models and logical database models have been extended for representing and handling imperfect information. In this chapter, we focus on fuzzy data modeling at conceptual level, in the EER model, a data model that is extensively used in conceptual data modeling, and, at logical level, in the nested relational databases based on possibility distribution. A complete extension of nested relational databases for extended-based fuzzy data is presented in the chapter. In particular, the formal approach to mapping a fuzzy EER model to a fuzzy nested relational database (FNRDB) schema is developed. The remainder of this chapter is organized as follows. Section 2 describes related work. The fuzzy extended entity-relationship (FEER) model and fuzzy nested relational databases are introduced in Section 3. In Section 4, the approaches to mapping the fuzzy EER model to the fuzzy nested relational schema are developed. Section 5 concludes this chapter.

2 Related work Fuzzy information has been extensively investigated in the context of the relational database model [8, 18, 25, 26] since Zadeh proposed fuzzy set theory [34]. In connection to the representations of fuzzy data, two basic extended data models for fuzzy relational databases can be identified: one is based on possibility distribution [25, 26] and the other one is based on similarity (proximity or resemblance) relation [8]. The former can further be classified into two categories, i.e., tuples associated with possibility degrees and attribute values represented possibility distributions, which respectively correspond to the type-1 fuzzy relational data model and the type-2 fuzzy relational data model [26]. The form of an n-tuple in each of the above-mentioned fuzzy relational models can be expressed, respectively, as:

Modelling Fuzzy Information in the EER

125

t1 = , t2 = <πA1, πA2, ..., πAi, ..., πAn,> and t3 = , where pi ⊆ Di with Di being the domain of attribute Ai, ai ∈ Di, d ∈ (0, 1], πAi is the possibility distribution of attribute Ai on its domain Di, and πAi (x), x ∈ Di, denotes the possibility that x is the actual value of t[Ai]. Notation d is used to denote the membership degree of the tuples in the corresponding relation. The value d of the tuple may be interpreted either as a possibility measure of association of the attribute values or as a truth value of a fuzzy predicate associated with the relation. Viewed from the expressive format, pi is a set value containing some elements of Di, in which these elements have similarity relations. In other words, pi is a fuzzy data represented by similarity relations in the elements of Di. It is clear that, based on the above-mentioned basic fuzzy relational models, there should be one kind of hybrid fuzzy relational model, called extended possibility-based fuzzy relational model [13], in which possibility distribution and similarity (proximity or resemblance) relation arise in a relational database simultaneously. Based on various fuzzy relational database models, many studies have been done for data integrity constraints [6, 7, 26, 27]. Also there have been research studies on fuzzy query languages [5, 28] and fuzzy relational algebra [30, 21]. In [5], an existing query language, namely SQL, for fuzzy queries was extended and some fuzzy aggregation operators were developed. In [36], a fuzzy relational data base (FRDB) model architecture and query language were presented and the possible applications of the FRDB in imprecise information processing were discussed. In order to model complex objects with imprecise and uncertain information, current efforts have been concentrated on some fuzzy non-traditional data models for databases such as the conceptual data models [10, 31, 38], the nested relational model [20, 33], and the object-oriented data model [2, 12-17, 22]. The conceptual data models can sever as the tool for designing databases. In [33], null values, set values, range values (partial values and value intervals), and 2 fuzzy values were all modeled in the NF data model based on similarity 2 relationship and the extended NF algebra was developed. In [20], fuzzy data were introduced into nested relational databases to model complex objects with imprecise and uncertain information, where fuzzy data are the extended possibility-based fuzzy data expressed by possibility distributions over a universe of discourse, and meanwhile a resemblance relation on the universe is in effect. In [20] the structure of the fuzzy nested relational model was briefly presented. Regarding modeling imprecise and uncertain information in the objectoriented databases (OODBs), Zicari and Milano first introduced incomplete information, namely, null values in [37], where incomplete schema and incomplete objects can be distinguished. From then on, the incorporation of

126

Zongmin M. Ma

imprecise and uncertain information into the object-oriented databases has increasingly received the attentions and fuzziness is witnessed at the levels of object instances and class hierarchies. Based on similarity relationship, in [16], the range of attribute values was used to represent the set of allowed values for an attribute of a given class. Depending on the inclusion of the actual attribute values of the given object into the range of the attributes for the class, the membership degrees of an object to a class can be calculated. The weak and strong class hierarchies were defined based on monotone increase or decrease of the membership of a subclass in its superclass. Based on the extension of a graphs-based object model, a fuzzy object-oriented data model was defined in [2]. The notion of strength expressed by linguistic qualifiers, which can be associated with the instance relationship as well as an object with a class, was proposed. Fuzzy classes and fuzzy class hierarchies were thus modeled in OODBs. In [3], the graph-based operations to select and browse a fuzzy object oriented database were further defined for the fuzzy graph-based model that manages both crisp and fuzzy information. The UFO (uncertainty and fuzziness in an object-oriented) database model was proposed in [17] to model fuzziness and uncertainty by means of fuzzy sets and generalized fuzzy sets, respectively. The fact that the behavior and structure of the object are incompletely defined results in a gradual nature for the instantiation of an object. The partial inheritance, conditional inheritance, and multiple inheritances were permitted in fuzzy hierarchies. Based on possibility theory, vagueness and uncertainty were represented in class hierarchies in [15], where the fuzzy ranges of the subclass attributes defined restrictions on that of the superclass attributes and then the degree of inclusion of a subclass in the superclass was dependent on the inclusion between the fuzzy ranges of their attributes. Focusing on the fuzzy types, Marín, Pons and Vila [22] discussed two different strategies for adding fuzzy types to an object-oriented database system and presented how the typical classes of an OODB can be used to represent a fuzzy type and how the mechanisms of instantiation and inheritance can be modeled using this kind of new type on an OODB. Recent efforts have been paid on the establishment of consistent framework for a fuzzy object-oriented model based on the standard for the Object Data Management Group (ODMG) object data model [12, 13]. In [14], an object-oriented database modeling technique was presented based on the concept ‘level-2 fuzzy set’ to deals with a uniform and advantageous representation of both perfect and imperfect ‘real world’ information. Also it was illustrated and discussed how the ODMG data model can be generalized to handle ‘real world’ data in a more advantageous way. Concerning the conceptual data modeling of fuzzy information, Zvieli and Chen [38] introduced three levels of fuzziness in the ER model based on fuzzy set theory. At the first level, entity sets, relationships, and attribute sets may be fuzzy, namely, they have membership degrees in the data model. The second

Modelling Fuzzy Information in the EER

127

level is related with the fuzzy occurrences of entities and relationships. The third level concerns the fuzzy values of attributes of special entities and relationships. The generic rules for mapping the fuzzy ER schema to the fuzzy relational databases were introduced in [9]. Also, based on the three levels of fuzziness in the ER model, the fuzzy extension of several major EER concepts was introduced in [10] using fuzzy set theory, including superclass/subclass, generalization/specialization, category, and the subclass with multiple superclasses. The graphical representations of the fuzzy EER model and the design methodologies for the fuzzy object-oriented databases were developed in [19]. In addition, IFO data model proposed in [1], which is a conceptual data model that incorporates the fundamental principles of semantic database modeling within a graph-based representational framework, was extended to deal with fuzzy information in [31, 32]. In [31], several types of imprecision and uncertainty such as the values without semantic representation, the values with semantic representation and disjunctive meaning, the values with semantic representation and conjunctive meaning, and the representation of uncertain information were incorporated into the attribute domain of the object-based data model. In [32], a mapping process to transform the conceptual schemas of the 2 ExIFO (Extended IFO) model into the similarity-based fuzzy extended NF relations [33] including uncertain properties that are represented in both models was also described in [32].

3 Fuzzy database modeling In this section, we concentrate on fuzzy database modeling at conceptual and logical levels. First we introduce some notions and notations of the fuzzy ER/EER model proposed in [10, 19, 38] and then we give a complete extension of the extended possibility-based fuzzy nested relational databases. 3.1 Fuzzy extended entity-relationship (EER) model Attribute. In the fuzzy ER/EER model, the following levels of fuzziness can be found in attributes. The first level of fuzziness means that attributes may be fuzzy corresponding to the data model. In other words, they have their membership degree. The third levels of fuzziness means that attributes may take fuzzy values. The membership degree here means uncertainty. The membership degree connected with an attribute means that the attribute may or may not belong to the corresponding entity. These two levels of fuzziness can appear in single-valued and multivalued attributes, and should be identified from crisp

128

Zongmin M. Ma

attributes in the data model. A composite attribute consists of any kind of the attribute types mentioned above. The graphical representations of these types of the attributes are shown in Figure 3.1. Here, for the attribute with a membership degree, we place membership degree inside the diagrams of attribute in the fuzzy ER model. For example, let A be an attribute type and µ (A) be its membership degree in the model. Then "µ (A)/A" is enclosed in the rectangle. If µ (A) = 1, "1/A" is usually denoted "A" simply. In a similar way, entity types and relationship types with membership degrees can be represented. Entity and Relationship. The following levels of fuzziness can be found in entities of the fuzzy ER/EER model: the first level of fuzziness means that entities may be fuzzy corresponding to the data model, i.e., they have their membership degree; the second level of fuzziness is related with the instances of special entities and relationships, which means that an entity instance belongs to the corresponding entity and relationship fuzzily, respectively. If entities are connected with the attributes with the third level of fuzziness, we say such entities are ones with the third level of fuzziness. These three levels of fuzziness can also be found in the relationships of the fuzzy ER/EER model. Here the membership degree connected with an entity (or relationship) means that the entity (or relationship) may or may not belong to the corresponding data model. It should be noted that an entity type can be a weak entity type and a relationship type may be an ownership relationship.

(a) µ

(e)

(b)

(c)

(d)

µ

µ (A)/A

µ

(f)

(g)

(h)

(a) Single-valued attribute type (b) Multivalued attribute type (c) Disjunctive fuzzy value attribute type (d) Conjunctive fuzzy value attribute type (e) Single-valued attribute type with membership degree (f) Multivalued attribute type with membership degree (g) Disjunctive fuzzy value attribute type with membership degree (h) Conjunctive fuzzy value attribute type with membership degree

Fig. 3.1 Attributes in Fuzzy ER/EER Model

Modelling Fuzzy Information in the EER

129

The graphical representations of the entities and relationship in the fuzzy ER/EER model are shown in Figure 3.2. Generalization/Specialization. Consider two entity types E and S on the universe of discourse U and they are fuzzy sets with membership functions µE and µS, respectively. Then S is a fuzzy subclass of E and E is a fuzzy superclass of S if and only if the following is true. (∀ e) (e ∈ U ∧ µS (e) ≤ µE (e))

µ (E)/E

E

µ(R)/R

R

(a)

(b)

(c)

(d)

(a) Entity with membership degree (b) Entity with fuzzy instances

(c) Relationship with membership degree (d) Relationship with fuzzy instances

Fig. 3.2 Entities and Relationships in Fuzzy ER/EER Model A generalization defines a superclass from several entity types, generally being with some common features, while a specialization defines several subclasses from an entity type according to a certain feature. Considering a fuzzy superclass E and its fuzzy subclasses S1, S2, …, Sn with membership functions µE, µS1, µS2, ..., and µSn, respectively, Then the following relationship is true. (∀ e) (∀ S) (e ∈ U ∧ S ∈ {S1, S2, …, Sn} ∧ µS (e) ≤ µE (e)) S1, S2, …, Sn are a fuzzy total specialization of E if (∀ e) (∃ S) (e ∈ E ∧ S ∈ {S1, S2, …, Sn} ∧ µS (e) > 0 ∧ µS (e) ≤ µE (e)) S1, S2, …, Sn are a fuzzy partial specialization of E if (∃ e) (∀ S) (e ∈ E ∧ S ∈ {S1, S2, …, Sn} ∧ µE (e) > 0 ∧ µS (e) = 0) In addition, S1, S2, …, Sn are disjoint if

130

Zongmin M. Ma (∃/ e) (∀ S) (∀ S’) (e ∈ U ∧ S ∈ {S1, S2, …, Sn} ∧ S’ ∈ {S1, S2, …, Sn} ∧ min (µS (e), µS’ (e)) > 0)

S1, S2, …, Sn are overlapping if (∃ e) (∀ S) (∀ S’) (e ∈ U ∧ S ∈ {S1, S2, …, Sn} ∧ S’ ∈ {S1, S2, …, Sn} ∧ min (µS (e), µS’ (e)) > 0) Generalization is the inverse process of specialization. Therefore, we have fuzzy total generalization, fuzzy disjoint generalization, and fuzzy overlapping generalization according to the discussion above. It should be noted that there is no fuzzy partial generalization, which is the same as the situation in the classical EER model. Category. Let S1, S2, …, Sn, and E be fuzzy entity sets with membership functions µS1, µS2, ..., µSn, and µE, respectively. Then E be a fuzzy category of S1, S2, …, Sn if (∀ e) (∃ S) (e ∈ E ∧ S ∈ {S1, S2, …, Sn} ∧ µE (e) > 0 ∧ µS (e) ≥ µE (e)) Note the difference between the fuzzy category and the fuzzy subclass with more than one fuzzy superclass. Let E be a fuzzy subclass and S1, S2, …, Sn be its fuzzy superclasses, which membership functions are respectively µE, µS1, µS2, ..., and µSn. Then (∀ e) (∀ S) (e ∈ E ∧ S ∈ {S1, S2, …, Sn} ∧ µE (e) > 0 ∧ µS (e) ≥ µE (e)) Aggregation. Let E be an fuzzy aggregation of fuzzy entity sets S1, S2, …, and Sn, which membership functions are µE, µS1, µS2, ..., and µSn, respectively. Then (∃ e) (∃ e1) (∃ e2) ... (∃ en) (e ∈ E ∧ e1 ∈ S1 ∧ e2 ∈ S2 ∧ … ∧ en ∈ Sn ∧ µE (e) = µS1 (e1) × µS2 (e2) × ... × µSn (en)) The diagrammatic notations in Figure 3.3 are used to represent fuzzy specialization, category and aggregation in the fuzzy EER model, respectively.

Modelling Fuzzy Information in the EER

131

d

o

d

o

∧

∨

×

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(a) Fuzzy total & disjoint specialization (b) Fuzzy total & overlapping specialization (c) Fuzzy partial & disjoint specialization (d) Fuzzy partial & overlapping specialization (e) Fuzzy subclass with fuzzy multi-superclasses (f) Fuzzy category ( )F i

Fig. 3.3 Specialization, Category and Aggregation in Fuzzy EER Model

3.2 Extended possibility-based fuzzy nested relational databases Model structure 2

A fuzzy NF relational schema is a set of attributes (A1, A2, ..., An, pM) and their domains are D1, D2, ..., Dn, D0, respectively, where Di (1 ≤ i ≤ n) can be one of the following: (1) The set of atomic values. For each element ai ∈ Di, it is a typical simple crisp attribute value. (2) The set of null values, denoted ndom, where null values may be unk, inap, nin, and onul. (3) The set of fuzzy subset. The corresponding attribute value is an extended possibility-based fuzzy data. (4) The power set of the set in (1). The corresponding attribute value, say ai, is multivalued one with the form of {ai1, ai2, ..., aik}. (5) The set of relation values. The corresponding attribute value, say ai, is a tuple of the form which is an element of Di1 × Di2 × ... × Dim (m > 1 and 1 ≤ i ≤ n), where each Dij (1 ≤ j ≤ m) may be a domain in (1), (2), (3), and (4) and even the set of relation values. The domain D0 is a set of atomic values and each value is a crisp one from the range [0, 1], representing the possibility degree that the corresponding tuple is 2 true in the NF relation. We assume that the possibilities of all tuples are precisely one in the chapter. Then for an attribute Ai ∈ R (1 ≤ i ≤ n), its attribute domain is formally represented as follows: τi = dom | ndom | fdom | sdom | where B1, B2, …, Bm are attributes.

132

Zongmin M. Ma

A relational instance r over the fuzzy NF schema (A1 : τ1, A2 : τ2, ..., An : τn) is a subset of Cartesian product τ1 × τ2 × ... × τn. A tuple in r with the form of consists of n components. Each component ai (1 ≤ i ≤ n) may be an atomic value, null value, set value, fuzzy value, or another tuple. 2 An example of the fuzzy NF relation is shown in Table 3.1. It can be seen that Tank_Id and Start_data are crisp atomic-valued attributes, Tank_body is a relation-valued attribute, and Responsibility is a set-valued attribute. In the attribute Tank_body, two component attributes Volume and Capacity are fuzzy ones. 2

Table 3.1. Pressured air tank relation Tank _Id

TA1

Body _Id BO01

TA2

BO02

Tank_body Material Volume

Alloy Steel

about 2.5e +03 about 2.5e +04

Capacity

about 1.0e +06 about 1.0e +07

Start_ Date

Responsibility

01/12/99

John

28/03/00

{Tom, Mary}

Assessment of data redundancies and removal 2

In order to manipulate fuzzy data in the fuzzy NF relational databases, it is necessary to evaluate the semantic measure of fuzzy data. On the basis, data redundancies can be assessed and removed. For two extended possibility-based fuzzy data, say πA and π B, there exist four types of semantic relationships between them, which are equivalence, inclusion, intersection, and irrelevancy. The notion of semantic inclusion degree is proposed in order to assess these semantic relationships in [19]. Definition 3.1. Let U = {u1, u2, …, un} be an universe of discourse. Let πA and πB be two fuzzy data on U based on possibility distribution. The semantic inclusion degree of π A and π B SID ( π A, π B), which means π A semantically includes πB, is then defined as follows: n

n

SID (πA, πB) = ∑ min (π B (u i ), π A (u i )) / ∑ π B (u i ) i =1 ui ∈U

i =1

Following the definition 3.1 and a given threshold β, we have (1) πA semantically includes πB iff SID (πA, πB) ≥ β.

Modelling Fuzzy Information in the EER

133

πA and πB are semantically equivalent to each other iff SID (πA, πB) ≥ β and SID (πB, πA) ≥ β. (3) πA and πB are semantically irrelevant to each other iff SID (πA, πB) < β and SID (πB, πA) < β. Here, we only focus on if πA and πB are semantically equivalent to each other.

(2)

Therefore, the notion of equivalence degree for two fuzzy data is given as follows. Definition 3.2. Let πA and πB be two fuzzy data and SID (πA, πB) be the degree that πA semantically includes πB. The semantic equivalent degree of πA and πB SE (πA, πB), denoting the degree that πA and πB are equivalent to each other, is defined as follows. SE (πA, πB) = min (SID (πA, πB), SID (πB, πA)) Two fuzzy data πA and πB are considered β-redundant if and only if SE (πA, πB) ≥ β. For two crisp data, atomic or set-valued, their equivalent degree is one if they are equal to each other, where the same set-valued data are considered equal. Assume that πA and πB are β-redundant to each other, the elimination of duplicate could be achieved by merging πA and πB and producing a new fuzzy data πC, where πA, πB, and πC be three fuzzy data on U = {u1, u2, …, un} based on possibility distribution and there is a resemblance relation on U. Then we define the following merging operation according to Zadeh’s extension principle [35]. πC = πA ∪f πB = {πC (w) / w | (∃ πA (ui) / ui) (∃ πB (vj) / vj) (πC (w) = Max (πA (ui), πB (vj)) ∧ ( w = ui | πC(w) = πA(ui) ∨ w = vj | πC(w) = πB(vj)) ∧ Res (ui, vj) ≥ α ∧ ui, vj ∈ U ∧ 1 ≤ i, j ≤ n) ∨ (∃ πA(ui) / ui) (∀ πB (vj) / vj) (πC (w) = πA(ui) ∧ w = ui ∧ Res (ui, vj) < α ∧ ui, vj ∈ U ∧ 1 ≤ i, j ≤ n) ∨ (∃ πB (vj) / vj) (∀ πA (ui) / ui) (πC (w) = πB (vj) ∧ w = vj ∧ Res (ui, vj) < α ∧ ui, vj ∈ U ∧ 1 ≤ i, j ≤ n)} 2

Now let us concentrate on the redundancies of tuples in a fuzzy NF relation. First, let us look at two values on a structured attribute aj = (Aj1: πAj1, Aj2: πAj2, …, Ajm: πAjm) and aj’ = (Aj1: πAj1’, Aj2: πAj2’, …, Ajm: πAjm’), which consist of simple attribute values, crisp (atomic and set-valued) or fuzzy, on the schema R (Aj1, Aj2, ..., Ajm). There is a resemblance relation on each attribute domain Djk (1 ≤ k ≤ m) and αjk ∈ [0, 1] (1 ≤ k ≤ m) is the threshold on the resemblance relation.

134

Zongmin M. Ma

Let β ∈ [0, 1] be a given threshold. aj and aj’ are α-β-redundant if and only if for k = 1, 2, …, m, min (SEαjk (πAjk, πAjk’)) ≥ β holds true. min (SEαjk (πAjk, πAjk’)) (1 ≤ k ≤ m) is called the equivalence degree of structured attribute values. Consequently, the notion of equivalence degree of structured attribute values can be extended for the tuples in the fuzzy nested relations to assess tuple redundancies. Informally, any two tuples in a nested relation are redundant, if, for pair of the corresponding attribute values, the equivalence degree is greater than or equal to the threshold value. If the pair of the corresponding attribute values is simple, the equivalence degree is one for two values. For two values of structured attributes, however, the equivalence degree is one for structured attributes. Two redundant tuples t and t’ are written t ≡ t’. Fuzzy nested relational operations Union and Difference. Let r and s be two union-compatible fuzzy nested relations. Then r ∪ s = min ({t | t ∈ r ∨ t ∈ s}) and r − s = {t | t ∈ r ∧ (∀v ∈ s) (t ≡/ v)} Here, the operation min () means to remove the fuzzy redundant tuples in r and s. Of course, the threshold value should be provided for the purpose. Cartesian Product. Let r and s be two fuzzy nested relations on schemas R and S, respectively. Then r × s is a fuzzy nested relation with the schema R ∪ S. The formal definition of Cartesian product operation is as follows: r × s = {t | t (R) ∈ r ∧ t (S) ∈ s} Projection. Let r be a fuzzy nested relation on the schema R and S ⊂ R. Then the projection of r on the schema S is formally defined as follows: ΠS (r) = min ({t | (∀ v ∈ r) (t = v (S)}) Here, an attribute in S may be of the form B.C, in which B is a structured attribute and C is its component attribute. Being the same as union operation, projection operation also needs to remove fuzzy redundant tuples in the result relation after the operation. Selection. In classical relational databases, the selection condition is of the form X θ Y, where X is an attribute, Y is an attribute or a constant value, and θ ∈ {=, ≠, >, ≥, <, ≤}. In order to implement fuzzy query for fuzzy relational databases, “θ” should be fuzzy, denoted ≈, ≈/ , , ≺ , , and ≺ . In addition, X

Modelling Fuzzy Information in the EER

135

is only a simple attribute or the simple attribute of a structured attribute but Y may be one of the following. (a) A constant, crisp or fuzzy one; (b) A simple attribute; (c) The simple component attribute of a structured attribute, having the form A. B, where A is a structured attribute and B is its simple component attribute. Assume that there is a resemblance relation on the universe of discourse and α is the threshold on it. Then the fuzzy comparison operations are defined as follows (1) X ≈ Y iff SEα (X, Y) ≥ β, where β is a selected cut (the followings are the same). (2) X ≈/ Y iff SEα (X, Y) < β. (3) X Y iff X ≈/ Y and min (Supp (X)) > min (Supp (Y)). (4) X Y iff X ≈ Y or X Y. (5) X ≺ Y iff X ≈/ Y and min (Supp (X)) < min (Supp (Y)). (6) X ≺ Y iff X ≈ Y or X ≺ Y. Depending on Y, the following situations can be identified for the selection condition X θ Y. Let X be the attribute Ai: τi in a fuzzy nested relation. (1) Ai θ c, where c is a crisp constant. According to τi, the definition of Ai θ c is as follows: • if τi is dom, Ai θ c is a traditional comparison and θ ∈ {=, ≠, >, <, ≥, ≤}, • if τi is fdom, Ai θ c is a fuzzy comparison and θ ∈ {≈, ≈/ , , ≺ , , ≺ }, • if τi is ndom, Ai θ c is a null comparison and regarded as the special fuzzy comparison, • if τi is sdom, Ai θ c is a element-set comparison. Then Ai θ c if c and any element in the value of Ai of a tuple satisfy the “θ”. (2) Ai θ f, where f is a fuzzy value. • if τi is dom, fdom, or ndom, Ai θ f is a fuzzy comparison and θ ∈ {≈, ≈/ , ≺ , , ≺ }, • if τi is sdom, Ai θ f is a fuzzy set comparison. Then Ai θ f if f and any element in the value of Ai of a tuple satisfy the fuzzy “θ”, where θ ∈ {≈, ≈/ , , ≺ , , ≺ }. (3) Ai θ Aj, where Aj: τj is a simple attribute and i ≠ j. • if τi and τj are all dom, Ai θ Aj is a traditional comparison, • if τi and τj are dom and fdom, fdom and fdom, or ndom and fdom, Ai θ Aj is a fuzzy comparison, • if τi and τj are dom and ndom, Ai θ Aj is a null comparison, • if τi and τj are dom and sdom, Ai θ Aj is a element-set comparison, • if τi and τj are fdom and sdom, Ai θ Aj is a fuzzy set comparison,

,

136

Zongmin M. Ma • if τi and τj are all ndom, Ai θ Aj is a null-null comparison. Then Ai θ Aj if they have the same null values on the same universe of discourse, • if τi and τj are ndom and sdom, Ai θ Aj is a null-set comparison and regarded as the special element-set comparison, • if τi and τj are sdom and sdom, Ai θ Aj is a set-set comparison and regarded as the special element-set comparison.

(4) Ai θ Aj. B, where Aj is a structured attribute (i ≠ j) and B is a simple attribute. The situations are the same as those in above case (3). In the fuzzy nested relational databases, the selection condition is similar to the selection condition in the fuzzy relational databases except that the attribute may be of the form B.C, where B is a structured attribute and C is its component attribute. Let Q be a predicate denoting the selection condition. The selection operation for a fuzzy nested relation r is defined as follows: σQ (r) = {t | t ∈ r ∧ Q (t)} In addition to some traditional relational operations, two restructuring, called Nest and Unnest (called Pack and Unpack, Merge and Unmerge also in literature), are also crucial in the fuzzy nested relational databases. The Nest operator can gain the nested relation with structured attributes. The Unnest operator is used to flatten the nested relation. That is, it takes a nested relation on a set of attributes and desegregates it, creating a "flatter'' structure. Nest Operation. Let r be a fuzzy nested relation with the schema R = {A1, A2, …, Ai, …, Ak, …, An}, where 1 ≤ i, k ≤ n. Now Y = {Ai, …, Ak} is merged into a structured attribute B and a new fuzzy nested relation s is formed, which schema is S = {A1, A2, …, Ai-1, B, Ak+1, …, An}. The following notation is used to represent the Nest operation above: s (S) = ΓY → B (r (R)) = {ω [(R − Y) ∪ B] | (∃u) (∀v) (u ∈ r ∧ v ∈ r ∧ SE (u [R − Y], v [R − Y]) < β ∧ ω [R − Y] = u [R − Y] ∧ ω [B] = u [Y]) ∨ (∀u) (∀v) (u ∈ r ∧ v ∈ r ∧ SE (u [R − Y], v [R − Y]) ≥ β ∧ ω [R − Y] = u [R − Y] ∪f v [R − Y] ∧ ω [B] = u [Y] ∪ v [Y])} It can be seen that in the process of the Nest operation on attribute sets Y to B, multiple tuples in r which are fuzzily equivalent on the attribute set R − Y are merged to form a tuples of s. Such merging operation is respectively completed on attribute sets R − Y and Y. On the R − Y, fuzzy union ∪f is used and for an attribute C ∈ R − Y, the value of C of the created tuple is an atomic value, crisp

Modelling Fuzzy Information in the EER

137

or fuzzy. The value of an attribute B. C ∈ Y of the created tuple, however, is a set value and the common union is used. Another restructuring operation, called Unnest, is an inverse of Nest under certain conditions. In a classical nested relation, this condition is that the nested relation is in Partitioned Normal Form (PNF). A relation is in PNF if and only if (a) all of a subset of the simple attributes forms a relation key and (b) every sub-relation is in PNF. Unnest Operation. Let s be a fuzzy nested relation with the schema S = {A1, A2, …, Ai-1, B, Ak+1, …, An}, where B is a structured attribute and B : {Ai, …, Ak}. Unnest operation products a new fuzzy nested relation r, which schema is R = {A1, A2, …, Ai-1, Ai, …, Ak, Ak+1, …, An}, i.e., R = S − B ∪ {Ai, …, Ak}. The following notation is used to represent the Unnest operation above: r (R) = Ξ B (s (S)) = {t [(R − B) ∪ { Ai, …, Ak }] | (∀u) (u ∈ s ∧ t [R − B] = u [R − B] ∧ t [Ai … Ak] ∈ u [B])}

4

Mapping a fuzzy EER model to a fuzzy nested relational schema

After constructing the fuzzy EER (FEER) model, we can map it to the fuzzy nested relational database (FNRDB) model. For this purpose, first of all, we should know how the fuzzy nested relational databases support the fuzzy EER model. Three levels of fuzziness can be found in the fuzzy EER model, in which the first level of fuzziness occurs in metadata. The fuzzy nested relational databases do not support this level of fuzziness, which can only model the fuzziness that occurs in the schema/instance and instance levels. In addition, an entity in the EER model generally corresponds to a relation in the nested relational databases and the attributes of an entity correspond to the attributes of the relation. The relationships in the EER model can be mapped into the attributes of the relations that are created by the entities connected with the relationships. The details of the formal mapping are illustrated as follows. 4.1 Transformation of entities Generally speaking, an entity in the FEER model can be transformed to a relation in the FNRDB and the attributes of the entity may be considered as the attributes of the mapped relation directly. Note that the multivalued attributes of the FEER model are mapped into set-valued attributes in the FNRDB.

138

Zongmin M. Ma

According to the discussion above, we can distinguish four kinds of entities in the FEER model as follows: (a) entities without any fuzziness at the three levels, (b) entities with the fuzziness only at the third level, (c) entities with the fuzziness at the second level, and (d) entities with the fuzziness at the first level. For the entities in (a) and (b), we can directly transform them to the relations. For the entities in (c), however, an additional attribute should be added to each relation transformed from the corresponding entity, which is used to denote the possibility that the tuples belong to the relation. It should be noted that for the entities with membership degree and the entities whose attributes have membership degree, the fuzzy nested relational model cannot represent the first level of fuzziness and the transformations from the entities to the relations and from the attributes of the entities to the attributes of the relations can not be completed. Figure 4.1 shows the transformation of the entities. Good Partner

Good Partner

Number

Number

Site

Site

Sate

pD

Sate

Fig. 4.1 Transformations of Entities Note that the above-mentioned entities are not subclass entities, which mapping will be discussed below. In addition, a weak entity in the FEER model depends on its owner entity. A weak entity in the FEER model can be mapped to a relation-valued attribute. 4.2 Transformation of relationships A relationship in the EER model should be mapped into an association relation, which attributes serve as a group of pointer to combine an explicit reference from one tuple to another tuple. Considering the constraint of cardinality, such attributes in two associated tuples can be single-valued or multivalued attributes. Let entity E1 with attributes {K1, A1, …, Am} and entity E2 with attributes {K1, B1, …, Bn} be connected with relationship R {X1, …, Xk}, where K1 and K2 are key attributes of E1 and E2, respectively, and R may be an one-to-one, one-tomany, or many-to-many relationship. In addition to the transformation process

Modelling Fuzzy Information in the EER

139

of the entity transformation given above, the relationship R is mapped into a relational schema with attribute set {K1, K2, X1, …, Xk}, where K1 and K2 are key attributes. If the constraint of cardinality is a one-to-many relationship, i.e., R is a one-to-many relationship from E1 to E2, K2 is a set-valued attribute of r1 created by E1, and K1 is a single-valued attribute of r2 created by E2. Considering the fuzziness in the FEER model, we can distinguish four kinds of relationships as follows: (a) relationships without any fuzziness at the three levels, (b) relationships with the fuzziness only at the third level (it means that the attributes of relationships may take fuzzy values), (c) relationships with the fuzziness at the second level, and (d) relationships with the fuzziness at the first level. For the relationship R in (a) or (b), the transformation of the relationships can be conducted according to the rules given above. For the relationship R in (c), an additional attribute denoting the possibility that the relationship instance belongs to the relation should be added into the relation r created by R. It should be noted that for the relationships with membership degree and the relationships whose attributes have membership degree, the fuzzy nested relational model cannot represent the first level of fuzziness and the transformation of such relationships can not be completed. Figure 4.2 shows the transformation of the relationships.

r1 K1

E1

A1

K1

A1

pD

r2 X

K2

R

B1

pD

r K1 K2

E2

K2

X

pD

B1

Fig. 4.2 Transformations of Relationships 4.3 Transformation of generalizations Let E1 with attributes {K1, A1, A2, …, Ak} and E2 with attributes {K2, B1, B2, …, Bm} be generated to form supertype S. Assume {A1, A2, …, Ak} ∩ {B1, B2, …, Bm} = {C1, C2, …, Cn}. Generally speaking, E1 and E2 are mapped into schemas

140

Zongmin M. Ma

{K1, A1, A2, …, Ak} − {C1, C2, …, Cn} and {K2, B1, B2, …, Bm} − {C1, C2, …, Cn}, respectively. As to the transformation of S, depending on K1 and K2, we distinguish the following two cases. (a) K1 and K2 are identical. Then S is mapped into the schema {K, C1, C2, …, Cn}, where K is the same as K1/K2. (b) K1 and K2 are different. Then S is mapped into the relational schema {K, C1, C2, …, Cn}, where K is the surrogate key created by K1 and K2 [32]. Considering the fuzziness in entities, the following cases for the transformation of the generalization are distinguished: (a) E1 and E2 are both crisp. Then E1 and E2 are transformed to relations r1 and r2 with attributes {K1, A1, A2, …, Ak} − {C1, C2, …, Cn} and {K2, B1, B2, …, Bm} − {C1, C2, …, Cn}, respectively. S is transformed to a relation r with attributes {K, C1, C2, …, Cn}. (b) When there is the fuzziness of instance/schema level in E1 and (or) E2, being similar to case (a) also, relation r, as well as relations r1 and r2, are formed. Note that r, r1 and (or) r2 created by E1 and (or) E2 with the instance/schema level of fuzziness should include attribute pD. (c) When there is the fuzziness of schema level in E1 and (or) E2, relation r, as well as relations r1 and r2, are formed. But the fuzziness at this level cannot be represented in the created relations. Figure 4.3 shows the transformations of the generalization. Owner

Address

d

Person

Owner Address

pD

Number

Person Sex

pD

Address

Company

Number

Number

Sex

Number

Number

Company BossName

BossName

Fig. 4.3 Transformation of Generalization

Modelling Fuzzy Information in the EER

141

4.4 Transformation of specializations Let S be an entity type with attributes {K, A1, A2, …, An}, where K is the key. Let entity type S1 with attributes {A11, A12, …, A1k} and entity type S2 with attributes {A21, A22, …, A2m} be the subclasses of S. Since S1 and S2 are subclasses of S, there are no keys in S1 and S2. At this point, S is mapped into the relational schema {K, A1, A2, …, An}, and S1 and S2 are mapped into schemas {K, A11, A12, …, A1k} and {K, A21, A22, …, A2m}, respectively. Figure 4.4 shows the transformation of specialization. 4.5 Transformation of categorizations The categorization in the EER model is concerned with the issue of selective inheritance. Essentially, the categorization shows the uncertainty that which entity in the categorization will take place in the schema is unknown currently. The entities, fuzzy or not, in the categorization can respectively be mapped into the relations following the methods given above. The categorization entity also follows the same transformation. Some additional attributes, however, should be added into the corresponding relation. These attributes are a set of all attributes of the entities in the categorization. Engine

Number d

Number

Plane

Model

Plane Engine Name Number Usage

Car Engine

Rate

Usage Name

Engine Model

Number

Car Engine Designer

Rate

Designer

Fig. 4.4 Transformation of Specialization 4.6 Transformation of aggregations Each aggregation in the fuzzy EER model can be mapped into a relation of the fuzzy nested relational schema with relation-valued attributes. Depending on the

142

Zongmin M. Ma

component entities, the aggregation entity may be crisp or fuzzy. As we know, there are four kinds of entities in the fuzzy EER model. The fuzziness of the component entities only on attribute values does not influence the relation mapped from the aggregation entity. If there is the fuzziness of the component entities at instance/schema level, namely, the fuzziness at the second level, however, an additional attribute must be added to the relation mapped from the aggregation entity, indicating the aggregation degree of the tuples. The fuzziness of the component entities at schema level, i.e., the fuzziness at the first level, however, cannot be modeled in the relation mapped from the aggregation entity. Figure 4.5 shows the transformations of aggregation.

5 Conclusion The incorporation of imprecise and uncertain information in database model has been an important topic of database research because such information extensively exists in data and knowledge intensive applications such as expert system, decision-making, and CAD/CAM etc. Besides, that there are complex object structures is another characteristic of these systems. The classical relational database model and its extension of imprecision and uncertainty do not satisfy the need of modeling complex objects with imprecise and uncertain information. In this chapter, we have presented the fuzzy extended entity-relationship (FEER) model to cope with imperfect as well as complex objects in the real world at a conceptual level. Also a nested relational database model has completely been extended for modeling the extended possibility-based fuzzy data. In particular, we have provided the formal approach to mapping a fuzzy EER model to a fuzzy nested relational database schema.

Acknowledgements The authors wish to thank the anonymous referees for their valuable comments and suggestions, which improved the technical content and the presentation of the paper. Work supported in part by the MOE Funds for Doctoral Programs (20050145024).

Modelling Fuzzy Information in the EER

143

Car

CarID

ChassisID

Model

Name

×

Chassis

EngineID

Size

Engine

Car Chassis

CarID

ChassisID

Engine

Model

EngineID

Size

Name

pD

Fig. 4.5 Transformation of Aggregation

References [1] Abiteboul, S. and Hull, R. (1987) IFO: A Formal Semantic Database

Model, ACM Transactions on Database Systems, 12(4): 525-565. [2] Bordogna, G., Pasi, G. and Lucarella, D. (1999) A Fuzzy Object-Oriented

[3]

[4]

[5] [6]

[7]

Data Model for Managing Vague and Uncertain Information, International Journal of Intelligent Systems, 14: 623-651. Bordogna, G. and Pasi, G. (2001) Graph-Based Interaction in a Fuzzy Object Oriented Database, International Journal of Intelligent Systems, 16(7): 821-841. Bosc, P. and Prade, H. (1997) An Introduction to Fuzzy Set and Possibility Theory-based Approaches to the Treatment of Uncertainty and Imprecision in Data Base Management Systems. In Uncertainty Management in Information Systems: from Needs to Solutions (A. Motro and P. Smets, (Eds), pp. 285-324, Kluwer Academic Publishers. Bosc, P. and Pivert, O. (1995) SQLf: A Relational Database Language for Fuzzy Querying, IEEE Transactions on Fuzzy Systems, 3(1): 1-17. Bosc, P. and Pivert, O. (2003) On the Impact of Regular Functional Dependencies When Moving to a Possibilistic Database Framework, Fuzzy Sets and Systems, 140(1): 207-227. Bosc, P., Dubois, D. and Prade, H. (1998) Fuzzy Functional Dependencies and Redundancy Elimination, Journal of the American Society for Information Science, 49(3): 217-235.

144

Zongmin M. Ma

[8] Buckles, B. P. and Petry, F. E. (1982) A Fuzzy Representation of Data for [9]

[10]

[11] [12]

[13] [14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

Relational Database, Fuzzy Sets and Systems, 7(3): 213-226. Chaudhry, N. A., Moyne, J. R. and Rundensteiner, E. A. (1999) An Extended Database Design Methodology for Uncertain Data Management, Information Sciences, 121(1-2): 83-112. Chen, G. Q. and Kerre, E. E. (1998) Extending ER/EER Concepts towards Fuzzy Conceptual Data Modeling, In: Proc. of the 1998 IEEE International Conference on Fuzzy Systems, 2: 1320-1325. Chen, P. P. (1976) The Entity-Relationship Model: Toward a Unified View of Data, ACM Transactions on Database Systems, 1(1) 9-36. Cross, V., Caluwe, R. and Vangyseghem, N. (1997) A Perspective from the Fuzzy Object Data Management Group (FODMG), In: Proc. of the 1997 IEEE International Conference on Fuzzy Systems, 2: 721-728. Cross, V. and Firat, A. (2000) Fuzzy Objects for Geographical Information Systems, Fuzzy Sets and Systems, 113: 19-36. De Tré, G. and De Caluwe, R. (2003) Level-2 Fuzzy Sets and Their Usefulness in Object-Oriented Database Modelling, Fuzzy Sets and Systems, 140(1): 29-49. Dubois, D., Prade, H. and Rossazza, J. P. (1991) Vagueness, Typicality, and Uncertainty in Class Hierarchies, International Journal of Intelligent Systems, 6: 167-183. George, R., Srikanth, R., Petry, F. E. and Buckles, B. P. (1996) Uncertainty Management Issues in the Object-Oriented Data Model, IEEE Transactions on Fuzzy Systems, 4(2): 179-192. Gyseghem, N. V. and Caluwe, R. D. (1998) Imprecision and Uncertainty in UFO Database Model, Journal of the American Society for Information Science, 49(3): 236-252. Ma, Z. M., Zhang, W. J. and Ma, W. Y. (2000) Semantic Measure of Fuzzy Data in Extended Possibility-Based Fuzzy Relational Databases, International Journal of Intelligent Systems, 15(8): 705-716. Ma, Z. M., Zhang, W. J., Ma, W. Y. and Chen, G. Q. (2001) Conceptual Design of Fuzzy Object-Oriented Databases Using Extended EntityRelationship Model, International Journal of Intelligent Systems, 16: 697711. Ma, Z. M. and Mili, F. (2002) An Extended Possibility-Based Fuzzy Nested Relational Database Model and Algebra, IFIP International Federation for Information Processing (KLUWER Academic Publishers), 221: 285-288. Ma, Z. M. and Mili, F. (2002) Handling Fuzzy Information in Extended Possibility-Based Fuzzy Relational Databases, International Journal of Intelligent Systems, 17(10): 925-942.

Modelling Fuzzy Information in the EER

145

[22] Marín, N., Pons, O. and Vila, M. A. (2001) A Strategy for Adding Fuzzy

[23] [24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34] [35] [36]

Types to an Object-Oriented Database System, International Journal of Intelligent Systems, 16(7): 863-880. Naiburg, E., (2000) Database Modeling and Design Using Rational Rose, http://www.therationaledge.com/rosearchitect/mag/current/spring00/f5.html. Poncelet, P., Teisseire, M., Cicchetti, R. and Lakhal, L. (1993) Towards a Formal Approach for Object Database Design, In: Proc. of the 19th Very Large Data Bases Conference, 278-289. Prade, H. and Testemale, C. (1984) Generalizing Database Relational Algebra for the Treatment of Incomplete or Uncertain Information and Vague Queries, Information Sciences, 34: 115-143. Raju, K. V. S. V. N. and Majumdar, A. K. (1988) Fuzzy Functional Dependencies and Lossless Join Decomposition of Fuzzy Relational Database System, ACM Transactions on Database Systems, 13(2) 129-166. Sözat, M. I. and Yazici, A. (2001) A Complete Axiomatization for Fuzzy Functional and Multivalued Dependencies in Fuzzy Database Relations, Fuzzy Sets and Systems, 117(2): 161-181. Takahashi, Y. (1993) Fuzzy Database Query Languages and Their Relational Completeness Theorem, IEEE Transactions on Knowledge and Data Engineering, 5(1): 122-125. Teorey, T. J., Yang, D. Q. and Fry, J. P. (1986) A Logical Design Methodology for Relational Databases Using the Extended EntityRelationship Model, ACM Computing Surveys, 18 (2): 197-222. Umano, M. and Fukami, S. (1994) Fuzzy Relational Algebra for Possibility-Distribution-Fuzzy-Relational Model of Fuzzy Data, Journal of Intelligent Information Systems, 3: 7-27. Vila, M. A., Cubero, J. C., Medina, J. M. and Pons, O. (1996) A Conceptual Approach for Deal with Imprecision and Uncertainty in Object-Based Data Models, International Journal of Intelligent Systems, 11: 791-806. Yazici, A., Buckles, B. P. and Petry, F. E. (1999) Handling Complex and Uncertain Information in the ExIFO and NF2 Data Models, IEEE Transactions on Fuzzy Systems, 7(6): 659–676. Yazici, A., Soysal, A., Buckles, B. P. and Petry, F. E. (1999) Uncertainty in a Nested Relational Database Model, Data & Knowledge Engineering, 30: 275–301. Zadeh, L. A. (1965) Fuzzy Sets, Information and Control, 8(3): 338-353. Zadeh, L. A. (1978) Fuzzy Sets as a Basis for a Theory of Possibility, Fuzzy Sets and Systems, 1(1): 3-28. Zemankova M. and Kandel A. (1985) Implementing Imprecision in Information Systems, Information Sciences, 37(1-3): 107-141.

146

Zongmin M. Ma

[37] Zicari, R. and Milano, P. (1990) Incomplete Information in Object-Oriented

Databases, ACM SIGMOD Record, 19(3): 5-16. [38] Zvieli, A. and Chen, P. P. (1986) Entity-Relationship Modeling and Fuzzy

Databases, In: Proc. of the 1986 IEEE International Conference on Data Engineering, 320-327.

Querying Contradictory Databases by Taking into Account Their Reliability and Their Number Laurence Cholvy ONERA Centre de Toulouse, 2 av Ed. Belin,31055 Toulouse, France [email protected]

1 Introduction Databases integration, federated databases, multidatabases, databases merging ([3], [1], [2], [17], [5], [12], [14], [10], [4]) are close problems which aim to query several independent databases viewed as a single one. A ﬁrst diﬃculty is to deﬁne a global schema which provides the user with a uniﬁed view of the diﬀerent databases and mappings between this global schema and the diﬀerent local schemas. This problem is intensively studied in the Database Community and leads to two main paradigms: the “Localas-view” paradigm (which describes each data source as a view of the global schema) and the “Global-as-view” paradigm (which deﬁnes global relations as views of the local ones). See for instance [10] and [4] for more details. This present work assumes that this ﬁrst diﬃculty has been solved yet and that the diﬀerent databases share a common data description language, i.e. a common set of relations symbols. The remaining diﬃculty is that the information stored in the diﬀerent databases may be contradictory, specially in the presence of some integrity constraints (key constraints, foreign key constraints etc.) that may be common to the diﬀerent databases or not. The point is then to provide consistent answers to queries. In order to illustrate that point, consider a user facing three databases which store information about ﬂights taking oﬀ from Paris Orly airport. Assume that R denotes a relation which stores, for each ﬂight, its number, the name of the company which manages it and the airport terminal it takes oﬀ from. The ﬁrst database stores the facts: R(1, AF, Orly-West), R(2, TWA, OrlyWest) , the second one stores R(1, AF, Orly-South), R(3, AF, Orly-South) and the last one stores R(1, AF, Orly-West), R(2, TWA, Orly-South). These facts respectively mean that ﬂight number 1 is an Air-France ﬂight which takes L. Cholvy: Querying Contradictory Databases by Taking into Account Their Reliability and Their Number, Studfuzz 203, 149–168 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

150

Laurence Cholvy

oﬀ from Orly West terminal, ﬂight number 2 is a TWA ﬂight which takes oﬀ from Orly West terminal, ﬂight number 1 is an Air-France ﬂight which takes oﬀ from Orly South terminal, etc. What is the answer that can be provided to the user if she wants to know which terminal ﬂight number 1 takes oﬀ from ? There is not a single way to answer this query. In fact, depending on the information one has about the databases, diﬀerent queries can be provided. Firstly, we can adopt a cautious attitude which consists in making the disjunction of all the contradictory answers. In the example, that attitude would lead to answer that Flight number 1 takes oﬀ from Orly-West terminal or from Orly-South terminal. But this way of answering is questionable since it introduces disjunctions in the answers even if the initial data are not disjunctive. If we know the relative reliability of the databases, then this information can be taken into account (with an ordinal method or a cardinal one) for providing complete answers ([15], [11], [5], [16]...). For instance if we know that the ﬁrst database is managed by Paris-Orly airport while the others are not, it can be assumed as the most reliable and we can answer that Flight number 1 takes oﬀ from Orly-West terminal. A more precise method consists in considering that databases reliability depends on the topics of the data ( [6]). For instance, if we know that the second database is managed by Air-France company, then we can assume that, as far as Air France ﬂight are concerned, it is the most reliable and we can answer that Flight number 1 takes oﬀ from Orly-South terminal. When the reliability of the databases is not known or when all the databases are equally reliable, an alternative is to provide an answer which agrees with the biggest number of databases ([9], [13], [7]). In the previous example, it comes to answer that Flight number 1 takes oﬀ from Orly-West terminal. Finally, a last method for answering queries is a combination of the previous ones and consists in taking into account the number of the databases which support a piece of information, and their reliability too. More specifically, in such a method, stating that a fact is true requires comparing the reliability degrees of the diﬀerent databases which support that fact and the reliability degrees of the diﬀerent databases which support its negation. In this present work, we focus on such a method in which reliability degrees of the databases are integers so that the biggest an integer, the higher the database reliability. We deﬁne a query evaluator for consistently answering queries by taking into account the number of the databases that support the answer and their reliability degrees as well. This chapter is organized as follows. Section 2 deﬁnes a query evaluator for consistently answering queries in the case of propositional databases. We prove that this query evaluator corresponds to a weighted majority merging operator. Section 3 extends the query evaluator to ﬁrst order databases. Two

Querying Contradictory Databases

151

examples are given in section 4. We show in section 5 that this present work generalizes previous ones. Finally, section 6 is devoted to a discussion.

2 A query-evaluator in the propositional setting 2.1 Assumptions and vocabulary We consider a propositional language L, and several databases db1 , . . . , dbn called primitive databases whose contents are ﬁnite and consistent sets of literals1 of L. If db is a primitive database and if α is a positive integer, then α.db denotes a set called weighted database. (The binary operation . will formally be deﬁned in the following sections). If db1 and db2 are two databases which are not primitive then (db1 ∗ db2 ) denotes a set called a merged database. (The binary operation ∗ will formally be deﬁned in the following sections). For instance, if we face three primitive database db1 , db2 and db3 then 3.db1 denotes a weighted database, (2.db1 ∗1.db2 ), and (3.db1 ∗(1.db2 ∗2.db3 )) denote merged database. Intuitively speaking, the notation (3.db1 ∗(1.db2 ∗2.db3 )) will be used to represent the fact that queries are addressed to the three databases, db1 , db2 and db3 , assuming that their respective reliability degrees are 3, 1 and 2. The query evaluator for answering queries is described as a set of formulas of a logical language, named M L and deﬁned in the following. 2.2 The logical language The meta-language M L deﬁned by: • constants symbols of M L are propositional atoms of L, names of databases, plus a constant symbol denoted nil and constant symbols denoting integers: 1, 2, etc. • a binary function, ., which will be used to denote the weighted databases. • a binary function, ∗, which will be used to denote the merged databases. • a binary function denoted + which is the sum of integers. • the binary meta-predicate symbols are: In, B, neg, = and >. • A ternary meta-predicate symbol is Occur. The intuitive semantics of the predicates is the following: - neg(l, l ) is true if l and l are two atoms of L and l is the negation of l. - In(l, db) is true if l is a literal of L, bd is a primitive database and l belongs to db. 1

A literal is a positive or negative atom.

152

Laurence Cholvy

- Occur(db, l, i) is true if l is a literal of L, i is an integer, db denotes a database (primitive, weighted or merged) and l appears i times in db. - B(db, l) is true if l is a literal of L, db is a database (primitive, weighted or merged) and we can conclude that l is true from db. 2.3 The program If we consider n databases db1 ..dbn , let us call M ET A(db1 , ..., dbn ), be the following set of the M L formulas in which db, db denote primitive, weighted or merged databases, l, l denote literals of L, α, i, j, k, k denote integers. (1) In(l, db) if db is a primitive database and l belongs to db (2) Occur(db, l, k) ∧ k = α.k → Occur(α.db, l, k ) (3) ¬(db = nil) ∧ Occur(db, l, i) ∧ Occur(db , l, j) ∧ (k = i + j) → Occur(db ∗ db , l, k) (4) In(l, db) → Occur(db ∗ nil, l, 1) (5) ¬In(l, db) → Occur(db ∗ nil, l, 0) (6) Occur(db, l, i) ∧ Occur(db, l , j) ∧ neg(l, l ) ∧ (i > j) → B(db, l) (7) neg(l, l ) where l and l are complementary literals of L. (8) k = (r + l) and (r + l) = k for any k in {1...n} for any r in {1...k} and for any l such that k = l + r (9) k = α.k for any integers α ∈ {1...n}, k ∈ {1...n}, and k ∈ {1...n2 } so that k = α.k (10) k > r for any k in {1...n} and for any r in {1...k − 1} such that k > r Axioms of type (1) are used to list the contents of the diﬀerent primitive databases. According to axiom (2), the number of occurrence of a literal l in the weighted database α.db is α times the number of occurrence of l in db. According to axiom (3), the number of occurrence of a literal l in the merged database db1 ∗ db is the sum of the number of occurrences of l in db1 and the number of occurrences of l in db . Axiom (4) states that if a literal l belongs to the primitive database db then the number of occurrence of l in db is 1. Axiom (5) states that if a literal l does not belong to the primitive database db then the number of occurrence of l in db is 0. Axiom (6) states that we can conclude l from a database db if the number of occurrences of l in db is strictly greater than the number of occurrences of its negation in db. (7) relate any literal of L with its negation. Finally, (8), (9) and (10) deﬁne, in extension, the sum of integers, product and comparison of integers needed in the previous axioms. Notice that there is a ﬁnite number of them.

Querying Contradictory Databases

153

2.4 Relation with a weighted majority merging operator In [9], Konieczny and Pino-P´erez introduced a majority merging operator as follows2 Let db1 ...dbn be n information sources to be merged. A majority merging operator, denoted ∆Σ , is deﬁned such that the models of the information source which is obtained from merging db1 ... dbn with this operator, is semantically characterized by: M od(∆Σ ([db1 , ..., dbn ])) =

min

≤Σ [db

(W)

1 ...dbn ]

where W denotes the set of all the interpretations of the language L (the propositional language used to describe the contents of the information sources). ≤Σ [db1 ...dbn ] is a total pre-order on W deﬁned by: w ≤Σ [db1 ...dbn ] w iﬀ dΣ (w, [db1 ...dbn ]) ≤ dΣ (w , [db1 ...dbn ])

with dΣ (w, [db1 ...dbn ]) =

n

i=1

min

w ∈M od(dbi )

d(w, w )

where M od(dbi ) is the set models of dbi and d(w, w ) is the Hamming distance. In other words, when merging db1 ...dbn with the operator ∆Σ , the result is semantically characterized by the interpretations which are minimal according to the pre-order ≤Σ [db1 ,...,dbn ] . Now, let us consider that the knowledge-bases db1 ...dbn are associated with weights α1 ...αn which are integers. We can extend the previous deﬁnitions and 1 ...αn , such that the models of the informadeﬁne a new merging operator, ∆α Σ tion source which is obtained from merging db1 ... dbn with this operator, is semantically characterized by: 1 ...αn ([db1 , ..., dbn ])) = M od(∆α Σ

min

Σ,α1 ...αn 1 ...dbn ]

(W)

≤[db

1 ...αn where ≤Σ,α [db1 ...dbn ] is a total pre-order on W deﬁned by:

1 ...αn w ≤Σ,α [db1 ...dbn ] w iﬀ dΣ,α1 ...αn (w, [db1 ...dbn ]) ≤ dΣ,α1 ...αn (w , [db1 ...dbn ])

with dΣ,α1 ...αn (w, [db1 ...dbn ]) =

n

i=1

2

αi .

min

w ∈M od(dbi )

d(w, w )

One will notice that we slightly change the presentations of these deﬁnitions to remain coherent with what has already been presented.

154

Laurence Cholvy

1 ...αn In other words, when merging db1 ...dbn with the operator ∆α , the Σ result is semantically characterized by the interpretations which are minimal α1 ...αn 1 ...αn is thus a weighted majority according to the pre-order ≤Σ,α [db1 ,...,dbn ] . ∆Σ merging operator. The following proposition proves that the query evaluator deﬁned in section 2.3 corresponds to such a merging operator.

Proposition 1. Let db1 ...dbn be primitive databases. Let α1 ....αn be integers and l be a literal of L. Then, PROLOG proves B(α1 .db1 ∗ (... ∗ (αn .dbn ∗ nil)), l) in the meta-program M ET A(db1 , ..., dbn ) 1 ...αn ([db1 ...dbn ]) |= l iﬀ ∆α Σ Proof. Notation In the following, [α1 .db1 ...αn .dbn] will denote the multi-set: [db1 ...(α1 times)...db1 , ...dbn ...(αn times)...dbn ]. (Only if ) Assume that PROLOG proves B(α1 .db1 ∗ (... ∗ (αn .dbn ∗ nil)), l) in the meta-program M ET A(db1 , ..., dbn ). 1 ...αn ([db1 ...dbn ])) such that w1 |= l. Assume that ∃w1 ∈ M od(∆α Σ Let us then consider the interpretation w2 which is identical to w1 except that the valuation of l in w2 is true, i.e, w2 |= l. Since PROLOG proves B(α1 .db1 ∗ (... ∗ (αn .dbn ∗ nil)), l) in the metaprogram M ET A(db1 , ..., dbn ), this implies (axiom (6)) that PROLOG also proves Occur(α1 .db1 ∗ (... ∗ (αn .dbn ∗ nil)), l, i), neg(l, l ), Occur(α1 .db1 ∗ ... ∗ (αn .dbn ∗ nil), l , j) and i > j. I.e, that in α1 .db1 ∗ (... ∗ (αn .dbn ∗ nil)), the number of occurrences of l is strictly greater that the number of occurrences of its negation. Thus, by deﬁnition of axioms (1)...(5), this means that, in the multi-set [α1 .db1 ...αn .dbn], the number of occurrences of l is strictly greater that the number of occurrences of its negation. Thus, dsum (w2 , [α1 .db1 ...αn .dbn]) < dsum (w1 , [α1 .db1 ...αn .dbn]) Thus, ﬁnally, dΣ,α1 ...αn (w2 , [db1 ...dbn ]) < dΣ,α1 ...αn (w1 , [db1 ...dbn ]). 1 ...αn ([db1 ...dbn ])) and thus leads to a conThis shows that w1 ∈ M od(∆α Σ tradiction. 1 ...αn ([db1 ...dbn ])) =⇒ w |= l. Finally, we conclude that ∀w w ∈ M od(∆α Σ (If ) 1 ...αn ([db1 ...dbn ]) |= l. Let us assume that ∆α Σ This implies that, in [α1 .db1 ...αn .dbn], the number of occurrences of l is strictly greater that the number of occurrence of its negation. Thus, by deﬁnition of axioms (1)...(5), this implies that, in α1 .db1 ∗ ... ∗ (αn .dbn ∗nil), the number of occurrences of l is strictly greater that the number of occurrences of its negation. Thus, by axiom 6, this implies that PROLOG proves B(α1 .db1 ∗ (... ∗ (αn .dbn ∗ nil)), l) in the meta-program M ET A(db1 , ..., dbn ). (End of proof )

Querying Contradictory Databases

155

2.5 Other properties The following two propositions show that when answering a query addressed to several databases, these databases have symmetrical roles i.e querying several databases is equivalent to querying any permutation of them. Proposition 2. Let db1 ...dbn be primitive databases. Let α1 ....αn be integers and l be a literal of L. In M ET A(db1 ...dbn ), PROLOG proves B(α1 .db1 ∗ (α2 .db2 ∗ nil), l) iﬀ it also prove B(α2 .db2 ∗ (α1 .db1 ∗ nil), l). Proof This proposition is a corollary of proposition 1. Proposition 3. Let db1 ...dbn be primitive databases. Let α1 ....αn be integers and l be a literal of L. In M ET A(db1 ...dbn ), PROLOG proves B(α1 .db1 ∗ (α2 .db2 ∗(α3 .db3 ∗nil)), l) iﬀ it also proves B(α1 .db1 ∗(α2 .db2 ∗(α3 .db3 ∗nil))), l). Proof This proposition is a corollary of proposition 1. Due to these results, parenthesis can be omitted when designating the diﬀerent databases we query. And, for instance, 2.db1 ∗ 1.db2 ∗ 2.db3 will denote (2.db1 ∗(1.db2 ∗(2.db3 ∗nil)))) as well as (2.db3 ∗(2.db1 ∗(1.db2 ∗nil)))) Proposition 4. Let db1 ...dbn be primitive databases. Let α1 ....αn be integers and l be a literal of L. The three following cases are exclusive: (1) In M ET A(db1 ...dbn ), PROLOG proves B((α1 .db1 ∗ . . . ∗ αn .dbn ), l) (2) In M ET A(db1 ...dbn ), PROLOG proves B((α1 .db1 ∗. . .∗αn .dbn ), l ) so that it also proves neg(l, l ). (3) In M ET A(db1 ...dbn ), PROLOG don’t prove B((α1 .db1 ∗ . . . ∗ αn .dbn ), l) nor B((α1 .db1 ∗ . . . ∗ αn .dbn ), l ) so that it also proves neg(l, l ). Proof This proposition is a corollary of proposition 1.

3 Application to ﬁrst-order databases In this section we extend the previous program for specifying a query evaluator which consistently answers queries addressed to several ﬁrst order databases which can be contradictory. But, in order to export these results to ﬁrst order databases, we only consider databases which are “equivalent to sets of ground literals” ([5]). Such databases are deﬁned below. 3.1 Databases equivalent to sets of ground literals Let LO be a ﬁrst order language. Deﬁnition 1 . A database is a pair DB =< EDB, IDB >3 such that EDB is a non empty and ﬁnite set of positive or negative ground literals of 3

“EDB” stands for “extensional database” and “IDB” stands for “intensional database”.

156

Laurence Cholvy

LO and IDB is a ﬁnite and consistent set of clauses of LO written without function symbol. Notice that literals in EDB can be positive or negative. Deﬁnition 2. Let DB =< EDB, IDB > a database. Let a1 ...an (resp. P1 ...Pk ) be the constant (resp. predicate) symbols which appear in the formulas of EDB ∪ IDB. The Herbrand base is the set of positive literals written with the Pi and the aj . A Herbrand interpretation of DB is an interpretation whose domain is {a1 , ..., an }. A Herbrand model of DB is a Herbrand interpretation which satisﬁes EDB ∪ IDB. Deﬁnition 3. Let HM1 ...HMn be the Herbrand models of EDB ∪ IDB. Let L = {l : l is a literal of the Herbrand base such that ∃ HMi ∃ HMj HMi |= l and HMj |= ¬l}. The database DB =< EDB, IDB > is equivalent to a set of ground literals iﬀ for any satisﬁable conjunction l1 ∧ ... ∧ lm where ∀i ∈ {1...m} such that li ∈ L or ¬li ∈ L, there exists HMi0 such that HMi0 |= l1 ∧ ... ∧ lm . Example. Consider DB1 =< EDB1 , IDB1 > with EDB1 = {p(a)} and IDB1 = {¬p(x) ∨ ¬q(x), p(x) ∨ r(x)}. The Herbrand models of DB1 are4 : {p(a)} and {p(a), r(a)}. We have: L = {r(a)} We can check that ¬r(a) is satisﬁed in the ﬁrst Herbrand model and that r(a) is satisﬁed in the second. So DB1 is equivalent to a set of ground literals. Consider now DB2 =< EDB2 , IDB2 > with EDB2 = {p(a)} and IDB2 = {¬p(x) ∨ q(x) ∨ r(x)}. The Herbrand models of DB2 are {p(a), q(a)}, {p(a), r(a)} and {p(a), q(a), r(a)}. We have L = {r(a), q(a)}. We can check that none of the Herbrand models satisfy ¬q(a) ∧ ¬r(a). Thus DB2 is not equivalent to a set of ground literals. In [5] we have proved the following result: Proposition 5. Let DB =< EDB, IDB > a database which is equivalent to a set of ground literals. Let l1 ...ln be some ground literals of LO such that l1 ∨ ... ∨ ln is not a tautology. Then, EDB ∪ IDB |= l1 ∨ ... ∨ ln iﬀ ∃i0 ∈ {1...n} EDB ∪ IDB |= li0 This result ensures that, in a database equivalent to a set of ground literals, a disjunction of ground literals which is not a tautology is deducible from the database iﬀ one of these literals is deducible from the database. This implies that there is no real disjunctive data deducible from these databases. 3.2 Taking IDB into account In this section, following [5], we extend M ET A(db1 ...dbn ) in order to take the clauses of IDB into account. Let us denote by h the function which associates any clause of IDB with a set of formulas in the following way: h(l1 ∨ ... ∨ ln ) = {(¬l1 ∧ ...¬li−1 ∧ ¬li+1 ... ∧ ¬ln ) → li , i ∈ {1, ...n}} Then, the axiom (1) of M ET A(db1 ...dbn ) is replaced by the following ones: 4

A model is denoted here by the set of its positive facts.

Querying Contradictory Databases

157

(1.1) EDB(db, l) if the ground literal l is in the EDB part of the database db. (1.2) IDB(db, f ) if f ∈ h(c), where c is a clause in the IDB part of db. (1.3) EDB(db, l) → In(l, db) (1.4) IDB(db, (r → l)) ∧ Bconj (db, r) → In(l, db) (1.5) Bconj (db, nil) (1.6) In(l1 , db) ∧ Bconj (db, r1 ) → Bconj (db, l1 ∧ r1 ) We can prove : Proposition 6. Let db =< EDB, IDB > be such that IDB is not recursive. Let l be a ground literal. Then PROLOG proves In(l, db) iﬀ EDB ∪ IDB |= l. (The proof is similar to the one in [5]) This result ensures that, if IDB is not recursive, axiom (1) can be replaced by axioms (1.1)...(1.6). Thus, if IDB is not recursive, the meta-program deﬁned for databases which are sets of propositional literals can be used in the case of ﬁrst order databases which are equivalent to sets of ground literals. 3.3 Deﬁnition of answers Deﬁnition 4 (Answers for closed queries). Let db1 ...dbn be n databases, each of them being equivalent to a set of literals. Let α1 ...αn be integers. Let F be a ground literal. The answer to the query “Can it be concluded that F true from the databases db1 ...dbn when assuming that their respective degrees of reliability are α1 ...αn ?” is deﬁned by: answer((α1 .db1 ∗ . . . ∗ αn .dbn ), F ) = Y ES iﬀ PROLOG proves B((α1 .db1 ∗ . . . ∗ αn .dbn ), F ) in M ET A(db1 ...dbn ) answer((α1 .db1 ∗ . . . ∗ αn .dbn ), F ) = N O iﬀ PROLOG proves neg(F, F ) and B((α1 .db1 ∗ . . . ∗ αn .dbn , F ) in M ET A(db1 ...dbn ) answer((α1 .db1 ∗ . . . ∗ αn .dbn ), F ) =?

else

Notice that we are in an open world approach: primitive databases are not necessarily complete. This explains the last case. This corresponds to the case when the two goals B((α1 .db1 ∗ . . . ∗ αn .dbn ), F ) and B((α1 .db1 ∗ . . . ∗ αn .dbn ), F ) ﬁnitely fail in PROLOG. Deﬁnition 5. (Answers for open queries). Let db1 ...dbn be n databases, α1 ...αn be integers and F (X) be an open literal. Consider the query “What are the X which satisfy F ” when it addressed to databases db1 ...dbn whose respective degrees of reliability are α1 ...αn ? . Its answer is deﬁned by: answer((α1 .db1 ∗ ... ∗ αn .dbn ), F (X)) = {A : tuple of constant symbols such that PROLOG proves B((α1 .db1 ∗ ... ∗ αn .dbn ), F (A))}

158

Laurence Cholvy

3.4 Properties Following these deﬁnitions and according to proposition 4, we can ensure that the query evaluator generates consistent answers to any query addressed to several databases even if they are contradictory. Proposition 7 Let db1 ...dbn be n databases, each of them being equivalent to a set of literals. Let α1 ...αn be integers and F be a ground literal which represents a closed query. Then the three following cases are exclusive: (1) answer((α1 .db1 ∗ . . . ∗ αn .dbn ), F ) = Y ES (2) answer((α1 .db1 ∗ . . . ∗ αn .dbn ), F ) = N O (3) answer((α1 .db1 ∗ . . . ∗ αn .dbn ), F ) =? Proof This is a consequence of proposition 4. Proposition 8 Let db1 ...dbn be n databases, each of them being equivalent to a set of literals. Let α1 ...αn be integers and let F (X) be a literal which represents an open query. Let A be a tuple of constants, then the two following cases are exclusive : (1) A ∈ answer((α1 .db1 ∗ . . . ∗ αn .dbn ), F (X)) (1) A ∈ answer((α1 .db1 ∗ . . . ∗ αn .dbn ), ¬F (X)) Proof This is a consequence of proposition 4. These two propositions show that the query-evaluator deﬁned in this work, consistently answers queries, even if the databases are contradictory.

4 Illustrative examples This section illustrates the query-evaluator deﬁned previously on two examples. The two examples diﬀer on the fact that, in the ﬁrst one, the databases share a common set of rules while it is not the case in the second one. 4.1 The databases share a common set of rules We consider three databases which store information about ﬂights taking oﬀ from Paris Orly airport. The ﬁrst database, Agency, is managed by a travel agency, the second, Orly, is managed by Paris Orly airport, and the third, AF , one by Air-France company. Information we focus on are ﬂight identiﬁcation numbers, ﬂight destination, departure time, departure terminal and company. Let these databases be denoted Agency =< EDBAgency , IDB > Orly =< EDBOrly , IDB > Air F rance =< EDBAF , IDB > with:

Querying Contradictory Databases

159

EDBAgency = { f light(1, T oulouse, 0825, AF, national), f light(2, London, 1130, AF, international), f light(3, London, 1430, BAW, international), f light(4, N ewY ork, 1945, T W A, international), f light(5, W ashington, 2200, T W A, international)} EDBOrly = { terminal(1, OrlyW ), terminal(2, OrlyW ), terminal(3, OrlyW ), terminal(4, OrlyS), terminal(5, OrlyW )} EDBAF = { f light(1, T oulouse, 0825, AF, national), f light(2, London, 1130, AF, international), terminal(1, OrlyW ), terminal(2, OrlyS)} IDB = { ∀x∀y∀z f light(x, y, z, AF, national) → terminal(x, OrlyW ), ∀x∀y∀zf light(x, y, z, AF, international) → terminal(x, OrlyS), ∀x∀y∀z∀t f light(x, y, z, BAW, t) → terminal(x, OrlyS), ∀x∀y∀z∀t f light(x, y, z, T W A, t) → terminal(x, OrlyS), ∀x¬terminal(x, OrlyS) ∨ ¬terminal(x, OrlyW ) Thus, according to the Agency database, ﬂight number 1 is a national AirFrance ﬂight to Toulouse and its departure time is 8.25; ﬂight number 2 is an international Air-France ﬂight to London and its departure time is 11.30; ﬂight number 3 is an international British AirWay international ﬂight to London and its departure time is 14.30; ﬂight number 4 is a TWA international ﬂight to New York and its departure time is 19.45; ﬂight number 5 is a TWA international ﬂight to Washington and its departure time is 22.00. According to Orly database, departure terminal of ﬂights number 1, 2, 3 and 5 is Orly-West, departure terminal of ﬂight number 4 is Orly-South. According to AF database, ﬂight number 1 is a national AirFrance ﬂight to Toulouse, its departure time is 8.25 am and its departure terminal is OrlyWest; ﬂight number 2 is a international AirFrance ﬂight to London, its departure time is 11.30 am and its departure terminal is Orly-South. Finally, it is known that Orly-West is the departure terminal of national Air-France ﬂights; Orly South is the departure terminal of international AirFrance ﬂights, British AirWays ﬂights and TWA ﬂights. And obviously, OrlySouth and Orly-West are two diﬀerent terminals. Here are some queries and the answers generated by the evaluator • When querying the three databases and assuming that they are equally reliable, is it true that ﬂight number 2 starts at 11.30 am to London and is an international ﬂight ? answer((1.Agency ∗1.Orly ∗1.AF ), f light(2, London, 1130, AF, international)) = Y ES

Indeed, on the ﬁrst hand, two databases (Agency and AF ) support the

160

Laurence Cholvy

fact that ﬂight number 2 starts at 11.30 am to London and is an international ﬂight. On the second hand, according to the Orly database, ﬂight number 2 starts from Orly West. Thus, with the rules, Orly database supports the fact that ﬂight number 2 is not an international ﬂight. Finally, since the three databases are supposed to be equally reliable, this means that there are two supports (or two proofs) to the fact that ﬂight number 2 starts at 11.30 am to London and is an international ﬂight and only one support to the contrary. By majority, we conclude that ﬂight number 2 starts at 11.30 am to London and is an international ﬂight. • When querying the three databases and assuming that their respective reliability degrees are 1, 3 and 1, is it true that ﬂight number 2 starts at 11.30 am to London and is an international ﬂight ? answer((1.Agency ∗3.Orly ∗1.AF ), f light(2, London, 1130, AF, international)) = NO

Assuming that the reliability degrees of the three databases are 1, 3 and 1 still implies that there are two supports to the fact that ﬂight number 2 starts at 11.30 am to London and is an international ﬂight. But this also implies that there are three supports to the contrary. Thus, by majority, we conclude that it is not true that ﬂight number 2 starts at 11.30 am to London and is an international ﬂight. • When querying the three databases and assuming that their respective reliability degrees are 1, 3 and 2, is it true that ﬂight number 2 starts at 11.30 am to London and is an international ﬂight ? answer((1.Agency∗3.Orly∗2.AF ), f light(2, london, 1130, AF, international)) = ?

Here, assuming that the reliability degrees of the three databases are 1, 3 and 2 implies that there are three supports to the fact that ﬂight number 2 starts at 11.30 am to London and is an international ﬂight but also implies that there three supports to the contrary. Thus, we cannot decide whether the fact is true or not. • When querying the three databases and assuming that they are equally reliable, what are the number, the destination, the departure time and the company of international ﬂights ? answer((1.Agency ∗ 1.Orly ∗ 1.AF ), f light(x, y, z, t, international)) = {(2 LONDON 1130 AF), (4 NEWYORK 1945 TWA) }

The reasons why ﬂight number 2 belongs to the answer are given previously (see the ﬁrst query). As for ﬂight number 4, it belongs to the answer because there is no opposition to that. Finally, ﬂight number 3 (resp 5) does not belong to the answer because, according to Agency database, there is one support to the fact that it is an international ﬂight,

Querying Contradictory Databases

161

and according to Orly database (and the rules) there is also one support to the contrary. • When querying the three databases and assuming that their respective reliability degrees are 1, 3 and 1, what are the number, the destination, the departure time and the company of international ﬂights ? answer((1.Agency ∗ 3.Orly ∗ 1.AF ), f light(x, y, z, t, international)) = {(4 NEWYORK 1945 TWA)}

Flight number 2 does not belong to the answer any longer because we can conclude that it is not true that it is an international ﬂight (see the second query). • When querying the three databases and assuming that their respective reliability degrees are 1, 3 and 2, what are the number, the destination, the departure time and the company of international ﬂights ? answer((1.Agency ∗ 3.Orly ∗ 2.AF ), f light(x, y, z, t, international)) = {(4 NEWYORK 1945 TWA)}

Flight number 2 does not belong to the answer because we cannot decide whether it is an international ﬂight or not (see the third query). • When querying Agency database and Orly database and assuming that their respective reliability degrees are 1 and 2, what are the number of the ﬂights whose departure terminal is Orly West ? answer((1.Agency ∗ 2.Orly), terminal(x, OrlyW )) = {1, 2, 3, 5}

Flight number 4 does not belong to the answer because since reliability degrees of Agency and Orly are respectively 1 and 2, there is one support to the fact that ﬂight number 2 starts from Orly West and two supports to the contrary. • When querying the three databases and assuming that their respective reliability degrees are 1, 2 and 3, what are the number of the ﬂights whose departure terminal is Orly West? answer((1.Agency ∗ 2.Orly ∗ 3.AF ), terminal(x, OrlyW )) = {1, 3, 5} Flight number 2 does not belong to the answer any longer because now, even there are still two supports to the fact that its departure terminal is Orly West, there are four supports to the contrary.

162

Laurence Cholvy

4.2 The databases don’t share a common set of rules We consider now that the three databases are: Agency EDBAgency , IDBAgency > Orly =< EDBOrly , IDBOrly > Air F rance =< EDBAF , IDBAF >

=<

with : IDBAgency = { ∀x∀y∀z∀t f light(x, y, z, BAW, t) → terminal(x, OrlyS), ∀x∀y∀z∀t f light(x, y, z, T W A, t) → terminal(x, OrlyS), ∀x¬terminal(x, OrlyS) ∨ ¬terminal(x, OrlyW )} IDBOrly = {∀x∀y∀z f light(x, y, z, AF, national) → terminal(x, OrlyW ), ∀x∀y∀z f light(x, y, z, AF, international) → terminal(x, OrlyS), ∀x¬terminal(x, OrlyS) ∨ ¬terminal(x, OrlyW )} IDBAF = {∀x¬terminal(x, OrlyS) ∨ ¬terminal(x, OrlyW )} Notice that now, ∀x¬terminal(x, OrlyS) ∨ ¬terminal(x, OrlyW ) is the only rule the three databases have in common. The queries for which, the answer has changed from an example to the other are: • When querying the three databases and assuming that they are equally reliable, what are the number, the destination, the departure time and the company of international ﬂights ? answer((1.Agency ∗ 1.Orly ∗ 1.AF ), f light(x, y, z, t, international)) = {(2 LONDON 1130 AF), (3 LONDON 1430 BAW),(4 NEWYORK 1945 TWA), (5 WASHINGTON 2200 TWA)}

Flight number 3 and number 5 now belong to the answer because, even if like before, Agency database supports the fact that they are international ﬂights, Orly database now does not support the contrary. • When querying the three databases and assuming that their respective reliability degrees are 1, 3 and 1, what are the number, the destination, the departure time and the company of international ﬂights ? answer((1.Agency ∗ 3.Orly ∗ 1.AF ), f light(x, y, z, t, international)) = {(3 LONDON 1430 BAW), (4 NEWYORK 1945 TWA),(5 WASHINGTON 2200 TWA)}

Flight number 3 and number 5 now belong to the answer (see previous query) • When querying the three databases and assuming that their respective reliability degrees are 1, 3 and 2, what are the number, the destination, the departure time and the company of international ﬂights ?

Querying Contradictory Databases

163

answer((1.Agency ∗ 3.Orly ∗ 2.AF ), f light(x, y, z, t, international)) = {(3 LONDON 1430 BAW), (4 NEWYORK 1945 TWA),(5 WASHINGTON 2200 TWA)}

Flight number 3 and number 5 now belong to the answer (see previous query) • When querying the three databases and assuming that their respective reliability degrees are 1, 2 and 3, what are the number of the ﬂights whose departure terminal is Orly West? answer((1.Agency ∗ 2.Orly ∗ 3.AF ), terminal(x, OrlyW )) = {1, 2, 3, 5} Flight number 2 now belong to the answer because Orly does not provide support to its contrary any longer.

5 Particular cases In this section, we show that the query-evaluator deﬁned in the present chapter is a general case of two other query evaluators that we have deﬁned in previous work. 5.1 Taking into account the relative reliability of the databases In [5], we have deﬁned a query-evaluator for answering queries addressed to several databases by taking into account their relative reliability. Let us denote it here M ET A1(db1 ...dbn ). Axioms were deﬁned so that PROLOG proves B(dbi1 > dbi2 > ... > dbin , l) iﬀ we can conclude l if we consider that database dbi1 is strictly more reliable than database dbi2 ,itself being strictly more reliable than etc... Detailing M ET A1(db1 ...dbin ) is out of the scope of this chapter, however, let us recall that its main axioms were (A1) In(l, db) → B(l, db) for any primitive database db and any literal l which belongs to db. (A2) B(O, l) → B(O > db, l) for any primitive database db and any order O, on some primitive databases but db. (A3) In(l, db) ∧ neg(l, l ) ∧ ¬B(O, l ) → B(O > db, l) Example . Consider db1 = {a, b}, db2 = {a, ¬c}, db3 = {¬a, c}. In M ET A1(db1 ...dbn ), PROLOG proves B(db1 > db2 > db3 , a), B(db1 > db2 > db3 , b) and B(db1 > db2 > db3 , ¬c). This means that, if we consider that db1 is more reliable than db2 , itself more reliable than db3 , then, we can conclude a and b and ¬c.In M ET A1(db1 ...dbn ), PROLOG can also prove, for instance, B(db3 > db2 > db1 , ¬a) B(db2 > db1 > db3 , b) and B(db2 > db1 > db3 , c). This means that, if we consider that db3 is more reliable than db2 , itself more reliable than db1 , then we can conclude ¬a, b and c.

164

Laurence Cholvy

The following proposition ensures that reasoning with total orders between databases, as it is done in M ET A1(db1 ...dbn ) can similarly be made in M ET A(db1 ...dbn ). Proposition 9. Let db1 ...dbn primitive databases and l be a literal. PROLOG proves B(dbi1 > dbi2 > ... > dbin , l) in M ET A1(db1 ...dbn ), iﬀ PROLOG proves B(2n−1 .dbi1 ∗ 2n−2 .dbi2 ∗ ... ∗ 20 .dbin , l) in M ET A(db1 ...dbn ) . Proof First, notice that PROLOG proves B(2n−1 .dbi1 ∗ 2n−2 .dbi2 ∗ ... ∗ 20 .dbin , l) in M ET A(db1 ...dbn ) iﬀ, in [2n−1 .dbi1 , 2n−2 .dbi2 , ...20 .dbin ], the number of occurrences of l is strictly greater than the number of occurrences of its negation. (If ) Assume that PROLOG proves B(dbi1 > dbi2 > ... > dbin , l) in M ET A1(db1 ...dbn ). Then, ∃j l ∈ dbij and ∀k < j ¬l ∈ dbik . Thus, in [2n−1 .dbi1 , 2n−2 .dbi2 , ...20 .dbin ], the number of occurrences of l is greater than S n−ij and the number of occurrences of ¬l is smaller than 2n−ij −1 + ... + 20 . Since S n−i > 2n−i−1 + ... + 20 , this implies that, in [2n−1 .dbi1 , 2n−2 .dbi2 , ...20 .dbin ], the number of occurrences of l is strictly greater than the number of its negation. Thus, PROLOG proves B(2n−1 .dbi1 ∗ 2n−2 .dbi2 ∗ ... ∗ 20 .dbin , l) in M ET A(db1 ...dbn ), (Only if ) Assume that PROLOG proves B(2n−1 .dbi1 ∗ 2n−2 .dbi2 ∗ ... ∗ 20 .dbin , l) in M ET A(db1 ...dbn ), Then, in [2n−1 .dbi1 , 2n−2 .dbi2 , ...20 .dbin ], the number of occurrences of l is strictly greater than the number of occurrences of its negation. In the list dbi1 , dbi2 , ...dbin let us assume that dbij is the ﬁrst database l belongs to. Then ∀k < j ¬l ∈ dbik (if it was not the case, then in [2n−1 .dbi1 , 2n−2 .dbi2 , ...20 .dbin ], the number of occurrences of l would not be strictly greater than the number of occurrences of its negation. Thus, (A1) and (A2), this implies that, in M ET A1(db1 ...dbn ), PROLOG proves B(dbij , l). And, by axiom (A3), this implies that in M ET A1(db1 ...dbn ), PROLOG proves B(dbi1 > ... > dbij , l). Thus, by axiom (A2), this implies that in M ET A1(db1 ...dbn ), PROLOG proves B(dbi1 > ... > dbij > ...dbin , l). (End of proof ) 5.2 Taking into account the number of databases In [7], we have deﬁned a query-evaluator for answering queries when addressed to several databases by applying a majority vote. Let us denote it here M ET A2(db1 ...dbn ). Its axioms were deﬁned so that PROLOG proves B(db1 ∗ ... ∗ dbn , l) iﬀ we can conclude l when making a majority vote between the databases db1 ... dbn . Detailing this evaluator is out of the scope of this

Querying Contradictory Databases

165

chapter. However, let us recall that its main axioms were: (B1) Occur(db, l, i) ∧ Occur(db , l, j) → Occur(db ∗ db , l, l) where k = i + j (B2) neg(l, l ) ∧ Occur(db, l, i) ∧ Occur(db, l , j) ∧ (i > j) → B(db, l) (B1) states that if in database db, the number of occurrences of l is i and if in database db , the number of occurrences of l is j, then in database db ∗ db the number of occurrences of l is i + j. (B2) states that if in database db, the number of occurrences of l is strictly greater than the number of occurrences of its negation, then we consider that l is true in db. Example . Consider again db1 = {a, b}, db2 = {a, ¬c}, db3 = {¬a, c}. In M ET A2(db1 ...dbn ), PROLOG proves B(db1 ∗ db2 ∗ db3 , a) and B(db1 ∗ db2 ∗ db3 , b) and does not prove B(db1 ∗ db2 ∗ db3 , c) nor B(db1 ∗ db2 ∗ db3 , ¬c). This means that, from a majority vote between the three databases, we can derive a and b but neither c. nor ¬c. The following proposition proves that M ET A2(db1 ...dbn ) is a particular case of M ET A(db1 ...dbn ). Proposition 10. Let db1 ...dbn primitive databases and l be a literal. In M ET A2(db1 ...dbn ), PROLOG proves B(db1 ∗ ... ∗ dbn , l) iﬀ in M ET A(db1 ...dbn ) it proves B(α.db1 ∗ ... ∗ α.dbn , l), where α is any integer. Proof Assume that in M ET A2(db1 ...dbn ), PROLOG proves B(db1 ∗ ... ∗ dbn , l). This means that (by axioms (B2)) in [db1 , ..., dbn ], the number of occurrences of l is strictly greater than the number of its negation. Thus, for any integer α, it is true that in [α.db1 , ..., α.dbn ], the number of occurrences of l is strictly greater than the number of its negation. This means that in M ET A(db1 ...dbn ), PROLOG proves B(α.db1 ∗ ... ∗ α.dbn , l). (End of proof )

6 Concluding remarks This chapter presented a query-evaluator which consistently answers queries addressed to several databases, in the case when: • The diﬀerent databases share a common description language. • They may have diﬀerent sets of constraints. • Each database is equivalent to a a set of ground literals. The method underlying this query evaluator consists in taking into account the number of databases that support a fact and their reliability degrees as

166

Laurence Cholvy

well. This means that, when formulating a query, the user speciﬁes which databases she wants to query and the reliability degrees she assumes they have. It has been proved that, given a query addressed to several databases associated with some reliability degrees, the answer computed by the previous evaluator is the same than the one that would be computed by a classical evaluator if that query was addressed to the database obtained by merging the databases according to a weighted majority attitude. However, here, the merging is never computed and the databases contents are unchanged after the querying. This implies that, for instance, the user may address a query to a group of databases with particular degrees of reliability and later on address a query to the same databases with diﬀerent degrees of reliability or even to diﬀerent databases. This query evaluator has been speciﬁed by a set of logical formulas. It has been implemented in PROLOG and the previous example has been run. Two main extensions have been made to this query evaluator but they are not detailed due to space restriction. First, following [7], the evaluator has been extended in order to answer more complex queries, i.e queries which are closed or open formulas written under conjunctive normal form. Thus, for instance, coming back to example of section 4.1, the evaluator can answer queries like the following: • When querying the three databases and assuming that their reliability degrees are respectively 3,2 and 1, what are the ﬂights which ﬂy to NewYork or to Washington ? answer( (3.Agency ∗ 2.Orly ∗ 1.AF ), f light(x, N ewY ork, z, t, u) ∨ f light(x, W ashington, z, t, u)) = {(4 N EW Y ORK 1945 T W A IN T ERN AT ION AL), (5 W ASHIN GT ON 2200 T W A IN T ERN AT ION AL)} • When querying the three databases and assuming that their reliability degrees are respectively 3,2 and 1, what are the ﬂights which ﬂy to London and which are not AirFrance ﬂights ? answer( (3.Agency ∗ 2.Orly ∗ 1.AF ), f light(x, London, z, t, u) ∧ ¬f light(x, y, z, AF, u)) = {(3 LON DON 1430 BAW IN T ERN AT ION AL)} Secondly, following [8], we have extended the evaluator in order to provide answers with explanations by distinguishing when the answer is YES (resp, NO) due to the weighted majority vote between the databases (such answer is called “Yes (resp, NO) by weighted majority”); when the answer is YES (resp, NO) because at least one database allows to deduce it and no other database contradicts it (it is called “Yes (resp, NO) unchallenged”); when the answer is “?” because, according to the weighted majority vote, the weight of data which allow to conclude YES is identical to the weight of data which allow to conclude NO (it is called “? balanced inconsistency”); or ﬁnally when the

Querying Contradictory Databases

167

answer is “?” because there is no information at all for providing a positive or a negative answer (it is called “Complete lack of information).” For instance, running the example of section 4.1, the answers would be now : • answer((1.Agency∗1.Orly∗1.AF ), f light(2, London, 1130, AF, international))= YES (Weighted majority) • answer((1.Agency∗3.Orly∗1.AF ), f light(2, London, 1130, AF, international))= NO (Weighted majority) • answer((1.Agency∗3.Orly∗2.AF ), f light(2, london, 1130, AF, international))= ? (Balanced inconsistency) • answer((1.Agency ∗ 1.Orly ∗ 1.AF ), f light(x, y, z, t, international)) = {(2 LONDON 1130 AF) (Weighted majority) , (4 NEWYORK 1945 TWA) (Unchallenged) } • answer((1.Agency ∗ 3.Orly ∗ 1.AF ), f light(x, y, z, t, international)) = {(4 NEWYORK 1945 TWA) (Unchallenged)} • answer((1.Agency ∗ 3.Orly ∗ 2.AF ), f light(x, y, z, t, international)) = {(4 NEWYORK 1945 TWA) (Unchallenged)} • answer((1.Agency ∗ 2.Orly), terminal(x, OrlyW )) = { 1 (Unchallenged), 2 (Weighted Majority), 3 (Weighted Majority) , 5 (Weighted Majority)} • answer((1.Agency ∗ 2.Orly ∗ 3.AF ), terminal(x, OrlyW )) = {1 (Unchallenged, 3 (Weighted Majority), 5 (Weighted Majority)}

Acknowledgments This work has been supported by ONERA (grant 1042801F)

References [1] C. Baral, S. Kraus, J. Minker, and V.S. Subrahmanian. Combining multiple knowledge bases. IEEE Trans. on Knowledge and Data Engineering, 3(2), 1991. [2] C. Baral, S. Kraus, J. Minker, and V.S. Subrahmanian. Combining knowledge bases consisting of ﬁrst order theories. Computational Intelligence, 8(1), 1992. [3] C. Batini, M. Lenzerini, , and S.B. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computer Surveys, 18(4), 1986. [4] L. Bertossi and J. Chomicki. Query answering in inconsistent databases. In Logics for Emerging Applications of Databases, pages 43–83. 2003. [5] L. Cholvy. Reasoning with data provided by federated databases. Journal of Intelligent Information Systems, 10(1), 1998.

168

Laurence Cholvy

[6] L. Cholvy and R. Demolombe. Reasoning with information sources ordered by topics. In Proceedings of the 6th International Conference on Artiﬁcial Intelligence : Methodologies, Systems, Applications (AIMSA’94), pages 151–162, Soﬁa, September 1994. World Scientiﬁc. [7] L. Cholvy and Ch. Garion. Answering queries adressed to several databases according to a majority merging approach. Journal of Intelligent Information Systems, 22(2), 2004. [8] L. Cholvy and Ch. Garion. Querying several conﬂicting databases. Journal of Applied Non-Classical Logics, 14(3), 2004. [9] S. Konieczny and R. Pino-P´erez. On the logic of merging. In Proc. of KR’98L.Bertossi and J. Chomicki., 1998. [10] D. Lembo, M. Lenzerini, , and R. Rosati. Source inconsistency and incompleteness in data integration. In Proceedings of KRDB’02, 2002. [11] J. Lin. Integration of weighted knowledge bases. Artiﬁcial Intelligence, 83:363– 378, 1996. [12] J. Lin and A.O. Mendelzon. Merging databases under constraints. International Journal of Cooperative Information Systems, 7(1), 1998. [13] J. Lin and A.O. Mendelzon. Knowledge base merging by majority. In Dynamic Worlds: From the Frame Problem to Knowledge Management. Kluwer Academic Publ., 1999. [14] A. Motro. Multiplex: A forma model for multidatabase and its implementation. In Proc. of NGITS-99, the Fourth Int. Workshop on Next Generation Information Technologies and Systems, Israel, July 1999. Springer-Verlag. [15] F. Sadri. Reliability of answers to queries in relational databases. IEEE Transactions on Knowledge and Data Engineering, 3(2):245–252, 1991. [16] S. S.Benferhat, D. Dubois, J. Lang, H. Prade, A. Saﬃotti, and P. Smets. A general approach for inconsistency handling and merging information in prioritized knowledge bases. In Proc. of KR’98, Trento, 1998. [17] V.S. Subrahmanian. Amalgamating knowledge bases. ACM Transactions on Database Systems, 19(2):291–331, 1994.

Database Integrations in Distributed Enterprise Information Systems: A Database Model with Imprecise Information and Query Processing Li Yan Northeastern University, Shenyang, Liaoning 110004, China

1 Introduction To increase product competitiveness, today’s manufacturing enterprises have to deliver their products at reduced cost and high quality in a short time. The change from sellers’ market to buyers’ market results in a steady decrease in the product life cycle time and the demands for tailor made and small-batch products. All these changes require that manufacturing enterprises quickly respond to market changes. Traditional production patterns and manufacturing technologies may find it difficult to satisfy the requirements of current product development. Many types of advanced manufacturing techniques, such as Computer Integrated Manufacturing, Agile Manufacturing, Concurrent Engineering, and Virtual Enterprise based on global manufacturing have been proposed to meet these requirements. One of the foundational supporting strategies is the computer-based information technology. Information systems have become the nerve center of current manufacturing systems. It should be noted that, for various organizational and technological reasons, multiple information systems used in information-based manufacturing enterprises are independently developed, locally administered, and different in logical or physical design [10, 24]. Therefore, a fundamental challenge in heterogeneous and distributed enterprise information management is the sharing of information for enterprise users across organizational boundaries [20]. Databases are an essential part of modern information systems. Therefore, some studies have concentrated on conceptual data modeling of distributed enterprise information systems [18, 25, 43]. But there are few researches on modeling and design of database systems of distributed enterprise information systems. It is especially true for information sharing of such database systems although it is essential. As we know, integrating heterogeneous databases into one system is a major means by which information in distributed information sources can be shared. In L. Yan: Database Integrations in Distributed Enterprise Information Systems: A Database Model with Imprecise Information and Query Processing, Studfuzz 203, 169–192 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

170

Li Yan

the field of databases, the necessity of being able to integrate databases from multiple organizations into a single database has been clearly pointed out by Silberschatz, Stonebraker and Ullman [38]. In the last two decade, heterogeneous database integration has been studied extensively both from a formal and from a practical point of views [2, 14, 17, 21, 29, 37]. More recently, much of the research on integration has focused on integrating conceptual data models and/or object-oriented databases [5, 34] and integrating data sources on the Web [3]. It should be noted that, however, relational databases are still the leading product of commercial database systems so far and most of enterprise information systems are developed based on relational databases. In addition, XML documents for Web-based applications are generally stored, manipulated, and queried in relational databases [4, 39]. Therefore, we focus on multiple relational database integration in this chapter. A core task in database integrations is identification and solutions of a number of conflicts and contradictions in existing component databases. As a result, the integrated databases may contain imprecise information even if component databases are precise [1, 12, 27, 40]. It has been identified that information integration of multidatabases is one reason that databases contain imprecise information [42]. Traditional databases are not able to represent and manipulate imperfect values. So it has been pointed out that the next generation of database systems should address the problems of incompleteness and inconsistency arising from the integration of distributed heterogeneous databases [38]. A number of approaches to querying imprecise data have been investigated under central databases environments [6, 28, 35, 33, 41] or database integration environments [12, 15, 19, 22, 31, 32, 40]. However, only one kind of imprecise information is generally considered in these researches. Furthermore, the methods to query processing are mainly based on qualitative measures. Quantitative computations for query processing are not further investigated although they can provide users with more informatics. Engineering applications are data and knowledge intensive applications. Rich database schemas and data types in component databases may result in several types of imprecise information in integrated databases, including null values, set values, and value intervals, just as shown in the next section. In this chapter, we introduce an extended relational database model that can be applied as the integrated database schema for unified representations of these three types of imprecise data. Based on the extended relational databases with imprecise information, some relational operations for query processing are defined. In particular, we differentiate and develop two kinds of query strategies in this chapter, which are qualitative operations and quantitative operations. The remainder of the chapter is organized as follows. An example scenario is presented in Section 2. We present that virtual enterprise manufacturing systems require information integration for their production activities and the solutions of several kinds of conflicts result in different kinds of imprecise values

Database Integrations in Distributed Enterprise Information Systems

171

simultaneously. In Section 3, an extended relational database model is introduced as the integrated database schema to represent imprecise information. Based on the extended relational databases, in Section 4, some relational operations are defined according to qualitative and quantitative query strategies, respectively. Also, several properties of these relational operations are discussed in Section 4. Section 5 provides a brief overview of related work. Section 6 concludes this chapter.

2 An example scenario Today’s enterprise information systems are generally distributed and heterogeneous. Information sharing for enterprise users across organizational boundaries is essential. In this section, we give virtual enterprise information systems as an example scenario, which typically have the features of today’s enterprise information systems mentioned above. 2.1 Organization structures A virtual enterprise consists of a number of units geographically dispersed but managed as one total unit, although the sub–units may be under separate management [36]. In a virtual enterprise, there is a master company, which may not have its own fabrication facilities and develops products by relying on other manufacturing companies, called partner companies or partners. Moreover, the organizational structure of a virtual enterprise is product–oriented, which implies that the relationship between master and partners is not permanent. In a virtual enterprise, product/configuration design and manufacturing are such a procedure that the master company selects or syntheses the partner companies and monitors their production activities by deriving information concerned. Four types of information have been identified in a virtual enterprise [43], which are product data, part manufacturing data, partner company data, and project progress monitoring data. Partner data can be divided into partner structure data and behavior and performance data. Partner structure data include number and working status of manufacturing equipments, workloads of a partner, etc. Partner behavior and performance data include types of products producible, type of manufactures, lead-time, prices, quality level, and etc. Project progress monitoring data include the person who is responsible for the development of a particular part, the starting and ending dates for the development of a part etc. Among these four types of data, the first one is within the master company and used only by itself and the latter three ones are within the partner company. But the partner company data and project progress monitoring data are used by the master company also. It is clear

172

Li Yan

that organization and production management of a virtual enterprise put an essential requirement on information integration that the master company shares the information from the corresponding partner companies during its engineering activities [20]. 2.2 Database integration and conflict resolutions The master company needs to access and derive data from multiple data sources in partner companies to organize production activities effectively. The data sources in partner companies, however, are independently developed, logically administered, and different in logical or physical designs from the other subsystems. So the features of the information systems in virtual enterprises can be generalized as heterogeneous and distributed. Databases are the major data repositories of information systems and relational databases, being the leading product of commercial database systems so far, are extensively employed in the developments of information systems, including enterprise information systems. Therefore, viewed from databases, multiple information systems integrations are essentially multiple database integrations, which are heterogeneous, distributed and independent. Figure 1 shows the architecture of multiple databases in virtual enterprise information systems, in which the CDB and PDB resided in a partner company are the databases of partner company data and project progress data, respectively. The CDB and PDB resided in master company are obtained through integrating the CDBs and PDBs resided in partner companies, respectively.

PDB

Extranet

……

PDB

Partner Company m

CDB

Partner/Project Data

Partner/ Project Data

Master Company

Partner Company 1

CDB

CDB

PDB

Partner/Project Data

Fig. 1: Architecture of Multiple Databases in Virtual Enterprise Information Systems A core task in database integrations is to identify and resolve a large number of incompatibilities, i.e., semantic conflicts, which exist in different databases

Database Integrations in Distributed Enterprise Information Systems

173

for semantically related data. There is a huge literature in this topic and the following semantic conflicts have extensively been identified. (a) Naming conflicts. This type of conflict can be divided two aspects. One is semantically related data items are named differently and the other is semantically unrelated data items are named equivalently. (b) Data representation conflicts. This occurs when semantically related data items are represented in different data types. (c) Data scaling conflicts. This occurs when semantically related data items are represented in different databases using different units of measure. (d) Missing data. This occurs when the schemas of the integrating databases have different attribute sets. A basic approach for conflict resolution is mapping and conversion. That means that semantically related integrating data items are mapped and conversed into an integrated data item (see figure 2), which corresponds to an attribute in integrated schema (called virtual attribute) and a value on the attribute. There are three mappings for conflict resolutions, which are one-toone mapping, many-to-one mapping, and one-to-many mapping. Using one-toone mapping, naming conflicts can be resolved through mapping semantically related data items (semantically unrelated data items) to a common virtual attribute (different attribute). Also using one-to-one mapping, data representation conflicts can be resolved through mapping semantically related data items to a common virtual attribute. Data scaling conflicts, however, can be resolved through mapping semantically related data items to a common virtual attribute using either many-to-one mapping or one-to-many mapping. As to the conflicts of missing data, it can be resolved using outerjoin operation [16]. 2.3 Information imprecision Conflict resolutions in database integrations generally result in information imprecision in the integrated databases. It has been identified that information integration of multidatabases is one reason that databases contain imprecise information [42]. Let’s focus on one-to-many mapping. It is clear that the attributes in the integrating data items are mapped into range values on the virtual attribute because of on one-to-many mapping. Depending on if all these attributes are discrete or continua, the range values may be set values or value intervals. In addition, when using outerjoin operation for missing data conflicts, the missing attribute values are null values on the virtual attribute in the result databases. Let us look at an example. There are two databases r and s shown in Table 1, which belong to two partner companies in different sites, respectively. They all have the same key attribute Part ID. To provide master company with both sites information about price, lead-time, and transport of the products, these two

174

Li Yan

databases should be integrated. It is clear that attribute values on Price of r and attribute values on Cost of s are semantically related but they are named differently and represented in different data types. Attribute values on Lead-time of r and attribute values on Lead-time of s as well as attribute values on Transport of r and attribute values on Transport of s are represented using different units of measuring. In addition, attribute Material of r is a missing attribute with respect to s. Site 2: Partner Company Component Database s

Site 1: Partner Company Component Database r

Attribute B

Attribute A

b

a

Domain (A)

Mapping: one-to-one many-to-one one-to-many

Integration

Domain (B)

Mapping: one-to-one many-to-one one-to-many

Site 3: Master Company Integrated Database u Attribute K k

Domain (K)

Fig. 2 Strategy of Database Schema Integration Table 1: Databases of partner companies

Site 1: Partner Database r Part ID

Price (Dollars)

Lead-time (Weeks)

Material

Transport

10010 10011 10013

50 55 67

2 3 3

Fe Cu Alloy

Land Air Land

Site 2: Partner Database s Part ID

Cost (Dollars)

Lead-time (Days)

Transport

10010 10012 10013

50.30 58.80 67.47

12 18 21

Train Vessel Truck

Database Integrations in Distributed Enterprise Information Systems

175

Suppose we have two facilities for land transport (train and truck) and one facility for air transport (plane). In addition, lead-time (weeks) is considered to correspond to value intervals. Two weeks, for example, mean more than 7 days but less than or equal to 14 days. Following the integration strategies and approaches above, we have the integrated relation shown in Table 2, where notations [], {}, and ϕ stand for value intervals, set values, and null values, respectively. Table 2: Relational database with imprecise values

Part ID 10010 10011 10012 10013

Price (Dollars)

Lead-time (Days)

Material

Transport

[50, 50] [55, 55] [59, 59] [67, 67]

[8, 14] [15, 21] [18, 18] [15, 21]

{Fe} {Cu}

{Train, Truck} {Plane} {Vessel} {Train, Truck}

ϕ {Alloy}

It has been shown that several types of imprecise information such as null values, set values, and value intervals may occur in integrated relational databases because of conflict resolutions. One of the goals that we integrate multiple databases is to obtain more information from the integrated databases. Traditional databases are not able to represent and manipulate these imperfect values. So, in the following, we present an extended relational database model that can represent these types of imprecise values. Also, we develop qualitative and quantitative operations for information query processing from the integrated databases with imprecise information.

3 An extended relational database model In classical relational databases, a relational schema R is a finite set {A1, A2, ..., An}, denoted by R = {A1, A2, ..., An} or R (A1, A2, ..., An), where Ai (1 ≤ i ≤ n) is called attribute. Each attribute Ai associates with a domain of values, denoted by Dom (Ai) or Di, which is a set of possible values taken by attribute values and may be discrete or continua. A relation instance r on schema R, denoted by r (R), is a set of tuples and for a tuple t ∈ r, it consists of attribute values t [Ai] (1 ≤ i ≤ n), where t [Ai] ∈ Dom (Ai). In classical relational databases, all attribute values of tuples must be atomic. An imprecise value in databases corresponds to a range of values, in which true value is restricted on a specific set of possible values. Viewed from the format of imprecise values, an imprecise value may be a set of possible values

176

Li Yan

(for discrete universe of discourse) or a value interval (for continua universe of discourse). When the value range of an imprecise value is the same as the corresponding attribute domain, it is generally called a null value. Formally imprecise values are defined as follows. Definition 3.1. An imprecise value on an attribute A corresponds to a range of possible values in which exactly one value in the range should be the true value of the imprecise value. Formally it is denoted by {a1, a2, …, am} for discrete attribute domain Dom (A), call a set value, or [a1, an] for continua attribute domain Dom (A), called a value interval. Here, {a1, a2, …, am} (or [a1, an]) ⊆ Dom (A). When an imprecise value corresponds to whole attribute domain, it is essentially a null value, denoted by ϕ. Note that crisp attribute values can be viewed as special cases of imprecise values. A crisp value, for example, p, on a discrete universe of discourse (or a continua universe of discourse) can be represented as {p} (or [p, p]). Moreover, the imprecise value without containing any element is called an empty imprecise value, denoted by ⊥. In fact, the symbol ⊥ means an inapplicable missing data [11]. Definition 3.2. For a set value η = {a1, a2, …, am}, the number of possible values in η is called the modular of η whereas the interval length of a value interval η = [a1, an] is called the modular of η. The modular of an imprecise value η is denoted by ||η||. Considering crisp values and empty imprecise values, we define the modular of an imprecise value as follows. Definition 3.3. Let η be an imprecise value on the universe of discourse U. Then (a) ||η|| = 0 if η = ⊥, i.e., η is an empty imprecise value. (b) ||η|| = 1 if U is a discrete universe of discourse and η = {p}. (c) ||η|| = δ if U is a continua universe of discourse and η = [p, p]. Here, δ is a very small number compared with ||U||. Generally, δ can be viewed to be so close to 0. (d) ||η|| = b − a if η = [a, b] and b ≠ a. (e) ||η|| is equal to the number of possible values in η if η is a set value on U. It is clear that the relational database model that is able to accommodate imprecise values, including set values, value intervals, and null values, must be an extended one. Let R (A1, A2, ..., An) be a relational schema and D1, D2, ..., Dn be the Di corresponding attribute domains. Let 2 denote the power set of Di. Then Di instead of Di, 2 will be the new domain of attribute Ai in extended relational

Database Integrations in Distributed Enterprise Information Systems

177

database model, called extended attribute domain and denoted by Dom* (Ai) to differentiate the traditional attribute domain. Based on the extended relational database model above, a relation instance r (R) is defined as a subset of Cartesian product Dom* (A1) × Dom* (A2) × ... × Dom* (An). For a tuple t ∈ r, it consists of attribute values t [Ai] (1 ≤ i ≤ n), where t [Ai] may be a crisp value, set value/value interval, ϕ, or ⊥. Anyway, t [Ai] ∈ Dom* (Ai). It should be noted that, however, primary key plays a role of the identifier of tuples in a relation, imprecise values are not permitted as the attribute values on the primary key. An example relation with imprecise values based on the extended relational database model is shown in Table 2.

4 Relational operations

4.1 Semantic relationships between imprecise values It is crucial to measure semantic relationships between imprecise values in order to manipulate such imprecise values in the extended relational databases. We introduce two types of semantic measure, namely, qualitative one and quantitative one. Qualitative measures Definition 4.1. Let t and s be two tuples in relation r (R) and A ∈ R. Then (a) t [A] and s [A] are equivalent to each other, denoted by t [A] ≡ s [A], if t [A] = s [A], i.e., they are two common crisp values, (b) t [A] and s [A] are inclusive, denoted by t [A] ≈ s [A], if t [A] ⊆ s [A] or t [A] ⊇ s [A], and (c) t [A] and s [A] are interrelated, denoted by t [A] ~ s [A], if t [A] ∩ s [A] ≠ Φ, in which Φ means an empty set. The above-mentioned semantic relationships between two attribute values are called equivalence, inclusion, and interrelation, respectively. It is clear that, for any an attribute, null values and crisp values as well as set values/value intervals must be inclusive. In addition, it can be seen that the equivalence relationship is reflexivity, symmetry, and transitivity, the inclusion relationship is reflexivity and transitivity, and the interrelation relationship is reflexivity and symmetry. For two imprecise data, say η1 and η2, η1 ≡ η2 ⇒ η1 ≈ η2, and η1 ≈ η2 ⇒ η1 ~ η2, where ⇒ indicates implication. The direction of these inferences is irreversible.

178

Li Yan

Definition 4.2. Let t1 and t2 be two tuples in relation r (R), where R = {A1, A2, ..., An}. If t1 (Ai) ≡ t2 (Ai) (0 ≤ i ≤ n), t1 and t2 are equivalent to each other, denoted by t1 ≡ t2. If t1 (Ai) ≈ t2 (Ai) (0 ≤ i ≤ n), t1 and t2 are inclusive, denoted by t1 ≈ t2. If t1 (Ai) ~ t2 (Ai) (0 ≤ i ≤ n), t1 and t2 are interrelated to each other, denoted by t1 ~ t2. Based on Definition 4.2, two kinds of data redundancies can be identified in the extended relational databases with imprecise values. The first one is due to the tuples that are equivalent to each other and the second one is due to the tuples that are inclusive or interrelated to each other. For the former, we just remove duplicate tuples. But for the latter, all tuples that are redundant by one another should be combined into one tuple which attribute values are the unions of all corresponding attribute values in combining redundant tuples. We call these two kinds of strategies of eliminating redundancies as definite and maybe. Then an operation Reduce (r, type) can be defined to eliminate data redundancies, in which r is a relation, type can be definite and maybe. Quantitative measures Notions of equivalence, inclusion, and interrelation are actually used to differentiate the degrees of information "similarity". Let η1 = [11, 13], η2 = [10, 15], and η3 = [12, 18] be three value intervals. According to Definition 4.1, we have η1 ≈ η2, η1 ~ η3, and η2 ~ η3. Intuitively, η1 is more similar to η2 than to η3. But what are their similar degrees? Especially, η1 and η2 are all similar to η3, but which one is more similar to η3? The notions given above cannot answer these questions because they only qualitatively describe semantic relationship between data items. Therefore, a novel notion equivalence degree is introduced to quantitatively measure the semantic relationship between imprecise values, which stems from the idea of the nearness measure and semantic proximity in fuzzy relational databases [7, 26, 30]. Definition 4.3. Let η1, η2 be two imprecise values on attribute A. The equivalence degree of η1 and η2, denoted by ED (η1, η2), is represented as follows. Here, Dom (A) and ||η|| denote the traditional attribute domain and the modular of imprecise value η, respectively. ED (η1, η2) = ||η1 ∩ η2||/|||η1 ∪ η2|| − ||η1 ∩ η2||/||Dom (A)|| Example 4.1. Let Dom (A) = {a1, a2, …, a20} and so ||Dom (A)|| = 20. Then (a) Suppose η1 = {a11} and η2 = {a11}. Then ED (η1, η2) = 1/1 - 1/20 = 19/20. (b) Suppose η1 = {a12, a13, a14, a15} and η2 = {a16, a17}. Then ED (η1, η2) = 0/6 0/20 = 0.

Database Integrations in Distributed Enterprise Information Systems

179

(c) Suppose η1 = {a14, a15, a16}, η2 = {a14, a15, a16}, η3 = {a14, a15, a16, a17, a18}, and η4 = {a14, a15, a16, a17, a18}. Then ED (η1, η2) = 3/3 - 3/20 = 0.85 and ED (η3, η4) = 5/5 - 5/20 = 0.75. We have ED (η1, η2) > ED (η1, η3). (d) Suppose η1 = {a11, a12, a13}, η2 = {a10, a11, a12, a13, a14, a15}, and η3 = {a12, a13, a14, a15, a16, a17, a18}. Then ED (η1, η2) = 3/6 - 3/20 = 0.35 and ED (η1, η3) = 2/8 2/20 = 0.15. We have ED (η1, η2) > ED (η1, η3). η1 is more similar to η2 than to η3. Intuitively, the more two imprecise values overlap, the more similar they are. Besides, the semantic equivalence degree of two imprecise values is also relevant to their modular. Case (c) in the above example can illustrate this situation well. The notion of semantic equivalence degree of attribute values can be extended to tuples. Let ti and tj be two tuples in r (R), where R = {A1, A2, …, An}. The semantic equivalence degree of tuples ti and tj is defined as ED (ti, tj) = min (ED (ti [A1], tj [A1]), ED (ti [A2], tj [A2]), …, ED (ti [An], tj [An])). Then we can define the notion of duplicate degree of tuples to evaluate and eliminate duplicate tuples quantitatively. Definition 4.4. Let r (R) be a relation over R = (A1, A2, …, An). Let ti = and tj = be two tuples in r. Sr (ti, tj) = max t ∈ r (min (ED (ti, t), ED (ti, t))) is called the duplicate degree of ti and tj, where t may be ti. If a threshold λ (0 ≤ λ ≤ 1) is given and Sr (ti, tj) ≥ λ, we say that ti and tj are duplicate tuples. Based on the notion of duplicate degree of tuples, we can define an operator Reduceλ (r) to eliminate the duplicate tuples in r in which duplicate degrees of tuples are larger than or equal to λ. 4.2 Relational operations for schema integration In multidatabase integrations, there exist two crucial operations. The first one is the operation that can resolve some semantic conflicts such as name conflicts, data representation conflicts, and data scaling conflicts. Mapping operation is introduced for this purpose. On the basis, integration operation can be defined. Mapping Let r (K, A, X) be a source relation, where K is primary key, A is an attribute, and X is a set of attributes. Then the mapping of r (K, A, X) into s (K, V, X), denoted ρA→V (r), is defined, where V is a virtual attribute of the corresponding attribute A.

180

Li Yan

ρA→V (r) = {t|(∃ u) (u ∈ r ∧ t [K] = u [K] ∧ t [V] = f (u [A]) ∧ t [X] = u [X])} Here, f () is a conversion function. For the one-to-one and many-to-one mappings, an atomic value u [A] is converted into another atomic value t [V] uniquely. For the one-to-many mapping, however, an atomic value u [A] is converted into a range value t [V] (an value interval or a set value). Integration Let r (K, A, X) and s (K, A, Y) be two source relations, where K is primary key. Then integration of r and s, denoted r ⊕ s, is defined as follows. r ⊕ s = {t|(∃ u) (∀ v) (u ∈ r ∧ v ∈ s ∧ u [K] ≠ v [K] ∧ t [K] = u [K] ∧ t [A] = u [A] ∧ t [X] = u [X] ∧ t [Y] = ϕ) ∨ (∃ v) (∀ u) (v ∈ s ∧ u ∈ r ∧ v [K] ≠ u [K] ∧ t [K] = v [K] ∧ t [A] = v [A] ∧ t [Y] = v [Y] ∧ t [X] = ϕ) ∨ (∀ u) (∀ v) (u ∈ r ∧ v ∈ s ∧ t [K] = u [K] = v [K] ∧ t [X] = u [X] ∧ t [Y] = v [Y] ∧ (∀ Q) (Q ∈ A ∧ t [Q] = u [Q] ∪ v [Q]))} From the definition above, it can be seen that the result relation r ⊕ s consists of three kinds of tuples. The first kind of tuples in r ⊕ s is the tuples that directly come from r. Note that for such a tuples in r, say u, there is not any tuple in s that has the same key value(s) as u’s key value(s). The second kind of tuples in r ⊕ s is the tuples that directly comes from s. Also for such a tuples in s, say v, there is not any tuple in r that has the same key value(s) as v’s key value(s). The third kind of tuples in r ⊕ s is the tuples that are formed through merging the tuples respectively from r and s because they have the same key value(s). Since there may be inconsistency of schema in r and s, namely, missing data conflict, the result relation of integrating r and s may contain null values.

Database Integrations in Distributed Enterprise Information Systems

181

4.3 Relational operations for query processing

Qualitative operations Traditional two-valued logic (True and False) is not able to represent the results of conditional expression with imprecise information. We adopt three-valued logic (3VL), which proposed firstly by [11] to process the null value in relational database. The 3VL consists of three logical values: true, false, and maybe. Based on the 3VL and the qualitative semantic relationships between imprecise information, several relational operations are defined as follows.

Union. Let r (R) and s (R) be two relations with imprecise values on the same schema. We have two kinds of strategies for the union operations: definite and maybe. Considering the possible data redundancies, we have: r ∪ s = Reduce (r ∪ s, type), type

where ∪ is the common union operator under classical circumstance, and type represents the operation strategies, which may be definite or maybe. Projection. Let r (R) be a relation with imprecise values. Also we have two kinds of strategies for the project operations: definite and maybe. Let X ⊆ R. then we have Π (r) = Reduce (Π (r), type), type

X

X

where Π (r) is the common projection operator under classical circumstance, X

and type represents the operation strategies, which may be definite or maybe. Difference. Difference between two relations r (R) and s (R) is realized by reducing the tuples from r that are the same as that in s. Considering the qualitative semantic relationships between tuples with imprecise information, we have r − s = {t|(∃ u) (∀ v) (u ∈ r ∧ v ∈ s ∧ (t = u) ∧ ((u = v) and (u ≈ v) are all definite

false))} and r − s = {t|(∃ u) (∃ v) (u ∈ r ∧ v ∈ s ∧ ((∀ X) (X ∈ R ∧ t [X] = u [X] – v [X])) maybe

∧ ((u ~ v) ∨ (v ≈ u))}. Natural join. Natural join is a particular kind of join operation. The operation requests that two jointing relations have common attribute subset and only such

Li Yan

182

two tuples that respectively belong to these two relations and which values are the same on the common attribute subset can be jointed together. Let r (R) and s (S) be two relations and X = R ∩ S. Then we have r definite

s = {t|(∃ u) (∃ v) (u ∈ r ∧ v ∈ s ∧ (t [R] = u [R]) ∧ (t [S - X] = v [S - X]) ∧ (u [X] = v [X] is true))} and

r

maybe

s = {t|(∃ u)(∃ v)(u ∈ r ∧ v ∈ s ∧ (t [R - X] = u [R - X]) ∧ (t [S - X] =

v

[S - X]) ∧ (t [X] = u [X] ∩ v [X]) ∧ (u [X] = v [X] is maybe))}. Selection. Let r be a relation with imprecise values and let C be a condition expression. Then the selection with clause C from r is defined as follows. σC (r, definite) = {t|t ∈ r and (C is true for t)} and σC (r, maybe) = {t|t ∈ r and (C is maybe for t)}. Example 4.2. There is a relation with imprecise values shown in Table 2. One makes a query as follows: SELECT Part ID, Price, Lead-time, Transport FROM Table 2 WHERE Leadtime ≤ 20. That means that we’d like to find out all tuples (parts) from Table 2 which values of lead-time is less than 20 days. Here we have two kinds of query strategies: definite and maybe, which correspond to the strategies defined in the relational operations above. Then we have the query answers shown in Table 3 and Table 4, respectively. When the query contain two conditions: (a) Lead-time < 20 and (b) Transport = “Truck”, there is no definite answer for the query. The maybe answer for the query is shown in Table 5. It is clear that there are two kinds of strategies for qualitatively querying databases with imprecise values. In order to obtain the different query answers, SQL should be extended. The following syntax of SQL is used to represent the select sentence.

Part ID 10010 10012

Part ID 10011 10013

Table 3: Definite query answer under single condition Price (Dollars) Transport Lead-time (Days) {Train, Truck} [50, 50] [8, 14] [18, 18] [59, 59] {Vessel}

Table 4: Maybe query answer under single condition Price (Dollars) Transport Lead-time (Days) [55, 55] [15, 21] {Plane} [67, 67] [15, 21] {Train, Truck}

Database Integrations in Distributed Enterprise Information Systems Part ID 10013

183

Table 5: Maybe query answer under complex condition Price Lead-time Transport {Train, Truck} [67, 67] [15, 21]

SELECT FROM FROM <selection condition> WITH Here, a new statement WITH is attached to SQL, in which type may be definite or maybe. Quantitative operations Based on the quantitative semantic relationships between imprecise information and the given threshold, several relational operations are defined as follows. Union. Let r (R) and s (R) be two relations with imprecise values on the same schema. Let λ (0 < λ ≤ 1) be the given threshold. Considering the possible data redundancies under the given threshold, we have r ∪λ s = Reduceλ ({t|t ∈ r ∨ t ∈ s}) It is clear that the union here is λ-union. Projection. Let r (R) be a relation with imprecise values and X ⊆ R. Let λ (0 < λ ≤ 1) be the given threshold. Then the projection is λ-projection and is defined as follows λ

∏X (r) = Reduceλ ({t|(∃ u) (u ∈ r ∧ t = u [X])}) Difference. Let r (R) and s (R) be two relations with imprecise values on the same schema. Let λ (0 < λ ≤ 1) be the given threshold. Then the λ-difference is defined as follows. r −λ s = {t|(t ∈ r ∧ (∀ v) (v ∈ s ∧ S s (t [X], v [X]) < 1 - λ)} Selection. Let r (R) be a relation with imprecise values and λ (0 < λ ≤ 1) be the given threshold. Let X = Y be a condition expression, where X is an attribute and Y is a constant or another attribute. Then the λ-selection is defined as follows.

σ Xλ =Y (r) = {t|t ∈ r ∧ ED (t [X], Y) > λ}

184

Li Yan

Natural join. Natural join of the relations with imprecise values can be defined using Cartesian product and selection. Let r (R) and s (S) be two relations and X = R ∩ S. Let λ (0 < λ ≤ 1) be the given threshold. Then r

λ

s = σ r[ X ]= s[ X ] (r × s) λ

Example 4.3. In order to obtain the query answers with different possibilities, SQL should be extended. A new statement, namely, WITH POSSIBILITY , is attached to SQL. Here 0 < threshold ≤ 1. Note that WITH POSSIBILITY statement can be omitted. The default of is exactly 1. Then one may make a query against the relation shown in Table 2 as follows: SELECT *FROM Table 2 WHERE Lead-time = [9, 16] WITH POSSIBILITY 0.5 Note that this query is a flexible one. Let Dom (Lead-time) = [0, 60]. Following Definition 4.3, we have ED ([8, 14], [9, 16]) = 5/8 - 5/60 = 0.542 > 0.5 ED ([15, 21], [9, 16]) = 1/12 - 1/60 = 0.067 ED ([18, 18], [9, 16]) = 0.0 ED ([15, 21], [9, 16]) = 1/12 – 1/60 = 0.067 Therefore, we have the query answers shown in Table 6.

Part ID 10010

Table 6: Quantitative query result Price (Dollars) Material Lead-time (Days) {Fe} [50, 50] [8, 14]

Transp {Train,

Properties of relational operations In this section, we focus on several properties of the relational operations defined above. Being similar to the classical relational databases, the proposed relational operation is sound. In other words, it is closed. It means that the results of all operations are valid relations. In detail, the result relations produced by the relational operations satisfy the following three criteria: (a) the attribute values must come from an appropriate attribute domain, (b) there are no duplicate tuples in a relation, and (c) the relation must be a finite set of tuples. In the following, we investigate how the relational operations defined above can satisfy these three criteria. Here, the relational operations refer to qualitative and quantitative ones simultaneously. First of all, let us look at the satisfactory situation of the first criterion. Projection and selection all take out a part from the source relation in either column direction or row direction. Because the

Database Integrations in Distributed Enterprise Information Systems

185

attribute values in source relation must belong to the appropriate attribute domain, the attribute values in these two result relations must come from the appropriate attribute domain. Union and difference are conducted under unioncompatible condition, which satisfies the first criterion. In natural join and Cartesian product, the attribute values in result relations come from two source relations, respectively, and they must be within the appropriate attribute domains. For selection, if there are no redundant tuples in ordinary relation, there are no redundant tuples in the result relations. There exist no redundant tuples in the result relations of union, difference, natural join, Cartesian product, and projection. This can be ensured by the definitions of those operations because the removal of redundancies has been considered. Therefore, the second criterion is satisfied. Now let us look at the satisfactory situation of the third criterion. Let r and s be two relations with imprecise values, and let |r| and |s| denote the numbers of tuples in r and s, respectively. Let σP (r) denote the selection operation. It is clear that 0 ≤ |σP (r)| ≤|r| for fuzzy selection. This implies that when no any tuple in r satisfies the selection condition, the tuple number in the result relation is 0, and that when all tuples in r satisfy the selection condition, one obtains |σP (r)| = |r|. When part of the tuples in r satisfy the selection condition, |σP (r)| must be greater than 0 and less than |r|. For the projection denoted ΠX (r), if all tuples in r are redundant after projecting, then |ΠX (r)| = 1. If there is not any redundancy in r after projecting, then |ΠX (r)| = |r|. In the other situations, |ΠX (r)| must be greater than 1 and less than |r|. Additionally, |r ∪ s| must not be s | and |r × s| greater than |r| + |s|, |r – s| must not be greater than |r|, and |r must not be greater than |r| × |s|. Since the number of tuples in the result relation is closely related with the source relations and the source relations are finite, the result relations must be finite. In addition, the set operations defined in relational operations above have similar properties as those of classical set operations. Let r, s, and u be three union-compatible fuzzy relations. Then (a) r ∪ s = s ∪ r and r ∪λ s = s ∪λ r, type

type

(commutativity) (b) r ∪ r = r and r ∪λ r = r, type

(idempotence) (c) (r ∪ s) ∪ u = r ∪ (s ∪ u) and (r ∪λ s) ∪λ u = r ∪λ (s ∪λ u), and type

type

type

type

(associativity) (d) r ∪ s = r ∪ (s − r) and r ∪λ s = r ∪ (s −λ r). type

type

Li Yan

186

The following properties are also held in the relational operations. Let r (R), s (R), and u (Q) be relations with imprecise values. Let P be a selection predicate involving attributes of R. Then (r ∪ s) = (u r) ∪ (u s) and u (r − s) = (a) u type1

(u

type1

r) − (u type 2

(b) u

λ

type1

type2

type1

type1

type2

type1

type 2

s),

(r ∪λ s) = (u

λ

r) ∪λ (u

λ

s) and u

λ

(r −λ s) = (u

λ

r)

−λ (u λ s), (c) σP (r ∪ s, type2) = σP (r, type2) ∪ σP (s, type2) and σP (r − s, type2) = type1

type1

type1

σP (r, type2) − σP (s, type2), type1

λ

λ

λ

λ

(d) σ P (r ∪λ s) = σ P (r) ∪λ σ P (s) and σ P (r −λ s) = (e) σP (u r, type2) = u σP (r, type2), and type1

λ

σ Pλ (r) −λ σ Pλ (s),

type1

λ

(f) σ P (u λ r) = u λ σ P (r). It should be noted that type1 and type2 above may be definite or maybe, respectively. All these properties can be proven by the definitions of relational operations above.

5 Related work Incomplete information query processing was extensively studied in relational databases. Allowing an attribute value to be a set value, Lipski [28] provided two different interpretations of queries, the internal and the external. The internal interpretation ignores all imprecise information, referring only to what is known to the system. In contrast, the external interpretation refers to the real world, which is modeled in an incomplete way by the system and imprecise information is taken into account. Reiter [35] proposed an extended relational theory to formulate relational databases with null values and presented a query evaluation algorithm for such databases. However, due to indefinite information brought in by null values, his algorithm is sound but not complete. The sound and complete query evaluation algorithms for relational databases with null values and disjunctive information were developed by Yuan and Chiang [41]. It should be noted that all these developments are based on deductive databases. Considering several different types of null values (including set values), Candan, Grant and Subrahmanian [6] gave a unified treatment of null values in relational databases using constraints, where each tuple t has an associated constraint Ct.

Database Integrations in Distributed Enterprise Information Systems

187

Based on such conditional tables, null-valued algebraic operations were defined and several properties of their algebra were studied. Allowed any attribute value to a missing value, an inapplicable value, a set value, or value interval, Morrissey [33] investigated uncertain query evaluation and classified two sets of query answers: the set of objects that are known to satisfy with complete certainty and the set that possibly satisfies the query with various degrees of uncertainty. However, relational operations for query were not defined in his paper. In addition, it was noticed by Morrissey [33] that the measures of uncertainty are useful and necessary for ranking objects for presentation to a user. Two methods of estimating this uncertainty, based on information theory, were hereby proposed. Indeed quantitative computations for handling imprecise and uncertain information in databases are receiving considerable attentions. Probability theory and fuzzy set theory, for example, are extensively used to represent, store, and manipulate such information on one hand. On the other hand, some methods based on artificial intelligence were proposed in literatures that estimate null values for approximating answers to queries against incomplete relational databases [8, 9, 13, 23]. In the context of database integrations, data integration problem has received considerable attention. Originally, in [12], set values (called partial values there) were introduced in the integrated relational databases to resolve mismatched domains and a set of extended relational operators was defined based on definite and maybe strategies. From then on, a number of approaches to querying imprecise data are extensively investigated. In [22], the Information Manifold project gave an algorithm for computing answers to queries posed over the global schema. Grahne and Mendelzon [19] considered the uncertainty introduced by multiple sources and defined a set of possible global databases consistent with a collection of sources, in which some of them are sound, some are complete, and some are both sound and complete. In their work, they gave upper and lower approximations to query answers and proved that the answer computed by the Information Manifold algorithm coincided with the certain answer. Tseng, Chen and Yang [40] took a different view for quantitative query answers with degrees of uncertainty. They used a probabilistic relational database model to represent uncertainty arising from data heterogeneity, where probabilities arose at the tuple level and at the attribute level. Based on the probabilistic relational databases, they developed a set of extended relational operations. Also, Florescu, Koller and Levy [15] used probabilistic knowledge in data integration. There, information about the completeness and relative overlap of data sources was used in ordering the accesses to sources in order to maximize the likelihood of obtaining answers early in the evaluation. Since quantitative computing can provide more informatics in handling imprecise and uncertain information, more recently, some computing methods have been proposed for imprecise query answers in data integrations. An approach was

188

Li Yan

presented in [31] for estimating loss of information based on navigation of ontological terms. They defined measures for loss of information based on intensional information as well as on well-established metrics like precision and recall based on extensional information. These measures were used to select results having the desired quality of information. Mendelzon and Mihaila [32] considered the problem of querying collections of sources with incomplete and partially sound data. They provided a method for checking the consistency of a source collection and proposed a probabilistic semantics for query answers.

6 Conclusion Nowadays successful manufacturing systems based on distributed organization principles heavily rely on development and management of distributed information systems. Manufacturing enterprises based on distributed information systems make it possible to organize production activities effectively across organizational boundaries and respond to dynamic changes of the market quickly. An important issue needed to be addressed in distributed enterprise information systems is information sharing. Database integration is one of important means by which information can be shared. In this chapter, we concentrate on the database integrations in the context of distributed enterprise information systems in virtual enterprises. We show that several types of imprecise information, including null values, set values, and value intervals, may arise in integrated databases simultaneously since there exist rich database schemas and data types in component databases under such environment and a number of conflicts are resolved. So we introduce an extended relational database model as the integrated database schema for unified representations of these types of imprecise information. Based the extended relational databases with imprecise information, some relational operations for qualitative and quantitative query processing are defined in the chapter. It should be pointed out, however, that the database integrations considered in the chapter assume that the integrating relational databases have the same keys. In other words, there is no incompatible key conflict in the integrations. It is possible that the component databases developed and maintained independently have different keys. It is especially true for distributed enterprise information systems. How to identify entities for integrations in this situation is hereby very interesting. In addition, the component databases considered in the chapter are all relational databases. However, there are various data sources in current distributed enterprise information systems, including structured relational databases and object-oriented databases as well as semistructured XML files. Integrating heterogeneous information source has been a challenge in the

Database Integrations in Distributed Enterprise Information Systems

189

management of distributed enterprise information systems. Our future work will focus on these issues.

References [1] Altareva, E. and Conrad, S., 2001, The Problem of Uncertainty and

[2]

[3]

[4] [5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Database Integration, Proc. of the 4th International Workshop on Engineering Federated Information Systems, 92-99. Batini, C., Lenzerini, M. and Navathe, S. B., 1986, A Comparative Analysis of Methodologies for Database Schema Integration, ACM Computing Surveys, 18 (4), 323–364. Bergamaschi, S., Castano, S., Vincini, M. and Beneventano, D., 2001, Semantic integration of heterogeneous information sources, Data & Knowledge Engineering, 36 (3), 215-249. Bourett, R., 2001, XML and Databases, http://www.rpbourret.com/xml/ XMLAndDatabases. htm Bukhres, O. A. and Elmagarmid, A. K., 1996, Object-Oriented Multidatabase Systems: A Solution for Advanced Applications, PrenticeHall 1996. Candan, K. S., Grant, J. and Subrahmanian, V. S., 1997, A Unified Treatment of Null Values Using Constraints, Information Sciences, 98 (1-4), 99-156. Chen, G. Q., Vandenbulcke, J. and Kerre, E. E., 1992, A General Treatment of Data Redundancy in a Fuzzy Relational Data Model, JASIS, 43 (4), 304311. Chen, S. M. and Lee, S. W., 2003, A New Method to Generate Fuzzy Rules From Relational Database Systems for Estimating Null Values, Cybernetics and Systems: An International Journal, 34 (1), 33-57. Chen, S. M. and Yeh, M. S., 1997, Generating Fuzzy Rules From Relational Database Systems for Estimating Null Values, Cybernetics and Systems: An International Journal, 28 (8), 695-723. Cheung, W. M. and Hsu, C., 1996, The Model-Assisted Global Query System for Multiple Databases in Distributed Enterprises, ACM Trans. on Information Systems, 14 (4), 421-470. Codd, E. F., 1987, More Commentary on Missing Information in Relational Databases (Applicable and Inapplicable Information), SIGMOD Record, 16 (1), 42–50. DeMichiel, L. G., 1989, Resolving Database Incompatibility: An Approach to Performing Relational Operations over Mismatched Domains, IEEE Trans. on Knowledge and Data Engineering, 1 (4), 485-493.

190

Li Yan

[13] Dutta, S., 1991, Approximate Reasoning by Analogy to Null Queries, [14]

[15]

[16] [17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

International Journal of Approximate Reasoning, 5, 373–398. Elmagarmid, A. K., Rusinkiewicz, M. and Sheth, A., 1998, Management of Heterogeneous and Autonomous Database Systems, Morgan Kaufmann Publishers, CA. Florescu, D., Koller, D. and Levy, A. Y., 1997, Using Probabilistic Information in Data Integration, Proc. of the 23rd International Conference on Very Large Data Bases, 216-225. Galindo-Legaria, C. A., 1994, Outerjoins as Disjunctions, Proc. of the 1994 ACM SIGMOD International Conference on Management of Data, 348-358. García-Solaco, M., Saltor, F. and Castellanos, M., 1995, A Structure Based Schema Integration Methodology, Proc. of the Eleventh International Conference on Data Engineering, 505-512. Giachetti, R. E., 1999, A Standard Manufacturing Information Model to Support Design for Manufacturing in Virtual Enterprises, Journal of Intelligent Manufacturing, 10, 49-60. Grahne, G. and Mendelzon, A. O., 1999, Tableau Techniques for Querying Information Sources Through Global Schemas, Proc. of 7th International Conference on Database Theory, Lecture Notes in Computer Science, 1540, 332-347. Hardwick, M., Spooner, D. L., Rando, T. and Morris, K. C., 1996, Sharing Manufacturing Information in Virtual Enterprises, Communications of the ACM, 39 (2), 46-54. Hull, R., 1997, Managing Semantic Heterogeneity in Databases: A Theoretical Perspective, Proc. of the Sixteenth ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems, 51-61. Kirk, T., Levy, A. Y., Sagiv, Y. and Srivastava, D., 1995, The Information Manifold, Proc. of the AAAI Spring Symposium on Information Gathering in Distributed Heterogeneous Environments. Klein, H. J., 1999, Efficient Algorithms for Approximating Answers to Queries Against Incomplete Relational Databases, Proc.of 6th International Workshop on Knowledge Representation meets Databases, 26-30. Krishnakumar, N. and Sheth, A., 1995, Managing Heterogeneous Multisystem Tasks to Support Enterprise-wide Operations, Distributed and Parallel Databases Journal, 3 (2), 155-186. Li, Q., Zhang, W. J. and Tso, S. K., 2000, Generalization of Strategies for Product Data Modeling with Special Reference to Instance-As-Type Problem, Computers in Industry, 41 (1), 25-34. Liao, S. Y., Wang, H. Q. and Liu, W. Y., 1999, Functional Dependencies with Null Values, Fuzzy Values, and Crisp Values, IEEE Trans. on Fuzzy Systems, 7 (1), 97-103.

Database Integrations in Distributed Enterprise Information Systems

191

[27] Lim, E. P., Srivastava, J. and Shekhar, S., 1994, Resolving Attribute

[28] [29] [30]

[31]

[32]

[33] [34] [35]

[36] [37]

[38]

[39]

[40]

[41]

Incompatibility in Database Integration: An Evidential Reasoning Approach, Proc. of the Tenth International Conference on Data Engineering, 154-163. Lipski, W., 1979, On Semantic Issues Connected with Incomplete Information Databases, ACM Trans.s on Database Systems, 4 (3), 262-296. Litwin, W., Mark, L. and Roussopoulos, N., 1990, Interoperability of Multiple Autonomous Databases, ACM Computing Surveys, 22(3), 267-293. Ma, Z. M., Zhang, W. J. and Ma, W. Y., 2000, Semantic Measure of Fuzzy Data in Extended Possibility-Based Fuzzy Relational Databases, International Journal of Intelligent Systems, 15 (8), 705-716. Mena, E., Kashyap, V., Illarramendi, A. and Sheth A., 2000, Imprecise Answers in Distributed Environments: Estimation of Information Loss for Multi-Ontology Based Query Processing, International Journal of Cooperative Information Systems, 9 (4), 403-425. Mendelzon, A. O. and Mihaila, G. A., 2001, Querying Partially Sound and Complete Data Sources, Proc. of the Fifteenth ACM SIGACT-SIGMODSIGART Symposium on Principles of Database Systems, 162-170. Morrissey, J. M., 1990, Imprecise Information and Uncertainty in Information Systems, ACM Trans. on Information Systems, 8 (2), 157-180. Parent, C. and Spaccapietra, S., 1998, Issues and Approaches of Database Integration, Communications of the ACM, 41 (5): 166-178. Reiter, R., 1986, A Sound and Sometimes Complete Query Evaluation Algorithm for Relational Databases with Null Values, Journal of the Association for Computing Machinery, 33 (2), pp. 349-370. Rolstadas, A., 1995, Enterprise Modeling for Competitive Manufacturing, International Journal of Control Engineering Practice, 3 (1), 43-50. Sheth, A. P. and Larson, J. A., 1990, Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases, ACM Computing Surveys, 22 (3), 183–236 Silberschatz, A., Stonebraker, M. and Ullman, J. D., 1991, Database Systems: Achievements and Opportunities, Communications of the ACM, 34 (10): 110-120. Tatarinov, I., Viglas, S., Beyer, K. S., Shanmugasundaram, J., Shekita, E. J. and Zhang, C., 2002, Storing and Querying Ordered XML Using a Relational Database System, Proc. of the 2002 ACM SIGMOD International Conference on Management of Data, 204 – 215. Tseng, F. S. C., Chen, A. L. P. and Yang, W. P., 1993, Answering Heterogeneous Database Queries with Degrees of Uncertainty, Distributed and Parallel Databases: An International Journal, 1 (3), 281-302. Yuan, L. Y. and Chiang, D. A., 1989, A Sound and Complete Query Evaluation Algorithm for Relational Databases with Disjunctive

192

Li Yan th

Information, Proc. of the 14 ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 66-74. [42] Zaniolo, C., Faloutsos, S. and Subrahmanian, Z., 1997, Advanced Database Systems, Morgan Kaufmann Publishers. [43] Zhang, W. J. and Li, Q., 1999, Information Modeling for Made-to-Order Virtual Enterprise Manufacturing Systems, Computer Aided Design, 31 (10), 611-619.

Integrating Data Doubtfully Ander de Keijzer Faculty of EEMCS, University of Twente PO Box 217, 7500 AE Enschede, The Netherlands [email protected]

1 Introduction Currently, the number of available information sources is enormous and still growing. Businesses have databases containing product, client, supplier and payment information. Even at home, people have multiple information sources, e.g. address books on PDAs, mobile phones and desktops. All of these information sources contain information on related items, or objects. The user usually only requires the actual information without having to search for it, or be aware that the information is divided over several independent information sources. Integration the information sources would help the user in his or her task, because the information is presented as if it were contained in one information source. As an example, we will use two address books of Mark, who has two address books. One address book is located at home, while the other is located at work. Subsets of these address books are presented in Table 1. Table 1. Subsets of two address books Name Street Number Phone Ed King Fifth Avenue 10 555-1234 Mark Hamburg Central Square 5 555-4321 Joe Rough Broadway 442 555-2341 Name Address Areacode Phone Ed King Fifth Avenue 10 555 1234 Mark Hamburg Central Square 7 555 4321 Stan Choice Amsterdam Avenue 2 555 3412

For Mark, it is not important from which address book an address is taken, as long as it is correct, so the answer

A. de Keijzer: Integrating Data Doubtfully, Studfuzz 203, 193–217 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

194

Ander de Keijzer

Ed King

Fifth Avenue 10

555-1234

could have come from either of the address books. However, it is important that all addresses are available at any given time. 1.1 Data integration Without loss of generality, we assume only two information sources are used as input. There are two diﬀerent methods to achieve data integration. The existing information sources can be integrated and the result can be stored as a diﬀerent information source. This method is referred to as materialized integration. After integration, the original sources are no longer needed. Another approach is to integrate the information sources at query time, keeping the original information sources. These sources can be updated independently and can also be used by other applications as autonomous information sources. In this case, the integrated information source is a mediated view over the original sources. In de remainder of this chapter, we will only consider materialized integration. The data in the original information sources can be about the same real world object (rwo). However, this information can conﬂict. For example, the house number of Mark Hamburg in Table 1 diﬀers in both address books. This conﬂict has to be resolved, before the information sources can actually be integrated. There are several reasons why the numbers diﬀer. One of the entries could contain a type-error, or even both could contain type-errors. The person in the ﬁrst address book could be a diﬀerent one from the person in the second address book, but both with the same name (probably relatives). By just choosing one of possibilities from the address books, we could end up with an error in our end result. 1.2 Uncertain data Instead of choosing one of the possibilities from the original information sources for the resulting information source, we include all possibilities. However, we associate a probability with each of the probabilities to indicate the likelihood they will actually occur in the real world. If the associated probability is 0, the possibility is left out of the resulting information source. As a result, the integrated information source will contain uncertain data, but no data will be lost due to the integration process. 1.3 Outline The remainder of this chapter is organized as follows. In section 2 the concept of possible worlds is described. In section 3 probabilistic relation data is introduced as a means to store uncertain data. This model is extended to probabilistic XML, which is shown in section 4. Querying probabilistic data, is the topic of section 5.

Integrating Data Doubtfully

195

2 Possible worlds A possible world is a collection of all real world objects, where every rwo has one speciﬁc appearance. There is, for example, a possible world where Mark Hamburg has house number 5 and another possible world where his house number is 7. Every possible world pW has a probability with which it exists, denoted by P (pW ).

Fig. 1. Possible Worlds derived from diﬀerent objects.

In the address book example, there is a total of 8 possible worlds, because there are 2 possibilities for Mark Hamburg, Joe Rough and Stan Choice, while there is just one possibility for Ed Kind. This results in 2 × 2 × 2 × 1 = 8 diﬀerent worlds. The concept of possible worlds is illustrated in Figure 1, here two objects (indicated by squares), with each two possibilities, are represented in a total of four possible worlds (circles). A possible appearance of a real world object is called a possible real world object, or possibility. Note that the deﬁnition of possibility in this chapter, is diﬀerent from other chapters.

3 Probabilistic relational data In [2] we deﬁned real world objects, relations and their probabilistic counterparts as follows. A real world object can be represented by a tuple T ∈ D1 × · · · × Dn , where Di is the domain of attribute i. A set of real world objects can then be described by a relation R ∈ P(D1 × · · · × Dn ). Finally, we represent a real world by a database DB ∈ PP(D1 × · · · × Dn ). Without loss of generality, we assume DB to consist of one relation. A probabilistic attribute is a traditional attribute with an associated prob˜ i = [0, 1] × Di . We can now deﬁne a probabilistic ability, i.e., its domain is D ˜ n . pT deﬁnes a possible description of a real world ˜1 × · · · × D tuple as pT ∈ D object, also called a possibility. Let Pi (pT ) be the probability part of attribute i in tuple pT and πi (pT ) its value part. ˜1 × · · · × D ˜ n ) be a probabilistic relation. Table 3 shows an Let pR ∈ P(D example of a probabilistic relation. We require pR to have a primary key k (or key for short). Without loss of generality, we assume the key of a probabilistic

196

Ander de Keijzer Table 2. Probabilistic relation addressbook Name Address Ed King [1.0] Fifth Avenue 10 Mark Hamburg [1.0] Central Square 5 Central Square 7 Joe Rough [1.0] Broadway 442 Stan Choice

[1.0] [0.7] [0.3] [0.9] [0.1] [1.0] Amsterdam Avenue 2 [0.9] [0.1]

Phone 555-1234 [1.0] 555-4321 [1.0] 555-2341 [0.9] [0.1] 555-3412 [0.9] [0.1]

relation to consist of one attribute. Key k uniquely identiﬁes the real world object. As a consequence, the associated probability needs to be 1. Observe that the key value itself is not unique in pR in the traditional sense (‘name’ in Table 3). For convenience, we introduce the following notation: Π v (pR) = {pT ∈ pR | πk (pT ) = v} Di pR = {v ∈ Di | (∃pT ∈ pR) • πi (pT ) = v} Π v (pR) is the set of all tuples from pR where the key equals v. Di pR is domain Di restricted to values occurring in pR. Two constraints should hold for a probabilistic relation:

( Pi (pT )) = 1 1. (∀v ∈ Dk pR) pT ∈Π v (pR) 1≤i≤n

2. (∀v ∈ Dk pR)(∀i = k)(∀pT , pT ∈ Π v (pR)) • πi (pT ) = πi (pT ) ⇒ Pi (pT ) = Pi (pT ) The ﬁrst constraint states that for each real world object, the total probability of all possibilities equals one. This implies that we assume a closed world. The second constraint states that all attributes are independent. 3.1 Possible world model If we consider a normal relation R to be a representation of the real world, then we can consider a probabilistic relation pR to be a representation of several possible worlds. Since the probabilities of the individual attributes are independent, we obtain the probability of a description pT of a real world object (i.e., a possibility) by multiplying the probabilities of the associated attributes: Pi (pT ). 1≤i≤n

The set of tuples representing one possibility for each real world object is called a possible world pW ⊆ pR, deﬁned by

Integrating Data Doubtfully

197

Table 3. Addressbook with distances (a) non-1NF representation name address distance ‘Mark Hamburg’ Central Square 5 [0.7] 50 Central Square 7 [0.3] 60 ‘Joe Rough’ Broadway 442 [0.9] 18 [0.1] 0

[0.7] [0.3] [0.9] [0.1]

(b) 1NF representation name ‘Mark Hamburg’ ‘Mark Hamburg’ ‘Mark Hamburg’ ‘Mark Hamburg’ ‘Joe Rough’ ‘Joe Rough’ ‘Joe Rough’ ‘Joe Rough’

address Central Square Central Square Central Square Central Square Broadway 442 Broadway 442

5 5 7 7

address prob distance 0.7 50 0.7 60 0.3 50 0.3 60 0.9 18 0.9 0 0.1 18 0.1 0

distance prob 0.7 0.3 0.7 0.3 0.9 0.1 0.9 0.1

(c) 3NF representation MAIN name id ‘Mark Hamburg’ 1 ‘Joe Rough’ 2

ROOM id 1 1 2 2

room prob Central Square 5 0.7 Central Square 7 0.3 Broadway 442 0.9 0.1

DISTANCE id 1 1 2 2

distance 50 60 18 0

prob 0.7 0.3 0.9 0.1

1. (∀v ∈ Dk pR) • |Π v (pW )| = 1 2. (∀pT ∈ pR)(∃pT ∈ pW ) • πk (pT ) = πk (pT ) The ﬁrst constraint ensures that only one possibility is contained in pW for each real world object. The second constraint ensures that each real world object from pR is contained in pW at least once. The set of all possible worlds, also called universe, is denoted by PWR pR = {pW ⊆ pR | pW is a possible world} 3.2 Storage scheme We developed a prototype system on top of an ordinary RDBMS. Probabilistic tables can be stored in several ways corresponding with the normal form to which one likes to adhere. Table 3 shows diﬀerent representations of the same probabilistic relation addressbook: a representation with non-atomic

198

Ander de Keijzer

attributes (Table 3(a)), the representation brought in ﬁrst normal form (Table 3(b)), and the representation brought in third normal form (Table 3(c)). The 1NF representation of probabilistic data, however, may lead to massive replication of attribute values, hence, data inconsistency. The 3NF representation, on the other hand, may require many joins, hence, may perform less eﬃciently. We will discuss both representations. 1NF representation ˜ is not atomic, since it consists of a probability part A probabilistic attribute D ˜1 × and a value part. The representation of a probabilistic relation pR ∈ P(D ˜ n ) can be transformed to be represented only by atomic attributes. We · · ·×D write pR ∈ P([0, 1] × D1 × · · · × [0, 1] × Dn ). pR is now in 1NF. Probability attributes which have only values of 0 or 1 are omitted, since these are not probabilistic, which is the case for at least the key of the relation. If we bring Table 3(a) in 1NF, this results in 2 × 2 + 2 × 2 = 8 rows (see Table 3(b)). Each row has 5 attributes: name, room, room prob, distance and distance prob. 3NF representation The 1NF representation may massively replicate attribute values, hence may lead to data inconsistency if not handled properly. Normalization to 3NF prevents this. We start by observing that the probabilistic relation in Table 3(a) is mod˜ 2 × · · · × PD ˜ n . Each non-atomic attribute can, howelled as pR ∈ Dk × PD ever, be modelled as a separate table. Therefore, we introduce a main table for each probabilistic relation with the key and a unique identiﬁer Dk (to avoid replicating the key value). Furthermore, we introduce a separate table Dk ×Di ×[0, 1] for each non-atomic attribute. The key of these tables consists of the ﬁrst two attributes, k and i. The main table has functional dependency k → k and the separate tables have functional dependencies k i → p, where p is the probability attribute with domain [0, 1]. These functional dependencies satisfy 3NF. The 3NF representation of table 3 is shown in table 3(c). Dependent attributes The attributes room and distance in table 3 are independent, which means that any given distance can occur with any given address. This becomes extra clear with the 3NF representation where both attributes are stored in diﬀerent tables. In our example however, the distance is likely to depend on the address. The 3NF representation can model this dependency by storing dependent attributes in the same table. The attributes address and distance in the main table are then replaced by a single attribute called address distance.

Integrating Data Doubtfully

199

4 Probabilistic XML As seen in the previous section, uncertain relational data has the following shortcomings • • • •

All probabilities must be independent Mutual exclusion between attributes is not possible All real world objects must exist in every possible world Ignorance cannot be modelled, i.e. a closed world is assumed.

We introduce uncertainty into the XML data model [8], resulting in probabilistic XML. We will show that most of the issues arising in the relational model can be solved by using probabilistic XML. 4.1 Formalization of probabilistic XML Since order is important in XML, we ﬁrst introduce some notation for handling sequences. Notational convention 1 Analogous to the powerset notation P , we use a power sequence notation S A to denote the domain of all possible sequences built up of elements of A. We use the notation [a1 , . . . , an ] for a particular sequence. We use set operations for sequences, such as ∪, ∃, ∈, whenever definitions remain unambiguous. We start by deﬁning the notions of tree and subtree as abstractions of an XML document and fragment. We model a tree as a node and a sequence of child subtrees. We abstract from the details of nodes. We assume that they incorporate properties like tag name, node identity, node kind, and, if appropriate, attribute or text value. Equality on nodes is deﬁned as equality on the properties. Deﬁnition 1. Let N be the set of nodes. Let Ti be the set of trees with maximum level i inductively deﬁned as follows: T0 = {(n, ∅) | n ∈ N } Ti+1 = Ti ∪ {(n, ST ) | n ∈ N ∧ST ∈ S Ti ∧(∀T ∈ ST • n ∈ N T ) ∧(∀T , T ∈ ST • T = T ⇒ N T ∩ N T = ∅)} where N T = {n} ∪ T ∈ST N T . Let Tﬁn be the set of ﬁnite trees, i.e., T ∈ Tﬁn ⇔ ∃i ∈ N • T ∈ Ti . In the sequel, we only work with ﬁnite trees. We sometimes use n to indicate the root node of a tree.

200

Ander de Keijzer 1

◦ • persons

V .7 kk VVVVV .3 VVVV kkkk k V◦M k ◦ qqq MMMMM q q personq•<< person •
nr • nm • nr • nm • nr • 7 5 7 Mark Mark

Fig. 2. Example probabilistic XML tree.

We obtain a subtree from a tree T by indicating a node n in T which is the root node of the desired subtree. We also deﬁne a convenience function child that returns the child nodes of a given node in a tree. Deﬁnition 2. Let subtree(T , n) be the subtree within T = (n, ST ) rooted at n. T if n = n • subtree(T , n) = subtree(T , n) otherwise

where T such that (T , ST ) ∈ ST ∧ n ∈ N T . For subtree(T , n) = (n, [(n1 , ST 1 ), . . . , (nm , ST m )]), let child(T , n) [n1 , . . . , nm ].

=

The central notion in this chapter is the probabilistic tree. In an ordinary XML document, all information is certain. When two XML data sources are integrated, they may conﬂict on information about certain real world objects. Therefore, after data integration, there may exist more than one possibility for a certain text node, or in general, for entire subtrees. We model this uncertainty in a probabilistic tree by introducing two special kinds of nodes: 1. probability nodes depicted as , and 2. possibility nodes depicted as ◦, which have an associated probability. The children of a probability node enumerate all possibilities with a combined probability of 1. Ordinary XML nodes are depicted as •. A probabilistic tree is well-structured, if the children of a probability node are possibility nodes, the children of a possibility node are XML nodes, and the children of XML nodes are probability nodes. In this way, on each level of the tree, you only ﬁnd one kind of nodes. Figure 2 shows an example of a probabilistic XML tree. The tree represents an XML document with a root node ‘persons’ (which exists with certainty). The root node has either one or two child nodes ‘person’ (with probabilities .7 and .3, respectively). In the one-child case, the name of the person is ‘Mark’

Integrating Data Doubtfully

201

and the house number is either ‘5’ or ‘7’ (with equal probability). In the more unlikely case of two children, the information of both persons is certain, i.e., they both have name ‘Mark’ and one has house number ‘5’ and the other ‘7’. This document is a possible result of two documents having been integrated, one document stating the house number of a person named ‘Mark’ to be ‘5’, and the other stating the telephone number of a person named ‘Mark’ to be ‘7’. It is uncertain if both talk about the same person. A data integration matching rule apparently determined that, with a probability of .7, they represent the same person. Therefore, the combined knowledge of the real world is described accurately by the given tree. A probabilistic tree is deﬁned as a tree, a kind function that assigns node kinds, and a prob function that assigns probabilities to possibility nodes. The root node is deﬁned to always be a probability node. A special type of probabilistic tree is a certain one, which means that all information in it is certain, i.e., there is no more than one possibility for any node. Deﬁnition 3. A probabilistic tree PT = (T , kind, prob) is deﬁned as follows • kind ∈ N → {prob, poss, xml} • NkT = {n ∈ N T | kind(n) = k }. • kind(n) = prob where T = (n, ST ) T T ∀n ∈ child(T , n) • n ∈ Nposs • ∀n ∈ Nprob T T • ∀n ∈ Nposs ∀n ∈ child(T , n) • n ∈ Nxml T T • ∀n ∈ Nxml ∀n ∈ child(T , n) • n ∈ Nprob T • prob ∈ Nposs → [0, 1] T • ∀n ∈ Nprob • ( n ∈child(T ,n) prob(n )) = 1. A probabilistic tree PT = (T , kind, prob) is certain iﬀ there is only one posT • sibility node for each probability node, i.e., certain(PT ) ⇔ ∀n ∈ Nprob |child(T , n)| = 1. To clarify deﬁnitions, we use b to denote a probability node, s to denote a possibility node, and x to denote an XML node.

Subtrees under probability nodes denote local possibilities. In the oneperson case of Figure 2, there are two local possibilities for the house number, it is either ‘5’ or ‘7’. The other uncertainty in the tree are the possibilities that there are one or two persons. Viewed globally and from the perspective of a device with this data in its database, the real world could either look like • one person with name ‘Mark’ and house number ‘5’ (probability .5 × .7 = .35), • one person with name ‘Mark’ and house number ‘7’ (probability .5 × .7 = .35), or • two persons with name ‘Mark’ and respective house numbers ‘5’ and ‘7’ (probability .3). These are called possible worlds.

202

Ander de Keijzer .8 qqMMM.2 MMM qqq ◦<< << << nm • nr • nm • nr •

q ◦<<

1

◦

• Mark (a) PT 1

1

◦ • 5

1

◦

• Mark

1

◦ • 7

1

◦MMMMM M nm • • nr 1

◦

• Mark

< .2 << ◦ ◦ .8

• 5

• 7

(b) PT 2

Fig. 3. Probabilistic XML tree equivalence.

Deﬁnition 4. A certain probabilistic tree PT is a possible world of another probabilistic tree PT , i.e., pw(PT , PT ), with probability pwprob(PT , PT ) iﬀ • PT = (T , kind, prob) ∧ PT = (T , kind , prob ) • T = (n, ST n ) ∧ T = (n, ST n ) • ∃s ∈ child(T , n) • child(T , n) = [s] • X = child(T , s) = child(T , s) • ∀x ∈ X • child(T , x ) = child(T , x ) • B = x ∈X child(T , x ) • ∀b ∈ B • PT b = subtree(PT , b) ∧PT b = subtree(PT , b) ∧pw(PT b , PT b ) • ∀b ∈ B • pb = pwprob(PT b , PTb ) • pwprob(PT , PT ) = prob(s) × b∈B pb The set of all possible worlds of a probabilistic tree PT is PWR pR PT = {PT | pw(PT , PT )}. A probabilistic tree is a compact representation of the set of all possible worlds, but there is more than one possible representation. The optimal representation is the one with the least number of nodes obtained through a process called simpliﬁcation. Deﬁnition 5. Two probabilistic trees PT 1 and PT 2 are equivalent PT iﬀ N 1 < PT = PWR PT . PT is more compact than PT if PWR 1 pR 2 1 2 PTpR N 2 . The transformation of a probabilistic tree to an equivalent more compact one is called simpliﬁcation. Figure 3 shows an example of two equivalent probabilistic trees. They both denote the set of possible worlds containing trees with • two nodes ‘nm’ and ‘nr’ with child text nodes ‘Mark’ and ‘5’ respectively (probability .8) and

Integrating Data Doubtfully

personq•MMM q

qqq
◦

◦

nm • nm • nr • 5 Mark Marc (a) Independence

person •

MMM < << ◦ ◦ nr • 7

qqMMMMM M qqq q ◦< ◦< <<< <<<

203

< << ◦ ◦ person •

nm • nr • nm • nr • 5 7 Marc Mark (b) Dependence

(c) Uncertainty about existence

Fig. 4. Expressiveness of probabilistic tree model.

• two nodes ‘nm’ and ‘nr’ with child text nodes ‘Mark’ and ‘7’ respectively (probability .2). It is obvious that in Figure 3, PT 2 is a simpliﬁcation of PT 1 . As mentioned earlier, relational approaches often disallow dependencies among attributes. The higher expressiveness of the probabilistic tree makes such a restriction unnecessary. Figure 4 illustrates three common patterns: independence between attributes (Figure 4(a)): any combination of ‘nm’ and ‘nr’ is possible, dependency between attributes (Figure 4(b)): only the combinations ‘Mark’/‘5’ and ‘Marc’/‘7’ are possible, and uncertainty about the existence of an object (Figure 4(c)): one possibility is empty, i.e., has no subtree. These patterns can occur on any level in the tree, which allows a much larger range of situations to be expressed.

5 Querying uncertain data For relational probabilistic databases, several extensions to relational algebra to support uncertain data have been proposed [1; 7; 4; 6; 5]. Based on [9], we argued in [2] that thinking in terms of possible worlds is powerful in determining a proper semantics of queries. Uncertainty can be treated as having more than one possible instantiation describing a particular real world object. Choosing one possible instantiation, or possibility for short, for each real world object, results in a possible world. Analogous to the notion of parallel universes, all possible worlds co-exist in the database and a query should, therefore, be evaluated in every possible world separately. This approach is not speciﬁc to relational databases, so we have adopted it for probabilistic XML as well [8]. 5.1 Querying relational data A probabilistic query should return a diﬀerent kind of result than a normal relational query. Consider a query such as the one posed below on the addressbook of Table 3.

204

Ander de Keijzer

SELECT name, address FROM addressbook WHERE name=’Mark Hamburg’ Seen from a traditional relational point of view, the answer Mark Hamburg Mark Hamburg

Central Square 5 Central Square 7

555-4321 555-4321

makes no sense: because attribute name is a key, only one tuple containing the name ‘Mark Hamburg’ is expected. From the perspective of probabilistic relations, however, the returned result is a somewhat correct answer, because uncertainty about addresses in the data will inevitably lead to uncertainty in the answer to a question about addresses. The answer only does not specify the probabilities, which it should. This calls for a review of the semantics of relational algebra operators and, as a consequence, their redeﬁnition, in order to support querying on probabilistic data. With the possible world approach, we would expect the answer [0.7] [0.3]

Mark Hamburg Mark Hamburg

Central Square 5 Central Square 7

555-4321 555-4321

which indicates that there is a 70% chance the possibility (‘Mark Hamburg’, ‘Central Square 5’) is correct and a 30% chance the possibility (‘Mark Hamburg’, ‘Central Square 7’) is correct. Operators Select The traditional select operator σc (R) selects tuples from R according to the selection condition c. For a probabilistic select operator, the selection condition c can also refer to probabilities of attributes. We, therefore, introduce a function P (f ) which returns the probability of attribute f . The semantics of the probabilistic select operator σc (pR) should be that it selects possibilities from pR. Note, however, that the result of a selection corresponds to a diﬀerent universe than the one stored in the database. All probabilities of attributes f in the answer are of the form P (f |c). For example, to obtain data that is highly likely from the addressbook relation, highly likely being a probability more than 60%, we could write σP (address)>0.6 (addressbook). The result of this expression is {(1.0, ‘Mark Hamburg’, ‘Central Square 5’)}, which states that in the universe where a probability of a room is higher than 60%, Mark Hamburg certainly has the address Central Square 5, hence the probability 1.0 in the answer. In other words, the original probabilities have to be normalized in the result. Note that this is a proper probabilistic relation according to Section 3. Furthermore, observe that dependency among attributes may occur.

Integrating Data Doubtfully

205

Project The traditional project operator πf (R) selects all attributes f from all tuples in R leaving out other attributes not in f . The semantics of a probabilistic project operator πf (pR) should be that it selects all attributes f from all possibilities in R. Since all attributes are independent, projection does not aﬀect the associated probabilities. However, if the key of pR is projected out, the individual objects can no longer be identiﬁed and the result is meaningless. Therefore, k always has to be part of attribute list f . Table 4. Determining the probabilities associated with a union result by enumerating possible worlds (P˘ = 1 − P ). A e∈A e∈A e∈ /A e∈ /A ⇒

[PA ] [PA ] [P˘A ] [P˘A ]

B e∈B e∈ /B e∈B e∈ /B

A∪B [PB ] e ∈ A ∪ B [P˘B ] e ∈ A ∪ B [PB ] e ∈ A ∪ B [P˘B ] e ∈ / A∪B

[PA PB ] [PA P˘B ] [P˘A PB ] [P˘A P˘B ]

e ∈ A ∪ B [PA + PB − PA PB ] e∈ / A ∪ B [1 − (PA + PB − PA PB )]

Union The probabilistic union operator merges two probabilistic relations possibly containing possibilities for the same real world objects. To properly calculate the probabilities in the answer, it is beneﬁcial to enumerate the possible worlds, i.e., consider each possibility of an element existing or not in the operand sets (see Table 4). The probability of the element occurring in the result is the sum of the three possibilities where the element occurred in either operand. The intersection and diﬀerence can be determined analogously. Cartesian product The semantics of a cartesian product is the simultaneous occurrence of two possibilities. Probabilities can be calculated according to this semantics. Aggregate functions Aggregate functions combine the values of a set of attribute values into one value. Examples are the total price of a collection or the average mark of students. Observe that a traditional aggregate operator works in one world

206

Ander de Keijzer

where everything is certain. To come up with a proper semantics for aggregate operators in the context of probabilistic relations, we use the strategy of enumerating possible worlds. For each possible world pW with associated probability P (pW ), we apply the traditional aggregate operator aggr . The results (aggr , P (pW )) together form the resulting probabilistic relation. Table 5. Partial Addressbook with distances Name Address Distance Mark Hamburg [1.0] Central Square 5 [0.7] 50 Central Square 7 [0.3] 60 Joe Rough [1.0] Broadway 442 [0.9] 18 [0.1] 0

[0.7] [0.3] [0.9] [0.1]

Given the partial addressbook relation of Table 5 and the query below: SELECT MAX(distance) FROM addressbook This would traditionally return the number 60. This is, however, only true for the world where ‘Mark Hamburg’ has the address ‘Central Square 7’, but there are three other possible worlds. These possible worlds should contribute to the result. The correct result {(0.63, 50), (0.27, 60), (0.07, 50), (0.03, 60)} = {(0.7, 50), (0.3, 60)} reﬂects the existence of four possible worlds, two possibilities for both ‘Mark Hamburg’ and ‘Joe Rough’, each with its own maximum distance either 50 or 60. The associated probabilities of worlds with equal maximum values can be combined as shown. In more general terms, the probabilistic aggregate operators (MAX, MIN, SUM, AVG) are deﬁned by aggr f (pR) ∈ P([0, 1] × R) where f indicates a ﬁeld name. We use the notation aggr f (pW ) for the traditional counterpart of an aggregate operator evaluated in possible world pW . The aggregate function aggr f is deﬁned by aggr f (pR) = {(P (pW ), aggr f (pW )) | pW ∈ PWR pR } A more elaborate explanation of the aggregate functions can be found in [3].

Integrating Data Doubtfully

207

EXP function All aggregate operators take the existence of possible worlds into account. However, the system should be able to predict information about the real world. We, therefore, introduce a new aggregate function, EXP , which returns the expected value of a numerical ﬁeld. EXP f (pR) ∈ R It is deﬁned by

πf (pT ) × Pf (pT )

pT ∈pR

Pf (pT )

pT ∈pR

The function calculates the weighted average of the ﬁeld over all possible worlds. In the presence of a GROUP BY clause, we assume pR to represent one group. For example, SELECT EXP(distance) FROM addressbook GROUP BY name returns {53, 16.2}, the expected distances to all persons in our addressbook. EXP can be used in combination with other aggregate functions. The expected maximum distance is obtained by SELECT EXP(MAX(distance)) FROM addressbook 5.2 Querying XML data The parallel universes analogy illustrates that a query on uncertain data may produce a set of possible answers indicating that the answer is uncertain as well. One obtains one possible answer per possible world. The probability of a possible answer is simply the probability of the possible world that gave rise to it. In short, the correct answer to a query on a probabilistic XML document can be obtained as follows: 1. enumerate all n possible worlds for the probabilistic tree, 2. evaluate the query for each possible world according to ordinary semantics, 3. the resulting n possible answers can be simpliﬁed by merging (partially) equal possible answers.

208

Ander de Keijzer

S qq SSS.3SSS q.35 q q S◦ ◦q ◦

.7 qqMMM.3 MMM qqq

.35

/.-, ()*+ seq 1

◦

nr • 5

/.-, ()*+ seq

/.-, ()*+ seq < <

1

1

◦

nr • 7

◦

nr • 5

(a) Query answer

1

◦

nr • 7

◦q

◦

/.-, ()*+ seq

/.-, ()*+ seq < <

< .5 << ◦ ◦ .5

nr • 5

nr • 7

1

◦

nr • 5

1

◦

nr • 7

(b) Compact answer

Fig. 5. Query //person[nm="Mark"]/nr.

This approach applies to queries in any query language. In XPath or XQuery, answers to queries are sequences. Queries on uncertain data, hence, result in sets of possible sequences. Assuming that a sequence is represented /.-, ()*+ by a special sequence node seq , a probabilistic tree can be used to represent the set of possible answers and simpliﬁcation can be used to obtain a compact representation of the answer. Note that this is just a statement about the semantics of a query. An implementation can obviously use a more eﬃcient algorithm for answering queries than by enumerating all possible worlds. Concerning query evaluation, our prototype implementation focuses on obtaining insight into how and to what degree possibilities in the data propagate to possibilities in the answer. It, therefore, simply follows the three-step strategy presented above. Performance of query evaluation is a topic for future research and, hence, beyond the scope of this chapter. We only sketch how the prototype enumerates all possible worlds in Subsection 5.2 below. Figure 5 shows an answer to an example query on the data of Figure 2. We have seen that there are three possible worlds. Evaluating the query in each of these possible worlds gives rise to the answer in Figure 5(a). Intuitively, it states that the answer to house numbers of people named “Mark” is uncertain: it is either “5” or “7” or there are actually two Marks with a house number: one with “5” and one with “7”. Figure 5(b) shows how we can compactly represent this answer by merging the ﬁrst to possibilities, i.e., making the possibilities local, since in both cases it is certain that the sequence contains one element. Note that to applications, it may be more logical to present the answer to a query as a set of possible answers where each answer is an ordinary XML sequence with ordinary non-probabilistic XML fragments as elements. Such a representation coincides with an enumeration of the possible worlds of the answer. On the other hand, an application prepared for consuming probabilistic answers may beneﬁt in performance by the compact representation. The design of an API is still an open issue.

Integrating Data Doubtfully

209

Enumerating possible worlds Generating all possible worlds from a compact probabilistic XML tree can be done recursively. The approach of our prototype implementation is as follows. Given a node in the tree, the function produces all possible worlds for the subtree rooted at that node. Contrary to the formalization, a possible world is represented by a normal XML tree instead of a certain probabilistic tree, i.e., it contains only XML nodes and no probability and possibility nodes. For producing the set of possible worlds for a certain node, the sets of possible worlds from its children need to be combined. There are two ways to combine sets of possible worlds: = • The product of two sets of possible worlds W1 = {B1 , . . . , Bm } is W1 ⊗ W2 = {A1 , . . . , An } and W2 {A1 B1 , A1 B2 , . . . A1 Bm , A2 B1 , A2 B2 , . . . An Bm }, where AB means a world in which both A and B hold. For example, A and B may correspond to two distinct children of a certain element, hence, a possible world combines one possibility for each child. = • The sum of two sets of possible worlds W1 ⊕ W2 {A1 , . . . , An , B1 , . . . , Bm }. Let W1 , . . . , Wn be the sets of possible worlds for each child of a given node. The resulting set of possible worlds for that node depends on the node kind: • Probability node. The result is W1 ⊕ . . . ⊕ Wn . • Possibility node. The result is W1 ⊗ . . . ⊗ Wn . • XML node. The result is constructed by placing the given XML node as parent (i.e., root) in each world in W1 ⊗ . . . ⊗ Wn .

6 Data integration In this section we will show how probabilistic XML can be used to perform data integration. 6.1 General approach A device’s database is a probabilistic XML document. When data integration with a foreign probabilistic XML document is initiated, the foreign document is considered to be a source of ‘new’ information on real world objects the device either already knows about or not. New information on ‘new’ real world objects is simply added to the database. Any diﬀerences in information on ‘existing’ real world objects are regarded as diﬀerent possibilities for that object. Note that we disregard possibilities concerning order. New information on ‘new’ real world objects is simply considered to come after information on known objects in document order.

210

Ander de Keijzer

Since it is often not possible to determine with certainty that two speciﬁc XML elements correspond to the same real world object, we assume the existence of a rule engine that determines the probability of two elements referring to the same real world object. In special cases, this rule engine may obviously decide on a probability of 0 (with certainty not the same real world object) or 1 (with certainty the same real world object). In this chapter, we abstract from the details of this rule engine, but imagine that it uses schema information to rule out possibilities. Or it may, for example, consult a digital street map to declare a certain street name very improbable as there exists no such street in that city. Or it may use Semantic Web techniques to reason away possibilities. In our current prototype, the rule engine is very simple and uses only some basic schema information. It distinguishes two cases: • The schema states that a certain element can appear only once. We assume that this means that the elements of both documents refer to the same real world object, hence, the subtrees are correspondingly merged. For example, if two person elements refer to the same real world person, their descendant elements (e.g., nm and tel) are merged. If two corresponding descendant elements diﬀer, we store this as two possibilities for that element. • The schema states that a certain element can appear multiple times. We assume that this means that the foreign document may contain new elements for this list. For example, the database contains knowledge about two persons “Mark” and “Joe”. The foreign document holds information on a person “Marc”. Note that “Marc” may be the same person as “Mark” only misspelled, or it may refer to a diﬀerent person. The data integrator will store both possibilities, i.e., one whereby it merges “Mark” and “Marc”, and one whereby it adds a new person element. Each possibility is assigned a probability by the rule engine. For example, it is not unthinkable that “Marc” and “Mark” are actually the same person. On the other hand, it is rather improbable that “Joe” and “Marc” are the same person. Our current prototype is not that clever though, i.e., it assigns the same probability to the likelihood that two person elements refer to the same real-life person as to the likelihood of both referring to diﬀerent real-life persons. The reason for this is that we need it as a base line to determine the eﬀectiveness of a more clever rule engine. Moreover, we would like to focus on the data integration mechanism ﬁrst. The minimal set of rules used by our prototype also includes that there can only be one root in an XML document and schema’s of integrated documents are the same, so diﬀerent tag names are assumed to refer to diﬀerent real world objects. 6.2 Integrating sequences In general, integrating sequences produces possibilities for all elements referring to either the same or diﬀerent real world objects. Since we made an

Integrating Data Doubtfully

211

Table 6. Possibilities for merging sequences x = {A, B} and y = {C, D} Referral to real world object resulting sequence A = B = C = D A, B, C, D A = C, B = C = D A/C, B, D A = D, B = C = D A/D, B, C A = C = D, B = C A, B/C, D A = C = D, B = D A, B/D, C A = C, B = D A/C, B/D A = D, B = C A/D, B/C

assumption that the schemas are the same and that elements with diﬀerent tag names refer to diﬀerent real world objects, many of those possibilities are ruled out. However, this rule does not limit the possibilities for sequences of elements with the same tag name. Take for example, the integration of address information of people. We are confronted with integrating sequences of person elements. Because we initially chose for a rather dumb rule engine, any two elements, one from each sequence, possibly refer to the same real world object. Therefore, when merging two sequences, X and Y , the resulting number of possibilities can be huge. Let, for example, X = [A, B] and Y = [C, D]. The possibilities to be generated during integration of X and Y are listed in Table 6.2. In the table, A = C indicates that A and C are considered to refer to the same real world object, hence, they should result in a single possibility where A and B are merged: A/B. Since the database already represents all possibilities explicitly, we do not need to consider two elements from one sequence to refer to the same real world object, so A = B and C = D are not valid possibilities. When all elements of X and Y refer to other real world objects, the number of resulting possible worlds is 1. But, when one element from X refers to the same real world object as an element from Y , there are X × Y possible ways how this can be done, since every element from X can in principle be matched with every element from Y . In general, the number of possible ways to merge i elements from X with i elements from Y can be computed as follows. Choose i diﬀerent elements from X, where the order of choosing the elements is unimportant, but an element x! cannot be chosen more than once. This can be done in (x−i)!i! = xi ways. Then, we choose i elements from Y to merge with those chosen from X. Since the ﬁrst chosen element from X should be merged with the ﬁrst element of Y , order is important when choosing elements from Y . The number of ways y! . to choose the i elements from Y is (y−i)! Let x and y be the lengths of X and Y , respectively. The process of merging sequences is commutative, we assume x ≤ y. In determining all possibilities, any i (0 ≤ i ≤ x) elements of X may refer to the same real world object as elements of Y . Therefore, the resulting total number of possibilities for a

212

Ander de Keijzer

merged sequence is

x

x i=0

i

×

y! (y − i)!

We see from this formula, that merged documents can become huge quite rapidly. If we take, for example, x = 5 and y = 5, then the maximum number of possibilities is 1546. The rule engine, however, may rule out certain possibilities. For example, if in the case of Table 6.2, A refers to a person named “Mark” and C to a person named “Joe”, the rule engine may assign probability 0 to the likelihood that A = C. In this way, it rules out two of the seven possibilities. 6.3 Data integration is an equivalence preserving operation An interesting property of the data integration approach described above, is if it preserves equivalence. Let D1 and D2 be two probabilistic XML documents and A = integrate(D1 , D2 ) the result of integrating them. Suppose D1 and D2 are equivalent to D1 and D2 respectively. Is then A = integrate(D1 , D2 ) equivalent to A? A full proof goes beyond the scope of this chapter, but below we try to show that it is plausible. There is a special case for which this property is especially interesting. The set of possible worlds can be represented as a probabilistic tree with one probabilistic node as root and all possible worlds as possibilities directly below it. Figure 3(a) is of this form. Since the above property holds, integrating two probabilistic trees amounts to integrating all combinations of possible worlds of both trees. Below, we ﬁrst show an example of integrating two certain trees to illustrate the recursive process. We then show the integration of a compact tree with two possibilities with a certain tree. Finally, we show the integration of an equivalent tree in set-of-possible-worlds representation with the same certain tree. The data integration function integrate takes two parameters D1 and D2 . It returns the integration result as a probabilistic XML tree. In the diagrams below, we have omitted probability and possibility nodes whenever there is only one possibility. The example below shows how we can recursively integrate two certain trees. integrate() < << person • • person Mark •

• Joe

After the ﬁrst integration step, we obtain:

Integrating Data Doubtfully

213

SSSS SSSS kkkk k k k ◦
• Joe

integrate() << <• Joe Mark •

The second integration step integrate(’Mark’, ’Joe’) obviously results in: < << ◦ ◦ Mark •

• Joe

Observe the diﬀerence between integrating person elements, which are speciﬁed as being part of a sequence, and other elements including text nodes for which there can only be one. The former produces an additional possibility for the case that there exist two persons. In general, text nodes are also part of a sequence (e.g., paragraphs in a text document). Concatenating names of persons, however, does not make sense, so our rule engine decides that, for example, the name of a person can not be “MarkJoe”. Integrating a probabilistic tree and a certain one proceeds as follows: integrate() MMM qq M person person •qq • < << ◦ ◦

• Joe

Mark •Marc • We would ﬁrst integrate both person elements: kkVVVVVVVV VVV kkkk k k ◦ ◦MMMMM M person person • person • • integrate() MMM qqq M q • << < Joe ◦ ◦ • • Mark Marc where

< << ◦ ◦ • • Mark Marc

• Joe

214

Ander de Keijzer

integrate() MMM qqq M q • << < Joe ◦ ◦ • • Mark Marc intuitively leads to qMMMM M qqq q ◦ ◦ ◦ • • • Mark Marc Joe The entire resulting tree looks like: kkVVVVVVVV VVV kkkk k k ◦ ◦MMM MMM person • person • • person qMMMM M qqq q ◦ ◦ ◦

< << ◦ ◦

• • • Mark Marc Joe

• • Mark Marc

• Joe

The resulting tree can be described as: (Joe ∨ Mark ∨ Marc) ∨ ((Mark ∨ Marc) ∧ Joe)

(1)

Moving the local possibility upwards in the tree, we get an equivalent less compact tree that is in all-possible-worlds representation. The integrate function now behaves as being applied to each possible world separately. MMM hintegrate() M person hhhhh h h • M q MMM q q M q q • ◦ ◦ Joe person • • person • Mark

• Marc

We integrate the person named “Joe” over both possibilities resulting in the following:

Integrating Data Doubtfully

215

hVVVVVV VVVVV hhhhh h h h h ◦h ◦ integrate() < << person • • person

integrate() < << person • • person

• • Mark Joe

• • Marc Joe

The ﬁnal result is: ffffqMYMYMYMYYYYYYYY fffffqqqq YYYYY f M f f f ◦
• • Marc Joe

person •

person •

< << ◦ ◦

< << ◦ ◦

• • Mark Joe

• • Marc Joe

The boolean representation is ((Mark ∧ Joe) ∨ (Mark ∨ Joe)) ∨ ((Marc ∧ Joe) ∨ (Marc ∨ Joe))

(2)

Note that this is equivalent to the earlier obtained boolean representation. The trees are equivalent. 6.4 Assigning probabilities and conﬁdence scores As explained earlier, the rule engine determines in some way the probability of the various possibilities. Intelligence and the use of information from schema and other data sources can be used to limit the number of possibilities, but also to better assign probabilities. The most simple scheme for assigning probabilities would be the following. Whenever the database conﬂicts with a foreign document one some element, we assign probability .5 to each resulting possibility. This approach has a severe drawback. For example, when a mobile device has once heard from another device that a person’s name is “Mark” and it meets another device which says its name is “Marc”, the scheme assigns the possibilities “Mark” and “Marc” each a probability of .5. But, when it has already heard from ninetynine diﬀerent devices, that the name of the person is “Mark”, it should be very suspicious when it meets a device that says this person’s name is “Marc”. Therefore, it should give the possibility “Marc” a very small probability of, say, .01.

216

Ander de Keijzer

The rather basic rule engine of our prototype has a still simple but suﬃcient scheme for assigning probabilities. It is based on the premise that what you have seen twice, is twice as likely to be correct. In other words, a conﬁdence score should be kept in the data. This factor is an indication of how certain we are about the data, which helps in assigning probabilities when integrating new data. Every time a diﬀerent device claims a certain possibility is true, the conﬁdence score is increased by one.

7 Conclusions In this chapter we described a probabilistic data model to store relational data. The query language was extended to support the probabilities associated with attributes using the possible world approach. Also a new aggregate function, EXP was introduced. This function calculates the expected value of a numerical ﬁeld and can be considered a bridge between possible worlds and real world. With this relational model, however, we were unable to model mutual exclusion, dependent probabilities, open worlds or possible worlds with diﬀerent real world objects. To overcome some of the problems faced in the relational model, the probabilistic model was adapted to be used with XML. With XML mutual exclusion, dependent probabilities and possible worlds with diﬀerent real world objects can be modelled. Open worlds could easily be modelled using XML, however, querying the tree of an open world is still an open issue.

References [1] Daniel Barbar´ a, Hector Garcia-Molina, and Daryl Porter. A probabilistic relational data model. In Fran¸cois Bancilhon, Costantino Thanos, and Dennis Tsichritzis, editors, Advances in Database Technology - EDBT’90. International Conference on Extending Database Technology, Venice, Italy, March 26-30, 1990, Proceedings, volume 416 of Lecture Notes in Computer Science, pages 60–74. Springer, 1990. [2] Ander de Keijzer and Maurice van Keulen. A possible world approach to uncertain relational data. In Supporting Imprecision and Uncertainty in Flexible Databases, 2004. [3] Ander de Keijzer and Maurice van Keulen. A probabilistic database extension. Technical report, CTIT, TR-04-21, 2004. [4] Debabrata Dey and Sumit Sarkar. A probabilistic relational model and algebra. ACM Trans. Database Syst., 21(3):339–369, 1996. [5] Thomas Eiter, Thomas Lukasiewicz, and Michael Walter. A data model and algebra for probabilistic complex values. Annals of Mathematics and Artiﬁcial Intelligence, 33(2-4):205–252, 2001.

Integrating Data Doubtfully

217

[6] Norbert Fuhr and Thomas R¨ olleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 15(1):32–66, 1997. [7] Laks V. S. Lakshmanan, Nicola Leone, Robert Ross, and V. S. Subrahmanian. ProbView: a ﬂexible probabilistic database system. ACM Transactions on Database Systems, 22(3):419–469, 1997. [8] M. van Keulen, A. de Keijzer, and W. Alink. A probabilistic XML approach to data integration. In Proceedings of the International Conference on Data Engineering, ICDE 2005, 2005. [9] E Zimanyi and A Pirotte. Imperfect information in relational databases. Uncertainty Management in Information Systems, A. Motro and P. Smets, Eds., 1997.

An Overview of Imperfection Representation in Semistructured Data Matteo Magnani1 and Danilo Montesi2 1

2

Department of Computer Science, University of Bologna, Mura Anteo Zamboni 7, 40127 Bologna, Italy [email protected] Department of Mathematics and Informatics, University of Camerino, Via Madonna delle Carceri 9, 62032 Camerino MC, Italy [email protected]

1 Introduction Today, many important database applications have to deal with data that is both semistructured and either imprecise or uncertain. As an example, scientiﬁc databases are likely to contain data aﬀected by some types of imperfection, with a schema that is not well-deﬁned a priori [6, ch.10]. A diﬀerent area with similar features is that of structured information retrieval, which has become very popular after the spread of XML documents [15]. As a consequence, several models for imperfect semistructured data have been proposed by the scientiﬁc community [22; 2; 13; 12; 18]. These models are similar in spirit to the probabilistic extensions of the relational model developed in the Nineties [3; 26; 9; 11; 17]. In particular, probabilities have been attached to nodes instead of tuples or attribute values. As we will show in this chapter, this attitude is useful to model only a few kinds of imperfection, because of the greater complexity of semistructured data. Many kinds of imperfection, such as uncertainty and imprecision, may coexist in the same data instance, and may aﬀect several data representation dimensions. Data representation dimensions are the constituents of a model used to represent data, e.g., nodes, arcs, order, contents and labels in semistructured data. In the ﬁrst part of the chapter we provide a classiﬁcation of possible dimensions of imperfection that are likely to aﬀect semistructured data. Dimensions of imperfection are obtained by combining data representation dimensions with relevant kinds of imperfection. For example, the content of a node may be aﬀected by uncertainty, as well as imprecision, labels can be missing, and so on. Afterwards, we use our classiﬁcation to review existing models for imperfect semistructured data. In particular, we focus on the representation of imperfection, and not on its manipulation. The reader interested in query languages for imperfect XML data will ﬁnd some useful publications in the area of structured information retrieval (IR). However, these languages M. Magnani and D. Montesi: An Overview of Imperfection Representation in Semistructured Data, Studfuzz 203, 221–239 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

222

Matteo Magnani and Danilo Montesi

are used to manipulate data that is not imperfect, therefore we do not address them in this survey. The objective of this contribution is not the deﬁnition of a new model, but a preliminary analysis that should be performed before the deﬁnition of new models. In the aforementioned proposals, a formalism to represent imperfection is chosen a priori, without clarifying what kind of imperfection it is supposed to manage. On the contrary, we claim that we should ﬁrst understand how data can be aﬀected by imperfection, that is the aim of this chapter, and only subsequently we can choose the most appropriate formalism(s) to manage it. The main motivation of this contribution is the need for a clear understanding of how imperfection can aﬀect semistructured data. To this aim, in Sect. 2 we discuss existing taxonomies of imperfection. Knowledge about diﬀerent kinds of imperfection is very important, as they usually require the application of diﬀerent methods and formalisms [29]. Then, we reconsider the identiﬁed classes of imperfection in the context of data management. In particular, in Sect. 3 we deﬁne a general data structure to specify data representation dimensions for semistructured data. The deﬁnition of a data structure is necessary, because we cannot understand how information is missing without deﬁning what information can and cannot be represented by our model. However, it is general enough to consider our results extensively valuable, simple to understand, and its formal deﬁnition allows us to clearly identify its data representation dimensions. In Sect. 4 we combine classes of imperfection and data representation dimensions, to identify possible dimensions of imperfection in semistructured data. In particular, in Sect. 4.1 we check which types of imperfection can locally aﬀect each data representation dimension, e.g., labels, nodes, and arcs. Then, in Sect. 4.2, we identify possible relationships between diﬀerent dimensions of imperfection. We conclude Sect. 4 with an example, explaining why previous work on the relational model has identiﬁed only one main dimension of uncertainty, determining a similar approach also on early proposals for semistructured data models. Finally, Sect. 5 is devoted to the comparison of existing models for imperfect semistructured data, using our classiﬁcation. We conclude the chapter highlighting the results of our investigation. Part of this chapter is an extended version of [19].

2 A taxonomy of imperfect information When we do not know everything we would like or need to know about a real world event, we are ignorant, and our knowledge is imperfect. Ignorance is a state of incomplete knowledge. In particular, ignorance and knowledge bound each other, and information may be viewed as a reduction of ignorance [16]. When we deal with complex systems, ignorance is often unavoidable, and it is sometimes necessary, to reduce the complexity of information. When data

An Overview of Imperfection Representation in Semistructured Data

223

is aﬀected by ignorance, we say it is imperfect. In the following we will use imperfect data and data aﬀected by ignorance as synonyms. Parsons points out that the importance of a taxonomy is not to give a deﬁnitive characterization of imperfection — that he calls uncertainty, but to allow distinctions to be drawn [24]. Saying it in another way, a classiﬁcation is not right or wrong, but it can be more or less useful. Similarly, Smets states that a full inventory of types of ignorance is not practically possible [28]. In the literature, many taxonomies have been proposed, and we can notice a general agreement on some high-level classes [4; 30; 21; 8; 28; 23]. At a ﬁrst glance, these taxonomies may seem incompatible, but this is mainly due to linguistic ambiguities. For example, we may say that we are certain that Tom is either 12 or 13 years old. However, at the same time we are not certain about Tom’s age: Is he 12 or 13? In fact, the term uncertainty has been used to indicate diﬀerent kinds of imperfection in diﬀerent taxonomies. In the presence of such ambiguities, we need a characterization of imperfection that is as independent of natural language semantics as possible. This can be done in many diﬀerent ways [10], and we will use a set based approach, that is probably the most typical choice. As we have already said, ignorance can be thought as lack of knowledge. Therefore, the way in which knowledge is missing may be used as a principle to classify it. To deﬁne knowledge, we must assume the existence of someone (or something) who knows. We call it a cognitive being. A cognitive being can believe in something, that we may call events, ideas, propositions, or in general beliefs, depending on the context. A taxonomy of ignorance depends on the way in which we deﬁne what a belief is. Many alternative deﬁnitions are possible, and their eﬀectiveness depends on the number and kind of classes it can discriminate. Throughout this chapter we will use the following scenario, illustrated in Fig. 1. Perfect information is deﬁned to be an element of a set U of mutually exclusive elementary events (∀e1 , e2 ∈ U ¬(e1 ∧e2 )). In the following, we will call this element t. For example, if knowledge concerns the color of a monochrome object, U can be the set {black, blue, yellow,. . . } whose elements are all the colors listed in some English dictionary. Notice that we could also use the set of all RGB colors3 — this depends on the application requirements. Cognitive beings do not ever know all possible alternative events. Based on the subset of U known by the cognitive being, that we call KU, we can represent knowledge as an assignment of some quantiﬁed belief to a ﬁnite set of propositions built over KU. For example, knowledge corresponding to “The object is either blue or yellow” is represented by proposition p1 = {blue, yellow}, and we can believe p1 to be possible, certain, or probable. Other examples of propositions are p2 = {green} and p3 = dark — in this case, p3 can be represented using a fuzzy set. Diﬀerent kinds of imperfection are characterized by diﬀerent set3

An RGB color is deﬁned by three integers, representing the amount of red, green and blue.

224

Matteo Magnani and Danilo Montesi

Fig. 1. A scenario for the classiﬁcation of ignorance. A cognitive being (CB), whose knowledge is described by this scenario, assigns some quantiﬁed belief (QB) to a ﬁnite set P of propositions, built over a ﬁnite set KU (Known Universe) of elementary events. U (Universe) is the set with all possible alternative elementary events.

tings of the constituents of knowledge, i.e., quantiﬁed beliefs and propositions, KU and U. Quantiﬁed beliefs (Bel) are functions over the ﬁnite set of propositions P to an ordered set of values representing ascending degrees of belief. When some belief is assigned to a proposition p, we say that p is a supported proposition, and we call the union of the supported propositions the Core C of a belief function. As an example of beliefs, we may consider the set {impossible, unlikely, probable, sure}, the interval [0, 1], or the interval [0, 100]. The range may vary depending on formalisms and applications, and we will use 0Bel and 1Bel to indicate the smallest and highest values of a function Bel. The kind of quantiﬁed belief is deﬁned by constraints on P and on the beliefs that can be assigned to its elements. Depending on the interpretation of these functions, quantiﬁed beliefs can represent subjective probability, possibility, necessity, belief, and plausibility, that are real-world modalities of belief and should not be confused with the mathematical theories with the same names [10; 27]. In this chapter we do not deal with these theories, because our analysis is preliminary to their choice. The data models discussed in the second part of the chapter use probability theory, without specifying any interpretation. Therefore, we will focus on a general kind of quantiﬁed belief that assigns to each proposition a degree of certainty, ranging from “not believed” to “completely believed, or certain”. One of the most typical kinds of imperfection is characterized by partial beliefs: The cognitive being assigns to some proposition p only part of the maximum amount of belief we can assign to a single proposition, that would indicate certainty. This can be due to several reasons: we may not trust the source of information supporting p, or p may represent a forecast, or p may derive from a generalization of past events (induction), just to mention a

An Overview of Imperfection Representation in Semistructured Data

225

few examples. For instance, we may say that “Perhaps, John is grey-haired”. This kind of imperfection is usually called uncertainty (UN), but this term should be used carefully, because it is often intended in a more general sense. Uncertainty is independent on the kind of proposition to which the belief is assigned, but is related to our trust in it. More formally, we have uncertainty when: ∃p ∈ P | 0Bel < Bel(p) < 1Bel . If no belief is assigned to any proposition, we have absence (ABS). A very unusual and extreme kind of absence corresponds to KU = ∅, where we cannot even formulate hypotheses (deﬁne propositions) because we do not know any possible elementary events. Absence is sometimes called incompleteness, but this term is also used with a more general meaning to indicate “incomplete information”, and thus ignorance. A less strict kind of absence corresponds to the situation where we know some events, but we do not believe in any of them — however, we would be able to assign some beliefs. For example, “We do not know the color of John’s hair”. When we know that KU = U, there is no diﬀerence in assigning no belief at all, or assigning all the belief to KU. This case is in the middle between absence and non-speciﬁcity, that we introduce in the next paragraph, and it is frequently present in databases, where domain constraints may represent a kind of knowledge in practical situations. Formally, we have absence when: p ∈ P | Bel(p) > 0Bel . Uncertainty and absence depend on the way beliefs are assigned to propositions. Other types of imperfection are characterized by the kinds of proposition supported by some belief. In particular, the number of elementary events composing a supported proposition deﬁnes the precision of our knowledge. If this number is greater than one, information is imprecise (IMP). There are diﬀerent varieties of imprecision, depending on the way elementary events are grouped together. When propositions are disjunctions of elementary events, i.e., they are crisp sets, imprecision is usually called non-speciﬁcity (NS). For example, assume that U = {black, brown, blonde, white} represents all possible colors of John’s hair. “John has black or brown hair” is a form of non-speciﬁcity. In this case, p = {black, brown}. Notice that non-speciﬁcity depends on our knowledge, but also on the level of detail of U . We can express non-speciﬁcity as follows: ∃p ∈ P, ∃e1 , e2 ∈ KU, e1 = e2 | Bel(p) > 0Bel ∧ e1 , e2 ∈ P . When a set aggregating elementary events is ill-deﬁned, we have vagueness (VAG). Vagueness and non-speciﬁcity are respectively the ill-deﬁned and the crisp aspect of the same type of imperfection, i.e., imprecision. For example we may say that “John has dark hair”. If U is the previously deﬁned set, this knowledge is vague. In principle, we may have an ill-deﬁned set of only one

226

Matteo Magnani and Danilo Montesi

element, and in this case vagueness would not be associated to imprecision. However, we cannot identify any practical utility of this case, so we will not consider it. In general vagueness is not associated with a particular formalism. For example, we may represent knowledge “John has dark hair” with the fuzzy set D = {black(1.0), brown(0.8)}, meaning that brown is quite dark (fuzziness). Or we may use the rough set (CD = {black}, CD = {black, brown}), suggesting that sometimes brown hair may not be deﬁned as dark4 [25]. In the following deﬁnition, the predicate “ill-deﬁned(p)” returns true when p is a proper fuzzy set, rough set, or any other kind of ill-deﬁned set. ∃p ∈ P | Bel(p) > 0Bel ∧ illdeﬁned(p) . Inconsistency (INC) is characterized by a positive belief assigned to the empty proposition. For example, we can assign some belief to the proposition “John is 13 and 15 years old”. Many theories do not take inconsistency into account, requiring the belief assigned to the empty set to be equal to zero. However, inconsistency is not unusual in real situations. Formally, Bel(∅) > 0Bel . Finally, our knowledge can be wrong. Information is erroneous (ERR) when the true elementary event does not belong to any supported proposition. This can be due to the fact that the true elementary event does not belong to KU, or to the fact that we do not believe in it. Formally, t∈ /C . In Table 1 we have listed the classes of imperfection deﬁned in this section that we will use in the remaining of the chapter. Notice that they are not disjoint. Ignorance can be present in diﬀerent forms at the same time. When there is absence, we know nothing, so we cannot have uncertainty, vagueness or non-speciﬁcity. Otherwise, we may have uncertain but precise, imprecise but certain, or both uncertain and imprecise data. For example, we can believe that“perhaps, John is between 180 and 190 cm. tall”, or “perhaps, John is very tall”, combining imprecision with uncertainty.

3 A reference semistructured data model In the previous section we have deﬁned some classes of imperfection based on quantiﬁed beliefs, propositions, and elementary events. However, we have not speciﬁed a semantics for the elements of U. In this way, our classes of imperfection are independent on the application context. To apply them to 4

These are the lower and upper approximations of the set of dark colors (D) based on the color (C).

An Overview of Imperfection Representation in Semistructured Data

227

Table 1. The classes of imperfection we consider in this chapter, with abbreviations and signiﬁcant examples. In the real world, John is 183 cm. tall, and U can be the set {0, 1, . . . , 271, 272}, according to http://www.google.com/search?q=the+tallest+ man Class

Abbr.

Example: John’s tallness

Absence Non-Speciﬁcity

ABS NS

Vagueness Uncertainty Inconsistency Error

VAG UN INC ERR

Not known. Between 180 and 190 cm. 183 or 184 or 185 cm. Not very tall. Perhaps, 183 cm. 183 and 184 and 185 cm. 170 cm.

semistructured data modelling, we need to specify what U represents in this context, i.e., what knowledge is about. A data structure deﬁnes a projection of the reality on a ﬁnite set of data representation dimensions. These dimensions clearly identify everything we can know. As a consequence, they can be used to specify the elements of U. A formal deﬁnition of a semistructured data structure is then necessary to identify possible dimensions of imperfection. However, it must be general enough to subsume the ones adopted by existing data models. As it is typical, we use a (directed acyclic) graph extended with additional data [1; 5; 31]. Deﬁnition 1. A data graph is a tuple d = (V, E, , λ, τ, δ), where: V = {v1 , . . . , vn } is a ﬁnite set of vertices E ⊂ {(vi , vj ) | vi , vj ∈ V } (V, E) is a directed acyclic graph (DAG) is a possibly empty partial order on V λ : V → (L ∪ {null}), where L is a set of names τ : V → T , where T is a set of types ⊕(c(v), ) if outdegree(v) ≥ 1 • ∀v ∈ V, δ(v) = o.w. δ (v) where ⊕ is a parametric concatenation operator, c(v) is the set of v’s children, while δ is a content function deﬁned on leaf nodes.

• • • • • •

The intuition about an edge (vi , vj ) is “vj is part of vi ”. Therefore, inside the model the sub-structure of leaves is not stored, but they are not ontologically diﬀerent from internal nodes. Nodes may have labels (λ) and they may be ordered (). Every node has a type (τ ) and a content (δ). As nodes represent diﬀerent levels of detail of the same reality, a content function may be deﬁned for every node, and it must be an aggregation of the content of its children. The function δ is parameterized, because of the generality of semistructured data models. In particular, its recursive deﬁnition is based on an elementary function (δ ), which extracts the content of simple-typed nodes

228

Matteo Magnani and Danilo Montesi

(leaves), and a concatenation mechanism. Both these parameters depend on the peculiar application of the model. For example, in the context of an XML application we can deﬁne: ⊕ ({v1 , . . . , vm }) = δ(vj1 ) · · . . . · · δ(vjm ) δ (v) = fn:string(v)

(1)

where ∀i, p ∈ [1, m] (i ≤ p ⇒ vji vjp ). is the document order, · is the string concatenation operator, is the blank character and fn:string(v) returns the content of v converted as string. This function has a semantics similar to the one with the same name deﬁned in the XQuery Functions and Operators speciﬁcation, when applied to leaf nodes [20]. As another example, consider a graph representing overlapping regions, where nodes represent space regions and can be subregions (children) of several other nodes. Here, (1) would lead to an undesired result, because the graph is not ordered, and we do not want to consider the same region many times. A possible instantiation for δ is: ⊕ ({v1 , . . . , vm }) = {o | o ∈ δ(v1 ) ∨ · · · ∨ o ∈ δ(vm )} δ (v) = {λ(v)}

(2)

where λ extracts the name of the regions corresponding to the leaves of the graph. Notice that the computation of δ converges, because data graphs are both acyclic and ﬁnite. Data graphs can be organized into collections, folders, or classes, but in this chapter we focus on their internal imperfection. However, when a data graph is embedded in a super-structure, it can be useful to consider uncertainty about whether it belongs to that structure or not. Example 1 (XML document repository). Let x be the following XML document: Hugo Notre-Dame de Paris

We can model it using a data graph containing the following data tree: V = {v0 , v1 , v2 , v3 , v4 , v5 } E = {(v0 , v1 ), (v1 , v2 ), (v1 , v4 ), (v2 , v3 ), (v4 , v5 )} = {(vi , vj ) | i ≤ j} λ(v0 ) = null, λ(v1 ) = book, λ(v2 ) = author, λ(v4 ) = title, λ(v3 ) = λ(v5 ) = null • τ (v0 ) = document, τ (v1 ) = τ (v2 ) = τ (v4 ) = element, τ (v3 ) = τ (v5 ) = text (or other types deﬁned by a DTD/XMLSchema) • δ(v0 ) = δ(v1 ) = ’Hugo Notre-Dame de Paris’, δ(v2 ) = δ(v3 ) = ’Hugo’, δ(v4 ) = δ(v5 ) = ’Notre-Dame de Paris’. • • • •

An Overview of Imperfection Representation in Semistructured Data

229

Here we have used the content aggregation equation (1). Many XML documents may be contained inside a collection, as illustrated in Fig. 2.

book author Hugo

title

Notre-Dame de Paris book

author Manzoni

title I promessi sposi

Fig. 2. An XML document repository. Roots are document nodes.

4 Dimensions of imperfection in semistructured data Each data representation dimension can be aﬀected by diﬀerent kinds of ignorance. In Sect. 4.1 we discuss which classes of imperfection can be signiﬁcantly associated with each given data representation dimension, through an analysis of the elements of U. We call the resulting combinations (local) dimensions of imperfection. In Sect. 4.2 we analyze possible dependencies between diﬀerent dimensions of imperfection. For example, uncertainty on node existence induces uncertainty on the arcs, as the existence of an arc depends on the existence of the nodes it connects. 4.1 Local dimensions of imperfection As we have already said, dimensions of ignorance in a semistructured data context are obtained by combining data representation dimensions (V, E, , λ, τ, δ) and types of uncertainty (ABS, NS, VAG, UN, INC, ERR), but not all combinations are signiﬁcant: for example, the concept of “vague type”5 is not used in real applications. The following considerations are illustrated in Table 2. A ﬁrst general consideration is that our knowledge can be wrong, independently on the data representation dimension we are considering. This is 5

“Vague type” can be used to refer to a type containing vague instances. In this chapter, consistently with our deﬁnition of knowledge and classes of imperfection, it indicates something that is a type “up to a certain degree”.

230

Matteo Magnani and Danilo Montesi

Table 2. Relevant dimensions of imperfection for semistructured data. The symbol × means that the combination of the corresponding data representation dimension and type of ignorance represents a signiﬁcant dimension of imperfection. Empty cells indicate that the dimension is not signiﬁcant ABS V E ≤ λ τ δ

× × × ×

NS × × × × × ×

VAG

UN

NS-UN

×

× × × × × ×

× × × × × ×

VAG-UN

×

INC

ERR

× ×

× × × × × ×

×

the reason why the last column of Table 2 is all checked. Inconsistency is usually due to the complexity of the language used to express propositions. If we know that a proposition corresponds to the empty set, we will not assign any belief to it. Therefore, inconsistency is not typical in nodes (V ), arcs (E) and types (τ ). On the contrary, it is more typical to express contents and labels with complex expressions. Similarly, in expressing the relative order of nodes, it is possible to provide equations with no solutions. However, inconsistency is often easy to identify and correct, and rarely aﬀects labels. Now, we consider the ontological objects of the model, i.e., nodes (V ), arcs (E), and order (). We call them ontological, because they are based on being, or characterized by existence. V represents the existence of a data graph. E and concern the structure of the graph. V is the most important set of ontological objects, because the other data representation dimensions are based on it. In this case, the elements of U are sets of nodes W ⊆ V , where W means that every v ∈ W exists and every v ∈ / W does not exist. Even if a given data instance is composed of a ﬁnite number of nodes, we assume not to know V (ontological ignorance). This is due to the nature of semistructured data, where we do not know a priori if new nodes will join an existing instance. Both non-speciﬁcity and uncertainty can aﬀect V . In fact, we can say that “a data graph consists of three or four nodes”, as in Fig. 3, or that “we are quite sure that a data graph is composed of two nodes”. On the contrary, absence would be possible in theory, but we do not take it into consideration: if we know nothing about a data graph, we have no reason to consider it. Remember that absence does not mean that we do not know if a particular node exists (this is NS). Additionally, vagueness is meaningless for ontological objects, because existence is a crisp concept, so we do not consider it. Figure 3 shows some possible elements of KU, when we consider vertexes. E and have a behavior similar to V . The only diﬀerence is that absence may be signiﬁcant. Anyway, it is unusual, because absence means that we do not know structural information, and we do not even know if there is a struc-

An Overview of Imperfection Representation in Semistructured Data 1

2

3

1

2

3

231

4

Fig. 3. An example of elements of KU for vertexes. They refer to data graphs with, respectively, three and four nodes

ture. Examples of possible elements of KU for arcs and order are represented in Figs. 4 and 5.

1

1

1

2

2

2

Fig. 4. An example of elements of KU for arcs between nodes.

1 2

1 3

3

2

Fig. 5. An example of elements of KU for the relative order of nodes.

The other dimensions of the model, i.e., λ, τ , and δ, are functional properties with domain V . Possible ranges of these functions have been illustrated in Figs. 6, 7, and 8. In this case, for every function f (v), U represents its range. The types of ignorance that may aﬀect these dimensions depend on the semantics of the range. Absence is not relevant on τ , in the sense that it can be managed outside the model. If we do not know anything about a type, we may specify it using any type (⊥). In principle, vagueness can aﬀect types, but vague types are not used in real applications, and there is not a signiﬁcant theoretical framework about vague types, to the best of our knowledge. As a consequence, we do not consider them in our taxonomy. On the contrary, non-speciﬁcity and uncertainty may aﬀect our knowledge about types. For instance, we may know that a node is either a MSOﬃce or an Openoﬃce ﬁle (NS). λ is similar to τ , but in this case absence is not usually managed outside the model. It is worth noticing that absence means that we do not know anything about a label, while a null value indicates we know that there is no label. δ is the only dimension that can be reasonably aﬀected by all types of ignorance. In particular, we may know nothing about it (ABS), or we may

232

Matteo Magnani and Danilo Montesi JPEG

GIF

BMP

TIFF

Fig. 6. An example of elements of KU for types.

null

P

H1

BODY

Fig. 7. An example of elements of KU for labels.

know a set of possible values (NS), sometimes ill-deﬁned (VAG), we may not be certain about it (UN), and our knowledge can be partially inconsistent (INC) or completely wrong (ERR). The examples of Table 1 are typical cases of δignorance. Additionally, the deﬁnition of δ states that the content of internal nodes may depend on that of leaves. This means that a model for imperfect semistructured data should provide a mechanism to propagate ignorance from leaves to internal nodes. 1

2

...

272

Fig. 8. An example of elements of KU for contents.

4.2 Global imperfection dependencies In the previous sections we have identiﬁed several dimensions of imperfection. Sometimes we ﬁnd that only one of them is present in a speciﬁc data instance. However, it may happen that many dimensions co-exist. In particular, some of them depend on others. Figure 9 illustrates this point. In Fig. 9(a) we have represented a semistructured instance with uncertainty on arc (v2 , v3 ). As already pointed out, the content of internal nodes depends on that of leaves. If we are not sure about the existence of arc (v2 , v3 ) we cannot assert that node v5 contributes to the content of v2 , even if we have complete knowledge of δ(v5 ) and δ(v6 ). Therefore, we can say that ignorance on the arcs induces ignorance on δ. If there is ignorance on E and also on δ , the propagation mechanism we mentioned at the end of the previous section should combine them. Moreover, ignorance on every dimension of the model is inﬂuenced by ignorance on the set of nodes (V ). In fact, if a node does not exist, we cannot have structural or functional properties on it. Figures 9(b) and 9(c) show two possible interpretations of ignorance on the existence of node v3 . In the ﬁrst case, non existence of a node implies that all its subparts that are not subparts of other nodes do not exist. Figure 9(c) oﬀers a diﬀerent interpretation, where arcs are redistributed and transitive relationships of “being subpart of” are maintained.

An Overview of Imperfection Representation in Semistructured Data

233

This example should have clariﬁed that diﬀerent dimensions of ignorance may inﬂuence each other. We can distinguish between two kinds of dependencies: casual and model dependencies. Model dependencies are induced by the deﬁnition of the data model, and they are always present. For example, uncertainty on a node always induces uncertainty on the other dimensions, independently on the application or semantics of the data graph. This happens because the deﬁnition of the other dimensions is based on V . As a consequence, we may state that I(V ) I(E, ≤, τ, λ, δ)6 and I(E, ≤) I(δ). In addition, there may be casual dependencies. For example, in the context of a particular application the fact that a node has content “male” may determine the absence of a husband sibling. In this case, ignorance on labels or types can inﬂuence other dimensions of imperfection. Anyway, this is contextdependent, so we do not address this problem here (even if an investigation of all possible dependencies could be interesting).

v1

}

v2

v1

v2

v4

v4

v5

v6

v6

BB BB BB B!

(a)

(b)

v2

v5

v6

11

11

11

1 v4

111

11

1

v3 B

v11

(c)

Fig. 9. Ignorance interactions.

4.3 The relational case As an example, we show what happens when we restrict our model to the relational one. Here, a relation on attributes (A1 . . . An ) can be represented by a collection of data graphs, where each tuple t = t1 , . . . , tn corresponds to a data graph d = (V, E, , λ, τ, δ) with: • • • • • •

V = {v1 , . . . , vn } E=∅ =∅ λ(vi ) = Ai τ (vi ) = Dom(Ai ) δ(vi ) = ti .

Notice that δ is compatible with our deﬁnition of data graph. We also assume that τ and Dom refer to the same set of values. 6

Read, “ignorance on V induces (or leads to) ignorance on . . .

234

Matteo Magnani and Danilo Montesi

In this case, only four dimensions of the model are active: V, λ, τ, and δ. On the contrary, E and are empty. Moreover, additional restrictions further limit the presence of ignorance. The attribute names and the domains (λ and τ ) are not retrieved by typical relational query languages. Additionally, the schema restricts them to be ﬁxed. Therefore, they are not aﬀected by ignorance. The set of nodes (V ) is ﬁxed, due to the strict tabular structure of the relational model, so we assume that all nodes always exist. The only exception is when we use the null value to indicate that an attribute does not exist. Anyway, this interpretation of null as ontological absence is not distinguished by others related to the value of the attributes. For this reason, this case is not treated in a speciﬁc way. Finally, we may have ignorance on the values of the nodes (δ), which is the only relevant dimension of ignorance at tuple level, and has been studied in many scientiﬁc papers. To sum up, the fact that ignorance in the relational model is monodimensional is due to its simplicity. When the data structure we have introduced to represent semistructured data is used in a more general way, many other dimensions can be worth managing.

5 Existing models for imperfect semistructured data In this section we brieﬂy present the main existing models for imperfect semistructured data, in the light of the dimensions of imperfection identiﬁed in Sect. 4. The objective of these models is to extend semistructured data with probabilities, and not to represent speciﬁc dimensions of imperfection. Therefore, the fact that some dimensions are not managed by them should not be seen as a limitation. However, the following analysis will be useful to clearly identify possible extensions of these models, and to change or broaden their applicability. Additionally, the adoption of probabilities makes it diﬃcult to state with certainty if some dimensions of imperfection can be modelled or not. Probability theory can be interpreted in several diﬀerent ways, and the fact that an interpretation has not been speciﬁed for these models makes them general, but also diﬃcult to classify. Where possible, we refer to the interpretation implicitly suggested by the motivating examples used by the authors to present their proposals. Table 3 summarizes the following discussion. The fact that the models described in this section are based on probabilities allows us to discuss some features that are common to all of them. First, consider incompatibility and error. Incompatibility cannot be represented by probability theory, because the probability of the empty set must be equal to zero — this is a corollary of Kolmogorov’s axioms. This does not represent a severe limitation in applications where inconsistency is not typical, or is simple to identify and remove. However, using probabilistic models we cannot represent someone’s generic belief about a semistructured data instance. Inconsistency and error are intersecting classes of imperfection. We can have inconsistency without error, if part of the belief is assigned to the

An Overview of Imperfection Representation in Semistructured Data

235

correct event and part to an inconsistent proposition, and we can have erroneous knowledge that is not inconsistent. However, inconsistency can be erroneous, so the models under consideration cannot manage error in general. When erroneous knowledge is consistent, we can represent it as if it was not erroneous, but probability theory does not specify how to combine two descriptions of the same reality where probabilities support disjoint alternatives. As a ﬁnal general remark, we do not regard probability theory as the best formalism to represent vagueness. While vagueness refers to features that are graded in the real world, probability concerns events that are either true or false. As probability is an axiomatic theory, it can also be interpreted to adapt to vagueness. However, this interpretation has not been discussed in the works we are considering. 5.1 Semistructured probabilistic object data model To the best of our knowledge, the ﬁrst published proposal of a semistructured probabilistic data model goes back to the year 2001 (Semistructured Probabilistic Object (SPO) data model) [7]. It is considerably diﬀerent from the other models cited in this section: it is based on relations, where tuples may have a variable number of attributes, and attributes represent random variables. While a probabilistic semistructured database can be represented in XML, the structural features of XML are not directly supported by this model [32]. In particular, the term “semistructured” refers to the variable number of attributes. Therefore, this model can only represent uncertainty on values, and not on structure. As the content of internal XML nodes cannot be represented, uncertain node content is only partially supported. 5.2 ProTDB and TIX The ﬁrst XML data model extended to manage uncertain information has been TAX [14]. Two relevant extensions of TAX have been proposed. The ﬁrst deﬁnes a probabilistic XML database, ProTDB [22]. On the other hand, TIX, investigates the use of probabilistic information and new algebraic operators for XML information retrieval tasks [2]. As ProTDB, TIX (algebra for querying Text In XML) uses attributes to attach probabilities to XML trees. TIX could also be used to represent probabilistic data. However, we do not consider it in Table 3, because [2] focuses on information retrieval applications, where probabilities represent the relevance of XML fragments: data is perfect, but we do not know if we are interested in it. ProTDB (Probabilistic Tree Data Base) shows how to annotate an XML document with probabilistic information [22]. This can be viewed as a probabilistic model, moving probability annotations from the XML data representation to the underlying model. TAX-like pattern-matching can be applied to this model. Compared with the SPO data model, ProTDB can represent

236

Matteo Magnani and Danilo Montesi

uncertainty on the structure of XML trees. However, the approach used to represent probabilities is based on the assumption that node existence (probability) is independent on other nodes. The authors have tried to overcome this problem by deﬁning ad hoc XML elements indicating speciﬁc probability distributions, but the representation of uncertainty on nodes is still partial. The same approach is applied to the content of leaf nodes. As probabilities concern nodes, uncertainty on arcs is not directly managed by ProTDB. 5.3 PXML and PIXML PXML is another model for probabilistic semistructured data [13]. Given a node, it can express any discrete probability distribution of its possible children. Simple distributions may be represented in a compact way, reducing their complexity. PXML distinguishes between leaves and internal nodes, and only leaves have a content. Therefore, the management of uncertainty on content is limited to leaves. This model does not assume node independence, so it has no limitations in representing uncertainty on the structure of XML trees. PIXML (Probabilistic Interval XML) is another version of PXML, where the model is generalized to manage interval probabilities [12]. Using interval probabilities, PIXML can represent more dimensions of imperfection than the models based on simple probabilities. In particular, non-speciﬁcity on nodes and arcs can be managed, combined or not with uncertainty. PIXML can also partially represent absence, corresponding to the assignment of a probability interval [0, 1] to every possible alternative instance. However, the set of alternative instances is strictly determined. In fact, both PXML and PIXML are based on a sort of schema, called weak instance, that deﬁnes the valid number of children for every node. Therefore, this can be considered a case of non-speciﬁcity. Uncertainty on leaves is not managed: this model consistently focuses on uncertain structures. 5.4 A model based on evidence theory We conclude our overview with a model based on Dempster–Shafer’s theory7 [18]. Although this model provides a query algebra to manipulate imperfect trees, from the point of view of imperfection representation it is no more expressive than PIXML. In fact, evidence theory is used to represent interval probabilities, and both models are based on a “possible worlds” semantics.

6 Concluding remarks In this chapter we have studied how imperfection can aﬀect semistructured data. In particular, we have deﬁned a taxonomy of dimensions of imperfection, based on widely accepted classes of ignorance and semistructured data 7

also known as evidence theory.

An Overview of Imperfection Representation in Semistructured Data

237

Table 3. Relevant dimensions of imperfection managed by the models cited in this section. A (−) near the name of a model means that the corresponding dimension of imperfection is only partially supported. The names of the models have been abbreviated as follows: SP (SPO data model), PR (ProTDB), PX (PXML), PI (PIXML), and DS (the model based on Dempster–Shafer’s theory) ABS V E ≤ λ τ δ

NS PI, DS PI, DS

VAG

UN PR(−) , PX, PI, DS PX, PI, DS

NS-UN VAG-UN INC ERR PI, DS PI, DS

SP(−) , PR(−) , PX(−)

representation dimensions. We have also pointed out that several dimensions of imperfection can co-exist inside the same data instance, with possible interactions between them. This has three important consequences. First, it can be necessary to integrate diﬀerent formalisms to manage diﬀerent kinds of imperfect data. Existing data models can already deal with more than one dimension of imperfection, but several others have not been considered yet. The second consequence concerns the dependencies between diﬀerent dimensions of imperfection we have mentioned in Sect. 4.2. Future data models should provide a way to propagate δ-ignorance, and take into account model dependencies. Casual dependencies can probably be managed at schema level, through the application of speciﬁc constraints. However, we have not investigated this issue, which lies outside the scope of our contribution. The ﬁnal consequence is the need for a way to evaluate the overall imperfection of a data instance. In fact, ﬁnal users may need to judge the relevance or likelihood of the results of a query, and to compare diﬀerent data instances we need to map them to a totally ordered domain. Therefore, a combination strategy is needed, which takes all the active dimensions of imperfection as input, and produces a single indicator.

References [1] S. Abiteboul. Querying semi-structured data. In International Conference on Data Base Theory (ICDT), pages 1–18, Delphi, Greece, 1997. [2] S. Al-Khalifa, Cong Yu, and H. V. Jagadish. Querying structured text in an XML database. In International Conference on Management of Data (SIGMOD), pages 4–15, June 2003. [3] D. Barbara, H. Garcia-Molina, , and D. Porter. The management of probabilistic data. IEEE Transactions on Knowledge and Data Engineering, 4(5):487–501, 1992.

238

Matteo Magnani and Danilo Montesi

[4] P. P. Bonissone and R. M. Tong. Editorial: Reasoning with uncertainty in expert systems. International Journal of Man–Machine Studies, 22(3):241–250, 1985. [5] P. Buneman. Semistructured data. In ACM Symposium on Principles of Database Systems (PODS), pages 117–121, Tucson, Arizona, May 1997. [6] A. B. Chaudhri, A. Rashid, and R. Zicari, editors. XML Data Management: Native XML and XML-Enabled Database Systems. Addison Wesley Professional, 2003. [7] A. Dekhtyar, J. Goldsmith, and S. R. Hawkes. Semistructured probalistic databases. In Statistical and Scientiﬁc Database Management, pages 36– 45. 2001. [8] R. Demolombe. Uncertainty in intelligent databases. In A. Motro and C. Thanos, editors, Uncertainty Management in Information Systems, chapter Uncertainty in intelligent databases, pages 89–154. Kluwer, 1997. [9] D. Dey and S. Sarkar. A probabilistic relational model and algebra. ACM Transactions on Database Systems, 21(3):339–369, 1996. [10] D. Dubois and H. Prade. Possibility Theory: An Approach to Computerized Processing of Uncertainty. Plenum Press, 1988. [11] N. Fuhr and T. R¨ olleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems, 15(1):32–66, 1997. [12] E. Hung, L. Getoor, and V.S. Subrahmanian. Probabilistic interval XML. In International Conference on Database Theory (ICDT), Italy, January 2003. [13] E. Hung, L. Getoor, and V.S. Subrahmanian. PXML: A probabilistic semistructured data model and algebra. In International Conference on Data Engineering (ICDE), Bangalore, India, March 2003. [14] H. Jagadish, L. Lakshmanan, D. Srivastava, and K. Thompson. TAX: A tree algebra for XML. In International Workshop on Database Programming Languages (DBPL), September 2001. [15] G. Kazai, N. G¨ overt, M. Lalmas, and N. Fuhr. The INEX evaluation initiative. Lecture Notes in Computer Science, 2818:279–293, 2003. Intelligent Search on XML Data. [16] G. Klir and R.M. Smith. On measuring uncertainty and uncertaintybased information: Recent developments. Annals of Mathematics and Artiﬁcial Intelligence, 32:5–33, 2001. [17] Laks V. S. Lakshmanan, Nicola Leone, Robert Ross, and V. S. Subrahmanian. Probview: A ﬂexible probabilistic database system. ACM Transactions on Database Systems, 22(3):419–469, September 1997. [18] M. Magnani and D. Montesi. A model for imperfect xml data based on dempster-shafer’s theory of evidence. Technical report UBLCS-2005-19, University of Bologna, 2005. [19] Matteo Magnani and Danilo Montesi. Dimensions of ignorance in a semistructured data model. In DEXA Workshops, pages 933–937, 2004.

An Overview of Imperfection Representation in Semistructured Data

239

[20] Ashok Malhotra, Jim Melton, and Norman Walsh. XQuery 1.0 and XPath 2.0 functions and operators. World Wide Web Consortium, Candidate Recommendation CR-xpath-functions-20051103, November 2005. [21] A. Motro. Imprecision and uncertainty in database systems. In P. Bosc and J. Kacprzyk, editors, Fuzziness in Database Management Systems. Physica-Verlag, 1995. [22] Andrew Nierman and H. V. Jagadish. ProTDB: Probabilistic data in XML. In VLDB, pages 646–657, 2002. [23] N. R. Pal. On quantiﬁcation of diﬀerent facets of uncertainty. Fuzzy Sets and Systems, 107:81–91, 1999. [24] Simon Parsons. Current approaches to handling imperfect information in data and knowledge bases. IEEE Transactions on Knowledge and Data Engineering, 8(3):353–372, June 1996. [25] Z. Pawlak. Rough sets. Int. J. Computer and Information Sci., 11:341– 356, 1982. [26] Michael Pittarelli. An algebra for probabilistic databases. IEEE Transactions on Knowledge and Data Engineering, 6(2):293–303, April 1994. [27] Glenn Shafer. A Mathematical Theory of Evidence. Princeton Un.Press, 1976. [28] Ph. Smets. Imperfect information: Imprecision - uncertainty. In A. Motro and Ph. Smets, editors, Uncertainty Management in Information Systems. From Needs to Solutions., pages 225–254. Kluwer Academic Publishers, 1997. [29] Ph. Smets. Probability, possibility, belief: Which and where? In Ph. Smets, editor, Handbook of Defeasible Reasoning and Uncertainty Management Systems, Volume 1: Quantiﬁed Representation of Uncertainty and Impression, pages 1–24. Kluwer Academic Publishers, Dordrecht, 1998. [30] M. Smithson. Ignorance and uncertainty: emerging paradigms. SpringerVerlag, Berlin, 1989. [31] D. Suciu. Semistructured data and XML. In FODO, pages 1–12, Kobe, Japan, November 1998. [32] W. Zhao, A. Dekhtyar, and J. Goldsmith. Representing probabilistic information in xml, 2003. Technical Report 770-03, Department of Computer Science, University of Kentucky, 2003.

A Synopsis based Approach for XML Fast Approximate Querying Sara Comai, Stefania Marrara, and Letizia Tanca Politecnico di Milano, Dipartimento di Elettronica e Informazione Piazza L. Da Vinci 32, I-20133 Milano, Italy {comai, marrara, tanca}@elet.polimi.it

1 Introduction and motivation In the last few years, XML has spread in many application ﬁelds and today it is used as a format to exchange data on the web, to ensure inter-operability among applications. Due to this success, the W3C has proposed a new query language, XQuery [25], speciﬁcally designed to query XML data. XQuery is a well-deﬁned but rather complex language [14]. In this work we propose a new approach to overcome the problem of the high computational costs required by aggregate queries over massive XML data collections. In traditional relational warehouses [11] a similar problem is solved by means of fast approximate queries, that use concise data statistics based on histograms or on other statistical techniques. Their most common application is for aggregate queries in modern decision support systems, where large volumes of data need to be queried, and quick and interactive responses from the DBMS are claimed, e.g., to analyze the data in the warehouse in order to get trend information to evaluate marketing strategies. In such applications, users are often more interested to obtain an approximate answer computed in a short time rather than an exact one obtained in some minutes or, at the worst, hours. In this work we show how to extend one of such approaches also to query massive XML data sets, to obtain approximate answers in very short computation times. The basic idea for approximate answers is to store pre-computed summaries of the XML warehouse, also called synopses (concise data collections), and to query them instead of the original database, thus saving time and computational costs. For this reason we attempt at obtaining a synopsis with a structure as similar as possible to the original one in order to combine conciseness of the data set and easiness of the query formulation. In our scenario, the user should pose a query to the system, which should use the synopsis instead of the original data to answer the query. Consequently, the answer should come within a much shorter time to the price of a loss in precision. The research work’s purpose involves techniques for such transparent query S. Comai et al.: A Synopsis based Approach for XML Fast Approximate Querying, Studfuzz 203, 241–265 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

242

Sara Comai et al.

transformation, as well as evaluation methods for estimating the error produced by the evaluation of the query on the synopsis instead of the original data. The structure of the work is as follows. Section 2 introduces a general overview of the full approach and an example of XML collection that will be used during the exposition of the approach. Section 3 presents some basic deﬁnitions. In Section 4 we describe the structure of the synopsis and deﬁne how to automatically create the XQuery query to construct the synopsis of a given structure. Section 5 describes how to transform a query in XQuery on the original data set into the corresponding query on the synopsis. Section 6 contains considerations about the quality of the approach. Section 7 shows the results of some experiments generated by means of a prototype tool, Section 8 presents some literature works about synopses, and, ﬁnally, in Section 9 we outline our conclusions.

2 Overview of the approach Figure 1 shows the basic steps of our approach: initially, a collection of XML documents, sharing a unique DTD or schema, is collected in a smaller group of XML documents, each document representing a summarization of the data contained in the original collection by means of statistical techniques. This new collection is called synopsis and the summarization is obtained by means of a XQuery transformation TR from the original data. The XML synopsis conforms to a new schema (e.g., DTD) generated from the original schema by a step by step XQuery transformation. When the user poses an aggregate query Q, over the original XML collection, his/her query is transformed by a XQuery transformation QT R into a query to the synopsis collection Q . Finally, the answer (probably approximate) is computed and returned to the user. In the general case, the collection of XML documents we suppose to deal with is composed by several diﬀerent groups, each of them sharing the same DTD or the same XML Schema. In the sequel we will use the word schema referring both to DTD and XML Schema. In our approach, each group of documents is summarized using a statistical technique creating one synopsis. The union of all the synopses generated from each group of documents composes the entire synopsis of the XML document collection. We use only one summarizing technique for the entire data collection (e.g., we use the equiwidth histograms), so that it is possible to use just one implementation of the statistical aggregate functions to query the resulting synopsis. This choice motivates why in this work we prefer to deal with DTDs and not XML Schemas: DTDs are a weaker formalism than XML schema but they give the necessary knowledge about the structure of the XML document tree. Since we do not use diﬀerent statistical techniques inside the same synopsis, we do not really need a deep knowledge of the data types used in the original data collection

A Synopsis based Approach for XML Fast Approximate Querying

243

Synopsis

Document collection XML …….. XML …….. …….. XML……... …….. …….. ……... …….. ……...

XML …….. XML …….. ……...

TR

BATCH XQuery engine Q XQuery …….. ……..

Q’ XQuery …….. ……..

QTR

XML …….. …….. ……... USER Result

RUN-TIME

Fig. 1. The XQuery approach for XML synopsis.

and then we do not need the more complex XML Schema formalism. Moreover, there exist many collections of data, that could have beneﬁt from our methodology, that do not have a real schema and the DTD is created a posteriori. Naturally, the choice of the statistical technique is application dependent and can be diﬀerent for each collection we analyze, then, nobody denies that in some applications XML Schemas could be more useful than DTDs; in this work, we focus on histograms because they seem the most suited statistics for the mixed content of a generic XML document, which typically stores categorical elements (e.g., names or colors etc.) and numerical elements (e.g., ages or prices) together in the same document. Indeed, it is possible to use wavelets to summarize only numerical data, while samples work well with uniform distributions that are not common in XML. 2.1 Running example During the exposition of the approach, we consider the following schema (DTD) describing a car store: cars are characterized by color and selling details including model, customer’s city and price and optional features of the sold car. Fig.2 shows a sample XML document conforming to such schema.

244

Sara Comai et al.

<list> white <selling> <details> <model> Fiat Brava Milano ABS airbag electronic closure key white <selling> <details> <model> Fiat Brava Roma ABS

white <selling> <details> <model> Opel Corsa Milano airbag electronic closure key white <selling> <details> <model> Fiat Marea Milano

Fig. 2. A sample XML document.

3 Representing summarized XML data Following a common use, we represent a XML document as a tree T = (V, E), where V is the node set comprising both nodes representing tags and nodes representing text content and attributes. Attributes are not explicitly handled in this work because, if a literal semantics [12] for the representation of XML documents is adopted, they can be treated as a particular case of PCDATA elements. E ⊆ V × V represents elements and text containment arcs. The target document collection can be composed by groups of documents having diﬀerent schemata, and, in this case, the proposed approach is applied separately to each group, while the ﬁnal synopsis is the union of all the synopses generated by the diﬀerent groups.

A Synopsis based Approach for XML Fast Approximate Querying

245

We represent XML elements with their paths from the root of the document they belong to using a XPath 1.0 [24] notation, in order to distinguish elements with the same tag name but with diﬀerent internal meaning (e.g., a person’s home address is diﬀerent from the address of the company the person belongs to). The entire methodology to create an XML synopsis can be described by the following steps: • Initially, the designer investigates the target data collection, in order to identify the most common aggregate queries in the application. On the basis of this set of queries, he/she identiﬁes a collection of frequent queried aggregations and the parameters of each histogram involved; • Then the synopsis is automatically created by a XQuery engine on the basis of the structure decided in the previous step; • The data collection is ready for querying: each query is transformed and redirected to the synopsis, obtaining an answer (possibly approximate) in a much shorter time than querying the original data set. At the beginning of the design of a synopsis, we identify the descriptor of the synopsis as a collection of frequent queried aggregations. The most natural set of synopses for an approximate query engine would include an aggregate element for each leaf element in the document tree. We refer to documents containing those aggregate elements as base synopses. We would like to evaluate the relationship between two leaf elements in the tree by means of some kind of combination of the base synopses of the target elements. This approach is similar to the problem of evaluating a join between two relations trying to combine the base synopses of the relations themselves. This problem has been discussed in literature in [3]: in this work they prove that the use of base samples to estimate the output of a join of two or more relations can produce a poor quality approximation. The reasons that motivate this claim are also valid in the histogram domain: • non-uniform result distribution: in general, the join of two histograms does not represent the real data distribution of the output of the join, since each original data item can be collected into one or more histograms depending on the dimensions chosen to build the statistics; • inaccurate result compositions: the join of two histograms typically composes bucket frequencies instead of the actual data. This can lead to both inaccurate answers and very poor conﬁdence bounds since they compose the errors committed when we consider a frequency value instead of the real data value. These considerations motivate our choice to construct the set of synopses taking into account the most common aggregate queries and the most interesting relationships among the elements of the document structure. In this scenario each synopsis can answer a limited (but interesting to the user) set of queries but the accuracy is guaranteed.

246

Sara Comai et al.

Deﬁnition 1 [Synopsis descriptor] The descriptor of a XML synopsis is deﬁned by a set of pairs Hs ={(summ e, < crit g1 , . . . , crit gn >)}, where • summ e is the path expression of the element to be summarized, and • < crit g1 , . . . , crit gn > (i.e., grouping criteria) is a (possibly empty) sequence of path expressions of the elements whereof the element given by summ e is grouped in the summarization process. As histograms are constructed over elements and not over paths, in the sequel we will use the name summ e just to indicate the leaf element of the path named summ e and crit g for the leaf element of the path crit g. As an instance, referring to the running example, suppose that the user be interested in storing a summarized collection about the models of the cars w.r.t. the color and the city where the cars were sold. In this case, Hs is represented as follows: Hs ={(summ e, )}={(list/car/selling/details/model, <list/car/color,list/car/selling/details/city>)}. Once the descriptor of the synopsis document has been deﬁned, another important decision is the set of parameters (e.g., number of buckets or boundary values) of the histogram that will store the data of summ e. Deﬁnition 2 [parameter document] The parameter document P is an XML document containing the boundaries of the buckets of the histogram to be obtained by the summarization described by a pair (summ e, < crit g1 , ..., crit gn >). The document can have the following structure: • if summ e ends with a categorical element (e.g., the element ) an interval content of a sample bucket of its histogram has the form < interval >< bv > value < /bv >< /interval >, where bv stands for boundary value; • if summ e ends with a numerical element, the interval can have the form < interval >< bvmin > value < /bvmin >< bvmax > value < /bvmax >< /interval >) For example, if summ e ends with the element (a categorical element) of the running example, an interval content of a sample bucket of its histogram has the form < interval >< bv > blue < /bv >< /interval >, while if summ e ends with a numerical value, the interval can have the form < interval >< bvmin > 0 < /bvmin >< bvmax > 99 < /bvmax >< /interval >. The correspondences between the original data collection and the corresponding synopsis collection, between the original query and the query on the synopsis, and their results are shown by the diagram of Fig. 3. In the ﬁgure, the original query Q applied to the data collection produces an exact result x; moreover a transformation rule QT R computes the query Q that, applied to the synopsis created by the transformation rule T R, produces a new result x , which may be diﬀerent from x. This diﬀerence is expressed by the approximation error err, which will be described in the sequel.

A Synopsis based Approach for XML Fast Approximate Querying Original data Collection (T 1 ,..., T n ) + H s +P

Q x

TR

Synopsis data Collection (T' 1,..., T' k )

247

}

QTR

err

x' Q’

Fig. 3. Transformation diagram between the original data collection and the corresponding synopsis, and between the aggregate queries, exact and approximate. In the ﬁgure, x and x’ are the result of the original query and the result of the approximate one respectively.

Given Hs and P , the XQuery transformations involved are expressed as follows: Deﬁnition 3 [TR] Given a set {T1 , . . . , Tn } of XML documents (where {T1 , . . . , Tn } share the same schema), we call TR : ({T1 , . . . , Tn }, Hs , P ) → {T1 , . . . , Tk }, k ≤ n (often it will be k h) the transformation able to construct the XML synopsis(DATAsyn ). {T1 , ..., Tk } is a (small) set of XML documents conforming to one new schema, which we call synopsis schema. Deﬁnition 4 [SchemaTransf ] Let {S1 , ..., Sn } be the set of schemata of the target XML data-set. We call SchemaTransf : ({S1 , ..., Sn }, Hs ) → {S1 , . . . , Sk } the transformation on the schemata of the initial set of documents to construct the corresponding synopsis schemata, {S1 , . . . , Sk }. Deﬁnition 5 [QTR] Let {Q} be a set of aggregate queries on the original data set and {Q} be the set of corresponding synopsis queries. Then, QTR : {Q} → {Q} is the transformation that returns, for each original query Q ∈ {Q}, a new query Q ∈ {Q}. If we consider the aggregate queries, let x ∈ R be the exact numerical answer of an aggregate query Q and let x ∈ R be the corresponding estimated answer on the synopsis. Then, the approximation error of Q can be measured in absolute, relative or combined terms [18] as follows: • Absolute error of a query: errabs = |x − x |. • Relative error of a query: abs |x−x | errrel = errx = x , for x > 0. • Combined error of a query: errcomb = min{α∗errabs , β∗errrel }, where α and β are positive constants, used to tune the relative importance of the two errors one w.r.t. the other. If x = 0, then errcomb = α ∗ errabs .

248

Sara Comai et al.

For a more detailed explanation about error approximation measures see [18]. In Section 6 we apply these deﬁnitions to our methodology and study the degree of approximation obtained querying our synopses instead of the original data.

4 Structure of the synopsis In this section we describe the methodology for the design and construction of a synopsis for a given set of XML documents. To simplify the synopsis construction we suppose that for each element e ∈ Hs a single XML synopsis document be obtained. Since we consider documents containing both categorical and numerical data, in this work we summarize the XML data collection by means of equiwidth histograms. Equi-width histograms group contiguous ranges of the element values into B buckets with the criterion that the sum of the spreads of the values in one bucket is approximately equal to 1/B-times the sum of the spreads of all values. In order to construct the equi-width histograms of the synopsis the designer must ﬁx the boundary values of each bucket, deciding the content of the parameter document P . To describe P we use the following deﬁnition: Deﬁnition 6 [Active Domain] We call active domain D of an element e the set of distinct element values of the domain of e that actually appear in the target XML document collection. The synopsis histogram is constructed according to the following rules: • Given e ∈ Hs , the histogram collects the data represented by summ e; • if summ e is a leaf node, each boundary value represents – a value of the active domain of summ e leaf element if this is a categorical element (e.g., color or city), – an interval in the active domain of summ e leaf element if it holds a numerical value. • If summ e is a non-leaf element, and A1 , . . . , An are the active domains of the descendant leaf elements of summ e, then the summ e histogram holds the Cartesian product BV = A1 × . . . × An ; • the frequency value f req represents the number of summ e values that satisﬁes the grouping conditions expressed by < crit g1 , . . . , crit gn > and belongs to the bucket deﬁned by the boundary value. • if we consider histograms of numerical data, the buckets must not overlap in order to avoid that the same item of the document collection be counted twice in the histogram. As an example, referring to the schema of the running example in Sect.2.1 consider the synopsis where summ e is list/car/color, i.e., a categorical leaf

A Synopsis based Approach for XML Fast Approximate Querying

249

element. In this case, let us suppose that the boundary values be blue, red, black and grey, hence the parameter document P is shown in Figure 4: hist bucket

interval bv

bucket

bucket

bucket

interval

interval

interval

bv

bv

grey

bv red

blue

black

Fig. 4. The parameter document for a categorical element histogram.

If summ e is list/car/selling/details/price, i.e., a numerical leaf element, the boundary values can range from $5000 of the cheapest car to $65000 of the most expensive one sold in the store. Therefore, a possible parameter document P of the histogram of this element is shown in Figure 5: hist bucket

bucket

bucket

bucket

... interval bv min

interval

interval bvmax

5000

14999

bvmin bvmin

bv max

bvmax

15000

55000

65000

24999

Fig. 5. The parameter document for a numerical element histogram.

Using the equi-width histograms, the parameter document of a numerical element can be automatically constructed providing as parameters the total range of the possible values of the elements (e.g., 5000 to 65000) and the number of buckets to construct (e.g., 6). If summ e is list/car/selling/details, i.e., a non-leaf element, the values to be considered in the histogram are obtained as combinations of the leaf descendant values list/car/selling/details/model, list/car/selling/details/city, list/car/selling/details/price, and list/car/selling/details/optional. For example, one possible value for detail is the tuple (F iatBrava, Rome, $12000, ABS). Let us now consider three possible situations:

250

Sara Comai et al.

• all the leaf descendant of summ e have cardinality (1:1) w.r.t. summ e. In this case a possible parameter document P is shown in Figure 6: hist bucket

city

...

price

interval interval

bv

bv min

bvmax

15000 Fiat Brava

bucket

optional

model

interval

24999

bv

bv ABS Rome

Fig. 6. The parameter document for an element having leaf descendants with cardinality 1:1.

Note that the path nodes between summ e and the leaf descendant values considered as boundary value of the histogram appear in each bucket structure between the node < bucket > and the node < interval >. • if one or more leaf descendant of summ e are optional elements (cardinality (0:1)), we overcome the problem by adding to the active domain of the optional element/s the value N D (i.e., Not Determined) in order to store the elements that do not contain the optional element in the document. As an example, let us suppose that summ e is list/car/selling/details and city is optional with active domain A = {Rome, M ilan, V erona}. The new active domain of city used for constructing the histogram is A = {Rome, M ilan, V erona, N D}, therefore a possible parameter document will contain also the buckets dealing with the absence of a city value in some documents of the original collection (see, as instance, Figure 7). • if one or more leaf descendant of summ e can appear more than once in summ e sub-tree (cardinality (0:n)) then we note an explosion in the number of the buckets composing the histogram. Indeed, we should consider any possible combination of the children elements values of summ e taking into account the possibility that the same element can appear more than once in the same document. Therefore this case is deprecated, and we strongly suggest not to choose as summ e an element containing children elements with cardinality n. A possible solution is to choose as summarized element one of the children of the element we would like to summarize with cardinality 1:1 w.r.t. summ e. As instance, consider the P document fragment in Fig. 8 constructed supposing that the same car model can be sold in Rome or in Milan or in both towns:

A Synopsis based Approach for XML Fast Approximate Querying

251

hist bucket bucket

optional

model city

...

price

interval

interval

interval interval

bv

bvmin

bv max

bv

bv

Fiat Brava

5000

14999

ABS

ND

Fig. 7. The parameter document for an element having leaf descendants with cardinality 0:1. hist bucket

bucket

bucket model

model

city

city interval

city

model

city

interval interval interval

bv Fiat Brava

bv

interval

bv

interval bv

bv

Rome

Rome

Milan

bv

bv

Fiat Brava

interval

Fiat Brava

Milan

Fig. 8. The parameter document for an element having leaf descendants with cardinality 0:N.

Obviously, if we consider a more complex case where the number of possible cities can be n >> 2, then the number of buckets in each histogram (one for each model) increases more than exponentially. 4.1 XQuery rules for synopsis computation We now deﬁne the transformation T R to construct the standard synopsis from the original data collection. There are two main cases: • construction of the histogram of a leaf element (leaf summ e); • construction of the histogram of a non leaf element (non leaf summ e); Case of leaf summarized element The histogram of each e = (summ e, < crit g1 , . . . , crit gn >), where summ e is a leaf element, can be computed using the XQuery code Qhist , where

252

Sara Comai et al.

www.doc.com is a URI containing the set of documents to summarize, while www.parameters.com speciﬁes the parameter document P of the element to summarize. The code Q(hist) is the following: 1. 2. LET $V := document(”www.doc.com”)/summ e 3. FOR $b IN document(”www.parameters.com”)/interval 4. RETURN 5. 6. 7. FOR $bv in $b//bvmin 8. < bvmin > $b//bvmin [text()] < /bvmin > 9. < bvmax > $b//bvmax [text()] < /bvmax > 10. FOR $bv2 in $b//bv 11. bv2 [text()] 12. 13. count( IF /interval//bvmin 14. THEN $V[text()>$b//bvmin /text() AND 15. text()<$b//bvmax /text()] 16. ELSE $V[text()=$b//bv/text()]) 17. 18. 19.

The construction of the histogram is based on two FOR clauses (lines 7 and 10), selecting for each bucket the appropriate boundary values (numerical or categorical); then, function count in line 13 computes the frequency of the bucket represented by the chosen boundary value(s). Case of non-leaf summarized element If summ e is a non-leaf element the Cartesian product of its descendant leaf elements must be computed; indeed the tuple of values contained into its descendant leaf elements represents the content value of summ e: Qhist becomes more complex. Indeed, in this case we do not have a ﬁxed XQuery query as in the previous case but we need a function able to construct the query case by case. In the following we show the pseudo code of this function, named create hist. Inputs of the function are the element to be summarized summ e, the grouping criterion sequence , the list of leaf children of summ e, the target document collection doc and the histogram parameter document P . The function returns the query string for the construction of the histogram of summ e. string create_hist(summ_e, , leaf_children(summ_e), { string q, t;

A Synopsis based Approach for XML Fast Approximate Querying

253

element el; q.insert(" LET $V:=document(",doc,"/",summ_e); q.insert(endline); q.insert("FOR $b IN document(", P,")/interval", endline, "RETURN ",endline); while(leaf_children(summ_e) is not empty) { el = first element of leaf_children(summ_e); if (el is a numerical element) {q.insert("<el>", endline); q.insert("FOR $bv IN $b//",el,"/bv_min", endline, "$b//",el,"/bv_min[text()]",endline, "",endline); t.insert("text()>$b//",el,"/bv_min[text()]AND text()<$b//",el,"/bv_max[text()]"); remove el from leaf_children(summ_e); if(leaf_children(summ_e) is not empty) t.insert("AND"); }else if(el is a categorical element) {q.insert("<" el ">", endline); q.insert("FOR $bv2 IN $b//",el,"/bv", endline, "",bv2,"[text()]",endline); t.insert("text()=$b//",el,"/bv[text()]"); remove el from leaf_children(summ_e); if(leaf_children(summ_e) is not empty) t.insert("AND"); } q.insert("",endline); } q.insert("",endline); q.insert(" count($V[",t,"])",endline); q.insert("",endline,""); }

The function create hist uses two strings, q and t: q stores the query, while t stores the boundary values of each bucket in order to compute the frequency of the bucket itself. For each bucket, the while cycle creates the structure of the bucket and stores the values of the boundary values given in P in the string t, which is used at the end of the function to compute the frequency. For example, the query constructed to create the histogram of the element list/car/selling/details (with children model, city, price and optional) is: LET $V:=document("www.doc.com")/list/car/selling/details FOR $b IN document("www.parameters.com")/interval RETURN FOR $bv2 IN $b//model/bv <model> $bv2[text()]

254

Sara Comai et al. FOR $bv2 IN $b//city/bv $bv2[text()] FOR $bv IN $b//price/bv_min <price> $b//price/bv_min[text()] $b//price/bv_max[text()] FOR $bv2 IN $b//optional/bv $bv2[text()] count($V[text()=$b//model/bv[text()]AND text()=$b//city/bv[text()]AND text()>$b//price/bv_min[text()]AND text()<$b//price/bv_max[text()] AND text()=$b//optional/bv[text()])

Construction of the synopsis document structure To deﬁne the transformation for computing the whole synopsis, we ﬁrst need a preliminary deﬁnition: Deﬁnition 7 [Lowest Common Ancestor] ∀i ∈ {1..n}, the lowest common ancestor (LCA) Ai is the deepest node that is ancestor both of summ e and of the elements in < crit gi , . . . , crit gn >,where • if i ∈ [1, ..., n − 1] then < crit gi , . . . , crit gn >⊆< crit g1 , . . . , crit gn > else • < crit gi , . . . , crit gn >=< crit gn > if i = n. Fig. 9 shows an example of tree of a synopsis and the LCAs of summ e and < crit g1 , crit g2 >. In this ﬁgure we can see that A1 is the LCA of summ e and crit g1 , while A2 is the LCA of summ e and < crit g1 , crit g2 >. Note that, by construction, each common ancestor Ai is the root element of the smallest subgraph that contains the histogram derived from one value of the active domain of crit gi . Since we have n histograms to be stored, it follows that we need exactly n subgraphs and consequently A will be repeated n times in the synopsis document; thus the following theorem holds: Theorem 1 Given the elements summ e and crit g, the corresponding LCA A is repeated inside the synopsis document n times, where n = |D| and D is the active domain of crit g.

A Synopsis based Approach for XML Fast Approximate Querying

255

...

A2 crit_g2 A1 crit_g1 summ_e

Fig. 9. LCA example structure.

The synopsis histogram is constructed from the original data collection starting from the set of paths deﬁning the histogram itself (summ e, < crit g1 , . . . , crit gn >) and the tuple of parameters ∈ P . The construction of the synopsis is based on a set of rules that deﬁne the transformation TR and have been detailed in [17]. In this work we show an example of XQuery code obtained to collect the models (representing summ e) w.r.t. color and city (representing < crit g1 , crit g2 >) from the running example of Section 2.1. 1. <list> 2. FOR $g1 IN LIST/CAR/COLOR 3. \* first common ancestor of color and model*\ 4. $g1 \* first crit\_g*\ 5. <selling> \* element that connects A1 and A2*\ 6. FOR $g2 IN LIST/CAR/SELLING/DETAILS/CITY \*the LET is inside the most nested FOR*\ 7. LET $V:=www.doc.doc/LIST/CAR/MODEL 8. [../CAR/COLOR=$g1 and ../CAR/Y/X/CITY=$g2] \*binding of the crit\_g criteria*\ 9. RETURN 10. <details> \* A2 *\ 11. <model> - histogram - 12. $g2 \* last crit\_g *\ 13. 14. 15. 16.

An example of the rules that deﬁne TR is the following:

256

Sara Comai et al. <list> white <selling> <details> <model> 10 Fiat Brava 15 Fiat Punto 8 Fiat Marea Milan

<details> <model> 12 Fiat Brava Rome blue <selling>

Fig. 10. A sample XML synopsis document.

TR-Rule 1 The query body is composed by n nested FOR clauses (n = |{crit g}|), following the same order of the elements as in < crit g1 , . . . , crit gn >. Each FOR clause generates a branch in the synopsis tree ready to store the histogram of the summarized element path summ e w.r.t. the condition expressed by crit g. The structure is recursive, because each branch born from the conditions crit gi ∈< crit g1 , . . . , crit gn >, i = 1, . . . , n, is divided into n new branches, one for each active domain value of crit gi+1 ∈< crit g1 , . . . , crit gn >. In this way each sub-branch can store the histograms of the data which respect both crit gi and crit gi+1 etc. In the example code this rule creates the FOR clauses in the lines 2 and 6 of the query. A sample synopsis document constructed for a document collection respecting the structure of the DTD in Section 2.1 is shown in Figure 10. The structure of the synopsis chosen for this example Hs is a collection of the models of the cars w.r.t. the color and the city where the cars were sold. Therefore, Hs is represented as follows: Hs ={(summ e, )}={(list/car/selling/details/model, <list/car/color,list/car/selling/details/city>)}. The other rules describe what appear in the clauses of the query, how many FOR cycles the query needs depending on the structure of the document tree, which branches of the document tree have to appear in the synopsis structure, and in which point of the synopsis tree the histograms are stored (see the details in [17]).

A Synopsis based Approach for XML Fast Approximate Querying

257

5 Querying the XML synopsis XML Synopses have been proposed basically to answer aggregate queries in a fast and eﬀective way and the language chosen to query XML data is the new standard proposal XQuery. Expressing aggregate queries in the relational data query language SQL is done by means of the GROUP-BY clause, but XQuery does not have such a powerful operator and aggregates are computed by (usually very complex) nested FOR clauses. The approach for the automatic translation of XQuery to the synopses considers users without any knowledge of the synopsis collection structure. In this case the translation derives directly from the original, complex and multi-nested query and cannot be optimized very well. Obviously, nobody denies to a more expert user to query directly the synopsis, writing the best query for the synopsis structure. In the XQuery language, the aggregate functions available are COUNT, SUM, AVG, MAX and MIN. We develop our analysis considering the same set of functions and the presence of diﬀerent kinds of data that, connected by a hierarchical structure, deﬁne diﬀerent kinds of grouping. In the synopsis querying approach, we deﬁne new functions created to compute the aggregates over the histograms instead of the original data: these functions are count hist, sum hist, avg hist, max hist and min hist. The structure of the query on the synopsis basically follows the structure of the original query; here are some general observations: • in the construction of the synopsis, we have eliminated some elements from the document; therefore, we can ﬁnd an answer only to queries involving elements that exist in the synopsis graph, otherwise we are forced to query the original data collection; • since the construction of the synopsis does not change the path structure of the crit g elements and stores all the values of their active domain, queries asking for values stored as grouping elements can ﬁnd a precise answer (not approximate); • queries involving histograms of categories (e.g., names or colors) always ﬁnd a precise answer because each boundary value represents an exact value of the element active domain and the frequency of the buckets represents the exact count of this value in the target collection; • the answer to a query involving histograms of numeric data (e.g., ages) is usually approximate, because the boundary values of the histograms are not exact values but intervals of values of the element active domain. As an example, consider the synopsis shown in Fig. 10. In the synopsis, the crit g element is the element list/car/color, which appears in the synopsis ﬁle with all its active domain values. In this case a query asking for the colors of the sold cars will ﬁnd all the possible values of the element as if it were performed on the original data collection. Instead, consider the parameter document P constructed for the element price in Sect. 4: if we look for the number of cars that cost less than $12000, the answer on the synopsis will

258

Sara Comai et al.

be approximate because it should use one of the buckets of the histogram of prices partially. The rules that deﬁne the transformation QT R have been detailed in [17]. As an example, consider the following rule: QTR-Rule 1 If the aggregate function in the original query is applied to a leaf element summ e, the synopsis query uses the corresponding function on histograms (e.g., count hist, avg hist). An example of the application of this rule is the following query, which asks for the number of Fiat Brava sold in Milan: for $det in doc("cars.xml")/list/car/selling/details where $det/city = "Milan" return count($det/model = "Fiat Brava")) This query is transformed into: for $det in doc("syn-cars.xml")/list/car/selling/details where $det/city = "Milan" return count_hist($det/model/hist/bucket/bv = "Fiat Brava")) $QTR_Rule$ The other rules describe what appears in the clauses of the query, which function has to be used in case of categorical or numerical aggregate element, and what happens in case of reverse steps in the query (see the details in [17]).

6 Approximation analysis In this section we further detail the considerations at the beginning of Sect. 5 about the approximation obtained in the synopsis querying process. Aim of this section is to analyze the approach in order to evaluate the degree of approximation during the querying phase. We analyze our synopsis approach according to the following dimensions: • Coverage: in this section we detail some general observations about the set of queries that can be answered using the synopses constructed with our methodology. • Answer quality: the accuracy and conﬁdence of its (approximate) answers to queries in {Q}. Two further dimensions, the Footprint and the Query Time of the resulting synopsis, will be presented by examples in Section 7 during the analysis of the results obtained with the prototype tool.

A Synopsis based Approach for XML Fast Approximate Querying

259

6.1 Coverage and answer quality The structure of the query on the synopsis basically follows the structure of the original query, hence when querying the synopsis some general observations hold: 1. In the construction of the synopsis, we have eliminated some elements from the document; therefore, we can ﬁnd an answer only to queries involving elements that exist in the synopsis graph, otherwise we are forced to query the original data collection. 2. Since the construction of the synopsis does not change the path structure of the crit g elements and stores all the values of their active domain, queries asking for the existence of values stored as grouping elements can ﬁnd a precise answer (not approximate); of course this statement does not hold if we are looking for aggregates (e.g., the number of...) involving crit g elements. 3. Queries involving histograms of categories (e.g., names or colors) always ﬁnd a precise answer because each boundary value represents an exact value of the element active domain and the frequency of the buckets represents the exact count of this value in the target collection. 4. The answer to a query involving histograms of numeric data (e.g., ages) is usually approximate, because the boundary values of the histograms are not exact values but intervals of values of the active domain of the element. A histogram should reduce the data by describing the data distributions. For the sake of easiness, in this analysis we concentrate on leaf elements, but the considerations for non-leaf elements are similar but involve multidimensional histograms. If histograms are available, we can use the synopsis to accelerate aggregate queries, using some ad-hoc formulas to build the aggregate functions count hist, sum hist, avg hist, max hist, and min hist. For a detailed review of these formulas see [17]. Moreover, in [17] we consider the problem of refreshing the synopsis histograms when the target XML document collection is updated. Unfortunately, the W3C has not yet proposed an XQuery syntax for updates, although such extension is strongly needed. Only in the last few years the scientiﬁc community has started to formally deal with the problem of updates in XQuery and some preliminary works are [26] or can be found in [1]. For this reason, we can only make some general considerations about the update of the synopsis but a deep study of this problem will be aﬀorded as soon as XQuery will have a well deﬁned update syntax.

7 Experimental results A prototype tool (see [17]) has been used to test the idea on a huge set of XML data in order to experimentally measure the approximation error and

260

Sara Comai et al.

the beneﬁts of using synopses in terms of computational costs and times. The prototype consists of two independent parts: SynGenerator, a tool for creating the synopsis and its DTD and ApproXquery, a tool for querying the synopsis. Both tools have been developed in Java. The ﬁrst experiments allow us to make some preliminary observations: in our toy benchmark constructed by using the example DTD in Section 2.1 each car document ﬁle occupies about 500 bytes. If we construct the synopsis shown in Section 3, we see that it can occupy (in the worst case in which there is not more than one car model for each color and each town) about 750 bytes. Note that once a certain modelcolor-town group has been inserted in the synopsis, the dimensions of the synopsis itself do not change as the number of cars belonging to this group augments. If we consider a store able to sell 10 cars of a given color and model in a certain town, we have that the original document collection will occupy 5000 bytes while the synopsis still occupies 750 bytes, i.e., 15% of the original size. Obviously, the percentage space occupation decreases as the number of cars belonging to a certain group augments. Olap applications work with quantities of data enormously bigger with respect to our toy examples; hence it is easy to infer that synopsis collections can occupy an inﬁnitesimal part of the original collection space, allowing a very good performance growth. In Figure 11 we show our car synopsis collection footprint and some comparative query execution times obtained by running some range queries presented in [17]: the ﬁrst column of the table shows the execution times of each original query on the target collection, the second column shows the execution times of the query to construct the synopsis, and the third column shows the execution times necessary to perform the transformed query on the synopsis. Even if the structures of the original and of the synopsis queries are very similar (they have been transformed automatically by the transformation QTR), the query execution times diﬀer considerably. This is basically motivated by the dimensions of the data collections that have to be queried: indeed the synopsis has a footprint of 17 KB while the original collection occupies 5 MB.

8 Related work In this section we present an overview of the main techniques used in literature to construct data synopses. 8.1 Synopsis data structures for relational data warehouses A good starting point, for an overview on synopsis data structures for relational data warehouse, is [4], which describes the state of the art in data reduction techniques, for reducing massive data sets down to a “big picture” and for providing quick approximate answers to queries. The list of data structures that could be considered synopsis data structure is extensive. For example,

A Synopsis based Approach for XML Fast Approximate Querying

261

Query execution times Query

Query on the original collection

Query that creates the synopsis

Query on the synopsis

Q1

40.51 sec

29.33 sec

0.43 sec

Q2

51.10 sec

1 min 26 sec

0.65 sec

Q3

1 min 13 sec

49.58 sec

0.59 sec

Q4

1 min 02 sec

1 min 17 sec

0.61 sec

Footprint 5 MB

/

17 KB (for all synopses)

(Intel Pentium 3, 1GHz, with 256 MB RAM)

Fig. 11. Comparative table of queries results.

Krishnan et al [15] proposed and studied the use of a compact suﬃx treebased structure for estimating the selectivity of an alphanumeric predicate with wildcards. Manber [16] considered the use of concise “signatures” to ﬁnd similarities among ﬁles. Broder et al [5] studied the use of (approximate) minwise independent families of permutations for signatures in a related context, namely, detecting and ﬁltering near-duplicate documents. Other works include the use of multi-fractals and wavelets for synopsis data structures [7; 18] and join samples for queries on the join of multiple sets [11]. For the purpose of this work we concentrate our review only on synopsis data structures used for fast approximate query answering. In this setting, a good survey is [9], which describes a context for algorithmic work relevant to massive data and a framework for evaluating such work. Moreover the paper overviews the literature about synopses till 1999 and highlights results on some important problem domains from the database literature: frequency moments, hot list queries, and histograms and quantiles. In the last 30 years, there have been a huge amount of works about synopses data structures applied in approximate answering approaches, whose main contributions are: 1) histograms [10; 13; 22], that partition attribute values domain into a set of buckets; 2) samples [19], which are based on the idea that a small random sample S of the data often well-represents the entire data; 3) Wavelets [23], which are a mathematical tool for hierarchical decomposition of functions/ signals. Multi-dimensional data synopses are used to approximate the joint data distribution of multiple attributes [3]. They are used for the selectivity estimation for queries with multiple attributes and for approximating OLAP data cubes and general relations.

262

Sara Comai et al.

XML diﬀers from relational data in several aspects: • documents present a hierarchical structure, where the position of each element carries useful information; • being semi-structured data, some items can be repeated or missing without a predeﬁned document structure; • there can be mixed content, numerical and categorical, stored in the same document. The synopsis approach we have proposed in this work is constructed with the aim to save the information stored in the hierarchical structure of the XML document during the summarization process, to take into account the problems and the advantages given by the semi-structured nature of XML, and uses histograms, among all the proposed techniques in literature, because this technique seems the most suited to be used with the mixed content of XML. 8.2 XML data synopses A ﬁrst approach in the extension of synopsis approaches to XML is given by diﬀerent techniques [8; 6; 20; 2] that have recently been proposed for building statistics for XML data with the aim to estimate the selectivity of path expressions; some example are: StatiX, an XML Schema-aware statistics framework that exploits the structure derived by regular expressions in the XML Schema in order to develop an eﬃcient and accurate XML query result estimator; Chen et al. build statistics used to estimate the selectivity of tree pattern queries (also called twig queries), or branching path expressions; XSchetch, a synopsis-graph model addressing the optimization of XML queries posed over large volumes of XML data where the authors construct synopsis structures for enabling the estimation of the path and branching distribution in the data graph to be used for the optimization of the original query. [2] presents a technique for building on-line XML statistics by observing the XPath queries issued to data source and their result sizes. This technique stores the path expressions and the information about their selectivity for use in estimating the selectivity of future XPath queries. Instead, the approach we propose uses synopsis techniques to execute aggregate queries saving computational costs, paying, when necessary, a little loss in precision. To our knowledge, ours is the ﬁrst work where synopses are used to ﬁnd approximate answers to aggregate XML queries. The most recent, and to our knowledge unique, work for XML approximate query answers, at the time of this work, is [21]: they study approximate query answers for XML queries focusing only on twig queries with branching path expressions, i.e., they consider the structural part of the problem and, w.r.t our work, ignore the value content of the document. Their approach is based on a structural XML synopsis, termed TREESKETCH, that captures, in limited space, the key properties of the underlying path distribution and enables approximate answers for one class of XML queries.

A Synopsis based Approach for XML Fast Approximate Querying

263

Another diﬀerence between our work and the XML synopsis literature is that our synopses are constructed to answer a set of very frequent queries, using a methodology guided by the application.

9 Conclusions In this work we have described the construction of a synopsis from an XML document collection and outlined the most important characteristics of the synopsis querying process. In this chapter we suppose to store one synopsis graph (summ e and corresponding < crit g1 , . . . , crit gn >) in each synopsis XML document, and use equi-width histograms as general purpose technique for summarizing XML data. In the next future we plan to focus on speciﬁc applications of the methodology in order to study other, more performative, statistical techniques. For instance, for collections of documents containing numerical data only, we could use the wavelets. Moreover we plan to extend our approach with a complete study of the problem of updating our synopses using XQuery, as soon as this language will have a standardized update syntax.

References [1] The galax project. http://www.galaxquery.org/. [2] A. Aboulnaga and J. F. Naughton. Building xml statistics for the hidden web. In Proc. CIKM’03 Conference, New Orleans,Louisiana,USA, 2003. [3] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate query answering. SIGMOD Record (ACM Special Interest Group on Management of Data), 28:275–286, 1999. [4] D. Barbar` a, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. The new jersey data reduction report. In Bulletin of Technical Committee on Data Engineering, pages 20(4): 3–45, 1997. [5] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In Proc. 30th ACM Symp. on the Theory of Computing, pages 327–336, 1998. [6] Z. Chen, H. V. Jagadish, F.Korn, N. Koudas, S. Muthukrishnan, R. T. Ng, and D. Srivastava. Counting twig matches in a tree. In ICDE, pages 595–604, 2001. [7] C. Faloutsos, Y. Matias, and A. Silberschatz. Modeling skewed distributions using multifractals and the ’80-20’ law. In Proc. 22rd International Conf. on Very Large Data Bases, pages 299–310, 1996. [8] J. Freie, J. R. Haritsa, M. Ramanath, P. Roy, and J. Simeon. Statix: Making xml count. In ACM SIGMOD, Madison, Wisconsin, June 4-6, 2002.

264

Sara Comai et al.

[9] P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science: Special Issue on External Memory Algorithms and Visualization, vol. A, 1999. [10] P. B. Gibbons, Y. Matias, and V. Poosala. Fast incremental maintenance of approximate histograms. In Proc. of Very Large Data Bases, 1997. [11] P. B. Gibbons, V. Poosala, S. Acharya, Y. Bartal, Y. Matias, S. Muthukrishnan, S. Ramaswamy, and T. Suel. Aqua: System and techniques for approximate query answering. In Technical Report, Murray Hill, New Jersey, 1998. [12] R. Goldman, J. McHugh, and J. Widom. From semistructured data to xml: Migrating the lore data model and query language. In Proc. WebDb, pages 25–30, 1999. [13] M. Greenwald and S. Khanna. Space-eﬃcient online computation of quantile summaries. In ACM Sigmod, 2001. [14] Jan Hidders, Jan Paredaens, and Dirk Van Gucht. A light but formal introduction to XQuery. In Second International XML Database Symposium, 2004. [15] P. Kishnan, J. S. Vitter, and B. Iyer. Estimating alphanumeric selectivity in the presence of wildcards. In Proc. ACM SIGMOD International Conf. on Management of Data., pages 282–293, 1996. [16] U. manber. Finding similar ﬁles in a large ﬁle system. In Proc. Usenix Winter 1994 Technical Conf., pages 1–10, 1994. [17] S. Marrara. Aggregate queries in XQuery. PhD thesis, Politecnico di Milano, 2005. PhD Thesis, Politecnico di Milano, XVII PhD School Edition. [18] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In Proc. of ACM SIGMOD Conference, pages 448–459, 1998. [19] F. Olken. Random sampling from databases., 1993. PhD Thesis, U.C. Berkeley. [20] N. Polyzotis and M. Garofalakis. Statistical synopses for graphstructured xml databases. In Proc. ACM SIGMOD Conference, Madison,Wisconsin,USA, 2002. [21] N. Polyzotis, M. Garofalakis, and Y. Ioannidis. Approximate xml query answers. In SIGMOD, 2004. [22] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. ACM SIGMOD, 1996. [23] J. S. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. In Proc. the 7th Int. Conf. on Information and Knowledge Management., 1998. [24] W3C. Xml path language (XPath) version 1.0, 1999. http://www.w3.org/TR/xpath.

A Synopsis based Approach for XML Fast Approximate Querying

[25] W3C. Xml query (XQuery) version 1.0, 2004. http://www.w3.org/ XML/Query. [26] Ling Wang and Elke A. Rundensteiner. Updating xquery views.

265

Maximum Entropy Inference for Geographical Information Systems Hykel Hosni1,2 , Maria Vittoria Masserotti1 , Chiara Renso1 1

2

KDDLAB, ISTI-CNR, Pisa, Italy. {Hykel.Hosni;Maria.Vittoria.Masserotti;Chiara.Renso}@isti.cnr.it Department of Mathematics, Manchester University

1 Introduction An immediate problem in approaching GIS (Geographic Information Systems) consists in giving a suﬃciently agreed deﬁnition of what GIS actually are. For present purposes it seems reasonable to consider GIS as being characterized by a twofold nature. On the one hand, GIS consist of a technology used for certain purposes. From this perspective, the crucial issues in GIS research amount to computing problems, both on the hardware and software level. On the other hand, however, GIS research is increasingly more focussed on theoretical issues concerning the representation of geographic information. According to the latter point of view GIS problems include, at the very least, issues of knowledge representation and reasoning. In this chapter we investigate some of the consequences deriving from approaching GIS from the latter point of view. In particular, we will be insisting on the fact that its ‘conceptual side’, so to speak, commits GIS research to achieving scientiﬁc goals which happen to be closely related to some of those pursued in Artiﬁcial Intelligence (AI) research.3 In doing so, we adopt a perspective according to which GIS are essentially construed as artiﬁcial intelligent agents reasoning about a certain classes of natural environments. The reason for our focussing on the issues of knowledge representation and reasoning in relation to GIS is that this perspective provides us with a promising methodological framework for tackling a critical issue in contemporary GIS science: the management of imperfect information. As we’ll be insisting in what follows, by adopting this perspective we can provide and then exploit a strong link between GIS science and the logico-mathematical discipline of uncertain reasoning.

3

Notice that the ‘obvious’ connection between GIS and AI research is to be found on the computational essence of the two areas. As we will see, however, much more ﬁne-grained similarities are worth investigating.

H. Hosni et al.: Maximum Entropy Inference for Geographical Information Systems, Studfuzz 203, 269–291 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

270

Hykel Hosni et al.

The purpose of this chapter is to put forward a methodologically robust framework for dealing with uncertainty in GIS. We envisage a key role for maximum entropy reasoning and we argue that fundamental questions in current GIS research can be addressed within such a framework. In order to be able to see this, however, a clariﬁcation on the perspective we are adopting on GIS is needed. In realistic environments, the information available for (geographic) reasoning will necessarily be imperfect, in various ways, and for various reasons[25]. This is a key factor determining the uncertainty of the conclusions of the corresponding reasoning processes. It is important to realize, however, that talking about uncertainty in GIS makes full sense when we assume an agent-perspective on them, as explained in section 2. Uncertainty, in fact, characterizes the epistemic state [4; 31] of an agent reasoning (and acting) on the basis of the imperfect information it possesses. In particular, a key source of uncertainty is incompleteness: the information that agents are able to collect and retrieve can never be exhaustive on the assumption that they operate in realistic environments. Thus, incompleteness is a typically agent-independent source of uncertainty [25] that puts agents in a condition of background ignorance, the phenomenon we are focussing on here. The agent-perspective on GIS is fundamental in order to understand the sort of (methodological) constraints that have to be met in the formalization of “commonsensical” reasoning under background ignorance. This will be illustrated in section 5 by drawing the main lines of a case study on the applicability of GIS to the investigation of animal behavior. Section 6 is devoted to the introduction of maximum entropy reasoning, its centrality in the modelling of adequate reasoning under background ignorance as well as its pre-eminence in the spectrum of formal models of uncertain reasoning. The viability of the maximum entropy solution for background ignorance modelling in GIS is further illustrated in section 7 where some aspects of its current implementations are discussed. Section 8 sums up the chapter and puts forward directions for future work.

2 Geographical information systems and artiﬁcial intelligence We have mentioned above the fact that GIS are characterized by a twofold nature. In the practice of GIS it seems that this fact has been implicitly assumed over the past two decades, as witnessed by the following deﬁnitions. Duecker 1979 [15] “A geographic information system is a special case of information systems where the database consists of observations on spatially distributed features, activities or events, which are deﬁnable in space as points, lines, or areas. A geographic information system manipulates data about these points, lines, and areas to retrieve data for ad hoc queries and analysis.”

Maximum Entropy Inference for Geographical Information Systems

271

Burrough 1986 [5] “A powerful set of tools for storing and retrieving at will, transforming and displaying spatial data from the real world for a particular set of purposes.” Clarke 1995 [9] “Automated systems for the capture, storage, retrieval, analysis, and display of spatial data.” Chrisman 1997 [8] “The organized activity by which people: measure aspects of geographic phenomena and processes; represent these measurements, usually in the form of a computer database, to emphasize spatial themes, entities, and relationships; operate upon these representations to produce more measurements and to discover new relationships by integrating disparate sources; transform these representations to conform to other frameworks of entities and relationships. These activities reﬂect the larger context (institutions and cultures) in which these people carry out their work. In turn, the GIS may inﬂuence these structures.” What is the common path followed by these deﬁnitions? It is quite clear that the ﬁl rouge among them all is to be found in the characterization of GIS goals: devising computing systems capable of collecting, representing, storing, manipulating and retrieving geographic information. On the basis of such a tentative deﬁnition, we can explore some relevant commonalities between GIS and AI research. In order to appreciate how logical theories of reasoning under imperfect information could be deployed in modelling “intelligent GIS”, however, we have to make explicit the fact we are thinking of GIS as essentially autonomous agents. Among the above mentioned characterizations of what GIS actually are, only the one due to Chrisman considers explicitly the (human) agent acquiring and manipulating information as an essential part of the geographic information system.4 Our interest, however, goes somehow beyond Chrisman’s deﬁnition, for we think of GIS as being centered around autonomous artiﬁcial agents. For this reason, throughout this chapter, we will be calling such broader systems GIS-agents. A major achievement of the past two decades in AI research has been the introduction of the ‘agents metaphor’ in the study of intelligent artiﬁcial systems.5 Among its many consequences one of particular interest to us is the anchorage of the rather abstract notion of “reasoning” to several types of environments. In a nutshell, an agent is understood as an embodied entity, that is to say an entity whose reasoning is about an environment which is subject to its (and other agents’) actions. Were we able to carefully explain how this interaction takes place, most of the fundamental questions of AI 4

5

Notice however that the other deﬁnitions do not appear to be inconsistent with such an idea. Due to obvious space constraints we plainly assume here the agents perspective in AI. Complete details can be found in [27] and in the more recent [54].

272

Hykel Hosni et al.

would be answered. It goes without saying that we set ourselves a much more modest aim here: accounting for the relation between GIS-agents and some aspects of the formalization of reasoning under imperfect information. The main reason we argue in favor of the necessity of the agents perspective is easily explained by noticing that were GIS mere tool-boxes, no sensible relevant issue concerning GIS reasoning would even arise, let alone reasoning under imperfect information. But there are further methodological reasons for turning our attention to the relation between GIS and (multi-)agent systems research. We have stressed in the previous section that the practice of GIS requires a variety of agents (both human and not) to interact.6 Therefore it is extremely natural to think of GIS as multi-agent systems in which the various tasks characterizing the diverse nature of GIS are distributed, say among agents responsible for acquiring, storing, manipulating relevant information. As a consequence, approaching theoretical GIS research from the agents perspective opens up to a variety of computing issues. We limit ourselves to list the mayor ones: communication : i.e. the sharing of information among agents; cooperation : i.e. the joint eﬀort of several agents aimed at achieving some goal which would go beyond their individual capabilities; negotiation : i.e. the activity by means of which several agents (who normally have competing goals) achieve agreement. Such issues are being extensively studied in the broader domain of multiagent systems (see e.g. [54] for a recent reference work). Thus it seems quite clear that when talking about the role of multi-agent systems in theoretical GIS research, the ‘direction of ﬁt’ is understood to be from the general to the particular, that is to say, the general framework developed in the context of multi-agent systems will have to be adapted to the particular case of GIS-agent.7 Being this study essentially devoted to the foundational issues, we postpone further details on this until subsequent stages of the research underway.

3 Geographical information systems and uncertainty Like any other agent operating in realistic environments –that is to say environments complicated enough so as to be seen as portions of the “real world”– 6

7

In the perspective of AI, we clearly aim at delegating as much workload as possible to artiﬁcial agents, putting ourselves in the comfortable position of ‘users’. As discussed in section [16] this is one of the motivations underlying the ‘na¨ıve geography’ approach. There are surely diﬀerent sorts of constraints characterizing the ontology of GIS. The abstract framework that we set out to develop, however, aims at capturing the problem of uncertainty management in GIS in its full generality.

Maximum Entropy Inference for Geographical Information Systems

273

GIS ought to face the inevitability of imperfect information [10; 14]. But why is it the case? This is best explained by brieﬂy recalling what sort of features characterize realistic agents. As we have just seen, agents make full sense only in terms of embodied entities. Then, whether they have a physical (robots) or software (softbot) nature, agents are necessarily limited entities. This has crucial consequences that we now brieﬂy consider by looking at the sort of things we expect a realistic agent like a GIS-agent to do. First of all, an agent acquires information by means of perceiving its environment. Then perceived information (or data) is suitably represented and eventually stored in the system’s memory. Let’s call the set of stored representations of information, the agent’s knowledge base.8 We take, for present purposes, agents’ reasoning as a function of their knowledge base whose output consists of new pieces of information (or equivalently, old ones made more speciﬁc). From this point of view, as it will be exempliﬁed in the following, the modelling of GIS-agents reasoning arises from the speciﬁcation of suitable constraints on their inference process. Agents’ reasoning process might be triggered by a variety of diﬀerent circumstances which might span from performing particular actions to answering certain queries. Whatever the goal reasoning is to fulﬁll, we expect intelligent agents to make the best possible use of the information available to them, given their essential limitations. As we are about to discuss, it is just because of such limitations that agents cannot undertake the sort of reasoning formally captured by classical logic. Having drawn this rough picture of the GIS as an agent we now tackle the crucial question: Where does all the uncertainty come from? As far as realistic agents are concerned, there are essentially two sources of imperfection which GIS research has to care about and which we call “agentindependent” and “agent-dependent” respectively (similarly called “populating GIS with data or analyzing it” in [14]). Those sources are simultaneously active at the perception, representation and reasoning levels and can purposefully be distinguished. The reason for so doing is that of distinguishing between technological and scientiﬁc problems, that is to say distinguishing among problems which should be tackled at the hardware (or software) level and those for which an abstract, theoretical solution is needed.9 Thus whereas agent-dependent sources of imperfection are generally to be tackled technologically, the agent-independent ones require geographic information systems

8

9

Within GIS research, the term database is often preferred. Databases, usually, correspond to particularly structured pieces of information. At this stage, however, it seems more appropriate to deal with a more general notion. Unlike “abstract”, in the context of our study, “theoretical” will essentially mean “logico-mathematical”.

274

Hykel Hosni et al.

to satisfy (some of) the rational/commonsensical10 constraints on agents’ reasoning as studied in the context of uncertain reasoning research. This distinction between ”agent-dependent” and ”agent-independent” imperfection is, in some sense, orthogonal to the well known ontology of imperfection as described in [14]. Here, imperfection in GIS has been deﬁned by two distinct orthogonal components: inaccuracy or error that means lack of correlation between observations and reality. Imprecision concerns a lack of speciﬁcity or a lack of detail in an observation or representation. A special case of imprecision is vagueness that is related to the presence of borderline cases [14]. The topic of uncertainty has been dealt with in AI for a long time and, especially in GIScience, it has been widely studied in the literature for many years from diﬀerent points of view: from error to inaccuracy, ambiguity, granularity, vagueness and so on [6; 2; 14]. The several diﬀerent approaches in the literature face speciﬁcally with diﬀerent aspects of uncertainty (or imperfection in GIS). As an example, Couclelis in [10] oﬀers a good point of view of the role of uncertainty in GIS and also from the point of view of mathematics and physics, along the history. This yields author to the conclusion that uncertainty is inevitable and therefore it is extremely important to tackle uncertainty into consideration when representing and reasoning with geographic world. A good starting point for a survey on all aspects on uncertainty in GIS is Worboys and Duckham book chapter on Uncertainty [14]. However, as we will point out later, this chapter focuses on an approach to a speciﬁc kind of geographic imperfection that is called information incompleteness. 3.1 Agent-dependent vs agent-independent limitations To the purpose of brieﬂy illustrating the contrast we are making here, let’s take the case of collecting data for GIS. Data acquisition, depending from the speciﬁc application ﬁeld, can be performed by diﬀerent activities. Consider the case of collecting data from a marine environment such activities are denoted as sensing and measuring. Both activities inevitably generate imperfection in the resulting representations independently of the agents’ representational systems but dependently on the nature of agents and their environment. Sensing is clearly subject to the actual physical constraints of the particular agents at hand. Although there will always be a limited number of objects that the agent can simultaneously perceive from a ﬁnite (and predetermined) number of distinct perspectives, the number of such objects and perspectives can clearly be maximized by means of appropriate technological solutions.11 10

11

We do not distinguish here between the (slightly diﬀerent) notions of “rationality” and “commonsense”. A point closely related to the latter has been investigated within the context of na¨ıve geography to which our approach is compared in section 3.3.

Maximum Entropy Inference for Geographical Information Systems

275

Nonetheless, as just remarked, it is evident that –even in principle– the information agents are able to collect cannot be exhaustive if they operate in realistic environments. This leads to what we will be calling incomplete knowledge-bases. For the present we take the latter to mean that there is some aspect of the agent’s environment which is not represented in the agent’s knowledge base. Such an issue cannot be solved by means of improving, say the perceptual processes of the agent itself. Rather, what is needed, in order for the GIS to make the most of the available information, is a clear understanding of what sort of constraints that should be imposed on the agent’s reasoning function so as to allow it to draw commonsensical conclusions given its incomplete knowledge base. Having said that, still one might be inclined to think that such imperfect knowledge-bases might amount to incomplete yet accurate (i.e. crisp) representations of the agents’ environment(s). Due to the intrinsic granularity of measuring however, this does not turn out to be the case.12 In a nutshell, the problem of granularity consists in the fact that when measuring say, through speciﬁc application dependent devices, agents transform essentially continuous entities into essentially discrete representations. This, in general, aﬀects the representation of spatio-temporal relations, leading to the phenomenon of vagueness. The implications of incomplete and vague knowledge bases are described in more details in section 4, whereas in section 6 we focus on one particularly important case of incomplete knowledge bases leading to background ignorance. The imperfection which originates as a byproduct of data acquisition clearly interferes with the subsequent manipulation of the stored representations. Therefore the imperfection which is so ‘inherited’ aﬀects the capability that agents have of drawing conclusions from their knowledge bases, like e.g. planning some appropriate actions or answering some particular queries. We have brieﬂy pointed out above that although they might enable better performances on the whole, technological solutions cannot annihilate such interferences: Due to the presence of agent-independent constraints, this cannot be the case, even in principle. Therefore, we claim, in order for GIS-agents to overcome the diﬃculties in handling such imperfect knowledge bases, they have to satisfy some commonsense constraints. Acknowledging the inevitability of imperfect information, our proposal suggests understanding how imperfect information can be managed by means of commonsensical reasoning. 3.2 The need for theoretical solutions Consider again the problems related to sensing and measuring. Their negative eﬀects –i.e. the amount of imperfection generated– can be reduced by means of suitable hardware/software solutions. For example the range of data and 12

Notice also that in normal circumstances sensing will also lead to inaccurate representations due to various forms of noise.

276

Hykel Hosni et al.

the amount of details acquired crucially depends on the nature of sensors: the more eﬃcient the sensors, the less the imperfection generated by them. To see why technological improvements cannot guarantee the required solution to the problem of imperfect information management we need to focus on the nature of realistic environments. Of particular interest for our purposes is to notice that any portion of environment (e.g. terrestrial or marine) corresponds to a inhomogeneous and proactive environment. As an example, marine environments are inhomogeneous to the extent that the behavior on the shoreline is considerably diﬀerent from the one on open sea, with respect to, say, marine currents. Of a similar nature is the problem constituted by temperature fronts, that is to say areas where a signiﬁcant change of temperature is recorded within tiny distances. Analogous cases can be found in terrestrial environments (wooden area is much diﬀerent from urban areas, quite similar but not identical to rural areas etc). Thus the inhomogeneity of environments gives rise to the phenomenon of context dependent information, that is to say, information pertinent to a speciﬁc area may not be adequate for other –even ‘similar’– areas. Of course, the fact that environment information is essentially layered adds in further variables. On the other hand, proactiveness captures the fact that environments change over time independently of agents’ actions in a way that agents cannot, in general, predict. Thus, proactiveness is much stronger than dynamism. Proactiveness clearly makes any knowledge base whatsoever imperfect to the extent it puts the agent in a condition that we’ll be calling background ignorance: at any point in time the agent cannot be taken to have complete information about its environment. As discussed later on, the phenomenon of background ignorance is to be found in any GIS application. Let us put to work what has been said so far. By means of brieﬂy analyzing the GIS tasks it emerged that there are essentially two possibilities in order to cope with imperfect information in GIS: either by improving the mechanisms of data acquisition/representation –and this, we have suggested, would essentially remain within the domain of GIS technology– or by isolating constraints that would enable commonsensical reasoning in GIS. The existence of agent-independent sources of imperfection, however, rules out the former as a satisfactory –in terms of generality– solution to our main problem. This clearly leaves open to us only the latter option. 3.3 GIS and common sense: the Na¨ıve Geography approach Those pointed out above are not the only reasons for imposing commonsense constraints on GIS-agents overall behavior. With respect to its main goal – enabling commonsense reasoning in GIS– the na¨ıve geography approach [16] is indeed notably close to the perspective we are suggesting in the present study, though the motivations underlying the na¨ıve geography programme are somehow diﬀerent.

Maximum Entropy Inference for Geographical Information Systems

277

Na¨ıve geography is deﬁned as the “the body of knowledge that people have about the surrounding geographical world” [16]. The main motivation for the formalization (and hence the implementation) of “na¨ıve geographic knowledge” consists in allowing non speciﬁcally trained people to make the most of their interaction with geographic information systems. We now focus on the most important (in our view) commonalities while postponing further discussion on the diﬀerences until the end of this section. Spatial and temporal reasoning. A cornerstone of the framework we are suggesting consists in taking GIS to be autonomous agents, that is to say, spatially and temporally embodied entities. Spatial and temporal reasoning, on the other hand, are “central to na¨ıve geography”[16]. Notice, however, that we have not directly argued for speciﬁc spatio/temporal constraints for commonsensical GIS. Rather we have limited ourselves to notice that –already at a purely abstract level– taking GIS as (multi)agent systems is suﬃcient to bring spatially and temporally embedded reasoning into the picture. Recall, from section 3.2, that without agents’ reasoning being temporal no notion of proactiveness could make sense, pretty much in the same way no issue of context-dependence can arise in non-spatial reasoning. It must be stressed that spatio/temporal reasoning enters explicitly into the picture only by means of appropriate formal constraints. Realistic environments. The fact that GIS-agents are embedded in realistic environments constitutes a key assumption underpinning our framework. We have seen before how this, combined with their necessary physical limitations, results in agents being able to perceive a limited number of objects from a limited number of diﬀerent perspectives. Those are precisely the features that characterize “geographic space” in the na¨ıve geography approach. Qualitative reasoning We have paid particular attention to the phenomenon of background ignorance stressing that it corresponds, by and large, to the usual epistemic or informational state of the agent. Although this is a topic for the next sections, it is appropriate to recall here that a very similar concern is put forward in the na¨ıve geography approach through the notion of “qualitative reasoning”: In qualitative reasoning a situation is characterized by variables that can only take a small, predetermined number of values and inference rules that use these values in lieu of numerical quantities approximating them. Qualitative reasoning enables one to deal with partial information, which is particularly important for spatial applications when only incomplete data sets are available [16] . We conclude this section by considering some signiﬁcant diﬀerences between our approach and the one developed in the context of na¨ıve geography. In the formalization of geographical commonsense reasoning as proposed by the latter, geographical space is two-dimensional. In the general context of GIS, however, this cannot be the case. Indeed –as we have stressed above– geographical space is extremely inhomogeneous requiring a fortiori GIS-agents

278

Hykel Hosni et al.

to make sense of both partial and contextual information. Related to the original motivations is the second main point of divergence. By aiming at making easier the deployment of GIS technologies by non-experts, na¨ıve geography is focussed on human-machine interaction to an extent that approach is not.

4 Characterizing imperfect information management Unfortunately, acknowledging a tight interaction between GIS and uncertain reasoning research does not provide the former with a dried and cut account of what it is to manage imperfect information. To-date, it is by and large impossible to satisfactorily represent all the interesting features of reasoning under imperfect information within a unique mathematical framework. This in not to say, however, that no feature at all can be captured. It is precisely in this spirit that we focus on rational/commonsensical constraints on agents’ reasoning. We have seen above that reasoning about (and hence acting upon) geographic domains already requires the ability to face background ignorance, context-dependency and vagueness. The purpose of this section is to investigate the sort of formal constraints that GIS-agents need satisfy if they are to exhibit commonsense. We propose the following slogan: Commonsensical GIS-agents must be able to implement nonmonotonic and fuzzy reasoning. We now brieﬂy discuss why this has to be the case and then move on to focussing on the management of background ignorance. 4.1 Managing background ignorance: non-monotonic reasoning The discussion of section 3.2 lead us to the conclusion that if we are to model commonsensical GIS-agents, then we ought to account for the phenomenon of background ignorance. This must be the case since the proactiveness of real environments implies that at no point in time, will the agent’s knowledge base ever be complete.13 Classical logic is monotonic to the extent that the conclusions that can be drawn from a given set of premisses have to be valid at any time and in any place where at least such premisses hold. This means that nothing the agent might be in a position to learn afterwards can aﬀect the (logical) status of such conclusions. Put very loosely (and somehow circularly) monotonic reasoning can never lead to the revision of an agent’s knowledge base. Straightforwardly, this cannot be a desirable pattern of reasoning for GIS-agents.14 13

14

Similarly, we have seen that the essentially layered nature of information, imply the context-dependency of agents’ knowledge bases. Immunity to new information is only compatible with omniscient agents or absolutely stable environments, neither of which is compatible with the underlying notion of GIS-agents.

Maximum Entropy Inference for Geographical Information Systems

279

Therefore it is easily seen that abstract models of GIS with commonsense cannot be provided by means of monotonic logical systems. Diﬀerent ways of getting rid of this undesirable form of reasoning have given rise to diﬀerent formal approaches to non-monotonic reasoning. The ﬁrst formal account of a type of non-monotonic reasoning was given in Reiter’s [47] where the so-called “close world assumption” is introduced.15 According to it, the agent is allowed to draw conclusions as if it had complete knowledge of the world with the proviso that the agent must withdraw those very conclusions in the case it acquires new information contradicting it. This idea culminated in the formalization of Default Logic [48]. Since the ﬁrst steps a tremendous number of logical systems rejecting (unconstrained) monotonicity has arisen (see [4] for a recent comprehensive survey and [11] for a computationally-oriented account of the main ‘survivors’). Therefore we cannot properly speak of a single non-monotonic logic, but we should consider it as a family of logics. The approach based on consequence relations, pursued by [18; 35; 32] makes rather transparent the reciprocal relations between such distinct systems, hence qualifying it as a suitable framework for investigating commonsensical constraints on GIS-agents.16 Non-monotonicity is also required for dealing with property-inheritance in inhomogeneous environments. As discussed above, context-depending information, as the result of say, temperature fronts, is likely to make the inheritance of some properties undesirable. The problem of blocking undesirable inheritance has been extensively treated in the early literature on non-monotonic reasoning (see, e.g.[19]). 4.2 Managing vagueness: fuzzy reasoning Space and time are obviously critical issues to be accounted for when modelling GIS-agents’ reasoning. One of the diﬃculties they generate in modelling geographic reasoning is that they correspond to non-discrete entities. Representing information about non-discrete (or equivalently, continuous) entities typically gives rise to vague statements. One example which has been studied in the GIS perspective is the nearness relation [55; 13]. It clearly follows that GIS-agents with commonsense must be able to deal with relations such as the nearness one. This is even more compelling if we agree with the programme of na¨ıve geography, and we seek for an easier non-experthuman/GIS interaction.

15

16

It is perhaps more than a curious historical accident the fact that after Aristotle, the interest in qualitative non-monotonic forms of reasoning was revived by issues in the theory of databases. Recent mathematical developments, like [17; 31; 24], show the inevitability of nonmonotonic reasoning in the characterization of rational reasoning under imperfect information.

280

Hykel Hosni et al.

This book focuses on research advances in approaches for incorporating explicit handling of uncertainty, especially by fuzzy sets, to address geographic problems. In order to better appreciate the requirement of “fuzziness” (See [44] for the most recent advances in handling uncertainty by fuzzy sets) for commonsensical GIS we need to recall that the main consequence of vagueness is the impossibility of deﬁning truth values without generating paradoxes like for example, the most famous Sorites one. Put the other way round, the only way for determining the truth value of a vague statement (i.e. a statement involving a vague predicate or relation) is that of allowing (possibly inﬁnitely many) degrees of truth in place of the standard binary values. Back in the 20’s Lukasiewicz provided the ﬁrst suggestions for devising logical systems capable of making sense of degrees of truth, or equivalently, many-valued logics.17 The topic of many-valued logics had a tremendous revival in the midSixties essentially due to Zadeh’s introduction of the notion of fuzzy set, in his seminal [56]18 . Zadeh’s work has given rise to a fairly big number of formal approaches for modelling fuzziness the most well-studied among them being the domain of possibility measures (see, e.g. [12] for a survey). 4.3 Qualitative vs. quantitative approaches When it comes to formal theories of reasoning under imperfect information, the debate arises about which side of the great divide better serves the purposes of the theory. We now brieﬂy deal with the issue by addressing the following question: What sort of conceptual diﬀerence mirrors the distinction between qualitative and quantitative methods? We ﬁrst distinguish between problems which are essentially quantitative and those which are not. It is clear that if we are dealing with an essentially quantitative problem, no debate can reasonably arise: only quantitative solutions can be adequate. A case in point is given by many-valued semantics. It is clearly quantitative to the extent its intended goal –as remarked just above– is that of capturing the notion of “degrees of truth”, which is clearly a genuinely quantitative issue. A case in which the problem is not necessarily quantitative is that of characterizing agents’ belief-formation, that is to say –in our present context– 17

18

Notice that, pretty much for the same reasons given above for non-monotonic logics, we speak of a family of many-valued logics, rather than a single manyvalued logic. This essentially depends on the fact that there are slightly distinct ways in which the notion of degree of truth can be captured. See [3] for a discussion on that. The enthusiasm of the early days contributed to the rather rhapsodic development of the subject. See [23] for a brief yet compelling conceptual review, and [22; 3] for the advanced mathematical details with respect to the formalization of reasoning under vagueness.

Maximum Entropy Inference for Geographical Information Systems

281

characterizing the kind of ﬂexible and sensible reasoning we expect from commonsensical GIS-agents. The point is better illustrated by means of an example. Let us consider Pearl’s at al. ε−semantics. In a nutshell, ε-semantics uses inﬁnitesimal probabilities to give meaning to default rules.19 In other words, defaults like “θ typically implies φ” are taken to hold just if the probability of the corresponding conditional (“φ given θ”) is inﬁnitesimally close to 1. Now, despite using numbers20 , ε−semantics clearly captures a qualitative aspect of reasoning under imperfect information. This follows from the fact that the expressive power of ε−semantics does not go beyond what the agent takes to be “extremely probable”. Put in other words, if we interpret ε−semantics as giving a formal meaning to agents’ beliefs, then no degrees of belief can be captured by ε−semantics. This clearly contrasts with the case of standard Bayesian probability accounts, where there are inﬁnitely (actually uncountably) many values a belief function can range over.21 Thus, from this point of view, there is a purely extensional diﬀerence between non-monotonic reasoning and classical probabilistic reasoning and it lies in the expressive power of their characterization of belief. Again, this is easily seen by considering the fact that non-monotonic reasoning can be given a probabilistic semantics provided that no degrees of probability values are allowed. This is nothing but an easy example showing how, in general, qualitative reasoning can be considered to be a very special case of the quantitative one. Another remark on the expressive power of formal models is appropriate. In standard Bayesian models of reasoning under imperfect information, probability values are taken to be pointed. In other words the value of a probability (belief) function is an exact –precise– numeric value in the interval [0, 1]. Giving such a numerical value, however, would not be possible for realistic agents in a number of cases of interest. Those are the cases in which indeed information is imprecise.22 This requirement of precision has being relaxed in the imprecise probability model fully described in [52] (but see also the more manageable [53]).

19

20

21

22

See [43] for an introductory discussion of ε-semantics in the context of Default Logic. A particularly unfruitful aspect of the debate consists in taking the distinction qualitative vs. quantitative to be mirrored by the symbolic vs. numeric one. See [39] for the mathematical details on the Bayesian approach and [26] for the related epistemological issues. The diﬀerence between imprecision and vagueness might be recalled intuitively by pointing out that whereas the former is an attribute of degrees of belief, the latter is an attribute of degrees of truth.

282

Hykel Hosni et al.

5 Thorny queries In order to have a more tangible GIS scenario at hand, let us focus on a case-study of animal behavior. Here, we very brieﬂy describe the research on Hystrix Cristata (crested porcupine) in the Natural Park of Maremma, Tuscany (details can be found in [7]), giving a sketch on a possible formalization of a problem raised by biologists. The study of the animals’ behavior starts from the collection of (relevant) data about them. This operation is mainly performed by means of radio tracking techniques, where individuals are provided with a radio collar and their localization is ﬁxed at a given time interval by means of a technique called “bearing” which relies on radio signals emitted from ﬁxed stations matching the signal coming from the animal radio collar. Each localization is called a ﬁx and is characterized by the identiﬁer of the tracked animal Id, the spatial coordinates X and Y , and the time of the bearing T . What behavioral ecologists expect to gain from the analysis of tracked animals is essentially an accurate account of certain observed behaviors of the porcupine in relation to its environment. As an example of such kind of analysis, the study of the behavior of the animal with respect to its den plays a crucial role in the overall understanding of the porcupine characteristics. Typically, scientists will be interested in understanding what (if any) are the individuals’ preferences concerning the location of the den, or how much time is being spent inside the den by the animal. Moreover, discovering which individuals share the same den can be relevant for understanding their mating system. The census of the den can be done by scientists by means of a technique called homing-in which amounts to constructing a database of the dens after their physical location has been ascertained by following radio-collar signals. Homing-in leads to a localization of the dens which, even if almost free of errors (just the minimal error caused by instruments or possible human mistakes), nonetheless turns out to be very expensive in terms of human effort. As a consequence, it can only be repeated on a bi-monthly basis. Since animals typically change their dens over time, the homing-in database is subject to a number of “gaps” resulting from the ignorance of the actual pairs individual, den. It can easily be seen that GIS designed to accomplish tasks of this sort need to be capable of handling the background ignorance caused by those “gaps”. Considering the problem of dens localization, every inference relating certain particular individuals with their own dens is clearly to be performed under ignorance of the possible changes that might have taken place since “the last physical measurement”. Abstracting from the particular example, reasoning under background ignorance is perhaps the most fundamental issue in the uncertainty management for GIS. Every dynamic, and indeed pro-active [25] system (be it a society of porcupines in their environment or a fragment of coastal territory or numerous other contexts) in fact, puts agents reasoning about the system in a condition of background ignorance.

Maximum Entropy Inference for Geographical Information Systems

283

We are now in a position to see how the phenomenon of background ignorance comes into the foreground in the use of GIS for animal behavior case study. Questions about the localization of unknown dens could be formalized as: Q: Given a number of known dens positions and a set of animal ﬁxes, infer the position of the dens given that there are “gaps”. If we abstract from the nature of the parameters occurring in a successful dens localization, we can see that the above query embeds a very general pattern of what is normally called “inductive inference”, the kind of inference –to put it crudely– that leads an agent to draw principled conclusions about as yet unobserved phenomena from the available observation/data. And this pattern of reasoning is where the agent perspective on GIS becomes essential. In such a framework we can, in fact, understand the above query as: P-Q Given the information possessed concerning a number of dens positions and the animal ﬁxes, determine the probability that the dens d1 , . . . , dn will be in the locations xi , yi . Notice that the probabilistic nature of the query is extremely natural both as a direct representation of the intrinsic uncertainty of the GIS-agent answer and in relation to the general information provided by domain experts. Solving queries of this sort in fact, among other things, involves deploying domain experts’ speciﬁc knowledge (in this case concerning the animal’s “den-behavior”) to construct a reliable hypothesis that can accommodate the possible changes occurring within the dynamic system. But domain speciﬁc knowledge cannot be taken to be “perfect” or otherwise “certain” knowledge. Rather it is best represented in probabilistic form. So, for example, the conjecture that the main activity of porcupines during the time spent outside the day is feeding [7] could be informatively represented by ω(f eeding | outside− den) = 0.8, where ω is a probability function (see below). Within this framework it is reasonable to require that an adequate solution to our query should provide “the most accurate” probability evaluation consistent both with the information possessed by the GIS-agent and with the domain expert general hypotheses A natural way of formalizing the idea of “the most accurate” evaluation, consists in requiring that the construction of the probability assignment (i.e. the solution to the probabilistic query) should, while satisfying the consistency requirements, introduce as few arbitrary information as possible, that is to say information that goes beyond the collected data and the domain-expert evaluations. Moreover, given the underlying uncertainty, the solution must adhere to (i.e. be justiﬁed by) principles of rational reasoning. From a logico-mathematical point of view, these two requirements amount to maximizing the entropy of the solution [39; 40; 42].

284

Hykel Hosni et al.

6 MaxEnt and background ignorance We recall brieﬂy some of the main ideas yielding Maximum Entropy (MaxEnt) inference relating them to our underlying problem. There is an extremely rich literature on MaxEnt in uncertain reasoning. An accessible introduction for the non-specialist is [40] whereas for more advanced presentations see [39; 42; 31]. The latter works provide also extensive references to the relevant literature. Recall that a ubiquitous feature of imperfection in geo-referenced information is the presence of “gaps” related to the periodicity of relevant measurements. It is worth emphasizing that there is virtually no way around the resulting background ignorance, no matter how eﬃcient our measurement and recording methods are. MaxEnt inference allows us to deﬁne meaningful and methodologically robust notions of logical consequence for dealing with background ignorance, a fundamental requirement if we are willing to take GIS as agents, that is “reasoners”. This allow us to characterize GIS, as far as their reasoning capabilities are concerned, by means of an appropriate inference process. The particular instance of inference process discussed below consists in probabilistic logic programs under MaxEnt. In such a framework, knowledge (intended as the information, or data possessed by an agent) is represented probabilistically, indeed by a ﬁnite set of (linear) constraints on agents’ subjective probability function. Knowledge being represented by subjective degrees of probability amounts to requiring –as seems perfectly appropriate for GIS– that agents’ answers must be based only the information available to them, with the explicit assumption that it is all the knowledge possessed by the agent (what is called Watts Assumption in [39]). In order to see how this could be captured and hence applied for the solution of our P-Query, we need a bit of formal setting-up which we take from [39]. Let SL be the set of sentences (denoted by θ, φ, etc.) of a ﬁnite propositional language L = {p1 , . . . , pn } and let KG , denote the probabilistic knowledge possessed by a given GIS-agent G at a ﬁxed point in time. Therefore knowledge of a set of ﬁxes might be thought of as a very special case of one such K. We can conveniently think of K as a ﬁnite set of conditional constraints of the form: ω(θ | φ) = x, where ω is a probability function on SL, and x ∈ [0, 1]. Note that unconditional constraints are obtained by taking the conditioning sentence (φ) to be any tautology. Moreover we assume that ω(φ) > 0. An Inference Process on L then, is deﬁned as a function N such that, for K a consistent ﬁnite set of linear constraints of the form ω(θj | φj ) = xi , with xi ∈ [0, 1], N (K) is a probability function ω on SL satisfying K. It should be noted that, from the representational point of view, we assume no qualitative diﬀerence between the data possessed by the agent and the domain-speciﬁc knowledge supplied by the expert. As a consequence it makes perfect sense to consider both sources of knowledge as deﬁning the constraints against which the consistency of the solution should be checked.

Maximum Entropy Inference for Geographical Information Systems

285

Given a ﬁnite consistent set of constraints KG in general there will be many formally consistent ways of extending the information possessed by G to the query at hand. What is needed then, is a principled way of discarding those solutions that, though logically consistent, would nonetheless fall short of being “rational” or “commonsensical”. This is obtained by identifying the solution to our query with the one provided by the MaxEnt Inference Process which, for KG as above, is that unique probability function N (M E) (KG ) consistent with KG as well as a number of principles of rationality, and for which the Shannon-entropy J

− xi log xi , i=1

is maximal. Shannon-entropy [51] is a measure of the “uncertainty of information” so that its maximization leads to an inference process that introduces as few arbitrary information as possible, where arbitrary loosely means “departing from what is given in KG ” [39; 28]. Hence, we can see how the MaxEnt inference process captures formally the two intuitive constraints on the solution of our P-Query above. There are many logico-mathematical realizations of MaxEnt inference processes. A particularly interesting one, for GIS research, is the framework of probabilistic logic programming under MaxEnt recently investigated by G. Kern-Isberner and T. Lukasiewicz [30] where the ﬁnite set of consistent linear constraints KG is expressed in terms of a consistent probabilistic logic program. Probabilistic logic programs under MaxEnt then, bridge between the theoretical results concerning MaxEnt inference processes and the corresponding computational logic (the full formal details can be found in [34; 30]). Moreover, as illustrated in [30], there exists a strong (formal) relation between probabilistic logic programming under MaxEnt and the normative properties of non-monotonic consequence relations. It is shown that, among other properties, the former satisﬁes the rules/conditions for the theory of Rational Consequence Relations which is usually regarded as capturing the “core aspects” of non-monotonic reasoning (see, e.g. [32; 33; 31]). It is worth recalling here that the theory of Rational Consequence Relations is both formally and conceptually tied to the standard AGM paradigm for belief revision. See [50] for an extensive discussion on this. Let’s now focus on the key properties satisﬁed by N (M E) (KG ). A crucial result in this area (see chapter 7 of [39]) is that N (M E) (KG ) is the only inference process which satisﬁes (along with the consistency requirements introduced above) a number of rationality principles. We discuss here just two of them, namely Irrelevance and Obstinacy. To illustrate the intuitive idea underlying the former, suppose that G is to solve a query concerning the localization of one particular den. It would be irrational for the agent answer on the grounds say, of the proportion of male researchers in the Italian Research Council. The resulting conclusion would have to be considered fallacious (i.e.

286

Hykel Hosni et al.

unjustiﬁed) due to such an irrelevance. Moreover, if we allowed GIS-agents to overlook the relevance of the constraints being used, the computational task of solving the query would almost inevitably suﬀer an unmotivated increase in complexity. Now, despite the fact that characterizing (ir)relevance in all its subtleties amounts to solving one of the hardest problems in the formal characterization of intelligent behavior, [20; 45; 37], MaxEnt is consistent with a very natural formalization of relevance for ﬁnite sets of consistent linear constraints: IRRELEVANCE If KG1 and KG2 are ﬁnite consistent sets of linear constraints in L, θ ∈ SL but no propositional variable appearing in θ or in KG1 appears also in KG2 , then N (KG1 + KG2 )(θ) = N (KG1 )(θ). Data acquisition is, under many respects, an expensive business. An important consequence of this, as we have recalled above, is the fact that the actual physical dens localization cannot be performed very frequently. This is clearly to be generalized to the whole domain of GIS. It is therefore of the greatest importance that the process of revising probability assignments should be constrained in such a way as to preserve as much information as possible, that is to say, by avoiding unnecessary revisions. This is the idea underlying the principle of Obstinacy which is captured formally by: OBSTINACY If KG1 and KG2 are ﬁnite consistent sets of linear constraints in L, N (KG1 ) satisﬁes KG2 , then N (KG1 + KG2 ) = N (KG1 ). Obstinacy is a particular case of constrained monotonicity, that is a principle that constraints the enlargement of the set of premises from which a conclusion has been previously drawn. So, for example, suppose that for a particular den, the GIS-agent provides a solution to the P-Query. Such a solution will have the form of a probability assignment (consistent with KG ) to the (relevant) sentence of SL. Now, if we expand KG with a new constraint that is already satisﬁed by the solution to our P-Query, then the solution for the enlarged set of constraints should not diverge from the one previously obtained. The intuition captured by Obstinacy is clearly deeply related with nonmonotonic reasoning. Indeed, as shown by [30], all the key properties of nonmonotonic consequence relations are satisﬁed in their framework for probabilistic logic programming under MaxEnt. Other aspects concerning the connection between non-monotonic reasoning and MaxEnt are discussed in [24].

7 Implementing MaxEnt Despite the numerous arguments in favor of reasoning under MaxEnt, there is still a theoretical issue that might discourage AI practitioners [41]. Unsurprisingly, this is computational complexity. As proven in chapter 10 of [39], in fact, the problems of checking the satisﬁability of a set of linear constraints KG and computing an approximation for ω(θ) consistent with KG where KG is a satisﬁable set of constraints and θ ∈ SL, are in general infeasible

Maximum Entropy Inference for Geographical Information Systems

287

Recent developments on the computational techniques for maximum entropy reasoning, however, provide good evidence for the claim that –in many practical circumstances– ME reasoning is indeed feasible. The key to reduce complexity consists in exploiting the (decidedly subtle) notion of “irrelevance”. solution of appropriate linear systems (namely the probabilistic logic programs). The expert system shell SPIRIT [1; 38; 49] provides an eﬃcient computational engine for the solution of such problems. SPIRIT optimizes the complexity of ME reasoning constructing a dependency-graph for the problem at hand [38]. Roughly, this involves ﬁrstly, introducing a (distinct) vertex for each (distinct) propositional variable occurring in the probabilistic logic program under consideration and secondly, connecting any two vertices such that the corresponding propositional variables appear in the same constraint in KG . The dependency-graph is then used to the eﬀect that the actual probability evaluation is performed on it rather than on the set of all possible atoms. Note that if the constraints are all logically independent, then the probability evaluation must be performed on all the 2n possible atoms, where n is the cardinality of L, though this situation hardly arises in practice. Once the constraints have been met, the MaxEnt distribution is immediately propagated to the queries, providing the required solution. Although this chapter is aimed at lying down the foundations of an agentbased approach to geographical information systems, we see a possible exploitation of maximum entropy inference also in the near future for today GIS. Indeed, approaches in the literature [36; 21] propose to add a reasoning component to GIS, to perform inferences on geographical data. The idea for a short term ME inference process within GIS, is to follow the lines drawn in such proposals to design a ME inference component on top of a ”today GIS”.

8 Conclusions and future research E. Jaynes wrote, on MaxEnt, that one evades it only at the cost of getting results that can be shown to be defective or incomplete, in that they either fail to use all the relevant information or assume false information [29]. We have argued that, what MaxEnt inference guarantees is just what is needed for solving adequately a certain wide class of queries that arise naturally in GIS research. To motivate and better illustrate our approach we have been drawing on some ideas taken from a real case-study on animal behavior. We have then outlined how the path between those very much needed methodological requirements and the actual implementation of MaxEnt reasoning for GIS is connected by a robust layer of computational logics. Within this framework it is shown that key properties of non-monotonic reasoning and belief revision are satisﬁed. In particular, ME logic programming deﬁnes a preferential consequence relation: a sound and complete non-monotonic logical system on which most of the research on non-monotonic logic converges. Despite discouraging

288

Hykel Hosni et al.

results on the complexity of ME inference, the studies on the corresponding computational logics have resulted in some eﬃcient implementations, chief among them the probabilistic expert system shell SPIRIT. To make the most of such formal connections, a new framework for GIS research is to be taken into account in which GIS are viewed as agents who possess their own knowledge and are expected to make the most out of it. This study is part of a much larger project. Although maximum entropy logic programming can be extended to handle probability intervals[30], its relations with many-valued and fuzzy reasoning is still very much terra icognita. Further research is then to be focussed on the formal investigation of the adaptation of MaxEnt logic programming to geo-referenced data and in particular on its relation with logic programming paradigms enriched with both spatial and temporal constructs like MuTACLP and STACLP [36; 46].

Acknowledgments This research has been supported by the REV!GIS Project IST-1999-14189 Revision of the Uncertain Geographic Information. We would like to thank Jeﬀ Paris for useful conversations on this topic.

References [1] Spirit: Symmetrical probabilistic intensional reasoning in inference networks in transition. http://www.fernunihagen.de/BWLOR/spirithome.html. [2] J. Aerts, M.F. Goodchild, and G.B. M. Heuvelink. Accounting for spatial uncertainty in optimization with spatial decision support systems. Transactions in GIS, 7(2):211–230, 2003. [3] A. D. C. Bennett, Jeﬀ B. Paris, and Alena Vencovsk´ a. A new criterion for comparing fuzzy logics for uncertain reasoning. Journal of Logic, Language and Information, 9(1):31–63, 2000. [4] A. Bochman. A Logical Theory of Nonmonotonic Inference and Belief Change. Springer, 2001. [5] P. A. Burrough. Principles of Geographical Information Systems for Land Resources Assessment. Oxford, 1988. [6] P. A. Burrough and R. A. Mcdonnel. Principles of Geographical Information Systems. Oxford University Press, 1998. [7] T. Ceccarelli, D. Centeno, F. Giannotti, A. Massolo, C. Parent, A. Raffaet` a, C. Renso, S. Spaccapietra, and F. Turini. Experimenting advanced spatio-temporal formalisms: an application to behavioural ecology. Geoinformatica, page to appear. [8] N. Chrisman. Exploring Geographic Information Systems. John Wiley and Sons, 1997.

Maximum Entropy Inference for Geographical Information Systems

289

[9] K. C. Clarke. Analytical and Computer Cartography. Prentice-Hall, Englewood Cliﬀs, NJ., second edition edition, 1995. [10] E. Couclelis. The certainty of uncertainty: GIS and the limits of geographic knowledge. Transactions in GIS, 7(2):165–175, 2003. [11] J. Dix, U. Furbach, and I. Niemel¨a. Nonmonotonic Reasoning: Towards Eﬃcient Calculi and Implementations. In A. Voronkov and A. Robinson, editors, Handbook of Automated Reasoning, volume 2, chapter 18, pages 1121–1234. Elsevier-Science-Press, 2001. [12] Didier Dubois and Henri Prade. Possibility theory: Qualitative and quantitative aspects. In Dov M. Gabbay and Philippe Smets, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, Volume 1, pages 169–226. Kluwer Academic Publishers, Dordrecht, 1998. [13] M. Duckham, K. Mason, J. Stell, and M. Worboys. A formal ontological approach to imperfection in geographic iinformation. In Proceedings GISRUK 2000, 2000. [14] M. Duckham and M. Worboys. GIS: A Computing Perspective – Second Edition. CRC Press, 2004. [15] K.J. Duecker. Land resource information systems: a review of ﬁfteen years experience. Geo-processing, 1:105–128, 1979. [16] M.J. Egenhofer and D.M. Mark. Na¨ıve geography. In Proceedings of COSIT 95, volume 988 of Lecture Notes in Computer Science, pages 1– 15. Springer, 1995. [17] N. Friedman and J. Halpern. Plausibility measures and default reasoning. Journal of the ACM, 48(4):648–685, 2001. [18] D. M. Gabbay. Theoretical foundations for non-monotonic reasoning in expert systems. In Proceedings NATO Advanced Institute on Logics and Models of Cuncurrent Systems, pages 439–457, Berlin, 1985. Springer. [19] D. M. Gabbay, C. Hogger, and J. Robinson, editors. Handbook of Logic in Artiﬁcial Intelligence and Logic Programming. Oxford University Press, 1994. [20] D. M. Gabbay and J. Woods. Agenda Relevance: A Study in Formal Pragmatics. North-Holland, 2003. [21] S. Grumbach, P. Rigaux, M. Scholl, and L. Segouﬁn. DEDALE, a spatial constraint database. In Proc. of Intl. Workshop on Database programming Languages, volume 1369 of Lecture Notes in Computer Science, pages 38– 59, 1998. [22] P. H´ ajek. Metamathematics of Fuzzy Logic. Kluwer Academic Publishers, Dordrecht, 1998. [23] P. H´ ajek. Ten questions and one problem on fuzzy logic. Annals of Pure and Applied Logic, 96(1-3):157–165, 1999. [24] Hill and Paris. When maximizing entropy gives the rational closure. JLC: Journal of Logic and Computation, 13, 2003. [25] H. Hosni, M. V. Masserotti, and C. Renso. Imperfect information management in geographic information systems. Internal report, ISTI-CNR, Pisa, 2004.

290

Hykel Hosni et al.

[26] Colin Howson. The Bayesian approach. In Dov M. Gabbay and Philippe Smets, editors, Handbook of Defeasible Reasoning and Uncertainty Management Systems, Volume 1: Quantiﬁed Representation of Uncertainty and Imprecision, pages 111–134. Kluwer Academic Publishers, Dordrecht, 1998. [27] A. N. Huhns and M. P. Singh. Readings in Agents. Morgan Kaufman Publishers Inc., 1998. [28] Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003. [29] E. T. Jaynes. The relation of Bayesian and Maximum Entropy methods. In G.J. Erickson and C.R. Smith, editors, Maximum-Entropy and Bayesian Methods in Science and Engineering, volume 1. Kluwer, Dordrecht, 1988. [30] G. Kern-Isberner and T. Lukasiewicz. Combining probabilistic logic programming with the power of maximum entropy. Technical Report 184302-12, Institut f¨ ur Informationssysteme, TUW, Wien, 2002. [31] Gabriele Kern-Isberner. Conditionals in Nonmonotonic Reasoning and Belief Revision - Considering Conditionals as Agents, volume 2087 of Lecture Notes in Computer Science. Springer, 2001. [32] S. Kraus, D. J. Lehmann, and M. Magidor. Nonmonotonic reasoning, preferential models and cumulative logics. ai, 44:167–207, 1990. [33] D. Lehmann and M. Magidor. What does a conditional knowledge base entail? Artiﬁcial Intelligence, 55(1):1–60, 1992. [34] T. Lukasiewicz and G. Kern-Isberner. Probabilistic logic programming under maximum entropy. In Anthony Hunter and Simon Parsons, editors, Proceedings of the 5th European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty (ECSQARU-99), volume 1638 of LNAI, pages 279–292, Berlin, July 5–9 1999. Springer. [35] D. Makinson. General patterns in nonmonotonic reasoning. In C. Gabbay, D. M. Hogger and J. Robinson, editors, Handbook of Logic in Artiﬁcial Intelligence and Logic Programming, pages 35–110. Oxford University Press, 1994. [36] P. Mancarella, A. Raﬀaet` a, C. Renso, and F. Turini. Integrating knowledge representation and reasoning in geographical information systems. Journal of Geographical Information Science, 18(4):417–446, 2004. [37] J. McCarthy. From here to human-level AI. In L. Carlucci Aiello, J. Doyle, and S. Shapiro, editors, Fifth Int. Conf. on Principles of Knowledge Representation and Reasoning, pages 640–646. Morgan Kaufmann, 1996. [38] C. Meyer and W. R¨ odder. Coherent knowledge processing at maximum entropy by SPIRIT. In Proceedings of the 12th conference on Uncertainty in Artiﬁcial Intelligence, pages 470–476, Portland, Oregon, USA, 1996. [39] J. B. Paris. The Uncertain Reasoner’s Companion: A Mathematical Perspective. Cambridge University Press, Cambridge, England, 1994.

Maximum Entropy Inference for Geographical Information Systems

291

[40] J. B. Paris. Common sense and maximum entropy. Synthese, 117 (1):75– 93, 1999. [41] J. B. Paris and A. Vencovsh´ a. In defense of the maximum entropy inference process. International Journal of Approximate Reasoning, 17:77– 103, 1997. [42] J. B. Paris and A. Vencovsk´ a. In defence of the maximum entropy inference process. In D. Corﬁeld and J. Williamson, editors, Foundations of Bayesianism, pages 203–240. Kluwer Academic Press, 2001. [43] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, Los Altos, California, 1988. [44] F. E. Petry, V. B. Robinson, and M. A. Cobb. Fuzzy Modeling with Spatial Information for Geographic Problems. Springer-Verlag, New York, 2005. [45] David Plaisted and Adnan Yahya. A relevance restriction strategy for automated deduction. Artiﬁcial Intelligence, 144(1–2):59–93, 2003. [46] A. Raﬀaeta, C. Renso, and F. Turini. Qualitative spatial reasoning in a logical framework. In Convegno Nazionale di Intelligenza artiﬁciale, September 2003. [47] R. Reiter. On closed world databases. In Gallaire and Minker, editors, Logic and Databases, pages 55–76. Plenum Press, New York, 1978. [48] R. Reiter. A logic for default reasoning. AI, 13:81–132, 1980. [49] Wilhelm R¨odder. On the measurability of knowledge acquisition, query processing. Int. J. Approx. Reasoning, 33(2):203–218, 2003. [50] H. Rott. Change, Choice and Inference : A Study of Belief Revision and Nonmonotonic Reasoning. Oxford University Press, 2001. [51] C.E. Shannon and W. Weaver. The mathematical theory of communication. Technical report, University of Illinois Press, Urbana, Illinois, 1964. [52] P. Walley. Statistical reasoning with Imprecise Probabilities. Chapman and Hall, London, 1991. [53] Peter Walley. Measures of uncertainty in expert systems. Artif. Intell, 83(1):1–58, 1996. [54] M. Wooldridge. An Introduction to MultiAgent Systems. Wiley, Chichester, 2002. [55] Michael F. Worboys. Nearness relations in environmental space. International Journal of Geographical Information Science, 15(7):633–651, 2001. [56] L.A. Zadeh. Fuzzy sets. Information and Control, 8:338–353, 1965.

Storage and Manipulation of Vague Spatial Objects Using Existing GIS Functionality Arta Dilo1 , Pieter Bos2 , Pawalai Kraipeerapun3 , and Rolf A. de By1 1

2

3

International Institute for Geo-Information Science and Earth Observation (ITC), PO Box 6, 7500 AA Enschede, The Netherlands. [dilo,deby]@itc.nl, Department of Computer Science, University of Twente, P.O. Box 217, 7500 AE, Enschede, The Netherlands. [email protected], School of Information Technology, Murdoch University, South Street, Murdoch Western Australia 6150, Australia. [email protected]

1 Introduction We collect and store data to derive information and make judgments about a world of our interest. Ideally, they should indicate in a unique and certain way which possible world corresponds to the actual world [17]. Imperfection arises when this is not possible. Imprecision is a type of imperfection that is often encountered. Data are imprecise if we cannot precisely deﬁne the actual world, i.e. several worlds satisfy the data. A speciﬁc type of imprecision is vagueness [17; 22], which is the focus of this study. A concept is vague if objects exist that cannot be classiﬁed either to the concept or to its complement. Vagueness arises in the presence of borderline cases [18]. It is often present in collected spatial information, such as forest inventories, or geological, soil, and vegetation maps. Soil or vegetation classes are such that they cannot be deﬁned sharply. The change from one class to another is gradual. This is in conﬂict with current geographical information systems (GIS) which assume that spatial objects are precisely deﬁned, sharp objects, using points, lines, and polygons as representations. Several theoretical models have been proposed to represent and handle vague objects. They can be divided into two groups. One group [2; 3; 4; 12] deals only with regions, called broad boundary regions. The boundary of such a region is not a sharp line but a zone of transition, which is considered to be homogenous. The other group [13; 14; 23] considers gradual changes in the transition zone, and models objects by employing fuzzy sets. Schneider [14] deﬁnes fuzzy points, fuzzy lines and fuzzy regions, based on a ﬁnite collection of elements of a regular grid, which form a partition of IR2 . The model is A. Dilo et al.: Storage and Manipulation of Vague Spatial Objects Using Existing GIS Functionality, Studfuzz 203, 293–321 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

294

Arta Dilo et al.

directly implementable in raster data format. The other theoretical models have not been followed up by implementations, and (to our knowledge) there is no other implementation of vague objects. The work presented in this chapter follows our previous work [7], which provides formal deﬁnitions of vague object types and their operators. This work is an implementation of these vague types and operators in GIS software. Its objective is to store and manipulate vague spatial information, by extending existing GIS functionality. GRASS, an open source GIS software, was selected for the implementation, to allow the use of existing spatial data handling capabilities. The rest of the chapter is organized as follows: Section 2 informally describes vague objects and operators we deal with, giving the intuition of the deﬁnitions provided in [7]. Section 3 is dedicated to the creation of vague objects from input data points, and to their storage. Section 4 describes the visualization techniques used to display vague objects. Section 5 provides for set operators on vague objects. Sections 6 and 7 close the chapter with discussions and conclusions.

2 Vague spatial types and operators Vagueness in spatial information could be positional, meaning the location of a certain object is vaguely described, or it could be thematic, meaning properties of an object are vaguely described, e.g. in natural language terms, which are generally vague. The vague types that we have provided in [7] deal with thematic vagueness. A spatial object described by one of these types has a known location, but its properties can only be expressed in vague terms. For example: • ‘Densely populated’ residential centers are represented by points with precise location, which have diﬀerent degrees of population density. Their property of being ‘densely populated’ is a vague term. • A traﬃc congestion is described by a property on a road network (which location is precise for our purpose): the ‘congestion’ level, which can only be expressed in vague terms. Part of the road is completely blocked, hence certainly belongs to the traﬃc congestion, whereas away from the cause of the congestion, the car build-up becomes less severe, therefore it is part of the traﬃc congestion only to a certain degree. • Agricultural land ‘suitability’ is a property of land (space) described by a vague term. Suitability of land for a given kind of agriculture is decided on the basis of a combination of precise criteria built on natural linguistic terms [16]. There exist locations that are certainly suitable for agriculture, whereas other locations are suitable only to a certain degree. The types we provide are either simple or general ones. A simple type is used to represent a basic object, i.e. the simplest identiﬁable object. A general type is used to represent a set of basic objects that belong to the same vague class (of

Storage and Manipulation of Vague Objects

295

a certain classiﬁcation). Partitioning of space based on a given classiﬁcation is important for many spatial applications. A classiﬁcation consisting of vague classes does not lead to a crisp partition of space. We introduce a type vague partition that allows some kind of overlapping between classes, still giving a meaningful classiﬁcation of space. A vague object of simple or general type is a fuzzy set in the real plane, satisfying speciﬁc properties. It is represented by its membership function µ : IR2 → [0, 1]. Simple types are simple vague point, simple vague line, and simple vague region. Figure 1 illustrates objects of these types.

Fig. 1. (a) simple vague point, (b) simple vague line, and (c) simple vague region. Dark tone indicates high membership value, light tone indicates low membership value.

A simple vague point represents a site with a known location but with uncertain existence (to a phenomenon of interest). It has a positive membership value only at that location; the membership value represents the certainty of existence of the site. A simple vague line represents a linear feature with known position but with an uncertain extent, i.e. any point of the line has some degree of participating in the line. A simple vague line is a continuous line with mostly gradual transitions of membership values between neighbor points on the line. Membership values are positive at every point of the line, except perhaps at the end points. A simple vague region is a region with a broad boundary. Locations in the broad boundary typically have diﬀerent positive membership values, which change mostly gradually between neighbor points in the region. It does not have cuts, punctures, isolated lines or isolated points. Its support set (i.e. the set of locations with positive membership value) is a single-component set, possibly with holes. The core (i.e. the set of locations with membership value equal to 1) however, can be composed of several components, possibly containing holes. The general types are vague point, vague line, and vague region. A vague point is a ﬁnite set of disjoint simple vague points. A vague line is a ﬁnite set of simple vague lines that intersect only at their end nodes, and have the same membership value at any common end node. A vague region is a ﬁnite

296

Arta Dilo et al.

Fig. 2. (a) vague point, (b) vague line, and (c) vague region.

set of disjoint simple vague regions. All these ﬁnite sets are potentially empty. Figure 2 illustrates these objects. A vague partition is a set of classes, where a class is of type vague region. Objects belonging to diﬀerent classes could overlap only at their uncertain parts. The certain part of one object is disjoint from all objects in any other class. The basic operators we provide are union, intersection, and diﬀerence. These are binary operators deﬁned on general types. General types are closed under these vague spatial operators, meaning the result of an operator is of one of these types. Every operator takes as arguments (two) objects of the same type, and returns an object of that type, except for the intersection of vague lines, which returns a vague point. The union between two vague regions can be used to generalize a classiﬁcation: two classes can be joined to create a new class that is more general than the previous two. The intersection of vague regions can be used to combine two classiﬁcations into a new one: two classes from diﬀerent classiﬁcations can be combined to form a more reﬁned class. Union, intersection, and diﬀerence between vague objects are deﬁned using fuzzy set operators – fuzzy union, fuzzy intersection, and fuzzy diﬀerence, respectively. Union of two objects µ1 , µ2 gives a new object µ of the same type, of which the membership value at every location is taken as the maximum of membership values of the input objects: µ(P ) = max {µ1 (P ), µ2 (P )}. Intersection between two vague lines gives a vague point, whereas intersection between vague points and between vague regions gives, respectively, a vague point and a vague region. The intersection of two vague objects µ1 and µ2 results in a vague object µ of which the membership value at every location is the minimum of membership values of input objects at that location: µ(P ) = min {µ1 (P ), µ2 (P )}. The diﬀerence µ between two vague objects µ1 and µ2 is the intersection of the ﬁrst object with the complement of the second. The membership value at each location is taken as µ(P ) = min {µ1 (P ), 1 − µ2 (P )}.

Storage and Manipulation of Vague Objects

297

3 Storage of vague objects In this section we discuss our storage model of vague objects. An object of a general vague type is a collection of simple vague type objects. Simple type objects are the basic objects that we need to store. An identiﬁer is given to such an object and used to compile the complete information at any time the object is needed. In this section we show how to store simple vague points, simple vague lines, and simple vague regions. The data we can collect and store can only be ﬁnite, while a simple vague line and a simple vague region represent inﬁnite point sets. An interpolation method is used in both cases to complete (approximate) information about objects. The GRASS vector format4 is used to store vague objects. Feature types supported by GRASS are point, line, boundary, area (without holes), and centroid. All data is stored using line representations. A line is a sequence of (x, y) coordinates forming types point, line, boundary, or centroid. A point or a centroid is constructed as a sequence of two identical elements. A line or a boundary is constructed as a sequence with at least two elements. A line type is used to store linear features, whereas a boundary type is used to store areal features. An area is formed by a set of lines of type boundary, which constitute its boundary. GRASS allows the storage of three dimensional (3D) features, i.e. a line can be a series of (x, y, z) coordinates. It can only build the complete topology for 2D features; it builds connectivity of 3D linear features, but ignores the third coordinate when creating (and building the topology of) areal features. We use that third coordinate to store membership values. Objects within a theme, e.g. road lines, vegetation classes, form what we call a data layer. A data layer is physically stored as a directory that contains several ﬁles, coor, topo, sidx, etc. each containing speciﬁc information. For example, vegetation data is stored in a directory vegetation. Location information is stored in the coor ﬁle in that directory, while topology information is stored in the topo ﬁle. One or more attributes can be attached to objects of a data layer. Our simple vague point is implemented in GRASS by an SVpoint that is a triple (x, y, mv), where (x, y) ∈ IR2 provides the location and mv ∈ (0, 1] provides the membership value. A simple vague line is implemented by an SVline that is a sequence of triples (x1 , y1 , mv1 ), . . . , (xn , yn , mvn ), each triple providing the x, y location of a point in the line, associated by the membership value of that point to the line. An approximation of the simple vague line is achieved by linear interpolation between consecutive points. A simple vague region is a surface embedded in IR2 . Triangulations can be used to represent it, e.g. TIN structures in GIS software. A triangulation is the division of a surface or a polygon into a set of triangles, such that each triangle edge is shared by two adjacent triangles. A triangulation method consists of two parts: creating the triangulation, and performing an interpolation within each triangle. A 4

We use GRASS version 6.0.

298

Arta Dilo et al.

simple vague region is implemented by an SVregion that is a triangulation. An approximation of the simple vague region is achieved by a linear interpolation within each triangle of SVregion. A vague point is implemented by a Vpoint, which is a set of triples (x, y, mv) having diﬀerent locations. A vague line is implemented by Vline, which is a collection of SVline objects intersecting only at end points. A vague region is implemented by Vregion, which is a set of triangulations that do not overlap each other. Objects of a simple type belonging to the same class are stored in a data layer. We call it a vague data layer, for containing vague information. It is of one of the implementation types: Vpoint, Vline, or Vregion. For example, all SVregion objects belonging to the class ‘forest’ of a vegetation theme, are stored in a vague (data) layer of type Vregion. A vague layer is created by input containing membership information, or such information is derived from attributes in input data by applying membership functions. We provide a module v.vague.membership that applies trapezoidal membership functions to a numerical attribute of a data layer. A vague point layer, i.e. a layer of type Vpoint, may derive from any point data containing membership information, or an attribute to which we could apply membership functions. Each input point associated with the membership value creates an SVpoint object in the vague point layer. A vague line layer is derived from measurements along linear features, e.g. level of congestion at locations along roads in a road network data layer. Such measurements can be direct membership values, or functions can be applied to them to derive membership values. An SVline object is created in the vague layer for each line object in the input data. We assume that data about a vague region layer comes from points associated with membership values, or points having an attribute to which we apply a membership function. These points may be irregularly distributed, e.g. coming from measurements. They may also be regularly distributed, e.g. coming from processed satellite images. The only information we can get from such input is a membership value to a certain vague class, which may be indeed composed of several SVregion objects. The input needs to be interpreted to create the simple objects. We cluster input points, and consider that each cluster establishes a separate object. Points of each cluster are then used to create a triangulation for each simple object identiﬁed. The next two sections, Section 3.1 and Section 3.2, are dedicated, respectively, to clustering the input to form separate SVregion objects, and creating objects from triangulation of clusters. Up to here, we have discussed about creation and storage of a vague layer, which is the implementation of a vague class. It is convenient to bind together all classes belonging to the same theme. This is discussed in Section 3.3.

Storage and Manipulation of Vague Objects

299

3.1 Clustering input and delineating boundaries Several techniques exist that can be used for clustering points in space: Kmeans (K-median) clustering, self organizing maps, hierarchical clustering, alpha shapes (see [1] for an overview). All techniques are based on some distance measure between points. The ﬁrst two techniques require the number of clusters to be determined beforehand. This is usually not known for our case. Hierarchical clustering with Euclidean distance could be used for clustering the input. The technique is used for diﬀerent inputs, not just points in Euclidean space, by employing diﬀerent distance measures or diﬀerent group (linkage) distances. It is a quite general, but slow technique. Alpha shapes (α-shapes from here onwards) work only with points in Euclidean space. They detect separate clusters from a given point set and delineate boundaries of the detected point clusters. The technique is faster than hierarchical clustering, and its output is richer as it contains a boundary for each cluster. We use α-shapes to cluster the input data and delineate the boundary of each cluster.

Fig. 3. Positive α-hull in the left, negative α-hull in the right (taken from [8]).

The α-shapes are a generalization of convex hulls. The convex hull of a point set S may be deﬁned as the intersection of all closed half-planes that contain all points of S. This notion is generalized to α-hulls in [8]. For positive (yet suﬃciently small) α, the α-hull of S is the intersection of all closed discs with radius 1/α that contain all points of S. Large α produce a curved hull of which the boundary consists of parts of circles (with radius 1/α) that pass through extreme points of S. As α approaches zero, the curved hull converges to the convex hull. An α-hull is smoother than the convex hull, but its approximation of the intuitive shape of points is coarser than that of the convex hull. The shape of the hull can be reﬁned by considering negative α’s. For negative α, the α-hull of S is taken to be the intersection of all closed complements of discs with radius −1/α that contain all points of S. Figure 3 shows the positive α-hull (left) and the negative α-hull (right) of a point set resembling the letter ‘A’.

300

Arta Dilo et al.

The α-hull can be deﬁned more concisely using the notion of a generalized disc. A generalized disc of radius 1/α is a disc of radius 1/α for α > 0; it is the complement of a disc of radius −1/α for α < 0, and a half-plane for α = 0. For a point set S and a real value α, the α-hull of S is the intersection of all closed generalized discs of radius 1/α that contain all points of S. Two other concepts, α-extreme and α-neighbor, are used in [8] to deﬁne the α-shape concept. A point p of a set S is termed an α-extreme in S if there exists a closed generalized disc of radius 1/α such that p lies on its boundary, and it contains all points of S. Two α-extremes p and q are said to be α-neighbors if there exists a closed generalized disc of radius 1/α with both points on its boundary, and which contains all points of S. An α-shape of S is then the straight line graph whose vertices are the α-extremes and whose edges connect the α-neighbors [8]. The assumption is that no four points of S are co-circular and no three points are collinear.

Fig. 4. (a) input points, (b) α-shape for a positive α, (c) α-shape for α equal to 0, (d) α-shape for a negative α.

Figure 4(b)–(d) illustrate α-shapes produced for decreasing α values on the same point set shown in Fig. 4(a). A decreasing α value results in a ﬁner shape. Depending on the α value, a single point can be a cluster (and its own boundary at the same time), as is the case in Fig. 4(c) for four points. The set of α-extremes becomes larger when α decreases: the set of α1 -extreme points of a point set S is a subset of α2 -extreme points of S if α1 > α2 [8]. We are interested in shapes ﬁner than the convex hull, therefore we work only with negative α values. This simpliﬁes the checks for α-extremes and α-neighbors, and also the complete algorithm for building the α-shape. A point is an α-extreme of a point set if and only if a circle with radius −1/α (α-circle hereafter) can be constructed such that the point is on the circle and no other point of the set lies inside it. Figure 5 illustrates α-extremes in a point set A. Point m from A is an α-extreme. Point n is not an α-extreme; any α-circle passing through n contains at least another point of A. We use

Storage and Manipulation of Vague Objects

301

Fig. 5. Examples of α-extreme in a set A: point m is an α-extreme, point n is not.

the same set A for illustrations 5–9 in this section. The set is chosen such that it covers all cases treated by the α-shape algorithm.

Fig. 6. Examples of α-neighbors: v and w are α-neighbors, s and v are not.

Two points are α-neighbors, if and only if an α-circle can be constructed through them, but with no other points of the set within that circle. Figure 6 illustrates α-neighbors in A. Points v and w are α-neighbors: there is a circle passing through v and w that does not contain any other point of A. Points s and v are not α-neighbors: there are only two circles with radius −1/α passing through both of them; both circles contain point t. The tests for building the α-shape of a point set are based on the Delaunay triangulation of the set, its dual (in the graph theoretical sense), the Voronoi diagram, and relations between the two. A Delaunay triangulation is a triangulation that maximizes the minimum angle [5]. A Voronoi diagram of a point set is a partition of the plane into convex polygons, one for each point of the set, such that each polygon contains only one point from the set that

302

Arta Dilo et al.

is its central point, and every point in a polygon is closer to its central point than to any other point of the set. Figure 7 shows the Delaunay triangulation (in light grey) and the Voronoi diagram (in thick light grey lines) of the point set A. The α-shape of a point set S is a subset of the Delaunay triangulation of S [8]. It is built by ﬁrst constructing the Delaunay triangulation of S, then testing for Delaunay edges whether they are in the α-shape. The complete algorithm is: create a Delaunay triangulation DT of S create Shape as an empty set of edges for every edge (p, q) in DT if p is α-extreme and q is α-extreme and p and q are α-neighbors add (p, q) to Shape ﬁ rof To build the Delaunay triangulation we use TRIANGLE, an open source tool created by Shewchuk [15]. Testing for α-extremes and α-neighbors of S, based on relations between the Delaunay triangulation and the Voronoi diagram of S, is explained in the next few paragraphs. Tests are translated into properties of the Delaunay triangulation and its convex hull, therefore this is the only structure needed for constructing the α-shape.

Fig. 7. Test for α-extremes: unbounded Voronoi polygon Vm and an α-circle (in grey) passing through m; bounded Voronoi polygon Vt , an α-circle passing through t, and the maximal circle of t (with dashed line).

Storage and Manipulation of Vague Objects

303

An α-extreme of a point set is a point on the boundary of the convex hull of the set, or a point which maximal circumradius5 of triangles of which the point is a vertex, is bigger than the radius −1/α. These properties are used to test for α-extremes. We explain them using Fig. 7. Points on the boundary of the convex hull of a point set have unbounded Voronoi polygons; any other point has a bounded Voronoi polygon. The point m is on the boundary of the convex hull of A; its Voronoi polygon Vm is unbounded. Any α-circle centered inside Vm that passes through m does not contain any other point of A. Point t is inside the convex hull of A; it has a bounded Voronoi polygon Vt . The Voronoi polygon Vt of t contains all points that are closer to t than to any other point of A. This means that any circle passing through t and centered at a point inside Vt does not contain any other point of A. The maximal circle having this property (shown with dashed line) is the one centered at the from t. An α-circle passing through t, and lying furthest Voronoi vertex cmax t inside the maximal circle of t, does not contain other points of A. Vertices of the Voronoi polygon Vt are the circumcentres of Delaunay triangles of which t is a vertex. The maximal circle of t is one of the circumcircles of triangles t is a vertex of; it is the circumcircle that has the maximal radius.

Fig. 8. Delaunay triangles sharing edge g and their circumcircles (in light grey); the corresponding Voronoi edges; and circles centered inside and outside edge vg passing through end points of g (with dashed and dotted line, respectively).

The test for α-neighbors is based on the relation between Delaunay edges (D-edge) and Voronoi edges (V-edge). Let us ﬁrst explain this relation using Fig. 8 that shows a partial Delaunay triangulation and Voronoi diagram of A. D-edges are shown in light grey, V-edges are shown with thick light grey lines. Every D-edge has a corresponding V-edge, e.g. D-edge g has its corresponding V-edge vg . V-edge vg joins the centers of the circumcircles of the two Delaunay triangles of which g is an edge. Circumcircles of Delaunay triangles are shown in light grey. Any circle passing through the end points r and u of g has the center in l, the perpendicular line to g at its midpoint (the dashed grey line). 5

The circumradius is the radius the triangle’s circumscribed circle, i.e., the unique circle that passes through each of the triangles vertices. The circumscribed circle is called circumcircle of the triangle.

304

Arta Dilo et al.

The V-edge vg lies on this line. A Delaunay triangle has the property that its circumcircle does not contain any other point of the set. The two circumcircles of triangles sharing edge g have their centers at the end points of edge vg . Any circle passing through end points of g and centered inside vg has no other point from A, e.g. the circle in dashed line. Moving the center of the circle along line l, outside vg , produces a circle that encloses m or n, and possibly other points from A, e.g. the circle with dotted line in Fig. 8 encloses point m.

Fig. 9. Testing Delaunay edges for being in the α-shape by using min and max circles (with dotted and dashed lines, respectively), and α-circles passing through end points of D-edges (in grey).

Two α-neighbors are connected through a D-edge, and the center of one of the α-circles passing through them is on the corresponding V-edge; no other point of the set lies inside the α-circle that has its center in the V-edge. This property is used to test a D-edge whether its end points are α-neighbors, that is the D-edge is in the α-shape. The center of an α-circle passing through end points of a D-edge is on the corresponding V-edge, if the radius −1/α is between the minimum and maximum distance from any of these points to the V-edge. If a D-edge is in the boundary of the convex hull, its corresponding V-edge is a half line, in which case the maximum distance is inﬁnite. The test reduces to checking whether the radius −1/α is bigger than the minimum distance. Figure 9 shows three diﬀerent cases for calculating the minimum

Storage and Manipulation of Vague Objects

305

and maximum distance for a D-edge. Edge d is in the boundary of the convex hull A; its corresponding V-edge vd is a half line. The minimum distance of an endpoint of d to the V-edge vd is the circumradius of the triangle of which d is an edge. The circumcircle of this triangle is shown with dotted line. D-edge g intersects with its V-edge vg . The minimum distance of a g endpoint to vg is the half length of g (vg is perpendicular with g at its midpoint). The minimum circle is shown with dotted line. The maximum distance is the distance from an endpoint of g to one of vg endpoints. Endpoints of vg are circumcenters of triangles of which g is an edge. The maximum distance is the maximum circumradius of the two triangles. The maximum (circum) circle is shown with dashed line. D-edge f does not intersect with its V-edge vf . The minimum and maximum distance for f are respectively, the minimum and maximum circumradius of the two triangles of which f is an edge. The minimum and maximum (circum)circles of triangles sharing edge f are shown in dotted and dashed line, respectively. The α-circles and their centers are shown in grey. Only D-edge f is in the α-shape.

Fig. 10. (a) α-shape for the initial α-value, (b) α-shape for decreased α-value.

The value of α deﬁnes the level of detail of the α-shape. The criterion we use to set an initial α value is that every point is at least vertex of one triangle in the α-shape interior. Figure 10(a) shows the α-shape taken from such α value, and 10(b) shows the shape after decreasing the α value. In Fig. 10(b) we see that there are loose points, not part of any shaped object. A triangle is part of the α-shape if and only if the radius of its circumscribed circle is

306

Arta Dilo et al.

smaller than or equal to −1/α [8]. To set an initial α value for a point set S, we calculate for every point s ∈ S the minimum radius rs of circumscribed circles of all Delaunay triangles that have s as a vertex. The radius −1/α is set to the maximum of rs values for all points of S. The α value calculated from that is used to estimate the shape. This initial value can be further adjusted by the user, if needed.

Fig. 11. (a) input from a regularly distributed point set, (b) object boundaries detected with the initial α-value.

The α-shape of regularly distributed point data (grid data) resulting from the use of the initial α-value is shown in Fig. 11. Grid points with membership value equal to zero have been excluded from the input. 3.2 Creating and storing triangulations The output of α-shapes consists of boundary lines together with the input data points. The lines constitute boundaries of the support sets of simple vague regions. They are processed to identify objects by their boundary. An identiﬁer is assigned to every simple region object detected. Each simple region is then built from the constrained Delaunay triangulation performed on boundaries of the region and points inside the boundary. Its triangulation data is stored together with the identiﬁer. Module v.vague.triangle performs the whole procedure. Each step of the procedure is explained in more detail in the following paragraphs. Simple vague regions are identiﬁed from cluster boundaries. Each region may contain holes. Because GRASS does not support areas with holes, we

Storage and Manipulation of Vague Objects

307

have to check for them. A hole inside a simple vague region may contain other regions. Therefore, from cluster boundaries we have to detect which boundary is an outer boundary of a region and which is a hole boundary. We check for every area boundary if it lies inside other areas. It is the outer boundary of a simple vague region if it lies inside an even number of areas. It is the boundary of a hole if it lies inside an odd number of areas. Figure 12 shows two diﬀerent conﬁgurations of outer boundaries and holes: area A is inside area B and it is a hole. Area C is inside area D which in turn is inside area E. The boundary of E and the boundary of D form (the boundaries of) a simple vague region. The boundary of C gives another simple vague region.

Fig. 12. Identifying simple vague regions (shown in grey) from areal boundaries: one simple vague region in the left; two simple vague regions in the right, one enclosing the other.

After identifying all simple vague regions, we perform a constrained Delaunay triangulation for every region. A constrained Delaunay triangulation is a triangulation of vertices with predeﬁned edges. It consists of four steps [15]: 1. creating a Delaunay triangulation from the input vertices; 2. inserting missing line segments from the boundary and deleting the Delaunay edges that overlap with them; 3. removing triangles at concavities and holes; 4. adding more points in order to improve the quality of the triangulation. Figure 13 illustrates the ﬁrst three steps of the constrained Delaunay triangulation. Figure 13(a) shows boundaries of a simple vague region. For simplicity of illustration we consider a constrained Delaunay triangulation with only line boundaries as input. Figures 13(b)–(d) show the results from the ﬁrst, second and third step, respectively. To visualize change in diﬀerent steps, the boundary edges are always drawn in black. In the two intermediate steps, the other edges are shown in dark grey, and the edges to be removed are shown with dashed grey lines. We use TRIANGLE by Shewchuk [15] to perform the constrained Delaunay triangulation. GRASS data is transformed to the TRIANGLE data

308

Arta Dilo et al.

Fig. 13. Results of three steps constrained Delaunay triangulation in a simple vague region. (a) region boundaries, (b) Delaunay triangulation, (c) insertion of missing boundary lines, (d) removal of triangles outside the boundary and inside holes.

format, the program is run on the transformed data, and its output is transformed back to GRASS vector format. The program cannot handle holes when the input is big (i.e. more than 50,000 points). We use TRIANGLE to perform the triangulation constrained only on the outer boundaries. Then we remove all edges inside holes. We expect the core of a region to have many input points, therefore many ﬂat triangles will be created. We remove the redundant core triangles, by ﬁrst constructing the boundary of the core from all ﬂat triangles of value 1, then performing the constrained Delaunay on the boundary. Figure 14 shows triangulation of a simple vague region after the simpliﬁcation of its core triangulation. Points are very dense outside the core, which makes the triangulation very dense. Core triangles are visible after the simpliﬁcation phase. For each simple vague region, triangulation edges are stored as boundary lines with (x, y, z) coordinates in the coor ﬁle. An attribute is used to store the identiﬁer of the simple vague region to which the edge belongs. Topology data of triangles and holes, e.g. indices of line boundaries, is stored in the topo ﬁle. The output of module v.vague.triangle is a data layer that contains information about a vague class. The complete algorithm of the module is presented below: create a list SVR of simple vague regions from cluster boundaries for every region r in SVR ﬁnd all points that are inside its boundary

Storage and Manipulation of Vague Objects

309

Fig. 14. Triangulation of a simple vague region after reducing core triangles. Triangulation is very dense outside the core; only core triangles are distinguishable.

build constrained Delaunay triangulation from r’s boundary and points if r has holes remove edges inside every hole store triangulation edges together with r index in SVR store topology for triangles and holes rof Every time a layer of simple vague regions is used, its information is compiled from stored data, and put in memory in a list Vague region of simple vague regions. Every element of Vague region contains a list of triangles and a list of holes. Information about triangles and holes is built using several GRASS data structures that are inside a main structure, Map info [11]. Every time a vector layer is used, data from coor and topo ﬁles is read and put into these structures. The indexed lists Area and Line inside a Map info structure contain topology information of areas and lines, respectively. Each element of Area contains a list of indices of lines that constitute its boundary. The index of the Line list is used to connect every element with the corresponding element in another list that contains attribute values. Figure 15 shows relations between these data structures. 3.3 Creating themes from several layers A data layer keeps information about one class. For example, a layer forest keeps data about simple vague regions that belong to a class ‘forest’ of a ‘vegetation’ theme. The vegetation theme consists of several classes: forest, grassland, shrubs, etc. Vague classes could overlap with each other, e.g. forest and shrubs, in which case triangulations representing them will overlap. Overlapping lines or areas are not allowed (handled) by the topology, which we

310

Arta Dilo et al.

Fig. 15. GRASS structures (inside Map info) used to store information about simple vague regions.

need for compiling object information. Therefore we cannot store all classes together. However, we often need to have all classes of a theme together, to operate with all or a selected subset of them. We create two tables to store the relation class and theme it belongs to. Figure 16 shows their schemas. Tables are stored in DBF format, which is a format integrated in GRASS, meaning that no connection to an external database is needed. Table Themes.dbf stores theme name and description, respectively in Name and Description. Name is the identiﬁer. The table Classes.dbf stores class name, the name of the theme it belongs to, the name of the GRASS folder where (layer) data is stored, and class description, respectively in Name, Theme, GRASSfolder, Description. The combination Name, Theme is the identiﬁer of the table. Theme refers to Themes.dbf table. Module v.vague.combine binds several layers in a theme. The created theme is added to the Themes.dbf table and its list of layers is added to the

Storage and Manipulation of Vague Objects

311

Fig. 16. Tables that keep information about themes and classes belonging to each theme.

Classes.dbf table (each layer as a new record in the table). The module allows users to add or remove layers from an existing theme, or to delete a theme. Changes are then reﬂected in Classes.dbf and Themes.dbf table. A theme of vague region layers forms a vague partition. Objects in a layer do not overlap with each other. This is assured by the triangulation process. When adding layers to a theme with the v.vague.combine we check if objects from diﬀerent layers overlap only in their uncertain part. We give a warning when this criterion is violated, and store a report for the violating cases (object identiﬁers, layers they belong to, and the theme).

4 Visualizing objects and displaying information Visualization of objects in a layer is done using color brightness to display levels of membership values, e.g. using grey scales. Diﬀerent layers are displayed using diﬀerent color hue for each layer, and brightness for membership value on each layer. Objects can be selected by clicking. Selected objects are shown in a separate colour, not used for displaying layers. Figure 17 shows a theme of vague regions having two classes. Diﬀerent color hue is used for each class. Darker color shows higher membership value. An object is selected and shown in yellow color (in the ﬁgure in light grey). Module v.vague.what performs visualization of layers. It allows to select objects from a layer, and displays information for a given location in a layer. A theme (created by v.vague.combine module) is the input for this module. Vague regions can be displayed by drawing only triangle edges of their triangulations. The set of triangles is drawn with a diﬀerent color for each layer. The interpolated values inside triangles can be used to color the full extent of objects in a layer, as it is shown in Fig. 17. Section 4.1 explains more thoroughly the techniques used for the visualization. The interface of v.vague.what uses two windows: the ‘control window’ that contains the list of layers, one layer in a separate tab, and the ‘display window’ for drawing the layers. Layers are ﬁrst drawn in the order in which they are stored in the theme. The selected layer (tab) in the control window becomes the top layer in the display window. An object of the top layer can be selected by clicking at a location inside the object. The triangle containing the given location is found ﬁrst, then the object the triangle belongs to. The

312

Arta Dilo et al.

Fig. 17. Visualization of several layers, each drawn in separate color hue. A selected object (pointed by the arrow) is shown in light grey.

object is highlighted. The information of the given location is displayed in the control window. This information contains the membership value of the location, the directory storing the layer data, and the identiﬁer of the object (it falls in). The membership value is calculated by using linear interpolation inside the triangle containing the location. The user can select several objects and output them to a new layer. 4.1 Visualization techniques A GRASS module, d.vect, is used to draw vague region layers with triangle edges. Each layer is drawn in a diﬀerent color. A new module, d.vague, is created to ﬁll triangles with interpolated values and draw them. The module uses a scanline rendering algorithm [20]. The rest of the section describes how this module works. GRASS functions are used to map the coordinates of triangle vertices to screen coordinates. A screen is a raster map consisting of pixels. Every pixel has a red, green, and blue value, specifying its color. Scanline rendering ﬁlls a triangle by drawing a line at a time, starting at the top of the triangle. All pixels of a line are drawn, which are part of this triangle. The line is then moved down and the drawing is repeated until the bottom of the triangle is reached. Figure 18 illustrates how the algorithm draws a triangle. Linear

Storage and Manipulation of Vague Objects

313

Fig. 18. Scan-line visualization technique.

interpolation is used to calculate the membership value at every location. Every triangle lies on a plane deﬁned by the three triangle vertices. Any point (x, y, z) of the plane satisﬁes the equation z = ax + by + c. The a, b, and c values are calculated from the coordinates of the three triangle vertices. The membership value for any point in the triangle is calculated by replacing its (x, y) location in the above equation. A diﬀerent color is used for each layer, selecting only colors that have the same saturation, so no color draws more attention than others. The brightest color is speciﬁed for every layer. The saturation and hue are kept constant for a layer. The brightest color is used for the lowest membership value of the layer. The color with brightness equal to 0 is used for the membership value equal to 1. The corresponding brightness value for any other membership value is calculated by ﬁrst inverting the [0, 1] interval (of membership values) then stretching it linearly to the range of brightness values. Because the monitors work with RGB colors, we use a function that applies that idea in an RGB model. The function that maps memberships value to RGB colors is f : [0, 1] → IR3 such that for every λ ∈ [0, 1] f (λ) = (1 − λ) × (Rmax , Gmax , Bmax ), where (Rmax , Gmax , Bmax ) is the speciﬁed brightest color of the layer. If multiple layers overlap, transparency is used to draw them together. This is calculated with alpha-blending [20]. The color (R, G, B)new at every location in the overlapping part is calculated as (R, G, B)new = 0.5 × (R, G, B)S1 + 0.5 × (R, G, B)S2 , where (R, G, B)S1 and (R, G, B)S2 are the colors at that location for the top and bottom layer respectively. When more than two layers are to be drawn, ﬁrst the two bottom layers are calculated. Then for every layer to be added on top of them, the above formula is reused, replacing (R, G, B)S2 with the calculated color for the previous layers, and (R, G, B)S1 the color of the new layer.

314

Arta Dilo et al.

5 Operators on vague objects The operators we implemented are union, intersection, and diﬀerence. This section describes shortly these operators for vague points and vague lines, for being simple, and concentrates more on the operators for vague regions. The operators for vague points take two Vpoint layers as input, and output a new Vpoint layer. The three operators check ﬁrst for simple point objects in the two input layers that have identical location. Union operator selects from each pair (of identical location points) the point with the higher membership value, and puts it in the result layer. It also adds all other (unmatched) points from both layers to the result layer. The intersection operator selects the point with the lower membership from each pair of matched objects, and inserts them in the result layer. For each pair of matched points, the diﬀerence operator calculates a new membership value (the formula provided in Sect. 2), and inserts in the result layer a point with the common location and the calculated membership value. It also adds to the result layer all unmatched points from the ﬁrst layer.

Fig. 19. Union of two simple vague lines.

The union operator for vague lines takes two Vline layers as input, and outputs a new Vline layer. It ﬁrst checks for simple lines that intersect. The operator splits the intersecting lines at the intersection point, which becomes common node for the newly created lines. The membership value at the common node is the maximum value of the memberships at this location in the two initial lines. The new lines are inserted in the result layer. Lines that do not intersect are directly added to the result layer. Figure 19 illustrates union of vague lines: lines L1 and L2 from input layers are intersecting; lines L3 , L4 , L5 , L6 are created and added in the result layer. The intersect operator for vague lines takes two Vline layers as input, and outputs a Vpoint layer. It checks for intersecting lines, creates a point at the intersection location with membership value the minimum of the two lines at this location, and inserts these points in the result layer. Operators for vague regions take two Vregion layers as input, and output a Vregion layer. A Vregion layer is a surface that consists of several non-overlapping triangulations. Triangulations belonging to diﬀerent surfaces might overlap. The overlapping zone between two surfaces is important for

Storage and Manipulation of Vague Objects

315

the operators, because they treat diﬀerently triangles inside and outside the overlapping zone. The overlapping zone may consist of several separate areas. Operators start by detecting the intersection line between triangulations. The line separates the input triangulations into parts that are used to construct the output surface. Union is built by taking the higher triangulation inside an overlapping area, and adding unchanged triangulation parts that are outside the overlapping zone. Intersection is built by taking the lower triangulation inside any overlapping area. The diﬀerence operator re-calculates values at triangulations inside an overlapping area, and adds unchanged triangulations of the ﬁrst surface that are outside the overlapping zone. The basic steps for the operators are: 1. 2. 3. 4.

Add vertical faces along boundaries of triangulations on both surfaces; Detect the intersection line; Re-triangulate both surfaces with this intersection line; Select the right triangles from the re-triangulated surfaces to build the result.

The ﬁrst three steps are the same for union and intersection operators. The forth step, selecting triangles for the output, is diﬀerent. We explain the ﬁrst three steps, and give the algorithm of the forth step for union. This algorithm can be used for the intersection with only few changes, which are described shortly. Diﬀerence operator requires speciﬁc treatment in most of the steps, therefore we explain it separately. To add vertical faces along a triangulation boundary we consider every edge of the boundary. Two vertical triangles are created for each edge and added to the triangulation. Suppose the edge is deﬁned by points p1 = (x1 , y1 , z1 ) and p2 = (x2 , y2 , z2 ). The face determined by p1 , p2 and their projections in plane, p3 = (x1 , y1 , 0) and p4 = (x2 , y2 , 0) is split into two triangles, and these are added to the triangulation. Detection of the intersection line is performed using the GNU triangulated surface library (http://gts.sourceforge.net). It produces the set of looped lines where the two surfaces intersect. The intersection between two triangles can be a line segment, or a polygon if the triangles lie in the same plane. Therefore, the intersection line between two surfaces consists only of straight-line segments. The intersection line is added as constraint to the triangulations of both surfaces. On each surface, only triangulations containing a part of the intersection line are to be re-triangulated. For every triangulation (to be changed), the triangles containing a line segment from the intersection are split into several new triangles. The other triangles are added unchanged to the new triangulation. The splitting of triangles is done by greedy triangulation, as described in [21]. Figure 20 illustrates the splitting of three triangles. The algorithm below explains how splitting is done. Let Vi be the set of vertices of the triangle i, and Li the set of line segments of the intersection line pass-

316

Arta Dilo et al.

Fig. 20. Re-triangulation: (a) old triangles and intersection line in grey, (b) new triangles after re-triangulation.

ing through the triangle i. Let us denote by Ei the set of edges of the new triangulation in triangle i, and Pi the union of Vi with the end points of Li . set Ei = Li for all points p from Pi create Dp as edges from p to any other point in Pi sorted in ascending order on length for every edge d = (p,q) in Dp if d ∈ / Ei and no edge e ∈ Ei intersects with d and no point r ∈ Pi − {p, q} lies on d add d to Ei ﬁ rof rof After the re-triangulation, the relation of a triangle from one surface to any triangle in the other surface is one of the three cases: • The three vertices of the triangle are on the other triangle: the triangle is part of an area that is contained in both surfaces. • One or two vertices are on the other triangle: the triangle intersects the other surface only at a point or along a line (the edge joining the two vertices). The rest of the triangle is above or below the other surface. • No vertex of the triangle is on the other surface: The triangle does not intersect with the other surface. It is completely above or below the other surface. A point is above a surface when its z coordinate is higher than the z value of the surface at the point location. A point in the interior of a triangle from one surface determines the relation of this triangle with the other surface. When an interior point is above, on, or below the other surface, the whole triangle is above, on, or below the other surface, respectively. The center of the incircle of a triangle6 is always in the interior the triangle, and can therefore be used for such testing. 6

The incircle is the inscribed circle of a triangle, i.e. the unique circle that is tangent to each of the triangle’s edges.

Storage and Manipulation of Vague Objects

317

The intersection line consist of several looped lines. Triangles of one surface that are inside a looped line are all above the other surface, or all below the other surface. So the union of two surfaces is formed by groups of triangles bounded by these looped lines. The algorithm that performs union of two re-triangulated surfaces S1 and S2 and outputs a surface S is given below: set S to an empty surface for any triangle t from surface S1 generate a point p(x, y, z) in the interior of t if S2 exist in location (x, y) if p is on or above S2 add t to S else add t to S ﬁ rof for any triangle t from surface S2 generate a point p(x, y, z) in the interior of t if S1 exist in location (x, y) if p is above S1 add t to S else add t to S ﬁ rof After the output surface is created, separate triangulations are detected and given an object identiﬁer. They are the SVregion objects of the result layer. The algorithm for calculating the intersection is quite similar. The diﬀerence is that a triangle in one surface is discarded if the other surface does not exist at the interior location of the triangle. Also, the testing for the z value of a point and a surface is reversed: the point should be on or below the surface. Diﬀerence between surfaces S1 and S2 is the intersection of S1 with the complement of S2 . Outside S2 its complement is equal to 1. The values of S1 are everywhere smaller or equal to 1, therefore the intersection outside S2 is equal to S1 . Thus, we only need to build the complement of S2 inside its boundaries. Surface S2 is built from S2 triangulation by inverting the values of each triangle vertex. The core of S2 will be a hole for S2 , therefore core triangles are not put in S2 . Holes of S2 constitute the core of S2 . Boundary of each hole is triangulated and included in S2 , all as ﬂat triangles with value 1. To build the result surface we pass through the same steps of union and intersection. First vertical faces are added along boundaries of S1 and S2 . For S1 faces are added from the boundary line to its projection in the horizontal plane, that is down to membership value 0. For S2 vertical faces are built along the boundary up to membership value 1. The intersection line between

318

Arta Dilo et al.

the two surfaces (extended by the vertical faces) is detected, and both surfaces are re-triangulated with this line. The forth step, selecting the right triangles, is quite similar with the intersection operator, except that every triangle of the ﬁrst surface is included in the output if the second surface does not exist at its location(s). We keep to notations S1 and S2 to denote now the surfaces taken after re-triangulation. The algorithm for the last step, selecting the right triangles for the output surface, is set S to an empty surface for any triangle t from surface S1 generate a point p(x, y, z) in the interior of t if S2 exist in location (x, y) if p is on or below S2 add t to S else add t to S ﬁ rof for any triangle t from surface S2 generate a point p(x, y, z) in the interior of t if S1 exist in location (x, y) if p is below S1 add t to S rof As for the other operators, after the output surface is created, separate triangulations are detected and given an object identiﬁer. Triangle edges of every triangulation are stored together with this identiﬁer in the new layer.

6 Discussions The current implementation of α-shapes gives good results for data sets with more or less regular sample density. If the sample density changes too much, the α-shapes do not work very well. Density-scaled α-shapes [19] provide a solution to this. Implementation of their algorithm would make the clustering process more robust. On the other side, density-based clustering algorithms (pointed out by a reviewer), e.g. DBSCAN, OPTICS, would be another possible solution to clustering of data points. It seems though that the technique aims at clustering of points, and does not provide for delineation of cluster boundaries. The only input type we consider for vague regions is data points. Data about vague regions could also be (vague) lines, e.g. lines having the same membership value. A part of the procedure to create vague regions from line input would be the same as for point input. They both need the constrained

Storage and Manipulation of Vague Objects

319

Delaunay triangulation. The identiﬁcation of objects would need another technique. The operators we implemented are only part of a complete set of spatial operators, e.g. those oﬀered by ROSE algebra [9; 10]. These are the operators returning spatial objects (types). From this group we left out the diﬀerence between vague lines, for having a certain complexity. Besides, we consider it less important than operators on vague regions, therefore we gave priority to their implementation. Other spatial operators would make use of these implemented basic operators. Several spatial relations (which deﬁnitions we have provided in [6]) are based on intersection of vague objects. Other spatial relations are based on operators like bounded diﬀerence, and absolute diﬀerence. They would therefore need an extension of these basic operators for their implementation. The membership function of a simple vague region can have discontinuities along lines. These discontinuities result in vertical faces in triangulations. The work presented in this chapter does not consider discontinuity lines. We build and store the topology of triangulations using GRASS topology, which ignores vertical faces and areas adjacent to them. We do not need to store vertical faces, but we do need triangles adjacent to them. To allow discontinuity lines we could build separate functions for the topology of triangulations from GRASS topology functions, modifying the last. GRASS topology builds the connectivity of 3D lines (through their nodes) correctly, but considers only their x, y coordinates when building areas from lines. We can use line connectivity to build the adjacent triangles to vertical faces, changing the existing functions to consider the special cases we need. Operators on vague regions are to be modiﬁed in order to consider discontinuities.

7 Conclusions The chapter shows how vague spatial objects can be stored and manipulated using existing GIS functionality for vector data format. Known spatial data types and structures were employed to construct vague object types. Points, lines, and triangulations were used to store simple vague points, lines, and regions, respectively. These simple vague types represent identiﬁable objects. Classes of simple objects were stored in separate vague layers. Classes a certain theme can be bound together via relations saved on database tables. A theme of vague region classes forms a vague partition, which allows for a soft classiﬁcation of space that is important for many spatial applications. Few modules were oﬀered to handle vague objects: a module that creates layers of vague objects from input data points; a module that visualizes vague layers in the screen, and allows to retrieve and display information about their objects; some modules that perform diﬀerent operations on vague layers. Union, intersection, and diﬀerence operators were implemented for vague

320

Arta Dilo et al.

objects. These are basic operators, on which other spatial operators can be built.

References [1] Pavel Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, Inc., San Jose, 2002. [2] Peter A. Burrough and Andrew U. Frank. Geographic objects with indeterminate boundaries. Number 2 in GISDATA. Taylor & Francis, London, 1996. [3] Eliseo Clementini and Paolino di Felice. A spatial model for complex objects with a broad boundary supporting queries on uncertain data. Data & Knowledge Engineering, 37(3):285–305, June 2001. [4] Anthony G. Cohn and Nicholas Mark Gotts. Representing Spatial Vagueness: A Mereological Approach. In Luigia Carlucci Aiello, Jon Doyle, and Stuart Shapiro, editors, Principles of Knowledge Representation and Reasoning (KR’96), pages 230–241. Morgan Kaufmann, 1996. [5] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf. Computational Geometry-Algorithms and Applications. Springer, Berlin, 2nd edition, 1997. [6] Arta Dilo, Rolf A. de By, and Alfred Stein. A proposal for spatial relations between vague objects. In Lun Wu, Wenzhong Shi, Yu Fang, and Qingxi Tong, editors, Proceedings of the International Symposium on Spatial Data Quality ISSDQ’05, Beijing, China, pages 50–59, Peking University, Beijing, China, August 2005. The Hong Kong Polytechnic University. [7] Arta Dilo, Rolf A. de By, and Alfred Stein. A spatial algebra for vague objects. Submitted for publication, January 2005. [8] H Edelsbrunner, D. G. Kirkpatrick, and Seidel R. On the shape of a set of points in the plane. IEEE Transactions on Information Theory, IT-29(4):551–559, July 1983. [9] Ralf Hartmut G¨ uting. An introduction to spatial database systems. VLDB Journal, 3(4):357–399, 1994. [10] Ralf Hartmut G¨ uting and Markus Schneider. Realm-based spatial data types: The ROSE algebra. VLDB Journal, 4(2):243–286, 1995. [11] Pawalai Kraipeerapun. Implementation of vague spatial objects. Master’s thesis, International Institute for Geo-information Science and Earth Observation (ITC), March 2004. [12] Antony J. Roy and John G. Stell. Spatial relations between indeterminate regions. International Journal of Approximate Reasoning, 27(3):205–234, September 2001. [13] Markus Schneider. Uncertainty management for spatial data in databases: Fuzzy spatial data types. In SSD ’99: Proceedings of the 6th International Symposium on Advances in Spatial Databases, volume 1651

Storage and Manipulation of Vague Objects

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21] [22]

[23]

321

of Lecture Notes in Computer Science, pages 330–351. Springer-Verlag, 1999. Markus Schneider. Design and implementation of ﬁnite resolution crisp and fuzzy spatial objects. Data & Knowledge Engineering, 44(1):81–108, January 2003. Jonathan Richard Shewchuk. Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator. In First Workshop on Applied Computational Geometry, pages 124–133, Philadelphia, Pennsylvania, May 1996. Rodrigo S. Sicat, Emmanuel John M. Carranza, and Uday Bhaskar Nidumolu. Fuzzy modelling of farmers’ knowledge for land suitability classiﬁcation. Agricultural Systems, 83(1):49–75, January 2005. Philippe Smets. Theories of uncertainty. In Enrique Ruspini, Piero Bonissone, and Witold Pedrycz, editors, Handbook of Fuzzy Computation. Institute of Physics Publishing, May 1999. Roy Sorensen. Vagueness. In Edward N. Zalta, editor, The Stanford Encyclopedia of Philosophy. The Metaphysics Research Lab CSLI, fall 2003 edition, 2003. M. Teichmann and M. Capps. Surface reconstruction with anisotropic density-scaled alpha-shapes. In In IEEE Visualization ’98 Proceedings, pages 67–72, San Francisco, CA, October 1998. ACM/SIGGRAPH Press. A. Watt and F. Policarpo. 3D Games, Real-time Rendering and Software Technology. Pearson Education Limited, Essex, England, 1st edition, 2001. Michael F. Worboys. GIS, a Computer Perspective. Taylor & Francis, 1st edition, 1995. Michael F. Worboys and Eliseo Clementini. Integration of imperfect spatial information. Journal of Visual Languages & Computing, 12(1):61– 80, February 2001. Benjamin F. Zhan. Topological relations between fuzzy regions. In Proceedings of the 1997 ACM Symposium on Applied Computing, pages 192– 196. ACM Press, 1997.

Spatial SQL with Customizable Soft Selection Conditions Gloria Bordogna1 , Marco Pagani1 , Giuseppe Psaila2 1

2

CNR Institute for the Dynamics of Environmental Processes via Pasubio 5, I-24044 Dalmine (BG) (Italy) {name.surname}@idpa.cnr.it Bergamo University Faculty of Engineering, via Marconi 5, I-24044 Dalmine (BG) (Italy) [email protected]

1 Introduction Spatial database systems have become a popular research area since they ﬁnd applications in diverse ﬁelds where there is the need to manage geometric, geographic or spatial data, which means data related to space such as the physical world (geography, urban planning, astronomy), parts of living organisms (body anatomy), engineering design (VLSI circuits, molecular structures) etc. When the earth surface and subsurface are the space of interest, the systems are called the Geographic Information Systems (GIS). A distinguished feature of spatial database systems is the management of sets of entities with a spatial reference, which means that spatial database systems provide operations to deal with the extent, location, and relationships of the spatial elements. However, it is generally recognized that current spatial database systems are inadequate to support typical applications of GIS and CAD, and also for potential applications such as data warehousing. This is the reason that motivates current research aimed at improving the functionalities and the performance of the available spatial database management systems, including the modelling of continuous ﬁelds, the management of large data sets and the design of user interfaces for simplifying the user-system interaction [1; 7; 26; 27]. Speciﬁcally, current spatial query languages are inadequate to perform spatial analysis for many reasons [3; 7; 14; 20; 25; 27]: one reason is that they force users to formulate their often vague requests by means of crisp selection conditions on spatial data. For many categories of users, such as planners and resource managers, the possibility to express tolerant conditions on conventional and spatial data, and to retrieve discriminated spatial information in decreasing order of relevance, can greatly simplify their spatial G. Bordogna et al.: Spatial SQL with Customizable Soft Selection Conditions, Studfuzz 203, 323–346 (2006) c Springer-Verlag Berlin Heidelberg 2006 www.springerlink.com

324

Gloria Bordogna et al.

analysis task, that generally is a sequence of trial and error phases aimed at identifying spatial features of interest [1; 7; 25]. In this chapter, we present our ideas to deﬁne a ﬂexible spatial query language for simplifying spatial analysis when using traditional GISs. However the proposal can be suited also for querying any type of spatial database since we deﬁne a user customizable query language in which the semantics of soft selection conditions can be easily speciﬁed through an operator of the query language itself. We tackle the problem having in mind the objectives of deﬁning a language that can be easily and cheaply incorporated into existing GISs, and that can be well accepted by the GIS community. On the one side, SQL is becoming more and more common in the GIS community. In facts, people ﬁnd it easier to learn its extensions instead of a completely novel language. In this respect, there is an eﬀort to deﬁne spatial extensions of SQL for standardization purposes, and several proposals are now available [11; 14; 17; 21; 24]. Speciﬁcally, SQL has been indicated as the standard query language by the Open Geospatial Consortium (OGC) [11]. However, this proposal does not allow the speciﬁcation of soft (tolerant) selection conditions. Nevertheless, the need for ﬂexible querying in GISs has been addressed by some researchers who mainly considered it in relation with the problem of querying geographic entities with unsharp boundaries, such as natural ecosystems [7]. In this case, the deﬁnition of appropriate spatial operators for evaluating topological or directional relationships between fuzzy geographic entities has been considered [7; 9; 12; 31]. On the other side, fuzzy set theory [29; 30] has been successfully applied to deﬁne ﬂexible query languages for databases [4; 25; 6], and several approaches are now available [5; 22]. For a survey on this topic see Chapter 1 in this volume. Some authors have shown the utility of fuzzy extensions of SQL in the context of spatial databases [3; 25]; however they regarded it as a special case of ﬂexible querying textual data, thus not considering the peculiarity that the management of the spatial component imposes. Based on these considerations we propose to deﬁne a fuzzy extension of SQL spatial that can be incorporated into traditional GISs. Spatial functions are proposed that make it possible to compute gradual geometrical, topological and directional properties between crisp geographic entities, on which soft conditions can be evaluated. Then, we discuss how to customize soft conditions (expressed by linguistic predicates) to the speciﬁc application and query context through a new SQL-like operator. We will show that the resulting proposal is easy to understand, easy to use for users, provides a very high level of ﬂexibility to users, gives full control of query semantics, which is not known to other proposals, and leaves untouched the original spatial databases.

Spatial SQL with Customizable Soft Selection Conditions

325

2 Related works Many researchers considered the problem of standardizing the query language for GISs and several proposals have indicated SQL as the basic language for querying geographic information in spatial databases [14; 17; 21]. There are many reasons for this choice. Basically SQL is recognized as a standard query language, it is very common and well known, and extensions can be learned faster than a completely new language. The Spatial extensions of SQL provide syntactical means to deﬁne Abstract Spatial Data Types (ASDT)s and to formulate queries involving only spatial properties, queries involving thematic properties, and queries that combine both types of properties. Several proposals of a spatial query language have been deﬁned that extend the SQL language to deal with spatial data [11; 14; 17; 20; 21; 23; 24; 27; 28]. Some commercial products are now available that use spatial extensions of SQL [23]. However, none of the proposed approaches addresses the problem of ﬂexible querying: the predicates of the relational calculus are extended to operate on the spatial domain, but they are still deﬁned as crisp, i.e., binary, predicates admitting only full or no satisfaction. On the other hand, some researchers addressed the need of ﬂexible querying in GISs by proposing the use of fuzzy SQL without analyzing the peculiar extensions that the management of spatial data imposes [25]. For example, they do not discuss the need of a fuzzy algebra to compute gradual topological relationships values between spatial entities so as to be able to evaluate soft selection conditions such as “almost completely overlapped ”, “almost included ”, etc. Other researchers investigated this need mainly in relation to the querying of geographic entities with unsharp boundaries. Representing ecosystems and regions characterized by high variability so that a precise boundary can only be guessed, have brought to extend the spatial data models based on exact objects. This made it possible the representation of objects with fuzzy boundary, i.e., regions with thick borders. A bunch of proposals has then come out to deﬁne spatial operators between such regions with thick boundaries [2; 9; 12; 14; 18; 31]. Some of the proposals considered that the topological and directional relationships between regions with fuzzy boundary would be better deﬁned as fuzzy relationships, i.e., if degrees of satisfaction of the relationship were accounted for [2]. For example, knowing that two regions with fuzzy boundaries are overlapped is not at all informative on the extent of their overlapping; more information could be added by distinguishing if only the borders of the regions are overlapped or even their inmost parts. The proposal of this chapter is founded on the idea that ﬂexible spatial and not spatial operations are a useful means for simplifying spatial analysis also in the case in which the geographic entities are represented precisely with crisp boundaries as in the classic GISs. Further, we are convinced that the proposal of a ﬂexible spatial query language should consider the state of the art of current GISs and emerging standards for spatial querying [11]. Finally,

326

Gloria Bordogna et al.

we think that, in order to be eﬀective, a ﬂexible query language must allow users to have total control over the soft selection conditions that they specify, giving them the means to deﬁne the semantics of the conditions that they ﬁnd most adequate in the considered context. Based on these ideas, we propose a ﬂexible spatial query language that incorporates the characteristics of fuzzy SQL for expressing soft selection conditions both on geometric attributes and geometrically derived attributes. Some new spatial (metric, topological and directional) functions are deﬁned to compute the degree of satisfaction of topological and directional relationships by spatial entities.

3 Characteristics of a spatial query language As an example of spatial analysis phase made possible by a current GIS let us give a look at the user manual of the commercial product ARCVIEW [19]. It illustrates, as ﬁrst example of quick start tutorial, the querying of spatially referenced data with the objective of “ﬁnding the best location for a new showroom”. The user is the manager of a company that wants to optimize logistics and shop placements in the Southeast USA under these speciﬁc requirements: • the manager wants to identify big cities in Southeast USA where the company might locate a new showroom; • the manager wants to make sure that customers placing orders at the showroom can receive their merchandise the next day, i.e., the new showroom must be located within one day truck drive from the regional distribution center in Atlanta, Georgia; • ﬁnally, in order to try to boost sales, the place of the new showroom must be in a state with weak sales in the last year. A candidate place for the new showroom can be selected by using ARCVIEW, provided that the previous vague criteria are translated into precise selection conditions that do not tolerate any under satisfaction. So the ARCVIEW user manual suggests the following crisp interpretations of the vague terms: “big city” is “a city with population greater than 80.000 people”; “within one day truck drive from the regional distribution center ” is a city within 300 miles from Atlanta; “weak sales” are sales below a total amount of 20.000$. Then, the user has to follow a rather long interaction with the system to identify the candidate cities for the showroom that are selected with equal relevance, i.e., with no indication of their degree of satisfaction of the selection criteria. Further, the user cannot be sure that the identiﬁed set is complete, i.e., the result can miss cities slightly under satisfying one of the crisp conditions. For example, a city that satisﬁes all the conditions but is just 301 miles from Atlanta can be missed. For the manager it would be easier to directly formulate the query through linguistic terms interpreted as soft selection conditions

Spatial SQL with Customizable Soft Selection Conditions

327

admitting degrees of satisfaction, avoiding the translation phase and obtaining discriminated results, which reﬂect diﬀerent levels of satisfaction of the soft selection conditions. The mapping on the spatial domain of the candidate cities with graded colors would immediately allow to identify the best candidates.

Fig. 1. Geometry Class Hierarchy deﬁned in the OGC Simple features speciﬁcations for SQL [11].

Now, let us analyze the basic characteristics that a spatial query language should provide, in order to be extended to make the formulation of ﬂexible spatial queries possible. In facts, due to the peculiarity of spatial data, a spatial query language is more complex than a query language dealing with traditional data only. For example, it must include facilities for all the essential operations of inquiring spatial and non spatial data. The ﬁrst requirement for a spatial query language is the deﬁnition of Abstract Spatial Data Types for representing both spatial entities and spatial operators [14; 20]. ASDTs, or spatial algebra, constitute the fundamental abstractions for points, lines and regions together with relationships between them and operations for composition (ex. computing the intersection and union of regions). Several proposals of spatial query languages have been deﬁned: most of them extend the SQL language for dealing with spatial data [14; 17; 18; 21; 24]. In 1999, the OGC (OpenGIS Consortium) released the OpenGIS Simple Features Speciﬁcation for SQL [11]. The purpose of this speciﬁcation is to deﬁne

328

Gloria Bordogna et al.

CITIES (id: string; Name: string; population: Integer; location: Polygon) REGION (id: string; Name: string; sales: integer; location: MultiPolygon) SHOWROOMS (id: string; Name: string; sales: integer; location: Point) Fig. 2. Relations with a geometry type attribute.

a standard SQL schema that supports storage, retrieval, query and update of a collection of geospatial features. In the speciﬁcation, geometric features are represented based on 2D geometry with linear interpolation between vertices. SQL has been deﬁned with ASDTSs, named Geometry Types, so as to extend the domains of relational calculus by including the spatial domain that provides an abstraction of spatial data [11]. Each geometry is associated with both a spatial reference system that must be unique in the database and a geometry class chosen from a geometry class hierarchy (see Figure 1). For example, the three relations reported in Figure 2 have the attribute location that is a spatial attribute of type geometry. Second characteristic of a spatial query language is the availability of a minimal set of spatial functions to operate on spatial data. Although there is still no consensus on a complete set of spatial operators to be included in a spatial query language, some fundamental and commonly used ones have been deﬁned [17; 27]. • Functions evaluating spatial relationships are the most important ones. They denote whether a spatial relationship holds between two objects. For example, they allow to test the occurrence for all spatial entities in a given relationship with a query object, e.g. a window. One can distinguish several classes of relationships: – Topological relationships, such as Equal, Disjoint, Intersects, Touches, Crosses, Within, Contains, Overlaps; they are invariant under topological transformations like translation, scaling, and rotation. – Directional relationships, such as above, below, or north of, southwest of, etc. – Metric relationships e.g. “distance < 100”. Among these, the topological relationships have been studied in depth [13; 14; 15; 16]. Speciﬁcally, the speciﬁcation of the OGC for the standard SQL gives the deﬁnitions of topological spatial relationships based on the interiors, exteriors, and the boundaries of participating spatial entities, i.e., the so called Dimensionally Extended 9-intersection model [8; 10]. According to this deﬁnition, the topological relationships can assume a value in the domain {true, f alse, unknown}. • Basic spatial operations for computing properties of spatial entities: for example they allow the computation of the length, area, perimeter, dimension of a shape. • Set operators, intended as operators that creates new spatial entities; they are the classical intersection, union, diﬀerence of two spatial entities, or

Spatial SQL with Customizable Soft Selection Conditions

329

transformations such as rotation, translation or scaling; there are also operations that compute other useful spatial entities such as the Minimum Bounding Box (Envelop), Boundary, Interior, Closure, Buﬀer, Centroid, Convex Hull, etc. of a spatial entity. For a complete list of the spatial operations deﬁned in SQL see [11]. There are important issues related to the deﬁnition of spatial operations in ASDTs. • There is general agreement that the deﬁnition of types (and particularly of operations) is application-dependent. So, it must be possible to deﬁne additional or alternative types and operations which leads to the requirement of extensibility for the system architecture [20]. • The second requirement is the closure property of the spatial operators: it must be possible to combine the results of queries yielding heterogeneous geometries. This requirement is generally not satisﬁed by SQL. In fact, in SQL there is only a limited number of basic functions operative on all ASDTs, while other speciﬁc operations depend on the spatial entity’s representation (either vector objects or raster ﬁelds) and ASDT. • A spatial algebra should oﬀer besides operations on atomic geometric objects, also operations on spatially related sets of objects, for example, a partition (thematic map, tessellation) [28]. Example operations are overlay of two partitions, fusion (merging adjacent areas in a partition if other attributes are equal), or ﬁnding, in a set of objects, the one closest to a query object. These operations are more complex than atomic operations. • On the other side, there are requirements that are related to the graphical visualization of the query results, such as the graphical combination (overlay) of several query results, the use pointing devices to select objects within a picture or subareas (zooming in). However, these features can be fulﬁlled by the GUI of the system implementing the spatial algebra. By using the OGC speciﬁcation of spatial SQL to carry out some spatial analysis, one generally proceeds step by step. First, properties of spatial entities, i.e. attribute values, have to be accessed. Then, new spatial properties have to be computed, for example by applying some metric or topological operator; in some cases new spatial entities must be identiﬁed based on the computations of some set operation. An intermediate relation combining all the results of spatial operations is obtained, on which the selection condition is applied, and ﬁnally the desired attributes are projected. This procedure reﬂects the evaluation steps of a basic spatial SQL query: SELECT attribute-list FROM source-relation WHERE condition in which the FROM clause is extended to create an intermediate single relation that is the result of some spatial operations. On this intermediate relation, the selection in the WHERE clause and the projection in the SELECT clause are then applied.

330

Gloria Bordogna et al.

Hence, in the FROM clause, nested queries might generate spatial attributes which are the result of spatial operations applied to the results of other subqueries. For example, the request for the most suitable place where to open a new showroom previously presented can be translated in the following SQL query: SELECT C.name, R.name, R.sales FROM (SELECT *, Distance(Centroid(C.location), Centroid(SELECT A.location FROM Cities AS A WHERE A.name="Atlanta") ) AS dist-from-Atlanta, Within(C.location, R.location) AS In FROM Cities AS C, Regions AS R ) AS INT-R WHERE C.population > 80.000 AND dist-from-Atlanta < 300 AND R.sales < 20.000 AND In=True We can see that there are three nested SQL queries. The intermediate Spatial SQL query is the one that has been extended with the spatial operations Distance (metric operation), and Within (topological function) and is aimed at computing the single intermediate relation INT-R: INT-R (C.id, C.name, C.population, C.location, S.id, S.name, S.sales, S.location, Dist-from-Atlanta, In) that is supplied as argument to the outmost SQL FROM clause. Notice that the function evaluating the topological relationship Within computes a value in true, false, unknown [10; 11].

4 Deﬁnition of the ﬂexible spatial sql In this section, we present our ideas for the deﬁnition of a ﬂexible spatial query language for a conventional GIS. A ﬂexible spatial query language should allow the expression of preferences inside conditions involving attributes of both geometry and non-geometry type. To this end, we adopt the OGC speciﬁcation for spatial SQL [11] as starting point of our deﬁnition. This choice is motivated basically by the fact that we want to found our language on a solid basis in order to implement it as an extension of any system adhering to SQL [23]. We incorporate ﬂexibility in SQL by assuming the ideas in [5] at the basis of the deﬁnition of SQLf and by taking into account issues related with the peculiarity of the spatial dimension. Not least, practical aspects are also considered related to the need of providing users with full control on the semantics of their queries, depending on the application and the query context. This last characteristic is mandatory when considering a spatial database in which the semantics of terms such as

Spatial SQL with Customizable Soft Selection Conditions

331

close, far, round-shaped, big, small referred to spatial entities can have a very diﬀerent interpretation depending on the application or even on the current settings of the working environment. To clarify, let us consider the term close, and the diﬀerent situations in which it changes meaning. • Diﬀerent Databases. The meaning of close changes if we change the database, because the application context for which the database has been deﬁned is diﬀerent. For example, close buildings has a diﬀerent semantics when referred to buildings in a cadastral database and when referred to buildings in a service network database. • Within the Same Database. Within the same database, it is possible to have diﬀerent meanings for the same linguistic terms, depending on the semantics of the spatial entity to select. For example, close between cities is diﬀerent than close between regions, although both data sets are stored in the same administrative database. • Working Space Settings. The semantics of a linguistic term can be inﬂuenced by the current settings of the working space. One of such settings is the current zooming factor of the map: if the map of Europe is currently visualized on the screen and one speciﬁes close between cities, the interpretation of the linguistic constraint is likely to be diﬀerent than in the case in which the map visualized on the screen is that of a country. For example, Milan and Rome can be considered close on the map of Europe when one has to ﬂy from one city to another one, while they can be judged not very close on the map of Italy when one is driving. Consequently, close can be interpreted as a relative soft condition, whose interpretation varies depending on the zooming factor: the higher the zooming factor, the stricter the interpretation. In our Flexible Spatial SQL, preferences on selection conditions are represented by soft conditions as in SQLf [5], but in order to support customizable soft conditions, we had in mind the following design guidelines. • The semantics of the soft selection conditions must be formalized to make clear how the soft selection conditions interact with other clauses in the SQL SELECT operator. • Soft selection conditions should be easily customizable; the language must provide some way to deﬁne the semantics of linguistic predicates and to customize it to the query context. • The closure property of the SQL SELECT operator must be strictly preserved; in other words, a SELECT statement takes a fuzzy relation as input and generates a fuzzy relation as output. The result is both an extension of the basic SQL SELECT operator and a new SQL-like operator, named CREATE TERM SET, for deﬁning the semantics, i.e., the meaning of soft conditions which can be made available to query the

332

Gloria Bordogna et al.

spatial database. In this way, the user can have full control of the ﬂexible query language, being able to fully customize the query; in addiction, the user can use a linguistic value with distinct meanings in the same application, depending on the chosen reference attribute and query context. Like in SQLf, soft conditions are here expressed through linguistic predicates identifying fuzzy subsets of the attribute domains and are speciﬁed in the WHERE clause of the extended SQL query, thus producing a fuzzy relation as a result of their evaluation. Besides, in the soft condition also the context of the linguistic predicates is speciﬁed, so that it is possible to choose the proper interpretation of the linguistic value. Soft conditions can be expressed on classic attributes, on geometry type attributes, and on geometry derived attributes, i.e. attributes derived by the computation of spatial operations. These two last kinds of conditions are called soft spatial conditions. Since soft conditions need for their deﬁnition either a continuous domain or a discrete domain of values, in order to make possible the speciﬁcation of soft spatial conditions such as almost completely within, almost equal, more or less north of, etc, it was necessary to extend the set of functions evaluating spatial relationships between geometries with new functions evaluating degrees of satisfaction of the spatial relationships. A set of new and fundamental spatial functions evaluating gradual spatial relationships are then deﬁned. In the following subsections, ﬁrst the basic query of the ﬂexible spatial SQL language is deﬁned; then, some fundamental soft spatial conditions are introduced, and new spatial functions are deﬁned. Finally the operator for deﬁning and customizing the semantics of the soft conditions is presented. 4.1 Structure of a basic ﬂexible sql query We propose the following basic structure of a Flexible Spatial SQL query: SELECT (N |T |N, T ) A1 , . . . An FROM source-relation WHERE soft-selection-condition in which • N is the maximum number of desired tuples with highest membership degree (N ≥ 1). • T is the minimum acceptable value that the tuple membership degree must overcome in order to be retrieved (0 < T ≤ 1). • source-relation is the source relation on which the query is applied; it can possibly contain spatial attributes, as well as it can be obtained by means of a JOIN expression, as well as it can be a nested query. • A1 , . . . An are either classical or spatial attributes, produced by the query; the query can derive new spatial attributes through the application of some spatial function; they can be either of geometry type, such as those

Spatial SQL with Customizable Soft Selection Conditions

333

resulting from the spatial function Centroid, or not, such as the result of the Distance function. • soft-selection-condition is a soft condition composed of crisp and linguistic predicates connected by AND, OR and negated by NOT. In this way, the soft spatial conditions are speciﬁed in the WHERE clause, like in SQLf. Crisp predicates are based on the classical comparison operators applied to classical data types. Linguistic predicates have the form: spatial-tuple or numerical-expression IS linguistic-value IN Term-set [ZOOMING: zf ] where a spatial-tuple is a tuple whose elements are of geometry type, linguistic-value is deﬁned in the speciﬁed Term-set. For example, the following predicate (a,b) IS ‘close’ IN City Distances selects a tuple in the source relation if the spatial attributes a and b describe two spatial objects which are ‘close’ each other, according to the meaning of ‘close’ in the term set City Distances. Note that (a,b) is a tuple composed by attributes a and b. The numerical-expression can be computed by means of spatial operations. For example, the following linguistic predicate is equivalent to the previous one. Distance(a,b) IS ‘close’ IN City Distances in which Distance is a spatial function computing the distance between the geometries a and b. When ZOOMING is speciﬁed, zf speciﬁes the zooming factor to apply to evaluate the query: this value is used to modify the interpretation of the linguistic value, depending on the query context. For example, a zooming factor of 1 might be exploited for queries concerning continents, while a zooming factor of 10 might be exploited for queries concerning city quarters: in the two cases, the meaning of ‘close’ is clearly very diﬀerent. For example, suppose that on a range between 0 and 10000 Km the linguistic value ‘close’ is deﬁned such as two points are close when their distance is less than 2500 Km. The following predicate

Distance(a,b) IS ‘close’ IN City Distances

ZOOMING: 5

considers a spatial domain range 5 times smaller. Consequently, in a range between 0 and 2000 Km, two points are close if their distance is less than 500 Km.

334

Gloria Bordogna et al.

ZOOMING is useful in the case in which one is willing to interpret the semantics of linguistic-value in a relative way with respect to the current query context. Here, by query context we intend the current status of visualization of a thematic map layer and we represent it by the active zooming factor; consequently, the value for the zooming factor might be the same exploited by the visualization environment. The type of modiﬁcation is dependent on the semantics of the linguistic predicate and user expectations; so, in order to provide maximum ﬂexibility, we let the user specify the kind of modiﬁcation (scaling) when deﬁning a term set (see the CREATE TERM-SET operator, Section 4.3). 4.2 Deﬁnition of soft spatial conditions Soft spatial conditions are soft conditions on geometry type attributes or on spatially derived attributes. Their deﬁnition depends on several factors among which the application domain, the scope of the database, the user, and the context of the query. In the following, the soft spatial conditions that can be useful in a generic database are classiﬁed depending on the semantics of the attribute values on which they can be applied: 1. Soft Geometric conditions: these conditions express a geometric constraint on the values of either geometry type attributes (e.g., boundary, minimum bounding box, centroid, etc.) or spatially derived attributes (e.g., area, perimeter, etc.) of spatial entities. They can be expressed by means of linguistic values for example on the boundary of an entity (such as circularly shaped, almost symmetric, indented boundary), or for example by means of linguistic values on the area of an entity (such as big, small, etc.). 2. Soft Topological conditions: these conditions express a soft topological constraint on the values of topological relationships between pairs of spatial entities. For example, they can be useful to select the spatial entities which are more or less within a given region (a window) or that consistently overlap each other. In order to allow their speciﬁcation, it is necessary to include new spatial functions in the ﬂexible spatial query language, evaluating degrees of satisfaction for the topological relationships. 3. Soft Direction conditions: these conditions express a soft directional constraint on the values of directional relationships between pairs of spatial entities. For example, they can be useful to select the spatial entities which are more or less north, almost completely west, etc., with respect to another entity. As in the previous case, it is necessary to include new spatial functions in the ﬂexible spatial query language, evaluating degrees of satisfaction for the directional relationships. 4. Soft Metric conditions: they express soft metric constraints on the values of metric relationships and can be speciﬁed by linguistic terms such as close, far, very far on the distance between entities.

Spatial SQL with Customizable Soft Selection Conditions

335

In the following, by some examples we show the way in which soft spatial conditions can be deﬁned. The soft topological, directional and metric conditions that we deﬁne constitute a minimal basic set of soft spatial conditions that can be useful for querying any general spatial database, although we are well aware of their incompleteness. The deﬁnition of soft spatial conditions requires ﬁrst the deﬁnition of the set of possible linguistic values by which they can be speciﬁed, i.e., the term set. Second, it is necessary to specify the semantics of each value in the term set though the deﬁnition of the membership functions. The membership functions have the set of possible values of the considered attribute as domain, and the unit interval [0, 1] as codomain. However, the semantics of the soft conditions can be diﬀerent depending on both the spatial entities and the application; so in each application we allow to deﬁne distinct term sets for each soft spatial condition, so as to be able to choose the proper semantic interpretation when formulating a query. To this aim, the CREATE TERM-SET command is deﬁned in the ﬂexible spatial SQL. To illustrate the problem of deﬁning soft geometric conditions let us consider the case of the following geometric condition on the shape of a spatial entity a: a IS ‘almost circular’ IN Shape factors This soft condition constrains the shape of the spatial entity referenced by the spatial attribute a to be almost circular. This can be evaluated by means of a spatial function SHAPE FACTOR (with codomain ∈ [0, 1]) denoting the roundness of a polygon; the function extends the spatial operations deﬁned in SQL. This function is deﬁned as follows: SHAPE FACTOR(s) := 4π ∗ Area(s)/(Perimeter(s))2 in which s is the geometry of a spatial entity, and Area and Perimeter are geometric functions deﬁned in SQL spatial and computing the area and perimeter respectively. So, given an implementation for the spatial functions Area and Perimeter for any geometry attribute s of type geometry polygon one can easily compute SHAPE FACTOR(s). It can be noticed that as the shape of s approximates a circle, the value of its SHAPE FACTOR(s) approaches 1, while as the shape diverges from that of a circle SHAPE FACTOR(s) tends to 0. This suggests to deﬁne the linguistic value ‘almost circular’ in the term set Shape factors= {‘circular’, ‘almost circular’, ‘not very circular’, ‘not circular’} with a semantics speciﬁed by a fuzzy subset on [0,1] (the codomain of SHAPE FACTOR(s)) with non-decreasing membership function µ‘almost-circular’ (see Figure 3). To support the deﬁnition of soft topological conditions, two new spatial functions, named OVERLAPPING DEGREE(s1, s2) and INCLUSION DEGREE(s1,

336

Gloria Bordogna et al.

Fig. 3. Semantics of the soft geometric condition almost circular.

s2) (with codomain [0,1]) are deﬁned, which relax the crisp topological functions Overlap and Within of SQL respectively. These new functions compute degrees of satisfaction of the topological relationships Overlap and Within, thus measuring the degree of overlapping and inclusion between two geometries, respectively. Speciﬁcally, OVERLAPPING DEGREE is deﬁned based on a similarity measure between two geometries. It is computed as the ratio of the area of the set intersection on the area of the set union of the two geometries s1 and s2 [2]: OVERLAPPING_DEGREE (s1,s2)= Area(Intersection(s1,s2))/Area(Union(s1,s2)) in which Area computes the area of a spatial entity, Intersection and Union are set operations generating the set intersection and the set union of two geometries. These are basic functions of SQL spatial, so the new function can be easily implemented. It can be observed that, as the two input geometries have an intersection that approximates their union, i.e. the more they are close to equal, the more the value returned by the function OVERLAPPING DEGREE is close to 1, while as the intersection decreases with respect to the union, the more the two entities are disjoint, the more the result is close to 0. Now, given the term set Overlapping Degrees= {‘very high’, ‘high’, ‘medium’, ‘low’, ‘very low’}, the linguistic values ‘very high’ and ‘high’ can be deﬁned with non-decreasing membership functions µ‘high’ , µ‘very high’ , the linguistic values ‘low’ and ‘very low’ can be deﬁned with with non-increasing membership functions µ‘low’ , µ‘very low’ , and the linguistic value ‘medium’ can be deﬁned with uni-modal membership function µ‘medium’ . These fuzzy subsets on the codomain of OVERLAPPING DEGREE can be deﬁned with triangular membership functions as depicted in Figure 4. Function INCLUSION DEGREE(s1, s2) is deﬁned based on a fuzzy inclusion measure between two geometries s1 and s2 [2]: INCLUSION_DEGREE(s1,s2)=Area(Intersect(s1,s2))/ Area(s1) It can be observed that as s1 is more and more included in s2, the value of INCLUSION DEGRRE(s1,s2) becomes closer to 1. On the contrary, as s1 is more

Spatial SQL with Customizable Soft Selection Conditions

337

Fig. 4. Semantics of the soft topological conditions between geometries.

Fig. 5. Semantics of two soft metric conditions between geometries.

and more disjoint from s2, the value of INCLUSION DEGRRE(s1,s2) tends to zero. The linguistic values in the term set Inclusion Degrees = {‘high’, ‘very high’, ‘medium’, ‘low’, ‘very low’} can be deﬁned by triangular fuzzy subsets on the codomain of function INCLUSION DEGREE(s1, s2), as depicted in Figure 4. As far as the soft metric conditions are concerned, expressed by linguistic values such as ‘far’, ‘close’, ‘very close’, etc., the basic domain of the linguistic distance values is deﬁned as the domain [0, 1] of a normalized distance measure, and can be deﬁned as depicted in Figure 5. The meaning of soft metric conditions is strongly dependent on the application, and within the application on the semantics of the spatial entities that are considered: ‘close’ between two stars in an astronomical database is different than close between two cities in an administrative database and within the same administrative database it is diﬀerent than between two buildings in the same city. For these reasons, a normalization factor of the distances that is dependent on the spatial entities under evaluation must be determined. For example, it can be deﬁned as the maximum possible distance on the spatial domain between any two such entities and can be set by the user during the deﬁnition of the term sets of linguistic values through the CREATE TERM-SET command. So, in this case, it can be useful in a single application to deﬁne several term sets of linguistic values with distinct normalization parameters depending on the entities.

338

Gloria Bordogna et al.

Fig. 6. Semantics of soft directional conditions between geometries.

Soft directional conditions can be expressed by linguistic values such as south, south-west, on the right, more or less above etc.. In this case, it is diﬃcult to deﬁne a common set of linguistic values that can be useful in any application: in GISs it can be useful to specify approximate cardinal points, while in CAD applications it is much more common to use terms such as ‘above’, ‘below’, ‘right of’, ‘left of. As an example, we deﬁne linguistic cardinal points by fuzzy sub-sets with triangular membership functions on the domain [0, 360◦ ] as depicted in Figure 6. To show the use of soft spatial conditions let us reformulate the request for the best place where to place a new showroom illustrated in Section 3 by the following Flexible SQL Spatial query . SELECT 5, 0.6 C.name FROM (SELECT *, Distance(Centroid(C.location), Centroid(SELECT A.location FROM CITIES AS A WHERE A.name="Atlanta") ) AS dist-from-Atlanta, INCLUSION_DEGREE(C.location,S.location) AS inclusion-degree FROM CITIES AS C, REGIONS AS S) AS INT-R WHERE C.population IS ’big’ IN City-Populations AND dist-from-Atlanta IS ’close’ IN City-Distances ZOOMING: 4 AND S.sales IS ’weak’ IN Sales AND inclusion-degree IS ’high’ IN Inclusion-Degrees in which term sets named City Populations, City Distances, Sales and Inclusion Degrees, and spatial function Distance and INCLUSION DEGREE, are exploited to compose soft predicates. This query retrieves the names of the cities belonging to the ﬁrst 5 tuples with membership degree above 0.6. and reﬂecting the global satisfaction of the ANDed conditions to be ‘big’ cities, ‘close’ to Atlanta, highly-within a region with ‘weak’ sales. Notice that for the linguistic predicate on the distance attribute, the zooming factor (ZOOMING keyword) is speciﬁed; this means that the semantics of the linguistic value ‘close’ is aﬀected by a zooming factor 4 of the map.

Spatial SQL with Customizable Soft Selection Conditions

339

4.3 Customizing ﬂexible spatial sql: deﬁnition of soft conditions’ semantics In order to make the ﬂexible spatial query language adaptable to user needs, we deﬁne a new SQL-like operator for deﬁning term sets (and the meanings of linguistic values), which can be used in the WHERE clause to specify linguistic predicates by means of the IS .. IN operator. This operator is named CREATE TERM-SET and can be used any time one wants to enrich the ﬂexible spatial SQL to allow the speciﬁcation of new soft conditions. This way, the semantics of linguistic predicates can be easily customized to the application, and within the application to the spatial entities it applies. CREATE TERM-SET name [ NORMALIZED WITHIN ( min, max ) ] [ SCALING ] ( EVALUATE evaluation-function WITH PARAMS list-of-params )+ VALUES list-of-linguistic-value-deﬁnitions • name is the name of the term set under deﬁnition. • The VALUES clause deﬁnes the set of linguistic values within the term-set, where list of-linguistic-value-deﬁnitions is a non-empty list of linguisticvalue-deﬁnition. Each linguistic-value-deﬁnition is a pair: (linguistic-value, meaning) – linguistic-value is a string identifying a linguistic value; for example, ‘very-far’. – meaning is a 4-tuple (left-bottom-corner , left-top-corner , right-top-corner , right-bottom-corner ) of ordered values in the range [0,1] (it must be left-bottom-corner ≤ left-top-corner ≤ right-top-corner ≤ right-bottom-corner ), and it deﬁnes the trapezoidal membership function of the fuzzy set identiﬁed by linguistic-value. The domain of the membership function is normalized within the range [0, 1]. For example, we can deﬁne meaning = (0.5, 0.75, 1, 1) for the linguistic value ‘far’, that is the trapezoidal membership function depicted in Figure 5. • The same term set can be evaluated by the IS .. IN operator over different data types. This is allowed by the non-empty list (in the syntax denoted as “( . . . )+”) of EVALUATE clauses: each occurrence deﬁnes a speciﬁc evaluation function (in the syntax denoted as evaluation-function) to be computed, whose values are ”compared” with the linguistic values. This way, depending on the data type of the left operand appearing in the IS .. IN operator, the proper function is applied. Note that the WITH PARAMS sub-clause deﬁnes the list of formal parameters appearing in the expression.

340

Gloria Bordogna et al.

• The possibly missing (in the syntax denoted as “[ . . . ]”) NORMALIZED WITHIN clause speciﬁes how to normalize the value v obtained by evaluating the evaluation-function . If the NORMALIZED WITHIN clause is speciﬁed, the speciﬁed range [min, max ] is mapped to the range [0,1], and v is mapped accordingly; if v is less than (greater than) min (max ), it is always evaluated as min (max ). If the NORMALIZED WITHIN clause is not present, by default it is min = 0 and max = 1. • The possibly missing SCALING clause speciﬁes if to scale the evaluation range depending on the zooming factor zf speciﬁed in the query. Given a zooming factor zf (with zf ≥ 1) the new scaled evaluation range becomes min’=min/zf and max’=max/zf : If the SCALING clause is not speciﬁed then it is assumed that zf =1, and the ZOOMING keyword in the IS .. IN operator does not have any eﬀect on the evaluation range. For example, consider the term set Inclusion Degree, which allows to express soft conditions on the inclusion degree among geometries. CREATE TERM-SET Inclusion_Degree EVALUATE INC_DEGREE WITH PARAMS INC_DEGREE AS FLOAT VALUES (‘very high’, (0.8, 1, 1, 1)), (‘high’,(0.55, 1, 1, 1)), (‘medium’, (0.2, 0.5, 0.5, 0.8)), (‘low’,(0, 0, 0, 0.45)), (‘very low’, (0, 0, 0, 0.2)) Notice that the default evaluation range [0,1] is exploited, since the inclusion degree is a value less than or equal to 1. Furthermore, no scaling is necessary, since this property is independent of the particular zooming factor. In contrast, consider the term set City Distances to evaluate soft metric conditions. CREATE TERM-SET City_Distances NORMALIZED WITHIN (0, 10000) SCALING EVALUATE Distance (s1, s2) WITH PARAMS s1 AS Point, s2 AS Point EVALUATE DIST WITH PARAMS DIST AS FLOAT VALUES (‘very far’,(0.8,0.9,1,1)), (‘far’,(0.5, 0.75, 1,1)), (‘medium distance’, (0.1,0.45,0.55,0.7)), (‘close’,(0,0,0.025,0.1)), (‘very close’, (0,0,0.01,0.02))

Spatial SQL with Customizable Soft Selection Conditions

341

The term set provides two evaluation functions (EVALUATE clauses): the ﬁrst one directly evaluates the distance between two points (it exploits the predeﬁned spatial Distance function ); the second one evaluates linguistic values against a numerical value (parameter DIST). Notice that distances between cities are normalized within 0 km and 10000 km: this way, we deﬁne the basic meaning for the concept of city distance with respect to continents; however, the SCALING clause is speciﬁed, so that the meaning of city distance can be adapted to the particular zooming factor applied in the query. For example, a zooming factor of 5 means that the evaluation range for a distance is [0,2000], which corresponds to the width of a wide European country; hence, considering the deﬁnition of the ‘far’ linguistic value, two cities are considered fully far if their distance is greater than 2000∗0.75 = 1500 km. This way, the joint exploitation of normalization and scaling, with the speciﬁcation of the zooming factor in the query, allows to ﬂexible reuse concepts, adapting them to the particular context in which the query is performed.

5 Formal semantics Membership function of linguistic values The membership function of a linguistic value lv is deﬁned with a trapezoidal shape by a 4-tuple (lb, lt, rt, rb), where:  0 If 0 ≤ x ≤ lb     (x − lb)/(lt − lb) If lb < x < lt  If lt ≤ x ≤ rt µlv (x) = 1   (rb − x)/(rt − rb) If rt < x < rb    0 If rb ≤ x ≤ 1 When lt = rt, the membership function has a triangular shape. When lb = lt and rt = rb, the membership function has a rectangular shape, and then the associated linguistic term speciﬁes a range condition. When lb = lt = rt = rb, the membership function is punctual and then the associated linguistic term speciﬁes a crisp condition. The IS .. IN Operator. A predicate based on the IS .. IN .. [ZOOMING] operator has the form: t IS lv IN T S [ZOOMING: zf ] where t is a tuple of (possibly spatial) attributes, lv is a linguistic value in the term set T S, i.e., lv ∈ T S. If the ZOOMING option is not set, the zooming factor zf is by default set to 1. Consider now the evaluation range [min, max] deﬁned for the term set T S. Depending on the presence of the SCALING clause the scaled evaluation range [min , max ] is computed as:

342

Gloria Bordogna et al.

min = min /zf max = max /zf in which zf =1 if either SCALING is not speciﬁed in the CREATE TERM-SET or if the ZOOMING option is not speciﬁed in the query. The following functions restrict a value v to the speciﬁed (possibly scaled) evaluation range.   min if v ≤ min W ithin(v, [min , max ]) = v if min < v < max  max if max ≤ v The following function normalize a value v from the evaluation range to the basic range [0, 1]. N ormalize(v, [min , max ]) = (v − min ) / (max − min ) Finally, we are ready to deﬁne the semantics of the IS .. IN operator. With eF unc(t) we denote the application of the evaluation-function to a tuple t, where µlv (v) is the trapezoidal membership function for the linguistic value lv. µ(t IS lvIN T S ZOOMING: zf ) (t) = µlv (N ormalize(W ithin(eF unc(t), [min , max ]), [min , max ])) Recall that the membership values generated by crisp operators are only 0 and 1. The WHERE clause We assume that each tuple t in a source crisp spatial relation R has a membership degree µ(R) (t) = 1. The WHERE clause speciﬁes a soft condition φ, which assigns t a new membership degree, for the purpose of selection. Applying associative properties, j can be seen either as φ = sub-cond1 lop sub-cond2 where lop is a logical operator AND, OR, or as a negated condition φ = NOT(sub-cond), where sub-cond1 and sub-cond2 can be either composed conditions or simple predicates. We deﬁne the semantics of the three logical operators in accordance with the usual deﬁnition of the AND, the OR and the NOT operators in fuzzy set theory as the min, max and (1-) respectively.

Spatial SQL with Customizable Soft Selection Conditions

343

If φ = sub-cond1 AND sub-cond2 , then µφ (t) = M in(µsub-cond1 (t), µsub-cond2 (t)) If φ = sub-cond1 OR sub-cond2 , then µφ (t) = M ax(µsub-cond1 (t), µsub-cond2 (t)) If φ = NOT (sub-cond), then µφ (t) = 1 − µsub-cond (t) Then, the membership degree of t after the evaluation of φ is: µ(t) = M in(µ(R) (t), µφ (t)) Finally, we deﬁne the semantics of the overall SELECT instruction, as far as selection is concerned. Given a source fuzzy spatial relation R (recall that the membership degree of tuples in non fuzzy spatial relations is 1), consider the membership degree µ(t) of a tuple t after the evaluation of φ. The SELECT instruction generates a fuzzy spatial relation containing only tuples t ∈ R such that either µ(t) ≥ T , where T is the minimum threshold for membership degree speciﬁed in the SELECT clause, or the ﬁrst N tuples with highest µ(t).

6 Conclusions In this chapter, we proposed a solution towards the deﬁnition of a ﬂexible spatial query language, in order to allow the expression of ﬂexible queries on spatial data. We founded our work on the existing proposals of the OGC simple feature speciﬁcation for SQL [11], the proposed standard extending SQL for specifying spatial queries, and on SQLf [4; 5], an extension of SQL for expressing ﬂexible queries based on soft conditions. In the deﬁnition of the ﬂexible spatial SQL we gave particular emphasis to the aspects of customization of the language and control of the semantics of the terms used to specify soft conditions, which is of great importance when querying a spatial database. The basic structure of a ﬂexible spatial SQL query is deﬁned to allow the expression of soft conditions in the WHERE clause and the yielding of fuzzy relations. In order to express soft conditions on spatial data, we showed that sometimes it can be necessary to extend the set of spatial functions of SQL. As an example we have proposed a minimal set of fundamental spatial functions, computing degrees of satisfaction of topological and directional relationships, which are useful for supporting the deﬁnition of soft topological and directional conditions. Based on these spatial functions, linguistic predicates can be expressed, by means of a new comparison operator, named IS .. IN, allowed in the WHERE clause of the SELECT operator. This operator can be used in a very ﬂexible way, since term sets can be freely deﬁned by users, by means of the new

344

Gloria Bordogna et al.

CREATE TERM-SET operator. This operator allows a full control on the semantics of the soft conditions deﬁned in the language and makes it possible to suit the linguistic values to the application and, within the application, to the query context. The combined use of features provided by the CREATE TERM-SET operator and the IS .. IN operator lets the user able to express complex linguistic predicates very easily and ﬂexibly.

References [1] E. Bertino and G. Vantini. Advanced database systems and geographical information systems. In Proc. of the II Workshop on AITSES, pages 1–12, 1996. [2] G. Bordogna, S. Chiesa, and D. Geneletti. Linguistic modelling of imperfect spatial information as a basis for simplifying spatial analysis. Information Sciences, 176(4):366–389, 2006. [3] G. Bordogna and G. Psaila. Fuzzy spatial sql. In Proceedings of FQAS04, LNAI 3055, pages 24–26, 2004. [4] P. Bosc, B. Buckles, F.E. Petry, and O. Pivert. Fuzzy databases. In J. C. Bezdek, D. Dubois, and H. Prade, editors, Fuzzy sets in Approximate reasoning and information systems: the handbook of fuzzy set series, pages 404–468. Kluwer Ac. Pub., 1999. [5] P. Bosc and O. Pivert. Sqlf: a relational database language for fuzzy querying. Trans. on Fuzzy Systems, 3:1–17, 1995. [6] P. Bosc and H. Prade. An introduction to the fuzzy set and possibility theory-based treatment of ﬂexible queries and uncertain and imprecise databases. In A. Motro and P. Smets, editors, Uncertainty Management in Information Systems: from Needs to Solutions. Kluwer Academic Pub., 1994. [7] P.A. Burrough and A.U. Frank. Geographic objects with indeterminate boundaries. In GISDATA series. Taylor & Francis, 1996. [8] E. Clementini and P. Di Felice. A comparison of methods for representing topological relationships. Information Sciences, 80:1–34, 1994. [9] E. Clementini and P. Di Felice. An algebraic model for spatial objects with indeterminate boundaries. In P.A. Burrough and A.U. Frank, editors, Geographic Objects with Indeterminate Boundaries, GISDATA. Taylor & Francis, 1996. [10] E. Clementini and P. Di Felice. A model for representing topological relationships between complex geometric features in spatial databases. Information Sciences, 90(1–4):121–136, 1996. [11] OGC (Open Geospatial Consortium). Opengis simple features implementation speciﬁcation for sql, 2005. [12] H. Couclelis. Towards an operational typology of geographic entities with ill-deﬁned boundaries. In P.A. Burrough and A.U. Frank, editors,

Spatial SQL with Customizable Soft Selection Conditions

[13] [14] [15]

[16]

[17] [18]

[19] [20]

[21]

[22]

[23]

[24] [25] [26] [27]

[28]

[29]

345

Geographic Objects with Indeterminate Boundaries, GISDATA. Taylor & Francis, 1996. M.J. Egenhofer. A formal deﬁnition of binary topological relationships. Lecture Notes in Computer Science, 367:457–472, 1989. M.J. Egenhofer. Spatial sql: A query and presentation language. IEEE Trans. on Knowledge and Data Engineering, 6(1):86–95, 1994. M.J. Egenhofer and R.D. Franzosa. Point-set topological spatial relations. International Journal of Geographical Information Systems, 5(2):161– 174, 1991. M.J. Egenhofer and J. Herring. A mathematical framework for the deﬁnition of topological relationships. In 4th Intl. Symposium on Spatial Data Handling, pages 803–813, Zrich, 1990. M.J. Engenhofer and A.U. Frank. Query languages for geographic information systems. Technical report, NCGIA, 2000. M. Erwig, R.H. Guting, M. Schneider, and M. Vazirgiannis. Spatiotemporal data types: An approach to modeling and querying moving objects. Geoinformatica, 3(3):269–296, 1999. ESRI Inc. Using ArcView GIS. Ralf Hartmut G¨ uting. Special issue on spatial database systems: An introduction to spatial database systems. j-VLDB-J, 3(4):357–399, October 1994. B. Huang and H. Lin. Design of a query language for accessing spatial analysis in the web environment. Geoinformatica: an International Journal on Advances of Computer Science for Geographic Information Systems, 3(2):165–183, 1999. J. Kacprzyk and S. Zadrosny. Fquery for access: fuzzy querying for windows-based dbms. In P. Bosc and J. Kacprzyk, editors, Fuzziness in database management systems. Physica verlag, 1995. OGC (Open Geospatial Consortium). Opengis registered products : Compliant and implementing products. http://www.opengeospatial.org/resources/?page=products. OGC (Open Geospatial Consortium). Opengis-simple features speciﬁcation for sql. http://www.opengeospatial.org/specs/?page=specs. F.E. Petry. Fuzzy Databases. Kluwer Academic Pub., 1996. F. Rigaux, M. Scholl, and A. Voisard. Spatial Databases with application to GIS. Morgan Kaufmann, 2002. S. Shekkar, S. Chawla, S. Ravada, A. Fetterer, X. Liu, and C. Lu. Spatial databases - accomplishements and research needs. IEEE Trans. on Knowledge and Data Engineering, 11(1):45–55, 1999. P. Svensson and Z. Huang. Geo-sal: A query language for spatial data analysis. In Proc. 2nd Intl. Symposium on Large Spatial Databases, pages 119–140, Zrich, 1991. L. A. Zadeh. The concept of a linguistic variable and its application to approximate reasoning, parts i, ii. Information Science, 8:199–249, 301–357, 1997.

346

Gloria Bordogna et al.

[30] L.A. Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(3–28):3–28, 1978. [31] F.B. Zhan. Topological relations between fuzzy regions. In Proceedings of the ACM SAC, pages 192–196, 1997.

Flexible Databases Supporting Imprecision and Uncertainty (Studies in Fuzziness and Soft Computing)

Uncertainty Theory (Studies in Fuzziness and Soft Computing) 3rd Edition

Fuzzy Probability and Statistics (Studies in Fuzziness and Soft Computing)

Fuzzy Probability and Statistics (Studies in Fuzziness and Soft Computing)

Simulating Continuous Fuzzy Systems (Studies in Fuzziness and Soft Computing)

Modeling Uncertainty with Fuzzy Logic: With Recent Theory and Applications (Studies in Fuzziness and Soft Computing)

Simulating Continuous Fuzzy Systems (Studies in Fuzziness and Soft Computing)

Fuzzy Mathematics: Approximation Theory (Studies in Fuzziness and Soft Computing)

Fuzzy Implications (Studies in Fuzziness and Soft Computing)

Consensual Processes (Studies in Fuzziness and Soft Computing, 267)

Hybrid Intelligent Systems (Studies in Fuzziness and Soft Computing)

Soft Computing in Image Processing: Recent Advances (Studies in Fuzziness and Soft Computing)

Soft Computing Applications in Business (Studies in Fuzziness and Soft Computing)

Foundations of Reasoning under Uncertainty (Studies in Fuzziness and Soft Computing, 249)

Production Engineering and Management under Fuzziness (Studies in Fuzziness and Soft Computing, Volume 252)

Foundations of Reasoning under Uncertainty (Studies in Fuzziness and Soft Computing)

Mathematics of Uncertainty: Ideas, Methods, Application Problems (Studies in Fuzziness and Soft Computing)

Fuzzy and Rough Techniques in Medical Diagnosis and Medication (Studies in Fuzziness and Soft Computing)

Soft Computing: Methodologies and Applications (Advances in Soft Computing) (Advances in Intelligent and Soft Computing)

Soft Methods for Integrated Uncertainty Modelling (Advances in Soft Computing)

Discrete Optimization with Interval Data: Minmax Regret and Fuzzy Approach (Studies in Fuzziness and Soft Computing)

Evolving Fuzzy Systems - Methodologies, Advanced Concepts and Applications (Studies in Fuzziness and Soft Computing)

Fuzzy Optimization: Recent Advances and Applications (Studies in Fuzziness and Soft Computing 254)

Portfolio Analysis: From Probabilistic to Credibilistic and Uncertain Approaches (Studies in Fuzziness and Soft Computing)

Soft Methods for Handling Variability and Imprecision (Advances in Soft Computing)

Complexity Management in Fuzzy Systems: A Rule Base Compression Approach (Studies in Fuzziness and Soft Computing)

Integration of Fuzzy Logic and Chaos Theory (Studies in Fuzziness and Soft Computing)

Optimal Models and Methods with Fuzzy Quantities (Studies in Fuzziness and Soft Computing)

Fuzzy Logic: A Spectrum of Theoretical and Practical Issues (Studies in Fuzziness and Soft Computing 215)

Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing)

Monte Carlo Methods in Fuzzy Optimization (Studies in Fuzziness and Soft Computing)

Flexible Databases Supporting Imprecision and Uncertainty (Studies in Fuzziness and Soft Computing)

Uncertainty Theory (Studies in Fuzziness and Soft Computing) 3rd Edition

Fuzzy Probability and Statistics (Studies in Fuzziness and Soft Computing)

Fuzzy Probability and Statistics (Studies in Fuzziness and Soft Computing)

Simulating Continuous Fuzzy Systems (Studies in Fuzziness and Soft Computing)

Modeling Uncertainty with Fuzzy Logic: With Recent Theory and Applications (Studies in Fuzziness and Soft Computing)

Simulating Continuous Fuzzy Systems (Studies in Fuzziness and Soft Computing)

Fuzzy Mathematics: Approximation Theory (Studies in Fuzziness and Soft Computing)

Fuzzy Implications (Studies in Fuzziness and Soft Computing)

Consensual Processes (Studies in Fuzziness and Soft Computing, 267)

Hybrid Intelligent Systems (Studies in Fuzziness and Soft Computing)

Soft Computing in Image Processing: Recent Advances (Studies in Fuzziness and Soft Computing)

Soft Computing Applications in Business (Studies in Fuzziness and Soft Computing)

Foundations of Reasoning under Uncertainty (Studies in Fuzziness and Soft Computing, 249)

Production Engineering and Management under Fuzziness (Studies in Fuzziness and Soft Computing, Volume 252)

Foundations of Reasoning under Uncertainty (Studies in Fuzziness and Soft Computing)

Mathematics of Uncertainty: Ideas, Methods, Application Problems (Studies in Fuzziness and Soft Computing)

Fuzzy and Rough Techniques in Medical Diagnosis and Medication (Studies in Fuzziness and Soft Computing)

Soft Computing: Methodologies and Applications (Advances in Soft Computing) (Advances in Intelligent and Soft Computing)

Soft Methods for Integrated Uncertainty Modelling (Advances in Soft Computing)

Discrete Optimization with Interval Data: Minmax Regret and Fuzzy Approach (Studies in Fuzziness and Soft Computing)

Evolving Fuzzy Systems - Methodologies, Advanced Concepts and Applications (Studies in Fuzziness and Soft Computing)

Fuzzy Optimization: Recent Advances and Applications (Studies in Fuzziness and Soft Computing 254)

Portfolio Analysis: From Probabilistic to Credibilistic and Uncertain Approaches (Studies in Fuzziness and Soft Computing)

Soft Methods for Handling Variability and Imprecision (Advances in Soft Computing)

Complexity Management in Fuzzy Systems: A Rule Base Compression Approach (Studies in Fuzziness and Soft Computing)

Integration of Fuzzy Logic and Chaos Theory (Studies in Fuzziness and Soft Computing)

Optimal Models and Methods with Fuzzy Quantities (Studies in Fuzziness and Soft Computing)

Fuzzy Logic: A Spectrum of Theoretical and Practical Issues (Studies in Fuzziness and Soft Computing 215)

Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing)

Monte Carlo Methods in Fuzzy Optimization (Studies in Fuzziness and Soft Computing)

Recommend Documents