This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
tsb = <Smart, Mercedes, black> tmr = <Mini, Austin, red> tmb = <Mini, Austin, black> A preference for red cars implies, in the totalitarian semantics, that all the red cars are preferred to all the black cars, regardless of other features of the cars: tsb
of product indicates that the price of product p is pr euros. Assume that a user is interested in the stores which have ordered about 50 copies of at least all the products priced under 20 euros. With a regular DBMS and query language (typically SQL), it is mandatory to translate the term “about 50 copies” in terms of a Boolean condition, for instance, an interval of the type [50 – a, 50 + a]. However, it is worth noticing that a small increase (respectively decrease) of the interval (variation of a) may lead to an undesirable behavior: elements initially selected (respectively discarded) are rejected (respectively accepted) due
152
p
p-i3
skill
to a larger (respectively smaller) divisor. Calling on preferences may be convenient to counter this abrupt way of doing. So, instead of the interval [45, 55], one will specify that 49, 50, and 51 are ideal values, 48 and 52 very satisfactory, …, 44 and 56 borderline, and others unacceptable. In general, preferences may apply to both dividend and divisor relations, and it is of particular interest to study their impact on the resulting relation as will be discussed later in the next section. Now, let us suppose that the divisor relation involves 20 elements (a1, …, a20) and that the dividend relation contains the pairs: {<x, a1>, <x, a6>, <x, a20>,
Neither x, nor y, nor z is satisfactory as to the division of r by s. Nevertheless, it seems legitimate to think that if x is definitely inadequate, y and z
Versatility of Fuzzy Sets for Modeling Flexible Queries
are almost satisfactory since they are associated with respectively 19 and 18 ai of the divisor. Thus, one may be interested in distinguishing between these quite different situations through tolerance to exceptions. An “all or nothing” approach will accept y and z provided that a 10% ratio of exceptions is allowed. It is also possible to adopt a graded view according to which exceptions are a matter of preferences and then their ratio (or number) a matter of degree. For instance, full satisfaction is maintained if exceptions are under 8%; above 15%, it becomes zero, and in-between satisfaction decreases in a linear way. In the preceding case, exceptions are treated on a quantitative basis, that is, according to their number. So, it is impossible to compensate a large number of small exceptions, that is, to account for the notion of low-level exceptions which may occur in the context of fuzzy relations. For instance, let us consider the fuzzy relations:
Tolerant strategies, which are more deeply studied later, may be adopted either directly or because the initial query (with a nontolerant division) has led to an empty answer. In this latter case, the new (tolerant) division represents a weakened form of the regular one and a nonempty answer can then be expected.
Division of Fuzzy Relations Objectives and Basic Tools Now, one considers fuzzy relations in the sense given previously, that is, whose tuples are weighted. In this context, one can envisage queries similar to that of example 7, for instance: “to what extent each candidate has all the highlyimportant skills required for the position with a medium level.”
r = {1/
With Zadeh’s (all or nothing) inclusion of E in F defined as: E ⊆ F ⇔ ∀x ∈ X, µE(x) ≤ µF(x)
(4)
we could say that s is almost included in Ωr(x) since the grades almost agree on the previous condition. Of course, the notion of qualitative exceptions may be dealt with in a gradual way (i.e., an exception is more or less a low-level one) so as to prevent a sharp behavior of the tolerance mechanism. Tolerance may also come into play in the following case. Let us assume dividend and divisor relations where the common attribute is provided with a resemblance (or proximity) relation. For instance, the divisor contains value a, while the dividend does not involve the pair <x, a>, but <x, b> where a and b are close to each other. In such a case, one might consider that <x, b> is a somewhat acceptable substitute for <x, a> which is required for a strict division.
If selection and projection are those defined before, this query can be algebraically expressed thanks to a division of fuzzy relations as: div(project(select(c, level = “medium”), {#i, apt}), project(select(p, importance = “high”), {apt}), {apt}, {apt})
provided that this operation is appropriately extended (i.e., is defined when the operand relations become fuzzy). Such an extension is based on an adaptation of formula (2) inside which the Boolean inclusion is replaced so that (1) its arguments may be fuzzy sets and (2) its result is graded (i.e., it delivers a degree of inclusion). Several ways of extension can be devised, among which: deg(E ⊆ F) = minx ∈ X µE(x) ⇒f µF(x)
(5)
where ⇒f denotes a fuzzy implication intended for generalizing the regular one, that is, an application from [0, 1] × [0, 1] into [0, 1] obeying a number of properties among which (1) 0 ⇒f a = 1, (2) a ⇒f 1
153
Versatility of Fuzzy Sets for Modeling Flexible Queries
= 1, (3) 1 ⇒f a = a, and (4) decreasing (respectively increasing) monotonicity with respect to the first (respectively second) argument, deg(E ⊆ F) = card(E ∩ F)/card(E) = Σx ∈ X ⊤(µE(x), µF(x))/Σx ∈ X µE(x) (6) where ⊤ is a triangular norm. Formula (5) conveys a logical view, whereas formula (6) is cardinality-based and the extended divisions issued from these two approaches are presented hereafter.
The Logical Approach If r is a fuzzy relation, from: Ωr (x) = {µ/a | µ/ ∈ r} and formula (2), the definition of the division of fuzzy relations obtained is: ∀ x ∈ supp(project(r, X)), µdiv(r, s, A, B) (x) = mina ∈ µ (a) ⇒f µr (a, x) supp(s) s (7)
with the usual one in that case (in particular 1 ⇒f 0 = 0 and 1 ⇒f 1 = 1). Such a definition guarantees that the result obtained by the division is a quotient (Bosc, Pivert, & Rocacher, 2007). In effect, using this generator, the Cartesian product of the divisor and the result t of the division is included (in Zadeh’s sense) in the dividend r and it is maximal, that is: ∀x, x ∈ project(supp(t), X) and µt (x) = d ⇒ s × {d/<x>} ⊆ r (9a) ∀x, x ∈ project(supp(r), X) and µt (x) = d and d1 > d ⇒ s × {d1/<x>} ⊄ r. (9b) Of course, the use of an R-implication or an S-implication in formula (7) has a strong impact on the semantics of the obtained division. It turns out that R-implications can be rewritten: p ⇒R-i q = 1 if p ≤ q, f(p, q) otherwise
where supp(E) denotes the support of a fuzzy set, that is, the set of elements with a strictly positive degree. Two types of fuzzy implications are considered in the rest of this chapter due to their clear meaning and properties: R-implications and S-implications. Let us recall that these two types of implications can both be written in a “residuated” form (Dubois & Prade, 1984) as:
where f(p, q) accounts for the penalty applied when the conclusion q does not reach the antecedent p. It is worth noticing that, using an R-implication, if E is included in F in Zadeh’s sense, formula (5) returns the maximal degree of inclusion. The degree attached to an element of the divisor (relation s) plays a role of a threshold. The higher it is in tuple of the divisor, the higher it should be in tuple of the dividend in order for x to get the maximal grade 1. Similarly, S-implications can be alternatively formulated as:
p ⇒f q = sup {u ∈ [0, 1] | cnj(p, u) ≤ q} (8)
p ⇒S-i q = ⊥(1 – p, q)
where cnj(a, b), the generator of the considered implication, is a continuous triangular norm for an R-implication and a continuous noncommutative conjunction for an S-implication. One can notice that the regular division (formula [2]) is recovered from formula (7) in the presence of regular relations due to the fact that any fuzzy implication coincides
where ⊥ stands for a triangular co-norm extending the usual disjunction. As a consequence, the antecedent p may be considered playing a role of degree of importance and (1 – p) is a guaranteed level of satisfaction. Here, one may remark that, in general, the maximal degree of inclusion is not obtained with formula (5) when E is included in F
154
(10)
Versatility of Fuzzy Sets for Modeling Flexible Queries
in the usual sense. In fact, the notion of inclusion conveyed by an S-implication is more drastic. For instance, with Kleene-Dienes implication (p ⇒K-D q = max(1 – p, q)) or Reichenbach implication (p ⇒S-i q = 1 – p + pq)), 1 is obtained if the support of E (elements with a positive degree) is included in the core (elements with the degree 1) of F. Note that this does not mean that a degree of inclusion based on an S-implication is less than one based on an R-implication as illustrated in example 8. In this context, the higher the degree of in the divisor, the more important the degree of in the dividend (i.e., the more this degree influences the grade assigned to x). Example 8. Let us consider the relations curriculum and profile of example 7 and the query looking for the candidates who have all the highly weighted skills with a reasonable level. This query leads to divide the two relations c-rl and p-hw and with the extensions of Tables 10a and 10b, one gets: 0.6/
is the generator of Kleenes-Dienes implication) leads to: pc2 = {0.6/
Table 10a. Relation c-rl c-rl
#i
skill
µ
c1
A
1
c1
B
0.6
c1
C
0.4
c2
A
0.8
c2
B
1
Table 10b. Relation p-hw p-hw
skill
µ
A
1
B
0.8
C
0.2
155
Versatility of Fuzzy Sets for Modeling Flexible Queries
that the containment operator is appropriately parameterized.
The Cardinality-Based Approach As mentioned above, a degree of inclusion can be built from formula (6), which, when used in the definition of the division yields: ∀ x ∈ supp(project(r, X)), µdiv(r, s, A, B) (x) = Σ a ∈ supp(s) ⊤(µr (a, x), µs (a))/Σ a ∈ supp(s) µs (b)
where ⊤ denotes a triangular norm. If such an approach seems legitimate, the definite validation depends on whether or not the result is a quotient. It turns out that this is not the case. In effect, it may happen (Bosc et al., 2007) that: (1) the Cartesian product using the smallest triangular norm of the divisor and the smallest result of such a division (i.e., using the smallest norm in the above expression) is not��������������������������������������� included in the dividend, and (2) the Cartesian product using the largest noncommutative conjunction of the divisor and the largest result of such a division (i.e., using the largest norm (min in the above expression) is not maximal.
Exception-Based Approximate Division As mentioned earlier, an approach to extending the division consists in the tolerance to exceptions, which leads to an approximate division. This can be understood in two different ways depending on the exceptions which can be of a quantitative or qualitative nature. Obviously, the result delivered by any approximate division must be a superset of the one returned by the regular division.
Quantitative Approach The definition of a quantitative approximate division is based on the allowance for the existence of some elements of the divisor (s) not connected in the dividend (r) with the value x under consideration. In other words, a certain number of values 156
of s can be more or less ignored depending on the authorized level of relaxation. The principle adopted is to weaken the universal quantifier into the relative fuzzy quantifier “almost all” (Kerre & Liu, 1998; Zadeh, 1983) modeled as a function from the unit interval into itself. So doing, degrees are associated with the weakened quantifier and the result is a fuzzy relation, although input relations may be nonfuzzy ones. These grades convey a natural semantics, namely the degree of satisfaction obtained when a given number of values are ignored. Let us remark that a similar approach is adopted in Galindo ���������������������������������������� et al. (2001) in the context of a division of relations involving imprecise data represented as possibility distributions.
Quantitative Approximate Division of Regular Relations Such an approximate division is a way for answering queries like: “to what extent does each candidate possess with a level over 3 almost all the skills whose importance is greater than 4.” ��������������� The definition of a quantitative approximate division of regular relations is based on the allowance for some elements of the divisor (s) not (at all) connected in the dividend r (i.e., absent) with the value x under consideration. In other words, a certain number of values of s can be more or less ignored depending on the authorized level of relaxation. The grades are defined as follows: µalmost all (0) = 0, µalmost all (1) = 1, µalmost all (1 – i/n) = wi expresses the degree of satisfaction when i out of the n elements of the referential are ignored. By definition: 1 = w0 ≥ w1 ≥ … ≥ wn = 0 and if we denote k1 = sup {j | wj = 1}, k 2 = sup {j | wj > 0},
Versatility of Fuzzy Sets for Modeling Flexible Queries
the quantifer allows for the total ignorance of k1 (and the partial ignorance of up to k 2 exceptions). The quantitative approximate division of relation r(A, X) by s(B) is obtained by: ∀x ∈ project(r, X), µquant-app-div(r, s, A, B) (x) = wi (11) with i = card({a | ∈ s and ∉ r}) and n = card(s). If the quantifier “almost all” is Boolean, the result obtained is a Boolean one, since then one is completely satisfied for a number of exceptions under (or equal to) k1 = k 2. Moreover, formula (11) generalizes formula (2) which is recovered with the universal quantifier characterized by w0 = 1, w1 = … = wn = 0 with which no exception is accepted. Moreover, ������������������������������������������ the resulting relation t can be shown to be a quotient according to the following characterization formulas: ∀x, x ∈ project(t, X) and µt (x) = d ⇒ s × {d/<x>} ⊆a r (12a) ∀x, x ∈ project(r, X) and µt (x) = d and d1 > d ⇒ s × {d1/<x>} ⊄a r (12b) where the inclusion operator used (⊆a) is an approximate one accounting for the tolerance which took place during the division (see Bosc et al., 2007, for details). Example 9. Let us take the relations r and s respectively defined over the schemas LessThan3km(#hôtel, #site) and Guide(#site) with the following extensions: r = {, …,
,
, …,
,
, …,
} s = {s1, …, s10}
and the quantifier “almost all” is defined as: µalmost all (f) = 0 if f ∈ [0, 0.75], µalmost all (f) = 1 if f ∈ [0.95, 1], µalmost all (f) linearly increasing if f ∈ [0.75, 0.95].
The query looking for hotels located at less than 3 km from almost all sites described in the guide is based on the approximate division of r by s. In r, hotel h1 is associated with 8 sites of s, h2 with 9, andt h3 with all of the 10 sites of s. Since the referential has 10 elements, the quantifier allows for somewhat ignoring up to 2 elements (k1 = 0, k 2 = 2). So, the result t of the approximate division of r by s using formula (11) is: µquant-app-div(r, s, A, B) (h1) = 0.25, µquant-app-div(r, s, A, B) (h2) = 0.75, µquant-app-div(r, s, A, B) (h3) = 1, while that of the regular division involves only h3. Clearly, the Cartesian product of s by t ({0.25/, …, 0.25/< h1, s10>, 0.75/
, …, 0.75/
, 1/
, …, 1/
}) is not included in r, since some exceptions occurred for h1 and h2 which have been partly ignored. However, one may observe that the two (respectively one) extra elements relative to h1 (respectively h2) have a degree which corresponds to w2 (respectively w1), that is, the ignorance of 2 (respectively 1) elements, and this is the rationale of the approximate inclusion used in formulas (12a) and (12b).♦
Quantitative Approximate Division of Fuzzy Relations The objective of the approximate division of fuzzy relations is to answer queries such as: “to what extent does each candidate possess with a medium level almost all the fairly important skills.” As in the previous subsection, one considers that the presence of values of the divisor insufficiently (or not at all) connected with x in the dividend r may be more or less compensated by the weakening of the universal quantifier into “almost all.” The idea is to ignore to a certain extent the result of a fuzzy implication provided that there is a strictly greater grade issued from the quantifier. Here again, the reasoning is made at a quantitative level, and it is based on a number of more or less acceptable exceptions (to the complete inclusion according
157
Versatility of Fuzzy Sets for Modeling Flexible Queries
to the type of fuzzy implication used). It follows that the quantitative approximate division of the fuzzy relation r(A, X) by the fuzzy relation s(B) is defined as: ∀x ∈ proj(supp(r), X), µquant-app-div(r, s, A, B) (x) = inf i max(αi, wi)
(13)
where αi is the ith smallest implication degree (µs(aj) ⇒f µr (aj, x)) and wi is the degree of ignoration wi = µalmost all (1 – i/n), issued from the quantifier for relation s whose cardinality is n; that is, n implication values intervene in formula (13). It is of particular interest to notice that this formula is recovered if one adopts the view according to which one searches for the best k such that k elements of the divisor s are connected with x in the dividend and k is compatible with “almost all.” One observes that an implication value insufficiently satisfactory is replaced by the degree of satisfaction corresponding to the number of values ignored so far (according to “almost all”). When wi is 1, total ignoration takes place, whereas if wi is 0, the associated element αi is completely taken into account as such. It appears that degrees of ignoration act inversely with respect to levels of importance (used for instance in the weighted conjunction discussed in Dubois & Prade, 1986). They define degrees of guaranteed satisfaction, that is, if p implication values are ignored, the satisfaction level is at least wp. Furthermore, associating the largest ignorance degree with the smallest implication value, the second largest ignorance degree with the second smallest implication value, and so on, is optimal as to the grade assigned to an element x in the resulting relation. Obviously, the user can choose the fuzzy implication to be used in formula (13) so as to specify the role played by the degrees of the divisor (threshold or importance). Furthermore, if the quantifier Q1 is included in Q2 (according to formula (4)), the result of the approximate division founded on Q1 is included in that of the approximate division based on Q2. In the presence of regular relations, formula
158
(13) delivers the same result as formula (11) and if the universal quantifier (w0 = 1, w1 = … = wn = 0) is used in formula (13), formula (7) is recovered. Last, it is proved in Bosc et al. (2007) that the result of this division is a quotient provided that: (1) an appropriate approximate Boolean inclusion is used in the characterization formulas, (2) either the divisor is a normalized relation (i.e., at least one element has the maximal degree 1), or the relaxed quantifier is nonfuzzy. Example 10. Let r with schema Close(#hotel, #site) and s with schema InterestingSites(#site) be two fuzzy relations. The tuple µ/<st, ht> of relation r gives the extent to which site st is close to hotel ht. Similarly, the tuple µ/<st> expresses the interestingness of site st. One considers the quantitative approximate division of r by s with the quantifier “almost all” of example 9, that is, one looks for hotels more or less close to almost all sites of interest. In this perspective, the degrees issued from the quantifier are once again w1 = 0.75, w2 = 0.25, w3 = … = w10 = 0 if s contains 10 tuples. Using Gödel implication and the following extensions of r and s: r = {0.1/, 0.1/
, 0.2/
, 0.5/
, 0.7/
, 0.9/
, 1/
, 1/
, 0.2/
, 0.5/
, 0.3/
the result of the division is for h1 : min(max(0.1, 0.75), max(0.1, 0.25), max(0.2, 0), max(0.5, 0), max(0.7, 0), max(1, 0), … , max(1, 0)) = 0.2 for h2 : min(max(0.3, 0.75), max(0.3, 0.25), max(1, 0), … , max(1, 0)) = 0.3.
For h1, the two implication values 0.1 are ignored thanks to w1 and w2. For h2, only the first implication value 0.3 is ignored (thanks to w1). The result of the approximate division of r by s is t = {0.2/, 0.3/
}, while that of the regular one is t’ = {0.1/
, 0.3/
}, strictly included in t.♦
Versatility of Fuzzy Sets for Modeling Flexible Queries
Qualitative Approach In the previous subsection, exceptions have been dealt with in a quantitative way. In this context, the quantitative inclusion of E in F expresses that “almost all elements of E are included in F according to the chosen implication.” An alternative approach is to take a qualitative view and to consider a qualitative inclusion operator expressing “all elements of E are almost included in F according to the chosen implication.” Then, exceptions are also taken into account according to the idea of “almost inclusion,” which leads to a qualitative view. In other words, the idea is to (more or less) compensate the initial value of the implication when it expresses a sufficiently “low-level” exception. Intuitively, the idea is to consider exceptions with respect to the inclusion in the following sense: if one looks for the inclusion of E in F, compensation takes place for an element x such that µE(x) and µF(x) are sufficiently close to the situation of inclusion. It seems reasonable to consider that the closeness in this situation is a matter of degree rather than based on a crisp boundary. For instance, one may think that if we must be close to a, a ± 0.1 is totally acceptable, a shift beyond 0.3 cannot be tolerated, and the satisfaction is linear in-between. Of course, this does not make sense for regular relations, for which exceptions correspond to the case where µE(x) equals 1 and µF(x) is zero. Due to their specificity, R-implications and S-implications must be considered separately.
Use of an S-Implication Looking at formula (10), one may observe that any S-implication is all the more satisfied as the antecedent is close to 0 or the conclusion close to 1 since the co-norm generalizes the notion of disjunction. From this, the situation of inclusion can be formulated as follows: the antecedent is close to 0 or the conclusion is close to 1. Then, the qualitative approximate division is expressed similarly to formula (7). One ����������������������������� could think that the way compensation acts is similar to what is expressed
by formula (13). It turns out that this does not work in the sense that it is not possible to characterize the resulting relation in terms of a quotient (this holds regardless of the type of implication used). This is why the qualitative approximate division is defined as: ∀x ∈ project(r, X), µqual-Si-app-div(r, s, A, B) (x) = mina ∈ supp(s) (µs (a) – δ1) ⇒S-i (µr (a, x) + δ2) (14) where δ1 = µs (a) if µs (a) ≤ α, 0 if µs (a) ≥ β, linear in-between δ2 = 1 – µr (a, x) if µr (a, x) ≥ 1 – α, 0 if µr (a, x) ≤ 1 – β, linear in-between letting α and β be the lower and upper bounds of an interval of [0, 1]. It is easy to see that with such a division, it is possible to characterize the resulting relation as a quotient provided that the characterization accounts for the tolerance introduced in the division. This leads to two conditions like (9a) and (9b), namely: ∀x, x ∈ project(supp(t), X) and µt (x) = d ⇒ s’ × {d/<x>} ⊆ r’ (15a) ∀x, x ∈ project(supp(r), X) and µt (x) = d and d1 > d ⇒ s’ × {d1/<x>} ⊄ r’ (15b) where r’ and s’ are defined as follows: µr’ (u) = 1 if µr (u) ≥ 1 – α, µr (u) if µr (u) ≤ 1 – β, linear in-between, µs’ (u) = 0 if µs (u) ≤ α, µs (u) if µs (u) ≥ β, linear in-between. Example 11. Let us take relations r and s with respective schemas (A, X) and (B) and their extensions given in Tables 11a and 11b. If the values of the lower and upper bounds are α = 0.1 and β = 0.25, the qualitative approximate division (formula (14)) of r and s with Kleene-Dienes implication returns: 159
Versatility of Fuzzy Sets for Modeling Flexible Queries
Table 11a. Relation r of example 11 r
A
X
µ
a1
y
0.8
a2
y
1
a3
y
0.4
Table 11b. Relation s of example 11 s
B
µ
a1
1
a2
0.7
a3
0.08
µqual-Si-app-div(r, s, A, B) (y) = min(1 ⇒K-D 0.8 + 0.033, 0.7 ⇒K-D 1, 0.08 – 0.08 ⇒K-D 0.4) = min(0.833, 1, 1) = 0.833,
while the result of the regular division would assign the grade 0.8 to y. It is easy to check that the Cartesian product of s’ ({1/
Use of an R-Implication When an R-implication comes into play, the general idea of qualitative exception (then of compensa-
tion) is the same as for S-implications.��������������� This is still a matter of intensity of the exceptions. Let us recall that any R-implication is completely satisfied if the conclusion attains the antecedent. Consequently, the intensity of an exception depends on the difference between µs (a) and µr (a, x). Intuitively, if this difference is positive but small enough, that is a low-intensity exception, which is somewhat tolerable. So, the definition of the qualitative approximate division in this case is analogous to formulas (7) and (14) (see Equation 16). The principle adopted here consists in splitting the compensation mechanism both in the antecedent and in the consequent part of the R-implication used. This general form is motivated by its similarity with what is done for S-implications in formula (14) where in fact both the divisor and the dividend are susceptible to be modified. Special cases are obtained letting δ = δ1, δ2 = 0, or δ = δ2, δ1 = 0; the latter choice is the one suggested in Bosc and Pivert (2006). Expressions similar to formulas (15a) and (15b) can be pointed out in order to characterize the result of this approximate division as a quotient. Example 12. Let us take the fuzzy relations with the same schema as in example 11 with the extensions of Tables 12a and 12b. The usual division of these two relations using Gödel implication in formula (16) delivers an empty result (since the tuple
Equation 16. ∀x ∈ project(r, X), µqual-Ri-app-div(r, s, A, B) (x) = mina ∈ supp(s) (µs (a) – δ1) ⇒R-i (µr (a, x) + δ2) where δ = δ1 + δ2 =
160
0 µs (a) – µr (a, x) linear in-between.
if µs (a) – µr (a, x) ≥ β, if µs (a) – µr (a, x) ≤ α,
(16)
Versatility of Fuzzy Sets for Modeling Flexible Queries
Table 12a. Relation r of example 12 r
A
X
µ
a1
y
0.7
a3
y
0.4
Table 12b. Relation s of example 12 s
B
µ
a1
1
a2
0.1
a3
0.6
these operations have a sound rationale. One such rationale is based on the use of a proximity relation expressing that elements of a domain are (more or less if the relation is a graded one) close to each other from a semantic viewpoint. The dilation of F is naturally definable in terms of the adjunction of any element of the referential which is close to an element of F. This is the idea pushed forward in one of the motivating examples of the second section in order to consider that an element missing in the dividend can be replaced by another which is present and such that b is close to a. From this starting point, one can define the erosion as the inverse operation, such that: ero(dil(F)) = dil(ero(F)) = F.
t2 = {min(1 ⇒Gö 0.7, 0.1 ⇒Gö 0 + 0.1, 0.6 ⇒Gö 0.4 + 0.05)/
Here (as well as in the previous example), as it is expected, the result delivered by the approximate division is a superset of that returned with the regular division.♦
Proximity-Based Approximate Division
This approach is suggested in Bosc et al. (2007) in the context of scalar continuous domains and a pair of inverse operations is defined for dilation and erosion. The key points of these operators are presented hereafter.
Proximity-Based Dilation and Erosion A proximity relation (Dubois, Hadjali, & Prade, 2001) is a fuzzy relation R on a scalar domain U, such that:
Principle
•
If one looks at formulas (14) and (16) used for the definition of the qualitative approximate division, it appears that the underlying graded inclusion obeys the following template:
•
deg(s ⊆app Ωr (x)) ⇔ deg(red(s) ⊆ enl(Ωr (x))) where ⊆app denotes an approximate inclusion, red(F) is a reduction of F (thus delivering a subset of F), and enl(F) is an enlargement of F (thus delivering a superset of F). On can see the approximate inclusion defined as a regular inclusion whose arguments (fuzzy sets) are modified. This leads to think of a general family of approximate inclusions based on dilation and erosion operations, provided that
∀ u ∈ U, R(u, u) = 1 (reflexivity), for any pair (u, v) ∈ U × U: R(u, v) = R(v, u) (symmetry).
The quantity R(u, v) can be viewed as a grade of “approximate equality” of u with v. An absolute proximity relation is an approximate equality relation E which can be modeled by a fuzzy relation of the form: E: U × U → [0, 1] (u, v) → E(u, v) = Z(u − v), which only depends on the value of the difference (u − v). Z, called a tolerance indicator, is a fuzzy interval centered in 0, such that: 161
Versatility of Fuzzy Sets for Modeling Flexible Queries
• • •
Z(r) = Z(− r), i.e., Z = −Z, Z(0) = 1, the support of Z is bounded and it is of the form [−A, A] where A is a positive real number.
In terms of trapezoidal membership function (t.m.f.), the parameter Z can be expressed by (−z, z, δ, δ) with A = z + δ and [−z, z] represents the core of Z. From a fuzzy set F on the scalar domain U and the absolute proximity relation E[Z], where Z is a tolerance indicator, it is possible to define a superset of F by dilation and a subset of F by erosion in the following way. The dilation operation delivers the set dil(F) which gathers the elements of F and those outside of F which are somewhat close to an element in F in the sense of the proximity relation E[Z]: µdil(F, Z) (x) = sup
y∈U
⊤(µF (y), µE[Z] (x, y))
= sup ⊤(µF (y), µZ (x − y)) y∈U
(17)
where ⊤ is a triangular norm. The erosion operation is defined as the inverse of the dilation: µero(F, Z) (x) = inf y∈U µE[Z] (x, y) ⇒⊤ µF (y) = inf
y∈U
µZ(x – y) ⇒⊤ µF (y)
(18)
where ⇒⊤ is the R-implication generated by the norm ⊤ used in formula (17). The set ero(F) involves any element of F such that all of its “neighbors,” that is, those which are somewhat close to it, are in F. Example 13. Let us consider the fuzzy set F over the real line represented by the t.m.f. (45, 50, 5, 5). This means that values of the interval [45, 50] are full members of F and that values between 45 and 40 on the one hand and 50 and 55 on the other hand have linearly decreasing membership grades in F. Let us take the proximity relation
162
represented by (-1, 1, 3, 3) which expresses that if (u – v) is less than 2, u and v are completely close to each other and that if u – vis between 2 and 6, they are somewhat �������������������������� neighbors.���������������� Using the norm minimum in (17) and Gödel implication in (18), one has: F1 = dil(F, Z) = (44, 51, 8, 8), F2 = ero(F, Z) = (46, 49, 2, 2). It is easy to check that: ero(F1, Z)) = dil(F2, Z) = F.♦
Towards a Proximity-Based Approximate Division The concepts of dilation and erosion presented before can serve as a basis for a new kind of tolerant division provided that (at least) one proximity relation over the domain subject to the division is available. Due to the fact that in practice domains used for divisions are rarely numeric, it is mandatory to go beyond the case of scalar domains and to work with any type of domain. Moreover, it must be noticed that with discrete domains, the property of inverse of dilation and erosion is generally lost. If r and s are two fuzzy relations of respective schemas R(A, X) and S(B), two basic types of proximity-based approximate divisions of r(A, X) by s(B) can be devised (combinations of them can also be thought of), one where the divisor is eroded: µprox-app-div1(r, s, A, B) (x) = mina ∈ supp(s) µero(s) (a) ⇒f µr (a, x) (19) and the other where the dividend is dilated µprox-app-div2(r, s, A, B) (x) = mina ∈ supp(s) µs (a) ⇒f µdil(r) (a, x). (20) These two approaches are not equivalent, and they have fairly different semantics. In the first
Versatility of Fuzzy Sets for Modeling Flexible Queries
case, the idea is to keep in the (final) divisor only values which are “strong representatives” (in the sense that the initial divisor contains all their close elements according to the proximity relation). The approximate division of r(A, X) by s(B) looks for the X-values associated in r with all the “strong” members of the divisor. The more demanding the fuzzy implication used in formula (18), the smaller the final divisor obtained and then the greater the result of the approximate division. In the second situation, the dividend is extended with all the elements somewhat close (in the sense of the proximity relation) of the values initially present, thus compensating for the absence of initially missing elements which are introduced thanks to the proximity mechanism. The approximate division of r(A, X) by s(B) looks for the X-values associated in r with all the elements of the divisor or their substitutes. The less demanding the norm used in formula (17), the larger the final dividend obtained and then the greater the result of the approximate division. Due to the very nature of formulas (19) and (20), the proximity-based approximate division returns a quotient. In effect, one may observe that this division comes down to a regular division where either the divisor or the dividend is modified. For instance, the result t obtained with formula (19) can be characterized as a quotient since it is obviously maximal and the Cartesian product of ero(s) and any x element of t is included in r provided that the conjunction used is the generator of the fuzzy implication used. Example 14. Let us consider the proximity relation defined over colors shown in Table 13 ���� and the fuzzy relations: r = {1/
A tuple d/
Table 13. The proximity relation of example 14 red
carmine
vermillion
orange
red
1
0.7
0.2
0
carmine
0.7
1
0
0
vermillion
0.2
0
1
0.8
orange
0
0
0.8
1
two tuples of s represent reference colors������� which are of interest for the user. Using Gödel implication, the erosion of s returns the empty relation, whereas if Lukasiewicz implication (p ⇒Lu q = 1 if p ≤ q, 1 – p + q otherwise) is used, the result is: ero(s) = {0.3/carmine, 0.2/orange}.
The dilation of r with the norm minimum (the largest one) yields: dil(r) = {1/
The proximity-based approximate division of r by the previous erosion of s with Gödel (respectively Lukasiewicz) implication in formula (19) delivers an empty result (respectively t1 = {0.8/
163
Versatility of Fuzzy Sets for Modeling Flexible Queries
Conclusion In this chapter, the use of fuzzy sets for modeling flexible queries is advocated. After a brief overview of both the algebraic framework that can be used to do so and an SQL-like query language, the focus has been put on the division operation. This operator has the advantage of being nontrivial, thus opening to a variety of extensions intended for overcoming the limitations observed sometimes when regular queries are used, in particular the “all or nothing” effect. Beyond the case where the operand relations become fuzzy due to the use of fuzzy predicates to define the divisor and/or the dividend, diverse types of approximate division operators have been envisaged. The key idea is to introduce some tolerance with respect to the usual crisp requirement of connection between the set of values associated with a value of the dividend and the set of values present in the divisor. One line is to allow for exceptions either in a quantitative or a qualitative way. Another approach is to introduce a proximity relation used to erode the divisor or to dilate the dividend. Throughout the chapter, a particular attention is put to the quality of quotient of the result produced by the various families of extended divisions. This property has been searched in a reasonable way in the sense that the characterization formulas remain close to the usual ones and they just tend to account for the changes made in the division. Furthermore, the fact that this property is not satisfied has led us to the rejection of some lines of extension. Future research mainly concerns: (1) the implementation of these operators and the measurement of their performances compared with the usual division and (2) the design of new semantically founded extensions, for instance, mixing quantitative and qualitative approaches to exceptions or refining the erosion and dilation mechanisms.
International Conference on Data Engineering (pp. 421-430). Bosc, P., Dubois, D., Pivert, O., & Prade, H. (1997). Flexible queries in relational databases: The example of the division operator. Theoretical Computer Science, 171, 281-301. Bosc, P., Hadjali, A, & Pivert, O. (2007). On a proximity-based tolerant inclusion. Paper ������������������� presented at the Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT’07). Bosc, P., & Liétard, L. (2004). Non monotonic aggregates applying to fuzzy sets in flexible querying. Paper presented at the Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (pp. 11-18). Bosc, P., Liétard, L., & Pivert, O. (2003). Sugeno fuzzy integral as a basis for the interpretation of flexible queries involving monotonic aggregates. Information Processing and Management, 39, 287-306. Bosc, P., & Pivert, O. (1992). Some approaches to relational databases flexible querying. Journal of Intelligent Information Systems, 1, 323-354. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Bosc, P., & Pivert, O. (2006). On a qualitative approximate inclusion: Application to the division of fuzzy relations. Paper presented at the International Workshop on Flexible Database and Information Systems Technology (FlexDBIST’06) (pp. 430-434). Bosc, P., Pivert, O., & Rocacher, D. (2007). About quotient and division of crisp and fuzzy relations. Journal of Intelligent Information Systems, 29, 185-210.
References
Bouchon-Meunier, B., & Yao, J. (1992). ����������� Linguistic modifiers and imprecise categories. International Journal of Intelligent Systems, 7, 25-36.
Börzsönyi, S., Kossmann, D., & Stocker, K. (2001). The Skyline operator. Paper presented at the 17th
Bruno, N., Chaudhuri, S., & Gravano, L. (2002). Top-k selection queries over relational databases:
164
Versatility of Fuzzy Sets for Modeling Flexible Queries
Mapping strategies and performance evaluation. ACM Transactions on Database Systems, 27, 153-187.
Friedman, J. H., Baskett, F., & Shustek, L. J. (1975). An algorithm for finding nearest neighbors. IEEE Transactions on Computers, 1001-1006.
Chang, C. L. (1982). Decision support in an imperfect world (Research Rep. No. RJ3421). San José, CA: IBM.
Galindo, J., Medina, J. M., Cubero, J. C., & Garcia, M. T. (2001). Relaxing the universal quantifier of the division in fuzzy relational databases. International Journal of Intelligent Systems, 16, 713-742.
Chomicki, J. (2003). Preference formulas in relational queries. ACM Transactions on Database Systems, 28, 427-466. Cubero, J. C., Medina, J. M., Pons, O., & Vila, M. A. (1994). The generalized selection: An alternative way for the quotient operations in fuzzy relational databases. Paper presented at the �������������� Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (pp. 23-30). Dubois, D., Hadjali, A, & Prade, H. (2001). Fuzzy qualitative reasoning with words. In P. P. Wang (Ed.), Computing with words (vol. 3, pp. 347-366). John Wiley & Sons. Dubois, D., Nakata, M., & Prade, H. (2000). Ex��� tended divisions for flexible queries in relational databases. In O. Pons, M.A. Vila., & J. Kacprzyk (Eds.), Knowledge management in fuzzy databases (pp. 105-121). Physica-Verlag. Dubois, D, & Prade, H. (1984). A theorem on implication functions defined from triangular norms. Stochastica, 8, 267-279. Also in D. Dubois, H. Prade, & R. R. Yager (Eds.). (1993). Readings in fuzzy sets for intelligent systems (pp. 105-112). Morgan & Kaufmann. Dubois, D., & Prade, H. (1986). Weighted minimum and maximum operations in fuzzy set theory. Information Sciences, 39, 205-210. Dubois, D., & Prade, H. (1996). Semantics of quotient operators in fuzzy relational databases. Fuzzy Sets and Systems, 78, 89-94. Fodor, J., & Yager, R. R. (1999). Fuzzy-set theoretic operators and quantifiers. In D. Dubois & H. Prade (Eds.), Fundamentals of fuzzy sets: The handbook of fuzzy sets series (pp. 125-193). Kluwer Academic Publishers.
Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Idea Group Publishing. Ichikawa, T., & Hirakawa, M. (1986). ARES: A relational database with the capability of performing flexible interpretation of queries. IEEE Transactions on Software Engineering, 12, 624-634. Kacprzyk, J., & Ziolkowski, A. (1986). Database queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man and Cybernetics, 16, 474-478. Kerre, E. E., & Liu, Y. (1998). An overview of fuzzy quantifiers: Interpretations. Fuzzy Sets and Systems, 95, 1-22. Kießling, W., & Köstler, G. (2002). Preference SQL: Design, implementation, experiences. Paper presented at the 28th Conference on Very Large Data Bases (pp. 990-1001). Lacroix, M., & Lavency, P. (1987). Preferences: Putting more knowledge into queries. Paper presented at the 13th Conference on Very Large Data Bases (pp. 217-225). Motro, A. (1988). VAGUE: A user interface to relational databases that permits vague queries. ACM Transactions on Office Automation Systems, 6, 187-214. Mouaddib, N. (1993). The nuanced relational division. Paper presented at the 2nd IEEE International Conference on Fuzzy Systems (pp. 419-424). Sugeno, M. (1974). Theory of fuzzy integrals and its applications. Doctoral thesis, Tokyo Institute of Technology.
165
Versatility of Fuzzy Sets for Modeling Flexible Queries
Tahani, V. (1977). A conceptual framework for fuzzy query processing: A step toward very intelligent database systems. Information Processing and Management, 13, 289-303. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy-relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-28. Yager, R. R (1988). On ordered weighted averaging operators in multicriteria decisionmaking. IEEE Transactions on Systems, Man and Cybernetics, 18, 183-190. Yager, R. R. (1991). Fuzzy quotient operators for fuzzy relational databases. Paper presented at the International Fuzzy Engineering Symposium (pp. 289-296). Zadeh, L. A. (1983). A computational approach to fuzzy quantifiers in natural languages. Computer Mathematics with Applications, 9, 149-183.
Key Terms Approximate Division: Extended version of the division where some idea of tolerance is introduced.
166
Flexible Querying: Approach where users include preferences in their queries so as to get a result made of discriminated elements. Fuzzy Implication: Operator generalizing the usual material implication, whose arguments and result are valued in the unit interval [0, 1]. Fuzzy Querying: Fuzzy set-based querying approach where each element of the result is assigned a degree of satisfaction valued in the unit interval [0, 1]. Fuzzy Relation: Relation whose members have a grade of membership expressing the extent to which they comply with the concept conveyed by the relation. Proximity Relation: Binary relation expressing the extent to which two values are approximately equal. Regular Relational Database: Database where information is precise and modeled in a relational way Relational Division: Binary operation whose arguments are relations with a common attribute which delivers a result having the property of a quotient.
167
Chapter VII
Flexible Querying Techniques Based on CBR Guy de Tré Ghent University, Belgium Marysa Demoor Ghent University, Belgium Bert Callens Ghent University, Belgium Lise Gosseye Ghent University, Belgium
Abstract In case-based reasoning (CBR), a new untreated case is compared to cases that have been treated earlier, after which data from the similar cases (if found) are used to predict the corresponding unknown data values for the new case. Because case comparisons will seldom result in an exact-similarity matching of cases and the conventional CBR approaches do not efficiently deal with such imperfections, more advanced approaches that adequately cope with these imperfections can help to enhance CBR. Moreover, CBR in its turn can be used to enhance flexible querying. In this chapter, we describe how fuzzy set theory can be used to model a gradation in similarity of the cases and how the inevitable uncertainty that occurs when predictions are made can be handled using possibility theory resulting in what we call flexible CBR. Furthermore, we present how and under which conditions flexible CBR can be used to enhance flexible querying of regular databases.
Introduction In case-based reasoning (CBR), knowledge is deduced from the characteristics of a collection of past cases, rather than induced from a set of knowledge rules that are stored in a knowledge
base. In this way, CBR can be applied to find the solution of a given problem on the basis of the known solutions to similar problems. Of course, this can only be done if it holds that: ‘Similar problems have similar solutions’
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Flexible Querying Techniques Based on CBR
This statement is the underlying hypothesis of CBR (Aamodt & Plaza, 1994). Without this hypothesis, CBR cannot be used. Starting from this hypothesis, the problem of finding the solution or outcome of a new case is solved by matching the characteristics of this case against those of similar cases. The result of this matching process then allows predicting the solution or outcome. At a later stage, when more information becomes available, the predictions could be tested as to their correctness, and if necessary, the process to find similar cases could be revised. The hypothesis thus allows solving problems without necessitating the explicit modeling of expert knowledge because in CBR this knowledge is implicit in the solution. CBR techniques have been successfully applied in many application fields. Under specific circumstances, they can also be applied to enhance and enrich database querying (de Calmès, �������� Dubois, Hüllermeier, Prade, & Sedes���������������������� , 2003; Ellman, 1995; Shimazu, Kitano, ��������������������������������������� & Shibata���������������������� , 1993). This chapter deals with the application of CBR in order to enhance the querying and accessibility of regular databases. A precondition for applying CBR techniques in database querying is that the database must contain comparable descriptions of real cases that all relate to the same topic and have common characteristics. In the remainder of the chapter, it is, without a loss of generality, assumed that each case description consists of attribute values that each describe a characteristic of the case. For new cases, some of the attribute values might be unknown. This can, for example, be due to the fact that the case was not completely handled/described at the time when it was initially stored in the database. If the underlying hypothesis of CBR holds for the problem of value prediction for an attribute, CBR techniques can also be used to predict the unknown values for that attribute. This capability to predict unknown attribute values can be fully exploited to enhance flexible querying of both regular and fuzzy databases. If users are looking for case descriptions for which an attribute takes a given value or has a value that is within a given range of values, then value
168
prediction also allows finding those cases which might have the requested value in the future. Of course, this kind of flexibility requires an enriched querying mechanism that allows the modeling of uncertain query results (due to the fact that it is obtained from a prediction). Because similarity between two cases is rarely a matter of all or nothing, but rather a matter of degree, such an enriched CBR mechanism or CBR based querying mechanism should also support the modeling of imprecision stemming from case comparison. Fuzzy set theory (Dubois & Prade, 2000; Pedrycz & Gomide, 1998; Zadeh, 1965) can be used to model such kinds of imprecision. This is especially the case because there is a close connection between fuzzy-set based approximate reasoning and the underlying inference principle of CBR (Dubois, Esteva, ������������������������������� Garcia, Godo, Lopez de Mantaras, & Prade�������������������������������� , 1998; Yager, 1997). Moreover, using fuzzy set theory also has the advantage that the related possibility theory (Dubois & Prade, 1988; Zadeh, 1978) can be used to model the uncertainty that is inherent to prediction (Dubois, Hüllermeier, & Prade�������� , 2000). In this chapter, we describe an enhanced CBRbased approach for flexible querying of regular databases. The approach is based on fuzzy set theory and possibility theory and enables the prediction of unknown attribute values of case descriptions that are inserted in a database system on condition that the underlying hypothesis of CBR holds in the context of the prediction problem. As an example, consider a regular relational database in which information about juridical complaints, as registered by lawyers after interaction with the aggrieved party is stored. In a juridical context, it holds that ‘similar complaints must be dealt with in a similar way’ and thus, by consequence, the underlying hypothesis of CBR also holds. The user has to select the ‘descriptive’ attributes on which the case comparison will be based. In the complaint database, the ‘descriptive’ attributes could be the attributes that are used to classify the complaint, the age of the victim, the gender of the victim, the address of the victim, the job of the vic-
Flexible Querying Techniques Based on CBR
tim, the marital status of the victim, and so forth. The values of the descriptive attributes will be used to in the search process for similar cases. The approach itself is also flexible because a flexible similarity range is associated with each ‘descriptive’ attribute at the initialization of the approach. Such a similarity range restricts the domain values that are considered to be close enough to an attribute value. For example, a range ‘±3’ could be associated with the ‘descriptive’ attribute representing the age of the victim. This means that all cases for which the age differs not more than 3 years with the age in the case under consideration will be considered as having a similar attribute value for ‘age of the victim.’ Furthermore, a weight is associated with each ‘descriptive’ attribute. Together, these weights indicate the relative importance of the attributes within the comparison process. The attributes ‘classification’ and ‘gender of the victim’ could be modeled so that they are more important than the attribute ‘age of the victim’ for case comparison purposes. Based on the results of the case comparison process, the prediction process allows it to predict the selected ‘predictable’ unknown attribute values of the case under consideration. In the example, a ‘predictable’ attribute could be the attribute that describes the judicial decision. Finally, there is some revision mechanism that compares the predicted values with the actual values when these become available. Revision information is then used to relax or to strengthen the similarity ranges in order to improve predictions for new cases. Although all phases of our approach, namely, case description, case comparison, prediction, and revision, are described in the chapter, the main focus is on the case comparison phase, as this is the phase where our approach mainly distinguishes itself from other approaches. The remainder of the chapter is organized as follows. In the next section, some preliminaries are set. A short overview of related work and the current state of the art is given. In A Flexible CBR Approach for Information Retrieval section, we describe the different phases and aspects of our CBR
approach. In the Enhancing Flexible Database Querying section, we describe how flexible database querying can be extended with CBR techniques. In A Real-World Application: The Gender Claim Database section, the practical usefulness of our approach is illustrated by means of a brief description of a real-world application. Thus, attention is paid to the description of the application field, the problem description, and the proposed solution to the problem. Finally, the achieved results are summarized, some conclusions are given, and some ideas for future research are discussed.
Some Preliminaries Current State of the Art and Related Work CBR is a methodology which solves new problems by investigating, adapting and reusing solutions to a previously solved, similar problem. As such, CBR has been applied in a wide range of real-world applications in fields such as text retrieval, health sciences, system diagnosis, Web searches, and database querying. Of special interest with respect to the work presented in this chapter is the initial work on the enhancement of database querying which is, for example, proposed by Shimazu et al. (1993), where the tuples in the result set of a relational database query are ordered on the basis of their similarity with a given target tuple, and by Ellman (1995), where CBR techniques are applied in an object oriented telecommunication service database to find service objects that match users’ service requirements. Most of the applications, especially database applications, deal with real-world information which is often imperfect due to, for example, imprecision, uncertainty, and incompleteness. As a consequence, the search for similar problems in CBR often needs to cope with imperfections. Recently, the CBR community has recognised that advanced techniques for efficiently handling imperfections will be beneficial for the develop-
169
Flexible Querying Techniques Based on CBR
ment of better CBR methods (Richter, 2006). Work has already been done, studying the use of Dempster-Shafer theory (Richter, 1995) and the use of probability theory (Faltings, 1997) in CBR. Furthermore, some pioneering work on the applicability of fuzzy set theory for dealing with imperfections in CBR has been presented (Dubois et al., 1998; Plaza, Esteva, ������������������������������ Garcia, G������������� odo, & López de Màntaras, 1996; Yager, 1997). In Dubois et al. (2000) and de Calmès et al. (2003), an approach to enhance flexible database querying with CBR techniques is presented. Here, fuzzy set theory is used to handle imprecision, and possibility theory is used to model uncertainty. The work presented in this chapter builds further on these results by refining the CBR process and using extended possibilistic truth values (EPTVs) instead of possibility and necessity measures to model uncertainty in the result sets of flexible database queries. The latter allows for explicitly modeling the situations where cases exist in which some attribute values are missing because they do not exist or apply.
EPTVs The concept ‘extended possibilistic truth value’ is defined as an extension of the concept ‘possibilistic truth value’ (PTV) which was originally introduced by Prade (1982) and further developed in de Cooman (1995) and de Cooman (1999). EPTVs provide an epistemological representation of the truth of a proposition, which allows reflecting knowledge about the actual truth. Their semantics is defined in terms of a possibility distribution (de Tré, 2002). In the remainder of this subsection, we describe the basics of EPTVs because these are used as the underlying framework for flexible querying in the section titled Enhancing Flexible Database Querying. Consider the three truth values ‘T’ (true), ‘F’ (false), and ‘⊥’ (undefined). With the understanding that P represents the universe of all propositions ~ ( I *) denotes the set of all ordinary fuzzy sets and ℘ in the universe I * = {T , F , ⊥}, the EPTV ~ t * ( p)
170
of a proposition p ∈ P is formally defined by the mapping: ~ ~ ( I *) t * : P →℘
(1)
which associates a fuzzy set ~ t * ( p ) with each p∈P . The semantics of the associated fuzzy set ~ t * ( p ) are defined in terms of a possibility distribution. With the understanding that: t* : P → I *
(2)
is the mapping function which associates the value T with p if p is true; associates the value F with p if p is false; and associates the value ⊥ with p if (some of) the elements of p are not applicable, undefined, or not supplied. This means that:
(∀x ∈ I *)( t*( p ) ( x) = ~t *( p ) ( x)), that is, (∀p ∈ P)( t*( p ) = ~t * ( p))
(3)
where t*( p ) ( x) denotes the possibility that the value of t * ( p ) conforms to x and ~t *( p ) ( x) is the membership grade of x within the fuzzy set ~ t * ( p) . In general, an EPTV has the following format: ~ t * ( p ) = {(T , ~ (T )), ( F , ~ ( F )), (⊥, ~ (⊥))} t *( p )
t *( p )
t *( p )
(4)
Hereby ~t *( p ) (T ) denotes the possibility that p is true, ~t *( p ) ( F ) is the possibility that p is false, and ~t *( p ) (⊥) is the possibility that some elements of p are not applicable, undefined, or not supplied. An EPTV ~ t * ( p ) is normalized if at least one of the membership grades ~t *( p ) (T ) , ~t *( p ) ( F ) , and ~ t *( p ) ( ⊥) is equal to 1. Due to the possibilistic interpretation, normalization implies that at least one of the considered truth values should be completely possible. Special cases of EPTVs are presented in Table 1:
Flexible Querying Techniques Based on CBR
Table 1. Special cases of EPTVs ~ t * ( p)
Interpretation
{(T ,1)}
p is true
{(F ,1)}
p is false
{(T ,1), ( F ,1)}
p is unknown
{(⊥,1)}
p is inapplicable
{(T ,1), ( F ,1), (⊥,1)}
information about p is not available
These cases are verified as follows: • • •
•
•
If it is completely possible that the proposition is true and no other truth values are possible, then it means that the proposition is true. If it is completely possible that the proposition is false and no other truth values are possible, then it means that the proposition is false. If it is completely possible that the proposition is true, it is completely possible that the proposition is false and it is not possible that the proposition is inapplicable, then it means that the proposition is applicable, but unknown. This truth value will shortly be called ‘unknown.’ If it is completely possible that the proposition is inapplicable and no other truth values are possible, then it means that the proposition is inapplicable. If all truth values are completely possible, then this means that no information about the truth of the proposition is available. The proposition might be inapplicable, but might also be true or false. This truth value will shortly be called ‘unavailable.’
This interpretation and verification is in accordance with the findings of Umano and Fukami (1994).
New propositions can be constructed from existing propositions, using logical operators. A unary ~ ’ is provided for the negation (NOT) of operator ‘ ¬ ~ ,’ ~ ,’ ‘ ∨ ~ ,’ ‘ ⇒ a proposition and binary operators ‘ ∧ ~ ’ are respectively provided for the conand ‘ ⇔ junction (AND), disjunction (OR), implication (IF THEN), and equivalence (IFF) of propositions. The arithmetic rules to calculate the EPTV of a composite proposition and the algebraic properties of EPTVs are presented in de Tré (2002). The rules for negation, conjunction, and disjunction can be summarized as: •
Rule for n ega t ion: ~ ~ ∀p ∈ P : t * ( NOT p ) = ¬(~ t * ( p ) where
~ ( I *) → ℘ ~ ( I *) : V~ ¬ ~ :℘ ~ (V~ ) ¬
is defined by o o o
(T ) = (F ) = ~ (V~ ) ( ⊥) = ¬ ~ (V~ ) ¬
~ (V~ ) ¬
(5)
(F ) (T ) ~ ( ⊥) V
~ V
~ V
•
Rule for conjunction: ~ ~~ ∀p, q ∈ P : t * ( p AND q ) = ~ t * ( p) ∧ t * (q)
where
171
Flexible Querying Techniques Based on CBR
~ ( I *) ×℘ ~ ( I *) → ℘ ~ ( I *) : (U~,V~ ) U~ ∧ ~ :℘ ~ V~ ∧ (6)
is defined by o
o
o
(T ) = min(
(T ),
~~ ~ U∧ V
min( min( ~ ~ ~ ( F ) = max min( U∧ V min( min(
~ U
min( ~ ~ ~ ( ⊥) = max min( U∧ V min(
~ V
~ U
(T ))
(T ),
( F ), ~ ( F ), U ~ U
( F ), ~ ( ⊥), U
~ U
(T ), ~ ( ⊥), U ~ ( ⊥), U ~ U
( F )), ~ (T )), V ~ ( F )), V ~ ( ⊥)), V ~ ( F )) V
~ V
(⊥)), ~ (T )), V ~ ( ⊥)) V
~ V
Rule for disjunction: ~ ~ ~~ t * (q) ∀p, q ∈ P : t * ( p OR q ) = t * ( p ) ∨
•
where
~ ( I *) ×℘ ~ ( I *) → ℘ ~ ( I *) : (U~ ,V~ ) U~ ∨ ~ :℘ ~ V~ ∨
These rules are obtained by applying Zadeh’s (1975) extension principle to the operators of the strong three-valued Kleene logic (Resher, 1969). Kleene logics are truth-functional, which means that according to these systems, the behaviour of a logical operator is mirrored in a logical function combining Kleene truth values. Therefore, the extended truth value of every composed proposition can be calculated as a function of the extended truth values of its original propositions.
A Flexible CBR Approach for Information Retrieval Like most CBR methodologies, the flexible CBR approach for information retrieval presented in this chapter consists of four main processes: case description, case comparison, prediction, and revision. These processes are all depicted in Figure 1 and briefly introduced below. •
(7)
is defined by min( min( o ~ ~ ~ (T ) = max min( U ∨V min( min( o
o
172
~~ ~ U∨ V
( F ) = min(
~ U
~ U
(T ), ~ ( F ), U ~ U
(T ), ~ ( ⊥), U
~ U
( F ),
min( ~ ~ ~ ( ⊥) = max min( U∨ V min(
(T ),
~ V
( F ))
( F ), ~ ( ⊥), U ~ ( ⊥), U ~ U
(T )), ~ ( F )), V ~ (T )), V ~ ( ⊥)), V ~ (T )) V ~ V
(⊥)), ~ ( F )), V ~ ( ⊥)) V ~ V
•
Case description. Here, the so-called ‘descriptive’ attributes of the case descriptions are identified and their associated weights are determined. These are the attributes that are identified by the user as being relevant for the case comparison process. For each ‘descriptive’ attribute, a specification of the range of values that are considered to be compatible with the actual attribute value (similarity range) is given. To start, the user must provide the necessary initialization parameter values. Furthermore, the ‘predictable’ attributes must also be identified. These are the attributes for which no data are available in new cases and for which the data will be predicted by the prediction process. Case comparison. This process requires the description of a new case and is responsible for the retrieval of the relevant ‘descriptive’ attributes values of similar cases in the database. Here, the weights and similarity ranges provided by the case description process are applied: the database is queried for cases with
Flexible Querying Techniques Based on CBR
Figure 1. Processes involved in the presented CBR approach for information retrieval
initialization
case description
revision actual values
•
•
weights similarity ranges
input of new case query
case comparison
similar cases
In the following subsections, each of these processes is described in more detail and illustrated with some examples.
Case Description
database
prediction output of predicted values
attribute values that are within the accepted similarity ranges. Next, the (global) similarity of each of the retrieved cases is calculated (aggregated), using the provided weights. Finally, the fuzzy set of similar cases is built on the basis of the global similarities. If no similar cases are found, this is communicated to the revision process. Prediction. Based on the data in the fuzzy set of similar cases, a prediction model is built for each of the ‘predictable’ attribute values. Each prediction model represents the predicted approximation of the future value of its associated ‘predictable’ attribute. These models are forwarded to the user and to the revision process. Revision. This process is activated when no similar cases are found in the case comparison or when the actual values for the attributes involved in the prediction process become available. The latter typically occurs when the case has been further processed by the users and the new data have been entered in the database. All extra information is processed. Eventually, a request to modify the parameter settings (similarity ranges and weights) is generated and sent to the case description process.
Without a loss of generality, we consider that the case database is a relational database (Codd, 1970), which means that its database schema consists of a finite number of r relations:
(
)
Ri Ai1 : Ti1 , Ai2 : Ti2 ,..., Aimi : Timi , 1 ≤ i ≤ r (8)
where Ai j : Ti j , 1 ≤ j ≤ mi are the attributes of the relation Ri with Ai j being the attribute’s name and Ti j being the data type of the attribute. The domain domTi j of the data type Ti j defines the allowed values for the attribute. To simplify the notations, it is assumed that each attribute name Ai j is unique within the database. The relations Ri , 1 ≤ i ≤ r are interrelated with each other via foreign keys and together contain all the data in the case database. With respect to the example of a relational database for juridical complaints that is briefly introduced in the introduction, the relation schemes in Exhibit 1 could, among others, be considered. The primary keys of the relations are respectively {ComplaintID} and {PersonID}. Relation ‘Complaint’ has two foreign keys {VictimID} and {SuspectedPersonID} that both refer to relation ‘Person.’ For the case description process, the user has to select a finite number of attributes from the relations of the database scheme. These attributes must allow the system to identify and describe a case and all data for these attributes must be available for the new case that will be used as input for the case comparison process. To distinguish them from the other attributes in the relations of the database, they are in this chapter called the ‘descriptive’ attributes of the case. A ‘descriptive’ attribute is either a regular attribute stored in the database or a derived attribute, whose value can be calculated from the regular attributes. For derived attributes, the calculation method for the values must be
173
Flexible Querying Techniques Based on CBR
Exhibit 1. Complaint
(C o m p l a i n t I D : v a r c h a r , C a t e g o r y : v a r c h a r , V i c t i m : v a r c h a r , SuspectedPerson:varchar, RegistrationDate:date, JudicialDecision:varchar, DecisionDate:date) Pe r so n (Pe r so n I D:va r ch a r, Na m e:va rch a r, Bi r th d a te:d at e, G e n d e r:ch a r (1), Address:varchar, Job:varchar, MaritalStatus:varchar)
Exhibit 2. D = {C a t e g o r y: v a r c h a r, Vi c t i m Age: ye a r (Re g i s t r a t i o n) – ye a r (Bi r t h d a t e), Gender:char(1), Job:varchar, MaritalStatus:varchar}
Exhibit 3. P = {JudicialDecision:varchar, Duration: DecisionDate – RegistrationDate}
specified. After the identification, a finite set D of ‘descriptive’ attributes is obtained. D = {A1 : e1 , A2 : e2 ,..., A m : em }
(9)
where for each attribute A j : e j , 1 ≤ j ≤ m , A j is the attribute’s name and e j either denotes the data type of the attribute or the expression to calculate the attribute’s value. For the example database for juridical complaints, a possible choice for the set D could be as shown in Exhibit 2. Additionally, the user must also indicate the attributes for which the future values must be predicted. This results in a finite set of ‘predictable’ attributes: P ={A1′ : e1′ , A2′ : e2′ ,..., A′p : e′p}
(10)
As is the case with the ‘descriptive’ attributes, the ‘predictable’ attributes could also be derived. For example, the time one took to treat and finish a case is something that can be irrelevant to the case itself, but it might be very useful to predict the possible duration of new cases. If this is the case,
174
a derived ‘predictable’ attribute ‘duration’ could be considered for which the values are obtained by calculating the time distance between the moment that a new case is entered in the database and the moment that the case is assigned a status ‘finished.’ For the example database for juridical complaints, a possible choice for the set P could be as shown in Exhibit 3.
Similarity Ranges For each attribute A : e ∈ D , the user must provide an initial similarity range Range A:e . This range defines the acceptable values for the attribute, which are considered to be similar with the actual value v Anew of the attribute in the description of the new case C new . Clearly, all the allowed values must be elements of the domain dome of the data type of e. For a regular attribute A, this is the associated data type of A; for a derived attribute A, this is the data type of the result of the evaluation of the expression e. Because in general, some of these allowed values could be considered more similar to v Anew than others, the similarity ranges will be
Flexible Querying Techniques Based on CBR
defined by means of fuzzy set theory. As v Anew is usually unknown at the moment of initialisation, it is important to provide relative range definitions, from which the absolute range can be derived at the moment that v Anew is given. For the range determination, different situations are distinguished, depending on the data type of e:
distances: d (a, v Anew ) , d (b, v Anew ) , d (v Anew , c) , and d (v Anew , d ) . Once the actual value of v Anew is known, the absolute similarity range can be determined from these distances as illustrated in Figure 2. (With the notation Range A:e (v Anew ,.) , it is denoted that the membership function of Range A:e is parameterised by v Anew .)
If e has an alphanumeric type or an enumeration type, the similarity range Range A:e is defined as a fuzzy similarity relation over dome , that is:
Note that for the sake of simplicity, attributes with a text type are in this chapter excluded from the set D. In a more advanced approach, text attributes could also be allowed and the similarity between two texts could be measured by applying some text comparison techniques as can be find in (fuzzy) information retrieval approaches (BaezaYates & Ribeiro-Neto, 1999). The ranges for the ‘descriptive’ attributes ‘Categor y:varchar,’ ‘Gender:char(1),’ and ‘VictimAge: year(Registration) – year(Birthdate)’ of the example could be defined as:
•
Range A:e : dome × dome → [0,1]
which satisfies the following properties
(11)
o
Reflexivity.
o
∀x ∈ dome : Range A:e ( x, x) = 1 . (12) Symmetry.
∀x, y ∈ dome : Range A:e ( x, y ) = Range A:e ( y, x) . (13) o Transitivity.
1.
Attribute �‘Category:varchar’:
RangeCategory:varchar verbal violence
∀x, z ∈ dome : Range A:e ( x, z ) ≥
max{min (Range A:e ( x, y ), Range A:e ( y, z ) ) y ∈ dome }. (14)
Hereby, the grade Range A:e ( x, y ) denotes the degree of similarity between the domain values x and y (Plaza et al., 1996). •
If e has a numeric type, the similarity range Range A:e is defined by a fuzzy set over the domain dome . For the sake of simplicity, only trapezoidal membership functions are considered. Such membership functions are determined by four parameters a, b, c, and d where [b,c] defines the core of the membership function and [a,d] defines its support. Because of the need for a relative range definition, the range Range A:e is defined by the four relative
verbal violence
stalking
sexual violence
Injuries
1
0.6
0
0
0.6
1
0
0
sexual violence
0
0
1
0
injuries
0
0
0
1
stalking
2.
3.
Attribute ‘Gender:char(1)’: RangeGender:char(1)
M
F
M
1
0
F
0
1
Attribute ‘VictimAge: year(Registration) – year(Birthdate)’: new new d (a, vVictimAge ) = 3, d (b, vVictimAge ) = 0, new new , and d ( v d (vVictimAge , c) = 0 VictimAge , d ) = 3
This means that the fuzzy set has a triangular membership function with top in the actual value new of the attribute in the new complaint devVictimAge 175
Flexible Querying Techniques Based on CBR
Figure 2. Similarity ranges for numeric types
new value vDuration that will be obtained once the new complaint has been completely dealt with. A deviation of 50 days will be considered completely similar, whereas a deviation of 300 days is the maximum deviation allowed.
RangeA:e(vnew A ,.)
0 a
b
vAnew
d(a,vnew A ) d(b,vnew A )
c
d
Weights
dome
d(vAnew,d) d(vAnew,c)
scription. With respect to similarity, a deviation of at most 3 years for this value is allowed. Analogously as with the elements of D, a similarity range Range A′:e′ is associated with each attribute A′ : e′ ∈ P . These ranges will mainly be used to determine the similarity between the predicted values and the actual values, when these become known. The ranges for the ‘predictable’ attrib u t e s ‘J u d i c i a l D e c i s i o n: v a r c h a r ’ a n d ‘Duration: DecisionDate –RegistrationDate’ of the example could be defined as: 1.
Attribute �‘JudicialDecision:varchar’:
fine
provisional imprisonment
effective imprisonment
Insusceptible
1
0
0
0
Fine
0
1
0.6
0.1
provisional imprisonment
0
0.6
1
0.3
effective imprisonment
0
0.1
0.3
1
2.
Attribute ‘Duration: DecisionDate –RegistrationDate’: new new d (a, vDuration ) = 300 , d (b, vDuration ) = 50 , new new , d ) = 300 d (vDuration , c) = 50 , and d (vDuration
This means that the fuzzy set has a trapezoidal membership function with as central value, the
176
• •
•
Insusceptible
RangeJudicialDecision:varchar
Finally, the user must assign an importance weight wA:e ∈ [0,1] to each attribute A : e ∈ D . This weight denotes the relative importance of the attribute within the similarity determination process. A weight wA:e = 0 denotes ‘not important at all,’ whereas conversely wA:e = 1 means ‘fully important.’ To make sense, the weights must satisfy the following semantic conditions (Dubois, Fargier, ��������� & Prade�������� , 1997):
•
In order to have an appropriate scaling, max i wAi :ei = 1 must hold. If wA:e = 1, that is, the weight is fully important and the actual value v Anew of the new case is not ‘similar’ at all to the stored value v Astored of the case in the database, then the weight may not influence the similarity between v Anew and v Astored . If wA:e = 1 , that is, the weight is fully important and the actual value v Anew is completely ‘similar’ to the stored value v Astored , then again the weight may not influence the similarity between v Anew and v Astored . Last, if wA:e = 0 , that is, the weight is not important at all, then the result of the weighting should be the neutral element of the used aggregation operator. This is true (or completely similar) for conjunction (∧) and false (or not similar at all) for disjunction (∨).
For example, in order to reflect that the attributes ‘Category:varchar������� ’ and ‘Gender:char(1)����������� ’ are more important than the other ‘descriptive’ attributes in the case comparison, the following weights could be associated with the elements of D:
Flexible Querying Techniques Based on CBR
(
Category: weight = 1 VictimAge: weight = 0.7 Gender: weight = 1 Job: weight = 0.7 MaritalStatus: weight = 0.7
)
t j = v A1 , v A2 ,..., v Am , 1 ≤ j ≤ n , where j j j (16) v Ai ∈ dome , 1 ≤ i ≤ m i
j
As with the similarity ranges, a weight wA′:e′ ∈ [0,1] must also be provided for each attribute A′ : e′ ∈ P . For example, in order to reflect that the attribute ‘JudicialDecision:varchar������������������������� ’ is more important than the attribute ‘Duration: DecisionDate –RegistrationDate’ in the comparison for revision purposes, the following weights could be associated with the elements of P: JudicialDecision: weight = 1 and Duration: weight = 0.4
Case Comparison This process is responsible for the retrieval of similar cases from the database. To do this, it considers the ‘descriptive’ attributes:
Together, these tuples model the case. Because all tuples in R D relate to the same case, a set of relevant attribute values can be considered for each attribute Ai . These sets are defined by:
{
}
VAi = v Ai 1 ≤ j ≤ n , 1 ≤ i ≤ m j
These considerations hold for each new case C , as well as for all the cases C j , 1 ≤ j ≤ l stored in the database: for each attribute Ai , the sets of attribute values: new
VACi
new
, resp. VACi j , 1 ≤ i ≤ m , 1 ≤ j ≤ l
and their corresponding similarity ranges and weights RangeA1:e1 , RangeA2:e2 ,..., RangeAm :em and wA1 :e1 , wA2 :e2 ,..., wA m :em
can be determined. In our simplified juridical complaint database example, the normalized relation has as relation schema as shown in Exhibit 4 and consists of the single tuple:
Because there are no one-to-many relationships in the database, this single tuple describes a new complaint C new , and in this simple case, the sets of relevant attribute values are all singletons, that is: new
Because some of these attributes might be related to each other via a (primary key, foreign key)-relationship, there could be a one-to-many correspondence between their actual values. Therefore, in general, the actual values of the ‘descriptive’ attributes of a single case have to be modeled by a non-normalized (derived) relation:
with tuples (rows)
(18)
t (‘Stalking’, 33, ‘F’, Housewife, ‘Divorced’)
A1 : e1 , A2 : e2 ,..., Am : em
R D (A1 : e1 , A2 : e2 ,..., Am : em )
(17)
(15)
new
C C VCategory = {' Stalking ' } , VVictimAge = {3 } , new
new
C C VGender = {' F ' } , VJob = {' Housewife' } , and new
C VMaritalSta ' Divorced ' } . tus = {
Similarity Between the Values for Individual Attributes The first step in the case comparison is the determination of the similarity between the values of new C VAC and the values of VA j of the attribute A of a stored case C j . To take into account the attribute’s new similarity range RangeA:e , the set VAC is replaced
177
Flexible Querying Techniques Based on CBR
Exhibit 4. R D (C a t e g o r y : v a r c h a r, V i c t i m Ag e : y e a r (R e g i s t r a t i o n) – Gender:char(1), Job:varchar, MaritalStatus:varchar)
~ new by a fuzzy set VAC :
•
The similarity between VAnew and VAC j is calculated by:
If e has an alphanumeric type or an enumeration type:
~ new VAC = x, max RangeA:e ( x, v) x ∈ dome ∧ max RangeA:e ( x, v) > 0 C new C new v V v V ∈ ∈ A A
(19)
For example, the fuzzy set for the alphanumeric attribute ‘Category:varchar’ becomes:
~ C new VCategory = {(' stalking ' ,1), (' verbal _ violence,0.6)} .
•
C new A
If e has a numeric type, for each v ∈ V , the ~ absolute similarity range Sv is determined from the relative distances of RangeA:e (cf. ~ new Figure 2). VAC is then obtained as the conjunction of these ranges, that is:
~ ~ ~ new VAC = x, max Sv ( x) x ∈ dome ∧ max S v ( x ) > 0 C new C new v∈V A v∈V A
(20)
Taking into account the corresponding ranges, the fuzzy set for the numeric attribute ‘VictimAge: year(Registration) – year(Birthdate)’ becomes: ~ C new VVictimAge = {(31,0.33), (32,0.66), (33,1), (34,0.66), (35,0.33)} .
Furthermore, the set VAC j is replaced by its fuzzy counterpart:
{
}
~C C VA j = (x,1)x ∈ VA j
178
ye a r (B i r t h d a t e),
(21)
(
sim V
C new A
,V
Cj A
~ new ~ C VAC ∩ VA j = ~ new ~ C = VAC ∪ VA j
)
min max ∑ x∈dome
∑
x∈dom e
( x) ( x ), ( x ) new C ~ ~ VA j V AC
~ new V AC
( x),
~C VA j
(22)
This calculation is based on the fuzzification of the Jaccard similarity measure as described in (Miyamoto, 2000).
Similarity Between Two Cases The similarity between two cases C new and C j is obtained by the weighted aggregation of the similarities of all ‘descriptive’ attributes, taking into account the weights that are associated with the attributes. In this chapter, we only consider conjunctive aggregation. Therefore, we can apply an implicator operator for the modeling of the impact of the weights (de Tré, ������������������� de Caluwe, Tourné, & Matthé�������� , 2003): f im∧ : [0,1]× [0,1]→ [0,1]
(w , sim(V A:e
C new A
C
))
(
(
new
C
,VA j max 1 − wA:e , sim VAC ,VA j
))
(23)
Note that with this definition, the semantic conditions for weights as proposed in Dubois et al. (1997) are satisfied. The similarity between the two cases is then obtained by:
(
)
((
(
new
C
)))
sim C new , C j = min f im∧ wA:e , sim VAC ,VA j A:e∈D (24)
Flexible Querying Techniques Based on CBR
Retrieval of Similar Cases The cases in the database that are similar to the case C new are retrieved by calculating the similarity:
(
)
sim C new , C j , 1 ≤ j ≤ l
(25)
for all stored cases C j . In order to retrieve only the most similar cases, the user can provide a threshold value τ: only cases for which the similarity is not lower than τ are provided in the result. Finally, the ~ fuzzy set SC new of cases that are similar to the case new C is obtained by:
{
SC new = (C j , sim(C new , C j ) )1 ≤ j ≤ l ∧ sim(C new , C j ) ≥
}
(26)
If this fuzzy set is empty, a message will be sent to the revision process.
Prediction The prediction process aims to predict the values for the ‘predictable’ attributes: P ={A1′ : e1′, A2′ : e2′ ,..., A′p : e′p}
of the new case. For the same reason as with ‘descriptive’ attributes, each ‘predictable’ attribute A′ has an associated set of actual values for each case C j , 1 ≤ j ≤ l stored in the database (cf. Equations [17]-[18]): C
V A′ j
(27)
Furthermore, the prediction process will associate a fuzzy set of predicted values: ~ new V AC′
(28)
with the attribute. Possibility theory is used to determine the possible elements of this set and thus defines the semantics of the membership grades ~ new of VAC′ as degrees of uncertainty. Here, the CBR hypothesis is interpreted as (Dubois et al., 1998,
2000): ‘the more similar two cases are, the more possible it is that their corresponding “predictable” attribute values are similar.’ ~ new In a straightforward approach, VAC′ can be obtained by: ~ new VAC′ = x,
~ new V AC′
( x) x ∈ dome′ ∧
~ new V AC′
( x) > 0
(29)
where ~ new V AC′
( x) =
C ∈ C
max ~ S new C
( C ) > 0 ∧ C [A′ ]= x
~ S
C new
(C )
Hereby, C [A′] denotes the actual value of attribute A′ for case C. By using Equation (29), each distinct value of attribute A′ that occurs in a case C that is an ~ element of the fuzzy set SC new of similar cases is considered to be a possible value for A′ in C new . Its degree of possibility is obtained as the maximum of the membership grades S~C new (C ) of ������� all ~ cases C in SC new that have the value as attribute value for A′ . More advanced techniques that also deal with the similarities between the values of A′ in the ~ similar cases of SC new ,can be used here (Dubois et al., 2000). These topics are outside the scope of this chapter.
Revision The revision process gets input from both the case comparison and the prediction processes. On the ~ one hand, it can occur that the fuzzy set SC new is empty. This means that no cases with similar characteristics are found in the database when considering the given similarity ranges, weights, and threshold value τ. Because too stringent conditions might have caused the empty query result, feedback from the user is necessary. On the other hand, it might also be the case that the predicted values prove to be incorrect, which might be caused by conditions that are too
179
Flexible Querying Techniques Based on CBR
soft. Therefore, as soon as the actual values of the ‘predictable’ attributes in P become available and are entered in the system, the new values are compared with the predicted values by calculating their similarity. Hereby, the same techniques as in the case comparison process can be applied, but now using the following ranges and weights:
Enhancing Flexible Database Querying
RangeA1′ :e1′ , RangeA2′ :e′2 ,..., RangeA′ p :e′p and
Flexible Querying
wA1′ :e1′ , wA2′ :e2′ ,..., wA′ p :e′p .
For many years, an emphasis has been put on research that aims to make database systems more flexible and better accessible. An important aspect of flexibility is the ability to deal with imperfections of information, like imprecision, vagueness, uncertainty, or incompleteness. Imperfection of information can be dealt with at the level of data modeling, the level of database querying, or both. The key idea in flexible querying is to introduce preferences inside database queries (Bosc, ��������� Kraft, & Petry������������������������������������������������ , 2005). This can be done at two levels: inside elementary query conditions and between query conditions. Preferences inside query conditions allow for expressing that some values are more adequate than others, whereas preferences between query conditions are used to associate different levels of importance with the conditions. To support preferences, query languages like SQL and OQL and their underlying algebraic frameworks have been generalized. Hereby, the possible extensions and flexible counterparts of the algebraic data manipulation operators have been studied (Bosc & Pivert, 1992, 1995; de Tré, �������������������� Verstraete, Hallez, Matthé, & de Caluwe������������������������� , 2006; Galindo, Medina, �������� Pons, & Cubero��������������������������������� , 1998; Galindo, ���������������� Urrutia, & Piattini����������������������������������������� , 2006; Umano & Fukami, 1994; Zadrozny & Kacprzyk, 1996). As the main objective of flexible querying is to refine Boolean conditions, which are either completely true or completely false, it is sufficient that the underlying logical framework supports some notion of ‘degree of satisfaction.’ Alternatively, an underlying logical framework based on possibility and necessity measures can be used to express certainty about query satisfaction. This approach,
The similarity between the completed ‘predictnew ed’ case Ccompleted and the case C new as originally entered in the database is obtained by the following counterpart of Equation (24):
(
)
((
(
C new
new sim Ccompleted , C new = min f im∧ wA′:e′ , sim VA′ completed ,VAC′ A′:e ′∈P
)))
new
(30)
If this similarity is lower than the threshold value τ, then the prediction is considered inadequate, and again feedback from the user is necessary. If user feedback is necessary, the process will interact with the user and provide the user all information that is available. More specifically, the cause of the interaction will be communicated, together with information about all attribute range values, weights, and (intermediate) results in the calculation of the similarities. By comparing the intermediate results, the process can determine for which descriptive attribute(s) values have the highest similarity and the lowest similarity. This information together with the feedback of the user can help the user to decide to adapt (some of) the parameters of the case description process. Alternatively, the parameters can also be automatically adapted by the process. This can be done, for example, by proportionally decreasing or increasing the weight and/or the relative distances of the definition of the similarity range of the attribute that performs best or worst. More details on this are outside the scope of this chapter.
180
In this section, we describe how the CBR approach presented in the previous section can be applied to enhance flexible querying of (conventional) relational databases.
Flexible Querying Techniques Based on CBR
as originally presented in Prade and Testemale (1984), does not provide any discussion of inapplicability of information at the logical level or offer a presentation of a formal framework for coping with it together with other null values, despite the fact that inapplicability is handled with a special domain value (⊥) in the data model. As illustrated in de Tré and de Caluwe (2003), extended possibilistic truth values (EPTVs) can be used to express (un)certainty about query satisfaction in flexible database querying: the EPTV representing the extent to which it is (un)certain that a given database record belongs to the result of a flexible query can be obtained by aggregating the calculated EPTVs that denote the extents to which it is (un)certain that the record satisfies the different criteria imposed by the query. Moreover, the logical framework based on EPTVs extends to the approach presented in Prade and Testemale (1984) and explicitly deals with the inapplicability of information during the evaluation of the query conditions: if some part of the query conditions are inapplicable, this will be reflected in the resulting EPTV. An extension of SQL that copes with EPTVs has been described in de Tré et al. (2006).
Extending Flexible Querying Systems with Extra CBR Facilities In a first approach, a flexible querying system could be extended with a CBR system for instance-based prediction (Dubois et al., 2000). Such an extra facility additionally allows users to examine the database for predicted values for a set of given attributes. Of course, in order to be usable, the underlying CBR hypothesis, ‘The more two database entities are similar, the more possible the similarity of associated attribute values’ must hold. By using the facility, a CBR technique as described in the section titled A Flexible CBR Approach for Information Retrieval’ will be applied. After having initialised the CBR system, the user has to enter the relevant attribute values describing the case under consideration. The unknown attribute values will then be predicted by the prediction process and returned to the user by the CBR system.
Embedding CBR Facilities in a Flexible Querying Language Rather than being provided as an extra stand-alone facility, CBR can also be embedded in existing (flexible) querying systems. To do this, the query language must be extended with an extra facility ‘PREDICT’ that allows the prediction of the unknown values of specified attributes. Without such a facility, those unknown values will in most systems be represented by the pseudodescription null (Codd, 1979; Vassilou, 1979). In the next subsections, we describe such a predict facility that could be embedded in a flexible querying language for conventional, relational databases supported by a logical framework of EPTVs.
Flexible Querying Using a Logical Framework of EPTVs In order to use EPTVs for expressing query satisfaction in flexible querying of regular relational databases, the relational model and relational algebra (Codd, 1972) must be extended with some additional facilities. To start with, the definition of a relation R is extended so that it contains an extra attribute ‘ Contains : TEPTV ’ with a corresponding data type TEPTV , that has EPTVs as allowed values. As such, each relation Ri, 1 ≤ i ≤ r in a database schema has the following schema:
(
Ri Ai1 : Ti1 , Ai2 : Ti2 ,..., Aimi : Timi , Contains : TEPTV
)
(31)
The extra attribute with name ‘Contains’ is used to express the extent to which the tuples of the relation belong to the relation. Hereby, it is implicitly assumed that the schema of a relation corresponds to a predicate and all tuples that belong to the relation are propositions that should not evaluate to false, that is, that have an associated EPTV that differs from {(F, 1)}. If no more information is available, for simplification, it can be assumed that all tuples initially have the value {(T, 1)} as associated with
181
Flexible Querying Techniques Based on CBR
EPTV. The value {(T, 1)} could, for example, be the default value that is assigned to the tuple on insertion. In a more general approach, users might be allowed to assign their own truth values, hereby expressing that the tuple belongs to the relation, only to the given extent. In order to guarantee the relational closure property of the set of relational algebra operators (Codd, 1972), the definitions of the operators must also be extended such that the results of the queries are also extended relations that have an extra attribute ‘Contains.’ The value of the extra attribute then expresses the extent to which a tuple belongs to the answer set of the query. In fact, the EPTV expresses the certainty about the compatibility of the tuple with the results expected by the user. This certainty is calculated during query processing, as is presented below for the selection, projection, and join operators (de Tré et al., 2006). Illustrative database. The relational database used to illustrate the proposed flexible querying approach is a simplification of the one introduced in the section titled A Flexible CBR Approach for Information Retrieval. It consists of two relations named ‘Victim’ and ‘Complaint,’ as shown in Figure 3. Each tuple in Victim represents information about a victim of some crime for which an official complaint is registered in the database and is characterized by a unique victim ID (VID), which is the primary key attribute, and an age attribute (Age). Each tuple in Complaint represents information
Figure 3. Example of the relations Complaint and Victim relation Victim
relation Complaint CID
VID
Duration
Contains
VID
Age
Contains
C0 V0
{(T,)}
V0
{(T,)}
C0 V0 C0 V0
{(T,)}
V0
{(T,)}
0
{(T,)}
V0
{(T,)}
C0 V0
⊥Integer
{(T,)}
V0
{(T,)}
C0 V0
{(T,)}
182
about a juridical complaint and is characterized by a unique complaint ID (CID), which is the primary key attribute; the corresponding victim ID of the victim (VID), which is a foreign key that refers to relation Victim; and the total duration of the complaint handling (Duration). The associated domains domT of the considered attributes contain a domain specific ‘undefined’ element ⊥T, that is used to model cases where a regular domain value is not applicable (cf. Prade & Testemale, 1984). In this way, the attribute domain of Duration contains an element ⊥Integer which denotes that a regular value for the duration is not applicable, which could be due to the fact that the complaint has been withdrawn. for example, this is the case for complaint C4. The selection operation. In relational algebra (Codd, 1972), the selection operation, also called the restriction operation, is written in the following general format: a WHERE e
(32)
where a denotes a database relation and e is a truth-valued function, also called the restriction condition, whose parameters are some subset of the attributes of a. The selection operation restricts relation a by discarding all tuples of a that do not satisfy e at all, that is, that have a calculated truth value that differs from false. The resulting relation contains the same attributes as relation a. In the proposed extension, the truth-valued function e is further generalised to a function that evaluates to an EPTV. Examples of such functions are the ‘IS’ function and the generalisations of the comparison operators like ‘=,’ ‘≠,’ ‘<,’ and ‘>.’ As an illustration, only the definition of the ‘IS’ function is described below. Definitions for the comparison operators are given in de Tré and de Caluwe (2004). With the understanding that v is the crisp stored value of attribute A and mL is the membership function of the fuzzy set L that represents the values desired by the user, the EPTV of the proposition ‘A IS L’ is defined by:
Flexible Querying Techniques Based on CBR
{(T ,
T
max(
T,
F,
⊥)
), ( F ,
max(
F
T,
F,
⊥)
), (⊥,
⊥
max(
T,
where •
T
•
F
•
⊥
ed:
=
L
F,
⊥)
)}
(33)
(v )
1 − L (v ) if v ≠⊥ T = if v =⊥T 0 1 − L (⊥T ) if v =⊥T = 0 if v ≠⊥ T
With Equation (33)��������������������������� , the following is reflect-
• • •
The resulting EPTV must be normalized. This is guaranteed by the division by max( T , F , ⊥ ) . If v is inapplicable (� v =⊥T ) and the fuzzy set L refers to the value ⊥T , the truth value T is possible to the calculated extent. The possibility of the truth value ⊥ is 1, if the
Figure 4. Resulting relations of the considered queries Complaint {CID, Duration} Complaint where Duration IS Long CID
VID
CID
Duration
Contains
Duration
Contains
C0
{(T,)}
C0 V0
{(T,)}
C0
{(T,)}
C0 V0 C0 V0
{(T,0.0),(F,)}
C0
0
{(T,)}
0
{(T,),(F,0.)}
C0
⊥Integer
{(T,)}
C0 V0
⊥Integer
{(⊥,)}
C0
{(T,)}
(a)
In practice, the fuzzy set L can be labeled by a linguistic term. The result set of the selection operation is obtained by evaluating e and by calculating the EPTVs that are associated with the resulting tuples. Hereby, it is important and necessary that the EPTVs of the original relation a are appropriately dealt with. ~ , presented in Therefore, the conjunction operator ∧ the preliminaries, is applied to the original EPTV and the EPTV that is obtained from the evaluation of e. Only tuples with a resulting EPTV that differs from {(F, 1)} belong to the resulting relation. As an example, consider the following query: Complaint WHERE Duration IS Long This query selects all ‘Complaint’-tuples with a Duration that is compatible with the fuzzy set that is labeled with the linguistic term ‘Long.’ The membership function of this fuzzy set is given as: Long
: domInteger → [0,1]
0 0 x x − 300 300 1
if if
x =⊥ Integer x < 300
if
300 ≤ x ≤ 600
if
x > 600
The tuples belonging to the result set of the query are given in Figure 4(a). For every tuple in this result set, the corresponding EPTV is calculated as the conjunction of the EPTV associated with the tuple in the original relation and the ETPV that is obtained by applying the ‘IS’ function as defined above in Equation (33), that is:
(b)
(Complaint where Duration IS Long) joIn Victim CID
attribute does not apply ( v =⊥T ) and����������� the fuzzy set label does not refer to the value ⊥T .
VID
Age
Duration
Contains
C0 V0
{(T,)}
C0 V0 C0 V0
{(T,0.0),(F,)}
0
{(T,),(F,0.)}
C0 V0
⊥Integer
{(⊥,)}
•
Tuple ‘C01’:
{(T ,1)}∧~ T ,
•
1 0 0 ~ {(T ,1)}= {(T ,1)} = {(T ,1)}∧ , ⊥, , F , max(1,0,0 ) max(1,0,0 ) max(1,0,0 )
Tuple ‘C02’:
(c)
183
Flexible Querying Techniques Based on CBR
{(T ,1)}∧~ T ,
have the same attribute values for the attributes in {X , Y ,, Z }, that is:
•
~ ∨ t (Contains ) = t ′(Contains ) t ( X ,Y ,, Z ) = t ′ ( X ,Y ,, Z ) (35)
0.04 0.96 0 , F , , ⊥, max(0.04,0.96,0 ) max(0.04,0.96,0 ) max(0.04,0.96,0 ) ~ = {(T ,1)}∧ {(T ,0.04 ), ( F ,1)}= {(T ,0.04 ), ( F ,1)}
Tuple ‘C03’:
0.8 0.2 0 , F , , ⊥, max(0.8,0.2,0) max(0.8,0.2,0) max(0.8,0.2,0 ) ~ {(T ,1), (F ,0.25)}= {(T ,1), (F ,0.25)} = {(T ,1)}∧
{(T ,1)}∧~ T ,
•
Tuple ‘C04’:
{(T ,1)}∧~ T ,
•
0 0 1 ~ {(⊥,1)}= {(⊥,1)} = {(T ,1)}∧ , ⊥, , F , max(0,0,1) max(0,0,1) max(0,0,1)
Tuple ‘C05’:
0 1 0 ~ {(T ,1)}∧~ T , = {(T ,1)}∧ {(F ,1)}= {(F ,1)} , ⊥, , F , max(0,1,0 ) max(0,1,0 ) max(0,1,0 )
EPTVs allow the modeling of the partial satisfaction of a flexible query condition (tuples ‘C02’ and ‘C03’). It also might be the case that the flexible condition is completely satisfied (tuple ‘C01’) or not satisfied at all (tuple ‘C05’); this tuple is not in the result set of the query. If some part of the data is not defined, for example, due to the fact that the complaint has been withdrawn (tuple ‘C04’), this is explicitly reflected in the associated EPTV. The projection operation. The algebraic projection operation is written in the following general format (Codd, 1972): a {X, Y, …, Z}
(34)
where a denotes a database relation and X, Y, …, Z are regular attributes of a. The result of the projection operation is a relation with a heading derived from the heading of a by removing all attributes not mentioned in the set {X , Y ,, Z } and a body consisting of all tuples of a, restricted to the values of attribute {X , Y ,, Z }. Hereby, repeated tuples are deleted. In the proposed extension, the extra attribute Contains is also added to the resulting relation. For each tuple t in the body of the resulting relation, the corresponding EPTV is calculated as the disjunction of all EPTVs that are associated with tuples t ′ in a that
184
In this way, the relational closure property is guaranteed with respect to the projection operator. As an example, consider the query: Complaint {CID, Duration} This query selects all ‘Complaint’-tuples, but restricts their tuple values to the values of attributes CID and Duration. The corresponding EPTVs of the original Complaint relation are copied to the resulting relation. The tuples belonging to the result set of the query are given in Figure 4(b). The join operation. Consider the two relations a and b with respective attribute sets {XY {Contains _ a}} and {YZ{Contains _ b}} where X = {X 1 , X 2 ,, X m }, Y = {Y1 , Y2 ,, Yn } and Z = {Z1 , Z 2 ,, Z p }.
This means that the attributes Y1, Y2, …, Yn are common to the two relations, X1, X2, …, Xm, Contains_a are the other attributes of a and Z1, Z2, …, Zp, Contains_b are the other attributes of b. Contains_a is the extra attribute for the associated EPTVs in a, whereas Contains_b is the extra attribute for the associated EPTV’s in b. The algebraic (natural) join operation is written in the following format (Codd, 1972): a JOIN b
(35)
The resulting relation is a relation with heading {XYZ{Contains}}and body consisting of all tuples that can be obtained by ‘combining’ tuples of a and b, which have the same values for all attributes in
Flexible Querying Techniques Based on CBR
Y in common. Within the extended approach, the associated EPTV in Contains is calculated by aggregating (combining) the corresponding EPTVs in Contains_a and Contains_b using the conjunction ~ that is presented in the preliminaries. operator ∧ As an example, consider the query: (Complaint WHERE Duration IS Long) JOIN Victim
This query joins the relations (Complaint WHERE Duration IS Long), presented in Figure 4(a), and Victim; the resulting EPTV of each tuple in the result is calculated by applying the conjunc~ to the EPTVs of both tuples that tion operator ∧ are combined to obtain the resulting tuple. The tuples belonging to the result set of the query are given in Figure 4(c).
The Predict Operation In order to illustrate the embedment of CBR facilities in a flexible querying system, the approach presented in the previous subsection is extended with an extra operation ‘PREDICT.’ This operation can only be meaningfully applied if the underlying CBR hypothesis “The more two database entities are similar, the more possible the similarity of associated attribute values” holds. In its simplest form, the format of this operation is as follows: a PREDICT X
(36)
where a denotes a database relation and X is an attribute of a. The result of the predict operation is a new relation that contains the same attributes as relation a. The tuples of the result set are obtained from the tuples of a by replacing any null value that occurs for the attribute X by a predicted value (if this value can be calculated). These predicted values are obtained by applying a CBR technique as described in the section titled A Flexible CBR Approach for Information Retrieval. Because a regular relational database is considered and because the data type of the attribute X does not change, only
domain values of the data type of X are allowed as predicted values. Consequently, only one predicted ~ value out of the fuzzy set VXt of predicted values (if not empty) can be completed in the tuple t. In the presented approach, the most possible value is chosen. This is the value x with the maximum ~ associated membership grade in VXt : ~ V Xt
( x) =
max
y∈dom X ∧
~ V Xt
( y)>0
~ V Xt
( y )
(37)
If more than one value has the maximum membership grade, then one of these values is chosen arbitrarily as approximation. Of course, because of the lost information, this is not an ideal situation. When working with a fuzzy database, the fuzzy ~ set VXt could be stored as the value for X, hereby representing the predicted value as adequate as possible, without a loss of information. For each tuple in the result set for which a null value has been replaced by a predicted value x, the associated EPTV is calculated by the conjunction of the EPTV: {(T ,
max(
~ V Xt
~ V Xt
( x)
( x),1 −
~ V Xt
( x))
), ( F ,
1− max(
~ V Xt
~ V Xt
( x)
( x),1 −
~ V Xt
( x))
)}
(38)
and the EPTV that was originally associated with ~ , prethe tuple. Again the conjunction operator ∧ sented in the preliminaries is used for this purpose. For all other tuples, the associated EPTV remains unchanged. As an example, consider the following query as applied on the relation depicted in Figure 5(a): Complaint PREDICT Duration This query predicts all null values that occur in the Duration attribute of the relation Complaint. The result of the query is presented in Figure 5(b). All tuples, except the tuple with CID ‘C03’ remain the same. For tuple ‘C03,’ the original Duration value was null. With the assumption that the fuzzy set of predicted values ‘C03’, returned by the CBRapproach is:
185
Flexible Querying Techniques Based on CBR
Figure 5. An illustration of the PREDICT operation relation Complaint CID
VID
Duration
Complaint PredIct Duration Contains
Duration
Contains
C0 V0
{(T,)}
C0 V0
{(T,)}
{(T,)}
{(T,)}
null
{(T,)}
C0 V0 C0 V0
0
{(T,),(F,0.)}
C0 V0
⊥Integer
{(T,)}
C0 V0
⊥Integer
{(T,)}
C0 V0
{(T,)}
C0 V0
{(T,)}
~ C 03 VDuration = {(550,0.8), (520,0.6), (583,0.4)}
the predicted value of Duration in ‘C03’ becomes 550. With: 0.8 0.2 ), ( F , )} = {(T ,1), ( F ,0.25)} 0.8 0.8
the associated EPTV becomes ~ {(T ,1)} = {(T ,1), ( F ,0.25)} ∧ ~ {(T ,1)} = {(T ,1), ( F ,0.25)} C 03[Duration ]∧
A Real-World Application: The Gender Claim Database Within a juridical context, the availability of and easy access to information regarding similar cases is useful with a view to treating complaints. Such cases can help jurists to detect potential pitfalls in time or to make assessments about future developments. The CBR approach and querying techniques presented in the previous sections can be used to predict future developments with respect to the handling of new complaints entered in a database for gender claim handling. The approach allows one to accommodate future attribute values, like the total duration of the complaint handling and the potential (intermediate) results of the actions undertaken by the jurists. Predictions are obtained by comparing new complaints with similar complaints that are stored in the database. Such a database application, called the ‘gender claim database,’ has been developed for the Belgian
186
VID
C0 V0 C0 V0
(a)
C 03[Duration ]= {(T ,
CID
(b)
Federal Institute of Equality of Women and Men and is meant to register, to preserve, and to process complaints about direct or indirect discrimination on the basis of gender, harassments (if these relate to the sex of the victim), and unwanted sexual behaviour, which are part of the authority of the institute. The database system is intended to support the way in which the complaints are dealt with as well as the way in which they will be reported to the authorities. A team of jurists is responsible for the complaint handling. The complaint handling system would offer jurists an important surplus value if it could support their database searches for similar cases and could help them make assessments about future developments in the handling of a newly entered complaint. For example, from similar cases, the jurist can learn more about the most likely options for that sort of complaint. Such facilities would be useful because these could help jurists to detect potential pitfalls in time and provide a means for a better exploitation of the database. The underlying idea is that the retrieved information is not intended to replace the knowledge of the jurist, but is additional to it. For that reason, interaction with the jurist is supported and encouraged, especially within the revision process.
Conclusion and Future Trends In a CBR approach for information retrieval, four main processes can be identified: case description, case comparison, prediction, and revision. In order
Flexible Querying Techniques Based on CBR
to be applicable, the CBR hypothesis that “similar problems have similar solutions” must hold. In the first part of this chapter, it has been described how fuzzy set theory and its related possibility theory can be applied to efficiently deal with the imperfections that are inherent to these processes. For the sake of argumentation, a conventional relational case database has been considered. More specifically, it has been illustrated how the case description process in the case of a relational case database can be made more flexible by providing similarity ranges and weights for the considered attributes. These similarity ranges define the acceptable values for the attributes, whereas the weights denote the relative importance of the attributes within the similarity determination process. It has also been illustrated how a flexible similarity measure for the comparison of two cases can be set up in the case comparison process. This makes sense because case comparisons will seldom result in an exact similarity matching of cases, and fuzzy set theory allows modeling a gradation of similarity of cases. Furthermore, it has also been illustrated how the inevitable uncertainty that occurs when predictions are made can be handled using possibility theory and how the revision process can help to fine-tune the system. Because of the added flexibility, the resulting approach is called a flexible CBR approach. In the second part of the chapter, it has been explained how a flexible CBR approach can be used to enhance flexible database querying. Two approaches have been distinguished. In the first approach, a flexible querying system is extended with a CBR system for instance-based prediction. In the second approach, CBR is embedded in an existing flexible querying system. For the sake of illustration, such a flexible querying approach for regular relational databases has been presented. The approach uses a logical framework based on EPTVs and has an embedded CBR-based prediction facility that allows predicting unknown data. To illustrate the practical usefulness of the approach, a real-world application for information retrieval in a juridical database for gender-claim handling has been briefly introduced.
Future work will focus on the further enhancement and development of the presented techniques. Among others, the incorporation of text retrieval mechanism, alternative aggregation techniques for the comparison process, more advanced techniques for value prediction, and the further (semi-)automation of the revision process will be studied. Another field of ongoing research is the generalization of the approach towards ‘fuzzy’ databases, that is, databases that can contain imperfect (imprecise, vague, incomplete, or uncertain) data.
References Aamodt, A., & Plaza, E. (1994). Case-based reasoning: Foundational issues, methodological variations, and system approaches. AI Communications of the ACM, 7(1), 39-59. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Essex, UK: ACM Press/Addison-Wesley. Bosc, P., Kraft, D., & Petry, F. E. (2005). Fuzzy sets in database and information systems: Status and opportunities. Fuzzy Sets and Systems, 153(3), 418-426. Bosc, P., & Pivert, O. (1992). Some approaches for relational databases flexible querying. International Journal on Intelligent Information Systems, 1, 323-354. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Codd, E. F. (1970). A relational model of data for large shared data banks. Communications of the ACM, 13(6). Codd, E. F. (1972). Relational completeness of data base sublanguages. In R. J. Rustin (Ed.), Data base systems. Englewood Cliffs, NJ: Prentice Hall. Codd, E. F. (1979). RM/T: Extending the relational model to capture more meaning. ACM Transactions on Database Systems, 4(4).
187
Flexible Querying Techniques Based on CBR
de Calmès, M., Dubois, D., Hüllermeier, E., Prade, H., & Sedes, F. (2003). ��������������������������� Flexibility and fuzzy casebased evaluation in querying: An illustration in an experimental setting. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 11(1), 43-66. de Cooman, G. (1995). Towards a possibilistic logic. In D. Ruan (Ed.), Fuzzy set theory and advanced mathematical applications (pp. 89-133). Boston: Kluwer Academic Publishers. de Cooman, G. (1999). From possibilistic information to Kleene’s strong multi-valued logics. In D. Dubois, E. P. Klement, & H. Prade (Eds.), Fuzzy sets, logics and reasoning about knowledge (pp. 315-323). Boston: Kluwer Academic Publishers. de Tré, G. (2002). Extended possibilistic truth values. International Journal of Intelligent Systems, 17, 427-446. de Tré, G., & de Caluwe, R. (2003). ��������� Modeling uncertainty in multimedia database systems: An extended possibilistic approach. International Journal of Uncertainty, Fuzziness and KnowledgeBased Systems, 11(1), 5-22. de Tré, G., & de Caluwe, R. (2004). Towards more flexible database systems: A logical framework based on extended possibilistic truth values. In Proceedings of the 15th International Workshop on Database and Expert Systems Applications DEXA 2004 (pp. 900-904), Zaragoza, Spain. de Tré, G., de Caluwe, R., Tourné, K., & Matthé, T. (2003). ���������������������������������������� Theoretical considerations ensuing from experiments with flexible querying. In Proceedings of the 10th International Fuzzy Systems Association (IFSA) World Congress (pp. 388-391), Istanbul, Turkey. de Tré, G., Verstraete, J., Hallez, A., Matthé, T., & de Caluwe, R. (2006). The ����������������������� handling of selectproject-join operations in a relational framework supported by possibilistic logic. In Proceedings of the 11th International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU) (pp. 2181-2188), Paris, France. 188
Dubois, D., Esteva, F., Garcia, P., Godo, L., Lopez de Mantaras, R., & Prade, H. (1998). Fuzzy set modeling in case-based reasoning. International Journal on Intelligent Systems, 13, 345-373. Dubois, D., Fargier, H., & Prade, H. (1997). Beyond min aggregation in multicriteria decision: (Ordered) weighted min, discri-min and leximin. In R. R. Yager & J. Kacprzyk (Eds.), The ordered weighted averaging operators: Theory and applications (pp. 181-192). Boston: Kluwer Academic Publishers. Dubois, D., Hüllermeier, E., & Prade, H. (2000). Flexible control of case-based prediction in the framework of possibility theory. Lecture Notes in Artificial Intelligence, 1898, 61-73. Berlin/Heidelberg: Springer-Verlag. Dubois, D., & Prade, H. (1988). Possibility theory. New York: Plenum Press. Dubois, D., & Prade, H. (Eds.). (2000). Fundamentals of fuzzy sets. Dordrecht, The Netherlands: Kluwer Academic Publishers Group. Ellman, J. (1995). An application of case based reasoning to object-oriented database retrieval. In Proceedings of the 1st UK Case Based Reasoning Workshop, Salford, UK. Faltings, B. (1997). Probabilistic indexing for case-based prediction. Lecture Notes in Artificial Intelligence, 1266, 611-622). Berlin/Heidelberg: Springer-Verlag. Galindo, J., Medina, J., Pons, O., & Cubero, J. (1998). A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible querying and answering systems (pp. 164174). Dodrecht: Kluwer Academic Publishers. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Miyamoto, S. (2000). Fuzzy multisets and their generalizations. Lecture Notes in Computer Science, 2235, 225����������������������������������� -���������������������������������� 236). Berlin/Heidelberg: ���������������������������� SpringerVerlag.
Flexible Querying Techniques Based on CBR
Pedrycz, W., & Gomide, F. (1998). An introduction to fuzzy sets: Analysis and design. The MIT Press. Plaza������������������������������������������������ , E., Esteva, F., Garcia, P., Godo, L., & López de Màntaras, R. (1996). ��������������������������� A logical approach to casebased reasoning using fuzzy similarity relations. Information Sciences, 106, 105����� -���� 122. Prade, H. (1982). Possibility sets, fuzzy sets and their relation to Lukasiewicz logic. In Proceedings of the 12th International Symposium on MultipleValued Logic (pp. 223-227). Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115-143.
Management of Data SIGMOD Conference (pp. 162-169). Yager, R. R. (1997). Case-based reasoning, fuzzy systems modeling and solution composition. In Proceedings of the ������������������������ Case-Based Reasoning Research and Development Second International Conference ICCBR-97 (pp. 633-643), Rhode Island, USA. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338-353. Zadeh, L. A. (1975). The concept of linguistic variable and its application to approximate reasoning (parts I, II, and III). Information Sciences, 8, 199-251, 301-357; 9, 43-80.
Rescher, N. (1969). Many-valued logic. New York: Mc.Graw-Hill.
Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1), 3-28.
Richter, M. M. (1995). On the notion of similarity in case-based reasoning. In G. della Riccia, R. Kruse, & R. Viertl (Eds.), Mathematical and statistical methods in artificial intelligence (pp. 171-184). Heidelberg: Springer-Verlag.
Zadrozny, S., & Kacprzyk, J. (1996). FQUERY for Access: Towards human consistent querying user interface. In Proceedings of the 1996 ACM Symposium on Applied Computing (SAC) (pp. 532-536), Philadelphia, PA.
Richter, M. M. (2006). Modeling uncertainty and similarity-based reasoning: Challenges. In Workshop Proceedings of the 8th European Conference on Case-Based Reasoning ECCBR 2006 (pp. 191199), Ölüdeniz/Fethiye, Turkey. Shimazu, H., Kitano, H., & Shibata, A. (1993). Retrieving cases from relational data-bases: Another stride towards corporate-wide case-base systems. In Proceedings of the 13th International Joint Conferences on Artificial Intelligence IJCAI (pp. 909-915), Chambéry, France. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-27. Vassiliou, Y. (1979). Null values in data base management: A denotational semantics approach. In Proceedings of the Special Interest Group on
Key Terms Case Based Reasoning: Case based reasoning (CBR) is a methodology where new problems are solved by investigating, adapting, and reusing solutions to a previously solved, similar problem. Hereby knowledge is deduced from the characteristics of a collection of past cases, rather than induced from a set of knowledge rules that are stored in a knowledge base. Case Comparison: This CBR process is responsible for the retrieval of cases in the database that are adequately similar to the case for which some values need to be predicted. Hereby, the fuzzy preferences and/or fuzzy conditions provided in the case description process are applied. Case Description: In this CBR process, it is identified how the cases are structured and can
189
Flexible Querying Techniques Based on CBR
be extracted from the database. Furthermore, the parameters required for case comparison purposes are set. For example, in case of a fuzzy case comparison, the fuzzy preferences and/or fuzzy conditions are specified�. Database: A collection of persistent data. In a database, data are modeled in accordance with a database model. This model defines the structure of the data, the constraints for integrity and security, and the behavior of the data. Flexible Querying: Searching for data in a database is called querying. Modern database systems offer/provide a query language to support querying. Relational databases are usually queried using SQL (Structured Query Language). Regular database querying can be made more user friendly by applying techniques for self-correction of syntax and semantic errors, database navigation or “indirect” answers like summaries, conditional answers, and contextual background information for (empty) results. This is called flexible querying. A special subcategory of flexible querying techniques is based on the introduction of fuzzy preferences and/or fuzzy conditions in queries. This is sometimes called fuzzy querying. Flexible Querying Techniques Based on CBR: CBR techniques can be used for flexible database querying purposes. More specifically, CBR techniques can be used for instance-based
190
prediction with which unknown data values can be approximated. Hereby, four main processes can be identified: case description, case comparison, prediction, and revision. Prediction: Based on the data in similar cases, a prediction model is built for each of the unknown data values that must be predicted. Each prediction model represents the predicted approximation of the unknown value. These models are forwarded to the user and to the revision process. Relational Database: A relational database is a database that is modeled in accordance with the relational database model. In the relational database model, the data are structured in relations that are represented by tables. The behavior of the data is defined in terms of the relational algebra, which originally consists of eight operators (union, intersection, division, cross product, join, selection, projection, and division), or in terms of the relational calculus, which is of a declarative nature. Revision: This process is activated when no similar cases are found in the case comparison or when the actual values for the attributes involved in the prediction process become available. The latter typically occurs when the case has been further processed by the users and the new data have been entered in the database. All extra information is processed. Eventually, a request to modify the parameter settings is generated and sent to the case description process.
191
Chapter VIII
Customizable Flexible Querying in Classical Relational Databases Gloria Bordogna CNR IDPA, Italy Guiseppe Psaila University of Bergamo, Italy
Abstract In this chapter, we present the Soft-SQL project whose goal is to define a rich extension of SQL aimed at effectively exploiting flexibility offered by fuzzy sets theory to solve practical issues when querying classic relational databases. The Soft-SQL language is based on previous approaches that introduced soft conditions on tuples in the classical relational database model. We retain the main features of these approaches and focus on the need to provide tools allowing users to directly specify the context dependent semantics of soft conditions. To this end, Soft-SQL provides a command (named CREATE TERM-SET) to define the semantics of linguistic values with respect to a context represented by a linguistic variable (Zadeh, 1975); the SELECT command is extended in order to support soft predicates based on the user defined term sets, the semantics of grouping and aggregation can be modified, and finally, the clauses in the SELECT command can be combined effectively.
Introduction The need to flexibly query relational databases has been widely recognized as a means to improve the effectiveness of the retrieval in current systems using SQL for expressing information needs. The main inadequacy of the SQL language is caused by the crisp algebra on which it is founded that does not support the ranking of the results with respect to their relevance to user needs. In this book, the chapter by Kacprzyk et al. provides an extensive survey on flexible querying approaches.
For many categories of users, the possibility to express tolerant conditions and to retrieve discriminated answers in decreasing order of relevance can greatly simplify users’ tasks that generally are performed through a sequence of trial and error phases. The problem of false drops when querying databases by specifying crisp selection conditions is well known. Several approaches have been proposed either based on preference specifications or on soft conditions tolerating degrees of undersatisfaction to overcome this drawback of SQL language use (Bosc & Pivert, 1992; Dubois
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Customizable Flexible Querying
& Prade, 1997; Eduardo, Goncalves, & Tineo, 2004; Kießling, 2002, 2003; Petry, 1996; Rosado, Ribeiro, Zadrozny, & Kacprzyk, 2006; Tineo, 2000). The foundations of Kießling for preferences in databases are the basis for an intuitive valuation of search results in which it is assumed that people naturally express their requests in terms like “I like A better than B.” All these preferences can be formulated as strict partial orders. Based on this formulation, the Preference SQL language (Kießling, 2002, 2003) has been defined as an extension of SQL. Several built-in base preference types, combined with the adherence to declarative SQL programming style, guarantees great programming productivity. Further, the Preference SQL optimizer does an efficient rewriting into standard SQL. Another approach for the specification of preferences in queries is based on soft constraints, that is, tolerant selection conditions formalized within fuzzy set theory (Zadeh, 1965).�������������������� Several extensions of SQL to allow the specification of soft selection conditions in queries have been proposed. A rich taxonomy that helps in understanding the various proposals of extension of SQL by fuzzy set theory is outlined in Rosado et al. (2006). In Dubois and Prade (1997), two reasons for using fuzzy set theory (Zadeh, 1965) to make querying more flexible are discussed. First, fuzzy sets provide a better representation of the user’s preferences. One reason is that users feel much more comfortable using linguistic terms instead of precisely specified numerical constraints when expressing in a query some condition such as when asking for some hotel “not too expensive and not too far from the beach.” Furthermore, the semantics of these linguistic terms can be exactly “d” (i.e., after Zadeh, 1999, defined as a function on the basic domain of a variable) by fuzzy sets (Zadeh, 1965) so that we can have a price definitely matching or definitely not matching the user’s request, but also a price that matches to a certain degree. The second reason is that a direct consequence of having a matching degree is that answers can be ranked according to users’ requirements. Furthermore,
192
the possibility to “precisiate” the semantics of the linguistic terms makes it possible to implement mechanisms that offer users a full control on the semantics of their flexible queries (Bordogna & Psaila, 2005). According to many authors (Bosc & Pivert, 1992, 1995; Kacprzyk & Zadrozny, 1995; Medina, Pons, & Vila, 1994; Petry, 1996), there are two main lines of research in the use of fuzzy set theory in the database management system (DBMS) context. The first one assumes a conventional database and, essentially, develops a flexible querying interface using fuzzy sets, possibility theory, fuzzy logic, and so forth (Bosc & Pivert, 1992, 1995; Bosc, Buckles, Petry, & Pivert, ��������������� 1999; Dubois & Prade, 1997; Galindo, Medina, Cubero, & García, 2000; Goncalves & Tineo, 2003, 2005; Kacprzyk, Zadrozny, & Ziolkowski, 1989; Ribeiro & Moreira, 1999; Tahani, 1977; Takahashi, 1991, 1995; Tineo, 2000). In the chapter of this book by Urrutia et al., a review of two extensions of SQL, namely FSQL (Galindo, Urrutia, & Piattini, 2006) and SQLf (Bosc & Pivert, 1995) is presented. The second line of research uses fuzzy or possibilistic elements for developing a fuzzy database model to manage imprecise and vague data (Bosc & Prade, 1994; Umano & Fukami, 1994). Also in this case querying constitutes an important element of the model (Baldwin, Coyne, & Martin, 1993; Bosc & Pivert, 1997a, 1997b; Buckles & Petry, 1985; Buckles, Petry, & Sachar, 1986; Galindo, Medina, & Aranda, 1999; Galindo, Medina, Pons, & Cubero, 1998; Galindo et al., 2006; Prade & Testemale, 1984, 1987; Shenoi, Melton, & Fan, 1990). For a description of a fuzzy extension of SQL that works on crisp and fuzzy relations, see Galindo et al. (2006). However, these proposals missed addressing some practical and not negligible aspects related to the effective usage of the flexible query language. Mainly, they do not exploit one of the main features offered by fuzzy set modeling, that is, the possibility to “precisiate” the semantics of linguistic terms (Zadeh, 1999) used in the flexible queries, thus not making users capable of having
Customizable Flexible Querying
full control of the semantics of the soft conditions they use. In Bordogna Psaila (2004, 2005) and Galindo et al. (2006), attempts have been made in this direction. However, these proposals do not focus on the need for specific tools to define context dependent linguistic predicate semantics, so as to adapt the query language to the application context. Let us think about the many interpretations of the term close when used in the context of a spatial database: it can vary depending on the scale of the map (close on a map with a scale 1:1000 vs. close on a scale 1:10.000), on the entities to which it is applied (close between countries vs. close between cities), or on the database itself (close in a cadastral database vs. close in an astronomical database). This consideration is valid for many terms such as cheap, that has a different meaning when buying a ticket for a theater performance or a ticket for a cinema. It may also depend on the intention of the query: cheap for a house to buy is different with respect to cheap for a house to rent. The context of usage of a linguistic term heavily determines its meaning. The proposals defined so far usually assume that linguistic predicates are somehow defined ”a priori” outside the query language (usually an extension of the classical SQL SELECT command). Even when user defined fuzzy predicates can be specified, like in SQLf, there are not specific commands in the query language itself to customize the meaning of the terms. Further, the meaning of the fuzzy predicates is fixed once and for all and cannot be modified depending on the context of its usage. This lead to SQL extensions that are hardly useful from a practical point of view since they do not provide direct means to the users for explicitly changing or customizing the semantics of linguistic predicates according to the user needs and indications. The proposal in this chapter is about flexible querying in conventional relational databases. We addressed practical issues to define an effective flexible query language, specifically: •
The problem to extend SQL so as to allow users to define the semantics of their linguistic predicates;
•
•
To allow a contextualization of the meaning of the defined linguistic terms so as to automatically modify their semantics depending on the context; Finally, to allow the specification of flexible queries by extending the SQL language, drawing on the experience of previous proposals and in particular of SQLf (Bosc & Pivert, 1995).
These objectives are achieved by defining Soft-SQL, an extension of the SQL language for customized flexible querying in classic relational databases. Its main and distinguishing features are the following: •
•
•
Queries operate on standard relations and produce standard relations as a result. The attribute “membership degree” of a tuple that can be used to rank the items reflecting their degree of satisfaction of the query conditions is dealt with as any other attribute; this allows closure to be achieved; Specific commands are provided to define sets of linguistic values (the command named CREATE TERM-SET), sets of linguistic quantifiers for groups of tuples, and complex selection conditions (the command named CREATE LINGUISTIC QUANTIFIER) and their semantics; Furthermore, mechanisms for allowing the easy and transparent contextualization of the linguistic values are defined. The SELECT command is extended in order to support context dependent soft predicates based on the user defined term sets to modify the semantics of grouping and aggregation based on basic and user defined quantifiers and, finally, to effectively combine some or all the clauses in the SELECT command to achieve a high degree of flexibility and effectiveness.
Examples of usage of these commands will be provided.
193
Customizable Flexible Querying
Background of the Proposed Soft SQL The starting point of our Soft-SQL is the approach defined for the extension of SQL in the conventional relational data model within fuzzy set theory, named SQLf, based on soft conditions on attribute values (Bosc & Pivert, 1995). In SQLf, a soft condition is expressed by means of a linguistic predicate represented by a fuzzy set. SQLf has been defined by extending the relational algebra so as to operate on fuzzy relations. ���� The introduction of soft conditions in SQL and the relational database model is achieved by generalizing a relation r, defined as a subset of D=D1 × D2 × , ..., × Dn, to become a fuzzy relation rf, defined as a fuzzy subset of D; that is, each tuple d of rf is associated with a membership degree µrf(d) in [0,1]. µrf(d) is interpreted as the degree of satisfaction of the linguistic predicates in the query. In order to satisfy������������������������������������������ the closure property, a regular relation can be seen as a kind of fuzzy relation with all the tuples having the same membership degree equal to 1. For their formal definition, refer to ��������� Bosc and Pivert (1995). This is the first difference with respect to our proposal of Soft SQL that works on regular relations. A basic block SQLf query allows first to specify a regulation mechanism of the query result in order to control the number of the desired items (tuples). This can be done by either specifying the maximum desired number N of tuples or a minimum threshold T that each tuple’s membership degree must exceed to be included in the result or both of these values. Further, the basic SQLf (Bosc & Pivert, 1995) query allows specifying soft conditions in the WHERE clause as follows: select [N | T | N, T] (attributes) from (relations) where ( fuzzy condition);
in which ( fuzzy condition) can imply fuzzy and Boolean basic conditions at the same time, linked by some connectors (AND, OR, or even a linguistic
194
quantifier such as most). The use of fuzzy quantifiers to define compound selection conditions has been proposed also in other fuzzy extensions of SQL language (Galindo et al., 2000; Kacprzyk & Zadrozny, 1997; Kacprzyk & Ziolkowski, 1986; Tineo, 2000). Also, fuzzy joins are possible to allow multiblock queries. In most applications, the membership functions of the linguistic predicates such as big and cheap are defined by trapezoidal functions. These definitions are coded in the application. Soft compound conditions can be expressed in the form of logical expressions of elementary conditions, for example, “big AND cheap” or “cheap AND close to the center,” are represented by fuzzy set operations, or by elementary conditions aggregated by a linguistic quantifier such as “most of (cheap, close to the center).” To illustrate an example, consider table FLAT, that describes flats in cities and their properties. FLAT(id: Integer, NumberOfRooms: Integer, City: String, Inhabitants: Integer, DistanceFromCenter: float, Price: float)
An example of SQLf query on the relation FLAT is: SELECT 5, 0.6 C.Id FROM FLAT as C WHERE most of (C.Price IS cheap, NumberOfRooms IS big, C.DistanceFromCenter IS close to the center);
in which the soft conditions are expressed by “IS linguistic term.” It imposes the evaluation of the degree of satisfaction of the linguistic predicates, for example, IS cheap, by the values of the attribute C.Price of relation FLAT. The intermediate fuzzy relation resulting from the evaluation of the soft condition is projected on the attribute C.Id and the best five tuples over the threshold 0.6 are returned to the user as a result. The linguistic quantifier most of is evaluated once all the
Customizable Flexible Querying
single soft conditions in parenthesis have been evaluated; it aggregates their degrees of satisfaction to produce an overall degree that is returned as the ranking degree of the tuple. Notice that in this fuzzy extension of SQL, linguistic values such as cheap are specified independently of the context to which they are applied. This can create problems since semantics of some linguistic values might depend on the context; for example, the notion of cheap flat in Milan is not the same in Paris or New York. Further, users cannot explicitly define the semantics of the linguistic values since no SQLf command is defined specifically for this purpose. This is a severe limitation of SQLf that makes it inadequate from a practical point of view. It can create problems since semantics of linguistic predicates might depend on the context, as it will be discussed in the next section, and users may want to control and modify their semantics when querying a database. In SQLf, it is also possible to express queries which work on sets of tuples using the GROUP BY and HAVING clauses. This type of query allows for expressing queries involving aggregate functions (MIN, MAX, SUM, etc.). In SQL, each partition groups tuples that have the same value on the grouping attribute(s). This functionality is retained in SQLf, where the HAVING clause can be used along with a fuzzy set condition aimed at the selection of partitions. In this respect, various possible conditions can be formulated from simple conditions involving aggregate functions to more complex ones involving fuzzy quantifiers (Tineo, 2000). So, for instance, an SQLf query such as the following one can be specified: it looks for the cities with few inhabitants, such that most of the flats in the city are cheap. SELECT C.City FROM FLAT as C WHERE C.Inhabitants IS few, GROUP BY C.City HAVING most of C.Price IS cheap;
Also in this case, the semantics of the fuzzy quantifier most of is defined “a priori” in SQLf. This query selects the small cities having most of the flats with cheap prices. One could also ask for flats in the same city having the average price around 100000 euros, for example, by replacing the last two rows in the previous query with the following ones: GROUP BY C.City HAVING avg(C.Price) ≈ 100000
The basic block SQLf query can also be nested so as to generate queries with an arbitrary level of nesting.
Main Focus of the Chapter Distinguishing Characteristics of Soft-SQL In this section, we present our ideas for the definition of a flexible and customizable query language for a conventional relational database management system. We incorporate flexibility in SQL by assuming the extensions of Bosc and Pivert (1995) at the basis of the definition of SQLf and by taking into account issues related with practical aspects of its usage, specifically, the need to provide users with full control on the semantics of their queries, depending on the application and the query context. This last characteristic is mandatory when considering a database in which the semantics of linguistic terms such as cheap, high, far, big, and small are referred to different attributes of distinct relations. To clarify, let us consider the terms cheap and close, and the different situations in which their meanings can change. •
Different databases. The meaning of cheap can change if we change the database because the application context for which the database has been defined is different. For example,
195
Customizable Flexible Querying
•
•
196
cheap buildings has a different semantics when referring to buildings in a cadastral database and when referring to buildings in an estate agency database. Within the same database. Within the same database, it is possible to have different meanings for the same linguistic terms, depending on the semantics of tuple to select. For example, cheap for flats has a different meaning than cheap for villas, although both data sets are stored in the same database but in distinct relations. Current selected tuples. The semantics of a linguistic term can be influenced by the current selected tuples. If one has selected flats in a small city like Bergamo (northern Italy) and specifies a further soft selection condition close to the city center or cheap, the interpretation of the linguistic constraints is likely to be different than in the case in which one is formulating the same selection on flats located in a big and expensive city like Milan. A flat that is 4 km from the center of Milan can be considered close to the center, while 4 km from the center of Bergamo can be considered not completely close. A cheap flat in Milan is likely to be considered very expensive in Bergamo. Further, one can have in mind to drive from the flat to the center or another one can consider to walk, and also these subjective settings of human mind influence the interpretation of closeness. Consequently, close and cheap can be interpreted as a relative soft condition, whose interpretation varies depending on the scope for which the query is formulated. We represent this concept of “query scope” by a parameter that we name hereafter the zooming factor. This notion of zooming factor is intuitive in geographic information systems (GIS) where it indicates the scaling factor of a map visualized on the screen. The higher the zooming factor, the stricter the interpretation of the closeness. We can generalize this concept to any linguistic term by taking care
of the fact that the zooming factor parameter affects the semantics of the linguistic value so as to make it stricter as the zooming factor increases. So, if we want to modify the interpretation of cheap to reflect the fact that in a small city like Bergamo it is stricter than in a big city like Milan, we can associate Bergamo with a higher zooming factor than Milan to achieve our objective. The idea is to derive the zooming factor automatically from the actual values of another attribute of the tuples. In the case of the example, we could compute the zooming factor by applying a function (for example, the average) evaluating the values of the attribute inhabitants of the city, so that a small city would have a lower average population than a big city. While the first two situations are modeled in FSQL (Galindo et al., 2006), the third one is not considered. In Soft-SQL, preferences on selection conditions are represented by soft conditions as in SQLf but in order to support customizable context dependent soft conditions, we designed the language having in mind the following guidelines. • •
•
The semantics of the soft selection conditions must be formalized depending on the context. Soft selection conditions should be easily customizable; the language must provide some way to define the semantics of linguistic predicates. The closure property of the SQL SELECT command must be strictly preserved; in other words, a SELECT statement takes relations as input and generates a relation as output. No special meaning is attributed to the membership value of a tuple; it is dealt with just as any other attribute.
Soft-SQL, allows the user to specify the context dependent semantics of linguistic predicates by means of two commands: CREATE TERM-SET,
Customizable Flexible Querying
for defining the semantics of linguistic values used to specify simple soft conditions, and CREATE LINGUISTIC QUANTIFIER, for defining the semantics of linguistic quantifiers used to specify compound soft conditions and also fuzzy aggregation functions. These user-defined linguistic predicates can be specified at distinct levels in Soft-SQL queries on classic relational databases: in the extended basic SQL SELECT command, in the Soft SQL GROUP BY clause, in the extended SQL HAVING clause, and in the extended aggregate functions (such as the Soft-SQL COUNT). This way, the user can have full control of the flexible query language, being able to fully customize the query; in addition, the user can use a linguistic value with distinct meanings in the same application, depending on the chosen reference attribute and query scope. In Soft-SQL, like in SQLf, soft conditions are expressed through linguistic predicates identifying fuzzy subsets of the attribute domains and are specified in the WHERE clause of the extended SQL query. Differently than in SQLf, we do not produce fuzzy relations as results of queries, but ordinary relations. This way, the membership degree of a tuple is dealt with as any other kind of attribute of a tuple. Besides, in the soft condition also the context of the linguistic predicates is specified, so that it is possible to choose the proper interpretation of the linguistic value. In the following, we first introduce the definition of the command to define linguistic values and customize their semantics. Then, we introduce the command to
define linguistic quantifiers. Finally, we define the SELECT query command.
Customized Linguistic Predicates Suppose the user wishes to query the database based on a linguistic concept, for example, “the price is cheap.” The main problem that arises is: how is it possible to define the semantics of the linguistic concept “cheap” for prices? In this case, the key of flexibility is the possibility of defining linguistic concepts appropriate for the specific application context. Then, consider the case of storing data about flats to sell in the database. Suppose the user wishes to define a set of linguistic terms for price levels, such as “expensive” and “cheap.” The semantics of a linguistic term might be defined by a trapezoidal function, normalized within the range [0, 1] (see Figure 1a). When defining the linguistic terms, we can consider that (in Europe and North America) price levels might be considered between 0 and 1 million euro. Soft-SQL provides commands for defining term sets and linguistic predicates. The term set named PriceLevels, following the previous considerations, can be defined as follows: CREATE TERM-SET PriceLevels NORMALIZED WITHIN (0, 1000000) EVALUATE Price WITH PARAMS Price AS FLOAT, VALUES (‘expensive’,(0.6, 0.7, 0.85,1),*) (‘cheap’,(0,0,0.2,0.4),*);
Figure 1. Trapezoidal membership function of the linguistic term cheap: (a) Cheap is defined on the unit interval, while (b) Cheap is rescaled on the absolute domain 0, 1million €. 1
1 mcheap
0
0.1
0.2
(a)
0.4
mcheap
1 0
100.M 200M
400M
1 Million €
(b) 197
Customizable Flexible Querying
The NORMALIZED WITHIN clause defines the normalization range; in this term set, the evaluation range is normalized between 0 and 1000000 euro; values less than 0 are treated as 0, while values greater than 1000000 are treated as 1000000. The EVALUATE clause specifies the type of the parameter that, after normalization, is subjected to the soft condition specified by the defined linguistic terms. In this case, the type of the argument of the soft condition is a floating point value named Price. Finally, the two linguistic terms “expensive” and “cheap” are defined: the name is followed by four values (in the range [0, 1]) that are the x coordinates of, respectively, the bottom left side corner, the top left side corner, the top right side corner, and the bottom right side corner of the trapezoidal function (see the trapezoidal function associated with cheap in Figure 1 and the section titled Definition of the CREATE TERM-SET Command for more details). A linguistic term is exploited to query tables by specifying a soft predicate in conditions through the SELECT command (in the WHERE and HAVING clauses). For instance, suppose the user wishes to query table Flat in order to find cheap flats. The WHERE clause might be the following: WHERE Price IS ’cheap’ IN PriceLevels
The new IS .. IN operator allows specifying soft predicates. The condition reported above says that values of attribute Price are checked against the trapezoidal function defined for the linguistic value ’cheap’ in the term set PriceLevels. Suppose its price is 700000 euros: based on the definition of the linguistic term (the function is depicted in Figure 1) and on the normalization range, the degree of satisfaction is 0, so the flat is not cheap. If the price is 100000 euros, the satisfaction degree is 1, so the flat is truly cheap. If the price is 300000 euros, the satisfaction degree is 0.5, so the flat is partially cheap. However, the same linguistic terms might be exploited for a more sophisticated evaluation mechanism. For instance, suppose the user looking for flats wants to compare
198
the difference between the price and a base level in order to know if the difference of price is cheap or expensive. In practice a function: f2(price, base) = price - base can be defined, and the value provided by this function is normalized and then checked against the trapezoidal function associated with the linguistic values. Then, actually the EVALUATE clause defines one or more evaluation functions, as in the following enriched definition of the term set PriceLevels. CREATE TERM-SET PriceLevels NORMALIZED WITHIN (0, 1000000) EVALUATE Price WITH PARAMS Price AS FLOAT EVALUATE (price - base) WITH PARAMS price AS float, base AS float, VALUES (‘expensive’,(0.6, 0.8, 1,1),*) (‘cheap’,(0,0,0.2,0.4),*);
Two evaluation functions are defined: the first one is simple; the value of the parameter Price is evaluated as it is against the trapezoidal function. The second evaluation function is based on two parameters (Price and base) and the difference between them is normalized and evaluated. When the term set is exploited in queries (within the SELECT command by means of the IS ... IN operator) the system checks for the number and the type of parameters, determining which evaluation function to apply. The “*”in the definition of the linguistic value semantics is associated with a modifier function (in the specific case of “*” with a product, but two other modifier functions are possible : “-” and “+” that define a left or right translation of the trapezoidal functions on their domain) and it is used to specify, in the basic block Soft-SQL query, that the semantics of the linguistic value must be made dependent on the context, that is, the zooming factor (a detailed description of modifiers is in the section titled Definition of the CREATE TERM-SET Command).
Customizable Flexible Querying
For example, the following WHERE clause wants to exploit the second evaluation function in order to obtain flats for which it is necessary to add a cheap amount of money w.r.t. the base price of 300000 euros. WHERE (Price, 300000) IS ’cheap’ IN PriceLevels
The system matches the pair (Price, 300000) with the evaluation functions defined in the term set and applies the one (if defined) compatible with the pair.
Soft-SQL Basic Queries and Term Sets Consider now a simple, but complete, query written by means of the extended SELECT command. We want to select cheap flats in Rome. SELECT Id, Price FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels;
The WHERE clause now specifies a compound soft condition in which a crisp predicate taking values in {0,1} is conjunct with the soft predicate, taking values in [0,1]. Based on the fuzzy AND evaluated as the minimum of the two membership degrees (the maximum, in case of disjunction), we compute the membership value of tuples; then, only tuples having a membership degree greater than 0 are selected; finally, these tuples are projected on attributes Id and Price. Thus, the result of the query is the set of flats that are in Rome and are cheap (fully or partially). This way, the query is flexible: the user obtains not only flats with price less than or equal to 200000 euros, but also flats with price such as 250000 euros or 350000: they are not exactly cheap, but their price is still close to be cheap, as far as the concept of a cheap flat in a City, and perhaps the user might find them interesting. However, the membership degree of selected tuples, that might
add useful information, is not produced by the previous query and tuples are not ordered with respect to their membership degree as it occurs in SQLf. In Soft-SQL, we followed the approach that queries generate classical relational tables. So, if the user wants to obtain the membership degree of tuples, the user should obtain this value as a classical attribute, by using a special keyword, as shown in the following query. SELECT Id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels;
The keyword DEGREE refers to the membership degree of tuples. Consequently, the generated table contains three attributes, that is, the flat identifier, the flat price, and the degree of satisfaction (attribute D) of the selection condition. For example, a flat with price 300000 euros has 0.5 as membership degree, meaning that the price is exactly cheap, but still close to be considered cheap. The membership degree is an important measure, and can be exploited to better select tuples: in effect, it can be used to rank tuples denoting how much they satisfy selection conditions. To this end, the ORDER BY clause can be exploited, as the following query does, and taking, for instance, the five best results. SELECT TOP 5 Id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels ORDER BY DEGREE DESC;
Observe that w.r.t. the syntax of SQLf, the query is based on a specific keyword, the TOP keyword. When TOP n is specified, the query takes the first n sorted tuples in the result table. The TOP keyword is general and not specifically designed to deal with membership degrees. Therefore, no special parameters must be added to the query, since ORDER BY and TOP operates on crisp relations as well. 199
Customizable Flexible Querying
Let us now formulate a query in which we want to modify the semantics of cheap depending on the fact that given the context of “Rome,” that is, a big city, we want to dilute the definition of cheap with respect to its standard definition so as to be able to also consider cheap prices for flats that are commonly not considered cheap. This can be done by specifying the optional parameter ZOOMING as follows: SELECT Id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels ZOOMING 0.5;
Having specified a factor of ZOOMING=0.5, and the modifier “*” in the definition of the linguistic value semantics, we indicate that we have to multiply by 0.5 the actual price before to evaluate its satisfaction of the soft condition cheap. So if in the common case a price of 300000 euros is c h e a p to a degree of 0.5 with the ZOOMING=0.5, we can say that a price of 600000 euros is still cheap to the degree 0.5. Conversely, by specifying ZOOMING=2, we want to restrict the concept of cheap so that a price of 150000 is cheap only to the degree 0.5. Finally, it is possible to select only tuples having membership degree greater than or equal to a specified threshold by means of the DEGREE THRESHOLD subclause, as shown in the following query. SELECT Id, Price, FROM FLAT WHERE City=’Rome’ Price IS ‘Cheap’ DEGREE THRESHOLD
DEGREE AS D AND IN PriceLevels ZOOMING 0.5 0.8;
After the selection condition is evaluated (and a membership degree is associated with the tuples), only tuples having membership greater than or equal to 0.8 are selected. For reader information, FSQL (Galindo et al., 2006) allows specifying threshold degrees for single conditions or groups
200
of conditions. We prefer to apply the threshold degree to a tuple when the whole condition has been evaluated, in order to obtain a more natural and intuitive extension of the classical SQL.
Linguistic Quantifiers Suppose now the user wants to select tuples by evaluating their membership degree in a more flexible way; for instance, the user might want to select flats for which most of the following three conditions C1, C2, and C3 are satisfied: C1: The flat is cheap, C2: The flat is big (in terms of number of rooms) C3: The flat is close to the center The concept of quantified condition most of a set of conditions are satisfied is also possible in SQLf and is based on the concept of linguistic quantifier. Thus, to improve flexibility, the user should be provided with a command to define linguistic quantifiers. We then introduced the CREATE LINGUISTIC QUANTIFIER command that allows defining relative linguistic quantifiers as defined by Zadeh (1983). CREATE LINGUISTIC QUANTIFIER most VALUES (0.45, 0.65, 1, 1); CREATE LINGUISTIC QUANTIFIER almost _ all VALUES (0.9, 1, 1, 1);
The two above instructions define two quantifiers, named most and almost all respectively. The tuples following the VALUES keyword define, as for linguistic values in term sets, a trapezoidal membership function µquantifier normalized within the range [0, 1] that represents the semantics of the quantifier (for instance, function µalmost _ all is the membership function for the quantifier almost _ all). We rely on the OWA semantics introduced by Yager (1988) for the evaluation of quantified soft conditions in the WHERE clause. This choice is motivated by the fact that in this context the quantifier is used to aggregate condi-
Customizable Flexible Querying
tions and thus this definition is more adequate than the Zadeh’s definition. To derive the weighting vector W =[w1 ... wn] of the OWA operator, given the membership function of the relative nondecreasing quantifier µQ defined by the CREATE LINGUISTIC QUANTIFIER command and n, the number of the soft conditions to aggregate, we compute the following (Yager, 1994): wi= µQ (i/n) – µQ((i-1)/n) with i=1, …, n Then we apply the OWA operator defined by the weighting vector W to the degrees of satisfaction µc1(t), .., µcn(t)of the soft conditions c1, ..,cn by each tuple t: OWAQ(µ c1(t),…, µ
(t))=∑i=1,..,n (wi
cn
*
bi)
with b i being the i-th greatest of the set µc1(t),..,µcn(t). As in SQLf, quantifiers can be exploited in the query. Then, the question on which we based the above example is expressed by means of the following query. SELECT id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND QUANTIFIED most (Price IS ‘Cheap’ IN PriceLevels ZOOMING 0.5, NumberOfRooms IS ‘Big’ IN RoomNUmbers ZOOMING 2, DistanceFromCenter IS ‘close’ IN CityDistances ZOOMING 0.5) ORDER BY DEGREE DESC;
The QUANTIFIED predicate evaluates the most quantifier on the three conditions listed in parentheses. Further, the QUANTIFIED predicate can be freely composed with other predicates. Observe that since we are interested in flats in a big and expensive city such as Rome, we have modified the standard definition of the linguistic values to fit the specific context: specifically, we have con-
sidered more dilated the semantics of cheap for prices, stricter the definition of big for flats, and more dilated the definition of close.
Groups Soft-SQL extends the GROUP BY clause in order to cope with membership degrees of tuples. The question is: what happens when tuples are grouped together? What is the membership degree of the overall group? Soft-SQL provides different semantics, named SAFE, OPTIMISTIC, and AVERAGE. The SAFE semantics assigns the groups the minimum membership degree of tuples; the OPTIMISTIC semantics assigns the group the maximum membership degree of tuples; the AVERAGE semantics assigns the groups the average of membership degrees of tuples. The different semantics give different relevance to groups. The SAFE semantics behaves as a conjunction and can be used to obtain strict evaluation of groups, based on the worst representative. For instance, consider the following query (notice the WITH SAFE DEGREE option). SELECT City, DEGREE AS D FROM FLAT WHERE Price IS ‘Cheap’ IN PriceLevels DEGREE THRESHOLD 0.8 GROUP BY City WITH SAFE DEGREE;
This can also be expressed in SQLf. Given a city, the minimum degree of tuples describing a flat in that city is considered for the overall group; consequently, if the degree of a city (let us denote it as city1) is 0.85 and the degree of a second city (let us denote it as city2) is 0.95, this means that all flats in city2 have membership degree greater than 0.95; the user might find city2 more interesting than city1, since flats available in city2 are generally cheaper than flats available in city1. If we change the GROUP BY clause in the previous query as: GROUP BY City WITH OPTIMISTIC DEGREE
201
Customizable Flexible Querying
we adopt the optimistic semantics: in this case, membership degree of the group is the membership degree of the best representative of the group. Consider again the two sample cities city1 and city2: with the optimistic semantics we might obtain, for instance, a membership degree of 1 for city1 and of 0.98 for city2. This means that city1 has at least one cheap flat, while city2 does not have a fully cheap flat. Finally, the AVERAGE semantics consider the average membership degree of tuples in a group as the average membership degree for the overall group. This is very useful to evaluate the average strength of a group. Observe that the semantics corresponds to the notion of cardinality of a fuzzy set. Nevertheless, one could also exploit a linguistic quantifier such as most to compute the semantics of the GROUP BY clause based on a trade-off between a risk taken to a risk adverse attitude such as : GROUP BY City WITH most DEGREE
in which most is the linguistic quantifier previously defined. The semantics of these Soft-SQL queries can also be expressed in SQLf. The difference with respect to SQLf is the fact that in this context of the GROUP BY we do not evaluate the linguistic quantifier based on the OWA operator, since it is too costly, given that generally each group can contain a large number of tuples. We adopt the OWA definition of linguistic quantifiers in the context of the QUANTIFIED predicates where the number of the soft conditions satisfaction degrees to be aggregated by the OWA operator is limited. In contrast, in the GROUP BY, it is more intuitive to directly use the definition of linguistic quantifiers given by Zadeh (1983) and to evaluate the trapezoidal membership function µQ associated with the quantifier Q on the numeric fuzzy cardinality of each group. The average membership degree ∑ of tuples in the group is computed (∑ is the cardinality of the fuzzy set); then, the overall membership degree of the group is given by µQ (∑).
202
HAVING Clause Soft-SQL redefines the semantics of the HAVING clause on the lines of SQLf. Similarly to the WHERE clause, we might have soft predicates based on linguistic predicates (by means of the IS … IN operator); furthermore, the membership degree of a group before group selection might not be 1. In this case, the membership degree of a group, after group selection, is the minimum value between the original membership and the one obtained by the having condition evaluation. In fact, the HAVING clause plays the role of a further selection of groups: after grouping, groups are further evaluated and selected based on the clause; consequently, it is intuitive to consider the minimum membership degree between the original one and the one obtained by the condition (like an AND). Again, as a straightforward extension, we allow the user to specify a membership degree threshold for groups (similar to the FROM/WHERE clauses): if the minimum threshold is not specified, groups with a membership degree greater than 0 are selected; otherwise groups with membership degree greater than or equal to the specified threshold are selected. This way, the HAVING clause is coherent with its role: it is a group selection condition. Consequently, since groups have a membership degree, both predicates based on the IS..IN operator and on aggregate functions (see next section) can be expressed, and groups can be selected based on the specified minimum threshold for group membership degrees. The result is that WHERE and HAVING clauses are fully orthogonal and can be freely composed to write complex queries.
Flexible Aggregate Functions When defining Soft-SQL, we considered aggregate functions as well. What happens when aggregate functions are applied to a set of tuples with membership degrees? What is the membership degree of the aggregation? We �������������������� found the answer in the different semantics introduced for groups:
Customizable Flexible Querying
it is possible to choose if one wants to evaluate the membership degree by means of the SAFE, OPTIMISTIC, AVERAGE, or whatever linguistic quantifier Q defined by the user. To illustrate, consider the following basic query, that selects cheap flats in Rome.
specific case, the same set of tuples might have an average degree of 0.6, meaning that Q of the found flats are quite cheap, or even fully cheap. To denote the chosen semantics, the syntax of an aggregate function has been extended. Then, the following aggregate functions:
SELECT Id, Price, DEGREE AS D FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels;
COUNT(* COUNT(* COUNT(* COUNT(*
If we want to count the number of retrieved flats, we may transform the query as follows by means of the COUNT aggregate function.
obtain the number of tuples in the set of tuples with associated SAFE, OPTIMISTIC, AVERAGE, and Q–quantified membership degree, respectively. The behavior of the other aggregate functions are straightforward; however, when aggregate functions consider specific values, only the degree of tuples having that values are considered. For example:
SELECT COUNT(*) AS C FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels;
But counted tuples may have a membership degree less than 1; thus what is the membership degree of the overall set of tuples? By applying the SAFE semantics, the degree is the degree of the worst representative. In the specific case, a degree of 0.5 might say that the set of flats is not satisfactory, since at least one flat is not much cheap. By applying the OPTIMISTIC semantics, the degree is the degree of the best representative. In the specific case, the same set of tuples might have an optimistic degree of 1, meaning that the user is fully satisfied because there is at least one cheap flat in Rome. Finally, the AVERAGE semantics considers the average of membership degrees as the degree of the aggregate set of tuples, that is, an average measure of the relevance of the set of tuples. In the specific case, the same set of tuples might have an average degree of 0.8, meaning that the average of the found flats are quite cheap, or even fully cheap. In the most general case of trade-off semantics defined by a user-defined quantifier Q, the degree of the counted tuples is computed as for the case of the group described in the previous section. In the
WITH WITH WITH WITH
SAFE DEGREE) OPTIMISTIC DEGREE) AVERAGE DEGREE) Q DEGREE)
COUNT(Price WITH SAFE DEGREE) SUM(Price WITH SAFE DEGREE) AVG(Price WITH SAFE DEGREE)
consider, for computing the overall membership degree, the degrees of tuples with not null value for attribute Price. Furthermore, for computing the overall membership degree, functions MIN(Price WITH SAFE DEGREE) MAX(Price WITH SAFE DEGREE)
consider only the degrees of the tuples having the minimum (respectively, maximum) value for attribute Price. This choice is coherent with the notion of aggregate function: Because the overall set of tuples is represented by one single value that corresponds to the minimum (respectively, maximum) value, only the degrees of representative tuples are considered. This characteristic of Soft-SQL is novel and not present in previous extensions of SQL.
203
Customizable Flexible Querying
Table 1. Example of a relation Id
DistanceFromCenter
DEGREE
101
2.5
0.9
5.2
0.8
2.5
1.0
102
103
Examples Suppose we have the set of tuples shown in Table 1, with membership degrees DEGREE, obtained after the application of a soft selection condition. Table 2 shows the set of aggregate functions and the returned values with membership degree µ. Observe that the degree for the MIN aggregate function is computed considering only tuples having the minimum values, while for the COUNT function all tuples are considered. Also notice the meaning of the different degree: the SAFE quantifier summarizes the worst satisfaction of selection conditions, the OPTIMISTIC quantifier summarizes the best satisfaction (there is at least one tuple fully satisfying the selection conditions), and the AVERAGE quantifier shows the average behavior of tuples w.r.t. the selection conditions. We could even specify a user defined degree through a linguistic quantifier Q.
Examples Suppose the user wants to know how many quite cheap flats are in Rome (note the degree threshold 0.8 that captures the idea of “quite cheap flat”).
SELECT COUNT(* WITH OPTIMISTIC DEGREE) AS items, DEGREE FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels ZOOMING 2 DEGREE THRESHOLD 0.8;
Since we are interested in understanding the goodness of the selected items, we specify the OPTIMISTIC quantifier: we obtain the maximum membership degree, that denotes the degree of the best selected item w.r.t. the selection condition. Suppose now we want to know the number of selected flats and the minimum distance from the center among tuples satisfying the query with membership degree of at least 0.8. SELECT COUNT(* WITH OPTIMISTIC DEGREE) AS items, MIN(DistanceFromCenter WITH SAFE DEGREE) AS MinDist, DEGREE WITH AVERAGE DEGREE FROM FLAT WHERE City=’Rome’ AND Price IS ‘Cheap’ IN PriceLevels DEGREE THRESHOLD 0.8;
W.r.t. the previous query, we added the aggregate function that computes the minimum distance from the center with the SAFE quantifier. This means that we consider that the distance from the center is an important parameter, thus the strength of the overall set of selected tuples depends on
Table 2. Example of membership degrees for aggregate functions Function
µ
MIN(DistanceFromCenter WITH SAFE DEGREE)
2.5
0.9
MIN(DistanceFromCenter WITH AVERAGE DEGREE)
2.5
0.95
MIN(DistanceFromCenter WITH OPTIMISTIC DEGREE)
2.5
1.0
3
0.8
COUNT(* WITH SAFE DEGREE)
COUNT(* WITH AVERAGE DEGREE)
COUNT(* WITH OPTIMISTIC DEGREE)
204
Value
3 3
0.9 1.0
Customizable Flexible Querying
the minimum degree of tuples having the lowest distance. Then, we have to choose the final degree. In these situations, we can again decide what semantics to apply, that is, SAFE, OPTIMISTIC, or AVERAGE, because this situation can be seen as an aggregation as well. The chosen semantics is applied to the degrees of aggregate functions appearing in the SELECT clause. In the example, we choose the AVERAGE degree, because we evaluate the strength of the set of selected tuples by combining both parameters. If the selected set of tuples were as the one shown in the example in Table 1, the final membership degree would be 0.95, that is, the average between 1.0 (the COUNT function) and 0.9 (the MIN function). Flexible aggregation semantics can be exploited when the GROUP BY clause appears in the query as well. Consider the following query. SELECT City FROM FLAT WHERE Price IS ‘Cheap’ IN PriceLevels AND DistanceFromCenter IS ‘Close’ IN Distances GROUP BY City WITH SAFE DEGREE HAVING COUNT(* WITH AVERAGE DEGREE) <= 10 DEGREE THRESHOLD 0.8;
The query selects all flats that are approximately cheap and close to the city center, independently of the city. Then, groups selected flats by city name. At this point, we are interested in cities in which all tuples (see the quantifier SAFE) satisfy the selection criteria, but the number of selected flats in each city must not exceed 10. We want a city to be represented by the average membership degree and keep cities with at least 0.8 membership degrees; this way, we can obtain cities where it is possible to find a limited number of purchase proposal for flats, such that the average strength of these proposals is rather high. Notice that the membership degree of the HAVING clause is determined by the aggregate function: the COUNT function is associated with the average membership degree of tuples in a group; let us
suppose this membership degree is 0.9. Then, the group membership degree (obtained applying the SAFE semantics after grouping), let us say 0.85, is combined with 0.9, and the new membership degree for the group is 0.85 (the minimum of them). Thus, the group is selected (because 0.85 is greater than the specified minimum threshold). The reader can notice that the degree of groups and of tuples is directly affected by quantifiers, aggregations, and condition evaluations. In effect, we do not allow to use the special DEGREE attribute in clauses, except in the SELECT and ORDER BY clauses. But even in the SELECT clause, it cannot be used in expressions or aggregate functions, but only to generate the result table (this is why it is not possible to write MAX(DEGREE)in the HAVING and SELECT clauses). To better understand this point and to illustrate the use of aggregate functions, consider the following query. SELECT City, MAX(Price WITH SAFE DEGREE) AS P, MIN(Rooms WITH OPTIMISTIC DEGREE) AS R, DEGREE WITH AVERAGE DEGREE FROM FLAT WHERE Price IS ‘Cheap’ IN PriceLevels AND DistanceFromCenter IS ‘Close’ IN Distances GROUP BY City WITH SAFE DEGREE HAVING COUNT(* WITH AVERAGE DEGREE) <= 10 DEGREE THRESHOLD 0.8;
The query is obtained from the previous one by adding the two aggregate functions to the SELECT clause. Furthermore, the AVERAGE semantics is chosen for obtaining the final degree. Consider again the select group, having 0.85 as membership degree. Suppose MAX(Price WITH SAFE DEGREE) has a membership degree of 0.95 (the minimum membership degree of tuples with the maximum value for attribute Price). Then, suppose MIN(Rooms WITH OPTIMISTIC DEGREE) has a membership degree of 0.9 (the maximum membership degree of tuples
205
Customizable Flexible Querying
having the minimum value for attribute Rooms). Computing the average of 0.85, 0.95, and 0.9, the new membership degree for the tuples generated in the result table is 0.9. In practice, this way it is possible to evaluate a combination of features, obtaining a membership degree (in the specific case based on the AVERAGE semantics) that considers them all together.
Definition of the CREATE TERM-SET Command Let us now define the CREATE TERM-SET command, that allows the definition of term sets with linguistic values and their semantics as trapezoidal functions. CREATE TERM-SET name [ NORMALIZED WITHIN ( min, max ) ] ( EVALUATE evaluation-function WITH PARAMS
list-of-params )+ list-of-linguistic-value-definitions
VALUES
• •
name is the name of the term set under definition. The VALUES clause defines the set of linguistic values within the term set, where list of-linguistic-value-definitions is a non-empty list of linguistic-value-definition. Each linguistic-value-definition is a triple:
(linguistic-value, meaning, modifier)
•
linguistic-value is a string identifying a linguistic value; for example, 'very-far'. meaning is a 4-tuple
•
(left-bottom-corner, left-top-corner, righttop-corner, right-bottom-corner)
of ordered values in the range [0,1] (it must be left-bottom-corner≤ left-top-corner≤ right-top-corner≤ right-bottom-corner), and it defines the trapezoidal membership function µlv of the fuzzy set identified by
206
•
linguistic-value. The domain of the membership function is normalized within the range [0,1]. For example, we can define meaning = (0.45, 0.75, 1, 1) for the linguistic value 'far'. modifier is the name of a modifier operator, that is, a string chosen in the predefined set ’*’, ’+’, ’-’},��������������������������� �������������������������� and identifies a modifier operator that can be either a product or a translation operator of the values to be evaluated. It specifies how the values to be evaluated by the membership function µlv associated with linguistic-value must be modified by the current zooming factor x (optionally specified in the IS .. IN .. [ZOOMING x] operator) before being evaluated.
The same term set can be evaluated in the query (by the IS .. IN .. [ZOOMING x] operator) over different data types. This is allowed by the non-empty list (in the syntax denoted as “( … )+”) of EVALUATE clauses: each occurrence defines a specific evaluation function (in the syntax denoted as evaluation-function) to be computed, whose values are “compared” with the linguistic values. This way, depending on the data type of the expression to be evaluated against the linguistic value, the proper function is applied. Note that the WITH PARAMS subclause defines the list of formal parameters appearing in the expression. The possibly missing (in the syntax denoted as “[ … ]”) NORMALIZED WITHIN clause specifies how to normalize the value v obtained by evaluating the evaluation-function. If the NORMALIZED WITHIN clause is specified, the specified range [min, max] is mapped to the range [0,1], and v is mapped accordingly; if v is less than (greater than) min (max), it is always evaluated as min (max). If the NORMALIZED WITHIN clause is not present, by default it is min=0 and max=1.
Definition of the CREATE LINGUISTIC QUANTIFIER Command Let us now define the CREATE LINGUISTIC QUANTIFIER command, that allows the defini-
Customizable Flexible Querying
tion of relative monotone nondecreasing linguistic quantifiers on the unit domain [0,1]. We do not allow absolute linguistic quantifier (Zadeh, 1983) since the use of relative quantifiers is much more flexible and intuitive especially in aggregate functions. Its syntax is the following: CREATE LINGUISTIC QUANTIFIER name VALUES (left-bottom-corner, left-top-corner, right-
top-corner, right-bottom-corner)
The semantics µQ of a relative nondecreasing linguistic quantifier is defined in terms of a trapezoidal function defined in the domain [0,1]. Similarly to the case of linguistic values, left-bottom-corner, left-top-corner, right-top-corner, and right-bottom-corner are the coordinates on the x-axis of the corners in the trapezoid, but right-top-corner, and right-bottom-corner are always set to one; in fact, the function must be monotonic nondecreasing. The weighting vector WQ of the OWAQ associated with a linguistic quantifier Q with a monotone nondecreasing2 membership function µQ can be derived automatically as defined previously in the section titled Linguistic Quantifiers. Then, we apply the OWA operator to the n degrees of satisfaction µc1(t).., µcn(t)of the soft conditions c1,..cn by each tuple t. In the following sections, given the four parameters lb,lt,rt,rb defining µQ we denote by OWAQ[lb, lt, rt, rb] (.) the application of the correspondent OWA operator with the weighting vector WQ.
Definition of the SELECT Command We now define the SELECT command and the semantics of each clause. In the rest of this section, we introduce the concepts by discussing groups of homogeneous clauses (in the following called blocks), defining their formal semantics and using examples to explain the concepts.
The FROM-WHERE Block Let us consider the FROM-WHERE block of clauses, whose syntax is the following: FROM source-relation [ WHERE soft-selection-condition [ DEGREE THRESHOLD dtf ]
]
Let us present the clauses in details. FROM clause. As in the classical SELECT command, the FROM clause specifies the table on which the query is performed, that can be the result of a join operation. Since tuples come from crisp relations (our extension of the SELECT command is designed to work on the classical crisp relational model), the membership degree assigned to each tuple is 1 by default. However, the source tables might be the result of some previous soft query, having an attribute that is the degree computed by the previous query, that must be used as membership value in the current query. To cope with this problem, we do not have to change the general syntax of the FROM clause; thus, source-relation is defined as follows: table-spec ( join-op table-spec O N join-condition )* In place of join-op, all types of join operators are allowed. We redefine table specifications tablespec, that become as follows: table-spec := table [AS table-alias] [WITH expr AS DEGREE] W.r.t. standard SQL, we allow an additional optional WITH expr AS DEGREE subclause: if not specified, the membership degree for tuples coming from the specified table is set to 1; if it is specified, for each tuple, the value assumed by the expression expr (based on attributes contained in the specified table) becomes the membership degree of the tuple. Notice that table is either a table name or a nested query. The WITH subclause is necessary because, by
207
Customizable Flexible Querying
hypothesis, the underlying model is the classical relational model; thus, SELECT produces classical relational tables. Consequently, if we store the result of a soft query, or if we want to use a soft query as nested query in the FROM clause, to exploit the membership degree, we need a means to specify that an expression (typically a simple attribute) must be considered as the membership degree of tuples in that table. Consider the following sketch as an example. SELECT … FROM (SELECT …, DEGREE AS D1 …) AS T1 WITH
D1 AS DEGREE INNER JOIN (SELECT …, DEGREE AS D2 …) AS T2 WITH D2 AS DEGREE ON …
in which two nested queries are joined in the FROM clause; however, since both the nested queries are soft queries, they generate an attribute named, respectively, D1 and D2, corresponding to the membership degrees. The WITH subclause allows the user to reuse this two attributes as membership degrees, the join operation (that takes the minimum membership degree when two tuples are joined together). WHERE Clause. The WHERE clause presents significant changes w.r.t. the standard; in particular, soft-selection-condition is a soft condition composed of crisp and/or linguistic predicates connected by AND, OR and negated by NOT or aggregated by a quantifier. In this way, the soft conditions are specified in the WHERE clause, like in SQLf. Crisp predicates are based on the classical comparison operators applied to classical data types. Linguistic predicates have the form: tuple or numerical-expression IS linguistic-value Term-set
IN
where tuple is a tuple of expressions and where linguistic-value is defined in the specified term set. For example, the following predicate: 208
(Price,200000) IS ‘cheap’ IN PriceLevels
selects a tuple in the source relation if its attribute value of Price minus the constant value 200000 satisfies the soft condition cheap, according to the meaning of 'cheap' in the term set PriceLevels. Note that (Price, 200000) is a tuple composed by attribute price and the constant value 200000. The numerical-expression can be directly a value or an expression. For example, the following linguistic predicate is equivalent to the previous one. (Price-200000) IS ‘cheap’ IN PriceLevels
Semantics Let us define the semantics of the FROM-WHERE block. Membership Function of Linguistic Values. The membership function of a linguistic value lv is defined with a trapezoidal shape by a 4-tuple (lb, lt, rt, rb), where: 0 ( x - lb)/( lt - lb) µlv( x ) = 1 (rb - x)/(rt - rb) 0
If 0 ≤ x ≤ lb If lb < x < lt If lt ≤ x ≤ rt If rt < x < rb If rb ≤ x ≤ 1
When lt=rt, the membership function has a triangular shape. When lb=lt and rt=rb, the membership function has a rectangular shape, and then the associated linguistic term specifies a range condition. When lb=lt=rt=rb, the membership function is punctual and then the associated linguistic term specifies a crisp condition. The IS .. IN Operator. A predicate based on the IS .. IN operator has the form: t IS lv IN TS [ ZOOMING x]
Customizable Flexible Querying
where t is a tuple of expressions, and lv is a linguistic value in the term set TS, that is, lv∈TS. The following functions restrict a value v to the specified evaluation range: min Within ( v, [min, max ]) = v max
if v ≤ min if min < v < max if max = v
The following function normalizes a value v from the evaluation range to the basic range [0,1]. Normalize(v,[min,max])=(v-min) / (max-min) Finally, If ZOOMING option is not set, we are ready to define the semantics of the IS .. IN operator. With eFunc(t), we denote the application of the evaluation-function (specified in the EVALUATE subclause of the CREATE TERM-SET command) to a tuple t, where lv is the trapezoidal membership function for the linguistic value lv. µ(t IS lv IN TS)(t) = µlv(Normalize(Within(eFunc(t), [min, max]), [min, max])) Recall that the membership values generated by crisp operators are only 0 and 1. If ZOOMING x option is set, we have to modify the semantics of µlv by applying the modifier function modifier specified with the linguistic value to the values to evaluate. µ(t IS lv IN TS ZOOMING x)(t)= (µlv(Normalize(Within(modifier(eFunc(t),x),[min, max]), [min, max]))) modifier is either a product (*) or a translator operator (+ or -) and can act as a dilator or concentrator depending on the value of x: for example, if x > 1 and modifier=* acts as a concentrator, while if 0 < x < 1, it acts as a dilator. The QUANTIFIED Operator. A predicate based on the QUANTIFIED operator has the form:
QUANTIFIED
Q (list-of-soft-conditions)
where Q is a user-defined quantifier. Given the set of membership degrees µc1 ...,µ cn of the conditions in list-of-soft-conditions, it is: Q (l i s t - o f - s o f t µ( Q U A N T I F I E D conditions))(t) = OWAQ[lb,lt,rt,rb] (µc1 ...,µ cn) where (lb,lt,rt,rb) is the tuple defining the trapezoidal membership function for the linguistic quantifier. The WHERE Clause. We assume that each tuple t in a source crisp spatial relation R has a membership degree µR (t) = 1. The WHERE clause specifies a soft condition φ, which assigns t a new membership degree, for the purpose of selection. Applying associative properties, φ can be seen either as: φ=sub-cond1 lop sub-cond2 where lop is a logical operator AND, OR, or as a negated condition φ=NOT(sub-cond), where sub-cond1, sub-cond2, and sub-cond can be either composed conditions or simple predicates. We define the semantics of the three logical operators in accordance with the usual definition of the AND, the OR, and the NOT operators in fuzzy set theory as the min, max, and 1, respectively. If φ� = sub-cond1 AND sub-cond2, then µφ(t) = Min(µ sub-cond1 (t), µ sub-cond2 (t)) If φ� = sub-cond1 OR sub-cond2, then µφ(t) = Max(µ sub-cond1 (t), µ sub-cond2 (t)) If φ� = NOT (sub-cond), then µφ(t) = 1 − µsub-cond (t) Then, the membership degree of t after the evaluation of φ is: 209
Customizable Flexible Querying
µ(t) = Min(µR (t), µφ(t)) The DEGREE THRESHOLD Subclause. Consider the set of tuples T’FW produced by the FROM clause and possibly filtered by the WHERE clause, whose membership degree µ(t) (with t ∈ T’FW) has been computed as previously discussed. If the subclause DEGREE THRESHOLD dtf is specified, the final set of tuples TFW produced by the FROM-WHERE clause is defined as follows: TFW = { t ∈ T’FW | µ(t) ≥ dtf } while in case the D E G R E E T H R E S H O L D subclause is not specified, it is defined as follows: TFW = { t ∈ T’FW | µ(t) > 0}
Flexible Aggregate Functions The syntax of aggregate functions is as follows: COUNT(* [ WITH quantifier-spec DEGREE ] ) COUNT( attr-name [ WITH quantifier-spec DEGREE ] ) COUNT( DISTINCT attr-name [ WITH quantifier-spec DEGREE ] ) MIN( attr-name [ WITH quantifier-spec DEGREE ] ) MAX( attr-name [ WITH quantifier-spec DEGREE ] ) AVG( attr-name [ WITH quantifier-spec DEGREE ] ) SUM( attr-name [ WITH quantifier-spec DEGREE ] )
where quantifier-spec is one of SAFE, AVERAGE, and OPTIMISTIC, or a user-defined linguistic quantifier; if the WITH option is not specified, by default the SAFE quantifier is adopted. Given a set of tuples T’ used to compute the aggregate value av by an aggregate function af, the membership degree associated to av, denoted as µ(av), the minimum, the average, the maximum membership degree associated with tuples in T’, depending on the specified basic quantifier, respectively; in case of a user defined quantifier, the trapezoidal membership function µQ associated with the quantifier is evaluated on the average membership degree of tuples as previously described. In particular, function
210
COUNT(*) operates on all selected tuples, while the other functions operate only on tuples having a not null value for the specified attribute.
Semantics Consider the set T of tuples on which to apply an aggregate function af. If af is the COUNT(*) function, T’=T; otherwise an attribute attr is specified and T’={t∈T | attr is not null}. Given the aggregate value av = af i( T’ ) generated by the aggregate function af, its membership degree µ(av)=Mint’ { µ(t’) }, ∀t'∈T'; if the specified quantifier is SAFE, is µ(av)=Avgt’ { µ(t’) }, ∀t'∈T'; if the specified quantifier is AVERAGE, is µ(av)=Maxt’ { µ(t’) }, ∀t'∈T'; if the specified quantifier is OPTIMISTIC. If the specified quantifier is a user-defined quantifier Q, µ(av)= µQ ( Avgt’ { µ(t’) }, ∀t'∈T.'
The SELECT-ORDER BY Block Consider now the SELECT and ORDER BY clauses. Their syntax is the following: SELECT [ TOP n ] result-schema [ WITH quantifier-name DEGREE ] FROM-WHERE-Block [ ORDER BY list-of-ordering-features ]
SELECT Clause. As in the standard SQL SELECT command, the attribute list appearing in the SELECT clause (i.e., result-schema) defines the schema for the table generated by the SELECT statement; here, we extend SQL by allowing the use of the special keyword DEGREE as an attribute name, whose value is the membership degree of tuples (if the GROUP BY clause is not present) or groups (if the GROUP BY clause is present). This special attribute is motivated by the fact that the resulting table is a classical relational table with no membership degree associated to tuples. If the user wishes to know the relevance of tuples w.r.t. specified selections, this attribute might be used to add a column to the output table with the membership degree. The extended semantics for aggregate functions (see the section titled Flex-
Customizable Flexible Querying
ible Aggregate Functions) that can be exploited in the result-schema requires the introduction of a mechanism to choose the final membership degree of tuples in the result. For this reason, the optional subclause WITH quantifier-name DEGREE is introduced for the SELECT clause as well. It allows the specification of a quantifier (SAFE, AVERAGE, OPTIMISTIC, or user-defined); by means of this subclause, it is possible to choose the degree of a tuple in presence of aggregate functions. In the case of a user-defined quantifier quantifier-name =Q, the Zadeh evaluation will be applied, which means that the AVERAGE of the degrees is computed first and then the value µQ(AVERAGE)of the trapezoidal function defined by (lb,lt,rt,rb). ORDER BY Clause. As in the standard SELECT command, the ORDER BY clause sorts the tuples in the result table; We allow the user to specify the DEGREE special attribute as sort key. The TOP subclause in the SELECT clause inserts only the first n sorted tuples into the result table.
Semantics We define the semantics of the SELECT clause, as far as the generation of the schema for the result table is concerned. Consider at first the case in which the GROUP BY-HAVING block is not specified. The clause operates on the set of tuples FW. •
If no aggregate functions are specified in the SELECT clause, the membership degree of each tuple FW is µ(t) (and the reserved attribute DEGREE assumes this value). •������������������������������������������� if aggregate functions are defined in the SELECT clause, no attributes are allowed in the clause (as the usual SQL constraint); in this case, the membership degree depends on the quantifier specified for the clause (if it is not specified, the SAFE quantifier is applied by default). Thus, one single tuple summarizing the entire set of tuples is generated, and its membership degree is defined as follows.
With af i, we denote the i-th aggregate function in the SELECT clause; with avi =af i(TFW), we denote the value returned by the i-th aggregate function in the SELECT clause; with µ(avi), we denote the membership value obtained by the i-th aggregate function. If the quantifier is SAFE, µ(TFW) = Min 1=1…n { µ(avi) } ∀ avi = af i(TFW), with af i in the clause If the quantifier is AVERAGE, µ(TFW) = Avg 1=1…n { µ(avi) } ∀ avi = af i(TFW), with af i in the clause If the quantifier is OPTIMISTIC, µ(TFW) = Max 1=1…n { µ(avi) } ∀ avi = af i(TFW), with af i in the clause If the quantifier Q is defined by the user with (lb,lt,rt,rb), µ(TFW) = µQ (Avg 1=1…n { µ(avi)}) ∀ avi = af i(TFW), with af i in the clause where µQ denotes the function specified by the quadruple (lb,lt,rt,rb). If the GROUP BY-HAVING block is specified, the semantics of the SELECT clause is slightly different.
The GROUP BY-HAVING Block Consider now the GROUP BY-HAVING block of clauses. Its syntax is the following: list-of-grouping-attributes [ WITH quantifier-spec DEGREE ] [ HAVING soft-group-selection-condition ] [ DEGREE THRESHOLD dtg ] [ GROUP BY
]
211
Customizable Flexible Querying
GROUP BY Clause. The GROUP BY clause behaves similarly to standard SQL, but each group has a membership degree as well, which is obtained by applying a quantifier (basic or user defined) by means of the optional subclause WITH quantifierspec DEGREE. If this subclause is not specified, the default SAFE quantifier is applied, that computes the minimum tuples’ membership degree as the membership degree for the group. The subclause specifying the application of a quantifier is: WITH
quantifier-spec DEGREE
where quantifier-spec specifies the application of a quantifier, that is, either SAFE or AVERAGE or OPTIMISTIC. For example, if the membership degree of a group is computed in an optimistic way (i.e., the maximum tuples’ membership degree), it is specified as: GROUP BY attr WITH OPTIMISTIC DEGREE
In the HAVING clause, soft-group-selection-condition allows predicates over aggregate functions and grouping attributes, as well as the specification of soft conditions by means of the IS .. IN operator. Notice that the semantics of aggregate functions has been extended as well, in order to cope with membership degrees. The optional clause DEGREE THRESHOLD g that follows the HAVING clause allows the user to specify a filtering threshold over the membership degree of each group: if present, only groups with membership degree greater than g are selected; otherwise only groups with membership degree greater than 0 are selected. g is greater than 0 and less than 1.
The Complete SELECT Command The complete syntax for the command is then the following: SELECT [ TOP n ] result-schema [ WITH quan-
212
tifier-spec DEGREE ] FROM source-relation [ WHERE soft-selection-condition ] [ DEGREE THRESHOLD dtf ] [ GROUP BY list-of-grouping-attributes [ WITH quantifier-spec DEGREE ] [ HAVING soft-group-selection-condition ] [ DEGREE THRESHOLD dtg ] ] [ ORDER BY
list-of-ordering-features ]
Semantics We can now define the semantics for the clauses in the GROUP BY-HAVING block. The GROUP BY Clause. Consider a group of tuples (grouped together based on the values of the grouping attributes) and a function GroupMembe rship(g,quantifier) that computes the membership degree for the group, depending on the specified quantifier. The membership degree of the group is µ(g)=GroupMembership(g,quantifier). For predefined quantifiers: GroupMembership(g, SAFE)=Min(µ(t)),∀t∈g; GroupMembership(g, AVERAGE)=Avg(µ(t)),∀t∈g. GroupMembership(g,OPTIMISTIC)=Max(µ(t)),∀t∈g. If a user defined linguistic quantifier Q is specified, the quantifier has associated the tuple (leftbottom-corner, left-top-corner, right-top-corner, right-bottom-corner) that defines the trapezoidal function µQ. In this case, it is GroupMembership(g, Q) = µQ ( Avgt’ { µ(t’) },∀t'∈g. The HAVING Clause. The HAVING clause can be in turn a soft condition based on the IS ... IN operator, applied only to grouping attributes. Thus, it is evaluated by the computation of a membership degree. Similarly to what defined the WHERE clause, if we denote the group selection condition, by means of φ, the membership degree for group g is:
Customizable Flexible Querying
µ’(g) = Min(µ(g), µφ (g))
G’GH = { g ∈ GGH | µ(g) ≥ dtg }
We adopted the Min semantics, because the group selection condition is a further selection applied to group, and it can be seen as a conjunction with the previous evaluations that gave rise to the group membership degree. The HAVING clause allows the specification of aggregate functions in comparison expressions. Thus, if we denote a comparison expression with aggregate functions as AggrCompExpr:
while in case the DEGREE THRESHOLD subclause is not specified, it is defined as follows.
AggrCompExpr = Min 1=1…n { µ(avi) } ∀ avi = af i(g), with af i ∈ AgrCompExpr Again, we used the Min semantics, because in a comparison expression containing two aggregate functions, it is natural to imagine one of them with the lower membership degree determining the membership degree of the comparison. In case an aggregate function is specified in the expression on which the IS .. IN operator is evaluated, the membership degree obtained by the IS .. IN operator is considered. As an example, consider the condition:
G’GH = { g ∈ GGH | µ(g) > 0} The SELECT Clause. The semantics of the SELECT clause changes when the GROUP BYHAVING block is specified. It generates a tuple for each group g ∈ G’GH. If no aggregate functions are specified, the membership degree is µ(g), as previously defined. If aggregate functions are specified, the semantics depends on the specified quantifier for the overall clause (if not specified, the SAFE quantifier is assumed by default). If the quantifier is SAFE, FWi If the quantifier is AVERAGE, FWi If the quantifier is OPTIMISTIC, FWi with 0≤i≤n, 0 while for 1≤i≤n it is µi = µ (avi) = af i(TFW), where af i is i-th aggregate function in the SELECT clause.
MAX(Price) IS ‘cheap’ in PriceLeves
Conclusion
That is not allowed in the WHERE clause because an aggregate function is used. In the HAVING clause, it is allowed, but both the aggregate function and the IS .. IN operator gives a membership degree. We choose to consider the membership degree returned by the IS .. IN operator as the overall membership degree of the expression.
In this chapter, we presented the current results of the Soft-SQL project, whose goal is to define and implement a flexible query language as an extension of the classical SQL within fuzzy set theory. The work takes ideas from previous well known approaches such as SQLf (Bosc & Pivert, 1995) and is motivated by practical issues, mainly the intent of defining a user customizable, context dependent, and fully controllable query language, exploiting features of classic SQL as far as possible in order to allow the expression of flexible queries in classical relational databases. The proposal introduces several novel concepts. First, it works on regular relations and no special meaning is attributed to the membership degree. The user must specifically use the SQL ORDER BY clause to rank the tuples of a relation with respect
The DEGREE THRESHOLD Subclause. Consider the set of groups GH produced by the GROUP BY clause and possibly filtered by the HAVING clause, whose membership degree µ(g) (with GH) has been computed as previously discussed. If the subclause DEGREE THRESHOLD dtg is specified, the final set of groups G’GH produced by the GROUP BY-HAVING block is defined as follows:
213
Customizable Flexible Querying
to their membership degree attribute values. This way, Soft-SQL is really an extension of SQL, and it subsumes it completely satisfying the closure property. By means of the notion of user-defined term sets, and through the use of the new command named CREATE TERM-SET, the user can define and customize the semantics of sets of linguistic values with respect to a given context. Thus, linguistic terms used to express soft conditions can assume a different semantics according to the attribute to which they are applied as it occurs when using terms in natural language. To this end, the SELECT command has been redefined, in order to let the user specify flexible queries based on context dependent soft selection conditions. Furthermore, a rich set of options has been introduced in the SELECT command to allow the precise definition of the query semantics and to adapt it to the specific context of the query, in order to look results that really meet users’ needs. This way, we achieve a new level of flexibility of the language not possible in other previous extensions of SQL by fuzzy sets. By following the same approach, the user is allowed to define linguistic quantifiers. The new command CREATE LINGUISTIC QUANTIFIER allows specifying and customizing the semantics of newly created relative linguistic quantifiers, which can be exploited both in the WHERE clause and in the GROUP BY clause.
References Baldwin, J. F., Coyne, M. R., & Martin, T. P. (1993). Querying a database with fuzzy attribute values by iterative updating of the selection criteria. In International Joint Conference on Artificial Intelligence (IJCAI’93). Bosc, P., Buckles, B., Petry, F. E., & Pivert, O. (1999). Fuzzy databases. In J. C. Bezdek, D. Dubois, & H. Prade (Eds.), Fuzzy sets in approximate reasoning and information systems: The handbook of fuzzy set series (pp. 404-468). Kluwer Academic Publishers.
214
Bosc, P., & Pivert, O. (1992). Fuzzy querying in conventional databases. In L. A. Zadeh & J. Kacprzyk (Eds.), Fuzzy logic for the management of uncertainty. John Wiley & Sons. Bosc, P., & Pivert, O. (1995). SQLf: a relational database language for fuzzy querying. Transactions on Fuzzy Systems, 3, 1-17. Bosc, P., & Pivert, O. (1997a). Fuzzy queries against regular and fuzzy databases. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems (pp. 187-208). Kluwer Academic Publishers. Bosc, P., & Pivert, O. (1997b). On representationbased querying of databases containing ill-known values. In Proceedings of the International Symposium on Methodologies for Intelligent Systems (ISMIS’ 97) (pp. 477-486). Bosc, P., & Prade, H. (1994). An introduction to the fuzzy set and possibility theory-based treatment of flexible queries and uncertain and imprecise databases. In A. Motro & P. Smets (Eds.), Uncertainty management in information systems: From needs to solutions. Kluwer Academic Publisher. Bordogna, G., & Psaila, G. (2004, June 24-26). Fuzzy spatial SQL. In Proceedings of Flexible Querying Answering Systems (FQAS04) (LNAI 3055), Lyon, France. Springer-Verlag. Bordogna, G., & Psaila, G. (2005, March 15-16). Extending SQL with customizable soft selection conditions. In Proceedings of the ACM-SAC Track on Information Access, Santa Fè, NM. Buckles, B. P., & Petry, F. E. (1985). Query languages for fuzzy databases. In J. Kacprzyk & R. R. Yager (Eds.), Management decision support systems using fuzzy sets and possibility theory (pp. 241-251). TUV Rheinland: Verlag. Buckles, B. P., Petry, F. E., & Sachar, H. S. (1986). Design of similarity-based relational databases. In H. Prade & C. V. Negoita (Eds.), Fuzzy logic in knowledge engineering (pp. 3-7). TUV Rheinland: Verlag.
Customizable Flexible Querying
Dubois, D., & Prade, H. (1997). Using fuzzy sets in flexible querying: Why and how? In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems (pp. 45-60). Kluwer Academic Publishers. Eduardo, J., Goncalves, M., & Tineo, L. (2004, September 27-October 1). A fuzzy querying system based on SQLf2 and SQLf3. In Proceedings of the XXX Conferencia Latinoamericana de Informática (CLEI 2004), ��������������� Arequipa, Peru. Galindo, J., Medina, J. M., & Aranda, G. M. C. (1999). Querying ������������������������������������ fuzzy relational databases through fuzzy domain calculus. International Journal of Intelligent Systems, 14, 375-411.� Galindo, J., Medina, J. M., Cubero, J. C., & García, M. T. (2000). Fuzzy quantifiers in fuzzy domain calculus. In Proceedings of the 8th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU´2000) (pp. 1697-1702), Spain. Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998). A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems. SpringerVerlag. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Idea Group Publishing. Goncalves, M., & Tineo, L. (2003, March 1-4). Derivation principle in SQLf2 algebra operators. In Proceedings of the 1st International Conference on Fuzzy Information Processing Theories and Application (FIP-2003), Beijing, ��������������� China. Goncalves, M., & Tineo, L. (2005, May 22-25). Derivation principle in advanced fuzzy queries. In Proceedings of the 14th IEEE International Conference on Fuzzy Systems (Fuzz-IEEE 2005), Reno, NV. Kacprzyk, J., & Zadrozny, S. (1995). FQUERY for access: Fuzzy querying for windows-based DBMS.
In P. Bosc & J. Kacprzyk (Eds.), Fuzziness in database management systems. Physica-Verlag. Kacprzyk, J., & Zadrozny, S. (1997). Implementation of OWA operators in fuzzy querying for Microsoft Access. In R. R. Yager & J. Kacprzyk (Eds.), The ordered weighted averaging operators: theory and applications (pp. 293-306). Boston: Kluwer. Kacprzyk, J., Zadrozny, S., & Ziolkowski, A. (1989). FQUERY III+: A “human-consistent” database querying system based on fuzzy logic with linguistic quantifiers. Information Systems, 6, 443-453. Kacprzyk, J., & Ziolkowski, A. (1986). Database queries with fuzzy linguistic quantifiers. IEEE Transactions on Systems, Man and Cybernetics, 16, 474-479. Kießling, W. (2002). Foundations of preferences in database systems. In Proceedings of the 28th International Conference on Very Large Databases. Kießling, W. (2003). Preference queries with svsemantics. In Proceedings of the 11th International Conference on Management of Data (COMMAD 2005). Medina, J. M., Pons, O., & Vila, M. A. (1994). Gefred: A generalized model of fuzzy relational databases. Information Sciences, 76, 87-109. Petry, F. E. (1996). Fuzzy databases. Kluwer Academic Publisher. Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115-143. Prade, H., & Testemale, C. (1987). Representation of soft constraints and fuzzy attribute values by means of possibility distributions in databases. In J. C. Bezdek (Ed.), Analysis of fuzzy information (vol. II, pp. 213-229). CRC Press. Ribeiro, R. A., &. Moreira, A. M. (1999). Intelligent query model for business characteristics. In
215
Customizable Flexible Querying
Proceedings of the IEEE/WSES/IMACS CSCC’99 Conference.
Computational Mathematics Applications, 9, 149-184.
Rosado, A., Ribeiro, R. A., Zadrozny, S., & Kacprzyk, J. (2006). Flexible query languages for relational databases: An overview. In G. Bordogna & G. Psaila (Eds.), Flexible databases supporting imprecision and uncertainty. Springer-Verlag.
Zadeh, L. A. (1999). From computing with numbers to computing with words: From manipulation of measurements to manipulation of perceptions. IEEE Transactions on Circuits and Systems, 45(1), 105-119.
Shenoi, S., Melton, A., & Fan, L. T. (1990). An equivalence classes model of fuzzy relational databases. Fuzzy Sets and Systems, 38, 153-170.
Key Terms
Tahani, V. (1977). A conceptual framework for fuzzy query processing: A step toward very intelligent database systems. Information Processing and Management, 13, 289-303. Takahashi, Y. (1991). A fuzzy query language for relational databases. IEEE Transactions on Systems, Man and Cybernetics, 21, 1576-1579. Tineo, L. (2000). Extending RDBMS for allowing fuzzy quantified queries. In M. Kung (Eds.), DEXA Proceedings (LNCS 1873, pp. 407-416). Springer-Verlag. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-27. Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multi criteria decision making. IEEE Transactions on Systems, Man and Cybernetics, 18, 183-190. Yager, R. R. (1994). Interpreting linguistically quantified propositions. International Journal of Intelligent Systems, 9, 541-569. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338-353.
Zadeh, L. A. (1975). The concept of a linguistic variable and its application to approximate reasoning (I-II). Information Sciences, 8, 199249, 301-357. Zadeh, L. A. (1983). A computational approach to fuzzy quantifiers in natural languages. 216
Aggregate Function: A function of SQL language working on sets of tuples instead of on single tuples and returning one single value as a result of their evaluation. Flexible Query: A query allowing the specification of some kind of preferences on selection conditions and/or priorities among conditions. Within the fuzzy context, flexible queries are also named fuzzy queries: preferences in fuzzy queries are defined by soft conditions expressed by linguistic predicates such as young, while priorities among conditions are expressed by numeric values expressing the degrees of priorities. Fullfillment Degree: A value in [0,1] that expresses the satisfaction degree of a tuple of a relation subjected to a flexible query in a relational database. When it is zero, it means that the tuple does not satisfy the flexible query at all; when it is one, it means that the tuple fully satisfies the query. Intermediate values in (0,1) indicate partial satisfaction of the query by the tuple. Fullfillment degree is also named membership degree of the tuple in SQLf queries. Linguistic Quantifier: Linguistic quantifiers extend the set of quantifiers of classical logic. They can be either crisp (such as all, at least 1, at least k, half) or fuzzy quantifiers (such as most, several, some, approximately k). Formally, Zadeh (1983) first defined fuzzy quantifiers as fuzzy subsets and identified two types of quantifiers: absolute and relative. Absolute quantifiers, such as about 7, almost 6, and so forth are defined as fuzzy sets with membership
Customizable Flexible Querying
function on a subset of positive integers. Relative quantifiers are defined as fuzzy sets with membership function defined on [0,1]. OWA Operator: Ordered weighting average operators defined by Yager (1988) are a family of mean-like operators that allow the realization of aggregations in between the two extremes of the AND and OR corresponding with the minimum and the maximum of the operands respectively. Soft Condition: Tolerant selection condition admitting degrees of satisfaction defined by a fuzzy set on the domain of a linguistic variable, such as Age, and specified by linguistic terms, such as young, old, and so forth.
Soft-SQL: Indicates the fuzzy extension of SQL defined in this chapter.� SQL: Structured Query Language used in relational databases. SQLf:� Indicates a fuzzy extension of SQL language (Bosc & Pivert, 1995). Another extension is named FSQL (Galindo et al., 2006).� Term Set: Name of the set of values that a linguistic variable can assume.
217
218
Chapter IX
Qualifying Objects in Classical Relational Database Querying Cornelia Tudorie University Dunarea de Jos, Galati, Romania
Abstract The topic presented in this chapter refers to qualifying objects in some kinds of vague queries sent to relational databases. We want to compute a fulfillment degree in order to measure the quality of objects when we search them in databases. After a discussion on various kinds of object linguistic qualification, with different kinds of fuzzy conditions in a fuzzy query, a new particular situation is proposed to be included in this subject: the relative object qualification as a query selection criterion, that is, queries with two conditions in which the first one depends on the results of the second one. It is another way to express the user’s preferences in a flexible query. In connection with this, a new fuzzy aggregation operator, AMONG, is defined. We also propose an algorithm to evaluate this kind of queries and some definitions to make it applicable and efficient (dynamic modeling of the linguistic values and unified model of the context). We demonstrate these ideas with software already implemented in our lab.
Introduction Database querying by various selection criteria can often confront a major limitation: the difficulty to realize and express precise criteria for locating the information. This happens because people do not always think and speak in precise terms, or they do not have details on the data range. The research community recently proposed a new way to query databases, more expressive and flexible than the classical one. It is about vague queries, for example: “retrieve the persons well paid which live not too far from the office”, of course, formulated in an
adequate query language. The main reason to use the vague predicates well paid and not too far is to express more flexibly the user’s preferences and at the same time to rank the selected tuples by a degree of criteria satisfaction. When a precise criterion, like “salary > 500 and distance home-office < 200” is required, it may return an empty list, even if there are a lot of persons having attribute values very close to the specified ones. As well, the same precise criterion may return a complete list of all persons, without any helpful ordering. So, it would be useful to provide intelligent interfaces to databases, able to interpret and evaluate imprecise criteria in queries.
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Qualifying Objects in Classical Relational Database Querying
Some important advantages resulting from including vague criteria in a database query may be: • • •
Easy to express queries The possibility to classify database objects by selecting them based on a linguistic qualification The possibility to refine the result by assigning to each tuple the corresponding fulfillment degree (the degree of criteria satisfaction); in other words, to provide a ranked answer according to the user’s preferences
Under this circumstance, when vague queries are accepted, fuzzy set membership functions are convenient tools for modeling the user’s preferences in many aspects. Fuzzy sets theory (����������������������� Bouchon-Meunier, 1995; Dubois, Ostasiewicz, & Prade, 1999; Yager & Zadeh, 1992; ����������������������������������� Zadeh, 1965) is accepted as one of the most adequate formal frameworks to model and to manage vague expressions. Two research areas are important for the fuzzy theory applied in database field: fuzzy querying regular databases and storing fuzzy information in databases. There are many scientific works regarding database fuzzy querying: general reference books (e.g., ��������� Galindo, Urrutia, & Piattini, 2006�������������������������� ), but also many articles in journals and communications at conferences. Some of them propose fuzzy extensions of the standard query language for relational database (SQL), able to interpret and evaluate fuzzy selection criteria; others proposed intelligent interfaces to fuzzy querying classical databases. The most important of those included: •
•
SQLf (Bosc & Pivert, 1995; Goncalves & Tineo, 2001a, 2001b; Projet �������������� BADINS, 1995, 1997�������������������������������������� ) and FSQL (Galindo et al., 2006) are extensions of the SQL language, allowing flexible querying. FQUERY (Kacprzyk & Zadrozny, 1995��, 2001����������������������������������������� ) and FuzzyBase (Gazzotti, Piancastelli, Sartori, & Beneventano, 1995) are fuzzy querying engine for relational databases.
In this book, the reader can find a chapter by Urrutia, Tineo, and González, studying SQLf and FSQL languages.������������������������������� There is also another chapter including a review about flexible querying and it has been written by Kacprzyk, Zadrożny, de Tré, and de Caluwe. ������������������������������� Other works have developed new data models able to taking into account imperfect information. Fundamental contributions have been made by Buckles and Petry (1982); Medina, Pons, and Vila (1994); and Prade ���������������������������� and Testemale (1984��� ). Galindo et al. (2006) also define a running fuzzy database, which stores imperfect data represented by fuzzy possibilistic distributions, fuzzy degrees, and so forth. From the beginning, it is important to remark that we deal in this chapter with relational database fuzzy querying (fuzzy queries on crisp data), but not fuzzy database querying (queries on fuzzy data). The first presented items are already talked about in Project BADINS (1995, 1997); Bosc and Pivert (1992); Bosc and Prade (1997); Dubois and Prade (1996); Kacprzyk and Zadrozny (2001); and many others, but we consider that it is useful to rediscuss them in order to propose an original classification of the various kinds of object linguistic qualifications. In this context, the relative object qualification will be proposed as a new kind of selection criteria, and it will be included in this classification. Some practical examples inspired us to develop this study. Let us compare the queries: Retrieve the cars having the speed greater than 240 Retrieve the cars having high speed Retrieve the inexpensive and high speed cars Retrieve the inexpensive cars among the high speed ones They are increasingly more complex: they start with a classical crisp query and go to more complex queries, including vague terms in the selection
219
Qualifying Objects in Classical Relational Database Querying
criteria. They correspond to different kinds of object linguistic qualifications, to different ways to model their semantics in fuzzy sets style (i.e., fuzzy models), and to different ways to compute the fulfillment degree of the selection criterion. All these will be discussed in the following sections. A great part of the chapter (the section titled Relative Object Qualification) is reserved to present a new kind of fuzzy queries with two conditions in which the first one depends on the results of the second one. In connection with this, a new fuzzy aggregation operator, AMONG, is defined. Some variants, particular cases of the queries based on relative qualification, are analyzed. The section titled Dynamic Modeling of the Linguistic Values proposes a procedure to define the linguistic terms by automatically extracting their fuzzy models from the actual content of the database. This procedure is generally useful, but it is mandatory in the relative qualification case. In order to make the query evaluation process more efficient, we propose in the section titled The Unified Model of the Context a solution to incorporate the knowledge base (containing the fuzzy models of the linguistic terms, or at least their metadescriptions) in the target database. Both kinds of pieces of knowledge are taken into discussion: effective models of the linguistic values as static definitions, but also metadescriptions of the linguistic values for a dynamic modeling process. Finally, we present several laboratory implementations, the conclusions, and future trends.
Absolute Object Qualification Querying a relational database in a classical system means selecting data (table rows) satisfying Boolean criteria. For example, the following crisp query is sent to a database including Table 1: Retrieve the cars having the speed greater than 240
220
The answer consists of a table, containing the database rows which satisfy the Boolean formula. So, the criterion max speed>240 is evaluated and the answer is in Table 2. The classical query searches the database objects having a certain property, expressed by a Boolean predicate: if “B Coupe” is selected, that means it has the property to have the speed superior to 240. The fuzzy predicate is an affirmation that may be more or less true, depending on the argument
Table 1. A relational database table (car) Max Speed
Price
AA
236
46000
AA4
221
28450
B3
226
31562
B7
243
57200
B Coupe
250
39000
C 300M
230
32000
IO
145
24000
LRD
130
28000
MBS
240
69154
MC
190
18200
M L200
145
19095
NV
132
15883
OA
186
16042
OCS
120
26259
OF
192
43615
OV
208
20669
OZ
178
18364
P 206
170
10466
P 607
222
31268
P 806
177
20633
P 911 C
280
65000
RC
186
12138
Name
…
…
Qualifying Objects in Classical Relational Database Querying
Table 2. The cars having speed greater than 240 Name
...
Max Speed
Price
P 911 C
280
65000
B Coupe
250
39000
B7
243
57200
...
value: for example, Max Speed (“B Coupe,” high). It is an extension of the classical logical predicate, which can be either definitely true or definitely false. The truth-value of the fuzzy predicate may be expressed as a number in [0,1], 1 standing for absolutely true and 0 for absolutely false. In a context related to query selection criteria, a fuzzy predicate is useful to model the gradual property: if “B Coupe” is selected, that means it has the property to have high maximum speed. Moreover, accepting a certain meaning of the “high” term (for example, a fuzzy set as semantic model), then the fuzzy query evaluation consists in computing a corresponding fulfillment degree of “high speed” property. Including a gradual property in the database vague queries, like ������� “x are Р”,��������� gives a qualification to the objects; that means selecting a number of objects (x) from the database that satisfy to a certain degree the gradual property (Р). More examples for fuzzy selection criteria like “x Р ” may be��: high speed cars, inexpensive cars, expensive cars, good students, big salary, young people, and so forth. When a query selection criterion is expressed by a gradual property, a fulfillment degree for each tuple is computed, starting from the definition of the fuzzy predicate. This is equal to the value of the membership function corresponding to the attribute value in the current tuple. Definition 1. Let R[A1, A2, … , An] be a table of a relational database, that is, a set of tuples t: R ⊂ {t | t ∈D1 × D2 × … × Dn } where Di are the domains of the attributes Ai, accepted (within this chapter) as intervals [ai,bi].
Then the fulfillment degree of a vague criterion, referring to an attribute A, with domain D=[a,b], or, in other words, the fulfillment degree of a gradual property Р by referring to an attribute A, is defined by the membership function of the fuzzy predicate: μР : D → [0,1] or μР : [a,b] → [0,1],
v → ׀μР (v)
The fulfillment degree of the gradual property Р, associated with the attribute A, may be considered a characteristic of each tuple so that the fulfillment degree may also be defined for table R: μР (t) = � μР (t.A), μР : R → [0,1], t ��� → �׀ where t.A is the crisp value of the attribute A for the tuple t, t.A ∈ [a, b]. Definition 2. If a crisp query on a database table R, based on a condition P (Boolean predicate) referring to an attribute A, is an application: Q : 2 D1 × D2 × … × Dn → 2 D1 × D2 × … × Dn R { → ���׀t ∈R | P(t) } then a vague query based on a gradual property Р associated with the attribute A is the application
QP
: 2 D1 × D2 × … × Dn → 2 D1 × D2 × … × Dn × (0,1] R ({ → ���׀t , μ �Р (t) ) | t ∈R ∧ � μР (t) > 0 } where � μР (t) = � μР (t.A). Usually, more such gradual properties, expressed by linguistic labels, can be linked to the same database attribute. They are named linguistic values, and the set of these labels may be the definition set of a linguistic variable. ���� The definition of linguistic variable can be found in Zadeh (1975) and in another chapter of this book written by Xexeo.� The set of linguistic values makes up the linguistic domain for the database attribute (more details in Tudorie & Dumitriu, 2004). In order to
221
Qualifying Objects in Classical Relational Database Querying
evaluate any vague query sent to a database, it is necessary to define both the crisp domain and the linguistic one for each attribute frequently used in searching operations. For example, [120,280] is the crisp domain, and ��{ low, medium, high }is the linguistic domain for the Max Speed attribute of the car table. Each linguistic value can be considered a gradual property and modeled as a fuzzy predicate defined on the attribute crisp domain as referential set (like in Figure 1). For example, according to the definitions in Figure 1, the answer to the vague query: Retrieve the cars having high speed applied to the car table is in Table 3. In other words, we found the cars with the property to have high speed. The fulfillment degree (μ) expresses the intensity of the property: between the 0 degree (i.e., a not high speed car) and the 1 degree (i.e., an absolutely high speed car).
The AND Operator: Multiqualification In the classical (precise) context, a compound selection criterion is a Boolean expression containing comparisons and logical operators. In a vague context, the operators AND, OR, NOT, are extended to fuzzy aggregation connectives. They are able to compute a global fulfillment degree for each database tuple, starting with the fulfillment Figure 1. Linguistic values defined on the Max Speed and Price attribute domains µ
low
high
medium
1 0.5 0
Max Speed 120
160 180
200 220
240
280
µ inexpensive
expensive
medium
1 0.5
Price
222
69154
54482
47146
39810
32474
25138
10466
0
degrees of each fuzzy condition and observing certain models for the fuzzy connectives. Usually, the Min and Max functions stand for the fuzzy conjunctive and disjunctive connectives; the complement stands for the fuzzy negation connective. But there are many other proposals in the literature for defining aggregation connectives (Grabisch, Orlovski, & Yager, 1998; Yager, 1991). Let us take for example a query based on a complex fuzzy selection criterion applied to the car table: Retrieve the inexpensive and high speed cars. The evaluation of this query, according to the definitions in Figure 1 and to the content of the car table (Table 1) generates the answer in Table 4. For each table row, the fulfillment degree of each linguistic value is computed and the arithmetical min function is used to implement the fuzzy conjunction between them. The answer produces the table rows (cars) having a significant global fulfillment degree. Definition 3. ������������������������������� The fuzzy model of the conjunction, AND(Р , S), of two gradual properties, Р and S, associated with two attributes, A1 and A2, is defined by the mapping: or µ Р AND S : D1 × D2 → [0,1] µ Р AND S : [a1,b1] × [a2,b2] → [0,1] , (v1,v2) → ׀min ( μР (v1), μ S (v2) ) The same fulfillment degree defined on a database table R is: µ Р AND S : R → [0,1], t �� → �׀min ( � μР (t), � μS (t) ) = min ( � μР (t.A1), � μS (t.A2) ) where t is a tuple and µР and µS are the membership functions defining each of the two gradual properties. Any conjunctive operator is a triangular norm (or t-norm), as it is defined in the fuzzy set theory (Yager, 1991), with the following properties:
Qualifying Objects in Classical Relational Database Querying
Table 3. The “high speed cars” table
fuzzy selection criterion. The objects selected by the query are defined by a double qualification: to be “inexpensive” and at the same time to have “high speed.” It is important to remark that the two qualifications are independent of each other, and they have the same significance for the user’s preferences.
Max Speed
Price
P 911 C
280
65000
1
B Coupe
250
39000
1
B7
243
57200
1
MBS
240
69154
1
AA
236
46000
0.80
C 300M
230
32000
0.50
B3
226
31562
0.30
P 607
222
31268
0.10
AA4
221
28450
0.05
Name
…
...
µ
Table 4. The “high speed and inexpensive cars” table Max Speed
Price
µ high
µ inexpensive
µ
B3
226
31562
0.3
0.12
0.12
Name
P 607
222
31268
0.1
0.16
0.1
C 300M
230
32000
0.5
0.06
0.06
AA4
221
28450
0.05
0.54
0.05
1. ������������� commutativity: AND(Р, S) = AND(S, Р ) 2. ������������� associativity: AND(Р, AND(S, T )) = AND(AND(Р, S), T ) 3. �������� monotony: AND(Р, S) ≤ AND(Р’, S’) if Р ≤ Р’ and S ≤ S’ 4. ���� unit element: AND(Р , 1) = Р It is obvious that min operator is a t-norm. A particular list of various t-norm functions is presented in Dubois and Prade (1996) and Galindo ����������� et al. (2006, p. 20��������������������������������� ). When using these functions as AND connective to database querying, they are modeling different linguistic expressions and, of course, different logical meanings of the selection criterion. The queries like the above-mentioned one include two (or more) gradual properties in the
Definition 4. The� vague query based on a double qualification is an application: QP,S : 2 D1 × D2 × … × Dn → 2 D1 × D2 × … × Dn × (0,1] R ({ → ���׀t , min ( � μР (t), μ � S (t) ) ) | t ∈R ∧ μ �Р (t) > 0 ∧� μ S (t) > 0} Similarly, a multiqualification (multiple conjunctions) can be expressed as a criterion in database vague queries.
Relative Object Qualification People use an enormous number of expressions in their common language for requesting information. This fact motivated us to search the most accurate model for as many queries as possible so that the computational treatment and the response may be as adequate as possible. There are in the literature certain approaches of modeling user’s preferences, which solve different situations, such as accepting tolerance, accepting different weight of importance for the requirements in a selection criterion, accepting conditional requirements, and so forth (e.g., Dubois & Prade, 1996). Our study has found a new class of problems, which require a partitioning of a limited subset of an attribute domain instead of the whole domain, in situations where dynamic modeling of the linguistic values is necessary. This is the case when the selection criteria are not independent, but they are in a combination which expresses a user’s preference. Two gradual properties are combined in a complex selection criterion such that a second one is applied on a subset of database rows, already selected by the first one. We assume
223
Qualifying Objects in Classical Relational Database Querying
that the second gradual property is expressed by a linguistic value of a database attribute, which is a label from the attribute linguistic domain. In this case, modeling of the linguistic domain of the second attribute requires taking into account not the whole crisp attribute domain, but a limited subset, characteristic to the first criterion-selected database rows.
The AMONG Operator: Relative Qualification to Other Gradual Property Let us consider as an example the following query based on a complex fuzzy selection criterion addressed to the car table: Retrieve the inexpensive cars among the high speed ones. The query evaluation procedure observes the following steps: Algorithm 1. The selection criterion high speed cars is evaluated, taking into account the definition in Figure 1; an intermediate result is obtained, containing the rows where the condition µ high (t)>0 is satisfied (Table 3). 2. The underlying interval containing the price for the selected cars forms the Price subdomain [28450, 69154]; this is the one considered later, instead of [10466, 69154]. 3. The linguistic value set {inexpensive, medium, expensive} is scaled to fit this
Figure 2. Linguistic values defined on a subdomain µ 1
INEXPENSIVE
MEDIUM
EXPENSIVE
0.5
224
69154
58978
53890
48802
43714
38626
28450
Price
10466
0
4.
5.
subdomain (Figure 2; in order to make the difference, the new definitions are labeled in capital letters). The selection criterion inexpensive cars is evaluated taking into account the definition in Figure 2. The fulfillment degree μ INEXPENSIVE is computed for each row of the intermediate result from step 1. The global fulfillment degree (μ) will result for each tuple and they are selected if μ(t)>0 (the shaded rows in Table 5).
At this point�������������������������������� , a new fuzzy aggregation operator can be defined in order to model the relative qualification in queries like “Р AMONG S .” Definition 5. The fuzzy model of the relative conjunction, AMONG(Р , S), of two gradual properties, Р and S, associated with two attributes, A1 and A2, is defined by the mapping: µ Р AMONG S : D1 × D2 → [0,1] or µ Р AMONG S : [a1,b1] × [a2,b2] → [0,1], (v1,v2) → ׀min (μР / S (v1), μ S (v2)) The same fulfillment degree, defined on a database table R is: µ Р AMONG S : R → [0,1] , t → ׀min (μР / S (t), μ S (t)) = min(μР / S (t.A1), μ S (t.A2)) where t is a tuple and µS is the membership function defining the S gradual property and μР / S is the fulfillment degree of the first criterion (Р ) relative to the second one (S). In Table 5, � μ INEXPENSIVE stands for�� μР / S , that is, μinexpensive/high, and μ stands for the global selection criterion computed as µ �AMONG. The membership function μР / S stands for a transformation of the initial membership function µР and is obtained by translation and compression, as in the following. After the first selection, based on the property S, associated with the attribute A2, the initial domain [a,b] of the attribute A1 becomes more limited, that is, the interval [a’,b’] (Figure 3).
Qualifying Objects in Classical Relational Database Querying
Table 5. The “inexpensive cars among the high speed ones” table Name
Max Speed
Price
µ high
µ INEXPENSIVE
µ
P 911 C
280
65000
1
0.00
0.00
B Coupe
250
39000
1
0.92
0.92
B7
243
57200
1
0.00
0.00
MBS
240
69154
1
0.00
0.00
AA
236
46000
0.80
0.00
0.00
C 300M
230
32000
0.50
1
0.50
B3
226
31562
0.30
1
0.30
P 607
222
31268
0.10
1
0.10
AA4
221
28450
0.05
1
0.05
Figure 3. Restriction of the attribute domain for a relative qualification µ 1
µ
0.5 0
µ a
a’
b−a ( v − a ' ) ) b'−a '
v → ׀μР (v)
μР / S (v) = μР ( a +
v → ׀μР/S (v)
Note. The expression of the membership function μР / S is defined based only on the original membership function µР and does not depend on the algorithm for modeling it. Consequently, the defining method of the linguistic values is preserved by the transformation f.
where f is the transformation f : [a’,b’] → [a,b] b−a (1) f (x) = a + ( x − a ' ) b'−a '
For the particular cases: a. b.
b'
Therefore:
Thus, if: μР : [a,b] → [0,1], then μР / S : [a’,b’] → [0,1], so that μР / S = f ° μР
b
/S
identical transformation a’=a , b’=b ⇒ f(x) = x interval limits x=a’ ⇒ f(a’) = a x=b’ ⇒ f(b’) = b
(2)
Definition 6. The algebraic model of the AMONG operator is: µ Р AMONG S : R → [0,1] µ Р AMONG S (t) = b −a
min ( μР ( a 1 + 1 1 ( t.A1 − a 1 ' ) ) , μ S (t.A2) ) b1 '−a 1 ' (3) where [a1’, b1’] ⊆ [a1,b1] is the sub-interval of the A1 corresponding to the table: QS (R) (obtained
225
Qualifying Objects in Classical Relational Database Querying
by the first selection, on the attribute A2, using property S ). The new operator stands for the model of a certain fuzzy conjunctive aggregation, but it cannot be considered a triangular norm. Regarding the properties of a triangular norm, one can remark: i. ���� the commutativity is not satisfied by the AMONG operator: AMONG(Р , S) ≠ AMONG(S , Р ) because μР / S (t) ≠ � μ S / P (t) and � μ S (t) ≠ � μP (t) ⇒ min (� μР / S (t), μ S (t) ) ≠ min (� μ S / P (t), � μP (t) ), ∀t
and because, semantically, such queries cannot be compared (remark (ii) next page).
ii. ���� the associativity is satisfied by the AMONG operator (see Exhibit A) and because, semantically, such queries reflect the same idea. For example: Retrieve the (inexpensive cars among the high speed ones) selected from the low fuel consumption ones. Retrieve the inexpensive cars selected from (the high speed cars among the low fuel consumption ones). iii. ���� the monotony is not satisfied by the AMONG operator:
¬ ( AMONG(Р , S) ≤ AMONG(Р ’ , S ’) if Р ≤ Р ’ and S ≤ S ’ ) because:
Although
Р ≤ Р ’ (as fuzzy model) ⇒ μР (t) ≤ μР ’ (t) , ∀t ⇒ μР / S (t) ≤ μР ’ / S (t) , ∀t , ∀S and S ≤ S ’ (as fuzzy model) ⇒ μ S (t) ≤ μ S ’ (t) , ∀t ⇒ [a1’, b1’]S ⊆ [a1’, b1’]S ’ though the comparison μР always true,
/S
(t) ≤ μР / S ’ (t) , ∀t is not
even if [a1’, b1’]S ⊆ [a1’, b1’]S ’ We marked [a1’, b1’]S , [a1’,b1’]S ’ as the sub-intervals of the A1 attribute values from the tables QS (R) and QS ’ (R) (obtained by the first selections, on the attribute A2, using property S and property S’ ). iv. ���� unit element is satisfied by the AMONG operator AMONG(Р� , 1) = Р� because min ( � μР���/ �1 (t), 1 ) = min ( � μР (t), 1 ) = � μP (t) , ∀t Q���������������������������������������������� ueries like the one above include two gradual properties in the fuzzy selection criterion, but in
Exhibit A. AMONG(Р , AMONG(S , T )) = AMONG(AMONG(Р , S) , T ) because min ( � μР������ (t), min ( � μ S/T (t), � μT (t) )) = min (min ( � μ(P/S)/T (t), � μ S/T (t) ), � μT (t)) = /(S/T) = min ( � μP/S/T (t), � μ S/T (t) , � μT (t)) , ∀t Exhibit B. QP/S : 2 D1 × D2 × … × Dn → 2 D1 × D2 × … × Dn × (0,1] R ({ → ���׀t , min (� μР / S (t), � μ S (t) ) ) | t ∈R ∧ � μР / S (t) > 0 ∧ μ � S (t) > 0}
226
Qualifying Objects in Classical Relational Database Querying
a special relationship: the second one refers to objects selected by the first one. That means that the objects are selected by a qualification relative to other gradual property. Definition 7. The� vague query based on a qualification relative to other gradual property is an application (see Exhibit B). Some remarks are interesting and very important: i ����������������������������������� A quite different query expression: Retrieve the most inexpensive cars among the high speed ones can be assimilated with the previous one, so it can be submitted to the same evaluation procedure. The most inexpensive criterion is not equivalent to the relational aggregation MIN operation on the
whole car table (Table 1), but it corresponds to a fuzzy selection on a fuzzy table. Moreover, this query expression may be even more suggestive for the database user and semantically adequate to the response in Table 5. ii ������������������������������ The new aggregation operator, AMONG, is not commutative: the inversion of the two criteria leads to a different query answer. Actually, when thinking of the semantic of the operation, the AMONG operator models exactly the importance level of the criteria, according to the user’s preference. For example, let us compare the queries:� Retrieve the inexpensive cars among the high speed ones. (µ = min(µinexpensive/high,µhigh)) and
Table 6. The “high speed cars among the inexpensive ones” table Max Speed
Price
µ inexpensive
µ HIGH
µ
IO
145
24000
1
0.00
0.00
MC
190
18200
1
0.09
0.09
M L200
145
19095
1
0.00
0.00
NV
132
15883
1
0.00
0.00
OA
186
16042
1
0.00
0.00
OV
208
20669
1
1
1
OZ
178
18364
1
0.00
0.00
P 206
170
10466
1
0.00
0.00
P 806
177
20633
1
0.00
0.00
RC
186
12138
1
0.00
0.00
OCS
120
26259
0.85
0.00
0.00
LRD
130
28000
0.61
0.00
0.00
AA 4
221
28450
0.54
1
0.54
P 607
222
31268
0.16
1
0.16
B3
226
31562
0.13
1
0.13
C 300M
230
32000
0.07
1
0.07
Name
227
Qualifying Objects in Classical Relational Database Querying
Retrieve the high speed cars among the inexpensive ones. (µ=min(µhigh/inexpensive,µinexpensive)) The difference appears as evident by looking comparatively at Tables 5 and 6 (the finally selected rows are shaded). iii
iv a.
When looking comparatively at Tables 4 and 5, one can observe the difference between the conjunctive criterion “�Р AND S ”������������� (multiqualification) and the new kind of selection criterion �“Р AMONG S ”��������������������������� (relative qualification). Semantically, the AND operator combines in one selection two independent criteria, having the same importance (priority) for the user’s preferences. On the contrary, the AMONG operator has to evaluate the second criterion prior to the first one. This is a supplementary argument for the noncommutatability of the AMONG operator. After a practical study of the use of relative qualification, we observed: Generally, the query formed by the two combined properties is searching for quite disjunct object categories. The answer of an AND conjunction is sometimes empty. On
b.
the contrary, the AMONG operator evaluates the second selection on a non-empty set of objects by adapting the model of the gradual property to the already selected objects. Obviously, the answer will be non-empty and it will be adequate to the user’s expectation. The AMONG operator does not give spectacular answers when the two properties are referring approximately to the same objects. For example, the query:
Retrieve the expensive cars among the high speed ones. (the shaded rows in Table 7)
compared to the query
Retrieve the expensive and high speed cars. (the shaded rows in Table 8)
v
Therefore, the most typical situation when this evaluation method is applicable is when the two criteria are in a special semantic relationship: the first selection brings out a hard limitation of the class of objects and the dynamic modeling of the linguistic values, for the second selection, becomes useful. The above procedure is not difficult to implement if we consider it a sequence of several operations. But an original and ef-
Table 7. The “expensive cars among the high speed ones” table
228
Name
Max Speed
Price
µ high
µ expensive/high
µ
P 911 C
280
65000
1
1
1
B Coupe
250
39000
1
0.00
0.00
B7
243
57200
1
0.65
0.65
MBS
240
69154
1
1
1
AA
236
46000
0.80
0.00
0.00
C 300M
230
32000
0.50
0.00
0.00
B3
226
31562
0.30
0.00
0.00
P 607
222
31268
0.10
0.00
0.00
AA 4
221
28450
0.05
0.00
0.00
Qualifying Objects in Classical Relational Database Querying
Table 8. The “expensive and high speed cars” table Name
Max Speed
Price
µ high
µ expensive
µ
P 911 C
280
65000
1
1
1
B Coupe
250
39000
1
0.00
0.00
B7
243
57200
1
1
1
MBS
240
69154
1
1
1
AA
236
46000
0.80
0.00
0.00
C 300M
230
32000
0.50
0.00
0.00
B3
226
31562
0.30
0.00
0.00
P 607
222
31268
0.10
0.00
0.00
AA 4
221
28450
0.05
0.00
0.00
ficient method to evaluate this kind of query is proposed in the section titled The Unified Model of the Context, where the knowledge base (fuzzy definitions of the linguistic vague terms) is incorporated in the database. vi The fuzzy predicates used to evaluate the criterion at the first step of the procedure can be previously defined, but this is not mandatory. On the contrary, at step 3, an algorithm has to be used in order to automatically obtain the adapted definitions. Various algorithms for dynamical defining linguistic values of database attributes are proposed in the section titled Dynamic Modeling of the Linguistic Values. vii Similar procedures can be used to evaluate more complex queries including relative qualification, for example:
the fuzzy aggregation, implementing the relative qualification (AMONG operator).
Relative Qualification to Other Crisp Attribute At least one more situation is relatively frequent: when the linguistic values must be dynamically defined for an attribute subdomain obtained after a crisp selection. It is about a complex selection criterion that includes a gradual property referring to the database rows already selected by a crisp value. Let us imagine a table (Table 9), containing all the sales transactions of a national company. The following query must take into account the principle that generally, in different cities (from the biggest to the smallest), the amount of the sales is different.
How many inexpensive cars are among the high speed ones?
Retrieve the clients in Galati which get large quantities of our product
The selection criterion “large quantity” has a different meaning for different cities. In the same way, the query evaluation procedure follows the same steps as in the previous section:
We need to mention that the aggregate computation on groups (how many) is not the subject of the present chapter (see, e.g., Blanco, Delgado, Martin-Bautista, Sánchez, & Vila, 2002; Delgado, Sánchez, & Vila, 2000; Rundensteiner & Bic, 1991), but only
229
Qualifying Objects in Classical Relational Database Querying
Table 9. The transactions in sales table Client
...
Quantity
City
AA
70
Galati
AA4
21
Tecuci
B3
67
Galati
B7
200
Bucharest
BC
30
Galati
CM
230
Bucharest
IO
145
Bucharest
LRD
130
Galati
MBS
24
Tecuci
MC
90
Tecuci
ML
145
Bucharest
NV
132
Galati
OA
86
Tecuci
OCS
120
Galati
OF
102
Galati
OV
8
Tecuci
OZ
17
Tecuci
P2
166
Galati
P6
222
Bucharest
P8
177
Bucharest
P9C
28
Tecuci
RC
186
Bucharest
1. ...
2. 3. 4.
One can remark that a large quantity at Galati (for example 130) is less than the minimum at Bucharest (i.e., 145). This is the reason for which the definitions of the linguistic values must be adapted to the context; that means that the qualification (large quantity) is relative to the other crisp attribute (city). The presented situation is a special case of the previous one. The evaluation procedure is the same, using the same AMONG operator; the simplification consists in the character of the second property (S), which is a crisp and not gradual property and for which the classical selection operation is enough. In this case, the simplified algebraic model of the AMONG operator becomes:
Table 10. The transactions in Galati city table Client
230
...
Quantity
City
AA
70
Galati
B3
67
Galati
BC
30
Galati
LRD
130
Galati
NV
132
Galati
OCS
120
Galati
OF
102
Galati
P2
166
Galati
The crisp selection criterion city=‘Galati’ is classically evaluated and an intermediate result is obtained (Table 10). The interval containing the quantity for the selected sales forms the quantity subdomain [30, 166], instead of [8, 230]. The linguistic value set {small, medium, large} will be defined on the new subdomain (Figure 4). The fuzzy selection criterion large quantity is evaluated according to the new definitions and the fulfillment degree will result for each tuple (Table 11).
...
µ Р AMONG S : R → [0,1] ,
t → ׀μР / S (t.A1)
b −a µ Р AMONG S (t) = μР ( a 1 + 1 1 ( t.A1 − a 1 ' ) ) b1 '−a 1 ' (4)
where [a1’, b1’] ⊆ [a1,b1] is the sub-interval of the attribute A1 values in the table Q S (R) (obtained by the first selection, on the attribute A2, using property S) One can observe that, this time, the property S brings its contribution in expressing the criteria satisfaction degree only by the limitation of the domain for the property Р ; [a1,b1] becomes [a1’, b1’].
Qualifying Objects in Classical Relational Database Querying
Figure 4. Linguistic values defined on a subdomain µ
SMALL
MEDIUM
LARGE
1 0.5 0
Quantity 30
64
81
108 115 132
…
...
Quantity
City
P2
166
Galati
1
NV
132
Galati
1
LRD
130
Galati
0.88
OCS
120
Galati
0.29
µ
Table 12. Transactions of large quantities AND in Galati city Client P2
…
Quantity
City
166
Galati
...
expresses a qualification relative to a crisp attribute and can be evaluated like above. But if the query is: Retrieve the clients which get large quantities
166
Table 11. Transactions of large quantities in Galati city Client
Retrieve the clients which get large quantities of soap
µ 0.98
Note. If the above query is interpreted as a multiqualification and not as a relative qualification, the answer will be Table 12, absolutely different from Table 11. Therefore, a more suggestive formulation of the query should be: Retrieve the clients which get large quantities of our product among the clients in Galati
Relative Qualification to Group on Other Attribute Let us start with an example: The queries in the previous paragraph consider that all sales refer to the same product (“our product”); that is, the quantities can be compared. But, let us consider now that all sales for different products are stored in a database (Table 13). In this case, the query
then the values of the quantity attribute for different products cannot be compared. It is impossible. This example suggests the evaluation of the large quantity criterion by taking into account one product at a time, that is: Retrieve the clients which get large quantities of some product For each product, the linguistic value is defined on the interval of the quantity values existing in the database, but only for that product. According to the definitions in Figure 5, the answer will be in Table 14. One can remark the “higher weight” of the 11 vacuum cleaners than the 162 envelopes.
Dynamic Modeling of the Linguistic Values Our study has found a new class of queries, where two fuzzy criteria are combined in a complex selection criterion such that a second fuzzy criterion is applied on a subset of database rows already selected by the first one. We assume that the secondly applied fuzzy criterion is expressed by a linguistic value of a database attribute, which is a gradual property; not an absolute property, but a relative one. In this case, modeling the linguistic domain of the second attribute requires taking into account not the whole crisp attribute domain, but a limited subset, characteristic to the first criterion-selected database rows. Actually, the main problem of the relative qualification is how to dynamically define the linguistic values on the subdomains (step 3 of the above algorithm), depending on an instant context? 231
Qualifying Objects in Classical Relational Database Querying
Table 13. Transactions of various product sales Client
...
Quantity
Product
...
AA
70
soap
AA4
11
vacuum cleaner
B3
6
soap
BC
30
soap
CM
230
envelope
IO
145
envelope
MBS
4
vacuum cleaner
ML
162
envelope
NV
10
soap
OA
14
vacuum cleaner
OCS
2
soap
OF
102
soap
OV
1
vacuum cleaner
OZ
1
vacuum cleaner
P6
200
envelope
P8
70
envelope
P9C
2
vacuum cleaner
RC
18
envelope
Some procedures for automatic discovering of the linguistic values definitions can be implemented, having a great advantage: details regarding effective attribute domain limits, or distributions of the values, can be easily obtained thanks to directly connecting to the database (more details in Tudorie, 2004; Tudorie & Dumitriu, 2004). Two examples of algorithms are presented in the following. We assume that there are usually three linguistic values, modeled as trapezoidal membership functions. The first algorithm (by uniform domain covering). In most applications, defining the linguistic values set covers almost uniformly the referential domain (Figure 6).
•
Obtaining the definition for the three linguistic values l1, l2 , and l3 on a database attribute starts from the predefined values α and β, and the attribute crisp domain limits, I and S; these ones are coming from the database content. For example: 1 = (S − I ) 8
and
=2 =
Table 14. Transactions of large quantities sales Client
232
...
Quantity
Product
...
µ
CM
230
envelope
1
P6
200
envelope
1
OF
102
soap
1
OA
14
vacuum cleaner
1
AA4
11
vacuum cleaner
1
ML
162
envelope
0.53
AA
70
soap
0.46
1 (S − I ) 4
(5)
Qualifying Objects in Classical Relational Database Querying
Figure 5. Linguistic values defined on subdomains of the quantity attribute for each product
µ
soap small
1
medium
large
0.5 0
Quantity 2
µ
27
39.5
52
64.5
77
102
vacuum cleaner small
1
medium
large
0.5 0
Quantity 2
5
6.5
8
9.5
11
14
envelope
µ
small
1
medium
large
0.5
Quantity
0 18
70
96
122
Figure 6. A set of linguistic values uniformly distributed on an attribute domain µ 1
l1
l2
174
226
0 , I ≤ v ≤ I + 2β + a I + 2β + 2a − v m l 3 ( v) = 1 − , I + 2β + a ≤ v ≤ I + 2β + 2a a 1 , v ≥ I + 2β + 2a
l3
0
I
S
The membership functions for l1, l2 , and l3 are: 1 , I ≤ v ≤ I + β v − (I + β) m l1 ( v) = 1 − , I+β ≤ v ≤ I+β+a a 0 , v ≥ I + β + a 0 , I ≤ v ≤ I + β I+β+a−v 1 − , I+β ≤ v ≤ I+β+a a m l 2 ( v) = 1 , I + β + a ≤ v ≤ I + 2β + a v − (I + 2β + a) 1 − , I + 2β + a ≤ v ≤ I + 2β + 2a a 0 , v ≥ I + 2β + 2a
148
(6)
where v=t.A is a value in the domain D = [I,S] ) of an attribute A of a table R. Based on this idea, a software interface FuzzyKAA system, presented in Tudorie (2006a, 2006b), is able to assist the user for linguistic values defined in a database context. Starting from a uniform partitioning of the attribute domain, the user can adjust the shapes either by changing numerical coordinates of graphical points or by directly manipulating them. •
The second algorithm (statistical mean–based algorithm) takes into account the real distribution of the attribute values in the database. The idea is to center the middle trapezium on the statistical mean of the attribute values (M). The other membership functions are 233
Qualifying Objects in Classical Relational Database Querying
Figure 7. Linguistic values defined on the basis of the statistical mean µ 1
’ l1
0I
l2
l3 M
A S
distributed on the left and on the right on the rest of the interval (Figure 7). In this case, the basic data used to determine the fuzzy models are the attribute crisp domain limits (I and S), but also the statistical mean value into the [I,S] interval: M=
n
∑ t .A / n , i −1
i
where n is the cardinality of the relation R and ti is a tuple ti ∈ R. The values of α, β, and α’ are based on I, S, M and they can be: 1 ⋅ min (M - I , S - M) 4 1 β = 2a = ⋅ min (M - I , S - M) 2 a=
7 a' = (S − I) − ⋅ min (M - I , S - M) 4 If 0 < α < 1 (S − I) , 0 < β < 1 (S − I) then 8 4 1 (S − I) < α’ < (S – I) 8
The formulae of the membership functions for the linguistic values l1, l2, and l3 are depending on the asymmetry of the statistical mean value into the [I,S] interval. They are obtained in a similar way with the above algorithm. ����������������� The fuzzy models obtained by this method seem to be closer to the meaning accepted by the user’s mind. It is important to remark that the same online method to model the linguistic domain of a database attribute can be used any time, instead of an
234
off-line process to knowledge acquisition from a human expert.
The Unified Model of the Context Generally, an intelligent interface for flexible querying database is an extra layer using its own data (knowledge base) containing the fuzzy model of the linguistic terms included in vague queries. Such an interface must be conceived as general as possible, that is, able to connect to any database, assuming that the knowledge base corresponding to it is already available (Figure 8). FSQL ����� (Galindo et al., 2006), for example, uses a FMB (Fuzzy Metaknowledge Base) with the definitions of labels, quantifiers, and more information about the fuzzy capabilities. The fuzzy query evaluation is possible by building an equivalent crisp query. The knowledge (fuzzy model of the linguistic terms) is used first for the SQL query building and after that for the computing of the fulfillment degree for each tuple. The context is defined in this case as the pair database and the knowledge base corresponding to it. The functionality and the utility of such an intelligent interface have been practically proved by some software systems presented in the next section. One of the most important points of an interface to databases is the performance, more specifically, the response time in query evaluation. In order to have good performance, an efficient solution needs to model the context in a uniform approach, as a single database, incorporating the fuzzy model of the linguistic terms, or their description in the target database. So, a unified model of the context is proposed in the following. There are two possibilities to model a unified context: •
Static Context: Including in the database of the static definitions of the linguistic terms (their fuzzy models) established a priori before the querying process.
Qualifying Objects in Classical Relational Database Querying
Figure 8. The ��������������������������������������������������������� flexible interface integrated in the querying system� Database Server
Database 1
KB 1 Fuzzy Knowledge
Database 2
KB 2 Fuzzy Knowledge
. . .
. . .
Database n Flexible Interface for Database Querying
KB n Fuzzy Knowledge
Dynamic Context: Including in the database only of the data necessary to dynamically define the linguistic terms at the moment of (or during) the querying process.
tabase; Terms and Points contain the description of the linguistic value shapes.
In the first case, only the absolute qualification or multiqualification queries can be processed. On the contrary, in order to evaluate relative qualification queries, the second model must be accepted; in other words, the fulfillment degree is dynamically computed by taking into account the subdomains of the attributes.
The section titled Relative Object Qualification has presented certain types of queries that require dynamically defining the linguistic values by partitioning an attribute subdomain already obtained by a previous selection. Moreover, if we accept that the dynamical definition is suitable to the user’s perception, then even the initial acquisition of the knowledge can be dynamically realized by a minimal involvement of the user (knowledge engineer). In all these situations, the above model is not functional anymore; this time, the complexity is transferred to the expressions evaluated at the moment of querying; these expressions can stand for the linguistic values model. At querying time, only the labels corresponding to the linguistic domain are necessary to be known. Therefore, the points table does not exist anymore in the context model, only the terms table does. According to the proposed model of the context, the vague query evaluation consists in building only one crisp SQL query, so as to provide the searched database objects and at the same time the degree of criteria satisfaction for each of them. Various situations will be analyzed by observing each of the two proposed models of the context, thus:
•
Static Model of the Context The fuzzy model of the linguistic terms can be described by various methods. Some complex graphical interfaces are developed in the Computer Science Department of the “Dunarea de Jos” University and are presented in Tudorie (2006a, 2006b) and Tudorie, Neacsu, and Manolache (2005) and in the next section of this chapter. Usually, the shape of the membership function of a fuzzy set is trapezoidal. However, we chose a more general model, which is a polygonal shape (Figure 9). In this case, the knowledge base can be modeled as a set of tables, which can be incorporated in the database. One possible unified model of the context is presented in Figure 10. Table1, Table2, …, Tablen are the tables of the target da-
Dynamic Model of the Context
235
Qualifying Objects in Classical Relational Database Querying
Figure 9. A possible model of the linguistic domain of a database attribute A m 1
0
pi11
pf11
pf12
pf13 pi21 pf14 pf21 pf22 pi31 pf23
Figure 10. The unified context model, based on static definitions of the linguistic values
pf31 pf24 pi41 pf32 pf41
A pf42
Evaluation of an Absolute Qualification Criterion Let us consider a pseudo-SQL query, generated from the user’s interface (any kind of interface style: natural language, graphical, command language, etc.). The general form is: SELECT * FROM table WHERE attribute = # ’term’
• • •
an absolute qualification criterion can be evaluated in a static context or in a dynamic context; also, a multiqualification criterion can be evaluated in both kinds of context; on the contrary, the relative qualification can be evaluated only according to the dynamic context model.
It is important that we accept, in the following, three linguistic values, represented by trapezoidal shapes as a membership function on the database attribute domain. For the dynamic definitions, we adopt the first algorithm proposed in the section titled Dynamic Modeling of the Linguistic Values. Any generalization is possible.
236
where the symbol #, preceding a linguistic term, denotes a gradual property. The crisp SQL query corresponding to the user’s request by observing the context model based on static definitions can be as shown in Equation 7. The EXPRESSION has to be replaced with the algebraic model of the fulfillment degree by taking into account the context (data and knowledge) in Figure 10. (See Equation 8.) The crisp SQL query corresponding to the user’s request, by observing the context’s model based on dynamic definitions, can be as shown in Equation 9. The EXPRESSION has to be replaced with the algebraic model of the fulfillment degree by taking into account the context (data and knowledge) and formulae (5)–(6). For example, the expression for the first linguistic value is: m l1 ( v) = s( v − I) −
v − I −β v − I −β ⋅ s( v − I − β) − s( v − I − β − a) ⋅ 1 − a a
(10)
Qualifying Objects in Classical Relational Database Querying
Equation 7. SELECT r.*, EXPRESSION AS ”degree”
FROM table r, terms t, points p
WHERE t.table=’table‘ AND t.attribute=’attribute’ AND t.label=’term’ AND
t.ID_t = p.ID_t AND r.attribute >= p.pi AND r.attribute <= p.pf AND degree > 0 ORDER BY degree DESC;
Equation 8. SELECT r.*, p.gi + (r.attribute - p.pi) / (p.pf - p.pi) * (p.gf - p.gi) AS ”degree”
FROM table r, terms t, points p
WHERE t.table=’table‘ AND t.attribute=’attribute’ AND t.label=’term’ AND
t.ID_t = p.ID_t AND r.attribute >= p.pi AND r.attribute <= p.pf AND
degree > 0
ORDER BY degree DESC;
Equation 9. SELECT r.*, EXPRESSION AS ”degree”
FROM table r, terms t
WHERE t.table=’table‘ AND t.attribute=’attribute’ AND t.label=’term’ AND
degree > 0
ORDER BY degree DESC;
where v=r.attribute, I and S are the interval limits of the attribute domain: I = (SELECT MIN(r.attribute) FROM table r) S = (SELECT MAX(r. attribute) FROM table r) α and β are: 1 a = (S − I) 8
= 1/8*(SELECT MAX(r. attribute)MIN(r. attribute) FROM table r) (11) β = 2a =
1 (S − I) = 1/4*(SELECT MAX(r.attribute)4
MIN(r.attribute) FROM table r) σ is the step function:
(12)
0, v ≤ 0 s( v ) = 1, v > 0
and can be approximated by
− 1, v < 0 sign ( v) = 0, v = 0 1, v > 0 1 So: s( v) = ⋅ (1 + sign ( v)) 2
(13)
By expressing all the three membership functions ( m l , m l , and m l ), the SQL command is as shown in Equation 14. 1
2
3
Evaluation of a Multiqualification Criterion: The Model of the AND Operator In order to evaluate such a query, the expression of “degree” will be: 237
Qualifying Objects in Classical Relational Database Querying
Equation 14. SELECT r.*, DECODE (t.order, 1, 1/2*(1+SIGN(r.attribute -(SELECT MIN(r.attribute) FROM table r))) – (r. attribute - (SELECT MIN(r. attribute) FROM table r) – 1/4*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r)) / (1/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r)) * 1/2*(1+SIGN(r. attribute -(SELECT MIN(r. attribute) FROM table r) 1/4*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r)) ) – 1/2*(1+SIGN(r. attribute -(SELECT MIN(r. attribute) FROM table r) 3/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) ) * (1 – (r. attribute - (SELECT MIN(r. attribute) FROM table r) – 1/4*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) / (1/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) )) , 2, (r. attribute���- (SELECT MIN(r. attribute) FROM table r) – 1/4*(SELECT MAX(r. attribute)- MIN(r. attribute) FROM table r)) / (1/8*(SELECT MAX(r. attribute)- MIN(r. attribute) FROM table r)) * 1/2*(1+SIGN(r. attribute -(SELECT MIN(r. attribute) FROM table r) 1/4*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r)) ) + 1/2*(1+SIGN(r. attribute -(SELECT MIN(r. attribute) FROM table r) – 3/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) ) * (1 – (r. attribute - (SELECT MIN(r. attribute) FROM table r) – 3/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) / (1/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) ) – 1/2*(1+SIGN(r. attribute -(SELECT MIN(r. attribute) FROM table r) – 5/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) ) * (r. attribute���- (SELECT MIN(r. attribute) FROM table r) – 5/8*(SELECT MAX(r. attribute) - MIN(r. attribute) FROM table r)) / (1/8*(SELECT MAX(r. attribute) - MIN(r. attribute) FROM table r)) – 1/2*(1+SIGN(r. attribute -(SELECT MIN(r. attribute) FROM table r) – 3/4*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) ) * (1 – (r. attribute - (SELECT MIN(r. attribute) FROM table r) – 5/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) / (1/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) ) , 3, (r. attribute - (SELECT MIN(r. attribute) FROM table r) – 5/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) / (1/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) )* (1/2*(1+SIGN(r. attribute -(SELECT MIN(r. attribute) FROM table r) – 5/8*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) ) – 1/2*(1+SIGN(r. attribute -(SELECT MIN(r. attribute) FROM table r) – 3/4*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) ) ) + 1/2*(1+SIGN(r. attribute -(SELECT MIN(r. attribute) FROM table r) – 3/4*(SELECT MAX(r. attribute)-MIN(r. attribute) FROM table r) ) ) – 1/2*(1+SIGN(r. attribute -(SELECT MAX(r. attribute) FROM table r))) ) AS ”degree” FROM table r, terms t WHERE t.table=’table‘ AND t.attribute=’ attribute’ AND t.term=’term’ AND degree > 0 ORDER BY degree DESC; DECODE is a selection-like function (in Oracle).
238
Qualifying Objects in Classical Relational Database Querying
Equation 15. SELECT * FROM table WHERE attribute1 = # ’term 1’ AND attribute2 = # ’term2’
the SQL query according to a static context is: SELECT r.*, LEAST(EXPRESSION1, EXPRESSION2) AS ”degree”
FROM table r, terms t1, terms t2, points p1, points p2
WHERE t1.table=’table‘ AND t1.attribute=’attribute1’ AND t1.label=’term1’ AND
t1.ID_t = p1.ID_t AND r. attribute1>= p1.pi AND r. attribute1<= p1.pf AND
t2.table=’table‘ AND t2.attribute=’attribute2’ AND t2.label=’term2’ AND
t2.ID_t = p2.ID_t AND r.attribute2 >= p2.pi AND r.attribute2 <= p2.pf AND
degree > 0
ORDER BY degree DESC;
Equation 16. SELECT r.*, LEAST(EXPRESSION1, EXPRESSION2) AS ”degree”
FROM table r, terms t1, terms t2
WHERE t1.table=’table‘ AND t1.attribute=’attribute1’ AND t1.label=’term1’ AND
t2.table=’table‘ AND t2.attribute=’attribute2’ AND t2.label=’term2’ AND
degree > 0 ORDER BY degree DESC;
LEAST( EXPRESSION1, EXPRESSION2 )
where EXPRESSION1 and EXPRESSION2 are the satisfaction degrees of the two criteria and LEAST corresponds to the mathematical min function (in Oracle). Thus, for a query like Equation 15 and the SQL query according to a dynamic context is as shown in Equation 16. These command lines can be considered the algorithmic model of the AND conjunction operator in a database fuzzy querying context.
Evaluation of a Relative Qualification Criterion: The Model of the AMONG Operator This time, only the dynamic context can be considered because the second criterion is always evaluated by taking into account the selection
already obtained after the first criterion, and the linguistic values are always dynamically modeled on subdomains of the attribute. Thus, for a query like SELECT * FROM table WHERE attribute1 = # ’term 1’ AMONG attribute2 = # ’term2’
the SQL query according to a dynamic context is as shown in Equation 17. The expression EXPRESSION2 corresponds to the first criterion and observes the above model, represented in formulae (10)-(13). But EXPRESSION1, that corresponds to the secondly evaluated criterion, is referring to the attribute subdomain obtained by the first selection. So, in EXPRESSION1 the table will be replaced by the answer table Q and the new values of the parameters will be:
239
Qualifying Objects in Classical Relational Database Querying
Equation 17. SELECT r.*, LEAST(EXPRESSION1, Q.firstdegree) AS ”degree”
FROM ( SELECT r.*, EXPRESSION2 AS ”firstdegree”
FROM table r, terms t
WHERE t.table=’table‘ AND t.attribute=’attribute2’ AND t.label=’term2’ AND
firstdegree > 0 ) AS Q ,
table r, term t WHERE t.table=’table‘ AND t.attribute=’attribute1’ AND t1.label=’term1’ AND degree > 0 ORDER BY degree DESC;
Figure 11. FuzzyQE interface for linguistic domain defining and database complex querying
I = (SELECT MIN(Q.attribute1) FROM Q) S = (SELECT MAX(Q.attribute1) FROM Q) α = 1/8*(SELECT MAX(Q.attribute1)-MIN(Q.attribute1) FROM Q)
β = 1/4*(SELECT MAX(Q.attribute1)-MIN(Q.attribute1) FROM Q)
(18)
The command line (17) can be considered the algorithmic model of the AMONG operator for the relative qualification in a database fuzzy querying context.
Retrieve the inexpensive cars among the high speed ones
Laboratory Software Tools for Database Flexible Querying Some software tools are developed for scientific purposes to be used for studying flexible queries (Tudorie, 2006a, 2006b; Tudorie et al., 2005). These multifunctional systems enable the analysis of many aspects. The most important is the opportunity of studying the relative qualification phenomenon and of validating the proposed algorithms for fuzzy queries evaluation. All these systems allow acquiring fuzzy models of the linguistic terms by graphical interfaces. Some of them are also able to interpret natural language (Romanian language) queries. It is important to
240
Qualifying Objects in Classical Relational Database Querying
remark that these systems are general enough to enable the any-time connection to any database and its associated knowledge base, newly created or already used. Here are three examples:
Interface for Fuzzy Knowledge Acquisition and Database Fuzzy Querying (FuzzyQE System) Goal: This software tool is able to connect the user to any database and to assist the user in the definition of linguistic values and fuzzy queries in that database context. The system proposes a uniform partitioning of the attribute domain and then the definitions implicitly obtained can be adjusted either by changing numerical coordinates of graphical points or by directly manipulating them. Simple queries (absolute object qualification) to existing definitions, but also complex queries (relative object qualification) are evaluated by implementing the AMONG operator (Figure 11).
Multi-User System for Linguistic Values Modeling and Database Fuzzy Querying (MultiDef System) Goal: This software tool is able to connect more users (e.g., knowledge engineers) to the same database; each of them has the possibility to describe each linguistic value of the database attributes. One defining process starts with an initial implicit model; the user may modify it according to his own semantic for the current linguistic term. An administrator, having his own interface, is monitoring and managing all this activity; he has any moment a total view of all membership functions drawn by the users for the same linguistic terms on the same attribute domain (Figure 12). Many types of queries can be evaluated.
Flexible Interface for Linguistic Values Modeling and Database Fuzzy Querying (CALIF �������������� System) Goal: This software tool enables the connection to any database, via a graphical interface
and linguistic values modeling (as fuzzy sets) by various algorithms by choice. Three main types of queries (simple, conjunction, and relative) can be evaluated. The interface is very flexible, providing many ways of adjusting parameters, definitions, and options (Figure 13).
Conclusion This chapter formulates a number of new problems, not very complicated, but referring to quite frequent situations, which were not discussed so far. The main aim of the chapter and its originality consists in the introduction of the concept of relative qualification of objects in the context of relational database querying. Under this circumstance, some important new problems, strongly related to the relative qualification, were developed (dynamic modeling of the linguistic values and the unified context based on dynamical definitions). Moreover, in an extended framework, we discussed in this chapter the problems of objects linguistic qualification and context modeling in all their aspects. More precisely, the complex criteria we studied (relative qualification) include two vague conditions in a special relationship: the first gradual property, expressed by a linguistic qualifier, is interpreted and evaluated relatively to the second one; accordingly, the fulfillment degree is computed in a particular way. The main idea of the evaluation procedure is to dynamically define sets of linguistic values on limited attribute domains, determined by previous fuzzy selections. This is the reason why it is not useful to create a priori the knowledge base with the fuzzy definitions, but to define the vague terms included in queries each time they need. One more reason is the great advantage of the direct connecting to the database: because details regarding effective attribute domain limits, or distributions of the values, can be easily obtained. In this idea, we developed the problem of dynamic modeling of linguistic values. Methods for automatic extraction of the linguistic values definitions from the actual database attribute values and solutions for
241
Qualifying Objects in Classical Relational Database Querying
Figure 12. MultiDef interface for linguistic domain defining and database complex querying
Figure 13. CALIF interface for linguistic domain defining and database complex querying
uniformly modeling the context (database and knowledge base) were proposed. The theoretical contribution consists in the new fuzzy aggregation operator AMONG, defined by this chapter; it stands for the model of the relative selection criterion in database fuzzy queries. A detailed discussion on the semantics, properties, and other remarks are present in the chapter. Some implementations that validate all these ideas were also briefly presented. They are developed and are running in the laboratory of our department. Future works would explore the implications of the new proposed kind of query in real fields like business
intelligence, OLAP, or data mining; but also other application fields of the new connective AMONG, like fuzzy control, or fuzzy database, where we are working with fuzzy values.
242
Acknowledgment The author is thankful for all the remarks made by the reviewers, particularly the editor. Many thanks also to the “English reviewer.”
Qualifying Objects in Classical Relational Database Querying
References Blanco, I., Delgado, M., Martín-Bautista, M. J., Sánchez, D., & Vila, M. A. (2002). ����������� Quantifier guided aggregation of fuzzy criteria with associated importances. In T. Calvo, R. Mesiar, & G. Mayor (Eds.), Aggregation operators: New trends and applications (Studies on Fuzziness and Soft Computing Series 97, pp. 272-290). Physica-Verlag. Bosc, P., & Pivert, O. (1992). Fuzzy querying in conventional databases. In L. A. Zadeh & J. Kacprzyk (Eds.), Fuzzy logic for the management of uncertainty (pp. 645-671). New York: Wiley. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Sys������� tems, 3, 1-17. Bosc, P., & Prade, H. (1997). An ������������������� introduction to fuzzy set and possibility theory-based approaches to the treatment of uncertainty and imprecision in data base management systems. In A. Motro & P. Smets (Eds.), Uncertainty management in information systems: From needs to solutions (pp. 285-324). ���������������������������� Kluwer Academic Publishers. Bouchon-Meunier, B. (1995). La logique floue et ses applications. Paris: ���������������������� Addison-Wesley. Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7(3), 213-226. Delgado, M., Sanchez, D., & Vila, M. A. (2000). Fuzzy cardinality based evaluation of quantified sentences. International Journal of Approximate Reasoning, 23, 23-66. Dubois, D., Ostasiewicz, W., & Prade, H. (1999). Fuzzy sets: History and basic notions (����������� Tech. Rep. No. IRIT/99-27 R). Toulouse, France: Institut de Recherche en Informatique. Dubois, D., & Prade, H. (1996). Using ����������������� fuzzy sets in flexible querying: Why and how? In H. Christiansen, H. L. Larsen, & T. Andreasen (Eds.), Workshop on flexible query-answering systems (pp. 89-103),������������������� Roskilde, Denmark. ������������������
Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Gazzotti, D., Piancastelli, L., Sartori, C., & Beneventano, D. (1995). FuzzyBase: A fuzzy logic aid for relational database queries. Paper presented at the 6th International Conference on Database and Expert Systems Applications, DEXA’95 (pp. 385-394), London, UK. Goncalves, M., & Tineo, L. (2001a). SQLf flexible querying language extension by means of the norm SQL2. Paper presented at the 10th IEEE International Conference on Fuzzy Systems, Fuzz-IEEE 2001 (Vol. 1), Melbourne, Australia. Goncalves, M., & Tineo, L. (2001b). SQLf3: An extension of SQLf with SQL3 features. Paper presented at the 10th IEEE International Conference on Fuzzy Systems, Fuzz-IEEE 2001 (Vol. 1), Melbourne, Australia. Grabisch, M., Orlovski, S. A., & Yager, R. R. (1998). Fuzzy aggregation of numerical preferences. In R. Slowinski (Ed.), Fuzzy sets in decision analysis, operations research and statistics (pp. 31-68). Boston: Kluwer Academic Publishers. Kacprzyk, J., & Zadrozny, S. (1995). FQUERY for ACCESS: Fuzzy querying for a Windows-based DBMS. In P. Bosc & J. Kacprzyk (Eds.), Fuzzyness in datatabase management systems (pp. 415-433). Heidelberg: Physica-Verlag. Kacprzyk, J., & Zadrozny, S. (2001). Computing with words in intelligent database querying: Standalone and Internet-based applications. Information Sciences, 134, 71-109. Medina, J. M., Pons, O., & Vila, M. A. (1994). GEFRED: A generalized model for fuzzy relational databases. Information Sciences, 76, 87-109. Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115-143.
243
Qualifying Objects in Classical Relational Database Querying
Projet BADINS: Bases de données multimédia et interrogation souple. (1995). Rennes, ���������������� France:� Institut de recherche en informatique et systèmes aléatoires. Projet BADINS: Bases de données multimédia et interrogation souple. (1997). Rennes, France: Institut de recherche en informatique et systèmes aléatoires. Rundensteiner,������������������������������������� E., & Bic, L. (1991). �������������� Evaluating aggregates in possibilistic relational databases. Data & Knowledge Engineering, 7, ��������� 239-267��. Tudorie, C. (2004). Linguistic values on attribute subdomains in vague database querying. Journal on Transactions on Systems, 3(2), 646-650. Tudorie, C. (2006a). Contributions to interfaces for database flexible querying. Doctoral thesis, University “Dunărea de Jos,” Galaţi, Romania. Tudorie, C. (2006b). Laboratory software tools for database flexible querying. Paper presented at the 2006 International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems, IPMU’06 (pp. 112115). Tudorie, C., & Dumitriu, L. (2004). ������������ How are the attribute linguistic domains involved in database fuzzy queries evaluation. Scientific Bulletin of “Politehnica” University of Timisoara, 49(63), 61-64. Tudorie, C., Neacsu, C., & Manolache, I. (2005). Fuzzy queries in Romanian language: An intelligent interface. Annals of “Dunarea de Jos,” III, 45-53. Yager, R. R. (1991). Connectives and quantifiers in fuzzy sets. Fuzzy Sets and Systems, 40(1), 39-75. Elsevier Science.
Zadeh, L. A. (1975). The concept of linguistic variable and its application to approximate reasoning (parts I, II, and III). Information Sciences, 8, 199-251, 301-357; 9, 43-80.
Key Terms Absolute Object Qualification: Including one gradual property (simple qualification) or a conjunction of gradual properties (multiqualification) in a query selection criterion. The fulfillment degree is computed accordingly to the fuzzy sets describing the meaning of the linguistic qualifiers. A conjunctive aggregation operator is used to model the connective AND (in the case of multiqualification). AMONG Operator: Fuzzy aggregation operator used for evaluation of a query selection criterion based on a relative qualification. �������������� The algebraic model of the AMONG operator is: µ Р AMONG S : R → [0,1] b −a µ Р AMONG S (t) = min ( μР ( a 1 + 1 1 ( t.A 1 − a 1 ' ) ) b1 '−a 1 ' , μS (t.A2) ) where R is a relation, A1 and A2 are two attributes of the R relation, A1 is defined on the interval [a1,b1] , t is a tuple, Р and S are two gradual properties corresponding to the attributes A1 and A2, µР and µS are the membership functions defining the Р and S gradual properties,
Yager, R. R., & Zadeh, L. A. (Eds.). (1992). Introduction to fuzzy logic applications in intelligent systems. Kluwer Academic Publishers.
[a1’, b1’] ⊆ [a1,b1] is the sub-interval of the A1 corresponding to the table QS (R) (obtained by the first selection, on the attribute A2, using property S ).
Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338-353.
Context for Fuzzy Querying Interface: The pair: target database and the knowledge base (con-
244
Qualifying Objects in Classical Relational Database Querying
taining the fuzzy model of the linguistic terms) corresponding to it. Dynamic Model of the Context: Including in the database only the data necessary to dynamically define the linguistic terms, at the moment of (or during) the querying process. Dynamic Model of the Linguistic Value: Automatic discovering of the linguistic values definitions from the actual content of the database. Appropriate algorithms can be implemented, based on a great advantage: by directly connecting to the database, one can easily obtain details regarding effective attribute domain limits, or distributions of the values. This procedure is generally useful, instead an off-line process of knowledge acquisition from a human expert; but it is mandatory in the relative qualification case. Multiqualification: Including more gradual properties in a query selection criterion. They are independent of each other and they have the same significance for the user’s preferences. The fulfillment degree is computed accordingly to the
fuzzy sets describing the meaning of the linguistic qualifiers and to the conjunctive aggregation operator, as model of the connective AND. Relative Object Qualification: Two gradual properties, as fuzzy conditions, are combined in a complex selection criterion, such that one of them is applied on a subset of database rows, already selected by the other one; dynamic definition of the linguistic value, corresponding to the secondly evaluated condition, is needed. The fulfillment degree is computed accordingly to the fuzzy sets describing the meaning of the linguistic qualifiers and to the AMONG aggregation operator. Static Model of the Context: Including in the database the static definitions of the linguistic terms (their fuzzy models) established apriori, before the querying process. Unified Model of the Context: Modeling the context in a uniform style, as a single database, i.e. incorporating the fuzzy model of the linguistic terms or their description, in the target database.
245
246
Chapter X
Evaluation of Quantified Statements Using Gradual Numbers Ludovic Liétard IRISA/IUT & IRISA/ENSSAT, France Daniel Rocacher IRISA/IUT & IRISA/ENSSAT, France
Abstract This chapter is devoted to the evaluation of quantified statements which can be found in many applications as decision making, expert systems, or flexible querying of relational databases using fuzzy set theory. Its contribution is to introduce the main techniques to evaluate such statements and to propose a new theoretical background for the evaluation of quantified statements of type “Q X are A” and “Q B X are A.” In this context, quantified statements are interpreted using an arithmetic on gradual numbers from ℕf, ℤf, and ℚf. It is shown that the context of fuzzy numbers provides a framework to unify previous approaches and can be the base for the definition of new approaches.
Introduction Linguistic quantifiers are quantifiers defined by linguistic expressions like “around 5” or “most of,” and many types of linguistic quantifiers can be found in the literature (Diaz-Hermida, Bugarin, & Barro, 2003; Glockner, 1997, 2004a, 2004b; Losada, Díaz-Hermida, & Bugarín, 2006) (as semifuzzy quantifiers which allow modeling expressions like “there are twice as many men as women”). We limit this presentation to the original linguistic quantifiers defined by Zadeh (1983) and
the two types of quantified statements he proposes. Such linguistic quantifiers allow an intermediate attitude between the conjunction (expressed by the universal quantifier ∀) and the disjunction (expressed by the existential quantifier ∃). Two types of quantified statements can be distinguished. A statement of the first type is denoted “Q X are A” where Q is a linguistic quantifier, X is a crisp set and A is a fuzzy predicate. Such a statement means that “Q elements belonging to X satisfy A.” An example is provided by “most of employees are well-paid” where Q is most of and X is a set
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Evaluation of Quantified Statements Using Gradual Numbers
of employees, whereas A is the condition to be well-paid. In this first type of quantified statement, the referential (denoted by X) for the linguistic quantifier is a crisp set (a set of employees in the example). A second type of quantified statements can be defined where the linguistic quantifier applies to a fuzzy referential. This is the case of the statement “most of young employees are well-paid” since most of applies to the fuzzy referential made of young employees. This statement means that most of elements from this fuzzy referential (most of young employees) can be considered well-paid. Such a quantified statement is written “Q B X are A” where A and B are two fuzzy predicates (when referring to the previous example, Q is most of, X is a set of employees, B is to be young while A is to be well-paid). Linguistic quantifiers can be used in many fields, and we briefly recall their use in multicriteria decision making, expert systems, linguistic summaries of data, and flexible querying of relational databases; some minor applications of linguistic quantifiers as in machine learning (Kacprzyck & Iwanski, 1992) or neural networks (Yager, 1992) are not dealt with. Multicriteria decision making consists mainly in finding optimal solutions to a problem defined by objectives and constraints. A solution must fullfill all objectives and must satisfy all constraints. The use of linguistic quantifiers in decision making (Fan & Chen, 2005; Kacprzyck, 1991; Malczewski & Rinner, 2005; Yager, 1983a) aims at retrieving solutions fullfilling Q objectives with respect to Q’ constraints, where Q and Q’ are either a linguistic quantifier or the universal quantifier. A typical formulation is then “find the solution where almost all objectives are achieved and where all constraints are satisfied.” The use of linguistic quantifiers in expert systems concerns mainly the expression and handling of logical propositions. An example is provided by logical statements accepting exceptions. A typical statement accepting exceptions is the proposition “all Sweden are tall” which can be turned into “almost all Sweden are tall” involving the linguistic quantifier “almost all.” Many inferences involving
quantified statements are possible (Dubois, Godo, De Mantaras, & Prade, 1993; Dubois & Prade, 1988a; Laurent, Marsala, ���������������������������� & Bouchon-Meunier��, 2003; Loureiro Ralha & Ghedini Ralha, 2004; Mizumoto, Fukami, & Tanaka, 1979; Sanchez, 1988). It is possible to consider the following one set in the probabilistic framework: if I know that “Karl is Sweden” and that “almost all Sweden are tall,” it is then possible to infer that the event “Karl is tall” is probable. The challenge is then to compute the degree of probability (which may be imprecise) attached to the event “Karl is tall.” Data sumarization (Kacprzyck, Yager, & Zadrozny, 2006; Sicilia, Díaz, Aedo, & García, 2002) is another field where linguistic quantifiers can be helpful. Yager (1982) defines summaries expressed by expressions involving linguistic quantifiers (the summary of a database could be “almost the half young employees are well-paid”). SummarySQL language (Rasmussen & Yager, 1997) has been proposed to define and evaluate linguistic summaries of data defined by quantified statements. As an example, it is possible to use this language to determine the validity (represented by a degree) on a given database of the linguistic summary “almost the half young employees are well-paid.” Flexible querying of relational databases aims at expressing preferences into queries instead of Boolean requirements as is the case for regular (or crisp) querying. Consequently, a flexible query returns a set of discriminated answers to the user (from the best answers to the less preferred). Many approaches to define flexible queries have been proposed, and it has been shown that the fuzzy set based approach is the more general (Bosc & Pivert, 1992). Extensions of the SQL language, namely SQLf (Bosc & Pivert, 1995) and FSQL (Galindo, 2005, 2007; Galindo, Medina, Pons, & Cubero, 1998; Galindo, Urrutia, ���������������������������� & Piattini��������� , 2006), have been proposed to define sophisticated flexible queries calling on fuzzy sets (in���������������� this book, the reader can find a chapter by Urrutia, Tineo, and Gonzalez including a comparison between FSQL and SQLf)������������������������������������������ . In this context, predicates are defined by fuzzy sets and are called fuzzy predicates, and they can be combined using various operators 247
Evaluation of Quantified Statements Using Gradual Numbers
such as generalized conjunctions and generalized disjunctions (respectively expressed by norms and t-norms) or using more sophisticated operators such as averages. Fuzzy predicate can also be defined by a quantified statements, as in the query “retrieve the firms where most of employees are well-paid.” After query evaluation, each firm is associated to a degree in [0,1] expressing its satisfaction with respect to the quantified statement of the first type: “most of employees are well-paid.” The higher this degree, the better answer is the firm. To evaluate a quantified statement is to determine the extent to which it is true. This chapter proposes a new theoretical framework to evaluate quantified statements of type “Q X are A” and “Q B X are A.” Propositions are based on the handling of gradual integers (from ℕf and ℤf) (Rocacher & Bosc, 2003a, 2003b) and gradual rational numbers (from ℚ f) as defined in Rocacher and Bosc (2003c, 2005). These specific numbers express well-known but gradual numbers and differ from usual fuzzy numbers which define imprecise (illknown) numbers. The section titled ��������������������������� Linguistic Quantifiers and Quantified Statements��������������������������� introduces the definition of quantified statements while the section titled Previous Proposals for the Interpretation of Quantified Statements is a brief overview of the proposition made for the evaluation of quantified statements. Gradual numbers are introduced in the section titled Gradual Numbers and Gradual Truth Value, and the section titled Interpretation of Quanitified Statements Using Gradual Numbers proposes to evaluate quantified statements using gradual numbers. In the following, we denote A(X) the fuzzy set made of elements from a crisp set X which satisfy a fuzzy predicate A (A(X) being defined by X ∩ A).
(∃), which are too limited to model all natural lan���� guage quantified sentences. For this reason, fuzzy quantifiers (Zadeh, 1983) have been introduced to represent linguistic expressions (many of, at least 3, etc.) and to refer to gradual quantities. It is possible to distinguish between absolute quantifiers (which refer to an absolute number such as about 3, at least 2, etc.) and relative quantifiers (which refer to a proportion such as about the half, at least a quarter, etc.). An absolute (resp. relative) quantifier Q in the statement “Q X are A” means that the number (resp. proportion) of elements satisfying condition A is compatible with Q. A linguistic quantifier can be increasing (resp. decreasing) (Yager, 1988) which means that an increase in the satisfaction to condition A cannot decrease (resp. increase) the truth value of the statement “Q X are A.” At least 3 and almost all (resp. at most 2, at most the half ) are examples of increasing (resp. decreasing) quantifiers. A quantifier is monotonic when it is either increasing or decreasing, and it is also possible to point out unimodal quantifiers which refer to a quantity such as about the half, about 4, and so forth. The representation of an absolute quantifier is a fuzzy subset of the real line while a relative quantifier is defined by a fuzzy subset of the unit interval [0,1]. In both cases, the membership degree µQ (j) represents the truth value of the statement “Q X are A” when j elements in X completely satisfy A, whereas A is fully unsatisfied by the others ( j being a number or a proportion). In other words, the definition of a linguistic quantifier provides the evaluation for “Q X are A” in case of a Boolean predicate. Consequently, the representation of an increasing (resp. decreasing) linguistic quantifier is an increasing (resp. decreasing) function µQ such that µQ (0) = 0 (resp. µQ (0) = 1) and ∃ k such as µQ (k) = 1 (resp. ∃ k such as µQ (k) = 0).
Linguistic Quantifiers and Quantified Statements
Example. Figure 1 describes the increasing relative linguistic quantifier almost all.
First order logic involves two quantifiers, the universal quantifier (∀) and the existential one
It is worth mentioning that, in case of an absolute quantifier, a quantified statement of type “Q B X are A” reverts to the quantified statement of the other
248
Evaluation of Quantified Statements Using Gradual Numbers
Figure 1. A representation for the quantifier almost all• 1 malmost all(p)
0
0.7
0.8 0.9 1 proportion p
type: “Q X are (A and B).” As an example, “at least 3 young employees are well-paid” is equivalent to “at least 3 employees are (young and well-paid).” As a consequence, when dealing with quantified statements of type “Q B X are A,” this chapter only deals with relative quantifiers.
Previous Proposals for the Interpretation of Quantified Statements In this section, the main propositions suggested to determine the truth value of quantified statements are briefly overviewed. An in-depth study of quantified statements interpretations can be found in Liu and Kerre (1998a, 1998b); Delgado, ��������� Sanchez, and Vila��������������������������������������� (2000); Barro, Bugarin, Cariñena, and Diaz-Hermida (2003); or Diaz-Hermida, Bugarin, Cariñena, and Barro (2004). The subsection titled Quantified Statements of Type “Q X are A” is devoted to the evaluation of quantified statements of type “Q X are A,” whereas the subsection titled Quantified Statements of Type “Q B X are A” is devoted to the evaluation of quantified statements of type “Q B X are A.” A short conclusion about these proposals is provided in the subsection titled About the Proposed Approaches to Evaluate Quantified Statements.
Quantified Statements of Type “Q X are A”
only the change of the quantity µQ (i/n) into µQ (i) with n as the cardinality of set X involved in the quantified statement. In the particular case of a Boolean predicate A, the evaluation of “Q X are A” is given by µQ (c) where c is the number of elements satisfying A. Some approaches (interpretations based on a precise and an imprecise cardinality) extend this definition to a fuzzy predicate A assuming that the cardinality of a fuzzy set can be computed. Other approaches (using an OWA operator or a Sugeno fuzzy integral) are based on a relaxation principle which implies the neglect of some elements. As an example, the interpretation of “almost all employees are young” means that some of the oldest employees can be (more or less) neglected before assessing the extent to which the remaining employees are young.
Interpretation Based on a Precise Cardinality Zadeh (1983) suggests computing the precise cardinality of fuzzy set A (called sigma-count and denoted ∑Count(A)). The sigma-count is defined as the sum of membership degrees, and the degree of truth of “Q X are A” is then µQ (∑Count(A)/n) with n as the cardinality of set X. The definition of ΣCount(A) implies that a large number of small µA(x) values has the same effect on the result than a small number of large µA(x) values. As a consequence, many drawbacks can be found such as the one shown by the next example. Example. Set X = {x1, x2 , ..., x10} is such that ∀i, µA(xi) = 0.1. In this case, the result for “∃ X are A” is expected to be 0.1 (or at least extremely low). The existential quantifier ∃ is defined by µ∃(0) = 0 and ∀i > 0, µ∃(i) = 1, and the absolute quantified statement is evaluated by µ∃(∑Count(A)). Computations give ∑Count(A) = 1 which implies that expression “∃ X are A” is entirely true (µ∃(1) = 1). This result is very far from the expected one. ♦
Relative quantifiers are assumed hereafter, and the adaptation to absolute quantifiers requires
249
Evaluation of Quantified Statements Using Gradual Numbers
Interpretation Based on an Imprecise Cardinality The method proposed in Prade (1990) involves two steps. The first one computes the imprecise cardinality πc of the set made of elements from X which satisfy A (it is a fuzzy number represented by a possibility distribution of integers). Then, quantifier Q is considered a vague predicate serving as a basis for a matching with πc. The result is a couple of degrees, the possibility and the necessity of the fuzzy event “πc is compatible with Q.” The imprecise cardinality of the set F of elements from X which satisfy A is given by the following possibility distribution (Dubois & Prade, 1985; Prade, 1990): let k be the number of values of F whose degree is 1: πc(k) = 1 (k may equal 0), ∀i < k, πc(i) = 0, th ∀j > k, πc(j) is the j largest value µ (x). F
In the ���������������������������������������� particular case where F is a usual set, πc describes a precise value (πc(k) = 1 and πc(i) = 0 ∀ i ≠ k) which��������������������������������������� is the usual cardinality of this set. The possibility Π(Q ; πc) and the necessity N(Q ; πc) of the fuzzy event “πc is compatible with Q” are (Dubois, Prade, & Testemale, 1988b): Π(Q ; πc) = max
1≤i≤n
min (µ (i/n), πc(i)) Q
and Ν(Q ; πc) = min
1≤i≤n
max (µ (i/n), 1 - πc(i)). Q
Figure 2. A representation for the quantifier almost all
Example. Let Q be the increasing relative quantifier almost all defined in Figure 2 and X = {x1, x2, x3, x4, x5, x6 , x7, x8 , x9, x10} with µA(x1) = µA(x2) = ... = µA(x7) = 1, µA(x8) = 0.9, µA(x9) = 0.7, µA(x10) = 0. We have: πc(7) = 1, πc(8) = 0.9 and πc(9) = 0.7, with: µalmost all(1/10) = ... = µalmost all(7/10) = 0, µalmost all(8/10) = 0.25, µalmost all(9/10) = µalmost all(1) = 1. The interpretation of “almost all X are A” leads to Exhibit A.
Interpretation by the OWA Operator We assume that X = {x1, ..., xn} and µA(x1) ≥ µA(x2) ≥ ... ≥ µA(xn). The interpretation of “Q X are A” (Q being increasing) by an ordered weighted average (OWA operator) is given by (Yager, 1988): n
∑ (wi *
i =1
A ( xi ) )
,
where wi = µQ (i/n) - µQ (i-1/n). Each weight wi represents the increase of satisfaction when comparing a situation where (i-1) elements are entirely A with a situation where i elements are entirely A (and the others are not at all A). This operator conveys a semantics of relaxation since the smaller wi, the more neglected µA(xi). An extension of the use of the OWA operator to decreasing quantifiers has been proposed by Yager (1993) and Bosc and Liétard (1993). The extension is based on the equivalence: “Q X are A” ⇔�� ∀Q' X are A ,”
1
where Q’ is the antonym of the decreasing quantifier Q (Q’ is then an increasing quantifier given
malmost all (p) 0.25 0
250
0.7
0.8
0.9
1 proportion
Evaluation of Quantified Statements Using Gradual Numbers
Exhibit A. Π(Q ; πc) = max min(µalmost all(7/10), πc(7)), max(µalmost all(8/10), πc(8)), max(µalmost all(9/10), πc(9))
= max min(0, 1), min (0.25, 0.9), min (1, 0.7) = 0.7, Ν(Q ; πc) = min max(µalmost all(7/10), 1 - πc(7)), max(µalmost all(8/10), 1 - πc(8)), max(µalmost all(9/10), 1 - πc(9)) = min max (0 , 0), max (0.25, 0.1), max(1, 0.3) = 0♦
by ∀p ∈ [0,1], µQ'(p) = µQ (1-p)). It is then possible to use the initial proposition to interpret “Q’ X are A .” In addition, when Q is not monotonic, this approach leads to the GD method introduced in the section titled The Probabilistic Approach (GD Method).
The Probabilistic Approach (GD Method) This method (Delgado et al., 2000) is based on the following imprecise cardinality of the fuzzy set A(X): ∀k ∈ {0,1,2,.. n}, p(k) = bk – bk+1, where n is the cardinality of set X and bk is the kth largest value of belongingness of an element to the fuzzy set A(X) (with b0 = 1 and bn+1 = 0). A value p(k) can be interpreted as the probability that set A(X) contains k elements. The evaluation of a “Q X are A” statement with an absolute quantifier is: n
∑ p (k ) ×
k =0
Q (k )
When Q is relative, the evaluation becomes: n
∑ p(k ) ×
k =0
Q ( k / n) .
This interpretation is clearly the average value of the different values taken by the linguistic quantifier.
The ZS Method The ZS method proposed in Delgado et al. (2000) considers the following fuzzy cardinality π of the fuzzy set A(X) of elements which satisfy predicate A: π(k) = 0 if it does not exist a level cut α such that |A(X)α| = k, otherwise π(k) = sup{α such that |A(X)α| = k}. This fuzzy cardinality can be interpreted as a possibility. The interpretation δ of the quantified statement “Q X are A” is the compatibility of the fuzzy quantifier Q with that fuzzy cardinality: δ = max
1≤k≤n
min (µ (k), π(k)), Q
where n is the cardinality of set X. This evaluation clearly provides the possibility of the event “the cardinality satisfies Q” (as in the approach briefly introduced in 3.1.2). In addition, it is a generalization (Delgado et al., 2000) of the Sugeno fuzzy integral approach since when Q is increasing, the ZS and the Sugeno integral methods lead to a same result.
Interpretation Based on a Sugeno Fuzzy Integral The ������������������� interpretation of “Q X are A” (Q being increasing) by a Sugeno fuzzy integral (Bosc & Liétard��, 1994a, 1994b; Ying, 2006) is given by:
251
Evaluation of Quantified Statements Using Gradual Numbers
δ = max 1 ≤ i ≤ n min (µQ (i), µA(xi)), where µA(x1) ≥ µA(x2) ≥ ... ≥ µA(xn). Due to the properties of the Sugeno fuzzy integral, δ states the existence of a subset C of X such that: • •
each element in C is A with some concrete degree, subset C is in agreement with the linguistic quantifier Q.
Since Q is increasing, the more these two aspects are met, the higher the truth value for “Q X are A.” As an example, “almost all employees are young” is evaluated by the existence of a subset of young employees which gathers almost all the employees. More precisely, δ can also be defined by: δ = max C ∈ P(X) min(p1(C), p2(C)), where P(X) denotes the powerset of X and p1(C) is defined by min x ∈ C µA(x), whereas p2(C) is given by µQ (|C|/n) with n as the cardinality of set X. In addition, it can be demonstrated (Dubois et al., 1988b) that this interpretation can also be given by a weighted conjunction: δ = min 1 ≤ i ≤ n max (1 - wi , µA(xi)), where wi = 1 - µQ ((i-1)/n) is the importance given to degree µA(xi). Here again, the smaller wi, the more neglected µA(xi). This Sugeno fuzzy integral based evaluation is a particular case of a proposition (Bosc & Liétard, 2005) made in a more general framework to evaluate the extent to which an aggregate (computed on a fuzzy set, the cardinality in case of a quantified statement) is confronted with a fuzzy predicate (a linguistic quantifier). So, it can be easily extended to any kind of linguistic quantifiers.
Quantified Statements of Type “Q B X are A” This section presents the previous propositions for the interpretation of fuzzy quantified statements of 252
type “Q B X are A.” Here again, a relative quantifier Q is considered.
Interpretation with an OWA Operator Yager (1988) suggests interpreting the expression “Q B X are A” by an ordered weighted averaging. Let X = {x1, ..., xn} with: µB(x1) ≤ µB(x2) ≤ ... ≤ µB(xn) and: n
∑
i =1
( xi ) = d.
The weights of the average are defined by: wi = µQ (Si) - µQ (Si-1) with Si = j = 1;i; mBxj/d, and S0 = 0. This operator aggregates the values of the implication µB(x) →K-D µA(x) where →K-D denotes Kleene-Dienes implication (a →K-D b = max (1 - a, b)). If the implication values ci are sorted in a decreasing ordrer c1 ≥ c2 ≥... ≥ cn, the interpretation of “Q B X are A” is: n
∑(ci * w ). =1
This calculus uses an OWA operator to aggregate implication values. As an example, the truth value obtained for “most of young employees are well-paid” is that of “for most of the employees, to be young implies to be well-paid.” The obtained result is far from the original meaning of the quantified statement.
Interpretation by Decomposition The interpretation by decomposition described in Yager (1983, 1984) is limited to increasing quantifiers. The proposition “Q B X are A” is true if an ordinary subset C of X satisfies the conditions p1 and p2 given hereafter: p1: there are Q elements B in C,
Evaluation of Quantified Statements Using Gradual Numbers
p2: each element x of C satisfies the implication: (x is B) → (x is A).
Table 1. Satisfaction degrees with respect to B and A
The truth value of the proposition: “Q B X are A” is then defined by:
x1
x2
x3
B
1
1
1
A
1
0
0
sup C ∈ P(X) min (p1(C), p2(C)), where p1(C) (resp. p2(C)) denotes the degree of satisfaction of C with respect to the condition p1 (resp. p2). The value p1(C) is defined by µQ (h) where h is the proportion of elements B in set C. Yager suggests the following definition of h (using ∑Counts):
∑ ∑
h = x∈C x∈ X
B( x ) B( x )
.
The value of p2(C) is: ∧x ∈ C µB(x) → µA(x) where ∧ is any triangular norm and → a fuzzy implication. This interpretation leads to evaluate the quantified statement by an aggregation of implication values µB(x) → µA(x). Similarly to the OWA based interpretation of “Q B X are A,” this interpretation is far from the original meaning for “Q B X are A.”
Proposition of Vila, Cubero, Medina, and Pons According to this proposition (Vila et al., 1997), the degree of truth for “Q B X are A” is defined by: δ = α * max x∈X min(µA(x), µB(x)) + (1 – α) * min max(µA(x), 1–µB(x)), x∈X where α is a degree of Orness (Yager & Kacprzyck, 1997) computed from the linguistic quantifier:
α=
n
(n−i)
∑( ( n − 1 ) ∗( i =1
Q( i /
n ) − Q (( i − 1 ) / n ))).
The interpretation of “Q B X are A” is a degree set between the truth value of “∃ B X are A” (given by maxx∈X min(µA(x), µB(x))) and that of ”∀ B X are A” (given by minx∈X max(µA(x),1–µB(x))). The closer to one is α, the more “Q B X are A” is interpreted as “∃ B X are A.” Example. Let us consider X = {x1, x2, x3} where the satisfaction degrees with respect to predicates B and A are given by Table 1. Th����������� e���������� value of α is given by: α = 1 * (µalmost all(1/3) - µalmost all(0)) + 1/2 * (µalmost (2/3) - µalmost all(1/3)) all + 0* (µalmost all(1) - µalmost all(2/3)). The linguistic quantifier almost all is such that: µalmost all(0) = 0, µalmost all(1/3) = 0.2, µalmost all(2/3) = 0.8 and µalmost all(1) = 1 and we get: α = 1 * (0.2 - 0) + 1/2 * (0.8 - 0.2) + 0 * (1 - 0.8) = 0.2 + 0.3 = 0.5. The final result is then: δ = α * max x∈X min(µA(x), µB(x)) + (1 – α) * min max(µA(x),1-µB(x)) x∈X = 0.5 * 1 + (1-0.5) * 0 = 0.5.
253
Evaluation of Quantified Statements Using Gradual Numbers
As a consequence, “almost all B X are A” is true at degree 0.5 which is far from the expected result (since the proportion of A elements among the B elements is 1/3 and µalmost all(1/3) = 0.2). ♦
The evaluation of “Q B X are A” is then:
∑ p( c ) ×
c in P
Q ( c ).
The GD Method for “Q B X are A” Statements
About the Proposed Approaches to Evaluate Quantified Statements
Delgado et al. (2000) propose a probabilistic view of the proportion of A elements among the B elements. Computations are related to the two fuzzy sets B(X) and A(X)∩B(X). In addition, when fuzzy sets B(X) and (A(X) ∩ B(X)) are not normal, they should be normalized (using any technique). The set S = {α1, α2, …, αm} is the set made of the different satisfaction degrees of elements from X with respect to fuzzy conditions B and A ∩ B (it is considered that 1 = α1 > α2 > …> αm > αm+1 = 0) and P the set of the different proportions provided by the α-cuts:
Some properties to be verified by any technique to evaluate quantified statements of type “Q X are A” and “Q B X are A” have been proposed in the literature (Blanco, Delgado, �������������������������� Martín-Bautista, Sánchez, & Vila����������������������������������� , 2002; Delgado et al., 2000), and it is possible to situate the different propositions with respect to these properties. At first, these properties are introduced and then the evaluation of quantified statements are discussed. Concerning “Q X are A” statements, the following properties can be considered:
P={
A( X ) ∩ B( X ) B( X )
where α is in S}.
If we denote P-1(c), the set of levels from S having c as relative cardinality (c being in P): P-1(c) = {αi from S such that = c},
A( X ) i ∩ B( X ) B( X )
i
i
the probability p(c) for a proportion c (in [0,1]) to represent A( X ) ∩ B( X ) is defined by: B( X )
p(c) =
∑(
i − i +1 ) i in P (c ) -1
∑(
=
i − i +1 ) A(X) i ∩ B(X) i such that c = i B(X) i
254
.
Property 1. If predicate A is crisp, the evaluation must deliver µ (|A(X)|) in case of absolute quantiQ fier and µ (|A(X)|/n) in case of a relative quantiQ fier (where A(X) is the crisp set made of element from X which satisfy A and n is the cardinality of crisp set X). Property 2. The evaluation is coherent with the universal and existential quantifiers. It means the evaluation of “Q X are A” is x∈∨X A (x) when Q is ∃ and x∈∧X A (x) when Q is ∀ ( ∨ and ∧ being respectively a co-norm and a norm). Property 3. The evaluation is coherent with quantifiers inclusion. Given two quantifiers Q and Q’ such that Q ⊆ Q’ (∀x, µQ(x) ≤ µQ’(x)), the evaluation of “Q X are A” cannot be larger than that of “Q’ X are A.” Concerning the “Q B X are A” statements, it is possible to recall: Property 4. If A and B are crisp and Q is relative, the evaluation must deliver µ (|A(X) ∩ B(X)|/|B(X)|) Q where A(X) (resp. B(X)) is the set made of elements from X which satisfy A (resp. B).
Evaluation of Quantified Statements Using Gradual Numbers
Property 5. When B is a Boolean predicate, the evaluation of “Q B X are A” is similar to that of “Q B(X) are A” where B(X) is the (crisp) set made of elements from X which satisfy B. Property 6. If the set of elements which are B is included in the set of A elements, Q is relative and B is normalized, then the evaluation of “Q B X are A” is µQ(1) (since 100% of B elements are A due to the inclusion). Property 7. If A(X) ∩ B(X) = ∅ (where A(X) (resp. B(X)) is the set made of elements from X which satisfy A (resp. B)), then the evaluation must return the value µQ(0). When considering the evaluation of “Q X are A” statements, the approaches based on cardinalities deliver a result which can be difficult to interpret. In case of a precise cardinality, the main drawback is that a large number of elements with small membership degrees may have the same effect on the result than a small number of elements with large membership degrees. As a consequence, property 2 cannot be satisfied (this behavior is demonstrated in Delgado et al., 2000). In addition as shown in Delgado et al. (2000), properties 1 and 3 are satisfied. In case of an imprecise cardinality, the result of the interpretation is imprecise since it takes the form of two indices: a degree of possibility and a degree of necessity. This imprecision tied to the result is difficult to justify because computations take into account a precise quantifier and precise degrees of satisfaction, so why should it deliver an imprecise result? Moreover, the approaches to evaluate “Q X are A” using a relaxation mechanism provide a result with a clear meaning and easy to interpret. Theses approaches (including ZS technique) satisfy (Delgado et al., 2000) properties 1, 2, and 3. When considering the evaluation of “Q B X are A” statements, the approach based on the OWA operator and on a decomposition technique considers a modification of the meaning of the quantified statement since “Q B X are A” is interpreted as “for
Q elements in X, to satisfy B implies to satisfy A.” These two approaches satisfy properties 4 and 5, while properties 6 and 7 are not fulfilled (Delgado et al., 2000). The approach proposed by Vila et al. (1997) interprets the quantified statement by a compromise between “∃ B X are A” and “∀ B X are A.” As a consequence, it may lead to a result which does not fit the quantifier’s definition, and none of the properties introduced in this section can be satisfied (Delgado et al., 2000). The GD method satisfies all properties (properties 4, 5, 6, 7). The next sections show that the framework of gradual numbers offers powerful tools to evaluate quantified statements. This context allows unifying the previous propositions made to evaluate quantified statements of type “Q X are A” (and based on a relaxation mechanism). In addition, gradual numbers offer new techniques to evaluate “Q B X are A” statements.
Gradual Numbers and Gradual Truth Value It has been shown (Rocacher, 2003) that dealing with both quantification and preferences defined by fuzzy sets leads to define gradual natural integers (elements of ℕf) corresponding to fuzzy cardinalities. Then, ℕf has been extended to ℤf (the set of gradual relative integers) and ℚf (the set of gradual rationals) in order to deal with queries based on difference or division operations (Rocacher & Bosc, 2005). These new frameworks provide arithmetic foundations where difference or ratio between gradual quantities can be evaluated. As a consequence, gradual numbers are essential in particular for dealing with flexible queries using absolute or relative fuzzy quantifiers. This is the reason why this section shortly introduces ℕf, the set of gradual integers, and its extensions ℤf and ℚf. Then, it is shown that applying a fuzzy predicate on a gradual number provides a specific truth value which is also gradual.
255
Evaluation of Quantified Statements Using Gradual Numbers
Gradual Natural Integers The fuzzy cardinality |F| of a fuzzy set F, as proposed by Zadeh (1983), is a fuzzy set on ℕ, called FGCount(F), defined by: ∀ n ∈ ℕ, µ|F|(n) = sup{α | |Fα| ≥ n}, where Fα denotes an α-cut of fuzzy set F. The degree α associated with a number n in the fuzzy cardinality |F| is interpreted as the extent to which F has at least n elements. It is a normalized fuzzy set of integers and the associated characteristic function is nonincreasing. Example. The fuzzy cardinality of the fuzzy set F = {1/x1, 1/x2 , 0.8/x3, 0.6/x4} is: |F| = {1/0, 1/1, 1/2, 0.8/3, 0.6/4}. The amount of data in F is completely and exactly described by {1/0, 1/1, 1/2, 0.8/3, 0.6/4}. Degree 0.8 is the extent to which F contains at least three elements.♦ It is very important to notice that we do not interpret a fuzzy cardinality as a fuzzy number based on a possibility distribution (which has a disjunctive interpretation). In fact, the knowledge of all the cardinalities of all different α-cuts of a fuzzy set F provides an exact characterization of the number of elements belonging to F. Consequently, |F| must be viewed as a conjunctive fuzzy set of integers. As a matter of fact, the considered fuzzy set F represents a perfectly known collection of data (without uncertainty), so its cardinality |F| is also perfectly known. We think that it is more convenient to qualify such cardinality as a “gradual” number rather than a “fuzzy” number. Other fuzzy cardinalities based on the definition of FGCounts, such as FLCounts or FECounts, have been defined by Zadeh (1983) or Wygralak (1999). Dubois and Prade (1985) and Delgado et al. (2000) have adopted a possibilistic point of view where a fuzzy cardinality is interpreted as a possibility distribution over α-cuts corresponding to a fuzzy number (Dubois & Prade, 1987). The rest of this chapter is based on such a fuzzy cardinality defined as FGCounts
256
and the set of all fuzzy cardinalities is called ℕf (the set of gradual natural integers). The α-cut xα of gradual natural integer x is an integer defined as the highest integer value appearing in the description x associated with a degree at least equal to α. In other words, it is the largest integer appearing in the α-level cut of its representation: xα = max{c ∈ ℕµx(c) ≥ α}. When x describes the FGCount of a fuzzy set A, the following equality holds: xα = Αα. This approach is along the line presented by Dubois and Prade (2005) where they introduce the concept of fuzzy element e in a set S defined as an assignment function ae from a complete lattice to L-{0} to S. Following this view, a gradual natural integer x belonging to ℕf can be defined by an assignment function ax from ]0, 1] to ℕ such that: ∀ α ∈ ]0, 1], ax(α) = xα . If x is identified to a fuzzy cardinality |F| of a the cardinality of the α fuzzy set F, then ax(α) is������������������������ level cut of F. Example. |�F| = {1/0, 1/1, 1/2, 0.8/3, 0.6/4} is a gradual natural integer defined by an assignment a function a|F| graphically represented by Figure 3. As an example, a|F|(0.7) = |F0.7| = 3. ♦ Any operation # between two natural integers can then be extended to gradual natural integers x and y (Rocacher & Bosc, 2005) by defining the corresponding assignment function ax#y as follows: ∀ α ∈ ]0, 1], ax#y(α) = ax(α) # ay(α) = xα # yα. Due to the specific characterization of gradual integers, it can easily be shown that ℕf is a semiring structure. So the addition and product opera-
Evaluation of Quantified Statements Using Gradual Numbers
Figure 3. The assignment function of a fuzzy cardinality N 4 3 2 1 0
a|F|(a)
∀ α ∈ ]0, 1], ax(α) = x�+α - x-α = xα Example. The compact denotation of the fuzzy relative (x, y) (with��: x = {1/0, 1/1, 0.8/2, 0.5/3, 0.2/4} and y = {1/0, 1/1, 0.9/2}������ ) is: (x, y)c = {1/0, 0.9/-1, 0.8/0, 0.5/1, 0.2/2}c.
0
0.6
0.8
1
a
tions satisfy the following properties: (ℕf, +) is a commutative monoïd (+ is closed and associative) with the neutral element {1/0}; (ℕf, ×) is a monoïd with the neutral element {1/0, 1/1}; the product is distributive over the addition.
Gradual Relative Integers In ℕf the difference between two gradual natural integers may be not defined. As a consequence, ℕf has to be extended to ℤf in order to build up a group structure. The ������������������������������������� set of gradual relative integers ℤf is defined by the quotient set (ℕf × ℕf) / ℛ of all equivalence classes on (ℕf × ℕf) with regards to ℛ the equivalence relation characterized by: ∀ (x+, x-) ∈ ℕf × ℕf, ∀ (y+, y-) ∈ ℕf × ℕf,(x+, x-) ℛ (y+, y-) iff x+ + y- = x- + y+. The α-cut of a fuzzy relative integer (x+, x-) is defined as the relative integer (x+α - x-α). Any fuzzy relative integer x has a unique canonical representative xc which can be obtained by enumerating the values of its different α-cuts on ℤ:
As an example, for a level of 0.9 we get: x0.9 = 1 while y0.9 = 2. As a consequence, the α-cut of (x, y) at level 0.9 is x0.9 – y0.9 = -1. The assignment function of (x, y) is represented by Figure 4. ♦ If x and y are two gradual relative integers, the addition x + y and the multiplication are respectively defined by the classes (x+ + y+, x- + y-) and ((x+ × y+) + (x- × y-), (x+ × y-) + (x- × y+)). The addition is commutative, associative, and has a neutral element, denoted by 0ℤf, defined by the class {(x, x) / x ∈ ℕf}. Each fuzzy relative integer (x+, x-) has an opposite, denoted by -x = (x-, x+). This is remarkable because in the framework of usual fuzzy numbers, this property is not always satisfied. �������������� It can be easily checked that the product in ℤf is commutative, associative, and distributive over the addition. The neutral element is the fuzzy relative integer ({1/0, 1/1}, {1/0}). Therefore we conclude that (ℤf, + , ×) forms a ring.
Figure 4. Assignment function of the gradual relative integer (x, y)
xc = ∑ αi / (x+αi - x-αi) degrees appearwhere αi correspond to the different ��������������� ing in the representation of� x+ and x-. Each value xα can be computed from the canonical representation since xα equals µ c(β) with β the immediate value x larger than or equal to α. The assignment function ax of x is a function from ]0, 1] to ℤ such that: 257
Evaluation of Quantified Statements Using Gradual Numbers
Gradual Rational Numbers
Figure 5. The fuzzy predicate high
The question is now to define an inverse to each gradual integer and to build up the set of gradual rational numbers. We define ℤf * as the set of gradual integer x such that: ∀α ∈ ]0, 1], xα ≠ 0 and ℛ’ as the equivalence relation such that: ∀(x, y) and (x’, y’) ∈ ℤf× ℤf *, [x, y] ℛ’ [x’, y’] iff x × y’ = x’ × y. The set of fuzzy rational numbers ℚ f is defined by the quotient set (ℤf × ℤf *) / ℛ’. The representation of a fuzzy relational number x can also be represented thanks to a more simple compact representation (denoted by xc) by enumerating values associated with the different α-cuts which are rationals. ������������������������ The assignment function ax of x is a function from [0, 1] to ℚ is defined by:
Example. We consider the fuzzy predicate high defined by Figure 5. If the number of young employees is the gradual integer x = {1/15, 0.7/20, 0.2/25} (which means that 15 employees are completely young, 5 employees have the same age and are young at the level 0.7, whereas 5 other people are rather not young since their level of youth is estimated at 0.2). The assignment function for x is the following:
∀ α ∈ ]0, 1], ax(α) = reduce(�(xn+α - xn-α) ÷ (xd+α - xd-α)).
∀α ∈ [0, 0.2], ax(α) = 25, ∀α ∈ ]0.2, 0.7], ax(α) = 20, ∀α ∈ ]0.7, 1], ax(α) = 15.
where the operator reduce means that the rational is reduced to its canonical form.
The application of the predicate high on the gradual integer x produces a global satisfaction S whose function of assignment is defined by:
Gradual Truth Value This section proposes a computation to determine the truth value obtained when applying a fuzzy predicate on a gradual number. Let x be an element of ℕf or ℤf or ℚ f ;������������������������� its assignment function ax is defined by: ∀ α ∈ ]0, 1], ax(α) = xα . If T is a fuzzy predicate, the application of the predicate T on x produces a global satisfaction S (called gradual truth value) characterized by the assignment function defined by: ∀ α ∈ [0, 1], aS(α) = T(xα) = T(ax(α)). For a given level α, aS(α) represents the satisfaction of the corresponding α-cut of the fuzzy number. In other words, for a given level α, the fuzzy number satisfies predicate T at degree aS(α).
258
a ∈ [ 0, 1] , aS(a) = T (xa) = T(ax (a)).
We get the gradual truth value given by Figure 6.♦ This gradual truth value shows the different results associated to the different α-cuts. When Figure 6. Gradual truth value corresponding to a global satisfaction S
Evaluation of Quantified Statements Using Gradual Numbers
referring to a previous example and when considering level 0.8, the fuzzy cardinality x states that the cardinality of this α-cut is 15 (x0.8 = 15). Since µalmost all(15) = 0.25, this cardinality satisfies to be high at degree 0.25. It can be checked that αs(0.8) = 0.25.
Interpretation of Quantified Statements Using Gradual Numbers The section titled Quantified ������������������������������ Statements of Type “Q X are A”������������������������������������������ considers ����������������������������������������� the evaluation of a quantified statement of type “Q X are A” while ������������������ the section titled Quantified ������������������������������� Statements of Type “Q B X are A” Where Q Is Relative��������������������������������� is interested in the evaluation of statement of type “Q B X are A,” where Q is relative. Each one of these computations provides a gradual truth value. As a consequence, the section titled A Scalar Truth Value for the Interpretation proposes a scalar interpretation computed from this gradual truth value.
In case of a relative linguistic quantifier, the truth value of “Q X are A” is given by the satisfaction of the linguistic quantifier into the proportion of elements which are A. We get: ∀ α ∈ [0, 1], µS(α) = µQ (c(α)/n) = µQ (|A(X)α|/n), where n is the cardinality of set X. Example. We consider the statement “about 3 X are A” where X = {x1, x2 , x3, x4} such that µA(x1) = µA(x2) = 1, µA(x3) = 0.8, µA(x4) = 0.6. The linguistic quantifier about 3 is given by Figure 7. The gradual truth value for “about 3 X are A” (defined by: ∀ α ∈ [0, 1], µS (α) = µQ (c(α))) is given by Figure 8. This gradual truth value provides the satisfaction obtained for the different α-cuts of A(X) (set made of elements from X which satisfy fuzzy condition A). As an example, µS(0.7) = µQ (|A(X)0.7|) = µQ (3) = 1. ♦
Quantified Statements of Type “Q X are A” The gradual cardinality of the fuzzy set A(X) made of elements from X which satisfy A is a FGCount denoted c and belongs to ℕf. When Q is absolute, the gradual truth value for “Q X are A” is given by the satisfaction of a fuzzy condition (a constraint represented by the quantifier) for that gradual number. As described in the section titled Gradual �������� Truth Value��������� , we get:
Figure 7. A representation for the quantifier about 3 mabout 3(n) 1
0
∀ α ∈ [0, 1], µS(α) = µQ (c(α)). From the definition of the FGCount, we get: ∀ α ∈ [0, 1], µS(α) = µQ (|A(X)α|). In other words, the fuzzy truth value S expresses the satisfaction of each α-cut of A(X) with respect to the linguistic quantifier.
1
2
3
4
5
n
Figure 8. A fuzzy truth value for “about 3 X are A” mS(a)
1 0.5
0
0.6
0.8
1
a
259
Evaluation of Quantified Statements Using Gradual Numbers
Quantified Statements of Type “Q B X are A” Where Q Is Relative
∀ α ∈ [0, max x ∈ X µB(x)], µS (α) = µS (p(α)) = µQ (|(A∩B)(X)α|/|B(X)α|).
The truth value of “Q B X are A” (Q being relative) is given by the satisfaction of the linguistic quantifier into the proportion of elements which are A among the elements which satisfy B. This proportion is a ration between two gradual integers:
The fuzzy truth value S expresses the satisfaction of each α-cut of A(X) and (A∩B)(X) with respect to the linguistic quantifier. The value α is viewed as a quality threshold for the satisfactions with respect to A and B. When the minimum is chosen as norm to define (A∩B)(X), the value of µS (α) states that: “among the elements which satisfy B at least at level α, the proportion of elements x with µA(x) ≥ α, is in agreement with Q” (since we have (A∩B)(X)α = A(X)α∩B(X)α). In other words, µS(α) is the truth value of the quantified statement when considering the two interpretations A(X)α and B(X)α . In addition, the fuzzy truth value S is not defined when α > max x ∈ X µB(x). A first attitude is to normalize B and A ∩ B or to employ the degree of orness defined by Yager and Kacprzyck (1997) so that that µS (α) = orness(Q). A second attitude which will be considered in this chapter is to assume that µS (α) = 0 in that case.
p = c/d, where: •
•
c is the cardinality (FGCount) of the fuzzy set (A∩B)(X) made of elements from X which satisfy fuzzy condition A and condition B (∀x in X, µA∩B(X) =min(µA(x), µB(x)), d is the cardinality (FGCount) of the fuzzy set B(X) made of elements from X which satisfy fuzzy condition B.
The gradual rational number c/d is defined by the couple (c, d). A canonical representation for c/n is: ∀ α ∈ [0, 1], p(α) = c(α)/d(α). This canonical definition is defined only when d(α) ≠ 0. The cardinality c (resp. d) as that of the fuzzy set (A∩B)(X) (B(X)), we get: ∀ α ∈ [0, 1], p(α) = |(A∩B)(X) α|/|B(X) α|, where |B(X) α| ≠ 0. It means that p(α) is not defined for α > max x ∈ X µB(x), and we can write: ∀ α ∈ [0, max |B(X) α|.
x∈X
µB(x)], p(α) = |(A∩B)(X) α|/
The gradual truth value for “Q B X are A” is given by the satisfaction of the constraint represented by the quantifier for that gradual proportion. According to the results introduced in the section titled Gradual ������������������������������������������� Truth Value������������������������ , a gradual truth value S is obtained:
260
t��������� atement “about half Example. We consider the s���������� B X are A” where X = {x1, x2 , x3, x4}. The satisfaction degrees are given by Table 2. The linguistic quantifier about half is given by Figure 9. The gradual truth value for “about half B X are A”is given by Figure 10. As an example, we get µS(0.6) = 1/3 because |(A∩B)(X)0.6|/|B(X)0.6| = 2/3 and µQ (2/3) = 1/3. The truth value of the statement “about half elements in {x such that µB(x) ≥ 0.6} are in {x such that µA(x) ≥ 0.6}” is 1/3. ♦
A Scalar Truth Value for the Interpretation The fuzzy truth value S computed in the previous section gathers the satisfactions of the different α-cuts with respect to the linguistic quantifier. This fuzzy truth value can be defuzzified in order to obtain a scalar evaluation (set in [0, 1]). Various
Evaluation of Quantified Statements Using Gradual Numbers
Table 2. Satisfaction with respect to B and A x1
x2
x3
x4
µB(xi)
1
0.9
0.7
0.3
µA(xi)
0.8
0.3
1
1
µΑ∩B(xi)
0.8
0.3
0.7
0.3
Figure 9. A representation for the quantifier about half mabout half(p)
1
0
0.5
1
proportion p
Figure 10. A fuzzy truth value for “about half B X are A”
interpretations can be associated to this defuzzification and we consider the following one (since it is the more natural): “the more α-cuts highly satisfies the constraint defined by the linguistic quantifier, the higher the scalar interpretation”. Obviously, when the scalar interpretation is 1, each α-cut fully satisfies the constraint. When dealing with a quantified statement of type “Q X are A,” a scalar evaluation of 1 means that whatever is the chosen interpretation for A(X) (set made of
elements from X which satisfy A), its cardinality is in agreement with the linguistic quantifier (i.e., ∀ α, µQ(|A(X)α|) = 1 or µQ(|A(X)α|/n) = 1). Otherwise, the higher the scalar evaluation, the more interpretations exist of A(X) with a high satisfaction with respect to the linguistic quantifier. When dealing with a quantified statement of type “Q B X are A,” the scalar evaluation is also interpreted in terms of α-cuts, that is, in terms of interpretations of fuzzy sets. For a given level α, the degree µS(α) provided by the gradual truth value represents the satisfaction of the quantifier with respect to the proportion: |(A∩B)(X)α|/|B(X)α| (µS(α) is the truth value of the quantified statement when considering the two interpretations (A∩B)(X) α and B(X) α). The scalar value aggregates the different satisfactions provided by the different levels and a scalar evaluation of 1 means that whatever is the chosen quality threshold α, the proportion is in complete agreement with Q. Otherwise, the higher the scalar evaluation, the more quality thresholds exist such that the proportion highly satisfies Q. In the section titled A ������������������������� Quantitative Approach��, we consider a quantitative defuzzification (since based on an additive measure, a surface) while in the section titled A Qualitative Approach, we consider a qualitative defuzzification (since based on a non-additive process). The section titled Satisfaction of Properties situates the results provided by these two defuzzifications with respect to the properties introduced in the section titled About the Proposed Approaches to Evaluate Quantified Statements.
A Quantitative Approach In this approach, the surface of the fuzzy truth value is delivered to the user. The scalar interpretation is then (Liétard & Rocacher, 2005): δ = (
1
∫0
1/ p
p −1 d ) S( )* p *
.
261
Evaluation of Quantified Statements Using Gradual Numbers
When p = 1, value δ is the area delimited by function µS. Since this function is a stepwise function, we get: δ = (α1 – 0) * µS(α1) + (α2 – α1) * µS(α2) +...+ (1 – αn) * µS(1), where the discontinuity points are: (α1, µS(α1)),..., (αn , µS(αn)) with α1 < α2 < ... < αn. Example. We consider the statement “about half B X are A” and the fuzzy truth value given by Figure 9. We compute: δ = (0.7 – 0.3) * 1/3 + (0.8 – 0.7)*1 = 0.233. The scalar result is rather low. When referring to Table 2, it seems that the proportion of elements which are A among the B elements is near to be 2/3. A low result for “about half B X are A” is coherent since the proportion 2/3 poorly satisfies the constraint about half. ♦ It has been shown (Liétard & Rocacher, 2005) that, when dealing with quantified statements of type “Q X are A,” this approach is a generalization of the OWA based interpretation (introduced in Interpretation by the OWA Operator). In addition, the next proof shows that when considering “Q B X are A” statements and when B is normalized, this defuzzication leads to the GD method introduced in The GD Method for “Q B X are A” Statements (when B is not normalized, the two methods differ since GD method imposes to normalize B, while the gradual truth value associates a value 0 when the α-cuts of B(X) is not defined). Proof. In case of a “Q B X are A” statement, the discontinuity points (αi, µS(αi)) of the gradual truth value are associated to αi values where the quantities µS(αi) vary. In other words: • •
262
αi values are coming from the set D ={µA∩B(x) where x is in X}∪ {�µB(x) where x is in X}, µS(αi) = µQ(|(A∩B)(X)αi| /|B(X)αi|).
The defuzzication gives: δ = (α1 – 0) * µQ(|(A∩B)(X)α1| /|B(X)α1|) + (α2 – α1) * µQ(|(A∩B)(X)α2| /|B(X)α2|) +...+ (αn – αn-1) * µS(1), where the values from D are denoted α1 < α2 < ... < αn. This expression is clearly that of an interpretation using the GD method (cf. 3.2.4).
A Qualitative Approach According to this defuzzification, the scalar interpretation takes into consideration two aspects: • •
a guaranteed (minimal) satisfaction value β associated to the α-cuts (β must be higher as possible), the repartition of β among the α-cuts (β should be attained by the most possible α-cuts).
Obviously, these two aspects are in opposition since, in general, the higher β, the smaller the repartition. The scalar interpretation δ reflects a compromise between these two aspects and we get: δ = max β in [0,1] min(β, each(β)), where each(β) means “for each level α, µS (α) ≥ β.” A definition of each(β) delivering a degree is the more convenient (Bosc & Liétard, 2005) and we propose to sum the lengths of intervals (of levels) where the threshold β is reached: each(
)=
∑(
] i , j ] such that ∀ ∈ ] i , j ], S (
j
− i ). ) ≥
The higher each(β), the more numerous the levels α for which µS (α) ≥ β. In particular, each(β) equals 1 means that for each level α, µS (α) is larger than (or equal to) β. In addition, from a computational point of view, the definition of δ needs to handle an infinity of
Evaluation of Quantified Statements Using Gradual Numbers
values β. However, it is possible (Bosc & Liétard, 2005) to restrict computations to β values belonging to the set of “effective” µS (α) values: δ = max {β| ∃ α such that β = µ (α)} min(β, each(β)), S
Example. We consider the statement “about half B X are A” and the fuzzy truth value given by Figure 9. The values β to be considered are 1/3 and 1. Furthermore: each(1/3) = 0.5, each(1) = 0.1.
each(
)=
∑(
] i , j ] such that ∀ ∈ ] i , j ], S (
j
− i ). ) ≥
and the discontinuity points of the gradual truth value are: (α1, µS(α1)),..., (αn, µS(αn)) with α1 < α2 < ... < αn. We first demonstrate the validity of properties related to “Q X are A” statements, then that of properties related to “Q B X are A” statements.
Properties Related to “Q X are A” Statements
We get δ = max (min(1/3, 0.5), min(1, 0.1)) = 1/3. As in the previous example, a low result for “about half B X are A” is coherent. ♦ It has been shown (Bosc & Liétard, 2005) that when dealing with quantified statements of type “Q X are A” (Q increasing), this defuzzification leads to the Sugeno fuzzy integral based approach introduced in the section titled The Probabilistic Approach (GD Method).
Satisfaction of Properties This section situates the results provided by these two defuzzifications with respect to the properties introduced in About the Proposed Approaches to Evaluate Quantified Statements. All the properties are satisfied, except property 7 which holds only when the set made of element from X which satisfy B is normalized. We recall that the quantitative approach delivers:
In case of a “Q X are A” statement, the αi are coming from the set D = {µA(x) where x is in X} and µS(αi) = µQ(|A(X)α |) (or µQ(|A(X)α |/n) in case of a i i relative quantifier with n the cardinality of set X). The different properties to be satisfied are: Property 1. If predicate A is crisp, the evaluation must deliver µ (|A(X)|) in case of absolute quantiQ fier and µ (|A(X)|/n) in case of a relative quantiQ fier (where A(X) is the crisp set made of element from X which satisfy A and n is the cardinality of crisp set X). Proof. When A is crisp, D is a singleton ({1}) and the only discontinuity point of the gradual truth value (cf. Figure 11) is (1, µQ(|A(X)|) (or (1,µQ(|A(X)|/n)). Figure 11. The gradual truth value associated to property 1
δ = (α1 – 0) * µS(α1) + (α2 – α1) * µS(α2) +...+ (1 – αn) * µS(1), while the qualitative approach delivers δ = max {β| ∃ α such that β = µ
S (α)}
min(β, each(β)),
where each(β) is defined by
263
Evaluation of Quantified Statements Using Gradual Numbers
The quantitative approach delivers: (1 – 0) * µQ(|A(X)|) (or (1-0) * µQ(|A(X)|/n) when Q is relative) and property 1 holds. Concerning the qualitative approach, we demonstrate the validity of property 1 only the case of an absolute quantifier. In case of a relative quantifier, it is necessary to change each expression µQ(|A(X)|) into µQ(|A(X)|/n) and the demonstration remains valid. When dealing with the qualitative approach, there is only one value β to be considered. This value equals µQ(|A(X)|) and each(β) = 1 (since for every level α in [0,1] µS(α) = µQ(|A(X)α|) = µQ(|A(X)|) = β). As a consequence, the result of the qualitative approach is min(β, each(β)) = min(µQ(|A(X)|), 1) and the property is valid. Property 2. The evaluation is coherent with the universal and existential quantifiers. It means the evaluation of “Q X are A” is x∈∨X A ( x ) when Q is ∃ and x∈∧X A (x) when Q is ∀ ( ∨ and ∧ being respectively a co-norm and a norm). Proof. The universal quantifier is relative and defined by µ (1) = 1 and for any k in [0,1[, µ (k) ∀ ∀ = 0. The gradual truth value is defined by: •
µS(α) = µ∀(|A(X)α|/n) = 1 when |A(X)α|/n = 1 which means when α is smaller than the minimum of membership degrees (denoted α1). This value α1 can be equal to 0 (when there exists at least one element x with µA(x) = 0).
Figure 12. The gradual truth value associated to the universal quantifier
264
•
µS(α) = 0 otherwise.
As a consequence, we obtain the gradual truth value given by Figure 12. The fuzzy truth value has a unique discontinuity point (α1, 1) and the quantitative approach delivers δ = (α1 – 0) * 1 = α1 which is the minimum of the membership degrees. Property 2 is then satisfied using the minimum as a norm. When dealing with the qualitative approach, there is only one value β to be considered. This value equals β = 1 with each(β) = α1. As a consequence, the result is min(β, each(β)) = α1 and the property is valid. The existential quantifier is absolute and defined by µ (0) = 0 and for any k ≠ 0, µ (k) = 1. ∃ ∃ The discontinuity points of the gradual truth value are: (α1, 1), ..., (αn, 1) (see Figure 13), where αn is the highest degree among the µA(x)s. The quantitative approach delivers: δ = (α1 – 0) * 1 + (α2 – α1) * 1 +...+ (αn - αn-1 ) * 1 + (1 – αn) * 0 = αn which is the maximum of the membership degrees. Property 2 is then satisfied using the maximum as a co-norm. When dealing with the qualitative approach, there is only one value β to be considered. This value equals β = 1 with each(β) = αn. As a consequence, the result is min(β, each(β)) = αn and the property is valid. Property 3. The evaluation is coherent with quantifiers inclusion. Given two quantifiers Q and Q’ such that Q ⊆ Q’ (∀x, µQ(x) ≤ µQ’(x)), the
Figure 13. The gradual truth value corresponding to the existential quantifier
Evaluation of Quantified Statements Using Gradual Numbers
evaluation of “Q X are A” cannot be larger then that of “Q’ X are A.” Proof. The gradual truth value for “Q X are A” is denoted S, while that associated to “Q’ X are A” is denoted S’. Since we have: ∀x, µQ(x) ≤ µQ’(x), it implies ∀ α in [0, 1] , µS(α) ≤ µS’(α) (since the two quantified statements are dealing with the same set X and the same fuzzy predicate A). We denote δ and δ’ the respective evaluation of “Q X are A” and “Q’ X are A.” If the quantitative approach is chosen, we have
∫
Figure 14. The gradual truth value associated to property 4
1
δ = 0 S ( )d
•
and
•
∫
1
δ' = 0 S ' ( )d .
The different properties to be satisfied are:
As a consequence, δ ≤ δ’ and property 3 is valid. If the qualitative approach is chosen: δ = ma x each(
β i n [0,1]
)=
min(β, each(β)), with
∑(
] i , j ] such that ∀ ∈ ] i , j ], S (
j
− i) ) ≥
δ'= max β in [0,1] min(β, each’(β)),with each' (
)=
αi values are coming from the set D ={µA∩B(x) where x is in X}∪ �{µB(x) where x is in X}, µS(αi) = µQ(|(A∩B)(X)αi| /|B(X)αi|).
∑(
] i , j ] such that ∀ ∈ ] i , j ], S' (
j
− i) ) ≥
Since ∀ α in [0, 1] , µS(α) ≤ µS’(α), we have each(β)≤ each’(β) which gives δ ≤ δ’ and property 3 is demonstrated.�
Properties Related to “Q B X are A” Statements In case of a “Q B X are A” statement, the discontinuity points (αi, µS(αi)) of the gradual truth value are associated to αi values where the quantities µS(αi) varies. In other words:
Property 4. If A and B are crisp and Q is relative, the evaluation must deliver µ (|A(X) ∩ B(X)|/|B(X)|) Q where A(X) (resp. B(X)) is the set made of elements from X which satisfy A (resp. B). Proof. When A and B are crisp, D is a singleton ({1}) and the only discontinuity point of the gradual truth value is (1, µQ(|(A∩B)(X)| /|B(X)|)) where (A∩B)(X) and B(X) are crisp sets (see Figure 14). The quantitative approach delivers: (1 – 0) * µQ(|(A∩B)(X)| /|B(X)|) and property 1 holds. When dealing with the qualitative approach, there is only one value β to be considered. This value β equals µQ(|(A∩B)(X)| /|B(X)|) and each(β) = 1. As a consequence, the result of the qualitative approach is min(β, each(β)) = min(µQ(|(A∩B)(X)| /|B(X)|), 1) and the property is valid. Property 5. When B is a Boolean predicate, the evaluation of “Q B X are A” is similar to that of “Q B(X) are A” where B(X) is the (crisp) set made of elements from X which satisfy B. Proof. This proof shows that the gradual truth value S associated to “Q B X are A” and the gradual
265
Evaluation of Quantified Statements Using Gradual Numbers
truth value S’ associated to “Q B(X) are A” are exactly the same: ∀ α in [0,1], µS(α) = µS’(α). When B is a Boolean predicate B(X) α is the crisp set B(X) for any level α. As a consequence, µS(α) = µQ(|(A∩B)(X)α| /|B(X)|). Since (A∩B)(X)α can be rewritten A(X)α∩B(X), we have:
Conclusion
Property 7. If A(X) ∩ B(X) = ∅ (where A(X) (resp. B(X)) is the set made of elements from X which satisfy A (resp. B)), then the evaluation must return the value µQ(0).
This chapter takes place at the crossword of quantified statements evaluation and fuzzy arithmetic introduced in Rocacher and Bosc (2003a, 2003b, 2003c, 2005). It shows that fuzzy arithmetic allows to evaluate quantified statements of type “Q X are A” and “Q B X are A.” The evaluation can be either a fuzzy truth value or a scalar value obtained by the defuzzification of the fuzzy value. Two types of scalar values can be distinguished: the first one corresponds to a quantitative view of the fuzzy value, the second one of a qualitative view. When dealing with quantified statements of type “Q X are A,” the two scalar values are respectively generalizations of the OWA based interpretation and the Sugeno integral based interpretation. When dealing with “Q B X are A” statements, our approach presents the advantage of providing a theoretical framework for computation. It is the first attempt to set this evaluation in the framework of an extended arithmetic and algebra. This aspect is very important since properties provided by the algebraic framework hold and we expect to obtain more interesting properties for the qualitative and quantitative approaches (in addition to the ones already stated in this chapter). As a consequence, further studies may concern the comparison of the qualitative and quantitative approach in terms of properties. In addition, since they are both summaries of the same evaluation (in the form of a gradual number), they should not differ significantly.
We show this property holds only when B(X) is normalized.
References
µS(α) = µQ(A(X)α∩B(X)/|B(X)|). It means that µS(α) is restricted to elements belonging to B(X) as it is the case for “Q B(X) are A” statement and we obviously get: ∀α in [0,1], µS(α) = µS’(α). Property 6. If the set of elements which is B is included in the set of A elements, Q is relative and B is normalized, then the evaluation of “Q B X are A” is µQ(1) (since 100% of B elements are A due to the inclusion). Proof. When the set of elements which is B is included in the set of A elements, we have µB(x) ≤ µA(x) for any element x from X. As a consequence, B(X)α ⊆ A(X)α for any level α. As a consequence, ∀α in [0,max x∈X µB(x)], µS(α) = µQ(1). Since B(X) is normalized, ∀α in [0,1], µS(α) = µQ(1) and it is obvious to show that the two defuzzifications give µQ(1) as final results.
Proof. When A(X) ∩ B(X) = ∅, we have A∩B(X)α = ∅ for any level α in [0,1]. As a consequence, ∀α in [0,max x∈X µB(x)], µS(α) = µQ(0). When B(X) is normalized fuzzy set, we get ∀α in [0,1], µS(α) = µQ(0) and it is obvious to show that the two defuzzifications give µQ(0) as final results.
266
Barro, S., Bugarin, A., Cariñena, P., & Diaz-Hermida, F. (2003). A framework for fuzzy quantification models analysis. IEEE Transactions on Fuzzy Systems, 11, 89-99. Blanco, I., Delgado, M., Martín-Bautista, M. J., Sánchez, D., & Vila, M. P. (2002). Quantifier ������������������ guided aggregation of fuzzy criteria with associated importances. In T. Calvo, R. Mesiar, & G. Mayor (Eds.),
Evaluation of Quantified Statements Using Gradual Numbers
Aggregation operators: New trends and applications (Studies on Fuzziness and Soft Computing Series, pp. 272-290). Physica-Verlag. Bosc, P., & Liétard L. (1993). On the extension of the OWA operator to evaluate some quantifications. In Proceedings of the 1st European Congress on Fuzzy and Intelligent Technologies (EUFIT’93) (pp. 332-338), Aachen, Germany. Bosc, P., & Liétard, L. (1994a). Monotonous ������������������ quantifications and Sugeno fuzzy integrals. In Proceedings of the 5th IPMU Conference (pp. 1281-1286), Paris, France. Bosc, P., & Liétard L. (1994b). Monotonic quantified statements and fuzzy integrals. In NAFIPS/IFIS/ NASA’94 Joint Conference (pp. 8-12), San Antonio, Texas. Bosc, P., & Liétard, L. (2005). A general technique to measure gradual properties of fuzzy sets. In Proceedings of the 10th International Fuzzy Systems Association (IFSA) Congress, Beijing, China. Bosc, P., & Pivert, O. (1992). Some approaches for relational databases flexible querying. Journal of Intelligent Information Systems, 1, 323-354. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Delgado, M., Sanchez, D., & Amparo M. V. (2002). A probabilistic definition of a nonconvex fuzzy cardinality. Fuzzy Sets and Systems, 126, 177-190. Delgado, M., Sanchez, D., & Vila, M. P. (2000). Fuzzy cardinality based evaluation of quantified sentences. International Journal of Approximate Reasoning, 23, 23-66. Diaz-Hermida, F., Bugarin, A., & Barro, S. (2003). Definition and classification of semi-fuzzy quantifiers for the evaluation fuzzy quantified sentences. International Journal of Approximate Reasoning, 34, 49-88. Diaz-Hermida, F., Bugarin, A., Cariñena, P., & Barro, S. (2004). Voting-model based evaluation of fuzzy
quantified sentences: A general framework. Fuzzy Sets and Systems, 146(1), 97-120. Dubois, D., Godo, L., De Mantaras, R. L., & Prade, H. (1993). Qualitative reasoning with imprecise probabilities. Journal of Intelligent Information Systems, 2, 319-363. Dubois, D., & Prade, H. (1985). Fuzzy cardinality and the modeling of imprecise quantification. Fuzzy Sets and Systems, 16, 199-230. Dubois, D., & Prade, H. (1987). Fuzzy numbers: an overview. Analysis of Fuzzy Information, Mathematics and Logics, I, 3-39. Dubois, D., & Prade, H. (1988a). On fuzzy syllogisms. Computational Intelligence, 4, 171‑179. Dubois, D., & Prade, H. (2005). Fuzzy elements in a fuzzy set. In Proccedings of the 10th International Fuzzy Systems Association (IFSA) Congress, Beijing, China. Dubois, D., Prade, H., & Testemale, C. (1988b). Weighted fuzzy pattern matching. Fuzzy Sets and Systems, 28, 315-331. Fan, Z. P., & Chen, X. (2005). Consensus measures and adjusting inconsistency of linguistic preference relations in group decision making. In Fuzzy Systems and Knowledge Discovery (pp. 130-139). Berlin/Heidelberg: Springer. Galindo, J. (2005). New characteristics in FSQL, a fuzzy SQL for fuzzy databases. WSEAS Transactions on Information Science and Applications, 2(2), 161-169. Galindo, J. (2007). FSQL (fuzzy SQL): A fuzzy query language. Retrieved February 6, 2008, from http://www.lcc.uma.es/~ppgg/FSQL Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998). A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems (pp. 164-174). Springer. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: IGI Publishing. 267
Evaluation of Quantified Statements Using Gradual Numbers
Glockner, I. (1997). DFS: An axiomatic approach to fuzzy quantification (Tech. Rep. No. TR97-06). University of Bielefeld. Glockner, I. (2004a). Fuzzy quantifiers in natural language: Semantics and computational models. Der Andere Verlag (Germany). Glockner, I. (2004b). Evaluation of quantified propositions in generalized models of fuzzy quantification. International Journal of Approximate Reasoning, 37, 93-126. Kacprzyck, J. (1991). Fuzzy linguistic quantifiers in decision making and control. In Proceedings of International Fuzzy Engineering Symposium (IFES’91) (pp. 800-811), Yokohama, Japan. Kacprzyck, J., & Iwanski, C. (1992). Fuzzy logic with linguistic quantifiers in inductive learning. In L. A. Zadeh & J. Kacprzyk (Eds), Fuzzy logic for the management of uncertainty (pp. 465-478). John Wiley and Sons. Kacprzyck, J., Yager, R. R., & Zadrozny, S. (2006). Fuzzy linguistic summaries of databases for an efficient business data analysis and decision support. In Knowledge Discovery for Business Information Systems (pp. 129-159)���������������������������� . The Netherlands: Springer. Laurent, A., Marsala, C., & Bouchon-Meunier, B. (2003). ��������������������������������������������� Improvement of the interpretability of fuzzy rule based systems: Quantifiers, similarities and aggregators. In Modelling with words (pp. 102-123). Berlin/Heidelberg: Springer. Liétard, L., & Rocacher, D. (2005). A generalization of the OWA operator to evaluate non monotonic quantifiers. ��� In Proceedings of the 2005 Rencontres Francophones sur la Logique Floues et ses Applications (LFA’05). Liu, Y., & Kerre, E. (1998a). An overview of fuzzy quantifiers (I): Interpretations. Fuzzy Sets and Systems, 95, 1-21. Liu, Y., & Kerre, E. (1998b). An overview of fuzzy quantifiers (II): Reasoning and applications. Fuzzy Sets and Systems, 95, 135-146
268
Losada, D. E., Díaz-Hermida, F., & Bugarín, A. (2006). Semi-fuzzy ���������������������������������������������� quantifiers for information retrieval. ��� In Soft Computing in Web Information Retrieval (pp. 195-220). Berlin/Heidelberg: Springer. Loureiro Ralha, J. C., & Ghedini Ralha, C. (2004). Towards a natural way of reasoning. In Advances in Artificial Intelligence–SBIA 2004 (pp. 114-123). Berlin/Heidelberg: Springer. Malczewski, J., & Rinner, C. (2005). Exploring multicriteria decision strategies in GIS with linguistic quantifiers: A case study of residential quality evaluation. Journal of Geographical Systems, 7(2)��, 249-268. Mizumoto, M., Fukami, S., & Tanaka, K. (1979). Fuzzy conditional inferences and fuzzy inferences with fuzzy quantifiers. In Proceedings of the 6th International Joint Conference on Artificial Intelligence (pp. 589-591), Tokyo, Japan. Prade, H. (1990). A two-layer fuzzy pattern matching procedure for the evaluation of conditions involving vague quantifiers. Journal of Intelligent and Robotic Systems, 3, 93-101. Rasmussen, D., & Yager, R. R. (1997). ������������ A fuzzy SQL summary language for data discovery. In D. Dubois, H. Prade, & R. R. Yager (Eds.), Fuzzy information engineering: A guided tour of applications (pp. 253264). New York: Wiley. Rocacher, D. (2003). On fuzzy bags and their application to flexible querying. Fuzzy Sets and Systems, 140(1), 93-110. Rocacher, R., & Bosc, P. (2003a). About Zf, the set of fuzzy relative integers, and the definition of fuzzy bags on Zf. Lecture Notes in Computer Science, 2715, 95-102. Springer-Verlag. Rocacher, R., & Bosc, P. (2003b). Entiers relatifs flous et multi-ensembles flous. In Rencontres francophones sur la logique floues et ses applications (LFA’03) (pp. 253-260). Rocacher, R., & Bosc, P. (2003c). ������������������ Sur la définition des nombres rationnels flous. In Rencontres francophones sur la logique floues et ses applications, (LFA’03) (pp. 261-268).
Evaluation of Quantified Statements Using Gradual Numbers
Rocacher, R., & Bosc, P. (2005). The set of fuzzy rational numbers and flexible querying. Fuzzy Sets and Systems, 155(3), 317-339.
Yager, R. R., & Kacprzyk, J. (1997). The ordered weighted averaging operators: Theory and applications. Boston: Kluwer.
Sanchez, E. (1988). Fuzzy quantifiers in syllogisms, direct versus inverse computation. Fuzzy Sets and Systems, 28, 305-312.
Ying, M. (2006). Linguistic quantifiers modeled by Sugeno integrals. Artificial Intel., 170, 581-606.
Sicilia, M. A., Díaz, P., Aedo, I., & García, E. (2002). Fuzzy linguistic summaries in rule-based adaptive hypermedia systems. In Adaptive Hypermedia and Adaptive Web-Based Systems Second Int. Conference, Malaga, Spain. Vila, M. A., Cubero, J. C., Medina, J. M., & Pons, O. (1997). Using OWA operator in flexible query processing. In The Ordered Weighted Averaging Operators: Theory, Methodology and Applications (pp. 258-274). Wygralak, M. (1999). Questions of cardinality of finite fuzzy sets. Fuzzy Sets and Systems, 102, 185-210. Yager, R. R (1982). A new approach to the summarization of data. Information Sciences, 28, 69-86. Yager, R. R. (1983a). Quantifiers in the formulation of multiple objective decision functions. Information Sciences, 31, 107-139. Yager, R. R. (1983b). Quantified propositions in a linguistic logic. International Journal of Man-Machine Studies, 19, 195-227. Yager, R. R. (1984). General multiple-objective decision functions and linguistically quantified statements. International Journal of Man-Machine Studies, 21, 389-400. Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Transactions on Systems, Man, and Cybernetics, 18, 183-190. Yager, R. R. (1992). On a semantics for neural networks based on fuzzy quantifiers. International Journal of Intelligent Systems, 7, 765-786. Yager, R. R. (1993). Families of OWA operators. Fuzzy Sets and Systems, 59, 125‑148.
Zadeh, L. A. (1983). A computational approach to fuzzy quantifiers in natural languages. Computer Mathematics with����������������� Applications, 9, 149-183.
Key Terms Fuzzy Predicate: Predicate defined by a fuzzy set. A fuzzy predicate delivers a degree of satisfaction. Gradual Integer: Integer which takes the form of a fuzzy subset of the set of naturals (interpreted as a conjunction). Such integers differ from fuzzy numbers which are interpreted as disjunctions of candidates. Gradual Relational Number: Gradual number interpreted as a conjunction and defined as the ratio of two relative integers. Gradual Relative Integer: Gradual number represented by a fuzzy subset of the set of relatives (interpreted as a conjunction). It is defined as the substraction of two gradual integers. Linguistic Quantifiers: Quantifiers defined by linguistic expressions like “around 5” or “most of.” Such quantifiers allow an intermediate attitude between the conjunction (expressed by the universal quantifier ∀) and the disjunction (expressed by the existential quantifier ∃). OWA Operator: Ordered Weighted Average Operator. The inputs are assumed to be sorted and the weights of this average are associated to input data depending on their rank (weight w1 is associated to the largest input, weight w2 is associated to the second largest input, and so forth). Sugeno Fuzzy Integral: Aggregate operator which can be viewed as a compromise between two aspects: (1) a certain quantity (a fuzzy measure) and (2) a quality of information (a fuzzy set). 269
270
Chapter XI
FSQL and SQLf:
Towards a Standard in Fuzzy Databases Angélica Urrutia Universidad Católica del Maule, Chile Leonid Tineo Universidad Simón Bolivar, Venezuela Claudia Gonzalez Universidad Simón Bolivar, Venezuela
Abstract Actually, FSQL and SQLf are the main fuzzy logic based proposed extensions to SQL. It would be very interesting to integrate them with a standard for fuzzy databases. The issue is what to take from one or other proposal. In this chapter, we analyze FSQL and SQLf making a comparison in several ways: approach direction, fuzzy components, system architecture, satisfaction degree, evaluation mechanisms, and experimental performance. We observe that there are powerful and interesting features in both proposals that could be mixed in a unified language for fuzzy relational databases.
Introduction In order to give greater flexibility to relational dababase management systems (RDBMS), different languages and models have been conceived with the incorporation of fuzzy logic concepts into information treatment. Two outstanding proposals in fuzzy logic application to databases are those of FSQL (Galindo, 1999, 2007) and SQLf (Bosc & Pivert, 1995a). This chapter shows a comparison on these two applications from different points of view.
FSQL was created in order to allow the treatment of the uncertainty in fuzzy RDBMS. It allows the representation and manipulation of precise and vague data. It distinguishes three data categories: crisp, referential ordered, and referential not ordered. It uses possibility distributions and similarity relations for the representation of vague data, using the model GEFRED (Medina, 1994; Medina, Pons, & Vila, 1994). For the manipulation of these data, FSQL extends some components of SQL with elements of fuzzy logic. It includes the use of possibility and necessity measures. Surroundings to
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
FSQL and SQLf
FSQL have been conceived as a catalogue named FMB to represent vague data and linguistic terms in a relational database. Additionally, FuzzyEER, an extension of the EER model (Extended EntityRelationship), has been conceived to allow the conceptual design of databases that incorporate vague data (Urrutia, 2003; Urrutia, Galindo, Jiménez, & Piattini, 2006;��������������������� Urrutia, Galindo, & Piattini, 2002����������������������������������� ). A mechanism for the translation of a conceptual scheme in FuzzyEER to FMB has been settled (Galindo, Urrutia, & Piattini, 2006). There exist two known implementations of FSQL at the present time, one in Oracle (Galindo, 1999, 2007) and the other in PostgreSQL (Maraboli & Abarzua, 2006). SQLf was conceived in order to represent vague requirements in queries to relational databases. It includes extensions based on fuzzy logic for all the elements of the SQL standards until the SQL3. In this language, query conditions may involve diverse linguistic user defined terms that are specified through an extension of the DDL. SQLf allows fuzzy queries over precise data, producing discriminated answers. That is to say, each row in the answer has associated its satisfaction degree of the vague requirement represented by the query. In order to evaluate queries in SQLf, it has been proposed to take advantage of the existing connections between the fuzzy and classic sets. From a fuzzy query, the principle of the derivation allows obtaining a derived precise query. The processing of the fuzzy query is made on the result of the derived consultation. There are two known SQLf implementations, both on Oracle. The comparison made in this work is related with the following aspects: variety in the use of fuzzy logic elements; satisfaction degree semantics of the answer set; evaluation mechanisms for query processing; proposed architectures for the implementation and performance experimental analysis with current prototypes. With the research work presented in this chapter, we open the way for the integration of FSQL and SQLf towards a new standard for fuzziness treatment in databases. This chapter has been organized as follows:
The next section gives a basic background on fuzzy sets. You can read more about this in the first chapter of this handbook. In the following section, we present the approach directions of FSQL and SQLf prior to pointing out the fuzzy components of both languages. Then, we give a general view of the architecture in the SQLf and FSQL implementations. We also explain in one section the use of satisfaction of fulfillment degrees in these languages. Evaluation mechanisms for fuzzy queries, according to the two proposals, are discussed, along with experimental performance analysis of existing prototypes. Finally, we address some conclusions and future trends of this work.
Fuzzy Sets Background Fuzzy sets were introduced in Zadeh (1965) to model fuzzy classes in control systems, and their use has been expanded to different domains: mathematics, classification, pattern matching, artificial intelligence, and so forth. In the first chapter of this volume, Galindo introduces fuzzy logic and fuzzy databases. See also the overview chapter by Kacprzyk, Zadrozny, De Tré, and De Caluwe in this book about fuzzy approaches to flexible database querying.
Fuzzy Sets The fuzzy sets theory stems from the classic set theory, adding a membership function to the elements of the set, which is defined in a way that each element is assigned a real number between 0 and 1 (Zadeh, 1965, 1978). A fuzzy set A over the universe of discourse U, is defined by means of a membership function µA:U → [0,1]. This function indicates the degree to which the element u is included in the concept represented by the fuzzy set. The degree 0 means that the element is completely excluded of the set, while the degree 1 means that it is completely included. It is also possible to represent a fuzzy set with a set of pairs.
271
FSQL and SQLf
A = {µA (u) /u : u ∈ U, µA (u) ∈ [0,1]}
(1)
Linguistic Label A linguistic label is a natural language word that expresses or identifies a fuzzy set. With this definition, we can assure that in our everyday life we use several linguistic labels for expressing abstract concepts such as: “young,” “old,” “cold,” “hot,” “cheap,” “expensive,” and so forth. This intuitive definition does not only vary from one to another person and depend on the moment, but also it varies with the context in which it is applied. For example, the linguistic label “high” does not measure the same in the phrases “a high person” and a “high building.” Example 1: If we express the qualitative concept “young” by means of a fuzzy set, where the X axis represents the universe of discourse “age” (in natural integer numbers) and the Y axis represents the membership degrees in the interval [0,1]. The fuzzy set that represents that concept could be expressed in the following way (considering a discrete universe): Young = 1/0 + ... + 1/25 + 0.9/26 + 0.8/27 + 0.7/28 + 0.6/29 + 0.5/30 + ... + 0.1/34
Fuzzy Number The concept of fuzzy number was introduced in Zadeh (1978) with the purpose of analyzing and manipulating approximate numeric values, for example, “near 0” and “almost 5.” The concept has been refined (Dubois & Prade, 1985, 1998), and several definitions exist. Definition 1: Let A be a fuzzy set in X and mA : X → [0,1] be its membership function A which is a fuzzy number if mA is convex, upper semicontinuity and the support of A is bounded. These requirements could be relaxed. Some authors add the constraint on being normalized
272
in the definition, that is, sup(mA(x)) = 1. The general form of the membership function of a fuzzy number A can be seen in Figure 1, which can be defined as: rA ( x) h ( x ) = A s A ( x) 0
if x ∈ [ , ) if x ∈ [ , if x ∈ ( ,
] ]
otherwise
where rA:X → [0,1], sA:X → [0,1], rA is increasing, sA is decreasing, and rA(β) = h = sA(γ) with h ∈ (0,1] and α, β, γ, δ ∈ X. The number h is called the height of the fuzzy number, and the interval [β, γ] is the kernel. A particular case of fuzzy numbers is obtained when we consider the functions rA and sA as linear functions. This type of function is often used. We call this type of fuzzy number triangular or trapezoidal. We will usually work with normalized fuzzy numbers due to which h=1, and in this case, we will be able to characterize normalized trapezoidal fuzzy number A, using the four really necessary numbers: A ≡ (α, β, γ, δ).
Fuzzy Logic Fuzzy set theory is the base of fuzzy logic. In this logic, the truth-value of a sentence (or satisfaction degree) is in the real interval [0,1]. The value 0 represents completely false, and 1 is completely true. The truth-value of a sentence “s” will be denoted as µ(s). This logic allows giving an interpretation to linguistic terms:
Figure 1. General fuzzy number h
0
FSQL and SQLf
•
•
•
•
•
Predicates (synonym of linguistic labels) are atomic components of this logic defined by a membership function on the fuzzy set. For example, linguistic terms such as “young,” “tall,” “heavy,” and “low” are predicates. Modifiers, linguistic terms that allow defining modified fuzzy predicates, are interpreted by means of transformations of the membership function. In this category are the natural language adverbs, for example, “very,” “relatively,” and “extremely.” Comparators, kinds of fuzzy predicates defined on pairs of elements, establish fuzzy comparisons; for example, “more greater than,” “approximately equal to,” and “close to.” are fuzzy comparators. Connectors are operators defined for combining fuzzy sentences. Fuzzy negation, conjunction, and disjunction are extensions of the classical. They preserve the existent correspondence with set operations complement (negation), intersection and union, respectively. Connectors may be classified by the number of their operands in unary (such as negation), binary (such as implication), or multi-ary (such as average). Quantifiers are terms describing quantities such as “most of,” “about a half,” and “around 20.” They are an extension of classical existential and universal quantifiers. Two types of fuzzy quantifiers are distinguished: absolute and proportional (relative). Absolute quantifiers represent amounts that are absolute in nature such as “about 5” or “more than 20.” An absolute quantifier can be represented by a fuzzy subset Q, such that for any non-negative real p∈R+, the membership grade of p in Q (denoted by µQ(p)) indicates the degree to which the amount p is compatible with the quantifier represented by Q. Proportional or relative quantifiers, such as “at least half” or “most,” can be represented by fuzzy subsets defined in the unit interval [0,1]. For any proportion p∈[0,1], µQ(p) indicates the degree to which the proportion p is compatible with the meaning of the quantifier.
Definition 2: Functionally, linguistic quantifiers are usually of one of three types: Increasing quantifiers (as “at least n”, “all”, “most”) are characterized by Q (a )≤ Q (b ) for all a
273
FSQL and SQLf
the relevance degree of A1, ..., An fuzzy conditions, respectively. For example, “Clark is Around 3 of well salary, great account, excellent interpersonal skills, very high preparation, relatively young age, with relevance 1, 1, 0.5, 0.8, 0.4.” The evaluation of quantified statements is studied in this handbook in the chapter by Liétard and Rocacher. The satisfaction degree, or truth-value, of basic sentences may be obtained in a simple way by the membership functions of predicates, modified predicates, or comparators. In case of combined sentences, the truth-value is obtained with the application of the operators defining the connectors over the result of simpler conditions. For quantified sentences, the satisfaction degree calculation is more complex. In fact, there are several interpretations of these sentences that lead to different degrees for a given sentence (Bosc & Pivert, 1995; Delgado, Sánchez, & Vila, 2000; Dubois & Prade, 1998; Galindo, Urrutia, Carrasco, & Piattini, 2004b; Galindo et al., 2006; Tineo, 2006; Zadeh, 1995). This logic is a multivalued logic whose main characteristics are (Zadeh, 1978): In fuzzy logic, exact reasoning is considered a specific case of approximate reasoning. Any logical system can be converted into terms of fuzzy logic. In fuzzy logic, knowledge is interpreted as a set of flexible or fuzzy restrictions over a set of variables. Inference is considered as a process of propagation of those restrictions. Inference is understood to be the process by which a result is reached, consequences are obtained, or one thing is deduced from another. In fuzzy logic, all problems are problems of degree. From this simple concept, a complete mathematical and computing theory has been developed which facilitates the solution of certain problems (Blanco, 2001; Medina, 1994; Pedrycz & Gomide, 1998). Fuzzy logic has been applied to a multitude of objectives such as control systems, modeling, simulation, pattern recognition, information or knowledge systems, computer vision, artificial intelligence, artificial life, and so forth.
Possibility Theory A fuzzy set may also be used to represent the imprecise value of possibility of a data item. In this case, the membership function of the fuzzy set is said to be the possibility distribution, measuring the possibility for actual values. When predicates of comparators are applied to fuzzy numbers or fuzzy labels, the truth-value of a fuzzy logic sentence becomes an imprecise value. In this case, we say that we use a possibilistic logic. The satisfaction degree is obtained with the application of the extension principle. This principle extends any operator over a domain for being used over possibility distributions (imprecise data values) in this same domain (Medina et al., 1994; Tineo, 1998). In possibilistic logic, it is also possible to use two measures for the truth-value of a sentence, instead of an imprecise value. These are the possibility and the necessity measures. Possibility is a measure of beliefs, while necessity is a measure of certainty. Definition 3: The possibility measure of a logic sentence s, Π(s) is the maximum value in the possibility distribution for the sentence’s fuzzy truth-value µ(s), that is, Π(s)= sup(µ(s)) The necessity measure of a logic sentence s, Ν(s) is the complement to one of the maximum value in the possibility distribution for the fuzzy truth-value of the negation of the given sentence µ(not s), that is, Ν(s)=1-sup(µ(not s)) In case of sentences using only precise values with fuzzy logic terms, the possibility measure in possibilistic logic coincides with the satisfaction degree in fuzzy logic. Extended explanations about the possibility and necessity measures and the extension principle are included in the first chapter of this handbook.
Approach Direction Database implementation and use comprehends three levels: the conceptual, the logical, and the
274
FSQL and SQLf
physical. The conceptual level deals with the abstraction data in the universe for the database. The logical level consists of the representation of these concepts in a database management system native language or model, and the high level specification of user requirements. The physical level corresponds to the storage structures and access paths with their tuning parameters. Both FSQL (Galindo, 1999, 2007; Galindo, Medina, Pons, & Cubero�������������������������� , 1998a; ����������������� Galindo, Medina, Pons, Vila, & Cubero,������������������������� 1998b; ����������������� Galindo, Medina, Vila, & Pons, 1998c; �������������������������������������� Galindo et al., 2006) and SQLf (Bosc & Pivert, 1992, 1994, 1995a; Goncalves & Tineo, 2001a, 2001b, 2006a, 2006b; Tineo, 2005, 2006) are proposals for database at logical level, since they are extensions of SQL. But their focuses might be in different approaches. One possible approach is to deal with concepts representation in a data model. Another approach would be to deal with user requirements specification. Both approaches could be mixed in different degrees into FSQL and SQLf proposals. There are two main ways in adding fuzzy set based capabilities to database systems. First is to allow the representation of imprecise data (Bosc & Pivert, 1994). Second is to allow the expression of flexible requirements to regular databases (Bosc & Pivert, 1991). It is also possible to combine these two ways, that is: support both fuzzy data and fuzzy queries. Let us talk first about the SQLf approach: Many massive use systems of public and private services could be advantaged with fuzzy query capabilities. We might talk about criminal justice systems, voters registrations, international banking information, and tourism service systems. For some authors (Cox, 1995; Fagin, 1999), it seems to be very useful to allow flexible querying on data contained in existing databases, which contains precise data. This is the focus of SQLf. SQLf has been conceived in order to solve the rigidity problem of classical database querying systems. SQLf is intended to provide more flexibility in querying by means of fuzzy sets use. This approach has been proved to be the most general for user preference
based querying (Bosc & Pivert, 1992; Goncalves & Tineo, 2006a, 2006b). The design of SQLf is oriented to provide a large variety of preferences based querying structures. This language allows fuzzy conditions use in any place where SQL (ANSI, 1986, 1989) allows using a classic (Boolean) condition. Moreover, fuzzy conditions in SQLf may use any kind of linguistic terms with a fuzzy set based interpretation. These terms are predicates, modifiers, comparators, connectors, and quantifiers. The SQLf definition has been updated in order to support features introduced in the standards SQL-92 (ANSI, 1992; Goncalves & Tineo, 2001a) and SQL-99 (Goncalves & Tineo, 2001b; Melton, 1993). SQLf gives a high flexibility to the user in the following sense. The user may create or define (Tineo, 1998) in an arbitrary way the user’s owns fuzzy terms and use them in querying any database. There is no intervention of a database designer or administrator for creation and use of fuzzy terms. It is consistent with the intention of providing a language for querying existing databases, providing fuzzy querying capabilities over classic relational databases. On the other hand, the approach of FSQL is the following: The fuzzy relational model uses a fuzzy degree in each row or tuple. The model is based in similarity relations by Buckles and Petry (1992, 1984) and the relational models with possibility distributions by Umano, Fukami, Mizumoto, and Tanaka (1980) and other authors (Prade & Testemale, 1987; Zemankova-Leech & Kaendel, 1984). Medina’s (1994) doctoral thesis also embraced the generalizations of fuzzy models (Medina, 1994). Medina proposes a conceptual framework for fuzzy representation called GEFRED (GEneralized Model for Fuzzy Relational Databases) and a language called FSQL (Fuzzy SQL). In the same research group, a young computer engineer, José Galindo (1999), started his doctoral research under the supervision of Medina, in order to improve the relational algebra of the GEFRED model, to define a fuzzy relational calculus, and to implement a FSQL server. In fact, the possibility and
275
FSQL and SQLf
necessity measures, shown by Dubois and Prade (1985, 1998), do not only allow the construction of two fuzzy comparators, but 14 of them are defined in FSQL. In Medina (1994) and Medina et al. (1994), the GEFRED model was proposed for FRDB. The GEFRED model represents a synthesis among the different models, which have appeared to deal with the problem of the representation and management of fuzzy information in relational databases. One of the main advantages of this model is that it consists of a general abstraction that allows dealing with different approaches, even when these may seem disparate. The GEFRED model is based on the definition of Generalized Fuzzy Domain (D) and Generalized Fuzzy Relation (R). Definition 4: Let U be the discourse domain or universe and ℘(U) the set of all possibility distributions defined for U, including those which define the Unknown and Undefined types. Let NULL be another type, the Generalized Fuzzy Domain is D ⊆℘(U) U NULL. The Unknown, Undefined and NULL types are defined according to UmanoFukami model. Definition 5: A Generalized Fuzzy Relation R, is given by two sets: “Head” (H) and “Body” (B), R=(H,B), defined as: The “Head” consists of a fixed set of attribute-domain-compatibility terms (where the last is optional), H={ (A1: D1[, C1]),…,(An: Dn [, Cn])} where each attribute Aj has an underlined fuzzy domain, not necessarily different, Dj (j=1,2,…,n). Cj is a “compatibility attribute” which takes values in the range [0,1]. The “Body” consists of a set of different generalized fuzzy tuples, where each tuple is composed of a set of attribute-value-degree terms (the degree is optional), B={(A1: di1[,ci1]),…, (An: din[,cin])} with i=1,2,…,m, where m is the number of tuples in the relation, and where dij represents the domain value for the tuple i and the attribute Aj, and cij is the compatibility degree associated with this value. Also, the GEFRED model defines fuzzy comparators, which are general comparators based on 276
any existing classical comparator (>, <, =, etc.), but it does not consolidate the definition of each one. The only requirement established is that the fuzzy comparator should respect the classical comparator outcomes when comparing possibility distributions expressing crisp values (like 1/x with x belonging to X). The FSQL is built considering the components of GEFRED. FSQL language extends the SQL language (Galindo et al., 1998a, 2006) to allow flexible queries. The SELECT command has been extended in order to express flexible queries including: linguistic labels, fuzzy comparators, fulfillment thresholds (THOLD clause), function CDEG, character %, fuzzy constants, conditions with IS, and so forth. A great part of these components will be discussed in this work and in the chapter by Ben Hassine et al. in this book. The extension of fuzzy data types: Traditional Database, they are data from our relations with a special format to store the fuzzy attribute values. The fuzzy attributes are classified by the system in four types: •
• •
•
Fuzzy Attributes Type 1: These attributes are totally crisp (traditional), but they have some linguistic trapezoidal labels defined on them, which allow us to make the query conditions for these attributes more flexible. Fuzzy Attributes Type 2: These attributes admit crisp data as well as possibility distributions over an ordered underlying domain. Fuzzy Attributes Type 3: These attributes do not have an ordered underlying domain, for instance, hair color. On these attributes, some labels are defined, and on these labels, a similarity relation has yet to be defined. Fuzzy Attributes Type 4: These attributes are like Type 3 attributes but without a similarity relation. Also in Galindo et al. (2006), another five data types incorporating fuzzy degrees are presented.
FSQL and SQLf
Fuzzy Logic Components Fuzzy logic allows giving an interpretation to linguistic terms known as fuzzy predicates, modifiers, fuzzy comparators, connectors, and fuzzy quantifiers. Fuzzy predicates are often named fuzzy labels. These fuzzy logic components are defined in FSQL and SQLf with some differences that we present hereafter. The main difference between SQLf and FSQL is in the intervention of a database design for linguistic terms. In SQL, the user is free to create arbitrary linguistic terms and use them in any place that this kind or term may be used without any other semantic restriction. In FSQL language, some linguistic terms are system built in, and other ones must be defined in the database design for specifying its context of use. SQLf is inclined to give the user more flexibility while FSQL is inclined to give more coherent semantics of fuzzy terms’ use since the design. SQLf is oriented to querying classic databases while FSQL is oriented to query, build, and manage fuzzy or classic databases.
erator in any side or as the value of an imprecise datum. On the other hand, SQLf allows its use only in the right side of an equality comparison, which is interpreted as the measure in which the value meets the predicate.
Fuzzy Modifiers SQLf has two predefined modifiers: • •
• •
Fuzzy Predicates or Linguistic Labels SQLf provides three forms to define a predicate: through a trapezium, extension, and arithmetic expression. The trapezium shape function is defined given its four x values for the inflection points. The extension function is defined given a finite set of pairs µ(x)/x. The arithmetic expression defined functions are specified with an arbitrary expression over the variable x in the domain whose value is in the real unit interval. SQLf allows fuzzy predicates to be defined over structured data such as row (or tuple) types and time stamps from SQL2. FSQL does also allow the definition of predicates with trapezium or extension specified membership functions, but it does not consider general arithmetic expression defined membership functions. In FSQL bibliography, fuzzy predicates are rather called linguistic labels. FSQL allows the use of these labels in any place where a data value is allowed, for example, using any comparison op-
•
The predefined modifier NOT is interpreted as μNOT P (x)=1–μP (x). The antonyms like “small” and “big” and “young” and “old” are related through the predefined modifier ant, whose semantic is μant P (x)=μP (M-x) with x ∈ [0,M]. In SQLf, the user may create fuzzy modifiers defined by powers, translations, and triangular norms or co-norms: When a modifier is defined as a power of membership function, the modified predicate is defined as: μ mod P (x)=( μp(x))n. When the modifier is defined by triangular norn or co-norm θ, it must be non-idempotent, and the modified predicate is defined as: μ mod (x)= μpn(x)=(P θ … θ P). If θ is a triangular P norm, the modifier is a contractor. On the other hand, if θ is a triangular co-norm, the modifier is a dilator. When the modifier is seen in terms of a translation, the modifier is defined as: μmod (x) = μP (x +δ) with δ ∈ ℜ; this represents a P translation in the x axis. FSQL has just four built in modifiers:
• •
Norm: this modifier normalizes the fuzzy value, dividing the original membership function by the height of the fuzzy value. Conc_Dilat: this modifier is parameterized with a real number argument p, if p > 1; this function returns a concentrated version of the fuzzy value; the membership function of this version takes on relatively smaller values, being raised to power p. Usually, p =
277
FSQL and SQLf
•
•
2. If p∈(0,1), this modifier dilates the fuzzy value. The membership function of this version takes on relatively greater values, being raised to power p. Usually, p = 0.5 (the square root). More_Contrast: Uses also a parameter p. This modifier is the contrast intensification function and it returns the fuzzy value with more contrast. The membership values lower than 0.5 are diminished while the grades of membership above 0.5 are elevated. Fuzzification: This function has a complementary effect to that of intensification. These operations are defined with a parameter p (usually p = 2).
Fuzzy Comparators In addition to the typical comparators (=, >, > =...) (EQ, GT, GEQ), SQLf and FSQL include fuzzy comparators. The fuzzy comparators for SQLf are user-defined. In case of the numeric domain, a fuzzy comparator may be defined by means of a distance measure. Allowed distance measures are the difference and the quotient. The satisfaction degree of comparison is given by the membership of this distance to a user given fuzzy set. In case of scalar domain, it is also possible to define fuzzy comparators by extension, listing the related pairs with their corresponding satisfaction degrees. Comparison is always established between regular (crisp) data values. SQLf allows fuzzy comparators to be defined over structured data such as row (or tuple) types and time stamps from SQL2. A useful user defined comparator is “similar” in the colors domain. The similar comparator definition concerns principally into define “similarity” grade between some colors = {Black, Brown, While} with the following data: 1
0.7
0
0.7
1
0.1
0
0.1
1
On the other hand, FSQL allows fuzzy comparison between both crisp and fuzzy data items. 278
FSQL has 18 built-in fuzzy comparators (see Table 4 in the chapter by Ben Hassine et al. in this volume). Six of them are defined as possibility measures for the extended version of =, >, >=, <, <=, <> (FEQ, FGT, FGEQ, FLT, FLEQ, FDIF). Two are particularly fuzzy the, >> and <<, much greater than (MGT) and much less than (MLT). There are eight other comparators that have been conceived as the necessity measures counterpart of preceding possibility comparators (NFEQ, NFGT, NFGEQ, NFLT, NFLEQ, NFDIF, NMGT, NMLT). Additionally, the inclusion comparators: the included and fuzzy included in comparator (FINCL, INCL). The comparison of fuzzy data types with a non-ordered referential (Type 3 and 4) is carried out in FSQL with a special version of fuzzy equal (FEQ) and fuzzy different (FDIF). To define the “similar” comparator of previous example in FSQL, the similarity relation is necessary. It allows specifying the grades defined above. Then, the FEQ and FDIF comparators could be used between colors.
Connectors In SQLf, the usually connectors: conjunction and disjunction are predefined with the triangular norm min and its respective co-norm max; also the unary connector negation is provided, interpreted as the complement to one. SQLf allows using means operators as multi-ary connectors. Included means are: arithmetic mean, geometric mean, harmonic mean, weighted mean, and ordered weighted mean. SQLf provides users the capability of creating their own connectors. These are specified using an arithmetic expression on the variables x and y for the left side and right side operands, respectively. Users may define, for example, the implication adopting user preferred interpretation. Other interesting example is the use of triangular norms and co-norms different to usual min and max. FSQL has only conjunction, disjunction, and negation as connectors. Nevertheless, users may choose the interpretation to be used for them. Ne-
FSQL and SQLf
gation must be a negation function. Conjunction may use triangular norms: “minimum,” “product,” “drastic product,” “bounded product p,” “Einstein product,” “Hamacher product pi,” and so forth. Disjunction may use triangular co-norms: “maximum,” “sum-product,” “drastic sum,” “bounded sum p,” “Einstein sum,” and so forth. It should be noted that for the sake of simplicity, if the norm needs some argument p, it is included after the name. It seems very useful to have predefined triangular norms and co-norms. By default, usual min, max, and complement to one operators are used. The most important t-norms and t-conorms are defined in Chapter I of this handbook.
Quantifiers SQLf has a set of predefined quantifiers as at least x, at most x, most, a few, around x and y, about x, at least half, about a half, where x and y are user desired numeric parameters. If users desire, they can overwrite these fuzzy quantifiers in a context. Traditional quantifiers (exists, any, all) are also predefined and may be used in fuzzy quantified sentences. Users may also define their own fuzzy quantifiers. In any case, the representation of a fuzzy quantifier is a fuzzy number with trapezium shape membership functions. SQLf allows any nature (absolute or proportional) and any behavior (increasing, decreasing, or unimodal) fuzzy quantifier. FSQL has many predefined quantifiers. They are fuzzy exists, most, almost all, about half, minority, about half of x, approximately x, twice of x, approximately xth part, less than xth part, more than xth part, approximately between half x and half y, approximately between twice x, twice y, and approximately between xth and yth part. FSQL allows also absolute and proportional (relative) fuzzy quantifiers with any behavior. The database user may define specific fuzzy quantifiers, and these ones should be associated with an attribute, a table, or the whole system. The other classification is related with behavior (increasing, decreasing, or unimodal) fuzzy quantifier of the membership function and classifies it in triangular, singleton,
L, gamma, S, Gaussian, pseudo-exponential, trapezoidal, and extended trapezoidal. Although, theoretically FSQL accepts any of them, it only has implemented the extended trapezoidal function.
Combiner A combiner operator receives one or more fuzzy predicates and returns a fuzzy predicate. FSQL has defined combiner operators for union and intersection purposes: UNION(fuzzy_values [, t-conorm]): returns the union of the fuzzy values (separated with commas), with the t-conorm indicated in the last argument. This last argument is optional, using by default the maximum t-conorm. For example, UNION(Quality, $[4,5,6,7], “maximum”). INTERSECTION(fuzzy_values [, t-norm]): returns the intersection of the fuzzy values with the t-norm indicated in the optional last argument. If this last argument does not appear, then the intersection uses the minimum t-norm. FSQL defines the fuzzy UNION, INTERSECT, and MINUS operations between two subqueries.
CARD(fuzzy_value) This function returns the cardinality of a fuzzy value. For example, in order to know the rows with less fuzziness in an attribute than a fuzzy constant, a SELECT statement may include the next condition: CARD(Quality) < CARD(3+–2).
Summary Some fuzzy characteristics in SQLf and FSQL are summarized in Table 1. On the other hand, Jiménez et al. (2005) shows a relationship between fuzzy attribute types in SQLf and FSQL. FSQL has some other characteristics defined in Galindo et al. (2006) that should be mentioned such as the definition of DML statements different than SELECT, like INSERT, DELETE, and UPDATE statements, different fuzzy constants which may be used in fuzzy queries, different formats of queries with fuzzy quantifiers, fuzzy quantifiers associated to columns, tables, or the system, fuzzy compara279
FSQL and SQLf
Table 1. Comparison of some characteristics in SQLf and FSQL SQLf
PREDICATES
USER DEFINED
PREDEFINED
MODIFIERS F U Z Z Y T E R M S
USER DEFINED
NORMALIZATION
PREDEFINED COMPARATORS USER DEFINED
CONNECTORS
PREDEFINED
USER DEFINED PREDEFINED QUANTIFIERS
USER DEFINED
FUZZY CONSTANTS
USER DEFINED
FUZZY OPERATORS
PREDEFINED
tors in temporal databases (FSQL defines 18 fuzzy comparator for fuzzy time, extending the temporal comparators of TSQL2), how to use classical comparators with fuzzy attributes and definition of DDL sublanguage with new objects that can be created (LABELS, QUANTIFIERS, etc.).
280
FSQL
Trapezium
Yes
Yes
Extension
Yes
Yes
Arithmetic Expressions
Yes
No
Complex
Yes
No
NOT
Yes
Yes
Antonym
Yes
No
POWER or CONC_DILAT
Yes
Yes
Norm
Yes
Yes
Translation
Yes
No
MORE_CONTRAST
No
Yes
Fuzzyfication
No
Yes
Possibility / Necessity
No
Yes
Long distance (MGT/MLT)
No
Yes
(Fuzzy) Inclusion
No
Yes
Distance
Yes
No
Extension
Yes
No
AND as min
Yes
Yes
OR as max
Yes
Yes
AND as other t-norm
No
Yes
OR as other t-conorm
No
Yes
NOT
Yes
Yes
Means
Yes
No
Arithmetic Expression
Yes
No
Trapezium
Yes
Yes
Trapezium
No
Yes
Others
Yes
Yes
Trapezium, approximate values, intervals, possibility distributions…
No
Yes
Union, Intersection, Minus
No
Yes
Card
No
Yes
System Architecture The fuzzy language definition (SQLf or FSQL) is independent of the implementation. However, in this section we study the architectures of current implementations. According to Blanco (2001)
FSQL and SQLf
advantage of this architecture is that it solves scalability and performance problems in other architectures. Its disadvantages mainly relate to its portability since this architecture is exclusive of the RDBMS, the engine of which has to be broadened.
and Timarán (2001), extensions to SQL might be incorporated to a RDBMS through different types of coupling architectures, namely loosely, mildly, and tightly: •
•
•
In a loosely coupling architecture, the new features can be integrated through a software layer on top of the RDBMS. The main advantages include its portability, which allows connecting to any RDBMS. Its disadvantages include scalability and performance. The scalability issue deals with the fact that all tools with this type of architecture load the whole data set into the memory, thus setting limits for management of large data amounts. Low performance results from records copied one by one from the database addressing space to the tool addressing space. In a mild coupling, the new features can be integrated through stored procedures using a procedural language for relational databases such as Oracle PL/SQL; or through external functions’ calls. The advantage of this architecture is that it considers the data scalability, administration, and manipulation capabilities of the RDBMS. It has a better performance since it does not need to communicate with the database because data are directly managed from the inner section of the RDBMS. Its disadvantages include complexity, architecture development costs, definition and installation of the stored procedures, and user defined functions each time a new database instance is created and low efficiency when it comes to intense calculations with a large number of relations. In addition, portability is null in case the other RDBMS does not contain the same procedural language or does not accept external function calls. In a tight coupling, the new features are incorporated into the RDBMS inner core. In this architecture, it is necessary to broaden the parser, rewriter, planner, and executor so the RDBMS is able to compile, transform, optimize, and execute a SQL query. The main
The proposed architecture of implementation for SQLf is on top of an existing RDBMS (Bosc & Pivert, 1995b; Tineo, 2006) with a loose coupling. Extensions are integrated through a software layer implemented in a generic language outside the RDBMS. The RDBMS receives and executes the SQL query and sends resulting data to the tool where it can be processed. The main advantages include its portability, which allows connecting to any RDBMS. Its disadvantages include scalability and performance. Actual implementation of SQLf is made on Oracle 9i RDBMS. It has been programmed in Java language. Its compatibility with major operating systems (e.g., Linux, Solarix, and Windows) has been proved. The SQLf Architecture is composed by many components using the three layers: interface, logic, and data. The main components of this architecture are the following (Figure 2): •
•
•
•
Client’s Interface. It receives user’s fuzzy queries and term definitions. It shows the final results of user operations: the fuzzy query answer. Instructions Dispatcher: It is responsible for delivering a remainder of the modules the necessary structures for the execution of the instruction and to receive of these the respective answers. Fuzzy Terms Catalog. It allows the specification of user defined fuzzy terms and retrieves the definition of such terms in order to be used by the sentence analysis and evaluation mechanisms. These terms are stored into a database. Sentence Analyzer. It analyzes the sentences introduced by the user, checking the syntactic and semantic correctness of the statements.
281
FSQL and SQLf
•
•
•
and SQL-99 in order to make our system as portable as possible. Therefore, we need to translate regular queries into specific RDBMS query language.
Perform the translations needed for the execution of the statements. In case of fuzzy queries, builds a tree structure of the fuzzy query that is used in the evaluation process. Evaluator/Calibrator. It performs the evaluation of fuzzy queries and interacts with the RDBMS in order to retrieve the relevant database elements to the query processing. For so doing, it uses a regular SQL query that is given by the Sentence Analyzer, computes the satisfaction degrees, and calibrates the answer of the fuzzy query. For so doing, it uses the result of the regular query and the tree structure of the fuzzy query. Query’s Translator SQLf → SQLf92 and SQL99. It applies the derivation principle query transformations, obtaining Boolean queries from fuzzy SQLf ones. Those queries are written in standard SQL according to the norms SQL-92 and SQL-99. Translation Instructions SQL-92 and SQL-99 → SQL-RDBMS. We have adopted to express regular queries in the standards SQL-92
SQLf catalog (see Figure 3) stores information about fuzzy terms and other objects that are provided of fuzziness, such as fuzzy views and fuzzy assertions. This catalog must also keep information about users, databases, groups, and privileges. This catalog has been conceptually conceived in the EER model and logically implemented into a relational database. On the other hand, FSQL (Galindo, 1999, 2007; Galindo et al., 2006) has been conceived with a mild coupling architecture. FSQL is integrated through stored procedures and functions. The advantage of this architecture is that it considers the data scalability, administration, and manipulation capabilities of the RDBMS. It should have a better performance since it does not need to communicate with the database, because data are directly managed from the inner section of the RDBMS.
Figure 2. SQLf implementation architecture
Crisp Aswer
Crisp Query
Crisp Answer
et
Crisp Instruction
Crisp Query
DD t bj ec O
An sw er
L
ltS su
In st ru ct io n
Re
translator Instructions sQL-92, sQL-99
Instruction SQL-RDBMS
sQL-rdbMs
Intruction answer SQL-RDBMS
rdbMs
282
evaluator / calibrator
Answer
Answer
Instruction Object DML
Instruction
Query’s translator sQLf sQL-92 sQLf sQL-99
y zz
Fuzzy Term Definition
Instruction Object PL
Fu
Crisp Instruction
Answer
Instructions dispatcher
n io ct ML ru D st t In jec b
Fuzzy term catalog
Instructions
O
Name Fuzzy Term
sentence Analyzer
SQLf Instructions
Crisp ResultSet
Instruction Object
Crisp Instruction PL
client Interface
FSQL and SQLf
Figure 3. SQLf catalog
Group
have
Permission Condition
Structure
Is_member create
use
Relation
User
belong
d
Fuzzy Term
define
View
Table
d
Quantifier d
Absolute
Modifier d
Comparator Connector
Predicate d
Relative Power
Translation
d
Numeric d
Date
Extension
Time Expression
Trapezium
Extension
Its disadvantages include complexity, architecture development costs, definition and installation of the stored procedures or functions each time a new database instance is created and low efficiency when it comes to intense calculations with a large number of relations. In addition, portability is null in mildly coupling architecture. Actual implementations of FSQL are made on Oracle and PostgreSQL. FSQL has a final interface FQ for Windows, developed in Visual Basic (Galindo, 1999, 2007), and other visual interfaces in Java, Visual FSQL (Galindo, Oliva, & Carrasco�������������������������������� , 2004a; Oliva, 2003), for many other platforms. The components of FSQL architecture are the following (Figure 4): •
•
Traditional Database: They are data from our relations with a special format to store the fuzzy attribute values. The fuzzy attributes are classified by the system in different data types as we explained above. Fuzzy Meta_Knowledge Base (FMB): It stores information about the fuzzy relational data-
•
•
Expression
Numeric Scalar
d
Trapezium
Scalar
Norm
d
Set
Fuzzy Relation
base (FRDB) in a relational format. It stores attributes which admit fuzzy treatment and different information for each one of them, depending on their fuzzy type. FSQL Server: It has been programmed entirely in SQL in Oracle PL/SQL (Galindo, 1999) and PostgreSQL (Maraboli & Abarzua, 2006) and it includes three kinds of functions: Translation Function (FSQL2SQL): It carries out a lexical, syntactic and semantic analysis of the FSQL query and translates it into a classic SQL sentence. The resulting SQL sentence includes reference to the following kinds of functions. Representation Functions: used to show the fuzzy attributes in a comprehensible way for the user and not in the internally used format. Fuzzy Comparison Functions: used to compare the fuzzy values and to calculate the compatibility degrees (CDEG function). FSQL Client: It is a simple and independent program that serves as an interface between the user and the FSQL Server. The user introduces the FSQL query and the client 283
FSQL and SQLf
Figure 4. FSQL implementation architecture (FRDB) FSQL Client
FSQL Statement
SQL Statement
SQL Statement
Analyzer
Results
FSQL _PKG.FSQL _LEXICO FSQL _PKG.FSQL _SINTACTICO
Database
Fuzzy MetaKnowledge
To Form Query SQL / Translator FSQL _PKG.FSQL _SEMANTICO
FMB
Fuzzy Database
FSQL Server
program communicates with the server and the database in order to obtain the final results. Examples of general FSQL clients are FQ and Visual FSQL. A FSQL client for one concrete application was built by Barranco, Campaña, Medina, and Pons (2004). The FIRST-2 (Fuzzy Interface for Relation SysTems, v. 2) implements the FMB. A summary of these objects is shown in Table 2, giving the set
of tables and views of the FMB and the utility for each one. The attributes and a detailed definition are in Galindo et al. (2006). Summarizing this analysis of SQLf and FSQL architecture, we can see that each proposal has its benefits and disadvantages. The main issue is scalability and performance vs. portability. Each one of the architectures is adequate to the approach of respective SQL extension. It could be interesting to propose architecture for the implementation of these languages with tightly coupling, that is, to incorporate the extensions to the RDBMS inner core or query engine. In this architecture, it is necessary to broaden the parser, rewriter, planner, and executor so the RDBMS is able to compile, transform, optimize, and execute an extended SQL query. The main advantage of this architecture is that it solves scalability and performance problems in other architectures. Its disadvantages mainly relate to its portability. In another way, SQLf and FSQL are completely compatible with the main operating systems. On the other hand, object-relational database model may be another interesting option as it is shown in this book in the chapter by Barranco, Campaña, and Medina.
Table 2. Tables and views of FIRST-2
284
N.
Table / View
Utility
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
T. FUZZY_COL_LIST T. FUZZY_DEGRE_SIG T. FUZZY_OBJECT_LIST T. FUZZY_LABEL_DEF T. FUZZY_APPROX_MUCH T. FUZZY_NEARNESS_DEF T. FUZZY_COMPATIBLE_COL T. FUZZY_QUALIFIERS_DEF T. FUZZY_DEGRE_COLS T. FUZZY_DEGRE_TABLE T. FUZZY_TABLE_QUANTIFIERS T. FUZZY_SYSTEM_QUANTIFIERS
List of fuzzy columns (or attributes). Fuzzy degree significances (or meanings). List of fuzzy objects of columns. Trapezoidal definitions (for labels). Margin and M values (Types 1 and 2). Similarity Relations (Type 3). Compatible fuzzy attributes. Qualifiers definition. Columns with an associated fuzzy degree Information about fuzzy degrees of tables Fuzzy quantifiers associated to tables Fuzzy quantifiers associated to the system
13. 14. 15.
V. LABLES_FOR_OBJCOL V. LABELS_OBJ_T3 V. ALL_COMPATIBLES_T34
Trapezoidal labels for each attribute. Labels for type 3 and 4. Compatible attribute type 3 and 4.
FSQL and SQLf
Satisfaction Degree Classical database querying systems suffer from a rigidity problem. Users must specify requirements by means of Boolean logic conditions in a query language such as SQL. In consequence, two problems arise. The first one is known as “nondiscriminated answers”: Result set contains rows satisfying query condition, but there is no discrimination between them. Users may be helpless in choosing the preferred answers. The second problem is known as “lost answers”: Crisp conditions usually leave out elements in the frontier. Also it is possible to retrieve too many elements that could be helpless for the user. SQLf (Bosc & Pivert, 1995a) has been conceived for solving rigidity problems by means of fuzzy sets in querying for expressing user preferences. Thus, the solution of a SQLf query is always a fuzzy result set, or, in other terms, a fuzzy relation. Answer rows are always provided of membership degrees in the real interval (0,1]. Rows in the answer are given in decreasing order of satisfaction degrees, the order could be not changed through an order by clause. These degrees are obtained as the satisfaction degree of each row to the SQLf fuzzy query. User does not specify how to calculate satisfaction degrees; it is specified by SQLf semantics. Users must not explicitly ask to compute the satisfaction degree; it is automatically done by SQLf. Users may not inhibit the degree calculation; it is part of intrinsic fuzzy query semantics. Note that fuzzy queries not satisfying rows are completely excluded of the answer. These rows are not in the result set; their satisfaction degrees are equal to 0. Despite this restriction, a fuzzy query might return a large number of rows that could become undesired for the user. Therefore, SQLf provides an answer calibration mechanism. It consists of a final optional clause in the SQLf query, the WITH CALIBRATION clause (Goncalves & Tineo, 2001a, 2001b). This clause may specify two kinds of answer calibration: •
Qualitative calibration consists of a threshold that specifies the minimum satisfaction
degree that must have a row in order to be in the result set. Of course, the threshold is a real number in the interval (0,1]. For a given query, only one threshold may be specified. This threshold applies to the result of the query. It is not individually specified for each involved condition or predicate. For SQLf, there are evaluation mechanisms based in the distribution of this threshold over the components of the query structure. These mechanisms are based in the derivation principle (Tineo, 1998, 2006; Bosc & Pivert, 1995b). Quantitative calibration defines the number of desired answers. Let this number be denoted by k and then the result set contains the top k, in other words, the best k rows according to the satisfaction degree. As the user may imagine, the quantitative calibration is specified with a natural number. In original definition of SQLf by Bosc and Pivert (1995a), the calibration is specified in the SELECT clause. In a later work, Goncalves and Tineo (2001a, 2001b) extended the definition, adding the WITH CALIBRATION clause in pro of language orthogonally. This clause may contain a real number in the interval (0,1] (qualitative calibration) or a natural number (quantitative calibration) or both. In the former case, the answer is calibrated in both senses: qualitative and quantitative. FSQL works a little different. The satisfaction degree of a condition is not explicitly given in the answer if the user does not demand it. The following function exists for demanding the satisfaction degree: Function CDEG (
285
FSQL and SQLf
to the * of SQL but this one also includes the columns for the satisfaction degrees of the attributes in which they are relevant. In the result, you will also find the function CDEG applied to each and every one of the fuzzy attributes which appear in the condition. FSQL allows the user to specify fulfillment thresholds (THOLD clause) at different levels of querying condition: For each simple fuzzy condition, a fulfillment threshold may be established (default is 1) with the format
286
Evaluation Mechanism For SQLf, there are four proposed evaluation mechanisms for fuzzy queries: •
•
•
•
Naive Program Strategy appears to have a high processing cost for query evaluation because it scans the whole relations involved in the query without taking advantage from the query conditions. Sugeno Program Strategy improves fuzzy quantified query processing by means of two failure conditions. This strategy appears to have a better processing cost for query evaluation than the naive strategy because it uses heuristics for avoiding the complete scan of the database. Program Derivation Strategy is based in external program like the program of naive strategy, but using derived queries for retrieving relevant rows. This mechanism may be applied to any SQLf querying structure. It has better performance than Naive and Sugeno Program strategies, because it restricts more the number of rows to be accessed. Query Derivation Strategy takes advantage from the relationship between fuzzy conditions and regular ones, leaving the whole evaluation in the hands of the DBMS. This strategy is of mild coupling. It takes advantage of the existing evaluation paths and optimizing technology. Previous studies have shown experimentally the benefits of this strategy with respect to other strategies. Nevertheless, current SQLf prototype does not use this strategy.
These two latter evaluation mechanisms are based on the derivation principle. Derivation principle has been proposed by Bosc and Pivert (1995a, 1995b) in order to define evaluation mechanisms that attempt to keep low the number of rows accessed in fuzzy querying. The main idea of these strategies is to take advantage of existing relations
FSQL and SQLf
between fuzzy conditions and Boolean relations. Such relations come from the concepts of support and α-cut of fuzzy sets. Later, Tineo (2006) has studied the application of this principle to all SQLf querying structures in both theoretical and practical ways. This principle states that given a fuzzy query in SQLf, it is also possible to derive a regular query obtaining the support of the fuzzy query result or a close superset. In case of a qualitative calibration with threshold α, the principle applies with the α-cut instead of the support that is more restrictive. For FSQL the evaluation is a more complex problem due to the presence of fuzzy data. In this case, there have not been conceived efficient evaluation mechanisms such as a derivation principle based on SQLf. The evaluation of FSQL is done with a naïve translation into a traditional SQL query with corresponding function calls for dealing with satisfaction degrees. It would be very interesting to study the application of the derivation principle in case of fuzzy data.
Performance Analysis In order to present a comparative analysis concerning the common fuzzy logic elements included in both SQLs, an experimental study was performed. We have made an initial performance using formal model statistic method. The idea of this method is to explain the influence of several considered factors in the observed values from experiments. The importance of a factor is measured by the proportion of the total variation in the response that is explained by the factor. FSQL and SQLf prototypes were studied to present a performance analysis of logic components included in the implementations. Therefore, the queries include fuzzy predicates, modifiers, comparators, and connectors. Quantifiers may not have been included because the prototypes did not include a common implemented subset. Results are those recollected by the Oracle traces. These results do not take into account the time required by the translation of the
fuzzy query into a relational one; they only measure the blocks access number, the CPU time, and the elapsed time required to answer the relational query. The results were grouped according to their experimental types.
Linear Statistical Model We have chosen a full factorial design for our experimental study. That is, we will consider all the mentioned factors and all their levels. This kind of design allows studying the influence of each factor and all factor interactions. We must take value for the answer variable for each possible combination of the factors in all their levels. The model is an expression of the observed values for the response variable as a combination of experimental factor level influences. A factor is a represented variable and the different instances a variable could take. A factor has a contribution to the observed response variable. These contributions are known as effects. The measure unit of the effects is the same as for the response variable. The model expresses an observed value as the sum of the average of observed values with the corresponding effects for factor level and combination of factors’ levels. A general statistical linear model has the form: C1 Xi1 + C 2 Xi 2 +... +CnXin + Cn + yi = Cn + 1 Xi1 Xi 2 + Cn + 2 Xi1 Xi 3 + ... +C n Xin − 1 Xin + Ei 2 ()
Being: yi The ith-observation of the variable. Cj Constant that suggests the variable’s relevance. Cj measures the j-th effect. xij The ith-observation of the j-th factor. Ei The error for the i-th observation. Xik Xij The interaction between the k and j factors for the i-th observation. This model attempts to explain the influence of all factors and their interactions in the experimental
287
FSQL and SQLf
results. Nevertheless, some factors or interactions might have no significant influence. Therefore, this model could be adjusted in the analysis of the experimental results in order to obtain a model that explains better the experimental behavior. As we have explained before, replicas are not considered in the model. The stochastic analysis is performed using a statistical software tool.
3) Repetitions: 2 4) Total Number of experiments: - For each query type: 2 queries* 2 engines * 2 thresholds = Total 8 - For the 5 queries type. Total: 40
Experimental Design
A computer was used to perform the experiments; it had a dual core processor 2.0 Ghz and 2 GB of RAM, Windows XP, and Oracle 9i RDBMS. The FSQL engine is implemented in Oracle 9i PL/SQL, while the SQLf engine is implemented in Java as a layer over Oracle 9i. Final user queries are addressed to the engines via a client application interface. For FSQL, we used the client interface made with Visual Basic, while the SQLf interface was made with Java. In order to take measures, we used the Oracle 9i trace and tkprof utility. At the present time, there are several prototypes of SQL developed by Tineo’s team at Universidad Simón Bolívar (Venezuela). We used here the SQLfi V.4.0. We will use simply SQLfi to refer the used SQLf engine.
The data for the experiment was generated using a random generator with uniform distribution for each attribute of the table PEOPLE with 2700 records. This table contains some important characteristics of a person’s group stored through the following classical attributes, as Table 3 shows. The experimental design consists of five experiments, one for each query type, and two representative queries were designed for each query type to avoid the slant that could be introduced because of the use of distinct fuzzy terms or operators. The five experiments are named: 1. 2. 3. 4. 5.
Basic One-Relational Block with Simple Condition Basic One-Relational Block with Conjunctive Condition Basic One-Relational Block with Disjunctive Condition Basic Multirelational Block with Complex Condition Nested Block
Therefore, the used experimental model is full factorial taking the following parameters: 1) - - - - 2) - -
288
Observed Variables: Total elapsed time (TET) Total CPU time (TCT) Opened connections (Conn) Disk block access number (DBAN) Considered Factors: Engines: SLQfi, FSQL Threshold: low(0.25), high(0.75)
Platform
Experimental Results We just present them in FSQL syntax (Q1); their translation to SQLf is straightforward (Q2). These fuzzy queries are examples of Single Relation with Simple Fuzzy Condition Queries. (See Exhibit 1.) Tables 4 and Table 5 present a summary with the results recollected from the experiments. The total elapsed time is plotted on Figure 5. This graphic presents the results for all the experiments. Each one is identified with a label from E1 to E5. In each one, the user threshold is distinguished. Here it is easy to observe the difference in time for the two evaluated engines. Explanation of these times regarding other aspects presented in previous tables will be deeply shown in the following sections where we present the linear model that explains the observed behavior of each experiment.
FSQL and SQLf
Basic One-Relational Block with Simple Condition The following model adjusts the data recollected from these experiments in 100% according to the R-square, AIC statistical parameters, and residuals analysis. TETSQLfi= 5.68 + 3.62 TCT -0.063 DBAN -1.04 Conn – 0.24 Grade - 0.0096 TCT*DBAN + 0.01 DBAN*Conn TETFSQL= 37.42 + 3.62 TCT -0.063 BDAN -1.04 Conn – 0.24 Grade - 0.0096 TCT*DBAN + 0.01 DBAN*Conn For the experiment E1, Figure 5 shows a similar behavior for all the queries evaluated in the FSQL engine. It is remarkable that the average of the total elapsed time required by the FSQL engine for the queries with calibration equal to 0.75 is less than the required for the queries with calibration 0.25.
The total elapsed time required for the queries with calibration 0.75 in both engines is quite similar. Instead queries with calibration 0.25 archive better performance in SQLfi. The number of accessed blocks and CPU time required by the FSQL engine is always greater than the required by SQLfi because the architecture of SQLfi has a strong logic layer implemented in the Java that is responsible for grade calculation and filtering. On the other hand, FSQL uses Oracle’s PL/SQL to perform these operations; therefore, the CPU time consumed by the FSQL engine is greater than the required by the SQLfi. Additionally, the metadata dictionary of the FSQL engine should include information for the storage of fuzzy data types and it should provide a general treatment of the fuzzy operations for these data types. That is because in a query processing the engine should consult several relational tables to perform the appropriate treatment of the crisp data; the SQLfi engine avoids these operations because it only processes crisp relational data.
Table 3. Data for the experiment Attribute
Domain
Semantic
LANGUAGE
VARCHAR(10)
The language. The possible values are: English, French, Spanish, German and Italian.
FNAME
VARCHAR(10)
First name.
LANAME
VARCHAR(10)
Last name.
DATE_OF_BIRTH
DATE
Date of birth
COUNTRY
VARCHAR(10)
Country where the person resides.
EMAIL
VARCHAR(40)
E-mail
SEX
VARCHAR(1)
Sex. The possible values are: M (male) and F (female).
AGE
NUMERIC(3)
Age.
CIVIL_STATE
VARCHAR(10)
Civil State. The possible initial values are: divorced, married, single and widowed.
EYES
VARCHAR(20)
Eyes color. The possible initial values are: black, blue, brown and green.
HAIR
VARCHAR(20)
Hair color. The possible initial values are: black, blond, chesnutbrown and redhaired.
WEIGHT
NUMERIC(10, 3)
Weight in kilograms
STATURE
NUMERIC(10, 3)
Stature in centimeters.
RACE
VARCHAR(20)
Race. The possible initial values are: afroamerican, corn-coloured, indian and white.
289
FSQL and SQLf
Exhibit 1. FSQL Q1:
SELECT *, CDEG(*) FROM PEOPLE WHERE Age f= $Ancient THOLD 0.75 ;
SQLf Q2:
SELECT * FROM PEOPLE WHERE Age = Ancient WITH CALIBRATION 0.75 ;
FSQL Q1:
SELECT * FROM PEOPLE WHERE Weight FEQ $Weighted 0.75 AND Stature FEQ $Tall 0.75 ;
SQLf Q2:
SELECT * FROM PEOPLE WHERE Weight = Weighted AND Stature = Tall WITH CALIBRATION 0.75 ;
Table 4. Results obtained with a threshold 0.25 Experiment E1
Repetition 1 2
E2
1 2 1
E3 2 1 E4 2
E5
1 2
290
Engine
TET (s)
TCT (s)
DBAN
Conn
SQLfi
5.84
0.8
185
9
FSQL
3.79
1.88
569
1
SQLfi
0.92
0.48
71
9
FSQL
3.96
1.94
565
1
SQLfi
4.99
0.2
64
11
FSQL
3.38
1.77
566
1
SQLfi
4.79
0.18
103
11
FSQL
3.71
2.03
563
1
SQLfi
0.52
0.29
79
9
FSQL
4.23
1.91
563
1
SQLfi
0.28
0.19
23
7
FSQL
3.91
1.99
565
1
SQLfi
1.02
0.61
132
579
FSQL
5.12
3.1
565
1
SQLfi
0.78
0.48
125
213
FSQL
5.26
3.3
565
1
SQLfi
0.94
0.33
105
13
FSQL
4.46
2.89
543
1
SQLfi
1.06
0.49
127
19
FSQL
5.07
3.13
558
1
FSQL and SQLf
Table 5. Results obtained with a threshold 0.75 Experiment E1
Repetition 1 2
E2
1 2
E3
1 2
E4
1 2
E5
1 2
Engine
TET (s)
TCT (s)
DBAN
Conn
SQLfi
0.98
FSQL
4
0.38
75
7
1.89
564
1
SQLfi
0.38
0.22
67
7
FSQL
3.75
1.99
565
1
SQLfi
0.99
0.25
81
9
FSQL
3.61
1.94
542
1
SQLfi
0.36
0.32
72
11
FSQL
3.79
2.11
563
1
SQLfi
0.07
0.05
56
7
FSQL
3.96
1.94
565
1
SQLfi
0.07
0.04
56
7
FSQL
5.19
2.08
559
1
SQLfi
0.55
0.29
59
137
FSQL
4.98
3.3
564
1
SQLfi
0.91
0.52
108
551
FSQL
5.26
3.24
563
1
SQLfi
1.19
0.96
132
541
FSQL
10.1
1.44
507
1
SQLfi
0.81
0.45
119
213
FSQL
5.21
3.05
569
1
Basic One-Relational Block with Conjunctive Condition For the experiments with Basic One-Relational Block with Conjunctive Condition, the following model adjusts the data recollected from these experiments in 100% according to the R-square, AIC statistical parameters, and residuals analysis. TETSQLfi=15.5-42.57 TCT-0.16 Conn-0.018 DBAN +0.018 Grade + 0.0778 TCT*DBAN+ 0.011 DBAN*Conn TET FSQL =86.44 -42.57 TCT-0.16 Conn-0.018 DBAN +0.018 Grade + 0.0778 TCT*DBAN+ 0.011 DBAN*Conn
Similarly, the behavior of both engines is quite similar for the queries with calibration of 0.75 in terms of the total elapsed time, but in this case FSQL reflects better performance than SQLfi for queries with a calibration of 0.75. This could be explained by the SQLfi implementation in which a database connection is opened and closed by each subquery. In this context, a subquery is every relational query that is necessary to answer a fuzzy query and include queries to the metadata catalog or user data. For this reason, although there are not required many data blocks access or CPU processing, the time required to perform the database connection should be considered. On the other hand, the SQLfi solves queries with a calibration of 0.25 faster than the FSQL. The reason is the architecture and number of
291
FSQL and SQLf
Figure 5. Total elapsed time 8,000 7,500 7,000 6,500 6,000 5,500 5,000 4,500 4,000 3,500 3,000 2,500 2,000 1,500 1,000 0,500 0,000 E1 E1 THOLD E2 THOLD E2 .25 THOLD E3 .75 THOLD E3 .25 E4 .75 THOLD THOLD E4 .25 THOLD .75 THOLD .25 .75
FSQL
returned rows. Queries with calibration of 0.25 have a higher cardinality and therefore elapsed times in FSQL engine include the required time to consult the metadata dictionary and processing of the statements through PL/SQL procedures. Finally, it is remarkable that, in this and the previous experiments, the CPU time and elapsed time are similar for all the queries evaluated in the FSQL, but that is not the case of the SQLfi.
Basic One-Relational Block with Disjunctive Condition TETSQLfi=14.82 –0.079 TCT – 0.22 DBAN -2.027 Conn -0.17 Grade +0.0014 TCT*DBAN + 0.030 DBAN*Conn TETFSQL=112.70 –0.079 TCT – 0.22 DBAN -2.027 Conn -0.17 Grade +0.0014 TCT*DBAN + 0.030 DBAN*Conn This model also adjusts the data recollected from these experiments in 100% according to the R-square, AIC statistical parameters and residuals
292
E5 THOLD .25
E5 THOLD .75
SQLF
analysis. In these queries, the processing requirements of the FSQL engine are greater than the SQLfi requirements because of the overhead FSQL data catalog and engines architecture. Additionally, in this experiment, the CPU time and elapsed time for the SQLfi engine looks similar because, although SQLfi requires more connections than FSQL, many connections were not required.
Basic Multirelational Block with Complex Condition TETSQLfi=-7.05 +45.98 TCT –0.067 DBAN- 0.0016 Conn +1.22 Grade – 0.08 TCT*DBAN -0.00006 DBAN *Conn TETFSQL=39.97+45.98 TCT –0.067 DBAN- 0.0016 Conn +1.22 Grade – 0.08 TCT*DBAN -0.00006 DBAN *Conn The model adjusts the data recollected from these experiments in 100% according to the Rsquare, AIC statistical parameters, and residuals analysis. The behavior of these queries is very
FSQL and SQLf
similar to the previous, but it is remarkable that while the complexity of the conditions is higher, the distances between the processing times required for both engines are reduced.
Nested Block TETSQLfi=-69.63– 21.9TCT –0.13 DBAN -0.043 Conn+5.49 Grade + 0.022TCT*DBAN +0.00048 DBAN *Conn TETFSQL=-39.8– 21.9TCT –0.13 DBAN -0.043 Conn+5.49 Grade + 0.022TCT*DBAN +0.00048 DBAN *Conn The model adjusts the data recollected from these experiments in 100% according to the Rsquare, AIC statistical parameters, and residuals analysis. These results are interesting because it is remarkable the nearness between the total elapsed time required by both engines in the second query with calibration 0.25. These results were archived although the FSQL engine performs its strongest logical operations in Oracle and suggests a very good performance for this kind of query. This could be explained by the good evaluation mechanism of not correlated queries provided by Oracle, and in the case of the FSQL, this responsibility is attached to the DBMS. Instead, for each row belonging to the outer query, the SQLfi engine creates a new query from the inner query because the naïve strategy was used in the implementation of nested queries; therefore, it is recommended that the derivation mechanism would be used for future works in nested queries.
Summary This performance analysis shows that it is enough to describe the total elapsed time required to the queries evaluations in terms of the CPU time, accessed blocks number, number of connections made to the database, fuzzy engine, and calibration. It is remarkable that the variables remain constants
among all statistical models; additionally, the coefficients of interaction parameters are always positive and although they are the smaller, their effect is strong because its domain. The analysis suggests that the engines have different scopes. The SQLfi provides better performance for no complex queries with calibration of 0.25 and suggests a dependent performance of the query features. Instead, FSQL promises good performance for nested queries and suggests a stable performance. It also seems that the more complex a query, the more reduced the distance between FSQL and SQLfi. Additionally, the wide range of fuzzy data that could be stored and processed in the FSQL engine generate an overhead in the treatment of fuzzy queries over crisp data because more metadata dictionary tables should be consulted for the query processing. On the other hand, the multiple connections created by the SQLfi engine and their interaction with the number of accessed blocks also have an influence over the total elapsed time required by the engine because many subqueries are generated, and more resources are needed to answer one logical fuzzy query.
Conclusion and Future Trends In this chapter, we have studied the main two proposals for fuzzy sets based extension to SQL: FSQL and SQLf. The comparison has been made from several points of view. We noticed that both proposals are quite solid. The found differences derived from the different approaches that both languages have. The approach of SQLf is to provide users with greater flexibility in querying precise data stored in relational databases. On the other hand, the FSQL approach is to manipulate imprecise information into a relational database. In this sense, SQLf gives more freedom to the user, whereas FSQL gives more force to the coherence in the design and use of fuzzy data elements. Both languages allow the use of a large variety of fuzzy elements: predicates, modifiers, compara293
FSQL and SQLf
tors, connectors, quantifiers, and operators. Most SQLf fuzzy logic elements are user defined, while most FSQL ones are system built in. It would be useful to have a language with the large variety of built in elements of FSQL but also with the great SQLf flexibility. Despite that FSQL is devoted to the manipulation of fuzzy data, query answers are relational tables and satisfaction degrees may be computed at user demand as projected attributes. On the other hand, in SQLf any query returns a fuzzy relation over well-known data. In SQLf, the satisfaction degree is always computed and is returned as an implicit quality of the answer, not as a projected attribute. It would be interesting to mix both proposals in order to allow fuzzy databases with both imprecise data and gradual membership. Nevertheless, it would also be desired to allow the user to specify the fact that in some queries, the satisfaction degree of answers would not be relevant, but just the selected rows in a relational way. Both FSQL and SQLf allow users to specify desired thresholds for satisfaction degree. SQLf does that in a global way. FSQL allows this specification in each fuzzy condition. It seems to give more flexibility, nevertheless it might not be clear what the combination of expressions with different thresholds means. The problem of query processing evaluation mechanisms has been deeply studied for SQLf. The main contribution in this way has been the definition and application of the derivation principle. This principle allows taking advantage of existing connections between fuzzy sets and classic sets in order to keep low the number of accessed rows in query evaluation. Unfortunately, there are not at present time studies of efficient evaluation mechanisms for FSQL. In future works, it will be convenient to explore the application of the derivation principle in the presence of fuzzy data. At the present time, the proposed architectures for the implementation of SQLf is a loosely coupled architecture that has advantages in portability but serious problems with scalability and performance. On the other hand, the proposed architecture for
294
FSQL is one of mild coupling that has advantages in scalability but problems in portability and performance. It would be useful to propose implementations of tightly coupling architecture for future implementation of fuzzy database languages. The initial experimental performance analysis presented here has shown better performance of SQLf prototype; regarding FSQL, this must obey to the fact that SQLf queries are evaluated applying the derivation principle. Nevertheless, the SQLf prototype has shown less scalability than FSQL implementation. This must be the strategy of loose coupling. The application of the derivation principle with a mild coupling architecture would provide better performance and scalability. The problem then would be with the portability. In future works, we will integrate SQLf and FSQL in a single language that incorporates the elements of both languages, according the analysis made here. We will also update this integrated language to cover all features form the arising standard SQL 200N. A very important problem to study in this fuzzy database language is the application of the derivation principle and its implementation with different coupling architectures.
Acknowledgment The authors would like to thank Dr. José Galindo (University of Málaga, Spanish projects TIN200614285, TIN2006-07262 and TIC-1570) for his useful revision of this article as well as his comments, which contributed to the final formulation of this work. Supported by the internal project of the Catholic University of the Maule, Chile: Number 81201(2006-2007). This work is supported in part by the Governmental Venezuelan Foundation for Science, Innovation and Technology FONACIT Grant G-2005000278. The main purpose of this work is the glory of God: So whether you eat or drink or whatever you do.
FSQL and SQLf
References ANSI. (1986). American national standard for information systems: Database language SQL (FDT, ANSI X3, 135-1986). New York: American National Standards Institute. ANSI. (1986). Database language SQL with integrity enhancement (ANSI X3, 135-1989). New York: American National Standards Institute. ANSI. (1992). Database language SQL (ANSI X3, 135-1992). New York: American National Standards Institute. Barranco, C. D., Campaña, J., Medina, J. M., & Pons, O. (2004). ImmoSoftWeb: A Web based fuzzy application for real estate management. In Proceedings of the Advances in Web Intelligence 2nd International Atlantic Web Intelligence Conference, AWIC 2004 (LNCS 3034, pp. 196-206). Heidelberg: Springer-Verlag. Blanco, I. (2001). Deducción en bases de datos relacionales difusas. Doctoral thesis, Universidad de Granada, España. Bosc, P., & Pivert, O. (1991). Fuzzy querying in conventional databases. In L. Zadeh & Kacprzyk (Eds.), Fuzzy logic for the management of uncertainty (pp. 645-671). John Wiley Bosc, P., & Pivert, O. (1992). Some approaches for relational databases flexible querying. International Journal of Intelligent Systems, 1(34), 323-354. Bosc, P., & Pivert, O. (1994). Imprecise data management and flexible querying in data bases. In R. Yager & L. Zadeh (Eds.), Fuzzy sets neural networks and soft computing (pp. 386-395). Van Nostrand Reinhol. Bosc, P., & Pivert, O. (1995a). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3(1). Bosc, P., & Pivert, O. (1995b). On the efficiency of the alpha-cut distribution method to evaluate simple fuzzy relational queries. In B. Bouchon-Meunier,
R. R. Yager, & L. A. Zadeh (Eds.), Advances in fuzzy systems: Applications and theory (vol. 4: Fuzzy logic and soft computing, pp. 251-260). World Scientific. Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7, 213-226. Buckles, B. P., & Petry, F. E. (1984). Extending the fuzzy database with fuzzy numbers. Information Sciences, 34, 145-155. Cox, E. (1995). Relational database queries using fuzzy logic. Artificial Intelligent Expert, pp. 23-29. Delgado, M., Sánchez, D., & Vila, A. (2000). Fuzzy cardinality based evaluation of quantified sentences. International Journal of Approximate Reasoning, 23, 23-66. Dubois, D., & Prade, H. (1985). Fuzzy cardinality and the modeling of imprecise quantification. Fuzzy Sets and Systems, 16, 190-230. Dubois, D., & Prade, H. (1998). Théorie des possibilites applications á la représentation des connaissances en informatique (2nd ed.). Masson. Fagin, R. (1999). Combining fuzzy information from multiple systems. Journal of Computer and System Sciences, 58, 83-99. Galindo, J. (1999). Tratamiento de la imprecisión en bases de datos relacionales: Extensión del modelo y adaptación de los SGBD actuales. Doctoral thesis, University of Granada, Spain. Retrieved February 7, 2008, from http://www.lcc.uma.es Galindo, J. (2007). FSQL (fuzzy SQL): A fuzzy query language. Retrieved February 7, 2008, from http://www.lcc.uma.es/~ppgg/FSQL Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998a). A ����������������������������������� server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems (LNAI 1495, pp. 164-174). Springer. Retrieved February 7, 2008, from http://www.springerlink.com/content/ddyttjwpn31hxer4/ 295
FSQL and SQLf
Galindo, J., Medina, J. M., Pons, O., Vila, M. A., & Cubero, J. C. (1998b, March). A ������������������ prototype for a fuzzy relational database. Demo session in the 8th International Conference on Extending Database Technology, Valencia, Spain. Galindo, J., Medina, J. M., Vila, M. A., & Pons, O. (1998c, October). Fuzzy ���������������������� comparators for flexible queries to databases. In Proceedings of the 6th Iberoamerican Conference on Artificial Intelligence, IBERAMIA’98 (pp. 29-41), Lisbon, Portugal. Galindo, J., Oliva, R. F., & Carrasco, R. A. (2004a, September). Acceso Web a bases de datos difusas: Un cliente visual de fuzzy SQL. Paper presented at the XIII Congreso Español sobre Tecnologías y Lógica Fuzzy (ESTYLF’2004), Jaén, Spain. Galindo, J., Urrutia, A., Carrasco, R., & Piattini, M. (2004b, December). Relaxing ������������������������ constraints in enhanced entity-relationship models using fuzzy quantifiers. IEEE Transactions on Fuzzy Systems, 12(6), 780-796. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Goncalves, M., & Tineo, L. (2001a). SQLf flexible querying language extension by means of the norm SQL2. In Proceedings of the 10th IEEE International Conference on Fuzzy Systems Fuzz-IEEE 2001 (vol. 1), Melbourne, Australia. Goncalves, M., & ������������������������������� Tineo, L. (2001b). SQLf3: ��� An extension of SQLf with SQL3 features. In Proceedings of the 10th IEEE International Conference on Fuzzy Systems Fuzz-IEEE 2001 (vol. 1), Melbourne, Australia. Goncalves, M., & Tineo, L. (2006a). SQLf vs. Skyline: Expressivity and performance. In Proceedings of the 15th IEEE International Conference on Fuzzy Systems Fuzz-IEEE 2006 (pp. 2062-2067), Vancouver, Canada. Goncalves, M., & Tineo, L. (2006b). Towards ������������� flexible Skyline queries. In Proceedings of the XXX
296
Conferencia Latinoamericana de Informática CLEI 2006, Santiago, Chile. Jiménez, L., Urrutia, A., Galindo, J., & Zaraté, P. (2005). Implementación de una base de datos relacional difusa: Un caso en la industria del Cartón. Revista Colombiana de Computación, 6(2), 48-58. Retrieved February 7, 2008, from http://www.unab. edu.co/editorialunab/revistas/rcc/rev62.htm Maraboli, R., & Abarzua, J. (2006). FSQL-f representación y consulta por medio del leguaje PL/PGSQL de información imperfecta. Doctoral thesis, Universidad Católica del Maule, Chile. Medina, J. M. (1994, May). Bases de datos relacionales difusas: Modelo teórico y aspectos de su implementación. Doctoral thesis, University of Granada. Medina, J., Pons, O., & Vila, M. (1994). GEFRED: �������� A generalized model of fuzzy relational databases. Information Sciences, 77(6), 87-109. Melton, J. (1993). ISO/ANSI working draft: Database language SQL (SQL3, X3H2-93-091/ISO DBL YOK-003). ISO/ANSI. Oliva, R. F. (2003). Visual FSQL. Gestión Visual de Bases de Datos Difusas en ORACLE a través de Internet usando FSQL. Proyecto Fin de Carrera, directed by J. Galindo, of Ingeniería Superior en Informática, in the University of Málaga. Re��� trieved February 7, 2008, from http://www.lcc. uma.es/~ppgg/PFC Pedrycz, W., & Comide, F. (1998). An introduction to fuzzy sets: Analysis and design (A Bradford Book). The MIT Press. Prade, H., & Testemale, C. (1987). Fuzzy relational databases: Representational issues and reduction using similarity measures. Journal of the American Society of Information Sciences, 38(2), 118-128. Timarán, R. (2001). Arquitecturas de integración del proceso de descubrimiento de conocimiento con sistemas de gestión de bases de datos: un estado del arte. Revista Ingeniería y Competitividad, 3(2). Universidad del Valle, Colombia.
FSQL and SQLf
Tineo, L. (1998). Interrogaciones flexibles en bases de datos relacionales: Trabajo de ascenso para optar a la categoría de Profesor Agregado. Universidad Simón Bolívar, Venezuela. Tineo, L. (2000). Extending RDBMS for allowing fuzzy quantified queries 1. In M. Revell (Eds.), Lecture Notes in Computer Science (vol. 1873, pp. 407-416). Berlin: Springer-Verlag. ISSN 0302-9743, ISBN 3-540-67978-2. Tineo, L. (2005). Una contribución a la interrogación flexible de bases de datos: evaluación de consultas cuantificadas difusas. Doctoral thesis, Universidad Simón Bolívar, Sartenejas, Venezuela. Retrieved February 7, 2008, from http://xica. bd.cesma.usb.ve/sqlfiv4 Tineo, L. (2006). Una contribución a la interrogación flexible de bases de datos relacionales: Evaluación de consultas cuantificadas. Doctoral thesis, Universidad Simón Bolívar, Caracas, Venezuela. Umano, M., Fukami, S., Mizumoto, M., & Tanaka, K. (1980). Retrieval processing from fuzzy databases (Tech. Rep. of IECE of Japan, vol. 80, no. 204 on automata and languages, pp. 45-54, AL80-50). IECE. Urrutia, A. (2003). Definición de un modelo conceptual para bases de datos difusas. Doctoral thesis, University of Castilla-La Mancha, Spain. Urrutia, A., Galindo, J., Jiménez, L., & Piattini M. (2006). Data modeling: Dealing with uncertainty in fuzzy logic (IFIP Springer Science and Business Media [SSBM], vol. 214, pp. 201-217). ISSN 1571-5736. Urrutia, A., Galindo, J., & Piattini, M. (2002). Modeling data using fuzzy attributes. In Proceedings of the XXII International Conference of the Chilean Computer Science Society (pp. 117-123).� IEEE Computer Society Press. Urrutia, A., Galindo, J., & Piattini, M. (2003). Propuesta de un modelo conceptual difuso. Libro de Ingeniería de Software, Ediciones Cyted Ritos2, ISBN;84-9602307-9, pp 51-76.
Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338-353. Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1, 8-28. Zemankova-Leech, M., & Kandel, A. (1984). Fuzzy relational databases: A key to expert systems. Verlag TUV Rheinland, Koln.
Key Terms CDEG (Compatibility Degree): FSQL predefined function intended for the computation of fuzzy conditions’ satisfaction degrees. In queries involving fuzzy data items, this function would be used in order to project the satisfaction degree as part of the resulting table. DDL (Data Definition Language): The statements of this language enable the creation and modification of the structures in which the data will be stored. Examples of DDL statements are: CREATE (to create objects of the database: tables, views, etc.), DROP (to remove objects), ALTER (to modify objects), and statements for security controls, indexes, and for the control of the physical storage of the data. Derivation Principle: SQLf’s theoretical and practical basis for evaluation strategies that keep the extra added cost of fuzzy query processing low, taking advantage of existing relations between fuzzy conditions and crisp ones. DML (Data Manipulation Language): The DML statements (or sentences) enable the query (consultation) and the modification of the data stored in the database. Examples of this kind of sentences are: SELECT, INSERT, DELETE, and UPDATE. FSQL: Extension of SQL with fuzzy sets based features that allows the storage of fuzzy data values and their use in any place where SQL allows using crisp data values, providing imprecise and uncertain data manipulation in classic and fuzzy relational databases. 297
FSQL and SQLf
Fuzzy Condition: Condition using fuzzy values and/or fuzzy comparators, whose truth is fuzzy. The fuzzy values may be fuzzy attributes or fuzzy constants, like linguistic labels or “approximately 8” (expressed by #8 in FSQL). Fuzzy comparators express fuzzy relations between two values, for example, “approximately equal,” “fuzzy greater than,” “much greater than,” and so on. The expression of preferences may be considered another kind of fuzzy condition. The fulfillment of one fuzzy condition is usually a value in the interval [0,1] giving a fuzzy fulfillment degree for each fuzzy condition. Fuzzy Degrees: Fuzzy attributes, whose domain is usually the interval [0,1], although other values are also permitted, such as possibility distributions (usually over this unit interval), which, in turn, may be related to specific linguistic labels (like “a lot,” “normal,” etc.). In order to keep it simple, usually only degrees in the interval [0,1]
298
are used, because the other option offers no great advantages and a greater technical and semantic complexity. SQL (Structured Query Language): Relational database querying language, as essentially developed by Chamberlin et al. (1974, 1976). In 1986, the American National Standard Institute (ANSI) and the International Standards Organization (ISO) published the standard SQL-86 or SQL1 (ANSI, 1986). In 1989, an extension of the SQL standard, called SQL-89, was published, and SQL2 or SQL-92 was published in 1992 (ANSI, 1992). SQLf: Extension of SQL with fuzzy sets based features that allows using a fuzzy condition in any place where SQL allows a Boolean condition, providing fuzzy querying capabilities on classic and relational databases giving fuzzy result sets as query answers.
299
Chapter XII
Hierarchical Fuzzy Sets to Query Possibilistic Databases Rallou Thomopoulos INRA, France Patrice Buche INRA, France Ollivier Haemmerlé IRIT, France
Abstract Within the framework of flexible querying of possibilistic databases, based on the fuzzy set theory, this chapter focuses on the case where the vocabulary used both in the querying language and in the data is hierarchically organized, which occurs in systems that use ontologies. We give an overview of previous works concerning two issues: first, flexible querying of imprecise data in the relational model and, second, the introduction of fuzziness in hierarchies. Concerning the latter point, we develop an aspect where there is a lack of study in current literature: fuzzy sets whose definition domains are hierarchies. Hence, we propose the concept of hierarchical fuzzy set and present its properties. We present its application in the MIEL flexible querying system for the querying of two imprecise relational databases, including user interfaces and experimental results.
Introduction In flexible querying systems, fuzzy sets are used to represent preferences in selection criteria. For instance, in the framework of a database about microbiological risk assessment in foods, the users may ask for milk as a first choice or yogurt as a second choice. In possibilistic databases, an imprecise datum is represented by a possibility distribution. For instance, in some kinds of human diseases,
the bacterium Escherichia coli is suspected to be responsible, but other bacteria like Listeria are not excluded. Behind those two different purposes, the same homogeneous formalism is used: the fuzzy set theory. In both cases, a relation order is defined on a domain of values. In this chapter, we study the case when the domain of values is not “flat” but hierarchically organized, using the “kind of” relation. For instance, food products, like milk or yogurt, are part of a hierarchy of substrates, in
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Hierarchical Fuzzy Sets to Query Possibilistic Databases
which whole milk is a kind of milk. In the same way, the bacteria Escherichia coli and Shigella are part of a hierarchy of micro-organisms. We call a fuzzy set defined on a hierarchy a hierarchical fuzzy set (HFS). Contrary to the classical case when the domain of values is “flat,” in this case, the assumption that the values are independent does not hold. Two order relations (the preference/possibility order relation and the “kind of” relation) must be put in adequacy. Several issues thus have to be addressed: •
•
•
Does the preference/possibility degree associated with a given value in a fuzzy set have implications on the degrees associated with other values of the domain, particularly more specific or more general values? What would be the meaning of two comparable values (with the meaning of the “kind of” relation) associated with different preference/possibility degrees? Can the “kind of” relation be used to enlarge the user’s query in order to obtain more answers while respecting the preference order defined by the user in the selection criteria?
We have designed and realized two instances (for two different relational databases) of a flexible querying system, called MIEL,1 involving hierarchical fuzzy sets. Both databases contain imprecise data and deal with risk assessment in food, respectively, microbial risk and chemical risk. The need for flexible querying, imprecise data representation, and studying fuzzy sets when the domain of values is hierarchically organized is justified, in both databases, by three characteristics of the data: •
300
Although composed of several thousand entries (10 for the microbial database and 50 for the chemical database), data are not abundant enough to answer every query and therefore there is a need for flexible querying in order to complement exact answers with pertinent answers (i.e., semantically close).
•
•
Data include imprecise values. For instance, the level of contamination of a given food by a given contaminant is not precisely known but is included in a given interval or is inferior to a given threshold. Symbolic data are often organized in taxonomies: for example, taxonomies of food products (Ireland & Moller, 2000), of bacteria (Ballows, Truper, Dworkin, Harder, & Schleifer, 1992), and so forth.
The MIEL fuzzy querying system has been especially designed for end users who are not specialists of computer science. They express their query through a set of prewritten queries we call views. These views can be complemented by the users through the simple graphical user interface of the MIEL system. That interface allows the users to specify their projection attributes and their selection criteria. The taxonomies of the symbolic data can also be browsed by the end users in order to express their selection criteria as hierarchical fuzzy sets. In this chapter, first we provide some background on the topic and recall some broad definitions useful for understanding the main focus of the chapter. Second, we define and explain the concept of hierarchical fuzzy set and compare it to the bibliography. Third, we present the MIEL flexible querying system which uses the concept of hierarchical fuzzy set. Fourth, the instantiations of the MIEL system for the querying of two imprecise databases in the field of risk assessment in food are presented and we give some experimental results. Fifth, current projects and future trends are presented and then we conclude this chapter.
Background We are concerned in this chapter with the combination of two topics: first, flexible querying of imprecise data, which includes flexible querying techniques, the representation of imprecise data, and the combination of both previous topics in the
Hierarchical Fuzzy Sets to Query Possibilistic Databases
framework of the relational model; second, the introduction of fuzziness in hierarchies. We will finish this section by recalling the basics of fuzzy sets required to understand this chapter.
Flexible Querying of Imprecise Data Flexible Querying Techniques The classical implicit assumption in database management systems is the closed world assumption: a fact which is not present in the database is assumed to be false. For example, let us consider the following facts stored in a given database: “Whole milk is contaminated by Listeria and Salmonella” and “Skim milk is contaminated by Escherichia Coli.” The query “Which are the contaminants not present in whole milk?” will retrieve “Escherichia Coli.” This assumption is embarrassing because, in real applications, it is often impossible to gather all the available information on a given subject. For example, in the field of risk assessment in food, it is difficult to gather information, in particular, because of confidentiality problems. Consequently, it is important to be able to “relax” the closed world assumption in order to consider that the lack of answer does not mean that the answer is negative but is somewhat unknown: it corresponds to the open world assumption (OWA). It comes to consider that a database can be incomplete and that some queries may have an empty answer as a result. To avoid this drawback, one may propose to the user: • •
Querying tools which retrieve information that is semantically close from the database, Models, parameterized with semantically close information found in the database, to estimate lacking information.
The first proposal, which is the one we consider in this chapter, has been studied in two different ways: by the expression of preferences in the selection criteria of a query and by the generalization
of the selection criteria. Those mechanisms permit one to complement an exact answer, potentially empty, with semantically close answers which have been judged pertinent. In the first family of approaches, the querying system does not check if information stored in the database verifies a selection criterion, but to which extent it somehow satisfies the selection criterion. It implies an order of the answers. Three kinds of works have been proposed to solve this problem: the use of secondary criteria in Lacroy and Lavency (1987), the definition of similarity distances (Ichikawa & Hirakawa, 1986; Motro, 1988), and the expression of linguistic preferences in Rabitti and Savino (1990). It has been shown (Bosc & Pivert, 1992; Bosc, Lietard, & Pivert, 1994) that all those propositions can be restated in a unique formalism: the expression of preferences by fuzzy sets. This formalism permits the user to distinguish ideal values from acceptable values for a given criterion. A pertinence degree is associated with each answer corresponding to the query: it measures the adequation degree of the answer to the fuzzy selection criteria of the query. In the second family of approaches, the query is modified to become more general (Motro, 1984). Consequently, the querying system retrieves the exact answers completed by other pertinent answers. In a first category of works, a hierarchy of concept is used to generalize the query when the answer is empty (Fargues, 1989; Bidault, Froidevaux, �������������� & Safar,�������������������������������������������� 2000). In a second category of works, when the selection criterion is expressed by a fuzzy set, several techniques have been proposed to generalize it. Dubois and Prade (1995) propose to use a similarity relation defined on the domain of values. If the fuzzy set is defined on a numerical domain, Bosc, HadjAli, and Pivert (2004) propose a fuzzy generalization operator using a proximity relation between two values based on the calculus of their quotient. Directed by applications in the field of risk in food where information is structured according to hierarchical symbolic data, we propose to gather both families of approaches which are complementary. We will develop this idea in the
301
Hierarchical Fuzzy Sets to Query Possibilistic Databases
concept of hierarchical fuzzy set presented in the main focus of the chapter.
Representation of Imprecise Data In the context of database management systems, Codd (1979) has been one of the first to take into account the notion of imprecise datum in the framework of the relational model. He introduced the concept of null value representing the value of an attribute which is unknown or has no sense in the record where it is stored. Lipski (1979, 1981) has extended Codd’s approach which was binary (complete knowledge or complete ignorance) in order to be able to express partial knowledge. He introduced the notion of plausible values represented by an exclusive disjunction of possible values. The theory of possibility (Zadeh, 1978) has been used in the framework of the relational model by Prade (1984) and Prade and Testemale (1984) to extend Codd’s and Lipski’s approaches by introducing an order on the possible values. In our system, we propose a representation of imprecise data in the relational databases based on the theory of possibility, close to the representation used in FSQL (Galindo, Medina, Pons, & Cubero, 1998).
Fuzzy Querying in the Framework of the Relational Model The expression of queries using fuzzy values has already been studied in the framework of the relational database model. Theoretical studies have been proposed to extend the SQL language by introducing fuzzy predicates processed on crisp information (Bosc & Pivert, 1995) and implementations have been proposed such as the FQUERY97 system (Zadrozny & Kacprzyk, 1998) under the QBE-like Microsoft Access graphical environment and the FSQL system (Galindo et al., 1998) under Oracle Relational Database Management System (RDBMS). Moreover, as the FSQL system permits the representation of imprecise data, its querying system is able to compare a fuzzy predicate with an imprecise datum. In those previous works, the user
302
has to build the query flexibility: for instance, in FSQL, the user has to specify in query whether the user is using a fuzzy join or a standard one. Those systems are more or less dedicated to computer science specialists even if FSQL system, for example, interprets some fuzzy concepts as “approximate,” “interval,” or “crisp” in a very understandable way. As we mentioned in the introduction, the aim of our system is to help any user to make a fuzzy query against a database schema. This is the reason why we have decided to develop our own fuzzy querying system, MIEL, which will be presented in the main focus of this chapter.
Introducing Fuzziness in Hierarchies Introducing fuzziness in a hierarchy can be seen in different ways. In our concern, the issue is to be able to define an order relation—represented by degrees that express preferences or possibility—on a hierarchically organized set of elements, on which a relation order is thus already defined by the “kind of” relation. The aim is thus to properly define, and reason with, a fuzzy set whose definition domain is a hierarchy. This issue is not trivial since the degree associated with an element must be coherent with those associated with sub-elements or super-elements, and there is currently a lack on this subject in the literature. In the bibliography concerning fuzzy methods, we have identified three main categories of papers which present some similarities; two are quite distant from our concern and the third one is closer to our concern and confirms some of the ideas we propose in this chapter. We can distinguish, especially in recent research: •
the use of linguistic labels in ontologies. In studies about possibilistic ontologies (Loiseau, ��������������������������� Boughanem, & Prade��������� , 2005), each term of an ontology is considered a linguistic label and has an associated fuzzy description. Fuzzy pattern matching between different ontologies is then computed using these fuzzy descriptions. This approach is
Hierarchical Fuzzy Sets to Query Possibilistic Databases
•
•
related to those concerning the introduction of fuzzy attribute values in the object model (Rossazza, Dubois, & Prade, 1998). the use of fuzzy relations between the terms of a thesaurus. Studies about fuzzy thesauri have discussed different natures of relations between concepts, where relations are gradual and moderated by degrees. Fuzzy thesauri have been considered, for instance, in Miyamoto and Nakayama (1986) and De Cock and������������������������������ Nikravesh�������������������� (2004). In this approach, a query composed of a set of terms is enlarged to similar terms thanks to fuzzy pseudothesauri. Similarity is based on the co-occurrence frequency of terms in a given set of documents. the use of a fuzzy conceptual structure for document indexing and user query expressing in the framework of information retrieval (Baziz, Boughanem����������������������� , Prade, & Pasi, 2006; ������ Boughanem, Pasi����������������� , & Prade,������� 2004). The conceptual structure is hierarchical and it encodes the knowledge of the topical domain of the considered documents. In this approach, the evaluation of conjunctive queries is based on the comparison of minimal subtrees containing the two sets of nodes corresponding to the concepts expressed in the document and the query respectively.
However, in our context, the terms of the hierarchy and the relations between terms are not fuzzy as in the two first categories of papers. Therefore, we could not inspire from those works to solve the questions we mentioned at the beginning of the introduction. We found more analogies with the third category of papers where the terms of the hierarchy and the relations between terms are not fuzzy as in our approach. Even if the interpretation of the weights are different and therefore leads to a different evaluation procedure, these authors show that the completion of the fuzzy sets representing the query and the document description, using the “kind of” relation of the conceptual structure, lead to better results. This idea is close to the notion
of generalization of HFS we will present in this chapter.
Basics of Fuzzy Sets We briefly present fuzzy sets, which will be used in the following to represent the required values in a flexible query or the possible values in an imprecise datum. We also introduce comparisons between fuzzy sets that will be used to compare an imprecise datum to a flexible query. Fuzzy sets (Zadeh, 1965) were introduced to represent concepts that are not strictly delimited, like “young” or “far.”. Unlike the case of a classic set, an element may belong partially to a fuzzy set. Definition 1: A fuzzy set A on a domain X is defined by a membership function mA from X to [0, 1] that associates the degree to which x belongs to A with each element x of X . The domain X may be continuous or discrete. Figure 1 presents two examples: the fuzzy sets ProductPreferences and ResponsibleBacterium. They are also denoted, respectively, 1/Milk + 0.5/ Yoghourt and 1/Escherichia coli + 0.7/Shigella, which indicates the degree associated with each element. These fuzzy sets are user-defined, during the choice of the querying selection criteria (ProductPreferences), or during the entry of an imprecise datum (ResponsibleBacterium). In the following, we focus on two different comparisons between fuzzy sets: the inclusion relation, that we use to determine in a binary way whether an imprecise datum is an answer to a flexible query or not, and fuzzy pattern matching, which allows one to determine in a gradual way whether an imprecise datum somehow answers a flexible query. In the most commonly used inclusion relation between fuzzy sets, a fuzzy set A (in our case, an imprecise datum) is included in B (in our case, a flexible query) if its membership function is “below” the membership function of B, that is, if each element that somehow belongs to A belongs at least as much to B. More formally: 303
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 1������������������������������������������������������������� .������������������������������������������������������������ The fuzzy sets ProductsPreferences and ResponsibleBacterium
Definition 2: Let A and B be two fuzzy sets defined on a domain X. A is included in B (denoted A ⊆ B ) if and only if their membership functions mA and mA satisfy the condition: ∀x ∈ X ,
A
(x ) ≤
B
(x ).
In fuzzy pattern matching (Dubois & Prade, 1995), two scalar measures are classically used to evaluate the compatibility between an imprecise datum and a flexible query: (1) a possibility degree of matching (Zadeh, 1978) and (2) a necessity degree of matching (Dubois & Prade, 1988). Definition 3: Let Q and D be two fuzzy sets defined on a domain X and representing respectively a flexible query and an imprecise datum: D is compatible with Q with the possibility degree Π (Q, D ) and the necessity degree N(Q,D): •
•
304
the possibility degree of matching between Q and D, denoted Π (Q, D ), is an “optimistic” degree of overlapping that measures the maximum compatibility between Q and D, and is defined by Π (Q , D ) = sup x∈X min(
Q
(x ),
D
(x )).
the necessity degree of matching between Q and D, denoted N(Q,D), is a “pessimistic” degree of inclusion that estimates the extent to which it is certain that D is compatible with Q, and is defined by N (Q , D ) = inf x∈X max (
Q
(x ),1 −
D
(x )).
Main Focus of the Chapter In the first part of this section, we introduce the concept of hierarchical fuzzy set. Then, in the second part, we present the MIEL flexible querying system which uses the concept of HFS. In the third part, we present two applications that rely on the MIEL querying system, including examples of Graphical User Interfaces (GUIs) illustrating use cases of the MIEL querying system and experimental results.
Hierarchical Fuzzy Set In this part, we propose a definition of a hierarchical fuzzy set (HFS) and specify its semantics. Then, we explain why and how we compute the closure of a HFS. Next, we extend the comparison operations between fuzzy sets we have recalled in the background section. We show that the notion of closure permits to group hierarchical fuzzy sets in equivalence classes and that each equivalence class has a unique representative, called minimal. Finally, we propose a method of generalization of a HFS based on the minimal HFS.
Definition and Semantics For a given selection attribute, if its domain of values is hierarchized, users always express their preferences on a subset of the domain. Indeed, they only choose the elements they are interested in and implicitly consider that: (1) the elements more specific than those they have chosen must
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 2������������������������ .����������������������� Example of a hierarchy
be taken into account by the system and (2) the other elements must not be taken into account (as the noncomparable elements for example). In the following, we say that an element elt of the domain is more general than an element elt' (denoted by elt' ≤ elt) if elt’ is a predecessor of elt in the partial order induced by the hierarchy. An example of such a hierarchy is given in Figure 2 (for instance, Meat ≤ Substrate). A hierarchical fuzzy set is then defined as follows: Definition 4: A hierarchical fuzzy set is a fuzzy set whose definition domain is a subset of the elements of a finite hierarchy partially ordered by the “kind of” relation. Example 1: The example of HFS shown in Figure 3 has for definition domain the set of elements {Whole milk, Half-skim Milk, Skim milk} which is a subset of the set of the elements belonging to Figure 3������������������ .����������������� Example of a HFS
1 0.9 0.8 0
Whole milk Half-skim milk Skim milk
the hierarchy presented in Figure 2. This HFS may also be noted 1.0/Whole milk + 0.9/Half-skim milk + 0.8/Skim milk. We can note that no restriction has been imposed concerning the elements that compose the definition domain of a hierarchical fuzzy set. In particular, the user may associate a given degree d with an element elt and another degree d' with an element elt’ more specific than elt. d' ≤ d represents a semantic of restriction for elt’ compared to elt, whereas d' ≥ d represents a semantic of reinforcement for elt’ compared to elt. For example, if there is a particular interest in Skim milk because the user studies the properties of low fat products, but also wants to retrieve complementary information about other kinds of milk, these preferences can be expressed using, for instance, the following fuzzy set: 1/Skim milk + 0.5/Milk. In this example, the element Skim milk has a greater degree than the more general element Milk, which corresponds to a semantic of reinforcement for Skim milk compared to Milk. On the contrary, if the user is interested in all kinds of milk, but to a lesser extent in Condensed milk because of its smaller water content, the preferences can be expressed using the following fuzzy set: 1/Milk + 0.2/Condensed milk. In this case, the element Condensed milk has a smaller degree than the more general element Milk, which corresponds to a semantic of restriction for Condensed milk compared to Milk.
305
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Closure We can make two remarks concerning the use of hierarchical fuzzy sets: •
•
The first one is semantic. Let 1/Skim milk + 0.5/Milk be an expression of preferences in a query. We can note that this hierarchical fuzzy set implicitly gives information about elements of the hierarchy other than Skim milk and Milk. For instance, one can deduce that the user does not expect results concerning products like meat or vegetable, even if the degree 0 has not explicitly been associated with these products. One may also assume that any kind of skim milk (sterilized, pasteurized, raw skim milk, for example) interests the user with the degree 1. The second one is operational. The problem rising from Definition 4 is that two different fuzzy sets on the same hierarchy do not necessarily have the same definition domain, which means they cannot be compared using the classic comparison operations of fuzzy set theory (see Definitions 2 and 3). For example, 1/Skim milk + 0.5/Milk and 1/Milk + 0.2/Condensed milk are defined on two different subsets of the hierarchy of Figure 2 and thus are not comparable.
These remarks led us to introduce the concept of closure of a hierarchical fuzzy set, which is a developed form defined on the whole hierarchy. Intuitively, in the closure of a hierarchical fuzzy set, the “kind of” relation is taken into account by propagating the degree associated with an element to its sub-elements (more specific elements) in the hierarchy. For instance, in a query, if the user is interested in the element Milk, we consider that all kinds of Milk—Whole milk, Skim milk, Pasteurized milk, and so forth������������������������� —������������������������ are of interest. On the opposite, we consider that the super-elements (more general elements) of Milk in the hierarchy� —Milk product, Substrate, and so forth����������������� —���������������� are too general to be relevant for the user’s query.
306
Definition 5: Let F be a hierarchical fuzzy set defined on a subset D of the elements of a hierarchy H. Its membership function is denoted mF. The closure of F, denoted clos(F), is a hierarchical fuzzy set defined on the whole set of elements of H and its membership function mclos(F) is defined as follows. For each element elt of H, let Eelt = {elt1, ..., eltn} be the set of the closest super-elements2 of elt in D (in the broad sense, i.e., elti ≥ elt): • •
if Eelt is not empty, clos ( F ) (elt ) = max 1≤i ≤ n F (elti ); otherwise clos ( F ) (elt ) = 0 .
In other words, the closure of a hierarchical fuzzy set F is built according to the following rules. For each element elt of H: • •
•
•
if elt belongs to F, then elt keeps the same degree in the closure of F (case where Eelt = {elt}); if elt has a unique smallest super-element elt1 in F, then the degree associated with elt1 is propagated to elt in the closure of F (case where Eelt = {elt1} with elt1 > elt); if elt has several smallest super-elements {elt1, ..., eltn} in F, with different degrees, a choice has to be made concerning the degree that will be associated with elt in the closure. The proposition made in Definition 5 consists in choosing the maximum of the degrees associated with {elt1, ..., eltn}. This choice is discussed in the following; all the other elements of H, that is, those that are more general than, or not comparable with the elements of F, are considered as irrelevant. The degree 0 is associated with them (case where Eelt = ∅ ).
Example 2: Figure 4 shows an example of closure presented on the hierarchy. The elements of the HFS and their associated membership degree appear in bold italic.
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 4������������������������������������������������������������������������������������ .����������������������������������������������������������������������������������� Closure of the hierarchical fuzzy set 0.8/Milk + 1/Whole milk + 0.3/Condensed milk
In the hierarchical fuzzy set 0.8/Milk + 1/Whole milk + 0.3/Condensed milk of Figure 4, the user has associated the degree 1 with Whole milk but only 0.3 with Condensed milk. The maximum of these two degrees is thus associated with their common sub-element Condensed whole milk in the closure. The case of Sweetened condensed milk is different: the user has associated the degree 0.8 with Milk but has given a restriction on the more specific element Condensed milk (degree 0.3). As Sweetened condensed milk is a kind of Condensed milk, it inherits the degree associated with Condensed milk, that is, 0.3. In the case where an element elt of the hierarchy, which does not appear in the initial hierarchical fuzzy set, has several smallest super-elements that appear in the hierarchical fuzzy set with different degrees, associating the maximum of these degrees with elt in the closure is a choice that may be discussed. We distinguish two cases: •
if the hierarchical fuzzy set expresses preferences in a query, the choice of the maximum allows us not to exclude any possible answer (the possibility and the necessity degrees of matching can be higher). In real cases, the lack of answers to a query generally makes this choice preferable, because it consists in
•
enlarging the query rather than restricting it. This is actually the case in our project; if the hierarchical fuzzy set represents an illknown datum, the choice of the maximum allows us to preserve all the possible values of the datum, but it also makes the datum less specific. We chose this solution in order to homogenize the treatment of queries and data. In a way, it also participates in enlarging the query, as a less specific datum may share more common values with the query (the possibility degree of matching can thus be higher, although the necessity degree can decrease).
We have shown in Thomopoulos, Buche, and Haemmerlé (2006) that computing the closure clos(F) of a fuzzy set F defined on a domain dom(F) ⊂ H has a complexity in |H|.|dom(F)|2, provided that the comparison of two elements of the hierarchy can be done in constant time. Generally, the definition domain of F is limited to a few elements, so that the actual computing time remains moderate. The closure operation has been implemented in the MIEL querying system. For a given query, MIEL computes the closures of the HFS associated with selection attributes before submitting it to the RDBMS.
307
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Comparisons of HFS The introduction of the concept of closure allows all the fuzzy sets that are defined on a given hierarchy to have the same definition domain (the whole hierarchy) and thus to be compared using the classical comparison operations between fuzzy sets. Definition 6: Let F1 and F2 be two hierarchical fuzzy sets defined on the same hierarchy. Then: 1. 2. 3.
F1 ⊆ F2 if clos (F1 ) ⊆ clos (F2 ); the possibility degree of matching between F1 and F2, Π (F1 , F2 ), is defined as Π (clos (F1 ), clos (F2 )); the necessity degree of matching between F1 and F2, N(F1,F2), is defined as N(clos(F1),clos(F2)).
Example 3: We present in Figure 5 the closures of the hierarchical fuzzy sets 1/Skim milk + 0.2/Milk and 1/Milk + 0.5/Condensed milk. The elements of
the HFS and their associated membership degrees appear in bold italic. Their comparison shows that 1/Skim milk + 0.2/Milk is included in 1/Milk + 0.5/Condensed milk because the membership function of the former associates lower degrees with every element of the hierarchy.
Minimal HFS In the previous section, we saw that each hierarchical fuzzy set has an associated closure that is defined on the whole hierarchy. We now focus on the fact that two different hierarchical fuzzy sets, defined on the same hierarchy, can have the same closure, as in the following examples. The hierarchical fuzzy sets Substrate1 = 1/ Milk and Substrate2 = 1/Milk + 1/Skim milk have the same closure: the degree 1 is associated with Milk and every more specific element, the degree 0 is associated with all the other elements of the hierarchy.
Figure 5����������������������������������������������������������������������������������������������� .���������������������������������������������������������������������������������������������� The closures of the hierarchical fuzzy sets 1/Skim milk + 0.2/Milk (upper part) and 1/Milk + 0.5/Condensed milk (lower part)
308
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 6��������������������������������������������������������� .�������������������������������������������������������� Common closure of the hierarchical fuzzy sets Substrate3 and Substrate4
The hierarchical fuzzy sets Substrate3 = 1/ Milk + 0.8/Whole milk + 1/Pasteurized milk and Substrate4 = 1/Milk + 0.8/Whole milk + 1/Whole pasteurized milk have also the same closure, represented in Figure 6. Such hierarchical fuzzy sets form equivalence classes with respect to their closures. We can note that Substrate2 contains the same element as Substrate1 with the same degree, and also one more element (Skim milk, with the degree 1). The degree associated with this additional element is the same as in the closure of Substrate1. We say that the element Skim milk is deducible in Substrate2. Definition 7: Let F be a hierarchical fuzzy set, with dom(F ) = {elt1 , , elt j , , elt n}, and F-j the fuzzy set resulting from the restriction of F to the domain dom(F) \ {eltj}. eltj is deducible in F if . clos (F− j )(elt j ) = F(elt j) As a first intuition, we could say that removing a deducible element from a hierarchical fuzzy set allows one to eliminate redundant information. But an element being deducible in F does not necessarily mean that removing it from F will have no consequence on the closure: removing elt from F
will not impact the degree associated with elt itself in the closure, but it may impact the degrees of the sub-elements of elt in the closure. For instance, the element Pasteurized milk is deducible in Substrate3, according to Definition 7. Removing 1/Pasteurized milk from Substrate3 would not modify the degree of Pasteurized milk itself in the resulting closure, but it would modify the degree of its sub-element Whole pasteurized milk (which would have the degree 0.8 instead of 1). Thus, this remark leads us to the following definition of a minimal hierarchical fuzzy set. Definition 8: In a given equivalence class (that is, for a given closure C), a hierarchical fuzzy set is said to be minimal if its closure is C and if none of the elements of its domain is deducible (here the term “minimal” does not have the meaning of cardinality). Example 4: The hierarchical fuzzy sets Substrate1 and Substrate4 are minimal (none of their elements is deducible), contrary to Substrate2 and Substrate3. We propose an algorithm, given in Exhibit 1, to calculate a minimal hierarchical fuzzy set. We have proven in Thomopoulos et al. (2006) that the stop-
309
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Exhibit 1.
Begin
End
Calculation of a minimal fuzzy set mnl having a given closure C mnl ← ∅ If (clos(mnl) = C) Then stop (case where C is the hierarchical fuzzy set that associates the degree 0 with every element of the hierarchy) Else let lin be an order such that each element of the hierarchy is examined after its super-elements (that is, a linear extension of the opposite order of that induced by the “kind of” relation) Repeat elt ← next element according to lin If ( clos (mnl ) (elt ) ≠ C (elt )) T hen mnl ← mnl ∪ {elt} mnl (elt ) = C (elt ) Endif Until (clos(mnl) = C) Endif
ping condition of this algorithm is always reached and that the HFS obtained with this algorithm is minimal. Computing the minimal fuzzy set mnl of a given closure C defined on a hierarchy H has a complexity in |H|.|dom(mnl)|2. Moreover, we have also proven in Thomopoulos et al. (2006) that the minimal HFS is unique for a given closure.
Generalization of a HFS Using a HFS representing preferences in a query does not guarantee to retrieve an adequate number of answers. A complementary solution to retrieve pertinent answers in addition to exact answers consists in generalizing the HFS. Approaches proposed in the bibliography to generalize fuzzy sets on flat domains, already presented in the background section, are not well-adapted to HFS. Some approaches only concern fuzzy sets defined on a numerical domain (Bosc et al., 2004; Bouchon-Meunier & Yao, 1992). Tolerant fuzzy pattern matching (Dubois & Prade, 1995) uses a
310
similarity relation between elements to enlarge the preferences, but it does not take into account the case of hierarchically organized domains. For instance, elements may be added to the support of the fuzzy set in the enlargement mechanism, but more specific elements than those may remain outside of it, which is a major drawback for hierarchical domains (see Buche, Dervin, Haemmerlé, & Thomopoulos, 2005, for more details). In this section, more than a unique solution, we propose a methodology in order to generalize a hierarchical fuzzy set expressing preferences. First, we define an elementary generalization operation of a HFS. Then, we introduce the notion of generalization rule which permits parameterizing the generalization of a HFS using several criteria. Finally, we propose a generalization operation which applies iteratively several elementary generalizations. Elementary generalization of a HFS: The elementary generalization of a HFS consists in
Hierarchical Fuzzy Sets to Query Possibilistic Databases
creating, given a hierarchical fuzzy set F, a more general hierarchical fuzzy set Fg, with the meaning of the inclusion relation extended to HFS. The proof of this property can be found in Thomopoulos et al. (2006). Definition 9: The elementary generalization of a HFS F is an operation that creates from F a hierarchical fuzzy set Fg obtained by adding a super-element of an element elt of dom(F), denoted eltg, with a given membership degree dg. The element eltg must satisfy the following condition: eltg may neither be an element of dom(F) nor be more specific than any element of dom(F). Example 5: Let F be the following hierarchical fuzzy set: F = 1/Condensed whole milk + 0.5/ Cheese. For elt = Condensed whole milk, we consider in the hierarchy of Figure 2 the super-element eltg = Milk and dg = 0.2. We obtain: Fg = 1/Condensed whole milk + 0.5/Cheese + 0.2/Milk. Generalization rule: We consider that the generalization of a HFS F essentially depends on three parameters: (1) which elements of F will be generalized and in which order; (2) for a given element of F, which super-elements will be considered for the generalization; and (3) how the membership degree associated with this super-element is determined. A generalization rule permits to determining those three parameters. Definition 10: A generalization rule Rg is a 3-tuple (ord, gen, calc), where: • • •
ord is a total traversal order through the elements of a hierarchical fuzzy set F, defined on a hierarchy H; gen is a mapping that associates a set of more general elements in H with each element elt in dom(F); calc is a mapping that associates a degree between 0 and 1 with each pair (elt, eltg) such that elt ∈ dom(F ) and elt g ∈ gen (elt ).
Example 6: •
•
•
ord may be, for instance, an order through the elements of F by decreasing degrees. This choice allows one to generalize in priority the elements of F that have the higher degrees, that is, the elements for which the user has expressed the higher preference; gen(elt) may be, for instance, the set of smallest super-elements of elt in the hierarchy; this choice permits to minimize the risk to obtain too general answers; calc(elt, elt g ) = min{x∈dom (F ) (x )>0} F (x )× F (elt )× 0.9 is an example of mapping that permits retrieving in priority the elements specified by the user. F
Each element of F does not necessarily have a more general element that may be added to F for the generalization operation: as we saw previously in Definition 9, this more general element must satisfy a condition. Here we define the notion of generalizable element of F, according to a given generalization rule. Definition 11: Let F be a hierarchical fuzzy set. An element elt of dom(F) is said to be generalizable in F, according to a generalization rule Rg, if elt has a more general element eltg in gen(elt) that satisfies the condition: eltg may neither be an element of dom(F) nor be more specific than any element of dom(F). Generalization of a HFS: As we saw in the section titled Closure, the MIEL querying system computes the closures of the HFS belonging to a query before submitting it to the RDBMS. Consequently, two queries using two different HFS which belong to the same equivalence class, retrieve the same answer. In order to preserve this property when MIEL performs the generalization of a HFS, this operation is not directly processed on the HFS, but on the minimal HFS, unique representative of the equivalence class which the HFS belongs to.
311
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Definition 12: The generalization of a hierarchical fuzzy set F, according to a generalization rule Rg, denoted gen(F), is an operation that provides a hierarchical fuzzy set Fg obtained as follows: • •
we call 0-degree generalization of F, denoted F0, the minimal fuzzy set that is equivalent to F; let Fn be the n-degree generalization of F: if there exists an element elt, first element (with the meaning of the order ord) of dom(F0)⊆ dom(Fn) generalizable in Fn according to Rg, then Fn+1 is obtained by an elementary generalization of Fn according to Rg, in which eltg is the first element of dom(F0)⊆dom(Fn ) , generalizable in Fn, and dg=calc(elt, eltg); if not, the generalization of F is the fuzzy set Fg = Fn.
Example 7: Let Rg be the generalization rule proposed in Example 6 and F the following hierarchical fuzzy set: F = 1/Whole milk + 1/Condensed whole milk + 0.8/Half skim milk + 0.2/Yoghourt.
We have proven in Thomopoulos et al. (2006) that the number of iterations of the generalization operation is finite and that the fuzzy set Fg obtained by generalization is more general than F with the meaning of the inclusion relation extended to HFS.
MIEL Querying System In this section, we present the MIEL flexible querying system which uses the concept of HFS. The MIEL graphical user interface allows the users to specify a query. Such a query is expressed in a view (selected by the user from a list of available views). The users also specify in the query a set of projection attributes and a set of selection criteria. Then the MIEL user interface sends the MIEL query to the relational subsystem. The relational subsystem adapts the query to the formalism it uses (an SQL query), then asks the RDBMS query processor to execute the query. Finally, the answers to the query are returned to the MIEL interface which presents them to the users. First, we present the choices we made in the design of the MIEL data model. Then, we successively present the MIEL query language and the MIEL query processing.
F0, the minimal fuzzy set that is equivalent to F, is the following: F0 = 1/Whole milk + MIEL Data Model 0.8/Half skim milk + 0.2/Yoghourt; • the first generalizable element of F0, in the orThe MIEL data model is composed of an abstract (x )× Fthe (elt )ontology, × 0.9 = 0.2 ×and 1.0 ×several 0.9 = 0.18 der ord, is Whole milk, as calc(Whole milk, milk) = min{xdata (x )>0} Fcalled ∈dom (F )model, concalc(Whole milk, milk) = min{x∈dom (F ) (x )>0} F (x )× F (elt )× 0.9 = 0.2 × 1.0 × 0.9 = 0crete .18 data models which depend on the actual F (x )× F (elt )× 0.9 = 0.2 × 1.0 × 0.9 = 0.18 the generalization provides F1 = data model chosen to store the data (RDB or other 1/Whole milk + 0.8/Half skim milk + 0.2/Yoformalisms). In this chapter, we are only concerned ghourt + 0.18/Milk; with the ontology and with the RDB concrete data • the f irst element of dom(F 0) genermodel, that is, the RDB schema. a l i z a b l e i n F 1 i s Yo g h o u r t , a s calc(Yoghourt, Milk Pr oduct) = 0.2 × 0.2 × 0.9 = 0.036, Ontology of the MIEL Data Model: The ontolPr oduct) = 0.2 × 0.2 × 0.9 = 0.036, the generalization provides ogy contains the knowledge of the domain used F2 = 1/Whole milk + 0.8/Half skim milk by the MIEL system. The basic notion of the + 0.2/Yoghourt + 0.18/Milk + 0.036/Milk ontology is the concept of attribute which must product; be understood in its classic database meaning. In • there is no element of dom(F0) generalizable order to take into account the imprecision of the in F2, so Fg = F2. values stored in the data of the MIEL system, we propose to use, instead of crisp values, imprecise •
F
F
312
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 7���������������������������������������� .��������������������������������������� A part of the variation domain of the attribute Substrate substrate Milk
Pasteurized Whole milk milk
Meat
Skim milk
Half skim milk
Beef
Poultry
Pork
Pasteurized whole milk
values expressed as possibility distributions represented by fuzzy sets (see Zadeh, 1978). A variation domain and a definition domain are associated with each attribute. The variation domain corresponds to the universe of discourse; the definition domain is the set of fuzzy sets which can be defined on the variation domain: it corresponds to the actual domain in the classical database meaning. Definition 13: A is the finite set of attributes of the MIEL data model. Each attribute a ∈ A is characterized by its type Type(a), its variation domain domv(a), and its definition domain dom(a). The type Type(a) of an attribute a can be numerical, symbolic, or hierarchized. Depending on its type, the variation domain domv(a) of an attribute a is: if Type(a) is numerical, domv(a) is defined as a subset of ℜ, the set of the real values; if Type(a) is symbolic, domv(a) is defined as a set of symbolic constants; if Type(a) is hierarchized, domv(a) is defined as a set of symbolic constants and a partial order defined on it.
• • •
Figure 8���������������������������������� .��������������������������������� Two examples of imprecise values pH_value
0
Substrate_value 1 0.7 0.5
1
45 7
9
0
Milk
Whole Milk
Skim Milk
In all cases, dom(a) is defined as the set of all the possible fuzzy sets on domv(a). Definition 14: The value of an attribute a belongs to dom(a) and is denoted t(a). It is a map π of domv(a) to [0,1]. We denote π(x) the degree of possibility that the effective value of a is x (x ∈ domv(a)). Example 8: The variation domain of the numerical attribute pH is the interval [0,14] on ℜ. The variation domain of the symbolic attribute Author could be the set {S.Ajjarapu, C.P.Rivituso, M.Zwietering}. A part of the variation domain of the hierarchized attribute Substrate is represented in Figure 7. The value pH_value of Figure 8 schematizes an example of value for the attribute pH (that value belongs to dom(pH): it is a map of domv(pH) into [0,1]). The value Substrate_value of Figure 8 schematizes an example of value for the attribute Substrate (that value belongs to dom(Substrate): it is a map of domv(Substrate) into [0,1]; the elements of domv(Substrate) having a degree equal to 0 are not represented). Note that in our data model, we consider that all the values are imprecise values. The case of a crisp value for an attribute a is a particular case of an imprecise value, such that ∃x ∈ domv (a ) [ (x ) = 1, and ∀y ≠ x, (y ) = 0]. For simplicity, and since it corresponds to the application needs, we chose to limit the representation of numerical values to trapezoidal functions in the actual database. These trapezoidal functions are stored by means of four characteristic points defining the limits of the support and the kernel of the fuzzy set. In the example of Figure 8, these four characteristic points are [4, 5, 7, 9]. Schema of the relational database: In the following, we do not present in detail the relational database schema (which is a classic RDB schema), but we focus on the choices we have made in order
313
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 9������������������������������������������������������������������������������������������������ .����������������������������������������������������������������������������������������������� The upper table presents an example of relation referencing numerical fuzzy values. The lower table contains a part of the relation NumericalFuzzySet which stores the actual numerical fuzzy sets (the second row corresponds to a crisp value in this example) ExpeId
Substrate
FuzzyPHId
10
Pork
200
11
Skim Milk
231
FuzzySetId
MinSupp
MinKer
MaxKer
MaxSupp
200
4
5
6
7
231
6
6
6
6
to map the ontology of the MIEL data model presented previously onto the RDB schema. We present how the attributes and their variation domains are represented in the RDB. We successively consider the way of representing an attribute belonging to the ontology of the MIEL data model, when that attribute is respectively of type numerical, symbolic, or hierarchized.
fuzzy set, an element of domv(a), and its associated membership degree in that fuzzy set. We remind that a fuzzy set on a symbolic variation domain is defined as a set of pairs (element, degree).
Representation of a numerical attribute: The representation of a value of a numerical attribute in the relational schema is done by means of a row of an additional table which contains the unique identifier of the numerical fuzzy set and four attributes which correspond to the four characteristic points of the trapezoidal function. Existing techniques (see Galindo, Urrutia, & Piattini, 2006) could be used to represent ������������������������������� other kinds of values (such as ranges, unknown, undefined, etc.).
In addition, the variation domain of each attribute a of A of type symbolic used in the relational schema is stored in a reference table which contains all the possible values that compose domv(a).
Example 9: The tables in Figure 9 present an example of a numerical attribute represented in the relational database. Representation of a symbolic attribute: The representation of a value of a symbolic attribute a of A in the relational schema is done by means of one or several rows of an additional table which contains three columns: the unique identifier of the
314
Example 10: Tables in Figure 10 present an example of the value of a symbolic attribute represented in the relational database.
Representation of a hierarchized attribute: The representation of a value of a hierarchized attribute of A in the relational schema is done in exactly the same way as the representation of a symbolic attribute (see above). The variation domain of each attribute a of A of type hierarchized used in the relational schema is stored in two specific tables: a table which contains all the possible values that compose domv(a) and a table which contains all the pairs {vi, vj} of the cover relation of the partial order of domv(a). Example 11: Tables presented in Figure 11 are partial instances of relations Ref _Substrate
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 10����������������������������������������� .���������������������������������������� The upper table presents an example of relation referencing symbolic fuzzy values. The bottom table contains a part of the relation SubstrateOrigin which contains symbolic fuzzy sets (only one fuzzy set is represented in this example). Substrate
FuzzyOriginId
Pork
100
fuzzy values. As presented above, the ontology of the MIEL data model is stored in specific tables of the relational database schema. This is due to the fact that we have to proceed to referential integrity control in the data we store.
Flexible Query Language
FuzzyOriginId
Country
Degree
100
USA
1
100
Germany
1
100
France
0.8
Figure 11������������������������������������������ .����������������������������������������� The upper table presents a part of relation Ref_Substrate. The bottom table presents a part of relation Hier_Substrate. Substrate Milk Full milk Pasteurized milk Pasteurized full milk SubstrateSup
SubstrateInf
Milk
Full milk
Milk
Pasteurized milk
Pasteurized milk
Pasteurized full milk
Full milk
Pasteurized full milk
A query asked on the MIEL system is expressed in the MIEL query language through the MIEL graphical user interface. In the following, we present the notions we use in a way close to domain relational calculus (Ullman, 1988). We use this query language in a data integration system in which data are stored using different models (relational model, conceptual graph model, or XML model presented in the section titled Futures Trends). It is the reason why a MIEL query is always expressed in a view. It provides two advantages: (1) the user always interacts with a unique querying language and does not need to know which models are used to store the data; and (2) the data integration system is extensible: to add a new data model, one have to design a new mediator which translates a MIEL query expressed in a view into a query adapted to the data model. The notion of view: A view is a usual notion in relational databases: it is a virtual table built from the actual tables of the relational database schema by means of a query. In the MIEL system, a set of views (which are prewritten queries) is proposed to the users in order to hide the complexity of the database schema.
and Hier_Substrate describing the hierarchized variation domain for substrates in the relational schema.
Definition 15: A view V on n (n > 0) queryable attributes a1 , , a n of the MIEL ontology is defined by V = {a1 , ,a n PV (a1 , ,a n )}where PV is a predicate which characterizes the construction of the view.
When an attribute is known to be a “crisp” value (for example the substrate in Figure 9),
Example 12: The view OneFactorExperience is defined on five attributes:
the database designers have used classic database attributes of type real, integer, or string instead of
315
Hierarchical Fuzzy Sets to Query Possibilistic Databases
OneFactorExperience = {Substrate, PathogenicGerm, PH, Factor, ResponseType | POneFacto(Substrate, PathogenicGerm, PH, Factor, rExperience ResponseType)}. The predicate POneFactorExperience defines the way the attributes involved in the view are linked together. That view characterizes the result of experimentations in which only one factor is controlled (for example, the temperature). The response type can be, for example, the growth speed of a pathogenic germ in a given substrate.
Q = {Substrate, PathogenicGerm, pH, Factor, ResponseType | POneFactorExperience (Substrate, PathogenicGerm, pH, Factor, ResponseType) ∧ (Substrate≈ SubstratePreferences) ∧ (pH≈ pHPreferences)} where the fuzzy sets SubstratePreferences and pHPreferences are presented in Figure 12 and their associated Boolean values in the query are false (no generalization). The minimum possibility and necessity degrees are set to ∏ min = 0.8 and N min = 0.0 .
Expression of a query: A query in the MIEL system is a specialization of a given view by the end user, who specifies a set of projection attributes as a subset of the queryable attributes of the view and a set of conjunctive selection criteria on some other attributes.
The answers: An answer A to a query Q in the MIEL system is a set of tuples. Each tuple is composed of values (which are fuzzy sets as presented in Definition 14). Each tuple satisfies the selection criteria of the query. If the generalization of a selection value is required in the query, the way this operation is processed depends on the Definition 16: A query Q asked on a view V type of the selection attribute. For an attribute of is defined by: type numerical or symbolic, the generalization is processed as in Dubois and Prade (1995). When ) Q ={a1 , , a p ∃a q +1 , , a n (PV (a1 , , a n )∧ (a p +1 ≈ v p +1 )∧ ∧ (a q ≈ v q )} the attribute is of type hierarchized, the general} ization of the hierarchical fuzzy set is processed +1 , , a n (PV (a1 , , a n ) ∧ (a p +1 ≈ v p +1 )∧ ∧(a q ≈ v q)) as described in Definition 12 of the section titled Generalization of an HFS. In the following defini2 tion, for any type of selection attribute, we denote and g p +1 , , g q and (∏ min , N min ) ∈ [0,1] where by gen(F) the generalization of the fuzzy set F, its PV is the predicate which characterizes the view V, corresponding fuzzy selection value. a1 , , a p are the projection attributes, a p +1 , , a q are the selection attributes and their respective valD e f i n i t i o n 17: L e t ues v p +1 , , v q given as selection values by the user, Q ={a1 , , a p ∃a q +1 , , a n (PV (a1 , , a n )∧(a p +1 ≈ v p +1) ∧ ∧ (a q ≈ v q )) } g p +1 , , g q are boolean values specifying whether ( ) ( ( ) ( ) ) Q ={aset , , a ∃ a , , a P a , , a ∧ a ≈ v ∧ ∧ a ≈ v } the fuzzy must be generalized or not, be a query, be its Boolv , , v g , , g 1 q +1 q n V 1 n p +1 p +1 q q p +1p p +1 q ean values, and (∏ min , N min ) be its minimum posand (∏ min , N min ) ∈ [0,1]2 is its minimum possibility sibility and necessity degrees. The answer A to the and necessity degrees. The attributes a q +1 , , a n query Q is: A ={ 1 , , r}, the set of tuples of the are the queryable attributes of the view which are not used in that query. form { i [a1 ], , i[a p]}1≤i ≤ r , such that every tuple of In the previous definition, the comparator ≈ A satisfies all the selection criteria of Q, with: stands for “��������������������������������� approximately equal�������������� ” and will be , (a p +1 )), , Π (f q (v q ), (a q ))) interpreted in the answer by the two classical i = t (Π (f p +1 (v p +1 ) scalar measures used in fuzzy pattern matching (Dubois & Prade, 1995) to evaluate the compatand ibility between an imprecise datum and a fuzzy ni = t (N (f p +1 (v p +1 ), (a p +1 )), , N (f q (v q ), (a q ))) set representing the selection values.
( )
( )
Example 13: The query Q is expressed in the view OneFactorExperience: 316
their respective possibility and necessity degrees of matching (as defined in Definition 3),
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 12����������������������������������� .���������������������������������� Preferences expressed by the user pHPreferences
0
SubstratePreferences 1 0.9 0.8
1
45 6
7
such as i ≥ ∏ min and N i ≥ N min , t a t-norm and f j (v j )= gen (v j ) if bj is true and vj otherwise. We have chosen the min operation to implement the t-norm t which is a classical choice to represent the conjunction but of course other t-norms may be used. Example 14: An example of answer corresponding to the query of Example 13 expressed in the view OneFactorExperience is given in Figure 13.
0
Whole milk
Skim milk
Half skim milk
relations involved in the view, and WherePart is the list of join predicates between those relations. We did not use the notion of view because it is not implemented in all RDBMS, especially MySQL, which is very popular. Example 15: The SQL query which corresponds to the view SubstrateList defined in Figure 14 is: select P.Title, S.Substrate from Publication P, Substrate S where P.IdPub = S.IdPub. The query processing in the database subsystem is processed as follows:
Query Processing The MIEL user interface sends the MIEL query to the relational subsystem. The relational subsystem adapts the MIEL query to the formalism it uses: an SQL query. The views in the relational database of the MIEL system are SQL queries. In the actual implementation of the MIEL system, the views are stored in a specific table of the database called LViews, in which each tuple represents a view and is composed of four columns: IdView is the unique identifier of the view, SelectPart is the list of projection attributes, FromPart is the list of
1. 2.
3.
Selection of the view corresponding to the query; For each selection attribute of type hierarchized, computation of the fuzzy set closure and optionally, if it is expressed in the query, computation of the generalization of the hierarchical fuzzy set; For each selection attribute of type numerical or symbolic, optionally if it is expressed in the query, computation of the generalization of the fuzzy set;
Figure 13������������������� .������������������ Part of an answer
(∏, N )
Substrate
PathogenicGerm
pH [min, max]
Factor
ResponseType
(1.0, 1.0)
Whole milk
Bacillus Cereus
[5.1, 5.2]
Temperature
Temporal cinetic
(0.9, 0.9)
Half skim milk
Listeria
[5.0, 5.4]
Temperature
Growth speed
(0.8, 0.0)
Skim milk
Listeria
[6.0, 8.0]
Temperature
Level of contamination
317
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 14��������������������������������������� .�������������������������������������� A partial instance of relation LViews IdView SubstrateList
4.
5. 6. 7.
SelectPart
FromPart
WherePart
P.Title, S.Substrate
Publication P, Substrate S
P.IdPub=S.IdPub
Transformation of the fuzzy values of the selection criteria into classic SQL conditions (we call that process “defuzzification”). This transformation depends on the type of the selection attribute. If the type is hierarchized, a list of queried values is built from the list of elements which belong to the fuzzy set closure. If the type is symbolic, a list of queried values is built from the list of elements which belong to the associated fuzzy set. If the type is numerical: a list of Boolean conditions checks both cases of overlapping between the fuzzy sets which represent the selection criterion and the one representing the imprecise datum (see Haemmerlé, ������� Buche, & Thomopoulos,������������������������� 2007, for more details); Completion of the SQL query corresponding to the view in order to build the actual “defuzzified” SQL query; Subsmission of the SQL query to a standard relational database management system (Oracle or Postgresql in the present version); Computation of the adequation degree of each tuple of the answer.
Example 16: We assume that the query asked through the MIEL graphical user interface is: {Title, Substrate | SubstrateList(Title,Substrate) ∧ (Substrate ≈ SubstratePreferences)}. The selected view is that of Example 15 and the fuzzy set SubstratePreferences is that of Example 13. Using the hierarchy of Figure 7, the fuzzy set closure and the defuzzification of the selection criterion lead to the actual selection criterion: Substrate in (‘Whole milk’, ‘Skim milk’, ‘Half skim milk’, ‘Pasteurized whole milk’). The SQL query submitted to the RDBMS query processor is then: select P.Title, S.Substrate from Publication P, Substrate S where
318
P.IdPub = S.IdPub and S.Substrate in (‘Whole milk’, ‘Skim milk’, ‘Half skim milk’, ‘Pasteurized whole milk’).
Applications and Experimentation In this part, we present two applications based on the MIEL querying system. Then examples of GUI illustrating a use-case are given. Finally, we give some experimental results about the closure and the generalization of a HFS.
Presentation of Two Applications We have designed and implemented two instances of the MIEL flexible querying system, involving hierarchical fuzzy sets for two different relational databases: •
•
the Sym’Previus (http://www.symprevius. org/) database which contains around 10.000 data about the behavior of microbial contaminants in foods. This system has been developed with industrial partners (Danone, Bongrain, Pernod-Ricard, …) and governmental institutions (French Ministry of Agriculture and Fisheries); the Mét@risk (http://metarisk.inapg.inra.fr/) database which contains around 50.000 data about chemical contamination in foods. This system has been developed by the Mét@risk INRA research unit with a national governmental institution (the French ministry of Agriculture and Fisheries) and an international institution (the WHO3 Gems Food).
Both systems are operational and a copy is accessible from the Web:
Hierarchical Fuzzy Sets to Query Possibilistic Databases
• •
Sym’Previus database: http://www.symprevius.org/ (contact olivier.couvert@adria. tm.fr to obtain an access) Mét@risk database: http://idot.inapg.fr/ mielContaminant/web/ (contact Buche@ agroparistech.fr to obtain an access)
The Sym’Previus ontology contains two attributes of type hierarchized. For the attribute Substrate, the ontology contains 507 terms with a maximum specialization depth of 7. For the attribute MicrobialContaminant (microorganism), the ontology contains 167 terms with a maximum specialization depth of 7 too. The Mét@risk ontology contains seven attributes of type hierachized. Six attributes describe the food: ProductType (481 terms), FoodSource (2239 terms), CookingMethod (42 terms), TreatmentApplied (628 terms), PackingMedium (37 terms), and ContainerOrWrapping (285 terms). For those attributes, the maximum specialization depth is 6. For the attribute ChemicalContaminant, the ontology contains 222 terms with a specialization depth of 1.
Examples of GUI We present some dialogues with the user through an example of query. Although those dialogs could of course be enhanced to help users introduce fuzzy sets, they have been judged enough “intuitive” by our microbiologist partners who helped us to design the GUI. The user wants to query the view OneFactorExperience (see Example 12). Five queryable attributes are available: Substrate, PathogenicGerm, pH, Factor, ResponseType. The user expresses preferences about the Substrate using the HFS (1.0/Cheese: soft + 0.9/Cheese) and about the pH using the numerical fuzzy set [4, 5, 6, 7]. Figure 15 shows the window in which the user expresses the HFS about the Substrate. The part of the ontology concerning the attribute Substrate can be accessed by the user in the frame Hierarchy of food products. The user’s preferences are registered in the list boxes of the frame Set of
values. The list boxes also give access to the ontology by alphabetic order. In the frame Hierarchy of food products, the button Vizualise the choice asks the MIEL system to color in red the terms of the ontology which belong to the fuzzy set closure and in purple the terms obtained by a generalization of the fuzzy set. This last operation is computed by the system when the check box Extended selection of the frame Set of values is marked. Figure 16 shows the window which permits one to define a fuzzy set of a numerical type. It is used in the example to specify pH preferences. A part of the answer to the query is presented in Figure 17. The first column contains the possibility degree of matching (see Definition 17) of each tuple. For instance, the first answer is about the behavior of Escherichia coli in Cheese and has been published in Reitsma (1996). Data have been collected in a farm in the USA. The second answer is about the behavior of Bacillus cereus in Processed cheese and has been collected by an anonymous industrial partner of the Sym’Previus network. As both answers correspond to a cheese or a kind of cheese (Processed cheese), they are associated with the degree 0.9. Answers can be downloaded on the user’s computer in Excel format for further manipulations.
Experimentation We have made an experimentation in order to evaluate the efficiency of the closure and the generalization operations. We have defined with the experts of the domain 7 test queries which cover around 10% of the Sym’Previus database entries (1132 data entries among 10.000 entries). The same query has been executed three times: without the computation of the closure (denoted standard queries), with the computation of the closure, with the computation of the closure and the generalization. Exact answers correspond to the answers obtained either by the standard queries or by the queries with the computation of the closure. All the answers obtained with the computation of the closure have been considered
319
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Figure 15��������������������������������������� .�������������������������������������� Graphical user interface to register a HFS
Figure 17����������������������������������������� .���������������������������������������� An example of answer to a query in the view OneFactorExperience
Figure 16����������������������������������������� .���������������������������������������� Graphical user interface to register a fuzzy set of type numerical
as a threshold above which results are classified as pertinent, and below which results are classified as nonpertinent by the experts. The evaluation results are thus also very good for the generalization method, as (1) pertinent results can be clearly identified using their matching degrees and (2) generalization results bring an important amount of complementary results (56% of the total number of results regarded as exact or pertinent).
as correct by the experts. They represent 99% of the exact answers (1% of the exact answers correspond to the standard queries). It represents an excellent result for the closure operation. Among the results obtained by the generalization operation, answers that are judged pertinent by the experts (80% of the total number of answers obtained by generalization) are those which have the highest matching degrees (between 0.6 and 0.8), whereas the answers that are judged nonpertinent (20%) have degrees that go from 0.6 to 0.2. An essential remark is that the value 0.6 can thus be considered
320
Future Trends In this section, we expose in more detail the continuation of this project within our own research team. In this chapter, we have presented the concept of hierarchical fuzzy set and its application to query a possibilistic database, thanks to the MIEL flexible querying system, in the framework of the relational model. The concept of hierarchical fuzzy set has also been implemented in two other representation formalisms, the conceptual graph model, and XML, in the framework of the design of a data integration system. Our data
Hierarchical Fuzzy Sets to Query Possibilistic Databases
integration system integrates three subsystems: a relational database (RDB) subsystem, a conceptual graph base (CGDB) subsystem and an XML base (XMLDB) subsystem. The relational database contains the stable, well-structured part of the information. The conceptual graph base contains the weakly structured pieces of information which do not fit the relational schema. As changing a relational schema is quite an expensive operation, we decided to use an additional base in order to store information that was not expected when the schema of the database was designed, but that is useful nevertheless. We chose to use the conceptual graph model for many reasons, including (1) its graph structure, which appeared as a flexible way of representing complementary information, and (2) its readability for a nonspecialist. The XML base contains information found semi-automatically on the Web by the AQWEB tool. AQWEB scans the Web, retrieves and filters documents which “look like” scientific publications. Tables containing scientific data are extracted automatically from each document and stored in an XML document. In order to be able to query efficiently the XML documents containing data tables, those tables are annotated using the domain ontology. For instance, AQWEB tries to match each term of the table with the closest terms of the ontology. XML documents including annotations are stored in an XML native database to enhance their querying. The data stored in the three bases are expressed in different formalisms, but conform to a single domain ontology. That domain ontology consists of a set of attributes and their associated variation domain (see Definition 13). The user expresses a query in the MIEL language. This query is then sent simultaneously to the three subsystems which transform the MIEL query into a formalism adapted to each subsystem (an SQL query in the RDB subsystem, a conceptual graph query in the CGDB subsystem, and an Xquery query in the XMLDB subsystem). The mediators which process this task have been presented respectively in Haemmerlé et al. (2007) for the CGDB subsystem and Buche,
Dibie-Bartélemy, Haemmerlé, and Hignette�������� (2006) for the first version of the XMLDB subsystem. The query is executed by each subsystem and the answers are retrieved by the MIEL graphical interface to be presented in a homogeneous way to the user. In the framework of the French national project WebContent (http://www.webcontent.fr), we are currently working on a new version of the AQWEB tool. In particular, we produce a new annotation represented as a fuzzy set, associating terms of the ontology with their similarity to the term of the Web table (Hignette, Buche, ������������������������ Dibie-Bartélemy, & Haemmerlé������������������������������������ , 2005). Consequently, we will have to consider that a hierarchical fuzzy set associated with an attribute of type hierarchized may have not only two but three different semantics: a semantics of preference or of possibility distribution (the ones we studied in this chapter), but also a semantics of similarity. We will have to take into account all the consequences of this extension in the new version of the MIEL XMLDB mediator which will be developed in the framework of the WebContent project.
Conclusion Fuzzy sets are used both in flexible querying to allow the expression of user’s preferences and in possibilistic databases to represent imprecise data by means of possibility distributions. In this chapter, we have focused on the case where these fuzzy sets are defined on hierarchically organized domains. Such domains are widely used in ontology-based systems. Defining fuzzy sets on hierarchically organized domains is not a trivial issue since the degree associated with an element in such a fuzzy set must be coherent with those associated with sub-elements or super-elements and compatible with reasoning using fuzzy set operations. We first proposed an overview of existing works in two fields, whose combination is the core of the chapter: flexible querying of imprecise data and fuzziness in hierarchies. Then we introduced the
321
Hierarchical Fuzzy Sets to Query Possibilistic Databases
concept of hierarchical fuzzy set and presented its properties: two ways of defining it, on a part of a hierarchy or by computing its closure on the whole hierarchy; the extension of fuzzy set operations to hierarchical fuzzy sets, based on the HFS closures; the existence of equivalence classes composed of hierarchical fuzzy sets that share the same closure; the existence of a unique representative which has a property of minimality within each equivalence class; the generalization of a hierarchical fuzzy set, based on its equivalent minimal fuzzy set and useful for flexible querying purposes. We presented the framework of the MIEL flexible querying system, its data model, its query language, and its query processing. It implements the concept of hierarchical fuzzy set and is used for the querying of two different relational databases, containing imprecise data, in the domain of risk assessment in food products. We illustrated the chapter with the examples of the two applications, their graphical user interface, and an experimental evaluation. This evaluation shows that 99% of the total number of exact answers is obtained thanks to the closure computation and that 56% of the total number of results regarded as exact or pertinent is obtained by the generalization mechanisms presented in the chapter. Finally, we outlined some future trends concerning both hierarchical fuzzy sets and the MIEL flexible querying system. The essential point to retain from this chapter appears to us as being the relevance of hierarchical fuzzy sets for ontology-based systems, which extends their use beyond the framework of a particular application or a particular representation formalism.
References Ballows, A., Truper, H., Dworkin, M., Harder, W., & Schleifer, K. (Eds.). (1992). The prokaryotes, a handbook on the biology of bacteria: Ecophysiology, isolation, identification, applications (2nd ed.). Berlin/Heidelberg/New York: Springer. Baziz, M., Boughanem, M., Prade, H., & Pasi, G. (2006). A fuzzy logic approach to information
322
retrieval using an ontology-based representation of documents. In E. Sanchez (Ed.), Fuzzy logic and the Semantic Web (pp. 363-377). Bidault, A., Froidevaux, C., & Safar, B. (2000). Re��� pairing queries in a mediator approach. In Proceedings of the 14th European Conference on Artificial Intelligence (pp. 406-410). Bosc, P., HadjAli, A., & Pivert, O. (2004). Fuzzy closeness relation as a basis for weakening fuzzy relational queries. In H. Christiansen, M.-S. Hacid, T. Andreasen, & H. L. Larsen (Eds.), Proceedings of Flexible Querying Answering Systems (LNCS 3055, pp. 41-53). Springer. Bosc, P., Lietard, L., & Pivert, O. (1994). Soft querying, a new feature for database management system. In Proceedings DEXA’94 (Database and EXpert system Application) (LNCS 856, pp. 631640). Springer-Verlag. Bosc, P., & Pivert, O. (1992). Some approaches for relational databases flexible querying. Journal of Intelligent Information Systems, 1(3/4), 323-354. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3(1), 1-17. Bouchon-Meunier, B., & Yao, J. (1992). Linguistic modifiers and imprecise categories. International Journal of Intelligent Systems, 7, 25-36. Boughanem, M., ���� Pasi, G.��, & Prade, H. (2004)���� .A �� fuzzy set approach to concept-based information retrieval. In Proceedings of the 10th International Conference IPMU’04, Perugia Italy (pp. 1775-1782). IPMU. Buche, P., Dervin, C., Haemmerlé, O., & Thomopoulos, R. (2005). Fuzzy querying on incomplete, imprecise and heterogeneously structured data in the relational model using ontologies and rules. IEEE Transactions on Fuzzy Systems, 13(3), 373-383. Buche, P., Dibie-Bartélemy, J., Haemmerlé, O., & Hignette, G. (2006). Fuzzy semantic tagging and flexible querying of XML documents extracted from the Web. J. of Int. Information Systems, 26(1), 25-40.
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Codd, E. F. (1979). Extending the database relational model to capture more meaning. ACM Transactions on Database Systems, 4(4), 397-434. De Cock, M. S. G., & Nikravesh, M. (2004). Fuzzy ������ thesauri for and from the www. In M. Nikravesh, L. Zadeh, & J. Kacprzyk (Eds.), Soft computing for information processing and analysis (pp. 275-284). Dubois, D., & Prade, H. (1988). Possibility theory: An approach to computerized processing of uncertainty. New York: Plenum. Dubois, D., & Prade, H. (1995). Tolerant fuzzy pattern matching: An introduction. In P. Bosc and J. Kacprzyk (Eds.), Fuzziness in database management systems (pp. 42-58). Heidelberg: Physica Verlag. Fargues, J. (1989). CG information retrieval using linear resolution, generalization and graph splitting. In Proceedings of the 4th Annual Workshop on Conceptual Graphs. Galindo, J., Medina, J. M., Pons, O., & Cubero, J. C. (1998). �������������������������������������� A server for fuzzy SQL queries. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems (LNAI 1495, pp. 164-174). Springer. Retrieved February 7, 2008, from http://www.springerlink.com/content/ddyttjwpn31hxer4 Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. ISBN: 1-59140-325-1. Haemmerlé, O., Buche, P., & Thomopoulos, R. (2007). The ��������������������������������������� MIEL system: Uniform interrogation of structured and weakly-structured imprecise data. Journal of Intelligent Information Systems. Retrieved February 7, 2008, from http://dx.doi. org/10.1007/s10844-006-0014-z Hignette, G., Buche, P., Dibie-Bartélemy, J., & Haemmerlé, O. (2005). Fuzzy �������������������������� semantic annotation of XML documents. In Proceedings of the Conference on Advanced Information Systems Engineering Workshop DisWeb 2005, Porto, Portugal (pp. 319-332).
Ichikawa, T., & Hirakawa, M. (1986). ARES: A relational database with the capability of performing flexible interpretation of queries. IEEE Transactions on Software Engineering, 12(5), 624-634. Ireland, J., & Moller, A. (2000). Review of international food classification and description. Journal of Food Composition and Analysis, 13(4), 529-538. Lacroix, M., & Lavency, P. (1987). Preferences: Putting more knowledge into queries. In P. M. Stocker, W. Kent, & P. Hammersley (Eds.), Proceedings of the 13th International Conference on Very Large Data Bases (pp. 217-225). Morgan Kaufmann. Lipski, W. (1979). On semantic issues connected with incomplete information databases. ACM Transactions on Database Systems, 4(3), 262-296. Lipski, W. (1981). On databases with incomplete information. Journal of the ACM, 28(1), 41-70. Loiseau, Y., Boughanem, M., & Prade, H. (2005). Evaluation of term-based queries using possibilistic ontologies. In E. Herrera-Viedma, G. Pasi, & F. Crestani (Eds.), Soft computing for information retrieval on the Web. Springer-Verlag. Miyamoto, S., & Nakayama, K. (1986). Fuzzy information retrieval based on a fuzzy pseudothesaurus. IEEE Transactions on Systems, Man and Cybernetics, 16, 278-282. Motro, A. (1984). Query generalization: A method for interpreting null answers. In Proceedings of the Expert Database Workshop (pp. 597-616). Motro, A. (1988). VAGUE: A user interface to relational databases that permits vague queries. ACM Transactions on Information Systems, 6(3), 187-214. Prade, H. (1984). Lipski’s approach to incomplete information data bases restated and generalized in the setting of Zadeh’s possibility theory. Information Systems, 9(1), 27-42. Prade, H., & Testemale, C. (1984). Generalizing database relational algebra for the treatment of incomplete or uncertain information and vague queries. Information Sciences, 34, 115-143. 323
Hierarchical Fuzzy Sets to Query Possibilistic Databases
Rabitti, F., & Savino, P. (1990). Retrieval of multimedia documents by imprecise query specification. In F. Bancilhon, C. Thanos, & D. Tsichritzis (Eds.), EDBT Lecture Notes in Computer Science, 416, 203-218. Springer.
the notion of Boolean membership to a set to the notion of degree of membership. Hierarchical Fuzzy Set: A fuzzy set whose definition domain is a part of a hierarchy. Hierarchy: A set of elements that are partially ordered by the “kind of” relation.
Rossazza, J., Dubois, D., & Prade, H. (1998). A hierarchical model of fuzzy classes. In R. De Caluwe (Ed.), Advances in fuzzy systems: Applications and theory, fuzzy and uncertain object-oriented databases: concepts and models (vol. 13, pp. 21-61). World Scientific.
MIEL Language�: A flexible querying language which permits expressing in a given view a conjunctive query. Current implementations have been done under Oracle and Postgresql RDBMS.
Thomopoulos, R., Buche, P., & Haemmerlé, O. (2006). Fuzzy sets defined on a hierarchical domain. IEEE Transactions on Knowledge and Data Engineering, 18(10), 1397-1410.
MIEL Query: A conjunctive query where the selection value associated with a queried attribute������������������������������������������ is expressed by a fuzzy set representing preferences.
Ullman, J. D. (1988). Principles of database and knowledge-base systems (vol. I). Computer Science Press.
Ontology: A formalization of the description of a domain knowledge at a conceptual level.
Zadeh, L. (1965). Fuzzy sets. Information and Control, 8, 338–353. Zadeh, L. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1, 3-28. Zadrozny, S., & Kacprzyk, J. (1998). Implementing fuzzy querying via the Internet/WWW: Java Applets, ActiveX controls and cookies. ��������� In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems (LNAI 1495, pp. 382-392). Springer.
Key Terms Flexible Querying: Methods for querying a database that enhance standard querying expressiveness in various ways such as the expression of user’s preferences, query generalization, and so forth, in order to facilitate the extraction of relevant data. Fuzzy Set: A mapping from a universe of discourse—definition domain of the fuzzy set—into the interval [0,1]. The concept of fuzzy set extends
324
Possibilistic Database: A database that contains ill-known data represented by means of the possibility theory. Possibility Distribution: A fuzzy set whose semantics represents the possible ordered values of an imprecise datum; only one of these values is the effective—but ill-known—value of the datum. Query Generalization: An operation that creates, from a given query Q1, a query Q2 such that Q1 is included in Q2; that is, the answers to Q1 are included in the answers to Q2 for any database.
Endnotes
MIEL is a french acronym for Extended Database Search Tool 2 with the meaning of the « kind of » relation. 3 ������������������������� World Health Organization 1
325
Chapter XIII
Query Expansion by Taxonomy Troels Andreasen Roskilde University, Denmark Henrik Bulskov Roskilde University, Denmark
Abstract The use of taxonomies and ontologies as a foundation for enhancing textual information base access has recently gained increased attention in the field of information retrieval. The objective is to provide a domain model of an application domain where key concepts are organized and related. If queries and information base objects can be mapped to this, then the ontology may provide a valuable basis for a means of query evaluation that matches conceptual content rather than just strings, words, and numbers. This chapter presents an overview of the use of taxonomies and ontologies in querying with a special emphasis on similarity derived from the ontology. The notion of ontology is briefly introduced and similarity is surveyed. The former can be considered a generalization of taxonomy, while the latter can be seen as an interpretation where aspects of formal reasoning are ignored and replaced by measures reflecting how close concepts are connected, thereby significantly enhancing performance. In turn, similarity measures can be used in conceptual querying. Queries can be expanded with similar concepts, thereby causing query evaluation to be based on concepts from the domain model rather than on words in the query.
Introduction Information retrieval (IR) deals with access to information as well as its representation, storage, and organization. The overall goal of an IR process is to retrieve the information relevant to a given request. The criteria for complete success are the retrieval of all the relevant documents1 stored in a given system and the rejection of all the nonrelevant ones. Thus, the notion of relevance is at the heart of IR, and the retrieved documents are
those found to be most relevant to the given request under the conditions available (representation, search strategy, and query). The set of relevant documents includes the documents that are likely to contain the information desired by the user and the selection of these is typically based merely on bag-of-words descriptions of the documents. Thus, finding relevant documents in an information retrieval system (IRS) obviously involves uncertainty, not only with regard to the interpretations that document descriptions represent but also due to possible interpretations of user requests.
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Query Expansion by Taxonomy
A number of different approaches have been introduced over the years in order to handle the problem of uncertainty in IR. Among these are variants of the vector retrieval model, the probabilistic retrieval model, and the extended Boolean retrieval model. On important and very promising extension of the Boolean model is the fuzzy retrieval model. In this model, documents may be more or less relevant, given a query, and “uncertainty is an inherent part of the decision problem since the criteria for determining the answer are not altogether clear” (Lucarella, 1990). To a large extent, every day users, as well as researchers and developers, have recognized the limitations of standard keyword-based IRSs, for example, search engines on the Internet. Even when advanced search options designed to facilitate increased precision and recall are available, they are not of much help to average users, who generally avoid them due to poor usability and high perceived difficulty of use (Bandos & Resnick, 2002). One obvious alternative to standard keywordbased search is a less rigid natural language interpretation of queries, an idea that goes almost as far back as the idea of natural language processing does. Many natural language query approaches have been presented and it appears that recent approaches applying shallow parsing, for example, Penev and Wong (2006), might contribute to improved search. Another important direction concerns domain knowledge processing. Most prominent among knowledge-based approaches is probably the use of taxonomies and ontologies, which recently has gained increased attention in the field of IR.2 Taxonomies and ontologies organize key concepts of an application domain and provide semantics through relations connecting concepts. Taxonomies, which can be considered controlled vocabularies that describe objects and relations between objects, are special cases of more general ontologies that include richer semantic relationships as well as rules for specification and formation. The main focus in this chapter is query evaluation based on domain knowledge captured by
326
taxonomies and ontologies. The general idea is to provide a mapping of concepts extracted from queries and documents into an ontology and to utilize this during query evaluation to obtain a matching on conceptual content rather than on just strings, words, and numbers. The emphasis in this chapter is more on how queries are interpreted and evaluated and less on how queries are expressed. Query expressions may simply be a set of keywords or concepts, may apply some more advanced operators, or may consist of controlled natural language. The important issues are the identification of concepts in queries and documents as well as the conceptual level at which they are compared. Thus, to provide ontology-based querying, we need a methodology where we can extract key concepts from queries and documents as well as a technique to compare descriptions to evaluate the degree to which they match. The objective of the former is to provide conceptual descriptions of queries and documents and the purpose of the latter is to permit query evaluation based on these conceptual descriptions. As regards extraction of key concepts from queries and documents as descriptions that indicate semantic content, it should be noted that when refraining from full semantic analysis, it is possible to produce parsers that can perform efficiently on large volumes of data. A very simplified two-phase processing principle, for example, was implemented in the OntoQuery project (Andreasen, Jensen, Nilsson, Paggio, Pedersen, & Thomsen, 2002) where the first phase was noun phrase bracketing and the second phase an extract of concepts from the individual noun phrases. A naive but powerful second phase is to extract nouns and adjectives only and combine them into “noun CHR adjective” pattern concepts (where CHR represents a “characterized by” relation). Thus, for instance, for the phrase “the black dog,” the parser may produce the following concept: “dog CHR black.” We will not go into detail on key concept extraction in this chapter but rather turn our focus to evaluation with special attention to comparison of descriptions. In order to compare conceptual descriptions, a means for measuring the extent of concept cor-
Query Expansion by Taxonomy
respondence is needed. Concept correspondence is typically measured by means of similarity measures reflecting connectivity in a taxonomy/ ontology that models the domain. Description correspondence calls for some kind of combination or aggregation over concept correspondence. A common approach to combine concept into description correspondence is expansion of queries, that is, adding to the initial query terms a collection of most similar terms and then evaluate the expanded query. The query answer will then, in addition to exact matching documents, include documents with descriptions that are only similar to the initial query description. The remainder of this chapter is organized as follows. First, a brief introduction to ontology provides the background for the main focus, conceptual expansion. Special attention is given specifically to concept algebra, an ontology representation formalism for representing so-called generative ontologies. Next, we focus on conceptual similarity, where a range of similarity measures based on taxonomies/ontologies are described, categorized, and discussed. Based on the conceptual similarity introduced, we then focus on principles for the expansion and evaluation of queries at a conceptual rather than a word-based level. Finally, we conclude with a summary. Although the general aim in this chapter is flexibility in querying, our focus is narrowed down to ontology-based similarity and the use of this in expansion of queries to document databases. For a general introduction to flexible querying, we refer to the chapter on this topic authored by Kacprzyk, Zadrożny, de Tré and de Caluwe.
Ontology In the field of philosophy, where metaphysics is the study of being or existence, ontology refers to a systematic explanation that includes the definition and classification of entities and their properties. In recent years, ontology has appeared as a new concept in knowledge engineering and in computer
science contexts, but with a modified and narrow usage that refers to a data model for a given domain comprising a set of concepts and the relationships between them. To avoid confusion, Guarino and Giaretta (1995) suggest distinguishing between two concepts, with “Ontology” referring to the philosophical definition and “ontology” referring to the knowledge engineering definition. For the latter, the most widely used definition of the term is from (Gruber, 1993), “an explicit specification of a conceptualization,” or the slightly modified version by Borst (1997), “a formal specification of a shared conceptualization.” In the literature on ontology applications, the terms “ontology” and “taxonomy” are often used interchangeably, which indicates that no general consensus exists regarding what is required to justify these terms. There does, however, seem to be agreement that taxonomy is a special case of ontology. To understand more closely what actually constitutes an ontology, consult the literature on ontology classification. Lassila and McGuinness (2001) introduce a distinction spanning from simple linguistic resources, such as controlled vocabularies, on the one end, to expressive formal systems that can express general logical constraints on the other. The more and more widely used WordNet (Fellbaum, 1998; Miller, 1995), a simple terminological ontology, is closer to the linguistic resource end of the spectrum, while the well-known Cyc3 (Lenat, 1995; Lenat & Guha, 1990) is an example of a more expressive and axiomatized formal system with definitions stated in logic. In addition, WordNet has only a small number of relations, while Cyc has thousands. For different levels of detail, expressiveness, and formality, various languages for ontology representation have been proposed. Most of them introduce some kind of specialization/generalization hierarchy for classes (or concepts) and share structural similarities, including notions for the representation of instances, attributes, and relations. Consequently, due to the large number of different languages, several attempts have been made to produce generic representation models.
327
Query Expansion by Taxonomy
One of these is Open Knowledge Base Connectivity (OKBC), a protocol for accessing knowledge bases that provides a generic framework compatible with many existing knowledge-representation systems (Chaudhri, Farquhar, Fikes, Karp, & Rice, 1998). In OKBC, classes are sets or collections of instances and are connected by specialization/generalization relations that form a taxonomy. The properties of a class, denoting attributes and relationships are called slots, which are obviously inherited from superclasses to subclasses. In addition to instances, classes, and slots, OKBC introduces facets for constraining expressions, for instance, to restrict the set of allowed values for a slot. In essence, the most important distinction among ontology languages is whether the framework is axiomatic or not, and the kind of reasoning that can be performed. When targeting query applications, however, this difference is not necessarily of much importance. Common for many approaches to ontology applications related to large-scale query evaluation is that the reasoning capabilities of the ontology framework are ignored. The most important aspect of the ontology that can contribute to enhanced querying is obviously similarity as derived from the ontology. Indications or measures of this can of course be derived from the reasoning in the ontology, for instance, subsumption reasoning. However, since any ontology structured around a specialization/generalization hierarchy can be simplified to a graph or a network reflecting connectivity among concepts, we can shortcut the derivation of similarity by directly evaluating the graph connectivity; for instance, a shorter path indicates greater similarity. Thus, in connection with querying and the derivation of similarity measures, we do not need to deal with detailed aspects of the ontology formalism, which means the choice of framework and representation language becomes less important. The tendency is to refrain from reasoning and use a heuristic approach instead in the interpretation of the ontology (for instance, rather than subsumption reasoning, the heuristics, “the shorter the connecting path between two concepts is, the more similar they are considered to be,” is applied).
328
One specific problem we have to deal with is that some of the concepts found in queries and documents do not directly appear in the ontology, and thus cannot be mapped to the ontology. For instance, if the concept “a black dog” appears in a document, but “dog” appears only without this specialization in the ontology, we cannot map it to the ontology. We can of course just ignore the color property and only consider what is found as an instance of the dog class. However, if the concepts “black” and “dog” are explicitly specified in the ontology, one solution is to introduce “generativity” and infer the concept “a black dog” from the concepts available, and thus avoid losing important information. An ontology framework is defined as “generative” when newly discovered concepts can be represented and situated in the ontology. Whether the ontology is generative or not is probably independent of the exact representation formalism chosen. However, in some frameworks, this is an inherent aspect. One such framework is presented below.
An Algebraic Representation of Ontologies The foundation of the generative ontology framework we define is a basis ontology that defines concepts and relations among concepts of the domain in use. This ontology can be obtained from, for instance, knowledge engineers, known ontologies like WordNet, and so forth. The basis ontology situates a set of atomic term concepts, A, in a concept inclusion lattice ordered by the concept inclusion relation, also called IS-A. A concept language (description language) defines a set of well-formed concepts, including both atomic and compound term concepts, where compound concepts are defined as concepts created by relating concepts in the ontology together. The concept language used here, Ontolog (Nilsson, 2001), is a lattice-algebraic description language. Its basic elements are concepts and binary relations between concepts. The algebra introduces
Query Expansion by Taxonomy
two closed operations on concept expressions ϕ and ψ: • •
Conceptual sum (ϕ + ψ), interpreted as the concept being ϕ or ψ, Conceptual product (ϕ × ψ), interpreted as the concept being ϕ and ψ,
also called join and meet, respectively. Relationships between concepts are introduced algebraically by means of a binary operator (:) known as the Peirce product (r : ϕ ), which combines a relation r with an expression ϕ. The Peirce product is used as a factor in conceptual products, as in x × (r : y), which can be rewritten to form the feature structure x [r : y], where [r : y] is an attribution of the concept x. Compound concepts can then be formed by attribution. Given atomic concepts A and semantic relations R, the set of well-formed terms L is: L = { A} ∪ {x[r1 : y1 ,..., rn : yn ] x ∈ A, ri ∈ R, yi ∈ L}. (1) Compound concepts can thus have multiple as well as nested attributions. For instance, with R = {WRT, CHR, CBY, TMP, LOC, …}4 and A = {entity, physical entity, abstract entity, location, town, cathedral, old}, we get: L = {entity, physical entity, abstract entity, location, town, cathedral, …, cathedral [LOC: town; CHR: old]; cathedral [LOC: town [CHR: old]], …}. Sources for knowledge base ontologies may have various forms. Typically, a taxonomy can be supplemented with, for instance, word and term lists as well as dictionaries for the definition of vocabularies and for handling morphology. WordNet, a well-known and widely used resource, is an interesting and useful resource for general ontologies. We will not go into detail here on the modeling, but will simply assume the presence of
a taxonomy T over the set of atomic concepts A. T and A express the domain and world knowledge provided. Given a concept inclusion lattice ordered by the IS-A relation where, for instance, a IS-A b and b IS-A c, transitivity means that we can infer that a IS-A c. The transitive closure over such a concept inclusion lattice would then be all the transitive relations we can infer from the lattice, for example, {a IS-A b, b IS-A c, a IS-A c}, for all concepts. Based on T’, the transitive closure of T, we can generalize to an inclusion relation “≤” over all well-formed terms of the language L with the following: “≤” = T’ ∪{<x [... , r : z], y [...]> | <x [...], y [...]> ∈ T’ } ∪{<x [... , r : z], y [... , r : z]> | <x [...], y [...]> ∈ T’ } ∪{
(2)
where ... denotes zero or more attributes of the form ri : ci. The general ontology O = (L,≤,R) thus encompasses a set of well-formed expressions, L, derived in the concept language from a set of atomic concepts, A, an inclusion relation generalized from the IS-A relation in T, and a supplementary set of semantic relations R. For r∈R, x [r : y] ≤ x and x[r : y] are in relation r to y. Observe that O is generative and that L is therefore potentially infinite. The indexing process extracts concepts from documents and maps them into the general ontology, resulting in an ontology comprised of concepts from the base ontology plus the compound concepts found in the documents and the relations between the concepts. One interesting subset from this ontology is the ontology (subontology) that consists of only concepts found in the document base and their upwards expansion. By upwards expansion, we mean all the concepts reachable upwards in the ontology from a given concept or set of concepts. We denote this subontology as instantiated ontology due to the fact that the foundation is a set of concept instances. Given a general ontology, O = (L,≤,R), and a set of concepts, C, the instantiated ontology OC 329
Query Expansion by Taxonomy
= (LC,≤C,R) is a restriction of O to cover only the concepts in C and corresponds to the upwards expansion LC of C in O: LC = C ∪ {x|y ∈ C, y ≤ x},
(3)
and the inclusion relation, ≤C, for OC: ≤C = {<x, y> | x, y ∈ LC, x ≤ y}.
(4)
Figure 1 shows an example of an instantiated ontology, where solid lines show the IS-A relation, dotted lines show attribution with semantic
relations, and the solid grayed lines indicate that concepts were removed on the IS-A path to save space (in the figure). The general ontology is based on (and includes) WordNet and the ontology shown is “instantiated” with respect to the following set of concepts: C = {cathedral[LOC:town[CHR:old]], abbey, fortification[CHR:large, CHR:old], stockade, fortress[CHR:big]}. LC defines all the concepts and all the connections in the instantiated ontology shown in Figure
Figure 1. An instantiated ontology based on a WordNet ontology over the set of instantiated concepts { cathedral[LOC:town[CHR:old]], abbey, fortress[CHR:big], stockade, fortification[CHR:large, CHR: old]}. Dotted edges show attribution with semantic relations and the grayed lines indicate that concepts were removed to save space.
330
Query Expansion by Taxonomy
1. That is, each of the concepts in C is expanded upwards; for instance, for the concept “abbey” we got the concepts: {“abbey”, “church”, “place_of_worship”, “building”, “structure”, “artifact” “physical_entity”, “entity”}. We then union all the sets found from each of the concepts in C to obtain LC, and connect them by the relations found in the general ontology to obtain ≤C. Instantiated ontologies can be formed over any set of concepts. Above, it is defined for the concepts found in the document base, but also a query, a query answer, and so forth could form the basis for the instantiating of a general ontology. One interesting aspect along this line is the possibility of using instantiated ontologies for visualization, although this will not be discussed further in this chapter.
Similarity In the context of information retrieval, the most significant challenge is to connect relevant stored documents to posed queries. In classical information retrieval where the need is to compare a bag-of-words (corresponding to a query as well as document descriptions), we need a single uniform means for comparing these structures. In the simplest (strict) form, the only relevant documents to a given query are those that contain all the properties (keywords) of the query. However, in modern information retrieval systems, the interpretation is normally not strict in this sense and documents are instead graduated and ranked according to degrees of resemblance to the query. Thus, in classical information retrieval, the foremost need is a similarity measure that compares a query with a set of documents. The introduction of concepts and ontologies in information retrieval necessitates, however, different views on the measurement of similarity
between queries and documents. The specialization/generalization hierarchy and the semantic relations of ontologies should influence the relevance judgment of documents in relation to queries, which similarity measures should thus reflect. Obviously, this calls for similarity measuring degrees of resemblance between concepts. Once established, similarities among concepts can be combined into resemblances between queries and documents, that is, description similarity. Below, we consider concept similarity, and in the following section, we return to descriptions. The problem of formalizing and quantifying the intuitive notion of semantic similarity between concepts has a long history going back at least to Aristotle in philosophy, psychology, and artificial intelligence (Budanitsky & Hirst, 2006). A variety of different synonyms are used sometimes interchangeably and sometimes with different meanings in the literature, for instance, likeness, resemblance, affinity, relatedness, and closeness. Some attempts have been made to differentiate between some of these terms, for example, by Resnik (1995), who defines similarity as a special case of relatedness, but in this chapter, we will refrain from this distinction and only use similarity and its opposite: distance. Major sources from which concept similarity can be derived are taxonomic structure and corpus-based statistics. Any ontology structured around a specialization/generalization hierarchy can be simplified to a graph or a network reflecting connectivity among concepts. Thus, we can obviously shortcut the derivation of similarity by evaluating graph connectivity directly, and this, sometimes supported by statistics, is exactly what most similarity measures do. Statistics in this connection are mainly frequency based probabilities and correlate concepts from a suitable text corpus (co-occurrence).
Similarity Measures Below we present and discuss diverse similarity measures based on ontologies. In the presentation,
331
Query Expansion by Taxonomy
we use a broad categorization, mainly into taxonomic structure-based and corpus-based measures, where the former ignores the corpus and the latter takes the corpus, and to some degree, also the taxonomic structure into account. This categorization is inspired by other similar categorizations, for example, Cross (2004). The issue here is to use conceptual similarity measures to expand queries with similar concepts, and one approach is to expand queries into fuzzy sets using similarity as membership. If in addition, documents are represented by fuzzy sets of the concepts extracted, then queries can be evaluated by comparison of fuzzy sets, for instance, by the relative sigma count (Lucarella, 1990):
eval ( d , q) =
∑ min( ∑ c∈C
c∈C
d
( c ), q
q
( c ) ),
(5)
(c)
where C is the set of concepts in the ontology and md (c) and mq(c) the membership functions for the fuzzy sets of concepts representing document d and query q, respectively. In general, concepts in queries can then be expanded by the function: similar ( c ) = ∑c′∈C sim( c, c′) / c′ ,
(6)
where sim(c1, c2) is some normalized similarity measure. Naturally, there is a number of different techniques for normalizing a similarity measure, as we shall see in the following sections, but a simple and general normalization is, for example:
sim( c1 , c2 ) =
SIM ( c1 , c2 ) , max (SIM ( x, y ) ) x , y∈C
332
(7)
where SIM(c1, c2) is the un-normalized measure. Similarly, any measure of distance, DIST, can be normalized by: dist ( c1 , c2 ) =
DIST ( c1 , c2 ) , max (DIST ( x, y ) )
(8)
x , y∈C
and transformed into a measure of similarity by sim(c1 , c2 ) = 1 − dist (c1 , c2 ) . Thus, we now have a general expansion mechanism for use with any similarity or distance measure based on taxonomies/ontologies and a simple query evaluation technique to match fuzzy sets. Improvements of the latter will be discussed later, but in the following two sections, we present and discuss a number of similarity measures. The variety of approaches for evaluating measures of similarity can be grouped roughly into theoretical studies, comparison to human judgment, and applicability in specific natural language applications. In this chapter, we will look briefly at the former by comparing different properties among the measures. We will discuss measures below in light of a range of basic intuitive properties, but the intent here is not to provide a complete evaluation that holds every property against every measure, but rather to give a characterization of each described measure, so most properties will be introduced together with the measures. Several properties for similarity and distance measures have been described in the literature. Not all appear to be clear modeling guidelines for the development of measures, but rather appear to be rationalized after recognized facts. Lin (1998) attempts to create a measure of similarity that is both universally applicable to arbitrary objects and theoretically justified, and hence not tied to particular applications, domains, resources, or a specific knowledge representation. In this regard, he introduces the properties of commonality, difference, and identity. These are quite general properties—the more common the more similar, the more different the less similar, and identity
Query Expansion by Taxonomy
leads to the maximum possible similarity. The properties are only rarely violated, but similarity measures are in most cases based on either commonality or difference, which means one of them is not obeyed, or is undefined. Before continuing with describing measures, we would like to emphasize two important properties that require special attention. First, similarity is often considered to be inherently symmetric. Tversky (1977) argues against this consideration, stating that similarity judgments can be regarded as extensions of similarity statements, that is, statements of the type “a is like b,” which is obviously directional. He gives a number of examples: “the portrait resembles the person” and not “the person resembles the portrait,” “the son resembles the father” and not “the father resembles the son.” In addition, it seems that when similarity is derived from a partial order, as in taxonomic inclusion, symmetry even becomes counter intuitive for cases where the order is defined. If two concepts are connected such that one is a specialization of the other, A < B (A specializes B), then we would expect that sim(A, B) < sim(B, A) instead. If “plankton” is a specialization of “organism,” then we would fully accept any kind of plankton as a kind of organism, while we cannot expect any kind of organism to be plankton. Hence, we will view similarity measures in light of whether they comply to the property introduced in Andreasen, Bulskov, and Knappe (2003): •
Generalization property: Concept inclusion implies reduced similarity in the direction of the inclusion.
Second, relying on edges in the taxonomy to represent uniform distances is a widely acknowledged problem. Consider the two pairs of concepts: (1) “pot plant” and “garden plant” and (2) “physical entity” and “abstract entity.” Intuitively, we would expect similarity for the first pair to be higher than the similarity for the second, since the first pair of concepts is much more specific (Sussna, 1993). This means that the distance represented by an
Figure 2. The generalization property implies that sim(D,B) < sim(B,D). The depth-relative property implies that sim(D,E) ≥ sim(C,F).
edge should be reduced with the increasing depth (number of edges from the top) of the location of the edge: •
Depth property: The distance represented by an edge is influenced by the depth of the location of the edge in the ontology—deeper locations means shorter distance.
Both of the above-mentioned properties are illustrated in Figure 2. Below, we will discuss taxonomic and corpus-based measures in light of these two properties as well as other more specific properties.
Taxonomic Structure-Based Measures The similarity measures presented in this section can roughly be divided in two groups. First, a number of measures based on distance, with the number of edges on the shortest path between concepts as the basic idea, and last, two measures that in principle considers all connecting paths. Shortest path length. One obvious way to measure similarity in a taxonomy given its graphical representation is simply to evaluate the distance
333
Query Expansion by Taxonomy
between the nodes corresponding to the items being compared, where a shorter distance implies higher similarity. In Rada, ��������������������������������������� Mili, Bicknell, and Blettner����������� (1989), a simple approach based on the shortest path length is presented. The principal assumption is that the number of edges between terms in a taxonomy (only IS-A relation) is a measure of conceptual distance between concepts: DISTRada ( ci , c j ) = minimal number of edges on a path from ci to c j
(9)
The shortest path length similarity measure complies with neither the generalization property nor the depth property, since the measure is symmetric and is based on the notion of a uniform length of edges. As an example of this, consider the ontology in Figure 1, where DISTRada ( fortress, cathedral ) = DISTRada (cathedral , fortress ) = 6 ,
and DISTRada ( fortress, cathedral ) = DISTRada (region, property ) ,
indicating symmetry and no depth property, respectively. Hirst and St-Onge. Another measure based on the shortest path length is presented by Hirst and St-Onge (1998). The measure uses not only the IS-A relation, but also PART-OF (each in both directions, corresponding to all relations between nouns in WordNet (Miller, 1990), that is, hypernymy, hyponymy, holonymy meronymy). Their measure is also influenced by the number of changes in direction on the paths. The weight of a path is expressed by the following formula: SIM Hirst&St −Onge ( ci , c j ) = C − path length − k × number of changes in direction
(10)
where C is an arbitrary fixed constant corresponding to the maximum accepted value for similar334
ity, path length is shortest path length, and k is a reduction weight, the cost, so to speak, for turns on the path. Two concepts are semantically similar if they are connected by a path that is not longer than an arbitrary fixed constant, C, and whose direction does not also change. This measure is basically a shortest path length and thus complies with the same properties as the shortest path length measure. If we again consider the ontology in Figure 1 and use C=8 and k=1, then SIM Hirst &St −Onge ( fortress, cathedral ) = 8 − 6 − 1 × 1 = 1 ,
due to a shortest path length = 6 and one shift in direction, while the same shortest path length without any shift in direction SIM Hirst&St −Onge ( physical _ entity , cathedral ) = 8 − 6 − 1 × 0 = 2
has a higher similarity. Weighted shortest path. One simple generalization of the shortest length path, the weighted shortest path, is presented in Bulskov, Knappe, and Andreasen (2002). This measure assigns weights to the IS-A relation, and thus, in order to obey the generalization property, provides the possibility of differentiating between generalizations and specializations. It is argued that in information retrieval concept inclusion IS-A intuitively implies strong similarity in the opposite direction of inclusion (specialization), but also that generalization should contribute to similarity. This can be achieved by assigning different weights to the direction of the IS-A relation. The distinction is expressed in two parameters , ∈ [0,1] controlling similarity of immediate specialization and generalization, respectively. For a path, P=(p1,, pn ) , between the nodes (concepts) c1 and c2 with c1 = p1 and c2 = pn the number of specializations is s ( P) =| {i | pi ISA pi +1} | and the number of generalizations is g ( P) =| {i | pi +1 ISA pi } | . If P1 , , P m are all paths connecting c1 and c2, then the degree to which c2 is similar to c1 is defined as:
Query Expansion by Taxonomy
Figure 3. An ontology transformed into a directed weighted graph with the immediate specialization and generalization similarity value s = 0.9 and g = 0.4 as weights. Similarity is derived as the maximum (multiplicative) weighted path length, and thus simWSP(“poodle”,”alsatian”) = 0.4 * 0.9 = 0.36.
{
simWSP ( c1 , c2 ) = max
j =1,...,m
s( P j )
}
g(P j )
(11)
An example ontology is given in Figure 3 with s = 0.9 and s = 0.4. The weighted shortest path measure is a generalization of the shortest path length measure and would therefore hold the same properties, but the weighting of edges puts the measure in accordance with the generalization property. Consider Figure 1 with s = 0.9 and s = 0.4, then: simWSP (cathedral , fortress ) = .9 2.4 4 = 0,021 ,
while simWSP ( fortress, cathedral ) = .9 4.4 2 = 0,105 ,
and thus the generalization property is obeyed. Sussna’s depth-relative scaling. In his depthrelative scaling approach, Sussna (1993) considers for every relation r also its inverse, r', as a separate relation. Each relation r from concept c1 to c2 is
weighted with a value in the range [min r;maxr]. This is the so called type specific fanout factor w that depends on the number nr of edges of the same type, leaving c1: w( c1 → c2 ) = max − r
max r − min r . nr ( c1 )
(12)
The type specific fanout factor reflects the dilution of the strength of the connotation between the source and the target concept; thus, the more edges that leave a node, the less the weight of this node in contribution to similarity becomes. The two inverse weights are averaged and scaled by depth, d, the edges in the overall taxonomy, which is motivated by the observation that sibling-concepts deeper in the taxonomy appear to be more closely related than those higher in the taxonomy. The distance between adjacent nodes c1 and c2 is computed as:
w( c1 → r c 2 ) + w( c 2 → r ' c1 ) 2 × max(d ( c1 ), d ( c 2 )) (13) DIST sussna ( c1 , c 2 ) =
where r is the relation that holds between c1 and c2, and r' is its inverse. The semantic distance between two arbitrary concepts c1 and c2 is computed as the sum of distances between the pairs of adjacent concepts along the shortest path connecting c1 and c2. The depth property was introduced by Sussna in connection with the introduction of this measure, so obviously this property is obeyed. However, since this is a variation of shortest path length, the generalization property is violated. Given that minIS-A = 1 and maxIS-A = 2, the distance between “cathedral” and “fortress” in Figure 1 is then computed by first finding DISTsussna for all adjacent concepts along the shortest path connecting “cathedral” and “fortress”:
335
Query Expansion by Taxonomy
DISTsussna (cathedral , church) = DISTsussna (church, place_of_worship ) = DISTsussna ( place_of_worship, building ) = DISTsussna (building , structure) = DISTsussna ( structure, defensive _ structure) = DISTsussna (defensive _ structure, fortress) =
2 −1 2 −1 +2− 1 2 2×7 2 −1 2 −1 2− +2− 1 1 2×6 2 −1 2 −1 2− +2− 1 1 2×5 2 −1 2 −1 2− +2− 1 2 2× 4 2 −1 2 −1 +2− 2− 2 1 2×5 2 −1 2 −1 +2− 2− 2 1 2×6 2−
= .178 = .167 = .200 = .312 = .250 = .208
and then summarize these edge distances: .178+.1 67+.200+.312+.250+.208 = 1.315. Wu and Palmer’s conceptual similarity. Wu and Palmer (1994) propose conceptual similarity in their paper on the semantic representation of verbs in computer systems and its impact on lexical selection problems in machine translation. Wu and Palmer define conceptual similarity between a pair of concepts c1 and c2 as: simWu & Palmer ( c1 , c2 ) =
2× N3 , (14) N1 + N 2 + 2 × N 3
where N1 and N2 are the numbers of nodes on a path from c1, respectively c2, to their least upper bound concept, while N3 is the number of nodes from the latter to the topmost node in the tree. Notice that sim(c,c)=1 and similarity reduces with an increase of the denominator, that is, with the increase of the path connecting c1 and c2. Due to N3, the measure is also depth-relative, but like Sussna’s measure, it is not in accordance with the generalization property. As an example, the similarity between “cathedral” and “fortress” in Figure 1 is: simWu & Palmer (cathedral , fortress ) =
2× N3 2× 4 = = .500 N1 + N 2 + 2 × N 3 5 + 3 + 2 × 4
Leacock and Chodorow’s normalized path length. Leacock and Chodorow (1998) propose
336
normalized path length for measuring semantic similarity as the shortest path using IS-A hierarchies for nouns in WordNet (Miller, 1990). The proposed measure determines the semantic similarity between two synsets (concepts) by finding the shortest path and by scaling using the depth of the taxonomy: Np( c1 , c2 ) SIM Leacock &Chodorow ( c1 , c2 ) = − log 2 D , (15)
where c1 and c2 represent the two concepts. Np(c1, c2) denotes the shortest path length between the synsets (measured in nodes), and D is the maximum depth of the taxonomy. Neither the generalization nor the depth property holds for this measure. The computation of the similarity between “cathedral” and “fortress” in Figure 1 is: 7 SIM Leacock & Chodorow (cathedral , fortress ) = − log = .456 2 × 10
Shared nodes. All the approaches that have been presented until now only take into account one path in measuring similarity. Consequently, when two concepts are connected by multiple paths, only one path, typically the shortest, contributes to their similarity. Especially in cases where semantic relations are used in addition to inclusion, their influence appears to be significant so that the sharing of attributes or properties can contribute to similarity. One obvious approach is to consider all possible connections (no matter what the type of relation) between concepts c1 and c2. Concepts, for instance, may be connected directly through inclusion and also through an attribute dimension, as in cat[CHR: black] and poodle[CHR: black], or we might have multiple paths due to multiple inheritance. A simplified approach that still reflects the existence of multiple paths is shown in the following. In the shared nodes similarity measure introduced in Knappe, Bulskov, and Andreasen (2006), concepts are compared on the basis of the more general concepts they share rather than on all the
Query Expansion by Taxonomy
paths connecting them. What is shared is simply the intersection between the set of upwards reachable nodes in the network of the concepts being compared. The intuition is that the more nodes they share, the more similar they are. The “decomposition” contribution to the set of upwards reachable nodes is: (c1 ) = {c1} ∪ {c2 | c1 ≤ c2 [...,r:c3 ] ∨ c1 ≤ c3[...,r:c2 ], c 2 ∈ L, c3 ∈ L, r ∈ R}
(16)
while the “taxonomic” contribution can be derived from w(C), the transitive closure of a set of concepts C with respect to ≤ : (C ) = {c2 | c1 ∈ C ∨ c2 ∈ C , c1 ISA c 2 } (17)
with (c1 ) = ( (c1 )) as the set of nodes (upwards) reachable from c1 in an ontology. (c1 ) ∩ (c2 ) is the reachable nodes shared by c1 and c2, which is thus an indication of how similar c1 and c2 are. The suggested parameterized similarity is:
they have an immediate subsuming concept (e.g., cat[CHR:black] and cat[CHR:brown]) than when they only share an attribute (e.g., cat[CHR:black] and dog[CHR:black]), we must differentiate and cannot simply define α(c) as a crisp set. The following is a generalization to fuzzy set based similarity (Andreasen, Knappe, & Bulskov, 2005), denoted as weighted shared nodes (WSN). First, notice that α(c) can be derived as follows. Let the triple (c1, c2,r) be the edge of type r from concept c1 to concept c2. Let E be the set of all edges in the ontology and T be the top concept; then α can be expressed: (T ) = {T } ( c1 ) = {c1} ∪ ( ( c1 ,c2 ,r )∈E ( c2 ) ).
A simple modification that generalizes α(c) to a fuzzy set is obtained through a function weight (r ) ∈ [0,1] that attaches a weight to each relation type r. With this function, we can generalize to: (T ) = {1 T }
simSharedNodes ( c1 , c2 ) =
| ( c1 ) ∩ ( c2 ) | | ( c1 ) ∩ ( c2 ) | + (1 − ) | ( c1 ) | | ( c2 ) |
(18)
where ∈ [0,1] determines a bias towards more or less influence from the taxonomic aspect of the compared node. If r = 0, the location (depth) of c1 is ignored, and if r = 1, the location of c2 is ignored. For the concepts “cathedral” and “fortress” in Figure 1 with r = .8, the similarity is: | (cathedral) | = |{cathedral,..., structure, artifact, physical _ entity, entity} | = 8 | ( fortress) | = |{ fortress,..., structure, artifact, physical _ entity, entity} | = 6 | (cathedral) ∩ ( fortress) | = | {structure, artifact, physical _ entity, entity} | = 4
4 4 simSharedNodes (cathedral , fortress ) = .8 + (1 − .8) = .533 6 8
Weighted shared nodes (WSN). Intuition tells us that when deriving similarity using the notion of shared nodes, not all nodes are equally important. If we want two concepts to be more similar when
(19)
( c1 ) = {1 c1} ∪ ( c1 ,c2 ,r )∈E
∑ weight(r ) ×
( c ') / c '∈ ( c2 )
( c' ) /c'
(20)
α(c) is thus the fuzzy set of nodes reachable from concept c and modified by weights of relations weight(r). The measure of semantic similarity between two concepts is then defined to be proportional to the number of nodes shared by the concepts, but where nodes are weighted according to the semantic relation by which they are reached. For instance, from the ontology in Figure 4, assuming relation weights weight(IS-A)=1, weight(CHR)=0.5, and weight(CBY)=0.5, then see Box 1. For concept similarity, the parameterized expression above can still be used applying the minimum for fuzzy intersection and the sum for fuzzy cardinality. Thus, we have, for instance:
337
Query Expansion by Taxonomy
Figure 4. Dotted edges show attribution with semantic relations in the above ontology
Box 1. ( dog[CHR : black ]) = {1 dog[CHR : black ] + 1 dog + 1 animal + 0.5 black + 0.5 color + 1 anything} ( cat[CHR : black ]) = {1 cat[CHR : black ] + 1 cat + 1 animal + 0.5 black + 0.5 color + 1 anything} ( dog[CHR : black ]) = ( dog[CHR : black ]) = 5
( dog [CHR : black ] ∩ cat[CHR : black ]) = {0.5 black + 0.5 color + 1 animal + 1 anything} | ( dog[CHR : black ] ∩ cat[CHR : black ]) |= 3.0
Thus, for the concepts “dog[CHR:black]” and “cat[CHR:black]” in Figure 4 with r = .8, the similarity is: 3 3 simWSN ( dog[CHR : black ] ∩ cat[CHR : black ]) = .8 + (1 − .8) = .6 5 5
The weighting of edges may immediately be difficult to determine, but most importantly it
338
permits differentiating between the key ordering relation, IS-A and the other semantic relations when calculating similarity. The weighted shared nodes measure complies with all the defined properties discussed.
Corpus-Based Measures The similarity measures presented so far use knowledge solely captured by the ontology (or taxonomy) to compute a measure of similarity. In this section, we present three approaches that incorporate corpus analysis as an additional and
Query Expansion by Taxonomy
qualitatively different knowledge source. The knowledge revealed by the corpus analysis is used to augment the information already present in the ontologies or taxonomies. Resnik’s information content. Resnik (1999) argues that one problem with edge-counting approaches is that they typically rely on edges representing uniform distances. One indication of similarity between two concepts is the extent to which they share information, which for a taxonomy can be determined by the relative position of their least upper bound. This indication seems to be captured by edge-counting approaches, for instance, the shortest path length approach presented above. However, edge-counting approaches in general do not comply with the depth property, since edges typically are reflected with uniform distances and the position in the hierarchy of the least upper bound is not taken into account. Resnik combines the taxonomic structure with empirical probability estimates in his measure and information content by applying knowledge from a corpus about the probabilities (based on frequencies) of senses to express non-uniform distances. Let C denote the set of concepts in a taxonomy that permits multiple inheritance and associates with each concept c ∈ C , the probability p(c) of encountering an instance of concept c. Following the standard definition from Shannon and Weaver’s (1949) information theory, the information content of c is then -log p(c). For a pair of concepts c1 and c2, their similarity can be defined as: simresnik ( c1 , c2 ) = max [− log(p( c ) )], c∈S ( c1 ,c2 )
(21)
where S ( c1 , c2 ) is the set of least upper bounds in the taxonomy of c1 and c2. p(c) is monotonically nondecreasing as one moves up in the taxonomy, and if c1 IS-A c2, then p(c1) <= p(c2). Resnik’s measure is depth-relative but ignores actual path length (and only considers the least upper bound), which appears to be too less dependent on the taxonomic structure.
Consider the ontology given in Figure 5 and the concepts “doctor1” and “nurse2.” Their similarity is equal to the information content of their least upper bound, “person,” which is 2.005. Observe that the similarity between, for instance, “adult” and “nurse2” is also 2.005, since their least upper bound is also “person.” Jiang and Conrath’s combined approach. Another approach based on Resnik’s idea is suggested by Jiang and Conrath (1997). This approach synthesizes edge-counting and information content into a combined model by adding the latter as a corrective factor. The general formula for the edge weights between a child concept cc and a parent concept cp by considering factors such as local density in the taxonomy, node depth, and link type is: E d (c p ) + 1 wt (cc , c p ) = + (1 − ) LS (cc , c p )T (cc , c p ) E ( c p ) d (c p )
(22)
where d(cp) is the depth of the concept cp in the taxonomy, E(cp) is the number of children of cp (the local density), E is the average density in the entire taxonomy, LS(cc, cp) is the strength of the edge between cc and cp, and T(cc, cp) is the edge relation/type factor. The parameters α, α >= 0 and β, 0 <= β <= 1 control the influence of concept depth and density, respectively. The semantic distance between two nodes is then defined as the summation of edge weights along the shortest path between them: dist Jiang &Conrath ( c1 , c2 ) = ∑ wt(c, parent (c)) c∈{ path ( c1 ,c2 ) − LSuper ( c1 ,c2 )}
(23)
where path(c1, c2) is the set of all nodes along the shortest path between concepts c1 and c2, parent(c) is the parent node of c, and LSuper(c1, c2) is the lowest superordinate (least upper bound) on the path between c1 and c2. The computation of this
339
Query Expansion by Taxonomy
Figure 5. A fragment from WordNet (Resnik, 1999)
measure is similar to how Sussna’s measure is computed, as a summation of edge weights along the shortest path. If we consider the ontology in Figure 5, use α = .5 and β = .3, and assume that the depth of “person” is 3 and that the average density is 4, the edge weights from “doctor1” to “nurse2” is: .5
4 6 + 1 dist Jiang &Conrath (doctor1, healt _ professional ) = .3 + (1 − .3) = 1.836 2 6 .5
4 5 + 1 dist Jiang &Conrath (healt _ professional , professional ) = .3 + (1 − .3) = 1.862 2 5 .5
4 4 + 1 dist Jiang &Conrath ( professional , adult ) = .3 + (1 − .3) = 3.466 1 4
Lin’s universal measure. Lin (1997, 1998) defines a measure of similarity claimed to be both universally applicable to arbitrary objects and theoretically justified. Upon recognizing that known measures generally are tied to a particular application domain or resource, he argues for the need for a measure that does not presume a specific kind of knowledge representation and that is derived from a set of assumptions rather than directly from a formula. His measure of similarity between two concepts in a taxonomy is defined as:
.5
4 3 + 1 dist Jiang &Conrath (adult , person) = .3 + (1 − .3) = 0.993 5 3 .5
4 3 + 1 dist Jiang &Conrath ( guardian, person) = .3 + (1 − .3) = 0.993 5 3 .5
4 5 + 1 dist Jiang &Conrath (nurse2, guardian) = .3 + (1 − .3) = 3.460 1 5
and the distance = 1.836+1.862+3.466+0.993+0. 993+3.460 = 12,610. But since this measure uses the structure, the distance between “adult” and “nurse2” = 0.993+0.993+3.460= 5.446 is less than the distance for “doctor1” and nurse2.”
340
2 × log p( LUB( c1 , c2 )) , log p( c1 ) + log p( c2 ) simLin ( c1 , c2 ) =
(24)
where LUB(c1, c2) is the least upper bound of c1 and c2 and where p(x) can be estimated based on statistics from a sense tagged corpus (e.g., Resnik’s information content). Compared to Resnik, Lin’s measure is also influenced by the actual connecting path.
Query Expansion by Taxonomy
Consider again the ontology given in Figure 5 and the concepts “doctor1” and “nurse2.” Their similarity is equal to: sim Lin (doctor1, nurce2) =
2 × log .2491 = .179 , log .0018 + log .0001
while the similarity between “adult” and “nurse2” is:
sim Lin (adult , nurce2) =
2 × log .2491 = .215 , log .0208 + log .0001
and reflect that “adult” is closer (more similar) to “nurse2” than “doctor1.” A generic instance-based approach. In the previous section, we defined instantiated ontology as the restriction of a general ontology to a given set of concepts. If this set of concepts is exactly the same as the ones appearing in our corpus (= all concepts in the set of targeted documents, when speaking about information retrieval), then the instantiated ontology can certainly be claimed to be corpus-based. Thus, any purely taxonomic structure type similarity measure modeled on top of this instantiated ontology becomes also corpus-based. The presented measures in the above sections vary by the properties they possess, the sources they are drawn from, and, of course, by their elegance. But, first and foremost, deciding the most appropriate measure for a given domain and a given application is an issue for the knowledge engineer. It appears that fuzzification sometimes almost suggests itself as a convenient formalism when speaking about similarity, a tendency which is even stronger when turning to description correspondence as derived from concept similarity since this is mainly a matter of aggregation. This is the topic of the next section.
Comparing Measures The measures presented above include similarity as well as distance, normalized, and un-normalized measures. As indicated in the introduction to this section, when a common scale is given, normalization is straightforward, and once measures are normalized, switching between similarity and distance is a simple matter of inversion (for instance, when normalized to [0,1], we have that similarity = 1-distance). However, it should be noted that normalization does not necessarily lead to a comparable scale, that is, similar to some value, for instance 0.8, in one similarity measure, is not necessarily the same as similar to 0.8 in another scale. The only, and of course, most interesting, way to compare measures is the order to which they rank a set of terms as similar to a given term, thus with reference to (6) for a given term x the descending order of the fuzzy set similar(x). It appears that fuzzification sometimes almost suggests itself as a convenient formalism when speaking about similarity, a tendency which is even stronger when turning to description correspondence as derived from concept similarity since this is mainly a matter of aggregation. This is the topic of the next section.
Expansion In the present approach, ontology-based querying relies on the comparison of a description of the query with descriptions of texts from the database. Queries and texts are mapped onto descriptors organized in structures called “descriptions.” The processing of queries is facilitated by the matching of descriptions. The key issue when dealing with evaluation approaches that take uncertainty into account is a relaxed interpretation of the request considering not only explicitly specified, but also similar terms/concepts/descriptions, in order to retrieve not only exact matching, but also similar objects/documents.
341
Query Expansion by Taxonomy
Many factors have influence on what in the end is considered similar. When dealing with ontologies as origin of similarity measures also the interpretation of the ontology is an issue. For instance, in some of the measures described in the previous section, the exact weighting of edges in the ontology has great influence of what is considered similar. Some approaches impose even more refined interpretations of ontologies. In this connection, it is especially worth mentioning the pattern matching approach suggested by Loiseau, Boughanem, and Prade (2005), which deals with possibilistic ontologies that allow distinguishing necessity and possibility in determination of similarity. Another aspect is the exact part of the ontology taken into consideration in determining what is similar. We have briefly mentioned the notion of the instantiated ontology above as subject for ontology based evaluation. This is covered in more detail in Andreasen, Bulskov, and Knappe (2005). An approach where similarity is based on degrees of inclusion among weighted subtrees is described in Baziz, Boughanem, Loiseau, and Prade (2007). In this approach, a query subtree is compared with a set of target document subtrees. Below we will restrict to what is probably the most common approach leading from similar terms to similar objects, namely query expansion. When dealing with generative ontologies allowing compound concepts, descriptions are not unique and may vary in level of detail, ability to combine, and structure. For instance, for the phrase “The noisy black dog is chasing the cat,” the following, listed according to increasing accuracy, are possible descriptions: {noise, black,dog,cat} {{noise, black,dog}, {cat}} {{noise, dog [CHR : black]}, {cat}} {noise [CBY :dog [CHR : black]], cat}. Thus, not surprisingly, with the same aspects represented, we obtain more accurate descriptions by combining these into compound descriptors. To index the information base, we have to decide on the underlying description structure. A
342
straightforward approach is to define descriptors as single concepts rather than sets (as in the first and the last description example above): D = {d1, … , dn},
(25)
where d1, … , dn are single concepts. This description structure applies for queries as well, and a query thus has the form, Q = {q1, … , qn}. The general idea is to capture similarity reflecting the domain-knowledge from the ontology in query evaluation, and for this purpose, to use the derived similarity measures rather than to reason on the ontology. A simple approach is to employ a function similar( x ) = ∑ sim( x, y ) / y denoting a fuzzy set of concepts similar to x. The function similar can be applied to either the descriptors of D or, preferably, the descriptors of Q (since there are many Ds and only one Q). Now, the first objective is to introduce appropriate principles for similarity evaluation and for aggregation. Before continuing the discussion on general evaluation principles, this issue is examined in the next subsection below.
Aggregation A query Q is represented by a description {q1, … , qn} and we assume that the value qi(D) ∈ [0,1] is the degree to which the text object with description D satisfies the descriptor qi. The overall valuation ValQ (D) of object D with respect to Q is obtained as an aggregation of {q1(D), … , qn(D)}. An obvious choice here is to adopt the order weighted averaging (OWA) introduced in (Yager, 1988). OWA aggregates n values a1, …,an by means of an ordering vector W = [w1 ,, wn ], applying w1 to the highest value among a1, …,an, w2 to the next highest value, and so forth. The weights are restricted n to w j ∈ [0,1] and ∑ j =1 w j = 1 , and the aggregation of values a1, …,an is: FW ( a1 ,, an ) = ∑ j =1 w j b j , n
(26)
Query Expansion by Taxonomy
where bj is the j’th largest among a1 ,, an and� b1, …,bn is thus the descending ordering of the values a1, …,an. By modifying W, we can obtain different aggregations, for instance, F(1,0,0,...) corresponds to the maximum, F(1/n,1/n,...) becomes the average, and F(0,0,...,1) the minimum. The OWA aggregation principle is very flexible and may be further refined by including importance weighting in the form of an n-vector M=<m1,…,mn>, mj∈[0,1], where, for instance, M=<1,0.8,0.8,...> gives more importance to the first argument, while there is no distinction with M=<1,1,...>. With reference to order weighting W = [w1 ,, wn ], a simple approach to applying importance to the aggregate of the values a1, …,an is to multiply these by the importance weights m1,…,mn before aggregation: FM ,W ( a1 , , a n ) = FW ( m1 * a1 , , m n * a n ) . (27)
OWA aggregation conveniently adapts “linguistic quantifiers,” modeled by an increasing function K: [0,1]-> [0,1] with K(0)=0 and K(1)=1, such that the order weights are prescribed as: j j −1 w j = K( ) − K( ) , n n
(28)
A quantifier, EXISTS, can, for instance, be modeled by K(x) = 1 for x > 0, FOR-ALL by K(x) = 0 for x < 1, and SOME by K(x) = x, while one possibility (of many) to introduce MOST is by a power of SOME, for example, K(x) = x3. Thus, we assume the general query expression: Q =< q1 , …, qn : M : K >,
(29)
where q1 , …, qn are the query descriptors, M specifies their importance weighting, and K specifies a linguistic quantifier, thereby indicating an order weighting. So with qi(D) as the degrees to which D satisfies the descriptor qi, the corresponding generalized valuation function is (compare with Formula (27)):
ValQ (D) = FM,w(K)(q1(D) , …, qn(D)),
(30)
where w is a function that takes a quantifier K and maps it to an n-vector w(K)→ [0,1]n of order-weights (for instance w(ALL)= (0,…,0,1)). A hierarchical approach to aggregation, generalizing OWA, is introduced in Yager (2000). Basically, hierarchical aggregation extends OWA to capture nested expressions. Query attributes may be grouped for individual aggregation and the language is orthogonal in the sense that aggregated values may appear as arguments to aggregations. Thus, queries may be viewed as hierarchies. To illustrate, consider the following nested query expression: < q1(D), < q2 (D),q3 (D), < q4 (D),q5(D),q6 (D) : M3 : K3 >. : M2 : K 2 >, : M1 : K1 >
(31)
Again, qi(D)∈[0,1] measures the degree to which descriptor qi conforms to the text object with description D, while Mj and Kj are the importance and quantifier applied in the j’th aggregate. In the expression above, M1 : K1 parameterizes aggregation at the outermost level of the two components q1(D) and the expression in lines 2 to 4. M2 : K2 parameterizes aggregation of the three components q2 (D), q3 (D), and the innermost expression (line 3), while M3 : K3 parameterizes aggregation of the three components q4 (D), q5(D), and q6 (D).
Query Evaluation Approaches On top of the OWA aggregation principle and the extended hierarchical version of OWA, we can distinguish two major cases of description structures: simple un-nested sets and nested sets, the former perfectly handled by OWA aggregation and the latter by hierarchical aggregation.
343
Query Expansion by Taxonomy
Aggregation on Un-Nested Descriptions The simple set-of-descriptors structure for descriptions in Formula (25) admits a straightforward valuation approach for a similarity query: Qsim =< q1 , …, qn : (1,1, … ) : SOME >. (32) The aggregation here is simple in that importance is not distinguished and SOME, corresponding to simple average, is used as quantifier. An example of a valuation is: Val Qsim (D) = F(1,1,:::),w(SOME)(q1(D) , …, qn(D)), (33) with individual query-descriptor valuation functions as qi(D) = maximumj{xx/dj ∈ similar(qi), dj∈D}. (34) To illustrate, assume a weighted shared node similarity and consider again Figure 4. In continuation of the WSN example from the previous section, assume with ρ = 0.8 and consider the query Q =< dog [CHR:black], noise >. With a threshold for similar on 0.4, we have what is shown in Box 2.
With the example valuation function (33), thus giving all query terms equal importance and taking simple arithmetic average as aggregation, the following are examples of query valuations to the query Q =< dog [CHR:black], noise >: ValQsim ({noise [CBY:dog]}) = 0.90 ValQsim ({noise [CBY:dog [CHR:black]]}) = 0.87 ValQsim ({dog, noise}) = 0.84 ValQsim ({black,dog, noise}) = 0.72. That ValQsim ({noise [CBY:dog]}) = 0.90 is derived according to Formula (34) as max(0.42,0.90)=0.9, while ValQsim ({dog, noise}) = 0.84 is the average (according to (33)) between the degree to which dog [CHR:black] respectively noise corresponds to the document represented by {dog, noise}: ValQsim ({dog, noise}) = max(0.68,0,42) + max(0.47,1) = 0.84. 2
Nested Aggregation on Un-Nested Descriptions An alternative is to expand the query Q to a nested expression:
Box 2. similar(dog [CHR:black]) = 1.00/dog [CHR:black]+0.7/dog [CHR:brown]+ 0.68/dog+0.6/cat [CHR:black]+ 0.58/noise [CBY:dog [CHR:black]]+0.52/animal+ 0.45/cat +0.45/black+0.42/noise [CBY:dog] similar(noise) = 1.00/noise+0.90/noise [CBY:dog]+ 0.87/noise [CBY:dog [CHR:black]]+ 0.60/anything+0.50/animal+0.50/color+ 0.47/cat +0.47/black+0.47/dog+0.47/brown+ 0.44/cat [CHR:black]+0.44/dog [CHR:black]+ 0.44/dog [CHR:brown].
344
Query Expansion by Taxonomy
ValQsim (D) = << q11(D) , …, q1k1 (D) : M1 : K1 >, < q21(D) , …, q2k2 (D) : M2 : K2 >, … , < qn1(D) , …, qnkn (D) : Mn : Kn >, : M0 : K0 >,
(35)
where for each qi we set < μi1 /qi1 , …, μiki /qiki >= similar(qi) and use as individual valuation: qij ( D ) = ij 0
when qij ∈{d1 ,..., d m} . otherwise
(36)
In the event that we use equal importance and the following combination of quantifiers: ValQsim (D) = << q11(D) , …, q1k1 (D) : (1,1, … ) : EXIST >, < q21(D) , …, q2k2 (D) : (1,1, … ) : EXIST >, … , < qn1(D) , …, qnkn (D) : (1,1, … ) : EXIST >, : (1,1, … ) : SOME >, (37) we get a valuation identical to that of Formula (33). Nested expressions, however, facilitate importance adjustment in connections with query expansion according to the kinds of relations contributing to the expansion. Assigning 1.0 importance to IS-A and 0.5 importance to CHR would, for the query Q=<dog[CHR:black], noise>, lead to the following expansion (compare with Figure 4): ValQsim (D) = << qdog [CHR:black](D),qdog(D),qblack (D), … : (1,1,0.5, … ) : EXIST >, < qnoise(D), … : (1,1, … ) : EXIST > : (1,1, … ) : SOME >. (38) Nested expressions are thus a way of distinguishing different kinds of relations influences on similarity.
Aggregation on Nested Descriptions In some cases, when text is processed by partial analysis as indicated earlier, an intrinsic structure appears as the most obvious choice for the description. The parser used in the project reported
on here is a two-phase parser, grouping words in the sentence into groups corresponding to noun phrases in the first phase, and deriving compound descriptors from the words in each noun phrase individually in the second phase. Thus, we have as an intrinsic structure from the first phase a set of sets (or lists) of words. If we could always extract a unique compound concept as descriptor from an inner set, the resulting intrinsic structure from the second phase would be the single set as assumed above. However, it is in many cases not possible, and we would therefore lose information by flattening to a single set. This suggests that descriptions should be sets of sets of descriptors such that the query structure becomes: Q = < Q1 , …, Qn > = << q11 , …, q1k1 >,… ,< qn1 , …, qnkn >>, (39) where the Qis are sets of descriptors qi j, j = 1 , …, ki, and a text index is: D = {D1 , …, Dm} = {{d11 , …, d1l1} , …, {dm1 , …, dmlm}},
(40)
where the Dis are sets of descriptors di j, j = 1 , …, li. This, however, demands a modified valuation and since, in this case, the initial query expression is nested, a valuation over a nested aggregation also becomes the obvious choice. Note first that the grouping of descriptors in descriptions has the obvious interpretation of a closer binding of descriptors within a group compared to across different groups. So we cannot individually evaluate each qi j(D), but have to compare at the level of the groups, for instance, by a restrictive quantification over qi1(Dj) , …, qiki (Dj) and an EXIST quantification over j to get the best matching Dj for a given Qi. A valuation can thus be: ValQsim (D) = <<< q11(D1) , …, q1k1 (D1) : M11 : MOST >,
345
Query Expansion by Taxonomy
…, < qn1(D1) , …, qnkn (D1) : Mn1 : MOST > : M1 : EXIST >, … , << q11(Dm) , …, q1k1 (Dm) : M11 : MOST >, …, < qn1(Dm) , …, qnkn (Dm) : Mn1 : MOST > : Mm : EXIST >, : M0 : SOME >. (41) The individual query-descriptor valuation functions can be set to: qi j(Dk) = maximuml{xx=dkl ∈ similar(qi j)}. (42) As opposed to the single set description example above, the qi,js in this instance are the original descriptors from the query. While choices of inner quantifiers are significant for correct interpretation, the choice of SOME at the outer level for the component description is just one of many possible choices for reflecting the user’s preference of overall aggregation.
Conclusion The emphasis in this chapter has been on a specific application of knowledge structures like taxonomies and ontologies, namely the expansion of queries. Ontologies, as a generalization of taxonomies, were briefly surveyed, similarity was introduced as the key to avoid reasoning while still reflecting ontological knowledge, and approaches to query expansion and comparison on the level of descriptions were discussed. The general idea is to provide retrieval guided by domain-specific knowledge as comprised by the ontology. As far as ontologies are concerned, we have, in addition to the survey, presented a specific lattice algebraic representation formalism. We consider this formalism as appropriate for the purpose partly because it easily generalizes to generative ontologies and also captures derived concepts (generativity is in fact inherent) and partly because 346
so-called instantiated ontologies can be derived by simple means using this formalism. When it comes to similarity measures, there are many alternatives, which is also the case for the properties proposed to characterize these measures. In conclusion, it is our view that taxonomic structure should play a key role as a source for similarity. The fact that Resnik ignores the path length below the least upper bound, for instance, appears to be too course-grained. Corpus statistics, on the other hand, should be taken into account whenever available. Regarding the simple generic approach presented in this chapter, taking instantiated ontologies as a source for similarity would most probably in many cases give better results, regardless of the taxonomic measure applied. The connectivity in the taxonomic structure becomes especially interesting in connection with document retrieval when this structure reflects the actual content of the document base. However, it appears that more sophisticated statistics like Resnik’s original idea of applying information theory has great potential, especially in combination with more thorough taxonomic excerpts. Moreover, it is probably also worth considering alternatives that include regular distributional similarity as discussed in Mohammad and Hirst (2006) and Weeds and Weir (2005) as well as considering possibilities for combining these approaches with ontology-based approaches. Query expansion is first of all a matter of comparison at the level of descriptions. The query is represented by a single description, which is then to be compared with descriptions in the information base referring to documents. So the most obvious way to realize this is by means of expansion of the query embedding similar concepts, of course, provided that the evaluation principle can aggregate the degree of match appropriately. It appears, however, that a more detailed interpretation of the query expression leading to a description reflecting structure (formal language queries) and/or semantics (NL queries) has interesting potential and should be investigated further. The flexible hierarchical aggregation
Query Expansion by Taxonomy
can be applied to embed quantification, logical connectors and the importance specification of a formal query language and syntax and semantics for the NL query, such as noun phrase structuring and part-of-speech, opening up for a more refined interpretation.
References Andreasen, T., Bulskov, H., & Knappe, R. (2003). Similarity for conceptual querying. Paper presented at the 18th International Symposium on Computer and Information Sciences, Antalya, Turkey (pp. 268-275). Andreasen, T., Bulskov, H., & Knappe, R. (2005). On automatic modeling and use of domain-specific ontologies. Paper presented at the 15th International Symposium on Methodologies for Intelligent Systems, Saratoga Springs, New York (pp. 74-82). Andreasen, T., Jensen, P. A., Nilsson, J. F., Paggio, P., Pedersen, B. S., & Thomsen, H. E. (2002). Ontological extraction of content for text querying. Paper presented at the 6th International Conference on Applications of Natural Language to Information Systems-Revised Papers, Stockholm Sweden (pp. 123-136). Andreasen, T., Knappe, R., & Bulskov, H. (2005). Domain-specific similarity and retrieval. Paper presented at the 11th International Fuzzy Systems Association World Congress, Beijing, China (pp. 496-502).
mation retrieval. In P. P. Wang, Da Ruan, & E. E. Kerre (Eds.), Fuzzy logic: A spectrum of theoretical and practical issues (pp. 193-218). Springer. Berners-Lee, T. (1998). Semantic Web roadmap. Retrieved February 8, 2008, from http://www. w3.org/DesignIssues/Semantic.html Borst, W. N. (1997). Construction of engineering ontologies for knowledge sharing and reuse. Enschede, The Netherlands: Centre for Telematics and Information Technology. Budanitsky, A., & Hirst, G. (2006). EvaluatingWordNet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1), 1-35. Bulskov, H., Knappe, R., & Andreasen, T. (2002). On measuring similarity for conceptual querying. Paper presented at the 5th International Conference on Flexible Query Answering Systems, Copenhagen, Denmark (pp. 100-111). Chaudhri, V. K., Farquhar, A., Fikes, R., Karp, P. D., & Rice, J. P. (1998). OKBC: A progammatic foundation for knowledge base interoperability. Paper presented at the 15th National Conference on Artificial Intelligence, Madison, Wisconsin (pp. 600-607). Cross, V. (2004). Fuzzy semantic distance measures between ontological concepts. Paper presented at the International Conference of the North American Fuzzy Information Processing Society. Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. The MIT Press.
Baeza-Yates, R. A., & Ribeiro-Neto, B. A. (1999). Modern information retrieval. ACM Press/Addison-Wesley.
Gruber, T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2), 199-220.
Bandos, J. A., & Resnick, M. L. (2002). Understanding query formation in the use of Internet search engines. Paper presented at the Human factors and ergonomics society 46th annual meeting (pp. 1291-1296).
Guarino, N., & Giaretta, P. (1995, April). Ontologies and knowledge bases: Towards a terminological clarification. Paper presented at the Towards very large knowledge bases, Amsterdam, The Netherlands (pp. 25-32).
Baziz, M., Boughanem, M., Loiseau, Y., & Prade, H. (2007). ������������������������������������� Fuzzy logic and ontology-based infor-
Hirst, G., & St-Onge, D. (1998). Lexical chains as representation of context for the detection and
347
Query Expansion by Taxonomy
correction malapropisms. In C. Fellbaum (Ed.), WordNet: An electronic lexical database. MIT Press. Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. Paper presented at the Proceedings on International Conference on Research in Computational Linguistics, Taiwan (pp. 19-33). Knappe, R., Bulskov, H., & Andreasen, T. (2006). Perspectives on ontology-based querying. International Journal of Intelligent Systems. Lassila, O., & McGuinness, D. (2001). The role of frame-based representation on the Semantic Web (Tech. Rep. No. KLS-01-02). Standford, CT: Knowledge Systems Laboratory, Standford University. Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database. MIT Press. Lenat, D. B. (1995). CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11), 33-39.
Proceedings of the International Conference on Computers and Communications (pp. 809-814). Miller, G. A. (1990). WordNet: An online lexical database. International Journal of Lexicography, 3(4). Miller, G. A. (1995). WordNet: A lexical database for English. Communication of the ACM, 38(11), 39-41. Mohammad, S., & Hirst, G. (2006). Distributional measures of concept-distance: A task-oriented evaluation. Paper presented at the Conference on Empirical Methods in Natural Language Processing (pp. 35-43). Nilsson, J. F. (2001). A logico-algebraic framework for ontologies: ONTOLOG. In ��� Proceedings of the First International OntoQuery Workshop, Department of Business Communication and Information Science, Kolding, Denmark (pp. 11-38). Penev, A., & Wong, R. (2006). Shallow NLP techniques for Internet search. Paper presented at the 29th Australasian Computer Science Conference Hobart, Tasmania, Australia (pp. 167-176).
Lenat, D., & Guha, R. V. (1990). Building large knowledge-based systems: Representation and inference in the Cyc Project. Addison-Wesley.
Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). ���������������������������������������� Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19(1), 17-30.
Lin, D. (1997). Using syntactic dependency as local context to resolve word sense ambiguity. Paper presented at the Annual Meeting of the Association for Computational Linguistics (pp. 64-71).
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. Paper presented at the International Joint Conference on Artificial Intelligence (pp. 448-453).
Lin, D. (1998). An information-theoretic definition of similarity. Paper presented at the International Conference on Machine Learning (pp. 296-304).
Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95-130.
Loiseau, Y., Boughanem, M., & Prade, H. (2005). Evaluation of term-based queries using possibilistic ontologies. In E. Herrera-Viedma, G. Pasi, & F. Crestani (Eds.), Soft computing for information retrieval on the Web. Springer-Verlag. Lucarella, D. (1990). Uncertainty in information retrieval: An approach based on fuzzy sets. In
348
Shannon, C. E., & Weaver, W. (1949). A mathematical theory of communication. Urbana, IL: University of Illinois Press. Sussna, M. (1993, November). Word sense disambiguation for tree-text indexing using a massive
Query Expansion by Taxonomy
semantic network. In Proceedings of the 2nd International Conference on Information and Knowledge Management, New York, NY (pp. 67-74). Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327-352. Weeds, J., & Weir, D. (2005). Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics, 31(4), 439-476. Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, Morristown, NJ, (pp. 133-138). Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria decisionmaking. IEEE Transactions on Systems, Man, and Cybernetics, 18(1), 183-190. Yager, R. R. (2000). A hierarchical document retrieval language. Information Retrieval, 3(4), 357-377.
Key Terms Description: A description for a text unit (document, paragraph, sentence) is the index terms related to this. Ontology: An ontology specifies a conceptualization, that is, a structure of related concepts for a given domain. Ontology-Based Querying: Evaluation of queries against a database utilizing an ontology describing the domain of the database Precision: The proportion of retrieved and relevant documents to all the documents retrieved. Query Expansion: Given a similarity relation over query terms, expansion of a query refers to the addition of similar terms to the query, leading to a relaxed query and an extended answer. Recall: The proportion of relevant documents that are retrieved out of all relevant documents available. Similarity: Similarity refers to the nearness or proximity of concepts. Taxonomy: A taxonomy is a hierarchical structure displaying parent-child relationships (a classification). A taxonomy extends a vocabulary and is a special case of a the more general ontology.
349
Section III
Implementation, Data Models, Fuzzy Attributes, and Applications
351
Chapter XIV
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata Mohamed Ali Ben Hassine Tunis El Manar University, Tunisia Amel Grissa Touzi Tunis El Manar University, Tunisia José Galindo University of Málaga, Spain Habib Ounelli Tunis El Manar University, Tunisia
Abstract Fuzzy relational databases have been introduced to deal with uncertain or incomplete information demonstrating the efficiency of processing fuzzy queries. For these reasons, many organizations aim to integrate flexible querying to handle imprecise data or to use fuzzy data mining tools, minimizing the transformation costs. The best solution is to offer a smooth migration towards this technology. This chapter presents a migration approach from relational databases towards fuzzy relational databases. This migration is divided into three strategies. The first one, named “partial migration,” is useful basically to include fuzzy queries in classic databases without changing existing data. It needs some definitions (fuzzy metaknowledge) in order to treat fuzzy queries written in FSQL language (Fuzzy SQL). The second one, named “total migration,” offers in addition to the flexible querying, a real fuzzy database, with the possibility to store imprecise data. This strategy requires a modification of schemas, data, and eventually programs. The third strategy is a mixture of the previous strategies, generally as a temporary step, easier and faster than the total migration.
Introduction New enterprise information systems are requested to be flexible and efficient in order to cope with rapidly changing business environments and ad-
vancement of services. An information system that develops its structure and functionality in a continuous, self-organized, adaptive, and interactive way can use many sources of incoming information and can perform intelligent tasks such as language
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
learning, reasoning with uncertainty, decision making, and more. According to Bellman and Zadeh (1970), “much of the decision making in the real world takes place in an environment in which the goals, the constraints, and the consequences of possible actions are not known precisely.” Management often makes decisions based on incomplete, vague, or uncertain information. In our context, the data which are processed by the application system and accumulated over the lifetime of the system may be inconsistent and may not express the reality. In fact, one of the features of human reasoning is that it may use imprecise or incomplete information and in the real world, there exists a lot of this kind of fuzzy information. Hence, we can assert that in our every day life we use several linguistic labels to express abstract concepts such as young, old, cold, hot, cheap, and so forth. Therefore, human-computer interfaces should be able to understand fuzzy information, which is very usual in many human applications. However, the majority of existing information systems deal with crisp data through crisp database systems (Elmasri & Navathe, 2006; Silberschatz, Korth, & Sudarshan, 2006). In this scenario, fuzzy techniques have proven to be successful principles for modeling such imprecise data and also for effective data retrieval. Accordingly, fuzzy databases (FDBs) have been introduced to deal with uncertain or incomplete information in many applications demonstrating the efficiency of processing fuzzy queries even in classical or regular databases. Besides, FDBs allow storing fuzzy values, and of course, they should allow fuzzy queries using fuzzy or nonfuzzy data (Bosc, 1999; De Caluwe & De Tré, 2007; Galindo, Urrutia, & Piattini, 2006; Petry, 1996). Facing this situation, many organizations aim to integrate flexible querying to handle imprecise data or to use fuzzy data mining tools, minimizing the transformation costs. A solution of the existing (old) systems is the migration, that is, moving the applications and the database to a new platform and technologies. Migration of old systems, or legacy systems, may be an expensive and complex process. It allows legacy systems to be moved to
352
new environments with the new business requirements, while retaining functionality and data of the original legacy systems. In this context, the migration towards FDBs, which constitutes a step to introduce imprecise data in an information system, does not only constitute the adoption of a new technology, but also, and especially, the adoption of a new paradigm. Consequently, it constitutes a new culture of development of information systems, and this book is evidence of the current interest and the promising future of this paradigm and its multiple fields. However, with important amounts invested in the development of relational systems, in the enrollment and the formation of “traditional” programmers, and so forth, enterprises appear reticent to invest important sums in the mastery of a new fuzzy paradigm. The best solution is to offer a smooth migration toward this technology, allowing them to keep the existing data, schemas, and applications, while integrating the different fuzzy concepts to benefit of the fuzzy information processing. It will lower the costs of the transformations and will encourage the enterprises to adapt the concept of fuzzy relational databases (FRDBs). Moreover, although the migration of the information systems constitutes a very important research domain, there is a limited number of migration methods between two specific systems. We mention some examples (e.g., Behm, Geppert, & Dittrich, 1997; Henrard, Hick, Thiran, & Hainaut, 2002; Menhoudj & OuHalima, 1996). To our knowledge, the migration of relational databases (RDB) towards FRDB is not even studied. FDBs allow storing fuzzy values and, besides, they allow making fuzzy queries using fuzzy or nonfuzzy data. It should be noted that classic querying is qualified by “Boolean querying,” although some systems use a trivalued logic with the three values true, false, and null, where null indicates that the condition result is unknown because some data is unknown. The user formulates a query usually with a condition, for example, in SQL, which returns a list of rows, when the condition is true. This querying system constitutes a hindrance for
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
several applications because we cannot know if one row satisfies the query better than another row. Besides, the traditional querying does not make it possible for the end user to use some vague linguistic terms in the query condition or to use fuzzy quantifier such as “almost all” or “approximately the half.” Many works have been proposed in the literature to introduce the flexibility into the database querying both in crisp and fuzzy databases (Bosc, Liétard, & Pivert, 1998; Bosc, & Pivert, 1995, 1997, 2000; Dubois & Prade, 1997; Galindo, Medina, & Aranda, 1999; Galindo, Medina, Pons, & Cubero, 1998; Galindo et al., 2006;���������� Kacprzyk & Zadrożny, 1995, 2001; ���������������������� Tahani, 1977; Umano & Fukami, 1994������������������������������������� ). The essential idea in these works consists in adding an additional layer to the classic DBMS (database management systems) to evaluate fuzzy predicates. In this book, the reader can find a chapter by ��������������������������������� Zadrożny, de Tré, de Caluwe,����� and Kacprzyk ��������������������������������������� with an interesting review about fuzzy querying proposals. Also, this book includes other chapters with new applications and new advances in the field of fuzzy queries. Some examples are the chapter by Takači ���������������������������� and������������������������� Škrbić about priorities in queries, the chapter by Dubois and������������� ���������������� Prade about bipolar queries, and the chapter by Barranco, Campaña, and Medina using a fuzzy object-relational database model. Among various published propositions for different fuzzy database models, we mention the one by Medina, Pons, and Vila (1995) who introduced the GEFRED model, an eclectic synthesis of other previous models. In 1995, Bosc and Pivert introduced the first version of a language handling the flexible queries named SQLf. In their turn, Medina, Pons, and Vila (1994b) proposed the FSQL language, which was later extended (Galindo, 1999, 2005; Galindo et al., 1998, 2006). Although the basic target in FSQL is similar to the SQLf language, FSQL allows fuzzy queries both in crisp and fuzzy databases and it presents new definitions such as many fuzzy comparators, fuzzy attributes (including fuzzy time), fuzzy constants. It allows the creation of new fuzzy objects such as labels, quantifiers, and so forth. There is another
chapter by Urrutia, Tineo, and González studying both proposals. This chapter presents a new approach for the migration from RDB towards FRDB with FSQL. The aim of this migration is to permit an easy mapping of the existing data, schemas, and programs, while integrating the different fuzzy concepts. Therefore, all valid SQL queries remain useful in the fuzzy query language FSQL (fuzzy SQL). This approach studies the RDB transformations essentially at the level of the schemas (physical and conceptual), the data, and, less specifically, the applications. First, we present a very brief overview about fuzzy sets and then we present basic concepts about FRDB. After, we present our three migration strategies. The first one, named “partial migration,” is useful only to include fuzzy queries in classic databases without changing existing data. The second one, named “total migration,” offers in addition to the flexible querying the possibility to store imprecise data. The third strategy is a mixture of the previous strategies. Finally, we outline some conclusions and suggest some future research lines.
Introduction to Fuzzy Sets The fuzzy sets theory stems from the classic theory of sets, adding a membership function to the set, which is defined in such a way that each element is assigned a real number between 0 and 1. In 1965, professor L.A. Zadeh defined the concept of fuzzy sets and then many works and applications have been made (Pedrycz & Gomide, 1998). We give here the most basic notions, and for a better introduction, read the first chapter of this handbook. A fuzzy set (or fuzzy subset) A is defined by means of a membership function µA (u), which indicates the degree to which the element u is included in the concept represented by A. The fuzzy set A over a universe of discourse U can also be represented with a set of pairs given by: A = {µA (u) /u : u ∈ U, µA (u) ∈ [0,1]}
(1) 353
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
where µ is the membership function and µA(u) is the membership degree of the element u to the fuzzy set A. If µA(u)=0, it indicates that u in no way belongs to the fuzzy set A. If µA(u)=1, then u belongs totally to the fuzzy set A. For example, if we consider the linguistic variable height_of _a_ person, then three fuzzy subsets could be defined identified by three labels, Short, Medium-height, and Tall, with membership functions µShort(u), µMedium-height(u), and µTall(u), respectively, where u takes values in the referential of this attribute (or underlying domain), that would be the real positive numbers (expressing the centimetres of height). On the other hand, for domains with a nonordered referential, a similarity function can be defined, that can be used to measure the similarity or resemblance between every two elements of the domain. Usually, the similarity values are normalized in the interval [0,1], where 0 means “totally different” and 1 means “totally alike” or equal. Thus, a similarity relationship is a fuzzy relation that can be seen as a function sr, so that: sr : D×D
→ [0,1]
sr(di, dj)
→ [0,1] with di, dj ∈ D
(2)
where D is the domain of the defined labels. We can assume that sr is a symmetrical function, this is that sr(di, dj) = sr(dj, di), as this is the most usual, although it does not necessarily have to be this way. We can also construct possibility distributions (or fuzzy sets) on the labels of D, extending the possibilities for expressing imprecise values (������� Zadeh, 1978), i������������������������������ n such a way that each value di ∈ D has a degree of truth or possibility pi associated to it, obtaining expressions for specific values that can be expressed generically as: {pi/di : pi ∈ [0,1], di ∈ D}
(3)
The domains with ordered and non-ordered referentials can adequately represent concepts of
354
“imprecision” using fuzzy sets theory. It should be noted that many of these natural concepts depend, in a greater or lesser degree, on the context and on the person that expresses them. From this simple concept, a complete mathematical and computing theory has been developed which facilitates the solution of certain problems. Fuzzy logic has been applied to a multitude of objectives such as control systems, modeling, simulation, patterns recognition, information or knowledge systems (databases, expert systems, etc.), computer vision, artificial intelligence, artificial life, and so forth.
Introduction to Fuzzy Relational Databases The first chapter of this handbook includes a brief introduction to this topic, explaining some basic models. We give here a brief overview in order to facilitate the reading of this chapter. The term “imprecision encompasses various meanings, which might be interesting to highlight. It alludes to the facts that the information available can be incomplete (vague), that we don’t know whether the information is true (uncertainty), that we are totally unaware of the information (unknown), or that such information is not applicable to a given entity (undefined). Usually, the total ignorance is represented with a NULL value. Sometimes these meanings are not disjunctive and can be combined in certain types of information”. (Galindo et al., 2006, p. 45) This imprecision was studied in order to elaborate systems, databases, and, consequently, applications which support this kind of information. Most works studying the imprecision in information have used possibility, similarity, and fuzzy techniques. The research on FDBs has been developed for about 20 years and concentrated mainly on the following areas: flexible querying in classical databases, extending classical data models in order to achieve fuzzy databases (including, of course, fuzzy queries on these fuzzy databases and fuzzy conceptual modeling tools), fuzzy data mining
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
techniques, and applications of these advances in real databases. All these different issues have been studied in different chapters of this volume and also in many other publications (De Caluwe & De Tré, 2007; Bosc, 1999; Bosc et al., 1998; Galindo et al., 2006; Petry, 1996). The querying of a FRDB, contrary to classical querying, allows the users to use fuzzy linguistic labels (also named linguistic terms) and express their preferences to better qualify the data that they wish to get. An example of flexible query, also named in this context fuzzy query, would be “list of the young employees, well paid and working in department with big budget.” This query contains the fuzzy linguistic labels “young,” “well paid,” and “big budget.” These labels are words, in natural language, that express or identify a fuzzy set that may or may not be formally defined. In fact, the flexibility of a query reflects the preferences of the end user. This is manifested by using a fuzzy set representation to express a flexible selection criterion. The extent to which an object in the database satisfies a request then becomes a matter of degree. The end user provides a set of attribute values (fuzzy labels), which are fully acceptable for the user, and a list of minimum thresholds for each of these attributes. With these elements, a fuzzy condition is built for the fuzzy query. Then, the fuzzy querying system ranks the answered items according to their fulfillment degree or level of acceptability. Some approaches, the so-called bipolar queries, need both the fuzzy condition (or fuzzy constraint) and the positive preferences or wishes, which are less compulsory. (A very interesting chapter about bipolar queries may be found in this volume in the chapter by Dubois and Prade.) Hence, the interests of fuzzy queries for a user are twofold: 1. 2.
A better representation of the user’s preferences while allowing the use of imprecise predicates. Obtaining the necessary information in order to rank the answers contained in the database according to the degree to which they satisfy
the query. It contributes to avoid empty sets of answers when the queries are too restrictive, as well as too large sets of answers without any ordering when queries are too permissive. This preface led us to establish the definition of FRDB as an extension of RDB. This extension introduces fuzzy predicates or fuzzy conditions under shapes of linguistic expressions that, in flexible querying, permits to have a range of answers (each one with its membership degree) in order to offer to the user all intermediate variations between the completely satisfactory answers and those completely dissatisfactory (Bosc et al., 1998). Yoshikane Takahashi (1993, p. 122) defined FRDB as “an enhanced RDB that allows fuzzy attribute values and fuzzy truth values; both of these are expressed as fuzzy sets”. Then, a fuzzy database is a database which is able to deal with uncertain or incomplete information using fuzzy logic. There are many forms of adding flexibility in fuzzy databases. The simplest technique is to add a fuzzy membership degree to each record, an attribute in the range [0,1]. However, there are others kind of databases allowing fuzzy values to be stored in a fuzzy attribute using fuzzy sets or possibility distributions or fuzzy degrees associated to some attributes and with different meanings (membership degree, importance degree, fulfillment degree...). The main models are those of Prade-Testemale (1987), Umano-Fukami (Umano, 1982; Umano & Fukami, 1994), Buckles-Petry (1982), Zemankova-Kaendel (1985) and GEFRED by Medina-Pons-Vila (1994a). This chapter deals mainly, with the GEFRED model (GEneralised model for Fuzzy RElational Database), and some later extensions (Galindo et al., 2006). This model constitutes an eclectic synthesis of the various models published so far with the aim of dealing with the problem of representation and treatment of fuzzy information by using RDB. One of the major advantages of this model is that it consists of a general abstraction that allows for the use of various approaches,
355
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Table 1. Data types in the GEFRED model 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
A single scalar (e.g., Behavior=Good, represented by the possibility of distribution 1/Good). A single number (e.g., Age=28, represented by the possibility of distribution 1/28). A set of mutually exclusive possible scalar assignations (e.g., Behavior={Bad, Good}, represented by {1/Bad, 1/Good}). A set of mutually exclusive possible numeric assignations (e.g., Age={20, 21}, represented by {1/20, 1/21}). A possibility distribution in a scalar domain (e.g., Behavior={0.6/Bad, 1.0/Regular}). A possibility distribution in a numeric domain (e.g., Age={0.4/23, 1.0/24, 0.8/25}, fuzzy numbers or linguistic labels). A real number belonging to [0, 1], referring to a degree of matching (e.g., Quality=0.9). UNKNOWN value with possibility distribution {1/u: u ÎU} , where U is the considered domain. UNDEFINED value with possibility distribution {0/u: u ÎU}, where U is the considered domain. NULL value, given by NULL={1/Unknown, 1/Undefined}.
regardless of how different they might look. In fact, it is based on the generalized fuzzy domain and the generalized fuzzy relation, which include respectively classic domains and classic relations. The original data types supported by this model are showed on Table 1.
Preliminary Concepts In order to implement a system which represents and manipulates “imprecise” information, Medina et al. (1995) developed the FIRST (fuzzy interface for relational systems) architecture, which has been enhanced with FIRST-2 (Galindo, Urrutia, & Piattini, 2004b; 2006). It has been built on some DBMS client-server architecture, such as Oracle1 and PostgreSQL2 (Galindo, 2007; Maraboli & Abarzua, 2006). It extends the existing structure and adds new components to handle fuzzy information. This architecture adds a server, named FSQL server, assuring the translation of flexible queries written in FSQL in a comprehensible language for the host DBMS (SQL). FSQL is an extension of the popular SQL language, in order to express fuzzy characteristics, especially in fuzzy queries, with many fuzzy concepts (fuzzy conditions, fuzzy comparators, fulfillment degrees, fuzzy constants, fuzzy quantifiers, fuzzy attributes, etc.). The first
356
versions of FSQL were developed during the last decade of the 20th century (Galindo et al., 1998; Medina et al., 1994b), and the more recent version is defined by Galindo et al. (2006). In the following subsections, we present this language and the supported fuzzy attributes types. The RDBMS (relational DBMS) dictionary or catalog which represents the part of the system allowing the storage of information about the data collected in the database, and other information (such as users, data structures, data control, etc.), is prolonged in order to collect the necessary information related to the imprecise nature of the new collection of data processing (������������������ fuzzy attributes, their type, their objects such as labels, quantifiers, etc.��������������������������������������������� ). This extension, named fuzzy metaknowledge base���������������������������������������������� (FMB)���������������������������������������� , is organized following the prevailing philosophy in the host RDBMS catalog. In this chapter, we designate by fuzzy RDBMS (FRDBMS) the addition of the FSQL server and the FIRST-2 methodology to the RDBMS.
Fuzzy Attributes In order to model fuzzy attributes, we distinguish between two classes of fuzzy attributes: Fuzzy attributes whose fuzzy values are fuzzy sets (or possibility distributions) and fuzzy attributes whose values are fuzzy degrees. Each class includes some
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
different fuzzy data type (Galindo et al., 2006; Urrutia, Galindo, & Piattini, 2002).
Fuzzy Sets as Fuzzy Values
•
These fuzzy attributes may be classified in four data types. This classification is performed taking into account the type of referential or underlying domain. In all of them, the values Unknown, Undefined, and Null are included: Fuzzy Attributes Type 1 (FTYPE1): These are attributes with “precise data,” classic or crisp (traditional with no imprecision). However, we can define linguistic labels over them, and we can use them in fuzzy queries. This type of attribute is represented in the same way as precise data, but they can be transformed or manipulated using fuzzy conditions. This type is useful for extending a traditional database, allowing fuzzy queries to be made about classic data. For example, enquiries of the kind “Give me employees that earn a lot more than the minimum salary.” Fuzzy Attributes Type 2 (FTYPE2): These are attributes that gather “imprecise data over an ordered referential.” These attributes admit, like Table 2 shows, both crisp and fuzzy data, in the form of possibility distributions over an underlying ordered dominion (fuzzy sets). It is an extension of Type 1 that does, now, allow the storage of imprecise information, such as “he is approximately 2 metres tall.” For the sake of simplicity, the most
•
•
Figure 1. Trapezoidal, linear, and normalized distribution function 1
0
U a
b
c
d
•
complex of these fuzzy sets are supposed to be trapezoidal functions (Figure 1). Fuzzy Attributes Type 3 (FTYPE3): They are attributes over “data of discreet non-ordered dominion with analogy.” In these attributes, some labels are defined (e.g., “blond,” “red,” “brown,” etc.) that are scalars with a similarity (or proximity) relationship defined over them, so that this relationship indicates to what extent each pair of labels resemble each other. They also allow possibility distributions (or fuzzy sets) over this dominion, for example, the value (1/dark, 0.4/brown), which expresses that a certain person is more likely to be dark than brown-haired. Note that the underlying domain of these fuzzy sets is the set of labels, and this set is non-ordered. Fuzzy Attributes Type 4 (FTYPE4): These attributes are defined in the same way as Type 3 attributes without it being necessary for a similarity relationship to exist between the labels.
Fuzzy Degrees as Fuzzy Values The domain of these degrees can be found in the interval [0,1], although other values are also permitted, such as a possibility distribution (usually over this unit interval). The meaning of these degrees is varied and depends on their use. The processing of the data will be different depending on the meaning. The most important possible meanings of the degrees used by some authors are the fulfillment degree, uncertainty degree, possibility degree, and importance degree. The most typical kind of degree is a degree associated to each tuple in a relation (Type 7) with the meaning of membership degree of each tuple to the relation. Another typical degree is the fulfillment degree associated to each tuple in the resulting relation after a fuzzy query. In this volume, there are some chapters about these kinds of relations (see for example the ranked tables in the chapter by Belohlavek and Vychodil or the fulfillment degrees in the chapter by Voglozin, Raschia, Ughetto and Mouaddib). 357
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Sometimes is useful to associate a fuzzy degree to only one attribute (Type 5) or to only a concrete set of attributes (Type 6), for example, in order to measure the truth, the importance, or the vagueness. Finally, in some applications, a fuzzy degree with its own fuzzy meaning (Type 8) is useful in order to measure a fuzzy characteristic of each item in the relation like the danger in a medicine or the brightness of a concrete material.
Representation of Fuzzy Attributes The representation is different according to the fuzzy attribute type. Fuzzy attributes Type 1 are represented as usual attributes because they do not allow fuzzy values. Fuzzy attributes Type 2 need five (or more) classic attributes: One stores the kind of value (Table 2), and the other four store the crisp values representing the fuzzy value. Note
Table 2. Kind of values of fuzzy attributes Type 2 Kind of values
Number 0, 1, 2
UNKNOWN, UNDEFINED, NULL
3
Crisp: d
4
Label: label_identifier
5
Interval: [n,m]
6
Approximate value: d
7
Trapezoidal value: [a,b,c,d]
8
Approx. value with explicit m: d±m
9,10,11,12
Possibility distributions (different formats)
Table 3. Kind of values of fuzzy attributes Types 3 and 4 Number 0, 1, 2
358
Kind of values UNKNOWN, UNDEFINED, NULL
3
Simple value: Degree/Label
4
Possibility Distribution: Degree1/label1 + ... + Degreen/Labeln
in Table 2 that trapezoidal fuzzy values (Figure 1) need the other four values. An approximate value (approximately d, d±margin) is represented with a triangular function centered in d (degree 1) and with degree 0 in d–margin and d+margin, where the value margin depends on the context, as we will see later). Other approximate values (number 8) use their own margin m. Finally, we can also represent possibility distributions in the Type 2 attributes. Some of them (number 9 and 10) use only the four attributes defined previously, but we define here two new and more flexible possibilities: •
•
Number 11: Discontinuous possibility distribution, given a list of point with the format p1/v1,…, pn/vn, where the pi are the possibility degrees and the vi are the values with such degrees. Note that we need 2n attributes (instead of the four) for storing a possibility distribution with n terms. The rest of the values have a degree of zero. Number 12: Continuous possibility distribution, given a list of point with the format p1/v1,…, pn/vn, where the pi are the possibility degrees and the vi are the values with such degrees. Note that we need 2n attributes for storing a possibility distribution with n terms. Now the stored possibility distribution represents a continuous linear function, and between vi and vi+1, there is a straight line joining each consecutive two points.
Fuzzy attributes Type 3 need 2n+1 attributes: One stores the kind of value (Table 3) and the others (2n) may store a possibility distribution where n is the maximum number of elements (degree/label). Note in Table 3 that number 3 needs only two values, but number 4 needs 2n values. Value n must be defined for each fuzzy attribute Type 3, and it is stored in the FMB (see following section). Fuzzy attributes Type 4 are represented just like Type 3. The difference between them is shown in the next section. Fuzzy degrees (Types 5, 6, 7, and 8) are represented using a classic numeric attribute because their domain is the interval [0,1].
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
The FSQL Language The FSQL language (Galindo, 2005; Galindo et al., 2006; Galindo, Aranda, Caro, Guevara, & Aguayo, 2002) is an authentic extension of SQL which allows fuzzy data manipulation like fuzzy queries. It means that all the valid statements in SQL are also valid in FSQL. In addition, FSQL incorporates some novelties to permit the inexact processing of information. This chapter will only provide a summary of the main extensions added to this language: •
Linguistic labels: If an attribute is capable of fuzzy treatment, then linguistic labels can be defined on it. These labels will be preceded with the symbol $ to distinguish them easily. There are two types of labels, and they will be used in different fuzzy attribute types: 1. Labels for attributes with an ordered underlined domain (Fuzzy Attributes Type 1 and 2): every label of this type has associated a trapezoidal possibility distribution in the FMB. This possibility distribution is generally trapezoidal, linear, and normalized, as shown in Figure 1. 2. Labels for attributes with a non-ordered fuzzy domain (Fuzzy Attributes Type 3 and 4). Here, a similarity relation may
•
•
be defined between each two labels in the domain, and it should be stored in the FMB. Fuzzy comparators: Besides the typical comparators (=, >, etc.), FSQL includes all the fuzzy comparators shown in Table 4. As in SQL, fuzzy comparators compare one column with one constant or two columns of the same (or compatible) type. As possibility comparators are more general (less restrictive) than necessity comparators, necessity comparators retrieve fewer tuples, and these tuples necessarily comply with the conditions (whereas with possibility comparators, the tuples only possibly comply with the condition, without any absolute certainty). Is it necessary to note that fuzzy attributes Type 2 can be compared with crisp values but always with the FSQL language. Function CDEG: The function CDEG (compatibility degree) may be used with an attribute in the argument. It computes the fulfillment degree of the condition of the query for the specific attribute in the argument. We can use CDEG(*)to obtain the fulfillment degree of each tuple (with all of its attributes, not just one of them) in the condition. If logic operators (NOT, AND, OR) appear in the condition, the calculation of this compatibility degree is carried out, by default, using the traditional negation, the minimum t-norm,
Table 4. Fuzzy comparators for FSQL (Fuzzy SQL), 16 in the Possibility/Necessity Family, and 2 in the Inclusion Family Possibility FEQ or F= FDIF, F!= or F<> FGT or F> FGEQ or F>= FLT or F< FLEQ or F<= MGT or F>> MLT or F<<
Necessity NFEQ or NF= NFDIF, NF!= or NF<> NFGT or NF> NFGEQ or NF>= NFLT or NF< NFLEQ or NF<= NMGT or NF>> NMLT or NF<<
FINCL
INCL
Significance Possibly/Necessarily Fuzzy Equal than… Possibly/Necessarily Fuzzy Different to… Possibly/Necessarily Fuzzy Greater Than… Possibly/Necessarily Fuzzy Greater or Equal than… Possibly/Necessarily Fuzzy Less Than… Possibly/Necessarily Fuzzy Less or Equal than… Possibly/Necessarily Much Greater Than… Possibly/Necessarily Much Less Than… Fuzzy Included in… / Included in…
359
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
and the maximum s-norm (or t-conorm), but the user may change these values. • Fulfillment thresholds: For each simple condition, a fulfillment threshold τ may be established (default is 1) with the format:
The FuzzyEER Model A database design methodology has three phases: conceptual design, logical design, and physical design. This study looks at conceptual design, which is the first phase during which the analysis of the database requirements takes place. The base requirements are independent of the data model that we use. For the conceptual design, the entity/ relationship model (ER model) or the enhanced ER model (EER) are usually used. Originally, the
360
conceptual level allows the use of elementary types of data which are called classical or crisp. These data types include numerical, alphanumerical, and binary data. However, the conceptual model does not always include these data types because they are usually not very important, so their definition is normally included in the data dictionary model. On the other hand, several works have been proposed in the literature to introduce the fuzzy concepts in database modeling. The conceptual modeling tool used in this work is the fuzzy enhanced entity relationship (FuzzyEER) model (Galindo, Urrutia, Carrasco, & Piattini, 2004c; Galindo, Urrutia, & Piattini, 2004a, 2006; Urrutia et al., 2002). This model extends the enhanced entity relationship (EER) model with fuzzy semantics and fuzzy notations to represent imprecision and uncertainty in the entities, attributes, and relationships using fuzzy sets and necessity-possibility measures. The basic concepts introduced in this model are fuzzy attributes, fuzzy entities, fuzzy relations, fuzzy degrees, and fuzzy constraints to mention those only here. We present in Example 5 (Figure 9) a simplified FuzzyEER conceptual schema. The book Fuzzy Databases: Modeling, Design and Implementation (Galindo et al., 2006) presents more details about this model.
A Migration Approach Towards FRDBs Designing a system that is able to make use of quantitative and qualitative data for real world applications is a challenging problem. Traditional systems produce representational descriptions that are often not very useful to the human expert. Using classic logic, it is possible to deal only with information that is totally true or totally false; it is not possible to handle information inherent to a problem that is imprecise or incomplete, but this type of information contains data that would allow a better solution to the problem. In the section titled Introduction to Fuzzy Sets, we saw that fuzzy logic is an extension of
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
the classic systems (Zadeh, 1992). Fuzzy logic is the logic behind approximate reasoning instead of exact reasoning. Its importance lies in the fact that many types of human reasoning, particularly the reasoning based on expert knowledge, are by nature approximate. Note the great potential that the use of membership degrees represents by allowing something qualitative (fuzzy) to be expressed quantitatively. Besides, a better communication can be attained through fuzzy logic because of its ability to utilize natural languages in the form of linguistic variables �������������������� (Zadeh, 1975, 1983)�. Closer to our context, database technology has an extremely successful track record as a backbone of information technology throughout the last three decades. To introduce imprecise data, global information should be managed as fuzzy. The best solution is to offer a smooth migration toward this technology. The migration towards FDBs, or fuzzy migration, does not only constitute the adoption of a new technology but also, and especially, the adoption of a new paradigm. Consequently, it constitutes a new culture of development of information systems. In fact, the fuzzy migration of information systems consists in modifying or replacing one or more of their components: database, architecture, interfaces, applications, and so forth, and generally the modification of one of these components can generate modifications of some others. This chapter is about the migration from crisp databases (relational) towards FDBs in order to introduce imprecise information in current information systems. This fuzzy migration consists in deriving a new database from a legacy database and in adapting data, metadata, and the software components accordingly. This migration, due generally to the apparition of new needs in the enterprise, must answer these requirements while maintaining the content of the information unaltered. Once the ex-database is emigrated, the ex-programs must be changed of such a manner that they reach the new database instead of the ex-data. The definition of the fuzzy migration concept cited above may involve several problems such as:
• • • • • • • •
The schemas modification requires a very detailed knowledge on the data organization (data types, constraints, etc.). The database source is generally badly documented. The difficulty of correspondences establishment between the two databases. The database models, source, and target can be incompatible. The values in the FMB (metadata) must be chosen after thorough studies. The communication protocols between the database and their applications are generally hidden. The administrator and at least some database users need some knowledge about fuzzy logic. Software using fuzzy information must be designed with care, especially if it will be utilized by regular users.
Related Work Although the information systems migration constitutes a very important research domain, there is a limited number of migration methods. For example, Tilley and Smith (1996) discuss current issues and trends in legacy system re-engineering from several perspectives (engineering, system, software, managerial, evolutionary, and maintenance). The authors propose a framework to place re-engineering in the context of evolutionary systems. The butterfly methodology (Wu, Lawless, Bisbal, Richardson, Grimson, Wade, & O’Sullivan, 1997) provides a migration methodology and a generic toolkit for the methodology to aid engineers in the process of migrating legacy systems. Different from the incremental strategy, this methodology eliminates the need of interoperability between the legacy and target systems. Closer to our subject, the Varlet project (Jahnke & Wadsack, 1999) adopts a process that consists of two phases. In the first one, the different parts of the original database are analyzed to obtain a logical schema for the implemented physical schema. In the second phase, this logical schema 361
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
is transformed into a conceptual one, which is the basis for modification or migration activities. The approach of Jeusfeld and Johnen (1994) is divided into three parts: mapping of the original schema into a metamodel, rearrangement of the intermediate representation, and production of the target schema. Some works also address the migration between two specific systems. Among those, Menhoudj and Ou-Halima (1996) present a method to migrate the data of a legacy system into a RDB management system. Behm et al. (1997) describe a general approach to migrate RDBs to object technology. Henrard et al. (2002) and Cleve (2004) present strategies to migrate date-intensive applications from a legacy data management system to a modern one and take the example of conversion of COBOL files into a SQL database. In the context of fuzzy databases, few fuzzy databases implementations have been developed in real and running systems (Galindo et al., 2006; Goncalves & Tineo, 2006; Kacprzyk & Zadrożny, 1995, 1999, 2000, 2001; Tineo, 2000). More information about these approaches is available in this handbook in the chapters by Zadrożny et al. and Urrutia et al. However, we do not know studies about the migration to fuzzy databases from classical or regular ones (Ben Hassine, Ounelli, Touzi, & Galindo, 2007). This chapter covers this gap.
Presentation of Our Approach Basically, our approach consists in achieving some fuzzy characteristics in any existing database, mainly fuzzy queries. We study how to optionally migrate the data stored in RDBs towards FRDBs. This approach is addressed mainly to database administrators (DBA), and it is intended to meet the following requirements: • •
362
to provide for methodical support of the migration process, to assist DBA in transforming relational schemas and database,
• • • •
to allow DBA to choose the attributes able to store imprecise data or/and be interrogated with flexible queries, to assist DBA in the list of required metadata, to exploit the full set of FRDBs features, to cover properties of the migration itself such as correctness and completeness.
We adopted in our migration approach three strategies answering users’ needs: •
• •
Partial migration: Its goal is to keep the existing data, schema, and applications. The main benefit in this migration is the flexible querying, but also some fuzzy data mining methods could be implemented on crisp data. Total migration: Its goal is to store imprecise values and to benefit from the flexible querying, fuzzy data mining on fuzzy data. Easy total migration: Its goal is to store imprecise values, to benefit from the flexible querying, fuzzy data mining on fuzzy data, and to keep the existing data and applications with the minimum required modifications.
Consequently, all users’ needs are assured. In the first strategy, the existing schemas and data of the database remain unchanged. There is only the addition of the FMB and the FSQL server to treat the flexible queries. In the second strategy, two operations in one are aimed: model imprecise data and interrogate the database with flexible queries. This strategy requires a modification of the schemas, the data, and eventually the programs of the database. The third strategy is a mix of the previous strategies, and the basic idea is only to add the required fuzzy attributes, but these fuzzy attributes do not replace the previous existing classic attributes. Then, we have a redundancy problem, but if the space is not a problem, then it is easy to manage.
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Figure 2. Definition of labels on the FTYPE1 attribute Salary
Partial Migration The principle of this migration is based on the fact that all what is valid in SQL remains also valid in FSQL. This migration is destined to the designers who want to keep their RDB while taking advantage of fuzzy queries. The existing schemas and data of the database remain unchanged. There is only the addition of at least two elements (Galindo et al., 2006): 1.
2.
The metabase, named FMB, consisted of 12 tables with metadata about the fuzzy information (for some applications, we can use less than these 12 tables). The FSQL server to utilize fuzzy queries. This server assures the translation of FSQL statements to SQL, a comprehensible language by the DBMS.
Managing Fuzzy Metadata At the level of the database, the tables of the FMB are created in order to store the information about the attributes susceptible to be interrogated by flexible queries. These attributes, and only these attributes, must be declared in the FMB of type FTYPE1. The choice of this type is justified by two reasons:
$Low
$Medium
$High
11
0
0.85 1 2 1.2
1.7
2.2
SALARY * 1000 €
•
All the attributes remain unchanged, including the attributes identifiers. • The quantifiable attributes (with ordered domain, where we can define trapezoidal possibility distributions, that is, of type number, real, etc.) which will be interrogated by fuzzy queries must be declared like FTYPE1 in the FMB: The attributes are not modified but they must be included in the FMB as FTYPE1. • To include in the FMB the fuzzy attribute characteristics, detailed in Table 6, rows 1 and 2. This is optional, but it is mandatory if we can use such characteristics. B. Flexibility: Fuzzy attributes FTYPE1 permit fuzzy queries allowing the use of:
A. Consistency: FTYPE1 attributes store only the previous existing crisp data. We must respect the following rules:
•
Fuzzy constants (see The FSQL Language section) like linguistic labels ($Hot), approximate values (#30), the
Table 5. Examples of fuzzy querying on a fuzzy attribute Type 1 (Salary) Type
Query
Signification
Classic (SQL)
SELECT * FROM Employee WHERE Salary = 2200;
List of the employees with a salary equal to 2200 euros.
Fuzzy (FSQL)
SELECT * FROM Employee WHERE Salary FEQ #2200;
List of the employees with a salary near to 2200 euros.
Fuzzy (FSQL)
SELECT % FROM Employee WHERE Salary FEQ $High;
List of the employees with a high salary.
363
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
• •
values UNKNOWN, UNDEFINED, and NULL, trapezoidal possibility distributions as well as, of course, crisp values. Fuzzy comparators, useful in this kind of attribute (the entire Table 4). Fulfillment thresholds, fuzzy set operators (fuzzy union, fuzzy intersection, and fuzzy minus), and functions to modify fuzzy constants (concentration, dilatation, contrast intensification, etc.).
It is necessary to note at this level that the use of these different concepts is based on many rules. For example, the linguistic labels defined for the FTYPE1 attributes must belong to a numerical domain, and this permits defining the associated trapezoidal possibility distributions in the FMB. Example 2: The attribute Salary can be transformed in a fuzzy attribute Type 1. We can also define the labels $Low, $Medium, and $High, for example, as the trapezoidal possibility distributions described in Figure 2. This attribute is quantifiable and have, for example, the euros as unit of measure. An attribute Productivity cannot be quantifiable. It can take the values “bad,” “regular,” and “good,” which cannot be represented by trapezoidal functions. For this reason, we cannot transform it to fuzzy attribute FTYPE1. We will show in the following section that this attribute can be FTYPE3. Figure 3. FBD architecture Classic statements
Flexible statements FSQL Server RDBMS
DB FMB
364
Example 3: Table 5 shows three types of queries (classic and fuzzy) that can be applied to the attribute Salary (FTYPE1) of Example 2. It should be noted that the FSQL queries must be preceded by the creation of the label $High and the margin for approximate values defined on this attribute in the FMB. Finally, the partial migration allows three extra characteristics that may be very useful for many enterprises: •
•
•
Fuzzy degrees (Table 6, row 4): We can add some different fuzzy degrees to each table. These attributes were explained in the subsection titled Fuzzy Degrees As Fuzzy Values. Fuzzy quantifiers (Table 6, row 5): We could answer to questions like: “Give me the departments in which most of their employees have a high salary,” and we can add a minimum threshold to the condition and also to the quantifier. Note that in this example, quantifier “most” is associated to the context of employee table. FSQL defines four forms to use fuzzy quantifiers. See Galindo et al. (2006) for details about fuzzy quantifiers in FSQL. Fuzzy data nining: The definition of fuzzy attributes Type 1 allows using many fuzzy data mining methods. In this volume, there is a chapter by Feil and Abonyi summarizing these methods. In particular, FSQL is a powerful tool for these purposes (Carrasco, Vila, & Galindo������������������������������� , 2003; Galindo et al., 2006), and it is shown in the chapter by Carrasco et al. in this handbook.
The FSQL Server At the level of the DBMS, the FSQL server is placed over the DBMS to treat the flexible statements (queries, deletions, updates, etc.) written in FSQL language. Figure 3 shows the DBMS architecture with the FSQL server.
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Hainaut (2004); Cleve (2004); and Cleve, Henrard, and Hainaut (2005) to integrate some systems at the time of the reengineering of a database. Figure 4 shows, in the left part, the main parts of the legacy system, comprising programs that interact with the legacy data through the legacy DBMS and through the legacy schema. The right part shows the state of the new system after the legacy DBMS has been extended with the FSQL server (Fuzzy DBMS) and the database with the FMB. The new database comprises the converted schema and data that have been transformed and migrated according to the new schema. Legacy programs have been transformed in such a way that they now access the data through the API of the new technology and through the new schema. When the converted system is deployed, new programs can be developed, that use the database through the native interface of the new fuzzy DBMS. Later on, if and when needed, the legacy programs could be rewritten according to the new technology.
Total Migration This strategy offers, in addition to the flexible querying, the possibility to store imprecise data at the level of the fuzzy attributes. Therefore, it will be a total migration towards the FRDB. Contrary to the previous case, the attributes susceptible to store imprecise data are of type FTYPE2, FTYPE3, and FTYPE4, and besides, some degrees may be included (types 5-8). The modification concerns only the tables defined or referenced to these attributes. This strategy comprises three main steps: (Step 1) schemas conversion, (Step 2) data conversion, and (Step 3) programs conversion. Note that this type of decomposition has already been used by Henrard et al. (2002); Henrard, Cleve, and Figure 4. System conversion Legacy system Legacy Programs
DBMS Legacy schema Legacy Data (crisp)
New system Legacy Programs
New Programs
Step 1: Database Schemas Migration The schema conversion is the translation of the legacy database structure, or schema, into an equivalent database structure expressed in the new technology (Henrard et al., 2002). In our context, it consists in modifying the table schemas with fuzzy attributes which are going to store fuzzy values. Moreover, the FMB, which stores the fuzzy
Fuzzy DBMS FSQL RDBMS New schema New Data (crisp + fuzzy) FMB
Figure 5. Physical schema conversion DB
Source physical schema (SPS)
1. Modify fuzzy FMB 2. Update FMB tables: see
DDL script Analysis
Source DDL script (SQL)
Fuzzy physical schema (FPS)
SQL Coding
Algorithm 1 Target DDL script (FSQL)
365
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
attributes information (linguistic labels, similarity relations, etc.), is created. In fact, the FMB is joined to the data dictionary in order to know the data types, the fuzzy objects defined over the fuzzy attributes, and so forth. Hence, the FMB extends the data dictionary in order to treat the new fuzzy attributes and their objects. During this process, the source physical schema (SPS) of the RDB is extracted and then transformed in a correspondent physical schema in the fuzzy DBMS. The new physical schema is used to produce the DDL (data definition language) code of the new FRDB. In this section, we present two strategies of transformations or conversions (physical schema conversion and conceptual schema conversion).
Physical Schema Conversion The physical schema conversion consists of analyzing the DDL code of the source RDB in order to find its physical schema. The DDL code may be obtained from the data dictionary, and it expresses the source physical schema. This relational schema
will be converted in the fuzzy DBMS modifying attributes and adding the FMB. Table 6 shows the information stored in the FMB for each fuzzy attribute. Figure 5 illustrates the process of this transformation: the source DDL script written in SQL is parsed to extract the physical schema. This schema includes all the data structures and constraints explicitly declared into the DDL code which will be converted to the target “fuzzy” physical schema (FPS), according to the Algorithm 1. This conversion returns the target DDL script, the new schema, which is easy to create using SQL. This last is executed then to generate the FRDB. Among this conversion, some classic attributes are transformed into fuzzy ones. Algorithm 1 presents general ideas about this transformation at the level of the database (modification of the structure of tables with fuzzy attributes) and at the level of the FMB (updating tables with fuzzy metadata: labels, similarity relations, quantifiers, etc.). Note that this strategy can use simple tools only, such as a DDL parser (to extract the SPS), an elementary schema converter (to transform the SPS into the FPS) and a DDL generator.
Table 6. Fuzzy metadata stored in the FMB 1. Information concerning fuzzy attributes (see Fuzzy Attributes section): • Fuzzy type, unit of measurement, comments, etc. 2. • • •
Fuzzy attributes FTYPE1 and FTYPE2: Fuzzy objects defined on these attributes: trapezoidal labels, qualifiers, quantifiers, etc. Margin for approximate values (the meaning of #n in the context of each attribute). MUCH value: The minimum distance in order to consider two values as very separated: This is necessary for comparators MGT, NMGT, MLT, and NMLT (see Table 4). • n value: Maximum number of points in the possibility distributions of new types 11 and 12 of Table 2. 3. • • • •
Fuzzy attributes FTYPE3 and FTYPE4: Fuzzy objects defined on these attributes: linguistic labels, qualifiers, quantifiers, etc. n value: Maximum number of elements in the possibility distributions (see Table 3). Compatible attributes. Similarity measures between labels, only for FTYPE3.
4. �������������� Fuzzy degrees: • Meanings of these fuzzy degrees (importance, membership, etc.). • Association: Table, column, set of columns or without association. 5. ������������������������������������������������������������������������������� Quantifiers associated to tables or general quantifiers for the general system.
366
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Algorithm 1. Physical schema transformation Input: SPS (RDB) Output: FPS (FRDB) Begin To create the FMB tables. for each attribute A of SPS do if A remains classic then no modification in its definition; else { /* This treatment is divided in two under-treatments : in the database and in the FMB */ switch (type of A) { /* Modify the tables structure according to each fuzzy type */ case FTYPE1: A remains unchanged. case FTYPE2: create at least 5 attributes with the following names: concatenate the first one with the letter ’T’: AT; concatenate the others ones respectively with 1, 2, 3, 4, … 2n: A1, A2, A3, A4…; /* Note that 2n must be at least 4. Then the minimum value for n is 2. */ case FTYPE3 and FTYPE4: create 2n+1 attributes with the following names: concatenate the first one with the letter ’T’ (AT); for each i = 1, … n do { /* n =max. number of elements in the pos. distributions */ concatenate the first attribute with Pi; concatenate the following with i; /* (AP1, A1, AP2, A2,... , APn, An) */ } } Update the FMB tables with the metadata about all fuzzy attributes: see Table 6. } End.
Figure 6. Example of physical schema conversion at the level of the database Relational Physical schema EMPLOYEE Matriculate Name Salary Age Productivity
PROJECT Num_project Name_project Budget
WORKS_ON Matriculate Num_project Nb_hours
RDB
Fuzzy Relational Physical schema PROJECT Num_project Name_project BudgetT Budget1 Budget2 Budget3 Budget4
EMPLOYEE Matriculate Name Salary AgeT Age1 Age2 Age3 Age4 ProductivityT ProductivityP1 Productivity1
WORKS_ON Matriculate Num_project Nb_hours FRDB
Algorithm 1 modifies the database schema for each fuzzy attribute. Note that we do not use the DDL statements of FSQL, because we want to show the inner schema in this conversion. Example 4: Suppose the physical schema constituted of the tables EMPLOYEE, PROJECT, and WORKS_ON. Figure 6 shows the modifications done at the level of the fuzzy attributes: • • • •
Salary: FTYPE1 with the linguistic terms low, medium, and high. Age: FTYPE2 with the linguistic terms young, adult, and old. Budget: FTYPE2 with the linguistic terms small, medium, and big. Productivity: FTYPE3 with the linguistic terms bad, regular, and good.
Note that, for example, attribute Budget is represented now with five classic attributes, and the Productivity attribute is transformed in three attributes. Figure 7 shows the modifications done
367
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Figure 7. Example of physical schema conversion at the level of the FMB
368
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Figure 8. Conceptual schema conversion Schema conversion (Fuzzification of some attributes)
SCS
- FDB Design -
DBRE
Conceptualization
FCS
Schema refinement
FPS
SPS DDL Analysis
Coding
Source DDL script (SQL)
Target DDL script (FSQL)
Figure 9. Examples of conceptual schema conversion Legacy Conceptual schema Matriculate Productivity
Name
EMPLOYEE Salary
(0..n)
(1..1)
Name_dep (1..n)
Works_For
Age
Works_On
DEPARTMENT (0..n)
(1..1)
(1..n)
PROJECT
Name_project
Num_dep
Controls Budget
Num_project
(a)
Fuzzy EER schema EMPLOYEE
(1..1)
Works_For
(1..n)
DEPARTMENT
Matriculate
Num_dep Name_dep
(0..n)
Name T1: Salary {low, medium, high}
Controls
T2: Age {young, adult, old} T3: Productivity {bad, regular, good} (1..1)
Works_On (0, approx 20 [0.25, 0.75])
PROJECT
(approx 2, approx 30 [0.25])
Num_project Name_project T2: Budget {small, medium, big}
(b)
369
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Figure 10. Data migration 1. Data Extraction
2. Data Conversion
Conceptual Schema Conversion 2. Fuzzification of some data
3. Data storage RDB FRDB
New needs
Expert
in the FMB tables. We include here a brief explanation of these tables: Table FCL stores all the fuzzy attributes (type, unit of measurement, etc.). Table FAM stores the margin for approximate values and the minimum distance for two values considered very separated for all FTYPE1 and FTYPE2 attributes. Table FOL includes all fuzzy objects belonging to all fuzzy attribute types (only labels in our example). Table FLD includes the definition of trapezoidal labels for FTYPE1 and FTYPE2 attributes (the four basic values, like in Figure 1). Table FND stores the similarity relation between each labels of a FTYPE3 attribute. Of course, the FMB needs more tables with different information (Galindo et al., 2006).
The conceptual schema conversion consists in extracting the physical schema conversion of the legacy database (SPS) and transforming it into its correspondent conceptual schema through a database reverse engineering (DBRE)3 process. Figure 8 describes this process. First of all, the source DDL script written in SQL is parsed in order to extract the source physical schema (SPS). This last is refined through an in-depth inspection of the way that the program uses and manages the data. The final DBRE step is the conceptualization that interprets the physical schema into the source conceptual schema (SCS). Then, a phase for FDB design makes a schema conversion, introducing new concepts (fuzzy constraints, fuzzy attributes, etc.) (Galindo et al., 2004a, 2004c, 2006; Urrutia et al., 2002). This transformation produces the fuzzy conceptual schema (FCS), and then the fuzzy physical schema (FPS), which is coded in the final DDL script of the FRDB. At this level, we have two options:
Algorithm 2. Data transformation Input: Source data (RDB) Output: Target data (FRDB) Begin for each attribute A of DML script do for each value v in A (for each row in the database table) do switch (Type of A) do { case Classic and FTYPE1: Insert v in its reserved field (no change); case FTYPE2: if v=NULL then Attribute AT=2 and A1=A2=A3=A4=NULL; else /* v is a crisp value */ AT=3, A1=v and A2=A3=A4=NULL; case FTYPE3 and FTYPE4: if v=NULL then Attribute AT=2 and AP1=A1=NULL; else /* v is a crisp value as text-label */ AT=3, AP1=1 and A1=v; } End
370
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Table 7. Table EMPLOYEE source Matriculate
Name
Salary
Age
···
005201
Habib
2000
50
···
005202
Mohamed Ali
2000
45
···
1. 2.
005203
José
2100
40
···
005204
Amel
2200
NULL
···
To CREATE new tables, copy values from old tables and, optionally, drop old tables and rename new tables (the best option). To ALTER old tables and to preserve current information, we must only store old attributes values (which will be converted in fuzzy attributes) in temporary tables, modify the structures of this tables, convert old attributes values to fuzzy attributes values, and copy them according to the new fuzzy attributes structure (Tables 2 and 3).
It is necessary to note that during the database design step, the choice of the most suitable fuzzy attribute type is a delicate task. The presence of an expert in FRDB design must be counseled strongly due to the complexity of the assimilation of the different fuzzy concepts (Ben Hassine, Touzi, & Ounelli, 2007). Example 5: Figure 9 shows an example of conceptual schema conversion from a legacy conceptual schema to a FuzzyEER schema. Attribute Salary is transformed to FTYPE1, Age and Budget to FTYPE2, and Productivity to FTYPE3. The other
attributes and primary keys remain unchanged. The constraints of the source ER schema can be transformed also to fuzzy constraints using the fuzzy (min, max) notation as this presented in the relationship Works_On. For example, in the PROJECT side, the (min, max) constraint indicates that in each project must work a minimum of approximately two employees (with a minimum degree of 0.5). At the same time, the number of employees in each project must not exceed approximately 30 (with a minimum degree of 0.25).
Step 2: Data Migration The data conversion consists in transforming the data of the RDB (crisp) to the format of the data defined by the fuzzy schema. It involves data transformations that materialize the schema transformations described above. These transformations include three stages as shown in Figure 10. The first step consists in extracting the data from the database. This extraction takes in account the different kind of constraints, data coherence, security, and so forth. The second step consists in converting these data in such a way that their structures coincide with the new format of the target FRDB schema. This transformation is made automatically using algorithm 2, or manually using expert knowledge as we will explain later. Finally, the data will be stored in the FRDB. However, during this transformation, there is a modification in the data representation for fuzzy attributes at the level of the database tables (see Tables 2 and 3 and the example in Figure 6) while introducing the data values according to their types
Table 8. Example of data conversion of the table EMPLOYEE for attribute FTYPE2 Age Matriculate
Name
Salary
AgeT
Age1
Age2
Age3
Age4
···
00521
Habib
2000
3
50
NULL
NULL
NULL
···
00522
Mohamed Ali
2000
4
1 (Id. for Adult)
NULL
NULL
NULL
···
00523
José
2100
6
40
30
50
10
···
005204
Amel
2200
0
NULL
NULL
NULL
NULL
···
371
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
(label: 4, approximate values: 6/8, unknown: 0, possibility distribution: 9-12, etc.), and at the level of the FMB tables while introducing their parameters (four parameters for trapezoidal labels, etc.). It should be noted that if we want to “fuzzify” some previous data, then the transformation is not automated. Fuzzy information may be more real than crisp information. For this reason, the intervention of an expert in FRDB design and in the database domain is strongly counseled in order to choose the most suitable type among the different types of fuzzy values mentioned previously (Ben Hassine et al., 2007). Sometimes, the crisp data can be kept. In other cases, they will be transformed, using some standard rules, in linguistic terms, intervals, approximate values, and so forth, Especially in some contexts, NULL values may be transformed into UNKNOWN values. Algorithm 2 presents the automatic modification. This step shows the advantages of the migration towards the FRDBs in terms of imprecise data modeling. It is important to note that in the legacy database, all attributes are crisp: Then, the translation is very easy; we have to choose between two values: NULL or crisp. Another easy transformation is, for example, to store approximate values for FTYPE2 attributes: Attribute AT=6, A1=v, A2=v−margin, A3=v+margin, and A4=margin (values in A2, A3, and A4 are only for increasing the efficiency). Example 6: Suppose the relation Employee described in Table 7, schema in Figure 6, where we assigned the FTYPE1 to the attribute Salary, and the FTYPE2 to Age. Since the attribute salary does not have any transformation at the level of the database, only the attribute Age is transformed, as shown in Table 8. All fuzzy data are represented by a number. Also, every fuzzy object has an identifier in the FMB. In our example, the attribute AgeT stores the kind of data stored about the Age (see Table 2). The parameters of this data are stored in the remaining attributes.
372
•
•
•
•
The employee Habib keeps the crisp value 50 years. This value (50) and its type (3) are stored respectively in the fields Age1 and AgeT. The age of Mohamed Ali receives the linguistic label adult, which must be first of all created (with the command CREATE LABEL) and stored in the FMB as shown in Figure 7. The identifier of this label will be stored in the field Age1, while specifying its type (4) in the field AgeT. The employee José stores an approximate value of 40 years old with a margin of 10. We store 40 in the field Age1 and its identifier in AgeT (6). The age of Amel stores NULL value, but in this case we can translate it to UNKNOWN (because we know that every person has some Age).
Step 3: Programs Migration The modification of the structure of the database requires, in the majority of the cases, propagation to the level of their related programs. Since that is not the first interest of this work, we present an overview of programs conversion. In fact, if we want to use flexible queries and to store imprecise data, the communication between programs and the database must be through the FSQL server. The programs must be modified according to the representation, interrogation, and storage of the new data. Moreover, we must decide what to do with fuzzy values in each program. Note that emigrating these programs not only means to convert DBMS calls in programs, but in addition requires the reengineering of imperative programs to accept fuzzy values and surely the reconstruction of user interfaces. We draw our inspiration from the strategies of programs conversion proposed by Henrard et al. (2002, 2004) and Cleve (2004). One of these strategies relies on wrappers that encapsulate the FRDB. This strategy allows interaction of the application programs with the legacy data ac-
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Figure 11. Programs migration based on wrappers Legacy programs Defuzzified data
Stantard Crisp Statement
Wrapper Fuzzification (SQL ⇒FSQL) Resulting data
FSQL server Defuzzification (FSQL ⇒SQL)
RDBMS Data
Carrying out the statement
FRDB + FMB
cess logic through these wrappers instead of the legacy DBMS. These wrappers are in the form of a software layer permitting the translation of the previous statements written in SQL to the FSQL language (fuzzification in the statement, not in the data). The FSQL server translates in turn these queries to SQL language (defuzzification in the statement). The fuzzy returned answers will be defuzzified in order to be treated by the application programs. This process is depicted in Figure 11. The second strategy consists in rewriting the access statements in order to make them process the new data through the new fuzzy DBMS-DML. Each DML statement must be located, its parameters must be identified, and the new fuzzy DML statement sequence must be defined and inserted in the code. This task may be complex because legacy program will manage imprecise data instead of legacy crisp ones. The third strategy generalizes the problems met in the second strategy. In fact, the change of paradigm when moving from standard crisp data in RDB to imprecise ones in FRDB induces problems such as whether the user wants now to use fuzzy information in FSQL statements and the manipulation of the imprecise data returned by these statements. The program is rewritten in order to use the new fuzzy DBMS-DML at its full power and take advantage of the new data system
features. This strategy is much more complex than the previous one since every part of the program may be influenced by the schema transformation. The most obvious steps consist of: 1. 2. 3. 4.
Identifying the statements and the data objects that depend on these access statements, Deciding whether each statements will be now fuzzy or not, Rewriting these statements and redefining its data objects, and Treating the possibly fuzziness in returned answers.
Easy Total Migration As we have shown, the migration of programs may be a very hard task in this process, but it is mandatory and essential in the total migration. With the Easy Total Migration, we achieve a total fuzzy database (���������������������������� ����������������������������� storing fuzzy values, fuzzy querying, and fuzzy data mining on fuzzy data) and also keep the existing data and applications with the minimum required modifications. The basic idea is to mix partial and total migration; that is, fuzzy attributes with fuzzy values are duplicated: one with fuzzy values and the other with only crisp values. In this process, we use the three
373
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
steps of the total migration with some modifications: (Step 1) schemas conversion, (Step 2) data conversion, and (Step 3) programs conversion. The steps 1 and 2 are equal but now we preserve the old attributes. For example, if the Age attribute is a fuzzified attribute and converted to FTYPE2, then we preserve the existing Age attribute and add the new attributes AgeT, Age1, Age2, Age3, and Age4 (like in Table 8). The program conversion is now easier, but we must manage the new fuzzy attributes in some DML statements in order to achieve legacy programs running exactly like before the migration: 1. SELECT: No modifications required (except if the SELECT uses the asterisk, *, because it represents all the attributes and in the new FRDB there are more attributes). 2. DELETE: No modifications required. 3. INSERT: The values of the fuzzified attributes must be inserted again in the same row in the corresponding new fuzzy attributes (Algorithm 2). 4. UPDATE: The values of the fuzzified attributes must also be updated in the same row in the corresponding new fuzzy attributes (Algorithm 2). The main drawback of this migration strategy is the redundancy in the fuzzified attributes (except the FTYPE1). The main advantage is the easy program migration. In some situations, this is the best option, using this strategy as an intermediate and temporary step to a total migration.
Conclusion Several real applications need to manage imprecise data and to benefit its users from the flexible querying (Bosc et al., 1998). Several theoretical solutions have been proposed (Bosc & Pivert, 1995; Buckles & Petry, 1982; Galindo, 1999; Medina et al., 1994; Umano, 1982; Zemankova-Leech & Kandel, 1985). Unfortunately, the repercussions
374
of these works on the practical level are negligible, even with the existence of some prototypes as FSQL server (Galindo et al., 1998, 2006). In this chapter, we proposed a migration approach from RDBs towards the generation of FRDBs. This approach is addressed mainly to database administrators and enterprises interested in such a fuzzy migration. We proposed three possible strategies for this migration. The first strategy, is called partial migration and enjoys some of the advantages of the FRDBs while preserving the existing schema, data, and applications. The second approach, named total migration, consists in benefiting from all FRDBs advantages (imprecise data storage, flexibility in queries, etc.). It allows an easy mapping of the existing data, schemas, and programs, while integrating the different fuzzy concepts. This strategy, based on the Henrard et al. (2002) approach, has three levels of conversions: conversion of the physical schema that generally can be preceded by a conceptual schema conversion, data conversion, and applications conversion. We studied in detail the first two levels. The applications conversion, or migration of programs, may be a very hard task in this process, but it is mandatory and essential in the total migration and must be made with experts in fuzzy databases and in the database domain. The third strategy, called easy total migration, is a mixture of the previous two strategies, generally as a temporary step. This option is easier and faster than the total migration and may be a good option in order to make the migration of programs slowly but painstakingly. As for perspectives on the future of this work, we mention (1) the automation of conversion of the schema, data, and applications, following the algorithms defined in this chapter and (2) the addition of an expert system to help the designer choose the appropriate fuzzy attributes type and other fuzzy objects (labels, quantifiers, etc.). This last point also contributes to the use of the FRDBs easier in real applications. In fact, one of the meted problems during the FRDBs design is to determine the attributes (columns) susceptible to store fuzzy
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
data and to choose their respective data types. The type assignment to these attributes is not an obvious task. This choice asks the designer, on the one hand, to know in a detailed way the properties of every fuzzy attribute type and, on the other hand, to really qualify the attribute in order to affect it the most suitable type among the four types of fuzzy attributes mentioned previously. We started working in order to solve this problem, and we have implemented an expert system to choose the suitable attribute type. It can also easily treat other fuzzy models by enriching its knowledge base. We are working now to combine this expert system in such a way that the migration process will be automated. We think that the automation of this migration will be a useful starting point in order to generalize the FDBs in the real database world.
Acknowledgment This work has been partially supported by the “Ministry of Education and Science” of Spain (projects TIN2006-14285 and TIN2006-07262) and the Spanish “Consejería de Innovación Ciencia y Empresa de Andalucía” under research project TIC-1570.
References Behm, A., Geppert, A., & Dittrich, K. R. (1997). On the migration of relational schemas and data to object-oriented database systems. In J. Gyrks, M. Krisper, & H. C. Mayr (Eds.), Proceedings of the 5th International Conference on Re-Technologies for Information Systems (pp. 13-33). Klagenfurt, Austria: Oesterreichische Computer Gesellschaft. Bellman, R. E., & Zadeh, L. A. (1970). Decisionmaking in a fuzzy environment. Management Sciences, 17(4), 141-175. Ben Hassine, M. A., Ounelli, H., Touzi, A. G., & Galindo, J. (2007, July). A migration approach from
crisp databases to fuzzy databases. In Proceedings of the IEEE International Conference FUZZ-IEEE 2007), London, UK (pp. 1872-1879). Ben Hassine, M. A., Touzi, A. G., & Ounelli, H. (2007). About the choice of data type in a fuzzy relational database. In B. Gupta (Ed.), Proceedings of the 22nd International Conference on Computers and Their Applications (CATA-2007) (pp. 231-238). Bosc, P. (1999). Fuzzy databases. In J. Bezdek (Ed.), Fuzzy sets in approximate reasoning and information systems (pp. 403-468). Boston: Kluwer Academic Publishers. Bosc, P., Liétard, L., & Pivert, O. (1998). Bases ������ de données et flexibilité: Les requêtes graduelles. Techniques et Sciences Informatiques, 17(3), 355378. Bosc, P., & Pivert, O. (1995). SQLf: A relational database language for fuzzy querying. IEEE Transactions on Fuzzy Systems, 3, 1-17. Bosc, P., & Pivert, O. (1997). Fuzzy queries against regular and fuzzy databases. In T. Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Flexible query answering systems. Dordrecht: Kluwer Academic Publishers. Bosc, P., & Pivert, O. (2000). SQLf query functionality on top of a regular relational database management. In O. Pons, M. A. Vila, & J. Kacprzyk (Eds.), Knowledge management in fuzzy databases (pp. 171-190). Heidelberg: Physica-Verlag. Buckles, B. P., & Petry, F. E. (1982). A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7, 213-226. Carrasco, R., Vila, M. A., & Galindo, J. (2003). FSQL: A flexible query language for data mining. Enterprise Information Systems, IV, 68-74. Hingham, MA: Kluwer Academic Publishers. Cleve, A. (2004). Data centered applications conversion using program transformations. Unpublished doctoral dissertation, Namur.
375
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Cleve, A., Henrard, J., & Hainaut, J.-L. (2005). Co-transformations in information system reengineering. In Proceedings of the 2nd International Workshop on Meta-Models, Schemas and Grammars for Reverse Engineering Collection. Electronic Notes in Theoretical Computer Science, 137(3), 5-15. De Caluwe, R., & De Tré, G. (Eds.). (2007). Preface to the special issue on advances in fuzzy database technology. International Journal of Intelligent Systems, 22(7), 662-663. Dubois, D., & Prade, H. (1997). Using fuzzy sets in flexible querying: Why and how? In T. Andreasen, H. Christiansen, H. L. Larsen (Eds.), Flexible query answering systems (pp. 45-60). Kluwer Academic Publishers. Elmasri, R., & Navathe, S. B. (2006). Fundamentals of database systems (5th ed.). ��������������� Addison-Wesley. Galindo, J. (1999). Tratamiento de la imprecisión en bases de datos relacionales: Extensión del modelo y adaptación de los SGBD actuales. ������������� Doctoral dissertation, University of Granada, Spain. Retrieved February 9, 2008, from http://www.lcc.uma.es Galindo, J. (2005). New characteristics in FSQL: A fuzzy SQL for fuzzy databases. WSEAS Transactions on Information Science and Applications, 2(2), 161-169. Galindo, J. (2007). FSQL (fuzzy SQL): A fuzzy query language. Retrieved February 9, 2008, from http://www.lcc.uma.es/~ppgg/FSQL Galindo, J., Aranda, M. C., Caro, J. L., Guevara, A., & Aguayo, A. (2002). ������������������ Applying fuzzy databases and FSQL to the management of rural accommodation. Tourist Management Journal, 23(6), 623-629. Galindo, J., Medina, J. M., & Aranda, M. C. (1999). Querying fuzzy relational databases through fuzzy domain calculus. International Journal of Intelligent Systems, 14, 375-411. Galindo, J., Medina, M., Pons, O., & Cubero, J. C. (1998). A server for fuzzy SQL queries. In T.
376
Andreasen, H. Christiansen, & H. L. Larsen (Eds.), Lecture Notes in Artificial Intelligence (Vol. 1495): Flexible query answering systems (pp. 164-174). Springer. Galindo, J., Urrutia, A., Carrasco, R., & Piattini, M. (2004c). Relaxing constraints in enhanced entityrelationship models using fuzzy quantifiers. IEEE Transactions on Fuzzy Systems, 12(6), 780-796. Galindo, J., Urrutia, A., & Piattini, M. (2004a). Fuzzy aggregations and fuzzy specializations in fuzzy EER model. In K. Siau (Ed.), Advanced topics in database research (pp. 105-126). Hershey, PA: Idea Group Publishing. Galindo, J., Urrutia, A., & Piattini, M. (2004b). Representation of fuzzy knowledge in relational databases. In Proceedings of the 15th International Workshop on Database and Expert Systems Applications (pp. 917-921). IEEE Computer Society. Galindo, J., Urrutia, A., & Piattini, M. (2006). Fuzzy databases: Modeling, design and implementation. Hershey, PA: Idea Group Publishing. Goncalves, M., & Tineo, L. (2006). SQLf vs. Skyline: Expressivity and performance. In Proceedings of the 15th IEEE International Conference on Fuzzy Systems Fuzz-IEEE 2006, Vancouver, Canada (pp. 2062- 2067). Henrard, J., Cleve, A., & Hainaut, J.-L. (2004). Inverse wrappers for legacy information systems migration. In T. Philippe & V. D. H. Willem-Jan (Eds.), Proceedings of the 1st International Workshop on Wrapper Techniques for Legacy Systems (WRAP’04) (pp. 30-43). Technische Universiteit Eindhoven Publish. Henrard, J., Hick, J. M., Thiran, P., & Hainaut, J.-L. (2002). Strategies for data reengineering. In Proceeding of the 9th Working Conference on Reverse Engineering (pp. 211-220). IEEE Computer Society Press. Jahnke, J. H., & Wadsack, J. P. (1999). Varlet: Human-centered tool support for database reengineering. In J. Ebert, B. Kulllbach, & F. Lehner
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
(Eds.), Proceedings of the Workshop on SoftwareReengineering, Bad Honnef, Germany. Jeusfeld, M., & Johnen, U. A. (1994). An executable meta model for re-engineering of database schemas. In Proceedings of the Conference on the Entity-Relationship Approach (pp. 533-547). Manchester, UK: Springer-Verlag. Kacprzyk, J., & Zadrożny, S. (1995). FQUERY for Access: Fuzzy querying for Windows-based DBMS. In P. Bosc & J. Kacprzyk (Eds.), Fuzziness in database management systems (pp. 415-433). Heidelberg, Germany: Physica-Verlag. Kacprzyk, J., & Zadro����������������������������� ż���������������������������� ny, S. (1999). The paradigm of computing with words in intelligent database querying. In L. A. Zadeh & J. Kacprzyk (Eds.), Computing with words in information intelligent systems (Part 1. Foundations, Part 2. Applications, pp. 382-398). Heidelberg/New York: SpringerVerlag. Kacprzyk, J., & Zadrożny, S. (2000). On a fuzzy querying and data mining interface. Kybernetika, 36, 657-670. Kacprzyk, J., & Zadro�������������������������� ż������������������������� ny, S. (2001). Computing with words in intelligent database querying: Standalone and Internet-based applications. Information Sciences, 134, 71-109. Maraboli, R., & Abarzua, J. (2006). FSQL-f representación y consulta por medio del leguaje PL/PGSQL de información imperfecta. Degree thesis, Universidad Católica del Maule, Ingeniero en Computación e Informática, Chile. Medina, J. M., Pons, O., & Vila, M. A. (1994a). GEFRED: A generalized model of fuzzy relational databases. Information Sciences, 76(1-2), 87-109. Medina, J. M., Pons, O., & Vila, M. A. (1994b). An elemental processor of fuzzy SQL. Mathware and Soft Computing, 3, 285-290. Medina, J. M., Pons, O., & Vila, M. A. (1995). FIRST: A fuzzy interface for relational systems. In Proceedings of the 6th International Fuzzy Systems Association World Congress, Brazil.
Menhoudj, K., & Ou-Halima, M. (1996). Migrating ���������� data-oriented applications to a relational database management system. In Proceedings of the 3rd International Workshop on Advances in Databases and Information Systems (ADBIS 1996) (pp. 102108), Moscow. Pedrycz, W., & Gomide, F. (1998). An introduction to fuzzy sets: Analysis and design. MIT Press. ISBN 0-262-16171-0. Petry, F. E. (1996). Fuzzy databases: Principles and applications (International Series in Intelligent Technologies). Kluwer Academic Publishers (KAP). Prade, H., & Testmale, C. (1987). Fuzzy relational databases: Representational issues and reduction using similarity measures. Journal of the American Society of Information Sciences, 38(2), 118-126. Silberschatz, A., Korth, H. F., Sudarshan, S. (2006). Database systems concepts (5th ed.). McGraw-Hill. Tahani, V. (1977). A conceptual framework for fuzzy query processing: A step toward very intelligent database systems. Information Processing and Management, 13, 289-303. Takahashi, Y. (1993). Fuzzy database query languages and their relational completeness theorem. IEEE Transactions on Knowledge and Data Engineering, 5(1), 122-125. Thiran, P., & Hainaut, J.-L. (2001). Wrapper development for legacy data reuse. In Proceedings of the 8th Working Conference on Reverse Engineering (pp. 198-207). Washington, DC: IEEE Computer Society Press. Tilley, S. R., & Smith, D. B. (1996). Perspectives on legacy system reengineering. Carnegie Mellon University: Software Engineering Institute. Tineo, L. (2000). Extending RDBMS for allowing fuzzy quantified queries 1. In M. Revell (Ed.), Lecture Notes in Computer Science, 1873, 407-416. Berlin: Springer-Verlag.
377
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
Umano, M. (1982). Freedom-O: A fuzzy database system. In M. Gupta & E. Sanchez (Eds.), Fuzzy information and decision processes (pp. 339-347). Amsterdam: North-Holland. Umano, M., & Fukami, S. (1994). Fuzzy relational algebra for possibility-distribution-fuzzy relational model of fuzzy data. Journal of Intelligent Information Systems, 3, 7-27. Urrutia, A., Galindo, J., & Piattini, M. (2002). Modeling data using fuzzy attributes. In Proceedings of the 22nd International Conference of the Chilean Computer Science Society (SCCC’02) (pp. 117-123). Chile: Computer Science Society. Wu, B., Lawless, D., Bisbal, J., Richardson, R., Grimson, J., Wade, V., & O’Sullivan, D. (1997). The butterfly methodology: A gateway-free approach for migrating legacy information systems. In Proceedings of the 3rd IEEE Conference on Engineering of Complex Computer Systems (ICECCS97) (pp. 200-205). Italy: IEEE Computer Society. Retrieved February 9, 2008, from https://www.cs.tcd.ie/publications/tech-reports/reports.99/TCD-CS-1999-15.pdf Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338-353. Zadeh, L. A. (1975). The concept of a linguistic variable and its application to approximate reasoning (parts I, II, and III). Information Sciences, 8, 199-251, 301-357 ; 9, 43-80. Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1), 3-28. Zadeh, L. A. (1983). A computational approach to fuzzy quantifiers in natural languages. Computational Mathematics Applications, 9, 149-184. Zadeh, L. A. (1992). Knowledge representation in fuzzy logic. In An introduction to fuzzy logic applications in intelligent systems. Kluwer Academic. Zemankova-Leech, M., & Kandel, A. (1985). Implementing imprecision in information systems. Information Sciences, 37, 107-141.
378
Key Terms CDEG Function: Function defined in FSQL to compute the Compatibility DEGree of each row. This value is the fulfillment degree of each row to the fuzzy condition included in the WHERE clause of a SELECT statement in FSQL language. This function may be used with an attribute in the argument and then it computes the fulfillment degree for the specific attribute. If the argument is the symbol asterisk, *, then it computes the fulfillment degree using the whole condition, even whether it includes fuzzy conditions on different attributes. Fuzzy Attribute: In a database context, a fuzzy attribute is an attribute of a row or object in a database, which allows querying by fuzzy information and/or storing this kind of information. Fuzzy Database: If a regular or classical database is a structured collection of records or data that is stored in a computer, a fuzzy database is a database which is able to deal with uncertain or incomplete information using fuzzy logic. There are many forms of adding flexibility in fuzzy databases. The simplest technique is to add a fuzzy membership degree to each record, that is, an attribute in the range [0,1]. However, there are other kinds of databases allowing fuzzy values to be stored in fuzzy attributes using fuzzy sets, possibility distributions, or fuzzy degrees associated to some attributes and with different meanings (membership degree, importance degree, fulfillment degree, etc.). Of course, fuzzy databases should allow fuzzy queries using fuzzy or nonfuzzy data, and there are some languages that allow these kinds of queries, like FSQL or SQLf. In synthesis, the research in fuzzy databases includes the following four areas: flexible querying in classical or fuzzy databases, extending classical data models in order to achieve fuzzy databases (fuzzy relational databases, fuzzy object-oriented databases, etc.), fuzzy data mining techniques, and applications of these advances in real databases. FSQL (Fuzzy SQL): Extension of the popular language SQL that allows the management of
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
fuzzy relational databases using the fuzzy logic. Basically, FSQL defines new extensions for fuzzy queries, extending the SELECT statement, but it also defines other statements. One of these fuzzy items is the definition of fuzzy comparators using mainly the possibility and necessity theory. Besides, FSQL allows the definition of linguistic labels (like hot, cold, tall, short, etc.) and fuzzy quantifiers (most, approximately 5, near the half, etc.). The more recent publication about FSQL is the book Fuzzy Databases: Modeling, Design and Implementation by Galindo et al. (2006). Fuzzy Comparators: They are different techniques to compare two values using fuzzy logic. FSQL defines fuzzy comparators like FEQ (fuzzy equal), NFEQ (necessarily fuzzy equal), FGT (fuzzy greater than), NFGT (necessarily fuzzy greater than), and so forth. Fuzzy Metaknowledge Base������� (FMB)�: In a fuzzy database, the FMB is the extension of the data dictionary in order to store the fuzzy metadata, that is, information about fuzzy objects: fuzzy data type of each fuzzy attribute, the definition of labels, the margin for approximate values, the minimum distance for very separated values, fuzzy quantifiers, and so forth. Fuzzy Migration: Migration from crisp databases towards fuzzy databases in order to introduce imprecise/fuzzy information in current information systems. This fuzzy migration consists in deriving a new database from a legacy database and in adapting data, metadata, and the software components accordingly. It does not only constitute the adoption of a new technology, but also the adoption of a new paradigm. Fuzzy Query: Query with imprecision in the preferences about the desired items. These preferences may be set usually using fuzzy conditions in the queries. These fuzzy conditions include many possible forms like fuzzy preferences (e.g., I prefer bigger than cheaper), fuzzy labels (e.g., hot and cold), fuzzy comparators (e.g., approximately greater or equal than), fuzzy quantifiers (e.g., most
or approximately the half), and so forth. One basic target in a fuzzy query is to rank the resulting items according to their fulfillment degree (usually a number between 0 and 1). FuzzyEER Model: Conceptual modeling tool, which extends the Enhanced Entity Relationship (EER) model with fuzzy semantics and fuzzy notations to represent imprecision and uncertainty in the entities, attributes, and relationships. The basic concepts introduced in this model are fuzzy attributes, fuzzy entities, fuzzy relations, fuzzy degrees, fuzzy degrees in specializations, and fuzzy constraints. A complete definition of this model is published in the book Fuzzy Databases: Modeling, Design and Implementation (Galindo et al., 2006). Legacy System: Existing system in a concrete context, for example, an existing database. SQL (Structured Query Language): A computer language used to create, retrieve, update, and delete data from relational database management systems. SQL has been standardized by both ANSI and ISO. It includes DML (Data Management Language) and DDL (Data Definition Language). The statement for querying is the SELECT command.
Endnotes
1
2
Oracle is possibly the most powerful database system. The last versions are object relational databases and designed for grid computing. Some distributions are free but with some limits (such as, to store up to 4GB of user data). It began three decades ago with only one relational database and actually it runs on all major operating systems, including Linux, UNIX (AIX, HP-UX, Mac OS X, Solaris, Tru64), and Windows. Official web page: http://www.oracle.com PostgreSQL is a powerful, open source relational database system. It has more than
379
How to Achieve Fuzzy Relational Databases Managing Fuzzy Data and Metadata
15 years of active development and a proven architecture that has earned it a strong reputation for reliability, data integrity, and correctness. It runs on all major operating systems, including Linux, UNIX (AIX, BSD, HP-UX,
380
SGI IRIX, Mac OS X, Solaris, Tru64), and Windows. Official web page: http://www. postgresql.org 3 �� A DBRE is a technology used to recover the conceptual schema that expresses the semantics of the source data structure
381
Chapter XV
A Tool for Fuzzy Reasoning and Querying Geraldo Xexéo Universidade Federal do Rio de Janeiro, Brazil André Braga IBM Brazil, Brazil
Abstract We present CLOUDS, which stands for C++ Library Organizing Uncertainty in Database Systems, a tool that allows the creation of fuzzy reasoning systems over classic, nonfuzzy, relational databases. CLOUDS can be used in three flavors: CLOUDS API, a C++ API; CLOUDS-L, a compiled language; and CLOUDSQL, a fuzzy extension to SQL queries (ANSI, 1992). It was developed using the objectoriented paradigm and has an extensible architecture based on a main control system that manages different models, and runs queries and commands defined in them. As a test, it was incorporated into a geographic information system and used to analyze epidemiological data.
CLOUDS: Tools for Fuzzy Reasoning and Querying This chapter describes CLOUDS (C++ Library Organizing Uncertainty in Database Systems), a set of tools that allows a programmer to create or extend a database-based system with a fuzzy query engine provided fuzzy reasoning capabilities. It also describes its first real life application, the extension of an epidemiological geographic information system, GISEpi (Nobre, Braga, Pinheiro, & Lopes, 1997). CLOUDS is open source and can be downloaded from SourceForge.1
Real world data are seldom as correct, exact, well defined, or well understood as our relational databases lead us to believe. Typically, we use approximation or intervals to deal with information uncertainty, often in a natural and unconscious way and at times with a clear loss of semantics. Fuzzy sets and fuzzy logic are well-established theories used to represent uncertainty in control systems (Klir & Yuan, 1995). Database researchers have also used them to model different forms of information uncertainty, creating fuzzy databases. Galindo, Urrutia, and Piattini (2006) provide a good review of different fuzzy database models.
Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
A Tool for Fuzzy Reasoning and Querying
One of the most promising applications of fuzzy databases is representing uncertainty in geographic information systems (GIS) (Bosc, Kraft, & Petry, 2005), which are plagued by different imperfections in data, like imprecise representation of terrain or statistical errors on data acquisition (Bolstad, 2005). GIS are an important decision making tool for governmental, nongovernmental, and private organizations (Jankowski & Nyerges, 2001). To fulfill this role, they can be extended with a decision support system, possibly with a rule-based decision system based on knowledge collected from human experience. Again, fuzzy systems have established themselves as a good implementation strategy for rule-based decision systems, due to the capability of implementing human reasoning, including characteristics such as imprecision in data evaluation or simultaneously applying different rules (Pedrycz, & Gomide, 1998). Motivated by the faulty data found in a real world health-care GIS application (Nobre et al., 1997), due to the quality problems in data gathering and compilation, and the need to use imprecise judgments to implement, or not, governmental health-care policies, we decided to extend it with a fuzzy module supporting data analysis and decision making. The result is CLOUDS, a portable library that can be easily used in different database applications. The library, developed in C++, contains tools for processing fuzzy SQL queries, for describing different fuzzy models over a relational database and for defining rules used in a fuzzy inference engine. In the next section, we will give a short review of the main topic discussed in this chapter. The third section will introduce CLOUDS. The fourth section will describe CLOUDS-L in detail. The fifth section will describe its use in GISEpi. The sixth section will present the conclusions.
Basic Concepts We assume that the reader is aware of the main developments in fuzzy sets and fuzzy logic. However, we would like to review a few basic concepts that are the basis of our proposal. 382
Fuzzy Systems and Fuzzy Reasoning “Fuzzy systems”2 is a general term encompassing all kinds of systems that use, in some part of their architecture, a mechanism based on fuzzy logic or fuzzy set theory. The traditional implementation strategy requires a main fuzzy engine that is isolated from the nonfuzzy (crisp) part of the system by crisp to fuzzy and fuzzy to crisp converters. Crisp to fuzzy conversion is known as fuzzification or fuzzy encoding. Fuzzy to crisp conversion is called defuzzification or fuzzy decoding (Klir & Yuan, 1995). Fuzzy propositions are logical statements that assume a fuzzy value. They can be conditional, qualified, or both, as well as simple. A simple, unconditional, and unqualified proposition states that a variable element belongs to a fuzzy set, as in “the age of x is old.” For a particular element, the degree of truth of the proposition is interpreted as the degree of membership of this element to the fuzzy set. In this way, any fuzzy proposition can be interpreted as a possibility distribution function that is equal to the membership function of the fuzzy set. A simple proposition p has the forms (Klir & Yuan, 1995) “p : V is F” or “p : V(i) is F” if it is important to discuss the individual element referred to by the proposition, as in “The age of John is old.” A qualified fuzzy proposition is a simple fuzzy proposition modified by a fuzzy truth qualifier or a fuzzy probability qualifier. Conditional propositions discuss the implication of one fuzzy proposition from another proposition, as in “If age is old, then strength is feeble.” Conditional propositions are equivalent to fuzzy implications. Among other options, it is common for a fuzzy system to use rules, or fuzzy implications, to formally represent knowledge. Although a single type of rule does not exist, the if-then representation is a standard. A basic if-then rule can be written in the form: IF a1 AND a2 … AND an THEN b, where ai, 1≤ i ≤ n, and b are simple fuzzy propositions. Like in the standard expert system approach,
A Tool for Fuzzy Reasoning and Querying
each rule represents some reasonable assumption about the actions (output values) that should be taken in the case of that state of the system (input values). Both input and output values are described by fuzzy sets. All rules work in parallel. A fuzzy reasoning engine can be seen as performing three steps: 1. 2. 3.
Deciding which rules to fire, based on the degree of truth of the antecedent, Calculating the value of the consequent for each fired rule, and Calculating a consolidated result.
This is known as the compositional inference engine (Zadeh, 1973) and is the simplest and most common deductive process used in fuzzy systems. To execute the second step, it is necessary to have a way to calculate the value of the consequent based on the value of the antecedent; that is, we need a function to represent the fuzzy implication. This function can be deduced using generalized modus ponens (Klir & Yuan, 1995). There are two main methods of inference in compositional fuzzy systems: the min-max (Mandani) and the fuzzy additive method (Cox, 1994). In the min-max method, used by default in CLOUDS, the consequent membership function is restricted to the minimum of the predicate truth and the compound result is the maximum of all these fuzzy sets, which is not compatible with fuzzy logic in the narrow sense, but achieves good practical results.
The Linguistic Approach to Fuzzy Systems The linguistic approach is Zadeh’s original idea for developing fuzzy systems. It is based on two main concepts: the linguistic variable and the linguistic term. A linguistic variable represents a concept that is measurable in some way, either objectively or subjectively, like “temperature” or “desire.” Linguistic variables are properties of an object or situation. Linguistic terms subjectively
rate the characteristic denoted by a linguistic variable. A linguistic term is a fuzzy set, and the linguistic variable defines its domain. For example, if “water temperature” is a linguistic variable, its values could be the linguistic terms “freezing,” “cold,” “warm,” “hot,” and “boiling.” Each linguistic term should have a membership function mapping measurable temperatures, such as 0 oC to 100 oC, to membership degrees. Every adequate representation of a fuzzy set involves the basic understanding of five related conceptual symbols as defined by Turksen (1991): • • •
•
•
the set of elements θ∈Θ such as a “person” from “group of friends”; the linguistic variable V, which is a label for one of the attributes of the elements θ∈Θ, such as the “age” of the “person”; the linguistic term A, which is an adjective or adverb describing the linguistic variable, chosen among the set of linguistic terms, such as “young,” which describes the “age”; a referential set X ⊂ [-∞,∞], which is a measurable numerical assignment interval for a particular attribute V of a set of elements θ∈Θ such as “[0,120] years” to “age”; and A subjective numeric attribution µA(θ) of the membership value, which is the membership degree of the element identified by the linguistic term A when labeled by the linguistic variable V. For example, for age 40, the membership value in set “young” could be 0.3.
Therefore, for a linguistic variable V, there will be a measurement process resulting, for each elementθ∈Θ, in a measured value mv ∈ [α,β], where α,β are, respectively, the greatest lower bound and least upper bound of the domain. To interpret this measurement, we define subjective notions as the linguistic terms A0, A1, A2, ... , An and their membership functions µ0(x), µ1(x), µ2(x), ..., µn(x), with domain [α,β] and range [0,1]. Applying µi to mv , we obtain the membership degree for the element θ in the set Ai, which represents the degree of ac-
383
A Tool for Fuzzy Reasoning and Querying
complishment of the linguistic term Ai when it is used to express a subjective measure of V.
Fuzzy Databases and Fuzzy Extensions to SQL In this section, we briefly discuss the two main fuzzy models used to extend relational databases and describe the main characteristics associated with fuzzy SQL or other fuzzy querying languages.
Extending Relational Databases with Fuzzy Theory There are several proposals for fuzzy database systems (Galindo et al., 2006). The main lines of work, described by Petry (1996), extend the relational model with one of two fuzzy models: possibility-based or similarity-based. Similaritybased models generalize the concept of relation to the concept of fuzzy relation, working with similarity tables to define the similarity, the degree of redundancy, and the uniqueness of a tuple. For example, a basic similarity model could fuzzify a column describing a qualitative opinion of movies with domain (“excellent,” “good,” “average,” or “bad”), saying that a good movie is 80% excellent. A fuzzy database will use this knowledge to also retrieve “good” movies (at 80% membership values) when queried for “excellent” movies. Possibility-based models use possibility distributions to represent ill-defined concepts and incomplete information within a tuple. Each tuple, in whole or in part, is paired with a membership value, which describing the relevance value of the tuple to the relation. In this way, the fuzzy relational model uses (new) relations: Rf : D1× D2 × ... × Dn → [0,1] where, D1, D2,…, Dn, are the domains of the original relation. It is possible to select and eliminate redundant tuples: given two tuples ti and tj; ti = (di1,
384
di2,…, din), with membership value µi; and tj = (dj1, dj2,…, djn), with membership value µj; if dik=djk, ∀k, k=1,…,n, and i≠j, then ti=tj and µi=j=min(µi,µj). The possibility-based model represents a standard fuzzy set extension over the definition of relational tables as relations, that is, sets of tuples, asking for a tuple to have a membership value to the relation. These membership values can be assigned to different semantic meanings, such as to indicate precision, timeliness, confidence on the value of data, and so forth. It is also possible to define other models that merge characteristics of these two models, such as the GEFRED model by Medina, Pons, and Vila (1994). Actually, both models are mostly orthogonal, since one is built upon the values themselves, while the other is built upon the membership of the tuple (and its values) to the relation. Considering the fuzzification of domains, one can see that fuzzy relational databases are not necessary in the first normal form, since their fuzzified domain values will not necessarily be singletons. For example, if a traditional (crisp) database allows the values of “size” as a domain to be “high,” “medium,” and “low,” a fuzzy database should allow something like “(high(0.4) and low(0.3))”. This can be implemented as one tuple or, to keep the first normal form, as multiple tuples, each one representing one of these values. For a complete overview of different fuzzy databases models, the reader is directed to the second chapter of Galindo et al. (2006, p. 45-57).
Fuzzy Queries The importance of relational databases for any organization is based on two main concepts: highly structured application-independent data stored in tables and a language that allows ad-hoc queries: SQL (ANSI, 1992). SQL, which stands for Structured Query Language, is probably the most used computer language in organizations. Almost every program, regardless of the language in which it is implemented, usually has to pull and push data to a database management system (DBMS) through an
A Tool for Fuzzy Reasoning and Querying
Table 1. The 18 fuzzy comparators for FSQL (fuzzy SQL): 16 in the Possibility/Necessity Family and 2 in the Inclusion Family (Galindo et al., 2006)3 Possibility
Necessity
FEQ or F=
NFEQ or NF=
Possibly/Necessarily Fuzzy Equal than…
FDIF, F!= or F<>
NFDIF, NF!= or NF<>
Possibly/Necessarily Fuzzy Different to…
FGT or F>
NFGT or NF>
Possibly/Necessarily Fuzzy Greater Than…
FGEQ or F>=
NFGEQ or NF>=
Possibly/Necessarily Fuzzy Greater or Equal than…
FLT or F<
NFLT or NF<
Possibly/Necessarily Fuzzy Less Than…
FLEQ or F<=
NFLEQ or NF<=
Possibly/Necessarily Fuzzy Less or Equal than…
MGT or F>>
NMGT or NF>>
Possibly/Necessarily Much Greater Than…
MLT or F<<
NMLT or NF<<
Possibly/Necessarily Much Less Than…
FINCL
INCL
Fuzzy Included in… / Included in…
SQL interface. Any fuzzy database model is much more useful if enhanced by a fuzzy extension to SQL or any other query language. Actually, fuzzy querying can be developed over traditional crisp databases as well, providing a powerful mechanism for describing imprecise questions. SQL is divided in two subsets (ANSI, 1992): the Data Definition Language (DDL) and the Data Manipulation Language (DML). DML has four basic commands: SELECT, UPDATE, INSERT, and DELETE. Most fuzzy SQL extensions, including CLOUDS, deal only with the SELECT statement, which retrieves data from the database according to a set of restrictions. Although they seem to perform a tiny role, SELECT statements are probably the most important ones, since they allow for the dynamic creation of tables and, consequently, reports. Bosc and Pivert (1997) describe three approaches for imprecise querying: • • •
Meaning
Separating the precise part of the query from the imprecise part and using a ranking mechanism to unify the answer, Translating the imprecise query to a precise query using ranges and then fuzzifying the result, and Directly implementing the fuzzy query.
These three mechanisms lead to different implementations. In the first two mechanisms, it is only necessary to build a front-end to the database and one can rely heavily on translating the fuzzy query to SQL. The third mechanism implies that the user has access to database internals. One of the authors has previously implemented such an approach using an open-source object‑oriented database engine (Boullosa, Cruz, & Xexéo, 1999). There are several authors who propose fuzzy query languages, most of them extending SQL and some of them using a Prolog-like syntax such as Liu and Li (1990). Although the existence and use of a fuzzy query language is not connected with the use of an underlying fuzzy database model, it implies the definition of a fuzzy model over the database with fuzzification and defuzzification rules. Most of the proposals extend the SQL language with fuzzy manipulation operators such as the 18 fuzzy comparators proposed in FSQL (Galindo et al., 2006) and described in Table 1. The first requirement is to allow the use of fuzzy linguistic terms, or fuzzy numbers, in the queries. Further on, it must be possible to indicate the minimum membership value that would be accepted in a recovered tuple. Li and Liu (1990), and other authors, use the basic syntax described in Program 1.
385
A Tool for Fuzzy Reasoning and Querying
Program 1. The basic fuzzy extension to SQL 1. SELECT
The use of a fuzzy SQL language can be mapped to a fuzzy relational calculus or algebra (Galindo, Medina, & Aranda, 1999; Umanu & Fukami, 1994). Other solutions avoid this formal approach and directly translate the SQL statement into fuzzy model definitions and formulae. It is also possible to map the fuzzy-SQL expression into SQL and then reprocess the output given by the database to obtain a fuzzy answer. It is necessary to redefine the arithmetic and relational operators used in the “WHERE” clause to deal with fuzzy concepts. The arithmetic operators are easily redefined through the extension principle (Klir & Yuan, 1995; Pedrycz & Gomide, 1998). The classical relational operators return true(1) or false(0), while their fuzzy extensions return membership values. Different extensions are suggested to express the equality between fuzzy sets such as functions based on inclusion, similarity, semantic distance, and compatibility. SQLf (Bosc & Pivert, 1995) is a SQL extension supporting fuzzy values and a great variety of fuzzy queries. Its basic principle is the introduction of fuzziness in two levels of the WHERE clause in the SELECT statement: in the predicates themselves and in the manner they are connected. We would like to direct the reader to two recent proposals that have something in common with CLOUDS:
386
FRIL++ and FSQL. Cao and Rossiter (2003) have proposed FRIL++, a programming language for deductive probabilistic and fuzzy object-oriented databases. Their approach, however, does not use a fuzzy reasoning engine. FRIL++ uses a Prolog-like syntax, while CLOUDS is similar to SQL. FSQL (Galindo et al., 2006) is a very complete fuzzy extension to SQL. Although extensive, and also modifiable via an “ALTER FSQL” command (p. 254), it is not extensible. CLOUDS, although limited in scope, is based on an extensible architecture, built over abstract classes and open-source code. In this book, you can find a chapter by Urrutia, Tineo, and Gonzalez including a comparison between SQLf and FSQL languages.
Overview of Clouds CLOUDS is a library that acts as a fuzzy front-end to relational databases. It is based on the linguistic approach to fuzzy systems. CLOUDS can be used in three different ways: • •
CLOUDS API, a set of C++ classes and methods that implements all functions; CLOUDSQL, which implements a fuzzy extension to SQL Select statements; and
A Tool for Fuzzy Reasoning and Querying
•
CLOUDS Language (CLOUDS-L), which accepts CLOUDSQL statements, and a compiler that translates CLOUDS-L directly to API commands.
A typical CLOUDS‑L session involves three phases: • • •
Describing the fuzzy model (input and fuzzification), Executing the model (fuzzy engine), and Providing output (defuzzification and output).
To give the reader a taste of CLOUDS, a CLOUDS‑L example is shown in Program 2. In the Program 2, we process the following steps: •
•
The “DATA” declaration (line 1) opens the table “Malaria1.” The table chosen is in the SIGEPE format and is, by default, a file named “Malaria1.” The “VARINT” (lines 2 and 3) declarations create two linguistic variables, with integer domains, over columns LAMEXA (number of examined patients) and LAMPOSTOT
Program 2. A simple program in CLOUDS Language (numbers used only for reference)
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
DATA Malaria1 TYPE SIGEPIDADOS VARINT LAMEXA IN Malaria1 AS LAMEXA AUTOMATIC VARINT LAMPOSTOT IN Malaria1 AS LAMPOSTOT AUTOMATIC VARSOLREAL Relevance IN Malaria1 AS Relevance FROM 0 TO 100 TERM Low IN LAMPOSTOT CURVE TRIANGULAR IN 0 0.0 0.5 ALPHA 0.2 TERM Average IN LAMPOSTOT CURVE TRIANGULAR IN 0 0.5 1.0 ALPHA 0.2 TERM High IN LAMPOSTOT CURVE TRIANGULAR IN 0.5 1.0 1.0 ALPHA 0.2 TERM Average IN LAMEXA TERM High IN LAMEXA HEDGE Very IN Low CONCENTRATOR FACTOR 2 TERM Very-Low IN LAMEXA RULE R1 IF [{LAMEXA=High} AND {LAMPOSTOT=High}] THEN Relevance High RULE R2 IF [{LAMEXA=High} AND {LAMPOSTOT= Average }] THEN Relevance Average RULE R3 IF [{LAMEXA= Average } AND {LAMPOSTOT= Average }] THEN Relevance Average RULE R4 IF [LAMEXA=Very-Low] THEN Relevance Low MODULE Aval1 IN Malaria1 EVALUATION OVER LAMEXA MODULE Aval2 IN Malaria1 EVALUATION OVER LAMPOSTOT MODULE Aval3 IN Malaria1 EVALUATION OVER Relevance OUTPUT LAMEXA IN Aval1 TYPE TABELASIGEPI Malaria1 OUTPUT LAMPOSTOT IN Aval2 TYPE TABELASIGEPI Malaria1 OUTPUT Relev IN Aval3 TYPE TABELASIGEPI Malaria1
387
A Tool for Fuzzy Reasoning and Querying
•
•
•
•
• •
388
(number of positives patients). These linguistic variables receive the same name of their crisp counterparts. The domain limits are automatically calculated. The “VARSOLREAL” (line 4) declaration creates a solution variable, named “Relevance,” over a real domain, which varies from 0 to 100. The first three “TERM” (lines 5 to 7) declarations create three linguistic terms, “Low,” “Average,” and “High.” o These terms are then bound to the linguistic variable LAMPOSTOT. o Finally, their membership functions are described as triangular and defined over the normalized domain [0,1]. For “Average,” this triangle, described as (0,0.5,1), starts at (0, 0), peaks at (0.5,0) and ends at (1,0). The other two are rectangular triangles. The alpha-cut limits the membership to values greater than 20%. The result is that the “triangle” is rather a “triangle over a box.” The fourth and fifth “TERM” (lines 8 and 9) declarations reuse the “Average” and “High” linguistic terms, binding them to the linguistic variable LAMEXA. o At this point, is important to notice that, internally, the system will create copies of these terms and their domains, although with the same shape relative to the normalized domain, which will probably have different limits. The hedge declaration (line 10) creates a hedge for the linguistic term Very, using the formula of a concentrator and a factor of 2. This will raise the membership value to the second power. The TERM declaration in line 11 defines a modified term Very-Low, applied to LAMEXA. Lines 12 to 15 create rules that allow for the evaluation of the Solution Variable “Relevance.”
•
•
Lines 16 to 18 create results by evaluating what was previously defined (using the command MODULE). These results can be used later in different output commands. Lines 19 to 21 write the results of evaluating the fuzzy values of LAMPOSTOT, LAMEXA, and Relevance to a GISEpi Table called “Malaria1” (actually inserting three new columns with those names in that table).
Metamodel Overview A CLOUDS system is divided into models. These models are independent and do not interact with each other. There is, however, a global model, which is a singleton that can be accessed by all models and can be used to exchange information between them. Tables are equivalent to relational tables and are imported from existing databases (or text files). The main data abstraction in a model is the linguistic variable. Each linguistic variable is described by a set of linguistic terms assigned to them. Most linguistic variables are fuzzy interpretations of crisp attributes in a relational table. Each linguistic term is a fuzzy set describing one subjective evaluation of the linguistic variable according to a membership function. Hedges are modifiers that can be assigned to linguistic terms. Linguistic variables are built upon table columns. The lower and upper limit of the domain of the linguistic variable can be defined automatically from the values in the table or programmatically by the user. Linguistic terms are defined over the normalized domain. Linguistic variables and linguistic terms are used to compose rules and queries. Rules allow us to evaluate solution variables, which calculate the membership values of their terms through them. Queries are used in the relational sense with fuzzy extensions and can refer to table columns or linguistic variables indifferently. The description above closely follows the linguistic approach found in most fuzzy systems and databases. Its main advantage is that it joins
A Tool for Fuzzy Reasoning and Querying
Figure 1. A conceptual view of a fuzzy model in CLOUDS and its relation with the data stored in a DBMS Fuzzy Model Querys and commands
Fuzzy Schema Logic Propositions
rules
-
O p era to rs L in gu istic T e rm s N u m eric V alue s F uz z y V ariab le s
C risp V ariable s
D ata R eq uis ition
Ling uistic V ariable s
S o lutio n V a ria ble
the concepts of fuzzy rules and fuzzy queries in a unique system, a characteristic that we are unaware of in other systems. Figure 1 presents a conceptual view of a fuzzy model in CLOUDS. Queries and commands (such as OUTPUT) manipulate logic propositions that can be written using operators, linguistic terms, numeric values, or fuzzy variables. Rules also use logic propositions to create solution variables. Data can be inserted directly in a logic proposition, but they usually come from a DBMS in the form of crisp or linguistic variables. Some crisp inputs can also be converted to fuzzy numbers. The “data interface” in the model represents the need to describe to a model the (physical) origin of the data.
System Overview The main control system, Figure 2, manages the fuzzy models, which interact with the database reading data and producing fuzzy or crisp results. It also interprets script files defining the fuzzy
Fuzzy N um bers
D ata R eq uis ition
D ata R eq uis ition
D ata Interfac e
D ata Interfac e
DBM S
D ata Interfac e
models. There are two types of fuzzy models: a global model and a group of independent ones. Each independent model possesses its own data interfaces, queries, variables, and rules, but it cannot access components from other models. The global model is suitable for comparing and building rules from several models, and exchanging information among them.
Fuzzy Models Each model is made up of three groups of objects: data interfaces, fuzzy schema, and fuzzy queries (see Figure 1). Data interfaces are virtual classes that provide the services necessary to fuzzify data stored in a database or in a text file. These virtual classes should be implemented according to the database chosen to allow the system to communicate with nonconventional databases such as spatial- or object-oriented ones. Currently, CLOUDS provides concrete classes with direct access to comma-separated files, “.dbf” files, GISEpi (Nobre et al., 1997) files, and generic SQL databases (using ODBC).
389
A Tool for Fuzzy Reasoning and Querying
Figure 2. The architecture of CLOUDS Scripts
c.L.o.u.d.s Main control system
Main Fuzzy Model Fuzzy Model 1
Querys and commands
Fuzzy Model 2
Fuzzy Model n
D atabas e
general variables, rules and interfaces
Figure 3. A subset of the UML class diagram of CLOUDS showing the linguistic variable and linguistic term hierarchies
Linguistic variables, linguistic terms, rules, and operators compose the fuzzy schema. Each one of these components has a unique identifier, so we can use the same component in different locations within a model, ensuring consistency. We may, for example, change the curve parameters of the term “high” that is being used by three different linguistic variables, simultaneously affecting them. Arguably, this feature can be problematic when one needs to modify the behavior of just one variable, but it is always possible to define a new term to be used specifically in that variable. Rules allow the use of fuzzy inference rules, which calculate the value of solution variables. Queries
390
may be directly invoked from the control system or be predefined inside a model.
Linguistic Variables In CLOUDS, linguistic variables fuzzify chosen database attributes. They are implemented as an abstract class linguistic variable with some basic concrete specializations: real linguistic variable, integer linguistic variable, fuzzy number, crisp variable, fuzzy variable, and solution variable. Figure 3 displays the unified modeling langiage (UML) diagram for these classes.
A Tool for Fuzzy Reasoning and Querying
Each linguistic variable is linked to an attribute (column) and to a list of linguistic terms. As an object in the model, its basic function is to get an attribute value from the database and to convert it into different fuzzy values, one for each linguistic term listed. A linguistic variable also normalizes the data. To do that, it first recovers the minimum and maximum for that attribute. The linguistic terms evaluate the membership value according to some predefined function based on a normalized domain. For example, a linguistic variable “AgeF” can be created from the column “Age” in the table “Student.” When used, “AgeF” will dynamically recover the minimum and maximum value for “Age.” Suppose that, for this database, the limits are 2 and 18. The age 2 will correspond to 0 in the normalized domain; the age 18 will correspond to 1. If we have defined two triangular linguistic terms, “Young” (0,0,1) and “Old” (0,1,1), a student with an Age value of 10 will have AgeF (fuzzy) values of Young(0,5) and Old(0,5). Real and integer linguistic variables are standard linguistic variables that deal with real and integer domains, respectively. Fuzzy numbers are a special kind of linguistic term, which represents approximate numbers like “about 3.” To build a fuzzy number, the user indicates the domain and the granularity, that is, how many fuzzy numbers will divide the domain, and then the system automatically creates and assigns linguistic terms to the linguistic variable. Crisp variables are used when we want to map an attribute exactly as it is. Fuzzy variables are used when the database is a true fuzzy database and the attribute data values are fuzzy values like “35% high” or “88% cold.” The last kind of linguistic variables are solution variables. They do not map to the database but are built from inference rules over other fuzzy variables, as explained later.
Linguistic Terms To define a linguistic term, it is necessary to select a function over a normalized domain (0.0 – 1.0) and to set its shape. This normalization allows for
a linguistic term to be used in different linguistic variables. CLOUDS provides four standard functions: triangular, trapezoidal, Gaussian, and customized. Others can be easily implemented by specialization of the “linguistic term” class. The first three are defined through a few parameters. The custom term, on the other hand, must be defined through a vector. Each linguistic term must be created first and then bound to the context of a linguistic variable. They actually serve as a template that can be used over any domain. They can be absolute or parametric. Absolute linguistic terms, used by default, are always positioned in the same way as related to the normalized domain. Parametric linguistic terms have an additional parameter referring to the non-normalized domain that defines where they should be positioned in the normalized domain and is useful when defining concepts such as fuzzy numbers or propositions like “Around(X).” All fuzzy variables have a specialized linguistic term NoInfo that deals with the null values. There are three types of it as specializations: UNDECIDED, UNDEFINED, and nulL. The term UNDECIDED is generated when crisp values do not belong to the support of any linguistic term. The term UNDEFINED is generated when the crisp value is out of the domain of the linguistic variable. The term NULL is generated for invalid values that cannot be qualified as either UNDEFINED or UNDECIDED (including a database NULL). Null values can only be intentionally defined by a rule. This approach is operational and was born from the necessity to deal with results derived by rules. Galindo et al. (2006, pp. 51-59) offer a good review of fuzzy models, including different meanings for null values. As a comparison, FSQL (Galindo et al., 2006, p. 187) uses UNKNOWN, that is similar to our UNDECIDED; UNDEFINED, representing attributes that are not applicable or are meaningless, which has some similarity to CLOUDS UNDEFINED; and NULL, representing total ignorance.
391
A Tool for Fuzzy Reasoning and Querying
Table 2. Hedge operators hedge V ery F ew
operator CONCENTRATION DILATION
A lm ost
INTENSIFICATION
Function f( ) k m 1/k m
2( )k ,0 ≤ ≤ 0.5 k 1 − 2(1 − ) ,0.5 < ≤ 1
Hedges Linguistic hedges, or simply hedges, modify linguistic terms. They are the fuzzy equivalents of adverbs, built as fuzzy mappings from [0,1] into [0,1], and are implemented as the decorator design pattern (Gamma, ���������������������������� Helm, Johnson, & Vlissides��, 1994). CLOUDS provides the most agreed upon linguistic hedges and their corresponding functions, shown in Table 2. Each linguistic term can be associated with a linked list of hedges. They act like filters that modify the membership values of the linguistic term according to each type of hedge. The membership evaluation function calls the first hedge of the list, the outermost hedge, which recursively calls the next one, until a linguistic term is found and provides a base value to be used in the calculations.
Rules A rule is a statement such as “IF
392
also provide the OR operator, implemented as a t-conorm, and the unary NOT operator. Leafs of the tree represent individual conditions, and the operators are fuzzy logic operators. Fuzzy logic operators can be calculated by many different functions. In CLOUDS, users can select which type of operators they will use. An abstract class implements the operators, and there is a specialization for each operator type. This construction allows us to build any proposed operators. A rule is evaluated through correlation, and, if necessary, the final fuzzy value can be defuzzified into a crisp value. There are two defuzzification methods that can be used, Max-Value (also known as “height” method) and center of gravity (Klir & Yuan, 1995). It is also easy to extend CLOUDS, programmatically, with any other desired method. The resulting values, defuzzified or not, can be assigned to a data attribute of the output data interface.
Queries Our extension to SQL allows dealing with fuzzy variables and fuzzy terms. Models can define or directly execute several queries. When we define a query with a name, it is stored to be executed later. When we define a query without a name, it is immediately executed. CLOUDS actually implements a simple SQL parser and engine, which does not allow for subqueries. A query is stored as a list of simple commands based on relational algebra commands. The translation made by the parser is very simple, although it is possible to implement optimization methods. To compute the membership degree of each row to the result set, CLOUDS uses, by default, the standard t-norm, t-conorm, and negation applied to the WHERE part of the SELECT statement. A CLOUDS query has the form: SELECT