Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5708
Philippa Gardner Floris Geerts (Eds.)
Database Programming Languages 12th International Symposium, DBPL 2009 Lyon, France, August 24, 2009 Proceedings
13
Volume Editors Philippa Gardner Imperial College London Department of Computing London, UK E-mail:
[email protected] Floris Geerts University of Edinburgh School of Informatics Edinburgh, Scotland, UK E-mail:
[email protected]
Library of Congress Control Number: 2009932262 CR Subject Classification (1998): H.2, H.2.3, K.8.1, H.3, H.4 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-03792-5 Springer Berlin Heidelberg New York 978-3-642-03792-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12740819 06/3180 543210
Preface
This volume contains the proceedings of the 12th International Symposium on Database Programming Languages (DBPL 2009), held in Lyon, France on August 24, 2009, and co-located with VLDB (the International Conference on Very Large Data Bases). DBPL continues to present high-quality work at the intersection of database and programming language research. This proceedings volume contains the six papers accepted for DBPL 2009, which were selected by the Program Committee. Every submission was reviewed by at least three members of the committee. In addition, we sought the opinions of external referees, chosen because of their expertise in a particular topic. We would like to thank all the authors who submitted papers to DBPL 2009. We also thank the members of the Program Committee for their excellent work during the electronic selection meeting. We are grateful to Andrei Voronkov for his EasyChair system, which made these discussions comparatively straightforward. We would like to thank Marcelo Arenas and Michael I. Schwartzbach for their assistance and sound council as Program Chairs of DBPL 2007. We finally thank Mohand-Said Hacid and Jean-Marc Petit for their superb local organization of DBPL 2009. August 2009
Philippa Gardner Floris Geerts
Organization
Program Chairs Philippa Gardner Floris Geerts
Imperial College London, UK University of Edinburgh, UK
Program Committee Pablo Barcel´ o Gavin Bierman Jan Chomicki Anuj Dawar J. Nathan Foster Cosimo Laneve Maarten Marx Christopher R´e Alan Schmitt J´erˆome Sim´eon Peter Thiemann Stijn Vansummeren Jef Wijsen Limsoon Wong
Universidad de Chile, Chile Microsoft Research, Cambridge, UK University at Buffalo, USA University of Cambridge, UK University of Pennsylvania, USA Universit`a di Bologne, Italy University of Amsterdam, The Netherlands University of Washington, USA Inria, Grenoble, France IBM T.J. Watson Research Center, USA University of Freiburg, Germany University of Hasselt, Belgium Universit´e de Mons-Hainaut, Belgium National University of Singapore, Singapore
External Referees Pierre Genev`es Giorgio Ghelli Todd J. Green Nabil Laya¨ıda Fabio Massacci Danilo Montesi Prakash Ramanan
INRIA, Grenoble, France Universit` a di Pisa, Italy University of Pennsylvania, USA INRIA, Grenoble, France Universit`a di Trento, Italy Universit`a di Bologne, Italy Wichita State University, USA
Table of Contents
Semantics, Types and Effects for XML Updates . . . . . . . . . . . . . . . . . . . . . . Michael Benedikt and James Cheney
1
An Automata-Theoretic Approach to Regular XPath . . . . . . . . . . . . . . . . . Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi
18
The Script-Writer’s Dream: How to Write Great SQL in Your Own Language, and Be Sure It Will Succeed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ezra Cooper XML Security Views Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benoˆıt Groz, Slawomir Staworko, Anne-Cecile Caron, Yves Roos, and Sophie Tison A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasunori Ishihara, Takuji Morimoto, Shougo Shimizu, Kenji Hashimoto, and Toru Fujiwara
36 52
68
General Database Statistics Using Entropy Maximization . . . . . . . . . . . . . Raghav Kaushik, Christopher R´e, and Dan Suciu
84
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
101
Semantics, Types and Effects for XML Updates Michael Benedikt1 and James Cheney2 2
1 Oxford University Computing Laboratory Laboratory for Foundations of Computer Science, University of Edinburgh
Abstract. The W3C recently released the XQuery Update Facility 1.0, a Candidate Recommendation for an XML update language. It appears likely that this proposal will become standard. XQuery has been equipped with a formal semantics and sound type system, but there has been little work on static analysis or typechecking of XML updates, and the typing rules in the current W3C proposal appear unsound for “transform” queries that perform embedded updates. In this paper, we investigate the problem of schema alteration, or synthesizing an output schema describing the result of an update applied to a given input schema. We review regular expression type systems for XQuery, present a core language and semantics for W3C-style XML updates, and develop an effect analysis and schema alteration, which can be used as the basis for sound typechecking for queries involving “transform”.
1
Introduction
Query and transformation languages for XML data have been studied extensively, both in the database and programming language communities. The World Wide Web Consortium (W3C) has developed XQuery, a standard XML query language with a detailed formal semantics and type system [1,2]. Most real-world data changes over time, and so it is also important to be able to update XML documents and XML-based data. However, query languages such as XQuery, and transformation languages such as XSLT, provide support only for “functional” computation over immutable data, and are awkward for writing transformations that update part of the data “in-place” while leaving most of the document alone. There have been a number of proposals and prototype implementations for XML update languages (see for example [3,4,5,6]). While no clear winner has emerged so far, the W3C has introduced the XQuery Update Facility [7] (henceforth called simply “XQuery Update”), combining features from several proposals; this is now supported by many XML database implementations. However, the typechecking and static analysis problems for XQuery Update (and for XML updates more generally) remain ill-understood. In contrast to XQuery, there is no formal semantics; moreover, the proposed typing rules for XQuery Update only ensure that updates are minimally well-formed, and do not show how to compute the type of the document after an update is performed. In fact, as we shall see, the proposed typing rules in the current W3C proposal are unsound. In this paper we develop a sound type and effect system for XQuery Update based on regular expression types [8]. Regular expression types are closely related P. Gardner and F. Geerts (Eds.): DBPL 2009, LNCS 5708, pp. 1–17, 2009. c Springer-Verlag Berlin Heidelberg 2009
2
M. Benedikt and J. Cheney
to tree automata [9] and have been employed in a number of other settings [10]. We show how to infer safe over-approximations for the results of both queries and updates. This is nontrivial because we must consider destructive update at the schema/type level. A complication is that XQuery Updates have a somewhat involved “snapshot” semantics. An update expression is first evaluated, yielding a sequence of atomic update operations; then the atomic update sequence is sanity-checked and finally applied. Moreover, updates are not applied in the order they were generated (as a programmer might expect) but instead are applied in several phases: insertions and renamings first, then replacements, then deletions. Example 1. Consider the update: for $y in $x//a delete $y, for $y in $x//a, $z in $x//d return (insert $z before $y) This deletes all nodes matching $x//a and inserts copies of all nodes matching $x//d before the deleted nodes. Suppose the input $x has type1 doc[a[], b[], c[d[]]]. One might expect that the a node will be deleted first, so that the second update has no effect, yielding result type doc[b[], c[d[]]]. However, the informal semantics in the W3C proposal reorders insert operations before deletions, so the actual result type is doc[d[], b[], c[d[]]]. Uses of ancestor or sibling XPath axes further complicate typechecking: Example 2. Consider the update expression: for $y in $x//a/following::b/parent::c return delete $y Intuitively, this deletes all c nodes that are parents of b nodes that follow some a node in the document. If the input $x has type doc[b[c[]∗ , a[]∗ ]] then this update has no effect; if $x : doc[a[], c[b[]∗ ]] then the output will always have type doc[a[], c[]?]; if $x : doc[(c[b[]], a[])∗ ] then the output will always have type $x : doc[(c[b[]], a[]+ )? ]. In the XQuery standard, however, the typing rules for axes such as following and parent are very conservative: they assume that the result of a query might be any part of the document. This would be disastrous from the point of view of typechecking updates such as the above, however, since we would have to assume that any part of the input could be the target of an update. Finally, XQuery Update includes a new “transform” query expression that performs updates in the middle of a query. The “transform” expression copies data into new variables and then modifies the copied data. This complicates typechecking because the modified values may be used in subsequent queries. The W3C proposal’s typing rules for “transform” do not seem take this into account, and appear unsound: 1
For brevity, we use compact, XDuce-style notation [8] for XML trees and types.
Semantics, Types and Effects for XML Updates
3
Example 3. A typical W3C “transform” expression is of the form copy $y := $x
modify delete $y/c
return $y
This expression behaves as follows: First we copy the value of $x and assign it to $y; then we evaluate the modifying expression delete $y/c and apply the resulting updates; finally, we return $y. Suppose $x : a[b[], c[]]. Thus, initially $y will have the same type. According to the typing rules given in the W3C proposal [7], the return expression will be typechecked with $y still assigned type a[b[], c[]], so the result of the query will be assigned type a[b[], c[]], but the return value will be of the form a[b[]]. To recover soundness for “transform” expressions it is necessary to ensure that the types of updated variables remain correct after the update is performed. One trivial, but unsatisfying way to do so is to set updated variables’ types to Any, a type that matches any XML document. Another possibility, perhaps that intended by the W3C proposal, is that that the data should be revalidated after each update snapshot completes (including updates in the modify clause of a transform). Such revalidation could ensure that the type information remains valid after a transform is done, so the above example would result in a run-time type error. However, the current draft is ambiguous on this point: it does specify that some revalidation should take place but does not specify that revalidation should ensure that types are preserved [7, Sec. 2.4.5]. In any case, dynamic revalidation is potentially costly, and would require the schema designer to anticipate all possible changes to the schema in advance, thus precluding typechecking XQuery Update expressions that, for example, add new elements to the document that were not present in the original schema. As these examples illustrate, it is easy to find pathological updates for which “good” output schemas appear difficult to predict. In fact, in general there may be no schema (based on regular expression types) that exactly captures the output of a query, because the range of a query or update over a regular input language may not be regular [11]. Even typechecking an update given a fixed input and output schema is hard in general, and undecidable for full XQuery Update. Nevertheless, it is worthwhile to find sound, static overapproximations to the result of an XML query or update. We focus on developing a pragmatic approach that demonstrates reasonable behavior on common cases. It is already difficult just to develop a nontrivial sound analysis for the W3C proposal, however, and experimental validation of the practical utility of our approach is beyond the scope of this paper. Prior work has been done on typechecking and other static analyses for UpdateX [3,12] and Flux [4], and other XML update proposals [5]. However, no prior work applies directly to the W3C’s current XQuery Update proposal. While Benedikt et al. [3,12] considered a language similar to XQuery Update, they did not investigate typechecking. Cheney [4] studied regular expression typechecking for Flux, an XML update language that is simpler, but also less expressive,
4
M. Benedikt and J. Cheney
than XQuery Update. Ghelli et al. studied commutativity analysis for an update language whose semantics differs substantially from the current version [5]. We want to emphasize that we consider a strict sublanguage of XQuery Updates that includes many key features but excludes some complications to the XML data model (such as attributes, processing instructions, etc.). We also leave out the “replace value of” operation [7]. However, extensions to handle these features appear straightforward. We also do not model the optional XML Schema validation, dynamic type-name tags, or revalidation features of XQuery Update. Instead, we adopt a purely structural type system with subtyping based on language inclusion, as in XQuery and much other previous work on typed XML programming languages [1,2,8,10,4]. Thus, our approach applies to any conforming implementation of XQuery Update, independent of whether it implements validation. In this paper, we consider these related problems for XQuery Updates: – pending effect analysis: given a schema and an update, approximate the possible atomic updates (“effect”) generated by the update. – schema alteration: given a schema and an update effect, find an output schema that approximates the results of applying atomic updates described by the effect. Prior work on typechecking of queries has not handled upward axes, since they use regular expression types that specify only the hedge or subtree structure of returned nodes, not their position within a larger schema. To handle the interaction of schemas and updates, we develop a type and effect system that can record this information. Hence our approach applies to a language that contains all XPath axis steps. In many XML processing settings (particularly databases) we can assume a fixed input schema and type declarations for the free variables of the expression, so we do not consider the (likely harder) schema inference problem of inferring types for both input variables and results. Due to space limitations, in the body of the paper we omit full treatment of “transform” queries; however, the omitted material is straightforward given the results in the paper, and is provided in the companion technical report [13]. We omit proofs and standard or straightforward definitions; these are also provided in a companion technical report. The technical report also presents extended examples. Outline. The rest of this paper is structured as follows: Section 2 reviews core XQuery and schema languages we will use, and Section 3 introduces the atomic update and XQuery Update languages, along with their operational semantics. Section 4 defines a pending effect analysis for update expressions and proves its soundness. Section 5 presents a schema alteration algorithm that applies a pending effect to a schema. We discuss a prototype implementation in Section 6. Section 7 discusses related and future work and Section 8 concludes.
Semantics, Types and Effects for XML Updates
2
5
Background
W3C XQuery Update 1.0 extends XQuery, which is already a large language. Even restricting attention to a core language, we must present a great deal of background material. In this section we review XML stores, regular expression types, XPath steps, and queries. Whenever possible we omit standard definitions that can be found in previous work or in [13]. XML stores. Let Loc be a set of locations l. A location sequence L is a list of locations; we write () for the empty location sequence and L · L for sequence composition. A store σ is a mapping from locations to constructors k, defined as follows: k ::= text[s] | a[L] where s is a string, a is an element node label and L is a list of locations. A well-formed store corresponds to an acyclic forest of XML trees. (We follow the XML data model and XQuery semantics in storing data values such as strings using “text nodes” in the store.) copy We introduce a copying judgment σ, L → σ , L that, intuitively, extends σ to a store σ by copying the subtree under each location in L to a fresh subtree, collecting the resulting locations in list L . This judgment is defined formally in [13]. Regular expression types. Following previous work [8,10,4], we employ regular expression types τ for XML queries and updates: τ ::= () | T | a[τ ] | δ | τ, τ | τ |τ | τ ∗ Here, δ is the base type of “data” (e.g. strings), and T, T , . . . ∈ TName are type names. We consider schemas S mapping type names to types. In order to ensure regularity, we forbid uses of top-level type names in S(T) ; for example, both the type definitions T → a[], T, b[]|() and T → a[T ], T |() are forbidden, whereas T → a[T ]∗ is allowed (and is equivalent to T → a[T ], T |()). Such schemas are called regular. A type whose type names are drawn from S is called an S-type. Regular schemas are very general and flexible, but they are awkward for our purposes. There are two reasons for this. First, we want to be able to typecheck queries and updates involving navigation axes such as descendant, ancestor and following more accurately than the default XQuery approach. Second, it is non-obvious how to apply the effects of updates to general regular schemas. Both problems can be ameliorated using flat schemas. Flat schemas provide an explicit type name for each “part” (e.g. element or data type) in the schema corresponding to a “part” of a document. This makes them more suitable for updating. Flat schemas are defined as follows: Definition 1. A flat type is a regular expression over type names. A flat schema is a schema in which all type definitions are either of the form T → δ, or T → a[τ ] where τ is a flat type.
6
M. Benedikt and J. Cheney σ |=S L1 : τ1 σ |=S L2 : τ2 σ(l) = a[L] σ |=S L : τ σ(l) = text[s] σ |=S l : a[τ ] σ |=S l : δ σ |=S () : () σ |=S L1 · L2 : τ1 , τ2 σ |=S L : τ1 σ |=S L : τ2 σ |=S L : () | τ, τ ∗ σ |=S L : S(T) σ |=S L : τ1 |τ2 σ |=S L : τ1 |τ2 σ |=S L : τ ∗ σ |=S L : T Fig. 1. Validation rules
In a flat schema, a type name is mapped to either a single element a[τ ] (with flat content type τ ) or δ. For example, X, (Y∗ , Z)∗ is a flat type and X → a[X, (Y∗ , Z)∗ ] is a flat schema rule. Flat schemas are syntactically more restrictive than general schemas, and hence they are less convenient for users. Fortunately, it is always possible to translate a regular schema S to an equivalent flat schema S , as follows: First introduce new type definitions T → a[τ ] for each type of the form a[τ ] occurring in the original schema, rewriting the existing definitions and un-nesting nested element constructors. Then, “inline” all occurrences of the original type names in the schema with their new definitions. Other S-types in a context Γ can also be translated to S -types in this way. As an example, the flat schema S corresponding to Y → a[Y]∗ is Z → a[Z∗ ], and the flat S -type corresponding to the S-type Y is Z∗ . Validation. We define a validation relation σ |=S L : τ that states that in store σ and schema S, location sequence L matches type τ . The rules in Figure 1 define validation. Aliasing. In determining types for updates, we will have to know whether two types can point to the same thing – this is a critical part of the algorithm in the beginning of Section 5. Types T and T may alias 2 (with respect to S) provided that for some σ and l ∈ dom(σ), we have σ |=S l : T and σ |=S l : T . Equivalently, types T and T do not alias provided that they are disjoint, considered as regular tree languages (with respect to S viewed as a tree automaton). Disjointness is decidable for regular languages, and for restricted expressions (e.g. 1-unambiguous), tractable procedures are known [14,15]. For the purposes of this paper we assume that we are given sound alias sets aliasS (T) such that if T and T may alias we have T ∈ aliasS (T). XPath axes. XPath is an important sublanguage of both XQuery and XQuery Update. XPath steps are expressions of the form: step ::= ax::φ φ ::= ∗ | n | text ax ::= self | child | descendant | parent | ancestor | · · · 2
Aliasing means that two names refer to the same thing. In pointer analysis, aliasing usually means that two variable names refer to the same memory location. Here, aliasing means two type names may match the same store location.
Semantics, Types and Effects for XML Updates
7
The semantics and static analysis problems for XPath have been well-studied [16,9]. We will abstract away from the details of XPath in this paper, by instep troducing judgments σ |= l/ax ::φ ⇒ L to model XPath step evaluation and step S T/ax ::φ ⇒ τ to model static typechecking for XPath steps. For the purposes of this paper, we assume that these relations satisfy the following soundness property: step
step
Lemma 1. If S T/ax ::φ ⇒ τ and σ |= l/ax ::φ ⇒ L and σ |=S l : T then σ |=S L : τ . Environments and type contexts. We employ (dynamic) environments γ mapping variables x, y, . . . ∈ Var to location sequences L, and type contexts (also known as static environments) Γ mapping variables to regular expression types. We write • for an empty environment or type context, and write γ[x := L] for the result of updating a context by binding x to L. A type context is flat if its types are flat. An S-context is a context whose types are S-types. We also write σ |=S γ : Γ to indicate that ∀x ∈ dom(Γ). σ |=S γ(x) : Γ(x). Queries. We introduce a core XQuery fragment, following Colazzo et al. [10]. q ::= x | () | q, q | a[q] | s | x/step | if q then q1 else q2 | let x := q in q | for x ∈ q return q The empty sequence (), element constructor a[q], sequential composition q, q and string s expressions build XML values. Variables and let-bindings are standard; conditionals branch depending on whether their first argument is nonempty. The expression x/step performs an XPath step starting from x. The iteration expression for x ∈ q return q evaluates q to L, and evaluates q with x bound to each location l in L, concatenating the results in order. We model the operational semantics of queries using a judgment σ, γ |= q ⇒ σ , L. Note that the store σ may grow as a result of allocation, for example in evaluating expressions of the form a[q] and s. We employ an auxiliary judgment copy σ, γ |= q ⇒ σ , L that is used for element node construction and later in the semantics of inserts (see Section 3) and transforms [13, Sec. 6]. The rules defining these judgments are given in [13]; here are two illustrative rules: σ, γ |= q ⇒ σ0 , L0
copy
σ0 , L0 → σ , L copy
σ, γ |= q ⇒ σ , L
copy
σ, γ |= q ⇒ σ , L l ∈ dom(σ ) σ, γ |= a[q] ⇒ σ [l := a[L]], l
Note that the second rule employs the auxiliary “copying” judgment, which simply evaluates a query and makes a fresh copy of the result.
8
M. Benedikt and J. Cheney
3
Core XQuery Updates
Atomic updates. We consider atomic updates of the form: ι ::= ins(L, d, l) | del(l) | repl(l, L) | ren(l, a) d ::= ← | → | ↓ | | Here, the direction d indicates whether to insert before (←), after (→), or into the child list in first ( ), last ( ) or arbitrary position (↓). Moreover, we consider sequences of atomic updates ω with the empty sequence written and concatenation written ω; ω . Updating expressions. We now define the syntax of updating expressions, based roughly on those of the W3C XQuery Update proposal. u ::= () | u, u | if q then u1 else u2 | for x ∈ q return u | let x := q in u | insert q d q0 | replace q0 with q | rename q0 as a | delete q0 The XQuery Update proposal overloads existing query syntax for updates. The () expression is a “no-op” update, expression u, u is sequential composition, and let-bindings, conditionals, and for-loops are also included. There are four basic update expressions: insertion insert q d q0 , which says to insert a copy of q in position d relative to the value of q0 ; deletion delete q0 , which says to delete the value of q0 ; renaming rename q0 as a, which says to rename the value of q0 to a and replacement replace q0 with q, which says to replace the value of q0 with a copy of q. In each case, the target expression q0 is expected to evaluate to a single node; if not, evaluation fails. Semantics. Updates have a multi-phase semantics. First, the updating expression is evaluated, resulting in a pending update list ω. We model this phase using an update evaluation judgment σ, γ |= u ⇒ σ , ω, along with an auxiliary judgment σ, γ, x ∈ L |= u ⇒ σ , ω that handles for-loops. The rules for these judgments are presented in Figure 2. Note that again the store may grow as a result of allocation, but the values of existing locations in σ do not change in this phase. Next, ω is checked to ensure, for example, that no node is the target of multiple rename or replace instructions. We do not model this sanitycheck phase explicitly here; instead we simply introduce an abstract predicate sanitycheck(ω) that checks that ω is a valid update sequence. Finally, the pending updates are applied to the store. The semantics of atomic updates is defined using the judgment σ |= ι σ presented in Figure 3. One natural-seeming semantics for update application is simply to apply the updates in ω in (left-to-right) order. However, this naive semantics is not what the W3C proposal actually specifies [7]. Instead, updates are applied in the following order: (1) “insert into” and rename operations, (2) “insert before, after, as first” and “as last” operations, (3) “replace” operations, and finally (4) “delete” operations. (There is an extra stage for “replace value of” operations in [7], which
Semantics, Types and Effects for XML Updates
σ1 , γ |= u1 ⇒ σ2 , ω1 σ2 , γ |= u2 ⇒ σ3 , ω2 σ, γ |= () ⇒ σ, σ1 , γ |= u1 , u2 ⇒ σ3 , ω1 ; ω2 σ1 , γ |= q ⇒ σ2 , l · L σ2 , γ |= u1 ⇒ σ3 , ω1 σ1 , γ |= q ⇒ σ2 , () σ2 , γ |= u2 ⇒ σ3 , ω2 σ1 , γ |= if q then u1 else u2 ⇒ σ3 , ω1 σ1 , γ |= if q then u1 else u2 ⇒ σ3 , ω2 σ1 , γ |= q ⇒ L, σ2 σ2 , γ[x := L] |= u ⇒ σ3 , ω σ1 , γ |= q ⇒ L, σ2 σ2 , γ, x ∈ L |= u ⇒ σ3 , ω σ1 , γ |= let x = q in u ⇒ σ3 , ω σ1 , γ |= for x ∈ q return u ⇒ σ3 , ω copy
σ1 , γ |= q1 ⇒ σ2 , L1 σ2 , γ |= q2 ⇒ σ3 , l2 σ1 , γ |= insert q1 d q2 ⇒ σ3 , ins(L1 , d, l2 )
σ1 , γ |= q ⇒ σ2 , l σ1 , γ |= delete q ⇒ σ2 , del(l)
copy
σ1 , γ |= q1 ⇒ σ2 , l1 σ2 , γ |= q2 ⇒ σ3 , L2 σ1 , γ |= q ⇒ σ2 , l σ1 , γ |= replace q1 with q2 ⇒ σ3 , repl(l1 , L2 ) σ1 , γ |= rename q as a ⇒ σ2 , ren(l, a) σ1 , γ[x := l] |= u ⇒ σ2 , ω1 σ2 , γ, x ∈ L |= u ⇒ σ3 , ω2 σ, γ, x ∈ () |= u ⇒ σ, σ1 , γ, x ∈ l · L |= u ⇒ σ3 , ω1 ; ω2
Fig. 2. Rules for evaluating update expressions to pending update lists
σ(l ) = a[L1 · l · L2 ]
σ(l) = a[L ]
σ |= ins(L, ←, l) σ[l := a[L1 · L · l · L2 ]] σ(l ) = a[L1 · l · L2 ]
σ |= ins(L, , l) σ[l := a[L · L ]] σ(l) = a[L ]
σ |= ins(L, →, l) σ[l := a[L1 · l · L · L2 ]] σ |= ins(L, , l) σ[l := a[L · L]] σ(l) = a[L1 · L2 ] σ(l) = a[L] σ |= ins(L, ↓, l) σ[l := a[L1 · L · L2 ]] σ |= ren(l, b) σ[l := b[L]] σ(l ) = a[L1 · l · L2 ] σ(l ) = a[L1 · l · L2 ] σ |= repl(l, L) σ[l := a[L1 · L · L2 ]]
σ |= del(l) σ[l := a[L1 · L2 ]]
Fig. 3. Semantics of atomic updates stage(ins( , ↓, )) = 1 stage(ren( , )) = 1 stage(ins( , d, )) = 2
(d ∈ {←, →, , })
stage(repl( , )) = 3 stage(del( )) = 4 σ0 |=1 ω σ1 σ |=i ωj σ
σ1 |=2 ω σ2 σ2 |=3 ω σ3 σ0 |= ω σ4 σ |=i ωk σ {j, k} = {1, 2}
σ3 |=4 ω σ4 σ |=i σ σ |= ι σ
σ |=i ω1 , ω2 σ σ |=stage(ι) ι σ σ, γ |= u ⇒ σ , ω sanitycheck(ω) σ |= ω σ σ, γ |= u σ Fig. 4. Update application
stage(ι) = i σ |=i ι σ
9
10
M. Benedikt and J. Cheney
we omit.) Subject to these constraints, the order of application within each stage is unspecified. To model this behavior we introduce a judgment σ |= ω σ along with an auxiliary function stage(ι) and judgment σ |=i ω σ for stages i ∈ {1, 2, 3, 4}. The rules defining these judgments are shown in Figure 3. Note that the rule for sequential composition permits arbitrary reordering of update sequences (which are also identified up to associativity). Static analyses for the W3C semantics are not in general valid for the naive, “in-order” semantics and vice versa. The final rule in Figure 4 defines the judgment σ, γ |= u σ , which evaluates an update, checks that the resulting pending update list is valid, and then applies the updates to the store. Inferring Types. For functional programs (i.e., queries) on documents, the notion of a valid type for an expression is fairly clear: given a schema S and expression e, a typing is a representation (e.g. by a regular expression type) of a set of trees; it is valid if it represents all of the possible hedges of subtrees returned by the query. Since XML updates modify the input store but do not return a value, the appropriate notion of a valid typing is less familiar. Our goal is to define a typing judgment S, Γ u S , Γ that relates an update u, input schema S and a S-context Γ to a new schema S and a new S -context Γ in which the types of variables in Γ have been adjusted to account for the changes made by the update. The basic correctness criterion we expect for this judgment is that if the initial store satisfies Γ with respect to S, then the final store resulting from applying u satisfies the type context Γ with respect to S . This property (Corollary 1) is the main result of the paper. Typically, the initial store will consist of a single tree and the environment γ will map a single variable $doc to the root of the tree. In this case our correctness property guarantees that the portion of the output reachable from this root will satisfy the new schema S .
4
Type and Effect Analysis
Query result type analysis. First, for queries we would like to define a typechecking judgment S; Γ q : τ that calculates return type τ for q when run in context Γ. Previous work on type systems for XML queries has been based on general regular-expression types [1,10]; here, however, we want to infer flattened types. To do this in the presence of element-node constructor expressions, we may need to add rules to the schema, so we employ a judgment S; Γ q : τ ; S . The rules are mostly straightforward generalizations of those in Colazzo et al. [10] and so are relegated to [13]. The key new rules with respect to previous work are those for node construction and XPath axis steps, respectively: S; Γ q : τ ; S T ∈ dom(S ) S; Γ a[q] : T; S [T := a[τ ]]
step
S Γ(x)/ax :: φ ⇒ τ S; Γ x/ax :: φ : τ ; S
Semantics, Types and Effects for XML Updates
11
σ |=S L : τ σ |=S l : T σ |=S l : T σ |=S ins(L, d, l) : ins(τ, d, T) σ |=S del(l) : del(T) σ |=S l : T σ |=S l : T σ |=S L : τ σ |=S ren(l, a) : ren(T, a) σ |=S repl(l, L) : repl(T, τ ) Fig. 5. Some representative effect validity rules (if d ∈ {↓, , } then S (T) = δ) S; Γ q : τ ; S1 S1 ; Γ q : T; S2
S; Γ q : T; S
S (T) = δ
S; Γ insert q d q : ins(τ, d, T); S2 S; Γ rename q as a : ren(T, a); S S; Γ q : T; S1 S1 ; Γ q : τ ; S2 S; Γ q : T; S S; Γ replace q with q : repl(T, τ ); S2 S; Γ delete q : del(T); S Fig. 6. Some representative update effect-inference rules
Theorem 1 (Type Soundness). If S; Γ q : τ ; S then for all σ, γ, L, σ , if σ |=S γ : Γ and σ, γ |= q ⇒ σ , L then σ |=S L : τ . Update effect analysis. We next turn to the problem of statically approximating the pending update list generated by an update expression. We use the term (pending) effect for such static approximations. Effects have the following forms: Ω ::= | Ω; Ω | Ω|Ω | Ω ∗ | ins(τ, d, T) | del(T) | ren(T, a) | repl(T, τ ) The semantics of effects is defined by the judgment σ |=S ω : Ω in Figure 5; we leave out standard rules for regular expression forms. Intuitively, σ |=S ω : Ω says that in store σ and schema S, the atomic updates ω match the effect expression Ω. We use judgments S; Γ u : Ω; S and S; Γ; x ∈ τ u : Ω; S to infer effects for plain and iterative updates respectively. We show some representative rules in Figure 6; the full definition is in the technical report. Note that typechecking an update may also require adding rules to the result schema, because of embedded node-construction (e.g. insert f oo[] ↓ x). Theorem 2 (Effect soundness). If S; Γ u : Ω; S then for all σ, γ, if σ |=S γ : Γ and σ, γ |= u ⇒ σ , ω then σ |=S ω : Ω. Type soundness only guarantees that the results of successful executions will match the static type. Dynamic errors may still occur while evaluating a wellformed query. Similarly, update effect soundness only guarantees that the results of a successful update evaluation will match the computed effect, not that evaluation will be free of dynamic errors. We believe our techniques can be modified to issue static warnings about possible dynamic errors in queries, but this is beyond the scope of this paper.
12
5
M. Benedikt and J. Cheney
Schema Alteration
We now present an algorithm for schema alteration, that is, soundly overapproximating the actual effects a pending update may have on a schema. Given input type context Γ, schema S and pending effect Ω we want to infer a suitable output schema S and type context Γ . The rough idea is as follows: ˜ by adding new temporary type names 1. Augment the input schema S to S standing for “places” where updates may occur. 2. Determine which type names may match the same store location at run time, using alias analysis (as described in Section 2). ˜ 3. Simulate the effects of each stage of atomic update application on S. ˜ 4. Finally, flatten the updated S to S and update the type context Γ to Γ . We first illustrate the above algorithm by an example: Example 4. Suppose we have effect Ω = ins((U, V), , T), del(T), ren(T, b), with ˜ → doc[T], T → a[U, V], U → b[], V → c[], and Γ = x : S. schema S given by rules S ˜ extending S with additional Using the schema S we will form a new schema S type names and instrumented rules based on the rules of S. For example, for the single rule T → a[U, V] we generate three rules: ˜ → T ˜← , ˜ ˜→ T Tr , T
˜c ] ˜r → a[T T
˜c → T ˜ , T ˜↓ , U ˜, T ˜↓ , V ˜, T ˜↓ , T ˜ T
˜↓ , T ˜← , T ˜→ , T ˜ , and T ˜ stand for data inserted “into”, Here, the five type names T ˜r stands for the “before”, “after”, “first into”, or “last into” T. The type name T ˜ data “replacing” T, and the type name Tc stands for the “content” of T. The rest of the auxiliary type names are all initially defined as (). Note there˜ initially is equivalent to T in ˜ in the augmented schema S fore that each type T S, in the sense that they match the same subtrees. Next, we simulate the static effects, in order of stage. In stage 1, we perform ˜c ]|b[T ˜c ]. In stage 2 ˜r to a[T the rename operation, by altering the definition of T ∗ ˜ we simulate effect ins((U, V), , T) by setting T to (U, V) . Here we refer to the original types U and V in S, which have the same definitions as before. Stage 3 is ˜c ]|b[T ˜c ]|(). ˜r to a[T inactive, and finally in stage 4 we apply the deletion by setting T In this example, there are no other type names that may alias T. Had there been, we would have applied the same changes to the aliases of T. ˜. Finally, we re-flatten the final schema. In this case consider the rule for T ˜2 |()], T ˜1 → a[(U, V)∗ , U ˜, V ˜], T ˜2 → ˜ → doc[T ˜1 |T Flattening and simplifying yields S ˜, V ˜]. Note that this type refers to both the old and new versions of b[(U, V)∗ , U U and V (they happen to be the same in this case). We also modify the type ˜ to reflect the change. context to x : S Another, more elaborate example is shown in Figure 7. We now describe the schema alteration algorithm more carefully.
Semantics, Types and Effects for XML Updates
13
Initial augmented schema: S → doc[T] T → a[U, V] U → b[]
˜→ S ˜ → T ˜ → U
˜r , S ˜→ ˜ S← , S ˜← , T ˜r , T ˜→ T
˜ ˜r , U ˜→ U← , U ˜→ ˜← , V ˜r , V ˜→ V V
V → c[]
˜r → a[˜ Sc ] S ˜r → a[˜ Tc ] T ˜ ˜ Ur → b[Uc ] ˜c ] ˜r → c[V V
˜ Sc → ˜ Tc → ˜ Uc →
˜ , ˜ ˜↓ , ˜ S S↓ , ˜ T, S S ˜ , ˜ ˜↓ , ˜ ˜ T T↓ , ˜ U, T V, ˜ T↓ , T
˜ , ˜ U U↓ , ˜ U ˜ ˜ , ˜ Vc → V V↓ , ˜ V
All other new type names are initialized to (). Effect: |Ω| = {ins(V, ←, U), ren(U, d), repl(V, U∗ ), del(T)} Phase 1: Phase 2: Phase 3: Schema changes: ˜ ˜ ˜r → c[˜ Ur → b[˜ Uc ]|d[˜ Uc ] U← → V∗ Vc ]|U∗ V
Phase 4: ˜ Tc ]|() Tr → a[˜
Result schema (after some equational simplifications): S → doc[T] T → a[U, V] U → b[] V → c[]
˜ S→ ˜ T → ˜ U→ ˜ V →
˜r , S ˜ Tr ˜∗ , ˜ Ur V ˜ Vr
˜r → S a[˜ Sc ] ˜ a[˜ Tc ]|() Tr → ˜r → b[˜ Uc ]|d[˜ Uc ] U ∗ ˜ ˜ Vr → c[Vc ]|U
˜ Sc → ˜ Tc → ˜ Uc → ˜ Vc →
˜ T, ˜ U, ˜ V () ()
Re-flattened schema: ˜ S → a[˜ T|()]
S → doc[T] T → a[U, V] U → b[] V → c[] ˜1 → b[] ˜ ˜ → a[V∗ , (˜ U1 |˜ U2 ), (˜ V0 |U∗ )] U U2 → d[] T
˜0 → c[] V
Fig. 7. Detailed example of schema alteration
Preprocessing. First we pre-compute sound approximate aliasing information for S, computing the set alias(T) = aliasS (T) for each T. We next define the ˜ as follows. For each rule T → a[τ ] in S, we introduce rules augmented schema S ˜r , T ˜→ ˜ → T ˜← , T T
˜c ] T ˜c → T ˜ , h(τ ), T ˜↓ , T ˜ ˜r → a[T T
˜↓ , U ˜ where h is the (unique) regular expression homomorphism satisfying h(U) = T ˜ for all U in S. We map all other new type names in S to (). Effect application. We now apply the effects to the augmented schema. The behavior of an effect is applied to the effect’s target type name and all of its aliases. We will ignore the regular expression structure of effects and just consider the set of atomic effects, written |Ω|. Similarly, we write |τ | for the set of all type names mentioned in τ . We also write {τ1 , . . . , τn } for the regular expression τ1 | · · · |τn . Phase 1: To simulate insert–into operations, for each type name T, we define the set I↓ (T) = {U | ∃T ∈ alias(T). ∃τ. ins(τ, ↓, T ) ∈ |Ω|, U ∈ |τ |}. We then ˜ To simulate renamings, for each ˜↓ → () with T ˜↓ → ( I↓ (T))∗ in S. replace rule T type name T, we define the set N (T)= {b | ∃T ∈ alias(T). ren(T , b) ∈ |Ω|}, and ˜r → τ0 with T ˜r → τ0 | {b[T ˜c ] | b ∈ N (T)}. replace rule T
14
M. Benedikt and J. Cheney
Phase 2: To simulate the remaining insert operations, we define the set Id (T) = ˜d → () with {τ | ∃T ∈ alias(T). ∃τ. ins(τ, d, T ) ∈ |Ω|} and then replace rule T ∗ ˜ Td → ( Id (T)) for each type name T and direction d ∈ {←, →, , }. Phase 3: To simulate replacement operations, we construct the set R(T) = {τ | ∃T ∈ alias(T). ∃τ. repl(T , τ ) ∈ |Ω|} ofpossible replacements for each T, and ˜r → τ0 with T ˜r → τ0 | R(T). replace the rule T Phase 4: To simulate deletions, for each T, if del(U) ∈ |Ω| for some U ∈ alias(T), ˜r → τ0 | (). ˜r → τ0 with T replace the rule T ˜ we also update Postprocessing. Once we have finished symbolically updating S, ˜ by replacing each binding x : τ in Γ with x : τ˜, where τ˜ is the regular Γ to Γ ˜ and Γ ˜. We also flatten S ˜ to obtain expression obtained by replacing each T with T S and Γ . We will write S, Γ Ω S , Γ to indicate that given input schema S and typing context Γ, symbolically evaluating Ω yields flattened output schema S and typing context Γ . We also define S, Γ u S , Γ as meaning that S; Γ u : Ω; S and S , Γ Ω S , Γ hold for some S and Ω. Correctness. The main correctness properties (proved in [13]) are: Theorem 3. Suppose S, Γ Ω S , Γ . If σ |=S γ : Γ and σ |=S ω : Ω and σ |= ω σ then σ |=S γ : Γ . Corollary 1. Suppose S, Γ u S , Γ and σ |=S γ : Γ and σ, γ |= u σ . Then σ |=S γ : Γ .
6
Implementation
We have developed a prototype implementation of type and effect analysis and schema alteration in OCaml, to demonstrate feasibility of the approach. We have tested the implementation on a number of examples from the XQuery Update Use Cases [17]. For these small updates and schemas, schema alteration takes under 0.1s. Space limitations preclude a full discussion of examples; we discuss the accuracy of the resulting schemas in [13]. However, there are several possible avenues for improvement: – Currently flattening produces large numbers of temporary type names, increasing the size of output and limiting readability. An obvious approach would be to do flattening only “on demand”, when further navigation effect application requires exploration of the schema below a certain type name. – Both effect application and flattening can produce redundancy in type expressions. Currently we simplify the regular expression types in the output schema using basic rules such as (), τ ≡ τ ≡ τ, () and (τ ∗ )∗ ≡ τ ∗ . Postprocessing using full-fledged regular expression simplification might be more useful [18].
Semantics, Types and Effects for XML Updates
15
– We have implemented a simple, but inaccurate alias analysis: we assume that two types alias if they have the same root element label. For the DTDbased examples in [13], this is exact. However, for more complex updates and schemas, we may need more sophisticated alias analysis to produce useful results. – Type and effect inference appears to be worst-case exponential in the presence of nested for-loops. In practice, typical queries and updates are small and of low nesting depth, so we expect the size of the schema to be the dominant factor. The type, effect and schema alteration algorithms appear to be polynomial in the size of the schema for fixed expressions. Further study of the complexity of our analysis in the worst case or for typical cases may be of interest.
7
Related and Future Work
There is a great deal of related work on static analysis of fragments of XPath [16], regular expression types and schema languages [8,19], and XML update language designs [3,4,5,6,7]. We restrict attention to closely related work. Cheney developed a typed XML update language called Flux [4], building on the XQuery type system of Colazzo et al. [10]. Flux differs significantly from XQuery Update and handles only child and descendant axes, but its semantics is much simpler. Static analysis problems besides typechecking have also been studied for XML or object query/update languages. Bierman [20] developed an effect analysis that tracks object-identifier generation side-effects in OQL queries. Benedikt et al. [3,12] presented static analyses for optimizing updates in UpdateX, a precursor to XQuery Update. Ghelli et al. [5] present a commutativity analysis for an XML update language. Roughly speaking, two updates u1 , u2 commute if they have the same side-effects and results no matter which order they are run. Their update language also differs from XQuery Update in important ways: in particular, the intended semantics of the W3C proposal (as formalized in this paper, for example) seems to imply that u1 , u2 will always have the same (potential) side-effects as u2 , u1 . There is also prior work on typechecking for XML transformations (see e.g. Møller and Schwartzbach [21] for an overview). Much of this work focuses on decidable subproblems where both input and output schemas are given in advance, whereas we focus on developing sound, practical schema alteration techniques for general queries and updates. Also, there is no obvious mapping from XQuery Updates to transducers. Balmin [22] and Barbosa et al. [23] present efficient dynamic techniques for checking that atomic updates preserve a fixed schema. These techniques are exact, but impose run-time overhead on all updates, and do not deal with changes to schemas. Raghavachari and Shmueli [15] give efficient algorithms for revalidating data after updates to either the data or schema, but their approach places stronger restrictions on schemas. It would be interesting to combine static and dynamic revalidation techniques.
16
M. Benedikt and J. Cheney
Building partly on the type effect analyses in this paper, we develop a schemabased independence analysis for XML queries and updates [24]. A query q and update u are statically independent if, roughly speaking, for any initial store, running q yields the same results as applying u and then running q. Static independence checking is useful for avoiding expensive recomputation of query results and perhaps also for managing safe concurrent access to XML databases. Aliasing is a fundamental problem in static analysis, and has been studied in a wide variety of previous contexts. We envision applying ideas from region inference [25] or more advanced shape analysis techniques [26] to obtain more accurate alias information. Aliasing also arises in object-oriented programming languages in settings such as type-safe object reclassification [27] and typestates [28], and ideas from this work may also be useful for dealing with aliasing in XML updates.
8
Conclusions
XML update languages are an active area of study, but so far little is known about typechecking and static analysis for such languages. In this paper we have given an operational semantics for the W3C’s XQuery Update Facility 1.0 and developed the first (to our knowledge) sound type system for this language (although many details are relegated to the technical report [13]). As a Candidate Recommendation, XQuery Update is still a work in progress and we hope that our work will help improve the standard as well as provide a foundation for future study of XML updates. Acknowledgments. Cheney was supported by a Royal Society Research Fellowship and by EPSRC grant EP/F028288/1, “Heterogeneous and Permanent Data”. Benedikt is supported in part by EPSRC EP/G004021/1 (the Engineering and Physical Sciences Research Council, UK).
References 1. Boag, S., Chamberlin, D., Fern´ andez, M.F., Florescu, D., Robie, J., Sim´eon, J.: XQuery 1.0: An XML query language. W3C Recommendation (January 2007), http://www.w3.org/TR/xquery 2. Draper, D., Fankhauser, P., Fern´ andez, M., Malhotra, A., Rose, K., Rys, M., Sim´eon, J., Wadler, P.: XQuery 1.0 and XPath 2.0 formal semantics. W3C Recommendation (January 2007), http://www.w3.org/TR/xquery-semantics/ 3. Benedikt, M., Bonifati, A., Flesca, S., Vyas, A.: Adding updates to XQuery: Semantics, optimization, and static analysis. In: Florescu, D., Pirahesh, H. (eds.) XIME-P (2005) 4. Cheney, J.: FLUX: FunctionaL Updates for XML. In: ICFP (2008) 5. Ghelli, G., Rose, K., Sim´eon, J.: Commutativity analysis for XML updates. ACM Trans. Database Syst. 33(4), 1–47 (2008) 6. Sur, G., Hammer, J., Sim´eon, J.: UpdateX - an XQuery-based language for processing updates in XML. In: PLAN-X (2004)
Semantics, Types and Effects for XML Updates
17
7. Chamberlin, D., Robie, J.: XQuery update facility 1.0. W3C Candidate Recommendation (August 2008), http://www.w3.org/TR/xquery-update-10/ 8. Hosoya, H., Vouillon, J., Pierce, B.C.: Regular expression types for XML. ACM Trans. Program. Lang. Syst. 27(1), 46–90 (2005) 9. Schwentick, T.: Automata for XML – A Survey. Journal of Computer and Systems Science 73, 289–315 (2007) 10. Colazzo, D., Ghelli, G., Manghi, P., Sartiani, C.: Static analysis for path correctness of XML queries. J. Funct. Program. 16(4-5), 621–661 (2006) 11. Papakonstantinou, Y., Vianu, V.: Type inference for views of semistructured data. In: PODS (2000) 12. Benedikt, M., Bonifati, A., Flesca, S., Vyas, A.: Verification of tree updates for optimization. In: Etessami, K., Rajamani, S.K. (eds.) CAV 2005. LNCS, vol. 3576, pp. 379–393. Springer, Heidelberg (2005) 13. Benedikt, M., Cheney, J.: Types, effects, and schema evolution for XQuery update facility 1.0, http://homepages.inf.ed.ac.uk/jcheney/publications/drafts/w3c-update-types-tr.pdf 14. Stearns, R., Hunt, H.: On the equivalence and containment problems for unambiguous regular expressions, regular grammars and finite automata. SIAM J. Comput. 14, 598–611 (1985) 15. Raghavachari, M., Shmueli, O.: Efficient revalidation of XML documents. IEEE Trans. on Knowl. and Data Eng. 19(4), 554–567 (2007) 16. Benedikt, M., Koch, C.: XPath leashed. ACM Comput. Surv. 41(1), 1–54 (2008) 17. Manolescu, I., Robie, J.: XQuery update facility use cases. W3C Candidate Recommendation (March 2008), http://www.w3.org/TR/xqupdateusecases 18. Trejo Ortiz, A.A.R., Anaya, G.F.: Regular expression simplification. Math. Comput. Simul. 45(1-2), 59–71 (1998) 19. Martens, W., Neven, F., Schwentick, T., Bex, G.J.: Expressiveness and complexity of XML Schema. ACM Transactions on Database Systems 31(3), 770–813 (2006) 20. Bierman, G.M.: Formal semantics and analysis of object queries. In: SIGMOD (2003) 21. Møller, A., Schwartzbach, M.I.: The design space of type checkers for XML transformation languages. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 17–36. Springer, Heidelberg (2004) 22. Balmin, A., Papakonstantinou, Y., Vianu, V.: Incremental validation of XML documents. ACM Transactions on Database Systems 29(4), 710–751 (2004) 23. Barbosa, D., Mendelzon, A.O., Libkin, L., Mignet, L., Arenas, M.: Efficient incremental validation of XML documents. In: ICDE. IEEE Computer Society, Los Alamitos (2004) 24. Benedikt, M., Cheney, J.: Schema-based independence analysis for XML updates. In: VLDB (2009) 25. Henglein, F., Makholm, H., Niss, H.: Effect types and region-based memory management. In: Pierce, B.C. (ed.) Advanced Topics in Types and Programming Languages. MIT Press, Cambridge (2005) 26. Sagiv, M., Reps, T., Wilhelm, R.: Solving shape-analysis problems in languages with destructive updating. ACM Trans. Program. Lang. Syst. 20(1), 1–50 (1998) 27. Drossopoulou, S., Damiani, F., Dezani-Ciancaglini, M., Giannini, P.: Fickle: Dynamic object re-classification. In: Knudsen, J.L. (ed.) ECOOP 2001. LNCS, vol. 2072, pp. 130–149. Springer, Heidelberg (2001) 28. Fink, S.J., Yahav, E., Dor, N., Ramalingam, G., Geay, E.: Effective typestate verification in the presence of aliasing. ACM Trans. Softw. Eng. Methodol. 17(2), 1–34 (2008)
An Automata-Theoretic Approach to Regular XPath Diego Calvanese1 , Giuseppe De Giacomo2 , Maurizio Lenzerini2 , and Moshe Y. Vardi3 1
2
KRDB Research Centre, Free University of Bozen-Bolzano, Italy
[email protected] Dipartimento di Informatica e Sistemistica, Sapienza Universit` a di Roma, Italy {degiacomo,lenzerini}@dis.uniroma1.it 3 Department of Computer Science Rice University, P.O. Box 1892, Houston, TX 77251-1892, U.S.A.
[email protected]
Abstract. In this paper we present Regular XPath (RXPath), which is a natural extension of XPath with regular expressions over paths that has the same computational properties as XPath: linear-time query evaluation and exponential-time reasoning. To establish these results, we devise a unifying automata-theoretic framework based on two-way weak alternating tree automata. Specifically, we consider automata that have infinite runs on finite trees. This enables us to leverage and simplify existing automata-theoretic machinery and develop algorithms both for query evaluation and for reasoning over queries. With respect to the latter problem, we consider RXPath as a constraint language, and study constraint satisfiability, and query satisfiability and containment under constraints in the setting of RXPath.
1
Introduction
XML1 has become the standard language for semistructured data, and the last few years have witnessed a strong interest in reasoning about XML queries and integrity constraints. From a conceptual point of view, an XML document can be seen as a finite node-labeled tree, and several formalisms have been proposed as query languages over XML documents. A common feature of many of these language is the use of regular path expressions to navigate through XML documents, and XPath is a popular language for such navigation [1]. This paper introduces a powerful extension of CoreXPath, called RXPath, for expressing XML queries. Our language is inspired by the work carried out in the last few years on extensions of CoreXPath [2,3]. In particular, we extend CoreXPath with nominals, as well as with regular path expressions over XML 1
A preliminary version of this paper, dealing with RXPath satisfiability only, has been presented at the 2008 Workshop on Logic in Databases (LID 2008). http://www.w3.org/TR/REC-xml/
P. Gardner and F. Geerts (Eds.): DBPL 2009, LNCS 5708, pp. 18–35, 2009. c Springer-Verlag Berlin Heidelberg 2009
An Automata-Theoretic Approach to Regular XPath
19
trees, expressed as two-way regular expressions over XPath axes. Our language is essentially Regular XPath [2], extended with nominals. A nominal is a formalism for denoting a single node in a document, similarly to XML global identifiers built through the construct ID. The power of our language in expressing paths is the one of Propositional Dynamic Logic (PDL) [4] extended with converse, nominals, and deterministic programs. This combination of path-forming constructs results in one of the most expressive languages ever considered for specifying structural queries over XML documents. We describe in this paper a comprehensive automata-theoretic framework for evaluating and reasoning about RXPath. Our framework is based on two-way weak alternating tree automata, denoted 2WATA. The use of automata-theoretic techniques in the context of XPath is not new. For example, tree walking automata are used in [2] to characterize the expressive power of Regular XPath, while bottom-up tree automata are used there to derive an algorithm for testing containment of Regular XPath queries. In contrast, here we show that 2WATA provide a unifying formalism, as they enable us to both derive a linear-time algorithm for the evaluation of RXPath queries and an exponential-time algorithm for testing containment of RXPath queries. Our automata-theoretic approach is based on techniques developed in the context of program logics [5,6]. Here, however, we leverage the fact that we are dealing with finite trees, rather than the infinite trees used in the program-logics context. Indeed, the automata-theoretic techniques used in reasoning about infinite trees are notoriously difficult [7,8] and have resisted efficient implementation. The restiction to finite trees here enables us to obtain much more feasible algorithmic approach. (A similar idea, focusing on query reasoning rather than query evaluation, has been pursued in [9,10]. For the latter proposal, also an implementation is provided.) In particular, one can make use of symbolic techniques, at the base of modern model checking tools, for effectively querying and verifying XML documents. It is worth noting that while our automata run over finite trees they are allowed to have infinite runs. This separates 2WATA from the alternating tree automata used, e.g., in [11]. The key technical results here are that acceptance of trees by 2WATA can be decided in linear time, while nonemptiness of 2WATA can be decided in exponential time. The first application of the automata-theoretic results is that RXPath queries can be evaluated in linear time; more precisely, the running time for query evaluation is a product of the size of the input tree and the size of the query. This extends the results in [12] of polynomial-time algorithms for the evaluation of CoreXPath queries. Note that [12] provide also results for full XPath that handles also data, hence goes beyond the core fragment considered here. Our result complements the one in [13], which considers an extension of RXPath with tests for equality between attributes at the end of two paths, and provides a query evaluation algorithm that is linear time in the size of the input tree but exponential in the size of the query.
20
D. Calvanese et al.
The second application is to RXPath reasoning. There has been a lot of focus on testing containment of XPath queries, cf. [14]. Instead, we focus here on the more general problem of satisfying constraints over XML documents. Specifically, we consider here structural constraints [15]. Structural constraints are those imposing a certain form on the trees corresponding to the documents, with no explicit reference to values associated with nodes. Notable examples of formalisms allowing for expressing such constraints are DTDs (see footnote 1 and [16]), and XML Schema2 [17]. We show how we can use RXPath to express such constraints. and then show that satisfiability of RXPath constraints can be checked in exponential time using our automata-theoretic results. The exponential decidability result for RXPath constraint satisfiability is not surprising as this problem can be reduced to the satisfiability problem for Repeat-Converse-Deterministic PDL (repeat-CDPDL) a well-known variant of PDL, which can be solved using the automata-theoretic techniques of [6]. As noted above, however, those techniques are quite heavy and so far resisted implementation. We also show that query satisfiability and query containment for RXPath can be reduced to checking satisfiability of RXPath constraints, thus enabling us to take advantage of the techniques developed for constraint satisfiability. Note that most previous results on this topic (see, e.g., [18]) refer to query languages that are either less expressive than RXPath, or are intended for semi-structured data modeled as graph-like structures, rather than XML trees.
2
Regular XPath
Following [3,19], we formalize XML documents as finite sibling trees, which are tree like structures, whose nodes are linked to each other by two relations: the child relation, connecting each node with its children in the tree; and the immediate-right-sibling relation, connecting each node with its sibling immediately to the right in the tree, such a relation models the order between the children of the node in an XML documents. Each node of the sibling tree is labeled by (possibly many) elements of a set of atomic propositions Σ. We consider the set Σ to be partitioned into Σa and Σid . The set Σa is a set of atomic propositions that represent either XML tags or XML attribute-value pairs. On the other hand, Σid is a set of special propositions representing (node) identifiers, i.e., that are true in (i.e., that label) exactly a single node of the XML document. Such identifiers are essentially an abstraction of the XML identifiers built through the construct ID (see footnote 1), though a node can have multiple identifiers in our case. Observe that sibling trees are more general than XML documents since they would allow the same node to be labeled by several tags. It is easy to impose RXPath constraints (see later) that force propositions representing tags to be disjoint if needed. 2
http://www.w3.org/TR/xmlschema-0/ and http://.../xmlschema-1/
An Automata-Theoretic Approach to Regular XPath (P ϕ)Ts ([P ]ϕ)Ts (¬ϕ)Ts (ϕ1 ∧ ϕ2 )Ts (ϕ1 ∨ ϕ2 )Ts
= = = = =
{z | ∃z .(z, z ) ∈ P Ts ∧ z ∈ ϕTs } {z | ∀z .(z, z ) ∈ P Ts → z ∈ ϕTs } Δ T \ ϕ Ts ϕT1 s ∩ ϕT2 s ϕT1 s ∪ ϕT2 s
(ϕ?)Ts (P1 ; P2 )Ts (P1 ∪ P2 )Ts (P ∗ )Ts (P − )Ts
= = = = =
21
{(z, z) | z ∈ ϕTs } P1Ts ◦ P2Ts P1Ts ∪ P2Ts (P Ts )∗ {(z , z) | (z, z ) ∈ P Ts }
Fig. 1. Semantics of node and path expressions
A sibling tree is a pair Ts = (ΔTs , ·Ts ), where ΔTs is a tree3 and ·Ts is an interpretation function that assigns to each atomic symbol A ∈ Σa a set ATs of nodes of ΔTs , to each identifier Id a singleton Id Ts containing one node of ΔTs , and that interprets the axis relations in the obvious way, namely: childTs = {(z, z·i) | z, z·i ∈ ΔTs } rightTs = {(z·i, z·(i+1)) | z·i, z·(i+1) ∈ ΔTs }
As in [3,19], we focus on a variant of XPath that allows for full regular expressions over the XPath axes. In fact, we make it explicit that such a variant of XPath is tightly related to Propositional Dynamic Logic (PDL) [20,4], and adopt the PDL syntax to express node and path expressions. RXPath expressions are of two sorts: node expressions, denoted by ϕ, and path expressions, denoted by P , defined by the following syntax (we omit parentheses): ϕ P
−→ −→
A | Id | P ϕ | [P ]ϕ | ¬ϕ | ϕ1 ∧ ϕ2 | ϕ1 ∨ ϕ2 child | right | ϕ? | P1 ; P2 | P1 ∪ P2 | P ∗ | P −
where A ∈ Σa , Id ∈ Σid , and child and right denote the two main XPath axis relations. We consider the other XPath axis relations parent and left as abbreviations for child− and right− , respectively. Also, we use the usual abbreviations, including true, false, and ϕ1 → ϕ2 . Given a sibling tree Ts = (ΔTs , ·Ts ), we extend the interpretation function ·Ts to arbitrary node and path expressions as shown in Figure 1, where we have used the standard notions of chaining (· ◦ ·) and reflexive-transitive closure (·∗ ) over binary relations. Note that, [P ]ϕ is equivalent to ¬P ¬ϕ. To develop our techniques for inference on RXPath, it is convenient to consider an additional axis fchild, connecting each node to its first child only, interpreted as fchildTs = {(z, z·1) | z, z·1 ∈ ΔTs }
Using fchild, we can thus re-express the child axis as fchild; right∗ . In this way, we can view sibling trees, which are unranked, as binary trees (see Section 4). 3
A (finite) tree is a non-empty (finite) set Δ of words over N, such that if x·i ∈ Δ, where x ∈ N∗ and i ∈ N, then also x ∈ Δ, and if i > 1, then also x·(i−1) ∈ Δ. By convention we take x·0 = x, and x·i·−1 = x. A (finite) labeled tree over an alphabet L of labels is a pair T = (ΔT , T ), where ΔT is a (finite) tree and the labeling T : ΔT → L is a mapping assigning to each node x ∈ ΔT a label T (x) in L.
22
D. Calvanese et al.
We say that a path expression is normalized if it is expressed by making use of fchild and right only, if ·− is pushed inside as much as possible, in such a way that it appears only in front of fchild and right only, and if all node expressions occurring in it are normalized. A node expression is normalized if all path expressions occurring in it are normalized, and if it is in negation normal form, i.e., negation is pushed inside as much as possible, in such a way that it appears only in front of atomic symbols. RXPath expressions can be used to express queries on XML documents. An RXPath (unary) query is an RXPath node expression ϕ that, when evaluated over a sibling tree Ts , returns the set of nodes ϕTs . We also consider RXPath binary queries, where such a query is an RXPath path expression P that, when evaluated over a sibling tree Ts , returns the set of pairs of nodes P Ts . We will address the problems of query evaluation and of query satisfiability by means of automata-theoretic techniques, presented in the next section. A further use of RXPath expressions is to specify constraints, which will be dealt with in Section 5, resorting again to automata.
3
Two-Way Weak Alternating Tree Automata
We consider a variant of two-way alternating automata [21] that run, possibly infinitely, on finite labeled trees Specifically, alternating tree automata generalize nondeterministic tree automata, while two-way tree automata generalize ordinary tree automata by being allowed to traverse the tree both upwards and downwards. Formally, let B + (I) be the set of positive Boolean formulae over a set I, built inductively by applying ∧ and ∨ starting from true, false, and elements of I. For a set J ⊆ I and a formula ϕ ∈ B + (I), we say that J satisfies ϕ if assigning true to the elements in J and false to those in I \ J, makes ϕ true. For a positive integer k, let [1..k] = {1, . . . , k} and [−1..k] = {−1, 0, 1, . . . , k}. For integers i, j, with i ≤ j, let [i..j] = {i, . . . , j}. A two-way weak alternating tree automaton (2WATA) running over labeled trees all of whose nodes have at most k successors, is a tuple A = (L, S, s0 , δ, α), where L is the alphabet of tree labels, S is a finite set of states, s0 ∈ S is the initial state, δ : S × L → B + ([−1..k] × S) is the transition function, and α is the accepting condition discussed below. The transition function maps a state s ∈ S and an input label a ∈ L to a positive Boolean formula over [−1..k] × S. Intuitively, if δ(s, a) = ϕ, then each pair (c , s ) appearing in ϕ corresponds to a new copy of the automaton going to the direction suggested by c and starting in state s . For example, if k = 2 and δ(s1 , a) = ((1, s2 ) ∧ (1, s3 )) ∨ ((−1, s1 ) ∧ (0, s3 )), when the automaton is in the state s1 and reads the node x labeled by a, it proceeds either by sending off two copies, in the states s2 and s3 respectively, to the first successor of x (i.e., x·1), or by sending off one copy in the state s1 to the predecessor of x (i.e., x·−1) and one copy in the state s3 to x itself (i.e., x·0). A run of a 2WATA is obtained by resolving all existential choices. The universal choices are left, which gives us a tree. Because we are considering two-way automata, runs can start at arbitrary tree nodes, and need not start at the root.
An Automata-Theoretic Approach to Regular XPath
23
Formally, a run of a 2WATA A over a labeled tree T = (ΔT , T ) from a node x0 ∈ ΔT is a finite ΔT × S-labeled tree R = (ΔR , R ) satisfying: 1. ε ∈ ΔR and R (ε) = (x0 , s0 ). 2. Let R (r) = (x, s) and δ(s, T (x)) = ϕ. Then there is a (possibly empty) set S = {(c1 , s1 ), . . . , (cn , sn )} ⊆ [−1..k] × S such that S satisfies ϕ, and for each i ∈ [1..n], we have that r·i ∈ ΔR , x·ci ∈ ΔT , and R (r·i) = (x·ci , si ). Intuitively, a run R keeps track of all transitions that the 2WATA A performs on a labeled input tree T : a node r of R labeled by (x, s) describes a copy of A that is in the state s and is reading the node x of T . The successors of r in the run represent the transitions made by the multiple copies of A that are being sent off either upwards to the predecessor of x, downwards to one of the successors of x, or to x itself. A 2WATA is called “weak” due to the specific form of the acceptance condition α. Specifically, α ⊆ S, and there exists a partition of S into disjoint sets, Si , such that for each set Si , either Si ⊆ α, in which case Si is an accepting set, or Si ∩ α = ∅, in which case Si is a rejecting set. In addition, there exists a partial order ≤ on the collection of the Si ’s such that, for each s ∈ Si and s ∈ Sj for which s occurs in δ(s, a), for some a ∈ L, we have Sj ≤ Si . Thus, transitions from a state in Si lead to states in either the same Si or a lower one. It follows that every infinite path of a run of a 2WATA ultimately gets “trapped” within some Si . The path is accepting if and only if Si is an accepting set. A run (Tr , r) is accepting if all its infinite paths are accepting. A 2WATA A accepts a labeled tree T from a node x0 ∈ ΔT if there exists an accepting run of A over T from x0 . The language L (A) accepted by A is the set of trees that A accepts from the root ε. 3.1
The Acceptance Problem
Given a 2WATA A = (L, S, s0 , δ, α), a labeled tree T = (ΔT , T ), and a node x0 ∈ ΔT , we’d like to know whether A accepts T from x0 . This is called the acceptance problem. We follow here the approach of [5], and solve the acceptance problem by first taking a product A × Tx0 of A and T from x0 . This product is an alternating automaton over a one letter alphabet L0 , consisting of a single letter, say a. This product automaton simulates a run of A on T from x0 . The product automaton is A × Tx0 = (L0 , S × ΔT , (s0 , x0 ), δ , α × ΔT ), where δ is defined as follows: – δ ((s, x), a) = Θx (δ(s, T (x))), where Θx is the substitution that replaces a pair (c, t), by the pair (t, x·c) if x·c ∈ ΔT , and by false otherwise. Note that the size of A × Tx0 is simply the product of the size of A and the size of T . Note also that A × Tx0 can be viewed an a weak alternating word automaton running over the infinite word aω , as by taking the product with T we have eliminated all directions. We can now state the relationship between A×Tx0 and A, which is essentially a restatement of Proposition 3.2 in [5].
24
D. Calvanese et al.
Proposition 1. A accepts T from x0 iff A × Tx0 accepts aω . The advantage of Proposition 1 is that it reduces the acceptance problem to the question of whether A × T accepts aω . This problem is referred to in [5] as the “one-letter nonemptiness problem”. It is shown there that this problem can be solved in time that is linear in the size of A × Tx0 by an algorithm that imposes an evaluation of and-or trees over a decomposition of the automaton state space into maximal strongly connected components. The result in [5] is actually stronger; the algorithm there computes in linear time the set of initial states from which the automaton accepts aω . We therefore obtain the following result about the acceptance problem. Proposition 2. Given a 2WATA A and a labeled tree T , we can compute in time that is linear in the product of the sizes of A and T the set of nodes x0 such that A accepts T from x0 . 3.2
The Nonemptiness Problem
The nonemptiness problem for 2WATAs consists in determining, for a given 2WATA A whether it accepts some tree T from ε. This problem is solved in [6] for 2WATAs (actually, for a more powerful automata model) over infinite trees, using rather sophisticated automata-theoretic techniques. Here we solve this problem over finite trees, which requires less sophisticated techniques, which are much easier to implement. In order to decide non-emptiness of 2WATAs, we resort to a conversion to standard one-way nondeterministic tree automata [22]. A one-way nondeterministic tree automaton (NTA) is a tuple A = (L, S, s0 , δ), analogous to a 2WATA, except that (i) the acceptance condition α is empty and has been dropped from the tuple, (ii) the directions −1 and 0 are not used in δ and, (iii) for each state s ∈ S and letter a ∈ L, the positive Boolean formula δ(s, a), when written in DNF, does not contain a disjunct with two distinct atoms (c, s1 ) and (c, s2 ) with the same direction c. In other words, each disjunct corresponds to sending at most one “subprocess” in each direction. While for 2WATAs we have separate input tree and run tree, for NTAs we can assume that the run of the automaton over an input tree T = (ΔT , T ) is an S-labeled tree R = (ΔT , R ), which has the same underlying tree as T , and thus is finite, but is labeled by states in S. Nonemptiness of NTAs is known to be decidable in linear time [23]. It remains to describe the translation of 2WATAs to NTAs. Given a 2WATA A and an input tree T as above, a strategy for A on T is a mapping τ : ΔT → 2S×[−1..k]×S . Thus, each label in a strategy is an edge-[−1..k]-labeled directed graph on S. Intuitively, each label is a set of transitions. For each label ζ ⊆ S × [−1..k] × S, we define state(ζ) = {u : (u, i, v) ∈ ζ}, i.e., state(ζ) is the set of sources in the graph ζ. In addition, we require the following: (1) s0 ∈ state(τ (ε)), (2) for each node x ∈ ΔT and each state s ∈ state(τ (x)), the set {(c, s ) : (s, c, s ) ∈ τ (x)} satisfies δ(s, T (x)) (thus, each label can be viewed as a strategy of satisfying the transition function), and (3) for each node x ∈ ΔT , and each edge (s, i, s ) ∈ τ (x), we have that s ∈ state(τ (x·i)).
An Automata-Theoretic Approach to Regular XPath
25
A path β in the strategy τ is a maximal sequence (u0 , s0 ), (u1 , s1 ), . . . of pairs from ΔT × S such that u0 = ε and, for all i ≥ 0, there is some ci ∈ [−1..k] such that (si , ci , si+1 ) ∈ τ (ui ) and ui+1 = ui ·ci . Thus, β is obtained by following transitions in the strategy. The path β is accepting if the path s0 , s1 , . . . is accepting. The strategy τ is accepting if all its paths are accepting. Proposition 3 ([6]). A 2WATA A accepts an input tree T from ε iff A has an accepting strategy tree for T . We have thus succeeded in defining a notion of run for alternating automata that will have the same tree structure as the input tree. We are still facing the problem that paths in a strategy tree can go both up and down. We need to find a way to restrict attention to uni-directional paths. For this we need an additional concept. An annotation for A on T with respect to a strategy τ is a mapping η : ΔT → S×{0,1}×S 2 . Thus, each label in an annotation is an edge-{0, 1}-labeled directed graph on S. We assume that edge labels are unique; that is, a graph cannot contain both triples (s, 0, s ) and (s, 1, s ). We require η to satisfy some closure conditions for each node x ∈ ΔT . Intuitively, these conditions say that η contains all relevant information about finite paths in τ . Thus, an edge (s, c, s ) describes a path from s to s , where c = 1 if this path goes through α. The conditions are: (a) if (s, c, s ) ∈ η(x) and (s , c , s ) ∈ η(x), then (s, c , s ) ∈ η(x) where c = max{c, c }, (b) if (s, 0, s ) ∈ τ (x) then (s, c, s ) ∈ η(x), where c = 1 if s ∈ α and c = 0 otherwise, (c) if x = y·i, (s, −1, s ) ∈ τ (x), (s , c, s ) ∈ η(y), and (s , i, s ) ∈ τ (x), then (s, c , s ) ∈ η(x), where c = 1 if either s ∈ α, c = 1, or s ∈ α, and c = 0 otherwise, and (d) if y = x·i, (s, i, s ) ∈ τ (x), (s , c, s ) ∈ η(y), and (s , −1, s ) ∈ τ (y), then (s, c , s ) ∈ η(x), where c = 1 if s ∈ α, c = 1, or s ∈ α, and c = 0 otherwise. The annotation η is accepting if for every node x ∈ ΔT and state s ∈ S, if (s, c, s) ∈ η(x), then c = 1. In other words, η is accepting if all cycles visit accepting states. Proposition 4 ([6]). A 2WATA A accepts an input tree T from ε iff A has a strategy tree τ on T and an accepting annotation η of τ . Consider now annotated trees (ΔT , T , τ, η), where τ is a strategy tree for A on (ΔT , T ) and η is an annotation of τ . We say that (ΔT , T , τ, η) is accepting if η is accepting. Theorem 1. Let A be a 2WATA. Then there is an NTA An such that L (A) = L (An ). The number of states of An is exponential in the number of states of A. The key feature of the state space of An is the fact that states are pairs consisting of subsets of S and S × {0, 1} × S. Thus, a set of states of An can be described by a Boolean function on the domain S 3 . Similarly, the transition function of An can also be described as a Boolean function. Such functions can be represented by BDDs [24], enabling a symbolic approach to nonemptiness testing
26
D. Calvanese et al.
of 2WATAs, as shown below. We note that the framework of [6] also converts a two-way alternating tree automaton (on infinite trees) to a nondeterministic tree automaton (on infinite trees). The state space of the latter, however, is considerably more complex than the one obtained here due to Safra’s determinization construction. This makes it practically infeasible to apply the symbolic approach in the infinite-tree setting. Theorem 2. Given a 2WATA A with n states and an input alphabet with m elements, deciding nonemptiness of A can be done in time exponential in n and linear in m. As shown in [3] (see also Section 5 for dealing with identifiers), reasoning over RXPath formulas can be reduced to checking satisfiability in Propositional Dynamic Logics (PDLs). Specifically, one can resort to Repeat-ConverseDeterministic PDL (repeat-CDPDL), a variant of PDL that allows for expressing the finiteness of trees and for which satisfiability is ExpTime-complete [6]. This upper bound, however, is established using sophisticated infinite-tree automatatheoretic techniques (cf., e.g., [25]), which so far have resisted attempts at practically efficient implementation [7,8], due to the use of Safra’s determinization construction [26] and parity games [27]. The main advantage of our approach here is that we use only automata on finite trees, which require a much “lighter” automata-theoretic machinery. As noted in Theorem 2, nonemptiness for 2WATA can be tested in time that is exponential in the number of states and linear in the size of the alphabet. One can show that our 2WATA-based decision procedure can be implemented using a symbolic approach, which has the potential to be capable of handling automata with large states spaces [28].
4
2WATAs for RXPath Query Evaluation
We address now the problem of evaluating RXPath queries by means of 2WATAs. To do so, we first represent sibling trees as binary trees, and then encode the problem of evaluating an RXPath query ϕ into the acceptance problem for a 2WATA Awf ϕ whose number of states is linear in ϕ. This allows us to establish a tight complexity bound for RXPath query evaluation. We work on binary trees. In order for such trees to represent sibling trees, we make use of special labels ifc, irs, hfc, hrs, where ifc (resp., irs) are used to keep track of whether a node is the first child (resp., is the right sibling) of its predecessor, and hfc (resp., hrs) are used to keep track of whether a node has a first child (resp., has a right sibling). Formally, we consider binary trees whose nodes are labeled with subsets of Σ ∪ {ifc, irs, hfc, hrs}. We call such a tree T = (ΔT , T ) well-formed if it satisfies the following conditions: – For each node x of T , if T (x) contains hfc, then x·1 is meant to represent the fchild successor of x and hence T (x·1) contains ifc but not irs. Similarly, if T (x) contains hrs, then x·2 is meant to represent the right successor of x and hence T (x·2) contains irs but not ifc.
An Automata-Theoretic Approach to Regular XPath
27
– The label T (ε) of the root of T contains neither ifc, nor irs, nor hrs. In this way, we restrict the root of T so as to represent the root of a sibling tree. – For each Id ∈ Σid , there is at most one node x of T with Id ∈ T (x). A sibling tree Ts = (ΔTs , ·Ts ), induces a well-formed binary tree πb (Ts ). To define πb (Ts ) = (ΔT , T ), we define, by induction on ΔTs , both a mapping πb from ΔTs to nodes of a binary tree, and the labeling of such nodes with ifc, irs, hfc, and hrs as follows: – πb (ε) = ε; – πb (x·1) = πb (x)·1, for each node x·1 ∈ ΔTs ; moreover, hfc ∈ T (πb (x)) and ifc ∈ T (πb (x)·1); – πb (x·(n+1)) = πb (x·n)·2, for each node x·(n+1) ∈ ΔTs , with n ≥ 1; moreover, hrs ∈ T (πb (x·n)) and irs ∈ T (πb (x·n)·2). Then, we take ΔT to be the range of πb , and we complete the definition of the labeling T (πb (x)), for each node x ∈ ΔTs , as follows: A ∈ T (πb (x)) iff x ∈ ATs , for each A ∈ Σa , and Id ∈ T (πb (x)) iff x ∈ Id Ts , for each Id ∈ Σid . Notice that, since in Ts Id is interpreted as a singleton, T is well-formed. To simplify the use of automata-theoretic techniques, we assume in the following that (normalized) path expressions are represented by means of finite automata rather than regular expressions. More precisely, a normalized path expression is represented as a finite automaton on finite words (NFA) P = (Θ, Q, q0 , , F ), in which the alphabet Θ is constituted by fchild, fchild− , right, right− and by node expressions followed by ?. The semantics of a path expression represented in such a way is the adaptation of the semantics for path expressions as given in Section 2, when we view them as regular expressions, with ;, ∪, and ∗ representing respectively the concatenation, union, and Kleene star operators: a path expression PN represented as an NFA denotes the same set of pairs of nodes as a path expression PR represented as a regular expression and defining the same language as PN , where the correspondence is applied inductively to the path expressions appearing in the node expressions of PN (respectively PR ). We need to make use of a notion of syntactic closure, similar to that of FisherLadner closure of a formula of PDL [4]. We need first to define the closure of path expressions: given a path expression P = (Θ, Q, q0 , , F ), we denote with Pq the path expression Pq = (Θ, Q, q, , F ) obtained from P by making q ∈ Q the initial state. The closure CL(P ) of P is the set CL(P ) = {Pq | q ∈ Q}. The syntactic closure CL(ϕ) of a node expression ϕ is defined inductively by asserting that {ϕ, ifc, irs, hfc, hrs} ⊆ CL(ϕ), and by the rules in Figure 2, where nnf (¬ψ) denotes the negation normal form of ¬ψ. Proposition 5. Given a node expression ϕ, the cardinality of CL(ϕ) is linear in the length of ϕ. Let ϕ be a normalized node expression. We first show how to construct a 2WATA Aϕ that, when run over the binary tree corresponding to a sibling tree Ts , accepts exactly from the nodes corresponding to those in ϕTs . The 2WATA Aϕ = (L, Sϕ , sϕ , δϕ , αϕ ) is defined as follows.
28
D. Calvanese et al.
if if if if if if if if
ψ ∈ CL(ϕ) then nnf (¬ψ) ∈ CL(ϕ) (if ψ is not of the form ¬ψ ) ¬ψ ∈ CL(ϕ) then ψ ∈ CL(ϕ) ψ1 ∧ ψ2 ∈ CL(ϕ) then ψ1 , ψ2 ∈ CL(ϕ) ψ1 ∨ ψ2 ∈ CL(ϕ) then ψ1 , ψ2 ∈ CL(ϕ) P ψ ∈ CL(ϕ) then ψ ∈ CL(ϕ), and Pq ψ ∈ CL(ϕ) for each Pq ∈ CL(P ) P ψ ∈ CL(ϕ), where P = (Θ, Q, q0 , , F ), then ψ ∈ CL(ϕ), for each ψ ? ∈ Θ [P ]ψ ∈ CL(ϕ) then ψ ∈ CL(ϕ), and [Pq ]ψ ∈ CL(ϕ) for each Pq ∈ CL(P ) [P ]ψ ∈ CL(ϕ), where P = (Θ, Q, q0 , , F ), then ψ ∈ CL(ϕ), for each ψ ? ∈ Θ Fig. 2. Closure of RXPath expressions
– The alphabet is L = 2Σ , i.e., all sets consisting of atomic symbols and the special symbols ifc, irs, hfc, hrs. This corresponds to labeling each node of the tree with a truth assignment to the atomic symbols, with information about the predecessor node, and with information about whether the children are significant. – The set of states is Sϕ = CL(ϕ). Intuitively, when the automaton is in a state ψ ∈ CL(ϕ) and visits a node x of the tree, this means that the automaton has to check that node expression ψ holds in x. When ψ is an atomic symbol α, i.e., an atomic proposition, an identifier, or one of the special symbols ifc, irs, hfc, hrs, this amounts to checking that the node label contains α. – The initial state is sϕ = ϕ. – The transition function δϕ is defined as follows: 1. For each λ ∈ L, and each symbol α ∈ Σ ∪ {ifc, irs, hfc, hrs}, there are transitions true, if α ∈ λ true, if α ∈ λ δϕ (α, λ) = δϕ (¬α, λ) = false,
if α ∈ λ
false,
if α ∈ λ
Such transitions check the truth value of atomic and special symbols and their negations in the current node of the tree. 2. For each λ ∈ L and each ψ1 , ψ2 ∈ CL(ϕ), there are transitions δϕ (ψ1 ∧ ψ2 , λ) = (0, ψ1 ) ∧ (0, ψ2 ) δϕ (ψ1 ∨ ψ2 , λ) = (0, ψ1 ) ∨ (0, ψ2 )
Such transitions inductively decompose node expressions and move to appropriate states of the automaton to check the subexpressions. 3. For each λ ∈ L and each P ψ ∈ CL(ϕ), where P = (Θ, Q, q0 , , F ), there is a transition δϕ (P ψ, λ) constituted by the disjunction of the following parts: if if if if if if
q0 ∈ F q ∈ (q0 , fchild) q ∈ (q0 , right) q ∈ (q0 , fchild− ) q ∈ (q0 , right− ) q ∈ (q0 , ψ ?)
then then then then then then
(0, ψ) (0, hfc) ∧ (1, Pq ψ) (0, hrs) ∧ (2, Pq ψ) (0, ifc) ∧ (−1, Pq ψ) (0, irs) ∧ (−1, Pq ψ) (0, ψ ) ∧ (0, Pq ψ)
An Automata-Theoretic Approach to Regular XPath
29
Such transitions check step-by-step the existence of a path on the tree that conforms to the path expressions P and such that ψ holds at the ending node. 4. For each λ ∈ L and each [P ]ψ ∈ CL(ϕ), where P = (Θ, Q, q0 , , F ), there is a transition δϕ ([P ]ψ, λ) constituted by the conjunction of the following parts: if if if if if if
q0 ∈ F q ∈ (q0 , fchild) q ∈ (q0 , right) q ∈ (q0 , fchild− ) q ∈ (q0 , right− ) q ∈ (q0 , ψ ?)
then then then then then then
(0, ψ) (0, ¬hfc) ∨ (1, [Pq ]ψ) (0, ¬hrs) ∨ (2, [Pq ]ψ) (0, ¬ifc) ∨ (−1, [Pq ]ψ) (0, ¬irs) ∨ (−1, [Pq ]ψ) (0, nnf (¬ψ )) ∨ (0, [Pq ]ψ)
Such transitions check step-by-step that for all paths on the tree that conform to the path expressions P we get that ψ holds at the ending node. – The acceptance conditions αϕ is the set of all node expressions [P ]ψ ∈ CL(ϕ). Observe that a simple partition Sϕ = ∪i Si of the set of states resulting from the above transition function is the one that reflects the syntactic structure of ϕ, and that puts all literals, including the ones corresponding to the special labels, in a single element of the partition ordered below all other elements. Specifically, for each pair P , ψ for which [P ]ψ or P ψ appears explicitly in nnf (ϕ), the set of node expressions {[Pq ]ψ | Pq ∈ CL(P )} form an element Si of the partition; similarly, the set of node expressions {Pq ψ | Pq ∈ CL(P )} form another element Sj of the partition. All other states in Sϕ form a singleton element of the partition, and sub-expressions are ordered below their containing expression. Note that all node expressions P ψ are in a rejecting set, which ensures that their satisfaction cannot be postponed indefinitely in an accepting run. As for the size of Aϕ , by Proposition 5, from the above construction we get: Proposition 6. The number of states of Aϕ is linear in the size of ϕ. Theorem 3. Let ϕ be a normalized node expression, Aϕ the 2WATA constructed above, Ts a sibling tree, and πb (Ts ) the corresponding (well-formed) binary tree. Then a node x of Ts is in ϕTs iff Aϕ accepts πb (Ts ) from πb (x). From Proposition 2 and Theorem 3, we immediately get our first main result. Theorem 4. Given a sibling tree Ts and an RXPath query ϕ, we can compute ϕTs in time that is linear in the number of nodes of Ts (data complexity) and in the size of ϕ (query complexity). This technique can be used also to evaluate binary queries. Indeed, we can adorn (in linear time in the size of Ts ) each node y of Ts with a unique identifier Id y , obtaining a tree Ts . Then, to evaluate the RXPath binary query P , we consider the unary queries ϕP,y = P Id y . The answer to P over Ts is simply Ts P Ts = y∈Ts {(x, y) | x ∈ ϕP,y }, which, by Theorem 4, can be computed in quadratic time in the number of nodes of Ts and in linear time in the size of P .
30
5
D. Calvanese et al.
Reasoning on RXPath
We start by introducing RXPath root constraints, which are node expressions intended to be true on the root of the document, and study the problem of satisfiability and implication of such constraints. Formally, the root constraint ϕ is satisfied in a sibling tree Ts if ε ∈ ϕTs . A (finite) set Γ of RXPath root constraints is satisfiable if there exists a sibling tree Ts that satisfies all constraints in Γ . A set Γ of RXPath root constraints implies an RXPath root constraint ϕ, written Γ |= ϕ, if ϕ is satisfied in every sibling tree that satisfies all constraints in Γ . Note that unsatisfiability and implication of RXPath root constraints are mutually reducible to each other. Indeed Γ is unsatisfiable if and only if Γ |= false. Also, Γ |= ϕ if and only if Γ ∪ {¬ϕ} is unsatisfiable. Hence, in the following, we deal with satisfiability only. In [3] it was shown that for RXPath root constraints without identifiers satisfiability is ExpTime-complete4 . Here, as mentioned, we include special propositions Σid representing identifiers. However, the condition that a proposition is an identifier, i.e., denotes a singleton, can be expressed in RXPath. Indeed we can force a proposition A to be a singleton by using the root constraint NA defined as follows, where the abbreviation u denotes the path expression (fchild ∪ right)∗ : NA = uA ∧ [u]((fchild; uA → [right; u]¬A) ∧ (right; uA → [fchild; u]¬A) ∧ (A → [(fchild ∪ right); u]¬A))
(1) (2) (3) (4)
Hence, using one such constraint for each identifier in A ∈ Σid , the ExpTimecompleteness result in [3] gives us also a complexity characterization for our variant of RXPath root constraints that involve identifiers. Theorem 5 ([3]). Satisfiability of RXPath root constraints is ExpTimecomplete. As mentioned, the technique for checking satisfiability of RXPath root constraints in [3] is based on a reduction to satisfiability in repeat-CDPDL, which so far has resisted implementation. Instead, by resorting to 2WATAs on finite trees as presented in Section 3, we offer a more promising path towards a viable implementation. To make use of such a technique for checking satisfiability of RXPath root constraints, we need to modify the 2WATA Aϕ in such a way that it accepts only binary trees that are well-formed (cf. Section 4). Indeed, a well-formed binary tree T = (ΔT , T ) induces a sibling tree πs (T ). To define πs (T ) = (ΔTs , ·Ts ), we define, by induction on ΔT , a mapping πs from ΔT to words over N as follows: 4
The hardness result holds also if all propositions are disjoint, and in the case where they represent standard XML document tags.
An Automata-Theoretic Approach to Regular XPath
31
– πs (ε) = ε; if hfc ∈ T (ε), then πs (1) = 1; – if hfc ∈ T (x) and πs (x) = z·n, with z ∈ N∗ and n ∈ N, then πs (x·1) = z·n·1; – if hrs ∈ T (x) and πs (x) = z·n, with z ∈ N∗ and n ∈ N, then πs (x·2) = z·(n+1). Then, we take ΔTs to be the range of πs , and we define the interpretation function ·Ts as follows: for each A ∈ Σa , we define ATs = {πs (x) ∈ ΔTs | A ∈ T (x)}; similarly, for each Id ∈ Σid , we define Id Ts = {πs (x) ∈ ΔTs | Id ∈ T (x)}. Notice that, since T is well-formed, Id Ts contains at most one element. Note that the mapping πs ignores irrelevant parts of the binary tree, e.g., if the label of a node x does not contain hfc, even if x has a 1-successor, such a node is not included in the sibling tree. Also, πs can be considered the inverse of the mapping πb defined in Section 4. The 2WATA Awf ϕ = (L, S, sini , δ, α), obtained by modifying Aϕ so that it accepts (from ε) only trees that are well-formed, is defined as follows: – The set of states is S = Sϕ ∪ {sini , sstruc } ∪ {sId , nId | Id ∈ Σid }, where sini is the initial state, and the other additional states are used to check structural properties of well-formed trees. – The transition function is constituted by all transitions in δϕ , plus the following transitions ensuring that Awf ϕ accepts only well-formed trees. 1. For each λ ∈ L, there is a transition δ(sini , λ)=(0, ϕ) ∧ (0, ¬ifc) ∧ (0, ¬irs) ∧ (0, ¬hrs) ∧ (0, sstruc ) ∧
Id∈Σid (0, sId )
Such transitions (i) move to the initial state of Aϕ to verify that ϕ holds at the root of the tree, (ii) check that the root of the tree is not labeled with ifc, irs or hrs, (iii) move to state sstruc , from which structural properties of the tree are verified, and (iv ) move to states sId (for each Id ∈ Σid ), from which the automaton verifies that a single occurrence of Id is present on the tree. 2. For each λ ∈ L, there is a transition δ(sstruc , λ) = ((0, ¬hfc) ∨ ((1, ifc) ∧ (1, ¬irs) ∧ (1, sstruc ))) ∧ ((0, ¬hrs) ∨ ((2, irs) ∧ (2, ¬ifc) ∧ (2, sstruc )))
Such transitions check that, (i) for each node labeled with hfc, its left child is labeled with ifc but not with irs, and (ii) for each node labeled with hrs, its right child is labeled with irs but not with ifc. 3. For each λ ∈ L and each Id ∈ Σid there are transitions δ(sId , λ) = ((0, Id) ∧ ((0, ¬hfc) ∨ (1, nId )) ∧ ((0, ¬hrs ) ∨ (2, nId ))) ∨ ((0, ¬Id) ∧ (0, hfc) ∧ (1, sId ) ∧ ((0, ¬hrs) ∨ (2, nId ))) ∨ ((0, ¬Id) ∧ (0, hrs) ∧ (2, sId ) ∧ ((0, ¬hfc) ∨ (1, nId )) δ(nId , λ) = (0, ¬Id) ∧ ((0, ¬hfc) ∨ (1, nId )) ∧ ((0, ¬hrs) ∨ (2, nId ))
Such transitions ensure that exactly one node of the tree is labeled with Id .
32
D. Calvanese et al.
– The set of accepting states is α = αϕ . The states sini and sstruc form each a single element of the partition of states, where {sini } precedes all other elements, and {sstruc } follows them. The states sId and nId are added to the element of the partition containing all literals. As for the size of Awf ϕ , by Proposition 6, and considering that the additional states and transitions in Awf ϕ are constant in the size of ϕ, we get: Proposition 7. The number of states of Awf ϕ is linear in the size of ϕ. Theorem 6. Let Γ be a (finite) set of RXPath root constraints, ϕ the conjuncwf tion of the constraints in Γ , and Awf ϕ the 2WATA constructed above. Then Aϕ is nonempty if and only if Γ is satisfiable. Theorem 7. Checking the satisfiability of a (finite) set Γ of RXPath root constraints by checking nonemptiness of the 2WATA constructed above can be done in ExpTime.
6
Query Satisfiability and Query Containment
We deal now with query satisfiability and query containment under constraints, and we show that these problems can be reduced in linear time to satisfiability of RXPath root constraints. As a consequence, we get that these problems are ExpTime-complete and that we can exploit for them the automata-based techniques developed in this paper. In the following, we deal only with RXPath binary queries (i.e., path expressions), since RXPath unary queries (i.e., node expressions) can be rephrased as binary queries: indeed ϕTs = {z | (z, z) ∈ (ϕ?)Ts }. We start our investigation with the query satisfiability problem. An RXPath query Q is satisfiable under a (finite) set of root constraints Γ if there exists a sibling tree Ts satisfying Γ such that QTs is non-empty. Considering the semantics of RXPath queries and root constraints, it is immediate to verify that Q is satisfiable under Γ if and only if Γ ∪ {u; Qtrue} is satisfiable. Hence, query satisfiability under root constraints in RXPath can be linearly reduced to satisfiability of RXPath root constraints, and we get the following result. Theorem 8. Query satisfiability under root constraints in RXPath is ExpTime-complete. We now turn our attention to query containment under constraints, i.e., verifying whether for all databases satisfying a certain set of integrity constraints, the answer to a query is a subset of the answer to a second query. Checking containment of queries is crucial in several contexts, such as query optimization, query reformulation, knowledge-base verification, information integration,
An Automata-Theoretic Approach to Regular XPath
33
integrity checking, and cooperative answering. Obviously, query containment is also useful for checking equivalence of queries, i.e., verifying whether for all databases the answer to a query is the same as the answer to another query. For a summary of results on query containment in semistructured, see, e.g., [29]. Query containment under constraints in our setting is defined as follows: An RXPath query Q1 is contained in an RXPath query Q2 under a set of RXPath constraints Γ , written Γ |= Q1 ⊆ Q2 , if for every sibling tree Ts that satisfies all constraints in Γ , we have that QT1 s ⊆ QT2 s . Again we can resort to root constraints satisfiability to verify containment. Namely: Γ |= Q1 ⊆ Q2 if and only if Γ ∪ {u; Id st ?; Q1 ; Id end ?true, [u; Id st ?; Q2 ; Id end ?]false} is unsatisfiable, where Id st and Id end are newly introduced identifiers. We get that also query containment under root constraints in RXPath can be linearly reduced to unsatisfiability of RXPath root constraints. Theorem 9. Query containment under root constraints in RXPath is ExpTime-complete. It follows that for the above problems of reasoning about queries under RXPath root constraints, we can exploit the automata-based techniques developed in this paper. We conclude the section by observing that also view-based query answering has attracted the interest of the XPath community, e.g., [30]. It can be shown that we can adapt the above techniques based on a reduction to satisfiability of RXPath root constraints also to solve view-based query answering.
7
Conclusions
In this paper we have studied RXPath, a powerful mechanism for expressing structural queries and constraints in XML. We have presented symbolic automata-based techniques for evaluation of RXPath queries over XML trees, and for checking satisfiability of RXPath constraints, and we have illustrated how to apply the latter technique for both query containment and view-based query answering. Notably, the automata-theoretic techniques that we have introduced check for infinite computations on finite trees. Acknowledgments. This research has been partially supported by NSF grants CCR-0124077, CCR-0311326, CCF-0613889, ANI-0216467, and CCF-0728882.
References 1. Clark, J., DeRose, S.: XML Path Language (XPath) version 1.0. W3C Recommendation (November 1999), http://www.w3.org/TR/1999/REC-xpath-19991116 2. ten Cate, B., Segoufin, L.: XPath, transitive closure logic, and nested tree walking automata. In: Proc. of PODS 2008, pp. 251–260 (2008)
34
D. Calvanese et al.
3. Marx, M.: XPath with conditional axis relations. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., B¨ ohm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 477–494. Springer, Heidelberg (2004) 4. Fischer, M.J., Ladner, R.E.: Propositional dynamic logic of regular programs. J. of Computer and System Sciences 18, 194–211 (1979) 5. Kupferman, O., Vardi, M.Y., Wolper, P.: An automata-theoretic approach to branching-time model checking. J. of the ACM 47(2), 312–360 (2000) 6. Vardi, M.Y.: Reasoning about the past with two-way automata. In: Larsen, K.G., Skyum, S., Winskel, G. (eds.) ICALP 1998. LNCS, vol. 1443, pp. 628–641. Springer, Heidelberg (1998) 7. Schulte Althoff, C., Thomas, W., Wallmeier, N.: Observations on determinization of B¨ uchi automata. In: Farr´e, J., Litovsky, I., Schmitz, S. (eds.) CIAA 2005. LNCS, vol. 3845, pp. 262–272. Springer, Heidelberg (2006) 8. Tasiran, S., Hojati, R., Brayton, R.K.: Language containment using nondeterministic Omega-automata. In: Camurati, P.E., Eveking, H. (eds.) CHARME 1995. LNCS, vol. 987, pp. 261–277. Springer, Heidelberg (1995) 9. Libkin, L., Sirangelo, C.: Reasoning about XML with temporal logics and automata. In: Cervesato, I., Veith, H., Voronkov, A. (eds.) LPAR 2008. LNCS, vol. 5330, pp. 97–112. Springer, Heidelberg (2008) 10. Genev`es, P., Laya¨ıda, N., Schmitt, A.: Efficient static analysis of XML paths and types. In: Proc. of the ACM SIGPLAN 2007 Conf. on Programming Language Design and Implementation (PLDI 2007), pp. 342–351 (2007) 11. Cosmadakis, S.S., Gaifman, H., Kanellakis, P.C., Vardi, M.Y.: Decidable optimization problems for database logic programs. In: Proc. of STOC 1988, pp. 477–490 (1988) 12. Gottlob, G., Koch, C., Pichler, R.: Efficient algorithms for processing XPath queries. ACM Trans. on Database Systems 30(2), 444–491 (2005) 13. Bojanczyk, M., Parys, P.: XPath evaluation in linear time. In: Proc. of PODS 2008, pp. 241–250 (2008) 14. Schwentick, T.: XPath query containment. SIGMOD Record 33(1), 101–109 (2004) 15. Fan, W.: XML constraints: Specification, analysis, and applications. In: Proc. of DEXA 2005 (2005) 16. Calvanese, D., De Giacomo, G., Lenzerini, M.: Representing and reasoning on XML documents: A description logic approach. J. of Log. and Comp. 9(3), 295–318 (1999) 17. Bex, G.J., Neven, F., Van den Bussche, J.: DTDs versus XML Schema: A practical study. In: Proc. of WebDB 2004, pp. 79–84 (2004) 18. Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: Reasoning on regular path queries. SIGMOD Record 32(4), 83–92 (2003) 19. Marx, M.: First order paths in ordered trees. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 114–128. Springer, Heidelberg (2004) 20. Afanasiev, L., Blackburn, P., Dimitriou, I., Gaiffe, B., Goris, E., Marx, M., de Rijke, M.: PDL for ordered trees. J. of Applied Non-Classical Logics 15(2), 115–135 (2005) 21. Slutzki, G.: Alternating tree automata. Theor. Comp. Sci. 41, 305–318 (1985) 22. Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison, S., Tommasi, M.: Tree automata techniques and applications (2002), http://www.grappa.univ-lille3.fr/tata/ 23. Doner, J.E.: Decidability of the weak second-order theory of two successors. Notices Amer. Math. Soc. 12, 819 (1965)
An Automata-Theoretic Approach to Regular XPath
35
24. Bryant, R.E.: Graph-based algorithms for Boolean-function manipulation. IEEE Trans. on Computers C-35(8) (1986) 25. Sattler, U., Vardi, M.Y.: The hybrid μ-calculus. In: Gor´e, R.P., Leitsch, A., Nipkow, T. (eds.) IJCAR 2001. LNCS, vol. 2083, pp. 76–91. Springer, Heidelberg (2001) 26. Safra, S.: On the complexity of ω-automata. In: Proc. of FOCS 1988, pp. 319–327 (1988) 27. Jurdzinski, M.: Small progress measures for solving parity games. In: Reichel, H., Tison, S. (eds.) STACS 2000. LNCS, vol. 1770, pp. 290–301. Springer, Heidelberg (2000) 28. Burch, J.R., Clarke, E.M., McMillan, K.L., Dill, D.L., Hwang, L.J.: Symbolic model checking: 1020 states and beyond. Information and Computation 98(2), 142–170 (1992) 29. Calvanese, D., De Giacomo, G., Lenzerini, M., Vardi, M.Y.: View-based query answering and query containment over semistructured data. In: Ghelli, G., Grahne, G. (eds.) DBPL 2001. LNCS, vol. 2397, pp. 40–61. Springer, Heidelberg (2002) 30. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Rewriting regular XPath queries on XML views. In: Proc. of ICDE 2007, pp. 666–675 (2007)
The Script-Writer’s Dream: How to Write Great SQL in Your Own Language, and Be Sure It Will Succeed Ezra Cooper University of Edinburgh
Abstract. We show how to translate expressions in a higher-order programming language into SQL queries. Somewhat surprisingly, we show that any suitable expression translates to a single SQL query, where the suitability is determined by a type-and-effect check. Thus, unlike in Hollywood where a script-writer can never be sure a movie sequel will be popular, we show how to be sure that your SQL—written in your own language—will succeed (in being translated).
Introduction Language-integrated query is an approach to accessing relational databases from a general-purpose programming language, which reduces the infamous “impedance mismatch” [1], by allowing queries to be written with the full flexibility of the general language, and translating them to flat relational queries (SQL, for our purposes) when possible. The increased flexibility may include features like functional abstraction, permitting the refactoring of query fragments, or the ability to form nested intermediate data structures having no relational equivalent—all within the familiarity of the host language. Language-integrated query is exemplified by the systems Kleisli [2], LINQ [3], Links [4], and Ferry [5]. These existing systems vary in their support of abstraction and degree of totality. They do not always permit queries to be abstracted by refactoring query fragments into parameterized functions, nor permit using predefined functions in queries, even if those could be translated in their applied context—at least not with the full flexibility of λ-calculus. As to totality, they generally make a best effort to translate host-language expressions to queries, but may fail to do so at runtime—either with a run-time error or by executing a query outside the database; we call this partiality. The increased flexibility of general programming languages makes it difficult to recognize query-translatable expressions. Besides the problem of nested data structures, programming languages normally have operations with no equivalent in the query language, including recursion, side-effecting statements, and primitive functions that simply aren’t available. Existing versions of Kleisli, Links and LINQ resolve the tension through partiality: they may not always transform the expression completely to an SQL query. Instead they might give a run-time error (in the case of LINQ) or they might evaluate the query expression directly by the P. Gardner and F. Geerts (Eds.): DBPL 2009, LNCS 5708, pp. 36–51, 2009. c Springer-Verlag Berlin Heidelberg 2009
The Script-Writer’s Dream
37
host language in a naive and inefficient way (forgoing the benefit of indexes, for example). The consequence may be that the program behaves very inefficiently, with no indication until runtime that this happens. The Ferry system gives a total translation, but without supporting abstraction. This paper contributes to the science of language-integrated query by extending query translation to higher-order functions (permitting abstraction) and offering a static test for translatability (ensuring totality at runtime, after passing the static check). Thus you can “write SQL” in your own native programming language, and be sure at compile time that it will succeed in translation. Contributions. The technical contributions of this paper include 1. a translation from any suitable expression of a typical impure, functional programming language to an equivalent SQL query, 2. a type-and-effect system for statically checking this suitability 3. a generalization of existing results for unnesting relational algebra by Wong [6] to a higher-order calculus, and 4. a description of a proof of totality for the translation, securing a result by Fegaras [7]. How it works. The recipe for translating higher-order language-integrated queries can be summarized as follows: 1. At compile time, (a) Check that query expressions have a flat relation type. (b) Use a type-and-effect system to check that query expressions are pure. (c) Generate two representations of each query-translatable expression: one suitable for execution and one suitable for query generation. 2. At runtime, to execute an expression via SQL, (a) Insert the values (query representations) for any free variables and (b) Reduce it to eliminate intermediate structures (that is, functions and nested data structures). This produces a normal form directly translatable to SQL.
Example Suppose Alice runs a local baseball league. First, she wants a list of the players with age at least 12. She might write this function (these examples use the syntax of Links, which is a general-purpose language but is specially adapted to look like a query language): fun overAgePlayers() { query { for (p <- players) where (p.age > 12) [(name = p.name)] } }
38
E. Cooper
(The for (x <- xs) e construct is a bag comprehension, and works as follows: for each element in the bag denoted by xs, it evaluates e with x bound to that element, producing a new bag; the result is the union of all those bags resulting from the body. Comprehensions have a long history of use in programming languages from Haskell to Python and JavaScript. The construct where (e1) e2 is simply shorthand for if e1 then e2 else [].) The compiler can deduce that this expression is in fact equivalent to an SQL query, so it accepts the function. However, the following code would give a compiler error: fun overAgePlayersReversed() { query { for (p <- players) where (p.age > 12) [(name = reverse(p.name))] } }
# ERROR!
This is because the reverse function has no SQL equivalent, and so no query is equivalent to this expression. Now, it takes nine players to make a baseball team, but some “teams” in Alice’s league are short of players. She needs to generate a mailing list of players that belong to a full team. One way to do this is to collect, for each team, a team roster (list of players) and then filter for those with a roster of at least nine elements. She writes the following code: fun teamRosters() { for (t <- teams) [(name = t.name, roster = for (p <- players) where (p.team == t.name) [(playerName=p.name)])]; } fun usablePlayers() { query { for (t <- teamRosters()) where (length(t.roster) >= 9) t.roster } }
Note the lack of brackets [] around the final t.roster: since the for comprehension takes the union of bags produced by the body, we here take the union of satisfying rosters. This expression is equivalent to an SQL query, although not in a direct way, since it uses an intermediate data structure that is nested (teamRosters has type [(name:String, roster:[(playerName:String)])]), and this is not supported by SQL. But since the final result is flat, our analysis accepts the query-bracketed expression and translates it into an equivalent SQL query, such as this one: select p.name as playerName from players as p, teams as t where (select count(*) from players as p2 where p2.team = t.name) < 9)
The Script-Writer’s Dream
39
Also note that factoring out the part of the query which returns the bag of all rosters posed no problem: the query translator will simply inline the function and produce a single query. Suppose now that Alice wishes to abstract the query condition on teams. That is, she wishes to write a function which accepts as argument a roster predicate, and produces a list of the player records belonging to those teams satisfying the predicate. With the following code, the query translator will still produce, in each invocation, a single SQL query: fun playersBySelectedTeams(pred) { query { for (t <- teamRosters()) where (pred(t.roster)) t.roster } }
The query translator will ensure that any argument passed as pred is itself a translatable function. If any call site tries to pass a untranslatable predicate, it produces a compiler error. This type of abstraction is particularly difficult to achieve in SQL, since the team rosters themselves cannot be explicitly constructed in SQL as part of a query. SQL restricts how subqueries can be used (for example, in various contexts they must return just one column, just one row, or both) and the subquery itself must be changed if we apply an aggregate function to its single column. For example, suppose we want to form a sub-league of “senior” teams: teams with enough players of age 15. We might define the following predicates: fun fullTeam(list) { length(list) >= 9 } fun seniorPlayers(list) { for (x <- list) where (x.age >= 15) [x] }
and use them as follows: playersBySelectedTeams(fun(x) { fullTeam(seniorPlayers(x)) } )
Here we have freely used length and a filtering comprehension, despite the fact that in SQL these are represented very differently: select p.name as playerName from players as p, teams as t where (select count(*) from players as p2 where p2.team = t.name and p2.age >= 15) >= 9)
The application of count needs to be placed within the subquery, while the comparison >= 9 is placed outside of it. Also the condition on p2.age must be placed within the where clause of the subquery. It would not be easy to write a “template” SQL query which would produce through string substitution all the queries that playersBySelectedTeams does. Microsoft’s LINQ system allows abstraction of query predicates, provided they are defined at a special type (the type of expressions). Because of the special type, these predicates cannot directly be reused as ordinary functions, and composition is not supported, so the last example would require explicitly rewriting
40
E. Cooper
the composed function fun(x) { fullTeam(seniorPlayers(x)) } as a new function. This paper shows how such a system might support composition. Road map. The next sections (1) define the core source language and its static analysis, (2) translate it into SQL, (3) characterize the correctness of the algorithm, and (4) extends it in two ways: with recursion and with the length operator.
1
The Source Language
We define a language which resembles the core of an ordinary impure functional programming language, and is also a conservative extension of the (higher-order) Nested Relational Calculus with side-effects and a query annotation. Its grammar is given in Figure 1. The terms [M ], [] and M N represent bag (multiset) operations: singleton construction, the empty bag, and bag union. The bag comprehension, written for (x ← L) M , computes the union of the bags produced by evaluating M in successive environments formed by binding x to the elements of L in turn. A table handle table s : T denotes a reference to a table, named s, in some active database connection; T designates the effective type of the table. The conditional form if B then M else N evaluates to the value of either M or N depending on the value of B. Records are constructed as a parenthesized sequence of field-name–term pairs −−−−→ (l = M ) and destructed with the field projection M.l. When speaking of a record
(terms)
(primitives) (table names) (field names) (types) (base types) (atomic effects) (effect sets)
B, L, M, N ::= for (x ← L) M | if B then M else N | table s : T | [M ] | [] | M N −−−−→ | (l = M ) | M.l | LM | λx.N | x | c − → | ⊕(M ) | empty(M ) | query M ⊕ s, t l −−→ e T ::= o | (l : T ) | [T ] | S → T o ::= bool | int | string E ::= noqy | · · · e a set of atomic effects
Fig. 1. Grammar of the source language
The Script-Writer’s Dream This work LM λx.N x if B then M else N empty M c for (x ← L) M [] [M ] M N table s : T −−−−→ (l = M ) M.l − → ⊕(M ) query M
41
NRC LM λx.M x if B then M else N empty M c {M | x ∈ L} {} {M } M ∪N x (L, M ) () π1 π2 M =N
Fig. 2. Nested Relational Calculus (NRC)
−−−−→ construction (l = M ) we will indicate the immediate subterms by subscripting −−−−→ → − the metavariable M with labels so that Ml is a field of (l = M ) for each l ∈ l . → − −−→ Similarly for record types (l : T ) the field types will be indicated Tl when l ∈ l . Functional abstraction λx.N and application LM are as usual. Variables are ranged by x, y, z and other italic alphabetic identifiers, but c ranges over constants. The language may contain primitive operations, ranged by ⊕, which must appear fully-applied (this is not a significant restriction since one may abstract over such expressions). The primitives should include boolean negation (¬). The form empty(M ) evaluates to a boolean indicating whether the bag denoted by M is the empty bag or not. The form queryM evaluates to the same value as M but instructs the compiler that M must evaluate as an SQL query. An expression is translatable if it can be so evaluated. Terms are assigned types which can be: base types ranged by o, record types −−→ (l : T ) where each field label l is given a type Tl , bag types [T ] and function e types S → T , where S is the function domain, T is the range, and e is a set of effects that the function needs permission to perform. We consider effects abstractly; E ranges over some arbitrary set of effects, which includes at least an effect noqy and may include other runtime actions such as I/O or reference-cell mutations. Every effect should represent some kind of runtime behavior that has no SQL equivalent; we use the distinguished effect noqy to mark nondatabasable operations when no other effect presents itself. For simplicity, this first calculus does not include the length operator; Section 4.2 extends to include it. NRC Comparison. Compare this language with NRC as given by Wong [6] (Figure 2). The two languages are nearly the same. Some apparent differences
42
E. Cooper
are only notational. NRC’s comprehension form {M | x ∈ L} is identical to ours, for (x ← L)M . The NRC literature uses set notation, {}, {M }, and M ∪N , but they can refer to any of the extended monads for bags, sets or lists. (Here we treat only bags.) We write table handles explicitly as table s : T , with a name s and required type annotation T , which facilitates local translation, while NRC uses free variables to refer to tables. A few differences are not notational. NRC is defined with tuples while we use records, without loss of generality. We admit an arbitrary set of basic operations ⊕ while NRC, at its core, includes no operations; these are treated as extensions in the NRC literature. This formulation of NRC includes an equality test at each type; the equality at base types is treated as a basic operation (ranged by ⊕) in our calculus. Finally, our language extends NRC with the translatability assertion query M . (In fact Wong [6] gives the same grammar as reproduced here, but the paper goes on to treat λ-abstractions as though they can only appear in application position, as in (λx.N )M , effectively restricting the language to a first-order form. Here we treat it in its full higher-order glory.) Type-and-effect system. The static semantics is defined by a type-and-effect system in Figure 3. It is close to standard systems along the lines of Talpin and Jouvelot [8]. The typing permits no recursion, and thus is analogous to simply-typed λ-calculus. We add recursion later using a fixpoint operator. The system involves just one form of judgment, Γ M : T ! e which can be read, “In environment Γ , term M has type T and may take effects in the set e.” The type of each constant c is given as Tc , which should be a base type. Constant values at complex type can, of course, be constructed explicitly. We demand that each primitive ⊕ either has an SQL equivalent, ⊕sql , or carries a nonempty effect annotation. We also insist that primitives have basic argument types and basic result type, or else have an effect annotation. The limitation to base-type arguments means that functions like empty and length, which anyway require special rewrite rules, cannot be defined as primitives. ∅ For example, the primitives might include addition, (+) : int × int → int, noqy which has an SQL equivalent, as well as print : string −→ (), which prints to the terminal and has no SQL equivalent, and hence it carries an effect. In keeping with the flatness of SQL tables, we require that each table must −−→ have flat relation type: in table s : T we require T = [(l : o)] for some base−−→ membered record type (l : o). An immediate type annotation is required on table expressions. This may seem a nuisance; as a practical matter, however, it provides a direct way for the programmer to check whether the usage type of a table agrees with the underlying table’s schema type in the DBMS. The type system is monomorphic, so for example each appearance of the empty bag [] must be given some particular concrete type.
The Script-Writer’s Dream .
Γ c : Tc ! ∅
(T-Const)
Γ, x : T x : T ! ∅
(T-Var)
Γ, x : S N : T ! e
(T-Abs)
e
Γ λx.N : S → T ! ∅ e
Γ L : S → T ! e
Γ M : S ! e
−−−→ Γ Mi : Ti ! ei for each Mi : Ti in M : T −−−−→ −−→ Γ (l = M ) : (l : T ) ! i e i (T-Record) −−→ Γ M : (l : T ) ! e
−−→ (l : T ) ∈ (l : T )
Γ M.l : T ! e (T-Project)
Γ LM : T ! e ∪ e ∪ e (T-App) e
⊕ : S1 × · · · × Sn → T Γ Mi : Si ! ei for each 1 ≤ i ≤ n − → Γ ⊕(M ) : T ! e ∪ i ei (T-Op) Γ M : [T ] ! e Γ empty(M ) : bool ! e Γ M :T !∅
(T-Empty)
−−→ T has the form [(l : o)]
43
Γ [] : [T ] ! ∅ Γ M :T !e Γ [M ] : [T ] ! e
(T-Null)
(T-Singleton)
Γ N : [T ] ! e
Γ M : [T ] ! e
Γ M N : [T ] ! e ∪ e (T-Union)
Γ query M : T ! ∅ (T-Db) −−→ T has the form [(l : o)] Γ (table t : T ) : T ! ∅ Γ M : [S] ! e
(T-Table)
Γ L : bool ! e Γ M2 : T ! e Γ M1 : T ! e Γ if L then M1 else M2 : T ! e ∪ e ∪ e (T-If)
Γ, x : S N : [T ] ! e
Γ for (x ← M ) N : [T ] ! e ∪ e (T-For)
Γ M :T !e
e ⊆ e
Γ M : T ! e
(T-Subsump)
Fig. 3. Type-and-effect system
The main proposition we show is that if a term queryM has a typing derivation under these rules, then when any closing, well-typed substitution σ is applied, it gives a term which can be translated to an equivalent SQL query.
2
Making Queries
To make queries from the source language, we will rewrite source terms to a sublanguage which translates directly and syntactically into SQL. We first examine the sublanguage and its relationship to our SQL fragment, then turn to the rewrite system.
44
E. Cooper
SQL-like sublanguage. The target sublanguage is given in Figure 4. Observe that the “normal form” expressions ranged by V all have relation type: bag of record −−→ −−→ of base type, or [(l : o)]. Types of the form (l : o) are called “row types,” after the database rows that they represent. SQL. The target SQL fragment is given in Figure 5. This includes all unions of queries on an inner join of zero or more tables, with result and query conditions taken from some given algebra of operations, including field projection, boolean conjunction, negation, the exists operator, and conditionals case . . . end. SQL translation. The type-sensitive function − (Figure 6) translates each closed term in the sublanguage directly into a query. (A fine point: SQL has no way of selecting an empty set of result columns in a select clause; to translate [()] we need to offer some dummy value, or *, as the result column. An implementation can supply any such dummy value, remembering to ignore the columns that the database actually returns.) We assume, without loss of generality, that all bound variables in the source program are distinct, thus there can be no clashes among the tables’ as aliases used in the from clause of the query. Rewrite rules. The translation of source terms into the SQL-isomorphic sublanguage is given as a rewrite system (Figure 7). We write M [L/x] for the substitution of the term L for the free variable x in the term M . A few rules beg explanation. The syntax of SQL permits conditional choices only at the level of individual fields, never at row or table level. Thus if-split transforms a choice between two bags into the union of two oppositely-guarded bags. Similarly, if-record pushes conditional choices at the row level down to the level of fields. As such, both of these rules (and only these) are typesensitive. The empty-flatten rule ensures that the argument to empty can be normalized to one of the query normal forms of Figure 4, which requires that it have relation type. Rules like app-if are common in higher-order rewrite systems: app-if pushes the deconstructor (−)M past an interposing form, in order to bring it together with corresponding constructors (here λx.N ) thus exposing β-reductions. Several rules may seem unnecessary if we were to use SQL subqueries. For example, why employ the for-assoc rule if we can write an SQL query that uses a nested select statement in its from clause? After all, we could more directly translate the expression for (y ← for (x ← table s : T ) [(b = x.a)]) [(c = y.b)] into SQL as follows: select y.b as c from (select x.a as b from s as x) as y. The nested comprehension became a nested subquery. Why then for-assoc?
The Script-Writer’s Dream
45
(query normal forms) V, U ::= V U | [] | F (comprehension NFs) F ::= for (x ← table s : T ) F | Z (comprehension bodies) Z ::= if B then Z else [] | [R] | table s : T −−−→ (row forms) R ::= (l = B) | x (basic expressions) B ::= if B then B else B | empty(V ) | → − ⊕( B ) | x.l | c Fig. 4. SQL-like sublanguage
Q, R ::= Q union all R | S −−−→ → S ::= select − s from t as x where e s ::= e as l | x. ∗ e ::= case when e then e else e end |
→ e) c | x.l | e ∧ e | ¬e | exists(Q) | ⊕(−
Fig. 5. Target SQL fragment
V U = V union all U −−→ −−−−→ [] : [(l : T )] = select null as l from ∅ where false −−→ −−−→ for (x ← table s : T ) F = select e as l from s as x, t as y where e −−→ −−−→ where (select e as l from t as y where e) = F −−→ → − if B then Z else [] = select e as l from t where e ∧ B −−→ → − where (select e as l from t where e ) = Z −−→ − −−−→ table s : [(l : o)] = select s.l as l from s where true [R] = select R from ∅ where true − −−−− → −−−→ (l = B) = B as l x = x. ∗ if B then B else B = case when B then B else B end empty(V ) = ¬exists(V ) −−→ → − ⊕( B ) = ⊕sql (B) x.l = x.l c = c Fig. 6. Translation from normalized terms to SQL
46
E. Cooper
for (x ← [M ]) N : T N [M/x]
(for-β)
(λx.N )M : T N [M/x] −−−−→ (l = M ).l : T Ml
(record-β)
(abs-β)
for (x ← []) M : T []
(for-zero-l)
for (x ← N ) [] : T []
(for-zero-r)
for (x ← for (y ← L) M ) N : T for (y ← L) (for (x ← M ) N )
if y ∈ fv(N ) (for-assoc)
for (x ← M1 M2 ) N : T for (x ← M1 ) N for (x ← M2 ) N (for-union-src) for (x ← M ) (N1 N2 ) : T for (x ← M ) N1 for (x ← M ) N2 (for-union-body) for (x ← if B then M else []) N : T if B then (for (x ← M ) N ) else [] (for-if-src)
(if B then L else L )M : T if B then LM else L M −−→ −−−→ if B then M else N : (l : T ) (l = L)
(app-if)
if B then M else N : [T ] if B then M else []
if N = []
(if-record) → − with Ll = if B then M.l else N.l for each l ∈ l if ¬B then N else []
(if-split)
if B then []else [] : T []
(if-zero)
if B then (for (x ← M ) N ) else [] : T for (x ← M ) (if B then N else [])
(if-for)
if B then M N else [] : T if B then M else [] if B then N else []
(if-union)
empty(M ) : T empty(for (x ← M ) [()])
(empty-flatten)
if M is not relation-typed query M : T M
(ignore-db)
Fig. 7. The rewrite system for normalizing source-language terms
The answer is that such rules are critical to the unnesting. Watch how we rewrite this query, which internally constructs a nested bag-of-bag type, not an SQL-representable type: for (y ← (for (x ← table s) [[x]]) y for (x ← table s) (for (y ← [[x]]) y)
(for-assoc) (β-for)
for (x ← table s) [x] And so the data is unnested. The for-assoc rule thus exposes β-reductions, which themselves eliminate other constructor/destructor pairs and so reduce the types of intermediate values.
The Script-Writer’s Dream
3
47
Correctness
To show that the translation is correct, we show that the rewrite rules are sound with respect to the static semantics, the rewrite system is strongly normalizing, and its normal forms are the ones given earlier. We merely state the results here; full proofs can be found in an accompanying technical report [9]. Soundness We use the rewrite rules only on pure terms, so the soundness proposition says that types and purity are preserved by rewriting. Proposition 1. If M M and M : T ! ∅ with T a relation type then M : T ! ∅. Furthermore M ∗ V gives then V : T ! ∅. In fact, the rewrite rules fail to preserve effects in general, since they may relocate, remove, and duplicate subterms and their effects. Totality We want the translation really to give a query for every well-typed query expression. First we show that the rewrite rules strongly normalize—they admit no infinite reduction—then that the normal forms fall into the SQL target. Proposition 2. Any well-typed term strongly normalizes. The proof uses the reducibility with -lifting approach of Lindley and Stark [10]; our for construct is simply the let of the monadic metalanguage, and there is some additional work to deal with the rule if-for, since it, unusually, pushes a context inside the binder of another. Proposition 3 (Normal forms). Closed, well-typed terms of effect-free relation type have normal forms that satisfy the grammar of Fig. 4. The proof goes in two steps; first, by induction on the structure of terms, we establish a grammar of all normal forms, where destructors are all pushed to the inside with constructors on the outside. Then, arguments based on the type and effect of query expressions narrow the grammar to the one of Fig. 4.
4 4.1
Extensions Recursion
A general-purpose programming language without recursion would be severely hobbled; but in standard SQL, many recursive queries are not expressible. Thus we need to add recursion to our language but ban it from query expressions. We add recursion by introducing a recursive λ-abstraction as follows. It introduces a recursive function of one argument and forces an effect on its type.
48
E. Cooper
Γ, f : S
e∪{noqy}
−→
T, x : S M : T ! e
Γ recfun f x = M : S
e∪{noqy}
−→
(T-RecFun)
T !∅
This prohibition is strong; some recursive functions may be expressible in SQL, and it will be interesting to see if a stronger translation can be given for recursive functions. 4.2
Length Operator
The examples in Section 1 used a function length which we have not yet studied. This section shows how to extend the system to support it. In the source, it is much like empty, but SQL’s nonuniformity forces us to handle it specially. First we extend the source language with length: M ::= · · · | length(M ) The typing rule is as you would expect, resulting in type int. Extend the SQL-like sublanguage (the normal forms) as follows: B ::= · · · | length(F ) Note that we will normalize the argument to a comprehension normal form, F , so that it gives a select query and not, say, a union all query. Next, augment the SQL target: −−−→ e ::= · · · | select count(∗) from t as x where e And hence we can translate it to SQL as follows: −−−→ length(F ) = select count(∗) from t as x where e −−−→ → where select − s from t as x where e = F Now we add the following rewrite rules: length(M ) : T length(for (x ← M ) [()]) length([]) : T 0 length(M N ) : T length(M ) + length(N )
(length-flatten) if M is not relation-typed (length-zero) (length-union)
Thus assumes, of course, that we have the constant 0 and the integer-addition operation (+) in our set of constants and primitives.
5
Related Work
The science of language-integrated query is young. It might begin with Thomas and Fischer [11], who first defined an algebra of nested relations, thus freeing queries from the flatland of the 1st-normal-form restriction.
The Script-Writer’s Dream
49
Paradaens and Van Gucht [12] gave the first unnesting result, showing that nested relational algebra, when restricted to flat input and output relations, is equivalent in power to traditional flat relational algebra. Wong [6] soon extended this result, showing that any first-order nested relational algebra expression can be rewritten so that it produces no intermediate data structures deeper than the greatest of its input and output relations. Suciu [13] showed that the unnesting holds even in the presence of recursion, through a first-order fixed-point combinator. Suciu and Wong [14] show how first-order functions in a higher-order NRC can be implemented in first-order NRC, provided all the primitive functions have a semantic property called “internal,” which requires that they do not output basic values not contained in their input. Fegaras [7] also shows how to transform higher-order nested relational queries into flat ones using a rewrite system with similarities to ours; we extend this by offering a proof of normalization, taking the queries all the way to SQL, and showing how a type-and-effect system can separate the translatable and untranslatable fragments of a general-purpose language. Van den Bussche [15] extended Wong’s first-order result to show that even nested-result-type expressions can be simulated with the flat relational algebra, using an interpretation function which assembles the flat results into a nested relation. The effort to harmonize query languages with programming languages is nearly as old as the two fields themselves. Atkinson and Buneman [16] survey the early history of integration. The impedance mismatch problem between databases and programming languages is described by Copeland and Maier [1]. The use of comprehension syntax was a breakthrough for language integration, since it gave an iteration construct that was both powerful enough for much general-purpose programming and also explicit enough to admit a more direct translation to a query language; this connection is explored by many authors [17,18,19,20]. Grust [21] summarizes much preceding work on monad comprehensions as a query language. In a different setting, Wiedermann and Cook [22] present a technique, using abstract interpretation, for extracting structured queries from imperative, objectoriented program code, where method dispatch rather than first-class functions is the means of query abstraction. Kleisli [23,2] is a functional programming system that allows querying a variety of data sources—including flat relational databases—and allows the use of nested relations as intermediate values and as results. The Kleisli system heuristically pushes constructs inside an SQL query, perhaps leaving some code untranslated. At run-time, it always correctly executes an expression, whether or not it is translatable to SQL. The Kleisli approach has the advantage that all query translation can be done at compile-time. The LINQ project [3] integrates queries into several programming languages, using an object-oriented interface for forming queries; a comprehension-like syntax is also provided. The interface works with many data sources, including relational DBMSes. In LINQ, one writes query fragments (e.g. conditions and transformations) using a language-level quoting mechanism which captures the
50
E. Cooper
code for use at runtime. Quoted and unquoted expressions give different types of values, and they are not interchangeable; in particular, quoted functions cannot be applied to produce new quoted functions, so compositionality is hampered. When translating an expression, LINQ may (a) produce a run-time error, (b) partition the expression into an SQL query and pre- and post-processing phases executed outside SQL, or (c) translate the expression completely to SQL. The Links project [4] also offers language-integrated query. The original version of Links, like Kleisli, always correctly executes an expression but may not translate it to SQL; Links’ execution may be very naive and perhaps ineffecient. Ferry [5] is a new system which translates a first-order nested query with a nested result into multiple flat SQL queries (a fixed number based on the type), expanding on Van den Bussche and translating the queries all the way to SQL.
Conclusion and Future Work Expanding on a rich history of theoretical results, we have shown how to extract SQL queries from a host programming language with higher-order functions, using comprehensions as a natural query primitive, as well as a static discipline for checking the translatability of expressions. This permits the first language-integrated-query system which supports functional abstraction and has a runtime SQL translator that is total—that is, once the static check passes, the runtime translator will succeed in translating the expression completely to SQL. We prevented unlabelled recursion using monomorphic simple types, but polymorphism is essential in programming, so we aim to accomodate it. Sam Lindley has developed a calculus, Frak, which uses type kinds to constrain type variables in a polymorphic calculus, and incorporates effect inference. We targeted a modest sublanguage of SQL, but would like to go further. For example, we might like to take advantage of grouping and aggregation (using the group by clause of SQL) and the nonstandard limit and offset clauses to fetch subrows. Also, supporting an ordered list type would be useful; the Ferry system handles ordered data, which we hope will inspire further results. The query annotation proposed may be useful for asserting that an expression be SQL-translatable; but the programmer may instead want to allow the compiler more flexibility, e.g. to partition the expression into an SQL query and preand post- processing phases, along the lines of what is normally done in LINQ and Kleisli. A science of such partitioning may be called for; this might allow the programmer to write even more query expressions while retaining a high degree of confidence in their run-time efficiency.
References 1. Copeland, G., Maier, D.: Making smalltalk a database system. SIGMOD Rec. 14(2), 316–325 (1984) 2. Wong, L.: Kleisli, a functional query system. J. Functional Programming 10(1), 19–56 (2000)
The Script-Writer’s Dream
51
3. Microsoft Corporation: The LINQ project: .NET language integrated query. White paper (September 2005) 4. Cooper, E., Lindley, S., Wadler, P., Yallop, J.: Links: Web programming without tiers. In: de Boer, F.S., Bonsangue, M.M., Graf, S., de Roever, W.-P. (eds.) FMCO 2006. LNCS, vol. 4709, pp. 266–296. Springer, Heidelberg (2006) 5. Grust, T., Mayr, M., Rittinger, J., Schreiber, T.: Ferry: Database-supported program execution. In: SIGMOD 2009 (June 2009) 6. Wong, L.: Normal forms and conservative extension properties for query languages over collection types. J. Comput. Syst. Sci. 52(3), 495–505 (1996) 7. Fegaras, L.: Query unnesting in object-oriented databases. In: SIGMOD 1998, pp. 49–60. ACM, New York (1998) 8. Talpin, J., Jouvelot, P.: The type and effect discipline. Information and Computation, 162–173 (1992) 9. Cooper, E.: The script-writer’s dream: How to write great SQL in your own language, and be sure it will succeed (tech report). Technical Report EDI-INF-RR-1327, University of Edinburgh (May 2009) 10. Lindley, S., Stark, I.: Reducibility and -lifting for computation types. In: Urzyczyn, P. (ed.) TLCA 2005. LNCS, vol. 3461, pp. 262–277. Springer, Heidelberg (2005) 11. Thomas, S.J., Fischer, P.C.: Nested relational structures. Advances in Computing Research 3, 269–307 (1986) 12. Paredaens, J., Van Gucht, D.: Possibilities and limitations of using flat operators in nested algebra expressions. In: PODS 1988, pp. 29–38. ACM, New York (1988) 13. Suciu, D.: Fixpoints and bounded fixpoints for complex objects. In: DBPL 1993, pp. 263–281 (1993) 14. Suciu, D., Wong, L.: On two forms of structural recursion. In: Y. Vardi, M., Gottlob, G. (eds.) ICDT 1995. LNCS, vol. 893, pp. 111–124. Springer, Heidelberg (1995) 15. Van den Bussche, J.: Simulation of the nested relational algebra by the flat relational algebra, with an application to the complexity of evaluating powerset algebra expressions. Theoretical Computer Science 254(1-2), 363–377 (2001) 16. Atkinson, M.P., Buneman, O.P.: Types and persistence in database programming languages. ACM Comput. Surv. 19(2), 105–170 (1987) 17. Trinder, P.: Comprehensions, a query notation for DBPLs. In: DBPL 1991, San Francisco, CA, USA (1992) 18. Breazu-Tannen, V., Buneman, P., Wong, L.: Naturally embedded query languages. In: Hull, R., Biskup, J. (eds.) ICDT 1992. LNCS, vol. 646, pp. 140–154. Springer, Heidelberg (1992) 19. Buneman, P., Libkin, L., Suciu, D., Tannen, V., Wong, L.: Comprehension syntax. SIGMOD Record 23, 87–96 (1994) 20. Grust, T., Scholl, M.H.: How to comprehend queries functionally. J. Intell. Inf. Syst. 12(2-3), 191–218 (1999) 21. Grust, T.: Monad comprehensions, a versatile representation for queries. In: The Functional Approach to Data Management, pp. 288–311. Springer, Heidelberg (2003) 22. Wiedermann, B., Cook, W.R.: Extracting queries by static analysis of transparent persistence. In: POPL 2007 (2007) 23. Buneman, P., Davidson, S.B., Hart, K., Overton, C., Wong, L.: A data transformation system for biological data sources. In: VLDB 1995, pp. 158–169 (1995)
XML Security Views Revisited Benoˆıt Groz1,2,3 , Slawomir Staworko1, Anne-Cecile Caron1,2 , Yves Roos1,2 , and Sophie Tison1,2 1
Mostrare project, INRIA Lille Nord-Europe & LIFL (CNRS UMR8022) 2 University of Lille 1 3 ENS Cachan
Abstract. In this paper, we revisit the view based security framework for XML without imposing any of the previously considered restrictions on the class of queries, the class of DTDs, and the type of annotations used to define the view. First, we show that the full class of Regular XPath queries is closed under query rewriting. Next, we address the problem of constructing a DTD that describes the view schema, which in general needs not be regular. We propose three different methods of approximating the view schema and we show that the produced DTDs are indistinguishable from the exact schema (with queries from a class specific for each method). Finally, we investigate problems of static analysis of security access specifications.
1
Introduction
The wide acceptance of XML as the format for data representation and exchange clearly demonstrates the need for a general and flexible framework of secure access for XML databases. While security specification and enforcement are well established in relational databases, their methods and approaches cannot be easily adapted to XML databases. This is because an XML document stores information not only in its data nodes but also in the way it is structured. Consequently, the problem of secure access to XML databases has its own particular flavor and requires dedicated solutions. The view-based security framework for XML databases [19] has received an increased attention from both the theoretical and practical angle [3,9,4,17,22,18,5]. It can be summarized as follows: – The administrator provides the schema of the document together with the security access specification (SAS) defining nodes accessible by the user. – A virtual view comprising all accessible nodes is defined; the view is never materialized but the user is given its schema. – Every query over the view is rewritten to an equivalent query over the underlying document and then evaluated.
This work was partially supported by the Enumeration project ANR-07-blanc-.
P. Gardner and F. Geerts (Eds.): DBPL 2009, LNCS 5708, pp. 52–67, 2009. c Springer-Verlag Berlin Heidelberg 2009
XML Security Views Revisited
53
The framework is parametrized by the class of queries, typically a fragment of XPath, and the type of formalism used to define the schema with the security access specification, typically an annotated DTD. Previous research often imposed various restrictions in order to facilitate the tasks relevant to the framework. For instance, taking the class of downward XPath queries allows to use the knowledge of the document DTD to benefit the query rewriting [3]. The task can be additionally simplified if the node accessibility is downward closed, i.e. all descendants of an inaccessible node are inaccessible as well [1]. The last restriction is also beneficial to constructing the view schema [4]. In fact, without it the set of possible views is not necessarily a regular set of trees, and in particular, needs not be representable with a DTD (cf. Example 3). For similar reasons, in some works only non-recursive DTDs are considered [3,17]. The aforementioned restrictions may easily limit the versatility of the framework (cf. Example 2). In this paper we lift all the restrictions and revisit three most frequently considered problems: query rewriting, construction of the view schema, and static analysis of SAS. We make the following contributions: 1. We show that Regular XPath is closed under rewriting over XML views. The rewriting is quadratic (combined complexity) whereas a lower exponential bound has been shown for downward XPath queries [5,6]. 2. We propose a novel approach of approximating tree languages with a DTD. Basically, a good approximation is a DTD that is indistinguishable from the real schema by means of querying the document. We present three different approximation methods and identify the classes of queries unable to distinguish them from the real schema. We also argue about the optimality of our constructions. 3. We investigate the problem of comparing two SAS and propose two different semantics, node- and query-based. The first compares the sets of accessible nodes while the second compares the sets of queries that the user can pose on the underlying document through the rewriting process. We provide a detailed complexity analysis of those two notions. We also investigate the problem of verifying accessibility of information. Because of space limitations we omit some of the proofs. They can be found in Appendix of the full version available online.1 Related work. In [17] Rassadko extends the query rewriting approach of Fan et al. [3,4] to XPath queries with ascending axes. However, only non-recursive DTDs are considered and it remains to be seen if this approach can be further extended to handle horizontal axes as well. To ensure that the view schema can be captured with a DTD, Fan et al. [3,4] altered the definition of the view for recursive DTDs. Essentially, some inaccessible nodes rather than be removed are obfuscated, i.e. their attributes are removed and names changed to dummy ones. While this approach guarantees the schema to be regular, it clearly limits the relationships one can hide. 1
http://researchers.lille.inria.fr/~ staworko/papers/staworko-dbpl09.pdf
54
B. Groz et al.
Vercammen et al. [22] propose a query rewriting algorithm for the class of XPath queries with intersection and difference. The set of accessible nodes is defined with a single XPath query Q. While this XPath dialect is commonly available in existing systems, the query rewriting is obtained by incorporating the query Q using the intersection and difference operators. In a straightforward evaluation of XPath, this is equivalent to materializing the view, a behaviour the original motivation tried to avoid. Our method of verification of information accessibility is inspired by the work of Libkin and Sirangelo [10] where the opposite problem is studied, i.e. verifying that some information is secret. They consider the class of MSO queries, which is more suitable in this context, and show the problem to be decidable for downward-closed accessibility specifications. We recall that X Reg is known to be properly included in MSO [21], and thus checking whether a property can be expressed in MSO needs not be related to the same check w.r.t. X Reg.
2
Preliminaries
XML Documents. We assume a finite set of node labels Σ and model XML documents with unranked ordered labeled trees TΣ . Formally, a Σ-tree is a finite structure t = (Nt , roott , childt , nextt , λt ), where Nt is a set of nodes, roott ∈ Nt is a distinguished root node, childt ⊆ Nt ×Nt is the parent-child relation, nextt ⊆ Nt × Nt is the next-sibling relation, and λt : Nt → Σ is the function assigning to every node its label. We remark that we do not assume the set of nodes to be a prefix closed subset of N∗ . Moreover, equality of trees should not be confused with isomorphism: two trees are equal if and only if all the elements of their underlying structure are the same, including the node set. Example 1. Figure 1 contains an example of a tree representing an XML database with information on software development projects.
Fig. 1. Tree t0
XML Security Views Revisited
55
Every project has a name and a type of license (either free or proprietary). Projects under development come with their source codes and documentation, whereas stable projects have also binaries. Regular XPath queries. The class X Reg of Regular XPath queries [13] over Σ-trees is defined with the following grammar (with a varying over Σ): α ::≡ self | ⇓ | ⇑ | ⇒ | ⇐ f ::≡ lab() = a | Q | true | false | not f | f and f | f or f Q ::≡ α | [f ] | Q/Q | Q ∪ Q | Q∗ Essentially, X Reg query is a regular expression of base axes and filter expressions. Filter expressions are Boolean combinations of node label tests and existential path tests. We define several macros: α+ is short for α∗ /α, Q[f ] is Q/[f ], α::a stands for α[lab() = a], and α::∗ is simply α, where a is a symbol, Q a query, f a filter expression, and α a base axis or its closure. The semantics of Regular XPath is defined in Fig. 2 (Boolean connectives are interpreted in the usual manner). [[Q]]t is the binary reachability relation on the nodes of t defined
[[self]]t = {(n, n) | n ∈ Nt }, [[⇓]]t = childt , [[⇑]]t = child−1 t , [[⇒]]t = nextt , [[⇐]]t = next−1 t ,
[[Q1 /Q2 ]]t = [[Q1 ]]t ◦ [[Q2 ]]t , [[Q1 ∪ Q2 ]]t = [[Q1 ]]t ∪ [[Q2 ]]t , [[Q∗ ]]t = [[Q]]∗t , [[[f ]]]t = {(n, n) ∈ Nt | (t, n) |= f } (t, n) |= lab() = a iff λt (n) = a, (t, n) |= Q iff ∃n ∈ Nt . (n, n ) ∈ [[Q]]t .
Fig. 2. The semantics of Regular XPath
by the query Q. By (t, n) |= f we denote that the filter f is satisfied at the node n of the tree t. We say that a query Q is satisfied in the tree t if (t, roott ) |= Q. The set of answers to a query Q in a tree t is defined as Ans(Q, t) = {n ∈ Nt | (roott , n) ∈ [[Q]]t }. For instance, the query Q0 = ⇓::project[⇓::stable]/⇓::name selects (the nodes storing) the names of all stable projects. The set of answers to Q0 in t0 (Fig. 1) is Ans(Q0 , t0 ) = {n4 , n11 }. We recall from [13] that X Reg is closed under inversion, i.e. for every query for any tree t. Basically, Q there exists a query Q−1 such that [[Q−1 ]]t = [[Q]]−1 t Q−1 is obtained by reversing the base axes and the order of composition on the top most level (filter expressions are unchanged). Naturally, |Q−1 | = |Q|.
56
B. Groz et al.
DTDs and annotations. A Document Type Definition (DTD) over Σ is a function D that maps Σ to regular expressions over expres Σ. We allow regular sions defined with the grammar E ::≡ empty a E ”,”E E ”|”E E ∗ , where empty defines the empty sequence, a is a symbol of Σ, E, E is the concatenation, E|E is the union, and E ∗ is the Kleene closure. In the sequel, we present DTDs using rules of the form a → E and if for a symbol a the rule is not specified, then a → empty is implicitly assumed. The dependency graph of a DTD D over Σ is a directed graph whose node set is Σ and the set of arcs contains (a, b) if D(a) uses the symbol b. A DTD is recursive iff its dependency graph is cyclic. A Σ-tree t satisfies D if for every node n with k children n1 , . . . , nk (listed in the document order) we have λt (n1 ) · · · λt (nk ) ∈ L(E), where E = D(λt (n)). By L(D) we denote the set of all Σ-trees that satisfy D. A security access specification (SAS) consists of a DTD and an annotation specifying the accessibility of document nodes. Although the DTD and the annotation typically come together, we will often operate on the annotation independently of the DTD. Formally, an annotation over Σ is a (possibly partial) function A that maps Σ × Σ to X Reg filter expressions. The size |A| of annotation A is simply the sum of the sizes of all filter expressions used in A. Annotations define accessibility of nodes as follows. A node b with the parent a is accessible w.r.t. A if the filter expression A(a, b) is satisfied at the node b. If A(a, b) is not defined, then accessibility of the parent is used (inheritance). Finally, the root node is always accessible. Example 2. The DTD D0 below captures the schema of XML databases described in Example 1. projects → project∗ project → name, (stable | dev), license A0 (project, stable)=false A0 (project, dev)=false license → free | propr
stable → src, bin, doc A0 (stable, src)=[⇑∗ ::project/⇓∗ ::free] A0 (stable, doc)=true dev → src, doc A0 (dev, src)=[⇑∗ ::project/⇓∗ ::free] A0 (dev, doc)=true
The annotation A0 gives access to all projects but delinquently hides the information whether or not the project is stable (in particular, it hides binaries). Additionally, A0 hides the source code of all projects developed under proprietary license. In the tree t0 from Fig. 1 the root node projects is accessible and all nodes project are accessible by inheritance. The nodes name and license with their children are accessible by inheritance as well. A0 implicitly states that stable and dev are not to be accessible, and the nodes bin are inaccessible by inheritance. On the other hand, A0 overrides the inheritance for nodes doc and makes them accessible. Finally, the accessibility of src nodes is conditional: only n7 and n21 are accessible because only those satisfy the specified conditions, A0 (stable, src) and A0 (dev, src) resp. Now, given an annotation A over Σ and a Σ-tree t, the view A(t) of t defined by A is the Σ-tree obtained by removing from t all inaccessible nodes. Removing
XML Security Views Revisited
57
Fig. 3. The view A0 (t0 )
a node causes its children to be adopted by (or linked to) the parent of the node. Since the accessibility of every node is well defined and the root is always accessible, the view is a well defined Σ-tree. Figure 3 presents the view A0 (t0 ) (for t0 from Fig. 1). We remark that the nodes of view constitute a subset of the nodes of the underlying document, i.e. NA(t) ⊆ Nt .
3
Query Rewriting over XML Views
In this section we show that the class of Regular XPath queries is closed under query rewriting. The rewriting technique for downward queries [3] relies on the knowledge of the DTD. Our rewriting method works independently of the DTD. We remark, however, that the DTD contains important information about the expected structure of the document, and in practical applications, one should take advantage of it in order to improve the efficiency of rewriting. The method uses the fact that accessibility of a node can be defined with a single filter expression (Lemma 1). This filter is used to construct rewritings of the base axes (Lemma 2), which are used to rewrite the user queries. A such that Lemma 1. For any annotation A there exists a filter expression facc A for any tree t, a node n ∈ Nt is accessible in t w.r.t. A if and only if (t, n) |= facc . A Moreover, facc can be constructed in O(|A|) time.
Proof. By dom(A) we denote the set of pairs of symbols for which A is defined. We begin by defining two filter expressions. The first checks if A defines a filter expression for the current node fdom := (a,b)∈dom(A) lab() = b and ⇑::a , and if it is the case, the second filter expression is used to evaluate it feval := (a,b)∈dom(A) lab() = b and ⇑::a and A(a, b) . Finally, we restate the definition of accessibility using Regular XPath A facc := ([not fdom ]/⇑)∗ /[not(⇑) or feval ].
58
B. Groz et al.
A The filter expression facc plays an important role in the algorithms presented in this paper. Therefore, it is beneficial to optimize it, for instance, by incorporating A0 the knowledge of the DTD. The filter expression facc in the presence of D0 (Example 2) is equivalent to
lab()=stable and lab()=dev and lab()=bin and (lab()=src or ⇑∗ ::project/⇓∗ ::free).
Now, we show how to rewrite the base axes. Lemma 2. For any annotation A and any α ∈ {⇓, ⇑, ⇒, ⇐} there exists a RegA A ular XPath expression Rα such that [[Rα ]]t = [[α]]A(t) for every tree t. Moreover, A |Rα | = O(|A|). Proof. Essentially, the rewriting Rα defines paths, traversing inaccessible nodes only, from one accessible node to another accessible node in a manner consistent with the axis α. For the vertical axes the task is quite simple: A A A R⇓A := [facc ]/⇓/([not facc ]/⇓)∗ /[facc ]
and R⇑A := (R⇓A )−1 . Rewritings of the horizontal axes are slightly more complex and we first define auxiliary filter expressions: A A f↓∃ := ([not facc ]/⇓)∗ /[facc ],
f↓∅ := not f↓∃ ,
∅ f→ := (⇒/[f↓∅ ])∗ /[not(⇒)].
f↓∃ checks that the current node or any of its descendants is accessible. Conversely, f↓∅ checks whether the current node and all of its descendants are in∅ accessible. Similarly, f→ verifies that only inaccessible can be found among the siblings following the current node and their descendants. A seeks the next accessible node among the following siblings The expression R⇒ of the current node and their descendants. However, if there are no such nodes but the parent is inaccessible, then next accessible node is sought among the following siblings of the parent. The last step is repeated recursively if needed. A A ∅ A A R⇒ := [facc ]/([f→ ]/⇑/[not facc ])∗ /⇒/( [(not facc ) and f↓∅ ]/⇒ ∪ A A ) and f↓∃ ]/⇓/[¬⇐])∗ /[facc ] [(not facc A A −1 A and R⇐ := (R⇒ ) . We observe that |Rα | = O(|A|) for every α ∈ {⇓, ⇑, ⇒, ⇐}.
Finally, we state the main result. Theorem 1. X Reg is closed under query rewriting, i.e. there exists a function Rewrite such that Ans(Q, A(t)) = Ans(Rewrite(Q, A), t) for any annotation A, any Regular XPath query Q, and any tree t. Moreover, Rewrite(Q, A) is computable in time O(|A| ∗ |Q|). Proof. The function Rewrite(Q, A) replaces in Q every occurrence of a base axis A . A simple induction over the size of Q shows that α ∈ {⇓, ⇑, ⇒, ⇐} with Rα [[Q]]A(t) = [[Rewrite(Q, A)]]t , Lemma 2 handling the nontrivial base cases. Since the root is always accessible, we get Ans(Q, A(t)) = Ans(Rewrite(Q, A), t). We note that the rewritten query is constructed in time O(|A| ∗ |Q|).
XML Security Views Revisited
59
We observe that the asymptotic complexity of our rewriting method is comparable to that of [4] but it handles a larger class of queries and DTDs. Currently, we investigate automata based formalisms to devise efficient evaluation algorithms. Also, we remark that our rewriting technique can be easily adapted to rewrite Conditional XPath queries, a strict subclass of Regular XPath which we believe to be more scalable.
4
View Schema Approximation
In general, the user will need a schema of the view to formulate her queries. In this section we address the problem of its construction. For a SAS (D, A) we define VS (D, A) = {A(t) | t ∈ L(D)}. Often constructing the view schema is simple. For example, the view schema for D0 and A 0 (Example 5) is projects → project∗
project → name, doc, license
license → free | propr
In general, however, three potential obstacles may arise when inferring the view schema. Example 3. Consider the security access specifications (D1 , A1 ), (D2 , A2 ), and (D3 , A3 ) (the last example is due to [9]): D1 : r → a, b, c b→d|e
D2 : r → c c → (a, c, b) | empty
D3 : r → ak
···
ai → ai−1 , ai−1
···
a0 → a
A1 (r, a) = [⇒::∗/⇓::d],
A2 (r, c) = A2 (c, c) = false,
A3 (r, ak ) = false,
A1 (r, c) = [⇐::∗/⇓::d],
A2 (c, a) = A2 (c, b) = true,
A3 (a0 , a) = true.
We observe that VS (D1 , A1 ) = {r(a, b(d), c), r(b(e))} although regular cannot k be captured with a DTD, VS (D2 , A2 ) = {r(a , bk ) | k ∈ N} is not a regular k tree set, and finally, VS (D3 , A3 ) = {r(a2 )} requires a DTD of size exponential in |D3 |. To alleviate the infeasibility of constructing a DTD representing the view schema, we investigate approximation methods. 4.1
Intermediate Representation
The reason why VS (D1 , A1 ) cannot be represented with a DTD is the correlation between the filter expressions A1 (r, a) and A1 (r, c), one is satisfied if and only if the other is. A simple way to circumvent this problem is to consider only simple annotations, i.e. annotations using only the constant filter expressions true and false. It can be easily shown that the schema of a view defined by a simple annotation on a non-recursive DTD can be always captured with a DTD. In the reminder of this section, we adapt the restriction to simple annotations. A comprehensive way to address the problem of correlation between filter
60
B. Groz et al.
expressions would be using a larger, more expressive class of schemas, for instance EDTDs [15,12]. EDTDs allow defining different content models (types) for the same symbol. Then, the correlations between filter expressions are easily captured with different types. For instance, an EDTD defining VS (D1 , A1 ) is r → (a, b(1) , c) | b(2)
b(1) → d
b(2) → e
Introducing EDTDs would unnecessarily complicate the presentation of our results. We conjecture, however, that our approximation methods generalized to handle arbitrary annotations if we use annotated EDTDs for SAS and EDTD for the schema view.2 To represent the exact view schema, we generalize the DTD by allowing the rules to use context-free grammars in place of regular expressions. Definition 1. A generalized DTD over Σ is a function H that maps Σ to context-free grammars over Σ. A Σ-tree t is valid w.r.t. H if for every node n with k consecutive children n1 , . . . , nk we have λt (n1 ) · · · λt (nk ) ∈ L(G), where G = H(λt (n)). By L(H) we denote the set of all trees valid w.r.t. H. Proposition 1. For every DTD D and every simple annotation A, there exists a generalized DTD H(D,A) such that L(H(D,A) ) = VS (D, A). Moreover, H(D,A) can be computed in time O(|D|). We also remark that for every generalized DTD H there exists a DTD D and a simple annotation A such that VS (D, A) = L(H). Thus, from now on we focus on generalized DTDs. Recall that testing regularity of a context-free language is undecidable [8]. Consequently, Proposition 2. It is undecidable to test whether a generalized DTD is equivalent to some DTD. 4.2
Indistinguishability of Approximation
In our opinion, the main purpose of the schema is to guide the user in her attempt to formulate a meaningful query, and an approximation of the schema should be judged from this perspective. Consequently, we propose the following notion to assert the suitability of an approximation. Definition 2. We say that two sets L1 and L2 of Σ-trees are indistinguishable ∼ by a class C of queries, denoted L1 ∼ ∼C L2 , if for every Q ∈ C the query Q is satisfied by a tree in L1 if and only if it is satisfied by a tree in L2 . In the sequel, instead of L(H) different subclasses of Regular of axes, filter expressions, and 2
∼ ∼ ∼ ∼C D. Also, we consider ∼C L(D) we write H ∼ XPath queries obtained by restricting the use negation. By X Reg(A) we denote the class of
The schema of a view defined by an annotated EDTD cannot be always captured with a EDTD but it can be shown that this is the case when the annotated EDTD is non-recursive.
XML Security Views Revisited
61
Regular XPath queries using axes from A. Using self and node label tests is always allowed, but using existential path test (negation) is allowed only if A contains [ ] (not respectively). This way X Reg(⇓, ⇑, ⇒, ⇐, [ ], not) denotes the class of all Regular XPath queries. We remark that X Reg(A) allows the use of all regular operators, including transitive closure and union. Thus, for instance, X Reg(⇓∗ ) X Reg(⇓+ ) X Reg(⇓). Example 4. Let H2 = H(D2 ,A2 ) for (D2 , A2 ) from Example 3 and recall that L(H2 ) = {r(an bn ) | n ∈ N}. Consider the following three DTDs: D2i : r → (a, b)∗
D2ii : r → a∗ , b∗
D2iii : r → (a | b)∗
We observe that H2 is indistinguishable from: (i) D2i by C1 = X Reg(⇓, ⇑, [ ], not); (ii) D2ii by C2 = X Reg(⇓, ⇑, ⇒+ , ⇐+ , [ ]); (iii) D2iii by C3 = X Reg(⇓). 4.3
Three Approximations
The only difference between generalized DTDs and standard DTDs is the use of context-free grammars instead of regular expressions. Consequently, we base our approximation techniques on existing methods that approximate context-free languages with regular expressions. We remark that it remains a nontrivial task to choose those methods that guarantee the approximated DTD to be indistinguishable by a possibly large class of queries. Later, we argue that the presented methods are optimal. The first method is due to Parikh [16]. We fix an ordering of the alphabet Σ = {a1 , . . . , an } and define φ(w) = (|w|a1 , . . . , |w|an ), where |w|ai denotes the number of occurrences of the symbol ai in the word w. Parikh has shown that for every context-free grammar G there exists a regular expression E such that {φ(w) | w ∈ L(G)} = {φ(w) | w ∈ L(E)}, i.e. E defines a set of words obtained by some permutation of words of G. For instance, the expression for the contextfree language {an bn | n ∈ N} is (a, b)∗ . In this paper, by P (G) we denote the Parikh regular expression obtained with the method described in [7]. We remark that |P (G)| = 2O(|G|) . The second method uses subwords and was first described in [2]. Formally, u is a subword of w, denoted u w, if u = u1 · · · uk and there exist v0 , v1 , . . . , vk ∈ Σ ∗ such that v0 u1 v1 · · · vk−1 uk vk = w. [2] shows that for every CFG G one can construct a regular expression G↓ such that L(G↓) = {u | ∃w ∈ L(G). u w}. Also this method produces an exponential regular expression, i.e. |G↓| = 2O(|G|) . The last method is very simple and for grammar G it constructs the expression (alph(G))∗ , where alph(G) is the set of all symbols used in words of G and can be easily obtained from a trimmed copy of G. Naturally, |alph(G)| = O(|G|). Definition 3. Let H be a generalized DTD. P P (i) The Parikh approximation of H is a DTD DH defined as DH (a) = P (H(a)) for every a ∈ Σ. ↓ ↓ defined as DH (a) = H(a)↓ (ii) The subword approximation of H is a DTD DH for every a ∈ Σ.
62
B. Groz et al.
S S (iii) The subset approximation of H is a DTD DH defined as DH (a) = ∗ alph(H(a)) for every a ∈ Σ. ↓ P The approximations are illustrated in Example 4 where D2i = DH , D2ii = DH , 2 2 iii S ∗ and D2 = DH2 . The subset approximation of VS (D3 , A3 ) is r → a . We note ↓ P S | = 2O(|H|) , |DH | = 2O(|H|) , and |DH | = O(|H|). Finally, we identify that |DH the classes of queries relevant to each approximation method.
Theorem 2. For any generalized DTD H we have P are indistinguishable by C1 = X Reg(⇓, ⇑, [ ], not). (i) H and DH ↓ (ii) H and DH are indistinguishable by C2 = X Reg(⇓, ⇑, ⇒+ , ⇐+ , [ ]). S are indistinguishable by C3 = X Reg(⇓). (iii) H and DH
We also remark that the subword and the subset methods construct a superset ↓ S of the real schema. More precisely, L(H) ⊆ L(DH ) ⊆ L(DH ). As for Parikh P approximation, DH correctly characterizes H if we consider unordered trees. 4.4
Optimality
First, we note that it seems rather unfeasible to construct methods yielding approximations indistinguishable for classes of queries larger than C1 or C2 . Proposition 3. There exists a generalized DTD H for which there is no DTD indistinguishable from H by X Reg(⇓, ⇒) or X Reg(⇓, ⇒+ , [ ], not). We remark that the Parikh and subword methods construct DTD exponential in the size of the generalized DTD. It is not very surprising because, as seen in Example 3, VS (D3 , A3 ) requires a DTD of exponential size. We show, however, that the exponential lower bound holds even when approximating w.r.t. quite strong restrictions of C1 and C2 . Theorem 3. Take any class of queries C containing X Reg(⇓, [ ]) or X Reg(⇓, ⇑). For all k ∈ N there exists a generalized DTD Hk whose size is O(k) and such that any DTD D indistinguishable from Hk by C is of size Ω(2k ).
5
Static Analysis of Security Access Specifications
In this section we investigate comparing security access specifications: equivalence and restriction. We assume the DTD to be unchanged which allows to eliminate the factor of comparing DTDs from our analysis. We recall that testing equivalence and inclusion of DTDs is known to be PSPACE-complete [11]. Also, we use again arbitrary annotations. We begin with testing the equivalence of two annotations which is essential for optimization of the query rewriting process. Next, we consider the problem of comparing restrictiveness of two SAS. When the administrator modifies the SAS to further restrict the user access to the document, it might be prudent to verify that indeed the user is not permitted access to any new data. To
XML Security Views Revisited
63
address this problem we propose the notion of node-based restriction of SAS, which compares the sets of accessible nodes before and after the change. We observe, however, that this test does not always guarantee that the user cannot obtain more information after the change (cf. Example 6). Consequently, we propose the alternative notion of query-based restriction, which compares the sets of queries the user is allowed to evaluate on the underlying document. 5.1
Equivalence
Two annotations are equivalent if they produce exactly the same view for every source document. We emphasise that in the context of security using isomorphism to compare the views defined by two annotations may be error prone: the same shape does not imply that the same data is accessible. Consequently, we base the notion of equivalence on equality of trees. Definition 4. Two annotations A1 and A2 are equivalent in the presence of the DTD D, denoted A1 ≡D A2 , if and only if A1 (t) = A2 (t) for every t ∈ L(D). We observe that A1 (t) = A2 (t) if and only if exactly the same nodes of t are accessible w.r.t. A1 and A2 . Consequently, we use Lemma 1 to relate equivalence of annotations to equivalence of Regular XPath known to be EXPTIMEcomplete [14,13]. Theorem 4. Testing equivalence of annotations is EXPTIME-complete. 5.2
Node-Based Comparison
Now, we focus on comparing restrictiveness of two SAS. The first approach to test restriction compares the set of nodes that are accessible in the views defined by the annotations. Definition 5. Given two annotations A1 and A2 and a DTD D, A1 is a nodebased restriction of A2 in the presence of D, denoted A1 D nb A2 , if NA1 (t) ⊆ NA2 (t) for every t ∈ L(D). Example 5. Recall S0 = (D0 , A0 ) from Example 2 and consider an annotation A 0 that is obtained from A0 by making all src nodes inaccessible, i.e. 0 A 0 (stable, src) = A 0 (dev, src) = false. Clearly, A 0 D nb A0 . In a manner analogous to Theorem 4, we can relate node-based restriction to containment of Regular XPath queries. Corollary 1. Testing node-based restriction of annotations is EXPTIMEcomplete. Given a DTD D over Σ, let AnnD be the quotient set of all annotations by ≡D (which is an equivalence relation). We observe that by Lemma 1 every equivalence class has a representative that is total, i.e. is defined for every D (a, b) ∈ Σ × Σ. Interestingly, (AnnD , D nb ) is a Boolean algebra, where Ann
64
B. Groz et al.
is the quotient set of all annotations by ≡D . The supremum, infimum, and complement operations are defined as: [A1 A2 ](a, b) = A1 (a, b) or A2 (a, b), [A1 A2 ](a, b) = A1 (a, b) and A2 (a, b), and A (a, b) = not A(a, b). The top and bottom elements are A (a, b) = true and A⊥ (a, b) = false. These operations can serve as general tools for manipulating SAS, for instance, merging two SAS. 5.3
Query-Based Comparison
Sometimes, reducing the amount of nodes in the view may inadvertently reveal some information that previously was hidden. To (roughly) identify the information that is made accessible by the annotation to the user we use this predicate: Public(D, A, Q) := ∃Q ∈ X Reg. ∀t ∈ L(D). Ans(Q, t) = Ans(Q , A(t)). Example 6. Recall again S0 = (D0 , A0 ) from Example 2, and consider the query Q = ⇓::projects/⇓::project[⇓::dev and ⇓::license/⇓::free] that selects projects with free license that are currently under development. One can easily verify that Public(D0 , A0 , Q) does not hold. In fact, A0 was deliberately designed to hide the development status of projects. Now, consider A0 obtained from A0 by denying access to the source code of all projects under development, i.e. A0 (dev, src) = false. Naturally, A0 D nb A0 . However, the projects with free license that are currently under development can be selected with the following query Q = ⇓::projects/⇓::project[not(⇓::src) and ⇓::license/⇓::free].
Clearly, this proves Public(D0 , A0 , Q).
We use Public to compare the information made accessible by the annotations. Definition 6. Given two annotations A1 and A2 and a DTD D, A1 is a querybased restriction of A2 in the presence of D, denoted A1 D qb A2 , if for every Regular XPath query Q, Public(D, A1 , Q) implies Public(D, A2 , Q). ∗ D We note that A1 D qb A2 implies A1 nb A2 (consider Q1 = ⇓ ::∗). However, the converse is not necessarily true (cf. Example 6). We also remark that (Ann, D qb ) needs not to be even a semi-lattice (an example can be found in Appendix). 0 Now, recall A 0 and A0 from Example 5. Naturally, A 0 D qb A0 . We observe also that A0 (t0 ) can be seen as a view constructed on top of the view A0 (t0 ). Interestingly, this observation leads to an alternative characterization of querybased restriction (cf. Lemma 1).
Lemma 3. Given two annotations A1 and A2 and a DTD D, A1 D qb A2 if and only if there exists a filter expression fP such that for every t ∈ L(D) and every n ∈ Nt , n is accessible in t w.r.t. A1 if and only if n ∈ NA2 (t) and (A2 (t), n) |= fP .
XML Security Views Revisited
65
In the sequel, we say that the filter expression fP proves A1 D qb A2 . In Example 5, A is f = lab() = src. the filter expression that proves A 0 D 0 P qb Now, we focus on computational properties of query-based restriction. Unfortunately, the problem of testing regular separability of two context-free languages, known to be undecidable [20], can be reduced to testing query-based restriction. Theorem 5. Testing query-based restriction is undecidable even for simple annotations. Undecidability of testing query-based restriction may limit its applications. However, in certain settings, semi-automatic testing is often acceptable. For instance, we may require the administrator to provide an additional input that helps to prove that the restriction holds. Corollary 2. Given two annotations A1 and A2 , a DTD D, and a filter expression f , testing whether f proves A1 D qb A2 is EXPTIME-complete. While Lemma 3 shows that testing query-based restriction is RE (recursively enumerable), for non-recursive DTDs there exists a coRE characterization. Lemma 4. Given a non-recursive DTD D and two annotations A1 and A2 , A1 D qb A2 if and only if A2 (t) = A2 (t ) implies A1 (t) = A1 (t ) for every t, t ∈ L(D). Naturally, it shows that query-based restriction for non-recursive DTDs is decidable. However, we are able to say more about its complexity. Theorem 6. Testing query-based restriction for non-recursive DTDs is in EXPTIME and is PSPACE-hard. Finally, we investigate the complexity of testing Public(D, A, Q), which may be used to verify that the annotation A is allowing direct access to the information defined by Q on the underlying document. Proposition 4. Testing query-based restriction can be reduced in polynomial time to Public, and vice versa. By Theorems 5 and 6 we immediately get. Corollary 3. Testing Public is undecidable for arbitrary DTDs and is in EXPTIME (and PSPACE-complete) for non-recursive DTDs.
6
Conclusions and Future Work
In this paper we have revisited problems central in the view-based XML security framework. We considered the full class of Regular XPath queries and recursive DTDs annotated with Regular XPath filter expressions. We have shown that the expressive power of annotated DTDs is equal to the expressive power of a single
66
B. Groz et al.
Regular XPath query (selecting all the accessible nodes). This result is central in the method of rewriting queries over the view. In the second part of the paper we have investigated the problem of approximating the view schema with a DTD indistinguishable by a class of queries. We have provided an analysis of approximability, presented three different approximation methods, and shown lower bounds on the size of the view DTD. Figure 4 summarizes the results.
Fig. 4. Summary of approximation results (negative results in round boxes)
Finally, we have considered various problems of statical analysis of SAS. We proposed two different approaches to compare restrictiveness of SAS and analysed their computational implications. We have also studied the relationship between comparison of SAS and verification of information accessibility. As future work, we plan to investigate the practicality of the proposed solutions by implementing a working system. We believe that in practical settings the proposed algorithms will behave efficiently. We would like to combine the Parikh and subword approximation methods to produce meaningful DTDs of acceptable sizes. We also intend to extend our approach to more expressive schema and query formalisms, for instance EDTDs and n-ary XPath queries. Our preliminary research in this direction is very promising.
References 1. Benedikt, M., Fundulaki, I.: XML subtree queries: Specification and composition. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 138–153. Springer, Heidelberg (2005) 2. Courcelle, B.: On constructing obstruction sets of words. Bulletin of the EATCS 44, 178–186 (1991) 3. Fan, W., Chan, C.-Y., Garofalakis, M.N.: Secure XML querying with security views. In: ACM SIGMOD International Conference on Management of Data, pp. 587–598 (2004)
XML Security Views Revisited
67
4. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: SMOQE: A system for providing secure access to XML. In: International Conference on Very Large Data Bases (VLDB), pp. 1227–1230. ACM, New York (2006) 5. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Rewriting regular XPath queries on XML views. In: International Conference on Data Engineering (ICDE), pp. 666–675 (2007) 6. Fan, W., Yu, J.X., Li, J., Ding, B., Qin, L.: Query translation from XPath to SQL in the presence of recursive DTDs. VLDB Journal (to appear, 2009) 7. Goldstine, J.: A simplified proof of Parikh’s theorem. Discrete Mathematics 19(3), 235–239 (1977) 8. Greibach, S.A.: A note on undecidable properties of formal languages. Mathematical Systems Theory 2(1), 1–6 (1968) 9. Kuper, G., Massacci, F., Rassadko, N.: Generalized XML security views. In: ACM Symposium on Access Control Models and Technologies (SACMAT), pp. 77–84. ACM, New York (2005) 10. Libkin, L., Sirangelo, C.: Reasoning about XML with temporal logics and automata. In: Cervesato, I., Veith, H., Voronkov, A. (eds.) LPAR 2008. LNCS, vol. 5330, pp. 97–112. Springer, Heidelberg (2008) 11. Martens, W., Neven, F., Schwentick, T.: Complexity of decision problems for simple regular expressions. In: Fiala, J., Koubek, V., Kratochv´ıl, J. (eds.) MFCS 2004. LNCS, vol. 3153, pp. 889–900. Springer, Heidelberg (2004) 12. Martens, W., Neven, F., Schwentick, T., Bex, G.J.: Expressiveness and complexity of XML schema. ACM Transactions on Database Systems (TODS) 31(3), 770–813 (2006) 13. Marx, M.: XPath with conditional axis relations. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., B¨ ohm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 477–494. Springer, Heidelberg (2004) 14. Neven, F., Schwentick, T.: XPath containment in the presence of disjunction, DTDs, and variables. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 312–326. Springer, Heidelberg (2002) 15. Papakonstantinou, Y., Vianu, V.: DTD inference for views of XML data. In: ACM Symposium on Principles of Database Systems (PODS), pp. 35–46 (2000) 16. Parikh, R.J.: On context-free languages. Journal of the ACM 13(4), 570–581 (1966) 17. Rassadko, N.: Policy classes and query rewriting algorithm for XML security views. In: Damiani, E., Liu, P. (eds.) Data and Applications Security 2006. LNCS, vol. 4127, pp. 104–118. Springer, Heidelberg (2006) 18. Rassadko, N.: Query rewriting algorithm evaluation for XML security views. In: Jonker, W., Petkovi´c, M. (eds.) SDM 2007. LNCS, vol. 4721, pp. 64–80. Springer, Heidelberg (2007) 19. Stoica, A., Farkas, C.: Secure XML views. In: IFIP WG 11.3 International Conference on Data and Applications Security, pp. 133–146. Kluwer, Dordrecht (2002) 20. Szymanski, T.G., Williams, J.H.: Non-canonical parsing. In: 14th Annual Symposium on Foundations of Computer Science, pp. 122–129. IEEE, Los Alamitos (1973) 21. ten Cate, B., Segoufin, L.: XPath, transitive closure logic, and nested tree walking automata. In: ACM Symposium on Principles of Database Systems (PODS), pp. 251–260 (2008) 22. Vercammen, R., Hidders, J., Paredaens, J.: Query translation for XPath-based security views. In: Grust, T., H¨ opfner, H., Illarramendi, A., Jablonski, S., Mesiti, M., M¨ uller, S., Patranjan, P.-L., Sattler, K.-U., Spiliopoulou, M., Wijsen, J. (eds.) EDBT 2006. LNCS, vol. 4254, pp. 250–263. Springer, Heidelberg (2006)
A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes Yasunori Ishihara1 , Takuji Morimoto1 , Shougo Shimizu2 , Kenji Hashimoto3 , and Toru Fujiwara1 1
Osaka University, Japan {ishihara,fujiwara}@ist.osaka-u.ac.jp 2 Advanced Institute of Industrial Technology, Japan
[email protected] 3 Nara Institute of Science and Technology, Japan
[email protected]
Abstract. The paper presents a tractable subclass of DTDs, called DCDTDs, for XPath satisfiability with sibling axes. A DC-DTD is a DTD such that each content model is in the form of a concatenation of single tag names and Kleene-starred regular expressions. DC-DTDs are a proper subclass of covering DTDs proposed by Montazerian et al., and a proper superclass of disjunction-free DTDs. In this paper, it is shown that tractability by covering DTDs is fragile against sibling axes. Then, tractability of XPath satisfiability with sibling axes under DC-DTDs is demonstrated. Finally, as a limitation of the tractability of DC-DTDs, it is shown that upward axes appearing in qualifiers bring intractability under even disjunction-free DTDs.
1
Introduction
XPath satisfiability is one of the major theoretical topics in the field of XML databases. XPath is a query language for XML documents, which are often regarded as unranked labeled ordered trees. An XPath expression specifies a pattern of (possibly branching) paths from the root of a given XML document. The answer to an XPath expression for an XML document T is a set of nodes v of T such that the specified path pattern matches the path from the root to v. A given XPath expression p is satisfiable under a given DTD (Document Type Definition) D if there is an XML document T conforming to D such that the answer to p for T is a nonempty set. The researches on XPath satisfiability are strongly motivated by query optimization. When (a part of) an XPath expression has been found unsatisfiable, we can always replace the expression with the empty set without evaluating it. Lakshmanan et al. investigated the complexity of satisfiability problem for tree pattern queries, which contain a useful fragment of XPath [1]. Benedikt et al. studied the complexity of XPath satisfiability widely and systematically. In [2,3], they investigated how the combinations of XPath components such as downward/upward axes, qualifiers, attribute-value equalities, and negations P. Gardner and F. Geerts (Eds.): DBPL 2009, LNCS 5708, pp. 68–83, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes
69
affect the complexity of the satisfiability problem. The relation between satisfiability and XPath containment problem [4] was also discussed. Then, in [3,5], they took sibling axes into account. Unfortunately, it was found that satisfiability is in P only for a very small subclass of XPath. Genev`es and Laya¨ıda tackled the intractability of XPath satisfiability itself. Their approach is to translate XPath expressions to formulas in monadic second-order (MSO) logic [6] and in a variant of μ-calculus [7,8]. Regular tree grammars [9], which are a general model of XML schemas and a proper superclass of DTDs, are also translated to such formulas. Then, satisfiability is verified by fast decision procedures for MSO and μ-calculus formulas. On the other hand, several researches tried to give a constraint on DTDs so that satisfiability becomes tractable. In [1], satisfiability under non-recursive DTDs was examined. In [2,3,5], non-recursive and disjunction-free DTDs were considered. However, non-recursiveness does not broaden the tractable class of XPath, and disjunction-freeness broadens the tractable class of XPath only when sibling axes are excluded. Montazerian et al. presented two subclasses of DTDs, called duplicate-free DTDs and covering DTDs, under which XPath satisfiability becomes tractable [10]. Intuitively, a DTD is duplicate-free if every tag name appears at most once in each content model. A DTD is covering if each content model allows a sequence consisting of all the tag names which appear in the content model. They also claimed that most of the real-world DTDs are duplicate-free or covering. However, the examined class of XPath was rather small. Especially, neither upward axes nor sibling axes were handled. The purpose of this paper is to propose a subclass of DTDs under which XPath satisfiability with sibling axes is tractable. Our XPath expressions consist of ↓ (child axis), ↓∗ (descendant-or-self axis), ↑ (parent axis), ↑∗ (ancestor-or-self axis), →+ (following-sibling axis), ←+ (preceding-sibling axis), ∪ (path union), and [ ] (qualifier). Our XPath is a proper superclass of that of [10] but a proper subclass of that of [2,3,5]. Unlike the formulation in [2,3,5], every axis is always combined with a node test. On the other hand, a wild card for a node test cannot be realized without path unions. See the definition in Section 2.3 for the detail. Following the notation of [2,3,5], a subclass of XPath is indicated by X (·). For example, the subclass with child axes and qualifiers is denoted by X (↓, [ ]). Our first result is intractability of satisfiability of XPath expressions with sibling axes under covering DTDs. Precisely, we show that satisfiability under covering DTDs is NP-complete for (1) X (χ↓ , χ↔ , [ ]) and (2) X (χ↓ , χ↑ , χ↔ , ∪), where χ↓ ∈ {↓, ↓∗ }, χ↑ ∈ {↑, ↑∗ }, and χ↔ ∈ {→+ , ←+ }. Note that satisfiability of X (↓, ↓∗ , ∪, [ ]) is tractable under covering DTDs [10]. Thus, tractability by covering DTDs is very fragile against sibling axes. Next, we propose a subclass of DTDs, called disjunction-capsuled DTDs (DCDTDs for short). In a DC-DTD, every content model is a regular expression in the form of a concatenation of single tag names and Kleene-starred expressions. In other words, all disjunction operators are “capsuled” by Kleene star operators. For example, (a|bc)∗ d(ab∗ )∗ is disjunction-capsuled, but a∗ |(bc)∗ is not. DC-DTDs are a proper subclass of covering DTDs and a proper superclass
70
Y. Ishihara et al. Table 1. Results of this paper ↓ ↓∗ ↑ ↑∗ →+ ←+ ∪ [ ] + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
++ + + ++ + ++ + + ++
any DTDs NPC[3] NPC[3] NPC[3] NPC[3] NPC[3] NPC[3] NPC[3] NPC[3] NPC[5]
covering DTDs P[10] NPC(§3) NPC(§3) NPC NPC NPC NPC NPC NPC
DC-DTDs P P P P(§4) P(§4) NPC NPC NPC NPC
disjunction-free DTDs P[3] P P P P NPC[3] NPC(§5) NPC(§5) NPC[5]
of disjunction-free DTDs. Then, we show that satisfiability under DC-DTDs is tractable for (3) X (↓, ↓∗ , →+ , ←+ , ∪, [ ]) and (4) X (↓, ↓∗ , ↑, ↑∗ , →+ , ←+ , ∪), which are proper superclasses of (1) and (2) above, respectively. For a given XPath expression in (3) or (4), satisfiability of each subexpression can be analyzed independently. Moreover, in order to handle sibling axes, maintaining the positions of subexpressions with respect to the top-level concatenation is enough. This fact will be stated formally using a concept of schema graphs. Finally, satisfiability for (5) X (↓∗ , ↑∗ , [ ]) and (6) X (χ↓ , χ↑ , →+ , ←+ , [ ]), where χ↓ ∈ {↓, ↓∗ } and χ↑ ∈ {↑, ↑∗ }, is shown to be NP-complete under even disjunction-free DTDs. Essentially, the intractability stems from upward axes appearing in qualifiers. That makes it possible to traverse a path many times. By encoding a truth assignment by a path in a tree, the 3SAT problem can be reduced to the satisfiability of these classes of XPath. Table 1 summarizes the results of this paper. A “+” in a multicolumn means “one of them.” For example, the second row of the table represents the class (1) above, i.e., the class with one of ↓ and ↓∗ , one of →+ and ←+ , and [ ]. NPC stands for NP-complete. Bold letters indicate the contributions of this paper. The rest of this paper is organized as follows. In Section 2 several definitions to formalize the XPath satisfiability problem are provided. Intractability of XPath satisfiability with sibling axes under covering DTDs is shown in Section 3. Next, tractability results on DC-DTDs are presented in Section 4. Then, in Section 5, it is shown that upward axes appearing in qualifiers bring intractability under even disjunction-free DTDs. Section 6 summarizes the paper.
2 2.1
Definitions XML Documents
An XML document is represented by an unranked labeled ordered tree. The label of a node v, denoted λ(v), corresponds to a tag name. We extend λ to a function on sequences, i.e., for a sequence v1 · · · vn of nodes, let λ(v1 · · · vn ) = λ(v1 ) · · · λ(vn ). A tree is sometimes denoted by a term, e.g., a(b()c()) denotes a
A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes
71
tree consisting of three nodes; the root has label a, and its left and right children have labels b and c, respectively. Attributes are not handled in this paper. 2.2
DTDs
A regular expression over an alphabet Σ consists of constants (empty sequence) and the symbols in Σ, and operators · (concatenation), ∗ (repetition), and | (disjunction). We exclude ∅ (empty set) because we are interested in only nonempty regular languages. The concatenation operator is often omitted as usual. A regular expression is disjunction-capsuled if it is in the form of e1 e2 · · · en (n ≥ 1), where each ei (1 ≤ i ≤ n) is either a symbol in Σ or in the form of (ei )∗ for any regular expression ei . For example, (a|bc)∗ d(ab∗ )∗ is disjunctioncapsuled, but a∗ |(bc)∗ is not. Note that is disjunction-capsuled because = ∗ . The length of a disjunction-capsuled regular expression e = e1 e2 · · · en , where each ei (1 ≤ i ≤ n) is either a symbol in Σ or in the form of (ei )∗ , is defined as n and denoted by len(e). Moreover, i (1 ≤ i ≤ len(e)) is called a position and each ei is called the i-th subexpression of e. A regular expression e is disjunction-free if e contains no disjunction operators. A regular expression e over an alphabet Σ is covering if e matches a string consisting of all the symbols in Γe , where Γe ⊆ Σ is the set of symbols appearing in e. A disjunction-free regular expression is vacuously disjunction-capsuled. Moreover, a disjunction-capsuled regular expression is always covering. To see this, regard a regular expression as a string generator. Since disjunctions are only inside the scope of Kleene stars, we can generate a string by repeating the Kleene-starred subexpressions so that all the operands of disjunctions simultaneously appear in the string. Definition 1. A DTD is a 3-tuple D = (Σ, r, P ), where: – Σ is a finite set of labels, – r ∈ Σ is the root label, and – P is a mapping from Σ to the set of regular expressions over Σ. P (a) is called the content model of label a. A disjunction-free DTD is a DTD such that P (a) is disjunction-free for every a ∈ Σ. A disjunction-capsuled DTD, a DC-DTD for short, is a DTD such that P (a) is disjunction-capsuled for every a ∈ Σ. A covering DTD is a DTD such that P (a) is covering (i.e., P (a) matches a string consisting of all the symbols in ΓP (a) ) for every a ∈ Σ. Example 1. Let Σ = {contact, name, officeMail, homeMail}, and suppose that P (a) = for each a ∈ Σ − {contact}. If P (contact) = name · (officeMail | homeMail)∗ , then DTD (Σ, contact, P ) is disjunction-capsuled. On the other hand, if P (contact) = name · (officeMail · homeMail | homeMail · officeMail), then the DTD is not disjunction-capsuled although it is covering. If
72
Y. Ishihara et al.
P (contact) = name · (officeMail | homeMail), then it is not even covering because P (contact) cannot match a string containing officeMail and homeMail simultaneously. Definition 2. A tree T conforms to a DTD D = (Σ, r, P ) if – the label of the root of T is r, and – for each node v of T and its children sequence v1 · · · vn , P (λ(v)) matches λ(v1 · · · vn ). Let TL(D) denote the set of all the trees conforming to D. The size of a regular expression is the number of constants and operators appearing in the regular expression. The size of a DTD is the sum of the size of all content models. 2.3
XPath Expressions
The syntax of an XPath expression p is defined as follows: p ::= χ :: l | p/p | p ∪ p | p[q], χ ::= ↓ | ↑ | ↓∗ | ↑∗ | →+ | ←+ , q ::= p | q ∧ q | q ∨ q, where l ∈ Σ. Each χ ∈ {↓, ↑, ↓∗ , ↑∗ , →+ , ←+ } is called an axis. Also, a subexpression in the form of [q] is called a qualifier. The size of an XPath expression p is defined as the number of subexpressions in the form of χ :: l in p. The semantics of an XPath expression over a tree T is defined as follows, where p and q are regarded as binary and unary predicates on nodes, respectively: – – – – – – – – – – – –
T T T T T T T T T T T T
|= (↓:: l)(v, v ) if v is a child of v and λ(v ) = l. |= (↑:: l)(v, v ) if v is the parent of v and λ(v ) = l. |= (↓∗ :: l)(v, v ) if v is v or a descendant of v, and λ(v ) = l. |= (↑∗ :: l)(v, v ) if v is v or an ancestor of v, and λ(v ) = l. |= (→+ :: l)(v, v ) if v is a following sibling of v and λ(v ) = l. |= (←+ :: l)(v, v ) if v is a preceding sibling of v and λ(v ) = l. |= (p/p )(v, v ) if there is v such that T |= p(v, v ) and T |= p (v , v ). |= (p ∪ p )(v, v ) if T |= p(v, v ) or T |= p (v, v ). |= (p[q])(v, v ) if T |= p(v, v ) and T |= q(v ). |= p(v) if there is v such that T |= p(v, v ). |= (q ∧ q )(v) if T |= q(v) and T |= q (v). |= (q ∨ q )(v) if T |= q(v) or T |= q (v).
A tree T satisfies an XPath expression p if there is a node v such that T |= p(v0 , v), where v0 is the root node of T . An XPath expression p is satisfiable under a DTD D if some T ∈ TL(D) satisfies p. Following the notation of [2,3,5], a subclass of XPath is indicated by X (·). For example, the subclass with child axes and qualifiers is denoted by X (↓, [ ]).
A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes
73
Our definition of XPath basically follows that of [2,3,5]. However, there are some differences. In [2,3,5], every axis is implicitly combined with a wild card (often denoted by ∗) for a node test. To specify a condition on the label, a qualifier and a function lab() must be used. Exceptionally, only child axis can be combined with a node test. On the other hand, we prefer symmetry among axes. Therefore, in our definition, every axis is always combined with a node test, without using qualifiers. Thus function lab() is excluded. As another point, a wild card for a node test is excluded in order to clarify the effect of path unions, because it can be regarded as a restricted use of path unions. Actually, by replacing every occurrence of χ :: ∗ with (χ :: a)∪(χ :: b)∪· · ·∪(χ :: z), where Σ = {a, b, . . . , z}, we can obtain an equivalent XPath expression without wild cards. The size of the obtained expression is at most |Σ| times larger than the original expression. Usually |Σ| is less than the size of the DTD, and therefore, this translation can be done in a polynomial time in the sizes of the XPath expression and the DTD.
3
Intractability Results on Covering DTDs
In this section, we provide two theorems demonstrating that the tractability by covering DTDs is fragile against sibling axes. Theorem 1. Satisfiability of X (χ↓ , χ↔ , [ ]), where χ↓ ∈ {↓, ↓∗ } and χ↔ ∈ {→+ , ←+ }, under covering DTDs is NP-complete. Proof. Immediately from [3], the satisfiability can be shown to be in NP. In what follows, we show the NP-hardness. First, we show the case of X (↓∗ , →+ , [ ]). Let φ be an instance of 3SAT, where φ = (L1,1 ∨ L1,2 ∨ L1,3 ) ∧ · · · ∧ (Ln,1 ∨ Ln,2 ∨ Ln,3 ) ¯1 , . . . , x ¯m }. Then, define D = (Σ, r, P ) as follows: and Li,j ∈ {x1 , . . . , xm , x – Σ = {r, x0 , x1 , . . . , xm , t, f }, – P (r) = x0 , P (xk ) = (txk+1 f | f xk+1 t) for 0 ≤ k ≤ m − 1, P (xm ) = . For every tree T in TL(D) and every variable xk (1 ≤ k ≤ m), T has exactly one node vk whose label is xk . We can interpret a given tree T in TL(D) as a truth assignment as follows: The value of xk is true if the right sibling node of vk has label t, and false if it has label f . Thus, the unique path from v1 to vm in T along with their right siblings represents a truth assignment. Now, define an XPath expression p as follows: p = ↓∗ :: x0 [(q1,1 ∨ q1,2 ∨ q1,3 ) ∧ · · · ∧ (qn,1 ∨ qn,2 ∨ qn,3 )],
where qi,j =
↓∗ :: xk / →+ :: t ↓∗ :: xk / →+ :: f
if Li,j = xk , if Li,j = x ¯k .
74
Y. Ishihara et al.
Let T ∈ TL(D). Whether the qualifier part is satisfied by T is dependent on the unique path from v1 to vk in T , which corresponds to a truth assignment. It is not difficult to see that φ is satisfiable if and only if p is satisfied by a tree conforming to D. For the case of X (↓, →+ , [ ]), we can prove the theorem in the same way by replacing ↓∗ :: x0 with ↓:: x0 , and ↓∗ :: xk with ↓:: x1 / · · · / ↓:: xk for each 1 ≤ k ≤ m. Moreover, →+ and ←+ are completely symmetric. Hence, all of the four cases have been proved. Theorem 2. Satisfiability of X (χ↓ , χ↑ , χ↔ , ∪), where χ↓ ∈ {↓, ↓∗ }, χ↑ ∈ {↑, ↑∗ }, and χ↔ ∈ {→+ , ←+ }, under covering DTDs is NP-complete. Proof. The proof is the same as Theorem 1, except for using upward axes instead of qualifiers. For the case of X (↓∗ , ↑∗ , →+ , ∪), an XPath expression p is defined as follows: p = ↓∗ :: x0 /(p1,1 ∪ p1,2 ∪ p1,3 )/ · · · /(pn,1 ∪ pn,2 ∪ pn,3 ),
where pi,j =
↓∗ :: xk / →+ :: t / ↑∗ :: x0 ↓∗ :: xk / →+ :: f / ↑∗ :: x0
if Li,j = xk , if Li,j = x ¯k .
Again, either ↑ or ↑∗ axis is enough to show the NP-hardness.
4
Tractability Results on DC-DTDs
In this section, we consider larger classes of XPath discussed in the previous section, and show that under DC-DTDs satisfiability for the classes is tractable. Before explaining our results, we introduce a schema graph of a given DC-DTD, which represents parent-child relationship as well as the possible positions of the children specified by the DC-DTD. We also define satisfaction relations between schema graphs and XPath expressions. The satisfaction relations coincide the satisfiability under DC-DTDs and are decidable in polynomial time. 4.1
Schema Graphs
Definition 3. The schema graph G = (U, E) of a DC-DTD D = (Σ, r, P ) is a directed graph defined as follows: – A node u ∈ U is either • (⊥, 1, −, r), where ⊥ is a new symbol not in Σ, or • (a, i, ω, b), where a, b ∈ Σ, 1 ≤ i ≤ len(P (a)) such that b appears in the i-th subexpression ei of P (a), and ω = “−” if ei is a single symbol in Σ and ω = “∗” otherwise. The first, second, third and fourth components of u are denoted by λpar (u), pos(u), ω(u), and λ(u), respectively. Especially, λ(u) is called the label of u. λpar , pos, and λ are extended to functions on sequences.
A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes
75
– An edge from u to u exists in E if and only if λ(u) = λpar (u ). Example 2. Let D = ({r, a, b, c}, r, P ) be a DC-DTD, where P (r) = (a|b)∗ ca∗ ,
P (b) = r∗ ,
P (a) = ,
P (c) = .
The schema graph of D is shown in Fig. 1. If T ∈ TL(D), then each node of T can be associated with a node of the schema graph of D. More precisely, there exists a mapping θ, called an SG mapping of T , from the set of nodes of T to the set of nodes of the schema graph of D with the following properties: – θ maps the root node of T to (⊥, 1, −, r). – Let v be a node of T and v1 · · · vn be the children sequence of v. Then, θ(vj ) = (λ(v), ij , ωij , λ(vj )), where 1 ≤ ij ≤ len(P (λ(v))), ωij = “−” if
(⊥, 1, −, r)
(r , 1, ∗, a)
(r , 1, ∗, b)
(r , 2, −, c)
(r , 3, ∗, a)
(b , 1, ∗, r) Fig. 1. A schema graph G
v0
r
θ(v0 ) = (⊥, 1, −, r)
v1
a
v2
b r
v8
a
v3
a
v7
v4
c
v5
a
v6
a
θ(v1 ) = θ(v3 ) = θ(v8 ) = (r, 1, ∗, a) θ(v2 ) = (r, 1, ∗, b) θ(v4 ) = θ(v9 ) = (r, 2, −, c) θ(v5 ) = θ(v6 ) = (r, 3, ∗, a)
c
v9
θ(v7 ) = (b, 1, ∗, r)
Fig. 2. A tree T and its SG mapping θ
76
Y. Ishihara et al.
the ij -th subexpression of P (λ(v)) is a single symbol in Σ and ωij = “∗” otherwise, and ij ≤ ij if j ≤ j . Moreover, for every maximum subsequence vj · · · vj such that ij = · · · = ij , the ij -th subexpression of P (λ(v)) matches λ(vj · · · vj ). Example 3. Consider the DC-DTD D defined in Example 2, and a tree T ∈ TL(D) shown in Fig. 2. In this case, there is a unique SG mapping θ of T , which is also shown in Fig. 2.
4.2
XPath Expressions without Upward Axes
In this section, satisfiability of XPath expressions without ↑ and ↑∗ under DCDTDs is shown to be tractable. To accomplish this, we define the semantics of XPath expressions over schema graphs. Then, we show that an XPath expression p is satisfiable under a DC-DTD D if and only if p is satisfied by the schema graph of D. Definition 4. A satisfaction relation |=↑ between a schema graph G and an XPath expression p ∈ X (↓, ↓∗ , →+ , ←+ , ∪, [ ]) is defined as follows: – G |=↑ (↓:: l)(u, u ) if there is an edge from u to u in G and λ(u ) = l. – G |=↑ (↓∗ :: l)(u, u ) if u is reachable from u in G and λ(u ) = l. – G |=↑ (→+ :: l)(u, u ) if λpar (u) = λpar (u ), λ(u ) = l, and pos(u) < pos(u ) if ω(u) = “−” and pos(u) ≤ pos(u ) if ω(u) = “∗”. – G |=↑ (←+ :: l)(u, u ) if λpar (u) = λpar (u ), λ(u ) = l, and pos(u) > pos(u ) if ω(u) = “−” and pos(u) ≥ pos(u ) if ω(u) = “∗”. – G |=↑ (p/p )(u, u ) if there is u such that G |=↑ p(u, u ) and G |=↑ p (u , u ). – G |=↑ (p ∪ p )(u, u ) if G |= p(u, u ) or G |=↑ p (u, u ). – G |=↑ (p[q])(u, u ) if G |= p(u, u ) and G |=↑ q(u ). – G |=↑ p(u) if there is u such that G |=↑ p(u, u ). – G |=↑ (q ∧ q )(u) if G |=↑ q(u) and G |=↑ q (u). – G |=↑ (q ∨ q )(u) if G |=↑ q(u) or G |=↑ q (u). The following theorem is almost obvious because G captures all the possible topologies of the trees conforming to D. Formally, it can be proved by induction on the structure of p. Theorem 3. Let p ∈ X (↓, ↓∗ , →+ , ←+ , ∪, [ ]). Let D be a DC-DTD. If T |= p(v, v ) for some T ∈ TL(D) with an SG mapping θ, then G |=↑ p(θ(v), θ(v )). The converse of Theorem 3 is also shown by induction on the structure of p. The next lemma corresponds to the basis step and is easily proved. Lemma 1. Let G be the schema graph of a DC-DTD D. Suppose that G |=↑ (χ :: l)(u, u ). Then, there are T in TL(D), its SG mapping θ, and its nodes v and v such that θ(v) = u, θ(v ) = u , and T |= (χ :: l)(v, v ).
A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes
77
To show the induction step, we claim that satisfiability of subexpressions of p can be analyzed independently, as long as the label and the position of each node are correctly maintained. Roughly speaking, p1 /p2 is always satisfiable if p1 and p2 are satisfiable. The following lemma precisely states this claim. Lemma 2. Let D be a DC-DTD. Let T1 and T2 be in TL(D) with SG mappings θ1 and θ2 , respectively. Suppose that T1 |= p1 (v1 , v1 ), T2 |= p2 (v2 , v2 ), and θ1 (v1 ) = θ2 (v2 ). Then, there are T ∈ TL(D), its SG mapping θ, and its nodes v and v such that θ(v) = θ1 (v1 ), θ(v ) = θ2 (v2 ), and T |= (p1 /p2 )(v, v ). Proof. Since p does not contain any upward axes, whether T2 satisfies p2 is dependent on only the subtree of T2 whose root is the parent of v2 . Let us call the subtree T2 . The tree T can be constructed by “merging” T2 into the subtree T1 of T1 whose root is the parent of v1 , without destroying the characteristics of the trees specified by p1 and p2 . Define a block of a tree (with respect to a given SG mapping) as a maximum subsequence of the children of the root node whose positions are the same. For example, the blocks of the tree T in Fig. 2 are v1 v2 v3 , v4 , and v5 v6 . Roughly speaking, merging trees is accomplished by interleaving their blocks. Let T1 = a(B1 · · · Bn ) and T2 = a(C1 · · · Cn ), where each of Bi s and Ci s is a subtree sequence whose roots form a block. Suppose that v1 and v2 are in the k-th block. Then, the merged subtree merge 0 (T1 , T2 ) is merge 0 (T1 , T2 ) = a(merge(B1 , C1 ) · · · merge(Bk−1 , Ck−1 ) merge 1 (Bk , Ck ) merge(Bk+1 , Ck+1 ) · · · merge(Bn , Cn )). If the i-th blocks correspond to a Kleene-starred expression, then merge(Bi , Ci ) is ) and defined as Bi Ci . Otherwise, Bi and Ci must be in the form of a (B1 · · · Bm a (C1 · · · Cm ), and merge(Bi , Ci ) is recursively defined as a (merge(B1 , C1 ) · · · , Cm )). merge(Bm Merging the k-th blocks needs a special treatment. If the k-th blocks correspond to a single symbol in Σ, then just define merge 1 (Bk , Ck ) as merge(Bk , Ck ). Otherwise, define merge 1 (Bk , Ck ) as Ck Bk Ck , where Bk is obtained from Bk by applying merge to the subtree with root v1 and the subtree in Ck with root v2 . Hence, v2 appears three times in merge 1 (Bk , Ck ): once in Bk where v2 is merged with v1 , and twice in Ck s. We need Ck s at the both sides of Bk so that both sibling axes →+ and ←+ correctly work at Bk . For example, let p1 = ↓ :: a[↓ :: c] and p2 = ← :: b. Consider a DC-DTD D with P (r) = (a|b)∗ , P (a) = c∗ , and P (b) = P (c) = . Then, T1 = r(a(c())) ∈ TL(D) satisfies p1 at the root and its unique child, and T2 = r(b()a()) ∈ TL(D) satisfies p2 at the node with label a. However, if merge 1 (Bk , Ck ) were defined as Bk Ck , then T would be r(a(c())b()a()) and not satisfy p1 /p2 . Let T be the tree obtained by merging T2 into T1 as above. Then, T conforms to D. Moreover, T1 and T2 are subgraphs of T , where v1 and v2 correspond to the same node in Bk of T . Since the XPath expressions p1 and p2 do not include negation, both expressions are satisfied by T .
78
Y. Ishihara et al.
By Lemmas 1 and 2 and more lemmas for other operators appearing in p, we have the following theorem: Theorem 4. Let p ∈ X (↓, ↓∗ , →+ , ←+ , ∪, [ ]). Let D be a DC-DTD. If G |=↑ p(u, u ), then there are T ∈ TL(D), its SG mapping θ, and its nodes v and v such that θ(v) = u, θ(v ) = u , and T |= p(v, v ). Thus, satisfiability of p ∈ X (↓, ↓∗ , →+ , ←+ , ∪, [ ]) under a DC-DTD D is equivalent to whether G |=↑ p((⊥, 1, −, r), u ) for some u . Note that deciding whether G |=↑ p((⊥, 1, −, r), u ) for some u is in P. Indeed, for each subexpression χ :: l of p, compute the set of all (u, u ) such that G |=↑ (χ :: l)(u, u ). Then, according to the structure of p, the set of all (u, u ) satisfying a subexpression of p is computed in a bottom-up manner. Since the number of nodes of G is polynomial in the size of D, this algorithm is polynomial time in the size of D and p. 4.3
XPath Expressions without Qualifiers
In this section, we show that satisfiability of X (↓, ↓∗ , ↑, ↑∗ , →+ , ←+ , ∪) under DC-DTDs is in P. In this case, again, the satisfiability can be analyzed by using the schema graph. However, satisfiability of subexpressions cannot be analyzed independently because of the upward axes. Upward axes traverse exactly the same path that was traversed before by downward axes. Thus, the path from the root to the current node must be maintained during satisfiability analysis. According to the intuition above, we define a satisfaction relation between schema graphs and the class of XPath expressions. An XPath expression will be regarded as a binary predicate on paths from (⊥, 1, −, r) on G. Definition 5. A satisfaction relation |=[] between a schema graph G and an XPath expression p ∈ X (↓, ↓∗ , ↑, ↑∗ , →+ , ←+ , ∪) is defined as follows. Let s be a nonempty sequence of nodes of G starting by (⊥, 1, −, r), and s be a possibly empty sequence of nodes of G. – G |=[] (↓:: l)(s, su ) if path su exists in G and λ(u ) = l. – G |=[] (↑:: l)(su, s) if path su exists in G and the label of the last node of s is l. – G |=[] (↓∗ :: l)(s, ss ) if path ss exists in G and the label of the last node of ss is l. – G |=[] (↑∗ :: l)(ss , s) if path ss exists in G and the label of the last node of s is l. – G |=[] (→+ :: l)(su, su ) if paths su and su exist in G, λ(u ) = l, and pos(u) < pos(u ) if ω(u) = “−” and pos(u) ≤ pos(u ) if ω(u) = “∗”. – G |=[] (←+ :: l)(su, su ) if paths su and su exist in G, λ(u ) = l, and pos(u) > pos(u ) if ω(u) = “−” and pos(u) ≥ pos(u ) if ω(u) = “∗”. – G |=[] (p/p )(s, s ) if there is s such that G |=[] p(s, s ) and G |=[] p (s , s ). – G |=[] (p ∪ p )(s, s ) if G |= p(s, s ) or G |=[] p (s, s ). To easily show the correspondence between G |=[] p and T |= p, we redefine the semantics of XPath so that p is a predicate on paths of T . Let v0 be the root of T ∈ TL(D). Let w be a nonempty sequence of nodes of T starting by v0 , and w be a possibly empty sequence of nodes of T .
A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes
79
– T |= (↓:: l)(w, wv ) if path wv exists in T and λ(v ) = l. – T |= (↑:: l)(wv, w) if path wv exists in T and the label of the last node of w is l. – T |= (↓∗ :: l)(w, ww ) if path ww exists in T and the label of the last node of ww is l. – T |= (↑∗ :: l)(ww , w) if path ww exists in T and the label of the last node of w is l. – T |= (→+ :: l)(wv, wv ) if paths wv and wv exist in T , v is a following sibling of v, and λ(v ) = l. – T |= (←+ :: l)(wv, wv ) if paths wv and wv exist in T , v is a preceding sibling of v, and λ(v ) = l. – T |= (p/p )(w, w ) if there is w such that T |= p(w, w ) and T |= p (w , w ). – T |= (p ∪ p )(w, w ) if T |= p(w, w ) or T |= p (w, w ). Since T is a tree, every node has a unique path from the root of T . It is easy to see that the above definition is equivalent to that in Section 2.3. Similar to Theorem 3, the following theorem can be proved by induction on the structure of p: Theorem 5. Let p ∈ X (↓, ↓∗ , ↑, ↑∗ , →+ , ←+ , ∪). Let D be a DC-DTD. Suppose that T |= p(w, w ) for some T ∈ TL(D) with an SG mapping θ. Then, G |=[] p(θ(w), θ(w )). The converse of Theorem 5 is shown by induction on the structure of p, as in the case of the previous section. The next lemma corresponds to the basis step. Lemma 3. Let G be the schema graph of a DC-DTD D. Suppose that G |=↑ (χ :: l)(s, s ). Then, there are T in TL(D), its SG mapping θ, and paths w and w from the root of T such that θ(w) = s, θ(w ) = s , and T |= (χ :: l)(w, w ). The next lemma is a part of the induction step. Lemma 4. Let D be a DC-DTD. Let T1 and T2 be in TL(D) with SG mappings θ1 and θ2 , respectively. Suppose that T1 |= p1 (w1 , w1 ), T2 |= p2 (w2 , w2 ), and θ1 (w1 ) = θ2 (w2 ). Then, there are T ∈ TL(D), its SG mapping θ, and paths w and w from the root of T such that θ(w) = θ1 (w1 ), θ(w ) = θ2 (w2 ), and T |= (p1 /p2 )(w, w ). Proof. Similarly to Lemma 2, T is constructed by merging T1 and T2 . More formally, T = merge 0 (T1 , T2 ), provided that the special treatment merge1 continues along with w1 and w2 . The detail is omitted because of the space limitation. In summary, the following theorem holds: Theorem 6. Let p ∈ X (↓, ↓∗ , ↑, ↑∗ , →+ , ←+ , ∪). Let D be a DC-DTD. If G |=↑ p(s, s ), then there are T ∈ TL(D), its SG mapping θ, and paths w and w from the root of T such that θ(w) = s, θ(w ) = s , and T |= p(w, w ).
80
Y. Ishihara et al.
Thus, satisfiability of p ∈ X (↓, ↓∗ , ↑, ↑∗ , →+ , ←+ , ∪) under a DC-DTD D is equivalent to whether G |=[] p((⊥, 1, −, r), s ) for some s . Now, we have to show that the latter condition is decidable in polynomial time. To do this, for a given p, we define a nondeterministic stack machine Mp which simulates |=[] . Note that the parameters of p in the definition of |=[] change only at their suffixes. The machine Mp maintains the parameters by a stack. The contents of the stack before and after transition correspond to the first and second parameters of p. – If p =↓:: l, Mp chooses a node u of G such that there is an edge from the stack top to u and λ(u ) = l. Then, Mp pushes u . – If p =↑:: l, Mp pops the stack. – If p =↓∗ :: l, Mp chooses a node u of G such that there is an edge from the stack top to u . Then, Mp pushes u . Mp repeats this, and nondeterministically ceases to repeat when λ(u ) = l. – If p =↑∗ :: l, Mp pops the stack nondeterministically many times. – If p =→+ :: l, Mp chooses u such that λ(u ) = l, and pos(u) < pos(u ) if ω(u) = “−” and pos(u) ≤ pos(u ) if ω(u) = “∗”, where u is the node on the stack top. Then, Mp pushes u . – If p =←+ :: l, Mp chooses u such that λ(u ) = l, and pos(u) > pos(u ) if ω(u) = “−” and pos(u) ≥ pos(u ) if ω(u) = “∗”, where u is the node on the stack top. Then, Mp pushes u . – If p = p1 /p2 , Mp behaves first as Mp1 , then as Mp2 . – If p = p1 ∪ p2 , Mp behaves as Mp1 or as Mp2 nondeterministically. The run of Mp with initial stack (⊥, 1, −, r) successfully stops with final stack s if and only if G |=[] p((⊥, 1, −, r), s ). Existence of a successful run of Mp can be reduced to the language nonemptiness of nondeterministic pushdown automata, which is decidable in polynomial time.
5
Intractability Results on Disjunction-Free DTDs
In this section, we show a limitation of the tractability of DC-DTDs. The following two theorems claim that upward axes appearing in qualifiers bring intractability under even disjunction-free DTDs. Theorem 7. Satisfiability of X (↓∗ , ↑∗ , [ ]) under disjunction-free DTDs is NPcomplete. Proof. Here we show only the NP-hardness. Let φ be an instance of 3SAT, where φ = (L1,1 ∨ L1,2 ∨ L1,3 ) ∧ · · · ∧ (Ln,1 ∨ Ln,2 ∨ Ln,3 ), ¯1 , . . . , x ¯m }. Then, define D = (Σ, r, P ) and each Li,j is a member of {x1 , . . . , xm , x as follows:
A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes
81
– Σ = {r, t1 , . . . , tm , f1 , . . . , fm , d}, – P (r) = t1 f1 , P (tk ) = P (fk ) = tk+1 fk+1 for 1 ≤ k ≤ m − 1, P (tm ) = P (fm ) = d, P (d) = . Define an XPath expression p as follows: p = ↓∗ :: d[(q1,1 ∨ q1,2 ∨ q1,3 ) ∧ · · · ∧ (qn,1 ∨ qn,2 ∨ qn,3 )],
where qi,j =
↑∗ :: tk ↑∗ :: fk
if Li,j = xk , if Li,j = x ¯k .
Note that only upward axes appear in the qualifier part of p. That is, the qualifier part specifies a condition for the path from the root to a leaf node (whose label is d). Intuitively, each path from the root to a leaf node represents a truth assignment. It is not difficult to see that φ is satisfiable if and only if there is a leaf node satisfying the qualifier part. Theorem 8. Satisfiability of X (χ↓ , χ↑ , →+ , ←+ , [ ]), where χ↓ ∈ {↓, ↓∗ } and χ↑ ∈ {↑, ↑∗ }, under disjunction-free DTDs is NP-complete. Proof. We show the case of X (↓∗ , ↑∗ , →+ , ←+ , [ ]) by the reduction from 3SAT. The other cases are proved by similar reductions. Let φ be an arbitrary instance of 3SAT with the same symbol definitions as in Theorem 7. Let us define D = (Σ, r, P ) as follows: – Σ = {r, x1 , . . . , xm , d}, – P (r) = x1 x1 , P (xk ) = xk+1 xk+1 for 1 ≤ k ≤ m − 1, P (xm ) = d, P (d) = . For any path from the root to a leaf node in a given tree T in TL(D), we can define a truth assignment as follows: The value of xk is true if the node vk labeled xk is a right child of its parent node vk−1 in the path, and false if vk is a left child of vk−1 . Now, define an XPath expression p as follows: p =↓∗ :: d[(q1,1 ∨ q1,2 ∨ q1,3 ) ∧ · · · ∧ (qn,1 ∨ qn,2 ∨ qn,3 )],
where qi,j =
↑∗ :: xk / ←+ :: xk ↑∗ :: xk / →+ :: xk
if Li,j = xk , if Li,j = x ¯k .
Then, whether the qualifier is satisfied by T ∈ TL(D) is dependent on a path from the root to a leaf node in T , which corresponds to a truth assignment. It is not difficult to see that φ is satisfiable if and only if p is satisfied by a tree conforming to D. It is open whether satisfiability of X (↓, ↑, [ ]) under DC-DTDs and disjunctionfree DTDs is tractable or not.
82
6
Y. Ishihara et al.
Conclusions
The paper has presented a tractable subclass of DTDs, called DC-DTDs, for XPath satisfiability with sibling axes. DC-DTDs are a proper subclass of covering DTDs, and their tractability is robust against sibling axes. On the other hand, upward axes appearing in qualifiers bring intractability even under disjunctionfree DTDs, which are a proper subclass of DC-DTDs. In this paper, and also in [2,3,5], satisfiability is defined over the paths from the root node. In other words, only absolute paths are considered. However, our results are still valid for relative paths, i.e., for the case where satisfiability is defined over all the paths on a tree. As for the tractability results, the satisfiability of a relative path prel is equivalent to that of an absolute path ((↓∗ :: a) ∪ · · · ∪ (↓∗ :: z))/prel , where Σ = {a, . . . , z}. Since the classes of XPath examined in Section 4 contain ↓∗ , the tractability results are still valid even if XPath expressions are regarded as relative paths. The proofs of the intractability for relative paths are exactly the same as the ones in Sections 3 and 5. There are several directions for future work. One direction is to augment the whole class of XPath by attributes, negation, and so on. The choice of sibling axes is also theoretically interesting. We have adopted →+ and ←+ because they coincide with following-sibling and preceding-sibling axes of XPath. On the other hand, [3,5] adopted → and ←, denoting immediate right and left siblings. The interesting point is that the choice drastically affects the complexity of XPath satisfiability. For example, we have shown that satisfiability for X (↓, ↓∗ , →+ , ←+ , ∪, [ ]) is tractable under DC-DTDs. However, by Proposition 7.2 of [3], satisfiability for X (↓, →, [ ]) (in our notation) is NP-hard even under disjunction-free DTDs. Another direction is to enlarge/restrict the class of DTDs. Regular tree grammars [9] are a general model for XML schemas. Finding a tractable subclass of regular tree grammars for XPath satisfiability seems interesting. Also, some natural superclasses/subclasses of DC-DTDs should be investigated. For example, we are trying to incorporate operator “?” (zero-or-one occurrence) into DCDTDs. Conversely, satisfiability may be easier under a smaller class of DTDs, e.g., DTDs such that every content model is a concatenation e∗1 · · · e∗n of Kleenestarred expressions. DC-DTDs will be useful for conservative approximation of XPath satisfiability. By the word “conservative,” we mean that any satisfiable expression is decided to be satisfiable. In other words, if a conservative approximation algorithm says that p is unsatisfiable, then p is definitely unsatisfiable. For the purpose of query optimization, finding definitely unsatisfiable expressions quickly is often more important than deciding satisfiability exactly. Using DC-DTDs, a conservative approximation algorithm can be constructed easily. Let D be an arbitrary DTD. First, make each content model of D disjunction-capsuled. For example, transform (a|b)cab∗ to (a|b)∗ cab∗ . Let DDC be the resultant DTD. Next, decide
A Tractable Subclass of DTDs for XPath Satisfiability with Sibling Axes
83
the satisfiability of a given XPath expression p under DDC . DDC is obviously a DC-DTD, and whenever a tree T conforms to D, it also conforms to DDC . Therefore, if p is decided to be unsatisfiable under DDC , then p is definitely unsatisfiable under D. Of course, a procedure that always answers “satisfiable” for any input XPath expression is a trivial, constant-time, conservative approximation algorithm. However, such a procedure is completely useless for query optimization. Better approximation the procedure achieves, better optimization we obtain. For practical DTDs and queries, the approximation performance of the above procedure should be examined. Also, the usefulness of conservative approximation should be compared with the approaches in [6,7,8] that use fast satisfiability checkers. Acknowledgments. The authors thank the anonymous reviewers for their insightful comments and suggestions. This research is supported in part by Grant-in-Aid for Scientific Research (C) 20500092 and Grant-in-Aid for Young Scientists (B) 40364111 from Japan Society for the Promotion of Science.
References 1. Lakshmanan, L.V.S., Ramesh, G., Wang, H., Zhao, Z.J.: On testing satisfiability of tree pattern queries. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, pp. 120–131 (2004) 2. Benedikt, M., Fan, W., Geerts, F.: XPath satisfiability in the presence of DTDs. In: Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 25–36 (2005) 3. Benedikt, M., Fan, W., Geerts, F.: XPath satisfiability in the presence of DTDs. Journal of the ACM 55(2) (2008) 4. Miklau, G., Suciu, D.: Containment and equivalence for a fragment of XPath. Journal of the ACM 51(1), 2–45 (2004) 5. Geerts, F., Fan, W.: Satisfiability of XPath queries with sibling axes. In: Bierman, G., Koch, C. (eds.) DBPL 2005. LNCS, vol. 3774, pp. 122–137. Springer, Heidelberg (2005) 6. Genev`es, P., Laya¨ıda, N.: A system for the static analysis of XPath. ACM Transactions on Information Systems 24(4), 475–502 (2006) 7. Genev`es, P., Laya¨ıda, N.: Deciding XPath containment with MSO. Data & Knowledge Engineering 63(1), 108–136 (2007) 8. Genev`es, P., Laya¨ıda, N., Schmitt, A.: Efficient static analysis of XML paths and types. In: Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, pp. 342–351 (2007) 9. Murata, M., Lee, D., Mani, M., Kawaguchi, K.: Taxonomy of XML schema languages using formal language theory. ACM Transactions on Internet Technology 5(4), 660–704 (2005) 10. Montazerian, M., Wood, P.T., Mousavi, S.R.: XPath query satisfiability is in PTIME for real-world DTDs. In: Barbosa, D., Bonifati, A., Bellahs`ene, Z., Hunt, E., Unland, R. (eds.) XSym 2007. LNCS, vol. 4704, pp. 17–30. Springer, Heidelberg (2007)
General Database Statistics Using Entropy Maximization Raghav Kaushik1 , Christopher R´e2 , and Dan Suciu2 1
2
Microsoft Research University of Washington, Seattle WA
Abstract. We propose a framework in which query sizes can be estimated from arbitrary statistical assertions on the data. In its most general form, a statistical assertion states that the size of the output of a conjunctive query over the data is a given number. A very simple example is a histogram, which makes assertions about the sizes of the output of several range queries. Our model also allows much more complex assertions that include joins and projections. To model such complex statistical assertions we propose to use the Entropy-Maximization (EM) probability distribution. In this model any set of statistics that is consistent has a precise semantics, and every query has an precise size estimate. We show that several classes of statistics can be solved in closed form.
1
Introduction
Modern database query optimizers are the result of thousands of man years worth of work from very talented individuals, and so are extremely sophisticated. Although the optimizers themselves are sophisticated, the heuristics they employ are often not: When estimating the selectivity of two predicates, the optimizer might make an independence assumption and so, assume that the selectivities of two predicates are independent. When estimating the size of the intersection of two columns, the optimizer might make a containment assumption and assume that the values in one column are contained in the other. The cleverness of the optimizer is in how and when to apply these rules. In spite of this intense effort and the fundamental importance of the problem, there is no general theory that explains how the optimizer should make these choices or when such choices are consistent. In this paper, we take a first step towards such a general, principled theory of how optimizers should make use of statistics. The lack of a principled framework is likely to become an even more critical problem as new sources of statistics become available to the query optimizer. For example, several proposals [1, 2] advocate acquiring query feedback and incorporating this statistical feedback into the optimizer. It is easy to collect cardinality statistics from each query as it is executed by the database engine. The difficulty lies in combining this newfound plethora of statistics to produce a single, principled estimate. The lack of such a principled framework is a major reason that execution feedback has not been widely adopted in commercial database engines. P. Gardner and F. Geerts (Eds.): DBPL 2009, LNCS 5708, pp. 84–99, 2009. c Springer-Verlag Berlin Heidelberg 2009
General Database Statistics Using Entropy Maximization V3 (z) V4 (x) V5 (x, y) V6 (x)
::::-
Σ: #R #T #V3 #V4 #V5 #V6
85
S(−, z) R(x, −) R(x, y), S(y, −, x), T (x) T (x), U (x, z), 10 < z < 50 Γ:
= 3000 = 42000 = 150 = 200 = 300000 = 4000
γ1 : R(−, y) ⇒ S(y) ⇒ T (x) γ2 : R(x, −) γ3 : R(z, y), S(y, z) ⇒ T (z)
Estimate: q(y) :- R(x, y), S(y, z) Fig. 1. An example of a Statistical Program and a query, q whose cardinality we would like to estimate
The key object of our study is a statistical program, which is a set of pairs (v, d), where v is a query (also called a view) and d > 0 is a number. Each pair (v, d) is called a statistical assertion, and means intuitively that the answer to v has expected size d; we write it as #v = d and say simply that “the size of v is d”. A statistical program encodes the information that is available to an optimizer. The primary use of this information is to estimate the expected result size of other queries during query optimization. Example 1. Figure 1 illustrates a statistical program that asserts the sizes of 4 views and 2 base relations. The program also asks for the estimated size of another query. This program asserts that the expected number of distinct tuples in the relation R is 3000. The program also asserts (via V3 ) that the second attribute of S contains 150 values. Our proposal also allows complex assertions that involve joins and arithmetic predicates, such as V6 . In addition to statistical assertions, our model also allows specifies inclusion constraints. These constraints are hard constraints and must hold in every instance I the distribution considers possible, i.e., for which P[I] > 0. For example, γ1 says that each value in the second column of R is also in the S relation. In this paper, we study the probability distribution over database instances that satisfy a given statistical program Σ. 1.1
Our Approach: Entropy Maximization
In this paper we define a model for a statistical program using the Entropy Maximization principle. We assume that the relations in the database are drawn from a finite domain D, of size N , and that each database instance I has some probability P(I). P is chosen to fit the statistics Σ and without making any
86
R. Kaushik, C. R´e, and D. Suciu
other assumptions on the data. More precisely (1) for each assertion #v = d in Σ, the expected size of v under P is d, and (2) the probability distribution P has the maximum entropy among those that satisfy (1) (formal definition given in Sec. 2). The EM model for a statistical program Σ is an instance of the general EM principle in probability theory, discussed for example by Jaynes [3, Ch.9,11], and which has also been applied to consistent use and construction of histograms [4, 5]. The EM framework has many attractive features. First, any combination of statistical assertions has a well-defined semantics (except, of course, when it is inconsistent); thus, a statistical program is treated as a whole, as opposed to a set of separate synopses. Second, every query has a well defined cardinality estimate; there is no restriction on the query, and the query estimate no longer depends on which heuristics are used to do the estimation. A third reason is that the EM framework has an interesting property that allows us to add a new statistical assertion smoothly: if the estimate of a query q under a statistical program Σ is d, then after adding the assertion #q = d the new EM probability distribution is identical to the previous one. In practice this means that if we add a small correction to the model, #q = d where d ≈ d, then the model will change smoothly. The final, and most important conceptual reason is that in a precise sense the probability distribution given by entropy maximization depends only on the provided statistics and makes no additional assumptions beyond this. We return to this point when we formally define our model and state its properties in Sec. 2.1. In this paper, we study the following model computation problem: given a statistical program Σ and a set of full inclusion constraints Γ , find a solutions to the EM model. (We explain below the reason for introducing constraints.) Since an EM solution is tied to the particular domain D, we seek to remove the dependency by letting the domain size N grow to infinity as is done, for example in random graphs [6], knowledge representation [7], or asymptotic query probability [8]. Since we seek an analytic understanding of the model, our goal is to find analytic (asymptotic) solutions to the EM model. Solving the EM model in general is, however, a very hard problem. We report in this paper several partial results on the asymptotic solutions for statistical programs. While our results do not add up to a comprehensive solution to statistical programs they do offer explicit solutions in several cases, and shed light on the nature of the EM model for database statistics. 1.2
Main Technical Results
In this paper, we introduce classify programs according to two axes. First, whether the statistical assertions are on base tables only, or on both base tables and views: we say that Σ is in normal form (NF) if all statistical assertions are on base tables; otherwise it is in non-NF. The program in Fig 1 is in nonNF. Second, whether the views in Σ, Γ have joins or are join-free: we call the program composite if all views have joins, and atomic if all are join-free (we do not consider mixed atomic/composite programs). All four combinations are
General Database Statistics Using Entropy Maximization
87
possible, e.g., an NF, composite program means that all statistical assertions are on base tables and the inclusion constraints are composite views. In this paper, we consider only composite programs1 . We give a complete solution for composite, NF programs: given statistics on the base tables and a set of full inclusion constraints, the EM model is described by an explicit formula. We prove this by using techniques from [10]. Second, we prove a more limited result for non-NF programs, by giving an explicit formula when the views are restricted to project-semi-joins. This explicit formula gives an important insight into the nature of difficulty of the non-NF programs, as we explain below. In addition to these explicit solutions, we discuss a generic technique that we encapsulate as the conditioning theorem (Sec. 3.1); this reduces a more complex program to a simpler program plus additional inclusion constraints; this is what has motivated us to study statistical programs together with constraints. The conditioning theorem states that every solution P to the EM model can be expressed as a conditional probability P(−) = P0 (− | Γ ), where P0 is a tuple-independent distribution called the prior, and Γ is a set of inclusion and set-equality constraints. Since P0 is tuple-independent, it is specified by a number pi ∈ (0, 1), one for each relation Ri , representing the probability that a generic tuple in the domain belongs to Ri ; the corresponding odds is denoted αi = pi /(1 − pi ). Understanding the quantity P0 (Γ ) is a key component to understanding the EM model. This quantity converges to 1 for composite NF programs, and to 0 for composite NNF programs, suggesting that the techniques used to solve composite, NF programs (which turn out to be simpler) do not extend to the other cases. The EM model computation that we study in this paper is the first step in our program of using the EM framework for query size estimation. The second step is to use the model in order to do query size estimation, which we leave for future work, noting that it has been solved for the special case of independent distributions [10]. The rest of the paper is organized as follows. We describe the EM model in Sec. 2, discuss composite programs in Sec. 3, and discuss how to handle range predicates in Sec. 4. We conclude in Sec. 6.
2
The EM Model
We introduce basic notations then review the EM model. CQ denotes the class of conjunctive queries over a relational schema R1 , . . . , Rm . We define a projectjoin query as a conjunctive query without constants and where no subgoal has 1
It turns out that atomic and mixed programs require an entirely different set of more complex, analytic techniques. In the full version of this paper [9], we discuss our preliminary results for atomic programs. We show for example that any program can be transformed to a normal form program. Interestingly, this normalization process also introduces additional inclusion constraints.
88
R. Kaushik, C. R´e, and D. Suciu
repeated variables, and write PJ for the class of project-join queries. For example R(x, y), S(y, z, u) is a project-join query, but neither R(a, x) nor S(x, x, y), T (y, z) are. An arithmetic predicate, or range predicate, has the form x op c, where op ∈ {<, ≤, >, ≥} and c is a constant; we denote by PJ≤ the set of projectjoin queries with range predicates. Let Γ be a set of full inclusion constraints, i.e., statements of the form ∀¯ x.w(¯ x) ⇒ Ri (¯ x), where w ∈ PJ≤ and Ri is a relation name. 2.1
Background: The EM Model
For a fixed domain D and constraints Γ we denote I(Γ ) the set of all instances over D that satisfy Γ ; the set of all instances over D is I(∅), which we abbreviate I. A probability distribution on I(Γ ) is a set of numbers p¯ = (pI )I∈I(Γ ) in [0, 1] that sum up to 1. We use the notations pI and P[I] interchangeably in this paper. ¯ where v¯ = (v1 , . . . , vs ) are projectA statistical program is a pair Σ = (¯ v , d), join queries, vi ∈ PJ, and (d1 , . . . , ds ) are positive real numbers. A pair (vi , di ) is a statistical assertion that we write informally as #vi = di ; in the simplest case it can just assert the cardinality of a relation, #Ri = di . A probability distribution I(Γ ) satisfies a statistical program Σ if E[|vi |] = di , for all i = 1, m. Here E[|vi |] denotes the expected value of the size of the view vi , i.e., I∈I |vi (I)|pI . We will also allow the domain size N to grow to infinity. For fixed values d¯ we say that a ¯ asymptotically v , d) sequence of probability distributions (¯ pN )N >0 satisfies Σ = (¯ if limN →∞ EN [|vi |] = di , for i = 1, m. Given a program Σ, we want to determine the most “natural” probability distribution p¯ that satisfies Σ and use it to estimate query cardinalities. In general, there may not exist any probability distribution that satisfies Σ; in this case, we say that Σ is unsatisfiable. On the other hand, there may exist many solutions. To choose a canonical one, we apply the Entropy Maximization (EM) principle. Definition 1. A probability distribution p¯ = (pI )I∈I(Γ ) is an EM distribution associated to Σ if the following two conditions hold: (1) p¯ satisfies Σ, and (2) it has the maximum entropy among all distributions that satisfy Σ, where the entropy is H = − I∈I(Γ ) pI log pI . With slight abuse, we refer to an EM distribution as the EM model, assuming it is unique. For a simple illustration, consider the following program on the relation R(A, B, C): #R = 200, #R.A = 40, #R.B = 30, #R.C = 20. Thus, we know the cardinality of R, and the number of distinct values of each of the attributes A, B, C. We want to estimate #R.AB, i.e., the number of distinct values of pairs AB. Clearly this number can be anywhere between 40 and 200, but currently there does not exists a principled approach for query optimizers to estimate the number of distinct pairs AB from the other four statistics. The EM model gives such a principled approach. According to this model, R is a random instance over a large domain D of size N , according to a probability distribution described by the probabilities pI , for I ⊆ D3 . The distribution pI is defined
General Database Statistics Using Entropy Maximization
89
precisely: it satisfies the four statistical assertions above, and is such that the entropy is maximized. Therefore, the estimate we seek also has a well defined semantics, as E[#R.AB] = I⊆D3 pI |I.AB|. This estimate will certainly be between 40 and 200; it will depend on N , which is an undesirable property, but a sensible thing to do is to let N grow to infinity, and compute the limit of E[#R.AB]. Thus, the EM model offers a principled and uniform approach to query size estimation. Of course, in order to compute any estimate we must first find the EM distribution pI ; this is the goal in this paper. To describe the general form of an EM distribution, we need some definitions. Fix the set of constraints Γ and the views v¯ = (v1 , . . . , vs ). Definition 2. The partition function for Γ and v¯ is the following polynomial T with s variables x¯ = (x1 , . . . , xs ): x) = T Γ,¯v (¯
|v (I)|
x1 1
s (I)| · · · x|v s
I∈I(Γ )
Let α ¯ = (α1 , . . . , αs ) be s positive real numbers. The probability distribution associated to (Γ, v¯, α) ¯ is: |v (I)|
pI = ωα1 1
s (I)| · · · α|v s
(1)
where ω = 1/T Γ,¯v (¯ α). We write T instead of T Γ,¯v when Γ, v¯ are clear from the context. The partition function can be written more compactly as: CΓ (N, k1 , . . . , ks )xk11 · · · xks s T (¯ x) = k1 ,...,ks
where CΓ (N, k1 , . . . , ks ) denotes the number of instances I over a domain of size N that satisfy Γ and for which |vi (I)| = ki , for all i = 1, s. The following is a key characterization of EM distributions. ¯ be a statistical program. For any Theorem 1. [3, page 355] Let Σ = (¯ v , d) probability distribution p¯ that satisfies the statistics Σ the following holds: p¯ is an EM distribution iff there exist parameters α ¯ s.t. p¯ is given by the Equation (1) (equivalently: p¯ is associated to (Γ, v¯, α ¯ )). The message of this theorem is that the weight of an instance I under the EM distribution only depends on |vi (I)|. That is, the distribution depends exactly on the provided statistics and makes no additional assumptions. It is this property that makes the EM distribution the natural model for database statistics. In a Bayesian sense, for a fixed set of statistics the EM model yields the optimal estimate. We refer to Jaynes [3, page 355] for a full proof and further discussion of this point; the “only if” part of the proof is both simple and enlightening, and we include in the Appendix for completeness.
90
R. Kaushik, C. R´e, and D. Suciu
We illustrate the utility of this theorem with two simple examples: Example 2. The Binomial-Model Consider a relation R(A, B) and the statistical assertion #R = d with Γ = ∅. The partition function is the binomial, 2 2 T (x) = k=0,N 2 Nk xk = (1 + x)N and the EM model turns out to be the probability model that randomly inserts each tuple in R independently, with probability p = d/N 2 . We need to check that this is an EM distribution: given 2 an instance I of size k, P[I] = pk (1 − p)N −k , which we rewrite as P[I] = ωαk . 2 Here α = p/(1 − p) is the odds of a tuple, and ω = (1 − p)N = P[I = ∅]. This is indeed an EM distribution by Theorem 1. Asymptotic query evaluation on a generalization of this distribution to multiple tables was studied in [8]. 2 Example 3. Overlapping Ranges Consider two views2 : v1 (x, y) :- R(x, y), x < .60N and v2 (x, y) :- R(x, y), .25N ≤ x and the statistical program #v1 = d1 , #v2 = d2 (again Γ = ∅). Assuming N = 100, the views partition the domain into three buckets, D1 = [1, 24], D2 = [25, 59], D3 = [60, 100], of sizes N1 , N2 , N3 . Here we want to say that we observe d1 tuples in D1 ∪ D2 and d2 tuples in D2 ∪ D3 . The EM model gives us a precise distribution that represents only these observations and nothing more. The partition function is (1 + x1 )N1 (1 + x1 x2 )N2 (1 + x2 )N3 , and the EM distribution has the form P[I] = ωαk11 αk22 , where k1 = |I ∩ (D1 ∪ D2 )| and k2 = |I ∩ (D2 ∪ D3 )|; we show in Sec. 4 how to compute the parameters α1 , α2 . In this paper we study the model computation problem: given a statistical program Σ, find the parameters α ¯ for the EM model. The ultimate goal of our program is to further use these parameters to estimate the size of arbitrary queries, but we will not treat the latter problem in this paper. The model depends on the size of the domain, N , and this is an undesirable property, since in practice N has no meaning other than that it is large. For that reason, we study the asymptotic model computation problem in this paper: find a sequence of parameters ¯ N ) satisfies Σ asymptotically. α ¯ N s.t. the distribution associated to (Γ, v¯, α To simplify our discussion we present our results for the case when the queries in the statistical program have no range predicates, and show in Sec. 4 how to handle range predicates. Thus, from now on, until Sec. 4, we will assume all conjunctive queries to be without range predicates. 2.2
A Taxonomy for Statistical Programs
Recall that PJ denotes the class of project-join queries. We define here two subclasses. First, a project query is a single subgoal query without constants or repeated variables; denote P the class of project queries. Second, a single component join query is a project-join query with the following properties: it is minimized, has at least two subgoals, and has a single connected component; denote PJC the 2
We represent range predicates as fractions of N so we can allow N to go to infinity.
General Database Statistics Using Entropy Maximization
91
class of single component join queries. Queries V3 , V4 in Fig. 1 are in P; queries V5 is in PJC . P and PJC are two disjoint subclasses of PJ that do not cover PJ. Some queries in PJ are not in either class, e.g. v(x, y) :- R(x, y), R(z, y), S(u) is a query that minimizes to R(x, y), S(u), which is neither in P nor in PJC (it has two connected components): we do not treat such queries in this paper. ¯ and constraints Γ along two axes: We classify statistical programs Σ = (¯ v , d) Definition 3. Σ is in normal form (NF) if all statistical assertions are on base tables; otherwise, it is in non-normal form (NNF). Definition 4. (1) Σ is composite if for every statistical assertion #v = d, v is x.w(¯ x) ⇒ either a base table or is in PJC . Γ is composite if for every constraint ∀¯ Ri (¯ x), w is in PJC . We say that Σ, Γ is composite if both are composite. (2) Σ, Γ is atomic if all their views are in P. Thus, there are four combinations of programs: NF/NNF and composite/atomic. For example, in Fig. 1, the program (Σ1 , Γ1 ) Σ1 = {#R = 3000, #T = 42000} with Γ1 = {γ1 } is an atomic, NF program; if we add the statistical assertion (V4 , d4 ), then the program is still atomic, but no longer in normal form. On the other hand, (Σ1 , Γ ) with Γ = {γ3 } is composite and in normal form; if we add the statistic (V5 , d5 ) then this becomes a composite program, not in normal form. We do not treat mixed atomic/composite programs.
3
Composite Programs
We start by discussing the case when all queries are composite. First, we introduce the two main techniques used in this section, conditioning on the prior, and the asymptotic probabilities from [8], then we give our results. 3.1
From Conditionals to EM Models
Recall that I denotes the set of all database instances, without any constraints. Define a prior probability distribution to be any tuple-independent probability distribution P0 on I. As seen in Example 2, this is an EM distribution for a very simple NF program, which just asserts the cardinalities of each relation, #Ri = di , and has no constraints. Each tuple t into Ri has probability P0 [t] = di /N arity(Ri ) , and the EM parameters are αi = pi /(1 − pi ) ≈ pi . Now let’s add a set of constraints Σ, i.e., consider the NF program consisting both of cardinality assertions #Ri and constraints Σ. Its EM model is obtained as follows: Theorem 2 (Conditioning). Let P be the EM model for a NF program Σ, Γ . Then there exists a prior probability distribution P0 such that: ∀ I ∈ I(Γ ), P[I] = P0 [I | Γ ] Moreover, the expected values are obtained through the following transfer equation: E[|q|] = E0 [|q| | Γ ].
92
R. Kaushik, C. R´e, and D. Suciu
Proof. Let α ¯ be the parameters of P. Define the tuple-independent prior as follows: for each relation Ri , define P0 [t ∈ Ri ] = pi = αi /(1 + αi ). (Thus, the |RI | odds of pi are precisely αi .) Then P0 [I] = ω0 αi i (follows by generalizing |RI | Example 2) and P[I] = ω αi i (by definition). Thus, P and P0 are essentially the same expression, only P is defined over a restricted domain I(Γ ) ⊆ I. For a simple illustration, consider the statistical program Σ: #R(A, B) = d1 , #T (B, C) = d2 , and the constraints Γ : R(x, y), R(y, z) ⇒ R(x, z) and T (x, y), R(y, z) ⇒ T (z, x). To solve it, first solve a different, simpler program #R(A, B) = b1 , #T (B, C) = b2 , without constraints. This is a tuple-independent probability distribution P0 . Then the solution to Σ, Γ is given as P[I] = P0 [I | Γ ]. The difficulty lies in choosing the statistics b1 , b2 of the simpler model: we need to ensure that E0 [|R| | Γ ] = d1 , E0 [|T | | Γ ] = d2 . 3.2
Background: Asymptotic Query Probabilities
Based on our discussion, we need to study prior probabilities that have the form P[t ∈ R] = b(R)/N arity(Ri ) , where b(R) is a constant that depends only on the relation symbol R. These tuple-independent distributions were studied in [8]. It was shown that for any Boolean conjunctive query q ∈ CQ, there exists two constants E(q) and C(q), which can be computed only from the constants b(R) and the query expression, s.t. P[q] = C(q)/N E(q) + O(1/N E(q)+1 ). We give the expressions for C(q) and E(q) in Fig. 2. Example 4. We illustrate the notations in Fig. 2 on the query q = R(x, y), R(y, z), and b(R) = b. D(q) = 4 − 3 = 1 and is called the degree of q; and b(q) = b2 . U Q(q) is obtained by substituting variables in q and contains four queries (up to isomorphism): q itself, then R(x, x), R(x, y), then R(x, y), R(y, y), and finally R(x, x). Their degrees are 1, 2, 2, 1 respectively, thus E(q) = 1 and is called the
V (q) = the number of distinct variables in q a(q) = {arity(g) | g ∈ goals(q)} D(q) = a(q) − V (q) b(q) = {b(g) | g ∈ goals(q)} U Q(q) = {η(q) | η = a substitution of variables} E(q) = min D(q0 )q0 ∈ U Q(q) U Q0 (q) = {q0 | q0 ∈ U Q(q), D(q0 ) = E(q)} b(q0 ) C(q) = aut(q0 ) 0 q0 ∈U Q (q)
Fig. 2. Notations for Theorem 3 from [8]
General Database Statistics Using Entropy Maximization
93
exponent of q. U Q0 (q) consists of the first and last queries (those that have D = 1), and aut(q0 ) is the number of automorphisms for q0 , and is 1 for both queries in U Q0 (q). Finally, C(q) = b2 + b is called the coefficient of q. Thus, P[q] = (b2 + b)/N + O(1/N 2 ). We consider here only conjunctive queries where all connected components have E > 0; this rules out some degenerate queries, whose treatment is more complex [11]. All PJC queries satisfy this property, since they have a single compoment and E > 0. Theorem 3. [8] For any conjunctive query q ∈ CQ, P0 (q) = C(q)/N E(q) + O(1/N E(q)+1 ). 3.3
Composite NF Programs
Theorem 4 (Composite, NF). Consider a statistical program in normal form Σ: #Rj = dj , for j = 1, m. Consider a set of inclusion constraints Γ where all queries are composite. Then an asymptotic solution to the EM model is given by αj = dj /N arity(Rj ) . The proof uses Theorem 3 and is given in the full version of the paper; it uses Theorem 3 as well as specific properties of the expressions D and E in Fig. 2. At a high level, the proof exploits the fact that limN P0 [Γ ] = 1 (i.e., the constraints, Γ , almost surely hold), where P0 is the prior associated to the same statistical program (Σ, Γ ): that is, the constraints Γ holds almost certainly in the prior, and hence the statistics are not affected by conditioning. Example 5. Consider the constraints R(x, y), R(y, z) ⇒ R(x, z), and T (x, y, z), R(y, u) ⇒ S(y), and the statistical assertions |R| = d1 , |T | = d2 , |S| = d3 . An asymptotic solution to the EM model is given by α1 = d1 /N 2 , α2 = d2 /N 3 , α3 = d3 /N . 3.4
Composite Non-NF Programs
Let Σ be a statistical program that consists of assertions on all relations, #Rj = dj , as well as assertions over composite views, #qi = di . Create a new relation symbol Ti for each statistical assertion of the same arity as the view, and define x, y¯) ⇐⇒ Ti (¯ x)). Each set equality conthe set equality constraints ∀¯ x.(∃¯ y .qi (¯ straint is expressed as γi ∧ δi , where γi is a full inclusion constraint and δi is a reverse inclusion constraint: x.qi (¯ x) ⇒ Ti (¯ x) γi ≡ ∀¯ x.Ti (¯ x) ⇒ (∃¯ y .qi (¯ x, y¯)) δi ≡ ∀¯ Denote Δ and Γ the set of all inclusion- and all reverse inclusion constraints. As before, the EM solution is given by a conditional P[I] = P0 [I | Δ ∧ Γ ], where P0 is some tuple-independent prior. However, it is now more difficult to transfer the expected sizes, and we provide a closed form solution only for a restricted class of views.
94
R. Kaushik, C. R´e, and D. Suciu
Definition 5. A query q ∈ PJ is a project-semi-join query if the following conditions hold. Let x ¯ = (x1 , . . . , xk ) be its head variables: – q has no self-joins (i.e., no repeated relation symbol). – If two different subgoals in q share a variable y, then y ∈ x ¯. – For every subgoal g, if g contains xi then it also contains xi+1 . A core subgoal is a subgoal that contains the smallest number of head variables. The core of q is the set of core subgoals and denoted G. In what follows, a transfer equation is an equation that relates a size estimate under the prior distribution to the size estimate of the distribution under constraints. Lemma 1. Let δ be an inverse inclusion constraint ∀¯ x.T (¯ x) ⇒ ∃¯ y.q(¯ x, y¯) where q is a project-semi-join query. Define the prior: b(Rj ) N arity(Rj ) b(T ) P0 (t ∈ T ) = 1 − E(G) N
P0 (t ∈ Rj ) =
(Here E(G) denotes the exponent of the core, see Fig. 2.) Let R1 , . . . , Rm be all subgoals in q, m ≥ 1. Then the transfer equation for the view is: R∈goals(q) b(R) (2) E0 [|T | | δ] = b(T ) The transfer equation for any other relation in R is as follows. If the query consists only of the core, then: E0 [|Ri | | δ] = b(Ri ) +
C(G) b(T )
(3)
(C(G) is the coefficient of G, see Fig. 2.) If the query has subgoals other than the core, then the expected cardinalities are unchanged. We prove the lemma in the full version of the paper. From the lemma we derive: Theorem 5 (Composite, non-NF). Let Σ be a statistical program where all queries are project-semi-join queries, and do not share common subgoals. Then an EM model for Σ has the following parameters: – For every base relation Rj , the parameter is αj = bj /N arity(Rj ) . – For every view assertion vj , the parameter is αj = N E(G) /bj , where G is the core of vj . where the numerical values bj are obtained by solving a system of equations (2) and (3). It is interesting to compare the solution to an NF program to that of a non-NF program (Theorems 4 and 5). For NF programs all parameters have the form αi = d/N a , for integer a > 0. For non-NF programs some parameters have the form N a /d and, thus, go to infinity.
General Database Statistics Using Entropy Maximization
95
Example 6. Consider the following statistical program3: #R1 = d1 #R2 = d2 v(x) :- R1 (x), R2 (x) #v = d3 Thus, we are given the sizes of R1 , R2 , and of their intersection. We introduce a new relation symbol T and the constraint δ = T (x) ⇔ R1 (x), R2 (x), then define the program in normal form: #R1 = d1 #R2 = d3 #T = d3 The theorem gives us the EM solution as follows. The core is the entire query, hence we define the prior: P0 (R1 (a)) = e1 /N P0 (R2 (a)) = e2 /N P0 (T (a)) = 1 − e3 /N where P0 (R1 (a)) denotes the marginal probability of the tuple R1 (a). Note that T has a very large probability. This gives us an EM model to our initial statistical program if we solve e1 , e2 , e3 in: d3 =
e1 e2 e3
e1 e2 e3 e1 e2 d3 = e2 + e3 d1 = e1 +
Example 7. A more complex example is a statistical program that uses the following project-semi-join view: v6 (x1 , x2 , x3 ) :- R1 (x1 , x2 , x3 , y), R2 (x2 , x3 ), R3 (x2 , x3 , z), R4 (x1 , x2 , x3 ), R5 (x1 , x2 , x3 ) The core consists of R2 and R3 , and so we define P0 [t ∈ T ] = eT /N 2 , where eT is chosen such that d2 d3 /eT = d6 .
4
Bucketization
Finally, we re-introduce range predicates like x < c, both in the constraints and in the statistical assertions. To extend the asymptotic analysis, we assume that all constants are expressed as fractions of the domain size N , e.g., in Ex. 3 we have v1 (x, y) :- R(x, y), x < 0.25N . 3
Its partition function is (1 + x1 + x2 + x1 x2 )N . Intuitevely, this is because each of the n tuples is in R1 and so pays x1 , is in R2 and so pays x2 or is in both R1 and R2 and so pays x1 x2 .
96
R. Kaushik, C. R´e, and D. Suciu
¯ = R1 , . . . , Rm be a relational schema, and consider a statistical program Let R ¯ We translate it into a bucketized Σ, Γ with range queries, over the schema R. 0 0 ¯ 0 , as follows. First, use all the statistical program Σ , Γ , over a new schema R constants that occur in the constraints or in the statistical assertions to partition the domain into b buckets, D = D1 ∪ D2 ∪ . . . ∪ Db . Then define as follows: – For each relation name Rj of arity a define ba new relation symbols, Rji1 ···ia = ¯ ¯ 0 is the schema consisting of all relation Rji , where i1 , . . . , ia ∈ [b]; then R i1 ···ia . names Rj – For each conjunctive query q with range predicates, denote buckets(q) = ¯ {q i | ¯i ∈ [b]|V ars(q)| } the set of queries obtained by associating each variable in q to a unique bucket, and annotating the relations accordingly. Each ¯ 0 , without query in buckets(q) is a conjunctive query over the schema R range predicates, and q is logically equivalent to their union. – Let BV = {buckets(v) | (v, d) ∈ Σ} (we include in BV queries up to logical equivalence), and let cu denote a constant for each u ∈ BV , s.t. for each statistical assertion #v = d in Σ the following holds cu = d (4) u∈buckets(v)
Denote Σ 0 the set of statistical assertions #u = cu , u ∈ BV . – For each inclusion constraint w ⇒ R in Γ , create b|V ars(w)| new inclusion ¯ ¯ constraints, of the form wj ⇒ Ri ; call Γ 0 the set of new inclusion constraints. Then the following holds: Proposition 1. Let Σ 0 , Γ 0 be the bucketized program for Σ, Γ . Let β¯ = (βk ) be the EM model of the bucketized program. Consider some parameters α ¯ = (αj ). Suppose that for every statistical assertion #vj = dj in Σ condition (4) holds, and the following condition holds for every query uk ∈ BV : αj (5) βk = j:uk ∈buckets(vj )
Then α ¯ is a solution to the EM model for Σ, Γ . This gives us a general procedure for solving the EM model for programs with ¯ range predicates: introduce new unknowns cij and add Equations (4) and (5), then solve the EM model for the bucketized program under these new constraints. Example 8. Recall Example 3: we have two statistics #σA≤0.60N (R) = d1 , and #σA≥0.25N (R) = d2 . The domain D is partitioned into three domains, D1 = [1, 0.25N ), D2 = [0.25N, 0.60N ), and D3 = [0.60N, N ], and we denote N1 , N2 , N3 their sizes. The bucketization procedure is this. Define a new schema
General Database Statistics Using Entropy Maximization
97
R1 , R2 , R3 , with the statistics #R1 = c1 , #R2 = c2 , #R3 = c3 , then solve it, subject to the Equations (5): β1 = α1 β2 = α1 α2 β3 = α2 We can solve for R1 , R2 , R3 , since each Ri is given by a binomial distribution with tuple probability βi /(1 + βi ) = ci /Ni . Now use Equations (4), c1 + c2 = d1 and c2 + c3 = d2 to obtain: α1 α1 α2 + N2 = d1 N1 1 + α1 1 + α1 α2 α2 α1 α2 N3 + N2 = d2 1 + α2 1 + α1 α2 Solving this gives us the EM model. Consistent histograms [5] had a similar goal of using EM to capture statistics on overlapping intervals, but use a different, simpler probabilistic model based on frequencies.
5
Related Work
There are two bodies of work that are most closely related to this paper. The first consists of the work in cardinality estimation. As noted above, while a variety of synopses structures have been proposed for cardinality estimation [12,13,14,15], they have all focused on various sub-classes of queries and deriving estimates for arbitrary query expressions has involved ad-hoc steps such as the independence and containment assumptions which result in large estimation errors [16]. In contrast, we ask the question what is the framework for performing cardinality estimation over arbitrary expressions in the presence of incomplete information. We approach this task via the EM principle. The EM model has been applied in prior work to the problem of cardinality estimation [4, 5]. However, the focus was restricted to queries that consist of conjunctive selection predicates over single tables. In contrast, we explore a fullfledged EM model that can incorporate statistics involving arbitrary first-order expressions. Another body of work consists of the work in probabilistic databases [17] which focuses on efficient query evaluation over a probabilistic database. The input statistics impose many possible distributions over the possible worlds and we choose the distribution that has maximum entropy. Our focus in this paper is in deriving the parameters of this EM distribution. The related problem of query estimation for a given model is not addressed in this paper. This is closely related to the problem of evaluating queries over probabilistic databases. Finally, we observe that entropy-maximization is a well-established principle in statistics for handling incomplete information [3]. As with probabilistic databases, new challenges emerge in the context of database systems, in our case the nature of statistics.
98
6
R. Kaushik, C. R´e, and D. Suciu
Conclusion
In this paper we propose to model arbitrary database statistics using an EntropyMaximization probability distribution. This model is attractive because any query has a well-defined size estimate, all statistics are treated as a whole rather than as individual synopses, and the model extends smoothly when new statistics are added. We reported in this paper several results that give explicit asymptotic solutions to statistical programs in several cases. As part of our technical development we described a technique encapsulated as the conditioning theorem (Theorem 2) that is of independent interest and are likely to be applicable to other statistical programs. We are leaving for future work the second part: using an EM model to obtain query size estimates. This has been solved in the past only for the independent case [10].
References 1. Stillger, M., Lohman, G.M., Markl, V., Kandil, M.: LEO - DB2’s LEarning Optimizer. In: VLDB (2001) 2. Chaudhuri, S., Narasayya, V.R., Ramamurthy, R.: Diagnosing Estimation Errors in Page Counts Using Execution Feedback. In: ICDE (2008) 3. Jaynes, E.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003) 4. Markl, V., Megiddo, N., et al.: Consistently estimating the selectivity of conjuncts of predicates. In: VLDB (2005) 5. Srivastava, U., Haas, P., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: Consistent histogram construction using query feedback. In: ICDE (2006) 6. Erd¨ os, P., R´enyi, A.: On the evolution of random graphs. Magyar Tud. Akad. Mat. Kut. Int. Kozl. 5, 17–61 (1960) 7. Bacchus, F., Grove, A., Halpern, J., Koller, D.: From statistical knowledge bases to degrees of belief. Artificial Intelligence 87(1-2), 75–143 (1996) 8. Dalvi, N., Miklau, G., Suciu, D.: Asymptotic conditional probabilities for conjunctive queries. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 289–305. Springer, Heidelberg (2004) 9. Kaushik, R., R´e, C., Suciu, D.: General database statistics using entropy maximization: Full version. Technical Report #05-09-01, University of Washington, Seattle, Washington (May 2009) 10. Dalvi, N., Suciu, D.: Answering queries from statistics and probabilistic views. In: VLDB (2005) 11. Dalvi, N.: Query evaluation on a database given by a random graph. Theory of Computing Systems (to appear, 2009) 12. Ioannidis, Y.E.: The History of Histograms. In: VLDB (2003) 13. Olken, F.: Random Sampling from Databases. PhD thesis, University of California at Berkeley (1993) 14. Deligiannakis, A., Garofalakis, M.N., Roussopoulos, N.: Extended wavelets for multiple measures. ACM Trans. Database Syst. 32(2) (2007) 15. Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking Join and Self-Join Sizes in Limited Storage. In: PODS (1999)
General Database Statistics Using Entropy Maximization
99
16. Ioannidis, Y.E., Christodoulakis, S.: On the propagation of errors in the size of join results. In: SIGMOD (May 1991) 17. Dalvi, N., Suciu, D.: Management of probabilistic data: Foundations and challenges. In: PODS, Beijing, China, pp. 1–12 (2007) (invited talk)
A
Proof of Theorem 1
The “only if” direction is very simple to derive by using the Lagrange multipliers for solving: pI − 1 = 0 (6) F0 = I∈I
∀i = 1, . . . , s : Fi =
|vi (I)|pI − di = 0
(7)
pI log pI
(8)
I∈I
H = maximum, where H =
I∈I
According to that method, one has to introduce s + 1 additional unknowns, λ, λ1 , . . . , λs : an EM distribution is a solution to a system of |I| + s + 1 equations consisting of Eq.(6), (7), and the following |I| equations:
∀I ∈ I :
∂(H −
i=0,s
∂pI
λi Gi )
= log pI − (λ0 +
λi |vi (I)|) = 0
i=1,s
This implies pI = exp(λ0 + i=1,s λi |vi (I)|), and the claim follows by denoting ω = exp(λ0 ), and αi = exp(λi ), i = 1, s.
Author Index
Benedikt, Michael
1
Kaushik, Raghav
Calvanese, Diego 18 Caron, Anne-Cecile 52 Cheney, James 1 Cooper, Ezra 36 De Giacomo, Giuseppe Fujiwara, Toru Groz, Benoˆıt
52
Ishihara, Yasunori
Lenzerini, Maurizio Morimoto, Takuji
18
R´e, Christopher Roos, Yves 52
18 68
84
Shimizu, Shougo 68 Staworko, Slawomir 52 Suciu, Dan 84
68
Hashimoto, Kenji
84
68 68
Tison, Sophie Vardi, Moshe Y.
52 18