This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
HR-212-IH (104)
pairs where the value corresponds to the time taken to respond to a request for quote: SELECT QUOTE_UUID, DURATION FROM QUOTE_REQUESTS Q
The following function returns instead all orders processed in a time that exceeded the one specified as part of the SLA stipulated with the customer: SELECT QUOTE_UUID FROM QUOTE_REQUESTS Q, SLA, CUSTOMER C WHERE Q.CUSTOMER_ID=C.ID AND SLA.CUSTOMER_ID=C.ID AND Q.DURATION>SLA.MAX_DURATION
In general, a mapping is characterized by the following properties: − A name (e.g., compute SLA violations). − The data type of the values computed by the function. This can be numeric or Boolean. Numeric metrics can be computed by numeric mappings. Taxonomical metrics are instead computed by associating, to each category in the taxonomy, a different Boolean mapping. If the mapping returns true for an element, then the element belongs to the category. − The entity whose elements are labeled. For example, a mapping can associate values to requests for quotes. − The mapping function, that returns a set of pairs <element ID, value>. − The name and types of the mapping function parameters, if any. Conceptually, the mapping function can be expressed in any language. In the cockpit, they are defined by means of SQL statements, for performance reasons. Metrics give semantics to the values computed by the mappings. A metric is characterized by a definition and an implementation part. The definition part states i) the name of the metric; ii) the data type, that can be numeric, Boolean, or taxonomy; iii) the entity whose elements are to be measured (e.g., cost is computed for orders). The implementation part of a metric defines which mapping functions should be used to compute it. For example, a metric quote request SLA violation can be computed based on the Boolean mapping described above for those customers that had agreed on a certain SLA. Note that different customers can have different SLA targets. In summary, metrics measure elements of entities via mappings. The metric model allows a metric to be “polymorphic”, i.e., computed via different mappings, depending on the element being measured (i.e., based on the business context to which the element belongs). The possibility of associating the same metric to different mappings is very important since it allows the definition of homogeneous
A Metric Definition, Computation, and Reporting Model
1081
metrics over heterogeneous elements. An example is the SLA violation metric discussed above, where the computation logic depends on the customer (SLA logics can be very complex). In this case, the analyst can define one metric, and state that a different mapping should be used based on the context (customer). The analyst will be able to evaluate SLA violations by simply looking at one metric, regardless of the way violations are computed. Contexts can be identified by any partitioning over the set of elements. However, to simplify modeling, partitioning is specified in terms of relationships to other entities or in terms of an entity’s attributes. For example, orders can be partitioned based on the customer or on their priority level. Note that the cockpit knows how to compute if an element belongs to a context, since attributes and relationships have been described as part of the business model. The key to reducing the development effort consists in allowing the definition of functions that are easily customizable and highly reusable, so that the same function can be leveraged to compute different metrics. For example, a mapping function that returns the value of a quote request attribute is completely generic, in that it does not “hardcode” the definition of the metric it is computing, nor the context to which it applies. As such, this function becomes handy for computing lots of different metrics. It is up to the cockpit, as we will see later, to use mappings in such a way that they can compute the right metric on the right context with the right parameters and in the most efficient way. Hence, the development time for defining a new report tends to zero as the number of metrics grow. This is indeed a very concrete benefit. In a supply chain scenario, only nine mappings were needed to define all the metrics underlying the reporting application, consisting of more than 80 reports. In this way, the software development and testing effort is drastically reduced. Reporting Model. The reporting model defines how we want to look at a metric. A report is characterized by the metric to be reported and by the desired statistics on this metric (e.g., average). In addition, grouping and filtering conditions can be specified, based on entity attributes and relationships. Thanks to the business model, the cockpit knows how to compute these. In addition, users can define the drill-down behavior, choosing from several options, including: displaying another specified report; executing a query; executing the code defined in a Java class; accessing a specified Web page. Finally, report definition includes scheduling times and preferred visualization (charting) technique. The definition of reports is done by means of a point-and-click GUI that is aware of the business domain and of the metric definitions, and therefore knows what options (in terms of, e.g., aggregation) can be provided. For example, it is aware that it is possible to aggregate orders by customers, but not vice versa.
3 Mapping Transformations In the cockpit, definitions of the domain, metric, and reporting schema (along with other configuration information) are performed through a Java or a Web based GUI that accesses the cockpit definition API. The most interesting aspect in the definition process lies in the way mappings are handled. The cockpit keeps all metric definition data into a few database tables. In particular, a table meters defines the mappings used
1082
F. Casati et al.
by each metric, a table contexts stores information about the context for which a certain <metric,mapping> association should be applied, and a table meter_parameters defines which values should be given to mapping parameters when the mapping is used to compute a certain metric. For example, meter meterForCost may state that metric cost is measured by mapping numResourcesUsedTimesUnitCost (which has the unit cost as a parameter). An entry in table context may state that meterForCost is only applied to the purchase process. An entry in meter_parameters may state that for meterForCost, a certain cost value (e.g., 20$) should be used. Mappings as written by the user have no knowledge of the cockpit database schema, as users should not be required to know the intricacies of the cockpit’s internals, but just the domain data. The cockpit extends mapping definitions so that they can not only label elements with metric values, but also identify which metrics are being computed, what are the different parameter values, and what are the different contexts to which the mapping should be applied when computing each metric. For example, Figure 1 shows how the cockpit transforms the code of mapping quote request attribute value. The modified query accesses the different tables that store metric definition data, and returns, in addition to element IDs and values, also the metric to which these values refer. Also, the where clause is extended with the capability of restricting the context as specified by the metric definition. The code that extends the mapping (like all of the cockpit code) is generic and has no a priori knowledge of the business domain. However, due to the way the domain model has been defined, the business domain description contains all the information necessary to perform the mapping transformation mentioned above, that can therefore be executed at mapping definition time. This post-processing of the mapping definition is what enables the creation of queries that are simple and generic, since cockpit dynamically figures out what they should compute and on what data, as opposed to hardcoding this information into the query. Another benefit of this approach is that, thanks to the set oriented nature of SQL, one execution of a given mapping computes all metrics that have been defined on top of it. Since many metrics typically depend on the same mappings, the computation time remains quasi-constant as the number of metrics grows. determines all the metrics based on this mapping computed metric SELECT Q.QUOTE_UUID,QD.VALUE, Q.TIMESTAMP, CK_M.ID FROM QUOTE_REQUESTS Q, QUOTE_DATA QD, QUOTE_DATA_DEFS QDD, METRICS CK_M, METERS CK_AM, CONTEXTS CK_C, METER_PARS EMP WHERE CK_AM.MAPPING_ID = 5 AND CK_AM.METRIC_ID = CK_M.ID AND CK_C.METER_ID = CK_AM.ID implements the context AND ( CK_C.CONTEXT_ELEMENT_ID IS NULL OR restrictions defined by all the (Q.QUOTE_UUID, CK_C.CONTEXT_ELEMENT_ID) metrics implemented by this IN ( SELECT * FROM CK_CX_QUOTE_TO_PRODUCT)) mapping AND EMP.METER_ID = CK_AM.ID AND QDD.NAME=EMP.PAR1 AND QDD.ID=QD.DATA_DEF_ID AND QD.QUOTE_UUID=Q.QUOTE_UUID retrieves all the actual parameters used by all the metrics implemented by this mapping
Fig. 6. Wrapping of mapping definitions. Original code is in bold.
A Metric Definition, Computation, and Reporting Model
1083
4 Concluding Remarks In summary, the proposed approach considerably simplifies development of reporting solutions, both for verticals (e.g., for a financial application) and for middleware applications (e.g., it can be used to provide reports for workflows or messaging systems). The use of modules that encapsulate metrics and mappings make such deployments easy to perform and to maintain. Further details and publications on this topic are available at http://www.hpl.hp.com/techreports/ .
BISON: Providing Business Information Analysis as a Service Hakan Hacıgümüş, James Rhodes, Scott Spangler, and Jeffrey Kreulen IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA hakanh@acm.org, {jjrhodes, spangles, kreulen}@us.ibm.com
Abstract. We present the architecture of a Business Information Analysis provisioning system, BISON. The service provisioning system combines two prominent domains, namely structured/unstructured data analysis and serviceoriented computing. We also discuss open research problems in the area.
1 Introduction Today’s highly competitive business environment challenges enterprises to push their limits to take advantage of all available information to improve business performance and stay competitive. This challenge becomes more difficult with the ever increasing amount of data from disparate sources. Although there is a glut of data generated by various sources those data mostly are not in a form that could be directly used to support critical business decision making processes. Thus, the essential problem is transforming the data into information that provides insights into the business operations and the competitiveness measures. Typically, the data are available from heterogeneous resources in varied formats. It is well known that the amount of unstructured data in the text form far surpasses the amount of available data in the structured form. Therefore, one important step in exploiting the available resources is structuring the inherently unstructured data in meaningful ways. A well-established first step in gaining understanding is to segment examples into meaningful categories. This leads to the idea of taxonomies. The taxonomies are meaningful hierarchical categorizations of documents into topics reflecting the natural relationships between the documents and their business objectives. Clearly there is need for systems that enable knowledge workers to take full-advantage of available data sources and generate business reports in a timely manner. We have developed such a system, called Business Insights Workbench (BIW), at the IBM Almaden Research Center. BIW has been used in numerous customer setups to solve complex problems that require understanding and analysis of very large textual data sets to fulfill the business objectives. Service-oriented computing is a new prominent computing paradigm. It allows organizations to seamlessly integrate heterogeneous resources that are available internally or externally. Services are defined as autonomous, platform-independent elements that are described, implemented, published, and discovered using standard protocols. We envisioned the union of these two technologies; namely the business information analysis and the service-oriented computing, to deliver even grater value Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1084 – 1087, 2006. © Springer-Verlag Berlin Heidelberg 2006
BISON: Providing Business Information Analysis as a Service
1085
for the customers. The resultant computing environment offers scores of new competencies such as better integration with intra- and inter-organization computing capabilities, enhanced and custom computing environments by composing available services from different providers, larger customer base coverage, and solutions to clustering, scalability, extensibility, dynamic provisioning, and fault tolerance.
2 Business Insights Workbench (BIW) In this section we give an overview of our business information analysis system, BIW. The details of the system can be found in [1] and [2]. BIW is a comprehensive data analysis application that allows a knowledge worker to learn from large collections of unstructured documents. The tool can be used to automatically categorize a large collection of text documents and then to provide a broad spectrum of controls to refine the building of an arbitrarily complex hierarchical taxonomy. The applicable tasks are grouped under three categories, namely, Explore, Understand, and Analyze. Explore operation performs the selection of the data of interest via queries or search from the data sources and summarizes structured values via metrics. It includes data specific functions such as full text search, database drill down, data join, intersect, and subset selection. Understand taxonomies uses text mining to extract higher level features from unstructured information. It provides tools to edit generated taxonomies visualize relationships among categories, build text models that can be applied to other data sets, and use nearest neighbor techniques to find related documents. Analyze function examines the intersections between taxonomies and structured information. It is used to discover trends and correlations, visualize data categories over time, analyze relationships between categories in different taxonomies, and compare structured and unstructured information. Analyze function essentially combines the structured and unstructured data sources to provide a complete view.
2 Service-Oriented Approach We have been developing a business information analysis service provisioning system, BISON, based on Business Insights Workbench that is presented above. The architecture diagram of BISON system is shown in Figure 1. We use J2EE standards as the implementation framework. Communications among the system entities, including the clients, are implemented as Web Services. Data Sources may provide structured and unstructured data and metadata information, such as full text indexes. Typical examples for text indexes are indexes created by crawling engines for increased full-text search performance. Data sources expose all the characteristics information, such as name of the source, schema of the database, and types of the data available, and data access mechanisms as web services. Data sources are registered in the services directory so that they can be discovered by the other system entities. The system architecture allows data sources dynamically join or leave the resource pool, or change the service characteristics.
1086
H. Hacıgümüş et al.
Taxonomy Analysis Service Service Directory (Explore, Understand, Analyze)
Client Data Source
WS
WS Service Management
Data Source 1
Client
Data Access Service
WS
(Explore, Analyze) WS Service Management
Service Control Center
Data Source 2
WS
Billing, Policies, SLAs, Monitoring Configuration, Administration
Fig. 1. The Architecture of BISON
Data Access Service is primarily responsible for querying the data sources to retrieve the data of interest and performing database oriented tasks. It also supports analyze functions over the data sources. The data access service is instrumented to facilitate service provisioning specific tasks. These tasks include, access control, metering, service monitoring for QoS and Service Level Agreements (SLAs). This information, along with the service management information from the other systems entities, is collected, monitored, acted upon by the Service Control Center. Taxonomy Analysis Service provides all of the enhanced analysis capabilities defined over the taxonomies, that is, explore, understand, and analyze functions. The separation of the data access service and the taxonomy analysis service provides service flexibility and scalability. Service Control Center oversees the service provisioning tasks in the system. It monitors and aggregates the system management data provided by the individual services. This aggregate information is used for billing, configuration management, dynamic provisioning, SLA management, and policy enforcement. The client is the consumer of the services. Typically there are two different types of clients in our system. First, the individual users those connect to the service directly by using a web browser. The second category of the clients is the service providers. They use our services to provide additional services to their end-users. The Client Data Source is a temporary data source that is created to maintain the intermediate results between the service calls for the client.
2 Current Research and Technology Issues Combining business information analysis applications and the service-oriented computing presents certain research and technology issues. We describe some those problems in this section.
BISON: Providing Business Information Analysis as a Service
1087
Metering. Metering is important for the cost analysis of the service delivery and generating the billing information. The metering is also important from the system management perspective as it provides valuable information for the performance of the system components. The real challenge is defining the metrics for metering that would be meaningful for the information management functions. Monitoring. Along with the overall service and system monitoring, we need application level monitoring techniques. Application level monitoring helps problem determination and also enables application specific billing. Given that each application has its own characteristics, developing a common monitoring methodology that could be applied to all of the possible service components is a challenging problem. Data Security and Privacy. The security of the data sources includes preventing unauthorized access to the data sources and preventing data disclosure to the parties who are not entitled to use the particular parts of the data. In addition, the clients may be concerned about revealing their identity for their particular interest in certain data sets. As a more challenging problem, if the client actually owns a data source and just uses the information analysis services, then the possible disclosure of confidential information to the service provider becomes an issue. Business Integration. In most of the cases the information analysis is a part of the higher level business processes at the client organizations. Hence, we need to develop methodologies that allow us to model our system processes and tie them into higher level business processes.
2 Conclusion We have presented the architecture of a Business Information Analysis provisioning system, BISON. The service provisioning system combines two very important domains, structured/unstructured data analysis and service-oriented computing. We have also described some open research problems in this area.
References 1. Scott Spangler, Jeffrey T. Kreulen: Interactive methods for taxonomy editing and validation, CIKM 2002 2. William F. Cody, Jeffrey T. Kreulen, Vikas Krishna, W. Scott Spangler: The integration of business intelligence and knowledge management. IBM Systems Journal 41(4): 697-713 (2002)
The Design and Architecture of the τ -Synopses System Yossi Matias, Leon Portman, and Natasha Drukh School of Computer Science, Tel Aviv University matias@cs.tau.ac.il, leon.portman@nice.com, kreimern@cs.tau.ac.il Abstract. Data synopses are concise representations of data sets, that enable effective processing of approximate queries to the data sets. The τ -Synopses is a system designed to provide a run-time environment for remote execution of multiple synopses for both relational as well as XML databases. The system can serve as an effective research platform for experimental evaluation and comparison of different synopses, as well as a platform for studying the effective management of multiple synopses in a federated or centralized environment.
1
Introduction
In large data recording and warehousing environments, it is often advantageous to provide fast, approximate answers to queries, whenever possible. The goal is to provide a quick response in orders of magnitude faster than the time to compute an exact answer, by avoiding or minimizing the number of accesses to the base data. Approximate query processing is supported by synopses that are compact representations of the original data, such as histograms, samples, wavelet-synopses or other methods [1]. In the aqua system [2], synopses are precomputed and stored in a dbms. The system supports approximate answers by rewriting queries originally directed to the base tables to run on these synopses, and it enables keeping synopses up-to-date as the database changes. The question of how to reconcile various synopses for large data sources with many tables was studied in [3]. The τ -Synopses system was designed to provide a run-time environment for execution of multiple synopses. The system can serve as an effective research platform for experimental evaluation and comparison of different synopses, as well as a platform for studying the effective management of multiple synopses. The synopses can be placed at a centralized environment, or they can function as web services in a federated architecture. A software demo of the system as a federated environment with remote execution of synopses was presented in [5]. A software demo of the system with emphasis on the synopses management in a centralized environment was presented in [4]. The system currently includes several dozens of synopses for both Relational as well as XML databases. Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1088–1091, 2006. c Springer-Verlag Berlin Heidelberg 2006
The Design and Architecture of the τ -Synopses System
1089
This paper presents a high-level overview of the architecture and functionality of the system. For more details, please refer to the full paper [6].
2
The τ -Synopses Functionality
The main operational processes supported by the τ -Synopses system are: constructing and updating multiple pluggable synopses, interception and analysis of query workload, interception and analysis of data updates, approximate query processing, synopses management, and benchmarking. The user interface provides an administrator user with a capability to manage data sources, synopses specifications, updates and pre-defined workloads. Figure 1 depicts the main administration UI.
Fig. 1. The Administration UI
End users can test and compare different synopses that are registered in the system. In the Query execution mode a user can evaluate a single synopsis at a time. In the Query Mode, the user selects the synopsis to be evaluated. Relational queries can be of the following structure: SELECT Sum(Data) FROM Relation WHERE filter > l AND filter < h . For XML synopses, the queries are XPath expressions. The system validates the user input expression for the XPath syntax and the tag labels are validated against existing labels of the underlying XML document. The Query mode also allows the evaluation of multiple queries at a time by specifying a workload to be evaluated. The approximate results obtained using the registered synopses are depicted together with the exact results computed by the system. The Benchmark Mode enables multiple synopses evaluation over pre-defined workloads and their comparison using visual display; see Figure 2. The user selects the synopses to be evaluated and the workload to be used for the evaluation. For the performance measurements, the minimum, maximum and step size
1090
Y. Matias, L. Portman, and N. Drukh
Fig. 2. Benchmark Mode
for the synopses construction are set by the user. The system invokes the construction of the different synopses, and these synopses are then evaluated over the selected workload. The system can compute the accuracy of the different synopses using several error metrics.
3
Architecture
In order to provide an effective operational and research platform the τ -Synopses system commits to the following design goals: – – – – –
Pluggable integration Remote execution Distributed client-server environment Flexibility and scalability Low bandwidth requirement
The core of the τ -Synopses system architecture features the following components: Query Execution Engine, Synopses Manager, Updates Logger, and Workload Manager. These modules interact with a relational or XML databases which hold the data sets, and with registered synopses that act as web services. These synopses are connected either locally or remotely through a soap-enabled platforms. The Synopses Manager is used for registration and maintenance of the synopses. A new synopsis is added to the system by registering its parameters (including list of supported queries and data sets) in the Synopses Manager Catalog. The Query Execution Engine supports an interface for receiving query request from end-users and invoking the appropriate synopsis (or synopses), as determined by the Synopses Manager in order to process such query. The Updates Logger feeds all data updates to the registered synopses by intercepting data updates information in the data sources. The Workload Manager captures, maintains and analyzes workload information for building, maintaining and testing synopses.
The Design and Architecture of the τ -Synopses System
Registration
Client
Synopsis
XML Data
Synopsis
XML Data
Synopsis
Relational Data
Synopsis
Relational Data
1091
Synopses Framework
Original Data
Workload
Fig. 3. General Architecture
The system provides a light-weight host process, inside which the custom synopses will be running. The host is responsible for all communication with the system and is transparent to the synopsis. This design enables unconstrained deployment. A remote synopsis can be integrated into the system by deploying or adapting such host into the remote system, and connecting the synopsis module locally into the host. Figure 3 illustrates an overall view of the system in a distributed environment, consisting of multiple remote synopses, each representing its local data source. The system modules were implemented in the .net framework, with remote modules communicating through the .net Remoting. Any relational DB can be used as a database provider. Acknowledgement. We thank Yariv Matia and Daniel Urieli for their contributions to the system development. We also thank the students in various classes at Tel Aviv university who have contributed synopses implementations to the system and for their feedback.
References 1. P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. External Memory Algorithms, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, American Mathematical Society, 50, A, 1999. Also in SODA’99. 2. P. B. Gibbons, V. Poosala, S. Acharya, Y. Bartal, Y. Matias, S. Muthukrishnan, S. Ramaswamy, and T. Suel. AQUA: System and techniques for approximate query answering. Technical report, Bell Laboratories, Murray Hill, 1998. 3. A. C. Konig and G. Weikum. A framework for the physical design problem for data synopses. In EDBT 2002 - Advances in Database Technology, March 2002. 4. Y. Matia, Y. Matias, and L. Portman. Synopses reconciliation via calibration in the τ -synopses system. In Software Demo, EDBT’06. 5. Y. Matias and L. Portman. τ -Synopses: a system for run-time management of remote synopses. In Software Demo, ICDE’04, EDBT’04. 6. Y. Matias, L. Portman, and N. Drukh. The design and architecture of the τ -synopses system. Technical Report, Tel Aviv University, 2005.
Integrating a Maximum-Entropy Cardinality Estimator into DB2 UDB Marcel Kutsch1 , Peter J. Haas2 , Volker Markl2 , Nimrod Megiddo2, and Tam Minh Tran3 1
3
IBM Boeblingen Laboratory, Schoenaicher Str. 220, 71032 Boeblingen, Germany kutschm@de.ibm.com 2 IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120, USA {marklv, phaas, megiddo}@us.ibm.com IBM Silicon Valley Research Laboratory, 555 Bailey Avenue, San Jose, CA 95141, USA minhtran@us.ibm.com
1 Introduction When comparing alternative query execution plans (QEPs), a cost-based query optimizer in a relational database management system (RDBMS) needs to estimate the selectivity of conjunctive predicates. The optimizer immediately faces a challenging problem: how to combine available partial information about selectivities in a consistent and comprehensive manner [1]. This paper describes a prototype solution to this problem. In more detail, suppose that the optimizer needs to estimate the selectivity s1,2,...,n of a predicate of the form p1 ∧ p2 ∧ · · · ∧ pn defined over a specified relation R, where each simple predicate pi (also called a Boolean factor or BF) is of the form “column op literal.” Here column is the name of a column in R, op is a relational comparison operator such as “=”, “>”, or “LIKE”, and literal is a value in the domain of the column; e.g., MAKE = ’Honda’ or YEAR > 1995. As usual, the selectivity of a predicate p is the fraction of rows in R that satisfy p. Estimates are typically available for simplepredicate selectivities of the form si and, in modern optimizers, partial information is often available in the form of joint selectivity estimates such as s2,3 , s1,3,5 , and so forth. These joint selectivities are computed from multivariate statistics (MVS) such as multidimensional histograms, column-group statistics [2], and statistics on views. Gaps in the available information are typically filled using uniformity and independence assumptions. A serious problem now arises in that there may be multiple, non-equivalent ways of estimating the selectivity of a given predicate. E.g., if selectivities s1 , s2 , s3 , s1,2 , and s2,3 are available, then we can estimate s123 as (i) s123 = s1 ·s2 ·s3 , (ii) s123 = s1,2 ·s3 , or (iii) s123 = s1 · s2,3 . Arbitrary, inconsistent choices among non-equivalent selectivity estimates lead to arbitrary, unreliable, and usually suboptimal choices of QEPs. The ad hoc methods for ensuring consistency used in current RDBMS are cumbersome and expensive, and throw away valuable information; as discussed in [1], the net effect is to bias the optimizer toward those QEPs about which it has the least information. Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1092–1096, 2006. c Springer-Verlag Berlin Heidelberg 2006
Integrating a Maximum-Entropy Cardinality Estimator into DB2 UDB
1093
To address the foregoing problems, the authors have proposed a technique called MAXENT [1], which uses the maximum entropy principle to obtain selectivity estimates. MAXENT exploits all available information in a consistent and principled manner, while avoiding imposition of any extraneous assumptions. We illustrate the basic idea for a predicate p1 ∧ p2 ∧ p3 having three conjuncts, assuming that partial selectivities s1 , s2 , s3 , and s1,2 are available. For a binary string of length 3, denote by xb the selectivity of the corresponding atom associated with the BFs in the predicate. For example, x110 is the selectivity of the atom p1 ∧ p2 ∧ ¬p3 , and so forth. MAXENT provides a principled means of assigning values to the collection of atomic variables x = { xb : b ∈ { 0, 1 }3 }, that is, for selecting a relative frequency distribution x over the atoms. Given the MAXENT solution, we can estimate the desired selectivity as s1,2,3 = x111 . MAXENT first expresses the available information as a system of equations; in our example, the equations are:
x100 + x110 + x101 + x111 = s1 x010 + x011 + x110 + x111 = s2
(1a) (1b)
x001 + x011 + x101 + x111 = s3 x110 + x111 = s1,2
(1c) (1d)
∑ b xb = 1
(1e)
There are typically many distributions x that satisfy (1a)–(1e); MAXENT selects the distribution with the highest entropy value H(x) = − ∑b xb log xb . Intuitively, H(x) is a measure of uncertainty in the distribution x; by maximizing entropy, MAXENT selects the “simplest” distribution that is consistent with the constraints in (1a)–(1e). In the absence of information, the MAXENT solution reduces to the classical optimizer assumptions of uniformity and independence. The maximum entropy distribution can be computed using a well known algorithm called Iterative Scaling ( IS); see [1] for details. In this paper we describe our experience in implementing a prototype of MAXENT in DB2 UDB. We found ourselves facing a number of challenges. First, the computational complexity of the IS algorithm grows exponentially in the number of BFs, so we needed to decompose the constrained optimization problem into a multiple small problems, each involving only a subset of the BFs. Our next problem was that the set of constraints as in (1a)–(1e) was inconsistent in many cases, so that there did not exist a feasible solution x to the optimization problem and the IS algorithm failed to converge. Inconsistent constraints can arise both when statistics are gathered at different times and when selectivity estimates are based on erroneous uniformity and/or independence assumptions. Finally, the IS algorithm can fail to converge even when the constraints are consistent, if the set of constraints implies that a given variable xb must equal 0 in any feasible solution [and hence ∂ H(x)/∂ xb = −1 − log xb = +∞]. For example, if p1 ⇒ p2 , so that s1,2 = s1 , then x100 = x101 = 0. All such “implied zeros” must be identified and handled explicitly when setting up the constrained optimization problem.
2 Implementation Overview MAXENT fits seamlessly into the existing optimizer architecture. After parsing a query, the optimizer, as in previous versions of DB2 UDB, first precomputes selectivities from
1094
M. Kutsch et al.
single-column statistics and MVS (specifically, column-group statistics and statistics on views). The new MAXENT module then precomputes any missing selectivities using the iterative scaling algorithm. During the final steps of QEP enumeration, costing, and selection, the DB2 optimizer uses the precomputed selectivity estimates to estimate the cardinalities of each partial QEPs that it considers. We now outline the various steps of our MAXENT implementation and indicate how we have addressed the challenges mentioned in the previous section. These steps, described below, consist of (1) partitioning, (2) inconsistency resolution, (3) implied-zero elimination, (4) iterative scaling, and (5) combination. Partitioning: The partitioning step decomposes the set of BFs into disjoint subsets; the IS algorithm is run independently on each partition in order to reduce overall computation time. Using the available selectivities, the partitioning algorithm first tries to break up the BFs into mutually “independent” subsets that are each as small as possible; here, e.g., the subsets { p1 , p2 } and { p3 , p4 } are independent if s1,2,3,4 = s1,2 · s3,4 . If any of the resulting subsets are still too large to be efficiently processed by the IS algorithm, the subset is partitioned further into (non-independent) subsets, in order to ensure sub-second response time. See [1, Sect. 5] for further elaboration of the partitioning problem. Steps (2)–(4) are now applied separately to each partition. Inconsistency resolution: To detect and resolve inconsistencies in the constraint set, MAXENT solves a linear program ( LP ) that is obtained by adding a pair of “slack” variables to each constraint. The slack variables represent adjustments (increases or decreases) to the corresponding selectivity that are needed to achieve consistency. In our − running example, constraint (1d) becomes x110 + x111 + a+ 1,2 − a1,2 = s1,2 , and we add + − + − the additional constraints a1,2 , a1,2 ≥ 0 and 0 ≤ s1,2 − a1,2 + a1,2 ≤ 1. The other constraints (1a)–(1c) are similarly modified. We then use linear programming to minimize − + − + − + − the sum of the slack variables, i.e., a+ 1 + a1 + a2 + a2 + a3 + a3 + a1,2 + a1,2 , subject to the modified constraints, thereby making the adjustments as small as possible. The − selectivities are then modified using the slacks, e.g., we take s1,2 = s1,2 − a+ 1,2 + a1,2 . Our prototype uses the open source COIN LP solver (found at www.coin-or.org). Implied-zero elimination: Implied zeros can be detected using linear programming; the maximum entropy optimization problem is then formulated so that implied zeros appear neither in the objective function nor the constraints. The “exact” solution is computationally expensive, but the following heuristic method is very effective in practice. Write each atomic variable as xb = vb + wb , and then maximize ∑b vb subject to the usual maximum entropy constraints as in (1a)–(1e), as well as the additional constraints that 0 ≤ wb ≤ 1 and 0 ≤ vb ≤ ε for each b, where ε is a small number such as 0.0001. Then xb is taken as an implied zero if and only if wb = vb = 0 in the optimal solution. The idea is that setting xb = 0 requires setting vb = 0, which significantly impacts the objective function because of the ε upper bound; thus, only “true” implied zeros are likely to appear in an optimal solution to the LP. Iterative Scaling: In this step MAXENT computes the solution to the maximum-entropy optimization problem using the IS algorithm, as described in [1].
Integrating a Maximum-Entropy Cardinality Estimator into DB2 UDB
1095
Combination: This step computes the overall maximum entropy solution by multiplying selectivities from different partitions. E.g., given two partitions { p1 , p2 } and { p3 , p4 }, (1) (2) (i) we would set x1101 = x11 · x01 , where xb is a selectivity for the ith partition.
3 Experimental Evaluation Figure 1 shows the effect of the preprocessing steps on the quality of the maximumentropy solution. We used 600 random queries with BFs involving between 3 and 14 columns of a DMV database as in [1]. The first (resp., second) three bars show the effect of the preprocessing without (resp., with) partitioning. With no preprocessing (Bar 1), the IS algorithm fails for 24% of the queries due to inconsistent constraints and for an additional 2% of the queries due to implied zeros. Inconsistency resolution (Bar 2) guarantees the existence of a maximum entropy solution, but IS still converges for only 74% of the queries because of implied zeros. Elimination of implied zeros (Bar 3) produces a maximum entropy solution for every query. Bars 1P–3P show similar results in the presence of partitioning. IS Algorithm Fails to Converge
24%
26%
22%
2%
75%
24% 2%
100%
50% 74%
100% 76%
74%
1000
No Feasible MAXENT Solution
76%
25%
Total Computation Time (Sec) for 600 Queries
Percentage of 600 Queries
IS Algorithm Converges
100%
770
3812
750
500 295
250
294
110 41
29
0
0%
1
1
2
3
1P
2P
MAXENT Preprocessing Steps
Fig. 1. Success rate of MAXENT
3P
2
3
1P
2P
3P
MAXENT Preprocessing Steps
Fig. 2. MAXENT computation time
Figure 2 shows the total computation time needed to build the maximum entropy model for the 600 queries. The various preprocessing regimes are as in Fig. 1. As can be seen, partitioning reduces the computation time by orders of magnitude. Partitioning also has an impact on the relative effects of the preprocessing strategies; we focus on Bars 1P–3P, since partitioning is the regime used in practice. Bar 2P shows that inconsistency resolution almost triples the computation time, but Bar 3P shows that the speedup due to implied-zero elimination more than compensates for the cost of inconsistency resolution. Overall, the performance impact of the preprocessing steps is negligible—the average time to process a query is 0.05 seconds—whereas the quality of the results is improved dramatically as in Fig. 1. As shown in [1], use of MAXENT can speed up query processing by orders of magnitude. Our new results show that a careful implementation of MAXENT can achieve these improvements in a commercial RDBMS while barely impacting query compilation time.
1096
M. Kutsch et al.
References 1. Markl, V., Megiddo, N., Kutsch, M., Tran, T.M., Haas, P.J., Srivastava, U.: Consistently estimating the selectivity of conjuncts of predicates. In: Proc 31st VLDB. (2005) 373–384 2. Ilyas, I.F., Markl, V., Haas, P.J., Brown, P.G., Aboulnaga, A.: CORDS: Automatic discovery of correlations and soft functional dependencies. In: Proc ACM SIGMOD. (2004) 647–658
Improving DB2 Performance Expert - A Generic Analysis Framework Laurent Mignet1 , Jayanta Basak1 , Manish Bhide1 , Prasan Roy1 , Sourashis Roy1 , Vibhuti S. Sengar1 , Ranga R. Vatsavai1 , Michael Reichert2 , Torsten Steinbach2 , D.V.S. Ravikant3, , and Soujanya Vadapalli4, 1
IBM India Research Laboratory, IIT Campus, New Delhi, India {lamignet, bjayanta, abmanish, prasanr, souraroy, visengar, rvatsava}@in.ibm.com 2 IBM Information Management Development, B¨ oblingen, Germany {rei, torsten}@de.ibm.com 3 Cornell University, USA 4 IIIT Hyderabad soujanya@iiit.ac.in Abstract. The complexity of software has been dramatically increasing over the years. Database management systems have not escaped this complexity. On the contrary, this problem is aggravated in database systems because they try to integrate multiple paradigms (object, relational, XML) in one box and are supposed to perform well in every scenario unlike OLAP or OLTP. As a result, it is very difficult to fine tune the performance of a DBMS. Hence, there is a need for a external tool which can monitor and fine tune the DBMS. In this extended abstract, we describe a few techniques to improve DB2 Performance Expert, which helps in monitoring DB2. Specifically, we describe a component which is capable of doing early performance problem detection by analyzing the sensor values over a long period of time. We also showcase a trends plotter and workload characterizer which allows a DBA to have a better understanding of the resource usages. A prototype of these tools has been demonstrated to a few select customers and based on their feedback this paper outlines the various issues that still need to be addressed in the next versions of the tool.
1
Introduction
Early detection/prediction of performance problems and potential resource constraints is essential for building robust and adaptive systems. Such prediction systems are also useful in limiting the impact of the failure. Often simple heuristics (rules of thumb) such as - the log size should be three times the raw data size - are often inadequate to address this problem. For example, too small a log size will increase the risk of frequent failures (transaction log full), and too large a log size will prevent proper utilization of the system resources.
Work done during his stay in IRL. Work done during her internship in IRL.
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1097–1101, 2006. c Springer-Verlag Berlin Heidelberg 2006
1098
L. Mignet et al.
Previous attempts to predict system performance include specific models that best characterize the given problem at hand. However, generalizing those models to a broad spectrum of database performance problems is very difficult. In our research we focus on generic tools that address a broad spectrum of DB2 health issues. In other words, instead of problem-specific models we are building a generic suit of models that can be applicable for addressing a broad set of performance problems that arise in day to day DB2 production environments. Our tools analyze the historical temporal data generated by the DB2 Performance Expert [5] (PE) and predict potential resource constraints, performance and specific trends. 1.1
Previous Art
Early Warning Mechanism is a very standard mechanism to provide some proactive features. In commercial database software, several products offer such features. DB2 Health Monitor [4] is one such tool and it monitors in background, several gauges and automatically issues an alert when a pre-specified threshold is met. Oracle in its version 10g provides trends analysis via the Automatic Database Diagnostic Monitor (ADDM) [6]. The ADDM includes a trends analysis wizard permitting users to forecast when a particular event may occur by using linear regression mechanism [7]. There are a few third party tools which also provide similar functionalities. BMC Software [2] offers a Connection-Miner tool which is part of their DGI Classic Suite. The tool provides some trend analysis reports showing resource utilization, trends by hour, week, month, or other desired criteria. The same features are also available in their SmartDBA offer [2]. Another third party tool is NORAD DBControl from Bradmark [3] which provides almost the same functionalities as that of the Connection-Miner tool. Finally, BEZ Systems [1] offers BEZPlus software which includes SerView, a database performance and SQL tuning tool and CorpView, an analysis tool that uses a performance data warehouse to track usage of the resources, data and SQL activity on the monitored server. The warehouse is used to perform trend analysis, problem identification and root-cause analysis. Unlike other database management tools, our current work tries to consolidate all these functionalities in a single tool thereby giving a holistic view to the users. In the rest of the paper we first give a brief description of the components showcased to customers in May 2005. We then outline a few components which were conceived based on their feedback. Finally, we enumerate some problems that we encountered during the preliminary research on the components described in this the paper.
2
Components and Features
This section outlines the different modules that could potentially extend the existing tool, DB2 Performance Expert. Explained next are the modules that were showcased to a few select customers.
Improving DB2 Performance Expert - A Generic Analysis Framework
1099
Performance Wizard: Optimizing DB2 performance involves tuning several parameters which are dependent on the workload characteristics. Due to the dynamic nature of the workload, these parameters might need to be retuned. Hence there is a need for a tool that would suggest optimal parameter values to the DBA depending on the workload. This task is accomplished by the Performance Wizard. To that end it creates a tree of registered performance problems. The intermediate nodes of the tree consists of problems like memory problem, IO problem etc. The leaves of a sub-tree represent the parameter values that can potentially cause the problems represented by the nodes of the tree. The user chooses a branch of the tree on which the problem analysis is to be performed. The performance wizard engine analyzes the sensor values corresponding to the parameters represented by the leaves of the sub-tree (for which the analysis is to be performed). The value of these sensors is compared over time with a predefined threshold, and more than one deviation from the normal behavior of the sensor is necessary to detect a performance problem. If the engine detects a performance problem, in order to fix the problem the engine suggests a new value of the parameter to the user. Trends Plotter: This component allows the user to plot the trend of several sensors. The user is able to define the workload on which the analysis is to be performed. A workload can be defined based on the queries executed by a specific user, application, location (IP address of the origin of the query) or already defined workloads. The user can then plot the usage of predefined resources one by one or a set at the same time. The component also provides an option to drill-down by workload so as to better understand the way that the resources are being used. Another feature permits the user to plot the trend of the historical data using linear regression techniques. The users are also given an option to analyze the impact of the SQL queries performed by the RDBMS in terms of most used columns, tables, joins etc. These two extensions were showcased to a few selected customers. Based on their feedback we are investigating the feasibility of their requirements. The additional components being explored are given below: Workload characterization: This feature involves fitting a mathematical model that best describes the current workloads seen by the database engine. A correlation between the workload characteristics and various system parameters helps in estimating the system resource requirements for an arbitrary workload. Discovering correlations between DB2 parameters: DB2 PE gathers over hundred parameters and records the current status of each parameter in a database. This component tries to establish the relationships between these parameter values. Various feature selection techniques and correlation analysis might uncover interesting relationships between various subsets of these parameters. Long term storage: This component involves developing the storage and archival techniques needed to efficiently manage the growing amount of historical runtime performance data present in DB2 PE. Apart from traditional compression techniques, we are also investigating various statistical distributions to
1100
L. Mignet et al.
represent data very concisely (that is store trends, rather than trend data). Suitable statistical sampling techniques and inverse transformation (decompression) techniques are being investigated to reconstruct the original data. Fitting Prediction Models: This features exploits the workload characterization and correlation discovery components (explained above) to fit a suitable mathematical model for predicting performance and resource usage. This will provide advisory tools that would guide the user to select a best model based on a given workload. Incremental algorithms are being designed so that analysis and prediction models do not further overburden the system. Figure 1 shows an example trend predicted (red) using one of our prediction model.
3
Future Work
1500
sql.d95.ts
During our work on the tool we encountered some problems due to the nature of the data as well as the nature of the application. Monitoring a software always implies a cost both in terms of communication (sending data over the network) as well as processing (gathering the data) on the monitored side. To lower this cost as much as possible, in our tool the data is only gathered using a discrete scheme (i.e., at some intervals). As a result, handling of missing data is an intrinsic part of the solution. This problem not only affects the numerical data extracted from the different sensors, but it has a bigger impact on the retrieval of the SQL statements. In the discreet scheme, the number of possible un-recovered SQL statements is not know and hence it is very difficult to estimate the correctness of the gathered information. Another problem is related to the analysis of the AR(OLS) Prediction historical data when seen as numerical time series. The literature on econo8.0 8.1 8.2 8.3 8.4 8.5 metrics or digital signal Time processing is abundant Fig. 1. Prediction using AR model with techniques to analyze time series to extract trends or to predict future values. The basic assumption of these models (for instance AR, ARIMA) is that the time series is stationary or pseudo-stationary. Unfortunately, this assumption does not hold in the case of the sensor data collected during our experimental phases which was generated using the TPC benchmark. We plan to investigate the feasibility of developing each of the components described above by designing new techniques which would overcome the enumerated problems.
References 1. BEZ Systems. www.bez.com. 2. BMC Software, DGI Classis Suite. http://www.bmc.com/products/ products services detail/0,,0 0 0 1401,00.html.
Improving DB2 Performance Expert - A Generic Analysis Framework
1101
3. Bradmark, Norad DB Control. www.bradmark.com/site/products/products.html. 4. DB2 Health Monitor. http://www.db2mag.com/story/showArticle.jhtml?articleID=51200282. 5. DB2PE. http://www-306.ibm.com/software/data/db2imstools/db2tools/ db-2pe/. 6. Oracle, ADDM. http://www.oracle.com/technology/oramag/oracle/04-may/o34tech talking.html. 7. Oracle, ADDM. http://www.oracle.com/technology/products/oracle9i/ datasheets/oem diagnostic/diagnostic.html.
Managing Collections of XML Schemas in Microsoft SQL Server 2005 Shankar Pal, Dragan Tomic, Brandon Berg, and Joe Xavier Microsoft Corporation, Redmond, WA 98052, USA {shankarp, dragant, branber, joexav}@microsoft.com
Abstract. Schema evolution is of two kinds: (a) those requiring instance transformation because the application is simpler to develop when it works only with one version of the schema, and (b) those in which the old data must be preserved and instance transformation must be avoided. The latter is important in practice but has received scant attention in the literature. Data conforming to multiple versions of the XML schema must be maintained, indexed, and manipulated using the same query. Microsoft’s SQL Server 2005 introduces XML schema collections to address both types of schema evolution.
1 Introduction Many XML schemas have become standard in vertical industry segments such as ACORD [1] and SportsML [5]. Microsoft’s Office products have made their XML schemas openly available [3]. These efforts have lead to better interoperability among loosely connected systems and the development of new applications, such as rich search and data sharing among applications. Standards bodies evolve their schemas, which may require reworking the applications since the XML schema evolution may require database schema changes. The scenarios for schema evolution are: 1. The new schema extends an existing one with new schema components. For example, the new schema can add a top-level element called “language” if a publisher enters a new, foreign market. The application considers existing data to have a default value for the new schema components (e.g. US English). 2. The schema is modified in an incompatible way. This often occurs with change in business needs and merger of systems. The existing data must be mapped to the new schema, using, for example, XSL transformations [8], if it does not conform to the new schema. The data transformation can be avoided in some cases (e.g. maxOccurs ofchanges from 7 to 5 but no instance exceeds 5 s). 3. The schema undergoes modification as government or business rules change. New data must conform to the new schema while old data must be retained in its old form for archival purposes. Examples are tax filing and securities trading. Tax laws change from one year to the next, but old tax returns should not be transformed to conform to the latest version of the XML schema. While the first two kinds of schema evolution have been studied extensively in the literature, the third kind is a practical problem with a few ad hoc solutions. SQL Server 2005 addresses it using a meta-data notion called an XML schema collection – Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1102 – 1105, 2006. © Springer-Verlag Berlin Heidelberg 2006
Managing Collections of XML Schemas in Microsoft SQL Server 2005
1103
a container of XML schemas that may be related (e.g. through <xs:import>) or unrelated to one another. Each schema in an XML schema collection C is identified using its target namespace. A new version of an XML schema with a new target namespace can be added to C and is treated like a new schema. This framework is powerful enough to support the other types of schema evolution too. The XML data is stored in a column of a rich data type called XML, as opposed to decomposing the data into tables and columns. This avoids database schema modification when XML schemas in C are added, dropped or modified. The database engine validates each XML instance according to XML schema specified in the XML instance during data assignment and modification.
2 XML Schema Collection An XML schema collection C is created using the following statement and registering one or more XML schemas: CREATE XML SCHEMA COLLECTION C AS ‘<xs:schema> … ’ The schema processor identifies and stores in C the schema components supplied in the schemas. An XML schema component is anything defined at the top level of an XML schema, such as an element, attribute, type, or group definition. Schema components from multiple XML schemas with the same target namespace are grouped together by the target namespace. A user can type an XML column using C. The constraint imposed by C is the collective set of the schema constraints imposed by the individual XML schemas in C. The post-schema validation Infoset (PSVI), which adds type information to the Infoset [6], is encoded in the internal representation of the XML data for faster parsing during XQuery [7] processing. 2.1 Schema Evolution Suppose you add an XML schema with target namespace BOOK-V1 to an XML schema collection C. An XML column XDOC typed using C can store XML data conforming to BOOK-V1 schema. To extend the XML schema, the schema designer adds the new schema components to the BOOK-V1 namespace. Adding optional elements and attributes, and top-level elements and type definitions do not require revalidation of the existing XML data in column XDOC. Suppose later the application wants to provide a new version of the XML schema, for which it chooses the target namespace BOOK-V2. This XML schema is added to C without transforming the existing XML instances in XDOC. The XML column can store instances of both BOOK-V1 and BOOK-V2 schemas. This yields significant simplification in data management when C contains a large number of XML schemas. The user can insert and modify XML instances conforming to the latest version of the schema as well as those conforming to the older versions of the schema. The XML schema collection framework can also support applications that require instance transformation. The application has to supply the modified XML schema and
1104
S. Pal et al.
the transform, such as XSL or XQuery, to be applied to the XML data. The system can generate the transform by comparing the old and the new XML schemas. XML schemas may use a “version” attribute [6] to specify the schema version while using the same target namespace for all versions. There is no language support in XQuery to specify the version of a schema. Furthermore, the XML schemas published by W3C and Microsoft include the year and optionally the month and the day of publication in the target namespace. Hence, our choice has been to rely on a new target namespace to indicate evolving versions of an XML schema.
3 Indexing and Querying XML Data XML indexes [4] are built on XML columns to speed up different classes of queries. Type information from the XML schema collection is stored in the XML indexes for scalar values and element types. Furthermore, tag names and paths are encoded relative to their target namespace. This ensures that a search can distinguish between aelement within the target namespace BOOK-V1 from one within the target namespace BOOK-V2. This allows users to restrict their queries to the desired schema namespaces and yet get good performance. However, search for in all namespaces can become a union query or turn into a primary XML index scan. One of the main benefits of an XML schema collection is the ability to query across multiple XML schemas. The XQuery compiler uses the schema component definitions to perform static type analysis, based on which it can reject a priori some queries that would result in run-time errors (e.g. type mismatch). It can also infer cardinality to assist in static query optimizations. If desired, an XQuery expression can be limited to a specific version of an XML schema. This is desirable since the older data should be as easily searchable as the new data. Querying over multiple schema versions can be broken down as follows: •
• •
•
If elementFormDefault or attributeFormDefault is specified as unqualified in an XML schema, the target namespace of non-top level elements and attributes in the schema is absent. A query for those elements and attributes looks for the type definitions in the special no name target namespace. This allows queries over elements and attributes defined in different XML schemas within the XML schema collection C, and yields very good performance. If the element or attribute is qualified in the XML schema, then the appropriate target namespace can be used to limit an XQuery expression to the element or attribute definitions within the specified target namespace. When an element or attribute is unqualified in an XQuery expression, the query looks for the type definitions in the default target namespace specified in the XQuery prolog. If no definition is found, then the no name target namespace is searched for the type definitions. If the element or attribute is qualified in the XML schema, a search is performed over multiple target namespaces using a wildcard for the target namespace. Thus, to search forelements within multiple versions of an evolving XML schema, the user can write the path expression as //*:book. This looks for the definition of the element in all the XML schemas registered with the XML schema collection, including the no name target
Managing Collections of XML Schemas in Microsoft SQL Server 2005
1105
namespace. Wildcard namespace queries are slow because they result in a primary XML index scan but some optimizations can be done such as turning it into an “or” query over the different XML schemas in C.
4 Conclusions and Future Work The focus in SQL Server 2005 has been to build up the infrastructure for XML schema evolution using XML schema collection. Most of the common features of XML Schema specification have been implemented to meet customer needs. For future work, incorporating instance transformation is useful. This can be based on XQuery or XSLT, and the system can generate the transform. Some applications, such as securities trading, require limiting an XML schema collection to the latest version of the XML schemas. In a query such as //*:author, which finds theelements in all the target namespaces within an XML schema collection, the query can be restricted to only the latest schema based upon execution context settings. The next version of the SQL standard for XML [2] introduces the concept of a registered XML schema which can accommodate multiple XML schemas that are related to one another using <xs:import>. This allows schema evolution as long as the evolved XML schema imports the old schema; the registered XML schema descriptor must be updated as well. By comparison, our XML schema collection accommodates disjoint XML schemas and allows more general schema evolution.
References 1. ACORD – Global Insurance Standard. http://www.acord.org/standards/StandardsHome.aspx 2. Melton, J.: ISO/IEC 9075-14:2003, Information technology — Database languages — SQL — Part 14: XML-Related Specifications (SQL/XML) (2004) 3. Office 2003: XML Reference Schemas. http://www.microsoft.com/downloads/detailsaspx?familyid=fe118952-3547-420a-a412-00a2662442d9&displaylang=en 4. S. Pal et al.: Indexing XML Data Stored in a Relational Database. Proc VLDB (2004). 5. SportsML – Sports Markup Language. http://www.sportsml.org/ 6. XML Schema Part 1: Structures and Part 2: Datatypes. W3C Recommendation 2 May 2001. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502, http://www.w3.org/TR/2001/REC-xmlschema-2-20010502 7. XQuery 1.0: An XML Query Language. http://www.w3c.org/TR/xquery 8. XSL Transformations (XSLT) Version 2.0. http://www.w3.org/TR/2005/WD-xslt2020050915/.
Enabling Outsourced Service Providers to Think Globally While Acting Locally Kevin Wilkinson, Harumi Kuno, Kannan Govindarajan, Kei Yuasa, Kevin Smathers, Jyotirmaya Nanda, and Umeshwar Dayal Hewlett-Packard Laboratories, 1501 Page Mill Rd., Palo Alto, CA 94304
1
Introduction
Enterprises commonly outsource all or part of their IT to vendors as a way to reduce the cost of IT, to accurately estimate what they spend on IT, and to improve its effectiveness. These contracts vary in complexity from the outsourcing of a world-wide IT function to smaller, country-specific, deals. For service providers to realize the economies of scale necessary for them to be profitable, they must “productize” services and work from models that standardize significant aspects of services and the resources required to fulfill them. However, “productization of services” is not simple. Services are inherently different from products. Services organizations are more likely to have significant regional variations when compared to their manufacturing counterparts. Local customers often prefer local service. Global customers expect the same service in all the regions that they operate in, and the service provider must thus leverage local service capabilities in a globally consistent manner. IT technology evolves rapidly, and service offerings, options on existing service offerings, and preferred standard solutions must follow suit. Finally, large global deals require the service provider to coordinate multiple teams from various regions with different areas of expertise and that use a variety of tools (often implemented using personal productivity applications such as Excel) and highly-manual processes. This leads to information silos where information is trapped in personal documents, making it difficult for management to make informed decisions about the services business. We introduce here novel techniques that address the unique needs of the IT service domain. Our initial prototype, introduced in [1], used conventional techniques. However, we found this approach was subject to limitations that prevented the wide adoption of our prototype (Section 2). Our new approach leverages metadata technology and the ubiquity of office productivity tools to overcome these barriers (Section 3).
2
Initial Approach
Our hypothesis is that a critical factor in these problems is that, fundamentally, the right information is not made available to the right person at the right time. Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1106–1109, 2006. c Springer-Verlag Berlin Heidelberg 2006
Enabling Outsourced Service Providers to Think Globally
1107
We believe this is because manual processes simply cannot scale to accommodate the tens of thousands of relationships and details involved in service design. To address these issues, we look to the manufacturing domain’s experience with computer-aided design (CAD) systems [2]. Backed by a library of components and their relationships, a CAD system takes care of tedious details of mechanical design tasks such as version management, component reuse, the validation of design rules, etc., freeing designers to focus on design objectives. Applying the CAD metaphor to the service domain, we hoped that by automating the management of relationship information and configuration processes, by exploiting human input for the difficult tasks of term/value configuration and mapping validation, and by integrating the relationships discovered during the human input phase back into the system, we would manage the information required for the design of service solutions more accurately and efficiently than either a fully-automated system or a non-automated system could. 2.1
First Prototype
Our first prototype application connected the stages of a deal lifecycle by providing a central repository that maintains and links between the per-customer data used in each of the stages. We refer readers to [1] for details. We call our prototype system the Service Configurator because it helps design solutions based on standard offerings followed by customer specific configuration. Integrating the lifecycle stages lets us identify and control the amount of unique work required for each customer and deliver low-cost, high-quality service offerings that adapt to changing customer needs. Our initial instinct was to use traditional design choices, such as a conventional three-tier web-client-server architecture. The data source layer was based upon a global information model, and used fixed schema mappings to provide access to data from domain-specific “authoritative” repositories maintained by the business unit. Each of these repositories was in turn itself global in nature, with relatively stable schema and content. Our business unit partners tested the prototype with data from an ongoing deal with a major customer, and validated benefits of the Service Configurator’s approach, such as information flow continuity and automation [1]. The prototype was reasonably well-received. Our collaborators were intrigued by the possibilities when presented with the concrete demonstrator, and we were chartered to build a second prototype with help from a larger pool of domain experts. 2.2
Issues with the First Prototype
Despite its positive reception, the first prototype suffered from issues that prevented it from being transferred directly to the business unit. Most significantly, we came to realize that service configuration is characterized by highly dynamic, distributed, disconnected processes, tools, and data sources. The first prototype was based upon a global perspective, yet its user community spans suborganizations, roles, and regions. Adopting the first system would have required big changes in the field’s processes and data sources, not to mention the creation
1108
K. Wilkinson et al.
of a receptor organization to develop the prototype into a complete system and support it. Because we did not accommodate existing regional infrastructures, tools, and practices, regional users could not use our system as-is. Second, although one of the most significant contributions of the first prototype is the data model, we came to realize that because the information model was exposed implicitly through web forms, the underlying concepts/relationships were not always clear to the system users. Finally, our system must bridge gaps between global and regional models, deal and service lifecycle stages, organizations and roles in the service business, and service portfolios. We provided only a global view of solutions, and found that the field needs to easily extend/augment the model. Furthermore, there was a need for a means to share those extensions among peers (and eventually promote the extensions to the global model). Our original, centralized, solution offered only limited interchange/interoperability. We needed an easy way to export/import subsets of the information model.
3
New (Federated) Approach
Our first system showed how a common information model could enable information sharing and new capabilities across the various dimensions of service outsourcing, e.g., service lifecycle, service offerings, tasks and roles. The goal of our second system is to overcome the barriers to using such a system by the service professionals. We used a two-pronged approach: integrate with familiar user tools and utilize semantic information models. Service professionals were skeptical how a central repository could maintain up-to-date information in the fast-paced, dynamic and disconnected environment of deal pursuit and delivery. We also observed a reliance on simple office productivity tools such as Excel and Word, which encapsulate information in manageable chunks and are frequently shared in the field. Consequently, we have developed interfaces between our system and Word and Excel. For Excel documents, we support mappings between cells in a spreadsheet and parameters in a service configuration. For Word documents, arbitrary snippets of content can be tagged and associated with a parameter. Transformation programs then exchange information between the Word or Excel documents and the repository. Our mappings enable repository content to appear in documents and the repository to be updated from the documents. Consequently, service professionals can immediately share and view any changes for a deal. An added advantage is that these documents are mobile, so information is available even when the repository is not accessible. Further, there is no learning curve for getting access to this information since the user interfaces, Word and Excel are familiar. The mappings are expressed in a relatively simple language and can be hidden. As needed, new mappings could easily be defined by a trained user. This basic capability is necessary but not sufficient because it only enables data sharing. It does not enable the sharing of data definitions, i.e., metadata. The information model in our first system was hidden behind a set of Web forms.
Enabling Outsourced Service Providers to Think Globally
1109
Our second system exposes the information model, and enables the definition and sharing of new concepts and relationships. We chose RDF/OWL as a representation language for our information model because it provides globally unique identifers for resources (URIs), enables rich semantic descriptions of information, and supports decomposition of models into smaller chunks that can be individually shared and extended. Rather than develop a single, complicated ontology to describe the complete services business, we are using an incremental approach that divides the business into sub-domains. There are a number of core ontologies. We anticipate the development of ontologies specific to regions, task or industry segment. RDF/OWL enables the extension and sharing of ontologies and facilitates their integration. However, expressed as an RDF graph, an ontology is not an appropriate representation for use by services professionals. We thus provide task-specific views of the ontology, each of which is a projection of the graph onto a tree. A tree provides a more natural presentation of the information, and is easily expressed in XML, facilitating interchange with other tools. E.g., a deal configuration graph might be projected onto a simpler, partonomic tree that lists, for each service, the modules comprising the service and the parameter values for each module. The views are expressed as templates and a transformation program applies the template to a graph to generate an XML tree. These views are intended for round-trip transformations. E.g., we can use a single view template to create round-trip transformations– from RDF to XML, as well as from XML back to RDF. In this way, we can extract information from a graph representing a configuration, incorporate that into some Word or Excel document, allow users to modify (in a limited way) that content and then update the configuration with the modified information.
4
Status and Challenges
Our second prototype has been met with enthusiastic reception from business unit contacts. Thus far, we have won sponsorship all the way up the services organization and have begun to engage with a transfer organization. In addition, we have identified some challenges to explore next. These include more sophisticated reporting and analysis functions, mechanisms for enabling the government of interactions between global and regional systems, and a workflow framework for analysis and delivery.
References 1. Kuno, H., Yuasa, K., Govindarajan, K., Smathers, K., Burg, B., Carau, P., Wilkinson, K.: Governing the contract lifecycle: A framework for sequential configuration of loosely-coupled systems. In: DNIS 2005. (2005) 2. Katz, R.H.: Toward a unified framework for version modeling in engineering databases. ACM Comput. Surv. 22 (1990) 375–409
Another Example of a Data Warehouse System Based on Transposed Files Antonio Albano1 , Luca De Rosa2 , Lucio Goglia, Roberto Goglia, Vincenzo Minei, and Cristian Dumitrescu3 1
3
Univ. of Pisa, Dept. of Computer Science, Largo B. Pontecorvo 3, 56127 Pisa, Italy 2 Advanced Systems, Via Napoli 159, 80013 Casalnuovo di Napoli (NA), Italy Sisteme Avanzate Italo-Romane, Calea Victoriei 155, Bucarest 1, Romania
Abstract. The major commercial data warehouse systems available today are based on record-oriented relational technology optimized for OLTP applications. Several authors have shown that substantial improvements in query performance for OLAP applications can be achieved by systems based on transposed files (column-oriented) technology, since the dominant queries only require grouping and aggregation on a few columns of large amounts of data. This new assumption underlying data warehouse systems means that several aspects of data management and query processing need to be reconsidered. We present some preliminary results of an industrial research project which is being sponsored by the Italian Ministry of Education, University and Research (MIUR) to support the cooperation of universities and industries in prototyping innovative systems. The aim of the project is to implement an SQL-compliant prototype data warehouse system based on a transposed file storage system. The paper will focus on the optimization of star queries with group-by.1
1
Introduction
The most important commercial data warehouse systems are based on recordoriented relational technology optimized for OLTP applications. They have been extended with new kinds of indexes and new optimization techniques for typical OLAP queries that require grouping and aggregation on just a few columns over large amounts of data. Several authors have demonstrated that, since a data warehouse system is query-intensive, an implementation based on a transposed file storage system (also called column-oriented, projection indexes) can achieve substantial improvements in OLAP query performance [7], [4], [3], [1], [6]. In 1
This work was partially supported by the MIUR, under FAR Fund DM 297/99, Project number 11384. The project partners are Advanced Systems, University of Pisa, Department of Computer Science, and University of Sannio, Research Centre on Software Technology.
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1110–1114, 2006. c Springer-Verlag Berlin Heidelberg 2006
Another Example of a Data Warehouse System
1111
addition to research prototype systems there are several commercial products that have adopted transposed files as a storage structure, such as Addamark and KDB. Clearly, the idea of changing the way data are stored is not in itself enough to achieve the expected query performance. Other storage structures must be considered and new algorithms for generating access plans must be designed. Although algorithms for join query processing are reported in the above references, the execution of the most typical OLAP query with grouping and aggregation has received less attention. It is well known that a standard way to evaluate this kind of query is to perform all the joins first and then the group-by. However, several authors have shown that an optimizer should also consider doing the group-by before the join [2], [5].2 We present some preliminary results of an industrial research project SADAS which began in March 2003 and is being sponsored by the Italian Ministry of Education, University and Research (MIUR) to support the cooperation of universities and industries in prototyping innovative systems. The main contractor is the Italian company Advanced Systems that implemented the commercial system Overmillion in the late 1990s using a completely transposed storage structure. However the system does not support either multi-relation queries or SQL. On the basis of their successful experience, the aim of our project was to redesign and re-implement Overmillion to have a complete SQL-compliant data warehouse system using a transposed file storage system. The paper is organized as follows. First we outline the main storage structures in SADAS, and then we present the approach adopted in generating query plans for star queries with grouping and aggregations.
2
Storage Structures
We briefly present the storage structures implemented in SADAS and the physical operators used in query plans. We consider only those features that are relevant for the group-by optimization, and we make the following assumptions to simplify the presentation: (a) the database has a star schema, where the primary and foreign keys have only one attribute; (b) the database is read-only; (c) a completely transposed structure is used for each table, and so each column is stored in a separate file; (d) each column has fixed length values different from null. The system adopts the following storage structures: 1. each column A of a table with n records is stored in two files with n elements using two representations: (a) a file named A.CLN (column values) which contains the column values for all the records of a table (an index is maintained for this file) and (b) a file named A.CLI (codified column) which contains the column values coded as the position of the values in the corresponding column index; 2
We would like to apologize to other contributors in this area whose work we have not acknowledged due to limitations of space.
1112
A. Albano et al.
2. for each foreign key fk of the fact table F for the dimensional table D there is a join index, which is a file that contains the RIDs of the matching records in D, instead of the corresponding key values; 3. for all the not empty subsets of the fact table foreign keys, the following structures exist: (a) a partial multi-column index G where (a) for each column A the values are those stored in the corresponding file A.CLI, and (b) the index contains only the values part and the number of elements for each entry. For example, in the index on columns A and B of F , the k-th entry that corresponds to the values (a1 , b3 ), will contain the values (1, 3) if a1 has code 1 and b3 has code 3; (b) a file G.GDI that contains the column values coded as the position of the values in the corresponding column index. For example, if the i-th record of the fact table F has columns A e B with values (a1 , b3 ), in the file G.GDI the i-th entry contains the value k.
3
Optimization of Star Queries with Grouping and Aggregations
Usually a star query is executed by traditional data warehouse systems using the best of two plans, the one produced in the usual way by relational systems, and the other produced with the following basic phases: (a) the local conditions are applied to the fact table and to the dimensional tables that participate in the join to compute the local rowsets. A local rowset is a bitmap representing the selected tuples from a table; (b) using the join indexes and the local rowsets, the global rowset is computed, which is a bitmap representing the records from the fact table that belong to the star join. The algorithm to execute this phase depends on the kind of join indexes available; (c) once the relevant fact table records have been retrieved using the global rowset, they are joined with the dimension tables to produce the answer to the query. When a system adopts a completely transposed file solution, the best plan to execute a star query is produced by adopting the same first two phases, using the operators BMIndexFilter and BMJoinIndex, but a different strategy for the third phase: To evaluate a group-by, the optimizer determines if a group-by can be moved on the fact table to use the operator IndexGByJoinFromBM, which exploits the benefits of a specific structure provided to accelerate the operation starting from the global rowset produced by the BMJoinIndex operator: IndexGByJoinFromBM(O, {Ai }, {Dj }, {Rh }, {gk }) groups the records of the
fact table that participate in the star join, represented by the global rowset O, by a set of foreign keys {Ai }, and retrieves the values of the dimensional columns {Dj }. It returns a set of records whose fields are the values of the columns {Rh } ⊆ {Ai } ∪ {Dj } and the values of the aggregate functions {gk }. To exploit the benefits of doing the group-by before the join, we decided to consider the three cases studied in [2], [5], which we revised and reformulated
Another Example of a Data Warehouse System
1113
in the form of equivalent rules of the relational algebra. Due to limitations of space, we will only show the following case. Proposition 1. Let α(X) be the set of columns in X and R Cj S an equi-join using the primary key pk of S and the foreign key fk of R. R has the invariant grouping property S) ≡ π A∪F ((A∪α(Cj )−α(S) γF (R)) Cj S)
A γF (R Cj
if the following conditions hold: (a) the foreign key of R is functionally determined by the grouping columns A in R Cj S; (b) each aggregate function in F only uses columns of R. Example 1. Consider the star schema and the query: Product(PKproduct, pName, cost, division), Keys: {PKproduct}, {pName} Order(PKorder, FKproduct, FKdealer, price, qty, date), Key: {PKorder} SELECT FROM WHERE GROUP BY
pName, SUM(qty) AS Q Order, Product FKproduct = PKproduct AND division = ’D1’ AND qty = 100 pName;
Since the fact table Order has the invariant grouping property, the optimizer is aware that the group-by can be moved on Order, and so it considers the following plan: IndexGByJoinFromBM ({FKProduct}, {pName}, {pName}, {SUM(qty) AS Q}) BMJoinIndex (FKProductJDI) BMIndexFilter (qty = 100)
4
BMIndexFilter (division = 'D1')
Conclusions
We have briefly presented (a) some of the features of the SADAS system, a data warehouse management system based on transposed files, and (b) the approach adopted in generating query plans for star queries with grouping and aggregations that benefit from specialized data structures. A single-user version of the system using a subset of SQL is operational and the demonstration at EDBT will show how its query performance compares favorably with that of commercial row-store data warehouse products, as reported by authors of similar projects.
1114
A. Albano et al.
References 1. P. A. Boncz and M. L. Kersten. MIL Primitives for Querying a Fragmented World. The VLDB Journal 8 (1999) 101–119. 2. S. Chaudhuri and K. Shim: Including Group-By in Query Optimization. In Proc. Intl. Conf. on VLDB, Santiago, Chile, (1994) 354–366. 3. A. Datta, K. Ramamritham, and H. M. Thomas: Curio: A N ovel Solution for Efficient Storage and Indexing in Data Warehouses. In Proc. Intl. Conf. on VLDB, Edinburgh, Scotland, (1999) 730–733. 4. C. D. French: ”One size fits all” Database Architectures do not Work for DDS. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, San Jose, California, USA, (1995) 449–450. 5. C. A. Galindo-Legaria and M. M. Joshi: Orthogonal Optimization of Subqueries and Aggregation. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, Santa Barbara, California, USA, (2001) 571–581. 6. M. Stonebraker et al.: C-Store: A Column-Oriented DBMS. In Proc. Intl. Conf. on VLDB, Trondheim, Norway, (2005) 553–564. 7. M. J. Turner, R. Hammond, and P. Cotton: A DBMS for Large Statistical Databases. In Proc. Intl. Conf. on VLDB, Rio de Janeiro, Brazil, (1979) 319–327.
XG: A Grid-Enabled Query Processing Engine Radu Sion1 , Ramesh Natarajan2 , Inderpal Narang3 , and Thomas Phan3 1 Stony Brook University sion@cs.stonybrook.edu 2 IBM TJ Watson Research Lab phantom@us.ibm.com 3 IBM Almaden Research Lab narang@us.ibm.com 4 IBM Almaden Research Lab phantom@us.ibm.com
Abstract. In [12] we introduce a novel architecture for data processing, based on a functional fusion between a data and a computation layer. In this demo we show how this architecture is leveraged to offer significant speedups for data processing jobs such as data analysis and mining over large data sets. One novel contribution of our solution is its data-driven approach. The computation infrastructure is controlled from within the data layer. Grid compute job submission events are based within the query processor on the DBMS side and in effect controlled by the data processing job to be performed. This allows the early deployment of on-the-fly data aggregation techniques, minimizing the amount of data to be transfered to/from compute nodes and is in stark contrast to existing Grid solutions that interact with data layers as external (mainly) “storage” components. By integrating scheduling intelligence in the data layer itself we show that it is possible to provide a close to optimal solution to the more general grid trade-off between required data replication costs and computation speed-up benefits. We validate this in a scenario derived from a real business deployment, involving financial customer profiling using common types of data analytics.
1 Demonstration Outline In this demo we will show how integrating a computation grid with a data layer results in significant execution speedups. For example, with only 12 non-dedicated nodes, a speedup of approximately 1000% can be attained, in a scenario involving complex linear regression analysis data mining computations for commercial customer profiling. In this demo we deploy XG with live connections (on-site demo only also possible) to our 70+ nodes CPU Grid located in IBM Almaden. We then demonstrate the data processing scenarios discussed below, including the mining query execution speedup shown in Figure 2. Additionally, we propose to demonstrate some of the more subtle features of our system, opening insights into important underlying design decisions. These features include: (i) the runtime mechanism for on-demand QoS provisioning of data replication and computation (provisioning the right amount of resources per data processing task to meet QoS demands, e.g., deadlines), (ii) the on-the-fly recovery on failure in both computation and data layers, (iii) the automatic discovery of computation Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1115–1120, 2006. c Springer-Verlag Berlin Heidelberg 2006
1116
R. Sion et al.
resources, (iv) the runtime monitoring of computation and data flows between the data layer and the grid. This demonstration is facilitated by the fact that these features are exposed through a user-friendly GUI.
2 Introduction As increasingly fast networks connect vast numbers of cheaper computation and storage resources, the promise of “grids” as paradigms of optimized, heterogeneous resource sharing across boundaries [2], becomes closer to full realization. It already delivered significant successes in projects such as the Grid Physics Network (GriPhyN) [4] and the Particle Physics Data Grid (PPDG) [10]. While these examples are mostly specialized scientific applications, involving lengthy processing of massive data sets (usually files), projects such as Condor [1] and Globus [3] aim at exploring “computational grids” from a declared more main-stream perspective. There are two aspects of processing in such frameworks. On the one hand, we find the computation resource allocation aspect (“computational grid”). On the other hand however data accessibility and associated placement issues are also naturally paramount (“data grid”). Responses to these important data grid challenges include high performance file sharing techniques, file-systems and protocols such as GridFTP, the Globus Replica Catalog and Management tools in Globus, NeST, Chirp, BADFS , STORK , Parrot , Kangaroo and DiskRouter in Condor. The ultimate goal of grids is (arguably) an increasingly optimized use of existing compute resources and an associated increase of end-to-end processing quality (e.g. lower execution times). Intuitively, a tighter integration of the two grid aspects (“computational” and “data”) could yield significant advantages e.g., due to the potential for optimized, faster access to data, decreasing overall execution times. There are significant challenges to such an integration, including the minimization of data transfer costs by performing initial data-reducing aggregation, placement scheduling for massive data and fast-changing access patterns, data consistency and freshness. In this work we propose, analyze and experimentally validate a novel integrated data-driven grid-infrastructure in a data mining framework. Computation jobs can now be formulated, provisioned and transparently scheduled from within the database query layer to the background compute Grid. Such jobs include e.g., the computation of analytical functions over a data subset at the end of which the result is returned back in a data layer (either by reference to a specific location or inline, as a result of the job execution). We designed and built a specialized experimental grid infrastructure. One of the main design insights behind it is that (arguably) any “global” grid is ultimately composed of clustered resources at the “edge”. It is then only natural to represent it as a hierarchy of computation clusters and associated close-proximity data sources. Using our end-to-end solution (data-layer aggregation and compute grid invocation), in our considered application domain (data analysis for predictive modeling) significant speed-ups have been achieved versus the traditional case of data-layer processing. Scenario: Customer Analytics. Let us now explore an important commonly encountered operation scenario for data mining in a commercial framework that yielded significant cost and speed-up benefits from our solution: a large company (i.e., with a cus-
XG: A Grid-Enabled Query Processing Engine
1117
tomer base of millions of customers), maintains an active customer transaction database and deploys data mining to better customize and/or optimize its customer-interaction response and associated costs. There are two types of customer interactions, each subject to different types of requirements and response mechanisms, namely (i) incoming (“pull” model) inquiries and (ii) outgoing advertisements (“push” model). For space reasons, here we are discussing (i). Incoming inquiries (e.g., over the phone, online) are handled in a real-time or shortnotice manner. There are bounds on response-time (e.g., Human-Computer interaction experiences should feature response times of under 7-8 seconds to be acceptable) [11] to be satisfied. An imprecise but fast initial “pre”-response might be often preferable to an exact but slow one, as it is likely that higher waiting-times would result in a drop in overall customer satisfaction. The company’s response data is based on previously recorded customer “profiles”, composed of a history of transactions and a set of related predictive models. Such profiles need to be maintained with sufficient (preferably maximal) accuracy and the associated (predictive) models re- computed periodically or as part of an event- driven paradigm in which associate customer events trigger individual model re- computations. In such an interaction, often the center-point (and likely the most expensive) is processing a function of the immediate input data and the customer predictive models in the stored profile (“model scoring”). Often, also, new models need to be computed on the fly. Because of its real-time nature, and the potential for thousands of simultaneous incoming customer requests, this scenario is extremely challenging. To understand the size of this problem, let us quantify some of the previous statements with real data. Let us assume a customer base of over 10 million customers. Roughly 0.1% (10k) of them are active at any point in time (interactive and automated phone calls, web access, other automated systems). Preferably, the company response in each and every transaction should be optimally tailored (i.e., through on-demand data mining) to maximize profit and customer satisfaction. On average, only 75% (7.5k) of these active (meta)transactions are resulting in actual data mining tasks and, for each second, only 20% of these task- triggering customers require data mining. To function within the required response-behavior boundary, the company has to thus handle a continuous parallel throughput of 1500 (possibly complex) simultaneous data mining jobs. Achieving this throughput at the computation and data I/O level is very challenging from both a cost and scalability viewpoint.
3 System Architecture The end-to-end solution comprises several major components: modeling, aggregation and computation outsourcing (in the data layer) and grid scheduling and management (grid layer). Data Layer. Designing specifically for data mining over large data sets, requires a careful consideration of network data transfer overheads. As traditional data mining solutions are often based on code directly executed inside the database query processor, these overheads could often be reduced by an initial data aggregation step performed inside the data layer, before outsourcing the more computation heavy model generation tasks.
1118
R. Sion et al. 1
Job SOAP 2
Data Mining Task
SQL/MM
Data Data Aggregates Data Aggregates Aggregates
Stored procedure
Goal-based on-demand MetaScheduler
8
3
xg.sched.Scheduler
3
data placement
Data Tables Materialized Views Federated Data sets
6
1
2
3
4
5
6
(a)
High Speed LAN
8
9
xg.sched.Scheduler
4
EE
EE
Grid Cluster
5
xg.res.Discovery BM
Cluster Store
Wide Area Network
Analytics Code Base
2 EE BM
8
Network Resource Manager
BM
1 EE
6
4
Compute Grid
BM
3
7
Data Layer
Scheduler
7
Data Layer
BM
2 EE BM
8
3
Analytics Code Base Data Data Data
SOAP
BM
EE
6
3
7
Runtime Estimator Web Service Invocation
xg.res.Discovery
1
4
network information
5
Cluster Store
9
Compute Results
5
Job Model for Goal-based provisioning (data size, computation estimates)
1
Mining Results
4
2
High Speed LAN
BM
3
7 EE
4 EE
Grid Cluster
(b)
Fig. 1. (a) Data Layer Overview: The grid is leveraged transparently from the data side. Mining tasks can execute normally within the query processor (e.g., as stored procedures). (b) Both data replication/placement (to cluster stores) and job scheduling in the hierarchical grid infra-structure is controlled by a meta scheduler.
The solution allows for dispatching of multiple simultaneous data processing tasks from within the query processor (i.e. SQL level) by performing simple calls through a user defined function (UDF) mechanism. At the completion of these tasks, their results become available within the actual data layer, ready for further processing. The interaction between the data layer and compute grid is composed of two elements: job submission control and data placement/replication. Job submission is initiated in the database and forwarded to the main computation grid through a webservice interface exposing the main grid scheduling control knobs. This interaction is enabled by user defined functions (UDF) within DB2, (constructs present also in a majority of big-vendor DBMS solutions including DB2 [6], Oracle [9] and SQL Server [8]). The grid scheduler controls are exposed through a webservice interface. This is achieved through the XML Extender [7] and its SOAP messaging capabilities which allows the invocation of job submission methods exposed by the schedulers in the compute grid layer (see Figure 1 (a)). The invocation is asynchronous so as to not block the calling thread and to allow actual parallelism in data processing. While extensive details are out of the current scope, for illustration purposes, let us discuss here the query in Figure 2 (a). After the initial aggregation step (performed by agg()) the resulting computations are outsourced to the grid (grouped by customer, c id) through the analysis grid()
execution time (s)
120 100 80
SELECT c_id, analysis_grid(agg(c_assets)) FROM customer_tx WHERE c_age < 21 GROUP BY c_id
60 40 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 compute nodes
Fig. 2. (a) With increasing number of computation nodes, query execution time decreases. (b) Sample Query.
XG: A Grid-Enabled Query Processing Engine
1119
UDF. This constructs the necessary SOAP envelopes, converts the arguments to a serializable format and invokes the grid scheduler with two parameters for each customer: (i) an URL reference to the external regression analysis code (located in the grid code-base, see Figure 1 (b)) and (ii) a reference to the aggregate input data. There are two alternatives for data transfer to/from the compute grid: inline (as an actual parameter in the job submission – suitable for small amounts of input data and close local clusters) and by-reference where actual input data sources are identified as part of the job submission – suitable for massive data processing in a global grid. To support the input/output data by reference paradigm, in our design data replication is activated by the meta-scheduler at the grid cluster level, leveraging “close” data stores and linking in with data replication/placement mechanisms (Information Integration [5]) if the data is “far” (see Figure 1 (b)). Computation Layer. XG is a grid management solution designed for tight data-layer integration. It enables a hierarchical grid structure of individual fast(er)-bus compute clusters (at the extreme just a single machine). This design derived from the insight that (arguably) a majority of grid infra-structures are to be composed of multiple highspeed cluster networks linked by lower speed inter-networks. Designing an awareness of this clustering structure in the actual grid allows for location-based scheduling and associated data integration and replication algorithms. The grid hierarchy is supported by the concept of a meta-scheduler, (Figure 1 (b)) a software entity able to control a set of other individual (meta)schedulers in a hierarchical fashion, composing a multi-clustered architecture. Job submission at any entry-point in this hierarchy results in an execution in the corresponding connected subtree (or sub-graph). At the cluster level a scheduler is managing a set of computation nodes. A node is composed (among others) of an Execution Engine and an Execution Monitor. The scheduler deploys a discovery protocol for automatic discovery of available compute resources, a polling mechanism for monitoring job progress and notifications for job rescheduling (e.g., in case of failures) and a scheduling algorithm for job scheduling. The scheduling algorithm is designed as a plugin within the scheduler, allowing for different scheduling policies to be hot- swapped. For inter-operability, at all levels, the schedulers are designed be invoked (e.g. for job scheduling) through a web-service interface. This interface allows for job submission (with both inline and by-reference data), job monitoring and result retrieval among others. More details are out of scope here. Results. Our experimental test-bed consists of a grid cluster composed of 70 general purpose 1.2GHz Linux boxes with approximately 256MB of RAM each. The data layer deploys IBM DB2 ver. 8.2. with the XML Extender [7] enabled. We evaluated the actual speed-ups of the model generation process with increasing number of grid nodes. Figure 2 (b) illustrates a data mining scenario (for 100 customer profiling jobs) for which execution time went down from roughly 112 seconds for one compute node, to about 11 seconds when 12 nodes where deployed. The deployed code in both the data layer and the grid remained the same. What changed was just the availability of new computation power on the grid side.
1120
R. Sion et al.
4 Conclusions In this work we introduced a novel architecture for data processing, a functional fusion between a data and a computation layer. We then experimentally showed how our solution can be leveraged for significant benefits in data processing jobs such as data analysis and mining over large data sets. There are significant open avenues for future research including: the integration of a message passing interface solution in the compute grid, more complex failure recovery augmented by check-pointing, the implementation of privacy-preserving primitives for data processing, the exploration of both data placement and computation scheduling as first class citizens in resource scheduling. This becomes especially relevant in on-demand environments where a maximal throughput in data and compute intensive application is not possible using current manual data partitioning and staging methods. Last but not least, we believe grid-aware query processing to be an exciting avenue for future research, ultimately resulting in a computation aware grid query optimizer within a traditional DBMS query processor.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
The Condor Project. Online at http://www.cs.wisc.edu/condor. The Global Grid Forum. Online at http://www.gridforum.org. The Globus Alliance. Online at http://www.globus.org. The Grid Physics Network. Online at http://www.griphyn.org. The IBM DB2 Information Integrator. Online at http://www.ibm.com/software/ data/integration. The IBM DB2 Universal Database. Online at http://www.ibm.com/software/ data/db2. The IBM DB2 XML Extender. Online at http://www.ibm.com/software/data/ db2/extenders/xmlext. The Microsoft SQL Server. Online at http://www.microsoft.com/sql. The Oracle Database. Online at http://www.oracle.com/database. The Particle Physics Data Grid. Online at http://www.ppdg.net. Julie Ratner. Human Factors and Web Development, Second Edition. Lawrence Erlbaum Associates, 2002. Radu Sion, Ramesh Natarajan, Inderpal Narang, Wen-Syan Li, and Thomas Phan. XG: A Data-driven Computation Grid for Enterprise-Scale Mining. In Proceedings of the International Conference on Database and Expert Systems Applications DEXA, 2005.
Managing and Querying Versions of Multiversion Data Warehouse Robert Wrembel and Tadeusz Morzy Pozna´ n University of Technology, Institute of Computing Science, Pozna´ n, Poland {Robert.Wrembel, Tadeusz.Morzy}@cs.put.poznan.pl
1
Introduction
A data warehouse (DW) is a database that integrates external data sources (EDSs) for the purpose of advanced data analysis. The methods of designing a DW usually assume that a DW has a static schema and structures of dimensions. In practice, schema and dimensions’ structures often change as the result of the evolution of EDSs, changes of the real world represented in a DW, new user requirements, new versions of software being installed, and system tuning activities. Examples of various change scenarios can be found in [1, 8]. Handling schema and dimension changes is often supported by schema evolution [2], temporal extensions [3, 8, 5, 10, 7], and versioning extensions [6]. Schema evolution approaches maintain one DW schema and the set of data that evolve in time. Temporal versioning techniques use timestamps on modified data in order to create temporal versions. In versioning extensions, a DW evolution is managed partially by means of schema versions and partially by data versions. These approaches solve the DW evolution problem partially. Firstly, they do not offer a clear separation between different DW states. Secondly, the approaches do not support modeling alternative, hypothetical DW states required for a what-if analysis. In order to eliminate the limitations of the aforementioned approaches, we propose a multiversion data warehouse.
2
Multiversion Data Warehouse - Supported Features
The multiversion data warehouse (MVDW) serves as a framework for: (1) separating various structures and contents of a DW, corresponding to various time periods; (2) creating and managing multiple alternative virtual business scenarios, required for the what-if analysis; (3) running queries addressing either a single or multiple DW versions [1]. Concept. The MVDW is composed of persistent versions, each of which describes a DW schema and data within a given time period. A DW version is in turn composed of a schema version and an instance version. A schema version describes the structure of a DW, whereas an instance version represents the set of data described by its schema version. Both of them are valid within Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1121–1124, 2006. c Springer-Verlag Berlin Heidelberg 2006
1122
R. Wrembel and T. Morzy
a certain period of time, represented by validity timestamps. We distinguish real and alternative versions. Real versions are used for representing changes in a real business environment, e.g. changing organizational structures, changing geographical borders of regions. Real versions are linearly ordered by the time they are valid within. Alternative versions represent virtual business scenarios and are used for simulation purposes. All DW versions form a version derivation graph whose root is the first real version. Schema and Dimension Structure Changes. The structure of a schema version and dimensions is modified by the set of operations that include among others: creating/modifying/dropping/renaming fact and level tables, adding/ modifying/dropping/renaming attributes, inserting/updating/deleting fact and dimension records, splitting/merging/reclassifying dimension records. Data Sharing. Multiple DW versions may partially use the same sets of data e.g., two real versions may use the same dimension. In order to eliminate data redundancy and reduce data volume, our prototype system applies data sharing. To this end, an information about all DW versions a given record belongs to is stored with this record in the form of bitmaps. A single bitmap represents one DW version. The number of bitmaps equals to the number of versions sharing data. The number of bits in a bitmap equals to the number of records in a given table. The ith bit in bitmap describing version Vm , is set to 1 if the ith record in the table exists in Vm . Otherwise the bit is set to 0. Querying Multiple DW Versions. In a MVDW, data of a user’s interest are usually distributed among several versions and a user may not be aware of the location of particular data. Moreover, DW versions addressed in queries may differ with respect to their schemata. For the purpose of querying a MVDW we proposed an extension to a standard SQL language that allows to: (1) address either a single version (further called a single-version query - SVQ) or multiple versions (further called a multi-version query - MVQ), (2) present results from various DW versions as if they belonged to a selected DW version, c.f. [9]. A user expresses a MVQ in terms of a selected schema version (usually the current real one), then a MVQ is processed by the parser and executor in the following steps. St1: Constructing the set of DW versions that are to be addressed in a MVQ - to this end version validity timestamps are used. St2: Decomposing MVQ into n SVQs taking into account schema changes (e.g. domain, attribute name, and table name changes). Each of SVQs addresses its own DW version. St3: Executing SVQs in their own DW versions. St4: Returning results of every SVQs to a user and presenting them separately. Additionally, every result set is annotated with: (1) information about a DW version the result was obtained from, (2) meta information about schema (e.g. attribute/table renaming, attribute domain modification) and dimension instance changes (e.g. reclassifying, splitting, or merging dimension records) between adjacent DW versions addressed by a MVQ. This meta information allows to analyze and interpret the obtained data appropriately. St5: Integrating SVQ results into one common data set (if possible). This set is represented in a DW version specified by a user
Managing and Querying Versions of Multiversion Data Warehouse
1123
(the current real version by default). The integration will not be possible if, for example, some attributes are missing in some queried DW versions. Transactional Maintenance. A MVDW is maintained by transactions in order to assure the consistency of its schema and data versions. To this end, we use four basic types of transactions. A versioning transaction is responsible for deriving a new DW version. A schema transaction is responsible for modifying a schema version. A data refreshing transaction is responsible for loading/refreshing fact and dimension level tables. A user transaction is applied to analytical processing by end users. The isolation of these transactions is achieved by a standard multiversion concurrency control algorithm.
GUI
Management layer
Schema and data mgr
MVQ interface
Transaction manager
Schema and data mgr library
MVDW
mgmt library
metadata
presentation module
MVQ parser
MVQ executor
dw versions
Fig. 1. The software architecture of the MVDW prototype
Metadata Management. In order to provide schema and data versioning and querying a MVDW, the set of well defined metadata is required. Our system manages metadata on: (1) the structure and content of every DW version, (2) changes applied between adjacent versions, (3) data conversion methods - which are required for MVQs, (4) transactions run in the system. Prototype Architecture and Implementation. Our MVDW has been implemented as a middle layer (cf. Figure 1) on top of Oracle10g DBMS that stores metadata, versions of data, and the library of the system’s management software. A user operates on a MVDW by means of a graphical interface implemented in Java. It makes available tools for managing multiversion schema and data, executing multiversion queries, and presenting their results. The middle management layer is composed of four main software modules, i.e. transaction manager, schema and data manager, MVQ parser, and MVQ executor.
3
Prototype Demonstration
The demonstration of the prototype MVDW will focus on: (1) showing how multiple DW versions are derived and managed, (2) applying schema and dimension
1124
R. Wrembel and T. Morzy
modification operations to selected DW versions, (3) demonstrating how queries can address multiple DW version that differ with respect to their schemata and dimension structures, (4) demonstrating how results of MVQs are visualized and how the results are annotated with meta information, (5) populating DW versions with data and explaining data sharing between multiple DW versions.
4
Contribution
To the best of our knowledge, our MVDW prototype is the first system that supports changes not only to the structure of dimensions but also to the structure of a DW schema. Real DW versions can also be used for storing versions of data only, and in this case our MVDW offers the functionality identical to temporal systems. Alternative DW versions are used for simulation purposes within the what-if analysis. By physically separating DW versions, one can clearly distinguish between different states of reality or simulation scenarios. A user can query only versions of interest, thus limiting the searched volume of data. Our implementation of a multiversion query language allows to query DW versions that differ with respect to their schemata and the structure of dimensions [9]. Additionally, query results are annotated with meta information on changes applied to the queried DW versions, which is a unique feature. This meta information allows to interpret the obtained results correctly. Changes made between DW versions are traced in the metaschema, but unlike the lineage techniques e.g., [4], our prototype allows to manage the provenience of data and schema elements only inside the MVDW.
References 1. B¸ebel B., Eder J., Koncilia Ch., Morzy T., Wrembel R.: Creation and Management of Versions in Multiversion Data Warehouse. Proc. of ACM SAC, Cyprus, 2004 2. Blaschka M., Sapia C., H¨ ofling G.: On Schema Evolution in Multidimensional Databases. Proc. of DaWak, Italy, 1999 3. Chamoni, P., Stock, S.: Temporal Structures in Data Warehousing. Proc. of DaWaK, Italy, 1999 4. Cui Y., Widom J.: Lineage Tracing for General Data Warehouse Transformations. Proc. of VLDB, 2001 5. Eder J., Koncilia C., Morzy T.: The COMET Metamodel for Temporal Data Warehouses. Proc. of CAISE, Canada, 2002 6. Golfarelli M., Lechtenb¨ orger J., Rizzi S., Vossen G.: Schema Versioning in Data Warehouses. Proc. of ER Workshops, China, 2004 7. Microsoft ImmortalDB. Retrieved November 25, 2005 from http:// research.microsoft.com/db/ImmortalDB/ 8. Mendelzon A.O., Vaisman A.A.: Temporal Queries in OLAP. Proc. of VLDB, Egypt, 2000 9. Morzy T., Wrembel R.: On Querying Versions of Multiversion Data Warehouse. Proc. of DOLAP, USA, 2004 10. Salzberg B., Jiang L., Lomet D., Barrena M., Shan J., Kanoulas E.: A Framework for Access Methods for Versioned Data. Proc. of EDBT, 2004
Natix Visual Interfaces A. B¨ohm , M. Brantner , C-C. Kanne, N. May, and G. Moerkotte University of Mannheim, Germany {alex, msb, cc, norman, moer}@pi3.informatik.uni-mannheim.de
Abstract. We present the architecture of Natix V2. Among the features of this native XML Data Store are an optimizing XPath query compiler and a powerful API. In our demonstration we explain this API and present XPath evaluation in Natix using its visual explain facilities.
1 The Natix System The Natix Project [3] was among the first Application Programming Interface to realize the idea of a native XML Data Store (XDS), which supports XML process... DOM SAX Text ing down to the deep levels of storage and View Manager query execution engine. Such native XDSs are now also being introduced by major System Control database vendors [1]. Natix Version 2.0 provides most features of a native, enterpriseclass XDS to application programmers, e.g. ACID transactions, efficient processing of Storage Engine XPath 1.0, and a rich set of APIs. Fig. 1 shows the modules contained in the Natix C++ library. Fig. 1. Natix architecture Applications use Natix’ schema management facilities to organize their persistent XML data collections. The topmost organizational unit is the Natix instance. System parameters, such as main memory buffer sizes, are instance-specific. Both transaction and crash recovery operate at the instance level. Therefore, all operations of a particular transaction must be executed within the context of a single instance. Each instance consists of several repositories that contain documents of a particular application domain. For example, the product catalog of an online shop could be stored within one repository, while another one would be used for the business reports. A repository comprises document collections. Document collections represent an unordered set of related documents, typically having a similar structure and markup, although they are not required to conform to a conjoint schema. ApplicaQuery Compiler
Query Execution
Schema Management
Transaction Management
Isolation Recovery
This work was supported by Landesgraduiertenf¨orderung Baden-W¨urttemberg. This work was supported by the Deutsche Forschungsgemeinschaft under grant MO 507/10-1.
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1125–1129, 2006. c Springer-Verlag Berlin Heidelberg 2006
1126
A. B¨ohm et al.
tions use this hierarchy level for grouping documents together that are processed as a unit, for example, all items belonging to a particular commodity group of an online shop.
2 Application Programming Interface Programming convenience, flexibility, and high performance are crucial for developing universal data management applications. Below, we describe the concepts that enable the Natix API to satisfy these requirements. Natix provides a variety of language bindings for the concepts. We present the C++ binding as an example. 2.1 Concepts The Natix API [4] allows accessing and manipulating entities on all levels of the logical hierarchy through request objects. To perform a database operation, an application creates a request object and and forwards it to the system. Instance-level operations such as instance creation and destruction are performed by sending requests to the instance, while transaction-level operations are performed by sending them to corresponding transaction objects. The API uses fragments as an abstract representation of all XML data that is handled by the Natix system. Fragments are an analogy to UNIX file descriptors, which provide a uniform interface to a variety of physical objects such as files, sockets, memory regions, etc. Natix fragments provide a uniform interface to a variety of XML data sources, for example documents stored in Natix, documents stored in the file system, entire document collections, or query results. Another important concept are views. Generally, there are many ways to represent and access XML data, such as the DOM and the SAX API, each of them having their particular benefits and drawbacks depending on the requirements of the respective application. In order to provide maximum flexibility, Natix implements many different interfaces for accessing XML data. To access an XML data source through a particular API, the application opens a corresponding view on a fragment. The current Natix release includes, among others, views for both the DOM and the SAX API, a C++ stream interface, a document metadata view and various sequence views for iterating over elements contained in an organizational unit (for example, all documents of a particular collection). The convenience and flexibility of the fragment/view concept is complemented by an efficient implementation mechanism. Natix uses a fragment/view matrix to obtain the most efficient implementation of a particular API for a given fragment type. For example, when accessing a document in the file system using a DOM view, a conventional parser is used to create a main-memory DOM representation. In contrast, if a DOM view is requested for a document stored in Natix, an object manager will make sure that only those parts that are actually required by the application are loaded from secondary storage, thereby reducing main memory consumption and avoiding unnecessary I/O overhead. As another example, if a SAX view is opened for a query result, the SAX events can be returned to the application in a pipelined fashion while the query is being evaluated.
Natix Visual Interfaces
1127
2.2 C++ Language Binding We will illustrate the Natix binding for C++ on the basis of the evaluation of XPath queries1 . Two queries will be used as examples, one for selecting the book with the specific id (//book[@id=’2342’]) from a particular book collection, the other one for gathering all invoices that exceed a specific amount (/invoice[sum(item/ @price) > 200]) from a document collection. As proposed by the W3C, query execution in Natix is di/ / s t a r t the trans acti on Transaction trans ( i ns t ) ; vided into two phases. We distinguish between a static and a / / c r e a te the query Q u e ry H a n d l e q u e r y H a n d l e = dynamic evaluation phase. During CreateQueryTidy ( t r a n s , the static phase, the query is com” / / book [ @id = ’ 2 3 4 2 ’ ] ” ) ; piled and prepared to be executed. / / prepare the query A static query context is provided, PrepareQueryTidy ( t r a n s , queryHandle ) ; which allows passing parameters / / e x e c u t e i t and g e t a f r a g m e n t to the compilation process. AfFragmentDescriptor queryResult = E x e c u t e Q u e r y T i d y ( t r a n s , q u e ry H a n d l e , ter the query is successfully pre” s t o r e ” , ” books ” , pared, it can be executed. A dy” d o c u m e n t . xml ” ) ; namic query context defines the OpenViewTidyenvironment for query execution, domView ( t r a n s , q u e r y R e s u l t ) ; in particular the context item (e.g. x e r c e s c : : DOMNode ∗ node =domView−>g e t N o d e ( ) ; a document or a collection). Figure 2 shows a few lines of code for Fig. 2. Example for the query API executing a query on a single document. At the end, a DOMView provides a DOM representation of the requested book element. Note that the actual DOMNode object returned by the DOMView is a binary compatible instance of the C++ DOM binding of Xerces C++ [5]. Executing the second example query on all invoices in one collection and opening a DocumentSequenceView would return an iterator for all qualifying documents. For programming convenience, the Natix API makes use of several C++ language features such as automatic database resource deallocation when destroying the corresponding objects. In Natix parlance, all of the request objects used in the example are Tidy requests, which means that they free any resources when they are destroyed, for example if an exception has been raised.
3 Executing XPath Queries in Natix Next, we give an overview of efficient and scalable XPath query compilation and evaluation. During this process an XPath query runs through different stages. Natix offers one visual explain facility for each of the following three steps of the compilation process.2 1 2
The interface is capable of handling additional query languages such as XQuery or XSLT. The tool used to trace the compilation process in the demonstration is also available online at http://pi3.informatik.uni-mannheim.de/xpc.html
1128
A. B¨ohm et al.
The result of the first stage — after parsing and normalizing the query — is an internal expression representation. It precisely describes the structure of a query, e.g. relationships between expressions or classification of predicates. Continuing with this representation, our process departs from the conventional apσ proach of interpreting XPath and enters the Υchild:book = realm of algebraic query evaluation. For every expression we apply translation rules that Υd-o-s::node() Afirstnode 2342 yield an operator tree as a result. Figure 3 2root Υ@id shows an operator tree for the first example query. 2 Besides special operators like unnestmap (Υ ), this plan uses well-known operators such Fig. 3. Query evaluation plan as a selection (σ), or aggregation (A). The detailed translation process is described in [2]. This query execution plan (QEP), which is a physical algebra plan, specifies detailed rules for evaluating the query. The last step is called code generation and produces code that can be evaluated efficiently in the Natix runtime system (RTS). The RTS implements a full-featured algebra as introduced in [2]. To provide scalability all sequencevalued algebra operators are implemented as iterators. The subscripts of these operators are implemented using assembler-like programs that are interpreted by the Natix virtual machine (NVM). These provide mechanisms to access the persistent representation of documents in the Natix page buffer or provide comparison and string functions.
4 Demonstration Our demonstration consists of two main components: (1) We provide guidance of how to build XML applications and how to retrieve XML data using the various interfaces offered by Natix. Using several sample programs and a graphical user interface we demonstrate how to create and manage huge XML repositories. Opening views or executing queries shows a variety of possibilities for efficiently accessing the stored data. (2) We explain the internals of the XPath query compiler using the various graphical representations of query plans that are provided by Natix. Several example queries from, for instance, the XMark benchmark help to understand and interpret the facets of our XPath compiler after every compilation step using the different visual explain facilities. The resulting query evaluation plans provide deep insight into the Natix query execution engine. The Natix XDS is available for download at [4].
References 1. K. S. Beyer, R. Cochrane, V. Josifovski, J. Kleewein, G. Lapis, G. M. Lohman, B. Lyle, F. Ozcan, H. Pirahesh, N. Seemann, T. C. Truong, B. Van der Linden, B. Vickery, and C. Zhang. System RX: One Part Relational, One Part XML. In Proc. SIGMOD Conference, pages 347– 358, 2005.
Natix Visual Interfaces
1129
2. M. Brantner, C-C. Kanne, S. Helmer, and G. Moerkotte. Full-fledged Algebraic XPath Processing in Natix. In Proc. ICDE Conference, pages 705–716, 2005. 3. T. Fiebig, S. Helmer, C-C. Kanne, G. Moerkotte, J. Neumann, R. Schiele, and T. Westmann. Anatomy of a native xml base management system. VLDB Journal, 11(4):292–314, 2002. 4. Chair of Practical Computer Science III. Natix Manual. University of Mannheim, 2005. http://pi3.informatik.uni-mannheim.de/natix.html. 5. Apache XML Project. Xerces C++ parser. Project Web Site, 2002.
Hermes - A Framework for Location-Based Data Management* Nikos Pelekis, Yannis Theodoridis, Spyros Vosinakis, and Themis Panayiotopoulos Dept of Informatics, University of Piraeus, Greece {npelekis, ytheod, spyrosv, themisp}@unipi.gr http://isl.cs.unipi.gr/db
Abstract. The aim of this paper is to demonstrate Hermes, a robust framework capable of aiding a spatio-temporal database developer in modeling, constructing and querying a database with dynamic objects that change location, shape and size, either discretely or continuously in time. Hermes provides spatio-temporal functionality to state-of-the-art Object-Relational DBMS (ORDBMS). The prototype has been designed as an extension of STAU [6], which provides data management infrastructure for historical moving objects, so as to additionally support the demands of real time dynamic applications (e.g. Location-Based Services - LBS). The produced type system is packaged and provided as a data cartridge using the extensibility interface of Oracle10g. The offspring of the above framework extends PL/SQL with spatiotemporal semantics. The serviceableness of the resulting query language is demonstrated by realizing queries that have been proposed in [9] as a benchmarking framework for the evaluation of LBS.
1 Introduction In recent years, we have been witnessing the explosion of emerging non-traditional database applications, such as location-based services. The main components of the underlying database of such applications include stationary and moving objects. The so-called Moving Objects Databases (MODs) are (or soon will be) ubiquitous. As the number of mobile commerce or, in general, mobile services, increases rapidly everyday, the need for robust management systems about location data, as well as the analysis of user movements are vital. In this paper, we present the design and implementation issues of a research prototype for efficient location-based data management, called Hermes (the ancient Greek god of Commerce). Hermes can be considered as a MOD management system with emphasis on the peculiarities of MODs, from representation to querying issues. Someone could mention a series of applications of Hermes at various levels in the context of mobile services. For example, Hermes can be used as a plug-in in telecom companies’ data warehouses *
Research partially supported by the Pythagoras EPEAEK II Programme of the Greek Ministry of National Education and Religious Affairs, co-funded by the European Union.
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1130 – 1134, 2006. © Springer-Verlag Berlin Heidelberg 2006
Hermes - A Framework for Location-Based Data Management
1131
that include spatio-temporal content. The previous example refers to offline processing of such historical data. Besides, Hermes supports the data management of real-time mobile services, addressing the issues of modern dynamic applications. For instance, imagine a user (tourist, consumer) moving around a city equipped with a next-generation mobile terminal (e.g. 3G cell-phone or PDA enhanced by the presence of a GPS receiver), receiving hints of information, commercial spots etc. Researchers [1], [2], [3], [4], motivated from such kind of application scenarios have tried to model spatio-temporal databases using this concept of moving objects and integrate them into any extensible DBMS. On the other hand, commercial relational or object-relational database systems offer limited capability of handling this kind of non-traditional data (object trajectories, in time and space). Hermes is the partial realization of the above discussed research vision.
2 The Prototype Hermes is developed as a system extension that provides spatio-temporal functionality to Oracle10g’s Object-Relational Database Management System (ORDBMS). The system is designed in a way that it can be used either as a pure temporal or a pure spatial system, but its main functionality is to support the modelling and querying of continuously moving objects. Such a collection of data types and their corresponding operations are defined, developed and provided as an Oracle data cartridge. Hermes Moving Data Cartridge (Hermes-MDC) is the core component of the Hermes system architecture. Hermes-MDC provides the functionality to construct a set of moving, expanding and/or shrinking geometries, as well as time-varying base types. Each one of these moving objects is supplied with a set of methods that facilitate the cartridge user to query and analyze spatio-temporal data. Embedding this functionality offered by Hermes-MDC in Oracle’s DML [5], one obtains an expressive and easy to use query language for moving objects. In order to implement such a framework in the form of a data cartridge we exploit a set of standard data types together with the static spatial data types offered by the Spatial option of Oracle10g [5] and the temporal literal types introduced in a temporal data cartridge, called TAU Temporal Literal Library Data Cartridge (TAU-TLL) [6]. Based on these temporal and spatial object data types Hermes-MDC defines a series of moving object data types illustrated in the UML class diagram of Figure 1 The interested reader in a detailed discussion for the resulted type system is referred to [7].
3 Architecture of Hermes Figure 2 illustrates the architecture of the Hermes system. A straightforward utilization scenario for a Hermes-MDC user is to design and construct a spatiotemporal object-relational database schema and build an application by transacting with this database. In this case, where the underlying ORDBMS is Oracle10g, in order to specify the database schema, the database designer writes scripts in the syntax of the Data Definition Language (DDL), which is the PL/SQL, extended with the spatio-temporal operations previously mentioned.
1132
N. Pelekis et al.
To build an application on top of such a database for creating objects, querying data and manipulating information, the application developer writes a source program in Java wherein he/she can embed PL/SQL scripts that invoke object constructors and methods from Hermes-MDC. The JDBC pre-processor integrates the power of the programming language with the database functionality offered by the extended PL/SQL and together with the ORDBMS Runtime Library generate the application’s executable. By writing independent stored procedures that take advantage of Hermes functionality and by compiling them with the PL/SQL Compiler, is another way to build a spatio-temporal application.
Fig. 1. Hermes-MDC Class Diagram
Fig. 2. The architecture of Hermes
4 The Demonstration Scenario In order to demonstrate the usefulness and applicability of the server-side extensions provided by Hermes we implement the majority of the benchmark queries for LBS proposed in [9]. Additionally, we specially develop an LBS application scenario for travellers entering the area of an airport, construct a spatial database modeling the ground plan of the airport, and input random trajectories of travellers moving around the area. Then, we pose queries following the same classification as proposed in [9]. Indicative examples include: •
Queries on stationary reference objects; examples include point (e.g. does this check-in serve my flight?), range (e.g. are there any fellow travellers in the area in front of this check-in?), distance-based (e.g. find the closer check-in), nearestneighbor (e.g. find the closest coffee shops to my current location) and topological queries (e.g. find travellers crossed this gate during the past hour);
Hermes - A Framework for Location-Based Data Management
•
• •
1133
Queries on moving reference objects; examples include distance-based (e.g. find travellers passed close to me this evening) and similarity-based queries (e.g. find the three most similar trajectories to the one I have followed so far in the airport); Join queries; examples include distance-join (find the closest check-ins to travellers of this flight) and similarity-join queries (find the two most similar pairs of travellers’ trajectories); Queries involving unary operators, such as travelled distance or speed (e.g. find the average speed of travellers on Saturday nights).
Based on related research work [8] the above queries constitute a minimum functionality a MOD system should provide. The above demonstration scenario is also accompanied (wherever appropriate) by visual illustrations formulated by MapViewer [5] (see Figure 3).
select shape from gates
select v.route.get_enter_leave_points (select gates.shape from gates where gates.id=1) from visitors v where v.id=1
Fig. 3. Visualization of entry and exit points in an area of interest
5 Conclusions In this paper, the Hermes prototype system was introduced. Hermes has been designed as a system extension that provides spatio-temporal functionality to state-ofthe-art ORDBMS. The usability of the prototype is demonstrated by applying benchmark queries proposed for the evaluation of systems providing location based services. Future work includes knowledge discovery from MODs. Typical examples we plan to incorporate in Hermes framework are: ‘characterization of routes as (in)frequent’, ‘habits in motion with respect to space and/or time constraints’ etc.
References 1. S. Dieker and R. H. Guting, “Plug and Play with Query Algebras: Secondo. A Generic DBMS Development Environment”, In Proc. of the Int. Database Engineering and Applications Symposium (IDEAS), p. 380-390, 2000. 2. M. Erwig, R.H. Guting, M. Schneider, and M. Vazirgiannis, “Spatio-Temporal Data Types: An Approach to Modelling and Querying Moving Objects in Databases”, GeoInformatica, 3(3):265-291, 1999.
1134
N. Pelekis et al.
3. L. Forlizzi, R. H. Guting, E. Nardelli, M. Schneider, “A Data Model and Data Structures for Moving Objects Databases”, In Proc. ACM SIGMOD Intl. Conf. on Management of Data, pages 319-330, 2000. 4. R.H. Guting, M. H. Bohlen, M. Erwig, C. S. Jensen, N. A. Lorentzos, M. Schneider, and M. Vazirgiannis, “A Foundation for Representing and Querying Moving Objects”, ACM Transactions on Database Systems, 25(1):1-42, 2000. 5. Oracle Corp. Oracle®. Oracle Database Documentation Library, 10g Release 1 (10.1), URL: http://otn.oracle.com/pls/db10g/. 6. N. Pelekis. STAU: A spatio-temporal extension to ORACLE DBMS, PhD Thesis, UMIST, 2002. 7. N. Pelekis, B. Theodoulidis, Y. Theodoridis and I. Kopanakis. An Oracle Data Cartridge for Moving Objects. UNIPI-ISL-TR-2005-01, March 2005. URL: http://isl.cs.unipi.gr/db. 8. N. Pelekis, B. Theodoulidis, I. Kopanakis and Y. Theodoridis, “Literature Review of SpatioTemporal Database Models”, Knowledge Engineering Review, 19(3):235-274, 2005. 9. Y. Theodoridis. “Ten Benchmark Database Queries for Location-based Services”, The Computer Journal, 46(6):713-725, 2003.
TeNDaX, a Collaborative Database-Based Real-Time Editor System A Word-Processing ‘LAN-Party’ Stefania Leone1 , Thomas B. Hodel-Widmer2 , Michael Boehlen3 , and Klaus R. Dittrich1 1
3
University of Zurich, Department of Informatics, Winterthurerstrasse 190, 8057 Zurich, Switzerland {leone, dittrich}@ifi.unizh.ch http://www.tendax.net 2 Swiss Federal Institute of Technology (ETH Zurich), Leonhardshalde 21, 8092 Zurich, Switzerland hodel@sipo.gess.ethz.ch Free University of Bolzano-Bozen, Piazza Domenicani 3, 39100 Bolzano, Italy boehlen@inf.unibz.it
Abstract. TeNDaX is a collaborative database-based real-time editor system. TeNDaX is a new approach for word-processing in which documents (i.e. content and structure, tables, images etc.) are stored in a database in a semi-structured way. This supports the provision of collaborative editing and layout, undo- and redo operations, business process definition and execution within documents, security, and awareness. During document creation process and use meta data is gathered automatically. This meta data can then be used for the TeNDaX dynamic folders, data lineage, visual- and text mining and search. We present TeNDaX as a word-processing ‘LAN-Party’: collaborative editing and layout; business process definition and execution; local and global undo- and redo operations; all based on the use of multiple editors and different operating systems. In a second step we demonstrate how one can use the data and meta data to create dynamic folders, visualize data provenance, carry out visual- and text mining and support sophisticated search functionality.
1
Introduction
Text documents are a valuable resource for virtually any enterprise and organization. Documents like papers, reports and general business documentations contain a large part of (business) knowledge. Documents are mostly stored in a hierarchical folder structure on file servers and it is difficult to organize them with regard to classification, versioning etc., although it is of utmost importance that users can find, retrieve and edit documents in a user-friendly way. In most of the commonly used word-processing applications documents can be manipulated by only one user at a time and tools for collaborative document editing and management are rarely deployed. Documents should be seen as a valuable Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1135–1138, 2006. c Springer-Verlag Berlin Heidelberg 2006
1136
S. Leone et al.
business asset which requires an appropriate data management solution. The need to store, retrieve and edit these documents collaboratively with guaranteed mechanisms for security, consistency, availability and access control is obvious. In the following, we present the database-based TeNDaX editor system which enables collaborative document editing and management, all within a interactive multi-user database environment.
2
The TeNDaX Editor System
TeNDaX stands for a Text Native Database eXtension. It enables the storage of text in databases in a native form so that text editing is finally represented as real-time transactions. Under the term ‘text editing’ we understand the following: writing and deleting text (characters), copying and pasting, defining layout and structure, inserting notes, setting access rights, defining business processes, inserting tables, pictures, and so on, i.e. all the actions regularly carried out by word processing users. By ‘real-time transactions’ we mean that editing text (e.g. writing a character/word) invokes one or several database transactions so that everything which is typed appears within the editor as soon as these objects are stored persistently. Instead of creating files and storing them in a file system, the content and all of the meta data belonging to the documents is stored in a special way in the database, which enables very fast transactions for all editing tasks [3]. The database schema and the transactions are designed to be used in a multi-user environment, as is customary in the database context. As a consequence, many of the database features (data organization and querying, recovery, integrity and security enforcement, multi-user operation, distribution management, uniform tool access, etc.) are now, by means of this approach, also available for word processing. TeNDaX creates an extension of DBMS to manage text. This addition is carried out ‘cleanly’ and the responding data type represents a ‘first-class citizen’ [1] of a DBMS (e.g. integers, character strings, etc.). Since the document data is stored in the database, we automatically gather meta data during the whole document creation process [6]. We gain meta data on document level (creator, roles, date and time, document object ID, document names, structure affiliation, note affiliation, security settings, size, authors, readers, state, places within static folders and user defined properties), on character level (author, roles, date and time, copy-paste references, local and global undo / redo, security settings, version and user defined properties) and from structure, template, layout, notes, security and business process definitions.
3
Demonstration: Word-Processing ‘LAN-Party’
In our word-processing ‘LAN-Party’ we focus on the TeNDaX editor system. Editors installed on different operating systems (Windows XP, Linux, Mac OSX) will support the demonstration of the following TeNDaX features:
TeNDaX, a Collaborative Database-Based Real-Time Editor System
1137
- Collaborative editing: We will concurrently work with multiple users on the same document. Editing a document includes operations like writing and deleting characters [4], inserting pictures, creating tables, applying layout and structure [2], local and global undo- and redo operations, setting access rights etc. All these operations are carried out in a dynamic multiuser database environment. - Business process definitions and flow: We will define and run a dynamic workflow within a document for ad-hoc cooperation on that document [5]. Tasks such as translation or verification of a certain document part can be assigned to specific users or roles. The workflow tasks can be created, changed and routed dynamically, i.e. at run-time. - Dynamic Folders: On the base of the automatically gathered document creation process meta data we will build dynamic folders [6]. Dynamic folders are virtual folders that are based on meta data. A dynamic folder can contain all documents a certain user has read within the last week. Its content is fluent and may change within seconds (e.g. as soon as a document changes). This represents a novel method for document management and text retrieval. - Data Lineage: We can display document content provenance. Meta data about all editing and all copy- and paste actions is stored with the document. This includes information about the source of the new document part, e.g. from which other document a text has been copied (either internal or external sources). We use this meta data to visualize data linage as depicted in Figure 1. - Visual Mining: The information visualization plug-in provides a graphical overview of all documents and offers a variety of interaction modalities. It is possible to navigate the document and meta data dimensions to gain an understanding of the entire document space. A visualization of a set of documents is shown in Figure 2. - Search: The meta data based searching and ranking plug-in offers sophisticated search options. Documents and parts of documents can either be found based on the document content, or structure, or document creation process meta data. The search result can be ranked according to different ranking options, e.g. ‘most cited’, ‘newest’ etc.
Fig. 1. Data Lineage
Fig. 2. Visual Mining
1138
4
S. Leone et al.
Conclusion
The collaborative database-based real-time TeNDaX editor system enables the storage and management of documents within a database. It natively represents text in fully-fledged databases, and incorporates all necessary collaboration support. It offers functions such as editing, awareness, fine-grained security, sophisticated document management, versioning, business processes, text structure, data lineage, text and visual mining - all within a collaborative multi-user database environment. Within the above presented TeNDaX editor system we use database technology to provide full word-processing functionality and sophisticated documentand meta data visualization. TeNDaX extends database technology and offers database-based universal data management for text documents.
References 1. Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., Croft, B., DeWitt, D., Franklin, M., Molina, H. G., Gawlick, D., Gray, J., Haas, L., Halevy, A., Hellerstein, J., Ioannidis, Y., Kersten, M., Pazzani, M., Lesk, M., Maier, D., Naughton, J., Schek, H., Sellis, T., Silberschatz, A., Stonebraker, M., Snodgrass, R., Ullman, J., Weikum, G., Widom, J., and Zdonik, S.: The Lowell Database Research Self-Assessment, CACM Vol. 48(5), 111-118 (2005) 2. Hodel, T. B., Businger, D., Dittrich, K. R.: Supporting Collaborative Layouting in Word Processing. IEEE CoopIS (2004) 355–372 3. Hodel, T. B., Dittrich, K. R.: Concept and Prototype of a Collaborative Business Process Environment for Document Processing. Data & Knowledge Engineering, 52, Special Issue: Collaborative Business. Process Technologies (2004) 61–120 4. Hodel, T.B., Dubacher, M., Dittrich, K. R.: Using Database Management Systems for Collaborative Text Editing. ECSCW CEW (2003) 5. Hodel, T.B., Gall, H., Dittrich, K. R.: Dynamic Collaborative Business Process within Documents. ACM SIGDOC (2004) 97–103 6. Hodel, T.B., Hacmac, R., H., Dittrich, K. R.: Using Creation Time Meta Data for Document Management. CAiSE (2005)
Synopses Reconciliation Via Calibration in the τ -Synopses System Yariv Matia, Yossi Matias, and Leon Portman School of Computer Science, Tel-Aviv University matiayar@post.tau.ac.il, matias@cs.tau.ac.il, leon.portman@nice.com
Abstract. The τ -Synopses system was designed to provide a run-time environment for multiple synopses. We focus on its utilization for synopses management in a single server. In this case, a critical function of the synopses management module is that of synopses reconciliation: given some limited memory space resource, determine which synopses to build and how to allocate the space among those synopses. We have developed a novel approach of synopses calibration for an efficient computation of synopses error estimation. Consequently we can now perform the synopses reconciliation in a matter of minutes, rather than hours.
1
Introduction
Data synopses are concise representations of data sets, which enable effective processing of approximate queries to the data sets. Recent interest in approximate query processing and in effectively dealing with massive data sets resulted with a proliferation of new synopses. The τ -Synopses system [6] was designed to provide a run-time environment for local and remote execution of various synopses. It provides the management functionality for registered synopses, and it enables easy registration of new synopses either locally or from remote soap-enabled platforms. The τ -Synopses system can serve as an effective research platform for experimental evaluation and comparison of different synopses, as well as a platform for studying the effective management of multiple synopses in a federated or centralized environment. The system was previously presented in the context of remote-synopses, demonstrating how synopses can be managed in a distributed fashion [5]. We now focus our attention on the utilization of the τ -Synopses system for synopses management in a single server. In this case, a critical function of the Synopses Manager module is that of synopses reconciliation: given some limited memory space resource, determine which synopses to build and how to partition the available space among those synopses. The problem of synopses reconciliation was previously studied and several algorithms were presented (e.g., [3, 2]). A basic operation in all reconciliation algorithms is that of estimating the accuracy of synopses implementations for Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1139–1142, 2006. c Springer-Verlag Berlin Heidelberg 2006
1140
Y. Matia, Y. Matias, and L. Portman
Fig. 1. Synopses Manager Architecture
given data sets. The common approach in obtaining such estimation is by invoking expensive queries into the original data. We present a novel approach, in which the reconciliation algorithm can consult with an error-estimation module, which can provide an error estimation without accessing the original data itself. Instead, the error-estimation module computes the synopsis approximation-error based on synopsis characteristics, which are computed by a calibration process in integration-time, and statistical data about the data sets, computed by a profiling process in registration-time. This results with an effective reconciliation process.
2
Synopsis Calibration and Synopses Reconciliation
The Synopses Manager module provides an automatic recommendation of which synopses to build and their sizes, based on the available synopses implementations, available memory space, and the registered relations and query workload. As depicted in Figure 1 the Synopses Manager includes the following types of data structures: (i) Relation Profile, which includes statistical information about each registered relation; (ii) Synopsis Characteristics, which includes parameters computed at integration time for each synopsis; and (iii) Reconciliation Recommendation, which includes the recommendations of synopses to be built for each registered relation. It also includes the following modules: Calibration, Profiling, Error Estimation, Reconciliation, and Approximate Query Optimizer; these modules are described in more details below. A more detailed explanation of the calibration and reconciliation processes can be found in [4].
Synopses Reconciliation Via Calibration
1141
The synopsis error-estimation function. The crux of our approach is a novel calibration technique [4], which associates to every synopsis implementation T an error estimation function, EET . We have found, through empirical testing, that the following function quite accurately describes the relative error of synopses implementations w.r.t. their corresponding relations and query workloads: EET (L, Z, Q, R, S) = a1 Lb1 + a2 Z b2 + a3 Qb3 + a4 Rb4 + a5 S b5 + a6 . The arguments L, Z, Q, R and S, collectively denoted as the Relation Profile, are: the relation size (L), relation data distribution skew (Z), workload query skew (Q), workload query range (R), and the synopsis size (S). The ai and bi coefficients, collectively denoted as the Synopsis Characteristics, are unique to each synopsis implementation. Calibration. This module is invoked every time a new synopsis implementation T is integrated into the system. The calibration process runs a small number of tests on the synopsis implementation, measuring its behavior under various synthetic relations and workloads. It then derives the ai and bi coefficients, of EET , using a combination of squared linear fitting and the cplex commercial solver [1]. The coefficients of the function are stored in the Synopsis Characteristics data structure. Error Estimation. This module is utilized by the Reconciliation and Approximate Query Optimizer modules. Given an approximate query to a relation with synopsis implementation T , the module computes the error estimation function EET based on the parameters available from the Relation Profile and Synopsis Characteristics data structures, resulting with the estimated approximationerror of the query. Approximate Query Optimizer. This module has two functions: (1) triggers the building of the required synopses based on the recommendations received from the Reconciliation module; and (2) when a user submits an approximate query to the system, this module performs the query on the relevant synopsis, and also invokes the Error-Estimation module, returning both the estimated result, and the estimated approximation-error of the result. Profiling. This module is invoked whenever a new relation or query workload are registered in the system. For relations, this module measures the cardinality of the relation (distinct count), and uses linear-squared-fitting to fit a Zipf parameter to the relation data distribution skew. For query workloads, the number and average range of the queries are calculated, and the Zipf parameter of the query distribution skew is again fitted using linear-squared-fitting. The computed statistical data is stored in the Relation Profile data structure. Synopses reconciliation. Synopses reconciliation is basically an optimization problem – given available synopses implementations, a memory space limit, and a query workload, we would like to know the combination of synopses and their sizes that would yield the minimal error for the entire system.
1142
Y. Matia, Y. Matias, and L. Portman
The Synopses Reconciliation module can accommodate the implementations of any synopses reconciliation algorithm. The module currently has implementations of the algorithms from [3, 2], with the following modification. Whenever the error measurement of a synopsis utilization is required, it uses the Error-Estimation Module, instead of using the straight-forward measurement which involves executing costly queries into the database. The process is invoked on-demand by the administrator, and returns a recommended combination of synopses to build. Utilizing the same reconciliation algorithms and heuristics as those in [3, 2], but replacing the action of measuring the error with a call to an error-estimation function, significantly reduces the run time of the reconciliation process while maintaining good accuracy.
3
System Demonstration
We demonstrate the calibration process for one of these synopses, showing the accuracy of the calculated EET function over different relations and workloads. We also demonstrate a full reconciliation process over a complex setup of relations and workloads, showing how a process that would normally take hours to complete, is completed in minutes, and compare its results to those of the optimal combination.
References 1. iLOG Inc. Ilog cplex 8.0 –user’s manual, 2002. 2. H. V. Jagadish, H. Jin, B. C. Ooi, and K. L. Tan. Global optimization of histograms. In SIGMOD ’01, pages 223–234, 2001. 3. A. C. Koenig and G. Weikum. A framework for the physical design problem for data synopses. In Proceedings of the 8th International Conference on Extending Database Technology, pages 627–645, 2002. 4. Y. Matia and Y. Matias. Efficient synopses reconciliation via calibration. Technical report, Tel Aviv University, 2005. 5. Y. Matias and L. Portman. τ -Synopses: a system for run-time management of remote synopses. In International Conference on Data Engineering (ICDE), Software Demo, pages 964–865, April 2004. 6. Y. Matias, L. Portman, and N. Drukh. The design and architecture of the τ -Synopses system. In Proc. EDBT 06’, Industrial and Application, 2006.
X-Evolution: A System for XML Schema Evolution and Document Adaptation Marco Mesiti1 , Roberto Celle2 , Matteo A. Sorrenti1 , and Giovanna Guerrini2 1
2
1
Universit` a di Milano, Italy mesiti@dico.unimi.it Universit` a di Genova, Italy guerrini@disi.unige.it
Introduction
The structure of XML documents, expressed as XML schemas [6], can evolve as well as their content. Systems must be frequently adapted to real-world changes or updated to fix design errors and thus data structures must change accordingly in order to address the new requirements. A consequence of schema evolution is that documents instance of the original schema might not be valid anymore. Currently, users have to explicitly revalidate the documents and identify the parts to be updated. Moreover, once the parts that are not valid anymore have been identified, they have to be explicitly updated. All these activities are time consuming and error prone and automatic facilities are required. A set of primitives that can be applied on an XML schema to evolve its structure has been proposed in [4]. For each primitive we have determined the applicability conditions, that is, when its application produces a schema that is still well-formed. Moreover, we have analyzed when the primitive application alters the validity of the documents instances of the schema. Table 1 reports the evolution primitives classified relying on the object (element, simple type, and complex type) on which they are applied and the kind of operation (insertion, modification, deletion). Primitives marked with “*” do not alter the validity of the document instances. Therefore, their application do not require a revalidation process. Other primitives might only alter the validity of a single element or of a restricted set of elements depending on the schema specification. Therefore, in these cases, the entire revalidation of a document is useless. In [4] we developed an algorithm, based on an element type graph labelling, that minimizes the revalidation process to only the elements affected by the primitives thus making the process more efficient. A related problem is how to evolve the structure of document instances in order to make them valid for the evolved schema. Suppose to introduce an element in a schema, and this is mandatory for all valid documents. This element should be introduced (maybe with a default or null value) in all the previously valid documents. Suppose now to remove an element from the schema. It should be removed from valid documents. The problem is more complex when the schema modification refers to an operator or to the repeatability of elements, and would often require user intervention. The adaptation process thus Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1143–1146, 2006. c Springer-Verlag Berlin Heidelberg 2006
1144
M. Mesiti et al. Table 1. Evolution primitives Insertion
Modification change restriction change base type rename type∗ insert glob simple type∗ Simple Type ∗ insert new member type change member type global to local∗ local to global ∗ rename local elem rename global type∗ insert glob complex type∗ change type local elem insert local elem Complex Type change cardinality insert ref elem change operator insert operator global to local∗ local to global ∗ rename glob elem∗ change type glob elem Element insert glob elem ref to local∗ local to ref ∗
Deletion remove type∗ remove member type∗
remove remove remove remove
element operator substructure type∗
remove glob elem∗
involves subtleties related both to the kind of update performed and to the structure of the updated type. Several updates require the detection of the minimal substructure for an element whose insertion is required in documents to validate. Our approach to document adaptation is based on the use of restructuring structures, that are an extension of the labelled element type graph employed for document revalidation, in which labels can also be ∆l , ∆l , and ∆llon , with l, ln , and lo element labels. These structures allow to specify the minimal modifications to be performed on documents invalidated by a schema update and are automatically inferred from the schema update whenever possible (otherwise, user intervention is required). The adaptation process will occur during the revalidation process and the idea is that to validate the special subelement ∆l , element l should be inserted. Similarly, to validate the special subelements ∆l and ∆llon , element l should be deleted and element lo should be renamed to ln , respectively. In this demonstration paper we present X-Evolution, a .NET system developed on top of a commercial DBMS that allows the specification of schema modifications in a graphical representation of an XML schema. It supports facilities for performing schema revalidation only when strictly needed and only on the minimal parts of documents affected from the modifications. Moreover, it supports the adaptation of original schema instances to the evolved schema. The adaptation process is semi-automatic and the required user intervention is minimized. Support is provided to the user for a convenient specification of the required updates. Commercial tools (e.g. [1, 5]) have been developed for graphically design XML schemas. However, they are not integrated with a DBMS and do not allow the semi-automatic revalidation and adaptation of documents within contained. Schema evolution had been previously investigated for DTDs in [3], where evolution operators were proposed. Problems caused by DTD evolution and the impact on existing documents are however not addressed. Moreover, since DTDs
X-Evolution: A System for XML Schema Evolution
1145
Fig. 1. X-Evolution schema modification facility
are considerably simpler than XML Schemas [2] the proposed operators do not cover all the set of schema changes that can occur on an XML Schema.
2
X-Evolution Facilities
X-Evolution offers different kind of facilities for handling the evolution of XML schemas, an efficient revalidation of document instances, and the adaptation of document instances to the evolved XML schema. X-Evolution connects to one or more databases in which the XML schemas and documents are stored. The left side bar of the schema modification facility (Figure 1) presents the identified documents and schemas. Whenever a user clicks on a schema, the system graphically represents the schema and identifies the documents that are valid for such a schema. Whenever a user clicks on a document, the system graphically represents the document and identifies the schemas for which the document is an instance. By graphically selecting a node of the tree representation of a schema,1 all the possible schema evolution primitives that can be applied on such a node are visualized. When the user invokes an evolution primitive, X-Evolution 1
Actually a schema should be represented as a direct graph. However, for the sake of readability, we duplicate nodes of the graph with more than one incoming edge.
1146
M. Mesiti et al.
checks whether the operation can alter the schema consistency. If it is preserved the operation is executed and the evolved schema visualized. Whenever the operation alters (or can alter) the validity of document instances of the schema, X-Evolution points out the documents that are not valid anymore. This operation is performed through an optimized revalidation algorithm detailed in [4]. The user interface helps the user in the adaptation process of non valid documents. For each non valid document the system points out the elements that should be removed or added and allows the specification of default values or structures to be inserted.
3
Demonstration
The demonstration of the X-Evolution system consists in four parts. 1. We will show how to connect the graphical interface to a database in which the documents and schemas are stored and how we can easily work with the graphical representation of documents and schemas. 2. We will show how to apply the developed evolution operations on a schema. Specifically, we will show how to construct a schema from scratch and how we can remove all the components from a schema making it an empty schema. Consistency of the resulting schema is checked for each operation, and, if violated, no update is performed. 3. We will show performances of our efficient approach in the revalidation of XML documents against the evolved schema with respect to the naive solution of entirely revalidate the documents instances of the schema. 4. Finally we will show the effectiveness of our adaptation approach by specifying schema modifications and showing the result of the adaptation process. Acknowledgement. The authors wish to thank Daniele Ghelli and Giuseppe Marchi for developing the graphical representation of XML documents and schemas.
References 1. Altova Inc. XMLSpy, 2005. http://www.altova.com/ 2. G.J. Bex, F. Neven, and J. Van den Bussche. DTDs versus XML Schema: A Practical Study. WebDB, 79–84, 2004. 3. D. K. Kramer and E. A. Rundensteiner. Xem: XML Evolution Management. RIDEDM, 103–110, 2001. 4. G. Guerrini, M. Mesiti, D. Rossi. Impact of XML Schema Evolution on Valid Documents. In Proc. of WIDM, Germany 2005. 5. Stylus Studio Inc. Stylus XML Editor, 2005. http://www.stylusstudio.com 6. W3C. XML Schema Part 0: Primer, 2001.
TQuEST: Threshold Query Execution for Large Sets of Time Series Johannes Aßfalg, Hans-Peter Kriegel, Peer Kr¨oger, Peter Kunath, Alexey Pryakhin, and Matthias Renz Institute for Computer Science, University of Munich {assfalg, kriegel, kroegerp, kunath, pryakhin, renz}@dbs.ifi.lmu.de Abstract. Effective and efficient data mining in time series databases is essential in many application domains as for instance in financial analysis, medicine, meteorology, and environmental observation. In particular, temporal dependencies between time series are of capital importance for these applications. In this paper, we present TQuEST, a powerful query processor for time series databases. TQuEST supports a novel but very useful class of queries which we call threshold queries. Threshold queries enable searches for time series whose values are above a user defined threshold at certain time intervals. Example queries are ”report all ozone curves which are above their daily mean value at the same time as a given temperature curve exceeds 28◦ C” or ”report all blood value curves from patients whose values exceed a certain threshold one hour after the new medication was taken”. TQuEST is based on a novel representation of time series which allows the query processor to access only the relevant parts of the time series. This enables an efficient execution of threshold queries. In particular, queries can be readjusted with interactive response times.
1
Introduction
In this paper, we present TQuEST, a powerful analysis tool for time series databases which supports a novel but very important query type which we call threshold query. Given a query time series q, a threshold τ , and a time series database DB, a threshold query T QDB (q, τ ) returns the time series p ∈ DB that has the most similar sequence of intervals of values above τ . In other words, each time series o ∈ DB ∪ {q} is transformed into a sequence of disjoint time intervals containing only those values of o that are (strictly) above τ . Then, a threshold query returns for a given query object q the object p ∈ DB having the most similar sequence of time intervals. Let us note that the exact values of the time series are not considered, rather we are only interested in whether the time series is above or below a given threshold τ . The transformation of the time series into interval sequences is depicted in Figure 1. Our new query type can for example be applied to pharmacological time series like blood parameters after drug treatment or biological data like gene expression profiles. Another possible application for our tool is the analysis of environmental air pollution which has gained rapidly increasing attention by Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1147–1150, 2006. c Springer-Verlag Berlin Heidelberg 2006
1148
J. Aßfalg et al. time series A
W time time series B
W time
Sequence of time intervals, where the values exceed W
Fig. 1. Transformation of time series into sequences of time intervals
many European research projects in recent years. For example, German state offices for environmental protection maintain more than 100 million time series, each representing the daily course of air pollution parameters1 . It is important to know which parameters nearly simultaneously exceed their legal threshold. A lot of work on similarity search in time series databases has been published. The proposed methods mainly differ in the representation of the time series, a survey is given in [1]. Standard techniques for dimension reduction include Discrete Fourier Transform (e.g. [2]), Discrete Wavelet Transform (e.g. [3]), Piecewise Aggregate Approximation (e.g. [4]), and Chebyshev Polynomials [5]. However, all techniques which are based on dimension reduction cannot be applied to threshold queries because necessary temporal information is lost. An effective and efficient processing of queries like ”return all ozone time series which exceed the threshold τ1 = 50µg/m3 at a similar time as the temperature reaches the threshold τ2 = 25◦ C” is of high importance but not supported by the above mentioned techniques. Such a query type has to support the exploration of time series based on user-defined amplitude spectrums. TQuEST not only meets these prerequisites but also enables the user to interactively adjust the query threshold.
2
Time Series Representation
As time series objects may be very complex, i.e. contain lots of measurements, we need a data representation of time series objects which allows us to process threshold queries very efficiently. An efficient processing enables interactive query response times, so that the query can be readjusted w.r.t. the last results without causing high response times. In the following, we assume that a time series is described by a sequence of connected segments and each segment denotes the interpolated time series course between a pair of subsequent time series values (measurements). 1
We thank U. B¨ ollmann and M. Meindl for providing us with real-world datasets from the Bavarian State Office for Environmental Protection, Augsburg, Germany.
TQuEST: Threshold Query Execution for Large Sets of Time Series
1149
The query processor is based on a novel data representation of time series. Assume a threshold query with a threshold τ is given. At query time, the query processor requires for each time series to derive the relevant time intervals, i.e. the time frames where the time series course is above the specified threshold τ . Obviously, those segments of the time series which cross the query threshold τ suffice to determine the desired time frames. This observation can be used as follows: we decompose the time series into trapezoids where the upper and lower edge is parallel to the time axis, the left side is bounded by an increasing time series segment and the right side is bounded by a decreasing segment. For a certain range of thresholds and a certain range of time, each trapezoid represents all time frames where the time series course is above the threshold. The trapezoids can be described very compactly and can be efficiently accessed by means of a spatial access method (e.g. [6]). The key point of our proposed time series representation is that we do not need to access the complete timeseries data at query time. Instead only partial information of the time-series objects suffices to report the results.
3
Threshold Query Processor
In this section, we present TQuEST, a JAVA 1.5 application, which supports effective threshold queries efficiently. At first the user has to select the desired threshold type. In case of an Absolute Threshold τ , all underlying calculations are based on absolute time series values. Our tool also handles Relative Thresholds where τ is given relative to the maximum and the minimum values of the time series. This query mode requires that all time series are also represented in a normalized form. In many application fields, a number of parameters is observed at the same time. Quite often these measurements yield time series with totally different ranges of values. By using relative thresholds, our tool is nonetheless able to detect correlations between certain observed parameters (for example between temperature and ozone concentration). TQuEST supports two major types of queries: In the Time Interval Sequence Query mode, the user can specify a sequence of time intervals on the time bar and a threshold for the query. Our software then retrieves time series that match these parameters best (cf. Figure 2(a)). The Time Series Query mode uses a given time series as a query object. Here, the user has to provide a threshold τ1 for the query time series and a threshold τ2 for the objects to retrieve. Our software then searches for time series that exceed τ2 during the same (or similar) time intervals as the query time series exceeds τ1 (cf. Figure 2(b)). In both query modes, our tool ranks the resulting time series according to the similarity between the specified time intervals and the time intervals in which the result time series exceed the threshold. Depending on the available metadata for each time series our tool can be configured to display any combination of additional attributes associated with a specific time series (e.g. patient ID, observed parameter, geographical location of the corresponding sensor station, etc.). To get a quick visual impression of the results, multiple time series can be
1150
J. Aßfalg et al.
(a) Time interval sequence query
(b) Time series query
Fig. 2. Query interface
selected for the graphical display in the middle area of the query GUI. Finally, by clicking the ‘Fetch Next’ button, the user can retrieve the next results from the database w.r.t. the ranking order. We demonstrate on real-world data that our prototype TQuEST is very useful. For example, we discovered relationships between meteorological data (e.g. air humidity) and air pollution attributes (e.g. particulate matter). Furthermore we can show that our tool can handle complex threshold queries in adequate time.
References 1. Keogh, E., Chakrabati, K., Mehrotra, S., Pazzani, M.: ”Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Santa Barbara, CA. (2001) 2. Agrawal, R., Faloutsos, C., Swami, A.: ”Efficient Similarity Search in Sequence Databases”. In: Proc. 4th Conf. on Foundations of Data Organization and Algorithms. (1993) 3. Chan, K., Fu, W.: ”Efficient Time Series Matching by Wavelets”. In: Proc. 15th Int. Conf. on Data Engineering (ICDE’99), Sydney, Australia. (1999) 4. Yi, B.K., Faloutsos, C.: ”Fast Time Sequence Indexing for Arbitrary Lp Norms”. In: Proc. 26th Int. Conf. on Very Large Databases (VLDB’00), Cairo, Egypt. (2000) 5. Cai, Y., Ng, R.: ”Index Spatio-Temporal Trajectories with Chebyshev Polynomials”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Paris, France). (2004) 6. Beckmann, N., Kriegel, H.P., Seeger, B., Schneider, R.: ”The R*-tree: An Efficient and Robust Access Method for Points and Rectangles”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, Atlantic City, NJ. (1990)
VICO: Visualizing Connected Object Orderings Stefan Brecheisen, Hans-Peter Kriegel, Matthias Schubert, and Michael Gruber Institute for Informatics, University of Munich {brecheis, kriegel, schubert, gruber}@dbs.ifi.lmu.de
Abstract. In modern databases, complex objects like multimedia data, proteins or text objects can be modeled in a variety of representations and can be decomposed into multiple instances of simpler sub-objects. The similarity of such complex objects can be measured by a variety of distance functions. Thus, it quite often occurs that we have multiple views on the same set of data objects and do not have any intuition about how the different views agree or disagree about the similarity of objects. VICO is a tool that allows a user to interactively compare these different views on the same set of data objects. Our system is based on OPTICS, a density-based hierarchical clustering algorithm which is quite insensitive to the choice of parameters. OPTICS describes a clustering as a so-called cluster order on a data set which can be considered as an image of the data distribution. The idea of VICO is to compare the position of data objects or even complete clusters in a set of data spaces by highlighting them in various OPTICS plots. Therefore, VICO allows even non-expert users to increase the intuitive understanding of feature spaces, distance functions and object decompositions.
1
Introduction
In modern databases, complex objects like multimedia data, proteins or text objects can be modeled in a variety of representations and can be compared by a variety of distance or similarity functions. Thus, it quite often occurs that we have multiple views on the same set of data objects and do not have any intuition about how the different views on data objects agree or disagree about the similarity of objects. VICO is a tool for comparing these different views on the same set of data objects. Our system is heavily based on OPTICS, a density-based hierarchical clustering algorithm, which is quite insensitive to its parametrizations. OPTICS describes a clustering as a so-called cluster order on a data set. A cluster order can be considered as an image of the data distribution in one representation. The idea of VICO is to select data objects or even complete clusters in one OPTICS plot and additionally highlight the same objects in all other displayed views on the data. VICO has the following three main applications: First, if more than one distance function for a given data set is available, it allows direct comparisons of the distance functions. Second, in a multi-represented setting, where multiple feature transformations for an object are available, the relationships between the given data representations can be examined by comparing the clusterings resulting w.r.t. these representations. Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1151–1154, 2006. c Springer-Verlag Berlin Heidelberg 2006
1152
S. Brecheisen et al.
Third, the connection between multi-instance objects and their single instances can be examined by comparing the clustering of multi-instance objects to the clusterings w.r.t. single instances.
2
Algorithmic Foundation
In the following, we will introduce the basic concepts behind OPTICS [1] which is the clustering algorithm VICO employs to generate the density plot of a given data representation. OPTICS is a density-based hierarchical clustering algorithm that extends DBSCAN by deriving a cluster hierarchy that is displayed within the so-called reachability plot. The central concepts of OPTICS are the core distance of an object expressing the size of the neighborhood around an object containing at least MinPts other objects. In other words, the core distance of object o is the smallest distance for which o would be considered a core point with respect to MinPts. The reachability distance of an object p from o denoted as dreach (p, o) is the maximum of the true distance between o and p and the core distance of o. OPTICS performs a best first run in a complete directed graph where the objects are the nodes and an edge between the objects p and o is labeled with dreach (p, o). After starting its traversal with an arbitrary node, OPTICS always pursues the edge first that provides the smallest reachability distance and starts with an already reached object. When traversing the data from one object to any other object the reachabilty of the correponding link is collected in the so-called reachability plot. Valleys in this plot indicate clusters: objects having a small reachability value are more similar to their predecessor objects than objects having a higher reachability value. The reachability plot generated by OPTICS can be cut at any level ε parallel to the abscissa. It represents the density-based clusters according to the density threshold ε: A consecutive subsequence of objects having a smaller reachability value than ε belong to the same cluster. An example is presented in Fig. 1: For a cut at the level ε1 , we retrieve two clusters denoted as A and B. Compared to this clustering, a cut at level ε2 would yield three clusters. The cluster A is
Fig. 1. Reachability plot (right) computed by OPTICS for a sample 2-D dataset (left)
VICO: Visualizing Connected Object Orderings
1153
split into two smaller clusters denoted by A1 and A2 and cluster B is decreased in size. Usually, for evaluation purposes, a good value for ε would yield as many clusters as possible.
3
Comparing Data Spaces Using VICO
The main purpose of VICO is to compare different feature spaces that describe the same set of data. For this comparison, VICO relies on the interactive visual exploration of reachability plots. Therefore, VICO displays any available view on a set of data objects as adjacent reachability plots and allows comparions between the local neighborhoods of each object. Fig. 2 displays the main window of VICO. The left side of the window contains a so-called tree control that contains a subtree for each view of the data set. In each subtree, the keys are ordered w.r.t. the cluster order of the corresponding view. The tree control allows a user to directly search for individual data objects. In addition to the object keys displayed in the tree control, VICO displays the reachability plot of each view of the data set. Since valleys in the reachability plot represent clusters in the underlying representation, the user gets an instant impression of the richness of the cluster structure in each representation. However, to explore the relationships between the representations, we need to find out whether objects that are clustered in one representation are also similar in the other representation. To achieve this type of comparison, VICO allows the user to select any data object in any reachability plot or the tree control. By selecting a set of objects in one view, the objects are highlighted in any other view as well. For example, if the user looks at the reachability plot in one representation and selects a cluster within this plot, the corresponding object keys are highlighted in the tree control and identify the objects that are contained in the cluster. Let us note that it is possible to visualize the selected objects as well, as long as there is a viewable object
Fig. 2. VICO displaying OPTICS plots of multi-represented data
1154
S. Brecheisen et al.
representation. In addition to the information about which objects are clustered together, the set of objects is highlighted in the reachability plots of the other representations as well. Thus, we can easily decide whether the objects in one representation are placed within a cluster in another representation as well or if they are spread among different clusters or are part of the noise. If there exist contradicting reachability plots for the same set of data objects, it is interesting to know which of these representations is closer to the desired notion of similarity. Thus, VICO allows the user to label data objects w.r.t. some class value. The different class values for the objects are displayed by different colors in the reachability plot. Thus, a reachability plot of a data space that matches the user’s notion of similarity should display clusters containing objects of the same color. Fig. 2 displays a comparison of two feature spaces for an image data set. Each image is labelled with w.r.t. the displayed motive. Another feature of VICO is the ability to handle multi-instance objects. In a multi-instance representation, one data object is given by a set of separated feature objects. An example are CAD parts that can be decomposed to a set of spatial primitives, which can be represented by a single feature vector. This way, the complete CAD part is represented by a set of feature vectors, which can be compared by a variety of distance functions. To find out which instances are responsible for clusters of multi-instance objects, VICO allows us to cluster the instances without considering the multi-instance object they belong to. Comparing this instance plot to the plot derived on the complete multi-instance objects allows us to analyze which instance clusters are typical for the clusters on the complete multi-instance object. Thus, for multi-instance settings, VICO highlights all instances belonging to some selected multi-instance object.
4
Architecture and Implementation
VICO is implemented in Java 1.5 and thus, runs on any platform supporting the current version of the Java Runtime Environment. VICO includes an integrated version of OPTICS allowing the user to load and cluster data sets described in a variety of file formats like CSV and ARFF files. For this version of OPTICS there are several distance measures already implemented like the Euclidian, Manhattan or Cosine distance. Furthermore, VICO already implements various distance functions for multi-instance objects, e.g. the Hausdorff distance. The system is based on an extensible architecture, so that additional components like new distance functions can be integrated easily by implementing Java interfaces. Finally, VICO can directly load preprocessed reachability plots as well and also export reachability plots that were computed by the integrated implementation of OPTICS.
Reference 1. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: “OPTICS: Ordering Points to Identify the Clustering Structure”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’99), Philadelphia, PA. (1999) 49–60
XQueryViz: An XQuery Visualization Tool Jihad Boulos, Marcel Karam, Zeina Koteiche, and Hala Ollaic Department of Computer Science, American University of Beirut, P.O.Box 11-0236 Riad El-Solh, Beirut, Lebanon {jb06, mk62, zak08, hao04}@aub.edu.lb
Abstract. We present in this demo the description of XQueryViz: an XQuery visualization tool. This graphical tool can parse one or more XML documents and/or schemas and visualizes them as trees with zooming, expansion and contraction functionality. The tool can also parse a textual XQuery and visualizes it as a DAG within two different windows: the first for the querying part (i.e. For-Let-Where clauses) and the second for the “Return” clause. More importantly, users can build XQuery queries with this graphical tool by pointing and clicking on the visual XML trees to build the XPath parts of an XQuery and then build the whole XQuery using visual constructs and connectors. A textual XQuery is then generated.
1
Introduction
We present in this demo XQueryViz: a graphical tool for the visualization and construction of XQuery queries. This tool is made of four main parts (Fig. 1) where the first vertical window(s) is(are) dedicated to the visualization of XML documents and schemas. The second vertical window(s) is/are dedicated to the visualization of the “For-Let-Where” (FLW) clauses of a FLWR query and the third vertical window(s) is/are reserved to the “Return” clauses of the (sub)query. The fourth window is a horizontal one where the textual representation of the XQuery is shown and updated dynamically during visual query construction. The contribution of this tool is its visualization and construction of XQuery queries in a natural data-flow manner where a query block (“FLW” or “Return” in a query or in a sub-query) is represented as a tree—with predicates connecting branches in a tree. The visualization makes it natural for a user to imagine XML data flowing from the root to the leaves of the tree, and hence makes it easier to understand and construct more complex XQuery queries. Contrary to XQBE [1] where the emphasis is on simplicity and the target users are non-experts, XQueryViz is designed for more advanced users and hence is more complex than XQBE. We explain in the next three sections the three major parts of this graphical tool. Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1155–1158, 2006. c Springer-Verlag Berlin Heidelberg 2006
1156
J. Boulos et al.
Fig. 1. A snapshot of XQueryViz showing an XML schema and an XQuery example
2
XML and XPath
The upper-left window(s) of XQueryViz is/are dedicated to XML documents and schemas and XPath visualization. Within these windows, both XML schemas and documents can be visualized as trees. Moreover, and contrary to most graphical interfaces that show XML documents and schemas in directory-like trees, the trees in this window have their natural top-down shapes. A user can load into this window one or more XML documents and/or schemas. She can then expand and/or contract different branches of a tree and zoom in and out on it. These two facilities are mostly helpful for large documents/schemas. The user can then build one or more XPath queries by pointing and clicking on the nodes of a tree. While working on this visual representation of the document, an XPath query is dynamically generated and updated in the lower (i.e. textual) window of the tool. In this way, the user can precisely understand the semantics of her node-clicking and XPath query generation relative to the XML schema/document. Elements, attributes and predicates can be included in the generated XPath expression. Several XPath expressions can be generated and saved, and then included in a larger XQuery that can normally contain several XPath expressions. Moreover, a user can use this part to define and construct new XML schemas.
XQueryViz: An XQuery Visualization Tool
For
Let
Function UDF
Where
XPath Variable Element Attribute Value
Parameter Root Operator Wildcard Return
1157
Sort By Sort By
Return Return element attribute
Edges
Fig. 2. Visual icons
3
The Querying Part
The querying part of an XQuery is divided into three visual sub-parts that represent the “For”, “Let”, and “Where” clauses of that query. This part is visualized in the middle vertical window(s) of the graphical tool. Each clause is written in the textual form with a specific color and shown in the visual window with that specific color. In this case, the user can naturally associate the sub-parts in that window with the clauses in the query. A user can currently load an XQuery from a file and visualizes it in this middle window and its “Return” clause in the third vertical window. The user has also the option of visually modifying this query and saving its new textual format or visually building a completely new query with or without using the XPath construction facility provided in the first window. With these options, a user can manipulate both textually and visually an XQuery so that it becomes as complicated as that user needs it to be. Every clause of an XQuery (i.e. For, Let, Where, Return, Order By) has its own visual construct and every component in these clauses has its own visual construct too. The components that XQueryViz currently supports are: elements, attributes, wildcards, variables, XPath, document roots, numbers, strings, parentchild relationships, ancestor-descendant relationships, predicates, functions, and quantifiers. These are shown in a palette list to the left of the three vertical windows in Fig. 1 and are shown is Fig. 2 with their meanings. Sub-queries are recursively constructed in the same manner as explained in the following. The “For” Clause: Multiple “For” clauses in an XQuery can be defined. The first “For” clause binds a variable to an XPath expression that should be applied on a certain XML document. Any subsequent “For” clauses may do the same of binding their variables to previously bound ones, or to the same or newly defined XML documents. The visual construct for the “For” clause has a right diamond that is connected to the tree representing the XPath expression and where the root of the document has its own visual construct and elements, attributes, and predicate expressions can also be visualized/constructed with their respective visual constructs. The “Let” Clause: Here too, multiple “Let” clauses can be defined and they can also be intertwined with “For” clauses. The “Let” clause has its visual construct that binds a variable to another variable with or without extension to its XPath, to an XPath expression on the root of an XML document, or to a constant value.
1158
J. Boulos et al.
The “Where” Clause: Predicate expressions on the previously defined variables in the “For” and “Let” clauses are applied here. While the previous two clauses contain only XPath expressions with their internal predicates, the user can define here predicates across the different variables and XPath expressions; hence, eventually connecting some XPath trees to produce the general shape of a DAG.
4
The Return Part
A user can visualize and construct the result of a query in the third vertical window of the graphical tool. Almost the same visual constructs for using previously defined elements and attributes, and for applying predicates and functions can also be used here. The major new visual constructs are the ones that define new elements. Moreover, defined variables in the “For-Let-Where” clauses (i.e. in the middle windows) have their new visual construct that makes them understood as defined earlier. Sub-Queries: We recently extended XQueryViz to handle an XQuery as a sub-query in both the “Where” and the “Return” clauses of a larger query. Our design is based on recursively spanning from the FLW or the “Return” window of the englobing XQuery two new windows that contain the two parts of the querying and returning of the sub-query. The yet unresolved problem here is how to visually connect these new windows to the spanning “Where” or “Return” clause to make it clear for a user how this sub-query is part of that “Where” or “Return” clause.
5
Conclusion
We presented in this paper the different functionality of an XQuery visualization tool that can visualize and construct XQuery queries. This tool can also visualize XML schemas and documents and let a user visually builds XPath queries on them. XQuery queries can then be built and manipulated in quite a more natural and easier manner than only textual format. We are currently preparing a usability study of XQueryViz where a certain number of users with different experience with XQuery will be asked to visualize and construct a certain number of XQuery queries and collect feedback and statistics on how effective and ergonomic this tool is.
Reference 1. D. Barga, A. Campi, and S. Ceri. XQBE (XQuery By Example): A Visual Interface to the Standard XML Query Language. ACM Transactions on Database Systems, Vol. 30, No. 2, June 2005. Also, Demo in Sigmod 2005.
SAT: Spatial Awareness from Textual Input Dmitri V. Kalashnikov, Yiming Ma, Sharad Mehrotra, Ramaswamy Hariharan, Nalini Venkatasubramanian, and Naveen Ashish Information and Computer Science, University of California, Irvine
1
Motivation
Recent events (WTC attacks, Southeast Asia Tsunamis, Hurricane Katrina, London bombings) have illustrated the need for accurate and timely situational awareness tools in emergency response. Developing effective situational awareness (SA) systems has the potential to radically improve decision support in crises by improving the accuracy and reliability of the information available to the decisionmakers. In an evolving crisis, raw situational information comes from a variety of sources in the form of situational reports, live radio transcripts, sensor data, video streams. Much of the data resides (or can be converted) in the form of free text, from which events of interest are extracted. Spatial or location information is one of the fundamental attributes of the events, and is useful for a variety of situational awareness (SA) tasks. This demonstration will illustrate our approach, techniques and solutions for obtaining spatial awareness from raw input text. There are several challenges that arise in obtaining spatial awareness from raw text input - modeling/ representation, event extraction and disambiguation, querying, reasoning and visualization. We specifically focus on illustrating solutions for (a)modeling and representation that captures spatial uncertainty in text and (b) efficient indexing and processing of various types of spatial queries to support reasoning of spatial information. Our solutions are implemented in the context of a prototype system called SAT (spatial awareness from text)that models and represents (potentially uncertain) event locations described in free text and incorporates several types of spatial queries of interest in SA applications. We demonstrate SAT in the context of 2 real-world applications that derive spatial information from text at different phases of the disaster response process. • Offline spatial analysis of data from the Sept 11, 2001 WTC attacks to retrieve relevant events and the response as it occurred. • Online, real-time assistance to field personnel using real time communication transcripts between dispatchers and first responders from 911 call centers in Los Angeles area. Such tools enable social scientists and disaster researchers to accurately analyze transcribed communication logs and situational reports filed by the first
This work was supported by NSF grants 0331707, 0331690.
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1159–1163, 2006. c Springer-Verlag Berlin Heidelberg 2006
1160
D.V. Kalashnikov et al.
responders after major disasters. These techniques can also be used to support real-time triaging and filtering of relevant communications and reports among first responders (and the public) during a crisis. Our primary objective is to design database solutions to support applications where the real world is being monitored (potentially using a variety of sensing technologies) to support tasks such as situation assessment and decision-making. An illustrative example. Consider a scenario during the response to the September 11, 2001 WTC attacks that demonstrates the need for spatial awareness. The following are excerpts from two real reports1 filed by Port Authority Police Department (PAPD) Officers: 1. “. . . the PAPD Mobile Command Post was located on West St. north of WTC and there was equipment being staged there . . . ” 2. “. . . a PAPD Command Truck parked on the west side of Broadway St. and north of Vesey St. . . . ” These two reports refer to the same location, i.e. the same command post – a point-location in the New York, Manhattan area. However, neither of the reports specify the exact location of the events; they do not even mention the same street names. Our objective is to represent and index such reports in a manner that enables efficient evaluation of spatial queries and subsequent analysis using the spatial data. Our system should have efficient supports to commonly used spatial queries, such as range, NN, spatial join, and so on. For instance, the representation must enable us to retrieve events in a given geographical region (e.g., around World Trade Center). Likewise, it should enable us to determine similarity between reports based on their spatial properties; e.g., we should be able to determine that the above events might refer to the same location(assuming a temporal correlation of the events). To support spatial analyses on free text reports, merely storing location in the database as free text is not sufficient either to answer spatial queries or to disambiguate reports based on spatial locations. For example, spatial query such as ‘retrieve events near WTC’, based on keywords alone, can only retrieve the first report mentioned earlier. So instead, we need to project the spatial properties of the event described in the report onto the 2-dimensional domain Ω and answer queries within this domain. In this paper, we model uncertain event locations as random variables that have certain probability density functions (pdfs) associated with them. Assisted by GIS and probabilistic modeling tools, we map uncertain textual locations into the corresponding pdfs defined in Ω. Given that a large number of spatially uncertain events can potentially arise during crisis situations2 , the focus of our project is on developing scalable solutions for effective and efficient processing of such spatially-uncertain events. 1 2
Original audio data available in converted text form. For instance, more than 1000 such events can be extracted from just 164 reports filed by Police Officers who participated in the disaster of September 11th, 2001.
SAT: Spatial Awareness from Textual Input
2
1161
Research Challenges and Solutions to Be Demonstrated
Development of an end-to-end approach for spatial awareness from textual input must address four practical challenges – (1) modeling uncertain spatial events, (2) representation, (3) indexing, and (4) query processing. In Figure 1, we show the major components of SAT. In the remaining of this section, we briefly describe the functionalities of SAT components and demonstrate the potentials of SAT in handling these challenges.
Report
(1) Extract Candidate Events from Sentences using
(2) Map spatial descriptors to Probabilistic Models
(3) Initial pdf
Landmarks & Spatial Descriptors
Probabilistic Spatial Query
(7)
(8)
Index Pruning (Phase I)
Post Processing (Phase II)
(4) Validate By Analysts
(5) Efficient pdf Representation
(6)
Probabilistic Event Database
Update Spatial Index
(9) GIS Display
Fig. 1. SAT Components
Fig. 2. WTC: Data and Query
Fig. 3. GIS Interface
Modeling. Spatial uncertainty has been explored both in the GIS and in database literature. We extend the probabilistic model for spatial uncertainty developed in [1, 2, 3]. In the probabilistic model, an uncertain location is treated as a continuous random variable (r.v.) which takes values (x, y) ∈ Ω and has a certain probability density function (pdf) f (x, y) associated with it. Interpreted this way, for any spatial region R, the probability that is inside R is computed as R f (x, y)dxdy. However, to apply the probabilistic models in our context requires us to solve: (a) event extraction from text, (b) modeling spatial uncertainty in text. The first four components (1–4) of Figure 1 are meant for these tasks. First, automated tools are employed to extract events from text, including their spatial properties. The analyst oversees this process to correct errors arising from this process and also to resolve extraction ambiguities (cf. [4]). Next, we map the extracted textual location into the corresponding probabilistic representation in a semi-supervised fashion. Our modeling solution [5] takes a spatial expression (s-expression) as its input and outputs the desired pdf for
1162
D.V. Kalashnikov et al.
the s-expression. It achieves that by analyzing landmarks and spatial descriptors (s-descriptors) mentioned in the s-expression. The analyst oversees these steps and adjusts the models if needed. We integrate this modeling process as a toolkit to the standard GIS system as shown in Figure 3. For example, our extraction and modeling tools can automatically determine the uncertainty regions of the two reports in Section 1, and display them as in Figure 2. An analyst can use the probabilistic modeling toolkit to further enhance the probabilistic models. Besides demonstrating the modeling process, we will also demonstrate the practical significance of using the probabilistic models. Showing together with query processing demonstration, we will show that simple bounding region models are not sufficient to answer analytical queries. Representation. In our context, we need to be able to represent pdfs of complex shapes in the database. There are several known methods for such a representation, such as histograms and modeling pdf as a mixture of Gaussians or of other distributions. However, these solutions cannot scale well. In [6], we have proposed a novel compressed quad-tree representation. We have implemented this representation in the SAT component No. 5 in Figure 1. Coupled with our new indexing strategies, we will demonstrate significant performance boost in query response time. It is interesting to note that the existing solutions that also deal with probabilistic spatial queries [1,2,3] do not address the representation issues directly. The reason is that their empirical evaluation is carried out using only simple densities such as uniform and Gaussian. Indexing and query processing. SAT efficiently supports several spatial query types – such as range, NN, and spatial join – commonly used in SA applications. For example, using a spatial region query, an analyst can express a query such as “find all the events, the location of which are around WTC”. Figure 2 shows this query visually (shaded region). The system should compute the probability of the events inside this region, and filter away low probability events. In [6], we have proposed a novel grid base indexing approach. Compared to the state-of-arts techniques proposed in [1, 3], our new indexing scheme has 2–10 times speedup. The new index solution has been incorporated into SAT system as component 6,7 and 8 in Figure 1. In our demonstration, using both real and synthetic data, we will demonstrate the efficiency of the indexing solution on different types of spatial query.
3
Concluding Remarks
In this paper we presented a system – SAT – which builds spatial awareness and provides for reasoning with spatial locations from textual input. Such Situational Awareness (SA) applications abound in a variety of domains including homeland security, emergency response, command and control, process monitoring/automation, business activity monitoring, to name a few. We believe that techniques such as ours can benefit a very broad class of applications where free text is used to describe events.
SAT: Spatial Awareness from Textual Input
1163
References 1. Cheng, R., Kalashnikov, Prabhakar, S.: Querying imprecise data in moving object environments. TKDE 16 (2004) 2. Cheng, R., Kalashnikov, D., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: SIGMOD. (2003) 3. Cheng, R., Xia, Y., Prabhakar, S., Shah, R., Vitter: Efficient indexing methods for probabilistic threshold queries over uncertain data. In: Proc. of VLDB. (2004) 4. Woodruff, A., Plaunt, C.: GIPSY: Georeferenced Information Processing SYstem. (1994) 5. Kalashnikov, D.V., Ma, Y., Hariharan, R., Mehrotra, S.: Spatial queries over (imprecise) event descriptions. Submitted for Publication (2005) 6. Kalashnikov, D., Ma, Y., Mehrotra, S., Hariharan, R.: Spatial indexing over imprecise event data. (In: Submitted to EDBT 2006)
MUSCLE: Music Classification Engine with User Feedback Stefan Brecheisen1 , Hans-Peter Kriegel1 , Peter Kunath1 , Alexey Pryakhin1 , and Florian Vorberger2 1
Institute for Informatics, University of Munich {brecheis, kriegel, kunath, pryakhin}@dbs.ifi.lmu.de 2 Florian.Vorberger@detach.de
Abstract. Nowadays, powerful music compression tools and cheap mass storage devices have become widely available. This allows average consumers to transfer entire music collections from the distribution medium, such as CDs and DVDs, to their computer hard drive. To locate specific pieces of music, they are usually labeled with artist and title. Yet the user would benefit from a more intuitive organization based on music style to get an overview of the music collection. We have developed a novel tool called MUSCLE which fills this gap. While there exist approaches in the field of musical genre classification, none of them features a hierarchical classification in combination with interactive user feedback and a flexible multiple assignment of songs to classes. In this paper, we present MUSCLE, a tool which allows the user to organize large music collections in a genre taxonomy and to modify class assignments on the fly.
1
Introduction
The progress of computer hardware and software technology in recent years made it possible to manage large collections of digital music on an average desktop computer. Thus, modern computer systems are able to compress a piece of music to a few megabytes in very fast time. Easy to use software that automates this process is available. Often, this software stores meta information, such as artist, album or title, along with the audio file. However, the amount and quality of the available meta information in publicly accessible online databases, e.g. freedb.org, is often limited. This meta data is especially useful when searching for a specific piece of music in a large collection. To organize and structure a collection, additional information such as the genre would be very useful. Unfortunately, the genre information stored in online databases is often incorrect or does not meet the user’s expectations. In this demonstration paper, we present MUSCLE, a prototype of a powerful hierarchical genre classification tool for digitized audio. It is often problematic to assign a piece of music to exactly one class in a natural way. Genre assignment is a somewhat fuzzy concept and depends on the taste of the user. Therefore, MUSCLE allows multi-assignments of one song to several classes. The classification is based on feature vectors obtained from three acoustic realms namely Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1164–1167, 2006. c Springer-Verlag Berlin Heidelberg 2006
MUSCLE: Music Classification Engine with User Feedback
1165
C
Timbre Features Rhythm
C
C
Pitch Piece of Music
Genre Tree Node
Genre Tree Leaf
C Classifier
Fig. 1. Architecture of MUSCLE.
timbre, rhythm and pitch. Timbre features are derived from the frequency domain and were mainly developed for the purpose of speech-recognition. The extraction of the timbral texture is performed by computing the short time fourier transform. We use the Mel-frequency cepstral coefficients (MFCCs), spectral flux and spectral rolloff as trimbral representations [1]. Rythmic content features are useful for describing the beat frequency and beat strength of a piece of music. In our demo, we use features derived from beat histograms [1] as description of the rhythmic content. Pitch extraction tries to model the human perception by simulating the behavior of the cochlea. Similar to the rhythmic content features, we derive pitch features from pitch histograms which were generated by a multipitch analysis model [2]. To sum up, each song is described by multiple representations and multiple instances, i.e. there exists a set of feature vectors per representation. The general idea of hierarchical classification is that a classifier located on an inner node solves only a small classification problem and therefore achieves more effective results more efficiently than a classifier that works on a large number of flat organized classes. There exist only a few approaches for automatic genre classification of audio data. In [3], music pieces are classified into either rock or classic using k-NN and MLP classifiers. An approach for hierarchical genre classification which does not support user feedback is presented in [1]. Zhang [4] proposes a method for a hierarchical genre classification which follows a fixed schema and where is only limited support for user-created genre folders. Moreover, all above mentioned hierarchical classification methods do not take full advantage of multi-instance and multi-represented music objects. In contrast, MUSCLE handles such rich object representations as well as an arbitrary genre hierarchy, deals with user feedback and supports multi-assignment of songs to classes.
2
Theoretical Foundation
In this section, we briefly describe MUSCLE’s approach for classifying music pieces into a genre taxonomy (cf. Fig. 1). Support vector machines (SVM) are used as
1166
S. Brecheisen et al.
classifiers which achieve superior classification accuracy in various application areas and have received much attention recently. By using kernel functions in combination with SVMs, any kind of data can be classified. Since a music piece is described by a set of feature vectors, we apply a set kernel function [5] for SVMs. The hierarchical classification problem is handled by performing a two layer classification process (2LCP) on each inner node N of the genre taxonomy. This process distinguishes only descendent nodes of N as classes Csingle and acts as a guidepost for the hierarchical classification. We train SVMs in the first layer of the 2LCP that distinguishes only single classes in each representation. Since standard SVMs are able to make only binary decisions we apply the so-called oneversus-one (OvO) approach in order to make a classification decision for more than two classes. We argue that for our application the OvO approach is best suitable because the voting vectors provided by this method are a meaningful intermediate description that is useful for solving the multi-assignment problem in the second layer of our 2LCP. In order to perform the multi-assignment we take advantage of the class properties in our application domain. We limit the possible class combinations to a subset Ccombi ⊂ 2Csingle because there exist several combinations that do not make sense, e.g. a piece of music belonging to the class ’classic’ is very implausible to be also in the class ’hip-hop’. Thus, the classifier (SVM) in the second layer of the 2LPC uses an aggregation of the voting vectors from the first layer of the 2LPC as input to assign an object to a class c ∈ Csingle ∪Ccombi . The voting vectors provided by the first layer SVMs for each representation are aggregated by using a weighted linear combination. The weights in the combination are calculated by using a so called object adjusted weighting. The intuition behind the object adjusted weighting is that the object to be classified needs to have a sufficient distance from any of the other classes. For more details we refer to [6].
3
Practical Benefits
MUSCLE is implemented in C/C++ and runs on the Windows platform. Its hierarchical playlist acts as a jukebox. The installation archive of MUSCLE contains a default genre taxonomy including the necessary training data in the form of feature vectors for each song. This data is used in the demonstration. Using aggregated information such as feature vectors makes it possible to share the training data without having to distribute the underlying music data. Classes and training data in the genre taxonomy can be deleted, moved or added by the user. When the user commits the changes of the class hierarchy or of the corresponding training data, MUSCLE trains the affected classifiers. Note that usually only a small subset of the entire classifier hierarchy has to be trained because a modification at a node requires a partial adaptation of the node and all parent nodes only. It is also possible to start the training automatically after each modification or to run the training in the background. When the user is satisfied with the training setup, a folder to automatically classify all contained songs can be selected.
MUSCLE: Music Classification Engine with User Feedback
(a) Multi-Assignment of Songs
1167
(b) User Feedback
Fig. 2. MUSCLE User Interface
Fig. 2 illustrates MUSCLE’s user interface. In the main window the playlist containing the classification result in form of a genre tree is displayed. An example for a multiple assignment of the song ’Anticipating’ to the classes ’pop’ and ’rhythm & base’ can be seen in Fig. 2(a). In case the user wants to manually adjust the genre assignment of a song, entries can be re-arranged using drag & drop as shown in Fig. 2(b).
References 1. Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing 10 (2002) 293–302 2. Tolonen, T., Karjalainen, M.: A computationally efficient multipitch analysis model. IEEE Transactions on Speech and Audio Processing 8 (2000) 708–716 3. Costa, C.H.L., Valle, J.D.J., Koerich, A.L.: Automatic classification of audio data. IEEE Transactions on Systems, Man, and Cybernetics 3 (2004) 562–567 4. Zhang, T.: Semi-automatic approach for music classification. In: Proc. of the SPIE: Conf. on Internet Multimedia Management Systems. Volume 5242. (2003) 81–91 5. G¨ artner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: Proc. of the 19th Int. Conf. on Machine Learning (ICML). (2002) 179–186 6. Kriegel, H.P., Kr¨ oger, P., Pryakhin, A., Schubert, M.: Using support vector machines for classifying large sets of multi-represented objects. In: Proc. SIAM Int. Conf. on Data Mining (SDM). (2004)
iMONDRIAN: A Visual Tool to Annotate and Query Scientific Databases Floris Geerts1,3 , Anastasios Kementsietsidis1 , and Diego Milano2 1
School of Informatics, University of Edinburgh, UK {fgeerts, akements}@inf.ed.ac.uk 2 Universit´a di Roma “La Sapienza”, Italy diego.milano@dis-uniroma1.it 3 Hasselt University, Belgium
Abstract. We demonstrate iMONDRIAN, a component of the MONDRIAN annotation management system. Distinguishing features of MONDRIAN are (i) the ability to annotate sets of values (ii) the annotation-aware query algebra. On top of that, iMONDRIAN offers an intuitive visual interface to annotate and query scientific databases. In this demonstration, we consider Gene Ontology (GO), a publicly available biological database. Using this database we show (i) the creation of annotations through the visual interface (ii) the ability to visually build complex, annotationaware, queries (iii) the basic functionality for tracking annotation provenance. Our demonstration also provides a cheat window which shows the system internals and how visual queries are translated to annotation-aware algebra queries.
1 Introduction Modern science relies increasingly on the use of database systems to store huge collections of scientific data. These data are generated from laboratory processes or are copied from other scientific databases. To make sense of these data and decide under which circumstances they can be used, scientists need to know their lineage, i.e., the conditions under which the data were generated, the accuracy of the processes that produced them, or how trust-worthy is the source from which the data were copied. These metadata are often stored in scientific databases in the form of annotations. In spite of their importance, existing data formats and schemas are not designed to manage the increasing variety of annotations. Moreover, DBMS’s often lack support for storing and querying annotations. Our work in the MONDRIAN1 annotation management system [1] is motivated by the pressing needs of biologists, some of which are highlighted by the following example. Consider the relation in Figure 1 which lists triples of identifiers belonging to three distinct biological databases. Each triple associates the identifier gid of a gene (in the gene database) with the identifier pid of the protein (in the protein database) that 1
Piet Mondrian: Dutch painter whose paintings mainly consist of color blocks.
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1168–1171, 2006. c Springer-Verlag Berlin Heidelberg 2006
iMONDRIAN: A Visual Tool to Annotate and Query Scientific Databases
1169
the gene produces, where the sequence of the protein is identified by sid (in the protein sequence database). Such relations are widely used in the biological domain and offer a quick way to cross-reference and establish associations between independent biological sources [2, 3]. Given such a relation, a biologist often wants to annotate each triple with any evidence that exist and verify its validity. Such evidence might include a reference to an article that mentions that the indicated gene produces the specified protein, or the name of a curator who verified this association. In the figure, we show possible annotations in the form of blocks and block labels. Blocks are used to indicate the set of values for which an annotation exists, while block labels are used to indicate the annotations themselves. In the figure, the annotations indicate the names of curators who verified that a particular association holds. So, in the first tuple, a block indicates Mary’s belief that the gene with GDB id 120231 produces protein with id P21359. Notice that parts of a triple can be verified by different curators (e.g. see the first tuple), while other parts are yet to be verified (e.g. see the third tuple). For annotations to be useful, the biologist must be able to query them. For example, she might want tuples that are annotated by either John or Mary. Or, she might want to find which are annotated, and by whom. Often, the lack of annotations is also of interest. For example, a biologist might want the gene-protein (gid, pid) pairs that are not annotated, so as to investigate the validity of these yet unverified pairs. To the best of our knowledge, MONDRIAN is the first system to support the annotation of sets of values, thus allowing for complex annotations such as the ones shown in the figure. Previous works only allowed for annotations to be attached to a particular value of a specific attribute (e.g., see [4]). Single-value annotations are insufficient since they fail to capture the complex relationships of values, relationships which span across attribute boundaries. Another distinguishing feature of MONDRIAN is the ability to query annotations and values alike. MONDRIAN offers an annotation-aware query algebra which we have shown to be both complete (it can express all possible queries over the class of annotated databases) and minimal (all the algebra operators are primitive) [1]. The expressiveness of our algebra goes well beyond the query capabilities of similar systems like, for example, DBNotes [5]. The algebra is simple and intuitive and is able to express all the queries mentioned earlier, and many more (see [1] for the full syntax and examples). For example, query q1 below retrieves all the tuples that are annotated by either John or Mary, while query q2 only retrieves tuples that have a gene-protein sequence (gid, sid) annotated pair. q1 = ΣMary ∪ ΣP eter
L q2 = Πgid,sid
In spite of being simple (and very easy to learn by those familiar with relational algebra), we don’t expect that biologists would want to learn yet another query algebra. Instead, it seems natural to offer a visual tool through which a biologists can both annotate data and query them. The objective of this demonstration is to present iMONDRIAN, a tool that offers the above capabilities. Once more, the simplicity of the algebra is to our favor since it facilitates the direct translation of visual queries to algebra queries.
1170
F. Geerts, A. Kementsietsidis, and D. Milano
2 The iMONDRIAN Demonstration The demonstration of iMONDRIAN shows how the tool can be used by biologists, throught the lifecycle of annotations, starting from their insertion, to their querying and ending with their deletion. The demonstration uses data and annotations from Gene Ontology (GO) [6], a publicly available biological database. 2.1 System Architecture The MONDRIAN architecture, shown in Figure 2. MONDRIAN is built in java and is running on top of MySQL. The iMondrian component is the front-end through which a user interacts with the system. A visually expressed query is translated to a query written in the MONDRIAN query algebra and this is subsequently translated to SQL and is executed over the underlying RDBMS. One advantage of MONDRIAN queries is that they are storage-model independent [1]. That is, MONDRIAN queries are at a level of abstraction that is independent of the chosen representation of annotations. Unlike the executed SQL queries, a change in this representation does not require the reformulation of our queries. Biologist Visual query
John
pid
gid
sid
I78852
120231
P21359
iMONDRIAN
A45770
120232
P35240
Mary
John, Mary
A01399
120233
P01138
John
Peter
A25218
120234
P08138
Mary
Fig. 1. An annotated relation
Result
Mondrian query algebra pid
Mondrian Query Engine
SQL
MySQL
gid
sid
I78852 120231 P21359 A45770 120232 P35240 A01399 120233 P01138
pid gid sid ann 1 0 1
1 1 1
0 1 1
John Mary Peter
Fig. 2. The MONDRIAN Architecture
2.2 Demonstrated Functionality Figures 3 and 4 show the iMONDRIAN interface. For ease of presentation, annotations are represented as colors. Thus, sets of values with the same annotation are colored the same. The same color can appear in a tuple over distinct attributes sets and thus colors do not suffice to tell which attributes are annotated as a set. Therefore, when a user selects an attribute value in a tuple, all the other attributes with which this value is annotated are highlighted. Furthermore, a value can participate in more than one blocks and thus it can have multiple colors. Such values are shown in grey, with a black border, and when a user clicks on them she sees all its colors in a popup window (see Figure 3). During the demo, we show how users can insert new tuples and annotations. An annotation can be inserted by selecting a set of values and attaching a color to them. This color can be either one that is used already in some tuple or a brand new color (annotation). The user can query both values and annotations in isolation or in unison. For example, to query annotations, the user can select a color from a value and ask for all the tuples that have the same color (annotation). For example, a visual query v1 might
iMONDRIAN: A Visual Tool to Annotate and Query Scientific Databases
Fig. 3. The iMONDRIAN interface
1171
Fig. 4. The result of a visual query
ask for all the annotations with a red or green color. Or, the user can select a number of attribute columns (by clicking check boxes next to each attribute name) and pose a visual query v2 that returns all the tuples with annotations that involves all of these columns. Each visual query results in a new window containing the query result. The user can pose queries in the result window of a previous query, thus allowing for building of comlex queries. Figure 4 shows the result of applying visual queries v1 and v2 to the relation of Figure 3. The composed query, written in the MONDRIAN query algebra, is available from the cheat window. This is useful if a user wants to execute periodically the same query. Then, she doesn’t have to go through the same steps in the iMONDRIAN interface. She only needs to copy the algebra query from the cheat window and send it directly to the MONDRIAN query engine. The demo also illustrates how MONDRIAN supports alternative annotation semantics [1]. For example, we discuss annotation (non-)inheritance a property that, given an annotation over a set of values, it determines whether, or not, any subset of these values also inherits the annotation. Finally, the demo illustrates some basic provenance functionality, which allows to trace back the origin of annotations.
References 1. Geerts, F., Kementsietsidis, A., Milano, D.: MONDRIAN: Annotating and querying databases through colors and blocks. In: ICDE. (2006) (To appear). 2. Kementsietsidis, A., Arenas, M., Miller, R.J.: Data Mapping in Peer-to-Peer Systems: Semantics and Algorithmic Issues. In: ACM SIGMOD. (2003) 325–336 3. Tan, W.C.: Research Problems in Data Provenance. IEEE Data Engineering Bulletin 27 (2004) 45–52 4. Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An Annotation Management System for Relational Databases. In: VLDB. (2004) 900–911 5. Chiticariu, L., Tan, W.C., Vijayvargiya, G.: Dbnotes: a post-it system for relational databases based on provenance. In: ACM SIGMOD. (2005) 942–944 6. Consortium, T.G.O.: The gene ontology (go) database and informatics resource. Nucl. Acids Res 32 (2004) 258–261
The SIRUP Ontology Query API in Action Patrick Ziegler, Christoph Sturm, and Klaus R. Dittrich Database Technology Research Group, Department of Informatics, University of Zurich, Winterthurerstrasse 190, CH-8057 Z¨ urich, Switzerland {pziegler, sturm, dittrich}@ifi.unizh.ch Abstract. Ontology languages to represent ontologies exist in large numbers, and users who want to access or reuse ontologies can often be confronted with a language they do not know. Therefore, ontology languages are nowadays themselves a source of heterogeneity. In this demo, we present the SIRUP Ontology Query API (SOQA) [5] that has been developed for the SIRUP approach to semantic data integration [4]. SOQA is an ontology language independent Java API for query access to ontological metadata and data that can be represented in a variety of ontology languages. In addition, we demonstrate two applications that are based on SOQA: The SOQA Browser, a tool to graphically inspect all ontology information that can be accessed through SOQA, and SOQA-QL, an SQL-like query language that supports declarative queries against ontological metadata and data.
1
Introduction
In current information systems, ontologies are increasingly used to explicitly represent the intended real-world semantics of data and services. Ontologies provide a means to overcome heterogeneity by providing explicit, formal descriptions of concepts and their relationships that exist in a certain universe of discourse, together with a shared vocabulary to refer to these concepts. Based on agreed ontological domain semantics, the danger of semantic heterogeneity can be reduced. A large number of ontology languages is available to specify ontologies. Besides traditional ontology languages, such as Ontolingua [1] or PowerLoom1, there is a notable number of ontology languages for the Semantic Web, such as SHOE2 , DAML3 , or OWL4 . Therefore, ontology languages are nowadays themselves a source of heterogeneity. As ontologies can be specified in a manifold of ontology languages, users looking for suitable ontologies can often be confronted with ontologies that are defined in a language they do not know. To make use of these ontologies, users either have to learn the particular ontology language or to find and employ suitable tools to access the desired ontology. Heterogeneity caused 1 2 3 4
http://www.isi.edu/isd/LOOM/PowerLoom/ http://www.cs.umd.edu/projects/plus/SHOE/ http://www.daml.org http://www.w3.org/2004/OWL/
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1172–1175, 2006. c Springer-Verlag Berlin Heidelberg 2006
The SIRUP Ontology Query API in Action
1173
by the use of different ontology languages can therefore be a major obstacle in ontology access. Besides this, building ontologies is a demanding and time-consuming task. Especially in cases where large all-embracing ontologies are built, the development phase can be a substantial investment — for example, more than a personcentury has been invested in the development of CYC [2]. Therefore, existing ontologies should be reused so that advantage can be taken of the efforts spent during the ontology development phase. However, heterogeneity caused by the use of different ontology languages can be a considerable impediment to this. In this demo, we present the SIRUP Ontology Query API (SOQA) [5], which is an ontology language independent Java API for query access to ontological metadata and data that can be represented in a variety of ontology languages. In an example scenario, four publicly available ontologies, each represented in a different ontology language, are accessed through SOQA. It is shown how their contents can be compared fast though concisely with the graphical SOQA Browser and with declarative queries in the SOQA Query Shell.
2
Overview of the SIRUP Ontology Query API
In general, ontology languages are designed for a particular purpose and, therefore, they vary in their syntax and semantics. To overcome these differences, we defined the SOQA Ontology Meta Model. It represents modeling capabilities that are typically supported by ontology languages to describe ontologies and their components; i.e., concepts, attributes, methods, relationships, instances, and ontological metadata [5]. Based on the SOQA Ontology Meta Model, the functionality of the SOQA API was designed. The SIRUP Ontology Query API (SOQA) is an ontology language independent Java API for query access to ontological metadata and data that can be represented in a multitude of ontology languages. That is, SOQA provides read access to ontologies through a uniform API that is independent of the underlying ontology language and hardware/software platform. Consequently, accessing and reusing general foundational ontologies as well as specialized domain-specific ontologies can be facilitated. With SOQA, users and applications can be provided with unified access to metadata and data of ontologies according the SOQA Ontology Meta Model. Besides, data of concept instances can be retrieved through SOQA. Note that despite the fact that SOQA is mainly employed for ontology access used for data content explication in the SIRUP integration approach [4], it is intended and designed to be a general-purpose ontology query API that can be used independently of SIRUP. SOQA and all its components are fully implemented in Java 1.5. From an architectural perspective, the SOQA API reflects the Facade design pattern: SOQA provides a unified interface to a subsystem which is in charge of retrieving information from ontologies that are specified in different ontology languages. SOQA as a Facade shields external clients from the internal SOQA components and represents a single point for unified ontology access (see Fig. 1). Examples for external clients of the API provided by the SOQA Facade are:
1174
P. Ziegler, C. Sturm, and K.R. Dittrich
– The query language SOQA-QL [5], which supports declarative queries over data and metadata of ontologies that are accessed through SOQA; – The SOQA Browser [5] that enables users to graphically inspect the contents of ontologies independent of the ontology language they are specified in; – (Third-party) Java applications that use SOQA as a single point of access to information that is specified in different ontology languages. Possible application areas are virtual organizations, enterprise information and process integration, the Semantic Web, and semantics-aware universal data management.
OWL
R1
W1
PowerLoom
R2
W2
DAML
R3
W3
...
Rn
Wn
Ontologies
Reasoners
Wrappers
User SOQA-QL SOQA
Other Applications
User
SOQA Browser User
Fig. 1. Overview of the SOQA Software Architecture
Based on its Facade architecture, SOQA provides more than 70 Java methods for unified ontology access according to the SOQA Ontology Meta Model (for details, see [5]). Behind the SOQA Facade, ontology wrappers are used as an interface to existing reasoners that are specific to a particular ontology language (see Fig. 1). Up to now, we have implemented SOQA ontology wrappers for OWL, PowerLoom, DAML, and the lexical ontology WordNet [3].
3
Demonstration Highlights
To illustrate the capabilities of the SOQA, we assume an example scenario where a developer of an information system is looking for a publicly available ontology concerning persons from the university domain. Therefore, he or she might use a search engine and find (1) the Aktors Portal Ontology5 that is represented in OWL, (2) the PowerLoom Course Ontology6 developed in the SIRUP project, (3) the DAML University Ontology7 from the University of Maryland, (4) and the lexical ontology WordNet. In contrast to the traditional approach, where our developer has to cope with four ontology languages and different ontology tools, we demonstrate uniform access to ontological information through SOQA. In particular, we show the capabilities of our graphical SOQA Browser and the 5 6 7
http://www.aktors.org/ontology/portal http://www.ifi.unizh.ch/dbtg/Projects/SIRUP/ontologies/course.ploom http://www.cs.umd.edu/projects/plus/DAML/onts/univ1.0.daml
The SIRUP Ontology Query API in Action
1175
SELECT author, documentation, languagename FROM ontology; SELECT name, documentation FROM attributes(base1_0_daml:Student); SELECT * FROM directsuperconcepts(wordnet_1:Student); SELECT name, value(portal:emailAddress) FROM instances(subconcepts(portal:Student)) WHERE name < 'C';
Fig. 2. SOQA-QL Example Queries Against Different Ontologies
declarative SOQA Query Shell for user-friendly access to ontology information independent of the language the four ontologies are represented in: – The SOQA Browser is presented to quickly survey the concepts and their attributes, methods, relationships, and instances that are defined in a particular ontology, as well as metadata concerning the ontology itself. Thus, fast though concise comparisons of the four ontologies in distinct browser windows are shown. – In the SOQA Query Shell, SOQA-QL queries are formulated for detailed access to ontological ontology data and metadata (see Fig. 2). Here, we present how the four ontologies and their components can be compared using declarative queries. Based on this, we demonstrate how ontology access and reuse is enabled and facilitated through SOQA that can, hence, contribute to leverage from database systems to semantics-aware, universal data management.
References 1. A. Farquhar, R. Fikes, and J. Rice. The Ontolingua Server: A Tool for Collaborative Ontology Construction. International Journal of Human-Computer Studies (IJHCS), 46(6):707–727, 1997. 2. D. B. Lenat. CYC: A Large-Scale Investment in Knowledge Infrastructure. Communications of the ACM, 38(11):32–38, 1995. 3. G. A. Miller. WordNet:A Lexical Database for English. Communications of the ACM, 38(11):39–41, 1995. 4. P. Ziegler and K. R. Dittrich. User-Specific Semantic Integration of Heterogeneous Data: The SIRUP Approach. In M. Bouzeghoub, C. Goble, V. Kashyap, and S. Spaccapietra, editors, First International IFIP Conference on Semantics of a Networked World (ICSNW 2004), pages 44–64, Paris, France, June 17-19, 2004. Springer. 5. P. Ziegler, C. Sturm, and K. R. Dittrich. Unified Querying of Ontology Languages with the SIRUP Ontology Query API. In G. Vossen, F. Leymann, P. C. Lockemann, and W. Stucky, editors, Datenbanksysteme in Business, Technologie und Web (BTW 2005), pages 325–344, Karlsruhe, Germany, March 2-4, 2005. GI.
Querying Mediated Geographic Data Sources Mehdi Essid, Fran¸cois-Marie Colonna, Omar Boucelma, and Abdelkader Betari LSIS and Universit´e Paul C´ezanne, Aix-Marseille 3, Avenue Escadrille Normandie-Niemen, F-13397 Marseille Cedex 20 first.last@lsis.org
Abstract. With the proliferation of geographic data and resources over the Internet, there is an increasing demand for integration services that allow a transparent access to massive repositories of heterogeneous spatial data. Recent initiatives such as Google Earth are likely to encourage other companies or state agencies to publish their (satellite) data over the Internet. To fulfill this demand, we need at minimum an efficient geographic integration system. The goal of this demonstration is to show some new and enhanced features of the VirGIS geographic mediation system.
1
Introduction
With the proliferation of geographic data and resources over the Internet, there is an increasing demand for integration services that allow a transparent access to massive repositories of heterogeneous spatial data. Recent initiatives such as Google Earth [5] are likely to encourage other companies or state agencies to publish their (satellite) data over the Internet, while integrating such data still poses several challenges. The goal of this demonstration is to illustrate the enhanced and new features of VirGIS[3], a geographic mediation/wrapper system that provides the user with an integrated view of the data, and advanced query facilities. Typical mediation approaches are data-driven and do not address the problem of integration of query capabilities. But the exploitation of available query capabilities is critical to a geographic mediation system. A preliminary version of the VirGIS prototype has been demonstrated at ICDE’2004[2]. This new version will exhibit advanced capabilities, both from the query engine point of view, and from the user interface perspective as well.
2
What Will Be Demonstrated
We will show the following aspects: 1. An enhanced GQuery [1] user interface: in the previous prototype, users were able to pose WFS [6] queries and some limited GQuery expressions. WFS Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1176–1181, 2006. c Springer-Verlag Berlin Heidelberg 2006
Querying Mediated Geographic Data Sources
1177
queries or XML/KVP (keyword-value-pair) queries are XML programs sent to WFS servers wrapping data sources. Although it allowed access to any geographic repository that is OpenGIS compliant, i.e., with a WFS interface, the system inherited WFS limitations because it lacked for complex queries expressions. The need for a more expressive and powerful query langage became obvious, and this led to the design and implementation of GQuery, an XQuery based language that allows spatial complex queries over GML data sources. 2. An extended mapping language that allows the expression of complex mappings (not only 1-to-1) in using constraints or functions (either basic ones or topological ones) between corresponding entities, i.e., attributes or features (classes). 3. A smarter query rewriting strategy: previously, when a property of a feature was missing from a source, a null value was returned to the user. In the current rewriting algorithm, feature properties that are missing in a source candidate, are searched in the other (compensating) local sources. More details on our rewriting strategy are given in [4]. 4. Performance enhancement: we discovered that 90% of the execution time was devoted to join operations, performed by the underlying XQuery join processor. To speed up query processing, we developed a specific join component, based on a merge-sort join algorithm.
3 3.1
Demonstration Scenario Data Sources
The scenario is a simplified version of the satellite catalogue interoperability problem. Consider the global (mediation) relation satellite(ID, N ame, SatID, Elevation, Date, Geom, U rl). describing a catalogue of satellite images. A user may pose a query against the satellite relation schema, asking for a satellite image (stored at Url address) that cover a location described by Geom (coordinates or bounding box), the image (shot) being taken by a satellite named Name and whose id is SatID, at a given Date and with a given sun Elevation. Relation satellite results from the integration of three relations stored in three different data sources as illustrated in Figure 1, and described as follows: – the DBC relation, DBC(key, Satellite, Sat ID, Sun elev, Date , T he Geom), stored in a geographic shape format, contains images taken by different satellites (SPOT, ASTER, IKONOS, etc.). In our example, we are interested only in images taken by SPOT (satellite = ’spot’). – the ikonos relation, ikonos(key, Satellite, Sat ID, Sun el, Date acqui, Geometry), relates to (IKONOS) data – expressed in a different scale, and
1178
M. Essid et al.
Satellite = ’spot’ Satellite
ikonos
DBC
Key
ID
key
Satellite
Name
Satellite
Sat_ID
SatID
Sat_ID
Sun_el
Elevation
Sun_elev
Date_acqui
Date
Date_
Geom
The_Geom
Url
Description
Geometry
Buffer
Preview Key Filename
Fig. 1. Mapping Between Global and Local Schemas
stored in a PostGIS format. The spatial operator Buffer is applied to the Geometry attribute to perform scale conversion, – the preview (quick look ) relation, preview(key, f ilename) supplies data stored in a POSTGRES database. A typical query (denoted Qdemo ) in this scenario could be phrased as follows: Given a spatial location, return all satellite images supplied by SPOT and IKONOS, between 2001 and 2004. As we already said, this version of VirGIS provides a more expressif mapping language. For example, one can add constraints to express mappings between features (the satellite global feature corresponds to the DBC real feature under the condition satellite = ’spot’). For more expressivity, these constraints may also contain spatial operators (the satellite global feature corresponds to the DBC real feature located in a zone z). Another extension of the mapping language allows aggregations between global and local attributes. For example, a global attribute may correspond to the result of an operator (spatial or not) executed over several local attributes (the global geometry attribute corresponds to a buffer built around the geometry of an ikonos feature). 3.2
Demonstration Highlights
The query is posed over the satellite relation, either in using GQuery or the graphical interface. When using the GQuery interface, a user can express complex queries over several global features in the same time. This could not be done on the previous prototype due to WFS limitations, that is get one feature at a time. In order to process such complex queries, we developed a three steps algorithm described as follows:
Querying Mediated Geographic Data Sources
1179
Fig. 2. GQuery (Command Line) Interface
1. a decomposition step in which we decompose a complex global query into elementary global queries, each of them dealing with one global feature only. This results in a global execution plan, 2. a rewriting step in which each elementary global query is rewritten in terms of the local schemas. The algorithm relies on key dependencies to return complete answers. This results in an elementary execution plan for each elementary query, 3. a transformation step during which we build the final execution plan. This is done by replacing each elementary query by its elementary execution plan and by adding transformation queries that perform sources to local schema translations. Figure 2 illustrates the processing of a GQuery expression Qdemo . The figure highlights three screens: a general menu screen (left), a central screen that allows GQuery expressions to be entered on line, and a right screen that details the step by step query processing. Figure 3 illustrates the graphical query interface, where a user simply delimits the bounding box target, sets some attribute values and submits the query. The answer may consist of several objects: a simple click on an object (one brick) will display the info attached to it.
1180
M. Essid et al.
Fig. 3. Graphical User Interface
4
Conclusion
In this paper, we described the main VirGIS characteristics that will be demonstrated. The system relies on Open GIS standards (data model, GML, WFS, etc.) and represents an advanced mediation system for geographic data. To the best of our knowledge, there is no geographic mediation system that combines both geographic standards with W3C ones.
References 1. O. Boucelma and F-M. Colonna. Mediation for Online Geoservices. In Proc. 4th International Workshop, Web and Wireless Geographical Information Systems, W2GIS, pages 81–93. LNCS 3428, Springer-Verlag, November 2004. 2. O. Boucelma, M. Essid, Z. Lacroix, J. Vinel, J-Y. Garinet, and A. Betari. VirGIS: Mediation for Geographical Information Systems. In Proc. ICDE 2004, Boston, March 30 - April 2 2004. 3. Omar Boucelma, Mehdi Essid, and Zo´e Lacroix. A WFS-based Mediation System for GIS Interoperability. In Proceedings of the tenth ACM international symposium on Advances in Geographic Information Systems, pages 23–28. ACM Press, 2002.
Querying Mediated Geographic Data Sources
1181
4. M. Essid, O. Boucelma, F.-M. Colonna, and Y. Lassoued. Query Rewriting in a Geographic Mediation System. In Proceedings of the 12th ACM international symposium on Advances in Geographic Information Systems, pages 101–108. ACM Press, 2004. 5. Google. Google Earth Beta: A 3D interface to the planet. earth.google.com/. 6. OpenGIS. Web Feature Server Implementation Specification, September 19th 2002.
FIS-by-Step: Visualization of the Fast Index Scan for Nearest Neighbor Queries Elke Achtert and Dominik Schwald Institute for Computer Science, University of Munich, Germany {achtert, schwald}@dbs.ifi.lmu.de
Abstract. Many different index structures have been proposed for spatial databases to support efficient query processing. However, most of these index structures suffer from an exponential dependency in processing time upon the dimensionality of the data objects. Due to this fact, an alternative approach for query processing on high-dimensional data is simply to perform a sequential scan over the entire data set. This approach often yields in lower I/O costs than using a multi-dimensional index. The Fast Index Scan combines these two techniques and optimizes the number and order of blocks which are processed in a single chained I/O operation. In this demonstration we present a tool called FIS-byStep which visualizes the single I/O operations during a Fast Index Scan while processing a nearest neighbor query. FIS-by-Step assists the development and evaluation of new cost models for the Fast Index Scan by providing user significant information about the applied page access strategy in each step of the algorithm.
1
Introduction
A large number of index structures for high-dimensional data have been proposed in previous years, cf. [2] for details. However, for sufficiently high dimensional data the complexity of similarity queries on multidimensional index structures is still far away from being logarithmic. Moreover, simple query processing techniques based on a sequential scan of the data are often able to outperform approaches based on sophisticated index structures. This is due to fact that usual index structures access data in too small portions and therefore cause lots of I/O accesses. The Fast Index Scan proposed in [1] subsumes the advantages of indexes and scan based methods in an optimal way. The algorithm collects accesses to neighboring pages and performs chained I/O requests, where the length of the chains are determined according to a cost model. The benefit of this chained I/O processing is that the seek costs–the main part of the total I/O costs–have to be paid only once. The authors have shown that the Fast Index Scan clearly outperforms both, the sequential scan as well as the Hjaltason and Samet algorithm which is typically used for processing nearest neighbor queries. In this demonstration we present a tool called FIS-by-Step to visualize the single chained I/O operations during a nearest neighbor query by applying the Fast Index Scan on top of an R-Tree. FIS-by-Step displays the applied page Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1182–1185, 2006. c Springer-Verlag Berlin Heidelberg 2006
FIS-by-Step: Visualization of the Fast Index Scan
6
4
1
1183
Priority List: 3, 8, 10, 5, 1, 4, 2, 7, 6, 9
3
10 9
5 7
2
8
1
2
3
4
5
6
7
8
9
10
chunk
Fig. 1. The Fast Index Scan for nearest neighbor queries
access strategy in each step of the algorithm and provides user significant statistical information. The step-by-step visualization is very useful in a lot of cases, e.g. for teaching and explaining the function of the Fast Index Scan, for visual evaluation of the applied strategies or for development of new strategies. The remainder of this paper is organized as follows: The concepts of the Fast Index Scan are described in Sect. 2. In Sect. 3 we demonstrate our tool FIS-byStep for visualizing the I/O operations during a Fast Index Scan.
2
The Fast Index Scan
As the Fast Index Scan has been evaluated in [1] on top of the IQ-Tree, it can be applied to any R-Tree like spatial index structure that consists of only one directory level. Usually nearest neighbor queries are evaluated by the algorithm of Hjaltason and Samet (HS) [3], which has been proven to be optimal w.r.t. the number of accessed pages. Unlike the original HS algorithm which loads and processes one page after the other, the Fast Index Scan adapts the HS algorithm and tries to chain I/O operations for subsequent pages on disk and optimizes the number and order of pages which are processed in a single I/O-operation. The HS algorithm keeps a priority list of all data pages in increasing order to their distance to the query point. For all pages pi in the priority queue there exists a certain probability that pi has to be loaded to answer the query. The idea of the Fast Index Scan is to load in each step a chunk of neighboring pages with a sufficient high probability instead of loading only one page, as the HS algorithm would do. In [1] the authors proposed a stochastic model to estimate the probability of a page to be accessed during a nearest neighbor query. Based on this access probability the cost balance of a page can be determined. A negative cost balance indicates that it is likely to be “profitable” to load the page in the current chunk additionally. This is given if the additional transfer costs to read the page in the current chunk are less than the estimated costs to read the page later in the algorithm. In Fig. 1 the page strategy of the Fast Index Scan is visualized: starting with page 3 the Fast Index Scan extends the chunk and reads page 4 and 5 additionally, because page 5 has a very high probability to be necessary to answer the query. Thus, reading page 4 and 5 in the current chunk is less expensive than loading page 5 in all probability later and causing additional seek costs.
1184
E. Achtert and D. Schwald
(a) Step 1
(b) Step 2
Fig. 2. Screenshots of the FIS-by-Step application
3
Visualization of the Fast Index Scan
The main purpose of the FIS-by-Step application is to show step-by-step how the Fast Index Scan solves a nearest neighbor query. Figure 2 shows screenshots of our application to explain how FIS-by-Step works. Before running a query, it is possible to change some settings like the pagesize in order to adjust the application to the data. After choosing a data file (a CSV file of hyperpoints), the first line of black rectangles appears in the main window of the application. Each of these rectangles represents a page on the disk, where the order of the rectangles is identical with the order of the pages on disk. After selecting the query point (which can be the first point of the data, a random point of the data, or any given point) the second line of rectangles appears, again showing all pages, but now there is one blue rectangle: This is the page that is the nearest one to the query point. The third line appears after using the “Next Step” button. This is the first step of the Fast Index Scan: The access probability is calculated for all pages. The different colors of the rectangles indicate the access probability of the pages: Black indicates an access probability of 0%, i.e. these pages need not to be read. Grey pages have been already processed and thus also have an access probability of 0%. A blue page is the nearest unprocessed page to the query point and therefore has an access probability of 100%. All other pages (with red to green color) have access probabilities between 0% and 100% and might have to be read during the algorithm. As illustrated in Fig. 2(a), in this step of our example 7 pages (underlined magenta) are read by the Fast Index Scan.
FIS-by-Step: Visualization of the Fast Index Scan
1185
Fig. 3. Statistics about the solved query
After using the “Next Step” button again, two things can happen: Either the query is solved and some statistics are displayed, or the query is not solved yet, so it is necessary to read some more pages as shown in Fig. 2(b). Note that all pages that have been read in the last step are now colored gray, as their access probability is now 0%. A lot of red colored pages from the first step are now black, since they have a larger distance to the query point than the nearest point of the already processed pages. Also there is a new blue page, i.e. a page with an access probability of 100%. This page is the one that is the nearest one to the query point, as all already processed pages are ignored. After this step the example query is solved, thus after using the “Next Step” button there does not appear a new line of rectangles, but a popup window, showing some statistics about the query (cf. Fig. 3). The statistical information consists of the number of accessed pages and the I/O time for solving the query using the Fast Index Scan in comparison to use the HS algorithm or the sequential scan of the data set, respectively. As the statistic shows, the Fast Index Scan outperforms the HS algorithm as well as the sequential scan. As mentioned above, the primary objective of our FIS-by-Step application is the step-by-step visualization of the Fast Index Scan. This stepwise visualization is very useful in a lot of cases, e.g. for teaching and explaining the Fast Index Scan. FIS-by-Step supports the visual evaluation, comparison and improvement of strategies for building chunks for chained I/O operations, layout of pages on disk, and ordering pages for processing in CPU. Furthermore, FIS-by-Step assists the development of new strategies, as the advantages and disadvantages of a strategy for a given data are shown directly.
References 1. S. Berchtold, C. B¨ ohm, H. V. Jagadish, H.-P. Kriegel, and J. Sander. Independent Quantization: An index compression technique for high-dimensional data spaces. In Proc. ICDE, 2000. 2. C. B¨ ohm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3), 2001. 3. G. R. Hjaltason and H. Samet. Ranking in spatial databases. In Proc. SSD, 1995.
ArHeX: An Approximate Retrieval System for Highly Heterogeneous XML Document Collections Ismael Sanz1 , Marco Mesiti2 , Giovanna Guerrini3 , and Rafael Berlanga Llavori1 1
1
Universitat Jaume I, Castell´ on, Spain {berlanga, Ismael.Sanz}@uji.es 2 Universit` a di Milano, Italy mesiti@dico.unimi.it 3 Universit` a di Genova, Italy guerrini@disi.unige.it
Introduction
Handling the heterogeneity of structure and/or content of XML documents for the retrieval of information is a fertile field of research nowadays. Many efforts are currently devoted to identifying approximate answers to queries that require relaxation on conditions both on the structure and the content of XML documents [1, 2, 4, 5]. Results are ranked relying on score functions that measure their quality and relevance and only the top-k returned. Current efforts, however, are still based on some forms of homogeneity on the structure of the documents to be retrieved. The parent-child or ancestor descendant relationship among elements should be still preserved, and the problem of similarity at the tag level (whose solution often requires the use of an ontology) is seldom considered [6, 8]. Consider for example, two entity types Book and Author that are bound by the many-to-many Write relationship. Many XML representations are possible. Someone can model books documents by starting from the Book entity type and listing for each book its authors. Others can model books documents by starting from the Author entity type and listing for each author the books she wrote. Current approaches miss to find relevant solutions in collections containing both kinds of documents because they can relax the structural constraint (book/author becomes book//author) but they are not able to invert the relationship (book/author cannot become author/book). A more general problem is that current systems [3] support only a specific similarity function on XML documents, while in practice the concept of “similarity” strongly depends on the requirements of each particular application. This makes it difficult, if not impossible, to tailor these systems to particular requirements. In this paper we present ArHeX, a system for approximate retrieval in the context of highly heterogeneous XML document collections. Our system is designed to support different similarity functions, including lexical (i.e., tag-oriented) and structural conditions in order to handle a wide variety of heterogeneous collections. In ArHex, a user can specify the pattern of data to be retrieved through Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1186–1189, 2006. c Springer-Verlag Berlin Heidelberg 2006
ArHeX: An Approximate Retrieval System
ArHeX
1187
* Pattern specification * Pattern evaluation * Measure selections * Parameter tuning
BOOK
Indexing Structures
Ontology
s AUTHOR
Heterogeneous XML Document Collection
(a)
EDITOR
NAME
(b)
Fig. 1. (a) ArHeX architecture, (b) sample pattern
a graphical interface. Moreover, she can specify mandatory constraints on some relationships among elements or on the element tags that should be preserved. By means of specifically tailored indexing structures and heuristics, ArHex is able to efficiently identify the approximate answers for the specified retrieval query ranked according to a similarity measure. Several parameters can be set and used to tune the behavior of the system to the application scenario in which it is employed.
2
ArHex System
ArHeX allows users to specify a suitable similarity measure for their collection, combining lexical and structural conditions. The lexical measures range from simple techniques based on the overlap of substrings to ontology-based measures. Indexes are tailored to the required measure for an efficient computation, using an inverted file-like structure. A peculiarity of our index is that we do not have an entry for each tag in the collection, but a normalization process is performed to group together similar tags relying on the tag similarity function preferred by the user. ArHex also supports a set of similarity measures that can be employed in the selection and ranking of query results. The considered measures range from standard information retrieval measures (e.g. occurrence of query tags) to more sophisticated ones (e.g. structure based or sibling order based functions). The developed system is equipped with the following functionalities. – Pattern specification. The structures of user queries are represented as patterns in our system. A pattern is a graph in which the user can specify a “preference” in the parent-child, ancestor-descendant and sibling relationships existing among elements or on the tags of elements (depicted through dashed lines in the graphical representations). “Preference” means that higher scores are given to query answers presenting such a structure but also results that do not (or only partially) present such a structure are returned. Moreover, a user can specify stricter constraints that must occur in the returned results. Our
1188
I. Sanz et al.
Fig. 2. ArHeX pattern evaluation facility
constraints are classified in 3 categories: ancestor-descendant, same level, and tag constraints (as detailed in [7]). Figure 1(b) shows an example of pattern in which we search for books having an author, editor and name elements. The book element can be the parent or the child of the author element. The name element can be the child of the author element but can also appear in other positions. The editor element should be found in the same level of the author element. In ArHeX patterns are specified through a graphical interface and then mapped in an XML document. – Pattern evaluation. The evaluation of a pattern in the collection is performed in different steps (details in [7]). First, through the inverted index, a pattern index organized in levels is generated containing the elements in the collection whose tags are similar to those in the pattern. Then, fragments are generated by considering the parent-child and ancestor-descendant relationships among elements in the collection. Furthermore, through the use of a similarity measure and the heuristic locality principle [6], fragments are combined in regions when the similarity of the pattern with respect to a region is higher than that with respect to each single pattern. Mandatory constraints are checked both during fragment and region construction depending on the category they belong to. Whenever a constraint is not met the corresponding fragment/region can be dropped or penalized according to user preferences. Finally, the similarity measure is employed to rank the top-k results. Figure 2 shows the evaluation of a pattern pointing out a similar region. – Measure selection. Different measures can be applied for the evaluation of similarity between a pattern and a region depending on the application domain. ArHeX allows the selection from a set of predefined measures and the
ArHeX: An Approximate Retrieval System
1189
combination of existing ones. Finally, in the evaluation of a pattern in the collection, a user can visualize the differences of evaluation obtained through a subset of the considered measures. – Parameter tuning. A user can tune the behavior of ArHeX to a specific scenario through a set of parameters that a graphical interface offers. For example, a user can specify the kind of tag similarity to employ (syntactic, semantic or both). Moreover, she can specify an extra weight to assign to elements in the pattern that are not found in similar regions or she can state when regions that do not meet the mandatory constraints should be dropped or penalized (and the weight to apply as penalty in the last case).
3
The Demonstration
The demonstration will show the following features: Specification of user-defined similarity measures. The system includes a library of component-based lexical and structural similarity functions, which can be tailored to the user’s needs. We will demonstrate the definition of tailored measures. Queries on different real and synthetic collections of documents. The performance of similarity-based queries using the graphical interface will be presented, using different real and synthetic collections of documents. Different similarity measures will be exercised showing the precision and recall results. Comparison of different measures. The system supports the interactive exploration of heterogeneous collection by allowing the use of several distinct similarity measures, in order to compare the results.
References 1. Amer-Yahia, S., Cho, S., Srivastava, D.: Tree Pattern Relaxation. EDBT. LNCS(2287). (2002) 496–513. 2. S. Amer-Yahia, N. Koudas, A. Marian, D. Srivastava, D. Toman. Structure and Content Scoring for XML. VLDB. (2005) 361–372. 3. G. Guerrini, M. Mesiti, I. Sanz. An Overview of Similarity Measures for Clustering XML Documents. Chapter in A. Vakali and G. Pallis (eds.), Web Data Management Practices: Emerging Techniques and Technologies. Idea Group. 4. A. Marian, S. Amer-Yahia, N. Koudas, D. Srivastava. Adaptive Processing of Top-k Queries in XML. ICDE. (2005) 162–173. 5. A. Nierman, H.V. Jagadish. Evaluating Structural Similarity in XML Documents. WebDB. (2002) 61–66. 6. I. Sanz, M. Mesiti, G. Guerrini, R. Berlanga Llavori. Approximate Subtree Identification in Heterogeneous XML Documents Collections. XSym. LNCS(3671). (2005) 192–206. 7. I. Sanz, M. Mesiti, G. Guerrini, R. Berlanga Llavori. Approximate Retrieval of Highly Heterogeneous XML Documents. Tech. report. University of Milano. (2005). 8. A. Theobald, G. Weikum. The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking. EDBT. LNCS(2287). (2002) 477–495.
MonetDB/XQuery—Consistent and Efficient Updates on the Pre/Post Plane Peter Boncz1 , Jan Flokstra3 , Torsten Grust2 , Maurice van Keulen3 , Stefan Manegold1 , Sjoerd Mullender1 , Jan Rittinger2 , and Jens Teubner2 1
CWI Amsterdam, The Netherlands {boncz, manegold, sjoerd}@cwi.nl 2 Technische Universit¨ at M¨ unchen, Germany {grust, rittinge, teubnerj}@in.tum.de 3 University of Twente, The Netherlands {keulen, flokstra}@cs.utwente.nl
1
Introduction
Relational XQuery processors aim at leveraging mature relational DBMS query processing technology to provide scalability and efficiency. To achieve this goal, various storage schemes have been proposed to encode the tree structure of XML documents in flat relational tables. Basically, two classes can be identified: (1) encodings using fixed-length surrogates, like the preorder ranks in the pre/post encoding [5] or the equivalent pre/size/level encoding [8], and (2) encodings using variable-length surrogates, like, e.g., ORDPATH [9] or P-PBiTree [12]. Recent research [1] showed a clear advantage of the former for efficient evaluation of XPath location steps, exploiting techniques like cheap node order tests, positional lookup, and node skipping in staircase join [7]. However, once updates are involved, variable-length surrogates are often considered the better choice, mainly as a straightforward implementation of structural XML updates using fixed-length surrogates faces two performance bottlenecks: (i) high physical cost (the preorder ranks of all nodes following the update position must be modified— on average 50% of the document), and (ii) low transaction concurrency (updating the size of all ancestor nodes causes lock contention on the document root). In [4], we presented techniques that allow an efficient and ACID-compliant implementation of XML updates also on the pre/post (respectively pre/size/level encoding) without sacrificing its superior XPath (i.e., read-only) performance. This demonstration describes in detail, how we successfully implemented these techniques in MonetDB/XQuery 1 [2, 1], an XML database system with fullfledged XQuery support. The system consists of the Pathfinder compiler that translates and optimizes XQuery into relational algebra [6], on top of the highperformance MonetDB relational database engine [3]. 1
MonetDB/XQuery and the Pathfinder compiler are available in open-source: http://monetdb-xquery.org/ & http://pathfinder-xquery.org/; the second version including XML updates will be released well before EDBT 2006.
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1190–1193, 2006. c Springer-Verlag Berlin Heidelberg 2006
MonetDB/XQuery—Consistent and Efficient Updates on the Pre/Post Plane
pre size level a b c d e f g h i j
0 1 2 3 4 5 6 7 8 9
9 3 2 0 0 4 0 2 0 0
0 1 2 3 3 1 2 2 3 3
a 9 f following 8 h 7 j 6 i 5 g ancestor 4 b preceding 3 c 2 e 1 d descendant 1 2 3 4 5 6 7 8 9 pre
insert−first(,/a/f/g) <m/>
a 9 f following 8 h 7 size+3 j pre+3 6 i 5 ancestor g 4 b preceding k 3 m new c 2 e l nodes 1 d descendant 1 2 3 4 5 6 7 8 9 pre
1191
pre size level 0 1 2 3 4 5 6 10 11 12 7 8 9
12 3 2 0 0 7 3 2 0 0 2 0 0
0 1 2 3 3 1 2 2 3 3 3 4 4
a b c d e f g h i j k l m
Fig. 1. The impact of Structural Updates on pre/size/level XML Storage
2
XML Updates
XML updates can be classified as: (i) value updates, which include node value changes (be it text, comment or processing instructions), and any change concerning attributes (attribute value changes, attribute deletion and insertion). Other modifications are (ii) structural updates, that insert or delete nodes in an XML document. With the pre/size/level encoding, value updates map quite trivially to updates in the underlying relational tables. Therefore, we focus on structural updates in the remainder. W3C has not formulated a standard for XML updates, yet. However, we expect that a future standard will include the functionality of the UpdateX language as proposed in [11]. Given that there is no standard XML update language (and hence syntax), yet, we decided to keep the changes in our XQuery parser limited by not using the syntax proposed in [11], but rather implement the same update functionality by means of a series of new XQuery operators with side effects. Consistent Bulk Processing. Semantically, which nodes are updated and with what values is determined solely using the pre-image (i.e. snapshot semantics). Still, updates need to be applied in the order mandated by XQuery evaluation, which conflicts with the bulk relational query execution employed in MonetDB/ XQuery (where optimized query execution may use a different order). To overcome this problem, the update operators initially just produce a tape of intended updates. This tape is represented by an XQuery item sequence, and thus in the end is yielded in the correct order. Finally, after optimization (in which duplicate updates or updates on deleted nodes are pruned), these updates are applied and committed. In our opinion, this optimized bulk approach to updates is unique to MonetDB/XQuery. Note that the update tape, which separates query evaluation and update execution, bears some resemblance to the idea of monad-based I/O [10] in purely functional programming languages, e.g., Haskell. Structural Update Problems. Figure 1 illustrates how the pre/size/level document encoding is affected by a subtree insert (a delete raises similar issues): all pre values of the nodes following the insert point change, as well as the size of all ancestor nodes. The former issue imposes an update cost of O(N ), with
1192
P. Boncz et al.
N the document size, because on average half of the document are following nodes. The latter issue is not so much a problem in terms of update volume (the number of ancestors is bound by the tree’s height, remaining small even for large XML instances) but rather one of locking: the document root is an ancestor of all nodes and thus must be locked by every update. This problem, however, can be circumvented by maintaining for each transaction a list of nodes of which the size is changed, together with the delta rather than the absolute changed value. This allows transactions to release locks on size immediately, and commit anyway later (even if the size of a node has been changed meanwhile by another committed transaction, we can just apply the delta to set it to a consistent state). With the problem of locking contention on size removed this way, in the sequel we concentrate on the problem of the shifts in pre here. Page-Wise Remappable Pre-Numbers. Figure 2 shows the changes introvs. Updatable Representation duced in MonetDB/XQuery to handle Read−Only structural updates in the pre/size/level table. The key observations are: pre size level node
– the table is called pos/size/level now. – it is divided into logical pages. – each logical page may contain unused tuples. – new logical pages are appended only (i.e., at the end). – the pre/size/level table is a view on pos/size/level with all pages in logical order. In MonetDB, this is implemented by mapping the underlying table into a new virtual memory region.
0 1 2 3 4 5 6 7 8 9
9 3 2 0 0 4 0 2 0 0
0 1 2 3 3 1 2 2 3 3
a b c d e f g h i j
pagesize = 8 unused space: level = NULL size set to unite consecutive space
pos size level
pos size level 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
10 3 2 0 0 5 0 0 2 0 0 4 3 2 1 0
0 1 2 3 3 1 2 null 2 3 3 null null null null null
a b c d e f g h i j
insert−first(,/a/f/g) first try to handle the insert inside a page if full, append pages (NULL padded) <m/>
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
18 3 2 0 0 13 3 2 2 0 0 4 3 2 1 0 0 0 5 4 3 2 1 0
0 1 2 3 3 1 2 3 2 3 3 null null null null null 4 4 null null null null null null
pre size level 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
18 3 2 0 0 13 3 2 0 0 5 4 3 2 1 0 2 0 0 4 3 2 1 0
0 1 2 3 3 1 2 3 4 4 null null null null null null 2 3 3 null null null null null
a b c d e f g k l m
h i j
pre/size/level is a memory−mapped view with pages in logical order pre is a virtual column (void), therefore it adapts automatically
Fig. 2. Updates With Logical Pages
Figure 2 shows the example document being stored in two logical pages. The logical size is measured in a number of tuples (here: 8) instead of bytes. The document shredder already leaves a certain (configurable) percentage of tuples unused in each logical page. Initially, the unused tuples are located at the end of each page. Their level column is set to NULL, while the size column holds the number of directly following consecutive unused tuples. This allows the staircasejoin to skip over unused tuples quickly. For the same reason, the size of existing nodes now also embraces the unused tuples within the respective subtrees. The advantage of unused tuples is that structural deletes just leave the tuples of the deleted nodes in place (they become unused tuples) without causing any shifts in pre numbers. And since unused tuples are counted in the size of their ancestors, deletes do not require updates of the size of their ancestors. Also, inserts of subtrees whose sizes do not exceed the number of unused tuples on the logical page, do not cause shifts on other logical pages. Larger inserts, only use page-wise table appends. This is the main reason to replace pre by pos. The pos
MonetDB/XQuery—Consistent and Efficient Updates on the Pre/Post Plane
1193
column is a densely increasing (0,1,2,...) integer column, which in MonetDB can be efficiently stored in a virtual (non-materialized) void column. We introduced new functionality in MonetDB to map the underlying disk pages of a table in a different non-sequential order into virtual memory. Thus, by mapping in the virtual memory pages of the pos/size/level table in logical page order, overflow pages that were appended to it, become visible “halfway” in the pre/size/level view. In the example of Figure 2, three new nodes k, l and m are inserted as children of context node g. This insert of three nodes does not fit the free space (the first page that holds g only has one unused tuple at pos=7 ). Therefore, a new logical page must be inserted in-between. Thus, we insert eight new tuples, of which only the first two represent real nodes (l and m), the latter six are unused. Thanks to the virtual column feature of MonetDB, in the resulting pre/size/level view, all pre numbers after the insert point automatically shift, at no update cost at all!
3
Conclusion
In our demonstration, we will show the performance and scalability of both read-only and update queries on potentially huge XML databases, provided by MonetDB/XQuery. The demonstration will graphically show how the key techniques described here influence the behavior of the system.
References 1. P. Boncz, T. Grust, S. Manegold, J. Rittinger, and J. Teubner. Pathfinder: Relational XQuery Over Multi-Gigabyte XML Inputs In Interactive Time. Technical Report INS-E0503, CWI, 2005. 2. P. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, and J. Teubner. Pathfinder: XQuery—The Relational Way. In Proc. VLDB Conf., 2005. (Demo). 3. P. Boncz and M.L. Kersten. MIL Primitives For Querying a Fragmented World. The VLDB Journal, 8(2), 1999. 4. P. Boncz, S. Manegold, and J. Rittinger. Updating the Pre/Post Plane in MonetDB/XQuery. In Proc. XIME-P, 2005. 5. T. Grust. Accelerating XPath Location Steps. In Proc. SIGMOD Conf., 2002. 6. T. Grust, S. Sakr, and J. Teubner. XQuery on SQL Hosts. In Proc. VLDB Conf., 2004. 7. T. Grust, M. v. Keulen, and J. Teubner. Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps. In Proc. VLDB Conf., 2003. 8. T. Grust, M. van Keulen, and J. Teubner. Accelerating XPath evaluation in Any RDBMS. ACM Trans. on Database Systems, 29(1), 2004. 9. P.E. O’Neil, E.J. O’Neil, S. Pal, I. Cseri, G. Schaller, and N. Westbury. ORDPATH: Insert-Friendly XML Node Labels. In Proc. SIGMOD Conf., 2004. 10. S. Peyton-Jones and P. Wadler. Imperative Functional Programming. In Proc. POPL Conf., 1993. 11. G. M. Sur, J. Hammer, and J. Simeon. UpdateX - An XQuery-Based Language for Processing Updates in XML. In Proc. PLAN-X, 2004. 12. J. Xu Yu, D. Luo, X. Meng, and H. Lu. Dynamically Updating XML Data: Numbering Scheme Revisited. World Wide Web Consortium, 8(1), 2005.
STRIDER: A Versatile System for Structural Disambiguation Federica Mandreoli, Riccardo Martoglia, and Enrico Ronchetti DII, Universit` a degli Studi di Modena e Reggio Emilia, via Vignolese, 905/b - I 41100 Modena {fmandreoli, rmartoglia, eronchetti}@unimo.it Abstract. We present STRIDER1 , a versatile system for the disambiguation of structure-based information like XML schemas, structures of XML documents and web directories. The system performs high-quality fully-automated disambiguation by exploiting a novel and versatile structural disambiguation approach.
1
Introduction
In recent years, knowledge based approaches, i.e. approaches which exploit the semantics of the information they access, are rapidly acquiring more and more importance in a wide range of application contexts. We refer to “hot” research topics, like schema matching and query rewriting [2, 5], also in peer data management systems (PDMS), XML data clustering and classification [8] and ontology-based annotation of web pages and query expansion [1, 3], all going in the direction of the Semantic Web. In these contexts, most of the proposed approaches share a common basis: They focus on the structural properties of the accessed information, which are represented adopting XML or ontology based data models, and their effectiveness is heavily dependent on knowing the right meaning of the employed terminology. Fig. 1-a shows the hierarchical representation of a portion of the categories offered by eBay. It is an example of a typical tree-like structure-based information managed in the above mentioned contexts and which our approach is successfully able to disambiguate. It contains many polysemous words, from string to which WordNet [6], the most used commonly available vocabulary, associates 16 meanings, to batteries (11 meanings), memory (10 meanings), and so on. The information given by the surrounding nodes allows us to state, for instance, that string is a “stringed instrument played with a bow” and not a “linear sequence of symbols”, and batteries are electronic devices and not a group of guns or whatever else. In this paper we propose STRIDER, a system which could be of support to these kinds of approaches in overcoming the ambiguity of natural language, as it makes explicit the meanings of the words employed in tree-like structures. STRIDER exploits the novel versatile structural disambiguation approach we proposed in [4]. 1
This work is partially supported by the Italian Council co-funded project WISDOM. STRucture-based Information Disambiguation ExpeRt.
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1194–1197, 2006. c Springer-Verlag Berlin Heidelberg 2006
STRIDER: A Versatile System for Structural Disambiguation
1195
buy 1
(a) computers
2
desktop PC components
4
memory
8 9
accessories
10
batteries
3
speaker
5
cameras
6
11
fan 7
antiques
12
furniture
14
13
chair
15
musical instruments string
mouse
(b) Polysemous terms
Graph
Contextualized Polysemous terms
CONTEXT EXTRACTION TERMS / SENSES SELECTION
GRAPH CONTEXT EXTRACTION
CONTEXT EXPANSION
Terms’ sense Rankings
TERMS DISAMBIGUATION
TERMS / SENSES FEEDBACK
Fig. 1. (a) A portion of the eBay categories;(b) The complete STRIDER architecture
2
An Overview of the STRIDER System
STRIDER is designed to perform effective disambiguation of tree-like structures. As shown in Fig. 1-b, which depicts the complete architecture of our system, STRIDER takes in input structure-based information like XML schemas, structures of XML documents and web directories and disambiguates the terms contained in each node’s label using WordNet as external knowledge source. The outcome of the disambiguation process is a ranking of the plausible senses for each term. In this way, the system is able to support both the completely automatic semantic annotation whenever the top sense of the ranking is selected and the assisted one through a GUI that assists the user providing useful suggestions. The STRIDER system has the following features: – automated extraction of terms from tree’s nodes (Terms/Senses Selection component in Fig.1-b); – high-quality and fully-automated disambiguation that: • is independent from training or additional data, which are not always available [7]; • exploits a context which goes beyond the simple “bag of words” approach and preserves the information given by the hierarchy (graph context ); • allows flexible extraction and full exploitation of the graph context according to the application needs (Graph Context Extraction component in Fig.1-b); • enriches the graph context by considering the expanded context, with additional information extracted from WordNet definitions and usage examples (Context Expansion component in Fig.1-b); – interactive and automated feedback to increase the quality of the disambiguation results;
1196
F. Mandreoli, R. Martoglia, and E. Ronchetti
Fig. 2. The Graphical User Interface of the STRIDER System
– user-friendly GUI speeding up the assisted disambiguation of trees, providing an easy-to-use layout of the informative components. Technical details about the implemented techniques for structural disambiguation are available in [4].
3
Demonstration
In this section we demonstrate the main features of STRIDER. The effectiveness of the system has been experimentally measured on several trees differing in the level of specificity and polysemy [4] (trees are available online at www.isgroup.unimo.it/paper/strider). Fig. 2 shows STRIDER’s GUI with the results of the disambiguation process for the eBay example (Fig. 1-a). In the left part of the GUI we see columns Node, Term that show the outcome of the automated extraction of terms from the tree’s nodes and column Synset that contains the chosen sense for the corresponding term. For flexibility purposes, the GUI allows users to fill it in either by manually choosing one of the senses in the right part or by pressing the Magic Wand. This simple act triggers the fully automatic disambiguation process of STRIDER which is applied to the entire loaded tree and automatically chooses the top sense in the ranking of each term. When the user highlights a term in the left part of the GUI, the right part shows all the available senses and for each of them the synset’s hypernym hyerarchy. One of the major strengths of our system is the versatility of being able to choose the crossing setting that is best suited to the tree characteristics. For instance, when the crossing setting is made up of the whole tree, the term antique of Fig.1-a is not disambiguated as “an old piece of furniture or decorative object”, but as “an elderly man” due to the presence
STRIDER: A Versatile System for Structural Disambiguation
1197
of terms like fan and speaker that could have the meaning of “persons” rather than “objects”. This behavior is typical of trees that gather very heterogeneous concepts like web directories. On the other hand, only by using the whole tree as the crossing setting in trees that have a very particular scope, for instance an IMDB tree schema on movies, terms like episode and genre are correctly disambiguated whereas a restricted crossing setting made of only ancestors and descendants provides wrong results. In general, the performed tests demonstrate that most of the term’s senses are correctly assigned straightforwardly with the disambiguation (the mean precision level on the tested trees is generally over 80% [4]). Such good performance is obtained even when the graph context provides too little information, as in generic bibliographic schemas, thanks to the context expansion feature which is able to deliver a higher disambiguation precision, by expanding the context with additional related nouns contained in the description and in the examples of each sense in WordNet. To get even better results the user could choose to refine them by performing successive disambiguation runs; for this purpose he/she is able to deactivate/activate the influence of the different senses of the available context words on the disambiguation process. Further, the flexibility of our approach allows the user to benefit from a completely automated feedback, where the results of the first run are refined by automatically disabling the contributions of all but the top ranked X senses in the following runs.
4
Conclusions
The disambiguation performances achieved by STRIDER are encouraging and demonstrate the very good effectiveness of the adopted approach. The intuitive GUI provides easy interaction with the user; further, the system can also be used in batch mode to meet the needs of the most cutting edge semantic-aware applications, where user intervention is not feasible.
References 1. P. Cimiano, S. Handschuh, and S. Staab. Towards the self-annotating web. In Proc. of the 13th WWW Conference, 2004. 2. H. Do, S. Melnik, and E. Rahm. Comparison of schema matching evaluations. In Proc. of the 2nd WebDB Workshop, 2002. 3. Marc Ehrig and Alexander Maedche. Ontology-focused crawling of web documents. In Proc. of the ACM SAC, 2003. 4. F. Mandreoli, R. Martoglia, and E. Ronchetti. Versatile Structural Disambiguation for Semantic-aware Applications. In Proc. of the 14th CIKM Conference, 2005. 5. F. Mandreoli, R. Martoglia, and P. Tiberio. Approximate Query Answering for a Heterogeneous XML Document Base. In Proc. of the 5th WISE Conference, 2004. 6. G. A. Miller. WordNet: A Lexical Database for English. CACM, 38(11), 1995. 7. I. Tatarinov and A. Halevy. Efficient Query Reformulation in Peer Data Management Systems. In Proc. of ACM SIGMOD, 2004. 8. M. Theobald, R. Schenkel, and G. Weikum. Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In Proc. of the WebDB Workshop, 2003.
An Extensible, Distributed Simulation Environment for Peer Data Management Systems Katja Hose, Andreas Job, Marcel Karnstedt, and Kai-Uwe Sattler Department of Computer Science and Automation, TU Ilmenau, P.O. Box 100565, D-98684 Ilmenau, Germany
Abstract. Peer Data Management Systems (PDMS) have recently attracted attention by the database community. One of the main challenges of this paradigm is the development and evaluation of indexing and query processing strategies for large-scale networks. So far, research groups working in this area build their own testing environment which first causes a huge effort and second makes it difficult to compare different strategies. In this demonstration paper, we present a simulation environment that aims to be an extensible platform for experimenting with query processing techniques in PDMS and allows for running large simulation experiments in distributed environments such as workstation clusters or even PlanetLab. In the demonstration we plan to show the evaluation of processing strategies for queries with specialized operators like top-k and skyline computation on structured data.
1 Introduction In recent years research concerning peer-to-peer (P2P) systems has mainly dealt with P2P applications based on environments that are much more sophisticated than those simple file sharing systems that they originate from. Especially Peer Data Management Systems (PDMS) are the focus of current work. They consider the problems arising from peers with different local schemas but appear to be one virtual database. Research activities in this area do not only include the behavior and topology of a P2P system itself but also query processing strategies, routing indexes[1], schema correspondences[2], etc. All approaches and ideas have to be thoroughly evaluated. For this purpose, simulation environments are built. Such implementations have to meet several requirements in order to allow for reasonable evaluations. In the context of P2P research the primary requirements are: scalability, dynamic behavior, performance, extensibility, arbitrary experimental settings, repeatability, traceability/logging, and control (as mentioned in [3]). Usually, research groups build their own environments using different data sets, programming languages, application areas and so on. This results in a couple of problems: – Many environments are implemented rather hastily and have several shortcomings like limited network size, firm index structures and query processing strategies, or even include bugs. Other aspects are simplified and therefore cannot be examined. – Results gained with different systems are usually not comparable. This leads to unsatisfying evaluations or high reimplementation costs. Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1198–1202, 2006. c Springer-Verlag Berlin Heidelberg 2006
An Extensible, Distributed Simulation Environment
1199
– Existing environments are normally based on network simulators like ns-2 that are too low-level for our purpose. – Most environments do not have an intuitive user interface let alone a graphical one. Configuring and using the software is difficult. The output is rather cryptic and does not allow for an intuitive result analysis. In this paper we present SmurfPDMS (SiMUlating enviRonment For PDMS), a simulator that meets all the requirements mentioned above. Furthermore, it tries to overcome the introduced shortcomings of existing simulators. In the following sections we present its architecture (Section 2) and the aspect of extensibility (Section 3). Afterwards, in Section 4, we present the graphical user interface that makes it easy to operate. Finally, in Section 5 we point out what features we are planning to show at the conference.
2 Architecture SmurfPDMS logically consists of two parts that we implemented using Java: the simulation engine and an environment for configuration and experimenting. The latter uses a graphical interface based on Swing and JGraph (http://www.jgraph.com) for visualizing configuration parameters, statistics, networks and so on. Moreover, this environment can generate initial settings like the network topology based on user specified parameters like the number of peers and the average number of connections per peer. The simulation engine simulates peers whose states are represented by peer objects. These objects have local caches and message queues to improve query processing, mappings to describe how to translate queries and data into a neighbor’s schema, and indexes to describe a neighbor’s data. The simulation environment does not only allow for running simulations locally on a single computer, it also gives the opportunity to run simulations involving a couple of computers - improving scalability. Communication between participating computers is carried out by sending and receiving messages using the JXTA [4] protocols. Simulating only one peer per machine enables the simulator to act like a “real” P2P application. In future work we intend to use PlanetLab (http://www.planet-lab.org) as the underlying framework. Each participating computer manages several peer objects that can be connected to others residing on other computers. The localization of peer objects (local or remote) does not have any influence on the implementation of other components like query processing strategies. A central clock mechanism allows for gaining comparable results even in resource-limited environments. Apart from the system architecture like introduced above, Figure 1 shows the implementation architecture of SmurfPDMS. It consists of three layers: the graphical user interface, the simulation layer, and the network communication layer. This architecture considers several central concepts: (i) managers that control the simulation or communication, (ii) messages for information exchange, (iii) hierarchies of inherited objects, and (iv) the distinction of participating JXTA peers between multiple participators and one coordinator. The coordinator runs on a designated computer and coordinates the simulation. Its most important tasks are: – Determine the setup including calculating a network topology, assigning the peers to the participating computers, calculating data partitions, etc. – Choose queries and determine peers to initiate them
1200
K. Hose et al.
User Interface Gui Layer
JGraph Swing
Configuration Files
Statistics
Simulation Coordinator Logging Startup Manager Manager LocalPeer Manager
Simulation Manager
Simulation Participant RemotePeer Manager
LocalPeer Manager
Simulation Layer
PeerObjects PeerObjects 19
6 10
3 18
5
Network Manager
RemotePeer Manager
JXTA Peerlist JXTA Peer 1
13 14
4
7 12
9
JXTA Peerlist JXTA Peer 1
Network Manager
JXTA
Network Layer
JXTA
Participants Coordinator
Fig. 1. Architecture
– Determine and choose peers to leave or join the network. – Simulate communication and processing delays, log the initial setup (in order to achieve repeatability) as well as results and statistics.
3 Extensibility We achieve extensibility by a modular architecture. Each of the above mentioned classes (manager, strategy, message, peer, etc.) represents a base class. Thus, the logical simulation engine of SmurfPDMS can be easily extended. For example, we have derived queries and answers from messages. If we want to introduce a new message type, we just have to derive a new class from the base class. In order to process such messages correctly, we could extend existing strategies or we could as well introduce a new strategy class as an extension of the corresponding base class. Just like the simulation engine, we can also extend the environment for configuration and experimenting. SmurfPDMS has several algorithms for initializing a simulation. Assume we want to test different topology construction algorithms or algorithms for distributing the initial data among peers. Then all we have to do is adding an implementation of these algorithms to the concerned classes. All other parameters can remain the same. This allows us to examine the influence of these algorithms under the same environmental settings.
4 Running Experiments First of all, the simulator has to be configured (arbitrary simulation settings). The user has the chance to load configuration files, to reconfigure individual parameters, or to have the simulator compute settings like the network topology, the data distribution, or the query mix. All these parameters and settings can be written to configuration files and reloaded for the next simulation (repeatability). Together with the log files we can easily
An Extensible, Distributed Simulation Environment
1201
reproduce and reconstruct entire simulations or reuse partial settings like the topology (traceability/logging). Afterwards, the simulator looks for other JXTA peers in the network. The user can select among them those that he or she wants to participate in the simulation (scalability, performance). After having selected the JXTA peers the simulation can be started with the computer that the user currently operates on as coordinator (control). Once the simulation has been started, simulations can be halted, canceled, and executed stepwise. Moreover, the graphical user interface provides the user with further features for visualization like messages and their details or peers’ local statistics. Figure 2 shows the basic principles of visualizing. The simulation either ends after a user
Message
Peers
CommunicationLink
Fig. 2. Simulation window
defined number of simulation steps or when no peer has any actions left to perform. In both situations the coordinator sends the break signal to all participating JXTA peers and thus ends the simulation. At the end of a simulation the global statistics are displayed to the user and a file containing that data is created. These results can be used for creating charts like we did for example in [5] and [6].
5 Demonstration We plan to focus our demonstration on the aspect of how to use the simulator for conducting experiments with arbitrary settings. This includes presenting algorithms for automatic topology generation as well as algorithms for distributing data among peers. Furthermore, we will show how to formulate queries by hand or how to have them generated automatically by the simulator. We will also show how to compare different query processing strategies by means of their results and collected statistics. Finally, we will show how we can do all this using a graphical user interface for both configuring the simulator and visualizing the results.
1202
K. Hose et al.
References 1. Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: 22nd Int. Conf. on Distributed Computing Systems. (2002) 23–32 2. Tatarinov, I., Halevy, A.: Efficient query reformulation in peer data management systems. In: SIGMOD’04. (2004) 539–550 3. Buchmann, E., B¨ohm, K.: How to Run Experiments with Large Peer-to-Peer Data Structures. In: IPDPS’04. (2004) 4. Sun Microsystems, I.: JXTA v2.3.x: Java Programmer’s Guide – http://www.jxta.org/docs/ JxtaProgGuide v2.3.pdf (2005) 5. Karnstedt, M., Hose, K., Sattler, K.U.: Query Routing and Processing in Schema-Based P2P Systems. In: DEXA’04 Workshops, IEEE Computer Society (2004) 544–548 6. Karnstedt, M., Hose, K., Stehr, E.A., Sattler, K.U.: Adaptive Routing Filters for Robust Query Processing in Schema-Based P2P Systems. In: IDEAS 2005, Montreal, Canada, 2005. (2005)
Data Management in the Social Web Karl Aberer School of Computer and Communications Science, EPFL, Lausanne, Switzerland karl.aberer@epfl.ch
An interesting observation relates to the fact that the most successful applications on the Web incorporate some sort of social mechanism. This is true for commercial success stories, such as Ebay with its reputation mechanism, Amazon with its recommendation tool and Google with PageRank, a recommendation-based ranking algorithm. Peer-to-peer file sharing and photo sharing are other recent examples were the essence of the application consists of social interactions. In these applications large numbers of anonymous participants interact, such that mechanisms for social control become increasingly important. This explains the recent interest in reputation-based trust management. The same issues will emerge when large numbers of services will be deployed over the Web through Web services and Grid computing technology. The goal of the panel is to reflect on these developments, identify important classes of applications involving social interaction which require data management support and information management capabilities, and project from there the potential future impact on the field of data management. A characteristic property of applications involving social interactions is the large numbers of participants of whom the behavior needs to be tracked and analyzed. This implies a strong need for scalable data management capabilities. Will this require novel approaches in the area of data management or will existing technology be sufficient? The past has shown that new applications frequently open new avenues in data management research. Examples are semi-structured data management responding to the need of managing data on the Web and stream data management responding to the need of managing data in networked environments and sensor data. Recently, in the context of the Semantic Web, social mechanisms for semantic tagging, so-called folksonomies, have created quite some interest. There the creation and alignment of structured data annotations for Web content becomes a social activity. Similarly as with collaborative filtering in information retrieval the social context is exploited in order to deal with the semantics problem, namely providing proper interpretation of data. Is this a promising approach for dealing with one of the hardest problems in data management, namely dealing with semantic heterogeneity of structured data? In social settings uncertainty is omnipresent, since intentions and interpretations of autonomous participants cannot be made completely transparent. This holds also true when it comes to the exchange and shared use of data. Is it possible, that the recent growing interest of the database community in applying probabilistic techniques in
Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 1203 – 1204, 2006. © Springer-Verlag Berlin Heidelberg 2006
1204
K. Aberer
data management roots also in the need of having appropriate tools for dealing with the uncertainty resulting from interaction in a social context? Finally, from a more general perspective, new requirements on data management often initiate new directions of interdisciplinary research for the field. Will the need to provide solutions for data management on the Social Web lead database researchers to look into areas such as agent technologies, game theory or micro-economy to better understand the mechanics of social interactions and their impact on data management solutions? These were some of questions that we will pose to the panel in order to identify interesting directions for future data management research.
Author Index
Abbadi, Amr El 112 Aberer, Karl 1203 Abiteboul, Serge 1049, 1059 Achtert, Elke 1182 Afrati, Foto 942 Aggarwal, Charu C. 41 Agrawal, Divyakant 112 Agrawal, Rakesh 240 Albano, Antonio 1110 Amer-Yahia, Sihem 349 Ashish, Naveen 1159 Aßfalg, Johannes 276, 1147 Atzeni, Paolo 368 Bai, Yijian 588 Basak, Jayanta 1097 Behrends, Erik 792 Bender, Matthias 149 Berg, Brandon 1102 Bernstein, Abraham 59 Bernstein, Philip A. 368 Betari, Abdelkader 1176 Bhattacharya, Arnab 865 Bhide, Manish 1097 Bijay, Kumar Gaurav 608 Bleiholder, Jens 811 B¨ ohlen, Michael 257, 1135 B¨ ohm, A. 1125 Boncz, Peter 1190 Botev, Chavdar 349 Boucelma, Omar 1176 Boulos, Jihad 755, 1155 Brantner, M. 1125 Brecheisen, Stefan 1151, 1164 Brochhaus, Christoph 204 Broder, Andrei Z. 313 Bruno, Nicolas 386 Camoglu, Orhan 645 Canahuate, Guadalupe 884 Cappellari, Paolo 368 Casati, Fabio 1079 Castellanos, Malu 1079 Celle, Roberto 1143
Chan, Chee-Yong 478 Chaudhuri, Surajit 386 Chen, Ming-Syan 682 Cheng, Jiefeng 961 Chirkova, Rada 942 Chuang, Kun-Ta 682 Colonna, Fran¸cois-Marie 1176 Cormode, Graham 4 Dayal, Umeshwar 1079, 1106 De Rosa, Luca 1110 Demers, Alan 627 Dittrich, Klaus R. 59, 1135, 1172 Drukh, Natasha 1088 Dumitrescu, Cristian 1110 Eiron, Nadav 313 Essid, Mehdi 1176 Ewen, Stephan 847 Feng, Ying 112 Ferhatosmanoglu, Hakan 884 Fletcher, George H.L. 95 Flokstra, Jan 1190 Fontoura, Marcus 313 Fritzen, Oliver 792 Furfaro, Filippo 442 Gamper, Johann 257 Garofalakis, Minos 4 Geerts, Floris 1168 Gehrke, Johannes 627 Gemulla, Rainer 423 Georgiadis, Haris 570 Gergatsoulis, Manolis 942 Ghica, Oliviu 1039 Gibas, Michael 884 Goglia, Lucio 1110 Goglia, Roberto 1110 Golab, Lukasz 608 Govindarajan, Kannan 1106 Gruber, Michael 1151 Grust, Torsten 1190 Guerrini, Giovanna 1143, 1186
1206
Author Index
Haas, Peter J. 1092 Hacıg¨ um¨ u¸s, Hakan 1084 Hariharan, Ramaswamy 1159 Herscovici, Michael 313 Hinze, Annika 1039 Hodel-Widmer, Thomas B. 1135 Hong, Mingsheng 627 Hose, Katja 1198 Hu, Haibo 186 Hu, Meng 664 Hua, Kien A. 700 Huang, Jiun-Long 682 Hung, Jen-Jou 902 Hvasshovd, Svein-Olaf 405 Jagadish, H.V. 478, 737 Jensen, Christian S. 257 Jin, Ruoming 533 Job, Andreas 1198 Kache, Holger 847 Kahveci, Tamer 645 Kalashnikov, Dmitri V. 1159 Kanne, C-C. 1125 Kantere, Verena 1069 Kanza, Yaron 222 Karakashian, Shant 755 Karam, Marcel 1155 Karnstedt, Marcel 1198 Kementsietsidis, Anastasios 1168 Kersten, Martin 1 Khuller, Samir 811 Kiefer, Christoph 59 Kiringa, Iluju 1069 Koteiche, Zeina 1155 Koudas, Nick 460 Kreulen, Jeffrey 1084 Kriegel, Hans-Peter 276, 1147, 1151, 1164 Kr¨ oger, Peer 276, 1147 Kunath, Peter 276, 1147, 1164 Kuno, Harumi 1106 Kutsch, Marcel 1092 Lee, Dik Lun 186, 515 Lee, Ken C.K. 1020 Lee, Wang-Chien 1020 Lehner, Wolfgang 423 Lempel, Ronny 313 Leone, Stefania 1135
Li, Yunyao 737 Lin, Xuemin 961 Liu, Danzhou 700 Liu, Peiya 588 Liu, Shaorong 588 Ljosa, Vebjorn 865 Llavori, Rafael Berlanga 1186 Lochovsky, Frederick 77 Løland, Jørgen 405 Luo, Gang 921 Luo, Qiong 515 Ma, Yiming 1159 Maier, David 3 Mamoulis, Nikos 167 Mandreoli, Federica 295, 1194 Manegold, Stefan 1190 Manolescu, Ioana 1049 Mansmann, Florian 496 Markl, Volker 847, 1092 Martoglia, Riccardo 295, 1194 Matia, Yariv 1139 Matias, Yossi 1088, 1139 May, N. 1125 May, Wolfgang 792 Mazzeo, Giuseppe M. 442 McPherson, John 313 Megiddo, Nimrod 1092 Mehrotra, Sharad 1159 Meka, Anand 980 Mendelzon, Alberto O. 222 Mesiti, Marco 1143, 1186 Michel, Sebastian 149 Mignet, Laurent 1097 Milano, Diego 1168 Miller, Ren´ee J. 222 Minei, Vincenzo 1110 Moerkotte, G. 1125 Morzy, Tadeusz 1121 Mullender, Sjoerd 1190 Mylopoulos, John 1069 Nanda, Jyotirmaya 1106 Narang, Inderpal 1115 Natarajan, Ramesh 1115 Naughton, Jeffrey F. 921 Naumann, Felix 773, 811 Nehme, Rimma V. 1001 Neven, Frank 829 Ntarmos, Nikos 131
Author Index Ollaic, Hala 1155 ¨ Ozsu, M. Tamer 608 Pal, Shankar 1102 Panayiotopoulos, Themis 1130 Parthasarathy, Srinivasan 533 Pavlaki, Vassia 942 Pelekis, Nikos 1130 Phan, Thomas 1115 Pitoura, Theoni 131 Portman, Leon 1088, 1139 Pryakhin, Alexey 276, 1147, 1164 Puhlmann, Sven 773 Qi, Runping
313
Raghavachari, Mukund 552 Raman, Vijayshankar 847 Raschid, Louiqa 811 Ravikant, D.V.S. 1097 Reichert, Michael 1097 Renz, Matthias 276, 1147 Rhodes, James 1084 Riedewald, Mirek 627 Rittinger, Jan 1190 Ronchetti, Enrico 295, 1194 Roy, Prasan 1097 Roy, Sourashis 1097 Rundensteiner, Elke A. 1001 Sacharidis, Dimitris 4 Sanz, Ismael 1186 Sattler, Kai-Uwe 1198 Schenkel, Ralf 331 Scheuermann, Peter 1039 Schubert, Matthias 1151 Schwald, Dominik 1182 Seidl, Thomas 204 Seifert, Andr´e 902 Senellart, Pierre 1059 Sengar, Vibhuti S. 1097 Shan, Ming-Chien 1079 Shanmugasundaram, Jayavel Shekita, Eugene 313 Shmueli, Oded 552 Singh, Ambuj K. 865, 980 Sion, Radu 1115 Sirangelo, Cristina 442 Skopal, Tom´ aˇs 718 Smathers, Kevin 1106 Smeaton, Alan F. 2
Sorrenti, Matteo A. 1143 Spangler, Scott 1084 Steinbach, Torsten 1097 Sturm, Christoph 59, 1172 Su, Weifeng 77 Tan, Kian-Lee 478 Tao, Yufei 167 Taropa, Emanuel 1049 Terzi, Evimaria 240 Teubner, Jens 1190 Theobald, Martin 331 Theodoridis, Yannis 1130 Tomic, Dragan 1102 Trajcevski, Goce 1039 Tran, Tam Minh 1092 Triantafillou, Peter 131, 149 Tung, Anthony K.H. 478 Turaga, Deepak S. 23 Vadapalli, Soujanya 1097 Van de Craen, Dieter 829 van Keulen, Maurice 1190 Vassalos, Vasilis 570 Vatsavai, Ranga R. 1097 Venkatasubramanian, Nalini 1159 Venkateswaran, Jayendra 645 Vinnik, Svetlana 496 Vlachos, Michail 23 Voisard, Agnes 1039 Vorberger, Florian 1164 Vosinakis, Spyros 1130 Vu, Khanh 700
349
Wang, Chao 533 Wang, Fusheng 588 Wang, Haixun 961 Wang, Jiying 77 Weikum, Gerhard 149 Weis, Melanie 773 White, Walker 627 Wichterich, Marc 204 Wilkinson, Kevin 1106 Wrembel, Robert 1121 Wu, Ping 112 Wu, Yao 811 Wyss, Catharine M. 95 Xavier, Joe 1102 Xu, Jianliang 186, 1020
1207
1208
Author Index
Yang, Huahai 737 Yang, Jiong 664 Yiu, Man Lung 167 Yu, Jeffrey Xu 961 Yu, Ning 700 Yu, Philip S. 23, 921, 961 Yu, Xiaohui 460 Yuasa, Kei 1106
Zhang, Caijie 112 Zhang, Zheng 222 Zhang, Zhenjie 478 Zhao, Ben Y. 112 Zhao, Dan 1069 Zhao, Dyce Jing 515 Zheng, Baihua 1020 Ziegler, Patrick 59, 1172 Zuzarte, Calisto 460