ð2Þ
In other words, f is a dissolved sequence of two static shots g and h in t = [t1, t2]. Let F ðtÞ ¼ f ðx; y; tÞ, by taking the frame difference, the following can be derived: F ðtÞ F ðt þ kÞ ¼ bðt; kÞ F ðt kÞ F ðtÞ
ð3Þ
aðtþkÞaðtÞ where bðt; kÞ ¼ aðtÞaðtkÞ > 1 and t = [t1 + k, t2 k]. If k > t2 t1 + 1, plateau effect will be exhibited and this effect can be exploited effectively for dissolve detection [11]. Equation(2) can be further simplified by assuming 1 a(t) as a linear function, aðtÞ ¼ ttt for instance. This 2 t 1 terms leads to a formula in terms of variance:
sf ðtÞ ¼ ðsg þ sh Þa2 ðtÞ 2sg aðtÞ þ sg
ð4Þ
where sf(t), sg and sh are the variances of, g(x, y) and h(x, y). Since sf(t) is a concave upward parabolic curve,
dissolves can be detected simply by locating parabolic curves [1,8]. The limitations of (3) and (4) are mainly due to the linearity assumption of a(t) and the static sequence assumption of shots g and h. As a result, most detectors are generally very sensitive to noise, camera and object motions. Recently, dissolve detection is also considered as a pattern classification problem, and various machine learning algorithms are brought in for this task [3]. For instance, in [6], multiple independent FSM (finite state machine) based detectors are used for detecting different types of shot boundaries. Together with classifiers, multiresolution analysis is also adopted by [5,7,13]. In [7], dissolve detection is reduced to the cut detection problem by projecting the spatio-temporal slices to lower temporal resolution scale. The detected cuts at lower resolution are further classified in the higher resolution space by machine learning. In [13], multi-resolution features are extracted for active learning of SVM (support vector machines) to detect dissolves. Wipe
Wipe is also a common shot boundary transition frequently found, for example, in sport and news videos. There are lots of fancy wipe transitions being used by video editors today. Figure 1 shows a few examples of wipe. As each transition exhibits its own unique transition patterns, wipe patterns, compared to dissolves, are relatively hard for machines to learn. Early works are mostly based on frame differencing and edge detection to trace the moving boundary lines created by wipes. For instance, the works in [8] detect the edge patterns formed by wipes in spatialtemporal slices. The recent work in [4] analyzes the independence and completeness properties of wipe transitions. A cost function utilizing the MPEG motion vectors is derived to model these two properties for reliable wipe detection.
General Challenges
As a fundamental building block to support other video applications, shot detection has been extensively studied in the literature. The challenges that are not completely solved for cut detection include abrupt illumination change (e.g., flashlight) and large object/camera movement, which could trigger abrupt changes of video signal and generate false alarms. For gradual transition (GT), on the other hand, the
Video Shot Detection
V
3319
Video Shot Detection. Figure 1. Various wipe transition.
major difficulties are the slow evolution of signal that is hard to observe. Furthermore, the temporal duration can vary greatly while the patterns could appear similar to slow camera/object movement, making GT detection a very challenging problem. For both cut and GT detection, a fundamental issue is the thresholding heuristic mostly relying on empirical setting. Over the years, the thresholding techniques have also been evolved from simple global threshold setting, to the adaptive thresholding [10] and learning-based heuristics [3,5,13].
Key Applications Shot detection is a fundamental task of video content manipulation. Shots are often served as the basic unit for video browsing, indexing and search. An interesting application of shot detection is that, by knowing the type, speed and frequency of shot transitions, the pace, mood, style and highlight of the underlying video events can be modeled for affective computing.
Precision ¼ Number of Transitions Correctly Detected Number of Transitions Detected Recall ¼ Number of Transitions Correctly Detected Number of Transitionsarray
ð5Þ
ð6Þ
Precision and recall are in the interval of [0,1]. Low precision hints the frequent occurrence of false positives, while low recall indicates the frequent occurrence of false negatives. As gradual transition involves multiple frames, frame precision and recall are also used in order to evaluate the accurate localization of transitions, Frame Precision ¼ Number of Frames Correctly Located in the Detected Transitions Number of Frames Located in the Detected Transitions
V ð7Þ
Frame Recall ¼
Experimental Results The evaluations commonly used are recall and precision, defined as:
Number of Frames Correctly Located in the Detected Transitions Number of Frames in the Transitions
ð8Þ
3320
V
Video Shot-Cut Detection
Data Sets The popular datasets always experimented are TRECVID benchmarks [9].
Video Skimming ▶ Video Summarization
Cross-references
▶ Scene Boundary Detection ▶ Sub-Shot Detection
Recommended Reading 1. Alattar A.M. Detecting and compressing dissolve regions in video sequences with a DVI multimedia image compression algorithm. Int. Symp. Circuits Syst., 1:13–16, 1993. 2. Bouthemy P., Gelgon M., and Ganansia F. A unified approach to shot change detection and camera motion characterization, IEEE Trans. Circuits Syst. Video Tech., 9(7):1030–1044, 1999. 3. Hanjalic A. Shot boundary detection: unraveled and resolved. IEEE Trans. Circuits Syst. Video Tech., 12(2):90–105, February 2002. 4. Li S. and Lee M.-C. Effective detection of various wipe transitions. IEEE Trans. Circuits Syst. Video Tech., 17(6), June 2007. 5. Lienhart R. Reliable dissolve detection. In Proc. SPIE Storage Retrieval Media Database, 2001. 6. Liu Z., Gibbon D., Zavesky E., Shahrary B., and Haffner P. AT&T Research at TRECVID 2006. In Proc. TREC Video Retrieval Evaluation, 2006. 7. Ngo C.W. A robust dissolve detector by support vector machine. In Proc. 11th ACM Int. Conf. on Multimedia, 2003. 8. Ngo C.W., Pong T.C., and Chin R.T. Video partitioning by temporal slice coherency. IEEE Trans. Circuits Syst. Video Tech., 11(8):941–953, August 2001. 9. TREC-Video, http://www-nlpir.nist.gov/projects/trecvid/. 10. Vasconcelos N. and Lippman A. Statistical models of video structure for content analysis and characterization. IEEE Trans. Image Process., 9(1):3–19, January 2000. 11. Yeo B.L. and Liu B. Rapid scene analysis on compressed video. IEEE Trans. Circuits Syst. Video Tech., 5(6):533–544, 1995. 12. Yu H. and Wolf W. A multi-resolution video segmentation scheme for wipe transition identification. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 1998. 13. Yuan J., Wang H., Xiao L., Zheng W., Li J., Lin F., and Zhang B. A formal study of shot boundary detection. IEEE Trans. Circuits Syst. Video Tech., 17(2), 2007. 14. Zabih R., Miller J. Mai K., and Zabih R. et al., A feature-based algorithm for detecting and classifying scene break. In Proc. 3rd ACM Int. Conf. on Multimedia, 1995. 15. Zhang H.J., Kankanhalli A., Smoliar S., and Zhang H.J. et al. Automatic partitioning of full-motion video. ACM Multimedia Syst., 1(1):10–28, 1993.
Video Structure Analysis ▶ Video Content Structure
Video Structuring ▶ Video Content Structure
Video Summarization C HONG -WAH N GO, F ENG WANG City University of Hong Kong, Hong Kong, China
Synonyms Video abstraction; Video skimming
Definition Video summarization is to generate a short summary of the content of a longer video document by selecting and presenting the most informative or interesting materials for potential users. The output summary is usually composed of a set of keyframes or video clips extracted from the original video with some editing process. The aim of video summarization is to speed up browsing of a large collection of video data, and achieve efficient access and representation of the video content. By watching the summary, users can make quick decisions on the usefulness of the video. Dependent on applications and target users, the evaluation of summary often involves usability studies to measure the content informativeness and quality of a summary.
Historical Background
Video Shot-Cut Detection ▶ Video Segmentation
Due to the advance of web technologies and the popularity of video capture devices in the past few decades, the amount of video data is dramatically increasing. This creates a strong demand for efficient tools to
Video Summarization
browse, access, and manipulate large video collections. The most straightforward and simplest way for quick browsing of video content is to speed up the play of video by frame dropping or re-encoding. As studied in [6], the entire video could be watched in a shorter amount of time by fast playback with almost no pitch distortion using the time compression technology. This kind of approach is easy to implement and preserves most information while shortening the time of watching videos. However, without comprehensive understanding of the original video content, the selection of video materials is impossible, and thus almost all redundant information still has to be watched. This limits further shortening of summary duration and its enjoyability. Meanwhile, the ratio of time compression is also limited. Since the 1990s, video content analysis has attracted a lot of research attention. Different approaches have been proposed for automatic structuring and understanding of video content. This enables video summarization by measuring and selecting the most informative materials based on more comprehension of the video content. Video summarization can be seen as a general term of video abstraction and video skimming. In some works, they are distinguished by little difference in the composition of the output. For instance, video skims are mainly composed of the short clips skimmed from the original video. Video summary and abstraction could be a set of keyframes, clips or even other media (e.g., texts) created to describe the video content. Nowadays, video summarization has become the key tool to facilitate efficient browsing and indexing of large video collections.
Foundations In the past two decades, a lot of research efforts have been devoted to video summarization. A systematic review of these works can be found in [10]. According to the definition, a summary should be as short or concise as possible to support efficient video browsing. On the other hand, it should be informative so that as much interesting and useful materials as possible will be included. Thus, the key of video summarization is how to measure the importance of different video clips and select the most useful ones so as to achieve both elegancy and informativeness. For this purpose, there are mainly two issues concerned: video content analysis and summary generation by selecting appropriate materials. Video content analysis enables the
V
3321
understanding and comprehension of the original video content. Different features are then extracted to structure and describe the video. Based on video content analysis, a shorter summary is generated by selecting and presenting the most important materials to users. Meanwhile, to cope with the characteristics of different video domains, many algorithms are proposed to achieve better results for the summarization of specific video domains by employing domainspecific knowledge. Features for Content Analysis
Along with the advances in video content analysis, more and more features are employed in video summarization. According to the features adopted for video content analysis, the algorithms can be categorized as shot detection based [15], motion based [3], audio based [9], and multi-modality based [4,11]. In recent years, most algorithms tend to integrate multiple features to better describe the video content. Besides the visualaudio features, especially with the dramatic increasing of web videos, the metadata, context, social, and log information are being explored for video summarization of video corpus, e.g., the video clips that are watched by the most users may be more important and interesting. What features to use is usually dependent on the requirements of specific applications. Importance Measure for Summary Generation
How to decide the importance and usefulness of video segments are essential for summary generation. During the past two decades, different approaches have been proposed. The first kind of approach is based on video shot detection. Shot detection is one of the fundamental techniques for video structuring. In each segmented shot, one or several keyframes are selected and presented for users to effiently browse and index the video content in order to find the interesting video segments for further exploration. In [15], several keyframes are extracted by detecting visual changes using color histogram inside each shot. In [2], to eliminate the possible redundancy among extracted keyframes, fuzzy clustering and data pruning methods are employed based on color information to obtain a set of nonredundant keyframes as the summary of the video. These approaches just employ limited low-level visual features. For the selection of keyframes or video segments, the main idea is to reduce the redundancy and maximize the visual difference in the generated
V
3322
V
Video Summarization
summary to show more information measure by lowlevel features. In the second kind of approach, importance measure is based on some pre-defined rules. In [8], a set of rules are defined to select the most useful frames or video segments. For instance, images between similar scenes that are less than 5 seconds apart, are used for summarization; short successive shots are considered to introduce a more important topic. With prioritized video frame from each shot, another set of high order rules are then used to compose the selected clips for summary generation. For instance, the duration of each clip is at least 60 frames based on empirical and user studies of visual comprehension in short video sequences. Similar rules can also be used in query-based or domain-specific video summarization. For instance, given a soccer video, the users may be just interested in the shots of goal. In this case, to make a summary, the events of goal should be first detected and then shown to the users. In this kind rule-based approaches, the importance of video segment is manually assigned by the predefined rules. This may be useful for some specific video domains and applications. However, the rules are usually subjective. Considering the variations of video content, to make desirable summaries, more and more detailed rules are required to cope with different scenarios. This becomes almost impossible in general videos. Today many approaches try to numerically measure the importance of video segments and model video summarization in a computational way. In [4], a user attention model utilizing a set of audio-visual features is proposed. Attention is a neurobiological conception. It implies the concentration of mental powers upon an object by close or careful observing or listening, which is the ability or power to concentrate mentally. When watching a video, human attention is always attracted by different information elements such as face, text, object and camera motion, and so on. In [4], attention models are first computed based on different visual/audio features. A user attention curve is then generated by linear combination fusion scheme. The crests of the curve are more likely to attract the viewers’ attentions. Keyframes or video skims are extracted from these crests. Similarly, in [14], a perception curve that corresponds to human perception changes is constructed based on a number of visual features, including motion, contrast, special scenes, and statistical rhythm. The frames
corresponding to the peak points of the perception curve are extracted for summarization. This kind of approach usually integrates multiple modalities from video documents and human perception cue to depict the curve of video importance numerically for summarization. However, there are still manually defined rules used such as the selection and the weights of different modalities. Domain-Specific Video Summarization
According to the target video domains, video summarization can be categorized into general videos and domain-specific videos, such as music video, news video, sports video, and rushes video. General video: The algorithms for general video summarization employ features available in all types of videos, and usually do not consider any domainspecific knowledge. In [5], a video is represented as a complete undirected graph and the normalized cut algorithm is employed to partition the graph into video clusters. The resulting clusters form a directed temporal graph and a shortest path algorithm is proposed to detect video scenes. The attention values are computed and attached to the scenes, clusters, shots and subshots. Thus, the temporal graph can describe the evolution and perceptual importance of a video. A video summary is generated from the temporal graph to emphasize both content balance and perceptual quality. The advantage of this kind of approach is that no prior knowledge is used and the algorithms can work for most videos. However, when watching videos of different domains, the users’ attentions are quite different. By treating all videos in the same way, these algorithms usually cannot produce satisfactory results for videos of specific interest. Thus, a lot of works focus on specific video domains and make use of domainspecific knowledge in order to improve the summary qualities. Music video: In music video, the music plays the dominant role. Thus, music video summarization is mainly based on music analysis. In [13], the chorus is detected and used as a thumbnail for music content. The analysis of visual content such as shot classification and text (lyrics) recognition are employed for music-visual-text alignment so as to make a meaningful and smooth music video summary. Sports video: When watching sports videos, people are more interested in the exciting moments and great plays. Thus, the highlight detection and event
Video Summarization
classification are usually involved in order to make a good summary. In [1], a multi-level representation for sports video is proposed. Based on low-level features, different semantic shots (e.g., Audience, Player Close-up, and Player Following) are classified at mid-level. For different sports games, the corresponding interesting objects and events (such as the football and a goal event in a soccer video) are detected at high-level. The interesting events can then be selected for summary generation. News video: In daily news videos, a topic is usually composed of a story chain during the topic evolution. The dependencies among related stories are important to summarize a news topic. A feasible solution for rapid browsing of news topics is by threading and autodocumenting news videos [12]. In [12], the duality between stories and textual-visual concepts is first exploited to cluster the stories of the same topic. The topic structure is then presented by exploring the textual-visual novelty and redundancy of stories. By pruning the peripheral and redundant news tories in the topic structure, a main thread is extracted for autodocumentary. Rushes video: Rushes are the raw materials captured during video production. Being unedited, rushes contain a lot of redundant and junk information which is intertwined with useful stock footage. In [11], stock footage is first located by motion analysis, repetitive shot detection, and shot classification. The most representative video clips are then selected to compose a summary based on object and event understanding.
Key Applications Video summarization applications.
can
be
used
in
many
Multimedia Archives Indexing and Retrieval
Nowadays more and more video documents are being digitized and archived worldwide. Video summarization can be used to index and retrieve a large video collection, and thus facilitate efficient access of video content. For instance, the online summarization can support journalists when searching old video material, or when producing documentaries. Another example is the wide use of web videos, where the index could easily be realized by automatically generated video summaries. Movie Marketing
In the broadcasting and filmmaking industries, a lot of raw footage (extra video, B-rolls footage) is used to
V
3323
generate the final products such as TV programs and movies. Twenty to forty times as much material may be shot as actually becomes part of the finished product. The ‘‘shoot-to-show’’ ratio, such as in BBC TV, ranges from 20 to 40. The producers see these large amount of raw footage as cheap gold mine. Video summarization can be used to find the stock footage from the heavily redundant materials for potential reuse. Home Entertainment
For instance, one can easily have a brief overview of what happened in a television series that may have been missed.
Future Directions Performance evaluation
Despite the advance achieved in video summarization, a fundamental problem, i.e., performance evaluation, is yet to be solved. Today, the evaluation is mainly carried out by subjective scoring based on the users’ personal judgment or few predefined criterions. This limits effective comparison between different approaches. Comprehensive understanding and modeling of video content
In the existing approaches, neither the rule-based nor computational models can fully grasp the content and the storyline of the video. Besides more powerful approaches for video content analysis, the modeling of multimedia information is also essential for video summarization. From large-scale video summarization to search and browsing
The existing works mainly focus on the summarization of a single video document or a small video cluster, where the information inside a video or dependency between videos can be relatively easily modeled. With the dramatic increasing of video data (e.g., web videos), how to summarize a huge group of videos, for instance, by structuring the relationship among the diversity of videos is a big challenge. Besides video content dependency, context and social information are cues that could be explored for this type of summarization in large-scale. In addition, the exploitation of summarization for large-scale video search and browsing remains an area yet to be studied.
V
3324
V
View Adaptation
Experimental Results In TRECVID Workshop 2007 on Rushes Summarization [7], there were 22 runs submitted for evaluation on the same benchmark. Different criteria such as IN (inclusion of groundtruth objects/events), EA (easy to understand), and RE (redundancy of the summary) were assessed. Runs generated by CityUHK, NII, and LIP6 were verified by [7] as significantly better than two baselines from CMU. The baselines were generated by evenly sampling frames and shot clustering, respectively.
Data Sets Rushes Videos by TRECVID Workshop.
Since 2004, the annual TRECVID Workshop provides a benchmark for rushes video exploitation and summarization. The dataset is composed of 100 h’s raw video materials produced during news video and film production.
10. Truong B.T. and Venkatesh S. Video abstraction: a systematic review and classification. ACM Trans. Multimedia Comput. Commun. Appl., 3(1), 2007. 11. Wang F. and Ngo C.W. Rushes video summarization by object and event understanding. In TRECVID Workshop on Rushes Summarization in ACM Multimedia Conference September 2007. 12. Wu X., Ngo C.W., and Li Q. Threading and autodocumenting in news videos. IEEE Signal Process. mag., 23(2):59–68, 2006. 13. Xu C., Shao X., Maddage N.C., and Kankanhalli M.S. Automatic music video summarization based on audio-visual-text analysis and alignment. In Proc. 31st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005, pp. 361–368. 14. You J., Liu G., Sun L., and Li H. A multiple visual models based perceptive analysis framework for multilevel video summarization. IEEE Trans. Circuits Syst. Video Tech., 17(3), 2007. 15. Zhang H.J., Wu J., Zhong D., and Smoliar S.W. An integrated system for content-based video retrieval and browsing. Pattern Recogn., 30(4):643–658, 1997.
View Adaptation Cross-references
▶ Video Content Analysis
K ENNETH A. R OSS Columbia University, New York, NY, USA
References
Synonyms
1. Duan L.Y., Xu M., Chua T.S., Tian Q., and Xu C. A Mid-Level Representation Framework for Semantic Sports Video Analysis. In Proc. 11th ACM Int. Conf. on Multimedia, 2003. 2. Ferman A.M. and Tekalp A.M. Two-stage hierarchical video summary extraction to match low-level user browsing preerences. IEEE Trans. Multimedia, 5(2):244–256, 2003. 3. Liu T., Zhang H.J., and Qi F. A Novel Video Keyframe Extraction Algorithm based on Perceived Motion Energy Model. IEEE Trans. Circuits Syst. Video Tech., 13(10):1006–1013, 2003. 4. Ma Y.F., Lu L., Zhang H.J., and Li M. A user attention model for video summarization. In Proc. 10th ACM Int. Conf. on Multimedia, 2002. 5. Ngo C.W., Ma Y.F., and Zhang H.J. Video summarization and scene detection by graph modeling. IEEE Trans. Circuits Syst. Video Tech., 15(2):296–305, 2005. 6. Omoigui N., He L., Gupta A., Grudin J., and Sanocki E. Timecompression: system concerns, usage, and benefits. In Proc. SIGCHI Conf. on Human Factors in Computing Systems, 1999. 7. Over P., Smeaton A.F., and Kelly P. The TRECVID 2007 BBC Rushes Summarization Evaluation Pilot. In TRECVID BBC Rushes Summarization Workshop in ACM Multimedia, 2007. 8. Smith M.A. and Kanade T. Video skimming and characterization through the combination of image and language understanding. In Proc. IEEE Int. Workshop on Content-Based Access of Image and Video Database, 1998. 9. Taskiran C.M., Pizlo Z., Amir A., Ponceleon D., and Delp E. Automated video program summarization using speech transcripts. IEEE Trans. Multimedia. 8(4):775–791, 2006.
Materialized view redefinition
Definition Small changes to the definition of a materialized view are often needed in database systems. View adaptation is the process of keeping a materialized view upto-date when the definition of the view is changed. View adaptation aims to leverage the previously materialized view to generate the new view, since the cost of rebuilding the materialized view from scratch may be expensive.
Key Points View adaptation is related to the question of answering queries using views. The new view can be thought of as a query, with the old view available to help compute it. However, view adaptation also admits in-place changes that are not possible using a query-answering approach. For example, if a redefined view contains most but not all of the records from the original view, then view adaptation could be achieved by deleting the records that no longer qualify. A variety of adaptation techniques are presented in [2,3], allowing changes to the SELECT, FROM,
View Definition
WHERE, GROUPBY and HAVING clauses. UNION and EXCEPT views are also considered. In some cases, multiple changes to a view definition can be handled in a single pass, without materializing intermediate results for each change. Experimental results show the value of adaptation, particularly the inplace methods, relative to recomputation. Extensions of view adaptation to more general contexts, such as distributed databases, have also been proposed [1,4]. Data warehouses and decision support systems often employ materialized views to speed up query processing. Applications in which a user can change queries dynamically and see the results fast, such as data visualization, data archeology, and dynamic query processing can also benefit from view adaptation.
Cross-references
▶ Answering Queries using Views ▶ Incremental View Maintenance ▶ Materialized Views
Recommended Reading 1. 2.
3.
4.
Bellahsene Z. View adaptation in the fragment-based approach. IEEE Trans. Knowl. Data Eng., 16(11):1441–1455, 2004. Gupta A., Mumick I.S., Rao J., and Ross K.A. Adapting materialized views after redefinitions: techniques and a performance study. Inf. Syst., 26(5):323–362, 2001. Gupta A., Mumick I.S., and Ross K.A. Adapting materialized views after redefinitions. In Proc. SIGMOD Conf. on Management of Data. San Jose, CA, 1995, pp. 211–222. Mohania M.K. and Dong G. Algorithms for adapting materialized views in data warehouses. In Proc. Int. Symp. on Cooperative Database Systems for Advanced Applications, 1996, pp. 309–316.
View Definition YANNIS KOTIDIS Athens University of Economics and Business, Athens, Greece
Synonyms View expression
V
Key Points A view is a virtual relation. Its content depends on the evaluation of a query over a set of base tables or other views in the database. This query is part of the view definition and is, typically, recomputed every time the view is referenced. In some cases, for efficiency, the tuples of a view may be materialized as a separate table in the database. In relational systems, a view is defined using the create view command: create view < v > as < query expression > The name of the view in the above example is < v > and the schema and content of the view are derived ondemand by the evaluation of < query expression >, which should be a legal expression supported by the database management system. Different vendor systems may impose some constraints on the form of < query expression >, for instance they may disallow references to temporary tables. When the data in the table(s) mentioned in < query expression > changes, the data in view < v > changes also. Consider the following view definitions: create view v1 as select Name, Age from Personnel where Department = ‘‘Sales’’ create view v2 as select Wages.Name,Wages.Salary from Personnel, Wages where Personnel.Name = Wages.Name and Personnel. Department = ‘‘Sales’’ The first expression defines a view termed v1 that contains the name and age attributes from database table Personnel. The instance of view v1 consists of the subset of personnel data restricted to those working at department Sales. View v2 defined by the second expression, contains the result of the join between tables Personnel and Wages for employees working at the same department.
Cross-references
▶ View Maintenance Aspects ▶ View Unfolding ▶ Views
Definition The definition of a view consists of the name of the view and of a query, whose result is used to determine the content of the view.
3325
Recommended Reading 1.
Adiba M.E. and Lindsay B.G. Database snapshots. In Proc. 6th Int. Conf. on Very Data Bases, 1980, pp. 86–91.
V
3326
V 2.
3.
4.
5.
6. 7.
View Maintenance
Dayal U. and Bernstein P. On the correct translation of update operations on relational views. ACM Trans. Database Syst., 8(3):381–416, 1982. Gupta A., Jagadish H.V., and Mumick I.S. Data integration using self-maintainable views. In Advances in Database Technology, Proc. 5th Int. Conf. on Extending Database Technology, 1996, pp. 140–144. Gupta H., Harinarayan V., Rajaraman A., and Jeffrey D.U. Index selection for OLAP. In Proc. 13th Int. Conf. on Data Engineering, 1997, pp. 208–219. Kotidis Y. and Roussopoulos N. DynaMat: a dynamic view management system for data warehouses. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 371–382. Roussopoulos N. View indexing in relational databases. ACM Trans. Database Syst., 7(2):258–290, 1982. Roussopoulos N. An incremental access method for viewCache: concept, algorithms, and cost analysis. ACM Trans. Database Syst., 16(3):535–563, 1991.
View Maintenance A LEXANDROS L ABRINIDIS 1, YANNIS S ISMANIS 2 1 Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA 2 IBM Almaden Research Center, Almaden, CA, USA
Synonyms View update; Materialized view maintenance
Definition View maintenance typically refers to the updating of a materialized view (also known as a derived relation) to make it consistent with the base relations it is derived from. Such an update typically happens immediately, with the transaction that updates the base relations also updating the materialized views. However, such immediate updates impose significant overheads on update transactions that cannot be tolerated by many applications. Deferred view maintenance, on the other hand, allows the view to become inconsistent with its definition, and a refresh operation is required to establish consistency. Typically, under deferred maintenance, a view is incrementally updated only just before data is retrieved from it (i.e., on-demand, just before a query is performed on the view).
Historical Background Early systems that supported views did so in their ‘‘pure form,’’ i.e., by storing just the view definition
and using query rewriting to take advantage of views in other queries [11]. Incremental view maintenance is introduced in [1] through a technique to efficiently detect relevant updates to materialized views, thus streamlining their maintenance. Deferred view maintenance is introduced in [10] as a scheme for materializing copies of views on workstations attached to a mainframe that maintains a shared global database. The workstations update local copies of the views while processing queries. In [7], deferred view maintenance is defined as the application of incremental view maintenance whenever desired, unlike the immediate view maintenance, where any database update triggers the incremental view maintenance algorithm. [5] has a nice survey of view maintenance techniques.
Foundations Algorithms and techniques for maintenance of materialized views can be classified according to three different criteria: Whether the view is recomputed from scratch or not: recomputation versus incremental maintenance. Whether the view is updated whenever the base data change or not: immediate versus deferred maintenance. Whether queries can be executed while the view is being updated or not: online versus offline maintenance. All the above dimensions are typically orthogonal. We explain the different options below. View Recomputation
Recomputing a materialized view from the base relations it is derived from is the most general technique of updating. As such, it can be applied on any type of view, regardless of the complexity of the query definition. The disadvantage is that, in most cases, such recomputation is costly, and, in many cases, the view can be updated incrementally instead, at a fraction of the cost. Incremental View Maintenance
It is possible to update a materialized view incrementally for many types of view definitions (i.e., queries). One such class is the general case of SPJ views (i.e., views whose definition is just a select-project-join query).
View Maintenance
For example, assume that we have a view V defined over two relations R and S through a natural join (i.e., V = R ⋈ S; for simplicity of the presentation we ignore the selection and projection operators). Further, let us assume that we have a set of deleted tuples from relation R, denoted as RD; a set of inserted tuples into 0 relation R, denoted as RI (i.e., R = R [ RI RD). Also, assume a set of deleted tuples from relation S, denoted as SD; and a set of inserted tuples into relation 0 S, denoted as SI (i.e., S = S [ SI SD). We trivially represent base relation updates as pairs of deletions and insertions. 0 Given the above, the updated version of V , i.e., V , 0 0 0 should be V = R ⋈ S = (R [ RI RD) ⋈ (S [ SI SD). By expanding this further, and grouping all the deletions from V as V D and all the insertions to V as V I, we have that: V D = (RD ⋈ (S [ SI)) ⋈ ((R [ RI) ⋈ SD), and V I = (RI ⋈ S) [ (R ⋈ SI) [ (RI ⋈ SI), so that 0 V = V [ V I V D . This, incrementally computed formula, should be less costly to compute than recomputing the entire join from scratch. The problem of incrementally updating materialized views is difficult in the general case, but there are additional classes of queries (i.e., besides SPJ views) that it can be solved for [6]. Immediate View Maintenance
The default way of updating materialized views is to do so immediately, i.e., batch together, in a single transaction, the updating of the base relations and the updating of the materialized views that are derived from these relations. However, many applications cannot tolerate this delay, especially if they are interactive and users are expecting an answer at transaction commit. Deferred View Maintenance
Incremental deferred view maintenance requires (i) techniques for checking what views are affected by an update to the basic tables, (ii)auxiliary tables that maintain certain information like updates and deletes since the last view refresh and finally (iii) techniques for propagating the changes from the base tuples to the view tuples without fully recomputing the view relation. First, Buneman in [2], proposes a technique for the efficient implementation of alerters and triggers that checks each update operation prior to execution to see whether it can cause a view to change. In [1], an
V
3327
efficient method for identifying updates that cannot possibly affect the views is described. Such irrelevant updates are then removed from consideration while differentially updating the views. In [7], the hypothetical relations technique developed in [12] is adapted to the purpose of storing and indexing the deltas to the base tables. The main idea is to use a single table AD that stores deletions and insertions for the base tables (updates can be modeled as a deletion followed by an update). Whenever a view is accessed, the base tables and the AD table need to be accessed (in order to check for new or deleted tuples). A bloom filter however, is used to check if a tuple from the base relation exists in AD significantly reducing irrelevant accesses to AD. In [4], the authors demonstrate that the ordering of the updates from the base tuples to the view tuples is critical and call this phenomenon state bug. Typically, an ‘‘incremental query’’ – during the refresh operation – avoids recomputing the full view and only incrementally computes the delta view to bring it up to date, based on updates/deletes made to the base tables. Such incremental queries can evaluated in two states: The pre-update state, where the base table updates have not been applied yet or the post-update state where changes have been applied. In most techniques a pre-update state is assumed which severely limits the class of updates and views considered. The post-update state allows for a much larger class of view to be deferred maintained, however direct application of pre-update techniques results in incorrect answers (state bug) and new techniques are proposed. Offline View Maintenance
Typically, maintaining materialized views is done offline, without allowing queries to the materialized view to execute concurrently with the processing of the materialized view updates. This simplifies the view maintenance algorithms significantly, at the expense of delaying queries. Traditionally, in data warehousing environments [3], updates of materialized views are performed at night, thus minimizing the possibility of delaying user queries. Online View Maintenance
The need of most companies for continuous operation (especially in the presence of the Web), has precipitated the need for online view maintenance, where
V
3328
V
View Maintenance Aspects
queries can be answered while the materialized views are being updated. In a centralized setting, this is typically achieved through some sort of multi-versioning, either as horizontal redundancy, where extra columns are added to hold the different versions [9], or as vertical redundancy, where extra rows are needed to hold the different versions [8]. In a distributed setting, this is typically achieved through determination of additional queries to ask of the data sources [13].
Key Applications Materialized views help speed up the execution of frequently accessed queries, giving interactive response times to even the most complex queries. The cost of maintaining materialized views is typically amortized over multiple accesses (i.e., queries to the view). This has been utilized/transferred in many different application domains, from data warehousing to web data management. Beyond efficient algorithms and techniques to update materialized views, special attention has also been given to the view selection problem: how to identify which views should be materialized, and also to the issue of how to effectively use materialized views to answer other queries (i.e., by utilizing subsumption or caching).
Cross-references
▶ Recursive View Maintenance ▶ View Selection
Recommended Reading 1. Blakeley J.A., Larson P.A˚., and Tompa F.W. Efficiently Updating Materialized Views. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1986, pp. 61–71. 2. Buneman P. and Clemons E.K. Efficient Monitoring Relational Databases. ACM Trans. Database Syst., 4(3):368–382, 1979. 3. Chaudhuri S. and Dayal U. An overview of data warehousing and OLAP technology. ACM SIGMOD Rec., 26(1):65–74, 1997. 4. Colby L.S., Griffin T., Libkin L., Mumick I.S., and Trickey H. Algorithms for deferred view maintenance. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 469–480. 5. Gupta A. and Mumick I.S. Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng. Bull., 18(2):3–18, 1995. 6. Gupta A., Mumick I.S., and Subrahmanian V.S. Maintaining views incrementally. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1993, pp. 157–166. 7. Hanson E.N. A performance analysis of view materialization strategies. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1987, pp. 440–453.
8. Labrinidis A. and Roussopoulos N. A performance evaluation of online warehouse update algorithms. Tech. Rep. CS-TR-3954, Department of Computer Science, University of Maryland, 1998. 9. Quass D. and Widom J. On-line warehouse view maintenance. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1997, pp. 393–404. 10. Roussopoulos N. and Kang H. Principles and Techniques in the Design of ADMS . IEEE Comp., 19(12):19–25, 1986. 11. Stonebraker M. Implementation of integrity constraints and views by query modification. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1975, pp. 65–78. 12. Woodfill J. and Stonebraker M. An implementation of hypothetical relations. In Proc. 9th Int. Conf. on Very Data Bases, 1983, pp. 157–166. 13. Zhuge Y., Garcia-Molina H., Hammer J., and Widom J. View maintenance in a warehousing environment. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1995, pp. 316–327.
View Maintenance Aspects A NTONIOS D ELIGIANNAKIS University of Athens, Athens, Greece
Definition Database systems often define views in order to provide conceptual subsets of the data to different users. Each view may be very complex and require joining information from multiple base relations, or other views. A view can simply be used as a query modification mechanism, where user queries referring to a particular view are appropriately modified based on the definition of the view. However, in applications where fast response times to user queries are essential, views are often materialized by storing their tuples inside the database. This is extremely useful when recomputing the view from the base relations is very expensive. When changes occur to their base relations, materialized views need to be updated, with a process known as view maintenance, in order to provide fresh data to the user.
Historical Background The use of relational views has long been proposed in relational database systems. The notion of materialized views, or snapshots, was first proposed in [1]. A snapshot represents the state of some portion of the database at the time when the snapshot was computed. Since the publication of [1], a large number of mechanisms for refreshing materialized views has been proposed. These
View Maintenance Aspects
mechanisms mainly involve either the time of refreshing the materialized views, or how such a refresh operation can be performed (i.e., through a re-computation of the view, or in an incremental manner). While each policy has its own advantages and drawbacks, ADMS was the first system that realized the importance of having multiple maintenance policies within the same database [11,12]. An excellent classification of the view maintenance problem was presented in [4].
Foundations Materialized views are used in order to provide faster response times to user queries. Figure 1 depicts a sample view-dependency graph of three materialized views (namely, V1, V2 and V3) defined over three base relations (denoted as B1, B2 and B3). If a view V contains in its definition a reference to a base relation (or another view) B, then there exists a directed edge in the view-dependency graph from B to V. Given this explanation, please note that V3 in the sample viewdependency graph is expressed in terms of a base relation (B3) and another materialized view (V2). However, the ultimate set of base relations that are used to derive V3 are B1, B2 and B3. When the underlying data of the base relations changes, the materialized views contain stale data, until the materialized view is refreshed. The view maintenance can be performed either through a re-computation of the view or, if possible, through an incremental procedure. An incremental update policy is often faster, and is thus
V
in many cases desirable, if the base relations have been modified only by a small portion. However, when a large portion of some base relations is updated (i.e., most tuples of a base relation are deleted), then a complete re-computation may actually be faster. The process of maintaining a materialized view can be classified based on five main dimensions. Time Dimension
Every update transaction incurs a time penalty, thus slowing down queries performed during the update process. On the other hand, some applications (i.e., applications regarding stock data) cannot tolerate viewing stale data. Depending on when a materialized view is updated, three main policies for view maintenance can be defined [2]: 1. Immediate Views: Upon an update to a base relation, the materialized view is updated immediately. This is often accomplished using database triggers. 2. Deferred Views: Updating the materialized view is deferred until the first time when the view is queried. 3. Snapshot Views: The view is updated periodically (i.e., at daily intervals) by an asynchronous process. Each of the aforementioned policies has some advantages and drawbacks, and each may be the policy of choice, depending on the characteristics of the targeted application. Immediate views incur an update penalty even if the updated tuples are never queried. Deferred views incur this update cost only at the first time when the view is queried after its base relations have been updated. Thus, deferred views incur a lower update overhead, but also worse query performance, since the query evaluation process may have to wait for some views to be updated, when compared to the immediate views policy. The snapshot views policy leads to faster query performance, since typically no updates are performed in parallel to the user queries, and to better update performance, since the updates are batched. Thus, snapshot views represent an attractive choice when the application may tolerate reading stale data. Expressiveness of View Definition Language
View Maintenance Aspects. Figure 1. An example of a view-dependency graph.
3329
Algorithms proposed for the maintenance of a materialized view can often handle only views expressed as a subset of SQL. For example, some techniques may only be able to handle views defined as select-project-join (SPJ) queries. However, a view definition may be much more complex, as it may contain (amongst others)
V
3330
V
View Maintenance Aspects
duplicates, arithmetic operations, aggregate or ranking functions, nested subqueries, set operations such as calculating the union or difference amongst two sets, outer-joins, recursion etc. View maintenance algorithms often seek to derive formulas for determining the update to a view, and thus avoid recalculating the entire view from scratch, based on the updates to its base relations. For example, for the sample dependency graph presented in Fig. 1, let view V 2 be calculated as: V 2 = B1 ⋈JC B2, where JC denotes the join condition in the definition of V 2 between relations B1 and B2. If DB1 and DB2 represent the inserted (deleted) tuples to relations B1 and B2, then the inserted (deleted) tuples DV2 to view V2 can be calculated as: DV 2 ¼ ðDB1fflJC B2Þ [ ðB1fflJC DB2Þ [ ðDB1fflJC DB2Þ Updates can be modeled as deletions followed by insertions. Deriving such formulas for computing the incremental update operations on a view typically becomes harder, or even impossible, as the expressiveness of the view definition language is increased. Thus, maintenance algorithms often first check whether a view can be incrementally maintained, and then proceed to decide how to actually maintain it. Available Information
When maintaining a materialized view, we do not always have access to both the materialized view and its base relations. For example, when one of the base relations represents a data stream, as in the case of Chronicle Views [7], the entire relation cannot be stored and is, thus, unavailable during the maintenance process. Thus, during the view maintenance process one may, or may not, have access to the materialized view itself, to its base relations, or to information regarding the presence of keys and referential integrity constraints. Depending on the amount of information available, different algorithms can be used for maintaining a view. Supported Modifications
The view maintenance algorithms can also be classified based on types of modifications to the base relations that they can handle. Not all algorithms handle both insertions and deletions to base relations. Some algorithms handle sets of modifications (i.e., insertions and deletions) in a single pass, as in [6], while others require one pass for each different modification set.
Some algorithms handle updates directly, while other algorithms model them as deletions followed by insertions. Finally, not all view maintenance algorithms can handle more complex modifications, such as modifications to the view definition. Even when such modifications are handled, the view maintenance can be achieved by different techniques that either recompute the view, or try to adapt the view, so that the old view can be used to materialize the new view [5]. Algorithm Applicability
View maintenance algorithms should be able to be applied to all possible data stored in the database tables, and to all instances of a particular modification (i.e., insertion, deletion, update etc). Techniques that operate correctly but only on specific database instances, or on specific only modification instances are less desirable. Figure 2, originally presented in [4], summarizes some areas of the problem space for three of the five dimensions described above, namely for the expressiveness of the view definition language, for the available information and for the supported modifications dimensions. The remaining two dimensions have been omitted for ease of presentation.
Key Applications Data Warehousing
Materialized views are frequently used in data warehouses to provide personalized behavior to users, store frequently queried data derived from multiple relations, and to encapsulate different views of local or even remote data and databases. Materialization allows for significantly better query performance [8], while view maintenance, in the case of Immediate or Deferred Views, allows users to access fresh data. Data Streams
Banking, billing, networking and stock applications often generate infinite streams of data. View maintenance algorithms can help answer complex queries over such infinite data streams without requiring access to the entire stream. Caching
Cached data can be refreshed quickly using techniques developed for view maintenance when only a small portion of the cache has become stale. The cached
View Maintenance Aspects
V
3331
View Maintenance Aspects. Figure 2. The problem space, as presented in [4].
content may also involve dynamically generated web page fragments. In this case, the notion of WebViews [9,10] can be used to decide which such fragments to materialize, and how to determine a schedule for updating them. Mobile Applications
If data needs to be transmitted in cases of bandwidth constraints, as in applications of mobile clients holding a cell phone or a GPS system, then only the data altered between the last transmission needs to be transmitted. Similar techniques that transmit only the changes in the computed data can also be used in other bandwidthconstrained applications as well, such as in transmitting smaller amounts of data to/from sensor nodes [3].
Cross-references
▶ Database Tuning and Performance ▶ Deferred View Maintenance ▶ Incremental View Maintenance ▶ Maintenance of Recursive Views ▶ Maintenance of Materialized Views with Outer-Joins ▶ Query Processing and Optimization in Object Relational Databases ▶ View Maintenance ▶ Viewcaches
Recommended Reading 1. Adiba B. and Lindsay B. Database snapshots. In Proc. 6th Int. Conf. on Very Data Bases, 1980, pp. 86–91.
2. Colby L., Kawaguchi A., Lieuwen D., Mumick I.S., and Ross K.A. Supporting multiple view maintenance policies. In Proc. ACM SIGMOD Conf. on Management of Data, 1997, pp. 405–416. 3. Deligiannakis A., Kotidis Y., and Roussopoulos N. Processing approximate aggregate queries in wireless sensor networks. Inf. Syst., 31(8):770–792, December 2006. 4. Gupta A. and Mumick I.S. Maintenance of materialized views: problems, techniques, and applications. IEEE Data Eng. Bull., Special Issue on Materialized Views and Data Warehousing, 18(2):3–19, June 1995. 5. Gupta A., Mumick I.S., Rao J., and Ross K.A. Adapting materialized views after redefinitions: techniques and a performance study. Inf. Syst., Special Issue on Data Warehousing, 16(5):323–362, July 2001. 6. Gupta A., Mumick I.S., and Subrahmanian V.S. Maintaining views incrementally. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1993, pp. 157–166. 7. Jagadish H.V., Mumick I.S., and Silberschatz A. View maintenance issues in the chronicle data model. In Proc. 14th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1995, pp. 113–124. 8. Kotidis Y. and Roussopoulos N. DynaMat: a dynamic view management system for data warehouses. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 371–382. 9. Labrinidis A. and Roussopoulos N. WebView materialization. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 367–378. 10. Labrinidis A. and Roussopoulos N. Balancing performance and data freshness in Web database servers. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 393–404. 11. Roussopoulos N. The incremental access method of view cache: concept, algorithms, and cost analysis. ACM Trans. Database Syst., 16(3):535–563, September 1991. 12. Roussopoulos N. and Kang H. Principles and techniques in the design of ADMS . IEEE Comput., 19(2):19–25, December 1986.
V
3332
V
View Expression
View Expression ▶ View Definition
View Update ▶ View Maintenance
View-based Data Integration YANNIS K ATSIS , YANNIS PAPAKONSTANTINOU University of California-San Diego, La Jolla, CA, USA
Definition Data Integration (or Information Integration) is the problem of finding and combining data from different sources. View-based Data Integration is a framework that solves the data integration problem for structured data by integrating sources into a single unified view. This integration is facilitated by a declarative mapping language that allows the specification of how each source relates to the unified view. Depending on the type of view specification language used, view-based data integration systems (VDISs) are said to follow the Global as View (GAV), Local as View (LAV) or Global and Local as View (GLAV) approach.
Historical Background Data needed by an application are often provided by a multitude of data sources. The sources often employ heterogeneous data formats (e.g., text files, web pages, XML documents, relational databases), structure the data in different ways and can be accessed through different methods (e.g., web forms, database clients). This makes the task of combining information from multiple sources particularly challenging. To carry it out, one has to retrieve data from each source individually, understand how the data of the sources relate to each other and merge them, while accounting for discrepancies in the structure and the values, as well as for potential inconsistencies. The first to realize this problem were companies willing to integrate their structured data within or across organizations. Soon the idea of integrating the data into a single unified view emerged. These systems,
to be referred to as view-based integration systems (VDISs) would provide a single point of access to all underlying data sources. Users of a VDIS (or applications) would query the unified view and get back integrated results from all sources, whereas the task of combining data from the sources and resolving inconsistencies would be handled by the system transparently to the applications. The VDISs made their first appearance in the form of multidatabases and federated systems [11]. Subsequently, the research community dived into the problem of specifying the correspondence between the sources and the unified view. The outcome were three categories of languages to express the correspondence (GAV, LAV and GLAV), together with several related theoretical and system results. The industry also embraced the view-based integration framework by creating many successful VDISs (e.g., BEA AquaLogic, IBM WebSphere).
Foundations Abstracting out the differences between individual systems, a typical VDIS conforms to the architecture shown in Fig. 1. Sources store the data in a variety of formats (relational databases, text files, etc.). Wrappers solve the heterogeneity in the formats by transforming each source’s data model into a common data model used by the integration system. The wrapped data sources are usually referred to as local or source databases, the structure of which is described by corresponding local/ source schemas. This is in contrast to the unified view exported by the mediator, also called global/target database. Finally, mappings expressed in a certain mapping language (depicted as lines between the wrapped sources and the mediator) specify the relationship between the wrapped data sources (i.e., the local schemas) and the unified view exported by the mediator (i.e., the global schema). Given a VDIS, applications or users retrieve data from the sources indirectly by querying the global schema. It is the task of the mediator to consult the mappings to decide which data to retrieve from the sources, and how to combine them appropriately in order to form the answer to the user’s query. VDISs can be categorized according to the following three main axes: 1. Common data model and query language. The data model and query language that is exposed by the
View-based Data Integration
V
3333
View-based Data Integration. Figure 1. View-based Data Integration System (VDIS) architecture.
wrappers to the mediator and by the mediator to the applications. Commonly used data models include the relational, XML and object-oriented data model. 2. Mapping language. The language for specifying the relationship of sources with the view. Languages proposed in the literature fall into three categories; Global as View (GAV), Local as View (LAV) and Global and Local as View (GLAV). Being one of the most important components in a VDIS, these are discussed in detail below. 3. Data storage method. The decision on the place where the data are actually stored. The two extremes are the materialized and the virtual approach (see [14] for a comparison). In the materialized integration (also known as eager, in-advance or warehousing approach), all source data are replicated on the mediator. On the other hand, in the virtual mediation (e.g., Infomaster [6]) (or lazy approach), the data are kept in the sources and the global database is virtual. Consequently,
a query against the global database cannot be answered directly, but has to be translated to queries against the actual sources. Finally, some systems employ hybrid policies, such as virtualization accompanied by a cache. Specifying the Relationship of the Sources with the Unified View
To allow the mediator to decide which data to retrieve from each source and how to combine them into the unified view, the administrator of the VDIS has to specify the correspondence between the local schema of each source and the global schema through mappings. The mappings are expressed in a language, corresponding to some class of logic formulas. Languages proposed in the literature fall into three categories: Global as View (GAV), Local as View (LAV) and Global and Local as View (GLAV). In GAV the global database (schema) is expressed as a function of the local databases (schemas). LAV on the other hand follows the
V
3334
V
View-based Data Integration
opposite direction, with each local schema being described as a function over the global schema. Therefore, LAV allows to add a source to the system independently of other sources. Finally, GLAV is a generalization of the two. This section presents each of these approaches in detail and explains their implications on the query answering algorithms. Essentially, each of them represents a different trade-off between expressivity and hardness in query answering. Running example. For ease of exposition, the following discussion employs a running example of integrating information about books. The example employs the relational data model for both the sources and the global database. Moreover, the query language used by the users to extract information from the global database, and in turn by the mediator to retrieve information from the sources, is the language of conjunctive queries with equalities (CQ=); a subset of SQL widely adopted in database research. Figure 2 shows the employed local and global schemas. Relations are depicted in italics and their attributes appear nested in them. For instance, the global schema G in Fig. 2b consists of two relations Book and Book_Price. Relation Book has attributes ISBN, title, suggested retail price, author and publisher, while Book_Price stores the book price and stock information for different sellers. Data is fueled into the system by two sources; the databases of the bookstore Barnes & Noble (B&N) and the publisher Prentice Hall
(PH), with the schemas shown in Fig. 2a. Note that there are two versions of the PH schema, used in different examples. Global as View (GAV)
Historically, the first VDISs followed the Global as View (GAV) approach [7,12], in which the global schema is described in terms of the local schemas. In such systems the contents of each relation R in the target schema G are specified through a query (view) V over the combined schemas of the sources. Thus, in GAV the correspondence between the local schemas and the global schema can be described through a set of mappings of the form: V i ! IðR i Þ one for each relation Ri of the global schema, where Vi is a query over the combined source schemas. I(Ri) is the identity query over Ri (i.e., a query that returns all attributes of Ri). The symbol ! can represent either query containment () or query equality (=). This leads to two different semantics, referred to in the literature as the open-world and the closed-world assumption, respectively. Example: Consider the following two GAV mappings (For the examples, the identity query I over some relation is considered to return the attributes of that relation in the same order as they appear on the schema in Fig. 2).
View-based Data Integration. Figure 2. Local and global schemas of the running example.
View-based Data Integration
M 1 : V1!I(Book) M 2 : V2!I(Book_Price) where V1(ISBN, title, sug_retail, authorName, ‘‘PH’’ ):PHBook(ISBN, title, authorID, sug_retail, format), PHAuthor(authorID, authorName) and V2(ISBN, ‘‘B&N’’, sug_retail, instock):PHBook(ISBN, title1, authorID, sug_retail, format), BNNewDeliveries(ISBN, title2, instock) The mappings are graphically depicted in Fig. 3a and b, as described in [9]. This is similar to the way most mapping tools (i.e., tools that allow a visual specification of mappings), such as IBM Clio and MS BizTalk Mapper, display mappings. Mapping M1 intuitively describes how Book tuples in the global database are created. This is done by retrieving the ISBN, title and sug_retail from a PHBook tuple, the author from the corresponding
V
3335
PHAuthor tuple (i.e., the PHAuthor tuple with the same authorID as the PHBook tuple), and finally setting the publisher to ‘‘PH’’ (since the extracted books are published by PH). Similarly, mapping M2 describes the construction of the global relation Book_Price. This involves combining information from multiple sources: the price from the suggested retail price information provided by PH and the inventory information from B&N, because B&N’s administrator knows that B&N’s sells its books at the suggested retail price. Query Answering in GAV. GAV mappings have a procedural flavor, since they describe how the global database can be constructed from the local databases. For this reason, query answering in GAV is straightforward, both in the materialized and in the virtual approach. In the materialized approach, the source data are replicated in the global database by executing for each mapping Vi ! I(Ri) the query Vi against the local databases and populating Ri with the query results.
V
View-based Data Integration. Figure 3. Example of GAV mappings.
3336
V
View-based Data Integration
Subsequently, an application query Q against the global schema is answered by simply running Q over the materialized global database. On the other hand, in the virtual approach, data are kept in the sources and thus a query against the global schema has to be translated to corresponding queries against the local schemas. Due to the procedural flavor of GAV, this can be done through view unfolding (i.e., replacing each relational atom of the global schema in the query by the corresponding view definition). Intuitively, whenever a query asks for a global relation Ri, it will instead run the subquery Vi over the local schemas, which, according to the mapping Vi ! I(Ri), provides the contents of Ri. Advantages. The simplicity of the GAV rules together with the straight-forward implementation of query answering, led to the wide adoption of GAV by industrial systems. From the research sector representative GAV-based VDISs systems are MULTIBASE [11], TSIMMIS [5] and Garlic [2]. Disadvantages. GAV also has several drawbacks: First, since the global schema is expressed in terms of the sources, global relations cannot model any information not present in at least one source. For instance, the Book relation in the example could not contain an attribute for the book weight, since no source currently provides it. In other words, the value of each global attribute has to be explicitly specified (i.e., in the visual representation all global attributes must have an incoming line or an equality with a constant). Second, as observed in mapping M2 of the running example, a mapping has to explicitly specify how data from multiple sources are combined to form global relation tuples. Therefore, GAV-based systems do not facilitate adding a source to the system independently of other sources. Instead, when a new source wants to join the system, the system administrator has to inspect how its data can be merged with those of the other sources currently in the system and modify the corresponding mappings. Local as View (LAV)
To overcome the shortcomings of GAV, researchers came up with the Local as View (LAV) approach [7,12]. While in GAV the global schema is described in terms of the local schemas, LAV follows the opposite direction expressing each local schema as a function of the global schema. LAV essentially corresponds to the ‘‘source owners view’’ of the system by describing which data of the global database are present in
the source. Using the same notation as in GAV, localto-global correspondences can be written in LAV as a set of mappings: IðRi Þ ! U i one for every relation Ri in the local schemas, where Ui is a query over the global schema and I the identity query. Example: Figure 4 shows the following two LAV mappings for the running example: M0 1 : I(PHBookcondensed) ! U1 M0 2 : I(BNNewDeliveries) ! U2 where U1(ISBN, title, author, sug_retail):Book(ISBN, title, sug_retail, author, ‘‘PH’’ ) and U2(ISBN, title, instock):Book(ISBN, title, sug_retail, author, publisher), Book_Price(ISBN, ‘‘B&N’’, sug_retail, instock) For instance, M0 1 specifies that PHBookcondensed contains information about books published by PH. Similarly, M0 2 declares that BNNewDeliveries contains the ISBN and title of books sold by B&N at their suggested retail price and whether B&N has them in stock. In contrast to GAV mappings, LAV mappings have a declarative flavor, since, instead of explaining how the global database can be created, they describe what information of the global database is contained in each local database. Advantages. LAV addresses many of GAV problems with the most important being that sources can register independently of each other, since a source’s mappings do not refer to other sources in the system. Disadvantages. LAV suffers from the symmetric drawbacks of GAV. In particular, it cannot model sources that have information not present in the global schema (this is the reason why the example above used the condensed version of PH’s schema that did not contain the attribute format, which is not present in the global schema). Furthermore, due to LAV’s declarative nature, query answering is non-trivial any more, as described next. Mainly because of its technical implications to query answering, LAV has been extensively studied in the literature. Representative LAV-based systems include Information Manifold [10] and the system described in [13].
View-based Data Integration
Query Answering in LAV. Since LAV mappings consist of an arbitrary query over the global schema, they may leave some information of the global database unspecified. For instance, mapping M0 2 above only states that B&N sells its books at the suggested retail price, without specifying the exact price. Thus, there might be infinitely many global databases that could be inferred from the sources through the mappings (each of them would make sure that each pair of Book and Book_Price tuples created from a BNNewDeliveries tuple share the same value for price, but each such global database might choose a different constant for the value). These databases are called possible worlds. Their existence has two important implications: First, since a unique global database does not exist, it cannot be materialized and therefore LAV lends itself better to virtual mediation. However, there is still a way of replicating source information in a centralized place. This involves creating a ‘‘special’’ database that intuitively stores the general shape of all possible worlds. This ‘‘special’’ database is called canonical universal solution and can be built through procedures employed in data exchange [3]. Second, since many global databases exist, the query answering semantics need to be redefined. The standard semantics adopted in the literature for query
V
3337
answering in LAV-based systems are based on the notion of certain answers [1,8]. The certain answers to a query are the answers to the query, which will always appear regardless of which possible world the query is executed against (i.e., the tuples that appear in the intersection of the sets of query answers against each possible world). Intuitively, certain answers return information that is guaranteed to exist in any possible world. Example: If in the integration system of Fig. 4 a query asks for all ISBNs that are sold by some seller at their suggested retail price, it will get back all ISBNs stored in relation BNNewDeliveries, because for each of them, any possible world will contain a pair of Book and Book_Price tuples with the same ISBN and the same prices (although these prices will have different values between possible worlds). On the other hand, if the query asks for all books sold at a specific price, it will not get back the ISBNs from BNNewDeliveries, because their exact prices are left unspecified by the mapping M0 2 and will therefore differ among possible worlds. In order to compute the certain answers to a query in a virtual integration system following the LAV approach, the query against the global schema has to be translated to corresponding queries against the local schemas. This problem is called rewriting queries using
V
View-based Data Integration. Figure 4. Example of LAV mappings.
3338
V
View-based Data Integration
views (because the query over the global database has to be answered by using the sources which are expressed as views over it), and is also of interest to other areas of database research, such as query optimization and physical data independence (see [8] for a survey). In contrast to GAV, it is a non-trivial problem studied extensively by researchers. Global and Local as View (GLAV)
To overcome the limitations of both GAV and LAV, [4] proposed a new category of mapping languages, called Global and Local as View (GLAV), which is a generalization of both GAV and LAV. GLAV mappings are of the form: Vi ! Ui where Vi, Ui are queries over the local and global schemas, respectively. GLAV languages can trivially express both GAV mappings and LAV mappings by assigning to Ui a query returning a single global relation or to Vi a query asking for a single local relation, respectively. However, GLAV is a strict superset of both GAV and
LAV by allowing the formulation of mappings that do not fall either under GAVor under LAV (i.e., mappings in whichVi and Ui both do not return just a single local or global relation). Example: Figure 5 shows two GLAV mappings. The first mapping is the GAV mapping M1 first presented in Fig. 3a, while the second mapping is the LAV mapping M0 2 used in Fig. 4b. Since Ui (a.k.a. the conclusion of the mapping) can be an arbitrary query over the global schema, GLAV allows the independent registration of sources in the same way as LAV. However, this also implies that the global database is incomplete. Consequently, query answering in GLAV is usually done under certain answer semantics, by extending query rewriting algorithms for LAV [15]. Alternatives to VDISs
Apart from the VDISs, research and industrial work led to many alternative approaches to data integration of structured data: 1. Vertical Integration Systems are specialized applications that solve the problem of data
View-based Data Integration. Figure 5. Example of GLAV mappings.
Views
integration for a specific domain. For example, http://www.mySimon.com or http://www.rottento matoes.com integrate price and movie information, respectively. 2. Extract Transform Load (ETL) Tools generally facilitate the actual migration of data from one system to another. When used for data integration they are closely tied to the problem of materializing the integrated database in a central data warehouse. Compared to these solutions, VDISs offer a more general approach to data integration with the following advantages: (i) The relationships between the sources and the unified view are explicitly stated and not hidden inside a particular implementation, (ii) a general VDIS implementation can be used in many different domains, and (iii) a VDIS deployment can be easily utilized by many applications, as shown in Fig. 1. In a way VDISs are analogous to Database Management Systems, offering a general way of managing the data (which in VDISs are heterogeneous and distributed), independently of the applications that need those data. Finally, another alternative to VDISs are Peerto-Peer (P2P) Integration Systems that drop the requirement for a single unified view, allowing queries to be posed over any source schema. Although it is an active area of research, P2P systems have not yet been widely adopted in industry.
3.
4.
5.
6.
7. 8. 9.
10.
11.
12.
13.
Key Applications
14.
VDISs are used for integration of structured data in many different settings, including among others enterprises, government agencies and scientific communities.
15.
V
3339
heterogeneous multimedia information systems: The Garlic approach. In Proc. 5th Int. Workshop on Research Issues on Data Eng., 1995, pp. 124–131. Fagin R., Kolaitis P.G., Miller R.J., and Popa L. Data exchange: Semantics and query answering. In Proc. Int. Conf. on Database Theory, 2002. pp. 207–224. Friedman M., Levy A., and Millstein T. Navigational plans for data integration. In Proc. 16th National Conf. on AI and 11th Innovative Applications of AI Conf., 1999. Garcia-Molina H.K., Papakonstantinou Y.K., Quass D.K., Rajaraman A.K., Sagiv Y.K., Ullman J.K., Vassalos V.K., and Widom J.K. The TSIMMIS approach to mediation: data models and languages. J. Intell. Inf. Syst., 8(2):117–132, 1997. Genesereth M.R., Keller A.M., and Duschka O.M. Infomaster: An information integration system. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1997. Halevy A. Logic-based techniques in data integration. In Logic Based Artif. Intell., 2000. Halevy A.Y. Answering queries using views: A survey. VLDB J., 10(4):270–294, 2001. Katsis Y., Deutsch A., and Papakonstantinou Y. interactive source registration in community-oriented information integration. In Proc. 34th Int. Conf. on Very Large Data Bases, 2008. Kirk T., Levy A.Y., Sagiv Y., and Srivastava D. The information manifold. In Information Gathering from Heterogeneous, Distributed Environments, 1995. Landers T. and Rosenberg R.L. An overview of MULTIBASE. Distributed systems, Vol. II: distributed data base systems table of contents, 1986, pp. 391–421. Lenzerini M. Data integration: A theoretical perspective. In Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2002. Manolescu I., Florescu D., and Kossmann D. Answering XML queries over heterogeneous data sources. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001. Widom J. Research problems in data warehousing. In Proc. 27th Int. Conf. on Very Large Data Bases, 1995. Yu C. and Popa L. Constraint-based XML query rewriting for data integration. In Proc. 27th Int. Conf. on Very Large Data Bases, 2004.
Cross-references
▶ Information Integration ▶ Peer-to-Peer Data Integration ▶ Query Translation ▶ Query Rewriting Using Views
Recommended Reading 1. Abiteboul S. and Duschka O.M. Complexity of answering queries using materialized views. In Proc. 17th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1998, pp. 254–263. 2. Carey M.J., Haas L.M., Schwarz P.M., Arya M., Cody W.F., Fagin R., Flickner M., Luniewski A., Niblack W., Petkovic D., Thomas II J., Williams J.H., and Wimmers E.L. Towards
Views YANNIS KOTIDIS Athens University of Economics and Business, Athens, Greece
Definition Views are virtual relations without independent existence in a database. The contents of the instance of a view are determined by the result of a query on a set of database tables or other views in the system.
V
3340
V
Virtual Disk Manager
Key Points Over the years, there have been several definitions and uses for the term view. From one perspective, a view is a query (or a macro) that generates data every time it is invoked. The term view is also used in association with the derived data that is produced by the execution of the view query. Views have also been used as indexes, summary tables or combinations of both. They provide physical data independence, by decoupling the logical schema from the physical design, which is usually driven by performance reasons. By controlling access to the views, one can control access to different parts of the data and implement different security policies. Views are also used in integration systems in order to provide unified access to physically remote databases. When the result of the view is stored as an independent table, the view is called materialized. Materialized views can be of two forms. In the first, the materialized table is treated as pure data, detached from the view definition that was used to derive its content. In the second, the view content is treated as derived data that needs to be properly maintained when update statements alter the state of the database, in a way that affects the instance of the view. This is called the view maintenance problem. In a symmetrical way, when a user or a program modifies a view (materialized or not) through an update statement, an implementation is required for translating the view update command into a series of updates on the base relations so that the requested update is observed in the view instance. This process is often termed update through views.
Cross-references
▶ Answering Queries Using Views ▶ Updates Through Views ▶ View Definition ▶ View Maintenance
Recommended Reading 1. 2.
3.
4.
Adiba M.E. and Lindsay B.G. Database snapshots. In Proc. 6th Int. Conf. on Very Data Bases, 1980, pp. 86–91. Dayal U. and Bernstein P. On the correct translation of update operations on relational views. ACM Trans. Database Syst., 8(3):381–416, 1982. Gupta A., Jagadish H.V., and Singh Mumick I. Data integration using self-maintainable views. In Advances in Database Technology, Proc. 5th Int. Conf. on Extending Database Technology, 1996, pp. 140–144. Gupta H., Harinarayan V., Rajaraman A., and Ullman J.D. Index selection for OLAP. In Proc. 13th Int. Conf. on Data Engineering, 1997, pp. 208–219.
5.
Kotidis Y. and Roussopoulos N. DynaMat: a dynamic view management system for data warehouses. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 371–382. Roussopoulos N. View indexing in relational databases. ACM Trans. Database Syst., 7(2):258–290, 1982. Roussopoulos N. An incremental access method for ViewCache: concept, algorithms, and cost analysis. ACM Trans. Database Syst., 16:535–563, 1991.
6. 7.
Virtual Disk Manager ▶ Logical Volume Manager (LVM)
Virtual Health Record ▶ Electronic Health Records (EHR)
Virtual Partitioning M ARTA M ATTOSO Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Synonyms VP
Definition Virtual Partitioning (VP) is a distribution database design technique [2] that avoids physical partitioning table design. VP is adopted in Database Clusters to implement intra-query parallelism [3]. VP is based on database replication and aims at designing a table partition dynamically, according to each query specification. The idea is to take advantage of the current configuration of the parallel execution environment, such as the number of nodes available, the tables of the query, the current load, and the available table replicas. Intra-query parallelism can be obtained through VP by rewriting a query into a set of sub-queries to be sent to different virtual partitions. At each node, the local sequential DBMS processes the sub-query on the specified virtual partition of the table. VP requires an extra query processing phase to compose the final result
Visual Analytics
from the partial ones produced by the sub-queries. This way, a database that is not physically partitioned can still be processed transparently through intra-query parallelism.
Key Points VP is implemented through query rewriting [1]. Basically, a DBC rewrites the original query by adding range predicates to it, which originates as many subqueries as the number of table replicas and nodes available. Then, each node receives a different subquery. For example, let Q be a query on table orders: Q : select sum (price) from orders where category = ‘bolt’;
A generic sub-query on a virtual partition is obtained by adding to Q’s where clause the predicate ‘‘and order_id >=v1 and order_id < v2.’’ By binding [v1, v2] to n subsequent ranges of order_id values, n sub-queries are generated, each for a different node on a different virtual partition of orders. These virtual partitions correspond to horizontally partitioned data. After the execution of all sub-queries each partial result would then be inserted into a temporary table, e.g., tempResult. Finally, to compose the result, the partitioned results need to be combined by an aggregate query. In this example: select sum (sprice) from tempResult;
For VP to be effective, the tuples of the virtual partition must be physically clustered according to the added range predicate. Ideally, there should be a clustered ordered index on table orders based on order_id, so that the DBMS will not scan tuples outside the VP.
2.
3.
V
3341
¨ zsu T. and Valduriez P. Principles of Distributed O Database Systems (2nd edn.). Prentice Hall, Upper Saddle River, NJ, 1999. Ro¨hm U., Bo¨hm K., Scheck H.-J., and Schuldt H. FAS – A freshness-sensitive coordination middleware for a cluster of OLAP components. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 754–768.
Vision ▶ Visual Perception
Visual Analysis ▶ Visual Analytics ▶ Visual Data Mining
Visual Analytics DANIEL A. K EIM , F LORIAN M ANSMANN , A NDREAS S TOFFEL , H ARTMUT Z IEGLER University of Konstanz, Konstanz, Germany
Synonyms Visual analysis; Visual data analysis; Visual data mining
Definition Cross-references
▶ Clustering Index ▶ Data Partitioning ▶ Data Replication ▶ Distributed Database Design ▶ Horizontally Partitioned Data ▶ Intra-query Parallelism
Recommended Reading 1.
Lima A.A.B., Mattoso M., and Valduriez P. Adaptive virtual partitioning for OLAP query processing in a database cluster. In Proc. 19th Brazilian Symp. on Database Systems, 2004, pp. 92–105.
Visual analytics is the science of analytical reasoning supported by interactive visual interfaces. Over the last decades, data was produced at an incredible rate. However, the ability to collect and store this data is increasing at a faster rate than the ability to analyze it. While purely automatic or purely visual analysis methods were developed in the last decades, the complex nature of many problems makes it indispensable to include humans at an early stage in the data analysis process. Visual analytics methods allow decision makers to combine their flexibility, creativity, and background knowledge with the enormous storage and processing capacities of today’s computers to
V
3342
V
Visual Analytics
gain insight into complex problems. The goal of visual analytics research is thus to turn the information overload into an opportunity: Decision-makers should be enabled to examine this massive, multi-dimensional, multi-source, time-varying, and often conflicting information stream through interactive visual representations to make effective decisions in critical situations.
Historical Background Automatic analysis techniques such as statistics and data mining developed independently from visualization and interaction techniques. However, some key thoughts changed the rather limited scope of the fields into what is today called visual analytics research. One of the most important steps in this direction was the need to move from confirmatory data analysis to exploratory data analysis, which was first stated in the statistics research community by John W. Tukey in his book ‘‘Exploratory data analysis’’ [7]. Later, with the availability of graphical user interfaces with proper interaction devices, a whole research community devoted their efforts to information visualization [1,2,5,8]. At some stage, this community recognized the potential of integrating the user in the KDD process through effective and efficient visualization techniques, interaction capabilities and knowledge transfer leading to visual data exploration or visual data mining [3]. This integration considerably widened the scope of both the information visualization and the data mining fields, resulting in new techniques and plenty of interesting and important research opportunities. The term visual analytics was coined by Jim Thomas in the research and development agenda ‘‘Illuminating the Path’’ [6], which had a strong focus on Homeland Security in the United States. Meanwhile, the term is used in a wider context, describing a new multidisciplinary field that combines various research areas including visualization, human-computer interaction, data analysis, data management, geo-spatial and temporal data processing, and statistics [4].
Foundations Visual analytics evolved from information visualization and automatic data analysis. It combines both former independent fields and strongly encourages human interaction in the analysis process as illustrated in Fig. 1. The focus of this section is to differentiate between visualization and visual analytics and thereby
Visual Analytics. Figure 1. Visual analytics as the interplay between data analysis, visualization, and interaction methods.
motivating its necessity. Thereafter, the visual analytics process is described and technical as well as social challenges of visual analytics are discussed. Visualization is the communication of data through the use of interactive interfaces and has three major goals: (i) presentation to efficiently and effectively communicate the results of an analysis, (ii) confirmatory analysis as a goal-oriented examination of hypotheses, and (iii) exploratory data analysis as an interactive and usually undirected search for structures and trends. Visual analytics is more than only visualization. It can rather be seen as an integral approach combining visualization, human factors, and data analysis. Visualization and visual analytics both integrate methodology from information analytics, geospatial analytics, and scientific analytics. Especially human factors (e.g., interaction, cognition, perception, collaboration, presentation, and dissemination) play a key role in the communication between human and computer, as well as in the decision-making process. In matters of data analysis, visual analytics furthermore profits from methodologies developed in the fields of statistical analytics, data management, knowledge representation, and knowledge discovery. Note that visual analytics is not likely to become a separate field of study, but its influence will spread over the research areas it comprises [9]. Overlooking a large information space is a typical visual analytics problem. In many cases, the information at hand is conflicting and needs to be integrated from heterogeneous data sources. Often the computer system lacks knowledge that is still hidden in the expert’s mind. By applying analytical reasoning, hypotheses about the
Visual Analytics
data can be either affirmed or discarded and eventually lead to a better understanding of the data. Visualization is used to explore the information space when automatic methods fail and to efficiently communicate results. Thereby human background knowledge, intuition, and decision-making either cannot be automated or serve as input for the future development of automated processes. In contrast to this, a well-defined problem where the optimum or a good estimation can be calculated by non-interactive analytical means would rather not be described as a visual analytics problem. In such a scenario, the non-interactive analysis should be clearly preferred due to efficiency reasons. Likewise, visualization problems not involving methods for automatic data analysis do not fall into the field of visual analytics. Visual Analytics Process
The visual analytics process is a combination of automatic and visual analysis methods with a tight coupling through human interaction in order to gain knowledge from data. Figure 2 shows an abstract overview of the different stages (represented through ovals) and their transitions (arrows) in the visual analytics process. In many visual analytics scenarios, heterogeneous data sources need to be integrated before visual or automatic analysis methods can be applied. Therefore, the first step is often to preprocess and transform the data in order to extract meaningful units of data for further processing. Typical preprocessing tasks are data
V
3343
cleaning, normalization, grouping, or integration of heterogeneous data into a common schema. Continuing with this meaningful data, the analyst can select between visual or automatic analysis methods. After mapping the data the analyst may obtain the desired knowledge directly, but more likely is the case that an initial visualization is not sufficient for the analysis. User interaction with the visualization is needed to reveal insightful information, for instance by zooming in different data areas or by considering different visual views on the data. In contrast to traditional information visualization, findings from the visualization can be reused to build a model for automatic analysis. As a matter of course these models can also be built from the original data using data mining methods. Once a model is created the analyst has the ability to interact with the automatic methods by modifying parameters or selecting other types of analysis algorithms. Model visualization can then be used to verify the findings of these models. Alternating between visual and automatic methods is characteristic for the visual analytics process and leads to a continuous refinement and verification of preliminary results. Misleading results in an intermediate step can thus be discovered at an early stage, which leads to more confidence in the final results. In the visual analytics process, knowledge can be gained from visualization, automatic analysis, as well as the preceding interactions between visualizations, models, and the human analysts. The feedback loop
V
Visual Analytics. Figure 2. The Visual Analytics Process is characterized through interaction between data, visualizations, models about the data, and the users in order to discover knowledge.
3344
V
Visual Analytics
stores this knowledge of insightful analyses in the system, and contributes to enable the analyst to draw faster and better conclusions in the future. Technical and Social Challenges
While visual analytics profits from the increasing computational power of computer systems, faster networks, high-resolution displays, as well as novel interaction devices, it must be kept in mind that new technologies are always accompanied with a variety of technical challenges that have to be solved. Dynamic processes in scientific or business applications often generate large streams of real-time data, such as sensor logs, web statistics, network traffic logs, or atmospheric and meteorological data. The analysis of such large data streams which can consist of terabytes or petabytes of data is one of the technical challenges since advances in many areas of science and technology are dependent upon the capability to analyze these data streams. As the sheer amount of data does often not allow to store all data at full detail, effective compression and feature extraction methods are needed to manage the data. Visual analytics aims at providing techniques that make humans capable of analyzing real time data streams by presenting results in a meaningful and intuitive way while allowing interaction with the data. These techniques enable quick identification of important information and timely reaction on critical process states or alarming incidents. Synthesis of heterogeneous data sources is another challenge that is closely related to data streams, because real-world applications often access information from a large number of different information sources including collections of vector data, strings and text documents, graphs or sets of objects. Integrating these data sources includes many fundamental problems in statistics, machine learning, decision theory, and information theory. Therefore, the focus on scalable and robust methods for fusing complex and heterogeneous data sources is key to a more effective analysis process. One step further in the analysis process, interpretability and trustworthiness or the ability to recognize and understand the data can be seen as one of the biggest challenges in visual analytics. Generating a visually correct output from raw data and drawing the right conclusions largely depends upon the quality of the used data and methods. A lot of possible quality problems (e.g., data capture errors, noise, outliers, low precision, missing values, coverage errors, double
counts) can already be contained in the raw data. Furthermore, pre-processing of data in order to use it for visual analysis bears many potential quality problems (i.e., data migration and parsing, data cleaning, data reduction, data enrichment, up-/down-sampling, rounding and weighting, aggregation and combination). The concrete challenges are on the one hand to determine and to minimize these errors on the pre-processing side, and a flexible yet stable design of the visual analytics applications to cope with data quality problems on the other hand. From a technical point of view, the design of such applications should either be insensitive to data quality issues through usage of data cleaning methods or explicitly visualize errors and uncertainty to raise awareness for data quality issues. In many scenarios, interpreting the raw data only makes little or no sense at all if it cannot be embedded in context. Research on semantics may derive this context from meta data by capturing associations and complex relationships. Ontology-driven techniques and systems have already started to enable new semantic applications in a wide span of fields such as bioinformatics, web services, financial services, business intelligence, and national security. Research challenges thereby arise from the size of ontologies, content diversity, heterogeneity as well as from computation of complex queries and link analysis over ontology instances and meta data. Scalability in general is a key challenge of visual analytics, as it determines the ability to process large datasets by means of computational overhead as well as appropriate rendering techniques. Often, the huge amount of data that has to be visualized exceeds the limited amount of pixels of a display by several orders of magnitude. In order to cope with such a problem not only the absolute data growth and hardware performance have to be compared, but also the software and the algorithms to bring this data in an appropriate way onto the screen. As the amount of data is continuously growing and the amount of pixels on the display remains rather constant, the rate of compression on the display is steadily increasing. Therefore, more and more details get lost. It is an essential task of visual analytics to create a higher-level view onto the dataset, while maximizing the amount of details at the same time. The field of problem solving, decision science, and human information discourse constitutes a further visual analytics challenge since it not only requires
Visual Analytics
understanding of technology, but also comprehension of typical human capabilities such as logic, reasoning, and common sense. Many psychological studies about the process of problem solving have been conducted. In a usual test setup the subjects have to solve a welldefined problem where the optimal solution is known to the researchers. However, real-world problems are manifold. In many cases these problems are intransparent, consist of conflicting goals, and are complex in terms of large numbers of items, interrelations, and decisions involved. The dynamics of information that changes over time should not be underestimated since it might have a strong impact on the right decision. Furthermore, social aspects such as decision making in groups make the process even more delicate. While many novel visualization techniques have been proposed, their wide-spread usage has not taken place primarily due to the users’ refusal to change their daily working routines. User acceptance is therefore a further visual analytics challenge, since the advantages of developed tools need to be properly communicated to the audience of future users to overcome usage barriers, and to tap the full potential of the visual analytics approach. Visual analytics tools and techniques should not stand alone, but should integrate seamlessly into the applications of diverse domains, and allow interaction with existing systems. Although many visual analytics tools are very specific (i.e., in astronomy or nuclear science) and therefore rather unique, in many domains (e.g., business or network security applications) integration into existing systems would significantly promote their usage by a wider community. Finally, evaluation as a systematic analysis of usability, worth, and significance of a system is crucial to the success of visual analytics science and technology. During the evaluation of a system, different aspects can be considered such as functional testing, performance benchmarks, measurement of the effectiveness of the display in user studies, assessment of its impact on decision-making, or economic success to name just a few. Development of abstract design guidelines for visual analytics applications would constitute a great contribution.
Key Applications Visual analytics is essential in application areas where large information spaces have to be processed and analyzed. Major application fields are physics and
V
astronomy. Especially the field of astrophysics offers many opportunities for visual analytics techniques: Massive volumes of unstructured data, originating from different directions of space and covering the whole frequency spectrum, form continuous streams of terabytes of data that can be recorded and analyzed. With common data analysis techniques, astronomers can separate relevant data from noise, analyze similarities or complex patterns, and gain useful knowledge about the universe, but the visual analytics approach can significantly support the process of identifying unexpected phenomena inside the massive and dynamic data streams that would otherwise not be found by standard algorithmic means. Monitoring climate and weather is also a domain which involves huge amounts of data, collected by sensors throughout the world and from satellites in short time intervals. A visual approach can help to interpret these massive amounts of data and to gain insight into the dependencies of climate factors and climate change scenarios, which would otherwise not be easily identified. Besides weather forecasts, existing applications visualize the global warming, melting of the poles, the stratospheric ozone depletion, as well as hurricane and tsunami warnings. In the domain of emergency management, visual analytics can help determine the on-going progress of an emergency and identify the next countermeasures (e.g., construction of physical countermeasures or evacuation of the population) that must be taken to limit the damage. Such scenarios can include natural or meteorological catastrophes like flood or waves, volcanoes, storm, fire or epidemic growth of diseases (e.g., bird flu), but also human-made technological catastrophes like industrial accidents, transport accidents or pollution. Visual analytics for security and geographics is an important research topic. The application field in this sector is wide, ranging from terrorism informatics, border protection, path detection to network security. Visual analytics supports investigation and detection of similarities and anomalies in large data sets, like flight customer data, GPS tracking or IP traffic data. In biology and medicine, computer tomography, and ultrasound imaging for three-dimensional digital reconstruction and visualization produce gigabytes of medical data and have been widely used for years. The application area of bio-informatics uses visual analytics techniques to analyze large amounts of biological data.
3345
V
3346
V
Visual Association Rules
From the early beginning of sequencing, scientists in these areas face unprecedented volumes of data, like in the Human Genome Project with three billion base pairs per human. Other new areas like Proteomics (studies of the proteins in a cell), Metabolomics (systematic study of unique chemical fingerprints that specific cellular processes leave behind) or combinatorial chemistry with tens of millions of compounds even enlarge the amount of data every day. A brute-force computation of all possible combinations is often not possible, but interactive visual approaches can help to identify the main regions of interest and exclude areas that are not promising. Another major application domain for visual analytics is business intelligence. The financial market, with its hundreds of thousands of assets, generates large amounts of data every day, which accumulate to extremely high data volumes throughout the years. The main challenge in this area is to analyze the data under multiple perspectives and assumptions to understand historical and current situations, and then monitoring the market to forecast trends or to identify recurring situations. Other key applications in this area are fraud detection, detection of money laundering, or the analysis of customer data, insurance data, social data, and health care services.
Cross-references
▶ Cluster Visualization ▶ Comparative Visualization ▶ Data Mining ▶ Data Visualization ▶ Human-Computer Interaction ▶ Multidimensional Visualization Methods ▶ Multivariate Visualization Methods ▶ Parallel Visualization ▶ Scientific Visualization ▶ Visual Classification ▶ Visual Clustering ▶ Visual Content Analysis ▶ Visual Data Mining ▶ Visual Metaphor ▶ Visual On-Line Analytical Processing (OLAP) ▶ Visualization for Information Retrieval
Recommended Reading 1.
Card S.W., Mackinlay J.D., and Shneiderman B. (eds.) Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann, San Francisco, CA, USA, 1999.
2. 3. 4. 5. 6.
7. 8. 9.
Chen C. Information Visualization – Beyond the Horizon. Springer, Berlin, 2nd edn., 2004. Keim D.A. Visual exploration of large data sets. Commun. ACM, 44(8):38–44, 2001. Keim D.A. and Thomas J. Scope and challenges of visual analytics. Tutorial at IEEE Visualization Conf., 2007. Spence R. Information Visualization – Design for Interaction. Pearson Education Limited, Harlow, England, 2nd edn., 2006. Thomas J. and Cook K. Illuminating the path: Research and development agenda for visual analytics. IEEE-Press, Los Alamitos, CA, USA, 2005. Tukey J.W. Exploratory data analysis. Addison-Wesley, Reading, MA, USA, 1977. Ware C. Information Visualization – Perception for Design. Morgan Kaufmann, San Francisco, CA, USA, 1st edn., 2000. Wong P.C. and Thomas J. Visual Analytics – Guest Editors’ Introduction. IEEE Trans. Comput. Graph. Appl., 24(5):20–21, 2004.
Visual Association Rules L I YANG Western Michigan University, Kalamazoo, MI, USA
Synonyms Association rule visualization
Definition Association rule mining finds frequent associations between sets of data items from a large number of transactions. In market basket analysis, a typical association rule reads: 80% of transactions that buy diapers and milk also buy beer. The rule is supported by 10% of all transactions. In this example, the 10% and 80% are called support and confidence, respectively. Depending on the user-specified minimum support and minimum confidence, association rule mining often produces too many rules for humans to read over. The answer to this problem is to select the most interesting rules. As interestingness is a subjective measure, selecting the most interesting rules is inherently human being’s work. It is expected that information visualization may play an important role in managing a large number of association rules, and in identifying the most interesting ones.
Historical Background An association rule reflects a many-to-many relationship. In lacking of an effective visual metaphor
Visual Association Rules
to display many-to-many relationships, information visualization has received a technical challenge in order to display and interact with many association rules. Most existing approaches have been designed to visualize association rules in restricted forms. The restriction can be on the number of items on each side of the rule, the type of the rule, the total number of items, or the total number of rules. These existing approaches can be classified into two categories: matrix-based approaches and graph-based approaches. A rectangular matrix can be used to visualize one-toone association rules, where the left-hand side (LHS) items of all rules are listed along one axis and the righthand side (RHS) of all rules are listed along the other
V
3347
axis of the rectangular matrix. A rule could be shown as an icon in the corresponding cell. Figure 1a shows an example 3D visualization, where the height and color of each bar represent support and confidence values of the corresponding rule, respectively. The major restriction of this approach is that it allows only one item on each side of the rule. Wong et al. [9] gave an alternative approach of arranging axes, where one axis is used to list items and the other is used to list rules. A rule is then visualized by a 3D bar chart against all items in the rule. This approach is able to visualize many-tomany rules, as long as the number of rules is kept reasonably small. Figure 1b illustrates the approach where the color of each bar indicates whether the
V
Visual Association Rules. Figure 1. Two major matrix-based approaches: (a) LHS versus RHS and (b) Rule versus Item.
3348
V
Visual Association Rules
item is on the LHS or RHS of the rule. Support and confidence values of the rules are visualized alongside the bar charts. Another matrix-based approach is to use mosaic plots and double deck plots [4] to visualize the contingency table of a frequent itemset that gives rise to a many-to-one association rule. However, only many-to-one rules derived from a single frequent itemset can be visualized each time. Fukuda et al. [2] gave a matrix-based approach of visualizing numeric association rules, which contain two numeric attributes on the LHS and a Boolean attribute on the RHS. The approach displays the set of rules as a bitmap image. Directed graph is another technique to depict associations among items. There are two alternatives to assign graph nodes: 1. Each node represents an itemset. An association rule is visualized as an edge from the node representing its LHS itemset to the node representing its RHS itemset. Such an approach has been used in IBM DB2 Intelligent Miner. A major problem is that there may easily be too many nodes to display. 2. Each node represents an item. An association rule is visualized as a bunch of edges from LHS items to an intermediate node and another bunch of edges from the intermediate node to RHS items. Such an approach may work well only when a few items and rules are involved. The graph can quickly turn into a tangled mess with as few as a dozen rules. Klemettinen et al. [5] introduced a rule visualizer that uses this approach. Figure 2 illustrates the two alternatives of visualizing association rules as directed graphs.
Visualization of association rules can be found in commercial data mining packages. SAS Enterprise Miner and SGI MineSet use matrix-based approaches and allow the user to visualize one-to-one association rules. IBM DB2 Intelligent Miner has a Associations Visualizer that uses the directed graph approach where each graph node displays an itemset.
Foundations Association rule mining [1,3] is one of the mostly researched areas in data mining. Let I = {i1, i2,...,ik} be a set of items. A subset A I of items is called an itemset. Input data to association rule mining are n subsets {T1, T2,...,Tn}, called transactions, of I. Transaction Ti supports an itemset A if A Ti. The percentage P(A) of transactions that support A is called the support of A. A frequent itemset is an itemset whose support is no less than a minimum value specified by the user. An association rule is an expression A ! B where A and B are itemsets and A \ B = ;. P(A [ B) is called the support of the rule. P(A [ B) ∕ P(A) is called the confidence of the rule. The problem of association rule mining is to find all association rules whose supports are no less than a minimum support value and whose confidences are no less than a minimum confidence value. Frequent itemsets hold a well-known Apriori property: subsets of a frequent itemset are frequent. Efficient association rule mining algorithms [1] have been developed using the Apriori property for early pruning of search space. The problem of too many discovered rules has also been studied. Klemettinen et al. [5] studied association rules at the border of frequent itemsets and used pattern templates to specify what the user wants to see. Testing criteria such as maximum entropy and
Visual Association Rules. Figure 2. Two major graph-based approaches: (a) Itemset as node and (b) Item as node.
Visual Association Rules
3349
has been studied by Kovalerchuk and Delizy in [6], although association rules are not their concern. Boolean vectors are displayed in multiple disk form where Boolean vectors of the same norm (equivalently, itemsets with the same number of items) are placed on the same disk. Figure 3 shows visualizations of an anti-monotone Boolean function f(i1,i2,...,i10) = i1. Such a function is equivalent to frequent itemsets on {i1,i2,...,i10} where {i2...i10} is a frequent itemset and {i1} is infrequent. In Fig. 3a, Boolean vectors are arranged on each disk in their numeric order. The visualization does not permit any real understanding of the border of frequent itemsets. Fig. 3b rearranges Boolean vectors on each disk so that all vectors on a Hansel chain are aligned vertically. A chain is a sequence of Boolean vectors such that each vector produces the next vector by changing a ‘‘0’’ element to ‘‘1,’’ and Hansel chains provide a way to non-repeatedly visit all vectors. As the function is monotone on each chain, Fig. 3b does show a border, although the border is highly zigzagged. Figure 3c rearranges Hansel chains according to the position of the first ‘‘1’’ value within each chain and shows the border in a clearer way. Frequent itemsets and association rules are often presented in a set-theoretic view. Given a set I = {i1, i2,...,ik} of items, its power set PðIÞ forms a lattice where the subset relationship specifies the partial order. For an example set of items, I = {a, b, c, d}, this lattice can be illustrated in Fig. 4a. Frequent d
chi-square(w2) have been suggested to replace the confidence test for the purpose of finding the most interesting rules. Liu et al. [7] have given a way of pruning and visualizing association rules with the presence of item taxonomy, by allowing a user to specify knowledge in terms of rules. Unexpected rules (that do not conform to the user’s knowledge) are selected and visualized in a structured way. Association rules are difficult to visualize for several reasons. First, an association rule reflects a many-tomany relationship and there is no effective visual metaphor to display many many-to-one, never to say many-to-many, relationships. Second, frequent itemsets and association rules have inherent closure properties. For example, any subset of a frequent itemset is also frequent; if a ! bc is a valid association rule, then a ! b, a ! c, ab ! c and ac ! b are all valid association rules. Third, the challenge is to visualize many such itemsets or rules. With the absence of an effective visual metaphor to present many-to-many relationships and to accommodate the closure properties, frequent itemsets and association rules pose fundamental challenges to information visualization. Taking a function-theoretic view, the Apriori property basically says that frequent itemsets define an anti-monotone Boolean function f (i1,i2,...,ik) where i1,i2,...,ik are Boolean variables (items). As the function is anti-monotone, there is a border which separates 1’s from 0’s. Visualization of monotone Boolean functions
V
V
d
Visual Association Rules. Figure 3. Visualizing the anti-monotone Boolean function f(i1,i2,...,i10) =
i1: (Courtesy of [6].)
3350
V
Visual Association Rules
Visual Association Rules. Figure 4. Itemset lattice, item taxonomy tree, and generalized itemset lattice on I = {a, b, c, d}: (a) Itemset lattice, (b) An item taxonomy tree, and (c) Generalized itemset lattice. (Courtesy of [10].)
itemsets define a border on the lattice: if an itemset is frequent, so is every subset of it; if an itemset is infrequent, so is every superset of it. The objective of frequent itemset visualization is to visualize the border. In fact, this has been tried in the three visualizations in Fig. 3, which have visualized the whole lattice structure. Each disk in Fig. 3 displays kl itemsets where l is the disk level. Clearly, such an approach would fail when k is large. In order to deal with the problem of long border of frequent itemsets, Yang has developed an approach [10] for visualizing generalized frequent itemsets and association rules by introducing user interaction with the border. Generalized association rules come from
the introduction of item taxonomy. As shown in Fig. 4b, an item taxonomy is a directed tree whose leaf nodes are items and whose non-leaf nodes are item categories. Mining generalized association rules across item taxonomy was studied in [8]. Item taxonomy introduces another dimension of closure property. For example, an ancestor itemset of a frequent itemset is also frequent. The presence of item taxonomy extends an itemset lattice to a generalized itemset lattice. The user can choose which items or item categories to display in an item taxonomy tree. The partially displayed item taxonomy tree induces a border of displayable itemsets in the generalized itemset lattice. For example, if the
Visual Association Rules
V
3351
Visual Association Rules. Figure 5. Visualizing association rules with item taxonomy. (Courtesy of [10].)
items c, d, e, f in Fig. 4b are displayed, this induces a border of displayable itemsets shown in Fig. 4c. Only non-redundant displayable frequent itemsets, for example, ec and ed in Fig. 4c, are visualized. In association rule visualization, only non-redundant rules derived from frequent displayable itemsets are visualized. The non-redundant displayable frequent itemsets and association rules are visualized in 3D through parallel coordinates, where each parallel coordinate is replaced by a standing visualization of item taxonomy tree. An itemset or rule is visualized as a Be´zier curve connecting all items in it. Parameters such as support and confidence can be mapped to graphical features such as width or color of the curve. Figure 5 gives a screen snapshot in the interactive visualization of many-to-many association rules, where association rules are aligned according to where the RHSs separate from the LHSs. In Fig. 5, the left two coordinates represent the LHSs of the rules and the right three coordinates represent the RHSs of the rules. The displayed item taxonomy tree can be expanded or shrunk by user interaction, and each change triggers a new set of itemsets or rules to be displayed.
Key Applications As a standard feature in many data mining software packages, association rule visualization has been used widely in business intelligence and scientific research.
Future Directions Although approaches have been developed to visualize association rules, none of them has gained predominant acceptance and can simultaneously manage a large number of rules with multiple items on both sides.
Visual association rule research has many open problems. An obvious one is how to prune uninteresting itemsets and rules, a topic that has bothered researchers for years. Specific to visualization, open problems include how to visualize a large number of many-to-many rules and how to associate closure properties of itemsets or rules with intuitive visual presentations. Data visualization on the lattice structure has potential applications beyond visual association rules. For example, it can be used in the visualization of iceberg data cubes in data warehousing.
Cross-references
▶ Association Rules ▶ Data Visualization ▶ Visual Analytics ▶ Visual Data Mining
Recommended Reading 1. Agrawal R. and Srikant R. Fast algorithms for mining association rules. In Proc. 20th Int. Conf. on Very Large Data Bases, 1994, pp. 207–216. 2. Fukuda T., Morimoto Y., Morishita S., and Tokuyama T. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 13–23. 3. Han J. and Kamber M. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA, USA, 2nd edn., 2005. 4. Hofmann H., Siebes A., and Wilhelm A. Visualizing association rules with interactive mosaic plots. In Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000, pp. 227–235. 5. Klemettinen M., Mannila H., Ronkainen P., Toivonen H., and Verkamo I. Finding interesting rules from large sets of discovered association rules. In Proc. Int. Conf. on Information and Knowledge Management, 1994, pp. 401–407. 6. Kovalerchuk B. and Delizy F. Visual data mining using monotone Boolean functions. In Visual and Spatial
V
3352
V 7.
8.
9.
10.
Visual Classification
Analysis: Advances in Data Mining, Reasoning, and Problem Solving, B. Kovalerchuk, J. Schwing (eds.). Springer, Berlin, 2004, pp. 387–406. Liu B., Hsu W., Wang K., and Chen S. Visually aided exploration of interesting association rules. In Advances in Knowledge Discovery and Data Mining, 3rd Pacific-Asia Conf., 1999, pp. 380–389. Srikant R. and Agrawal R. Mining generalized association rules. In Proc. 21th Int. Conf. on Very Large Data Bases, 1995, pp. 407–419. Wong P.C., Whitney P., and Thomas J. Visualizing association rules for text mining. In Proc. IEEE Symp. on Information Visualization, 1999, pp. 120–123. Yang L. Pruning and visualizing generalized association rules in parallel coordinates. IEEE Trans. Knowl. and Data Eng., 17 (1):60–70, January 2005.
Visual Classification M IHAEL A NKERST Allianz, Munich, Germany
Synonyms Cooperative classification
Definition Decision trees have been successfully used for the task of classification. However, state-of the-art algorithms do not incorporate the user in the tree construction process. Through the involvement of the user in the process of classification, he/she can provide domain knowledge to focus the search of the algorithm and gain a deeper understanding of the resulting decision tree. In a cooperative approach, both the user and the computer contribute what they do best: the user specifies the task, focuses the search using his/her domain knowledge and evaluates the (intermediate) results of the algorithm. The computer, on the other hand, automatically creates patterns satisfying the specified user constraints. The cooperative approach is based on a novel visualization technique for multi-dimensional data representing their impurity with respect to their class labels.
Historical Background The idea of visual classification has been built upon recent progress in the area of information visualization and decision trees.
Many different algorithms for learning decision trees have been developed over the last 20 years. For instance, CART [1] was one of the earliest systems which, in particular, incorporates an effective pruning phase. Their so-called minimum cost complexity pruning cuts off branches that have a large number of leaf nodes, yielding just a small reduction of the apparent error. SLIQ [2] is a scalable decision tree classifier that can handle both numerical and categorical attributes. In order to make SLIQ scalable for large training sets, special data structures called attribute lists are introduced, which avoid sorting the numerical attributes for each selection of the next split. Furthermore, a greedy algorithm for efficient selection of splits of categorical attributes is presented. An experimental evaluation demonstrates that SLIQ produces decision trees with state-of-the-art accuracy, and tree size with a much better efficiency for large training sets. Visual representation of data as a basis for the human–computer interface has evolved rapidly in recent years. The increasing amount of available digital information in all kinds of applications has led to the challenge of dealing with both high dimensionality and large amounts of data. [3] gives a comprehensive overview over existing visualization techniques for large amounts of multidimensional data, which have no standard mapping into the Cartesian coordinate system.
Foundations A decision tree classifier constructs a tree in a topdown fashion, performing a greedy search through the very large space of all possible decision trees. At each current node, the attribute that is most useful for the task of classification (with respect to the subset of all training objects having passed all tests on the path to the current node) is selected. Criteria such as the information gain or the gini index have been used to measure the usefulness of an attribute. Domain knowledge about the semantics of the attributes and the attribute values is not considered by the criteria. Note that greedy algorithms for decision tree construction do not allow to backtrack to a previous choice of an attribute when it finally turns out to be suboptimal. Visual classification [4,5] is a user-centered approach to decision tree construction where the user and the computer can both contribute their strengths. The user provides domain knowledge and evaluates intermediate results of the algorithm, the computer
Visual Classification
automatically creates patterns satisfying user constraints and generates appropriate visualizations of these patterns. In this cooperative approach, domain knowledge of the user can direct the search of the algorithm. Additionally, by providing adequate data and knowledge visualizations, the pattern recognition capabilities of the human can be used to increase the effectivity of decision tree construction. Furthermore, the user gets a deeper understanding of the decision tree than just obtaining it as a result of an algorithm. The three components of visual classification is visualizing the data, visualizing the decision tree and the integration of algorithms into decision tree construction. Visualizing the data is done by the pixel-oriented bar technique. The bar visualization technique is performed as follows. Within a bar, the sorted attribute values are mapped to pixels in a line-by-line fashion according to their order. Each attribute is visualized independently from the other attributes in a separate bar. Figure 1 illustrates the method of the bar visualization for the case of two attributes. The amount of training data that can be visualized by the bar technique is determined by the product of the number of data records and the number of attributes. Visualizing the decision tree is done by a visualization technique, such that each node is represented by the data visualization of the chosen splitting attribute of
V
3353
that node. For each level of the tree a bar is drawn representing all nodes of this level. The top level bar corresponds to the root node of the decision tree. On lower levels of the tree the number of records and thus the number of pixels is reduced if there are leaves in upper levels – leaves are underlined with a black line. Black vertical lines indicate the split points set in the current bar. On lower levels, partitions of the data inherited from upper levels are marked by white vertical lines at the same horizontal position as the original split point. Attribute and node information at the mouse pointer position (attribute name, attribute value, min., max. value and number of records in this node) is displayed on demand. Upon a mouse click the system switches back to the data visualization of the corresponding node. Compared to a standard visualization of a decision tree, a lot of additional information is provided which is very helpful in explaining and analyzing the decision tree: Size of the node (number of training records corresponding to the node) Quality of the split (purity of the resulting partitions) Class distribution (frequency and location of the training instances of all classes) Figure 2 illustrates the visualization of a decision tree for the Segment training data from the Statlog benchmark [6] having 19 numerical attributes. The integration of algorithms into decision tree construction is solved in the following way. Propose Split
Visual Classification. Figure 1. Illustration of bar visualization.
For a set of attributes selected by the user, the attribute with the best split together with the optimum split point of this attribute is calculated and visualized. If a singleton attribute is specified as input, only the optimum split point for this attribute is determined. The function propose split turns out to be useful in two
Visual Classification. Figure 2. Illustration of knowledeg visualization.
V
3354
V
Visual Classification
cases: first, whenever there are several candidate attributes with very similar class distributions and, second, when none of the attributes yields a good split which can be perceived from the visualization.
subtree. He/she will select another candidate attribute, split it and proceed. Utilizing the new function, however, the user requests a look-ahead for each candidate attribute before actually performing a split. Thus, the necessity of backtracking may be avoided.
Look-Ahead
For some hypothetical split of the active node of the decision tree, the subtree of a given maximum depth is calculated and visualized with the new visualization technique for decision trees. This function offers a view on the hypothetical expansion up to a user specified number of levels, or until a user specified minimum number of records per node. If the lookahead function is invoked for a limited number of levels, it is very fast (some seconds of runtime) because no pruning is performed in this case. The look-ahead function may provide valuable insights for selecting the next split attribute. Without the look-ahead function, when there are several candidate attributes the user selects one of them, chooses one or several split points and continues the tree construction. If at a later stage the expanded branch does not yield a satisfying subtree, the user will backtrack to the root node of this branch and will remove the previously expanded
Visual Classification. Figure 3. PBC system.
Expand Subtree
For the active node of the decision tree, the algorithm automatically expands the tree. Several parameters may be provided to restrict the algorithmic expansion such as the maximum number of levels and the minimum number of data records or the minimum purity per leaf node. The pruning of automatically created subtrees rises some questions. Should one only prune the subtree created automatically or prune the whole tree – including the subtrees created manually? According to the paradigm of the user as a supervisor, pruning is only applied to automatically created trees. It is important to distinguish two different uses of the expand subtree function: the maximum number of levels is either specified or it is unspecified. In the first case, the decision tree is usually post-processed by the user. No pruning is performed if a maximum number of levels is specified. Otherwise, if no maximum
Visual Clustering
number of levels is specified, the user wants the system to complete this subtree for him and pruning of the automatically created subtree is performed if desired. The function expand subtree is useful in particular if the number of records of the active node is relatively small. Furthermore, this function can save a lot of user time because the manual creation of a subtree may take much more time than the automatic creation. The visual classification approach is implemented in the PBC system, as depicted in Fig. 3.
Key Applications The visual classification approach can be applied to any domain where decision tree classification is applicable. The value of the approach is the deeper understanding of the data and the decision tree.
Cross-references
▶ Decision Trees ▶ Pixel-Oriented Visualization Techniques
Recommended Reading 1.
2.
3. 4. 5.
6.
Ankerst M., Elsen C., Ester M., and Kriegel H.-P. Visual classification: an interactive approach to decision tree construction. In Proc. 5th Int. Conf. on Knowledge Discovery and Data Mining, 1999, pp. 392–396. Ankerst M., Ester M., and Kriegel H.P. Towards an effective cooperation of the computer and the user for classification. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000. Breiman L., Friedman J.H., Olshen R.A., and Stone P.J. Classification and Reggression Trees. Wadsworth, Belmont, CA, 1984. Keim D.A. Visual database exploration techniques. Tutorial at Int. Conf. on Knowledge Discovery and Data Mining, 1997. Mehta M., Agrawal R., and Rissanen J. SLIQ: A Fast Scalable Classifier for Data Mining. In Advances in Database Technology, Proc. 5th Int. Conf. on Extending Database Technology, 1996. Michie D., Spiegelhalter D.J., and Taylor C.C. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. See also http://www.ncc.up.pt/liacc/ML/statlog/datasets.html.
Visual Clustering M IKE S IPS Stanford University, Stanford, CA, USA
Synonyms Visual mining; Visual data mining
V
3355
Definition Synthesis of computational methods and interactive visualization techniques that represents a clustering structure, defined in higher dimensions to the human analyst in order to support the human analyst to explore and refine the clustering structure of high dimensional data spaces based on his/her domain knowledge.
Historical Background The advancements made in computing technology over the last two decades allow both scientific and business applications to produce large data sets with increasing complexity and dimensionality. Automated clustering algorithms are indispensable for analyzing large n-dimensional data sets but often fall short to provide completely satisfactory results in terms of quality, meaningfulness, and relevance of the revealed clusters. With the increasing graphics capabilities of the available computers, researchers realized that an integration of the human into the clustering process based on visual feedbacks helps to improve the effectiveness of automated clustering algorithms. A synthesis of automated clustering algorithm and visualization usually does not only yield better clustering results, but also a higher degree of user satisfaction and confidence in the findings.
Foundations Clustering is one of the basic data analysis tasks. It is a process of organizing data into similar groups. Unlike classification, clustering is unsupervised learning. In particular, the classes are unknown and no training set with pre-computed class labels is available. A clustering algorithm in general can be seen as an external source that labels the data. Automated clustering algorithms are indispensable in analyzing large n-dimensional data spaces, but they often fall short in computing completely satisfactory results. In general, two interconnected reasons can be observed. First, many clustering algorithms use assumptions about specific properties of clusters either as builtin defaults or as input parameter settings. Automated clustering algorithms are very sensitive to these input parameters. Analysts often get different clustering results even for slightly different parameter settings. Useful parameter settings are in general hard to determine a priority because humans have huge difficulties in understanding the impact of different parameter settings on the revealed clusters; especially in n-dimensional data
V
3356
V
Visual Clustering
spaces. Second, large n-D data spaces often have skewed distributions. The density, distribution and shape of the clusters are quite diverse in different subspaces. Skewed distributions are in general difficult to reveal using just one global parameter setting. A variety of different visual clustering approaches have been proposed. Visual clustering can be classified into four common approaches, based on their mechanism to incorporate the data analyst into the clustering process. Several exploratory data analysis systems allow an interactive exploration of a given clustering structure based on projections. The aim is to make a given clustering structure in n-D interpretable for the human. A few visual clustering approaches are extensions of an optimized clustering algorithm with advanced visualization techniques. Other approaches extract the clustering structure based on the analyst’s feedback. In these scenarios, clusters are characterized based on the analyst’s specific domain knowledge. More recently, novel data analysis techniques from related disciplines such as machine learning or statistics supporting visual clustering have been proposed. An interesting approach uses visualization to support the selection of subspaces that contain strong clusters. Another interesting approach is Self-Organizing Maps. Visual Exploration of a Given Clustering Structure
Many interactive systems use orthogonal standard projections to visualize the given clustering structure of an n-dimensional data space. Orthogonal standard projections are widely used in exploratory data analysis, because they are easy to understand. Unlike in unlabeled data, the data item’s class labels are exploited in these approaches to find interesting projections of the clustering structure, i.e., projections that preserve properties of the clusters. The n23Tool proposed by Yang [15] uses 3–D cluster-guided projections [5]. The idea of clusterguided projections is to find a 3–D subspace that neatly separates four given clusters. A cluster-guided projection is based on the following observation. Any combination of four distinct and non co-linear cluster centers defines a unique 3-D subspace. Moreover, such a subspace neatly separates the four clusters, and in addition it also preservers the inter-cluster distance between any two of the cluster centers. To support an efficient exploration of clusters in n-dimensional data spaces, the n23 tool combines cluster-guided projections with Grand Tour [3]. A cluster-guided tour
allows an analyst to quickly get an impression of all clusters by smoothly moving through the space of cluster-guided projections. Koren and Carmel [11] propose an alternative measure of goodness to preserve the clustering structure in projections. The basic idea is to weight the distances between points differently depending on whether they have the same class label. Given this objective function, the approach then searches for the best linear transformation of the n-dimensional data. This method has a significant advantage in comparison to traditional PCA or MDS, because it captures the cluster structure of the data, and in addition the intra-cluster shapes. One general problem with general projections and embeddings is that users may have trouble interpreting the display. Another problem is that these approaches work well whether the clusters are neatly separated in n–D or the data contains only small amounts of noise. The Self-Organizing Map (SOM) is a neural network algorithm based on unsupervised learning (see [10] for further readings). The basic idea is to stretch a 1-D or 2-D neuron grid through the data. The lattice of the grid can be either rectangular or hexagonal. Each neuron is represented by an n-dimensional prototype vector. During the iterative training of a SOM, the neurons become the cluster means, and points closest to the neurons are considered to belong to that cluster. Intuitively, a SOM can be seen as a constraint k-means algorithm (the net of neurons is very flexible and folds onto the data clouds). SOM’s can be used to visualize the clustering structure of an n-dimensional data space. The 1-D or 2-D grids get wrapped into a flat 2-D layout [12]. An alternative visual representation of the clustering structure can be achieved by visualizing either the distances of each grid unit to its neighboring grid units or the similarity of grid units. In most cases, gray shading (U-Matrix) and geometric shapes (Distance-Matrix) are used to represent distances to neighboring grid units, and color to show similarity of grid units (similarity coloring). A detailed discussion of SOM-based data visualization techniques is presented in [14]. Extension of an Optimized Clustering Algorithm
An interesting extension of OptiGrid [8] with both an iconic visualization technique and kernel- density plots is HD-Eye [9]. OptiGrid is a density-based clustering algorithm that uses contracting projections and
Visual Clustering
separators to build up an n-dimensional grid. Clusters are defined as highly populated grid cells. HD-Eye considers clustering as a partitioning problem. A projection is of potential interest in finding cluster separators whether it neatly separates a number of clusters in the projected space. Finding interesting projections and specifying good separators are two difficult problems; both are difficult to compute automatically. The basic idea of HD-Eye is to allow specific user feedbacks in two crucial steps of the partitioning process. To support the user in surveying the vast set of orthogonal projections in order to find potential useful projections, HD-Eye employs an abstract iconic display (Fig. 1). The shapes of the icons are determined based on the number of maxima of the data’s density function, and how well the maxima are separated in the projections. The color of the icons represents the number of data items belonging to a maxima. Once a potential interesting projection has been selected, HD-Eye provides 1-D/2-D color-based density plots for a further analysis of the projection. Density plots
V
3357
easily allow the analyst to check whether a projection separates a number of clusters well. The HD-Eye approach doesn’t require that a projection neatly separates all clusters but at least two clusters. HD-Eye follows an iterative partitioning process. Based on his/her domain knowledge and intuition, the analyst interactively selects interesting projections from the iconic display. Then, the analyst defines useful separators within the density plots by drawing either a split line (linear partitions) or a split polygon (non-linear partitions). The analyst can easily follow the interactive partitioning process by inspecting the separator tree. After each interactive partitioning, the clusters get refined, i.e., highly populated regions based on the refined grid, and added into the separator tree. Another interesting approach is OPTICS [2] (Fig. 2). OPTICS is an extension of the DBSCAN [6] algorithm and creates a one-dimensional ordering of the data items. The ordering of the data items represents its density-based clustering, which can be obtained from a broad range of input parameter
V Visual Clustering. Figure 1. The HD-Eye interfaces has three main visual components – the iconic display showing properties of the projections, 1–D and 2–D kernel-density plots for a further analysis of selected projections, and the separator tree. Clockwise starting from the top: separator tree, iconic representation of 1–D projections, 1–D projection histogram, 1–D color-based density plots, iconic representation of multi dimensional projections and color-based 2D density plot. Figure is taken from the exploration of a large molecular biology data set. (used with permission of Alexander Hinneburg).
3358
V
Visual Clustering
Visual Clustering. Figure 2. The figure shows an example data (Fig. 2a) and its reachability plot (Fig. 2b) – objects are on the x-axis together with their reachability values on the y-axis. (used with permission of Mihael Ankerst).
settings. Clusters in DBSCAN are determined based on e and MinPt parameters, when e is the radius of a local neighborhood around each data item, and MinPt determines the minimal number of data items in the local neighborhood. The basic observation of OPTICS is that given a constant MinPt value, density-based clusters with respect to a higher density (lower e) are completely contained in density-based clusters with respect to a lower density (greater e). The ordering of the data items can be used to visualize how the data items are clustered. These visualizations are called reachability plots. The data items are visualized according to the clustering ordering on x-axis, together with their associated reachability distance on y-axis. Intuitively, data items sharing a common cluster are close to each other in the cluster ordering and have similar reachability distances. The reachability distance jumps whether a different cluster starts in the ordering. In addition, the ordering of the data items can be used to compute a cluster hierarchy (dendrogram). Clustering Based on Visual User Feedback and Refinement
An interesting approach is presented by Aggarwal [1]. The basic idea is similar to HD-Eye, i.e., the search for projection that neatly separates clusters. In contrast to
HD-Eye that uses an iconic display to show properties of the projections and then supports an interactive search for interesting projections, Aggarwal proposed an iterative cluster partitioning based an automated subspace detection algorithm. The idea of the algorithm is straightforward and follows an iteration process. In each iteration step, a number of data items called polarization points are randomly chosen from the data set. Next, an analytic method is used to find a lower-dimensional subspace that neatly groups the data items around the polarization points. Once an interesting projection has been found, HD-Eye and Aggarwal use kernel-density plots of the projected data in order to discover the clustering structure. In contrast to the HD-Eye approach that allows an interactive separation of clusters in the density plots, Aggarwal’s approach separates clusters by choosing noise levels. The iterative cluster partition process terminates whether all data items belong at least to one cluster. The final clustering structure is extracted in a frequent item analysis based on the recorded user selections in each partitioning step. To control the granularity of the cluster separation in the final clustering structure, the analyst can choose different noise levels for the same projection.
Visual Clustering
ClusterSculpture [13] is an interesting approach that allows an analyst to interactively refine an initially computed hierarchical clustering structure. The hierarchical clustering structure is described as a dendrogram. The analyst refines that dendrogram by either splitting or merging a number of nodes; if it’s necessary the analyst might want to reorganize the whole dendrogram. ClusterSculpture supports the analyst in deciding whether a node should be split or merged by presenting the relationship between the attribute values and the clustering structure, as well as the distances of neighboring data items around a selected cluster. ClusterSculpture provides three main visualizations. A dendrogram shows the cluster hierarchy, and it allows the analyst to quickly survey the clustering structure. The distribution of the attribute values in
V
3359
each dimension are shown in a point map. Each data item is visualized as a horizontal line with one colored pixel for each dimension. The color of each pixel represents the normalized attribute value in that dimension. Additionally, the data items are sorted along the y-axis to emphasize the clustering structure. The distances of neighboring data points around the leaf cluster closest to a selected data item is presented when the analyst selects a data item in the point map. Three different colors are used to represent the selected data item, and whether or not data items share the same neighbor with the selected data item. A number of exploratory data analysis systems have interactive visualizations to support the analyst in finding clusters. GGobi [4] supports the analyst by providing a spin-and-brush mechanism. Starting with
V
Visual Clustering. Figure 3. The figure shows a visualization of the entropy matrix of a real cancer data set with 72 dimensions. The two histograms at the bottom of the image allows the analyst to adjust the classification and coloring in an interactive fashion. (used with permission of Diansheng Guo).
3360
V
Visual Content Analysis
an initial projection of the data, the analyst tours through the space of possible projections until an interesting projection is reached. The analyst then paints a cluster in that projection using an easy to distinguish color, and continues touring until no more clusters are revealed. Visualization Used as Interactive Feature Selection for Clustering
An interesting subspace selection method that effectively identifies subspaces determining strong clusters is proposed by Guo [7]. The method is based on the computation of the pairwise conditional entropy values of 2-D subspaces. The pairwise conditional entropy values are visualized in the so-called entropy matrix, which allows the data analyst to easily understand relationships among dimensions (Fig. 3). Interesting multi-dimensional subspaces can be either automatically extracted or visually identified in the entropy matrix. The selected dimensions can be fed into a clustering algorithm in order to improve the effectiveness of the clustering results. Although designed as a feature selection method, the entropy matrix allows the analyst to get an understanding of the overall clustering structure of an n-dimensional data space.
Key Applications Effective mining of large traditional relational databases, complex 2-D and 3-D multimedia databases, geo-spatial databases, and bio-databases.
5. Dhillon I.S., Modha D.S., and Spangler W.S. Visualizing Class Structure of Multidimensional Data. In Proc. 30th Symp. on the Interface: Computing Science and Statistics, 1998, pp. 488–493. 6. Ester M., Kriegel H.P., Sander J., and Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996, pp. 226–231. 7. Guo D. Coordinating computational and visual approaches for interactive feature selection and multivariate clustering. Inform. Vis., 2(4):232–246, 2003. 8. Hinneburg A. and Keim D.A. Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In Proc. 25th Int. Conf. on Very Large Data Bases, 1999, pp. 506–517. 9. Hinneburg A., Keim D.A., and Wawryniuk M. HD-Eye: visual mining of high-dimensional data. IEEE Computer Graphics and Applications, 19(5):22–31, 1999. 10. Kohonen T. Self-Organizing Maps. third edn., Springer Series in Information Science, 2001. 11. Koren Y. and Carmel L. Robust Linear Dimensionality Reduction. IEEE Trans. Vis. Comput. Graph., 10(4):459–470, 2004. 12. Kraaijveld M., Mao J., and Jain A. A nonlinear projection method based on kohonen’s topology preserving maps. IEEE Trans. Neural Networks, 6(3):548–559, 1995. 13. Nam E.J., Han Y., Mueller K., Zelenyuk A., and Imre D. ClusterSculptor: A Visual Analytics Tool for High-Dimensional Data. In IEEE Symp. on Visual Analytics Science and Technology, 2007, pp. 75–82. 14. Vesanto J. Som-Based Data Visualization Methods. Intell. Data Analy., 2(3), 1999. 15. Yang L. Interactive exploration of very large relational datasets through 3D dynamic projections. In Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000, pp. 236–243.
Cross-references
▶ Automated Clustering Algorithms ▶ Exploratory Data Analysis ▶ Visual Analytics ▶ Visual Data Mining
Recommended Reading 1. Aggarwal C.C. A human–computer interactive method for projected clustering. IEEE Trans. Knowl. Data Eng., 16(4):448–460, 2004. 2. Ankerst M., Breunig M.M., Kriegel H.P., and Sander J. OPTICS: ordering points to identify the clustering structure. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 49–60. 3. Asimov D. The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comp., 6(1):128–143, 1985. 4. Cook D. and Swayne D.F. Interactive and Dynamic Graphics for Data Analysis – With R and GGobi. Springer Science and Business Media, New York, NY, USA, 2007.
Visual Content Analysis M ARCEL WORRING , C EES S NOEK University of Amsterdam, Amsterdam, The Netherlands
Synonyms Video indexing; Image indexing
Definition Visual content analysis is the process of deriving meaningful descriptors for image and video data. These descriptors are the basis for searching large image and video collections. In practice, before the process starts, one applies image processing techniques which
Visual Content Analysis
take the visual data, apply an operator, and return other visual data with less noise or specific characteristics of the visual data emphasized. The analysis considered in this contribution starts from here, ultimately aiming at semantic descriptors.
Historical Background Analyzing the content of visual data using computers has a long history, dating back to the 1960s. Some initial successes prompted researchers in the 1970s to predict that the problem of understanding visual material would soon be solved completely. However, the research in the 80s showed that these predictions were far too optimistic. Even now, understanding visual data is still a major challenge. In the 90s a new field emerged, namely contentbased image retrieval (CBIR), where the aim is to develop methods for searching in large image archives. Where the computer vision community was publishing papers on the analysis of 10–50 images, the CBIR researchers started with 1,000 images or more. Nowadays, research papers consider data sets which are orders of magnitudes larger in size. The paper by Smeulders et al. [10] reviewed the state-of-the-art in 2000, almost all of the findings are still very relevant for current research. The CBIR field has not only developed itself since then, it also had a major impact on the computer vision field. Large visual collections are also becoming the standard in computer vision. It also led to new methods for analyzing visual data. In early methods for understanding the content of the data pre-defined models were used. In current research, models are learned automatically from large sets of annotated examples, based on the appearance of objects in the visual data. This makes the computer vision algorithms more flexible and easier to adapt to new data sets. Video content analysis started later than image analysis due to the high processing power required. Computer capacity, however, has increased substantially over the last 10 years making it feasible to process also large collections of video. Research in video content analysis started with camera shot segmentation and simple analysis. Soon the research community realized that the result of automatic speech recognition is an important, sometimes decisive, factor in video content analysis. Best results are obtained if the visual and audio modalities are analyzed in conjunction [11]. Relying on speech is fine for video programs with a
V
3361
clear speech signal, nicely separated from environmental sounds and music. More and more, however, people create video data without such clear speech, so the role of visual content has moved to the forefront in video content analysis again. So, although visual content analysis started around 50 years ago, it is still an active research area with many research challenges ahead.
Foundations Automatically extracting the semantics of visual data is a notoriously difficult problem which has become known as the semantic gap [10]: Semantic gap: The semantic gap is the lack of coincidence between the information that one can extract from the sensory data, and the interpretation that the same data has for a user in a given situation. The existence of the gap has various causes. One reason is that different users interpret the same visual data in a different way. This is especially true when the user is making subjective interpretations of the visual data. In the following part those subjective interpretations are not considered. However, also for objective interpretations like the presence of a car in a picture, developing automatic methods is still difficult. Consequently, their performance is not optimal. The difficulties are due to the large variations in appearance of visual data corresponding to one semantic concept. Cars, for example, come in different models, shapes, and color. The above causes are inherent to the problem. Visual analysis methods should address these fundamental problems. There are also variations in appearance which are not due to the richness of the semantics. Varying the viewpoint, lighting and other circumstantial conditions in the recording of a scene will deliver different data, whereas the semantics have not changed. These variations induce the so called sensory gap, which is defined as [10]: Sensory gap: The sensory gap is the gap between the object in the world and the information in an image recording of that scene. One of the consequences of the semantic and sensory gap is the fact that recordings of semantically different objects might appear more similar than two recordings of the same object in different recording conditions. Methods are needed that are minimally affected by the sensory gap, while still being able to distinguish objects with different semantics. This is the
V
3362
V
Visual Content Analysis
essence of visual content analysis. General methods for doing the analysis are described next. Methods for deriving the semantics of visual data are mostly based on a supervised learning scheme. In such a scheme, the developer provides the system with a large set of annotated examples. For all of the examples, the system extracts low-level features. The system then uses a machine learner to build a model for each semantic concept. Now, if a new image or video without annotation is given, the system again computes features. It then uses the model to derive a measure of the probability of the concept being present in the visual data. An overview of the general concept based video content analysis scheme is depicted in Fig. 1. So, the two main things to consider are the features to extract and the machine learning scheme. Features are many, see [10,11] for an extensive overview. A good starting point for a set of practical visual features is defined in the MPEG-7 standard [8]. There are different categories of features namely color features, texture features, shape features, and motion features, which will now be described. Color features in their simplest form are global descriptions of the visual content based on color histograms. There are numerous color spaces to choose from in which to compute the histogram, including standard RGB, the intuitive HSV space, the perceptually uniform Lab space, or the invariant set of color spaces in [4]. Methods which use more elaborate methods
than histograms take the color layout into account, in addition to global characteristics. Where color is a property which can be measured at every pixel in the visual data, texture is a measure which considers patterns in the image. Again, there is a variety of methods to choose from with varying characteristics in terms of invariance. Many can be traced back to statistics on co-occurrence of grey values in a local neighborhood, or filters emphasizing patterns in certain directions like Gabor filters. In [3], a convenient set of features for texture description has been defined based on natural statistics. Color and texture often form the basis for segmenting the visual data in homogeneous regions. This process is known as weak segmentation as it is purely based on the visual data itself, not on the interpretation of the data. The latter is known as strong segmentation, which is delineating the contour of a semantic object in the image. Strong segmentation is required before shape features can be measured. As a consequence, shape features are only applicable for those applications where the background can be controlled easily, like the recognition of logos or a set of objects on a turntable. A distinction is made here between descriptors that describe the object as a region, e.g., based on properties of the medial axis, and measures that take the contour as the basis, like the curvature scale space descriptor. In most practical cases, strong segmentation is impossible, whereas global descriptors loose too much of
Visual Content Analysis. Figure 1. General scheme for concept-based learning in visual data using annotated examples.
Visual Content Analysis
the spatial information. Some methods therefore focus on weak segmentation, describing the resulting regions individually, or move to the detection of key points. These key points are characteristic points on the contours of objects, which in many cases capture the most important information of the object [7]. All of the above features are equally important for video data. However, for video one can also make use of the temporal dimension. One can analyze the motion pattern to find the camera motion, track regions or points of interest through the sequence and describe their tracks, or derive general measures of the activity in the scene. These additional information sources can enhance the interpretation of the data. Feature extraction yields a feature vector for every image or video segment. As indicated, for interpretation of the data, state-of-the-art systems follow a supervised learning approach. A good overview of learning methods is presented in [5]. Although there is a large variety of methods, the Support Vector Machine has become the standard choice in most visual indexing schemes [1,14]. The above considers the visual data in isolation, but often the data has associated text data like the automatic speech recognition results for video. This data can be used as an additional source to improve the understanding of the data. In fact, for some concepts the textual data gives almost all the clues to interpret the visual data, others perform best when only the visual data is used. Relations between the concepts detected can also provide essential information for some concepts. Therefore, a method like [13], which learns for a given concept the optimal analysis scheme through appropriate analysis of the data, is the best choice.
V
organization, association or individual can create video and put it on the web. Even more than for broadcasting companies, narrow casters have limited resources to manually analyze the content of their videos. Law Enforcement
Through the availability of cheap cameras, video content analysis plays a major role in law enforcement. There is a need for content analysis to detect and analyze illicit video material on confiscated hard disks containing hundreds of gigabytes of video data. In video surveillance, a single camera generates 7*24 h of video per week. Any visual content analysis which helps the operator in reducing the load of watching these video streams is of great importance. Web Search
There is no need to explain that there is a tremendous amount of video available on the Internet. Video search on the web, however, is still mostly limited to keyword-based search for full video clips. Video content analysis is needed to delve into the actual content of all these videos out there.
Future Directions
The broadcasting field generates enormous amounts of audio-visual data, which should be stored for later reference and re-use. Manual annotation is laborious and expensive. Tools for automatic analysis are of great importance to deal with this flood of videos.
Obviously the research in video content analysis has not reached its end yet. On the contrary, the community is only beginning to automatically understand the meaning of visual material. To bring the field a step further, researchers should tackle the large number of open questions out there. Tangential contributions will introduce features with higher discriminatory power, while having better invariant properties, and new machine learning techniques, better suited for the specifics of visual data. A new direction is tackling the major bottleneck in supervised learning, namely the amount of training examples needed, by employing user tagged visual data like Flickr. The latter annotations are less accurate than the current practice in supervised learning, but the amount of training samples is orders of magnitude larger. Another promising direction is the use of capture time meta data, contextual information, and social networks to provide additional clues for interpreting the visual data.
Narrow Casting
Data Sets
Producing video, up to now, has been limited to major film studios or specialized companies. Now, every
The research community in video content analysis is very active. In recent years there was an important
Key Applications Video content analysis plays a role in various application areas, the major ones are considered here. Broadcasting
3363
V
3364
V
Visual Content Analysis
role for benchmarks encouraging research in the various subfields. They do so by providing large test collections and uniform evaluation procedures. The benchmarks provide a forum where researchers can compare their results, bringing the research a big step further every year. For visual content analysis of images the most elaborate benchmark is the PASCAL VOC challenge [2]. This benchmark contains image sets of several hundreds to thousands of images with a ground truth at the semantic level, a box containing the object, as well as complete outlines of the object in the scene. In this manner, it provides a challenging data set for testing. Video analysis, especially for the purpose of search, has been studied for several years now in the TRECVID benchmark organized by NIST. In this benchmark, videos from mostly news and other broadcasts have been semantically annotated at the shot level [9]. Collections contain hundreds of hours of video material. Hence they not only pose a challenge at the
analysis level, but also at the computer performance level. In addition, one can see that the analysis of such data sets requires expertise from different fields, making the threshold to work on such collections large. Therefore, the MediaMill Challenge [12] has taken the idea of making shared data sets available one step further. This resource not only contains annotations of the TRECVID data, but also a collection of five pre-cooked experiments. To allow an easy start on the topic, the results of each sub-step are provided. Later Columbia and the VIREO group made similar resources available [6,15]. The above efforts allow for visual database research without actually having to perform the visual analysis.
Experimental Results The performance of video analysis algorithms for information retrieval is typically measured by the average precision. This is a single-valued measure proportional to the area under a recall-precision curve. This value is the average of the precision over all relevant judged
Visual Content Analysis. Figure 2. Indication of the state-of-the-art performance in video content analysis. Note the clear relation between the performance and the amount of annotated examples.
Visual Data Mining
shots. Hence, it combines precision and recall into one performance value. To be precise, let Lk = {l1,l2,...,lk} be a ranked version of the answer set A. At any given rank k let R \ Lk be the number of relevant results in the top k of L, where R is the total number of relevant results. Then average precision is defined by: average precision ¼
A 1X R \ Lk lðl k Þ; R k¼1 k
ð1Þ
where indicator function l(lk) = 1 if lk 2 R and 0 otherwise. As the denominator k and the value of l(lk) are dominant in determining average precision, it can be understood that this metric favors highly ranked relevant results. To give an indication of the state-of-theart, Fig. 2 gives an overview of results for TRECVID data using the features defined for the MediaMill challenge. Performance is shown as function of number of annotated examples as this has proven to be an important indicator for the quality of the result. One can conclude that performance is not perfect, but a rapid increase in performance over the last years can be observed. Hence, one could expect that more and more reliable concept detectors will become available in the coming years.
V
3365
6. Jiang Y.G., Ngo C.W., and Yang J. VIREO-374: LSCOM semantic concept detectors using local keypoint features. Available at: http://www.cs.cityu.edu.hk/~yjiang/vireo374/. 7. Lowe D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis., 60(2):91–110, 2004. 8. Sikora T. The MPEG-7 visual standard for content description-an overview. IEEE Trans. Circ. Syst. Video Tech., 11:696–702, 2001. 9. Smeaton A. Large scale evaluations of multimedia information retrieval: The TRECVid experience. In Proc. 4th Int. Conf. Image and Video Retrieval, 2005, pp. 19–27. 10. Smeulders A., Worring M., Santini S., Gupta A., and Jain R. Content based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell., 22(12):1349–1380, 2000. 11. Snoek C. and Worring M. Multimodal video indexing: A review of the state-ofpascal-the-art. Multimed. Tool Appl., 25(1):5–35, 2005. 12. Snoek C., Worring M., van Gemert J.C., Geusebroek J.M., and Smeulders A. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proc. 14th ACM Int. Conf. on Multimedia, 2006. 13. Snoek C., Worring M., Geusebroek J., Koelma D., Seinstra F., and Smeulders A. The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Trans. Pattern Analy. Mechine Intell., 28(10):1678–1689, 2006. 14. Vapnik V. The nature of statistical learning theory. SpringerVerlag, New York, NY, USA, 2nd edn., 2000. 15. Yanagawa A., Chang S.F., Kennedy L., and Hsu W. Columbia university’s baseline detectors for 374 LSCOM semantic visual concepts. Columbia University, 2007, aDVENT technical report 222-2006-8.
Cross-references
▶ Content-Based Video Retrieval ▶ Decision Tree Classification ▶ Image Retrieval ▶ Multimedia Databases ▶ Multimedia Information Retrieval ▶ Video Scene and Event Detection ▶ Video Content Analysis
Visual Data Analysis ▶ Visual Analytics ▶ Visual Data Mining
Visual Data Exploration Recommended Reading 1. Chang C.C. and Lin C.J. LIBSVM: a library for support vector machines. 2001, software available at: http://www.csie.ntu.edu. tw/~cjlin/libsvm/. 2. Everingham M., van Gool L., Williams C., Winn J., and Zisserman A. The PASCAL visual object classes homepage. Available at: http://www.pascal-network.org/challenges/VOC/. 3. Gemert J., Geusebroek J., Veenman C., Snoek C., and Smeulders A. Robust scene categorization by learning image statistics in context. In Proc. Int. Workshop on Semantic Learning Applications in Multimedia, 2006. 4. Geusebroek J., Boomgaard R., Smeulders A., and Geerts H. Color invariance. IEEE Trans. Pattern Anal. Mach. Intell., 23(12):1338–1350, 2001. 5. Jain A., Duin R., and Mao J. Statistical pattern recognition: A review. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):4–37, 2000.
▶ Dense Pixel Displays ▶ Disclosure Risk
V Visual Data Mining S IMEON J. S IMOFF University of Western Sydney, Sydney, NSW, Australia
Synonyms Visual data analysis; Visual analysis; Visual discovery; Immersive data mining; VDM
3366
V
Visual Data Mining
Definition Visual data mining (VDM) is the process of interaction and analytical reasoning with one or more visual representations of abstract data. The process may lead to the visual discovery of robust patterns in these data or provide some guidance for the application of other data mining and analytics techniques. It facilitates analysts in obtaining deeper understanding of the underlying structures in a data set. The process relies on the tight interconnectedness of tasks, selection of visual representations, the corresponding set of interactive manipulations, and respective analytical techniques. Discovered patterns form the information and knowledge utilized in decision making.
Historical Background Visual exploration of large data sets had been used as a complementary technique to data mining in order to obtain additional information about the data set. Since the early 1990s there has been recognition of the need for specific visualization techniques for large data sets that enable data mining tasks. The term ‘‘visual data mining’’ emerged in the late 1990s, labeling such techniques combined with interactive functionality and some guidelines for sense making from different visual displays of the data and/or the output of data mining algorithms. Ankerst’s and Niggemann’s theses around the millennium considered the establishment of more systematic scientific and methodological foundations of the field. The 3D Visual Data Mining (3DVDM) project in Aalborg University added a new twist, placing the analyst literally ‘‘within’’ the visual representation of the data set in various stereoscopic virtual reality systems. During 2001–2003 a series of workshops on visual data mining, conducted in conjunction with major KDD conferences on both sides of the Atlantic, bridged scholars in the fields of data mining and knowledge discovery in databases, information visualization, human-computer interaction as well as cognitive psychologists related to computing, industry practitioners, and developers of visual data mining tools. They established some common elements that need to be specified when developing methodologies for visual data mining, including: (i) the initial assumptions posed by the respective visual representations; (ii) the set of interactive operations over the respective visual representations and guidelines for their application and interpretation, and; (iii) the range of
applicability and limitations of the visual data mining methodology. The majority of the works, published in the proceedings of these workshops after completion, have made their way into the books and special issues, referred in section ‘‘Recommended Reading’’. These and numerous subsequent research and development efforts led to the establishment of visual data mining as an essential component of visual analytics.
Foundations Visual data mining integrates the abilities of the human perceptual system for pattern recognition and human reasoning from images, with the data processing and display power of computers, aiming at capitalizing on the best of both worlds. From a disciplinary perspective, visual data mining emerged as the confluence of several disciplines, the main of which are grouped into two clusters in Fig. 1. On the righthand side are those, primarily related to computing, when on the left-hand side are those primarily related to humans. From the perspective of scientific fundamentals, though not exhaustive, this list shows the diversity of sources that contribute to the methods and approaches developed in visual data mining. The VDM process relies on the interactive visual processing pipeline, shown in Fig. 2. Each step within the pipeline involves interaction with the analyst and all the iterative loops in the process close via the analyst. Data mining algorithms can be used to assist the process before any visualization has been considered, and/or after a visual interaction with the data. Being the core of this human-centred process, visual processing is constrained by the capacity of the human visual system to somewhat 106 to 107 observations. Large data sets easily surpass this constraint. Consequently, essential for visual data mining are low-dimensional visual representations of high dimensional data that (i) preserve relations between the data points in the high-dimensional space, and (ii) can make such relations visually detectable. Coupling of visualization algorithms with statistical techniques is one way it generating visual data representations that more accurately present the underlying properties of the data. Such visual representations are expected to fit consistently certain metaphors that are close to human mental models, as humans depict structures in the data by forming a mental model which captures only a gist of the information. These are important design considerations in visual data mining, as visualization
Visual Data Mining
V
3367
Visual Data Mining. Figure 1. Visual data mining as a confluence of disciplines.
Visual Data Mining. Figure 2. Visual data mining is a human-centred interactive analysis and discovery process.
V algorithms display data sets that usually lack inherent 2D and 3D semantics. As manipulation of visual representations of the data set (or subsets of it) is essential in enabling visual pattern detection, the success of the process illustrated in Fig. 2 depends on: (i) the breadth of the collection of visualization techniques, (ii) the consistency of the design of these visual representations, (iii) the ability to remap
interactively the data attributes to the parameters of the visual representations, (iv) the set of functions for interacting with the visualization, and (v) the capabilities that these functions offer in support of the sense-making process. From an analyst point of view, the suite of visual representations and interactive techniques is the toolbox for constructing ‘‘virtual worlds’’ from the data. The visual metaphors in such worlds are the vehicles to
3368
V
Visual Data Mining
transfer meaning from the visual structure of the data into the semantics of the underlying problem. The transfer is enabled by the rules of behavior of the visualization elements that constitute a visual language. The consistent design of visual languages and techniques for interaction with different visual representations of a dataset take into account the specifics of human long-term and short-term memory, knowledge of interaction with virtual environments, including frames of reference, orientation, and navigation. These considerations are also central in order to support the sense-making process in Fig. 2 (the process itself has several iterative steps that are not shown in Fig. 2). Dealing with large volumes of data means scalability and interactivity means real-time response of the visualization algorithms in the toolbox in Fig. 2, and the overall visual processing pipeline in order to analyze the details available in the large databases. Interactivity also means support of various visual data mining operations, for example, a visual drill down or visual clustering. The presentation of the extracted information and knowledge during the decision making step in Fig. 2 usually requires a different visualization approach to the visual data mining step. Such separation of visual analysis tasks from the visual presentation of the results is reflected in contemporary visual data mining technology.
Key Applications Visual data mining is applied in a wide range of areas of human endeavor, which generate large volumes of otherwise incomprehensible data. These include science, business, technology development, medicine, and healthcare. Chemistry and new drug discovery are examples of successful application of VDM methods. The applications in various fields provide feedback to the development of VDM technology, in terms of integration of 1D, 2D, 2.5D, and 3D visual representations, respective sets of interactive operations, and evaluations of the visual data mining capabilities and interfaces by the analysts in respective fields. Special key application areas of visual data mining are the analysis of geospatial and spatio-temporal data sets, visual link analysis and visual text mining. Spatial data sets include spatially related attributes, which provides guiding information for the choice of visual
representations and interaction methods. Visual link analysis is applied in areas where there is a need for identification of hidden links and chain of links between the attributes in a data set, for example, the fraud detection type of problems in different environments – government, corporate, insurance, retailing, and crime. These problems have the same feature: different cases of fraud may differ in terms of the patterns that reflect them in the data, and the nonvisual data mining and analytics methods may not be able to detect them. The advent of social computing has led to embedding visual link analysis technologies in the social computing systems. Visual text mining uses a variety of statistical techniques to relate groups of words with groups of documents, and then generates visual data summarizations using different metaphors.
Future Directions Visual data mining, as a discovery mode of interaction with databases, plays a central role in visual analytics. Though interaction has been recognized, there is a strong demand on the development of interactive visualizations, which are fundamentally different from the static visualizations. Designed with the foundations of perceptual and cognitive theory in mind, and focused on supporting the processes and methods in visual data mining, these visual data representations are expected to be relevant to specific tasks and effective in terms of achieving the analytics goals. Within this direction essential are methods for: (i) design and evaluation of visualizations for visual data mining, including metrics for the evaluation of the interactivity of visualizations and their ability to facilitate discovery processes; (ii) visualization techniques of projections of high-dimensional data that preserve the statistical properties of the patterns in the original data; (iii) further integration of data visualization and sonification and development of interactive mining methods, including immersive ones, that operate with such data representations; and (iv) enabling VDM systems to support research strategies and inquiry styles of different data analysts.
Experimental Results As a rule, research papers in visual data mining, reporting on novel visual analysis techniques, present the results from the application of respective techniques to several data sets. Papers, which explore
Visual Data Mining
human-computer interaction aspects, report the outcomes of observational case studies and usability experiments.
Data Sets Data sets are available from the KDNuggets site (http:// www.kdnuggets.com/datasets/index.html), government bureaus of statistics, research centers and other sources. For visual link analysis the Enron organizational e-mail data set is available at http://www.cs.cmu.edu/~enron/.
URL to Code This selective rather than inclusive sample of tools is divided into two categories: (i) visual data exploration and mining tools, and (ii) general visualization platforms that can be used to develop visual data mining tools. Visual Data Exploration and Mining Open Source Xmdv Tool: Source: http://davis.wpi. edu/ xmdv/. Offers five methods for displaying flat form data and hierarchically clustered data and variety of interactive operations. GGobi: http://www.ggobi.org/ Offers similar variety of visualization and interaction methods. VizRank: http://www.ailab.si:8088/supp/bi-vizrank Offers informative scatter-plot projections in high dimensional class-labeled data. 3DVDM: http://www.inf.unibz.it/dis/projects/3dvdm/ download.html Offers tools for exploring data in 3D virtual reality environments. Commercial
Miner3D: http://www.miner3d.com/ Highly flexible visual data mining tool with many display and interaction methods. NetMap Analytics: http://www.netmap.com.au/ Tool for visual link analysis. Separate visual analysis and visual presentation modes (the presentation mode bears some similarity to VisualLinks (http://www.visualanalytics.com)). NetMiner: http://www.netminer.com/ A nice collection of visual metaphors for interactive analysis of graph structures in data sets. IN-SPIRE: http://in-spire.pnl.gov/- visual text miner.
V
3369
General Visualization Engines and Development Tools Open Source VTK: http://www.vtk.org/ The Visualization ToolKit (VTK) software system for 3D computer graphics, image processing, and visualization. Mondrian: http://moose.unibe.ch/tools/mondrian Information visualization engine that lets the visualization be specified via a script, hence, can be utilized for development of specialized visual data mining tools. ParaView: http://www.paraview.org/New/index.html Visualization platform that supports distributed computational models for processing large data sets Gapminder: http://www.gapminder.org Based on the original Trendalyzer tool for knowledge extraction from interactive animation of statistical time-series, the development has been overtaken by Google.
Cross-references
▶ Data Mining ▶ Human-Computer Interaction ▶ Information Visualization ▶ Knowledge Discovery in Data Bases ▶ Visual Analytics ▶ Visual Clustering ▶ Visual Information Processing
Recommended Reading 1. Ankerst M. Visual Data Mining. Faculty of Mathematics and Computer Science, University of Munich, Munich, 2000. 2. Chen C. Information Visualization: Beyond the Horizon. Springer, London, 2004. 3. Chittaro L., Combi C., and Trapasso G. Data mining on temporal data: a visual approach and its clinical application to hemodialysis. J. Visual Lang. Comput., 14:591–620, 2003. 4. Demsˇar U.K. Investigating visual exploration of geospatial data: an exploratory usability experiment for visual data mining. Comput. Environ. Urban., 31:551–571, 2007. 5. de Oliveira F., Crisina M., and Levkowitz H. From visual data exploration to visual data mining: a survey. IEEE T. Vis. Comput. Gr., 9(3):378–394, 2003. 6. Isenberg P., Tang A., and Carpendale S. An exploratory study of visual information analysis. In Proc. SIGCHI Conf. on Human Facters in Computing Systems, 2008. 7. Keim D.A., Mansmann F., Schneidewind J., and Ziegler H. Challenges in visual data analysis. In Proc. Int. Conf. on Information Visualization, 2006. 8. Keim D.A. and North S.C. Visual data mining in large geospatial point sets. IEEE Comput. Graph., 24(5):36–44, 2004.
V
3370
V
Visual Discovery
9. Keim D.A., Sips M., and Ankerst M. Visual data-mining techniques. In Visualization Handbook, Hansen Johnson Elsevier, Amsterdam, 2005, pp. 831–843. 10. Kovalerchuk B. and Schwing J. (eds.). Visual and Spatial Analysis: Advances in Data Mining, Reasoning, and Problem Solving. Springer, Dordrecht, 2004. 11. Niggemann O. Visual Data Mining of Graph-Based Data. Department of Mathematics and Computer Science, University of Paderborn, Paderborn, Germany, 2001. 12. Shneiderman B. Inventing discovery tools: combining information visualization with data mining, In Proc. Discovery Science, 2001, pp. 17–28. 13. Simoff S.J., Bo¨hlen M., and Mazeika A. (eds.). Visual Data Mining: Theory, Techniques and Tools for Visual Analytics. Springer, Heidelberg, 2008. 14. Soukup T. and Davidson I. Visual Data Mining: Techniques and Tools for Data Visualization and Mining. John Wiley & Sons, London, 2002. 15. Thomas J.J. and Cook K.A. Illuminating the Path: The Research and Development Agenda for Visual Analytics. IEEE CS Press, Silver Spring, MD, 2005.
Visual Discovery ▶ Visual Data Mining
Visual Displays of Nonnumerical Data ▶ Visualizing Categorical Data
Visual Displays of Numerical Data ▶ Visualizing Quantitative Data
Visual Formalisms DAVID H AREL , S HAHAR M AOZ The Weizmann Institute of Science, Rehovot, Israel
Definition Visual formalisms are diagrammatic and intuitive, yet mathematically rigorous languages. Thus, despite their clear visual appearance, they come complete with a syntax that determines what is allowed, and semantics
that determines what the allowed things mean. The main emphasis in the visuality is typically placed on topological relationships between diagrammatic elements, such as encapsulation, connectedness, and adjacency. Geometric and metric aspects, such as size, shape, line-style, and color, may also be part of the formalism. Icons can be used too. Such languages typically involve boxes and arrows, and are often hierarchical and modular. Visual formalisms are typically used for the design of hardware and software systems. This includes structural as well as more complex behavioral specifications.
Historical Background Two of the oldest examples of visual formalisms are graphs and Venn diagrams, which are both originally due to Euler [7,8]. A graph, in its most basic form, is simply a set of points, called nodes or vertices, connected by lines, called edges or arcs. Its role is to represent a set of elements S and some binary relation R on them. The precise meaning of the relation R is part of the application and has little to do with the mathematical properties of the graph itself. Certain restrictions on the relation R yield special classes of graphs that are of particular interest, such as ones that are connected, directed, acyclic, planar, or bipartite. Graphs are used intensively in all branches of computer science; the elements represented by the nodes range from the most concrete (e.g., physical gates in a circuit diagram) to the most abstract (e.g., complexity classes in a classification scheme), and the edges have been used to represent many kinds of relations, e.g., logical, temporal, causal, or functional. Many of the fundamental algorithms in computer science are related to graphs. Some open problems are related to graphs too, such as the problem of whether graph isomorphism can be decided in polynomial time. A hypergraph is a generalization of a graph, where the relation R is not necessarily binary (in fact, it need not have a fixed arity). The corresponding edges are called hyperedges. Hypergraphs have applications in database theory (see, e.g., [9]). The information conveyed by a graph (or a hypergraph) is nonmetric, and is captured by the purely topological notion of connectedness; shapes, locations, distances, and sizes, for example, have no significance. Euler circles, or Venn diagrams (see [8,25] for the original work), represent sets using the fact that simple closed curves partition the plane into disjoint
Visual Formalisms
inside and outside regions. They thus give the topological notions of enclosure, exclusion, and intersection the obvious set theoretic meanings of being a subset of, being disjoint from, and having a nonempty intersection with, respectively (see Fig. 1). Typical uses of Venn diagrams include only a small number of circles. However, they can be expanded to n-set diagrams (see, e.g., [5]). A higraph [17] is a formalism that further generalizes graphs, in the spirit of Euler circles. It adds not only hyperedges, but Euler/Venn-like topologically related ‘‘blobs’’ for the nodes, as well as a certain kind of partitioning and adjacency. It thus adds to the relation R notions of orthogonality and hierarchy, as well as multilevel and multinode edges (and hyperedges). The term ‘‘visual formalism’’ appears to be first used in the work on statecharts [16,17], a language for the specification and design of reactive systems. Higraphs [17] are the underlying graphical representation for statecharts. In general, visual formalisms provide mechanisms for higher levels of abstraction and thus help engineers in coping with the challenges posed by the ever increasing size and complexity of systems. The effectiveness of visual formalisms also relies on the power of human vision and cognitive perception in recognizing and memorizing visual properties, such as topological/
Visual Formalisms. Figure 1. An example of a Venn diagram.
V
3371
spatial relationships between visual entities. They thus provide an appealing representation and communication medium. The advent of modern tools for graphical editing and display has also contributed to the increasing popularity of various visual formalisms among practitioners. Moreover, some tools, indeed go beyond plain editing and syntax correctness checks, and allow their users to take advantage of the rich semantics of some visual formalisms. Thus, some tools have automated procedures for languages that are sufficiently powerful, such as code generation and formal verification. These are directly applied to the visual artifacts and thus indeed exploit their full potential. The Unified Modeling Language (UML) [23], an OMG standard formed in the late 1990s, is a collection of visual languages for software and systems specification, defined under a common meta model, and supported by many tools. The standard, however, does not come with satisfactory semantics and hence, by itself, cannot be regarded as a fully fledged visual formalism, though many researchers have provided semantics for many parts thereof.
Foundations Some widely used visual formalisms include flowcharts, used for the representation of series of actions and decisions, mainly for simple imperative programs (see [14] for what seems to be the original usage thereof, and [12] for a fundamental paper on verification, which was carried out in the framework of flowcharts); data flow diagrams (DFD) (see, e.g., [13]), used to represent the flow of data through a system; entity relationship diagrams (ERD) (Fig. 2, and see [2] for the original work) and their modern application to objects, object model diagrams (OMD) and class diagrams (part of the UML [23], see Fig. 3); message sequence charts (MSC) [21] and their modal extension, live sequence charts [3,19], for inter-object
V
Visual Formalisms. Figure 2. An example of an entity relationship diagram.
3372
V
Visual Formalisms
Visual Formalisms. Figure 3. An example of a class diagram.
scenario-based specification; the specification and description language (SDL) [20], for the specification and description of the behavior of reactive and distributed systems; Petri nets (see, e.g., [24], and see Fig. 4); statecharts [16,18] (see Fig. 5); and constraint diagrams [22], used to express logical constraints extended with quantification and navigation of relations. Statecharts are a prime example of a visual formalism and are therefore shortly discussed below. However, the ideas presented are more general and have indeed been extended to other domains, such as security, databases, and more generally, UML-style behavioral and structural modeling. In essence, statecharts are finite state machines drawn as state transition diagrams and extended with two key ideas: hierarchy and orthogonality. The hierarchy is introduced using a substating mechanism, and the orthogonality is introduced using a notion of simultaneity. Both are reflected in the graphics themselves, i.e., the hierarchy by encapsulation inspired by Euler/Venn diagrams, and the orthogonality by partitioning blobs into adjacent regions separated by dashed lines. Topological features are a lot more fundamental than geometric ones, in that topology is a more basic branch of mathematics than geometry in terms of symmetries and mappings. One thing being inside another is more basic than it being smaller or larger than the other, or one being a rectangle and the other a circle. Being connected to something is more basic than being green or yellow or being drawn with a thick line or with a thin line. The brain understands topological features given visually much better than it grasps geometrical ones. The mind can see easily and immediately whether things are connected or not,
Visual Formalisms. Figure 4. An example of a Petri net.
whether one thing encompasses another, or intersects it, etc. (see, e.g., [15], for early work on this). Statecharts are not exclusively visual/diagrammatic. Their non-visual parts include, for example, the events that cause transitions, the conditions that guard against taking transitions, and actions that are to be carried out when a transition is taken. For these, statecharts borrow from both the Moore and the Mealy variants of state machines, in allowing actions on transitions between states or on entrances to or exits from states, as well as conditions that are to hold throughout the time the system is in a state. Unfortunately, some people tend to take diagrams too lightly, finding it difficult to consider a collection of graphics serious or ‘‘formal’’ enough. A common misconception about formality is that textual and symbolic languages are inherently formal, whereas visual and diagrammatic languages are not, as if one could measure formality by how many Greek letters and mathematical symbols the language contains. This myth, together with the deviously deceptive simplicity of graphics, lead to the so called doodling phenomenon – a mindset that says that diagrams are what an engineer scribbles on the back of a napkin but that the real work is done with textual languages.
Visual Formalisms
V
3373
Visual Formalisms. Figure 5. An example of a statechart.
An interesting topic related to the use of visual formalisms concerns the layout of diagrams, and more generally, their usability and aesthetics. This includes the development of algorithms for the automatic layout of diagrams (see, e.g., [4]), as well as experimental research towards measuring the usability and effectiveness of different visual languages and the development of aesthetic principles. Another research direction deals with the mechanisms of the language definitions themselves and with the development of new visual languages, for example, domain specific visual formalisms targeted for specific application areas.
Key Applications Software and Systems Structural Specification
Object model diagrams, and more generally, other structural diagrams, are very popular and are used successfully by software and systems engineers. See the various structural diagrams of the UML [23] and SDL [20].
used across the development life cycle of systems, from requirements analysis to specification to simulation to code generation to testing, formal verification, and documentation. In recent years such languages have also been used for the modeling and analysis of biological systems as reactive systems [6,11], which appears to be an application area of great potential. Data Modeling and Querying
Entity relationship diagrams (ERD) are used for data modeling, or more broadly, for the conceptual specification of databases. Hypergraphs have applications in database theory (see, e.g., [10]). Finally, visual formalisms are also used for the specification of database queries (see [1] for a survey).
Cross-references
▶ Activity Diagrams ▶ Temporal Visual Languages ▶ Visual Query Language
Recommended Reading Reactive Systems Behavioral Specification
Petri nets, statecharts and variants of message sequence charts are successfully used in the design and development of reactive systems. For example, in embedded and real-time safety-critical systems in the automotive, avionics, and telecommunication industries. They are
1. Catarci T., Costabile M.F., Levialdi S., and Batini C. Visual query systems for databases: a survey. J. Vis. Lang. Comput., 8(2): 215–260, 1997. 2. Chen P.P.-S. The entity-relationship model – toward a unified view of data. ACM Trans. Database Syst., 1(1):9–36, 1976. 3. Damm W. and Harel D. LSCs: Breathing Life into Message Sequence Charts. J. Form. Methods Syst. Des., 19(1):45–80,
V
3374
V 4.
5. 6.
7. 8. 9. 10.
11.
12.
13. 14.
15.
16. 17. 18. 19.
20.
21.
22.
Visual Interaction
2001. Preliminary version in P. Ciancarini, A. Fantechi and R. Gorrieri (eds.). In Proc. 3rd IFIP Int. Conf. on Formal Methods for Open Object-Based Distributed Systems, 1999, pp. 293–312. Di Battista G., Eades P., Tamassia R., and Tollis I.G. Graph Drawing: Algorithms for the Visualization of Graphs. Prentice‐ Hall PTR, Upper Saddle River, NJ, USA, 1998. Edwards A.W.F. Cogwheels of the mind: the story of Venn diagrmas. Johns Hopkins University Press, 2004. Efroni S., Harel D., and Cohen I.R. Towards Rigorous Comprehension of Biological Complexity: Modeling, Execution and Visualization of Thymic T Cell Maturation. Gen. Res., 13(11): 2485–2497, 2003. Euler L. Commentarii academiae scientiarum Petropolitanae, Vol. 8. 1741. Euler L. Lettres il une Princesse d’Allemagne, Vol. 2. 1772. letters 102–108. Fagin R. Degrees of acyclicity for hypergraphs and relational database schemes. J. ACM, 30(3):514–550, 1983. Fagin R., Mendelzon A.O., and Ullman J.D. A simplified universal relation assumption and its properties. ACM Trans. Database Syst., 7(3):343–360, 1982. Fisher J., Piterman N., Hubbard E.J.A., Stern M.J., and Harel D. Computational insights into C. elegans vulval development. In Proc. Natl. Acad. Sci., 102(6):1951–1956, 2005. Floyd R.W. Assigning meanings to programs. In J.T. Schwartz (ed.). In Proc. Symposia on Appl. Math., Vol. 19. American Mathematical Society, 1967, pp. 19–32. Gane C.P. and Sarson T. Structured Systems Analysis: Tools and Techniques. Prentice‐Hall, Englewood, Cliffs, NJ, 1979. Goldstine H.H. and von Neumann J. Planning and Coding of Problems for an Electronic Computing Instrument. Institute for Advanced Study, Princeton, N.J., 1947. Reprinted in von Neumann’s Collected Works, Vol. 5 , A.H. Taub (ed.). Pergamon, London, 1963, pp. 80–151. Green T.R.G. Pictures of Programs and Other Processes, or How to Do Things with Lines. Behav. Inform. Tech., 1:3–36, 1982. Harel D. Statecharts: a visual formalism for complex systems. Sci. Comput. Program., 8:231–274, 1987. Harel D. On visual formalisms. Commun. ACM, 31(5):514–530, 1988. Harel D. and Gery E. Executable object modeling with statecharts. Computer, July 1997, pp. 31–42. Harel D. and Marelly R. Come, Let’s Play: Scenario-Based Programming Using LSCs and the Play-Engine. Springer, Berlin Heidelberg, New York, 2003. ITU. ITU-T Recommendation Z.100: Specification and Description Language. Technical report, International Telecommunication Union, 1992. ITU. ITU-T Recommendation Z.120: Message Sequence Charts. Technical report, International Telecommunication Union, 1996. Kent S. Constraint diagrams: visualizing invariants in objectoriented models. In Proc. 12th ACM SIGPLAN Conf. on ObjectOriented Programming Systems, Languages & Applications, 1997, pp. 327–341.
23. Object Management Group (OMG). UML: Unified Modeling Language. Available at: http://www.omg.org. 24. Reisig W. Petri Nets: An Introduction, Monographs in Theoretical Computer Science. An EATCS Series, Vol. 4. Springer, Berlin Heidelberg, New York, 1885. 25. Venn J. Symbolic Logic. Macmillan and Co., London, 1881.
Visual Interaction M ARISTELLA M ATERA Politecnico di Milano University, Milan, Italy
Synonyms Visual interaction design; Graphic design
Definition In the field of Human-Computer Interaction (HCI), Visual Interaction refers to the adoption of user interfaces for interactive systems, which make use of visual elements and visual interaction strategies with the aim of supporting perceptual inferences instead of arduous cognitive comparisons and computations. The design of Visual Interaction focuses on the definition of interaction mechanisms through which (i) the users can perform actions on the interactive system by means of visual elements, and (ii) the system can provide feedback to the user, by visually representing the results of the computations triggered by the user actions. The flow of user actions and system feedback over time then has to be coordinated. In the field of Databases, Visual Interaction refers to the adoption of visual interfaces that provide access to the collection of data stored in databases, by means of visual formalisms and mechanisms supporting both the presentation and the interaction with data.
Historical Background The information proliferation that characterized the end of the last century led to an increased use of databases by a continuously growing number of users. As an effect of such a trend, the traditional paradigm to access databases, based on formal, text-based query languages, became inadequate for a large portion of inexperienced users. In 1977, Moshe Zloof proposed QBE (Query By Example) [14], a pioneer query environment where a skeleton of a database relational schema was presented
Visual Interaction
and, based on it, the users were then able to express queries by entering commands, example elements, and selection conditions in visual tables. QBE was the precursor of a new research line, fully emerged during 1990s on the wave of the consensus gained by interactive systems and visual interfaces. In fact, in that period, several research initiatives, both in the HCI and the database field, focused on enhancing the quality of the interaction with databases, by replacing the traditional, text-based query languages with visual paradigms. A new generation of visual tools, the so called Visual Query Systems (VQSs) (see [4] for a survey), was therefore proposed, which promoted visual formalisms and visual interaction mechanisms at the level of both the presentation of and the interaction with the information contained in databases. The most adopted visual representations were forms, diagrams, and icons, each one being used in accordance with the nature of the information to be represented or queried. Some systems also adopted 3D visualizations [1] or virtual reality techniques [12]. Following the first proposals for VQSs, which especially focused on the interaction with structured, record-based information, some visual query languages for XML were also proposed [2,9]. However, it was the Web that led to the major change in the interaction with databases, proposing a novel, easy-to-use visual paradigm, which suddenly allowed a huge number of users to access huge amounts of data distributed across several data sources. The Web, today, provides the most widely adopted visual interface for accessing databases.
Foundations In the interaction with database it is possible to recognize at least two different needs: (i) to let the users understand the database content, and (ii) to allow the users to interact with the content to extract the information they are interested in. From the visual interaction perspective, the main issue is therefore to find effective mechanisms to transmit to the users the information content of the database and to support users during the interaction with such content. This can result, for instance, in using different visual metaphors, defined as a mapping between a data model and a visual model [5], both for the understanding phase and for the interaction phase. Several works focused on the effectiveness and efficiency of visual metaphors. Some of them (see for
V
example [5,10]) dealt with the effectiveness of schema display, trying to define and formalize the notion of adequate metaphors for representing the database schema, thus supporting the understanding phase. Some other works also covered the effectiveness of the interaction phase. For example, in [3] authors pointed out the need for multi-paradigmatic systems, able to provide querying environments in which adaptive interfaces exploit several visual representations and interaction mechanisms depending on the user capabilities and tasks. In [11] the author instead specifically focused on the problem of automatically generating effective presentations of query results, adapting the data representation to the nature of the data themselves, and also to the tasks that such data are meant to support. The majority of these works were characterized by stratified, multi-component architectures supporting the definition of proper visual formalisms. For example, in [10], authors propose a definition of the visual formalisms adopted for schema display characterized by (i) a data model that captures schemas, (ii) a visual model that captures visualizations, and (iii) a visual metaphor that provides the mapping between data and visual models. Such a stratified definition especially aimed at ensuring flexibility for the selection of the most appropriate metaphor for schema display, and also induced the formalization of many interesting properties. For instance, authors precisely defined the concept of correctness of the visual metaphors (no valid visual schema will map to an invalid data schema and vice versa) through a relationship between the constraints in the data model, and those in the visual model. While the previous work mainly focused on schema display during the understanding phase, in [3] the authors propose a stratified multi-component structure for the architecture of a multi-paradigmatic VQS, supporting both the understanding and the interaction phase. The most notable features were: 1. The provision of a formalism for expressing databases, which is able to represent, in principle, a database expressed in any of the most common data models. 2. The construction and the management of a user model, which allows the system to propose to the users the most appropriate visual representation according to their skills and needs.
3375
V
3376
V
Visual Interaction
3. A transformation layer, defining the translations among the different available representations, in terms of both database content and visual interaction mechanisms, which allows the users to shift from one representation to another according to their preferences. Given the always increasing relevance of XML-based data descriptions, in late 1990s some visual query languages for XML also started to be proposed, offering the advantage of letting users manipulate visual objects instead of managing the complexity of the underlying XML-base query languages (e.g., XQuery) [2]. However, even if such languages employed visual primitives, the proposed interaction paradigms still preserved a tight correspondence with the traditional (textual) query formulation. Therefore, their advantages have still to be proved. An incisive step forward in the visual interaction with databases was, however, determined by the growth of the Internet. The World Wide Web suddenly allowed a huge number of users to easily retrieve and access a large amount of information available in different formats – from unstructured multimedia data to traditional relational data. The Web indeed introduced the information browsing paradigm, a very powerful (from the user experience point of view) visual interaction mechanism, based on the ‘‘one-click’’ selection of hyperlinks to explore the information space. This innovation immediately provided an easier alternative to the traditional, text-based way to access and query databases. Especially with the advent of dynamic Web pages, relational databases were fully integrated into the Web, and Web pages became a composition of ‘‘content units,’’ each one corresponding to an application component in charge of querying at runtime the application data source, and displaying the extracted contents in the page. Additionally, search mechanisms were introduced. Form-based interfaces allowed the users to input (even complex) selection conditions that the Web application was able to transform into suitable SQL queries over the application contents; the query results were then returned into a new Web page, dynamically generated on the fly, typically in the form of an index of linked items. The Web thus definitely became a ‘‘de facto’’ visual interface for databases. The so called ‘‘data-intensive’’ Web applications gained momentum, offering Web interfaces over huge collections of data equipped with
both navigation and search mechanisms, as it happened for on-line trading and e-commerce applications, digital libraries, and several other classes of Web applications interfacing large data sources. Such applications share with the initially proposed VQSs a stratified architecture, being characterized by three orthogonal dimensions: 1. The application data, collecting the contents to be published by the application. 2. The hypertext interface, defining the way in which content units extracted from the application data source are clustered within pages, as well as the navigation mechanisms allowing users to move along the different pages. 3. The presentation style, defining layout and graphic properties for the rendering of both contents and navigation mechanisms within pages. Similarly to VQSs, such a stratified architecture ensures flexibility, by enabling the provision of multiple interfaces over the same content. Hypertext interfaces indeed can be considered views over the application data, and therefore several hypertext structures can be constructed over a same database [7] to satisfy the different requirements characterizing different classes of users, different access devices, or even different situations of use. Also, once a hypertext interface is defined, different presentation styles, defining the layout and the graphic properties of the constituent Web pages, can be applied on it. Therefore, especially in the case of adaptive Web applications, several combinations of hypertexts and presentation styles can be adopted to convey contents in the most appropriate modality with respect to the characteristics of the user or of the usage environment, possibly described in a user or a context model [6]. Recently, the so-called data-driven methods have been proposed for the design of the hypertext interfaces. They fund their approach on data modeling, and in particular on some properties of the data to be published in the hypertext interface, which can have impact on the effectiveness of the interface itself. In particular, the method proposed in [8] aims to enhance content accessibility, which is the ability of a Web application to offer effective and efficient access to its contents, supporting both (i) the identification of the most relevant content entities the application deals with and their mutual relationships (understanding phase), and (ii) the retrieval of the desired content
Visual Interaction
instances (interaction phase). In Latins, ‘‘rem tene, verba sequentur’’ (manage content, words will follow). Rephrasing this motto, the idea behind this method is that if ‘‘Rem’’ is the information globally available to the application, and ‘‘Verba’’ is the hypertext interface enabling information identification and retrieval through navigation and search, one can say that organizing the application content in accordance with some proper modeling abstractions can induce the definition of hypertext interfaces able to effectively support the information access on the Web. Authors
V
3377
therefore propose some data modeling patterns, called Web Marts, which suggest a data organization where: 1. Some core entities reflect the most important concepts addressed by the application. 2. Some detail entities represent additional information that complete the description of the core concepts. 3. Some access entities then classify the core contents for facilitating their access.
V
Visual Interaction. Figure 1. An example of visual composition of a Yahoo Pipe (http://pipes.yahoo.com/pipes/) (a) and its corresponding rendering (b). The pipe accesses the New York Times homepage (Fetch Feed component), analyzes it (Content Analysis component), and uses the extracted keywords to find related pictures at the Flickr Web site (nesting a For each Replace component with a Flickr component).
3378
V
Visual Interaction
By taking into account such data characterization, it is possible to achieve Web interfaces able to highlight the application core concepts, facilitating their retrieval by means of navigation and search mechanism defined on top of access entities, and support the browsing of the core concepts by means of navigation patterns defined on top of detail entities. In other words, the resulting hypertext interfaces are able to effectively support both the understanding of and the interaction with the Web application data.
Key Applications Nowadays, principles of Visual Interaction design are essential for the construction of any system providing access to databases. Even more, they constitute the basis for the development of modern data-intensive Web applications, especially considering the modern Web 2.0 paradigm that, besides other advantages, also promotes more advanced interaction mechanisms, fully comparable to the mechanisms traditionally adopted for desktop interactive applications.
Future Directions The current trend in the development of modern Web applications – and in particular of those applications commonly referred to as Web 2.0 applications – clearly points toward a high user involvement in the creation of contents. Users visually interact with Web sites not only as passive actors accessing information, but also as creators of the content instances published by the Web application. The so-called social applications, like MySpace, YouTube, or Wikipedia, are indeed living an indisputable success. However, users are not only more and more being involved in content creation. The phenomenon of Web mashups is now emerging to provide Web environments where the users have the opportunity to create online applications starting from contents and functions that are provided by third parties. Typical components that may be mashed up, i.e., composed, are RSS/Atom feeds, Web services, content wrapped from third party Web sites, or programmable APIs (like the one provided by Google Maps). Online mashup tools, like Yahoo Pipes (http://pipes.yahoo.com/pipes/) or the Google Mashup Editor (http://editor.googlemashups. com/), and a number of other academic research efforts (see for example [13]) confirm this trend, which implies the addition of a further level of interaction in the Web:
users are not only able to visually access data, but they are also able to visually compose the interfaces providing access to data. Figure 1, for example, shows the Yahoo pipes editor, where a simple application is visually modeled by composing some components that analyze the contents published in the New York Times home page, and then use the retrieved keywords to find related photos at another Web site (www flickr.com). Since unskilled end users are supposed to ‘‘program’’ the necessary composition logic, special emphasis is therefore currently put on intuitive, visual composition environments, not requiring any programming effort. This seems to be one of the major challenges that will be addressed by future academic and industrial research efforts.
Cross-references
▶ Visual Interfaces ▶ Visual Metaphor ▶ Visual Query Language ▶ Web 2.0/3.0
Recommended Reading 1. Boyle J. and Gray P.M.D. Design of 3D metaphors for database visualisation. In Proc. IFIP 2.6 3rd Working Conf. on Visual Database Systems. Lausanne, Switzerland, 1995, pp. 157–175. 2. Braga D., Campi A., and Ceri S. XQBE (XQuery By Example): a visual interface to the standard XML query language. ACM Trans. Database Syst., 30(2):398–443, 2005. 3. Catarci T., Chang S.-K., Costabile M.F., Levialdi S., and Santucci G. A graph-based framework for multiparadigmatic visual access to databases. IEEE Trans. Knowl. Data Eng., 8(3): 455–475, 1996. 4. Catarci T., Costabile M.F., Levialdi S., and Batini C. Visual query systems for databases: a survey. J. Visual Lang. Computing, 8(2):215–260, 1997. 5. Catarci T., Costabile M.F., and Matera M. Which metaphor for which database? In People and Computers X, Proc. HCI’95 Conf., 1995, pp. 151–165. 6. Ceri S., Daniel F., Matera M., and Facca F.M. Model-driven development of context-aware web applications. ACM Trans. Internet Technol., 7(1), 2007. 7. Ceri S., Fraternali P., and Matera M. Conceptual modeling of data-intensive web applications. IEEE Internet Computing, 6(4):20–30, 2002. 8. Ceri S., Matera M., Rizzo F., and Demalde´ V. Designing dataintensive web applications for content accessibility using web marts. Commun. ACM., 50(4):55–61, 2007. 9. Comai S., Damiani E., and Fraternali P. Computing graphical queries over XML data. ACM Trans. Inf. Syst., 19(4):371–430, 2001.
Visual Interfaces 10. Haber E.M., Ioannidis Y.E., and Livny M. Foundations of visual metaphors for schema display. J. Intelligent Inf. Syst., 3(3/4):263–298, 1994. 11. Mackinlay J.D. Automating the design of graphical presentations of relational information. ACM Trans. Graphics, 5(2):110–141, 1986. 12. Massari A. and Saladini L. Virgilio: a VR-based system for database visualization. In Proc. Workshop on Advanced Visual Interfaces, 1996, pp. 263–265. 13. Yu J., Benatallah B., Saint-Paul R., Casati F., Daniel F., and Matera M. A framework for rapid integration of presentation components. In Proc. 16th Int. World Wide Web Conference, 2007, pp. 923–932. 14. Zloof M. Query By Example: a database language. IBM Syst. J., 16(4):324–343, 1977.
Visual Interaction Design ▶ Visual Interaction
Visual Interfaces T IZIANA C ATARCI University of Rome, Rome, Italy
Synonyms Graphical user interfaces; GUIs; Direct manipulation interfaces
Definition Visual Interfaces are user interfaces that make extensive use of graphical objects (icons, diagrams, forms, etc.) that the user may directly manipulate on the screen through several kinds of pointing devices (including her/ his fingers) and get an almost instantaneous feedback (near real-time interactivity). Information-intensive interfaces typically exploit one or more visual language, i.e., a language that systematically uses visual signs to convey a meaning in a formal way.
Historical Background The importance of providing the user with an interactive environment where she/he could directly manipulate the computer content, and get an immediately graspable feedback out of her/his actions, was evident
V
since the first demonstration of the Sketchpad system by Ivan Sutherland [7]. This system was the basis of his 1963 MIT Ph.D. thesis. SketchPad supported manipulation of visual objects using a light-pen, including grabbing objects, moving them, changing size, and using constraints. It contained the seeds of hundreds of important interface ideas. Later on, in the 1970s, many of the interaction techniques popular in direct manipulation interfaces, such as how objects and text are selected, opened, and manipulated, resulted from research at Xerox PARC. The concept of direct manipulation interfaces for everyone was envisioned by Alan Kay of Xerox PARC in a 1977 article about the Dynabook [4]. The first commercial systems to make extensive use of direct manipulation were the Xerox Star (1981) [6], the Apple Lisa (1982) [10], and Macintosh (1984) [9]. Ben Shneiderman at the University of Maryland coined the term ‘‘direct manipulation’’ in 1982, identified the components, and gave psychological foundations [5]. As the quantity of information available became larger and larger, as well as the potential user community of databases and other information sources, there was a growing need for user interfaces that require minimal technical sophistication and expertise by the users, and support a wide variety of information intensive tasks. For instance, information-access interfaces must offer great flexibility on how queries are expressed and how data are visualized. The general idea was to exploit the well-known high bandwidth of the human-vision channel that allows both recognition and understanding of large quantities of information in no more than a few seconds. Thus, for instance, if the result of an information request can be organized as a visual display, or a sequence of visual displays, the information throughput is immensely superior to the one that can be achieved using textual support. User interaction becomes an iterative query-answer game that very rapidly leads to the desired final result. Conversely, the system can provide efficient visual support for easy query formulation. Displaying a visual representation of the information space, for instance, lets users directly point at the information they are looking for, without any need to be trained into the complex syntax of current query languages. Alternatively, users can navigate in the information space, following visible paths that will lead them to the targeted items. Again, thanks to the visual support users are able to easily understand how to
3379
V
3380
V
Visual Interfaces
formulate queries, and they are likely to achieve the task more rapidly and less prone to errors than with traditional textual interaction modes.
Foundations One fruitful sign system conceived by humans for the purpose of storing, communicating and perceiving essential information is based on bidimensional visual signs, like those contained in pictures, photographs, geographic maps. Visual signs are characterized by a high number of sensory variables: size, intensity, texture, shape, orientation, and color, which all concur to provide details about the information to be communicated [1]. For instance, Tufte suggests many different purposes for which color may be employed: ‘‘. . .to label (color as noun), to measure (color as quantity), to represent or imitate reality (color as representation), and to enliven or decorate (color as beauty). In a geographical map, color labels by distinguishing water from stone and glacier from field, measures by indicating altitude with contour and rate of change by darkening, imitates reality with river blues and shadows hatchures, and visually enlivens the topography quite behind what can be done in black and white alone’’ [8]. Exploiting the multidimensionality of the visual representation allows people to perform in a single instant of perception a visual selection, in the sense that all the meaningful correspondences among the visualized elements are captured. Other constructions, for example a linear text, do not permit this immediate grasping, the entire set of correspondences may only be reconstructed in the user memory. This is also due to the fact that visual representations are processed by human beings primarily in parallel, while text is read sequentially. Visual interfaces capitalize on all the above characteristics of envisioning the information, yet one can say that the basic paradigms adopted for enhancing visual interaction are: 1. The direct manipulation interaction 2. The metaphor 3. The visual representation These three concepts, although separately investigated in the literature, at the present state are so strictly correlated to deeply influence each other, and constitute the foundations for the development of visual interfaces. All three concepts are covered in specific entries of this Encyclopedia.
A visual language may be defined as a language that systematically uses visual representations to convey a meaning in a formal way. Frequently, people have used visual languages, iconic languages and graphical languages as synonyms. If one follows the above definition they are all visual languages, but it may be more precise to call iconic languages only those using icons and pictures extensively, while diagrammatic (or graphical) languages are those primarily using diagrams (such as graphs, flow-charts, blockdiagrams, etc). When speaking about visual languages, some people allude to languages handling visual objects that are visually presented, i.e., images or pictorial objects of which a visual representation is given, others intend languages whose constructs are visual [3]. Chang calls the first ones visual information processing languages and the others visual programming languages. In the first case, one deals with conventional languages enhanced with subroutine libraries to handle visual objects. Image processing, computer vision, robotics, office automation, etc. are the typical application domains for these languages. Visual programming languages handle objects that do not necessarily have a visual representation. Within the class of visual programming languages, a subclass exists which manages a particular kind of non-visual objects, namely data in databases; this subclass of languages is that of visual query languages (VQLs), i.e., query languages exploiting the use of visual representations. While VQLs are usually adopted for query formulation, information visualization mechanisms are used for displaying the results that constitute the answer to a user request. Information visualization, an increasingly important field of Computer Science, focuses on visual mechanisms designed to communicate clearly to the user the structure of information, and allow her/him to make sense out of large quantities of data. Significant opportunities for information visualization may be found today in data mining and data warehousing applications, which typically access large data repositories. Also, the enormous quantity of information sources on the WWW also calls for visualization techniques.
Key Applications Any kind of traditional database-related activity (e.g., database creation, querying, mining, updating) has been and could be further supported by the usage of suitable visual interfaces, equipped with effective visual
Visual Interfaces
representations, easy-to-use interaction mechanisms and data visualizations. For instance, informationdiscovery interfaces must support a collaboration between humans and computers, for example, for data mining. Due to humans’ limited memory and cognitive abilities, the growing volume of available information has increasingly forced the users to delegate the discovery process to computers, greatly underemphasizing the key role played by humans. Discovery should be viewed as an interactive process in which the system gives users the necessary support to analyze terabytes of data, and users give the system the feedback necessary to better focus its search. Visual interfaces are obviously also extremely important for novel database applications, dealing with non-traditional data, such as multimedia, temporal, geographical, etc. Moreover, that the user interface has become a crucial component in the success of any modern (information intensive) application is surely witnessed by the extraordinary growth of the Web: a click on the mouse is all users need to traverse links across the world. However, this much easier access to a huge quantity of various information has created new problems to the users, related to information digestion and assimilation that are difficult to achieve if one lets unstructured floods of data collect in perceptually-rich and information-overabundant displays. The information flood can turn out to be useful only if it can be converted in a somehow structured information flow that users can consume. The Information Foraging theory has been indeed developed in order to evaluate the valuable knowledge gainable from a system in relation to the cost of the necessary interaction.
Future Directions Issues that are being further explored in modern interfaces include how best to array tasks between people and computers, create systems that adapt to different kinds of users, and support the changing context of tasks. Also, the system could suggest appropriate discovery techniques depending on data characteristics as well as data visualizations, and help integrate what are currently different tools into a homogeneous environment. Modern interfaces use (and will further exploit in the near future) multiple modalities for input and output (speech and other sounds, gestures, handwriting, animation, and video) and multiple screen sizes (from tiny to huge), and have an
V
‘‘intelligent’’ component (‘‘wizards’’ or ‘‘agents’’ to adapt the interface to the different wishes and needs of the various users). However, a key challenge in interacting with current information-abundant electronic environments is to shift from data-centric computer systems, which have been characteristic of over 40 years of Computer Science, to task-centric ones. Indeed, current desktop oriented systems propose a mostly disconnected set of generic tools (word processor, e-mail reader, video and image visualizer, etc.). The fact that these tools are running on one system without being connected leads to awkward situations, such as one re-entering or copying the same data among the different applications. Nowdays, users have to cope with an evergrowing amount of information, which they have to manage and use in order to perform their everyday tasks. This trend forces users to focus more on managing their information rather than using it to accomplish their objectives. Managing information basically amounts to saving it, and possibly being able to find (and re-find) it for subsequent reuse. Given the overabundance of available information, the process of saving, finding, and re-finding itself needs to be supported by algorithms and tools more powerful than standard OS file browsers. During recent years, the creation of such tools has been the ultimate goal of the so-called Personal Information Management (PIM) field [9]. However, not much has been done in order to cope with the original problem, i.e., to support the user in executing her/his tasks so as to achieve her/his goals. Whereas, it would definitely be beneficial for the user to move from a (static and rigid) objectcentric world to a (dynamic and adaptive) task-centric one. Tasks are more of an abstract notion than a computer manageable entity. Hence, such a move requires designing a system that is able to interpret the user’s aims and to support the execution of the user’s tasks. Tasks need to be explicitly represented in the system both in their static aspects, i.e., the kind of information that they manipulate, the kind of programs involved in these manipulations, security and authentication issues that may arise, and in their dynamic ones, i.e., the sequences of actions that they require, the alternative choices that are given to the user, pre-conditions, post-conditions, and invariants for the various activities that are involved in the task. Both static and dynamic aspects are of direct interest to the user, hence, need to be expressed using explicit semantics that the user can share.
3381
V
3382
V
Visual Interfaces for Geographic Data
Experimental Results The key point of any interactive system should be to support, at best, people in achieving their goals and performing their tasks. To be more precise, the interaction should favor an increase in efficiency of people performing their duties without this having to cause extra organizational costs, inconveniences, dangers and dissatisfaction for the user, undesirable impacts on the context of use and/or the environment, long periods of learning, assistance and maintenance. In literature, most of the above listed requirements are synthetically associated to the qualitative software characteristic called usability. Precise usability evaluation needs to be conducted at all stages in the system life cycle in order to: (i) provide feedback which can be used to improve design; (ii) assess whether user and organizational objectives have been achieved; and (iii) monitor long term use of the product or system. In the early stage of design, emphasis is placed on obtaining feedback that can be used as a guide in the design process, while later, when a realistic prototype is available, it is possible to assess whether user and organizational objectives have been achieved. Since in the early stages of design and development changes are less expensive than in later stages, evaluation has to be started as soon as the first design proposals are available. Depending on the development stage of the project, evaluation may be used to select and validate the design options that best meet the functional and usercentered requirements, elicit feedback and further requirements from the users or diagnose potential usability problems and identify needs for improvement in the system. Expert evaluation can be fast and economical. It is good for identifying major problems but not sufficient to guarantee a successful interactive system. Controlled experiments with actual users are a key technique to identify usability issues that might have been missed by the expert evaluation, since they are related with the real usage of the system and the knowledge of the application domain. Generally speaking, evaluation techniques vary in their degree of formality, rigor and user involvement depending on the environment in which the evaluation is conducted. The choice is determined by financial and time constraints, the stage of the development lifecycle and the nature of the system being developed.
Cross-references ▶ Data Visualization ▶ Diagram
▶ Direct Manipulation ▶ Form ▶ Human-Computer Interaction ▶ Icon ▶ Information Foraging ▶ Multimodal Interfaces ▶ Natural Interaction ▶ Result Display ▶ Usability ▶ Visual Metaphor ▶ Visual Data Mining ▶ Visual Query Language ▶ Visual Interaction ▶ Visual Representation ▶ WIMP Interfaces
Recommended Reading 1. Bertin J. Graphics and Graphic Information Processing. Walter de Gruyter, Berlin, 1981. 2. Catarci T., Dong X.L., Halevy A., and Poggi A. Personal Information Management, chap. Structure Everything. University of Washington Press, Seattle, WA, 2008. 3. Chang S.K. Principles of Pictorial Information Systems Design. Prentice-Hall, Englewood Cliffs, NJ, 1989. 4. Kay A. Personal dynamic media. IEEE Comput., 10(3):31–42, 1977. 5. Shneiderman B. Direct manipulation: a step beyond programming languages. IEEE Comput., 16(8):57–69. 6. Smith D.C., Harslem E., et al. The Star user interface: an overview. In Proc. 1982 National Computer Conference, 1982, pp. 515–528. 7. Sutherland I.E. SketchPad: a man-machine graphical communication system. In Proc. AFIPS Spring Joint Computer Conference. ACM, New York, NY, 1963. 8. Tufte E.R. Envisioning Information. Graphics Press, Cheshire, Connecticut, 1990. 9. Williams G. The Apple Macintosh computer. Byte, 9(2):30–54, 1984. 10. Williams G. The Lisa computer system. Byte, 8(2):33–50, 1983.
Visual Interfaces for Geographic Data R OBERT L AURINI LIRIS, INSA-Lyon, Lyon, France
Synonyms Cartography; Visualizing spatial data; Interactive capture; Interactive layout
Visual Interfaces for Geographic Data
Definition Geographic data are in essence visual multimedia data. To enter this data with key-boards and to print them as tables of data are not very practical actions. For entry, special devices exist (theodolites, lasers, aerial photos, satellite images, etc.), whereas visual layout is currently named cartography or mapping. Visual interfaces have two facets, allowing the user to present their output as maps, possibly with very large printers, and to present spatial queries visually.
Historical Background In the 1970s, the expression used was ‘‘computer-aided mapping’’ to emphasize the idea that printing maps was the sole scope of using computers. Then, in the 1980s, the more important aspect was considered to be structuring of the geographic database, so then the expression ‘‘Geographic Information Systems’’ was coined. From this period, research has been done on presenting maps as well as presenting queries. For centuries cartographers have created a large corpus of rules for manual mapping. Those rules were progressively integrated and enlarged to compose maps. For instance, zone hatching was considered a boring task when manually done, whereas with a computer, this task is straightforward. In contrast, name placement was considered as a relatively easy task, although in computing, this is still a challenge.
Foundations It is common to say that ‘‘approximately 80% of all government data has some geographic component’’ (Langlois G., ‘‘Federal Computer Week,’’ Jan. 08, 2001). This affirmation states the importance of geographic information not only for administration but also for companies, and not only for environmental and urban planning but also for geo-marketing, since all those data are stored in databases, or more specifically, geographic or spatial databases. In addition to non-spatial attributes, geographic objects are characterized by geometric shapes and coordinates, usually in two dimensions, but increasingly with three dimensions and time. Those characteristics imply three consequences: 1. The necessity of special data models for storing. 2. The necessity of distinguishing storage format and layout format. 3. The necessity of specialized interfaces for both querying against databases and presenting the results, essentially by means of cartography.
V
3383
The scope of this overall presentation is to study those interfaces. But before presenting these interfaces, it is necessary to emphasize that geographic data are not entered via keyboard but are acquired by different devices, such as theodolites, aerial photos, satellite images, and more recently, GPS (Global Positioning System). These devices are essentially characterized by different resolutions and error levels. Just as theodolites can measure objects within accuracy of less than 0.1 mm, some satellite systems or GPS systems have accuracies of 100 m or less. On the other hand, according to the size of the screens used and the size of the territory to be mapped, different scales must be used. In other words, the resulting map must be simplified or generalized more exactly in order to reach readability and completeness, as it passes from storage format to some other layout format. This discussion will be organized as follows: after a short introduction, visual interfaces for cartographic output will be examined first, and then interfaces for querying. Then, a small section on new barriers for mobile handheld devices will conclude. Visual Interfaces for Cartographic Output
A map is the classical way to respond to a spatial query or to layout the results of some spatial analysis. However, over the centuries, cartographers and geographers have elaborated upon sets of knowledge and knowhow to make maps. One of the main problems is known as generalization, i.e., the way to simplify a map; for instance, in a geographic database, to have a country that is described with 1 million points/ segments, but must be presented in a thumbnail with only 30 points/segments. Another aspect is called graphic semiology, i.e., the way to select the symbology for mapping. Generalities A fundamental rule in conventional cartography states that any object, once reduced after scaling to less than 0.1 mm cannot be mapped. For instance, an house or a road that is 10 m wide, at 1:1000 scale will be represented by 1 cm, whereas at 1:100,000 scale they will not be represented at all. Another rule concerns details to be aggregated. An example would be two houses separated by a road. At some scale they will be represented separately, whereas at other scales they will be aggregated. As a consequence, at some scales a city is seen as a collection of separated houses, then as a set of city-blocks, then as a compact area, and finally as a point.
V
3384
V
Visual Interfaces for Geographic Data
Visual Interfaces for Geographic Data. Figure 1. Example of generalization (1:25000, 1:35000, 1:50000). Source: http://recherche.ign.fr/labos/cogit/arGIGA.php [2].
Generalization
By generalization, a piece-wise line can be simplified into a single piece, especially by using the Douglas-Peucker algorithm [3] or variants based on multi-agents systems [5]. Figure 1 gives some examples. Graphic Semiology
Graphic semiology, invented by J. Bertin [1], is the study of the meaning of graphics. In other words, it deals with the signification of drawings, the choice of captions, symbols and icons, together with a methodology to transmit visual messages. Six visual variables have been proposed for representing spatial objects (see Fig. 2) – shape, size, orientation, pattern, hue and value. For instance, to represent a church, a cross symbol (shape) can be used; the symbol’s size can be selected according to the initial size of the church, etc.
Visual Interfaces for Geographic Data. Figure 2. Bertin’s Visual variables. Source: http://atlas.nrcan.gc.ca/site/ english/learningresources/carto_corner/vis_var.gif/ image_view.
Animation
For some applications, animated cartography can be used, characterized by movement of objects, flickering, mutation, modification of shape or of color, velocity, etc. An excellent example is the animated map for weather forecast, in which some iconized clouds are slowly moving.
Chorems
Chorems are a new way of representing schematized territories. Indeed, for several applications, it is not necessary to restitute the complete database contents, but rather to map the more important aspects. Those chorems were usually designed manually, but by means of spatial and spatio-temporal data mining, geographic patterns can be discovered and mapped. In other words, chorems are a new visual representation of geographic database summaries [4] and a way to represent geographic knowledge. The following example (Fig. 3) emphasizes the water problem in Brazil: it is easy to understand that a conventional river map (Fig. 3a) does not show
the more crucial aspects as given in Fig. 3b with caption in Fig. 3c. Query Input
Visual interfaces for GIS are not only used for cartography, but also for query input, i.e., by means of interaction. Present systems can be classified into three categories: 1. Textual queries, such as by using spatial extensions of SQL/ORACLE 2. Tabular queries, by means of forms in which some queries are pre-programmed 3. Graphic queries based on widgets such as icons, mice, and clicks Some basic queries such as point-in-polygon, region or buffer queries are usually given visually and interactively. However, more complex queries are also usually made, such as queries regarding
Visual Interfaces for Geographic Data
V
3385
Visual Interfaces for Geographic Data. Figure 3. The water problem in Brazil using a conventional river map (a) and a chorem map (b) issued from [5].
V Visual Interfaces for Geographic Data. Figure 4. Example of a visual query asking for ‘‘all cities crossed by a river’’ [6].
intersection or adjacency by means of Egenhofer’s spatial relation. For instance, the LVIS system [2] is a visual system based upon those relations (See Fig. 4).
However, a fourth method seems more interesting, based on a tangible table in which several persons can collaborate. Figure 5 shows such a table (from the Geodan Company); see [11] for details.
3386
V
Visual Interfaces for Geographic Data
Visual Interfaces for Geographic Data. Figure 5. Example of a tangible table from the Geodan Company (http://www.geodan.nl).
discovering new modes of representing maps and interacting with them, essentially for Location-Based Services. A very common example is the need to represent the way to go from one location to another location, as shown in Fig. 6. These new mobile handheld devices will imply new techniques for visualizing geographic data, essentially due to screen size.
Key Applications Any domains in cartography, from urban to environmental planning, geology, archaeology, real estate mapping, location-based services, etc.
Cross-references
▶ Cartography ▶ Geographical Databases ▶ Geographic Information System ▶ Spatial Network Databases ▶ Spatial Indexing Techniques ▶ Spatial Information System Visual Interfaces for Geographic Data. Figure 6. Example of a map presented on a mobile device. Final Remarks: Challenges for Small Mobile Devices
In the early 1970s, the main problem was producing maps, and then increasingly maps were seen as results of spatial queries or as results of spatial analysis techniques. However, as it is simple to lay out maps in conventional screens or in very big screens, the screen size of the new handheld mobile devices requires
Recommended Reading 1. Bertin J. Se´miologie graphique. La Haye, Mouton, 1970. 2. Bonhomme C., Trepied C., Aufaure M.A., and Laurini R. A visual language for querying spatio-temporal databases. In Proc. 7th Int. Symp. on Advances in Geographic Inf. Syst., 1999, pp. 34–39. 3. Ce´cile D. Ge´ne´ralisation Cartographique par Agents Communicants: Le mode`le CartACom. PhD Dissertation, University Paris VI, 11/06/2004, 2004. 4. Del Fatto V., Laurini R., Lopez K., Loreto R., Milleret-Raffort F., Sebillo M., Sol-Martinez D., and Vitiello G. Potentialities of chorems as visual summaries of spatial databases contents.
Visual Metaphor
5.
6. 7.
8.
9.
10. 11.
In Proc. 9th Int. Conf. Visual Information Systems, 2007, pp. 537–548. Douglas D. and Peucker T. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. The Can. Cartographer, 10(2):112–122, 1973. Kraak M.-J. and Brown A. Web Cartography. CRC, Boca Raton, FL, 2000, 208pp. Kraak M.-J. and Omerling F. Cartography: Visualization of Geospatial Data (2nd edn.). Pearson Education, NJ, 2003, 205pp. Lafon B., Codemard C., and Lafon F. Essai de chore`me sur la the´matique de l’eau au Bre´sil. http://histoire-geographie. ac-bordeaux.fr/espaceeleve/bresil/eau/eau.htm, 2005. Laurini R. Information Systems for Urban Planning: A Hypermedia Co-operative Approach. Taylor and Francis, London, 2001, 308pp. Laurini R. and Thompson D. Fundamentals of Spatial Information Systems. Academic Press, San Diego, CA, 1992. van Borkulo E., Barbosa V., Dilo A., Zlatanova S., and Scholten H. Services for an emergency response system in The Netherlands. Second Symposium on Gi4DM, Goa, 2006.
Visual Metaphor M ARIA F RANCESCA C OSTABILE 1, A LAN F. B LACKWELL 2 1 University of Bari, Bari, Italy 2 University of Cambridge, Cambridge, UK
Synonyms Metaphor; Analogy
Definition Metaphor is a figure of speech, whose essence is to make a comparison between things that are not literally the same. For example, saying ‘‘that man is a lion’’ asks the reader to imagine how a topic (the man) could be reinterpreted in terms of some vehicle (a lion). In Graphical User Interfaces (GUIs), visual metaphor refers to a kind of analogy, by which designers present the user with an explanation of system behavior in terms of some image. In the early days of the GUI, attempts were made to present all computer behavior in terms of real world analogies, as in the case of the desktop metaphor. However, these large scale metaphors soon broke down, as the challenge of developing and maintaining whole systems of correspondence became apparent. Attempts to replicate the success of the desktop metaphor have failed [1].
V
3387
In practice, UI designers now use the term to refer to visual conventions and genres that, although familiar, need not resemble any real-world objects (e.g., a dialog box or flowchart). The essential benefits of the original desktop metaphor are those provided by direct manipulation. Occasionally, real world analogies are successfully used to introduce real world models for new system behaviors (e.g., the shopping basket metaphor in e-commerce). In all cases, user acceptance is encouraged by the use of familiar visual forms, whether pictorial (an icon such as a picture of a shopping basket), or an abstract visual formalism (conventional layout of dialog boxes, conventional nodes and links for a flowchart).
Key Points A common view in cognitive science is that all abstract thought is based on physical metaphors [3] (e.g., adding ‘‘up’’). Computers also use expressions such as ‘‘cut and paste’’ to convey intended usage by analogy to existing technology. Once such figures of speech become widespread, they are unlikely to change. The best principle for visual metaphor design, therefore, is to present users with concepts and terminology that are familiar, whether from the physical world or from previous experience with computers. In creative product design, metaphor can also help designers to imagine novel forms. This more literary strategy is open to interpretation, and may provide less direct guidance to users of the final product. In the database area, visual formalisms present abstract database concepts to users. GUIs and the growth of database users in the late 80s pushed towards the development of visual query systems, i.e., ‘‘systems for querying databases that use a visual representation to depict the domain of interest and express related queries’’ [2]. A visual representation combines visual formalisms (diagrams, icons, forms) that can be interpreted from experience of existing representation genres – the more familiar and appropriate, the easier it is for the user to understand the usage intended by the designer. Visual metaphors create expectations about system functionality that, if not fulfilled, can disorientate users. Poor use of metaphor has caused some significant failures of software products. Since there are several ways of creating and interpreting metaphor, the effectiveness of any interface metaphor must be analyzed and empirically evaluated, firstly to ensure
V
3388
V
Visual Mining
that it is consistent with principles of direct manipulation, and secondly to determine whether users recognize and are familiar with the intended visual formalism.
Cross-references
▶ Direct Manipulation ▶ Visual Formalisms ▶ Visual Interaction ▶ Visual Query Language ▶ Visual Representation
Recommended Reading 1. 2.
3.
Blackwell A.F. The reification of metaphor as a design tool. ACM Trans. Comput. Hum. Interact., 13(4):490–530, 2006. Catarci T, Costabile M.F., Levialdi S., and Batini C. Visual query systems for databases: a survey. J. Vis. Lang. Comput., 8:215–260, 1997. Lakoff G. and Johnson M. Metaphors We Live By. University of Chicago Press, Chicago, 1980.
Visual Mining ▶ Visual Clustering
multidimensional data volumes. Visual OLAP provides a comprehensive framework of advanced visualization techniques for representing the retrieved data set along with a powerful navigation and interaction scheme for specifying, refining, and manipulating the subset of interest. The concept emerged from the convergence of BI (Business Intelligence) techniques and the achievements in the areas of Information Visualization and Visual Analytics. Traditional OLAP frontends, designed primarily to support routine reporting and analysis, use visualization merely for expressive presentation of the data. In the visual OLAP approach, however, visualization plays the key role as the method of interactive query-driven analysis. Comprehensive analysis includes a variety of tasks such as examining the data from multiple perspectives, extracting useful information, verifying hypotheses, recognizing trends, revealing patterns, gaining insight, and discovering new knowledge from arbitrarily large and/or complex data volumes. In addition to conventional operations of analytical processing, such as drill-down, roll-up, slice-and-dice, pivoting, and ranking, visual OLAP supports further interactive data manipulation techniques, such as zooming and panning, filtering, brushing, collapsing etc.
Historical Background
Visual Multidimensional Analysis ▶ Visual On-line Analytical Processing (OLAP)
Visual On-Line Analytical Processing (OLAP) M ARC H. S CHOLL , S VETLANA M ANSMANN University of Konstanz, Konstanz, Germany
Synonyms Visual multidimensional analysis; Interactive visual exploration of multidimensional data
Definition An umbrella term encompassing a new generation of OLAP (On-Line Analytical Processing) end-user tools for interactive ad-hoc exploration of large
OLAP emerged in the mid-90s as a technology for interactive analysis of large volumes of accumulated and consolidated business data. The underlying multidimensional data model allows users to view data from different perspectives by shaping it into multidimensional cubes of measurable facts, or measures, as the cube’s cells that are indexed by a set of descriptive categories, called dimensions. Member values within a dimension may be further arranged into a classification hierarchy (e.g., city ! state ! country) to enable additional aggregation levels. A data cube in a realworld application may hold millions of fact entries characterized by up to 20 dimensions, each supplied with a single or multiple classification hierarchies of various depth. Obviously, to gain insight into such huge and complex data, analysts need mechanisms for projecting the original set onto a two-dimensional (or at most three-dimensional) space and reducing it to a perceivable number of items. First proposals on using visualization for exploring large data sets were not tailored towards OLAP applications, but addressed the generic problem of visual
Visual On-Line Analytical Processing (OLAP)
querying of large data sets stored in a database. Keim and Kriegel [3] proposed VisDB, a visualization system based on a new query paradigm. In VisDB, users are prompted to specify an initial query. Thereafter, guided by a visual feedback, they dynamically adjust the query, e.g., by using sliders for specifying range predicates on single attributes. Retrieved records are mapped to the pixels of the rectangular display area, colored according to the degree of their conforming to the specified set of selection predicates, and positioned according to a grouping or ordering directive. A traditional interface for analyzing OLAP data is a pivot table, or cross-tab, which is a multidimensional spreadsheet produced by specifying one or more measures of interest and selecting dimensions to serve as vertical (and, optionally, horizontal) axes for
V
3389
summarizing the measures. The power of this presentation technique comes from its ability to summarize detailed data along various dimensions, and arrange the aggregates computed at different granularity levels into a single view, preserving the ‘‘part-of ’’ relationships between the aggregates. Figure 1 exemplifies the idea of ‘‘unfolding’’ a three-dimensional data cube Fig. 1a into a pivot table Fig. 1b, with cells of the same granularity marked with matching background color in both representations. However, pivot tables are inefficient for solving non-trivial analytical tasks, such as recognizing patterns, discovering trends, identifying outliers, etc. [1,4,11]. Visualization has the power to save time and reduce errors in analytical reasoning by utilizing the phenomenal abilities of the human vision system to recognize patterns [2].
V
Visual On-Line Analytical Processing (OLAP). Figure 1. Projecting a multidimensional data cube onto a pivot table. (a) A sample three-dimensional data cube for storing sales transactions as measures Quantity and Amount characterized by dimensions Product, Date, and Store. (b) A pivot table view of sales data (measures Quantity and Amount) broken down vertically by Product and Date and horizontally by Store.
3390
V
Visual On-Line Analytical Processing (OLAP)
OLAP interfaces of the current state of the art enhance the pivot table view by providing a set of popular business visualization techniques, such as bar-charts, pie-charts, and time series, as well as more sophisticated layouts, such as scatterplots, maps, treemaps, cartograms, matrices, grids, etc. and vendors’ proprietary visualizations (e.g., decomposition trees or fractal maps). Some tools go beyond mere visual presentation of data and propose sophisticated approaches inspired by the findings in information visualization research. Prominent examples of advanced visual systems are Advizor by Visual Insights [1] and Tableau by Tableau Software [2]. Advizor implements a technique that organizes the data into three perspectives. A perspective is a set of linked visual components displayed together on the same screen. Each perspective focuses on a particular type of analytical task, such as (i) single measure view using a 3D multiscape layout, (ii) multiple measures arranged into a scatterplot, and (iii) anchored measures presented using techniques from multidimensional visualization (box plots, parallel coordinates, etc.). Tableau is a commercialized successor of Polaris, a visual tool for multidimensional analysis developed by a research team of Pat Hanrahan at Stanford University [12]. Polaris inherits the basic idea of the classical pivot table interface that maps aggregates onto a grid defined
by dimension categories assigned to the grid’s rows and columns. However, Polaris uses embedded graphical marks rather than textual numbers in the table cells. The types of supported graphics are arranged into a taxonomy, comprising rectangle, circle, glyph, text, Gantt bar, line, polygon, and image layouts. Russom [10] summarizes the trends in business visualization software as a progression from rudimentary data visualization to advanced forms and proposes to distinguish three life-cycle stages of visualization techniques, such as maturing, evolving, and emerging, as depicted in Fig. 2. Within this classification, visual OLAP clearly fits into the emerging techniques for advanced interaction and visual querying. Ineffective data presentation is not the only deficiency of conventional OLAP tools. Further problems are cumbersome usability and poor exploratory functionality. Visual OLAP addresses those problems by developing fundamentally new ways of interacting with multidimensional aggregates. A new quality of visual analysis is achieved by unlocking the synergy between the OLAP technology, information visualization, and visual analytics.
Foundations A successful OLAP tool is capable of supporting a wide variety of analytical tasks. As mentioned in the previous
Visual On-Line Analytical Processing (OLAP). Figure 2. Trends in data visualization for business analysis (adopted with modifications from [7]).
Visual On-Line Analytical Processing (OLAP)
section, different tasks are best solved by applying different visual presentations. OLAP tools account for this diversity by providing a comprehensive framework, which enables users to interactively generate satisfactory visual presentations. The overall query specification cycle evolves by (i) selecting a data source of interest, (ii) choose a visualization technique (e.g., a scatterplot), and then (iii) mapping various data attributes to that technique’s structural elements (e.g., horizontal and vertical axes) as well as to other visual attributes, such as color, shape, and size. The main elements of the framework are the navigation structure for visual querying of data sources, a taxonomy of available visual layouts attributes, and a toolkit of interaction techniques for dynamic query refinement and visual representation of the data. A unified framework is obtained by designing an abstraction layer for each element and providing mapping routines (e.g., navigation events to database queries, query results to a visual layout, etc.) implementing the interaction between different layers. Visual Query Specification
Visual OLAP disburdens the end-user from formulating queries in the ‘‘raw’’ database syntax by allowing purely visual (i.e., by means of using a computer mouse) query specification. Data cubes are represented as a browsable structure whose elements can be queried by ‘‘pointingand-clicking’’ and ‘‘dragging-and-dropping.’’ Visual interface does not trade off advanced functionality for simplicity, it rather facilitates the process of specifying ad hoc queries of arbitrary complexity.
V
3391
A common navigation paradigm is that of a file browser that represents the contents as a recursive containment of elements. The nodes in the navigation hierarchy may be of types database, schema, table (cube), dimension, classification level, and measure. In simplified configurations, the navigation may be limited to a single data cube and, thus, consist solely of dimensions and measures. Dimension hierarchies are presented as recursive nesting of their classification levels, thereby allowing users to browse either directly in the dimension’s data or explore its hierarchy schema. In the former approach, denoted extensionbased, the navigation tree of a dimension is a straightforward mapping of the dimension’s data hierarchy: each hierarchical value is a node that can be expanded to access the contained next-level values. Alternatively, the navigation can explicitly display the dimension schema, with each level as a child node of its parent level. This so called intension-based approach is especially appreciated for power analysis and employing advanced visualization techniques. The latter navigation strategy becomes the only option when supporting multiple and heterogeneous dimension hierarchies [7] and a multi-cube join or drill-across [6]. Figure 3 shows the difference between instancebased and schema-based browsing for a hierarchical dimension Period. To navigate in multidimensional aggregates and perform dimensional reduction to extract data for analysis, OLAP defines a number of standard multidimensional analysis operations incorporated into a visual framework in the form of navigation events and interaction options.
V
Visual On-Line Analytical Processing (OLAP). Figure 3. Browsing options for hierarchical dimensions: extension versus intension navigation. (a) dimension instances arranged in a navigation hierarchy. (b) hierarchy of dimension categories with on-demand data display and an option to switch to extension navigation.
3392
V
Visual On-Line Analytical Processing (OLAP)
User interactions are translated into valid database queries. From the user’s perspective, querying is done implicitly by populating the visualization with data and incrementally refining the view. The output of any OLAP query is a data cube. Visual presentation is generated by assigning the cube’s elements – measure values and their dimensional characteristics – to the visual variables of the display. A visualization technique is defined by the visual variables, or attributes, it employs and the way those constructs are combined. Examples of visual attributes are position, shape, size, color, and orientation. Typically, each of the above visual attributes is used to represent a single data field returned by a query. Various attributes behave differently with respect to the range, data type, and the number of values they can meaningfully represent. The mapping of the cube’s structure to the navigation scheme as well as the translation of navigation events into database queries rely on the metadata of the underlying data warehouse system. Metadata describes the structure of the cubes and their dimensions, measures, and applicable aggregation functions. In an advanced user interface, the analysts are able to define new measures in addition to the pre-configured ones. New measures are obtained by applying a different aggregate function (e.g., average, variance, count) or by specifying a more complex formula over a single or multiple measure attributes (e.g., computing a ratio between two aggregates). Visualization Techniques
The task of selecting a proper visualization technique for solving a particular problem is by far not trivial as various visual representations (metaphors) may not only be task dependent, but also domain dependent. Successful visual OLAP frameworks need to be based on a comprehensive taxonomy of domains, tasks, and visualizations. The problem of assisting the analyst in identifying an appropriate visualization technique for a specific task is an unsolved issue in state-of-the-art OLAP tools. Typically, a user has to find an appropriate solution manually by experimenting with different layout options. To support a large set of diverse visualization techniques and to enable dynamic switching from one technique to another, an abstraction layer has to be defined for specifying the relationships between the data and its visual presentation. Maniatis et al. propose an abstraction layer solution, called a Cube Presentation
Model (CPM) [5], that distinguishes between two layers: a logical layer deals with data modeling and retrieval whereas a presentation layer provides a generic model for representing the data (normally, on a 2D screen). The entities of the presentation layer include points, axes, multicubes, slices, tapes, cross-joins, and content functions. The authors demonstrate how CPM constructs can be mapped to advanced visual layouts at the example of the Table Lens – a technique based on a cross-tabular paradigm with support for multiple zoomable windows of focus. A common approach to visualization in OLAP application relies on a set of templates, wizards, widgets, and a selection of visual formats. Hanrahan et al. [2] argue, however, that an open set of questions cannot be addressed by a limited set of techniques, and choose a fundamentally different approach for their visual analysis tool Tableau: a declarative visual query language VizQLTM offers high expressiveness and composability by allowing users to create their own visual presentations by means of combining various visual components. Figure 4 illustrates the visualization approach of Tableau by showing just a small subset of sophisticated visual presentations, created using simple VizQL statements, not relying on any pre-defined template layout. The designers of Tableau deliberately restrict the set of supported visualizations to the popular and proven ones, such as tables, charts, maps, and time series, doubting general utility of exotic visual metaphors [2]. Thereby, Tableau’s approach is constrained to generating grids of visual presentations of uniform granularity and limited dimensionality. Other researchers suggest that visual OLAP should be enriched by extending basic charting techniques or by employing novel and less known visualization techniques to take full advantage of multidimensional and hierarchical properties of the data [4,11,14,15]. Tegarden [15] formulates the general requirements of business information visualization and gives an overview of advanced visual metaphors for multivariate data, such as Kiviat diagrams and Parallel Coordinates for visualizing data sets of high dimensionality, as well as 3D techniques, such as 3D Scattergrams, 3D line graphs, floors and walls, and 3D map-based bar-charts. Another branch of visualization research for OLAP concentrates on developing multiscale visualization techniques capable of presenting the data at different levels of aggregation. Stolte et al. describe their
Visual On-Line Analytical Processing (OLAP). Figure 4. Examples of sophisticated multidimensional visualizations generated by simple VizQL statements (used by permission of Tableau Software, Inc.).
Visual On-Line Analytical Processing (OLAP)
V 3393
V
3394
V
Visual On-Line Analytical Processing (OLAP)
implementation of multiscale visualizations within the framework of the Polaris system [13]. The underlying visual abstraction is that of a zoom graph that supports multiple zooming paths, where zooming actions may be tied to dimension axes or triggered by a different type of interaction. Lee and Ong propose a multidimensional visualization technique that adopts and modifies the Parallel Coordinates method for knowledge discovery in OLAP [4]. The main advantage of this technique is its scalability to virtually any number of dimensions. Each dimension is represented by a vertical axis and the aggregates are aligned along each axis in form of a bar-chart. The other side of the axis may be used for generating a bar-chart at a higher level of detail. Polygon lines adopted from the original Parallel Coordinates technique are used for indicating relationships among the aggregates computed along various dimensions (a relationship exists if the underlying sets of fact entries in both aggregates overlap). Mansmann and Scholl concentrate on the problem of losing the aggregates computed at preceding query steps while changing the level of detail and propose to use hierarchical layouts for capturing the results of multiple decompositions within the same display [9]. The authors introduce a class of multiscale visual metaphors called Enhanced Decomposition Tree: the levels of the visual hierarchy are created by decomposing the aggregates along a specified dimension and the nodes contain the resulting sub-aggregates arranged into an embedded visualization (e.g., a bar-chart). Various hierarchical layouts and embedded chart techniques are considered to account for different analysis tasks. Sifer presents a multiscale visualization technique for OLAP based on coordinated views of dimension hierarchies [11]. Each dimension hierarchy with qualifying fact entries attached as the bottom-level nodes is presented using a space-filling nested tree layout. Drilling-down and rolling-up is performed implicitly by zooming within each dimension view. Filtering is realized by (de-)selecting the values of interest at any level of dimension hierarchies, resulting either in highlighting the qualifying fact entries in all dimension views (global context coordination) or in eliminating the disqualified entries from the display (result only coordination). A similar interactive visualization technique, called the Hierarchical Dynamic Dimensional Visualization
(HDDV), is proposed in [14]. Dimension hierarchies are shown as hierarchically aligned barsticks. A barstick is partitioned into rectangles that represent portions of the aggregated measure value associated with the respective member of the dimension. Color intensity is used to mark the density of the number of records satisfying a specified range condition. Unlike in [11], dimension level bars are not explicitly linked to each other, allowing to split the same aggregate along multiple dimensions and, thus, to preserve the execution order of the dis-aggregation steps.
Key Applications Visual OLAP should be considered an integral part of a BI architecture. The latter appears rather universal with respect to prospective application domains and comprises virtually all business and non-business scenarios that require quantitative analysis. Prominent business application areas are Financial Risk Management, Industrial Process Control, Operations Planning, Capital Markets Management, Network Monitoring, Marketing Analysis, Fraud/Surveillance Analysis, Portfolio Management, Customer/Product Analysis, Budget Planning, Operations Management, and Economic Analysis. Non-business applications are found primarily in government, healthcare, research, and academia.
Future Directions At present, visual OLAP is lacking a unified formal model and query specification standard. Existing visual analysis frameworks are based on proprietary formalisms and models. Furthermore, advanced OLAP tools claim to turn visualization from the presentation layout into the method of data exploration. This claim implies the need for re-defining visualization as an instrument in terms of its structural components and interaction functions. A pioneering initiative on addressing the above formalization issues by integrating visualization into a query language is a visual declarative data query language VizQLTM developed at Tableau Software Inc. and released in 2006 [2]. However, VizQL is a proprietary solution and is limited to the visual table paradigm of the Tableau system. To be universally adopted, a new standard for visual exploration of OLAP data has to be open, flexible, and extendible to account for a wide range of visualization approaches. A promising research direction is to evaluate a wealth of existing visualization techniques with respect
Visual Perception
to their applicability to multidimensional analysis and to identify classes of techniques efficient for solving particular analysis tasks. One of the major visualization challenges for OLAP is the ability to present a large number of dimensions on a display. An additional visual attribute for mapping a dimension could be animation, as found in Gapminder software for interactive data exploration using animated scatterplots where animation is used to show the evolution of values along the timeline [9]. Another emerging research direction is concerned with spatio-temporal analysis that employs specialized techniques for spatial and/or temporal exploration, such as maps, cartograms, times series, and calendar views. These techniques are aware of the rich semantics behind temporal and geographic dimensions of the data. Rivest et al. [8] propose SOLAP (spatial OLAP) as a visual platform for spatio-temporal analysis using cartographic and general displays. The authors define different types of spatial dimensions and measures as well as a set of specialized geometry-aware operators.
Cross-references
▶ Business Intelligence ▶ Cube ▶ Data Visualization ▶ Dimension ▶ Hierarchy ▶ Measure ▶ Multidimensional Modeling ▶ On-Line Analytical Processing ▶ Visual Interfaces
Recommended Reading 1. Eick S.G. Visualizing multi-dimensional data. ACM SIGGRAPH Comput. Graph., 34(1):61–67, 2000. 2. Hanrahan P., Stolte C., and Mackinlay J. Visual analysis for everyone: understanding data exploration and visualization. Tableau Software Inc., 2007. White Paper, http://www.tableausoftware.com/docs/Tableau_Whitepaper.pdf 3. Keim D.A. and Krigel H.-P. VisDB: database exploration using multidimensional visualization. IEEE Comput. Graph. Appl., 14(5):40–49, 1994. 4. Lee H.-Y. and Ong H.-L. A new visualisation technique for knowledge discovery in OLAP. In Proc. First Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1997, pp. 279–286. 5. Maniatis A.S., Vassiliadis P., Skiadopoulos S., and Vassiliou Y. Advanced visualization for OLAP. In Proc. ACM 6th Int. Workshop on Data Warehousing and OLAP, 2003, pp. 9–16.
V
3395
6. Mansmann S. and Scholl M.H. Exploring OLAP aggregates with hierarchical visualization techniques. In Proc. 2007 ACM Symp. on Applied Computing, 2007, pp. 1067–1073. 7. Mansmann S. and Scholl M.H. Extending visual OLAP for handling irregular dimensional hierarchies. In Proc. 8th Int. Conf. Data Warehousing and Knowledge Discovery, 2006, pp. 95–105. 8. Rivest S., Be´dard Y., and Marchand P. Toward better support for spatial decision making: defining the characteristics of spatial on-line analytical processing (SOLAP). Geomatica, 55(4): 539–555, 2001. 9. Rosling H., Ro¨nnlund A.R., and Rosling O. New software brings statistics beyond the eye. In Proc. Organisation for Economic Co-operation and Development, 2006, pp. 522–530. 10. Russom P. Trends in Data Visualization Software for Business Users. DM Review, May 2000. 11. Sifer M. A visual interface technique for exploring OLAP data with coordinated dimension hierarchies. In Proc. Int. Conf. on Information and Knowledge Management, 2003, pp. 532–535. 12. Stolte C., Tang D., and Hanrahan P. Polaris: a system for query, analysis, and visualization of multidimensional relational databases. IEEE Trans. Visual. Comput. Graph., 8(1):52–65, 2002. 13. Stoltec C., Tang D., and Hanrahan P. Multiscale visualization using data cubes. IEEE Trans. Visual. Comput. Graph., 9 (2):176–187, 2003. 14. Techapichetvanich K. and Datta A. Interactive visualization for OLAP, Part III. In Proc. Int. Conf. on Computational Science and its Applications, 2005, pp. 206–214. 15. Tegarden D.P. Business information visualization. Comm. AIS, 1(1), 1999, Article 4.
Visual Perception S ILVIA G ABRIELLI Bruno Kessler Foundation, Trento, Italy
Synonyms Sight; Vision
Definition In psychology, visual perception is the ability to transform visible light stimulus reaching the eyes into information supporting recognition processes and action. The various physical and processing components which enable a human to being to assimilate information from the environment are known as the visual system. The act of seeing starts when the cornea and lens at the front of the eye focus an image of the outside world onto a light-sensitive membrane in the back of the eye,
V
3396
V
Visual Perception
called the retina. The retina is actually the part of the brain which works as a transducer for the conversion of patterns of light into neuronal signals. Precisely, the photoreceptive cells of the retina detect the photons of light and respond by producing neural impulses. These signals are processed in a hierarchical fashion by different parts of the brain, such as the lateral geniculate nucleus, and the primary and secondary visual cortex of the brain. The major problem in visual perception is that what is seen is not simply and always a translation of retinal stimuli (i.e., the image on the retina). Thus, different theories and experimental studies have been devised to explain what visual processing does to create what is actually perceived. It is worth stressing that visual perception involves the acquisition of knowledge, so it is not merely an optical process but it also entails cognitive activity.
Historical Background The foundations of modern theories of vision were laid in the seventeenth century, when Descartes and others established the principles of optics, making it possible to distinguish between the physical properties of light, images and the psychological properties of the visual experience. It was proposed that visual perception involves adding information to that present in the retinal image in order to reach properties such as solidity of an object and meaning. According to the empiricist position, developed primarily by von Helmholtz [8], vision originates from a form of unconscious inference: it is a matter of deriving a probable interpretation from incomplete data (a set of elementary sensations that are integrated and synthesized through a process of learning by association). Inference requires prior assumptions about the world, such as that light comes from above, or that objects are viewed from above and not below. The study of visual illusions (cases when the inference process goes wrong) has yielded a lot of insight into what sort of assumptions the visual system makes. The unconscious inference hypothesis has recently been investigated in Bayesian studies of visual perception. Proponents of this approach consider that the visual system performs some form of Bayesian inference to derive a perception from sensory data. Models based on this idea have been used to describe various visual subsystems, such as the perception of motion or the perception of depth.
Gestalt psychologists (in the 1930s and 1940s), questioned the assumption that vision is derived from inference mechanisms and previous knowledge [10]. They considered visual perception to be direct and based on the detection of patterns or configurations, as organized wholes available in the perceptual field. According to the Gestalt Laws of Organization there are six main factors that determine how things are grouped in visual perception: proximity, similarity, closure, symmetry, common fate and continuity. The major problem with the Gestalt laws and theory is that they are descriptive of vision more than explanatory. In Gibson’s ecological theory of perception [4], the emphasis is placed on the role played by relations in the visual environment, which are embedded within the spatial and temporal distribution of stimuli an active observer can perceive. According to Gibson, no inference mechanism is needed to detect these affordances, that are directly available in the surrounding environment. Computational approaches to the study of vision, such as Marr’s work [5], have provided more detailed explanations of visual phenomena by building artificial intelligence models of the processes involved. Vision is considered as the process of forming a description or representation of what is in the scene from the retinal images. The system at work is modular and serial, consisting of a number of subprocesses, each one taking one representation and transforming it into another. It goes from the creation of the primal sketch (representing changes in light intensity occurring over space in the image and organizing these local descriptions onto a 2D representation of regions and boundaries) to the 2 1 ∕ 2 D sketch (where the layout of objects surfaces, their distances and orientations relative to the observer are represented) to the creation of 3D model representations (specifying the solid shapes of objects matched against their corresponding representations in memory).
Foundations According to a multidisciplinary study of visual perception based on perceptual psychology, neuroscience and computational analysis, the purpose of vision is to produce information about objects, locations and events in the world from imaged patterns of light reaching the viewer. Psychology uses the term ‘distal stimulus’ to refer to the physical world under observation and ‘proximal stimulus’ to refer to the retinal image. The function of vision is to create a description of aspects of the distal
Visual Perception
stimulus given the proximal stimulus. Although visual perception is said to be veridical when it produces accurate descriptions of the real world, in practice vision is better understood in the context of the cognitive and motor functions that it serves. Vision systems create descriptions of the visual environment based on properties of the incident illumination. The human vision system is primarily sensitive to patterns of light rather than to the absolute magnitude of light energy. The eye does not operate as a photometer. Instead, it detects spatial, temporal and spectral patterns in the light imaged on the retina, and information about these patterns of light form the basis of visual perception. A system which measures changes in the light energy rather than the magnitude of energy has an ecological utility, since it makes it easier to detect patterns of light over large ranges in light intensity. This is also an advantage for computer graphics, which can make graphic displays work effectively by only producing similar patterns of spatial and temporal change to the real world. Depth perception: Human beings can perceive the world as 3D, although images projected onto the retina are 2D (all points placed at different distances in space, but along the same directional axis, project onto a same point of the retina). Depth perception, also called stereopsis, relies mainly on binocular cues (cues that require input from both eyes) and on monocular cues (cues available from the input of just one eye), as well as on the synthetic integration performed by a person’s brain based on the full field of view perceived with both the eyes. Among the most important binocular cues there are: (i) retinal disparity, due to the distance between the two eyes which makes the projection of objects or scenes onto each retina slightly different. By using two images of the same scene obtained from slightly different angles, it is possible to triangulate the distance to an object with a high degree of accuracy. If an object is far away, the disparity of that image falling on both retinas will be small, if it is close the disparity will be large. (ii) accommodation, focusing on far away objects, the ciliary muscles stretch the eye lens, making it thinner. (iii) convergence, to focus on a same object the two eyes need to converge. The angle of convergence is larger when the eye is fixating objects that are far away. Examples of monocular cues are: (i) focus, the lens of the eye can change its shape to bring objects at different distances into focus. Knowing at what
V
distance the lens is focused when viewing an object means knowing the approximate distance to that object. (ii) perspective, the property of parallel lines converging at infinity allows people to reconstruct the relative distance of two parts of an object, or of landscape features. (iii) occlusion, the blocking of the sight of objects by others is also a clue which provides information about relative distance (the occluded object is perceived as more far away). (iv) peripheral vision, at the outer extremes of the visual field, parallel lines become curved, as in a photo taken through a fish-eye lens. This effect greatly enhances the viewer’s sense of being positioned within a real, three dimensional space. Perceptual constancies: To view the world by the exact image projected on the retina would mean to perceive it as very unstable. Objects within a person’s field of vision would constantly be changing shape, size, position and color as s/he moved toward and away from them, because the distance and therefore the amount of reflected light detected by the retina would also be changing. Indeed, this is not the case, our surroundings are perceived as solid and stable since the world is not just ‘‘seen’’ but actively constructed from fragmentary perceptual data [4]. It is our brain that by means of perceptual constancies interprets the changing proximal stimulation to reach stability. For instance, in the case of size constancy our brain perceives an object in relation not only to its visual angle, but also to the perceived distance. Also, size constancy tells us that objects moving away from us are the same size even though their retinal image is getting smaller and this is because size constancy is a property of the perceptual field (it refers to the relationship between the object and its surrounding context, which remains stable). The same principle works in form constancy, where an object is perceives as the same notwithstanding the different shapes that are projected onto the retina as its angle changes. That is to say that perception of the partial view becomes equivalent to perception of the whole object. Also, in the case of color and light constancies, our retina is able to distinguish the different wavelengths of light entering the eye, however, our perception remains relatively stable since it is affected by our previous experiences or knowledge, as well as by the context one is dealing with (e.g., color and lightness of an object do not depend only on the absolute intensity of light, but also on their relationships with the stimulation arriving from the surrounding areas).
3397
V
3398
V
Visual Perception
There are primarily two different ways in which visual perception and the perceptual process have been scientifically investigated. Psychophysical analysis has studied how a person’s perception is related to the stimulus. As an example, Stevens’ power law defines the relationship between the magnitude of a physical stimulus and its perceived intensity or strength [7]. The general form of the law is cðIÞ ¼ kI a ; where I is the magnitude of the physical stimulus, c is the psychophysical function capturing sensation (the subjective size of the stimulus), a is an exponent that depends on the type of stimulation and k is a proportionality constant that depends on the type of stimulation and the units used. A second way of studying perception considers its relation to the physiological processes that are occurring within the person’s sensors and/or brain. This is called physiological analysis of the perceptual process, which measures, for example, some aspects of a person’s brain activity when s/he looks at a visual stimulus. Magnetic resonance imaging (MRI) is a typical instrument used for mapping activation patterns in the human brain as a person watches a scene. Due to the complexity of visual perception, and the intimate relation among its psychophysical, physiological and cognitive aspects, the best way of getting a complete view of this phenomenon is by cross referencing the different disciplines that have addressed its study.
Key Applications
Visual Interfaces
Visual perception provides the scientific basis for the design of visual interfaces that can best match the requirements of their users, by enabling an easier and more transparent access and interaction with information. Information Visualization
In the field of Information visualization, visual perception principles help to produce effective visualizations of non-geometric data retrieved from large document collections (e.g., digital libraries), the World Wide Web, and databases, with the aim of supporting users to make sense of information there contained and of enhancing their creative thinking.
Future Directions Current theories and methods for the study of visual perception in psychophysics and neuroscience need to be integrated with information processing approaches in the attempt to model vision at a more abstract computational level. An interesting cross-fertilization is expected between neuroscience experimentation and computer vision theories. Particularly, for providing a detailed account of how the brain makes effective use of top-down resources (e.g., knowledge, memories etc.), creates predictions about forthcoming stimuli and constantly matches expectations against signals from the environment [6]. It is likely that as researchers in visual neuroscience and computer vision develop a better understanding of the role played by bottom-up and top-down information processing mechanisms and their interaction, better theories of visual perception can be developed.
Computer Graphics
Visual perception has become a key component of computer graphics, since it helps to achieve the development of high fidelity virtual environments in reasonable time by exploiting knowledge of how the human visual system works (e.g., rendering time can be saved by avoiding to compute those parts of a scene that the human will fail to notice).
Experimental Results A considerable amount of experimental evidence exists for all the theories and approaches to the study of visual perception cited above. References to results on the physiology, psychology and ecology of vision are provided in [1,3]. Application of visual perception principles to the design of Information visualization and visual interfaces is discussed in [9].
Artificial Intelligence
The results of studies on human visual perception inspire the development of computational models of vision that, in turn, can be used to design and implement visual abilities within artificial intelligence systems (e.g., robots).
Data Sets An online laboratory providing a collection of demonstrations and experiments on visual perception can be found at: http://www2.psych.purdue.edu/ coglab/ VisLab/welcome.html.
Visual Query Language
V
URL to Code
Historical Background
The portal (above) also intends to provide software that can be used in the study of visual science.
The birth of VQLs was due to several needs, including: providing a friendly human-computer interaction, allowing database search by non-technical users, introducing a mechanism for comfortable navigation even in case of incomplete and ambiguous queries. It is worth noting that the real precursor of VQLs was QBE, already proposed by Moshe Zloof in 1977 [19]. QBE was really ahead of its time. Indeed, the Zloof ’s paper states: ‘‘the formulation of a transaction should capture the user’s thought process. . . .’’ This is a quite common idea today, but at that time (1977) the research on user interfaces and human-computer interaction was still in its infancy. Approximately in the same period, Smith coined the term ‘‘icon’’ in his Ph.D. thesis [16] and Alan Kay introduced the idea of direct manipulation interfaces that are, in principle, usable by everyone [14]. It was necessary to wait until the beginning of 1980s for the first commercial systems making extensive use of direct manipulation, namely Xerox Star, Apple Lisa, and Macintosh. It is worth noting that QBE is based not only on the usage of examples for expressing queries, but it also relies on the direct manipulation of relational tables inside a basic graphical user interface: an environment and an action modality which were quite unknown at that time, considering the almost simultaneous publication of Alan Kay’s paper. Another anticipatory idea is the incremental query formulation, i.e., ‘‘. . .the user can build up a query by augmenting it in a piecemeal fashion.’’ Many papers still recommend to allow the user expressing the query in several steps, by composing and refining the initial formulations [8]. Moreover, QBE was the first proposal of query language in which the attention to the user interface is coupled with a rigorous definition of the language syntax and semantics and a careful study of the language expressive power. Unfortunately, many of the later proposals of visual query languages have emphasized the aspects related to user interaction and ease of learning and of use only, without also focusing on formal aspects, such as syntax, semantics and expressive power [8]. Such query languages are usually presented through examples of query formulation, making both the study of language expressiveness and the comparison with other languages difficult. The opposite is true for QBE, whose usability was also tested comparing it with SQL in several experiments,
Cross-references ▶ Visual Interfaces
Recommended Reading 1. Bloomer, C.M. Principles of Visual Perception. Herbert, London, 1990. 2. Bruce V. and Green P.R. Visual Perception. Erlbaum, Hove, England, 1990. 3. Bruce V., Green P.R., and Georgeson M.A. Visual Perception: Physiology, Psychology, and Ecology. Psychology, New York, 2003. 4. Gibson J.J. The Ecological Approach to Visual Perception. Houghton Mifflin, Boston, 1979. 5. Marr D. and Vision W.H. Freeman, San Francisco, CA, USA, 1982. 6. Perception on Purpose. FP6-IST project N. 027268. Available at: http://perception.inrialpes.fr/POP/. 7. Stevens S.S. On the psychophysical law. Psychol. Rev., 64 (3):153–181, 1957. 8. von Helmholtz H. (Obituary). In Proc. Royal Soc. Lond. Royal Society, Great Britain. Taylor and Francis, London, 1854. 9. Ware C. Information Visualization: Perception for Design. Morgan Kaufmann, San Francisco, CA, USA, 2004. 10. Wertheimer M. Gestalt theory. In a Source Book of Gestalt Psychology, W.D. Ellis (ed. & trans.). Routledge and Kegan Paul, London (Original work published 1925), 1938, pp. 1–11.
Visual Query Language T IZIANA C ATARCI University of Rome, Rome, Italy
Synonyms Visual query system
Definition Visual Query Languages (VQLs) are languages for querying databases that use a visual representation to depict the domain of interest and express related requests. VQLs provide a language to express the queries in a visual format, and they are oriented towards a wide spectrum of users, especially novices who have limited computer expertise and generally ignore the inner structure of the accessed database. Systems implementing VQLs are usually called Visual Query Systems (VQSs) [8].
3399
V
3400
V
Visual Query Language
such as those reported in [15] and [17]. Finally, many of the QBE ideas are still up-to-date and it is amazing to note that QBE-like interfaces are nowadays adopted in commercial database systems, despite the current explosion of sophisticated visualizations and interaction mechanisms. However, only later on, during the 1980s and early 1990s, VQLs received the greatest attention by the database community with the presentation of several diverse proposals for an effective visual and interactive query language.
Foundations In [5], VQLs were reviewed and classified based on three criteria, namely: What can be done using the system, i.e., the expressive power of the environment. How the system may be used in order to build the query, i.e., the concept of usability as determined by the available strategies, the interaction and representation models. Whom, in terms of classes of users, the systems are addressed to. The expressive power of a query system can be defined as the ability of the system to extract meaningful information from the database. More formally, according to [10], the expressive power can be based on the concept of computable query. Let U be a fixed countable set, called the universal domain, and let D(B), a subset of U, be a finite set which includes all the elements appearing in the database B. A query is a partial function giving an output (if any) which is a relation over D(B). A query Q is said to be computable if Q is partial recursive and satisfies a consistency criterion: if two databases are isomorphic, then the corresponding outputs of Q are also isomorphic (under the same isomorphism). In other words, the result of a query should be independent from the organization of the data in a database and should treat the elements of the database as uninterpreted objects. Excluding queries expressing undecidable problems, computable queries can be seen as the more general class of ‘‘reasonable’’ queries. Other meaningful classes of queries have been investigated in [11], giving rise to the so called Chandra’s hierarchy. A significant class of queries inside the hierarchy is the one of first order queries, and the term completeness was used to indicate that a query language
could express all the first order queries. However, it was apparent early that the amount of expressive power provided by such a language is not adequate for expressing useful queries, such as transitive closure and, more generally, fixpoint queries (i.e., queries obtained by augmenting the standard first order operators with the construct of least fixpoint). On the other hand, languages that friendly express queries simpler than first order queries are significant, since average database users make elementary requests. As a consequence of the above remarks, the expressive power of the majority of visual query languages is lower or equal to the classes of first order or fixpoint queries. More precisely, most of the iconic languages fall in lowest level classes, since they are directed to a casual user, who is mainly interested in a friendly expression of simple queries (e.g., select-project queries). On the other hand, most of the graphical languages are equally or less expressive than relational algebra and very few are placed in the upper levels of the hierarchy (see [8]). The notion of usability used in [5] is quite ‘‘database-oriented,’’ and different from the ones used in the human-computer interaction (hci) community that is now commonly accepted. Indeed, in [5] usability was specified in terms of the models used in VQLs for denoting both data and queries, their corresponding visual representations and the strategies provided by the system in order to formulate the query. This definition does not reflect the hci view of usability as a software quality related to user perception and fruition of the software system. In particular, the data model has no significant impact on the user perception of the system. Whereas, visual representations and interaction strategies influence the usersystem interaction, since they are important parts of the system interface (the only thing the user sees of a system). As for the visual representation, the query representation is generally dependent on the database representation, since the way in which the query operands (i.e., data in the database) are presented constrains the query representation. For example, given a query on a relational database, the query may be formulated in terms of several representations, e.g., filling some fields in tables visualizing the relations, or following paths in a hypergraph that visualizes the relational schema. In this case the table and the hypergraph are two possible
Visual Query Language
representations associated with the relational database. On the other hand, the visual representation used to display the query result can be different from the database representation. This is mainly due to the fact that what is visualized for the query purpose most often is the schema of the database, while the actual database instances constitute the query result to be displayed to the user. Visual representations used in VQLs typically use forms, diagrams, icons or a combination of them. Form-based representations are the simplest way to provide the users with friendly interfaces for data manipulation. They are very common as application or system interfaces to relational databases, where the forms are actually a visualization of the tables. In query formulation prototypical forms are visualized for users to state the query by filling appropriate fields. In systems such as QBE [19], only the intensional part of relations is shown: the user fills the extensional part to provide an example of the requested result. The system retrieves whatever matches the example. In more recent form-based representations the user can manipulate both the intensional and the extensional part of the database. Diagrammatic representations are also widely used in existing systems. Typically, diagrams represent data structures displayed using visual elements that correspond to the various types of concepts available in the underlying data model. Diagrammatic representations adopt as typical query operators the selection of elements, the traversal on adjacent elements and the creation of a bridge among disconnected elements. Iconic representations use icons to denote both the objects of the database and the operations to be performed on them. A query is expressed primarily by combining operand and operator icons. For example, icons may be vertically combined to denote conjunction (logical AND) and horizontally combined to denote disjunction (logical OR). In order to be effective, the proposed set of icons should be easily and intuitively understandable by most people. The need for users to memorize the semantics of the icons makes the approach manageable only for somehow limited sets of icons. The hybrid representation is a combination of the above representations. Often, diagrams are used to describe the database schema, while icons are used either to represent specific prototypical objects or to
V
indicate actions to be performed. Forms are mainly used for displaying the query result. Similarly to most graphical user interfaces, the VQLs that have been developed so far have mainly stressed the user input aspects of the interaction and have given little thought to the visualization of output data. Conversely, an appropriate visualization of the query result allows the user to better capture the relationships amongst the output data, and some systems are progressing in this direction. For instance, AMAZE employs 3D graphs [7]. The data are shown as a 3D snapshot of the n-dimensional results. Different methods of result visualization are also planned to be available to the user. The Film Finder system [1] visualizes information about movies by means of starfield displays, which show database objects as small selectable spots (either points or 2D figures). The displayed data can be filtered by changing the range of values on both the Cartesian axes. The query result fits on a single screen and the system quickly, i.e., within a second, computes the new data display in response to the user’s requests. This property, called near realtime interactivity, ensures high usability. A different approach is to use virtual reality techniques to present the query result with a simulation of a real environment (i.e., a virtual one) that depicts a situation familiar to the user. VQRH [12] is one of the systems that provide the user with several visual representations for both query formulation and result visualization. One possibility is to use 3D features to present the results in the simulated reality setting. For example, if the database refers to the books in a library, a virtual library can be represented in which the physical locations of the books are indicated by icons in a 3D presentation of the book stacks of the library. Apart from the visual representation, any VQL is characterized by the way in which it allows the user to express his/her requests. Very often, the actual query specification is the second step of the user interaction, while there is a first step devoted to the understanding of the overall database content. This first phase is in general supported providing the user with browsing and/or filtering and zooming mechanisms. Query formulation is the fundamental activity in the process of data retrieval. The query strategy by schema navigation has the characteristic of concentrating on a concept (or a group of concepts) and moving from it in order to reach other concepts of interest, on
3401
V
3402
V
Visual Query Language
Visual Query Language. Figure 1. Unconnected path in QBD* [19].
which further conditions may be specified. Such a strategy differs according to the type of path followed during the navigation (see Figure 1 for an example of unconnected path). A second strategy for query formulation is by subqueries. In this case the query is formulated by composing partial results. The third strategy for query formulation is by matching. It is based on the idea of presenting the structure of a possible answer that is matched against the stored data. The last strategy for query formulation is by range selection, allowing a search conditioned by a given range on multi-key data sets to be performed. The query is formulated through direct manipulation of graphical widgets, such as buttons, sliders, and scrollable lists, with one widget being used for every key. An interesting implementation of such a technique has been proposed in [1], and is called dynamic query. The user can either indicate a range of numerical values (with a range slider), or a sequence of names alphabetically ordered (with an alpha slider). Given a query, a new query is easily formulated by moving the position of a slider with a mouse: this is supposed to give a sense of power but also of fun to the user, who is challenged to try other queries and see how the result is modified. Usually, input and output data are of the same type and may even coincide.
Key Applications VQLs have been basically proposed to allow nonprogrammers to express database queries. In [5] it was also conjectured that, depending on the user class and the kind of task (i.e., query), certain VQLs are
more appropriate than others. For instance, VQLs based on extensive use of icons and icon composition mechanisms [18], are typically more suited to express simple (select-project-join) queries and be used by naive users. Analysis underlying such statements were usually correct. Nevertheless, they needed to be substantiated by running user trials, and such experiments were (and still are) not so common in the VQL community. However, some of them have been carried on and basically support the hypothesis that different kinds of queries are better supported by different visual representations and interaction mechanisms. Whereas, results on the adequacy of the different visual representations with respect to the various classes of users are not so strong as it was conjectured.
Future Directions VQLs mainly deal with traditional databases, i.e., databases containing alphanumeric data. However, in recent years the application realms of databases have raised a lot in terms of both number and variety of data types. As a consequence, specialized systems have been proposed for accessing such new kinds of databases, containing non-conventional data, such as images, videos, temporal series, maps, etc. Furthermore, the idea of information repository has been deeply influenced by the beginning of the Web age. Different visual systems have been proposed to cope with the need for extracting information residing on the Web. Temporal Databases
There are a growing number of applications dealing with data characterized by the temporal dimension
Visual Query Language
V
(e.g., medical records, biographical data, financial data, etc.). Still, visual interfaces for querying temporal databases have been less investigated than their counterpart in traditional databases. Typically, end-users of these data are competent in the field of the application but are not computer experts. They need easy-to-use systems able to support them in the task of accessing and manipulating the data contained in the databases. In this case, typical interactions with the data involve the visualization of some of their characteristics over some timeframe or the formulation of queries related with temporal events, such as the change of status of an employee or the inversion of the tendency of stock exchanges.
Visually Querying Digital Libraries
Geographical Databases
Experimental Results
Geographical information is most naturally conveyed in visual format. Maps and diagrams (i.e., schematic maps such as a bus network map) are the core means in user interactions, both for querying the database and displaying the result of a query. A typical query would be ‘‘show me on a city map where is the post office that is closest to this location.’’ The query itself would most likely be expressed using preformatted forms and menus to select the city and the reference location. The result would be a blinking or otherwise highlighted point in the displayed map. Once a map is displayed, as a result of a previous query or as an initial background screen in a query formulation interaction, the map can be used to specify a new query. This typically supports queries such as ‘‘give me more information on this,’’ where the value of the this parameter is specified by pointing in some way to a location in the map (i.e., a point on the screen). Thus in some sense visual interaction is common practice in GIS systems, due to the intrinsically spatial reference that is associated to the data.
As it was mentioned in the previous sections, user trials have been conducted to compare the usability of query languages. First a well-known definition of usability is recalled: ‘‘the extent to which a product can be used with efficiency, effectiveness and satisfaction by specific users to achieve specific goals in specific environments’’ [3]. More precisely, effectiveness refers to the extent to which the intended goals of the system can be achieved; efficiency is the time, the money, and the mental effort spent to achieve these goals; and satisfaction depends on how comfortable the users feel using the system. In the case of VQLs the main goal is to extract information from the database by performing queries, and the accuracy in achieving such a goal is generally measured in terms of the accuracy of query completion (i.e., user’s correctness rate when writing the queries). Measures of efficiency relate the level of effectiveness achieved at the expense of various resources, such as mental and physical effort, time, financial cost, etc. In principle, both the user’s and the organization’s point of view should be considered. However, the user’s efficiency is most frequently measured in terms of the time spent to complete a query. The above two measures (i.e., query accuracy and response time) can be evaluated quite precisely. Frequently this is done either by recording real users performing predefined tasks with the system and then analyzing the recorded data or by directly observing the user. The most common tasks are query writing and query reading, both of which are performed by investigating the relationships between database queries expressed in natural language and the same
Web Visual Access
Nowdays, the Web is the widest information repository. However, to find the information of interest among the mass of uninteresting one is a very hard task. In order to help the user in retrieving information scattered everywhere in the Web, several proposals have been made by different research communities, such as those of database, artificial intelligence, and human-computer interaction (see, e.g., [13]), a limited amount of them relate to visual querying and information visualization.
3403
The main purpose of a digital library (DL) is to facilitate the users in easily accessing the enormous amount of globally networked information, which includes preexisting public libraries, catalog data, digitized document collections, etc. Thus, it is fundamental to develop both the infrastructure and the user interface to effectively access the information via the Internet. The key technological issues are how to search and how to display desired selections from and across large collections. A DL interface must support a range of functions including query formulation, presentation of retrieved information, relevance feedback and browsing.
V
3404
V
Visual Query Language
queries expressed in the system under study. In query writing, the question is: ‘‘Given a query in natural language how easily can a user express it through the query language statements?’’ The question for query reading is: ‘‘Given a query expressed through the query language statements, can the user express the query easily in natural language?’’ Moreover, other kinds of measures can be defined and evaluated, although with less precision. Measures of satisfaction describe the comfort and acceptability of the overall system used. The learnability of a product may be measured by comparing the usability of a product handled by one user along a time scale. Measuring usability in different contexts can assess the flexibility of a product. Usability of query languages has first been studied through the comparison between QBE and SQL [15,17]. The former study [15] showed better user performances when using QBE with respect to SQL, both in query reading and query writing tests. However, a later study [17] also comparing QBE and SQL took into account several factors, such as the use of the same database management system, a similar environment, etc. It is interesting to note that the query language type affected user performance only in ‘‘paper and pencil’’ tests, in which case QBE users had higher scores than SQL users. In on-line tests, the user’s accuracy was not affected by the type of the language adopted, but the user’s satisfaction was much greater with QBE, and his or her efficiency much better. In [2], a language based on the previouslymentioned dynamic queries was tested against two other query languages, both providing form fill-in as the input method. One of these languages (called FG) has a graphical visualization output, and the other one (called FT) has a fully textual output. The alternative interfaces were chosen to find out which aspect of dynamic queries makes the major difference, either the input by sliders, or the output visualization. The tasks to be performed by the user concerned basically the selection of elements that satisfy certain conditions. However, the subjects were also asked to find a trend for a data property and to find an exception for a trend. The hypothesis that the dynamic query language would perform better than both the FG and the FT interface was confirmed. Similarly, the FG interface produced faster completion times than the FT
interface. In particular, for the task of finding a trend, the possibility of getting an overview of the database (in the dynamic and FG interfaces) made the major difference. In searching for an exception, the dynamic interface performed significantly better than the FG and FT ones. This was due to the advantages offered by both the visualization and the sliders. The visualization allowed subjects to see exceptions easily when they showed up on the screen, and the sliders allowed them to quickly change the values to find the correct answer. Other experiments have been conducted [17,18] to compare a diagrammatic query language, namely QBD*, against both SQL and QBI, an iconic query language. The overall objective of the studies was measuring and understanding the comparative effectiveness and efficiency with which subjects can construct queries in SQL or in the diagrammatic or iconic languages. The experiments were designed to determine if there is a significant correlation between 1) the query class and the query language type, and 2) the type of query language and the experience of the user. The subjects were undergraduate students, secretaries and professionals having different levels of expertise. The results of the comparison between QBD* and SQL confirmed the intuitive feeling that a visual language is easier to understand and use than a traditional textual language not only for novice users, but also for expert ones. The experts’ errors when using SQL were mainly due to the need of remembering table names and using a precise syntax. Working with QBD*, users can gain from looking at the E-R diagrams. On the basis of the figures which have been obtained when comparing QBD* and QBI one can say that expert users perform better using the QBD* system, while a small difference exists concerning the performance of non-expert users (slightly better using QBI).
Cross-references
▶ Data Model ▶ Data Visualization ▶ Digital Libraries ▶ Expressive Power of Query Languages ▶ Geographic Information System ▶ Multimedia Databases ▶ Temporal Database ▶ Usability ▶ Query Language
Visual Representation
Recommended Reading 1. Ahlberg C. and Shneiderman B. Visual information seeking: tight coupling of dynamic query filters with starfield displays. In Proc. SIGCHI Conf. on Human Factors in Computing Systems, 1994, pp. 313–317. 2. Ahlberg C., Williamson C., and Shneidermann B. Dynamic queries for information exploration: an implementation and evaluation. In Proc. SIGCHI Conf. on Human Factors in Computing Systems, 1992, pp. 619–626. 3. Angelaccio M., Catarci T., and Santucci G. QBD*: a graphical query language with recursion. IEEE Trans. Softw. Eng., 16:1150–1163, 1990. 4. Badre A.N., Catarci T., Massari A., and Santucci G. Comparative ease of use of a diagrammatic Vs. an iconic query language. In Interfaces to Databases J. Kennedy P.J. Barclay (eds.). Electronic Series Workshop in Computing, Springer, 1996. 5. Batini C., Catarci T., Costabile M.F., and Levialdi S. Visual Query Systems: A Taxonomy – nei. In Proc. 2nd IFIP W.G. 2.6 Working Conference on Visual Databases, 1991. 6. Bevan N. and Macleod M. Usability assessment and measurement. In The Management of Software Quality, M. Kelly (ed.). Ashgate Technical/Gower Press, Hampshire, UK, 1993. 7. Boyle J., Leishman S., and Gray P.M.D. From WIMP to 3D: the development of AMAZE. J. Vis. Lang. Comput. (Special issue on visual query systems), 7:291–319, 1996. 8. Catarci T., Costabile M.F., Levialdi S., and Batini C. Visual query systems for databases: a survey. J. Vis. Lang. Comput., 8(2):215–260, 1997. 9. Catarci T. and Santucci G. Diagrammatic vs Textual query languages: a comparative experiment. In Proc. IFIP W.G. 2.6 Working Conference on Visual Databases, 1995, pp. 57–85. 10. Chandra A.K. Programming primitives for database languages. In Proc. 8th ACM SIGACT-SIGPLAN Symp. on Principles of Programming Languages, 1981, pp. 50–62. 11. Chandra A.K. Theory of database queries. In Proc. 7th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1988, pp. 1–9. 12. Chang S.K., Costabile M.F., and Levialdi S. Reality bites – progressive querying and result visualization in logical and VR spaces. In Proc. IEEE Symp. Visual Languages, 1994, pp. 100–109. 13. Geronimenko V. and Chen C. (eds.). Visualising the Semantic Web. Springer, Berlin, 2002. 14. Kay A. Personal dynamic media. IEEE Comput., 10(3): 31–42, 1977. 15. Reisner P. Query languages. In Handbook of Human-Computer Interaction, M. M. Helander (ed.). North-Holland, Amsterdam, 1988, pp. 257–280. 16. Smith D.C. Pygmalion: A Computer Program to Model and Stimulate Creative Thought. Birkhauser Verlag, Basel, 1977. 17. Yen M.Y. and Scamell R.W. A human factors experimental comparison of SQL and QBE. IEEE Trans. Softw. Eng., 19(4):390–402, 1993. 18. Tsuda K., Yoshitaka A., Hirakawa M., Tanaka M., and Ichikawa T. Iconic browser: an iconic retrieval system for objectoriented databases. J. Vis. Lang. Comput., 1(1):59–76, 1990. 19. Zloof M.M. Query-by-example: a database language. IBM Syst. J., 16(4):324–343, 1977.
V
3405
Visual Query System ▶ Visual Query Language
Visual Representation YANNIS I OANNIDIS University of Athens, Athens, Greece
Synonyms Graphical representation
Definition The concept of ‘‘representation’’ captures the signs that stand in for and take the place of something else [5]. Visual representation, in particular, refers to the special case when these signs are visual (as opposed to textual, mathematical, etc.). On the other hand, there is no limit on what may be (visually) represented, which may range from abstract concepts to concrete objects in the real world or data items. In addition to the above, however, the term ‘‘representation’’ is often overloaded and used to imply the actual process of connecting the two worlds of the original items and of their representatives. Typically, the context determines quite clearly which of the two meanings is intended in each case, hence, the term is used for both without further explanation. Underneath any visual representation lies a mapping between the set of items that are being represented and the set of visual elements that are used to represent them, i.e., to display them in some medium. In order for a visual representation to be useful, the mapping must satisfy certain properties: it must be expressive as well as effective [1]. Expressiveness is related to the accuracy with which the visual representation captures the underlying items; an expressive mapping should neither lose any information nor lead to additional, irrelevant implications. Effectiveness is related to the quality of the interpretation of the visual representation by a human; an effective mapping should allow for a fast and unique interpretation of the underlying items, leaving no room for ambiguities or false impressions. Expressiveness and effectiveness are often enhanced significantly when
V
3406
V
Visual Representation
the set of visual elements used to represent a set of (interrelated) items is familiar to the user (possibly from other domains), and the corresponding mapping is intuitive so that it invokes the correct interpretation of these items effortlessly. The mapping that underlies a visual representation is often called a Visual Metaphor [4], as it realizes a ‘‘transfer’’ (metaphor is from the Greek word metafor a – metaphora, which means ‘‘transfer’’) of implicit and explicit characteristics of the visual elements to the items these represent. The quality of visual representations depends on the visual elements used in it, the choice of which visual element is mapped to which item, and which visual element characteristic is mapped to which item characteristic, and any constraints that must be imposed on the mapping. A brief introduction to the above is the main topic of this entry.
Historical Background Visual representation of the external world has been exercised by humans for thousands of years and, in recent history, this has extended to abstract worlds as well. Visual metaphors have been used so widely that human cognition is considered tightly interweaved, and sometimes even identified, with human vision. Work on visual representation of data and information dates as far back as the work of the Scottish political economist William Playfair, who was the first to use bar charts, pie charts, and other extremely intuitive and useful visual constructs that are used today for this purpose [6]. In recent times, many theories on the semantics of visual representations and how they may enhance human perception have been published, a great example being the work of Tufte [6,7]. Visual representation of information is extensively used in several types of computer systems, which are all using, either explicitly or implicitly, a set of visual elements and a mapping from them to data items being visually represented. Given that the variety of possibilities regarding both the elements and the mappings is endless, there have been a plethora of papers that have been written on visual representation, including some aggregate bodies of work that offer a comprehensive view of the entire field [2]. In the following section, a formal model for visual representation is presented. Although visual representations can be employed for any kind of abstract or concrete items, for ease of exposition and without loss
of generality, the discussion focuses on representing data or information items.
Foundations This section presents a formalism that captures the essence of visual representation and can be used as a means to classify and compare several approaches to the topic, and to explain some of the features of the relevant systems. For the sake of presentation clarity, it makes various simplifying assumptions and, therefore, its expressive power is not as high as some existing visual representations manifest. Nevertheless, it is expressive enough to capture a broad class of visual representations and characteristic enough to convey the main aspects of the concept overall. Without loss of generality, any dataset is assumed to conform to the structure and constraints of some data model (in database terminology, that would be a data schema). Visualizations of such a dataset require the corresponding notion of a visual model, which is similar to a data model, except that its primitives are visual. In analogy to a dataset being an instance of a data model, a visual representation that is an instance of a visual model will be called a visualset. In order for a visualset to obtain meaning and, therefore, become a visual representation of a dataset, it needs a mapping of its elements to the corresponding items of the dataset, so that users may see the former and understand the latter. Such a mapping is typically defined at the (data and visual) model level and propagated down to all instances of the models. Given this, such a mapping is called a (visual) metaphor. Separation of the visual representation process into three distinct components (visual model, data model, and visual metaphor) has many benefits. It permits metaphors to be tested for correctness, evaluated, compared, and combined. Also, it permits systems, especially visual-representation and user-interface tools, to be evaluated with respect to (i) the quality of their visual models and metaphors, (ii) their flexibility with respect to defining or changing their models and metaphors, and (iii) the particular ways in which this is done. Using the above three notions, the problem of visual representation of datasets may be stated more formally as follows. Given a data model D, let SðDÞ denote the set of valid datasets that can be constructed based on that model. Similarly, let SðGÞ denote the set of visualsets that can be constructed based on a visual
Visual Representation
model G. The goal is to establish a binary relation between SðGÞ and SðDÞ, the metaphor, whose specific properties depend on the precise operations that users want to perform on visualsets. Specifically, 1. If users want to be able to use a visualset of G to view any dataset of D in its entirety, then an onto function must exist of the form f : SðGÞ ! SðDÞ, so that every dataset can be represented visually. 2. If, in addition, users want to be able to use a visualset of G to update a dataset of D, then a total onto function must exist of the form f : SðGÞ ! SðDÞ, so that every visualset can be uniquely interpreted as a dataset. Clearly, not all such functions f that satisfy the above properties are useful. Many are arbitrary mappings, with no obvious correspondence between the dataset and visualset. The goal is to establish a relationship between the members of SðDÞ and SðGÞ so that when users view a visualset, they can infer the dataset to which it maps. This implies that f should be derived from a correspondence between the individual components and features of the data and visual models, which would result in a structural similarity between datasets and visualsets, making it easier for users to understand the former through the latter. Data and Visual Models
The formalism below (essentially a meta-model) identifies the key components and features of a large and interesting class of data and visual models, which serves as an important example of the concepts surrounding visual representation. In that formalism, every data or visual model M can be seen as a quadruple M ¼ < P; A; V; C > defined as follows: P
is a finite set of classes (containers) of data items (resp., visual elements). Each such class P is associated with a (possibly infinite) set of unique ids I (P) that can be used to identify the members of class P. A is a finite set of identifiers of attributes that data items (resp., visual elements) are expected to have, depending on the class they belong in. If A(P) represents the attributes of the members of class S P, then A ¼ P2P A(P). For all P 2 P, each member of A(P) is of the form P.A, where A is the name of the corresponding attribute of the members of class P.
V
C
V
3407
is a (possibly infinite) set of identifiers of the possible values that the above attributes may have. If V(P.A) represents the possible values of attriS S bute P.A, then V ¼ P2P P:A2AðPÞ VðP:AÞ. For all P:A 2 A(P), each element of V(P.A) is of the form P.A == v, where v is the corresponding potential value of attribute P.A. For an attribute P.A, V(P.A) may be equal to a set I ðÞ of ids of the members of some class. is a finite set of constraints, i.e., rules that must be satisfied by any dataset (resp., visualset) that conforms to M. These constraints are formulas in some prespecified language L and refer to members of P, A, V.
To draw an analogy, in relational database terms, the above describes relational tables (P) with attributes (A), the attributes’ domains (V), and integrity constraints (C). It is conceivable that some data or visual model may not be representable in (any enhancement of) the above formalism. Most of the common models fall naturally in this formalism, however, so it is the one adopted in this entry. Data models capture abstract organization of information and their creation is a classical database problem that is well understood. As an example, consider a very simple data model on employees. Each employee has a name, a salary, an age, and a department. The three possible departments are ‘‘toy’’, ‘‘candy’’, and ‘‘shoe’’. This data model is the quadruple D ¼ < P D ; AD ; V D ; CD >; where CD ¼ ;, and the data-item classes in P D , their attributes in AD , and their corresponding value sets in V D are given in the following table.
Data item class (P) Employee
Attribute (P.A)
Attribute values (VðP:AÞ)
Name Salary
text [0, 10000]
Age Dept
[0, 100] {toy, candy, shoe}
Contrary to the situation with data models, creating suitable visual models is less straightforward, since they must reflect not only the (visual) information to be organized, but also the medium in which the
V
3408
V
Visual Representation
models are expressed. To create visual item classes for a visual model, visual building blocks are needed, which may actually be combinations of standard visual elements, such as points, lines, regions, text, etc. (A more formal discussion of visual constructs may be found in [3].) Constraints in the visual model are used to hold compositions together. For example, if a visual element is a box with a piece of text in the center, then it is a composition of a region and a text item satistfying the constraint that the location of the text time (its center) is the same as the location of the region (its center). As an example, consider a very simple visual model that supports the placement of objects in a twodimensional data field. The only class of visual items is point (not in the mathematical sense), which is simply a region. This visual model is the quadruple G ¼ < P G ; AG ; V G ; CG >; where again CG ¼ ;, and the visual-item classes in P G , their attributes in AG , and their corresponding value sets in V G are given in the following table. For simplicity, only a subset of the attributes is shown.
Visual item class (P) Attribute (P.A) Attribute values (VðP:AÞ) Point
LocationX
plane-pixel-x
LocationY Size Shape Color
plane-pixel-y {100 pixels} {square, oval, triangle} {blue, red, green}
Visual Metaphors
A visual metaphor is defined as a correspondence between features of a visual and a data model. This
x Point
Tp(x)
x
Ta(x)
Employee
point.locationX point.locationY point.color
employee.salary employee.age employee.dept
point.shape
employee.dept
correspondence induces a mapping between visualsets (instances of the visual model) and datasets (instances of the data model). Having the mapping be decomposed into mappings of the model features helps produce visualizations that, when viewed, allow the user to deduce the underlying dataset. Consider a data model D ¼ < P D ; AD ; V D ; CD > and a visual model G ¼ < P G ; AG ; V G ; CG >. A metaphor includes correspondences between item classes (P D and P G ), their attributes (AD and AG ), and their attribute values (V D and V G ). These correspondences describe the meaning of visual model features relative to the underlying data model. To allow presentation flexibility, correspondences are permited between multiple features in the visual model and a single feature in the data model. More formally, a metaphor T is an onto function from G to D (denoted by T : G ! D), which is the union of the following three onto functions: 1. Function T p : P G ! P D 2. Function T a : AG ! AD , where Ta(PG.AG) = PD .AD only if Tp(PG) = PD 3. Function T v : V G ! V D , where Tv(PG.AG == vG) = (PD.AD == vD) only if Ta(PG.AG) = PD .AD As an illustration of the above definition, the following metaphor T maps from the visual model to the data model discussed above. A metaphor provides meaning to the features of a visual model by establishing a correspondence between them and features of a data model. The precise meaning is captured by T. For example, displaying a red square point implies the existence of an employee in the toy department. The metaphor must be such that users can correctly and unambiguously interpret a
x
Tv(x)
point.locationX==x point.locationY==x point.color==‘‘red’’ point.color==‘‘green’’
employee.salary==(x*10) employee.age==(x/5) employee.dept==‘‘toy’’ employee.dept==‘‘candy’’
point.color==‘‘blue’’ point.shape==‘‘square’’ point.shape==‘‘oval’’ point.shape==‘‘triangle’’
employee.dept==‘‘shoe’’ employee.dept==‘‘toy’’ employee.dept==‘‘candy’’ employee.dept==‘‘shoe’’
Visual Representation
displayed visualset. This can be made more precise based on various mandatory and optional properties of T, i.e., being a function, onto, total, and 1-1. As a minimum requirement, a metaphor T has been defined as an onto function: if it weren’t onto, then some characteristics of a data model would not be captured visually; if it weren’t a function, then a single visual construct would have multiple meanings and, therefore, would not be interpreted appropriately. Beyond the above, a visual model may have features that do not carry any meaning and/or features that carry redundant meaning. Based on knowledge of T, users should be able to ignore the former and not be confused by the latter. In particular, if a metaphor is not total then some visual elements do not mean anything in the data model. This creates no problem if the metaphor will be used simply for visual display of datasets: unmapped features of the visual model will be used just for presentation. On the other hand, if the metaphor will be used for updating datasets displayed, then if a visual attribute PG.AG is mapped by Ta, then all the values in V G ðP G :AG Þ should be mapped as well, otherwise, one would be able to draw visualsets that are not translatable to datasets. Also, for both retrieval and update, if a metaphor is not 1-1, then multiple visual elements have the same meaning in the data model. If Tp or Tv are not 1-1, then there is a choice of visual constructs that can be used, which should be left to the user or resolved via some default mechanism. If Ta is not 1-1, then there is redundancy: multiple visual attributes capture the same data attribute. Functions that are not 1:1 establish equivalence classes among the features of the visual model, i.e., several features may have the same meaning. For example, T a ðpoint:colorÞ ¼ T a ðpoint:shapeÞ ¼ employee:dept specifies redundancy: visual attributes point.color and point.shape redundantly capture data attribute employee.dept. Value mappings are similar: 00
00
T v ðpoint:color ¼¼ green Þ ¼ T v ðpoint:shape ¼¼00 oval00 Þ ¼ ðemployee:dept ¼¼00 candy 00 Þ indicates that specific values of redundant visual attributes (e.g., point.color and point.shape) capture the same value of a data attribute.
V
3409
Datasets, Visualsets, and Visual Representation
As defined above, a dataset or visualset may be considered as an instantiation of a data or visual model, respectively. In particular, a dataset S of a model M is defined as follows (similarly for a visualset): 1. For every P 2 P, there is a finite set ½P I ðPÞ of data items in class P that appear in dataset S. 2. For every P:A 2 A, there is a total function ½P:A : ½P ! VðP:AÞ, which determines the value of the P.A attribute for every data item in [P]. 3. For every c 2 C, there is a constraint [c], constructed from c by replacing every P 2 P by [P] and every P:A 2 A by [P.A]. All these constraints are satisfied by the above. Given a metaphor T : G ! D, any dataset that conforms to data model D can be represented by a visualset that conforms to visual model G via a mapping of their contents that remains faithful to the metaphor. In particular, visual items in the classes of such a visualset represent data items in the corresponding classes (based on T) of the dataset. Similarly, the visual attributes and the values of these visual elements represent the data item attributes and their values as specified by T. Essentially, there is a function t induced by T that determines a mapping from any visualset that conforms to G to some dataset that conforms to D. As with metaphor T, the induced function t can be extended so that its domain includes instantiations of constraints as well, i.e., t([c]) is valid for c 2 CG. As a simple example of all the above, consider the following dataset that conforms to the data model described above: employee(Jason, 2.1, 55, ‘‘toy’’) employee(Iris, 2.6, 60, ‘‘toy’’) employee(Nick, 3.2, 43, ‘‘shoe’’ employee(Magda, 5.8, 47, ‘‘shoe’’) employee(Chloe, 3.4, 46, ‘‘candy’’) employee(Phoebe, 8.3, 31, ‘‘candy’’) employee(Mike, 8.3, 70, ‘‘candy’’) Figure below gives an example visualset that would be produced by applying the example metaphor above to this dataset. (The color of a point is indicated textually.) Clearly, real systems often deal with more complicated data and visual models and visual metaphors, but the underlying principles remain those highlighted in this exposition.
V
3410
V
Visual Similarity
Key Applications Visual representation is manifested in all applications with a (visual) user interface. Furthermore, having explicit declaration, storage, and manipulation of data and visual models as well as visual metaphors is essential to systems that offer to users the ability to modify the visual representations used, for reasons of personalization, quality control, or even plain variety.
Visual Similarity ▶ Image Retrieval
Visual Web Data Extraction ▶ GUIs for Web Data Extraction
Visual Web Information Extraction ▶ GUIs for Web Data Extraction
Visualization for Information Retrieval J IN Z HANG University of Wisconsin Milwaukee, Milwaukee, WI, USA
Cross-references ▶ Data Visualization ▶ Visual Formalisms ▶ Visual Metaphor ▶ Visual Perception
Recommended Reading 1.
2.
3.
4.
5.
6. 7.
Card S.K., Mackinlay J.D., and Shneiderman B. Information visualization. In Readings in Information Visualization: Using Vision to Think, 1999, pp. 1–34. Card S.K., Mackinlay J.D., and Shneiderman B. Readings in Information Visualization: Using Vision to Think. Morgan Kaufman, Los Altos, CA, 1999. Foley J.D., van Dam A., Feiner S.K., and Hughes J.F. Computer Graphics: Principles and Practice. Addison-Wesley, Reading, MA, 1990. Haber E.M., Ioannidis Y., and Livny M. Foundations of visual metaphors for schema display. J. Intell. Inf. Syst., 3(3/4):263–298, 1994. Mitchell W. Representation. In Critical Terms for Literary Study, Lentricchia F and McLaughlin T. (eds.), 2nd edn., Chicago, IL. University of Chicago Press, 1995. Tufte E.R. The Visual Display of Quantitative Information. Graphics Press, Cheshire, CO, 1983. Tufte E.R. Envisioning Information. Graphics Press, Cheshire, CO, 1990.
Definition Visualization for information retrieval refers to a process that transforms the invisible abstract data and their semantic relationships in a data collection into a visible display and visualizes the internal retrieval processes for users [14]. Basically, visualization for information retrieval consists of two components: visually presenting objects in a meaningful way in a defined visual environment or space, and visualizing information seeking process within the environment or space. Transformation of the invisible and abstract information and sophisticated semantic connections into a visual environment requires clearly defining a visual environment or space, pinpointing the objects in a data collection which are displayed in the visual environment, identifying significant connections among the objects, and projecting both the objects and the relationships onto the environment. The form and way of visualizing the internal retrieval process vary in different visualization models. Visualization of the internal retrieval process depends not only on a defined visual environment, but also on the information seeking paradigms such as query searching and browsing.
Visualization for Information Retrieval
Many traditional information retrieval evaluation models such as the distance evaluation model, cosine evaluation model, ellipse evaluation model, etc. can be visualized in a visual space. Furthermore, new information retrieval evaluation models can be developed in the visual space. The primary advantage of visualization for information retrieval is to make objects, object relationships, and information seeking transparent to end users. As a result, information retrieval becomes more intuitive and effective. Visualization methods can be applied to various information retrieval models such as the Boolean based information retrieval model, the vector-based information model, etc.
Historical Background A traditional information retrieval system like an OPAC (Online Public Access Catalog) system and search engine usually matches a user’s query with document surrogates in a database, and returns the user with a linear retrieval results list. The results list includes relevant items retrieved from the database. The results list may be ranked against titles, authors, publishing time, or similarities. In nature an information retrieval process is an iterative process. Users of the information retrieval system are supposed to use the results list to make the relevance judgment decisions, subject relevance analysis among the retrieved items, and readjust the search strategies. Towards this aim, an information retrieval system should not only provide users with relevance information between a query and the retrieved items, but also relevance information among the retrieved items, in the retrieval results list. The former information is apparently important for information retrieval. The latter information, which is as valuable as the former information, would facilitate the users to further expand, revise, and modify their search strategies. Unfortunately, if an information retrieval system is equipped with a similarity ranking mechanism for its linear results list structure, it only provides users with the relevance information between a query and the retrieved items. The inherent weakness of the linear result list structure cannot offer relevance information among the retrieved items, let alone the degree to which they are relevant. In a traditional information retrieval system, the user’s information seeking process is discontinuous. After transferring a user’s information need into a
V
3411
query and submitting it to the system, the user loses control over the internal information processing. The user regains control after the retrieval results are returned to the user. The user has no clues about how the query is matched with document surrogates, and how the final retrieval decision is made. In other words, the internal retrieval processing is a ‘‘black box’’ for the end user. The users cannot observe the internal processing and engage in it. However, if the user could participate in the internal retrieval processing, it would make information seeking more user-friendly, relevance judgment decision-making more accurate, and the retrieval process more natural. Information retrieval basically consists of two important fronts: browsing and querying. They are equally important for information seeking. Each has its own advantages and disadvantages. Neither can replace the other. In a traditional information retrieval context where querying dominates an information retrieval process, browsing cannot be fully utilized to explore information due to the lack of a meaningful information browsing environment. Visualization for information retrieval opens a new avenue for users to seek for information in the two fronts. Korfhage [6], as a pioneer and leading researcher in the field of visualization for information retrieval, made a significant contribution to the early research in the field. He introduced the important concept of the reference point, which can be used as a projection reference point when objects in a high dimensional space are projected onto a low dimensional visual space. He visually interpreted an information evaluation retrieval model in a visual space, and came up with new visualization models for information retrieval.
Foundations Models for Multiple Reference Points
A reference point (or point of interest) is introduced to represent users’ information needs. In a broad sense, a query is a kind of a reference point. A reference point can reflect a particular perspective of a user’s information need. It can be a subject topic, user’s search preference, a previous query, or any information related to the user’s information need. Since a document usually involves multiple facets, multiple reference points can reveal more information about the document from multiple perspectives.
V
3412
V
Visualization for Information Retrieval
The visualization models for multiple reference points can be classified into two categories: models for fixed multiple reference points and models for movable reference points. In a model for fixed reference points [11] which is developed to visualize a sophisticated Boolean query, the vertices of a polygon represent the fixed reference points and the edges of the polygon have the same length. Within the polygon there are several concentric layers. The number of the concentric layers is equal to the number of the reference points or the number of vertices in the polygon. Each layer represents a different logic combination zone of the involved reference points. For instance, the first outer layer represents single reference point zone, the second outer layer represents the logic AND combination zone for two reference points, . . ., and the last central layer represents the logic AND combination zone for all reference points (or the final results of the query). Various icons which symbolize search results for the logic combination zones are located within the corresponding layers. The shape and the number of the icons vary in different layers so that they can be easily distinguished in the visual space. This multiple layer structure can display not only the final results of a query but also the itemized results at different combination levels. In addition, each vertex of the polygon can be defined as a sub-query. If that is the case, the vertex can extend to a new sub-polygon to visualize the sub-query results by connecting the vertex to the new sub-polygon. Thanks to this feature, the model can visualize a sophisticated Boolean query. A model for movable reference points [9] works differently from the model for fixed reference points. It is apparent that in this model one of the most distinctive characteristics is that the involved reference points can be manipulated and placed in any place in the visual space by users. In other words, the reference points can be moved to a meaningful position. Because the model is developed based on the vector model instead of the Boolean model, the similarity between a document and any of the involved reference points can be accurately calculated by using a selected similarity measure. Basically, the location of a document in the visual space is affected directly by the strengths (similarities) between the document and all involved reference points. Consequently, after all reference points and documents are projected onto the visual space, the documents which are closer to a reference point are more relevant to the reference point. The uniqueness of the model relies on the fact that the
strength between a document and a reference point is relative because the strength is measured by the ratio of the similarity between the document and the reference point to the similarities between the document and all reference points in the visual space. It is this relativity in conjunction with the fact that the similarities are not assigned to any coordinates of the visual space directly that makes the movability of the reference points possible in the visual space. That is, the positions of projected documents are determined by the locations of the reference points. The implication of the movability of the reference points on information retrieval is that users can select a reference point of interest, move it, and observe the object configuration change caused by the reference point movement in the visual space. If documents in the visual space are relevant to the moving reference point, they also move accordingly. Otherwise they stay still. Using this characteristic, users can effectively identify a group of related terms to a reference point in the visual space by selecting and moving it. This model can be applied to visualizing a returned results list from an information retrieval system, a full-text, and hyperlinks structures. Euclidean Space Characteristics Based Models
Two of the most prominent characteristics of the Euclidean space are distance and direction. Both distance and direction are used to define and describe object locations in a space, and to demonstrate relationships among the objects in the space. When documents are organized in a vector space, they can be virtually described in a hyperspace where both the distance and direction characteristics are preserved. Furthermore, both the distance and direction have significance for information retrieval. It is natural and intuitive to utilize both distance and direction to develop information retrieval visualization models. By reducing the dimensionality of the high dimensional document vector space, spatial relationships among the documents in the vector space can be projected onto a low dimensional space to observe and manipulate the documents. Euclidean space characteristics based models require two reference points to construct their low dimensional visual spaces. In the vector space if two reference points (R1 and R2) are clearly defined, a group of important parameters for a document (D) in the vector space can be calculated. The two visual distances (R1D and R2D) between the document and
Visualization for Information Retrieval
V
3413
can be converted to a horizontal line in the distanceangle based low visual space if the query and the origin of the vector space are defined as the two reference points. That is because the distance from any point on the sphere to the center, which is one of the projection parameters, is a constant regardless of another projection parameter angle. Users can interact with the horizontal line in the visual space to control the size of the sphere in the vector space. Visualization of Hierarchy Structures
Visualization for Information Retrieval. Figure 1. Display of the visual distances and visual sangles of a document.
the two reference points respectively, and the two visual angles (∠R2R1D and ∠R1R2D) formed by the three points (D, R1 and R2) can be calculated. For simplicity, D, R1 and R2 are displayed in a three dimensional space (see Fig. 1). If one of the two visual distances and one of the two visual angles are assigned to the Y-axis and X-axis, respectively, it constructs the visual space of the distance-angle based visualization model [15]. If the two angles are assigned to the Y-axis and X-axis, respectively, it constructs the visual space of the angle-angle based visualization model [13]. If the two distances are assigned to the Y-axis and X-axis, respectively, it constructs the visual space of the distance-distance based visualization model [8]. Observe that as long as these visual spaces are defined, any documents in the vector space can be effectively projected onto the visual spaces. The valid display areas vary in the models. They range from a close area to a semi-open area. The uniqueness of these visualization models is reflected in the visualization of information retrieval evaluation models such as the distance evaluation model, cosine evaluation model, conjunction evaluation model, disjunction evaluation model, ellipse evaluation model, and oval evaluation model. In other words, they visualize the internal retrieval process in the visual spaces. For instance, in the vector space, the distance evaluation model corresponds to a hypersphere whose center is a query and radius is a retrieval threshold. Documents within the invisible sphere are regarded as the retrieved documents. This hyper sphere
As one important browsing mechanism, a hierarchy structure is widely used to organize and present a variety of information. A hierarchy structure can provide users with a structural categorical framework directing users to relevant information. Visualization techniques can enable users to navigate smoothly in a hierarchy structure by visualizing both global and local overviews of a hierarchy structure. Visualization technique enhances controllability and flexibility of information exploration and therefore alleviates the ‘‘disorientation’’ during navigation. ConeTree [10] is a visualization system that shows root, branch, and leaf nodes of a tree structure in a 3-D environment. Starting from the apex of a cone (the root of the tree), it fans out from left to right in a form of the cone. The child nodes are displayed along the cone base circumference which is a circle. The circle’s diameter depends on the number of the child nodes on it. Any of these child nodes can extend to a new cone structure similar to its parent if the node has child nodes. In this way, an entire hierarchy structure can be presented visually. TreeMap [1] displays a hierarchy structure in a different way. Its visual space is a grid. A cell of the grid can include several structurally similar sub-grids. The nested sub-grids show the categories at different levels, each sub-grid corresponds to a new sub-category. Visualization of Internet Information
The Web provides both challenge and opportunity for visualization for information retrieval. Information on the Web is huge, dynamic, diverse, and hyperlinked. It has become an indispensable source for people to search for information. One of the problems for information seeking on the Web is the notorious ‘‘lost in cyberspace’’ state. Visualization for information retrieval can alleviate, if not eliminate, the problem. Using information visualization techniques such as the hyperbolic method, people can effectively
V
3414
V
Visualization Pipeline
visualize the hidden hyperlink relationships among the connected web pages. In such a visualization environment, web pages (nodes) are connected by edges (hyperlinks). Users can easily recognize the visited paths, revisit the browsed web pages, and explore new paths in the visual space. A hyperbolic method is employed to visualize a subject directory where nodes are linked by hyperlinks [4]. A search engine can respond to a user query with an overwhelming number of related web pages. It is difficult for users to browse and identify the relevant web pages from the returned list. Information visualization techniques provide a unique way to solve the problem. In a cartographic visual space the returned web pages from multiple search engines are metaphorically presented as cities and the semantic link between two cities as a road [5]. The interactive map facilitates the decision-making on the relevance judgment. Grokker [3] categorizes all retrieved web pages and presents them in a group of circles in its visual space. A circle which represents a category includes multiple smaller circles which are its sub-categories. Users can drill down from the top of a category to the bottom (the web pages) by selecting corresponding circles in the visual space. It is worth pointing out that there are many other information visualization approaches such as multidimensional scaling analysis [12], pathfinder associative networks [2], self-organizing maps [7], etc. which can be applied to information retrieval. A detailed discussion of visualization for information retrieval, comparisons among various approaches and their implications for information retrieval are discussed in Zhang’s monograph on the topic [14].
Cross-references
▶ Information Retrieval ▶ Information Visualization
Recommended Reading 1. Asahi T., Turo D., and Shneiderman B. Usingtreemapstovisualize the analytic hierarchy process. Inf. Syst. Res., 6(4):357–375, 1995. 2. Dearholt D.W. and Schvaneveldt R.W. In Pathfinder Associative Networks: Studies in Knowledge Organization, R.W. Schvaneveldt (ed.). Ablex Publishing, Norwood, NJ, 1990, pp. 1–30. 3. Grokker. Available online at: http://www.grokker.com/ (retrieved October 29, 2007). 4. Inxight. Available online at: http://www.inxight.com/ (retrieved on October 29, 2007).
5. Kartoo. Available online at: http://www.kartoo.com/ (retrieved on October 29, 2007). 6. Korfhage R.R. To see or not to see – is that the query? In Proc. 14th Annu. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1991, pp. 134–141. 7. Lin X. Map displays for information retrieval. J. Am. Soc. Inf. Sci., 48(1):40–54, 1997. 8. Nuchprayoon A. and Korfhage R.R. GUIDO: a visual tool for retrieving documents. In Proc. 1994 IEEE Comp. Soc. Workshop on Visual Languages, 1994, pp. 64–71. 9. Olsen K.A., Korfhage R.R., Sochats K.M., Spring M.B., and Williams J.G. Visualization of a document collection: the VIBE system. Inf. Process. Manag., 29(1):69–81, 1993. 10. Robertson G.G., Card S.K., and Mackinlay J.D. Information visualization using 3D interactive animation. Commun. ACM, 36(4):57–71, 1993. 11. Spoerri A. InfoCrystal: a visual tool for information retrieval and management. In Proc. Second Int. Conf. on Information and Knowledge Management, 1993, pp. 11–20. 12. White H.D. Visualizing a discipline: an author co-citation analysis of information science, 1972–1995. J. Am. Soc. Inf. Sci. Technol., 49(4):327–355, 1998. 13. Zhang J. TOFIR: a tool of facilitating information retrieval – introduce a visual retrieval model. Inf. Process. Manag., 37(4):639–657, 2001. 14. Zhang J. Visualization for Information Retrieval. Springer, Berlin, 2008. 15. Zhang J. and Korfhage R.R. DARE: distance and angle retrieval environment: a tale of the two measures. J. Am. Soc. Inf. Sci., 50(9):779–787, 1999.
Visualization Pipeline H ELWIG H AUSER 1, H EIDRUN S CHUMANN 2 1 University of Bergen, Bergen, Norway 2 University of Rostock, Rostock, Germany
Synonyms Visualization reference model
Definition The visualization pipeline is a general model for a typical structure of a visualization process, which has been abstracted from earlier works in visualization in the late 1980s [3], and then adapted and extended several times [1, 2,...]. Starting out from data to be visualized and a particular visualization task at hand, a number of steps are processed along the visualization pipeline, including data enhancement, visualization mapping, and rendering, to eventually achieve a
Visualization Pipeline
visualization of the data with the purpose of serving the given visualization task through effectiveness, expressiveness, and appropriateness [4].
Key Points Data visualization is a part of computer science which provides expressive visual representations of data – through the appropriate application of computer graphics, often also in an interactive form – with the intention to effectively aid specific user tasks, including the exploration, analysis, and presentation of data. At this point only graphical data visualization is considered and other forms of data visualization such as sonification or the visualization through haptics are disregarded. Visualization establishes an efficient link between the user and her or his data, utilizing the enormous strengths of the human visual system with perception and cognition, and leading to a better understanding of the data through information extraction and knowledge crystallization. To do so, the data is transformed from its original space, i.e., the data space, to an abstract visualization space, usually in two or three dimensions, where the visualization representation of the data is constructed. To eventually link up to the user, the visualization is rendered to an image or animation which is then displayed to the user. At least two mappings, i.e., the visualization mapping (from data space to visualization space) and rendering (from visualization space to image
V
3415
space) are required to fully accomplish this process. An example is the geometric representation of a 3D scalar data distribution by a discretized iso-surface, i.e., a mesh of triangles, which is then rendered to the screen by the use of computer graphics algorithms for visible surface detection and shading. In almost all cases, however, it is also necessary to first prepare and transform the data according to the needs of the visualization task at hand, such that visualization mapping and rendering can be applied afterwards. Haber and McNabb have already described this principal visualization model (as illustrated in Fig. 1) in 1990 [3]. About ten years after Haber and McNabb, Chi formulated the data state reference model for visualization [1], which formulates the visualization pipeline on the basis of transformation and stage operators, which – depending on their combination – result in various different possible instances of the visualization pipeline model. In 2004, dos Santos and Brodlie concluded that it is more meaningful to split the first step of the visualization pipeline into two [2]: data analysis and filtering. While the original data is initially prepared (once) for visualization, e.g., through interpolation, i.e., in a data-centric fashion, thereafter the user selects which data to actually visualize in the highly interactive and thus user-centric filtering step. Especially in the case of explorative visualization, where users search for priori unknown features in the data, the flexible interaction with the filtering step is crucial.
Visualization Pipeline. Figure 1. The visualization pipeline according to Haber and McNabb [3].
V
Visualization Pipeline. Figure 2. The visualization pipeline based on operators [1], with two steps in data space [2], and with user interaction.
3416
V
Visualization Reference Model
The consideration of user interaction in general, altering operator parameters at all stages of the visualization pipeline, is an important aspect of the modern visualization pipeline (as also shown in Fig. 2).
Cross-references
▶ Data Visualization ▶ Visual Data Mining
Recommended Reading 1.
2.
3.
4.
Chi E.H. A taxonomy of visualization techniques using the data state reference model. In Proc. IEEE Symp. on Information Visualization, 2000, pp. 69–75. dos Santos S. and Brodlie K. Gaining understanding of multivariate and multidimensional data through visualization. Computer & Graphics, 28(3):311–325, 2004. Haber R.B. and McNabb D.A. Visualization idioms: A conceptual model for scientific visualization systems. In Visualization in Scientific Computing, G.M. Nielson, B. Shriver, L.J. Rosenblum (eds.). IEEE Computer Society, WA, USA, 1990, pp. 74–93. Schumann H. and Mu¨ller W. Visualisierung – Grundlagen und allgemeine Methoden (in German). Springer-Verlag, Berlin, 2000.
Visualization Reference Model ▶ Visualization Pipeline
Visualizing Categorical Data ¨ NLU¨, A NATOL S ARGIN A LI U University of Augsburg, Augsburg, Germany
Synonyms Visualizing categorical data; Graphics for discrete data; Visual displays of nonnumerical data; Plots for qualitative information
Definition Categorical data are data recorded about units on variables which take values in a discrete set of categories. Examples of categorical variables are gender, citizenship, or number of children. Categorical variables can be dichotomous (two categories; e.g., gender) or polytomous (more than two categories;
e.g., citizenship), and nominal (unordered categories; e.g., gender) or ordinal (ordered categories; e.g., number of children). Categorical data can be in case form (individual raw data vector recorded about each unit) or frequency form (tabulated data counting over the categories of the variables), and univariate or multivariate (including bivariate). Quantitative variables can be discretized to become categorical variables (e.g., using child and adult instead of exact age). Strictly speaking, all data may be considered categorical because of limited precision of measurement. Visualizing categorical data, indeed visualization of data in general, seeks to (i) summarize the data, (ii) expose information and structure in the data, (iii) supplement the information available from analytic measures on the data, and (iv) suggest more adequate analytics for the data.
Key Points Graphics for univariate categorical data are barcharts and stacked barcharts (for vertically drawn bars, all bars have the same width), and spineplots (modified barcharts where, for vertically drawn bars, all bars have the same height). The area of a bar represents the count for its category. Whereas (stacked) barcharts enable a good comparison of (cumulative) absolute counts, spineplots enable a direct comparison of proportions. When interactively highlighting a subgroup of units, spineplots allow visually comparing the proportions in different categories by looking at the heights of the highlighted areas [3]. Another popular, yet widely criticized, graphic is the pie chart (with partial, exploded, or perspective variants), for displaying shares. Graphics for multivariate categorical data are mosaic plots and their variations equal binsize and doubledecker plots (for comparing highlighted proportions), fluctuation diagram (for identifying common combinations), and multiple barcharts (for comparing conditional distributions) [2]. Mosaic matrices, as a categorical analog of the scatterplot matrix, consist of all pairwise mosaic plots for two-way subtables [1]. Other powerful graphics are treemaps and trellis displays [2,3], for visualizing hierarchies and conditional structures, respectively. There are graphics specifically designed for bivariate categorical data, for instance, fourfold displays and sieve diagrams (parquet diagrams) [1]. For 2 2 tables, the fourfold display visualizes the association between variables in terms of the odds ratio, with confidence rings providing a visual test of whether
Visualizing Clustering Results
the odds ratio differs significantly from 1. Sieve diagrams provide displays of the pattern of association in general r c tables. Influence and diagnostic plots for such categorical data models as logistic regression, loglinear, and logit models (e.g., half-normal probability plots) provide examples of model-based visualizations of categorical data [1,2]. These plots help to evaluate model fit. Dimension reduction methods such as correspondence analysis and biplots provide visualizations of associations in n-way tables in a small number of dimensions [1].
Cross-references
▶ Data Visualization ▶ Multivariate Visualization Methods ▶ Visual Analytics ▶ Visualizing Quantitative Data
Recommended Reading 1. 2.
3.
Friendly M. Visualizing Categorical Data. SAS Institute, Cary, NC, 2000. Hofmann H. Mosaic plots and their variants. In Handbook of Data Visualization, C.H. Chen, W. Haerdle, A.R. Unwin (eds.). Springer, Berlin Heidelberg New York, 2008. Unwin A.R., Theus M., and Hofmann H. Graphics of Large Datasets. Springer, Berlin Heidelberg New York, 2006.
Visualizing Clustering Results A LEXANDER H INNEBURG Martin-Luther-University Halle-Wittenberg, Halle/Saale, Germany
Synonyms Dendrogram; Heat map
Definition Visualizing clusters is a way to facilitate human experts in evaluating, exploring or interpreting the results of a cluster analysis. Clustering is an unsupervised learning technique, which groups a set of n data objects D = {x1,...,xn} into clusters, so that objects in the same cluster are similar, and objects from different clusters are dissimilar to each other. The data can be available (i) as (n n) matrix of similarities (or dissimilarities), and (ii) as (n d) data matrix, which describes each
V
3417
data object by a d-dimensional vector. The second form has to be accompanied by a suitable similarity or dissimilarity measure, which computes for a pair of d-dimensional vectors a (dis)similarity score. A typical example of such a measure is the Euclidian metric. Clustering results may come in different forms: (i) as a partition of D, (ii) as a model, which summarizes properties of D and (iii) as a set of hierarchically nested partitions of D. Visualizations of those results focus one or several properties like the similarity relations between clusters and/or the underlying data objects, the components of the clustering model, or the representation of the data objects. Cluster visualization serves to determine semantics of clusters relevant to the application at hand by inspection, to check whether the cluster model and its parametrization (e.g., number of clusters) fits the data in a meaningful way, or to explore hierarchical relationships between clusters.
Historical Background Clustering has been used as a technique to explore the structure of data. However, the results are abstract in nature, and in general the found cluster structure is not directly accessible to human experts as an overall picture. Data exploration is also facilitated by directly visualizing the data itself without clustering it. Simple examples include the scatter plots technique of vector data, which shows 2D projections to all possible d(d1)/2 pairwise attribute combinations. Independently computed results of partitioning clustering algorithms can be shown by labeling the points in the scatter plots according to cluster membership. An example is shown in Fig. 1a. As the display of all 2D scatter plots requires quite a large view space, alternative techniques turn to dimensionality reduction to derive a single 2D projection of the vector data, which (approximately) optimizes some criterion. Dimensionality reduction techniques for data visualization include principle component analysis (PCA) using the first two principal components, classic multi-dimensional scaling (MDS), which performs PCA on the (dis)similarity matrix, isoMDS which minimizes the sum of squared differences between the original (dis)similarities and those in the projected 2D-space. New techniques of this sort include, FASTMAP [5] and latent probabilistic models like probabilistic and Bayesian PCA [4]. The optimization criteria differ from technique to technique but the results are always sets of 2D points. The main property is that
V
3418
V
Visualizing Clustering Results
Visualizing Clustering Results. Figure 1. (a) Scatter plot the four-dimensional Iris data, (b) 2D-PCA of the Iris data. The classes Seritosa (S), Versicolor (C) and Virginica (V) are coded by letter and the membership of the clusters is coded by color red, green, blue.
Visualizing Clustering Results
dimensionality reduction and clustering are independently applied to the original data, and the results of both are combined in the visualization. Another need for visualizing clustering results beside data exploration stems from cluster model evaluation and selection. Clustering methods require parameters, e.g., some algorithms need the number of clusters k as an input parameter. There is a large number of proposed validation measures to tackle the particular problem of choosing the right value for k. A typical example is the Bayesian information criterion (BIC), which balances the number of parameters with the model performance measured as likelihood. An example of BIC for a Gaussian mixture model applied to the Iris data is
V
3419
shown in Fig. 2a. Gaussian mixture models are a probabilistic extension of k-means. Note, that BIC gives no clear indication, that k = 2 or k = 3 is better. However, despite that measures like BIC are good indicators for model selection, they may also mislead the user [5] in cases where the model oversimplifies the underling data. In general, there is no broadly accepted measure, that solves the problem of model selection. So, cluster analysis often requires manual tuning of the algorithms to the application at hand. Visualization serves as a tool in the process of evaluating the performance of a clustering algorithm. However, due to its subjective nature, visualization is mainly used here as an auxiliary evaluation method.
V
Visualizing Clustering Results. Figure 2. (a) BIC of a Gaussian mixture model fitted to the Iris data for k = 1,...,10. The shown BIC-values are averages of 10 runs (higher BIC-values is better). (b,c) show clusterings for k = 2,3.
3420
V
Visualizing Clustering Results
Foundations A simple but powerful method to visualize the results of clustering is to visualize the cluster model itself. A straightforward example is k-means, which has been used to determined the clusters (shown by color) in Fig. 1a, b. Note, that in this case the cluster model, which consists of the cluster centers only, is shown indirectly by color coding the cluster membership. The cluster centers, which are the means of the clusters, can be visually approximated. An explicit visualization would be required to show the coordinates of the cluster centers and perhaps the decision boundaries of the clusters, which correspond to the edges of the Voronoi diagram induced by the cluster centers. A more complex cluster model is the Gaussian mixture model, whose components consist of multidimensional Gaussian distributions. Such a distribution also has in addition to the center coordinates, parameters for the covariances between the dimensions. Figures 2b, c show some 2D-projections of the Gaussian models mixture models for two and three components fitted to the Iris data. Each component is visualized as an ellipse, whose center corresponds to the center of the Gaussian, and the orientation of the ellipse reflects the correlations between dimensions. Specifically, each ellipse shows the points which have the same probability density with respect to the corresponding Gaussian. The figures do not show the cluster memberships of the data points, because the Gaussian mixture model only gives a posterior distribution Pðcj~ xÞ for a point ~ x and a cluster c, which allows several options for associating points to clusters. The most common method is to associate a point to the cluster with the maximum posterior. The shown visualizations combine both data and cluster models. However, there are many cases, when this scenario is not directly applicable. The reasons might be that the data does not allow a direct visualization in a scatter plot style, or the dimensionality of the vector data is too high to be easily reduced to 2D-projections by picking dimensions by hand. A typical example is document clustering, where the data objects are documents. Documents are either represented as sets of words or high dimensional word-count vectors. Both cases do not allow a straightforward two-dimensional visualization of the document collection. A general method is parametric embedding (PE) [10], which takes the posterior distributions of the data objects with respect to some computed clusters,
and fits a Gaussian mixture model of two-dimensional data points, so that the posteriors of the mixture model reflect the original posteriors. The model fitting of parametric embedding is done by minimizing the sum of Kullback–Leibler divergences between original posterior distributions and those from the two-dimensional Gaussian Mixture model. Note, that parametric embedding determines both the parameters for the twodimensional Gaussian components of the mixture model, as well as the positions of the two-dimensional data points onto which the original data objects are mapped to. This technique allows the visualization the clustering results of most probabilistic clustering algorithms, regardless of the type and dimensionality of the data. In the case where posterior probabilities are not available from the clustering algorithm, neighborhood component analysis (NCA) [8] can be applied for cluster visualization. NCA assumes vector data with given cluster (or class) labels and minimizes a stochastic version of the leave-one-one k-nearest-neighbor classification by learning a Mahalanobis metric. Therefore, NCA falls into the class of metric learning. Cluster visualization is achieved by restricting the rank of the learned Mahalanobis matrix to two or three. In such settings, the linearly transformed data can be visualized as scatter plots in two- or three-dimensional space. An improvement of the PE technique is the combination of clustering and visualization [11], where the two-dimensional visualization acts like a regularizing prior distribution for the clustering model. The improvement is that the cluster membership values for each object are not fixed during the calculation of the visualization. So, both the parameter set for the clustering model and the parameters for the visualization are estimated simultaneously. In the case of high-dimensional data, the regularizing visualization component helps to avoid overfitting of the clustering model. Although, the experiments are done with a simple Gaussian mixture model (e.g., see Fig. 2), the improved technique could be used in combination with any probabilistic clustering model. Instead of using scatter plots, which represent multi-dimensional data by two-dimensional points, parallel coordinates [1] can be used, which have no need for dimensionality reduction (e.g., see Fig. 3). Parallel coordinates visualize a multi-dimensional vector as a sequence of consecutive line segments, and visually scale up to 15 dimensions. The coordinates
Visualizing Clustering Results
Visualizing Clustering Results. Figure 3. K means clustering (k = 3) of the iris data shown in parallel coordinates.
axes are represented by vertical lines. Simple visualizations show the clustering by color coding the cluster membership. A more advanced technique is proposed in [7]. Cluster centers are visualized by parallel coordinates. Additionally, the spans of the clusters projected to the attributes are also visualized by drawing opaque tubes of varying thickness around the line segments of the cluster centers. So far, clusters have been visualized by showing a possibly transformed version of the data objects itself, together with some information about the clustering and/or the cluster model. A principle alternative to that approach is to visualize distance or similarity relations between the data objects and/or the derived clusters. In the next paragraphs, we explain the methods wrt. to object distances to make the discussion simpler, however, similarity measures could be used as well. The amount of distances between n data objects scales quadratically in n, which makes it difficult to visualize all pairwise distances of large data sets (e.g., n 1,000). On the other side, visualizing distances or similarities is applicable to a much wider spectrum of data types, in contrast to the previous approaches, which assumed a given or derived representation of the data objects as vectors. Thus, visualizing relative information is attractive for looking at clusterings of sequences, trees, graphs, sets, and other non-vector like data. The direct visualization of the distance matrix D 2 Rnn of a data set codes the values of a matrix entry dij by color or gray value of a pixels, while the
V
two indices i and j are coded by the pixel’s position. In order to visualize a clustering, the columns and rows are partially ordered into blocks according to cluster membership. In a case where the clustering is meaningful, the change of the column and row permutation from a random to the partial ordering induced by the clustering has a huge visual impact. In the ideal case, clusters form blocks which are placed along the diagonal of the matrix. Possible relations between clusters show up as off-diagonal blocks in such a plot. This technique also allows the comparison of a clustering with some ground truth, if available, by ordering the rows and columns according to clustering and class labels, respectively. Examples are shown in Fig. 4. As the distance values sometimes have some skewed distribution, it improves the visual results to apply a monotone sub(super)-linear transformation to the distance values before mapping them to color to increase (decrease) the visual contrast. In Fig. 4, the square roots of the distances are shown. Non-clustering methods to find a good permutation of the distance matrix are discussed in [9] under the term seriation. Hierarchical clustering is one of the earliest clustering methods. The generic, agglomerative, bottom-up algorithm starts with a clustering where each data object is its own cluster. In each step, the closest pair of clusters is found (i.e., the pair of clusters with smallest distance) and merged to form a new cluster. The distance between clusters C and C0 is induced by the distance between objects in different ways, e.g., single linkage dðC; C 0 Þ ¼ minxEC;x 0 EC 0 dðx; x 0 Þ, average linkage dðC; C 0 Þ ¼ 1=jCj jC 0 jSxEC Sx 0 EC 0 dðx; x 0 Þ or complete linkage dðC; C 0 Þ ¼ minxEC;x 0 EC 0 dðx; x 0 Þ. After n 1 steps, a hierarchy of nested clusterings is determined, which is represented as a binary tree. The visualization of such a tree is called a dendrogram, when the height of an inner node is the distance between the merged clusters. Leaf nodes have zero heights. The height of a complete linkage dendrogram is the maximal distance between two objects, the height of a single linkage dendrogram is the maximal edge in the minimal spanning tree of the data, and the height of an average linkage dendrogram is something between the two other dendrograms. Figure 5 shows different dendrograms of the iris data. Overall, a dendrogram shows n 1 distances between clusters, which is much less distance information than the original distance matrix. So, a dendrogram can be seen as a lossy compressed form of
3421
V
3422
V
Visualizing Clustering Results
Visualizing Clustering Results. Figure 4. Distance matrix of the iris data: (a) in random order, (b) columns and rows are ordered according to a k-means clustering with k = 3, and (c) rows are ordered according to a k-means clustering with k = 3 and columns are ordered by ground truth class labels.
the distance matrix. The number of possible visualizations of the same dendrogram is 2n1, which can be seen by noting that the orientation of each inner node has two possible states. So, the super-exponential problem of finding a good visualization of the similarity matrix among n! possibilities, is reduced to the exponential problem of finding a good dendrogram visualization among 2n1. Typically, heuristics are used to find a useful orientation of a dendrogram, but in [3,12] more principled optimization techniques are used to that end. The derived permutation of
the leafs, which correspond to the data objects, is also used as a meaningful ordering of the rows and columns of the distance matrix [14]. In the special case of vector data, a similar visualization is derived by hierarchically clustering the rows and columns of the data matrix, which is built by concatenating the d-dimensional data vectors x1,...,xn to a n d matrix. The respective dendrograms for the row and the column set are shown at the borders of the matrix. Both variants are called heatmap. Examples of the iris data are shown in Fig. 6. These kinds of
Visualizing Clustering Results
V
3423
Visualizing Clustering Results. Figure 5. Dendrograms of the Iris data.
visualizations are especially popular in gene expression analysis. As the width of a dendrogram scales linearly with the data size, the clustering of a large data set is difficult to visualize with that technique. Therefore, heuristics are used to compute a horizontal cut in the dendrogram, and only the upper part is visualized. However, cutting at a constant height does not always produce the best result, thus new more flexible heuristics are discussed in [13]. In the case of single-linkage-like cluster hierarchies, the OPTICS approach [2] visualizes the lower part of the hierarchy with an alternative technique. OPTICS extends flat single linkage clustering. For visualization purposes it computes an ordering of the data points, which reveals nested clusters. Flat single linkage clustering sees the data as a graph, with the data objects as nodes. An edge connects two
nodes, when the corresponding data objects are closer than a given threshold (EPS). Clusters are the connected components of the graph. A drawback of single linkage clustering is that well separated clusters may be connected by chains of outliers and are merged by the algorithm. Therefore, OPTICS extends the basic single linkage approach by introducing an additional linkage constraint, the core object condition. The condition says that every object in a cluster is linked to at least one core object. A core object has at least MINPTS 1 other data objects in the neighborhood of a radius EPS. Graph theory says that connected components can be found by two equivalent algorithms, namely depth first search (DFS) and breath first search (BFS). BFS is mostly implemented by a first in first out (FIFO) queue. OPTICS uses a priority queue instead of a simple FIFO queue, which sorts the internal objects by ascending reachability distance. The reachability
V
3424
V
Visualizing Clustering Results
Visualizing Clustering Results. Figure 6. Heatmaps of the Iris data showing (a) the data matrix and (b) the Euclidian distance matrix. The used dendrograms are computed by complete linkage.
Visualizing Clustering Results. Figure 7. OPTICS on Iris data.
distance of an object is defined as the distance to the MINPTS-nearest neighbor in the case of core objects, as the distance to the nearest core object in the case of border objects, which are linked to a core object, and infinity in the case of outliers. A good intuition of the
reachability distance is to think of it as the inverse of data density. Objects with a low reachability distance have many neighbors, thus are in areas of high density. As the algorithm uses a priority queue, which outputs objects with a low reachability distance first, it focuses
Visualizing Hierarchical Data
on objects in high density areas first. The mode of operation of OPTICS is that it starts randomly somewhere in a cluster, is then lead to the clusters center and works its way to the border. After a cluster is finished, it starts with the next one in the same manner. The visualization technique of OPTICS displays the objects in the order they are processed by the algorithm, and shows the reachability distance for each object. By the mode of operation of OPTICS, this methods shows a cluster as a valley. The left border of the valley is formed by the random start, which is with high probability an object with some larger reachability distance. The bottom of the valley is formed by the center objects of the cluster, which have the smallest reachability distance. The right border of a valley is formed by the border objects of the cluster. In that case, the cluster is perfectly separated from the rest of the data, the priory queue becomes empty and the algorithm jumps randomly to some still unlabeled object and processes the next cluster. Nested cluster structure shows up as dents in the bottom of a larger valley. As OPTICS never links objects further away than EPS, it does not determine links between clusters, which are separated by more than EPS. Thus, the visualization shows only the lower part of the hierarchy. Examples of OPTICS plots are shown in Fig. 7.
Key Applications The visualization of clustering results is mainly used for interpreting the results of clustering and presenting them to people from the application side. It is a good tool to put the derived clusters into the context of the application at hand by annotating the visualization with meta-information. Visualizing clustering results also facilitates the exploration of the data and the tuning of clustering algorithms.
V
3425
Recommended Reading 1. Alfred Inselberg. Parallel coordinates: VISUAL multidimensional geometry and its applications, Spring, 2007. 2. Ankerst M., Breunig M.M., Kriegel H.-P., and Sander J. Optics: Ordering points to identify the clustering structure. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 49–60. 3. Bar-Joseph Z., Gifford D.K., and Jaakkola T.S. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics, 17(90001):22–29, 2001. 4. Bishop C. Pattern Classification and Machine Learning. Springer, New York, 2006. 5. Domingos P. Occam’s two razors: The sharp and the blunt. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, 1998, pp. 37–43. 6. Faloutsos C. and Lin K.I. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1995, pp. 163–174. 7. Fua Y.-H., Rundensteiner E.A., and Ward M.O. Hierarchical parallel coordinates for visualizing large multivariate data sets. In Proc. IEEE Conf. on Visualization, 1999. 8. Goldberger J., Roweis S.T., Hinton G.T., and Salakhutdinov R. Neighbourhood components analysis. In Advances in Neural Inf. Proc. Syst. 18, Proc. Neural Inf. Proc. Syst., 2005, pp. 513–520. 9. Hahsler M., Hornik K., and Buchta C. Getting Things in Order: An introduction to the R package seriation. http//:cran.at. r-project.org/web/packages/seriation/vignettes/seriation.pdf. 10. Iwata T., Saito K., Ueda N., Stromsten S., Griffiths T.L., and Tenenbaum J.B. Parametric embedding for class visualization. Neural Comput., 19(9):2536–2556, 2007. 11. Kaban A., Sun J., Raychaudhury S., and Nolan L. On class visualisation for high dimensional data: Exploring scientific data sets. In Proc. 9th Int. Conf. on Discovery Science, 2006. 12. Koren Y. and Harel D. A two-way visualization method for clustered data. In Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2003, pp. 589–594. 13. Langfelder P., Zhang B., and Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24(5):719–720, 2008. 14. Strehl A. and Ghosh J. Relationship-Based Clustering and Visualization for High-Dimensional Data Mining. INFORMS J. Comput., 15(2):208–230, 2003.
URL to Code The images in this article (except the OPTICS plots) are produced by the statistics package R. The code for the drawings can be found at http://www.informatik. uni-halle.de/~hinnebur/clusterVis.tar.gz.
Cross-references
▶ Clustering Overview and Applications ▶ Clustering Validity ▶ Visual Classification ▶ Visual Clustering ▶ Visual Data Mining
V Visualizing Hierarchical Data G RAHAM W ILLS SPSS Inc., Chicago, IL, USA
Synonyms Hierarchical graph layout; Visualizing trees; Tree drawing; Information visualization on hierarchies; Hierarchical visualization; Multi-level visualization
3426
V
Visualizing Hierarchical Data
Definition Hierarchical data is data that can be arranged in the form of a tree. Each item of data defines a node in the tree, and each node may have a collection of other nodes as child nodes. The relationship between the parent nodes and the child nodes forms a tree network. The formal definition of a tree is that the graph formed by the nodes and edges (defined between parent and child node) is both connected and contains no cycles. The following properties of a tree are of more practical use from the point of view of displaying visualizations: One node, called the root node, has no parent. All other nodes have exactly one parent. Nodes with no children are termed leaf nodes. Nodes with children are termed interior nodes. For all nodes in a tree, there is a single unique path up the tree going from parent to parent’s parent and so on, which will terminate in the root node. The number of nodes on the path from a node to the root is termed its depth. Although not strictly required, the vast majority of hierarchical data, and the main application area, consists of trees where the parent nodes define some form of aggregation on the child nodes, so that the data for the parent is equal to, or is expected to be close to, the aggregation of the data for that node’s children. The relationship between parent and child is often described in terms of inclusion; it is common to state that nodes contain their children. This aspect is often brought out in visualizations. A simple example of hierarchical data would consist of populations for the world, hierarchically broken down into sub-regions. At the top level, the root would consist of the world, with data being the population of the world. The next level could be the four main continents, each with their individual population, and each continent would have countries as children, each with their population counts. In each case, one would expect that the population of the parent node would roughly equal the population of the child nodes. If the data were collected at slightly different times, or from different sources, the populations for the continents might not exactly equal the sum of their children’s populations, but one would expect it to be very close. Note that in this example, every leaf node has the same depth. Although this is a common property in many applications, it is not a required property. In fact,
if Antarctica is counted as a continent, it would have no contained countries, and the hierarchy would lose this property. If the level of the hierarchy is further increased, by dividing countries into regions, one would have very different levels. The United States might be divided into states and then into zip codes, whereas Vatican City would not be divided up at all.
Historical Background Leonhard Euler is regarded as the founder of graph theory, publishing initially in the 1730s, but displays of trees as predate even this, especially displays of family trees, some of which survive from significantly earlier. However, the discipline of visualizing hierarchical data in a systematic fashion dates back only to the availability of computers. In 1992, the conference International Symposium on Graph Drawing commenced meeting and their conference proceedings, available from 1994, provide an excellent overall reference to the state of the art in this subject, although limited mainly to static graphs. From around 1990 onwards, increased attention has been paid to interactive, or dynamic visualization of hierarchical data, although information on such techniques is scattered across many disciplines. User interface controls for interacting with hierarchies became common with the advent of windowed operating systems; trees of folders and files are commonplace and techniques for filtering and pruning such views have seen significant research and user testing.
Foundations There are a number of reasons why hierarchical data might be visualized, and it is important to identify the goal of any visualization, as different visualization techniques serve different goals. Common reasons to view hierarchies are: Understanding Structure: The goal is to understand the structure of the hierarchy; in the world population example, one might want to know how countries are distributed within continents, or to see how many major regions are within countries. Understanding Data: The goal is to understand the distribution of data across the hierarchy. In the example, one might want to know if each continent has a similar distribution of population, or if the lowest levels have similar populations (do zip codes
Visualizing Hierarchical Data
in the US, on average, have the same populations as the Vatican City?), or similar questions. Often, the goal is to understand the data within the context of the structure. Summarizing large amounts of data: The goal is to reduce information overload by providing a summary of low-level data into aggregated data. A hierarchy allows the user of a visualization to set the level of detail they want, by only showing data up to a certain level in the hierarchy. There are two basic branches of visualization techniques for hierarchies. The first is based on a nodeedge graph-layout approach which focuses attention on the structure and relationships, and the second on space-filling approaches, which focus attention on the relative sizes of nodes in the hierarchy. Each is discussed below. Node-Edge Layouts
These displays are essentially a specialization of general graph layout techniques. Each node in the hierarchy is displayed as a small glyph, commonly a circle or a square. Data on the node can be represented by changing aesthetic attributes of the node such as its size, color, pattern, etc. Fig. 1 below gives an example of a traditional node-edge display for hierarchical data. Each node represents a country, and they are aggregated into a hierarchy where the net level up is a sociological grouping. The populations have been aggregated by a mean average, so the node representing the ‘‘new world’’ countries has been given the mean population of countries in the new world. It is clear that there is little difference between populations of countries when they are aggregated into these groupings. This visualization has the advantage that it is good for showing the structure of the hierarchy – one can see clearly how each many countries are in each group, and the distribution of the variable of interest. However, the display is not compact – a lot of space is wasted showing links having little information. Modifications to this basic display include the following: Ordering the nodes within each parent. The child nodes for each parent could be sorted by a variable. In Fig. 1, it would make sense to sort by the population to create a better overview of the distribution of that variable across countries within a group.
V
3427
Displaying information on the edges. In the above figure, the edges could be coded according to the similarity between the country and its group, using any of a number of statistical techniques to define such similarity. One such technique starts simply with data items and then generates a hierarchy by successively clustering items into groups based on similarity. This technique is called hierarchical clustering and is often visualized using a dendrogram as in Fig. 2. The dendrogram shows when groups are formed using the vertical dimension; the lower on the vertical dimension, the more similar the groups that were merged were. In Fig. 2, F and G were merged first and because they care the most similar pair, then A and B were formed into a group, then D and E. Then C was added to {A, B}, following which {D, E} was merged with {F, G}. The resulting groups {A, B, C}, {D, E, F, G}, {H} are only merged together at a much higher level of dissimilarity. The dendrogram therefore lets one see not only what clusters were created, but it also, by placing the merge information at a location in the vertical dimension proportional to the similarity of the groups being merged, allows one to see how good the resulting clusters are. Space-Filling Layouts
In contrast to node-edge layouts, space-filling layouts take explicit advantage of the hierarchical nature of data directly. A space-filling layout is usable only when the main variable of interest is summable. As these techniques lay out areas in proportion to sizes, and parents visually include their children, a variable that can be summed is necessary. It is not possible to use space-filling layouts directly to show means, minimums or similar statistics. The base layout is limited to sums. It is, of course, possible to color or use some other nonsize aesthetic for such techniques, but then the essential space-filling nature of the display is of little value. Figure 3 shows the example data with a radial space-filling layout. Each level in the hierarchy is represented by a band at a set radius, with the root note in the center, and children outside their parents. The angle subtended by a node is proportional to the percentage of the entire population that this node represents. Children are laid out directly outside their parents, so parents ‘‘divide up’’ the space for their children according to the child sizes.. This is a typical
V
3428
V
Visualizing Hierarchical Data
Visualizing Hierarchical Data. Figure 1. Standard Node-Edge layout for a hierarchical network.
Visualizing Hierarchical Data
V
3429
space-filling technique, many of which have been discovered in various disciplines. They share the following characteristics: The root node occupies 100% of the space, or dimension of interest (in this case, the radial dimension). Each node partitions its space according to the relative sizes of its children.
Visualizing Hierarchical Data. Figure 2. Dendrogram of a hierarchical clustering.
Figure 3 uses space more economically than Fig. 1, and also allows one to see sizes more clearly. It is generally to be preferred if the data support it. Compare this figure with Fig. 4, which uses a style of layout popularized under the name ‘‘TreeMap.’’ The treemap, instead of allocating space for child nodes outside the parent node, lays them out directly
V
Visualizing Hierarchical Data. Figure 3. The Population data of Fig. 1, this time showing summed population, using a space-filling radial layout.
3430
V
Visualizing Hierarchical Data
Visualizing Hierarchical Data. Figure 4. TreeMap version of Fig. 3.
on top of the parent, thus making a maximally compact representation. This causes the immediate problem that there is then no way to distinguish the tree structure, since the children completely occlude the parents. Typically, as in Fig. 4, a small border is left around the child nodes to allow the tree structure to be ascertained, although that does distort the overall relationship between visual size and the sizes of the hierarchical data points. The TreeMap is not a good technique for learning about structure because of this issue, but
because it is very compact it can perform quite well in the domain to which it is applicable: displaying relative sizes of items in the hierarchy when the structure of the hierarchy is well-known or of little interest. There are numerous different ways to render this figure, and care should be taken to avoid situations where some rectangles are skinny and others are flat; it is harder to make size judgments based on areas with widely differing aspect ratios than it is on arcs with different angles as in Fig. 3, for example. In general, the TreeMap should
Visualizing Hierarchical Data
be treated as a specialist tool for hierarchies to which it is applicable, not as a general purpose hierarchical visualization. Interactive Visualization of Hierarchical Data
The entry on Visualization of Network Data indicated some techniques for interacting with general networks. These can be applied to the special case of hierarchies, and of these the most applicable and important technique is that of brushing and linking. Figure 5 demonstrates a number of techniques that have been detailed earlier, and adds interaction to the visualization. The base data are a set of consumer price
V
3431
indices (CPI) from the United Kingdom, collected monthly. A hierarchical clustering has been performed on the months, based on all the CPIs in the data set except the overall CPI. At the bottom is multiple time series chart showing four indices (food, clothing, housing, furniture). At the top right is a histogram showing residuals from a fitted time series mode of the overall CPI. Each view is linked to each other view, so that selecting a region of one chart highlights the corresponding months in all other views. In this figure, the higher residuals have been selected, denoting months where the actual CPI was higher than the model predicted. This is shown in red in the
V
Visualizing Hierarchical Data. Figure 5. Linked Views of UK CPI data.
3432
V
Visualizing Network Data
histogram, and in the time series view the segments corresponding to those months are shown with a thicker stroke. It appears that these residuals occur mainly on the downward part of regular cycles on the topmost time series (representing the clothing CPI). In the hierarchical view, a generalization ahs been made to the linking. As the months are aggregated at higher levels of the hierarchy, these interior nodes in the hierarchical tree map may be partially selected. In this figure the selection percentage is shown with a rainbow hue map, with blue indicating completely unselected, red completely selected, and green half-selected. Intermediate hues indicate intermediate selection percentages. At the outer levels some fully selected clusters can be seen, but more interestingly, looking at the inner bands, there are strong differences in color, indicating some relationship between the distributions of the residuals for the overall CPI, and the distribution of clusters of the other CPIs. This indicates some form of inter-dependence between them should be investigated to improve the model.
Key Applications Hierarchies are a very common data structure. Application areas include geographical data, which are invariably organized in hierarchies, data on organizations, such as businesses, military organizations, and political entities. Financial data is also often suitable. As well as a priori hierarchies, it is very common to generate hierarchies so as to aggregate data and allow higher level information to be viewed. Database systems like OLAP and other roll-up techniques can create hierarchies and allow users to move rapidly between levels to investigate data.
Data Sets UK figures for CPI were retrieved from http://www. statistics.gov.uk/cpi/, and similar statistics should be generally available from most countries National Statistics office. The world population data are a subset taken from the CIA’s world fact book, available at https://www.cia. gov/library/publications/the-world-factbook/.
Cross-references
▶ OLAP ▶ Visualizing Network Data
Recommended Reading 1.
2.
Di Battista G., Eades P., Tamassia R., and Tollis I. Graph Drawing: Algorithms for the Visualization of Graphs. PrenticeHall, Englewood Cliffs, NJ, USA, 1999. Friendly M. The Gallery of Data Visualization. http://www.math. yorku.ca/SCS/Gallery/.
Visualizing Network Data G RAHAM W ILLS SPSS Inc., Chicago, IL, USA
Synonyms Graph layout; Network topology; Graph drawing; Information visualization on networks
Definition A network is a set of nodes with edges connecting the nodes. When the graph defined by that set of nodes and edges has other associated data, the result is termed ‘‘network data.’’ Visualizing Network Data is the process of presenting a visual form of that structure so as allow insight and understanding.
Historical Background Although examples of informal drawing of networks and hand-drawn graph layouts can be found stretching back many decades, the discipline of visualizing network data in a systematic fashion dates back only to the availability of computers, with an early paper by Tutte in 1963 titled ‘‘How to Draw a Graph’’ being a prime example. In 1992, the conference International Symposium on Graph Drawing commenced meeting and their conference proceedings, available from 1994, provide an excellent overall reference to the state of the art in this subject. From around 1990 onwards, increased attention has been paid to interactive, or dynamic visualization of Network Data. These techniques have surfaced in a variety of different fields, including Statistical Graphics, Information Visualization, and many applied areas.
Foundations The basic goal behind a successful static visualization of network data is simply stated: Producing a layout of nodes and edges with a bounded 2-dimensional (or, less commonly, 3-dimensional) region such that the
Visualizing Network Data
resulting display portrays the structure of the network as clearly as possible. To achieve this goal the following topics must be addressed: Techniques for representing the nodes and edges Aesthetic criteria that define what is meant by a clear representation Layout Techniques Extensions to layout techniques to incorporate data on nodes and edges Interactive techniques to augment static layouts Node and Edge Representation
The simplest form of display for a node is as a small glyph, commonly a circle or a square. It is simple and compact. Further, it is easy to add extra information to, such as color, pattern, label or other aesthetics. These can represent data on the nodes. The first three example layouts Fig. 1 below show nodes as circles. The last layout is different; it uses an extended representation to display a node, allowing the edges to be displayed as vertical lines. This representation is most common when the graph is tree-like, or more generally is a hierarchical network. When there is special known structure for a graph like this, it is possible to use representations that help elucidate such properties. For general network data, however, simple nodes are the most common and suitable choice. Edge representation allows more freedom. If the edges are directed, so that they have a definite ‘‘from’’ and ‘‘to’’ node, then they are usually portrayed with arrows at the ends, unless the direction is obvious from the layout. Other common representational techniques are: Straight Edge. Figure 1(a) and Figure 1(d) shows edges as straight lines directly linking nodes. This
V
3433
has the advantage of being simple and making connections clear, but tends to result in more edge crossings. Polyline/Paths. Using paths instead of straight edges allows a reduction in the number of edge crossings in Fig. 1(c), but the resulting display is more complex. Orthogonal Paths. A style of representation where paths consist of orthogonal lines. This form of representation harkens back to printed circuit board layouts, an early application of graph layout. Quality Criteria
The question of what makes a good layout has been approached by many authorities. In practice trade-offs must be made, and specific applications stress some criteria more than others. Those criteria that are most often considered important are given below: Minimize edge crossings Minimize the area needed to display Maximize the symmetry of the layout, both globally and locally Minimize number of bends in polyline layouts Maximize the angle between edges, both at nodes and when they cross Minimize total path lengths There are also criteria that are suitable for specific graph types. For example, in a graph that is directed and acyclic (there is no path from nodes following edges that loops back on itself), one strong criterion is that the edges generally head in the same direction (all upward, or all downward, as in Fig. 1(a). Note that Fig. 1(b) does not exhibit this quality, an example of the trade-offs made when considering different layout algorithms.
V
Visualizing Network Data. Figure 1. Representations: (a) Straight-edge, (b) Polyline, (c) Orthogonal, (d) Hierarchical.
3434
V
Visualizing Network Data
Layout Techniques
Layout techniques for general networks employ a variety of techniques, the most common of which are: Planar Embeddings. A planar graph is one that has a 2-dimensional representation that has no edge crossings. It is also possible to represent any planar graph using only straight-line edges. Although testing and subsequent layout can be done in linear time, the algorithms for doing so are complex. When each vertex has at most four edges, orthogonal representations are possible, but minimizing the number of bends is NP-hard, and approximate algorithms are employed in practice. Planarization. If a graph is not planar, it can be made planar by adding ‘‘fake’’ nodes at the crossing points. One such technique would be to find the maximal planar subgraph and laying it out, then adding the additional edges and inserting the additional nodes at crossing points. The resulting graph is laid out using straight edges and then the fake nodes removed, leaving polylines connecting some remaining nodes. Directed Embeddings. A similar technique is to extract a directed graph from the overall graph (by orienting the edges if necessary) and use a hierarchical drawing technique to place the nodes. A simple example would be finding a minimum spanning tree within the graph and laying it out directly. Force-directed. Using a mixture of aesthetic criteria, the ‘‘energy’’ of a layout can be defined where low energy corresponds to a good layout. The resulting energy function can be minimized using techniques including randomization, simulated annealing, steepest descent, and simulation. Force-directed techniques are very general, and often can be implemented using iterative algorithms, which makes them amenable to a choice of stopping criteria that can deal with large networks more easily. Layouts for Networks with Data
If, in addition to the network connections, one also has data on either or both of the nodes or the edges, standard Information Visualization techniques can be used to augment the network visualizations. The simplest augmentation is to map a variable for nodes or edges onto an aesthetic, such as color, size, shape, pattern, transparency or dashing. The basic act of labeling nodes is already an example of this use of aesthetics to convey information. Figure 2 shows an application to understanding correlation patterns between variables in a data set consisting of information
Visualizing Network Data. Figure 2. Correlations between Variables; baseball player data, 2004.
on Major League Baseball players in 2004. Only correlations passing a statistical test of adequacy have been retained, resulting in a non-connected graph. The edges contain data on two measures of association between then variables connected by the edge, which have been mapped to aesthetics as follows: Color: The color represents the statistical significance of the association, with green being weak and red being strong. Size: The width of the edges indicates how strong the association is in the sense of how much of the variation with one variable can be explained by the other. The overall layout has a straight-edge representation, and has been arrived at via a force-directed algorithm. When the data are similar to the above, where one has a measure (or, in this case, two measures) of an edge’s strength, the algorithms used should be modified so as to take the strength into account. Although this is possible for all algorithms, it is most simple to implement using force-directed techniques. The goal is to ensure that nodes that are highly associated with each other are close together, and that leads to the criterion that path length should be inversely proportional to edge weight. Reviewing the quality criteria above, some of the criteria can be modified for weighted networks as follows: Minimize the weighted summed deviation between path lengths and inverse edge weights.
Visualizing Network Data
Penalize edge crossings with crossings involving strong edges penalized more than weak edges. Maximize the angle between edges, both at nodes and when they cross, penalizing angles involving strong edges more than weak edges. The layout of Fig. 2 was achieved by a force-directed algorithm under these constraints. The measures of association were derived by Scagnostic algorithms of Wilkinson and Wills. Interactive Augmentations
For small and medium-sized static networks, static layouts are adequate, but for larger data sets and for timevarying data, different techniques are required. These can roughly be divided into the following categories: Distortion techniques. Suitable for large but not huge networks, distortion techniques allow users to place a focus point on a region of interest of the display, and the display redraws so as to magnify the area of interest and de-magnify the rest of the display. Since
V
3435
usable magnification levels can go no higher than a factor of about 5, this technique can improve the number of nodes that can be visualized by a maximum of about 25. Specific versions of these techniques include fisheye and hyperbolic transformations. Figure 3 and Figure 4 below illustrate a network showing associations between words in Melville’s Moby Dick. The size of the nodes indicates word frequencies, and the links color indicates strength of association. The cluster at the top left is cluttered in Fig. 3, so an interactive fisheye is applied and the focus point is dragged to that cluster, allowing one to see that cluster in more detail, as shown in Fig. 4. Brushing/Linking. Brushing and Linking techniques are generally applicable techniques for using one visualization in conjunction with another. In the basic implementation a binary pseudo-variable is added to the data set indicating the user’s degree of interest. This variable is mapped to an aesthetic in each linked chart, and the user is allowed to drag a brush, or click on
V
Visualizing Network Data. Figure 3. Force-directed layout of important words in Moby Dick. Node sizes indicate word frequencies. Edges link words are commonly found close to each other.
3436
V
Visualizing Network Data
Visualizing Network Data. Figure 4. Force-directed layout of important words in Moby Dick. A fisheye transformation has been placed over the dense cluster around the word ‘‘whale,’’ magnifying that cluster.
areas of interest in one chart so as to set corresponding values of the ‘‘degree of interest’’ variable and highlight corresponding parts of all linked charts. This technique is of particular value in the visualization of network data, as it allows any graph layout to be augmented with additional visualizations of data on nodes and links. A simple system for reducing visual complexity is to link histograms, bar charts or other summarized charts to a graph layout display, with the visibility of nodes and links based on the degree of interest. This allows users to select, for example, high values of one variable and then refine the selection by clicking on a bar of a categorical chart. The network display will then show only those items corresponding to the visually selected subset of rows. A trivial application of this is simply to provide sliders for each variable associated with nodes and links, and by dragging the sliders define a subset of the data which is then displayed as a network display. Animation. When data on networks varies over time, animating the results can provide insight into
the nature of the variation. Visualization is particularly suitable for such data as modeling dynamically evolving networks is a hard problem, with no general models currently available. Hence visualization becomes an important early step in understanding the problem. Technically, animation is similar to brushing, and can be simulated in a brushing environment by making selections on a view of the time dimension. The main differences are in internal techniques to optimize for animation, and in the user interfaces provided for each.
Key Applications Network Data Visualization is widely applicable. Communication networks such as telephony, internet and cellular are key areas, with particular emphasis on network security issues such as intrusion detection and fraud monitoring.
Future Directions The field of network data visualization has been evolving steadily since its inception. The classic problem
Visualizing Quantitative Data
remains open; providing high-quality layouts for networks. It is known that most problems in this area are NP-hard, and so ongoing research will focus on approximate algorithms. Application areas are increasingly providing large networks with sometimes millions of nodes, which require new algorithms for such data. Further, evolutionary networks, in which the topology of the network itself changes over time, have become more important, and algorithms for evolving a layout smoothly from an old state to accommodate a new state are needed.
Data Sets The baseball data can be found online at the baseball archive: http://www.baseball1.com. This contains a wealth of tables, and the example data used above was extracted from the tables found there. The text of Moby Dick can be found at many sites, including Project Gutenberg: http://www.gutenberg.org.
Cross-references
▶ Visualizing Hierarchical Networks
Recommended Reading 1.
2.
3.
Battista G., Eades P., Tamassia R., and Tollis I. Graph Drawing: Algorithms for the Visualization of Graphs. PrenticeHall, 1999. Herman I., Melancon G., and Marshal M.S. Graph visualization and navigation in information visualization: a survey. IEEE Trans. Vis. Comput. Graph., 6(1), 2000. International Symposium on Graph Drawing, http://graphdrawing.org/
Visualizing Quantitative Data ¨ NLU¨ A NATOL S ARGIN , A LI U University of Augsburg, Augsburg, Germany
Synonyms Visualizing quantitative data; Graphics for continuous data; Visual displays of numerical data
Definition Quantitative data are data that can be measured on a numerical scale. Examples of such data are length, height, volume, speed, temperature or cost.
V
3437
A quantitative variable can be transformed into a categorical variable by grouping, for example weight can be divided into underweight, normal weight and overweight. The inverse transformation may not be possible. Quantitative data can be categorical or continuous. This entry only concentrates on continuous data. Visualization in general means the graphical or visual display of data or relations.
Key Points Univariate graphics are the most basic ones and display exactly one variable. There are three plot types that are commonly used: dotplots, boxplots and histograms. Dotplots draw every data point along one axis. This graphic shows clusters and gaps in the data. Overplotting can be a serious problem, because many points can be crowded around one position. Jittering of the points is then a possible solution. This means, that the points are diffused along a second axis. Boxplots are also aligned on one axis, but show summary statistics (median, lower and upper hinges) and emphasize outliers. A boxplot gives a broad overview of the structure of the data and sets of boxplots are good for comparing several variables. Histograms are displayed using two axes. The values of the variable, grouped in bins, are on the abscissa and the frequencies of the values on the ordinate. Histograms can be very informative but strongly depend on the choice of the anchorpoint (start of the first bin) and binwidth. Bivariate data are typically drawn as scatterplots, which show the structure and dependencies of two variables. The two variables are plotted against each other in an orthogonal coordinate system. The points can be drawn in different colors or symbols to show special groups of the data. Scatterplots are good for detecting one- and two-dimensional outliers and can, together with interactive methods, also be used for large datasets. Scatterplot matrices, also known as splom, are a generalization for n variables. The (i,j)-th entry is the scatterplot of variable i against variable j. Variable labels are often given in the matrix diagonal. Scatterplot matrices are only efficient for a small number of variables. For more variables parallel coordinates are a better choice. This space-saving graphic shows all variables in one display and hence can visualize the whole dataset.
V
3438
V
Visualizing Spatial Data
Cross-references
▶ Data Visualization ▶ Multivariate Visualization Methods ▶ Parallel Coordinates ▶ Visualizing Categorical Data
Recommended Reading 1. 2.
Cleveland W. Visualizing Data. Hobert, Summit, NJ, USA, 1993. Unwin A., Theus M., and Hofmann H. Graphics of Large Datasets. Springer, Berlin Heidelberg New York, 2006.
Visualizing Spatial Data
or aggregated into virtual volumes by virtualization software. A file system and its associated files ultimately reside in a physical volume on a storage device. Database table’s data also ultimately resides in a physical volume on a storage device. Storage vendors typically use Volume as the unit of management. That is, they allow volume level data copy, migration, snapshots, access control, compression, and encryption.
Cross-references
▶ Logical Unit Number (LUN) ▶ Logical Volume Manager ▶ LUN ▶ LUN Mapping ▶ RAID
▶ Visual Interfaces for Geographic Data
Volume Set Manager Visualizing Trees
▶ Logical Volume Manager (LVM)
▶ Visualizing Hierarchical Data
Voronoi Decomposition Volume
▶ Voronoi Diagram
K ALADHAR VORUGANTI Network Appliance, Sunnyvale, CA, USA
Synonyms Logical volume; Physical volume
Definition The storage provided by a storage device is offered in the form of volumes. A volume corresponds to a logically contiguous piece of storage. The volumes offered by storage devices are typically known as physical volumes.
Key Points Storage controllers usually create volumes by stripping them across multiple disks that belong to a RAID group. RAID 1 (mirroring), RAID 5 (use of parity, but no dedicated parity disk), and RAID 6 (can tolerate two failures) are the most common RAID types. These volumes get mapped into LUNs by host OS, or aggregated into logical volumes by a logical volume manager,
Voronoi Diagrams C YRUS S HAHABI 1, M EHDI S HARIFZADEH 2 1 University of Southern California, Los Angeles, CA, USA 2 Google, Santa Monica, CA, USA
Synonyms Voronoi tessellation; Voronoi decomposition; Dirichlet tessellation; Thiessen polygons
Definition The Voronoi diagram of a given set P = {p1,...,pn} of n points in Rd partitions the space of Rd into n regions [2]. Each region includes all points in Rd with a common closest point in the given set P according to a distance metric D(.,.). That is, the region
Voronoi Diagrams
Voronoi Diagrams. Figure 1. The ordinary Voronoi diagram.
corresponding to the point p 2 P contains all the points q 2 Rd for which the following holds: 8p0 2 P; p0 6¼ p; Dðq; pÞ Dðq; p0 Þ The equality holds for the points on the borders of p’s and p0’s regions. Incorporating arbitrary distance metrics D(.,.) results in different variations of Voronoi diagrams. As an example, Additively Weighted Voronoi diagrams are defined by using the distance metric D (p, q) = L2(p, q) + w(p) where L2(.,.) is the Euclidean distance and w(p) is a numeric weight assigned to p. A thorough discussion on all variations is presented in [7]. Figure 1 shows the ordinary Voronoi diagram of nine points in R2 where the distance metric is Euclidean. The region V (p) containing the point p is denoted as its Voronoi cell. With Euclidean distance in R2, V (p) is a convex polygon. Each edge of this polygon is a segment of the perpendicular bisector of the line segment connecting p to another point of the set P. These edges and their end-points are called Voronoi edges and Voronoi vertices of the point p, respectively. For each Voronoi edge of the point p, the corresponding point in the set P is called a Voronoi neighbor of p. The point p is referred to as the generator of Voronoi cell V (p). Finally, the set given by VD(P) ={V (p1),...,V (pn)} is called the Voronoi diagram of the set P with respect to the distance function D(.,.).
Key Points Voronoi diagrams exhibit the following properties [9]: Property 1: The Voronoi diagram of a set P of points, VD(P), is unique. Property 2: Given the Voronoi diagram of P, the nearest point of P to point p 2 P is among the Voronoi
V
neighbors of p. That is, the closest point to p is one of generator points whose Voronoi cells share a Voronoi edge with V (p). Property 3: The average number of vertices per Voronoi cells of the Voronoi diagram of a set of points in R2 does not exceed six. That is, the average number of Voronoi neighbors of each point of P is at most six. Empowered by the above properties, different variations of Voronoi diagrams have been used as index structures for the nearest neighbor search. Hagedoorn introduces a directed acyclic graph based on Voronoi diagrams [3]. He uses the data structure to answer exact nearest-neighbor queries with respect to general distance functions in O(log2n) time using only O(n) space. In [6], Maneewongvatana proposes a hierarchical index structure for point data, termed overlapped split tree (os-tree). Each os-tree node is associated with a convex polygon referred as the cover, which includes all the points in space whose nearest neighbor is a data point associated with the node. The same principal used in Voronoi diagrams is utilized in os-trees to partition the space using subset of Voronoi edges into regions, each having different sets of nearest neighbors. In [11], Xu et al. study indexing location data in location-based wireless services. They propose the D-tree, an index structure that can be used to efficiently process NN queries (planar point queries in their terminology). D-tree simply indexes the point data using the subsets of the points’ Voronoi edges that partition the entire set into two subsets. This partitioning principle is used in all levels of the tree. Many studies also focus on utilizing individual Voronoi cells for query processing. Korn and Muthukrishnan [5] describe four examples of the Voronoi cell computation problem, drawn from different spatial/vector space domains in which the influence set of a given point is required. Stanoi et al. in [10] combine the properties of Voronoi cells (influence sets in their terminology) with the efficiency of R-trees to retrieve reverse nearest neighbors of a query point from the database. Zhang et al. [12] determine the so-called validity region around a query point as the Voronoi cell of its nearest neighbor. The cell is the region within which the result of the nearest neighbor query remains valid as the location of the query point is changing. To provide an efficient similarity search mechanism in a peer-to-peer data network, BanaeiKashani and Shahabi propose that each node maintains its Voronoi cell based on its local neighborhood
3439
V
3440
V
Voronoi Tessellation
in the content space [1]. As a more practical example, Kolahdouzan and Shahabi [4] propose a Voronoibased data structure to improve the performance of exact k nearest neighbor search in spatial network databases. Sharifzadeh and Shahabi [8] utilize Additively Weighted Voronoi diagrams to process a class of nearest neighbor queries.
Cross-references ▶ Road Networks ▶ Rtree
Recommended Reading 1. Banaei-Kashani F. and Shahabi C. SWAM: a family of access methods for similarity-search in peer-to-peer data networks. In Proc. Int. Conf. on Information and Knowledge Management, 2004, pp. 304–313. 2. de Berg M., van Kreveld M., Overmars M., and Schwarzkopf O. Computational Geometry: Algorithms and Applications, 2nd ed. Springer, Berlin Heidelberg New York, 2000. 3. Hagedoorn M. Nearest neighbors can be found efficiently if the dimension is small relative to the input size. In Proc. 9th Int. Conf. on Database Theory, 2003, pp. 440–454. 4. Kolahdouzan M. and Shahabi C. Voronoi-based K nearest neighbor search for spatial network databases. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 840–851. 5. Korn F. and Muthukrishnan S. Influence sets based on reverse nearest neighbor queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 201–212. 6. Maneewongvatana S. Multi-Dimensional Nearest Neighbor Searching with Low-dimensional Data. PhD thesis, Computer Science Department, University of Maryland, College Park, MD, USA, 2001. 7. Okabe A., Boots B., Sugihara K., and Chiu S.N. Spatial Tessellations, Concepts and Applications of Voronoi Diagrams, 2nd edn. Wiley, Chichester, UK, 2000.
8. Sharifzadeh M. and Shahabi C. Processing optimal sequenced route queries using voronoi diagrams. Geoinformatica 12(4), Springer Netherlands, December 2008, pp. 411–433. 9. Sharifzadeh M. Spatial Query Processing Using Voronoi Diagrams. PhD thesis, Computer Science Department, University of Southern California, Los Angeles, CA, 2007. 10. Stanoi I., Riedewald M., Agrawal D., and El Abbadi A. Discovery of influence sets in frequently updated databases. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 99–108. 11. Xu J., Zheng B., Lee W.-C., and Lee D.L. The D-tree: an index structure for planar point queries in location-based wireless services. IEEE Trans. Knowl. Data Eng., 16(12):1526–1542, 2004. 12. Zhang J., Zhu M., Papadias D., Tao Y., and Lee D.L. Locationbased spatial queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 443–453.
Voronoi Tessellation ▶ Voronoi Diagram
VP ▶ Virtual Partitioning
VSM ▶ Vector-Space Model
W W3C S ERGUEI M ANKOVSKII CA Labs, CA, Inc., Thornhill, ON, Canada
Cross-references ▶ Web 2.0/3.0 ▶ Web Services ▶ XML ▶ XML Schema
Synonyms World Wide Web consortium
Recommended Reading 1.
W3C. Available at: http://www.w3.org/
Definition W3C is an international consortium for development of World Wide Web protocols and guidelines to ensure long-term growth of the Web.
Key Points W3C was founded in 1994 by the inventor of the World Wide Web Tim Berners-Lee as a vendor-neutral forum for building consensus around Web technologies. The consortium consists of member organization and dedicated staff of technical experts. Membership is open to any organization or individual whose application is reviewed and approved by the W3C. Usually W3C members invest significant resources into the Web technologies. W3C fulfils its mission by creation of recommendations enjoying status of international standards. In the first 10 years of existence, it produced over eighty W3C recommendations. W3C is responsible for such technologies as HTML, XHTML, XML, XML Schema, CSS, SOAP, WSDL and others. W3C members play a leading role in the development of the recommendations. W3C initiatives involve international, national, and regional organizations on global scale. Global participation reflects broad adoption of the web technologies. Along with broad adoption comes growth in diversity of software and hardware used on the Web. W3C aims to facilitate hardware and software interoperability to avoid fragmentation of the Web. W3C operations are jointly administered by the MIT Computer Science and Artificial Intelligence Laboratory, European Research Consortium for Informatics and Mathematics, and Keio University. #
2009 Springer ScienceþBusiness Media, LLC
W3C XML Path Language ▶ XPath/XQuery
W3C XML Query Language ▶ XPath/XQuery
W3C XML Schema ▶ XML Schema
WAN Data Replication M AARTEN VAN S TEEN VU University, Amsterdam, The Netherlands
Synonyms Wide-area data replication
Definition The field of WAN data replication covers the problems and solutions for distributing and replicating data across wide-area networks. Concentrating on databases alone, a wide-area database is defined as a collection of multiple, logically interrelated databases distributed and
3442
W
WAN Data Replication
possibly replicated across sites that are connected through a wide-area network. The characteristic feature is that data are spread across sites that are separated through wide-area links. Unlike links in local-area networks, the quality of communication through wide-area links is relatively poor. Links are subject to latencies of tens to thousands of milliseconds, there are often severe bandwidth restrictions, and connections between sites are much less reliable. In principle, WAN data replication also covers the distribution and replication of plain files. These issues are traditionally handled by wide-area distributed file systems such as AFS [11] and NFS [3,12], which are both widely used. These distributed file systems aim at shielding data distribution from applications, i.e., they aim at providing a high degree of distribution transparency. As such, they tackle the same problems that wide-area databases need to solve. However, matters are complicated for databases, because relations between and within files (i.e., tables) also need to be taken into account.
Historical Background WAN data replication is driven by the need for improving application performance across wide-area networks. Performance is generally expressed in terms of client-perceived quality of service: response times (latency), data transfer rates (bandwidth), and availability. Replication may also be driven by the requirement for reducing monetary costs, as the infrastructure over which data distribution and replication takes place is generally owned by a separate provider who may be charging per transferred byte. Replication of data has been explored in early widearea systems such as Grapevine [2], Clearinghouse [4],
and Lotus Notes [6]. All these systems concentrated on achieving a balance between data consistency, performance, and availability, recognizing that acceptable quality of service can be achieved only when weak consistency can be tolerated. In general, this required exploring application semantics making solutions more or less specific for an application. In the mid-1990s, researchers from Xerox explored client-centric consistency models by which data were distributed and replicated in a wide-area database such that a notion of strong data consistency could be presented to users individually [8]. For example, data that had been modified by a user when accessing the database at location A would be propagated to location B before the user would access the system again at B, but not to other locations. The need for WAN data replication became widely recognized with the explosion of the Web, which instantly revealed the shortcomings of its traditional client-server architecture. Up to date, virtually all Web sites are centrally organized, with end users sending requests to a single server. However, this organization does not suffice for large commercial sites, in which high performance and availability is crucial. To address these needs, so-called Content Delivery Networks (CDNs) came into play. A CDN is essentially a Web hosting service with many servers placed across the Internet (see Fig. 1). Its main goal is to ensure that a single Web site is automatically distributed and replicated across these servers in such a way that negotiated performance and availability requirements are met (see also [9]). CDNs surfaced when most Web sites were still organized as a (possibly very large) collection of files that could be accessed through a single server. Modern sites, however, are no longer statically organized, but deploy full-fledged databases from which Web content
WAN Data Replication. Figure 1. Web-based data replication deploying servers that are placed at the edge of the Internet. Adapted from Tanenbaum, van Steen: Distributed Systems, 2nd edn. Prentice-Hall, Englewood, Cliffs, NJ, 2007.
WAN Data Replication
is dynamically generated by application servers. As a consequence, the so-called edge servers to which client requests are initially directed are gradually turning into servers hosting partially or fully replicated databases, or database caches. These issues are discussed below.
Foundations A popular model used to understand the various issues involved in wide-area data(base) is the one in which every data item has a single associated server through which all its updates are propagated. Such a primary or origin server as it is called, simplifies the handling of conflicts and maintenance of consistency. In practice, an origin server maintains a complete database that is partially or completely replicated to other servers. In contrast, in an update anywhere approach, updates may be initiated and handled at any replica server. The main problem with this approach is that it requires global consensus among the replica servers on the ordering of updates if strong consistency is to be preserved. Achieving such consensus introduces serious scalability problems, for which reason various optimistic approaches have been proposed (optimistic in the sense that corrective actions may later be necessary). Queries are initially forwarded to edge servers, which then handle further processing. This could mean that subqueries are issued to different origin servers, but it is also possible that the edge server can compute the answer locally and send the response directly to the requesting client without further contacting the origin. In this context, key issues that need to be addressed for wide-area data replication are replica placement and consistency. Replica Placement
Somewhat surprisingly, many researchers do not make a clear distinction between placement of server machines and placement of data on servers. Nevertheless, this distinction is important: while data placement can often be decided at runtime, this is obviously not the case for placement of server machines. Moreover, the criteria for placement are different: server placement should be done for many data objects, but deciding on the placement of data can be optimized for individual data objects. Both problems can be roughly tackled as optimization problems in which the best K out of N possible locations need to be selected. There are a number of variations of this problem, but most important is the
W
3443
fact that heuristics need to be employed due to the exponential complexity of known solutions. For this reason, runtime data placement decisions deploy simpler solutions. An overview is provided in [13]. Relevant in this context is comparing different placements to decide which one is best. To this end, a general cost function can be used in various metrics which are combined: cos t ¼ w 1 res1 þ w 2 res2 þ þw n resn ; resk 0; w k > 0 where resk is a monotonically increasing variable denoting the cost of resource k and wk its associated weight. Typical resources include distance (expressed in delay or number of hops) and bandwidth. Note that the actual dimension of cost is irrelevant. What matters is that different costs can be compared in order to select the best one. Using a cost-driven placement strategy also implies that resource usage must be measured or estimated. In many cases, estimating costs may be more difficult than one would initially expect. For example, in order for an origin server to estimate the delay between a client and a given edge server may require mapping Internet locations to coordinates in a high-dimensional geometric space [7]. Data Consistency
Consistency of replicated data has received considerable attention, notably in the context of distributed sharedmemory parallel computers. This class of computers attempts to mimic the behavior of traditional sharedmemory multiprocessors on clusters or grids of computers. However, traditional distributed systems have often generally assumed that only strong consistency is acceptable, i.e., all processes concurrently accessing shared data see the same ordering of updates everywhere. The problem with strong consistency is that it requires timely global synchronization, which may be prohibitively expensive in wide-area networks. Therefore, weaker consistency models often need to be adopted. To capture different models, Yu and Vahdat defined continuous consistency [15], a framework that captures consistency along three different dimensions: time, content, and ordering of operations. When updates are handled in a centralized manner, then notably the first two are relevant for WAN data replication. Continuous time-based consistency expresses to what
W
3444
W
WAN Data Replication
extent copies of the same data are allowed to be stale with respect to each other. Continuous content-based consistency is used to express to what extent the respective values of replicated data may differ, a metric that is useful when dealing with, e.g., financial data. The differences in consistency are subsequently expressed as numbers. By exchanging this information between sites, consistency enforcement protocols can be automatically started without further interference from applications. There are essentially three ways to bring copies in the same state. First, with state shipping the complete up-to-date state is transferred to a replica server. This update form can be optimized through delta shipping by which only the difference between a replica’s current state and that of a fresher replica is computed and transferred. These two forms of update are also referred to as passive replication, or asymmetric update processing. In contrast, with active replication, function shipping takes place, meaning that the operations that led to a new state are forwarded to a replica and subsequently executed. This is also known as symmetric update processing. Differences may also exist with respect to the server taking the initiative for being updated. In pull approaches a replica server sends a request to a fresher replica to send it updates. In the push approach, updates are sent to a replica server at the initiative of a server that has just been updated. Note that for pushing, the server taking the initiative will need to
know every other replica server as well as its state. In contrast, pulling in updates requires only that a replica server knows where to fetch fresh state. For these reasons, pull-based consistency enforcement is often used in combination with caches: when a request arrives at a caching server, the latter checks whether its cache is still fresh by contacting an origin server. An important observation for WAN data replication is that it is impossible to simultaneously provide strong consistency, availability, and partition tolerance [5]. In the edge-server model, this means that despite the fact that an origin server is hosting a database offering the traditional ACID properties, the system as a whole cannot provide a solution that guarantees that clients will be offered continuous and correct service as long as at least one server is up and running. This is an important limitation that is often overlooked.
Key Applications Wide-area data(base) replication is applied in various settings, but is arguably most prevalent in Web applications and services (see also [1]). To illustrate, consider a Web service deployed within an edge-server infrastructure. The question is what kind of replication schemes can be applied for the edge servers. The various solutions are sketched in Fig. 2. First, full replication of the origin server’s database to the edge servers can take place. In that case, queries can be handled completely at the edge server and problems evolve around keeping the database replica’s
WAN Data Replication. Figure 2. Caching and replication schemes for Web applications. Adapted from Tanenbaum, van Steen: Distributed Systems, 2nd edn. Prentice-Hall, Englewood, Cliffs, NJ, 2007.
WAN Data Replication
consistent. Full replication is generally a feasible solution when read/write ratios are high and queries are complex (i.e., spanning multiple database tables). The main problem is that if the number of replicas grow, one would need to resort to lazy replication techniques by which an edge server can continue to handle a query in parallel to bringing all replicas up-todate, but at the risk of having to reconcile conflicting updates later on [10]. If queries and data can be organized such that access to only a single table is required (effectively leading to only simple queries), partial replication by which only the relevant table is copied to the edge server may form a viable optimization. An alternative solution is to apply what is known as content-aware caching. In this case, an edge server has part of the database stored locally based on caching techniques. A query is assumed to fit a template allowing the edge server to efficiently cache and lookup query results. For example, a query such as ‘‘select * from items where price < 50’’ contains the answer to the more specific query ‘‘select * from items where price < 20.’’ In this case, the edge server is actually building up its own version of a database, but now with a data schema that is strongly related to the structure of queries. Finally, an edge server can also deploy contentblind caching by which query results are simply cached and uniquely associated with the query that generated them. In other words, unlike content-aware caching and database replication, no relationship with the structure of the query or data is kept track of. As discussed in [14], each of these replication schemes has its merits and disadvantages, with no clear winner. In other words, it is not possible to simply provide a single solution that will fit all applications. As a consequence, distributed systems that aim to support WAN data replication will need to incorporate a variety of replication schemes if they are to be of general use.
Cross-references
▶ Autonomous Replication ▶ Consistency Models for Replicated Data ▶ Data Replication ▶ Distributed Databases ▶ Eventual Consistency Optimistic Replication and Resolution ▶ One-Copy Serializability
W
3445
▶ Partial Replication ▶ Replica Control ▶ Replication Based on Group Communication ▶ Replication for High Availability ▶ Replication for Scalability ▶ Traditional Concurrency Control for Replicated Databases
Recommended Reading 1. Alonso G., Casati F., Kuno H., and Machiraju V. Web Services: Concepts, Architectures and Applications. Springer, Berlin Heidelberg New York, 2004. 2. Birrell A., Levin R., Needham R., and Schroeder M. Grapevine: an excercise in distributed computing. Commun. ACM, 25(4): 260–274, 1982. 3. Callaghan B. NFS Illustrated. Addison-Wesley, Reading, MA, 2000. 4. Demers A., Greene D., Hauser C., Irish W., Larson J., Shenker S., Sturgis H., Swinehart D., and Terry D. Epidemic algorithms for replicated database maintenance. In Proc. ACM SIGACTSIGOPS 6th Symp. on the Principles of Dist. Comp., 1987, pp. 1–12. 5. Gilbert S. and Lynch N. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant Web services. ACM SIGACT News, 33(2):51–59, 2002. 6. Kawell L., Beckhardt S., Halvorsen T., Ozzie R., and Greif I. Replicated document management in a group communication system. In Proc. 2nd Conf. Computer-Supported Cooperative Work. 1988, pp. 226–235. 7. Ng E. and Zhang H. Predicting Internet Network Distance with Coordinates-based Approaches. In Proc. 21st Annual Joint Conf. of the IEEE Computer and Communications Societies, 2002, pp. 170–179. 8. Petersen K., Spreitzer M., Terry D., and Theimer M. Bayou: replicated database services for world-wide applications. In Proc. 7th SIGOPS European Workshop, 1996, pp. 275–280. 9. Rabinovich M. and Spastscheck O. Web Caching and Replication. Addison-Wesley, Reading, MA, 2002. 10. Saito Y. and Shapiro M. Optimistic replication. ACM Comput. Surv., 37(1):42–81, 2005. 11. Satyanarayanan M. Distributed files systems. In Distributed Systems, 2nd edn., S. Mullender (ed.). Addison-Wesley, Wokingham, 1993, pp. 353–383. 12. Shepler S., Callaghan B., Robinson D., Thurlow R., Beame C., Eisler M., and Noveck D. Network File System (NFS) Version 4 Protocol. RFC 3530, 2003. 13. Sivasubramanian S., Szymaniak M., Pierre G., and van Steen M. Replication for Web Hosting Systems. ACM Comput. Surv., 36(3):1–44, 2004. 14. Sivasubramanian S., Pierre G., van Steen M., and Alonso G. Analysis of Caching and Replication Strategies for Web Applications. IEEE Internet Comput., 11(1):60–66, 2007. 15. Yu H. and Vahdat A. Design and Evaluation of a Conit-based Continuous Consistency Model for Replicated Services. ACM Trans. Comput. Syst., 20(3):239–282, 2002.
W
3446
W
Watermarking
Watermarking ▶ Storage Security
Wavelets on Streams M INOS G AROFALAKIS Technical University of Crete, Chania, Greece
Definition Unlike conventional database query-processing engines that require several passes over a static data image, streaming data-analysis algorithms must often rely on building concise, approximate (but highly accurate) synopses of the input stream(s) in real-time (i.e., in one pass over the streaming data). Such synopses typically require space that is significantly sublinear in the size of the data and can be used to provide approximate query answers. The collection of the top (i.e., largest) coefficients in the wavelet transform (or, decomposition) of an input data vector is one example of such a key feature of the stream. Wavelets provide a mathematical tool for the hierarchical decomposition of functions, with a long history of successful applications in signal and image processing [10]. Applying the wavelet transform to a (one- or multi-dimensional) data vector and retaining a select small collection of the largest wavelet coefficient gives a very effective form of lossy data compression. Such wavelet summaries provide concise, generalpurpose summaries of relational data, and can form the foundation for fast and accurate approximate query processing algorithms.
Historical Background Haar wavelets have recently emerged as an effective, general-purpose data-reduction technique for approximating an underlying data distribution, and providing a foundation for approximate query processing over traditional (static) relational data. Briefly, the idea is to apply the Haar wavelet decomposition to the input relation to obtain a compact summary that comprises a select small collection of wavelet coefficients. The results of [3,9,11] have demonstrated that fast and accurate approximate query processing engines can be designed to operate solely over such compact wavelet
summaries. Wavelet summaries can also give accurate histograms of the underlying data distribution at multiple levels of resolution, thus providing valuable primitives for effective data visualization. In the setting of static relational tables, the Haar wavelet transform is well understood, and scalable wavelet decomposition algorithms for constructing wavelet summaries have been known for some time [3,11]. Typically, such algorithms assume at least linear space and several passes over the data. Data streaming models introduce the novel challenge of maintaining the topmost wavelet coefficients over a dynamic data distribution (rendered as a stream of updates), while only utilizing small space and time (i.e., significantly sublinear in the size of the data distribution).
Foundations
The Haar Wavelet Decomposition. Consider the onedimensional data vector a = [2,2,0,2,3,5,4,4] comprising N = 8 data values. In general, N denotes the data set size and [N] is defined as the integer index domain {0,...,N 1}; also, without loss of generality, N is assumed to be a power of 2 (to simplify notation). The Haar Wavelet Transform (HWT) of a is computed as follows. The data values are first averaged together pairwise to get a new ‘‘lower-resolution’’ representation of the data with the pairwise averages 2þ2 0þ2 3þ5 4þ4 = [2,1,4,4]. This averaging loses 2 ; 2 ; 2 ; 2 some of the information in a. To restore the original a values, detail coefficients that capture the missing information are needed. In the HWT, these detail coefficients are the differences of the (second of the) averaged values from the computed pairwise average. Thus, in this simple example, for the first pair of averaged values, the detail coefficient is 0 since 22 02 2 ¼ 0, for the second it is 1 since 2 ¼ 1. No information is lost in this process – one can reconstruct the eight values of the original data array from the lower-resolution array containing the four averages and the four detail coefficients. This pairwise averaging and differencing process is recursively applied on the lower-resolution array of averages until the overall average is reached, to get the full Haar decomposition. The final HWT of a is given by wa = [11∕4, 5 ∕4, 1∕ 2, 0, 0, 1, 1, 0], that is, the overall average followed by the detail coefficients in order of increasing resolution. Each entry in wa is called a wavelet coefficient. The main advantage of using wa instead of the original data vector a is that for vectors containing similar values
Wavelets on Streams
most of the detail coefficients tend to have very small values. Thus, eliminating such small coefficients from the wavelet transform (i.e., treating them as zeros) introduces only small errors when reconstructing the original data, resulting in a very effective form of lossy data compression [10]. A useful conceptual tool for visualizing and understanding the HWT process is the error tree structure [9] (shown in Fig. 1 for the example array a). Each internal tree node ci corresponds to a wavelet coefficient (with the root node c0 being the overall average), and leaf nodes a[i] correspond to the original data-array entries. This view allows us to see that the reconstruction of any a[i] depends only on the log N + 1 coefficients in the path between the root and a[i]; symmetrically, it means a change in a[i] only impacts its log N + 1 ancestors in an easily computable way. The support for a coefficient ci is defined as the contiguous range of data-array that ci is used to reconstruct (i.e., the range of data/leaf nodes in the subtree rooted at ci). Note that the supports of all coefficients at resolution level l of the HWT are exactly the 2l (disjoint) dyadic ranges of size N ∕2l = 2logNl over [N], defined as Rl,k = [k 2logNl,...,(k + 1) 2logNl 1] for k = 0,...,2l 1 (for each resolution level l = 0,..., log N). The HWT can also be conceptualized in terms of vector inner-product computations: let fl,k denote the vector with fl,k[i] = 2llogN for i 2 Rl,k and 0 otherwise, for l = 0,...,log N and k = 0,...,2l 1; then, each of the coefficients in the HWT of a can be expressed as the inner product of a with one of the N distinct Haar wavelet basis vectors:
W
3447
1 f ðflþ1;2k flþ1;2kþ1 Þ :l ¼ 0;:::;log N 1; 2 k ¼ 0;:::;2l 1g [ ff0;0 g Intuitively, wavelet coefficients with larger support carry a higher weight in the reconstruction of the original data values. To equalize the importance of all HWT coefficients, a common normalization scheme is to scale the coefficient values at level l (or,pequivalently, ffiffiffiffiffiffiffiffiffiffi the basis vectors fl,k) by a factor of N =2l . This normalization essentially turns the HWT basis vectors into an orthonormal basis – letting ci∗ denote the normalized coefficient values, this fact has two important consequences: (i) The energy (squared L2-norm) of the a vector is preserved in the wavelet domain, that P P is, jjajj 22 ¼ ha; ai ¼ i a½i2 ¼ i ðci Þ2 (by Parseval’s theorem); and, (ii) Retaining the B largest coefficients in absolute normalized value gives the (provably) best B-term approximation in terms of L2 (or, sumsquared) error in the data reconstruction (for a given budget of coefficients B) [10]. The HWT and its key properties also naturally extend to the case of multi-dimensional data distributions; in that case, the input a is a d-dimensional data array, comprising Nd entries (Without loss of generality, a domain of [N]d is assumed for the d-dimensional case.), and the HWT of a results in a d-dimensional wavelet-coefficient array wa with N d coefficient entries. The supports of d-dimensional Haar coefficients are d-dimensional hyper-rectangles (over dyadic ranges in [N]d), and can be naturally arranged in a generalized error-tree structure [5,6].
W
Wavelets on Streams. Figure 1. Error-tree structure for the example data array a (N = 8).
3448
W
Wavelets on Streams
Wavelet Summaries on Streaming Data. Abstractly, the goal is to continuously track a compact summary of the B topmost wavelet coefficient values for a dynamic data distribution vector a rendered as a continuous stream of updates. Algorithms for this problem should satisfy the key small space/time requirements for streaming algorithms; more formally, streaming wavelet algorithms should (ideally) guarantee (i) sublinear space usage (for storing a synopsis of the stream), (ii) sublinear per-item update time (to maintain the synopsis), and (iii) sublinear query time (to produce a, possibly approximate, wavelet summary), where ‘‘sublinear’’ typically means polylogarithmic in the domain size N. The streaming wavelet summary construction problem has been examined under two distinct data streaming models. Wavelets in the Ordered Aggregate (Time Series) Streaming Model. Here, the entries of the input data vector a are rendered over time in the increasing (or, decreasing) order of the index domain values. This means, for instance, that a[1] (or, the set of all updates to a[1]) is seen first, followed by a[2], then a[3], and so on. In this case, the set of the B topmost HWT values over a can be maintained exactly in small space and time, using a simple algorithm based on the error-tree structure (Fig. 1) [7]: Consider reading (an update to) item a[i + 1] in the stream; that is, all items a[j] for j i have already streamed through. The algorithm maintains the following two sets of (partial) coefficients: 1. Highest B HWT coefficient values for the portion of the data vector seen thus far. 2. log N + 1 straddling partial HWT coefficients, one for each level of the error tree. At level l, index i straddles the HWT basis vector fl,k, where j 2 [k 2logNl, (k + 1) 2logNl 1]. (Note that there is at most one such basis vector per level.) When the (i + 1)th data item is read, the value for each of the affected straddling coefficients is updated. With the arrival of the (i + 1)th item, some coefficients may no longer be straddling (i.e., their computation is now complete). In that case, the value of these coefficients is compared against the the current set of B highest coefficients, and only the B largest coefficient values in the combined set are retained. Also, for levels where a straddling coefficient has been completed, a new
straddling coefficient is initiated (with an initial value of 0). In this manner, at every position in the time series stream, the set of the B topmost HWT coefficients is retained exactly. Theorem 1 [7] The topmost B HWT coefficients can be maintained exactly in the ordered aggregate (time series) streaming model using O(B + logN) space and O(B + logN) processing time per item. Similar algorithmic ideas can also be applied to more sophisticated wavelet thresholding schemes (e.g., that optimize for error metrics other than L2) to obtain space/time efficient wavelet summarization algorithms in the ordered aggregate model [8]. Wavelets in the General Turnstile Streaming Model. Here, updates to the data vector a can appear in any arbitrary order in the stream, and the final vector entries are obtained by implicitly aggregating the updates for a particular index domain value. More formally, in the turnstile model, each streaming update is a pair of the form (i, v), denoting a net change of v in the a[i] entry; that is, the effect of the update is to set a[i] ← a[i] v. (The model naturally generalizes to multi-dimensional data: for d data dimensions, each update ((i1,...,id), v) effects a net change of v on entry a[i1,...,id].) The problem of maintaining an accurate wavelet summary becomes significantly more complex when moving to this much more general streaming model. Gilbert et al. [7] prove a strong lower bound on the space requirements of the problem: for arbitrary turnstile streaming vectors, nearly all of the data must be stored to recover the top B HWT coefficients. Existing solutions for wavelet maintenance over turnstile data streams rely on randomized schemes that return only an approximate synopsis comprising (at most) B Haar coefficients that is provably near-optimal (in terms of the captured energy of the underlying vector) assuming that the data vector satisfies the ‘‘small-B property’’ (i.e., most of its energy is concentrated in a small number of HWT coefficients) – this assumption is typically satisfied for most real-life data distributions [7]. One of the key ideas is to maintain a randomized AMS sketch [2], a broadly applicable stream synopsis structure comprising randomized linear projections of the streaming data vector a. Briefly, an atomic AMS sketch of a is simply the inner product
Wavelets on Streams
P ha; xi ¼ i a½ixðiÞ, where x denotes a random vector of four-wise independent 1-valued random variates. Theorem 2 [1,2] Consider two (possibly streaming) data vectors a and b, and let Z denote the O(log (1∕d))-wise median of O(1 ∕22)-wise means of independent copies of the atomic AMS sketch product P P ð i a½ixðiÞÞð i b½ixj ðiÞÞ. Then, jZ ha; bij Ejjajj2 jjbjj2 with probability 1 d. Thus, using AMS sketches comprising only Oðlogð1=dÞ E2 Þ atomic counters, the vector inner product ha,bi can be approximated to within Ejjajj2jjbjj2 (hence implying an E-relative error estimate for the squared L2 norm jjajj 22 ). Since Haar coefficients of a are inner products with a fixed set of wavelet-basis vectors, the above theorem forms the key to developing efficient, approximate wavelet maintenance algorithms in the turnstile model. Gilbert et al. [7] propose a solution (termed ‘‘GKMS’’ in the remainder of the discussion) that focuses primarily on the one-dimensional case. GKMS maintains an AMS sketch for the streaming data vector a. To produce the approximate B-term representation, GKMS employs the constructed sketch of a to estimate the inner product of a with all wavelet basis vectors, essentially performing an exhaustive search over the space of all wavelet coefficients to identify important ones. More formally, assuming that there is a B-coefficient approximate representation of the signal with energy at least jjajj 22 (‘‘small B property’’), the GKMS algorithm uses a maintained AMS sketch to exhaustively estimate each Haar coefficient and selects up to B of the largest coefficients (excluding those whose square is less than Ejjajj 22 =B, where E < 1 is the desired accuracy guarantee). GKMS also uses techniques based on range-summable random variables constructed using Reed-Muller codes to reduce or amortize the cost of this exhaustive search by allowing the sketches of basis vectors (with potentially large supports) to be computed more quickly. Theorem 3 [7] Assuming there exists a B-term representation with energy at least jjajj 22 , then, with probability at least 1 d, the GKMS algorithm finds a representation of at most B coefficients that captures at least (1 E) of the signal energy jjajj 22 , using Oðlog2 N logðN =dÞB 2 =ðEÞ2 Þ space and per-item processing time.
W
3449
A potential problem lies in the query time requirements of the GKMS algorithm: even with the Reed-Muller code optimizations, the overall query time for discovering the top coefficients remains superlinear in N (i.e., at least OðE12 N log N Þ), violating the third requirement on streaming schemes. This also renders direct extensions of GKMS to multiple dimensions infeasible since it implies an exponential explosion in query cost (requiring at least OðN d Þ) time to cycle through all coefficients in d dimensions). In addition, the update cost of the GKMS algorithm is linear in the size of the sketch since the whole data structure must be ‘‘touched’’ for each update. This is problematic for high-speed data streams and/or even moderate sized sketch synopses. To address these issues, Cormode et al. [5] propose a novel solution that relies on two key technical ideas. First, they work entirely in the wavelet domain: instead of sketching the original data entries, their algorithms sketch the wavelet-coefficient vector wa as updates arrive. This avoids any need for complex rangesummable hash functions (i.e., Reed-Muller codes). Second, they employ hash-based grouping in conjunction with efficient binary-search-like techniques to enable very fast updates as well as identification of important coefficients in polylogarithmic time. – Sketching in the Wavelet Domain. The first technical idea in [5] relies on the observation that it is possible efficiently produce sketch synopses of the stream directly in the wavelet domain. That is, the impact of each streaming update can be translated on the relevant wavelet coefficients. By the linearity properties of the HWT and the earlier description, an update to the data entries corresponds to only polylogarithmically many coefficients in the wavelet domain. Thus, on receiving an update to a, it can be directly converted to O(polylog(N)) updates to the wavelet coefficients, and an approximate (sketch) representation of the wavelet coefficient vector wa can be maintained. – Time-Efficient Updates and Large-Coefficient Searches. Sketching in the wavelet domain means that, at query time, an approximate representation of the wavelet-coefficient vector wa is available, and can be employed to identify all those coefficients that are ‘‘large,’’ relative to the total energy of the data jjoa jj 22 ¼ jjajj 22 . While AMS sketches can provide such estimates (a point query is just a special case of an inner product), querying remains much
W
3450
W
Wavelets on Streams
too slow taking at least OðE12 N) time to find which of the N coefficients are the B largest. Instead, the schemes in [5] rely on a divide-and-conquer or binary-search-like approach for finding the large coefficients. This requires the ability to efficiently estimate sums-of-squares for groups of coefficients, corresponding to dyadic subranges of the domain [N]. Low-energy regions can then be disregarded, recursing only on high-energy groups – this guarantees no false negatives, as a group that contains a high-energy coefficient will also have high energy as a whole. The algorithms of [5] also employ randomized, hash-based grouping of dyadic groups and coefficients to guarantee that each update only touches a small portion of the synopsis, thus guaranteeing very fast update times. The key to the Cormode et al. solution is a hash-based probabilistic synopsis data structure, termed GroupCount Sketch (GCS), that can estimate the energy of fixed groups of elements from a vector w of size N under the turnstile streaming model [5]. This translates to several streaming L2-norm estimation problems (one per group). A simple solution would be to keep an AMS sketch of each group separately; however, there can be many (e.g., linear in N) groups, implying space requirements that are O(N). Streaming updates should also be processed as quickly as possible. The GCS synopsis requires small, sublinear space and takes sublinear time to process each stream update item; more importantly, a GCS can provide a highprobability estimate of the energy of a group within additive error Ejjojj 22 in sublinear time. In a nutshell, the GCS synopsis first partitions items of w into their group using an id() function (which, in the case of Haar coefficients, is trivial since it corresponds to fixed dyadic ranges over [N]), and then randomly maps groups to buckets using a hash function h(). Within each bucket, a second stage of hashing of items to subbuckets is applied (using another hash function f(), where each contains an atomic AMS sketch counter in order to estimate the L2 norm of the elements in the bucket. As with most randomized estimation schemes, a GCS synopsis comprises t independent instantiations of this basic randomized structure, each with independently chosen hash function pairs (h(), f ()) and x families for the AMS estimator; during maintenance, a streaming update (x, u) is used to update each of the t AMS counters corresponding to element x. (A pictorial
Wavelets on Streams. Figure 2. The GCS data structure: Element x is hashed (t times) to a bucket of groups (using h(id(x))) and then a sub-bucket within the bucket (using f(x)), where an AMS counter is updated.
representation is shown in Fig. 2.) To estimate the energy of a group g, for each independent instantiation m = 1,...,t of the bucketing structure, the squared values of all the AMS counters in the sub-buckets corresponding to bucket hm(g) are summed, and then the median of these t values is returned as the estimate. Theorem 4 [5] The GCS can estimate the energy of item groups of the vector w within additive error Ejjojj 22 1 1 with probability 1 d using space 1of O E3 log d counters, per-item update time of O log d , and query time of O E12 log d1 . To recover coefficients with large energy in the w vector, the algorithm employs a hierarchical searchtree structure on top of [N]: Each level in this tree structure induces a certain partitioning of elements into groups (corresponding to the nodes at that level), and per-level GCS synopses can be used to efficiently recover the high-energy groups at each level (and, thus, quickly zero in on high-energy Haar coefficients). Using these ideas, Cormode et al. [5] demonstrate that the accuracy guarantees of Theorem 3 3
N N can be obtained using OðB Elog log B log 3 3 Ed Þ space, N Oðlog2 N log B log Ed Þ per item processing time, and 3
N OðEB3 3 log N log B log Ed Þ query time. In other words,
their GCS-based solution guarantees sublinear space and query time, as well as per-item processing times that are sublinear in the size of the stream synopsis. Their results also naturally extend to the multi-dimensional case [5].
Key Applications Wavelet-based summaries are a general-purpose datareduction tool, and the maintenance of such summaries
Weak Consistency Models for Replicated Data
over continuous data streams has several important applications, including large-scale IP network monitoring and network-event tracking (e.g., for detecting network traffic anomalies or Denial-of-Service attacks), approximate query processing over warehouse update streams, clickstream and transaction-log monitoring in large web-server farms, and satelite- or sensornet-based environmental monitoring.
Data Sets Several publicly-available real-life data collections have been used in the experimental study of streaming wavelet summaries; examples include the US Census Bureau data sets (http://www.census.gov/), the UCI KDD Archive (http://kdd.ics.uci.edu/), and the UW Earth Climate and Weather Data Archive (http:// www-k12.atmos.washington.edu/k12/grayskies/).
Future Directions The area of streaming wavelet-based summaries is rich with interesting algorithmic questions. The bulk of the discussion here focuses on L2-error synopses. The problem of designing efficient streaming methods for maintaining wavelet summaries that optimize for nonL2 error metrics (e.g., general Lp error) under a turnstile streaming model remains open. Optimizing for non-L2 error implies more sophisticated coefficient thresholding schemes based on dynamic programming over the error-tree structure (e.g., [6,8]); while such methods can be efficiently implemented over ordered aggregate streams [8], no space/time efficient solutions are known for turnstile streams. Dealing with physicallydistributed streams also raises interesting issues for wavelet maintenance, such as trading off approximation quality with communication (in addition to time/ space). The distributed AMS sketching algorithms of Cormode and Garofalakis [4] can be applied to maintain an approximate wavelet representation over distributed turnstile streams, but several issues (e.g., appropriate prediction models for local sites) remain open. Finally, from a systems perspective, the problem of incorporating wavelet summaries in industrialstrength data-streaming engines, and testing their viability in real-life scenarios remains open.
Cross-references
▶ Approximate Query Processing ▶ Data Compression in Sensor Networks
W
3451
▶ Data Reduction ▶ Data Sketch/Synopsis ▶ Synopsis Structure ▶ Wavelets in Database Systems
Recommended Reading 1. Alon N., Gibbons P.B., Matias Y., and Szegedy M. Tracking join and self-join sizes in limited storage. In Proc. 18th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1999, pp. 10–20. 2. Alon N., Matias Y., and Szegedy M. The space complexity of approximating the frequency moments. In Proc. 28th Annual ACM Symp. on Theory of Computing, 1996, pp. 20–29. 3. Chakrabarti K., Garofalakis M., Rastogi R., and Shim K. Approximate query processing using wavelets. In Proc. 26th Int. Conf. on Very Large Data Bases, 2000, pp. 111–122. 4. Cormode G. and Garofalakis M. Approximate continuous querying over distributed streams. ACM Trans. Database Syst., 33(2): 1–39, 2008. 5. Cormode G., Garofalakis M., and Sacharidis D. Fast approximate wavelet tracking on streams. In Advances in Database Technology, Proc. 10th Int. Conf. on Extending Database Technology, 2006, pp. 4–22. 6. Garofalakis M. and Kumar A. Wavelet synopses for general error metrics. ACM Trans. Database Syst., 30(4): 888–928, December 2005. 7. Gilbert A.C., Kotidis Y., Muthukrishnan S., and Strauss M.J. One-pass wavelet decomposition of data streams. IEEE Trans. Knowl. Data Eng., 15(3):541–554, May 2003. 8. Guha S. and Harb B. Wavelet synopsis for data streams: minimizing non-euclidean error. In Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2005, pp. 88–97. 9. Matias Y., Vitter J.S., and Wang M. Wavelet-based histograms for selectivity estimation. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998, pp. 448–459. 10. Stollnitz E.J., DeRose T.D., and Salesin D.H. Wavelets for Computer Graphics – Theory and Applications. Morgan Kaufmann, San Francisco, CA, 1996. 11. Vitter J.S. and Wang M. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 193–204.
Weak Consistency Models for Replicated Data A LAN F EKETE University of Sydney, Sydney, NSW, Australia
Synonyms Weak memory consistency; Copy divergence
W
3452
W
Weak Consistency Models for Replicated Data
Definition Some designs for a distributed database system involve having several copies or replicas for a data item, at different sites, with algorithms that do not update these replicas in unison. In such a system, clients may detect a discrepancy between the copies. Each particular weak consistency model describes which discrepancies may be seen. If a system provides a weak consistency model, then the clients will require more careful programming than otherwise. Eventual consistency (q.v.) is the best-known weak consistency model.
Historical Background In the late 1980s and early 1990s, replication research focused on systems that allowed replicas to diverge from one another in controlled ways. Epidemic or multi-master algorithms were introduced in the work of Demers et al. [4]. These researchers identified the importance of session properties [8], which ensure that clients see information that includes changes they could reasonably expect the system to know. Ladin et al. [6] extended this concept by allowing the clients to explicitly indicate information they expect the system to know when responding to requests. Consistency models for single-master designs with replicas that lag behind a master were also explored by several groups [1,7,9]. Much of the work was presented in two Workshops on the Management of Replicated Data, held in 1990 and 1992. Much later, Bernstein et al. [3] defined a consistency model called relaxed currency, in which stale reads can occur within transactions that also perform updates.
Foundations In designing a distributed database system, the same data is usually kept replicated at different sites, for faster access and for better availability. However there is generally a substantial performance penalty from trying to keep the replicas perfectly synchronized with one another, so that clients never observe that replicas exist. Thus many system designers have proposed system designs where the replica control (q.v.) allows the clients to see that the replicas have diverged from one another. A consistency model captures the allowed observations that clients will make. In contrast to the models which maintain an illusion of a singlesite system, a weaker consistency model allows the clients to see that the system has replicas. Different models are characterized by the ways in which the
divergence between replicas is revealed. In describing some of the models, the research community is sometimes divided between definitions that speak directly about the values in the replicas, and definitions that deal with the events that occur at the clients. Another axis of variation in consistency models comes from the possibility of having different consistency requirements for read-only transactions. It is common to require, for example, that all the transactions which modify the data should execute with the strong consistency model of 1copy-serializability (q.v.), while some weak consistency is allowed for those transactions which contain only read operations. This section does not try to present every known weak consistency model, but rather indicate some characteristic examples of different styles. To motivate the weak consistency models, consider how badly things can go wrong in a system with uncontrolled read-one-write-all operations, where each update can be done anywhere first (inside the global client transaction) and then the updates are propagated lazily to the other replicas. Suppose there is such a system with one logical data item z which is an initially empty relation R(p,q,r) where p is a primary key, and two sites A and B each with a replica of z. In this example, there will be two different operations that can be performed on the data item: u which inserts a row (INSERT INTO R VALUES (101, 10,‘‘Fred’’)), and v which modifies some rows (UPDATE R SET q = q +1 WHERE r = ‘‘Fred’’). Consider T1,C which just performs u and is submitted by client C, and T2,D which performs v and is submitted by client D. Here is a possible sequence of events. In the notation used, u1[z A,1] represents the event where u is performed on the local replica of item z at site A, as part of transaction T1. Since the operations in the following example are modifications, the notation just shows as the return value the number of rows changed (1 in this particular event). Notice that the subscript on the event type indicates the transaction involved, and the superscript on the item name indicates the site of the replica which is affected. The event where the client C running transaction T1 learns the return value, and thus knows that the operation has been performed, can be represented by u1,C[z,1]. Note, there is no superscript on the item, since this is the client’s view. Many consistency models need to refer to where transactions start and finish, so there are events like b1, C for the start of transaction T1 at client C, or c1,C for the commit of that transaction by the client, or indeed
Weak Consistency Models for Replicated Data
c1A for the commit of the local subtransaction of T1 which is running at site A. The symbols T A and TB are used for the copier transactions that propagate updates previously done at other sites. b1;C b2;D u1 ½z A ; 1 u1;C ½z; 1 c A1 c 1;C v 2 ½z B ; 0 v 2;D ½z; 0 c B2 c 2;D bA v ½z A ; 1 c A bB u ½z B ; 1 c B
ð1Þ
Notice that in sequence (1), the operation v acts in a quite different way when performed at the two replicas (at z B, no rows are found to match the where clause, while at the other replica a row is modified). This leads to a situation where the final state of replica z A has a row (101, 11, ‘‘Fred’’), whereas the final state of z B has (101, 10, ‘‘Fred’’). That is, the replicas have diverged from one another, and they will stay this way unless another operation is submitted to reset the row in question. What is worse, the effects of this divergence can spread through other operations, which will act quite differently when performed at the two sites. This situation has been called ‘‘system delusion’’ or ‘‘split brain.’’ It is clearly unacceptable, and each weak consistency model should at least prevent this occurring. Convergence. Several replica consistency models ensure that replicas eventually converge. Because this is a liveness feature, defining it formally involves looking at infinite sequences of events. However, informally, one can require that, provided the system enters a quiescent period without further changes, then a convergence time will be reached when all replicas have the same value, a value that includes the effects of all the updates on the item, performed in some order. Thus, after the convergence point any read on the logical item will show this value. Before the convergence time, a read is allowed to show any value that reflects some arrangement of some subset of the updates to that logical item (and different reads can show incompatible arrangements of different subsets of updates). This eventual consistency (q.v.) model was first proposed in general distributed computing research where operations are independent requests rather than combined into transactions; in replication of transactional DBMSs, there is an additional requirement that there should be an agreed serial order on all the updating transactions, so that the eventual state of every replica is the outcome of doing all the updating transactions in the given serial order. The key mechanisms for ensuring eventual consistency involve detecting and then reconciling the
W
3453
situations where different replicas have applied operations in different orders. A common form of reconciliating involves having some deterministic rule to say which update should come first. A any replica where operations have been performed in the wrong order, inverse operations are used to rollback to an earlier state, and then the operations are reapplied in the correct order. For example, in the sequence (1) above, if the authoritative order has T1,C before T2,D , then when TB brings information to z B about the missing operation u, the reconciler will use an inverse operation v1[z B] to reverse the out-of-order operation v. This leads to the following sequence, in which the convergence point is reached at the end of TB when the replicas are in consistent states. b1;C b2;D u1 ½z A ; 1 u1;C ½z; 1 c A1 c 1;C v 2 ½z B ; 0 B v 2;D ½z; 0 c B2 c 2;D bA v ½z A ; 1 c A bB v 1 ½z
u ½z B ; 1 v ½z B ; 1 c B
ð2Þ
In practice, the burden of reconciliation can be reduced somewhat, by noticing that the order does not matter, for many pairs of operations (those that commute, for example because they deal with different logical data items). Nevertheless, [5] has shown that the cost of reconciliation is a serious limit to scalability of multi-master designs that offer eventual consistency. The eventual consistency model is focused on the way update operations are done; until convergence is reached, read-only transactions are quite unconstrained. Variations on the eventual consistency model are defined by limiting, in various ways, the nondeterminism in the values that can be returned in the read operations before the convergence point. For example, some consistency models enforce session properties. One notable property is called ‘‘read one’s own writes.’’ In this model, the subset of updates whose effects are seen in a read must include all writes (or more generally operations of update transactions) that were submitted at the same client, and further the arrangement of these writes must respect any cases where one transaction completed before another was started. More flexibly, some models, such as in [6] allow each client to indicate, when requesting a read, a set of previous transactions that must be included in the set of operations whose effects lead to the read’s return value. Stale replicas. It is hard to write reconciliation code to deal with all the different cases of out-of-order
W
3454
W
Weak Consistency Models for Replicated Data
updates. A major class of system designs avoids this issue entirely, by having a single master or primary replica, where all updates of the item are done first (it is common, but not essential, for the master replicas for all logical items to be at a single site). It is easy in system designs like this, to ensure that the updates are propagated one after another from the master replica to the other replicas (called ‘‘slaves’’). In this model, there is always a ‘‘correct’’ value for each logical data item; this is the value in the master replica. Any slave always has a value which the master held in the past; it reflects some prefix of the transactions whose updates are reflected in the master, and so one says that the slave replica is a (possibly) stale version of the data. The term ‘‘quasi-copy’’ has been applied to a replica with out-of-date information [2]. The weakest, most permissive, consistency model for this system design is ‘‘relaxed currency’’ [3]. In this model, the clients’ view is required to be equivalent to that in a serial, unreplicated and multi-version database (Note the change from the definition of 1-copy-serializability (q.v.), where the execution is equivalent to a serial, unreplicated, single-version database.). There must be a serial order on all the transactions, so that whenever the client reads a logical item, it sees some version but not necessarily the current one (it may be an older version that is returned). Most system designs that offer relaxed currency provide a very restricted form: they also require that all the update transactions satisfy 1-copy-serializability (because all the operations of the update transactions are done at the master). In these designs, a read operation inside an update transaction must see the current version of the item; seeing a stale version can happen only in a read-only transaction. To illustrate the relaxed currency consistency model, here is a sequence of events that might happen in a system with two sites A and B. A is the master site for integer-valued items x and y, while B has a slave copy of y but no replica for x. T3,C executes a transfer of 2 units from y to x by reading and then updating both items at the master site. The read-only transaction T4,C determines the status of the two logical items, first reading x A (since there is only one copy for x) and then reading the slave replica y B. T denotes the copier transaction that propagates the change in y to the slave at site B. The event where a value 12 is written to the local
replica of item x at site A, as part of transaction T3, is shown as w3[x A, 12], and w3,C[x, 12] represents the notification of this write to the client at C. Similarly r4[y B, 20] is the event where T4 reads the value 20 in the replica of y at site B. As before one also has events like b4,C for the start of transaction T4 at client C, or c4, C for the commit of that transaction by the client, or indeed c4A for the commit of the local subtransaction of T4 which is running at site A. b3;C bA3 r 3 ½x A ; 10 r 3;C ½x; 10 r 3 ½y A ; 20 r 3;C ½y; 20 w 3 ½x A ; 12 w 3;C ½x; 12 w 3 ½y A ; 18 w 3;C ½y; 18 c A3 c 3;C b4;C bA4 r 4 ½x A ; 12 r 4;C ½x; 12 bB4 r 4 ½y B ; 20 r 4;C ½y; 20 c A4 c B4 c 4;C bB w ½y B ; 18 c B
ð3Þ
At the end of the execution, the slave site is up-to-date, with the latest value for the logical item y it keeps a replica of. However, during the execution, the slave replica is sometimes behind the master, in particular this is true when T4 reads the replica. Thus in sequence (3), the client view is b3;C r 3;C ½x; 10 r 3;C ½y; 20 w 3;C ½x; 12 w 3;C ½y; 18 c 3;C b4;C r 4;C ½x; 12 r 4;C ½y; 20 c 4;C
ð4Þ
which could occur in a serial multiversion system with a single site, in the order T3,C then T4,C. Notice that in the serial multiversion system, when T4,C reads x it sees the latest version (produced by T3,C), but its read of y returns an older ‘‘stale’’ version taken from the initial state. There has been much work involving enhancing weak consistency definitions with quantitative bounds on the divergence between replicas.
Key Applications The commercial DBMS vendors all offer replication mechanisms with their products, which can be configured in ways that offer different consistency properties. The most common approaches give either eventual convergence (if multiple masters are allowed, and conflicts are reconciled correctly) or reads from stale replicas (if there is a single master). There are also a range of research prototypes which give the user various consistency models; in general these show a tradeoff, where improved performance comes from offering weaker consistency models.
Weak Equivalence
Future Directions Most recent research on replica control finds new algorithms that provide one of the known weak consistency models, either eventual consistency or else the combination of 1-copy serializable updates with reads from stale copies. The main focuses of research have been to restrict how stale the reads can be, and how to make reconciliation easier to code correctly.
Cross-references ▶ Data Replication
Recommended Reading 1.
2.
3.
4.
5.
6.
7.
8.
9.
Alonso R., Barbara´ D., and Garcia-Molina H. Data caching issues in an information retrieval system. ACM Trans. Database Syst., 15(3):359–384, 1990. Alonso R., Barbara´ D., Garcia-Molina H., and Abad S. Quasicopies: Efficient data sharing for information retrieval systems. In Advances in Database Technology, Proc. 1st Int. Conf. on Extending Database Technology, 1988, pp. 443–468. Bernstein P.A., Fekete A., Guo H., Ramakrishnan R., and Tamma P. Relaxed-currency serializability for middle-tier caching and replication. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006, pp. 599–610. Demers A.J., Greene D.H., Hauser C., Irish W., Larson J., Shenker S., Sturgis H.E., Swinehart D.C., and Terry D.B. Epidemic algorithms for replicated database maintenance. In Proc. ACM SIGACT-SIGOPS 6th Symp. on the Principles of Dist. Comp. 1987, pp. 1–12. Gray J., Helland P., O’Neil P.E., and Shasha D. The dangers of replication and a solution. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 173–182. Ladin R., Liskov B., Shrira L., and Ghemawat S. Providing high availability using lazy replication. ACM Trans. Comput. Syst., 10(4):360–391, 1992. Sheth A.P. and Rusinkiewicz M. Management of interdependent data: Specifying dependency and consistency requirements. In Proc. 1st Workshop on the Management of Replicated Data, 1990, pp. 133–136. Terry D.B., Demers A.J., Petersen K., Spreitzer M., Theimer M., and Welch B.B. Session guarantees for weakly consistent replicated data. In Proc. 3rd Int. Conf. on Parallel and Distributed Information Systems, 1994, pp. 140–149. Wiederhold G. and Qian X. Consistency control of replicated data in federated databases. In Proc. 1st Workshop on the Management of Replicated Data, 1990, pp. 130–132.
Weak Coupling ▶ Loose Coupling
W
3455
Weak Equivalence C HRISTIAN S. J ENSEN 1, R ICHARD T. S NODGRASS 2 1 Aalborg University, Aalborg, Denmark 2 University of Arizona, Tucson, AZ, USA
Synonyms Snapshot equivalence; Temporally weak
Definition Informally, two tuples are snapshot equivalent or weakly equivalent if all pairs of timeslices with the same time instant parameter of the tuples are identical. Let temporal relation schema R have n time dimensions, Di, i = 1,...,n, and let ti, i = 1,...,n be corresponding timeslice operators, e.g., the valid timeslice and transaction timeslice operators. Then, formally, tuples x and y are weakly equivalent if 8t 1 2 D 1 :::8t n 2 D n ðtntn ð:::ðt1t 1 ðxÞÞ:::Þ ¼ tntn ð:::ðt1t 1 ðyÞÞ:::ÞÞ Similarly, two relations are snapshot equivalent or weakly equivalent if at every instant their snapshots are equal. Snapshot equivalence, or weak equivalence, is a binary relation that can be applied to tuples and to relations.
Key Points The notion of weak equivalence captures the information content of a temporal relation in a pointbased sense, where the actual timestamps used are not important as long as the same timeslices result. For example, consider the two relations with only a single attribute: {(a, [3, 9]} and {(a, [3, 5]), (a, [6, 9])}. These relations are different, but weakly equivalent. Both ‘‘snapshot equivalent’’ and ‘‘weakly equivalent’’ are being used in the temporal database community. ‘‘Weak equivalence’’ was originally introduced by Aho et al. in 1979 to relate two algebraic expressions [1,2]. This concept has subsequently been covered in several textbooks. One must rely on the context to disambiguate this usage from the usage specific to temporal databases. The synonym ‘‘temporally weak’’ does not seem intuitive–in what sense are tuples or relations weak?
W
3456
W
Weak Memory Consistency
Cross-references
▶ Point-Stamped Temporal Models ▶ Snapshot Equivalence ▶ Temporal Database ▶ Time Instant ▶ Timeslice Operator ▶ Transaction Time ▶ Valid Time
Recommended Reading 1.
2. 3.
4.
Aho A.V., Sagiv Y., and Ullman J.D. Efficient optimization of a class of relational expressions. ACM Trans. Database Syst., 4(4):435–454, 1979. Aho A.V., Sagiv Y., and Ullman J.D. Equivalences among relational expressions. SIAM J. Comput., 8(2):218–246, 1979. Gadia S.K. Weak temporal relations. In Proc. 4th ACM SIGACTSIGMOD Symp. on Principles of Database Systems, 1985, pp. 70–77. Jensen C.S. and Dyreson C.E. (eds.). A consensus glossary of temporal database concepts – February 1998 version. In Temporal Databases: Research and Practice, O. Etzion, S. Jajodia, S. Sripada (eds.). LNCS 1399, Springer-Verlag, Berlin, 1998, pp. 367–405.
Weak Memory Consistency ▶ Weak Consistency Models for Replicated Data
Web 2.0 Applications ▶ Social Applications
Web 2.0/3.0 A LEX W UN University of Toronto, Toronto, ON, Canada
Definition Web 2.0 and Web 3.0 are terms that refer to the trends and characteristics believed to be representative of next generation web applications emerging after the dot-com collapse of 2001. The term Web 2.0 became widely used after the first O’Reilly Media Web 2.0 Conference (http://www.web2con.com/) in 2004.
Key Points The term Web 2.0 does not imply a fundamental change in Internet technologies but rather, captures what industry experts believe to be the key characteristics of successful web applications following the dotcom collapse. In general, Web 2.0 applications typically leverage one or more of the following concepts: Active participation, collaboration, and the wisdom of crowds. Consumers explicitly improve and add value to the web application they are using either through volunteerism or in response to incentives and rewards. This typically involves explicitly creating and modifying content (Wikipedia entries, Amazon reviews, Flickr tags, YouTube videos, etc.) Passive participation and network effects: Consumers naturally improve and add value to the web application being used as a result of the application’s inherent design and architecture. Only a small percentage of users can be relied on to participate actively, so user contribution can be made non-optional by design. Examples include the ranking of most-clicked search results, mostviewed videos, peer-to-peer file sharing, and viral marketing. The Long-tail: Catering to niche user interests in volume and breadth can be just as successful if not more so than narrowly catering to only the most popular interests. This is due to the fact that content distribution over the Internet is not constrained by the physical limitations (such as location, storage, and consumer-base) of real-world vendors and service providers. For example, an online music vendor incurs virtually no additional cost to ‘‘stock’’ songs that do not sell well individually but when taken together, have a large population of fringe consumers. The Internet as a platform: Web 2.0 applications become extremely lightweight, flexible, and dynamic when the Internet is treated like a globally available platform with vast resources and an international consumer-base. Such applications act more like software services than traditional software products and evolve quickly by staying perpetually in beta. For Web 2.0 applications, global connectivity also extends beyond traditional PCs to portable handheld devices and other mobile devices. Web applications that mimic traditional desktop applications (such as online Operating Systems, work processors,
Web Advertising
and spreadsheets) are examples of Web 2.0 applications using the Internet as a platform. Web 2.0 enabling technologies primarily build on top of existing Internet technologies in the form of APIs and web development tools – AJAX, MashUp development tools, and other tools for creating rich media content for example. Web 3.0 is still purely hypothetical and intends to capture the trends and characteristics of web applications beyond Web 2.0. Hypotheses vary widely between industry experts, ranging from the semantic web to available bandwidth as being the key influences of Web 3.0 applications.
Cross-references ▶ AJAX ▶ MashUp ▶ Service
Web Advertising VANJA J OSIFOVSKI 1, A NDREI B RODER 2 1 Uppsala University, Uppsala, Sweden 2 Yahoo! Research, Santa Clara, CA, USA
Synonyms Online advertising; Contextual advertising; Search advertising
W
3457
of serving billions of requests a day and selecting from hundreds of millions of individual advertisements.
Historical Background The Web emerged as a new publishing media in the late 1990. Since then, the growth of web advertising has paralleled the growth in the number of web users and the increased time people spend on the Web. Banner advertising started with simple placement of banners on the top of the pages at targeted sites, and has since evolved into an elaborate placement scheme that targets a particular web user population and takes into account the content of the web pages and sites. Search advertising has its beginnings in the specialized search engines for ad search. Combining the web search with ad search was not something web users accepted from the start, but has become mainstream today. Furthermore, today’s search advertising platforms have moved from simply asking the advertiser to provide a list of queries for which the ad is to be shown to employing variety of mechanisms to automatically learn what ad is appropriate for which query. Today’s ad platforms are large, scalable and reliable systems running over clusters of machines that employ state-of-the-art information retrieval and machine learning techniques to serve ads at rates of tens of thousands times a second. Overall, the technical complexity of the advertising platforms rivals those of the web search engines.
Foundations Definition Web advertising aims to place relevant ads on web pages. As in traditional advertising, most of the advertising on the web can be divided into brand advertising and direct advertising. In the majority of cases, brand advertising is achieved by banners – images or multimedia ads placed on the web page. As opposed to traditional brand advertising, on the web the user can interact with the ad and follow the link to a website of advertisers choice. Direct advertising is mostly in the form of short textual ads on the side of the search engine result pages and other web pages. Finally, the web allows for other types of advertising that are hybrids and cross into other media, such as video advertising, virtual worlds advertising, etc. Web advertising systems are built by implementing information retrieval, machine learning, and statistical techniques in a scalable, low-latency computing platform capable
Web advertising spans Web technology, sociology, law, and economics. It has already surpassed some traditional mass media like broadcast radio and it is the economic engine that drives web development. It has become a fundamental part of the web eco-system and touches the way content is created, shared, and disseminated – from static html pages to more dynamic content such as blogs and podcasts, to social media such as discussion boards and tags on shared photographs. This revolution promises to fundamentally change both the media and the advertising businesses over the next few years, altering a $300 billion economic landscape. As in classic advertising, in terms of goals, web advertising can be split into brand advertising whose goal is to create a distinct favorable image for the advertiser’s product, and direct-marketing advertising that involves a ‘‘direct response’’: buy, subscribe, vote, donate, etc, now or soon.
W
3458
W
Web Advertising
In terms of delivery, there are two major types: 1. Search advertising refers to the ads displayed alongside the ‘‘organic’’ results on the pages of the Internet search engines. This type of advertising is mostly direct marketing and supports a variety of retailers from large to small, including microretailers that cover specialized niche markets. 2. Contextual advertising refers to ads displayed alongside some publisher-produced content, akin to traditional ads displayed in newspapers. It includes both brand advertising and direct marketing. Today, almost all non-transactional web sites rely on revenue from content advertising. This type of advertising supports sites that range from individual bloggers and small community pages, to the web sites of major newspapers. There would have been a lot less to read on the web without this model! Web advertising is a big business, estimated in 2005 at $12.5 billion spent in the US alone (Internet Advertising Board – www.iab.com). But this is still less than 10% of the total US advertising market. Worldwide, internet advertising is estimated at $18B out of a $300. Thus, even at the estimated 13% annual growth, there is still plenty of room to grow, hence an enormous commercial interest. From an ad-platform standpoint, both search and content advertising can be viewed as a matching problem: a stream of queries or pages is matched in real time to a supply of ads. A common way of measuring the performance of an ad-platform is based on the clicks on the placed ads. To increase the number of clicks, the ads placed must be relevant to the user’s query or the page and their general interests. There are several data engineering challenges in the design and implementation of such systems. The first challenge is the volume of data and transactions: Modern search engines deal with tens of billions of pages from hundreds of millions of publishers, and billions of ads from tens of millions of advertisers. Second, the number of transactions is huge: billions of searches and billions of page views per day. Third, to deal with that, there is only a very short processing time available: when a user requests a page or types her query, the expectation is that the page, including the ads, will be shown in real time, allowing for at most a few tens of milliseconds to select the best ads. To achieve such performance, ad-platforms usually have two components: a batch processing component
that does the data collection, processing, and analysis, and a serving component that serves the ads in real time. Although both of these are related to the problems solved by today’s data management systems, in both cases existing systems have been found inadequate for solving the problem and today’s ad-platforms require breaking new grounds. The batch processing component of an ad-system processes collections of multiple terabytes of data. Usually the data are not shared and a typical processing lasts from a few minutes to a few hours over a large cluster of hundreds, even thousands of commodity machines. The concern here is only about recovering from failures during the distributed computation and most of the time data are produced once and read one or more times. To be able to perform the same operation, a commercial database system would need to be deployed on the same scale that is beyond the scope of both shared-nothing and shared disk architectures. The reason for this is that database systems are designed for sharing data among multiple users in presence of updates and have overly complex distribution protocols to scale and perform efficiently at this scale. The backbone of the batch processing components is therefore being built using simpler distributed computation models and distributed file systems running over commodity hardware. Several challenges lay ahead to make these systems more usable and easier to maintain. The first challenge is to define a processing framework and interfaces such that large scale data analysis tasks (e.g., graph traversal and aggregation) and machine learning tasks (e.g., classification and clustering) are easy to express. So far there are two reported attempts to define such a query language for data processing in this environment [6,10]. However, there has been no reported progress on mapping these languages to a calculus and algebra model that will lend itself to optimization to improve optimization time. To make the task easier, new machine learning and data analysis algorithms for large scale data processing are needed. At the data storage layer, the challenge lies in designing a data storage that resides on the same nodes where the processing is performed. The serving component of an advertising platform must have high throughput and low latency. To achieve this, in most cases the serving component operates over read-only copy of the data replaced occasionally by the batch component. The ads are
Web Application Server
usually pre-processed and matched to an incoming query or page. The serving component has to implement reasoning that, based on a variety of features of the query/page and the ads, estimates the top few ads that have the maximum expected revenue within the constraints of the marketplace design and business rules associated to that particular advertising opportunity. The first challenge here is developing features and corresponding extraction algorithms appropriate for real-time processing. As the response time is limited, today’s architectures rely on serving from a large cluster of machines that hold most of the searched data inmemory. One of the salient points of the design of the ad search is an objective function that captures the necessary trade-off between efficient processing and quality of results, both in terms of relevance and revenue. This function requires probabilistic reasoning in real-time. Several different techniques can be used to select ads. When the number of ads is small, linear programming can be used to optimize for multiple ads grouped in slates. Banner ads are sometimes placed using linear programming as optimization technique [1,2]. In textual advertising the number of ads is too large to use linear programming. One alternative in such case is to use unsupervised methods based on information retrieval ranking [4,11]. Scoring formulas can be adapted to take into account the specificity of ads as documents: short text snippets with distinct multiple sections. Information retrieval scoring can be augmented with supervised ranking mechanisms that use the click logs or editorial judgments [9] to learn what ads work for a particular page/user/query. Here, the system needs to explore the space of possible matches. Some explore-exploit methods have been adapted to minimize the cost of exploration based on one-arm-bandit algorithms [3]. In summary, today’s search and content advertising platforms are massive data processing systems that apply complicated data analysis and machine learning techniques to select the best advertising for a given query or page. The sheer scale of the data and the real-time requirements make this problem a very challenging task. Today’s implementations have grown quickly and often in an ad-hoc manner to deal with a $15 billion fast growing market, but there is a need for improvement in almost every aspect of these systems as they adapt to even larger amounts of data, traffic, and new business models in web advertising.
W
3459
Experimental Results The reader is referred to [4,9,11] for the current published results on contextual advertising. A historical overview of search advertising is provided in [5]. Some results on query rewrites for search advertising are presented in [7,8].
Cross-references
▶ Information Retrieval Models ▶ Information Retrieval Operations
Recommended Reading 1. Abe N. Improvements to the linear programming based scheduling of Web advertisements. J. Electron. Commerce Res., 5(1):75–98, 2005. 2. Abrams Z., Mendelevitch O., and Tomlin J.A. Optimal delivery of sponsored search advertisements subject to budget constraints. In Proc. 8th ACM Conf. on Electronic Commerce, 2007, pp. 272–278. 3. Agarwal D., Chakrabarti D., Josifovski V., and Pandey S. Bandits for taxonomies: a model based approach. In Proc. SIAM International Conference on Data Mining, 2007. 4. Broder A., Fontoura M., Josifovski V., and Riedel L. A semantic approach to contextual advertising. In Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007, pp. 559–566. 5. Fain D. and Pedersen J. Sponsored search: a brief history. In Proc. 2nd Workshop on Sponsored Search Auctions. Web publication, 2006. 6. Group C.S. Community systems research at Yahoo! ACM SIGMOD Rec., 36(3):47–54, 2005. 7. Jones R. and Fain D.C. Query word deletion prediction. In Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 435–436. 8. Jones R., Rey B., Madani O., and Greiner W. Generating query substitutions. In Proc. 15th Int. World Wide Web Conference, 2006, pp. 387–396. 9. Lacerda A., Cristo M., Andre M.G., Fan W., Ziviani N., and Ribeiro-Neto B. Learning to advertise. In Proc. 32nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2006, pp. 549–556. 10. Pike R., Dorward S., Griesemer R., and Quinlan S. Interpreting the data: parallel analysis with Sawzall. Sci. Program. J., 13(4): 277–298, 2005. 11. Ribeiro-Neto B., Cristo M., Golgher P.B., and de Moura E.S. Impedance coupling in content-targeted advertising. In Proc. 31st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005, pp. 496–503.
Web Application Server ▶ Application Server
W
3460
W
Web Characteristics and Evolution
Web Characteristics and Evolution D ENNIS F ETTERLY Microsoft Research, Mountain View, CA, USA
Definition Web characteristics are properties related to collections of documents accessible via the World Wide Web. There are vast numbers of properties that can be characterized. Some examples include the number of words in a document, the length of a document in bytes, the language a document is authored in, the mime-type of a document, properties of the URL that indentifies a document, HTML tags used to author a document, and the hyperlink structure created by the collection of documents. As in the physical world, the process of change that the web continually undergoes is identified as web evolution. The web is a tremendously dynamic place, with new users, servers, and pages entering and leaving the system continuously, which causes the web to change very rapidly. Web evolution encompasses changes in all web characteristics, as defined above.
Historical Background As early as 1994, researchers were interested in studying characteristics of the World Wide Web. As documented by Woodruff et al. [13], the Lycos project at the Center for Machine Translation at Carnegie Mellon University calculated and published statistics on the web (HTTP, FTP, and GOPHER) pages it collected. Those initial statistics included the 100 ‘‘most weighty words’’ by TF-IDF, the document size in bytes, and the number of words in each document. The first estimate of the size of the web also occurred in 1994, when Bray seeded a web crawl with 40,000 documents and estimated the size of the web to be the number of unique URLs that could be fetched. Using web documents collected in 1995, Woodruff et al. performed an extensive study [12] of 2.6 million HTML documents collected using the Inktomi web crawler. The motivations for this study included collecting information about the HTML markup to improve the language and measure the prevalence of non-standard language extensions. The authors performed a natural language analysis, correctness of the HTML markup, top-level domain distribution, in addition to nine per-page properties. The authors of
that study also observed that the web evolved ‘‘exceptionally quickly’’ based on their observations of two data sets collected just months apart. They note that the properties of the documents that exist in both collections vary greatly and that many of the documents that exist in the first collection do not exist in the second. In addition to characterizing documents discovered on the web, a number of usage based studies were performed under the auspices of the World Wide Web Consortium’s (W3C) Jim Pitkow was the chair of the final working group, and he published a summary [11] of previous web characterization studies in 1999. These studies included characterizations of web requests, HTTP traffic, web document lifetime, and user behavior. In 2000, Broder et al. conducted a landmark study measuring the graph structure of the web [4] and demonstrating that connectivity of pages on the web could be classified into a structure that resembled a ‘‘bowtie.’’
Foundations As highlighted by Baeza-Yates et al. [2], one of the principal issues faced by any initiative to characterize the web is how to select the sample of documents to characterize. The importance of this procedure is highlighted in a recent study that compares statistics from four different large web crawls [1] and finds that there are quantitative differences in the statistical characterizations of these collections. Several different methods for sampling have been utilized. The initial studies referred to in the ‘‘Historical Background’’ section were performed when the web was much smaller than it currently is and it was possible to crawl a reasonable fraction of the web in order to obtain a representative sample. However, even though those studies were able to crawl a reasonable fraction of the web, they were still only able to characterize content that they could fetch. A large portion of web content is inaccessible to the observer. This inaccessible content can be stored in a database, which is commonly called the deep web, be part of a corporate intranet, and thus not accessible from the public web, or could require authorization to access. In order to estimate the size of the web, and the number of web servers providing web content, Lawrence and Giles chose Internet Protocol (IP) addresses at random and used the number of servers that successfully responded to estimate the total number
Web Characteristics and Evolution
of web servers publically accessible [1]. They also crawled all of the accessible pages on the first 2,500 web servers that responded to requests sent to their IP address. In order to address the issue of collecting a representative sample of connected, publically accessible web pages, Henzinger et al. describe a method to generate a near-uniform sample of URLs using a biased random walk on the web [10]. Another approach to sampling that has been employed in at least ten cases is to seed the collection with all domains registered in a national domain, for example Chile’s national domain is .cl. Baeza-Yates et al. compare characterizations from ten national domains and two multinational web spaces (Africa and Indochina) [2]. They find that web characterizations that include a power law are consistent across all tested collections. Another area of web characterization that has been the focus of significant study is the detection of duplicate and near-duplicate documents on the web. Detecting exact matches is much simpler than detecting near duplicates because each document can be reduced to a single fingerprint. Measures for computing text similarity, such as cosine similarity or the Jaccard similarity coefficient, require pairwise comparisons between documents, which will not complete when dealing with the numbers of documents that exist on the web. To put this in perspective, in their 1997 paper ‘‘Syntatic Clustering of the Web’’ [3] Broder et al. state that a pairwise comparison would involve O(1015) (a quadrillion) comparisons. There have been two general approaches for condensing the information retained for each document and minimizing the number or required comparisons. The Broder et al. approach approximates the Jaccard similarity coefficient by retaining the minimum value computed by each of several hash functions, and then using those retained values, called sketches or shingles, to summarize the document. The computational effort required to discover near-duplicate documents using this algorithm is further reduced via the use of super-shingles, or sketches of sketches. Near-duplicate documents are very likely to share at least one super-shingle, which reduces the complexity to checking each of a small number of supershingles. Charikar developed the other approach, which uses random projections of the words in a document to approximate the cosine similarity of documents [5]. Henzinger later experimentally compared [9] these two approaches.
W
3461
Several different approaches have been used to measure the evolution of the World Wide Web. One common approach has utilized traces from web proxies to study evolution of web pages which provides insight into improved polices for caching proxies. Douglis et al. investigated [7] the rate of change of documents on the web in order to validate two critical assumptions that apply to web caching: a significant number of requests are duplicates of past requests, and the content associated with those requests does not change between accesses. The evolution of the web has been extensively studied from a crawling perspective. In 2000, Cho and Garcia-Molina performed the first study of web page evolution [6]. They downloaded 720,000 pages daily from 270 ‘‘popular’’ hosts and computed a checksum over the entire page contents. These checksums were then compared to measure rates of change. Fetterly, Manasse et al. [8] expanded upon this study in 2003 by performing a breadth-first crawl of 150 million pages each week for 11 weeks and measuring the degree of change that occurred within each page. At the intersection of web characteristics and evolution, Bar-Yossef et al. expand on the studies of link lifetime referenced by Pitkow [12] and investigate the decay of the web. They introduce the notion of a ‘‘soft 404,’’ which is an error page returned with a HTTP result code of 200 – ‘‘OK’’ instead of 404 indicating an error has occurred.
Key Applications A significant cost incurred when operating a search service is the cost of downloading and indexing content. Being able to predict, as accurately as possible, when to re-crawl a particular page allows the search engine to wisely utilize its resources. A search engine also needs to be able to identify near-duplicate documents that exist within its index, in order to either remove them or to crawl them at a reduced rate. Characterization of the web enables at least two important functions: interested parties can measure the evolution of the web along many axes, barring significant evolution of the measured characteristic, results of past characterizations can be used to validate whether a new sample is a representative one or not.
Data Sets The Stanford WebBase Project provives access to their repository of crawled web content. The Laboratory
W
3462
W
Web Content Extraction
for Web Algorithmics (LAW) at the Dipartimento di Scienze dell’Informazione (DSI) of the Universita` degli studi di Milano provides several web graph datasets. The Internet Archive is another potential source of data, although at the time of this writing, they do not currently grant new access to researchers. http://dbpubs.stanford.edu:8091/ testbed/doc2/ Web Base/ http://law.dsi.unimi.it/index.php?option=com_include &Itemid=65
11. Lawrence S. and Giles L.C. Accessibility of information on the web. Nature, 400(6740):107–107, July 1999. 12. Pitkow J.E. Summary of www characterizations. Comput. Netw., 30(1–7):551–558, 1998. 13. Woodruff A., Aoki P.M., Brewer E.A., Gauthier P., and Rowe L.A. An investigation of documents from the world wide web. Comput. Netw., 28(7–11):963–980, 1996.
Web Content Extraction ▶ Fully Automatic Web Data Extraction
Cross-references
▶ Data Deduplication ▶ Document Links and Hyperlinks ▶ Incremental Crawling ▶ Term Statistics for Structured Text Retrieval ▶ Web Search Result De-duplication and Clustering
Web Content Mining ▶ Data, Text, and Web Mining in Healthcare ▶ Data Integration in Web Data Extraction System
Recommended Reading 1. Angeles Serrano M., Maguitman A., Santo Fortunato na´ M.B., and Vespignani V. Decoding the structure of the www: A comparative analysis of web crawls. ACM Trans. Web, 1(2):10, 2007. 2. Baeza-Yates R., Castillo C., and Efthimiadis E.N. Characterization of national web domains. ACM Trans. Int. Tech., 7(2): 9, 2007. 3. Broder A.Z., Glassman S.C., Manasse M.S., and Zweig G. Syntactic clustering of the web. In Selected papers from the sixth International Conference on World Wide Web, 1997, pp. 1157–1166. 4. Broder A., Kumar R., Maghoul F., Raghavan P., Rajagopalan S., Stata R., Tomkins A., and Wiener J. Graph structure in the web: Experiments and models. In Proc. 8th Int. World Wide Web Conference, 2002. 5. Charikar M.S. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual ACM Symp. on Theory of Computing, 2002, pp. 380–388. 6. Cho J. and Garcia-Molina H. The evolution of the web and implications for an incremental crawler. In Proc. 26th Int. Conf. on Very Large Data Bases, 2000, pp. 200–209. 7. Douglis F., Feldmann A., Krishnamurthy B., and Mogul J.C. Rate of change and other metrics: a live study of the world wide web. In Proc. 1st USENIX Symp. on Internet Tech. and Syst., 1997. 8. Fetterly D., Manasse M., Najork M., and Wiener J. A large-scale study of the evolution of web pages. In Proc. 12th Int. World Wide Web Conference, 2003, pp. 669–678. 9. Henzinger M. Finding near-duplicate web pages: a largescale evaluation of algorithms. In Proc. 32nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2006, pp. 284–291. 10. Henzinger M.R., Heydon A., Mitzenmacher M., and Najork M. On near-uniform url sampling. Comput. Netw., 33(1–6): 294–308, 2000.
Web Crawler ▶ Web Crawler Architecture
Web Crawler Architecture M ARC N AJORK Microsoft Research, Mountain View, CA, USA
Synonyms Web crawler; Robot; Spider
Definition A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them, and recursively continues to download the web pages identified by these hyperlinks. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine. Moreover, they are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on. Despite their conceptual simplicity, implementing high-performance web crawlers poses major engineering challenges due to the
Web Crawler Architecture
scale of the web. In order to crawl a substantial fraction of the ‘‘surface web’’ in a reasonable amount of time, web crawlers must download thousands of pages per second, and are typically distributed over tens or hundreds of computers. Their two main data structures – the ‘‘frontier’’ set of yet-to-be-crawled URLs and the set of discovered URLs – typically do not fit into main memory, so efficient disk-based representations need to be used. Finally, the need to be ‘‘polite’’ to content providers and not to overload any particular web server, and a desire to prioritize the crawl towards high-quality pages and to maintain corpus freshness impose additional engineering challenges.
Historical Background Web crawlers are almost as old as the web itself. In the spring of 1993, just months after the release of NCSA Mosaic, Matthew Gray [6] wrote the first web crawler, the World Wide Web Wanderer, which was used from 1993 to 1996 to compile statistics about the growth of the web. A year later, David Eichmann [5] wrote the first research paper containing a short description of a web crawler, the RBSE spider. Burner provided the first detailed description of the architecture of a web crawler, namely the original Internet Archive crawler [3]. Brin and Page’s seminal paper on the (early) architecture of the Google search engine contained a brief description of the Google crawler, which used a distributed system of page-fetching processes and a central database for coordinating the crawl. Heydon and Najork described Mercator [8,9], a distributed and extensible web crawler that was to become the blueprint for a number of other crawlers. Other distributed crawling systems described in the literature include PolyBot [11], UbiCrawler [1], C-proc [4] and Dominos [7].
Foundations Conceptually, the algorithm executed by a web crawler is extremely simple: select a URL from a set of candidates, download the associated web pages, extract the URLs (hyperlinks) contained therein, and add those URLs that have not been encountered before to the candidate set. Indeed, it is quite possible to implement a simple functioning web crawler in a few lines of a high-level scripting language such as Perl. However, building a web-scale web crawler imposes major engineering challenges, all of which are
W
ultimately related to scale. In order to maintain a search engine corpus of say, ten billion web pages, in a reasonable state of freshness, say with pages being refreshed every 4 weeks on average, the crawler must download over 4,000 pages/second. In order to achieve this, the crawler must be distributed over multiple computers, and each crawling machine must pursue multiple downloads in parallel. But if a distributed and highly parallel web crawler were to issue many concurrent requests to a single web server, it would in all likelihood overload and crash that web server. Therefore, web crawlers need to implement politeness policies that rate-limit the amount of traffic directed to any particular web server (possibly informed by that server’s observed responsiveness). There are many possible politeness policies; one that is particularly easy to implement is to disallow concurrent requests to the same web server; a slightly more sophisticated policy would be to wait for time proportional to the last download time before contacting a given web server again. In some web crawler designs (e.g., the original Google crawler [2] and PolyBot [11]), the page downloading processes are distributed, while the major data structures – the set of discovered URLs and the set of URLs that have to be downloaded – are maintained by a single machine. This design is conceptually simple, but it does not scale indefinitely; eventually the central data structures become a bottleneck. The alternative is to partition the major data structures over the crawling machines. Ideally, this should be done in such a way as to minimize communication between the crawlers. One way to achieve this is to assign URLs to crawling machines based on their host name. Partitioning URLs by host name means that the crawl scheduling decisions entailed by the politeness policies can be made locally, without any communication with peer nodes. Moreover, since most hyperlinks refer to pages on the same web server, the majority of links extracted from downloaded web pages is tested against and added to local data structures, not communicated to peer crawlers. Mercator and C-proc adopted this design [9,4]. Once a hyperlink has been extracted from a web page, the crawler needs to test whether this URL has been encountered before, in order to avoid adding multiple instances of the same URL to its set of pending URLs. This requires a data structure that supports set membership test, such as a hash table.
3463
W
3464
W
Web Crawler Architecture
Care should be taken that the hash function used is collision-resistant, and that the hash values are large enough (maintaining a set of n URLs requires hash values with log2 n2 bits each). If RAM is not an issue, the table can be maintained in memory (and occasionally persisted to disk for fault tolerance); otherwise a disk-based implementation must be used. Implementing fast disk-based set membership tests is extremely hard, due to the physical limitations of hard drives (a single seek operation takes on the order of 10 ms). For a disk-based design that leverages locality properties in the stream of discovered URLs as well as the domain-specific properties of web crawling, see [9]. If the URL space is partitioned according to host names among the web crawlers, the set data structure is partitioned in the same way, with each web crawling machine maintaining only the portion of the set containing its hosts. Consequently, an extracted URL that is not maintained by the crawler that extracted it must be sent to the peer crawler responsible for it. Once it has been determined that a URL has not been previously discovered, it is added to the frontier set containing the URLs that have yet to be downloaded. The frontier set is generally too large to be maintained in main memory (given that the average URL is about 100 characters long and the crawling system might maintain a frontier of ten billion URLs). The frontier could be implemented by a simple disk-based FIFO queue, but such a design would make it hard to enforce the politeness policies, and also to prioritize certain URLs (say URLs referring to fast-changing news web sites) over other URLs. URL prioritization could be achieved by using a priority queue implemented as a heap data structure, but a disk-based heap would be far too expensive, since adding and removing a URL would require multiple seek operations. The Mercator design uses a frontier data structure that has two stages: a front-end that supports prioritization of individual URLs and a back-end that enforces politeness policies; both the front-end and the back-end are composed of a number of parallel FIFO queues [9]. If the URL space is partitioned according to host names among the web crawlers, the frontier data structure is partitioned along the same lines. In the simplest case, the frontier data structure is just a collection of URLs. However, in many settings it is desirable to attach some attributes to each URL, such as the time when it was discovered, or (in the
scenario of continuous crawling) the time of last download and a checksum or sketch of the document. Such historical information makes it easy to determine whether the document has changed in a meaningful way, and to adjust its crawl priority. In general, URLs should be crawled in such a way as to maximize the utility of the crawled corpus. Factors that influence the utility are the aggregate quality of the pages, the demand for certain pages and topics, and the freshness of the individual pages. All these factors should be considered when deciding on the crawl priority of a page: a high-quality, highlydemanded and fast-changing page (such as the front page of an online newspaper) should be recrawled frequently, while high-quality but slow-changing and fast-changing but low-quality pages should receive a lower priority. The priority of newly discovered pages cannot be based on historical information about the page itself, but it is possible to make educated guesses based on per-site statistics. Page quality is hard to quantify; popular proxies include link-based measures such as PageRank and behavioral measures such as page or site visits (obtained from web beacons or toolbar data). In addition to these major data structures, most web-scale web crawlers also maintain some auxiliary data structures, such as caches for DNS lookup results. Again, these data structures may be partitioned across the crawling machines.
Key Applications Web crawlers are a key component of web search engines, where they are used to collect the pages that are to be indexed. Crawlers have many applications beyond general search, for example in web data mining (e.g., Attributor, a service that mines the web for copyright violations, or ShopWiki, a price comparison service).
Future Directions Commercial search engines are global companies serving a global audience, and as such they maintain data centers around the world. In order to collect the corpora for these geographically distributed data centers, one could crawl the entire web from one data center and then replicate the crawled pages (or the derived data structures) to the other data centers; one could perform independent crawls at each data center and thus serve different indices to different
Web Data Extraction System
geographies; or one could perform a single geographically-distributed crawl, where crawlers in a given data center crawl web servers that are (topologically) closeby, and then propagate the crawled pages to their peer data centers. The third solution is the most elegant one, but it has not been explored in the research literature, and it is not clear if existing designs for distributed crawlers would scale to a geographically distributed setting.
URL to Code Heritrix is a distributed, extensible, web-scale crawler written in Java and distributed as open source by the Internet Archive. It can be found at http://crawler. archive.org/
W
3465
Web Data Extraction ▶ Web Harvesting
Web Data Extraction System R OBERT B AUMGARTNER 1,2, WOLFGANG G ATTERBAUER 3, G EORG G OTTLOB 4 1 Vienna University of Technology, Vienna, Austria 2 Lixto Software GmbH, Vienna, Austria 3 University of Washington, Seattle, WA, USA 4 Oxford University, Oxford, UK
Synonyms Cross-references
▶ Focused Web Crawling ▶ Incremental Crawling ▶ Indexing the Web ▶ Web Harvesting ▶ Web Page Quality Metrics
Recommended Reading 1. Boldi P., Codenotti B., Santini M., and Vigna S. UbiCrawler: a scalable fully distributed web crawler. Software Pract. Exper., 34(8):711–726, 2004. 2. Brin S. and Page L. The anatomy of a large-scale hypertextual search engine. In Proc. 7th Int. World Wide Web Conference, 1998, pp. 107–117. 3. Burner M. Crawling towards eternity: building an archive of the World Wide Web. Web Tech. Mag., 2(5):37–40, 1997. 4. Cho J. and Garcia-Molina H. Parallel crawlers. In Proc. 11th Int. World Wide Web Conference, 2002, pp. 124–135. 5. Eichmann D. The RBSE Spider – Balancing effective search against web load. In Proc. 3rd Int. World Wide Web Conference, 1994. 6. Gray M. Internet Growth and Statistics: Credits and background. http://www.mit.edu/people/mkgray/net/background. html 7. Hafri Y. and Djeraba C. High performance crawling system. In Proc. 6th ACM SIGMM Int. Workshop on Multimedia Information Retrieval, 2004, pp. 299–306. 8. Heydon A. and Najork M. Mercator: a scalable, extensible web crawler. World Wide Web, 2(4):219–229, December 1999. 9. Najork M. and Heydon A. High-performance web crawling. Compaq SRC Research Report 173, September 2001. 10. Raghavan S. and Garcia-Molina H. Crawling the hidden web. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 129–138. 11. Shkapenyuk V. and Suel T. Design and Implementation of a high-performance distributed web crawler. In Proc. 18th Int. Conf. on Data Engineering, 2002, pp. 357–368.
Web information extraction system; Wrapper generator; Web macros; Web scraper
Definition A web data extraction system is a software system that automatically and repeatedly extracts data from web pages with changing content and delivers the extracted data to a database or some other application. The task of web data extraction performed by such a system is usually divided into five different functions: (i) web interaction, which comprises mainly the navigation to usually pre-determined target web pages containing the desired information; (ii) support for wrapper generation and execution, where a wrapper is a program that identifies the desired data on target pages, extracts the data and transforms it into a structured format; (iii) scheduling, which allows repeated application of previously generated wrappers to their respective target pages; (iv) data transformation, which includes filtering, transforming, refining, and integrating data extracted from one or more sources and structuring the result according to a desired output format (usually XML or relational tables); and (v) delivering the resulting structured data to external applications such as database management systems, data warehouses, business software systems, content management systems, decision support systems, RSS publishers, email servers, or SMS servers. Alternatively, the output can be used to generate new web services out of existing and continually changing web sources.
Historical Background The precursors of web data extraction systems were screen scrapers which are systems for extracting screen
W
3466
W
Web Data Extraction System
formatted data from mainframe applications for terminals such as VT100 or IBM 3270. Another related issue are ETL methods (ETL = Extract, Transform, Load), which extract information from various business processes and feed it into a database or a data warehouse. A huge gap was recognized to exist between web information and the qualified, structured data as usually required in corporate information systems or as envisioned by the Semantic Web. In many application areas there has been a need to automatically extract relevant data from HTML sources and translate this data into a structured format, e.g., XML, or into a suitable relational database format. A first and obvious solution was the evolution from screen scrapers to web scrapers, which can navigate to web pages and extract textual content. However, web scrapers usually lack the logic necessary to define highly structured output data as opposed to merely text chunks or textual snippets. Moreover, they are usually unable to collect and integrate data from different related sources, to define extraction patterns that remain stable in case of minor layout changes, to transform the extracted data into desired output formats, and to deliver the refined data into different kinds of applications. For this reason, specific research on web data extraction was needed and several academic projects and some commercial research projects on web data extraction were initiated before or around the year 2000. While the academic projects such as XWRAP [11], Lixto [2], Wargo [14], and the commercial systems RoboMaker of Kapow technologies (Kapow Technologies. http://www.kapowtech.com/) and WebQL of QL2 Software (QL2 Software Inc. http://www.ql2.com/) focused on methods of strongly supervised ‘‘semiautomatic’’ wrapper generation, providing a wrapper designer with visual and interactive support for declaring extraction and formatting patterns, other projects were based on machine learning techniques. For example, WIEN [9], Stalker [13], and DEByE [10] focused on automatic wrapper induction from annotated examples. Some other projects focused on specific issues such as automatic generation of structured data sets from web pages [3] and particular aspects of navigating and interacting with web pages [1]. Finally, some systems require the wrapper designer to program the wrapper in a high level language (e.g., SQL-like languages for selecting data from a web page) and merely give visual support to the programmer; an example is the W4F system [15]. A number of early
academic prototypes gave rise to fully fledged systems that are now commercially available. For example, Stalker gave rise to the Fetch Agent Platform of Fetch Technologies (Fetch Technologies, Inc. http://www. fetch.com/), the Lixto system gave rise to the Lixto Suite of the Lixto Software company (Lixto Software GmbH. http://www.lixto.com/), and the Wargo system evolved into the Denodo Platform [14]. A good and almost complete survey of web information extraction systems up to 2002 is given in [8].
Foundations Formal foundations and semantics of data extraction
There are four main formal approaches to define (the semantics of) web wrappers: In the functional approach, a web wrapper is seen as a mapping f from the parse or DOM tree T of a web page to a set f(T) of sub-trees of T, where each subtree corresponds to an extracted data object. Another step specifies the relabeling of extracted data to fit a previously defined output schema. The logical approach (see Logical foundations of web data extraction) consists of the specification of a finite number of monadic predicates, i.e., predicates, that define sets of parse-tree nodes and that evaluate to true or false on each node of a tree-structured document. The predicates can be defined either by logical formulas, or by logic programs such as monadic datalog, which was shown to be equivalent in expressive power to monadic second-order logic (MSO) [6]. Note that languages such as XPATH or regular path expressions are subsumed by MSO. The automata theoretic approach relies on tree automata, which are a generalization of finite state automata from words to trees. The equivalence of the two logical approaches, MSO and monadic datalog, and the unranked query automata approach shows that the class of wrappers definable in these formalisms is quite natural and robust. Fortunately, this class is also quite expressive, as it accommodates most of the relevant data extraction tasks and properly contains the wrappers definable in some specifically designed wrapper programming languages [7]. Finally, the textual approach interprets web pages as text strings and uses string pattern matching for data extraction [13].
Web Data Extraction System
Methods of wrapper generation
With regard to user involvement, three principal approaches for wrapper generation can be distinguished: (i) Manual wrapper programming, in which the system merely supports a user in writing a specific wrapper but cannot make any generalizations from the examples provided by the user; (ii) wrapper induction, where the user provides examples and counterexamples of instances of extraction patterns, and the system induces a suitable wrapper using machine learning techniques; and (iii) semi-automatic interactive wrapper generation, where the wrapper designer not only provides example data for the system, but rather accompanies the wrapper generation in a systematic computer-supported interactive process involving generalization, correction, testing, and visual programming techniques. Architecture
Figure 1 depicts a high-level view of a typical fullyfledged semi-automatic interactive web data extraction system. This system comprises several tightly connected components and interfaces three external entities: (i) the Web, which contains pages with information of interest; (ii) a target application, to which the extracted and refined data will be ultimately delivered; and (iii) the user, who interactively designs the wrapper. The wrapper generator supports the user during the wrapper design phase. It commonly has a visual interface that allows the user to define which data should be extracted from web pages and how this data should be mapped into a structured format such as XML. The
W
3467
visual interface itself can include several windows, such as: (i) a browser window that renders example web pages for the user; (ii) a parse tree window that shows the HTML parse tree of the current web page rendered by the browser; (iii) a control window that allows the user to view and control the overall progress of the wrapper design process and to input textual data (e.g., attribute names or XML tags); and (iv) a program window that displays the wrapper program constructed so far and allows the user to further adjust or correct the program. The subunit that actually generates the wrapper is contained in what is referred to here as the program generator. This submodule interprets the user actions on the example web pages and successively generates the wrapper. Semi-automatic wrapper generators allow the user to either specify the URL of example web pages, which are structurally similar to the target web pages, or to navigate to such example pages. In the latter case, the navigation is recorded and can be automatically reproduced. The example pages are rendered in a browser window. Most wrapper generators also display the parse tree of the current page. In most systems, the wrapper designer can click on a node of the HTML parse tree, and the system immediately highlights the corresponding part of the example page in the browser, and vice versa. The designer can, thus, iteratively narrow down an example of a relevant data item after a few trials. Many web data extraction systems exhibit an XPATH-like path expression that precisely identifies the selected data item. The wrapper designer can then generalize this path expression by replacing some of the elements with wildcards.
W
Web Data Extraction System. Figure 1. Architecture of a typical web data extraction system.
3468
W
Web Data Extraction System
The result is a generalized pattern that matches several similar data items on similarly structured pages. Some systems (e.g., Denodo and Lixto) allow a wrapper designer to select data items directly on the rendered web page. The wrapper generator often also provides support for associating a name or tag to each extraction pattern and for organizing patterns hierarchically. Systems based on wrapper induction allow the user to supply a large number of sample web pages with positive and/or negative examples of desired data items, from which the program generator tries to learn a generalized pattern. Systems for manual wrapper programming simply provide a visual environment for developing a wrapper in the underlying wrapper programming language, for testing the wrapper against example pages and for debugging. In addition, the visual interface usually allows an application designer to specify when a wrapper should be executed. The wrapper executor is the engine that facilitates the deployment of the previously generated wrappers. Typically, a wrapper program comprises functions such as deep web navigation (e.g., the means to fill out forms), identification of relevant elements with declarative rules or procedural scripts, and generation of a structured output format (e.g., XML or relational tables). Typically, the wrapper executor can also receive additional input parameters such as a start URL, a predefined output structure (e.g., a particular XML schema), and special values for forms. In some approaches, the output includes metadata (e.g., verification alerts in case of violated integrity constraints) in addition to the actual data of interest. The wrapper repository stores the generated wrappers together with metadata. In some approaches, wrapper templates and libraries of domain concepts can be reused and extended. Systems such as Dapper (Dapper. http://www.dapper.net/) offer a community-based extensible wrapper repository. In principle, systems can offer automatic wrapper maintenance and adaptation as part of the repository. The data transformation and integration unit provides a means to streamline the results of multiple wrappers into one harmonized homogeneous result. This step includes data integration, transformation and cleaning, basically functions commonly found in the Transform step of ETL processes of data warehouses and mediation systems. Technologies and tools comprise query languages, visual data mapping, automatic schema matching and de-duplication techniques. Additionally, the
data can be composed into desired result formats such as HTML, Excel or PDF. The data delivery unit (capturing the Load step in the ETL philosophy) decides about the appropriate output channel such as email, Open Database Connectivity (ODBC) connection, or FTP. Advanced wrapper generation systems offer a number of system connectors to state-of-the-art databases and enterprise software such as Customer Relationship Management (CRM) or Enterprise Resource Planning (ERP), or partner with software providers in the area of Enterprise Application Integration (EAI) or Business Intelligence (BI) who offer such solutions. The central control and scheduling unit is the heart of the processing engine. In general, synchronous and asynchronous extraction processes can be distinguished. A typical example of an asynchronous triggering is a realtime user request in a flight meta-search application. The request is queued, the respective wrappers are triggered, parameter mappings are performed, and the result is integrated and presented to the user in real-time. Sophisticated approaches offer a workflow-based execution paradigm, including parallelism, returning partial results, and interception points in complex booking transactions. On the other hand, market monitoring or web process integration are typical synchronous scenarios. An intelligent scheduler is responsible for distributing requests to the wrapper executor, iterating over a number of possible form inputs, and dealing with exception handling. Commercial wrapper generation systems
Dapper. The Dapp Factory of Dapper offers fully serverbased wrapper generation, and supports a community driven reusable wrapper repository. Dapper strongly relies on machine learning techniques for wrapper generation and concentrates exclusively on web pages that can be reached without deep web navigation. Users designing a wrapper label positive and negative examples and debug the result on a number of similarly structured web pages. Wrappers are hosted on the server and wrapper results can be queried via REST. Commercial APIs are offered for using Dapper in Enterprise scenarios. Denodo (Denodo. http://www.denodo.com). The Denodo ITPilot, formerly known as Wargo, is a platform for creating and executing navigation and extraction scripts that are loosely tied together. It offers graphical wizards for configuring wrappers and allows DOM events to be processed while navigating web pages. Deep Web navigation can be executed in
Web Data Extraction System
Internet Explorer, and the result pages are passed on to the extraction program. Furthermore, ITPilot offers some wrapper maintenance functionalities. Denodo additionally offers a tool called Aracne for document crawling and indexing. Lixto. The Lixto Suite comprises the Lixto Visual Developer (VD), a fully visual and interactive wrapper generation framework, and the Java-based Lixto Transformation Server providing a scalable runtime and data transformation environment, including various Enterprise Application connectors. VD is ideally suited for dynamic Web 2.0 applications as it supports a visual Deep Web macro recording tool that makes it possible to simulate user clicks on DOM elements. Tightly connected with the navigation steps and hidden from the wrapper designer, the expressive language Elog is used for data extraction and linking to further navigation steps. VD is based on Eclipse and embeds the Mozilla browser. Kapowtech. The Kapow RoboSuite (recently rebranded to Kapow Mashup Server) is a Java-based visual development environment for developing web wrappers. It embeds a proprietary Java browser, implements a GUI on top of a procedural scripting language, and maps data to relational tables. The RoboServer offers APIs in different programming languages for synchronous and asynchronous communication. In addition, variants of the Mashup Server for specific settings such as content migration are offered. WebQL. QL2 uses a SQL-like web query language called WebQL for writing wrappers. By default, WebQL uses its own HTML DOM tree model instead of relying on a standard browser, but a loose integration with Internet Explorer is offered. The QL2 Integration Server supports concepts such as server clustering and HTML diffing. Furthermore, an IP address anonymization environment is offered.
Key Applications Web market monitoring. Nowadays, a lot of basic information about competitors can be retrieved legally from public information sources on the Web, such as annual reports, press releases or public data bases. On the one hand, powerful and efficient tools for Extracting, Transforming and Loading (ETL) data from internal sources and applications into a Business Intelligence (BI) data warehouse are already available and largely employed. On the other hand, there is a growing economic need to efficiently integrate external data, such as market and competitor information, into these systems as well.
W
With the World Wide Web as the largest single database on earth, advanced data extraction and information integration techniques as described in this paper are required to process this web data automatically. At the same time, the extracted data has to be cleaned and transformed into semantically useful formats and delivered in a ‘‘web-ETL’’ process into a BI system. Key factors in this application area include scalable environments to extract and schedule processing of very large data sets efficiently, capabilities to pick representative data samples, cleaning extracted data to make it comparable, and connectivity to data warehouses. Web process integration. In markets such as the automotive industry, business processes are largely carried out by means of web portals. Business critical data from various divisions such as quality management, marketing and sales, engineering, procurement and supply chain management have to be manually gathered from web portals. Through automation, suppliers can dramatically reduce the cost while at the same time improving speed and reliability of these processes. Additionally, the automation means to leverage web applications to web services, and hence wrapper generation systems can be considered as an enabling technology for Service Oriented Architectures (SOA) as envisioned and realized in Enterprise Application Integration (EAI) and B2B processes. Key factors in this application area are workflow capabilities for the whole process of data extraction, transformation and delivery, capabilities to treat all kinds of special cases occurring in web interactions, and excellent support of the latest web standards used during secure transactions. Mashups. Increasingly, leading software vendors have started to provide mashup platforms such as Yahoo! Pipes or IBM QEDWiki. A mashup is a web application that combines a number of different websites into an integrated view. Usually, the content is taken via APIs by embedding RSS or atom feeds similar to REST (Representational State Transfer). In this context, wrapper technology transforms legacy web applications to light-weight APIs that can be integrated in mashups in the same way. As a result, web mashup solutions no longer need to rely on APIs offered by the providers of websites, but can rather extend the scope to the whole Web. Key factors for this application scenario include efficient real-time extraction capabilities for a large number of concurrent queries and detailed understanding of how to map queries to particular web forms.
3469
W
3470
W
Web Data Extraction System
Future Directions
►Generic web wrapping. Without explicit semantic annotations on the current Web data extraction systems that allow general web wrapping will have to move towards fully automatic wrapping of the existing World Wide Web. Important research challenges are (i) how to optimally bring semantic knowledge into the extraction process, (ii) how to make the extraction process robust (in particular, how to deal with false positives), and (iii) how to deal with the immense scaling issues. Today’s web harvesting and automated information extraction systems show much progress (see e.g., [4,12]), but still lack the combined recall and precision necessary to allow for very robust queries. A related topic is that of auto-adapting wrappers. Most wrappers today depend on the tree structure of a given web page and suffer from failure when the layout and code of web pages change. Auto-adapting wrappers, which are robust against such changes, could use existing knowledge of the relations on previous versions of web page in order to automatically ‘‘heal’’ and adapt the extraction rules to the new format. The question here is how to formally capture change actions and execute the appropriate repair actions. ►Wrapping from visual layouts. Whereas web wrappers today dominantly focus on either the flat HTML code or the DOM tree representation of web pages, recent approaches aim at extracting data from the CSS box model and, hence, the visual representation of web pages [5]. This method can be particularly useful for layout-oriented data structures such as web tables and allows to create automatic and domain-independent wrappers which are robust against changes of the HTML code implementation. ►Data extraction from non-HTML file formats. There is a substantial interest from industry in wrapping documents in formats such as PDF and PostScript. Wrapping of such documents must be mainly guided by a visual reasoning process over white space and Gestalt theory, which is substantially different from web wrapping and will, hence, require new techniques and wrapping algorithms including concepts borrowed from the document understanding community. ►Learning to deal with web interactions. As web pages are becoming increasingly dynamic and interactive, efficient wrapping languages have to make it possible to record, execute and generalize macros of web interactions and, hence, model the whole process of workflow integration. An example of such a web interactions is a complicated booking transaction. ►Web
form understanding and mapping. In order to automatically or interactively query deep web forms, wrappers have to learn the process of filling out complex web search forms and the usage of query interfaces. Such systems have to learn abstract representation for each search form and map them to a unified meta form and vice versa, taking into account different form element types, contents and labels.
Cross-references
▶ Business Intelligence ▶ Data Integration in Web Data Extraction System ▶ Data Mining ▶ Data Warehouse ▶ Datalog ▶ Deep-web Search ▶ Enterprise Application Integration ▶ Extraction, Transformation, and Loading ▶ Information Extraction ▶ Information Integration ▶ Logical Foundations Of Web Data Extraction ▶ MashUp ▶ Screen Scraper ▶ Semantic Web ▶ Service Oriented Architecture ▶ Snippet ▶ Web Harvesting ▶ Web Information Extraction ▶ Web Services ▶ Wrapper Induction ▶ Wrapper Maintenance ▶ Wrapper Stability ▶ XML ▶ Xpath/Xquery
Recommended Reading 1. Anupam V., Freire J., Kumar B., and Lieuwen D. Automating web navigation with the WebVCR. Comput. Network., 33(1–6):503–517, 2000. 2. Baumgartner R., Flesca S., and Gottlob G. Visual web information extraction with Lixto. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 119–128. 3. Crescenzi V., Mecca G., and Merialdo P. Road runner: towards automatic data extraction from large Web sites. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 109–118. 4. Etzioni O., Cafarella M.J., Downey D., Kok S., Popescu A., Shaked T., Soderland S., Weld D.S., and Yates Y. Web-scale information extraction in KnowItAll: (preliminary results). In Proc. 12th Int. World Wide Web Conference, 2004, pp. 100–110.
Web ETL 5. Gatterbauer W., Bohunsky P., Herzog M., Kru¨pl B., and Pollak B. Towards domain-independent information extraction from web tables. In Proc. 16th Int. World Wide Web Conference, 2007, pp.71–80. 6. Gottlob G. and Koch C. Monadic datalog and the expressive power of languages for web information extraction. J. ACM 51(1):74–113, 2002. 7. Gottlob G. and Koch C.A. Formal comparison of visual web wrapper generators. In Proc. 32nd Conf. Current Trends in Theory and Practice of Computer Science, 2006, pp. 30–48. 8. Kuhlins S. and Tredwell R. Toolkits for generating wrappers: a survey of software toolkits for automated data extraction from Websites. NODe 2002, LNCS:2591, 2003. 9. Kushmerick N., Weld D.S., and Doorenbos R.B. Wrapper induction for information extraction. In Proc. 15th Int. Joint Conf. on AI, 1997, pp. 729–737. 10. Laender A.H.F., Ribeiro-Neto B.A., and da Silva A.S. DEByE – data extraction by example. Data Knowl. Eng., 40(2):121–154, 2000. 11. Liu L., Pu C., and Han W. XWRAP: an XML-enabled wrapper construction system for web information sources. In Proc. 16th Int. Conf. on Data Engineering, 2000, pp. 611–621. 12. Liu B., Grossman R.L., and Zhai Y. Mining web pages for data records. IEEE Intell. Syst., 19(6):49–55, 2004. 13. Muslea I., Minton S., and Knoblock C.A. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonom. Agents Multi-Agent Syst., 4(1/2):93–114, 2001. 14. Pan A., Raposo J., A´lvarez M., Montoto P., Orjales V., Hidalgo J., Ardao L., Molano A., and Vin˜a A´. The Denodo data integration platform. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002. 15. Sahuguet A. and Azavant F. Building intelligent web applications using lightweight wrappers. Data Knowl. Eng., 36(3):283–316, 2001.
Web Data Mining ▶ Data, Text, and Web Mining in Healthcare
Web Directories ▶ Lightweight Ontologies
Web ETL O LIVER F RO¨ LICH Lixto Software GmbH, Vienna, Austria
Synonyms ETL using web data extraction techniques
W
3471
Definitions As ETL (acronym for Extraction, Transformation and Loading) is a well-established technology for the extraction of data from several sources, their cleansing, normalization and insertion into a Data Warehouse (e.g., a Business Intelligence System), Web ETL stands for an ETL process where the external data to be inserted into the Data Warehouse is extracted from semi-structured Web pages (e.g., in HTML or PDF format) using Web Data Extraction techniques. Particularly, back-end interchange of structured data just using the Web, e.g., two database systems exchanging data with Web EDI technology (EDI (Electronic Data Interchange) stands for techniques and standards for the transmission of structured data, for example over the Web, in an application-toapplication context.), is not a Web ETL process as no semi-structured data needs to be transformed using Web Data Extraction techniques.
Key Points Powerful and efficient tools supporting ETL processes (Extracting, Transforming and Loading of data) in a Data Warehouse context exist for years now. They concentrate on the data extraction from internal applications. But there is also a growing need to integrate external data: For example in Competitive Intelligence, information about competitor activities and market developments is becoming a more and more important success factor for enterprises. The largest information source on earth is the World Wide Web. Unfortunately, data on the Web requires intelligent interpretation and cannot be easily used by programs, since the Web is primarily intended for human users. Therefore, the extraction from semi-structured information sources like Web pages was mostly done manually and was very time consuming just a few years ago. Today, sophisticated toolsets for Web Data Extraction exist for retrieving relevant data automatically from Web sites and for transforming it into structured data formats. An example of such a toolset is the Lixto Suite [1], allowing for the extraction and transformation of data from Web pages into structured XML formats. This structured XML data can be integrated in Data Warehouse systems. This whole process from Web Data Extraction to the integration of the cleansed and normalized information is called a Web ETL process. Web ETL itself can be a part of a Business Intelligence process, turning semi-structured data into
W
3472
W
Web Harvesting
business-relevant information in a Business Intelligence Data Warehouse. An example of a Business Intelligence application of Web ETL is given in [2]. Another application field of Web ETL besides Competitive Intelligence and Business Intelligence is Front-End System Integration, where Web interfaces of business applications are utilized for coupling these systems using Web ETL processes.
Cross-references
▶ Business Intelligence ▶ Competitive Intelligence ▶ Data Warehouse ▶ ETL ▶ Semi-Structured Data
Recommended Reading 1.
2.
3.
Baumgartner R., Flesca S., Gottlob G. Visual web information extraction with Lixto. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 119–128. Baumgartner R., Fro¨lich O., Gottlob G., Harz P., Herzog M., and Lehmann P. Web data extraction for business intelligence: the Lixto approach, In Proc. Datenbanksysteme in Business, Technologie und Web (BTW), 2005, pp. 48–65. Fro¨lich O. Optimierung von Gescha¨ftsprozessen durch Integrierte Wrapper-Technologien. Dissertation, Institute of Information Systems, Vienna University of Technology, 2006.
Web Harvesting WOLFGANG G ATTERBAUER University of Washington, Seattle, WA, USA
Synonyms Web data extraction; Web information extraction; Web mining
Definition Web harvesting describes the process of gathering and integrating data from various heterogeneous web sources. Necessary input is an appropriate knowledge representation of the domain of interest (e.g., an ontology), together with example instances of concepts or relationships (seed knowledge). Output is structured data (e.g., in the form of a relational database) that is gathered from the Web. The term harvesting implies that, while passing over a large body of available information, the process gathers only such information that lies in the domain of interest and is, as such, relevant.
Key Points The process of web harvesting can be divided into three subsequent tasks: (i) data or information retrieval, which involves finding relevant information on the Web and storing it locally. This task requires tools for searching and navigating the Web, i.e., crawlers and means for interacting with dynamic or deep web pages, and tools for reading, indexing and comparing the textual content of pages; (ii) data or information extraction, which involves identifying relevant data on retrieved content pages and extracting it into a structured format. Important tools that allow access to the data for further analysis are parsers, content spotters and adaptive wrappers; (iii) data integration which involves cleaning, filtering, transforming, refining and combining the data extracted from one or more web sources, and structuring the results according to a desired output format. The important aspect of this task is organizing the extracted data in such a way as to allow unified access for further analysis and data mining tasks. The ultimate goal of web harvesting is to compile as much information as possible from the Web on one or more domains and to create a large, structured knowledge base. This knowledge base should then allow querying for information similar to a conventional database system. In this respect, the goal is shared with that of the Semantic Web. The latter, however, tries to solve extraction a` priori to retrieval by having web sources present their data in a semantically explicit form. Today’s search engines focus on the task of finding content pages with relevant data. The important challenges for web harvesting, in contrast, lie in extracting and integrating the data. Those difficulties are due to the variety of ways in which information is expressed on the Web (representational heterogeneity) and the variety of alternative, but valid interpretations of domains (conceptual heterogeneity). These difficulties are aggravated by the Web’s sheer size, its level of heterogeneity and the fact that information on the Web is not only complementary and redundant, but often contradictory too. An important research problem is the optimal combination of automation (high recall) and human involvement (high precision). At which stages and in which manner a human user must interact with an otherwise fully automatic web harvesting system for optimal performance (in terms of speed, quality, minimum human involvement, etc.) remains an open question.
Web Information Extraction
Cross-references
▶ Data Extraction ▶ Data Integration ▶ Fully-Automatic Web Data Extraction ▶ Information Retrieval ▶ Semantic Web ▶ Web Data Extraction ▶ Web Data Extraction System ▶ Web Scraper ▶ Wrapper
Recommended Reading 1.
2. 3.
Ciravegna F., Chapman S., Dingli A., and Wilks Y. Learning to harvest information for the Semantic Web. In Proc. 1st European Semantic Web Symposium, 2004, pp. 312–326. Crescenzi V. and Mecca G. Automatic information extraction from large websites. J. ACM, 51(5):731–779, 2004. Etzioni O., Cafarella M.J., Downey D., Kok S., Popescu A.M., Shaked T., Soderland S., Weld D.S., and Yates A. Web-scale information extraction in KnowItAll: (preliminary results). In Proc. 12th Int. World Wide Web Conference, 2004, pp. 100–110.
Web Indexing ▶ Indexing the Web
Web Information Extraction R AJASEKAR K RISHNAMURTHY, Y UNYAO L I , S RIRAM R AGHAVAN , F REDERICK R EISS , S HIVAKUMAR VAITHYANATHAN , H UAIYU Z HU IBM Almaden Research Center, San Jose, CA, USA
Synonyms Text Analytics; Information Extraction
Definition Information extraction (IE) is the process of automatically extracting structured pieces of information from unstructured or semi-structured text documents. Classical problems in information extraction include named-entity recognition (identifying mentions of persons, places, organizations, etc.) and relationship extraction (identifying mentions of relationships between such named entities). Web information extraction is the application of IE techniques to process the
W
3473
vast amounts of unstructured content on the Web. Due to the nature of the content on the Web, in addition to named-entity and relationship extraction, there is growing interest in more complex tasks such as extraction of reviews, opinions, and sentiments.
Historical Background Historically, information extraction was studied by the Natural Language Processing community in the context of identifying organizations, locations, and person names in news articles and military reports [9]. From early on, information extraction systems were based on the knowledge engineering approach of developing carefully crafted sets of rules for each task. These systems view the text as an input sequence of symbols and extraction rules are specified as regular expressions over the lexical features of these symbols. The formalism underlying these systems is based on cascading grammars and the theory of finite-state automata. One of the earliest languages for specifying such rules is the Common Pattern Specification Language (CPSL) developed in the context of the TIPSTER project [1]. To overcome some of the drawbacks of CPSL resulting from a sequential view of the input, the AfST system [2] uses a more powerful grammar that views its input as an object graph. Beginning in the mid-1990s, as the unstructured content on the Web continued to grow, information extraction techniques were applied in building popular Web applications. One of the earliest such uses of information extraction was in the context of screen scraping for online comparison shopping and data integration applications. By manually examining a number of sample pages, application designers would develop ad hoc rules and regular expressions to eke out relevant pieces of information (e.g., the name of a book, its price, the ISBN number, etc.) from multiple Web sites to produce a consolidated Web page or query interface. Recently, more sophisticated IE techniques are being employed on the Web to improve search result quality, guide ad placement strategies, and assist in reputation management [7,12].
Foundations Knowledge-engineered rules have the advantage that they are easy to construct in many cases (e.g., rules to recognize prices, phone numbers, zip codes, etc.), easier to debug and maintain when written in a high-level rule
W
3474
W
Web Information Extraction
language, and provide a natural way to incorporate domain or corpus-specific knowledge. However, such rules are extremely labor intensive to develop and maintain. An alternative paradigm for producing extraction rules is the use of learning-based methods [6]. Such methods work well when training data is readily available and the extraction tasks are hard to encode manually. Finally, there has been work in the use of complex statistical models, such as Hidden Markov Models and Conditional Random Fields, where the rules of extraction are implicit within the parameters of the model [11]. For ease of exposition, the rule-based paradigm is used to present the core concepts of Web information extraction. Rule-based extraction programs are called annotators and their output is referred to as annotations. The two central concepts of rule-based extraction are rules and spans. A span corresponds to a substring of the document text represented as a pair of offsets (begin, end). A rule is of the form A ← P (Fig. 1), where A is an annotation (specified within angled brackets) and P is a pattern specification. Evaluating the rule associates any span of text that matches pattern P with the annotation A. The description of information extraction is organized around four broad categories of extraction tasks: entity extraction, relationship extraction, complex composite extraction, and application-driven extraction. The first two categories, while relevant and increasingly used in Web applications, are classical IE tasks that have been extensively studied in the literature even before the advent of the Web. Category 1. Entity extraction refers to the identification of mentions of named entities (such as persons, locations, organizations, phone numbers, etc.) in unstructured text. While the task of entity extraction is intuitive and easy to describe, the corresponding annotators are fairly complex and involve a large number of rules and carefully curated dictionaries. For example, a high-quality annotator for person names would involve several tens of rules to capture all of the different
conventions, shorthands, and formats used in person names all over the world. Example 1. As an illustrative example, consider the simple annotator shown in Fig. 1 for recognizing mentions of person names. The annotator uses a CPSLstyle cascading grammar specification. Assume that the input text has already been tokenized and is available as a sequence of hTokeni annotations. Rules R4, R5, and R6 are the lowest level grammar rules since they operate only on the input hTokeni annotations. Each of these rules attempts to match the text of a token against a particular regular expression and produces output annotations whenever the match succeeds. Rule R6 states that a span of text consisting of a single token whose text matches a dictionary of person names (‘‘Michael,’’ ‘‘Richard,’’ etc.) is a hPersonDicti annotation. Rule R4 similarly defines hSalutationi annotations (e.g., Dr., Prof., Mr., etc.) and R5 defines annotations consisting of a single token beginning with a capital letter. Finally, rules R1, R2, and R3 are the higher-level rules of the cascading grammar since they operate on the annotations produced by the lower-level rules. For example, R1 states that a salutation followed by two capitalized words is a person name. Such a rule will recognize names such as ‘‘Dr. Albert Einstein’’ and ‘‘Prof. Michael Stonebraker.’’ Category 2. Binary relationship extraction refers to the task of associating pairs of named entities based on the identification of a relationship between the entities. For instance, Example 2 describes the task of extracting instances of the CompanyCEO relationship, i.e., finding pairs of person and organization names such that the person is the CEO of the organization. Example 2. Assume that mentions of persons and organizations have already been annotated (as in Category 1 above) and are available as hPersoni and hOrganizationi annotations respectively. Figure 2 shows rules that identify instances of the CompanyCEO relationship. Rule R7 looks for pairs of hPersoni and hOrganizationi annotations such that the text
Web Information Extraction. Figure 1. Simple rules for identifying person names.
Web Information Extraction
between these annotations satisfies a particular regular expression. The regular expression that is used in this example will match any piece of text containing the phrase ‘‘CEO of ’’ with an optional comma before the phrase and an arbitrary amount of whitespace separating the individual words of the phrase. Thus, it will correctly extract an instance of this relationship from the text ‘‘Sam Palmisano, CEO of IBM.’’ As with Example 1, the rules presented here are merely illustrative and high-quality relationship annotators will involve significantly more rules. Category 3. Complex composite extraction. The presence of vast amounts of user generated content on the Web (in blogs, wikis, discussion forums, and mailing lists) has engendered a new class of complex information extraction tasks. The goal of such tasks is the extraction of reviews, opinions, and sentiments on a wide variety of products and services. Annotators for such complex tasks are characterized by two main features: (i) the use of entity and relationship annotators as sub-modules to be invoked as part of a higher level extraction workflow, and (ii) the use of aggregation-like operations in the extraction process. Example 3. Consider the task of identifying informal reviews of live performance of music bands
W
3475
embedded within blog entries. Figure 3 shows the highlevel workflow of an annotator that accomplishes this task. The two individual modules, ReviewInstance Extractor and ConcertInstance Extractor, identify specific snippets of text in a blog. The hReviewInstancei module identifies snippets that indicate portions of a concert review – e.g., ‘‘show was great,’’ ‘‘liked the opening bands’’ and ‘‘Kurt Ralske played guitar.’’ Similarly, the ConcertInstance Extractor module identifies occurrences of bands or performers – e.g., ‘‘performance by the local funk band Saaraba’’ and ‘‘went to the Switchfoot concert at the Roxy.’’ The output from the ReviewInstance Extractor module is fed into the ReviewGroup Aggregator module to identify contiguous blocks of text containing hReviewInstancei snippets. Finally, a hConcertInstancei snippet is associated with one or more hReviewGroupsi to obtain hBandReviewsi. Note that in addition to the complex high-level workflow, each individual module in Fig. 3 is itself fairly involved and consists of tens of regular expression patterns and dictionaries. Category 4. Application-driven extraction. The last category of extraction tasks covers a broad spectrum of scenarios where IE techniques are applied to perform extraction that is unique to the needs of a particular
Web Information Extraction. Figure 2. Simple rules for identifying CompanyCEO relationship instances.
W
Web Information Extraction. Figure 3. Extraction of informal band reviews.
3476
W
Web Information Extraction
Web application. While the early applications of IE were ad hoc and limited to screen scraping, there has been recent widespread use of IE in the context of improving search quality. A prototypical example is the use of specially crafted extraction rules applied to the URLs and titles of Web pages to identify highquality index terms. Consider the following URLs: U1 http://en.wikipedia.org/wiki/Michelangelo U2 http://www.ibiblio.org/wm/paint/auth/ michelangelo/ U3 http://www.michelangelo.com/ U4 http://www.artcyclopaedia.com/artists/ michelangelo_buonarotti.html
It is easy to see that certain key segments of the URL string, such as the last segment of the path (as in U1 and U2), the portion of the hostname following the ‘‘www’’ (in U3) and the actual resource name (as in U4) provide fairly reliable clues as to the actual content of the corresponding Web page. Modern search engines, on the Web and in the intranet, are applying sophisticated patterns to extract such key segments from the URL and produce terms for their search index. In the above examples, such a technique would enable the term ‘‘michelangelo’’ or the phrase ‘‘michelangelo buonarotti’’ to be identified as high-quality index terms for the Web pages. In a similar fashion, extraction of key phrases from the captions of images and videos are being used to effectively index and search over online multimedia repositories. Other specialized search engines for verticals such as healthcare, finance, and people search, also make significant use of information extraction to identify domain-specific concepts.
Key Applications As the vast majority of information on the Web is in unstructured form, there is growing interest, within the database, data mining, and IR communities, in the use of information extraction for Web applications. Several research projects in the areas of intranet search, community information management, and Web analytics, are already employing IE techniques to bring order to unstructured data. The common theme in all of these applications is the use of IE to process the input text and produce a structured representation to support search, browsing, and mining applications. Web search engines are also employing IE techniques to recognize key entities (persons, locations,
organizations, etc.) associated with a web page. This semantically richer understanding of the contents of a page is used to drive corresponding improvements to their search ranking and ad placement strategies. Finally, there is continuing interest in techniques to reliably extract reviews, opinions, and sentiments about products and services from the content present in online communities and portals. The extracted information can be used to guide business decisions related to product placement, targeting, and marketing. The complex nature of these extraction tasks as well as the heterogeneous and noisy nature of the data pose interesting research challenges.
Future Directions While the area of information extraction has made considerable progress since inception, several important challenges still remain open. Two of these challenges that are under active investigation in the research community are described below. Scalability
As complex information tasks move to operating over Web-size data sets scalability has become a major focus [8]. In particular, IE systems need to scale to large numbers of input documents (data set size) and to large numbers of rules and patterns (annotator complexity). Two classes of optimizations being considered in this regard are Low-Level Primitives: Improving the performance of the low-level operations that dominate annotator execution time. High-Level Optimization: Automatically choosing more efficient orders for evaluating annotator rules. Low-Level Primitives In many information extraction systems, low-level text operations like tokenization, regular expression evaluation and dictionary matching dominate execution time. Speeding up these low-level operations leads to direct improvements in scalability, both in terms of the number of documents the system can handle and the number of basic operations the system can afford to perform on a given document [5]. Some of the problems being addressed currently in this context are given below.
When a large number of dictionaries are evaluated in an annotator (e.g., the BandReview annotator (Fig. 3) evaluates over 30 unique dictionaries),
Web Information Extraction
evaluating each dictionary separately is expensive as the document text is tokenized once per dictionary evaluation. To address this inefficiency, techniques are being explored to evaluate multiple dictionaries efficiently in a single pass over the document text. Evaluating complex regular expressions over every document is an expensive operation. An active area of research is the design of faster regular expression matchers for special classes of regular expressions. Another approach being considered is the design of ‘‘filter’’ regular expression indexes. These indexes can be used to quickly eliminate many of the documents that do not contain a match. High-Level Optimization
As web information extraction moves towards Categories 3 and 4 (more complex tasks) there is more opportunity for improving efficiency [3–14]. Some of these efficiency gains can be obtained using traditional relational optimizations such as ‘‘join reordering’’ and ‘‘pushing down selection predicates.’’ There are also additional text-specific optimizations that may give significant benefit, two of which are described below. Restricted Span Evaluation: Let R denote a set of rules for a given information extraction task. Let spans(d, R0 )be a set of spans obtained by evaluating R0 R over a document d. Let r 2 (R R0 ) be a rule to be evaluated over the complete text of d. Restricted Span Evaluation is the optimization technique by which theannotator specification can be rewritten so that rule r is evaluated only on spans(d, R0 )and not over the complete text of d. A specific instantiation of restricted span evaluation is presented below, using Rule R1 in the Person annotator (Fig. 1). This rule identifies a salutation followed by two capitalized words. Note that R1 uses R4 and R5 for identifying hSalutationi and hCapsWordi respectively. A naive evaluation strategy identifies all occurrences of hSalutationi and hCapsWordi over the complete document before R1 is evaluated. An alternative approach is to first identify hSalutationi in the document and then look for hCapsWordi only in the immediate vicinity of all the hSalutationi annotations. The latter approach evaluates the hCapsWordi rule only on a smaller amount of text as the hSalutationi annotations occur infrequently.
W
3477
This can result in considerable performance improvements. Conditional Evaluation: Let R denote a set of rules for a given information extraction task and R1 and R2 be two non-overlapping subsets of R. Conditional Evaluation is the optimization technique by which the annotatorspecification can be rewritten so that R1 is conditionally evaluated on a document d depending on whether the evaluation of R2 over dsatisfies certain predicates. An example predicate is whether the evaluationof R2 produces at least one annotation. A specific instantiation of Conditional Evaluation in the context of the BandReview annotator (Fig. 3) is given below. There are two main modules in the annotator, ConcertInstance and ReviewGroup which can be conditionally evaluated – i.e., the absence of one in a document implies that the other need not be evaluated on that document. In general, Conditional Evaluation is applicable in large workflows at join conditions such as the BandReview Join in Fig. 3. Initial results on combining efficient evaluation of lower-level primitives and higher-level optimizations are very encouraging and recent results have reported an order of magnitude improvement in execution time [13,14]. Uncertainty Management
Real-world annotators typically consist of a large number of rules with varying degrees of accuracy. For example, most annotations generated by Rule R1 (Fig. 1) are likely to be correct since salutation is a very strong indicator of the existence of a person name. Rule R3, on the other hand, will identify some correct names such as ‘‘James Hall’’ and ‘‘Dan Brown’’ along with some spurious annotations such as ‘‘Town Hall’’ and ‘‘Dark Brown.’’ Since ‘‘Hall’’ and ‘‘Brown’’ are ambiguous person names that also appear in other contexts and these dictionary matches are combined with a capitalized word, R3 has a lower accuracy than R1. Formally this difference in accuracy between different rules can be captured by the notion of the precision of a rule. Rule Precision Suppose rule R identifies N annotations over a given document corpus, of which ncorrect annotations are correct. Then the precision of rule R is the fraction ncorrect N . Each annotation a is associated with
W
3478
W
Web Information Extraction System
a confidence value that is directly related to the precision of the corresponding rule. The confidence value associated with an annotation enables applications to meaningfully represent and manipulate the imprecision of information extracted from text [4]. For Category 1 tasks, the associating of such confidence values (such as probabilities) has been addressed in existing literature. On the other hand, associating confidence values with annotations for more complex extraction tasks is still open [10]. To illustrate, consider Rule R6 in Fig. 2 for identifying the CompanyCEO relationship. This rule uses existing annotations hPersoni and hOrganizationi . Therefore the confidence of a CompanyCEO annotation is dependent not only on the precision of rule R6 but also on the confidences of the hPersoni and hOrganizationi annotations. By modeling the confidence numbers as probabilities it is possible to view the individual rules as queries over a probabilistic database. However, direct application of current probabilistic database semantics (possible worlds) is precluded as illustrated in the following example. Rule R7 identifies an instance of the CompanyCEO relationship from the text ‘‘Interview with Lance Brown, CEO of PeoplesForum.com.’’ This rule looks for the existence of the text ‘‘CEO of ’’ between hPersoni and hOrganizationi thereby making it a highly accurate rule. Even though the participating hPersoni annotation has a low confidence (identified by a rule with low precision, Rule R3), the resulting hCompanyCEOi annotations should have high confidence due to the high precision of Rule R7. Such behavior cannot be modeled under ‘‘possible worlds semantics’’ where the confidence of hCompanyCEOi annotations is bounded by the confidences of the input hPersoni and hOrganizationi annotations. Handling such anomalies in the calculation of probabilities for relationship and complex composite extraction tasks is an open problem.
Cross-references
▶ Fully Automatic Web Data Extraction ▶ Information Extraction ▶ Languages for Web Data Extraction ▶ Metasearch Engines ▶ Probabilistic Databases ▶ Query Optimization ▶ Text Analytics ▶ Text Mining ▶ Uncertainty and Data Quality Mgmt
▶ Web Advertising ▶ Web Data Extraction System ▶ Web Harvesting ▶ Wrapper Induction
Recommended Reading 1. Appelt D.E. and Onyshkevych B. The common pattern specification language. In Tipster Text Program Phase 3, 1999. 2. Boguraev B. Annotation-based finite state processing in a largescale NLP architecture. In Proc. Recent Advances in Natural Language Processing. 2003, pp. 61–80. 3. Cafarella M.J. and Etzion O. A search engine for natural language applications. In Proc. 14th Int. World Wide Web Conference, 2005, pp. 442–452. 4. Cafarella M.J. et al. Structured querying of Web text: A technical challenge. In Proc. 3rd Biennial Conf. on Innovative Data Systems Research, 2007, pp. 225–234. 5. Chandel A., Nagesh P.C., and Sarawagi S. Efficient batch top-k search for dictionary-based entity recognition. In Proc. 22nd Int. Conf. on Data Engineering, 2006. 6. Cohen W. and McCallum A. Information extraction from the world wide web Tutorial at Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2003. 7. Cunningham H. Information Extraction, Automatic. In Encyclopedia of Language and Linguistics, 2nd edn. 2005. 8. Doan A., Ramakrishnan R., and Vaithyanathan S. Managing information extraction: state of the art and research directions. Tutorial in Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006. 9. Grishman R. and Sundheim B. Message understanding conference-6: a brief history. In Proc. 16th Conf. on Computational, 1996, pp. 446–471. 10. Jayram T.S., Krishnamurthy R., Raghavan S., Vaithyanathan S., and Zhu H. Avatar information extraction system. Q. Bull. IEEE TC on Data Engineering, 29(1):40–48, May 2006. 11. Lafferty J., McCallum A., and Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. 18th Int. Conf. on Machine Learning, 2001, pp. 282–289. 12. Li Y., Krishnamurthy R., Vaithyanathan S., and Jagadish H. Getting work done on the web: supporting transactional queries. In Proc. 32nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2006, pp. 557–564. 13. Reiss F., Raghavan S., Krishnamurthy R., Zhu H., and Vaithyanathan S. An algebraic approach to rule-based information extraction. In Proc. 24th Int. Conf. on Data Engineering, 2008, pp. 933–942. 14. Shen W., Doan A., Naughton J., and Ramakrishnan R. Declarative information extraction using datalog with embedded extraction predicates. In Proc. 33rd Int. Conf. on Very Large Data Bases, 2007, pp. 1033–1044.
Web Information Extraction System ▶ Web Data Extraction System
WEB Information Retrieval Models
Web Information Integration and Schema Matching ▶ Data Integration in Web Data Extraction System
WEB Information Retrieval Models C RAIG M AC D ONALD, I ADH O UNIS University of Glasgow, Glasgow, UK
Synonyms Web search engines
Definition The Web can be considered as a large-scale document collection, for which classical text retrieval techniques can be applied. However, its unique features and structure offer new sources of evidence that can be used to enhance the effectiveness of Information Retrieval (IR) systems. Generally, Web IR examines the combination of evidence from both the textual content of documents and the structure of the Web, as well as the search behavior of users and issues related to the evaluation of retrieval effectiveness in the Web setting. Web Information Retrieval models are ways of integrating many sources of evidence about documents, such as the links, the structure of the document, the actual content of the document, the quality of the document, etc. so that an effective Web search engine can be achieved. In contrast with the traditional librarytype settings of IR systems, the Web is a hostile environment, where Web search engines have to deal with subversive techniques applied to give Web pages artificially high search engine rankings. Moreover, the Web content in heavily duplicated, for example by mirroring, which search engines need to account for, while the size of the Web requires search engines to address the scalibility of their algorithms to create efficient search engines. The commonly known PageRank algorithm – based on a documents hyperlinks – is an example of a source of evidence about a document to identify high quality documents. Another example is the commonly applied anchor text of the incoming hyperlinks. These sources of evidence are two of the most popular examples of link-based sources of evidence.
W
3479
Historical Background The first search engines for the Web appeared around 1992–1993, notably with the full-text indexing WebCrawler and Lycos both arriving in 1994. Soon after, many other search engines arrived, including Altavista, Excite, Inktomi and Northern Light. These often competed directly with directory-based services, like Yahoo!, which added search engine facilities later. The rise in prominence of Google, in 2001, was due to its recognition that the underlying Web search user task is not an adhoc task (where users want lots of relevant documents, but not any ones in particular) to a more precision oriented task, where the relevance of the topranked result is important. This set Google apart from other search engines, since by using link analysis techniques (such as PageRank), hyperlink anchor text and other heuristics such as the title of the page, it could find itself at rank #1 for the query ‘‘google.’’ This is something none of the other search engines at that time could achieve. Since then, there has been a distinct consolidation in the Web search engine market, with only three major players taking the majority of the English market: Google, Yahoo! and MSN Live. However, additional search engines are thriving in other areas: Baidu and Yandex have high penetration in the Chinese and Russian markets respectively, while Technorati, BlogPulse and other blog search engines are popular in the blogosphere. However, since the arrival of Google, the Web IR research field has become much more active, with many more research groups, and more conferences than before, to the point that it is almost a separate research field in its own right, with two separate ACM conferences (World Wide Web Conference, and Web Search and Data Mining Conference), in addition to the existing IR conferences.
Foundations The Web can be considered as a large-scale document collection, for which classical text retrieval techniques can be applied. However, its unique features and structure offer new sources of evidence that can be used to enhance the effectiveness of IR systems. Generally, Web IR examines the combination of evidence from both the textual content of documents and the structure of the Web, as well as the search behavior of users, and issues related to the evaluation of retrieval effectiveness. The information available on the Web is very different from the information contained in either
W
3480
W
WEB Information Retrieval Models
libraries or classical IR collections. A large amount of information on the Web is duplicated, and content is often mirrored across many different sites. Moreover, many documents can be unintentionally, or intentionally inaccurate, or are intended to mislead search engines since this is their purpose. This means that Web IR models need to be developed so they can perform well in such a hostile environment. The Web is based on a hypertext document model, where documents are connected with direct hyperlinks. This results in a virtual network of documents. A major development in the Web IR field was that of query independent evidence. In particular, Page et al’s PageRank algorithm [1] determines the quality of a document by examining the quality of the documents linked to it. In particular, the PageRank scores correspond to the probability of visiting a particular node in a Markov chain for the whole Web graph, where the states represent Web documents, and the transitions between states represent hyperlinks. PageRank was reported to be a fundamental component of the early versions of the Google search engine [1], and is beneficial in high-precision user tasks, where the relevance and quality of the top-ranked documents are important. Many other similar link analysis algorithms have been proposed, including those that can be applied in a query-dependent or independent fashion. Most are based on random-walks, calculating the probability of a random Web user visiting a given page. Examples are Kleinberg’s HITS [6] and the Absorbing Model [13]. Other sources of document quality have been reported, including the use of the URL evidence to determine the type of the page (for instance, whether the URL is short or long, or how many ‘‘/’’ characters it contains [7]). The integration of such query independent evidence into the ranking strategy is an important issue. In the language modeling framework, it is natural to see the integration of query independent evidence as a document prior, that defines the overall likeliness of the document’s retrieval [7]. Craswell et al. [3] introduce a framework (FLOE) for integrating query independent evidence with the retrieval score from BM25. The integration of several priors remains an important problem. Peng et al. [10] propose a probabilistic mehod of combining several document priors in the language modeling framework, while Plachouras [11] examines how various query independent evidence can be applied on a query-by-query basis.
The algorithms HITS and PageRank, along with their extensions, explicitly employ the hyperlinks between Web documents, in order to find high quality, or authoritative Web documents. A form of implicit use of the hyperlinks in combination with content analysis is to use the anchor text associated with the incoming hyperlinks of documents. Web documents can be represented by an anchor text surrogate, which is formed from collecting the anchor text associated with the hyperlinks pointing to the document. The anchor text of the incoming hyperlinks provides a concise description for a Web document. The used terms in the anchor text may be different from the ones that occur in the document itself, because the author of the anchor text is not necessarily the author of the document. Indeed, Craswell et al. [2] showed that anchor text is very effective for navigational search tasks and more specifically for finding home pages of Web sites. Several models have since been developed that use the anchor text of a document in addition to the content. Kraaij et al. [7] and Ogilvie and Callan [9] describe mixture language modeling approaches, where the probability of a term’s occurrence in a document is the mixture of the probability of its occurrence in different textual representations of the document (e.g., content, title, anchor text). Later, Robertson et al. [14] showed that due to the different term occurrence distributions of the different representations of a document, it is better to combine frequencies rather than scores. Indeed, shortly thereafter, Zaragoza et al. [16] and Macdonald et al. [8] devised weighting models where the frequency of a term occurring in each of a document’s representations is normalized and weighted before scoring by the weighting model. This allows a fine-grained control over the importance of each representation of the document in the document scoring process. This has been further investigated by the use of multinomial Divergence from Randomness models to score structured documents [12].
Key Applications Web Search Engines are heavy developers and users of Web IR technology. Yet, to prevent exploitation by spammers and imitation by commercial rivals, the models applied by the search engines remain closelyguarded secrets. However, in recent years, the IR research field has grown, and many groups are now researching topics related to Web IR. The TREC forum,
WEB Information Retrieval Models
discussed below, is a facilitator of much of this research, providing samples of the Web for research purposes
Future Directions There are many open problems in Web IR research. In particular, spam is an ever-growing issue – many sites are created with the purpose of manipulating search engine rankings for financial gain (e.g., advertising revenue). Search engines are in a constant battle with the spammers, and are constantly becoming more robust in adversial conditions. Large Web search engines have often identified a huge number of features for every web page. Indeed, recent reports from the Microsoft Live search engine suggest that they use over 300 features when ranking web pages. At this scale, how should these features be combined? This area of interest has been recently gaining much interest from the IR and machine learning communities, and was the subject of the Learning To Rank for Information Retrieval workshop [5]. Generally, speaking, the idea is that machine learning techniques can be applied to learn a function from the input data and the evaluation results, instead of developing a function from a theory. The Learning to Rank sub-field also encompasses the large-scale learning that search engines perform using the clickthrough data gleaned from user sessions, allowing them to learn the best results for popular queries. The advent of the blogging phenomenon has created many new and interesting search and data mining tasks for search engines in the era of user-generated content. The blogging community (known as the blogosphere) often responds to real-life events with comment and rhetoric. The TREC Blog track has been created to investigate retrieval issues in this niche area of the Web, and has thus far investigated the retrieval of opinionated posts – finding blog posts which not only discuss a target, but also disseminate an opinion about the target. As Web search becomes ubiquitious for the average knowledge worker, the need for internal company search engines to allow employees to search and mine the company intranet are becoming important. To this end, the TREC Enterprise track has been investigating retrieval tasks in the Enteprise setting.
Data Sets Much research into techniques for Web IR has centered around the Text REtrieval Conference (TREC) Web
W
3481
track and other related tracks. In particular, TREC has investigated retrieval from various samples of the Web, and produced several test collections. Each test collection consists of a corpus of Web documents crawled from the Internet, combined with queries with information needs (known as topics), and relevance assessments. Table 1 introduces the years that Web tasks ran at TREC. The document corpora used are described further in Table 2. It is of note over the early years of the Web tracks, the tasks were being developed, and hence in the early years, the user tasks are not the most common task that users perform in search engines (e. g., Adhoc retrieval). Instead, the focus on early precision and known-item retrieval tasks (e.g., Home page and Named page finding) were developed from studies of user interactions with search engines [15]. Table 2 describes the collections used for the TREC Web and Terabyte tracks, over the years 1999–2006. When developing Web IR techniques, where it is appropriate, it is common to apply the techniques on these
WEB Information Retrieval Models. Table 1. Tasks and collections applied by various Web IR tracks at TREC TREC Track
Collection
Tasks
2006 Terabyte 2005 Terabyte 2004 Terabyte 2004 Web
.GOV2 .GOV2 .GOV2 .GOV
2003 Web
.GOV
2002 Web
.GOV
2001 Web
WT10G
Adoc, Named page Adhoc, Named page Adhoc Mixed (Home page, Named page, Topic Distillation) Home page, Named page, Topic Distillation Named page, Topic Distillation Adhoc, Home page
2000 Web 1999 Web
WT10G WT2G
Adhoc Adhoc
WEB Information Retrieval Models. Table 2. Web IR research test collections Collection .GOV2 .GOV WT10G WT2G
# Documents
# Links
25,205,179 1,247,753 1,692,096 247,491
261,937,150 11,110,989 8,063,026 1,166,146
W
3482
W
Web Macros
standard test collections, and compare to appropriately trained baselines. Hawking and Crasswell [4] contains an overview of the Web track at TREC. Relatedly, TREC also has two test collections for Enterprise IR research, and a further collection for Blog IR research.
URL to Code Text REtrieval Conference: http://trec.nist.gov Web and Blog Test Collections: http://ir.dcs.gla.ac.uk/ test_collections ACM Special Interest Group IR: http:// www.sigir.org/
Cross-references
▶ BM25 ▶ Divergence from Randomness Models ▶ Indexing the Web ▶ Information Retrieval Models ▶ Probabilistic Retrieval Models and Binary Independence Retrieval (BIR) Model ▶ Web Indexing ▶ Web Page Quality Metrics ▶ Web Spam Detection
Recommended Reading 1. Brin S. and Page L. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst., 30(1–7): 107–117, 1998. 2. Craswell N., Hawking D., and Robertson S. Effective site finding using link anchor information. In Proc. 24th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001, pp. 250–257. 3. Craswell N., Robertson S., Zaragoza H., and Taylor M. Relevance weighting for query independent evidence. In Proc. 31st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005, pp. 416–423. 4. Hawking D. and Craswell N. The very large collection and Web tracks. In TREC: Experiment and Evaluation in Information Retrieval. Kluwer Academic Publishers, Dordrecht, 2004, pp. 199–232. 5. Joachims T., Li H., Liu T.Y., and Zhai C. SIGIR workshop report: learning to rank for information retrieval (LR4IR 2007). SIGIR Forum, 41(2):55–62, 2007. 6. Kleinberg J.M. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604–632, 1999. 7. Kraaij W., Westerveld T., and Hiemstra D. The importance of prior probabilities for entry page search. In Proc. 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2002, pp. 27–34. 8. Macdonald C., Plachouras V., He B., Lioma C., and Ounis I. University of Glasgow at WebCLEF 2005: Experiments in perfield normlisation and language specific stemming. In Proc. 6th Workshop, Cross-Language Evaluation Forum, 2005, pp. 898–907.
9. Ogilvie P. and Callan J. Combining document representations for known-item search. In Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 143–150. 10. Peng J., Macdonald C., He B., and Ounis I. Combination of document priors in Web information retrieval. In Proc. 8th Int. Conf. Computer-Assisted Information Retrieval, 2007. 11. Plachouras V. Selective Web Information Retrieval. PhD thesis, Department of Computing Science, University of Glasgow, 2006. 12. Plachouras V. and Ounis I. Multinomial randomness models for retrieval with document fields. In Proc. 29th European Conf. on IR Research, 2007, pp. 28–39. 13. Plachouras V., Ounis I., and Amati G. The static absorbing model for the Web. J. Web Eng., 165–186, 2005. 14. Robertson S., Zaragoza H., and Taylor M. Simple BM25 extension to multiple weighted fields. In Proc. Int. Conf. on Information and Knowledge Management, 2004, pp. 42–49. 15. Silverstein C., Henzinger M., Marais H., and Moricz M. Analysis of a very large AltaVista Query Log. Technical Report 1998-014, Digital SRC, 1998. 16. Zaragoza H., Craswell N., Taylor M., Saria S., and Robertson S. Microsoft cambridge at TREC-13: Web and HARD tracks. In Proc. the 4th Text Retrieval Conf., 2004.
Web Macros ▶ Web Data Extraction System
Web Mining ▶ Data, Text, and Web Mining in Healthcare ▶ Languages for Web Data Extraction ▶ Web Harvesting
Web Mashups M ARISTELLA M ATERA Politecnico di Milano University, Milan, Italy
Synonyms Composite Web applications
Definition Web mashups are innovative Web applications, which rely on heterogeneous content and functions retrieved from external data sources to create new composite services.
Web Page Quality Metrics
Key Points Web mashups characterize the second generation of Web applications, known as Web 2.0. They are composite applications, usually generated by combining content, presentation, or application functionality from disparate Web sources. Typical components that may be mashed-up, i.e., composed, are RSS/Atom feeds, Web services, programmable APIs, and also content wrapped from third party Web sites. Components may have a proper user interface that can be reused to build the interface of the composite application, they may provide computing support, or they may just act as plain data sources. Content, presentation and functionality, as provided by the different components, are then combined in disparate ways: via JavaScript in the browser, via server-side scripting languages (like PHP), or via traditional languages like Java.
W
3483
identify three concrete and somewhat complementary aspects to the quality of a web page. The first is based on the absolute goodness of the web page and its associated meta-data. This might depend on a variety of parameters, including the worth of content that exists on the web page, the reputation of the person who authored the web page, the importance of the web site that hosts the web page, and so on. The second way is to focus on the quality of the web page as perceived by other pages that exist on the web. This can be captured by links that point to the web page: a hyperlink from one web page to another can be viewed as an endorsement of the latter by the former. The third way to measure quality is to focus on the how the web users perceive this page. This can be realized by studying the traffic on the web page. The main focus of this entry will be the second aspect of web page quality.
Historical Background Cross-references ▶ Visual Interaction ▶ Web 2.0/3.0
Web Ontology Language ▶ OWL: Web Ontology Language
Web Page Quality Metrics R AVI K UMAR Yahoo Research, Santa Clara, CA, USA
Synonyms Link analysis
Definition The primary mission of web search engines is to obtain the best possible results for a given user query. To accomplish this effectively, they rely on two crucial pieces of information: the relevance of a web page to the query and some aspect of the quality of the web page that is independent of the query. Relevance, the extent to which the query matches the content of the web page, is formalized and extensively studied in the field of information retrieval. Quality, on the other hand, is more nebulous and less well-defined. Nevertheless, one can
Web search engines were originally built using mainly classical information retrieval techniques, which were used to calculate the relevance of a query to a web page. The techniques were often modified in simple ways to account for the fact that most of the content on the web was written in HTML; for example, the presence of a query term in the title of the web page can connote a higher relevance. These methods served well in the early days of the web, but became increasingly inadequate as the web grew. One reason was they failed to take into account a distinctive feature of web pages: the presence of explicit hyperlinks created by people to link one document to another. At a coarse level, the role of hyperlinks in web pages is two-fold. The first is to aid in the browsability and navigability from one web page to another on a web site. The second is to refer to outside sources of information that are considered relevant and authoritative by the author of the web page. Independently and almost concurrently, Brin and Page [4] and Kleinberg [12] came up with different methods to exploit the presence of links in order to ascribe a notion of quality to a web page, in the context of a collection of web pages with hyperlinks between them. The idea of using links between documents to infer a quality measure has precedents in the field of bibliometrics. For example, impact factor was proposed by Garfield [7] to measure the quality of scientific journals. It was defined as the average number of citations to the journal. Bibliographic coupling was proposed by Kessler [11] to measure similarity of
W
3484
W
Web Page Quality Metrics
two hyperlinked documents. It was defined as the total number of documents linked to by both documents. An alternate measure for document similarity was proposed by Small [15]: count of the total number of documents that have a link to both the documents. The above measures treat all links alike and do not account for the quality of the citing document. A pervading theme in the works of Brin and Page and Kleinberg is to treat different links differently. This simple principle has proved to be very effective in practice in terms of improving web search and has found tremendous use in commercial search engines, even beyond web search.
Foundations For the remainder of the entry, it is convenient to view the web as a directed graph or as a matrix. The web pages are nodes in this graph and the hyperlinks from one web page to another constitute the directed edges. This graph defines a natural adjacency matrix M whose (p,q)-th entry Mpq is 1 if and only if page p has a hyperlink to page q and 0 otherwise. In 1998, Brin and Page [4] proposed a simple iterative method to define quality of a web page. Their method, called PageRank, used the hyperlinks to arrive at the quality. One way to think about this method is to focus on the following browsing behavior of a web user. Most of the times, the user visits a web page and after browsing, chooses one of the hyperlinks on the page at random to select the next page to browse. There is also a small chance that the user abandons browsing the current page entirely and chooses a random web page to continue browsing. This behavior can be formalized as follows. Each page p has a non-negative value p(p), updated iteratively. A stochastic process can be defined to abstract the user behavior, and in the limit, p(p) will be the fraction of the time spent at page p by the process. Let a 2 (0,1) be a fixed constant. At each step of the process, with probability a, the process jumps to a page chosen uniformly at random from the entire set of web pages. With the remaining probability 1 a, the process jumps to a web page q chosen uniformly at random from the set of web pages to which page p has a hyperlink. If p has no hyperlinks, then the process jumps to a web page chosen uniformly at random from the entire set of pages. Observe that this process can be modeled by the iteration ~ ptþ1 ¼ P T ~ pt with P = (1 a) 0 0 M + aJ. Here, M is a stochastic matrix corresponding
to the above process, J is the matrix where each entry is inverse of the total number of nodes in the graph, and a is the teleportation probability, usually chosen around 0.15. Notice that M0 can be easily derived from the adjacency matrix M. Since the above iteration corresponds to a linear system, it converges to a fixed-point ~ p¼ limt!1~ pt , which is the principal eigenvector of PT. The quality of a page p is given by the value p(p). Many variants and extensions of the basic PageRank method have been proposed. A particularly interesting extension, called topic-sensitive PageRank, was proposed by Haveliwala [10]. The goal of this method is to define a quality measure that is biased with respect to a given topic. First, a subset of pages that correspond to the topic is identified; this can be done, for instance, by building a suitable classifier for the topic. Next, the basic PageRank process is modified in the following manner. Instead of jumping with probability a to a random page chosen from the entire set of web pages, the process jumps to a page chosen uniformly at random from this subset of pages. As before, the iteration converges and a quality measure for each page that is specific to this topic can be obtained. Also in 1998, Kleinberg [12] proposed a different iterative method to define a quality of a web page. His method, called HITS, is aimed at finding the most authoritative sources in a given collection of web pages. The intuition behind HITS is the following. Each web page has two attributes, namely, how good is the page as an authority and how good is the page as a hub in terms of the quality of web pages to which it links. These two attributes mutually reinforce one another: the quality of a page as an authority is determined by the hub quality of pages that have a hyperlink to it and the quality of a page as a hub is determined by the authority quality of the pages to which it hyperlinks. This reinforcement can be mathematically written in the following manner. Each page p has a hub value h(p) and an authority value a(p). Translating the above intuition into iterative equations, one can obtain at+1(p) = ∑ qjq!pMqpht(q) and ht+1(p) = ∑ qjp!qMpqat(q). Since these iterations correspond to linear systems, they converge to their respective fixed points ~ a and ~ h. It is easy to see that ~ a is the principal eigenvector of MTM and ~ h is the principal eigenvector of MMT. The quality of a web page p as an authoritative page is then given by its authority value a(p).
Web Question Answering
There have been several modifications and extensions to Kleinberg’s original work. For instance, [2,5,6,13]. Chakrabarti et al. [5] used the anchortext of a link to generalize the entries in M to [0,1]. Bharat and Henzinger [2] proposed many heuristics, including ones to address the issue of a page receiving unduly high authority value by virtue of many pages from the same web site pointing to it. Lempel and Moran [13] proposed a HITS-like method with slightly different definitions of authority and hub values. For a comparative account of PageRank and HITS, the readers are referred to the survey by Borodin et al. [3].
6.
7. 8.
9.
10.
Key Applications As mentioned earlier, web page quality can be a vital input to the effectiveness of search engines. Besides this obvious application, ideas behind PageRank and HITS have been used to tackle many other problems on the web. Four such applications are mentioned below. Gyo¨ngyi et al. [9] developed link analysis methods to identify spam web pages. Their method is based on identifying a small, reputable set of pages and using the hyperlinks to discover more pages that are likely to be good. Rafiei and Mendelzon [14] used random walks and PageRank to compute the topics for which a given web page has a good reputation. Bar-Yossef et al. [1] proposed a random walk with absorbing states to compute the decay value of a web page, where the decay value of a page measures how well-maintained and up-to-date is the page. Gibson et al. [8] used the HITS algorithm as a means to identifying web communities.
Cross-references
▶ Information Retrieval ▶ Web Data Management
11. 12. 13. 14.
15.
W
3485
analyzing hyperlink structure and associated text. Comput. Netw., 30:65–74, 1998. Chakrabarti S., Dom B., Gibson D., Kumar R., Raghavan P., Rajagopalan S., and Tomkins A. Spectral filtering for resource discovery. In Proc. ACM SIGIR Workshop on Hypertext Analysis. 1998, pp. 13–21. Garfield E. Citation analysis as a tool in journal evaluation. Science, 178:471–479, 1972. Gibson D., Kleinberg J., and Raghavan P. Inferring Web communities from link topology. In Proc. ACM Conference on Hypertext, 1998, pp. 225–234. Gyo¨ngyi Z., Garcia-Molina H., and Pedersen J. Combating web spam with TrustRank. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 576–587. Haveliwala T.H. Topic-sensitive PageRank: A context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Eng., 15:784–796, 2003. Kessler M.M. Bibliographic coupling between scientific papers. Am. Doc., 14:10–25, 1963. Kleinberg J. Authoritative sources in a hyperlinked environment. J. ACM, 46:604–632, 2000. Lempel R. and Moran S. SALSA: the stochastic approach for linkstructure analysis. ACM Trans. Inform. Syst., 19:131–160, 2001. Rafiei D. and Mendelzon A.O. What is this page known for? Computing web page reputations. Comput. Netw., 33:823–835, 2000. Small H. Co-citaton in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Inform. Sci., 24:265–269, 1973.
Web QA ▶ Web Question Answering
Web Query Languages ▶ Semantic Web Query Languages
Recommended Reading 1. Bar-Yossef Z., Broder A., Kumar R., and Tomkins A. Sic transit gloria telae: towards and understanding of the web’s decay. In Proc. 12th Int. World Wide Web Conference, 2004, pp. 328–337. 2. Bharat K. and Henzinger M. Improved algorithms for topic distillation in a hyperlinked environment. In Proc. 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998, pp. 104–111. 3. Borodin A., Roberts G.O., Rosenthal J.S., and Tsaparas P. Link analysis ranking algorithms, theory, and experiments. ACM Trans. Internet Tech., 5:231–297, 2005. 4. Brin S. and Page L. The anatomy of a large-scale hypertextual web search engine. Comput. Netw., 30:107–117, 1998. 5. Chakrabarti S., Dom B., Gibson D., Kleinberg J., Raghavan P., and Rajagopalan S. Automatic resource compilation by
Web Question Answering C HARLES L. A. C LARKE University of Waterloo, Waterloo, ON, Canada
Synonyms Web QA
Definition A question answering (QA) system returns exact answers to questions posed by users in natural language, together
W
3486
W
Web Question Answering
with evidence supporting those answers. A Web QA system maintains a corpus of Web pages and other Web resources in order to determine these answers and to provide the required evidence. A basic QA system might support only simple factual (or ‘‘factoid’’) questions. For example, the user might pose the question Q1. What is the population of India? and receive the answer ‘‘1.2 billion,’’ with evidence provided by the CIA World Factbook. A more advanced QA system might support more complex questions, seeking opinions or relationships between entities. Answering these complex questions might require the combination and integration of information from multiple sources. For example, the user might pose the question Q2. What methods are used to transport drugs from Mexico to the U.S.? and hope to receive a summary of information drawn from newspapers articles and similar documents. These two examples bracket the capabilities of current systems. Most QA systems can easily answer Q1, while questions such as Q2 represent an area of active research. It is important to note that this definition of Web QA does not encompass Websites that allow users to answer each other’s questions, nor sites that employ human experts to answer questions.
Historical Background While experimental question answering systems have existed since the 1960’s, research interest in question answering increased substantially in 1999, with the introduction of a question answering track at the Text REtreival Conference (TREC) [14,19]. TREC is an annual evaluation effort supervised by the US National Institute of Standards and Technology (NIST). Since its inception in 1991, TREC has conducted evaluation efforts for a wide range of information retrieval technologies, including question answering. Inspired by the TREC QA track, Kwok et al. [8] developed one of the earliest Web QA systems. Their MULDER system answered questions by issuing queries to a commercial Web search engine and selecting answers from the pages it returned. They compared the performance of their system to that of several commercial search engines on the task of answering factoid questions taken from the TREC QA track. Also inspired by TREC, Radev et al. [16] developed a method for converting questions into Web queries by
applying techniques borrowed from statistical machine translation. Clarke et al. [5] crawled pages from the Web in order to gather information for answering trivia questions. Work by Dumais et al. [6] demonstrated the value of fully exploiting the volume of ¨ zsu [9] information available on the Web. Lam and O applied information extraction techniques to mine answers from multiple Web resources. Much of the work on question answering was conducted outside the context of the Web. A general survey of this work is given by Prager [14]. An overview of many influential methods and systems is provided by Strzalkowski and Harabagiu [17].
Foundations Question answering is an extremely broad area, integrating aspects of natural language processing, machine learning, data mining and information retrieval. This entry focuses primarily on the use of Web resources for question answering. Figure 1 provides a conceptual overview of the architecture of a basic QA system, showing the main components and processing steps. The system analyzes questions entered by the user, generates queries to collections of source material downloaded from the Web, and then selects appropriate answers from the resulting passages and tuples. The architecture shown in this figure emphasizes those components that are related to information retrieval and database systems, representing a composite derived from the published descriptions of a number of experimental systems [5,6,8,12,14,20]. The details of specific QA systems may differ substantially from this figure; an advanced QA system may include many additional components. Sources for question answering are downloaded from the Web using a crawler, which is similar in construction and operation to the crawlers used by commercial Web search engines. After crawling, an inverted index is constructed, to allow the resulting collection to be searched efficiently. Before construction of this inverted index, the crawler may annotate the collection to aid in question answering. For example, named entities, such as people, places, companies, quantities and dates, may be identified [15]. NLP techniques such as part-of-speech tagging and lightweight parsing may be applied. The creation of the index may also incorporate any of the standard indexing techniques associated with Web search, including link analysis.
Web Question Answering
W
3487
Web Question Answering. Figure 1. Architecture of a Web question answering system.
Since the operation of a general Web crawler requires substantial effort and resources, many experimental QA systems substitute a commercial Web search engine for this crawling component [8]. To answer questions, one or more queries are issued to the search engine. Answers are then selected from the top documents returned by these queries. As an additional post-processing step, distinct from crawling, the QA system may perform information extraction and text mining to identify events and relationships within the crawled pages. This extracted information, in the form of tuples, is used to populate tables of facts for reference by the QA system, essentially providing ‘‘prepared’’ answers for many factoid questions [2]. For example, associations between books and their authors, or between companies and their CEOs might be established in this way [1,4]. In addition, facts may be collected from preestablished sources that are known to be reliable. For example, current scores and schedules might be scraped from a sports Website; up-to-date movie and TV trivia might be taken from an entertainment site. Wrappers to collect this information might be constructed manually or learned from examples [7]. Since all answers should be appropriately supported by evidence, tuples containing extracted information should include links to the sources from which they are derived.
The result of these crawling and information extraction steps are two indexed collections. The first collection is a Web corpus of pre-processed Web pages, annotated to support question answering. The second collection is a database of extracted relations, providing pre-computed facts and answers. In its function and form, the first collection resembles a traditional information retrieval system; the second collection resembles a traditional relational database system. These collections are prepared in advance of user queries. Answers are then determined by accessing and merging information from both collections. In a deployed QA system, these collections would be maintained on an on-going basis, much as an incremental crawler maintains the index of a commercial Web search engine. After a user enters a question, the first stage in processing it is a question analysis step, in which an answer type is derived and queries to the retrieval components are constructed. The nature of the response required from the system will vary from question to question. The answer to a question might require a fact (‘‘How many planets are there?’’), a list (‘‘What are the names of the planets?’’), a yes-no response (‘‘Is Pluto a planet?’’), a definition (‘‘What is a dwarf planet?’’), or a description (‘‘Why isn’t Pluto a planet anymore?’’). The role of the answer type is to convey the nature of
W
3488
W
Web Question Answering
the expected response to the answer selection component, providing a mapping of the question onto the domain supported by the system. The depth and detail provided by these answer types varies substantially from system to system. Broadly, the answer type encodes constraints implied by the question, which must be satisfied by the answer. For example, the answer type associated with Q1 might be any of QUANTITY POPULATION POPULATION (COUNTRY (‘‘India’’)) where the notation in this example is strictly for illustrative purposes. Depending on the system, an answer type might be encoded in a wide variety of ways. The answer type associated with Q2, might be any of SNIPPET SUMMARY SUMMARY (‘‘transport’’, ‘‘drugs’’, COUNTRY (‘‘United States’’), COUNTRY(‘‘Mexico’’)) These examples merely suggest the range of possibilities for expressing an answer type, which may include enumerations, patterns, and trees. Since a QA system should ideally be able to handle any question, many systems will support a catch-all OTHER answer type. In addition to the answer type, the question analysis component generates queries to the two retrieval components: the passage retrieval component, which searches the Web corpus, and the relational retrieval component, which searches the extracted relations. The passage retrieval component searches the Web corpus and returns passages that may contain the answer. At its simplest, a query to the passage retrieval component is a list of keywords and phrases, resembling a typical query to a Web search engine. For example, for Q1 the generated query might be ‘‘population india,’’ and for Q2 it might be ‘‘transport drugs mexico’’. At its simplest, the relational component may be queried using SQL, returning tuples that may contain the answer. For Q1, the population of India could be retrieved from a table of country populations and passed to the answer selection component. For more complex questions, the queries issued to the retrieval components will be correspondingly more complex. In some cases, multiple queries may be issued to either or both components, particularly when the answer requires more than a simple fact. For example,
as part of processing Q2, the relational component might retrieve lists of illicit drugs, the names of drug cartels and their leaders, and geographic locations along the US/Mexican border. It is also possible for the retrieval components to associate quality or certainty scores with passages and tuples they return, and these scores may then be taken into consideration when selecting an answer. Constraints implied by the answer type might also be incorporated into the query. For example, the query ‘‘population india ’’ includes a annotation tag indicating that a retrieved passage should contain a positive numeric value. The annotations generated during the creation of the Web corpus may allow the passage retrieval component to support structured queries that precisely specify the required content of an answer-bearing passage [3]. For question answering, passage retrieval is normally preferred over the document-oriented retrieval typical of other IR systems. In many cases, answers to questions are contained within a few sentences or paragraphs, and the use of passages reduces the processing load placed on the answer selection component. Even when an answer requires information from multiple sources, the information required from each source can often be found within a short passage. Typically, passages are one or more sentences in length, and the passage retrieval component might return tens or hundreds of passages for analysis by the answer selection component. For Q1, the passage retrieval component might return passages such as: Uttar Pradesh is not only the most populous state in India but is also one of the largest in area . . . The population of India grew by near 20% during the 1990’s to more than 1.1 billion . . . India is home to roughly 1.2 billion people . . . Keywords associated with the query are underlined. As seen in this example, the terms in passages need not match the keywords exactly; matches against morphological variants and other related terms (‘‘populous’’ and ‘‘people’’) may also be permitted. Passage retrieval algorithms for question answering usually treat proximity of query keywords as an important feature for retrieval. Good passages may contain most or all of the query keywords in close proximity [18]. Guided by the answer type, the answer selection component analyzes the passages and tuples returned by the retrieval components to generate answers.
Web Question Answering
Depending on the answer type, the output from this component may be a ranked list, a single best answer, or even a null answer, indicating that the system was unable to answer the question. In determining possible answers, the selection component may start by identifying a set of candidate answers. For example, given an answer type of QUANTITY and the list of passages above, the set of answer candidates might include ‘‘20%,’’ ‘‘1990,’’ ‘‘more than 1.1 billion’’ and ‘‘roughly 1.2 billion.’’ If the answer type is more restrictive, the candidates ‘‘20%’’ and ‘‘1990’’ might be excluded from the set. If a candidate answer appears in multiple passages, this repetition or redundancy may strengthen the support for that candidate. For example, the candidate answers ‘‘more than 1.1 billion’’ and ‘‘roughly 1.2 billion’’ provide redundant support for an answer of ‘‘1.2 billion.’’ Redundancy is an important feature in many answer selection algorithms [5,6]. The answer selection component must balance the support provided by redundancy with other factors, in order to select its final answer. In the leading research systems, answer selection is a complex process, and the development effort underlying this component may substantially outweigh the development effort underlying the rest of the system combined. In these advanced systems, the answer selection component may issue additional queries to the retrieval components as it tests hypotheses and evaluates candidate answers [13]. A discussion of advanced techniques for answer selection is beyond the scope of this entry. Additional information, and references to the literature, may be found in Prager [14]. Evaluation efforts have been undertaken annually by NIST since 1999. Each year, dozens of research groups from industry and academia participate in these experimental efforts. While these efforts do not specifically focus on question answering in a Web context, the evaluation methodologies developed as part of these efforts may be applied to evaluate Web QA systems. Instead of a Web corpus, a test collection of newspaper articles and similar documents is provided by NIST, and this collection forms the source for answers and evidence. However, many of the participating groups augment this collection with Web pages and other resources in order to improve the performance of their systems [20]. These systems use a combination of all available resources to determine an answer, and then project this answer back onto the test collection to provide the required evidence.
W
3489
While the details vary from year to year, the basic experimental structure remains the same. The test collection is distributed to participating groups well before the start of a formal experiment. At the start of an experiment, NIST distributes a test set of questions to each group. Participants are free to annotate the test collection and modify their systems prior to receiving this test set, but they must freeze the development of their systems once they receive it. Each group executes the questions using their QA system and returns the answers to NIST, along with supporting evidence. Answers are manually judged by NIST assessors, where the evidence must fully support an answer for it to judged correct. The answers to factoid questions are judged on a binary basis (correct or not). For more complex questions, answers are evaluated against a list of information ‘‘nuggets,’’ where each nugget represents a single piece of knowledge associated with the answer [11]. For example, the use of low-water crossings on the Rio Grande might form a single nugget in the answer to Q2. Answers are then scored on the basis of the nuggets they contain. While all judging is performed manually for the official NIST results, efforts have been made to construct automatic judging tools and re-usable collections, which allow QA system evaluation to take place outside the framework of the official experiments [10,11].
Key Applications By providing exact answers, rather than ranked lists of Web pages, question answering has the potential to reduce the effort associated with finding information on the Web. Commercial search engines already incorporate basic question answering features. For example, most commercial search engines provide an exact answer in response to Q1. However, none currently treat Q2 as anything other than a keyword query. Bridging the gap between these questions would represent a major step in the evolution of Web search.
Data Sets From 1999 to 2007, the evaluation of experimental QA system was conducted as part of TREC (http://trec. nist.gov). Starting in 2008, the evaluation of QA systems was incorporated into a new experimental conference series, the Text Analysis Conference (http://www.nist.gov/tac/). Test collections and other experimental data be found by visiting those sites.
W
3490
W
Web Resource Discovery
Cross-references
▶ Indexing the Web ▶ Information Retrieval ▶ Inverted Files ▶ Snippet ▶ Text Mining ▶ Web Crawler Architecture ▶ Web Harvesting ▶ Web Information Extraction ▶ Web Search Relevance Ranking
14. 15.
16.
17.
Recommended Reading 1. Agichtein E. and Gravano L. Snowball: Extracting relations from large plain-text collections. In Proc. ACM International Conference on Digital Libraries. 2000, pp. 85–94. 2. Agichtein E. and Gravano L. Querying text databases for efficient information extraction. In Proc. 19th Int. Conf. on Data Engineering, 2003, pp. 113–124. 3. Bilotti M.W., Ogilvie P., Callan J., and Nyberg E. Structured retrieval for question answering. In Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007, pp. 351–358. 4. Brin S. Extracting patterns and relations from the World Wide Web. In Proc. Int. Workshop on the World Wide Web and Databases, 1998, pp. 172–183. 5. Clarke C.L.A., Cormack G.V., and Lynam T.R. Exploiting redundancy in question answering. In Proc. 24th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001, pp. 358–365. 6. Dumais S., Banko M., Brill E., Lin J., and Ng A. Web question answering: Is more always better? In Proc. 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2002, pp. 291–298. 7. Kushmerick N., Weld D.S., and Doorenbos R.B. Wrapper induction for information extraction. In Proc. 15th Int. Joint Conf. on AI, 1997, pp. 729–737. 8. Kwok C., Etzioni O., and Weld D.S. Scaling question answering to the Web. ACM Trans. Inf. Syst., 19(3):242–262, 2001. ¨ zsu M.T. Querying Web data – the WebQA 9. Lam S.K.S. and O approach. In Proc. 3rd Int. Conf. on Web Information Systems Eng., 2002, pp. 139–148. 10. Lin J. and Katz B. Building a reusable test collection for question answering. J. Am. Soc. Inf. Sci. Technol., 57(7): 851–861, 2006. 11. Marton G. and Radul A. Nuggeteer: Automatic nugget-based evaluation using descriptions and judgements. In Proc. Human Language Technology Conf. of the North American Chapter of the Association of Computational Linguistics, 2006, pp. 375–382. 12. Narayanan S. and Harabagiu S. Question answering based on semantic structures. In Proc. 20th Int. Conf. on Computational linguistics, 2004, pp. 693–701. 13. Pasca M.A. and Harabagiu S.M. High performance question/ answering. In Proc. 24th Annual Int. ACM SIGIR Conf. on
18.
19.
20.
Research and Development in Information Retrieval, 2001, pp. 366–374. Prager J. Open-domain question-answering. Found. Trends Inf. Retr., 1(2):91–231, 2006. Prager J., Brown E., Coden A., and Radev D. Questionanswering by predictive annotation. In Proc. 23rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2000, pp. 184–191. Radev D.R., Qi H., Zheng Z., Blair-Goldensohn S., Zhang Z., Fan W., and Prager J. Mining the Web for answers to natural language questions. In Proc. Int. Conf. on Information and Knowledge Management, 2001, pp. 143–150. Strzalkowski T. and Harabagiu S. (eds.). Advances in Open Domain Question Answering. Springer, Secaucus, NJ, USA, 2006. Tellex S., Katz B., Lin J., Fernandes A., and Marton G. Quantitative evaluation of passage retrieval algorithms for question answering. In Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 41–47. Voorhees E.M. Question answering in TREC. In TREC: Experiment and Evaluation in Information Retrieval, E.M. Voorhees and D.K. Harman (eds.). MIT, Cambridge, MA, USA, 2005, pp. 233–257. Yang H., Chua T.-S., Wang S., and Koh C.-K. Structured use of external knowledge for event-based open domain question answering. In Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 33–40.
Web Resource Discovery ▶ Focused Web Crawling
Web Scraper ▶ Web Data Extraction System
Web Scraping ▶ Languages for Web Data Extraction
Web Search and Crawling ▶ Biomedical Scientific Textual Data Types and Processing
Web Search Query Rewriting
Web Search Engines ▶ WEB Information Retrieval Models
Web Search Query Rewriting R OSIE J ONES 1, F UCHUN P ENG 2 Yahoo! Research, Burbank, CA, USA 2 Yahoo! Inc., Sunnyvale, CA, USA
1
Synonyms Query reformulation; Query expansion; Query assistance; Query suggestion
Definition Query rewriting in Web search refers to the process of reformulating an original input query to a new query in order to achieve better search results. Reformulation includes but not limited to the following: 1. Adding additional terms to express the search intent more accurately 2. Deleting redundant terms or re-weighting the terms in the original query to emphasize important terms 3. Finding alternative morphological forms of words by stemming each word, and searching for the alternative forms as well 4. Finding synonyms of words, and searching for the synonyms as well 5. Fixing spelling errors and automatically searching for the corrected form or suggesting it in the results
W
3491
This is generally an appropriate behavior for web search, since the large number of documents available ensure that some documents will be found, and the overall task is then reduced to ranking those documents. It also greatly reduces the number of documents which must be considered. A typographical error, such as a misspelling, or use of non-conventional terminology (such as colloquial rather than medical terminology), or use of an ambiguous term can lead the search engine to retrieve fewer or different documents than the full set of documents relevant to the user’s intent. The tendency of web searchers to look just at the first 10–20 results [8] exacerbates the severity of this, since it is then even more important to have the most relevant documents at the top of the list. Most web search engines offer interactive or automated query rewriting, to correct spelling, suggest synonyms for terms, and suggest related topics and subtopics the user might be interested in. In interactive query rewriting, the web searcher is shown one or more options for a spell correction or rephrasing of a query [1]. In automatedquery rewriting,aquery isautomaticallymodifiedtocorrectspelling,addtermsorotherwisemodify the original query, then used to retrieve documents without further interaction from the user [12]. Automatic query rewriting needs to be accurate in order to avoid query intent drifting. Hence, most query rewriting techniques are for interactive query reformulation. The pre-cursors to query rewriting in web search are spell correction (first used in word processing), and relevance and pseudo-relevance feedback [3] in information retrieval.
Foundations Stemming
Historical Background Web search queries are the words users type into web search engines to express their information need. These queries are typically 2–3 words long [7]. Traditional information retrieval allows retrieval of documents that do not contain all of the query words. While early search engines such as Lycos and AltaVista also retrieved documents which contained a subset of the query terms, current search engines such as Yahoo, Google, and MSN typically require web documents to contain all of the query terms (with the exception of stop-words – words such as prepositions and conjunctions which do not have a large impact on overall topic meaning).
Stemming is the process of merging words with different morphological forms, such as singular and plural nouns, and verbs in past and present tense. Stemming is a long studied technology. Many stemmers have been developed, such as the Lovins stemmer [11] and the Porter stemmer [13]. The Porter stemmer is widely used due to its simplicity and effectiveness in many applications. However, the Porter stemming makes many mistakes because its simple rules cannot fully describe English morphology. Corpus analysis is used to improve Porter stemmer [17] by creating equivalence classes for words that are morphologically similar and occur in similar context.
W
3492
W
Web Search Query Rewriting
Traditionally, stemming has been applied to Information Retrieval tasks by transforming words in documents to the their root form before indexing, and applying a similar transformation to query terms. Although it increases recall, this naive strategy does not work well for Web Search since it lowers precision. Recently, a context-sensitive based on statistical machine translation [12] is successfully applied to Web search to improve both recall and precision, by considering the likelihood of different variants occurring in web pages. The most likely variants are used to expand the query, and both the original query, and the modified version can be used to retrieve web pages. Spelling Correction
of ‘‘Palo Alto’’ in California. The expanded word ‘‘California’’ can be inferred from many features, such as IP address, session analysis, geographic mapping information. Another type of expansion is to expand an abbreviated form to its full form. For example, ‘‘HISD’’ can be expanded to ‘‘Houston Independent School District.’’ This is referred as abbreviation or acronym rewriting. One word may be mapped to multiple expansions, for example, ‘‘abc’’ could mean ‘‘American border collie,’’ ‘‘american baseball coaches,’’ ‘‘American born Chinese’’ or ‘‘american broadcasting company.’’ Depending on the query context, it can be expanded to the correct form [16]. Query expansion can be done on concept level instead of individual word level [6,14]. For example, ‘‘job hunting’’ can be expanded to ‘‘recruiting’’ or ‘‘employment opportunity.’’ One way of extracting concepts from query is through query segmentation [15].
In Web search, spelling correction takes a user input query and provides a ‘‘corrected’’ form. Spelling correction is typically modeled as a noisy channel process. Separate models are built for character level errors (edit models which represent probabilities of character level insertions, substitutions and deletions), and overall word or word sequence probability (typically referred to as a language model) [2]. Web query logs can be used to improve the quality of models, by providing data for the language model, as well as the edit model [4]. Spell correction is commonly used as a query rewriting step before retrieving documents in web search, though the user is also sometimes consulted, with an interface asking ‘‘Did you mean ...?’’ Recently, major search engines are taking one step further on spelling correction by directly retrieving results with the corrected query instead of asking users to click on ‘‘Did you mean ...?’’ suggestions. This significantly improves users experience as users may not notice that their original input query are misspelled. For example, a query ‘‘brittney spears’’ input to major search engines now can directly retrieve results for the correct form ‘‘Britney Spears.’’ However, it is also quite risky as it largely depends on the quality of misspelling correction. If a wrong suggestion is used, the retrieve search results would be totally different from the original user intent. Thus, this implicit query rewriting has to be very conservative.
It is obvious that stop words can be dropped in most cases in search. Some non-stop words can also be dropped [9]. Query word deletion is not so popular as query expansion, as in most queries can have enough matched documents in the whole Web. Still, word deletion from query is important as it can help improving recall of tail queries (queries that are not very frequent).
Query Expansion
Key Applications
Query expansion adds words to a query to better reflect the search intent. For example, query ‘‘palo alto real estate’’ may be expanded to ‘‘palo alto California real estate’’ if it is known that the user is interested in city
Query rewriting and suggestion is useful for improving performance of web search, as well as more generally identifying related terms which can be used for other tasks.
Query Substitution and Suggestion
Some queries can be reformulated with a totally different query through query substitution and suggestion. Approaches to query substitution and suggestion involve identifying pairs of queries for which there is evidence that their meaning is related. One approach looks at sequences of queries issued by web searchers as the reformulate and modify their queries [10]. Others identify pairs of queries as related if they lead to clicks on the same documents or retrieve similar documents [5]. For example, query ‘‘the biggest sports event this year’’ and query ‘‘olympic 2008’’ have the same intent (assuming one infers that this year in the first query is 2008). Query Word Deletion
Web Search Relevance Feedback
Future Directions One area of research for query rewriting is how to detect query intent drifting after rewriting. For example, stemming query ‘‘marching network’’ to ‘‘march networks,’’ or ‘‘blue steel’’ to ‘‘blued steel’’ is bad since the query after rewriting has different intent of the input. How to model intent drifting is a challenging task in Web search.
13. 14.
15.
16.
Cross-references
▶ Query Expansion ▶ Query Expansion models ▶ Query Rewriting ▶ Web Search Relevance Feedback
17.
W
3493
Research and Development in Information Retrieval, 2007, pp. 639–646. Porter M.F. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. Qiu Y. and Frei H.P. Concept based query expansion. In Proc. 16th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1993, pp. 160–169. Tan B. and Peng F. Unsupervised query segmentation using generative language models and wikipedia. In Proc. 17th Int. World Wide Web Conference, 2008, pp. 347–356. Wei X., Peng F., and Dumoulin B. Analyzing Web text association to disambiguate abbreviation in queries. In Proc. 34th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2008, pp. 751–752. Xu J. and Croft B. Query expansion using local and global document analysis. In Proc. 19th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1996, pp. 4–11.
Recommended Reading 1. Anick P. Using terminological feedback for Web search refinement – a log-based study. In Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 88–95. 2. Brill E. and Moore R.C. An improved error model for noisy channel spelling correction. In Proc. 38th Annual Meeting of the Assoc. for Computational Linguistics, 2000, pp. 86–293. 3. Croft W.B. and Harper D.J. Using probabilistic models of document retrieval without relevance information. J. Doc., 35 (4):285–295, 1979. 4. Cucerzan S. and Brill E. Spelling correction as an iterative process that exploits the collective knowledge of Web users. In Proc. Conf. on Empirical Methods in Natural Language Processing, 2004, pp. 293–300. 5. Cui H., Wen J.R., Nie J.Y., and Ma W.Y. Probabilistic query expansion using query logs. In Proc. 11th Int. World Wide Web Conference, 2002, pp. 325–332. 6. Fonseca B.M., Golgher P., Pssas B., Ribeiro-Neto B., and Ziviani N. Concept-based interactive query expansion. In Proc. 14th ACM Int. Conf. on Information and Knowledge Management, 2008, pp. 696–703. 7. Jansen B.J., Spink A., and Saracevic T. Real life, real users, and real needs: a study and analysis of user queries on the web. Inf. Process. Manage. Int. J., 36(2):207–227, 2000. 8. Joachims T., Granka L., Pang B., Hembrooke H., and Gay G. Accurately interpreting clickthrough data as implicit feedback. In Proc. 31st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005, pp. 154–161. 9. Jones R. and Fain D. Query word deletion prediction. In Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 435–436. 10. Jones R., Rey B., Madani O., and Greiner W. Generating query substitutions. In Proc. 15th Int. World Wide Web Conference, 2006, pp. 387–396. 11. Lovins J.B. Development of a stemming algorithm. Mech. Translat. Comput. Ling., 2:22–31, 1968. 12. Peng F., Ahmed N., Li X., and Lu Y. Context sensitive stemming for Web search. In Proc. 33rd Annual Int. ACM SIGIR Conf. on
Web Search Relevance Feedback H UI FANG 1, C HENG X IANG Z HAI 2 1 University of Delaware, Newark, DE, USA 2 University of Illinois at Urbana-Champaign, Urbana, IL, USA
Definition Relevance feedback refers to an interactive cycle that helps to improve the retrieval performance based on the relevance judgments provided by a user. Specifically, when a user issues a query to describe an information need, an information retrieval system would first return a set of initial results and then ask the user to judge whether some information items (typically documents or passages) are relevant or not. After that, the system would reformulate the query based on the collected feedback information, and return a set of retrieval results, which presumably would be better than the initial retrieval results. This procedure could be repeated.
Historical Background Quality of retrieval results highly depends on how effective a user’s query (usually a set of keywords) is in distinguishing relevant documents from non-relevant ones. Ideally, the keywords used in the query should occur only in the relevant documents and not in any non-relevant document. Unfortunately, in reality, it is often difficult for a user to come up with good keywords,
W
3494
W
Web Search Relevance Feedback
mainly because the same concept can be described using different words and the user often has no clue about which words are actually used in the relevant documents of a collection. A main motivation for relevance feedback comes from the observation that although it may be difficult for a user to formulate a good query, it is often much easier for the user to judge whether a document or a passage is relevant. Relevance feedback was first studied in the vector space model by Rocchio [8]. After that, many other relevance feedback methods have been proposed and studied in other retrieval models [6,15], including classical probabilistic models and language modeling approach. Although these methods are different, they all try to make the best use of the relevance judgments provided by the user and generally rely on some kind of learning mechanism to learn a more accurate representation of the user’s information need. Relevance feedback has proven to be one of the most effective methods for improving retrieval performance [8,6,10]. Unfortunately, due to the extra effort that a user must make in relevance feedback, users are often reluctant to provide such explicit relevance feedback. When there are no explicit relevance judgments available, pseudo feedback may be performed, also known as blind relevance feedback [2,7]. In this method, a small number of top-ranked documents in the initial retrieval results are assumed to be relevant, and relevance feedback is then applied. This method also tends to improve performance on average (especially recall). However, pseudo feedback method does not always work well, especially for queries where none of the top N documents is relevant. The poor performance is expected because the non-relevant documents are assumed to be relevant, which causes the reformulated query to shift away from the original information need. Somewhere inbetween relevance feedback and pseudo feedback is a technique called implicit feedback [5,4,11]. In implicit feedback, a user’s actions in interacting with a system (e.g., clickthroughs) are used to infer the user’s information need. For example, a result viewed by a user may be regarded as relevant (or more likely relevant than a result skipped by a user). Search logs serve as the primary data for implicit feedback; their availability has recently stimulated a lot of research in learning from search logs to improve retrieval accuracy (e.g., [3]).
Relevance feedback bears some similarity to query expansion in the sense that both methods attempt to select useful terms to expand the original query for improving retrieval performance. However, these two methods are not exactly same. First, query expansion does not necessarily assume the availability of relevance judgments. It focuses on identifying useful terms that could be used to further elaborate the user’s information need. The terms can come from many different sources, not just feedback documents (often called local methods for query expansion). For example, they can also come from any document in the entire collection (called global methods) [14] or external resources such as a thesaurus. Second, while relevance feedback is often realized through query expansion, it can be otherwise. For example, when all available examples are non-relevant, other techniques than query expansion may be more appropriate [13].
Foundations The basic procedure for relevance feedback includes the following steps. A user first issues a query, and the system returns a set of initial retrieval results. After returning the initial results, the user is asked to judge whether the presented information items (e.g., documents or passages) are relevant or non-relevant. Finally, the system revises the original query based on the collected relevance judgments typically by adding additional terms extracted from relevant documents, promoting weights of terms that occur often in relevant documents, but not so often in non-relevant documents, and down-weighting frequent terms in non-relevant documents. The revised query is then executed and a new set of results is returned. The procedure can be repeated more than once. Technically, relevance feedback is a learning problem in which one learns from the relevant and nonrelevant document examples how to distinguish new relevant documents from non-relevant documents. The learning problem can be cast as to learn either a binary classifier that can classify a document into relevant vs. non-relevant categories or a ranker that can rank relevant documents above non-relevant documents. Thus in principle, any standard supervised learning methods can be potentially applied to perform relevance feedback. However, as a learning problem, relevance feedback poses several special challenges, and as a result, a direct application of a standard supervised learning method is often not very
Web Search Relevance Feedback
effective. First, many supervised learning methods only work well when there are a relatively large number of training examples, but in the case of relevance feedback, the number of positive training examples is generally very small. Second, the training examples in relevance feedback are usually from the top-ranked documents, thus they are not a random sample and can be biased. Third, it is unclear how to handle the original query. The query may be treated as a special short example of relevant documents, but intuitively this special relevant example is much more important than other relevant examples and the terms in the query should be sufficiently emphasized to avoid drifting away from the original query concept. These challenges are exacerbated in the case of pseudo feedback as the query would be the only reliable relevance information provided by the user. In the information retrieval community, many relevance feedback algorithms have been developed for different retrieval models. A commonly used feedback algorithm in vector space models is the Rocchio algorithm developed in early 1970s [8], which remains a robust effective state-of-the-art method for relevance feedback. In Rocchio, the problem of relevance feedback is defined as finding an optimal query to maximize its similarity to relevant documents and minimize its similarity to non-relevant documents. The revised query can be computed as Qr ¼ a Qo þ
X b X g di dj ; jDr j d 2D jDn j d 2D i
r
j
n
where Qr is the revised query vector, Qo is the original query vector, Dr and Dn are the sets of vectors for the known relevant and non-relevant documents, respectively. a, b, and g are the parameters that control the contributions from the two sets of documents and the original query. Empirical results show that positive feedback is much more effective than negative feedback, so the optimal value of g is often much smaller than that of b. A term would have a higher weight in the new query vector if it occurs frequently in relevant documents but infrequently in non-relevant documents. The IDF weighting further rewards a term that is rare in the whole collection. The feedback method in classical probabilistic models is to select expanded terms primarily based on Robertson/Sparck-Jones weight [6] defined as follows:
wi ¼ log
W
3495
ri =ðR ri Þ ; ðni ri Þ=ðN ni R þ ri Þ
where wi is the weight of term i, ri is the number of relevant documents containing term i, ni is the number of documents containing term i, R is the number of relevant documents in the collection, and N is the number of documents in the collection. In language modeling approaches, feedback can be achieved through updating the query language model based on feedback information, leading to model-based feedback methods[15]. Relevance feedback received little attention in logical model, but some possible feedback methods have been discussed in [9]. Despite the difference in their way of performing relevance feedback, all these feedback methods implement the same intuition, which is to improve query representation by introducing and rewarding terms that occur frequently in relevant documents but infrequently in non-relevant documents. When optimized, most of these methods tend to perform similarly. Improvements over these basic feedback methods include (1) passage feedback [1] which can filter out the non-relevant part of a long relevant document in feedback, (2) query zone [12] which can improve the use of negative feedback information, and (3) pure negative feedback [13] which aims at exploiting purely negative information to improve performance for difficult topics. Most studies focus on how to use the relevance judgments to improve the retrieval performance. The issue of choosing optimally documents to judge for relevance feedback has received considerably less attention. Recent studies on active feedback [11] have shown that choosing documents with more diversity for feedback is a better strategy than choosing the most relevant documents for feedback in the sense that the relevance feedback information collected with the former strategy is more useful for learning. However, this benefit may be at the price of sacrificing the utility of presented documents from a user’s perspective. Although relevance feedback improves retrieval accuracy, the improvement comes at the price of possibly slowing down the retrieval speed. Indeed, virtually all relevance feedback algorithms involve iterating over all the terms in all the judged examples, so the computational overhead is not negligible, making it a challenge to use relevance feedback in applications where response speed is critical.
W
3496
W
Web Search Relevance Feedback
Key Applications Relevance feedback is a general effective technique for improving retrieval accuracy for all search engines, which is applicable when users are willing to provide explicit relevance judgments. Relevance feedback is particularly useful when it is difficult to completely and precisely specify an information need with just keywords (e.g., multimedia search); in such a case, relevance feedback can be exploited to learn features that can characterize the information need from the judged examples (e.g., useful non-textual features in multimedia search) and combine such features with the original query to improve retrieval results. Relevance feedback is also a key technique in personalized search where both explicit and implicit feedback information could be learned from all the collected information about a user to improve the current search results for the user. Despite the fact that relevance feedback is a mature technology, it is not a popular feature provided by the current web search engines possibly because of the computational overhead. However, the feature ‘‘finding similar pages’’ provided by Google can be regarded as relevance feedback with just one relevant document.
Future Directions Although relevance feedback has been studied for decades and many effective methods have been proposed, it is still unclear what is the optimal way of performing relevance feedback. The recent trend of applying machine learning to information retrieval, especially research on learning to rank and statistical language models, will likely shed light on the answer to this long-standing question. So far, research on relevance feedback has focused on cases where at least several relevant examples can be obtained in the top-ranked documents. However, when a topic is difficult, a user may not see any relevant document in the top-ranked documents. In order to help such a user, negative feedback (i.e., relevance feedback with only non-relevant examples) must be performed. Traditionally negative feedback has not received much attention due to the fact that when relevant examples are available, negative feedback information tends not to be very useful. An important future research direction is to study how to exploit negative feedback to improve retrieval accuracy for difficult topics as argued in [13].
Active feedback [11] is another important topic worth further studying. Indeed, to minimize a user’s effort in providing explicit feedback, it is important to select the most informative examples for a user to judge so that the system can learn most from the judged examples. It would be interesting to explore how to apply/adapt many active learning techniques developed in the machine learning community to perform active feedback.
Experimental Results The effectiveness of a relevance feedback method is usually evaluated using the standard information retrieval evaluation methodology and test collections. A standard test collection includes a document collection, a set of queries and judgments indicating whether a document is relevant to a query. The initial retrieval results are compared with the feedback results to show whether feedback improves performance. The performance is often measured using Mean Average Precision (MAP) which reflects the overall ranking accuracy or precision at top k (e.g., top 10) documents which reflects how many relevant documents a user can expect to see in the top-ranked documents. When comparing different relevance feedback methods, the amount of feedback information is often controlled so that every method uses the same judged documents. The judged or seen documents are usually excluded when computing the performance of a method to more accurately evaluate the performance of a method on unseen documents. The evaluation is significantly more challenging when the amount of feedback information can not be controlled, such as when different strategies for selecting documents for feedback are compared or a feedback method and a non-feedback method are compared. In such cases, it is tricky how to handle the judged documents. If they are not excluded when computing the performance of a method, the comparison would be unfair and favor a feedback that has received more judged examples. However, if they are excluded, the performance would not be comparable either because the test set used to compute the performance of each method would likely be different and may contain a different set of relevant and non-relevant documents.
Data Sets Many standard information retrieval test collections can be found at: http://trec.nist.gov/
Web Search Relevance Ranking
URL to Code Lemur project contains the code for most existing relevance feedback method, which can be found at: http://lemurpoject.org/.
Cross-references ▶ Implicit Feedback ▶ Pseudo Feedback ▶ Query Expansion
Recommended Reading 1. Allan J. Relevance feedback with too much data. In Proc. 18th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1995, pp. 337–343. 2. Buckley C. Automatic query expansion using SMART: TREC-3. In Overview of the Third Text Retrieval Conference (TREC-3), D. Harman (ed.), 1995, pp. 69–80. 3. Burges C., Shaked T., Renshaw E., Lazier A., Deeds M., Hamilton N., and Hullender G. Learning to rank using gradient descent. In Proc. 22nd Int. Conf. on Machine Learning, 2005, pp. 89–96. 4. Joachims T. Optimizing search engines using clickthrough data. In Proc. 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2002, pp. 133–142. 5. Kelly D. and Teevan J. Implicit feedback for inferring user preference. SIGIR Forum, 37(2):18–28, 2003. 6. Robertson S.E. and Jones K.S. Relevance weighting of search terms. J. Am. Soc. Inf. Sci., 27(3):129–146, 1976. 7. Robertson S.E., Walker S., Jones S., Hancock-Beaulieu M.M., and Gatford M. Okapi at TREC-3. In Proc. The 3rd Text Retrieval Conference, 1995, pp. 109–126. 8. Rocchio J. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs, NJ, 1971, pp. 313–323. 9. Ruthven I. and Lalmas M. A survey on the use of relevance feedback for information access system. Knowl. Eng. Rev., 18(2):95–145, 2003. 10. Salton G. and Buckley C. Improving retrieval performance by relevance feedback. J. Am. Soc. Inf. Sci., 44(4):288–297, 1990. 11. Shen X. and Zhai C. Active feedback in ad hoc information retrieval. In Proc. 31st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2005, pp. 59–66. 12. Singhal A., Mitra M., and Buckley C. Learning routing queries in a query zone. In Proc. 20th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1997, pp. 25–32. 13. Wang X., Fang H., and Zhai C. A study of methods for negative relevance feedback. In Proc. 34th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2008, pp. 219–226. 14. Xu J. and Croft W.B. Query expansion using local and global document analysis. In Proc. 19th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1996, pp. 4–11.
W
3497
15. Zhai C. and Lafferty J. Model-based feedback in the language modeling approach to information retrieval. In Proc. Int. Conf. on Information and Knowledge Management, 2001, pp. 403–410.
Web Search Relevance Ranking H UGO Z ARAGOZA 1, M ARC N AJORK 2 1 Yahoo! Research, Barcelona, Spain 2 Microsoft Research, Mountain View, CA, USA
Synonyms Ranking; Search ranking; Result ranking
Definition Web search engines return lists of web pages sorted by the page’s relevance to the user query. The problem with web search relevance ranking is to estimate relevance of a page to a query. Nowadays, commercial web-page search engines combine hundreds of features to estimate relevance. The specific features and their mode of combination are kept secret to fight spammers and competitors. Nevertheless, the main types of features at use, as well as the methods for their combination, are publicly known and are the subject of scientific investigation.
Historical Background Information Retrieval (IR) Systems are the predecessors of Web and search engines. These systems were designed to retrieve documents in curated digital collections such as library abstracts, corporate documents, news, etc. Traditionally, IR relevance ranking algorithms were designed to obtain high recall on medium-sized document collections using long detailed queries. Furthermore, textual documents in these collections had little or no structure or hyperlinks. Web search engines incorporated many of the principles and algorithms of Information Retrieval Systems, but had to adapt and extend them to fit their needs. Early Web Search engines such as Lycos and AltaVista concentrated on the scalability issues of running web search engines using traditional relevance ranking algorithms. Newer search engines, such as Google, exploited web-specific relevance features such as hyperlinks to obtain significant gains in quality. These measures were partly motivated by research in citation analysis carried out in the bibliometrics field.
W
3498
W
Web Search Relevance Ranking
Foundations For most queries, there exist thousands of documents containing some or all of the terms in the query. A search engine needs to rank them in some appropriate way so that the first few results shown to the user will be the ones that are most pertinent to the user’s need. The interest of a document with respect to the user query is referred to as ‘‘document relevance.’’ this quantity is usually unknown and must be estimated from features of the document, the query, the user history or the web in general. Relevance ranking loosely refers to the different features and algorithms used to estimate the relevance of documents and to sort them appropriately. The most basic retrieval function would be a Boolean query on the presence or absence of terms in documents. Given a query ‘‘word1 word2’’ the Boolean AND query would return all documents containing the terms word1 and word2 at least once. These documents are referred to as the query’s ‘‘AND result set’’ and represent the set of potentially relevant documents; all documents not in this set could be considered irrelevant and ignored. This is usually the first step in web search relevance ranking. It greatly reduces the number of documents to be considered for ranking, but it does not rank the documents in the result set. For this, each document needs to be ‘‘scored’’, that is, the document’s relevance needs to be estimated as a function of its relevance features. Contemporary search engines use hundreds of features. These features and their combination are kept secret to fight spam and competitors. Nevertheless, the general classes of employed features are publicly known and are the subject of scientific investigation. The main types of relevance features are described in the remainder of this section, roughly in order of importance. Note that some features are query-dependent and some are not. This is an important distinction because query-independent features are constant with respect to the user query and can be pre-computed off-line. Query-dependent features, on the other hand, need to be computed at search time or cached. Textual Relevance
Modern web search engines include tens or hundreds of features which measure the textual relevance of a page. The most important of these features are matching functions which determine the term similarity to the query. Some of these matching functions depend
only on the frequency of occurrence of query terms; others depend on the page structure, term positions, graphical layout, etc. In order to compare the query and the document, it is necessary to carry out some non-trivial preprocessing steps: tokenization (splitting the string into word units), letter case and spelling normalization, etc. Beyond these standard preprocessing steps, modern web search engines carry out more complex query reformulations which allow them to resolve acronyms, detect phrases, etc. One of the earliest textual relevance features (earliest both in information retrieval systems and later in commercial web search engines) is the vector space model scoring function. This feature was used by early search engines. Since then other scoring models have been developed in Information Retrieval (e.g., Language Models and Probabilistic Relevance Models) [8] and have probably been adopted by web search engines. Although web search engines do not disclose details about their textual relevance features, it is known that they use a wide variety of them, ranging from simple word counts to complex nonlinear functions of the match frequencies in the document and in the collection. Furthermore, web search engines make use of the relative and absolute position of the matches in the document. In Information Retrieval publications, there have been many different proposals to make use of term position information, but no consensus has been reached yet on the best way to use it. Most known approaches are based on features of the relative distances of the match terms such as the minimum (or average, or maximum) size of the text span containing all (or some, or most) of the term matches. Web search engines have not disclosed how they use position information. Besides match position, web search engines exploit the structure or layout of documents, especially HTML documents. There are a number of ways to do this. One of the simplest is to compute textual similarity with respect to each document element (title, subtitles, paragraphs). More complex solutions integrate matches of different structural elements into a single textual relevance score (see for example [8]). Another type of textual relevance information is provided by the overall document quality. For example, Web search engines use automatic document classifiers to detect specific document genres such as adult content, commercial sites, etc. Specialized techniques
Web Search Relevance Ranking
are also used to detect spam pages. Pages may be eliminated or demoted depending on the result of these analyses. Hyperlink Relevance
The web is a hyperlinked collection of documents (unlike most previously existing digital collections, which had only implicit references). A hyperlink links a span of text in the source page (the ‘‘anchor text’’) to a target page (the ‘‘linked page’’). Because of this, one can think of a hyperlink as a reference, an endorsement, or a vote by the source page on the target page. Similarly, one can think of the anchor text as a description or an explanation of the endorsement. One of the innovations introduced by Web Search Engines was leveraging the hyperlink-structure of the web for relevance ranking purposes. The hyperlink graph structure can be used to determine the importance of a page independently on the textual content of the pages. This idea of using web hyperlinks as endorsements was originally proposed by Marchiori [9] and further explored by Kleinberg [6] and Page et al. [11] (who also introduced the idea of using the anchor text describing the hyperlink to augment the target page). Exploiting User Behavior
As stated above, hyperlink analysis leverages human intelligence, namely peer endorsement between web page authors. Web search engines can also measure users’ endorsements by observing the search result links that are being clicked on. This concept was first proposed by Boyan et al. [5], and subsequently commercialized by DirectHit [3]. For an up-to-date summary of the state of the art, the reader is referred to [5]. Besides search result clicks, commercial search engines can obtain statistics of page visitations (i.e., popularity) from browser toolbars, advertising networks or directly from Internet service providers. This form of quality feedback is query-independent and thus less informative but more abundant.
W
millions of queries per hour. For this reason most features need to be pre-computed off-line and only their combination is computed at query time. Some features may require specialized data structures to be retrieved especially fast at query time. This is the case for example of term-weights (which are organized in inverted indices [1]), or query-dependent hyperlink features [10]. A more fundamental aspect of the performance of a relevance ranking algorithm is its accuracy or precision: how good is the algorithm at estimating the relevance of pages? This is problematic because relevance is a subjective property, and can only be observed experimentally, asking a human subject. Furthermore, the performance of a ranking algorithm will not depend equally on each page: the best ranked pages are those seen by most users and therefore the most important to determine the quality of the algorithm in practice. Performance evaluation measures used for the development of relevance ranking algorithms take this into account [8]. There exist other, less explicit measures of performance. For example, as users interact with a search engine, the distribution of their clicks on the different ranks give an indication of the quality of the ranking (i.e., rankings leading to many clicks on the first results may be desired). However, this information is private to the search engines, and furthermore it is strongly biased by the order of presentation of results. Feature Combination
All of the features of a page need to be combined to produce a single relevance score. Early web search engines had only a handful of features, and they were combined linearly, manually tuning their relative weights to maximize the performance obtained on a test set of queries. Modern search engines employ hundreds of features and use statistical methods to tune these features. Although the specific details remain secret, a number of research publications exist on the topic (see for example [13]).
Key Applications Performance
There are several aspects of the performance of a web relevance ranking algorithm. There are the standard algorithmic performance measures such as speed, disk and memory requirements, etc. Running time efficiency is crucial for web search ranking algorithms, since billions of documents need to be ranked in response to
3499
The key application of web search relevance ranking is in the algorithmic search component of web search engines. Similar methods are also employed to bias the ranking of the advertisements displayed in search results. Some of the principles have been applied in other types of search engines such as corporate search (intranet, email, document archives, etc.).
W
3500
W
Web Search Relevance Ranking
Future Directions
Data Sets
Research continues to improve all of the relevance features discussed here. This research has lead to a continuous improvement of search engine quality. Nevertheless, current relevance features are becoming increasingly hard to improve upon. Considerable research is centered today on discovering new types of features which can significantly improve search quality. Only two of the most promising areas are mentioned here: Query-understanding: Different types of queries may require very different types of relevance ranking algorithms. For example, a shopping query may require very different types of analysis from a travel or a health query. Work on algorithms that understand the intent of a query and select different relevance ranking methods accordingly could lead to dramatic increases in the quality of the ranking. Personalization: In principle it is possible to exploit user information to ‘‘personalize’’ web search engine results. Different results would be relevant to a query issued by a layperson than a topic expert, for example. There are many different ways to personalize results: with respect to the user search history, with respect to the user community, with respect to questionnaires or external sources of knowledge about the user, etc. Many scientific papers have been written on this topic, but the problem remains unsolved. Commercial web search engines have mainly shied away from personalized algorithms. Google has proposed several forms of personalized search to its users, but this feature has not had much success. Nevertheless, the search continues for the right way to personalize relevance ranking.
Commercial search engines dispose of very large data sets comprising very many documents (e.g., hundreds of millions), queries (e.g., tens of thousands), and human relevance evaluations (e.g., hundreds of thousands). These data sets are routinely used to develop and improve features for relevance ranking. See [10,13] for examples of this. Publicly available data sets for experimentation are very small compared to those used by commercial search engines. Nevertheless, they may be used to investigate some features and their combination. The most important datasets for web relevance ranking experiments are those developed in the Web-track of the TREC competition organized by NIST [4].
Experimental Results Evaluation of web relevance ranking is difficult and very costly, since it involves human judges labeling a collection of queries and results as to their relevance. The most careful evaluations of web relevance features are carried out by web search engine companies, but they are not disclosed to the public. There have been very many partial evaluations of search engines published, but they are always controversial due to their small scale, their experimental biases and their indirect access to the search engine features. A number of experimental benchmarks have been constructed for public scientific competitions. Although they are small and partial they can be used for experimentation (see below).
URL to Code Due to the extraordinary cost of developing and maintaining a full-scale web search engine, there are no publically available systems so far. The Nutch project (http:// lucene.apache.org/nutch/) is aiming to build an opensource web-scale search engine based on the Lucene search engine. Other retrieval engines capable of crawling and indexing up to several millions of documents include INDRI (http://www.lemurproject.org/indri/), MG4J (http://mg4j.dsi.unimi.it/) and TERRIER (http:// ir.dcs.gla.ac.uk/terrier/).
Cross-references
▶ Anchor Text ▶ BM25 ▶ Document Links and Hyperlinks ▶ Field-Based Information Retrieval Models ▶ Information Retrieval ▶ Language Models ▶ Relevance ▶ Relevance Feedback ▶ Text Categorization ▶ Text Indexing and Retrieval ▶ Vector-Space Model ▶ WEB Information Retrieval Models ▶ Web Page Quality Metrics ▶ Web Search Relevance Feedback ▶ Web Spam Detection
Recommended Reading 1. Baeza-Yates R. and Ribeiro-Neto B. Modern Information Retrieval. Addison Wesley, Reading, MA, 1999. 2. Boyan J., Freitag D., and Joachims T. A machine learning architecture for optimizing web search engines. In Proc. AAAI Workshop on Internet Based Information Systems, 1996.
Web Search Result Caching and Prefetching 3. Culliss G. The Direct Hit Popularity Engine Technology. A White Paper, DirectHit, 2000. Available online at https:// www.uni-koblenz.de/FB4/Institutes/ICV/AGKrause/Teachings/ SS07/DirectHit.pdf. Accessed on 27 Nov 2007. 4. Hawking D. and Craswell N. Very large scale retrieval and Web search. In TREC: Experiment and Evaluation in Information Retrieval, E. Voorhees and D. Harman (eds.). MIT Press, Cambridge, MA, 2005. 5. Joachims T. and Radlinski F. Search engines that learn from implicit feedback. IEEE Comp., 40(8):34–40, 2007. 6. Kleinberg J. Authoritative sources in a hyperlinked environment. Technical Report RJ 10076, IBM, 1997. 7. Langville A.N. and Meyer C.D. Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton, NJ, 2006. 8. Manning C.D., Raghavan P., and Schu¨tze H. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, 2008. 9. Marchiori M. The quest for correct information on the Web: hyper search engines. In Proc. 6th Int. World Wide Web Conference, 1997. 10. Najork M. Comparing the effectiveness of HITS and SALSA. In Proc. Conf. on Information and Knowledge Management, 2007, pp. 157–164. 11. Page L., Brin S., Motwani R., and Winograd T. The PageRank citation ranking: bringing order to the Web. Technical Report, Stanford Digital Library Technologies Project. 12. Richardson M., Prakash A., and Brill E. Beyond PageRank: machine learning for static ranking. In Proc. 15th Int. World Wide Web Conference, 2006, pp. 707–715. 13. Taylor M., Zaragoza H., Craswell N., Robertson S., and Burges C. Optimisation methods for ranking functions with multiple parameters. In Proc. Conf. on Information and Knowledge Management, 2006, pp. 585–593.
Web Search Result Caching and Prefetching R ONNY L EMPEL 1, FABRIZIO S ILVESTRI 2 1 Yahoo! Research, Haifa, Israel 2 ISTI-CNR, Pisa, Italy
Synonyms Search engine caching and prefetching; Search engine query result caching; Paging in Web search engines
W
capacity k which can store copies of k of the N objects (N >> k). This fast memory buffer is called the cache. The storage system is presented with a continuous stream of queries, each requesting one of the N objects. If the object is stored in the cache, a cache hit occurs and the object is quickly retrieved. Otherwise, a cache miss occurs, and the object is retrieved from the slower memory. At this point, the storage system can opt to save the newly retrieved object in the cache. When the cache is full (i.e., already contains k objects), this entails evicting some currently cached object. Such decisions are handled by a replacement policy, whose goal is to maximize the cache hit ratio (or rate) – the proportion of queries resulting in cache hits. Often, access patterns to objects, as found in query streams, are temporally correlated. For example, object y might often be requested shortly after object x has been requested. This motivates prefetching – the storage system can opt to retrieve and cache y upon encountering a query for x, anticipating the probable future query for y. The setting with respect to search result caches in Web search engines is somewhat different. There, the objects to be cached are result pages of search queries. A search query is defined as a triplet q = (qs, from, n) where qs is a query string, from denotes the relevance rank of the first result requested, and n denotes the number of requested results. The result page corresponding to q would contain the results whose relevance rank with respect to qs are from, from + 1,..., from + n 1. The value of n is typically 10. Search engines set aside some storage to cache such result pages. However, a search engine is not a typical twotiered storage structure, as results not found in the cache are not stored in slower storage but rather need to be generated through the query evaluation process of the search engine. Prefetching of search results occurs when the engine computes and caches j n results for the query (qs, from, n), with j being some small integer, in anticipation of follow-up queries requesting additional result pages for the same query string.
Historical Background Definition Caching is a well-known concept in systems with multiple tiers of storage. For simplicity, consider a system storing N objects in relatively slow memory, that also has a smaller but faster memory buffer of
Markatos was the first to study search result caching in depth in 2000 [11]. He experimented with various known cache replacement policies on a log of queries submitted to the Excite search engine, and compared the resulting hit-ratios. In 2001 Saraiva et al. [12]
3501
W
3502
W
Web Search Result Caching and Prefetching
proposed a two-level caching scheme that combines caching of search results with the caching of frequently accessed postings lists. Prefetching of search engine results was studied from a theoretical point of view by Lempel and Moran in 2002 [6]. They observed that the work involved in query evaluation scales in a sub-linear manner with the number of results computed by the search engine. Then they proceeded to minimize the computations involved in query evaluations by optimizing the number of results computed per query. The optimization is based on a workload function that models both (i) the computations performed by the search engine to produce search results and (ii) the probabilistic manner by which users advance through result pages in a search session. In 2003, two papers proposed caching algorithms specifically tailored to the locality of reference present in search engine query streams: PDC – Probability Driven Caching [7], and SDC – Static Dynamic Caching [14] (extended version in [4]). The next section will focus primarily on those caching strategies and followup papers, e.g., the AC strategy proposed by BaezaYates et al. in [2]. In 2004, Lempel and Moran studied the problem of caching search engine results in the theoretical framework of competitive analysis [8]. For a certain stochastic model of search engine query streams, they showed an online caching algorithm whose expected number of cache misses is no worse than four times that of any online algorithm. Note that search engine result caching is just one specific application of caching or paging; the theoretical analysis of page replacement policies dates back to 1966, when Belady [3] derived the optimal offline page replacement algorithm. In 1985, Sleator and Tarjan published their seminal paper on online paging [15], showing the optimal competitiveness of the Least Recently Used (LRU) policy.
Foundations The setting for the caching of search results in search engines is as follows. The engine, in addition to its index, dedicates some fixed-size fast memory cache that can store up to k result pages. It is then presented with a stream of user-submitted search queries. For each query it consults the cache, and upon a cache hit returns the cached page of results to the user. Upon a cache miss, the query undergoes evaluation by the
index, and the requested page of results (along with perhaps additional result pages) are computed. The requested results are returned to the user, and all newly computed result pages are forwarded to the cache replacement algorithm. In case the cache is not full, these pages will be cached. If the cache is full, the replacement algorithm may decide to evict some currently cached result pages to make room for the newly computed pages. The primary goal of caching schemes (replacement policies and prefetching strategies) is to exploit the locality of reference that is present in many real life query streams in order to exhibit high hit ratios. Roughly speaking, locality of reference means that after an object x is requested, x and a small set of related objects are likely to be requested in the near future. Two types of locality of reference present in search engine query streams are detailed below. Topical Locality of Reference
The variety of queries encountered by modern Web search engines is enormous. People submit queries on any and all walks of life, in all languages spoken around the world. Nevertheless, some query strings are much more popular than others and occur with high frequency, motivating the caching of their results by the search engine. Frequent queries can be roughly divided into two classes: (i) queries whose popularity is stable over time, and (ii) queries with bursty popularity, which may rise suddenly (e.g., due to some current event) and drop quickly soon thereafter. Sequential Locality of Reference
This is due to the discrete manner in which search engines present search results to users – in batches of n (typically 10) results at a time. Some users viewing the page of results for the query q = (qs, from, n) will not be fully satisfied, and will soon thereafter submit the follow-up query (qs, from + n, n). In general, when many users have recently queried for (qs, from, n), the higher the likelihood that at least one of them querying for (qs, from + n, n). This behavior of users, coupled with the observation that the work involved in computing a batch of size j n, j > 1 results for a query string is significantly smaller than j independent computations of batches of n results, motivates the prefetching of search results in search engines [9]. Several studies have analyzed the manner in which users query search engines and view result
Web Search Result Caching and Prefetching
pages [11,7, 13,1]. In general, it has been observed on numerous query logs that query popularities obey a power-law distribution, i.e., the number of query strings appearing n times in the log is proportional to nc. For example, Lempel and Moran [7] report a value of c equaling about 2.4 on a log of seven million queries submitted to AltaVista in late 2001 (The number of extremely popular queries typically exceeds the value predicted by the power law distribution.). While the 25 or so most popular queries in the query stream may account for over 1% of an engine’s query load, a large fraction of any log is composed of queries that appear in it only once. For example, Baeza-Yates et al. [1] analyze a log in which the vast majority (almost 88%) of unique query strings were submitted just once, accounting for about 44% of the total query load. Such queries are bound to cause cache misses, and along with the misses incurred for the first occurrence of any repeating query imply an upper bound of about 50% for cache hit rates on that log. Many of the studies cited above report that users typically browse through very few result pages of a query, and that the ‘‘deeper’’ the result page, the less users view it. While the exact numbers vary in each study, the reports agree that the majority of queries will have only their top-ten results viewed, and that the percentage of queries for which more than three result pages are viewed are very small. However, note that statistical data in the order of 102 (single percentage points) are powerful predictors for the purpose of caching result pages. Beyond achieving high hit ratios, caching schemes must also be efficient, i.e., the logic involved in cache replacement decisions must be negligible as compared with the cost of a cache miss. Furthermore, they should be designed for high throughput, as search engines serve many queries concurrently and it would be counterproductive to have cache access become a bottleneck in their computation pipeline. Different caching schemes have been applied to search results, ranging from ‘‘classic’’ ones to policies specifically designed for search engines workloads. ‘‘Classic’’ Caching Schemes
These schemes include the well-known Least Recently Used (LRU) replacement policy and its variations (e.g., SLRU and LRU2) [15]. To quickly recap, upon a cache miss, LRU replaces the least recently accessed of the currently cached objects with the object that caused the
W
3503
cache miss. SLRU is a two-tiered LRU scheme: cache misses result in new objects entering the lower-tier LRU. Cache hits in the lower-tier LRU are promoted to the upper tier, whose least recently accessed object is relegated to be the most recently accessed object of the lower tier. The above replacement policies are very efficient, requiring O(1) time per replacement of an item. Furthermore, they are orthogonal to prefetching policies in the sense that one can decide to prefetch and cache any number of result pages following a cache miss. PDC – Probability Driven Cache
In PDC the cache is divided between an SLRU segment that caches result pages for top-n queries (i.e., queries of the form (qs, from, n) where from = 1), and a priority queue that caches result pages of follow-up queries. The priority queue estimates the probability of each follow-up results page being queried in the near future by considering all queries issued recently by users, and estimating the probability of at least one user viewing each follow-up page. The priority queue’s eviction policy is to remove the page least likely to be queried. One drawback of PDC is that maintaining its probabilistic estimations on the worthiness of each results page requires an amortized time that is logarithmic in the size of the cache per each processed query, regardless of whether the query caused a hit or a miss. An advantage of PDC is that it is better suited for prefetching than other schemes, since prefetched pages are treated separately and independently by the priority queue and are cached according to their individual worthiness rather than according to some fixed prefetching policy. SDC – Static Dynamic Cache
SDC divides its cache into two areas: the first is a readonly (static) cache of results for the perpetually popular queries (as derived from some pre-analysis on the query log), while the second area dynamically caches results for the rest of the queries using any replacement policy (e.g., LRU or PDC). Introducing a static cache to hold results for queries that remain popular over time has the following advantages: The results of these queries are not subject to eviction in the rare case that the query stream exhibits some ‘‘dry spell’’ with respect to them. The static portion of the cache is essentially readonly memory, meaning that multiple query threads
W
3504
W
Web Search Result Caching and Prefetching
can access it simultaneously without the need to synchronize access or lock portions of the memory. This increases the cache’s throughput in a real-life multithreaded system that serves multiple queries concurrently. SDC can be applied with any prefetching strategy. See the discussion below for the specific prefetching strategy proposed in [14]. AC
AC was proposed by Baeza-Yates et al. in 2006 [2]. Like PDC and SDC, this scheme also divides its cache into two sections. The intuition behind AC is to identify the large fraction of rare ‘‘tail’’ queries and to relegate them to a separate portion of the cache, thus preventing them from causing the eviction of more worthy results. Prefetching Policies
The prefetching policies appearing so far in the literature are rather simple. One straightforward policy is to simply choose some fixed amount k of result pages to prefetch upon any cache miss. The value of k can be optimized to maximize the cache hit ratio [7] or to minimize the total workload on the query evaluation servers [9]. These two objective functions do not necessarily result in the same value of k, since it could be that it is easier to compute more queries with fewer results per query than less queries with more results per query. An adaptive prefetching scheme that also prefetches, in some cases, following a cache hit was proposed in [14]: Whenever a cache miss occurs for the top-n results of a query (i.e., when from = 1), evaluate the query and also prefetch its second page of results. For any cache miss of a follow-up query (from > 1), evaluate the query and prefetch some fixed number k of additional result pages. If the second page of results for a query is requested and is a cache hit, return the cached page and prefetch in the background the next k result pages. Note that performing query evaluations following cache hits is a speculative operation that does not reduce the overall load on the backend in terms of query evaluations. It may balance that load somewhat and reduce the latency experienced by searchers when submitting follow-up queries.
As explained above, the choice of prefetching policy is orthogonal to the choice of replacement policy. However, PDC and AC are better suited to assigning different priorities to prefetched pages rather than treating them all as ‘‘recently used.’’ Note that prefetching is only effective when caching prefetched pages will result in more cache hits than the pages they replace in the cache. This depends on the characteristics of the cache (its size and replacement policy) as well as on the observed interaction of the users with the specific search engine in question (how often they tend to browse deep result pages).
Key Applications Caching and Prefetching of results in Web search engines is a key application ‘‘per se.’’
Future Directions Implementing search result caching in live, Web-scale systems brings with it many engineering challenges that have to do with the particular architecture of the specific search engine, and the specific nature of the application. This section briefly points at several such issues, each deserving of further research attention. Search Result Caching in Incremental Search Engines
The previous sections assumed that the results of query q are stationary over time. In the case of evolving collections cached results may become stale, and serving them defeats the purpose of an incremental engine. Thus, more advanced cache maintenance policies are required. Caching in the Presence of Personalized Search Results
Search engines are increasingly biasing search results toward the preferences of the individual searchers. As a consequence, two searchers submitting the same query might receive different search results. In this case, cache hit ratios decrease since results cannot always be reused across different searchers. Level of Details to Cache
A page of search results is comprised of many different components (e.g., result URLs, summary snippets, ads), with each component logically coming from different computational processes and often from different sets of physical servers. While caching is meaningless without saving the URLs of the results, the
Web Search Result Caching and Prefetching
other components can be computed separately, and whether to cache or recompute them is a matter of computational tradeoff. Holistic View
There are many types of objects other than result pages that merit caching in a search engine, e.g., postings lists of popular query terms, user profiles, summary snippets, various dictionaries, and more [12,10]. The optimal allocation of RAM for the caching of different objects is a matter for holistic considerations that depend on the specific architecture of the search engine.
Experimental Results The literature reports on many experiments with many query logs and caching schemes. All caching schemes have at least one tunable parameter – the cache size – whereas schemes with several cache segments (e.g., SLRU) have parameters governing the size of each segment. Search-specific caching schemes (PDC, SDC, AC) have additional tunable parameters, and when coupled with prefetching schemes, the number of reported experimental configurations grows significantly. Therefore, the summary below focuses on qualitative observations rather than quantitative ones. With respect to caching without prefetching, the literature finds that: Even generic caching schemes attain decent hit ratios. For example, LRU achieves on some query logs hit ratios as high as 80% of what an infinite cache would achieve. In terms of absolute values, hit ratios of 20–30% are common for LRU based schemes. In fact, recency-based caching schemes are well suited for the large topical locality of reference exhibited by power-law query distributions, where significant portions of the query stream are comprised of a small set of highly popular queries. Search-specific caching schemes outperform generic caching schemes. However, the relative gains in terms of hit ratio are rarely higher than 10%. Throughout the literature, prefetching improved the achieved cache hit ratios, in many cases by 50% and more as compared with the same experimental setup without prefetching. The following are additional observations: The largest relative gains in hit ratio are achieved for moderate amounts of prefetching, namely fetching one or two result pages beyond what is requested in
W
3505
the query. More aggressive prefetching improves the hit ratio by modest amounts at best, and in small caches may even cause it to deteriorate. As a general rule, larger caches merit more prefetching, since much of a large cache is devoted to lowmerit result pages, which are less likely to be queried than follow-up result pages of current queries. Not surprisingly, the adaptive prefetching scheme proposed in [14] outperforms the simplistic policy of always prefetching the same amount of result pages regardless of the identity of the page that generated the cache miss.
Cross-references
▶ Buffer Management ▶ Cache Replacement as Trade-Off Elimination ▶ Cache-Conscious Query Processing ▶ Competitive Ratio ▶ Indexing the Web ▶ Inverted Files
Recommended Reading 1. Baeza-Yates R., Gionis A., Junqueira F., Murdock V., Plachouras V., and Silvestri F. The impact of caching on search engines. In Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007, pp. 183–190. 2. Baeza-Yates R., Junqueira F., Plachouras V., and Witschel H.F. Admission policies for caches of search engine results. In Proc. 14th Int. Symp. String Processing and Information Retrieval, 2007, pp. 74–85. 3. Belady L.A. A study of replacement algorithms for a virtualstorage computer. IBM Syst. J., 5(2):78–101, 1966. 4. Fagni T., Perego R., Silvestri F., and Orlando S. Boosting the performance of web search engines: caching and prefetching query results by exploiting historical usage data. ACM Trans. Inf. Syst., 24(1):51–78, 2006. 5. Karedla R., Love J.S., and Wherry B.G. Caching strategies to improve disk system performance. Computer, 27(3):38–46, 1994. 6. Lempel R. and Moran S. Optimizing result prefetching in web search engines with segmented indices. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 370–381. 7. Lempel R. and Moran S. Predictive caching and prefetching of query results in search engines. In Proc. 12th Int. World Wide Web Conference, 2003, pp. 19–28. 8. Lempel R. and Moran S. Competitive caching of query results in search engines. Theor. Comput. Sci., 324(2):253–271, 2004. 9. Lempel R. and Moran S. Optimizing result prefetching in web search engines with segmented indices. ACM Trans. Internet Tech., 4:31–59, 2004. 10. Long X. and Suel T. Three-level caching for efficient query processing in large web search engines. In Proc. 14th Int. World Wide Web Conference, 2005, pp. 257–266.
W
3506
W
Web Search Result De-duplication and Clustering
11. Markatos E.P. On caching search engine query results. Comput. Commun., 24(2):137–143, 2001. 12. Saraiva P., Moura E., Ziviani N., Meira W., Fonseca R., and Ribeiro-Neto B. Rank-preserving two-level caching for scalable search engines. In Proc. 24th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001, pp. 51–58. 13. Silverstein C., Henzinger M., Marais H., and Moricz M. Analysis of a very large altavista query log. Technical Report 1998-014, Compaq Systems Research Center, October 1998. 14. Silvestri F., Fagni T., Orlando S., Palmerini P., and Perego R. A hybrid strategy for caching web search engine results. In Proc. 12th Int. World Wide Web Conference, 2003 (Poster). 15. Sleator D.D. and Tarjan R.E. Amortized efficiency of list update and paging rules. Commun. ACM, 28:202–208, 1985.
Web Search Result De-duplication and Clustering X UEHUA S HEN 1, C HENG X IANG Z HAI 2 Google, Inc., Mountain View, CA, USA 2 University of lllinois at Urbana-Champaign, Urbana, IL, USA
1
Definition Web search result de-duplication and clustering are both techniques for improving the organization and presentation of Web search results. De-duplication refers to the removal of duplicate or near-duplicate web pages in the search result page. Since a user is not likely interested in seeing redundant information, de-duplication can help improve search results by decreasing the redundancy and increasing the diversity among search results. Web search result clustering means that given a set of web search results, the search engine partitions them into subsets (clusters) according to the similarity between search results and presents the results in a structured way. Clustering results helps improve the organization of search results because similar pages will be grouped together in a cluster and a user can easily navigate into the most relevant cluster to find relevant pages. Hierarchical clustering is often used to generate a hierarchical tree structure which facilitates navigation into different levels of clusters. De-duplication can be regarded as a special way of clustering search results in which only highly similar (i.e., near-duplicate) pages would be grouped into a
cluster and only one page in a cluster is presented to the user. Although standard clustering methods can also be applied to do de-duplication, de-duplication is more often done with its own techniques which are more efficient than clustering. For example, Fig. 1 shows a screenshot of the clustered search results of the query ‘‘jaguar’’ from the vivisimo meta-search engine (http://www.vivisimo. com), which clusters web search results and presents a hierarchical tree of search results. On the left are some clusters labeled with phrases; each cluster captures some aspect of the search results, and different senses of ‘‘jaguar’’ (e.g., jaguar animals vs. jaguar cars) have been separated. On the right are the documents in the cluster labeled as ‘‘Panthora onca,’’ which are displayed as the user clicks on this cluster on the left panel.
Historical Background The idea of clustering search results appears to be first proposed in [6]. However, the application of document clustering to improve the performance of information retrieval has a much longer history [14], dating back to at least 1971 [9]. Most of this work has been motivated by the so-called cluster hypothesis, which says that closely associated documents tend to be relevant to the same requests [9]. The cluster hypothesis suggests that clustering can be potentially exploited to improve both search accuracy (e.g., similar documents can reinforce each other’s relevance score) and search efficiency (e.g., scoring clusters takes less time than scoring all the documents in the whole collection). Research has been done in both directions with mixed findings. In [3], it was proposed that clustering can be used to facilitate document browsing, and a document browsing method called Scatter/Gather was proposed to allow a user to navigate in a collection of documents. In [6], Scatter/Gather was further applied to cluster search results and was shown to be more effective in helping a user find relevant documents than simply presenting results as a ranked list. Clustering web search results appears to be first studied in [15], which shows that it is feasible to cluster web search results on-the-fly. Vivisimo (http://www. vivisimo.com/) is among the first Web search engines to adopt search result clustering. More recent work in this line has explored presenting search results in clusters (categories) defined based on context (e.g., [4,13]), hierarchical clustering (e.g., [5]), efficiency
Web Search Result De-duplication and Clustering
W
3507
Web Search Result De-duplication and Clustering. Figure 1. Sample clustered results about ‘‘jaguar’’ from Vivisimo search engine.
of clustering, incorporating the prior knowledge or user feedback into clustering algorithms in a semisupervised manner. Compared with clustering, document de-duplication has a relatively short history. The issue of document duplication was not paid much attention to by information retrieval researchers in early days, probably because it was not a serious issue in the traditional applications of retrieval techniques (e.g., library systems). However, the problem of detecting duplicate or near-duplicate documents has been addressed in some other contexts before it was studied in the context of the Web. For example, the problem has naturally occurred in plagiarism detection. The problem also arose in file systems [10], where one document could be easily copied, slightly modified, saved as another file, or transformed into another format, leading to duplicate or near-duplicate documents. The SCAM system [12] is among the early systems for detecting duplicate or near-duplicate documents. The algorithms proposed in these studies are often based on signatures – representing and comparing documents based on their signatures, or values generated by a hashing function based on substrings from documents.
Web page de-duplication was first seriously examined in [1], where the authors pointed out that document duplication arises in two ways on the Web: one is that the same document is placed in several places, while the other is that the same document are in almost identical incarnations. The authors further proposed an effective de-duplication algorithm based on representing and matching documents based on shingles (i.e., contiguous subsequences of tokens in a documents). Now it has been realized that Web page de-duplication is needed at stages of indexing, ranking, and evaluation [8] to decrease the index storage, increase user satisfaction level, and improve ranking algorithm, respectively.
Foundations Since de-duplication can be regarded as a special case of clustering, it is not surprising that deduplication and clustering share very similar fundamental challenges, which include (i) designing an appropriate representation of a document; (ii) designing an effective measure of document similarity based on document representation; (iii) finding/ grouping similar documents (duplicate or near-duplicate documents in the case of de-duplication)
W
3508
W
Web Search Result De-duplication and Clustering
quickly. Most research work attempts to solve all or some of these challenges. How to represent a document and how to measure similarity between documents are closely related to each other in that a similarity function making sense for one representation may not work for another representation. Both highly depend on how the notion of similarity is defined. This is a non-trivial issue as two documents may be similar in many different ways (e.g., in topic coverage, in genre, or in structure), and depending on the desired perspective of similarity, one may need different representations of documents and different ways are needed to measure similarity. As a result, the optimal choice often depends on the specific applications. For de-duplication, it often suffices to represent documents primarily based on simple tokenization of documents or even surface string representation of documents since duplicate or near-duplicate documents generally share a lot of surface features and may contain very long identical strings. For the same reason, exact matching of two representations may also be sufficient most of the time. Most proposed methods for detecting duplicate or near-duplicate documents can be roughly classified into two categories: signature matching and content matching. They mainly differ in how a document is represented. In signature matching, a document is represented by a signature of the document which is often generated using a hash function based on some (not necessarily meaningful) subsequences of tokens in a document, thus the similarity between documents is mostly measured based on syntactic features of documents. In content matching, the basic unit of document representation is often a meaning unit such as a word or phrase intended to capture the goal of content-matching. Moreover, the basic units are often weighted to indicate how well they reflect the content of a document using heuristic weighting methods such as TF-IDF weighting proposed in information retrieval. Since document representation reflects more the content of a document in the content matching approach, the similarity between documents reflects their semantic similarity better than in the fingerprint matching approach. In general, content matching captures more semantics in deduplication but is generally more expensive than signature matching. A comprehensive evaluation of these methods can be found in [7].
The shingling algorithm proposed in [1] remains a major state-of-the-art method for Web page deduplication. In this algorithm, every k subsequence of tokens of a web page is used as a unit of representation (called a shingle), and a page is represented by the set of unique shingles occurring in the page. The similarity between two web pages is measured by the percentage of unique shingles shared by the two pages. Sampling can be used to estimate this similarity and improve efficiency of de-duplication. In the I-Match method [2], collection statistics is used to filter terms in a signature matching approach, thus achieving content-matching to a certain extent. In a recent work [8], it is shown that additional contentbased features can be exploited to significantly improve de-duplication accuracy. Clustering of search results is almost exclusively based on content matching rather than signature matching because the goal is to group together semantically similar pages which are not always syntactically similar. A document is often represented by a contentcapturing feature vector. To construct such a vector, a set of features are selected and assigned appropriate weights to them. Many features may be adopted for clustering, including, e.g., terms from the full content of the html web page, web page snippet (title and summary), URL of the web page, and anchor text. These features need to be weighted appropriately to reflect the content of a page accurately. These challenges are essentially the same as in text retrieval where there is a need to represent a document appropriately to match a document with a query accurately. In general, document similarity can be computed in a similar way to computing the similarity between a query and a document in a standard retrieval setting. Thus many methods developed in information retrieval, especially TF-IDF term weighting and language models, can applied to clustering search results. The basic idea of TF-IDF weighting is to give a term a higher weight if the term occurs frequently in a document and occurs infrequently in the entire collection of documents. Similarity is often computed based on the dot product or its normalized form (i.e., cosine) of two TF-IDF vectors. Language models achieve a similar weighting heuristic through probabilistic models. Once similarities between documents are computed, many clustering methods can be applied, such as k-means, nearest neighbor, hierarchical agglomerative clustering, and spectral clustering. Instead of relying
Web Search Result De-duplication and Clustering
on an explicit similarity function, a different strategy for clustering is to cast the clustering problem as fitting a mixture model to the text data; the estimated parameter values can then be used to easily obtain clusters. Methods of this kind are called model-based clustering methods. Finite multinomial mixture, probabilistic latent semantic indexing (PLSA), and latent Dirichlet allocation (LDA) are some examples in this category. In model-based clustering, the notion of similarity is implied by the model. Due to the lack of systematic comparisons among these different methods, it is unclear which clustering method is the best for clustering search results. Efficiency is a critical issue for clustering search results since a search engine must do that on-the-fly after the user submits a particular query. Usually clustering is applied after the search engine retrieves a ranked list of web pages. Thus the time spent on clustering is in addition to the original latency of web search. If the latency is noticeable, the user may be discouraged from using a clustering interface. Some clustering algorithms such as k-means and some model-based algorithms are ‘‘any time’’ algorithms in that they are iterative and can stop at any iteration to produce (less optimal) clustering results. These algorithms would thus be more appropriate for clustering search results when efficiency is a concern. Speeding up clustering is an active research topic in data mining, and many efficient algorithms proposed in data mining can be potentially applied to clustering search results. Another important challenge in clustering search results is to label a cluster appropriately, which clearly would directly affect the usefulness of clustering, but is yet under-addressed in research. A good cluster label should be understandable to the user, capturing the meaning of the cluster, and distinguishing the cluster from other clusters. In [11], a probabilistic method is proposed to automatically label a cluster with phrases. Another approach to improving cluster labeling and clustering of results in general is to learn from search logs to discover interesting aspects of a query and cluster the search results according to these aspects which can then be naturally labeled with some typical past queries [13].
Key Applications The major application of search results de-duplication is to reduce the redundancy in search results, which
W
presumably improve the utility of search results. Document de-duplication, however, has many other applications in Web search. For example, it can reduce the size of storage for index, and allow a search engine to load index of more diverse web page sets to the memory. Eliminating duplicate or near-duplicate documents can also reduce the noise in the document set for evaluation, and thus would allow an evaluation measure to reflect more accurately the real utility of a search algorithm [8]. Besides applications for search engines, de-duplication also has many other applications such as plagiarism detection and tracking dynamic changes of the same web page. Clustering search results generally help a user navigate in the set of search results. It is expected to be most useful when the search results are poor or diverse (e.g., when the query is ambiguous) or when the user has a high-recall information need. Indeed, when the user has a high-precision information need (e.g., navigational queries) and the search results are reasonably accurate, presenting a ranked list of results may be the best and clustering may not be helpful. However, due to the inevitable mismatches between a query and web pages, the search results are more often non-optimal. When the search results are poor, it would take a user a lot of time to locate relevant documents from a ranked list of documents, so clustering is most likely helpful because it would enable a user to navigate into the relevant documents quickly. In the case of poor results due to ambiguity of query words, clustering can be expected to be most useful. When the user has a high-recall information need, clustering can also help the user quickly locate many additional relevant documents once the user identifies a relevant cluster. In addition, the overview of search results generated from clustering may be useful information itself from a user’s perspective. Data visualization techniques can be exploited to visualize the clustering presentation as in Kartoo (http://www.kartoo.com). Appropriate visualization may increase the usefulness of clustering search results. Clustering search results, when used in an iterative way, can help a user browse a web pages collection. In Scatter/Gather [3], for example, the user would interact with an information system by selecting interesting clusters, and the information system would do clustering again after the user selects some clusters. Thus clustering search results can effectively support a user to interactively access information in a collection.
3509
W
3510
W
Web Services
URL to Code Lemur (http://www.lemurproject.org/). Lemur is an open-source toolkit designed to facilitate research in information retrieval. Document clustering is included. CLUTO(http://glaros.dtc.umn.edu/gkhome/views/ cluto/). CLUTO is a software package for clustering low- and high-dimensional datasets and for analyzing the characteristics of clusters.
13. Wang X. and Zhai C. Learn from Web search logs to organize search results. In Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007, pp. 87–94. 14. Willett P. Recent trends in hierarchic document clustering: a critical review.. Inf. Process. Manage., 24(5):577–597, 1988. 15. Zamir O. and Etzioni O. Grouper: a dynamic clustering interface to Web search results. In Proc. 8th Int. World Wide Web Conference, 1999.
Web Services
Cross-references ▶ Data Clustering ▶ Data Visualization
Recommended Reading 1. Broder A.Z., Glassman S.C., Manasse M.S., and Zweig G. Syntactic clustering of the web. Comput. Networks, 29(8–13): 1157–1166, 1997. 2. Chowdhury A., Frieder O., Grossman D.A., and McCabe M.C. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171–191, 2002. 3. Cutting D.R., Pedersen J.O., Karger D., and Tukey J.W. Scatter/ gather: a cluster-based approach to browsing large document collections. In Proc. 15th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1992, pp. 318–329. 4. Dumais S.T., Cutrell E., and Chen H. Optimizing search by showing results in context. In Proc. SIGCHI Conf. on Human Factors in Computing Systems, 2001, pp. 277–284. 5. Ferragina P. and Gulli A. A personalized search engine based on Web-snippet hierarchical clustering. In Proc. 14th Int. World Wide Web Conference, 2005, pp. 801–810. 6. Hearst M.A. and Pedersen J.O. 1Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proc. 19th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1996, pp. 76–84. 7. Hoad T. and Zobel J. Methods for identifying versioned and plagiarised documents.. J. Am. Soc. Inf. Sci. Technol., 54(3): 203–215, 2003. 8. Huffman S., Lehman A., Stolboushkin A., Wong-Toi H., Yang F., and Roehrig H. Multiple-signal duplicate detection for search evaluation. In Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007, pp. 223–230. 9. Jardine N. and van Rijsbergen C. The use of hierarchic clustering in information retrieval.. Inf. Storage Retr., 7(5):217–240, 1971. 10. Manber U. Finding similar files in a large file system. In Proc. USENIX Winter 1994 Technical Conference, 1994, pp. 1–10. 11. Mei Q., Shen X., and Zhai C. Automatic labeling of multinomial topic models. In Proc. 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2007, pp. 490–499. 12. Shivakumar N. and Garcia-Molina H. SCAM: a copy detection mechanism for digital documents. In Proc. 2nd Int. Conf. in Theory and Practice of Digital Libraries, 1995.
E RIC WOHLSTADTER 1, S TEFAN TAI 2 University of British Columbia, Vancouver, BC, Canada 2 University of Karlsruhe, Karlsruhe, Germany
1
Synonyms e-Services
Definition Web services provide the distributed computing middleware that enables machine-to-machine communication over standard Web protocols. Web services are defined most precisely by their intended use rather than by the specific technologies used, since different technologies are popular [1]. Web services are useful in a compositional approach to application development; where certain key features of an integrated application are provided externally through one or more remote systems. Additionally, Web service standards are a popular platform for wrapping existing legacy applications in a more convenient format for interoperability between heterogeneous systems. To provide interoperability Web services should follow standards for formatting application messages, describing service interfaces, and processing messages. Two popular technology choices discussed in this entry are the SOAP [5] based services and the REST (REpresentational State Transfer) [4] based services. In contrast to traditional World Wide Web (WWW) resources, Web services decouple the user interface from the underlying programmatic interface, to provide an application programming interface (API) to clients. This API provides the means through which Web services expose remote computational resources or informational content over the Internet. Web services are generally designed to follow a service-oriented architecture which promotes the loose coupling useful for
Web Services
interaction in a wide-area environment. This loose coupling comes at the price of giving up more powerful mechanisms such as stateful objects which introduce tighter coupling [2,7].
Historical Background Web services address enterprise application integration (EAI) in a wide area environment. Historically, they were designed in response to the need for a simpler base of standards than provided by distributed object computing. The original WWW was also unsatisfactory due to its tight coupling with the user interface. Distributed object computing provides the abstraction of remote stateful objects whose methods can be invoked transparently by clients. This form of remote method invocation is popular in proprietary distributed systems but has never been widely deployed on the Internet. A standardized technology known as the Common Object Request Broker Architecture (CORBA) is commonly used to develop and deploy distributed object applications. Although the CORBA platform provides powerful abstractions for application programmers, these abstractions tended to introduce a high level of complexity in the underlying infrastructure. Consensus was never reached on several important technical problems introduced by the stateful nature of objects such as distributed garbage collection and object identification [2,7]. Thus, distributed object computing is more appropriate in single enterprise deployment scenarios (e.g., telecommunications and military applications). With the rise in acceptance of XML, it made sense to leverage this format as a means of machine-to-machine communication. In creating new standards around this format, certain problematic features such as stateful objects were not included in the newly developed standards leading to more lightweight and loosely coupled Web services. The WWW was created as a collaborative platform for distributing information through web pages: documents with text and images including the capability to link between related content. The standard format for web pages, Hypertext Markup Language, included mixed elements of both informational content and user interface design. With the tremendous popularity of the WWW, came the opportunity to bring rich upto-date informational data sources to a large audience. This also required complex business processes to manage requests for data. Since the retrieved data was tangled with user interface code, it was difficult to separate the
W
3511
two concerns for users interested in post-processing the retrieved information. ‘‘Wrappers’’ had to be used to parse, or ‘‘scrape’’ useful content from delivered pages. To simplify the distribution of information, some websites began offering remote APIs for information retrieval which has led to the interest in standardized Web services.
Foundations Since interoperability is a key element for any Web service, some standardized approach must be taken to export service functions. The most formal approach to building Web services is built around two core standards known as SOAP and the Web Services Description Language (WSDL) [8]. The standards are being developed by organizations such as the W3C and OASIS. Other lightweight and dynamic approaches are generally categorized under the moniker of REST services. The REST approach requires few standardized technologies beyond HTTP. For both approaches, the standard way to identify each service is through means of an Internet Uniform Resource Identifier (URI). To locate an existing service, a URI is typically obtained out-of-band through informal means, although other automated approaches to service discovery are being explored. In their most basic form, Web services provide only the basic primitives for exchanging documents between clients and servers. Standards for building robust, transactional, and secure services have been proposed but are not currently widely deployed. The remainder of this entry focuses on the core Web services fundamentals and describes how both SOAP/ WSDL and REST deal with the details of service implementation: message formatting, interface description, and message processing. Web services use a human readable message format for transport of application specific data. This practice simplifies the task of debugging and loosens a client’s reliance on complex implementations for marshalling data. A popular message format is XML due to its self-described nature and abundance of third-party support for processing and storage. The use of XML is mandated by the SOAP/WSDL approach and is popular in REST services also. Conversion of XML to data structures native for a specific programming language can be done through standardized mappings. These mappings are supported by middleware which automate the mapping process. When REST services
W
3512
W
Web Services
are consumed directly from a browser agent through JavaScript, a format known as JSON (JavaScript Object Notation) [3] can be used. Unlike XML, JSON provides a direct human readable serialization of JavaScript objects. This can reduce problematic mismatches between the differing type systems of XML Schema [10] and JavaScript. As an API, Web services often provide a number of distinct functions each with their own input requirements and output guarantees. This interface can be specified formally or informally. Formal description of Web service interfaces is provided through WSDL. WSDL is an extension of the XML Schema specification and most importantly provides description of the set of functions exposed by a service. WSDL specifications add the possibility of providing statically typed-checked service use for clients. This is done by implementing message format mapping through a code generation and compilation process. With REST, the inputs to service functions are specified as standard web URIs, making use of URI parameters for data input. Output can be produced using any standard HTTP supported encoding and XML is often used for this purpose. The original description of the REST architecture placed considerable emphasis on the fact that services should be described as a set of resources which could be manipulated using predefined HTTP methods (i.e., GET, POST, DELETE, etc.), rather than an application specific set of functions. However in practice, the term REST more commonly refers to any service accessible with little technology beyond HTTP, whether or not the service is modeled primarily as resources or functions. REST currently provides no formal treatment for description of services, although discussion of a standard REST description language is underway as of this writing. Although not required, many Web services middleware platforms are built around a common architectural style referred to as the Chain of Responsibility pattern [6]. On a local scale incoming and outgoing Web service messages pass through a series of components, called handlers or interceptors, before being passed to lower network layers. This architecture promotes loose-coupling within the middleware itself. Handlers addressing separate concerns such as reliability, transactions or security can easily be plugged into the middleware and configured on a per application basis. A large number of standards known as WS-* [9] are being developed to specify the formats and
protocols for managing these concerns in middleware implementations. SOAP provides a standard for encapsulating XML based Web service messages with an envelope that can be used to communicate such extra-functional information such as security tokens, time stamps, etc. Using SOAP, the Chain of Responsibility architecture can be extended to a distributed series of message processing steps. In the distributed scenario each step can be handled at a potentially different intermediary location on the network, giving rise to Web service intermediaries. SOAP also standardizes the fault handling semantics for undeliverable messages and provides a number of message exchange patterns. These patterns capture certain reliability guarantees which can potentially be of use by the underlying middleware for optimizing resource usage. REST messages are transported according to the semantics of the Hypertext Transport Protocol resource access methods. REST architectural principles mandate the use of stateless protocols wherever possible. REST message processors should consider only point-to-point connections as opposed to an end-toend connection. This simplifies the interposition of caches or other intermediaries for message processing. REST services should rely only on standardized HTTP message headers for communicating extra-functional processing instructions, to keep implementation infrastructure lightweight.
Key Applications Web services expose data and application functionality using standard Internet technology for consumption by clients over the Web. For example, Web services are used to provide access to very large product data. Today, the range of Web services offerings spans from very simple services to more advanced offerings in support of specific business requirements. Simple services include format conversion services, transformation services and real-time information services such as news feeds and stock quotes. Business offerings include the full range of e-commerce, CRM, marketing, finance, and communication services, among them verification services such as credit checking, accounting services, or customized notification services. Further, entire business processes such as the shipment of goods are provided as Web services. The modular nature of Web services supports the de-composition of application functionality and business processes and their (re-)composition into new applications. Services composition applies both
Web Services and the Semantic Web for Life Science Data
within an organization and across different organizations. Compositions can take various forms, including workflow applications that require the use of a process composition language and centralized workflow engine. Web services in this way support EAI using standard Internet technology. Another form of composition are Mashups, re-purposing Web-accessible content and functionality in new Web applications that provide an interactive (human) end-user experience. For example, a Mashup may visualize location information of social event feeds on a map. Web services are also being used in re-architecting middleware software platforms as service-oriented architectures. Middleware services such as message queuing and data storage are provided as Web services; middleware functionality can thus be externalized and used as a remote service. This model targets small and medium sized enterprises in particular, but applies equally to larger size enterprises as well. More advanced services offerings include application hosting and execution services, which allow clients to have entire software applications hosted and executed externally.
Future Directions Today, there exists a variety of Web services protocols, APIs, languages, and specifications. Different combinations of Web services technology are being used for different purposes. For example, SOAP and Web services specifications in support of security and reliability are used in enterprise computing, whereas REST services are popular for simple services like data access and applications in the social Web. From a business viewpoint, Web services establish a new market for trading and contracting electronic services that complement and transform the traditional consulting services and software market. The development of this market today is accompanied by the emergence of intermediaries for buyers and sellers to exchange services. These intermediaries distinguish themselves using different pricing mechanisms, comarketing efforts, billing and accounting management, services usage monitoring, data integration models, and infrastructure quality guarantees for services availability. New Web services technology in support of this business development is expected to emerge.
Cross-references
▶ AJAX ▶ Coupling and De-Coupling
W
3513
▶ Discovery ▶ Enterprise Application Integration ▶ Interface ▶ Loose Coupling ▶ Mashups ▶ OASIS ▶ Request Broker ▶ Service Oriented Architecture ▶ SOAP ▶ Tight Coupling ▶ W3C ▶ Web 2.0 (3.0)
Recommended Reading 1. Alonso G., Casati F., Kuno H., and Machiraju V. Web Services: Concepts Architectures and Applications. Springer, Berlin, 2003. 2. Birman K. Like it or not, web services are distributed objects. Commun. ACM, 47(12):60–62, 2004. 3. Crockford D. The application/json media type for JavaScript object notation. Network Working Group, RFC 4627, 2006. 4. Fielding R. Architectural Styles and the Design of Networkbased Software Architectures. Ph.D. Dissertation, University of California, 2000. 5. SOAP Version 1.2. W3C Recommendation, 2007. 6. Vinoksi S. Chain of responsibility. IEEE Internet Comput., 6(6):80–83, 2002. 7. Vogels W. Web Services are not Distributed Objects. IEEE Internet Comput., 7(6):59–66, 2003. 8. Web Services Description Language Version 2.0. W3C Recommendation, 2007. 9. Weerawarana S., Curbera F., Leymann F., Storey T., and Ferguson D. Web Services Platform Architecture. Prentice Hall, Upper Saddle River, NJ, 2005. 10. XML Schema. W3C Recommendation, 2004.
Web Services and the Semantic Web for Life Science Data C ARTIK R. KOTHARI , M ARK D. W ILKINSON University of British Columbia, Vancouver, BC, Canada
Definitions Web services are frameworks for communication between computer applications on the World Wide Web. A general feature of ‘‘canonical’’ Web services is that they expose Web-based application interfaces in the form of a Web Services Description Language (WSDL) document with machine-readable content describing the input(s), output(s), and location of an application. The Semantic
W
3514
W
Web Services and the Semantic Web for Life Science Data
web extends the traditional Web by applying human and machine-readable labels to the links between resources, and encouraging those resources to be ‘‘typed’’ into a set of well-grounded categories, defined by shared ontologies. Machines can thus explore and process the content of the Semantic Web in meaningful ways. Moreover, the Semantic Web moves beyond simple documents and allows linking of individual data points, analytical tools, and even conceptual entities with no physical representation. Ontologies are formal descriptions of a knowledge domain, and range from simple controlled vocabularies to explicit logical definitions of the defining properties for all concepts and inter-concept relationships within that knowledge domain. In the life sciences, with its multitude of specialized data repositories and data processing applications, the Semantic Web paradigm promises to overcome the hurdles to data discovery, integration, mining, and reuse that continues to hinder biomedical investigation in the post-genomic era.
Historical Background The life sciences are experiencing what might best be described as a feedback-loop – the ability of life scientists to generate raw ‘‘omics’’ data in vast quantities is perhaps only exceeded by the quantity of data output from tools analyze this raw data, and both become the fodder for new downstream analyses. Though much of the data is stored in one of several generalized Webbased repositories – for example UniProt (http://www. pir. uniprot.org), GenBank (http://www.ncbi.nlm.nih. gov/GenBank), or EMBL (http://embl.org) – still more data is housed in specialized repositories such as Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/ geo/) (GEO), for Gene Expression Data, or speciesspecific Web databases such as Wormbase (http://www. wormbase.org/), Flybase (http://flybase.bio.indiana. edu/), DragonDB (http://antirrhinum.net), and TAIR (http://www.arabidopsis.org/). Similarly, many common analytical tools are also made available on the Web, through traditional CGI or JavaScript interfaces. In the course of their daily experimental investigations, Biologists frequently need to use various combinations of these distributed data and/or analytical resources. For most biologists, this is achieved by manual copy/ paste of the output from one resource into the input form fields of the next. Computer-savvy biologists might undertake to construct ‘‘screen scrapers’’ – small computer programs that automatically process
the output from Web resources and pipelined data from one resource to the next – and this is further facilitated by the creation of centralized code repositories such as BioPerl (http://www.bioperl.org) and BioJava (http://biojava.org) that provide a uniform API into a standard and curated set of ‘‘scrapers.’’ Nevertheless, automation of data extraction from traditional Web forms is fraught with difficulties; with the lack of a common data interchange format and shareable semantics being the biggest hurdle. The introduction of Web Services provided a more standardized, and platform/language-independent method of exposing data and analytical tools on the Web, and encouraged the use of XML as the standard data exchange format between different web resources. Moreover, the interface to and functionality of a Web Service could be described in a standardized machine processable syntax – Web Services Description Language (http://www.w3.org/TR/2007/REC-wsdl20primer-20070626/) (WSDL), and a novel transport protocol – Simple Object Access Protocol (http://www.w3. org/TR/2007/REC-soap12-part0–20070427/) (SOAP) – was introduced to hide the complexities of the data exchange process and thus facilitate design of simple object- oriented accessors to remote Web resources. A paradigm for Web Services discovery (UDDI) was also proposed; however this has not been widely adopted, and will not be discussed here. Although Web Services greatly aided the standardization of interfaces, the problem of interoperability had not been solved – the pipelining of Web Services continued to require sophisticated knowledge of the interfaces such that appropriate data-structures could be extracted from one service and fed into the next. This is because the XML data syntax lacks semantic grounding, and thus the input and output to/from any Web Service is opaque to automated interpretation, which thwarts attempts at automated Web Service workflow composition. The advent of the Semantic Web and ontologies constructed with Semantic Web compatible knowledge representation formalisms promises to overcome this hurdle. The Semantic Web is being implemented as a Web of machine processable data. This is in contrast to the first version of the World Wide Web where machines were primarily concerned with the presentation of data; consumption and interpretation were limited to humans. Data on the Semantic Web is rendered machine processable by annotation with syntactic constructs
Web Services and the Semantic Web for Life Science Data
whose semantics are grounded in formal mathematical logic. Knowledge representation formalisms such as the Resource Description Framework (http://www.w3.org/ RDF/) (RDF) and the Web Ontology Language (http:// www.w3.org/2004/OWL/) (OWL) provide syntactic constructs with logically grounded semantics that can be used to annotate web services. Ontologies such as the OWL-Services (http://www. w3.org/Submission/2004/SUBM-OWL-S-20041122/) ontology and the Web Services Modeling Ontology (http://www.w3.org/Submission/WSMO/) (WSMO) are fresh initiatives to leverage the Semantic Web infrastructure for web service composition. A web service on the Semantic Web can be described by instances of concepts from the OWL-S or WSMO and easily invoked and composed with other compatible web services.
Foundations Web Services
Web Services have revolutionized the field of distributed computation. Web Services enable the remote discovery and execution of software applications across the dimensions of the Web in an implementation agnostic manner. According to the World Wide Web Consortium (W3C), a Web Service is ‘‘a software system designed to support interoperable machine to machine interaction over a network.’’ The functionality of Web Services can be easily understood from three different perspectives viz. the service provider, the service client, and the service registry. The Service Provider
A service provider is a person or organization that is responsible for the development and maintenance of the Web Service. The service provider is familiar with the implementation details of the service, such as the architecture of the underlying application and the programming language used. These details are encapsulated by the Web Service interface, which describes the function of the Web Service, its location, input parameters and output parameters and their data-types. Consider an application that performs a search for relevant publication in an online, open access repository such as PubMed (http://www.pubmed.org). This application would be encapsulated in a web interface (the Web Service), which advertises the function of the service (searching for relevant publications), the location of the service (a resolvable URI, usually a URL), the input parameters (keyword, name of the
W
3515
author, citation details such as journal name and/or date of publication, all of which would be string-literal data-types), and the output parameters of the service (title of the publication, abstract, and citation details; all of which are string-literals). This Web Service description is published as a WSDL document and is (optionally) registered in a registry. Note the description does not include implementation details. The underlying application could be implemented on a Linux or Microsoft platform, using any programming or scripting language desired. More importantly, can also be accessed by a client application on any platform using any language. The Service Client
The service client locates a specific Web Service from its description. The client then remotely invokes the discovered service and executes it. In the PubMed example, consider a service client that is looking for publications about the Vascular Endothelial Growth Factor (VEGF) protein. The service client discovers the book browsing service from its WSDL description. The WSDL description details the input parameters accepted by the Web Service, one of which is ‘‘keyword.’’ The client sends the text string ‘‘VEGF’’ into the interface as the ‘‘keyword’’ parameter and thereby invokes the service. The communication between the client and the service generally utilizes the HTTP protocol, and is often supplemented by a standardized messaging format such as SOAP. Note that ‘‘keyword’’ is a human-readable label indicating the purpose of that input parameter – the WSDL interface definition must be interpreted by a person in order to determine the intent of that parameter, emphasizing the point that, in traditional Web Services, the interfaces are opaque due to the lack of ontological grounding of XML tags.
The Service Registry The service registry acts as a ‘‘yellow pages’’ for Web Service discovery, and brokers the direct interaction between a client and a chosen service. Specifically, the registry holds WSDL descriptions of Web Services, including their functional characteristics such as input and output parameters, and location. In addition, human readable descriptions of the web service may also be included in the WSDL description. Details of the service provider, and classification information (e.g., the UNSPSC category of the service), and reliability estimates may also be provided. In addition, Web Service registries may perform
W
3516
W
Web Services and the Semantic Web for Life Science Data
ancillary tasks such as periodic polling to identify inactive services, and collecting performance metrics and reliability estimates. The Semantic Web
The Semantic Web initiative aims to make the vast amounts of information on the Web processable by machines. Previously, computers were involved only in the presentational aspects of this information, which was marked up with tags from the HyperText Markup Language (http://www.w3.org/html/) (HTML). HTML tags specify how the tagged information is to be presented on a browser. However, there are no semantics associated with HTML tags, nor are there semantics associated with the hyperlinks between documents, thus preventing automated interpretation of the contained information. The development of knowledge representation formalisms for the Semantic Web such as RDF, the Ontology Inference Layer (http://www. ontoknowledge.org/oil/) (OIL), the DARPA Advanced Markup Language (http //www.daml.org) (DAML), from the Defense Advanced Research Projects Agency (http://www.darpa.mil) (DARPA), and OWL have been an attempt to overcome these limitations of the Web. OWL is the most advanced formalism for markup on the Semantic Web, providing a very expressive and decidable set of constructs for ontology development, and has been adopted by the W3C as a standard. The semantics of constructs from OWL (and OIL and DAML) are grounded in mathematical logic. Therefore, the information content is now accessible to computers and can be used by logical inference axioms to deduce implicit knowledge. Lastly, the information content can be queried by machines across the Web. Intelligent, semantic based searches become a possibility, replacing the prevalent simple keyword based searches and word and hyperlink based algorithms. Three components are critical to understanding the Semantic Web architecture; ontologies, query engines and inference engines. Figure 1 shows the Semantic Web architecture proposed by Tim Berners-Lee. Ontologies
Ontologies are machine interpretable models of knowledge domains. An ontological representation of a knowledge domain contains logically grounded definitions of concepts and inter-concept relations that are specific to that domain. Since these concepts and relations are defined in a rigorous logical
Web Services and the Semantic Web for Life Science Data. Figure 1. Proposed semantic web architecture (Berners-Lee et al., 2001).
framework, their semantics can be shared and processed by machines without loss of clarity. The information content of Web pages on the Semantic Web can be marked up as instances of ontologically defined concepts and relations. Knowledge representation formalisms such as OWL can be used to construct ontologies of different knowledge domains. Ontology editing tools such as Prote´ge´ (http://protege.stanford.edu/), TopBraid suite (http://www.topquadrant.com/topbraid suite. html), and Altova SemanticWorks (http://www.altova. com/products/semanticworks/semantic_web_rdf_ owl_editor. html) have made the process of ontology development very straightforward for knowledge do main experts. This is attested by the proliferation of ontologies, especially in the life sciences domain. Query Engines
Marking up the information content of Web pages with semantically rich HTML tags makes it possible for query engines and languages to retrieve this information. Two of the earliest query languages for retrieving RDF annotated information were SPARQL (http://www.w3.org/TR/rdf-sparql-query/) and the RDF Query Language (http://139.91.183.30:9090/RDF/RQL/ ) (RQL). SeRQL (//www.openrdf.org/doc/sesame/users/ ch06.html), nRQL [9], and RDQL (http://www.w3.org/ Submission/2004/SUBM-RDQL-20040109/) are other querying formalisms that have been developed to retrieve RDF annotated documents from the Semantic Web. Inference Engines Inference is the process of deducing implicit information from the combination of explicitly stated information and inference axioms. For example,
Web Services and the Semantic Web for Life Science Data
given that any person who is the parent of a parent is a grandparent (the axiom), and given that Lisa is the parent of Charles and Charles is the parent of Harry (explicitly stated facts), it can be deduced that Lisa is a grandparent of Harry (inference). In this fashion, logical inference axioms can be used to reason with the concept (and relation) definitions and their instances. The information content of the Semantic Web can be reasoned with by specialized inference engines such as FaCT + + (http://owl.man.ac.uk/factplusplus/), Pellet (http://pellet.owldl.org), and RACERPro (http://www. racer-systems.com/products/racerpro/index.phtml). Semantic Web Services
The possibility of marking up Web Services with semantically expressive tags led to the adoption of the Semantic Web Services initiative (http://www.swsi.org/). The objective of the Semantic Web Services initiative is the use and leverage of the framework of Semantic Web technologies to enable the automation and dynamism of Web Services annotation, publication, discovery, invocation, execution, monitoring, and composition. Referring again to the PubMed search example, the input and output parameters of the search web service can be described using ontologically defined concept instances. WSDL provides a means to specify the acceptable input parameters of the Web Service to be one of three string literals, viz. keyword, citation details, and author name. However, the fact that the author name, citation details, and keyword are specialized types of string literals cannot be explicitly stated with WSDL. Using an ontology to ground these Web Service parameters as specific ‘‘types’’ of entity, it then becomes possible to indicate, to both human and machine, the purpose of each parameter field. Moreover, the concept of a ‘‘Literature Search Tool’’ could be defined as a Web Service that had parameters of type ‘‘author’’ and ‘‘keyword.’’ An inference engine could then be used to discover this PubMed Web Service in response to a user-request for Literature Search Tools. Semantic Web Services in the Life Sciences
Computational methods, also known as in silico experimentsor workflows, complement in vivo and in vitro methods in biological research, and provide highthroughput access to tools that can manage the large volumes of genomic, proteomic, and other types of data required by modern biomedical experiments.
W
Since many analytical tools are now available as Web Services, it is now possible to ‘‘pipeline’’ these tools together to create in silicoworkflows, with the biologist manually connecting the output of one service to the input of the next based on a human readable WSDL description. This, however, is a complex task, and requires the biologist to be aware of the individual capabilities and operating principles of each Web Service. Availability of a comprehensive textual description of a Web Service’s capabilities and operating principles is rare, and not always helpful to a biologist who is not computer-savvy. Therefore, the constructed workflows maybe error-prone and/ or procedurally flawed. Moreover, many applications go through versions or become unavailable from time to time, making workflows somewhat fragile. The burden of ‘‘re-wiring’’ a workflow to accommodate a new version of an interface; replacing a dead interface with an equivalent functional one; and discovering new or improved applications as and when they become available, is also placed on the biologist and requires knowledge and awareness outside of their domains of expertise. Machine processable descriptions of Web Services are, thus, crucial to enabling their automatic discovery and invocation, and most importantly, to their automated composition into stable, selfrepairing analytical workflows. Registries of semantically annotated Web Services such as myGrid (http://www.mygrid.org.uk/) and Moby Central (http://biomoby.org/RESOURCES/ MOBY-S/ServiceInstances) have been used to facilitate the discovery and use of Web Services by biologists. However, biologists still need to familiarize themselves with datatype hierarchies to discover web services that can process their data. Recent innovations capable of inferring datatypes from actual data [4] promise to relieve biologists of this requirement as well. The automated composition of Web services is more troublesome. Automated pipelining the output of one web service into the input of another is the simplest of the proposed approaches to service composition. In mathematical terms, this is analogous to finding a route in a directed graph. A web services network can be visualized as a directed graph as shown in Figure 2. Every node in Figure 2 corresponds to a web service. Directed edges leading to a webservice correspond to input parameters, and those leading away from a
3517
W
3518
W
Web Services and the Semantic Web for Life Science Data
Key Applications
Web Services and the Semantic Web for Life Science Data. Figure 2. A directed graph representation of a web service network.
service, to output parameters. Finding a route from node 7 (which could represent a service that queries a genome database given an accession number, for instance) to node 10 (which may represent a service that performs a sequence comparison and outputs a report) in this figure is a computationally complex problem, capable of consuming large extents of computational time and memory. Even with finding the shortest possible route in the figure from node 7 to node 10, uninformed search algorithms are exponentially complex. This means that the time taken to find a solution (and the computational memory used) increases exponentially as the number of possible solutions. Heuristics are required to prune the search space in such problems and reduce the complexity to decidable and tolerable limits. In web service composition, the complexity of the problem is compounded due to several factors. Some services are asynchronous, meaning they take in input parameters without outputting anything. Other services may be notifiers, giving output parameters without (seemingly) taking in any input. Some edges many not necessarily lead to other services, being dead end points in the route finding algorithm. Lastly, the shortest combination of services may not be desirable to the biologist as well. The automated composition of web services is therefore, an open and fascinating research problem. The interested reader may refer the resources at the Web Services Choreography Working Group (http://www.w3.org/2002/ws/ chor/) Web site at the W3C for more information.
BioMoby [10] is an open source, extensible framework that enables the representation, discovery, retrieval, and integration of biological Web Services. By registering their analysis and data access services with BioMoby, service providers agree to use and provide service specifications in a shared semantic space. The BioMoby Central registry now hosts more than a thousand services in the United States, Canada, and several other countries across the world. BioMoby uses a datatype hierarchy to facilitate automated discovery of Web Services capable of handling specific input datatypes. As a minimal Web based interface, the Gbrowse Moby (http://moby. ucalgary.ca/gbrowse_moby) service browser can be used by biologists to discover and invoke biological web services from the Moby registry and seamlessly chain these services to compose multi-step analytical workflows. The process is data centric, relying on input and output datatype specifications of the services. The Seahawk client interface [4] can infer the datatype of the input data files directly and immediately present the biologist with a list of BioMoby Web Services that can process that input file. Users of the Seahawk interface are relieved of the necessity to familiarize themselves with datatype hierarchies, and instead are free to concentrate on the analytical aspects of their work. Other clients for BioMoby services include Remora (http://lipmbioinfo.toulouse.inra.fr/remora/cgi/remora.cgi), Ahab (http://bioinfo.icapture.ubc.ca/bgood/Ahab.html), and MOWserv (http://www.inab.org/MOWServ/). myGrid [3] is an example of a computational grid architecture that brings together various computational resources to support in silico biological experiments. The computational resources include analysis and data retrieval services. As such, myGrid is a service grid and the myGrid philosophy regards in silico experiments to be distributed queries and workflows. Taverna [5], a module of myGrid, provides a means to integrate the diverse resources on the myGrid framework into reusable in silico workflows. Grimoires [1] is a semantics enabled service registry that enables discovery of myGrid services. Taverna also provides an interface to access BioMoby services [6], a promising development towards the integration of research initiatives in the area of biomedical informatics. The PathPort framework [11] developed at The Virginia Bioinformatics Institute presents a Web based interface that makes it possible for end users to invoke local and distributed biological Web Services that are
Web Services and the Semantic Web for Life Science Data
described in WSDL, in a location and platform independent manner. Sembowser [8] is an ontology based implementation of a service registry that facilitates Web Service annotation, search and discovery based upon domain specific keywords whose semantics are grounded in mathematical logic. Sembowser associates every published Web Service with concepts defined in an ontology and further, classifies the published service on the basis of the task it performs and the domain with which it is associated. The Simple Semantic Web Architecture and Protocol (http://semanticmoby.org) (SSWAP), is a Semantic Web Services architecture that uses ontology based descriptions of biological Web Services from the Virtual Plant Information Network (http://vpin.ncgr.org/) (VPIN) to discover, and remotely execute Web services. Formerly known as Semantic Moby, this architecture is the closest realization of the conventional definition of a Semantic Web Services framework, and utilizes the knowledge embedded in any Web-based ontology to assist in discovery and pipelining of Web Services. The need for semantic descriptions to enable automated Web Service discovery, invocation and composition has led to the development of prototypical ontology based frameworks. The OWL-S framework, the METEOR-S framework (http://lsdis.cs.uga.edu/ projects/meteor-s/), and the Web Services Modeling Framework (WSMF) [2] are examples of research initiatives in this area. Ontology based Web Service architectures are in very early stages of development and their evolution into industrial strength technologies will be eagerly anticipated. Web services with machine processable descriptions are of great value in the development of mashups. A mashup is a composite application that brings together data and functionality from diverse sources using technologies such as AJAX and RSS. Mashup development has been of specific interest to the Web 2.0 community, with its emphasis on mass collaborative information gathering techniques, as exemplified by Wikipedia (http://www.wikipedia.org), Flickr (http://www.flickr.com), and Delicious (http:// del.icio.us). Very recently, diverse applications have been embedded with Semantic Web ontologies, query engines, and inference engines to create mashups. Mashups have been used to answer specific questions about genes responsible for Alzheimer’s disease and also visualize these genes [7]. In this demo, results from SPARQL
W
3519
queries have been integrated with data extracted from the Allen Brain Atlas site (http://www.brainatlas. org/aba/) using screen scraping techniques, and finally visualized using Google Maps (http://maps.google.com) and the Exhibit (http://simile.mit.edu/exhibit/)visualization toolkit.
Future Directions The Semantic Web, with its promise of information sharing, reuse and integration, has seen widespread adoption by the life sciences community. Currently, the curators of many of the major genome, proteome, transcriptome, and interactome repositories are adopting Semantic Web standards, while more and more service providers are gravitating towards the Semantic Web Services paradigm. Semantic Web standards are crucial to a synergistic approach to a full understanding of, and representation of, complexity in Life Sciences.
Cross-references
▶ Controlled Medical Vocabulary ▶ Data Structures and Models for Biological Data Management ▶ Data Types in Scientific Data Management Systems ▶ Grid and Workflows ▶ Mashups ▶ Ontologies ▶ Ontologies and Life Science Data Management ▶ Ontologies in Scientific Data Integration ▶ Ontology ▶ Ontology-Based Data Models ▶ OWL: Web Ontology Language ▶ Query Languages for Ontological Data ▶ Query Languages for the Life Sciences ▶ RDF ▶ Scientific Workflows ▶ Screen Scraper ▶ Semantic Web ▶ Semantic Web Query Languages ▶ Semantic Web Services ▶ SOAP ▶ W3C ▶ Web 2.0 (3.0) ▶ Web Services ▶ Workflow Management and Workflow Management Systems ▶ Workflow Model ▶ Workflow Modeling ▶ XML
W
3520
W
Web Services Business Process Execution Language
Recommended Reading 1. Fang W, et al. Performance analysis of a semantics enabled service registry. In Proc. 4th All Hands Meeting. Nottingham, UK, 2005. Available at: http://users.ecs.soton.ac.uk/lavm/papers/ Fang-AHM05.pdf 2. Fensel D. and Bussler C. The Web Services Modeling Framework (WSMF), Electron. Commerce Res. Appl., 1(2):113–137, 2002. 3. Goble C., Pettifer S., Stevens R., and Greenhalgh C. Knowledge integration: in silico experiments in bioinformatics. In the Grid: Blueprint for a New Computing Infrastructure (2nd edn.), I. Foster, C. Kesselman (eds.). Morgan Kaufmann, Los Altos, CA, 2003. 4. Gordon P. and Sensen C. Seahawk: Moving beyond HTML in Web-based Bioinformatics Analyses. BMC Bioinformatics, 8(June):208, 2007. 5. Hull D., Wolstencroft K., Stevens R., Goble C., Pocock M., Li P., and Oinn T. Taverna: a tool for building and running workflows of services. Nucleic Acids Res., 34: W729–W732, 2006. 6. Kawas E., Senger M., and Wilkinson M. BioMoby extensions to the taverna workflow management and enactment software. BMC Bioinformatics, 7:523, 2006. 7. Ruttenberg A, et al. Advancing translational research with the semantic web. BMC Bioinformatics, 8(Suppl. 3):S2, May 2007. 8. Sahoo S., Sheth A., Hunter B., and York W. Sembowser: adding semantics to a biological web services registry. In Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences, C. Baker, K. Cheung (eds.). Springer 2007. 9. Wessel M. and Mo¨ller R. A high performance semantic web query answering engine. In Proc. Int. Workshop on Description Logics, 2005. 10. Wilkinson M. and Links M. BioMOBY: an open-source biological web services proposal. Brief. Bioinform., 3(4): 331–341, 2002. 11. Xue T., Yang B., Will R., Sharp B., Kenyon R., Crasta O., and Sobral B. A generalized framework for pathosystems informatics and bioinformatics web services. In Proc. 2007 Int. Conf. on Bioinformatics and Computational Biology, 2007.
Web Services Business Process Execution Language ▶ Composed services and WS-BPEL
Web Site Wrappers ▶ Languages for Web Data Extraction
Web Spam Detection M ARC N AJORK Microsoft Research, Mountain View, CA, USA
Synonyms Spamdexing; Google bombing; Adversarial information retrieval
Definition Web spam refers to a host of techniques to subvert the ranking algorithms of web search engines and cause them to rank search results higher than they would otherwise. Examples of such techniques include content spam (populating web pages with popular and often highly monetizable search terms), link spam (creating links to a page in order to increase its linkbased score), and cloaking (serving different versions of a page to search engine crawlers than to human users). Web spam is annoying to search engine users and disruptive to search engines; therefore, most commercial search engines try to combat web spam. Combating web spam consists of identifying spam content with high probability and – depending on policy – downgrading it during ranking, eliminating it from the index, no longer crawling it, and tainting affiliated content. The first step – identifying likely spam pages – is a classification problem amenable to machine learning techniques. Spam classifiers take a large set of diverse features as input, including content-based features, link-based features, DNS and domain-registration features, and implicit user feedback. Commercial search engines treat their precise set of spam-prediction features as extremely proprietary, and features (as well as spamming techniques) evolve continuously as search engines and web spammers are engaged in a continuing ‘‘arms race.’’
Historical Background Web spam is almost as old as commercial search engines. The first commercial search engine, Lycos, was incorporated in 1995 (after having been incubated for a year at CMU); and the first known reference to ‘‘spamdexing’’ (a combination of ‘‘spam’’ and ‘‘indexing’’) dates back to 1996. Commercial search engines began to combat spam shortly thereafter, increasing their efforts as it became more prevalent.
Web Spam Detection
Spam detection became a topic of academic discourse with Davison’s paper on using machine learning techniques to identify ‘‘nepotistic links,’’ i.e., link spam [4], and was further validated as one of the great challenges to commercial search engines by Henzinger et al. [9]. Since 2005, the workshop series on Adversarial Information Retrieval on the Web (AIRWeb) provides a venue for researchers interested in web spam.
Foundations Given that the objective of web spam is to improve the ranking of select search results, web spamming techniques are tightly coupled to the ranking algorithms employed (or believed to be employed) by the major search engines. As ranking algorithms evolve, so will spamming techniques. For example, if web spammers were under the impression that a search engine would use click-through information of its search result pages as a feature in their ranking algorithms, then they would have an incentive to issue queries that bring up their target pages, and generate large numbers of clicks on these target pages. Furthermore, web spamming techniques evolve in response to countermeasures deployed by the search engines. For example, in the above scenario, a search engine might respond to facetious clicks by mining their query logs for many instances of identical queries from the same IP address and discounting these queries and their result clickthroughs in their ranking computation. The spammer in turn might respond by varying the query (while still recalling the desired target result), and by using a ‘‘bot-net’’ (a network of third-party computers under the spammer’s control) to issue the queries and the click-throughs on the target results. Given that web spamming techniques are constantly evolving, any taxonomy of these techniques must necessarily be ephemeral, as will be any enumeration of spam detection heuristics. However, there are a few constants: Any successful web spamming technique targets one or more of the features used by the search engine’s ranking algorithms. Web spam detection is a classification problem, and search engines use machine learning algorithms to decide whether or not a page is spam. In general, spam detection heuristics look for statistical anomalies in some of the features visible to the search engines.
W
3521
Web Spam Detection as a Classification Problem
Web spam detection can be viewed as a binary classification problem, where a classifier is used to predict whether a given web page or entire web site is spam or not. The machine learning community has produced a large number of classification algorithms, several of which have been used in published research on web spam detection, including decision-tree based classifiers (e.g., C4.5), SVM-based classifiers, Bayesian classifiers, and logistic regression classifiers. While some classifiers perform better than others (and the spam detection community seems to favor decisiontree-based ones), most of the research focuses not on the classification algorithms, but rather on the features that are provided to them. Taxonomy of Web Spam Techniques
Content spam refers to any web spam technique that tries to improve the likelihood that a page is returned as a search result and to improve its ranking by populating the page with salient keywords. Populating a page with words that are popular query terms will cause that page to be part of the result set for those queries; choosing good combinations of query terms will increase the portion of the relevance score that is based on textual features. Naı¨ve spammers might perform content spam by stringing together a wide array of popular query terms. Search engines can counter this by employing language modeling techniques, since web pages that contain many topically unrelated keywords or that are grammatically ill-formed will exhibit statistical differences from normal web pages [11]. More sophisticated spammers might generate not a few, but rather millions of target web pages, each page augmented with just one or a few popular query terms. The remainder of the page may be entirely machine-generated (which might exhibit statistical anomalies that can be detected by the search engine), entirely copied from a human-authored web site such as Wikipedia (which can be detected by using near-duplicate detection algorithms), or stitched together from fragments of several humanauthored web sites (which is much harder, but not impossible to detect). Link spam refers to any web spam technique that tries to increase the link-based score of a target web page by creating lots of hyperlinks pointing to it. The hyperlinks may originate from web pages owned and
W
3522
W
Web Spam Detection
controlled by the spammer (generically called a link farm), they may originate from partner web sites (a technique known as link exchange), or they may originate from unaffiliated (and sometimes unknowing) third parties, for example web-based discussion forums or in blogs that allow comments to be posted (a phenomenon called blog spam). Search engines can respond to link spam by mining the web graph for anomalous components, by propagating distrust from spam pages backwards through the web graph, and by using content-based features to identify spam postings to a blog [10]. Many link spam techniques specifically target Google’s PageRank algorithm, which not only counts the number of hyperlinks referring to a web page, but also takes the PageRank of the referring page into account. In order to increase the PageRank of a target page, spammers should create links on sites that have high PageRanks, and for this reason, there is a marketplace for expired domains with high PageRank, and numerous brokerages reselling them. Search engines can respond by temporarily dampening the endorsement power of domains that underwent a change in ownership. Click spam refers to the technique of submitting queries to search engines that retrieve target result pages and then to ‘‘click’’ on these pages in order to simulate user interest in the result. The result pages returned by the leading search engines contain clientside scripts that report clicks on result URLs to the engine, which can then use this implicit relevance feedback in subsequent rankings. Click spam is similar in method to click fraud, but different in objective. The goal of click spam is to boost the ranking of a page, while the goal of click fraud (generating a large number of clicks on search engine advertisements) is to spend the budget associated with a particular advertisement (to hurt the competitor who has placed the ad or simply to lower the auction price of said ad, which will drop once the budget of the winning bidder has been exhausted). In a variant of click fraud, the spammer targets ads delivered to his own web by an ad-network such as Google AdSense and obtains a revenue share from the ad-network. Both click fraud and click spam are trivial to detect if launched from a single machine, and hard to detect if launched from a bot-net consisting of tens of thousands of machines [3]. Search engines tackle the problem by mining their click logs for statistical anomalies, but very little is known about their algorithms.
Cloaking refers to a host of techniques aimed at delivering (apparently) different content to search engines than to human users. Cloaking is typically used in conjunction with content spam, by serving a page containing popular query terms to the search engine (thereby increasing the likelihood that the page will be returned as the result of a search), and presenting the human user with a different page. Cloaking can be achieved using many different techniques: by literally serving different content to search engines than to ordinary users (based for example on the well-known IP addresses of the major search engine crawlers), by rendering certain parts of the page invisible (say by setting the font to the same color as the background), by using client-side scripting to rewrite the page after it has been delivered (relying on the observation that search engine crawlers typically do not execute scripts), and finally by serving a page that immediately redirects the user’s browser to a different page (either via client-side scripting or the HTML ‘‘meta-redirect’’ tag). Each variant of cloaking calls for a different defense. Search engines can guard against different versions of the same page by probing the page from unaffiliated IP addresses [13]; they can detect invisible content by rendering the page; and they can detect page modifications and script-driven redirections by executing client-side scripts [12].
Key Applications Web spam detection is used primarily by advertisementfinanced general-purpose consumer search engines. Web spam is not an issue for enterprise search engines, where the content providers, the search engine operator and the users are all part of the same organization and have shared goals. However, web spam is bound to become a problem in any setting where these three parties – content providers, searchers, and search engines – have different objectives. Examples of such settings include vertical search services, such as product search engines, company search engines, people search engines, or even scholarly literature search engines. Many of the basic concepts described above are applicable to these domains as well; the precise set of features useful for spam detection will depend on the ranking algorithms used by these vertical search engines.
Future Directions Search engines are increasingly leveraging human intelligence, namely the observable actions of their
Web Transactions
user base, in their relevance assessments; examples include click-stream analysis, toolbar data analysis, and analysis of traffic on affiliate networks (such as the Google AdSense network). It is likely that many of the future spam detection features will also be based on the behavior of the user base. In many respects, the distinction between computing features for ranking (promoting relevant documents) and spam detection (demoting facetious documents) is artificial, and the boundary between ranking and spam suppression is likely to blur as search engines evolve.
Experimental Results Several studies have assessed the incidence of spam in large-scale web crawls at between 8% and 13% [11,5]; the percentage increases as more pages are crawled, since many spam sites serve a literally unbounded number of pages, and web crawlers tend to crawl high-quality human-authored content early on. Ntoulas et al. describe a set of content-based features for spam detection; these features, when combined using a decision-tree-based classifier, resulted in an overall spam prediction accuracy of 97% [11].
W
3523
4. Davison B.D. Recognizing nepotistic links on the web. In Proc. AAAI Workshop on Artificial Intelligence for Web Search, 2000. 5. Fetterly D., Manasse M., and Najork M. Spam, damn spam and statistics. In Proc. 7th Int. Workshop on the Web and Databases, 2004, pp. 1–6. 6. Gyo¨ngyi Z., and Garcia-Molina H. Spam: its not just for Inboxes anymore. IEEE Comput., 38(10):28–34, 2005. 7. Gyo¨ngyi Z. and Garcia-Molina H. Web Spam Taxonomy. In Proc. 1st Int. Workshop on Adversarial Information Retrieval on the Web, 2005, pp. 39–47. 8. Gyo¨ngyi Z., Garcia-Molina H., and Pedersen J. Combating Web spam with TrustRank. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 576–587. 9. Henzinger M., Motwani R., and Silverstein C. Challenges in web search engines. ACM SIGIR Forum 36(2):11–22, 2002. 10. Mishne G., Carmel D., and Lempel R. Blocking blog spam with language model disagreement. In Proc. 1st Int. Workshop on Adversarial Information Retrieval on the Web, 2005, pp. 1–6. 11. Ntoulas A., Najork M., Manasse M., and Fetterly D. Detecting spam web pages through content analysis. In Proc. 15th Int. World Wide Web Conference, 2006, pp. 83–92. 12. Wang Y.M., Ma M., Niu Y., and Chen H. Spam double-funnel: connecting Web spammers with advertisers. In Proc. 16th Int. World Wide Web Conference, 2007, pp. 291–300. 13. Wu B. and Davison B. Detecting semantic cloaking on the web. In Proc. 15th Int. World Wide Web Conference, 2006, pp. 819–828.
Data Sets Castillo et al. have compiled the WEBSPAM-UK2006 data set [2], a collection of web pages annotated by human judges as to whether or not they are spam. This data set has become a reference collection to the field, and has been used to evaluate many of the more recent web spam detection techniques.
Cross-references
▶ Indexing the Web ▶ Web Page Quality Metrics ▶ Web Search Relevance Feedback ▶ Web Search Relevance Ranking
Recommended Reading 1. Becchetti L., Castillo C., Donato D., Leonardi S., and BaezaYates R. Using rank propagation and probabilistic counting for link-based spam detection. In Proc. KDD Workshop on Web Mining and Web Usage Analysis, 2006. 2. Castillo C., Donato D., Becchetti L., Boldi P., Leonardi S., Santini M., and Vigna S. A reference collection for Web spam. ACM SIGIR Forum, 40(2):11–24, 2006. 3. Daswani N. and Stoppelman M. and the Google Click Quality and Security Teams. The anatomy of clickbot.A. In Proc. 1st Workshop on Hot Topics in Understanding Botnets, 2007.
Web Structure Mining ▶ Data, Text, and Web Mining in Healthcare
Web Transactions H EIKO S CHULDT University of Basel, Basel, Switzerland
Synonyms Internet transactions
Definition A Web Transaction is a transactional interaction between a client, usually a web browser, and one or several databases as backend of a multi-tier architecture. The middle tier of the architecture includes a web server which accepts client requests via HTTP. It forwards these requests either directly to the underlying database or to an application server which, in turn, interacts with the database.
W
3524
W
Web Usage Mining
Key Points The main application of Web transactions is in eCommerce applications. In a minimal configuration, the architecture for Web transactions consists of a client, a web server and a database server. Database access is provided at the web server level, i.e., by embedding database access into Java servlets or JavaServer Pages (JSP). More sophisticated architectures consider a dedicated application server layer (i.e., an implementation of the Java Enterprise Edition specification Java EE) and support distributed databases. Communication between application and database layer is usually implemented on the basis of JDBC (Java Database Connectivity), ODBC (Open Database Connectivity), or native database protocols. Transaction support at application server level is provided by services like JTS (Java Transaction Service) via the Java Transaction API (JTA) in the Java world, or OTS (Object Transaction Service) in CORBA. Essentially, these services allow to associate several application server calls and their interactions with resource managers with the same transaction context and to coordinate them by using a two phase commit (2PC) protocol. Thus, Web transactions mostly focus on atomic commit processing. Similarly, protocols on the Web service stack exploit 2PC to atomically execute several Web services. Multi-tier web applications require special support for failure handling at application level (application recovery). In order to increase the degree of scalability of multi-tier architectures for Web transactions, application servers can be replicated. To avoid that the database backend then becomes a bottleneck, dedicated caching strategies at the middle tier are applied. Web transactions can be part of Transactional processes.
Cross-references
▶ Application Server ▶ Caching ▶ Transactional Processes ▶ Web Service
Recommended Reading 1.
2.
Barga R., Lomet D., Shegalov G., and Weikum G. Recovery Guarantees for Internet Applications. ACM Trans. Internet Technol., 4(3):289–328, 2004. Burke R. and Monson-Haefel R. Enterprise JavaBeans 3.0. O’Reilly, 2006.
3.
Luo Q., Krishnamurthy S., Mohan C., Pirahesh H., Woo H., Lindsay B., and Naughton J. Middle-tier database caching for e-business. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 600 –611.
Web Usage Mining ▶ Data, Text, and Web Mining in Healthcare
Web Views A LEXANDROS L ABRINIDIS Department of Computer Science, University of Pittsburgh, Pittsburgh, PA, USA
Synonyms Web Views; HTML fragment
Definition Web Views are web pages or web page fragments that are automatically created from base data, which are typically stored in a database management system (DBMS).
Key Points Although caching of HTML pages (and fragments) has been proposed in the literature as early as the mid1990s, the term Web View was introduced in 1999 [2] to denote that these fragments are generated through queries made to a back-end DBMS. The concept of a Web View facilitates the materialization of dynamically generated HTML fragments outside the DBMS. The main advantage of materializing Web Views (outside the DBMS, e.g., at the web server) is that the web server need not query the DBMS for every user request, thus greatly improving query response time for the user [3]. On the other hand, for the quality of the data returned to the user to be high, the system must keep materialized Web Views fresh (in the background). This essentially decouples the processing of queries at the web server from the processing of updates at the DBMS (which are also propagated to materialized Web Views). Materializing a Web View presents a trade-off. On the one hand, it can greatly improve response time for user-submitted queries. On the other hand, it generates a background overhead for the Web View to be kept fresh (essentially a materialization decision can be
What-If Analysis
viewed as a ‘‘contract’’ for the Web View to be kept fresh). As such, selecting which Web View to materialize is an important problem that has received attention [5,6]. This problem is essentially similar to the view selection problem in traditional DBMSs [5], with added complexity due to the online nature of the Web and the need for any solution to be highly dynamic, constantly adapting to changing conditions and workloads. Having identified the set of Web Views to materialize, there is the issue of determining the order by which to propagate updates to them. Since one update to a base relation can trigger the refresh of multiple different materialized Web Views, the order by which these are updated can make an impact on the overall quality of data returned to the user. This is essentially a special-case online scheduling problem, which has been addressed in [4]. Web Views have been used successfully to decouple the processing of queries (to generate dynamic, database-driven web pages) from that of updates (to the content stored inside the DBMS used to driven the web site). This enables much better performance (in terms of response to user queries) without sacrificing the freshness of the data served back to the user.
W
3525
What-If Analysis S TEFANO R IZZI University of Bologna, Bologna, Italy
Definition In order to be able to evaluate beforehand the impact of a strategic or tactical move so as to plan optimal strategies to reach their goals, decision makers need reliable predictive systems. What-if analysis is a dataintensive simulation whose goal is to inspect the behavior of a complex system, such as the corporate business or a part of it, under some given hypotheses called scenarios. In particular, what-if analysis measures how changes in a set of independent variables impact a set of dependent variables with reference to a given simulation model such a model is a simplified representation of the business, tuned according to the historical corporate data. In practice, formulating a scenario enables the building of a hypothetical world that the analyst can then query and navigate.
Historical Background Cross-references ▶ View Maintenance
Recommended Reading 1.
2.
3.
4.
5.
6.
Gupta H. and Mumick I.S. Selection of views to materialize under a maintenance cost constraint. In Proc. 7th Int. Conf. on Database Theory, 1999, pp. 453–470. Labrinidis A. and Roussopoulos N. Alexandros On the materialization of Web views. In Proc. ACM SIGMOD Workshop on The Web and Databases, 1999, pp. 79–84. Labrinidis A. and Roussopoulo N. Web View materialization. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 367–378. Labrinidis A. and Roussopoulos N. Update propagation strategies for improving the quality of data on the Web. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 391–400. Labrinidis A. and Roussopoulos N. Balancing performance and data freshness in Web database servers. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 393–404. Labrinidis A. and Roussopoulos N. Exploring the tradeoff between performance and data freshness in database-driven Web servers. VLDB J., 13(3):240–255, 2004.
Web Widget ▶ Snippet
Though what-if analysis can be considered as a relatively recent discipline, its background is rooted at the confluence of different research areas, some of which date back decades ago. First of all, what-if analysis lends some of the techniques developed within the simulation community, to contextualize them for business intelligence. Simulations are used in a wide variety of practical contexts, including physics, chemistry, biology, engineering, economics, and psychology. Much literature has been written in this field over the years, mainly regarding the design of simulation experiments and the validation of simulation models [5,8,9]. Another relevant field for what-if analysis is economics that provide the insights into business processes necessary to build and test simulation models. For instance, in [1] a set of alternative approaches to forecasting are surveyed, and useful guidelines for selecting the best ones according to the availability and reliability of knowledge are given. Finally, what-if analysis relies heavily on database and data warehouse technology. Though data warehousing systems have been playing a leading role in supporting the decision process over the last decade, they are aimed at supporting analysis of past data
W
3526
W
What-If Analysis
(‘‘what-was’’) rather than giving conditional anticipations of future trends (‘‘what-if ’’). Nevertheless, the historical data used to reliably build what-if predictions are normally taken from the enterprise data warehouse. Besides, there is a tight relationship between what-if analysis and multidimensional modeling since input and output data for what-if analysis are typically stored within cubes [7]. In particular, in [2] the SESAME system for formulating and efficiently evaluating what-if queries on data warehouses is presented. Here, scenarios are defined as ordered sets of hypothetical modifications on multidimensional data. Finally, there are relevant similarities between simulation modeling for what-if analysis and the modeling of Extraction, Transformation and Loading applications; in fact, both ETL and what-if analysis can both be seen as a combination of elementary processes each transforming an input data flow into an output.
Foundations As sketched in Fig. 1, a what-if application is centered on a simulation model, that establishes a set of complex relationships between some business variables corresponding to significant entities in the business domain (e.g., products, branches, customers, costs, revenues, etc.). A simulation model supports one or more scenarios, each describing one or more alternative ways to construct a prediction of interest for the user. The prediction typically takes the form of a multidimensional cube, whose dimensions and measurescorrespond to business variables, to be interactively explored by the user by means of any On-Line Analytical Processing (OLAP) front-end. A scenario is
characterized by a subset of business variables, called source variables, and by a set of additional parameters, called scenario parameters, that the user has to value in order to execute the model and obtain the prediction. While business variables are related to the business domain, scenario parameters convey information technically related to the simulation, such as the type of regression adopted for forecasting and the number of past years to be considered for regression. Distinguishing source variables among business variables is important since it enables the user to understand which are the ‘‘levers’’ that she can independently adjust to drive the simulation. Each scenario may give rise to different simulations, one for each assignment of the source variables and of the scenario parameters. A simple example of a what-if query in the marketing domain is: How would my profits change if I run a 3 2 (pay 2 and take 3) promotion for one week on all audio products on sale? Answering this query requires building a simulation model capable of expressing the complex relationships between the business variables that determine the impact of promotions on product sales, and to run it against the historical sale data in order to determine a reliable forecast for future sales. In particular, the source variables for this scenario are the type of promotion, its duration, and the product category it is applied to. Possible scenario parameters could be the type of regression used for forecasting and the number of past years to be considered for regression. The specific simulation expressed by the what-if query reported in the text is determined by giving values ‘‘3 2,’’ ‘‘one week’’ and ‘‘audio,’’ respectively,
What-If Analysis. Figure 1. Functional sketch for what-if analysis.
What-If Analysis
to the three source variables. The prediction could be a cube with dimensions week and product and measures revenue, cost and profit. Importantly, what-if analysis should not be confused with sensitivity analysis, aimed at evaluating how sensitive the behavior of the system is to a small change of one or more parameters. Besides, there is an important difference between what-if analysis and simple forecasting, widely used especially in the banking and insurance fields. In fact, while forecasting is normally carried out by extrapolating trends out of the historical series stored in information systems, what-if analysis requires simulating complex phenomena whose effects cannot be simply determined as a projection of past data. On the other hand, applying forecasting techniques is often required during what-if analysis. In [4] the authors report a useful classification of forecasting methods into judgmental, such as those based on expert opinions and role-playing, and statistical, such as extrapolation methods, expert systems and rule-based forecasting. The applicability of these methods to different domains is discussed, and an algorithm for selecting the best method depending on the specific characteristics of the problem at hand is reported. A separate mention is in order for system dynamics [4,11], an approach to modeling the behavior of nonlinear systems, in which cause-effect relationships between abstract events are captured as dependencies among numerical variables; in general, such dependencies can give rise to retroactive interaction cycles, i.e., feedback loops. From a mathematical standpoint, systems of differential equations are the proper tool for modeling such systems. In the general case, however, a solution cannot always be found analytically, so numerical techniques are often used to predict the behavior of the system. A system dynamics model consists of a set of variables linked together, classified as stock and flow variables; flow variables represent the rate at which the level of cumulation in stock variables changes. By running simulations on such a model, the user can understand how the system will evolve over time as a consequence of a hypothetical action she takes. She can also observe, at each time step, the values assumed by the model variables and (possibly) modify them. Thus, it appears that system dynamics can effectively support what-if applications in which the current state of any part of the system could influence its own future state through a closed chain of dependency links.
W
3527
Designing a what-if application requires a methodological framework; the one presented in [6] relies on seven phases: 1. Goal analysis, aimed at determining which business phenomena are to be simulated, and how they will be characterized. The goals are expressed by (i) identifying the set of business variables the user wants to monitor and their granularity; and (ii) defining the relevant scenarios in terms of source variables the user wants to control. 2. Business modeling, which builds a simplified model of the application domain in order to help the designer to understand the business phenomenon as well as give her some preliminary indications about which aspects can be either neglected or simplified for simulation. 3. Data source analysis, aimed at understanding what information is available to drive the simulation and how it is structured. 4. Multidimensional modeling, which defines the multidimensional schema describing the prediction by taking into account the static part of the business model produced at phase 2 and respecting the requirements expressed at phase 1. 5. Simulation modeling, whose aim is to define, based on the business model, the simulation model allowing the prediction to be constructed, for each given scenario, from the source data available. 6. Data design and implementation, during which the multidimensional schema of the prediction and the simulation model are implemented on the chosen platform, to create a prototype for testing. 7. Validation, aimed at evaluating, together with the users, how faithful the simulation model is to the real business model and how reliable the prediction is. If the approximation introduced by the simulation model is considered to be unacceptable, phases 4–7 should be iterated to produce a new prototype. The three modeling phases require a supporting formalism. Standard UML can be used for phase 2 (e.g., a use case diagram and a class diagram coupled with activity diagrams) and any formalism for conceptual modeling of multidimensional databases can be effectively adopted for phase 4. Finding a suitable formalism to give broad conceptual support to phase 5 is much harder, though some examples based on the use of colored Petri nets, event graphs and flow charts can be found in the simulation literature [10].
W
3528
W
What-If Analysis
Key Applications Among the killer applications for what-if analysis, it is worth mentioning profitability analysis in commerce, hazard analysis in finance, promotion analysis in marketing, and effectiveness analysis in production planning. Less traditional, yet interesting applications described in the literature are urban and regional planning supported by spatial databases, index selection in relational databases, and ETL maintenance in data warehousing systems. Either spreadsheets or OLAP tools are often used to support what-if analysis. Spreadsheets offer an interactive and flexible environment for specifying scenarios, but lack seamless integration with the bulk of historical data. Conversely, OLAP tools lack the analytical capabilities of spreadsheets and are not optimized for scenario evaluation [2]. Recently, what-if analysis has been gaining wide attention from vendors of business intelligence tools. For instance, both SAP SEM (Strategic Enterprise Management) and SAS Forecast Server already enable users to make assumptions on the enterprise state or future behavior, as well as to analyze the effects of such assumptions by relying on a wide set of forecasting models. Also Microsoft Analysis Services provides some limited support for what-if analysis. This is now encouraging companies to integrate and finalize their business intelligence platforms by developing what-if applications for building reliable business predictions.
Future Directions Surprisingly, though a few commercial tools are already capable of performing forecasting and what-if analysis, and some papers describe relevant applications in different fields, very few attempts have been made so far to address methodological and modeling issues in this field (e.g., [6]). On the other hand, facing a what-if project without the support of a design methodology is very time-consuming, and does not adequately protect the designer and his customers against the risk of failure. The main problem related to the design of what-if applications is to find an adequate formalism to conceptually express the simulation model, so that it can be discussed and agreed upon with the users. Unfortunately, no suggestion to this end is given in the literature, and commercial tools do not offer any general modeling support. Another relevant problem is to establish a general framework for estimating the loss of precision that is introduced when modeling low-level phenomena with
higher-level dependencies. This could allow designers to assess the reliability of the prediction as a function of the quality of the historical data sources and of the precision of the simulation model. Decision makers are used to navigating multidimensional data within OLAP sessions, that consist in the sequential application of simple and intuitive OLAP operators, each transforming a cube into another one. Consequently, it is natural for them to ask for extending this paradigm for accessing information also to what-if analysis. This would allow users to mix together navigation of historical data and simulation of future data into a single session of analysis. In the same direction, an approach has recently been proposed for integrating OLAP with data mining [3]. This raises an interesting research issue. In fact, OLAP should be extended with a set of new, well-formed operators specifically devised for what-if analysis. An example of such operator could be apportion, which disaggregates a quantitative information down a hierarchy according to some given criterion (driver). For instance, a transportation cost forecasted by branch and month could be apportioned by product type proportionally to the quantity shipped for each product type. In addition, efficient techniques for supporting the execution of such operators should be investigated.
Cross-references
▶ Business Intelligence ▶ Cube ▶ Data Warehousing Systems: Foundations and Architectures ▶ On-Line Analytical Processing
Recommended Reading 1. Armstrong S. and Brodie R. Forecasting for marketing. In Quantitative methods in marketing. G. Hooley and M. Hussey (eds.). Int. Thompson Business Press, London, 1999, pp. 92–119. 2. Balmin A., Papadimitriou T., and Papakonstantinou Y. Hypothetical Queries in an OLAP Environment. In Proc. 26th Int. Conf. on Very Large Data Bases, 2000, pp. 220–231. 3. Chen B., Chen L., Lin Y., and Ramakrishnan, R. Prediction cubes. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 982–993. 4. Coyle R.G. System Dynamics Modeling: A Practical Approach. Chapman and Hall, London, 1996. 5. Fossett C., Harrison D., and Weintrob H. An assessment procedure for simulation models: a case study. Oper. Res., 39(5): 710–723, 1991.
WIMP Interfaces 6. Golfarelli M., Rizzi S., and Proli A. Designing what-if analysis: towards a methodology. In Proc. ACM 9th Int. Workshop on Data Warehousing and OLAP, 2006, pp. 51–58. 7. Koutsoukis N.S., Mitra G., and Lucas C. Adapting on-line analytical processing for decision modeling: the interaction of information and decision technologies. Decis. Support Syst., 26(1):1–30, 1999. 8. Kreutzer W. System Simulation – Programming Styles and Languages. Addison Wesley, Reading, MA, 1986. 9. Law A.M. and Kelton W.D. Simulation Modeling and Analysis. McGraw-Hill Higher Education, Boston, MA, 1999. 10. Lee C., Huang H.C., Liu B., and Xu Z. Development of timed colour petri net simulation models for air cargo terminal operations. Comput. Ind. Eng., 51(1):102–110, 2006. 11. Roberts E.B. Managerial applications of system dynamics. Pegasus Communications, 1999.
While Loop ▶ Loop
Wide-Area Data Replication ▶ WAN Data Replication
W
3529
Microsoft Windows for PCs, MacOs for Apple Macintosh, various X Windows-based systems for UNIX, etc.
Historical Background WIMP interfaces were invented at the SRI laboratory in California. The development of WIMP interfaces continued at Xerox PARC. The 1981 Xerox Star workstation is considered to be the first production computer to use the desktop metaphor, productivity applications and a three-button mouse. WIMP was popularized by the Apple Macintosh in the early 1980s. The interaction style/paradigm has now been copied by the Microsoft Windows operating system, Motif, the X Window System, etc. The rapid rise of Microsoft Windows has made WIMP interfaces become the dominant interface paradigm. WIMP interfaces have been improved with a set of new user interface widgets during the years. However, the basic structure of a WIMP interface usually does not change. WIMP interfaces typically present the work space using a desktop metaphor [5]. Everything is presented in a two dimensional space which has windows. The functionality of the application is made available through interface widgets such as: menus, dialog boxes, toolbars, palettes, buttons, etc.
Foundations
Wide-Area Storage Systems
In this section is a description of the elements of the WIMP interface.
▶ Peer-to-Peer Storage Windows
WIMP Interfaces S TEPHEN K IMANI CSIRO Tasmanian ICT Centre, Hobart, TAS, Australia
Definition There exist many types of interaction styles. They include but are not limited to: command line interface, natural language, question/answer and query dialog, form-fills and spreadsheets, WIMP, and three-dimensional interfaces. The most common of the foregoing interaction styles is the WIMP. WIMP is an acronym for Windows, Icons, Menus and Pointers. Alternatively, it is an acronym for Windows, Icons, Mice and Pull-down menus. Examples of user interfaces that are based on the WIMP interaction style include:
Windows are areas of the display that behave as if they were independent terminals. They are typically rectangular areas of the display that can be manipulated independently on the display screen. In most cases, the view of the contents of a window can be changed by scrolling or editing. Windows can contain text and/or graphics. They can be moved, resized, closed, maximized, or minimized (reduced to an icon). Many windows can be displayed on the screen simultaneously thereby allowing multiple separated tasks to be visible at the same time. The user can switch from one from one task to another by moving from one window to another. The components of windows usually include: Scrollbars: They enable users to move the contents of the window up and down (vertical scrollbar), or from side to side (horizontal scrollbar).
W
3530
W
WIMP Interfaces
Title bars: They describe the name of the window. Status bar: It displays the status of the task/process in the window. Boxes/widgets for resizing and closing the window. Oftentimes, all windows in use will not fit on the screen space at once. Several strategies exist for managing multiple windows: 1. Iconification/minimizing: This strategy allows screen space to be saved by reducing windows to window icons. The user can re-expand the icons at will. A window icon therefore serves as a visual reminder of the window. 2. Tiling: In this case, the system uses all the available screen space to display the windows. The windows do not overlap. Tiling can take many forms. For instance: some systems use a fixed number of tiles while others allow variable numbers of tiles. 3. Overlapping: This strategy is probably now the most popular. In the strategy, windows are allowed to partially obscure each other like overlapping papers arranged on a desk. Cascading is a form of overlapping where the windows are automatically arranged fanned out, usually in a diagonal line so that the title and one other border of each window can be seen. With cascading, many windows can therefore be displayed in a limited screen space at the same time (Figs. 1 and 2).
When multiple windows are displayed on the screen at the same time, the active window is usually distinguished by a shaded/bold title bar. The active window is said to have focus. Windows have three size states: maximized, minimized/iconified and normal/restored. Some systems support windows within windows e.g., in Microsoft PowerPoint (MS Office 2000). Icons
In the context of WIMP, an icon is a small picture or image, used to represent some aspect of the system (such as a printer icon to represent the printing action) or to represent some entity/object (such as a window) (Fig. 3). As it was indicated earlier, a window may among other things be closed completely. Alternatively it may be reduced to this small representation (iconified/ minimized). An icon therefore can save screen space and therefore many windows can be available on the screen simultaneously. Moreover, it can serve as a reminder to the user that s/he can resume dialog (or interaction with the represented application) by simply opening up the window. Pointers
A pointer is a cursor on the display screen. It is worth noting that interaction in WIMP interfaces relies a lot on pointing and selecting interaction objects such as
WIMP Interfaces. Figure 1. Common components of a window.
WIMP Interfaces
W
3531
WIMP Interfaces. Figure 2. Tiling versus cascading.
WIMP Interfaces. Figure 3. Examples of icons.
icons and menu items. Pointers are therefore important in WIMP interfaces. There are at least two types of cursors: mouse cursor and text cursor. A mouse cursor shows the user where the current position of the mouse is considered to be with respect to the windows on screen. Usually, the shape and behavior of the mouse cursor can be changed. A text cursor shows where input will be directed from the keyboard. The mouse is the most common device for pointing and clicking. However, other input devices such as joystick, trackball, cursor keys or keyboard shortcuts too can be used for such pointing and selecting tasks. In a particular system, different cursors are used to represent different modes/states e.g., normal cursor as an arrow, double-arrowed cursor for resizing windows, hour-glass when system is busy, etc. A hot-spot is the location to which the cursor points. Selection occurs at the coordinate of the hot-spot. The hot-spot of the cursor should be obvious to the user (Fig. 4). Menus
A menu is an interaction feature that presents a set of options displayed on the screen where the selection and execution of one (or more) of the options results in a change in the state of the interface [8]. The human beings’ ability to recognize information on being given
a visual cue is superior to their ability to recall the information. Menus therefore can serve as cues for the operations or services that the system can perform. Therefore, the labels/names used for the commands in menus should be informative and meaningful. A potential candidate for selection can be indicated by using a pointing device (e.g., mouse, arrow keys, etc) to move the pointer accordingly. As the pointer moves, visual feedback is typically given by some kind of visual emphasis such as highlighting the menu item. Selection of a menu item can be realized by some additional user action (such as pressing a button on the pointing devices, pressing some key on the keyboard, etc). Keyboard accelerators, which are key combinations that have the same effect as selecting the menu item, are sometimes offered. When there are too many items, the menu items are often grouped and layered. The main menu, usually represented as a menu bar, should be conspicuous or readily available/accessible. Menu bars are often placed at one of the sides of the screen or window. For instance: at the top of the screen (e.g., MacOS), at the top of each window (e.g., Microsoft Windows). Most systems show the currently selected state of any group of menu items by displaying them in bold or by a tick. Entries that are disallowed in the current context are often shown in a dimmed font. Types of menus: Pull-down menus: They are attached to a main menu under the title bar or to buttons. Pull-down menus are dragged from the main menu by
W
3532
W
WIMP Interfaces
WIMP Interfaces. Figure 4. Examples of pointers.
moving the pointer into the menu bar and pressing the button. Fall-down menus: They automatically appear from the main menu when the pointer enters the menu bar, without having to press the button. Pop-up menus: They appear when a particular region of the screen or window, maybe designated by an icon, is selected, but they remain in position until the user instructs it to disappear again e.g., by clicking on a ‘‘close box’’ in the border of the menu’s window, by releasing the mouse button, etc. Pin-up menus: They can be ‘‘pinned’’ or attached to the screen, staying in place until explicitly asked to go away. Pie menus: The menu options in a pie menu have a circular arrangement with the pointer at the center.
There are two main challenges with menus: deciding which items to include and how to group those items. Here are some guidelines on menu design: A menu label in pull-down menus should reflect the functions of the underlying menu items. The items in pull-down menus should be grouped by function. Menu groupings in pull-down menus should be consistent (to facilitate the transfer of learning and confidence to the user). Menu items should be ordered in the menu according to importance and frequency of use. Opposite functionalities (e.g., ‘‘save’’ and ‘‘delete’’) should be kept apart to prevent accidental selection of the wrong function. Additional Interaction Elements
Many other interaction elements may be present in a WIMP interface. For instance: buttons, sliders, toolbars, palettes, dialog boxes, etc. Buttons are individual and isolated regions within a display that can be selected by the user to invoke a specific operation or state. Sliders are used to set quantities that vary continuously within given limits or even to enable the user
to choose between large numbers of options [9]. A toolbar is a collection of small buttons, each with icons, that provides convenient access to commonly used functions. It should be pointed out that although the function of a toolbar is similar to that of a menu bar, the icons in a toolbar are smaller than the equivalent text and therefore more functions can be simultaneously displayed. A palette is a mechanism for making the set of possible modes and the active/current mode visible to the user. A dialog box is an information window used by the system to provide contextual information.
Key Applications WIMP interfaces have made computer usage more accessible to users who previously could not use the earlier types of user interfaces. Such users include: young children (who cannot yet read or write), managers, and non-professional home users [11]. WIMP interfaces can be credited for the increased emphasis on the incorporation of user interface design and usability evaluation in the software development process in the late 1980s and early 1990s. On the whole, WIMP interfaces can be said to be instrumental in the realization of applications that are characterized by relative ease of learning, ease of use, and ease of transfer of knowledge due to the consistency in WIMPbased designs [11].
Cross-references
▶ Human-Computer Interaction ▶ Icon ▶ Visual Interaction ▶ Visual Interfaces ▶ Visual Metaphor
Recommended Reading 1. Alistair E. The design of auditory interfaces for visually disabled users. In Proc. SIGCHI Conf. on Human Factors in Computing Systems, 1988, pp. 83–88. 2. Balakrishnan R. and Kurtenbach G. Exploring bimanual camera control and object manipulation in 3D graphics interfaces.
Window-based Query Processing
3.
4.
5. 6. 7.
8.
9.
10.
11.
In Proc. SIGCHI Conf. on Human Factors in Computing Systems, 1999, pp. 56–62. Beaudouin-Lafon M. Designing interaction, not interfaces. In Proc. Working Conf. on Advanced Visual Interfaces, 2004, pp. 15–22. Beaudouin-Lafon M. and Lassen H.M. The architecture and implementation of CPN2000, a post-WIMP graphical application. In Proc. 13th Annual ACM Symp. on User Interface Software and Technology, 2000, pp. 181–190. Cesar P. Tools for adaptive and post-WIMP user interfaces. New Directions on Human Computer Interaction, 2005. Dix A., Finlay J., Abowd G., and Beale R. Human-Computer Interaction. Prentice Hall, Englewood Cliffs, NJ, 2003. Green M. and Jacob R. SIGGRAPH: ’90 Workshop report: software architectures and metaphors for non-WIMP user interfaces. Comput. Graph., 25(3): 229–235, 1991. Paap K.R. and Roske-Hofstrand R.J. Design of menus. In Handbook of Human-Computer Interaction, M. Helander (ed.). Amsterdam, North-Holland, 1998. Preece J., Rogers Y., Sharp H., Benyon D., Holland S., and Carey T. Human-Computer Interaction. Addison-Wesley, Reading, MA, 1994. Odell D.L., Davis R.C., Smith A., and Wright P.K. Toolglasses, marking menus, and hotkeys: a comparison of one and twohanded command selection techniques. In Proc. 2004 Conf. on Graphics Interface, 2004, pp. 17–24. van Dam A. Post-WIMP user interfaces. Commun. ACM, 40(2):63–67, 1997.
Window-based Query Processing WALID G. A REF Purdue University, West Lafayette, IN, USA
Synonyms Stream query processing
Definition Data Streams are infinite in nature. As a result, a query that executes over data streams specifies a ‘‘window’’ of focus or the part of the data stream that is of interest to the query. When new data items arrive into the data stream, the window may either expand or slide to allow the query to process these new data items. Hence, queries over data streams are continuous in nature, i.e., the query is continuously re-evaluated each time the query window slides. Window-based query processing on data streams refers to the various ways and techniques for processing and evaluating continuous queries over windows of data stream items.
W
3533
Historical Background Windows over relational tables have already been introduced into Standard SQL (SQL:1999) in order to support data analysis, decision support, and more generally, OLAP-type operations. However, the motivation for having windows in data stream management systems is quite different. Since data streams are infinite, it is vital to limit and focus the scope of a query to a manageable and finite portion of the data stream. Earlier works on window query processing have focused on processing tuplecount and time-sliding windows. Ever since, window query processing techniques have been developed to deal with out-of-order tuple arrivals, e.g., [14], revision tuple processing, e.g., [1,13], incremental evaluation techniques, e.g., [6], stream punctuation techniques, e.g., [17], multi-query optimization techniques, e.g., shared execution of window queries [4,7,9,10], and adaptive stream query processing techniques, e.g., [12].
Foundations Queries over data streams are continuous in nature. A continuous query progressively produces results as new data items arrive into the data stream. Since data streams are infinite, queries that execute over a data stream need to define a region of interest (termed a window). There are several ways by which a query can define its window(s) of interest. These include streambased versus operation-based windows, tuple-count versus time-sliding windows, and sliding versus predicate windows. These ways for specifying windows are orthogonal and can hence be combined. For example, a window can be time-sliding and at the same time, operation-based. Similarly, a stream-based window can also be predicate-based, etc. Query Processing Techniques: Incremental Evaluation Versus Reevaluation
Whenever the data items within a window change, e.g., when the window slides with the arrival of new tuples or with the expiration of some old tuples from the window, a continuous query’s answer needs to be reproduced. There are two mechanisms for reproducing the answer to a continuous query, namely (i) query reevaluation and (ii) incremental evaluation, which are explained in the following sections.
W
3534
W
Window-based Query Processing
Query Reevaluation
In query reevaluation, a window over a data stream is viewed as an instantiation of a table that contains the data tuples in the window. When the window slides, a new table is formed (or opened). So, from the point of view of a continuous query, a data stream is a sequence of tables. With the arrival of a new table (window of data stream items), the query is reevaluated to produce a new output result. As a consequence, an important feature of query reevaluation is that the semantics of the traditional query processing operators, e.g., selects and joins, do not change. When the slide of a window is less than its range, multiple overlapping windows will be open concurrently over the same stream. Hence, it is possible that a newly arriving tuple contribute to multiple windows. Query Processing using Revision Tuples
At times, data stream items may be noisy or erroneous due to the nature of the data streaming applications. As a result, stream data sources may need to ‘‘revise’’ previously issued tuples in the form of ‘‘revision tuples’’ [1,13]. So, in a sense, processing a revised tuple may involve reprocessing of the input tuples that were in the same window or computation as the tuple being revised to produce revised output results. Consequently, some query operators will need to preserve ‘‘state’’ in order to reprocess the revised input tuples. For example, aggregates may need to store the individual values (or some summary of these values) that were used to produce the aggregate result so that the aggregate operator may be recomputed given the revised input tuple. This approach is referred to as ‘‘upstream processing.’’ It is possible to go downstream, i.e., from the output tuples backwards. The idea is to correct the previously generated output tuples using the new and original values of the revised tuple as well as some state information depending on the nature of the participating query operators. More detail about processing revision tuples can be found in [1,13].
Query Processing using Punctuations Some stream query operators are stateful while others are blocking. Since data streams are infinite, the state of these query operators may be unbounded in size. Similarly, blocking operators cannot wait indefinitely as the data streams never end. Knowing more a priori knowledge about the stream semantics, it is possible to embed, within the data stream, special annotations, termed
‘‘Punctuations’’ that break the infinite data stream into finite sub-streams. Then, stateful and blocking query operators can be applied successfully to each of the substreams. Stream punctuations can take multiple forms. For example, one stream can send a ‘‘last-reading-withinthis-hour’’ punctuation to reflect that no more readings for this hour are expected from that source. In this case, a blocking operator can produce results related to this source once the operator receives this punctuation. Similarly, in a bidding application, sending a ‘‘nomore-bids-for-item:%itemid’’ punctuation would allow the bid to be finalized for that item. In general, a punctuation can be viewed as a predicate that evaluates to ‘‘false’’ for every tuple in the stream that follows the punctuation. This helps query operators generate output tuples following a punctuation as well as help reduce the size of the state kept per operator. Detailed discussion about query processing using punctuations can be found in [17]. Query Processing Using Heartbeats Stream data sources
often assign a timestamp to each data element they produce. Due to the distributed nature of the stream data sources, data elements may arrive to the data stream management system out of order. System buffers need to store the out-of-order tuples to present them later to the query processor in increasing timestamp order. Before processing a tuple t with timestamp ts, it is important to guarantee that no more tuples with timestamp less than ts will arrive to the system. Stream heartbeats are a mechanism that provide such assurance. Each data source is responsible for generating its own heartbeats periodically and embeds them within its own data stream. In cases when a data source does not have this capability, the data stream management system should be able to synthesize and generate heartbeats for each data stream based on parameterized knowledge of the environment, e.g., the maximum network delay. When processing a query that accesses multiple streams, say s1, s2,...,sn, the query processor computes the minimum timestamp, say tsmin, of the heartbeats of all the n streams. All the tuples in the system buffer with timestamps less than tsmin are forwarded to the query for processing. Such query-level heartbeats are simple and easy to implement. However, they can block the processing of a query unnecessarily, e.g., when one of the streams is temporarily blocked while the others are available. Alternatively, in contrast to a
Window-based Query Processing
query-level heartbeat, in an operator-level heartbeat, the query processor computes the minimum timestamps of heartbeats for streams input to each query operator in the plan. In this case, each operator will have its own heartbeat. Input tuples with timestamps less than an operator’s heartbeat are forwarded to that operator. Latency and memory requirements of both query-level and operator-level heartbeats are further detailed in [14]. Incremental Query Evaluation
In incremental query evaluation, whenever the window contents change, the query does not reprocess all the tuples inside the window. Instead, the query only processes the changed tuples. Moreover, only the changes to the query answer are reported. The answer to the query is considered more as a materialized view in the sense that only the changes in the base tables (the deltas) need to be processed by the view expression. As a result, some new tuples get inserted into the view while others get deleted from it. Two types of events need to be handled in the case of incremental query evaluation. The first event type is when a new tuple arrives into the data stream and hence becomes inside of the stream’s window. The second event type is when a tuple exits the window. For example, when a time-sliding window slides, a tuple’s timestamp may become outside of the time interval covered by the window, and hence the tuple expires and exits the window. Systems vary on how they handle both types of events. STREAM [16] generates two types of streams: an Insert (I) stream that contains the newly arriving tuples, and a Delete (D) stream that contains the expiring tuples. In contrast to STREAM that has two types of streams, Nile [8] maintains only one stream but with two types of tuples: Positive and Negative Tuples. Positive tuples correspond to the new tuples that arrive into the window while negative tuples correspond to the expiring tuples. Positive tuples are the regular tuples that a traditional query processor handles. In contrast, negative tuples are processed differently by each query operator. Incremental Query Processing Using Negative Tuples
In a traditional relational query evaluation plan (QEP), a table-scan operator is usually at the bottom (leaf) level of the QEP and serves as the interface operator that pipelines the tuples of a relation to the rest of the
W
3535
QEP. In incremental query evaluation, the equivalent to a table-scan operator is the window-expire operator (W-expire, for short). A W-expire operator is assigned to each stream in the query and is added at the bottom (leaf) level of a QEP. The inputs to the W-expire operator are the raw stream tuples as well as the definition of the window (e.g., say a window of range 10 min and slide 2 min). The function of the W-expire operator is two-fold: (i) introduce the new stream tuples to the QEP as they arrive into the stream, and (ii) generate (or synthesize) negative tuples that correspond to stream tuples that expire from the window and introduce them to the QEP. Conceptually, a negative tuple, say t-, needs to trace the same path in the QEP that its corresponding positive tuple, say t+, took to produce the same results that t+ produced (but with negative sign). Once it receives a negative tuple, each query operator, in turn, processes the negative tuple and produces 0 or more negative tuples as output that get processed by the operators next in the pipeline. The semantics of all query operators are extended to handle positive as well as negative tuples. Additionally, each query operator needs to store additional state information to be able to process negative tuples successfully. This is similar to the case of processing revision tuples in Borealis. Details about the extended semantics and state information of query operators for incremental query processing as well as optimizations to reduce the cost of handling negative tuples can be found in [6]. Query Processing for Predicate Windows
In contrast to a sliding window, in a predicate window, a window is specified by a predicate that defines the stream data tuples of interest to a query. For example, a query may define its window of interest to be the room identifiers of the rooms with temperatures greater than 110 F. When a room’s temperature is below 110 F, that room identifier exits the window, otherwise it remains inside the window of interest. Predicates that define a window may greatly vary in complexity. In general, a predicate window can be expressed using a regular select-from-where query. Consider the predicate window that is defined by following Select query over Stream roomTemperatures: Select room-id, temperature from roomTemperatures where temperature >120. The room-ids and temperature values of the high-temperature rooms are maintained inside this predicate window.
W
3536
W
Window-based Query Processing
Three types of tuples need to be maintained in this scenario (insert, delete, and update tuples). Insert and delete tuples are the same as the positive and negative tuples used for incremental evaluation, while update tuples are similar to revision tuples. When the temperature of Room 5 increases from 110 to 130 F (i.e., the room did not have a high temperature before), the corresponding tuple (room-id, temperature): (5, 130) enters into the predicate window via an insert tuple. Similarly, when the temperature of the same room changes from 130 to 140 F, the corresponding tuple (5, 140) replaces the old tuple (5, 130) via an update tuple. Finally, when the temperature of Room 5 goes below 120 F, e.g., 70 F, the corresponding tuple (5, 140) is eliminated from the predicate window via a delete tuple. Because of the generality of predicate windows and the use of insert, delete, and update tuples, predicate windows can be used to support views over data streams [5]. Shared Execution of Multiple Window-based Queries
In continuous stream query processing, it is imperative to exploit resources and commodity among query expressions to achieve improved efficiency. Shared execution is one important means of achieving this goal. The role of joins in particular is further enhanced in stream query processing due to the use of ‘‘selection pull up.’’ In the NiagaraCQ system [9,10], it was shown that the traditional heuristic of pushing selection predicates below joins is often inappropriate for continuous query systems because early selection destroys the ability to share subsequent join processing. Since join processing is expensive relative to selections, it is almost always beneficial to process the join once in shared mode and then subject the produced join output tuples to the appropriate selection operations. A similar argument holds for aggregation operations and Group-by. Multiple windowed query operators, e.g., window joins, can share their execution as the windows of interest over one data stream overlap. For example, consider the two streams A and B and the two join operators J and K that use the same join predicate, e.g., A.a = B.b. Assume an operation-based time-sliding windows of sizes wJ for J and wK for K, e.g., wJ = 1 h or wK = 1 day. Since J and K have the same join predicate and wJ < wK, the join output tuples of J are a subset of the join output tuples. Notice that it is possible that two tuples, say A1 and B1, join and their
timestamps are more than 1 h apart but less than 1 day. Hence, the join output pair is part of K’s join output tuples but is not part of J’s. Executing both operations separately wastes system resources. Sharing the execution of the join operator would save on CPU and memory resources. One feasible policy is to perform only the join with the largest window (K in the example, since wK is the larger of the two windows) [4,12]. The output of this join is then routed to multiple output streams (two streams in the example), where the output tuples for the joins with smaller windows get filtered out based on their timestamps. One advantage of this policy is its simplicity since arriving tuples are completely processed before considering the next incoming tuple. However, this policy has a disadvantage in that it delays the processing of smallwindow joins until the largest-window join is completely processed, hence negatively affecting the response time of small-window joins. An alternative policy for shared execution of window joins is to perform the join with the smallest window first by all new tuples, then the next larger window joins, and so on until the largest window join is processed [7]. As long as there are tuples to be processed by the smallest window, they are processed first before any tuples are processed by the larger windows. Notice that the join output pairs produced for the smallest windows are also directly output tuples for all the larger windows, but the converse is not true, as explained above. The response time for the small-window joins is significantly enhanced. However, there is significant state information that needs to be maintained for each tuple to know at what point inside the windows from which that tuple resumes execution with one of the window joins. More detail and tradeoffs of each of the policies as well as better optimized policies for shared execution of joins can be found in [7]. Out-of-order Tuple Processing
In processing queries over data streams, it is often the case that the window operators are order sensitive. For example, when continuously computing the running average over the most-recent ten tuples in a stream, the output result depends on the order of the tuples in the stream. With the ordered arrival of a new tuple, the 11th most-recent tuple gets dropped and the new average is computed. The same is true for time-sliding windows. When tuples arrive in order, the timestamp of the new tuple is used to slide the window. This is not
Window-based Query Processing
the case when tuples arrive out of order. In this case, none of the windows can get closed and any produced answer pertaining to a given window cannot be considered final since at any future point in time, a tuple can arrive into that window and hence the output needs to be recalculated. There are several ways to deal with out-of-order tuples. As explained in Section ‘‘Query Processing using Revision Tuples,’’ revision tuples are one way to deal with out-of-order tuples. The timestamp, say TS, of the most-recent tuple of a stream is maintained. Whenever an out-of-order tuple arrives (this tuple’s timestamp is less than TS), a revision tuple is issued to revise the output of the affected window(s). When the maximum possible delay of a tuple is known in advance, the stream query processor can leave open all the potential windows that are within this delay window. Once the maximum delay is reached, the corresponding windows get closed and the tuples inside the window are processed to produce the final answers for that window. The input stream manager can synthesize and issue a punctuation or a heartbeat to that effect so that the window results can be computed and finalized.
Key Applications Key applications for window-based stream query processing include computer network traffic analysis, network performance monitoring, intrusion detection. stock market data analytics, web browsing clickstream analytics, algorithmic stock trading, sensor network data summarization and processing, moving object tracking, traffic monitoring, and surveillance applications.
Cross-references
▶ Adaptive Stream Processing ▶ Continuous Queries in Sensor Networks ▶ Continuous Query ▶ Data Stream ▶ Distributed Streams ▶ Event Stream ▶ Fault-Tolerance and High Availability in a Data Stream Management Systems ▶ Histograms on Streams ▶ Publish/Subscribe over Streams ▶ Stream Models ▶ Stream Processing ▶ Stream-Oriented Query Languages and Operators
W
3537
▶ Windows ▶ XML-Stream Processing
Recommended Reading 1. Abadi D., Ahmad Y., Balazinska M., Cetintemel U., Cherniack M., Hwang J-H., Lindner W., Maskey A.S., Rasin A., Ryvkina E., Tatbul N., Xing Y., and Zdonik S. The design of the Borealis stream processing engine. In Proc. 2nd Biennial Conf. on Innovative Data Systems Research, 2005, pp. 277–289. 2. Abadi D., Carney D., Cetintemel U., Cherniack M., Convey C., Lee S., Stonebraker M., Tatbul N., and Zdonik S. Aurora: a new model and architecture for data stream management. VLDB J., 12(2):120–139, 2003. 3. Bai Y., Thakkar H., Luo C., Wang H., and Zaniolo C. A data stream language and system designed for power and extensibility. In Proc. Int. Conf. on Information and Knowledge Management, 2006, pp. 337–346. 4. Chandrasekaran S. and Franklin M.J. Streaming queries over streaming data. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 203–214. 5. Ghanem T.M. Supporting Views in Data Stream Management Systems. Ph.D. Dissertation. Department of Computer Science, Purdue University, 2007. 6. Ghanem T.M., Hammad M.A., Mokbel M.F., Aref W.G., and Elmagarmid A.K. Incremental evaluation of sliding-window queries over data streams. IEEE Trans. Knowl. Data Eng., 19(1): 57–72, 2007. 7. Hammad M.A., Franklin M.J., Aref W.G., and Elmagarmid A.K. Scheduling for shared window joins over data streams. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 297–308. 8. Hammad M.A., Mokbel M.F., Ali M.H., Aref W.G., Catlin A.C., Elmagarmid A.K., Eltabakh M., Elfeky M.G., Ghanem T., Gwadera R., Ilyas I.F., Marzouk M., and Xiong X. Nile: a query processing engine for data streams. In Proc. 20th Int. Conf. on Data Engineering, 2004, p. 851. 9. Jianjun C., DeWitt D.J., Feng T., and Yuan W. NiagaraCQ: a scalable continuous query system for internet databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 379–390. 10. Jianjun C., DeWitt D.J., and Naughton J.F. Design and evaluation of alternative selection placement strategies in optimizing continuous queries. In Proc. 18th Int. Conf. on Data Engineering, 2002, pp. 345–356. 11. Johnson T., Muthukrishnan S., Shkapenyuk V., and Spatscheck O. A heartbeat mechanism and its application in gigascope. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 1079–1088. 12. Madden S., Shah M.A., Hellerstein J.M., and Raman V. Continuously adaptive continuous queries over streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 49–60. 13. Ryvkina E., Maskey A.S., Cherniack M., and Zdonik S. Revision processing in a stream processing engine: a high-level design. In Proc. 22nd Int. Conf. on Data Engineering, 2006. 14. Srivastava U. and Widom J. Flexible time management in data stream systems. In Proc. 23rd ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, 2004, pp. 263–274.
W
3538
W
Windows
15. Stonebraker M., Cetintemel U., and Zdonik S. The 8 requirements of real-time stream processing. ACM SIGMOD Rec., 34 (4):42–47, 2005. 16. The STREAM Group. STREAM: the Stanford stream data manager. IEEE Data Eng. Bull., 26(1):19–26, 2003. 17. Tucker P.A., Maier D., Sheard T., and Fegaras L. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng., 15(3):555–568, 2003.
Windows C ARLO Z ANIOLO University of California-Los Angeles, Los Angeles, CA, USA
Synonyms Logical window ¼ Time-based window; Physical window ¼ Tuple-based windows
Definition Windows were introduced as part of SQL:1999 OLAP Functions. For instance, given a sequence of bids it is possible to use the following SQL:2003 statement to find the last 40 offers (the current offer and the previous 39) for item 0021: SELECT itemID, avg(Offer) OVER(ROWS 39 PRECEDING ORDER BY TIME) FROM BIDS WHERE ItemID=0021
When BIDS is instead a data stream, the ‘‘ORDER BY TIME’’ clause becomes redundant, and clauses such as ‘‘FOLLOWING’’ are often not supported in continuous query languages. However, these languages still provide a ‘‘PARTITION BY’’ clause (or the more traditional ‘‘GROUP BY’’ clause), whereby a user can specify that the average of the last 40 offers must be computed for all items, not just item 0021. In addition to entailing powerful and flexible analytical queries on ordered sequences and time-series, as in databases, windows on data streams play the key role of synopses, and are widely employed in this capacity. In particular, since it is not feasible to memorize unbounded data streams, window joins are used instead. In window joins, the newly arriving tuples in one data stream are joined with the recent tuples in the window of the other data streams (and symmetrically). The design and implementation of window-based extensions for aggregates
and joins operators for data streams has generated significant research work [1,2,3].
Key Points Traditional SQL aggregates are blocking (and thus not suitable for continuous queries on data streams) but their window versions are not. Therefore, the window concept is the cornerstone of many continuous query languages, where its functionality has also been extended with new constructs, such as slides, tumbles, and landmark windows, that are not in SQL:2003. In a sliding window of w tuples (seconds) the aggregate computed over w is returned for each new incoming tuple: when a slide of size s is also specified, then results are only returned every s tuples (seconds). With s = w we have a tumbling window in which results are only returned at the end of each window. When w∕s = k, then the window, and the computation oftheaggregate,ispartitionintok panes [2]. A landmark window is one where an occurrence of some event of semantic significance, e.g., a punctuation mark [3], defines one or both endpoints. Efficient implementation requires delta computations that exploit the algebraic properties of aggregates – e.g., by increasing (decreasing) the current sum with the value from the tuple entering (leaving) the window – and architectures that consolidate the vast assortment of windows and aggregates at hand [1,2]. Windows provide the basic synopsis needed to support joins with limited memory. Special techniques are used to optimize multi-way joins, and response time for all joins. Further optimization issues occur when load-shedding is performed by either (i) using secondary store to manage overflowing window buffers, or (ii) dropping tuples from the windows in such a way that either a max-subset or a random sample of the original window join is produced. Aged-based policies, where older tuples are dropped first, is preferable in certain applications. Sketching techniques are often used to estimate the productivity of tuples, whereby the least productive tuples are dropped first [3]. The same techniques can also be used to estimate duplicates in DISTINCT aggregates; reservoir-sampling inspired techniques have instead been proposed for aggregates where duplicates are not ignored.
Cross-references
▶ Data Sketch/Synopsis ▶ Join
Workflow Enactment Service State Data
▶ One-Pass Query Processing ▶ Punctuations ▶ SQL ▶ Stream Processing
Recommended Reading 1.
2.
3.
Bai Y. et al. A data stream language and system designed for power and extensibility. In Proc. Int. Conf. on Information and Knowledge Management, 2006, pp. 337–346. Li J. et al. Semantics and evaluation techniques for window aggregates in data streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 311–322. Maier D., Tucker P.A., and Garofalakis M. Filtering, punctuation, windows and synopses. In Stream Data Management, Vol. 30, N. Chaudhry, K. Shaw, M. Abdelguerfi (eds.). Kluwer, Dordecht, 2005, pp. 35–56.
W
3539
Workflow ▶ Workflow Management
Workflow Constructs N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
Synonyms Process semantics
Definition The elements of a Workflow Model defining how the process is executed.
Wireless Sensor Networks ▶ Sensor Networks
Within-Element Term Frequency ▶ Term Statistics for Structured Text Retrieval
Word Conflation ▶ Stemming
Word of Mouth ▶ Reputation and Trust ▶ Trust and Reputation in Peer-to-Peer Systems
Work Element ▶ Activity
Work Performer ▶ Actors/Agents/Roles
Key Points Workflow Constructs represent the elements of a Workflow Model, such as Control Data, Splits, Joins, Activities and Actors, which combined define the parameters of how a process instances is executed. The “Semantics” of the process refer to how it is executed in the context of workflow rules, such the steps, sequencing and dependencies; whereas the “Syntax” of the process refers to its definition within the context of a specific language. Workflow Constructs is directly related to semantics, however, does not define specific syntax.
Cross-references
▶ Activity ▶ Actors/Agents/Roles ▶ Control Data ▶ Join ▶ Split ▶ Workflow Model
Workflow Control Data ▶ Control Data
Workflow Enactment Service State Data ▶ Control Data
W
3540
W
Workflow Engine State Data
Workflow Engine State Data ▶ Control Data
Workflow Evolution P ETER DADAM , S TEFANIE R INDERLE University of Ulm, Ulm, Germany
Synonyms Process evolution; Schema evolution in workflow management systems; Schema evolution in process management systems; Adaptive workflow/process management; Workflow/process instance changes
Definition The term evolution has been originally used in biology and means the progressive development of a species over time, i.e., the adaptation to changing environmental requirements. Business processes (which are often called workflows when implemented and thus automated within a workflow management system) also ‘‘live’’ within an environment (e.g., the enterprise or the market). This environment is typically highly dynamic and thus the running workflows have to adapt to these changing requirements – i.e., to evolve – in order to keep up with the ever-changing business environment and provide their users with the competitive edge. Workflow evolution implies two basic challenges: change realization and change discovery. Change realization means that it must be technically possible to adapt workflows. Change discovery refers to the question which workflow modifications are required when the environmental requirements change. So far, research has focused on the first topic whereas the second one is subject to future research. Regarding the technical realization, workflow changes can take place at two levels – at the workflow instance and the workflow type level. Instance-specific changes are often applied in an ad-hoc manner and become necessary in conjunction with real-world exceptions. They usually affect only single workflow instances. As opposed to this, in conjunction with workflow schema changes at the workflow type level, a collection of related instances may have to be adapted, i.e., workflow schema changes are propagated to running workflow instances or running workflow instances are migrated to the changed workflow schema.
Historical Background In research workflow evolution has been mainly addressed from a technical point of view so far. More precisely, different approaches have been developed for the realization of workflow change during the last 10 years [8]. These approaches can be divided into different categories, i.e., pre-planned workflow changes, flexibility by design, and flexible workflow management technology. Pre-planned workflow changes: In this category workflow changes are planned in advance [3,6]. Then, based on different strategies, one or several of these changes are executed during runtime if necessary (e.g., if an exception occurs). Strategies to initiate a workflow change can be rule-based (e.g., using ECA rules), goal-based (e.g., using AI planning techniques), or process-driven (e.g., based on graph grammars). Flexibility by design: Workflow adaptations can be also foreseen in the workflow description [1,12] (e.g., in the workflow graph). The approaches range from simply modeling alternative branches with associated selection codes to approaches which specify workflows only as a rough ‘‘skeleton’’ in advance and leave the detailed specification of the workflow to the user at runtime. Then, for example, the user can select which activities are to be executed in which order. Some approaches also integrate the support of planning methods from Artificial Intelligence for the runtime workflow specification. Flexible workflow management technology: Approaches in this field either deal with the modification of workflow instances [7] or with workflow schema evolution (i.e., changes at workflow type level) [2,11,13,14]. Basic questions, however, are very similar for workflow schema and instance changes, mainly concerning the correctness of the workflow management systems after changes at either level have been carried out. Consequently, different change frameworks on workflows have been defined [15] and several correctness criteria have been developed which partly depend on the used workflow meta model (e.g., Petri Nets) [8]. Whereas the correctness question is more or less sufficient to deal with workflow instance changes, other approaches have discovered that, for workflow schema changes, additional challenges arise. The reason is that if a workflow schema is modified, a possibly large number of workflow instances running according to the workflow schema is affected as well. Thus, approaches arose to deal with
Workflow Evolution
efficiency and usability aspects. Furthermore, some proof-of-concept prototypes were implemented. The last development stage of technical workflow evolution research has been made when considering workflow schema and instance changes in their interplay and not in an isolated manner [9]. Currently, more and more approaches start to integrate semantic knowledge into workflow management systems. This can be used as a basis for developing intelligent adaptive systems which are able to learn from previous changes and therefore automatically adapt their workflows when the environment changes (cf. Future Directions).
Foundations As already mentioned, workflow evolution comprises two basic challenges – change discovery and change realization. Change discovery refers to the detection of necessary business process optimizations which should be brought into the running workflows. For change discovery different approaches have emerged recently which are, for example, based on process mining or case-based reasoning techniques. Since these trends are currently ‘‘in the flow’’ the reader is referred to the ‘‘Future Directions’’ section for more details. Contrary, the topic of workflow change (i.e., how to bring the discovered changes into the running system) has been investigated in great detail. Note that the following discussion of scientific challenges in the context of workflow evolution refers to dynamic workflow changes (i.e., changing a workflow instance or a workflow schema during runtime) and not to pre-planned workflow changes. Fig. 1a shows an example of an order workflow based on which the problem spectrum is motivated in the following. At workflow type level, changes are handled by modifying the affected workflow schema S based on which a collection of workflow instances I1,..., In is running. Intuitively, it is necessary that the modification of a correct workflow schema S again results in a correct workflow schema S’ (a workflow schema is called correct if it satisfies correctness constraints set out by the underlying workflow meta model, i.e., the formalism used to describe the business processes, e.g., Petri Nets). This so called static schema correctness can be preserved, for example, by the applied change operations. After changing workflow schema S it is also necessary to deal with process instances I1,..., In running on S (cf. Fig. 1b). Different strategies for this so-called
W
dynamic process schema evolution have been proposed in literature. One possibility is to cancel all running instances and to restart them according to the new workflow schema. However, this strategy may cause an immense loss of work and would usually not be accepted by users. Therefore, at minimum, it is claimed that workflow instances I1,..., In started according to workflow schema S can be finished on S without being interrupted. This strategy can be implemented by providing adequate versioning concepts and may be sufficient for workflows of short duration. However, doing so raises severe problems in conjunction with long-running workflows as often found in, for example, clinical or engineering environments. Due to the resulting mix of workflow instances running on old and new schema versions a chaos within the production or the offered services may occur. Furthermore, often, it is not acceptable to run instances on the old schema version if laws or business rules (e.g., clinical guidelines) are violated. For these reasons, it is quite important to be able to apply workflow schema changes to running instances as well. This scenario is called the propagation of workflow schema changes to running workflow instances or, in other words, the migration of the (‘‘old’’) running instances to the changed workflow schema. Note that the number of running instances I1,..., In may be very large. In environments like hospitals or telecommunication companies, for example, n > 10,000 may easily hold. To better understand the challenge of propagating a workflow schema change to running workflow instances, one can compare a running workflow instance with an application program. This program is currently executed and consists of several procedures. These steps can be successively executed (sequential execution), can be situated within ifthen-else conditions (alternative branches), or can be even executed in parallel threads. Workflow schema evolution can then be compared to manipulating the running program by inserting one or more procedures, deleting procedures, or changing the order of procedures in the midst of program execution. In particular, the changes have to be carried out by maintaining a correct program execution (e.g., correct parameter provision) for all possible execution alternatives. Consequently, it is a non-trivial but crucial task for workflow management systems to enable workflow
3541
W
3542
W
Workflow Evolution
schema evolution. The following main requirements have been identified: 1. Completeness: Workflow designers must not be restricted, neither by the used workflow meta model nor by the offered change operations. 2. Correctness: The ultimate ambition of any adaptive/ flexible workflow management system must be the correctness of workflow changes; i.e., introducing changes to the runtime system without causing
Workflow Evolution. Figure 1. Example order workflow.
inconsistencies or errors (like deadlocks or improperly invoked activity programs). Therefore, adequate correctness criteria are needed. These criteria must not be too restrictive, i.e., no workflow instance should be needlessly excluded from applying a dynamic change. 3. Efficient Correctness Checks: Assume that an appropriate correctness criterion for workflow changes has been found. Then the challenge is to check this criterion efficiently. This is especially important for
Workflow Evolution
large-scale environments with hundreds (up to thousands) of running workflow instances. 4. Usability: The migration of workflow instances to a changed workflow schema must not require expensive user interactions. First of all, such interactions may lead to delays within the instance executions. Secondly, users (e.g., designers or administrators) must not be burdened with the job to adapt workflow instances to changed workflow schemes. Complex process structures and extensive state adaptations might overstrain them quickly. Therefore, it is crucial to offer methods for the automatic adaptation of workflow instances. However, supporting workflow schema changes is not sufficient to enable workflow evolution. In fact, it must be possible to modify single workflow instances as well. Such workflow instance changes are often carried out in an ad-hoc manner in order to deal with an exceptional situation, e.g., an unforeseen correction of the inventory information in an order workflow as depicted for instance 8 in Fig. 1b. In the literature, workflow schema and instance changes have been an important research topic for several years. So far, there are only a few adaptive systems supporting both kinds of changes in one system [4,14], however, only in a separated manner (i.e., already individually modified workflow instances are excluded from migration to the changed workflow schema). As discussed, this approach is not sufficient in many cases, particularly in connection with longrunning workflows. Assume, for example, a medical treatment workflow which is normally executed in a standardized way for every patient. Assume further that due to an unforeseen reaction, an additional drug is given to a certain patient. However, this deviation from the standard procedure must not imply that the affected workflow instance (and therefore the patient) is excluded from further workflow optimizations. Therefore, it must be possible to propagate workflow schema changes at the type level to individually modified instances as well. This interplay between concurrently applied workflow schema and instance changes raises many challenges, for example, the question to which degree the changes overlap. Overlapping means that workflow schema and instance change (partly) have the same effects on the underlying workflow schema. Such an overlap occurs, for example, if a workflow instance has
W
3543
(partly) anticipated a future workflow schema change in conjunction with a flawed process design. In this situation conflicts may arise between the overlapping workflow schema and instance changes (e.g., if the same activities are deleted). Different approaches have been developed to deal with concurrent workflow schema and instance changes (e.g., based on workflow schema and change comparisons in order to determine overlapping degrees between changes) [9]. Thus, the current state-of-the art comprises sufficient solutions for the technical realization of workflow changes. However, there are advanced challenges regarding the semantic realization of workflow changes and the acquisition of the workflow changes. These challenges are described within the ‘‘Future Directions’’ section.
Key Applications Production workflow (process) management systems Clinical workflows Office automation Development workflows (e.g., in the automotive domain) Enterprise application integration (?)
Future Directions Currently, there are comprehensive solutions available regarding the technical support of workflow changes. However, this is not sufficient for a complete workflow evolution solution. First of all, in addition to the technical requirements as discussed in the ‘‘Foundations’’ section, also semantic aspects must be taken into account. In order to make this technology broadly applicable, it is not sufficient to check only whether the planned workflow changes violate any structural or state-related correctness properties. In addition, it must also be possible to verify if semantic constraints (e.g., business rules, four eyes principle) imposed on the affected workflow are violated [5]. In some cases the necessity to change a workflow schema has reasons like changes in legislation, in global business rules, or in the structural organization of the enterprise etc. In many cases, however, workflow schema changes are motivated by the observation that the existing workflow does not meet some real-world requirements. For example, the order of steps is not optimal, some steps are missing or not necessary, or important cases are not reflected in the workflow
W
3544
W
Workflow Join
schema. It should not be left to users to detect necessary changes. If users find ‘‘work-arounds’’ to locally solve their problem, this information may never reach the person in charge. More promising is to let the system analyze execution logs in order to discover repetitive deviations (e.g., using process mining), to compute an alternative workflow schema covering these cases, and to support the process designer to (semi-)automatically perform the necessary workflow schema evolution [10].
Cross-references
▶ Business Intelligence ▶ Business Process Execution Language ▶ Business Process Management ▶ Business Process Reengineering ▶ Composed Services and WS-BPEL ▶ Petri Nets ▶ Process Mining ▶ Process Optimization ▶ Workflow Constructs ▶ Workflow Management and Workflow Management System ▶ Workflow Management Coalition ▶ Workflow Model ▶ Workflow Patterns ▶ Workflow Schema ▶ Workflow Transactions
Recommended Reading 1. Adams M., ter Hofstede A.H.M., Edmond D., and van der Aalst W.M.P. A service-oriented implementation of dynamic flexibility in workflows. In Proc. Int. Conf. on Cooperative Inf. Syst., 2006. 2. Casati F., Ceri S., Pernici B., and Pozzi G. Workflow evolution. Data Knowl. Eng., 24(3):211–238, 1998. 3. Heimann P., Joeris G., Krapp C., and Westfechtel B. DYNAMITE: dynamic task nets for software process management. In Proc. 18th Int. Conf. on Software Eng., 1996, pp. 331–341. 4. Kochut K., Arnold J., Sheth A., Miller J., Kraemer E., Arpinar B., and Cardoso J. IntelliGEN: a distributed workflow system for discovering protein-protein interactions. Distr. Parallel Databases, 13(1):43–72, 2003. 5. Ly L.T., Rinderle S., and Dadam P. Semantic correctness in adaptive process management systems. In Proc. Int. Conf. Business Process Management, 2006, pp. 193–208. 6. Mu¨ller R., Greiner U., and Rahm E. AgentWork: a workflow system supporting rule-based workflow adaptation. Data Knowl. Eng., 51(2):223–256, 2004. 7. Reichert M. and Dadam P. ADEPTflex – supporting dynamic changes of workflows with-out losing control. J. Intell. Inform. Syst., 10(2):93–129, 1998.
8. Rinderle S., Reichert M., and Dadam P. Correctness criteria for dynamic changes in workflow systems – a survey. Data Knowl. Eng., 50(1):9–34, 2004. 9. Rinderle S., Reichert M., and Dadam P. Disjoint and overlapping process changes: challenges, solutions, applications. In Proc. Int. Conf. on Cooperative Inf. Syst., 2004, pp. 101–120. 10. Rinderle S., Weber B., Reichert M., and Wild W. Integrating process learning and process evolution – a semantics based approach. In Int. Conf. Business Process Management, 2005, pp. 252–267. 11. Sadiq S., Marjanovic O., and Orlowska M. Managing change and time in dynamic workflow processes. Int. J. Cooper. Inform. Syst., 9(1, 2):93–116, 2000. 12. Sadiq S., Sadiq W., and Orlowska M. Pockets of flexibility in workflow specifications. In Proc. 20th Int. Conf. on Conceptual Modeling, 2001, pp. 513–526. 13. van der Aalst W.M.P. Exterminating the dynamic change bug: a concrete approach to support workflow change. Inform. Syst. Front., 3(3):297–317, 2001. 14. Weske M. Formal foundation and conceptual design of dynamic adaptations in a workflow management system. In Proc. Hawaii Int. Conf. System Sciences, 2001. 15. Weber B., Rinderle S., and Reichert M. Change patterns and change support features in process-aware information systems. In Proc. Int. Conf. Advanced Information Systems Engineering, 2007, pp. 574–588.
Workflow Join N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
Synonyms AND-join; Rendezvous; Synchronization join
Definition A point in the workflow where two or more parallel executing activities converge into a single common thread of control.
Key Points The execution of parallel activities commences with an AND-Split and concludes with an AND-Join. For example, in a credit application process there may be a split in the workflow at which point multiple activities are completed separately (in parallel, if not simultaneously.) Each parallel executing thread is held until the set of all thread transitions to the next activity is completed at which point the threads converge and the next activity is initiated.
Workflow Management and Workflow Management System
W
3545
Key Points The automation of a business process is defined within a Process Definition, which identifies the various process activities, procedural rules and associated control data used to manage the workflow during process enactment. Many individual process instances may be operational during process enactment, each associated with a specific set of data relevant to that individual process instance (or workflow ‘‘Case’’). A loose distinction is sometimes drawn between production workflow, in which most of the procedural rules are defined in advance, and ad-hoc workflow, in which the procedural rules may be modified or created during the operation of the process.
Cross-references
▶ AND-Split ▶ OR-Join ▶ Process Life Cycle ▶ Workflow Management and Workflow Management System
Cross-references
▶ Business Process Management ▶ Event-Driven Business Process Management ▶ Workflow Management and Workflow Management System ▶ Workflow Model ▶ XML Process Definition Language
Workflow Lifecycle ▶ Process Life Cycle
Workflow Management and Workflow Management System Workflow Loop
J OHANN E DER University of Vienna, Vienna, Austria
▶ Loop
Synonyms Business process management
Workflow Management N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
Synonyms Workflow; Business management
process management; Case
Definition The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.
Definition Business Process: A set of one or more linked procedures or activities that collectively realize a business objective or policy goal, normally within the context of an organizational structure defining functional roles and relationships [10]. Workflow: The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules [10]. Workflow Management System (WFMS): A system that defines, creates and manages the execution of workflows through the use of software, running on one or more workflow engine, which is able to interpret
W
3546
W
Workflow Management and Workflow Management System
the process definition, interact with workflow participants and, where required, invoke the use of IT tools and applications [10].
software more flexible and to integrate heterogeneous applications. Aspects of Workflows
Historical Background In the 1980s, office automation was a major research focus with several topics, the support for the including movement of documents through offices. Some proposals were made to model document or form flows and to support the ‘‘paperless’’ office. Another driver for the development of workflows came from extended transaction models. The problem was to solve integrity problems in heterogenous information systems where, within one business process, several databases are updated. Integrity was compromised if some of the updates failed or were not carried out. Extended transaction models aimed to provide assurance that all steps were performed properly. A third development stream came from computer supported cooperative work, where the need arose to better support repeating processes. In the 1980s revolutionary ideas for improving business by business process reengineering also emerged. Enterprises were urged to organize themselves around their core business processes and not according to functional considerations as before. This movement in business administration called for software to support process driven organizations. Since the 1990s, hundreds of workflow systems have been developed, both academic prototypes and commercial products with quite different functionalities.
Foundations Workflows automate and/or support the execution of business processes. The core idea of workflow management is to separate the process logic from the functional application logic. This means that applications provide functionality, and the workflow system controls, when to involve which functionality. So the workflow is responsible for who (which actor) does what (which activity), with which data and resources, when (control logic) under which constraints. And it is the job of the application system or workflow participant to know how an activity is performed. So the workflow system is responsible for the logistics of a process but not for the execution of the individual steps. This separation of concerns, to separate the doing of things from the organization of the process is supposed to make
A workflow consists of a set of activities and the dependencies between them, the so-called business logic. These dependencies can be expressed in different forms. The most frequent way is to express the dependencies by control flow definitions, often supported by graphical notations. Other ways are data flow or rules. However, the control flow is only one aspect of workflow management, although it has received most attention. The following aspects of workflows can be distinguished: Activities: Activities are units of work to be performed, either by a specific program or by an user, typically interacting with some application programs. Activities can be elementary, i.e from the point of the workflow systems they are the smallest units of work. Elementary activities are frequently called tasks. Activities can be composed to complex activities. Process: The process defines the control flow, i.e., the admissible relative orders of activity executions. Typical elements of control flow are sequence, andparallelism (concurrent execution of activities), orparallelism (selection of different activities, inclusive or exclusive). Workflow models provide the means for defining the process. Many process models rely on (variants of) Petri-nets or (simpler) variants of Petrinets, in particular, so called workflow nets. Workflow nets are graphs, where the nodes are activities and the directed edges represent a precedence relation between activities. Additionally, there are control nodes: andsplit for invoking several successor activities concurrently, and-join for synchronizing parallel threads, xorsplits for selecting one of several successor activities and xor-join for joining alternate paths again. Actors: Actors perform the tasks defined in the workflow specification. Actors can be humans or systems. Systems perform so-called automatic tasks, i.e., execute some application programs. Human actors can be identified directly (by a user name), or indirectly by reference to a position in the organization, by roles, representing a set of users competent for fulfilling a tasks, by organizational unit, by specification of necessary skills and competencies, or by reference to some data item containing the actor. For indirect representation of the individual actor, the workflow system has to resolve the role at runtime and assign a user to
Workflow Management and Workflow Management System
a particular task. For the resolution of actors specified by organizational units, the workflow system maintains a model of the organization. Data: Data are needed for the processing of a workflow. Three kinds of data comprise workflows. Workflow control data are managed by a WFMS and describe workflow execution (e.g., control and data flow among activities), relevant internally for a WFMS and without a long-term impact beyond the scope of the current workflow (e.g., a termination state of an activity). Application data are managed by the applications supporting the process instance and generally are never seen by the WFMS. Workflow relevant data are used by the WFMS to determine the state transitions of a workflow (e.g., transition conditions) and may be accessed both by the WFMS and the applications. Further, a workflow system manages organizational data which it uses at runtime to resolve actors, and audit data kept in the workflow log documenting the execution details of workflow instances. Applications: Applications are pieces of software that can be invoked within a workflow to fulfill a certain activity. A WFMS has to administer applications and the details of their invocation. Other aspects: Other aspects of workflows include constraints that have to be satisfied (e.g., pre- and postcondition of activities, quality-of-service constraints), in particular temporal constraints specifying deadlines and bounds for the duration between activities. Business rules may restrict the process.
W
3547
Workflow Management Systems
Workflow Management Systems are generic software systems that are driven by workflow models and control the execution of workflows. The Workflow Management Coalition (WfMC) developed a reference architecture for WFMS that has been widely adopted. Figure 1 shows the model with its components and interfaces. The Workflow Enactment Service, consisting of one or several workflow engines, is the core of a workflow management system. It accepts a process description, interprets it, creates instances (workflows), and controls the execution of the process instances. This means that the activities of the workflows are invoked according to the logic expressed in the process model, activities are dispatched to actors, and, if necessary, external services and external applications are invoked or process instances are forwarded to other workflow enactment services. Interface 1 accepts the process definition that contains all information necessary for the execution of process instances. Process definition tools are employed to describe processes in a proper format. Some process definition languages have been proposed as standards (e.g., XPDL or BPEL) to have uniform interfaces between process definition tools and workflow enactment services. Nevertheless, many workflow management systems feature proprietary languages, partly because of the different types of workflows they intend to support.
W
Workflow Management and Workflow Management System. Figure 1. Workflow Management Systems: WfMC reference architecture [2].
3548
W
Workflow Management and Workflow Management System
Interface 2 links the workflow enactment service with workflow client applications, the piece of software the workflow participants (or actors) work with. The main feature of this component is to maintain the worklist, a collection of all activities assigned to a particular user. Users can typically accept an activity, delegate it, refuse it, work on it, or complete it. Interface 3 describes how other applications are invoked. This interface is frequently used to invoke legacy applications. Since applications (in particular interactive programs) can be invoked from the workflow client application as well, the interfaces 2 and 3 are typically combined permitting the invocation of applications both directly from the workflow enactment service and on user request from the workflow client tool. Interface 4 addresses workflow interoperability. It defines the interaction between workflow enactment services, the invocation of subworkflows, the synchronization between cooperating workflows, and the forwarding of workflows to other workflow applications. Interface 5 specifies the interaction with administration and monitoring tools. Administration tools provide the general administration of the workflow system: the registration of user, of application programs, the security management, in particular the administration of rights. The monitoring tools allow for the inspection of the progress of workflow instances and typically provide sophisticated statistics about the execution of workflows to support process managers and to provide data for process improvement.
Key Applications Workflow systems are mainly employed to support the management and execution of business processes. The typical architecture is that a workflow management system is used for managing the business processes and for integrating the different application areas. Such an architecture can be found in various enterprises and organizations. A few example applications include insurance claim processing, loan processing, product development, bug reporting, order processing, media production, etc. However, there are additional types of workflow applications: Groupware like systems with high flexibility to support workflow that cannot be fully modeled beforehand but develop as the process is executed. These systems primarily support the cooperation
between individuals and care for highly dynamic processes. There are workflow systems as top layers on large enterprise software systems, adding processes to the rich integrated functionality of these systems. Workflow systems are used to integrate heterogeneous application systems and provide an integration platform for these systems. Workflow systems are a layer in the architecture of software systems caring for the process aspects just as data management systems are responsible for storing and retrieving data (process brokers). Workflows are frequently classified into three different groups according to the criteria of how much they can be automated, how frequent they are executed, and how much they can be specified: Ad-hoc workflows are highly dynamic workflows that are not fully specified at build time and consist mainly of manual activities. Examples are document routing, or review processes. Administrative workflows are well structured and more repetitive. They are the typical office paper shuffling type of processes. Examples are budgeting, purchase order processing, etc. Production workflows are highly repetitive, well structured and fully automated. They represent the core business processes of an organization. Examples are insurance claim handling, bug reporting, order processing, etc. The employment of workflow technology is supposed to bring the following benefits: Efficiency: The automation of process control, the increased possibilities for concurrent execution of activities of the same workflow instance, and the automation of the forwarding of work to the next actor increase the efficiency of business processes as measured in throughput and in turn-around time. Process modeling: The business processes are explicitly specified and do no longer solely exist in the heads of the participants or in manuals. This is a necessary prerequisite to reflect on processes and to improve processes. Automated process control: the centralized automatic control of the process execution guarantees that business rules are respected. Deviations from the specified process, which are of course possible, are documented and can be used for auditing and
Workflow Management and Workflow Management System
process improvement. Users are informed of work to do, and the data used for performing activities are automatically delivered to the respective workers. Integration: Workflow systems allow to bridge heterogeneous applications. Functionalities in application systems are called external applications. The workflow describes and the WFMS enforces the proper sequences of invocation of functionalities in different application systems and transports data between different systems. In general, the integrity of the whole system consisting of heterogeneous applications is achieved. Transparency: Transparency is increased through explicit process specification and documentation. All steps and decisions in the execution of a process are centrally documented (workflow log or workflow history). Therefore, it is easy to report the actual state of a process instance. Monitoring: The progress of workflows can be easily monitored. For example, delays or exceptional situations are immediately recognized allowing process managers to react timely to assure that deadlines are met. Documentation: The execution of workflows is automatically documented. Each process can be traced, e.g., for the purpose of revision. The process documentation is also a valuable source for statistics about the processes to identify weaknesses of the process definition and for process improvements. Quality assurance: Quality assurance procedures like the ISO 9000 Suite or TQM (Total Quality Management) require in particular that processes are specified, that processes are executed according to their specification, and that all processes are documented. All that is supported by workflow systems. Process oriented management: Workflow systems support the management of process driven organization.
Workflow systems are a flourishing market. Applications can be found in a large number of application domains: e.g., insurances, banks, government agencies, hospitals, etc. Scientific workflows are successfully applied in life science, physics, and chemistry, etc. There are many applications with embedded workflow technology, e.g., document management systems, content management systems, or ERP-systems.
W
3549
Future Directions An important development of workflow systems is their combination with web service technology. Web service composition relies on the same principles as workflow, the main difference is that the activities are web services. Another important topic is the support of interorganizational business processes. The benefits of workflow technology should be exploited also for business processes that span several organizations (enterprizes). While this is already achieved for some types of virtual enterprizes, where the partners (suppliers, costumers, consultants, service providers, etc.) involved and the interfaces are fairly stable, it is still a challenge to support processes with changing partners or multitudes of partners. Here a lot of research is still needed, in particular, integration with semantic web services.
Cross-references
▶ Database Techniques to Improve Scientific Simulations ▶ Extended Transaction Models ▶ Workflow Management Coalition ▶ Workflow Model ▶ Workflow Transactions
Recommended Reading 1. van der Aalst W. et al. Workflow Management: models, methods, and systems. MIT, Cambridge, MA, 2002. 2. Dumas M., van der Aalst W.M., and ter Hofstede A.H. Processaware information systems: bridging people and software through process technology. Wiley, 2005. 3. Eder J., Groiss H., and Liebhart W. The Workflow Management System Panta Rhei. In Workflow Management Systems and ¨ zsu, A. Sheth Interoperability. Kalinichenko, A. Dogac, M.T. O (eds). 1998. pp. 129–144. 4. Ellis C.A. and Nutt G.J. Office Information Systems and Computer Science. ACM Comput. Surv., 12(1): 27–60, 1980. 5. Ellis C.A. and Nutt G.J. Modeling and Enactment of Workflow Systems. In Proc. 14th Int. Conf. on Application and Theory of Petri Nets., 1993, pp. 1–16. 6. Georgakopoulos D., Hornick M., and Sheth A. An overview of workflow management: From process modeling to workflow automation infrastructure. Distrib. Parallel Databases, 3(2):119–153, 1995. 7. Jablonski S. and Bussler C. Workflow management: modeling concepts, architecture and implementation. Int. Thomson Computer, 1996. 8. Leymann F. and Roller D. Production workflow. Prentice-Hall, Englewood, Cliffs, NJ, 2000. 9. The Workflow Reference Model. Workflow Management Coalition, 1995. 10. Workflow management coalition terminology and glossary [R]. Workflow Management Coalition, 1999.
W
3550
W
Workflow Management Coalition
Workflow Management Coalition N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
Cross-references
▶ Workflow Management and Workflow Management System ▶ Workflow Model ▶ XML Process Definition Language
Synonyms XPDL
Definition
Workflow Meta-Model ▶ Workflow Schema
The Workflow Management Coalition, founded in August 1993, is a non-profit, international organization of workflow vendors, users, analysts and university/research groups. The Coalition’s mission is to promote and develop the use of workflow through the establishment of standards for software terminology, interoperability and connectivity among BPM and workflow products. Comprising more than 250 members worldwide, the Coalition is the primary standards body for this software market.
▶ Process Mining
Key Points
N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
The WfMC creates and contributes to process related standards, educates the market on related issues, and is the only standards organization that concentrates purely on process. The WfMC created Wf-XML and XPDL, the leading process definition used today in over 70 solutions to store and exchange process models. XPDL is not an executable programming language but a process design format for storing the visual diagram and process syntax of business process models, as well as extended product attributes. The Coalition has developed a framework for the establishment of workflow standards. This framework includes five categories of interoperability and communication standards that will allow multiple workflow products to coexist and interoperate within a user’s environment. These interface points are structured around the workflow reference model which provides the framework for the Coalition’s standards program. The reference model identifies the common characteristics of workflow systems and defines five discrete functional interfaces through which a workflow management system interacts with its environment – users, computer tools and applications, other software services, etc. Working groups meet individually, and also under the umbrella of the Technical Committee, which is responsible for overall technical direction and coordination.
Workflow Mining
Workflow Model
Synonyms Business process model; Process definition
Definition A set of one or more linked procedures or activities which collectively realize a business objective or policy goal, normally within the context of an organizational structure defining functional roles and relationships.
Key Points A business process is typically associated with operational objectives and business relationships, for example an Insurance Claims Process, or Engineering Development Process. A process may be wholly contained within a single organizational unit or may span several different organizations, such as in a customer-supplier relationship. A business process has defined conditions triggering its initiation in each new instance (e.g., the arrival of a claim) and defined outputs at its completion. A business process may involve formal or relatively informal interactions between participants; its duration may also vary widely.
Workflow Modeling
A business process may consist of automated activities, capable of workflow management, and/or manual activities, which lie outside the scope of workflow management.
Cross-references
▶ Workflow Management and Workflow Management System ▶ Workflow Modeling
Workflow Model Analysis W. M. P. VAN DER A ALST Eindhoven University of Technology, Eindhoven, The Netherlands
Definition The correctness, effectiveness, and efficiency of the business processes supported by a workflow management systems are vital to the organization. A workflow process definition which contains errors may lead to angry customers, back-log, damage claims, and loss of goodwill. Flaws in the design of a workflow definition may also lead to high throughput times, low service levels, and a need for excess capacity. This is why it is important to analyze a workflow process definition before it is put into production. In order to analyse a workflow definition, it is necessary to translate the definition into a model suitable for analysis, e.g., a simulation model or a model that allows for verification techniques (e.g., a Petri net).
Process-aware information systems are driven by process models. A typical example of a process-aware information system is a system generated using workflow management software. In such cases there is an explicit process model that can be used as a starting point for analysis [3,4]. There are three types of workflow model analysis: 1. Validation, i.e., testing whether the workflow behaves as expected. 2. Verification, i.e., establishing the correctness of a workflow. 3. Performance analysis, i.e., evaluating the ability to meet requirements with respect to throughput times, service levels, and resource utilization.
3551
Validation can be done by interactive simulation: a number of fictitious cases are fed to the system to see whether they are handled well. For verification and performance analysis more advanced analysis techniques are needed. Fortunately, many powerful analysis techniques have been developed for formal methods such as Petri nets [2,3]. Linear algebraic techniques can be used to verify many properties, e.g., place invariants, transition invariants, and (non-)reachability. Coverability graph analysis, model checking, and reduction techniques can be used to analyze the dynamic behavior of the workflow process. Simulation, queueing networks, and Markov-chain analysis can be used for performance evaluation (cf. [1,2]). Note that all three types of analysis assume that there is some initial workflow model. This implies that analysis may require some modeling or some translation from one language to another. In some cases, process mining can be used to discover the underlying workflow model.
Cross-references
▶ Business Process Management ▶ Petri Nets ▶ Process Mining ▶ Workflow Management
Recommended Reading 1.
2. 3.
Key Points
W
4.
Marsan M.A., Balbo G., and Conte G. et al. Modelling with generalized stochastic Petri nets. Wiley Series in Parallel Computing. Wiley, New York, NY, USA, 1995. Reisig W. and Rozenberg G. (eds.) Lectures on Petri Nets I: Basic Models. LNCS. 1491. Springer, Berlin Heidelberg New York, 1998. van der Aalst W.M.P. The application of Petri nets to workflow management. J. Circ. Syst. Comput., 8(1):21–66, 1998. van der Aalst W.M.P. and van Hee K.M. Workflow Management: Models, Methods, and Systems. MIT, Cambridge, MA, 2004.
Workflow Modeling M ARLON D UMAS University of Tartu, Tartu, Estonia
Synonyms Business process modeling
Definition A workflow model is a representation of the way an organization operates to achieve a goal, such as
W
3552
W
Workflow Modeling
delivering a product or a service. Workflow models may be given as input to a workflow management system to automatically coordinate the tasks composing the workflow model. However, workflow modeling may be conducted purely for documentation purposes or to analyze and improve the operations of an organization, without this improvement effort implying automation by means of a workflow system. A typical workflow model is a graph consisting of at least two types of nodes: task nodes and control nodes. Task nodes describe units of work that may be performed by humans or software applications, or a combination thereof. Control nodes capture the flow of execution between tasks, therefore establishing which tasks should be enabled or performed after completion of a given task. Workflow models, especially when they are intended for automation, may also include object nodes denoting inputs and outputs of tasks. Object nodes may correspond to data required or produced by a task. But in some notations, they may correspond to physical documents or other artifacts that need to be made available or that are produced or modified by a task. Additionally, workflow models may include elements for capturing resources that are involved in the performance of tasks. For example, in two notations for workflow modeling, the Unified Modeling Language (UML) Activity Diagrams, and the Business Process Modeling Notation (BPMN), resource types are captured as swimlanes, with each task belonging to one swimlane (or multiple in the case of UML). In other notations, such as Event-driven Process Chains (EPC), this is represented by attaching resource types to each task.
Historical Background The idea of documenting business processes to improve customer satisfaction and business operations dates back to at least the 1960s when it was raised, among others, by Levitt [8]. Levitt contended that organizations should avoid focusing exclusively on goods manufacturing, and should view their ‘‘entire business process as consisting of a tightly integrated effort to discover, create, arouse, and satisfy customer needs.’’ However, it is not until the 1970s that the idea of using computer systems to coordinate business operations based on process models emerged. Early scientific work on business process modeling was undertaken in the context of office information systems by Zisman [13] and Ellis [2] among others. During the
1970s and 1980s there was optimism in the IT community regarding the applicability of process-oriented office information systems. Unfortunately, few deployments of this concept succeeded, partly due to the lack of maturity of the technology, but also due to the fact that the structure and operations of organizations were centered around the fulfillment of individual functions rather than end-to-end processes. Following these early negative experiences, the idea of process modeling lost ground during the next decade. Towards the early 1990s, however, there was a renewed interest in business process modeling. Instrumental in this revival was the popularity gained in the management community by the concept of Business Process Reengineering (BPR) advocated by Hammer [3] and Davenport [1] among others. BPR contended that overspecialized tasks carried across multiple organizational units should be unified into coherent and globally visible processes. This management trend fed another wave of business process technology known as workflow management. Workflow management, in its original form, focused on capturing business processes composed of tasks performed by human actors and requiring data, documents and forms to be transferred between them. Workflow management was successfully applied to the automation of routine and high-volume administrative processes such as orderto-cash, procure-to-pay and insurance claim handling. However it was less successful in application scenarios requiring the integration of heterogeneous information systems, and those involving high levels of evolution and change. Also, standardization efforts in the field of workflow modeling and management, led by the Workflow Management Coalition (WfMC), failed to gain wide adoption. The BPR trend also led to the emergence of business process modeling tools that support the analysis and simulation of business processes, without targeting their automation. The ARIS platform, based on the EPC notation, is an example of this family of tools. Other competing tools supported alternative notations such as IDEF3 and several variants of flowcharts. By the late 1990s, the management community had moved from the concept of BPR to that of Business Process Improvement (BPI), which advocates an incremental and continuous approach to adopting process-orientation. This change of trend combined with the limited success of workflow management and its failure to reach standardization, led to the
Workflow Modeling
term workflow acquiring a negative connotation. Nonetheless, the key concepts underpinning workflow management reemerged at the wake of the twenty-first century under the umbrella of Business Process Management (BPM). BPM adopts a more holistic view than workflow management, covering not only the automated coordination of processes, but also their monitoring and continuous analysis and improvement. Also, BPM does not only deal with the coordination of human tasks, but also the integration of heterogeneous information and software systems. With BPM came the realization that business processes should be analyzed and designed both from a business viewpoint and from a technical viewpoint, and that methods are required to bridge these viewpoints. Also, the BPM trend led to a convergence of modeling languages. BPMN has emerged as a standard for process modeling at the analysis level, while at the implementation level, the place is taken by WS-BPEL. Other notations include UML Activity Diagrams and EPCs at the analysis level, and YAWL (Yet Another Workflow Language) [6] at the execution level.
Foundations Workflow modeling can be approached from a number of perspectives [4]. The control-flow perspective describes the ordering and causality relationships between tasks. The data perspective (or information perspective) describes the data that are taken as input and produced by the activities in the process. The resource perspective describes the structure of the organization and identifies resources, roles, and groups. The task perspective describes individual steps in the processes and thus connects the other three perspectives. Many workflow modeling notations emphasize the control-flow perspective. This is the case for example
W
3553
of process modeling notation based on flowcharts. Flowcharts consist of actions (also called activities) and decision nodes connected by arcs that denote sequential execution. In their basic form, flowcharts do not make a separation between actions performed by different actors or resources. However, an extension of flowcharts, known as cross-functional flowcharts, allow one to perform this partitioning. Flowcharts have inspired many other contemporary process modeling notations, including UML Activity Diagrams and BPMN. A simplified business process model for credit approval in BPMN is given in Fig. 1. An execution of this process model is triggered by the receipt of a credit application, denoted by the leftmost element in the model. Following this, the application is checked for completeness. Two outcomes are possible: either the application is marked as complete or as incomplete. This choice is represented by the decision gateway depicted by the diamond labeled by an ‘‘X’’ sign. If the application is incomplete, additional information is sought from the applicant. Otherwise, two checks are performed in parallel: a credit history check and an income check. This is denoted by a parallel split gateway (the first diamond labeled by a ‘‘+’’ sign). The two parallel checks then converge into a synchronization gateway (the second diamond labeled by a ‘‘+’’). The credit history check may be performed automatically while the income source check may require an officer to contact the applicant’s employer by phone. After these checks, the application is assessed and it is either accepted or rejected. A second family of notations for business process modeling are those based on state machines. State machines provide a simple approach to modeling behavior. In their basic form, they consists of states connected by transitions. When used for business process
W
Workflow Modeling. Figure 1. Example of a business process model in BPMN.
3554
W
Workflow Modeling
modeling, it is common for the transitions in a state machine to be labeled by event-condition-action (ECA) rules. The semantics of a transition labeled by an action is the following: if the execution of the state machine is in the source state, an occurrence of the event (type) has been observed, and the condition holds, then the transition may be taken. If the transition is taken, the action associated with the transition is performed and the execution moves to the target state. A transition may contain any combination of these three elements: e.g., only an event, only a condition, only an action, an event and a condition but no action, etc. In some variants of state machines, it is possible to attach activities to any given state. The semantics is that when the state is entered, the activity in question may be started, and the state can only be exited once this activity has completed. Several commercial business process management systems support the execution of process models defined as state machines. Examples include the ‘‘Business State Machine’’ models supported by IBM Websphere and the ‘‘Workflow State Machine’’ models supported by Microsoft Windows Workflow Foundation. A disadvantage of using basic state machines for process modeling is their tendency to lead to state explosion. Basic state machines represent sequential behavior: the state machine can only be in one state at a time. This, when representing a business process model with parallel execution threads, one essentially needs to enumerate all possible permutations of the activities that may be executed concurrently. Another source of state explosion is exception handling. Statecharts have been proposed as a way to address this state explosion problem. Statecharts include a notion of concurrent components as well as transitions that may interrupt the execution of an entire component of the model. The potential use of statecharts for workflow modeling has been studied in the research literature, for example in the context of the Mentor research
prototype [10], although it has not been adopted in commercial products. Figure 2 shows a statechart process model intended to be equivalent to the BPMN ‘‘Credit Approval’’ process model of Fig. 1. It can be noted that decision points are represented by multiple transitions sharing the same source state and labeled with different conditions (conditions are written between square brackets). Parallel execution is represented by means of a composite state divided into two concurrent components. Each of these components contains one single state. In the general case, each concurrent component may contain an arbitrarily complex state machine. Finally, a third family of notations for control-flow modeling of business processes are those based on Petri nets. In the research literature, workflow modeling notations based on Petri nets have been used mainly for studying expressiveness and verification problems (e.g., detecting deadlocks in workflow models). Also, certain Petri net-based notations are supported by business process modeling tools and workflow management systems. A notable example is YAWL. YAWL is an extension of Petri nets designed on the basis of the Workflow Patterns. A key design goal of YAWL is to support as many workflow patterns as possible in a direct manner, while retaining the theoretical foundation of Petri nets. Like Petri nets, YAWL is based on the concepts of place and transition (Transitions are called tasks in YAWL while places are called conditions.). YAWL also incorporates a concept of decorator, which is akin to the concept of gateway in BPMN. In addition, it supports the concepts of sub-process (called composite task) and cancelation region. The latter is used to capture the fact that the execution of a set of tasks may be interrupted by the execution of another task, and can be used for example for fault handling. Figure 3 shows a YAWL process model intended to be equivalent to the BPMN ‘‘Credit Approval’’ process
Workflow Modeling. Figure 2. Example of a business process model captured as a statechart.
Workflow Modeling
W
3555
Workflow Modeling. Figure 3. Example of a business process model captured in YAWL.
model of Fig. 1 and the statechart process model of Fig. 2. The symbols attached to the left and the right of each task are the decorators. The decorator on the left of task ‘‘Check Completeness’’ is an XOR-merge decorator, meaning that the task can be reached from either of its incoming flows. The decorator of the right of this same task is an XOR-split, which corresponds to the decision gateway in BPMN. The decorator just before ‘‘Check Credit History’’ is an AND-split (akin to a parallel split gateway) and the one attached to the left of ‘‘Assess Application’’ is an AND-join and it is used to denote a full synchronization. The data perspective of process modeling can be captured using notations derived from dataflow diagrams. Dataflow diagrams are directed graphs composed of three types of nodes: processes, entities and data stores. Entities denote actors, whether internal or external to an organization. Processes denote units of work that transform data. Data stores denote units of data storage. A data store is typically used to store data objects of a designated type. The arcs in a dataflow diagram are called data flows. Data flows denote the transfer of data between entities, processes and data stores. They may be labeled with the type of data being transferred. Data flow diagrams are often used very liberally, with almost no universally accepted syntactic constraints. However, best practice dictates that each process should have at least one incoming data flow and one outgoing data flow. Data flows may connect entities to processes and vice-versa, but can not be used to connect entities to data stores or to connect one entity to another directly. It is possible to connect two processes directly by means of a data flow, in which case it means that the output of the first process is the input of the second. It is also possible to assign ordinal numbers to processes in order to capture their
execution order. Dataflow diagrams also support decomposition: Any process can be decomposed into an entire dataflow diagram consisting of several subprocesses. Data flows are quite restricted insofar as they do not capture decision points and they are not suitable for capturing parallel execution or synchronization. Because of this, dataflow diagrams are not used to capture processes in details, instead, they are used as a high-level notation. On the other hand, some elements from data flow diagrams have found their way into other process modeling notations such as EPCs, which incorporate constructs for capturing input and output data of functions. Similarly, Activity Diagrams have a notion of object nodes, used to represent inputs (outputs) consumed (produced) by activities. Also, more sophisticated variants of dataflow languages have been adopted in the field of scientific workflow, which is concerned by the automation of processes involving large datasets and complex computational steps in fields such as biology, astrophysics and environmental sciences. When modeling business processes for execution (e.g., by a workflow system), it is important to capture not only the flow of data into and out of tasks, but also data transformation, extraction and aggregation steps. At this level of detail, it may become cumbersome to use simple data flows. Instead, executable process modeling languages tend to rely on scoped variables. Variables may be defined for example at the level of the process model. Tasks may take as input several input and output parameters, and data mappings are used to connect the data available in the process model variables, and the input and output parameters. Data mappings may be inbound (i.e., expressions that extract data from variables in the task’s encompassing
W
3556
W
Workflow Modeling
scope and assign values to the task’s input parameter) or outbound (from the output parameters to the variables in the encompassing scope). This is the approach used for example in YAWL, Mentor and AdeptFlex [9]. It is also the approach adopted by many commercial workflow products and by WS-BPEL. One area of research in the area of workflow modeling is the analysis of their expressiveness [6]. Several classes of workflow notations have been identified from this perspective: structured workflow models, synchronizing workflow models, and standard workflow models. Structured workflow models are those that can be composed out of four basic control-flow constructs, one for choosing one among multiple blocks of activities, another for executing multiple blocks of activities in parallel, a third one for executing multiple blocks of activities in sequence, and finally a fourth one for repeating a block of activities multiple times. An activity by itself can be a block, and more complex blocks can be formed by composing smaller blocks using one of the four constructs. It has been shown that structured workflow models are well-behaved, meaning in particular that their execution can not result in deadlocks [5]. On the other hand, this class of models has low expressive power. It is not difficult to construct examples of unstructured workflow models that can not be translated to equivalent structured ones under reasonable notions of equivalence. Synchronizing workflow models are acyclic workflow models composed of activities that are connected by transitions. These transitions correspond to control links in the WS-BPEL terminology. They impose precedence relations and they can be labeled with conditions. An activity can be executed if all its predecessors have been either completed or skipped. Like structured workflow models, this class of models are well-behaved. However, they are not very expressive. Finally, standard workflow models correspond to the subset of BPMN composed of tasks, decision gateways, merge gateways (where multiple alternative branches converge), parallel split gateways and synchronization gateways. This class of models is more expressive than the former two, but on the other hand, some of these models may contain deadlocks. Other classes of workflow models are those corresponding to Workflow Nets, which are more expressive than standard workflow models as they can capture different types of choices. YAWL models are even more expressive due to the presence of the cancelation region
construct, and another construct (called the ‘‘ORjoin’’) for synchronization of parallel branches, where some of these branches may never complete. A significant amount of literature in the area of workflow verification has dealt with studying properties of modeling languages incorporating a notion of OR-join [7].
Key Applications Business process modeling can be applied both for documentation, business analysis, business process improvement and automation. In some cases, organizations undertake business process modeling projects for the sake of documenting key activities and using this documentation for employee induction and knowledge transfer. business process models are sometimes required for compliance reasons, for example as a result of legislation in the style of the Sarbannes-Oxley act in the United States which requires public companies to have internal control processes in place to detect and prevent frauds. A process model that captures how an organization currently works, is called an ‘‘as is’’ process model. Such ‘‘as is’’ models are useful in understanding bottlenecks and generating ideas for improvement using process analysis techniques, including simulation and activity-based costing. The integration of improvement ideas in a business process leads to a ‘‘to be’’ process model. Some of these improvement ideas may include automating (parts of) the business process. At this stage, business process models need to be made more detailed. Once they have been made executable, business process models may be deployed in a business process execution engine (also called a workflow engine).
Future Directions Standardization efforts in the field of business process modeling have led to the current co-existence of two standard process modeling languages: BPMN at the business analysis level and BPEL at the implementation and execution level. One question that arises from the co-existence of these two standards is whether or not models in BPMN should be synchronized with models defined in BPEL and how. The assumption behind this question is that business analysts will define models in BPMN, while developers are likely to work at the level of BPEL. For facilitate the collaboration between these stakeholders, it would be desirable to relate models defined in these two languages. An alternative would be for the two languages to converge into a single one.
Workflow Patterns
Another related area where open problems exist is that of scientific workflow modeling. Researchers have argued for some time that workflow-related problems in scientific fields such as genetics and ecological sciences, do not have the same requirements as those found in the automation of business processes. These differences in requirements call for new paradigms for workflow modeling. While several proposals have been put in place [9], e.g., workflow modeling languages based on variants of dataflow modeling, consolidation still needs to occur and consensus in the area is yet to be reached.
Cross-references
▶ Activity Diagrams ▶ Business Process Execution Language ▶ Business Process Management ▶ Business Process Modeling Notation ▶ Business Process Reengineering ▶ Petri Nets ▶ Workflow Model Analysis ▶ Workflow Patterns
Recommended Reading 1. Davenport T.H. Process Innovation: Reengineering Work through Information Technology. Harvard Business School, Boston, 1992. 2. Ellis C.A. Information control nets: a mathematical model of office information flow. In Proc. Conf. on Simulation, Measurement and Modeling of Computer Systems. 1979, pp. 225–240. 3. Hammer M. Reengineering work: don’t automate, obliterate. Harvard Bus. Rev., July/August 1990, pp. 104–112. 4. Jablonski S. and Bussler C. Workflow Management: Modeling Concepts, Architecture, and Implementation. Int. Thomson Computer, London, UK, 1996. 5. Kiepuszewski B., ter Hofstede A.H.M., and Bussler C. On structured workflow modelling. In Proc. 12th Int. Conf. on Advanced Information Systems Engineering, 2000, pp. 431–445. 6. Kiepuszewski B., ter Hofstede A.H.M., and van der Aalst W.M.P. Fundamentals of control flow in workflows. Acta Inform., 39 (3):143–209, 2003. 7. Kindler E. On the semantics of epcs: resolving the vicious circle. Data Knowl. Eng., 56(1):23–40, 2006. 8. Levitt T. Marketing myopia. Harvard Bus. Rev., July/August 1960, pp. 45–56. 9. Luda¨scher B., Altintas I., Berkley C., Higgins D., Jaeger E., Jones M., Lee E.A., Tao J., and Zhao Y. Scientific workflow management and the Kepler system. Concurrency Comput.–Pract. Exp., 18(10):1039–1065, 2006. 10. Muth P., Wodtke D., Weissenfels J., Dittrich A., and Weikum G. From centralized workflow specification to distributed workflow execution. J. Intell. Inform. Syst., 10(2), 1998.
W
3557
11. Reichert M. and Dadam P. ADEPTflex: supporting dynamic changes of workflow without loosing control. J. Intell. Inform. Syst., 10(2):93–129, 1998. 12. van der Aalst W.M.P. and ter Hofstede A.H.M. YAWL: yet another workflow language. Inform. Syst., 30(4):245–275, 2004. 13. Zisman M.D. Representation, Specification and Automation of Office Procedures. PhD thesis, University of Pennsylvania, Warton School of Business, 1977.
Workflow on Grid ▶ Grid and Workflows
Workflow Participant ▶ Actors/Agents/Roles
Workflow Patterns W. M. P. VAN DER A ALST Eindhoven University of Technology, Eindhoven, The Netherlands
Definition Differences in features supported by the various contemporary commercial workflow management systems point to different insights of suitability and different levels of expressive power. One way to characterize these differences and to support modelers in designing nontrivial workflows, is to use a patterns-based approach. Requirements for workflow languages can be indicated through workflow patterns, i.e., frequently recurring structures in processes that need support from some process-aware information system. The Workflow Patterns Initiative [6] aims at the systematic identification of workflow requirements in the form of patterns. The Workflow Patterns Initiative resulted in various collections of patterns. The initial twenty control-flow patterns have been extended with additional controlflow patterns and patterns for other perspectives such as the resource, data, and exception handling perspectives. All of these patterns have been used to evaluate different systems in a truly objective manner.
Key Points The Workflow Patterns Initiative [6] was established with the aim of delineating the fundamental requirements
W
3558
W
Workflow Scheduler
that arise during business process modeling on a recurring basis and describe them in an imperative way. Based on an analysis of contemporary workflow products and modeling problems encountered in various workflow projects, a set of twenty patterns describing the controlflow perspective of workflow systems was identified [1]. Since their release, these patterns have been widely used by practitioners, vendors and academics alike in the selection, design and development of workflow systems. An example of one of the initial workflow patterns is the Discriminator pattern which refers to the ability ‘‘to depict the convergence of two or more branches such that the first activation of an incoming branch results in the subsequent activity being triggered and subsequent activations of remaining incoming branches are ignored.’’ It can be seen as a special case of the more general N-out-ofM pattern, where N is equal to one. While the more recent languages and systems support this pattern, many of the traditional workflow management systems have severe problems supporting this pattern. As a result, the designer is forced to change the process, to work around the system, or to create ‘‘spaghetti diagrams.’’ Based on many practical applications the original set of twenty control-flow patterns was extended in two directions. First of all, the set of control-flow patterns was extended and refined. There are now more than forty control-flow patterns, all formally described in terms of Petri nets. Second, patterns for other perspectives were added. The data perspective [5] deals with the passing of information, scoping of variables, etc, while the resource perspective [4] deals with resource to task allocation, delegation, etc. More than eighty patterns have been defined for both perspectives. The patterns for the exception handling perspective deal with the various causes of exceptions and the various actions that need to be taken as a result of exceptions occurring. The resulting sets of patterns have been used for: (i) improving the understanding of workflow management and clarifying the semantics of different constructs, (ii) analyzing the requirements of workflow projects, (iii) evaluating products, and (iv) the training of workflow designers.
Recommended Reading
Cross-references
Cross-references
▶ Workflow Constructs ▶ Workflow Management ▶ Workflow Schema
1.
2.
3. 4.
5.
6.
van der Aalst W.M.P., ter Hofstede A.H.M., Kiepuszewski B., and Barros A.P. Workflow patterns. Distrib. Parallel Database, 14(1):5–51, 2003. Casati F., Castano S., Fugini M.G., Mirbel I., and Pernici B. Using patterns to design rules in workflows. IEEE Trans. Softw. Eng., 26(8):760–785, 2000. Hohpe G. and Woolf B. Enterprise Integration Patterns. Addison-Wesley, Reading, MA, 2003. Russell N., van der Aalst W.M.P., ter Hofstede A.H.M., and Edmond D. Workflow resource patterns: identification, representation and tool support. In Proc. 17th Int. Conf. on Advanced Information Systems Eng., 2005, pp. 216–232. Russell N., ter Hofstede A.H.M., Edmond D., and van der Aalst W.M.P. Workflow data patterns: identification, representation and tool support. In Proc. 24th Int. Conf. on Conceptual Modeling, 2005, pp. 353–368. Workflow Patterns Home Page. http://www.workflowpatterns. com.
Workflow Scheduler ▶ Scheduler
Workflow Schema N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
Synonyms Workflow Meta-Model
Definition The meta structure of a business process defined in terms of a data model.
Key Points In contemporary workflow the Workflow Schema is represented by a standard process definition such as XPDL (XML Process Definition Language) based on XML Schema. In non-standard environments or workflows defined with in specific application (not a BPMS or workflow management system) the Workflow Schema is built as a proprietary data model.
▶ Workflow Model ▶ Workflow Modeling ▶ XML Process Definition Language
Wrapper
Workflow Transactions V LADIMIR Z ADOROZHNY University of Pittsburgh, Pittsburgh, PA, USA
Synonyms Transactional workflows; Transactional processes; Transactional business processes
Definition Workflow Transaction is a unit of execution within a Workflow Management System (WFMS) that combines process-centric approach to modeling, executing and monitoring of a complex process in an enterprise with data-centric Transaction Management ensuring the correctness and reliability of the workflow applications in the presence of concurrency and failures. Workflow Transactions accommodate applicationspecific Extended Transaction Models (ETMs) that relax basic ACID properties of a classical database transaction models.
3559
Extended Transaction Models (ETMs) that relax some of the ACID properties. There are several variations of the workflow transaction concept that depends on the way how the Workflow and Transaction Managements are merged. In general, workflows are more abstract than transactions and transactions provide specific semantics to workflow models. Alternatively, there are approaches where transactions are considered on a higher level of abstraction, while workflows define process structure for the transaction models. Workflows and transaction can also interact as peers at the same abstraction level within a loosely coupled process model. While workflow transaction commonly supports business applications, the workflow technology has also been adopted to support workflows for scientific explorations and discovery.
Cross-references
▶ Transaction Management ▶ Workflow Management
Recommended Reading 1.
Key Points Consider a Workflow Management System (WFMS) that models complex processes in an enterprise (e.g., processing insurance claim) using a workflow structured as a set of business tasks that should be performed in a special order. Some of the business tasks can be performed by humans while some of them are implemented as processes that typically create, process and manage information using loosely coupled distributed, autonomous and heterogeneous information systems. This process-centric approach to modeling of a complex enterprise process can be contrasted with data-centric approach accepted in database Transaction Management (TM) that keeps track of data dependencies, support concurrent data accesses, and crash recovery. Traditional TM approach is too strict for workflow applications, which requires selective use of transactional properties to allow flexible inter-task collaboration within a workflow. For example, classical ACID transaction model enforcing task isolation makes impossible task cooperation. Another example is large amount of task blocking and abortions caused by transaction synchronization, which is not acceptable for large-scale workflows. In order to address those issues Workflow Transactions accommodate application-specific
W
2.
Georgakopoulos D., Hornick M., and Sheth A. An overview of workflow management: from process modeling to workflow automation infrastructure. Distrib. Parallel Databases, 3(2):119–153, 1995. Grefen P. and Vonk J. A taxonomy of transactional workflow support. Int. J. Coop. Inf. Syst., 15(1):87–118, 2006.
Workflow/Process Instance Changes ▶ Workflow Evolution
World Wide Web Consortium ▶ W3C
WORM ▶ Storage Devices ▶ Storage Security ▶ Write Once Read Many
Wrapper ▶ Temporal Strata
W
3560
W
Wrapper Adaptability
Wrapper Adaptability ▶ Wrapper Stability
Wrapper Induction M AX G OEBEL 1, M ICHAL C ERESNA 2 Vienna University of Technology, Vienna, Austria 2 Lixto Software GmbH, Vienna, Austria
1
Synonyms Wrapper Generation; Information Extraction
Definition Wrapper induction (or query induction) is a subfield of wrapper generation, which itself belongs to the broader field of information extraction (IE). In IE, wrappers transform unstructured input into structured output formats, and a wrapper generation systems describes the transformation rules involved in such transformations. Wrapper induction is a solution to wrapper generation where transformation rules are learned from examples and counterexamples (inductive learning). The induced wrapper subsequently is applied to unseen input documents to collect further label relations of interest. To ease annotation of examples by the user, the learning framework is often implemented within a visual annotation environment, where the user selects and deselects elements visually. The term ‘‘wrapper induction’’ was first conceptualized by Nicholas Kushmerick in his influential PhD thesis in 1997 in the context of semi-structured Web documents [10], where the author also laid the theoretic foundations of wrapper induction in the PAC learning framework.
Historical Background IE is a very mature research field that has spun off the database community in the need to integrate a constantly increasing number of unstructured or semi-structured documents into structured database systems. The World Wide Web (WWW) as a means of an information sink has been accepted very quickly since its mainstream adoption in the early 1990s. Its low cost and wide spread has resulted in an enormous surge of data to be available on the Web. Particularly in the past decade, the number of documents of high
informational value on the Web has exploded. The lack of rigid structure in Web documents together with the Web’s vast value has initiated an independent research field for wrapper generation, which studies IE approaches particularly for semi-structured Web documents. First IE approaches in the Web domain have seen systems where wrappers were constructed manually [13,16], yet it soon showed that manual wrapper systems did not scale well and were inflexible to change in document structure. The urge for more principled, and automated wrapper generation systems sparked the beginning of a whole new research sub-field of wrapper generation. The beginnings of wrapper induction can be found in [10]. In this original work of Kushmerick, the author identified six wrapper classes and evaluated them in terms of learning theoretical concepts. The conclusion of his work was that quite expressive wrapper systems can be learned from surprisingly simple wrappers. His implementation prototype was called WIEN (1998), and it presents the first wrapper induction system. Shortly after, SoftMealy and STALKER have been published. The SoftMealy system [7] (1998) relaxed the finite state transducer (Mealy machine) for Web documents, which brought great performance improvements over the WIEN system. The STALKER system [9,14] introduced hierarchical wrappers generated from landmark automata. In [17] (1999), the state machine approach was augmented with observation probabilities to form a Hidden Markov model for header extraction from research papers. BWI [6] (boosted wrapper induction, 2000) extended wrapper induction from its initial application on Web documents back to free text documents. It uses a probabilistic model for tuple length classification together with a boosting algorithm for delimiter learning. In more recent work, [8] generate node predicates on the DOM tree representation of a HTML document (such as tagName, tagAttribute, #children, #siblings). These are constructed for each example node in the training set together with those nodes along the path towards the root from each such node – until a common ancestor is reached. Verification rules are generated from special verification predicates to be checked later. Finally, patterns with equal extraction results are combined. In [3], the authors also use the DOM tree representation of web documents to detect repetitive
Wrapper Induction
occurrences of pre/post subtree constellations. They use boolean logic to combine all rules into a minimal disjunctive normal form over all training examples. Applying supervised learning to Web wrapper generation bears a crucial drawback: sufficiently large sets of documents with annotated examples are hard to obtain and require expertise in the making. Labor and time investments renders the labeling of examples costly, creating a bottleneck in IE from Web documents. As a result, part of the research focused on tools that allow non-expert users to create wrappers interactively via visual user interaction. Starting with a set of non-annotated documents, the algorithm tries to infer the correct wrapper by asking the user to annotate a minimal number of examples. Lixto [2], NoDoSe [1], Olera [4] are examples of interactive tools that allow users to build wrappers through a GUI. Finally, DEByE [12] (2002) is similar to STALKER and WIEN as it also used landmark automata for rule generation. Unlike all previously mentioned wrapper induction systems that generate extraction rules top-down (most general first), DEByE approaches the problem bottom-up, starting with the most specific extraction rule and generating from there.
Foundations From Free Text to Web Documents
Wrapper generation systems nowadays focus on the Web domain, where documents are of a semistructured format. Originally, wrapper generation systems were developed for free text documents. Free text exhibits no implicit structure at all, and extraction requires natural language processing techniques such as Part-of-Speech tagging. With the advent of the Web as a content platform, the importance of free text has dwindled in IE with respect to the importance of Web documents. Web documents are an intermediate
W
3561
format, lying between free text (unstructured) and structured text (e.g., XML). The use of HTML tags allows to apply relations between individual text segments in the sense of specifying structural information (see Fig.1). One genuine problem for Web IE is that the lack of strict rules on these relations yields room for ambiguities of a tag’s semantic meaning in different contexts. Web wrapper induction is mainly concerned with the exploitation of tag tree (DOM) structures. Wrapper Induction
Kushmerick defined the wrapper induction problem as a simple model of information extraction. Input collection is referred to as set of documents S. An extraction rule (also query) allows to get a subset of the document according to some user-defined filter arguments. The result of a query is a set of information fields from S. Information fields are made up of vectors of unique attributes which are referred to as tuples. The concept of attributes is borrowed from relational databases, where attributes correspond to table columns in the relational model. In some systems the order of the attributes is relevant, others allow permutations and missing attributes. The collection of all tuples describes the content of S. Given this definition, a wrapper can be formalized as the function mapping a page to its content. page ) wrapper ) content Different wrappers define different mappings. In turn, this means that the output of two different wrappers are two different subsets of all tuples defined in the page’s content. Patterns and Output Structure
Wrapping semi-structured resources is equivalent to extracting a hierarchy of elements from a document and returning as result a structured, relational
W
Wrapper Induction. Figure 1. Relational data in free text (left) and formatted in semi-structured HTML (right).
3562
W
Wrapper Induction
data model that reflects the extracted hierarchy. The hierarchy consists of patterns which can be seen as elementary wrapping operators. Each pattern in the hierarchy defines an extraction of one atomic tuple class from a given parent pattern. In the simplest case, data is structured in tabular form. Each tuple can be extracted separately and the attributes of each tuple are arranged in linear chains. A more difficult pattern structure are nested patterns where a parent pattern branches into multiple different child patterns (e.g., similar to a tree). In Fig.2, this is illustrated by extracting the pattern author from the parent pattern description. Kushmerick’s Wrapper Classes
In his PhD thesis, Kushmerick has analyzed delimiterbased wrappers for IE from strings. He devised six wrapper classes of varying complexity that can deal with the extraction of different pattern scenarios by identifying start and stop delimiters. In his formulation, Kushmerick defined the wrapper induction problem as a Constraint Satisfaction Problem (CSP) with all possible combinations of left and right delimiters as input. The constraints are formulated on the forbidden combinations on the input, which in turn vary with the wrapper class. What is interesting is that despite their simplicity, these wrapper classes yield decent results on a large benchmark set. Wrapper Class LR LR (Left–Right) is the simplest wrapper class on which all subsequent classes are build
Wrapper Induction. Figure 2. A wrapper represented by a hierarchy of patterns. Each pattern is defined by a set of rules describing how to locate its instances from its parent pattern.
upon. LR wrappers identify one delimiter token on each side of the tuple to be extracted, where L denotes the start token, and R denotes the end token. By taking all possible prefix and postfix strings in the document string for a given example tuple, the learning algorithm first identifies the candidate sets for the left and right delimiters. With L and R being mutually independent, a principled search can be performed via checking the validity of each candidate delimiter for the complete training set. This can also be expressed as constraints on the search space containing of all permutations of L and R delimiters. Wrapper Class HLRT A more expressive wrapper class is Head-Left–Right-Tail, which is a direct extension from LR. It includes a page’s head H and tail T for cases where similar delimiters are used for relevant and irrelevant information fields on a page. Delimiter H determines where to start looking for the LR pattern (the end of the head section), and T respectively determines where the search should end. Together, H and T define the search body of the page. The learning algorithm follows directly from the LR algorithm with the new constraints included. H and T must be learned jointly on l1. Wrapper Class OCLR Open–Close-Left–Right is an alternative to the HLRC wrapper class, where the head and tail delimiters are replaced with open and close delimiters. The difference here is that, contrary to the HLRT class where the complete search body for the wrapper is specified by the H and T delimiters, the O and C delimiters merely specify the beginning and end of a tuple in a page. It thus extends LR to a multi-slot extractor. O and C must be learned jointly on l1. Wrapper Class HOCLRT Head-Open–Close-Left– Right-Tail is the combination of the HLRT and the OCLR wrapper classes. It is the most powerful of the four basic wrapper classes, yet all four additional delimiters H, T, O and C must be learned jointly on l1. Wrapper Class N-LR The previous four wrapper classes work well on tabular (relational) data sets. As it is common to have hierarchical nestings in relational data, Kushmerick devised two extensions to the previous wrapper classes that can deal with nested data. Nested data are tree-like data structures where the attributes are nested hierarchically. The first nested wrapper class, N-LR, extends the simple LR class. It uses the relative position among attributes to determine which attribute to extract next, given a current attribute. Here, all attributes must be learned jointly.
Wrapper Induction
Wrapper Class N-HLRT Finally, wrapper class N-HLRT combines the expressiveness of nested attributes with the head tail constraints from the HLRT class. It is exactly N-LR with the additional delimiters H and T to be considered in the hypothesis space of all candidate wrappers. While in practice, Kushmerick showed these wrapper classes to be able to learn the correct wrapper from relatively few examples, the theoretical PAC bounds for the same wrappers are very loose [11]. Other Wrapper classes
Apart from delimiter-based wrapper classes, there also exist other wrapper classes that can prove highly effective, but which depend on the domain of the input documents. For Web wrapper induction, interesting classes include: 1. Edit distance similarity. The edit distance (Levensthein Distance) exists both for string and tree structures. It measures the difference between two strings or trees as the number of basic edit operations that need to be performed to turn one argument into the other. Due to the tree structure of the DOM (Document Object Model) of Web pages, this method lends itself very well to identifying tree patterns in Web documents. 2. Rule-based systems. Prolog-like rules are induced from examples. Typically, systems generate a most-general set of rules from the given examples and use covering algorithms to reduce the rules to a minimum set covering all examples. The advantage of rule-based wrappers is the relative ease to interpret rules by a user. Active Learning
The key bottleneck of wrapper induction is the need for labeled training data to be available. An effective solution to reduce the number of annotations in a training corpus is offered by active learning. Active learning tries to identify examples that are not labeled but which are informative to the learning process, and the user is asked to label only such informative examples. Active learning has been applied to the Stalker wrapper generation system in [15], using co-testing. In co-testing, redundant views of a domain are implemented through mutually-exclusive feature sets per view. Views can be exploited by learning a separate classifier per view, as the ensemble of all classifiers
W
3563
reduces the hypothesis space for the wrapper. ***[5] provides a comprehensive overview of different active learning strategies, including common machine learning techniques such as bagging (multiple partitioning of training data) and boosting (ensemble learning).
Key Applications Several of the presented wrapper induction systems are known for being successfully commercialized. The Lixto system [2] has emerged into an equally named spin-off company of the Technical University Vienna, Austria (http://www.lixto.com). Kushmerick’s work [10] is being integrated into products of a US-based company called QL2 (http://www.ql2.com). Similarly, technologies created by the authors of the STALKER system [14] are used in products of another US company called Fetch Technologies (http://www.fetch. com). Web Intelligence solutions offered by these companies have customers for example in the travel, retail or consumer goods industries. Typical services built on top of the technology are on-demand price monitoring, revenue management, product mix tracking, Web process integration or building of Web mashups. Wrapper Induction Algorithms
WIEN is the first system for wrapper induction from semi-structured documents. The system learns extraction rules in two stages: In a fist step, simple LR rules are generated from the training data, that specify left and right delimiters for each example. The second step refines the LR rules into example conforming HLRT rules, where additional Head and Tail delimiters are added if necessary. The WIEN algorithm assumes attributes to be equally ordered among all tuples, and it cannot deal with missing attributes. WIEN supports both single-slot and multi-slot patterns. Softmealy is based upon a finite-state transducer and it utilizes contextual rules as extraction rules. Before being input to the transducer, a document is tokenized and special separators are added between two consecutive tokens. The transducer regards its input as a string of separators, where state transitions are governed by the left and right context tokens surrounding the current input separator. SoftMealy machines have two learnable components: the graph structure of the state transducer and the contextual rules governing the state transitions. The graph structure should be kept simple and small, which is achieved
W
3564
W
Wrapper Induction
via a conservative extension policy. Contextual rules are learned by a set-covering algorithm that finds the minimal number of contextual rules covering all examples. Stalker generates so-called landmark automata (SkipTo functions) from a set of training data. Nesting patterns allows the system to decompose complex extraction rules into a hierarchy of more simple extraction rules. Stalker employs a covering algorithm to reduce its initial set of candidate rules to a best-fit set of rules. Co-testing learns both forward and backward rules on the document. In case of agreement the system knows that the selected rule is correct, otherwise the user is given the conflicting example for labeling in order to resolve the conflict (active learning). In boosted wrapper induction (BWI), delimiterbased extraction rules are generated from simple contextual patterns. Boundaries that uniquely describe relevant tuples are identified and assigned fore and aft scores. Learning in the BWI approach constitutes of determining fore detectors F, aft detectors A, and the pattern length function. The latter is a maximum likelihood estimator on the word occurrences in the training set. F and A are learned using a boosting algorithm, where several weak classifiers are boosted into one strong classifier. BWI works both in natural language and wrapper domains. NoDoSe is an interactive tool that allows the user to hierarchically decompose a complex wrapper pattern into simple sub-wrappers. The system works both in the free text and the Web domain with the help of two distinct mining components. Learning rules consist of start and stop delimiters, either on text level (first component) or on HTML structure level (second component). Parsing generates a concept tree of rules that support multi-slot patterns and attribute permutation. DEByE is another interactive induction system that applies bottom-up extraction rules. The authors define so-called attribute-value pair (AVP) patterns, and use window passages surrounding an example tuple to build lists of AVP patterns. Unlike other wrapper induction systems that typically work top-down, their approach allows for more flexibility as the order of individual patterns here is no longer relevant. They claim that, unlike other wrapper induction systems that typically work top-down, identifying atomic attribute objects prior to identifying the nested object itself adds flexibility to a system as partial objects are detected as well as nested objects with differing
attribute orders. On the downside, this approach adds quite some complexity to the wrapper induction process. Future Directions
Several trends can be identified in current wrapper induction research. Shift to automation. The shifting of (semi-)supervised IE applications from free text to Web documents emphasized the cost problem of obtaining annotated examples from such sources. Unsupervised, fully automated methods using methods such as alignment, entity recognition, entity resolution or ontologies are motivated to overcome the so-called labelling bottleneck. Similarly, in semi-supervised systems, effort further focused to minimize the number of required examples (e.g., through active learning, see above). Adoption of new technologies. Over time the complexity of Web sites has substantially increased. Technologies such as Web forms, AJAX or Flash are today used to build the so-called Deep Web. Many documents that are interesting for wrapper induction are therefore hidden in the Deep Web. Because usual crawling algorithm do not work well, new techniques such as focused crawling, Deep Web navigations, form mapping or metasearch receive increased research attention. Shift to semantic annotations. The results of state-ofthe-art wrapper induction systems support high confidence in success for many classic IE scenarios. With the quick and thourough adoption of the Web as a media and data communication protocol there comes an increasing demand for semantic services that allow users to enrich and interlink existing information with ontological meta-annotations. The envisioned semantic Web that would offer this functionality quickly proved to be a rather slow development process with no end in sight. Web IE systems on the other hand have started to add support for semantic labeling and even automatic detection of such relational information from labeled and unlabeled sources.
Data Sets The most known data set used in information extraction is called RISE (http://www.isi.edu/info-agents/ RISE). RISE is a collection of resources both from the Web and the natural text domains. Among others, it contains documents related to seminar announcements, Okra search, IAF address finder. Large fraction of the pages in the corpus originates before the year
Wrapper Maintenance
2000. Because of their simplicity are those pages considered out-dated with respect to the current structure and layout of modern Web pages.
Cross-references
▶ Data Clustering (VII.b) ▶ Data Mining ▶ Decision Tree Classification ▶ Information Retrieval ▶ Semi-Structured Text Retrieval ▶ Text Mining
W
3565
16. Sahuguet A. and Azavant F. WysiWyg web wrapper factory (W4F). 2001, URL http://citeseer.ist.psu.edu/553711.html; http://www.ai.mit.edu/people/jimmylin/papers/Sahuguet99.ps. 17. Seymore K., McCallum A., and Rosenfeld R. Learning hidden Markov model structure for information extraction. In Proc. AAAI 99 Workshop on Machine Learning for Information Extraction. 1999.
Wrapper Generation ▶ Wrapper Induction
Recommended Reading 1. Adelberg B. NoDoSE: a tool for semi-automatically extracting structured and semistructured data from text documents. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998, pp. 283–294. 2. Baumgartner R., Flesca S., and Gottlob G. Visual Web Information Extraction with Lixto. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 119–128. 3. Carme J., Ceresna M., and Goebel M. Web Wrapper Specification Using Compound Filter Learning. In Proc. IADIS Int. Conf. WWW/Internet 2006, 2006. 4. Chang C.H. and Kuo S.C. OLERA: Semisupervised web-data extraction with visual support. IEEE Intell. Syst., 19(6):56–64, 2004. 5. Finn A. and Kushmerick N. Active learning selection strategies for information extraction. In Proc. Workshop on Adaptative Text Extraction and Mining, 2003. 6. Freitag D. and Kushmerick N. Boosted Wrapper Induction. In Proc. 12th National Conf. on AI, 2000, pp. 577–583. 7. Hsu C.N. and Dung M.T. Generating Finite-state Transducers for Semi-structured Data Extraction from the Web. Inf. Syst., 23(8):521–538, 1998. 8. Irmak U. and Suel T. Interactive wrapper generation with minimal user effort. In Proc. 15th Int. World Wide Web Conf., 2006, pp. 553–563. 9. Knoblock C.A., Lerman K., Minton S., and Muslea I. Accurately and Reliably Extracting Data from the Web: a Machine Learning Approach. Q. Bull, IEEE TC on Data Eng., 23(4):33–41, 2000. 10. Kushmerick N. Wrapper Induction for Information Extraction. Ph.D. thesis, University of Washington, 1997. 11. Kushmerick N. Wrapper induction: Efficiency and expressiveness. Artif. Intell., 118(1–2):15–68, 2000. 12. Laender A.H.F., Ribeiro-Neto B., and da Silva A.S. DEByE – Date extraction by example. Data Knowl. Eng., 40(2):121–154, 2002. 13. Liu L., Pu C., and Han W. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. In Proc. 16th Int. Conf. on Data Engineering, 2000, pp. 611–621. 14. Muslea I., Minton S., and Knoblock C. STALKER: Learning extraction rules for semistructured, Web-based information sources. 1998, URL citeseer.ist.psu.edu/muslea98stalker.html. 15. Muslea I., Minton S., and Knoblock C.A. Selective Sampling with Redundant Views. In Proc. 12th National Conf. on AI, 2000, pp. 621–626.
Wrapper Generator ▶ Web Data Extraction System
Wrapper Generator GUIs ▶ GUIs for Web Data Extraction
Wrapper Maintenance K RISTINA L ERMAN , C RAIG A. K NOBLOCK University of Southern California, Marina del Rey, CA, USA
Synonyms Wrapper verification and reinduction; Wrapper repair
Definition A Web wrapper is a software application that extracts information from a semi-structured source and converts it to a structured format. While semi-structured sources, such as Web pages, contain no explicitly specified schema, they do have an implicit grammar that can be used to identify relevant information in the document. A wrapper learning system analyzes page layout to generate either grammar-based or ‘‘landmark’’-based extraction rules that wrappers use to extract data. As a consequence, even slight changes in the page layout can break the wrapper and prevent it from extracting data correctly. Wrapper maintenance is a composite task that (i) verifies that the wrapper continues to extract data
W
3566
W
Wrapper Maintenance
correctly from a source, and (ii) repairs the wrapper so that it works on the changed pages.
Historical Background Wrapper induction algorithms [6,3,11] exploit regularities in the page layout to find a set of extraction rules that will accurately extract data from a source. As a consequence, even minor changes in page layout break wrappers, leading them to extract incorrect data, or fail to extract data from the page altogether. Researchers addressed this problem in two stages. Wrapper verification systems [5,9] monitor wrapper output to confirm that it correctly extracts data. Assuming that the structure and the format of data does not change significantly, they compare new wrapper output to that produced by the wrapper when it was successfully invoked in the past and evaluate whether the two data sets are similar. Wrapper verification systems learn data descriptions, also called content features, and evaluate wrapper output similarity based on these features. Two different representations of content features have been studied in the past. Kushmerick [4,5] and Chidlovskii [1] represent wrapper output by global numeric features, e.g., word count, average word length, HTML tag density, etc. Lerman et al. [8,9] instead learn syntactical patterns associated with data, e.g., they learn that addresses start with a number and are followed by a capitalized word. Both systems ascertain that the distribution of features over the new output is similar to that over data extracted by the wrapper in the past. If statistically significant changes are found, the verification system notifies the operator that the wrapper is no longer functioning properly. If the wrapper is not working correctly, the next task is to automatically repair it by learning new extraction rules. This task is handled by a wrapper reinduction system. Since the wrapper learning algorithm simply needs a set of labeled examples of data on new pages in order to learn extraction rules, researchers have addressed the reinduction problem by automatically labeling data instances on new pages. The problem is more complex than simply looking for instances of old wrapper output on new pages because data can change in value (e.g., weather, stock quotes) or format (abbreviations added, etc.). Instead, to identify the new data instances, reinduction systems combine learned content features, heuristics based on page structure, and similarity with the old wrapper output. Lerman et al. [9] use syntactical patterns learned from
old wrapper output to identify possible data instances on new pages. They then exploit regularities in the structure of data presented on the page to narrow candidates to ones likely to be the correct instances of a field. For example, a Web source listing restaurant information displays city, state and zipcode in the same order in all records. Likewise, instances of the same field are, in most cases, surrounded by the same HTML markup on all pages. Chidlobskii [1] take a similar approach, selecting data instance candidates based on learned numeric features, and exploiting page grammar and data structure to identify correct examples. Meng et al. [10] and Raposo et al. [12], on the other hand, rely on the document-object-model (DOM) tree to help identify correct examples of data on new pages.
Foundations Figure 1 shows the life cycle of a wrapper. During its normal operation the wrapper is extracting data from the source, and learning content features that describe the data. The wrapper verification system monitors wrapper output to confirm that it is working correctly. Once it determines that a wrapper has stopped working, it uses the learned content features and page and data structure heuristics to identify data on the new pages. The labeled pages are then submitted to the wrapper induction system which learns new extraction rules. Content-based features play a central role in both verification and reinduction. Data returned by online sources usually has some structure: phone numbers, prices, dates, street addresses, etc. all follow some format. Unfortunately, character-level regular expressions are too fine-grained to capture this structure. Instead,
Wrapper Maintenance. Figure 1. Life cycle of a wrapper, illustrating verification and reinduction stages.
Wrapper Maintenance
researchers represent the structure of data as a sequence of tokens, called a pattern. Tokens are strings generated from an alphabet containing different character types: alphabetic, numeric, punctuation, etc. The token’s character types determine its assignment to one or more syntactic categories: alphabetic, numeric, etc. The categories form a hierarchy, which allows for multi-level generalization. Thus, a set of zipcodes ‘‘90210’’ and ‘‘90292’’ can be represented by specific tokens ‘‘90210’’ and ‘‘90292,’’ as well as more general types 5Digit, Number, Alphanum. An efficient algorithm to learn patterns describing a field from examples was described in [9]. An alternate approach represents the content of data by global numeric features, such as character count, average word length, and density of types, i.e., proportion of characters in the training examples that are of an HTML, alphabetic, or numeric type. This approach would learn the following features for zipcodes: Count = 5.0, Digits = 1.0, Alpha = 0.0, meaning that zipcodes are, on average, five characters long, containing only numeric characters. Wrapper Verification
Wrapper verification methods attempt to confirm that wrapper output is statistically similar to the output produced by the same wrapper when it was successfully invoked in the past. Henceforth, successful wrapper output are called training examples. Wrapper verification system learns content-based features and their distribution over the training examples. In order to check that the learned features have a similar distribution over the new wrapper output, one approach [5] calculates the probability of generating the new values randomly for each field. Individual field probabilities are then combined to produce an overall probability that the wrapper has extracted data correctly. Another approach to verification [9] uses training examples to learn data patterns for each field. The system then checks that this description still applies to the new data extracted by the wrapper. Wrapper Reinduction
Content-based features are used by the wrapper reinduction system to automatically label the new pages. It first scans each page to identify all text segments that match the learned features. The candidate text segments that contain significantly more or fewer tokens than expected are eliminated from consideration. The learned features are often too general and will match
W
3567
many, possibly hundreds, of text segments on each page. Among these spurious text segments are the correct examples of the data field. Researchers use additional heuristics to help narrow the set to correct examples of the field: true examples are expected to appear in the same position (e.g., after a tag) and in the same context (e.g., restaurant address precedes city name) on each page. In addition, they are expected to be visible to the user (not part of HTML tag) and appear within the same block of HTML page (or the same DOM sub-tree). This information is used to split candidate extracts into groups. The next step is to score groups based on their similarity to the training examples. This technique generally works well, because at least some of the data usually remains the same when the page layout changes. Of course, this assumption does not apply to data that changes frequently, such as weather information, flight arrival times, stock quotes, etc. However, previous research has found that even in these sources, there is enough overlap in the data that the approach works. The final step of the wrapper reinduction process is to provide the extracts in the top-scored group to the wrapper induction algorithm, along with the new pages, to learn new data extraction rules. For some types of Web sources automatic wrapper generation is a viable alternative to wrapper reinduction. One such system, developed by Crecenzi et al. [2], learns grammar-based extraction rules without requiring the user to label any pages. Once wrapper verification system detects that the wrapper is not working correctly, it collects new sample pages from which it learns new extraction rules. Still another approach to data extraction uses a combination of content-feature learning and page analysis to automatically extract data from Web pages [7].
Key Applications Automatic wrapper maintenance can be incorporated into a wrapper learning system to reduce the amount of user effort involved in monitoring wrappers for validity and repairing them.
Experimental Results To evaluate the performance of wrapper maintenance systems, researchers collected data over time, saving Web pages along with data extracted from them by wrappers. In one set of experiments researchers [9] monitored 27 wrappers (representing 23 distinct data
W
3568
W
Wrapper Maintenance
sources) by storing results of 15–30 queries periodically over 10 months. The wrapper verification system learned content descriptions for each field and made a decision about whether the new output was statistically similar to the training set. A manual check of the 438 comparisons revealed 37 wrapper changes attributable to changes in the page layout. The verification algorithm correctly discovered 35 of these changes and made 15 mistakes. Of these mistakes, 13 were false positives, which means that the verification program decided that the wrapper failed when in reality it was working correctly. Only two of the errors were the more important false negatives, meaning that the algorithm did not detect a change in the data source. Representing the structure of content by global numeric features instead would have missed 17 wrapper changes. Researchers also evaluated the reinduction algorithm only on the ten sources that returned a single tuple of results per page. They used the algorithm to extract (35) fields from ten Web sources for which correct output is known, regardless of whether the source had actually changed or not. The output of the reinduction algorithm is a list of tuples extracted from ten pages, as well as extraction rules generated by the wrapper induction system for these pages. Though in most cases they were not able to extract every field on every page, they still learned good extraction rules as long as a few examples of each field were correctly labeled. The algorithm was evaluated in two stages: first, researchers checked how many fields were successfully identified; second, they checked the quality of the learned extraction rules by using them to extract data from test pages. The algorithm was able to correctly identify fields 277 times across all data sets making 61 mistakes, of which 31 were attributed to false positives and 30 to the false negatives. On the extraction task, the automatically learned extraction rules achieved 90% precision and 80% recall. The results above apply to sources that return a single item per page. Other researchers have applied reinduction methods to repair wrappers for sources that return lists of items per page [1,13] with similar performance results.
Data Sets The data set used in the wrapper maintenance experiments discussed above is available from the University
of Southern California Information Sciences Institute: http://www.isi.edu/integration/datasets/reinduction/ wrapper-verification-data2.zip. The data contains results of invocations of wrappers for different Web sources over a period of 10 months. Each wrapper was executed with the same small set of queries (if they contained no time-dependent parameters). The data set contains both the returned Web pages and data extracted by the wrapper from the source. Over the data collection period, several sources changed in a way that prevented wrappers from correctly extracting data. The wrappers were repaired manually in these cases and data collection continued. The performance of the USC’s verification and reinduction systems on this dataset is reported in [9].
Cross-references
▶ Web Data Extraction System ▶ Web Information Extraction ▶ Wrapper Induction
Recommended Reading 1. Chidlovskii B. Automatic repairing of web wrappers by combining redundant views. In Proc. 14th IEEE Int. Conf. Tools with Artificial Intelligence, 2002, pp. 399–406. 2. Crescenzi V. and Mecca G. Automatic information extraction from large websites. J. ACM, 51(5):731–779, 2004. 3. Hsu C.-N. and Dung M.-T. Generating finite-state transducers for semi-structured data extraction from the web. J. Inform. Syst., 23:521–538, 1998. 4. Kushmerick N. Regression testing for wrapper maintenance. In Proc. 14th National Conf. on AI, 1999, pp. 74–79. 5. Kushmerick N. Wrapper verification. World Wide Web J., 3(2):79–94, 2000. 6. Kushmerick N., Weld D.S., and Robert B. Doorenbos. Wrapper induction for information extraction. In Proc. 15th Int. Joint Conf. on AI, 1997, pp. 729–737. 7. Lerman K., Gazen C., Minton S., and Knoblock C.A. Populating the semantic web. In Proc. AAAI Workshop on Advances in Text Extraction and Mining, 2004. 8. Lerman K. and Minton S. Learning the common structure of data. In Proc. 12th National Conf. on AI, 2000, pp. 609–614. 9. Lerman K., Minton S., and Knoblock C. Wrapper maintenance: a machine learning approach. J. Artif. Intell. Res., 18:149–181, 2003. 10. Meng X., Hu D., and Li C. Schema-guided wrapper maintenance. In Proc. of 2003 Conf. on Web Information and Data Management, 2003, pp. 1–8. 11. Muslea I., Minton S., and Knoblock C.A. Hierarchical wrapper induction for semistructured information sources. Auton. Agent Multi Agent Syst., 4:93–114, 2001.
Wrapper Stability 12. Raposo J., Pan A., Alvarez M., and Hidalgo J. Automatically generating labeled examples for web wrapper maintenance. In Proc. of 2005 IEEE/WIC/ACM Int. Conf. on Web Intelligence, 2005, pp. 250–256. 13. Raposo J., Pan A., A´lvarez M., and Hidalgo J. Automatically maintaining wrappers for semi-structured web sources. Data Knowl. Eng., 61(2):331–358, 2007.
Wrapper Repair ▶ Wrapper Maintenance
Wrapper Robustness ▶ Wrapper Stability
Wrapper Stability G EORG G OTTLOB Oxford University, Oxford, UK
Synonyms Wrapper robustness; Wrapper adaptability
Definition A wrapper, whether hand written or generated by a Web data extraction system, is a program that extracts data from information sources of changing content and translates the data into a different format. The stability of a wrapper is the degree of insensitivity to changes of the presentation (i.e., formatting, layout, or syntax) of the data sources. A stable wrapper is ideally able to extract the desired data from source documents even if the layout of the current documents differs from the layout of those documents that were used as examples at the time the wrapper was generated. Thus, a wrapper is stable or robust if it is able of coping with perturbations of the original layout.
3569
source will not prevent a wrapper to extract the desired data, i.e., that the wrapper remains stable under such changes (or, equivalently, robust to these changes), as long as the data items to be extracted still appear in the document. For obvious reasons, totally stable wrappers can hardly be generated. However, there are sensible differences in stability between simple screen scrapers or Web scrapers on one hand, and wrappers generated by advanced Web data extraction systems, on the other hand. Screen scrapers or Web scrapers often rely on exact data positions in a document or in the parse tree of an HTML Web page. For example, a screen scraper may memorize that a certain price to extract from a Web page is the data item occurring as the 37th leaf of the Web page’s parse tree. Of course, in case the page’s layout is altered by adding, say, an advertisement in front of the data, such a wrapper will fail to extract the intended price. Advanced Web data extraction systems, on the other hand, characterize the objects to be extracted via syntactic, contextual, or semantic properties rather than by their position only. For example, the Lixto system [1] could characterize a desired price as a numeric data item preceded by a currency symbol, occurring in the second column of a table entitled ‘‘Price list,’’ such that a certain article name appears in the first column of the same row. Clearly, such a characterization is much more stable than a purely positional one. There are only very few studies on Wrapper stability, among which, for example [2], and this topic remains an important one for future research. One issue of particular interest is the automatic adaptation of wrappers to new layouts [2–5].
Cross-references
▶ Web Data Extraction System
Recommended Reading 1.
2.
Key Points Documents, and, Web pages in particular are usually susceptible to slight layout changes over time. Most of these changes are minor changes. For example, an advertisement may be added to the top of a Web page. One would ideally wish that changes in the layout of the
W
3.
4.
Baumgartner R., Flesca S., and Gottlob G. Visual web information extraction with Lixto. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 119–128. Davulcu H., Yang G., Kifer M., and Ramakrishnan I.V. Computational aspects of resilient data extraction from semistructured sources. In Proc. 19th ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, 2000, pp. 136–144. Meng X., Hu D., and Li C. Schema-guided wrapper maintenance for web-data extraction. In Proc. Fifth ACM CIKM Int. Workshop on Web Information and Data Management, 2003, pp. 1–8. Phelps T.A. and Wilensky R. Robust intra-document locations. In Proc. 9th Int. World Wide Web Conf., 2000, pp. 105–118.
W
3570
W 5.
Wrapper Verification and Reinduction
Wong T. and Lam W. A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In Proc. 2004 IEEE Int. Conf. on Data Mining, 2004, pp. 257–264.
Wrapper Verification and Reinduction ▶ Wrapper Maintenance
Write Once Read Many K ENICHI WADA Hitachi Limited, Tokyo, Japan
and long-term storage capability, physical WORM has historically been used for archiving data that requires a long retention period. 2. Logical WORM uses storage systems which provide a WORM capability by using electric keys or other measures to prevent rewriting even if with non-WORM storage media, such as disk drives. Driven by the recent growth of legislation in many countries, logical WORM is now increasingly being used to archive corporate data, such as financial documents, emails, and certification documents, and to find them quickly in case of discovery.
Cross-references
▶ Digital Archives and Preservation
Synonyms Write once read mostly; Write one read multiple; WORM
Write Once Read Mostly ▶ Write Once Read Many
Definition A data storage technology that allows information to be written to storage media a single time only, and prevents the user from accidentally or intentionally altering or erasing the data. Optical disc technologies such as CD-R and DVD-R are typical WORM storage.
Key Points There are two types of WORM technology: physical and logical WORM. 1. Physical WORM uses storage media which physically can be written only once, and prevents the user from accidentally or intentionally altering or erasing the data. Optical storage media such as CD-R and DVD-R are good examples. Offering fast access
Write One Read Multiple ▶ Write Once Read Many
WS-BPEL ▶ Composed services and WS-BPEL
WS-Discovery ▶ Discovery
X XA Standard ▶ Two-Phase Commit Protocol
XMI ▶ XML Metadata Interchange
XML M ICHAEL RYS Microsoft Corporation, Sammamish, WA, USA
Synonyms Extensible markup language; XML 1.0
Definition The Extensible Markup Language or XML for short is a markup definition language defined by a World Wide Web Consortium Recommendation that allows annotating textual data with tags to convey additional semantic information. It is extensible by allowing users to define the tags themselves.
Historical Background XML was developed as a simplification of the ISO Standard General Markup Language (SGML) in the mid 1990s under the auspices of the World Wide Web Consortium (W3C). Some of the primary contributors were Jon Bosak of Sun Microsystems (the working group chair), Tim Bray (then working at Textuality and Netscape), Jean Paoli of Microsoft, and C. Michael Sperberg-McQueen of then the University of Chicago. Initially released as a version 1.0 W3C recommendation on 10 Feb. 1998, XML has undergone several revisions since then. The latest XML 1.0 recommendation edition is the fourth edition as of #
2009 Springer ScienceþBusiness Media, LLC
this writing. A fifth edition is currently undergoing review. The fifth edition is adding some functionality into XML 1.0 that was part of the XML 1.1 recommendation, that has achieved very little adoption. Based on the XML recommendation, a whole set of related technologies have been developed, both at the W3C and other places. Some technologies are augmenting the core syntactic XML recommendation such as the XML Namespaces and XML Information Set recommendations, others are building additional infrastructure on it such as the XML Schema recommendation or the XSLT, XPath and XQuery family of recommendations. Since XML itself is being used to define markup vocabularies, it also forms the basis for vertical industry standards in manufacturing, finance and other areas, including standard document formats such as XHTML, DocBook, Open Document Format (ODF) and Office Open XML (OOXML) and forms the foundation of the web services infrastructure.
Foundations XML’s markup format is based on the notion of defining well-formed documents and is geared towards human-readability and international usability (by basing it on Unicode). Among its design goals were ease of implementation by means of a simple specification, especially compared to its predecessor, the SGML specification, and to make it available on a royalty-free basis to both implementers and users. Well-formed documents basically contain markup elements that have a begin tag and an end tag as in the example below: < tag > character data < /tag > Element tags can have attributes associated with it that provide information about the tag, without interfering with the flow of the textual character data that is being marked up: character data
3572
X
XML
Processing instructions can be added to convey processing information to XML processors and comments can be added. And comments can be added for commenting and documentation purposes. = They follow different syntactic forms than element tags and can appear anywhere in a document, except within the begin tag and end tag tokens themselves: A well-formed document must also have exactly one top-level XML element, and can contain several processing instructions and comments on the top-level next to the element. The order among these elements is informationbearing, since they are meant to mark up an existing document flow. Thus, the following two well-formed XML documents are not the same: <doc> <element> value1 <element> value2 <doc> <element> value2 <element> value1 The XML information set recommendation defines an abstract data model for these syntactic components, introducing the notion of document information items for a document, element information items for element tags, attribute information items for their attributes, character information items for the marked up character data etc. The XML namespace recommendation adds the ability to scope an element tag name to a namespace URI, to provide the ability to scope markup vocabularies to a domain identifier. Besides defining an extensible markup format, the XML recommendation also provides a mechanism to constrain the XML document markup to follow a certain grammar by restricting the allowed tag names and composition of element tags and attributes with document type declarations (DTDs). Documents that follow the structure given by DTDs are not only wellformed but also valid. Note that the XML Schema recommendation provides another mechanism to constrain XML documents. Finally, XML also provides mechanisms to reuse parts of a document (down to the character level) using so called entities.
For more information about XML, please refer to the recommended reading section.
Key Applications While XML was originally designed as an extensible document markup format, it has quickly taken over tasks in other areas due to the wide-availability of free XML parsers, its readability and flexibility. Besides the use for document markup, two of the key application scenarios for XML are the use for interoperable data interchange in loosely-coupled systems and for adhoc modeling of semi-structured data. XML’s first major commercial applications actually have been to describe the data and, with DTDs or XML schema formats, structures of messages that are being exchanged between different computer systems in application to application data exchange scenarios and web services. XML is not only being used to describe the message infrastructure format and information such as the SOAP protocol, RSS or Atom formats, but also to describe the structure and data of the message payloads. Often, XML is also used in more adhoc micro-formats for more REST-ful web services. At the same time that XML was being developed, several researcher groups were looking into data models that were less strict than relational, entityrelationship and object-oriented models, by allowing instance based properties and heterogeneous structures. XML’s tree model provides a good fit to represent such semi-structured, hierarchical properties, and its flexible format is well–suited to model the sparse properties and rapidly changing structures that are often occurring in semi-structured data. Therefore, XML has often been used to model semi-structured data in data modeling. XML support has been added to databases on form of either pure XML databases or by extending existing database platforms such as relational database systems to enable databases to manage XML documents serving all these three application scenarios.
Cross-references ▶ XML Attribute ▶ XML Document ▶ XML Element ▶ XML Schema ▶ XPath/XQuery ▶ XSL/XSLT
XML Access Control
Recommended Reading 1. 2. 3. 4. 5.
Namespaces in XML 1.0, latest edition. Available at: http://www. w3.org/TR/xml-names Wikipedia entry for XML. Available at: http://en.wikipedia.org/ wiki/XML XML 1.0 information Set, latest edition. Available at: http:// www.w3.org/TR/xml-infoset XML 1.0 recommendation, latest edition. Available at: http:// www.w3.org/TR/xml XML 1.1 recommendation, latest edition. Available at: http:// www.w3.org/TR/xml11
XML (Almost) ▶ Semi-structured Data
XML 1.0 ▶ XML
XML Access Control D ONGWON L EE 1, T ING Y U 2 The Pennsylvania State University, University Park, PA, USA 2 North Carolina State University, Raleigh, NC, USA
X
3573
in a variety of contexts, including operating systems, databases, and computer networks. The most influential policy models today are discretional access control (DAC), mandatory access control (MAC), and role-based access control (RBAC) models. In DAC, the owner of an object (e.g., a file or database table) solely determines which subjects can access that object, and whether such privileges can be further delegated to other subjects. In MAC, whether a subject can access an object or not is determined by their security classes, not by the owner of the object. In RBAC, privileges are associated with roles. Users are assigned to roles, and thus can only exercise access privileges characterized by their roles. Typical implementations of access control are in the form of access control lists (ACLs) and capabilities. In ACLs, a system maintains a list for each object of subjects who have access to that object. In capabilities, each subject is associated with a list that indicates those objects to which it has access. ACLs and capabilities are suitable to enforce coarse-grained access control over objects with simple structures (e.g., file systems or table-level access control in relational databases). They are often not efficient for fine-grained access control over objects with complex structures (e.g., element-level access control in XML, or row-level and cell-level access in relational databases).
1
Definition XML access control refers to the practice of limiting access to (parts of) XML data to only authorized users. Similar to access control over other types of data and resources, XML access control is centered around two key problems: (i) the development of formal models for the specification of access control policies over XML data; and (ii) techniques for efficient enforcement of access control policies over XML data.
Historical Background Access control is one of the fundamental security mechanisms in information systems. It is concerned with who can access which information under what circumstances. The need for access control arises naturally when a multi-user system offers selective access to shared information. As one of the oldest problems in security, access control has been studied extensively
Foundations XML access control is fine-grained in nature. Instead of controlling access to the whole XML database or document, it is often required to limit a user’s access to some substructures of an XML document (e.g., some subtrees or some individual elements). An XML access control policy can be typically modeled as a set of access control rules. Each rule is a 5-tuple (subject, object, action, sign, type), where (i) subject defines a set of subjects; (ii) object defines a set of elements or attributes of an XML document; (iii) action denotes the actions that can be performed on XML (e.g., read, write, and update); (iv) sign 2{+, } indicates whether this rule is a grant rule or a deny rule; and (v) type 2{LC, RC} refers to either local check or global check. Intuitively, an access control rule specifies that subjects in subject can (when sign=+) or cannot (when sign=) perform action specified by action on those elements or attributes in objects. When type=RC, this authorization decision also applies to the descendants of those in object.
X
3574
X
XML Access Control
Object is usually specified using some XML query languages such as XPath expressions. There are multiple ways to identify subject. Following RBAC, subject can be specified as one or several roles (e.g., student and faculty) [4]. It may also follow attribute-based access control, where each user features a set of attributes and subject is defined based on the values of those attributes (e.g., those subjects whose age is over 21). Some access control policy in the literature also follows MAC, where subject refers to security classes. Example 1. Consider two access control rules for the subject of ‘‘admin’’ as follows: R1: (admin, /people/person/name, read, , LC) R2: (admin, /people//address//*, read/update, +, RC) R1 Indicates that users belonging to the admin role cannot read textual and attribute data of XML node , the child of that is the child of the root node . On the other hand, R2 specifies that the same users of the admin role can read and even update any XML data that are descendents of XML node if they are descendents of under . Once an access control policy is specified, there are two problems that need to be addressed. First, given any access request, one needs to determine whether or not a rule exists that applies to the request. If so, the policy is said to be complete. Most access control systems adopt a closed world assumption. That is, if no rules apply to a request, the request is denied. Second, multiple rules with different authorization decisions may apply to a request. One typical way to resolve such conflicts is to let denial override permit. For XML access control, it may also be solved by having more explicit rules override less explicit ones. For instance, an LC rule may override an RC rule. For two RC rules, the one that applies to a node’s nearest ancestor usually dominates. Languages for specifying access control policy are proposed in such efforts as XACL by IBM [7]. Therefore, it is also possible to use much more expressive languages to specify access control policy within XML access control models. Finally, the use of authorization priorities with propagation and overriding are related to similar techniques studied in object-oriented databases. Once XML access control is specified in a given model, it can be enforced in a variety of ways. By and large, most of the existing XML access control methods are either view-based or rely on the XML engine to
enforce access control at the node-level of XML trees. The idea of view-based enforcement (e.g., [5,11]) is to create and maintain a separate view for each user (or role) who is authorized to access a specific portion of an XML data [4]. The view contains exactly the set of data nodes that the user is authorized to access. After views are constructed, during run time, users can simply run their queries against the views without worrying about access control issues. Although views can be prepared offline, in general, view-based enforcement schemes suffer from high maintenance and storage costs, especially for a large number of roles: (i) a virtual or physical view is created and stored for each role; (ii) whenever a user prompts update operation on the data, all views that contain the corresponding data need to be synchronized. To tackle this problem, people proposed a method using compressed XML views to support access controls [11]. The basic idea is based on the observation of accessibility locality, i.e., elements close to each other tend to have the same accessibility for a subject. Based on this observation, a compressed accessibility map only maintains some ‘‘crucial’’ nodes in the view. With simple annotation on those crucial nodes, other nodes’ accessibility can be efficiently inferred instead of explicitly stored. Each node in the compressed view is associated with a label (desc, self), where desc can be either d+ or d, indicating whether its descendants are accessible or not, and self can be either s+ or s, indicating whether the node itself is accessible. Given any node in an XML tree, by its relationship to those labeled nodes in the compressed view, we can infer its accessibility. Example 2. Consider the XML tree in Fig. 1a with squares and circles denoting accessible and inaccessible nodes, respectively. The corresponding compressed view is shown in Fig. 1b. Note that since node C is a descendant of B and B is labeled (d, s+), C can be inferred to be inaccessible. In the non view-based XML access control techniques (e.g., [2,9,10]), an XML query is pre-classified against the given model to be ‘‘entirely’’ authorized, ‘‘entirely’’ prohibited, or ‘‘partially’’ authorized before being submitted to an XML engine. Therefore, without using pre-built views, those entirely authorized or prohibited queries can be quickly processed. Furthermore, those XML queries that are partially authorized are re-written using state machines such that they request for only data that are granted to the users or roles.
XML Access Control
Example 3. Consider three access control rules for the security role ‘‘admin.’’ Furthermore, an administrator ‘‘Bob’’ requested three queries, Q1 to Q3 in XPath as follows: R1: (admin, /people/person/name, read, , LC) R2: (admin, /people//address//*, read/update, +, RC) R3: (admin, /regions/namerica/item/name, read, +, LC) Q1: /people/person/address/street Q2:/people/person/creditcard Q3:/regions//* Then, Q1 by Bob can be entirely authorized by both R1 and R2, but entirely denied by R3. Similarly, Q2 is entirely authorized by R1, entirely denied by both R2 and R3. Finally, Q3 is entirely accepted by R1, entirely denied by R2, and partially authorized by R3 and needs to be re-written to /regions/namerica/item/ name in order not to be conflicted with R3.
Key Applications As XML has been increasingly used not only as a data exchange format but as a data storage format, the problem of controlling selective access over XML is indispensable to data security. Therefore, XML access control issues have been tightly associated with secure query processing techniques in relational and XML databases (e.g., [3,5]). The access control policy model of XML can also be extended to express security requirements of other semi-structured data such as LDAP and object-oriented databases. The aforementioned access control enforcement techniques can be further extended to protect privacy during the exchange of XML data in distributed information sharing system. For instance, in PPIB system [8], XML access control is used to hide what query content is or where data objects are located,
X
3575
etc. In an environment where XML data are stored in a distributed fashion and users may ask sensitive queries whose privacy must be kept to is utmost extent (e.g., HIV related queries in health information network), XML access control techniques can be used, along with XML content-based routing techniques [6].
Cross-references
▶ Relational Access Control ▶ Secure XML Query Processing
Recommended Reading 1. Bertino E. and Ferrari E. Secure and selective dissemination of XML documents. ACM Trans. Inform. Syst. Secur., 5(3):290–331, 2002. 2. Bouganim L., Ngoc F.D., and Pucheral P. Client-based access control management for XML documents. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 84–95. 3. Cho S., Amer-Yahia S., Lakshmanan L.V.S., and Srivastava D. Optimizing the secure evaluation of twig queries. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 490–501. 4. Damiani E., Vimercati S., Paraboschi S., and Samarati P. A finegrained Access Control System for XML Documents. ACM Trans. Inform. Syst. Secur., 5(2):169–202, 2002. 5. Fan W., Chan C.-Y., and Garofalakis M. Secure XML querying with security views. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 587–598. 6. Koudas N., Rabinovich M., Srivastava D., and Yu T. Routing XML queries. In Proc. 20th Int. Conf. on Data Engineering, 2004, p. 844. 7. Kudo M. and Hada S. XML document security based on provisional authorization. In Proc. 7th ACM Conf. on Computer and Communications Security, 2002, pp. 87–96. 8. Li F., Luo B., Liu P., Lee D., and Chu C.H. Automaton segmentation: a new approach to preserve privacy in XML information brokering. In Proc. 14th ACM Conf. on Computer and Communications Security, 2007, pp. 508–518. 9. Luo B., Lee D., Lee W.C., and Liu P. QFilter: fine-grained runtime XML access control via NFA-based query rewriting.
X
3576
X
XML Algebra
In Proc. Int. Conf. on Information and knowledge Management, 2004, pp. 543–552. 10. Murata M., Tozawa A., and Kudo M. XML access control using static analysis. In Proc. 10th ACM Conf. on Computer and Communication Security, 2003, pp. 73–84. 11. Yu T., Srivastava D., Lakshmanan L.V.S., and Jagadish H.V. A compressed accessibility map for XML. ACM Trans. Database Syst., 29(2):363–402, 2004.
Attributes can be constrained and typed by schema languages such as XML DTDs [2] or XML Schema [3]. An XML attribute is represented as an XML attribute information item in the XML Information Set [1] and an XML attribute node in the XPath and XQuery data model [4].
Cross-references
XML Algebra ▶ XML Tuple Algebra
▶ XML ▶ XML Document ▶ XML Element
Recommended Reading 1.
XML Application Development ▶ XML Programming
2. 3. 4.
XML 1.0 Information Set, latest edition available online at: http://www.w3.org/TR/xml-infoset. XML 1.0 Recommendation, latest edition available online at: http://www.w3.org/TR/xml. XML Schema Part 0: Primer, latest edition available online at: http://www.w3.org/TR/xmlschema-0/. XQuery 1.0 and XPath 2.0 Data Model (XDM), latest edition available online at: http://www.w3.org/TR/xpath-datamodel/.
XML Attribute M ICHAEL RYS Microsoft Corporation, Sammamish, WA, USA
Synonyms XML attribute; XML attribute node
XML Attribute Node ▶ XML Attribute
Definition An XML attribute is used to represent additional information on an XML element in the W3C XML recommendation [2].
Key Points The name of an XML attribute has to be unique for a given element. Therefore, the following XML element is not allowed <e a1 = ‘‘v1’’ a1 = ‘‘v2’’/> while the following is well-formed <e1 a1 = ‘‘v1’’> <e2 a1 = ‘‘v2’’/> An XML attribute has a value that follows the attribute value normalization rules outlined in section 3.3.3 of [2]. This means that several whitespace characters do not get preserved in attribute values, unless they are explicitly represented with a character entity (e.g.,
for carriage return).
XML Benchmarks D ENILSON B ARBOSA 1, I OANA M ANOLESCU 2, J EFFREY X U Y U 3 1 University of Alberta, Edmonton, AB, Canada 2 INRIA, Saclay–Iˆle-de-France, Orsay, France 3 The Chinese University of Hong Kong, Hong Kong, China
Definition An XML benchmark is a specification of a set of meaningful and relevant tasks, intended to assess the functionality and/or performance of an XML processing tool or system. The benchmark must specify the following: (i) a deterministic workload, consisting of a set of XML documents and/or a procedure for obtaining these and a set of operations to be performed;
XML Benchmarks
(ii) detailed rules for executing the workload and making the measurements; (iii) the metrics used to report the results of the benchmark; and (iv) standard ways of interpreting the results.
Historical Background XML has quickly become the preferred format for representing and exchanging data on the Web age. The level of acceptance of XML is astonishing, especially when one considers that this technology was introduced only in 1997. XML is an enabling technology with applications in virtually all domains of information processing. At the time of writing, XML is widely used in content distribution on the Web (e.g., RSS feeds), as the foundation of large initiatives such as the Semantic Web and Web Services, and is the basis for routinely used productivity tools, such as text editors and spreadsheets. Such complexity led to the development of a large number of fairly narrowly-scoped benchmarks for XML. Moreover, there is still no clear understanding of what an XML benchmark should look like. The first XML benchmarks were developed in academia for testing specific processing tasks and/or relatively narrow applications. In fact, some of these earlier benchmarks are categorized as micro-benchmarks. For example, XMach emulated the scenario of an XML file servers using a large number of simple XML files, while XMark modeled an online auction application using a single complex XML document. Over time, XML benchmarks evolved to contemplate more realistic and complex application scenarios, in an attempt to emulate the typical workload of larger applications. For instance, XBench modeled four scenarios, resulting from the combination of two factors: (i) the nature of the documents (data-centric vs. document-centric); and (ii) the number of documents in the workload (single-document vs. multi-document). The first XML Benchmark developed entirely by industry was TPoX (Transaction Processing over XML), which simulates a financial application in a multi-user environment with concurrent access to the XML data. TPoX is intended for testing all aspects of a relational-based XML storage and processing system. Another development worth mentioning concerns the declarative synthetic data generators for XML. While, strictly speaking, these are not benchmarks, these tools help in obtaining appropriate testing data
with reasonably customizability.
low
effort
and
high
X
3577
enough
Foundations Benchmarks should be simple, portable, scale to different workload sizes, and allow the objective comparisons of competing systems and tools [4]. It should be noted that the diversity and complexity of its applications, developing meaningful and realistic benchmarks for XML is a truly herculean task. Also, XML processing tools fall into many categories, from simple storage services to sophisticated query processors, thus adding to the complexity of developing relevant and realistic XML benchmarks. The two factors above have led to the development of benchmarks that are relatively narrow in scope, focusing on very specific tasks. Moreover, the lack of universally accepted, comprehensive XML benchmarks, resulted in the development of general-purpose synthetic data generators, which allow the user to obtain customized test data with relatively low effort. This is significant, as such tools were not popular until the advent of XML. XML Microbenchmarks
A micro-benchmark is a narrowly-defined benchmark aimed at testing very specific aspects of a tool and/or system. The Michigan Benchmark
The Michigan Benchmark is an XML Micro-benchmark developed at the University of Michigan [8]. It uses a single synthetic document that does not resemble a typical document from any real world application domain. Instead, it is carefully designed to allow the testing of the following query processing operations: matching attributes by value; selecting elements by name; evaluation of positional predicates; selection of nodes based on predicates over their parent, children, ancestors or descendants; join operations; computing aggregate functions; and processing updates. The authors applied the benchmark to three database systems: two native XML DBMSs, and a commercial ORDBMS.
XML Application Benchmarks
An application benchmark is a comprehensive set of tasks that approximates the workload of a typical application in the respective domain. Four important
X
3578
X
XML Benchmarks
XML application benchmarks are XMach-1 [2], XMark [9], XBench [10], and TPOX [7]. XMach-1
XMach-1 (XML Data Management Benchmark) is a multi-user benchmark developed at the University of Leipzig, Germany [2]. Unlike most existing XML benchmarks that are designed to test the query processors of database management systems, XMach-1 is designed to test database management systems which include a query processor as well as the other key components. In terms of measurement, XMach-1 evaluates systems based on throughput performance (XML queries per second) instead of response time for usergiven XML queries. XMach-1 considers a system architecture to support web applications, which consists of four main components, namely, XML database, application servers, loaders and browser clients. In the XML database, there are multiple schemas, and there are between 2 and 100 documents per schema. Each document is generated using 10,000 most frequent English words, and occupies between 2 and 100 KB of storage. The workload contains eight queries and three update operations. Some evaluation results can be found in [6].
XMark
XMark is the result of the XML benchmark project, led by a team at CWI [9]. It models an Internet auctioning application. The workload consists of a large database, in a single XML document, containing: (i) items for auction in geographically dispersed areas; (ii) bidders and their interests; and (iii) detailed information about existing auctions, which can be open or closed. XMark’s workload includes 20 queries that cover the following broad kinds of operations: simple node selections; document queries for which order information is relevant; navigational queries; and computing aggregate functions. The XMark data generator employs several kinds of probability distributions and uses several real data values (e.g., names of countries and people) to produce realistic data; also, the textual content uses real words of the English language. XMark is by far the most widely used XML benchmark at the time of writing.
XBench XBench is a benchmark suite developed at the University of Waterloo [10]. XBench defines application benchmarks categorized according to two criteria: single-document versus multi-document and data-centric versus text-centric domains. The latter
criterion distinguishes data management and exchange scenarios from content management applications (e.g., electronic publishing). Document collections in XBench range in size from a few kilobytes to several gigabytes, and its workload consists of bulk-loading as well as various query and text-based search operations. Results of an evaluation of four different systems, comprising both native XML stores as well as relational-based stores, are provided in [10]. TPoX TPoX (Transaction Processing over XML) is a comprehensive application benchmark developed jointly by IBM and Intel [7]. TPoX simulates a financial application domain (security trading) and is based on the industry-standard XML Schema specification FIXML [3]. The testing environment in TPoX covers several aspects of XML management inside DBMS, including the use of XQuery, SQL/XML, updates, and concurrent access to the data. The authors report on an experimental evaluation of the IBM DB2 product for storing and processing XML data [7]. Synthetic Data Generators
Synthetic data have other applications besides benchmarking, such as testing specific components of a complex system or application. In this setting, an important requirement for a data generator, besides generating realistic data (i.e., synthetic data whose characteristics match those of typical real data in the application), is the the ability of easily customizing the test data (e.g., its structure). Declarative synthetic data generators, on the other hand, are tools that produce synthetic data according to specifications that describe what data to generate, as opposed to how to generate such data, thus facilitating the generation of synthetic data. Declarative data generators are intended for easing the burden in obtaining test data, unlike the data generators of standardized benchmarks, which have the characteristics of the data they produce embedded in their source code. Declarative data generators rely on formalisms providing higher levels of abstraction than programming languages, such as conceptual schema languages annotated with probabilistic information (for describing the characteristics of the intended data). Such probabilistic information are needed because schema languages specify only what content is allowed in valid document instances. A realistic data generator must allow the specification of the characteristics of
XML Compression
typical documents as well. For example, while a schema formalism will specify that a book element may contain between 1 and 10 authors, a realistic data generator will allow one to define a probability distribution for the number of authors in the test data. Another desirable feature of realistic synthetic data is that it satisfies integrity and referential constraints. For instance, an XML document describing a book review should refer to an existing book in the test data. Two examples of declarative XML data generators are the IBM XML Generator [3], whose data specifications are based on Document Type Definitions, and ToXgene [1], which relies on XML Schema specifications. Both tools allow the specification of skewed probability distributions for elements, attributes, and textual nodes. ToXgene, being based on XML Schema, supports different data types, as well as key and referential constraints. ToXgene also offers a simple declarative query language that allows one to model relatively complex dependencies among elements and attributes involving arithmetic and string manipulation operations. For instance, it allows one to model that the total price of an invoice should be the sum of the individual prices of the items in that invoice multiplied by the appropriate tax rates. Finally, ToXgene offers support for generating recursive XML documents.
X
3579
6. Lu H., Yu J.X., Wang G., Zheng S., Jiang H., Yu G., and Zhou A. What makes the differences: benchmarking XML database implementations. ACM Trans. Internet Technol., 5(1):154–194, 2005. 7. Nicola M., Kogan I., and Schiefer B. An XML transaction processing benchmark. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2007, pp. 937–948. 8. Runapongsa K., Patel J.M., Jagadish H.V., Chen Y., and Al-Khalifa S. The Michigan benchmark: towards XML query performance diagnostics. Inf. Syst., 31(2):73–97, 2006. 9. Schmidt A., Waas F., Kersten M.L., Carey M.J., Manolescu I., Busse R. XMark: a benchmark for XML data management. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 974–985. ¨ zsu M.T., and Khandelwal N. XBench Bench10. Yao B.B., O mark and Performance Testing of XML DBMSs. In Proc. 20th Int. Conf. on Data Engineering, 2004, pp. 621–633.
XML Cardinality Estimation ▶ XML Selectivity Estimation
XML Compression J AYANT R. H ARITSA 1, DAN S UCIU 2 Indian Institute of Science, Bangalore, India 2 University of Washington, Seattle, WA, USA
1
Key Applications Meaningful benchmarks are essential for the development of new technologies, as they allow developers to assess progress and understand intrinsic limitations of their tools. Applications include functionality testing, performance evaluation, and system comparisons.
Recommended Reading 1. Barbosa D. and Mendelzon A.O. Declarative generation of synthetic XML data. Softw. Pract. Exper. 36(10):1051–1079, 2006. 2. Bo¨hme T. and Rahm E. XMach-1: a benchmark for XML data management. In Proc. German Database Conference. Springer, Berlin, 2001, pp. 264–273; Multi-user evaluation of XML data Management Systems with XMach-1. LNCS, Vol. 2590, 2003, pp. 148–159. 3. Financial Information Exchange Protocol. FIXML 4.4 Schema Version Guide. Available at: http://www.fixprotocol.org. 4. Gray J. (ed.). The Benchmark Handbook for Database and Transaction Systems (2nd edn.). Morgan Kaufmann, San Francisco, CA, USA, 1993, ISBN 1-55860-292-5. 5. IBM XML Generator. Available at: http://www.alphaworks.ibm. com/tech/xmlgenerator, 2007.
Definition XML is an extremely verbose data format, with a high degree of redundant information, due to the same tags being repeated over and over for multiple data items, and due to both tags and data values being represented as strings. Viewed in relational database terms, XML stores the ‘‘schema’’ with each and every ‘‘record’’ in the repository. The size increase incurred by publishing data in XML format is estimated to be as much as 400% [14], making it a prime target for compression. While standard general-purpose compressors, such as zip, gzip or bzip, typically compress XML data reasonably well, specialized XML compressors have been developed over the last decade that exploit the specific structural aspects of XML data. These new techniques fall into two classes: (i)Compressionoriented, where the goal is to maximize the compression ratio of the data, typically up to a factor of two
X
3580
X
XML Compression
better than the general-purpose compressors; and (ii) Query-oriented, where the goal is to integrate the compression strategy with an XPath query processor such that queries can be processed directly on the compressed data, selectively decompressing only the data relevant to the query result.
Historical Background Research into XML compression was initiated with Liefke and Suciu’s development in 2000 of a compressor called XMill [6]. It is based on three principles: separating the structure from the content of the XML document, bucketing the content based on their tags, and compressing the individual buckets separately. XMill is a compression-oriented technique, focusing solely on achieving high compression ratios and fast compression/decompression, ignoring the query processing aspects. Other compression-oriented schemes that appeared around the same time include Millau [4], designed for efficient encoding and streaming of XML structures; and XMLPPM, which implements an extended SAX parser for online processing of documents [2]. There are also several commercial offerings that have been featured on the Internet (e.g., [11,13,15]). Subsequently, the focus shifted to the development of query-oriented techniques intended to support query processing directly on the compressed data. This stream of research began with Tolani and Haritsa [9] presenting in 2002 a system called XGrind, where compression is carried out at the granularity of individual element/attribute values using a simple contextfree compression scheme – tags are encoded by integers while textual content is compressed using nonadaptive Huffman (or Arithmetic) coding. XGrind consciously maintains a homomorphic encoding, that is, the compressed document is still in XML format, the intention being that all existing XML-related tools (such as parsers, indexes, schema-checkers, etc.) could continue to be used on the compressed document. In 2003, Min et al. [7] proposed a compressor called XPRESS that extended the homomorphic compression approach of XGrind to include effective evaluation of query path expressions and range queries on numerical attributes. Their scheme uses a reverse arithmetic path-encoding that encodes each path as an interval of real numbers between 0 and 1. The following year produced the XQueC system [1] from Arion et al., which supports cost-based tradeoffs between compact storage and efficient processing.
An excellent survey of the state-of-the-art in XML compressors is available in [1].
Foundations The basic principles for compressing XML documents are the following: Separate structure from data: The structure consists of XML tags and attributes, organized as a tree. The data consists of a sequence of items (strings) representing element text contents and attribute values. The structure and the data are compressed separately. Group data items with related meaning: Data items are logically or physically grouped into containers, and each container is compressed separately. By exploiting similarities between the values in a container, the compression improves substantially. Typically, data items are grouped based on the element type, but some systems (e.g., [1]) chose more elaborate grouping criteria. Apply specialized compressors to containers: Some data items are plain-text, while others are numbers, dates, etc., and for each of these different domains, the system uses a specialized compressor. Compression-Oriented XML Compressors
The architecture of the XMill compressor [6], which is typical of several XML compressors, is depicted in Fig. 1. The XML file is parsed by a SAX parser that sends tokens to the path processor. The purpose of the path processor is to separate the structure from the data, and to further separate the data items according to their semantics. Next, the structure container and all data containers are compressed with gzip, then written to disk. Optionally, the data items in certain containers may be compressed with a user-defined semantic compressor. For example, numerical values or IP addresses can be binary encoded, dates can be represented using specialized data structures, etc. By default, XMill groups data items based on their innermost element type. Users, however, can override this, by providing container expressions on the command line. The path processor uses these expressions to determine in which container to store each data item. Path expressions also determine which semantic compressor to apply (if any). The amount of main memory holding all containers is fixed. When the limit is exhausted all containers are gzip-ed, written to disk, as one logical block, then the compression resumes. In effect, this partitions the
XML Compression
X
3581
XML Compression. Figure 1. Architecture of the XMill compressor [4].
input XML file into logical blocks that are compressed independently. The decompressor, XDemill, is similar, but proceeds in reverse. It reads one block at a time in main memory, decompresses every container, then merges the XML tags with the data values to produce the XML output. Example. To illustrate the working of XMill, consider the following snippet of Web-server log data, where each entry represents one HTTP request: 202.239.238.16 GET / HTTP/1.0 text/html 200 1997/10/01-00:00:02 4478 http://www.so-net.jp/ Mozilla/3.0 [ja]
After the document is parsed, the path processor separates it into the structure and the content, and further separates the content into different containers. The structure is obtained by removing all text values and attribute values and replacing them with their container number. Start-tags are dictionary-encoded, i.e., assigned an integer value, while all end-tags are replaced by the same, unique token. For illustration purposes, start-tags are denoted with T1, T2,..., the unique endtag with /, and container numbers with C1, C2,.... In this example the structure of the entry element is: T1 T2 C1 / T3 C2 / T4 C3 / T5 C4 / T8 C7 / T8 C7 / T11 C10 / T12 C11 / / Here T1 = apache:entry, T2 = apache:host, and so on, while / represents any end tag. Internally, each token is encoded as an integer: tags are positive, container numbers are negative, and \ is 0. Numbers between ( 64, 63) take one byte, while numbers outside this range take two or four bytes; the example string is overall coded in 26 bytes. The structure is compressed using gzip, which is based on Ziv-Lempel’s LZ77 algorithm [10]. This results in excellent compression, because LZ77 exploits very well the frequent repetitions in the structure container: the compressed structure usually amounts to only 1–3% of the compressed file. Next, data items are partitioned into containers, then compressed. Each container is associated with
X
3582
X
XML Compression
an XPath expression that defines which data items are stored in that container, and an optional semantic compressor for the items in that container. By default there is a container for each tag tag occurring in the XML file, the associated XPath expression is //tag, and there is no associated semantic compressor. Users may override this on the command line. For example: xmill -p //shipping/address -p //billing/address file.xml creates two separate containers for address elements: one for those occurring under shipping, and one for those occurring under billing. If there exist address elements occurring under elements other than shipping or billing, then their content is stored in a third container. Note that the container expressions only need to be specified to the compressor, not to the decompressor. Overriding the default grouping of data items usually results in only modest improvements in the compression ratio. Much better improvements are achieved, however, with semantic compressors. Query-Oriented XML Compressors
XGrind. The technique described in [9] is intended to simultaneously provide efficient query-processing performance and reasonable compression ratios. Basic requirements to achieve the former objective are (i) finegrained compression at the element/attribute granularity of query predicates, and (ii) context-free compression assigning codes to data items independent of their location in the document. Algorithms such as LZ77 are not context-free, and therefore XGrind uses non-adaptive Huffman (or Arithmetic) coding, in which two passes are made over the XML document – the first to collect the statistics and the second to do the actual encoding. A separate character-frequency distribution table is used foreach element and non-enumerated attribute, resulting infine-grained characterization.The DTDisused toidentify enumerated-type attributes and their values are encoded using a simple binary encoding scheme, while the compression of XML tags is similar to that of XMill. With this scheme, exact-match and prefix-match queries can be completely carried out directly on the compressed document, while range or partial-match queries only requireon-the-flydecompressionoftheelement/attribute values that feature in the query predicates. A distinguishing feature of the XGrind compressor is that it ensures homomorphic compression – that
is, its output, like its input, is semi-structured in nature. In fact, the compressed XML document can be viewed as the original XML document with its tags and element/attribute values replaced by their corresponding encodings. The advantage of doing so is that the variety of efficient techniques available for parsing/querying XML documents can also be used to process the compressed document. Second, indexes can be built on the compressed document similar to those built on regular XML documents. Third, updates to the XML document can be directly executed on the compressed version. Finally, a compressed document can be directly checked for validity against the compressed version of its DTD. As a specific example of the utility of homomorphic compression, consider repositories of genomic data (e.g., [12]), which allow registered users to upload new genetic information to their archives. With homomorphic compression, such information could be compressed by the user, then uploaded, checked for validity, and integrated with the existing archives, all operations taking place completely in the compressed domain. To illustrate the working of XGrind, consider the XML student document fragment along with its DTD shown in Figs. 2 and 3. An abstract view of its XGrind compressed version is shown in Fig. 4, in which nahuff(s) denotes the output of the HuffmanCompressor for an input data value s, while enum(s) denotes the output of the Enum-Encoder for an input data value s, which is an enumerated attribute. As is evident from Fig. 4, the compressed document output in the second pass is semi-structured in nature, and maintains the property of validity with respect to the compressed DTD. The compressed-domain query processing engine consists of a lexical analyzer that emits tokens for encoded tags, attributes, and data values, and a parser built on top of this lexical analyzer does the matching and dumping of the matched tree fragments. The parser maintains information about its current path location in the XML document and the contents of the set of XML nodes that it is currently processing. For exact-match or prefix-match queries, the query path and the query predicate are converted to the compressed-domain equivalents. During parsing of the compressed XML document, when the parser detects that the current path matches the query path, and that the compressed data value matches the compressed
XML Compression
X
3583
XML Compression. Table 1. XPRESS interval scheme Element tag
XML Compression. Figure 2. Fragment of student database.
XML Compression. Figure 3. DTD of student database.
XML Compression. Figure 4. Abstract view of compressed XGrind database.
query predicate, it outputs the matched XML fragment. An interesting side-effect is that the matching is more efficient in the compressed domain as compared to the original domain, since the number of bytes to be processed have considerably decreased. XPRESS. While maintaining the homomorphic feature of XGrind, XPRESS [7] significantly extends its scope by supporting both path expressions and range queries (on numeric element types) directly on the compressed data. Here, instead of representing the tag of each element with a single identifier, the element label path is encoded as a distinct interval in [0.0, 1.0). The specific process, called reverse arithmetic encoding, is as follows: First, the entire interval [0.0, 1.0) is partitioned into disjoint sub-intervals, one for each distinct element. The size of the interval is
Path label
book book section book.section subsection book.section. subsection
Element interval
Path interval
[0.0, 0.1) [0.3, 0.6) [0.6, 0.9)
[0.0, 0.1) [0.3, 0.33) [0.69, 0.699)
proportional to the normalized frequency of the element in the data. In the second step, these element intervals are reduced by encoding the path leading to this element in a depth-first tree traversal from the root. An example from [7] is shown in Table 1, where the element intervals are computed from the first partitioning, and the corresponding reduced intervals by following the path labels from the root. For each node in the tree, the associated interval is incrementally computed from the parent node. The intervals generated by reverse arithmetic encoding guarantee that “If a path P is represented by the interval I, then all intervals for suffixes of P contain I.’’ Therefore, the interval for subsection which is [0.6, 0.9) contains the interval [0.69, 0.699) for the path book.section.subsection. Another feature of XPRESS is that it infers the data types of elements and for those that turn out to be numbers over large domains, the values are compressed by first converting them to binary and then using differential encoding, instead of the default string encoding. Recently, XPRESS has been extended in [8] to handle updates such as insertions or deletions of XML fragments. Finally, an index-based compression approach that improves on both the compression ratio and the querry processing speed has been recently proposed in [3].
Key Applications Data archiving, data exchange, query processing.
Cross-references
▶ Data Compression in Sensor Networks ▶ Indexing Compressed Text ▶ Lossless Data Compression ▶ Managing Compressed Structured Text ▶ Text Compression ▶ XML ▶ XPath/XQuery
X
3584
X
XML Data Dependencies
Recommended Reading 1. Arion A., Bonifati A., Manolescu I., and Pugliese A. XQueC: a query-conscious compressed XML database. ACM Trans. Internet Techn., 7(2):1–35, 2007. 2. Cheney J. Compressing XML with multiplexed hierarchical PPM models. In Proc. Data Compression Conference, 2001, pp. 163–172. 3. Ferragina P., Luccio F., Manzini G., and Muthukrishnan M. Compressing and Searching XML Data Via Two Zips. In Proc. 15th Int. World Wide Web Conference, 2006, pp. 751–760. 4. Girardot M. and Sundaresan N. Millau: an encoding format for efficient representation and exchange of XML over the Web. In Proc. 9th Int. World Wide Web Conference, 2000. 5. Liefke H. and Suciu D. An extensible compressor for XML data. ACM SIGMOD Rec., 29(1):57–62, 2000. 6. Liefke H. and Suciu D. XMill: an efficent compressor for XML data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 153–164. 7. Min J.K., Park M., and Chung C. XPRESS: a queriable compression for XML data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 122–133. 8. Min J.K., Park M., and Chung C. XPRESS: a compressor for effective archiving, retrieval, and update of XML documents. ACM Trans. Internet Techn., 6(3):223–258, 2006. 9. Tolani P. and Haritsa J. XGRIND: a query-friendly XML compressor. In Proc. 18th Int. Conf. on Data Engineering, 2002, pp. 225–235. 10. Ziv J. and Lempel A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337–343, 1977. 11. www.dbxml.com 12. www.ebi.ac.uk 13. www.ictcompress.com 14. www.ictcompress.com/xml.html 15. www.xmlzip.com
XML Data Dependencies ▶ XML Integrity Constraints
XML Data Integration ▶ XML Information Integration
XML Database Design ▶ Semi-structured database design
XML Database System ▶ XQuery Processors
XML Document M ICHAEL RYS Microsoft Corporation, Sammamish, WA, USA
Synonyms XML document
Definition An XML document is a text document that follows the W3C XML recommendation [3].
Key Points An XML document is considered well-formed, if it satisfies the XML 1.0 well-formedness constraints of having exactly one top-level XML element, balanced begin- and end-tags, and all the other rules outlined in [3]. It is considered valid, if it validates against a schema language such as the XML DTDs [3] or XML Schema [1]. An XML document is represented as an XML document information item in the XML Information Set [2] and an XML document node in the XPath and XQuery data model [4].
Cross-references ▶ XML ▶ XML Attribute ▶ XML Element
Recommended Reading
XML Database ▶ XML Storage
1. 2.
Schema Part 0: Primer, latest edition available online at: http:// www.w3.org/TR/xmlschema-0/. XML 1.0 Information Set, latest edition available online at: http://www.w3.org/TR/xml-infoset.
XML Indexing 3. 4.
XML 1.0 Recommendation, latest edition available online at: http://www.w3.org/TR/xml. XQuery 1.0 and XPath 2.0 Data Model (XDM), latest edition available online at: http://www.w3.org/TR/xpath-datamodel/.
3585
Recommended Reading 1. 2. 3.
XML Element
X
4.
Schema Part 0: Primer, latest edition available online at: http:// www.w3.org/TR/xmlschema-0/. XML 1.0 Information Set, latest edition available online at: http://www.w3.org/TR/xml-infoset. XML 1.0 Recommendation, latest edition available online at: http://www.w3.org/TR/xml. XQuery 1.0 and XPath 2.0 Data Model (XDM), latest edition available online at: http://www.w3.org/TR/xpath-datamodel/.
M ICHAEL RYS Microsoft Corporation, Sammamish, WA, USA
Synonyms XML element
Definition An XML element is a markup element in the W3C XML recommendation [3] that consists of a beginning markup tag and an end tag with optional content and optional attributes.
Key Points An XML element is a markup element in the W3C XML recommendation [3] that consists of a beginning markup tag and an end tag. It has to have balanced begin- and end-tags as in this example <element>content. XML elements can be nested as in <e1>some optional text <e2>nested element some optional text They can have XML attributes as in <element attribute=‘‘value’’>content And they can be empty, meaning have no content as in <element/>. The structure of elements is free within the grammatical rules of [3], but elements can be constrained and typed by schema languages such as XML DTDs [3] or XML Schema [1]. An XML element is represented as an XML element information item in the XML Information Set [2] and an XML element node in the XPath and XQuery data model [4].
Cross-references ▶ XML ▶ XML Attribute ▶ XML Document
XML Enterprise Information Integration ▶ XML Information Integration
XML Export ▶ XML Publishing
XML Filtering ▶ XML Publish/Subscribe
XML Indexing X IN LUNA D ONG , D IVESH S RIVASTAVA AT&T Labs–Research, Florham Park, NJ, USA
Definition XML employs an ordered, tree-structured model for representing data. Queries in XML languages like XQuery employ twig queries to match relevant portions of data in an XML database. An XML Index is a data structure that is used to efficiently look up all matches of a fragment of the twig query, where some of the twig query fragment nodes may have been mapped to specific nodes in the XML database.
X
3586
X
XML Indexing
Historical Background XML path indexing is related to the problem of join indexing in relational database systems [15] and path indexing in object-oriented database systems (see, e.g., [1,9]). These index structures assume that the schema is homogeneous and known; these assumptions do not hold in general for XML data. The DataGuide [7] was the first path index designed specifically for XML data, where the schema may be heterogeneous and may not even be known.
Foundations Notation
An XML document d is a rooted, ordered, node-labeled tree, where (i) each node corresponds to an XML element, an XML attribute, or a value; and (ii) each edge corresponds to an element-subelement, elementattribute, element-value, or an attribute-value relationship. Non-leaf nodes in d correspond to XML elements and attributes, and are labeled by the element tags or attribute names, while leaf nodes in d correspond to values. For the example XML document of Fig. 1a, its tree representation is shown in Fig. 1b. Each node
is associated with a unique number referred to as its Id, as depicted in Fig. 1b. An XML database D is a set of XML documents. Queries in XML languages like XPath and XQuery make fundamental use of twig queries to match relevant portions of data in an XML database. A twig query Q is a node- and edge-labeled tree, where (i) nodes are labeled by element tags, attribute names, or values; (ii) edges are labeled by an XPath axis step, e.g., child, descendant, or following-sibling. For example, the twig query in Fig. 2a corresponds to the path expression /book[./descendant:: author[./following-sibling :: author [fn =‘‘jane’’]][ln =‘‘poe’’]]. Given an XML database D, and a twig query Q, a match of Q in D is identified by a mapping from nodes in Q to nodes in D, such that: (i) Q’s node labels (i.e., element tags, attribute-names and values) are preserved under the mapping; and (ii) Q’s edge labels (i.e., XPath axis steps) are satisfied by the corresponding pair of nodes in D under the mapping. For example, the twig query of Fig. 2a matches a root book element that has a descendant author element that (i) has a child ln element with value poe; and (ii) has a following sibling author element that has a child fn
XML Indexing. Figure 1. (a) An XML database fragment. (b) XML tree (Id numbering).
XML Indexing
X
3587
XML Indexing. Figure 2. (a) Twig query. (b) Twig query fragments.
element with value jane. Thus, the book element in Fig. 1b is a match of the twig query in Fig. 2a. An XML Index I is a data structure that is used to efficiently look up all matches of a fragment of the twig query Q, where some of the twig query fragment nodes may have been mapped to specific nodes in the XML database. Some fragments of the twig query of Fig. 2a are shown in Fig. 2b and include the edges book/ descendant::author and author/following-sibling::author, and the path book/ descendant::author[ln =‘‘poe’’]. The matches returned by XML Index lookups on different fragments of a twig query Q can be ‘‘stitched together’’ using query processing algorithms to compute matches to Q. The following sections describe various techniques that have been proposed in the literature to index node, edge, path and twig fragments of twig queries. Node Indexes
When the fragment of a twig query Q that needs to be looked up in the index is a single node labeled by an element tag, an attribute name, or a value, a classical inverted index is adequate. This index is constructed by associating each element tag (or attribute name, value) with the list of all the node Ids in the XML database with that element tag (attribute name, value, respectively). For example, given the data of Fig. 1b, the list associated with the element tag author would be [7,27,47], and the list associated with the value jane would be [9,49]. Positional Numberings, Edge Indexes
Now consider the case when the fragment of a twig query Q that needs to be looked up in the index is an edge labeled by an XPath axis step. For example, find
all author nodes that are descendants of the book node with Id = 1. As another example, find all author nodes that are following-siblings of the author node with Id = 7. Using Node Ids A simple solution is to use an inverted
index that associates each (node1 label, node2 label, XPath axis) triple to the list of all pairs of XML database node Ids that satisfy the specified node labels and axis relationship. For example, given the data of Fig. 1b, the list associated with the triple (book, author, descendant) would be [(1,7),(1,27), (1,47)], and the list associated with the triple (author, author, following-sibling) would be [(7,27),(7,47),(27,47)]. In general, this inverted index could be much larger than the number of nodes in the original XML database, especially for XPath axes such as descendant, following-sibling and following. To overcome this limitation, more sophisticated approaches are required. A popular approach has been to (i) use a positional numbering system to identify nodes in an XML database; and (ii) demonstrate that each XPath axis step corresponds to a predicate on the positional numbers of the corresponding nodes. Two such approaches, using Dewey numbering and using Interval numbering, are described next.
Using Dewey Numbering
An elegant solution for edge indexing is to associate each XML node n with its DeweyId, proposed by [14], and obtained as follows: (i) associate each node with a numeric Id ensuring that sibling nodes are given increasing numbers in a left-to-right order; and (ii) the DeweyId of a node is obtained by concatenating the Ids of all nodes along
X
3588
X
XML Indexing
XML Indexing. Figure 3. (a) Dewey numbering. (b) Interval numbering.
the path from the root node of n’s XML document to n itself (The similarity with the Dewey Decimal System of library classification is the reason for its name). Figure 3a shows the DeweyIds of some nodes in the XML tree, using the numeric Ids associated with those nodes in Fig. 1b. For example, the fn node with Id = 8 has DeweyId = 1.6.7.8. DeweyIds can be used to easily find matches to various XPath axis steps. In particular, node n2 is a descendant of node n1 if and only if (i) n1.DeweyId is a prefix of n2.DeweyId. For example, in Fig. 3a, the jane node with DeweyId = 1.6.7.8.9 is a descendant of the author node with DeweyId = 1.6.7. Similarly, node n2 is a child of node n1 if and only if (i) n2 is a descendant of n1; and (ii) n2’s DeweyId extends n1’s DeweyId by one Id. By maintaining DeweyIds of nodes in classical trie data structures, various XPath axis lookups can be done efficiently. The main limitation of DeweyIds is that their size depends on the depth of the XML tree, and can get quite large. This limitation is overcome by using Interval numbering, described next.
Using Interval Numbering The position of an XML node n is represented as a 3-tuple: (LeftPos, RightPos, PLeftPos), where (i) numbers are generated in an increasing order by visiting each tree node twice in a left-to-right, depth-first traversal; n. LeftPos is the number generated before visiting any node in n’s subtree and n.RightPos is the number generated after visiting every node in n’s subtree; and (ii) n.PLeftPos is the LeftPos of n’s parent node (0 if n is the root node of DocId). Figure 3b depicts
the LeftPos and RightPos numbers of each node in the XML document. It can be seen that each XPath axis step between a pair of XML database nodes can be tested using a conjunction of equality and inequality predicates on the components of the 3-tuple. In particular, node n2 is a descendant of node n1 if and only if: (i) n1. LeftPos < n2.LeftPos; and (ii) n1.RightPos > n2.RightPos. An element n2 is a child of an element n1 if and only if n1.LeftPos = n2.PLeftPos. An element n2 is a following-sibling of an element n1 if and only if: (i) n1.RightPos < n2.LeftPos; and (ii) n1. PLeftPos = n2.PLeftPos. For example, in Fig. 3b, the jane node with interval number (9,10,8) is a descendant of the author node with interval number (7,16,6). Thus, the set of 3-tuples corresponding to the interval numbering of nodes of an XML database can be indexed using a 3-dimensional spatial index such as an R-tree, and the different XPath axis steps correspond to different regions within the 3-dimensional space. Variations of this approach have been considered in, e.g., [10,2,8]. Note that for both Dewey numbering and Interval numbering, one would need to leave gaps between numbers to allow for insertions of new nodes in the XML database [5]. Path Indexes
When the fragment of a twig query Q that needs to be looked up in the index is a subpath of a root-to-leaf path in Q, where some (possibly none) of the nodes have been mapped to specific nodes in the XML database, XML path indexes are very useful. The works in the literature have primarily focused on the case where
X
XML Indexing
each edge in the subpath of Q is labeled by child, i.e., all matches are subpaths of root-to-leaf paths in an XML document. For example, a path index can be used to efficiently look up all matches to the path fragment author[ln =‘‘poe’’] of the twig query depicted in Fig. 2a. A framework by Chen et al. [3] is described next, which covers most existing XML path index structures, and solves the BoundIndex problem. Problem BoundIndex: Given an XML database D, a subpath query P with k node labels and each edge labeled by child, and a specific database node id n, return all k-tuples (n1,...,nk) that identify matches of P in D, rooted at node n. The framework of [3] requires each node in an XML document to be associated with a unique numeric identifier; this could be, e.g., the Id of the node in Fig. 1b. To create a path index, [3] conceptually separates a path in an XML document into two parts: (i) a schema path, which consists solely of schema components, i.e., element tags and attribute names; and (ii) a leaf value as a string if the path reaches a leaf. Schema paths can be dictionary-encoded using special characters (whose lengths depend on the dictionary size) as designators for the schema components. Most of the works in the literature have focused on indexing XML schema paths (see, e.g., [7,12,4]). Notable exceptions that also consider indexing data values at the leaves of paths include [6,17,3]. In order to solve the BoundIndex problem, one needs to explicitly represent paths that are arbitrary subpaths of the root-to-leaf paths, and associate each such path with the node at which the subpath is rooted. Such a relational representation of all the paths in an XML database is (HeadId, SchemaPath, LeafValue, IdList), where HeadId is the id of the start of the path, and IdList is the list of all node identifiers along the schema path, except for the HeadId. As an example, a fragment of the 4-ary relational representation of the data tree of Fig. 1b is given in Table 1; element tags have been encoded using boldface characters as designators, based on the first character of the tag, except for allauthors which uses U as its designator. Given the 4-ary relational representation of XML database D, each index in the family of indexes: (i) stores a subset of all possible SchemaPaths in D; (ii) stores a sublist of IdList; and (iii) indexes a
3589
XML Indexing. Table 1. The 4-ary relation for path indexes HeadId
SchemaPath
LeafValue
IdList
1 1 1
B BT BT
null null XML
[] [2] [2]
1 1 1 1
BU BUA BUAF BUAF
null null null jane
[6] [6,7] [6,7,8] [6,7,8]
1 1
null poe
[6,7,12] [6,7,12]
6
BUAL BUAL ... U
null
[]
6 6 6 6
UA UAF UAF UAL
null null jane null
[7] [7,8] [7,8] [7,12]
6
UAL ...
poe
[7,12]
subset of the columns HeadId, SchemaPath, and LeafValue. Given a query, the index structure probes the indexed columns in (iii) and returns the sublist of IdList stored in the index entries. Many existing indexes fit in this framework, as summarized in Table 2. For example, the DataGuide [7] returns the last Id of the IdList for every root-to-leaf prefix path. Similarly, IndexFabric [6] returns the Id of either the root or the leaf element (first or last Id in IdList), given a root-to-leaf path and the value of the leaf element. Finally, the DATAPATHS index is a regular B+-tree index on the concatenation of HeadId, LeafValue and the reverse of SchemaPath (or the concatenation LeafValueHeadIdReverseSchemaPath), where the SchemaPath column stores all subpaths of root-toleaf paths, and the complete IdList is returned; the DATAPATHS index can solve the BoundIndex problem in one index lookup. Twig Indexes
ViST [16] and PRIX [13] are techniques that encode XML documents as sequences, and perform subsequence matching to look up all matches to twig queries.
X
3590
X
XML Indexing
XML Indexing. Table 2. Members of family of path indexes Index
Subset of SchemaPath
Sublist of IdList
Value [11]
paths of length 1
only last Id
Forward link [11]
paths of length 1
only last Id
DataGuide [7] Index Fabric [6]
root-to-leaf path prefixes root-to-leaf paths
only last Id only first or last Id
ROOTPATHS [3]
root-to-leaf path prefixes
full IdList
DATAPATHS [3]
all paths
full IdList
Key Applications XML Indexing is important for efficient XML query processing, both in relational implementations of XML databases and in native XML databases.
Future Directions It is important to investigate XML Path Indexes for the case of path queries with edge labels other than child, especially when different edges on a query path have different edge labels. Another extension worth investigating is to identify classes of twig queries that admit efficient XML Twig Indexes.
Data Sets University of Washington XML Repository: http://www.cs.washington.edu/research/ xmldatasets/.
Cross-references
▶ XML Document ▶ XML Tree Pattern, XML Twig Query ▶ XPath/XQuery ▶ XQuery Processors
Recommended Reading 1. Bertino E. and Kim W. Indexing techniques for queries on nested objects. IEEE Trans. Knowledge and Data Eng., 1 (2):196–214, 1989.
Indexed Columns SchemaPath, LeafValue HeadId, SchemaPath SchemaPath reverse SchemaPath, LeafValue LeafValue, reverse SchemaPath, LeafValue, HeadId, reverse SchemaPath
2. Bruno N., Koudas N., and Srivastava D. Holistic twig joins: optimal XML pattern matching. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 310–321. 3. Chen Z., Gehrke J., Korn F., Koudas N., Shanmugasundaram J., and Srivastava D. Index structures for matching XML twigs using relational query processors. Data Knowl. Eng., 60 (2):283–302, 2007. 4. Chung C.-W., Min J.-K., and Shim K. APEX: an adaptive path index for XML data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 121–132. 5. Cohen E., Kaplan H., and Milo T. Labeling dynamic XML trees. In Proc. ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, 2002, pp. 271–281. 6. Cooper B.F., Sample N., Franklin M.J., Hjaltason G.R., and Shadmon M. A fast index for semistructured data. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 341–350. 7. Goldman R. and Widom J. DataGuides: enabling query formulation and optimization in semistructured databases. In Proc. 23th Int. Conf. on Very Large Data Bases, 1997, pp. 436–445. 8. Grust T. Accelerating XPath location steps. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 109–120. 9. Kemper A. and Moerkotte G. Access support in object bases. ACM SIGMOD Rec., 19(2):364–374, 1990. 10. Kha D.D., Yoshikawa M., and Uemura S. An XML indexing structure with relative region coordinate. In Proc. 17th Int. Conf. on Data Engineering, 2001, pp. 313–320. 11. McHugh J. and Widom J. Query optimization for XML. In Proc. 25th Int. Conf. on Very Large Data Bases, 1999, pp. 315–326. 12. Milo T. and Suciu D. Index structures for path expressions. In Proc. 15th Int. Conf. on Data Engineering, 1999, pp. 277–295. 13. Rao P. and Moon B. PRIX: Indexing and querying XML using Pruffer sequences. In Proc. 20th Int. Conf. on Data Engineering, 2004, p. 288.
XML Information Integration 14. Tatarinov I., Viglas S., Beyer K., Shanmugasundaram J., Shekita E., and Zhang C. Storing and querying ordered XML using a relational database system. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 204–215. 15. Valduriez P. Join indices. ACM Trans. Database Syst., 12 (2):218–246, 1987. 16. Wang H., Park S., Fan W., and Yu P. ViST: a dynamic index method for querying XML data by tree structures. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 110–121. 17. Yoshikawa M., Amagasa T., Shimura T., and Uemura S. XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Trans. Internet Tech., 1(1):110–141, 2001.
XML Information Integration A LON H ALEVY Google Inc., Mountain View, CA, USA
Synonyms XML data integration; XML enterprise information integration
Definition Information integration systems offer uniform access to a set of autonomous and heterogeneous data sources. Sources can range from database systems and legacy systems to forms on the Web, web services and flat files. The data in the sources need not be completely structured as in relational databases. The number of sources in a information integration application can range from a handful to thousands. XML information integration systems are ones that based on an XML data model and query language.
Key Points XML played a significant role in the development of information integration, from the research and from the commercialization perspectives. The emergence of XML fueled the desire for information integration, because it offered a common syntactic format for sharing data. Once this common syntax was in place, organizations could start thinking about sharing data with each other (or even, within the organization) and integrating data from multiple sources. Of course, XML did very little to address the semantic integration issues – sources could still share XML files whose tags
X
3591
were completely meaningless outside the application. However, the fact that XML is semi-structured was an advantage when modeling heterogeneous data that may have different schematic structure. From the technical perspective, several information integration systems were developed with an XML data model at their core. Developing these systems lead to many advances in XML data management, including the development of query and update languages, languages for schema mapping, algorithms for query containment and answering queries using views, and efficient processing of XML data streaming from external sources. Every aspect of information integration systems had to be reexamined with XML in mind. Interestingly, since the push to commercialize information integration came around the same time as the emergence of XML, many of the innovations mentioned above had to be developed by start-ups on a very short fuse.
Key Applications Some of the key applications of information integration are: Enterprise data management, querying across several enterprise data repositories Accessing multiple data sources on the web (and in particular, the deep web) Large scientific projects where multiple scientists are independently producing data sets Coordination accross mulitple government agencies
Cross-references
▶ Adaptive Query Processing ▶ Information Integration ▶ Model Management
Recommended Reading 1. 2.
3.
4.
Abiteboul S., Benjelloun O., and Milo T. The active XML project: an overview. VLDB J., 17(5):1019–1040, 2008. Deshpande A., Ives Z., and Raman V. Adaptive Query Processing. Foundations and Trends in Databases. Now Publishers, 2007. www.nowpublishers.com/dbs. Haas L. Beauty and the Beast: The theory and practice of information integration. In Proc. 11th Int. Conf. on Database Theory, 2007, pp. 28–43. Halevy A.Y., Ashish N., Bitton D., Carey M.J., Draper D., Pollock J., Rosenthal A., and Sikka V. Enterprise information integration: successes, challenges and controversies. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 778–787.
X
3592
X
XML Information Retrieval
XML Information Retrieval ▶ INitiative for the Evaluation of XML retrieval (INEX)
XML Integrity Constraints M ARCELO A RENAS Pontifical Catholic University of Chile, Santiago, Chile
Synonyms XML data dependencies
Definition An XML integrity constraint specifies a semantic restriction over the data stored in an XML document. A number of integrity constraint languages have been proposed for XML, which can be used to enforce different types of semantic restrictions. These proposals, together with some languages for specifying restrictions on the element structure of XML documents (e.g., DTD and XML Schema), are currently used to specify the schema of XML documents.
Historical Background The problem of defining and manipulating integrity constraints is one of the oldest problems in databases. Soon after the introduction of the relational model by Codd in the ’70s, researchers developed several languages for specifying integrity constraints, and studied many fundamental problems for these languages. In the relational model, a database is viewed as a collection of relations or tables. For instance, a relational database storing information about courses in a university is shown in Fig. 1. This database consists of a time-varying part, the data about courses, and a part considered to be time independent, the schema of the relations, which is given by the name of the relations (CourseInfo and CourseTerm) and the names of the attributes of the relations. Usually, the information contained in a database satisfies some semantic restrictions. For example, in the relation CourseInfo shown in Fig. 1, it is expected that only one title is associated to each course number.
By providing the schema of a relation, a syntactic constraint is specified (the structure of the relation), but a semantic constraint as the one mentioned above cannot be specified. To overcome this problem, it is necessary to specify separately a set of semantic restrictions. These restrictions are called integrity constraints, and they are expressed by using suitable languages. Although a number of integrity constraints languages were developed for relational databases, functional and inclusion dependencies are the ones used most often. A functional dependency is an expression of the form X ! Y, where X and Y are sets of attributes. A relation satisfies X ! Y if for every pair of tuples t1, t2 in it, if t1 and t2 have the same values on X, then they have the same values on Y. An inclusion dependency is an expression of the form R[X] S[Y], where R and S are relation names, and X, Y are sets of attributes of the same cardinality. A relational database satisfies R[X] S[Y] if for every tuple t1 in R, there exists a tuple t2 in S such that the values of t1 on X are the same as the values of t2 on Y . For example, relation CourseTerm shown in Fig. 1 satisfies functional dependency {Number, Section} ! Room, since every section of a course is given in only one room, and relations CourseInfo and CourseTerm satisfy inclusion dependency CourseTerm[Number] CourseInfo[Number], since every course number mentioned in CourseTerm is also mentioned in CourseInfo (and thus every course given in some term has a name). Integrity constraint languages have also been developed for more recent data models, such as nested relational and object-oriented databases. Functional and inclusion dependencies are also present in all these models. In fact, two subclasses of functional and inclusion dependencies, namely, keys and foreign keys, are most commonly found in practice. For the case of relational databases, a key dependency is a functional dependency of the form X ! U, where U is the set of attributes of a relation, and a foreign key dependency is formed by an inclusion dependency R[X] S[Y] and a key dependency Y ! U, where U is the set of attributes of S. For example, Number ! Number, Title is a key dependency for relation CourseInfo, while CourseTerm[Number] CourseInfo [Number] and Number ! Number, Title is a foreign key dependency. Keys and foreign keys are fundamental to conceptual database design, and are supported by
XML Integrity Constraints
X
3593
XML Integrity Constraints. Figure 1. Example of a relational database.
the SQL standard. They provide a mechanism by which objects can be uniquely identified and referenced, and they have proved useful in update anomaly prevention, query optimization and index design.
Foundations A number of integrity constraint specifications have been proposed for XML. The hierarchical nature of XML data calls for not only absolute constraints that hold on an entire document, such as dependencies found in relational databases, but also relative constraints that only hold on sub-documents, beyond what it is encountered in traditional databases. Here, both absolute and relative functional dependencies, keys, inclusion dependencies and foreign keys are considered for XML. In general, integrity constraints for XML are defined as restrictions on the nodes and data values reachable by following paths in some tree representation of XML documents. For example, Fig. 2 shows an XML document storing information about courses at the University of Toronto (Uof T), and Fig. 3 shows a tree representation of this document. Given an XML document D, element types t1,...,tn in D and a symbol ‘ such that ‘ is either an element type in D or an attribute in D or the reserved symbol text(), the string t1,...,tn.‘ is a path in D if there exist nodes u1,...,un in D such that (i) each ui is if type ti, (ii) each ui+1 is a child of ui, (iii) if ‘ is an element type, then there exists a node u of type ‘ that is a child of un, (iv) if ‘ is an attribute, then ‘ is defined for un, and (v) if ‘ = text(), then un has a child of type PCDATA. For example, student.taking, student.taking. cno, UofT.course.title and UofT.course.title. text() are all paths in the XML document shown in Figs. 2 and 3. Given a node u in an XML document D and a path p in D, reach(u, p) is defined to be the set of all nodes and values reachable by following p from u in D. Furthermore, if P is a regular expression over an alphabet consisting of the reserved symbol text() and the
element types and attributes in an XML document D, then a node v is reachable from a node u in D by following P, if there exists a string p in the regular language defined by P such that v 2 reach(u, p). The set of all such nodes is denoted by reach(u, P). For example, for the XML tree shown in Fig. 3, reach (u1, UofT.student) = {u2}, reach(u1, UofT.student. name) = {J. Smith}, reach(u1, UofT.course.title. text()) = {Compilers} and reach(u1,UofT.(student.taking+course). cno) = {CSC258,CSC309}. Keys and Functional Dependencies for XML
Absolute keys for XML were first considered by Fan and Simeo´n [7]. Let D be an XML document and t an element type in D. Then ext(t) is defined to be the set of all nodes of D of type t. For example, if D is the XML tree shown in Fig. 3, then ext(taking) = {u4, u5}. Moreover, given a node υ and a list of attributes X = [a1,...,ak] that are defined for υ, υ[X] is defined as [υ.a1,...,υ.ak], where υ.ai is the value of attribute ai for node υ. For example, u2[sno, name] is [st1, J. Smith] in the XML tree shown in Fig. 3. Absolute keys for XML are defined as follows [6,7]. An absolute key is an expression of the form t[X] ! t, where t is an element type and X is a nonempty set of attributes defined for t. An XML document D satisfies this constraint, denoted by D ⊨ t[X] ! t, if for every v1, v2 2 ext(t), if v1[X] = v2[X], then v1 = v2. Thus, t[X] ! t says that the X-attribute values of a t-element uniquely identify the element in ext(t). Notice that two notions of equality are used to define keys: value equality is assumed when comparing attributes, and node identity is used when comparing elements. The same symbol ‘ = ’ is used for both, as it will never lead to ambiguity. The following are typical keys for the XML document shown in Fig. 2: student[sno] ! student and course[cno] ! course. The first constraint says that student number (sno) is an identifier for students, and the second constraint says that course number
X
3594
X
XML Integrity Constraints
XML Integrity Constraints. Figure 2. Example of an XML document.
XML Integrity Constraints. Figure 3. Tree representation of the XML document shown in Fig. 2.
(cno) is an identifier for courses. It should be noticed that if courses in different departments can have the same course number, then key course[cno] ! course has to be replaced by course[cno, dept] ! course. Since XML documents are hierarchically structured, users may be interested in the entire document as well as in its sub-documents. The latter give rise to relative keys, that only hold on certain sub-documents. The extension of Fan and Simeo´n’s language to the relative case was done by Arenas et al. in [8]. This language is introduced below. Notation v1 ≺ v2 is used when v1 and v2 are two nodes in an XML document and v2 is a descendant of v1. For example, u1 ≺ u4 and u3 ≺ u7 in the XML tree shown in Fig. 3. A relative key is an expression of the form t(t1[X] ! t1), where t, t1 are element types and X is a nonempty set of attributes that is defined for t1. This constraint says that relative to each node v of type t, the set of attributes X is a key for all the t1-nodes that are descendants of v. That is, an XML document D satisfies this constraint, denoted by D⊨t(t1[X] ! t1), if for every v 2 ext(t), and for every v1, v2 2 ext(t1) such that v ≺ v1 and v ≺ v2, if v1[X] = v2[X], then v1 = v2. For example, the following is a typical relative key for the XML document shown in Fig. 2: course
(enrolled[sno] ! enrolled). This constraint says that relative to the elements of type course, sno is an identifier for the elements of type enrolled. Thus, this constraint states that every student can enroll at most once in a given course. It should be noticed that the previous constraint cannot be replaced by the absolute key enrolled[sno] ! enrolled, since this states that every student can enroll in at most one course. It should also be noticed that absolute keys are a special case of relative keys when t is taken to be the type of the root of the XML documents. For example, for the case of the XML document shown in Fig. 2, absolute key student [sno] ! student is equivalent to relative key UofT (student[sno] ! student). A more powerful language for expressing XML keys was introduced by Buneman et al. [4,3]. This language allows the definition of absolute and relative keys. More precisely, an absolute key is an expression of the form (P, {Q1,...,Qn}), where P, Q1,...,Qn are regular expressions. An XML document D satisfies this key if for every pair of nodes u, v 2 reach(root, P), where root is the node identifier of the root of D, if reach(u, Qi) \ reach(v, Qi) 6¼ ;, for every i 2 [1, n], then u and v are the same node. For example, a key dependency can be used to express that name is an identifier
XML Integrity Constraints
for students in the University of Toronto database: (UofT.student, {name}). Notice that if a nested structure is used in this database to distinguish first names from last names: <student sno="st1"> John Smith ... then to characterize name as an identifier for students, two paths are included in the right-hand side of the key dependency: (UofT.student, {name.first.text(), name.last.text()}). In [4,3], Buneman et al. defined a relative key as a pair of the form (P 0, K), where P0 is a regular expression and K is an absolute key of the form (P, {Q1,...,Qn}). An XML document satisfies this key if every node reached from the root by following a path in P0 satisfies K, that is, for every u 2 reach(root, P 0 ) and for every v1,v2 2 reach(u, P), if reach(v1, Qi) \ reach(v2, Qi) 6¼ ;, for every i 2 [1, n], then v1 and v2 are the same node. For example, a relative key constraint can be used to express that a student cannot take the same course twice in the XML database shown in Fig. 2: (UofT.student, (taking, {cno})). This key dependency is relative since two distinct students can take the same course. It should be noticed that every key t(t1[a1,...,ak] ! t1) in Arenas et al.’s language [8] can be represented as (S∗.t,(S∗.t1, {a1,...,ak})) in Buneman et al.’s language [4,3], where S is the alphabet consisting of all the element types in an XML document. In [2], Arenas and Libkin introduced a functional dependency language for XML. In this language, a functional dependency over an XML document D is an expression of the form X ! p, where X [ {p} is a set of paths in D. To define the notion of satisfaction for functional dependencies, Arenas and Libkin [2] used a relational representation of XML documents. Given an XML document D, let paths(D) be the set of all paths in D. Then a tree tuple over D is a mapping t that assigns to each path p 2 paths(D) either a node identifier or a data value or the null value ⊥, in a way that is consistent with the tree representation of D. Formally, if p 2 paths(D), the last symbol of p is an element type t and t(p) 6¼ ⊥, then (i) t(p) = u, where u is a node
X
3595
identifier in D of type t; (ii) if p0 is a prefix of p, then t(p0 ) is a node identifier that lies on the path from the root to u in D; (iii) if a is an attribute defined for u in D, then t(p.a) = u.a; and (iv) if p.text() is a path in D, then there is a child s of u of type PCDATA such that t(p.text()) = s. A tree tuple is maximal if it cannot be extended to another one by changing some nulls to either node identifiers or data values. The set of maximal tree tuples in D is denoted by tuples(D). Then functional dependency fp1;:::;pm g ! p is true in D if for every pair t1, t2 2 tuples(D), whenever t1 ðpi Þ ¼ t2 ðpi Þ 6¼ ? for all i n, then t1 ðpÞ ¼ t2 ðpÞ holds. For example, let D be the XML document shown in Figs. 2 and 3. In this database, there is at most one name associated with each student number, which can be represented by means of functional dependency UofT.student.sno ! UofT.student.name. XML document D satisfies this constraint, which can be proved formally by constructing tuples(D). Table 1 shows the two tree tuples contained in this set. These tuples satisfy UofT.student.sno ! UofT.student. name since they have the same values in UofT.student.sno and also in UofT.student.name. Arenas and Libkin’s language [2] can also be used to express relative functional dependencies. For example, in the XML document shown in Fig. 2, every student has at most one grade in each course (which is the final grade for the course). This relative functional dependency can be expressed as {UofT.student, UofT. student.taking.cno} ! UofT.student.taking. grade, whose satisfaction can be checked by considering again the tree tuples shown in Table 1. Inclusion Dependencies and Foreign Keys for XML
One of the first XML integrity constraint languages was proposed by Abiteboul and Vianu [1]. They considered inclusion dependencies of the form P Q, where P and Q are regular expressions. An XML document D rooted at u satisfies this constraint if reach(u, P) reach(u,Q). For example, in the database shown in Fig. 2, the following constraint expresses that the set of courses taken by each student is a subset of the set of courses given by the university: UofT.student. taking.cno UofT.course.cno. An inclusion constraint P Q where P are Q are paths, like in the previous example, is called a path constraint [1]. In [5], Buneman et al. introduced a more powerful path constraint language. Given an XML document D and paths p1, p2, p3 in D, in this language a constraint is an
X
3596
X
XML Integrity Constraints
expression of either the forward form 8x8y ðx 2 reach ðroot; p1 Þ ^ y 2 reachðx; p2 Þ ! y 2 reachðx; p3 ÞÞ, where root represents the root of the XML document, or the backward form 8x8y ðx 2 reachðroot; p1 Þ ^ y 2 reachðx; p2 Þ ! x 2 reachðy; p3 ÞÞ. This language can be used to express relative inclusion dependencies. For example, if the document shown in Fig. 2 is extended to store information about students and courses in many universities, hUofTi is replaced by huniversityi: . . . . . . Then the following dependency in Buneman et al.’s language [5] can be used to state that for each university, the set of courses taken by each one of its students is a subset of the set of courses given by that university: 8x8yðx 2 reachðroot; db:universityÞ ^ y 2 reach ðx;student:taking:cnoÞ ! y 2 reachðx; course:cnoÞÞ: This constraint is relative to each university and, thus, it cannot be expressed by using Abiteboul and Vianu’s path constraints [1], as this language can only express constraints on the entire document. By using Abiteboul and Vianu’s approach, it can only be said that if a student is taking a course, then this course is given in
some university: db.university.student.taking. cno db.university.course.cno. A language for expressing absolute foreign keys for XML was proposed by Fan and Libkin in [6]. In this language, an absolute foreign key is an expression of the form t1[X] FK t2[Y], where t1, t2 are element types, and X, Y are nonempty lists of attributes defined for t1 and t2, respectively, and jXj = jYj. This constraint is satisfied by an XML document D, denoted by D ⊨t1[X] FK t2[Y], if D ⊨t2[Y] ! t2, and in addition for every v1 2 ext(t1), there exists v2 2 ext(t2) such that v1[X] = v2[Y]. The extension of Fan and Libkin’s proposal to the relative case was done by Arenas et al. in [8]. In this proposal, a relative foreign key is an expression of the form t(t1[X]FK t2[Y]), where t, t1, t2 are element types, X and Yare nonempty lists of attributes defined for t1 and t2, respectively, and jXj = jYj. It indicates that for each node v of type t, X is a foreign key of descendants of v of type t1 that references a key Y of t2-descendants of v. That is, an XML document D satisfies this constraint, denoted by D⊨t(t1[X] FKt2[Y]), if D⊨t(t2[Y] ! t2) and for every v 2 ext(t) and every v1 2 ext(t1) such that v ≺ v1, there exists v2 2 ext(t2) such that v ≺ v2 and v1[X] = v2[Y].
Key Applications XML integrity constraints are essential to schema design, query optimization, efficient storage, index
XML Integrity Constraints. Table 1. Maximal tree tuples in the XML tree shown in Fig. 3 Path
t2
t1
UofT
u1
u1
UofT.student UofT.student.sno UofT.student.name UofT.student.taking
u2 st1 J. Smith u4
u2 st1 J. Smith u5
UofT.student.taking.cno UofT.student.taking.grade UofT.course UofT.course.cno
CSC258 A u3 CSC258
CSC309 B+ u3 CSC258
UofT.course.dept UofT.course.title UofT.course.title.text() UofT.course.enrolled
Computer Science u6 Compilers u7
Computer Science u6 Compilers u7
UofT.course.enrolled.sno
st1
st1
XML Metadata Interchange
design, data integration, and in data transformations between XML and relational databases.
Cross-references
▶ Database Dependencies ▶ Functional Dependency ▶ Key
Recommended Reading 1.
2.
3.
4.
5.
6. 7.
8.
Abiteboul S. and Vianu V. Regular path queries with constraints. In Proc. 16th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1997, pp. 122–133. Arenas M. and Libkin L. A normal form for XML documents. In Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2002, pp. 85–96. Buneman P., Davidson S., Fan W., Hara C., and Tan W.C. Reasoning about keys for XML. In Proc. 8th Int. Workshop on Database Programming Languages, 2001, pp. 133–148. Buneman P., Davidson S., Fan W., Hara C., and Tan W.C. Keys for XML. In Proc. 10th Int. World Wide Web Conference, 2001, pp. 201–210. Buneman P., Fan W., and Weinstein S. Path constraints in semistructured and structured databases. In Proc. 17th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1998, pp. 129–138. Fan W. and Libkin L. On XML integrity constraints in the presence of DTDs. J. ACM, 49(3):368–406, 2002. Fan W. and Sime´on J. Integrity constraints for XML. In Proc. 19th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2000, pp. 23–34. Marcelo Arenas, Wenfei Fan and Leonid Libkin. On the Complexity of Verifying Consistency of XML Specifications. SIAM Journal on Computing, 38(3):841–880, 2008.
XML Message Brokering ▶ XML Publish/Subscribe
XML Metadata Interchange M ICHAEL W EISS Carleton University, Ottawa, ON, Canada
Synonyms
X
3597
integration of tools, repositories, applications, and data warehouses. The framework defines rules for generating XML schemas from a metamodel based on the Metaobject Facility (MOF). XMI is most frequently used as an interchange format for UML, although it can be used with any MOF-compliant language.
Key Points The motivation for introducing XMI was the need to provide a standard way through which UML tools could exchange UML models. XMI produced by one tool can generally be imported by another tool, which allows exchange of models among tools by different vendors, or the exchange of models with other types of tools upstream or downstream the tool chain. As stated above, XMI is not limited to mapping UML to XML, but it provides rules to generate DTDs or XML schemas and XML documents from any MOFcompliant language. Thus, a model that conforms to some MOF-compliant metamodel can be translated to an XML document that conforms to a schema generated according to the rules of the XMI standard. However, since XMI represents UML models in XML, XMI is more widely applicable. For example, XMI has been used for the analysis of models and for the integration of applications, both internal and with external parties. The key idea underlying XMI is to provide rules for generating schemas from a metamodel. These include the mapping of model attributes to XML elements and attributes, and model associations to containment relationships between XML elements or to links. Notably, XMI maps inheritance to composition in XML, in which elements and attributes from superclasses in the model are ‘‘copied down’’ into the XML document, as XML lacks inheritance.
Cross-references
▶ Extensible Markup Language ▶ Metamodel ▶ Meta Object Facility ▶ UML
XMI
Recommended Reading
Definition
2.
XMI (XML Metadata Interchange) is an XML-based integration framework for the exchange of models, and, more generally, any kind of XML data. XMI is used in the
3.
1.
Carlson D. Modeling XML Applications with UML. AddisonWesley, Reading, MA, 2001. Grose T., Doney G., and Brodsky S. Mastering XMI. Wiley, New York, 2002. OMG, MOF 2.0/XMI Mapping, version 2.1.1, 2007, http://www. omg.org/spec/XMI/2.1.1
X
3598
X
XML Parsing, SAX/DOM
XML Parsing, SAX/DOM C HENGKAI L I University of Texas at Arlington, Arlington, TX, USA
Definition XML parsing is the process of reading an XML document and providing an interface to the user application for accessing the document. An XML parser is a software apparatus that accomplishes such tasks. In addition, most XML parsers check the well-formedness of the XML document and many can also validate the document with respect to a DTD (Document Type Definition) or XML schema. Through the parsing interface, the user application can focus on the application logic itself, without dwelling on the tedious details of XML. There are mainly two categories of XML programming interfaces, DOM (Document Object Model) and SAX (Simple API for XML). DOM is a tree-based interface that models an XML document as a tree of nodes, upon which the application can search for nodes, read their information, and update the contents of the nodes. SAX is an event-driven interface. The application registers with the parser various event handlers. As the parser reads an XML document, it generates events for the encountered nodes and triggers the corresponding event handlers. Recently, there have been newly proposed XML programming interfaces such as pull-based parsing, e.g., StAX (Streaming API for XML), and data binding, e.g., JAXB (Java Architecture for XML Binding).
Historical Background DOM (Document Object Model) was initially used for modeling HTML (HyperText Markup Language) by various Web browsers. As inconsistencies existed among the individual DOM models adopted by different browsers, inter-operability problems arose in developing browser-neutral HTML codes. W3C (World Wide Web Consortium) standardized and released DOM Level 1 specification on October 1, 1998, with support for both HTML and XML. DOM Level 2 was released in November 2000 and added namespace support. The latest specification DOM Level 3 was released in April 2004. SAX (Simple API for XML) was developed in late 1997 through the collaboration of several implementers of early XML parsers and many other members of the
XML-DEV mailing list. The goal was to create a parserindependent interface so that XML applications can be developed without being tied to the proprietary API (Application Programming Interface) of any specific parser. SAX1 was finalized and released on May 11, 1998. The latest release SAX2, finalized in May 2000, includes namespace support.
Foundations XML parsing is the process of reading an XML document and providing an interface to the user application for accessing the document. An XML parser is a software apparatus that accomplishes such tasks. In addition, most XML parsers check the well-formedness of the XML document and many can also validate the document with respect to a DTD (Document Type Definition) [2] or XSD (XML Schema (W3C)) [11]. Through the parsing interface, the user application can focus on the application logic itself, without dwelling on the tedious details of XML, such as Unicode support, namespaces, character references, well-formedness, and so on. XML Programming Interfaces
Most XML parsers can be classified into two broad categories, based on the types of API that they provide to the user applications for processing XML documents. Document Object Model (DOM): DOM is a treebased interface that models an XML document as a tree of various nodes such as elements, attributes, texts, comments, entities, and so on. A DOM parser maps an XML document into such a tree rooted at a Document node, upon which the application can search for nodes, read their information, and update the contents of the nodes. Simple API for XML (SAX): SAX is an event-driven interface. The application receives document information from the parser through a ContentHandler object. It implements various event handlers in the interface methods in ContentHandler, and registers the ContentHandler object with the SAX parser. The parser reads an XML document from the beginning to the end. When it encounters a node in the document, it generates an event that triggers the corresponding event handler for that node. The handler thus applies the application logic to process the node specifically. The SAX and DOM interfaces are quite different and have their respective advantages and disadvantages. In general, DOM is convenient for random
XML Parsing, SAX/DOM
access to arbitrary places in an XML document, can not only read but also modify the document, although it may take a significant amount of memory space. To the contrary, SAX is appropriate for accessing local information, is much more memory efficient, but can only read XML. DOM is not memory efficient since it has to read the whole document and keep the entire document tree in memory. The DOM tree can easily take as much as ten times the size of the original XML document [4]. Therefore it is impossible to use DOM to process very large XML documents, such as the ones that are bigger than the memory. In contrast, SAX is memory efficient since the application can discard the useless portions of the document and only keep the small portion that is of interests to the application. A SAX parser can achieve constant memory usage thus easily handle very large documents. SAX is appropriate for streaming applications since the application can start processing from the beginning, while with DOM interface the application has to wait till the entire document tree is built before it can do anything. DOM is convenient for complex and random accesses that require global information of the XML document, whereas SAX is more suited for processing local information coming from nodes that are close to each other. The document tree provided by DOM contains the entire information of the document, therefore it allows the application to perform operations involving any part of the document. In comparison, SAX provides the document information to the application as a series of events. Therefore it is difficult for the application to handle global operations across the document. For such complex operations, the application would have to build its own data structure to store the document information. The data structure may become as complex as the DOM tree. Since DOM maintains information of the entire document, its API allows the application to modify the document or create a new document, while SAX can only read a document. DOM and SAX are the two standard APIs for processing XML documents. Most major XML parsers support them. There are also alternative tree-based XML models that were designed to improve upon DOM,
X
3599
including JDOM and DOM4J. In addition to DOM and SAX, other types of APIs for processing XML documents have emerged recently and are supported by various parsers. Two examples are pull-based parsing and Java data binding. Pull-Based Parsing: Evolving from XMLPULL, StAX (Streaming API for XML) also works in a streaming fashion, similar to SAX. Different from SAX where the parser pushes document information to the application, StAX enables the application to pull information from the parser. This API is more natural and convenient to the programmer since the application takes full control in processing the XML document. Java Data Binding: A data-binding API provides the mapping between an XML document and Java classes. It can construct Java objects from the document (marshalling) or build a document from the objects (unmarshalling). Accessing and manipulating XML documents thus become natural and intuitive, since such operations are performed through the methods of the objects. The application can thus focus on the semantics of the data themselves instead of the details of XML. JAXB (Java Architecture for XML Binding) is a Java specification based on this idea. Validating Parsers
In addition to accessing XML documents, another critical functionality of XML parsers is to validate the correctness of the documents. Given that XML is a popular data model for data representation and exchange over the Internet, the correctness of XML documents is important for applications to work properly. It is difficult to let the application itself handle incorrect documents that it does not expect. Fortunately, most major XML parsers have the ability to validate the correctness of XML documents. The correctness of an XML document can be defined at several levels. At the bottom, the document should follow the syntax rules of XML. Such a document is called a well-formed document. For example, every non-empty element in a well-formed document should have a pair of starting tag and ending tag. Furthermore, the document should conform to certain semantic rules defined for the application domain, such as a ‘‘state’’ node containing one and only one ‘‘capital’’ node. Such semantic rules can be defined by XML schema specifications such as DTD or XSD (XML Schema (W3C)). An XML document that complies with a
X
3600
X
XML Parsing, SAX/DOM
schema is called a valid document. Finally, the application may enforce its own specific semantic rules. In principle all the parsers are required to perform mandated checks of well-formedness, although there are parser implementations that do not. A parser that checks for the validity of XML documents with respect to XML schema in addition to their well-formedness is a validating parser. Schema validation is support by both DOM and SAX API. Most major XML parsers are validating parsers, although some may turn off schema validation by default.
combine parsing and validation. Van Engelen et al. [10] uses deterministic finite state automata (DFA) to integrate them and the DFA is built upon the schema according to mapping rules. Kostoulas et al. [5] further applies compilation techniques to optimize such integrated parsers. Takase et al. [8] explores a different way to improve parser performance. It memorizes parsed XML documents as byte sequences and reuses previous parsing results when the byte sequence of a new XML document partially matches the memorized sequences.
XML Parsing Performance
Key Applications
There are relatively few studies in the literature on performance of XML parsing. However, as the first step in every application that takes XML documents to process, parsing can easily become the bottleneck of the application performance. DOM is memory intensive since it has to hold the entire document tree in memory, making it incapable in handling very large documents. Therefore, efforts have been made to improve DOM parser performance by exploiting lazy XML parsing [7]. The key idea is to avoid loading unnecessary portions of the XML document into the DOM tree. It consists of two stages. The pre-parsing stage builds a virtual DOM tree and the progressive parsing stage expands the virtual tree with concrete contents when they are needed by the application. Farfa´n et al. [3] further extended the idea to reduce the cost of the pre-parsing stage by partitioning an XML document, thus only reading a partition into the DOM tree when it is needed. Nicola et al. [6] investigated several real-world XML applications where the performance of SAX parsers is a key obstacle to the success of the projects. They further verified that schema validation can add significant overheads, which can sometimes even be several times more expensive than parsing itself. Validation often incurs significant processing costs. One reason for such low efficiency is the division of parsing and validation steps. In conventional parsers these two steps are separate this is because validation often requires the entire document to be in the memory thus having to wait till the parsing is finished. Therefore, even for a SAX parser, the advantage of memory efficiency is lost. To cope with this challenge, there have been studies on integrating parsing and validation into a schema-specific parser [1,9,12]. For example, [1] constructs a push-down automaton to
Every XML application has to parse an XML document before it can access the information in the document and perform further processing. Therefore, XML parsing is a critical component in XML applications.
URL to Code
List of XML parsing interfaces DOM, http://www.w3.org/DOM/ JDOM, http://jdom.org/ DOM4J, http://dom4j.org/ SAX, http://www.saxproject.org/ StAX, http://jcp.org/en/jsr/detail?id=173 XMLPULL, http://www.xmlpull.org/ JAXB, http://www.jcp.org/en/jsr/detail?id=222 List of XML parsers Ælfred, http://saxon.sourceforge.net/aelfred.html Crimson, http://xml.apache.org/crimson/ Expat, http://expat.sourceforge.net/ JAXP, https://jaxp.dev.java.net/ Libxml2, http://xmlsoft.org/index.html MSXML, http://msdn.microsoft.com/en-us/li brary/ms763742.aspx StAX Reference Implementation (RI), http://stax. codehaus.org/ Sun’s Stax implementation, https://sjsxp.dev.java. net/ XDOM, http://www.philo.de/xml/ Xerces, http://xerces.apache.org/
Cross-references
▶ XML ▶ XML Attribute ▶ XML Document ▶ XML Element ▶ XML Programming ▶ XML Schema
XML Programming
X
Recommended Reading
Definition
1. Chiu K., Govindaraju M., and Bramley R. Investigating the limits of SOAP performance for scientific computing. In Proc. 11th IEEE Int. Symp. on High Performance Distributed Computing, 2002, pp. 246–254. 2. Document Type Declaration, http://www.w3.org/TR/REC-xml/ #dt-doctype 3. Farfa´n F., Hristidis V., and Rangaswami R. Beyond lazy XML parsing. In Proc. 18th Int. Conf. Database and Expert Syst. Appl., 2007, pp. 75–86. 4. Harold E.R. Processing XML with Java(TM): a Guide to SAX, DOM, JDOM, JAXP, and TrAX. Addison-Wesley, MA, USA, 2002. 5. Kostoulas M., Matsa M., Mendelsohn N., Perkins E., Heifets A., and Mercaldi M. XML screamer: an integrated approach to high performance XML parsing, validation and deserialization. In Proc. 15th Int. World Wide Web Conference, 2006, pp. 93–102. 6. Nicola M. and John J. XML parsing: a threat to database performance. In Proc. Int. Conf. on Information and knowledge Management, 2003, pp. 175–178. 7. Noga M., Schott S., and Lo¨we W. Lazy XML processing. In Proc. 2nd ACM Symp. on Document Engineering, 2002, pp. 88–94. 8. Takase T., Miyashita H., Suzumura T., and Tatsubori M. An adaptive, fast, and safe XML parser based on byte sequences memorization. In Proc. 14th Int. World Wide Web Conference, 2005, pp. 692–701. 9. Thompson H. and Tobin R. Using finite state automata to implement W3C XML schema content model validation and restriction checking. In Proc. XML Europe, 2003, pp. 246–254. 10. Van Engelen R. Constructing finite state automata for high performance XML web services. In Proc. Int. Symp. on Web Services, 2004, pp. 975–981. 11. XML Schema (W3C), http://www.w3.org/XML/Schema 12. Zhang W. and Van Engelen R. A table-driven streaming XML parsing methodology for high-performance web services. In Proc. IEEE Int. Conf. on Web Services, 2006, pp. 197–204.
The primary standards body for workflow management and business process interoperability.
XML Persistence ▶ XML Storage
3601
Key Points The XML Process Definition Language (XPDL) is a format standardized by the Workflow Management Coalition to interchange business process definitions between different modeling tools, BPM suites, workflow engines and other software applications. XPDL defines a XML schema for specifying the declarative part of workflow. XPDL is designed to exchange the process design, both the graphics and the semantics of a workflow business process. XPDL contains elements to hold the X and Y position of the activity nodes as well as the coordinates of points along the lines that link those nodes. XPDL provides the serialization for BPMN. This distinguishes XPDL from BPEL, which is also a process definition format, but does not contain elements to represent the graphical aspects of a process diagram and BPEL focuses exclusively on the executable aspects of the process.
Cross-references
▶ Business Process Execution Language ▶ Business Process Modeling Notation ▶ Workflow Model ▶ Workflow Schema
XML Programming P ETER M. F ISCHER ETH Zurich, Zurich, Switzerland
Synonyms XML application development
Definition
XML Process Definition Language N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
Synonyms XPDL
XML programming [2] covers methods and approaches to process, transform and modify XML data, often within the scope of a larger application which uses imperative programming languages. Similar to database programming, an important issue in XML programming is the impendence mismatch between the existing programming models, which are mostly based on an object-oriented data model and use an
X
3602
X
XML Programming
imperative style, and XML programming approaches, which are based on an XML data model, and apply various programming styles. A plethora of XML programming approaches exists, driven by different usage patterns of XML in applications. The XML programming approaches can be classified into three areas: (i) XML APIs to existing languages, (ii) XML extensions of existing programming languages, and (iii) Native XML processing languages. The varying sets of XML programming requirements and XML programming approaches make it impossible to declare a clearly preferable approach. Careful analysis by the application designer is needed to determine which technique is best suited for a particular setting.
Historical Background The need for XML programming arose soon after XML had been established as a simplified derivative of SGML, capable of representing any kind of semistructured data. Historically, three areas were influential to shape the directions in XML programming: Low-level APIs oriented towards document parsing: XML being regarded as a structured document format, a popular way to program XML is based on letting a document parser transform the XML into some internal structure of the target programming language, based on concepts of the areas of compiler construction. This approach had already been used for HTML with great success, making DOM the default API for client-side web programming. Document Transformation languages are a second influence also based on the document nature of XML. Such languages specify rules to apply to specific fragments of a document and generate new document parts out of the matched fragments and modification/creation statements. A well-known transformation language out of the SGML world is DSSSL, which is often used in the context of DocBook for scientific document processing. Database programming: a third influence stems from treating XML as a generic data representation format that, in turn, can be maintained and queried in a similar way as relational data. In database programming, two different data models (relational/ OO) and two different programming styles (declarative/imperative) need to be reconciled as well. Four main directions were developed:
1. Call-level interfaces: a literal string of the database language is given as a parameter to a function of the imperative host language. This function is used to interface with the DBMS and returns a result that can be turned into data types of the host language. Typical examples of these are ODBC or JDBC. 2. Embedded SQL: the query expressions of the relational language are embedded in the host application code, allowing for better type checking and hiding the details of the actual interfacing to the DBMS. 3. Automatic mapping layers: instead of modeling both the application with (usually) objectoriented methodologies (UML) and the database with relational methodologies (E/R) and writing expressions in both the database and the host language, only the application is modeled and the code for the application is written. The necessary database schema and the query expressions to retrieve and insert data into the DBMS are automatically generated from the application. The database aspect is hidden and development is simplified. The drawbacks are the lack of flexibility and often lower performance than with a separate database design. 4. Procedural extensions to SQL: an inverse approach to hiding the database is to integrate the application into the database, thus benefitting from the stability and scalability of the DBMS environment while exploiting performance benefits by being ‘‘close’’ to the data. For this approach, SQL has been extended with imperative constructs to enhance its expressive power (Turing-completeness) and make the development style more suitable to general applications. Well-known examples of such SQL extensions are PL/SQL (Oracle) or Transact-SQL (Sybase, Microsoft).
Foundations Specific Requirements of XML Programming
While there are certain similarities to related areas like database programming (in particular, the impedance mismatch), XML programming has its own special set of requirements and challenges, which can be traced back to two specific areas: (i) the advantages and deficiencies of XML as data and programming model,
XML Programming
and the resulting differences to other, established programming methodologies, and (ii) the widely varying and non-uniform use of XML in terms of usage and operations. Conceptual Aspects of XML
XML has a clear set of advantages that set it apart from other approaches used to represent data, and led to its rapid acceptance. A determining factor of the success of XML has been the independence from particular vendors and platforms. This advantage has been further strengthened by a large number of high-quality tools for XML technologies that are often available under permissive licenses, making the integration of XML technology into existing or new applications easy. As a result, knowledge about XML and related technologies is available freely. The XML syntax is both human readable and machine readable, avoiding problems that occur if a format is only specialized for one way of interpretation, such as unstructured text or complex binary encodings. XML is not just a document format, but includes methods and technologies for metadata description, RPC, workflow management, declarative querying, security, document processing and many more. These technologies and methods cover a large part of the required features for application development and ensure high interoperability not just among the classes of XML technology, but also among the applications building on them. From a data and application design point of view, XML provides a number of benefits that make it a good choice over competing approaches: in contrast to relational or object-oriented approaches, data and its interpretation are decoupled. This decoupling allows writing code that works on schema-less data, but does not prohibit adding schema when needed. Working with schema-less data is particularly helpful to shorten time-to-market times when building new applications, as the time-consuming and tedious schema design phase can be shortened. Similarly, it is helping with long-lived data (common in many business settings) where the code is already outdated, but the data will still be needed for new applications. The benefit of not being forced to have a schema is being further enhanced by the fact that the semi-structured model of XML allows representing a large spectrum of data ‘‘shapes,’’ reaching from unstructured data like
X
3603
annotated text to highly structured data like a relational table. The XML syntax and data models are not limited to represent data, but are also used to represent metadata and code, allowing uniform management and modifications. XML, however, does have limitations that reduce its usefulness for certain applications: many standardization efforts, as well as many development efforts, follow a bottom-up approach by defining small entities with low complexity. While this approach ensures that the individual standards and components are easy to understand and use, a combination of standards or components does not always cover all required aspects, thus leaving room for interpretation and incompatibility. A more serious conceptual problem is the limitation of the data model towards tree structures, making it difficult to express arbitrary graph structures or N:M relations. This issue is being aggravated because there is not a commonly accepted way to specify references in XML data. Another important deficiency is the lack of standard design methodology. UML and Entity-RelationshipModeling have greatly helped to establish the concepts of object-oriented and relational technologies, respectively. By using them, modeling and developing applications are greatly simplified. While not being a deficiency of XML per se, the mismatch between XML concepts and programminglanguage/database concepts complicates the use of XML and makes programming for it more difficult: since the data model of XML is neither object-based nor relational, translation needs to be done from and to XML, keeping it outside the usual type system and program analysis of the existing environments. The ability of XML to both work without any schema and represent many shapes of data makes direct mapping of arbitrary XML instances to a strict object-based structure (or to relational schema) often impossible, forcing the use of generic and less useable APIs. The object-oriented approach of hiding the data and binding the methods tightly to this data are juxtaposed to the nature of XML where data are explicitly exposed and accessible to many different methods. Differences in XML Usage
The second main issue in programming is the wide variety of ways in which XML is used. This usage includes the type of content that an XML instance represents, the operations that are performed on the data and
X
3604
X
XML Programming
some non-functional properties that have an impact on the processing such as size, structure, persistence etc. XML is used to represent a large number of different types of content, each leading to different functional requirements, favoring specific approaches for XML programming. The first, and currently most popular, class of using XML is as a document storage format. Typical examples are XHTML, office documents (OOXML, ODF), and graphics formats (SVG). These files are usually used to be presented in a human readable format by an application, transformed into another document format, modified and stored again. A second class is database content, either in forms of document collections or representing complex data in XML format. Here, the focus is on maintaining a large data set of XML and being able to successfully retrieve matching data. A third class is about program code, metadata and configuration data, where XML is used to determine the behavior of a data processing system. A fourth class is to use it as a communication format, in cases such as SOAP, or REST, putting the focus of transferring data or state in a loosely coupled, yet efficient way from one system to the other. A fifth and upcoming class is to use XML for log files, event streams or scientific data streams, requiring the analysis of the data by correlating data or detecting event patterns. On these types of content, different operations are performed, which in turn are better supported by certain XML programming approaches: Next to the complete retrieval of the XML content, limited or full-scale query operations such as filtering/selection, projection or the joins are common operations. A typical operation for many scenarios is also the creation of new XML data. Updating existing XML data is common in database settings, document management, code and metadata. More specialized operations are full text search, relevant for document collections and databases and trigger processing and event generation, which are commonly used either in databases or data streams. In addition to the types of content and the operations, a number of non-functional properties have an impact on which XML programming approach to use in particular scenarios. Having large volumes of XML requires an approach that supports the relevant technologies for scaling well, but for small volumes such an approach might be too heavyweight. Similarly, dealing with persistent data requires different, more elaborate
approaches if only temporary data are used. Using very structured data allows different optimization in storage and processing of XML than using unstructured data, since implementation techniques out of either relational databases or text processing are more appropriate. Again, different XML programming approaches are better suited for one or the other. Similarly, some approaches can deal better with data that is read only, append only or freely updateable. Additional Classification Criteria
Next to the criteria that come from the basic properties of XML and the variety of the XML usage scenarios, there are additional classification criteria that – in the widest sense – are concerned with architecture, ‘‘user experience’’ of the approaches, typing, compatibility and performance. The integration of XML processing into the architecture of an application is an important aspect of an XML programming approach. The traditional approach is to use XML just as the input and/or serialization format and leave the architecture of the application unaffected. A second, more disruptive but also more powerful approach, is to use an XML type inside the type system of the host programming language. This type is usually also augmented by more or less powerful expressions that work on it. The third, and most intrusive approach, is to use an XML data model and a native XML expression language throughout the whole application. These three approaches show an increasing amount of disruptiveness, forcing developers and architects to give up on the existing knowledge on application development. On the other hand, the three approaches also show an increasing ability to utilize the advantages of XML, such as schema-less processing while reducing the impedance mismatch between the XML world and the application. The first approach allows a programmer to have high productivity quickly, since existing programming knowledge can be re-used, but on the long run the complications of dealing with the impedance mismatch in this approach make the second and the third approach a more compelling option. Compliance to W3C standards is important when interoperating with other XML-based applications, but certain standards, such as XML Schema, can add a significant amount of complexity to an approach. Closely related are the aspects of the XML data model (e.g., Infoset, XDM, proprietary), type support for the XML data
XML Programming
(general node types versus full schema types) and the support for static type analysis, because an extended XML type support adds complexity to an API, but allows for more strict correctness checking and better optimizations. Important aspects of the acceptance of an XML programming approach are performance and optimizability. Low-level approaches that are not deeply integrated into the architecture tend to have the best performance on the short run, as developers can choose their access pattern on the data and optimize a program written in a language they are familiar with. On the long run, declarative solutions with a uniform data and expression model and strong typing support hold the most promise, as they can shift the burden of optimizing from the programmer to optimizers built into the system. Approaches to XML Programming
A large number of approaches to develop XML-based applications exist, they are influenced by document processing and database programming methods. In this section, the approaches are clustered along the amount of change to the architecture and programming style they require compared to object-oriented/ imperative programming: when making a decision on a specific approach to use, the main trade-off is how much of the XML advantages should/need to be used compared to the disruption caused by moving away from the well-known application development environments. The three main classes of approaches are XML interfaces to existing languages, XML extensions to existing languages and XML-oriented/native XML programming languages. The first class, interfaces to existing languages, provides methods to maintain all of the program logic and complexity inside the established imperative/objectoriented languages. XML is treated like any other source of outside data by limiting its impact to an adaptation layer/API and representing XML as instances of the native, non-extended type system, e.g., as tree of node objects. By doing so, the impact on existing programming models is kept low, but at the cost of either having very generic APIs with low productivity or limited flexibility. The impedance mismatch in the data model and the expressions as well as the purely imperative programming style limits the possibilities for optimization. This first class can further be broken down into two subclasses, generic XML-oriented APIs and
X
schema-driven code generation. Generic XML-oriented APIs do not require any knowledge of the particular structure of an XML instance, thus representing XML as generic, low-level objects in the host language such as trees of nodes, event sequences or node item sequences. The majority of these APIs provide low level, parseroriented programming interfaces such as DOM, SAX and StaX. They support a limited set of querying operations such as selection or projection, the creation of new XML and updates to existing XML; any more highlevel functionality needs to be implemented in the imperative language. This allows for a large amount of generality and possibly high performance, since access and processing can be tailored to the needs of a specific application. This advantage, however, comes at a high price: the generic and low-level nature of the APIs cause low developer productivity. There also exist call-level interfaces to XML-oriented programming languages or database system such as XQJ, which allow shifting/ moving of some XML-oriented operations outside the imperative program. This frees the developer from lowlevel work, but requires learning and understanding of a separate expression language with a different data model and different semantics. Schema-driven code generation increases the abstraction level and developer productivity by automatically creating high-level host programming language objects representing specific, typed XML, e.g., by turning an XML item representing ‘‘person’’ data into a person object. This object can then be accessed and manipulated by the methods the object exposes, e.g., a method to get the name of a person. By doing so, the level of abstraction is increased, developers do not need to learn many details about XML and can stay within their well-understood object-oriented/imperative world, all leading to higher productivity. Schema knowledge (e.g., DTD, XML schema or an ad-hoc format) is needed to perform this automatic code generation, restricting this approach to scenarios where the XML instances are highly structured and this structure information is known in advance. This limits the flexibility of the code generation approach, as it reduces XML to a representation format of host language objects. Well-known examples of such codegeneration approaches are XML Beans or JAXB. The second class of XML programming approaches, XML extensions to existing programming languages, still maintains all the logic and complex application code in the imperative/OO programming
3605
X
3606
X
XML Programming
language, but extends the host language to represent XML as a first-class data type of this language. This extension of the type system is often accompanied by expressions that work on that new type, which can range from accessor methods to (limited) declarative querying possibilities. The XML type system support is often aligned with the properties of the host language type system, thus reducing the mismatch, but also limiting the compatibility with W3C standards (e.g., XML Schema not being implemented). The tighter integration with the host programming language increases programmer productivity, without restricting the flexibility as much as schema-driven code generation. This second class can again be broken down into subclasses: XML as a native programming language type, XML as a native ORDBMS database type and query capabilities inside the programming language. Adopting XML as a native programming language data type is an approach that has been taken especially by web-oriented programming languages such as Javascript/Ecmascript and PHP, because using XML interfaces was not considered sufficient any more. XML is a first-rate data type that can be constructed inside the program or read from a file. Instances of this XML data type can be accessed by functions that incorporate a subset of path navigation with a syntax that resembles the access to a field or a class member, e.g., x.balance to access the balance child of the XML variable x in XML extension of Javascript, E4X. While the mismatch to the host language is being reduced, a mismatch to W3C XML standards is often created. The imperative nature of the languages often limits the optimizability. Using XML as native DB data type tries to achieve similar effects in the relational/object-relation database scenario, since there also exist impedance mismatches between the relational space and the XML space. Adding XML as a column type provides one possible solution to store XML in relational database system. On the language side, SQL/XML [3] blends SQL (as a relational query language) with XQuery as an XML query language by providing methods to map from one data model to the other, and embed XQuery expressions into SQL. This approach takes advantage of the DBMS infrastructure including triggers, transactional support, scalability, clustering, and reliability. Since both languages are declarative and there is mapping between the data models, global optimization over
both XML and relational expression is possible. SQL/ XML is supported by the major database vendors, making it well-known and providing good tool support and documentation. The drawbacks include the high overhead of loading all the XML data into the database, which is not useful for temporary XML or small volumes of data, and the complexity and cost of running a database server. The combination of two different query languages with different syntax, semantics and data models hinders the productivity of developers. Adding query capabilities inside existing imperative language takes imperative languages a step closer to the database and increases the productivity, flexibility and optimizability of an existing programming language. The most prominent example of including such capabilities is the LinQ extension of the Microsoft .NET framework, which is based on the COmega research prototype [4]. On top of XML data type extensions and basic accessors eventwell (as described in the previous cases) and similar extension for relation data, LinQ provides query capabilities over all supported data models. These query capabilities come in the form of explicit query operators similar to relational algebra including collection-oriented selection and projection, joins, grouping etc., that work on all data types. On top of these operations, LinQ provides declarative queries similar in style to the SQL SelectFrom-Where. LinQ provides a good integration of the different data models and with the rest of the language, including the libraries, and yields a high productivity for developers familiar with .NET. Significant drawbacks are the lack of support for typed XML, the limited scope of static analysis, and the dominance of imperative constructs in the language, making database-style optimizations like lazy evaluation, streaming and indexing hard to do automatically. The third class, XML-oriented programming languages, handles all application logic in an environment that is based purely on an XML data model and expressions working on this data model. Doing so represents a significant disruption from conventional languages, including giving up existing tools and design approaches. Compliance to W3C standard is high as is productivity related to XML processing. Many of these languages are, however, not designed to be generalpurpose programming languages, lacking libraries and support for areas like GUI programming or numerical computations.
XML Programming
This third class can again be split into three subclasses: domain-specific languages, expression languages with an XML type system, and XML scripting. Domain-specific languages utilize XML types and expression for a specific task that is often related to one particular use of XML. Clearly, such a language only works well within its intended design domain, but the close match to XML technologies and the restriction to relevant concepts facilitate high productivity. A very prominent example is BPEL, a workflow description language with XML syntax, which is used to ‘‘orchestrate’’ Web Services. It is transparent/agnostic to the actual XML data model, query language and expression language by just focusing on the control flow expressions needed in workflow environments. It provides a high abstraction level and many useful concepts needed in workflow environment. The separation of control flow and the actual expressions restricts the optimizability even in its intended domain. Languages with an XML Type System are designed to handle operations on XML data in the most useable and efficient way, but not as general programming languages. The best known languages are XPath 2.0, XSLT 2.0 and XQuery. All of these languages work with a consistent data model (XDM) and provide high compliance to other W3C standards. The operations covered-include querying XML instances, construction of new XML instances (XQuery+XSLT), updating existing XML instances (XQuery) and full text search (XQuery). XSLT carries an XML syntax and uses a recursive pattern approach of document transformation language. It works best for small volumes of temporary XML data that are of unknown structure. XQuery uses a declarative style based in iterations and support for static type analysis. It works well for structured data and provides the facilities for good optimizability. Implementations exist for persistent and temporary data, large and small volumes of data in database systems, and as stand-alone expression engines. XML Scripting languages represent a recent development where a declarative XML type language (often XQuery) is extended with imperative constructs. The goal is to gain usability and a certain level of expressiveness by allowing (limited) side effects while still maintaining optimizability. Such a language could be used for at all tiers of XML software stacking, proving the potential for a truly XML-only application development. Major challenges in the long run will be developer acceptance and the building of good
X
3607
compilers/optimizers to actually achieve the potential performance benefits. Compared to imperative languages with query capabilities, the chances for optimizability are better, since the starting point is already well optimizable language. The most prominent example of such an XML scripting language is [1].
Key Applications XML programming is relevant for almost all usage scenarios of XML that go beyond simple storage and retrieval. The large variety of use cases for XML is reflected in the different requirements for XML programming. A very common case is web application programming, where the browser, as the client layer, and the database server layer both contain and manipulate XML data. Web services, as a way of loosely coupling applications via XML messages, have become popular for information and application integration. An area in between these two use cases are mashups, were data and services from different sources are combined to form new applications.
Future Directions A promising future direction is the integration of continuous/streaming XML data in the ‘‘regular’’ programming environments. Many data stream sources already produce their data as XML, and this data needs to be combined with other streaming or static data. Semantic querying, which is now handled using RDF/OWL, can be brought towards more standard XML programming models, thus enabling more effective ways for data integration. The interaction between XML integrity constraints and programming will become more important, as reasoning over programs and data (including verification) will improve the quality of applications. Automatic maintenance of a well-defined state (similar to relational integrity constraints in RDBMSs) will simplify programming XML applications in data-intensive environments. Native XML languages have the potential to change the architecture for many data-intensive applications which currently are built in a multi-tier fashion. Using the same data model and expressions allows-collapsing layers/tiers when needed, setting the main direction of partitioning not along the tiers, but along services and their respective data.
Cross-references
▶ Active XML ▶ AJAX ▶ Composed Services and WS-BPEL
X
3608
X
XML Publish/Subscribe
▶ Entity Relationship Model ▶ Java Database Connectivity ▶ Open Database Connectivity ▶ SOAP ▶ Unified Modeling Language ▶ W3C ▶ Web Services ▶ XML ▶ XML Information Integration ▶ XML Integrity Constraints ▶ XML Parsing, SAX/DOM ▶ XML Process Definition Language ▶ XML Schema ▶ XML Stream Processing ▶ XPath/XQuery ▶ XQuery Full-Text ▶ XQuery Processors ▶ XSL/XSLT
messages from senders to receivers based on receivers’ data interests. In this model, publishers (i.e., senders) generate messages without knowing their receivers; subscribers (who are potential receivers) express their data interests, and are subsequently notified of the messages from a variety of publishers that match their interests. XML publish/subscribe is a publish/subscribe model in which messages are encoded in XML and subscriptions are written in an XML query language such as a subset of XQuery 1.0. (In the context of XML pub/sub, ‘‘messages’’ and ‘‘documents’’ are often used exchangeably.) In XML-based pub/sub systems, the message brokers that serve as central exchange points between publishers and subscribers are called XML message brokers.
Historical Background Recommended Reading 1.
2.
3.
4.
Chamberlin D., Carey M.J., Fernandez M., Florescu D., Ghelli G., Kossmann D., Robie J., and Simeon J. XQueryP: an XML application development language. In Proc. XML 2006 Conference. 2006. Florescu D. and Kossmann D. Programming for XML. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006, p. 801. Funderburk J.E., Malaika S., and Reinwald B. XML programming with SQL/XML and XQuery. IBM Syst. J., 41(4):642–665, 2002. Meijer E., Schulte W., and Bierman G. Unifying tables, objects and documents. In Proc. Workshop on Declarative Programming in the Context of Languages, 2003.
As described in ‘‘Publish/Subscribe over Streams,’’ XML pub/sub has emerged as a solution for loose coupling of disparate systems at both the communication and content levels. At the communication level, pub/sub enables loose coupling of senders and receivers based on the receivers’ data interests. With respect to content, XML can be used to encode data in a generic format that senders and receivers agree upon due to its flexible, extensible, and self-describing nature; this way, senders and receivers can exchange data without knowing the data representation in individual systems.
Foundations
XML Publish/Subscribe YANLEI D IAO 1, M ICHAEL J. F RANKLIN 2 University of Massachusetts Amherst, MA, USA 2 University of California-Berkeley, Berkeley, CA, USA
1
Synonyms Selective XML dissemination; XML filtering; XML message brokering
Definition As stated in the entry ‘‘Publish/Subscribe over Streams,’’ publish/subscribe (pub/sub) is a many-tomany communication model that directs the flow of
XML pub/sub raises many technical challenges due to the requirements of large-scale publish/subscribe (as described in the entry ‘‘Publish/Subscribe over Streams’’), the volume of XML data, and the complexity of XML processing. Among all, two challenges are highlighted below: XML stream processing. In XML-based pub/sub systems, XML data continuously arrives from external sources, and user subscriptions, stored as continuous queries in a message broker, are evaluated every time when a new data item is received. Such XML query processing is referred to as stream-based. In cases where incoming messages are large, stream-based processing also needs to start before the messages are completely received in order to reduce the delay in producing results.
XML Publish/Subscribe
X
Handling simultaneous XML queries. Compared to XML stream systems, a distinguishing aspect of XML pub/sub systems lies in the size of their query populations. All the queries stored in an XML message broker are simultaneously active and need to be matched efficiently with each incoming XML message. While multi-query processing has been studied for relational databases and relational streams, the complexity of XML processing, including structure matching, predicate evaluation, and transformation, requires new techniques for efficient multi-query processing in this new context.
expression to the automaton states. Figure 2 shows an example automaton created for a simple path expression, where the two concentric circles represent the accepting state. When arriving, XML messages are parsed with an event-based parser, the events raised during parsing are used to drive the execution of the automaton. In particular, ‘‘start element’’ events drive the automaton through its various transitions, and ‘‘end element’’ events cause the execution to backtrack to the previous states. A path expression is said to match a message if during parsing, the accepting state for that path is reached.
Foundation of XML Stream Processing for Publish/Subscribe
XML Filtering
Event-based parsing. Since XML messages can be used to encode data of immense sizes (e.g., the equivalent of a database’s worth of data), efficient query processing requires fine-grained processing upon arrival of small constituent pieces of XML data. Such fine-grained XML query processing can be implemented via an event-based API. A well known example is the SAX interface that reports low-level parsing events incrementally to the calling application. Figure 1 shows an example of how a SAX interface breaks down the structure of the sample XML document into a linear sequence of events. ‘‘Start document’’ and ‘‘end document’’ events mark the beginning and the end of the parse of a document. A ‘‘start element’’ event carries information such as the name of the element and its attributes. A ‘‘characters’’ event reports a text string residing between two XML tags. An ‘‘end element’’ event corresponds to an earlier ‘‘start element’’ event and marks the close of that element. To use the SAX interface, the application receiving the events must implement handlers to respond to different events. In particular, stream-based XML processors can use these handlers to implement event-driven processing. An automata-based approach. A popular approach to event-driven XML query processing is to adopt some form of finite automaton to represent path expressions [1,13]. This approach is based on the observation that a path expression (a small, common subset of XQuery) written using the axes (‘‘/,’’ ‘‘//’’) and node tests (element name or ‘‘*’’) can be transformed into a regular expression. Thus, there exists a finite automaton that accepts the language described by such a path expression [11]. Such an automaton can be created by mapping the location steps of the path
In XML filtering systems, user queries are written using path expressions that can specify constraints over both structure and content of XML messages. These queries are applied to individual messages (hence, stateless processing). Query answers are ‘‘yes’’ or ‘‘no’’ – computing only Boolean results in XML filtering avoids complex issues of XML query processing related to multiple matches such as ordering and duplicates, hence enabling simplified, high-performance query processing. XFilter [1], the earliest XML filtering system, considers matching of the structure of path expressions and explores indexing for efficient filtering. It builds a dynamic index over the queries and uses the parsing events of a document to probe the query index. This approach quickly results in a smaller set of queries that can be potentially matched by a message, hence avoiding processing queries for which the message is irrelevant. Built over the states of query automata, the dynamic index identifies the states that the execution of these automata is attempting to match at a particular moment. The content of the index constantly changes as parsing events drive the execution of the automata. YFilter [4] significantly improves over XFilter in two aspects. By creating a separate automaton per query, XFilter can perform redundant work when significant commonalities exist among queries. Based on this insight, YFilter supports sharing in processing by using a combined automaton to represent all path expressions; this automaton naturally supports shared representation of all common prefixes among path expressions. Furthermore, the combined automaton is implemented as a Nondeterministic Finite Automaton (NFA) with two practical advantages: i) a relatively small number of states required to represent even
3609
X
3610
X
XML Publish/Subscribe
XML Publish/Subscribe. Figure 1. An example XML document and results of SAX parsing.
large numbers of path expressions and complex queries (e.g., with multiple wildcards ‘‘*’’ and descendent axes ‘‘//’’), and ii) incremental maintenance of the automaton upon query updates. Results of YFilter show that its shared path matching approach can offer order-ofmagnitude performance improvements over XFilter while requiring only a small maintenance cost. Structure matching that XFilter considers is one part of the XML filtering problem; another significant part is the evaluation of predicates that are applied to path expressions (e.g., addressing attributes of elements, text data of elements, or even other path expressions) for additional filtering. Since shared structure matching has been shown to be crucial for performance, YFilter supports predicate evaluation using post-processing of path matches after shared structuring matching, and further leverages relational processing in such post-processing. Figure 3 shows two example queries and their representation in YFilter. Q1 contains a root element ‘‘/nitf’’ with two nested paths applied to it. YFilter decomposes the query into two linear paths ‘‘/nitf/head/ pubdata[@edition.area=‘‘SF’’],’’ and ‘‘/nitf//tobject.subject[@tobject.subject.type =‘‘Stock’’].’’ The structural part of these paths is represented using the NFA with the common prefix ‘‘/nitf’’ shared between the paths. The accepting states of these paths are state 4 and state 6, where the network of operators (represented as boxes) for the remainder of Q1 starts. At the bottom of the network, there is a selection (s) operator above each accepting state to handle the value-based predicate in the corresponding path. To handle the correlation between
XML Publish/Subscribe. Figure 2. A path expression and its corresponding finite automaton.
the two paths (e.g., the requirement that it should be the same ‘‘nitf’’ element that makes these two paths evaluate to true), YFilter applies a join (. /) operator after the two selections. Q2 is similar to Q1 and hence shares a significant portion of its representation with Q1. Index-Filter [2] builds indexes over both queries and streaming data. The index over data speeds up the processing of large documents while its construction overhead may penalize the processing of small ones. Results of a comparison between Index-Filter and YFilter show that Index-Filter works better when the number of queries is small or the XML document is large, whereas YFilter’s approach is more effective for large numbers of queries and short documents. XMLTK [8] converts YFilter’s NFA to a Deterministic Finite Automaton (DFA) to further improve the filtering speed. A straightforward conversion could theoretically result in severe scalability problems due to an explosion in the number of states. This work, however, shows that such explosion can be avoided in many cases by using lazy construction of the DFA and placing certain restrictions on the types of documents and queries supported (when suitable for the application). XPush [9] further explores a pushdown
XML Publish/Subscribe
automaton for shared processing of both structure and value-based constraints. Such an automaton can provide high efficiency when wildcard (‘‘*’’) and descendant (‘‘//’’) operators are rare in queries and periodic reconstruction of the automaton can be used. FiST [14] views path expressions with predicates as twig patterns and considers ordered twig pattern matching. For such ordered patterns, it transforms the patterns as well as XML documents intro sequences using Prufer’s method. This approach allows holistic matching of ordered twig patterns, as opposed to matching individual paths and then merging their matches during postprocessing in YFilter (which works for both ordered and unordered patterns), resulting in significant performance improvements over YFilter. XML Filtering and Transformation
XML filtering solutions presented above have not addressed the transformation of XML messages for customized result delivery, which is an important feature in XML-based data exchange and dissemination. For XML transformation, queries are written using a richer subset of XQuery, e.g., the For-Where-Return expressions. To support efficient transformation for many simultaneous queries, YFilter [5] further extends its NFA-based framework and develops alternatives for building transformation functionality on top of shared path matching. It explores the tradeoff between shared path matching and post-processing for result customization, by varying the extent to which
X
3611
they push paths from the For-Where-Return expressions into the shared path matching engine. To further reduce the remarkable cost of post-processing of individual queries, it employs provably correct optimizations based on query and DTD (if available) inspection to eliminate unnecessary operations and choose more efficient operator implementations for post-processing. Moreover, it provides techniques for also sharing postprocessing across multiple queries, similar to those in continuous query processing over relational streams. Stateful XML Publish/Subscribe
In [10], efficient processing of large numbers of continuous inter-document queries over XML Streams (hence, stateful processing) is addressed. The key idea that it exploits is to dissert query specifications into tree patterns evaluated within individual documents and value-based joins preformed across documents. While employing existing path evaluation techniques (e.g., YFilter) for tree pattern evaluation, it proposes a scalable join processor that leverages relational joins and view materialization to share join processing among queries. XML Routing
As described in ‘‘publish/subscriber over streams’’ distributed pub/sub systems need to efficiently route messages from their publishing sites to the brokers hosting relevant queries for complete query processing. While the concept of content-based routing and many architectural solutions can be applied in XML-based
X
XML Publish/Subscribe. Figure 3. Example queries and their representation in YFilter.
3612
X
XML Publish/Subscribe
pub/sub systems, routing of XML messages raises additional challenges due to the increased complexity of XML query processing. Aggregating user subscriptions into compact routing specifications is a core problem in XML routing. Chan et al. [3] aggregate tree pattern subscriptions into a smaller set of generalized tree patterns such that (i) a given space constraint on the total size of the subscriptions is met, and (ii) the loss in precision (due to aggregation) during document filtering is minimized (i.e., a constrained optimization problem). The solution employs tree-pattern containment and minimization algorithms and makes effective use of documentdistribution statistics to compute a precise set of aggregate tree patterns within the allotted space budget. ONYX [6] leverages YFilter technology for efficient routing of XML messages. While subscriptions can be written using For-Where-Return expressions, the routing specification for each output link at a broker consists of a disjunctive normal form (DNF) of absolute linear path expressions, which generalizes the subscriptions reachable from that link while avoiding expensive path operations. These routing specifications can be efficiently evaluated using YFilter, even with some work shared with complete query processing at the same broker. To boost the effectiveness of routing, ONYX also partitions the XQuery-based subscriptions among brokers based on exclusiveness of data interests. Gong et al. [7] introduce Bloom filters into XML routing. The proposed approach takes a path query as a string and maps all query strings into a Bloom filter using hash functions. The routing table is comprised of multiple Bloom filters. Each incoming XML message is parsed into a set of candidate paths that are mapped using the same hash functions to compare with the routing table. This approach can filter XML messages efficiently with relatively small numbers of false positives. Its benefits in efficiency and routing table maintenance are significant when the number of queries is large.
should be returned. Really Simple Syndication (RSS) provides similar yet simpler services based on URLand/or keyword-based preferences. Application Integration. XML publish/subscribe has been widely used to integrate disparate, independentlydeveloped applications into new services. Messages exchanged between applications (e.g., purchase orders and invoices) are encoded in a generic XML format. Senders publish messages in this format. Receivers subscribe with specifications on the relevant messages and the transformation of relevant messages into an internal data format for further processing. Mobile services. In mobile applications, clients run a multitude of operating systems and can be located anywhere. Information exchange between information providers and a huge, dynamic collection of heterogeneous clients has to rely on open, XML-based technologies and can be further facilitated by pub/sub technology including filtering and transformation for adaptation to wireless devices.
Data Sets XML data repository at University of Washington, http://www.cs.washington.edu/research/xmldatasets/ and Niagara experimental data, http://www.cs.wisc. edu/niagara/data.html.
URL to Code YFilter is an XML filtering engine that processes simultaneous queries (written in a subset of XPath 1.0) against streaming XML messages in a shared fashion. For each XML message, it returns a result for every matched query (http://yfilter.cs.umass.edu/). ToXgene is a template-based generator for large, consistent collections of synthetic XML documents (http://www.cs.toronto.edu/tox/toxgene/). XMark is an XQuery benchmark suite to analyze the capabilities of an XML database (http://www.xmlbenchmark.org/).
Cross-references Key Applications Personalized News Delivery. News providers are adopting XML-based formats (e.g., News Industry Text Format [12]) to publish news articles online. Given articles marked up with XML tags, a pub/sub-based news delivery service allows users to express a wide variety of interests as well as to specify which portions of the relevant articles (e.g., title and abstract only)
▶ Continuous Queries ▶ Publish/Subscribe Over Streams ▶ XML ▶ XML Document ▶ XML Parsing ▶ XML Schema ▶ XML Stream Processing ▶ XPath/XQuery
XML Publishing
Recommended Reading 1. Altinel M. and Franklin M.J. Efficient filtering of XML documents for selective dissemination of information. In Proc. 26th Int. Conf. on Very Large Data Bases, 2000, pp. 53–64. 2. Bruno N., Gravano L., Doudas N., and Srivastava D. Navigationvs. Index-based XML Multi-query processing. In Proc. 19th Int. Conf. on Data Engineering, 2003, pp. 139–150. 3. Chan C.Y., Fan W., Felber P., Garofalakis M.N., and Rastogi R. Tree pattern aggregation for scalable XML data dissemination. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 826–837. 4. Diao Y., Altinel M., Zhang H., Franklin M.J., and Fischer P.M. Path sharing and predicate evaluation for high-performance XML filtering. ACM Trans. Database Syst., 28, (4)467–516, 2003. 5. Diao Y. and Franklin M.J. Query processing for high-volume XML message brokering. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 261–272. 6. Diao Y., Rizvi S., and Franklin M.J. Towards an Internet-Scale XML dissemination service. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 612–623. 7. Gong X., Qian W., Yan Y., and Zhou A. Bloom filter-based XML packets filtering for millions of path queries. In Proc. 21st Int. Conf. on Data Engineering, 2005, pp. 890–901. 8. Green T.J., Gupta A., Miklau G., Onizuka M., Suciu D. Processing XML streams with deterministic automata and stream indexes. ACM Trans. Databases, 29(4):752–788, 2004. 9. Gupta A.K. and Suciu D. Streaming processing of XPath queries with predicates. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 419–430. 10. Hong M., Demers A.J., Gehrke J., Koch C., Riedewald M., and White W.M. Massively multi-query join processing in publish/ subscribe systems. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2007, pp. 761–772. 11. Hopcroft J.E. and Ullman J.D. Introduction to Automata Theory, Languages and Computation. Addition-Wesley, Boston, MA, 1979. 12. Internal Press Telecommunications Council. News Industry Text Format. Available online at: http://www.nitf.org/, 2004. 13. Ives Z.G., Halevy and A.Y., Weld D.S. An XML query engine for network-bound data. VLDB J., 11(4): 380–402, 2002. 14. Kwon J., Rao P., Moon B., and Lee S. FiST: scalable XML document filtering by sequencing twig patterns. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 217–228.
XML Publishing Z ACHARY I VES University of Pennsylvania, Philadelphia, PA, USA
Synonyms XML export
Definition XML Publishing typically refers to the creation of XML output (either in the form of a character stream or file)
X
3613
from a relational DBMS. XML Publishing typically must handle three issues: converting an XML query or view definition into a corresponding SQL query; encoding hierarchy in the SQL data; and generating tags around the encoded hierarchical data. Since in some cases the relational data may have originated from XML, the topics of XML Storage and XML Publishing are closely related and often addressed simultaneously.
Historical Background The topic of XML Publishing arose very soon after database researchers suggested a connection between XML and semi-structured data [5], a topic that had previously been studied in the database literature [1,2,8]. Initially the assumption was that XML databases would probably need to resemble those for semistructured data in order to get good performance. Florescu and Kossmann [11] showed that storing XML in a relation database could be more efficient than storing it in a specialized semi-structured DBMS. Concurrently, Deutsch et al. were exploring hybrid relational/semi-structured storage in STORED [7]. Soon after, the commercial DBMS vendors became interested in adding XML capabilities to their products. The most influential developments in that area came from IBM Almaden’s XPERANTO research project [4,14], which formed the core of Shanmugasundaram’s thesis work [9]. Today, most commercial DBMSs use a combination of all of the aforementioned techniques: storing XML data in relations, storing hybrid relational/semi-structured data, and storing XML in a hierarchical format resembling semi-structured data.
Foundations As every student of a database class knows, a good relational database schema is in first normal form (1NF): separate concepts and multi-valued attributes are split into separate tables, and taken together the tables can be visualized as a graph-structured EntityRelationship Diagram. In contrast, XML data are fundamentall not in 1NF: an XML document has a single root node and hierarchically encodes data. Thus, there are two main challenges in XML Publishing: first, taking queries or templates describing XML output and mapping them into operations over 1NF tables; and second, efficiently computing and adding XML tags to the data being queried. Typically, the latter problem is addressed with two separate
X
3614
X
XML Publishing
modules, and hence the XML Publishing problem consists of three steps: 1. Converting queries, which typically are posed in a language other than SQL, into an internal representation from which SQL can be constructed. 2. Composition and optimization of the resulting queries, such that redundant work is minimized. 3. Adding tags, which is often done in a postprocessing step outside the DBMS. Each of these topics is addressed in the remainder of this section. The discussion primarily focuses upon the XPERANTO and SilkRoute systems, which established many of the basic algorithms used in XML Publishing. Converting Queries
The first issue in XML Publishing is that of taking the specification of the XML output, and converting it into some form from which SQL can be constructed. A number of different forms have been proposed for specifying the XML output: Proprietary XML template languages, such as IBM DB2’s Document Access Definition (DAD) or SQL Server’s XML Data Reduced (XDR), which essentially provide a scheme for annotating XML templates with SQL queries that produce content. SQL extensions, such as the SQL/XML standard [12], which extend SQL with new functions to add tags around relational attributes and intermediate table results. Relational-XML query languages, such as the RXL language in early versions of SilkRoute [10], which provide a language that creates hierarchical XML content using relational tables. RXL resembles a combination of Datalog and an XML query language (specifically, XML-QL [6]), and it can be used to define XML views that can be queried using a standard XML query language. XML query languages with built-in XML views of relations, an approach first proposed in XPERANTO [3], where each relational table is mapped into a virtual XML document and a standard XML language (like XPath or XQuery) can be used to query the relations and even to define XML views. Over time, the research community has come to the consensus that the last approach offers the best set of trade-offs, as it offers a single compositional language for all XML queries over the data in the RDBMS. The commercial market has settled on a combination
of this same approach, for XML-centric users, plus SQL extensions for relation-centric users. Here the discussion assumes an XML-centric perspective, with XQuery as the language, since it is representative. XML Publishing systems generally only attempt to tackle a subset of the full XQuery specification, focusing on the portions that are particularly amenable to execution in a relational DBMS. There are many commonalities between basic XQueries and SQL: selections, projections, and joins can be similarly expressed in both of these languages, views can be defined, and queries can be posed over data in a way that is agnostic as to whether the data comes from a view or a raw source. The two main challenges in conversion lie in the input to the XQuery – where XPaths must be matched against relations or XML views defined over relations – and in creating the hierarchical nested XQuery output. XPath matching over built-in views of relations is, of course, trivial, as each relation attribute maps to an XML element or attribute in the XPath. Conversion gets significantly more complex when the XPaths are over XML views constructed over the original base relations: this poses the problem of XML query composition, where the DBMS should not have to compute each composed view separately, but rather should statically unfold them. Composition and Optimization
As a means of composing queries, SilkRoute takes each XQuery definition and creates an internal view forest representation that describes the structure of the XML document; it converts an XQuery into an canonical representation called XQueryCore, and its algorithms compose XQueryCore operations over an input document. XPERANTO, based in large part on IBM’s DB2, performs very similar operations, but starts by creating an internal representation of each query block called in a model XQGM, then executes a series of rewrite rules to merge the blocks by merging steps that create hierarchy in a view’s output with steps that traverse that hierarchy in a subsequent query’s input. The resulting SQL is often highly complex and repetitive, with many SQL blocks being unioned together: each level of the XML hierarchy may require a separate SQL block, and each view composition step in the optimization process may create multiple SQL query blocks (an XPath over a view may match over multiple relational attributes from the base data).
XML Publishing
Thus, optimizing the resulting SQL becomes of high importance. XML publishing systems typically convert from XQuery to SQL, not to actual executable query plans, so they can only do a limited amount of optimization on their own. Here they use a significant number of heuristics [10,14] to choose the best form for the sets of SQL queries, and rely on the RDBMS’s costbased optimizer to more efficiently optimize the queries. (Recently, commercial vendors have invested significant effort in improving their cost-based query optimizers for executing the types of queries output by XML Publishing systems.) Adding Tags
The final step in XML Publishing is that of adding tags to the data from the relational query. There have been two main approaches to this task: in-engine, relying on SQL functions to add the tags; and outside the DBMS, relying on an external middleware module to add the tags. Each is briefly described. In-engine Tagging In [14], the authors proposed a
method for adding XML tags using the SQL functions XMLELEMENT, XMLATTRIBUTE, and XMLAGG: each of these takes a relational attribute or tuple and converts it into a partial XML result, encoded in a CLOB datatype. Naturally, the first two create an element or attribute tag, respectively, around a data value. The third function, XMLAGG, functions as an SQL aggregate function: it takes an SQL row set (possibly including XMLELEMENTor XMLATTRIBUTE values) and outputs a tuple with a CLOB containing the XML serialization of the rowset’s content. The drawback to the in-engine tagging approach is that CLOBs are not always handled efficiently, as they are typically stored separately from the tuples with which they are associated. Hence, when this method was incorporated into the SQL/XML standard, a new XML datatype was proposed that could be implemented in a manner more efficient than the CLOB.
X
3615
Tagging Middleware
A more common approach is to simply a tuple representation of the XML document tree/forest within the SQL engine, and to convert this tuple stream into XML content externally, using a separate tagging module [9,14]. There are two common techniques for encoding hierarchy in tuple streams: outer join and outer union. Suppose there are relationaltables in Fig.1a, encoding an element a and its child element b, and one wants to encode their output as the following XML: Outer Join. In SilkRoute, tuples are generated using outerjoins between parent and child relations. The tuple stream will be sorted in a way that allowsthe tagger to hold the last tuple, not all data (see Fig.1b). The ‘‘parent’’ portion of the tree (the a element) will appear in both tuples, and the external XML tagger will compare consecutive tuples in the tuple stream to determine the leftmost position where a value changed – from which it can determine what portion of the XML hierarchy has changed. Outer union. In contrast, XPERANTO uses a socalled sorted outer union representation (Fig.1c), which substitutes null values for repeated copies of values (but not keys). Its approach relies on two characteristics of IBM’s DB2 engine that are shared by some but not all other systems: (i) null values can be very efficiently encoded and processed, meaning that the query is more efficient; (ii) null values sort high, i.e., DB2’s considers null values to have a sort value greater than any non-null value. The tagger simply needs to look for the first non-null attribute to determine what portion of a the XML to emit.
Key Applications XML publishing has become a key capability in the relational database arena, as increasingly data
X
XML Publishing. Figure 1. Tuple representations of XML data.
3616
X
XML Retrieval
interchange mechanisms and Web services are being built using XML data from an RDBMS. Today the ‘‘Big Three’’ commercial DBMS vendors all use some techniques for XML Publishing, and it is likely that smaller DBMSs, including those in open source, will gradually incorporate them as well.
Cross-references
▶ Approximate XML Querying ▶ XML Storage ▶ XML Views
Recommended Reading 1. Abiteboul S., Quass D., McHugh J., Widom J., and Winer J.L. The Lorel query language for semistructured data. In Int. J. Digit. Libr., 1(1):68–88, 1997. 2. Buneman P., Davidson S.B., Fernandez M.F., and Suciu D. Adding structure to unstructured data. In Proc. 13th Int. Conf. on Data Engineering, 1997, pp. 336–350. 3. Carey M.J., Florescu D., Ives Z.G., Lu Y., Shanmugasundaram J., Shekita E., and Subramanian S. XPERANTO: publishing objectrelational data as XML. In Proc. 3rd Int. Workshop on the World Wide Web and Databases, 2000, pp. 105–110. 4. Carey M., Kiernan J., Shanmugasundaram J., Shekita E., and Subramanian S. XPERANTO: a middleware for publishing object-relational data as XML documents. In Proc. 26th Int. Conf. on Very Large Data Bases, 2000, pp. 646–648. 5. Deutsch A., Ferna´ndez M.F., Florescu D., Levy A.Y., and Suciu D. XML-QL. In Proc. The Query Languages Workshop, 1998. 6. Deutsch A., Fernandez M.F., Florescu D., Levy A., and Suciu D. A query language for XML. Comp. Networks, 31(11–16):1155– 1169, 1999. 7. Deutsch A., Fernandez M.F., and Suciu D. Storing semistructured data with STORED. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 431–442. 8. Fernandez M.F., Florescu D., Kang J., Levy A.Y., and Suciu D. Catching the boat with strudel: experiences with a web-site management system. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998, pp. 414–425. 9. Fernandez M.F., Kadiyska Y., Suciu D., Morishima A., and Tan W.C. SilkRoute: A framework for publishing relational data in XML. ACM Trans. Database Syst., 27 (4):438–493, 2002. 10. Fernandez M., Tan W.C., and Suciu D. SilkRoute: trading between relations and XML. Comp. Networks, 33 (1–6):723–745, 2000. 11. Florescu D. and Kossmann D. A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database. Tech. Rep. 3684, INRIA, 1999. 12. ISO/IEC 9075-14:2003 Information technology – Database languages – SQL – Part 14: XML-Related Specifications (SQL/ XML).
13. Shanmugasundaram J. Bridging Relational Technology and XML. Ph.D. thesis, University of Wisconsin-Madison, 2001. 14. Shanmugasundaram J., Shekita E.J., Barr R., Carey M.J., Lindsay B.G., Pirahesh H., and Reinwald B. Efficiently publishing relational data as XML documents. VLDB J., 10(2–3):133–154, 2001.
XML Retrieval M OUNIA L ALMAS 1, A NDREW T ROTMAN 2 1 Queen Mary, University of London, London, UK 2 University of Otago, Dunedin, New Zealand
Synonyms Structured document retrieval; Structured text retrieval; Focused retrieval; Content-oriented XML retrieval
Definition Text documents often contain a mixture of structured and unstructured content. One way to format this mixed content is according to the adopted W3C standard for information repositories and exchanges, the eXtensible Mark-up Language (XML). In contrast to HTML, which is mainly layout-oriented, XML follows the fundamental concept of separating the logical structure of a document from its layout. This logical document structure can be exploited to allow a more focused sub-document retrieval. XML retrieval breaks away from the traditional retrieval unit of a document as a single large (text) block and aims to implement focused retrieval strategies aiming at returning document components, i.e., XML elements, instead of whole documents in response to a user query. This focused retrieval strategy is believed to be of particular benefit for information repositories containing long documents, or documents covering a wide variety of topics (e.g., books, user manuals, legal documents), where the user’s effort to locate relevant content within a document can be reduced by directing them to the most relevant parts of the document.
Historical Background Managing the enormous amount of information available on the web, in digital libraries, in intranets, and so on, requires efficient and effective indexing and retrieval methods. Although this information
XML Retrieval
is available in different forms (text, image, speech, audio, video etc), it remains widely prevalent in text form. Textual information can be broadly classified into two categories, structured and unstructured. Unstructured information has no fixed pre-defined format, and is typically expressed in natural language. For instance, much of the information available on the web is unstructured. Although this information is mostly formatted in HTML, thus imposing some structure on the text, the structure is only for presentation purposes and carries essentially no semantic meaning. Correct nesting of the HTML structure (that is, to form an unambiguous document logical structure) is not imposed. Accessing unstructured information is through flexible but mostly simplistic means, such as a simple keyword matching or bag of words techniques. Structured information is usually represented using XML, a mark-up language similar to HTML except that it imposes a rigorous structure on the document. Moreover, unlike HTML, XML tags are used to specify semantic information about the stored content and not the presentation. A document correctly marked-up in XML has a fixed document structure in which semantically separate document parts are explicitly identified – and this can be exploited to provide powerful and flexible access to textual information. XML has been accepted by the computing community as a standard for document mark-up and an increasing number of documents are being made available in this format. As a consequence numerous techniques are being applied to access XML documents. The use of XML has generated a wealth of issues that are being addressed by both the database and information retrieval communities [3]. This entry is concerned with content-oriented XML retrieval [2,5] as investigated by the information retrieval community. Retrieval approaches for structured text (markedup in XML-like languages such as SGML) were first proposed in the late 1980s. In the late 1990s, the interest in structured text retrieval grew due to the introduction of XML in 1998. Research on XML information retrieval was first coordinated in 2002 with the founding of the Initiative for the Evaluation of XML Retrieval (INEX). INEX provides a forum for the
X
3617
evaluation of information retrieval approaches specifically developed for XML retrieval.
Foundations Within INEX, the aim of an XML retrieval system is ‘‘to exploit the logical structure of XML documents to determine the best document components, i.e., best XML elements, to return as answers to queries’’ [7]. Query languages have been developed in order to allow users to specify the nature of these best components. Indexing strategies have been developed to obtain a representation not only of the content of XML documents, but their structure. Ranking strategies have been developed to determine the best elements for a given query. Query Languages
In XML retrieval, the logical document structure is additionally used to determine which document components are most meaningful to return as query answers. With appropriate query languages, this structure can be specified by the user. For example, ‘‘I want a paragraph discussing penguins near to a picture labeled Otago Peninsula.’’ Here, ‘‘penguins’’ and ‘‘Otago Peninsula’’ specify content (textual) constraints, whereas ‘‘paragraph’’ and ‘‘picture’’ specify structural constraints on the retrieval units. Query languages for XML retrieval can be classified into content-only and content-and-structure query languages. Content-only queries have historically been used as the standard form of input in information retrieval. They are suitable for XML search scenarios where the user does not know (or is not concerned with) the logical structure of a document. Although only the content aspect of an information need can be specified, XML retrieval systems must still determine the best granularity of elements to return to the user. Content-and-structure queries provide a means for users to specify conditions referring both to the content and the structure of the sought elements. These conditions may refer to the content of specific elements (e.g., the returned element must contain a section about a particular topic), or may specify the type of the requested answer elements (e.g., sections should be retrieved). There are three main categories of content-and-structure query languages [1]:
X
3618
X
XML Retrieval
1. Tag-based queries allow users to annotate words in the query with a single tag name that specifies the type of results to be returned. For example the query ‘‘section:penguins’’ requests section elements on ‘‘penguins.’’ 2. Path-based queries are based upon the syntax of XPath. They encapsulate the document structure in the query. An example the NEXI language is: ‘‘//document[about(.,Otago Peninsula)]//section [about(.//title, penguins)].’’ This query asks for sections that have a title about ‘‘penguins,’’ and that are contained in a document about ‘‘Otago Peninsula.’’ 3. Clause-based queries use nested clauses to express information needs, in a similar way to SQL. The most prominent clause-based language for XML retrieval is XQuery. A second example is XQuery Full-Text, which extends XQuery with text search predicates such as proximity searching and relevance ranking. The complexity and the expressiveness of content-andstructure query languages increases from tag-based to clause-based queries. This increase in expressiveness and complexity often means that content-andstructure queries are viewed as too difficult for end users (because, they must, for example, be intimate with the document structure). Nonetheless they can be very useful for expert users in specialized scenarios, and also have been used as an intermediate between a graphical query language (such as Bricks [12]) and an XML search engine.
Indexing Strategies
Classical indexing methods in information retrieval make use of term statistics to capture the importance of a term in a document; and consequently for discriminating between relevant and non-relevant content. Indexing methods for XML retrieval require similar term statistics, but for each element. In XML retrieval there are no a priori fixed retrieval units. The whole document, one of its sections, or a single paragraph within a section, all constitute potential answers to a single query. The simplest approach to allow the retrieval of elements at any level of granularity is to index each element separately (as a separate document in the traditional sense). In this case, term statistics for
each element are calculated from the text of the element and all its descendants. An alternative is to derive the term statistics through the aggregation of term statistics of the element own text, and those of each of its children. A second alternative is to only index leaf elements and to score non-leaf elements through propagation of the score of their children elements. Both alternatives can include additional parameters incorporating, for instance, element relationships or special behavior for some element types. It is not uncommon to discard elements smaller than some given threshold. A single italicized word, for example, may not be a meaningful retrieval unit. A related strategy, selective indexing, involves building separate indexes for those element types previously seen to carry relevant information (sections, subsections, etc, but not italics, bold, etc.). With selective indexing the results from each index must be merged to provide a single ranked result list across all element types. It is not yet clear which indexing strategy is the best. The best approach appears to depend on the collection, the types of elements (i.e., the DTD) and their relationships. In addition, the choice of the indexing strategy currently has an effect on the ranking strategy. More details about indexing strategies can be found in the entry on Indexing Units. Ranking Strategies
XML documents are made of XML elements, which define the logical document structure. Thus sophisticated ranking strategies can be developed to exploit the various additional (structural) evidence not seen in unstructured (flat) text documents. Element Scoring
Many of the retrieval models developed for flat document retrieval have been adapted for XML retrieval. These models have been used to estimate the relevance of an element based on the evidence associated with the element only. This is done by a scoring function based, for instance, on the vector space, BM25, the language model, and so on. They are typically adapted to incorporate XML-specific features. As an illustration, a scoring function based on language models [6] is described next: Given a query q ¼ t1 ;:::;tn made of n terms, an element e and its corresponding element language
XML Retrieval
model ye, the element e is ranked using the following scoring function: PðejqÞ / PðeÞ Pðqjye Þ where P(e) is the prior probability of relevance for element e and P(qjye) is the probability of the query q being ‘‘generated’’ by the element language model and is calculated as follows: Pðt1 ;:::;tn jye Þ ¼
n Y
lpðti jeÞ þ ð1 lÞpðti jCÞ
i¼1
Here Pðti jeÞ is the maximum likelihood estimate of term ti in element e, Pðti jCÞ is the probability of query term in the collection, and l is the smoothing parameter. P(tije) is the element model based on element term frequency, whereas Pðti jCÞ is the collection model based on inverse element frequency. An important XML-specific feature is element length, since this can vary radically – for example, from a title to a paragraph to a document section. Element length can be captured by setting, the prior probability P(e), as follows: lengthðeÞ pðeÞ ¼ P lengthðeÞ C
length(e) is the length of element e. Including length in the ranking calculation has been shown to lead to more effective retrieval than not doing so.
Contextualization The above strategy only scores an element based on the content of the element itself. Considering additional evidence has shown to be beneficial for XML retrieval. In particular for long documents, using evidence from the element itself as well as its context (for example the parent element) has shown to increase retrieval performance. This strategy is referred to as contextualization. Combining the element score and a separate document score has also been shown to improve performance.
Propagation
When only leaf elements are indexed, a propagation mechanism is used to calculate the relevance score of the non-leaf elements. The propagation combines the retrieval scores of the leaf elements (often using a weighted sum) and any additional element
X
3619
characteristics (such as the distance between the element and the leaves). A non-trivial issue is the estimation of the weights for the weighted sum. Merging
It has also been common to obtain several ranked lists of results, and to merge them to form a single list. For example, with the selective indexing strategy [10], a separate index is created for each a priori selected type of element (such as article, abstract, section, paragraph, and so on). A given query is then submitted to each index, each index produces a separate list of elements, normalization is performed (to take into account the variation in size of the elements) and the results are merged. Another approach to merging produces several ranked lists from a single index and for all elements in the collection (a single index is used as opposed to separate indices for each element). Different ranking models are used to produce each ranked list. This can be compared to document fusion investigated in the 1990s.
Processing Structural Constraints
Early work in XML retrieval required structural constraints in contentand-structure queries to be strictly matched but specifying an information need including structural constraints is difficult; XML document collections have a wide variety of tag names. INEX now views structural constraints as hints as to where to look (what sort of elements might be relevant). Simple techniques for processing structural constraints include the construction of a dictionary of tag synonyms and structure boosting. In the latter, the retrieval score of an element is generated ignoring the structural constraint but is then boosted if the element matches the structural constraint. More details can be found in the entry on Processing Structural Constraints.
Processing Overlaps
It is one task to provide a score expressing how relevant an element is to a query but a different task to decide which of a set of several overlapping relevant elements is the best answer. If an element has been estimated relevant to a query, it is likely that its parent element is also estimated relevant to the query as these two elements share common text. But, returning chains of elements to the user should be avoided to ensure that the user does not receive the same text several times (one for each element in the
X
3620
X
XML Retrieval
chain). Deciding which element to return depends on the application and the user model. For instance, in INEX, the best element is one that is highly relevant, but also specific to the topic of request (i.e., does not discuss unrelated topics). Removing overlap has mostly been done as a postranking task. A first approach, and the most commonly adopted one, is to remove elements directly from the result list. This is done by selecting the highest ranked element in a chain and removing any ancestors and descendents in lower ranks. Other techniques looked at the distribution of retrieved elements within each document to decide which ones to return. For example, the root element would be returned if all retrieved elements were uniformly distributed in the document. This technique was shown to outperform the simpler techniques. Efficiency remains an issue as the removal of overlaps is done at query time. More details can be found in the entry on Processing Overlaps.
Key Applications XML retrieval approaches (from query languages to ranking strategies) are relevant to any applications concerned with the access to repositories of documents annotated in XML, or similar mark-up languages such as SGML or ASN.1. Existing repositories include electronic dictionaries and encyclopedia such as the Wikipedia [4], electronic journals such as the journals of the IEEE [7], plays such as the collected works of William Shakespeare [8], and bibliographic databases such as PubMed. (www.ncbi.nlm.nih.gov/pubmed/) XML retrieval is becoming increasingly important in all areas of information retrieval, the application to full-text book searching is obvious and such commercial systems already exist [11].
Experimental Results Since 2002 work on XML retrieval has been evaluated in the context of INEX. Many of the proposed approaches have been presented at the yearly INEX workshops, held in Dagsthul, Germany. Each year, the INEX workshop pre-proceedings (which are not peerreviewed) contain preliminary papers describing the details of participant’s approaches. Since 2003 the final INEX workshop proceedings have been peerreviewed, and since 2004 they have been published by Springer as part of the Lecture Notes in Computer
Science series. Links to the pre- and final proceedings can be found on the INEX web site (http://www.inex. otago.ac.nz/).
Data Sets Since 2002 INEX has collected data sets that can be used for conducting XML retrieval experiments [9]. Each data set consists of a document collection, a set of topics, and the corresponding relevance assessments. The topics and associated relevance assessments are available on the INEX web site (http://www.inex. otago.ac.nz/). It should be noted that the relevance assessments on the latest INEX data set are released first to INEX participants.
Cross-references
▶ Aggregation-Based Structured Text Retrieval ▶ Content-and-Structure Query ▶ Evaluation Metrics for Structured Text Retrieval ▶ Indexing Units ▶ Information Retrieval Models ▶ INitiative for the Evaluation of XML Retrieval ▶ Integrated DB&IR Semi-Structured Text Retrieval ▶ Logical Structure ▶ Narrowed Extended XPath I ▶ Presenting Structured Text Retrieval Results ▶ Processing Overlaps ▶ Processing Structural Constraints ▶ Propagation-Based Structured Text Retrieval ▶ Relationships in Structured Text Retrieval ▶ Structure Weight ▶ Structured Document Retrieval ▶ Structured Text Retrieval Models ▶ Term Statistics for Structured Text Retrieval ▶ XML ▶ XPath/XQuery ▶ XQuery Full-Text
Recommended Reading 1. Amer-Yahia S. and Lalmas M. XML search: languages, INEX and scoring. ACM SIGMOD Rec., 35(4):16–23, 2006. 2. Baeza-Yates R., Fuhr N., and Maarek Y.S. (eds.). Special issue on XML retrieval, ACM Trans. Inf. Syst., 24(4), 2006. 3. Blanken H.M., Grabs T., Schek H.-J., Schenkel R., and Weikum G. (eds.). Intelligent Search on XML Data, Applications, Languages, Models, Implementations, and Benchmarks, Springer, Berlin, 2003.
XML Schema 4. Denoyer L. and Gallinari P. The Wikipedia XML corpus, comparative evaluation of XML information retrieval systems. In Proc. 5th Int. Workshop of the Initiative for the Evaluation of XML Retrieval, 2007, pp. 12–19. 5. Fuhr N. and Lalmas M. (eds.). Special issue on INEX, Inf. Retrieval, 8(4), 2005. 6. Kamps J., de Rijke M., and Sigurbjo¨rnsson B. The importance of length normalization for XML retrieval. Inf. Retrieval, 8(4):631–654, 2005. 7. Kazai G., Go¨vert N., Lalmas M., and Fuhr N. The INEX Evaluation Initiative. In Intelligent search on XML data, applications, languages, models, implementations, and benchmarks, H.M. Blanken, T. Grabs, H. Schek, R. Schenkel, G. Weikum (eds.). Springer, 2003, pp. 279–293. 8. Kazai G., Lalmas M., and Reid J. Construction of a test collection for the focused retrieval of structured documents, In Proc. 25th European Conf. on IR Research, 2003, pp. 88–103. 9. Lalmas M. and Tombros A. INEX 2002–2006: understanding XML retrieval evaluation. In Proc. 1st Int. DELOS Conference, 2007, pp. 187–196. 10. Mass Y. and Mandelbrod M. Component ranking and automatic query refinement for XML retrieval. In Proc. 3rd Int. Workshop of the Initiative for the Evaluation of XML Retrieval, 2004, pp. 73–84. 11. Pharo N. and Trotman A. The use case track at INEX 2006. SIGIR Forum, 41(1): 64–66, 2007. 12. van Zwol R., Baas J., van Oostendorp H., and Wiering F. Bricks: the building blocks to tackle query formulation in structured document retrieval. In Proc. 28th European Conf. on IR Research, 2006, pp. 314–325.
XML Schema M ICHAEL RYS Microsoft Corporation, Sammamish, WA, USA
Synonyms XML schema; W3C XML schema
X
3621
about parts of the XML document. It is defined in a World Wide Web Consortium recommendation [3–5]. The current version is 1.0. Unlike document type descriptions (DTDs) that have been defined in the XML recommendation [1], XML Schema is using an XML based vocabulary to describe the schema constraints. A schema describes elements and attributes and their content models and types. It provides for complex types that describe the content model of elements and simple types that describe the type of attributes and leaf element nodes. Types can be related to each other in so called derivation hierarchies, either by restricting or extending a super type. Elements with different names can be grouped together into substitution groups if their types are in a derivation relationship. The element, attribute and type declarations are called schema components. A schema is normally associated with a target namespace to which these schema components belong. Figure 1 depicts an example schema (based on examples in [10]) that defines a schema for the target namespace http://www.example.com/ PO1 containing the following schema components: two global element declarations, three global complex type declarations and 1 global simple type declaration: Besides constraining XML documents, XML Schemas are also being used to define vocabularies and semantic models, often in the context of information exchange scenarios to define and constrain the data contracts for data exchanged between clients and services. The schema components can in addition be used for providing additional type information in XQuery [7,8], XPath [2,8] and XSLT [9]. Note that version 1.1 of XML Schema is currently under development at the W3C [6,10].
Definition XML Schema is a schema language that allows to constrain XML documents and provides type information about parts of the XML document. It is defined in a World Wide Web Consortium recommendation [3–5].
Key Points XML Schema is a schema language that allows the constraint of documents and provides type information
Cross-references ▶ XML ▶ XML Attribute ▶ XML Document ▶ XML Element ▶ XPath/XQuery ▶ XQuery/XQuery ▶ XSLT/XSLT
X
3622
X
XML Schema
XML Schema. Figure 1. Example XML Schema.
XML Selectivity Estimation
Recommended Reading 1. XML 1.0 Recommendation, latest edition. Available at: http:// www.w3.org/TR/xml 2. XML Path Language (XPath) 2.0, latest edition. Available at: http://www.w3.org/TR/xpath20/ 3. XML Schema Part 0: Primer, latest edition. Available at: http:// www.w3.org/TR/xmlschema-0/ 4. XML Schema Part 1: Structures, latest edition. Available at: http://www.w3.org/TR/xmlschema-1/ 5. XML Schema Part 2: Datatypes, latest edition. Available at: http://www.w3.org/TR/xmlschema-2/ 6. XML Schema 1.1 Part 2: Datatypes, latest edition. Available at: http://www.w3.org/TR/xmlschema11–2/ 7. XQuery 1.0: An XML Query Language, latest edition. Available at: http://www.w3.org/TR/xquery/ 8. XQuery 1.0 and XPath 2.0 Data Model (XDM), latest edition. Available at: http://www.w3.org/TR/xpath-datamodel/ 9. XSL Transformations (XSLT) Version 2.0, latest edition. Available at: http://www.w3.org/TR/xslt20/ 10. W3C XML Schema Definition Language (XSDL) 1.1 Part 1: Structures, latest edition. Available at: http://www.w3.org/TR/ xmlschema11–1/
XML Schemas
X
3623
Selectivity estimators apply an estimation procedure on a synopsis of the data. Due to the stringent time and space constraints of query optimization, of which selectivity estimation is only one of the steps, selectivity estimators are faced with two, often conflicting, requirements: they have to accurately and efficiently estimate the cardinality of queries while keeping the synopsis size to a minimum. While there is a large body of literature on selectivity estimation in the context of relational databases, the inherent complexity of the XML data model creates new challenges. Synopsis structures for XML must capture statistical characteristics of document structure in addition to those of data values. Moreover, the flexibility allowed by the use of regular expressions to define XML elements typically results in data that has highly-skewed structure. Finally, queries over XML documents require path expressions over possibly long paths, which essentially translates to a number of joins that is much larger than what is found in typical database applications. This increases the scope for inaccurate estimates, since it is known that errors propagate rapidly when estimating multi-way join queries.
▶ XML Types
Historical Background
XML Selectivity Estimation M AYA R AMANATH 1, J ULIANA F REIRE 2, N EOKLIS P OLYZOTIS 3 1 Max-Planck Institute for Informatics, Saarbru¨cken, Germany 2 School of Computing, University of Utah, UT, USA 3 University of California Santa Cruz, Santa Cruz, CA, USA
Synonyms XML cardinality estimation
Definition Selectivity estimation in database systems refers to the task of estimating the number of results that will be output for a given query. Selectivity estimates are crucial in query optimization, since they enable optimizers to select efficient query plans. They are also employed in interactive data exploration as timely feedback about the expected outcome of user queries, and can even serve as approximate answers for count queries.
Selectivity estimation for XML queries has its roots in early works on semi-structured data [7]. These include both lossy techniques for structural summarization [9] and lossless techniques for deriving structural indices [4,8]. The development of XML and associated query languages (XPath and XQuery) has created new challenges to the problem of selectivity estimation. In particular, XML query languages involve a richer set of structure- and content-related operators and hence require more sophisticated techniques for synopsis generation and estimation. A similar observation can be made for the XML data model itself, which is more involved compared to the early models for semistructured data. This diversity has given rise to a host of different XML selectivity estimation techniques that address different aspects of the general problem. These techniques can be characterized with respect to a number of different features, including: information captured in the synopsis structure (e.g., tree- vs. graphbased view of XML data, structure only vs. structure and content); the class of supported queries (e.g., simple path expressions, XPath); the use of schema information; the ability to provide accuracy guarantees; synopsis generation strategy (e.g., streaming vs.
X
3624
X
XML Selectivity Estimation
non-streaming); support for maintaining the synopsis in the presence of updates to the data.
Foundations In general, XML selectivity estimation techniques consist of two components: a synopsis structure that captures the statistical features of the data, and an estimation framework that computes selectivity estimates based on the constructed synopsis. The techniques presented in the recent literature cover a wide spectrum of design choices in terms of these two components. For instance, Bloom histograms [17] and Markov tables [1] base the synopsis on an enumeration of simple paths, while XSketch [11], XSeed [19], and StatiX [3] build synopses structures that capture the branching path structure of the data set. Other points of variation include: the supported query model, e.g., structure-only queries [1] vs. queries with value predicates [3,11,10]), or linear queries [1] vs. twig queries [2,12]; the use of schema information [3,18]; probabilistic guarantees on estimation error [17]; explicit support for data updates [5,13,17]. Table 1 provides an overview of the current literature on XML selectivity estimation techniques. A detailed description of each technique is beyond the scope of this entry. Instead, the following classification of techniques is discussed: synopses based on path enumeration, graph-based synopses, and updatable synopses. Path Enumeration
A straightforward approach to handle simple path expression queries is to enumerate all paths in the data and store the count for each path. Of course, the number of paths in the data can be very large and thus summarization techniques are required to concisely store this information. Markov tables [1] record information on paths of up to specified length m only, and further reduce the number of paths by deleting paths with low frequencies. The selectivity of path expressions for paths with length less than or equal to m is derived by a lookup of this table. For longer path expressions, selectivity is computed by combining the counts of shorter paths. In contrast, the bloom histogram method, proposed by Wang et al. [17], groups together simple paths with ‘‘similar’’ frequencies into buckets of a histogram. Each bucket of paths is then summarized using a bloom filter to represent the contents (paths) of the bucket and a representative count is associated with it. Selectivities are estimated by first
identifying the bucket containing the query path and retrieving its count. Estimation errors are bounded and depend on the number of histogram buckets and the size of the bloom filter. While the above techniques use path enumeration as a basic operation to estimate the selectivity of simple path expressions, the same concept is used in [2] to answer more complex ‘‘twig’’ queries, i.e., path queries with branches. The idea is to record, for each path, a small fixed-length signature of the element ids that constitute its roots. These signatures can be ‘‘intersected’’ in order to estimate the number of common roots for several simple paths, or equivalently, the number of matches for the twig query that is formed by joining the simple paths at their root. The set of simple paths and their counts as well as their signatures are organized into a summary data structure called the correlated subtree (CST). Given a twig query, it is first broken down in a set of constituent simple paths based on the CST, and the corresponding signatures and counts are combined in order to estimate the overall selectivity. Graph-Synopsis-Based Techniques
At an abstract level, a graph synopsis summarizes the basic graph structure of an XML document. More formally, given a data graph G = (VG, EG), a graph synopsis SðGÞ ¼ ðV S ; E S Þ is a directed node-labeled graph, where (i) each node v 2 V S corresponds to a subset of element (or attribute) nodes in VG (termed the extent of ) that have the same label, and (ii) an edge in (u,v) 2 EG is represented in E S as an edge between the nodes whose extents contain the two endpoints u and v. For each node u, the graph synopsis records the common tag of its elements and a count field for the size of its extent. In order to capture different properties of the underlying path structure and value content, a graph-synopsis is augmented with appropriate, localized distribution information. As an example, the XSketch-summary mechanism [11], which can estimate the selectivity of simple path expressions with branching predicates, augments the general graph-synopsis model with: (i) localized per-edge stability information, indicating whether the synopsis edge is backward- and/or forward-stable, and, (ii) localized per-node value distribution summaries. In short, an edge (u,v) in the synopsis is said to be forward-stable if all the elements of u have at least one child in v; similarly, (u,v) is backward stable if all the elements in v have at least one parent in u. Accordingly,
XML Selectivity Estimation
X
3625
XML Selectivity Estimation. Table 1. Summary of work on XML selectivity estimation Proposal
Input
Chen et al. [ 2] Data Aboulnaga et al. [1] XPathLearner [5] CXHist [6] Wu et al. [18]
Data Query feedback Query feedback Data, Schema, Predicates Data
Summary structure
Structure predicates
Correlated subpath tree Path tree and Markov tables Markov tables
Tree pattern
Substring
No
No
Simple paths
No
No
No
Simple Paths
Equality Predicates
Yes
No
Markov tables
Simple Paths
Substrings
Yes
No
Position Histogram
Tree pattern
General predicates
No
No
Graph
Tree Pattern
Numerical range
No
Yes
Value predicates
Error Updates guarantees
XSketches [11] XSeed [19]
Data
Graph
Tree pattern
No
No
Yes
XCluster [10]
Data
Graph
Tree pattern
No
StatiX [3]
Data and Schema
Schema graph and Histograms
Tree Pattern
Numerical range, Term No Containment and Substring Numerical Range No
Sartiani [15]
Data and Schema Data
Tagged region graph Bloom Filter
Tree Pattern
General Predicates
No
No
Simple Paths
No
Yes
Yes
Data and Schema
Schema graph and Histogram
Tree pattern
Numerical Range
Yes
No
Data
Histograms on data samples Randomized Sketch
Simple Paths
No
No
No
Tree pattern General predicates with child axis
Yes
Yes
Bloom Histograms [17] IMAX [13] Wang et al. [16] SketchTree [14]
Data Stream
for each node u that represents elements with values, the synopsis records a summary H(u) which captures the corresponding value distribution and thus enables selectivity estimates for value-based predicates. Recent studies [10,12,19] have proposed a variant of graph-synopses that employ a clustering-based model in order to capture the path and value distribution of the underlying XML data. Under this model, each synopsis node is viewed as a ‘‘cluster’’ and the enclosed elements are assumed to be represented by a corresponding ‘‘centroid’’ that aggregates their characteristics. The TreeSketch [12] model, for instance, defines the centroid of a node u as a vector of average child counts (c1,c2,...,cn), where cj is the average child count from elements in u to every other node vj in the synopsis. Thus, the assumption is that each element in
No
u has exactly cj children to node vj. Furthermore, the clustering error, that is, the difference between the actual child counts in u and the centroid, provides a measure of the error of approximation. While the above techniques derive the graph structure from the data itself, StatiX [3] exploits the graph structure provided by the XML Schema to build synopses. In a way, the schema can be regarded as a synopsis of the data since it describes the general structure of the data. StatiX makes use of the mapping between an element in the data and a type in the schema (typically assigned during the validation phase) in order to build its synopsis structure. In addition, the statistics gathering phase consists of assigning ordinal numbers to each type found in the data. The schema graph, which forms a basic synopsis, is augmented with histograms
X
3626
X
XML Selectivity Estimation
which capture parent-child distributions (built using ordinal numbers) and value distributions (built using values occurring in the data). By appropriately tuning the number of types through a set of schema transformations and by decreasing or increasing the number of histogram buckets, coarse to fine-grained synopsis structures can be built. Selectivities of branching path expressions with value predicates can be estimated using either a simple lookup (possible for simple path expressions where the return value for the query corresponds to a single type) or histogram-based cardinality estimation techniques to combine value histograms with parent-child distributions (needed for branching path expressions with value predicates). Updatable Synopses
Many selectivity estimation techniques for XML assume that the underlying data are static and require an offline scan over the data in order to gather statistics and build synopsis structures. Although this assumption is valid for applications that primarily use XML as a document exchange format, it does not hold for scenarios where XML is used as a native storage format. For these, it is important that estimators support synopsis maintenance in addition to synopsis construction. The Bloom Histogram technique [17] outlined in Table 1 supports updates by maintaining a ‘‘full’’ summary (maybe on disk) and rebuilding the bloom histogram periodically. That is, each update is first parsed into paths, and a table of paths is updated with the new cardinalities. The bloom histogram which is used for selectivity estimation will be rebuilt from this path table when a sufficient number of updates have been received and the bloom histogram estimates are no longer reliable. IMAX [13], which is built on top of StatiX [3], updates the synopsis structure directly as and when the update is received. The key concept in this technique is to first identify the correct location of the update (corresponding to the ordinal numbers of the types in the update) and then to estimate and update the correct buckets of the required histograms. It is possible to estimate the amount of error in the histograms and to schedule a re-computation of the required set of histograms from the data if the error becomes too large. While the approaches outlined above respond directly to changes in the data, XPathLearner [5] and CXHist [6] observe the query processor and use a query feedback loop to maintain the synopsis structure. In
short, these techniques obtain the true selectivity of each query as it is processed, and use this information in order to update the synopsis accordingly. As a result, the synopsis becomes more refined with respect to frequent queries and can thus provide more accurate selectivity estimates for a significant part of the workload.
Key Applications The main utility of cardinality estimation is to serve as input into a query optimizer which determines the best execution plan for a query. These estimates can also serve as approximate answers to ‘‘COUNT’’ queries. In addition, they are used in interactive data exploration in order to provide users with early feedback regarding the number of results to the submitted query.
Cross-references
▶ Query Optimization ▶ Query Processing ▶ XML ▶ XML Tree Pattern, XML Twig Query ▶ XPath/XQuery
Recommended Reading 1. Aboulnaga A., Alameldeen A.R., and Naughton J. Estimating the selectivity of XML path expressions for internet scale applications. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 591–600. 2. Chen Z., Jagadish H.V., Korn F., Koudas N., Muthukrishnan S., Ng R.T., and Srivastava D. Counting twig matches in a tree. In Proc. 17th Int. Conf. on Data Engineering, 2001, pp. 453–462. 3. Freire J., Haritsa J., Ramanath M., Roy P., and Sime´on J. StatiX: Making XML Count. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 181–191. 4. Goldman R. and Widom J. Dataguides: enabling query formulation and optimization in semistructured databases. In Proc. 23th Int. Conf. on Very Large Data Bases, 1997, pp. 436–445. 5. Lim L., Wang M., Padmanabhan S., Vitter J., and Parr R. XPathLearner: An on-line self-tuning markov histogram for XML path selectivity estimation. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 442–453. 6. Lim L., Wang M., and Vitter J. CXHist: an on-line classificationbased histogram for XML string selectivity estimation. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 1187–1198. 7. McHugh J., Abiteboul S., Goldman R., Quass D., and Widom J. A database management system for semistructured data. ACM SIGMOD Rec., 26(3):54–66, September 1997. 8. Milo T. and Suciu D. Index structures for path expressions. In Proc. 7th Int. Conf. on Database Theory, 1999, pp. 277–295. 9. Nestorov S., Ullman J., Wiener J., and Chawathe S. Representative objects: concise representations of semistructured,
XML Storage
10.
11. 12.
13.
14.
15.
16.
17.
18.
19.
hierarchical data. In Proc. 13th Int. Conf. on Data Engineering, 1997, pp. 79–90. Polyzotis N. and Garofalakis M. XCluster Synopses for structured XML content. In Proc. 22nd Int. Conf. on Data Engineering, 2006, p. 63. Polyzotis N. and Garofalakis M. XSketch synopses for XML data graphs. ACM Trans. on Database Syst., 31(3):1014–1063, 2006. Polyzotis N., Garofalakis M., and Ioannidis Y. Approximate XML query answers. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 263–274. Ramanath M., Zhang L., Freire J., and Haritsa J. IMAX: incremental maintenance of schema-based XML statistics. In Proc. 21st Int. Conf. on Data Engineering, 2005, pp. 273–284. Rao P. and Moon B. Sketchtree: approximate tree pattern counts over streaming labeled trees. In Proc. 22nd Int. Conf. on Data Engineering, 2006, p. 80. Sartiani C. A framework for estimating XML query cardinality. In Proc. 6th Int. Workshop on the World Wide Web and Databases, 2003, pp. 43–48. Wang W., Jiang H., Lu H., and Yu J.X. Containment join size estimation: models and methods. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 145–156. Wang W., Jiang H., Lu H., and Yu J.X. Bloom histogram: path selectivity estimation for XML data with updates. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 240–251. Wu Y., Patel J.M., and Jagadish H.V. Estimating answer sizes for XML queries. In Advances in Database Technology, Proc. 8th Int. Conf. on Extending Database Technology, 2002, pp. 590– 608. ¨ zsu M.T., Aboulnaga A., and Ilyas I.F. XSEED: Zhang N., O accurate and fast cardinality estimation for XPath queries. In Proc. 22nd Int. Conf. on Data Engineering, 2006, p. 61.
XML Storage D ENILSON B ARBOSA 1, P HILIP B OHANNON 2, J ULIANA F REIRE 3, C ARL -C HRISTIAN K ANNE 4, I OANA M ANOLESCU 5, VASILIS VASSALOS 6, M ASATOSHI YOSHIKAWA 7 1 University of Alberta, Edmonton, AB, Canada 2 Yahoo! Research, Santa Clara, CA, USA 3 University of Utah, Salt Lake City, UT, USA 4 University of Mannheim, Mannheim, Germany 5 INRIA Saclay–Iˆle-de-France, Orsay, France 6 Athens University of Economics and Business, Athens, Greece 7 University of Kyoto, Kyoto, Japan
Synonyms XML persistence; XML database
X
3627
Definition A wide variety of technologies may be employed to physically persist XML documents for later retrieval or update, from relational database management systems to hierarchical systems to native file systems. Once the target technology is chosen, there is still a large number of storage mapping strategies that define how parts of the document or document collection will be represented in the back-end technology. Additionally, there are issues of optimization of the technology and strategy used for the mapping. XML Storage covers all the above aspects of persisting XML document collections.
Historical Background Even though the need for XML storage naturally arose after the emergence of XML, similar techniques had been developed earlier, since the mid-1990’s, to store semi-structured data. For example, the LORE system included a storage manager specifically designed for semi-structured objects, while the STORED system allowed the definition of mappings from semi-structured data to relations. Even earlier, storage techniques and storage systems had been developed for object-oriented data. These techniques focused on storing individual objects, including their private and public data and their methods. Important tasks included performing garbage collection, managing object migration and maintaining class extents. Object clustering techniques were developed that used the class hierarchy and the composition hierarchy (i.e., which object is a component of which other object) to help determine object location. These techniques, and the implemented object storage systems, such as the O2 storage system, influenced the development of subsequent semi-structured and XML storage systems. Moreover, the above solutions or ad-hoc approaches had also been used for the storage of large SGML (Standard Generalized Markup Language, a superset and precursor to XML) documents.
Foundations Given the wide use of XML, most applications need or will need to process and manipulate XML documents, and many applications will need to store and retrieve data from large documents, large collections of documents, or both. As an exchange format, XML can be simply serialized and stored in a file, but serialized document storage often is very inefficient for query processing and updates.
X
3628
X
XML Storage
As a result, a large-scale XML storage infrastructure is critical to modern application performance. Figure 1a shows a simple graphical outline of an XML DTD for movies and television shows. As this example shows, XML data may exhibit great variety in their structure. At one extreme, relationalstyle data like title, year and boxoff children of show in Fig. 1a may be represented in XML. At the opposite extreme are highly irregular structures such as might be found under the reviews tag. Figure 2 shows a similar graphical representation of a real-life DTD for scientific articles. Since every HTML structure or formatting element is also an XML element or attribute, the corresponding XML tree is very deep and wide, and no two sections are likely to have the same structure. XML processing workloads are also diverse. Queries and updates may affect or return few or many nodes. They may also need to ‘‘visit’’ large portions of an XML
tree and return nodes that are far apart, such as all the box-office receipts for movies, or may only return or affect nodes that are ‘‘close’’ together, such as all the information pertaining to a single review. A few different ways of persisting XML document collections are used, and each addresses differently the challenges posed by the varied XML documents and workloads. Instance-Driven Storage
In instance-driven storage, the storage of XML content is driven by the tree structure of the individual document, ignoring the types assigned to the nodes by a schema (if one exists). In some cases, e.g., when documents have irregular structure or an application mostly generates navigations to individual elements, instancedriven storage can greatly simplify the task of storing XML content. One instance-driven technique is to
XML Storage. Figure 1. Movie DTD and example native storage strategy.
XML Storage. Figure 2. Graphical outline of a complex article DTD (strongly simplified).
XML Storage
X 3629
X
3630
X
XML Storage
store nodes and edges in one or more relational tables. A second approach is to implement an XML data model natively. Tabular Storage of Trees
A relational schema for encoding any XML instance may include relations child, modeling the parent-child relationship, and tag, attr, id, text, associating to each element node respectively a tag, an attribute, an identity and a text value, as well as sets that contain the root of the document and the set of all its elements. Notice that such a schema does not allow full reconstruction of an original XML document, as it does not retain information on element order, whitespace, comments, entity references etc. The encoding of element order, which is a critical feature of XML, is discussed later in this article. A relational schema for encoding XML may also need to capture built-in integrity constraints of XML documents, such as the fact that every child has a single parent, every element has exactly one tag, etc. Tabular storage of trees as described enables the use of relational storage engines as the target storage technology for XML document collections. While capable of storing arbitrary documents, with this approach a large number of joins may be required to answer queries, especially when reconstructing subtrees. This is the basic storage mapping supported by Microsoft’s SQL Server as of 2007. Native XML Storage Native XML storage software implements data structures and algorithms specifically designed to store XML documents on secondary memory. These data structures support one or more of the XML data models. Salient functional requirements implied by standard data models include the preservation of child order, a stable node identity, and support for type information (depending on the data model supported). An additional functional requirement in XML data stores is the ability to reconstruct the exact textual representation of an XML document, including details such as encoding, whitespace, attribute order, namespace prefixes, and entity references. A native XML storage implementation generally maps tree nodes and edges to storage blocks in a manner that preserves tree locality, i.e., that stores parents and children in the same block. The strategy is to map XML tree structures onto records managed by a storage manager for variable-size records. One possible approach is to map the complete document
to a single Binary Large Object and use the record manager’s large object management to deal with documents larger than a page. This is one of the approaches for XML storage supported by the commercial DBMS Oracle as of 2007. This approach incurs significant costs both for update and for query processing. A more sophisticated strategy is to divide the document into partitions smaller than a disk block and map each partition to a single record in the underlying store. Large text nodes and large child node lists are handled by chunking them and/or introducing auxiliary nodes. This organization supports efficient local navigation and tree reconstruction without, for example, loading the entire tree into memory. Such an approach is used in the commercial DBMS IBM DB2 as of 2007 (starting with version 9). Native stores can support efficiently updates, concurrency control mechanisms and traditional recovery schemes to preserve durability (see ACID Properties). Figure 1b shows a hypothetical instance of the schema of Fig. 1a. The types of nodes are indicated by shape. One potential assignment of nodes to physical storage records is shown as groupings inside dashed lines. Note that show elements are often physically stored with their review children, and reviews are frequently stored with the next or previous review in document order. Physical-level heuristics that can be implemented to improve performance include compressed representation of node pointers inside a block, and string dictionaries allowing integers to replace strings appearing repeatedly, such as tag names and namespace URIs. Schema-Driven Storage
When information about the structure of XML documents is given, e.g., in a DTD or an XML Schema (see XML Schema), techniques have been developed for XML storage that exploit this information. In general, nodes of the same type according to the schema are mapped in the same way, for example to a relational table. Schema information is primarily exploited for tabular storage of XML document collections, and in particular in conjunction with the use of a relational storage engine as the underlying technology, as described in the next paragraph. In hybrid XML storage different data models, and potentially even different systems, store different document parts.
XML Storage
Relational Storage for XML Documents
Techniques have been developed that enable the effective use of a relational database management system to store XML. Figure 3a illustrates the main tasks that must be performed for storing XML in relational databases. First, the schema of the XML document is mapped into a suitable relational schema that can preserve the information in the original XML documents (Storage Design). The resulting relational schema needs to be optimized at the physical level, e.g., with the selection of appropriate file structures and the creation of indices, taking into account the distinctive characteristics of XML queries and updates in general and of the application workload in particular. XML documents are then shredded and loaded into the flat tables (Data Loading). At runtime, XML queries are translated into relational queries, e.g., in SQL, submitted to the underlying relational system and the results are translated back into XML (Query Translation). Schema-driven relational storage mappings for XML documents are supported by the Oracle DBMS. An XML-to-relational mapping scheme consists of view definitions that express what data from the XML document should appear in each relational table and constraints over the relational schema. The views generally map elements with the same type or tag name to a table and define a storage mapping. For example, in Fig. 3b, two views, V1 and V2 are used to populate the Actors and Shows tables respectively. A particular set of storage views and constraints along with physical storage and indexing options together comprise a storage design. The process of parsing an XML document and populating a set of relational views according to a storage design is referred to as shredding. Due to the mismatch between the tree-structure of XML documents and the flat structure of relational tables, there are many possible storage designs. For example, in Fig. 3b, if an element, such as show, is guaranteed to have only a single child of a particular type, such as seasons, then the child type may optionally be inlined, i.e., stored in the same table as the parent. On the other hand, due to the nature of XML queries and updates, certain indexing and file organization options have been shown to be generally useful. In particular, the use of B-tree indexes (as opposed to hash-based indexes) is usually beneficial, as the translation of XML queries into relational languages often involves range conditions. There is evidence that the
X
3631
best file organization for the relations resulting from XML shredding is index-organized tables, with the index on the attribute(s) encoding the order of XML elements. With such file organization, index scanning allows the retrieval of the XML elements in document order, as required by XPath semantics, with a minimum number of random disk accesses. The use of a path index that stores complete root-to-node paths for all XML elements also provides benefits. Cost-Based Approaches A key quality of a storage mapping is efficiency – whether queries and updates in the workload can be executed quickly. Cost-based mapping strategies can derive mappings that are more efficient than mappings generated using fixed strategies. In order to apply such strategies, statistics on the values and structure of an XML document collection need to be gathered. A set of transformations and annotations can be applied to the XML schema to derive different schemas that result in different relational storage mappings, for example by merging or splitting the types of different XML elements, and hence mapping them into the same or different relational tables. Then, an efficient mapping is selected by comparing the estimated cost of executing a given application workload on the relational schema produced by each mapping. The optimizer of the relational database used as storage engine can be used for the cost estimation. Due to the size of the search space for mappings generated by the schema transformations, efficient heuristics are needed to reduce the cost without missing the most efficient mappings. Physical database design options, such as vertical partitioning of relations and the creation of indices, can be considered in addition to logical database design options, to include potentially more efficient mappings in the search space. The basic principles and techniques of cost-based approaches for XML storage are shared with relational cost-based schema design. Correctness and Losslessness An important issue in designing mappings is correctness, notably, whether a given mapping preserves enough information. A mapping scheme is lossless if it allows the reconstruction of the original documents, and it is validating if all legal relational database instances correspond to a valid XML document. While losslessness is enough for applications involving only queries over the documents, if documents must conform to an XML schema
X
3632
X
XML Storage
XML Storage. Figure 3. Relational storage workflow and example.
and the application involves both queries and updates to the documents, schema mappings that are validating are necessary. Many of the mapping strategies proposed in the literature are (or can be extended to be) lossless. While none of them are validating, they can be extended with the addition of constraints to only allow updates that maintain the validity of the XML document. In particular, even though losslessness and validation are undecidable for a large class of mapping schemes, it is possible to guarantee information
preservation by designing mapping procedures which guarantee these properties by construction. Order Encoding Schemes Different techniques have been proposed to preserve the structure and order of XML elements that are mapped into a relational schema. In particular, different labeling schemes have been proposed to capture the positional information of each XML element via the assignment of node labels. An important goal of such schemes is to be able
XML Storage
to express structural properties among nodes, e.g., the child, descendant, following sibling and other relationships, as conditions on the labels. Most schemes are either prefix-based or range-based and can be used with both schema-driven and instance-based relational storage of XML. In prefix-based schemes, a node’s label includes as a prefix the label of its parent. Dewey-based order encodings are the best known prefix-based schemes. The Dewey Decimal Classification was originally developed for general knowledge classification. The basic Dewey-based encoding assigns to each node in an XML tree an identifier that records the position of a node among its siblings, prefixed by the identifier of its parent node. In Fig. 1b, the Dewey-based encoding would assign the identifier 1.1.2 to the dashed-line year element. In range-based order encodings, such as interval or pre/post encoding, a unique {start,end} interval identifies each node in the document tree. This interval can be generated in multiple ways. The most common method is to create a unique identifier, start, for each node in a preorder traversal of the document tree, and a unique identifier, end, in a postorder traversal. Additionally, in order to distinguish children from descendants, a level number needs to be recorded with each node. An important consideration for any order-encoding scheme is to be able to handle updates in the XML documents, and many improvements have been made to the above basic encodings to reduce the overhead associated with updates. Hybrid XML Storage
Some XML documents have both very structured and very unstructured parts. This has lead to the idea of hybrid XML storage, where different data models, and even systems using different storage technologies, store different document parts. For example, in Fig. 3b, review elements and their subtrees can be stored very differently from show elements, for example by serializing each review according to the dashed lines in the figure or storing them in a native XML storage system. Prototype systems such as MARS and XAM have been proposed that support a hybrid storage model at the system level, i.e., provide physical data independence. In these systems, different access methods corresponding to the different storage mappings are formally described using views and constraints, and query processing
X
3633
involves the use of query rewriting using views. Moreover, an appropriate tool or language is necessary to specify hybrid storage designs effectively and declaratively. An additional consideration in favor of hybrid XML storage is that storing some information redundantly using different techniques can improve the performance of querying and data retrieval significantly by combining their benefits. For example, schema-directed relational storage mappings often give better performance for identifying the elements that satisfy an XPath query, while native storage allows the direct retrieval of large elements. In environments where updates are infrequent or update cost less important than query performance, such as various web-based query systems, such redundant storage approaches can be beneficial.
Key Applications XML Storage techniques are used to efficiently store XML documents, XML messages, accumulated XML streams and any other form of XML-encoded content. XML Storage is a key component of an XML database management system. It can also provide significant benefits for the storage of semi-structured information with mostly tree structure, including scientific data.
Cross-references
▶ Dataguide ▶ Deweys ▶ Intervals ▶ Storage Management ▶ Top-k XML Query Processing ▶ XML Document ▶ XML Indexing ▶ XML Schema ▶ XPath/XQuery
Recommended Reading 1. Arion A., Benzaken V., Manolescu I., and Papakonstantinou Y. Structured materialized views for XML queries. In Proc. 33rd Int. Conf. on Very Large Data Bases, 2007, pp. 87–98. 2. Barbosa D., Freire J., and Mendelzon A.O. Designing information-preserving mapping schemes for XML. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 109–120. 3. Beyer K., Cochrane R.J., Josifovski V., Kleewein J., Lapis G., ¨ zcan F., Pirahesh H., Seemann N., Lohman G., Lyle B., O Truong T., der Linden B.V., Vickery B., and Zhang C. System RX: one part relational, one part XML. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 347–358. 4. Chaudhuri S., Chen Z., Shim K., and Wu Y. Storing XML (with XSD) in SQL databases: interplay of logical and physical designs. IEEE Trans. Knowl. Data Eng., 17(12):1595–1609, 2005.
X
3634
X
XML Stream Processing
5. Deutsch A., Fernandez M., and Suciu D. Storing semi-structured data with STORED. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 431–442. 6. Fiebig T., Helmer S., Kanne C.C., Moerkotte G., Neumann J., Schiele R., and Westmann T. Anatomy of a native XML base management system. VLDB J., 11(4):292–314, 2003. 7. Georgiadis H. and Vassalos V. XPath on steroids: exploiting relational engines for XPath performance. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2007, pp. 317–328. 8. Ha¨rder T., Haustein M., Mathis C., and Wagner M. Node labeling schemes for dynamic XML documents reconsidered. Data Knowl. Eng., 60(1):126–149, 2007. 9. McHugh J., Abiteboul S., Goldman R., Quass D., and Widom J. Lore: a database management system for semistructured data. ACM SIGMOD Rec., 26:54–66, 1997. 10. Shanmugasundaram J., Tufte K., He G., Zhang C., DeWitt D., and Naughton J. Relational databases for querying XML documents: limitations and opportunities. In Proc. 25th Int. Conf. on Very Large Data Bases, 1999, pp. 302–314. 11. Ve´lez F., Bernard G., and Darnis V. The O2 object manager: an overview. In Building an Object-Oriented Database System, The Story of O2. F. Bancilhon, C. Delobel, and P.C. Kanellakis (eds.), Morgan Kaufmann, San Francisco, CA, USA, 1992, pp. 343–368.
XML Stream Processing C HRISTOPH KOCH Cornell University, Ithaca, NY, USA
Definition XML Stream Processing refers to a family of data stream processing problems that deal with XML data. These include XML stream filtering, transformation, and query answering problems. A main distinguishing criterion for XML stream processing techniques is whether to filter or transform streams. In the former scenario, XML streams are usually thought of as consisting of a sequence of rather small XML documents (e.g., news items), and the (Boolean) queries decide for each item to either select or drop it. In the latter scenario, the input stream is transformed into a possibly quite different output stream, often using an expressive transformation language.
Historical Background With the spread of the XML data exchange format in the late 1990s, the research community has become interested in processing streams of XML data. The selective dissemination of information that is not strictly tuple-based, such as electronically disseminated
news, was one of the first, and remains one of the foremost applications of XML streams. Subscriptions to items of interest in XML Publish-Subscribe systems are usually expressed in weak XML query languages such as tree patterns and fragments of the XPath query language. This problem has been considered with the additional difficulty that algorithms have to scale to large numbers of queries to be matched efficiently in parallel. In addition to XML publish-subscribe in the narrow sense, a substantial amount of research has addressed the problems of processing more expressive query and stream transformation languages on XML streams. This includes automata as well as well as data-transformation query language-based approaches (e.g., XQuery).
Foundations Controlling Memory Consumption. Efficient stream processing is only feasible for data processing problems that can be solved by strictly linear-time one-pass processing of the data using very little main memory. For the XML stream filtering problem, and restricted query languages such as XML tree patterns, linear-time evaluation techniques are not hard to develop (see e.g., [1] for an extensive survey). However, developing one-pass filtering techniques that require little main memory is nontrivial. Techniques from communication complexity have been used to study memory lower bounds of streaming XPath evaluation algorithms. It has been observed in [5] that there can be no streaming algorithm with memory consumption sublinear in the depth of the data tree, even for simple tree pattern queries and very small XPath fragments. This is a tight bound. Boolean queries in query languages of considerable expressiveness can be processed on XML streams using memory linear in the depth of the tree and independent of any other aspects of its size. By classical reductions between logics and automata on trees, this includes queries in first-order and even monadic second-order logic and as a consequence tree pattern queries and a large class of XPath queries [1]. Since the nesting depth of XML documents tends to be shallow, this is a positive result. Selecting Nodes or Subtrees. The above observation – that memory consumption does not depend on the size of the XML document but only on the depth of the parse tree – only holds for the problem of testing, for each document, whether it matches the tree pattern or
XML Stream Processing
not. If output is to be produced by selecting documents or subtrees, then there are simple queries which require most of the document to be buffered. For an example, consider the following two XML documents. hAihB=i . . . hB=ihC=ih=Ai hAihB=i . . . hB=ihD=ih=Ai and the XPath query /*[C]. Any implementation of this query that is to output selected documents must select the entire left document but not the right. Hence such an implementation will have to buffer the prefix of either document up to the C resp. D node. This may amount to buffering almost all the document. The problem of efficiently selecting nodes using XPath on XML streams with close to minimum space requirements was studied in several works, and results are usually space bounds depending linearly on the depth of the data tree, a function of certain properties of the query, and the number of candidate output nodes from the data tree, which in some cases can be nearly all the nodes in the parse tree of the XML document. Automata-Based Stream Processing. A large part of the research into efficient XML stream processing is based on compiling queries or subscriptions into automata that then run on the data stream. This is not surprising since automata provide a natural one-pass processing model. For most forms of automata one can analyze the runtime memory usage easily. Note, however, that finite word automata are not sufficiently expressive to keep track of the position in the nested XML stream. Translating XPath queries into pushdown automata has been studied in several works. Pushdown automata yield document depth-bounded space usage. The pushdown automaton nature is somewhat concealed in some of this work; the processing model can be thought of as a finite word automaton for a query path expression which runs on the path from the root node of the XML tree to the current data tree node. There is also a pushdown automaton, independent of the path expression, that acts as a controller for the word automaton, managing the current path using a stack and conceptually rerunning the word automaton every time a new node in the stream is encountered. The blow-up required to compute deterministic finite word automata for query path matching is exponential in size of the query. The sources of this
X
exponentiality were explored in [4]. An alternative is to maintain a nondeterministic finite word automaton for path matching as done in YFilter [3]. This technique is described in detail in. Transducer Networks. In the following, a technique for constructing automata that have the power to effect stream transformation while being deterministic, consuming little main memory, and being of size polynomial in the size of the query is described. The exponential size of deterministic automata is avoided by not compiling automata for managing and recognizing the subexpressions of an XPath query into a single ‘‘flat’’ automaton. These automata are instead kept apart, as a transducer network [6]. A transducer network consists of a set of synchronously running transducers (here, deterministic pushdown transducers) where each transducer runs, possibly in parallel with some other transducers, either on the input XML stream, or on the output of another transducer (in which case the input is the original stream where some nodes may have been annotated using labels). Two transducers may also be ‘‘joined,’’ producing output whose annotations are pairs consisting of the annotations produced by the two input transducers. Next, this is formalized, and some of the transducers that form part of a transducer network are exhibited. XPath queriesarefirstrewritten into nestedfilterswith paths of length one; for instance, query child:: A∕descendant::B is first rewritten into child[lab() = A ∧ descendant[lab() = B]]. To emphasize the goal of checking whether the query can be successfully matched, rather than computing nodes matched by a path, axis filters are written as ∃child[f] and ∃descendant[f]. The rewritten queries will now be translated into transducer networks inductively. The axes used have to be forward axes (child, next sibling, descendant, and following). A large class of XPath queries with other axes (e.g., ancestor) can be rewritten to use just forward axes [2]. A deterministic pushdown transducer T is a tuple (S, G, O, Q, q0, F, d) with input alphabet S, stack alphabet G, output alphabet O, set of states Q, start state q0, set of final states F, and transition function d : Q S (e [ G) ! Q G∗ O. For no q 2 Q, s 2 S, g 2 G, both d(q, s, e) and d(q, s, g) are defined. Here e denotes the empty word. All transducers will have Q = F; that is, all states are final states, so all valid
3635
X
3636
X
XML Stream Processing
runs will be accepting. If the transducer T is in state q and has uv on the stack, and if d(q, s, v) = (q 0 , w, s 0 ), then T makes a transition to state q0 and stack uw (u,v, w 2 G∗) on input s, and produces output o, denoted s=o (q,uv) ! (q 0 ,uw). A run on input s1...sn is a sequence s1 =o1 sn =on of transitions (q0,e) ! ! (q,u) that produces output o1...on. A transducer T[∃descendant[f]] running on the output stream of transducer T[f] is a deterministic pushdown transducer with S = O = {h i ,t,f}, G = {t,f}, Q = F = {qf, qt}, q0 = qf, and transition function n d : q x ; hi; EÞ 7! ðqf ; x; hiÞ ðqx ; y 2 ft; f g; zÞ 7! ðqx_y_z ; E; xÞ On seeing an opening tag of a node, this transducer memorizes on the stack whether f was matched in the subtrees of the previously seen siblings of that node. On returning (i.e., seeing a closing tag), the transducer labels the node (by its proxy the closing tag) with t or f (true or false) depending on whether f was matched in the node’s subtree, which is encoded in the state. Example 1. On input h i h i h i h i f t th i h i h i t f f t, T [∃descendant[·]] has the run hi=hi
hi=hi
hi=hi
hi=hi
ðqf ; EÞ ! ðqf ; f Þ ! ðqf ; ff Þ ! ðqf ; fff Þ ! f =f
t=f
hi=hi
hi=hi
t=t
hi=hi
ðqf ; ffff Þ !ðqf ; fff ÞÞ !ðqt ; ff Þ !ðqt ; f Þ ! t=f
f =t
ðqf ; ftÞ ! ðqf ; ftf Þ ! ðqf ; ftff Þ !ðqt ; ftf Þ ! f =t
t=t
ðqt ; ftÞ !ðqt ; f Þ !ðqt ; EÞ
and produces output h i h i h i h i fft h i h i h i fttt (see Fig. 1). A transducer T[∃child[f]] can be defined similarly. The transducers for testing labels and computing conjunctions of filters do not need a stack. The transducer T[lab() = A] has the opening and closing tags of the XML document as input alphabet S, O = {h i ,t,f}, Q = F = {q0}, and d = {(q0,h·i,e)7!(q0,e,h i),(q0,h∕Ai, e)7!(q0,e,t),(q0,h∕Bi,e)7!(q0,e,f)} (where B stands for all node labels other than A). The transducer T[f ∧ c] has S = {h i } [ {t,f}2, O = {h i,t,f}, Q = F = {q0} and d ={(q0,h i, e)7!(q0,e,h i),(q0,(x,y),e)7!(q0,e,x ∧ y)}. The overall execution of a transducer network is exemplified in Fig. 1, where the filter that matches the XPath expression self:: A∕descendant:: B, rewritten into (∃descendant[lab() = B]) ∧lab() = A is evaluated using a transducer network. The transducers for the different subexpressions run synchronously; each symbol (opening or closing tag) from the input stream is first transformed by T[f1] and T[f3]; the output of T [f1] is piped into T[f2] and the output of both T[f2] and T[f3], as a pair of symbols, is piped into T[f4]. Only then is the next symbol of the input stream processed, which is handled in the same way, and so on. In the example of Fig. 1, the final transducer labels exactly those nodes t on which the filter is true. Checking whether the filter can be matched on the root node, which is not the case in this example, can be done using an additional pushdown automaton which is not exhibited here but is simple to define.
XML Stream Processing. Figure 1. Document tree (top left), transducer network (top right), and run of the transducer network (bottom).
XML Tree Pattern, XML Twig Query
Key Applications See XML Publish-Subscribe.
Cross-references
▶ Data Stream ▶ XML Publish/Subscribe
Recommended Reading 1. 2.
3.
4.
5.
6.
Benedikt M. and Koch C. Xpath Unleashed. ACM Comput. Surv., 41(3), 2009. Bry F., Olteanu D., Meuss H., and Furche T. Symmetry in XPath. Tech. Rep. PMS-FB-2001-16, LMU Mu¨nchen, 2001, short version. Diao et al. Y. YFilter: efficient and scalable filtering of XML documents. In Proc. 18th Int. Conf. on Data Engineering, 2002, pp. 341–342. Green T.J., Miklau G., Onizuka M., and Suciu D. Processing XML streams with deterministic automata. In Proc. 9th Int. Conf. on Database Theory, 2003, pp. 173–189. Grohe M., Koch C., and Schweikardt N. Tight lower bounds for query processing on streaming and external memory data. Theor. Comput. Sci., 380(1–2):199–217, 2007. Olteanu D. SPEX: streamed and progressive evaluation of XPath. IEEE Trans. Knowl. Data Eng., 19(7):934–949, 2007.
XML Tree Pattern, XML Twig Query L AKS V. S. L AKSHMANAN University of British Columbia, Vancouver, BC, Canada
Synonyms Tree pattern queries; TPQ; TP; Twigs
Definition A tree pattern query (also known as twig query) is a pair Q = (T, F), where T is node-labeled and edge-labeled tree with a distinguished node x 2 T and F is a boolean combination of constraints on nodes. Node labels are variables such as $x, $y. Edge labels are one of ‘‘pc,’’ ‘‘ad,’’ indicating parent-child or ancestor-descendant. Node constraints are of the form $x.tag = TagName or $x.data relOp val, where $x.data denotes the data content of node $x, and relOp is one of =, <, >, ≤, ≥, ≠. Informally, a tree pattern query specifies a pattern tree, with a set of constraints. Some of the constraints specify what the node labels (tags) should be. Some of them specify how pairs of nodes are related to one
X
3637
another – as a parent-child or as an ancestor-descendant. Finally, constraints on the data content of nodes enforce what data values are expected to be present at the nodes. Taken together, a tree pattern is similar in concept to a selection condition in relational algebra. There, the condition specifies what it takes a tuple to be part of a selection query result. A tree pattern similarly specifies what it takes an XML fragment (i.e., a subtree of the input data tree) to be part of a selection query result. This will be formalized below.
Historical Background One of the earliest papers to use the term twig query was Chen et al. [1]. In this paper, the authors were interested in estimating the number of matches of a twig query against an XML data tree, using a summary data structure. However, the notion of a twig in this paper was confined to tree patterns without ancestor-descendant relationships. Furthermore, data values were required to be strings over an alphabet (i.e., PCDATA) and a match was defined based on identity of node labels in the twig query and the labels of corresponding nodes in the data tree. Amer-Yahia et al. [2] introduced the notion of tree pattern query where the authors allowed both parentchild and ancestor-descendant relationships between pattern nodes. They motivated such queries in the context of querying LDAP-style network directories and as a core operation in query languages such as XML-QL and Quilt (the prevalent predecessors of XQuery). They focused on minimization of tree pattern queries. Jagadish et al. [3] was the first paper to define tree pattern queries in the current general form. They did so in the context of TAX – a tree algebra for XML data manipulation. Actually, tree patterns as defined in [3] allow for a richer class of node predicates than defined above. An XML data management system called TIMBER that was developed using the foundations of TAX algebra is described in Jagadish et al. [4]. Subsequently, an extensive body of work has flourished on various aspects of tree pattern (or twig) queries, ranging from their minimization [2,5,6], their efficient evaluation via the so-called structural and holistic joins [7,8] and their extensions leveraging indices [9,10]. Twig queries have served as a basis for defining the semantics of group-by for XML data [11]. Subsequently, efficient algorithms for computing groupby queries over XML data were reported in [12] while [13] addresses efficient computation of XML cube. All of
X
3638
X
XML Tree Pattern, XML Twig Query
these works use tree patterns in an essential way. Given the importance of approximate match when searching an XML document, papers dealing with approximate match have also taken tree patterns as a basis. These include tree pattern relaxation [14] and FleXPath [15]. Last but not the least, given the importance of twig queries, much work has been done on estimating selectivity of twig queries [16–18].
Foundations Let D = (N, A) be an XML data tree, i.e., an ordered node-labeled tree whose nodes are labeled by tags, which are element names drawn from an alphabet ∏ and whose leaf nodes are additionally labeled by strings from S*. The latter correspond to PCDATA and represent the data content of leaf nodes. A twig query specifies a pattern tree. The semantics of such a query is based on the notion of a match. Let Q = (T, F) be a tree pattern, where T = (V, E) is a node-labeled and edge-labeled tree and let D be an XML data tree. Then a match is defined based on the notion of an embedding. Given Q and D as above, an embedding is a function h: V → N, mapping query or pattern nodes to data nodes such that the following conditions are satisfied: Whenever (x, y) 2 Eis an edge labeled ‘‘pc’’ (resp., ‘‘ad’’), h(y) is a child (resp., descendant) of h(x) in D. The boolean combination of conditions of the form and $x.tag = TagName and $x.data relOp val in F is satisfied: An atom $x.tag = TagName is satisfied provided $x is the label of node v in Q and the tag of the data node h(v) is TagName.
An atom $x.data relOp val is satisfied provided $x is the label of node v of Q and the node label of the data node h(v) stands in relationship relOp to the constant val. Satisfaction w.r.t. boolean combinations of atoms is defined in the standard way. Figure 1 shows a sample data tree. Figure 2 shows a tree pattern query and illustrates matches. In this figure, the distinguished node (node labeled $n) is identified by a surrounding box. Informally, this pattern says find the immediate or transitive name sub-elements of book elements such that the book was published after 1900, as indicated by an immediate year subelement of the book element. Note the edge labels ‘‘pc’’ and ‘‘ad’’ in Fig. 2(a) that specify the intended relationship between elements. In Fig. 2(b), the matching data nodes are indicated in red. The result of a selection query with this tree pattern against the input of Fig. 1(a) consists of the two name sub-elements corresponding to matches of the distinguished node of the tree pattern.
Key Applications As mentioned earlier, a substantial amount of work on XML query evaluation and optimization has flourished using twig/tree patterns as a basis. One of the reasons for this is that tree pattern queries can be seen as an abstraction of a ‘‘core’’ subset of XPath [19]. In addition to the query evaluation issues mentioned above, tree pattern queries have also attracted attention from the point of view of query static analysis, study of structural properties, and query answering using views. In static analysis, Hidders [20], Lakshmanan
XML Tree Pattern, XML Twig Query. Figure 1. Example XML data tree.
XML Tree Pattern, XML Twig Query
X
3639
XML Tree Pattern, XML Twig Query. Figure 2. (a) Example tree pattern query and (b) Matches.
et al. [21], and Benedikt et al. [22] study satisfiability of fragments of XPath queries, which can be modeled as tree patterns with possible extensions. Benedikt et al. [23] study structural properties of several XPath fragments corresponding to different extensions of the basic class of tree patterns as defined in this chapter. Xu and Ozsoyoglu [24], Deutsch and Tannen [25], and Lakshmanan et al. [26] study different formulations of the query answering using views problem for XML. It is important to investigate richer extensions of tree patterns, capturing a greater subset of the features found in XPath. By far, the most important such feature that is lacking in tree patterns is element order. Various papers have examined XPath fragments together with sibling axis. An alternative approach is to consider tree patterns with two kinds of internal nodes – ordered and unordered. An unordered node is just like a node in a standard tree pattern. An ordered internal node is one for which the order of its children in the pattern should obey a specified order. e.g., suppose x is an ordered internal node and it has children y, z, w in that order, in the pattern. Then this imposes a constraint that a match of node z should follow the corresponding match of y. Similarly, a match of w should follow the corresponding match of z. Another example extension that is worthwhile investigating is a notion of relaxation of a tree pattern with order and keyword search that is appropriate for searching XML documents.
Cross-references
▶ Top-k XML Query Processing ▶ XML Retrieval ▶ XML Schema ▶ XML Selectivity Estimation ▶ XML Views ▶ XPath/XQuery ▶ XSL/XSLT
Recommended Reading 1. Al-Khalifa S., Jagadish H.V., Koudas N., Patel J.M., Srivastava D., and Wu Y. Structural joins: a primitive for efficient XML query pattern matching. In Proc. 18th Int. Conf. on Data Engineering, 2002, pp. 141–152. 2. Amer-Yahia S., Cho S.R., Lakshmanan L.V.S., and Srivastava D. Minimization of tree pattern queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2001, pp. 497–508. 3. Amer-Yahia S., Cho S.R., and Srivastava D. Tree pattern relaxation. In Advances in Database Technology, Proc. 8th Int. Conf. on Extending Database Technology, 2002, pp. 496–513. 4. Amer-Yahia S., Lakshmanan L.V.S., and Pandit S. FleXPath: flexible structure and full-text querying for XML. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 83–94. 5. Benedikt M., Fan W., and Geerts F. XPath satisfiability in the presence of DTDs. In Proc. 24th ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, 2005, pp. 25–36. 6. Benedikt M., Fan W., and Kuper G.M. Structural properties of XPath fragments. In Proc. 9th Int. Conf. on Database Theory, 2003, pp. 79–95. 7. Bruno N., Koudas N., and Srivastava D. Holistic twig joins: optimal XML pattern matching. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 310–321. 8. Chen Z., Jagadish H.V., Korn F., Koudas N., Muthukrishnan S., Ng R., and Srivastava D. Counting twig matches in a tree. In Proc. 17th Int. Conf. on Data Engineering, 2001, pp. 595–604. 9. Chien S.Y., Vagena Z., Zhang D., Tsotras V.J., and Zaniolo C. Efficient structural joins on indexed XML documents. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 263–274. 10. Deutsch A. and Tannen V. Reformulation of XML queries and constraints. In Proc. 9th Int. Conf. on Database Theory, 2003, pp. 225–241. 11. Flesca S., Furfaro F., and Masciari E. On the minimization of Xpath queries. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 153–164. 12. Freire J., Haritsa J., Ramanath M., Roy P., and Simeon J. StatiX: making XML count. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 181–191. 13. Gokhale C., Gupta N., Kumar P., Lakshmanan L.V.S., Ng R., and Prakash B.A. Complex group-by queries for XML. In Proc. 23rd Int. Conf. on Data Engineering, 2007, pp. 646–655.
X
3640
X
XML Tuple Algebra
14. Hidders J. Satisfiability of XPath expressions. In Proc. 9th Int. Workshop on Database Programming Languages, 2003, pp. 21–36. 15. Jagadish H.V., Lakshmanan L.V.S., Srivastava D., and Thompson K. TAX: a tree algebra for XML. In Proc. 8th Int. Workshop on Database Programming Languages, 2001, pp. 149–169. 16. Jagadish H.V., Al-Khalifa S., Chapman A., Lakshmanan L.V.S., Nierman A., Paparizos S., Patel J.M., Srivastava D., Wiwatwattana N., Wu Y., and Yu C.TIMBER: a native XML database. VLDB J., 11(4):274–291, 2002. 17. Jiang H., Lu H., Wang W., and Ooi B.C. XR-tree: indexing XML data for efficient structural joins. In Proc. 19th Int. Conf. on Data Engineering, 2003, pp. 253–263. 18. Lakshmanan L.V.S., Ramesh G., Wang H., and (Jessica) Zhao Z. On testing satisfiability of tree pattern queries. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 120–131. 19. Lakshmanan L.V.S., Wang H., and (Jessica) Zhao Z. Answering tree pattern queries using views. In Proc. 32nd Int. Conf. on Very Large Data Bases, 2006, pp. 571–582. 20. Miklau G. and Suciu D. Containment and equivalence for a fragment of XPath. In Proc. 21st ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, 2002, pp. 65–76. 21. Paparizos S., Al-Khalifa S., Jagadish H.V., Lakshmanan L.V.S., Nierman A., Srivastava D., and Wu Y. Grouping in XML. In EDBT 2002 Workshop on XML-Based Data Management LNCS, vol. 2490, Springer, Berlin, 2002, pp. 128–147. 22. Polyzotis N. and Garofalakis M. XSketch synopses for XML data graphs. ACM Trans. Database Syst., September 2006. 23. Polyzotis N., Garofalakis M., and Ioannidis Y. Selectivity estimation for XML twigs. In Proc. 20th Int. Conf. on Data Engineering, 2004, pp. 264–275. 24. Wiwatwattana N., Jagadish H.V., Lakshmanan L.V.S., and Srivastava D. X3: a cube operator for XML OLAP. In Proc. 23rd Int. Conf. on Data Engineering, 2007, pp. 916–925. 25. XML Path Language (XPath) Version 1.0. http://www.w3.org/ TR/xpath. ¨ zsoyoglu Z. Rewriting XPath queries using 26. Xu W. and Meral O materialized views. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 121–132.
XML Tuple Algebra I OANA M ANOLESCU 1, YANNIS PAPAKONSTANTINOU 2, VASILIS VASSALOS 3 1 INRIA Saclay–Iˆle-de-France, Orsay, France 2 University of California-San Diego, La Jolla, CA, USA 3 Athens University of Economics and Business, Athens, Greece
Synonyms Relational algebra for XML; XML algebra
Definition An XML tuple-based algebra operates on a domain that consists of sets of tuples whose attribute values are items, i.e., atomic values or XML elements (and hence, possibly, XML trees). Operators receive one or more sets of tuples and produce a set, list or bag of tuples of items. It is common that the algebra has special operators for converting XML inputs into instances of the domain and vice versa. XML tuplebased algebras, as is also the case with relational algebras, have been extensively used in query processing and optimization [1–10].
Historical Background The use of tuple-based algebras for the efficient set-ata-time processing of XQuery queries follows a typical pattern in database query processing. Relational algebras are the most typical vehicle for query optimization. Tuple-oriented algebras for object-oriented queries had also been formulated and have a close resemblance to the described XML algebras. Note that XQuery itself as well as predecessors of it (such as Quilt) are also algebras, which include a list comprehension operator (‘‘FOR’’). Such algebras and their properties and optimization have been studied by the functional programming community. They have not been the typical optimization vehicle for database systems.
Foundations The emergence of XML and XQuery motivated many academic and industrial XML query processing works. Many of the works originating from the database community based XQuery processing on tuple-based algebras since in the past, tuple-based algebras delivered great benefits to relational query processing and also provided a solid base for the processing of nested OQL data and OQL queries. Tuple-based algebras for XQuery carry over to XQuery key query processing benefits such as performing joins using set-at-a-time operations. A generic example algebra, characteristic of many algebras that have been proposed as intermediate representations for XQuery processing, is described next. It is based on a Unified Data Model that extends the XPath/XQuery data model with sets, bags, and lists of tuples; notice that the extensions are only visible to algebra operators.
XML Tuple Algebra
Given an XML algebra, important query processing challenges include efficient implementation of the operators as well as cost models for them, algebraic properties of operator sequences and algebraic rewriting techniques, and cost-based optimization of algebra expressions. Unified Data Model
The Unified Data Model (UDM) is an extension of the XPath/XQuery data model with the following types: Tuples, with structure [$a1 = val1,...,$ak = valk], where each $ai = itemi is a variable-value pair. Variable names such as $a1,$a2,etc. follow the syntactic restrictions associated with XQuery variable names; Variable names are unique within a tuple. A value may be (i) thespecial constant ⊥(null) (The XQuery Data Model also has a concept of nil. For clarity of exposition, the two concepts should be considered as distinct, in order to avoid discrepancies that may stem from the overloading.), (ii) an item, i.e., a node (generally standing for the root of an XML tree)or atomic value, or (iii) a (nested) set, list, or bag of tuples. Given a tuple [$a1 = val1,...,$ak = valk] the list of names [$a1,...,$ak] is called the schema of the tuple. We restrict our model to homogeneous collections (sets, bags or lists) of tuples. That is, the values taken by a givenvariable $a in all tuples are of the same kind: either they are all items (some of which may be null), or they are all collections of tuples. Moreover, in the lattercase, all the collections have the same schema. Note that the nested sets/bags/lists are typically exploited for buildingefficient evaluation plans. If the value of variable t.$a is a set, list or bag of tuples, let [$b1, $b2,...,$bm] be the schema of a tuple in t.$a. In this case, for clarity, we may denote a $bi variable, 1 i m, by $a.$bi (concatenating the variable names, from the outermost to the innermost, and separating them by dots). Lists, bags and sets of tuples, denoted as ht1,...,tni, {{t1,...,tn}}, and {t1,...,tn} and referred to collectively as collections. In all three cases the tuples t1,..., tn must have the same schema, i.e., collections are homogeneous. Sets have no duplicate tuples, i.e., no two tuples in a set are id-equal as defined below. Two tuples are id-equal, denoted as t1 = id t2, if they have the same schema and the values of the corresponding variables either (i) are both null, or
X
3641
(ii) are both equal atomic values (i.e., they compare equal via = v), or are both nodes with the same id (i.e., they compare equal via = id), or (iii) are both sets of tuples and each tuple of a set is id-equal to a tuple of the other set. For the case (iii), similar definitions apply if the variable values are bags or lists of tuples, by taking into account the multiplicity of each tuple in bags and the order of the tuples for lists. Notation Given a tuple t = [. . .$x = v. . .] it is said that $x maps to v in the context of t. The value that the variable $x maps to in the tuple t is t.$x. The notation t 0 = t + ($var = v) indicates that the tuple t 0 contains all the variable-value pairs of t and, in addition, the variable-value pair $var = v. The tuple t 0 = t + t 0 contains all the variable-value pairs of both tuples t and t 0 . Finally, (id) denotes the node with identifier id. Sample document and query A sample XML document is shown in tree form in Fig. 1. For readability, the figure also displays the tag or string content of each node. The following query will be used to illustrate the XML tuple algebra operators: for $C in $I//customer return <customer> {for $N in $C/name ¯ $F in $N/first $L in $N/last return {$F, $L, $C/address } } <customer>
(Q1)
Unified Algebra
Tuple-based XQuery algebras typically consist of operators that: (i) perform navigation into the data and deliver collections of bindings (tuples) for the variables; (ii) construct XPath/XQuery Data Model values from the tuples; (iii) create nested collections; (iv) combine collections applying operations known from the relational algebra, such as joins It is common to have redundant operators for the purpose of delivering performance gains in particular environments; structural joins are a typical example of such redundant operators. An XQuery q, with a set of free variables V , is translated into a corresponding algebraic expression fq, which is a function whose input is a collection with schema V . The algebraic expressions outputs a collection in the Unified Data Model. The trivial operators used to convert the collection into an XML-formatted document are not discussed.
X
3642
X
XML Tuple Algebra
XML Tuple Algebra. Figure 1. Tree representation of sample XML document.
Selection
The selection operator s is defined in the usual way based on a logical formula (predicate) that can be evaluated over all its input tuples. The selection operator outputs those tuples for which the predicate evaluated to true.
Projection
The projection operator p is defined by specifying a list of column names to be retained in the output. The operator p does not eliminate duplicates; duplicate-eliminating projections are denoted by p0.
Navigation The navigation operator follows the established tree pattern paradigm. XQuery tuplebased algebras involve tree patterns where the root of the navigation may be a variable binding, in order to capture nested XQueries where a navigation in the inner query may start from a variable of the outer query. Furthermore, XQuery tuple-based algebras have extended tree patterns with an option for ‘‘optional’’ subtrees: if no match is found for optional subtrees, then those subtrees are ignored and a ⊥ (null) value is returned for variables that appear in them. This feature is similar to outerjoins, and has been used in some algebras to consolidate the navigation of multiple nested queries (and hence of multiple tree patterns) into a single one. In what follows, the tree patterns are limited to child and descendant navigation. The literature describes extensions to all XPath axes. An unordered tree pattern is a labeled tree, whose nodes are labeled with (i) an element/attribute or the wildcard ∗ and (ii) optionally, a variable $var. The edges are either child edges, denoted by a single line, or descendant edges, denoted by a double line. Furthermore, edges can be required, shown as continuous lines, or optional, depicted with dashed lines. The variables appearing in the tree pattern form the schema of the tree pattern.
The semantics of the navigation operator is based on the mapping of a tree pattern to a value. Given a variable-value pair $R = ri and a tree pattern T with schema V , a mapping of T to ri is a mapping of the pattern nodes to nodes in the XML tree represented by ri such that the structural relationships among pattern nodes are satisfied. A mapping tuple (also called binding) has schema V and pairs each variable in V to the node to which it is mapped corresponding to such a mapping. The function map (T, ri) returns the set of bindings corresponding to all mappings. A mapping may be partial: nodes connected to the pattern by optional edges may not be mapped, in which case this node is mapped to a ⊥. To formally define embeddings for tree patterns T that have optional edges, the auxiliary pad and sp functions are introduced. Given a set of variables V , the function pad V ðtÞ extends the tuple t to have a variable-value pair $V = ⊥ for every variable $V of V that is not included in t. i.e., pad pads t with nulls so that it has the desired schema V . The function is overloaded to apply on a set of tuples S, so that pad V ðSÞ extends each tuple of S. Given two tuples t and t 0 , it is said that t is more specific than t 0 if for every attribute/variable $V , either t.$V = t 0 .$V or t 0 .$V = ⊥. For example, [$A = na, $B = ⊥, $C = ⊥] is less specific than [$A = na, $B = nb, $C = ⊥]. Given a set S of tuples named sp(S) the set that consists of all tuples S that are not less specific than any other tuple of S. For example, sp({[$A = na, $B = ⊥, $C = ⊥], [$A = na, $B = nb, $C = ⊥]}) ={[$A = na, $B = nb, $C = ⊥]}. Then given a tree pattern T with set of variables V and optional edges, the set of tree patterns T1,...,T n are created that have no optional edges and are obtained by non-derministically replacing each optional edge with a required edge, or removing the edge and the
XML Tuple Algebra
subtree that is adjacent to it. The embeddings of the pattern T were than defined as: spðpad V ðmapðT 1 ; IÞ [ . . . [ mapðT n ; IÞÞÞ mapping an unordered tree pattern to a value produces unordered tuples. In an ordered tree pattern the variables are ordered by the preorder traversal sequence of the tree pattern, and this ordering translates into an order of the attribute values of each result tuple. Tuples in the mapping result are then ordered lexicographically according to the order of their attribute values. The navigation operator navT inputs and outputs a list of tuples with schema V . The parameter T is a tree pattern, whose root must be labeled with a variable $R that also appears in V . The input to the operator is a list of the form ht1,...,tni , where each ti, i = 1,..., n is a tuple of the form [...,$R = ri,...]. The output of the operator is the list of tuples ti + t 0i, for all i, where t 0i is defined as t 0i 2 map(T, ri). Figure 2 shows the pattern TN1 corresponding to the navigation part in query Q1, and the result of nav TN 1 on the sample document. The order of the tuples is justified as follows: The depth-first pre-order of the variables in the tree pattern is ($C, $N, $F, $L, $1). Consequently, the first tuple precedes the second
X
3643
because c1 << c2 and the second tuple precedes the third tuple because a21 << a22. Construction
Tuple-based XQuery algebras include operators that construct XML from the bindings of variables. This is captured by the crListL operator, which inputs a collection of tuples and outputs an XML tree/forest. The parameter L is a list of construction tree patterns, called a construction pattern list. For example, consider the query: for$C in $I//customer, $N in $C/name, $F in $N/first, $L in $N/last, $A in $C/address return <customer> { $F, $L, $A } ` (Q2)
Figure 3 depicts the navigation pattern TN2, the construction pattern list TC2, which consists of a single construction pattern, and a simple algebraic expression, corresponding to Q2. Nested Plans The combination of navigation and
construction operators captures the navigation and construction of unnested FLWR XQuery expressions. The following operators can be used to handle nested queries.
XML Tuple Algebra. Figure 2. Tree pattern for XQuery Q1.
X
XML Tuple Algebra. Figure 3. Navigation and construction for Q2.
3644
X
XML Tuple Algebra
XML Tuple Algebra. Figure 4. Algebraic plan for Q1.
XML Tuple Algebra. Figure 5. Alternative algebraic plan for Q1.
XML Tuple Algebra
Apply The appp7!$R (as in ‘‘apply plan’’) unary opera-
tor takes as parameter an algebra expression p, which delivers an XML ‘‘forest’’, and a result variable $R, which should not appear in the schema V of the input tuples. Intuitively, for every tuple t in the input collection I, p({t}) is evaluated and the result is assigned to $R. appp7!R ðIÞ ¼ ft þ ðR ¼ rÞjt 2 I; R ¼ pðftgÞg For example, query Q1 produces a customer output element for every customer element in the input. The output element includes the first name and last name pairs (if any) of the customer and for each name all the address children (if any) of the input element. Figure 4 depicts an algebraic plan for Q1, using the app operator. TC3, TA3 and TN3 are navigation patterns, while TU1, TU2 and TU3 are construction pattern lists, consisting of one, two, and one, respectively, construction patterns. The partial plans p1 and p2 each apply some navigation starting from $C and construct XML output in $1 and, respectively, $2. The first app operator reflects the XQuery nesting, while the second app corresponds to the inner return clause. The final crList operator produces customer elements. The app operator is defined on one tuple at a time. However, many architectures (and algebras) consider also a set-at-a-time execution. For instance, instead of evaluating appp17!$1 on one tuple at a time (which means twice for customer c2 since she has two addresses), one could group its input by customer ID, and apply p1 on one group of tuples at a time. In such cases, the following pair of operators is useful. Group-By The groupBy G id ;G v 7!$R has three parameters: the list of group-by-id variables G id , the list of groupby-value variables G v , and the result variable $R. The operator partitions its input I into sets of tuples P G ðIÞ such that all tuples of a partition have id-equal values for the variables of G id and equal values for the variables of G v . The output consists of one tuple for each partition. Each output tuple has the variables of G id and G v and an additional variable $R, whose value is the partition (Some algebras do not repeat the variables of G id and G v in the partition, for efficiency reasons.). Note that input tuples that have a ⊥ in any of the of the G id or G v variables are not considered. s The appp;$v7 !$R operator assumes that the variable $V of its input I is bound to a collection
Apply-on-Set
X
3645
of tuples. The operator applies the plan p on t.$V for every tuple t from the input, and assigns the result to the new variable $R. For example, Fig. 5 shows another possible plan for Q1. This time a single navigation pattern TC, is used which includes optional edges. The navigation result may contain several tuples for each customer with multiple addresses. Thus, the navigation result is grouped by $C before applying p1 on the sets. The benefits of implementing nested plans by using grouping and operators that potentially deliver null tuples (outerjoin in particular) had first been observed in the context of OQL. Comparing Fig. 5 with Fig. 4 shows that the same query may be expressed by different expressions in the unified algebra. Providing multiple operators (or combinations thereof) which lead to the same computation is typical in algebras.
Key Applications An XML tuple algebra is an important intermediate representation for the investigation and the implementation of efficient XML processing techniques. An XML tuple algebra or similar extensions to relational algebra are used as of 2008 by XQuery processing systems such as BEA Aqualogic Data Services Platform and the open source MonetDB/XQuery system.
Cross-references
▶ Top-k XML Query Processing ▶ XML Document ▶ XML Element ▶ XML Storage ▶ XML Tree Pattern, XML Twig Query ▶ XPath/XQuery
Recommended Reading 1. Arion A., Benzaken V., Manolescu I., Papakonstantinou Y., and Vijay R. Algebra-based identification of tree patterns in XQuery. In Proc. 7th Int. Conf. Flexible Query Answering Systems, 2006, pp. 13–25. 2. Beeri C. and Tzaban Y. SAL: an algebra for semistructured data and XML. In Proc. ACM SIGMOD Workshop on The Web and Databases, 1999, pp. 37–42. 3. Cluet S. and Moerkotte G. Nested Queries in Object Bases. Technical report, 1995. 4. Deutsch A., Papakonstantinou Y., and Xu Y. The NEXT logical framework for XQuery. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 168–179.
X
3646
X
XML Typechecking
5. Michiels P., Mihaila G.A., and Sime´on J. Put a tree pattern in your algebra. In Proc. 23rd Int. Conf. on Data Engineering, 2007, pp. 246–255. 6. Papakonstantinou Y., Borkar V.R., Orgiyan M., Stathatos K., Suta L.,Vassalos V., and Velikhov P. XML queries and algebra in the Enosys integration platform. Data Knowl. Eng., 44 (3):299–322, 2003. 7. Re C., Sime´on J., and Ferna´ndez M. A complete and efficient algebraic compiler for XQuery. In Proc. 22nd Int. Conf. on Data Engineering, 2006, p. 14. 8. The XQuery Language. www.w3.org/TR/xquery, 2004. 9. XQuery 1.0 and XPath 2.0 Data Model. www.w3.org/TR/xpathdatamodel. 10. XQuery 1.0 Formal Semantics. www.w3.org/TR/2005/WDxquery-semantics.
XML Typechecking V E´ RONIQUE B ENZAKEN 1, G IUSEPPE C ASTAGNA 2 H ARUO H OSOYA 3, B ENJAMIN C. P IERCE 4, S TIJN VANSUMMEREN 5 1 University Paris 11, Orsay Cedex, France 2 C.N.R.S. and University Paris 7, Paris, France 3 The University of Tokyo, Tokyo, Japan 4 University of Pennsylvania, Philadelphia, PA, USA 5 Hasselt University and Transnational University of Limburg, Diepenbeek, Belgium
Definition In general, typechecking refers to the problem where, given a program P, an input type s, and an output type t, one must decide whether P is type-safe, that is, whether it produces only outputs of type t when run on inputs of type s. In the XML context, typechecking problems mainly arise in two forms: XML-to-XML transformations, where P transforms XML documents conforming to a given type into XML documents conforming to another given type. XML publishing, where P transforms relational databases into XML views of these databases and it is necessary to check that all generated views conform to a specified type. A type for XML documents is typically a regular tree language, usually expressed as a schema written in a schema language such as DTD, XML Schema, or Relax NG (see XML Types). In the XML publishing case, the input type s is a relational database schema, possibly with integrity constraints.
Typechecking problems may or may not be decidable, depending on (i) the class of programs considered, (ii) the class of input types (relational schemas, DTDs, XML Schemas, Relax NG schema, or perhaps other subclasses of the regular tree languages), and (iii) the class of output types. In cases where it is decidable, typechecking can be done exactly. In cases where it is undecidable, one must revert to approximate or incomplete typecheckers that may return false negatives – i.e., may reject a program even if it is type-safe. Even when exact typechecking is possible, approximate typechecking may be preferable as this is often computationally cheaper than exact typechecking. In the programming languages literature, typechecking often not only entails verifying that all outputs are of type t, but also requires detecting when the program may abort with a run-time error on inputs of type s [3]. The above definition encompasses such cases: view run-time errors as a special result value error and then typecheck a program against an output type that does not contain the value error.
Historical Background Although typechecking is a fundamental and wellstudied problem in the theory of programming languages [3], the types necessary for XML typechecking (based on regular tree languages) differ significantly from the conventional data types usually considered (i.e., lists, records, classes, and so on). Indeed, although it is possible to encode XML types into conventional datatypes, this encoding lacks flexibility in the sense that programs tend to need artificial changes when types evolve [22]. For this reason, Hosoya et al. [22] proposed regular tree languages as the ‘‘right’’ notion of types for XML and presented an approximate typechecker in this context. The typechecker was implemented in the XML-to-XML transformation language XDuce [21] whose approach was later extended to general purpose programming by CDuce (functional programming) and Xtatic (object-oriented programming). XDuce’s approach also lies at the basis of XQuery’s typechecking algorithm [6]. The contemporary study of exact typechecking for XML-to-XML transformations started with an investigation of relatively simple transformation languages [29,32,33]. Ironically enough, the fundamentals of exact typechecking for more advanced transformation languages were already investigated a long time before XML appeared [7,8]. These fundamentals were
XML Typechecking
revived in the XML era by Milo, Suciu, and Vianu in their seminal work on k-pebble tree transducers [30], which was later extended to other transformation languages [26,39,40]. The computational complexity of exact typechecking was investigated in [14,27,28]. Exact typechecking algorithms for XML publishing scenarios were given by Alon et al. [1].
Foundations Exact Typechecking XML-to-XML Transformations Recall that in this setting, P is a program that should transform XML documents of a type s into documents of a type t. When the languages in which the transformation and the types are expressed are sufficiently restricted in power, exact typechecking is possible. There are two major approaches to the construction of an exact typechecking algorithm: forward inference and backward inference. Forward inference solves the typechecking problem directly by first computing the image O of the input type s under the transformation P, i.e., O :={P(t)jt 2 s}, and then checking O t [28,30,32,33,37]. This approach does not work if O goes beyond context-free tree languages as checking O t then becomes undecidable. Sadly, this is already the case when P is written in very simple transformation languages, such as the top-down tree transducers (this fact is known as folklore; see, e.g., [14].) Also, computing O itself becomes undecidable for more advanced transformation languages. Backward inference, on the other hand, first computes the pre-image I of the output type t under P, i.e., I :={t jP(t) 2 t}, and then checks s I. Backward inference often works even when the transformation language is too expressive for forward inference. The technique has successfully been applied to a range of formal models of real-world transformation languages like XSLT, from the top-down and bottom-up tree transducers [7], to macro tree transducers [8,14,27], macro forest transducers [34], k-pebble tree transducers [30], tree transducers based on alternating tree automata [39], tree transducers dealing with atomic data values [37], and high-level tree transducers [40]. As mentioned in the definition, static detection of run-time errors can be phrased as a particular form of typechecking by introducing a special output value error. Exact typechecking in this form has been
X
3647
investigated for XQuery programs written in the nonrecursive for-let-where return fragment of XQuery without automatic coercions but with the various XPath axes; node constructors; value and node comparisons; and node label and content inspections, in the setting where the input type s is given by a recursion-free regular tree language. The crux of decidability here is a small-model property: if P(t) = error for some t of type s then there exists another input t0 of type s whose size depends only on P and s such that P(t0 ) = error. It then suffices to enumerate all inputs t0 2 s up to the maximum size and check P(t0 ) = error. There are only a finite number of such t0 (up to isomorphism, and P cannot distinguish between isomorphic inputs), from which decidability follows [41]. This small model property continues to hold when we extend the above XQuery fragment with arbitrary primitives satisfying some general niceness properties [41]. XML Publishing In this setting, the input to the program P is a relational database D. Suppose that P computes its output XML tree by posing simple select-project-join queries to D, nesting the results to these queries, and constructing new XML elements. Exact typechecking for programs of this form, when the input type s is a relational database schema with key and foreign key constraints, and the output type t is a ‘‘star-free’’ DTD, is decidable [1]. (See [1] for a precise definition and examples of the concept ‘‘star-free.’’) As was the case for detecting runtime errors in XQuery programs, the crux of decidability here is again a small model property. Typechecking remains decidable for output DTDs t that are not star-free, but then the queries in P must not use projection. Typechecking unfortunately becomes undecidable when the output types t are given by XML Schemas or Relax NG Schemas [1]. Typechecking also becomes undecidable when P uses queries more expressive than select-project-join queries. Approximate Typechecking
The expressive power that realistic applications require of practical transformation languages is often too high to allow for exact typechecking. In such cases, one must revert to approximate or incomplete typecheckers that guarantee that all successfully checked programs are type-safe, but that may also reject some type-safe programs. Existing techniques
X
3648
X
XML Typechecking
can be grouped into two categories: type systems and flow analyses. Type Systems
Many conventional programming languages (such as C and Java) specify what programs to accept by a type system [35]. Typically, such a system consists of a set of typing rules that determine the type of each subexpression of a program. Often, in order to help the typechecker, the programmer is required to supply type annotations on variable declarations and in other specified places. The pioneer work applying this approach to the XML setting was the XDuce (transduce) language [21], whose type system is based on regular tree languages. One significant point in this work is its definition of a natural notion of subtyping as the inclusion relation between regular tree languages and its demonstration of the usefulness of allowing a value of one type to be viewed as another type with a syntactically completely different structure [22]. In addition, although the decision problem for subtyping is known to be EXPTIME-complete, the ‘‘top-down algorithm’’ used in the XDuce implementation is empirically shown to be efficient in most cases that actually arise in typechecking [13,22,36]. Such a type system also needs machinery to reduce the amount of type annotations that otherwise tends to be a burden to the user, in particular when the language supports a non-trivial mechanism to manipulate XML documents such as regular expression patterns [20] or filter expressions [17]. A series of works address this problem by proposing automatic type inference schemes that have certain precision properties in a sense similar to the exact typechecking in the previous section [17,20,42]. These ideas have further been extended for XML attributes [19] and para-metric polymorphism [18,43]. CDuce (pronounced ‘‘seduce’’) extends XDuce in various ways [2]. From a language point of view, CDuce embraces XDuce’s approach of a functional language based on regular expression patterns [20] and extends it with finer-grained pattern matching, complete two-way compatibility with programs and libraries in the OCaml programming language, Unicode, queries, XML Schema validation, and, above all, higher-order and overloaded functions. XML types are enriched with general purpose data types, intersection and negation types, and functional types. Finally, the CDuce type inference algorithm for patterns is implemented by a new kind of tree automaton and proved to
be optimal [10]. Among these extensions, the addition of higher-order functions is significant. Theoretically, this extension is not trivial, first because functions do not fit well in the framework of finite tree automata, but more deeply because this entails a definitional cycle: the definition of typechecking uses subtyping, whose definition then uses the semantics of types (NB: subtyping is defined as inclusion between the sets denoted by given two types), whose definition in turn uses well-typedness of values; the last part depends on typechecking in the presence of higher-order functions since typechecking of a function abstraction lx.e requires analysis of its internal expression e. Some solutions are known for breaking this circularity [14,43]. Also, an approach to combine one of these treatments with high-order functions has been proposed [43]. Several research groups explore ways of mixing a XDuce-like type system with an existing popular language. Xtatic [15] carries out this program for the C# language, developing techniques to blend regular expression types with an object-oriented type system. XJ [16] makes a closely related effort for Java. OCamlDuce [11] mixes with OCaml, proposing a method to intermingle a standard ML type inference algorithm with XDuce-like typechecking. XHaskell [25] is also another instance for Haskell; their approach is, however, to embed XML types into Haskell typing structures (such as tuples and disjoint sums) in the style of data-binding, yet support XDuce-like subtyping in its full flexibility by deploying a coercion technique [25]. The formal semantics of XQuery defined by the W3C contains a type system based on a set of inductive typing rules [6]. Their first draft was heavily based on XDuce’s type system [9]. Later, they switched to a different one that reflects the object-oriented hierarchical typing structure adopted by XML Schema. As byproducts of the above pieces of work, several optimization and compilation techniques that exploit typing information have been proposed [10,24]. Flow-Analysis
Flow analysis is a static analysis technique that has long been studied in the programming language community. A series of investigations has been conducted for adapting flow analysis to approximate XML typechecking, concurrently to XDucerelated work, [3,23,31]. In this approach, the user needs to write no type annotations for intermediate values like in XDuce, but instead the static anlyzer completely infers them, thus providing a more
XML Typechecking
user-friendly system. One potential drawback is that the specification is rather informal and therefore, when the analyzer raises an error, the reason can sometimes be unclear; empirically, however, such false negatives are rare. Flow analysis is applied first to Bigwig language system, an extension of Java with an XML-manipulating facility called ‘‘templates’’ [3]. Though this first attempt handles only XHTML types, they naturally generalize it to arbitrary XML types, calling the resulting system XAct [40]. Their techniques are further extended and applied to static analysis of XSLT [31].
Key Applications XML typechecking is a key component of XQuery, the standard XML query language. As outlined above, XML typechecking in XQuery is based on a set of inductive typing rules that reflects the object-oriented hierarchical typing structure adopted by XML Schema. Different approaches to XML typechecking may be found in research prototypes like CDuce, OcamlDuce, XDuce, XAct, XHaskell, and Xtatic for which references are given below.
Url to Code CDuce: http://www.cduce.org OCamlDuce: http://www.cduce.org/ocaml.html XAct: http://www.brics.dk/Xact/ Xduce: http://xduce.sourceforge.net XHaskell: http://taichi.ddns.comp.nus.edu.sg/taichiwiki/XhaskellHomePage Xtatic: http://www.cis.upenn.edu/~bcpierce/xtatic A gentle introduction to exact typechecking for both XML-to-XML transformations and XML Publishing can be found in [37]. A non-technical presentation of Regular Expression Types and Patterns and their use in query languages can be found in the joint DPBL and XSym 2005 invited talk [4]. For a more complete presentation of Regular Expression Types and Patterns and the associated type-checking and subtyping algorithms we recommend the reader to refer to the seminal JFP article by Hosoya, Pierce, and Vouillon [22]. The joint ICALP and PPDP 2005 keynote talk [5] constitutes a relatively simple survey of the problem of type-checking higher-order functions and an overview on how to derive subtyping algorithms semantically: full technical details can be found in an extended version published in the JACM [13].
X
3649
Cross-references
▶ Database Dependencies ▶ XML ▶ XML Types ▶ XPath/XQuery
Recommended Reading 1. Alon N., Milo T., Neven F., Suciu D., and Vianu V. Typechecking xml views of relational databases. ACM Trans. Comput. Log., 4(3):315–354, 2003. 2. Benzaken V., Castagna G., and Frisch A. CDuce: an XML-centric general-purpose language. In Proc. 8th ACM SIGPLAN Int. Conf. on Functional Programming, 2003, pp. 51–63. 3. Brabrand C., Møller A., and Schwartzbach M.I. The project. ACM Trans. Internet Tech., 2(2):79–114, 2002. 4. Castagna G. Patterns and types for querying XML. In Proc. of DBPL 2005, Tenth International Symposium on Database Programming Languages, 2005, pp. 1–26. 5. Castagna G. and Frisch A. A gentle introduction to semantic subtyping. In Proc. 7th Int. ACM SIGPLAN Conf. on Principles and Practice of Declarative Programming, 2005, pp. 198–208. 6. Draper D., Fankhauser P., Ashok Malhotra M.F., Rose K., Rys M., Sime´on J., and Wadler P. XQuery 1.0 and XPath 2.0 Formal Semantics, 2007. http://www.w3.org/Tr/query-semantics/. 7. Engelfriet J. Top-down tree transducers with regular look-ahead. Math. Syst. Theory, 10:289–303, 1977. 8. Engelfriet J. and Vogler H. Macro tree transducers. J. Comput. Syst. Sci., 31(1):710–146, 1985. 9. Ferna´ndez M.F., Sime´on J., and Wadler P. A semi-monad for semi-structured data. In Proc. 8th Int. Conf. on Database Theory, 2001, pp. 263–300. 10. Frisch A. Regular tree language recognition with static information. In Proc. 3rd IFIP Int. Conf. on Theoretical Computer Science, 2004, pp. 661–674. 11. Frisch A. OCaml+CDuce. In Proc. 11th ACM SIGPLAN Int. Conf. on Functional Programming, 2006, pp. 192–200. 12. Frisch A., Castagna G., and Benzaken V. Semantic subtyping. In Proc. 17th IEE Conf. on Logic in Computer Science, 2002, pp. 137–146. 13. Frisch A., Castagna G., and Benzaken V. Semantic subtyping: dealing set-theoretically with function, union, intersection, and negation types. J. ACM, 55(4):1–64, 2008. 14. Frisch A. and Hosoya H. Towards practical typechecking for macro tree transducers. In Proc. 11th Int. Workshop on Database Programming Languages, 2007, pp. 246–261. 15. Gapeyev V., Levin M.Y., Pierce B.C., and Schmitt A. The Xtatic experience. In Proc. Workshop on Programming Language Technologies for XML (PLAN-X). January 2005. University of Pennsylvania Technical Report MS-CIS-04-24, 2004. 16. Harren M., Raghavachari M., Shmueli O., Burke M.G., Bordawekar R., Pechtchanski I., and Sarkar V. XJ: facilitating XML processing in Java. In Proc. 14th Int. World Wide Web Conference, 2005, pp. 278–287. 17. Hosoya H. Regular expression filters for XML. J. Funct. Program., 16(6):711–750, 2006.
X
3650
X
XML Types
18. Hosoya H., Frisch A., and Castagna G. Parametric polymorphism for XML. In Proc. 32nd ACM SIGACT-SIGPLAN Symp. on Principles of Programming Languages, 2005, pp. 50–62. 19. Hosoya H. and Murata M. Boolean operations and inclusion test for attribute-element constraints. Theor. Comput. Sci., 360 (1–3):327–351, 2006. 20. Hosoya H. and Pierce B.C. Regular expression pattern matching for XML. J. Funct. Program., 13(6):961–1004, 2002. 21. Hosoya H. and Pierce B.C. XDuce: a typed XML processing language. ACM Trans. Internet Tech., 3(2):117–148,2003. 22. Hosoya H., Vouillon J., and Pierce B.C. Regular expression types for XML. ACM Trans. Program. Lang. Syst., 27(1):46–90, 2004. 23. Kirkegaard C. and Møller A. Xact – XML transformations in Java. In Proc. Programming Language Technologies for XML, 2006, p. 87. 24. Levin M.Y. and Pierce B.C. Type-based optimization for regular patterns. In Proc. 10th Int. Workshop on Database Programming Languages, 2005, pp. 184–198. 25. Lu K.Z.M. and Sulzmann M. XHaskell: regular expression types for Haskell. Manuscript, 2004. 26. Maneth S., Perst T., Berlea A., and Seidl H. XML type checking with macro tree transducers. In Proc. 24th ACM SIGACTSIGMOD-SIGART Symp. on Principles of Database Systems, 2005, pp. 283–294. 27. Maneth S., Perst T., and Seidl H. Exact XML type checking in polynomial time. In Proc. 11th Int. Conf. on Database Theory, 2007, pp. 254–268. 28. Martens W. and Neven F. Frontiers of tractability for typechecking simple xml transformations. J. Comput. Syst. Sci., 73(3):362–390, 2007. 29. Milo T. and Suciu D. Type inference for queries on semistructured data. In Proc. 18th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1999, pp. 215–226. 30. Milo T., Suciu D., and Vianu V. Typechecking for XML transformers. J. Comput. Syst. Sci., 66(1):66–97, 2003. 31. Møller A., Olesen M.O., and Schwartzbach M.I. Static validation of XSL transformations. ACM Trans. Programming Languages and Syst., 29(4): Article 21, 2007. 32. Murata M. Transformation of documents and schemas by patterns and contextual conditions. In Proc. 3rd Int. Workshop on Principles of Document Processing, 1996, pp. 153–169. 33. Papakonstantinou Y. and Vianu V. DTD inference for views of XML data. In Proc. 19th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2000, pp. 35–46. 34. Perst T. and Seidl H. Macro forest transducers. Inf. Process. Lett., 89(3):141–149, 2004. 35. Pierce B.C. Types and Programming Languages. MIT, 2002. 36. Suda T. and Hosoya H. Non-backtracking top-down algorithm for checking tree automata containment. In Proc. 10th Int. Conf. Implementation and Application of Automata, 2005, pp. 83–92. 37. Suciu D. The XML typechecking problem. ACM SIGMOD Rec., 31(1):89–96, 2002. 38. Sulzmann M. and Lu K.Z.M. A type-safe embedding of XDuce into ML. Electr. Notes Theor. Comput. Sci., 148(2):239–264, 2006. 39. Tozawa A. Towards static type checking for XSLT. In Proc. 1st ACM Symp. on Document Engineering, 2001, pp. 18–27.
40. Tozawa A. XML type checking using high-level tree transducer. In Proc. 8th Int. Symp. Functional and Logic Programming, 2006, pp. 81–96. 41. Vansummeren S. On deciding well-definedness for query languages on trees. J. ACM, 54(4):19, 2007. 42. Vouillon J. Polymorphism and XDuce-style patterns. In Proc. Programming Languages Technologies for XML, 2006, pp. 49–60. 43. Vouillon J. Polymorphic regular tree types and patterns. In Proc. 33rd ACM SIGACT-SIGPLAN Symp. on Principles of Programming Languages, 2006, pp. 103–114.
XML Types F RANK N EVEN Hasselt University and Transnational University of Limburg, Diepenbeek, Belgium
Synonyms XML schemas
Definition To constrain the structure of allowed XML documents, for instance with respect to a specific application, a target schema can be defined in some schema language. A schema consists of a sequence of type definitions specifying a (possibly infinite) class of XML documents. A type can be assigned to every element in a document valid w.r.t. a schema. As the same holds for the root element, the document itself can also be viewed to be of a specific type. The schema languages DTDs, XML Schema, and Relax NG, are, on an abstract level, different instantiations of the abstract model of unranked regular tree languages.
Historical Background Bru¨ggemann-Klein et al. [3] were the first to revive the theory of regular unranked tree automata [14] for the modelling of XML schema languages. Murata et al. [10] provided the formal taxonomy as presented here. Martens et al. [8] characterized the expressiveness of the different models and provided type-free abstractions.
Foundations Intuition
Consider the XML document in Fig. 1 that contains information about store orders and stock contents.
XML Types
X
3651
XML Types. Figure 1. Example XML document.
Orders hold customer information and list the items ordered, with each item stating its id and price. The stock contents consists of the list of items in stock, with each item stating its id, the quantity in stock, and – depending on whether the item is atomic or composed from other items – some supplier information or the items of which they are composed, respectively. It is important to emphasize that order items do not include supplier information, nor do they mention other items. Moreover, stock items do not mention prices. DTDs are incapable of distinguishing between order items and stock items because the content model of an element can only depend on the element’s name in a DTD, and not on the context in which it is used. For example, although the DTD in Fig. 2 describes all intended XML documents, it also allows supplier information to occur in order items and price information to occur in stock items. The W3C specification essentially defines an XSD as a collection of type definitions, which, when abstracted away from the concrete XML representation of XSDs, are rules like store ! order½order ; stock½stock
ð⋆Þ
that map type names to regular expressions over pairs a[t] of element names a and type names t. Intuitively, this particular type definition specifies an XML fragment to be of type store if it is of the form where n 0; f1,..., fn are XML fragments of type order; and g is an XML fragment of type stock. Each type name that
occurs on the right hand side of a type definition in an XSD must also be defined in the XSD, and each type name may be defined only once. Using types, an XSD can specify that an item is an order item when it occurs under an order element and is otherwise a stock item. For example, Fig.2 shows an XSD describing the intended set of store document. Note in particular the use of the types item1 and item2 to distinguish between order items and stock items. It is important to remark that the ‘‘Element Declaration Consistent’’ constraint of the W3C specification requires multiple occurrences of the same element name in a single type definition to occur with the same type. Hence, type definition (⋆) is legal, but persons ! ðperson½male þ person½femaleÞþ is not, as person occurs both with type male and type female. Of course, element names in different type definitions can occur with different types (which is exactly what yields the ability to let the content model of an element depend on its context). On a structural level, ignoring attributes and the concrete syntax, the structural expressiveness of Relax NG corresponds to XSDs without the EDC constraint. A Formalization of Relax NG
An XML fragment f ¼ f1...fn is a sequence of labeled trees where every tree consists of a finite number of nodes, and every node v is assigned an element name denoted by lab(v). There is always a virtual root which
X
3652
X
XML Types
XML Types. Figure 2. A DTD and an XSD describing the document in Fig.1.
acts as the common parent of the roots of the different fi. For a set EName and Types of element and type names, respectively, the set of elements is defined as {a [t]ja 2 EName, t 2 Types}. The set of regular expressions is given by the following syntax: r ::¼ E j a j r; r j r þ r j r j r þ j r? where e denotes the empty string and a is an element. Their semantics is the usual one and is therefore omitted. An XSchema is a tuple S ¼ (EName,Types, r, t0) where EName and Types are finite sets of elements and types, respectively, r is a mapping from Types to regular expressions, and, t0 2 Types is the start type. A typing t of f is a mapping assigning a type t(v) 2 Types to every node v in f (including the virtual root). For a node v with children v1,...,vm, define child-string (t,v) as the string lab(v1)[t(v1)]. . .lab(v1) [t(v1)]. An XML fragment f then conforms to or is valid w.r.t.S if there is a typing t of f such that for every node v, child-string (t,v) matches the regular expression r(t(v)), and t(root)¼t0. The mapping t is then called a valid typing. Despite the clean formalization, the above definition does not entail a validation algorithm. One possibility is to compute for each node v in f a set of possible types D(v) Types such that for each type t 2 D(v), the XML subfragment rooted at v is valid w.r.t. the schema with start type t. The XML fragment is then valid w.r.t. S itself when the start type t0 belongs to D(root). The sets D(v) can be computed in a bottomup fashion. Indeed, t 2 D(v) iff (i) v is a leaf node and r(t) contains the empty string; or, (ii) v is a nonleaf node with children v1,...,vn and there are t1 2 D(v1),...,tn 2 D(vn) such that lab(v1)[t1]. . .lab(vn)[tn]2 r(t). A valid typing can then be computed from the sets D by an additional top-down pass through the tree. Although this kind of bottom-up validation is a bit at
odds with the general concept of top-down or streaming XML processing, the algorithm can be adapted to this end (cf. for instance, [10,13]). For general XSchema’s, a valid typing is not necessarily unique and cannot always be computed in a single pass [8]. XSchemas as defined above correspond precisely to the class of unranked regular hedge languages [3] and can be seen as an abstraction of Relax NG. Note that the present formalization is overly simplistic w.r.t. attributes as Relax NG treats them in a way uniform to elements using attribute–element constraints [6]. Relationship with Tree and Hedge Automata
Although an XML fragment can consist of a sequence of labeled trees, in the literature it is accustomed to restrict this sequence to simply one tree. XSchemas as defined above then define precisely the unranked regular tree languages [3,11]. Although several automata formalism capturing this class have been defined [3,4,5], each with their own advantages [9], XSchemas correspond most closely to the model of Bru¨ggemannKlein et al. [3], which is defined as follows. A tree automaton A ¼ (Q, EName, d, q0), where Q is the set of states (or, types), q0 2 Q is the start state (start type), and d maps pairs (q,a) 2 Q EName to regular expressions over Q. An input tree f is accepted by the automaton if there exists a mapping t from the nodes of f to Q, called a run (or, a typing), such that the root is labeled with the start state, and for every non-root node v with children v1,...,vn, the string lab(v1)... lab(vn) matches d(t(v),lab(v)). The translation between XSchemas and tree automata is folklore and can for instance be found in [3]. Deterministic Regular Expressions
The unique particle attribution constraint (UPA) requires regular expressions to be deterministic in the following sense: the form of the regular expression should allow each symbol of the input string to match
XML Types
X
uniquely against a position in the expression when processing the input string in one pass from left to right. That is, without looking ahead in the string. For instance, the expression r ¼ (aþb) ∗ a is not deterministic as the first symbol in the string aaa can already be matched to two different a’s in r. The equivalent expression b ∗ a(b ∗ a) ∗ , on the other hand, is deterministic. Unfortunately, not every regular expression can be rewritten into an equivalent deterministic one [2]. Moreover, it is not a very robust subclass, as it is not closed under union, concatenation, or Kleene-star, prohibiting an elegant constructive definition [2]. Deterministic regular expressions are characterized as oneunambiguous regular expressions by Bru¨ggemann-Klein and Wood [2]. Deciding whether a regular expression is one-unambiguous can be done in quadratic time [1]. Furthermore, it can be decided in EXPTIME whether there is a deterministic regular expression equivalent to a given regular expression [2]. If so, the algorithm can return an expression of a size which is double exponential. It is unclear whether this can be improved.
valid with respect to P if, for every node v of f, there is a rule r ! s 2 P such that the string formed by the labels of the nodes on the path from the root to v match r and the string formed by the children of v match s (cf. [7,8] for more details). The single-type restriction further ensures that XSDs can be uniquely typed in a one-pass top-down fashion. To be precise, one-pass typing in a top-down fashion means that the first time a node is visited, a type should be assigned (so only based on what has been seen up to now), and that a child can be visited only when its parent has already visited. Type inclusion and equivalence is EXPTIMEcomplete for XSchemas and is in PTIME for DTDs and XSDs. In fact, w.r.t. the latter, the problem reduces to the corresponding problem for the class of employed regular expressions [8].
A Formalization of DTDs and XSDs
1. Bru¨ggemann-Klein A. Regular expressions into finite automata. Theor. Comput. Sci., 120(2):197–213, 1993. 2. Bru¨ggemann-Klein A. and Wood D. One unambiguous regular languages. Inform. Comput., 140(2):229–253, 1998. 3. Bru¨ggemann-Klein A., Murata M., and Wood D. Regular tree and regular hedge languages over unranked alphabets. Technical Report HKUST-TCSC-2001-0, The Hongkong University of Science and Technology, 2001. 4. Carme J., Niehren J., and Tommasi M. Querying unranked trees with stepwise tree automata. In Proc. 15th Int. Conf. Rewriting Techniques and Applications, 2004, pp. 105–118. 5. Cristau J., Lo¨ding C., and Thomas W. Deterministic automata on unranked trees. In Proc. 15th Int. Symp. Fundamentals of Computation Theory, 2005, pp. 68–79. 6. Hosoya H. and Murata M. Boolean operations and inclusion test for attribute-element constraints. Theor. Comput. Sci., 360(1–3):327–351, 2006. 7. Martens W., Neven F., and Schwentick T. Simple off the shelf abstractions for XML schema. ACM SIGMOD Rec., 36(4):15– 22, 2007. 8. Martens W., Neven F., Schwentick T., and Bex G.J. Expressiveness and complexity of XML schema. ACM Trans. Database Syst., 31(3):770–813, 2006. 9. Martens W. and Niehren J. Minimizing tree automata for unranked trees. In Proc. 10th Int. Workshop on Database Programming Languages, 2005, pp. 232–246. 10. Murata M., Lee D., Mani M., and Kawaguchi K. Taxonomy of XML schema languages using formal language theory. ACM Trans. Internet Tech., 5(4):660–704, 2005. 11. Neven F. Automata theory for XML researchers. ACM SIGMOD Rec., 31(3):39–46, 2002.
Let S ¼ (EName, Types, r, t0) be an XSchema. Then, S is local when EName¼Types and regular expressions in r are defined over the alphabet {a[a]ja 2 EName}. This simply means that the name of the element also functions as its type. Furthermore, S is single-type when there are no elements a[t1] and a[t2] in a r(t) with t1 6¼ t2. A DTD is then a local XSchema where regular expressions are restricted to be deterministic. Finally, an XSD is then a single-type XSchema where regular expressions are restricted to be deterministic. Expressiveness and Complexity
XSchemas form a very robust class, for instance, equivalent to the monadic second-order logic (MSO) definable classes of unranked trees [12], and are closed under the Boolean operations. XSDs on the other hand are not closed under union or complement [8,10]. They define precisely the subclasses of XSchemas closed under ancestor-guarded subtree exchange, and are much closer to DTDs than to XSchemas as becomes apparent from the following equivalent typefree alternative characterization. A pattern-based XSDP is a set of rules {r1 ! s1,...,rm ! sm} where all ri are horizontal regular expressions and all si are deterministic vertical regular expressions. An XML fragment f is
3653
Cross-references
▶ XML Schema ▶ XML Typechecking
Recommended Reading
X
3654
X
XML Updates
12. Neven F. and Schwentick T. Query automata on finite trees. Theor. Comput. Sci., 275:633–674, 2002. 13. Segoufin L. and Vianu V. Validating streaming XML documents. In Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2002, pp. 53–64. 14. Thatcher J.W. Characterizing derivation trees of context-free grammars through a generalization of finite automata theory. J. Comput. Syst. Sci., 1(4):317–322, 1967.
Foundations The design of an XML update mechanism has three important aspects: Definition of the update operators Revalidation of modified data after the updates Embedding of the operators in the language Operators
XML Updates G IORGIO G HELLI University of Pisa, Pisa, Italy
Definition The term XML Updates refers to the act of modifying XML data while preserving its identity, through the operators provided by an XML manipulation language. Identity preservation is crucial to this definition: the production of XML data from XML data without preserving the original data identity is called XML transformation. The general notion of identity has many concrete incarnations. The XQuery/XPath data model (see [14]) associates a Node Identity to each node of the XML syntax tree. In a language based on this data model, updates differ from transformations because the former modify the data but preserve node identities. Another hallmark of updates is that an expression that refers to the data being updated may have a different value after the update is evaluated, while the evaluation of XML transformations does not change the value of any other expression. XML updates may be embedded in any XML manipulation language, but this entry will be focused on XML updates in the context of languages of the XQuery family.
Historical Background The first languages proposed to manipulate XML data did not support XML updates, because XML transformations can often be used as a substitute for XML updates, but the former are simpler to optimize and have cleaner semantics. However, many applications need to update persistent XML data, and eventually the problem was tackled. The first widely-known proposal in the scientific literature was presented in [11], and was clearly influenced by previous work on SQL updates and on primitives for tree updates.
There is a wide agreement on the basic update operators. Almost every proposal includes operators to delete a subtree, insert a subtree in a specific position, rename a node, and replace the value of a node. The deletion of a node T may be defined as an operation that just detaches T from its parent (detach semantics) or as an operation that invalidates every node of T (erasure semantics). The first choice needs to be supported by a garbage-collector, while the second may require the management of pointers to invalidated nodes. The insertion of a tree T below a node n may break the tree-structure of XML data, if the tree T has already a parent, and would create a cycle if n were a node of T. For this reason, the insertion operator typically copies T before inserting it below n. Node renaming must be defined with some care because it may break invariants connected to name spaces. Some proposals also include a move T1 into T2 operation to move a subtree T1 to a different location without altering its identity. This operation cannot be encoded by inserting T1 into T2, and then deleting the original T1, because the insertion performs a copy, hence this encoding does not preserve the identity of T1. Revalidation
Dynamic revalidation is, currently, the technique of choice, in order to ensure that modified data still respect type invariants. The complexity of revalidation depends on the language used to express structural invariants. For example, if a DTD is used, when a valid subtree rooted at an element with name q is moved from a position to another, it is only necessary to verify whether an element with name q may be found in that position. If XML Schema is used, then the content model of the moved subtree may depend on its position, hence the subtree has to be revalidated. In any case, a full revalidation of the updated structure is usually not needed, and many optimizations are generally possible. In some cases, static code analysis may avoid dynamic revalidation altogether. Work on this aspect is only at its beginning stages.
XML Updates
The Operators and the Full Language
The presence of update operators in an XML language may heavily affect the possibility of optimization. Any important optimization aims at reducing the number of times an expression is evaluated. Such a reduction is typically possible if the value of that expression at a certain time is guaranteed to be equal to its value when it was last evaluated. Updates make this property very difficult to prove. This problem has been faced with two main approaches: separation of queries and updates, and delayed update application. The separation approach is exemplified by [6,10,11]: these languages distinguish between expressions and statements and put strong limitations on the places where statements may appear. The proposals in [2,3] perform a ‘‘partial’’ separation of queries and updates. They use the same syntax for non-updating expressions and updating expressions, which means, for example, that standard FLWOR expressions are used to iterate both classes of expressions. However, the places where updating expressions may appear are severely limited: for example, in a FLWOR expression, they can only appear in the return clause. The languages XQuery! and LiXQuery+ [8,9], instead, have just one syntactic category (expressions) and no syntactic limitation on where the updates may appear. The delayed-application approach is based on the definition of a snapshot scope; all the update expressions, or statements, in the scope are not executed immediately, but only when the scope is closed. This ensures that, inside the scope, the value of an expression is not affected by updates. For example, the proposals of [3,10,11] define a whole-query snapshot scope, while in XQueryP [2] every update is applied immediately. Finally, in XQuery! and LiXQuery+ [8,9], the programmer has a full control of the snaphot scope. Apart from optimization, the delayed approach is also proposed for consistency reasons. The partial execution of a set of correlated updates would create consistency problems. These problems can be avoided by collecting these updates in a scope, and by imposing that all the updates in each snapshot scope are evaluated atomically.
Key Applications Updates are unavoidable in any language that manipulates XML persistent data, as the one proposed in [11], and are very useful in any language for scripting
X
3655
purposes, as XQueryP [2]. Application scenarios are detailed in [4] and in [5].
Future Directions The most important open problems, in this field, are optimization and static analysis. Some work on the optimization of queries with updates has been done (see [1,7], for example), but the field is huge. Static analysis is also an area where lot of work has to be done. Here, a crucial issue is the design of algorithms to substitute dynamic post-update revalidation with some form of static type-checking.
Experimental Results Some of the proposals have a public implementation that has been use to experiment with the semantics and the performance of the language; for example, XL has a demo reachable from [13], and XQuery! has a demo reachable from [15]. The W3C maintains a list of XQuery implementations in [12].
Cross-references
▶ XML ▶ XML Algebra ▶ XML Query Processing ▶ XML Type Checking ▶ XPath/XQuery
Recommended Reading 1. Benedikt M., Bonifati A., Flesca S., and Vyas A. Adding updates to XQuery: Semantics, optimization, and static analysis. In Proc. 2nd Int. Workshop on XQuery Implementation, Experience and Perspectives, 2005. 2. Carey M., Chamberlin D., Fernandez M., Florescu D., Ghelli G., Kossmann D., Robie J., and Sime´on J. XQueryP: an XML application development language. In Proc. of XML, 2006. 3. Chamberlin D., Florescu D., and Robie J. XQuery update facility. W3C Working Draft, July 2006. 4. Chamberlin D. and Robie J. XQuery update facility requirements. W3C Working Draft, June 2005. http://www.w3.org/ TR/xquery-update-requirements/. 5. Engovatov D., Florescu D., and Ghelli G. XQuery scripting extension 1.0 requirements. W3C Working Draft, June 2007. http://www.w3.org/TR/xquery-sx-10-requirements. 6. Florescu D., Gru¨nhagen A., and Kossmann D. XL: an XML programming language for Web service specification and composition. In Proc. 11th Int. World Wide Web Conference, 2002, pp. 65–76. 7. Ghelli G., Onose N., Rose K., and Sime´on J. XML query optimization in the presence of side effects. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2008, pp. 339–352.
X
3656
X
XML Views
8. Ghelli G., Re´ C., and Sime´on J. XQuery!: an XML query language with side effects. In Proc. 2nd Int. Workshop on Database Technologies for Handling XML Information on the Web, 2006, pp. 178–191. 9. Hidders J., Paredaens J., Vercammen R., and Demeyer S. On the expressive power of XQuery-based update languages. In Database and XML Technologies, 5th Int. XML Database Symp., 2006, pp. 92–106. 10. Sur G.M., Hammer J., and Sime´on J. An XQuery-based language for processing updates in XML. In Proc. Programming Language Technologies for XML(PLAN-X), 2004. 11. Tatarinov I., Ives Z., Halevy A., and Weld D. Updating XML. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2001, pp. 413–424. 12. W3C. W3C XQuery site. http://www.w3.org/XML/Query. 13. XL team. XL site. http://xl.inf.ethz.ch. 14. XQuery 1.0 and XPath 2.0 data model (XDM). W3C Recommendation, January 2007. 15. XQuery! team. XQuery! site. http://xquerybang.cs.washington. edu.
XML Views M URALI M ANI Worcester Polytechnic, Institute, Worcester, MA, USA
Synonyms XML publishing
Definition Database applications provide an XML view of their data so that the data is available to other applications, especially web applications. Database systems provide support for the client applications to use (query and/or manipulate) the data. The operations specified by the client applications are composed with the view definitions by the database system, thus performing these actions. The internal data model used by the database application, as well as how the operations are performed are transparent to the client applications; they see only an XML view of the entire system. XML views help the database systems to maintain their legacy data, as well as utilize the optimization features present in legacy systems (especially SQL engines), and at the same time make the data accessible to a wide range of web applications.
Historical Background Views (external schema) are a feature [12] present universally in almost all database systems. Views provide data independence as well as the ability to control
access of portions of data to different classes of users. With XML [2] becoming the standard for information exchange over the web since 1998, database applications have used XML views to publish their data and to make the data accessible to web applications. Nowadays, XML views are supported by most major database engines like Microsoft SQL server, Oracle, and IBM DB2.
Foundations The views that database systems support can either be virtual or materialized [12]. When the view is virtual, only the view definition is stored in the system. Whenever a client application accesses the data by issuing a query over the view, the database system composes the user query with the view definition and this combined query is executed over the underlying data. The advantage of virtual views are that the data is never out-of-date as the data is stored, accessed from and manipulated in only one place. Materialized views on the other hand store the data for the view along with the view definition. Therefore the same data is now in more than one location, and thus the different copies of the data need to be kept consistent by maintaining the materialized views whenever the underlying data changes. Incremental and efficient maintenance of materialized views have been studied [6,9,11] and are supported by most commercial SQL engines. The advantage of materialized views is that the user query can potentially be answered faster as the user query can be directly answered from the materialized view, and does not require composing it with the view definition. In this article, both virtual and materialized XML views that are defined primarily over relational data are considered, and state-of-the-art techniques, and open problems for these are described. Mapping Between the XML View and the Underlying Data
The mapping between the XML view and the underlying data are used for publishing the data, as well as for performing the user requested actions (such as answering queries) when the view is virtual. Different ways of specifying mappings between the XML view and the underlying data are possible. The canonical mapping [13] is a very simple one where there is a 1-1 mapping between the relational tuples and the XML elements in the view. An XML element is constructed for every row in every table in the relational database. Thus the entire
XML Views
data in the relational database is captured in this canonical XML view. Slightly more complex mappings that still capture the entire relational data in the XML view are studied in [8], where the key-foreign key constraints are used to nest XML elements within each other. The translation of queries (especially navigation queries as in XPath) in the above mapping schemes is quite straight forward. However, often times, a database application needs to publish their data as an XML view that conforms to a standard schema. In such cases, a more complex mapping scheme is needed as all of the underlying data may not be exposed in the view; also the underlying data might need to be restructured to conform to the schema. In [10], the authors study how the user can specify relationships between the underlying schema and the view schema diagrammatically. Based on some assumptions, the system then analyzes the user specified relationships and translates them into meaningful mappings that can be understood by the system (such as SQL queries). Also the user is consulted when there is potential ambiguity. In [5,13], the mapping is specified using XQuery language [15]. This is similar to the scenario that is well-understood by SQL engines (where view definitions are specified in SQL). The database administrator can specify the XML view using an XQuery expression that conforms to the required schema. User Queries over XML Views
The systems that support XML views need to provide the capability for users to query the data. In [5,13], the
X
3657
authors study how the user queries specified using XQuery can be answered efficiently in the scenario where the XML view is also specified using XQuery. The architecture for such a system as described in [13] is shown in Fig. 1. One of the main assumptions made by these systems is that the SQL engine is best equipped to handle computations efficiently (those computations that an SQL engine can handle); therefore one needs to push down as much computation as possible within the SQL engine. The view composer shown in Fig. 1 composes the user query with the view query. The computation pushdown module then rearranges the combined query plan so that the SQL portion is at the bottom of the query plan, and it includes everything that can be done by the SQL engine. In such a case, the middle-ware only needs to do tagging and this is done in a single pass over the data returned by the SQL engine. User Updates over XML Views
While a lot of work has focussed on how to answer user queries over XML views, very little work has focussed on handling user updates over virtual XML views. As the view is virtual, the view update needs to be performed by updating the base data in such a way that the effect expected by the user is achieved. A common semantics used for view updates is the side-effect free semantics shown in Fig. 2, as described in [3,7]. In the figure, D represents the database instance, DEFv represents the view definition, V is the view instance, u is the user specified view update. u(V) represents the effect that the user wants to achieve on the view, the view update problem therefore is to find an update U over the base data such that the user desired effect is
X XML Views. Figure 1. Typical architecture for processing user queries over XML views.
XML Views. Figure 2. Illustrating side-effect free semantics.
3658
X
XML Views
achieved on the view. It is possible that such an update U, does not exist in which case the view update cannot be performed. In some cases, there could be a unique U, and in other cases, it is possible that multiple such updates over the base data, exist in which case the ambiguity needs to be resolved using heuristics, by the user, or by rejecting the view update. Updating SQL views itself is considered a hard problem, and the existing solutions handle only a subset of the view definitions. When a user specifies updates over view definitions that use ‘‘non-permissible’’ operators (such as aggregation operators), the system rejects these updates. Most solutions, including commercial ones, use a schema level analysis to perform/reject the view update, where they utilize the base schema, the view definition and the user specified update statement. Some solutions also examine the base data, in which case fewer view updates need to be rejected. The main approaches that study updates over XML views of relational databases include [1,14]. In [1], the authors translate the XML view into a set of SQL views; now the solutions for SQL views can be utilized. In [14], the authors identify that the XML view update problem is harder than the SQL view update problem – SQL view update problem boils down to the case where the XML view schema has only one node. For a
general XML view, given an update (such as delete) to be performed on an XML view element, the authors partition the XML view schema nodes into three categories – for one of the categories, the authors utilize the results from the SQL view update research, for the other two categories, the authors propose new approaches to check for side-effects. As follow up work to [14], the authors have studied how data level analysis can be used for the XML view update problem as well. Maintenance of Materialized XML Views
In order to keep materialized views consistent with the underlying base data, one approach is to recompute the view every time there is a base update. However, this approach is not efficient as the base updates are typically very small compared to the entire base data. It would be efficient if incremental view maintenance could instead be performed, where only the update to the view is computed. Incremental view maintenance consists of typically two steps – in the propagate phase, the delta change to the view is computed using a incremental maintenance plan, and in the apply phase, the view is refreshed using this delta change to the view. Incremental maintenance of XML views is more complex than maintenance of SQL views, because of
XML Views. Figure 3. Architecture for maintaining materialized XML views.
XPath/XQuery
the more complex features of XML including nested structure (a form of aggregation) and order, and of XML query languages such as XQuery. In [4], the authors study how to incrementally maintain XML views defined over underlying data sources that are also XML (note that the solutions apply to the case where the underlying data sources are relational as well). The architecture of their approach is shown in Fig. 3. As the underlying base is XML, the base updates can come in many different granularities – this requires a validate phase which combines the update with the base data to make it a complete update. This is then fed to the incremental maintenance plan in the propagate phase, which computes the delta change to the view. In the apply phase, the view is refreshed using the delta change to the view.
Key Applications XML views are useful to any database application that wishes to publish their data on the web.
Future Directions XML views are already being used widely by database applications, and their usage is expected to increase further in future. There are still several issues that need to be studied to make such systems more efficient. For processing queries, query composition will result in several unnecessary joins. Therefore the query optimizer must be able to remove unnecessary joins, this requires the query optimizer to be able to infer key constraints in the query plan. For processing view updates, a combined schema and data analysis promises to be an efficient approach and needs to be investigated. Incremental view maintenance further requires new efficient approaches for handling operations (such as aggregation) present in XML query languages.
Cross-references
▶ Top-k XML Query Processing ▶ XML Information Integration ▶ XML Publishing ▶ XML Updates ▶ XPath/XQuery
Recommended Reading 1. Braganholo V.P., Davidson S.B., and Heuser C.A. From XML view updates to relational view updates: Old solutions to a new problem. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 276–287.
X
3659
2. Bray T., Paoli J., Sperberg-McQueen C.M., Maler E., and Yergeau F. Extensible Markup Language (XML) 1.0, W3C Recommendation. Available at: http://www.w3.org/XML 3. Dayal U. and Bernstein P.A. On the correct translation of update operations on relational views. ACM Trans. Database Syst., 7(3):381–416, September 1982. 4. El-Sayed M., Rundensteiner E.A., and Mani M. Incremental maintenance of materialized XQuery views. In Proc. 22nd Int. Conf. on Data Engineering, 2006, p. 129. 5. Fernandez M., Kadiyska Y., Suciu D., Morishima A., and Tan W.-C. SilkRoute: A framework for publishing relational data in XML. ACM Trans. Database Syst., 27(4):438–493, December 2002. 6. Griffin T. and Libkin L. Incremental maintenance of views with duplicates. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1995, pp. 328–339. 7. Keller A.M. Algorithms for translating view updates to database updates for views involving selections, projections and joins. In Proc. 4th ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, 1985, pp. 154–163. 8. Lee D., Mani M., Chiu F., and Chu W.W. NeT & CoT: Translating relational schemas to XML schemas using semantic constraints. In Proc. Int. Conf. on Information and Knowledge Management, 2002, pp. 282–290. 9. Palpanas T., Sidle R., Cochrane R., and Pirahesh H. Incremental maintenance for non-distributive aggregate functions. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 802–813. 10. Popa L., Velegrakis Y., Miller R.J., Hernandez M.A., and Fagin R. Translating Web data. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 598–60. 11. Quass D. Maintenance expressions for views with aggregates. In Proc. workshop on Materialized Views: Techniques and Applications, 1996, pp. 110–118. 12. Ramakrishnan R. and Gehrke J. Database Management Systems. McGraw Hill, 2002. 13. Shanmugasundaram J., Kiernan J., Shekita E., Fan C., and Funderburk J. Querying XML views of relational data. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 261–27. 14. Wang L., Rundensteiner E.A., and Mani M. Updating XML views published over relational databases: Towards the existence of a correct update mapping. Doc. Knowl. Eng. J., 58 (3):263–298, 2006. 15. W3C XQuery Working Group. Available at: http://www.w3.org/ XML/Query/
XPath/XQuery J AN H IDDERS , J AN PAREDAENS University of Antwerp, Antwerpen, Belgium
Synonyms W3C XML path language; W3C XML query language
X
3660
X
XPath/XQuery
Definition XPath (XML path language) and XQuery (XML query language) are query languages defined by the W3C (World Wide Web Consortium) for querying XML documents. XPath is a language based on path expressions that allows the selection of parts of a given XML document. In addition it also allows some minor computations resulting in values such as strings, numbers or booleans. The semantics of the language is based on a representation of the information content of an XML document as an ordered tree. An XPath expression consist usually of a series of steps that each navigate through this tree in a certain direction and select the nodes in that direction that satisfy certain properties. XQuery is a declarative, statically typed query language for querying collections of XML documents such as the World Wide Web, a file system or a database. It is based on the same interpretation of XML documents as XPath, and includes XPath as a sublanguage, but adds the possibility to query multiple documents in a collection of XML documents and combine the results into completely new XML fragments.
the XSL working group. Finally, in January 2007 the recommendations for both XQuery 1.0 [10] and XPath 2.0 [9] were published. Historical Roots of XQuery 1.0 Most features of
XQuery can be traced back to its immediate predecessor Quilt [2]. This language combined ideas from other predecessors such as XPath 1.0 and XQL from which the concept of path expressions was taken. From XML-QL came the idea of using variable bindings in the construction of new values. Older influences where SQL whose SELECT-FROMWHERE expressions formed the inspiration for the FLWOR expressions, and OQL which showed the benefit of a functional and fully orthogonal query language. Finally there were also influences from other query languages for semi-structured data such as Lorel and YATL.
Foundations The organization of this section is as follows. First, the XPath 1.0 data model is presented, then the XPath 1.0 language, which is followed by a description of XQuery 1.0 and finally XPath 2.0 is briefly discussed.
Historical Background The development of XPath and XQuery as W3C standards is briefly described in the following. XPath 1.0 XPath started as an initiative when the W3C XSL working group and the W3C XML Linking working group (which was working on XLink and XPointer) realized that they both needed a language for matching patterns in XML documents. They decided to develop a common language and jointly published a first working draft for XPath 1.0 in July 1999, which resulted in the recommendation [8] in November 1999. XQuery 1.0 and XPath 2.0
The history of XQuery starts in December 1998 with the organization by W3C of the workshop QL ’98 on XML query languages. This workshop received a lot of attention from the XML, database, and full-text search communities, and as a result the W3C started the XML Query working group [7]. Initially, its goals were only to develop a query language, and a working draft of the requirements was published in January 2000 and a first working draft for XQuery in February 2001. More than 3 years later, in August 2004 the charter of the group was extended with the goal of codeveloping XPath 2.0 with
The XPath 1.0 Data Model The information content of an XML document is represented by an ordered tree that can contain seven types of nodes: root nodes, element nodes, attribute nodes, text nodes, comment nodes, processing instruction nodes and namespace nodes. Three properties can be associated with these nodes: a local name, a namespace name and a string-value. The local name is defined for element, attribute, processing instruction and namespace nodes. For the latter two it represents the target for the processing instruction and the prefix for the namespace, respectively. The namespace name represents the full namespace URI of a node and is defined for the element and attribute nodes. The string-value is defined for all types of nodes but for the root and element node it is derived from the string-values of the nodes under it. For attribute nodes it is the normalized attribute value, for text nodes it is the text content of the node, for processing instruction nodes it is the processing instruction data, for comment nodes it is the text of the comment and for namespace nodes it is the absolute URI for the namespace. An example of an XML document and its corresponding data model is given in Fig. 1. Over the nodes in this tree a document
XPath/XQuery
X
3661
XPath/XQuery. Figure 1. An XML document and its XPath 1.0 data model.
order is defined that orders the nodes as they are encountered in a pre-order walk of the tree. The XPath 1.0 Language The most important type of expression in XPath is the location path which is a path expression that selects a set of nodes from the tree describing the document. A location path can be relative, in which case it starts navigating from a certain context node, or absolute, in which case it starts from the root node of the document. In its simplest form a relative location path consist of a number of steps separated by / such as for example class/student/@id where class and student are steps that navigate to children element nodes with those names, and @id navigates to attribute nodes with name id. It selects all nodes that can be reached from the context node by these steps, i.e., the id attribute nodes that are directly below student element nodes that are in turn children of class element nodes that are children of the context node. The path expression becomes an absolute location path, i.e., it starts navigation from the root node, if it is started with / as in /class/student/@id. The wildcards * and @* can be used to navigate to element nodes and attribute nodes, respectively, with any name. The steps can also be separated by // which means that the path expression navigates to the current node and all its descendants before it applies the next step. So class//first-name navigates to all first-name elements that are directly or indirectly below a class element directly below the context node. A special step is the self step which is denoted as. and remains in the same place, so class//. returns the class element node and all the nodes below it. Another special step
is the parent step which is denoted as .. and navigates to the parent of the current context node. A location path can also start with // which means that it is an absolute location path whose first step navigates to the root node and all the nodes below it. For example, // student will select all student element nodes in the document. Path expressions can also be combined with the union operator, denoted as |, which takes the union of the results of two path expressions. For example, //(student | teacher)/(last-name | first-name) returns the first and last names of all students and teachers. With each step zero or more predicates can be specified that must hold in order for nodes to be selected. Such predicates are indicated in square brackets such as for example in //student [@id=‘‘s123456’’]/grade which select the grade elements below student elements with a certain id attribute value. Conditions that can be specified include (i) comparisons between the string-values of the results of path expressions and other expressions, (ii) existential predicates that check whether a certain path expression returns a nonempty result and (iii) positional predicates that check the position of the current node in the result of the step for the context node. Comparisons between path expressions that return more than one node have an existential semantics, i.e., the comparison as assumed to hold if at least one of the returned nodes satisfies the equation. For example, //course[enrolled/student/ last-name=‘‘Janssen’’] returns the courses in which at least one student with the last name ‘‘Janssen’’ enrolled. Available comparison operators include =, != (not equals), < and <=. An example of an existential
X
3662
X
XPath/XQuery
predicate is //course[enrolled/student] that selects courses in which at least one student enrolled. Finally, an example of a positional predicate is given in //course/teacher[1]/name that selects for each course the name of the first teacher of that course. If there are multiple predicates then a positional predicate takes the preceding predicates into account. For example, //course[subject=‘‘databases’’][1] selects the first of the courses that have databases as their subject. The comparisons and existential conditions can be combined with and and or operations, and the not() function. For path expressions that have to navigate over more types of nodes and require other navigation steps a more general syntax is available. In this syntax a single step is specified as axis::node-test followed by zero or more predicates. Here axis specifies the direction of navigation and node-test a test on the name or the type of the node that has to be satisfied. There are 13 possible navigation axes: child, attribute, descendant, descendant-or-self, parent, ancestor, ancestor-or-self, following, preceding, following-sibling, precedingsibling self and namespace. The following axis navigates to all nodes that are larger in document order but not descendants, and the followingsibling axis navigates to all siblings that are larger in document order. Possible node-tests are names, the * wild-card and tests for specific node types such as comment(), text(), processing-instruction () or node() which matches all nodes. Finally two special functions position() and last() can be used in predicates and denote the position of the current node in the result of the step and the size of that result, respectively. To illustrate, the location path /class// student[1] can also be written in this syntax as /child::class/descendant-or-self::node ()/child::student[position()=1]. Next to the operators for navigation XPath also has functions for value manipulation of node sets, strings, booleans and numbers. This includes the arithmetic operators þ, *, , div and mod, aggregation operators such as count() and sum(), string manipulation such as string concatenation, space normalization, substring manipulation, etc., type conversion operators such as string(), number() and boolean(), and finally functions that retrieve properties of nodes such as name(), string() (also overloaded for type conversion) and namespace-uri().
XQuery 1.0
The data model for XML documents is for XQuery 1.0 largely the same as for XPath 1.0. The main changes are that the root of a document is represented by a document node, and that typed values are associated with nodes. The types of these typed values include the built-in types from XML Schema and are associates with all values computed in XQuery. However, for the sake of brevity these types are mostly ignored here. Another important change is that the fundamental data structure is no longer sets of nodes but ordered sequences of nodes and atomic values as defined by XML Schema. All results of expressions are such sequences and single values such as 1 and ‘‘mary’’ are always interpreted as singleton sequences. Note that these sequences are flat, i.e., they cannot contain nested sequences. The concatenation of sequences s1 and s2 is denoted as s1,s2. So the expression (‘‘a,’’ ‘‘b’’) denotes in fact the concatenation of the singleton sequence ‘‘a’’ and the singleton sequence ‘‘b.’’ Therefore (1, (2, 3)) is equivalent to (1, 2, 3). The path expressions as defined in XPath 1.0 are almost all included and have mostly the same semantics. An important difference is that they do not return a set of nodes but a sequence of nodes sorted in document order. Several extensions to path expressions are introduced. Next to the union operator that is equivalent with the | operator in XPath 1.0, there is also the intersect and difference operators that correspond to the set intersection and set difference. These also return their results sorted in document order. The aggregation functions such as count() and sum() are naturally generalized as fn:count() and fn:sum() for sequences, and new ones such as fn:min(), fn:max() and fn:avg() are added. Note that built-in functions in XQuery are prefixed with a namespace, usually fn. Next to the old value comparisons new types of comparisons are introduced, such as is and is not that compare the node identity of two nodes, and << that checks precedence in document order. For navigating over references, the fn:id() function is introduced that given an identifier retrieves the element node with that identifier, and the function fn:idref() that given an identifier retrieves the element nodes that refer to this identifier. The access to collections of XML documents is provided by the functions fn:doc() and fn:collection() that both expect as argument a string containing a URI. These functions retrieve the requested
XPath/XQuery
X
XML fragments associated with this URI and, if this was not already done before, construct their data models and return a document node or a sequence of nodes that are the roots of the fragments. An example of their use would be fn:doc(‘‘courses.
An additional optional feature for for clauses is the at clause that binds an extra variable to the position of the current binding. For example, the expression for $x at $p in (‘‘a,’’ ‘‘b,’’ ‘‘c’’) return ($p, $x) returns the sequence (1,‘‘a,’’ 2,‘‘b,’’ 3,‘‘c’’).
xml’’)/course[@code=‘‘DB201’’]/enrolled/ student that retrieves the students enrolled in the course DB201 from the file courses.xml.
New nodes can be constructed in two ways. The first is the direct constructor which is demonstrated in the first FLWOR example and consists of literal XML with embedded XQuery expression between curly braces. For the literal XML the corresponding nodes are created and deep copies of the results of the expressions are inserted into that. This can be used to compute the content and the attribute values of the new node such as in {$x/full-name}. An alternative that also allows computation of the element name are the computed element constructors of the form element{e1}{e2} that construct a new element node with the name as computed by e1 and the content, i.e., all nodes directly below it, deep copies of those computed by e2. For all types of nodes, except namespace nodes, are such computed constructors available. XQuery also adds conditional expressions such as an if (e1) then e2 else e3, and type switches of the form typeswitch(e) case t1 return e1 case t2 return e2 ... default return ed that returns the result of the first ei such that the result of e matches type ti or the result of ed if none of the types match. It also has logical quantifiers such as some $v in e1 satisfies e2 and every $v in e1 satisfies e2. In order to allow query optimization techniques for unordered data formats there is a function fn: unordered() that allows the user to indicate that the ordering of a result is of no importance. In addition, there is a global parameter called the ordering mode such that when declared as unordered it means that, informally stated, the path expressions may produce unordered results and FLWOR expressions without an order by clause may change the iteration order. This ordering mode can be reset locally for an expression e with the statements unordered{e} and ordered{e}. A very powerful feature is the possibility to start a query with a list of possibly recursive function definitions. For example, declare function countElem
The core expressions of XQuery are the FLWOR expressions which are illustrated by the following example: for $s in fn:doc(students.xml)//student, $e in fn:doc(‘‘enrollments.xml’’)// enrollment let $cn := fn:doc(‘‘courses.xml’’)//course [@crs-code=e/@crs-code]/name where $s/@stud-id = $e/@stud-id order by $cn return <enroll> {$s/name ,$cn}
A FLWOR expression starts with one or more for and let clauses that each bind one or more variables (that always start with $). The for clause binds variables such that they iterate over the elements of the result sequence of an expression, and the let clause binds the variable to the entire sequence. This is followed by an optional where clause with a selection condition, an optional order by clause that specifies a list of sorting criteria and a return clause that contains an expression that constructs the result. In the example the for clause binds the variable $s such that it iterates over the student elements in the file students.xml in the order that they appear there. Then, for each of those, it binds the variables $e such that it iterates over all the enrollments in the file enrollments.xml in the order that they appear there. For every binding of the variables it evaluates the let clause where it binds $cn with the name of the course in the enrollment $e. Then it selects those combinations for which the condition in the where clause is true, i.e., if the student $s belongs to the enrollment $e. The resulting bindings are sorted by the order by clause on the course name in $cn. Finally, the return clause creates for each binding in the result of the preceding clause an enroll element that contains the name element of student $s and the name element in $cn. Note that if there had been no order by clause then the result would have been sorted in the order that the students are listed in students.xml.
($s){fn:count($s/self::element()) þ fn: sum(for $e in $s/* return countElem($e)))}
defines a function that counts the number of element
3663
X
3664
X
XPath/XQuery
nodes in a fragment. This feature makes XQuery Turing complete and gives it the expressive power of a full-blown programming language. Finally, there is a wide range of predefined functions for the manipulation and conversion of atomic values as defined in XML Schema, and functions for sequences such as fn:distinct-values() that removes duplicate values, fn:reverse() that reverses a sequence and fn:deep-equal() that tests if two sequences are deep equal. XPath 2.0
This version of XPath is based on the same data model as XQuery 1.0 and is semantically and syntactically a subset of XQuery 1.0. The main omissions are (i) user-defined functions, (ii) all clauses in FLWOR expressions except the for clause without at and the return clause, (iii) the node constructors and (iv) the typeswitch expression.
Key Applications The XPath language is used in several W3C standards such as DOM (Level 3), XSL, XLink, XPointer, XML Schema and XForms, and also in ISO standards such as Schematron. Most programming languages that offer some form of support for XML manipulation also support XPath for identifying parts of XML fragments. This includes C/C++, Java, Perl, PHP, Python, Ruby, Schema and the .Net framework. The usage of the XQuery language can be roughly categorized into three different but not completely disjoint areas, for which different types of implementations are available. The first is that of standalone XML processing, where the XML data that is to be queried and transformed consists of documents stored on a file system or the World Wide Web, and these data are processed by a standalone XQuery engine. The second area is that of database XML processing where the XML documents are stored in an XML-enabled DBMS that has an integrated XQuery engine. The final and third area is that of XML-based data integration where data from different XML and non-XML sources are integrated into an XML view that can be queried and transformed with XQuery. For each of the mentioned areas the XQuery engine faces different challenges. For example, for database XML processing it must optimally use the data structures provided by the DBMS, which might have an XMLspecific storage engine, a relational storage engine or a
mixture of both. On the other hand, for data integration it may be more important to determine how to optimally combine data from streaming and non-streaming sources and how to recognize and delegate query processing tasks to the data sources that have those capabilities themselves. As a consequence, different XQuery engines are often specialized in one of the mentioned application areas.
Future Directions Since XPath 2.0 and XQuery 1.0 have become W3C recommendations, the involved working groups have continued to work on several extensions of these languages: The XQuery Update Facility This extension adds update operations to XQuery. It allows expressions such as for $s in /inventory/clothes/shirt[@size = "XXL"] return do replace value of $s/@price with $s/@price - 10
that combine the XQuery syntax with update operations such that multiple elements can be updated at once. A last call working draft for the XQuery Update facility was published by W3C in August 2007. XQuery 1.1 and XPath 2.1 In March 2007 the XML Query working group published a first version of the XML Query 1.1 requirements, and they plan to produce with the XSL working group the requirements for XPath 2.1. The proposed extensions for XQuery include grouping on values, grouping on position, calling external functions that for example, invoke web services, adding explicit node references that can be used as content and can be dereferenced, and finally higher order functions. XQuery 1.0 and XPath 2.0 Full-Text This extension adds constructs for doing full text searches on selections of documents, text-search scoring variables that can be used in FLWOR expressions, and full-text matching options that can be defined in the query prolog. A last call working draft was published in May 2007. XQuery Scripting Extensions This adds imperative features such that the resulting language can be more readily used for tasks that would otherwise typically be accomplished using an imperative language with XML
XQuery Full-Text
capabilities. Proposed extensions include constructs for controlling the order of computation in FLWOR expressions, operations that cause side effects such as variable assignments, and operations that observe these side effects. A first working draft describing the requirements for this extension was published by the XML Query working group in March 2007.
3665
XQFT ▶ XQuery Full-Text
Cross-references
▶ XML Information Integration ▶ XML Programming ▶ XML Storage ▶ XML Tuple Algebra ▶ XML Updates ▶ XML Views ▶ XQuery Full-Text ▶ XQuery Processors ▶ XSL/XSLT
X
XQuery 1.0 and XPath 2.0 Full-Text ▶ XQuery Full-Text
XQuery Compiler ▶ XQuery Processors
Recommended Reading 1. Brundage M. XQuery: The XML Query Language. Pearson Higher Education, Addison-Wesley, Reading, MA, USA, 2004. 2. Chamberlin D.D., Robie J., and Florescu D. Quilt: An XML query language for heterogeneous data sources. In Proc. 3rd Int. Workshop on the World Wide Web and Databases, 2000, pp. 53–62. 3. Hidders J., Paredaens J., Vercammen R., and Demeyer S. A light but formal introduction to XQuery. In Database and XML Technologies, 2nd Int. XML Database Symp., 2004, pp. 5–20. 4. Katz H., Chamberlin D., Kay M., Wadler P., and Draper D. XQuery from the experts: A guide to the W3C XML query language. Addison-Wesley Longman, Boston, MA, USA, 2003. 5. Melton J. and Buxton S. Querying XML: XQuery, XPath, and SQL/XML in context. Morgan Kaufmann, San Francisco, CA, USA, 2006. 6. Walmsley P. XQuery. O’Reilly Media, 2007. 7. W3C. W3C XML query (XQuery). http://www.w3.org/XML/ Query/. 8. W3C. XML path language (XPath), version 1.0, W3C recommendation 16 November 1999. http://www.w3.org/TR/xpath/, November 1999. 9. W3C. XML path language (XPath) 2.0, W3C recommendation 23 January 2007. http://www.w3.org/TR/xpath20/ January 2007. 10. W3C. XQuery 1.0: An XML query language, W3C recommendation 23 January 2007. http://www.w3.org/TR/xquery/ January 2007.
XPDL ▶ Workflow Management Coalition ▶ XML Process Definition Language
XQuery Full-Text C HAVDAR B OTEV 1, J AYAVEL S HANMUGASUNDARAM 2 Yahoo Research!, Cornell University, Ithaca, NY, USA 2 Yahoo Research!, Santa Clara, USA
1
Synonyms XQuery 1.0 and XPath 2.0 Full-Text; XQFT
Definition XQuery Full-Text [11] is a full-text search extension to the XQuery 1.0 [9] and XPath 2.0 [8] XML query languages. XQuery 1.0, XPath 2.0, and XQuery FullText are query languages developed by the World Wide Web Consortium (W3C).
Historical Background The XQuery [9] and XPath languages [8] have evolved as powerful languages for querying XML documents. While these languages provide sophisticated structured query capabilities, they only provide rudimentary capabilities for querying the text (unstructured) parts of XML documents. In particular, the main full-text search predicate in these languages is the fn:contains
X
3666
X
XQuery Full-Text
($context, $keywords) function (http://www.w3. org/TR/xpath-functions/#func-contains), which intuitively returns the Boolean value true if the items in the $context parameter contain the strings in the $keywords parameter. The fn:contains function is sufficient for simple sub-string matching but does not provide more complex search capabilities. For example, it cannot support queries like ‘‘Find titles of books (//book/title) which include ‘‘xquery’’ and ‘‘fulltext’’ within five words of each other, ignoring capitalization of letters.’’ Furthermore, XQuery does not support the concept of relevance scoring, which is very important in the area of full-text search. To address these short-comings, W3C has formulated the XQuery 1.0 and XPath 2.0 Full-Text 1.0 Requirements [12]. These requirements specify a number of features that must, should or may be supported by full-text search extensions to the XQuery and XPath languages. These include the level of integration with XQuery 1.0 and XPath 2.0, support for relevance scoring, and extensibility as the most important features of such extensions. W3C has also described a number of use cases for full-text search within XML documents that occur frequently in practice. These use cases can be found in the document XQuery 1.0 and XPath 2.0 Full-Text 1.0 Use Cases [13]. The use cases contain a wide range of scenarios for full-text search that vary from simple word and phrase queries to complex queries that involve word proximity predicates, word ordering predicates, use of thesauri, stop words, stemming, etc. The XQuery Full-Text language has been designed to meet both the XQuery 1.0 and XPath 2.0 Full-Text 1.0 Requirements and the XQuery 1.0 and XPath 2.0 Full-Text 1.0 Use Cases. XQuery Full-Text is also related to previous work on full-text search languages for semi-structured data: ELIXIR [5], JuruXML [4], TEXQuery [3], TiX [1], XIRQL [6], and XXL [7].
Foundations XQuery Full-Text provides a full range of query primitives (also known as selections) that facilitate the search within the textual content of XML documents. The textual content is represented as a series of tokens which are the basic units to be searched. Intuitively, a token is a character, n-gram, or sequence of characters. An ordered sequence of tokens that should occur in the document together is referred to as a phrase. The
process of converting the textual content of XML documents to a sequence of tokens is known as tokenization. XQuery Full-Text proposes four major features for support of full-text search the XQuery and XPath query languages: 1. Tight integration with the existing XQuery and XPath syntax and semantics 2. Query primitives for support of complex full-text search in XML documents 3. A formal model for representing the semantics of full-text search 4. Support for relevance scoring. Each of these features is discussed below. XQuery Full-Text Integration
One of the main design paradigms of XQuery Full-Text is the tight integration with the XQuery and XPath query languages. The integration is both on the syntax and data-model levels. The syntax level integration is achieved through the introduction of a new XQuery expression, the FTContainsExpr. The FTContainsExpr acts as a regular XQuery expression and thus, it is fully composable with the rest of the XQuery and XPath expressions. The basic syntax of the FTContainsExpr is: Expr ‘‘ftcontains’’ FTSelection
The XQuery expression Expr on the left-hand side is called the search context. It describes the nodes from the XML documents that need to be matched against the full-text search query described by the FTSelection on the right-hand side. Expr can be any XQuery/XPath expression that returns a sequence of nodes. The syntax of the FTSelection will be described later. The entire FTContainsExpr returns true if some node in Expr matches the full-text search query FTSelection. Note, that since FTContainsExpr returns results within the XQuery data model, FTContainsExpr can be arbitrary nested within XQuery expressions. Consider the following simple example. //book[./title ftcontains ‘‘information’’ ftand ‘‘retrieval’’]//author
The above example returns the authors of books whose title contains the tokens ‘‘information’’ and ‘‘retrieval.’’ The expression ./title defines the search context for the FTContainsExpr in predicate expression
XQuery Full-Text
within the brackets [] which is the title child element node of the current book element node. The FTSelection ‘‘information’’ ftand ‘‘retrieval’’ specifies that book titles must contain the tokens ‘‘information’’ and ‘‘retrieval’’ to be considered a match. Integration can be achieved not only for XQuery Full-Text expressions within XQuery/XPath expressions but vice versa. Consider the following example. for $color in (‘‘red,’’ ‘‘yellow’’) for $car in (‘‘ferrari,’’ ‘‘lamborghini,’’ ‘‘maserati’’) return //offer[. ftcontains {$color, $car} phrase]
The above query returns offer element nodes which contain any of a number of combinations of colors and cars as a phrase. The FTContainsExpr will match phrases like ‘‘red lamborghini’’ or ‘‘yellow ferrari’’ but it will not match ‘‘red porsche’’ or ‘‘yellow rusty ferrari.’’ XQuery Full-Text Query Primitives
This section describes the basic XQuery Full-Text query primitives and how they can be combined to construct complex full-text search queries. The XQuery Full-Text query primitives include token and phrase search, token ordering, token proximity, token scope, match cardinality, and Boolean combinations of the previous. Further, XQuery FullText allows control over the natural language used in the queried documents, the letter case in matched tokens, and the use of diacritics, stemming, thesauri, stop words, and regular-expression wildcards though the use of match options. The XQuery Full-Text query primitives are highly composable within each other. This allows for the construction of complex full-text search queries. The remainder of this section will briefly describe some of the available query primitives and show how they can be composed. The most basic query primitive is to match a sequence oftokens also known asFTWords. For example, the above car offer query can also be written as: //offer[. ftcontains {‘‘red ferrari,’’ ‘‘yellow ferrari,’’ ‘‘red lamborghini,’’ ‘‘yellow lamborghini,’’ ‘‘red maserati,’’ ‘‘yellow maserati’’} any]
X
3667
The above FTContainsExpr matches offer nodes which contain any of the listed phrases. The any option specifies that it is sufficient to match a single phrase within an offer node. It is also possible to use the option all which specifies that all phrases should be matched within a single node. As it was shown earlier, the phrase options specifies that all nested phrases should be combined into a single phrase. Other possible options are any word or all word which specify that the nested phrases need to be first broken into separate tokens before applying the respective disjunctive or conjunctive match semantics. For example, the expression //offer[. ftcontains {‘‘red ferrari,’’ ‘‘yellow lamborghini’’} all word]
is equivalent to //offer[. ftcontains {‘‘red,’’ ‘‘ferrari,’’ ‘‘yellow,’’ ‘‘lamborghini’’} all]
Two other basic query primitives are the ability to restrict the proximity of the matched tokens. There are two flavors of proximity predicates. The FTWindow primitive specifies that the matched tokens have to be within a window of a specified size. Consider the example //offer[. ftcontains {‘‘red ferrari,’’ ‘‘yellow lamborghini’’} all
window at least 6 words] The above expression will return offer nodes which contain, say, ‘‘red ferrari and brand new yellow lamborghini’’ because both phrases occur within a window of seven words, i.e., the window matches the FTWindow size restriction of at least six words. The expression will not match nodes which contain ‘‘red ferrari and yellow lamborghini’’ because the phrases occur within a window of five words. The other flavor of proximity primitive is FTDistance. It can be used to restrict the number of intervening tokens between the matched tokens or phrases. For example, consider the expression //offer[. ftcontains {‘‘new,’’ ‘‘brand,’’ ‘‘ferrari’’} all
distance at most 1 word] The above expression will return offer nodes which contain the specified query tokens with at most one intervening token between consecutive
X
3668
X
XQuery Full-Text
occurrences of the query tokens. For example, the expression will return nodes which contain ‘‘brand new red ferrari’’ because there are no intervening tokens between ‘‘brand’’ and ‘‘new’’ and one intervening token between ‘‘new’’ and ‘‘red.’’ On the other hand, the expression will not return nodes which contain ‘‘new car of the ferrari brand’’ because there are three intervening tokens between ‘‘new’’ and ‘‘ferrari.’’ XQuery Full-Text allows also the specification of the order of the matched tokens using the FTOrder primitive. //offer[. ftcontains {‘‘new,’’ ‘‘brand,’’ ‘‘ferrari’’} all
distance at most 1 word ordered] The above query expands the previous FTDistance example by specifying that the query tokens can occur in the nodes only in the specified order. More complex query expressions can be built using FTAnd, FTOr, FTUnaryNot, and FTMildNot. They allow for the combination of other FTSelections into conjunctions, disjunctions, and negations. Here is a complex example using FTAnd and some of the FTSelections introduced earlier. //offer[. ftcontains (({‘‘red,’’ ‘‘ferrari’’} all window at most 3 words) ftand ({‘‘yellow,’’ ‘‘lamborghini’’} all window 3 words)) window at most 20 words]
The above expression specifies that the resulting offer nodes must contain ‘‘red’’ and ‘‘ferrari’’ within
a window of 3 words, ‘‘yellow’’ and ‘‘lamborghini’’ within a window of 3 words, and all of them within a window of 20 words. //offer[. ftcontains ‘‘red ferrari’’ ftnot ‘‘yellow lamborghini’’]
This example of FTNot will return offer nodes which contain the phrase ‘‘red ferrari’’ but not the phrase ‘‘yellow lamborghini.’’ Sometimes, this kind of negation can be too strict. For example, consider a query that looks for articles about the country Mexico but not the state New Mexico. //article[. ftcontains ‘‘Mexico’’ ftnot ‘‘New Mexico’’]
The above query will not return articles which talk about both the country Mexico and the state ‘‘New Mexico.’’ The query can be rewritten using FTMildNot. //article[. ftcontains ‘‘Mexico’’ not in ‘‘New Mexico’’]
Intuitively, the above query specifies that the user wants articles where the token ‘‘Mexico’’ has occurrences not part of the phrase New Mexico (although there are still be occurrences of the phrase New Mexico). The power of XQuery Full-Text queries can be further increased with the use of match options that control the way tokens are matched within the document. For example, if the user wants to find offers containing ‘‘FERRARI’’ (all capital letters), she can use the ‘‘uppercase’’ match option: //offer[. ftcontains ‘‘ferrari’’ uppercase]
This query will match only tokens ‘‘FERRARI’’ regardless of how the token is specified in the query. Match options can be used to encompass several FTSelections. For example, consider the query //article[. ftcontains (‘‘car’’ ftand ‘‘aircraft’’) window at most 50 words with thesaurus at ‘‘http://acme.org/ thesauri/Synonyms’’]
The above query searches for articles that contain the tokens ‘‘car’’ and ‘‘aircraft’’ or their synonyms using the thesaurus identified by the URI ‘‘http:// acme.org/thesauri/Synonyms.’’ The matched tokens have to be within a window of at most 50 words. As mentioned in the beginning of this section, there are other FTSelections and match options provided by XQuery Full-Text. The description of all of these features goes beyond the scope of the article. The reader is referred to the complete specification of the language available at [11]. This section will finalize the brief overview of the language primitives in XQuery Full-Text with the description of the feature extension points. Extension points be used to provide additional implementationdefined full-text search functionality. An example of the use of such an extension can be found below. declare namespace acmeimpl = ‘‘http:// acme.org/AcmeXQFTImplementation;’’ //book/author[name ftcontains (# acmeimpl:use-entity-index #) {‘‘IBM’’}]
XQuery Full-Text
The above example shows the use of the extension use-entity-index provided by the ACME imple-
mentation of XQuery Full-Text. It directs the implementation to use an entity index for the token ‘‘IBM’’ so it can also match other references to the same entity, such as ‘‘International Business Machines’’ or even ‘‘Big Blue.’’ If the above query is evaluated using another implementation the use-entity-index extension will be ignored.
XQuery Full-Text Formal Model
Amer-Yahia et al. [3] have shown that the XQuery 1.0 and XPath 2.0 Data Model (XDM) [10] is inadequate to support the composability of the complex full-text query primitives described in the previous section. Intuitively, XDM represents data only about entire XML nodes but for sub-node entities like the positions of the tokens within a node. This precludes, for example, the evaluation of nested proximity FTSelectionsFTWindow and/or FTDistance such as the ones used in the FTAnd example. Therefore, XQuery Full-Text needs a more precise data model that can support the complex full-text search operations and their composability. The language specification [11] introduces the AllMatches model. The result of every FTSelection applied on a node from the search context can be represented as an AllMatches object within this model. The AllMatches object has a hierarchical structure. Each AllMatches object consists of zero, one, or more Match objects. Intuitively, a Match object describes one set of positions of tokens in the node that match the corresponding FTSelection. The AllMatches contains all such possible Match objects. The positions of tokens in a match are described using TokenInfo objects which contain data about the relative position of the token in the node and to which query token it corresponds. This model is sufficient to describe the semantics of all FTSelections. The semantics of every FTSelection can be represented as a function that takes as input one or more AllMatches objects from its nested FTSelections and produces an output AllMatches object. The remainder of this section will illustrate the process. The illustration will use some simplifications not to burden the exposition with excessive details. Nevertheless, the example below will illustrate some of the basic techniques in the workings of the model. Consider the simple query
X
3669
//offer[. ftcontains {‘‘red,’’ ‘‘ferrari’’} all window at most 3 words]
Let’s assume that the token ‘‘red’’ occurs in the current node from the search context in relative positions 5 and 10 and the token ‘‘ferrari’’ occurs in relative positions 7 and 25. The AllMatches objects for each of these tokens contain two Match objects – one for each position of occurrence: ‘‘red’’: AllMatches { Match {TokenInfo{token:‘‘red,’’ pos:5}}, Match{TokenInfo{token:‘‘red,’’ pos:10}}} ‘‘ferrari’’: AllMatches { Match{TokenInfo{token:‘‘ferrari,’’ pos:7}}, Match{TokenInfo{token:‘‘red,’’ pos:25}}}
To obtain the AllMatches for the FTWords {‘‘red,’’ ‘‘ferrari’’} all, each pair of posi-
tions for ‘‘red’’ and ‘‘ferrari’’ has to be combined into a single Match object. (‘‘red,’’ ‘‘ferrari’’) all: AllMatches { Match {TokenInfo{token: ‘‘red,’’ pos:5}}, TokenInfo{token:‘‘ferrari,’’ pos:7}}, Match{TokenInfo{token:‘‘red,’’ pos:10}, TokenInfo{token:‘‘ferrari,’’ pos:7}}, Match{TokenInfo{token:‘‘red,’’ pos:5}}, TokenInfo{token:‘‘red,’’ pos:25}}, Match{TokenInfo{token:‘‘red,’’ pos:10}}, TokenInfo{token:‘‘red,’’ pos:25}}}
Finally, to obtain the AllMatches for the outer most FTWindow, those Matches which violate the window size condition have to be filtered out. Only the first Match satisfies it: the window sizes are 3, 4, 21, and 16 respectively. Therefore, the final AllMatches is: (‘‘red,’’ ‘‘ferrari’’) all window at most 3 words: AllMatches { Match{TokenInfo{token: ‘‘red,’’ pos:5}}, TokenInfo{token:‘‘ferrari,’’ pos:7}}}
The resulting AllMatches object is non-empty and therefore, the current node from the search context
X
3670
X
XQuery Full-Text
satisfies the FTSelection and the FTContainsExpr should return true. The semantic of every FTSelection can be defined in terms of similar operations on AllMatches objects. The description of all such operations goes beyond the scope of this article. The reader is referred to the language specification [11] for further details. Relevance Scoring in XQuery Full-Text
One of the core features of full-text search is the ability to rank the results in the order of their decreasing relevance. Usually, the relevance of a result is represented using a number called a score of the result. Higher scores represent more relevant results. To support scores and relevance ranking, XQuery Full-Text extends the XQuery language with the support for score variables. Score variables give access to the underlying scores of the results of the evaluation of an XQuery Full-Text expression. Score variables can be introduced as part of a for- or let-clause in a FLWOR expression. Here is a simple example for $o score $s in //offer[. ftcontains {‘‘red,’’ ‘‘ferrari’’} all window at most 3 words] order by $s descending return
The above is a fairly typical query iterates over all offers that contain ‘‘red’’ and ‘‘ferrari’’ within a window of three words using the $o variable, binds the score of each offer to the $s variable, and uses those variables to return the id of offers and their scores in order of decreasing relevance. Score variables can also be bound in let clauses. This allows for the use of a scoring condition that is different than the one uses for filtering the objects. Consider the following example. for $res at $p in ( for $o in //offer[. ftcontains {‘‘red,’’ ‘‘ferrari’’}all ] let score $s := $o ftcontains {‘‘excellent,’’ ‘‘condition’’} all window at most 10 words order by $s descending return $o) where $p <= 10 return $res
The above query obtains the top 10 offers for red Ferrari’s whose relevance is estimated using the fulltext search condition ‘‘excellent condition’’ all window at most 10 words. It should be noted that XQuery Full-Text does not specify how scoring should be implemented. Each XQuery Full-Text implementation can use a scoring method of their choice as long the generated scores are in the range [0,1] and higher scores denote greater relevance.
Key Applications Support for full-text search in XML documents.
Cross-references
▶ Full-Text Search ▶ Information Retrieval ▶ XML ▶ XPath/XQuery
Recommended Reading 1. Al-Khalifa S., Yu C., and Jagadish H. Querying structured text in an XML database. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 4–15. 2. Amer-Yahia S., Botev C., Doerre J., and Shanmugasundaram J. XQuery full-text extensions explained. IBM Syst. J., 45 (2):335–351, 2006. 3. Amer-Yahia S., Botev C., and Shanmugasundaram J. TE XQuery: A full-text search extension to XQuery. In Proc. 12th Int. World Wide Web Conference, 2004, pp. 583–594. 4. Carmel D., Maarek Y., Mandelbrod M., Mass Y., and Soffer A. Searching XML documents via XML fragments. In Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 151–158. 5. Chinenyanga T. and Kushmerick N. Expressive and efficient ranked querying of XML data. In Proc. 4th Int. Workshop on the World Wide Web and Databases, 2001, pp. 1–6. 6. Fuhr N. and Grossjohann K. XIRQL: An extension of XQL for information retrieval. In Proc. ACM SIGIR Workshop on XML and Information Retrieval, 2000, pp. 172–180. 7. Theobald A. and Weikum G. The index-based XXL search engine for querying XML data with relevance ranking. In Advances in Database Technology, Proc. 8th Int. Conf. on Extending Database Technology, 2002, pp. 477–495. 8. XML Path Language (XPath) 2.0. W3C Recommendation. Available at: http://www.w3.org/TR/xpath20/ 9. XQuery 1.0: An XML Query Language. W3C Recommendation. Available at: http://www.w3.org/TR/xquery/ 10. XQuery 1.0 and XPath 2.0 Data Model (XDM). W3C Recommendation. Available at: http://www.w3.org/TR/xpathdatamodel/
XQuery Processors 11. XQuery 1.0 and XPath 2.0 Full-Text 1.0. W3C Working Draft. Available at: http://www.w3.org/TR/xpath-full-text-10/ 12. XQuery 1.0 and XPath 2.0 Full-Text 1.0 Requirements. W3C Working Draft. Available at: http://www.w3.org/TR/xpath-fulltext-10-requirements/ 13. XQuery 1.0 and XPath 2.0 Full-Text 1.0 Use Cases. W3C Working Draft. Available at: http://www.w3.org/TR/xpath-full-text10-use-cases/
XQuery Interpreter ▶ XQuery Processors
XQuery Processors ¨ ZCAN 3, T ORSTEN G RUST 1, H. V. J AGADISH 2, FATMA O 4 C ONG Y U 1 University of Tu¨bingen, Tu¨bingen, Germany 2 University of Michigan, Ann Arbor, MI, USA 3 IBM Almaden Research Center, San Jose, CA, USA 4 Yahoo! Research, New York, NY, USA
X
3671
generality and completeness. Its type system is based on XML schema, and it contains XPath language as a subset. Over the years, several software vendors have developed products based on XQuery in varying degrees of conformance. A current list of XQuery implementations is maintained on the W3 XML Query working group’s homepage (http://www.w3.org/XML/XQuery). There are three main approaches of XQuery processors. The first one is to leverage existing relational database systems as much as possible, and evaluate XQuery in a purely relational way by translating queries into SQL queries, evaluating them using SQL database engine, and reformatting the output tuples back into XML results. The second approach is to retain the native structure of the XML data both in the storage component and during the query evaluation process. One native XQuery processor is Michael Kay’s Saxon, which provides one of the most complete and conforming implementations of the language available at the time of writing. Compared with the first approach, this native approach avoids the overhead of translating back and forth between XML and relational structures, but it also faces the significant challenge of designing new indexing and evaluation techniques. Finally, the third approach is a hybrid style that integrates native XML storage and XPath navigation techniques with existing relational techniques for query processing.
Synonyms XML database system; XQuery compiler; XQuery interpreter
Foundations Pathfinder: Purely Relational XQuery
Definition XQuery processors are systems for efficient storage and retrieval of XML data using XML queries written in the XQuery language. A typical XQuery processor includes the data model, which dictates the storage component; the query model, which defines how queries are processed; and the optimization modules, which leverage various algorithmic and indexing techniques to improve the performance of query processing.
Historical Background The first W3C working draft of XQuery was published in early 2001 by a group of industrial experts. It is heavily influenced by several earlier XML query languages including Lorel, Quilt, XML-QL, and XQL. XQuery is a strongly-typed functional language, whose basic principals include simplicity, compositionality, closure, schema conformance, XPath compatibility,
The Pathfinder XQuery compiler has been developed under the main hypothesis that the well-understood relational database kernels also make for efficient XQuery processors. Such a relational account of XQuery processing can indeed yield scalable XQuery implementations – provided that the system exploits suitable relational encodings of both, (i) the XQuery Data Model (XDM), i.e., tree fragments as well as ordered item sequences, and (ii) the dynamic semantics of XQuery that allow the database back-end to play its trump: set-oriented evaluation. Pathfinder determinedly implements this approach, effectively realizing the dashed path in Fig. 1b. Any relational database system may assume the role of Pathfinder’s back-end database; the compiler does not rely on XQueryspecific builtin functionality and requires no changes to the underlying database kernel.
X
3672
X
XQuery Processors
Internal XQuery and Data Model Representation. To represent XML fragments, i.e., ordered unranked trees of XML nodes, Pathfinder can operate with any nodelevel tabular encoding of trees (node ¼ ^ row) that – along other XDM specifics like tag name, node kind, etc. – preserves node identity and document order. A variety of such encodings are available, among these are variations of pre/post region encodings [8] or ORDPATH identifiers [15]. Ordered sequences of items are mapped into tables in which a dedicated column preserves sequence order. Pathfinder compiles incoming XQuery expression into plans of relational algebra operators. To actually operate the database back-end in a set-oriented fashion, Pathfinder draws the necessary amount of independent work from XQuery’s for-loops. The evaluation of a subexpression e in the scope of a for-loop yields an ordered sequence of zero or more items in each loop iteration. In Pathfinders relational encoding, these items are laid out in a single table for all loop iterations, one item per row. A plan consuming this loop-lifted or ‘‘unrolled’’ representation of e may, effectively, process the results of the individual iterated evaluations
of e in any order it sees fit – or in parallel [10]. In some sense, such a plan is the embodiment of the independence of the individual evaluations of an XQuery forloop body. Exploitation of Type and Schema Information. Pathfinder’s node-level tree encoding is schema-oblivious and does not depend on XML Schema information to represent XML documents or fragments. If the DTD of the XML documents consumed by an expression is available, the compiler annotates its plans with node location and fan-out information that assists XPath location path evaluation but is also used to reduce a query’s runtime effort invested in node construction and atomization. Pathfinder additionally infers static type information for a query to further prune plans, e.g., to control the impact of polymorphic item sequences which present a challenge for the strictly typed table column content in standard relational database systems. Query Runtime. Internally, Pathfinder analyzes the data flow between the algebraic operators of the generated plan to derive a series of operator annotations (keys, multi-valued dependencies, etc.) that drive plan simplification and reshaping. A number of
XQuery Processors. Figure 1. (a) DB2 XML system architecture. (b) Pathfinder: purely relational XQuery on top of vanilla database back-ends. (c) Timber architecture overview [11].
XQuery Processors
XQuery-specific optimization problems, e.g., the stable detection of value-based joins or XPath twigs and the exploitation of local order indifference, may be approached with simple or well-known relational query processing techniques. The Pathfinder XQuery compiler is retargetable: its internal table algebra has been designed to match the processing capabilities of modern SQL database systems. Standard B-trees provide excellent index support. Code generators exist that emit sequences of SQL:1999 statements [9] (no SQL/XML functionality is used but the code benefits if OLAP primitives like DENSE_RANK are available). Bundled with its code generator targeting the extensible column store kernel MonetDB, Pathfinder constitutes XQuery technology that processes XML documents in the Gigabyte-range in interactive time [7].
Timber: A Native XML Database System
Timber [11] is an XML database system which manages XML data natively: i.e., the XML data instances are stored in their actual format and the XQueries are processed directly. Because of this native representation, there is no overhead for converting the data between the XML format and the relational representation or for translating queries from the XQuery format into the SQL format. While many components of the traditional database can be reused (for example, the transaction management facilities), other components need to be modified to accommodate the new data model and query language. The key contribution of the Timber system is a comprehensive set-at-a-time query evaluation engine based on an XML manipulation algebra, which incorporates novel access methods and algebraic rewriting and cost based optimizations. An architecture overview of Timber is shown in Fig. 1c, as presented in [11]. XML Data Model Representation. When XML documents are loaded into the system, Timber automatically assigns each XML node with four labels hD, S, E, Li, where D indicates which document the node belongs to, and S, E, L represents the start key, end key, and level of the node, respectively. These labels allow quick detection of relationships between nodes. For example, a node hd1,s1,e1,l1i is an ancestor of another node hd1,s2,e2,l2i iff s1 <s2 ∧ e1> e2. Each node, along with the labels, are stored natively, and in the
X
order of their start keys (which correspond to their document order), into the Timber storage backend. XQuery Representation. A central concept in the Timber system is the TLC (Tree Logical Class) algebra [12,18,17], which manipulates sets (or sequences) of heterogeneous, ordered, labeled trees. Each operator in the TLC algebra takes as input one or more sets (or sequences) of trees and produces as output a set (sequence) of trees. The main operators in TLC include filter, select, project, join, reordering, duplicateelimination, grouping, construct, flatten, shadow/illuminate. There are several important features of the TLC algebra. First, like the relational algebra, TLC is a ‘‘proper’’ algebra with compositionality and closure. Second, TLC efficiently manages the heterogeneity arising in XML query processing. In particular, heterogenous input trees are reduced to homogenous sets for bulk manipulation through tree pattern matching. This tree pattern matching mechanism is also useful in selecting portions of interest in a large XML tree. Third, TLC can efficiently handle both set and sequence semantics, as well as a hybrid semantics, where part of the input collection of trees is ordered [17]. Finally, the TLC algebra covers a large fragment of XQuery, including nested FLWOR expressions. Each incoming XQuery is parsed and compiled into the TLC algebra representation (i.e., logical query plans) before being evaluated against the data. The Timber system, and its underlying algebra, has been extended to deal with text manipulation [2] and probabilistic data [14]. A central challenge with querying text is that exact match retrieval is too crude to be satisfactory. In the field of information retrieval, it is standard practice to use scoring functions and provide ranked retrieval. The TIX algebra [2] shows how to compute and propagate scores during XML query evaluation in Timber. Traditionally, databases have only stored facts, which by definition are certain. Recently, there has been considerable interest in the management of uncertain information in a database. The bulk of this work has been in the relational context, where it is easy to speak of the probability of a tuple being in a relation. ProTDB develops a model for storing and manipulating probabilistic data efficiently in XML [14]. The probability of occurrence of an element in a tree (at that position) is recorded conditioned on the probability of its parent’s occurrence.
3673
X
3674
X
XQuery Processors
Query Runtime: Evaluation and Optimization. The heart of the Timber system is the query evaluation engine, which compiles logical query plans into the physical algebra representation (i.e., physical query plans) and evaluates those query plans against the stored XML data to produce XML results. It includes two main subcomponents: the query optimizer and the query evaluator. The query evaluator executes the physical query plan. The separation between the logical algebra and the physical algebra is greater in XML databases than in relational databases, because the logical algebra here manipulates trees while the physical algebra manipulates ‘‘nodes’’ – data are accessed and indexed at the granularity of nodes. This requires the design of several new physical operators. For example, for each XML element, the query may need to access the element node itself, its child sub-elements, or even its entire descendant subtree. This requires the node materialization physical operator, which takes a (set of) node identifier(s) and returns a (set of) XML tree(s) that correspond to the identifier(s). When and how to materialize the nodes affects the overall efficiency of the physical query plan and it is the job of the optimizer to make the right decision. Structural join is another physical operator that is essential for the efficient retrieval of data nodes that satisfy certain structural constraints. Given a parent-child or ancestor-descendant relationship condition, the structural join operator retrieve all pairs of nodes that satisfy the condition. Multiple structural join evaluations are typically required to process a single tree-pattern match. In Timber, a whole stack-based family of algorithms has been developed to efficiently process structural joins, and they are at the core of query evaluation in Timber [1]. The query optimizer attempts to find the most efficient physical query plan that corresponds to the logical query plan. Every pattern match in Timber is computed as a sequence of structural joins and the order in which these are computed makes a substantial difference to the evaluation cost. As a result, join order selection is the predominant task of the optimizer. Heuristics developed for relational systems often do not work well for XML query optimization [20]. Timber employs a dynamic programming algorithm to enumerate a subset of all the possible join plans and picks the plan with the lowest cost. The cost is calculated by the result size estimator, which relies on the position histogram [19] to estimate the lower and upper bounds of each structural join.
DB2 XML: A Hybrid Relational and XML DBMS
DB2 XML (DB2 is a trademark of IBM Corporation.) is a hybrid relational and XML database management system, which unifies new native XML storage, indexing and query processing technologies with existing relational storage, indexing and query processing. A DB2 XML application can access XML data using either SQL/XML or XQuery [6,16]. The general system architecture is shown in Fig. 1a. It builds on the premises that (i) relational and XML data will co-exist and complement each other in enterprise information management solutions, and (ii) XML data are different enough that it requires its own storage and processor. Internal XQuery and Data Model Representation. At the heart of DB2 XML’s native XML support is the XML data type, introduced by SQL/XML. DB2 XML introduces a new native XML storage format to store XML data as instances of the XQuery Data Model in a structured, type-annotated tree. By storing the binary representation of type-annotated XML trees, DB2 XML avoids repeated parsing and validation of documents. In DB2 XML, XQuery is not translated into SQL, but rather mapped directly onto an internal query graph model (QGM) [6], which is a semantic network used to represent the data flow in a query. Several QGM entities are re-used to represent various set operations, such as iteration, join and sorting, while new entities are introduced to represent path expressions and to deal with sequences. The most important new operator is the one that captures XPath expressions. DB2 XML does not normalize XPath expressions into FLWOR blocks, where iteration between steps and within predicates is expressed explicitly. Instead, XPath expressions that consist of solely navigational steps are expressed as a single operator. This allows DB2 XML to apply rewrite and cost-based optimization [4] to complex XQueries, as the focus is not on ordering steps of an XPath expression. Exploitation of Type and Schema Information. DB2 XML provides an XML Schema repository (XSR) to register and maintain XML schemas and uses those schemas to validate XML documents. An important feature of DB2 XML is that it does not require an XML schema to be associated with an XML column. An XML column can store documents validated according to many different and evolving schemas, as well as schema-less documents. Hence, the association between schemas and XML documents is on per document basis, providing maximum flexibility.
XQuery Processors
As DB2 XML has been targeted to address schema evolution [5], it does not support schema import or static typing features of XQuery. These two features are too restrictive because they do not allow conflicting schemas and each document insertion or schema update may result in recompilation of applications. Hence, DB2 XML does not exploit XML schema information in query compilation. However, it uses simple data type information for optimization, such as selection of indexes. Query Runtime. DB2 XML query evaluation runtime contains three major components for XML query processing: 1. XQuery Function Library: DB2 XML supports several XQuery functions and operators on XML schema data types using native implementations. 2. XML Index Runtime: DB2 XML supports indexes defined by particular XML path expressions, which can contain wildcards, and descendant axis navigation, as well as kind tests. Under the covers, an XML index is implemented with two B+Trees: a path index, which maps distinct reverse paths to generated path identifiers, and a value index that contains path identifiers, values, and node identifiers for each node that satisfy the defining XPath expression. As indexes are defined via complex XPath expressions, DB2 XML employs the XPath containment algorithm of [3] to identify the indexes that are applicable to a query. 3. XML Navigation: XNAV operator evaluates multiple XPath expressions and predicate constraints over the native XML store by traversing parentchild relationship between the nodes [13]. It returns node references (logical node identifiers) and atomic values to be further manipulated by other runtime operators.
Key Applications Scalable systems for XML data storage and XML query processing are essential to effectively manage increasing amount of XML data on the web.
URL to Code Pathfinder
The open-source retargetable Relational XQuery compiler Pathfinder is available and documented at www. pathfinder-xquery.org. MonetDB/XQuery – Pathfinder
X
3675
bundled with the relational database back-end MonetDB – is available at www.monetdb-xquery.org. Timber
Timber is available and documented at www.eecs. umich.edu/db/timber.
Cross-references
▶ Top-k XML Query Processing ▶ XML Benchmarks ▶ XML Indexing ▶ XML Storage ▶ XPath/XQuery
Recommended Reading 1. Al-Khalifa S., Jagadish H.V., Patel J.M., Wu Y., Koudas N., and Srivastava D. Structural joins: a primitive for efficient XML query pattern matching. In Proc. 18th Int. Conf. on Data Engineering, 2002, pp. 141–152 2. Al-Khalifa S., Yu C., and Jagadish H.V. Querying structured text in an XML database. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 4–15 ¨ zcan F., Beyer K.S., Cochrane R.J., and Pirahesh H. 3. Balmin A., O A Framework for using materialized XPath views in XML query processing. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, p. 6071. 4. Balmin A. et al. Integration cost-based optimization in DB2 XML. IBM Syst. J., 45(2):299–230, 2006. ¨ zcan F. et al. System RX: one part relational, 5. Beyer K.S. and O one part XML. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 347–358. 6. Beyer K.S., Siaprasad S., and van der Linden B. DB2/XML: designing for evolution. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 948–952. 7. Boncz P.A., Grust T., van Keulen M., Manegold S., Rittinger J., and Teubner J. MonetDB/XQuery: a fast XQuery processor powered by a relational engine. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006, pp. 479–490. 8. Grust T. Accelerating XPath location steps. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 109–220. 9. Grust T., Mayr M., Rittinger J., Sakr S., and Teubner J. A SQL: 1999 code generator for the pathfinder XQuery compiler. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2007, pp. 1162–1164. 10. Grust T., Sakr S., and Teubner J. XQuery on SQL hosts. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 252–263. 11. Jagadish H.V., Al-Khalifa S., Chapman A., Lakshmanan L.V.S., Nierman A., Paparizos S., Patel J., Srivastava D., Wiwatwattana N., Wu Y., and Yu C. TIMBER: a native XML database. VLDB J., 11:274–291, 2002. 12. Jagadish H.V., Lakshmanan L.V.S., Srivastava D., and Thompson K. TAX: a tree algebra for XML. In Proc. 8th Int. Workshop on Database Programming Languages, 2001, pp. 149–164. 13. Josifovski V., Fontoura M., and Barta A. Querying XML streams. VLDB J., 14(2):197–210, 2005.
X
3676
X
XSL Formatting Objects
14. Nierman A. and Jagadish H.V. ProTDB: probabilistic data in XML. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 646–657. 15. O’Neil P., O’Neil E., Pal S., Cseri I., Schaller G., and Westburg N. ORDPATHs: insert-friendly XML node labels. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 903–908. ¨ zcan F., Chamberlin D., Kulkarni K.G., and Michels J.-E. 16. O Integration of SQL and XQuery in IBM DB2. IBM Syst. J., 45 (2):245–270, 2006. 17. Paparizos S. and Jagadish H.V. Pattern tree algebras: sets or sequences? In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 349–360. 18. Paparizos S., Wu Y., Lakshmanan L.V.S., and Jagadish H.V. Tree logical classes for efficient evaluation of XQuery. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 71–82. 19. Wu Y., Patel J.M., and Jagadish H.V. Estimating answer sizes for XML queries. In Advances in Database Technology, Proc. 8th Int. Conf. on Extending Database Technology, 2002, pp. 590–608. 20. Wu Y., Patel J.M., and Jagadish H.V. Structural join order selection for XML query optimization. In Proc. 19th Int. Conf. on Data Engineering, 2003, pp. 443–454.
XSL Formatting Objects ▶ XSL/XSLT
XSL/XSLT B ERND A MANN Pierre & Marie Curie University (UPMC), Paris, France
Synonyms eXtensible Stylesheet Language; eXtensible Stylesheet Language transformations; XSL-FO; XSL formatting objects
Definition XSL (eXtensible Stylesheet Language) is a family of W3C recommendations for specifying XML document transformations and typesettings. XSL is composed of three separate parts: XSLT (eXtensible Stylesheet Language Transformations): a template-rule based language for the structural transformation of XML documents.
XPath (XML Path Language): a structured query language for the pattern, type and value-based selection of XML document nodes. XSL-FO (XML Formatting Objects): an XML vocabulary for the paper document oriented typesetting of XML documents.
Historical Background The development of XSL was mainly motivated by the need for an open typesetting standard for displaying and printing XML documents. Its conception was strongly influenced by the DSSSL (Document Style Semantics and Specification Language) ISO standard (ISO/IEC 10179:1996) for SGML documents. Like DSSL, XSL separates the document typesetting task into a transformation task and a formatting task. Both languages are also based on structural recursion for defining transformation rules, but whereas DSSL applies a functional programming paradigm, XSL uses XML-template rules and XPath pattern matching for defining document transformations. The W3C working group on XSL was created in December 1997 and a first working draft was released in August 1998. XSLT 1.0 and XPath 1.0 became W3C recommendations in November 1999, and XSL-FO reached recommendation status in October 2001. During the succeeding development of XQuery, both the XQuery and XSLT Working Groups shared responsibility for the revision of XPath, which became the core language of XQuery. XSLT 2.0, XPath 2.0 and XQuery 1.0 achieved W3C recommendation status in January 2007.
Foundations XSLT Programming
XSLT programming consists in defining collections of transformation rules that can be applied to different classes of document nodes. Each rule is composed of a matching pattern and a possibly empty XML template. The matching pattern is used for dynamically binding rules to nodes according to their local (name, attributes, attribute values) and structural (document position) properties. Rule templates are XML expressions composed of static XML output fragments and dynamic XSLT instructions generating new XML fragments from the input data. The following example illustrates the usage of XSLT template rules for implementing some simple
XSL/XSLT
relational queries on the XML representation of a relational database db. The database contains two relations R(a: int,b: int) and S(b: int, c: int). Each relation is represented by a single element containing a sub-element of element type t for each tuple: <S> The first transformation is defined by the following two rules: Rule 1: <xsl:template match="/"> <xsl:apply-templates select="db/*/t" /> Rule 2: <xsl:template match="t"> <xsl:copy-of select="." /> Both rules specify the class of nodes to which they can be applied by an absolute (starting from the document root ´/´) or a relative XPath expression. The first rule applies to the document root ´/´, whereas the second rule matches all document elements of type t (independently of their absolute position in the document). The XSLT processing model is based on the recursive application of template rules to a current node list, which is initialized by the list containing only the document root. The transformation consists of binding each node to exactly one transformation rule (The existence of at least one matching rule is guaranteed by a default rule for each kind of document node.) and replacing it by the corresponding rule template. Templates contain static XML data and dynamic XSLT
X
instructions which recursively generate new transformation requests (XSLT instructions, are distinguished from static fragments by using the XSLT namespace http://www.w3.org/1999/XSL/Transform which is generally (but not necessarily) bound to the prefix <<xsl:>>). The whole transformation process stops when all transformation requests have been executed. The rule template of rule 1 generates an element of type T, which contains a new transformation request type xsl:apply-template for all tuples in the database returned by the XPath expression db/*/t. This expression is evaluated against the context node of the rule (the document root) and returns a new current node list containing all elements of type t in R and S (wildcard * matches all element names). Rule 2 matches all elements (tuples) of type t and creates a copy by instruction xsl:copy-of select=".". The final result is a relation T containing the union of R and S. The previous example illustrates the usage of XPath for the dynamic selection of template rules and for recursively generating new current node lists for transformation. Rule patterns can simply match nodes by their element types, but they can also use more complex XPath patterns defining value-based and structural matching constraints. For example, in order to copy only tuples of R, one might replace rule 2 by the following two rules: Rule 3: <xsl:template match="R/t"> <xsl:copy-of select="." /> Rule 4: <xsl:template match="S/t" /> Rule 3 matches and copies sub-elements t of some element R, whereas rule 4 matches tuples of S without creating any node in the output tree. Observe that the same result can be achieved by keeping rule 2 and replacing rule 1 by the following rule: Rule 5: <xsl:template match="/"> <xsl:apply-templates select="db/R/t" /> The following three rules simulate relational selection sb=5(R) by copying only of tuples in R with attribute value b=5: Rule 6: <xsl:template match="R/t[@b=5]">
3677
X
3678
X
XSL/XSLT
<xsl:copy-of select="." /> Rule 7: <xsl:template match="R/t" /> Rule 8: <xsl:template match="S/t" /> Rule 6 matches only tuples of R with attribute value b equal to 5, and creates a copy of this subset in the output tree. Rule 7 applies to all tuples of R without any other restriction. The resulting binding conflict is solved by a set of laws based on import precedence, priority attribute values and certain syntactical criteria, which generally choose the rule with the most specific matching criteria. For example, rule 7 applies to all nodes that can be transformed by rule 6, whereas the opposite is not true. The inclusion problem for XPath patterns has been formally shown to be PSPACE-complete [9], and the recommendation document defines an incomplete set of syntactical criteria that can easily be implemented and allowed to solve a large set of conflicts in practice. If the conflict resolution algorithm for template rules leaves more than one matching rule, the XSLT processor must generate a recoverable dynamic error where the optional recovery action is to select, from the matching template rules that are left, the one that occurs last in declaration order. XSLT rules can be assigned to different computation modes by using an optional mode attribute. For example, the following three rules compute the natural join of R and S on attribute b by a simple nested loop. Rule 9 applies to all tuples of R in default mode and represents the outer loop of the join: XPath expression /db/S/t[@b=current()/@b] selects for each current tuple in R, all tuples in S with the same b attribute value. The inner loop is represented by rule 10 which is triggered by rule 9 and joins parameter $tuple with some tuple in S. The template of this rule creates a new tuple by copying all attributes element $tuple and all attributes of the current node (tuple). Attribute mode is used for distinguishing between the outer loop (mode default) and the inner loop (mode join). In order to ‘‘neutralize’’ the default transformation of S, rule 11 applies to S in default mode and generates nothing. Rule 9: <xsl:template match="R/t"> <xsl:apply-templates select="/db/S/t[@b=current()/@b]" mode="join"> <xsl:with-param name="tuple" select="." />
Rule 10: <xsl:template match="S/t" mode="join"> <xsl:param name="tuple"> <xsl:copy-of select="$tuple/@*" /> <xsl:copy-of select="@*" /> Rule 11: <xsl:template match="S/t" /> XSL-FO document typesetting
XSL-FO is a vocabulary of XML element types for defining the main typographic objects (chapters, pages, paragraphs, figures, etc.) and their properties (character set, indentation, justification, etc.) used for the paper-oriented typesetting of documents. The main structure of an XSL-FO document is defined by a document model (fo:layout-master), one or several page sequences (fo:page-sequence) composed of one or several flows (fo-flow) of blocks (fo-block). Each of these elements can be parametrized by a set of attribute values for configuring the layout of pages (page-height, margins), tables (column width and height, padding, etc.), lists (item label and distance) and blocks (font-family and size, text alignment etc.). The richness and complexity of the XSL-FO typesetting model is illustrated by the size of the recommendation document (400 pages), and the reader is invited to consult the W3C’s XSL-FO tutorial for a detailed introduction. Theoretical Foundations of XSLT
The most important theoretical aspects of XSL concern XSLT and its interaction with XPath. XSLT and XPath have been theoretically evaluated and compared to other XML query languages by using different theoretical frameworks. For example, the formal semantics of XSLT has been defined by a rewriting process starting from an empty ordered output document, a set of rewriting rules (the XSLT program) and an ordered labeled input tree with ‘‘pebbles’’ (the current node list) [6]. Based on this representation it was possible to show that, under certain constraints (no join on data values), given two XML schemas (regular tree languages) t1 and t2 and an XSLT stylesheet (k-pebble transducer) T, it is possible to type-check T with respect to its input
XSL/XSLT
type t2: 8 t 2 t1:T(t) t2 is decidable. Similarly, [2] proposed a formal model for a subset of XSLT using tree-walking tree-transducers with registers and look-ahead, which can compute all unary monadic second-order (MSO) structural patterns. This expressiveness result provides an important theoretical foundation for XSLT, since MSO captures different robust formalisms like first order logics with set quantification, regular tree languages, query automata and finitevalued attribute grammars. The computation model of XSLT can also be compared with that of structural recursion for semi-structured data [1]. Structural recursion is proposed by functional programming languages like CAML for the dynamic selection of function definitions by matching function parameter values with pattern signatures. For example, the following transformation function f removes all elements of type E and renames all elements of type A to B: f(v)=v f({})={} f({0 A0 : t})={ 0 B0 : f(t)} f({0 E0 : t})=f(t) This function can be defined in XSLT as follows: <xsl:template match=0 *0 > <xsl:copy><xsl:apply-templates select=0 *0 > <xsl:template match=0 E0 /> <xsl:template match=0 A0 > <xsl:apply-templates select=0 .0 > There are two main differences between XSLT and structural recursion. First, XSLT is defined on trees, whereas structural recursion can be applied on arbitrary graphs. Second, all structural recursion programs are guaranteed to terminate, whereas it is easily possible to write infinite XSL transformations as it is shown in the following rule, which recursively generates for any element a new element containing the result of its own transformation: <xsl:template match=0 *0 > <xsl:apply-templates select=0 .0 > The result of this transformation rule is an infinite tree . . .