Societal Challenges and Geoinformatics
edited by A. Krishna Sinha Department of Geological Sciences Virginia Tech Blacksburg, Virginia 24061, USA and Adjunct Professor of Geology and Geoinformatics Department of Geological Sciences San Diego State University San Diego, California 92182 USA David Arctur Open Geospatial Consortium 35 Main Street, Suite 5 Wayland, Massachusetts 01778 USA Ian Jackson British Geological Survey Keyworth, Nottinghamshire NG12 5GG UK Linda C. Gundersen U.S. Geological Survey MS 911 National Center Reston, Virginia 20192 USA
Special Paper 482 3300 Penrose Place, P.O. Box 9140
Boulder, Colorado 80301-9140, USA
2011
Copyright © 2011, The Geological Society of America (GSA), Inc. All rights reserved. GSA grants permission to individual scientists to make unlimited photocopies of one or more items from this volume for noncommercial purposes advancing science or education, including classroom use. In addition, an author has the right to use his or her article or a portion of the article in a thesis or dissertation without requesting permission from GSA, provided the bibliographic citation and the GSA copyright credit line are given on the appropriate pages. For permission to make photocopies of any item in this volume for other noncommercial, nonprofit purposes, contact The Geological Society of America. Written permission is required from GSA for all other forms of capture or reproduction of any item in the volume including, but not limited to, all types of electronic or digital scanning or other digital or manual transformation of articles or any portion thereof, such as abstracts, into computer-readable and/or transmittable form for personal or corporate use, either noncommercial or commercial, for-profit or otherwise. Send permission requests to GSA Copyright Permissions, 3300 Penrose Place, P.O. Box 9140, Boulder, Colorado 80301-9140, USA. GSA provides this and other forums for the presentation of diverse opinions and positions by scientists worldwide, regardless of their race, citizenship, gender, religion, sexual orientation, or political viewpoint. Opinions presented in this publication do not reflect official positions of the Society. Copyright is not claimed on any material prepared wholly by government employees within the scope of their employment. Published by The Geological Society of America, Inc. 3300 Penrose Place, P.O. Box 9140, Boulder, Colorado 80301-9140, USA www.geosociety.org Printed in U.S.A. GSA Books Science Editors: Marion E. Bickford and Donald I. Siegel Library of Congress Cataloging-in-Publication Data Societal challenges and geoinformatics / edited by A. Krishna Sinha... [et. al.]. p. cm. — (Special Paper ; 482) Includes bibliographical references. ISBN 978-0-8137-2482-9 (pbk.) 1. Geographic information systems. 2. Geodatabases. I. Sinha, A. Krishna, 1941–. G70.212.S6295 2011 910.285—dc23 2011028777 Cover caption: The relevance of geoscience. Geological block diagram courtesy of Christopher Wardle, British Geological Survey.
10 9 8 7 6 5 4 3 2 1
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 1. Integrating sensor data and geospatial tools to enhance real-time disaster management capabilities: Wildfire observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Vincent G. Ambrosia, Donald V. Sullivan, and Sally W. Buechel 2. Ontological relations and spatial reasoning in earth science ontologies . . . . . . . . . . . . . . . . . . . . 13 Hassan A. Babaie 3. Geoscience metadata—No pain, no gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Jeremy R.A. Giles 4. Geoscience data and derived spatial information: Societal impacts and benefits, and relevance to geological surveys and agencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 R.A. Hughes 5. Strategic Sustainability Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 B. Deal, E. Jenicek, W. Goran, N. Myers, and J. Fittipaldi 6.. Grid query optimization in the analysis of cone penetration testing data . . . . . . . . . . . . . . . . . . . 59 Patrick M. Dudas, Hassan A. Karimi, and Abdelmounaam Rezgui 7. The role and development of a persistent interoperability test bed for geosciences research . . . . 69 M.J. Jackson, G. Hobona, L. Bernard, J. Brauner, and C. Higgins 8. GEONETCast: Global satellite data dissemination and the technical and social challenges . . . . 77 George Jungbluth, Richard Fulton, Linda Moodie, Paul Seymour, Mike Williams, Lothar Wolf, and Jiashen Zhang 9. Developing and implementing international geoscience standards—A domestic perspective . . . 87 J.L. Laxton and T.R. Duffy 10. The need for ontologies: Bridging the barriers of terminology and data structure . . . . . . . . . . . . 99 Leo Obrst and Patrick Cassidy 11. Data provenance for preservation of digital geoscience data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Beth Plale, Bin Cao, Chathura Herath, and Yiming Sun 12. Theoretical foundations of the event bush method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Cyril A. Pshenichny and Oksana M. Kanzheleva 13. Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 A. Krishna Sinha 14. Global Map: International cooperation in the mapping sciences . . . . . . . . . . . . . . . . . . . . . . . . 183 D.R. Fraser Taylor iii
Preface
The fusion of informatics technologies with geoscience-based data and tools signals a necessary change in the way we manage the future of our science. It has become abundantly clear that for data to be useful, it must exist without borders and allow scientists, educators, and decision makers to use it freely and easily. Although the goal appears to be simple, it is very complex in detail, and this volume is dedicated to the broader community who wish to participate in translating data into knowledge. This transformation will enable all of us who are interested in geoscience-based solutions to address significant challenges facing society, such as sustainability of resources, urbanization, and climate change. In a recent report by the The Wall Street Journal (January 8–9, 2011), the CEO of General Motors, Dan Akerson, was quoted as saying, “GM has to start acting like a consumer-driven, not engineering-driven company.” Geoscience is no different; we have to make our science societally relevant and user friendly, and not be driven solely by technology. Therefore, geoinformatics can be considered as an agent for making our data and products useful to the public at large. Contributors to this volume are recognized authorities in facilitating informatics-based solutions to global challenges, and are committed to expanding the role of geosciences by translating data into knowledge. The chapters in this volume cover a broad spectrum of research themes (presented alphabetically by primary author’s last name), and provide the latest thinking that will influence ongoing and future research in the emerging science of geoinformatics. Fourteen research papers, co-authored by thirty-eight researchers from both geoscience and computer sciences, cover a spectrum of topics, which range from integrating sensor and satellite data to the need for interoperability through test beds and semantics. Other research topics addressed include strategic sustainability, international standards and collaborations, metadata, provenance, query optimization, and a discussion of the event bush method. This vast array of topics has one common theme—facilitating the use of data and tools for the geoinformatics community to act as first responders to societal challenges. This book follows an earlier publication Geoinformatics: Data to Knowledge (Special Paper 397) by the Geological Society of America in 2006, and is a testimony to GSA for its leadership role in supporting geoinformatics. A. Krishna Sinha, Senior Volume Editor
v
The Geological Society of America Special Paper 482 2011
Integrating sensor data and geospatial tools to enhance real-time disaster management capabilities: Wildfire observations Vincent G. Ambrosia California State University–Monterey Bay, Seaside, California 93955, and National Aeronautics and Space Administration (NASA) Ames Research Center, Moffett Field, California 94035, USA Donald V. Sullivan Sally W. Buechel NASA Ames Research Center, Moffett Field, California 94035, USA
ABSTRACT The primary factors needed to manage disaster events are time-critical geospatial information on the event occurrence and presentation of that information in an easily manageable, collaborative/interactive geospatial decision-support and visualization environment. In this chapter, we describe the development, integration, and use of an unmanned airborne system (UAS), a multispectral sensor with autonomous onboard processing capabilities, a data distribution system, and geospatial processes to deliver real-time information to emergency incident management teams facing wildland fires. The unique integration of the described tools has contributed to an order of magnitude decrease in the delivery time of critical geospatial information to disaster managers. The UAS wildfire imaging campaigns in the western United States in 2007 and 2008 are briefly described in the context of real-world adaptation and utility of the resultant information improvements. These capabilities have farreaching applications to other time-critical, disaster event management scenarios, and they are being expanded to further utilize various UAS platforms and other airborne sensor system data. This chapter will also describe the resultant integration issues faced and the solutions for ubiquitous adaptation of many of these processes in future UAS missions.
INTRODUCTION
The severe fire season transitions into the Southwest, north through the Rockies, west to the Pacific Northwest, and then south through California, culminating in late fall–early winter fire events in Southern California. This extended fire season taxes the resources of local, state, and federal agencies mandated with monitoring and mitigating these destructive events. During the
Large-scale wildfires occur frequently throughout the United States every year. The fire season begins in late winter in the Southeast United States (initially in Florida) and transitions both northward into the Appalachian states and westward.
Ambrosia, V.G., Sullivan, D.V., and Buechel, S.W., 2011, Integrating sensor data and geospatial tools to enhance real-time disaster management capabilities: Wildfire observations, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 1–12, doi:10.1130/2011.2482(01). For permission to copy, contact
[email protected]. © 2011 The Geological Society of America. All rights reserved.
1
2
Ambrosia et al.
extensive burning seasons throughout the United States, piloted aircraft employing thermal imaging systems are deployed and/or contracted by some state agencies and by the National Interagency Fire Center (NIFC; Boise, Idaho) to collect large volumes of data over fires spread from Canada to Mexico, and from the Rocky Mountain Front to the Pacific Ocean. Large fires such as the Yellowstone conflagration in 1988; the Cerro Grande, New Mexico, fire in 2000; the western Montana fires of 2000; the Colorado fires of 2002; the San Diego, California, fires of 2003; the Southern California fires of 2007; and the Northern California fires of 2008 burdened these remote-sensing data gathering crews and taxed both the resources and the stamina of these personnel. Goals Currently, geospatial information derived from airborne systems can require a few hours for processing, georectification, use, and integration by field personnel. Those processes can be streamlined and automated to provide near-real-time contextual geospatial information to mitigation managers on disaster events. The goals for integrating the unmanned airborne system (UAS) platform, payload, data telemetry, and georectification capabilities were to significantly improve the timeliness of the data stream for utility in fire mapping and disaster mitigation, and to provide improved and more accurate information on fire conditions than is currently realized. This chapter describes the processes that allow geospatial data development and use within 15 min of acquisition by the Autonomous Modular Scanner (AMS)– Wildfire sensor on the National Aeronautics and Space Administration (NASA) UAS platform. SYSTEMS AND TECHNOLOGY INTEGRATION Between 2006 and 2008, NASA and the U.S. Forest Service (USFS) managed a series of unmanned airborne sensor missions that showcased five major integrated technologies for improving the timeliness and utility of wildfire geospatial information. Those five technologies were: (1) use of a long-duration UAS as a sensor platform; (2) development of improved wildfire sensor systems for UAS platform operations; (3) autonomous onboard data processing capabilities; (4) enhanced geospatial tools; and (5) real-time data distribution capabilities. These five technologies, assets, and tools are described in the following sections. A schematic showing the data collection, information processing, and data distribution process is shown in Figure 1. Each of the elements shown in that schematic will be further detailed in this chapter. Unmanned Airborne System—NASA Ikhana The NASA Ikhana UAS is a modified General Atomics– ASI, Inc., Predator-B (MQ-9) unmanned aerial vehicle (UAV, and
it entered NASA service in January 2007 to support earth science and aeronautics research (Fig. 2). “Ikhana” is a Native American Choctaw word meaning intelligence, conscious, or aware. The name is descriptive of the research goals NASA has established for the aircraft and its related systems. The Ikhana UAS consists of the Ikhana aircraft, a ground control station, ground support equipment, and ground communications systems. The Ikhana is remotely controlled by a pilot on the ground seated at a console located in the ground control station. The sensor system operator, seated at a console located in the ground control station, can remotely control the AMS-Wildfire sensor payload carried aloft by the Ikhana. The Ikhana home base is the NASA Dryden Flight Research Center (DFRC) at Edwards Air Force Base (EAFB), California. The Ikhana is capable of ~24 h duration, ~13,720 m (45,000 feet) altitude, and flight legs of over 7408 km (4000 nautical miles) (Table 1). The Ikhana flew its first science missions in support of wildfire observations in August 2007. Payloads can be flown in the nose compartment of the Ikhana or in pods at various wing-mount locations on the aircraft. The imaging payload instrument used for the wildfire imaging missions was mounted in an instrument payload pod that was attached under the wing. This configuration allowed quick instrument access and the ability to “swap-out” the sensor pod rapidly for necessary maintenance or mission reconfiguration. All necessary electronic cabling was designed to provide power from the Ikhana to the payload in the wing-pod, as well as to provide the necessary cabling to connect the payload instrument, processors, and interface to the aircraft data telemetry system, described in the next section. Aircraft and Data Telemetry on UAS There are two kinds of ground communications to the aircraft: line-of-sight (LOS) and satellite over-the-horizon (OTH) systems. A portable ground data terminal provides command and control and payload uplink/downlink when the aircraft is within radio line-of-sight (~130 km or 70 nautical miles). The satellite communications system provides the over-the-horizon uplink and downlink to the ground control station. Aircraft and telemetry data are downlinked to the ground control station for display on the payload operator and user consoles. The OTH data telemetry system in the NASA Ikhana UAS is used for bidirectional command and control of the UAS as well as for bidirectional control of the AMS-Wildfire sensor system. The Ikhana telemetry is accomplished through a Ku-band frequency, commercial service provider, geosynchronous satellite platform. The system has a data bandwidth capacity of 3.0 megabits per second (Mbs), where 1.0 Mbs are used for data transmission, and 2.0 Mbs are used for video data transmission. This telemetry link allows imagery and level II data products from the AMS sensor, developed on the payload processors, to be sent from the UAS to the ground control station and then redistributed through the Internet to the community. The sensor payload system, which provides the imagery and processed information, is described in the following section.
Integrating sensor data and geospatial tools for disaster management
Airborne Element AMS
POS/AV
Scanner
IMU/DGPS
• Automated Data Capture (ADC) • NAV Data MUX
Ground Element Internet End User
Full Res. 200 Hz Data (RS-232)
Digitizer g Functions:
Web Server
Functions:
• Collaborative Decision Environment • Image Server: Level-2 GEOTIFS & Science Products Track Maps
Shared Storage
Payload Computer Functions:
Shared Storage
Sat Com Link 3 Mbs Ku (Full Res. Imagery)
• Automated Image Geocorrection • Level-2 Algorithm Processing • On-Demand Image Subsetting
DEM Database
3
9.6 Kbs Sensor C&C, IMM)
Ground un Computer Functions: • IMM Interface • Query Handling • Sensor C&C
Product QA/QC
Figure 1. Autonomous Modular Scanner (AMS)–Wildfire sensor image data collection/distribution architecture. The Applanix Position and Orientation System for Airborne Vehicles (POS/AV) combines data with the onboard inertial measurement unit (IMU)/ differential global positioning system (DGPS) to provide sensor/platform geo-location information, combined with the AMS data to the digitizer to generate an automated data capture (ADC) set and a navigation (NAV) data multiplexer (MUX). Data are transferred then to the payload computer, transmitted through a satellite communications (Sat Com) telemetry link to a ground computer where sensor command and control (C&C) capabilities reside (to monitor sensor functionality). At the ground station an intelligent mission management (IMM) process agent resides, allowing the real-time information sharing of sensor system performance characteristics to dictate flight/data collection modifications to be performed in real time. Data are then transferred to a server, allowing access via web services to the image data, which are formatted into public domain metadata standard georeferencing information embedded within a TIFF image file (GEOTIF), for ease of ingestion into GIS systems. QA/QC—quality assurance/quality control.
Figure 2. The NASA Ikhana UAS platform. The sensor is carried in a pod located under the wing, as can be seen in this image. The aircraft size and specifications can be derived from Table 1. The payload pod housing the AMS instrument is approximately 2.5 m long.
4
Ambrosia et al. TABLE 1. NATIONAL AERONAUTICS AND SPACE ADMINISTRATION IKHANA UNMANNED AIRBORNE SYSTEM SPECIFICATIONS Length 11 m (36 ft ) Wing span 20.11 m (66 ft) Maximum take-off weight 4773 kg (10,500 lb) 1089 kg (2400 lb) of instruments Payload Range >7408 km (4000 nautical miles) Speed 170–200 knot s Operational altitudes Up to 13,720 m (45,000 ft) Maximum endurance ~24 h Internal payload bay Yes External payload mounts Yes Electrical power 4.9 kW @ sea level; 2.8 kW @ altitude
Autonomous Modular Scanner (AMS) Sensor The AMS-Wildfire sensor, developed at the NASA Ames Research Center, is an airborne multispectral imaging line scanner capable of high-altitude autonomous operations on both manned and unmanned aircraft. The sensor is a highly modified Daedalus AADS-1268 scanning system that has interchangeable optical components (primary apertures), and it can support pixel size resolutions of 1.25 mrad and 2.5 mrad. The swath width is always 716 pixels across, giving total angular widths of roughly 43° or 86°, respectively, and scan rates are continuously adjustable from 2 scans/s to 33 scans/s, which allow operations through a wide range of altitudes and aircraft speeds. Spatial resolution is determined by altitude and the primary aperture size (1.25 mrad or 2.5 mrad). For the wildfire missions flown in 2007 and 2008, the Ikhana operated at a nominal altitude of 7011 km (23,000 feet) above mean sea level (amsl) (~20,000 feet above ground level [AGL]), while the AMS-Wildfire instrument was configured with an aperture size of 2.5 mrad, which provided a pixel spatial resolution of ~15 m (50 ft). The system is configured with sixteen (16) discrete spectral channels, ranging from the visible through shortwave-, mid-, and thermal-infrared (VIS-IR-TIR) (Table 2). The TIR channels are calibrated for accurate (~0.5 °C) temperature discrimination of hot targets, up to ~850 K. The TIR channels simulate those found on the proposed National Polar-Orbiting Operational Environmental Satellite System (NPOESS) visible/ infrared imager/radiometer suite (VIIRS) instrument (channels M12 and M15). Because the AMS line scanner collects a series of scan lines over a wildfire event, the raw spectral data are sent to a computer processor onboard the platform to further process the data into useful information data sets for delivery to a telemetry system and distribution to receiving nodes on the ground. The onboard autonomous data processing is described in the following section. Onboard Autonomous Data Processing The onboard data processing system was designed to complete the acquisition, preprocessing, information extraction, and output product generation from the raw spectral data collected by
TABLE 2. AUTONOMOUS MODULAR SCANNER– WILDFIRE 12 CHANNEL SCANNER SPECIFICATIONS Spectral band Wavelength (µm) 1 0.42–0.45 2 0.45–0.52 (TM1) 3 0.52–0.60 (TM2) 4 0.60–0.62 5 0.63–0.69 (TM3) 6 0.69–0.75 7 0.76–0.90 (TM4) 8 0.91–1.05 9 1.55–1.75 (TM5) high gain 10 2.08–2.35 (TM7) high gain 11 3.60–3.79 (VIIRS* M12) high gain 12 10.26–11.26 (VIIRS M15) high gain 13 1.55–1.75 (TM5) low gain 14 2.08–2.35 (TM7) low gain 15 3.60–3.79 (VIIRS M12) low gain 16 10.26–11.26 (VIIRS M15) low gain Note: Total field of view is 42.5° or 85.9° (selectable). Instantaneous field of views is 1.25 mrad or 2.5 mrad (selectable). Spatial resolution is 3–50 m (variable based on altitude). *VIIRS—visible/infrared imager/radiometer suite.
the AMS-Wildfire sensor system (Fig. 3). The processing chain is initiated based upon a single-step “acquisition request” from the data system operator. The remaining processing steps are autonomous, based on that single request for a data acquisition. This request includes, among other options, specifying the band selection for the three-band visual product to deliver the desired output resolution (optional), and the algorithm selection to apply to the data stream (fire detection, burn severity index, etc.). Based on the request received, the processing system selects the necessary AMS-Wildfire spectral channels from the continuous full-resolution sensor data stream, performs the conversion to temperature/radiance as appropriate, applies the requested algorithm, extracts the resulting information in vector form if appropriate, and creates a georectified visual raster product. The vector and raster products are produced on the link-module computer processor and transmitted via the Ikhana telemetry antenna
Integrating sensor data and geospatial tools for disaster management
5
POS-AV Circular Buffer All Channels
LN-100 A to D
Sensor UDP Control
AMS
Request
Flight Direction
an Sc
CDE
Link Module Shapefile Extraction
ne Li
9.6kbs ppp
Map Server
Ground Station
Optional Jpg2000 Compression
Ku - Link
Figure 3. Schematic of the on-board autonomous processing architecture. The Applanix Position and Orientation System for Airborne Vehicles (POS/AV), and the AMS data are sent through an analog-to-digital (A to D) converter and then to the sensor controller, then via a user datagram transport protocol (UDP) to the Link Module computer for on-board algorithm and geo-processing of data. Operations processing command exchanges between the on-board Link Module and the ground station are sent via a 9.6 kbs ppp (point-to-point protocol) data link protocol. The processed data products are sent to the same ground station via a wide-band Ku frequency satellite communications link, then are available on the ground for ingestion into mapping or web mapping services.
through the commercial Ku-band SatCom system to the ground station. The steps of the autonomous processing are detailed in the following subsections. Preprocessing Eighteen channels (16 spectral and 2 metadata channels) of the AMS-Wildfire sensor are received continuously on the linkmodule. The selected raw digital data counts are converted to apparent radiance for visible and near-infrared wavelength channels, and brightness temperature for the thermal channels. The most recent sensor calibration information (from the NASA Ames sensor spectral calibration facility) is employed to derive apparent radiance, in addition to solar elevation angle and relative solar/sensor azimuth information, determined from knowledge of the current time and position of the sensor platform during acquisition. Two on-sensor black-body calibration reference source temperature readings provide a linear digital count-to-
radiance conversion, which is then used in an approximate inverse Planck’s equation to produce a brightness temperature for each pixel in the thermal channels. This onboard preprocessing calibration step allows data to be spectrally and thermally consistent from mission to mission. Extracting Information—Autonomous Algorithm Processing Fire hot-spot detection algorithm. The onboard processing system for the AMS sensor was designed to support a variety of data manipulation algorithms to utilize the multispectral range (16 channels) of the sensor. For fire hot-spot detection, a multichannel temperature threshold algorithm, based on that developed by the Canadian Center for Remote Sensing (CCRS), was implemented (Li et al., 2000a, 2000b; Flasse and Ceccato, 1996; Cahoon et al., 1992). The CCRS algorithm is similar to other fire-detection algorithms used on various satellite-derived thermal data, including the Moderate-Resolution
6
Ambrosia et al.
Imaging Spectroradiometer (MODIS) fire-detection algorithm. The CCRS algorithm was originally developed for use with satellite (AVHRR) imagery (Li et al., 2000b), but it has been adapted for use on various airborne sensor systems, including the AMSWildfire sensor. The fire hot-spot detection algorithm uses the 3.6 μm channel of the AMS-Wildfire sensor to define fire via a temperature threshold, and two or more additional channels to further refine this classification. Multichannel thresholds take advantage of particular fire and nonfire target characteristics to remove fire commission errors encountered when using a single midwave thermal-infrared channel–derived temperature value alone. For example, restricting the reflectance value of a selected nearinfrared channel helps eliminate sun glint, a common cause of fire commission error. The threshold values used in the algorithm (AMS channels 11 and 12 and, for daytime missions, channel 7; see Table 2) are parameters that can be variably set by the operator during a mission. The current fire hot-spot detection algorithm is calculated as: If Band 11 (3.60–3.79 µm) > Band 11 minimum temperature (e.g., 360 K) and Band 12 (10.26–11.26 µm) > Band 12 minimum temperature (e.g., 290 K) and Band 11 – Band 12 > Difference minimum (e.g., 14 K), And (if available) Band 7 (0.76–0.90 µm) < Reflectance maximum (e.g., 0.4) (to screen high-reflectance commission errors), Then pixel is classified as a fire hotspot. A vector data set outlining the boundaries of adjacent “fire hotspot–detected” pixels is provided as an alternative or additional data product. This vector file, in ESRI shapefile format, is processed through the same telemetry system and delivered along with the raster product for display in geographic information system (GIS) packages or other Web-map services. An example of the hot-spot detection vector file data is shown in Figure 4. Normalized burn ratio (NBR). A normalized burn ratio (NBR) index algorithm option was also implemented for use as a postfire vegetation rehabilitation assessment tool. The NBR utilizes two spectral channels from the AMS-Wildfire sensor: band 7 (0.76–0.90 µm) and band 10 (2.08–2.35 µm). These two bands provide the best contrast between photosynthetically healthy and burned vegetation (Howard et al., 2002). The normalized burn ratio index algorithm is calculated as: Band 7 – Band 10 , Band 7 + Band 10 where band numbers refer to the AMS-Wildfire channels (see Table 2 for band characteristics). The NBR is usually determined for both prefire (satelliteprovided) and postfire scenes, and it is useful for accomplishing postfire assessments of burn severity conditions to assist in reme-
Figure 4. AMS-Wildfire scanner data collected 19 July 2008 over the Canyon Complex fire, northern California. The fire hot-spot detection algorithm data (polygons in yellow) overlain on Google Earth terrain indicate the fire extends beyond the most recent defined fire perimeters (dark area polygons). The blue line running NE to SW is the real-time flight track of the Ikhana UAS during the data collection effort. This scene covers an approximate area of 11 × 11 km2 and is centered at approximately 39°44′50″ N, 121°12′12″ W.
diation activities. Since the AMS is calibrated, the AMS-acquired postfire NBR values can theoretically be compared to preburn satellite data (with the same spectral channels, such as Landsat Thematic Mapper) to derive a differenced normalized burn ratio (dNBR), where the difference is determined by subtraction (preNBR–post-NBR) to give dNBR. Currently, any number of arithmetic algorithms requiring up to five channels of the AMS-Wildfire data set can be easily added to the system to provide autonomous level II product generation for distribution through the telemetry link to the user community on the ground. The system is dynamic and can accommodate additional processes as warranted. Georectification Inertial measurement unit (IMU). The fully automated georectification processing utilizes metadata from an Applanix Position and Orientation System for Airborne Vehicles (POS AV) model 310 system. The POS AV-310 integrates precision global positioning satellite (GPS) data with inertial technology to provide real-time and postprocessed (POSPac) measurements of the position, roll, pitch, and heading of airborne sensors. The Applanix POS AV-310 system, with the inertial measurement unit (IMU) mounted on the sensor scan head, combines differential GPS output (via the Omnistar wide-area differential GPS service) with the IMU data to provide accurate platform position/attitude information at the time a scan line is collected. The POS-AV-310 roll, pitch, heading, altitude,
Integrating sensor data and geospatial tools for disaster management and platform position are recorded with each scan line from the full 100 Hz Applanix output and embedded in the sensor data stream as they are collected during a mission. Photogrammetric projective transformation equations are used to determine the position of each pixel in the scan line as projected to the ground, with “ground” being determined by the onboard digital elevation model (DEM) data for the area being overflown. The AMS data pixels are resampled to a requestor-specified map resolution and requested band order (for visual products) to produce a particular output product. For efficiency, data are processed in “frames” of acquired flight-line transects (1200 lines of scanner data), which, as GeoTIFF formatted files, combine to produce fire mosaics when displayed in various GIS. An example of a real-time developed and processed AMS-Wildfire image frame mosaic for the 2008 Basin (California) wildfire is shown in Figure 5. Digital elevation model data (DEM). A digital elevation model (DEM) is served on the onboard processor and consists of a composite data set of 1 arc-s Shuttle Radar Topographic Mission (SRTM) elevation “tiles,” which are turned into a mosaic in real time as needed, creating a seamless DEM for the entire western United States, where the majority of the missions are flown. The SRTM data are at 30 m postings (spatial resolution) (U.S. Geological Survey, 2008). The SRTM DEM data are used to define the geospatial context (latitude, longitude, elevation) reference for georectification of the sensor line-scanner data. Each of the AMS-Wildfire data pixels are georeferenced based upon the relationship between the location and attitude of the sensor platform (which defines the pointing vector of the line-scanner pixel at acquisition time) and the latitude, longitude, and elevation of the terrain (from the SRTM data). The geometric accuracy of the real-time georeferenced final products has not yet been rigorously assessed. The position and attitude data when used with only the “forward” solution are inherently less accurate than when postprocessed. In eight flight lines of repetitive overflights from a 2006 wildfire mission data set, a single overflight in 2007 and 2008, and two calibration flights, the real-time (postprocessed) georeferencing accuracies had an average root-mean-square (RMS) error of 2.5 pixels (R. Dominguez, 2007, personal commun.). These errors include those related to any fixed misalignment between the sensor and the Applanix navigation system. The authors found similar georeferencing positional RMS errors on onboard real-time UAS AMS-Wildfire–acquired data sets collected in 2007 as well. Improvements to these results can be made. The relative alignments of the sensor/IMU and GPS antennae must be reassessed with every new system reinstallation, since very small orientation discrepancies will reduce the precision of the georeferencing. Improvements may also be achieved with an increase in the DEM resolution. The 30 m posting SRTM DEM product used for this study is coarser than desirable for georectification with our higher-spatial-resolution AMS-Wildfire instrument data. The onboard product generation, algorithm processes, and georectification processes were developed with automation and
7
Figure 5. AMS-Wildfire sensor real-time processed image frame mosaic for the Basin Fire, Big Sur, California, collected on 8 July 2008. Data was processed and mosaiced in “real-time” from 5 flight lines of frame data. The fire hot-spot detection algorithm shape-file data are shown draped on the AMS-Wildfire sensor 3-channel composite image. This scene covers an approximate area of 8122 sq. km (56 × 56 sq. mi.) and is centered at approximately 36°13′40″ N, 121°30′00″ W, immediately south of Monterey, California.
near-real-time delivery of information as critical objectives. The complete image processing time onboard the UAS takes ~10 s (0.1 min) per image-file frame (1200 lines of AMS-Wildfire spectral data). With the additional data transmission time (via satellite telemetry) and ground-based quality control assessment, the total process time (to final delivery to a server for Internet distribution) still falls well within the 15 min defined as a metric for near-real-time data delivery. Automation can be extended by increasing the system’s usable knowledge base (i.e., day/night awareness to autonomously employ the correct detection algorithm, spatial concept of a fire, etc.) to further remove the sensor data engineer from the process. The resultant geospatially registered imagery is transmitted from the aircraft through a satellite communications system down to the ground control station and is then made available to the community for visualization in a multitude of GIS systems or data visualization software. Those capabilities are defined in the following sections. Real-Time Data Distribution and Geospatial Tools The level II georectified data sets and imagery are sent from the onboard link module to the Ku-band telemetry satellite communications system (described in “Aircraft and Data Telemetry on UAS” section). The GeoTIFF files have moderate file sizes (1–3 Mb per frame), allowing for minimal transmittance time
8
Ambrosia et al.
through the telemetry link to the ground control station, where they are then sent to servers at NASA Ames for redistribution through the Internet. The geospatial processing services for real-time AMSWildfire–derived data products were implemented utilizing open standards promulgated primarily by the Open Geospatial Consortium (OGC). The services utilized included: (1) sensor planning service (SPS), (2) Web notification service (WNS), (3) Web map service (WMS), (4) Web coverage service (WCS), and (5) Web feature service (WFS). Operationally, an incident responder (fire management team) would register an image acquisition request with the SPS and request a notification of acquisition once it has occurred, via a range of Web notification services. WNS notification mechanisms include e-mail and instant messaging (IM). This notification service allows interested requestors to remain current on new acquisitions and to be notified when those data elements are available for viewing or download. The common alerting protocol (CAP), a format codeveloped by the Federal Emergency Management Administration (FEMA) and the State of California Governor’s Office of Emergency Services (CA-OES), describes the message content. Items included in the CAP message are network pointers to the various access mechanisms of the requested image. There are five types of pointers to the image data: (1) a pointer to the original spectral data available via anonymous file transfer protocol (FTP); (2) a pointer to the data via an OGC-compliant WMS, used by GIS clients (such as ESRI ARC users); (3) a pointer to the data via an OGC-compliant WCS, used primarily by other processing services, including fire and smoke modeling teams; (4) a pointer to a Keyhole markup language (kml) file, used primarily by Google Earth clients; and (5) a pointer to a thumbnail-sized version of the file for quick-look viewing of the data. Web services utilized include an SPS where tasking requests can be submitted, a sensor alert service to alert the SPS of new data, a Web map service, and a Web coverage service, both of which are automatically updated. After the Web services are updated, an SAS alert is sent to the requestor in the format requested (e-mail or IM). The requestor then can retrieve the original data directly for ingest into a desktop GIS, access the WMS or WCS via a desktop GIS, or visualize it using any standard Web browser or Google Earth. The project team developed a visualization capability utilizing Google Earth as a “front-end” for real-time image viewing of the kml files as they were automatically generated. A Google Earth tool, the Collaborative Decision Environment (CDE), was developed to quickly simplify three-dimensional visualization of the wildfire imagery and to allow the additional visualization of pertinent fire information data layers such as fire weather, etc. (Fig. 6). Incident command teams were provided linkages to the
CDE (served at NASA Ames Research Center) through a network link to the data “mash-up” service. The CDE was used extensively, as was access to the various WMS data-formatted holdings. The team also developed an enhanced browser-based viewer that allowed fire images and flight tracks to be displayed on a user-selectable base-map layer, if they preferred to use non-kml-formatted data viewers. Therefore, in addition to the CDE kml-formatted data, the AMS imagery and shape files were available as layers in road, satellite, and hybrid maps from Yahoo, Google, OpenStreetMap, and Microsoft’s “Bing” base maps (Fig. 7). CAPABILITY DEMONSTRATIONS, 2006–2008 During the Western States Fire Missions (WSFM) in 2006– 2008, the aforementioned tools were matured and integrated together to develop real-time (under 15 min) geospatial wildfire imagery and information for use by incident command teams. Next, the 2006–2008 Western States Fire Missions are briefly summarized to provide context for employment of the integrated tools during operational missions. 2006 Mission Series In the fall of 2006, the Western States Fire Mission Series was initiated following receipt of the Federal Aviation Authority (FAA) Certificate of Authorization (COA) for limited flight opportunities in the National Airspace System (NAS). On 24 October, the Ikhana UAS predecessor (Altair), with the AMSWildfire instrument onboard, was allowed to fly into the National Airspace System over a small series of controlled burns on the eastern flanks of the Sierra Nevada. The mission demonstrated long-duration flight profiles and was the first NAS operation for the project. The Altair UAS subsequently supported emergency fire imaging on 28 October over the Esperanza Fire in Southern California, providing real-time information on fire location and progression to the on-site incident command team. The emergency COA provided unprecedented access for the UAS to the highly populated Los Angeles Basin NAS to support fire datacollection activities. The Esperanza Fire had the California governor’s “State Emergency Declaration” status, allowing the FAA to respond quickly to modify the area of operations and flight conditions for the UAS. During that 16 h mission, the AMSWildfire scanner system provided multiple image data sets of the fire progression. Following the Esperanza fire emergency support mission, the 2006 mission series ended. In 2006, a total of ~40 h of operations were run with AMS-Wildfire scanner data collected over two burn complexes. Western States Fire Mission, 2007 The 2007 mission series were the first flights of the new NASA Ikhana UAS, which was delivered in January 2007. The 2007 Western States Fire Mission series was initiated in August
Integrating sensor data and geospatial tools for disaster management
9
Ikhana UAS Google Earth
Weather NWS, NRL, MIT
MODIS
Internet
CDE Visualization in Google Earth
USFS, NASA/UMD
Fire Incidents NIFC, USFS, USGS
Airspace Restrictions
Satellite Track Prediction
FAA, NASA
Streaming Video
Group IM
Figure 6. Components of the Collaborative Decision Environment (CDE). The visualization element of the CDE employs Google Earth. The critical fire data elements (left side) that compose the additional visualization components are a “mash-up” of data from various web-served data locations, including those from the National Weather Service (NWS), Naval Research Lab (NRL), Massachusetts Institute of Technology (MIT), U.S. Forest Service (USFS), University of Maryland (UMD), the National Interagency Fire Center (NIFC), the U.S. Geological Survey (USGS), the Federal Aviation Administration (FAA), and others. The CDE also allows integration of instant messaging (IM) and provision of streaming video data from the acquiring UAS platform, in addition to the 3-D visualization of the AMS-Wildfire sensor–acquired data.
following the allowance of the Ikhana COA from the FAA. In total, eight fire data-collection missions occurred during the fire season in 2007. The first four missions demonstrated longduration and long-range capabilities, with data collection over fires located in eight western states. Mission operations had 10 and 22 h duration and 2593–5926 km (1400–3200 nautical miles) mission ranges. During those first four flights, 27 fires were overflown and imaged with real-time geospatial fire data relayed to incident command centers (ICC). To assist in information integration, Wildfire Research and Applications Partnership (WRAP) team members were embedded at various ICCs. Southern California Firestorm Support Missions, October 2007 In late October 2007, a series of Santa Ana wind–driven fires erupted in the Los Angeles and San Diego regions of Southern California. Over 11 major fires were burning in the
Southern California area. The NASA Ikhana AMS sensor supported the fire management teams on those wildfires with provision of real-time hot-spot detection and postfire assessment imagery. On 23 October, the team requested and received an emergency COA modification from the FAA to facilitate operations in the affected regions. On 24 October, the Ikhana UAS with the AMS-Wildfire sensor flew the first of four missions over 11 major wildfire complexes in the region. Flight endurance each day was between 7 and 9 h with ~2500 km (1350 nm) mission ranges. Many of the fires were imaged twice a day to provide up-to-date fire progression information. Team members were again embedded in various ICC and county-level Emergency Operations Centers (EOC). Fire information from the AMS-Wildfire was delivered to the ICs and EOCs as well as to national operations centers, including NIFC and Department of Homeland Security (DHS). A summary of the 2007 missions is shown in Table 3.
10
Ambrosia et al. Google Earth Data Processing • KML Conversion • WMS • CAP Notification
Payload GCS Publish Web Server
Internet
WMS
ArcGIS @ IC
FTP Server
FTP Mirror Site Figure 7. Data collection and distribution of sensor data, illustrating the web services employed in the generation and distribution of the AMS sensor data sets. The AMS-derived products, generated onboard the Ikhana UAS, transmitted to the Payload Ground Control System (GCS), and then to the NASA FTP (file transfer protocol) server are processed into a variety of product formats including Keyhole Markup Language (KML) Google Earth–enabled formats, Web Map Service (WMS) format, and as a common alerting protocol (CAP) notification. Data investigators can also access AMS-derived product data (fire hot-spot shape files) from the NASA FTP server (or a mirrored site at the U.S. Forest Service) using geographic information systems (GIS) software, such as ESRI’s ArcGIS package.
TABLE 3. 2007 WESTERN STATES FIRE MISSION SUMMARY Flight date Duration Fires flown Mileage (h) (km) 16 Aug. 10 4 2993 (1400 nm) 29 Aug. 16.1 7 4630 (2500 nm) 7 Sept. 20 12 5926 (3200 nm) 27 Sept. 9.9 4 3333 (1800 nm) 24 Oct. 9 11 2500 (1350 nm) 25 Oct. 8.7 11 2500 (1350 nm) 26 Oct. 7.8 11 2500 (1350 nm) 28 Oct. 7.1 11 2500 (1350 nm)
Western States Fire Mission, 2008 In late June 2008, lightning ignited hundreds of fires in Northern California. When the California governor declared a State of Emergency, the WRAP team was requested to support state and federal firefighting efforts. The WRAP team requested an emergency COA amendment from the FAA to allow flight operations over a vast region of the state, which was granted on 25 June. Since the Ikhana and the AMS-Wildfire sensor had been undergoing routine maintenance and preparation for a much later seasonal mission, the first flights over the Northern California did not occur until July 8. Subsequently, four total missions were
TABLE 4. 2008 WESTERN STATES FIRE MISSION SUMMARY Flight date Duration Fires flown Mileage (h) (km) 8 July 9.5 9 2593 (1400 nm) 19 July 5.0 4 1852 (1000 nm) 17 Sept. 3 1 1482 (800 nm) 19 Sept. 3.5 2 1482 (800 nm)
flown in 2008 through the conclusion of the fire season (Table 4). Each of the mission data-collection efforts focused on providing real-time fire information to the various ICC and well as to the State Operations Center, and the Multi-Agency Coordination Center, where data were integrated into the wildfire management decision process. The 2008 missions showcased new imaging and software enhancements, including the delivery of postfire NBR data sets to incident teams. Following the Northern California firestorms, the remainder of the western U.S. experienced a “light” fire season, which allowed the WRAP project team to “stand-down” capabilities for the majority of August, September, and October. In October 2008, the WRAP team discontinued demonstration and development missions, due to the official “close” of the western U.S. fire season.
Integrating sensor data and geospatial tools for disaster management FINDINGS AND INTEGRATION ISSUES The demonstration and emergency support missions flown by the NASA-USFS research team demonstrated autonomous geospatial data collection, processing, and delivery within 15 min of acquisition by the AMS sensor on the NASA UAS (Ikhana) platform. This time period was well within the metrics established at the initiation of our project flights in 2006 (1 h delivery metric). The information and data product delivery time were also an order of magnitude faster than those methods previously employed for fire detection and mapping. Most of the major integration issues with adaptation of these capabilities for operational utility are related to component costs. UAS platforms, such as the NASA Ikhana, are expensive compared to manned aircraft platforms, but in some cases they operate at much greater range, altitudes, duration, and with improved flight characteristics. One of the major UAS operational integration issues is related to the current regulatory framework for operations of UAS in the NAS and the restrictions imposed on those operations. Routine UAS flights in the NAS are probably over 5 yr away from reality. The national disaster management communities may see some relief from those restrictions, now that it has been shown that UAS capabilities are essential for supporting observation and monitoring strategies over rapidly evolving events, such as wildfires. The sensor system described in this chapter (AMS-Wildfire) is a “one-of-a-kind” scientific instrument development. All the components of the instrument are commercial-off-the-shelf technologies, and there is interest in private industry (sensor development industry) to develop duplicate and “next-generation” sensor systems. The components that compose the satellite telemetry system are ubiquitous and are being integrated on various manned and unmanned platforms, although their costs are still high. The autonomous real-time georectification processing is modeled after various photogrammetric processes used to postmission-correct aerial photography, and it is being evolved further as a result of a Small Business Innovative Research grant to Terra-Mar, Inc., to develop fast emergency response georectification procedures. During our research, the project team matured and evolved those processes to operate autonomously on data sets collected from a remotely operated sensor system. The real-time fire hot-spot detection and burn assessment algorithms are based on calculations developed for use with various satellite imaging systems, including those developed for the Advanced Very High-Resolution Radiometer and the ModerateResolution Imaging Spectroradiometer. We have shown that those same algorithms can be operated in the onboard processing chain, allowing for level II data products to be produced in near real time. This process is more time efficient than delivering sensor level I data to the ground and then manually applying those algorithms on that data into level II products. The capacity for providing custom-developed data products direct from an acquiring sensor is limitless. This will have a significant impact on time savings and data utility, especially for the disaster man-
11
agement community, who rely on rapid decision-making information data provision. The “sensor webs” developed and enhanced during the period of these missions have demonstrated the utility of designing Web services with open standard interfaces. The Google Earth visualization tool, the messaging and communication environment, and the real-time video broadcasting all were developed around open standards and allowed access to users of all data-serving platforms, including personal digital assistant devices. CONCLUDING REMARKS We have demonstrated that various platform, sensor, communications, and geospatial technologies can be integrated to provide near-real-time intelligence in support of disaster management entities. In our work with the U.S. wildfire management agencies, we designed a functional system for meeting disaster response metrics of “geospatial data delivery in under an hour from the collection event.” We have shown that geospatial data can be provided within a 5–15 min time period (from collection), which represents a significant advance in current capabilities. Large-capacity, long-duration, high-altitude UAS platforms can play a significant role in providing repetitive, lingering capabilities over disaster events, especially dynamic, evolving events like wildfires. The OTH satellite data and voice communications telemetry systems on these platforms can be employed to control and command acquisition of imaging payload as well as to provide sensor data to ground team members through the same telemetry linkages. This allows rapidly updateable information to be in the hands of incident management teams when needed. Imaging sensor systems can be designed to collect critical spectral and thermal wavelength channel information specifically “tuned” to the phenomenon that is being observed. The use of multispectral data in analysis of wildfire behavior characteristics is critical to ascertain fire behavior, location, and movement. The spectral channels defined in this chapter are essential for wildfire observations. Multichannel capabilities offer clear advantages over single-channel fire detection systems, as we have shown in this chapter. Image-processing capabilities, in order to derive level II data from acquiring sensor systems, can be automated and included as part of the payload processing package on an airborne (or satellite) sensing platform, such as the UAS described in this chapter. Complex algorithms can be integrated into the processing scheme to further reduce those labor and time-consuming, postmission analysis tasks. By integrating sensor and/or platform IMU and positioning information with terrain DEM data, a fully georectified image product can be developed autonomously onboard an aircraft, further reducing the critical labor and time requirements for delivery of accurate geospatial data. Web-enabled GIS tools and systems such as Google Earth or ESRI products provide a user-friendly “platform” for display of georectified imagery and information. Our goals were to ensure that the information products developed autonomously from the UAS-acquiring sensor would integrate seamlessly into a
12
Ambrosia et al.
multitude of geospatial visualization packages. We achieved that objective by providing autonomously generated data products in Open Geospatial Consortium (OGC) standard formats. Following 3 yr of system development and emergency support missions in the western United States, we have demonstrated that current off-the-shelf technologies can be integrated to provide the disaster management community with the data and “intelligence” that they require in real time. We anticipate that the civilian use of UAS will increase dramatically, especially in support of disaster management and disaster relief efforts. The processes and technologies described here for the use of UAS platforms and enabling sensors and technologies should form the foundation for designing future disaster monitoring and observation capabilities. These integrated technologies have obvious cross-disciplinary application to other disaster events in the United States and the world. ACKNOWLEDGMENTS We acknowledge the support of the National Aeronautics and Space Administration (NASA) through a grant (REASoN-0109-0172) awarded to support this work. We are also grateful for the support of S. Wegener (Bay Area Environmental Research Institute, BAERI), B. Lobitz (California State University–Monterey Bay), F. Enomoto (NASA), S. Johan (NASA), S. Schoenung (BAERI), T. Zajkowski (U.S. Forest Service–Remote Sensing Applications Center, USFS-RSAC), E. Hinkley (USFS-RSAC), S. Ambrose (NASA), T. Fryberger (NASA), T. Rigney (NASA), B. Cobleigh (NASA), G. Buoni
(NASA), J. Myers (University of California–Santa Cruz, UCSC), T. Hildum (UCSC), M. Cooper (General Atomics– Aeronautical Systems Inc.), and J. Brass (NASA). We would also like to acknowledge the wildfire management community members who engaged us in defining observation criteria and metrics that allowed us to help improve their wildfire/disaster mitigation capabilities. REFERENCES CITED Cahoon, D.R., Jr., Stocks, B.J., Levine, J.S., Cofer, W.R., III, and Chung, C.C., 1992, Evaluation of a technique for satellite-derived area estimation of forest fires: Journal of Geophysical Research, v. 97, p. 3805–3814. Flasse, S.P., and Ceccato, P.S., 1996, A contextual algorithm for AVHRR fire detection: International Journal of Remote Sensing, v. 17, p. 419–424, doi:10.1080/01431169608949018. Howard, S.M., Ohlen, D.O., McKinley, R.A., Zhu, Z., and Kitchen, J., 2002, Historical fire severity mapping from Landsat data: Pecora 15/Land Satellite Information IV/ISPRS (International Society of Photogrammetry and Remote Sensing) Commission I/FIEOS (Future Intelligent Earth Observation Satellites) 2002 Conference Proceedings: Bethesda, Maryland, American Society of Photogrammetry and Remote Sensing (CD-ROM). Li, Z., Nadon, S., Cihlar, J., and Stocks, B., 2000a, Satellite mapping of Canadian boreal forest fires: Evaluation and comparison of algorithms: International Journal of Remote Sensing, v. 21, p. 3071–3082, doi:10 .1080/01431160050144965. Li, Z., Nadon, S., and Cihlar, J., 2000b, Satellite detection of Canadian boreal forest fires: Development and application of an algorithm: International Journal of Remote Sensing, v. 21, p. 3057–3069, doi:10 .1080/01431160050144956. U.S. Geological Survey, 2008, Shuttle Radar Topographic Mission: http:// srtm.usgs.gov/index.php, site updated 23 June 2008 (site accessed 1 June 2009). MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
Ontological relations and spatial reasoning in earth science ontologies Hassan A. Babaie Department of Geosciences, Georgia State University, Atlanta, Georgia 30302-4105, USA
ABSTRACT Several types of fundamental ontological relations connect the endurant (continuant) and perdurant (occurrent) entities in every domain. These include: instantiation, parthood, location, and connection relations, and those that are derived from them, such as adjacency, overlap, containment, and coincidence. Some of these types of relations, and their subtypes, are formally defined in the context of the Web Ontology Language (OWL) for a variety of endurant geological examples, mostly from the Nankai Trough in southwest Japan and the San Andreas fault in California. Here, the foundational ontological relations are discussed to show their application in building useful earth science ontologies. These relations, defined as properties in OWL, are given in the context of the Resource Description Framework (RDF) triples and their relationship to relational databases. The role of properties in providing semantics, reasoning, and knowledge structuring and representation is discussed for various ontological relations. The semantics of classes are provided by the metaproperty and restrictions of the properties that use these classes as domain and range. Types of properties are described from different perspectives and for different purposes. Property subclassing, through OWL’s subproperty construct, is used to restrict properties. The formal definitions of the foundational taxonomic (isA), partonomic (partOf), location (locatedIn), containment (containedIn, componentOf), and topologic (overlap, adjacentTo) relations, at the class and instance levels, are given in first-order logic for continuant geological entities. Geologic examples for several other basic relations such as derivesFrom, transformationOf, and absorb are also given.
INTRODUCTION
ontologies (Kashyap et al., 2008; Donnelly et al., 2005; Smith, 2003, 2004) represent the reality by depicting the taxonomic, partonomic, and other types of hierarchical structure of objects and events in the real world (Lambe, 2007). Application of ontologies to support information management, knowledge discovery, and spatial and temporal reasoning has been progressively appreciated by earth scientists in recent years, as is apparent by
Depiction of the structure and dynamics of reality, to enable spatial and temporal reasoning of simple and composite components of Earth’s natural systems, requires construction of ontologies that are based on formalized representation of the static and dynamic entities and their complex relationships. Formal domain
Babaie, H.A., 2011, Ontological relations and spatial reasoning in earth science ontologies, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 13–27, doi:10.1130/2011.2482(02). For permission to copy, contact
[email protected]. © 2011 The Geological Society of America. All rights reserved.
13
14
H.A. Babaie
a significant increase in the number of projects devoted to building top-level and domain ontologies (e.g., Raskin and Pan, 2005; Sinha, 2006). Construction of useful ontologies requires a thorough understanding of the formal meaning of the foundational ontological relations such as instantiation, parthood, location, and connection, and those that are derived from them, such as adjacency, overlap, containment, and coincidence (e.g., Randell et al., 1992; Cohn and Varzi, 2003; Donnelly, 2004b; Schulz et al., 2005, 2006). Despite the spatial and spatio-temporal nature of most components in Earth systems, and the fact that most natural objects are complex composite entities, the existing taxonomic structures in current earth science ontologies are mainly based on the primitive isA (i.e., subclass, instantiation, or subsumption) relation, and they under-represent the other types of ontological relation. The hierarchies in these ontologies generally do not include the mereological part-whole (i.e., partitive) and topological (e.g., connection) relations that are needed to depict composite, spatially and spatio-temporally related entities. Moreover, there seems to be confusion in the earth science community as to the difference between formal relations that hold at the universal level compared to those that exist among instances (individuals) in reality. In this paper, I focus on introducing the major fundamental relations (both Web Ontology Language [OWL] qualified names and user-defined) and the difference between those that hold between universal types and those that exist among instances in reality. The main objective of this paper is to introduce the formal ontological relations so that they can be used more consistently in designing better and more reliable earth science ontologies. Due to the scarcity of ontologies in the earth sciences, most of the material used in this paper is based on the work in artificial intelligence, medical informatics, and analytical philosophy, where significant progress has been done in ontological engineering. The formal relations in this paper are given in the first-order logic notation (Levesque and Lakemeyer, 2000; Brachman and Levesque, 2004), and the symbols used in this paper include: ∧ (and), ∨ (or), → (then), ¬ (not), ∃ (there exists), ⊆ (is a subclass of), ∩ (intersection of) and ∀ (for all, if any). All variables are given in Helvetica font, and those for types are given in the upper case letters, e.g., X, C, R. Variables representing particulars or instances of universal types are given in the lowercase letters, e.g., x, c, r. Universal types and particulars are also given in Helvetica font, with the first letter of every word written in the capital letters, e.g., Ocean, Rock, Fault, and the first letter of each word in compound names capitalized in the camel case, e.g., IgneousRock, AccretionaryPrism. The relation names start with a lowercase letter in Helvetica font, and the first letter of each subsequent word in composite names is in uppercase, i.e., camel case (e.g., partOf, isA, connectedTo). Concepts, i.e., term definitions, are given in italic font, e.g., mineral, water. Although the geological examples given to elucidate the semantics of each type of relation are varied, they mostly relate to the spatial relations in the Nankai Trough accretionary prism in southwest Japan (Tobin and Kinoshita, 2006) and the San Andreas fault in California.
PROPERTIES AND SEMANTIC LANGUAGES Semantic models (e.g., ontologies) have a graph-theoretical structure, consisting of nodes representing terms, and edges representing relations that link the nodes (Smith et al., 2005). The relata for the relations are classes (i.e., terms) that represent entities in reality. Ontology captures the formal relations and their universal relata, based on the real relationships and constraints between the instances in a specific domain of discourse, i.e., a discipline or field (e.g., Smith, 2003, 2004; Bains, 2006). This way, ontologies can be used to represent domain knowledge and theories, and support inquiries in the domain. Information models depict the reality by using classes to represent the universals and instances of these classes to represent the individuals. Knowledge in any domain is a collection of numerous true statements (propositions) (Carrara and Sacchi, 2006). For example, the statement, “Rock is an aggregate of one or more minerals,” is a piece of the petrology knowledge. The following are two other examples of tectonics and structural geology knowledge statements: “Temperature (T) increases with depth (z),” denoted by the geothermal gradient ∂T/∂z ≈ 30 °C/km, and “stress (σ) and strain (e) are linearly related (at low temperatures/pressures and shallow depth)” expressed by Hooke’s law of elasticity, σ = Ee, where E is the proportionality constant. These knowledge statements are composed of terms (e.g., stress, strain in the last statement) that represent universal classes that evoke domain concepts, and the relationships among them (e.g., the “linearly related to” relation between stress and strain). Knowledge statements are explicitly asserted in ontologies in the form of Resource Description Framework (RDF) triples, applying the semantic web languages such as OWL (Web Ontology Language) and its underlying RDF and RDF schema (RDFS) languages (Breitman et al., 2007; Allemang and Hendler, 2008; Antoniou and van Harmelen, 2008; Kashyap et al., 2008). Every statement in RDF is like a value in a single cell of a database table, which requires three components for its complete representation (Allemang and Hendler, 2008): a row identifier (subject, s), a column identifier (predicate, P), and the value in the cell (object, o) (Fig. 1). Subject is the thing (individual) for which we make the statement, the predicate is the property (relation) for the subject, and the object is the value for the property. We refer to the subject-predicateobject statement as an RDF triple. Figure 1 shows the concept of the RDF triple and its relationship to the relational database table structure, and Figure 2 gives an example for conversion of the relational database table into RDF triples. A knowledge base built based on a domain ontology is a large set of such RDF triples about the individuals in that domain. Because each row (i.e., record) in a relational database table has multiple columns (fields, attributes), several triples in an ontology often relate to one subject (Figs. 1B and 2B). In other words, a single subject in a knowledge base may relate to many objects through many predicates, and hence RDF triples. For example, a specific, individual sedimentary bed (subject) has specific values (objects) for its age, orientation, and composition predicates.
Ontological relations and spatial reasoning in earth science ontologies
Figure 1. The correspondence between the Resource Description Framework triple (subject-predicate-object) and a relational database table. (A) Each row (record) in a relational database table represents a subject (s). Each column (field or attribute) in the table is a predicate (property, Pi). The cell value at the intersection of a row and column is the object (Oi). (B) Each subject (S) corresponds with many objects (Oi) through many properties (Pi) (Allemang and Hendler, 2008).
15
In the semantic web jargon, subjects are said to be instances of classes, predicates are properties (relations), and objects, which are either instances of other classes or are data types, provide values for the properties. There are two types of properties in OWL: data type and object. The data type property is a binary relation between a set of individuals (subjects) and a set of instances of a typed literal (e.g., XSD, XML Schema Definition data types) (Breitman et al., 2007). An object property is a binary relation between sets of individuals of two classes, i.e., the subject and objects of a RDF triple are both individuals. Properties are restricted by their domain and range. Domain relates a property (predicate) to a subject class of a triple that uses the property. This way, domain namespace imposes restriction on the type of the subject instances that use the property (Antoniou and van Harmelen, 2008); the domain is the set of values for which the property is defined, i.e., it is the collection of types that use the property. For example, the domain for the composition property in the triple: Mineral composition MineralGroup, is the Mineral class. Range is used to relate a property to a target, object class, or data type, and therefore it puts a restriction on the set of values a property can take. For example, the range for the composition property in the previous triple can be silicate, phosphate, sulfide, hydroxide, etc., which are all of type MineralGroup. Domain and range are constructs that give information on how a property may be used, and they should be declared with care. The direction of roles should be defined correctly when using the domain and range. Table 1 shows some statements in the structural geology domain (namespace prefix: struc), given in the N3 serialization format (Allemang and Hendler, 2008). In these statements, the Fold class has the foldDescription and tightness data type properties of string type, and a foldAxis object property for which the range is of the Line type. RELATIONS AND REASONING
Figure 2. A relational database table (A) converted into the Resource Description Framework triples (B) for the first row (record) of the table.
As conceptual units, relations constitute a major component of knowledge structuring, representation, and reasoning by providing semantics, i.e., meaning (Jouis, 2002; Lambe, 2007). Because semantic data are focused mainly on the relationship
TABLE 1. ABBREVIATED N3 SERIALIZATION OF THE STRUCTURAL GEOLOGY DOMAIN’S FOLD CLASS N3 statement Resource Description Framework triple struc:foldDescription rdf:type owl:DataTypeProperty. Fold tightness “open” struc:foldDescription rdfs:domain struc:Fold. Fold Description “Harmonic fold in bedding” struc:foldDescription rdfs:range XSD:string. Fold foldAxis Line1 struc:foldAxis rdf:type owl:ObjectProperty. struc:foldAxis rdfs:domain struc:Fold. struc:foldAxis rdfs:range struc:Line. struc:tightness rdf:type owl:DateTypeProperty. struc:tightness rdfs:domain struc:Fold. struc:tightness rdfs:range xsd:string.
16
H.A. Babaie
between individuals, ontologies are property-oriented in contrast to the object-oriented models (Sagaran et al., 2009). Properties in ontology languages such as OWL are first-class citizens, meaning that like classes, they can subsume subproperties (Antoniou and van Harmelen, 2008). Because properties are defined globally, they can be used anywhere by any class, i.e., unlike methods in an OO (object-oriented) language, a property in RDF does not belong to one class! This provides a great deal of flexibility in OWL and its sublanguages because there is no need to modify classes when we introduce new properties. Relations provide a framework to meaningfully connect categories of entities defined in classes that stand for our domain types. Properties stand for relations or constitute the arguments for relations. Classes are defined in ontologies based on properties or values of the properties, for example, we can define classes of minerals that are silicate, or classes of deformed rock that are mylonitic. Ontologies are often built in one domain and reused in another through importation. The reuse requires declaration of namespace for each ontology. The XML namespace standard, used by RDF, provides a global scope to domain classes and properties, allowing reuse of existing types into new ones without naming conflicts. The XML standard allows different communities of scientists (e.g., oceanography and atmospheric science) to independently develop their own ontologies and markup languages. There is often a need to integrate these autonomously developed vocabularies into other applications, and this is where the namespace becomes very useful. It is common for two domain vocabularies to contain classes or properties that have the same name but that are structured in different ways, and thus provide meaning differently. If the two vocabularies are shared by an application, there will be a name conflict, which would lead to an error during processing. The namespace prevents this kind of name collision by assigning the similarly named terms, which belong to different domains, to different uniform resource identifiers (URI) that reference these communities. Declaration of a namespace is done by the xmlns (XML namespace) attribute, which allows both an optional prefix and a URI to be chosen for the namespace. The prefix, which references the URI, qualifies each term within a vocabulary to a specific community, e.g., struc:Fold, ocean:Floor, which are qualified names for the structural geology and oceanography domains, respectively.
The fact that rocks have textures and textures have textual descriptions can be stated as: Rock texture Texture and Texture textureType XSD:string, respectively. Let’s assume that Rock and Texture are defined in the Petrology ontology (with the “petr” namespace prefix, which is a URIref), and then they are imported into another ontology. The imported Rock and Texture classes are referred to, in the ontology that is using the Petrology ontology, by the qualified names petr:Rock and petr:Texture, respectively. In practice, the petr prefix is a URI, which is a unique identifier for the Petrology ontology if it exists. Figure 3 shows the graphical and textual presentation of the texture property and its domain (Rock) and range (Texture) classes, and the textureType property and its domain (Texture) and range (XSD:string). Two instances (individuals) of the Rock and textureType are also shown. There are two general types of relation: static and dynamic (Jouis, 2002). While the static relations connect the spatial or static aspects of entities (i.e., no change in state), the dynamic relations deal with the temporal and spatio-temporal concepts (e.g., process, event) that involve change of state. Examples of the static relations are: IdahoBatholith locatedIn Idaho; Rock isA Solid. An example of the dynamic relation is: Faulting displace Rock. Relations are used to structure knowledge and provide semantics for reasoning through the following constructs: (1) hyponomy (Cruse, 2002), to specialize general spatial entities through the “isA” and “is a kind of” relations (e.g., NormalFault isA Fault), (2) troponymy (Fellbaum, 2002), to subclass verbs that represent the effects of processes (e.g., strain isA deform, slip isA move), and (3) meronymy (Pribbenow, 2002; Koslicki, 2008), to structure complex wholes by using the “part of” relation (e.g., Mineral partOf Rock). A statement such as “NormalFault isA Fault” or “strain isA deform” implies inclusion of meaning, that is, the meaning of the NormalFault type or the strain relation includes the meaning of the Fault type or deform relation, respectively. In other words, the hyperonym-hyponym pair (i.e., general-specialized pair, e.g., Fault and NormalFault, or strain and deform) are of the same semantic type. In a troponymy in which elongate and distort are subproperties of the strain relation, the subordinate (i.e., specialized) relations (elongate, distort) contain the meaning of the superordinate (general) relation (strain), but they add extra semantics to it, that is, whereas strain is change in something (volume, length, and
Figure 3. (A) Directed graph of two resource description framework (RDF) triples: Rock texture Texture and Texture textureType XSD:string. “andesite” and “porphyritic” are instances of the Rock and Texture classes. (B) The N3 serialization of the two RDF triples, defining the Rock class and the domain and range for the texture and textureType properties.
Ontological relations and spatial reasoning in earth science ontologies angle), the subordinate elongate or distort relation deals specifically with the change in length or shape, respectively (Fellbaum, 2002). The difference between the verbal relations elongate and distort is in the “manner” or “way”’ in which the processes that these relations represent occur. The elongate or distort relation, as in “Extension elongate Pebble” or “Shearing distort Fossil,” denotes the manner or way in which a pebble or fossil strains by elongation or change in angle, respectively. SUBSUMPTION OF PROPERTIES AND CLASSES Properties relate classes in the hierarchy. Importing and equating properties from different sources can be done with the use of the rdfs:subPropertyOf. When an ontology is imported into another one, we may need to make two properties equivalent. For example, assume that two groups of geologists (e.g., geochemistry and tectonics groups) have knowledge bases in which one domain uses the property study and the other one uses investigate for the scientific work done by its geologists. Let’s assume that the namespace prefixes for these two domains are geochem and tect, respectively. Assuming that the investigate and study verbal properties mean the same thing, we make them equivalent letting each property be the subproperty of the other (Allemang and Hendler, 2008), i.e., geochem:study rdfs:subPropertyOf tect:investigate and tect:investigate rdfs:subPropertyOf geochem:study. Or, we can use the owl:equivalentProperty for this purpose. As another example, if hydrogeology ontology calls flow for what structural geology ontology calls transport, we can use the rdfs:subPropertyOf as long as the domain and range of these various properties are of the same type. To state that all uses of the transport and move properties are the same, we assert: hydro:transport rdfs:subPropertyOf struc:move. If we have the triple: x transport y in the hydrogeology domain (hydro namespace prefix), we can infer x move y in the structural geology domain (struc namespace prefix), as long as x and y are of related types (e.g., water and ion). For example, since transport is a subproperty of the move property, the explicit assertion hydro:Water hydro:transport geochem:Ion infers that struc:Fluid struc:move geochem:Ion.
If class (type, set) C is a subclass of A and a subclass of B (i.e., C rdfs:subClassOf A, and C rdfs:subClassOf B), then C is in the intersection of sets A and B (i.e., C ⊆ A ∩ B). In this case, if individual x is in C, then x is also in both A and B. For example, pyroclastic rocks (e.g., tuff) have the properties of both volcanic and clastic sedimentary rocks. In a default namespace (where no prefix precedes the colon in the qualified name), this fact is asserted as :PyroclasticRock rdfs:subClassOf:VolcanicRock, and :PyroclasticRock rdfs:subClassOf:DepositionalRock. If we now assert that :HuckleberryTuff rdf:type:PyroclasticRock, we infer (i.e., reason through the inference rules) that the Huckleberry tuff (in Wyoming) is both depositional (kind of sedimentary) and volcanic, i.e., we derive the following two inferred statements: :HuckleberryTuff rdf:type:VolcanicRock, and :HuckleberryTuff rdf:type:DepositionalRock. Notice that the inference is unidirec-
17
tional, i.e., pyroclastic rock is both volcanic and depositional, but every depositional sedimentary or volcanic rock is not pyroclastic! Each class defines the essential and accidental properties (Colomb, 2007). The essential properties are necessary for the individuals to have if they are members of the class. Individuals may or may not have the accidental properties. For example, one essential property of the silicate family of minerals is to have a composition made of silicon, oxygen, and certain metals. If a mineral does not have Si or O (essential properties), then it does not belong to the Silicate class. Thus, essential values for all instances of a class must necessarily be the same, and they must always be present. However, being glassy, blue, smoky, purple, or milky for quartz (a silicate) is accidental. Notice that what is essential for a subclass may be accidental for its superclass. For example, in the IgneousRock isA Rock taxonomy, formation from magma, which is essential for a member of the IgneousRock subclass, is accidental for a member of the Rock superclass. An essential whole is a complex individual with essential unifying properties (relating all the essential parts) that are required by the whole. For example, the Fold essential whole must have limb and hinge line as essential parts. If a property is both essential and sufficient to identify the class, it is called rigid. For example, physical, optical, and chemical properties of minerals are rigid. All instances of a given mineral (rigid class) have the same set of values drawn from its rigid properties. Properties represent the binary relations (predicates) between resources or individuals, referenced by the subjects and objects in the RDF triples (Hitzeler et al., 2009). Properties are defined in RDF as instances of the rdf:Property class (e.g., solidify rdf:type rdf:Property). The meaning of the recrystallize property, relating metamorphism and rocks (Metamorphism recrystallize Rock), is the set of all individuals that are recrystallized. As sets, properties exhibit a similarity to classes rather than to individuals. Properties are subclassed in OWL by applying the owl:subPropertyOf construct (e.g., grainBoundaryMigrate owl:subPropertyOf recrystallize). The rdfs:subPropertyOf provides a mechanism to extend properties, from more general to more specific, i.e., allow hierarchy of properties. “P is said to be a subproperty of P′ if every instance of P is also an instance of P′” (Colomb, 2007), or, stated differently, “P is a subproperty of P′ if P′(x,y) whenever P(x,y)” (Fig. 4). In general, P rdfs:subPropertyOf P′ means: if x P y, then x P′ y, i.e., if x and y are related by P, then they are also related by P′. To clarify this, think of the shear and extend properties, which are more specific types of the displace property (i.e., shear rdfs:subPropertyOf displace and extend rdfs:subPropertyOf displace) (Fig. 4). In this case, if Fault shear Grain, then Fault displace Grain. The brittlyDeform property is more specific than deform, and rotate is more specific than deform, i.e., rotate rdfs:subPropertyOf deform. This means that if a fault rotates a fold, it deforms it (if Fault rotate Fold, then Fault deform Fold). The crystallize property is a subproperty of solidify (crystallize owl:subPropertyOf solidify), and strain, rotate, and translate are subproperties of the deform property (e.g., strain owl:subPropertyOf
18
H.A. Babaie
Figure 5. A property inheriting from two other properties, where x and y are instances of classes A and B, which are related by property P. The relation x P y implies the two top x R y and x S y relations (properties).
Figure 4. Subclassing of properties, where x and y are instances of the A and B classes, respectively. (A) P is a subproperty of P′ if x p y infers x p′ y, i.e., the relation of two instances (resources) x and y by the subproperty infers the relation by the superproperty.
deform). So crystallize owl:subPropertyOf solidify means that if a magma x crystallizes into a mineral y (i.e., if x P y), then x also solidifies into mineral y (i.e., implies x P′ y). Also, grainBoundaryMigrate owl:subPropertyOf recrystallize means that
grain boundary migration during metamorphism brings recrystallization. Notice that subsumption is transitive, i.e., material that is strained is also deformed; material that is crystallized is also solidified. The converse is not necessarily true, that is, something that solidifies does not necessarily do it through crystallization. There are cases where a property (P) inherits meaning from two (R and S) (or more) other properties. This can be done by defining property P to be the logical intersection of R and S (i.e., P ⊆ R ∩ S), and two resources x and y are related by property P. In other words, if x P y, then x R y and x S y (Fig. 5). For example, recrystallization (defined in the structural geology ontology with the struc namespace) implies both crystal plastic deformation and strain softening. These relations are asserted as follows: struc:recrystallize rdfs:subPropertyOf struc: crystalPlasticallyDeform and struc:recrystallize rdfs:subPropertyOf struc:strainSoften. This is a unidirectional inference, i.e., when recrystallization occurs, it strain softens the rock, but all strain softenings are not achieved via recrystallization (for example, it can occur via recovery).
Optional properties can subsume mandatory subproperties, but mandatory properties can only have mandatory subproperties (Colomb, 2007). Instances of the domain classes for an optional property are not required to participate in it. For example, ductilelyDeform may be optional for a Rock, but it can subsume mylonitize, which can subsume recrystallize. If the mylonitize property is mandatory for a crystal-plastically deformed fault rock, then its recrystallize and recover subproperties must be mandatory too. Mandatory properties are expressed with the existential quantifier (∃). For example, Folding (a process) must involve either a planar object (e.g., bedding, foliation) or a linear object (flute cast, lineation). So, if there is an instance of Folding, there also exists an instance of a Planar or Linear class through the deform property. For example, if the Folding and Bedding classes are defined in the structural geology ontology (namespace prefix: struc), then the domain and range for the deform property are given by: struc:deform rdfs:domain struc:Folding, and struc:deform rdfs:range struc:Bedding. A class can “supply” or “carry” a property. It supplies the property if the property only holds for the class and not its superclass, i.e., it is defined in the subclass. It carries the property if it is inherited from the superclass (Guarino and Welty, 2002; Colomb, 2007). Properties can carry essentiality (+E), unity (+U), identity (+I), and rigidity (+R). These metaproperties are defined in the Ontoclean method (Guarino and Welty, 2002). It should be noted that the Ontoclean method is only one of several possible ontology evaluation methods. Essentiality means that all instances must have the same value for their property. Unity relates to the parts that are needed to make the whole. It is concerned about how parts of a complex object are put together to make the whole object. Identity refers to the properties that are needed to identify instances of a class. Rigidity is provided by the necessary and sufficient properties that identify the class of an individual. If any of these metaproperties is annotated with a negative prefix (read: “not”), e.g., –E, –U, –I, and –R, it means that the metaproperty does not necessarily hold (may hold by accident) for all instances of the class for which the property is defined (Guarino and Welty, 2002). Thus, –E means that property is not essential, although it could be accidental. Same is true for –U, –I,
Ontological relations and spatial reasoning in earth science ontologies and –R, which read “not unity,” “not identity,” and “not rigidity,” respectively. A metaproperty annotated with ~ (read “anti”), e.g., ~E (anti-essential), ~U (anti-unity), ~I (anti-identity), or ~R (antirigid), means that it necessarily does not hold for any instance of the class. An anti-identity and anti-unity property cannot be used as a basis of an identity or unifying relation for any instance, respectively. An anti-essential property can be updated (i.e., the value may change) in all instances (Colomb, 2007). The +R and +E properties must be mandatory, whereas –R, ~R, –E, and ~E can be optional. Subclasses cannot weaken the strength of the metaproperty for the superclass, i.e., a subclass cannot have an ~E, ~I, or ~U if the superclass has the property +E, +I, and +U. However, the opposite is possible, i.e., if a superclass has a property with metaproperties ~E, ~I, or ~U, the subclass can have that property with +E, +I, and +U, respectively (Colomb, 2007). RELATIONS THAT HOLD BETWEEN FUNDAMENTAL TYPES OF ENTITIES The entities in a domain (e.g., subduction zone, fold-andthrust belt) fall into two broad, disjoint (i.e., nonoverlapping) categories (e.g., Smith and Grenon, 2004; Bittner et al., 2004): (1) continuants (endurants), and (2) occurrents (perdurants). The continuants include material and immaterial substances, parts (both fiat and bona fide parts; Smith 2001), boundaries, aggregates, qualities, roles, function, and spatial regions (Smith and Grenon, 2004). The continuant objects, such as fault, lake, accretionary prism, rock, and porosity, exist in their entirety (i.e., as a mereological whole) at any temporal slice (i.e., at a time instant, ti) of their four-dimensional (spatio-temporal) life history. Despite the continuous change in the object (attribute) and relational properties (e.g., partitive and connection relations) of the continuants, these entities maintain their identity through time and space, as long as they exist. For example, the continuous qualitative changes in the type, thickness, and spatial location of sediments in the Kumano forearc basin, and its underlying accretionary prism, do not change the identity of these components of the Nankai Trough (just like the change in the color of your hair does not change you). While continuants represent the static part of reality, the occurrents correspond to the dynamics of the world. The occurrents include events that signify the instantaneous beginning and end of state change in objects (e.g., rock, fault) through homogeneous processes that bring qualitative change to the continuants. For example, the accretionary prism (a continuant) grows through processes of offscraping, underplating, and sedimentation, which modify the structure of the prism over time. In this paper, I focus on the formal spatial relations and do not cover the temporal and spatio-temporal relations, which require discussions of occurrents, such as processes and events, that are not in the scope of the paper. These relations can be found in Babaie (2011). Formal, in this case, means that the relations, which are defined in first-order logic, apply to any domain of
19
reality, for example, to subduction zone, strike-slip fault, experimental rock deformation, or atmospheric science. Notice that the formal relations, such as partOf, are not necessarily part of the OWL language. The formal relations (e.g., partOf, locatedIn) may hold between: (1) continuant objects, e.g., SeismogenicZone partOf PlateBoundaryFaultZone; SplayFault locatedIn AccretionaryPrism; UnderplatedSediment externallyConnectedTo AccretionaryPrism, (2) occurrent objects, e.g., Comminution partOf Cataclasis; Fracturing partOf Faulting, or (3) between objects of the two disjoint types, e.g., Folding involves Discontinuity; Ions participate-in Mineralization. Notice that the relations between two continuants
are defined at the time instants that the two objects are related to each other, i.e., xi partOf yi, or xi properPartOf yi, for the time index i. The related objects can have different granularities, for example, Microfracture partOf DamageZone; TwinBanding locatedIn Mineral. A damage zone, a fractal entity, can exist over a large range of scale, microscopic to regional, compared to twin banding, which is microscopic. The universal term entity refers to objects, processes, events, functions, times, and places (Smith and Rosse, 2004). Entities are represented in information systems by classes (universals, types, kinds) and their instances (individuals, particulars, tokens), which exist in a specific region of space and time. Examples of entities include Ocean (a continuant type) and its instances, e.g., IndianOcean and PacificOcean, and Subduction (an occurrent type). Smith (2004) and Klein and Smith (2006) defined concept to refer to the meaning of a consensual, general term that has been agreed upon by domain experts and has been used in making scientific assertions. For example, the San Andreas fault in California is an individual of the Fault type (class). The string: “San Andreas fault” is a term (symbol) that refers to the actual, individual San Andreas fault, and it evokes the abstract fault concept, which has a welldefined meaning to the domain experts. The concept fault means “a planar discontinuity in rock along which there has been some displacement.” The concept ocean refers to the universal Ocean type, the largest body of water on Earth. The fault and ocean concepts refer to (i.e., stand for) the universal types Fault and Ocean that have instances (particulars, individuals) in reality, such as the San Andreas Fault and Pacific Ocean. Concepts do not exist; they are used to represent the universal types that can be instantiated in reality. Ontologies are not about concepts; they are models of individual entities that instantiate the universals in space and time, e.g., SanAndreasFault, NankaiTroughAccretionaryPrism. These two examples are instances of the Fault and AccretionaryPrism type, respectively. Universal ontological types are represented in information models as artifacts, such as classes in UML (Unified Modeling Language) diagrams, entities in entity relationship diagrams, elements in XML schema, and tables in databases. Thus, UML classes represent ontological types that have instances in reality, and they are given specific terms that refer to our concepts. It is imperative that we not think of ontologies as hierarchies
20
H.A. Babaie
of concepts, but of types and instances and the ways in which they are related. For example, the rock and water concepts are not related in dictionaries. In reality, however, rock can contain water in its pores, i.e., Pore partOf Rock and Pore contains Water (inverse of Water containedIn Pore). Notice that the types Rock, Water, and Pore are related at the universal level. At the instance level in reality, however, there are some real rocks (i.e., instances of the Rock type) that do not have pores and therefore do not contain water. Ontologies are depictions of both the universal types and the real relations that may exist among instances of these types, based on domain theories and knowledge. METAPROPERTIES Complex entities can be partitioned either through decomposition into subclasses (taxonomy) or into parts (partonomy) (Tversky, 1990). Class inclusion through taxonomy is based on similarity. Whereas the meronomic relations are between concepts, allowing no inheritance by the subclass, the taxonomic relations are within concepts, making it possible for a class to inherit properties from its superclass (Pribbenow, 2002). In contrast to the downward inheritance in a taxonomy, a partonomy may allow an upward inheritance, whereby a whole inherits from its parts. For example, an ultramafic rock inherits its dark color and high density from its mineral parts. A fold inherits its shape from the constituent layers; a molecule inherits its composition from its elements. Relations can be unary, binary, ternary, or n-ary (e.g., Smith, 2003; Smith et al., 2005). There are three general types of relations (Smith et al., 2005) that hold between: (1) classes, i.e.,
, e.g., isA; (2) a class and it instances, i.e., , e.g., instanceOf, and (3) instances, i.e., , e.g., partOf. In the following, some of the formal, primitive, foundational relations that obtain between different classes are described. Ontological relations are those that obtain between instances in reality independent of the ways in which we gain knowledge about them or represent them (Smith et al., 2005). Formal means that the relations are general and domainneutral, and primitive means that they are atomic and other relations can be derived from that. The metaproperties, which define the properties for properties, are defined next (e.g., Breitman et al., 2007). If the property that relates two classes is the same in both directions, we declare the property to be symmetric. P is symmetric if and only if, for any x and y, P(x,y) if and only if P(y,x). For example: “twinsWith” is a symmetric property (if x twinsWith y, then y twinsWith x). Symmetric properties must be declared as such (P rdf:type owl:SymmetricProperty). The inference for a symmetric property is as follows: P owl:inverseOf P. That is, the symmetric property is an inverse property. Property R is said to be inverse of property P if for any x, y, P(x,y) if and only if R(y,x). Many properties in one direction have an inverse property in the opposite direction but are named differently. For exam-
ple, the first property in each of the following pairs reverses the direction of the second property: analyzes and analyzedBy, investigates and investigatedBy, hasSample and sampleOf, wrote and writtenBy, and locatedIn and locationOf. The partOf property is an inverse property. This means that if fold has limb as part, then limb is part of fold (struc:Fold hasPart struc:limb; partOf owl:inverseOf struc:hasPart.). In mathematics, a relation P is said to be transitive, for any x, y, and z, if and only if P(x,y) and P(y,z), then P(x,z). This is represented by the owl:transitiveProperty and may be declared as follows: P rdf:type owl:TransitiveProperty. The inference for this property is as follows: If x P y and y P z, then x P z. For example, C partOf B, B partOf A, then C partOf A. For example, by being fractal, faults have segments that have smaller fault segments, which have even smaller segments, which are themselves faults, such that struc:FaultSegment struc:partOf struc:FaultSegment; struc:partOf rdf:type owl:TransitiveProperty; struc:FaultSegment rdfs:subClassOf struc:Fault. The locatedIn property is also transitive: geo:locatedIn rdf:type owl:TransitiveProperty. For example, if tect:SanAndreasFault geo:locatedIn geo:California, and geo:California geo:locatedIn geo:United States, then, geo:SanAndreasFault geo:locatedIn geo:United States. Property P is functional, for any x, y, and z, if P(x,y) and P(x,z), then y = z. It is the one for which there exists only one value. This is in analogy with mathematical function (y = 3x), which for any single input value (e.g., x = 2) returns one unique value (y = 6). For example, the location of a sample given by its longitude and latitude is a functional property. The owl:FunctionalProperty and owl:InverseFunctionalProperty allow merging data for a same individual from different sources. The owl:FunctionalProperty can only take one value for any individual, allowing sameness to be inferred. The inference rule for this construct is as follows: if P rdf:type owl:FunctionalProperty, X P A and X P B, then A owl:sameAs B. Property P is inverse functional, for any x, y, and z, if P(y,x) and P(z,x), then y = z. The owl:InverseFunctionalProperty, which is the inverse of the owl:FunctionalProperty, is very useful for merging data from different sources. This property is equivalent to the key in relational databases, such as social security number and driver’s license number. The inference rule of this construct is as follows: if P rdf:type owl:InverseFunctionalProperty, and A P X, B P X, then A owl:sameAs B. For example, assume that the location of any measurement is uniquely identified by the longitude and latitude (defined by the location class), i.e., no two samples can be taken at the same exact spot, given :Station134 :locatedAt :LocationA and :Station346 :locatedAt :LocationA, we infer that :Station134 owl:sameAs :Station346. For a one-to-one relationship, we use a combination of the owl:FunctionalProperty and owl:inverseFunctionalProperty. For example, we can define sample ID to be unique. :sampleId rdfs:domain :Sample :sampleId rdfs:range xsd:Integer :sampleId rdf:type owl:FunctionalProperty :sampleId rdf:type owl:Inverse FunctionalProperty. So, any two samples with the same ID must
be the same sample!
Ontological relations and spatial reasoning in earth science ontologies Notice that not every functional property can also be an inverse functional property. For example, mineral composition can be functional only, because every mineral has a unique composition, but many individuals can share that same composition (if they belong to the same class) (compare with the hasFather property). Some properties can only be inverse functional, but not functional. For example, a single-author publication of an author (or description of an outcrop of a thin section by one person) may be inverse functional, because it only belongs to one person. The person can have several such publications or descriptions. INSTANTIATION RELATION Individuals (instances) are related to the universals (types) with the primitive isA relation, e.g., SanAndreasFault isA Fault. The isA relation, as a mechanism for subtyping in reality, provides for the specialization of a class in an information model. In scientific investigations, we deal with the individuals, not the universals. These instances can be of the continuant or occurrent types (Klein and Smith, 2006). As scientists, we study individuals such as the SanAndreasFault, or a specific specimen of a rock or water sample from a particular river (i.e., an instance of the River type). The instances stand in different relations to each other in the real world. At the class level, we may have: Mylonite isA Rock; Pore properPartOf Rock; Recrystallization potentialPartOf Deformation, Basin contains Water, and AccretonaryPrism tangentiallyConnectedTo subductingPlate. Notice that in reality a particular deformation (e.g., brittle deformation), somewhere in space and time, may not include recrystallization. The assertion Mylonite isA Rock implies that all instances of the type Mylonite are also instances of the type Rock. However, notice that only some instances of the type Rock are also instances of the Mylonite type. The assertion ForearcBasin potentiallyAdjacentTo AccretionaryPrism implies that all instances of the type ForearcBasin may be adjacent to some instance of the type AccretionaryPrism. Two conditions are needed for a type to be a subtype of another type (Smith, 2004): C isA D, if: C and D are universals, and for all times t, if anything instantiates universal C at t, then that same thing must also instantiate the universal D at t. A universal is anything that can be instantiated by an individual (particular). For example, if contraction in a subduction complex leads to the creation of an instance of a SplayFault in the accretionary prism, it also forms an instance of the Fault super type at the same time. Instantiation is represented, at the instance level, by the instanceOf relation, for example, c instanceOf C at t, which means that the continuant, particular c instantiates universal C at time t (Smith et al., 2005), e.g., BereaSandstone instanceOf SedimentaryRock. The binary instanceOf relation is written as Inst (c, C), or Cct, where the particular c is an instanceOf the universal C. Every universal has a particular (i.e., ∃c Inst [c, C]), and every particular is an instance of a universal (i.e., ∃C Inst [c, C]) (Donnelly et al., 2005). The symbol ∃ is the existential quantifier, which
21
reads: there exists some (at least one). The instance relation can be used to define subsumption at the universal level: isA (C, D) = ∀c (Inst[c, C] → Inst[c, D]), which says that C is subsumed by D (i.e., C isA D), if every instance (c) of C is also an instance of D. The ∀ symbol is the universal quantifier, which reads: for all, if any. For example, isA (Pseudotachylyte, FaultRock), or isA (Silicate, Mineral). These can also be written as: Pseudotachylyte isA FaultRock, and Silicate isA Mineral. The universal-level assertion: C isA D means: for all c, t, if c instanceOf C at t, then c instanceOf D at t (Smith et al., 2005). For example, ThrustFault isA Fault, which is true only if instantiation of a structure of type ThrustFault, say in an accretionary prism, leads to the instantiation of the structure of type Fault at the same time in the prism. Thus, in the examples: FaultSegment isA Fault and Mylonite isA FaultRock, the instantiation of the FaultSegment or Mylonite leads to simultaneous instantiation of the Fault or FaultRock, respectively. Notice that the isA relation does not hold between the concepts (i.e., meaning of the terms); it holds between universals with actual instances in reality. When we assert that: ThrustFault isA Fault, we are not saying that the meaning of the term thrust fault is the same as the meaning of the term fault. The meanings of these two terms are of course different. The assertion means that the universal ThrustFault isA type of the universal Fault type. Thus, the isA relation is used here to mean subtyping between universals and between their corresponding instances. The isA relation represents a necessary, but not a sufficient, condition for an entity. For example, the universal type Mylonite is necessarily a FaultRock, but it has other properties that the more general FaultRock type does not have. A ThrustFault is necessarily a Fault, but a thrust has some properties that are unique to itself, that a general type of Fault may lack. In other words, it is not sufficient to say that a ThrustFault isA Fault or Mylonite isA Rock. Even though a human being is necessarily a mammal, it is not sufficient to say that a human being isA mammal (dogs are also mammals). There is a difference between dogs and humans even though both are necessarily mammals. The difference is represented by additional and unique object and relational properties of the subclasses in the ontology model. For example, an instance of the type Mylonite may have foliation, lineation, and zones of recrystallization that an instance of Rock may not have. PARTONOMIC RELATION Although most ontologies apply the isA relation for class hierarchies, the mereological (partOf, part-whole, partitive) relation is probably of equal value for building the hierarchical structure of ontologies (Pribbenow, 2002; Schulz and Hahn, 2005; Koslicki, 2008). Notice that, although OWL uses the owl:subclassOf property to construct the isA relation, it does not have a qualified name to construct the partOf or hasPart relation. Many entities of interest to Earth scientists are composite, i.e., aggregates made of several parts, which have complex spatial or spatio-temporal structures. The following is a discussion of
22
H.A. Babaie
the formal semantics of the non-OWL constructs such as partOf and hasPart properties. Composite entities can be separated into parts, which maybe spatial objects, temporal entities (events), or spatio-temporal entities such as processes. For example, a subduction complex above the plate-boundary fault is a composite whole made of several parts that include the accretionary prism, forearc basin, plate-boundary fault zone, and inner and outer slope basins. A strike-slip fault is a composite of many segments, steps, and bends. A ductile deformation (process) along a fault may contain several spatio-temporal parts, which may include subprocesses of recrystallization, recovery, or cataclastic flow. Each of these parts may have its own parts, e.g., the prism part includes the offscraped and underplated thrust sheets of sediment; a segment is made of many other segments, steps, and bends (because fault is fractal). The offscraped thrust sheets are bounded thrust faults that are members (i.e., parts) of a collection of faults (a whole). The plate-boundary fault zone, at the base of the prism, may have the following parts: seismogenic zone, mylonite zone, cataclastic zone, aseismic zone, and boundaries that include the décollement. Notice that even though the seismogenic zone may be part of some plate-boundary fault zone in a subduction zone, not all such fault zones have a seismogenic zone. Facts like this need to be included in the ontology. The recrystallization may include dynamic recrystallization by grain boundary migration or rotation, or static recrystallization. The partitive partOf relation, like the isA relation, only holds between universals or instances, not between concepts or classes in information models. The formal definitions for the partOf relation at the instance level is given by (Smith et al., 2005): c partOf d at t, which means that the particular, continuant c is an instancelevel partOf the particular, continuant d at time t. The equivalent universal-level assertion: C partOf D means that for all c, t, if c instance of C at t, then there is some d such that d instanceOf D at t, and c partOf d at t. Notice the all-some structure in these definitions (given in italic font) (Smith et al., 2005). For example, Xenolith partOf IgneousRock means that all xenoliths, if any exists at any time anywhere, should be part of some igneous rock. This reflects the domainal fact that only some igneous rocks have xenolith, and that xenoliths do not make any sense if they are not part of an igneous rock. Smith (2004) defined the partitive partOf relation as a combination of the partFor and hasPart relations, which are defined as follows. The universal assertion X partFor Y provides information mainly about X, and it asserts that if x (an instance of X) exists at time t, then y (an instance of Y) also exists at t, and that x exists only as a partOf y (i.e., at instance level). For example, UnderplatedSediment partFor AccretionaryPrism, means that if there is an instance of underplated sediment at time t, then it is a part of an instance of an accretionary prism that exists at the same time. This means that underplated sediment does not exist (or mean anything) if there is no prism. It does not deny the fact that accretionary prisms may have no underplated sediment. The assertion Y hasPart X, on the other hand, provides information about Y, and asserts that if y (an instance of Y) exists at time t, then x (an instance
of X) exists at the same time as a partOf y (i.e., at the instance level). This means that y does not exist unless it has an instance of X as part. For example, NankaiTroughAccretionaryPrism hasPart AccretedSediment, which means that the accretionary prism cannot exist at time t (e.g., today) if it does not have accreted sediment as a part. Thus, X partOf Y holds, by combining the two assertions, if and only if for any instance x of X existing at time t, there is some simultaneously existing instance y of Y, such that x is an instance-level part of y, and y has x as part. For example, SubductingPlate partOf SubductionZone, which means that if the PhilippinePlate is an instance of the SubductingPlate and the NankaiTrough is an instance of the SubductionZone at the present time, then the subducting Philippine plate can only exist today as a part of the present-day Nankai Trough subduction zone. The relationship between parts and the whole that composes or aggregates the parts is the subject of the formal theory of part-whole structure called mereology (e.g., Simon, 1987; Varzi, 2003). The mereological partOf relation (denoted as P) and its inverse, hasPart (denoted as P-1), that obtain between two individuals (a part and its whole) constitute the fundamental relations for composite entities (e.g., Varzi, 1996; Casati and Varzi, 1999; Pontow, 2004; Schulz and Hahn, 2005; Koslicki, 2008). The relations may hold in the following cases (e.g., Schulz et al., 2006): between material and nonmaterial objects, parthood over time, parthood and spatial location, and parthood between occurrents (Schulz et al., 2005, 2006). For example, Mineral partOf Rock (or Rock hasPart Mineral) signifies that an instance of the class Mineral (which represents individual minerals in reality in a domain model) is a part of an instance of the class Rock at a specific instant of time, t. In UML the isA and partOf relations are represented by subclassing and composition (filled black diamond) or aggregation (open diamond), respectively. The partitive partOf relation is needed, along with other semantic relations such as attribution (e.g., thickness of the forearc sediments), class-subclass (e.g., ThrustFault isA Fault), spatial inclusion (e.g., NankaiTrough locatedIn Japan), and connection relation (e.g., ForearcBasin externallyConnectedTo Prism), for a better representation of reality (e.g., Donnelly et al., 2005). Partitive relations in ontologies, designed to portray the reality in natural systems, hold between universals and then are applied to individuals using constraints based on domain knowledge. We need to make a distinction between parthood at the universal level, i.e., PartOf (A, B) versus that at the instance or individual level, i.e., partOf (x, y) (Schulz et al., 2006). For example, compare PartOf (PlateBoundaryFault, SubductionZone) versus partOf (NankaiTroughPlateBoundaryFault, NankaiSubductionZone). The class-level PartOf (A, B) means that the universal A is part of universal B if every instance of B has some instance of A as part, and every instance of A is part of some instance of B. Schulz et al. (2006) introduced the ternary relation Inst, which relates an individual to a universal at time t, i.e., Inst (x, A, t). The formal definition of the class-level PartOf (A, B) is then given based on the Inst (x, A, t) as follows: ∀x,t Inst (x, A, t) ∃ y
Ontological relations and spatial reasoning in earth science ontologies Inst (y, B, t) ∧ p (x, y, t), which reads: for all x and t, x is an instanceOf A at time t, if there exists a y, where y is an instanceOf B at t, and x is a partOf y at t. For example, Inst (SanAndreasFault, StrikeSlipFault, t) means that the San Andreas fault is a strike-slip
fault at a given time; it may become a thrust fault at another geological time if the stress field changes due to plate reconfiguration. The definition of the universal-level HasPart (A, B) relation (which is an inverse of the universal PartOf [A, B] relation) is as follows: ∀y,t Inst (y, B, t) ∃ x Inst (x, A, t) ∧ p (x, y, t). The classlevel PartOf (A, B) can be interpreted in the following different ways (Schulz et al., 2006), which need to be clarified using constraints such as cardinality in the ontology model. It means that all instances of A are partOf some instances of B, for example: PartOf (underplatedSediment, AccretionaryPrism), i.e., there are some prisms in which there is no underplated sediment as part. It means that all instances of B have some instances of A as part, for example: PartOf (Bed, Formation). It means that all instances of A are partOf some instances of B and all instances of B have some instance of A as part. For example, PartOf (OuterSlopeBasin, AccretionaryPrism). It implies that there is at least one instance of A which is partOf some instance of B, for example, PartOf (SplayFault, AccretionaryPrism). Notice, however, that not all splay faults are part of accretionary prisms, and not all prisms have splay faults. The relationship between parts and the whole may be functional, structural, spatial, or based on whether or not parts are separable from the whole, or are homeomerous (Lambrix, 2000). An example of a functional partOf relation is that between cement/matrix (parts) and the sediment (a whole); the function of cement is to hold the grain together. The function of pores or open fractures (parts) in a rock (a whole) is to store or transmit fluids, respectively. Parts are separable if the whole survives after the parts are separated from it, for example, when dissolved ions (parts) are removed from water (the whole). Parts are homeomerous if they are the same kind of thing as their whole, e.g., calcite or quartz crystal (part) in a monomineralic marble or quartz quartzite (whole), respectively. Specimens taken from a sample of a core of granite are homeomerous with the sample, and with the core itself. Gerstl and Pribbenow (1996) divided a composite whole into three end-member types: heterogeneous complex, uniform collection, and homogeneous mass. The mereological partOf relation is further extended (i.e., specialized) by the following relations (Lambrix, 2000). The componentOf relation implies the existence of a complex, heterogeneous structure in the whole, and functional, structural, spatial, temporal, and other relations among the components (Gerstl and Pribbenow, 1996), and between components and the whole. The parts (i.e., components) are separable and nonhomeomerous, and they have spatial and temporal relations to the whole. Examples: UnderplatedSediment componentOf AccretionaryPrism; ForearcBasin componentOf SubductionComplex; PlateBoundaryFaultZone componentOf SubductionComplex. Both prism and the subduction complex, in these cases, are heterogeneous, complex wholes, with several
23
internal components that have spatial and temporal relations to each other. The memberOf (elementOf) relation relates all members (parts) to a compositionally uniform collection (a whole), in the same way, without any functional or subclassing implication. The parts in this case have membership based on spatial proximity, not type (i.e., are not subclasses of the whole), and they are separable and nonhomeomerous. Examples: RiserCore memberOf CoreCollection of the boreholes in the NanTroSEIZE project in the Nankai Trough. For a splay fault propagating from the plate-boundary fault: SplayFault memberOf SplayFaultCollection. FractureSet memberOf FractureSystem, for the sets of fractures forming in shear zones. SiteNT3-01A memberOf NanTroSEIZESiteCollection. The portionOf relation relates a homeomerous, separable, and nonfunctional part to a homogeneous whole. For example, WorkingCoreSplit portionOf Core represents the portion of the core, other than the archive core split, that is set aside for different analyses. Other examples include: SpotCoreSample portionOf RoundCoreSample; DrillingMudSample portionOf DrillingMud. The nonfunctional, separable, and nonhomeomerous stuffOf relation holds between a whole and the material that it is partly or wholly made of. Examples: Rock stuffOf Core, in addition to other stuff, such as air, water, and mud; Mineral stuffOf Vein; IgneousRock stuffOf Dike. There are several criteria that can be used to identify genuine parthood of entities (Smith et al., 2005). The main criterion is sortality, which means that parts need to be of the right type to enable instantiation of the whole. For example, a drilling pipe stuck in a bore hole is not part of rock; monitoring equipment in a borehole is not part of the borehole, they are containedIn it. Metal and plastic cannot be part of rock. A question arises as to whether a piece of meteorite embedded in a clastic sedimentary rock (e.g., mudstone) is part of the rock, even though meteorite and mudstone are not of the same type. In this case, the meteorite is a clast, and clasts are parts of the clastic rock. It is possible for the provenance to help in this case, where a meteorite that is extraterrestrial in origin does not fit the terrestrial origin of other clasts in the mudstone. The function of a part may be essential to the functioning of the whole. For example, the cement of a conglomerate is a part of the conglomerate because it is holding the clasts together. If there is no cement, the conglomerate would be another entity (loose gravel aggregate). A part may be a structural element of the whole. For example, one or more limbs of a fold are needed for a fold to exist. In most cases, the life cycle of the part and the whole correspond to each other. For example, quartz, feldspar, and mica are parts of a granitic rock. However, the feldspar may be altered into clay at a time when other components (i.e., mica and quartz) are still parts of the granite. There may be exceptions to the life cycle rule if temporary parthood occurs. For example, the seismogenic zone in the plate-boundary fault zone under the accretionary prism may migrate into the prism and become part of the prism at a later time.
24
H.A. Babaie
The relata of the part-whole relations, in addition to objects, can be activities and processes. For example, the phaseOf or stageOf relation holds between different (spatio-temporal) parts of activities or processes. The phaseOf and stageOf relations are functional but not separable or homeomerous. Activity in this case involves human or machine agents, and it may be sampling, examination, simulation, or drilling. Examples: Stage3 stageOf NanTroSEIZEDrilling; PaleomagneticMeasurement phaseOf NonDestructivePhysicalPropertyMeasurement; Microscopic Measurement phaseOf OnBoardAnalysis. All nonabstract objects occupy spatial and temporal regions. The placeOf relation holds
between an area (a whole) and the places that are located in it. Examples: Japan placeOf Nankai Trough. LOCATION AND CONTAINMENT RELATIONS The locatedIn relation (inverse of the placeOf relation) holds between a continuant and a spatial region (r) at time t (i.e., c locatedIn r at t), i.e., it depends on a function (Galton, 2000, 2004; Donnelly, 2004a), which assigns a region r(c, t) that any continuant instance c exactly occupies at time t. Thus, c locatedIn d at t means that r(c, t) partOf r(d, t) at t (Smith et al., 2005), which means c is locatedIn d if c’s region is part of d’s region (Donnelly, 2004a). Example: Vein locatedIn fracture at time t, Mylonite locatedIn ShearZone at time t. In all cases, scientists measure instances of the continuants at the present time, and therefore the present time is implied in the assertions (i.e., t is dropped). However, the present time becomes part of the past as new measurements are done in the “present times,” i.e., time instants of the future. Many of the locatedIn relations actually represent instantaneous parthood, for example, the SeismogenicZone partOf PlateBoundaryFault only applies at a specific time, and it may not be true at other times because the zone may migrate with time. Symbolically, the location relation is given as: locatedIn (c, d) = P(r[c], r[d]), which reads: c is locatedIn d if region c is partOf region d. Thus, if x is partOf y, then x is locatedIn y. Examples: if Mylonite locatedIn ShearZone at time t, then the region of Mylonite is partOf the region of the ShearZone at time t. If VolcaniclasticSediment is locatedIn ForearcBasin at t, then the region of VolcaniclasticSediment is partOf the region of the Forearc Basin at t. At the universal level, the assertion C locatedIn D means for all c, t, if c instanceOf C at time t, then there is some d such that d instanceOf D at t, and c locatedIn d at t. For example, at the present time: SanAndreasFault partiallyLocatedIn California, or YellowstoneHotSpot partiallyLocatedIn Wyoming. We can derive the following transitive relations from the locatedIn relation (Donnelly, 2004c): if c is locatedIn d and d is locatedIn z, then c is locatedIn z. Example: Mg2+ ion locatedIn water sample, which occupies a fracture, is located in the fracture. If c is part of d and d is locatedIn z, then c is locatedIn z. Example: A fracture that is part of a thrust sheet located in the accretionary prism is located in the prism. If c is locatedIn d and d is partOf z, then c is locatedIn z. Example: If volcaniclastic sediment locatedIn
the Kumano forearc basin sequence, and the forearc is partOf the Nankai Trough, then the volcaniclastic sediment is partOf the trough. If two objects coincide partially or wholly without being part of one another, then we use the coincidence relation, which is another kind of location relation. In other words, a continuant may exist in a spatial region occupied by another continuant of which it is not a part. In this case, the first entity may coincideWith but is not locatedIn the larger entity, which means that it is not part of the second entity. Examples are: water in an open fracture, and ions or contaminants in the pores of sediment or an aquifer. In all of these cases, there is no parthood relation, just spatial coincidence. It is not necessary for a fracture or pore to have water or a contaminant. Object x is said to overlap object y if x and y share a common part z (Pontow, 2004). In other words, Oxy = ∃z (Pzx ∧ Pzy). Examples: NankaiAccretionaryPrism overlap Nankai Trough. Object x is discreteFrom object y if x does not overlap y, i.e., Dxy = ¬Oxy. Example: SubductingPlate discreteFrom ForearcBasin; SlopeBasin discreteFrom ForearcBasin. In the partial coincidence (PCoin [x, y]) case, the regions of the two objects, x and y, overlap without a part-whole relation (i.e., ¬Oxy). Thus, PCoin (x, y) = Or(x)r(y). Partial coincidence is reflexive, PCoin (x, x), which means that any object partially coincides with itself, and it is symmetric, PCoin (x, y) = PCoin (y, x), which means that if x partially coincides with y, then y partially coincides with x. For example, ForearcBasin partiallyCoincidesWith AccretionaryPrism, which means that even though the spatial regions of both of these spatial entities overlap, the forearc basin is not part of the prism. Partial coincidence is more common than the total (whole) coincidence. The coincidesWith and locatedIn relations are related: locatedIn (x, y) → PCoin (x, y), i.e., if x is locatedIn y, then x coincides with y. Other relations include the containedIn, which obtains between a material continuant and a site, i.e., empty space that can contain the object. For example, ZeoliteCrystal containedIn Vug; Water containedIn Fracture; Contaminant containedIn Pore. Again, like the partOf relation, the containedIn relation holds at a certain time index, t. Some composite spatial entities are derived from other entities through the Boolean-like operations of sum, product, and complement, which are equivalent to the set theory’s union, intersection, and complement, respectively (Galton, 2004). For example, a fracture system in the plate-boundary fault zone is a sum of several variably oriented fracture sets of different types (e.g., Riedel shear, Y-shear, and P-shear fractures). This is true for onedimensional fracture traces and two-dimensional fracture planes. An example in three dimensions is the accretionary prism, which is the sum of all thrust sheets of accreted and underplated sediments and rocks. The derivesFrom relation obtains between two nonidentical individuals. The assertion c derivesFrom c1 means that continuant object c is derived from the continuant object c1 at some time t, and that c1 does not exist anymore. At the universal level, we
Ontological relations and spatial reasoning in earth science ontologies can have Gneiss derivesFrom Granite. There are three types of the derivesFrom relations: (1) Continuation, where an instance of a continuant succeeds another at time t. (2) Fusion, where two or more continuants fuse into a new continuant (earlier continuants cease to exist; examples include water and ions crystallizing into a new mineral at t; two magmas mixing at instant t; P- and Y-slip surfaces merging and becoming a new slip surface). Unification (Smith and Brogaard, 2003), which is a close relative of the fusion relation, in which two or more continuants join, but continue to exist, in a new complex continuant (for example, pebbles of many older rocks unifying in a conglomerate; porphyroclasts of older minerals unifying in a mylonite). (3) Fission, where a part of a continuant breaks into two or more other continuants at time t, which will exist on their own. Example: metamorphism of a rock leading to the formation of a new mineral at the expense of several older minerals (parts) that existed in the rock. The adjacentTo (Donnelly et al., 2005) relation is a proximity (topology) relation that holds between two disjoint continuants. The transformationOf relation represents change in the continuants over time. It obtains between an instance of a class C at time t, which used to be an instance of another disjoint class C1 at an earlier time t1. Examples: QuartzofeldspathicMylonite transformationOf Granite; Soil transformationOf Rock at t. The absorb relation obtains when a continuant continues to exist but absorbs another continuant, which ceases to exist on its own, for example, Mineral absorb Water, which means a mineral may absorb a water molecule, H2O as hydroxyl ion, OH–1, or an ion into its crystalline structure. Rector et al. (2006) introduced the notion of collectives and granular parts, which relate closely to the formalization of entities such as rocks and sediment that are made of many parts. Such collective wholes (e.g., rock, sediment) have “emergent” properties that do not exist in the individual parts (minerals, grains), for example, the emergent property of the silicate chain structure when the silicon-oxygen tetrahedra connect to each other to make a whole molecule. The emergent property of the silicate mineral (a collective) is not the same as the individual atoms (silicon, oxygen). A collective of grains in a sedimentary rock has emergent properties such as porosity, hydraulic conductivity, texture, and fabric that do not make sense for individual grains. Collectives are themselves a part of larger collectives, for example, minerals, which are collectives of atoms, are part of larger rock collectives, which are themselves part of even larger collectives organized in different packages such as lithostratigraphic rock units, members, formations, groups, and sequences. The emergent properties of a lithostratigraphic rock unit may include such things as its anisotropy and homogeneity in all sorts of physical properties. Rector et al. (2006) distinguished two types of subrelation under the parthood relation related to collectivity: granular parthood and determinate parthood. The collectives are aggregates of grains that play the same role in the whole, and they do not depend on the number of grains in the collective. For example, the relation of grains in a layer of sediment is a granular part-
25
hood. In this case, there is an indeterminate number of grains in the layer, and removal of one part (i.e., grain) does not necessarily diminish the whole (sedimentary layer). Compare this with the relation of a collective of several tectonostratigraphic units of different ages separated by unconformities or faults (e.g., in an accretionary prism). In this case, which represents the determinate parthood, the removal of any of the parts (e.g., thrust sheet), which are limited in number, will necessarily diminish the whole, and the integrity of the collective will be lost. DISCUSSION AND SUMMARY Continuant entities that exist in any domain of the natural world stand in a variety of ontological relations to each other, including instantiation, parthood (P), overlap (O), location (LocIn), containment, and adjacency. These types of binary relations, denoted by R (Smith and Rosse, 2004; Donnelly et al., 2005), can hold between universals or particulars of any earth science discipline. We use consensual, agreed-upon terms to define the universals in our domain, and we represent them as classes in information systems. The relations that exist among the instances of the universal may not exist among the concepts (terms) that represent them. While only one instance of a material, universal type can occupy a unique spatial region in a specific temporal region, many instances of a given universal type can synchronously exist in different spatial regions. The spatial regions of these simultaneously existing instances of the same universal type may or may not overlap. Smith and Rosse (2004) and Donnelly et al. (2005) introduced a refinement of each of the binary relations (R) of parthood (P), overlap (O), instantiation, and location (LocIn). In the following, R can stand for P, O, LocIn, or other types of binary relations. Donnelly et al. (2005) defined R1(A, B), R2 (A, B), and R12 (A, B) relations among universal types A and B, depending on whether the restriction is put on the first argument (A) or the second (B). These R-structures are defined next (notice the allsome structure in the three cases). R1(A, B) = ∀x {Inst (x, A) → ∃y (Inst[y, B] ∧ Rxy)}, i.e., A is related to B (e.g., by instantiation, parthood, location) if all instances of A are related to some instances of B. This means that each A stands in R relation (e.g., proper parthood) to some B. Notice that the emphasis is on A, i.e., something is true about A; hence subscript 1 in R1 to emphasize the first argument. For example the assertion: PP1(ForearcBasin, SubductionComplex) means that each forearc basin is a properPartOf some subduction zone. This does not say that each subduction zone must have a forearc basin as proper part. The assertion: O1(SeismogenicZone, PlateBoundaryFaultZone) means that every seismogenic zone overlaps (i.e., shares a common part with) some plate-boundary fault zone under the prism. This assertion does not say that each plate-boundary fault zone must overlap a seismogenic zone. O1(SplayFault, Décollement) means that every splay fault overlaps some décollement, but not the other way around. LocIn1(Vein, Fracture) asserts that every vein is located in some fracture. It does not mean that every fracture
26
H.A. Babaie
has vein in it. PCoin1(AccretionaryPrism, SubductingPlate) asserts that every accretionary prism partially coincides with some subducting plate, but every subducting plate does not have to partially coincide with an accretionary prism. Notice that we can assert PP1(AccretedSediment, AccretionaryPrism), but not PP1(UnderplatedSediment, AccretionaryPrism). We can assert PCoin1(Trench, SubductingPlate) but not PCoin1(SubductingPlate, Trench). B2(A, B) = ∀y {Inst (y, B) → ∃x (Inst[x, A] ∧ Rxy)}, i.e., the relation between A and B stands if all instances of B are related to some instance of A. Here, restriction is on the second argument (B), i.e., it says that for each B, there is some A that stands in R relation to it, for example, PP2(Sediment, ForearcBasin), which states that each forearc basin has some sediment as a proper part. Notice that the statement does not say that each instance of sediment is a proper part of a forearc basin. The assertion: O2(AccretionaryPrism, SlopeBasin) states that every slope basin overlaps some accretionary prism. However, it does not assert that every accretionary prism overlaps some slope basin. LocIn2(Sediment, ForearcBasin) states that every forearc basin contains sediment. However, not all sediments are located in forearc basins. PCoin2(SubductionZone, UnderplatedSediment) states that every underplated sediment partially coincides with some subduction zone. It does not say every subduction zone partially coincides with underplated sediment. R12(A, B) = R1(A, B) ∧ R2(A, B) conjuncts the above two cases and states that each instance of A stands in R relation to some B, and each instance of B stands in R relation to some instance of A. In this case, the restriction is on all instances of both A and B. For example, PP12(SubductingPlate, SubductionZone), which says that each subducting plate is a proper part of a subduction zone, and each subduction zone has a subducting plate as a proper part. O12(SubductingSediment, PlateBoundaryFault) asserts that every subducting sediment overlaps some plate-boundary fault, and every plate-boundary fault overlaps some subducting sediment. LocIn12(SubductionZone, Ocean) states that every subduction zone is located in some ocean, and every ocean has some subduction zone. PCoin12(TransformFault, MidOceanRidge) asserts every transform fault partially coincides with some mid-ocean ridge, and every mid-ocean ridge partially coincides with some transform fault. These R structures, defined by Donnelly et al. (2005), provide a powerful means for spatial reasoning. It can be concluded that a complete and comprehensive representation of knowledge in a specific earth science domain requires the application of these R structures. The axioms of the ontologies in these domains need to differentiate among R1, R2, and R12 structures to enable effective spatial reasoning. REFERENCES CITED Allemang, D., and Hendler, J., 2008, Semantic Web for the Working Ontologist: Effective Modeling in RDFS and OWL: Amsterdam, Morgan Kaufmann Publishers, 330 p.
Antoniou, G., and van Harmelen, F., 2008, A Semantic Web Primer: Cambridge, Massachusetts Institute of Technology Press, 264 p. Babaie, H., 2011, Modeling geodynamic processes with ontologies, in Keller, R., and Baru, C., eds., Geoinformatics: Cyberinfrastructure for the Solid Earth Sciences: Cambridge, UK, Cambridge University Press, p. 166–189. Bains, P., 2006, The Primacy of Semiosis: An Ontology of Relations: Toronto, University of Toronto Press, 186 p. Bittner, T., Donnelley, M., and Smith, B., 2004, Endurants and perdurants in directly depicting ontologies: AI Communications, v. 17, no. 4, p. 247– 258. Brachman, R.J., and Levesque, H.J., 2004, Knowledge Representation and Reasoning: Amsterdam, Morgan Kaufmann Publishers, 381 p. Breitman, K.K., Casanova, M.A., and Truszkowski, W., 2007, Semantic Web: Concepts, Technologies and Applications: Berlin, Springer-Verlag, 327 p. Carrara, M., and Sacchi, E., 2006, Propositions: An introduction, in Carrara, M., and Sacchi, E., eds., Propositions: Semantic and Ontological Issues: Amsterdam, Rodopi B.V. Publishing, p. 1–27. Casati, R., and Varzi, A.C., 1999, Parts and Places: The Structures of Spatial Representation: Cambridge, Massachusetts Institute of Technology Press, 238 p. Cohn, A.G., and Varzi, A.C., 2003, Mereological connection: Journal of Philosophical Logic, v. 32, p. 357–390, doi:10.1023/A:1024895012224. Colomb, R.M., 2007, Ontology and the Semantic Web: Amsterdam, IOS Press, 258 p. Cruse, D.A., 2002, Hyponymy and its varieties, in Green, R., Bean, C.A., and Myaeng, S.H., eds., The Semantics of Relationships: An Interdisciplinary Perspective: Information Science and Knowledge Management Series: Dordrecht, the Netherlands, Kluwer Academic Publishers, p. 3–21. Donnelly, M., 2004a, Layered mereology, in Gottlob, G., and Walsh, T., eds., Proceedings of the 18th Joint International Conference on Artificial Intelligence (JICAI 2003): San Francisco, Morgan Kaufman, p. 1269–1274. Donnelly, M., 2004b, A formal theory for reasoning about parthood, connection, and location: Artificial Intelligence, v. 160, p. 145–172, doi:10.1016/ j.artint.2004.06.003. Donnelly, M., 2004c, On parts and holes: The spatial structure of the human body, in Fieschi, M., et al., eds., MEDINFO 2004: Amsterdam, IOS Press, p. 351–355. Donnelly, M., Bittner, T., and Rosse, C., 2005, A formal theory for spatial representation and reasoning in biomedical ontologies: Artificial Intelligence in Medicine, v. 36, no. 1, p. 1–27, doi:10.1016/j.artmed.2005.07.004. Fellbaum, C., 2002, On the semantics of troponymy, in Green, R., Bean, C.A., and Myaeng, S.H., eds., The Semantics of Relationships: An Interdisciplinary Perspective: Information Science and Knowledge Management Series: Dordrecht, the Netherlands, Kluwer Academic Publishers, p. 22–34. Galton, A., 2000, Qualitative Spatial Change: New York, Oxford University Press, 409 p. Galton, A., 2004, Multidimensional Mereotopology, in Dubois, D., Welty, C., and Williams, M.-A., eds., Proceedings of the International Conference on the Principles of Knowledge Representation (KR’04): Menlo Park, California, AAAI Press, p. 45–54. Gerstl, P., and Pribbenow, S., 1996, A conceptual theory of part-whole relations and its applications: Data and Knowledge Engineering, v. 20, p. 305–322, doi:10.1016/S0169-023X(96)00014-6. Guarino, N., and Welty, C., 2002, Identity and subsumption, in Green, R., Bean, C.A., and Myaeng, S.H., eds., The Semantics of Relationships: An Interdisciplinary Perspective: Information Science and Knowledge Management Series: Dordrecht, the Netherlands, Kluwer Academic Publishers, p. 111–125. Hitzeler, P., Krotzsch, M., and Rudolph, S., 2009, Foundations of Semantic Web Technologies: Boca Raton, Florida, CRC Press, 427 p. Jouis, C., 2002, Logic of relationships, in Green, R., Bean, C.A., and Myaeng, S.H., eds., The Semantics of Relationships: An Interdisciplinary Perspective: Information Science and Knowledge Management Series: Dordrecht, the Netherlands, Kluwer Academic Publishers, p. 127–140. Kashyap, V., Bussler, C., and Moran, M., 2008, The Semantic Web: Berlin, Springer-Verlag, 404 p. Klein, G., and Smith, B., 2006, Concept Systems and Ontologies: Journal of Biomedical Informatics, v. 39, no. 3, p. 274–287. Koslicki, K., 2008, The Structure of Objects: New York, Oxford University Press, 288 p.
Ontological relations and spatial reasoning in earth science ontologies Lambe, P., 2007, Organizing Knowledge: Taxonomies, Knowledge and Organizational Effectiveness: Oxford, UK, Chandos Publishing, 277 p. Lambrix, P., 2000, Composite objects, in Part-Whole Reasoning: Lecture Notes in Artificial Intelligence, v. 1771: New York, Springer, p. 21–30. Levesque, H.J., and Lakemeyer, G., 2000, The Logic of Knowledge Bases: Cambridge, Massachusetts Institute of Technology Press, 282 p. Pontow, C., 2004, A note on the axiomatics of theories of parthood: Data and Knowledge Engineering, v. 50, p. 195–213, doi:10.1016/j.datak.2003.12 .002. Pribbenow, S., 2002, Meronomic relationships: From classical mereology to complex part-whole relations, in Green, R., Bean, C., and Myaeng, S.H., eds., The Semantics of Relationships: An Interdisciplinary Perspective: Information Science and Knowledge Management Series: Dordrecht, the Netherlands, Kluwer Academic Publishers, p. 35–50. Raskin, R., and Pan, M.J., 2005, Knowledge representation in the semantic web for Earth and environmental terminology (SWEET), in Babaie, H.A., and Ramachandran, R., eds., Applications in Geosciences: Computers and Geosciences, v. 31, p. 1119–1125. Rector, A., Rogers, J., and Bittner, T., 2006, Granularity scale and collectivity: When size does and does not matter: Journal of Biomedical Informatics, v. 39, p. 333–349, doi:10.1016/j.jbi.2005.08.010. Schulz, S., and Hahn, U., 2005, Part-whole representation and reasoning in biomedical ontologies: Artificial Intelligence in Medicine, v. 34, no. 3, doi:10.1016/j.artmed.2004.11.005, p. 179–200. Schulz, S., Daumke, P., Smith, B., and Hahn, U., 2005, How to distinguish parthood from location in bio-ontologies, in Friedman, C.P., ed., American Medical Informatics Association (AMIA) Annual Symposium Proceedings 2005, p. 669–673. Schulz, S., Kumar, A., and Bittner, T., 2006, Biomedical ontologies: What partOf is and isn’t: Journal of Biomedical Informatics, v. 39, p. 350–361, doi:10.1016/j.jbi.2005.11.003. Segaran, T., Evans, C., and Taylor, J., 2009, Programming the Semantic Web: Sebastopol, California, O’Reilly Media Inc., 282 p. Shanks, G., Tansley, E., and Weber, R., 2004, Representing composites in conceptual modeling: Communications of the Association for Computing Machinery, v. 47, no. 7, p. 77–80, doi:10.1145/1005817.1005826. Sider, T., 2001, Four-Dimensionalism: An Ontology of Persistence and Time: Oxford, UK, Clarendon Press, 255 p.
27
Simon, P., 1987, Parts: A Study in Ontology: Oxford, UK, Clarendon Press, 390 p. Sinha, K., ed., 2006, Geoinformatics: Data to Knowledge: Geological Society of America Special Paper 397, 288 p. Smith, B., 2001, Fiat objects: Topoi, v. 20, no. 2, p. 131–148, doi:10.1023/A: 1017948522031. Smith, B., 2003, Ontology, in Floridi, L., ed., Blackwell Guide to the Philosophy of Computing and Information: Oxford, UK, Blackwell, p. 155–166. Smith, B., 2004, Beyond concepts: Ontology as reality representation, in Varzi, A., and Vieu, L., eds., Formal Ontology and Information Systems: Amsterdam, IOS Press, p. 73–84. Smith, B., and Brogaard, B., 2003, Sixteen days: The Journal of Medicine and Philosophy, v. 28, p. 45–78, doi:10.1076/jmep.28.1.45.14172. Smith, B., and Grenon, P., 2004, The cornucopia of formal-ontological relations: Dialectica, v. 58, p. 279–296, doi:10.1111/j.1746-8361.2004 .tb00305.x. Smith, B., and Rosse, C., 2004, The role of foundational relations in the alignment of biomedical ontologies, in Fiechi, M., et al., eds., Proceedings MEDINFO 2004: Amsterdam, IOS Press, p. 444–448. Smith, B., Ceuster, W., Kellages, B., Kohler, J., Kumar, A., Lomax, J., Nungall, C., Neuhaus, F., Rector, A.L., and Rosse, C., 2005, Relations in biomedical ontologies: Genome Biology, v. 6, p. R46, doi:10.1186/gb-2005-6 -5-r46. Tobin, H.J., and Kinoshita, M., 2006, NanTroSEIZE: The IODP Nankai Trough Seismogenic Zone Experiment: Scientific Drilling, v. 2, p. 23–27. Tversky, B., 1990, Where partonomies and taxonomies meet, in Tsohatzidis, S.L., ed., Meanings and Prototypes: Studies in Linguistics and Categorization: New York, Routledge, p. 334–344. Varzi, A.C., 1996, Parts, wholes, and part-whole relations: The prospect of mereotopology: Data and Knowledge Engineering, v. 20, p. 259–286, doi:10.1016/S0169-023X(96)00017-1. Varzi, A.C., 2003, Mereology, in Zalta, E.N., ed., Stanford Encyclopedia of Philosophy: CSLI (Center for the Study of Language and Information) Internet publication: http://plato.stanford.edu/contents.html (accessed May 2011). MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
Geoscience metadata—No pain, no gain Jeremy R.A. Giles British Geological Survey, Keyworth, Nottingham NG12 5GG, UK
ABSTRACT Metadata are an essential tool for use in the management of geoscientific information. Well-managed metadata provide a number of key information management functions, including facilitating data discovery and providing a robust framework for information asset management. The realization of these and other benefits is predicated on the existence of well-maintained metadata. Sadly, metadata are commonly absent, incomplete, inaccurate, inarticulate, or obsolete. Some of the benefits and opportunities that arise from well-managed metadata collections are discussed here. The rapid development of spatial data infrastructures means that maintenance of metadata for geoscience information is becoming increasingly important.
a proposed purpose, and its limitations. On the simplest level, metadata should give the reader of the record a clear understanding of: (1) the information content of the data set; (2) the reason(s) for data collection; (3) the location(s) where the data were collected; (4) the time at which, or interval during which, the data were collected; (5) the person, team, or organization that collected the data; and (6) the methods of data collection. The term “metadata” is relatively new—it was first used in print in 1973. However, the recording process that the term encompasses has a long history. Terms such as registry, directory, and catalogues describe records that fulfill a similar function. Numerous data catalogues and data directories have been, and continue to be, published physically by organizations wishing to make potential users aware of their data holdings. With the advent of the Internet, many of these resources have been migrated progressively into an Internet compatible form. For example, the National Aeronautics and Space Administration Global Change Master Directory (http://gcmd.nasa.gov), which
INTRODUCTION The phrase “information entropy” was coined by Michener et al. (1997) to describe the tendency for stored information to become more disordered over time. Metadata provide a tool that has been developed to slow this inexorable deterioration of the value of the information content of a data set. Metadata’s high-level purpose is to maintain the continuing usefulness of information so that it can be understood, exploited, reused and repurposed. In practice, the high-level purpose of metadata is fulfilled through its implementation in three principal roles or functions. The first role is described as “discovery metadata.” As the name suggests, this is primarily a tool that supports identification of an information object that might be suitable for an intended purpose. The second metadata role involves its use to describe the information object in such a way that the potential user comes to an understanding of both the context in which the underlying data set was created and processed, and its potential and limitations. The third role of metadata is as an asset management tool for use by the data set manager responsible for their long-term curation. A good metadata record enables the potential user of a data set to understand the contents of the resource, its suitability for
Giles, J.R.A., 2011, Geoscience metadata—No pain, no gain, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 29–33, doi:10.1130/2011.2482(03). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
29
30
J.R.A. Giles
was created in 1987 to enable users to locate and obtain access to earth science data sets, is a sophisticated metadata collection. By the last decade of the twentieth century, numerous metadata collections were available. However, many of them followed organizational, community, national, or regional standards, and, as a result, they lacked interoperability. Geoscience information is commonly difficult and thus expensive to acquire. As a result, geoscientists have become adept at reusing and repurposing previously acquired information (Wood and Curtis, 2004). In many countries, a legislative framework exists to sustain reuse of commercially acquired geoscientific information to support the work of regulators and geological survey organizations. For example, in Great Britain, the Water Resources Act of 1991 requires information collected during the drilling of water wells to be supplied to the Natural Environmental Research Council, the parent body for the British Geological Survey. In some cases, existing legal frameworks are supplemented by informal agreements for data sharing that include data reuse in academic research. However, anyone who has attempted such reuse of information might have found it to be a frustrating and time-consuming activity. Peebler (1996) estimated that the average exploration and production professional spent from 20% to 30% of their total project time searching for, loading, and formatting data. Other studies have suggested that this figure can be much higher for exploration in new areas. In addition, without a clear understanding of the context in which the data sets were originally created, their value and usefulness can be questioned. Generally, the older the data set, the more difficult it is to understand in the absence of appropriate metadata. This problem is well understood by Michener et al. (1997), who described the typical degradation of information content associated with data over time. They identified a number of potential inflection points within the general curve that represents the decay of the information content value. Examples include the retirement, career change, or death of the data set creator(s) and advances in storage technology. BENEFITS OF METADATA A well-formed metadata record provides a wide range of benefits (IGGI Working Group on Metadata Implementation, 2004). It enables faster and easier discovery of information, which facilitates communication and allows development of a comprehensive understanding of existing data sets. Such knowledge will in turn prevent the re-creation of a data set that has previously been compiled, and avoid the additional costs that such re-creation would incur. After discovery of one or more data sets, the well-formed metadata records allow accurate judgments to be made about the potential for reuse and repurposing in ongoing research or in the development of new products and services. A clear understanding of the context of the collection of a raw data set and its subsequent processing contributes to the long-term preservation of the data set. Metadata are an essential tool for those who manage data sets as an organizational asset because they provide the key
information that enables managers to make and justify meaningful decisions. The Global Spatial Data Infrastructure Association has defined spatial data infrastructure as “the relevant base collection of technologies, policies, and institutional arrangements that facilitate the availability of and access to spatial data” (GSDI, 2004, p. 8). Metadata represent a fundamental element of such a spatial data infrastructure. An example of a spatial data infrastructure initiative is the Infrastructure for Spatial Information in the European Community, or INSPIRE (Directive 2007/2/EC). The European Union has recognized that the general situation relating to spatial information in Europe is one of fragmentation of data sets and sources, gaps in availability, lack of harmonization between data sets at different geographical scales, and duplication of information collection. These problems make it difficult to identify, access, and use the data that are available. There is a clearly recognized need for quality georeferenced information to support the development and understanding of the complexity and interactions among human activities and environmental pressures and the associated impacts. Geology, characterized according to composition and structure, including bedrock, aquifers, and geomorphology, is named specifically in the INSPIRE Directive, along with soils, natural risks (such as landslides, earthquakes, and volcanoes), energy resources, and mineral resources. The requirements of INSPIRE are some of the drivers underlying the OneGeology initiative (www.onegeology.org), which aims to create dynamic geological map data of the world that will be made available via the Web. Other examples of spatial data infrastructures include the Australian Spatial Data Infrastructure (ASDI), United Nations Spatial Data Infrastructure (UNSDI), and the U.S. National Spatial Data Infrastructure (NSDI). The primary element of a spatial data infrastructure is good quality geospatial metadata. A standards framework was put in place by the International Organization for Standardization (ISO) with the development of standards for geographic information metadata (ISO 19115:2003 [ISO, 2003]) and an extensible markup language (XML) schema implementation for the metadata (ISO/TS 19139:2007 [ISO, 2007]). This primary importance of metadata is also reflected in INSPIRE, where metadata are the first element of the implementation road map. Carefully targeted metadata collections can contribute to the generation of real economic benefit. In 2000, the UK government identified the need to stimulate oil and gas investment on the UK continental shelf by attracting new niche companies that had the skills necessary to exploit previously undeveloped discoveries by utilizing technically innovative solutions. The aim was to ensure that indigenous oil and gas production remained at significant levels so that it would continue to contribute about $70 billion a year to the UK economy. A key element of the strategy was the development of metadata-based information resources that would provide details about the information that was available and where to obtain it. The result was DEAL (www.ukdeal .co.uk), which is sponsored by the UK Department of Energy and Climate Change and the industry body Oil and Gas UK. DEAL
Geoscience metadata—No pain, no gain publishes information on (1) oil and gas wells; (2) seismic surveys; (3) infrastructure, such as pipelines; (4) licenses; (5) fields; and (6) strategic environmental assessments. One of the principal benefits of metadata for public-sector bodies, such as geological survey organizations, is the reduction of legislative compliance risk. Many countries have established freedom of information legislation under which public-sector bodies are required to respond to requests for information within short time scales. Such legislation is predicated on the existence of sufficient metadata to enable resource discovery. The European Union is in the process of implementing a long-term strategy on improving access to, and the sharing of, environmental information, including many aspects of the geosciences. This includes improving public access to environmental information (Directive 2003/4/EC), establishing an infrastructure for spatial information in the European Union (INSPIRE), and drafting a directive for a Shared Environmental Information System (SEIS). Each of these directives are being, or will be, transposed into national legislation in the European Union member states, where individual geological survey organizations will be required to create, disseminate, and maintain metadata to comply with their legal obligations. THE PROBLEM WITH METADATA In a perfect world, every data set would be fully and clearly described by a complete metadata record. The record would be maintained regularly, so that the information content remained up-to-date and accurate. Individual metadata records would take into account similar records that were already in existence and would ensure that the two records reflected their close relationship while clearly describing any distinctive features. This would guarantee that appropriate database searches would be able to recall apposite information with precision. “Recall” and “precision” are terms originally used in this context within the library community. Recall describes the capability to discover relevant records; a search that misses a lot of relevant information is described as having poor recall. A search that recalls relevant records along with numerous irrelevant ones is said to have poor precision. The user finds it difficult to identify the valuable records amongst the numerous returns. Many simple searches using Internet search engines have poor precision. In the real world, individual metadata records fall far short of the ideal. Poor-quality metadata can lead to misleading conclusions and costly mistakes, yet few people understand the nature of the errors associated with their own metadata. The principal obstacle to the creation and maintenance of a well-formed metadata record is poor management control. This shortfall expresses itself in the form of incomplete, inaccurate, internally inconsistent, inarticulate, and obsolete records. To expect data creators to compile complex and accurate metadata records routinely and willingly is optimistic. An appropriate management framework must exist to promulgate policies that will motivate data creators to ensure that the metadata records are created and maintained.
31
Another persistent problem occurs when metadata aggregators harvest third-party collections to compile more extensive thematic or regional metadata collections. Generally, these records are copied to the aggregators’ metadata collections, and, if for some reason the link with the original record is broken, the copy commonly continues to be published in its obsolete form. It is not uncommon to find such obsolete records, many over a decade old, still being published by metadata aggregators. REALIZING THE BENEFITS OF METADATA For an organization to be able to create, maintain, and disseminate its metadata in a consistent and reliable manner, there must be an appropriate management framework in place to ensure success. Without the subjective pain of an investment of time and resources, there will be no gains; the benefits will not be realized. The steps required are straightforward. (1) Establish a metadata policy. (2) Adopt an appropriate standards framework. (3) Initiate and support metadata collection. (4) Disseminate the metadata. (5) Maintain the metadata. The first and most significant step for any organization wishing to implement systematic, organization-wide metadata is to define an achievable metadata policy. This is a set of broad, high-level principles that form the guiding framework within which the metadata management can operate. This policy must be embraced formally by senior management and supported by appropriate resources and authority. The metadata policy would normally be a subsidiary part of the organizational data or information policy, and as such should cross-refer to it. Experience has shown that a successful metadata policy implementation needs committed managerial backing at the highest level within an organization. The compilation of an individual well-formed metadata record by someone who has knowledge and experience of the data set is a trivial task. However, the management overhead of convincing many busy staff across an entire organization that they should each provide information individually is considerable. A strong and unambiguous high-level directive is absolutely essential. Staff members must be left with no doubt regarding the benefits of undertaking the required actions and the penalties for not doing so. Thus, it is also important that policy statements have a champion who will own the policy at an executive level within an organization. The policy champion will be accountable for ensuring compliance within the organization and identifying the resources required to create, maintain, disseminate, and exploit the metadata resource. In recent years, the International Organization for Standardization (ISO) has established the currently definitive standards framework for metadata. Its adoption and implementation will eventually remove the previously complex situation presented by numerous independent organizational, community, national, or regional standards. Many organizations are still migrating their metadata collections to the appropriate ISO standards, and
32
J.R.A. Giles
it will still be some years before interoperability becomes normal. However, any organization starting to create a new metadata collection now, or reengineering an existing collection, should adopt the relevant ISO standards as the core of their metadata policy. The primary metadata standard for nonspatial data is ISO 15836:2009 (ISO, 2009), and for spatial data, the metadata standard is ISO 19115:2003 (ISO, 2003). Because the ISO standards do not yet meet all requirements, there is a continuing need for community standards to extend the ISO standards. The first major task following the adoption of a metadata policy and a standard is to initiate compilation of metadata across the organization. This should be managed as a single project undertaken by a data set expert who is given the task of identifying the organization’s data sets and one or more individual(s) who understands the specific resources. The quickest way to ensure that the metadata are compiled in a consistent and complete form is to have each of the data set experts interviewed by a metadata steward who is responsible for creating the metadata record. Dissemination of an organization’s metadata can be arranged in a number of ways. At its simplest, the metadata inventory could be published regularly as a searchable document on an internal server or intranet. It could be published regularly as sets of static Web pages or as a query form Web page linking to a metadata database. The better way is as a Web service publishing ISO/TS 19139:2009 (ISO, 2009) XML implementation of ISO 19115:2003 (ISO, 2003). This approach gives users the greatest flexibility in how they select, manipulate, and present metadata from one or more organizations. Once the metadata for an organization have been compiled, it is essential that they be maintained actively, or their currency and value will depreciate over time. Maintenance is required on several levels, and these each need to be considered. To realize the full potential of a metadata collection, it is commonly necessary to maintain it over long periods. Even with investment in technically sophisticated search tools, such systems will find little user acceptance if the data are incomplete or not up-to-date. Hence, following initial metadata compilation, the subsequent metadata maintenance is of major importance. One role of the metadata steward is to ensure that maintenance is carried out consistently over a long period. The shelf life of a metadata record is surprisingly short, and they must be reviewed and updated on a regular cycle. Contact details within metadata can only be described as volatile, and they need reviewing more frequently than other elements. Other aspects of the metadata record can be reviewed on different cycles, depending upon whether the data set described is static or dynamic. Such reviews should not be left to the individual data set experts because their exit from the organization might be the cause of the records’ obsolescence. Review should be undertaken by the metadata steward who is responsible for maintaining the metadata records. In addition to reviewing the metadata content, the steward should also check the currency of its underlying metadata standard, in case it has become outdated following revision of the ISO standards.
KEY ROLES IN METADATA IMPLEMENTATION There are three key roles in the process of managing metadata. A successful metadata policy implementation requires authority from the highest level within the organization. The metadata policy champion is the owner of the policy within the organization and is responsible for ensuring that appropriate authority and resources are available to implement and maintain the policy. A metadata steward is responsible for day-to-day management of the organization’s metadata and for ensuring that the metadata record is comprehensive and meets the standards adopted by the organization. They report to the policy champion and can call upon authority and resources to fulfill their role. They work with the data set experts to compile new metadata records and maintain existing ones. When data set experts change roles or leave the organization, they ensure that a replacement is identified. Finally, they check periodically whether the ISO standard used by the organization has been updated. The data set expert is responsible for providing the metadata steward with sufficient information to create and maintain a well-formed metadata record. CONCLUSION The whole geoscience community would benefit if highquality metadata were available. It would enable the rapid discovery, accurate assessment, and easier management of information assets. Its absence adds considerable hidden costs to the whole sector in terms of lost time and lost opportunities. Metadata have a poor reputation as being ineffectual and expensive. This reputation is rightly deserved because many organizations fail to put the management framework in place to ensure that well-formed metadata records are created, maintained, and disseminated. However, metadata are a powerful tool that can provide a wide range of benefits to organizations that have the discipline to manage metadata activity effectively. One way to improve the situation is to recognize and criticize existing published records that do not meet the needs of the users. Take time to e-mail the record owner and explain metadata shortfalls or why the records are inaccurate or out of date. Use every opportunity to peer-review existing metadata records. REFERENCES CITED Directive 2003/4/EC, 2003, On public access to environmental information: Official Journal of the European Union, L&C, v. L41, p. 26. Directive 2007/2/EC, 2007, On establishing an Infrastructure for Spatial Information in the European Community (INSPIRE): Official Journal of the European Union, L&C, v. L108, p. 1. Global Spatial Data Infrastructure Association (GSDI), 2004, Developing Spatial Data Infrastructures: The SDI Cookbook, Version 2.0: Global Spatial Data Infrastructure Association, www.gsdi.org/docs2004/Cookbook/ cookbookV2.0.pdf (accessed 20 April 2011). IGGI (Intragovernmental Group on Geographic Information) Working Group on Metadata Implementation, 2004, The Principles of Good Metadata Management (2nd ed.): London, Office of the Deputy Prime Minister, 30 p.
Geoscience metadata—No pain, no gain International Organization for Standards (ISO), 2003, ISO 19115:2003, Geographic Information—Metadata: Geneva, Switzerland, International Organization for Standards, 140 p. International Organization for Standards (ISO), 2007, ISO/TS 19139:2007— Geographic Information—Metadata: XML Schema Implementation: Geneva, Switzerland, International Organization for Standards, 111 p. International Organization for Standards (ISO), 2009, ISO 15836:2009—Information and Documentation: The Dublin Core Metadata Element Set: Geneva, Switzerland, International Organization for Standards, 5 p. Michener, W.K., Brunt, J.W., Helly, J., Kirchner, T.B., and Stafford, S.G., 1997, Non-geospatial metadata for the ecological sciences: Ecological Applica-
33
tions, v. 7, p. 330–342, doi:10.1890/1051-0761(1997)007[0330:NMFTES ]2.0.CO;2. Peebler, R., 1996, Extended integration: The key to future productivity leap: Oil & Gas Journal, v. 94, no. 21, p. 57–61. Wood, R., and Curtis, A., 2004, Geological prior information and its application to geoscientific problems, in Curtis, A., and Wood, R., eds., Geological Prior Information: Informing Science and Engineering: Geological Society of London Special Publication 239, p. 1–14. MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
Geoscience data and derived spatial information: Societal impacts and benefits, and relevance to geological surveys and agencies R.A. Hughes British Geological Survey, Sir Kingsley Dunham Centre, Keyworth, Nottingham NG12 5GG, UK
ABSTRACT Low levels of geospatial literacy and geoscientific understanding mean that basic geological map data are meaningful to, and can therefore be interpreted by, a remarkably small number of people with specialist knowledge and training. Nevertheless, geological maps continue to underpin the exploration, exploitation, and management of natural resources such as fossil fuels, minerals, and groundwater. Geological maps can, however, be the essential basis for derived, spatial geoscience information with which complex science relating to societally relevant issues such as geohazards can be communicated meaningfully to the layperson. Such derived spatial geoscience information offers opportunities for geological surveys and agencies to demonstrate societal relevance by creating social and economic benefits. Production and delivery of such information from complex geoscientific data should therefore be central to the mission of geological surveys and agencies. This pathway is traced from data to information and knowledge of use in decision making. Societal benefits and impacts are described and quantified using case studies and independent economic impact analysis data.
WHO USES GEOLOGICAL MAPS?
tioning system (GPS) satellite positioning and navigation systems, and new initiatives enabled by these technologies (e.g., OpenStreetMap [www.openstreetmap.org] and OpenCycleMap [www.opencyclemap.org]). Innovative outreach programs from state topographical mapping agencies such as the UK’s Ordnance Survey (OS) also play a part. The OS, for example, provides paper topographical maps free of charge to secondary school pupils (the 11–16 age group) in the UK. With links into the school curriculum, it is clear that this initiative is increasing the level of understanding of topographical mapping by children in the UK from all socioeconomic backgrounds.
Geospatial Literacy The ability to understand even simple topographical mapping requires of the user knowledge of fundamental concepts such as scale, azimuthal orientation, elevation, contours, and map symbology. Nevertheless, the proportion of the population able to use such topographical mapping is on the rise, at least in the UK. This is due to many factors, including the ready availability of online mapping from established providers, the increasingly routine availability and use of global posi-
Hughes, R.A., 2011, Geoscience data and derived spatial information: Societal impacts and benefits, and relevance to geological surveys and agencies, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 35–40, doi:10.1130/2011.2482(04). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
35
36
R.A. Hughes
Geoscientific Literacy To interpret a geological map, not only do the fundamentals of scale, azimuthal orientation, and elevation need to be understood, but there must also be a familiarity with highly specialized concepts including stratigraphy and stratigraphical relationships, geological structure, and complex symbology, and an ability to visualize structural, stratigraphic, and lithodemic relationships in three dimensions. So, what proportion of a population has the necessary skills and knowledge to both produce and interpret geological map data to a professional standard? In trying to answer this question, it is useful to consider a few national examples, and it is necessary also to accept the basic assumption that these types of interpretations (and productions) require graduate-level geoscience knowledge and skills. Natural resource exploration and production represent a major contribution to the Canadian economy. Between May 2008 and May 2009, oil, gas, and mineral production were valued at CN$49.9 billion or 4.2% of gross domestic product (GDP) (Statistics Canada [www.statcan.gc.ca]), while during the same period, exploration investment was estimated at CN$2.8 billion (The Northern Miner, 2009). Canada’s population is ~33 million, and in 2008, Canadian universities produced around 1200 geoscience graduates. In 2008, therefore, a paltry additional 0.004% of the Canadian population developed the expertise to interpret a geological map to acceptable professional standards. In the UK, which has an economy that is less dependent upon resource exploration and production, but to which geoscience is equally important for other environmental management reasons, ~1300 earth sciences (geology, environmental sciences, etc.) students graduated in 2008. With a national population of ~60 million, this means that in 2008, an extra 0.002% of the UK population developed the expertise to interpret a geological map to professional standards. Whatever the errors in these crude figures, it is beyond dispute that the proportions of the populations of these countries able to interpret geological maps are extremely low. A further relevant trend lies in the numbers of geoscience graduates that do not pursue careers in geoscience, and whose essential skills in producing and interpreting geological maps are effectively lost to their national economies. The American Geological Institute (2009), for example, reported that less than 13% of the ~6000 new U.S. geoscience bachelor’s majors graduating in the fall of 2008 will ever work professionally in geoscience. Similarly (but slightly less pessimistically), of the ~1300 earth science graduates produced annually in the UK, only half of those will find their first employment in the earth sciences. We can conclude, therefore, that although levels of topographical mapping literacy and use are increasing (at least in the UK), the proportion of the population of major nations that can produce and interpret basic geological mapping to professional standards is extraordinarily small. Furthermore, in the cases of the UK and United States in 2008, at least half (and in the United
States significantly more) of those graduating from universities with the appropriate skills are lost to the profession, either through deliberate career choice decisions or because of the unavailability of geoscience employment. The taxes and royalty contributions of the extractive industries to national economies can be huge. Since these industries are reliant on basic geological map data, it follows that the underpinning contribution of such data to the development and productivity of resource-dependent economies is also very great. However, taking into account the extremely low levels of geospatial and geoscientific literacy described here, it is evident that basic geological map data are largely meaningless to the overwhelming proportion of the population, who lack the skills and knowledge required to interpret such data. There are, however, very real and quantifiable social and economic impacts and benefits in high-resolution spatial geoscience data sets that can be derived from basic geological map data. DERIVED SPATIAL GEOSCIENCE INFORMATION Geological surveys and environmental agencies in tectonically and volcanically active parts of the globe have for many years been using primary geological mapping as the baseline data set from which to derive geohazard maps for use in civil planning, emergency planning and response, and engineering design. The need to understand and mitigate the potential hazards of earthquakes in particular has focused the efforts of agencies on producing geospatial information that can be used by nonspecialists. Examples include seismicity, ground shaking, induced landslide potential, and both probabilistic and deterministic liquefaction potential maps developed specifically for urban planning and development, disaster mitigation, and response planning (see, for example, U.S. Geological Survey, 2009). While tsunami hazard potential maps for some parts of the Pacific coast of North America have existed for many years, one consequence of the devastating Indian Ocean tsunami of 26 December 2005 was the focusing of research efforts on mapping zones of tsunami hazard along previously poorly understood populous coastlines. As a result, tsunami hazard maps and related information (at least at low resolutions) now exist for most of the world’s populous coastlines, including those areas known to be at high risk such as Japan (e.g., Government of Japan Cabinet Office, 2004) and those areas where tsunami hazard potential is generally regarded as low (e.g., Schmidt-Thomé, 2006). Similarly, pyroclastic flow and mud-flow hazard potential maps are available for many of the world’s volcanoes that are located in populous areas (see, for example, Orsi et al., 2004). While large parts of Earth’s surface are seismically relatively inactive and distant from active volcanic sources, they may still be vulnerable to more insidious but nevertheless potentially damaging geological hazards. As custodians of unique geoscience data holdings and owners of sometimes unique in-house expertise, there are real opportunities for geological surveys and agencies to demonstrate societal relevance and generate economic benefits
Geoscience data and derived spatial information by producing geospatial information characterizing vulnerability to these hazards. From Data to Information and Knowledge for Decision Making—The Knowledge Transfer Dimension In 2001, the British Geological Survey (BGS) launched the world’s first national digital vectorized and attributed geological map at the 1:50,000 scale (DiGMapGB; see British Geological Survey, 2009a). In 2003, the BGS initiated a program having the specific objective of using the DiGMapGB digital geological map data as the basis from which to produce societally useful, derivative national geohazard potential information. Within this program, the BGS has since 2003 produced, and annually updated, national geohazard potential data sets, including ground stability, natural radon potential, flooding (groundwater, coastal, and riverine), permeability, non–coal mining, and potentially harmful elements in soils. The BGS national ground stability data set (GeoSure; Fig. 1; see also British Geological Survey, 2009b), for example, consists of multilayer geospatial information (i.e., a “geographic information system [GIS] data set”) that gives indicative hazard potential values for six shallow geohazards, namely, swelling and shrinking clays, soluble rocks, landslides, compressible rocks, collapsible rocks, and unconsolidated running sands. First launched in 2003, it is now widely used by both the public and commercial sectors in the UK for property-specific searches as a preliminary step to property sale/purchase transactions and by operators of major infrastructure such as highway and railway networks. GeoSure is also now used by major UK insurers in setting premiums charges for buildings insurance. Effective knowledge transfer is integral to the success of such geoscience spatial data sets, enabling them to be used by laypersons with no specialist geoscience knowledge or understanding. The GeoSure landslide hazard potential layer, for example, is produced by multiparameter analysis with expert knowledge and validation (Table 1). However, those parameters, data, and expertise are translated into information expressed in plain English, so that complex science can be used as a basis for decision making by unqualified persons, therefore increasing the reach, value, and impact of the science. Significantly for the layperson, this high-resolution geospatial information can be interrogated on a location- or propertyspecific basis, or by postal code (ZIP code). The BGS offers its own such site-specific reporting information service (GeoReports; British Geological Survey, 2009b), or the user can purchase the same BGS information supplied by one of many private-sector “value-added resellers.” Economic Benefits of Derived Geoscience Spatial Information In 2006, the BGS GeoSure natural ground stability hazard potential information was used as a case study by the economists
37
at PricewaterhouseCoopers (PwC) in an investigation into the economic benefit of environmental research, commissioned by the BGS’s parent body, the Natural Environment Research Council (NERC) (Natural Environment Research Council, 2006). PwC concluded the following: (1) By using this information, decision makers are empowered to make better-informed decisions, and they can avoid future costs and prevent loss of investment by avoiding or mitigating subsidence incidents. (2) BGS information on subsidence risk, provided at postal code and household level, is “accurate and relevant to user needs,” responsive to climate change impacts, and “meets the needs of the information age.” (3) By using this information, financial and social costs can be avoided through avoiding investing in areas at risk of subsidence, or taking preemptive action and mitigating subsidence. (4) Using this information, wider societal benefits can be created, such as avoidance of stress, injury, and disruption associated with loss of property. Noting that the annual cost of subsidence to the UK insurance industry is ~£300 million, PwC concluded that use of the BGS ground stability information could save UK insurers between £70 million and £270 million in reduced payouts between 2006 and 2030. The BGS–Health Protection Agency (HPA) natural radon hazard potential information for England and Wales was launched in 2007 (see Miles et al., 2007). This high-resolution spatial data set (see also British Geological Survey, 2007) was produced using a methodology that combines empirical radon measurements with digital geology (Miles and Appleton, 2005), and it can be interrogated on a property- or locationspecific basis. Property owners can use the information to find out indicative natural radon potential levels at their properties or locations (see Fig. 2), and to decide on remedial action if necessary. Similarly, builders and developers of new homes and commercial premises can factor the same information into the designs of new buildings to mitigate the effects of natural radon. Exposure to indoor radon is the largest contributor to the radiation exposure of the population (Miles and Appleton, 2005). It is estimated that about 1100 persons die each year in the UK due to lung cancers caused by natural radon (Independent Advisory Group on Ionising Radiation, 2009). This incidence may increase significantly when combined with other carcinogens such as tobacco smoking. In many parts of the world, natural radon almost certainly causes more deaths than any other environmental hazard. Unlike the BGS ground stability information, the BGS-HPA natural radon potential information has not yet been subject to independent economic impact analysis. However, it seems highly likely that such an analysis would conclude that use of the information could lead to the avoidance of a significant number of potentially fatal lung cancers each year.
100
0 kilometers
100
Figure 1. (Left) GeoSure landslide hazard potential derivative layer. (Right) GeoSure swell-shrink clay hazard potential derivative layer. Copyright © British Geological Survey (Natural Environment Research Council); used with permission.
kilometers
Low to nil
Low to nil
0
Swell-shrink potential Significant Moderate
Landslide potential Significant Moderate
Geoscience data and derived spatial information
39
TABLE 1. FROM DATA TO INFORMATION: THE BRITISH GEOLOGICAL SURVEY GeoSure LANDSLIDE HAZARD POTENTIAL DATA SET SOURCE DATA AND INFORMATION PROVIDED TO USERS Data sources and parameters Information provided for decision making “High hazard potential: Slope instability problems almost • Lithology certainly present and may be active. Significant constraint on • Structure land use.” • Geotechnical properties • • • • • •
Porosity, permeability Groundwater, natural springs Digital terrain models Slope angle and class BGS national landslide database Expert and geographic information system validation processes
“Low hazard potential: Slope instability problems are not believed to occur, but consideration should be given to potential problems of adjacent areas impacting on the site.”
Figure 2. Natural radon hazard potential data set for part of southwest England. Copyright © British Geological Survey (Natural Environment Research Council) and Health Protection Agency; used with permission.
40
R.A. Hughes
SOCIETALLY RELEVANT GEOSCIENCE DATA AND THE FUTURE OF GEOLOGICAL SURVEYS AND AGENCIES State budgets for many geological surveys and agencies are under severe downward pressure. While basic geological map data continue to underpin the exploitation and management of natural resources, beyond these essential applications, they offer little value to wider society, in which there are extremely low levels of geoscientific literacy. There are, however, great—and largely untapped—opportunities for all geological surveys and agencies to demonstrate their relevance to governments and their wider populations by generating geoscience information that has broad societal reach and that can yield real and quantifiable social and economic benefits. The UK government has introduced policies (Research Councils UK, 2007) that put a clear expectation upon public-sector research to generate significant increases in the social and economic impacts and benefits of their research programs. It is not known if similar far-sighted policies exist in other countries. However, by virtue of their unique assets of expertise and national data holdings, geological surveys and agencies are extremely well placed to deliver socioeconomic benefits through relevant geoscience information that can be used to support informed decision making across all sectors of society, and so serve the objectives and needs of governments, society, and agencies alike. Looking to the future, geological surveys and agencies should respond to societal challenges by making the provision of such information central to their missions. ACKNOWLEDGMENTS Topographical data in the figures are reproduced with the permission of the Ordnance Survey of Great Britain.
REFERENCES CITED American Geological Institute, 2009, Status of the Geoscience Workforce 2009: http://www.agiweb.org/workforce/reports/2009-StatusReportSummary .pdf (accessed 24 June 2011). British Geological Survey, 2007, HPA (Health Protection Agency)-BGS Radon Potential Dataset: http://www.bgs.ac.uk/radon/hpa-bgs.html (accessed 24 June 2011). British Geological Survey, 2009a, DiGMapGB-50 digital geological map of Great Britain: http://www.bgs.ac.uk/products/digitalmaps/digmapgb_50 .html (accessed 24 June 2011) British Geological Survey, 2009b, GeoReports: http://shop.bgs.ac.uk/ GeoReports/. Government of Japan Cabinet Office, 2004, Tsunami and Storm Surge Hazard Manual (English edition): 112 p. Independent Advisory Group on Ionising Radiation, 2009, Radon and Public Health: Chilton, Oxfordshire, UK, Health Protection Agency, 240 p. Miles, J.H.C., and Appleton, J.D., 2005, Mapping variation in radon potential both between and within geological units: Journal of Radiological Protection, v. 25, p. 257–276, doi:10.1088/0952-4746/25/3/003. Miles, J.H.C., Appleton, J.D., Rees, D.M., Green, B.M.R., Adlam, K.A.M., and Myers, A.H., 2007, Indicative Atlas of Radon in England and Wales: Didcot, Oxfordshire, UK, Health Protection Agency, 36 p. Natural Environment Research Council, 2006, Economic Benefit of Environmental Science: www.nerc.ac.uk/publications/corporate/economic.asp (accessed 24 June 2011). The Northern Miner, 2009, 1–7 June 2009, v. 95, no. 15, p. 1–2. Orsi, G., Vito, M., and Isaia, R., 2004, Volcanic hazard assessment at the restless Campi Flegrei caldera: Bulletin of Volcanology, v. 66, p. 514–530, doi:10.1007/s00445-003-0336-4. Research Councils UK, 2007, Increasing the Economic Impact of the Research Councils: http://www.rcuk.ac.uk/documents/publications/ktactionplan .pdf (accessed 24 June 2011). Schmidt-Thomé, P., 2006, The Spatial Effects and Management of Natural and Technological Hazards in Europe: European Spatial Planning Observation Network, Statistics Canada, 197 p., www.statcan.gc.ca/daily -quotidien/090731/t090731a1-eng.htm. U.S. Geological Survey, 2009, Reducing Hazards in the Central and Eastern U.S.: Reports and maps: http://earthquake.usgs.gov/regional/ceus/index .php.
MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
Strategic Sustainability Assessment B. Deal Department of Urban and Regional Planning, University of Illinois, Champaign, Illinois 61820, USA E. Jenicek W. Goran N. Myers Engineer Research and Development Center, Construction Engineering Research Laboratory, Champaign, Illinois 61822, USA J. Fittipaldi U.S. Army Environmental Policy Institute, Arlington, Virginia 22202, USA
ABSTRACT New strategies for sustainability within the Department of Defense are focused on addressing present and future needs while strengthening community partnerships that improve operational abilities. This “across-the-fence line” strategic thinking requires innovative tools that can engage a broad segment of the community and a variety of military interest groups. These tools must provide a platform for understanding the challenges and realizing the goals of both private- and public-sector interests. They must tangibly represent many different potential futures, their implications, and policies that can help mobilize solutions quickly and easily in a uniform, consistent, and democratic manner. The Strategic Sustainability Assessment (SSA) consists of a series of complementary tools for forecasting and backcasting that provide regional stakeholders a unique perspective on potential sustainable regional policy and investment choices. Forecasting approaches use dynamic spatial modeling techniques to project potential future urban transformations and their implication to the social, environmental, and economic fabric of the region. Backcasting is used to determine critical sets of strategic interventions designed to offset the simulated future impacts. The results of the analysis are managed through the use of a Web-based GeoPortal. This helps to democratize the information by providing it to local stakeholders in a useable and accessible way. The hope is that greater and more direct access to models and the information they generate will help lead to better, more sustainable planning decisions in our military bases and in our communities.
Deal, B., Jenicek, E., Goran, W., Myers, N., and Fittipaldi, J., 2011, Strategic Sustainability Assessment, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 41–57, doi:10.1130/2011.2482(05). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
41
42
Deal et al.
INTRODUCTION Sustainability has become an important issue of broad public concern. Clean air and water, renewable energy, open space, and pollution prevention are not only essential for improving the livability of our communities, they are imperative for the successful long-term operation of our military installations. However, demographic and lifestyle shifts have increased our communal demand for land and other limited resources. As these demands grow, they place pressures on the infrastructure, resources, and long-term sustainability of our regions and the installations they support. As the embodiment of enormous capital investment in infrastructure, land, and personnel, military installations are critical to local and state economies and to the sustainability of defense, security, and military readiness. Some military installations’ economic and environmental contributions to the local community, however, have become overshadowed by perceived incompatibilities, such as noise, dust, resource competition, land use, land values, and land availability. These points of contention arise as the local community expands and available resources become more scarce. Eventually, the installation’s benefit to the community may be outweighed by the community’s requirement for resources, and the military may be perceived as a barrier to local growth and development. These potentially “unsustainable” installations face a number of risks, including downsizing, realignment, and even closure. The U.S. Army Strategy for the Environment, “Sustain the Mission—Secure the Future,” establishes a long-range vision for meeting the Army’s mission while addressing issues of regional sustainability (Assistant Secretary of the Army for Installations and Environment, 2004). The foundation of the strategy focuses Army thinking on addressing present and future needs while strengthening community partnerships that can help improve operational abilities. This across-the-fence line strategic thinking requires innovative geoinformatic planning support tools that can engage a broad segment of the community and various military interest groups. These tools must provide a platform for understanding the challenges and realizing the goals of both privateand public-sector interests. They must tangibly represent many different potential futures, their implications, and policies that can help mobilize solutions quickly and easily in a uniform, consistent, and democratic manner. Such planning support innovations are not only timely and important, but they are also critical if we are to effectively communicate the great challenges to improving the built environment and sustaining our defense infrastructure that lie ahead. Better access to regional growth and other models and the information they generate will improve public participation, feedback, and support in the planning process, ultimately leading to better, more sustainable decisions for our communities, institutions, and publicly supported defense infrastructure. This chapter describes an approach for engaging stakeholders and the broader defense community in regional planning and
sustainability issues through the development of an innovative planning support system (PSS). The PSS was constructed to inform planning decisions as part of the Strategic Sustainability Assessment (SSA) project. The SSA is a long-term project sponsored by the Army Environmental Policy Institute that seeks to provide quantitative and visually accessible information on issues critical to the sustainability of public and private communities. The SSA uses a variety of models, tools, and research techniques to provide strategic analyses for the Army and its surrounding communities in their journey toward increasing sustainability. The product is a series of ongoing, regular studies and reports that focus on specific regions or issues that enable the development of implementation plans and concepts for the Army Strategy for the Environment (ASE). The following briefly describes the role that geoinformatics played in the development of a PSS used in an Army SSA application for the Fort Bragg region of North Carolina. We describe the basis for the system and the complex, large-scale models it deploys, along with a case study example of its application in North Carolina. DEPARTMENT OF DEFENSE COMMUNITIES, GEOINFORMATIC TECHNOLOGIES, AND PLANNING SUPPORT Department of Defense Communities By design, military installations were historically located in geographic regions with plentiful and inexpensive land. These are often areas with soils too poor for agriculture with few competing employment centers. Testing and training ranges are sited far from existing populations in an attempt to ameliorate complaints related to noise and dust. Over time the civilian population has crept closer to installation boundaries, both in response to the availability of secure employment and to provide services to the service members and their families living on or near the installation. The previously isolated bases are now, at times, a catalyst for development activity in the surrounding communities (i.e., outside the fence line1). These “encroachment” events—one or more of the many pressures that limit the military’s use of land, air, and sea space—can affect an installation’s ability to sustain its operations by seriously restricting, and in some cases shutting down, its ability to train and test soldiers for conflict. U.S. Army installations face additional pressures from both inside and outside the military complex. For example, Army installations must comply with the same environmental regulations as their civilian neighbors across the fence, although in greater depth and with greater penalty for noncompliance. Noncompliance outside of federal lands leads to monetary fines; inside the fence, noncompliance can lead to greater fines and a loss of funding and/or jobs, and it can project an image of the 1
Outside and inside the fence line are terms used by the military to denote activities that take place “outside” or “inside” military jurisdictional boundaries.
Strategic Sustainability Assessment installation as an inconsiderate neighbor or an unresponsive federal entity. In a broad context, installations must also compete with each other for their very existence as the Army of the future trims real estate. Consequently, if the installation has not proactively engaged its neighbors in planning practices, its perceived value to the community might be outweighed by its value as a source of developable land (or other resources) or by its image as an environmental polluter. The Department of Defense’s (DoD) interest in and ability to reach outside its spheres of control has been lacking for several reasons: 1. Resources/priorities—DoD installations prioritize missionrelated activities over all other facility and facility-planning activities. They recognize factors that impact mission activities such as those arising from nearby community growth as secondary issues. Therefore, the funding, personnel, and time needed to engage in community planning processes are typically under-resourced. 2. Data—U.S. military installations have traditionally managed data within their fence line with great acumen, but not beyond. The initial geographic information systems developed by military installations generally contained data only for regions inside their boundary. Other agencies such as the U.S. Department of Agriculture have followed suit by mapping only outside military installation boundaries, creating a hodge-podge of data, making it difficult to find cohesion among approaches and availability. 3. Short-term focus—Military installation commanders have the authority to interact with their local communities, but they frequently rotate assignments, and this rotation, plus the pressing and urgent short-term issues they face, have tended to limit their focus on community planning and coordination. Planning issues typically have a long time line and are often put aside while short-term issues are addressed. 4. Authority—There are some formal mechanisms for resource sharing and coordinated planning between installations and communities in the DoD (the Office of Economic Adjustment offers programs to facilitate these interactions), but utilization is dependent on local adaptation and limited funding. Geoinformatic Technologies The tools needed to plan these complex DoD-dominated regions within a twenty-first-century society are now on the horizon. With these tools, citizens and stakeholders are no longer bound by an incomprehensible future; they are able to work with tangible representations of many different potential futures, shifting their thinking about the future region from an abstract idea to potential realities. How will our region change if a new investment is made? What types of lands are off limits to urban development? How do the changes affect our schools? Important regional decisions based on data and information about the future can be viewed and shared in a uniform, consistent, and democratic manner. Planners have easy access to this information
43
and can use it to assemble a set of actions (policies, regulations, investments, or designs) based on public input. Emerging geoinformatic technologies make this new paradigm for planning possible. These technologies have the ability to transform planning from a slow and costly, paper-driven, blackbox process of producing reports to a transparent, democratized way of delivering information relevant to policy and investment decisions quickly and easily. Regions will no longer create a plan just for the sake of creating a plan, and the chasm between making and using plans will gradually shrink. Making, using, and evaluating plans will be increasingly consistent across space, time, and functional area (land use, schools, infrastructure, etc.). Decisions can be made on the basis of tangible information available and viewable long into the future by a wide range of constituent groups. These models link physical changes in a community to their relevant consequences, enhancing the individual and collective decision-making processes inside and outside the fence line. Planning Support Systems A growing set of planning support tools that utilize geoinformatic technologies has become available in recent years. Brail and Klosterman (2001) outlined the state of the art and described new approaches in PSS development due in some measure to increased computational capabilities and availability of digital data. Geertman and Stillwell (2003) added a review of the role of PSS in planning with the intention of documenting best practices and promoting the use of planning tools. They described the evolution of PSS-like tools and identified spatially explicit decisionsupport systems as an important subcategory. They discussed tools used in three types of planning: participatory, strategic and land use, and environmental. Klosterman described categorization of PSS along two dimensions: by the planning task that the model addresses and the technique or approach it utilizes (Brail and Klosterman, 2001). This recent evolution of planning support tools toward dynamic spatial simulation systems contrasts somewhat with earlier work in spatial (and aspatial) reasoning systems (Kim et al., 1990). Knowledge-based reasoning systems were loosely founded on a philosophical ideal of capturing the manner in which expert knowledge is applied to address complex planning problems. These systems were characterized by the use of multiple types of domain knowledge and complex domain models to support reasoning processes. This approach is important in considering the development of PSSs in the context of participatory planning exercises. Active participation of stakeholders is an important component of the planning process. Participatory planning has traditionally been led by public agencies and relies on face-to-face contacts in venues such as town-hall meetings. Information limitations in various forms often restrict the public’s access to or interest in participating in the process, making many such processes “expert”-driven with intermittent public feedback. Technological innovations in policy analysis and computing have started
44
Deal et al.
to break down these barriers in many aspects of governance by providing more information and services through the Web. This technology has created an opportunity to develop applications that extend interactivity by providing analytical tools that allow the public to perform on-the-fly policy evaluations and provide immediate feedback. More specifically, these tools allow users to compare outcomes of policy choices in relation to location, type, and intensity related to spatial growth and development of places (cities, regions, etc.) on the basis of their impacts on various quality-of-life indicators—and their potential implications for adjacent military operations. A regionally based PSS utilizing geoinformatic technologies and planning processes was constructed to inform planning decisions as part of a SSA project, and it was applied to a DoD Army installation–dominated region. STRATEGIC SUSTAINABILITY ASSESSMENT The SSA is described as an integrated planning approach that optimizes regional resources and begins to develop sustainable management plans for regions with a strong DoD presence using geoinformatics, dynamic planning support tools, and participatory planning methods. Generally, the SSA initiates a regional journey toward sustainability. Together, regional stakeholders identify assets and resources and define a vision for their future. SSA modeling tools then simulate a range of likely futures that use the current assets and compare the outcomes with the participants’ vision in order to identify the changes needed to achieve common goals. The SSA process does not prescribe a future or finalize a plan for getting there. Rather, it provides information to empower regional decision makers working toward a sustainable future. It is the assembly of data, tools, and plans in a participatory environment that fosters the identification of regional resources and priorities, potential future realities, and emerging coalitions. Necessary tools in forming an SSA engagement include models that characterize current conditions and threats, models that project future conditions and policy scenarios, sustainability impact models, and visualization tools. In any given region, entities will already employ many of these models and tools. They will likely have comprehensive or capital improvement plans that characterize the current condition and evolution of resources. They have likely set future goals and objectives. Specialized impact models are probably used by municipalities to monitor transportation, by school districts to monitor enrollments, and by utility providers to monitor water quantity and quality. The strength (and transferability) of the SSA is in bringing these tools together with supplemental resources to fill in the gaps to shape geoinformatic, dynamic, participatory planning support. Initiating a Set of Tools and Resources The SSA pilot study initiated a toolbox of resources to assist regions. These tools included the Sustainable Installations
Regional Resource Assessment (SIRRA), Land-Use Change Assessment (LUCA), and land-use projection, impact assessment, and GeoPortal visualization tools of the Land-Use Evolution and Impact Assessment Model (LEAM). These tools complement common regional efforts. SIRRA and LUCA are DoD-developed tools for comparing stresses and investments across regions. Each uniquely informs national and broad regional stakeholders about the ways in which a local situation fits within a larger context. LEAM is a privately developed land-use simulation approach that is growing in popularity with DoD installations for its useful role in local public policy and planning deliberations (Deal and Pallathucheril, 2007). SIRRA is a screening tool for assessing relative vulnerability in 10 sustainability issue areas: (1) air quality, (2) airspace, (3) energy, (4) urban development, (5) threatened and endangered species, (6) locational sustainability, (7) water, (8) economic issues, (9) quality of life, and (10) transportation. The results of SIRRA analyses are used to identify regions and sustainability issues that require further study using additional data sources. SIRRA was developed under the Strategic Environmental Research and Development Program (SERDP) and was recognized as the 2006 SERDP Project of the Year. SIRRA, which provided auditable data for the Army stationing analysis for Base Realignment and Closure (BRAC) 2005, has been used to evaluate an existing installation’s ability to absorb additional forces and a region’s capability of supporting a new installation. It is also continually used at installation sustainability planning workshops. Incorporation of SIRRA into the SSA brings regional awareness to individual stakeholders. The assessment provides valuable screenings for which additional studies, planning, and actions may be recommended to ensure continued viability (Jenicek et al., 2004). Also developed under SERDP, LUCA examines local and regional land-use trends and impacts on military installation missions. It can provide technologies and data to help installations and units proactively plan to protect the mission sustainability of DoD’s current and future capabilities (Lozar et al., 2003). A LUCA analysis shows a series of landscape changes over 30 years and covers a variety of regional scales. Trends are drawn from the analysis of historic land-use and land cover maps, satellite images, and other sources. Several studies have utilized differing data sources to tailor graphical presentations and comparative analyses of changes over time. Land-use change can significantly and permanently affect opportunities to test and train, but the decades-long process of change is easy to overlook in installation planning. LUCA analyses bring the recognition of land-use change trends to the SSA. LEAM—a dynamic urban modeling and simulation tool that forecasts the outcomes of urban investment and policy initiatives—is at the core of the SSA efforts to build an interactive PSS modeling environment. LEAM technology and framework have been described elsewhere in detail (Deal and Pallathucheril, 2007; Deal, 2008) and will only be summarized here. The LEAM framework consists of two major parts: (1) a land-use
Strategic Sustainability Assessment change (LUC) model defined by multiple drivers that describe the local causal mechanisms and allow easy addition and removal of variables for playing out alternative scenarios, and (2) impact assessment models that support rapid analysis and interpretation of land-use changes depending on local interest and applicability. In other words, the LEAM framework is intended to help users find answers to the complex questions of “What if?” and “So what?” The LEAM LUC model uses a structured, gridded surface with state-change conditions that evolve over time, similar to other change assessment technologies. The LEAM grid surface is not flat, however, but gains a “hilly” topography on the basis of both biophysical and socioeconomic constraining factors. It incorporates state-change techniques to calculate a probability that represents the potential of each cell to change from one landuse category to another. Unlike other state-change approaches, however, the probability of (state) change is influenced by local interactions (such as the accessibility of the cell to a given attractor), global interactions (the state of the regional economy, for example), and other causal mechanisms, producing suitability scores that contribute to the grid surface relief and affect subsequent allocation. Similar to other large-scale urban models, LEAM works at the regional scale and incorporates regional macro-socioeconomic models combined within the modeling framework to help determine demand for land. Unlike other large-scale models, however, LEAM aggregates to the regional scale from a fine-scale (30 m × 30 m) resolution that includes cell-based micromodels. This architecture enables loosely and tightly coupled linking with other models that may operate at different spatial scales (transportation models, for example) and the capability to quickly link to models that describe the potential implications of the simulated changes. One submodel, LEAMecon, is an econometric, input-output model that determines the regional demand for residential, commercial, and open-space land. Unlike other approaches that use regional constraints on demand to determine spatial allocation (see Wu and Martin, 2002), in the LEAM approach, households and jobs are established by LEAMecon and converted into a demand for land using sector-based economic and demographic analyses that are quantitative, defensible, and allow for economically based what-if scenario questions to be tested. Within each scenario tested, the estimated demand for land serves as a target for regional land allocation. Market variables increase or decrease development rates on the basis of how well the regional demand targets are met (or not met). SSA Engagement Once model simulations are established, scenario descriptions of alternative land-use policies, investment decisions, growth trends, and unexpected events (among others) can be simulated, analyzed, and compared for regional importance. Simulated outcomes are described in graph, chart, text, and map form and are used in engaging in local dialogue and analyzing the potential
45
implications of the changes described (Deal and Schunk, 2004). The assessment of probable impacts is important for understanding the “so what” of scenario simulations. A visual and quantitative representation of each scenario’s outcome provides both an intuitive means of understanding and a basis for analyzing the implications of potential decisions. These representations act as a catalyst for discussion and communal decision making. The importance of effective visualization devices for dealing with dynamic spatial data sets has long been recognized, especially in the field of natural resource research (Ekstrom, 1984; Rosenfeld and Kak, 1982). Natural resource scientists have been using visualization tools to better understand their science, while social scientists have sought to better understand human behaviors vis-à-vis those resources (Cox, 1990; Malm et al., 1981). While the case for supporting visualization at detailed as well as regional scales was made over a decade ago, not much progress has been made in that regard (Orland, 1992). Geographic information system (GIS) software has boosted dynamic spatial models in that it eases management and manipulation of the large amounts of spatial data that go into them. At the same time, GIS software may limit our ability to effectively deal with the data sets produced because the software does not extend beyond conventional cartographic representations. As a result, the visualization devices used on these data sets support only very basic inferences across space. Even “map movies” (animated maps) have limited value beyond some very preliminary inferences, since each frame is displayed for a relatively short period, and detailed comparisons cannot be made across time. Budthimedhee et al. (2002) rendered some key insights into the characteristics of visualization devices that can effectively and efficiently support inferences from dynamic spatial data sets. First, because the speed at which inferences can be made is critically important with large data sets, she draws on the idea that we must pay attention to the ease and accuracy with which the preattentive visual system can assess relative magnitudes. Second, because of the amount of data needed to make inferences, she draws on the idea that graphic attributes of a visualization device may be more important than its efficiency in using ink to represent the data (as is the conventional wisdom). Third, she notes that the more proximate the information needs to be in order to make inferences, the more proximate the information needs to be when visualized (Wickens, 1992). These ideas were critical in the development of the SSA Web-based GeoPortal (described later herein). The SSA GeoPortal development process explored the potential for building visualization devices for multiscale dynamic spatial data sets by focusing on the elementary perceptual task, number and type of graphic attributes involved, and proximity of compatible components. Fort Bragg SSA Example At the time of the SSA project, the Fort Bragg region already had sustainability planning efforts and regional partnerships
46
Deal et al.
under way, although for reasons described previously, they had been tenuous and cumbersome. Fort Bragg was one of the pioneering Army installations for the (inside-the-fence) Installation Sustainability Planning process, which was begun there in 2003. A nonprofit entity, the Sustainable Sandhills group emerged from that process as a regional sustainability entity and a potential catalyst for outside-the-fence interactions. In addition, Joint Land-Use Studies had been completed in the region and a BRAC Regional Task Force had begun to analyze the potential implications if there were significant increases in the number of soldiers housed at Fort Bragg. Building on these efforts, participants in the Fort Bragg SSA pilot study began by identifying current assets and resources and defining a vision for their future. Building the SSA Framework Existing Conditions The Fort Bragg region is undergoing significant environmental, social, economic, and mission changes. In addition to major Army initiatives—Army to Modular Force Conversion and “Grow the Army”—the BRAC Commission recommended the realignment of U.S. Army Forces Command and U.S. Army Reserve Command to Pope Air Force Base and Fort Bragg. Together, these activities are expected to bring tens of thousands of new residents to the region by 2013, an influx exacerbated by the inherent population growth. The Fort Bragg region grew 59.9% in Fayetteville between 1990 and 2000, a time frame in which the overall U.S. population grew just over 13%. Within a 5-mile buffer of the Fort Bragg fence line, urban areas grew 22% between 1992 and 2001, compared with an Army average
of 26%, obtained from 98 Army installations analyzed (Lozar et al., 2005). Economic Drivers A LEAMecon model was developed for the Fort Bragg– Fayetteville region to forecast changes in output, employment, population, and income over time on the basis of changes in the market, technology, productivity, and other exogenous factors. The core model consists of nine economic sectors and nine components of final demand. The output from each sector is consumed by other sectors (interindustry flows) and by components of final demand (which characterizes value added in the economy). Various shocks like investments to specific sectors, increases in public spending, or consumption from households, etc., can be applied to the regional economic system. The employment model shows changes in productivity over time to determine regional employment levels. The total population is subdivided into different age cohorts, each of which has a specific role to play in regional landuse change in the region. The resulting economic trend is used as an input to a dynamic housing market simulation, which then feeds into LEAM as residential land-use change. LEAMecon forecasts employment in the Fort Bragg region to increase by ~216,000 to ~1.7 million jobs by 2030. It forecasts a corresponding population increase of ~343,000 to a total regional population of 2.8 million people over the same time period. Figure 1 describes by-sector changes to 2030. It shows a sharp increase in service sector jobs and a slight decrease in manufacturing jobs over the next 20 years, corresponding roughly with national trends. Government jobs and retail trade employment figures are also expected to increase significantly in this region, generally due to the DoD presence. The financial, insurance, and
Figure 1. Changes in economic structure in the Fort Bragg region of North Carolina to 2030.
Strategic Sustainability Assessment real estate sectors (FIRE) are also expected to increase, but at a slower rate of increase over time. Land-Use Changes Several scenarios of potential land-use change forecasts were tested as a result of stakeholder discussions: Business-as-usual (BAU). The BAU or reference scenario represents an estimate of the spatial extents of future regional development should current trends continue. Figure 2 shows that residential development in the BAU scenario occurs overwhelmingly in Cumberland County, and more specifically in and around Fayetteville and along the southeastern boundary of Fort Bragg. Conversion of more than 10,000 acres (4047 ha) of forest, 5000 acres (2023 ha) of grassland, and almost 6000 acres (2428 ha) of agricultural land will provide the
47
space for new residential development. Harnett, Lee, and Moore Counties are expected to lose the highest proportion of agricultural land to development by 2035. Low-impact development (LID). This scenario illustrates the generalized benefits of encouraging various LID policies and their potential impacts on the region. Conversion of urban open space slows overall, resulting from policies aimed at maintaining natural or predevelopment hydrology and infiltration rates. This scenario shows little difference with baselines in land use, but it has much broader, positive implications when water quality and quantity impact models are used. Transit-oriented development (TOD). Results from the TOD scenario indicate that almost 72% of new residential development will occur in Cumberland County (61% occurs in Cumberland in the BAU scenario). Interestingly, a greater amount
BAU LAND-USE CHANGE IN ACRES BY COUNTY FOR THE FORT BRAGG REGION (2001–2035) County Cumberland Harnett Hoke Lee Montgomery Moore Richmond Scotland
Water
Residential
Commercial
Barren
Forest
Grasslands
Agricultural
–6 –4 0 –5 0 –5 –2 –1
22,399 9614 1777 6096 1322 8900 5432 3529
399 79 120 367 77 533 220 146
–33 –4 –1 –46 –2 0 –118 0
–10,533 –2415 –599 –2903 –622 –4915 –2139 –1202
–4964 –2155 –412 –797 –204 –1581 –1205 –691
–5882 –4733 –788 –2442 –458 –2295 –1854 –1565
Figure 2. Business-as-usual (BAU) growth scenario for the Fort Bragg region.
Urban open space –551 –161 –52 –185 –61 –381 –144 –90
Wetlands –20 –8 –1 –1 –1 –6 –2 –1
Deal et al.
of natural areas would be converted to urban land uses in the TOD scenario than in either the BAU or LID scenarios. Closure of Bragg Boulevard. This scenario presents a possible land-use outcome of closing Bragg Boulevard to through traffic. In this scenario, conversion to urban land uses from natural and agricultural lands continues to roughly the same degree as in the BAU scenario. Regional zoning. This scenario uses the normalized regional zoning map as produced by the Sustainable Sandhills project, showing development patterns if the current zoning were strictly enforced. In Cumberland County, conversion to residential land use remains high, but commercial land use increases by 1.4% above the No Bragg Boulevard scenario. Conversion of natural and agricultural lands is similar for all scenarios, ~40% forest, 18% grassland, 35% agricultural, 3% urban open space, and the remainder, ~2%, split between water and wetlands. Typically, the results of a 30 to 50 yr run are evaluated with a summary map that indicates where new development is projected in a region during this period. Summary maps can also be developed that zoom in on a particular part of a region to assist in local planning efforts. Results are also summarized in spreadsheets and graphs that indicate growth over time in the region and the land uses that decline as urbanization increases. Assessing the Implications of SSA Scenarios Housing Households within the Fort Bragg region grew by over 122,000 between 1970 and 2000 and will likely increase by 210,000 by 2035. A primary driver for this growth is the presence of Fort Bragg and several other proximal DoD installations and the BRAC process. As part of that process, Fort Bragg and Pope Air Force Base are set to receive 62,775 active-duty military and civilian personnel by 2013 (U.S. Department of Defense, 2005). This does not include the soldiers’ families and the supporting services that inevitably follow. Adding to housing pressures, key demographic shifts are taking place on a regional and national scale, such as falling household size, single worker households, age increases, increasing divorce rates, and growth in foreignborn populations. As previously noted, Cumberland County will likely experience the largest growth in households (Fig. 3). Over 50% of the additional 120,608 households expected between 2000 and 2035 will likely be located near the city of Fayetteville, as inbound residents are likely to reside near existing population centers. Harnett County has a rural character and is attractive for development due to its convenient location to both Fort Bragg and the North Carolina Research Triangle. Households in Harnett will likely migrate to the southwest and northern areas of the county near the communities of Sprout Springs, Lillington, Erwin, and Dunn. The majority of residents enjoy unincorporated, large-lot living, so zoning will likely limit residential development. Moore County is attractive as a resort
400,000
Scotland County Richmond County
350,000
Moore County
300,000 Total Households
48
Montgomery County Lee County
250,000
Hoke County
200,000
Harnett County Cumberland County
150,000 100,000 50,000 0 2000
2005
2015
2025
2035
Figure 3. Potential household change by county in the Fort Bragg region to 2035.
and retirement community; its 43 golf courses are especially valued. The communities of Southern Pines, Pinehurst, and Carthage have a distinctive character that the residents wish to maintain. Zoning again will be an issue for residential development in Moore County. New sprawling residential growth patterns in the region can have significant environmental consequences. On average, densities in the region are decreasing, and total developed land is projected to increase 64% (50,000 acres or 20,234 ha) by 2035, causing a loss of 3.25% of existing grasslands, 2.88% of current agriculture land uses, and 1.99% of existing forested areas. This translates into a loss of more than 15,000 acres (6070 ha) of agriculture and 33,000 acres (13,355 ha) of plant and animal habitat. The region is home to several unique plant and animal species— particularly, the endangered red-cockaded woodpecker—and this loss of habitat could threaten the survival of this species. With the in-migration of soldiers, there comes an increase in housing demand, which eventually translates to rising housing costs. Historically, home prices in the Fall Line region have been favorable to low—93% of the current households spend less than 30% of their household income on housing rents or mortgages (U.S. Census Bureau, 1990). Moreover, current fair market rents are affordable to the majority of pay grades for inbound military personnel. New construction (2007) costs an average of $138,000, which requires a minimum income of $42,720, given current lending practices.2 This is just below a warrant officer (W-4) pay grade. The current median household income in the Fall Line region is $42,000—ideally affording a $137,000 priced home. Housing affordability is a key indicator of economic health and a community’s mobility. When affordable housing is
2
Loan amount at 80% loan to value, a 6.61% effective interest rate, and income required to qualify at 25% qualifying ratio.
Strategic Sustainability Assessment
49
unavailable to low-income households, family resources needed for food, medical and dental care, and other necessities are diverted to housing costs. Residential instability results as families are forced to move frequently, live with other families in overcrowded conditions, or experience periods of homelessness. Moreover, when home prices are high, development is pushed outward, and sprawl increases. Finally, economic diversity, income distribution, and social integration are all connected to this indicator as well. Housing-Job Balance The Fort Bragg region expects 39% of its inbound residents to be employed in the service sector. Overall, an estimated 295,735 of year 2035 total households (65%) are expected to have a working wage income, and the remaining 158,703 will likely have higher professional wages. A large sector of working wage residents earn $24,000 annually. This means that developments in the region should feature houses in the $100,000 to $160,000 price range (Bureau of Labor Statistics, 2008). A ratio of jobs to households is commonly used to express the concept of jobs-housing balance. The most basic measure is the ratio of the number of jobs to the number of households in an area, calculated by dividing the number of jobs by the number of households. The recommended target range for jobs-housing ratios is 1.3–1.7 (Ewing et al., 1996). The models suggest that the Fort Bragg region will maintain a jobs-housing ratio of 1.8 through 2035. This suggests that many residents may not need to commute outside of the community for employment purposes. The exception is Harnett County. LEAM Training Opportunities Model Considerations of the incompatibilities between military installations and surrounding communities have traditionally been approached by assessing the areas of the region that might be affected by specific military activity. For example, noise contours can be produced to simulate the spread of training noise activity over space in the vicinity of a military installation (Fig. 4). This map can be used to identify where potential residential complaints might occur given the training activity studied. This essentially looks at the situation as a neighborhood problem; that is, don’t move to the noisy places! An alternative approach might view the problem as a military training problem. For example, the same contour might be originated from every new household generated in the region. As in the previous example, the spatial extent of each contour might represent the probability of complaint. The resulting pattern, however, shows the areas on the installation to which specific training exercises might be limited in order to diffuse complaint probability (Fig. 5). This changes the community problem to a regionally centered one. If the military is too constrained by the transformation of urban uses, for example, its ability to effectively carry out its mission suffers and the installation may close, something neither the community nor the military would like to see.
Figure 4. Simulated noise contours of potential training activity on Fort Bragg.
Figure 5. Areas (in yellow) considered to be lowcompliance probability zones from artillery training noise.
Water Usage Water availability is an issue of increasing concern nationally and locally. In the Fort Bragg SSA study, each scenario was analyzed for its potential water availability implication. We also tested various water policy interventions for their efficacy in reducing regional water consumption and demand. The U.S. Geological Survey Estimated Water Use in the United States in 2000 was used as the basis for the analysis, with some modification and local improvement (Hutson et al., 2004). Figure 6 describes a sector-by-sector forecast of the projected water consumption from 2000 to 2035 in the Fort Bragg region. Consumption is expected to increase from ~48 billion gallons (182 million m3) a year to ~60 billion gallons (227 million m3) a year, a 25% increase. Residential and commercial water consumption are each projected to increase 39%, and agriculture water usage is expected to drop ~4% due to loss of agricultural lands to development.
50
Deal et al.
Water needs can be met in the future by increasing the efficiency of water usage while reducing leaks and waste through cost-effective water-saving technologies, revised economic policies, and appropriate state and local regulations. Strategic intervention initiatives used in the water analysis include a Public System Loss Management Program planned for 2010, a Commercial/Industrial Water Conservation Program in 2012, a Residential Water Conservation Program in 2015, and an Agricultural Water Conservation Program in 2018. Residential rainwater harvesting will also be introduced in 2025, along with commercial/industrial rain water–gray water programs. Water reuse was not considered as an option for this analysis because that would require the development of a separate distribution system. Figure 7 shows the Fort Bragg region’s water usage projection when the various strategic interventions are applied.
Water Quality LEAM water quality (LEAMwq) represents a method for quickly assessing the impacts of urbanization on surface runoff and non-point-source (NPS) pollutant loading, providing a quick screening of the impacts of urbanization and identifying the need for more advanced modeling. A simple export coefficient modeling approach can predict pollutant loading as a function of the export of pollutants from each source in the study area (Johnes, 1996). LEAMwq integrates LEAM with the Long-Term Hydrologic Impact Assessment (L-THIA) model. L-THIA is a GISbased export coefficient model developed at Purdue University with the support of the U.S. Environmental Protection Agency. It calculates mean surface runoff and NPS pollutant loading for a given region and period using daily precipitation series, a land-use map, and a hydrological soil group map. The pollutants
Figure 6. A sector-by-sector forecast of the projected water consumption from 2000 to 2035 in the Fort Bragg region. MGY—million gallons per year.
Figure 7. The Fort Bragg region’s water usage projection when the various strategic interventions are applied. MGY—million gallons per year.
Strategic Sustainability Assessment selected for this analysis include total nitrogen, suspended particles, and phosphorous. Average storm-water runoff and average total sediments are expected to increase by ~1.6%, and average total nitrogen is expected to decrease by ~1.7% across the region between 2000 and 2035 (Fig. 8). The decrease in regional nitrogen loading in local streams is related to a general decline in agricultural land uses, since industrial farming practices are the main cause of seasonal nitrogen loading. The simulated rapid increase in stormwater runoff is directly correlated to an increase in total regional impervious surfaces in the form of new development. Roadways, sidewalks, parking lots, rooftops, driveways, patios, and recreational uses such as basketball courts and bike paths all contribute to a regional rise in imperviousness.
Figure 8. Regional nitrogen loading and runoff for the Fort Bragg region. Land-use change scenarios: BAU—business as usual; LID— low-impact development; TOD—transit-oriented development; No Bragg—closure of Bragg Boulevard; Suit All—residential, commercial, and industrial sustainability; Suit Nat—natural area sustainability; RegZone—regional zoning.
51
Energy A military installation is just one of the regional users of energy. If the installation is large and the regional population is small, it may be both the largest single user and the dominant user. In the past five decades, growth around installations has been significant, and although an installation may be the largest single user of energy, it may not be a large part of the aggregate energy demand in the region. The energy consumption patterns of the installation and its energy reduction program may not be significant compared with regional energy demands and general consumption patterns. Therefore, from an energy perspective, the sustainability of an installation in the regional context depends on the region as a whole moving to a more efficient consumption pattern and resilient energy supply mix along with the installation. The energy gap is the difference between projected energy use and energy use with sustainable programs and goals in place and met. The goals of the recent initiative in the Sandhills for the energy team are to “promote energy efficiency and conservation, renewable energy utilization, and sustainable building design.” Taken to the full extent, these would set a goal for regional fuels to be of domestic origin and have a high renewable content. The goal for 2035 could be ~30% renewable. A further goal would be to reduce the usage of natural gas and petroleum by the current amount of the percentage imported. Currently, ~18% of natural gas and ~70% of petroleum are imported. Reducing petroleum usage by 70% may not be attainable and is also out of the regional government’s control to a great extent, but it should be attempted by encouraging alternatives to automobile usage. This was not included in the model, although improvements in corporate average fuel economy standards were factored in. Figure 9 shows the expected increases in energy consumption related to increased population and related land development. The significant growth over time shows the imperative to infuse better technologies. Figure 10 illustrates the potential effects of strategic policy interventions within the energy sector consisting of efficiency improvements for new and existing building, a transition to zero-net-energy buildings in the residential and commercial sectors, an efficiency program for the agricultural sector, and an emphasis on efficiency and transition to hybrid automobiles in the transportation sector. Emissions Air pollutant and emissions criteria stem predominantly from energy usage. Carbon dioxide emissions come from energy usage and other sources such as deforestation, land-use change, and agriculture. The military installation is just one of the regional users of energy and one of the generators of air emissions. If the installation is large and the regional population is small, the installation may be both the largest single energy user and the dominant emitter. Historically, this has been the pattern, but in the past five decades, growth around installations has been significant. Although an installation may still be the largest single user of energy and producer of emissions, it may not be a large part of the aggregate emissions in the region. This is especially
52
Deal et al.
true when large urban centers and industrial areas are near the installation. Therefore, the emission patterns of the installation operations may not be significant compared with regional emissions due to energy demands and general consumption patterns. From an overall emissions and energy perspective, the sustainability of an installation in the regional context depends on the region as a whole moving to a more benign emissions footprint on the basis of efficient energy use and consumption patterns and the use of low-carbon energy sources along with the installation. Table 1 provides the air emissions projections based on the modeled energy projections, including the strategic policy interventions defined in the energy section. Infrastructure Costs Infrastructure demand and costs (transportation, utilities, other urban infrastructure, and air pollution) were also estimated
as part of the SSA analysis. This involved updating, improving, and implementing the Social Costs of Alternative Land Development System (SCALDS), originally developed for the U.S. Department of Transportation, Federal Highway Administration, by Parsons Brinckerhoff Quade and Douglas, Inc., of Portland, Oregon (Delucchi, 1998). The SCALDS model has three general calculation paths. The physical development path models the consumption of land, projected mixture of new housing units, local infrastructure cost, and annual operating costs of sewer, water, and storm-water services. Projections of the average amount of nonresidential building space needed to support new development are also possible. The total travel cost path models the annual operating cost of peak and nonpeak travel on a passenger-miles-traveled basis. The third path models the air pollution produced by transport mode, the energy consumed by transportation, and the energy
Figure 9. Fort Bragg region energy projection.
Figure 10. Fort Bragg region energy projection with policy interventions. inc—including, wd— wood, lpg—liquid petroleum gas, kero—kerosene.
Strategic Sustainability Assessment consumed by residential land use in nondollar units. The residential energy consumption contains a factor that approximates the nonresidential energy consumption. This path also estimates the cost of the energy consumed by transport and residential land use. The final outputs from the model are regional projections of the demand and costs for transportation, local infrastructure, and residential and commercial energy over the next 30 yr. Infrastructure costs were analyzed for three scenarios for the Fort Bragg region: BAU, LID, and TOD. The results indicate a potential net savings of $751,208,503 in local infrastructure costs between 2000 and 2035 should the region follow LID strategies (Table 2). Most of the savings would come from nontransportation-related energy savings and reducing the need for engineered sanitary and storm-sewer infrastructure. PULLING IT TOGETHER LEAM GeoPortals Our SSA PSS (the SSA GeoPortal) uses an open-source spatial data viewer (MapViewer) in conjunction with a Web-based map service (Google Maps API [application programming interface]) to render and view map images, impact analyses, and other data. Users can easily pan, zoom, and move around the informa-
SO2
53
tion. Using the Google Maps API also enables the user to view satellite images of the data (Figs. 11 and 12). Because of the ability to zoom in closely, one can get a real sense of the ways in which land-use changes might affect their community. They can physically place themselves in the image and locate issues of primary importance to them relatively quickly and easily. Gaining access to the visual data, however, is not enough to effectively influence planning decisions. A critical step involves the system from which the information is both derived and managed. An ideal PSS conveys the complexity of regional dynamics in a setting accessible to those of elementary technological savvy. Our experiences have suggested that the entire PSS architecture must reside in a simple and easy-to-use content management system architecture that requires no specialized skills to upload, create, manage, or view information. Our GeoPortal-based PSS is based on the open-source content management system Plone (http://www.plone.org). We customize Plone for the SSA application through the design and implementation of Plone objects for storing aspatial data, plans, LEAM scenarios, and other spatial data. This system, although still in the exploratory stage, allows content of different types — from text to images to GIS layers—to be managed in a consistent and uniform manner, effectively simplifying the reporting and analysis process. Full-text searches (see Discussion about ISoP
TABLE 1. FORT BRAGG REGIONAL AIR EMISSIONS PROJECTIONS WITH POLICY INTERVENTIONS BY YEAR 2000 2015 2025 2035 2015 2025 2035 Regional air emission projections related to increased population Regional air emission projections related to strategic and related land development policy interventions within the energy sector (Pollutant [tons]) (Pollutant [tons]) 37,200 43,625 48,716 53,879 41,720 32,791 19,425
NOx
23,489
25,351
31,927
35,531
27,255
24,175
CO
5352
6480
7264
8056
6273
5959
6201
CO2
9,854,773
11,761,663
13,253,675
14,761,922
11,291,910
9,555,083
7,039,789
1364
1566
1726
1887
1189
935
556
Particulates Hg
0.99
1.19
1.35
1.51
1.13
0.89
TABLE 2. INFRASTRUCTURE COST COMPARISON OF BUSINESS-AS-USUAL (BAU) TO LOW-IMPACT DEVELOPMENT (LID) SCENARIOS Social costs comparison between BAU and LID scenarios
BAU value (US$)
LID value
Increase in annual private cost for water
136,677,328
133,679,852
LID to BAU comparison (%) 97.8
Increase in annual private cost for sewer
138,538,718
136,489,870
98.5
Increase in annual private cost for stormwater systems
13,853,872
13,648,987
98.5
898,233,935
806,054,717
89.7
6,618,014,035
5,866,805,532
88.6
Increase in annual nontransportation energy costs Local infrastructure costs
21,422
0.52
54
Deal et al.
Figure 11. The Strategic Sustainability Assessment (SSA) GeoPortal, http://www.leam.uiuc.edu/ssa/.
below) can be combined with spatial queries to help locate data of interest. A built-in workflow capability supports a create-reviewpublish cycle for new content. Map exploration allows the usual selection of raster and vector map overlays, the ability to zoom and pan, and an added ability to dial back and forth in time within LEAM dynamic simulations results. This allows SSA users to experience potential changes to complex urban (and nonurban) systems in a systematic manner. There are opportunities for additional information, metadata, and comment on each piece of information presented. When GIS layers and legends are included in the system, the potential of the PSS system for making information useable, accessible, and manageable is illustrated. The SSA GeoPortal PSS helps regional stakeholders to integrate activities and data that lead to plan formulation and policy making by making the results of model building and scenario planning viewable and accessible. This creates real opportunities to engage decision makers, stakeholders, and the general public in the process of making both plans and planning decisions. With
advances in open-source Internet technology, we believe there is untested potential for the efficient and democratic creation of plans without the institutional largesse and conflict of twentiethcentury planning processes. DISCUSSION In its journey to increase sustainability, the Fort Bragg region has embarked on a series of ongoing, regular studies and reports that focus on specific issues and that enable the development of implementation plans and concepts. This journey has several byproducts, for example, beneficial partnerships and opportunities for further dialogue with internal and external stakeholders, recommendations for policy development and new initiatives as the Army works to achieve the goals of the ASE, and identification of short-term actions needed to ensure that long-term goals are met. Many regions are looking for cooperative planning opportunities between local and regional agencies in pursuit of sustainable
Strategic Sustainability Assessment
55
Figure 12. A view of the Land-Use Evolution and Impact Assessment Model (LEAM) results in the Fort Bragg region using the Strategic Sustainability Assessment (SSA) GeoPortal. Yellow and red areas are currently urbanized areas or areas that will become urbanized by the year 2035; these overlie a section scaled aggregate analysis, where darker blue is more intense development.
growth solutions. However, individual plans often overlap and possibly conflict with each other, causing the formation of partnerships and actions to fail. A key component of the SSA in helping regions to succeed is the delivery of plan content. Recent literature refers to this as an information system of plans (ISoP). Currently, considerable effort is expended on making plans, but little attention is paid to making them usable by related projects and planning agencies. An ISoP is an interactive, centralized database designed to help local, county, and regional decision makers form policy in a multiplan environment and allow plan content to be more easily and intelligently retrieved. It facilitates analysis of plan overlap and conflict. Ideally, the GeoPortal records the spatial and temporal scope of the plans in the ISoP, along with an abstract description and a link to the plan in its entirety. Using the SSA GeoPortal, stakeholders and decision makers not only have access to the data and analyses on which the original plan was based, but also urban development decisions and plans that have been made since that time. New planning activi-
ties no longer start from scratch, but instead take into account plans already made by other agencies and entities working on the issue at hand. They quickly assess the actions in the original plan that are relevant and those that are not; they develop a revised set of actions that reflect a current understanding of the region. Furthermore, analyses by consultants and the data used in these plans and analyses are mandatorily stored in a publicly owned information system. CONCLUSIONS Contemporary urban society faces increasing challenges— burgeoning populations, climate change, increasing resource demands, and constantly evolving political environments—in balancing the needs of economic development with goals of sustainable development, especially relating to the environment and equity. Government agencies, when they are able to acknowledge these goals, find it difficult to communicate the long-term
56
Deal et al.
benefits of difficult policy choices designed to address these challenges. Citizens, on the other hand, are limited by their capacity to access and understand the complexity and range of information needed to participate effectively in a public process. In the field of land-use planning and policy, this information relates to a wide range of decisions and investment choices made by public entities, including zoning, infrastructure development, housing location and type, green infrastructure questions, social capital support, and environmental preservation. Increasing public accessibility to the information (and the models that produce it) will affect such decisions. A link between the potential policy and investment questions and their relevant consequences has several advantages. First, making the information available and visually accessible will enable the public to play a greater role in individual and collective decision-making processes. Second, access to models that provide better information about the future should help produce better decisions. Third, development of an innovative cyber PSS will help enhance communal goals of regional sustainability by facilitating alternative forms of policy making. Active participation of stakeholders is an important component of the planning process. Participatory planning has traditionally been led by public agencies that rely on face-to-face contacts in venues such as town-hall meetings, but information limitations in various forms often restrict the public’s access or interest in participating in the processes, making them “expert”-driven with intermittent public feedback. Technological innovations in policy analysis and computing have started to break down these barriers in many aspects of governance by providing more information and services through the Web. This has created an opportunity to develop applications that extend interactivity by providing analytical tools through the Web that allow the public to perform onthe-fly policy evaluations and provide immediate feedback. More specifically, these tools allow users to compare outcomes of policy choices in relation to location, type, and intensity related to spatial growth and development of places (cities, regions, etc.) on the basis of their impacts on various quality-of-life indicators— and their potential implications on adjacent military operations. In this chapter, we have shown how geoinformatic technologies are helping to change the way we think about planning for sustainability. Future work in this arena involves the development of planning support systems that can be created and that can evolve with an understanding of the actual use and processing of information by lay people, especially with respect to decision making and risk. Such systems would be informed by an understanding of the ways in which communities of people (as opposed to the individuals in communities) think about issues and risks, and the ways in which communities come to understand the quality of information pertinent to planning from a variety of sources. These systems might be considered “sentient,” in that they respond to the individuals that are using them and the evolving data that inform them. The ultimate objective in this and related work, however, is to effectively create an easily accessible support system that con-
verts raw data into usable information that communities would find helpful in reaching better-informed decisions. ACKNOWLEDGMENTS We acknowledge the support provided by Michael C. Cain, director of the Army Environmental Policy Institute, and Tad Davis, Deputy Assistant Secretary of the Army. The real-world experiences gained at Fort Bragg and the Sandhills region would not have been possible without active engagement from a broad array of planning professionals, especially Paul Wirt, Fort Bragg, and Jon Parsons, Sustainable Sandhills. REFERENCES CITED Assistant Secretary of the Army for Installations & Environment (ASAIE), 2004, The Army Strategy for the Environment: Washington, D.C., Office of the Assistant Secretary of the Army for Installations and Environment, 12 p. Brail, R.K., and Klosterman, R.E., eds., 2001, Planning Support Systems: Integrating Geographic Information Systems, Models, and Visualization Tools: Redlands, California, Environmental Systems Research Institute Press, 428 p. Budthimedhee, K., Li, J., and George, R.V., 2002, ePlanning: A snapshot of the literature on using the World Wide Web in urban planning: Journal of Planning Literature, v. 17, no. 2, p. 227–246. Bureau of Labor Statistics, 2008, Occupational Outlook Handbook, 2008–09 Edition: Washington, D.C., U.S. Department of Labor, 720 p. Cox, D.J., 1990, The art of scientific visualization: Academic Computing, v. 4, no. 6, p. 20–56. Deal, B., 2008, Sustainable Land-Use Planning: The Integration of Process and Technology: Saarbrücken, Germany, Verlag Dr. Müller, 128 p. Deal, B., and Pallathucheril, V.G., 2007, Developing and using scenarios, in Hopkins, L.D., and Zapata, M., eds., Engaging the Future: Forecasts, Scenarios, Plans, and Projects: Cambridge, Massachusetts, Lincoln Institute of Land Policy, 374 p. Deal, B., and Schunk, D., 2004, Spatial dynamic modeling and urban land use transformation: A simulation approach to assessing the costs of urban sprawl: Ecological Economics, v. 51, no. 1–2, p. 79–95. Delucchi, M.A., 1998, The National Social Cost of Motor Vehicle Use: Federal Highway Administration Report FHWA-PD-99-001, http://www.fhwa.dot .gov/scalds/socialcost.htm (accessed June 1998). Ekstrom, M.P., 1984, Digital Image Processing Techniques: Orlando, Florida, Academic Press, Inc., 372 p. Ewing, R., Porter, D.R., Heflin, C.C., and DeAnna, M.B., 1996, Best Development Practices: Doing the Right Thing and Making Money at the Same Time: Chicago, American Planning Association, 180 p. Geertman, S., and Stillwell, J., eds., 2003, Planning Support Systems in Practice: Berlin, Springer, 578 p. Hutson, S.S., Barber, N.L., Kenny, J.F., Linsey, K.S., Lumia, D.S., and Maupin, M.A., 2004, Estimated Water Use in the United States in 2000: U.S. Geological Survey Circular 1268, 46 p. Jenicek, P., Svehla, P., Zabranska, J., and Dohanyos, M., 2004, Factors affecting nitrogen removal by nitritation/denitritation: Water Science and Technology, v. 495–496, p. 73–79. Johnes, P.J., 1996, Evaluation and management of the impact of land use change on the nitrogen and phosphorus load delivered to surface waters: The export coefficient modeling approach: Journal of Hydrology (Amsterdam), v. 183, no. 3–4, p. 323–349, doi:10.1016/0022-1694(95)02951-6. Kim, T.J., Wiggins, L.L., and Wright, J.R., eds., 1990, Expert Systems: Applications to Urban Planning: New York, Springer-Verlag, 268 p. Lozar, R.C., Ehlschlaeger, C.R., and Cox, J., 2003, A Geographic Information Systems (GIS) and Imagery Approach to Historical Urban Growth Trends around Military Installations: Champaign, Illinois, Engineer Research and Development Center Report TR-03-9, 47 p. Lozar, R.C., Meyer, W.D., Schlagel, J.D., Melton, R.H., MacAllister, B.A., Rank, J.S., MacDonald, D.P., Cedfeldt, P.T., Kirby, P.M., and Goran, W.D.,
Strategic Sustainability Assessment 2005, Characterizing Land-Use Change Trends around the Perimeter of Military Installations: Champaign, Illinois, Engineer Research and Development Center Report TR-05-4, 106 p. Malm, W., Kelley, K., Molenar, J., and Daniel, T., 1981, Human perception of visual air quality (uniform haze): Atmospheric Environment, v. 15, no. 10–11, p. 1875–1890. Orland, B., 1992, Evaluating regional changes on the basis of local expectations: A visualization dilemma: Landscape and Urban Planning, v. 21, no. 4, p. 257–259, doi:10.1016/0169-2046(92)90035-X. Rosenfeld, A., and Kak, A.C., 1982, Digital Picture Processing: Orlando, Florida, Academic Press. U.S. Census Bureau, 1990, Census of Population and Housing: Washington, D.C., U.S. Department of Commerce.
57
U.S. Department of Defense (DoD), March 2005, Department of Defense Report to the Defense Base Closure and Realignment Commission: Department of the Army Analysis and Recommendations, BRAC 2005 Volume III: Washington, D.C., U.S. Department of Defense, 760 p., http:// www.brac.gov/finalreport.html (accessed 6 July 2011). Wickens, C.D., 1992, Engineering Psychology and Human Performance (2nd ed.): Glenview, Illinois, Scott, Foresman, and Co., 560 p. Wu, F., and Martin, D., 2002, Urban expansion simulation of southeast England using population surface modelling and cellular automata: Environment & Planning A, v. 34, no. 10, p. 1855–1876, doi:10.1068/a3520. MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
6.
The Geological Society of America Special Paper 482 2011
Grid query optimization in the analysis of cone penetration testing data Patrick M. Dudas Hassan A. Karimi Geoinformatics Laboratory, School of Information Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania 15213, USA Abdelmounaam Rezgui Center for Intelligent Spatial Computing, George Mason University, Fairfax, Virginia 22030, USA
ABSTRACT Soil liquefaction takes place during and/or after the occurrence of an earthquake and is a major contributor to urban seismic risk. Geologists use a technique called the cone penetration test (CPT) to determine the properties of soils, including liquefaction levels, which yields large amounts of soil data. The analysis of such massive amounts of data requires high-performance computing resources. In this paper, we present GQO (Grid Query Optimizer), a distributed algorithm that enables the analysis of large CPT data sets efficiently on a grid cyberinfrastructure.
mapping (Barney, 2009) are just a few of the many examples of computationally intensive and data-intensive problems requiring high-performance computing solutions. These types of problems involve retrieval of massive data sets from multiple data sources (e.g., university laboratories, research laboratories) and require complex computations to answer a single scientific question. Finding efficient means for managing and sharing these data sets is a challenge to not only scientists and engineers, but to any discipline that employs very large data sets in their respective fields of study. The geographical distribution and computational complexity of many scientific problems require efficient alternatives for distributed data retrieval and query processing. Grid computing has been recognized as a potential platform for distributed data processing. A grid cyberinfrastructure can be formed by connecting several computers via a network connection and using common management software to schedule and distribute jobs over these computers. Typically, the computers of a grid belong to different entities and are physically in different locations. TeraGrid
INTRODUCTION For data-intensive problems, most solutions can be found using a sequential program on a single central processing unit (CPU) within a reasonable amount of time. This requires little overhead to complete the task at hand, and scientists and engineers, with limited programming skills, can produce solutions within a practical time frame and with minimal effort. However, even in these cases, programs may still run for a long period of time. Additionally, there are cases where programs may require large amounts of storage in the order of terabytes or even petabytes. Often, such problems are compounded to a point that traditional computing resources are no longer feasible, and they require the application of dedicated high-performance computing resources with large storage capacities. For these highperformance computing resources, costs include not only the initial startup but also maintenance and upgrade. Research projects such as weather simulations, astrophysics, earthquake simulations, electrical circuits, and human genome
Dudas, P.M., Karimi, H.A., and Rezgui, A., 2011, Grid query optimization in the analysis of cone penetration testing data, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 59–67, doi:10.1130/2011.2482(06). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
59
60
Dudas et al.
(National Science Foundation TeraGrid, 2010) is an example of a national grid that connects computers from around the country and has many nodes that are designated as supercomputers, which boast roughly two petaflops of computing resources and 50 petabytes of stored data and records (National Science Foundation TeraGrid, 2010). In a grid environment, each computer is considered its own independent node, and scalability is the ultimate objective. New computers can be added to a grid without much effort. Most grid environments require a batching or queuing hierarchy whereby users can request nodes and an amount of time on each node. Other variables can be added to a request, including a specific operating system, hardware, memory allocation, or programming language support. Grids can vary in the ways in which they allocate time and resources by either allocating nodes when the system is idle or partitioning small amounts of resources at all times. By utilizing existing resources while not influencing their local operations, parallelization of computational jobs is possible and results in improved execution times similar to those found in Amdahl’s law, which states the prediction of maximum operations using multiple processors (Amdahl, 1967). Despite the potential of grid computing for solving many complex scientific problems, grid computing has yet to be applied to solve a large number of challenging problems. Many scientists are still unable to access large amounts of data at different locations, bring them together, and perform computations on them in a controllable environment. Query processing optimization remains a challenge when the size of data set is very large and the data are geographically distributed. In such cases, users may have to manually partition the query to obtain the data from different locations and then manually join the resulting data sets. To automate query processing optimization, Grid Query Optimizer (GQO) was developed (Liu and Karimi, 2008). GQO optimizes query processing in distributed heterogeneous databases on grid environments. In this paper, we discuss the use of GQO for cone penetration test (CPT) data on PittGrid, the University of Pittsburgh’s campus grid infrastructure.
CPT data are usually collected with CPT trucks (Fig. 1) that are positioned at locations of interest. Most data sets are from areas in the coasts of California, and some areas of Indiana and are used for the exploration of shallow subsurfaces (<30 m in depth). The U.S. Geological Survey Earthquake Hazards Program uses 23 ton trucks to push a cone, roughly 3.6 cm in diameter, into the ground using part of the truck’s own weight, which is supported by both the tip of the cone and the sleeve of the cone. Sensors are attached to the cones and sleeves to provide numeric properties about the soil and its pores. Even though this is nonintrusive drilling, the data obtained from this testing are numerous. A few examples of this data would include (Havach, 2003): (1) tip resistance, determined by the force required to push the tip of the cone through the soil; (2) sleeve friction, determined by the force required to push the sleeve through the soil (the resistance to liquefaction can be inferred from these two measurements); and
CONE PENETRATION TESTING DATA The mission of the U.S. Geological Survey (USGS) is to “serve the nation by providing reliable scientific information to describe and understand the Earth; minimize loss of life and property from natural disasters; manage water, biological, energy, and mineral resources; and enhance and protect our quality of life” (U.S. Geological Survey, 2009). Information provided by the USGS can range from hazards data distributions to surface exploration via cone penetration tests. In particular, the USGS utilizes CPT in the analysis of soil liquefaction. Soil liquefaction occurs during or after earthquakes and is caused by an increase in water pressure in the pores of the sand or dirt, which then causes a decrease in effective stress, and thus a decrease in the shear strength. This can create what are called “soil volcanoes.”
Figure 1. Example of cone penetration test trucks and measurements (Western Earthquakes Hazards Team, 2009a). Photo by Scott Haefner, USGS. Figure by Tom Holzer, USGS.
Grid query optimization in the analysis of cone penetration testing data (3) S-wave traveltime, which is the amount of time required for a shear wave, created by a sledgehammer or automatically by a compressed air–driven hammer, to reach the seismometer on the cone. Information derived from this data could be used for: seismichazard mapping; paleoseismic exploration; groundwater quality studies; environmental contamination studies; and regional or site-specific three-dimensional mapping of shallow sedimentary deposits (Havach, 2003). Data used in the work presented in this paper were collected from the following locations: Alameda County, California (211 sites), San Luis Obispo County, California (37 sites), San Bernardino County, California (133 sites), Santa Clara County, California (163 sites), and Evansville area, Indiana (58 sites). Most of the data were collected in Alameda County, California, at 211 sites as shown in Figure 2, where dots represent the sites where measurements were made. The CPT data available from the USGS were collected for various projects by the USGS Western Earthquake Hazards Team (2009b). As specified by the USGS, the data were collected independently of the analysis. So, data sets were not tailored for one particular study but were meant for any study that could utilize this information. Figure 3 gives the names and descriptions of all the data. These files were all standard text-flat files with relatively the same format. To convert these unstructured files into structured databases, the first and most significant step was to design the
Figure 2. Cone penetration test (CPT) data (Alameda County, California) (Western Earthquakes Hazards Team, 2009a).
61
ERD (entity-relation diagram) based on the data usage. It was determined that because most queries involved numerical calculations, different tables would provide a logical order to optimize calculation. Sample calculations include: (1) Ground attenuation—One of the first calculations (queries) needed is to determine the amount of dissipation via an earthquake, which is caused by ground motion intensity or ground attenuation. (2) Seismic stress ratio—Soil liquefaction can be analyzed using the cyclic stress ratio (CSR) or, as more commonly called, seismic stress ratio. This is derived from the horizontal earthquake force (Day, 2001). (3) Cyclic resistance ratio—Using the soil type, estimation can be made of the ground’s resistance to liquefaction during an earthquake or cyclic resistance ratio. Once this and CSR are calculated, a safety factor (Fs) can be determined. Higher values for the safety factor indicate that liquefaction is less likely to occur (Mitchell, 2001). The ERD is shown in Figure 4. Each table’s filename was used as the primary key. The CPTData table was created to reduce redundancy for all the input files. As illustrated in the ERD, for every file, there is a single set of metadata about the location, a single set of geographical metadata, and multiple tuples (rows in a database table) of CPT measurements. We utilized GQO on CPT data to experiment with time performance on PittGrid. Since PittGrid is based on Globus and related technologies, the remainder of this section provides an overview of them. Many middlewares are now available based on the infrastructure of Globus (Globus Alliance, 2008), an open-source toolkit for constructing computer grids provided by the Globus Alliance. Globus provides the means with which to implement jobs in a grid environment using Open Grid Services Architecture (OGSA) (Globus Alliance, 2009b). Open Grid Services Architecture Data Access and Integration (OGSA-DAI) (Globus Alliance, 2009a) is one of the extended services that interrelates databases in a grid environment. OGSA-DAI resolves data-access problems by providing a global grid data services engine that manages data resource information in a grid environment. Several organizations worldwide have built similar grid environments to solve data-intensive problems by applying grid techniques such as DataGrid (2008), EGEE (Enabling Grids for E-sciencE, 2008), Particle Physics Data Grid (2008), Open Science Grid (2008), and other advanced computing centers (Data Intensive Distributed Computing Research Group, 2008; Virtual Organization, 2008; Pacific Northwest National Laboratory, 2008; Advanced Computing for Science, 2008). Most of these organizations or virtual organizations spent a lot of their efforts on managing petabytes-scale data, retrieving data, and scheduling jobs. For example, DataGrid (2008), funded by the European Union, provides an infrastructure and middleware for solving problems that involve intensive computations and analysis of large-scale data sets across different scientific domains (Segal et al., 2000).
62
Dudas et al.
Figure 3. Cone penetration test (CPT) data information (Western Earthquakes Hazards Team, 2009a).
Figure 4. Entity-relation diagram (ERD) for a cone penetration test (CPT) database.
Grid query optimization in the analysis of cone penetration testing data It addresses issues such as work scheduling, data management, monitoring services, fabric management, and storage management. It gradually evolved to the current Enabling Grids for E-sciencE (EGEE) (2008), which is the largest, multidisciplinary grid infrastructure in the world. GRID QUERY OPTIMIZER GQO was developed to effectively and efficiently process queries on a widely distributed heterogeneous database management systems in grids. Built on the infrastructure of OGSA-DAI, the core optimizing strategy of GQO consists of selecting the best data resource from its replicas and parallelizing query execution on the grid. It includes two main components, resource selection and parallelism processing, and three auxiliary services, transmission prediction service (TPS), environment information service (EIS) and database information service (DBIS), as shown in Figure 5. The resource selection component is composed of two subcomponents: resource locator and resource ranker. Resource locator collaborates closely with three auxiliary services to locate the required resources. The three auxiliary services collect the data about transmission latency, host information, and database information in the grid. With all of this information retrieved by the resource locator, the resource ranker calculates the rank for each host that possesses the query. One of the unique features of GQO is its novel resource allocation algorithm, which uses a ranking system to
63
assign weights to the parameters and evaluate them to determine those that may influence the decision-making process, rather than relying on a single parameter. These parameters take into account a host’s MIPS (million instructions per second or the standard in which processing and speed is measured), RAM (random access memory) size, current workload, and transmission latency rate (TLR). TLR calculates the mean transmission capability between two hosts at a specific time based on the historical statistical data. It is based on the logic that the higher a ranking is for a given host, the better the chance will be that it would perform the query segment. After the resource ranker decides the host(s) to carry out the operations, another component of GQO, parallelism processing, detects the potential parallelism to maximize query optimization. Since parallelism exists when there is no data dependency between operations, the algorithm iterates the operations in the query for data dependency. The output of parallelism processing is a sequential execution plan that contains parallel steps. Upon the determination of the execution plan, the query is ready to be sent to the grid for execution. GQO has been implemented on PittGrid (Natarajan, 2008), the University of Pittsburgh’s campus grid. PittGrid has around 500 nodes, most running Linux (with different flavors such as Red Hat, Fedora, and Scientific Linux), some running Windows XP, Vista, and Windows 7, and a few are Solaris and Mac hosts. Memory in these nodes ranges from 512 MB to 2 GB, and most of them are either single-core or dual-core Intel processors with
Figure 5. Grid Query Optimizer (Liu and Karimi, 2008).
64
Dudas et al.
speeds ranging from 1.8 GHz to 3.00 GHz. There is also one central PittGrid server that has two dual-core processors running at 2.8 GHz, 8 gigabytes of available RAM, and a 4 × 300 gigabyte hard drive using RAID (Redundant Array of Independent Disks) 5/1. Even though PittGrid has only been active for a couple of years in production, it has recorded using over a million CPU hours for a number of different projects throughout the university. This includes projects in physics and astronomy, simulating neural networks, mathematics modeling, lung volume during infection, clinical pharmacology in the quantum genomics, entromics, and many other projects ranging from critical care medicine to statistics. PittGrid supports several programming languages, including C++, Java, C, and FORTRAN, as well as the graphing and numerical computing languages R and MATLAB. The PittGrid environment is structured on Condor and Globus, with a separate layer and API (Application Programming Interface) dedicated to the management of database systems based on OGSA-DAI. As shown in Figure 6, the user first submits SQL (Structured Query Language) queries to the portal, which looks for the server that supports GQO to optimize the queries, locate the resources, and send the list of resources to the OGSA-DAI to carry out the query execution. GQO can be installed on one or several servers on PittGrid. In its current version, the portal server is on the main server. GQO in the Grid Environment GQO addresses issues related to resource allocation and parallel processing. It also makes the process of querying distributed relational databases more user-friendly and scalable by two components: SQL parser and execution plan. The SQL parser is the first step in query processing. It converts the SQL string into an abstract query tree (AQT), which is basically a tree data structure that represents the relations and their operations. After the AQT is built, the resource locator will
first collect the relations involved in the query and look up the services for information on relations, databases, hosts, and related OGSA-DAI handlers and identifiers (IDs). The process, shown in Figure 7, has the following steps: Step 1—The resource locator sends a request to the DBIS for information on relations. Step 2—The DBIS returns not only individual relation information, such as the number of records and size, but also a set of databases that have the same relation. Another set of information DBIS returns is a set of hosts that own the databases obtained from this step with their OGSA-DAI handlers and IDs. Step 3—With the host name from step 2, the resource locator will send a request to the EIS for host information by the host name where the databases are stored. Step 4—The EIS returns information on the host such as MIPS, RAM, and CPU. This information is then used to evaluate the capability of a host to carry out join operations in the ranking system. Step 5—The resource locator will iterate over all the hosts in the “host set” to retrieve all the transmission latency from one host to the other hosts. Step 6—After receiving the request, the TPS calculates the mean transmission latency between two hosts from the historical statistical data and then sends the result back to the resource locator. The resource locator then sends all the data to the resource ranker to grade each host. After the resource ranker ranks all the hosts, the resource locator outputs the physical query tree (PQT), which is the AQT with the host information. An execution plan is then generated by the parallelism processing component. This plan maximizes parallelism by detecting the data dependency between operations. The leaf nodes have the highest chance of being selected by the parallelism processing component (Fig. 8). The critical point in parallel processing is the notification of completeness of data transferring between two parallelized processes. A stack of threads is created to solve this issue. The main thread starts from the root node and intermediate nodes, which create subthreads for their two children nodes until the leaf nodes are reached. The leaf nodes create the subthreads for the relations for the selection in the nodes. The completeness of two selection subthreads in a leaf node, say node 1, triggers the join operation, and the completeness of two join operations in node 1 and node 2 triggers the join operation on intermediate node 3. The host with a higher rank is chosen to perform the join operation rather than selecting a new host for the sake of reducing latency due to data transfer. For example, in node 1, if the rank of host 1 is higher than host 2, the data obtained from host 2 will be transferred to host 1 to perform the join operation. VALIDATION
Figure 6. Grid Query Optimizer (GQO) in PittGrid.
MySQL 5.1.41 was used to store the data sets. After all data were ported from flat files to relational tables, it was determined that the database was not substantial enough in size to warrant
Grid query optimization in the analysis of cone penetration testing data
65
Figure 7. Resource locator and resource ranker.
Figure 8. Query tree with parallel processing.
a valid test for GQO. The database was roughly 13 MB in size, and most queries, when natively run in MySQL, yielded only a few seconds (on average 2.3 out of 10 tries) to respond. The computation platform was an AMD Athlon 64 dual-core host with 1790 MB of RAM and 100 GB of hard drive.
Given its size, the original database was not used in the analysis of earthquake data as initially planned. To continue testing, we opted to replicate the data multiple times to create a database with two tables close to 2000 entries (CPT and Locations tables) and one table close to 750,000 entries (CPT Data table).
66
Dudas et al.
A query was designed that required all values to calculate cyclic resistance ratio for California. The query included water depth, soil depth, tip resistance, and sleeve friction for all positions in California as follows: SELECT CPT.Water_Depth, CPTData.Depth, CPTData.Tip_Resistance, CPTData.sleeve_ Friction FROM CPTData, CPT, Locations WHERE CPT.File_name=Locations.File_name AND CPTData.File_name=Locations.File_name AND Locations.State = ‘CA’
The execution of this query either never completed on the stand-alone platform (because system resources were over consumed) or required over an hour (out of 10 repeated runs, the average was 1 h 34 min and 34 s) to complete and close to an hour to show all subsequent data. Utilization of a similar database, along with other databases, showcased the way in which a grid cluster allows for multiple data sets to work together seamlessly. Tests were then run on a simulated grid environment, producing results that showed that not only did the queries complete,
but they completed in a faster time than other optimization algorithms. Table 1 lists the databases used in these tests. The following queries (Fig. 9) were also run for two other techniques, CIO (Complete Iteration Optimizer) and RO (Randomized Optimizer). Details about these techniques may be found in Liu and Karimi (2008). The final analysis for this tests showed that compared to CIO and RO, GQO produced better results in seven out of the eight queries (see Fig. 10). Our experiments took into account the mean query processing time, which is the time needed to generate query execution plans and the time to run those plans (Liu and Karimi, 2008).
TABLE 1. SIMULATED DATABASES AND RELATIONS Database Relation No. of records Size (MB) DB1 R1 550,000 100.2 R3 3,000,000 650 DB2 R2 100,000 150.5 DB3 R4 50,000 90.7 DB4 R5 300,000 120.8
Figure 9. Tested queries in abstract query trees.
Grid query optimization in the analysis of cone penetration testing data
Figure 10. Mean query processing time (QPT) for CIO (Complete Iteration Optimizer), GQO (Grid Query Optimizer), and RO (Randomized Optimizer).
CONCLUSION Our experiments show that when working with large data sets that are dispersed on a grid environment, GQO can produce results more efficiently than a stand-alone host and other query optimization techniques available on grid environments. Earthquake analysis using CPT could ultimately generate data in a more timely fashion, especially for large sets of CPT data, allowing researchers to predict soil liquefaction calculations in real time. If applied in an actual event, this could save a city, state, or country emergency costs, recovery costs, and, most importantly, human lives. REFERENCES CITED Advanced Computing for Science, 2008, Available from http://acs.lbl.gov/ (accessed December 2007). Amdahl, G.M., 1967, Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities: Proceedings of the April 18–20,
67
1967, Spring Joint Computer Conference: New York, Association for Computing Machinery, 810 p., doi:10.1145/1465482.1465560. Barney, B., Introduction to Parallel Computing, https://computing.llnl.gov/ tutorials/parallel_comp/ (accessed April 15, 2009). DataGrid, 2008, The DataGrid Project, http://eu-datagrid.web.cern.ch/eu -datagrid/ (accessed September 16, 2008). Data Intensive Distributed Computing Research Group, 2008, http://acs.lbl .gov/ (accessed July 29, 2008). Day, R.W., 2001, Geotechnical Earthquake Engineering Handbook: New York, McGraw-Hill Professional, 700 p. Enabling Grids for E-sciencE, 2008, Enabling Grids for E-sciencE, http://www .eu-egee.org/ (accessed March 20, 2008). Globus Alliance, 2008, Globus Toolkit, http://www.globus.org/ (accessed October 27, 2008). Globus Alliance, 2009a, OGSA-DAI, http://dev.globus.org/wiki/OGSA-DAI (accessed June 5, 2009). Globus Alliance, 2009b, Towards Open Grid Services Architecture, http:// www.globus.org/ogsa/ (accessed April 12, 2009). Havach, G.A., 2003, Subsurface Exploration with the Cone Penetration Testing Trucks (Fact Sheet 028-03): U.S. Geological Survey Fact Sheets, http:// pubs.usgs.gov/fs/2003/fs028-03/ (accessed October 16, 2009). Liu, S., and Karimi, H.A., 2008, Grid query optimizer to improve query processing in grids: Future Generation Computer Systems, v. 24, no. 5, p. 342–353, doi:10.1016/j.future.2007.06.003. Mitchell, J.K., 2001, Selected Geotechnical Papers of James K. Mitchell (Idriss, I.M., editor): Civil Engineering Classics: Reston, Virginia, American Society of Civil Engineers, p. 844–885. Natarajan, S., 2008, About PittGrid, http://www.cs.pitt.edu/pittgrid (accessed December 3, 2008). National Science Foundation TeraGrid, 2010, The TeraGrid: 2005–2011, https://www.teragrid.org/web/about/index (accessed October 19, 2010). Open Source Grid, 2008, The Globus Alliance, http://www.globus.org/ (accessed April 23, 2008). Pacific Northwest National Laboratory, Data Intensive Computing, 2008, Pacific Northwest National Laboratory, Data Intensive Computing, http:// dicomputing.pnl.gov/ (accessed April 23, 2008). Particle Physics Data Grid, 2008, Particle Physics Data Grid, http://www.ppdg .net/ (accessed May 1, 2008). Segal, B., Robertson, L., Gagliardi, F., and Carminati, F., 2000, Grid Computing: The European Data Grid Project: Nuclear Science Symposium Conference Record, v. 1, p. 2-1; doi:10.1109/NSSMIC.2000.948988. U.S. Geological Survey, 2009, United States Geological Survey–Mission Statement, http://www.usgs.gov/aboutusgs/who_we_are/ (accessed June 1, 2009). Virtual Organization, 2008, Towards Open Grid Services Architecture, http:// www.globus.org/ogsa/ (accessed April 12, 2008). Western Earthquakes Hazards Team, 2009a, Earthquake Hazards Program: Cone Penetration Testing (CPT) Data—Introduction, http://earthquake .usgs.gov/regional/nca/cpt/data/index.php (accessed December 12, 2009). Western Earthquakes Hazards Team, 2009b, Earthquake Hazards Program: About Data, http://earthquake.usgs.gov/ (accessed December 12, 2009).
MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
The role and development of a persistent interoperability test bed for geosciences research M.J. Jackson G. Hobona Centre for Geospatial Science, Triumph Road, University of Nottingham, Nottingham NG7 2TU, UK L. Bernard J. Brauner Technical University of Dresden, Mommenstrasse 9, Dresden, 01069, Germany C. Higgins EDINA National Data Centre, 160 Causewayside, University of Edinburgh, Edinburgh EH9 1PR, UK
ABSTRACT Over the past decade, desktop image analysis and geographic information system (GIS) software have matured into the dominant tool for geoprocessing. Desktop solutions usually incorporate a broad range of geospatial processing functionality together with locally maintained data in order to provide a tightly coupled and largely autonomous environment for carrying out operational geospatial activity. This approach has also been adopted by the research community. Researchers typically bring together at the local level both the necessary analytical tools and the data for the research activity at hand. There is now, however, a move away from such desktop technology. A loosely coupled service-oriented architecture, based on the deployment of Web services developed and maintained by a dispersed community, is now seen as a more powerful and flexible approach. Likewise, with this approach, the data may also be distributed and accessed directly from databases maintained by the collection agencies rather than being duplicated at the site where analysis is being undertaken. This new approach, based on established and emerging standards for geospatial interoperability, has many advantages, which are discussed herein. This chapter describes the challenges associated with the more dispersed and collaborative nature of the operational and research programs that are based on such an architectural approach. The need for and benefits of a persistent interoperability test bed for geosciences research and education are discussed, as is the question of ways in which to facilitate the move of such an architectural approach into routine operational use.
Jackson, M.J., Hobona, G., Bernard, L., Brauner, J., and Higgins, C., 2011, The role and development of a persistent interoperability test bed for geosciences research, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 69–75, doi:10.1130/2011.2482(07). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
69
70
Jackson et al.
EVOLUTION OF INTEROPERABILITY FOR GIS The growth of desktop computer mapping in the 1980s and the subsequent development of the ubiquitous personal computer for geographic information system (GIS) processing brought about a surge in the number of people who could directly manipulate and analyze geographic data. This started, or at least accelerated, the process of moving the use of GIS out from the elite confines of specialist mapping and GIS groups and departments into the domain of the general user. This democratization of GIS, however, was still limited. The user still typically had to acquire GIS software from a particular vendor. Mixing of software functionality between vendor solutions was fraught with problems of incompatibility between data formats and software design. While basic functionality might be cheap, or even available for free as an open-source product, many users and organizations found that specific functionality or required customization was absent in the standard product. These gaps could only be filled by buying vendor-based options or by commissioning consultants to develop add-ons and extensions. For the GIS companies, this was often a major revenue stream and therefore a situation that, if not deliberately manipulated, was certainly not seen in a negative light. Moreover, not only were data acquisition costs high, but the need to host the data locally incurred additional direct costs for storage and database software, as well as continuing costs for database administration expertise and data updating and maintenance. Additional and more significant developments occurred from the mid-1990s with the introduction of Web-based map search and display products such as MapQuest and MultiMap (both launched in 1996). At the time, the main GIS vendors were slow and cautious in responding to the public demand for such services. The belief was that such a capability was the application of fairly trivial but disruptive technology for map viewing, which would not significantly impact on the user requirement for professional geospatial analysis software. MapQuest was sold to America Online, Inc. (now TimesWarner), for $1.1 billion in 2000.1 At the time, this was a valuation that almost certainly exceeded that of all the mainstream specialist GIS vendors put together. Yet, even after this event, the GIS vendors were still slow to react to the changing market developments, constrained as they were by their legacy products, underpinning software architectures, and business models. In February 2005, Google, then only a 7 yr stripling, was able to enter a still immature Web-mapping market, and, with the launch of its Map API (application programming interface) (June 2005), both the commercial and academic use of mash-ups started to emerge. During this period, Google data and software releases and related developments by Yahoo, Microsoft, and others created a burst of energy and enthusiasm for the use of geospatial data. In a very short period of time, imagery and mapping, at a level of detail and with a comprehensiveness of cover previously only 1
MultiMap was purchased by Microsoft in December 2007.
available or affordable by government agencies or perhaps large commercial organizations such as utilities and oil companies, were available for all. Not only this, but it was largely free of charge, together with free or very low cost simple tools to allow user data sets to be merged and examined in a spatial context. There was a major gulf between these new “community” developments and the analytical precision and power of the professional desktop GIS software, but they met a need and attracted a mass-market user base while “GIS” remained a niche. The vision of GIS built from discrete but interoperable service components accessing multiple data sets from their source providers was one that emerged in a similar time scale, though initially with much less market visibility. The Open Geospatial Consortium (OGC, http://www.opengeospatial.org/), itself preceded by the Open Grass Foundation, was launched in 1994 with its declared vision of “diverse geoprocessing systems communicating directly over networks by means of a set of open interfaces based on the ‘Open Geodata Interoperability Specification (OpenGIS)’” (http://www.opengeospatial.org/ogc/history). Early efforts of the OGC were focused on specifications for access to tightly coupled geospatial data stores, e.g., the OGC Simple Features Standard published in 1997. During this initial period, OGC developed the OGC Abstract Specification, which set the stage for the development of the loosely coupled Web service standards. In 1999, the OGC Web Map Test Bed created the initial versions of the Web Map Service, Geography Markup Language, and the Catalog Service for the Web. This test bed was a pivotal point in the OGC in applying the abstract specifications to the Web. The Web provided tremendous energy and ideas, which in turn influenced the abstract specifications. The experience of the Web Map Test Bed as a cradle of research and development is a fundamental concept in developing geospatial standards. After this leap forward from the Web Map Test Bed, OGC’s Technical Committee and Planning Committee codified this new understanding by approving version 1 of the OGC Reference Model (ORM). The OGC ORM brings together, under the International Organisation for Standardisation Reference Model for Open and Distributed Processing (ISO RM-ODP), a comprehensive description of OGC’s technical baseline for interoperability and the framework for the adopted and emerging OpenGIS Specifications. The early 2000s also saw the approval of the highly significant OpenGIS Geography Markup Language version 3.0 (GML 3.0) and the OpenGIS Web Map Service (WMS). The OGC WMS was later to become accepted by ISO/TC (International Standards Organisation Technical Committee) 211, Geographic Information/Geomatics, for standardisation as ISO 19128. Figure 1 sketches the discussed time line of this GIS evolution. Building on the first OGC Web Map Test Bed, the OGC Interoperability Program was established. This program has now conducted multiple initiatives, ranging from initial development in test beds to near-operational deployments in pilots. OGC interoperability experiments have been recently added to
Role and development of a persistent interoperability test bed for geosciences research Monolithic Desktop GIS
Client-Server GIS
Application
Geoservices in an SDI
Application
Web Feature Service
Registry
Monolithic GIS
GeoMiddleware
Web Process Service
Web Map Service
File System and/or R-DBMS
...... ...
R-DBMS Client/ Geoportal
~1985
2011
~1995 GI-Interoperabilityvia-API era
Data-exchangeformat era 1994 OGC founded
1996 1st GSDI Conference
71
1999 1st OGC testbed
Spatial data infrastructure era 2005 Launch of Google Earth
respond to the need for more initiatives that tie implementation with specifications in support of standards development. Equally important as the developments of technical specifications for interoperable GIS, related organizational measurements foster the implementation of spatial data infrastructures to support national, regional, and international administrations. The European Union (EU) INSPIRE (Infrastructure for Spatial Information in the European Community) initiative (European Commission, 2007) and the Global Spatial Data Infrastructure organization (GSDI; http://www.gsdi.org/) are two examples of such organizational development. The late 1990s and early years of 2000 were also a time when the development of mobile computing and the ability to position these mobile devices either by network positioning or global positioning system (GPS) were becoming feasible propositions for the mass market. These developments were particularly driven by the increasing ubiquity of mobile phones and fast data communications using GPRS (General Packet Radio Service) or the newly emerging 3G networks. These developments were particularly pronounced in Japan and also in Europe, where telecommunications companies invested many billions of Euros in purchasing 3G licenses (over £22 billion in the UK alone). In 2000, M.J. Jackson led a group in such a company with the purpose of designing the software platform to supply location-based services (LBS) to a mass market. This activity demonstrated that just as Web mapping services required a different approach to software and systems design compared to traditional GIS, so location-based services also had their specific needs, which required fresh thinking and a different approach to the problems of data management and delivery, and particularly to product evolution (Jackson and Tucker, 2005). The past decade has seen a succession of technologies emanating from diverse areas that offer up a vision of the alwaysconnected citizen served by a high-bandwidth, location- and
Figure 1. The geographic information system (GIS) evolution time line. API— application programmer interface; GI— geospatial information; GSDI—Global Spatial Data Infrastructure organization; INSPIRE—Infrastructure for Spatial Information in the European Community; OGC—Open Geospatial Consortium; R-DBMS—relational data base management system; SDI—spatial data infrastructure.
Volunteered GI & GI-Service-Chain on-demand era
2007 EU INSPIRE Directive
context-aware communications system that can supplement the visual environment with a broad range of connected data, analysis, and visualization services. This personalized augmented reality is no longer just the realm of science fiction, but it is at the threshold of implementation on everyone’s mobile phone. In turn, this consumer-level technology and citizen empowerment have driven a movement whereby people frustrated by limited access to location-based and location-referenced data due to licensing restrictions, cost, or just ease of access have become personal surveyors of map and georeferenced data. This movement has been spurred by the formation of Web-based groups collaborating to combine the individual inputs into comprehensive collections of location data. “Volunteered geographic information,” now an established umbrella term to describe the various kinds of society-driven spatial data collection and publication initiatives (Craglia et al., 2008), has become a major source of publically available free data accessible via Web services over the Internet. The Open Street Map project (http:// www.openstreetmap.org/) represents one excellent example of this movement. RESEARCH AND COMMERCIAL PROGRESS CAN IMPROVE BY INCREASING TEST-BED FACILITIES While we have arrived at the threshold, however, we have also in the main paused from stepping over it. The vision and the way forward have been well, if not widely, understood since the early 2000s. It is recognized that the world of context-aware (including location-aware) ubiquitous and pervasive computing opens up major economic potential. Online search and social networking companies have grown from nothing to be multibillion-dollar companies in the matter of a few years based on their mass-market appeal. The same potential exists with LBS. However, just as it was not possible to predict the exact form
72
Jackson et al.
of the services that made social networking companies successful, so the precise nature or timing of killer LBS applications (or the application that becomes a killer application upon integration with location) cannot be predicted. For example, the use of satellite navigation applications, such as TomTom, implement route finding and network algorithms that have existed in professional GIS for many years. However, it is only in the last year or two, with convergence of advances in low-cost GPS chips and mobile phone technology, that “satnav” applications have become a mass-consumer product. The likely scenario, however, is that in a manner analogous to genetic evolution, those organizations and applications that can support the widest territorial growth and adapt most rapidly to market trends and fashions are likely to be amongst the market dominators. In geospatial systems terms, this may be equated to those system architectures that can access the broadest and best sources of data, that can be interoperable with the synergistic services from sister applications areas, and that can most rapidly reconfigure their set of accessible services into market-honed applications. So, we have a classic “crossing the chasm” dilemma. If one stays on the familiar ground with proven legacy products, one has short-term security but at the expense of increasing separation from potential capability and future market demand. If one leaves the threshold at this moment, uncertainty about the ways in which the multiple new strands of technology best complement each other and the best combinations and delivery mechanisms will lead many early entrepreneurs to fail, as we saw from the first rush over the threshold in the early 2000s. This analogy applies equally to the academic research community. The development of Web map services and especially the availability of high-resolution imagery and large-scale maps from companies such as Google and Microsoft, together with the tools for creating mash-ups and sharing data, have greatly broadened the research use of geographic data sets for both multidisciplinary analytical purposes and the visualization of results. However, it is still frequently the case that a significant percentage of the research funding available for any given award must be spent establishing a bespoke software environment, creating services that exist elsewhere (but in an incompatible form), and managing the data within a project-specific environment. While the development of national and international spatial data infrastructures (SDI) and international environmental programs such as Global Earth Observation System of Systems (GEOSS) (http://www .earthobservations.org/geoss.shtml, accessed 8 June 2011), with its heavy emphasis on interoperability between multiple partners, is stimulating the introduction of open standards, the uptake in the wider community will remain slow and cautious until the perceived risk of adopting such an approach on an end-to-end systems basis is significantly reduced (Jackson et al., 2009). OGC Web Services Initiative This situation was recognized early by the Open Geospatial Consortium (OGC), who initiated the OGC Interoperability
Program. After the initial two Web Map Test Beds in 1998 and 2000, the Interoperability Program has conducted a series of OGC Web Services (OWS) Initiatives, beginning with OWS-1 in October 2001. The mission of the OWS-1 initiative was the definition and specification of interfaces that support interoperability among geospatially enabled Web services, where a Web service was defined as an application that can be published, located, and dynamically invoked across the Web. The Web services might perform functions ranging from simple requests to complicated business processes, and, once deployed, other applications and other Web services can discover and invoke the deployed service. In the OGC Web Services test-bed process, sponsors2 establish the requirements, technical scope, and agenda, as well as the form and content of demonstrations. They also contribute resources to support the initiative. Then, together with other technology vendors and researchers, they work together to define, test, and document open-interface specifications that enable Web services in the geospatial community. From the beginning, OWS-1 was envisioned as a multiphase initiative, whereby each phase would focus within a defined and short time scale on specific technology and specification “thread sets” and, in each case, would end with a demonstration of the capabilities that had been developed. Subsequent OWS initiatives have been organized, the most recent, at the time of writing, being OWS-7 (http://www.opengeospatial.org/standards/ requests/50), which was completed in 2010, and OWS-8 will be carried out in 2011. These test beds have been a key element in the process of developing the interoperability specifications, but they are by definition short-lived. They are not designed to provide a shareable platform for research in the academic community. They are also not designed to provide a facility that can be used in the context of graduate or postgraduate education, although videos taken of the latter OWS test beds do provide a useful function in this context. Other Test-Bed Activity Other geospatial test-bed developments have been undertaken, and some leading examples are described in Hobona et al. (2009). Collectively, these initiatives are helping to build community confidence in the benefits of moving away from desktop geoscience information systems and toward the concept of discrete geospatial data and processing services linked together 2
The sponsors of OWS-1 were the Federal Geographic Data Committee, GeoConnection–Natural Resources Canada, Lockheed Martin, the National Aeronautics and Space Administration, the U.S. Army Corps of Engineers Engineer Research and Development Center, U.S. Environmental Protection Agency EMPACT (Environmental Monitoring for Public Access and Community Tracking) Program, the U.S. Geological Survey, and the U.S. National Imagery and Mapping Agency. Participants were BAE Systems, Compusult Ltd., CubeWerx, Dawn Corporation, DLR (German Aerospace Center), ESRI, Galdos Systems, George Mason University, Intergraph, Ionic Software, Laser-Scan, PCI Geomatics, Polexis, SAIC, Social Change Online, University of Alabama– Huntsville (UAH), and Syncline.
Role and development of a persistent interoperability test bed for geosciences research across multiple providers and sites to allow flexible applications development. However, with a few exceptions, progress is limited. The need still remains for a more persistent test bed to support collaborative research and development based on interoperability standards and to provide a platform for education and applications development. Within Europe, the recognition of this need has led to a specific collaboration between research centers with the goal of providing such a capability. THE AGILE-EUROSDR-OGC PERSISTENT INTEROPERABILITY TEST BED (PTB) INITIATIVE An initiative with this need in mind was agreed upon on 8 May 2007 at an AGILE-EuroSDR-OGC3 Workshop on Persistent Geospatial Test Beds held in Aalborg, Denmark, as part of the annual Association of Geographic Information Laboratories for Europe (AGILE) Conference. At this workshop, the consensus view was that a persistent geospatial test bed should be established within the European research community. The objectives proposed for this initiative were: (1) to provide a research platform for collaborative European research in geospatial interoperability; (2) to aid the assessment of the current standards for geospatial interoperability in terms of research compatibility, completeness, consistency, ease of use, and extensibility; (3) to provide an environment for teaching standards and techniques for geospatial interoperability; and (4) to provide a resource to the members of AGILE, EuroSDR, and OGC Europe for the coordination of research requirements, as well as definition, testing, validation, and development of open standards. The proposal indicated that the test bed should be: (1) distributed and consist of data, Web services, hardware, software, tools, and human resources, and (2) based upon both ISO and OGC geospatial standards and other appropriate information technology standards (e.g., from W3C and OASIS). A consultation program was also recommended to address issues such as: (1) What purposes should a test bed serve? (2) What does it need to consist of? (3) How should it be governed and administered? (4) How should it be constructed? (5) How might it be funded? (6) How may it be sustained so that it is available in the long term? (7) What would the relationship of the test bed be to other data and service infrastructures, such as Europe’s National Spatial Data Infrastructures (NSDIs), Europe’s
73
INSPIRE program (http://inspire.jrc.ec.europa.eu/), the United Nation’s SDI (UNSDI, http://www.ungiwg.org/ unsdi.htm), and the Group on Earth Observation’s Global Earth Observation System of Systems (GEOSS; http://www.earthobservations.org/)? Subsequent to the workshop, a steering committee was formed having one representative from each of the three sponsor organizations, and a call for contributions to a persistent test-bed program was sent out. This program subsequently became known as the “AGILE-EuroSDR-OGC PTB” or PTB (http://sdi-testbed .eu) for short (Gould et al., 2007). SHARED INFRASTRUCTURE The shared infrastructure of the PTB conceptually consists of user interface, discovery, geoprocessing, and data access tiers. An illustration of the stacking of the tiers is presented in Figure 2. A wiki-based portal is the gateway to all PTB resources, documentation, and other material. The discovery tier consists of catalogue services hosted at various participating PTB organizations. It is intended for the catalogue services to register all contributed services. Such a shared infrastructure is not unique to the PTB; a similar infrastructure has been implemented for GEOSS (http://www.earthobservations.org/geoss.shtml, accessed 8 June 2011). The key role played by the catalogue services is inherited from the publish-find-bind paradigm of the service-oriented architecture (SOA), on which OGC Web services are developed. In support of the PTB, the University of Nottingham commissioned development to modify an installation of the Geonetwork catalogue service to support the registration of resources without a geospatial location; examples of such resources include tutorials, specifications, and program source code. Cataloguing of
User Interface Tier Wiki-based portal Discovery Tier Catalogue services Geoprocessing Tier Processing services, computational grids Data Access Tier Feature services, coverage services, security services
3
Association of Geographic Information Laboratories of Europe (http://www .agile-online.org/)–European Spatial Data Research Organisation (http://www .eurosdr.net/start/)–Open Geospatial Consortium (http://opengeospatial.org/).
Figure 2. Overview of tiers in shared persistent test bed (PTB) infrastructure.
74
Jackson et al.
nonspatially referenced resources will help shape the PTB into not only a technology platform but also a knowledge-sharing platform. Responses to the initial call and the subsequent development of use cases have provided components for geoprocessing and data access tiers. With the discovery tier of the PTB close to maturity, the next phases of the PTB will see further organization of the geoprocessing (Brauner et al., 2009) and data access tiers of the platform. Current geospatial research undertaken by some of the respondents from the initial call suggests that grid computing may be included in geoprocessing tasks (Hobona et al., 2010), while security and sensor webs may enhance data access. It is therefore reasonable to suggest that the geoprocessing and data access tiers of the PTB will continue to evolve beyond the currently implemented use cases. PTB ACTIVITY AND PROGRESS The response to the 2007 call was strong. Over 30 organizations from across Europe indicated that they wished to participate. The responding organizations included national mapping and research agencies and local/regional government bodies as well as academic institutions. The positive response also included cross-European organizations such as the European Commission Joint Research Centre in Ispra, Italy. In order to ensure academic independence and objectivity, this first call was not open to commercial organizations. The magnitude and institutional breadth of response underlined the need for such a capability, but a pragmatic approach forward was needed in the absence of funding. In December 2007, it was agreed that the first phase of the test bed should be progressed through the development of a small number of discrete use case implementations. The services developed in these use cases would be published and made remotely accessible to participating members. The first phase was to successfully implement and demonstrate interoperability between selected member laboratories. The next phase would then start to link these with further use cases (1) to facilitate collaboration across organizations on SDI/INSPIRE-related research challenges and (2) to demonstrate increasingly rich application scenarios through the chaining of the use case services. These use cases (see http:// plone.itc.nl/gitestbed/phase-1-documents/PTB-background -and-future-directions-proposal.pdf) would also be developed to provide exemplars for teaching students. Use case results were published in Swan et al. (2008) and Hobona et al. (2009). Due to lack of resources, a comprehensive formal consultation exercise was not carried out, but feedback was received from several workshops and meetings across Europe where the project was discussed. This feedback indicated that the focus on a standards-based persistent interoperability test bed and software platform for research, teaching, and demonstration was relevant to the European SDI as defined by INSPIRE, as well as to other European and international programs. As a standards-based platform adopting a service-oriented and Web-services–based archi-
tecture, the PTB potentially offered a powerful basis for research group collaboration. By encouraging harmonization and sharing of data and services, it would also aid innovation and productivity in research and efficiency in the knowledge-transfer process. Importantly, it was felt that only through the availability of a persistent interoperability test bed could the risk of proceeding with the adoption of a multipartner loosely coupled data and service geospatial architecture be justified and the necessary skills base be developed. From 2008, a key objective was the goal of integrating the PTB activity more closely with European national and EU programs and especially INSPIRE. Linked to this objective, there is the need to secure adequate financial underpinning for the PTB initiative so as to support project management and coordination of the current use cases and to allow future sustainable development. A significant development in the context of these objectives is the establishment of cross-linkages with two major EU-funded projects: (1) In July 2008, an EU Framework 7 Programme Coordination and Support Action project called “GEOSS, INSPIRE, and GMES [Global Monitoring for Environment and Security], an Action in Support” (GIGAS; http://www.thegigasforum.eu/gigasweb.html) commenced. GIGAS promotes the coherent and interoperable development of the GMES (http://www.gmes .info/), INSPIRE, and GEOSS initiatives through their concerted adoption of standards, protocols, and open architectures. (2) 2008 also saw the commencement of the EuroGeographics-led European Spatial Data Infrastructure Network (ESDIN) EU eContentplus project (http://www .esdin.eu/). ESDIN aims to provide a bridge between the theory and the practice of the INSPIRE directive toward implementation and usage of interoperable geographical data by “spatially enabled societies.” As the importance of the European INSPIRE (http://www .inspire-geoportal.eu/) activities grows, the relevance of the PTB also increases to the European SDI programs. Whereas support from ESDIN and GIGAS provides the PTB with much-needed human resources, the PTB provides GIGAS with a template (Brauner et al., 2010) for an interoperability testing platform that can support multiple initiatives (such as GEOSS, INSPIRE, and GMES). Further, the PTB also potentially provides ESDIN with a persistent interoperability platform on which technical components of European SDI can be tested. Hence, the relationship among the PTB, GIGAS, and ESDIN is symbiotic and is expected to lead to further collaborations with other INSPIRErelated projects. The current implementations of the four use cases have been successful as examples of interoperability research undertaken between collaborating academic institutions. They do not, however, yet represent the stable research platform envisaged that would allow other researchers to discover and access a broad spectrum of test-bed services in order to use them in client
Role and development of a persistent interoperability test bed for geosciences research applications linked to their own and/or other third-party services for collaborative research purposes. This service-chaining ability requires the use case developers to publish the services, e.g., by providing a suitable catalogue and a PTB Web portal. The number and range of services also need to be enriched before meaningful applications may be built from the test-bed services or before they provide the community with the opportunity to test client functionality, test work-flow chains, assess standards/applications performance, or do detailed evaluations between alternative algorithms and service components. The goal is for the PTB to offer reliable and/or mutually interchangeable services that are stable enough to function as the basis for a wide range of European and global research projects. The gap between these objectives and current PTB status requires a significant commitment from participating PTB partners and a high level of coordination. This gap is most likely to be filled by taking modest incremental steps. To ensure focus, it is also best carried out within the framework of collaboration with major initiatives and projects such as the EU-funded GIGAS and ESDIN projects and the EuroSDR research commission structure. The proposed design for the next stage of platform development includes the creation of a use case that is largely achievable through the chaining of existing use case implementations. This will look to incorporate outputs from the phase 1 use cases, especially relating to data and services security. This exemplar will demonstrate access control around OGC Web Services on top of pan-European data. The widely used “Single Sign On” standards-based Shibboleth approach to federating access control has been integrated with OGC GeoXACML (www.geoxacml .org/) to demonstrate access to OGC WMS without changing the OGC interface. Because uptake of Shibboleth is widespread within the European academic sector, the plans are to build on this and work toward the objective of major content providers making their data available into the PTB on the basis of production strength security being available. Inclusion of authentication and authorization components in service discovery and chaining scenarios could drive development of use cases showcasing advances in geospatial Web service security. CONCLUSION This study concludes that a persistent test bed is a key tool for enhancing research into geoscience interoperability. However, we acknowledge that the lack of initial funding has been a significant obstacle to the speedy development of the PTB and that support from beneficiary projects (such as GIGAS and ESDIN) is critical to implementation. We further conclude that the international nature of the collaborating organizations in the
75
PTB provides a community representative of the users and stakeholders of INSPIRE. Therefore, the PTB has an important role to play in the adoption and future development of the European SDI (INSPIRE) and SDI on a global basis. REFERENCES CITED Brauner, J., Foerster, T., Schaeffer, B., and Baranski, B., 2009, Towards a research agenda for geoprocessing services, in Haunert, J., Kieler, B., and Milde, J., eds., Proceedings of 12th AGILE (Association of Geographic Information Laboratories of Europe) International Conference on Geographic Information Science, Hanover, Germany: AGILE: http:// agile2009.de (accessed 9 June 2011). Brauner, J., Bernard, L., and Müller, M., 2010, An organisational frame for interoperability testing in spatial data infrastructures, in INSPIRE (Infrastructure for Spatial Information in Europe) Conference 2010: Krakow, Poland, INSPIRE: http://inspire.jrc.ec.europa.eu/events/conferences/ inspire_2010/abstracts/69.doc (accessed 9 June 2011). Craglia, M., Goodchild, M.F., Annoni, A., Camara, G., Gould, M., Kuhn, W., Mark, D., Masser, I., Maguire, D., Liang, S., and Parsons, E., 2008, Nextgeneration Digital Earth: A position paper from the Vespucci Initiative for the Advancement of Geographic Information Science: International Journal of Spatial Data Infrastructures Research, v. 3, p. 146–167. European Commission, 2007, Directive 2007/2/EC of the European Parliament and of the Council of 14 March 2007 Establishing an Infrastructure for Spatial Information in the European Community (INSPIRE): Council, European Parliament and European Official Journal of the European Union, L 108/1–L 108/14. Gould, M., Jackson, M., Klien, E., and Lemmens, R., 2007, Design and implementation of persistent testbed for geo-web services, in Proceedings of the 13th EC-GI and GIS (European Commission Geospatial Information and Geographic Information Systems) Workshop 2007: Porto, Portugal; http:// www.ec-gis.org/Workshops/13ec-gis/presentations/5_sdi_technology _III_Klien_2.pdf (accessed 9 June 2011.) Hobona, G., Jackson, M.J., Gould, M., Higgins, C., Brauner, J., Matheus, A., Foerster, T., Nash, E., Lemmens, R., Abele, S., Swan, J., Anand, S., Strobel, S., Bishr, M., Korduan, P., and James, P., 2009, Establishing a persistent interoperability test-bed for European Geospatial Research, in Haunert, J., Kieler, B., and Milde, J., eds., Proceedings of the 12th AGILE (Association of Geographic Information Laboratories of Europe) International Conference on Geographic Information Science 2009: Hanover, Germany, Leibniz University of Hanover, 10 p., http://www.ikg.uni-hannover.de/ agile/fileadmin/agile/paper/31.pdf (accessed 9 June 2011). Hobona, G., Fairbairn, D., Hiden, H., and James, P., 2010, Orchestration of grid-enabled geospatial web services in geoscientific workflows: IEEE Transactions on Automation Science and Engineering, v. 7, no. 2, p. 407– 411. Jackson, M.J., and Tucker, T., 2005, Architecture and design considerations for mobile location based services, in Location Intelligence Conference 2005: http://www.powershow.com/view/39c21-YTEyN/Architecture_and _Design_Considerations_for_Mobile_Location_Based_Services_flash _ppt_presentation (accessed 9 June 2011). Jackson, M.J., Schell, D., and Taylor, D.R.F., 2009, The evolution of geospatial technology calls for changes in geospatial research education and government management: Directions Magazine, 6 April 2009, 10 p.; http://www .directionsmag.com/articles/the-evolution-of-geospatial-technology -calls-for-changes-in-geospatial-rese/122571 (accessed 9 June 2011). Swan, J., Foerster, T., Lemmens, R., Hobona, G., Anand, S., and Jackson, M., 2008, Discovery and invocation of schematisation services: A use case for the OGC-EuroSDR-AGILE persistent test-bed for Europe: GEOconnexion International Magazine, November 2008, p. 24–27. MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
GEONETCast: Global satellite data dissemination and the technical and social challenges George Jungbluth Richard Fulton Linda Moodie NOAA/NESDIS (National Oceanic and Atmospheric Administration/National Environmental Satellite, Data, and Information Service), 1335 East West Highway, SSMC 1, Silver Spring, Maryland 20910, USA Paul Seymour NOAA/NESDIS (National Oceanic and Atmospheric Administration/National Environmental Satellite, Data, and Information Service), Suitland Federal Center, 4231 Suitland Road, Room 1654, Suitland, Maryland 20746, USA Mike Williams Lothar Wolf European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT), Eumetsat Allee 1, D-64295 Darmstadt, Germany Jiashen Zhang National Satellite Meteorological Center, China Meterological Administration, No. 46, South Street of Zhong-Guan-Cun, District Haidian, Beijing 100081, China
ABSTRACT GEONETCast is a global, near–real-time, environmental data dissemination system in support of the Global Earth Observation System of Systems (GEOSS). The goal of the system is to enable enhanced dissemination, application, and exploitation of environmental data and products for the diverse societal benefits defined by the Group on Earth Observations (GEO), including agriculture, energy, health, climate, weather, disaster mitigation, biodiversity, water resources, and ecosystems. The system consists of three regional broadcasts: EUMETCast (operated by the European Organisation for the Exploitation of Meteorological Satellites [EUMETSAT], covering Europe, Africa, and parts of Asia and the Americas), CMACast (operated by the China Meteorological Administration [CMA], covering Asia and parts of the Pacific), and GEONETCast Americas (operated by the U.S. National Oceanic and Atmospheric Administration [NOAA], covering North, Central, and South America and the Caribbean). The GEONETCast system uses the Digital Video Broadcast-Satellite
Jungbluth, G., Fulton, R., Moodie, L., Seymour, P., Williams, M., Wolf, L., and Zhang, J., 2011, GEONETCast: Global satellite data dissemination and the technical and social challenges, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 77–85, doi:10.1130/2011.2482(08). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
77
78
Jungbluth et al. (DVB-S) or Digital Video Broadcasting–Satellite–Second Generation (DVB-S2) standard over commercial communications satellites and low-cost, off-the-shelf technology to widen the access of new user groups to Earth observation information.
INTRODUCTION Government ministers from 58 countries and the European Commission agreed at the third Earth Observation Summit in February 2005 to put in place a Global Earth Observation System of Systems (GEOSS) to meet the need for timely, quality, longterm global provision of environmental information as a basis for sound decision making and to enhance delivery of benefits to society. The ministers also established the intergovernmental Group on Earth Observations (GEO) to take the steps necessary to implement GEOSS. The United States, European Commission, China, and South Africa serve as co-chairs of GEO, which now includes 86 member countries, the European Commission, and 61 participating organizations (GEO Secretariat, 2011). GEO’s vision for GEOSS is the global leveraging of existing and future regional, national, and global environmental observation and data management systems for the benefit of all society. In many cases, organizations or governments develop and implement environmental observation and data management systems for their national and regional users’ needs without linking them with other similar systems in other regions, often resulting in a comprehensive yet disconnected patchwork of valuable environmental resources that cannot be exploited by all of society. Often, these systems are built originally for specific, limited purposes without realization of the potential value to other scientific or other disciplines for little or no added cost (e.g., the value of meteorological satellite observations for critical decision making in the health or energy communities). In the current resourceconstrained age of trying to do more with less, it is imperative for world leaders to reduce the constantly increasing impacts of environmental challenges by working together to share their individual resources across political borders and across scientific disciplines, since many environmental problems are fundamentally global in scope (CENR/IWGEO, 2005). It is the objective of GEOSS to engage organizations and governments to take their existing environmental observing, value-added data processing, and distribution systems and integrate them together into a globally linked “system of systems” that can provide societal benefits for a global audience (CENR/ IWGEO, 2005; GEO Secretariat, 2007). The participation of no less than 86 member countries in GEOSS is a testament to the common understanding of the need for global cooperation to address global environmental challenges that cross political boundaries. GEONETCast is envisioned as one piece of this broader initiative and a step forward in the global exchange of environmental information using a common and inexpensive receive station platform based on the latest satellite communication technology that can be implemented regardless of the lack
of terrestrially based communication infrastructure. For more information on GEOSS, please see the GEO Web page at http:// earthobservations.org. GEONETCAST CONCEPT GEONETCast is an important near–real-time data distribution system within GEOSS by which environmental data and products from participating data providers are transmitted to users through a global network of communications satellites using a multicast, broadband capability. This general dissemination capability, manifested through a small number of regional but interconnected GEONETCast systems, can be especially useful in parts of the world where high-speed terrestrial communication lines and/or Internet are not available or in regions where these lines have been disrupted by natural disasters. It is intended to complement other existing data dissemination systems using other data delivery methods (Fulton et al., 2009). A motivating factor to increase the use of environmental data across the world is to make it accessible to all nations in a cost-effective and efficient manner. GEONETCast promises to facilitate and enhance access to environmental data in the nine defined societal benefit areas of GEO (agriculture, weather, water, energy, health, climate, biodiversity, disasters, and ecosystems) (GEO Secretariat, 2007). Potential societal benefits from GEONETCast exist in all nine of GEO’s defined societal benefit areas. GEONETCast is a pipe (not unlike the Internet) through which environmental data are transmitted from the originating data providers to the data end users, so its benefits encompass all of the benefits derived from using the diverse environmental data that it carries. It is therefore an enabler of benefits through enhanced communications so that users that once may not have had access to data can now be a part of the network, at limited cost, to derive their associated benefits (Fulton et al., 2009; Williams and Wolf, 2007). NOAA and EUMETSAT (European Organisation for the Exploitation of Meteorological Satellites) presented their vision for GEONETCast at the second GEO Plenary meeting in December 2005. The concept was endorsed and became a GEO Workplan Task in Architecture and Data Management. In November 2006, early success in the development of GEONETCast was showcased at the third GEO Plenary meeting in Bonn, Germany, through an international press conference (Williams and Wolf, 2007). In 2007, the China Meteorological Administration volunteered their FENGYUNCast system, the precursor to CMACast, as a key component of the GEONETCast system, and in April 2008, NOAA began operational broadcasts of its GEONETCast Americas service. These services joined EUMETSAT’s
GEONETCast contribution to the GEONETCast system, EUMETCast, which went operational in November 2002. In 2010, the MITRA satellite dissemination service, maintained by the Russian Federal Service for Hydrometeorology and Environmental Monitoring (Roshydromet), began work on integrating its operations into the global GEONETCast framework. GEONETCAST SYSTEM CONCEPT AND ARCHITECTURE The GEONETCast system follows the GEOSS concept in being a network of regional dissemination services working together to form a global system. GEONETCast is a user-driven (from a requirements and products standpoint) interconnected global network of near-real-time regional dissemination systems that link GEOSS environmental data, products, service providers, and users across the globe. Each regional system is focused on a specific sector of the globe, primarily supporting the specific needs of users in that sector. However, these regional systems are interoperable with each other to allow data files to flow across the regional boundaries in both directions as needed by users in other regions. There are two main system components of GEONETCast: (1) a regional data collection, management, and dissemination system, and (2) distributed user receiver stations. The regional data collection, management, and dissemination systems within GEONETCast are referred to as the GEONETCast Regional Network Centers (GNCs). These regional hubs, EUMETCast, CMA-
79
Cast (the follow-on system to CMA’s FENGYUNCast), and GEONETCast Americas, perform data collection and receipt of GEOSS files from regional data providers and other GNCs. They then schedule, prioritize, and manage the broadcast of these files over their regional systems to users in near–real time (Fig. 1). The GEONETCast broadcasts are disseminated through the use of commercial communications satellites, which rebroadcast the uplink from the GNCs to the users. Currently, all regions are broadcasting to users by means of the Digital Video BroadcastSatellite (DVB-S) or Digital Video Broadcasting–Satellite– Second Generation (DVB-S2) standard using a Ku and C-band radio transmission. Files are sent over the system, which is organized into a series of subchannels, similar to the way in which a satellite television broadcast is organized into selectable subchannels by broadcaster. These subchannels can be selected or deselected, allowing users to download all or any relevant portion of the GEONETCast broadcast that they need, using only the receive system and antenna. Each GNC sets its own broadcast standards, encoding and access rights, and software parameters, allowing regional services to maintain the flexibility necessary to utilize service models best suited to the regional client base while maintaining interoperability and data transfer abilities with the other regional hubs. Currently, thousands of files per day are exchanged between network centers. Although each system may have unique system architecture characteristics, they are all able to exchange data files in both directions in a manner that is transparent to the user. For example,
Figure 1. Global GEONETCast coverage by regional provider (Fulton et al., 2009). CMA—China Meteorological Administration; NOAA— National Oceanic and Atmospheric Administration; EUMETSAT—European Organisation for the Exploitation of Meteorological Satellites.
80
Jungbluth et al.
data files originating in China, Africa, or Europe can be received by GEONETCast Americas for rebroadcast as needed by users in the Americas, and, similarly, data files originating in the Americas can be sent to other regional systems for broadcast in their regions (Fig. 2).
on region) is that there is a large supply of discount and surplus, previously used hardware that can be repurposed and incorporated in a GEONETCast receiving station for very low cost (Fig. 3). Regional Dissemination Systems within GEONETCast
Receive Stations The GEONETCast satellite broadcast is received on the ground by relatively low-cost user receive stations utilizing common off-the-shelf components to the maximum extent possible to minimize user costs. These stations include a dish antenna (for C-band, typically 2.4 m; for Ku-band, 1 m in diameter), cabling, and a standard personal computer and hardware and software components necessary to decode the incoming satellite signal and create data files on the station’s hard drive. These components include a standard commercial DVB-S or DVB-S2 receiver box or card and client datacasting software. GEONETCast receive stations are intended to be relatively affordable, with project materials costs amounting to under $3000, where the C-band antenna, at $1500, is the largest expense (Ku antennas are cheaper). Commercial DVB receiver boxes cost approximately $80–200, and the client software used by the GNCs is also not expensive (Fulton et al., 2009; Williams and Wolf, 2007). An advantage of using the widespread DVB-S standard and C- or Ku-band transmissions (depending
EUMETCast EUMETSAT’s EUMETCast data dissemination system, which is their regional contribution to the GEONETCast system, actually predated the development of GEONETCast, and many of the design concepts ultimately utilized by the new global GEONETCast system originated in the early 2000s from the EUMETCast model, including the use of DVB-S technology on commercial satellites as a means to distribute real-time environmental data. The interoperability concepts behind EUMETCast with respect to data acquisition and redistribution allow it to be a substantial contribution to the World Meteorological Organization’s (WMO) Integrated Global Data Dissemination System (IGDDS), a concept for a global standard for the broadcast of meteorological satellite data (Withee, 2006). The GEONETCast concept for distribution of a much more diverse selection of environmental data (beyond just meteorological satellite data) naturally followed and was developed and implemented by the international GEONETCast Implementation Group in direct support of the GEOSS concept for globally linking environmental
Figure 2. GEONETCast data dissemination and exchange (Fulton et al., 2009).
GEONETCast data management systems to maximize international societal benefits (Moura, 2006). The EUMETCast system uses several commercial satellites and transponders to cover the region of Europe, Africa, and the Middle East. Additional details can be found at http://www .eumetsat.int/Home/Main/What_We_Do/EUMETCast/System _Description/index.htm. EUMETCast uses Ku-band and C-band frequencies with several satellite footprints. Unique sets of data are broadcast over each footprint based on regional user needs. EUMETCast currently serves ~3200 users in Europe (14.5 Mb/s), 320 in Africa (3.5 Mb/s), and 50 through its Americas service (2.5 Mb/s). More information on EUMETCast can be found at the following web link: http://www.eumetsat.int/Home/Main/ DataAccess/EUMETCast/index.htm. CMACast CMA’s contribution to GEONETCast is called CMACast, a follow-on to the FENGYUNCast service originally deployed in 2004. Similar to EUMETCast, FENGYUNCast started out with an IGDDS focus on distribution of their Chinese meteorological satellite data to users, but it has since expanded to include a broader set of environmental data from other environmental disciplines as well. As a result, FENGYUNCast also became CMA’s contribution to GEONETCast for the Asia-Pacific region (Fulton et al., 2009). One of the requirements of deploying a GEONETCast system that goes beyond an IGDDS is that the system must accommodate all types of environmental data from as many of the nine societal benefit areas as possible. FENGYUNCast started out initially using Ku-band, but then switched to C-band
81
frequencies, which have the advantage of a larger footprint coverage area than Ku-band. CMACast, initiated in 2010, will broadcast to ~2600 users in 18 countries throughout Asia and parts of the Pacific with a bandwidth of 70 Mb/s. CMACast uses a C-band transmission on the DVB-S2 standard. The broadcast contains meteorological and satellite data, products, and forecasts, as well as training and capacity building. More information on CMA and CMACast can be found by visiting the CMA Web site at http://www.cma.gov .cn/en/. GEONETCast Americas GEONETCast Americas is the Western Hemisphere component of the global GEONETCast system. Maintained by NOAA, GEONETCast Americas broadcasts on the C-band from the Intelsat IS-9 communications satellite in geostationary orbit and has a broadcast footprint reaching from southern Canada through southern Chile (Fulton, 2007). GEONETCast Americas went operational in April 2008 and currently serves users in seven countries throughout the Western Hemisphere. GEONETCast Americas maintains a 2 Mb/s bandwidth. The current product suite contains a number of NOAA/NESDIS (National Environmental Satellite, Data, and Information Service) products, including fire and ash products, precipitation, and ice cover, as well as satellite-derived products from the SERVIR Regional Visualization and Monitoring System hosted by CATHALAC (Centro del Agua del Trópico Húmedo para América Latina y Caribe or Water Center for the Humid Tropics for Latin America and the Caribbean) in Panama, Brazilian INPE (Grupo de Organização da Comissão Nacional de
Figure 3. GEONETCast system hardware (Fulton, 2007). DVB—Digital Video Broadcast.
82
Jungbluth et al.
Atividades Espaciais or National Institute for Space Research), Argentine CONAE (Comisión Nacional de Actividades Espaciales or National Space Activities Commission), U.S. Environmental Protection Agency (EPA), and the National Aeronautics and Space Administration (NASA). GEONETCast Americas differs in some respects from the other regional broadcasts in that it was commissioned expressly to serve the GEOSS effort through the broadcast of Earth observation information products and data in the nine societal benefit areas. NOAA currently operates and maintains other primary systems and infrastructure to disseminate data streams from its polarorbiting and geosynchronous meteorological satellites. More information on GEONETCast Americas can be found by visiting the Web site at http://www.geonetcastamericas.noaa.gov/.
• Further develop the GEONETCast Alert Channel and
GEONETCAST CURRENT OPERATIONAL STATUS AND PLANS
•
GEONETCast has successfully demonstrated a full initial operating capability, with near-global reach and broadcasts available to users in all regions 24 h per day, 7 d per week, and 365 d per year. The regional network centers coordinate their efforts through the GEONETCast Implementation Group, which communicates status and ongoing operational plans through twice-monthly teleconferences. In addition to resolving operational and technical problems in coordination and data transfer, the Implementation Group enables regional centers to learn about new users and data providers in each region, bringing new potential synergies and allowing innovations to be more quickly spread across regions. Since initiation, GEONETCast has also become quickly integrated into the overall GEO effort. Several GEO-backed projects supported by the European Union, ICT (Information and Communication Technology), USAID (U.S. Agency for International Development), and others take advantage of the system, and GEONETCast is positioned as a necessary infrastructure link in several GEO work plan tasks on capacity building and development. Capacity-building and EO (Earth Observation) projects taking advantage of GEONETCast for implementation include DevCoCast (GEONETCast applications for and by developing countries), AIDA (Advancing ICT for DRM [Disaster Risk Management] in Africa), MIA-VITA (European Union Volcano and Disaster Mitigation Initiative), AEGOS (African-European Georesources Observation System), SERVIR (NASA-USAID Environmental Information Dissemination Partnership), and RANET (Radio and Internet for the Communication of Hydro-Meteorological and Climate Related Information), among others. GEONETCast’s activities through 2011 include the following operational and thematic milestones:
•
• Further develop the GEONETCast Training Channel,
including securing and/or developing content, negotiating broadcast and usage plans, and utilizing user feedback in improving service.
•
•
•
•
increase the product offering, including guidance on information and alert characteristics, routing and priority, as well as interoperability. Work with the International Charter for Space and Major Disasters to carry out trial transmissions of Charter information over regional broadcasts. Forge new contacts in academic, policy planning and other communities for the use of the GEONETCast system. Explore multinational academic and policy partnerships taking advantage of the GEONETCast system as an international information dissemination tool. Operationalize data sharing and interoperability among the GNCs. The GEONETCast Network Centre Service Specification document has been agreed by the three original GNCs. Explore options for expanding GNC coverage into the Pacific, gauge potential user interest and potential avenues for operations of an expansion. Develop the Product Navigator to allow GEONETCast users to find metadata information related to the products being transmitted, including reference to the data sent through GEONETCast into the various EO portals developed within (GEO). Organize training course for the setting up and utilization of GEONETCast receiving stations, with a particular focus on developing countries (in coordination with CB09-02g). Continue to cooperate with regional partners to address user data needs, including participation and/or leadership in requirements generation activities. (2011 GEONETCast progress report to GEO)
1. Data Exchange In 2010, GEONETCast focused on enhancing dataexchange activities between the regional network centers. All three regional broadcasts have established initial data-exchange capabilities and are providing operational data exchange, ensuring products from one regional broadcast are available to other regions as requested by users. The MITRA system has established initial data-exchange arrangements with GEONETCast Network Centers and has set milestones for integrating into the global GEONETCast system. 2. Product Navigator In 2008, EUMETSAT released the GEONETCast Product Navigator, a fully searchable file-based catalog of GEONETCast product offerings across all regional broadcasts that is viewable using standard Web browsers. The Product Navigator can organize products by a number of user-selectable variables, including provider, broadcast, file type, and societal benefit area. The Product Navigator is fully compliant with GEO and World Meteorological Organization metadata standards, and it will be included
GEONETCast for distribution in all GEONETCast global broadcasts, as well as through the GEO and regional GEONETCast Web portals. It will be updated and distributed to GEONETCast users routinely via the satellite broadcast as well as the Internet. 3. GEONETCast Training Channel In 2008, GEONETCast demonstrated proofs of concept for the GEONETCast Training Channel, which will be dedicated to products and programming designed to train new users in the application and production of Earth observation products, and in the use of GEONETCast and other GEOSS tools for accessing Earth observation information. Initial proofs of concept included transmitting distance learning modules over the GEONETCast Americas and EUMETCAST Africa broadcasts, as well as data-exchange activities to send training materials between the GEONETCast Americas and EUMETCast regional network centers. An ongoing goal of GEONETCast is to further develop the service strategy for the Training Channel, including determining user requirements and potential training materials, as well as determining the ways in which file selection and broadcasts will be optimized to best serve user needs. 4. Hazard Alert Channel In 2008, GEONETCast also demonstrated initial proofs of concept for a Hazard Alert Channel, intended to relay regional environmental warnings and alerts in priority broadcasts to users. This capability is inherent in the GEONETCast architecture and could bring real benefit to users in remote locales without other reliable means of information dissemination. A service strategy and user requirements are being determined, and initial product streams are coming online in the Americas and in Europe and Africa. Initial pilot concepts include bringing flash-flood warnings and guidance to rural communities in Central America. The GEONETCast Hazard Alert Channel is fully compatible with the WMO Common Alert Protocol (CAP) standard. 5. DevCoCast A key goal for 2009 was to widen the reach of the GEONETCast system by bringing it to new users and user groups. The DevCoCast project, funded by the European Commission, has been instrumental in supporting the activation of new receiver stations in Africa and in Latin America. The International Institute for Geo-Information Science and Earth Observation (ITC) of the Netherlands has partnered with DevCoCast to provide training in the use of GEONETCast for environmental stewardship in Africa and South America. DevCoCast, in partnership with local data providers, has held multiple training events for users in Africa and South America from 2009 and to September 2011, reaching large numbers of users through in-person education sessions and online seminars.
83
6. New User Groups and Data Providers GEONETCast has established user groups in the meteorological community and is expanding to include environmental science organizations dedicated to regional stewardship, such as the CATHALAC center in Panama and the RCMRD (Regional Centre for Mapping of Resources for Development) facility in Nairobi, Kenya. GEONETCast is committed to growing its user base to serve a wider section of decision makers, scientists, and the public. In 2009, GEONETCast Americas began a collaboration with the NOAA-funded CREST (Cooperative Remote Sensing Science and Technology Center) alliance of universities in the United States to link the remote-sensing science graduate teams in separate locations with satellite-derived and other Earth observation data of use to their work. The CREST partnership also has plans to develop new products from their own research and an X-band receiver station at the City College of New York that can eventually be disseminated back over the GEONETCast system to affiliated academic partners and the greater user base. As of 2010 a GNC-A receive station has been in place at Bowie State University Campus in Maryland and another is being installed at the University of Puerto Rico at Mayagüez (UPRM). GEONETCast Americas in 2008 and 2009 released Announcements of Opportunity calling for data providers to propose products for dissemination over the GEONETCast system. The announcements were distributed at the 2008 GEOSS in the Americas symposium, as well as internally through U.S. government agencies. SOCIAL AND TECHNOLOGICAL CHALLENGES TO GEONETCAST Technical Challenges System Integration The technical challenges to the successful implementation of the GEONETCast system have been in large part identified over the last several years during the development of the system and are on the way to being resolved. These challenges are mostly related to integrating three, not wholly similar, regional dissemination systems to ensure necessary levels of interoperability and data exchange. While all three GEONETCast regional broadcasts use either the DVB-S or DVB-S2 standard and maintain an almost identical hardware requirement for receive stations, each network maintains its own network software solution, each of which offers varying levels of differentiated access rights and user registration requirements. For example, EUMETCast broadcasts an encrypted signal that users decode with the use of a network key, CMACast uses encoding with a main station user authorization system, and GEONETCast Americas maintains no data encryption or user registration necessity. The varying levels of encryption and user access policies across systems mean that additional coordination is sometimes necessary to allow users access to encrypted data from non-native regional broadcasts.
84
Jungbluth et al.
Data Exchange The final vision for GEONETCast is to have all regional broadcasts joined to one another with robust data links to ensure that users in any region can have access to any and all Earth observation products regardless of product origin or region. Creating dedicated data links compatible with the data dissemination and security standards of each regional organization has been a technical challenge. All three systems maintain common service specifications and similar standards, which help to overcome these challenges. For example, all three systems use the DVB-S protocol and TCP/IP FTP (Transmission Control Protocol/Internet Protocol File Transfer Protocol) for data acquisition and ingestion, as well as exchange. Currently, initial links between all regions are in place, and working links with the necessary security and automation for sustainable operation are in advanced implementation stages. It is important to note that while the technical capacity for data exchange is designed into the system, the transfer of products from one region to another is driven by user demand and is implemented on a per-case basis. Service Cost The GEONETCast direct broadcast model utilizes commercial communications satellite technology to broadcast environmental information and products to receive stations consisting of easy-to-buy, off-the-shelf components. This model shifts much of the cost of the provision of data broadcast services onto the disseminating organization through the establishment and maintenance of a satellite broadcast communication infrastructure and the recurring purchase of commercial satellite bandwidth sufficient to carry the required data volume. This, however, does provide an advantage to users, who are relieved of the need to acquire multiple satellite data and products from disparate providers using proprietary receiver infrastructure, or through costly high-speed Internet connections. User-supplied components for a working receiver station cost less than $3000 (Fulton et al., 2009; Williams and Wolf, 2007) and are one-time costs, since there are generally no (or minimal) recurring subscription costs to obtain the broadcast; the cost of labor for installing a satellite dish and receiver hardware can be a little more. While this is a small sum in comparison with the cost of some other satellite or terrestrial communications technologies, it can still be considerable for target audiences in rural areas and in developing countries, especially if potential users are not currently utilizing Earth observation technology in their work and may not be convinced of the utility. Language GEONETCast is a near-global system with a widespread and diverse user community. While the main operational language on a coordinating level and in many of the product offerings is English, a majority of GEONETCast users are not native English speakers, and a large proportion of them do not speak English at all. GEONETCast has begun efforts to disseminate products, including capacity-building and training programming,
in local languages, including training modules for Spanish- and Portuguese-speaking users in Latin America and Africa. Lack of local language support nevertheless remains an obstacle in expanding the user base in some areas. User Capacity for Data Exploitation GEONETCast maintains the technical capability to send Earth observation data to a large percentage of the world’s population. However, many potential end users lack the technical and knowledge base to exploit the available data streams (including value-added data processing) and locally disseminate the data they receive. GEONETCast is an infrastructure component of GEOSS, along with the GEO Portals and other dissemination tools, and while it serves an important capacity-building role, there is no data processing capability available to users through the GEONETCast broadcast itself. GEONETCast is now working closely with the GEO Capacity Building Committee and other groups to bring important capacity-building programming to end users who require training in utilizing GEONETCast-provided and other data for local use. The GEONETCast Training Channel is a key element in supporting these efforts. Social Challenges Growing Sustainable Communities of Providers and Users, and Servicing and Generating Requirements The key social challenge to the success of the GEONETCast system is providing a service that data providers and users find valuable and that they want to see expanded in capability. GEONETCast seeks to bring Earth observation products to new audiences, including many who have never had access to such information before. Therefore, GEONETCast must work together with GEO and other partners to market the service to potential users and to those data providers who wish to provide greater access to the valuable data products they create. This process includes targeting user groups, regions, and sectors of a society that could benefit from the information and products carried over this type of “last mile” system. A temptation for information dissemination service managers in this process is to populate the dissemination service with data sets and products that one would guess would be of interest to potential users. GEONETCast has, however, worked to stay ahead of this urge to “fill the pipe” and works to actively generate requirements from current and potential users. Building momentum here is a challenge—it is difficult to generate requirements from a small but growing number of active users; however, greater numbers of users will come only after a larger suite of identified products is included in the broadcast. In the Americas, the GEONETCast Americas Coordination Group is actively engaging both data providers and end users using an accepted push-pull vetting process for discovery, acceptance, and broadcast of new environmental data types desired by users over the service. In 2010, this group began working with a World Meteorological Organization (WMO) requirements task team to
GEONETCast determine products and data of interest to users in WMO Regions III and IV (South and Central America and the Caribbean). The new recommended products are currently being added to the broadcast where feasible. Each region has approached this challenge uniquely. In Africa, DevCoCast, supported by the European Commission, has worked with local partners in Africa to hold training sessions on the use of the GEONETCast system and has supported the acquisition of new receiving stations. This project has also been successful in Latin America, where Brazilian INPE/CPTEC (Centro de Previsão de Tempo e Estudios Climáticos or Center for Weather Forecast and Climate Studies) is supporting training programs in the use of satellite-derived products and the GEONETCast system. Efforts are also under way in Latin America to develop a cadre of specialists experienced in the installation and use of GEONETCast systems to work with new and potential users to identify local needs and user requirements, as well as to install and calibrate receiver systems. Specialists in four countries (Argentina, Brazil, Costa Rica, and Panama) have expressed interest in working with these new user groups. Targeting potential data providers presents a similar challenge requiring marketing and targeted education about the ways in which the system can enhance the information dissemination goals of an organization at little cost to them. Just as a process of guided requirements generation is helpful in getting new users to buy into the system, working with potential product providers is necessary to show the advantages of a low-cost and robust dissemination system for extending the reach of some of their product offerings. A separate challenge has been to maximize the participation of GEO constituencies in the work of GEONETCast. GEONETCast is one of the official infrastructure projects within GEO, along with the GEO Portal. While the process of selecting and registering data sets and Earth observation products for inclusion into the GEO Portal is well under way, the process of maximizing the potential of GEONETCast as a dissemination system on the part of these same information providers is in earlier stages. Likewise, as new GEONETCast efforts such as the Training Channel and the Hazard and Alerts Channel become operational, new priorities will be engaging experts in the capacity building and disaster mitigation fields to guide the programming development of those respective information sources and consulting on targeting new user communities with programming.
85
broadcasts environmental data and products from multiple users over a single unified broadcast. GEONETCast broadcasts environmental information to Africa, Asia, Europe, the Americas, and parts of the Pacific. The system’s user base is growing and diversifying. This year, GEONETCast is working to develop Training and Hazard Alert Channels, further implement network center data exchange, finalize the GEONETCast Product Navigator, and expand the user base in Africa and the Americas. Social challenges to GEONETCast’s success are centered on educating new user groups in the utility and possibilities of the system, including generating concrete user data requirements and raising capacity to utilize Earth observation information among new target populations. Technical challenges include system integration across three unique regional dissemination systems, data exchange, and hardware and installation costs. Although challenges exist, we are meeting them and making measured progress in expanding the user base for GEONETCast and increasing the value of the system worldwide. REFERENCES CITED CENR/IWGEO (Committee on Environment and Natural Resources/Interagency Work Group on Earth Observations), 2005, Strategic Plan for the U.S. Integrated Earth Observation System: Washington, D.C., National Science and Technology Council Committee on Environmental and Natural Resources, 149 p. (Available from CENR Executive Secretariat, 1401 Constitution Ave. NW, Washington, DC 20230 or at http://www.ostp.gov.) EUMETSAT User Service, 2007, GEONETCast: A Nerve System for the Planet: A Brochure for EUMETSAT: Darmstadt, Germany, EUMETSAT, 15 p. Fulton, R., 2007, GEONETCast Americas—Status and Plans, in International Symposium on Remote Sensing of Environment, June 2007: http://www .earthobservations.org/documents/geonetcast/20070627_geonetcast _americas_status_and_plans.pdf (accessed 29 July 2011). Fulton, R., Seymour, P., and Moodie, L., 2009, GEONETCast Americas—A GEOSS Environmental Data Dissemination System Using Commercial Satellites: Space Technologies for the Benefit of Human Society and Earth: New York, Springer Publishing, 303 p. GEO (Group on Earth Observations), 2007, The Full Picture: Leicester, UK, GEO Secretariat, Tudor Rose Publishers, 143 p., http://www .earthobservations.org. GEO Secretariat, 2007, The First 100 Steps to GEOSS: Geneva, Switzerland, GEO Secretariat, 212 p., available at http://www.earthobservations.org. Moura, A.D., 2006, WMO’s contribution to GEOSS and GEONETCast: World Meteorological Organization Bulletin, v. 55, no. 4, p. 256–260. Williams, M., and Wolf, L., 2007, GEONETCast—Delivering Environmental Data to Users Worldwide: IEEE Systems Journal, September 2007, v. 2, no. 3, p. 401–405. Withee, G.W., 2006, Integrated global data dissemination service: World Meteorological Organization Bulletin, v. 55, no. 4, p. 249–252.
CONCLUSION GEONETCast is an operational and effective, low-cost, satellite-based, dissemination system for Earth observation data that
MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
Developing and implementing international geoscience standards—A domestic perspective J.L. Laxton T.R. Duffy British Geological Survey, Murchison House, West Mains Road, Edinburgh EH9 3LA, UK
ABSTRACT Here, we describe the development and implementation of standards for the dissemination of geoscience information. We do this from the perspective of the British Geological Survey, but this perspective is considered typical of many geological survey organizations. When geoscience data dissemination occurred through the use of paper maps, standards were mainly developed by individual organizations. The introduction of digital systems for map production and data storage required the development of corporate data models. The evolution of the Web as a means of searching for data led to the development of metadata standards, first at the national level, but soon after at the international level as well. The requirement for interoperable digital geoscience data has led to the need for an accepted international conceptual data model, and so we describe the development and implementation of GeoSciML, the GeoScience Markup Language (GSML). Agreement on a schema enables delivery of data in a standard form, but semantic harmonization is required for full interoperability. The implementation of Web services using GeoSciML requires the use of Open Geospatial Consortium (OGC) and International Organization for Standardization (ISO) open standards, but difficulties have been encountered through lack of full compliance with these standards on the part of software suppliers. The UK Digital National Framework is a means of achieving interoperability between data from different domains at a national level, and it is a good basis for compliance with the mandatory pan-European INSPIRE (Infrastructure for Spatial Information in Europe) framework. The international standards described here are essential in order to meet society’s growing demand for interoperable geoscience information in a wide variety of applications.
Laxton, J.L., and Duffy, T.R., 2011, Developing and implementing international geoscience standards—A domestic perspective, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 87–97, doi:10.1130/2011.2482(09). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
87
88
Laxton and Duffy
INTRODUCTION There is a wide range of different standards that are used in different areas of geosciences, but this chapter concentrates on those concerned with the dissemination of information. We look at the issues from the viewpoint of the UK and the British Geological Survey (BGS), but we consider this to be typical of many similar organizations around the world and so can be used to show, more generally, the evolution from a national to an international perspective. The dissemination of information has always been central to the role of geological survey organizations, and the geological map has been the primary means of communication since the earliest days of geology and the world’s first geologic map, that of England and Wales produced by William Smith in 1801 (Winchester, 2002). The geological map is a synthesis and interpretation of a wide range of data sources (Harrison, 1963), such as field mapping and subsurface information derived from boreholes and observations in mines. These raw data have not traditionally been published by geological survey organizations, although, where they are not confidential, they are commonly made available on request. From the earliest days of geology, the need for some level of standardization was apparent, in particular, concerning the colors and symbolization used on geological maps. The Second International Geological Congress in 1881 agreed on a standard color scheme, primarily for the first Geological Map of Europe (Vai, 2004), but most standardization has remained limited to one organization or one country and has led to a wide range of different map specifications being developed (e.g., British Geological Survey, 2000). These standards were initially concerned with the production of printed maps using traditional cartographic techniques, but they evolved with the advent of digital cartography to meet new requirements (e.g., Federal Geographic Data Committee, 2006). In parallel with the introduction of computer-aided design (CAD) systems to handle geological maps, there was an increase in digitization of other geoscience data and the early adoption of database management systems. Even at this early stage of digitization, it was recognized that digital data opened the way to multidisciplinary analysis but that standards would be required if this potential were to be realized (Albert, 1983). These standards were still seen in a national context. Use of digital cartography (CAD) systems rapidly evolved into the use of geographic information systems (GIS), and with it the idea that maps could be considered, along with the predominantly observation point data in relational databases, as part of a corporate spatial database. BGS developed a Digital Map Production System based on such a spatial database, which was used both to produce standard geological maps and derivative products meeting specific customer requirements (Laxton and Becken, 1996). This was underpinned by a logical model that was seen as the standard upon which storage of digital geological map data could be based (Bain and Giles, 1997).
The need for an integrated approach to the organization and management of digital geoscience data led the BGS to initiate the BGS-geoIDS (BGS Geoscience Integrated Database System) project (Giles et al., 1997). This initiative had the aim of creating “an integrated corporate database of geoscience information” (p. 4) to meet a range of business objectives, including “making data available to staff and customers that comply with a documented standard and thereby improve the quality of the products and services” (p. 42); “to provide the opportunity to combine/ integrate data of diverse types, to develop innovative new products and solutions” (p. 42); and “to allow corporate implementation of standards and establish best practice” (p. 43). A corporate standard for metadata and a corporate data model were seen as priorities. The metadata were primarily for discovery purposes, both internally and externally, and it was important for these to conform to developing standards. The UK, along with many other countries, developed national metadata standards. In the case of the UK, this was through the National Geospatial Data Framework (NGDF), and in the first instance, BGS adopted this standard (Holmes et al., 1999). The appearance of the international metadata standard ISO 19115 (ISO/TC 211, 2003) led many countries to revise their national standards in the form of profiles of ISO 19115. The UK profile was termed the UK GEMINI standard (Cabinet Office, 2004). This was the first international data standard widely adopted by geological survey organizations. BGS developed its own profile of ISO 19115, but it was fully compatible with the GEMINI standard. The UK Natural Environment Research Council (NERC), the parent body of BGS, in its strategy document for the period 2002–2007 (NERC, 2002), identified the exploitation of knowledge as one of its five strategic priorities and stressed the need for increased international collaboration. In support of this, BGS committed to use of the Web as a means of delivering data to its external users (British Geological Survey, 2004), although at the beginning, this was mainly through the download of some free products and the ability to search online and order data. The theme was continued in the most recent BGS strategy (British Geological Survey, 2009), which identifies “improving the communication of geoscience knowledge” (p. 6) as a key challenge and “exploiting the Web” (p. 7) as a central means to achieve this. While the basic function of geological survey organizations, to be the principal source of geoscience information within their territories, has not changed significantly since their foundation, the technological circumstances in which they operate have changed. The move from paper-based products, to digital products, to Web dissemination has necessarily led to an evolution of the geoscience standards that are required. Early standards were generally set by, and for, individual geological survey organizations, although the need for some international harmonization was recognized, as in the case of map color schemes. Digitization of data required the development of more integrated data models, and the idea of the “corporate database” evolved, but this was still generally seen within the context of individual organizations. It is the advent of Web dissemination of data that has brought
Developing and implementing international geoscience standards about the requirement for internationally agreed-upon geoscience data standards. The development of these will be discussed in the next sections. GEOSCIML—AN INTERNATIONAL GEOSCIENCE DATA EXCHANGE STANDARD Development Process It is becoming increasingly important to be able to query and exchange digital geoscientific information between data providers and users. Technological opportunities arising from the evolution of geospatial information standards and Internet standards for data delivery services in general are making such interoperability a viable proposition. Early experiments suggested that the development of a generic geoscience markup language would lead to significant improvements in interoperability (Sen and Duffy, 2004). In order to investigate these opportunities, BGS convened a meeting in Edinburgh of international geoscience data providers, mainly geological surveys, in 2003. Following from this meeting, the Interoperability Working Group (IWG) was set up under the auspices of the International Union of Geological Sciences Commission for the Management and Application of Geoscience Information (CGI). The working group was tasked with developing a conceptual geoscience data model, mapping this to a common interchange format, and demonstrating the use of this interchange format through the development of a test bed. Active participants in the working group were drawn from 10 geological survey organizations around the world (AZGS [Arizona, USA], BGS [UK], BRGM [France], ISPRA [Italy], GA [Australia], GSC [Canada], GSJ [Japan], GSV [Victoria, Australia], SGU [Sweden], and the U.S. Geological Survey [USA]) and one research institution (CSIRO [Australia])1. In order for there to be interchange of information, there has to be agreement on the nature and structure of the information to be interchanged. The simplest way of achieving this would be if all geoscience data providers shared a common database structure. However, because most providers already have their own data models, have differing internal business requirements, and gather and hold different information, this option is not possible. The solution is to agree to a common conceptual data model, to which data held in existing databases can be mapped. Such a data model needs to identify the objects being described (e.g., “faults”), their properties (e.g., “displacement”), and the relations between objects (e.g., “faults are a type of GeologicStructure”). Such a model can be described graphically using Universal Mod-
eling Language (UML), an ISO (International Organization for Standardization) standard (ISO/IEC, 2005). A uniform methodology, based on standards published by the Open Geospatial Consortium (OGC) and the ISO, was developed and applied in the design of GeoSciML, the GeoScience Markup Language (GSML). The method is based on the Object Management Group’s Model Driven Architecture (MDA)® (OMG, 2003), with model design in UML using the General Feature Model from ISO 19109 (ISO/TC 211, 2005), the use of components from other standards in the ISO 19100 series, and production of the XML schema following the encoding rules specified in ISO 19136 (ISO/TC 211, 2007). The resultant encoding shows a literal and explicit relationship to the UML model. This is unlikely to be as compact as hand-coded special cases, but it is consistently structured across similar models. Full structure and meaning are preserved, and compactness is easily dealt with using standard compression techniques. Furthermore, the use of standard components for elements that are common across domains ensures maximum interoperability. Scope of GeoSciML The GeoSciML standard is designed to assist interoperability through a standardization of the data structure used in geoscience (Simons et al., 2006). It is built on top of international standards for data language, such as Geographic Markup Language (GML) (ISO/TC 211, 2007), and the W3C Web systems standards. The GeoSciML initiative will put in place the standards to enable the interoperability of geoscience data, but semantic interoperability will only be achieved when the data content also conforms to accepted international standards. Figure 1 shows where GeoSciML fits within the hierarchy of interoperability. GeoSciML is useful even without full semantic interoperability because it enables geological survey organizations, and other geoscience data providers, to deliver the same type of information to customers in a standard format, thus allowing easier analysis and processing. The GeoSciML conceptual model assists in the development of semantic interoperability by identifying those concepts that require standardization for such
interoperability semantic
Data content
schematic
Data structure
syntax
Data language
systems
Data systems
1
Explanation of acronyms: AZGS—Arizona Geological Survey; BGS—British Geological Survey; BRGM—Bureau de Recherches Géologiques et Minières; ISPRA—Istituto Superiore per la Protezione e la Ricerca Ambientale; GA—Geoscience Australia; GSC—Geological Survey of Canada; GSJ—Geological Survey of Japan; GSV—Geological Survey of Victoria; SGU—Sveriges Geologiska Undersökning; CSIRO—Commonwealth Scientific and Industrial Research Organisation.
89
Geoscience community
GeoSciML
OpenGIS community (OGC)
Figure 1. The level of interoperability that GeoSciML is designed to address. GeoSciML—GeoScience Markup Language; GIS—geographic information systems; OGC—Open Geospatial Consortium.
90
Laxton and Duffy
interoperability. The ICS (International Commission for Stratigraphy) stratigraphic time chart, produced from the geologic time scale (Gradstein et al., 2004), is an example of internationally accepted concept definitions, and this has been implemented in GeoSciML. There are, however, few geoscience domains that have such agreed-upon concept definitions, and therefore in the first instance, the IWG set up a Concept Definitions task group, which is developing simple vocabularies designed specifically for interoperability. It is recognized that these vocabularies are a temporary measure until full concept definitions have been developed by the relevant international geoscience communities. We will not describe the GeoSciML data model in full but rather give a brief overview that can be used to discuss the issues involved in implementing such a standard in a domestic context. The scope of the content of GeoSciML has been restricted in the current version, 2.1, to those geoscience objects that form the main components of a geological map (geological units, faults, contacts, and their defining concepts), as well as boreholes. In modeling the geological map, the concept of the GeologicFeature, which can be either a GeologicUnit or a GeologicStructure, is central. The model separates the concept of the GeologicFeature from its occurrences, which are termed MappedFeatures. A typical geological map is made up of MappedFeatures, which are in turn specified by GeologicFeatures. A single MappedFeature can be specified by only one GeologicFeature, but a GeologicFeature can have several MappedFeature occurrences, on maps of different scales or in a three-dimensional model, for example. The main function of a MappedFeature therefore is to hold geometry, which it inherits from standard types defined in GML. Figure 2 shows the UML diagram for the GeologicFeature package, one of 12 packages in the GeoSciML data model, which illustrates the relationship between MappedFeatures and GeologicFeatures. The GeoSciML data model allows for the detailed description of GeologicUnits (Fig. 3) having a wide range of specific data types, such as WeatheringDescription and MetamorphicDescription. The composition can be described in detail using the related EarthMaterial package, which allows the definition of RockMaterials in terms of their constituent parts. The age of a GeologicFeature is defined in terms of GeologicEvents, which can be aggregated to give a GeologicHistory. GeologicUnits can be classified using ControlledConcepts drawn from vocabularies such as stratigraphic lexicons (Fig. 4). ControlledConcepts can in turn be defined using prototypes, which can be any type of feature but which must be consistent with the ControlledConcept being defined. In the case of a stratigraphic lexicon, the prototype would be a normative GeologicUnit definition. Implementing GeoSciML We will use our experience in the use of GeoSciML at BGS to illustrate some of the issues involved in implementing an international standard such as this. The first data set that we attempted
to map to GeoSciML is the BGS digital map standard DigMap (Becken and Green, 2000). This standard is used primarily for maps that have traditionally been published at scales of 1:50,000, 1:250,000, and 1:625,000. The information shown on the maps varies a little between scales, but they are all basically lithostratigraphic maps classified according to terms defined in the BGS Lexicon of Named Rock Units (http://www.bgs.ac.uk/lexicon/). This lexicon provides information about the age, stratigraphic rank and parentage, and lithology of each unit, with the latter using definitions from the BGS Rock Classification Scheme (http://www.bgs.ac.uk/bgsrcs/home.html). It should be noted that the lexicon is designed to cover all mapped rocks in the UK, so it includes lithodemic units such as igneous intrusions. Large intrusions are generally given distinct names in the lexicon, but minor intrusions are given a generic name, such as “Unnamed Igneous Intrusion, Westphalian.” In some cases, a single stratigraphic unit may be subdivided on maps on the basis of lithology, for example, limestone bands within a predominantly sandstone unit. Both the lexicon name and the lithology are therefore required to uniquely identify a mapped unit. Because DigMap is designed to be a self-contained format that can be used to deliver digital geological maps to customers, it incorporates attribute values drawn from the BGS Lexicon of Named Rock Units, rather than simply referencing these in the lexicon. DigMap therefore has fields not just for the rock name codes and lithology codes, but also for the full-text description of what the codes mean. Similarly, DigMap contains fields for the code and description for each rock unit above the unit in question in the stratigraphic hierarchy—bed to member to formation to subgroup to group to supergroup. The minimum and maximum ages of the unit are recorded, with fields for both geochronological terms (e.g., Albian) and years. The temporal hierarchy is recorded in a similar way to the stratigraphic hierarchy, with each temporal level above the unit in question, from epoch through to eon, being recorded. In implementing this structure in GeoSciML, the first step is to map features in DigMap to equivalents in the GeoSciML data model. For example, the mapped polygons in DigMap relate directly to MappedFeatures specified by a GeologicUnit in GeoSciML. The second step is to map attributes in DigMap to properties in the GeoSciML data model, and also to derive values for any properties that are mandatory in GeoSciML that may not be specifically recorded in DigMap. For example, a mandatory property for GeologicUnit is the geologic unit type, and, as noted already, DigMap includes a mixture of lithostratigraphic and lithodemic units. At present, there is no unambiguous way of distinguishing between these in either the BGS Lexicon of Named Rock Units or through a retrieval on the basis of lithology, so the geologic unit type property can only be accurately populated through individual classification by a geologist. Until this has been done and the information added to the lexicon, the geologic type value must either be set as “unspecified,” which is correct but provides little information, or set to “lithostratigraphic,” which will be correct in the majority
Developing and implementing international geoscience standards
91
class Summary diagram: Geologic Feature
SampledFeature is usually GSML collection that represents the geology of interest. «type» GM_Object {n} +shape
SamplingFeature «FeatureType» +samplingFrame SpatiallyExtensiveSamplingFeature
«FeatureType» MappedFeature + observationMethod: CGI_TermValue [1..*] + positionalAccuracy: CGI_Value The shape of a mapped feature is determined by observation, not assertion
constraints {self.metadata.hierarchyLevel=(feature or dataset or series)} {self.shape contained in samplingFrame.shape} «FeatureType» GeologicEvent
+occurrence
+ eventAge: CGI_Value + eventEnvironment: CGI_TermValue [0..*] + eventProcess: CGI_TermValue [1..*]
{If ((TypeOf(GeologicHistory) is +geologicHistory DisplacementEvent) then (TypeOf(feature) is ShearDisplacementStructure)} 0..*
constraints {self.metadata.hierarchyLevel=(feature or dataset or series)} +preferredAge
+feature
«CodeList» DescriptionPurpose
Reasons for the existence of this GeologicFeature
+ definingNorm + instance + typicalNorm
Description
GeologicRelation +targetLink
1
0..1
0..*
+specification+source
0..*
«FeatureType» GeologicFeatureRelation +sourceLink
1
«FeatureType» GeologicFeature
+ observationMethod: CGI_TermValue [1..*] + purpose: DescriptionPurpose = instance
+target 1
constraints {self.metadata.hierarchyLevel=(feature or dataset or series)}
«FeatureType» GeologicUnit + geologicUnitType: ControlledConcept «estimatedProperty» + bodyMorphology: CGI_TermValue [0..*] «Type» + compositionCategory: CGI_TermValue [0..1] ControlledConcept + exposureColor: CGI_TermValue [0..*] 0..1 0..* + outcropCharacter: CGI_TermValue [0..*] + name: LocalizedGenericName [1..*] +classifier + rank: ScopedName [0..1] + unitThickness: CGI_Numeric [0..*] constraints AnyDefinition
0..*
+definingStructure «FeatureType» +definedUnit GeologicStructure 0..1 0..1
estimatedProperty stereotype indicates properties for which property-level metadata (observations) could be associated with assignment of a particular value for the property in an instance. This is not currently implemented in the xml schema, but a solution may be developed in the wider GML community. The stereotype is left in the model as an indicator of 'observed' properties for which future versions may allow association of observation metadata.
{metadata.hierarchyLevel=(feature or dataset or series)} {prototype.purpose="definingNorm"}
Figure 2. Summary Universal Modeling Language (UML) diagram for the GeoSciML GeologicFeature package.
of cases but incorrect in a minority. Over time, such problems can be corrected by modifying the national data models to more closely conform to the requirements of the international ones. The converse situation arises where there is information in a data provider’s database that cannot be mapped to GeoSciML. For example, map polygons in DigMap have a property relating them to a particular published map sheet, but there is no equivalent property in GeoSciML. This illustrates the general point that mapping between national data models and international ones is likely to lead to a loss of semantic resolution. DigMap draws information from the BGS Lexicon of Named Rock Units, which in turn draws from the BGS Rock Classification Scheme and the BGS Geological Timechart (http://www
.bgs.ac.uk/discoveringGeology/time/timechart/home.html), and it holds this in a flat data structure with significant redundancy in order to produce a self-contained product for delivery to customers. As noted previously, GeoSciML allows stratigraphic lexicons to be encoded using the vocabulary package structure, and the BGS Lexicon of Named Rock Units maps well using this mechanism. This structure gives flexibility in data delivery by allowing access to the vocabularies in different ways. The vocabularies can be delivered along with the data, but, more efficiently, they can also be delivered separately, so that a customer might store the vocabulary for use with multiple data deliveries. Alternatively, the vocabulary can be made available as a vocabulary Web service, allowing compact data delivery that
«Type» ControlledConcept
+composition
0..1
0..1
0..1
«FeatureType» GeologicUnit + geologicUnitType: ControlledConcept
0..*
0..1
«FeatureType» GeologicEvent
«estimatedProperty» + metamorphicFacies: CGI_TermValue [0..*] + metamorphicGrade: CGI_TermValue [0..1] + peakPressureValue: CGI_NumericValue [0..1] + peakTemperatureValue: CGI_NumericValue [0..1] + protolithLithology: EarthMaterial [0..*]
«DataType» MetamorphicDescription
«estimatedProperty» + proportion: CGI_Value
+ role: ScopedName
«DataType» GeologicUnitPart
+ LithotectonicUnit + MagnetostratigraphicUnit + MassMovement + Pedoderm + Pedostratigraphic + PolarityChronostratigraphicUnit
+ Allostratigraphic + Alteration + ArtificialGround + Biostratigraphic + Chronostratigraphic + Deformation + Excavation + Geomorphologic + GeophysicalUnit + Lithodemic + Lithogenetic + Lithologic + Lithostratigraphic
«CodeList» GeologicUnitType
constraints {self.metadata.hierarchyLevel=(feature or dataset or series)}
+ eventAge: CGI_Value + eventEnvironment: CGI_TermValue [0..*] + eventProcess: CGI_TermValue [1..*]
Figure 3. Summary Universal Modeling Language (UML) diagram for the GeoSciML GeologicUnit package.
«estimatedProperty» + beddingPattern: CGI_TermValue [0..*] + beddingStyle: CGI_TermValue [0..*] + beddingThickness: CGI_Value [0..*]
«DataType» BeddingDescription
+bedding
0..1
+metamorphicCharacter
0..*
+part
0..*
0..1
+preferredAge
+geologicHistory
+feature
+containedUnit
1
constraints {self.metadata.hierarchyLevel=(feature or dataset or series)}
+ observationMethod: CGI_TermValue [1..*] + purpose: DescriptionPurpose = instance
«FeatureType» GeologicFeature
«estimatedProperty» + bodyMorphology: CGI_TermValue [0..*] + compositionCategory: CGI_TermValue [0..1] + exposureColor: CGI_TermValue [0..*] + outcropCharacter: CGI_TermValue [0..*] +definedUnit + rank: ScopedName [0..1] 0..1 + unitThickness: CGI_Numeric [0..*]
+weatheringCharacter
0..*
+classifier
«estimatedProperty» + density: CGI_Numeric [0..*] + magneticSusceptibility: CGI_Value [0..*] + permeability: CGI_Value [0..*] + porosity: CGI_Value [0..*]
«DataType» PhysicalDescription
+physicalProperty
«FeatureType» GeologicStructure
+definingStructure
«estimatedProperty» + weatheringDegree: CGI_Term [0..1] + weatheringProduct: EarthMaterial [0..*] + weatheringProcess: CGI_Term [0..*] + environment: CGI_Term [0..*]
«DataType» WeatheringDescription
«estimatedProperty» + lithology: ControlledConcept [1..*] + material: CompoundMaterial [0..1] + proportion: CGI_Value
+ role: ScopedName
«DataType» CompositionPart
0..1
AnyDefinition
constraints {metadata.hierarchyLevel=(feature or dataset or series)} {prototype.purpose="definingNorm"}
+ name: LocalizedGenericName [1..*]
class Summary diagram: Geologic Unit
92
Developing and implementing international geoscience standards
93
class Summary diagram: Concepts and vocabularies AnyIdentifiableObject «Type» ISO 19136 Annex D.3::AnyDefinition
+member 1..*
«Type» ISO 19136 Annex D.3:: AnyDictionary
«Type» VocabRelation +linkToTarget
+ role: ScopedName
0..* +source 1
+target
1
+vocabulary
«Type» ControlledConcept
1..*
0..* {0..1 per vocabulary}
1 «Type» GeologicVocabulary
+vocabulary
+ name: LocalizedGenericName [1..*] constraints {metadata.hierarchyLevel=(feature or dataset or series)} {prototype.purpose="definingNorm"}
0..*
1
Figure 4. Summary Universal Modeling Language (UML) diagram for the GeoSciML Vocabulary package.
«Type» StratigraphicLexicon Definition
0..1 +prototype «type» Records and Class Metadata:: Any {n}
the content model (type) of the entity that is the prototype for a ControlledConcept must be consistent with the intention of the concept.
can reference the online vocabulary for more detailed information where needed. In implementing GeoSciML, it is necessary to be clear about its objective. If the aim is to deliver an organization’s data in a standard form with the least loss of semantic resolution, then the organization’s own vocabularies should be used, as described for the implementation of DigMap. However, in some cases, a key objective may be to integrate data from several sources; for example, the OneGeology project (http://www .onegeology.org/) aims “to create dynamic digital geological map data for the world,” while the related OneGeology-Europe project (http://www.onegeology-europe.eu/index.php/) aims “to create a geological map at 1:1,000,000 (of Europe).” These projects require semantic as well as schematic interoperability (Fig. 1), and this can only be achieved through the use of shared vocabularies. The CGI Simple Lithology vocabulary has been developed as part of the GeoSciML initiative to help address this issue. The CGI Simple Lithology vocabulary, as its name suggests, is designed to provide a simple lithology classification designed for interoperability. It is not intended as a replacement for any of the detailed lithological vocabularies used by organizations at present, such as the BGS Rock Classification System. The main issue in implementing a vocabulary for interoperability is to relate
generally higher-semantic-resolution concepts used in an organization’s internal vocabulary to the broader concepts used in the interoperable vocabulary. If the classification system used in the two vocabularies is broadly similar, then concepts from the internal vocabulary are likely to fall within the concept space of concepts defined in the interoperable vocabulary. If the classification system used is different, then it is likely that concepts from the internal vocabulary may overlap the concept space of several terms in the interoperable vocabulary. In this case, the hierarchy of terms in the interoperable vocabulary will need to be navigated until a sufficiently broad concept is found that includes the concept from the internal vocabulary, and this can lead to a significant loss of semantic resolution. In the future, developments in the study of semantics and the use of ontologies may provide a better approach to mapping between overlapping concepts (Reitsma et al., 2008). WEB SERVICE STANDARDS The OGC, as well as defining GML, has developed standards for the interchange of spatial data, the two most relevant being Web Mapping Service (WMS) (OGC, 2006) and Web
94
Laxton and Duffy
Feature Service (WFS) (OGC, 2005). WMS is concerned mainly with Web map rendering of particular chosen properties with chosen classification symbolizations. A WFS, however, delivers data in GML format, and thus allows querying and exchange of information about individual features, potentially including all the richness of information described in a domain or community model such as GeoSciML. In order to test GeoSciML, the participating organizations set up WMS/WFS services to serve example geological map and borehole data using the GeoSciML standard and addressing commonly accepted use cases. Figure 5 shows a WMS of the lithostratigraphy of the UK at 1:625,000 scale within the OneGeology Portal, overlain with a fragment of the GeoSciML Web Feature Service for that data set. The participants deliberately did not use the same software implementations of WMS/WFS, allowing different options to be compared and greater testing of true interoperability between different data providers. Many Open Source software solutions are being tried, and increasingly, commercial offerings that support GML 3.x schemas such as GeoSciML are becoming available. The BGS has used free and open-source MapServer (http://mapserver.org/), Geoserver (http://geoserver.org), and deegree (http://www.deegree.org/) software for such Web services and is also using Snowflake (www.snowflakesoftware.co.uk) commercial software. While it is highly desirable to use international standards, such as the OGC Open Standards, as the basis for implementing
interoperable systems, the existence of a published standard does not always guarantee its maturity or full adoption by software implementations. Frequently “optional” items in the standards are not implemented at all, despite potential users requiring them in practice. This can also reduce interoperability, because software with a stricter adherence to the standards will have difficulty interoperating with software with a less rigorous adherence. For this reason, several software solutions had to be tried and sometimes wrapped with in-house–written “mediator” software to improve the Web service outputs. Experience with implementing the test bed has enabled commercial vendors, such as Snowflake, to improve the ability of their software to implement a real community schema like GeoSciML. The test-bed participants internal data storage schemas were all different, and “mapping” from these internal schemas to the GeoSciML schema is part of the process of increasing interoperability between data set providers by expressing, and providing as Web services, available data in a commonly agreed-upon form. WMS/WFS services are new and require information communities to agree on common conceptual data models to get the most benefit from them. Geoscience, through the GeoSciML initiative, is one of the first communities to do this and makes us well placed to exploit the opportunities that Web services will open up for sharing data and enabling data to be incorporated transparently into users’ Web applications. In the future, the semantic web and semantic grid will take these developments
Figure 5. Web services of the lithostratigraphy of the UK (http://portal.onegeology.org).
Developing and implementing international geoscience standards further and are likely to revolutionize the way information is accessed and used. The underpinning geoscience conceptual data model will need to be extended to cover a broader area of geoscience, for example, to more observational data, in order to fully exploit these opportunities. UK DIGITAL NATIONAL FRAMEWORK The UK Digital National Framework (DNF) was launched by the Ordnance Survey (OS), the UK national topographic mapping agency, in 1999. The DNF is defined as “…a model for the integration of geographic information of all kinds—from national reference data sets to application information at a local level” (Ordnance Survey, 2004, p. 13). The objective of DNF is to encourage the reuse and enable the interoperability of spatial data within the UK through the establishment of a set of principles and guidelines. It does not aim to create a new standard but rather draws on existing standards. The architecture of DNF enables “Business Objects,” for example, a planning application or highway maintenance record, to establish a permanent reference to a spatial object in the “Reference Base” (Rackham, 2010). This frees users of spatial data from the need to maintain separate geographic objects in their databases and ensures that any changes in the reference base
95
are automatically reflected in user applications. An “Association Model” (Malcomson, 2006) has been developed to describe, at the logical level, the various types of DNF association. DNF does not specify a particular reference base, but rather it draws up a set of “principles for the selection of DNF Reference Objects,” which includes the need for objects to be maintained to a published specification, for objects to have a defined life cycle, and for object instances to be identified and versioned. In practice, however, the DNF Reference Base has become synonymous with the OS MasterMap® digital topographic database, which is “a consistent and maintained framework for the referencing of geographic information (GI) in Great Britain” (Ordnance Survey, 2010, p. 6). MasterMap® object instances have unique “Topographic Object Identifiers (TOIDs)®,” which can be used as the basis for referencing and association, and it also has attributes that record the object life cycle. The UK, along with all other European Union (EU) countries, will be required to make various spatial data sets, including geological data, freely available as part of the INSPIRE (Infrastructure for Spatial Information in Europe) initiative. There is a great deal of overlap between the objectives of DNF and INSPIRE, although INSPIRE applies at a European level and DNF applies at a UK national level. This overlap means that conformance with DNF principles will greatly assist data-providing
Figure 6. An illustration of the use of Digital National Framework in a geological context. © Crown copyright. All rights reserved British Geological Survey 100017897 2009.
96
Laxton and Duffy
organizations in conforming to the INSPIRE standard, as well as ensuring a high level of consistency and interoperability between the various elements of the UK contribution to INSPIRE. In recognition of this, the DNF is now being developed in conjunction with the UK SDI (Spatial Data Infrastructure), which in turn is being developed in support of INSPIRE. As noted previously, while the DNF principles do not specify the use of OS MasterMap® as the reference base, in practice, it is the only reference base that has been used within DNF. This poses a significant problem in implementing many environmental data sets, including geological data, within the DNF framework because there is commonly no logical relationship between the environmental object and any object in the topographic reference base. BGS is working with the DNF developers to address this problem and has suggested that the DNF Reference Base be extended to include a range of different reference layers such as geology, soils, or land cover, which would each conform to the “principles for the selection of DNF Reference Objects.” Depending on the use case, “Business Objects” could be associated with Reference Objects from any of the reference base layers. Figure 6 illustrates the use of DNF in a geological context. Two geological Reference Objects (BGS670–671), areas of coal outcrop, partially overlay a topographic Reference Object (OS123), a field. A Business Object might be a planning application for open-cast extraction of coal from within the field. This Business Object would derive its geometry from the field Reference Object and some of its attribute information from the coal Reference Objects, without the requirement to carry out a GIStype spatial overlay operation between the coal and the field polygons. There would, of course, need to be a means of obtaining the identifiers of the various objects, but a future INSPIRE-type Web service could provide access to these. Similar use cases can be envisaged with other combinations of Reference Objects. These associations would not be dynamic—changes in one Reference Base layer would not be reflected in others, so potentially the associations could become invalid over time. For many requirements, such as the one described here, this would not be a significant issue because the Reference Base layers are unlikely to change within the time scale for which the association is required. IMPLEMENTATION OF INSPIRE BGS is a Legally Mandated Organisation in the UK context (as a national geological survey with a royal charter from the NERC parent body) that is required by the European Community INSPIRE directive (European Union, 2007), as transposed to UK law, to declare and provide significant parts of its data as interoperable Network Services (Web services including WMS and WFS) conforming to INSPIRE regulations and technical guidance. As such, BGS needs standards such as the GeoSciML schema to be designed and adopted for use within INSPIRE, and, in particular, the Annex II Geology data theme and possibly various Annex III “geohazard and applied” related themes. GeoSciML has already been designed and accepted by a wide
international geoscience community, and so it has been proposed that the next version of GeoSciML (version 3) will be a candidate for use as the Geology Annex II schema. GeoSciML has already been tested for use in data exchange through OGC/ISO Web services. GeoSciML is also being tested through the aforementioned OneGeology-Europe project. The INSPIRE implementation time table is a long one, stretching to 2019, but the first deliverable required of Legally Mandated Organisations like BGS is to publish through Catalog Web Services (OGC, 2007) metadata in ISO standard form on the data sets and data services that are published by BGS. This is being addressed now. SUMMARY AND CONCLUSIONS Since the earliest days of geosciences, there has been recognition that some level of standardization is necessary to enable the communication of geoscience information. Initially, this was mainly concerned with establishing national standards, such as stratigraphic nomenclature or the content of a particular geological map series, but some international standards such as map color schemes were also developed. The digitization of geoscience information led to an increased need for organizations to develop internal standards, such as corporate data models and vocabularies, in order to manage and integrate their data. This was driven partly by the need to manage and integrate digital data corporately to meet increasing requirements to allow customers to tailor products to their specific needs, rather than simply producing a series of standard all-purpose products such as traditional geological map series. The advent of the Web as a means of accessing geoscience data provided a major impetus for the development and implementation of standards. Initially, the Web was used as a means of providing information on data availability, which in turn raised the requirement for discovery metadata. Since many organizations, in particular government data providers, were in a similar position, this provided an incentive to develop national metadata standards linked to national metadata portals. Because the Web is inherently international, it also facilitated an increased requirement to carry out multinational data searches, and this led to the development of regional and then international discovery metadata standards. These were the first international data standards widely adopted by geological survey organizations. The increased availability of digital geoscience data from a wide range of organizations, and the ability to discover this data through metadata portals led to an increasing demand for access to the data in a standard form to enable integrated analysis and processing. To make data available in a standard format requires conformance to a standard conceptual data model. Since individual organizations have generally developed their own internal data models, this requires agreement on an international conceptual model, to which national and regional ones can be mapped. The development of GeoSciML is an example of this for the information generally shown on geological maps.
Developing and implementing international geoscience standards The passage from internal to international standards has often been achieved through initial conformance with national standards, which then develop to conform to emerging international standards. As well as the development of metadata standards, in the UK, the development of the DNF as a national infrastructure for interoperability has provided a strong base for the implementation of INSPIRE. The implementation of software standards, such as the OGC Open Standards for Web services, can be constrained by the availability of commercial software that fully conforms to the standards. Early adoption of such standards may have to make compromises over full compliance with the standard, while feedback from such experience to the software developers will help improve the software for later adopters. Since the early days of geological survey organizations, the demand for geoscience data has increased enormously, and many different sectors of society, mainly nongeoscientists, now require such data in some form or other. In parallel with this, technological developments have enabled geoscience data to be made available more easily, in forms tailored to individual user’s requirements. However, it will only be possible to realize this potential if geoscience data interoperability is underpinned by international standards. ACKNOWLEDGMENTS The development of GeoSciML was carried out by an international group of collaborators under the Commission for the Management and Application of Geoscience Information Interoperability Working Group, and we have drawn on their work in describing GeoSciML. This paper is published with permission of the executive director of the British Geological Survey (NERC). REFERENCES CITED Albert, T.M., 1983, Geoscience data management: Computers & Geosciences, v. 9, p. 3–6, doi:10.1016/0098-3004(83)90029-8. Bain, K.A., and Giles, J.R.A., 1997, A standard model for storage of geological map data: Computers & Geosciences, v. 23, p. 613–620, doi:10.1016/ S0098-3004(97)00043-5. Becken, K., and Green, C., 2000, DigMap: A digital geological map of the United Kingdom: British Geological Survey: Earthwise, v. 16, p. 8–9. British Geological Survey, 2000, Specifications for the Preparation of 1:10,000 Scale Geological Maps (2nd ed.): British Geological Survey Research Report RR/00/02, 70 p. British Geological Survey, 2004, Proposals for the British Geological Survey Core Strategic Programme 2005–2010: Nottingham, British Geological Survey, 86 p. British Geological Survey, 2009, British Geological Survey Strategy 2009– 2014: Applied Geoscience for Our Changing Earth: Nottingham, British Geological Survey, 28 p. Cabinet Office, e-Government Unit, 2004, UK GEMINI Standard Version 1.0: London, Cabinet Office, p. 54. European Union, 2007, Directive 2007/2/EC of the European Parliament and of the Council of 14 March 2007 establishing an Infrastructure for Spatial Information in the European Community (INSPIRE): The Official Journal of the European Union, v. 50, L 108, 25.4.2007, p. 14.
97
Federal Geographic Data Committee (prepared for the Federal Geographic Data Committee [FGDC] by the U.S. Geological Survey), 2006, FGDC Digital Cartographic Standard for Geologic Map Symbolization: Reston, Virginia, Federal Geographic Data Committee Document Number FGDC-STD-013-2006, 290 p., 2 plates. Giles, J.R.A., Booth, D.C., Booth, S.J., Coates, J.S., Jackson, I., Lawrence, D.J.D., Lowe, D.J., McInnes, J.L., McKenzie, A.A., and Smith, I.F., 1997, BGS-geoIDS Project Scoping Study Report: British Geological Survey Technical Report WO/97/17R, 133 p. Gradstein, F.M., Ogg, J.G., and Smith, A.G., 2004, A Geologic Time Scale: Cambridge, UK, Cambridge University Press, 589 p. Harrison, J.M., 1963, Nature and significance of geological maps, in Albritton, C.C., ed., The Fabric of Geology: Stanford, California, Freeman, Cooper & Co., p. 225–232. Holmes, K.A., Dobinson, A., Giles, J.R.A., Johnson, C.C., Lawrence, D.J.D., and McInnes, J.L., 1999, BGS-geoIDS Metadata—Issues Document: British Geological Survey Technical Report WO/99/01R, 79 p. ISO/IEC (International Organization for Standardization/International Electrotechnical Commission), 2005, Information Technology—Open Distributed Processing—Unified Modeling Language (UML) Version 1.4.2: ISO/IEC 19501, 458 p. ISO/TC (International Organization for Standardization/Technical Committee) 211, 2003, Geographic Information—Metadata: ISO 19115, 140 p. ISO/TC (International Organization for Standardization/Technical Committee) 211, 2005, Geographic Information—Rules for Application Schema: ISO 19109, 81 p. ISO/TC (International Organization for Standardization/Technical Committee) 211, 2007, Geographic Information—Geography Mark-Up Language (GML): ISO 19136, 421 p. Laxton, J.L., and Becken, K., 1996, The design and implementation of a spatial database for the production of geological maps: Computers & Geosciences, v. 22, p. 723–733, doi:10.1016/0098-3004(95)00128-X. Malcomson, E., 2006, Association Model: Digital National Framework (DNF) Technical Guide, dnf0004, 23 p. NERC (Natural Environment Research Council), 2002, Science for a Sustainable Future 2002–2007: NERC Strategy Document: Swindon, Natural Environment Research Council, 24 p. OGC (Open Geospatial Consortium), 2005, Web Feature Service Implementation Specification Version 1.1: OGC document OGC 04-094, 117 p. OGC (Open Geospatial Consortium), 2006, Web Map Service Implementation Specification Version 1.3.0: OGC document 06-042, 85 p. OGC (Open Geospatial Consortium), 2007, OpenGIS Catalogue Services Specification 2.0.2—ISO Metadata Application Profile: OGC document 07-045, 125 p. OMG (Object Management Group), 2003, MDA Guide Version 1.0.1: OMG Document omg/2003-06-01, 62 p. Ordnance Survey, 2004, The Digital National Framework—Evolving a Framework for Interoperability across All Kinds of Information: Southampton, Ordnance Survey White Paper, 39 p. Ordnance Survey, 2010, OS MasterMap® Topography Layer: User Guide and Technical Specification Version 1.9: Southampton, Ordnance Survey, 142 p. Rackham, L., 2010, Digital National Framework (DNF)—Overview: Digital National Framework Technical Guide Version 3.0, dnf0001, 48 p. Reitsma, F., Laxton, J., Ballard, S., Kuhn, W., and Abdelmoty, A., 2008, Semantics, ontologies and e-science for the geosciences: Computers & Geosciences, v. 35, p. 706–709, doi:10.1016/j.cageo.2008.03.014. Sen, M., and Duffy, T.R., 2004, GeoSciML: Development of a generic GeoScience mark-up language: Computers & Geosciences, v. 31, p. 1095–1103, doi:10.1016/j.cageo.2004.12.003. Simons, B., Boisvert, E., Brodaric, B., Cox, S., Duffy, T.R., Johnson, B.R., Laxton, J.L. and Richard, S., 2006, GeoSciML: Enabling the exchange of geological map data, in Australian Earth Sciences Convention, Melbourne, 2006, Abstracts & Handbook: Sydney, Geological Society of Australia (CD-ROM). Vai, G.B., 2004, The Second International Geological Congress, Bologna, 1881: Episodes, v. 27, p. 13–20. Winchester, S., 2002, The Map That Changed the World: London, HarperCollins, 338 p. MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
The need for ontologies: Bridging the barriers of terminology and data structure Leo Obrst* The MITRE Corporation, 7515 Colshire Drive, McLean, Virginia 22102-7508, USA Patrick Cassidy MICRA, Inc., 735 Belvidere Ave., Plainfield, New Jersey 07062, USA
ABSTRACT This chapter describes the need for complex semantic models, i.e., ontologies of real-world categories, referents, and instances, to go beyond the barriers of terminology and data structures. Terms and local data structures are often tolerated in information technology because these are simpler, provide structures that humans can seemingly interpret easily and easily use for their databases and computer programs, and are locally salient. In particular, we focus on both the need for ontologies for data integration of databases, and the need for foundational ontologies to help address the issue of semantic interoperability. In other words, how do you span disparate domain ontologies, which themselves represent the real-world semantics of possibly multiple databases? We look at both sociocultural and geospatial domains and provide rationale for using foundational and domain ontologies for complex applications. We also describe the use of the Common Semantic Model (COSMO) ontology here, which is based on lexical-conceptual primitives originating in natural language, but we also allow room for alternative choices of foundational ontologies. The emphasis throughout this chapter is on database issues and the use of ontologies specifically for semantic data integration and system interoperability.
INTRODUCTION
Nationwide, lack of semantic interoperability is estimated to have cost U.S. businesses over $100 billion per year in lost productivity (http://en.wikipedia.org/wiki/Semantic_interoperability). Such estimates do not consider loss of opportunity for timely action due to slow information sharing, nor do they measure lost opportunities that would arise from existence of a standard of meaning that would allow new, independently developed programs to interoperate accurately. The effects of the lack of semantic interoperability on national security are difficult to measure, but they add to this economic inefficiency.
Differing terminology and different database structures within various communities create serious barriers to the timely transfer of information among communities. Barriers to accurate transfer of information between communities, also called semantic interoperability, can be so large that they present a practical impossibility, given the limited resources available for most tasks.
*[email protected] Obrst, L., and Cassidy, P., 2011, The need for ontologies: Bridging the barriers of terminology and data structure, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 99–123, doi:10.1130/2011.2482(10). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
99
100
Obrst and Cassidy
The difficulty is not in transferring bits and bytes of data but in transferring information between separately developed systems in a form that can be correctly machine-interpreted and used for automated inference. Optimally, the information to be transferred must be in a neutral form and independent of the original purpose for gathering and recording the data. If recorded in such a form, it can be reused in an automated process by any system regardless of its origin. Computational ontologies, developed over the past 30 yr, provide the current state of the art for expressing meanings in such an application-neutral form. This report discusses why adoption of a common foundation ontology (FO) throughout an enterprise can enable quick and accurate exchange of machineinterpretable information without modifying local users’ terminologies or practices, and the principle is briefly explored as it relates to sociocultural and geospatial information. A foundation ontology as described here is an ontology that represents the most basic and abstract concepts (classes, properties, relations, functions) that are required to logically describe the concepts in any domain ontology that is mapped to it. The principles described here apply to sharing of information among all domains, including geospatial information. The need for an ontological approach can be better understood by examining the limits of some common approaches to interoperability. Commercial vendors have developed systems that reduce this inefficiency. The syntactic and structural processes are called “data integration” or “enterprise information integration.” Among the tactics proposed is the process of “data warehousing.” In this process, data from multiple databases are extracted, transformed, and loaded into a new database that provides an integrated view of the local enterprise. This tactic is practical only within a single enterprise that has a central director that can enforce data sharing and adequately fund the integrated database. In addition, it typically does not allow real-time updates of the integrated database. Furthermore, there are no explicit semantics associated with this approach; the semantics remains implicit, individuals must inspect the structural schemas, the appropriate data dictionaries, and the application procedural code that surrounds access to or processing of the data from the database in order to understand the intended semantics. Here is an example. One agency may have a database that includes a table of terrorist organizations, with individual members related to each organization by the column (attribute) “member” in the “terrorist organization” table. One record might specify a member of such an organization and the member’s position in the organization. The program using that database may also access lists of people trying to enter the United States, and it will immediately notify border agents to detain anyone who appears in the “member” relation of a terrorist organization for questioning. Another agency may have a database with a table of individuals considered “persons of interest.” A column “member-of” might specify the organizations to which an individual belongs. The program using that database may be searching for social relations among persons of interest, perhaps using Structured Query Language (SQL) or another query language. A common mem-
bership could indicate a significant link. If those two agencies share their data, neither program will recognize the significance of the alternative way of specifying an organization’s memberships. The two organizations represent the relations in an inverse fashion, so even a cross-reference table that maps columns of identical meaning would be unable to catch this relationship. The meaning (semantics) of social relations are locked in the procedural code of the programs that perform actions based on the implied, but not explicit, meaning of the data tables. However, if both databases are properly mapped to an ontology, even simple ontology formalisms, such as the Web Ontology Language (OWL) (Bechhofer et al., 2004), will automatically generate the inverses of the specified relations. Consequently, the attempt of that potential terrorist trying to enter the country could be recognized immediately and automatically, in spite of differences in the way information is encoded in different databases. Another situation may arise where one agency has a database of social relations containing the field “CloseRelatives,” which lists all relatives with up to two links of family relations (i.e., parent, uncle, aunt, brother-in-law, etc.), but another agency may only list specific relations individually. In the ontology, the specific relations will be subrelations of the more general relation (e.g., an uncle will also be recognized as a “close relation”). A query for “CloseRelatives” on the database will return the more specific relatives from other databases mapped to the ontology, using the inference mechanism of the ontology. Other examples and scenarios illustrating the advantages of using ontologies for database integration are provided later in this paper. Appendix A will describe the similarities between ontologies and databases. Appendix B will describe the use of ontologies for federated queries across multiple databases. Appendix C will provide an example of a query translation from the ontology to multiple databases using that ontology. When different geospatial communities develop ontologies or databases derived for different purposes, sharing their data will require the logical descriptions of geospatial entities as well as other concepts to be carefully specified. In order for logical inference to work consistently among different ontologybased applications, the concepts, especially the relations, to be shared in common must be logically represented in terms with an agreed-upon meaning, and in formats that, if not identical, can be accurately translated. One common concept is that of “distance.” However, to date, there is little agreement even on a common representation for units of measure, among which units of distance, time, and force (gravity, geomagnetism) will be used frequently in geospatial applications. Note that there are existing or emerging ontologies for units of measure, e.g., the emerging OASIS (Organization for the Advancement of Structured Information Standards) Quantities and Units of Measure Ontology Standard (QUOMOS, 2011) Technical Committee, the National Aeronautics and Space Administration (NASA) Quantities, Units, Dimensions, and Data Types (QUDT, 2011) ontology, the Cyc Quantity and Units of Measure representation, and the Ontolingua physical quantities ontology (Ontolingua,
The need for ontologies 1994), but as of yet, there is no common agreement on any of those ontologies. For precise reasoning, representation of geographical features such as land masses, atmosphere, and bodies of water will depend on having a common representation for substances such as rock, sand, air, and water, for light and heat, and for objects such as vegetation or artifactual structures (buildings, roads, bridges), as well as for spatial concepts. In some geospatial applications, entities such as countries are represented as literal strings that may be interpreted in different ways by the procedural code of different applications. This leaves a lot of room for error in interpretation when the different aspects of a country may need to be reasoned with; distinctions may need to be made between the spatial region of the geographical territory, the physical objects (structures, mineral resources) within the geographical territory, the government controlling that territory, the population or citizenry, and the abstract notion of a country as an agent that can perform certain actions and enforce its laws. Later herein, we discuss the reasons for adopting a common FO that contains at least the primitive elements used in the different applications, which will provide a basis for accurate interoperability among those applications. Examples of such primitive elements, in addition to units of measure, that may frequently be required are: the geometric primitives of the Open Geospatial Consortium’s Geospatial Markup Language (GML, 2011) standard, spatial relations such as those formalized in the Region Connection Calculus (Randall et al., 1992; Renz, 2002; Obrst, 2002; Cohn and Gotts, 1994), time relations such as those formalized in the OWL Time ontology (Hobbs and Pan, 2004; Fernández-López and Gómez-Pérez, 2004; Hayes, 1996), geographical relations such as those in the Geographical Ontologies of Bitters (2002), and others. In some cases, specifications for the intended meanings of terms may require some observational or measuring procedure to be identified, and that procedure must then be related to the term to make the intended meaning unambiguous. One example from the geographic information system (GIS) domain can be seen in the description by Schuurman and Leszczynski (2006, p. 711–712) of the way in which the term “crown closure” (the density of forest canopy) can be defined using the same words by two different organizations, but it can be measured by different procedures, resulting in different values for that property for the same section of forest. The terms can only be made unambiguous for computational purposes, and its uses related to each other, by including a reference to the measuring procedure as part of the logical description. Certain mathematical procedures may also need representation as primitives. As mentioned by Fonseca et al. (2003), in translating information from one GIS application to another, it may be necessary, for example, to convert a Triangulated Irregular Network representation of a region to a representation using isolines. In such a case, some interpolation technique(s) will be needed to convert the different kinds of representation.
101
For information to be converted accurately and consistently by means of a foundation ontology, the interpolation procedure used will need to be specified and represented in the ontology, so that any GIS application will be able to accurately transfer its information in a form usable by others, and be able to reason with either representation as required. As with measuring procedures, addition of an interpolation procedure to the ontology might require adding some additional primitive ontology elements beyond those found in the common foundation ontology, but interpolation procedures may themselves be constructible from more basic primitives, and they may not require additions to the primitive inventory of the foundation ontology. Such procedures for specialized fields may require new primitive elements, but they may themselves be constructible from primitives already present in the foundation ontology. If required, new primitives for specialized purposes may not need to be included in the most basic foundation ontology, but they can be included in a midlevel extension that would be used only by domain ontologies that require use of such procedural primitives. For the purposes of this paper, and the prototype addressing the sociocultural semantic modeling that engendered the need for this paper, we chose the Common Semantic Model (COSMO) foundational ontology (Cassidy, 2008), which is largely based on the development of semantic primitives behind the common words and concepts of natural language. Similar to the COSMO approach, Barker, Porter, and Clark (2001) described the use of a foundation ontology, though their set of primitives appears to be smaller than our interpretation of the set needed for a broadly adequate foundation ontology. After a basic foundation ontology is adopted, new applications may be mapped to it that have primitive elements not decomposable into elements already in the foundation ontology, and those missing elements will need to be added in order to support interoperability of the new application with other applications already mapped to the foundation ontology. By mapping to such a common foundation ontology, all geospatial applications may share the same information without loss of conceptual precision, and they may share it with other communities that use the foundation ontology as the basic standard of meaning. To overcome the opaqueness of meanings in traditional databases within the federal government, a few relatively isolated divisions have attempted approaches to ensure that shared information would be recognized by all programs sharing the data. However, a strategy for retrieving information from multiple databases in a form suitable for automated inference has not yet achieved widespread use. The difficulty of reaching common agreement on one standard of meaning has been a subject of speculation and debate. Systems that are not based on ontologies have limited expressiveness and do not support the logical inferencing that will permit full representation of meanings. This may inhibit potential users who do not find the representation models they need or prefer. The ontologies that are sufficiently expressive are time-consuming to learn effectively. Without publicly available
102
Obrst and Cassidy
programs demonstrating their usefulness, motivation to undertake the cost of implementing them has limited most programs to a small exploratory phase. For ontologies, this latter point suggests that it is only a matter of time until convincing utility demonstrations are developed to provide the motivation for the necessary resources. Another factor is an almost universal misunderstanding among those who have not studied ontology technology regarding bridging terminology barriers. Because of a lack of understanding of the new technology and misleading focus on linguistic phenomena, it is still widely, but incorrectly, believed that overcoming these barriers will be impossible. An additional factor is that individual ontology systems are not used by a large community and thus the number of third-party vendors of utilities that make ontology simple is small. Only in the past few years have vendors developed programs that make use of ontologies easier, and, thus far, mostly for the OWL ontology language. Another issue is the lack of agreement among the ontology research community regarding the content of the foundation ontology. One attempt to avoid the latter problem is to view the foundation ontology as a “defining vocabulary,” which is being pursued in developing the COSMO ontology (Cassidy, 2008), in parallel with the sociocultural data model. Most efforts at retrieving information from multiple databases in a form suitable for automated inference suffer from a lack of explicit, expressive, and application-neutral data representation. One well-developed information exchange system, the National Information Exchange Model (NIEM, 2011), grew from a U.S. Department of Justice initiative to exchange information about criminal suspects to a broader program to support exchanging information across the federal government (e.g., to the Department of Homeland Security). Although NIEM is widely used, it does not have the expressiveness of an ontology, and, consequently, other projects within the federal government are still attempting to develop alternative approaches in order to move beyond the simple message-passing supported by NIEM formats (i.e., Information Exchange Package Documentation). One initiative (Wisnosky, 2010), apparently similar in several ways to the one we suggest here, is now being undertaken within the Business Mission Area of the U.S. Department of Defense (DoD). In that approach, a common core vocabulary based on primitives is logically specified by an ontology that constrains the meanings of the terms; the common vocabulary provides the basis for translating data among local data stores. This system is being developed incrementally, but details and examples of performance are not yet available for public inspection. Several industry groups have developed specialized information exchange standards, which are also commonly aimed at accurate information exchange (e.g., UCore [2009], the National Building Information Model Standard [2011], and the Standard for the Exchange of Product model data [STEP, 2011], International Organization for Standardization [ISO] standard 10303: see http://www.iso.org/iso/iso_catalogue.htm). The primary alternative to the use of a common foundation ontology that has been explored by a number of groups is
ontology mapping or ontology alignment (Kalfoglou and Schorlemmer, 2003, 2004; Ehrig, 2005; Euzenat and Shvaiko, 2007; Musen and Noy, 2002; Stoutenburg, 2009). There have been conferences and workshops devoted to that technique. However, the accuracy of automatic mapping is low (usually less than 60%, often much less) and insufficient to support automated inference in mission-critical applications. Semiautomated mapping techniques may be useful to align independently created ontologies to the common foundation ontology when the domain ontology is not created originally from components provided by the foundation ontology. In the context of geospatial ontologies, matching techniques have been studied by Cruz and Sunna (2008) and Sunna and Cruz (2007), who also found that conceptual heterogeneity among classification methods required human inspection to deduce proper alignments for most classes. For the purpose of automated inference, the difficulties related to automated alignment are considerably worse than for many of the cases studied because alignment between the all-important semantic relations in formal ontologies is very complicated, and it can be challenging even for experienced ontologists. To achieve generalizability beyond simple message passing, the federal government has embarked on the Community of Interest (COI) paradigm. The COI paradigm enables two or more organizations that form a COI to develop specific vocabularies and models of those vocabularies to share information. Some organizations, such as the Enterprise Vocabulary Team (EVT) under the Air Force CIO office (Parmelee, 2007) have adopted or are adopting principled approaches to developing COI vocabularies and models. The EVT is developing semantic models of the COI vocabularies, which are expressed in OWL, and then it is generating downstream products (e.g., XML schemas) for the actual exchange of data. Other approaches, including some with ontologies in their architectures, do exist. For example, several projects within the federal government are aimed at developing a “core” model for information that represents concepts common to many domains and that can be used as a tool for integrating different domain models. The DoD Core taxonomy (Hayes, 2005), NIEM, Universal Core (UCore), Command and Control (C2) Core, the Maritime Information Exchange Model (MIEM, 2008), the Federal Enterprise Architecture Reference Ontology (FEA-RMO), and other efforts within the intelligence community are examples. Except for a few ontologies used in the intelligence community, these efforts do not use ontologies, but instead they use data schemas. They do not have the expressiveness required to unambiguously represent the meanings of concepts from different domains of interest. Examples of this expressiveness problem are given in the body of this chapter and in Appendix B. Exceptions to the attributions here are the ontologies developed for the semantic layers of UCore and C2 Core (Winters and Tolk, 2009; Smith et al., 2009), which, however, are recognized only as products affiliated with the standards and not direct products of those standards.
The need for ontologies These other approaches do not provide a general solution to exchanging information among databases with heterogeneous types and purposes. Some form of information exchange system must be developed to allow true federated (or consistent distributed) queries from a requester with the proper permissions to the databases to operate on one of the federal networks without requiring the local systems or practices to be modified to accommodate the external query beyond granting access. The following section describes the method, based on existing technology, which is likely to achieve that goal. THE ONTOLOGY SOLUTION An ontology solution requires both expressive ontologies and a well-constructed ontological architecture (Obrst, 2010), the latter of which includes the use of foundational (or sometimes called “upper”), midlevel, utility, domain, and subdomain ontologies. In this section, we discuss the level of expressiveness of the ontologies needed for complex applications and the nature of the fully founded ontological architecture needed. An Expressive Ontology An adequately expressive ontology enables information to be exchanged in a form that contains the elements needed (the classes, relations, functions, instances, and potentially rules) for a computer system to properly interpret and automatically use the information without prior knowledge of, or interaction with,
103
the system that created the information. In Figure 1, the level of expressiveness (ability to describe the meaning of a concept) required is that of a logical theory represented minimally in OWL-DL, but preferably in OWL-Full, with, in addition, rules expressed in a rule language such as the Semantic Web Rule Language, the Rule Interchange Format, or in logic programming, or a hybrid representation (Samuel et al., 2008). In Figure 1, “terms” are natural language words or phrases that act as indices into the underlying meaning, i.e., the concept (or composition of concepts) and the syntax (e.g., string) that stands in for or is used to indicate the semantics (meaning). “Concepts” (referents, categories) are units of semantics, i.e., the node (entity) or link (relation) in the mental or knowledge representation model and a placeholder for the real-world referent. Both weak and strong ontologies use the subclass_of relation between a child concept and its parent. The primary distinction between a weak and a strong ontology is that a “weak” ontology is expressed in a knowledge representation language that is not based on a formal logic. Why is this important? It means that a machine can only read and process a weak ontology (e.g., currently, that means models in an entity-relation language or the Unified Modeling Language [UML]). It cannot semantically interpret the ontology, i.e., ingest the ontology and perform automated reasoning on it. So, a weak ontology is not semantically interpretable by machine; a “strong” ontology is. The figure depicts the correlation between complexity of the semantic model—ranging from taxonomy, thesaurus, and conceptual model to logical theory—and complexity of the potential application. As the expressiveness of the
Figure 1. More expressive semantic models enable more complex applications. SW—semantic Web.
104
Obrst and Cassidy
semantic model increases, so does the possibility of solving more complex problems. For applications that require great semantic precision, i.e., where approximate or loose characterizations of the semantics simply will not accomplish what is needed, more expressive models (ontologies) are required, and only logical theories are expressive enough to enable complex applications such as semantic interoperability (Obrst, 2003; Daconta et al., 2003; Obrst et al., 2007, 2010a, 2010b; Gruninger et al., 2008) and complex decision support for rich situational awareness and course-of-action analysis (Stoutenburg et al., 2007a, 2007b; Obrst, 2010). This ability to automatically and correctly interpret data without contact with the creators of the data is critical because massive amounts of data are generated and accessible, making it a practical impossibility for data consumers to consult with more than a few data source managers to resolve the ambiguities that are present in information when it is not represented in a form as expressive and unambiguous as an ontology. This is especially important when time-critical information must be interpreted and forwarded to the person or system that can take the appropriate action. For 20 years, ontology technology has been developing rapidly; it is now being deployed to solve practical problems. For example, the Cyc ontology has been used to integrate clinical data at a major medical facility (Lenat et al., 2010). A separate system “SIRI” (http://siri.com) is a “virtual personal assistant” that uses an ontology for information management, and it is now included as an app for the iPhone. As will be described later herein, an ontology, together with an interface to local data, can serve as a “translation utility” that allows local groups to develop and pursue their tasks independently while still being able to accurately share information with other independent groups. The use of a common foundation ontology to enable sharing of information has been suggested previously. As early as 1998, Guarino (1998) provided theoretical arguments for the use of a common foundation (“top-level”) ontology in information sharing, and also explained why a post hoc attempt to map domain ontologies that were developed independently can provide a misleading apparent overlap, because domain ontologies are often insufficiently detailed to rule out many unintended models, some of which may appear identical between domains.
the ways in which foundation ontologies can be used for interoperability can be found in Borgo and Lesmo (2008). Comparisons of foundational ontologies can be found in Semy et al. (2005), Mascardi et al. (2006), Grenon (2003a), etc. Finally, there is the Upper Ontology Summit, which was an effort to correlate existing upper ontologies (Obrst et al., 2006). In Figure 2, the layers represent the upper, midlevel, and domain (or lower) ontology layers. Sometimes the upper and midlevel ontologies are called foundational ontologies, and there can be multiple such in each layer. In our view, “upper” or “foundational” ontologies are typically abstract and make assertions about constructs such as identity criteria, parts/wholes, substance and constitution, space and time (and endurants and perdurants), necessary properties, dynamic properties, attributes spaces, etc., that apply to all lower levels; hence, they span all midlevel and domain ontologies (Obrst, 2010). These foundational ontologies may consist of levels, as discussed as “levels of reality” in a number of papers (Poli, 2003, 2010; Poli and Obrst, 2010). “Midlevel” ontologies are more specific, making assertions that span multiple domain ontologies. “Domain” or “subdomain” ontologies address specialized portions of the real world, i.e., about specific aspects of electronic commerce, military situational awareness, intelligence analysis, geographical, geophysical, and geopolitical portions of the world, which can get very fine grained. “Utility” ontologies are specialized ontologies that address very common domains shared by very many domains and can be represented at different levels in the figure: e.g., units of measure, time, biology experimental construct ontologies, etc. “Super-domain” ontologies are those ontologies that act as upper or foundational ontologies for specific complex domains, such as biology, medicine, physics, services, etc. Sometimes, it can be hard to substantively differentiate midlevel, utility, and superdomain ontologies, since many ontologies can be characterized at each of those levels.
Upper Upper
Upper Ontology Upper
Need for a Foundational Ontology Utility
The COSMO (Cassidy, 2008) ontology is only one of several existing foundational ontologies that might serve as the starting point for developing a common foundation ontology for one or more communities. Others are OpenCyc (2011), Suggested Upper Merged Ontology (SUMO, 2001) (Niles and Pease, 2001), Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) (Masolo et al., 2003), Basic Formal Ontology (BFO) (Grenon, 2003a, 2003b; Grenon and Smith, 2004), and OCHRE (Object-Centered High-level REference) (Schneider, 2003). An overview of other foundational ontologies is presented in chapter 6 of the book by Oberle (2006). Another discussion of
Midlevel
Midlevel
Midlevel Ontology
Super-Domain
Domain
Super-Domain
Domain
Domain
Domain
Figure 2. Ontological architecture.
Domain Ontology
The need for ontologies The use of ontologies for sharing of geospatial information has been previously discussed by Schuurman and Leszczynski (2006). However, as they point out (p. 711), the formalization of the basic GIS data that must precede mapping to a common ontology has been only slowly adopted. Whether the GIS data are currently structured as a relational database, or as an ontology, or in some other form, we wish to convey that the potential for accurate interoperability provides a reason to undertake the additional effort needed for information modelers to describe their data using the logical form provided by a common foundation ontology. Some vendors claim to provide systems that use an ontology to integrate relational databases. For database integration, commercial vendors such as Cycorp (http://www.cyc.com) and HighFleet (formerly Ontology Works, http://www.highfleet.com) offer systems that have a built-in foundation ontology and utilities to link that ontology to multiple databases. Availability of an existing foundation ontology does not eliminate the need to create specific domain ontologies and ontology extensions that can map the local data stores to the foundation ontology and through the foundation ontology to other data stores. Several foundation ontologies are available for free public use, and some commercial vendors provide abilities beyond their ontologies, such as utilities for the process of linking the foundation ontology to a database. If, in the database integration process for the federal government, open-source programs are desired, the utilities for using the foundation ontology must be developed based on a foundation ontology selected or developed for that purpose. However, this linkage utility will only be developed once, and will not greatly increase the cost of integration beyond the effort required for developing the mappings of local data stores. Whether using an existing vendor’s integration utility or developing an open system to minimize the long-term costs and maximize the open-source community’s participation, both
105
involve mapping each local data store to a common representation of meaning (requiring some expressive ontology), and, when required, converting information in one terminology and format into information with the same meaning in another terminology and format, using the foundation ontology as a common standard of meaning. Using a common ontology is a practical method to avoid many-to-many mappings of multiple databases. As depicted in Figure 3, n databases can be mapped to each other by a common ontology near linearly with only 2n – 1 mappings to the common ontology; while without a common reference interlingua, n2 – n individual mappings would be required among the databases. As a result of this well-known difference, almost every project that attempts to integrate multiple knowledge organization systems uses a common core to which mappings are made and through which terminology conversions can be accomplished. Most of these projects address only local integration concerns, so the resulting mappings of one project are not generalizable beyond that project. Compounding the lack of compatibility of local integration projects is the fact that few have used a common standard of meaning with the expressiveness of an ontology. As a result, it is inevitable that one group’s common meanings are unsatisfactory to others. To avoid the problem of limited local orientation or limited expressiveness, a translation utility is needed, which is based on a common foundation ontology with an inventory of basic concepts that is large enough to allow definitions of most specialized concepts found in databases. The task is to develop a translation utility and apply it to integrate multiple databases. The task includes the development of domain ontologies that extend the basic concepts represented in the foundation ontology, and thereby accurately represent the meanings of the data in the integrated databases. Such an ontology-based system will perform
Figure 3. Complexity of semantic integration with and without ontologies: An ontology allows for near-linear semantic integration (actually 2n – 1) rather than near n2 (actually n2 – n) integration. Each application/database maps to the “lingua franca” of the ontology, rather than to each other.
106
Obrst and Cassidy
conversions rapidly, accurately, and automatically. This will enable a true federal-wide federated search for information using local terminology mapped to the common ontology. For situations where a database has been developed without reference to a common foundation ontology, the conceptual schema used to develop the database can serve as a record of the conceptual elements of importance for the application, accelerating the process of post hoc mapping. Fonseca et al. (2003) discussed issues in relating conceptual schemas to their corresponding ontological representation. Some scenarios showing how an ontology can query across multiple databases are given in Appendix B. The access to a particular store of information will depend upon the permission level a querying community has to access other data stores, and the development of mappings to those data stores. In this paper, we illustrate the principles by describing the way in which a common foundation ontology can support accurate federated search among diverse databases. However, the same principle can be used to allow data from one source to be entered into multiple databases, by translating the assertions in the data from one data source into the terminology and format of all of the local databases that may wish to use and store that data. This principle also supports accurate communication among multiple intelligent agents or among ontology-based applications generally, and a query of multiple databases is merely one procedure that may be used by a large number of different applications for data sharing. The principle of a common foundation ontology allows great flexibility in the programs and communications methods used locally or within small communities; it is only necessary, when communication to a broad audience is desired, for one of the local agents to translate the information into the logical vocabulary of the common foundation ontology, and then from there to any other system that wishes to access the information. The communications methods among local groups of databases can therefore be specialized and optimized for local efficiency, and translation to the common logical foundation is required only for information to be communicated from or to systems outside the local community. For brevity, we consider the case where the data reside only in the local databases, and queries to the remote databases are created at query time. This can be done by translating a query posed to one of the federated databases, transmitting the translations to the other databases then receiving the replies and translating them back to the terminology of the original database. Alternative architectures have been suggested. In one alternative (Critchlow et al., 1998), the ontology is used to translate local databases into a common warehouse database that represents all the desired data. Data are entered locally, but the federated search is accomplished by query of the warehouse database. This alternative architecture can reduce potential inefficiency of automatically generated queries by creating a single warehouse database for which optimum forms for queries to the warehouse can be identified and used. Another alternative architecture uses a Data Access Mediation Service that accepts queries and returns
answers. It also provides effective translations between local terminologies by links to two other services—a semantic map and an association—and the local data stores. In that architecture, the single ontology mapping is replaced by two types of translation service—the semantic map, which specifies the data source and elements in the source that satisfy the query, and the cached associations, which translate the source data back into the terminology of the requester. Either of these alternative architectures may be more efficient than the simple query translation enabled by a common foundation ontology. However, the simple translation method should minimize demands on the local data managers and create the ability to federate queries across databases quickly. This architecture will be an intermediate phase while a more efficient method is developed to query across multiple diverse data stores; it will ease the transition from the existing structures of legacy databases to the time when the schemas of databases are generated directly from an ontology. In all these architectures, a common foundation ontology (or its procedural equivalent) is essential to serve as an unambiguous standard for the meanings of the local data and enable accurate query translation and data retrieval. The common foundation ontology will provide the standard for meanings of information extracted from free text and the bridge between the structured data in databases and the unstructured data in text. A common foundation ontology will allow the creation of new databases by developing an ontology as the conceptual data model and automatically generating the database schema from the ontology. This procedure will be no more costly than traditional database development, as discussed later herein. If each local ontology is developed using the basic concepts in the common foundation ontology, the databases generated will automatically map to each other. Automatic generation of database schemas permits local terminology to be used in the local ontology and database. The local terminology is mapped to the common foundation ontology by linkage of the local ontology to the common foundation ontology. This is all invisible to the user. The method of generating a database directly from an ontology had already been implemented commercially in the HighFleet system. Use of a translation service to federate existing databases is a tactic that can deal with legacy databases while allowing database developers to become familiar with the use of ontologies as they work with existing data stores. For those only beginning to formalize their data in an ontology format, Appendix A describes considerations and principles that should be kept in mind. There are now many books, articles, and online tutorials that can assist beginners who are learning the basic techniques and tools; new resources appear rapidly. There is also a large body of literature and annual conferences addressing the problem of relating separately developed ontologies to each other. Variously, this technical thread goes by ontology “matching,” “alignment,” and “mapping” (Sowa, 2010; Kalfoglou and Schorlemmer, 2003, 2004; Ehrig, 2005; Euzenat and Shvaiko, 2007; Stoutenburg, 2009; Obrst, 2010), and, of course, a potential
The need for ontologies post hoc “merging” (Noy and Musen, 2000; McGuinness et al., 2000). The annual International Semantic Web Conference (ISWC) and associated conferences also have relevant papers. In addition, research on formalized context, formal interoperability, and ontology modularization also addresses very similar issues of mapping ontologies or portions of ontologies and modularization of ontologies (Blair et al., 1992; Guha, 1991; McCarthy, 1990, 1993; McCarthy and Buvač, 1997; Gabbay, 1996; Akman and Surav, 1996, 1997; Giunchiglia and Bouquet, 1997, 1998; Giunchiglia and Ghidini, 1998; Obrst et al., 1999; Lenat, 1998; Meseguer, 1998; Menzel, 1999; Menzel, 1999; Basin et al., 2000; Blackburn, 2000; Bouquet et al., 2003; Obrst, 2010; Obrst and Nichols, 2005; Haase et al., 2006, and the papers therein). These approaches do not use the tactic of focusing on semantic primitives, as does the COSMO ontology used for this project. The task of converting existing relational databases into ontology format has also been investigated. There are simple systems such as D2RQ described by Bizer (2009), which converts an RDB (relational database) to an RDF (Resource Description Framework) graph, as well as RDB2Onto described by Laclavik (2006), and there are also more complicated systems, such as R2O (Barrassa et al., 2004), in which a separate query system is developed to deal with potential mismatches among different databases. Cost Considerations The cost of this method of achieving interoperability for newly developed databases is no greater than the cost of traditional database development because development of the domain ontology for a new database replaces developing the conceptual and logical data models. Because the number of databases can be expected to increase at least at the rate of economic expansion, ~3%–6% per year, within 10 to 20 yr, the number of new databases being created will exceed legacy databases. Therefore, it is significant that practical semantic interoperability for the majority of databases can be achieved with no additional development cost, beyond the minimal cost of achieving agreement on the common foundation ontology. For legacy databases, retroactive mapping will be practical, but the cost is likely to be a significant fraction of the cost of developing a new database, because each element (table, field, or restriction) must be mapped to the foundation ontology. A “triage” approach can be taken, where critical databases are mapped immediately, less critical ones are mapped when the databases are refactored or are merged into a data warehouse, and unimportant databases (those not important for federated search purposes) are left unmapped. With the anticipated savings in interoperability costs, this triage approach could quickly pay for the development cost for a translation utility and mapping of legacy databases. There is also an emerging field within ontological engineering that attempts to strictly characterize the costs associated with ontology development. ONTOCOM is one such attempt (Simperl et al., 2006, 2009; Bürger, 2008; Bürger and Simperl, 2008; Imtiaz et al., 2009).
107
ONTOLOGY VERSUS TRADITIONAL DATA MODEL This brief summary provides a general overview of the differences between an ontology and traditional data model. For a detailed discussion, refer to Appendix A. Traditional data models are syntactic and structural without any explicit semantics for the tables and fields. Data are represented as strings, numbers, and simple enumerations, and are often shown as a choice among a limited number of string values for an attribute. By contrast, in a proper ontology, few things, other than entity names, would be represented as simple strings. The elements (including attributes) would be part of a class structure; they would be first-class entities in the sense of being types (categories or classes), relations, or instances of types. Each firstclass entity can have many relations to other entities in the ontology, whereas strings in an enumeration do not have semantically meaningful internal structure. For example, in a traditional data model, a list of religions as attributes for a person in one data model might include the strings “Islam,” “Sunni Islam,” “Christianity,” “Church of England,” etc. Those strings may not be related to anything else in the data model, whereas in an ontology, the meaning of these terms and rules is explicitly recorded. This definition applies to any detail required for any of the applications using any of the data stores. At a minimum, each religion would be a first-class element (which might be a subtype of “Religion” or an instance of that type); the conceptual structure containing information about the religion might be potentially useful in some application. In the traditional data model, the same string might be used differently in other tables, depending on the procedural code that interprets the data in the tables. The relations and differences may be of no significance in a given application, even though the implicit concepts are related (e.g., Sunni Islam would be a specialization of Islam). On the other hand, when that information is reused in a different application, the fact that the strings refer to first-class entities and have a direct relation to each other may be significant to the second application. Next, in Table 1 we present a segment from one database that provides a small sampling of cases where interpretations can depend on specific applications and require careful mapping to avoid misinterpretation. This example illustrates issues that can arise from representation of the common concept of a “person,” but equally troublesome issues may be found in most databases when the data need to be shared for use beyond the application for which the original database was created. Note that MIDB refers to the “Modernized Integrated Database” used by the U.S. Department of Defense (see ref. MIDB, 2011) (Table 1). Entity_Guid A GUID (“Globally Unique Identifier”) can be uniform throughout a database, but it may not be interpretable between databases. To recognize the identity of individual entities, some method will be required to “de-duplicate” individuals that are
108
Obrst and Cassidy
TABLE 1. EXAMPLE DATABASE TABLE TABLE name: INDIVIDUAL_VITALS TABLE description: Table recording basic information about an individual Column name Data type Description ENTITY_GUID Unique Lookup unique identifier for entity. identifier(16) NAME_FIRST Nvarchar(54) A given name conferred upon an individual at birth or through a legal process. (MIDB) NAME_LAST Nvarchar(54) The surname or given name conferred upon an individual at birth, through marriage (e.g., maiden name), or through a legal process. It can also be the family name. (MIDB) ALIAS Bit(1)(1,0) Alternate name for entity. BIRTH_PLACE Nvarchar(100) A textual description of the individual’s birth location to include, if possible, the city, town, village, state, and province. (MIDB) BIRTH_DATE SASODATE(12) The date of an individual’s birth. (MIDB) ETHNICITY Nvarchar(50) Entity’s ethnic classification (lookup). INDIVIDUAL_RELIGION Char(6) ALLEGIANCE Varchar(3) The Department of Defense (DoD) Standard Country Code designator for the country or political entity to which the entity owes its allegiance. (MIDB) CITIZENSHIP1 Varchar(3) Lookup to country_codes of an entity’s primary citizenship. MIDB—Modernized Integrated Database used by the U.S. DoD.
returned from a search. If a common unique identifier is not used in different databases, other attributes and a complex algorithm may be used to arrive at a probability that two individuals are identical. The translation utility can include such algorithms, unless simple presentation of results without concern for duplication is acceptable.
alias such as “Abu Abbas” will also be indicated as a property of a name in an ontology (again, useful for social networking), but it will be lost in most databases that simply record a name as a string.
Name_First, Name_Last
In this database, birthplace is a complex string and not encoded in any standardized way. To align with other databases that record birthplace, a parse of that string will be required to extract country, town, and other significant location information that can help identify an individual. A “birthplace” of “Jersey City” must also be recognized as being consistent with a birthplace of “New Jersey” and a birthplace of “USA.” This transitivity is needed to allow such inferences and recognize proper relations.
These are usually unambiguous for Western European names, but they may be ambiguous for names in other cultures (e.g., Asian names, where the family name is actually the first name). The local databases must be consistent in assigning family name to “NAME_LAST.” In databases where other elements of a name are maintained (patronymic, mother’s maiden name), it will be necessary to ensure that the “last” name is consistent with the “last” name of typical American names when mapping. For single-name individuals, consistent use of the “first name” is expected. Alias This field may record names other than the one given at birth. In this database, there is no way to distinguish the type of alias; the bit merely indicates that it is an alias. In another database, the type of alias (married name, assumed name, nickname, pen name, nom de guerre, etc.) may be indicated by a code. The codes need to be unambiguously interpretable and stored in the ontology mappings. The type of “alias” can be important. The significance of a maiden name may be useful in a social networking application but not interpreted in other applications. In an ontology, “maiden name” will be a first-class entity with its own significance regarding relationships. The significance of an
Birth_Place
Birth_Date Birth date is generally used consistently, though format conversion may be required. The ontology mappings will automatically convert to a consistent internal format. Where the birth date is approximate or not known, different databases may have different methods of coping, and knowledge of these different methods will be needed for proper interpretation and translation. Ethnicity Different databases may have different codes for ethnicity classifications, requiring a simple translation to a common representation, but it is possible for different ethnicity classifications to be used, in which case, the ontology would need to represent both classifications with relations between them to properly interpret and translate queries.
The need for ontologies Individual_Religion In this database, the religion attribute is a code. Such codes will vary among databases and need to be referenced to a common classification. As previously mentioned, religions can have subdivisions. The significance of a religion and its subdivisions in a database is likely to be buried in the procedural code. Over time, actual usage can become unknown to those who inherit the database from its original creators and application developers. For example, a database’s application may implement a procedural rule limiting any attempt to use Shiite workers in a Sunni neighborhood, reflecting the fact that Sunni Muslims and Shiite Muslims in a particular time and place were sufficiently antagonistic that they refused to associate, or would attempt to even kill each other. In an ontology, a religion is a first-class entity from which many such facts may be asserted. Some facts may be used implicitly with no trace in the database by one application. Allegiance, Citizenship1 Both attributes refer to a country. If “allegiance” is not identical to “primary citizenship,” its significance will be interpreted within the procedural code of the local application; yet, it may be significant to other applications, if it can be interpreted properly. A difference or identity in these two attributes may be significant, but to be used properly, the definition of “allegiance” must be explicitly related in the ontology to an individual’s citizenship and behavioral tendencies. In all cases, a common ontology and mappings from that ontology to the databases will allow the data fields to be unambiguously represented, so that fields of identical meaning will be mapped to the same element in the ontology. Fields of related meanings will also be mapped with the relations represented in a form that permits automatic retrieval of potentially significant data, even when data in the remote databases are not identical to originating community’s database. Role of the Translation Utility The translation utility will interpret data in one database and convert it to its equivalent representation in another database. The utility will use logical inferencing and procedural code; these mechanisms can become too detailed to present in a short paper. For example, a “LengthInMeters” column in one database might be used with a data type of decimal, whereas a different database might represent “Length” as two columns—a decimal number and a unit of length measure. The one column in the first database and the two columns in the second database would both be mapped to an instance of “LengthMeasure” in the ontology, with unit conversions performed when required. Other types of data will not have equivalents that can be easily converted. In any given database, such conversions could be programmed as integrity constraints on input or stored procedures. However, such procedures are unique to specific databases and cannot be shared because the interpreta-
109
tion of the results depends on the ways in which the data are used in the application. A more robust solution is to include logical rules in a domain ontology, which are the equivalent of computer program procedural functions that encode “business rules.” One such logical rule might notify the office when a certain situation is recognized, even though the person holding that office may differ for each of the different communities sharing their data. Where logical rules alone are inadequate, an ontology can call on external procedural functions, as when performing complex mathematical calculations required for translation or inference. Finally, the translation utility itself may use stored procedures for translation. In addition to recognizing related, but not identical, data elements in different data stores, an ontology will permit the generation of answers that require some form of logical inference (this could be performed automatically once the local data are mapped to an ontology). For example, a query may ask for a person’s relatives, even though the local data store may have only parentchild relations, while a remote data store may have information on some of the same people, but contain only sibling relations. The ontology, using information from both stores, will not only recognize both kinds of kinship, but using domain rules, will be able to generate a list of increasingly remote relations, including uncles, aunts, and cousins. One such rule might be: (if ?Person1 is the father of ?Person2, and ?Person3 is a brother of ?Person1, then ?Person3 is an uncle of ?Person2) An important capability enabled by an ontological representation is the ability to create such rules that can be added to the domain ontology without recompilation of the program. This provides a simple mechanism for creating “new information” from existing information. Such rules can be immediately translated into the ontology format and added to the domain ontology, in effect permitting changes to the translation program in a simple manner. By understanding these rules, domain experts could update the query translation utility. Use of declarative formats to simplify data structures is common to database and ontology design. However, the ability of ontologies to rapidly ingest and use inference rules provides them with the ability to adapt quickly and at less cost to additional domain knowledge from the domain experts. In summary, by mapping local databases to a common ontology, the definition of the data can be interpreted by other applications that have data stores linked to the ontology. By linking simple concepts and strings in one data model to more complex concepts in the ontology, the utility of the data in the first database can be expanded beyond its original purpose and in ways that a different application can use to infer more information. ONTOLOGY USE FOR DATA INTEROPERABILITY A foundational ontology should be used to achieve data interoperability among multiple data stores. The foundation
110
Obrst and Cassidy
ontology provides a set of concepts that can be combined to create a domain ontology, which represents meanings in each local data store. In this way, when one community accesses data in nonlocal data stores, a query interpretable in the community’s local data store (e.g., SQL created by a local user interface) is automatically converted to a federated query in conceptual form using the mappings to the foundation ontology. This federated conceptual query can then be automatically converted into local queries to all accessible data stores that may have relevant information. The answers follow a reverse pathway and are converted into the terminology of the originator’s local data store and presented to the user. A sketch of one possible query method (called the Query Translation Method) is presented in Figure 4 (note: data paths are numbered in sequential order; each numbered path moves over the network from a source to a destination). The explanations of the numbered data paths in the figure are: 1. A query entered in the standard form for some local data store (e.g., database 1 [DB1]) is forwarded to the local copy of the query interface and sent over the network to the query translation service. 2. (a1) The query translation service interprets the query in terms of the common data model, based on the ontology extensions and mappings that relate each local data store to the foundation ontology and its domain extension. The query is translated into the terminology of each local data store using the
mappings, forwarded to the local data stores over the network, and received by each local data store in its own terminology. (a2) An example of the way in which a database query in one database could be translated into a comparable query in multiple other databases is given in Appendix C. This example assumes that the local data stores do not need to provide any services other than simple access to their databases. If the local data stores provide an access service other than SQL query, a different translation mechanism might be needed. (b) The translated queries from the translation service are received by each local data store via a Web service. 3. Information retrieved from each local data store is returned to the translation service in the terminology of each local data store. 4. The translation service converts the retrieved data into the terminology of the querying user (e.g., DB1 terminology) and returns the answers in the user’s terminology. The local copy of the query interface presents data to the user and, in the case of semantic mismatches (related, but not identical, data meanings), adds qualifying comments to alert the user of the meanings of the retrieved data that may not conform to the format of the querying user’s local database or local terminology. In the query translation method, all data reside in the local data stores. The local data stores are accessed in a read-only manner from the translation service; updates to the data are performed locally. There is no replication of local data. Outside of the local data stores, key components of this translation process are:
Federated Query Using Foundaon Ontology uses
DB1-Extension & mappings DB2-Extension & mappings DB3-Extension & mappings
mappings
Local Query Query Interface – Data Access Service
[4]
Query Translaon Service
[1]
[1]
Warehouse-Extension & mappings DB7-Extension & mappings
[4]
[3]
[2]
Foundaon Ontology
Network
[2] [3] [2]
[3]
[2]
[3]
[2]
Q-Service
[2] [3]
Data Warehouse Q-Service
DB1
Q-Service
DB2
[3] Q-Service
Q-Service
DB3
DB4
DB5
DB6
Figure 4. The query translation method.
DB7
The need for ontologies 1. Extensions of the foundation ontology express concepts in each of the local data stores, not in the foundation ontology. These extensions together constitute a domain ontology that is a model of the domain represented by the full set of connected databases. This domain ontology is defined in terms of the foundation ontology. When it must perform complex inferences to answer queries, the linkage to the foundation ontology may be used. 2. Elements in the local data store are mapped to concepts in the domain extension ontology. 3. A local data access service is connected locally to the local query interface, which accepts the query from the local user, forwards it to the translation service, receives the answers from the translation service in the terminology of the local data store, and presents the retrieved information to the user. The data access service is responsible for presenting the retrieved data in a way that notifies the user when information is similar, but not identical, to the queried data. 4. Query services (Q-service) at each local data store accept queries from the network in a format agreed to by the local data stores and returns answers to the translation service. The query format may be standard SQL, but the query service must provide a wrapper that can accept the query over the network. Creating and Using Mappings from Databases to the Ontology Although the mappings to the common foundation ontology must be created separately for each local data store, the translation service utility that uses the mappings to interpret local data in terms of the foundation ontology and convert it to terms conforming to a different local data store should only be developed once by a team of ontologists and programmers. Adequate foundation ontologies are available to support the translation service and do not need to be developed. The effort of an ontologist will be required only to develop the specialized domain ontologies and mappings for each set of databases to be integrated. The mappings themselves will all be extensions of the foundation ontology, and the mappings will use relations that can be interpreted by the conversion utility, based on the foundation ontology. Periodic improvements and supplements to the foundation ontology may be required as additional specialized data stores are integrated, but local data store developers need not learn the additional internal translation mechanisms or assist in their development. The conversion to and from the conceptual form supported by the foundation ontology and its extension can be accomplished dynamically at query time. By using separate mappings from each local data store to the foundation ontology, a translation utility can perform the conversion from local form to common conceptual form and the reverse. The mappings are created for each local data store by a team that includes an ontologist familiar with the foundation ontology and a domain expert who understands the database terms to be mapped. The mappings take the form of an extension to
111
the foundation ontology specifying the particular elements in the foundation ontology (or its domain extension) that correspond to tables or columns in each local database. Mapping updates will be required if the local data model changes; this should be easier than development of the initial mapping. Only a single mapping of each local data store to the foundation ontology is required in order to create mappings between all of the local data stores. This takes a problem that is potentially n-squared in complexity (n being the number of data stores to integrate) and reduces it to “order of n.” When sophisticated interfaces can be developed to permit domain experts to find the ontology elements that correspond to the data elements they use locally, then assistance from an ontologist may no longer be needed. There are different options for a system architecture that will implement this mechanism for data integration. These include having the translation from and to the local terminologies done centrally and the federated queries sent to each local store using the local terminology (refer to Fig. 4), or having the queries transmitted in a common terminology and translated to the local terminology at the local data store site by the local query service. In either case, Web services must be implemented in each local data node to accept queries in the required format. Alternative architectures were discussed in the section “The Ontology Solution.” Prototyping and Implementing Ontology-Based Integration Development of a working prototype using the query translation method will require an extension to this study. This phase of the project has developed, in addition to a preliminary version of the Sociocultural Ontology, only a simple Java program to illustrate one type of query translation. The cost of adaptation of commercial methods for data integration via ontologies needs to be weighed against the cost of developing an open-source method suitable for widespread government and commercial use. For DoD-wide or federal-wide data integration, the cost of developing an open-source translation utility will be a small fraction of the total cost for developing new databases or mappings to legacy databases. An open-source utility addressing the issue of federated query via a common foundation ontology will permit refinements from a large community of interested DoD and nonDoD developers. For now, it is anticipated that any query mechanism currently used by each local data store can be intercepted, translated, and sent to other accessible data stores with relevant information. In addition to assisting in the development of the mappings by providing clear descriptions of their data elements, the local data stores will need to provide access by implementing a service that can accept database queries in SQL or another agreed common format. That service can be developed as part of the project developing the translation utility. It will be the same for all local data stores, unless some local system requires modification of the standard service. Thus, we do not anticipate the need to develop new user interfaces but rely on intercepting a
112
Obrst and Cassidy
query at a local data store in a format such as SQL, sending it over the networks connecting the local data stores, and returning answers from remote data stores in the same format. In the longterm, development of a natural language interface for federated query will make wide access to data easier. Development of the translation utility should be a project within the federal government, and it should be integrated with support for a single foundation ontology. The foundation ontology should be chosen from among existing ontologies, and supplemented when required. Because the translation utility is so dependent on the format and basic structure of the foundation ontology, it is not practical for an open-source generic utility to be developed by an academic or industrial group. Integrating multiple data stores will require each local team involved in data management to clearly define each of the data elements (tables, columns, restrictions) in their system; read the documentation provided by the ontology team describing the meanings of the logical representations they create; and confirm or correct the interpretations of the ontology team. With the mappings in hand, the ontology team and developers of the translation interface can develop the programs that will provide the desired translation capability. Further validation testing will come when data transfers are executed in a realistic user setting. If a local data store does not have an implemented service allowing access to the data store via the network, one must be added to the local system. It may be necessary to develop local access programs tuned to specific data stores if they do not permit access via the same method (e.g., passing an SQL query over the network). It also may be desirable to implement new display methods to return the query results from remote data stores that indicate the source of the answers. For example, when remote information may be close, but not identical, in meaning to the requested information, it may be signaled by a standardized prefix in the answers (e.g., an “[S]” to indicate that the remote store only contains supertypes of the type requested, and the answer following the [S] may or may not relate to the more specific type requested). The Problem of De-Duplication Specific individuals or objects in different databases may not be easily recognized as identical. A unique identifier in one database may be unrelated to a unique identifier used in another database. The methods for determining that two individuals or objects in different databases are the same (“de-duplication”) can be complicated. In favorable cases, there may be unique identifiers that are common (e.g., Vehicle Identifier Number, Social Security Number [SSN]). In general, methods may be included in the translation service to determine if two individuals returned as the answer to the same query on different databases are identical. Such methods may use multiple attributes (for people: name, address, date of birth, place of birth, SSN, passport number, telephone number, e-mail address, known relatives) to arrive at a probabilistic estimate of common identity. For an initial imple-
mentation of query translation, such de-duplication methods may be omitted, but the system will be most useful if information that can indicate identity is used and reported to the user. Data Conversion Effort for Sociocultural Data A prototype was begun, based upon the process described herein, as part of an effort to integrate data stores containing sociocultural information. Although this example derived from a project involving sociocultural information, the principle of a common foundation ontology applies to integration of information among any community that wishes to share their data and have them accurately interpreted and properly used by other computer systems. In our effort, we used the COSMO ontology (Cassidy, 2008). As mentioned earlier, other foundational ontologies could have been chosen, and in past applications, we have indeed chosen or recommended others (Semy et al., 2005). COSMO was used in this project because COSMO was initiated and has continued to be developed with the purpose of including representations of any of the basic (primitive) concepts that exist in any other foundation ontology; thus COSMO, when fully developed, will have all of the primitive ontology elements needed to build representations of more complex ontology elements in any linked ontology, regardless of the domain of interest. The sociocultural data integration prototype included the following features: 1. The foundation ontology providing the fundamental concept representations that will be used to define domain-specific concepts was the “Common Semantic Model” (COSMO), an ontology developed as a merger of several publicly available foundation ontologies and extended to include required basic concepts not already in those ontologies. 2. Development of the Sociocultural Ontology (SCO), an extension of the COSMO, was begun to represent the specialized concepts found in data stores describing sociocultural information. The priorities for adding new concepts to the SCO were derived from lists of important sociocultural terms supplied by domain experts, who provided cases indicating the kind of sociocultural information needed for actions undertaken by U.S. military forces. 3. A mapping from the SCO to the MAP-HT (Mapping the Human Terrain, 1997) data model can be created when a required minimum number of sociocultural concepts have been added to the SCO. 4. A translation utility can be developed to convert information from the format and terminology of one mapped database into the format and terminology of another. The structure and language of this utility are not yet determined. Some lessons from the early phases of this project are discussed next. The sociocultural information integration effort is still at an early stage and can easily adapt to the preferences or requirements of other data integration projects. Projects that are already
The need for ontologies mature in their data models, such as the NIEM, can be treated as another form of data store to which mappings can be developed to the foundation ontology. Projects that are still in the formative stages could be fully integrated by proactive coordination with this sociocultural data integration project. Such coordination of data model development with an ontology-based SCO effort will allow all parties to find common representations. Coordination at the data model development stage will simplify the mapping effort to the foundation ontology, while still allowing local data store developers to choose any terminology or structure that they find most convenient. Residual local structures still differing from the foundation or other local representations can be mapped by the process discussed here, but the mapping effort will be reduced if coordination at the formative stages is feasible. Conversely, without such coordination, local choices for data representations might differ merely due to the absence of an accepted standard, thereby increasing the mapping effort. Issues in Creating Mappings of Databases to Ontologies This section discusses some issues in creating mappings from databases to ontologies. Collaborate with the Database Developers The first issue in creating mappings is finding a way to collaborate with the database developers to determine the meanings of the database elements to be mapped. This requires a good data dictionary explaining the meanings of the database elements. For some databases, a data dictionary is critical because the names of the data elements are not self-explanatory. For example, no data dictionary is available for one database and elements look like the one below: In table IND (individual) attributes (columns) exist: CIH_Charactertraits CIH_Traits CIH_Characteristics . . . and in table IND_SOURCE_INFO CIH_CharTraits. Any of these elements might have a meaning identical to or related to the other, but without detailed explanations, mapping cannot be created to any other database, even if the names of the columns are identical. We cannot be certain of the intended interpretation or usage of these elements without consulting the developers or users of that database. Other examples abound. In the same sample database, there is a table: IND_SOURCE_INFO with column: CIH_MembershipInOrgs . . . and a table IND_SL with column: CIH_Memberships . . . and a table ORG (organization, apparently)
113
with columns: CIH_Membership CIH_Personnel. If the CIH_Membership relation points to a list of members of an organization, then it could be related as the inverse of CIH_ Memberships or CIH_MembershipInOrgs, but how do the latter two relations differ, if at all? Two Elements with Different Names May Have Identical Meanings This is the simplest mismatch to map. Consider three columns from two databases: CIH_dob, BIRTH_DATE, and GMI_BIRTH_DATE. All three columns appear to represent the data of birth for an individual, and the mapping appears obvious. However, there may be different formats for the date, so a translation of format may be required. If no format conversion is required, this translation could be done with a lookup table. If conversion is required, a simple data model with the translation provided by an XSLT (“Extensible Stylesheet Language Transformations”) format conversion method may suffice. However, other conversions will not be so simple, and adopting the more general solution of mapping to an ontology will serve the complicated as well as the simple cases. Even for this simple case, an ontology can be helpful. In one database, there may be an integrity constraint that individuals cannot be paid for work performed before they were born, but across different databases, that kind of integrity constraint cannot be performed. When multiple databases are mapped to an ontology, the logical rules of the ontology could detect anomalies that might not be detectable in one database. In two different databases, a person with a certain SSN might have two different birth dates. Knowing that a person can have only one birth date allows the discrepancy to be detected by logical inference within the ontology. Such consistency checking can be performed automatically within an ontology when data conversion is being performed. Within a single database, it can only be performed at data-entry time. Data with Similar Meaning Can Need Conversion A case where conversion of data elements is required occurs when measurements are expressed in different units in different databases. An XSLT procedure could perform conversions, even if the mappings of databases are created among data models rather than ontologies. No Off-the-Shelf Ontology Extension for Specific Use Each local data store that is to be integrated into a coordinated query system will have many concepts, data structures,
114
Obrst and Cassidy
and terms specific to that local system and cannot be expected in any preexisting ontology. In favorable cases, some similar knowledge may be represented from a previous data store, and only minor additions or changes to an existing ontology mapping may be required to represent similar data in a new agency or enterprise. That state of development of ontology mapping is unlikely to occur until years of widespread use of a common system have passed. For now, neither vendors nor in-house developers can develop a proper mapping of a local data store to a common foundation ontology without an effort specifically directed at representing the local data structures used in the enterprise. The size of the effort will depend on the size of the local data model (the number of distinct types of data elements). Input from the developers and managers of the local data store will be essential in creating an accurate mapping to the common foundation ontology by clarifying ambiguities in the local data model.
1989). The notion of a basic set of primitive concepts as the basis for human language understanding has also gained experimental support from recent studies of brain activity using functional magnetic resonance imaging (Mitchell et al., 2008). The COSMO ontology, chosen as the foundation ontology for this study, is structured specifically to serve as a conceptual analogue of the linguistic defining vocabulary, and it is being developed to include concept representations corresponding to the full set of 2100 Longman defining vocabulary words. This ontology should allow creation of normalized precise logical specifications for any information in databases or other information sources. The benefit of using a relatively small set of ontology elements as the basis for defining terms or concepts from many fields is that the restricted size makes it easier for developers of interfaces to understand and use the common foundation ontology. The more widespread it is used, the more effective such a common standard of meaning will be.
How the Foundation Ontology Bridges Multiple Domains
AUTHORS’ NOTE
A foundation ontology serves as the common conceptual and logical language through which different human languages or technical terminologies can be translated into each other. The present study was focused on converting terms used in different databases, but the same principle applies to information in other forms, such as in free text. Widely varying use of terms by different communities and in different contexts makes ordinary human languages unsuitable for representation in a form that computers can properly interpret. The foundation ontology overcomes such incompatibility by providing a relatively small common “defining vocabulary” of basic concepts that can be used in combinations to describe any of the more specialized concepts of interest to different communities. Use of the common foundation ontology allows precise specification of the similarities and differences in the use of terms, regardless of the context or community in which they are used, and thereby facilitates accurate exchange of information. By converting local knowledge in local terminology into its representation using the foundation ontology, translations into multiple other terminologies can be provided. Use of a common foundation ontology as the means of translation allows the local user control over community terminology without sacrificing the ability to accurately transfer information among different communities. The development of “common core” vocabularies that are not structured as ontologies suitable for automated reasoning does not provide the functionality required for this task. The principle of using a relatively small controlled vocabulary to provide easily understood definitions of a much larger set of terms has been established practice in the dictionary publishing industry. The Longman Dictionary of Contemporary English (1987) uses a controlled vocabulary of 2100 words, with which it defines all of the 65,000 words in the dictionary. A study of the Longman vocabulary showed that the minimum required vocabulary is actually only 1200 words (Wilks et al., 1996; Guo,
The views expressed in this paper are those of the authors alone and do not reflect the official policy or position of the MITRE Corporation or any other company or individual. APPENDIX A. ONTOLOGIES AND DATABASES: SIMILARITIES AND DIFFERENCES This appendix describes ontologies and databases and their similarities and differences by focusing on their respective goals and design processes. Introduction Ontologies are about vocabularies and their meanings with an explicit, expressive, and well-defined semantics, possibly machineinterpretable. Ontologies limit the possible formal models of interpretation (semantics) of those vocabularies to the set of meanings a modeler intends (i.e., close to the human conceptualization). None of the other “vocabularies,” such as database schemas or object models with less expressive semantics, does that. The approaches with less-expressive semantics assume that humans will look at the “vocabularies” and supply the semantics via the human semantic interpreter (your mental model). Additionally, a developer will code programs to enforce the local semantics that the database/database management system (DBMS) cannot. 1. They may get it right. 2. Other people will have to read that code, interpret it, and see if it will do what it should do. 3. The higher you go in terms of data warehouses, marts, etc., the more semantic error creeps in. Ontologies model generic, real-world concepts and their meanings, unlike either database schemas or object models, which are typically specific to a particular set of applications and represent limited semantics. A given ontology cannot model any given domain completely. However, in capturing real-world and imaginary (i.e., a theory of unicorns and other fantastic beasts) semantics, you are enabled to reuse, extend, refine, and generalize, etc., that semantic model.
The need for ontologies It is suggested to reuse ontologies; database schemas cannot generally be reused. A database conceptual schema might be used as the basis of an ontology; however, that would be a leap from an entityrelation model to a conceptual model (i.e., a weak ontology) to a logical theory (strong ontology). Similarly, one could begin with a taxonomy or a thesaurus and migrate it to an ontology. Logical and physical schemas are useless, since they incorporate non-real-world knowledge (and in non-machine-interpretable form). By the time the physical schema is achieved, there are only relations and key information; the little semantics available was thrown away at the conceptual schema level. The methodologies for ontologies and databases are similar. The database designer or knowledge/ontology engineer must consider an information space that captures certain kinds of knowledge. However, a database designer does not care about the real world, but about constructing a specific local container/structure of data that will hold the user’s data in an access-efficient way. A good database designer works with users and generates use cases and scenarios based on the expected user interaction. Similarly, ontologists work with domain experts and/or subject matter experts (SMEs) and get a sense of the semantics that they require. A good ontologist will analyze the data available (bottom up) and analyze what the domain expert says (top down). In many cases (e.g., intelligence analysis), the ontologist will not only ask the SME about the kinds of questions that are asked for the tasks, but also the kinds of questions that should be asked (which are impossible to get answered currently by using mainstream database and system technology). The Database Design Process: Three Stages There are three stages to the database design process. They include: 1. In interaction with prospective users and stakeholders of the proposed database, the database designer will create a conceptual schema using a modeling language and tools based on entity-relation (ER) models, extended ER models, or object-oriented models using UML. 2. Once this conceptual schema is captured, the designer will refine it to become a logical schema, sometimes called a logical data model; it will still be in an ER language or UML. The logical schema typically results from refining the conceptual schema using normalization and other techniques to move closer to the physical model that will be implemented to create the actual database. This is done by normalizing the relations (and attributes, if the conceptual schema contains these) using the same ER and UML languages. 3. Finally, the design will refine the logical schema to become the physical schema, where the tables, columns, keys, etc., are defined, and the physical table is optimized in terms of index elements and sectors in the database to place the various data elements. A data dictionary may be created for the database. This expresses the meaning of various database elements in natural language documentation. The data dictionary is only semantically interpretable by individuals, since it is written in natural language. The most expressive real-world semantics of the database creation process exist in the conceptual schema and data dictionary. The conceptual schema may be part of the documentation for the process of developing the database (i.e., an artifact of that process). The data dictionary will be kept as documentation. Unfortunately, the underlying physical database and its schema may change dramatically without the original conceptual schema and data dictionary being comparably changed. This is also the case with UML models used to create object-oriented systems and sometimes to define enterprise architectures.
115
Databases and Ontologies: Integrity Enforcement Databases typically try to enforce three kinds of integrity. 1. Domain integrity (not the same notion of “domain” used in logic/ ontologies): Domains are usually data-type domains (i.e., integers, strings, real numbers, or column-data domains). 2. Typically, symbolic objects are not entered into a database, just strings. So on data entry or update, some program (or the DBMS) will make sure that if a column is defined to contain only integer data, the user can only enter integer data. 3. Referential integrity: This refers to key relationships, primary and foreign. This kind of integrity is structural, making sure that if a key is updated, other keys dependent on it are updated appropriately: add, delete, update (usually considered an initial delete, followed by an add). 4. Semantic integrity: This represents real-world constraints, sometimes called “business rules,” that are held over the data. Databases and DBMSs cannot usually do this (even with active and passive triggers), so auxiliary programming code usually enforces this (e.g., “no other employee can make more than the CEO” or other crossdependencies). You cannot check database consistency like an ontology, using a logical knowledge representation language. For databases, you can only enforce the three types of integrity. For an ontology, you can check consistency in two ways: 1. syntactically (proof theory) and 2. semantically (model-theory). This consistency checking can be done at two levels: 1. Prove that your KR (knowledge representation) language is sound and complete at the meta-level. • Sound (“Phi ⊢ A” implies “Phi ⊨ A”): The proof system will not prove anything that is not valid. • Complete (“Phi ⊨ A” implies “Phi ⊢ A”): The proof system is strong enough to prove everything that is valid. “Phi ⊢ A” means something that A follows from or is a consequence of Phi. “Phi ⊨ A” means that A is a semantic consequence or entailment of Phi in some model (or valuation system) M with truth values, etc., i.e., the argument is valid. Both ⊢ and ⊨ are called turnstyles, syntactic and semantic, respectively. 2. Check the consistency of a theory (ontology) at the object level. • This is like negation consistency: There is no A such that both “Phi ⊢ A” and “Phi ⊢ ~A” (i.e., a contradiction). The Ontology Design Process In creating common knowledge (as opposed to deep domain knowledge), intuition and understanding of the world can be used to develop the ontology. A good background in formal ontology or formal semantics helps, because then you have already learned: 1. 2. 3. 4.
a rigorous, systematic methodology; formal machinery for expressing fine details of world semantics; an appreciation of many alternative analyses, pitfalls, errors, etc.; and complex knowledge about things in the world and insight into your pretheoretical knowledge.
In linguistics, it is said that although everyone knows how to use natural language like English, very few know how to characterize that knowledge or about prospective theories about that knowledge. Native
116
Obrst and Cassidy
speakers do not have good subjective insight into how they do things; they just do them. Ontology design and development uses typical software development lifecycle notions, but enriches them with the following processes: 1. Incremental, staged ontology development and deployment (typically breadth-first) with feedback from user/developer community at each stage, as time warrants. a. Define versioning, change management. b. General process: Ontology Lifecycle Management; 2. Form team (three roles: ontologists, domain expert liaison, domain expert/SME partners) and develop project plan; 3. Identify stakeholders (end users, developers); 4. Identify existing data stores and systems and analyze them; 5. Investigate existing model resources for potential harvesting; 6. Develop competency questions, use cases, scenarios, requirements; 7. Develop architecture and select tools, design; and 8. Iterate: build, test and review (per stage), refine, and deploy. Initially, the ontology development process seeks requirements primarily by developing competency questions: What questions (queries) do you want answered by the newly developed ontologies? These questions will help drive out specific use cases. Those use cases will be refined to create specific scenarios and requirements. Competency questions will be used to determine if the ontology development effort is complete (i.e., if reasonable responses to the queries are given, with sufficient detail, as judged by knowledgeable domain experts, then you are done). Concurrently, simultaneous bottom-up and top-down analysis is performed. Nearly every ontology development effort requires accessing and using data stored in databases or data obtained from existing software applications. Bottom-up analysis analyzes existing data stores and legacy systems that you want to cover to capture their semantics. This may require the creation of conceptual models (local ontologies), if those stores and systems do not have existing models or those models are insufficient. This process must address syntactic (relational vs. XML, other file or message syntax), structural (database [DB] schema, XML schema, message schema), and semantic heterogeneity and interoperability, including potentially different business rules, conversions, etc. Top-down analysis includes consideration of queries that cannot be now asked, even using existing data sources and legacy systems. Examples include: link or social network analysis, temporal (time-based) queries or time series, part/whole decomposition, spatial-temporal flow (supply chain) of distributed material, equipment (biological, chemical, nuclear weapons), suppliers, etc. An extended example may be a transportation ontology for a government agency concerned with logistics, which would include: 1. Determine coverage needed (competency questions, use cases, requirements) and project plan. The coverage may include transportation modes; vehicles, conveyers, and subclasses, operational characteristics; organizations and people; facilities and locations; cargo; services and related entities; travel and transportation routes; related and affecting notions; designations and identifiers; events associated with transportation and travel; others as deemed appropriate. The coverage will be decided by referring to the competency questions that have been developed, any use cases and scenarios used to obtain the competency questions, and any requirements distilled from all of these sources. 2. Determine existing models, resources for harvesting, and prospective integration with existing upper and middle ontologies.
3. Analyze resources and existing data stores with respect to coverage requirements. 4. Design incremental breadth-first stages: • Develop core subclass hierarchies with basic relations. • Expand the representation of facilities, routes, and cargo. • Address in-depth organizational, governance, and control concepts for transportation, including travel planning and itineraries, effects of regulatory policy on transportation activities, and some physical factors such as transportation impediments (weather and terrain). 5. Implement each stage with review and feedback, gauge competency question results. When the competency questions can be answered reasonably, with the appropriate correctness, detail, and level of granularity as adjudicated by domain experts who understand the data and applications, then the ontology development process is completed. Ontologies vs. Databases Often with nonontological approaches to capturing the semantics of data, systems, and services, the modeling process stops at a syntactic and structural model, and even throws the impoverished semantic model away to act as historical artifact. It is completely separated from the evolution of the live database, system, or service, and is still only semantically interpretable by an individual who can read the documents, interpret the graphics, supply the real-world knowledge of the domain, and who understands how the database, system, or service will be implemented and used. Ontologists want to shift some of that “semantic interpretative burden” to machines and have them mimic human semantics (i.e., understand what we mean). The result would bring the machine up to the human, not force the human to the machine level. By “machine semantic interpretation” we mean structuring and constraining in logical, axiomatic language the symbols humans supply; the machine will conclude, via an automated inference process, what an individual would in comparable circumstances. The knowledge representation language that enables this automated inference must be a language that makes fine modeling distinctions and has formal or axiomatic semantics for those distinctions, so no direct human involvement will be necessary—the meaning of “automated inference.” The primary purpose of the database is for storage and ease of access to data, not complex use. Software applications (with the data semantics embedded in non-reusable code via programmers) and individuals must focus on data use, manipulation, and transformation, all of which require a high degree of interpretation of the data. Extending the capabilities of a database often requires significant reprogramming and restructuring of the database schema. Extending the capabilities of an ontology can be done by adding to its set of constituent relationships. In theory, this may include relationships for semantic mapping, whereas semantic mapping between multiple databases will require external applications. APPENDIX B. SCENARIOS FOR ONTOLOGY USE IN FEDERATED QUERY This appendix illustrates examples where an ontology’s more detailed semantic relations can provide the links between databases, allowing retrieval of relevant data that are not represented directly in less-expressive data structures. Part Relations A database may contain information specifying which entities are parts of others, but the meanings of the “part” relations may take
The need for ontologies different semantically or lexically distinct forms in different databases. There is an abstract, generic notion of “part” discussed in the philosophical literature, as explored by Casati and Varzi (1999); but in practical applications more specific notions of “part” may be used, such as: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
a finger is part of a hand, a person is part of an organization (group membership), Sunnis are a part of the Muslim community (subgroups), a division is part of a company (business), a company is part of a division (military), oxygen is part of water (chemical composition), salt is a part of seawater (substance mixtures), a word is part of a sentence (syntactic rule), an angle is part of a polygon (definition), an engine is part of a car (necessary part), an engine is a component of a car (component is a separately manufactured part), a module is part of a spacecraft (major segment), a subroutine is part of a program, taking a step is part of walking (process part), buying a ticket is part of taking a train (standardized script part), and the Normandy invasion was part of the Second World War (event part).
The meanings of many of these are sufficiently different from the others that the use of a generic “part” relation to mean all these things would cause serious error if the assertions using those relations were used in a system that performed automated inference. Both an ontology and a traditional data model will allow specifying specialized “part” relations for semantically distinct types. In either case, one may assert, for example, that , where the relation “wasaSubeventOf” is used only to relate events to other Events, but the ontology will also allow us to specify that “wasaSubeventOf” is a subrelation of the more generic “isaPartOf.” Therefore, if ?X isaSubeventOf ?Y, then it is also true that ?X isapartOf ?Y. This permits cases where one database may use the generic “isaPartOf” relation on heterogeneous objects, and another may use a more specialized “part” relation. The fact that the second relation will answer queries on the first would be very difficult to do without the use of subrelation inferences available in an ontology. For a specific example, a relation “isaComponentOf” may be defined in one database, with the intended interpretation being that ?X isaComponentOf ?Y means that ?X is a separately manufactured unit that is assembled with other units to produce a ?Y. In the ontology, “isaComponentOf” would be a subrelation of “isaPhysicalPartOf.” A second database may not make any distinction between components and other kinds of “parts” (e.g., an arm “isaPartOf” a body). A query for physical parts from a user of the second, less specific database could still return “components” from the first database without misinterpreting the more specific intended meaning of “component” in the first database. Thus, in one database, an engine may be part of a car, and, in another, an engine may be a component of a car. Both of these can be properly interpreted by an ontology without confusing the different intended meanings of “part” and “component.” This simple inference is useful in cases where different databases might have more than one of these meanings of “part” in a single table, whereas in another database, only the more specific “component” relation might be used. A query from users of the first database for “parts” could automatically return “components.” Inferences on a hierarchy of relations (attributes) are built into many ontology systems (including the Web Ontology Language [OWL]), but they would need to be specially programmed with an XML schema definition (XSD) data structure. In the ontology, the proper inferences are performed automatically.
117
Unit Conversion In two different databases, an attribute “height” may be represented as a decimal number. In database 1, the units are assumed to be meters. In database 2, the units are assumed to be feet. Although the intended meanings of the two attributes are similar, one cannot substitute one for the other. A mapped ontology can recognize the relations between the two attributes, and a translation program accessing the ontology mappings can convert the height values in one database to equal values in the units of the other database. An outline of an example of translation is shown in Appendix C. Specialization-Generalization Mismatches In database 1, there may be a table representing “Armored_ Vehicles” with no subtypes of that type explicitly represented. In database 2, there may be tables for “Tanks” and “Armored_Personnel_Carriers.” A query in database 1 terminology for Armored_Vehicles must be able to retrieve Tanks and Armored Personnel Carriers from the second database. However, these mappings are not 1:1. A query for “Tanks” in the second database could retrieve all “Armored_Vehicles” from the first database. Assuming that there are no attributes that can distinguish among armored vehicles, the returned query would need to indicate in some standardized manner that the returned value (Armored_Vehicle) is an instance of a supertype of the requested type and may not actually be of the type (Tank) requested. A prefix or suffix on the returned value would be a simple method to signal such a potential mismatch, if no other mechanism were provided by the user interface. Transitive Relations There are relations that are “transitive” in this sense: If R is a transitive relation, then ?X R ?Y and ?Y R ?Z imply ?X R ?Z. For example, the “sibling” relation is transitive. If A is a sibling of B and B is a sibling of C, then A is a sibling of C. Such inference rules are automatically executed in an ontology, allowing the generation of new knowledge from existing knowledge. When transitive relations are used in more than one database, new knowledge can be created from the combination of the databases that is not available from any one database alone. Some “part” relations are transitive: X is a physical part of Y. X is a subregion of Y. X is a textual part of Y (where X and Y are texts). However, some “part” relations are not transitive: J is a member of OrganizationX. A crowbar (an object) is composed of Steel (a substance). Transitive relations must be individually marked as such in the ontology. APPENDIX C. AN EXAMPLE OF QUERY TRANSLATION This appendix describes a method by which a query against a database mapped to an ontology could be translated as a semantically equivalent query to other ontology-mapped databases. Introduction This appendix presents a simplified hypothetical example of the logical mechanism by which a Structured Query Language (SQL) query on one database mapped to the foundation ontology would be
118
Obrst and Cassidy
translated into an equivalent query on other mapped databases. The actual translator program is not developed yet. This represents an outline of a small part of the kind of code that would be developed for the simple case of unit conversion in a query. The translator program would be developed once to integrate many databases. A more detailed illustration of query translation was created as a short Java program (“TestQuery.java”). That program illustrates the steps required for translation to remote databases from an SQL query posed to one of the federated databases. This example presents a case that does not require an ontology to implement; a less-expressive data model might represent the kinds of relations used here. However, there will be other cases, particularly for transitive and inverse relations or where inference rules (e.g., business rules) are needed, where an ontology or its expressive equivalent is needed. This example exhibits a part of the mechanics of a hypothetical translation program that is more complex than simple one-to-one conversion. The full query translator will consist of a program that can: 1. access an ontology that contains the domain model and mappings for the set of databases to be queried (the access is via an Application Programming Interface [API] implemented by the Ontology Management Program [OMP]); 2. perform manipulations on the information retrieved from the ontology; 3. receive queries from a user interface; 4. transmit queries to remote databases; 5. receive answers from remote databases; and 6. send the answers, translated into the terminology of the user, back to the user’s interface. The domain ontology for any given set of databases to be integrated will be an extension of the foundation ontology. It will contain representations of all of the terms in the full set of databases, among which translations are to be performed. The extension ontology will also contain the mappings from the databases to the extension ontology. Other issues, such as de-duplication, can only be resolved on a case-by-case basis. The translator may have an index for simple 1:1 correspondences of elements in the different databases, making ontology query unnecessary for those cases. For retrieval of information from remote data stores when the mappings are too complex to return data in the exact form in which they appear in the database of the originating query, information returned to the user from the remote database should indicate that it is related but not identical to the kind of information stored in the local store. Relations may be: D1 is a subtype of D2, D2 is a subtype of D1, D2 and D1 are both subtypes of a common type, D1 and D2 have overlapping but not identical types and cannot be clearly distinguished. Those cases are not illustrated here. This translation mechanism is only an outline, and it has not been implemented for a Common Semantic Model (COSMO)–based translator. In this illustration, relations are used as variables. Some ontology query languages do not permit such use. To properly map table columns to ontology relations, the logic will need to be quasi–second order, but tactics to accomplish that effect are available for first-order reasoners. Alternatively, a procedural code that is equivalent to the logical operations specified here might be used in the translator; the ontology could be stored as another database. Additional complexity occurs when data stored in a table in one database are stored in multiple tables in a second database. In that case, the foreign keys must find corresponding elements in the other tables of such a database and will need to be identified by the ontology query mechanism. That situation is not treated here. This simple case of measure translation is only one step more complex than a simple table lookup for one-to-one translations. More
complex relationships exist among database elements and ontology elements (e.g., when a table is used to represent more than one type of entity, and some of the columns are relevant to description of one type of entity but not of another). The Translation Program In this example, the translation program uses Java language conventions. The ontology access will be performed using an API specific to the OMP, but in this example, ontology queries will be shown in the Knowledge Interchange Format (KIF) language, which is interpretable by several public programs, including the SigmaKEE (Sigma Knowledge Engineering Environment, 2011; used by SUMO) implementation of Vampire (the First-Order Logic theorem prover, 2009) and the HighFleet (formerly Ontology Works) Integrated Ontology Development Environment (IODE) system. 1. The Query A simple query is given for the height of people having the last name “Cassidy” in a local database where people are represented in a table called INDIVIDUAL, and a person’s height is in a column called “height;” the unit of measure is assumed to be meters. This example does not query for attributes of a specific individual, but for attributes of all persons having the same last name. If the attributes of individuals were to be queried, either a unique identifier would be used throughout the set of databases or some set of attributes would be used for deduplication. Where necessary, de-duplication procedures might match individuals to some adequate level of confidence by a probabilistic matching algorithm. This example can represent any query for a particular attribute on individual entities of any particular type having some other attribute in common. (Q:DB1) Select HEIGHT from INDIVIDUAL where NAME_ LAST = “Cassidy”
2. The Query Is Received The query is received from the Data Access Service (mechanism unspecified here) over the network by the translation program, along with the name of the original database to which the query was posted. For simplicity, the original database is called “DB1” and is received as the value of the string variable “sourceDB.” 3. The Translator Interprets the Query The translator interprets the query by constructing the (preexisting) unique names for the database table and column by string concatenation, using a double underscore to separate database name, table name, and column name. In the program, the names of the ontology elements are stored as strings. The pseudo-call QueryOntology used here returns an array of variable names corresponding to the return value(s) of a KIF query. More than one table and more than one column in the set of databases can correspond to a type or relation (e.g., attribute) in the ontology. In the following, the lines preceded by a double slash represent comments in the program, and those lines without the initial double slash represent computer code that can be executed as a program. // Define some Java variables used. String rel1name[1000]; names String rel2name[1000]; names String tablename[100];
// space for the relation // space for other relation // space for the table names
The need for ontologies
119
String remoteTables[100]; // tables in the remote databases that correspond to the queried table String remoteCols[100]; // columns (attributes) in the remote databases corresponding to the queried attribute String remoteNumCols[100]; // columns (attributes) in the remote databases that are numbers representing a measure String remoteMeasCols[100]; // columns (attributes) in the remote databases that represent complete measures, including the unit of measure and quantity String remoteID // name of the unique identifier column for a remote table
// simple case, we only ask whether the ontology relation is a “dbMappedMeasureRelation,” // which means that it is a number representing a measure. We only illustrate the query for // one corresponding ontology relation (rel2name[0], which has a value of // ‘DB1__INDIVIDUAL__HEIGHT’); there may be more than one corresponding relation. // The query call in this case returns only a true or false response.
// ********* Get strings for query ************** // String variable “sourceDB” and the query string are received along with the call to the // translator program (not shown); the query string is parsed to get the column name for the // return value, the table name, and the column name(s) in the “where” clause. The names // of the table and columns used in the ontology are then constructed.
// The answer for this case is “true.” That means that the type of measure and units needs to be // determined, and a conversion into the ontology standard units must be performed. // These queries are known to return unique values, so arrays are not necessary.
String qtable = sourceDB + “__” + “INDIVIDUAL”; // = DB1__INDIVIDUAL String qcol = qtable + “__” + “HEIGHT”; // = DB1__INDIVIDUAL__HEIGHT String idcol = qtable + “__” + “NAME_LAST”; String idval = “Cassidy”;
// The method “QueryOntology” represents the API call to query an ontology for whatever // ontology management program is being used. // Get the corresponding ontology elements for the table and the columns: // note that the ontology query returns an array of class names in the ontology that correspond // to the table in the query. Usually (and in this case) only one Type in the ontology // is represented by the table referenced in the query. classname = QueryOntology(correspondsToTable, ?TABLE, qtable);
// The relations in the ontology that correspond to the database columns in the query // are similarly found by the API call. // Note that the idcol (unique id column) may be specific to a particular table, // or may (as with SSN) show up in multiple tables. We need to cover all // cases. rel1name = QueryOntology(correspondsToDB_Column, ?ID, idcol); rel2name = QueryOntology(correspondsToDB_Column, ?REL, qcol);
// Next, it is necessary to determine if the corresponding ontology elements are // directly mapped, or whether some form of conversion is required. In the present // case, the field value is a simple number, but it represents a measure. Therefore, // it will need to be converted to the standard form in the ontology for measures of that type. // The ontology is queried to determine if the present element needs conversion. For this
boolean isaMeasure = QueryOntologyTF(isanInstanceOf, rel2name[0], dbMappedMeasureRelation);
String measureType = QueryOntologyStr(quantifiesMeasu re, rel2name[0], ?measure);
// We can also get the type of measure and its units, but these are not needed in the translation. // They may be useful for more detailed reasoning. String measureType = QueryOntologyStr(hasMeasure, rel2name[0], ?mtype); String measureUnit = QueryOntologyStr(hasUnit, rel2name[0], ?munit);
// The actual number in the “height” attribute may need conversion, but that depends on whether // the remote databases store the comparable data in the same or different units. // We can postpone conversion until we know it is necessary. // Now we can inquire whether other accessible databases have this attribute recorded for // the class of Person. // First, find tables in all connected databases that represent an Individual person: // Since there may be more than one Type in the ontology that is represented // in the local table being queried, the array classname may have a length greater than // one. The query below would have to be iterated over all Types in the ontology that // are returned in “classname[],” but for this example, we find that there is only one, // so we only query for classname[0]. The return value will be an array of table names, // potentially more than one in any given database, and likely more than one in the full // set of integrated databases. remoteTables = QueryOntology(correspondsToTable, classname[0], ?TABLES);
// The name(s) of the column(s) in the full set of integrated databases that correspond // to the ontology relation need to be discovered. This query gets all corresponding columns, // not only those in the relevant tables. However, only the columns in the relevant
120 // // //
Obrst and Cassidy
tables will be used. We have to iterate over all of the n relations that correspond to the column in the query, though typically there should only be one.
// In this case we are dealing with a measure, which may be represented by a bare number // in one column (as in the table in the query-originating DB), or by a column value // that includes the unit of measure. We need to search // for both methods of representation. remoteNumCols = QueryOntology(quantifiesMeasure, ?measureCols, measureType); remoteMeasCols = QueryOntology(correspondsToDB_ Column, rel2name[n], ?COLUMNS);
// The individual columns are linked to the tables in which they appear by an ontology relation. // Using this, we select only those columns that appear in the relevant tables, thereby // not querying irrelevant tables (the relation “height” might be used for entities other // than that in the Person table). An alternative to cross-checking the returned columns above // with returned tables is to perform the column query for one database table at a time, // iterating over all tables returned by the table query. In that case, the column queries // would use a more complex expression transmitted to the ontology, one for each // table “remoteTables[n]” (each of which corresponds to the ontological type of the // table DB1_INDIVIDUAL, which was queried by the original query, in this case // corresponding to the ontological type “Person”). The query would then look like the // following, but the carriage returns would be omitted; they are here just for clarity: // remoteNumCols= QueryOntology((and // (quantifiesMeasure ?measureCols measureType) // (isaColumnInDBTable ?measureCols remoteTables[n])));
// // ******** An example of cross-checking of tables and columns is omitted for brevity // because it is a simple iterated string comparison in Java. ************* // To determine whether any conversion of the number in the measure column is // required, to translate to some other database, the query above for “measureUnit” // is repeated for each corresponding remote database column that contains a bare number. // If the unit of measure is the same, no conversion is necessary. If the unit of measure // is different, the conversion will need to be done by some mathematical function. Such // conversions may be simpler in the procedural code of the translation program, but the // conversion factors should be encoded in the ontology in any case, to minimize dependence on // procedural code. // *************** Correlating the NAME_LAST column *****************************
// A column name translation similar to that above also needs to be performed // on the NAME_LAST column referenced in the “where” clause of the original query. // This part will proceed similarly to the previous translation of attribute columns // except that (1) the tables are already known and need not be found again; and // (2) this attribute is not quantitative and therefore uses a different, simpler ontology // relation. This step is omitted for this example. // *************** Formulate Remote Queries ***************************** // Now, knowing the columns and tables in the remote databases that have a representation of // the measure being queried (person height), we can formulate an SQL query for each // separate database to retrieve the specific values. // Several kinds of translation procedures will need to be developed, // depending on the complexity of the relations between the concepts represented in different // databases. As mentioned previously, for simple one-to-one translations, a lookup table // could be used and would likely be faster than ontology query. // Depending on the method of access granted by the remote // databases, the query may need separate formulation for each local database. Alternatively, // all local database may agree to receive the query in a standard format, and do some form of // format conversion locally. This architecture issue has not been decided, and the choice may // depend on the policies applied to remote access by the local database managers.
REFERENCES CITED Akman, V., and Surav, M., 1996, Steps toward formalizing context: AI Magazine, v. 17, no. 3, p. 55–72. Akman, V., and Surav, M., 1997, The use of situation theory in context modeling: Computational Intelligence, v. 13, no. 3, p. 427–438. Barker, K., Porter, B., and Clark, P., 2001, A library of generic concepts for composing knowledge bases, in Proceedings of the K-CAP’01 International Conference on Knowledge Capture, Victoria, Canada: New York, Association for Computing Machinery, p. 14–21; http://www.cs.utexas .edu/users/kbarker/papers/kcap01-content.pdf (accessed May 15, 2011). Barrasa, J., Corcho, Ó., and Gómez-Pérez, A., 2004, R2O, an extensible and semantically based database to ontology mapping language, in Bussler, C., Tannen, V., and Fundulaki, I., eds., Second International Workshop on Semantic Web and Databases, Toronto, Canada Workshop Proceedings: New York, Springer; http://www.cs.man.ac.uk/~ocorcho/documents/ SWDB2004_BarrasaEtAl.pdf (accessed May 15, 2011). Basic Formal Ontology (BFO); available in different implementations, as described at http://www.ifomis.uni-saarland.de/bfo (accessed May 15, 2011). Basin, D., D’Agostino, M., Gabbay, D., Matthews, S., and Viganò, L., eds., 2000, Labelled Deduction: Amsterdam, Kluwer Academic Publishers, 266 p. Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., and Stein, L.A., 2004, OWL Web Ontology Language Reference, February 10, 2004; http://www.w3.org/TR/owl-ref/ (accessed May 15, 2011). Bitters, B., 2002, Feature classification system and associated 3-dimensional model libraries for synthetic environments of the future, in Proceedings of I/ITSEC (Interservice/Industry Training, Simulation and Education Conference) (National Training and Simulation Association NTSA);
The need for ontologies http://vissim.uwf.edu/Pubs/IITSEC-2002_042.pdf (accessed May 15, 2011). Bizer, C., 2010, The D2RQ Platform—Treating Non-RDF Databases as Virtual RDF Graphs; http://www4.wiwiss.fu-berlin.de/bizer/d2rq/ (accessed May 15, 2011). Blackburn, P., 2000, Internalizing labelled deduction, in Proceedings of Hylo’99, First International Workshop on Hybrid Logics: Journal of Logic and Computation, , v. 10, no. 1, p. 137–168. Blair, P., Guha, R.V., and Pratt, W., 1992, Microtheories: An Ontological Engineer’s Guide: Cycorps Technical Report Cyc-050-92: Austin, Texas, Cycorps. Borgo, S., and Lesmo, L., 2008, The attractiveness of foundational ontologies in industry, in Borgo, S., and Lesmo, L., eds., Proceeding of the 2008 Conference on Formal Ontologies Meet Industry: Amsterdam, the Netherlands, IOS Press, p. 1–9. Bouquet, P., Giunchiglia, F., Van Harmelen, F., Serafini, L., and Stuckenschmidt, H., 2003, OWL: Contextualizing ontologies, in Fensel, D., Sycara, K.P., and Mylopoulos, J., eds., 2nd International Semantic Web Conference (ISWC 2003): Sanibel Island, Florida, October 20–23, 2003: New York, Springer, p. 164–179. Bürger, T., 2008, A benefit estimation model for ontologies, in Poster Proceedings of the 5th European Semantic Web Conference (ESWC), Tenerife, Spain, June 1–5, 2008; http://ontocom.sti-innsbruck.at/docs/eswc2008 -benefitestimation.pdf (accessed May 15, 2011). Bürger, T., and Simperl, E., 2008, Measuring the benefits of ontologies, in Meersman, R., Tari, Z., and Herrero, P., eds., On The Move to Meaningful Internet Systems, OTM 2008 Workshops: Berlin, Heidelberg, Springer Verlag. Includes Proceedings of Ontology Content and Evaluation in Enterprise (OntoContent ’08), Monterrey, Mexico, November 9–14, 2008, p. 584–594. Casati, R., and Varzi, A.C., 1999, Parts and Places: The Structures of Spatial Representation: Cambridge, Massachusetts Institute of Technology Press, 238 p. Cassidy, P., 2008, COSMO ontology, http://micra.com/COSMO/ (accessed May 16, 2011). Cohn, A.G., and Gotts, N.M., 1996, The ‘egg-yolk’ representation of regions with indeterminate boundaries, in Burrough, P., and Frank, A.M., eds., Geographical Objects with Undetermined Boundaries: London, Taylor & Francis, p. 171–188. Command and Control (C2) Core, 2010, http://www.jfcom.mil/about/fact _c2core.html (accessed May 16, 2011). Critchlow, T., Ganesh, M., and Musick, R., 1998, Automatic generation of warehouse mediators using an ontology engine, in Proceedings of the 5th KRDB (Knowledge Representation Meets DataBases) workshop: Seattle, Washington, May 1998; http://sunsite.informatik.rwth-aachen.de/ Publications/CEUR-WS/Vol-10/ (accessed May 16, 2011). Cruz, I.F., and Sunna, W., 2008, Structural alignment methods with applications to geospatial ontologies, in Janowicz, K., Raubal, M., Schwering, A., and Kuhn, W., eds., Special Issue on Semantic Similarity Measurement and Geospatial Applications: Transactions in GIS, v. 12, no. 6, December 2008: Hoboken, New Jersey, Wiley-Blackwell, p. 683–711. Daconta, M., Obrst, L., and Smith, K., 2003, The Semantic Web: A Guide to the Future of XML, Web Services, and Knowledge Management: Indianapolis, Wiley Publishing, 312 p. Ehrig, M., 2007, Ontology Alignment: Bridging the Semantic Gap: New York, Springer Science+Business Media, 247 p. Euzenat, J., and Shvaiko, P., 2007, Ontology Matching: Berlin, Heidelburg, Springer-Verlag, 333 p. Federal Enterprise Architecture Reference Ontology (FEA-RMO), http://web -services.gov/fea-rmo.html (accessed May 16, 2011). Fernández-López, M., and Gómez-Pérez, A., 2004, Searching for a time ontology for semantic web applications, in Varzi, A.C., and Vieu, L., eds., Proceedings of the Third International Conference on Formal Ontology in Information Systems (FOIS-2004), Turín, Italy, November 4–6, 2004: Frontiers in Artificial Intelligence and Applications, v. 114: Amsterdam, IOS Press, p. 331–341; http://www.iospress.nl/loadtop/load .php?isbn=faia (accessed May 16, 2011.) Fonseca, F., Davis, C., and Câmara, G., 2003, Bridging ontologies and conceptual schemas in geographic information integration: GeoInformatica, v. 7, p. 355–378, doi:10.1023/A:1025573406389. Gabbay, D., 1996, Labelled Deductive Systems: Principles and Applications. Volume 1: Introduction: Oxford, UK, Oxford University Press, 512 p.
121
Geospatial Markup Language (GML): http://www.opengeospatial.org/ standards/gml (accessed May 16, 2011). Giunchiglia, F., and Bouquet, P., 1997, Introduction to Contextual Reasoning: An Artificial Intelligence Perspective: Trento, Italy, Istituto per la Ricerca Scientifica e Tecnological (IRST) Technical Report 9705-19, May 1997, 22 p. Giunchiglia, F., and Bouquet, P., 1998, A Context-Based Framework for Mental Representation: Trento, Italy, Istituto per la Ricerca Scientifica e Tecnologica (IRST) Technical Report 9807-02, July 1998, 7 p. Giunchiglia, F., and Ghidini, C., 1998, Local models semantics, or contextual reasoning = locality + compatibility, in Cohn, A., Schubert, L., and Shapiro, S., eds., Proceedings of the Sixth International Conference on Principles of Knowledge Representation and Reasoning (KR’98), Trento, Italy, June 2–5, 1998: San Francisco, Morgan Kaufmann, p. 282–289. Grenon, P., 2003a, BFO in a Nutshell: A Bi-Categorial Axiomatization of BFO and Comparison with DOLCE: IFOMIS (Institute for Formal Ontology and Medical Information Science), University of Leipzig, Technical Report 06/2003; http://www.ifomis.org/Research/IFOMISReports/ IFOMIS%20Report%2006_2003.pdf (accessed May 16, 2011), 37 p. Grenon, P., 2003b, Spatio-Temporality in Basic Formal Ontology: SNAP and SPAN, Upper Level Ontology, and Framework of Formalization (Part I): IFOMIS (Institute for Formal Ontology and Medical Information Science, University of Leipzig), Technical Report Series 05/2003; http:// www.ifomis.org/Research/IFOMISReports/IFOMIS%20Report%2005 _2003.pdf (accessed May 16, 2011). Grenon, P., and Smith, B., 2004, SNAP and SPAN: Towards dynamic geospatial ontology: Spatial Cognition and Computation, v. 4, no. 1, p. 69–104. Gruninger, M., Bodenreider, O., Olken, F., Obrst, L., and Yim, P., 2008, Ontology, taxonomy, folksonomy: Understanding the distinctions: Journal of Applied Ontology, v. 3, no. 3, p. 191–200. Guarino, N., 1998, Formal ontology and information systems, in Formal Ontology in Information Systems (“FOIS”), Proceedings of FOIS’98, Trento, Italy, June 6–8, 1998: Amsterdam, IOS Press, p. 3–15. Guha, R.V., 1991, Contexts: A Formalization and Some Applications [Ph.D. thesis]: Stanford, California, Stanford University, 146 p. Guo, C.-M., 1989, Constructing a Machine-Tractable Dictionary from the Longman Dictionary of Contemporary English [Ph.D. thesis]: Las Cruces, New Mexico State University, 140 p. Available from University Microfilms International, Ann Arbor, Michigan. Haase, P., Honavar, V., Kutz, O., Sure, Y., and Tamilin, A., eds., 2006, Proceedings of the 1st International Workshop on Modular Ontologies (WoMO’06), Athens, Georgia, November 5, 2006. Available from Sun SITE Central Europe (CEUR) Workshop Proceedings, v. 232, and at http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-232/ (accessed May 16, 2011). Hayes, P., 1996, A Catalog of Temporal Theories: University of Illinois Technical Report UIUCBI-AI-96-01, http://www.ihmc.us/groups/phayes/wiki/ a3817/attachments/e15ba/TimeCatalog.pdf (accessed May 16, 2011). Hazarika, S.M., and Cohn, A.G., 2001, A taxonomy for spatial vagueness: An alternative egg- yolk interpretation, in Proceedings of SVUG’01 (SunGard Virtual User Group), 2001; http://citeseerx.ist.psu.edu/viewdoc/download ?doi=10.1.1.8.2852&rep=rep1&type=pdf (accessed May 16, 2011). Hobbs, J.R., and Pan, F., 2004, An ontology of time for the semantic web, in ACM Transactions on Asian Language Information Processing, v. 3, no. 1: New York, Association for Computing Machinery, p. 66–85; http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.94.9478&rep =rep1&type=pdf (accessed May 16, 2011). Imtiaz, A., Bürger, T., Popov, I.O., and Simperl, E., 2009, Framework for value prediction of knowledge-based applications, in 1st Workshop on the Economics of Knowledge-Based Technologies (ECONOM 2009) in conjunction with 12th International Conference on Business Information Systems (BIS 2009), Poznan, Poland, April 27–29, 2009; http://ontocom .sti-innsbruck.at/docs/ECONOM2009.pdf (accessed May16, 2011). Kahlert, R.C., and Sullivan, J., 2006, Microtheories, in von Hahn, W., and Vertan, C., eds., First International Workshop: Ontology Based Modeling in the Humanities, April 7–8, 2006: Report FBI-HH-B-264-06: Hamburg, Germany, University of Hamburg, p. 56–66; http://clio-knows .sourceforge.net/Microtheories-v2.pdf (accessed May 16, 2011). Kalfoglou, Y., and Schorlemmer, M., 2003, Mapping: The state of the art: The Knowledge Engineering Review, v. 18, no. 1, p. 1–31, doi:10.1017/ S0269888903000651. Kalfoglou, Y., and Schorlemmer, M., 2004, Ontology mapping: The state of the art, in Sheth, A., Staab, S., and Uschold, M., eds., Dagstuhl Seminar
122
Obrst and Cassidy
Proceedings on Semantic Interoperability and Integration. Internationales Begegnungs- und Forschungszentrum (IBFI): Schloss Dagstuhl, Germany, 40 p.; http://drops.dagstuhl.de/opus/volltexte/2005/40/pdf/04391 .KalfoglouYannis.Paper.40.pdf (accessed May 16, 2011). Laclavik, M., 2006, RDB2Onto: Relational database data to ontology individuals mapping, in Navrat, P., et al., eds., Tools for Acquisition, Organization, and Presenting of Information and Knowledge: Bratislava, Vydavatelstvo STU, 2006, Workshop, Nizke Tatry, Slovakia, September 29–30: Bratislava, Slovakia, Vydavate©stvo STU, Vazovova 5, p. 86–89; http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.64.998&rep =rep1&type=pdf (accessed May 16, 2011). Lenat, D., 1998, The Dimensions of Context-Space: Austin, Texas, Cycorp, Technical Report; http://www.cyc.com/doc/context-space.pdf (accessed May 16, 2011). Lenat, D., Witbrock, M., Baxter, D., Blackstone, E., Deaton, C., Schneider, D., Scott, J., and Shepard, B., 2010, Harnessing Cyc to answer clinical researchers’ ad hoc queries: AI Magazine, v. 31, no. 3, p. 13–32. Also available at http://www.cyc.com/technology/whitepapers_dir/ Harnessing_Cyc_to_Answer_Clincal_Researchers_ad_hoc_Queries.pdf (accessed May 16, 2011). Longman Dictionary of Contemporary English (new ed.), 1987, Essex, UK, Longman Group, 1229 + 82 p. Mapping the Human Terrain (MAP-HT), 1997, a Joint Capabilities Technical Demonstration (JCTD) Project led by the Army. Public information is available at http://en.wikipedia.org/wiki/Human_terrain_system and http:// www.army.mil/-news/2010/10/12/46426-socio-cultural-data-collection -provides-insight-for-commanders/ and http://www.memestreams.net/ users/pnw/blogid783316 (all accessed May 16, 2011). Mascardi, V., Cordì, V., and Rosso, P., 2006, A Comparison of Upper Ontologies: Technical Report from Dipartimento di Informatica e Scienze dell’Informazione (DISI), Genoa, Italy: Report DISI-TR-06-21, http:// www.disi.unige.it/person/MascardiV/Download/DISI-TR-06-21.pdf (accessed May 16, 2011). Masolo, C., Borgo, S., Gangemi, A., Guarino, N., and Oltramari, A., 2003, Wonderweb Deliverable D18: Ontology Library (final): Trento, Italy, Laboratory for Applied Ontology, ISTC-CNR, 349 p.; www.loa-cnr.it/ Papers/D18.pdf (accessed May 16, 2011). Versions of the ontology itself are available at http://www.loa-cnr.it/DOLCE.html (accessed May 16, 2011). McCarthy, J., 1990, Formalizing Common Sense: Papers by John McCarthy: Norwood, New Jersey, Ablex Publishing Corporation, 256 p. McCarthy, J., 1993, Notes on formalizing context, in Bajcsy, R., ed., Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, August 28–September 3, 1993: Chambéry, France, Morgan Kaufmann. McCarthy, J., and Buvač, S., 1997, Formalizing context (expanded notes), in Aliseda, A., van Glabbeek, R., and Westerståhl, D., eds., Computing Natural Language: Stanford, California, Stanford University, p. 13–50; http://www-formal.stanford.edu/buvac/formalizing-context.ps (accessed May 16, 2011). McGuinness, D.L., Fikes, R., Rice, J., and Wilder, S., 2000, An environment for merging and testing large ontologies, in Cohn, A.G., Giunchiglia, F., and Selman, B., eds., Proceedings of the Seventh International Conference on Principles of Knowledge Representation and Reasoning (KR2000), Breckenridge, Colorado, April 12–15, 2000: San Francisco, California, Morgan Kaufmann, p. 483–493; http://citeseerx.ist.psu.edu/viewdoc/down load?doi=10.1.1.109.1812&rep=rep1&type=pdf (accessed May 16, 2011). Menzel, C., 1999, The objective conception of context and its logic: Minds and Machines, v. 9, no. 1, p. 29–56, doi:10.1023/A:1008390129138. Meseguer, J., 1998, Formal interoperability, in Proceedings of the Fifth International Symposium on Artificial Intelligence and Mathematics, Fort Lauderdale, Florida, January 4–6, 1998; http://rutcor.rutgers.edu/~amai/ aimath98/Extended_Abstracts/JMeseguer.ps (accessed May 16, 2011). MIDB, 2011, The Modernized Integrated Database (MIDB), http://www.fas .org/irp/program/disseminate/midb.htm (accessed May 17, 2011). MIEM, 2008, The Maritime Information Exchange Model, http://faculty.nps .edu/fahayesr/miem.html and references therein (accessed May 16, 2011). Mitchell, T.M., Shinkareva, S.V., Carlson, A., Chang, K.-M., Malave, V.L., Mason, R.A., and Just, M.A., 2008, Predicting human brain activity associated with the meanings of nouns: Science, v. 320, p. 1191, doi:10.1126/ science.1152876.
Musen, M., and Noy, N., 2002, Evaluating ontology mapping tools: Requirements and experience, http://bmir.stanford.edu/file_asset/index.php/57/ BMIR-2002-0936.pdf (accessed May 16, 2011). National Building Information Model Standard, 2011, http://www.wbdg.org/ (accessed May 17, 2011). National Information Exchange Model (NIEM), 2011, http://www.niem.gov/ (accessed May 17, 2011). Niles, I., and Pease, A., 2001, Towards a standard upper ontology, in Welty, C., and Smith, B., eds., Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001), Ogunquit, Maine, October 17–19, 2001: New York, ACM Press, p. 2–9; http://citeseerx.ist .psu.edu/viewdoc/download?doi=10.1.1.75.2093&rep=rep1&type=pdf (accessed May 16, 2011). Noy, N.F., and Musen, M.A., 2000, PROMPT: Algorithm and tool for automated ontology merging and alignment, in Seventeenth National Conference on Artificial Intelligence (AAAI-2000), Menlo Park, California, Association for the Advancement of Artificial Intelligence: Paper available as Stanford Medical Informatics, Stanford University Technical Report SMI-2000-0831, p. 450–455, http://bmir.stanford.edu/file_asset/ index.php/159/BMIR-2000-0831.pdf (accessed May 16, 2011). Oberle, D., 2006, Semantic Management of Middleware: New York, Springer, 276 p. Obrst, L., 2002, Review of “Qualitative Spatial Reasoning with Topological Information,” by Jochen Renz, 2002, Springer-Verlag: Association for Computing Machinery Computing Reviews. Obrst, L., 2003, Ontologies for semantically interoperable systems, in Frieder, O., Hammer, J., Quershi, S., and Seligman, L., eds., Proceedings of the Twelfth Association for Computing Machinery International Conference on Information and Knowledge Management (CIKM 2003), New Orleans, Louisiana, November 3–8: New York, Association for Computing Machinery, p. 366–369; http://portal.acm.org/citation.cfm ?id=956863.956932 (accessed May 16, 2011). Obrst, L., 2010, Ontological architectures, in Poli, R., Seibt, J., and Kameas, A., eds., TAO—Theory and Applications of Ontology, vol. 2: Computer Applications: New York, Springer, p. 27–67. Obrst, L., and Nichols, D., 2005, Context and ontologies: Contextual indexing of ontological expressions, in Association for the Advancement of Artificial Intelligence 2005 Workshop on Context and Ontologies, poster (AAAI 2005), Pittsburgh, Pennsylvania, July 9–13, 2005; http:// www.mitre.org/work/tech_papers/tech_papers_05/05_0903/index.html (accessed May 16, 2011). Obrst, L., Whittaker, G., and Meng, A., 1999, Semantic context for object exchange, in Association for the Advancement of Artificial Intelligence (AAAI) Workshop on Context in AI Applications, Orlando, Florida, July 19, 1999; http://www.aaai.org/Papers/Workshops/1999/WS-99-14/ WS99-14-014.pdf (accessed May 16, 2011). Obrst, L., Cassidy, P., Ray, S., Smith, B., Soergel, D., West, M., and Yim, P., 2006, The 2006 Upper Ontology Summit Joint Communiqué: Journal of Applied Ontology. v. 1, no. 2, p. 203–211. Obrst, L., Ceusters, W., Mani, I., Ray, S., and Smith, B., 2007, The evaluation of ontologies: Toward improved semantic interoperability, in Baker, C.J.O., and Cheung, K.H., eds., Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences: New York, Springer, p. 139–158. Obrst, L., Janssen, T., and Ceusters, W., eds., 2010a, Ontologies and Semantic Technologies for Intelligence: Frontiers in Artificial Intelligence and Applications: Amsterdam, the Netherlands, IOS Press, Book Series, v. 213, 227 p. Obrst, L., Stoutenburg, S., McCandless, D., Nichols, D., Franklin, P., Prausa, M., and Sward, R., 2010b, Ontologies for rapid integration of heterogeneous data for command, control, and intelligence, in Obrst, L., Janssen, T., and Ceusters, W., eds., Ontologies and Semantic Technologies for the Intelligence Community: Frontiers in Artificial Intelligence and Applications: Amsterdam, the Netherlands, IOS Press, Book Series, v. 213, p. 71–90. Ontolingua, 1994, The Physical Quantities ontology: “Theory Physical Quantities”; http://www.ksl.stanford.edu/htw/dme/thermal-kb-tour/physical -quantities.html (accessed May 16, 2011). OpenCyc, 2011, The OpenCyc ontology, http://cyc.com/cyc/opencyc (accessed May 16, 2011). Parmelee, M., 2007, The Vocabulary Driven Value Chain: Department of Defense (DoD) Community of Interest (COI) Forum, July 31, 2007; http://
The need for ontologies semanticommunity.info/@api/deki/files/6978/=MParmelle07312007.ppt (accessed May 17, 2011). Poli, R., 2003, Descriptive, formal and formalized ontologies, in Fisette, D., ed., Husserl’s Logical Investigations Reconsidered: Dordrecht, Netherlands, Kluwer, p. 193–210. Poli, R., 2010, The categorial stance, in Poli, R., Seibt, J., Healy, M., and Kameas, A., eds., Theory and Applications of Ontology, vol. 1: Philosophical Perspectives: Berlin, Springer, p. 1–22. Poli, R., and Obrst, L., 2010, The interplay between ontology as categorial analysis and ontology as technology, in Poli, R., Healy, M., and Kameas, A., eds., Theory and Applications of Ontology, vol. 2: Computer Applications: New York, Springer, p. 1–26. QUDT, 2011, Quantities, Units, Dimensions, and Data Types (QUDT) ontology: Quantities, Units, Dimensions and Data Types in OWL and XML. Maintained by the National Aeronautics and Space Administration, http:// www.qudt.org/ (accessed May 16, 2011). QUOMOS, 2011, The Quantities and Units of Measure Ontology Standard (QUOMOS) TC, an OASIS (Advancing Open Standards for the Information Society) Technical Committee, http://www.oasis-open.org/ committees/tc_home.php?wg_abbrev=quomos (accessed May16, 2011). Randall, D.A., Cui, Z., and Cohn, A.G., 1992, A spatial logic based on regions and connection, in Nebel, B., Swartout, W., and Rich, C., eds., Proceedings of the 3rd International Conference on Principles of Knowledge Representation and Reasoning, Cambridge, Massachusetts, October 1992: San Francisco, Morgan Kaufmann, p. 165–176. Renz, J., 2002, Qualitative spatial reasoning with topological information: Lecture notes in Computer Science, v. 2293: Artificial Intelligence: Berlin, Springer-Verlag, 201 p. Samuel, K., Obrst, L., Stoutenberg, S., Fox, K., Franklin, P., Johnson, A., Laskey, K., Nichols, D., Lopez, S., and Peterson, J., 2008, Applying prolog to semantic web ontologies and rules: Moving toward description logic programs: The Journal of the Theory and Practice of Logic Programming (TPLP), v. 8, no. 3, p. 301–322. Schneider, L., 2003, How to build a foundational ontology: The object-centered high-level reference ontology OCHRE, in Günter, A., Kruse, R., and Neumann, B., eds., Proceedings of the 26th Annual German Conference on AI (KI 2003): Advances in Artificial Intelligence, Lecture Notes in Computer Science, v. 2821: New York, Springer, p. 120–134. Schuurman, N., and Leszczynski, A., 2006, Ontology-based metadata: Transactions in GIS, v. 10, no. 5, p. 709–726. Semy, S., Pulvermacher, M., and Obrst, L., 2005, Toward the Use of an Upper Ontology for U.S. Government and U.S. Military Domains: An Evaluation: MITRE Technical Report, MTR 04B0000063, November 2005; http://www.mitre.org/work/tech_papers/tech_papers_05/04_1175/index .html (accessed May 16, 2011). SigmaKEE, 2011, Sigma Knowledge Engineering Environment (SigmaKEE), http://sourceforge.net/projects/sigmakee/ (accessed May 16, 2011). Simperl, E., Tempich, C., and Sure, Y., 2006, ONTOCOM (ONTOlogy COst Model): A cost estimation model for ontology engineering, in Cruz, I., et al., eds., Proceedings of the International Semantic Web Conference (ISWC 2006): New York, Springer, p. 625–639. More information on ONTOCOM is available at http://ontocom.ag-nbi.de/ (accessed May 16, 2011). Simperl, E., Popov, I.O., and Bürger, T., 2009, ONTOCOM revisited: Towards accurate cost predictions for ontology development projects, in Aroyo, L., et al., eds., Proceedings of the European Semantic Web Conference 2009 (ESWC ’09), Heraklion, Greece, May 20–June 4, 2009: New York, Springer, p. 248–262. Smith, B., Vizenor, L., and Schoening, J., 2009, Universal core semantic layer, in Proceedings of the Ontologies for the Intelligence Commu-
123
nity (OIC) Conference, October 20–22, 2009: Fairfax, Virginia, George Mason University; http://sunsite.informatik.rwth-aachen.de/Publications/ CEUR-WS/Vol-555/paper5.pdf (accessed May 16, 2011). Sowa, J., 2010, Building, Sharing, and Merging Ontologies, http://www.jfsowa .com/ontology/ontoshar.htm (accessed July 5, 2011). STEP, 2011, Standard for the Exchange of Product Model Data (STEP), ISO 10303: http://www.tc184-sc4.org/SC4_Open/SC4%20Legacy%20 Products%20%282001-1983 08%29/STEP_%2810303%29/ (accessed May 16, 2011). Stoutenburg, S., 2009, Advancing Ontology Alignment: New Methods for Biomedical Ontology Alignment Using Non-Equivalence Relations [Ph.D. thesis]: Colorado Springs, University of Colorado, 222 p.; http://www .cs.uccs.edu/~kalita/work/StudentResearch/StoutenburgPhDThesis2009 .pdf (accessed May 16, 2011). Stoutenburg, S., Obrst, L., Nichols, D., Franklin, P., Samuel, K., and Prausa, M., 2007a, Ontologies and rules for rapid enterprise integration and event aggregation: Vocabularies, ontologies, and rules for the enterprise (VORTE 07), EDOC 2007 (2007 Eleventh International IEEE EDOC Conference Workshop): Annapolis, Maryland, doi: 10.1109/EDOCW.2007.22. Stoutenburg, S., Obrst, L., McCandless, D., Nichols, D., Franklin, P., Prausa, M., and Sward, R., 2007b, Ontologies for rapid integration of heterogeneous data for command, control, and intelligence, in Ontologies for the Intelligence Community Conference: Columbia, Maryland; http://citeseerx .ist.psu.edu/viewdoc/download?doi=10.1.1.142.9502&rep=rep1&type =pdf (accessed May 16, 2011). Suggested Upper Merged Ontology (SUMO), 2001, http://www.ontologyportal .org/ (accessed May 17, 2011). Sunna, W., and Cruz, I.F., 2007, Structure-based methods to enhance geospatial ontology alignment geospatial semantics, in Second International Conference (GeoS): New York, Springer, Lecture Notes in Computer Science 4853, p. 82–97. UCore, 2009, Universal Core (UCore), http://cio-nii.defense.gov/initiatives/ ucore.shtml (accessed May 16, 2011). Vampire First-Order Logic Theorem-Prover, 2009, http://www.vprover.org/ (accessed May 17, 2011). Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., and Hübner, S., 2001, Ontology-based integration of information—A survey of existing approaches, in Proceedings of the International Joint Conference on Artificial Intelligence 2001 (IJCAI-01) Workshop on Ontologies and Information Sharing, Seattle, Washington, p. 108–117. Proceedings available at http://citeseerx.ist.psu.edu/viewdoc/download?doi =10.1.1.13.3504&rep=rep1&type=pdf (accessed May 16, 2011). Wilks, Y., Slator, B., and Guthrie, L., 1996, Electric Words: Dictionaries, Computers, and Meanings: Cambridge, Massachusetts Institute of Technology Press, 289 p. Winters, L., and Tolk, A., 2009, C2 domain ontology within our lifetime, in Proceedings of the 14th International Command and Control Research and Technology Symposium (ICCRTS): Washington, D.C., June 15–17, 2009; http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA503107&Location =U2&doc=GetTRDoc.pdf 59 pages (accessed May 16, 2011). Wisnosky, D.E., 2010, Semantic Technology in the Department of Defense, Business Mission Area, in Semantic Technology for Intelligence, Defense, and Security (STIDS) Conference 2010, October 26–29, 2010: Fairfax, Virginia, George Mason University; http://stids.c4i.gmu.edu/ presentations/STIDS_Keynote_Wisnosky.pdf (accessed May 16, 2011).
MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
Data provenance for preservation of digital geoscience data Beth Plale School of Informatics and Computing, Indiana University, 919 E. 10th St., Bloomington, Indiana 47405, USA Bin Cao Teradata Corporation, 100 N. Sepulveda Blvd, El Segundo, California 90245, USA Chathura Herath Yiming Sun School of Informatics and Computing, Indiana University, 919 E. 10th St., Bloomington, Indiana 47405, USA
ABSTRACT The first step in the preservation of digital scientific data is gathering enough information “about” a scientific outcome or data collection so that it can be discovered and used a decade later as easily as it is used at the time. Data provenance, or lineage of a collection, can capture the way in which a particular scientific collection was created, when, and by whom. Tools that automate the collection of provenance can reduce the burden on the researcher, and provenance data can be stored in ways that make the data more amenable to long-term preservation. We discuss the various dimensions of data provenance in data-driven geospatial science with the goal of conveying a good grasp of provenance collection, representation, and use. Our research in data cyberinfrastructure utilizes real-time observational data in on-demand weather forecasts, and we discuss this aspect as well.
INTRODUCTION The earth, oceanographic, and atmospheric sciences generate large amounts of data from observational instruments and sensors and in amounts that are proliferating as sensor technology becomes more of a commodity and cyberinfrastructure capable of handling data volumes becomes more established. For instance, the Advanced Microwave Scanning Radiometer– Earth Observing System (AMSR-E) captures sea-ice measurements as images that are then sent through an ingest process in which they undergo image refinement including application of edge detection algorithms that pick out the boundary between land and sea ice. Atmospheric forecast model data are generated
hourly at the National Oceanic and Atmospheric Administration Earth Systems Research Laboratory, and these data are downloaded for use in a range of applications, including the National Science Foundation–funded Vortex 2 field study during spring 2010, which spanned the heartland of the United States to take close-up measurements of tornadic and severe storm activity (IU, 2010). Environmental data also take the form of the International Forestry Resources and Institutions (IFRI) research initiative, which examines the ways in which governance arrangements affect forests and the people who depend on them. The IFRI database today holds 15 years of survey data taken of forested areas worldwide. These data sets are an increasingly valuable part of our heritage as we attempt to answer larger questions around
Plale, B., Cao, B., Herath, C., and Sun, Y., 2011, Data provenance for preservation of digital geoscience data, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 125–137, doi:10.1130/2011.2482(11). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
125
126
Plale et al.
climate change and other grand-scale phenomena. However, in order for data to remain valuable years after it was generated, it must be accompanied by a description of itself. A potted plant, a rose bush for instance, comes with instructions on how to take care of it. Similarly, for a data object to be useable 10 and 100 yr from now, it must come with instructions describing the context of the data object. For instance, what is the object’s format? What is the context under which the object was created? By whom? The contextual story surrounding a data object is its provenance. Provenance is the lineage of a data product or collection of data (Simmhan et al., 2005), and it tells us who or what process acted on a data collection prior to its arrival at its present destination. Like the lineage of a piece of art, the historical record can be incomplete, but even an incomplete record when taken over a long history presents information about the object and its usefulness that is not available elsewhere. The role that data provenance plays in geoinformatics is to capture information about the time a collection was created and its originator. It can capture the transformations a data collection underwent and the sequence of tasks that were executed and data products applied to generate a new collection. These actions can occur inside a database through updates, or outside a database through task graph scripting, command line sequences, human activity, or workflow systems. Provenance can identify event causality; enable broader forms of sharing, reuse, and longterm preservation of scientific data; be used to attribute ownership and determine the quality of a particular data set (Simmhan et al., 2006). A popular data model for provenance is the Open Provenance Model (OPM; Moreau et al., 2008). Emerging from e-Science researchers, OPM is evolving as a standardized representation of provenance. OPM, which defines a model and a set of inference rules for provenance graphs, defines three primary entities: artifacts—immutable piece of state; processes— actions or series of actions performed on or caused by artifacts; and agents—contextual entities that act as a catalyst of a process. OPM defines several causal dependencies, including but not limited to: process used artifact; artifact was generated by a process; process was triggered by another process. Causality edges use the past tense referring to an execution that has been completed. Hence, provenance graphs capture past activity. Provenance about data products is useful to researchers in multiple ways. Provenance information can be used in knowledge reasoning (Leake and Kendall-Morwick, 2008; Zhao et al., 2008) to recommend a particular task graph plan (or “workflow”). For instance, experience may say that one cluster is far more reliable than another. A completion tool would recommend a workflow plan that executes on the more reliable cluster. Provenance data can be useful in ascertaining the quality of a data product. Simmhan et al. (2006) suggested that data quality has several dimensions, intrinsic metadata, provenance quality, quality of service, and community perception of quality. The intrinsic metadata quality score makes use of the inherent properties that can be discovered by having physical access to the data, such as observing system’s manufacturer specifications for accuracy and precision.
Provenance is an important quality metric since the derivation process has significant implications on the data’s quality, and errors introduced by faulty data tend to inflate as they propagate to data derived from them. The quality of service metric for a data product present in a data repository measures the ability to access the data product and transfer it to a remote location for an application’s use at a certain resource cost. The community perception of quality is based on intrinsic factors such as reputation of an organization from which data products were generated. Data provenance can be used to replay a task sequence (Foster et al., 2002). It can be used to construct a record for the preservation of a scientific collection. The difference between provenance for replay and provenance for preservation is the relevant time span. Replay expects the execution environment to undergo no change from run to run. In other words, the executables in the workflow are assumed to be accessible and runnable. Data are expected to exist at the physical location, and the workflow is assumed to run in the same execution system. Provenance collected for purposes of preservation on the other hand must also contain contextual information (Kowalczyk, 2007) because it is unlikely that an outdated version of an executable model will be runnable several years later at the pace of hardware and software technology changes. The software framework that executed the task graph is unlikely to exist either. Further discussion appears later in this chapter on the uses of provenance information for preservation. In this chapter, we discuss the various dimensions of data provenance in data-driven geospatial science with the goal of conveying a solid understanding of provenance collection, representation, and use. Our research in data cyberinfrastructure utilizes real-time observational data in on-demand weather forecasts, and we discuss this aspect as well. Use Case for Provenance Capture The AMSR-E (http://www.ghcc.msfc.nasa.gov/AMSR/index .html) is a passive microwave radiometer aboard a satellite operated by the National Aeronautics and Space Administration (NASA). It produces infrared images every 50 min on each of its dozen or so frequency channels. It observes atmospheric, land, oceanic, and cryospheric parameters, including precipitation, seasurface temperatures, ice concentrations, snow-water equivalent, surface wetness, wind speed, atmospheric cloud water, and water vapor. The image products that are beamed down continuously are processed by a continuous ingest stream. The AMSR-E ingest pipeline takes in level 1 data products and generates level 2 and level 3 data products (geoscience instruments often use a level numbering scheme wherein a larger level number means additional aggregated or derived products; level 0 data, often seen only at the instrument, could even be analog data), creating data sets that are key to research in climate variability. The AMSR-E ingest processing stream has existed for some time. As such, the provenance collection has to adapt. As shown in Figure 1, the ingest flow is under the control of
Data provenance for preservation of digital geoscience data a single script, S1. Prior capture can be carried out by insertion of instrumentation sensors in the script and select applications that the script invokes. Some applications require other forms of capture, such as the log file of an application that is parsed by a provenance “adaptor” as shown in Figure 1. The AMSR-E ingest pipeline as it existed prior to our provenance capture effort had primitive support for tracking provenance kinds of information, but this record was limited in a number of ways as follows: 1. Full lineage information was not collected. There was no direct traceability back to the artifacts that directly influenced the generation. 2. Sometimes complete input information was not collected. An example is the level 3 rain product. Due to a limitation in the length of the input pointer field in the metadata, only the first 900 products are listed. 3. Change notification was based on a “paper push” model. There was no simple, automated mechanism for the data user to request information on previous or more recent versions of a given product. Versions were poorly tracked. 4. Provenance information was embedded in the data. Existing provenance-like information was captured in the inventory metadata and stored in the files. The drawback with this approach is that once the data were removed (e.g., replaced by a newer version), so was the provenance information. 5. Some provenance and quality information was not available to users. Production history and quality assurance (QA) files were generated but not distributed to data users. While production history information was embedded in the metadata stored within the file, the QA files may have contained information not available elsewhere. 6. Comparison of two versions of data products was cumbersome. To compare two different versions of a data prod-
Prov sys P1
Adaptor Sensor Sensor Script S1 Sensor
Adaptor
Applic A1 Applic A2
... Applic AM
Figure 1. Provenance collection of Advanced Microwave Scanning Radiometer–Earth Observing System (AMSR-E) ingest pipeline.
127
uct, a data user had to find the paper release notes for each version. We address these limitations by treating each task in the ingest flow as a separate application and applying the capture techniques of Figure 1, where each task was instrumented separately using an appropriate instrumentation approach. Some of the use cases that drive this ongoing work are (1) a provenance graph comparison between two graphs done across time that picks up differences in algorithm versions; and (2) interference from new communication satellites with NASA satellites at select channels (lower-frequency channels). Can data errors at this level propagate so they are visible? Finally, provenance must provide a response to the question, “Something isn’t right with this data product.” PROVENANCE COLLECTION, REPRESENTATION, AND STORAGE The scientific workflow that is orchestrated by a workflow engine has gained acceptance in scientific communities as a means to automatically execute task sets having control, data, or time dependencies. Gil et al. (2006) elevated workflows to the status of a first-class entity in a cyberinfrastructure architecture, an acknowledgment of its primary role. A workflow is a graph consisting of a collection of vertices, V, and a collection of edges, E, where each edge E is a line joining a pair of vertices V. Workflow graphs often exhibit multiply connected vertices. Each edge is unidirectional, indicating control flow execution of the graph. Execution of the graph can be envisioned as a single token on an edge that transfers from node V1 to node V2. The token can be a control message or a single data object. The node fires when a token has been received. For multiply connected vertices, the pointed-to node fires when a token has been received along each of the input edges. A node generates an output token (control message or data object) along each output edge. Workflow functionality could be hand-coded into scripts, but scripts lack flexibility, and code must be added as new cases arise. When a task graph needs reordering, tasks swapped, or delay introduced, then workflow engines are easier because new task graphs can be easily and automatically assembled. Workflows also allow for more flexible binding to input and output definitions and data. The workflow graph is a high-level, or declarative, description of what needs to be done, not how it is to be done. The declarative description allows for higher-level abstractions. For instance, a data object that flows between vertex V1 and vertex V2 could be specified as the “latest measurement reading from sensor network X.” The logic to resolve this, which might be a query to a real-time data repository, could be plugged into a workflow engine with ease. In a hand-coded script, however, it represents another special case that must be coded into an already complex script. The capture of provenance during workflow execution seems most logically carried out closest to where the data are generated, because this is where the most knowledge is known
128
Plale et al.
about a data object. However, scientific applications are often community codes, written in Fortran and having ad hoc ways of being invoked and parameters passed that preclude solutions requiring direct modifications to the code base. In order to be used within a cyberinfrastructure system (CI system), the black box applications are “wrapped” as Web services. This wrapper, which provides the orchestration tool with a uniform interface to talk to all the various scientific codes, is also where status monitoring and provenance collection can take place because the wrapper has some level of information about tasks and their inputs and outputs. Provenance reflects not only details of inputs, outputs, and task execution, but information about who executed what and on whose behalf and so on. As such, collection takes place in several places. As shown in Figure 2, the researcher works with the CI system through a Web portal. At the portal, the user is identified and constructs a workflow graph. Provenance collection is done at the portal to capture the user’s session. A workflow engine executes the workflow script on a user’s behalf, so capture occurs there as well. Finally, capture is done within the Web service that wraps the application codes. Provenance events travel from the local instrumentation points to the remote provenance service.
This can be through a synchronous or asynchronous send-receive message protocol or by putting the message on an enterprise service bus (also called publish-subscribe system) on which the provenance service is listening. The Karma provenance collection and management system (Simmhan et al., 2008) that we built to carry out the collection shown in Figure 2, captures both process provenance and data provenance in user-driven workflow systems. Process provenance records the execution of a single process (i.e., task); data provenance records the derivation path of a data product. The provenance is collected based on discrete provenance activities during the life cycle of a workflow execution. These activities can be aggregated to form complex data and process provenance graphs that can span across workflows. Unlike other provenance-enabled workflow systems (e.g., VisTrails; Scheidegger et al., 2008), Karma is not tightly coupled to a particular workflow system and can capture and store streaming provenance information. This gives Karma the flexibility to capture provenance from different systems. Karma provenance collection is implemented as modular, pluggable instrumentation so as to be useable in diverse architectures composed of, say, Axis 2 Web services, Java classes, and message bus listeners.
BPEL script
Workflow Engine
Provenance Capture
App Task Service Wrapper
Provenance Capture
Provenance Capture
App Task Service Wrapper
Provenance Capture
App Task Service Wrapper
Provenance Capture
Provenance Capture Figure 2. Points at which provenance collection occurs in a workflow-driven system: at the portal, at the workflow engine, and at the Web service–wrapped application tasks. The user creates a workflow script (in BPEL [Business Process Execution Language]) that gets sent to a workflow engine. The workflow engine invokes application tasks that run within a service wrapper. Provenance capture occurs on the client side and during workflow execution.
Data provenance for preservation of digital geoscience data The architecture of the Karma system is shown in Figure 3. The Karma provenance service can receive events by subscribing and listening to an enterprise service bus through a “subscriber interface.” When it receives an event, the corresponding provenance handler retrieves the provenance information and invokes the “ingest core” to store or update provenance data to the relational database. Karma also supports a direct send-receive interface, shown as “synchronous ingest web service.” In this way, the system can be configured more simply without an enterprise service bus, which may not be hosted in some research laboratories. Karma is being extended to accept provenance graphs in the Resource Description Framework (RDF) format. To access and retrieve provenance data stored in the database, the user queries the database through the “query interface,” which draws upon knowledge discovery tools to complete the provenance record. Karma provides client with tools such as a query client, visualization client, and a client used to retrieve data in the form suitable for digital preservation. Provenance Representation Provenance can be represented and stored using different technologies, including relational databases (e.g., REDUX; Barga and Digiampietri, 2008), semantic Web technologies RDF
Application and Client Layer
OPM Graph
Core Service Layer
Storage Layer
Graph Viz Client
Query Client
Preservation Client
Enterprise Service Bus
Handler
Web Service Layer
(e.g., myGrid; Zhao et al., 2008), and internal private formats (e.g., PASS; Holland et al., 2008). Relational databases together with Extensible Markup Language (XML) views are also used (e.g., VisTrails; Scheidegger et al., 2008). Provenance data can be organized in layers for efficient storage. REDUX stores the provenance data in a multilayer model with increasingly detailed and instantiated workflow information from abstract representations down to fully instantiated and executed workflows. The VisTrails provenance information is organized into three layers: the workflow evolution layer, which captures the relationships among the series of workflows created in an exploratory task; the workflow layer, which consists of specification of individual workflows; and the execution layer, which stores run-time information about the execution of workflow modules. The provenance in Chimera (Foster et al., 2002) is represented in the Virtual Data Language, which represents data products as abstract typed data sets and their materialized replicas. Computational process templates, called transformations, are scripts in the file system. The parameterized instance of the transformations, called derivations, can be connected to form workflows that consume and produce replicas. The provenance manager of VIEW (Lin et al., 2008) contains three layers: the provenance model layer, the relational model layer, and the model-mapping layer. The provenance model layer represents scientific workflow run provenance via domain ontologies; the
Instrumented Applications
XML Events
129
Synchronous Ingest Web Service
Handler
Handler
Subscriber Interface (Provenance Listener)
Ingest Core Data Abstraction Layer
Query Service
REST Push Service
Knowledge Discover: Inferencing, Quality, Completeness
Relational Store
Figure 3. Architecture of a provenance collection system. The application generates events that are passed to the collection system. Client interfaces, upper right, retrieve provenance for use. The service layer and core layer provide ingest, storage, and knowledge-generation capability. OPM—Open Provenance Model; XML— Extensible Markup Language; REST—Representational State Transfer.
130
Plale et al.
relational model layer stores and queries provenance metadata; and the model mapping layer performs schema mapping, data mapping, and query mapping for the other two layers. Unlike these systems, in which workflows have to be predefined before execution, Karma can capture and store the provenance of data products generated by workflows that have not been predefined. The two-layer information model we propose (Cao et al., 2009) contains a registry layer, which contains metadata about the instance, and an execution layer. As shown in Figure 4, the registry level has similarities to registries used in Web service architectures in that it contains information about services and data products at a sufficient level of detail to support discovery and automated decisions about whether or not to bind a particular data product or service. The registry for provenance is not used for binding, but it needs to contain sufficient information to build a data object that can be preserved indefinitely. The execution level captures instance invocation and execution details of
a particular sequence of actions. Note how the structure of the top layer mirrors the bottom. The two-layer model recognizes commonalities in workflows and stores that common information consistently and without redundancy. In more detail, the registry level records the metadata of services and data that may be used in an execution sequence. It also serves to capture the structure of workflows when the structure can be known in advance. The registry level contains three primary entities: abstract service, abstract method, and abstract data product. There are two special abstract services: composite service and opaque service. A composite service represents a workflow having member services; the relationship is represented by “has subservice.” An opaque service represents a black-box workflow in which the internals of services involved are not known. For predefined workflows, workflow structure in terms of services is represented by the relationship “has next-service,” which means a service has another service
type annotation
has subservice
Abstract Data Product
abstract WSDL annotation 0..n
Abstract Service
name 0..n
contains
0..n
Abstract Data Granule
0..n
has input/ output
0..n
1..1
supports
0..n
Composite Service Opaque Service
0..n
Abstract Data Collection
0..n
Abstract Method
Registry Level
Execution Level 1..1
contains
0..n
0..n
is instance of
is instance of client_id
1..1
Data Product
has nextservice
0..n
0..n is instance of
0..n
0..n
has input/ output
0..n
1..1
Method
Client
Service
ISA
ISA
ISA
Data Granule
service_id concrete WSDL host architecture initialization_time termination_time isSuccess annotation
time stamp 0..n
Data Collection
data_id url value size annotation
Entity 1..n
1..n invokes
time stamp request_status request_send_time request_receive_time response_status response_send_time response_receive_time annotation
Figure 4. Information model for representing execution-level provenance (bottom), i.e., based on information gathered only at run time, and from registry-level provenance (top) that captures detailed information gathered from nonexecution sources. WSDL stands for Web Services Description Language, a language that encodes the definition of a web service.
Data provenance for preservation of digital geoscience data following it. An abstract service can have zero or more abstract methods. Each abstract method has zero or more data products as input and output. The input/output parameter could be either a granule or collection. If an input/output parameter for a method has multiple components, it is represented as “abstract data collection”; otherwise, it is an “abstract data granule.” In summary, the registry level captures the metadata of services (abstract Web Services Description Language [WSDL]), the methods inside a service, the name and type of input parameters and output results of each method, and the structure of workflows in terms of services for predefined workflows. The order of method execution is recorded in the execution level by method invocation. The execution level models instances, that is, execution time information of method invocations and data products used or generated by each invocation. Service, method invocation, and data product in the execution level are, respectively, instances of abstract service, abstract method, and abstract data product in the registry level. The client is an entity that initiates workflows or services; it could be a user, workflow engine, or another program. A client, a method, or a service (including workflow, which is a composite service) is an entity. Invocation of a workflow, service, or method is represented by an entity invoking another entity. The execution information of an invocation includes request/response status and time stamp. The location, initialization, and termination time stamps of a service are recorded by a service instance. The data values, data size and location, the time stamp of data product generation or use are recorded provenance for a data object. Similar to the registry level, the input/output parameter for a method could be either a granule or a collection. Compared to the layered model as used for instance by Barga and Digiampietri (2008), Karma has three advantages. First, Karma can capture and store streaming provenance data, i.e., the workflow can be unknown until execution. Second, for services and data products that may be used by multiple workflows, Karma stores the metadata only once in the registry level to provide uniformity and avoid redundancy. Third, for predefined workflows, Karma stores the structure of the workflow only once at the registry level and separate instances of run-time provenance at the execution level, which allows efficient storage for the repeated workflow executions. Application of Information Model From the information model, we constructed a database schema used for provenance data storage and query. The provenance database schema represents the information model with a 1:1 mapping to store and query provenance data. Following the information model, the database tables are also in two levels. The primary tables at the registry level include abstract service, abstract method, abstract data product, and the input/output relationship between abstract data product and abstract method. There are three auxiliary tables to store composite service, workflow structure, and data collection.
131
The tables at the execution level are derived from those used in Karma2 (Simmhan et al., 2008), with the extension of client, method, and data collection to comply with the information model. Obviously, service, method, and data product store the run-time instance data. Note that although the tables at the execution level store instance data for the tables at the registry level, the primary-foreign key relationships between these two levels are not mandatory. We represent a method invocation in terms of invoker and invokee, where an invoker is the one who initiates the invocation, and an invokee is the one who is invoked. Both invoker and invokee are entities. The invoker can be a client (i.e., client method invocation), or a service (i.e., service method invocation). We assign an entity identifier (ID) for each invoker and invokee, and these IDs are used to represent the invocation relationship. OBSERVATIONAL DATA STREAMS IN WORKFLOWS A unique element to geoscience research is the multitude of data generated in real-time about the condition of the ocean floor, atmosphere, riverbeds, even the climate variation from bottom to top of a large sequoia. Doppler radar data, for instance, can be captured for a metropolitan region and data mined as it is generated from the radars looking for developing severe storm signatures. When an appropriate signature is detected, a workflow can be triggered. This approach was demonstrated in the Linked Environments for Atmospheric Discovery (LEAD) project system (Droegemeier et al., 2005; Vijayakumar and Plale, 2007). Our earlier LEAD approach suffered from a cognitive dissonance between the one-time execution of a workflow graph contrasted to processing an event stream of continuously arriving events. We stepped back, reexamined the problem, and adopted an approach that tries to fold events processing into the graph model and graph execution model of workflow systems. The architecture of this approach is discussed in this section. To address continuously arriving data from instruments, the graph model must be extended to accommodate an input edge that has multiple tokens on it, and a workflow node that executes repeatedly until each token in the set has been ingested and processed. Other single-token input edges to the same node have the token consumed repeatedly for as many tokens are on the timeseries edge. For example, consider V2 vertex in Figure 4. The edge (Vs,V2) has a time-series input of tokens e1,e2, …, e5, and edge (V1,V2) is a conventional input of one event, d1. The execution semantics dictate that V2 is invoked for each event in the time series with input sets (e1,d1), (e2,d1), …. (e5,d1). Workflow outputs are saved in output variables, but the workflows that interact with time series of events tend to produce a time series of output events that can be returned to the system as another event stream. Scientific workflow systems provide the means for composing and executing workflows. The LEAD cyberinfrastructure, for instance, uses the XBaya (Gannon et al., 2007) tool for workflow composition and monitoring, and the Apache Orchestration Director Engine (ODE) workflow engine for workflow execution.
132
Plale et al.
XBaya provides a drag and drop interactive interface for composing workflows out of Web services and data sources in a representation that is workflow language agnostic. XBaya can compile the graphical workflow representation into a language called WSBPEL (Web Services Business Process Execution Language), Jython execution scripts, and Taverna Scuffl scripts (Zhao et al., 2008). In normal mode of operation, XBaya-composed workflows are deployed to the open-source Apache ODE. Once a workflow is deployed, it can be invoked multiple times. Actual execution of the workflow takes place inside the Apache ODE workflow engine. The interactions of LEAD workflow components are more comprehensively discussed by Gannon et al. (2007), and the challenges of workflow systems are discussed in Gil et al. (2006). For the user to take one workflow node in a workflow graph and convert it to take a stream instead of a single input, it requires new nodes to be added. One, illustrated in Figure 5, is the “active node.” The active node takes a sequence of events (e.g., time series) as input and produces a time series as output. This node behaves differently from standard nodes in that it executes repeatedly over all the events in a sequence before stopping. An alternate approach would be to add a control loop that exits V2 and loops back around to feed back into V2, but a control loop fails to reflect the changed input/output signature of an active node, particularly with respect to the time-series output. The actions taking place under the hood to support the active node are illustrated in Figure 6, which in part A shows the workflow graph before time-series abstractions are added. Part B shows the stream source added to a workflow and a node’s inputs configured to be a stream input, represented by the replacement of a single stream input source (in A) with two input sources (in B). This simple programming model extension was chosen because it is easier for a researcher to integrate events processing into composition of conventional scientific workflows that she is familiar with, than to specify events processing separately, which can require the specification of complex behavior. Once a streaming workflow is composed, step 1, and deployed to the workflow engine (steps 2 and 3), the time-series
e5 Vs
Time Series Data Flow e4
e3
e2
e1 V2
data sources are consulted (step 4), and connections are made to the deployed workflow (step 5). The composed workflow is compiled into two executable workflows, shown as C and D in the figure. The control workflow (C in Fig. 6) initiates the execution and receipt of a message from the event-processing system and dispatches it to the child workflow. The child workflow, D, is the original scientific workflow. The execution of the deployed workflow begins by invoking the control workflow, which contacts the complex eventprocessing system to inform it that the workflow is alive and ready to receive events. After that, the workflow waits for observational data readings to arrive from the stream service. When the control workflow receives the next data reading (step 7), it invokes the child workflow with data reading just received (step 8), and the child workflow begins executing. Meanwhile, the control workflow loops back and waits for another event from the stream service. It should be noted that a new instance of a child workflow is executed for each event received at the control workflow as described in this scenario. This execution pattern results in the child workflow producing an output for each invocation. These outputs in fact constitute a stream of their own and in this framework that time series is published back to the stream processing system (step 9). The simple programming model shown to the user is additionally supported through new semantics to the BPEL script to implement the control workflow, C in Figure 6. The semantics specifically are needed to implement a stream-receive node as follows: while(workflow lifetime not ended){ Receive(X); Invoke(Service-1, X); }
This logic allows the control workflow to wait for an incoming event X from the complex event-processing system, and once that event is received, the control workflow invokes a node service-1 with the incoming event X. In the scenario of Figure 6, service-1 would be the child workflow representing the actual experiment. This abstraction allows the events in the stream be executed within the context of a workflow execution and managed and monitored as a conventional workflow. As shown in the thick light-blue lines of C in Figure 6, the incoming and outgoing data flows to the stream-receive node have different cardinality, and those data-flow edges now represent streams rather than conventional data flow. This opens up an opportunity to manipulate streams in the workflow itself using stream operators like select, join, aggregation, etc. This is part of our ongoing research. ARCHITECTURE FOR PROTOTYPE
V1
d1
Figure 5. Graphical interface view of the way in which the active node operates. When a workflow node is made “active,” instead of taking a single data object as input, it can take a sequence of objects.
Data sets are an increasingly valuable part of our heritage. However, for data to remain valuable years after it was generated or collected, it must be accompanied by a description of itself. For a data object to be useable 10 and 100 yr from now, it must carry with it information that helps make it useable. Provenance about
Data provenance for preservation of digital geoscience data
133
A 1
Composition
B Deployment
4
XBaya Workflow Engine
C
5 6
Complex Event Processing Engine
8 7 3
9
2
BPEL Engine
D
Figure 6. Under the hood: architectural extensions to workflow construction to support event sequences. The Query API (or Query Application Programming Interface) is where queries are submitted. When a query is submitted, an EPR (end point reference) is returned which can be thought of as the end of a hose where events of interest will flow out. BPEL—Business Process Execution Engine.
a data object or collection can be captured, and we described earlier a capture and representation system that supports just this, but the bundling of provenance with the data objects is something we did not discuss. We discuss this in the final section. Figure 7 outlines a possible system for gathering the provenance and domain-specific metadata of a complex data object. The complex data object in question could be of a forecast model execution. To fully understand a weather forecast, one needs to know how the
model was configured, what input files were used, and how these input files were generated. The machine on which the model was run is important, as is the version of the model code. One might even want information as to the disks on which the data were stored, particularly if there is suspicion of low reliability for some storage systems. Figure 7 shows gathering of the preservation information from the distributed sources where it resides. We break the functionality into three components, archive and
134
Plale et al.
Figure 7. Steps in gathering metadata and provenance to build preservation object. CI—cyberinfrastructure; XMC Cat—XML Metadata Concept Catalog.
preservation, query and discovery, and dissemination, and then we discuss each. The archive and preservation function creates the preservation bundle from an experiment upon a user request. When the user makes a request, they pass along an experiment ID and their security credential. The preservation module uses the user credential and experiment ID to contact a metadata catalog. We use the XML Metadata Concept Catalog (XMC Cat; Jensen and Plale, 2008). XMC Cat is scientific metadata catalog that was designed based on an observed common structural pattern among XML schemas of many e-Science domains. A community can use XMC Cat for their domain-specific metadata by loading their community-defined XML schema into it. XMC Cat stores and indexes complex hierarchical metadata attributes, and it answers complex queries based on domain-specific metadata and/or data structural information. The particular instance of XMC Cat that our prototype preservation system contacts manages the metadata for scientists who signed up to use a personal workspace in the LEAD Science Gateway. It returns the complete metadata information about the experiment, as well as the collections and data products related to the experiment and their metadata. The preservation system then contacts the Karma provenance service and obtains invocation information about the processes in the experiment, the ingest and generation of the list of artifacts previously returned from the XMC Cat by these processes, and other entity and causal relationship details relevant to the experi-
ment. At this point, the preservation system would have gathered enough information at level 1, 2, and 3 to make the preservation bundle complete. It then contacts a name-resolution service to resolve logical IDs of the artifacts to physical URLs and transfers the data files to assemble an archive information package. The preservation system could be optionally configured to be aware of additional services, such as service registries or even source code repositories. If this were the case, the preservation service would send requests to retrieve additional information, including service WSDLs, deployment configuration details, and source code files. These pieces would be added to the archive bundle. After archive bundles are created and preserved, the information should be indexed so that future users can issue queries for discovery. Although detailed metadata and provenance are being collected throughout the life cycle of the experiments, they do little good if they are buried inside of an archive bundle; so the information must be exposed to the outside world to answer queries. Some digital repositories such as the FedoraCommons (2010; see http://www.fedora-commons.org) provide native support for metadata in the form of 15 Dublin Core elements (2010; see http://dublincore.org/documents/2010/10/11/dces/). These elements are generic in nature and provide minimal metadata support. While this may be sufficient for documents in digital libraries, researchers in e-Science are more likely to be interested in domain-specific metadata and make discoveries by forming
Data provenance for preservation of digital geoscience data queries over these attributes. In addition, the primary function of a preservation system is to preserve data and knowledge; indexing metadata attributes and answering queries are not in its scope, nor should they be its responsibility. It is best if we leave such tasks to tools specialized in such kind of jobs, namely, metadata catalogs. For the discovery of preserved objects, we believe a viable alternative required for rich discovery in e-Science is to attach a metadata catalog such as XMC Cat to the preservation system to handle domain-specific queries. When the preservation system retrieves metadata, provenance, and other service-related information during the archiving phase, it sends the information to this XMC Cat for indexing. Later, when a user tries to search for earlier experiments, he/she would first form a query to the metadata catalog. XMC Cat returns entries that match the query. The user then uses the IDs of these entries to retrieve the archive bundle from the preservation system. The dissemination of a preserved object is through a lightweight module, lightweight because the preservation system does not try to achieve reproducibility. Its sole function is to hand a client an archive bundle based on the ID provided. For our prototype, an archive bundle is a ZIP file containing all data and information together, which is stored on a file system. Eventually we will look into multiple replicas of the same archive bundle to guard against disasters, which would require the addition of a replica manager. We will also explore the trade-off between having each archive bundle as a prepackaged ZIP file versus having its contents scattered as individual files and creating a single ZIP package upon dissemination request. EXPERIMENTAL EVALUATION We examined the performance of the preservation system to identify latencies that might suggest alternate approaches or optimizations. Our plan is to extend a standard digital repository such as FedoraCommons using services that bridge the gap between metadata and provenance generation and archival. The first step reported here quantifies costs. Our experimental evaluation uses provenance data generated by three different workflows that were executed by a workflow emulator, workflowEM (Ramakrishnan et al., 2010), and it creates archive packages from them. We chose these three because they represent workflows of varying sizes. We measured the performance of key steps under workloads of varying sizes. All software programs with the exception of the XMC Cat were executed on an Intel Core2 Duo machine with both cores clocked at 1.60 GHz, 2 GB of memory, running Windows Vista 32-bit Ultimate edition. The XMC Cat server ran on a server with 2 Dual Core AMD Opteron processors with each core clocked at 1 GHz, 16 GB of memory, running Red Hat Enterprise Linux 4. For each workflow, the first metric measured is the time it takes to retrieve metadata from XMC Cat and to “synthesize” the metadata document. “Synthesis” is necessary because the XMC Cat hosts metadata from real-world experiment runs, whereas workflowEM provides simulated runs, so we need to modify the
135
metadata document given by the XMC Cat to ensure artifacts’ names and data object locations on the web (their URLs) are consistent with these in the workflow notifications. After “synthesis,” the metadata document contains a complete list of artifacts involved in a workflow, so we can measure the performance time of the next step, which is to store these artifacts into the registry layer of Karma, where the metadata are stored as annotations per OPM version 1.1. Using this list of artifacts, information for a complete OPM graph can be populated in the Karma registry layer by sending queries for process information and other causal dependency information from the execution layer of Karma. We then measure the performance of this step. We verify the OPM graph from Karma registry by querying for OPM graphs using each artifact ID and process ID to check if they result in the same graph (currently, there are no agent entities in any of these workflows, so agent IDs are not used). While the verification of graph correctness is not part of our performance evaluation, we measure the time that each ID takes to get the graph. The version of XMC Cat we use is part of the LEAD cyberinfrastructure, so we do not have metadata or data for the Animation and Motif workflows. We use synthetic data products and metadata instead. Lastly, we measure the time it takes to create a ZIP file from each workflow. In addition to XML serialization of the OPM graph and the “synthesized” metadata documents returned from the XMC Cat, we also archive physical artifacts. In addition, the source code files we include are generated as random ASCII files, 10 MB in size. In reality, source code files rarely reach this size, so by archiving this size of a file, we can estimate an upper bound of the performance time. Types of Workflow The first workflow we use is a LEAD North American Mesoscale (NAM) initialized forecast workflow. This workflow represents a small-scale workflow (in terms of number of nodes), consisting of 6 tasks and 9 artifacts, and the size of artifacts varies from 0.2 MB to 2.4 GB. The second workflow is a medium sized one that renders computer animation frames to create a movie. This workflow has a total of 22 tasks that together use and generate 42 artifacts. The sizes of the artifacts also range widely, from as small as 10 KB to as large as 500 MB. The third workflow is Motif for analyzing genomes, and it is large, having a width span of 135 parallel tasks, 138 total processes, and 275 artifacts. The time it takes to register a workflow execution to Karma is dependent on the size of the workflow graph. This is demonstrated by Figure 6. The smaller LEAD workflow, with its 6 tasks, takes less than one eighth of the time as compared to the Motif workflow, with its 138 processes and 275 artifacts (data processes). The time it takes to package an archive bundle is measured for one step of the task, that is, the time it takes to compress, or zip, the provenance, metadata, and data into an archive bundle for preservation. The dominant performance factors are the number of artifacts and their size as shown in Table 1.
136
Plale et al. TABLE 1. TIME REQUIRED TO CREATE ARCHIVE BUNDLES AND SIZES OF BUNDLE BEFORE COMPRESSION LEAD workflow Animation workflow Motif workflow Bundle time (s) 408 215 303 Aggregated size of artifacts (MB) 3672 2010 2793 Aggregated size of source code (MB) 60 220 1380 Total size (MB) 3732 2230 4173
From this table, we see that the small workflow LEAD takes the longest time to archive. This is because the other two workflows, although large in terms of nodes, have most of the services identical, so there are fewer unique services than the LEAD workflow. In addition, the total size of artifacts in LEAD also exceeds that in the other two workflows. It may seem surprising that the total uncompressed size of the Motif workflow experiment is the highest, but its performance time is shorter than that of the LEAD. This is because although the total size is the largest for the Motif workflow, a large portion of it goes to the simulated service “source code” files, which all point to a single 10 MB ASCII character file, where as other “experiment data” files are binary. Therefore, the ZIP compression algorithm is able to process the ASCII files much faster than other binary files because the ASCII files yield better compression rate. CONCLUSION Digital collections that are generated during the course of scholarly investigation have the potential to significantly advance the scientific enterprise. Their value and probability for reuse, however, depend upon the quality of their description. Collections that have minimal information are often not used, and this is why much scientific data are written once and not read. Our research investigates ways to collect provenance information easily such that the burden is low on both the developer and user of geoscience application codes. Our investigations began with provenance collection in workflow-driven systems. We then extended collection beyond workflow-driven systems to examine provenance collection wherever it is needed, including cloud computing infrastructures, where the sheer scale of industry-built data centers promises high availability and reliability at low cost to any user—an attractive value proposition for the scientific community who has dealt with issues of reliability as systems grow larger and more complex. Our goal is seamless, low-burden collection of provenance information that can facilitate the longterm preservation and reuse of important digital geospatial scientific data collections. ACKNOWLEDGMENTS We thank Keith Brewster of Oklahoma University, and Jeffrey Cox, Girish Subramanian, and Suresh Marru from Indiana University for helpful discussions on this work. We thank Unidata for assistance and involvement. This work was funded by the National Science Foundation under grants OCI-0721674,
ATM-0331480, and CSR-0720580, and through a grant by the Eli Lilly Corporation. REFERENCES CITED Barga, R., and Digiampietri, L., 2008, Automatic capture and efficient storage of e-Science experiment provenance: Concurrency and Computation, v. 20, no. 5, p. 419–429, doi:10.1002/cpe.1235. Cao, B., Plale, B., Subramanian, G., Robertson, E., and Simmhan, Y., 2009, Provenance information model of Karma, in Third International Workshop on Scientific Workflows (SWF’09): Proceedings of the 2009 Congress on Services, Washington D.C., IEEE (Institute of Electrical and Electronics Engineers) Computer Society Press, p. 348–351, doi:10.1109/ SERVICES-I.2009.54. Droegemeier, K., Brewster, K., Xue, M., Weber, D., Gannon, D., Plale, B., Reed, D., Ramakrishnan, L., Alameda, J., Wilhelmson, R., Baltzer, T., Domenico, B., Murray, D., Wlson, A., Clark, R., Yalda, S., Graves, S., Ramachandran, R., Rushing, J., and Joseph, E., 2005, Service-oriented environments for dynamically interacting with mesoscale weather: Computing in Science and Engineering, Computer Society Press and American Institute of Physics, v. 7, no. 6, p. 12–29. Foster, I., Vöckler, J., Wilde, M., and Zhao, Y., 2002, Chimera: A virtual data system for representing, querying, and automating data derivation, in Proceedings of the International Conference on Scientific and Statistical Database Management: Washington, D.C., IEEE (Institute of Electrical and Electronics Engineers) Computer Society Press, p. 37–46. Gannon, D., Plale, B., Christie, M., Huang, Y., Jensen, S., Liu, N., Marru, S., Pallickara, S.L., Perera, S., Shirasuna, S., Simmhan, Y., Slominski, A., Sun, Y., and Vijayakumar, N., 2007, Building grid portals for e-Science: A service oriented architecture, in Grandinetti, L., ed., High Performance Computing and Grids in Action: Amsterdam, the Netherlands, IOS Press, p. 1179–1185. Gil, Y., Deelman, E., Ellisman, M., Fahringer, T., Fox, G., Gannon, D., Goble, C., Livny, M., Moreau, L., and Myers, J., 2006, Report on the 2006 NSF Workshop on Challenges of Scientific Workflow: http://vtcpc.isi.edu/ wiki/images/3/3a/NSFWorkflowFinal.pdf. Holland, D., Seltzer, M., Braun, U., and Muniswamy-Reddy, K., 2008, PASSing the provenance challenge: Concurrency and Computation, v. 20, no. 5, p. 531–540, doi:10.1002/cpe.1227. IU (Indiana University), 2010, Weather Prediction Technology Supports National Tornado Research Project: http://newsinfo.iu.edu/news/page/ normal/14369.html (accessed March 2011). Jensen, S., and Plale, B., 2008, Using characteristics of computational science schemas for workflow metadata management, in 2nd International Workshop on Scientific Workflows (SWF’08), Congress on Services: Washington, D.C., IEEE (Institute of Electrical and Electronics Engineers) Computer Society Press, http://doi.ieeecomputersociety.org/10.1109/ SERVICES-1.2008.42. Kowalczyk, S.T., 2007, Towards a preservable object: A Qualifying Examination, 29 October 2007; available at http://ella.slis.indiana.edu/~skowalcz/ Qual/STK_Qualification_10-30.pdf (accessed July 2011). Leake, D.B., and Kendall-Morwick, J., 2008, Towards case-based support for e-Science workflow generation by mining provenance, in Proceedings of the 9th European Conference on Advances in Case-Based Reasoning: Berlin, Heidelberg, Springer-Verlag, p. 269–283. Lin, C., Lu, S., Lai, A., Chebotko, A., Fei, X., Hua, J., and Fotouhi, F., 2008, Service-oriented architecture for VIEW: A visual scientific workflow management system, in Proceedings of IEEE (Institute of Electrical and Electronics Engineers) Conference on Services Computing: Washington,
Data provenance for preservation of digital geoscience data D.C., IEEE Computer Society Press, p. 335–342, doi:10.1109/SCC .2008.118. Moreau, L. (ed.), Plale, B., Miles, S., Goble, C., Missier, P., Barga, R., Simmhan, Y., Futrelle, J., McGrath, R., Myers, J., Paulson, P., Bowers, S., Ludaescher, B., Kwasnikowska, N., van den Bussche, J., Ellkvist, T., Frieire, J., and Groth, P., 2008, OPM: The Open Provenance Model (v1.01): University of Southampton Electronics and Computer Science Technical Report, 35 p., http://eprints.ecs.soton.ac.uk/16148 (accessed May 2011). Ramakrishnan, L., Plale, B., and Gannon, D., 2010, WORKEM: Representing and emulating distributed scientific workflow execution state, in Proceedings of the 10th International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010): Washington, D.C., IEEE (Institute of Electrical and Electronics Engineers) Computer Society Press. Scheidegger, C., Koop, D., Santos, E., Vo, H., Callahan, S., Freire, J., and Silva, C., 2008, Tackling the provenance challenge one layer at a time: Concurrency and Computation, v. 20, no. 5, p. 473–483, doi:10.1002/cpe.1237. Simmhan, Y., Plale, B., and Gannon, D., 2005, A survey of data provenance in e-Science: SIGMOD Record, v. 34, no. 3, p. 31–36, doi:10.1145/ 1084805.1084812.
137
Simmhan, Y., Plale, B., and Gannon, D., 2006, Towards a quality model for effective data selection in collaboratories, in Institute of Electrical and Electronics Engineers Workshop on Workflow and Data Flow for Scientific Applications (SciFlow06): Washington, D.C., IEEE (Institute of Electrical and Electronics Engineers) Computer Society Press. Simmhan, Y., Plale, B., and Gannon, D., 2008, Karma2: Provenance management for data-driven workflows: International Journal of Web Services Research, v. 5, no. 2, p. 1–22, doi:10.4018/jwsr.2008040101. Vijayakumar, N., and Plale, B., 2007, Towards low overhead provenance tracking in near real-time stream filtering, in Moreau, L., and Foster, I., eds., International Provenance and Annotation Workshop (IPAW’06): Berlin, Springer-Verlag, Lecture Notes in Computer Science, v. 4145, p. 46–54. Zhao, J., Goble, C., Stevens, R., and Turi, D., 2008, Mining Taverna’s semantic web of provenance: Concurrency and Computation, v. 20, no. 5, p. 463– 472, doi:10.1002/cpe.1231.
MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
Theoretical foundations of the event bush method Cyril A. Pshenichny* VNIIOkeangeologia, Angliisky Prospect, 1, Saint Petersburg 190121, Russian Federation Oksana M. Kanzheleva† National Research University of Information Technologies, Mechanics and Optics, Kronverkskiy Prospect, 49, Saint Petersburg 197101, Russian Federation
ABSTRACT This paper presents a description of the theoretical foundations of a novel method of geoinformatics, the event bush, and discusses its place and role in a newly formulated framework for information modeling of geoenvironments. The event bush addresses a particular yet very wide type of geoenvironment, that of directed alternative changes, which is likely to occur in many information domains. Modeling these environments, this method has reached the state of conceptual model and may become, with further development, one of the first examples of strict formal theory in the earth sciences. The method is exemplified by application to the complex volcanic environment of Mount Etna, Sicily.
INTRODUCTION Information, along with matter and energy, is a fundamental property of the universe. However, in contrast to these, it has not been systematically studied by any of the natural sciences. Information is being addressed by these disciplines only with numerical measure and in relation to matter and energy (Umpleby, 2007). The study of the behavior of information per se in the universe should be attributed to the science of informatics, perhaps as its principal theoretical goal. Studying means understanding, which, in turn, brings back such synonyms as interpretation, generalization, and, finally, modeling (in the broadest sense). Hence, one would not be wrong to say that a goal of informatics is modeling of the universe through
the information that exists in it. Like any generalization, modeling is based on some formal properties of objects. Since models are of objects of the so-called “real,” “material,” “objective,” or “physical” world, their information modeling must be based on general properties of information in this world. Generally, these properties merge into the field of “general” informatics, which is tightly interrelated with the mathematics and logic that provide formal patterns for all kinds of studies (mass, energy, and information, including, to our mind, general description of algorithms—see, e.g., Uspenskii and Semenov, 1993). However, mathematics and logic offer a vast but “mute” variety of formalisms, and the virtue of informatics is to make them “speak” about the properties of information seen in the world, “tuning” these formalisms by different information domains (geosciences,
*Current address: Geognosis Project, Center for Innovation and Technology, National Research University of Information Technologies, Mechanics and Optics, Birzhevaya Liniya, 14-16, Saint Petersburg 199034, Russian Federation; [email protected]. † [email protected]. Pshenichny, C.A., and Kanzheleva, O.M., 2011, Theoretical foundations of the event bush method, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 139–164, doi:10.1130/2011.2482(12). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
139
140
Pshenichny and Kanzheleva
life sciences, business activities, medicine, and so forth), where these or those properties may look more or less prominent. Such intersections of informatics with particular domains form “thematic” branches of informatics (geoinformatics, bioinformatics, business informatics, medical informatics, and others, which are all bidisciplinary sciences. From the general informatics, they may borrow formalisms for application or for pre-application “tuning” (external testing in Fig. 1), a task that is quite intriguing itself. However, perhaps yet more intriguing, there is the thematic appeal to general informatics, which actually begins the cycle of information studies. Specialists working in science and humanities, observing the world, be it a landscape, a rock sequence, or business activities, may come up with some generalized vision of things in their domain. This vision later can receive a formalized notation and be tested, which literally means conveying it to corresponding thematic informatics. If this internal testing is successful, the notation is transported upward, to general informatics, through which it comes to other thematic informatics for external testing, and if the latter tests are successful too, it becomes universal for tasks of a particular kind regardless of information domain (Fig. 1). It fructifies an appropriate mathematical formalism (or calls for a new one) and becomes an information technology.
MATHEMATICS AND LOGIC
In this sense, “geo,” “bio” (-informatics), and the like do not mean the field of application but merely the field of generation of an approach—e.g., geometry is related to Earth just like to anything else, but in ancient Greece, it appeared in the field of land use, and this is fixed in the name of the formalism. It is accepted that thematic branches of informatics exist to provide information services to particular fields (geosciences, biology, medicine, business activities, and so forth) by means of information technologies. However, as is seen from the previous, this is only the tip of the iceberg and perhaps the least interesting part of it. The real intellectual claim of informatics and its fundamental scientific value are in the effort of scientists working in particular fields (e.g., natural sciences) to shape their professional vision, their intellectual intuition, and work out a formalizable approach that would meet a formalism (or urge to create one) and make it “speak.” Thus, thematic branches of informatics, belonging to corresponding domains of science (earth sciences, life sciences, business studies, medicine), allow the latter to explore not only matter and energy transformations but also the behavior of information in their domain. This process, from first intuitive hints to creation of an information technology, can be termed “information modeling.”
pattern f orma lism s
GENERAL INFORMATICS
THEMATIC BRANCHES OF INFORMATICS (GEOINFORMATICS, BIOINFORMATICS, BUSINESS INFORMATICS, MEDICAL INFORMATICS, AND SO FORTH) Figure 1. Interaction between general informatics and thematic branches of informatics, on one side, and between general informatics and mathematical/logical disciplines, on the other. Cyclic interaction between general and thematic informatics begins at conveying a generalized yet not formalized vision of things from some particular thematic field of informatics to the general informatics and passes two circles until an information technology based on this vision is distributed to end users in any relevant thematic field of informatics, making the cycle of information studies full.
Theoretical foundations of the event bush method In this context, geoinformatics is a thematic branch of informatics that belongs to earth sciences and focuses on information modeling in this domain. It has acquired such information technologies as ontologies, taxonomies, and markup languages from general informatics, and it has enriched the latter with the concept of organization of spatial information (dominantly spatial data by now) known as geographic information systems or just geoinformation systems (GIS). GIS now can be applied to map not only Earth or other planets, but virtually everything that has, even imaginary or abstract, form, to which various qualities or properties may be related and which may need to be compared on spatial ground. Herewith, it looks natural to inquire about formalisms other than GIS that can appear in the womb of geoinformatics (1) to find wide cross-disciplinary application and (2) to form, along with the tools borrowed from general informatics, an armory of information modeling techniques for bodies or processes in, on, and near Earth addressed by other geosciences (let us term them “geoentities”). In recent years, an approach to knowledge organization seems to have been nurtured by geoinformatics that has already passed internal testing by a variety of tasks in volcanology, Quaternary geology, and sedimentary geology and other fields—the “event bush” (Pshenichny and Khrabrykh, 2002; Pshenichny et al., 2008, 2009). This paper aims to provide a complete description of the event bush approach, from underlying intuitive vision to as strict a formulation as possible at the moment, for its further inspection and use in the geosciences and in informatics. This purpose implies the following tasks: 1. 2. 3. 4.
describe the vision that underlies the formalism, list basic assumptions, definitions, rules, and corollaries, exemplify the method by geoscientific application, and discuss its place and role in the evolution of geoinformatics and general informatics. All of these will be addressed here.
INTUITIVE BASEMENT: APPREHENSION OF GEOENTITIES General Considerations One feature that looks obvious to the majority of geoscientists is the ubiquitous interrelationship between objects of similar range in, on, or near Earth. To properly describe one geoentity, we almost always need information on similar geoentities. Indeed, it makes little sense to consider one stone, its shape, and composition, or the way it was formed, because an isolated stone is never formed alone. However, if we look at that single body of rock from which this and other stones were formed, we cannot avoid pondering about the position of this rock in a rock sequence, i.e., again, its interrelation with similar geoentities.
141
Also, if we focus not on objects (e.g., stones) but on the process that formed them, it becomes clear that, quantitatively, this process is a member of a range of processes (e.g., from slumping to rockfall), and, qualitatively, it never (or almost never) has only one cause. The same causes in other combinations produce different processes (e.g., boiling of groundwater may produce geysers in one case, a flow of lava fragments in another, and an outburst of host rocks in yet another case), so they need to be considered too in order to understand why this process, and not others, has taken place. Such a feature can be found in many domains (e.g., history, biology, business, etc.), and in all of them, the formalisms that can handle it might be useful, but in the geosciences, it is very prominent and likely to be captured by formalisms of potentially wide cross-disciplinary scope. In geology, this means that any geological body or process is environed (at least potentially) by more or less similar ones, and only their joint consideration can work to explain observations. Therefore, an important feature of information on the geoentities is that to explain even one of these, information is needed for the entire environment to which it belongs, and these environments, rather than objects themselves, should be the targets of the human mind and information modeling. We would like to leave the term “environment” as more or less clear by intuition and define it quite freely as a group of somehow interrelated entities (geoentities in the Earth information domain) that necessarily involves entities (bodies or processes) otherwise united in one class by some of their important (not accidental) properties. There seems to be parallelism between environment and another pertinent term, “system” (Bogdanov, 1926; Von Bertalanffy, 1968), but their interrelation, deserving a special study, is not directly related to the purpose of this paper. However, one point should be emphasized, that the concept of system implies consideration of matter and energy transformations (which are the virtue of physical modeling), while environment as stated here is subject to information modeling. By the way, information modeling of environments should not be confused with environmental modeling, which is, in fact, physical modeling for the purposes of geoecology (Jakeman et al., 2008). The main instrument of information modeling is language. Following the more or less conventional understanding of language based on the studies of De Saussure (1983 [1913]), Bloomfield (1914), and many others, language is a system of signs (indices, icons, symbols) for encoding and decoding information. In the context of information modeling, the well-known division of languages into natural and artificial does not seem to be appropriate, because the so-called natural languages tend not to be strict enough, and formal languages (including mathematics and logic) are not specific to, or meaningful for, the modeled information domain. Instead, some experience of knowledge representation demonstrates “artificially” strict and formal management of “natural” words (Sowa, 2000) or, broadly speaking, qualitative expressions, with numbers, parameters, equations, or other formal and quantitative entities going above them (if ever
142
Pshenichny and Kanzheleva
necessary). Therefore, the behavior of information can be studied only through its reflection in qualitative expressions. Hence, any kind of numerical or mathematical modeling should be preceded by information modeling that has to be based on firm semantic (“natural”) basement, though in research practice, this is often not the case. Note, this also supports the necessity of an applied counterpart of informatics, a selection of methods to look at the ways in which these qualitative (commonly, though not necessarily, verbal) expressions are treated by humans, and thus try to minimize the human subjectivity in information modeling. Such selection of methods of various origins (from statistics to psychology, from linguistics to physiology) is known as knowledge engineering (Feigenbaum, 1984). It is worthwhile to look at the practice of geoinformatics, be it a GIS project creation, a data scheme design, an ontology development, or any other activity, from the point of view of language. Such geoentities as bodies, since they cannot be taken apart from their properties, are understood as expressions having at least one subject naming the body or body class and at least one predicate that denotes a property and relates to the subject, e.g., subject, “rock,” and predicate, “magmatic.” Herewith, some predicates may well act as subjects themselves, e.g., in the expression “plagioclase is a mineral,” “to be mineral” is a predicate, but “mineral” is another subject that may have its own properties (at least, “to include plagioclase”). In the case of multiple subjects, no subject without a predicate is admitted, but one predicate may relate to more than one subject, and at least one predicate must relate to all subjects of the geoentity (“volcanic rocks and landforms”). In the case of multiple predicates, a predicate may relate to another predicate, but in a chain of predicate of predicate of predicate … of predicate, the last one must relate to subject—for instance, “blocks of intensely sericitized plagioclase granite,” where predicate “intense” relates to predicate “to be sericitized,” which, in turn, relates to predicate “plagioclase,” which, in turn, relates to predicate “granite,” and this relates to the subject “blocks.” Also, any number of predicates from one on may relate to one subject (e.g., “blocks of granite and gneiss”), except, certainly, for the predicates that have been considered contradicting (e.g., “thoroughly unsorted cross-bedded sand”). In addition, it is quite common for a subject, unless given a proper noun, to appear to be a class to which it belongs itself. For instance, a lava flow can split into several flows, and in some sense, we will be considering them as different flows, but in some sense, still as one, unless something critical happens with one of these “branches.” Any wave is a set of waves, and any set of waves is a wave in given place and time. Ground shaking is a set of innumerable ground shakings, and any set of ground shakings is ground shaking. Finally, a set of geoenvironments is a geoenvironment. Sometimes, an additional condition is required to maintain such “self-membership” relation, e.g., coincidence in space and/or time—any set of rocks bedded one over another is a rock too.
Furthermore, even if one subject (say, one earthquake) is affected by various circumstances of the environment at one time (e.g., rock structure, rock density, water in pores), to correctly consider the overall influence, we decompose the earthquake in our mind—and consider it as “copies of itself,” affected separately by factor one, by factor two, and so forth—and we build models and then integrate these models to account for overall impact. Thus, the distinction between the “body” and “class of body” (body class) becomes uncertain. To cope with this methodological hindrance, it seems necessary to appeal to a notion of “generality of subject” in geoenvironments. Let us distinguish between “general” and “particular subjects.” Process is another type of geoentity, which can be comprehended in information modeling as change of either body name (i.e., subject), or its property (i.e., predicate), or both. In other words, process as an information modeling term is a succession of finite expressions representing body names and properties described in the previous paragraph. Any two neighboring members of this succession must differ at least in one word (subject or predicate); otherwise, they would denote not a process but another state of things (e.g., quiescence). Each member of the succession that denotes process represents an “event” and receives a relative and qualitative time-specific meaning: Some body changes (qualitatively), i.e., acquires (or loses), a property or becomes a different body after some event in a succession and before another. Hence, process can be defined as a succession of at least two different events. In this way, the concept of time appears in the information modeling—again, in purely qualitative form. Then, if there is a succession of at least two absolutely similar events (e.g., where subjects differ by generality or do not differ at all), it must mean that the body remains the same, while others are involved in processes and undergo changes. These considerations may, and should be, coincided with the fundamentals of modeling of matter and energy that proceed quantitatively in terms of physical parameters linked by equations. This is an intriguing and challenging issue, however, lying apart from the scope of this paper. Here in this work, they readily bring us to classification of environments. Classification of Environments It seems natural to think that various methods of geoinformatics should be sharpened for different tasks. Then, it looks reasonable to look for a classification of possible tasks of geoinformatics, which would appropriately place, inter alia, the method of event bush and, desirably, as yet unapplied methods. Having agreed that the principal goal of geoinformatics is information modeling of geoenvironments, one may expect that the classification of environments may best organize the armory of geoinformatics (and simultaneously be exported to general informatics). Later herein, we will suggest such an organization. An environment must include bodies and may include processes (Fig. 2). Environments of the former type are, e.g., rock
Theoretical foundations of the event bush method
143
ENVIRONMENTS Changes not considered
Changes considered
Bodies and their properties are considered, processes are not
Both bodies/properties and processes are considered
Subject-based methods
Event-based methods
Formalisms expressed as graphs Formalisms expressed as graphs with subjects or predicates in the nodes with events in the nodes of various form representing conceptual graphs, definitional networks, correlational nets, relational graphs, assertional networks, thematic organization packets (?), monotonic description logics, ontologies, taxonomies, thesauruses, entity-relationship diagrams, relational data schemes, Changes are considered Changes are considered as layered spatial data organization as a general flow of events never-ending transitions between states concepts (GIS), mindmaps, etc. and have nonrepeating (including qualitatively expressed oscillations
Changes are directed
Changes are nondirected
beginning and end
and cycles)
Methods of construction of scenarios
Changes are invariant
There may be only singular path from one event to another Formalisms expressed as single-root trees: event trees, decision trees, quantified: probability trees
Formalisms expressed as cyclic graphs causal loops and others, quantified: Markov models, etc.
Formalisms expressed as multiroot trees and nets, with possible local cycles: causal networks, Boolean networks, systems thinking diagrams, There may be a multitude of paths nonmonotonic description logics, influence diagrams, reasoning maps, from one event to another EVENT BUSH, quantified: belief networks (without cycles), algebraic belief networks, PROBABILITY BUSH
Changes are alternative
Figure 2. Classification of environments in the earth science information domain.
sequences as addressed by stratigraphy, and those of the latter are, e.g., environments of sedimentation. The environments that include solely bodies are, as construed previously, not changing, while environments of the other type do change. This distinction appears to be valuable also for the methods of informatics, which are or can be used in geoinformatics. Indeed, some of them focus on bodies and properties (i.e., subjects and their predicates) and will be henceforth denoted as subject-based, some, on processes or events, and will be referred to as event-based. Regardless of underlying formalisms, all of these methods appear to admit a graphic expression as a network, or graph (in most cases); still, the meaning of nodes and edges may vary. Subject-based methods include a very wide range of instruments (many of them were reviewed by Sowa, 2000) and are usually expressed as directed or nondirected, possibly cyclic graphs with each subject and each predicate occupying a separate node, and edges meaning relations like “is a,” “a property of,” “x acts on y” (the latter already implies an event but does not consider it), and the like. In this cluster of methods, a very peculiar one is GIS, in which every subject has a spatial reference as a predicate, and
some general rules of managing these spatial references are introduced. This is a specific contribution of geoinformatics to general informatics. Other subject-based methods that have been used in the earth science domain (ontologies, taxonomies, thesauruses, entity-relationship diagrams, relational data schemes) were rather borrowed by geoinformatics from general informatics, and yet other sets of methods (conceptual graphs, definitional networks, correlational nets, relational graphs, monotonic description logics, assertional networks, and possibly, thematic organization packets) are awaiting their geo-applications. A deep research into the formal nature, graphic representation, and peculiarities of application of subject-based methods (or at least their overview and classification) would be of great theoretical value and crossdisciplinary benefit but does not constitute a task of this work. Nevertheless, yet from this brief survey, one may conclude that geoinformatics has not employed so many of the existing methods and has provided only one. Contrary to subject-based ones, the event-based methods include in their nodes not separate subjects or predicates but integer events. Some may include parameters (e.g., influence diagrams or, occasionally, belief networks). However, accurate
144
Pshenichny and Kanzheleva
consideration of parameters is likely to reveal hidden event-based structure (Carniel et al., 2011), and a virtue of informatics is to make the hidden seen. Edges mean various kinds of cause-effect relations, which can be well or poorly understood. Two subtypes of changing geoenvironments can be identified, (1) those in which changes are not directed and (2) those with directed changes. If changes are not directed or there are no clear beginnings and ends of changes, there must be either a cycle (a loop), or oscillations expressed qualitatively, or chaotic changes, which cannot be analyzed qualitatively (but not complete quiescence, which would bring us to the no-change environment addressed by subject-based methods). The formalisms that satisfy this precondition may include causal loops and some others, and computational tools like Markov chains, already used in the earth sciences (Aspinall et al., 2006; Oliveros et al., 2008). If changes are directed in an environment and there may be only singular path from one event to another, their flow represents a scenario (in case of invariant changes) or scenarios (if options are possible) where some causes lead to definite results, which can be somehow fixed in mind. This type of environment provides ground for the method of event trees (see, e.g., Newhall and Hoblitt, 2002) and their ramifications (event trees, decision trees, fault trees; quantitative: probability trees and Bayesian trees and so forth). Such a tree has only one starting event (“root”) and suggests only one path to any of its further nodes. No cycles are admitted, even locally. Finally, the most sophisticated environment type is that of directed alternative changes. Formalisms describing such environments are expressed graphically as constructions having many “roots” or net shapes, allowing various paths at least between some of the nodes, and permitting local cycles. These are causal networks, systems thinking diagrams, nonmonotonic description logics, influence diagrams, Boolean networks, reasoning maps, and the tool particularly addressed in this work, the event bush. Also, there are quantitative methods based on such formalisms, the belief networks (normally without cycles, though Tulupyev and Nikolenko [2005] suggested algebraic cyclic belief networks) and quantitative version of the event bush, the probability bush. The edges in the event bush, as will be shown later herein, may unite not only two but any number of nodes. This makes the graphic notation of the said formalism yet wider than a classical graph, where only two nodes can be linked by an edge (Tutte, 1998). Properties of Geoenvironments with Directed Alternative Changes: Some Intuitive Hints We consider an environment to have directed alternative changes if some environment becomes an arena for various, intertangling processes, which are thought to originate from a limited number of causes or sources (but not less than two), which tend to leave a definable result, and which can be followed as more or less distinct scenarios.
For instance, consider a geodynamically active environment in a mountainous area in a temperate climatic zone or very high mountains in a warmer climate, such as the Alps, Caucasus, Cordilleras of North America, Andes, Himalayas, and many others. Such an environment implies ice caps topping mountains, including, possibly, volcanoes, which have a heat flow from the interior (e.g., active volcanoes or young plutons), deep valleys hosting rivers, and forested mountain slopes or feet. Even slight warming under a mountain summit or a regional earthquake may trigger ice avalanches, turning into mudflows or lahars downslope, which become yet more devastating as they capture tree trunks. Mudflows may pond gullies and cause floods, which, in turn, evoke other dangerous consequences. To understand possible threats, one should first construct and qualitatively constrain all possible scenarios of evolution of such an environment. One way to do this is to delve deep into memory and knowledge of such landscapes and enforce intuition. Nonetheless, this method of creating a complete set of scenarios is not immune from forgetting, misunderstanding, or under- or overestimation and therefore seems not reliable enough if not substantiated by formal treatment based on a pattern that expresses common features of environments of this type. Contemplating various environments that take place on Earth now or are believed to have done so in the geological past—eruption and cooling of volcanic products, advancement/ retreat of a sea or glacier, tectonic rebuilding of a region, or routine accumulation of sediments in a basin—we may notice some features of similarity between environments of strikingly diverse origin, scale, and duration. Thus, even if an environment is rather routine, there is (was) some more or less clear onset when some invading external agent, or agents (magma, sea/glacier, regional tectonic movement, including, possibly, downsagging, which forms a sedimentary basin), was imposed on the local geological or landscape conditions. The role of these conditions may decrease with time (volcanic eruption), or remain essentially the same (tectonic rebuilding), or become qualitatively different (sedimentation in a basin). External agents are being “projected” on local conditions, and the latter, when affected, transform diversely in different parts as determined by their own internal properties and the properties of these agents, and they continue to interact with each other. For instance, the properties of eruption products are initially constrained by (1) chemical composition, gas content, fragments of old minerals and rocks, temperature and pressure in magma, and they are finally determined by (2) the conditions of cooling and solidification, including the interactions between them (topography, amount and form of free water on surface, character of surface deposits and vegetation, weather conditions during and after the eruption, and so forth). If we consider the environment itself to be absolutely passive, then we may speak of simple projection of external invading agents with concomitant transformations determined mainly by the properties of these agents, but if the environment is considered rather active itself, we have to consider these agents (e.g., advancing glacier) only as an
Theoretical foundations of the event bush method addition to ever-present external agents—rivers, winds, and the like, transforming the local conditions all along. However, even in this case, the onset of the environment in question is the time of invasion of the new agent (magma in the previous example). Likewise, the properties of a transgression sequence are determined by those of (1) the incoming sea and (2) the landscape; properties of a sedimentary batch formed in a basin are determined by (1) properties of rocks bedded around and (2) character of tectonic movements that expose them to denudation and create a basin for catching the sediments; and the newly built tec-
tonic framework, naturally, is constrained by (1) older geological structure and (2) the character of tectonic action operating upon it (Fig. 3). Also, it should be noted that there is no strict prescription for defining location, on the one hand, and agent(s), on the other. Rather, the distinction is intentional: For instance, if we are interested in the way in which ascending and erupting magma is transformed into a volcanic facies, the magma has to be taken for (1) conditions and put on the left in our graph, and the landscape features must be taken for (2) the agents transforming the magma
B
...IS AFFECTED BY THE LANDSCAPE
...IS AFFECTED BY ADVANCING SEA
C
...IS FORMED IN A BASIN
D
...CAN BE TECTONICALLY REBUILT
(1) Features of existing tectonic framework
HOW THE REGION...
(2) New tectonic processes that begin to operate
Properties of sedimentary batch formed in a basin
(1) Properties of rocks bedded around
(2) Character of tectonic movements
HOW A SEDIMENTARY BATCH...
Properties of transgression sequence
(1) Landscape features on land
HOW THE GROUND...
(2) Properties of transgressing sea
Properties of eruption products
(1) Chemical composition, gas content, fragments of old minerals and rocks, temperature and pressure in magma
HOW THE ERUPTING MAGMA...
(2) Conditions of cooling and solidification
Scenarios of rebuilding of regional tectonic framework
A
145
Figure 3. Anatomy of geoenvironments of alternative directed changes by four various examples. Regardless of contents, such environment can be formulated in a universal way shown in capitals on the left and top. However, the left part of the title may directly relate to the “invaded conditions” and may (as in the “How a sedimentary batch…” example) relate to end results.
Pshenichny and Kanzheleva
and put on top, and if the question is, “How is the landscape affected by volcanic eruption,” everything is vice versa—landscape is 1, while magma is 2. Then, products of one environment will occur in another, where they will be affected by new agents and transform respectively, and nothing formed once is lost or forgotten by the nature. Everything finds its new place and role. Lava and ignimbrite flows are dissected by rivers, covered by vegetation, and create new landscape; newly formed sea develops its own subaquatic landscape, depending, again, on regional climate conditions and character of coast; glaciers melt down and produce a new ensemble of postglacial landforms and related landscape features; new tectonic frameworks may host intrusion of plutons and formation of ore; and a sedimentary bed may be folded and metamorphosed. This makes us reason in terms of successions of environments “conjoined” with each other in space and time (Fig. 4). Commonly, not only one but many products of one environment are “borrowed” by another as starting points for new pathways of geological evolution. A unit of eruption products always includes several species of rock (juvenile pyroclastics, resurgent pyroclastics, possibly lava, hydrothermally altered rock), each forming distinct bodies, as well as new fractures, pits, cones, and so forth. A transgression unit necessarily includes a number of layers topping one another in well-known order (“transgression bedding”). Tectonic framework, naturally, represents an ensemble of interrelated dislocations. A sedimentary bed formed in a sagging basin can be quite homogeneous, but, as geologists know, this is very rarely the case—commonly, it also represents a succession of layers. Naturally, all these “end members,” documenting some accomplished scenarios in the environment in which they were formed, will become a new “invaded environment” (Fig. 4) or contribute to it, and maybe some of these
under some conditions may also become “invading agents” themselves. This new environment may make some of the properties inherited by these products “speak” in a new context; therefore, nothing in the description of products may be omitted. For instance, presence of water-soaked rocks or sediments in the way of a calm lava flow may cause rapid explosions on it and even shift it to a pyroclastic-flow behavior. Different patterns of cracks in a melting glacier are believed to cause different modes of melting and postglacial sedimentation, which, in turn, are clearly pronounced by changes in forests in a taiga landscape. Organisms living in seawater can accumulate some chemical elements, the normal rate of which is negligible, and form mineral deposits in a sedimentary bed. Some of the faults developed in a region may appear crucial for generation of a hydrographic network, excavation of wide river valleys, and then, consequently, promotion of human infrastructure. (Such processes perhaps could be expressed as chains of up to three to five environments or more.) However, simultaneously, it should be kept in mind that some of the features geologists used to pay attention to may appear far from that influential on the choice of scenario of environment evolution—e.g., higher silica content in magma generally favors explosive or extrusive versus effusive behavior of erupted products, but it has only “traditional” and “symbolic” power in the determining principal (qualitative) future course of events, except in some specific cases. Another peculiar feature is that scenarios on and inside Earth may readily recur (e.g., remelting and recrystallization, transformation of pyroclastic flow into pyroclastic surge and back in the course of one eruption, repeated eruptions from one source, repeated transgressions or glaciations, cyclic sedimentation or
...IS FORMED IN A BASIN
...CAN BE TECTONICALLY REBUILT
(2) New tectonic processes that begin to operate
HOW THE REGION...
(1) Properties of sedimentary batch formed in a basin
(1) Properties of rocks bedded around
HOW A SEDIMENTARY BATCH...
(2) Character of tectonic movements
Figure 4. Example of succession of geoenvironments.
Scenarios of rebuilding of regional tectonic framework
146
Theoretical foundations of the event bush method tectonism), but, in any case, there is an obvious starting point before and an endpoint after the loop within some larger-scaled directed change of an environment. Importantly, an environment can be viewed at different scales (taking scale purely qualitatively as an indication of the detail with which we study it). For example, formation of a sedimentary basin can be conceptualized as tectonic reworking of a segment of lithosphere and thought of as movement of lithosphere plates or the like, or as destruction and accumulation of rocks involving the transport of material with mechanical and chemical transformation, or as interaction of grains of sand, organic matter, mineralized water, and so forth—but at any level, there will be “passive” conditions and “invading” agents. Moreover, different levels can be “sewed up vertically,” as a segment of lithosphere includes particular rock batches, and a rock includes grains, minerals, organic remnants, water in cracks, etc. Therefore, the conceptualizations made at different scales are neither independent nor conflicting but somehow “nested” and resemble self-similarity sensu Mandelbrot (1982); this is shown in another example in Figure 5. Still, even at a single given scale, one geoenvironment or several geoenvironments that we refer to a similar class may behave differently under similar conditions, e.g., a growing extrusive dome on a volcano may end up with a violent explosion or just stop growing. A reason for a different behavior may not always be found, if it exists at all. The information modeling of a directed alternative changes environment must be able to account for such behavior.
147
Finally, if we look at the “anatomy and physiology” of events in a geoenvironment of directed alternative changes, we may need to recognize some more or less typical collisions. If a body or body class is left on its own, without any notable external influence, solely due to its internal causes, it may: 1. Remain essentially the same (e.g., a mature river valley, a cooled intrusive body, an abyssal plane); 2. Acquire or lose some properties (evolving river valley, hot intrusion, continental slope foot); 3. Become another body, either changing its essential properties (e.g., a volcano will become extinct and become a hill) or still keeping them (an avalanche of ice blocks from a mountain top may transform into a mudflow downslope but remain similarly fast and devastating); 4. Produce another body (or a number of bodies, simultaneously or in succession) of similar class or diverse classes, still remaining qualitatively the same (e.g., lava flow produces marginal clastic facies, or ice sheet spawns icebergs); 5. Stop being itself and become a set of bodies of similar class (e.g., a hill is dissected by erosion into a group of hills) or diverse classes (e.g., a pyroclastic flow leaves a succession of facies that differ in lithology, physiography, and composition, from proximal lithic breccia to distal veneer deposits and coignimbrite tephra, all being solid deposits and not a flow at all). Importantly, in the former case, the resulting geoentities and the initial one all must have incompatible sets of properties; otherwise this is a situation described in item 4. If we
...IS FORMED IN A BASIN (2) Character of tectonic movements
Properties of diamond placer deposits Properties of tectonic hydrocarbon traps
(1) Properties of oil-hosting formations
Figure 5. Example of “nested” (self-similar) geoenvironments.
Properties of sedimentary batch formed in a basin
(1) Properties of kimberlite pipes
(2) Faulting and related displacements
(1) Properties of rocks located around
HOW A SEDIMENTARY BATCH...
(2) St eady uplift
148
Pshenichny and Kanzheleva
consider not a singular body but a body class, we have to put “similar superclass” instead of “similar class” in this rationale. (These cases perhaps will be applicable to invariant change environments too.) Several aspects need to be stressed here. Recognition, or identification, of bodies is certainly intuitive. One may recognize a pyroclastic flow as a single moving body having integer properties; another may deem it as a congregation of moving particles—and in the latter case, transformation of a flow into solid facies will not be a number 5 case of transformation of bodies of different classes, because particles just change their properties and settle down, but a number 2 case, and then with compaction and possible remelting, a number 3 case. In addition, while various facies from one flow may well coexist in time and space, for a given particle, on the contrary, these options are incompatible, because it must “go” either in one or in another facies. Another, quite arbitrary factor is the reference of two bodies to one or two different classes and the very meaning of “the same.” Still, whatever a scientist’s opinion could be, it would fit one of these cases. Thus, this list of cases seems to be universal, while the choice of case may appear dependent on scale and on highly subjective personal presuppositions addressed by knowledge engineering. Crossing the scales—passing from, say, a flow or rock to the particles they consist of—we, in fact, make the internal causes external, and vice versa, although such diversity of internally and externally driven events can be seen in one scale as well, not necessarily moving from scale to scale. Considering the external influence on a body (body class), the following cases can be thought of. 6. Two or more bodies interact with each other on an equal basis if they have at least one common subject and produce another body, in which the subject is the common subject of interacting bodies, and the common predicate is necessarily present among the predicates—e.g., if a province undergoes tectonic movements and the same province undergoes volcanism, it becomes a volcano-tectonic province. 7. Two or more bodies interact with each other on an equal basis if they have at least one common property (co-occurrence in time and/or space or any other predicate) and produce another entity, in which the subject is formed by the predicates of the interacting bodies, and predicates are formed by their subjects—e.g., flowing fragments of magma (shards and pumice) and flowing clasts of host rocks form a flow (new subject) of fragments of magma (shards and pumice) and clasts of host rocks (subjects of initial bodies have turned into predicates of the resulting one). 8. Two bodies interact so that one modifies the other. Either only the predicate of the modified body is changed by the modifier (e.g., a sedimentary batch unexposed to denudation becomes exposed to it, hot intrusive body heats the host rock at the exocontact, etc.), or the subject is changed too (e.g., a sedimen-
tary batch previously unexposed to denudation is eroded, hot intrusive body quenches the host rock at the exocontact, so that it transforms into another rock, etc.). These intuitive provisions have determined the creation and use of the event bush formalism described in the next section. BASIC ASSUMPTIONS, DEFINITIONS, RULES, AND COROLLARIES General Comments The method of event bush was first suggested by Pshenichny and Khrabrykh (2002) and was recently developed by Pshenichny et al. (2008, 2009) to reveal the behavior of information in environments of directed alternative change by constructing the full set of possible scenarios under some given fixed starting conditions (both “passive conditions” and “invading agents”). Some promising attempts of application of the event bush were reported by Pshenichny et al. (2005) for optimization of Bayesian belief network technique, by Pshenichny and Fedukov (2007) for assessment of environmental consequences of ore extraction in shallow waters, by Pshenichny et al. (2009) for evolution of landscape and Quaternary deposits and for general volcanological purposes, by Behncke and Pshenichny (2009) for construction of eruption scenarios for Etna volcano, Sicily, and by Diviacco and Pshenichny (2010) for creation of concept-referenced spaces in collaborative research. Nonetheless, the basement of the method has remained rather loose and semi-intuitive. The present work aims to start filling this gap by going toward more formal grounds. Architecture of the Event Bush An environment of directed alternative changes is described by the event bush as a set of events of four types arranged in certain order (henceforth called multiflow structure—see Fig. 6) and connected by special connectives. ia. Primary internal events. These are primary, nonunique inputs representing the “passive conditions.” These would determine any further course of events (“happenings”). ib. Primary external events. These are the “invading agents.” They may affect basic inputs or influence their further, indirect manifestations, thus “shaping” different “happenings.” ii. Secondary events that result from primary internal inputs with or without the contribution of primary external ones—the “happenings” proper formulated in a strict concise way indicating their core features determined by the causes, following the principle “one more cause—one change of event.” iii. Tertiary events that denote end results, or products, generated either by primary internal or by secondary events, with or without primary external ones. Tertiary events document the completed “happenings.”
Theoretical foundations of the event bush method
149
A
PRIMARY EXTERNAL EVENTS (ib)
SECONDARY EVENTS (ii)
TERTIARY EVENTS (iii)
PRIMARY INTERNAL EVENTS (ia)
E1
Figure 6. Multiflow structure of the event bush.
E2
B E2
E1
E4
E3
C E2 ...
E1
Any event in an event bush is uniquely characterized by the set of subjects, set of predicates, type (ia, ib, ii, or iii), and, in some cases, generality. Two primary events may not have similar subject. Other pieces of information (even not necessarily verbal) can be associated with events and/or connectives, but the presence of events i–iii is mandatory. The following relations between items i–iii are set: not a single event may lead to ia or ib; event ia without an event ib or ii may lead only to ii or one event iii; event ia may not lead to any event together with another ia; event ib may lead only to an event ii and only together with ia or another ii; event iii may not lead to any other event. These relations are enforced by the connectives of the event bush.
En
D E1 ...
En
En-1
Figure 7. Graphic notation for the connectives of the event bush: (A) flux, (B) influx, (C) furcation, and (D) conflux. E1, E2, …, En are connected events.
Connectives of the Event Bush For simplicity, we will consider in this subsection only the events having one subject and the predicates (one or more) that relate directly to the subject (i.e., no “predicate of predicate” expressions). Other cases, however interesting, will be studied later in relation to class-subclass relations in the event bush and the link between the event bush and conceptual structures. Based on the observations of behavior of information in the geoenvironments of directed alternative changes presented in the “Properties of Geoenvironments…” subsection, the following connectives are introduced.
2. Influx connective (Fig. 7B) describes two events (Ei, Ej) producing another (Ek), but playing different roles (this will be described below): Ei, Ej Influx Ek. 3. Furcation connective (Fig. 7C) describes production of multiple events (Ei+1, Ei+2, …, En) by one (Ei): Ei Furcation Ei+1, Ei+2, …, En.
1. Flux connective (Fig. 7A) describes one event (Ei) producing another (Ej):
4. Conflux connective (Fig. 7D) describes production of one event (En) by multiple events (Ei, Ei+1, Ei+2, …, En–1):
Ei Flux Ej .
Ei, Ei+1, Ei+2, …, En–1 Conflux En.
150
Pshenichny and Kanzheleva
From event to event, the following components may change: 1. 2. 3. 4.
subject(s) S, predicates P, type of event, or generality of subject(s).
We postulate that type of event and generality of subject(s) may change, if, and only if, subjects and predicates do not change, and vice versa, if both subjects and predicates are the same in two connected events, then either their type or generality of subject(s) must change. If these conditions are omitted, the formalism is not an event bush. All the changes listed here are governed by the connectives. These will be considered in detail in the following. The sign ~ is used to denote the negation. Negation in the event bush has been understood rather freely as a sign (not a connective!) related solely to a predicate that means that the subject of the considered event has this predicate somewhere in the bush but not in this event. We expect to define negation more strictly when we study the class-subclass relations in the event bush. Flux Connective The connective of flux, by which one event (Ei) produces another (Ej), Ei Flux Ej, has six different modi. 1. Change of subject from one secondary event to another with the same predicate: ii SiPk Flux Modus 1 ii SjPk (Fig. 8A), e.g., “Organic matter (Si) has a form of tree branches in Quaternary deposits (Pk) Flux Modus 1 Silica (Sj) has a form of tree branches in Quaternary deposits (Pk).” This is good for formal expression of processes evolving “by themselves” and leading to crucially different results, like various phenomena of substitution (mineral pseudomorphism, transformation of pyroclastic density current into a surge or lahar during volcanic eruptions and so forth) that widely occur in Earth’s interior and on the surface. Here, the natural object (i.e., geoentity) denoted by Si is waning, and the geoentity denoted by Sj is waxing. Another example of this modus gives a different idea of interpretation: “A pluton (Si) emplaced into a sedimentary batch (Pk) Flux Modus 1 Heat is emitted into a sedimentary batch (Pk).” Emission, or spawning, of matter or energy is another meaning of this modus in description of geoenvironments (i.e., geoscientific contexts). Finally, this modus may have the sense of bare influence of one geoentity on another: “A pluton (Si) emplaced into a sedimentary batch (Pk) Flux Modus 1 A fold (Sj) is formed in a sedimentary batch (Pk).” Thus, modus 1 appears to have three interpretation meanings, which can be roughly marked as “substitution,” “emission,” and “influence.” In the former, the geoentity denoted by the subject of initial event ceases to exist, and in two others, continues
to exist. Both subjects belong to similar a superclass characterized by similar predicate(s) Pk, though they do not necessarily exhaust it. 2. Change of subject from one secondary event to another with changing predicate and without a semantic interrelationship between subjects and predicates: ii SiPk Flux Modus 2 ii SjPl (Fig. 8B). For example, “Ice blocks (Si) avalanche (Pk) Flux Modus 2 Mud (Sj) flows (Pl),” or “Volcano (Si) erupts on land (Pk) Flux Modus 2 Ash (Sj) is emitted to stratosphere (Pl),” or “Climate (Si) gets wetter (Pk) Flux Modus 2 Glaciers (Sj) grow (Pl).” These three examples illustrate the same meanings as those of modus 1 (meanings: substitution, emission, influence), but there is no more ground to refer the two subjects to one class. 3. Change of subject from one secondary event to another with changing predicate and with a semantic interrelationship between subjects and predicates: ii SiPj Flux Modus 3 ii SPjPSi (Fig. 8C). An example of this could be “Slope (Si) is affected by rockfall (Pj) Flux Modus 3 Rockfall (SPj) passes by slope (PSi).” The semantic relationship of “changing roles” of predicate and subject expresses an interaction of two geoentities in which first one affects the other, and then, vice versa. Indeed, the slope is transformed by rockfall, but, simultaneously, it could stop the rockfall. In time, the former always precedes the latter, at least a little. Moreover, being affected by rockfall, it could generate another geoentity, e.g., a seismic signal. 4. Change (negation versus assertion) of predicate from one secondary event to another: ii Si~Pk Flux Modus 4 ii Si Pk, ii SiPk Flux Modus 4 ii Si~Pk (Fig. 8D). These relations could be exemplified, the former by, “Stream flow (Si) does not erode limestone (~Pk) Flux Modus 4 Stream flow (Si) does erode limestone (Pk),” and the latter by, “Extrusive dome (Si) is stable (Pk) Flux Modus 4 Extrusive dome (Si) is not stable (~Pk).” This could be a formal expression, inter alia, for self-evolving processes where the results are not as crucial as those addressed in the modus 1 in the “substitution” sense. 5. Change of type of event: from primary internal to secondary, from secondary to tertiary, from primary internal to tertiary: ia SiPj Flux Modus 5 ii Si Pj ; ii SiPj Flux Modus 5 iii Si Pj ; ia SiPj Flux Modus 5 iii Si Pj
Theoretical foundations of the event bush method
151
B
A
S j Pk
S i Pk
Si Pk
Sj Pl
Si~Pk
Si Pk
Si Pk
Si ~Pk
D
C
SPj PSi
Si Pj
E
Si Pj
Si Pj
Si Pj
Si Pj
Si Pj
Si Pj
Figure 8. Modi of the flux connective (except modus 6) on the multiflow structure background: (A) modus 1, (B) modus 2, (C) modus 3, (D) modus 4, and (E) modus 5. See comments in the text.
152
Pshenichny and Kanzheleva
(Fig. 8E). An example of these relations could be “River (Si) flows (Pj) Flux Modus 5 River (Si) flows (Pj).” This reflects an important feature of any changing geoenvironment—the state of quiescence for some of its objects, when one may speak not as much about an event as about a fact, i.e., a “fixed,” or stationary, event, which still may influence other events and participate in processes (for instance, be modified by some other events). 6. Change of generality of subject—from general to particular: ia any Si Pj Flux Modus 6 ii some Si Pj ; ii any Si Pj Flux Modus 6 ii some Si Pj . This modus will be commented and exemplified later herein because it requires us to introduce some other connectives. Any primary internal event may produce by itself either only one event by flux modus 5, or any number of events, though with unchanged subjects and predicates, by flux modus 6. Any secondary event may produce by itself only one event by modus 5 and any number of events by modus 6 (with the previous condition on subjects and predicates imposed by modus 6). Any modus except 5 can be applied any number of times and along with any other modi to the event, and all the resulting events will have subjects denoting coexisting geoentities, contrary to the connective of furcation (see later herein). Given the architecture of the event bush, any other modus of flux seems to be a repetition or a subtype of those already given here. Tertiary events may be produced only by the flux connective. Influx Connective Influx connective unites three events, from which one (the modified event Ei, Si~PSj), being affected by another (the modifier Ej, Sj P), produces the third one, Ek, with the subject from Ei (Si) and predicate (PSj) determined by the modifier’s subject: Si~PSj , Sj P Influx Si PSj (Fig. 9). This connective expresses one of the most common interactions on and in Earth, when something deviates from its initial state under external influence. The geoentity denoted by Si continues to exist but acquires a property it has not had (and therefore can be modified as such only once, for it does not exist with the negation of this property anymore), and the geoentity denoted by the modifier’s subject, Sj, providing this property, may participate in whatever number of influxes (together with its property P, of course) and provide similar or different properties to other modified events. Influx can be exemplified as follows, “River (Si) is unponded (~PSj), Blocks of rock (Si) avalanche (P) Influx River (Si) is ponded (PSj),” and a semantic relation between “blocks of rock” and “to pond” is assumed herewith (“what blocks of rock will try to do if meet a river”). In general, the predicate PSj is always semantically related to the subject of the modifier. For instance, if we imagine an event
bush describing the deposition from flows of volcanic material, in a modifier “Valley develops in the lowland,” the subject “valley” may produce a predicate “elongated” that would apply to any subject denoting a deposit forming from flow (“an elongated deposit of ash/mud/blocks of rock forms”). These relations may follow from a hierarchically superior event bush in a succession of self-similar (in our case, nested) bushes described in the subsection “Multibush Constructions” of the “Strategies of Application” section, but they still need to be formalized. We expect to do this with consideration of class-subclass structure of the event bush formalism. Influx connective is always accompanied by a modus 5 flux connective leading from primary internal or secondary to tertiary event, ia/ii Si~PSj Flux Modus 5 iii Si~PSj, which means that the initial event would not change unless affected by another event. If Ei acts as the modified event in a number of influxes, they all are related to the said flux. Only those events can be modified by influx, which flux by modus 5 to an event of the third type. At the same time, thinking crisply, the initial event, which is modified by the influx, may be in flux only when it acts alone; the modus 5 of flux may be true only if the influx is not, and vice versa, the influx may be the case if, and only if, flux modus 5 is false. The event Ei usually is primary internal, but it can also be secondary. Ej is primary external or secondary. Ek is always secondary (Fig. 9): ia Si~PSj, ib Sj P Influx ii Si PSj ; ii Si~PSj, ib Sj P Influx ii Si PSj ; ia Si~PSj, ii Sj P Influx ii Si PSj ; ii Si~PSj, ii Sj P Influx ii Si PSj . Furcation Connective Furcation connective produces multiple events (Ei+1, Ei+2, …, En; n > 1) from one (Ei), where all Ei, Ei+1, Ei+2, …, En are mutually incompatible and together exhaust the whole set of possible outcomes of Ei (Fig. 10). Two modi are possible for this connective: (1) subject remains the same but changes the predicate(s) it had in the initial event, and (2) subject changes itself but the predicate(s) it had in the initial event remains the same. They will be shown below for two resulting events. 1. Attributing incompatible combinations of predicates to a similar subject, modus 1 performs division of predicates: ii Si Pj PkPl Furcation Modus 1 ii Si Pj~PkPl, ii Si Pj Pk~Pl (Fig. 10A). This represents a situation of natural choice for a process to take this or that scenario, and it can be illustrated by the following case: “Lava (Si) extrudes (Pj) yet neither piles up (Pk) not slides down (Pl) Furcation Modus 1 Lava (Si) that has extruded (Pj) does not pile up (~Pk) but slides down (Pl), Lava (Si)
Theoretical foundations of the event bush method
Sj P
153
Sj P
Sj P Si~PSj
Si~PSj Si PSj
Si~PSj
Si~PSj Si PSj Si~PSj
Figure 9. Subjects and predicates in the influx connective on the multiflow structure background.
Si~PSj Si PSj
Sj P
Si~PSj
Si~PSj Si PSj
B
A
Si Pj ~Pk Pl Si Pj Pk Pl
Sj Pl Si Pl
Si Pj Pk ~Pl
Sk Pl
Figure 10. Subjects and predicates in the furcation connective on the multiflow structure background: (A) modus 1 and (B) modus 2.
154
Pshenichny and Kanzheleva
that has extruded (Pj) piles up (Pk) and does not slide down (~Pl).” Here, “has extruded” obviously means “does not extrude anymore.” This is true if we are referring to a particular portion of lava (certainly, not the entire erupted lava, which is free to behave in a variety of ways simultaneously). Though, it should be kept in mind here that the peculiarity of “having acquired” a property may be formally interpreted not as having but, on the contrary, as not having this property anymore—“lava has extruded and does something else on the surface” actually means that lava does not extrude now; still, it has been accepted in the event bush that the property must relate to subject in such case, contrary to the case that a similar subject does not have this property at all (e.g., lava was spilled out and, hence, never extruded). With deeper formalization, this shortcoming must be fixed. 2. If subjects of resulting events are mutually exclusive and together exhaust the class that is the subject of initial event, then the following can be put forth (shown for two resulting events for simplicity): ii Si Pl Furcation Modus 2 ii Sj Pl, ii SkPl (Fig. 10B). Like in the flux modus 1, the process described here leads to complete change of subject, but unlike that modus, here is an option. For instance, “Water (Si) is in the gully on slope of the volcano (Pl) Furcation Modus 2 Stream (Sj) is in the gully on slope of the volcano (Pl), Lahar (Sk) is in the gully on slope of the volcano (Pl).” This is not just a classification of events but exactly expresses the cause-effect relationship, because first water accumulates in a gully, and then this gives birth either to a big stream or to lahar. Similar predicate(s) are necessary here to demonstrate that natural objects involved both in the initial and the resulting events belong to a similar superclass characterized by similar predicate(s) and together exhaust it. Both modi of furcation involve secondary events only. It is assumed that the geoentity denoted by the subject of the initial event does not exist anymore, and geoentities denoted by the subjects of resulting events appear as a result of the process denoted by furcation. By definition, the resulting events are mutually incompatible and together exhaust the whole variety of possible outcomes of the initial one; hence, no other consequences of the said event are possible. Then, if such event is included into the connectives of furcation and flow (regardless of modus), not more than one of these may be true. Conflux Connective Conflux describes production of one event (En) by multiple events (Ei, Ei+1, Ei+2, …, En–1) (see Fig. 11). There are three modi of this connective. 1. The events Ei, Ei+1, Ei+2, …, En–1 have at least one common predicate, the subject of En comes out of this predicate, and the predicates of En come from the subjects of Ei, Ei+1, Ei+2, …, En–1. Both transformations are governed by semantic relations between the predicate of initial events and subject of the result-
ing one defined similarly as the semantic relation in the influx connective. In terms of subjects and predicates, for two initial events with one predicate, each can be shown as ii Si Pk, ii Sj Pk Conflux Modus 1 ii SPkPSi PSj (Fig. 11A), e.g., “Droplets of magmatic melt (Si) flow in gas envelope downslope (Pk), Fragments of crystals (Sj) flow in gas envelope downslope (Pk) Conflux Modus 1 Downslope flow in gas envelope (SPk) involves droplets of magmatic melt (PSi) and fragments of crystals (PSj).” This modus portrays a case when several events coinciding by some properties or circumstances (e.g., those of space, time, composition, or others) produce a new geoentity (denoted by the subject of the resulting event), and this geoentity is determined exactly by their common properties or circumstances that unite them. 2. The events Ei, Ei+1, Ei+2, …, En–1 have at least one common predicate, the subject of En comes out of their subjects, and the predicates are the common predicates of Ei, Ei+1, Ei+2, …, En–1. For two initial events, it looks as follows: ii Si Pi, ii Sj Pi Conflux Modus 2 ii SSiSj Pi (Fig. 11B). An example of this mode could be the following: “Sericite (Si) alters granite (Pi), chlorite (Sj) alters granite (Pi) Conflux Modus 2 Sericitic-chloritic aggregate (SSiSj) alters granite (Pi).” The meaning of this modus is close to that of the previous one, but the new entity is formed not by coinciding circumstances or properties but by “merging” subjects of initial events. 3. The events Ei, Ei+1, Ei+2, …, En–1 all have similar subjects and similar predicates, and these are the subjects and predicates of the event En (shown for two initial events): ii some Si Pj, ii some Si Pj Conflux Modus 3 ii any Si Pj. This modus is used only with the modus 6 of flux described next. In any modus of conflux, geoentities denoted by the subjects of initial events do not exist anymore, and the geoentity denoted by the subject of resulting event appears. All events united by conflux are secondary. Flux Connective Modus 6 and Conflux Modus 3: Explication These two modi have been designed specially to account for the important case of self-similarity of subjects that often occurs in the geoscience. As was mentioned previously, any succession of rocks is a rock, any class of co-occurring ground shakings is ground shaking, and, ultimately, any class of related geoenvironments is a geoenvironment itself. Hence, very often, the subjects in the event bush have to be dual, being simultaneously an integer entity and class dividable into subclasses, each
Theoretical foundations of the event bush method
155
B
A
S1 P1
S1 P1 SP1 PS1 PS2
SS1S2 P1 S2 P1
S2 P1
Figure 11. Subjects and predicates in conflux connective: (A) modus 1 and (B) modus 2.
of the latter being a similar, though not the same, geoentity. The following pattern is suggested to represent this formally (Fig. 12). For instance, a single layer of sandstone is being eroded by rivers in river valleys and by abundant vegetation at watersheds. The results of these destruction processes are virtually diverse, mechanical removal in one case and chemical transformation in situ in the other. Hence, as soon as the layer may become exposed to some kind of denudation, semantically the corresponding subject loses its generality, and structurally the modus 6 of flux takes place. The state of geoentity in this case is the quiescence, but with an opportunity of disruption, which occurs by influx connectives (see Fig. 12). Any flux connective of modus 6 may be used not alone but in an ensemble with
numerically important here because it acquires temporal, spatial, or other meanings.
• other connectives of said modus with similar events to the left
5.
An Instructive Analogy The suggested kit of connectives complies well with the cases of behavior of geoentities in directed alternative change environments described in the “Properties of Geoenvironments…” subsection: 1. 2. 3. 4.
from it, • flux connectives of modus 5, the events at the left parts of
which are those to the right from flux modus 6 connectives, • influx connectives modifying events to the left from flux
modus 5, and • a conflux connective (see following) uniting unmodified
6. 7. 8.
Case 1 is represented by modus 5 of flux, Case 2 is represented by modus 4 of flux, Case 3 is represented by modus 1 or 2 of flux, Case 4 is represented by modus 4 (if results are of the same class as the initial geoentity) or 2 (if of different classes) of flux, Case 5 is represented by furcation modus 2 (if results are of similar class as the initial geoentity) or 1 (if not); sometimes the latter can also be expressed as a succession of furcation modus 2 and flux modus 1 or 2, Case 6 is represented by modus 2 of conflux, Case 7 is represented by modus 1 of conflux, and Case 8 is represented by influx.
events of flux modus 5 (Fig. 12). However, those parts of the sandstone layer that occur at valley sides and at watersheds but that remain intact continue to behave as an integer body that is expressed by a general subject. Therefore, they are confluxed back, and this is the virtue of conflux modus 3. This structural feature of event bush may appear especially useful for physical modeling of complex influences on various subjects. Quiescence expressed by repeated flux modus 5 is also
This gives us the reason to consider the choice of these connectives optimal for describing the processes in environments of directed alternative changes. In addition, one may notice a parallelism between these connectives and mechanisms of production of new organisms known in biology. Indeed, flux is an analog to vegetative reproduction, in which one event is root, and the other is offspring. Influx can be paralleled with sexual reproduction. A modified event is the “mother,” the modifier is the “father,” and the result is the
Pshenichny and Kanzheleva
Any Si Pj
156
Some Si Pj
Some Si Pj
Some Si Pj
Some Si Pj
Some Si Pj
Some Si Pj
Some Si Pj
Some Si Pj
Any Si Pj
Any Si Pj
Any Si Pj
Figure 12. Flux modus 6 and conflux modus 3.
“child” (or, again, though in a bit different sense, “offspring”). The subject of “child” is naturally the same as the subject of “mother,” which is very similar to live birth, while the contribution of “father” is purely “genetic information” (the predicate). However, with time, the child may become really very alike with the father—this option is expressed by modus 3 of flux (though, from the point of view of this analogy, it is a root-offspring relation). Furcation is nothing else but division of cells, and therefore the subject is inherited by all newly formed “cells” as a nucleus would be. However, the analogy here is not complete because furcation implies that only one of the resulting events indicates a geoentity that takes place; others remain only possible but inexistent (contrary to a dividing cell that produces coexisting cells). Conflux represents one more mechanism of producing new organisms, the colonial growth. The resulting event is a “colony” formed by some law, either the unity of subject or of predicate of uniting “primitive organisms,” which certainly are “imprinted” in its structure (e.g., subject formed from a common predicate).
This definition, still being rather far from formal, fixes the event bush in a state in which it can be, and has been, applied to various geoscientific tasks. The avenue for further research is totally formal definition of event bush, which could be taken independently of a geoscientific context to be exported into the field of general informatics for “external testing.” This is expected to be achieved with the investigation of class-subclass relations within an event bush and between the bushes to explore the behavior of subjects and predicates throughout the bush and throughout a network of bushes. At present, however, we deem it pertinent to summarize the relations established within the bush between various types of events (ia, ib, ii, and iii) by its connectives: ia Ei Flux ii Ej , ia Ei Flux iii Ej , ii Ei Flux ii Ej ,
Definition of the Event Bush The conceptual framework put forth here brings us to a definition of the event bush method. The event bush is a geoinformatic method of construction of scenarios in environments of directed alternative change that is based on the multiflow structure, must include the connectives of flux and influx, and may include the connectives of conflux and furcation.
ii Ei Flux iii Ej , ia Ei ib Ej Influx ii Ek , ia Ei ii Ej Influx ii Ek , ii Ei ib Ej Influx ii Ek ,
Theoretical foundations of the event bush method ii Ei ii Ej Influx ii Ek , ii Ei Furcation ii Ej , ii Ei, ii Ei+1..., ii Ei+k Conflux ii Ek+1. Some remarks will be made in the following section to proceed to application issues of the event bush method. Concomitant Definitions, Rules, and Corollaries The corollaries and ultimate consequences of the definitions of event bush and the connectives need to be explored further to optimize their use and avoid possible ambiguity. However, some observations can be made right now. For instance, the possible ways to generate a new subject in the event bush are modi 1 and 3 of Flux and modus 1 of Conflux. Event ii can form cyclic patterns but with an entrance from ia or another ii and exit to iii or another ii. Also, for the obvious reason that the resulting events of furcation are incompatible, neither these nor their consequences, provided they have no other causes, may be united by conflux. Another point stated in the definitions of all the connectives can be summarized as follows. If a predicate P appears for the first time in relation to subject S in event E, this automatically means that in all events throughout the bush that have the same subject but that do not follow directly or indirectly from E, there must be the negation of P, if not explicitly put otherwise. Some more definitions would enable reasoning and inference in the event bush framework. Because the environment the event bush aims to describe is that of directed alternative changes, it seems useful to interpret the notion of change in terms of event bush. In the event bush, change is a correctly built expression including one (only one!) connective of any type. General formulae for these expressions were listed in subsection Connectives of the event bush. Change has one causal part (that includes the events, one or more, left of the connective) and one effect part (the events to the right of the connective). The two changes in which an ia event may be involved as a cause alone is flux modus 5 to ii or iii events and flux modus 6 to ii events. Also, it should be postulated that the left (causal) part of each change is unique, except for the flux modi 4 and 6, and given the left part and the connective (regardless of modus), the right part may be also singular. This means that 1. a given pair of events may produce either solely one influx, or one conflux, or nothing at all; 2. any three or more events may produce either one conflux, or nothing; and 3. if one (secondary) event is the causal part of flux, except for flux modus 2 in the meaning of influence, and simultaneously is the causal part of furcation, only one of these relations may take place in reality; if one (secondary) event is the causal part
157
of flux modus 2, in the meaning of influence of one event on another (so that the geoentity denoted by the subject of the event in the causal part is not meant to wane), and is also the left part of furcation, both relations may coexist in reality. Then, flow is a finite totally ordered set (Schröder, 2003) of changes, in which the left part of the first change is a type ia event, the right part of the last change is a type iii event, and the right part of every preceding change is the left part of the following change. Flow describes such geoentities as process or quiescence. Flow is always followed from type ia to iii. If there is a change with influx in it, the modifier of this influx (ib or ii) is included, but not the events that caused it, unless there are successive influx changes. If there is a flow, in which at least one event participates in causal parts of two changes (e.g., one, being flux, and another, furcation), there is another flow, which overlaps with the former one in the interval from the beginning (ia) to this very event, and then ramifies. Furthermore, if there is a flow without the type ii events, first, it may only consist of one change, namely, the flux modus 5 from type ia to iii, and then, there also must be a flow that overlaps with this one in the beginning (i.e., at the ia event) and includes at least one influx. This influx of ia and another event, ib or ii, will produce a type ii event, which will then route this flow further. There must not be events in a bush not included in a flow. Ramifying (furcating or influxing), confluxing, or coming to common effect by different fluxes, flows form a multiflow, which is, by definition, the structure of the event bush. A graphic notation has been adopted for event bush. It is based on the multiflow structure (Fig. 6) placing events ia on the left, ib on the top, ii in the center, and iii on the right. Events are plotted as rectangular boxes. The connectives are marked: flux, as ordinary arrow from cause to effect (Fig. 7A), influx, as an arrow from the modifier to the effect with a “right turn” from the modified event (Fig. 7B), conflux, as “double right turn” (Fig. 7C), and furcation, as a circle with ramification to effects (Fig. 7D). This visualizes the bush and makes it possible to “drive” along the flows following the “road painting” of connectives. Now, the applications of the method can be considered. STRATEGIES OF APPLICATION Individual Bushes Existing applications of the event bush (Pshenichny et al., 2005, 2009; Pshenichny and Fedukov, 2007; Behncke and Pshenichny, 2009; Diviacco and Pshenichny, 2010), however tentative, already appear complicated enough to be hardly treated by a human. This is not only because of the natural complexity of the modeled environments but largely because of a lack of semantic structure, which has been introduced in the Basic Assumptions section of the present paper. This structure and the
158
Pshenichny and Kanzheleva
rules of composition of the event bush have been continuing to evolve before reaching a state of maturity. Next, we will demonstrate one of the latest event bushes that fixes this weakness. A spectacular view of a slope of an active polygenic volcano with lava cones and flows and active faults is shown in Figure 13A. Figure 13B presents an event bush that explains how the observed bodies were formed and the other bodies had to cogenetically form below the surface or could alternatively form on and below the surface. This event bush can be recorded as follows (cause and effect parts of changes are shown in brackets for better readability). ([ia] Host rocks exist in volcano without fissures and dislocations) Flux Modus 5 ([iii] Host rocks exist in volcano without fissures and dislocations). ([ia] Host rocks exist in volcano without fissures and dislocations, [ib] Fissures develop) Influx ([ii] Host rocks in volcano are dissected by fissures yet not dislocated). ([ii] Host rocks in volcano are dissected by fissures yet not dislocated) Flux Modus 5 ([iii] Host rocks in volcano are dissected by fissures yet not dislocated). ([ii] Host rocks in volcano are dissected by fissures yet not dislocated) Flux Modus 2 ([ii] Fissures develop in nondislocated host rocks of volcano but are not filled with magma). ([ii] Fissures develop in nondislocated host rocks of volcano but are not filled with magma) Flux Modus 5 ([iii] Fissures develop in nondislocated host rocks of volcano but are not filled with magma). ([ii] Fissures develop in nondislocated host rocks of volcano but are not filled with magma, [ib] Magma ascends) Influx ([ii] Fissures that develop in nondislocated host rocks of volcano are filled with magma). ([ii] Fissures that develop in nondislocated host rocks of volcano are filled with magma) Flux Modus 3 ([ii] Magma fills the fissures in nondislocated host rocks of volcano). ([ii] Magma fills the fissures in nondislocated host rocks of volcano, neither stopping in the fissures in nondislocated host rocks of volcano and solidifying nor ascending through fissures in nondislocated host rocks of volcano) Furcation Modus 1 ([ii] Magma that filled the fissures in nondislocated host rocks of volcano stops in the fissures in nondislocated host rocks of volcano yet does not solidify and does not ascend through fissures in nondislocated host rocks of volcano, [ii] Magma that filled the fissures in nondislocated host rocks of volcano does not stop in the fissures and does not solidify in nondislocated host rocks of volcano but ascends through fissures in nondislocated host rocks of volcano). ([ii] Magma stops in the fissures in nondislocated host rocks of volcano and yet does not solidify) Flux Modus 4 ([ii] Magma that stopped in the fissures in nondislocated host rocks of volcano solidifies in the fissures in nondislocated host rocks of volcano). ([ii] Magma that stopped in the fissures in nondislocated host rocks of volcano solidifies in the fissures in nondislocated
A
Figure 13 (on this and facing page). Example of interpretation of volcanic environment by the event bush: (A) the environment proper (Mount Etna, Sicily, near Rifugio Sapienza), and (B) the event bush explaining the way in which the observed bodies were formed and the other bodies had to cogenetically form below the surface or could alternatively form on and below the surface. The lava cones in the foreground are a few dozen meters high.
([ii]
([ii]
([ia]
([ia]
([ia]
([ii]
host rocks of volcano) Flux Modus 5 ([iii] Magma that stopped in the fissures in nondislocated host rocks of volcano solidifies in the fissures in nondislocated host rocks of volcano). Fissures develop in nondislocated host rocks of volcano but are not filled with magma) Flux Modus 3 ([ii] Host rocks of volcano are dislocated along fissures not filled with magma). Host rocks of volcano are dislocated along fissures not filled with magma) Flux Modus 5 ([iii] Host rocks of volcano are dislocated along fissures not filled with magma). Undeformed slope without fissures, uncovered by lava exists on volcano) Flux Modus 5 ([iii] Undeformed slope without fissures, uncovered by lava exists on volcano). Undeformed slope without fissures, uncovered by lava exists on volcano, [ii] Fissures develop in nondislocated host rocks of volcano but are not filled with magma) Influx ([ii] Undeformed slope of volcano uncovered by lava is dissected by fissures). Undeformed slope without fissures, uncovered by lava exists on volcano, [ii] Host rocks of volcano are dislocated along fissures not filled with magma) Influx ([ii] Slope of volcano uncovered by lava is deformed by host rocks of volcano dislocated along fissures not filled with magma). Slope of volcano uncovered by lava is deformed by host rocks of volcano dislocated along fissures not filled with magma) Flux Modus 5 ([iii] Slope of volcano uncovered by lava is deformed by host rocks of volcano dislocated along fissures not filled with magma).
Fissures develop
Magma ascends Host rocks exist in volcano without fissures and dislocations
Host rocks in volcano are dissected by fissures yet not dislocated
Fissures develop in nondislocated host rocks of volcano but are not filled with magma
Host rocks in volcano are dissected by fissures without dislocations
Fissures develop in nondislocated host rocks of volcano but are not filled with magma Fissures that develop in nondislocated host rocks of volcano are filled with magma
Magma fills the fissures in nondislocated host rocks of volcano neither stopping in the fissures in nondislocated host rocks of volcano and solidifying nor ascending through fissures in nondislocated host rocks of volcano
Undeformed slope uncovered by lava without fissures exists on volcano
B
Host rocks exist in volcano without fissures and dislocations
159
Host rocks of volcano are dislocated along fissures not filled with magma
Magma that filled the fissures in nondislocated host rocks of volcano stops in the fissures in nondislocated host rocks of volcano yet does not solidify and does not ascend through fissures in nondislocated host rocks of volcano Magma that filled the fissures in nondislocated host rocks of volcano stops, and in the fissures in nondislocated host rocks of volcano solidifies and does not ascend through fissures in nondislocated host rocks of volcano
Magma that filled the fissures in nondislocated host rocks of volcano does not stop in the fissures and is not solidifying in nondislocated host rocks of volcano but ascends through fissures in nondislocated host rocks of volcano
Magma that filled the fissures in nondislocated host rocks of volcano stops in the fissures in nondislocated host rocks of volcano and solidifies and does not ascend through fissures in nondislocated host rocks of volcano
Host rocks of volcano are dislocated along fissures not filled with magma Undeformed slope uncovered by lava without fissures exists on volcano
Undeformed slope of volcano uncovered by lava is dissected by fissures
Fissures develop at slope of volcano but are not filled with magma
Undeformed slope of volcano uncovered by lava is dissected by fissures
Fissures develop at slope of volcano but are not filled with magma
Fissures that develop at slope of volcano are filled with magma
Lava reaches the slope along fissures on slope of volcano
Lava forms lava cones along fissures on slope of volcano Slope of volcano uncovered by lava is deformed by host rocks of volcano dislocated along fissures not filled with magma
Fissures that develop at slope of volcano are filled with magma
Lava flows along fissures on slope of volcano from fissures
Lava cones form and erupt lava along fissures on slope of volcano
Lava flows along fissures on slope of volcano
Lava cones form and erupt lava along fissures on slope of volcano Slope of volcano uncovered by lava is deformed by host rocks dislocated along fissures
Slope of volcano deformed by host rocks dislocated along fissures is covered by lava erupted from lava cones along fissures
Figure 13 (Continued).
Slope of volcano deformed by host rocks is covered by lava erupted from lava cones along fissures
160
Pshenichny and Kanzheleva
([ii] Fissures develop in nondislocated host rocks of volcano but are not filled with magma, [ii] Magma ascends through fissures in nondislocated host rocks of volcano) Influx ([ii] Fissures that develop at slope of volcano are filled with magma). ([ii] Fissures that develop at slope of volcano are filled with magma) Flux Modus 5 ([iii] Fissures at slope of volcano are filled with magma). ([ii] Fissures that develop at slope of volcano are filled with magma) Flux Modus 2 ([ii] Lava reaches the slope of volcano). ([ii] Lava reaches the slope along fissures on slope of volcano yet neither flows form fissures nor forms lava cones along fissures) Furcation Modus 1 ([ii] Lava that reached the slope along fissures on slope of volcano flows along fissures on slope of volcano from fissures and does not form lava cones on slope of volcano, [ii] Lava that reached the slope along fissures on slope of volcano does not flow along fissures on slope of volcano but forms lava cones along fissures on slope of volcano). ([ii] Lava forms lava cones along fissures on slope of volcano) Flux Modus 3 ([ii] Lava cones form and erupt lava along fissures on slope of volcano). ([ii] Lava cones form and erupt lava along fissures on slope of volcano) Flux Modus 5 ([iii] Lava cones form and erupt lava along fissures on slope of volcano). ([ii] Lava cones form and erupt lava along fissures on slope of volcano) Flux Modus 2 ([ii] Lava flows along fissures on slope of volcano from fissures). ([ii] Slope of volcano uncovered by lava is deformed by host rocks of volcano dislocated along fissures not filled with magma, [ii] Lava forms lava cones along fissures on slope of volcano) Influx ([ii] Slope of volcano deformed by host rocks of volcano dislocated along fissures not filled with magma is covered by lava erupted from lava cones along fissures). ([ii] Slope of volcano deformed by host rocks of volcano dislocated along fissures not filled with magma is covered by lava erupted from lava cones along fissures) Flux Mode 5 ([iii] Slope of volcano deformed by host rocks of volcano dislocated along fissures not filled with magma is covered by lava erupted from lava cones along fissures). Structural and semantic rules of the event bush make it possible, first, to construct a scenario describing some particular (e.g., actually observed) case involving the necessary and sufficient set of events (ia), (ib), (ii), and (iii), then abstract from this particular case and generate a complete set of possible scenarios based on these very events, their subjects, and predicates. For example, ([ii] Magma stops in the fissures in nondislocated host rocks of volcano yet does not solidify) Flux Modus 4 ([ii] Magma that stopped in the fissures in nondislocated host rocks of
volcano solidifies in the fissures in nondislocated host rocks of volcano). ([ii] Magma that stopped in the fissures in nondislocated host rocks of volcano solidifies in the fissures in nondislocated host rocks of volcano) Flux Modus 5 ([iii] Magma that stopped in the fissures in nondislocated host rocks of volcano solidifies in the fissures in nondislocated host rocks of volcano). Thus “missing parts” of a picture can be added, requiring minimum additional geoentities (i.e., in line with the “Occam’s razor” rule). Formulations in boxes in Figure 13B may differ from those above; this is due to more exquisite relationships between the predicates (“becomes obvious,” “becomes irrelevant,” “is repeated,” etc.) or even peculiarities of natural language. These need to be addressed in further investigations. To describe more complex environments, one may, depending on his or her vision of the environment, either develop one bush, or appeal to multibush constructions. Multibush Constructions One-Level Interrelated Bushes A formalization of different geoenvironments by event bush poses a natural question, How can different bushes be related to each other? These can be united “on plane” and “in space.” “On plane,” this can be done “in line” (tertiary events of one bush become primary internal or external events of another), in a simple network (tertiary events of one bush become primary internal events, and tertiary events of another become primary external events of the same third bush), and in a complex network (tertiary events of one bush become primary internal or external events of more than one bush; or vice versa, one bush takes primary internal or external events from more than one bush). “In space,” primary internal, primary external, secondary, or tertiary events of one bush can become secondary events of another. These interrelations, which still need to be explored and understood, must give us a better vision of interrelated geoenvironments and hopefully arm us with an efficient tool to create sophisticated information structures simulating the evolutionary scenarios of Earth. Nested Self-Similar Bushes Aiming to view singular geoenvironment at different scales as shown in Figure 5, a series of bushes can be constructed from the most general down to as much detailed as desired. This opportunity was demonstrated by Behncke and Pshenichny (2009). The theoretical foundation for building self-similar bushes may come from class-subclass or part-whole relations between the subjects of events of more general and less general bushes. However, these relations within and between the event bushes need to be studied thoroughly, and then the theory of self-similar event bushes will be reported.
Theoretical foundations of the event bush method Families of Event Bushes Variations in behavior of similar geoenvironment or geoenvironments of similar class under the same conditions, which seem to be controversial, can be modeled by a set of event bushes with similar semantics, or as a family of event bushes. This can be also used to cope with the incompleteness of geological record or other data. All bushes in a family have similar sets of (ia), (ib), (ii), and (iii) events but differ in structure, i.e., in the set of connectives uniting these events. While some of the connectives will certainly repeat from bush to bush (e.g., flux modus 2 [ia–iii]), others will not. Thus, one may figure out the relations maintained, or asserted, throughout the entire family, in some, or in none of its bushes. This strongly resembles the concept of tautology and executable formula in logic (Smullyan, 1968). Meanwhile, this also models the diversity of expert opinions and may serve for their elucidation and reconciliation (Wood and Curtis, 2004). DISCUSSION: THE EVENT BUSH IN THE GEOINFORMATIC AND INFORMATIC FRAMEWORK Interrelationships between objects of similar range can be found in many domains (e.g., history, biology, business, etc.). Hence, the approach to identification and classification of methods of information modeling based on the environments they address seems to be extendable from geoinformatics onto other thematic fields. This vision may be a valuable contribution of geoinformatics to general informatics, if it passes the “external” testing. Simultaneously, it looks interesting to view the evolutional aspect of armory of geoinformatics. Indeed, the main intellectual challenge is extraction of strict, “solid” vision from a hot intuitive mess of professional intuition, personal feelings, beliefs, impressions, and pieces of other knowledge, for which a metaphor
161
would be emplacement of lava or intrusive magma and its further cooling, solidification, and crystallization of “minerals” (e.g., less evolved, subjects and predicates, and more complex, interrelated events) in particular order. Let us compare the environment-based classification with another classification of methods of geoinformatics suggested earlier by Sinha et al. (2008) based on eventually strengthening semantics (Fig. 14). Transition from weaker to stronger semantics includes, by Sinha and coauthors, a passage from taxonomy (expressed as relational model written, e.g., in extensible markup language [XML]), to thesaurus, to conceptual model, and, finally, to logical theory formulated in terms of first-order logic (Smullyan, 1968). This is a natural pathway of increasing maturity of information modeling, which can be seen in general as progressing from description of discoveries to integration of information. Stronger semantics mean larger expressive power, higher level of interoperability, and, as a result, feasibility of more complex queries. It allows, according to Sinha et al. (2008), “inferences” (not necessarily in exact logical sense—see, e.g., Gentzen, 1934) from heterogeneous data sets. At the same time, it requires community agreement on conceptual relationships in the information domain. Such agreement, complete or partial, can be fairly well expressed by an event bush or a family of bushes. In comparing the two schemes of methods of geoinformatics, our classification (Fig. 2) and the evolutional scheme of Sinha et al. (Fig. 14), one may suppose that the early stage of evolution of methodology of geoinformatics (taxonomy/thesaurus stage) refers solely to the subject-based methods by our classification, while the mature state (conceptual model/logical theory stage) refers to both subject-based and event-based ones. This seems reasonable because the primitive relations between individual subjects and predicates (class-subclass, part-whole, etc.), which are the basis at the early stage, are not enough to set the relationships between the events
Figure 14. Conceptual evolution of methods of geoinformatics (by Sinha et al., 2008). XML—extensible markup language, OWL—Web Ontology Language, UML—unified modeling language, RDF—resource description framework, and DB—database.
162
Pshenichny and Kanzheleva
unless these relations are supported by much more evolved structural rules like the connectives of the event bush—just like a volcanic rock cannot transform into an intrusive one in the process of crystallization. The event bush, hence, must be regarded as a conceptual model, by the classification of Sinha et al. (2008). Tight interrelationship between the structure and semantics is the virtue of this approach. Clarification of the class-subclass relations between the subjects and predicates of the bush will bridge the gap between the event-based and subject-based methods and allow us to carefully track all the properties and their influence on further products generated in an environment (also marking permitted omissions of properties as done in Fig. 13B). Such a “property survey” can be easily performed throughout the bush or even interrelated bushes. This is the oncoming task of our research. The event bush also has a good perspective to become a formal theory (logical or alike), which occupies the highest position in the succession plotted by Sinha et al. (2008; see Fig. 14). There is a mind-teasing analogy between the connectives of the event bush and logical connectives (though no direct parallels between particular connectives seem plausible), changes and propositional formulae, flows present in a family of bushes and tautologies/executable formulae, and some others. Taking the left and right parts of any change as “meta-events,” one perhaps may think of nested bushes similar to logical connectives that may unite not only the variables but also formulae of any length. The issue of truth values or analogs for the event bush needs to be discussed; however, the interrelationships between flows based on the definitions of connectives may open a theoretical opportunity of inference nearly sensu Gentzen (1934). Certainly, this will not be a logical interpretation but a new formal theory built in the same way as existing logical systems (Hilbert and Bernays, 1934). More research is needed to find out whether the rules and definitions adopted herein (in the “Basic Assumptions…” section) may result in duplication of the events, presence of incompatible (controversial) inferences, or situations wherein a lack of clarity surrounds a predicate P as it relates to the subject S in a particular event or not, and hence, events that may and may not follow from this event by this or that connective. To our knowledge, this is the first attempt to create a more or less strict conceptual model in the geosciences. Existing event-based methods (e.g., Bayesian belief networks—see, e.g., Jensen, 2001; Aspinall et al., 2003; or event trees—see, e.g., Newhall and Hoblitt, 2002; Bayesian trees—see, e.g., Marzocchi et al., 2008) have not entered, in fact, the field of geoinformatics, because they have no semantic rules governing the formulation and behavior of events. Nevertheless, the event bush can also be used for their optimization, being convertible to both, though, with some loss of relevant information (Pshenichny et al., 2009). The well-known method of neural networks, in our opinion, may have semantic constraints, at least at the level of architecture (input layer, hidden layers, output layer), and recent publications show promising intersections of neural networks with the
subject-based methods (i.e., ontology design—see Abolhassani et al., 2006). The issues of conceptual interrelation and an opportunity of co-application of event bush and neural networks will be addressed in future studies. Importantly, in constructing the conceptual models, the event bush neither provides nor is based on definitions of involved events, or their subjects, or predicates, i.e., it allows geoscientists to reason exactly as they do standing at an outcrop and pointing to actual rock features. Ironically, absence of common natural language (English, Spanish, French, Russian, any other) may help specialists better understand each other, just pointing to different objects in a rock wall, drawing primitive sketches on sand, making gestures, and expressing their (dis)agreement. Thus, they manage to exchange complicated ideas that would otherwise require a few research papers to express, and, yet more important, collectively reason, avoiding concepts diversely defined in their scientific schools. They appeal not to concepts but directly to geoentities, implicitly naming them as “…what we see here and anything like this anywhere else.” Meanwhile, the factor that is left completely at everyone’s discretion is what to consider “like this.” This appears to be an efficient way to discard one of the most painful issues of formalization of the geoscience, the disagreement on definitions. Instead, what is said is, in fact, the following, “Whatever to consider LIKE THIS, it MUST behave so and so depending on this, this, and that circumstance.” In many cases, this seems to be a way to perform a working formalization of geoscientific contexts. It should be also stressed that the event bush operates with the geoentities, initially not accompanied by physical parameters or mathematical variables, but enables reasoning in terms of geoentities (“things” or even “words”) almost as strict as in terms of parameters and numerical values. Thus, it draws a clear distinction between the qualitative and quantitative and proves that geoentities do not need to be interpreted in terms of physics or mathematics to be treated strictly and formally. However, this does not undermine the quantitative modeling; rather, it allows one to better focus the models and more firmly root them in the geoentities. The theoretical opportunity for binding geophysical models to the event bushes describing the corresponding phenomena qualitatively, as well as practical benefits of this venture have been shown by Carniel et al. (2011). Also, Pshenichny et al. (2009) show how time (and then, space) values can be incorporated in the event bush. This is seen as a methodologically better option to address the studies of matter and energy of Earth. Information modeling of geoentities is even more urgent, because, like many other dominantly descriptive disciplines, the geosciences show a kind of internal resistance to formalization, application of artificial intelligence and design methods, and corresponding theoretical rethinking. This is due to the character of descriptive knowledge, which is genuinely intuitive and stuck to particular processes and objects, as well as personal apprehension instead of focusing on general properties and conditions and ubiquitous laws. From one side, this opens another, so far largely unused, opportunity for formalization taken precisely
Theoretical foundations of the event bush method by the event bush—mimicking the way the objects behave in nature, a formalism is sought that best suits our “feeling” of the object as well as our way of reasoning about it. However, from the other side, any theoretical rethinking of an information domain, being a desired consequence of application of the information technologies (Loudon, 2000), draws a picture different from that the community is used to, regardless of whether the latter is adequate and ever correct. Since the very first steps, the way of thinking that underlies the event bush has differed from virtually all existing pathways of thought “officially” perceived in the earth sciences, be it essentially inductive and nonstrict traditional geological consideration (for example, reconstruction of geological history), modeling in terms of physical parameters, building of a single-root event/probability tree (Newhall and Hoblitt, 2002), or compilation of a Bayesian belief network based on the expert’s knowledge and intuition. Implementation of formalism, then, will proceed much faster if it can find direct analogies in the “physical” world, as do trees, neural network shapes, mindmaps, or loop diagrams. The event bush has at least two such obvious common-sense analogies, one, with the patterns of flow, and another, with the methods of production of new organisms in biology. Also, one may parallelize its connectives with the “driving code” (to “drive” different ways from the left and top to the right) or other examples of everyday life. This gives a hope that the event bush will successfully pass the “external testing” and become an instrument of general informatics. At this stage, its expressive and communicative power will be employed to the fullest, and it may become an efficient tool to communicate within the scientific community and between this community and the society as a whole at the optimal balance between clarity and strictness. CONCLUDING REMARKS A peculiar feature of behavior of information in the earth science domain is ubiquitous interrelationship between entities of similar range. This provides the groundwork to consider geoinformatics as the study of environments that include geoentities (objects and/or processes) of similar class, which are also bound by some other relation. On this basis, a classification of geoenvironments can be suggested, which, first, seems to sufficiently organize the set of methods that are or can be used by the geoinformatics, and then, appears transferable to other alike disciplines (some life sciences, business activities, history, and others). In this classification, the complexity ascends from subjectbased to event-based methods (which model the no-change and changing environments, correspondingly). Among the eventbased, the complexity generally grows from nondirected to directed change environments, and among the directed change ones, from those with invariant to those with alternative changes. These kinds of environments may be addressed by a variety of methods, but none of them except the event bush suggests semantic and structural rules that would allow us to consider it a conceptual model and a method of information modeling. How-
163
ever, existing methods (at least Bayesian belief networks) may benefit from the event bush, which seems to be reducible to these, and thus it may become a mature means of information modeling. Basic assumptions, definitions, and rules of composition of the event bush in their present form allow us to widely apply the event bush in a variety of geoscientific tasks. An example of volcanological application was shown in this paper. So far, the event bush enables us to construct the interrelated flows of events, including those that are missing in observation, requiring minimum additional geoentities. Nevertheless, further theoretical research toward the complete formalization is needed to examine and develop the theoretical basement of the event bush method to ensure the absence of flaws and ambiguities and formulate the guidelines for automatic reasoning in it. ACKNOWLEDGMENTS We are deeply obliged to Sergey Nikolenko for constructive criticism of the first part of this work, as well as to Victoria Shterkhun, Zina Khrabrykh, Alexander Rezyapkin, and many others, who have worked side by side with us for years, and who encourage and support us. The research was carried out in the framework of the Marie Curie Action “International Research Staff Exchange Scheme” (FP7-PEOPLE-IRSES-2008) Cross-Disciplinary Knowledge Transfer for Improved Natural Hazard Assessment (CRODINAS; 2009–2011), EC Framework Programme 7, grant no. 230826. REFERENCES CITED Abolhassani, H., Bagheri-Hariri, B., and Haeri, S.H., 2006, On ontology alignment experiments: Webology, v. 3, no. 3, article 28, http://www.webology .ir/2006/v3n3/a28.html (accessed 4 June 2011). Aspinall, W.P., Woo, G., Voight, B., and Baxter, P.J., 2003, Evidence-based volcanology: Application to eruption crises: Journal of Volcanology and Geothermal Research, v. 128, p. 273–285, doi:10.1016/S0377-0273 (03)00260-9. Aspinall, W.P., Carniel, R., Jaquet, O., Woo, G., and Hincks, T., 2006, Using hidden multi-state Markov models with multi-parameter volcanic data to provide empirical evidence for alert level decision-support: Journal of Volcanology and Geothermal Research, v. 153, p. 112–124, doi:10.1016/ j.jvolgeores.2005.08.010. Behncke, B., and Pshenichny, C.A., 2009, Modeling unusual eruptive behavior of Mt. Etna, Italy, by means of event bush: Journal of Volcanology and Geothermal Research, v. 185, p. 157–171, doi:10.1016/j.jvolgeores .2009.04.020. Bloomfield, L., 1914, An Introduction to the Study of Language: New York, Henry Holt and Company, 335 p. Bogdanov, A., 1926, Allgemeine Organisationslehre: Tektologie, v. I, p. II. Carniel, R., Pshenichny, C.A., Khrabrykh, Z., Shterkhun, V. and Pascolo, P., 2011, Modeling models: Understanding structure of geophysical knowledge by means of the event bush method, in Marschallinger, R., and Zobl, F., eds., Mathematical Geosciences at the Crossroads of Theory and Practice: Proceedings of the Conference of International Association of Mathematical Geosciences, 5–9 September, Salzburg, Austria. De Saussure, F., 1983 [1913], Course in General Linguistics (Harris, R., translator): La Salle, Illinois, Open Court. Diviacco, P., and Pshenichny, C.A., 2010, Concept-referenced spaces in computer-supported collaborative work, in Proceedings of the European Geosciences Union (EGU) 5th General Assembly, Vienna (Austria): Geophysical Research Abstracts, v. 12, EGU201-EGU6258 (CD-ROM).
164
Pshenichny and Kanzheleva
Feigenbaum, E.A., 1984, Knowledge engineering: The applied side of artificial intelligence, in Hagel, H.P., ed., Proceedings of a Symposium on Computer Culture: The Scientific, Intellectual, and Social Impact of the Computer: New York, New York Academy of Sciences, p. 91–107. Gentzen, G., 1934, Untersuchungen über das logische Schliessen, I–II. “Math. Z.,” bd. 39, h. 2, 3 (English translation: Gentzen, G., 1969, Investigations into logical deduction, in Szabo, M., ed., The Collected Papers of Gerhard Gentzen: Amsterdam, North-Holland, p. 68–128). Hilbert, D., and Bernays, P., 1934, Grundlagen der Mathematik (Fundamentals of Mathematics), Bd. I: Heidelberg, Springer-Verlag, 2 vols. Jakeman, A.J., Voinov, A., Rizzoli, A.E., and Chen, S., 2008, Environmental Modelling, Software and Decision Support: State of the Art and New Perspectives: Amsterdam, Elsevier, 384 p. Jensen, F.V., 2001, Bayesian Networks and Decision Graphs: New York, Springer, 268 p. Loudon, T.V., 2000, Geoscience after IT: Computers & Geosciences, v. 26, 13 p. Mandelbrot, B.B., 1982, The Fractal Geometry of Nature: New York, W.H. Freeman and Company, 468 p. Marzocchi, W., Sandri, L., and Selva, J., 2008, BET_EF: A probabilistic tool for long- and short-term eruption forecasting: Bulletin of Volcanology, v. 70, p. 623–632, doi:10.1007/s00445-007-0157-y. Newhall, C.G., and Hoblitt, R.P., 2002, Constructing event trees for volcanic crises: Bulletin of Volcanology, v. 64, p. 3–20, doi:10.1007/s004450100173. Oliveros, A.Q., Carniel, R., Tárraga, M., and Aspinall, W., 2008, On the application of hidden Markov model and Bayesian belief network to seismic noise at Las Cañadas caldera, Tenerife, Spain: Chaos, Solitons, and Fractals, v. 37, p. 849–857, doi:10.1016/j.chaos.2006.09.073. Pshenichny, C.A., and Fedukov, R.A., 2007, Formal treatment of knowledge in water science by means of event bush, in Proceedings of the European Geosciences Union (EGU) 2nd General Assembly, Vienna (Austria): Geophysical Research Abstracts, v. 9, EGU2007-A-01016 (CD-ROM). Pshenichny, C.A., and Khrabrykh, Z.V., 2002, Knowledge base of formation of subaerial eruption unit, in Leroy, S., and Stuart, I., eds., Environmental Catastrophes and Recovery in the Holocene: London, Brunel University, http://atlas-conferences.com/cgi-bin/abstract/caiq-22 (accessed 4 June 2011). Pshenichny, C.A., Carniel, R., and Akimova, V.L., 2005, Decreasing the uncertainty of BBN technique by means of complex formal approach to volcanological information treatment, in Proceedings of the European Geosciences Union (EGU) 2nd General Assembly, Vienna (Austria): Geophysical Research Abstracts, v. 7, EGU05-A-01016 (CD-ROM). Pshenichny, C.A., Nikolenko, S.I., Carniel, R., Sobissevitch, A.L., Vaganov, P.A., Khrabrykh, Z.V., Moukhachov, V.P., Shterkhun, V.L., Rezyapkin,
A.A., Yakovlev, A.V., Fedukov, R.A., and Gusev, E.A., 2008, The event bush as a potential complex methodology of conceptual modelling in the geosciences, in Sanchez-Marre, M., Bejar, J., Comas, J., Rizzoli, A., and Guariso, G., eds., Proceedings, iEMSs—International Congress on Environmental Modelling and Software: Barcelona, International Environmental Modelling and Software Society, v. 2, p. 900–912. Pshenichny, C.A., Nikolenko, S.I., Carniel, R., Vaganov, P.A., Khrabrykh, Z.V., Moukhachov, V.P., Akimova-Shterkhun, V.L., and Rezyapkin, A.A., 2009, The event bush as a semantic-based numerical approach to natural hazard assessment (exemplified by volcanology): Computers & Geosciences, v. 35, p. 1017–1034, doi:10.1016/j.cageo.2008.01.009. Schröder, B.S.W., 2003, Ordered Sets: An Introduction: Boston, Birkhäuser Boston, Inc., 391 p. Sinha, K., Malik, Z., Raskin, R., Barnes, C., Fox, P., McGuinness, D., and Lin, K., 2008, Semantics-based interoperability framework for geosciences: Eos (Transactions, American Geophysical Union), Fall Meeting supplement, abstract IN31D-11. Smullyan, R.M., 1968, First-Order Logic: Berlin, Springer-Verlag, 158 p. Sowa, J.F., 2000, Knowledge Representation: Logical, Philosophical, and Computational Foundations: Pacific Grove, California, Brooks/Cole Publishing Co., 594 p. Tulupyev, A.L., and Nikolenko, S.I., 2005, Directed cycles in Bayesian belief networks: Probabilistic semantics and consistency checking complexity, in Gelbukh, A., de Albornoz, A., and Terashima, H., eds., Proceedings of the Mexican International Conference on Artificial Intelligence 2005: Berlin, Springer-Verlag, Lecture Notes in Artificial Intelligence, v. 3789, p. 214–223. Tutte, W.T., 1998, Graph Theory As I Have Known It: New York, Oxford University Press, 156 p. Umpleby, S.A., 2007, Physical relationships among matter, energy and information: Systems Research and Behavioral Science, v. 24, p. 369–372, doi:10.1002/sres.761. Uspenskii, V.A., and Semenov, A.L., 1993, Algorithms: Main Ideas and Applications: Dordrecht, the Netherlands, Kluwer Academic Publishers, 269 p. Von Bertalanffy, L., 1968, General System Theory: Foundations, Development, Applications: New York, George Braziller, 289 p. Wood, R., and Curtis, A., 2004, Geological prior information and its application to geoscientific problems, in Curtis, A., and Wood, R., eds., Geological Prior Information: Geological Society of London Special Publication 239, p. 1–14.
MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm A. Krishna Sinha Department of Geological Sciences, Virginia Tech, Blacksburg, Virginia 24061, USA
ABSTRACT The need to develop a geoscience cyberinfrastructure framework for both the discovery and semantic integration of disciplinary databases in geosciences is abundantly clear as we seek to unravel both the evolutionary history of Earth and address significant societal challenges. Although geoscientists have produced large amounts of data, the ability to find, access, and properly interpret these large data resources has been very limited. The main reason for the difficulties associated with both discovery and integration of heterogeneous and distributed data sets is perhaps related to the adoption of various acronyms, notations, conventions, units, etc., by different research groups. This makes it difficult for other scientists to correctly understand the semantics associated with data, and it makes the interpretation and integration of data simply infeasible. This paper presents the scientific rationale for developing new capabilities for semantic integration of data across geoscience disciplines. In order to enable the sharing and integration of geosciences data on a global scale, ontology-based data registration and discovery are required. Hence, this paper describes the need to develop foundation-level ontologies for efficient, reliable, and accurate data sharing among geoscientists. Ontologically registered data can be modeled through the use of geoscientific tools to answer complex user queries. This paper emphasizes the need to share tools such as Web services that are registered to a service ontology and made accessible to the scientific community at large. Future development would include an ontology of concepts associated with processes, enabling users to conduct both forward and reverse modeling toward a more robust understanding of complex geoscience phenomena. This paper presents two use cases for a semantic infrastructure model registering data and services, including processes for analysis of complex geoscience queries.
INTRODUCTION Communities of scientists around the world are working toward the goal of discovering new knowledge through a better understanding of the fundamental principles that lie behind com-
plex and heterogeneous databases (Sinha et al., 2010). There is common consensus that access and integration (e.g., layering of data) of data are prerequisites for creating an information infrastructure, but, arguably, this cannot be the ultimate goal. We need to add transformative capabilities to data and related information
Sinha, A.K., 2011, Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 165–181, doi:10.1130/2011.2482(13). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
165
166
A.K. Sinha
through integrating processes that capture the full significance of the data leading to knowledge discovery. Figure 1 (Fleming, 1996) emphasizes the stages required to move from data to principles of science, which can be readily referred to as transforming data to knowledge.
Expressivity
Figure 1. The pathway from data to knowledge and beyond is a threestep activity: first seeking information as it relates to description, definition, or perspective (what, when, where), and then knowledge constituting strategy, practice, method, or approach (how), which leads to new insight and principles (why).
While information entails an understanding of the relations between data, it generally does not provide a foundation for why the data are what they are, nor an indication as to how the data would change over time through physical, chemical, and biological processes. When a pattern of relationships exists amidst the data and information, the pattern has the potential to represent knowledge that can be further explored through a better understanding of principles. I support the premise that semantic capabilities at all levels of logic are required to follow the path from syntactic to semantic interoperability (Fig. 2), as well to infer unknown relationships between data through reasoning (Sinha et al., 2010). Ontologies have long been used to describe all entities within an area of reality, and all relationships between such entities. Therefore, an ontology constitutes a set of well-defined terms with well-defined relationships (Gruber, 1993), and it can be a vital tool enabling researchers to turn data into knowledge. Computer scientists have made significant contributions to linguistic formalisms and computational tools for developing complex vocabulary systems using reason-based structures, and I suggest that a three-tiered ontology framework will be required to provide researchers with the pathway from data to examination of the fundamental principles that govern the sciences. There is common consensus that scientific disciplines have to deal with (1) large data volumes, (2) complexity of data, (3) volatility of data, (4) heterogeneity of data, (5) broad distribution of data resources, and (6) access to tools and services that can appropriately render and represent data and data products. A knowledgecreation infrastructure that enables access to and integration of heterogeneous data is required to meet these challenges, and it is often referred to as e-science. The semantic capabilities needed
Strong semantics o y t e r v y Modal Logic si er ove First-Order Logic v s o c e Logical Theory sc is pr x Di e D e Description Logic e ta dg r a DAML+OIL, OWL e o D wl m UML o o n t K Conceptual Model s s Semantic Interoperability le RDF/S m XTM o r
F
Extended ER
Thesaurus ER DB Schemas, XML Schema
Structural Interoperability
Taxonomy Relational Model, XML
Syntactic Interoperability
Weak semantics Application
Figure 2. There are multiple levels of semantics and associated interoperability capabilities (Obrst, 2003). Increasing interoperability services requires increasing community agreement on conceptual relationships across participating geoscience disciplines. Strong semantics allow inferences and reasoning from data set contents (Sinha et al., 2010). DAML+OIL—Darpa Agent Markup Language+Ontology Interchange Language, DB—database, ER— Entity-Relationship, OWL—Web Ontology Language, RDF/S—Resource Description Frameworks, UML—Unified Modeling Language, XML—extensible markup language, XTM—syntax for topic maps.
Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm to integrate complex and heterogeneous data within such an infrastructure is the focus of this paper. Specifically, I emphasize that this infrastructure will require a combination of upper-level, midlevel, and foundation-level ontologies coupled with data-level domain ontologies, as well as semantically enabled tools and services to achieve the goal of transforming data into knowledge. In the new e-science paradigm, geoscientists have moved toward using the Web as a medium to exchange and discover vast amounts of data. The current practice is dominated by establishing methods to access data, with little emphasis on capturing the meaning of the data, which would facilitate interoperability and integration. Some common current methods for integration include schema integration, leading to the use of mediated schemas that provide a uniform query interface for multiple resources (Halevy et al., 2006). Methods using peer data management (Aberer, 2003) can allow participating data sources to retrieve data directly from each other, and this is likely to extend data integration to the Internet scale. However, such query capabilities require syntactic and semantic mapping across resources in order to be effective. Clearly, availability of ontologies will become a prerequisite for semantic integration. In this paper, I adopt the definition of ontology as a set of knowledge terms, including the vocabulary, the semantic interconnections, and rules of inference and logic for some particular topic (Gruber, 1993; Noy, 2004). BUILDING A SEMANTIC FRAMEWORK FOR THE SCIENCES AND EARTH SCIENCE IN PARTICULAR Here, three types of ontologic frameworks are identified for discovery of data and its integration: object (e.g., materials), process (e.g., chemical reactions), and service (e.g., simulation models or geochemical filters) (Sinha et al., 2006a; Malik et al., 2007a). Objects represent our understanding of the state of the system when the data were acquired, while processes capture the physical and chemical actions on objects that may lead to changes in state and condition over time. Service provides tools (e.g., simulation models and analysis algorithms) to assess multiple hypotheses, including inference or prediction. Object ontology characterizes the semantics of the data. It maps the metadata in the databases to different concepts essential for data search and integration. The service ontology maps instances of services to conceptual tasks to permit semantic searches and automatic linkages to types of data. The process ontology captures the broad domain knowledge, including information such as understanding of the data set, relationships among the different variables, normal range of the variables, known causal relationships (e.g., Reitsma and Albrecht, 2005; Sinha et al., 2006a; Barnes, 2006), etc. These three classes of ontologies within the semantic layer of e-geoscience are thus required to enable automated discovery, analysis, utilization, and understanding of data through both induction and deduction. Although this paper emphasizes primarily the current status of object and service ontologies, I recognize the need to expand this capability to a point where sci-
167
entists can examine the relationships between data and external factors, such as processes that may influence our understanding of the reasons certain events happen. However, the development of object ontologies is a prerequisite for semantic interoperability across process, object, and service ontologies (Sinha et al., 2006b; Rezgui et al., 2007, 2008). It is important to note that the semantic framework presented in this paper is more conceptual rather than explicitly formulated to meet all known rules for building formal ontologies. It is, however, the underpinning for formalizing its semantic content. I also briefly describe the status of two software-engineered prototypes that enable data to be registered to known ontologies, leading to a new method for discovery, integration, and analysis of heterogeneous data. OBJECT ONTOLOGY: SEMANTIC FRAMEWORK FOR DATA Object ontologies can be represented at four levels of abstraction: upper- (Semy et al., 2004), mid- (Raskin and Pan, 2005), as well as foundation-level and discipline-specific (i.e., earth science) ontologies (Fig. 3). The latter two are the subject of this paper. Upper-level ontologies, e.g., Suggested Upper Ontology (SUO) (Phytila, 2002; Niles and Pease, 2001) and the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) (Masolo et al., 2002), are domain independent and provide universal concepts applicable to multiple domains, while midlevel ontologies, e.g., SWEET (Semantic Web for Earth and Environmental Terminology; sweet.jpl.nasa.gov/ontology/), constitute a concept space that organizes knowledge of earth system science across its multiple, overlapping subdisciplines. SWEET also includes concepts of data representations, services, and legacy. Foundation-level ontologies capture relationships among conceptual organizations of types of data, including their measurements, while domain-level ontologies are discipline specific and can be used for efficient, reliable, and accurate data sharing among earth scientists (Sinha et al., 2006a, 2006b). These ontologies seek to utilize existing community-accepted high-level ontologies such as SUO (http://suo.ieee.org/SUO/Ontology-refs.html, Institute of Electrical and Electronics Engineers [IEEE] endorsed) and SWEET (http://sweet.jpl.nasa.gov). In particular, the SWEET ontology contains formal definitions for terms used in earth and space sciences and encodes a structure that recognizes the spatial distribution of earth environments (earth realm) and the interfaces between different realms (Raskin and Pan, 2005; Raskin, 2006). Thus, SWEET provides an extensible midlevel terminology that can be readily utilized by both foundation-level and domain-specific ontologies (Malik et al., 2010). In this paper, I describe an approach for building both foundation- and domainlevel ontologies for interdisciplinary integration within the earth sciences. Specifically, I describe why community-supported foundation ontology development is a prerequisite for developing domain ontologies. I then use two case histories to emphasize the complexity of this endeavor and suggest that the semantic
168
A.K. Sinha
Figure 3. Conceptual organization of object ontologies (Unified Modeling Language [UML] diagram) at various levels of granularity is necessary for transformation of data to knowledge. Both SUO (http://suo.ieee.org/SUO/Ontology-refs.html) and SWEET (http://sweet.jpl.nasa.gov) ontologies can be used to provide connectivity to existing and future ontologies related to all science disciplines. The ontology for elements and isotopes under the concept of Material (Matter) discussed in this paper is considered a foundation ontology because it is common to all science domains. Such high-level UML diagrams show that Materials have properties, age, structure, and location, while Measurements include all analytical tools, including human observations, used for gathering data associated with any object. Domain-specific ontologies utilize these concepts for semantic integration across disciplines. SUMO—Suggested Upper Merged Ontology, SWEET—Semantic Web for Earth and Environmental Terminology.
platforms presented in this paper can act as the basis for deeper semantic structures in the near future. SEMANTIC FRAMEWORK FOR OBJECT ONTOLOGY AT THE FOUNDATION LEVEL Foundation ontologies for all sciences (Fig. 3) can be viewed as a representation of formal declarative specifications of all objects, phenomena, and their interrelationships. I emphasize that the concept of Matter1 (labeled as “Material” in Fig. 4), including all thermodynamic states of matter, is the most fun1
Helvetica font is used for concepts and classes.
damental of all ontologies. Clearly, without matter, there can be no semantic concept of location, time, and structure, or physical properties of matter and instruments that measure these properties. These foundation ontologies may then be used to capture discipline-specific terms, such as those for minerals, rocks, geologic time scale, geologic structures, or geologic phenomena. This approach also readily accepts geoscience terms being developed through GeoSciML (http://www.geosciml.org/.), a markup language designed to promote syntactic integration of heterogeneous resources (Boisvert et al., 2003; Simons et al., 2006; Malik et al., 2010). In the following section, I first describe individual foundation-level ontology packages through high-level concept maps. These concept maps are organized to explore relationships
Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm Import Numerics Ontology
Import http://www.isi.edu/ ~pan/OWL-Time.html
Import SWEET: Units Ontology
Time
Structure Physical Properties Location
Import SWEET: Physical Property Ontology
Material Measurements Mathematical Functions Import SWEET: Space Ontology
169
Phenomenon
Figure 4. Foundation-level representation of the concepts as packages and its connectivity with midlevel SWEET ontologies. I use the term “packages” to emphasize that multiple ontologies are contained within each package. For example, the Material package contains ontologies for elements and isotopes, which are readily linked to domain-specific concepts such as minerals, rocks, water, or magma and can be readily linked to GeoSciML-endorsed terms. The semantic relationship between all packages and their subclasses is the foundation for semantic interoperability.
Import SWEET: Physical Phenomena Ontology
without the full attribution of class structures and associations. When possible, earth-science domain-level terms and concepts have been added to foundation-level ontologies to show its extensibility and thus enable semantic registration of data and terms for integration. Material: The elements, constituents, or substances of which something is composed of, or can be made. Matter that has qualities which give it individuality and by which it can be categorized (Webster’s New Collegiate Dictionary, http://www.merriam -webster.com/dictionary/material, accessed 9 May 2011). Materials include classes of elements and isotopes, as well as states of matter, such as solids, liquids, fluids, and gases and all associated properties. These fundamental classes are linked to earth science–specific concepts of minerals and rocks (Fig. 5).
the IUPAC (International Union of Pure and Applied Chemistry) technical report on atomic weight of the elements (DeLaeter et al., 2003). The major difference between the element ontology described in this paper and the periodic table (periodic table ontology) developed under the Darpa Agent Markup Language (DAML) program is explicit packaging of elements, e.g., large ion lithophile elements (LILEs), of interest to geoscientists. I also represent each element as a subclass of the Element class, so that each element inherits all common properties of the class Element. I also use the term “nominal atomic weight” to represent atomic weights of both mono-isotopic and polyisotopic elements, and to distinguish it from isotope atomic weights. Isotope Ontology
Element Ontology Figure 6 is a class diagram representing the different concepts related to elements and the explicit relationship between them. The element class (e.g., DeLaeter et al., 2003) contains a list of properties, including: name, nominal atomic weight, symbol, atomic number, color, atomic radius, melting point, boiling points, density, specific heat, ionization potential, electron affinity, heat of fusion, heat of vaporization, bonding radius, and electron negativity. In addition to other properties, such as classifications of elements as metallic, nonmetallic, or semimetallic, and Goldschmidt’s classification (atmophile, chalcophile, siderophile, or lithophile), rare earth group, platinum group elements, etc., it also contains properties representing the arrangement of the element in the periodic table (http://www.webelements.com/): group, period, and block (s-block, p-block, d-block, or f-block). The following resources were used to gather information about elements and their properties: http://chemlab.pc.maricopa.edu/ periodic/periodic.html, http://www.webelements.com/, as well as
Figure 7 is a class diagram that shows relationships within isotope ontology. It shows that an Isotope has the following properties: symbol, number of neutrons, and isotope atomic weight (Rosman and Taylor, 1998). I separate the concept of atomic weight for elements from isotopes to prevent overwriting of the assigned atomic weight of the parent element. Every isotope (such as uranium 238) is a separate class and inherits the properties of the parent element (uranium) isotope as well as the abstract Isotope class. For instance, U238, and U235 are subclasses of the U_Isotope class, which in turn is a subclass of the Isotope class. Another concept represented in the isotope ontology is the classification of isotopes into two subclasses: nonradionuclide and radionuclide. I use this system to include the decay series where a radionuclide can be an intermediate parent, a primary parent, or an intermediate daughter (Fig. 7). Each of these subclasses has their corresponding properties. Primary or secondary parent isotopes have a half-life, decay mode, abundance, and a final daughter or a decay series including intermediate daughters.
170
A.K. Sinha
Figure 5. Unified Modeling Language (UML) representation of Materials package, emphasizing linkages to earth science concepts. Elements and isotopes constitute the fundamental attributes of Materials and can be readily utilized in other sciences, e.g., in biochemistry, where the concept of cells and their properties could replace minerals or rocks.
Figure 6. Element ontology Unified Modeling Language (UML) diagram emphasizing the details required to discover and integrate semantically registered databases. See relationship of the Class Element/Isotope to other classes in Figure 5.
Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm
171
Figure 7. Unified Modeling Language (UML) diagram for isotope ontology. Registration of isotopic measurements to such conceptual relationships will allow for integration between elements and isotopes.
Final daughter has abundance and a primary parent. For the isotope properties, the following resources were used to gather the information: http://chemlab.pc.maricopa.edu/periodic/periodic. html, http://ie.lbl.gov/toi/perchart.htm, as well as IUPAC technical report on isotopic composition of the elements (Rosman and Taylor, 1998). HOW ARE ELEMENT AND ISOTOPE ONTOLOGIES LINKED TOGETHER? Every element in the element ontology is connected to its corresponding isotope (Fig. 8). For instance, an instance of
Uranium (U) class has one to many instances of U_isotope. On the other hand, every U_isotope has a reference to its parent element Uranium. In order for isotopes to inherit all properties of elements, I have treated each element as a class of the Element class, rather than consider each element as an instance. Discovery and integration between elements and isotopes are more easily accomplished through this structure. Structure: Arrangement of particles or parts in a substance or body; aggregate of elements of an entity in their relationship to each other (Webster’s New Collegiate Dictionary, http:// www.merriam-webster.com/dictionary/structure, accessed 9 May 2011). It includes organization of all geologic structures and their
172
A.K. Sinha
relationship to each other within concepts that range from zero dimensions (0-D) to three dimensions (3-D). An ontologic framework for structures (Fig. 9) has been established that represents concepts of 0-D, 1-D, 2-D, and 3-D geometric forms, which can capture all known geometrical forms and their relationships to each other. It should be emphasized that a Contain relationship links 0-D, 1-D, and 2-D to 3-D (volume concept). For example, in earth science, discontinuities such as seismic, lithologic, planar, spherical, lattice, fault, and fracture,
Figure 8. An example illustrating how the uranium element is linked to its isotopes through both has and is a conceptual relationships.
as well as chemical or isotopic discontinuities, have a 2-D geometric form. In order to extend these concepts to data gathered by geologists, I use lineation as an example having 1-D geometric form with field measurements of trend and plunge. Specific types of lineation can then be represented as subclasses with properties inherited from the concept of lineation. I have also established a class relationship that allows the meaning of data situated in x-y-z space to be captured. For example, the concept of Relative Sequence contains concepts of Sequence, which in turn contain X, Y, and Z Sequences. I have further established links between Time Sequence and Sequence, so that geologic units above or below a datum horizon can be explicitly related to be older or younger than the datum. This is a very useful mechanism for portrayal of stratigraphic horizons, regardless of the rock type involved. Extended to its larger spatial concept, it also allows us to recognize layers of Earth such as the crust and mantle (also see SWEET for Earth layers). Location: A position or site occupied or marked by some distinguishing feature (Webster’s New Collegiate Dictionary, http:// www.merriam-webster.com/dictionary/location, accessed 9 May 2011). All materials and structures have a location within a 3-D volume. Although many coordinate reference systems are recognized (Fig. 10), geoscientists commonly use a geodetic reference frame
Figure 9. Unified Modeling Language (UML) diagram representing concepts contained in Structure package. Note the earth science concepts that are readily linked to foundation-level concepts of all dimensions, 0-D, 1-D, 2-D, and 3-D. Concepts of earth layers contained in SWEET are also linked in this package.
Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm for absolute location of samples or points of observation. Relative location is a more difficult concept to classify as data, but concepts of near or adjacent have been successfully used to identify polygons in geographic information systems (Schuurman, 2004). Time: The measured or measurable period during which an action, process, or condition exists or continues (Webster’s New Collegiate Dictionary, http://www.merriam-webster.com/ dictionary/time, accessed 9 May 2011). This also includes concepts of duration and geologic events. This package (Fig. 11) deals with relationships of classes within the concept of geologic time. I utilized OWL-time (http:// www.w3.org/TR/2006/WD-owl-time-20060927/) as a starting foundation-level ontology and added geologic concepts associated with either absolute, intervals, or durations of geologic events. I extended calendar time (required for querying presentday events as recorded through a calendar clock) to ranges that include geologic time scales. The capability to query data across these concepts uniquely provides the temporal perspective for the better-characterized spatial queries. The organization of the concepts also allows the user to access multiple geologic time scales because geologic time is a temporal entity with attributes such as beginning, end and duration.
173
Measurements: Instrumental measurements and images (chemical, physical, and morphological) of matter (Webster’s New Collegiate Dictionary, http://www.merriam-webster.com/ dictionary/time, accessed 9 May 2011). This package, when fully developed, would contain ontology for instruments, including their operating conditions. Many of these instruments and their operating parameters are already available in SWEET. Phenomenon: A fact or event of scientific interest susceptible to scientific description and explanation (Webster’s New Collegiate Dictionary, http://www.merriam-webster.com/dictionary/ phenomenon, accessed 9 May 2011). Events such as earthquakes or volcanism are considered as phenomenon. This package is fully covered in SWEET ontology. Physical Properties: An attribute common to all members of a class (Webster’s New Collegiate Dictionary, http:// www.merriam-webster.com/dictionary/%20properties, accessed 9 May 2011). All planetary materials have physical properties. This package is fully covered in SWEET ontology. Mathematical and Statistical Functions: A set of commonly used notations that describe the numerical data and attributes associated with them. Computations such as standard error, regression, and standard deviation are considered as functions. This package is fully covered in SWEET ontology.
Figure 10. Global Geographic Location system adapted from http://www.colorado.edu/geography/gcraft/notes/co and http://www.iki.rssi.ru/ vprokhor/coords.htm (Dana, 1995; Russell, 1971; Hapgood, 1992).
174
A.K. Sinha
Figure 11. Unified Modeling Language (UML) diagram representing the Geologic Time class and its properties (adapted from Hobbs and Pan, 2004; Allen, 1991).
SERVICE ONTOLOGY: SEMANTIC ORGANIZATION OF TOOLS AND SERVICES These ontologies are designed to provide a semantic framework for classes of tools and services that provide computational or classification capabilities for data sets. As an example, a classification tool such as alumina saturation index can utilize rock geochemical data to classify it as peraluminous or metaluminous. In a later section (scenario 1), A-type igneous rocks can be recognized through discriminant diagrams such as those given by Eby (1990). More complex modeling codes such as reactive transport can be used to model the behavior of elements in a mine wastedisposal facility. Within this framework, codes can be organized within a service ontology and be made available to a user based on the query (Malik et al., 2007b; Rezgui et al., 2008). Such
tools and services should be wrapped as Web services to facilitate discovery in a Web environment. I foresee these services as applications developed, shared, and registered to service ontology by geoscientists across the world. Availability of a semantic framework will allow individuals to register their application to specific concepts, e.g., A-type classification of igneous rocks. PROCESS ONTOLOGY: INTERPRETING “WHY” FROM DATA THROUGH REVERSE AND FORWARD PROCESS MODELS Because of the emerging state of ontologic research in earth sciences, no foundation- or domain-level ontologies for geologic processes are available for refinement. However, Sinha et al. (2006a) and Barnes (2006) have presented high-level conceptual
Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm schemas for geologic processes that produce or modify igneous rocks. For example, as shown in an overview diagram (Fig. 12), it is easy to recognize that there are many interrelated concepts that are woven in the logic of interpreting igneous rocks. It is clear that in the category of presolidification events, the process of partial melting is the key prerequisite for the formation of igneous rocks. Associated with this process, there are issues of sources, depth of melting, and the tectonic setting of the melting event. Additional processes represented in the presolidification hierarchy are segregation, transport, and emplacement. Syn- to postsolidification processes include phenomena such as the rate of cooling or alteration. It is clear that many phenomena act as operators on any given process, and the cumulative effect of such actions leads to the final product, i.e., igneous rock. The recognition of the characteristic contribution of each process to the final physical form, i.e., shape of pluton, or its chemical signature is often difficult to assess, and this may lead to multiple hypotheses as discussed for scenario 2. The development of process ontologies requires the integration of phenomena and events associated with the entire hierarchy as represented in Figure 12. As such, a link between processes and concepts within a semantic framework would provide relationships that exist during formation of an igneous rock, from the partial melting event to the solidification event; this is similar in concept to the feedback loop applied to granitic plutons (Petford et al., 1997). DISCOVERY, INTEGRATION, AND ANALYSIS OF RESOURCES A prototype software environment for a discovery, integration, and analysis (DIA) engine of semantically registered data
Phenomenon
Process
Source rock Heat source Depth of melting
Melting
Stress regime
Segregation
Melt viscosity Melt density Geometrical form Depth of emplacement Volatile content
Transport
Presolidification processes
Emplacement
Rate of cooling
Solidification
Alteration Metamorphism
Postsolidification
Syn- to postsolidification processes
Figure 12. A graphic view of multiple processes that are linked together through process, time, tectonic, and observational relationships (from Sinha et al., 2006a).
175
and services (Fig. 13) has been described in detail by Rezgui et al. (2007, 2008) and Malik et al. (2010). The primary objective of constructing the DIA engine is to build a service-oriented computational infrastructure that enables scientists to discover, integrate, and analyze earth science data. Data discovery enables the users to retrieve the distributed data sets, i.e., located at multiple sites, that are pertinent to the research task at hand. Data integration enables users to query various data sets along some common attributes to extract previously unknown information called data products. The data products that are generated can either be used in their delivered form or used as input to the data analysis phase. Data analysis may be used to verify certain hypotheses, or it may refine the data product with further data discovery and integration. All data used by DIA are referenced to the original data provider, even through multiple iterations, e.g., controlling provenance of data (Simmhan et al., 2005), and this is a prerequisite for building trust for online conduct of science (Zaihrayeu et al., 2005). SEMANTICS-ENABLED REGISTRATION OF DATA AND SERVICES: SIMPLIFYING DISCOVERY PRIOR TO INTEGRATION I suggest that two types of semantic registration are necessary for both discovery and integration of data: (1) data registration, e.g., SEDRE—semantically enabled data registration engine described by Malik et al. (2010); and (2) SESRE—semantically enabled services, (including a process-oriented codes) registration engine. These concepts are shown in Figure 13. Ontology-aided data integration, accomplished by registering databases to ontologies, systematically resolves both syntactic and semantic heterogeneity, allowing scientists to focus on the content of the database rather than its schema (Lin and Ludäscher, 2003). SEDRE facilitates discovery through resource registration at three levels: 1. Keyword-based registration: Discovery of data resources (e.g., gravity, geologic maps, etc.) requires registration through the use of high-level index terms. For instance, the popular AGI Index terms (American Geologic Institute GeoRef Thesaurus; http://www.agiweb.org/news/spot_nov8_thesaurus.html) can be used. If necessary other index terms, such as those provided by AGU (American Geophysical Union, http://www.agu.org/pubs/ authors/manuscript_tools/journals/index_terms/), can be used as well and eventually be cross indexed to each other. 2. Ontological class-level registration: Discovery of the semantic content of databases requires registration of the database to class-level ontology, such as rock geochemistry, gravity database, etc. 3. Item detail–level registration: Item detail–level registration consists of associating a column in a database to a specific concept or attribute of an ontology, thus allowing the resource to be queried using concepts instead of actual values. This mode of registration is most suitable for data sets built on top of relational databases. However, item detail–level registration can be extended
176
A.K. Sinha
DIA Engine User
Data and Service Discovery
DIA Registration Component Data Registration SEDRE
Object Service
Ontologies
Service Registration SESRE
Figure 13. Conceptual organization of classes of ontologies within the extensible discovery, integration, and analysis (DIA) engine, where the class of service ontologies is inclusive of both tools and process ontologies. SEDRE—semantically enabled data registration engine; SESRE— semantically enabled services registration engine.
Tools and process codes wrapped as Web services
to cover Excel spreadsheets and maps in ESRI Shapefile format by internally mapping such data sets to PostgreSQL tables. For example, a column in a geochemical database may be specified as representing SiO2 measurement. Ontological data registration at item detail level uses the concepts of Subject, Object, Value, and Unit. Figure 14 shows the relationship between these concepts and the method with which it is possible to map columns of data sets to these concepts. In an example utilizing geochemical data, Rock represents the Subject (sample 1758), which contains the element compound SiO2 as one of its Objects. The Object SiO2 has a Value of 50.72 and is measured in wt% Unit. To facilitate such registration, one can envision a graphic user interface (Fig. 15) for geochemical data. Similar interfaces can be readily created for all major disciplines, thus making it easy for data providers to semantically tag their data for ontology-based discovery. For instance, SO2 columns in the data sets are mapped to terms adopted from element ontology (see Figs. 5 and 6), while units of measurement are made available through SWEET unit ontology. SEDRE allows the data owners to maintain control over their data and only store the data– ontology term mappings. The mappings also include concepts of longitude/latitude coordinates (from location ontology) to enable efficient access to spatial data. I recognize that data registration through ontologies is a time-consuming process, and that data owners may not be able/willing to register their data sets in “one go.” Therefore, SEDRE is developed as a downloadable service, where data owners can download SEDRE (along with all the required ontology terms) on their personal machines and connect to SEDRE’s online repository only to upload the data-ontology mappings. This allows data owners to register their data at their own convenience, while keeping ownership of data. DIA uses different “registry servers” (RSs), which could be distributed worldwide, to provide directory functionalities (registration of data and tools, indexing, search, etc.). The providers of resources
advertise their resources on registry servers, which may then be (manually or automatically) discovered and used. SCENARIO 1 To illustrate the different DIA components, consider the following query: What is the distribution, U/Pb zircon ages, and the gravity expression of A-type plutons in Virginia? Query Specification In the DIA engine, the user query can be expressed in one of two ways: it can either use a text-based format or a menubased format. The text-based format allows a user to query the entire database, while the menu-based format (Fig. 16) lets the user select only specific items, which in turn queries only a subset of the data. The user does not need extensive knowledge of the querying techniques, models, or keywords (which may be required in a text-based format). The task at hand can be completed with the help of a few “mouse-clicks,” and query results are definitely produced as long as the data required to answer the query are present, i.e., empty result sets are only returned in the case of missing data. The user clicks through the different menus to “build” an exact query. Filtering and Integration Data filtering is a process in which the DIA engine transforms a raw data set into a data product. Data filtering may also take a data product as its input. Examples of data products include a map showing the A-type bodies in the Mid-Atlantic region, an Excel file giving the ages of those A-type bodies, a gravity database table spatially related to A-type bodies saved as a contoured gravity map, etc. Data products used in data integration may be
AnalyticalOxideConcentration
1
0..n
analytic alOxide: AnalyticalOxide Concentration: ValueWithUnit errorOfConcentration: ValueWithUnit
A Section from Planetary Material Ontology
Figure 14. The ontologic registration of data to concepts is the key to semantic interoperability and integration. Note the representation of a rock sample as a subject with object defining the data itself. The application of ontology-based concepts of subject and object facilitates registration and discovery. For example, column SampleID represents the concept of Subject (in this instance it is a rock sample), while the concept of Object contains the concept of analytical oxide with value and unit. Based on such deep ontologies, it is possible to easily register and query for data associated with any subject.
Figure 15. Schematic representation of registration of data through SEDRE (semantically enabled data registration engine). This user interface is specifically designed to register data from atmospheric studies, and, as such, commonly measured compounds are made available under the section Major elements. As shown in the inset labeled New Mapping, we readily register subject (e.g., SO2) and capture its value and units (in Dobson units). Such templates can be readily made for all subdisciplines and will lead to easy registration of data to known semantics. SESRE—semantically enabled services registration engine.
178
A.K. Sinha
Figure 16. Query specification through menus. The discriminant diagrams are made available as Web services, and they use a point in a polygon algorithm to recognize whether a sample has A-type affinities. The menu also provides a link history that enables a user to follow the steps involved in the classification of samples.
of two types: prepackaged or created dynamically. Querying prepackaged data is usually faster but is not flexible and provides little support for complex scientific discovery. Dynamically created data products may require on-the-fly integration and extensive query processing, but they enable far richer possibilities for scientific discovery. DIA’s Service-Oriented Approach for Facilitating Semantic Integration of Heterogeneous Data The DIA engine is a Web-based, service-oriented system developed using a variety of technologies including: ESRI’s ArcGIS Server 9.1, Microsoft’s .NET framework, Web services, Java, and JNBridge 3 (Rezgui et al., 2008; Malik et al., 2010). Users submit queries through the DIA’s Web-accessible graphical interface. The engine translates these queries into a sequence of tasks such as: accessing map servers, discovering and accessing data sources, invoking Web services (e.g., Cardoso and Sheth, 2006), filtering features, joining layers, and graphically rendering query results for visualization. The DIA engine also enables users to save their query history as well as export data products for future references. Since the DIA engine is developed along a service-oriented approach, key code modules are wrapped as Web services. This approach has two advantages. First, it makes the system readily extensible. As the geoscience community introduces new services, these could be integrated in the DIA engine as new functionalities. Second, services developed for DIA may be used as building blocks to produce other systems. The DIA engine supports several querying modes (geological map-based queries, region-based queries, etc.). To answer the example for A-type plutons query, the user first selects the option “Geological map-based queries” in DIA’s main menu (Fig. 17B). The system then accesses a geological map server, gets a (default) geological map, and displays it to the user (Fig. 17). This map enables the user to select the area of interest (i.e., Virginia). This may be done by selecting a bounding box or by selecting the entire state. In the latter case, the DIA engine accesses a gazetteer to determine the selected state’s latitude-longitude coordinates. The user then uses DIA’s drop-down menus (Fig. 17B; also Fig. 16) to identify a computational filter (A-type magma class
filter in this case) to be applied to the data samples located in the selected area. The DIA system is designed to search all semantically registered data sets that have samples located in the area of interest. The user then selects all the A-type bodies and requests U/Pb ages. If ages have been registered to these bodies within a geochronologic database, the information is retrieved and an age is displayed for each body (Fig. 17C). Similarly, access to point source gravity data from a site such as http://paces.geo.utep.edu that have been semantically registered provides the capability to plot the distribution of individual stations and use a kriging tool to construct contour maps at various scales (Fig. 17C). Ultimately, the user is now presented with an integrative view (Fig. 17D) that can be used to discover relationships between occurrence of certain types of igneous rocks and their gravity signature. SCENARIO 2 Why are sulfur contents associated with volcanic activity similar for volcanoes from different plate-tectonic settings? (This illustrates the need to eventually link object-service and process ontologies.) Volcanism and its bearing on climate change have been the subject of many studies (e.g., Blong, 1984; Robock, 2000), and this area of research provides scientists with the opportunity to study the scales at which climate can be influenced. For example, a global decrease in temperature from 0.1 °C to 10 °C can be positively correlated with sulfur yield in grams (Fischer et al., 1997). For larger eruptions, such as Mount Pinatubo in 1991, the very substantial amounts of material (e.g., sulfur dioxide reacting with water to form sulfuric acid), will reach into the stratosphere above ~15 km, where its effects are felt on a global scale and can persist for years. Although existing ontologies such as those presented here and formalized in SWEET can support smart search and integration, they are unable to explain the patterns in data or mechanisms (processes) responsible for the abundance of sulfur in volcanic eruptions. The concept of plate tectonics provides a framework for geologically associated general characteristics of volcanoes (Sinha et al., 2007), but many of the details of volcanic activity and their products are difficult to explain through this paradigm. For example, many volcanoes
Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm
179
Integration and Analysis
Data Product: plutons and gravity field
D
Gravity Gridding
Geospatial & Geotemporal: A-type
C
Figure 17. A layered bottom-up depiction of discovery, integration, and analysis (DIA) phases from data to data product. Panels show the steps required to integrate geospatial, geotemporal, and geophysical data through the use of tools and services that have been registered to known ontologies.
Igneous Area of Interest
Geochemical Magma Class
B Tool Selection
A-Type Discriminant Diagram
A Ontologically registered data and services Structure Physical Properties Location
Material Functions
Time
Geochronology Geochemistry
Measurements
Geophysics
Phenomenon
Geospatial
share common characteristics (high sulfur emissions) despite being located in different tectonic settings, and I have chosen Mount Pinatubo (convergent-margin setting in Philippines) and Nyanuragira (continental divergent setting in East Africa) to show similar SO2 loading of the atmosphere. Based on new ontologies for both volcanoes and plate tectonics (Sinha et al., 2007), one can suggest that an integrated conceptual understanding of processes associated with volcanic systems from magma generation to eruption is required to link characteristics of volcanic eruptions to plate tectonics. The ultimate objective of understanding the particular volcanoes that are capable of influencing climate change requires this integrative approach. For domain experts, investigation of geologic processes typically involves data-driven inference, specifically, the use of detailed data sets involving objects, time, and space to derive testable genetic hypotheses (“reverse” models). For volcanic rocks, reverse process models require the ability to use field, geochemical, and mineralogical data to infer magma source(s), storage
site(s), and causes for compositional diversity. This is generally done by thermodynamic calculations to determine pressure and temperature, and by matching patterns of compositional arrays to infer differentiation/assimilation processes. By a combination of logic and trial-and-error comparison, researchers eliminate as many genetic scenarios as possible. If coded for hands-off computation, this approach borders on that of artificial intelligence. In contrast, predictive models require the ability to visualize and calculate a process from its inception to completion (“forward” models). Forward modeling of magmatic processes associated with volcanism, and extensible to other geologic phenomena, is similar to workflow calculations. In the simplest approach, magmatism is a bottom-to-top phenomenon, from a mantle heat source to final pluton emplacement or eruption. However, although adding complexity to forward models is relatively straightforward, the usefulness of such complex models may suffer due to lack of general applicability in complex volcanic environments.
180
A.K. Sinha
SUMMARY AND FUTURE RESEARCH DIRECTIONS This paper presents an overview for development of foundation-level ontologies that enable semantic integration of data and tools. The recently developed prototypes for DIA and SEDRE engines readily show semantic capabilities that utilize ontologies and Web services to organize, annotate, and define data sets and tools. However, for geoscientists to advance their scientific goals, it is likely that an active presence in the emerging “Semantic Web” (Berners-Lee et al., 2001) will be required. This would enable data and applications to be automatically understood and processed without geographical or organizational boundaries (Alonso-Jiménez et al., 2006), and thus lead to efficiency through precise “information hunting,” e.g., smart search. Other advantages for the geoscience community for participating in the use of Semantic Web technologies suggested by Sinha et al. (2010) include: facilitated knowledge management (processes of capturing, extracting, processing, and storing knowledge) (Alonso et al., 2004), integration across heterogeneous domains through ontologies (Fox et al., 2008), efficient information filtering (sending selective data to right clients), formation of virtual communities (Reitsma and Albrecht, 2005), legacy capture for long-term archiving, serendipity (finding unexpected collaborators), and Web-based education (Ramamurthy, 2006). Capabilities based on semantic integration of data, services, and processes will become the new paradigm in scientific endeavors and provide a significant boost to the visibility of geoscience research and education in a competitive world (Sinha et al., 2010). ACKNOWLEDGMENTS The author acknowledges the decade-long interaction with many geoscience and computer science colleagues and extends special thanks to Kai Lin, Abdelmounaam Rezgui, Zaki Malik, Robert Raskin, Calvin Barnes, Boyan Brodaric, Peter Fox, and Deborah McGuinness for their support in developing semantic capabilities for the geoscience community. This research was supported by National Science Foundation award EAR022558. All concept maps were prepared using free software provided by the Institute for Human and Machine Cognition, http://cmap.ihmc.us/. REFERENCES CITED Aberer, K., 2003, Special issue on peer to peer data management: Special Interest Group on the Management of Data (SIGMOD) Record, v. 32, p. 69–72. Allen, J.F., 1991, Time and time again: The many ways to represent time: International Journal of Intelligent Systems, v. 6, p. 341–355, doi:10.1002/int .4550060403. Alonso, G., Casati, F., Kuno, H., and Machiraju, V., 2004, Web Services: Concepts, Architecture, and Applications: Berlin, Springer Verlag, 354 p. Alonso-Jiménez, J.A., Borrego-Díaz, J., Chávez-González, A.M., and Martín-Mateos, F.J., 2006, Foundational challenges in automated semantic Web data and ontology cleaning: IEEE Intelligent Systems, v. 21, no. 1, p. 42–52.
Barnes, C., 2006, From Object to Process Ontology: U.S. Geological Survey Scientific Investigations Report 2006-5201, p. 40–41. Berners-Lee, T., Hendler, J., and Lassila, O., 2001, The semantic web: Scientific American, v. 284, p. 34–43, doi:10.1038/scientificamerican0501-34. Blong, R.J., 1984, Volcanic Hazards: Sourcebook on the Effects of Eruptions: Orlando, Academic Press, 424 p. Boisvert, E., Johnson, B.R., Schweitzer, P.N., and Anctil, M., 2003, XML Encoding of the North American Data Model: U.S. Geological Survey Open-File Report 03-471, http://pubs.usgs.gov/of/2003/of03-471/ boisvert/index.html (accessed 4 May 2011). Cardoso, J., and Sheth, A., 2006, The semantic web and its applications, in Cardoso, J., and Sheth, A.P., eds., Semantic Web Services, Processes and Applications, Volume 3: New York, Springer Verlag, p. 3–33. Dana, P.H., 1995, Co-ordinate System Overview: http://www.colorado.edu/ geography/gcraft/notes/coordsys/coordsys.html (accessed 8 May 2011). DeLaeter, J.R., Bohlke, J.K., DeBievre, P., Hidaka, H., Peiser, H.S., Rosman, K.J.R., and Taylor, P.D., 2003, Atomic weights of the elements: Review 2000: Pure and Applied Chemistry, v. 75, p. 683–800, doi:10.1351/ pac200375060683. Eby, G.N., 1990, The A-type granitoids: A review of their occurrence and chemical characteristics and speculation on their petrogenesis: Lithos, v. 26, p. 115–134, doi:10.1016/0024-4937(90)90043-Z. Fischer, R.V., Heiken, G., and Hulen, J.B., 1997, Volcanoes: Crucibles of Change: Princeton, New Jersey, Princeton University Press, 317 p. Fleming, N., 1996, Coping with a Revolution: Will the Internet Change Learning?: Canterbury, New Zealand, Lincoln University, http:// www.vark-learn.com/documents/information_and_knowle.pdf (accessed 9 May 2011). Fox, P., Sinha, A.K., McGuinness, D., Raskin, R.G., and Rezgui, A., 2008, A Volcano Erupts—Semantic Data Registration and Integration: U.S. Geological Survey Scientific Investigations Report 2008-5172, p. 72–75. Gruber, T.R., 1993, A translation approach to portable ontologies: Knowledge Acquisition, v. 5, p. 199–220, doi:10.1006/knac.1993.1008. Halevy, A., Rajaraman, A., and Ordille, J., 2006, Data integration: The teenage years, in Dayal, U., Whang, K., Lomet, D., Alonso, G., Lohman, G., Kersten, M., Cha, S.K., and Kim, Y., eds., Proceedings of the 32nd International Conference on Very Large Data Bases (Seoul, Korea): Very Large Data Bases, VLDB Endowment: New York, Association of Computing Machinery, p. 9–16. Hapgood, M.A., 1992, Space physics coordinate transformations: A user guide: Planetary and Space Science, v. 40, p. 711–717, doi:10.1016/ 0032-0633(92)90012-D. Hobbs, J.R., and Pan, J., 2004, An ontology of time for the semantic web: ACM (Association for Computing Machinery) Transactions on Asian Language Information Processing, v. 3, p. 66–85, doi:10.1145/1017068.1017073. Lin, K., and Ludäscher, B., 2003, A system for semantic integration of geologic maps via ontologies, in Ashish, N., and Goble, C., eds., Semantic Web Technologies for Searching and Retrieving Scientific Data (SCIS), ISWC 2003 Workshop, v. 83: Aachen, Germany, Aachen University, Sun Site Central Europe. Malik, Z., Rezgui, A., and Sinha, A.K., 2007a, Ontologic Integration of Geoscience Data on the Semantic Web: U.S. Geological Survey Scientific Investigations Report 2007-5199, p. 41–43. Malik, Z., Rezgui, A., Sinha, A.K., Lin, K., and Bouguettaya, A., 2007b, DIA: A web services–based infrastructure for semantic integration in geoinformatics, in Proceedings of the IEEE (Institute of Electrical and Electronics Engineers) International Conference on Web Services (ICWS 2007): New York, IEEE, p. 1016–1023. Malik, Z., Rezgui, A., Medjahed, B., Ouzzani, M., and Sinha, A.K., 2010, Semantic integration on geosciences: International Journal of Semantic Computing, v. 4, no. 3, p. 1–30, doi:10.1142/S1793351X10001036. Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A., and Schneider, L., 2002, The Wonderweb Library of Foundational Ontologies and the DOLCE Ontology: Laboratorio di Ontologia Applicata Technical Report D 17: Padova, Italy, 37 p. Niles, I., and Pease, A., 2001, Towards a standard upper ontology, in Welty, C., and Smith, B., eds., Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001): New York, Association of Computing Machinery, p. 2–9. Noy, N.F., 2004, Semantic integration: A survey of ontology-based approaches: Special Interest Group on the Management of Data (SIGMOD) Record, v. 33, p. 65–70, doi:10.1145/1041410.1041421.
Infusing semantics into the knowledge discovery process for the new e-geoscience paradigm Obrst, L., 2003, Ontologies for semantically interoperable systems, in Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM 03): New York, Association of Computing Machinery, p. 366–369. Petford, N., Clemens, J.D., and Vigneresse, J., 1997, Application of information theory to the formation of granitic rocks, in Bochez, J., Hutton, D.H.W., and Stephens, W.E., eds., Granite: From Segregation of Melt to Emplacement Fabrics: Dordrecht, the Netherlands, Kluwer Academic Press, p. 3–10. Phytila, C., 2002, An Analysis of the SUMO and Description in Unified Modeling Language, Phytila-SUMO.htm, http://suo.ieee.org/SUO/SUMO/index .html (accessed 8 May 2011). Ramamurthy, M.K., 2006, A new generation of cyberinfrastructure and data services for earth system science education and research: Advances in Geosciences, v. 8, p. 69–78, doi:10.5194/adgeo-8-69-2006. Raskin, R.G., 2006, Development of ontologies for earth system science, in Sinha, A.K., ed., Geoinformatics: Data to Knowledge: Geological Society of America Special Paper 397, p. 195–200. Raskin, R.G., and Pan, M.J., 2005, Knowledge representation in the semantic web for earth and environmental terminology (SWEET): Computers & Geosciences, v. 31, p. 1119–1125, doi:10.1016/j.cageo.2004.12.004. Reitsma, F., and Albrecht, J., 2005, Modeling with the semantic web in the geosciences: Institute of Electrical and Electronics Engineers Intelligent Systems, v. 20, p. 86–88, doi:10.1109/MIS.2005.32. Rezgui, A., Malik, Z., and Sinha, A.K., 2007, DIA Engine: Semantic Discovery, Integration, and Analysis of Earth Science Data: U.S. Geological Survey Scientific Investigations Report 2007-5199, p. 15–18. Rezgui, A., Malik, Z., and Sinha, A.K., 2008, Semantically Enabled Registration and Integration Engines (SEDRE and DIA) for the Earth Sciences: U.S. Geological Survey Scientific Investigations Report 2008-5172, p. 47–52. Robock, A., 2000, Volcanic eruptions and climate: Review of Geophysics, v. 38, p. 191–219. Rosman, K.J.R., and Taylor, P.D.P., 1998, Isotopic compositions of the elements 1997: Pure and Applied Chemistry, v. 70, p. 217–235, doi:10.1351/ pac199870010217. Russell, C.T., 1971, Geophysical coordinate transformation: Cosmic Electrodynamics, v. 2, p. 184–196.
181
Schuurman, N., 2004, GIS: A Short Introduction: Malden, Massachusetts, Blackwell Publishing, 171 p. Semy, S., Pulvermacher, M., and Obrst, L., 2004, Towards the Use of an Upper Ontology for U.S. Government and Military Domains: An Evaluation: The MITRE Corporation (04-0603), http://handle.dtic.mil/100.2/ ADA459575 (accessed 8 May 2011). Simmhan, Y.L., Plale, B., and Gannon, D., 2005, Survey of data provenance in e-science: Special Interest Group on the Management of Data (SIGMOD) Record, v. 34, no. 3, p. 31–36, doi:10.1145/1084805.1084812. Simons, B., Boisvert, E., Brodaric, B., Cox, S., Duffy, T., Johnson, B., Laxton, J., and Richard, S., 2006, GeoSciML: Enabling the exchange of geological map data, in Australian Society of Exploration Geophysicists Extended Abstracts: Australia, Commonwealth Scientific and Industrial Research Organization, p. 1–4. Sinha, A.K., Zendel, A., Brodaric, B., Barnes, C., and Najdi, J., 2006a, Schema to ontology for igneous rocks, in Sinha, A.K., ed., Geoinformatics: Data to Knowledge: Geological Society of America Special Paper 397, p. 169– 182. Sinha, A.K., Lin, K., Raskin, R., and Barnes, C., 2006b, Cyberinfrastructure for the Geosciences-Ontology Based Discovery and Integration: U.S. Geological Survey Scientific Investigation Report 2006-5201, p. 1–2. Sinha, A.K., McGuinness, D., Fox, P., Raskin, R., Condie, K., Stern, R., Hanan, B., and Seber, D., 2007, Towards a Reference Plate Tectonics and Volcano Ontology for Semantic Scientific Data Integration: U.S. Geological Survey Scientific Investigations Report 2007-5199, p. 43–46. Sinha, A.K., Malik, Z., Rezgui, A., Zimmerman, H., Barnes, C.G., Thomas, W.A., Jackson, I., Gundersen, L.C., Heiken, G., Raskin, R., Fox, P., McGuinness, D.L., and Seber, D., 2010, Geoinformatics: Transforming data to knowledge for geosciences: GSA Today, v. 20, no. 12, p. 4–10, doi:10.1130/GSATG85A.1. Zaihrayeu, I., da Silva, P., and McGuinness, D.L., 2005, IWTrust: Improving user trust in answers from the Web, in Proceedings of 3rd International Conference on Trust Management (iTrust2005): Rocquencourt, France, Springer, p. 384–392.
MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA
The Geological Society of America Special Paper 482 2011
Global Map: International cooperation in the mapping sciences D.R. Fraser Taylor Geomatics and Cartographic Research Centre, Carleton University, Ottawa, Ontario, K1S 5B6, Canada, and Chair, International Steering Committee for Global Mapping, Geospatial Information Authority of Japan, Tsukuba, Japan
ABSTRACT This chapter discusses the origins and purpose of Global Map, the current situation of the initiative, and the challenges it faces in the future. A major societal challenge facing the world today involves finding a way to deal more effectively with growing environmental problems. Reliable geographic information at a global scale is an indispensable element in formulating policy responses to global environmental challenges. The main purpose of Global Map is to describe the status of the global environment to aid in decision-making processes. Global Map provides digital maps of the terrestrial surface of Earth at a resolution of 1 km, with consistent and comparable specifications for every country. It is produced in cooperation with the national mapping organization in each country. Global Map was initiated by the government of Japan as a contribution to the action plan of the United Nations Agenda 21 program. There are four vector and four raster layers. Version 1 of Global Map was released in June 2008 and includes coverage of Antarctica. It also includes two global maps with complete high-quality coverage, one on land cover and the other on percentage tree cover. New uses of Global Map include disaster adaptation, mitigation, and management, and educational applications. Although Global Map as a product is important, the cooperative process by which Global Map is produced is equally important. This ongoing cooperation will help to ensure the future of Global Map as it enters a new phase in its development and make a substantial contribution to capacity building in the application of geoinformation to sustainable development.
ANTECEDENTS OF GLOBAL MAP The idea of international cooperation in mapping is not a new one. The idea of an international 1:1,000,000 scale map of the world produced through a cooperative international effort was suggested at the end of the nineteenth century. At that time, the world was undergoing an unprecedented period of international cooperation in a number of fields, especially international trade and communications. Although the term was not used at that time, what is now called “globalization” was certainly a topic
of discussion (Pearson et al., 2006). Cartography at that time was primarily a tool of the nation state and, in many cases, of the colonial ambitions of these states. It was argued that this situation should change in order to respond to the international realities and possibilities of the new century. Each nation state was producing its own maps in a variety of scales and formats, and, as a result, there was no cartographic coverage at the global scale that could be used to respond to the emerging global challenges of the time. Clearly, there were problems relating “interoperability” between and among national map series.
Taylor, D.R.F., 2011, Global Map: International cooperation in the mapping sciences, in Sinha, A.K., Arctur, D., Jackson, I., and Gundersen, L., eds., Societal Challenges and Geoinformatics: Geological Society of America Special Paper 482, p. 183–191, doi:10.1130/2011.2482(14). For permission to copy, contact [email protected]. © 2011 The Geological Society of America. All rights reserved.
183
184
D.R.F. Taylor
The idea of an internationally agreed-upon map series at the 1:1,000,000 scale had been proposed as early as 1860 by the British cartographer Sir Henry James (James, 1860), but the most significant and systematic outline of this idea was made by Penck in 1891 (Penck, 1893) at the International Geographical Congress in Switzerland. Although the idea was widely accepted and supported in principle at the time, implementation was very slow, and the international cooperation required to produce it proved difficult to achieve. There were protracted and often rancorous discussions, and initial progress was very slow. It was not until 1909 that agreement was finally reached at an international conference held in London and work began to create International Map of the World in a number of countries. A major setback took place in 1913, when the United States withdrew from the project, partially because of its impatience with the slow progress but also because of the isolationist policies of the government of the time (Pearson et al., 2006). The United States decided to produce its own 1:1,000,000 scale map of Latin America, which it considered to be in its “sphere of influence.” Despite this, at a second international map of the world conference held in Paris in 1913, final agreement on the map specifications was reached among the 34 nations represented. A coordinating bureau for the initiative was established at the British Ordnance Survey. The outbreak of World War I destroyed the cooperative process that had created the map specifications, but the impetus did not die, and the Royal Geographical Society of the UK produced a series of maps at the 1:1,000,000 scale, which used a simplified version of the international map of the world specifications (Heffernan, 1996), and the Ordnance Survey produced eight sheets of India according to those same specifications between 1914 and 1918 (Heffernan, 1996). Somewhat ironically, the Royal Geographical Society’s 1:1,000,000 maps were used at the Peace Conference in Paris in 1919, and the existence and utility of these maps provided a boost to the concept after the war. Slow progress on the creation of the international map of the world continued in the 1920s and 1930s, but this was again interrupted by the outbreak of World War II in 1939. During that war, the value of a 1:1,000,000 map series was recognized by many of the participants on both sides of the conflict, and several nations, including Japan, produced their own map series based on international map of the world specifications. In 1949, the International Geographical Union established a Commission on the International Map of the World, which suggested that the responsibility for the international map of the world be given to the cartographic unit of the newly established United Nations (UN), and this took place in 1951. There was, however, considerable skepticism in professional cartographic circles over the need for an international map of the world in the postwar era, and, despite UN interest and support, relatively few new sheets were produced. The influential American cartographer Arthur Robinson went as far as to argue that the international map of the world was no more than “cartographic wallpaper” (Robinson, 1965). The project continued to limp along, and in 1989, a UNESCO (United Nations Educational Scientific and Cultural Organization) report concluded that the international map of the
world was no longer feasible, and the project came to a formal end with less than half the map sheets required ever having been produced. Analysis of the international map of the world experience suggests that there were a number of reasons why this ambitious experiment did not succeed. There were no “…clear, consistent and manageable objectives” (Pearson et al., 2006, p. 24), and those promoting it over the years failed to create and implement a “clear and consistent vision for their project” (Pearson et al., 2006, p. 24). This experience provides valuable lessons for subsequent attempts to utilize geoinformatics, especially mapping, to respond to societal challenges, which is the unifying theme of this volume. The historical experience of the failure to fully implement the International Map of the World project, despite many decades of effort, has special significance for those attempting to implement Global Map (Pearson et al., 2006), which is the focus of this chapter. THE VISION, ORIGIN, AND PURPOSE OF GLOBAL MAP There are direct and interesting parallels between the vision and plans to create Global Map and the earlier attempts to create the international map of the world (Pearson et al., 2006). Almost exactly a century after the international map of the world was proposed, the government of Japan, with support from the United States, proposed a new initiative to create a 1:1,000,000 digital map of the world to aid in environmental and sustainable development decision making. Japan had a special interest in environmental issues at the international level and saw the creation of Global Map as one specific response to the challenges posed by the United Nations Conference on Environment and Development held in Brazil in 1992. The action plan of Agenda 21, which came out of that meeting, included a specific call for the creation of global environmental data as an aid to decision making (Pearson et al., 2006). Japan had earlier created international world maps. There is clear evidence in the report “An Image Survey Watching the Earth,” produced by the Geographical Survey Institute of Japan (GSI, 1991), that the proponents of Global Map, who were based in the GSI of Japan, had carefully considered the international map of the world experience in creating their action plan to establish and create Global Map (GSI, 1991). A deliberate attempt was made to avoid the major weaknesses that underlay the lack of success of the international map of the world project, which have been outlined by a number of authors (Winchester, 1995; Rhind, 2000; Heffernan, 2002; Pearson et al., 2006). This is evident both in the substantive content of Global Map and in the ongoing processes by which it is being created. International recognition and involvement have been keys to the creation of Global Map as outlined by Maruyama (1998), Masaharu and Akiyama (2003), and Okatani et al. (2006). Continuing international endorsement and political support, especially from the United Nations, have been critical factors for success. Global Map requires the active participation of national
Global Map: International cooperation in the mapping sciences and regional mapping organizations, and great care has been taken to ensure that each member of the Global Map family can effectively make a contribution to Global Map. This “bottomup” participative process is a key element of the Global Map initiative. For many nations, this involves an ongoing capacitybuilding process in geoinformation, which again is an important element in the creation of Global Map. The First International Workshop on Global Mapping was held in Japan in 1994, which set a target date for a first version of Global Map by the year 2000. The International Steering Committee for Global Mapping (ISCGM, which consists of representatives of national mapping organizations) was established at a second workshop held in 1996. Professor John Estes of the United States was elected as the first chair of ISCGM and the director general of the Geographical Survey Institute of Japan, Kunio Nonmura, outlined the proposals for the creation of Global Map. It was to be a digital map of the world at 1 km resolution (~1:1,000,000 scale) with eight layers, four vector and four raster, and common specifications. It was to be made freely available for public use in the international arena. Global Map was formally proposed by Japan and the United States and accepted as part of the implementation plan for Agenda 21 in 1997 at the 19th Special Session of the United Nations Economic and Social Council. The first Global Map Forum was held later that year, and the specifications of Global Map were finalized after an extensive consultation process. In January 1998, ISCGM sent out a letter with endorsement from the United Nations, inviting all of the national mapping organizations of the world to participate in Global Map. It is interesting to note the differences between the launch of Global Map and that of the international map of the world project discussed earlier. A great deal of preparatory work in the international arena was done to establish the concept for Global Map. It had clear and measurable objectives and a clear purpose. It had wide international support and was endorsed as an integral part of a United Nations initiative. Agreement was reached on the major elements of the initial specifications before nations were formally invited to participate, and these specifications were created in a pragmatic manner, taking into account the capabilities and wishes of participating nations. Global Map had a well-funded and wellorganized secretariat to coordinate its activities. Although Japan and the United States played a large role in creating the vision for Global Map, from the outset, great efforts were made to ensure that it was not seen as the initiative of any one national mapping organization but was a truly international initiative coordinated by an international steering committee of national mapping organizations with a neutral chair who was not the director of any one of the these organizations. Global Map was designed to meet global needs, but the way in which it was constructed meant that each country’s national needs were also met. In many developing nations, for example, the Global Map training programs and workshops funded by the government of Japan helped to build much needed human and institutional capacity in geospatial information management and to create national data sets. This
185
support has been continuous for over two decades, and it both encourages and facilitates national participation. Global Map thus became both a national and international endeavor. Many nations lacked the initial capacity to create their own digital map coverage. This challenge was met in two ways. First, to create version 0 of Global Map, existing digital data sets such as G TOPO 30, Vector Map level 0, and One Kilometer Advanced Very High Land Radiometer data were freely provided to each participating organization, which then updated and verified that coverage according to Global Map specifications. The provision of these data sets was facilitated by the U.S. Geological Survey. Without this support, the release of version 0 in 2000, which was largely based on these three data sets, would not have been possible. The quick release of version 0 with global coverage was important to demonstrate the viability of the Global Map vision. American isolationism seriously damaged the creation of the International Map of the World, but the involvement of the United States as a global player was a great advantage for the creation of Global Map. A second factor was the capacity-building program built in as an integral part of the Global Map initiative. From the outset, nations were asked to identify the level of involvement they wished to undertake. A country choosing level A involvement agreed not only to process its own data but also to help one or more other countries to do so. Countries requiring assistance to complete their coverage would choose the level C designation, and those choosing level B would agree only to process data for their own country. Japan has been the most active of the countries, choosing the level A designation. This is largely because of the partnership established among the Geographical Survey Institute (the national mapping organization, in 2010 renamed the Geospatial Information Authority of Japan), the Japan International Cooperation Agency (JICA), and the Ministry of Lands, Infrastructure, and Transport (MLIT). JICA is one of the few aid agencies specifically giving assistance to mapping agencies in the developing world (JICA, 2003). Part of that assistance involves support for the creation of Global Map. For a number of years, MLIT has supported a Global Map workshop in Africa, and it has also helped to facilitate an annual scholarship program funded by JICA since 1994, which brings trainees from national mapping organizations to the Geographical Information Authority of Japan for extended periods to receive training in the production of Global Map. These efforts have been supplemented by the private sector, and both ESRI and Intergraph have provided grant support to aid national mapping organizations, especially in the acquisition of software. As a result of these capacity-building efforts, many of the first nations to complete their Global Map coverage were developing nations, including Kenya, Myanmar, and Mongolia, among others. A major event in the history of Global Map was the involvement of the Global Map secretariat in the World Summit for Sustainable Development held in Johannesburg in 2002 (Masaharu
186
D.R.F. Taylor
and Akiyama, 2003). The secretariat participated in all four workshops leading up to the summit and in the summit itself. Largely as a result of this input, the implementation document that came out of the summit contains paragraphs 132 and 133 as follows:
the government of Japan has indicated that support will continue for a new phase of Global Map after the release of version 1 of Global Map in June 2008. THE PRESENT SITUATION
132. Promote the development and wider use of Earth observation technologies, including satellite remote sensing, global mapping and geographic information systems, to collect quality data on environmental impacts, land use and land-use changes, including through urgent actions at all levels to: (a) Strengthen cooperation and coordination among global observing systems and research programmes for integrated global observations, taking into account the need for building capacity and sharing of data from ground-based observations, satellite remote sensing and other sources among all countries; (b) Develop information systems that make the sharing of valuable data possible, including the active exchange of Earth observation data; (c) Encourage initiatives and partnerships for global mapping. 133. Support countries, particularly developing countries, in their national efforts to: (a) Collect data that are accurate, long-term, consistent and reliable; (b) Use satellite and remote-sensing technologies for data collection and further improvement of ground-based observations; (c) Access, explore and use geographic information by utilizing the technologies of satellite remote sensing, satellite global positioning, mapping and geographic information systems. (Capitalization as in original text; United Nations, 2002, p. 64)
The explicit recognition of Global Map and the identification of ISCGM as an implementing agency were important to reaffirm United Nations support for Global Map. All of the nations present at the summit endorsed the document, and at the national level, this provides each national mapping agency with strong arguments for the allocation of national resources for the production of Global Map. It is important to note that, as outlined earlier, Global Map is part of a much wider program by the government of Japan to support initiatives leading to improvements in environment and sustainable development at the global level. In 2000, Japan committed almost one third of its large Official Development Assistance budget for this purpose (Okada, 2003), and at the Johannesburg summit, it announced a new program, the Environmental Conservation Initiative for Sustainable Development (Okada, 2003), further strengthening Japan’s international support for environmental initiatives, of which Global Map is a part. Geomatics initiatives are much more likely to be effective if they are part of a much wider commitment to a specific goal. In the case of Global Map, this is the commitment of Japan to provide long-term funding and support to policies and programs aimed at improving global environmental conditions and furthering sustainable development at the global scale. It can be argued that no geomatics initiative can be fully effective if it is an isolated program. The International Map of the World project was isolated from other international initiatives of the time, which may have contributed to its lack of success. Global Map plays a very specific role in a much wider effort. It is also an effort that has been sustained over time with adequate funding and continuing support. This support has been provided since 1992, and
Table 1 shows the progress of Global Map over time, and Figure 1 shows the coverage as of July 2010 in map form. The initial progress of Global Map in terms of actual production of data was slower than expected, but as can be seen from Table 1, there was a very rapid acceleration in 2007 and 2008. As of July 2010, 164 countries and 16 regions including Antarctica were participating in Global Map. This represents over 96% of Earth’s territorial surface. Data have been released for almost 60% of Earth’s surface by area and 52% by population, and data for many other countries are undergoing verification. Sample coverage of the eight layers of Global Map for Kenya is shown in Figure 2. Version 1 of Global Map was released on 5 June 2008. In addition to the coverage mentioned already, a Global Percentage Tree Cover Map and a Global Land Cover Map were launched (Akatsuka, 2008). The map of global percentage tree cover is shown in Figure 3. Global Map has made remarkable progress, but many challenges remain in addition to achieving complete coverage. THE CHALLENGES AND THE FUTURE Plans for Global Map Phase III were approved by ISCGM at the 15th meeting held in Tokyo in January 2008 (Secretariat of the ISCGM, 2008a). Global Mapping Forum 2008 was also held at the same time, giving a much wider audience the opportunity to discuss and demonstrate the use of Global Map. Of particular note in this respect was the opportunity to observe “Global Map School” in action. Global Map School uses online Global Map coverage of two nations to facilitate online discussions on environmental issues between schools in each of the countries. In this case, the schools involved were Keio Futsubu and Chubotu Junior High School in Japan and Princess Chulaborn’s College
Year
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Total
TABLE 1. PARTICIPATION IN GLOBAL MAP Number of countries Number of countries and regions and regions for which data have been participating released 12 0 59 0 10 Global Map version 0 using existing global data sets released 8 4 32 2 9 6 5 2 18 1 19 6 6 20 2 24 180 75
Global Map: International cooperation in the mapping sciences
Figure 1. Progress of the Global Mapping project.
Nakhon Si Thammarat of Thailand. This was the third such Global Map School session, and it illustrated the utility of Global Map for educational purposes. The educational uses of Global Map were not specifically envisioned when Global Map was first introduced, and this is an important and growing application area for Global Map. Several nations, including the United States, are considering the use of their national coverage for educational purposes, and this will increase the value of Global Map to societies worldwide. A second important new application of Global Map is in the area of disaster adaptation, mitigation, and management. Immediately after the disastrous tsunami in the Indian Ocean in 2004, the Geographical Survey Institute of Japan developed a map of the disaster area using Global Map data (Ubukawa et al., 2008). The institute developed both land-use maps and elevation maps and models from the Global Map data and released these online. Since 2004, the institute has released maps of eight major disasters, and since 2007, these have also been posted on RELIEF WEB, the Website of the United Nations Office for the Coordination of Humanitarian Affairs, the United Nations agency that coordinates information on disaster areas. The Global Map data are posted within 48 hours of the disaster. Global Map data were of particular value during the floods in Myanmar in 2008. Since Myanmar had completed its Global Map coverage, this was one of the few comprehensive base map coverages available. Although small-scale mapping has limitations in disaster situations, it can be useful in comparative and contextual terms, as was the case in Myanmar.
Figure 2. The eight data layers of Global Map for Kenya.
187
188
D.R.F. Taylor
Figure 3. Global Map of percentage tree cover.
Global Map has also been used at the global scale to model projected change in maximum daily rainfall and regions at risk from sea-level change (www.iscgm.org). When Global Map was first conceived, the concept of national spatial data infrastructures was in its infancy. A national spatial data infrastructure can be defined as “the technology, policies, standards, human resources, and related activities necessary to acquire, process, distribute, use, maintain, and preserve spatial data” (United States Office of Management and Budget, 2002, p. 2). This concept has grown in importance in recent years, and a new Global Spatial Data Infrastructure Association has emerged. Many nations are creating national spatial data infrastructures. Although not specifically conceived as such, Global Map is an operational global spatial data infrastructure (Taylor, 2005). In addition, many nations, such as Kenya (Mweru, 2005) and Brazil, have used their Global Map coverage as the framework to create their national spatial data infrastructure by adding additional data layers and creating larger-scale coverages. Three hundred and forty-six participants from 24 countries attended Global Map Forum 2008 and unanimously adopted the Global Map Tokyo Declaration, which reads: Recalling that during the Earth Summit in 1992, the nations of the world addressed global environmental problems and adopted Agenda 21, that ISCGM was established to promote the development of Global Map, and that in 2002 global environment was further discussed in Johannesburg at WSSD, where the goal of the development of Global Map was further supported, [we] express our gratitude to 179 countries and regions of the world that have participated in the project for
their efforts to bring Global Map to this stage. At the same time, we work to further expand the use of Global Map. We also call on all those countries not yet committed to Global Map to join and work to ensure coverage of the terrestrial surface of the Earth. We further recognize that global environmental problems such as climate change, deforestation and desertification have become serious problems for humanity, that issues on climate change will be a major topic at the G8 Hokkaido Toyako Summit in July, which brings together world leaders. All people, including those dedicated to mapping the Earth should make a contribution to solving these problems. The users and producers of Global Map call for the strengthening and coordination required to make Global Map, which has been developed with common specifications and internationally agreed standards, and which accurately describes the status and impact of human activities. Global Map gives a common understanding to people who live on the Earth. Global Map should be more usable and easily available to assist in decision making to help solve the common environmental problems facing humanity. Of particular importance in this respect is capacity building activities for and with developing nations. (Secretariat of the ISCGM, 2008a, p. 5)
The declaration gave a new sense of vision for Global Map. In the declaration, mention is made of the G8 Summit in Japan, at which Global Map was prominently featured. Global Map was also featured at the meetings of the ministers of the environment of the G8 nations and at the Fourth Tokyo International Conference on African Development (TICAD IV), which was held in Yokahama in late May 2008. TICAD has been held every 5 yr since 1993, and it is an international conference on Africa jointly organized by the government of Japan, the World Bank, and The United Nations focusing on development problems in Africa. Fifty-one
Global Map: International cooperation in the mapping sciences African heads of state participated in the conference in 2008, as did over 3000 participants. A major product of the meeting was the Yokohama Action Plan, which outlines concrete support actions. Under the section dealing with Environmental/Climate Change Issues, Global Map is specifically mentioned: “2. Adaptation. Promote technical assistance such as establishing and updating the Global Map data for the entire Africa, describing the status of its environment in five years” (Nakamura, 2008, p. 7). The international recognition of the importance of Global Map is important for the future of the initiative. Such recognition makes the work of ISCGM more visible at both national and international levels, and the legitimacy of Global Map is enhanced. A specific goal for the creation and use of Global Map in Africa has been established. Although the use of Global Map is increasing and numerous examples of such use are given on the ISCGM Website, it is clear that additional efforts are required to further inform the world of the existence of Global Map and to promote and increase its use. This was a key message in the discussions and resolutions of the 15th meeting of ISCGM in June 2008 (Secretariat of the ISCGM, 2008b). In resolutions dealing with this topic, the following resolution appears: “ISCGM anticipates that the above-mentioned outreach activities will lead to a broad and effective use of Global Map for research and policy formulation for environmental protection, mitigation of natural disasters and the achievement of sustainable development and for education and other purposes” (Secretariat of the ISCGM, 2008b, p. 2). Increasing the effective use of Global Map is, perhaps, the major challenge facing Global Map as it enters the third phase of its existence. Experience has shown that regardless of the perceived value of an initiative such as Global Map from the perspective of those producing and supporting it, its value to society will depend upon its use, and encouraging such use must be an integral part of the planning process for all endeavors of this type. Producing Global Map is a necessary, but not sufficient, step. Global Map must be easy to use and readily accessible to users at low or no cost. This is a central principle of Global Map, but there are a number of barriers to be overcome if that goal is to be achieved. Some of these are technical, but others are administrative and political. On the technical side, Global Map has recently updated its specifications. The original specifications created by ISCGM are over a decade old, and although they were the best available at the time, much has changed over the years (Taylor, 2008; Nagayama, 2008). Global Map reviewed its technical specifications, and a workshop to discuss and approve the new specifications was held in September 2009. ISCGM is a Class A member of ISO (International Standards Organization), and its new specifications meet the latest ISO TC 211standards (Secretariat of the ISCGM, October 2009). The major changes include a change to the official format of vector data from Vector Product Format to Geography Markup Language (GML) 3.2.1 (ISO 91136) and the adoption of a metadata profile based on ISO 19115. The tile size is determined by each nation. Increased efforts are being made to
189
make technical access to data easier for users, and discussions are under way to create a Global Map Data portal. In 2008, the average number of downloads of Global Map data on a monthly basis increased over 100% between January and December. There is an obvious linkage between Global Map and national spatial data infrastructures, as mentioned earlier. Scale is also an issue because for many small states, especially small island states, the 1:1,000,000 scale is too large to be of real utility, and Global Map has decided to accept data at smaller scales, such as 1:250,000, to help address this problem. It is also important that Global Map data be regularly updated. This is the responsibility of each participating nation, and a 5 yr update cycle is anticipated. The adoption of the new standards and specifications poses updating challenges. Existing coverage must be updated to reflect these changes. All new Global Map coverage will be in the new format. To date, only one nation, Bulgaria, has updated their maps using the new specifications. A technical challenge facing Global Map is to create seamless regional coverage. The existing tile structure is based on a nation-by-nation approach. Euroglobal Map has resolved this problem, and similar efforts are under way to create a seamless Global Map for Asia, Latin America (Barriga, 2005), and North America. These efforts are not being led by ISCGM but by a variety of other organizations building on and expanding Global Map coverage. These include the Pan American Institute of Geography and History and the Permanent Committee for Geographic Information Processing for Asia and the Pacific. A seamless regional approach also brings with it the political issues of disputed border regions, but in contentious issues of this type, Global Map follows the United Nations’ practice. Because existing Global Map coverage is created by individual nations or regions, each participant uses their own definition of its borders, which the existing tile structure allows them to do. This ability to define their own borders has been a factor in the acceptance of Global Map specifications, and the retention of the tile structure in the new specifications reflects the need to respond to the interests of ISCGM members. The utility of Global Map in technical terms will be greatly enhanced by ensuring interoperability with other geospatial data sets, and in creating the new specifications for Global Map, special attention has been given to this issue. Despite the considerable efforts of the Open Geospatial Consortium (OGC) and others, many technical problems of interoperability between geospatial data sets, such as semantic interoperability, still remain. The new Global Map standards and specifications are fully consistent with ISO TC 211 and utilize both Open Geospatial Consortium Web Mapping Services specifications and GML. The difficult and complex administrative and political issues surrounding access to geospatial data in general were outlined in a Committee on Data for Science and Technology (CODATA) White Paper on this topic released in 2008 (CODATA, 2008). This paper was produced to help implement the data-sharing principles of GEOSS, the Global Earth Observation System of Systems, in which ISCGM is playing an active role. GEOSS
190
D.R.F. Taylor
established a data-sharing task force in 2004, and this group produced the GEOSS Data Sharing Principles action plan for consideration at the GEOSS ministerial summit in November 2010. This major international initiative is described in another chapter of this volume, but the issues discussed apply to any initiative to use geoinformation for societal benefit, including all of those described in this book. If geospatial data are not easily accessible, then no initiative designed to meet societal needs will succeed in obtaining its objective. The acceptance of principles on data access is clearly much easier than the implementation of these principles, as is clearly indicated by the Global Map experience. In signing the agreement to participate in Global Map, each participant agrees to make their data available at no, or low, cost, although a distinction is made between data for public or commercial use. In the latter case, nations are free to follow their own business models relating to charging for data. Despite this formal agreement, problems have arisen. For example, excellent Global Map data have been available for Europe in the form of Euroglobal Map for some time, but the business model of Eurogeographics has made the implementation of the free or low-cost access to data principle of Global Map very difficult to achieve. Eurogeographics as an organization did not exist when many individual European nations agreed to participate in Global Map. Discussions on this matter have continued over the years, and a partial solution was reached in late 2008, when 11 of the member nations of Eurogeographics agreed in principle to allow free access to their Euroglobal Map data. This will require new licensing agreements, which are currently under discussion. In late 2009, the business model of the Ordnance Survey, a key member of Eurogeographics, underwent substantial revision to make more map data more readily available at minimum cost. This will affect the business model of Eurogeographics, and constructive discussions continue between ISCGM and Eurogeographics to further resolve the data-access issues. Financial return is also a major factor in the creation and release of data from Russia. Russia is a participant in Global Map but argues that it requires substantial payment for its data before it is prepared to provide them to ISCGM. There have been political and organizational changes in Russia since its initial agreement to participate in Global Map, and, again, discussions are under way between ISCGM and the new organization responsible for national mapping in Russia. In times of organizational change, resolution of these issues is a difficult and complex process. Security issues are slowing down the release of data for some members such as Israel, although it is interesting to note that the Global Map data for Palestine (West Bank and Gaza Strip) have already been released. Initially, security concerns also affected the release of Global Map data for both India and Pakistan, despite the small scale of the data involved. These were satisfactorily resolved and helped to create the more general availability of geospatial data, especially in India, where discussions took place at the cabinet level, resulting in a much improved situation. Despite these difficulties, access and availability of Global Map data are rapidly increasing, as indicated by the growing
number of downloads from the Global Map Website described earlier. Global Map also has a very close working relationship with the United Nations Second Administrative Level Boundaries (SALB) project. For example, SALB boundary codes are now an integral part of the new Global Map specifications. Global Map was a model for the OneGeology project described elsewhere in this volume, and the structure and organization of OneGeology reflect much of the Global Map experience. The Global Land Cover Network of Food and Agriculture Organization (FAO) is also a Global Map partner organization and the Global Land Cover Map uses the FAO land-cover categories. CONCLUSION The empirical evidence supports the conclusion that Global Map has been remarkably successful, although many challenges remain. The technical challenges will be much easier to overcome than the administrative and political ones. The new specifications developed in late 2009 are a positive response to the technical challenges. Global Map is working closely with GEOSS to help to address the nine societal benefit areas. GEOSS is discussed in more detail elsewhere in this volume. The major administrative challenges involve the more active participation of key partner nations, especially Russia. Global Map depends on voluntary cooperation, and the approach used to encourage such participation includes incentives, capacity building, and technical support working through the national mapping organizations. Every effort is made to facilitate national needs and priorities through participation in this international project. Many factors have contributed to the ongoing success of Global Map, which is an excellent example of the utility of geospatial information to society, but one stands out above all of the others— the cooperative process by which Global Map is being created. Many nations and international organizations have the technical ability to produce global data sets, and there are numerous excellent examples of this. What sets Global Map apart is the role of the national and regional organizations participating in Global Map in creating and/or checking and verifying the digital coverage required. The Global Map coverage for countries such as Kenya, for example, is a Kenyan creation of which the nation can be justifiably proud (Mweru, 2005). Kenya is making use of Global Map coverage to meet a wide variety of societal needs. In creating Global Map, national and regional needs are being met, and, at the same time, both human and social capital is being built in the application of geospatial technologies to the development of those societies. In developing nations in particular, it is important for indigenous scientists to be involved in the application of geospatial technologies (Taylor, 2004). To help with environmental and sustainable development decision making, Global Map as a product is making an important contribution to society, but it is perhaps the process by which it is produced and the capacity building in the creation and use of geospatial data that will make the greatest contribution to sustainable development.
Global Map: International cooperation in the mapping sciences REFERENCES CITED Akatsuka, F., 2008, Release of land cover and percentage tree cover data of the Global Map, version 1 (global version): Global Mapping Newsletter, v. 51, p. 3. Barriga, R., 2005, America’s global map: Presentation to the 12th ISCGM (International Steering Committee for Global Mapping): Tsukuba, Japan, ISGM, 15 p. Committee on Data for Science and Technology (CODATA), 2008, White Paper on GEOSS Data Sharing Principles: Paris, CODATA, p. 93, http:// www.earthobservations.org/documents/dsp/Draft%20White%20 Paper%20for%20GEOSS%20Data%20Sharing%20Policies_27Sept08 .pdf (accessed January 2009). Geographical Survey Institute of Japan, 1991, An Image Survey Watching the Earth: Tokyo, Printing Bureau, Ministry of Japan, 263 p. (in Japanese). Heffernan, M., 1996, Geography, cartography and military intelligence: Transactions of the Institute of British Geographers, v. 21, p. 504–533, doi:10.2307/622594. Heffernan, M., 2002, The politics of the map in the early twentieth century: Cartography and Geographic Information Science, v. 29, p. 207–226, doi:10.1559/152304002782008512. James, H., 1860, Description of the projection used in the topographical section of the War Office for maps embracing large portions of the Earth’s surface: Journal of the Royal Geographical Society, v. 30, p. 106–111, doi: 10.2307/1798292. Japan International Cooperation Agency, 2003, Contributions to National Mapping Progress: Tokyo, JICA Social Development Study Department. Maruyama, H., 1998, History of activities for international agreement on the development of the Global Map: Bulletin of the Geographical Survey Institute of Japan, v. 44, p. 63–90. Masaharu, M., and Akiyama, M., 2003, Publicity activities of Global Mapping at Johannesburg summit and outcomes of the summit: Bulletin of the Geographical Survey Institute of Japan, v. 49, p. 56–69. Mweru, K., 2005, Kenya’s experience with Global Map: Presentation to the 12th ISCGM (International Steering Committee for Global Mapping): Tsukuba, Japan, ISGM Secretariat, 12 p. Nagayama, T., 2008, Global Map Standards and Specifications: Presentation to the International Standards Organization Standards in Action Workshop: Tsukuba, Japan, International Steering Committee for Global Mapping Secretariat, http://www.isotc211.org/WorkshopTsukuba/Workshop -Tsukuba.htm (accessed January 2009). Nakamura, T., 2008, TICADIV and Global Map: Global Mapping Newsletter, v. 50, p. 7. Okada, S., 2003, Towards a green future: Asian Pacific Perspectives: Japan+1 1, p. 26–29. Okatani, T., Maruyama, H., Sasaki, M., Yaguguchi, T., Magayama, S., Kayaba, M., Abe, M., and Kishimoto, N., 2006, Progress of Global Mapping Proj-
191
ect: The Johannesburg summit in 2002: Bulletin of the Geographical Survey Institute of Japan, v. 53, p. 7–16. Pearson, A., Taylor, D.R.F., Kline, K.D., and Heffernan, M., 2006, Cartographic ideals and geopolitical realities: International Map of the World from the 1890s to the present: The Canadian Geographer, v. 50, no. 2, p. 149–176, doi:10.1111/j.0008-3658.2006.00133.x. Penck, A., 1893, Construction of a map of the world on a scale of 1:1,000,000: Geographical Journal, v. 1, p. 253–261. Rhind, D.W., 2000, Current shortcomings of global mapping and the creation of a new geographical framework for the world: The Geographical Journal, v. 166, p. 295–305, doi:10.1111/j.1475-4959.2000.tb00031.x. Robinson, A.H., 1965, The future of the international map: The Cartographic Journal, v. 1, p. 1–4. Secretariat of the International Steering Committee for Global Mapping (ISCGM), 2008a, Global Mapping Newsletter, v. 50, June, 6 p. Secretariat of the International Steering Committee for Global Mapping (ISCGM), 2008b, Global Mapping Newsletter, v. 51, September, 4 p. Secretariat of the International Steering Committee for Global Mapping (ISCGM), 2009, Global map specifications version 2, in Report of the Sixteenth Meeting of the International Steering Committee for Global Mapping: Bangkok, Thailand, International Steering Committee for Global Mapping, October, 75 p. Taylor, D.R.F., 2004, Capacity building and geographic information technologies in African development, in Brunn, S.D., Cutter, S.L., and Harrington, J.W., eds., Geography and Technology: Dordrecht, the Netherlands, Kluwer, p. 521–546. Taylor, D.R.F., 2005, The history and development of Global Map as a global spatial data infrastructure, in Proceedings of the International Federation of Surveyors Working Week and the 8th International Conference of the Global Spatial Data Infrastructure (GSDI) Association: Cairo, GSDI (CDROM). Taylor, D.R.F., 2008, Global Map standards and specifications, in Proceedings of GSDI-10 Conference, Standards Workshop: Trinidad (CD-ROM). Ubukawa, T., Kisanuki, J., and Akatsuka, F., 2008, Global Map—An international project: Geographische Rundschau: International Edition, v. 4, p. 62–65. United Nations, 2002, Report of the World Summit on Sustainable Development: http://www.Johannesburg summit.org/html/documents/su;mmit-docs.html (accessed December 2008). United States Office of Management and Budget, 2002, National Spatial Data Infrastructure of the United States Circular A-16: http://whitehouse.gov/ OMB/circulates/a016_REV.htm/#background (accessed December 2009). Winchester, S., 1995, Taking the world’s measure: Cartography’s greatest undertaking survived wars and bureaucratic snarls only to die when it was nearly done: Civilization, v. 2, p. 56–59. MANUSCRIPT ACCEPTED BY THE SOCIETY 17 FEBRUARY 2011
Printed in the USA